基于视觉特征的网页信息抽取方法研究

英文篇名：Research on Web Page Information Extraction Based on Visual Features
作者：王宪发 ; 郭岩 ; 刘悦 ; 俞晓明 ; 程学旗
英文作者：WANG Xianfa;GUO Yan;LIU Yue;YU Xiaoming;CHENG Xueqi;School of Computer Science and Technology,University of Chinese Academy of Sciences;CAS Key Laboratory of Newtwork Data Science and Technology,Institute of Computing Technology,Chinese Academy of Sciences;
关键词：视觉特征 ; 网络信息抽取 ; 自动生成模板
英文关键词：visual features;;web extraction;;automatic template generation
中文刊名：MESS
英文刊名：Journal of Chinese Information Processing
机构：中国科学院大学计算机与控制学院;中国科学院计算技术研究所中国科学院网络数据科学与技术重点实验室;
出版日期：2019-05-15
出版单位：中文信息学报
年：2019
期：v.33
基金：国家重点研发计划(2017YFB0803302,2016YFB1000902);; 国家重点基础研究发展计划(973)(2014CB340405);国家重点基础研究发展计划(973)(2014CB340401);; 国家自然科学基金(61433014)
语种：中文;
页：MESS201905012
页数：10
CN：05
ISSN：11-2325/N
分类号：108-117

摘要

面对大规模异构网页,基于视觉特征的网页信息抽取方法普遍存在通用性较差、抽取效率较低的问题。针对通用性较差的问题,该文提出了基于视觉特征的使用有监督机器学习的网页信息抽取框架WEMLVF。该框架具有良好的通用性,通过对论坛网站和新闻评论网站的信息抽取实验,验证了该框架的有效性。然后,针对视觉特征提取时间代价过高导致信息抽取效率较低的问题,该文使用WEMLVF,分别提出基于XPath和基于经典包装器归纳算法SoftMealy的自动生成信息抽取模板的方法。这两种方法使用视觉特征自动生成信息抽取模板,但模板的表达并不包含视觉特征,使得在使用模板进行信息抽取的过程中无需提取网页的视觉特征,从而既充分利用了视觉特征在信息抽取中的作用,又显著提升了信息抽取的效率,实验结果验证了这一结论。
Facing with the large-scale heterogeneous web pages,web extraction methods based on visual features tend to have poor generality and low extraction efficiency.To deal with the issue of poor generality,this paper proposes WEMLVF,a Web page information extraction framework based on visual features using supervised machine learning.This framework has good versatility.The effectiveness of the framework is validated through experiments on forum sites and news review sites.Then,to deal with the issue of low efficiency,the framework WEMLVF is utilized and method is proposed for automatically generating information extraction templates based on XPath and SoftMealy(a wrapper induction algorithm).These two methods use visual features to automatically generate information extraction templates without visual features.It makes full use of visual features information extraction and significantly improve the efficiency of information extraction,which is empirically verified.

引文

[1]Deng Cai,Shipeng Yu,Ji-Rong Wen,et al.VIPS:a vision-based page segmentation algorithm[R].USA:Microsoft Technical Report,2003.
    [2]Wei Liu,Xiaofeng Meng,Weiyi Meng.ViDE:a visionbased approach for deep web data extraction[J].IEEETrans.Knowl.Data Eng.,2009,22(3):447-460.
    [3]Alberto H F Laender,Berthier A Ribeiro Neto,Altigran S da Silva,et al.A brief survey of web data extraction tools[J].ACM Sigmod Record,2002,31(2):84-93.
    [4]Chai-Hui Chang,Mohammed Kayed,Moheb Ramzy.ASurvey of Web Information Extraction Systems[J].IEEE Transactions on Knowledge&Data Engineering,2006,18(10):1411-1428.
    [5]Emilio Ferrara,Pasquale De Meo,Giacomo Fiumara,et al.Web data extraction,applications and techniques[J].Knowledge-Based Systems,2014,70(C):301-323.
    [6]Chun-Nan Hsu,Ming-Tzung Dung.Generating finitestate transducers for semi-structured data extraction from the web[J].Journal of Information Systems,1998,23(8):521-538.
    [7]Ion Muslea,Steve Minton,Craig Knoblock.A hierarchical approach to wrapper induction[C]//Proceedings of AGENTS99,New York,NY,USA:ACM,1999:190-197.
    [8]Crescenzi V,Mecca G,Merialdo P.RoadRunner:towards-automatic data extraction from large Web sites[C]//Proceedings of the 27th International Conference on Very Large Data Bases,Roma,Italy,Sep 11-14,2001.San Francisco,USA:Morgan Kaufmann Publishers Inc,2001:109-118.
    [9]Liu,B.,Grossman,R.and Zhai,Y.,Mining data records in Web pages[C]//Proceedings of the 9th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining.Washington,USA:ACM,2003:601-606.
    [10]程学旗,郗家贞,郭岩,等.一种基于时间串的论坛页面信息自动抽取方法及系统[P].中国:CN201410429698.9.2015-01-07.
    [11]Jun Zhu,Zaiqing Nie,Ji-Rong Wen,et al.2DConditional Random Fields for Web Information Extraction[C]//Proceedings of the 22nd International Conference on Machine Learning(ICML),Bonn,Germany:ACM,2005:1044-1051.
    [12]安增文,王超,徐杰锋.基于机器学习的网页正文提取方法[J].微型机与应用,2010,29(12):4-6.
    [13]Wang Y,Hu J.A machine learning based approach for table detection on the web[C]//Proceedings of International Conference on World Wide Web,Hawaii,USA:ACM,2002:242-250.
    [14]Zehuan Cai,Jin Liu,Lamei Xu,Chunyong Yin,Jin Wang.A Vision Recognition Based Method for Web Data Extraction[J].Advanced Science and Technology Letters,2017,143(40):193-198.
    [15]Gogar T.,Hubacek O.,Sedivy J.Deep Neural Networks for Web Page Information Extraction[C]//Proceedings of Artificial Intelligence Applications and Innovations,AIAI 2016,Thessaloniki,Greece:IFIP,2016:154-163.
    [16]Liu J,Lin L,Cai Z et al.Deep web data extraction based on visual information processing[J].Journal of Ambient Intelligence and Humanized Computing,2017,2017(1):1-11.
    [17]Stuart J.Russell,Peter Norvig.Artificial Intelligence:A Modern Approach,Third Edition[M].New Jersey,USA:Pearson Education,Inc,2009:695-696.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700