摘要
为了从网页中精确地提取正文内容,提出一种基于支持向量机(SVM)与DOM重心半径模型的算法。通过SVM对网页DOM节点集进行提取,得出文本块节点。根据网页链接信息和初次提取的文本块节点计算重心半径,利用重心半径模型进行二次精确提取,并给出相应的公式推导和超参数选取过程。实验结果表明,与统计抽取、FFT抽取等算法相比,该算法的准确率和提取效率较高,泛化能力较好。
To extract the content from a Web page accurately,an algorithm based on Support Vector Machine(SVM) and gravity radius model of DOM is proposed.Extract the node of text block from Web pages by means of SVM.Use the links information from its page and the node above to calculate the gravity radius,and utilize gravity radius model of DOM to accurately extract content again.The process of corresponding formula derivation and hyper parameters selection are presented in this paper.Experimental results show that compared with statistical extraction,FFT extraction and other algorithm,the proposed algorithm has higher accuracy and efficiency as well as better generalization ability.
引文
[1] IKVIK L.Information extraction from World Wide Web:a survey[M].Oslo,Norway:Norweigan Computing Center,1999:8-9.
[2] VAPNIK V N.The nature of statistical learning theory[M].Berlin,Germany:Springer,1995.
[3] HAMMER J,MCHUGH J,GARCIA-MOLIN H.Semistructured data:the TSIMMIS experience[C]//Proceedings of East-European Conference on Advances in Databases and Information Systems.Swindon,UK:British Computer Society,1997:1-8.
[4] LIU Ling,PU Caltm,HAN Wei.XWRAP:an XML-enabled wrapper construction system for Web information sources[C]//Proceedings of International Conference on Data Engineering.Washington D.C.,USA:IEEE Press,2000:611-621.
[5] CRESCENZI V,MECCA G,MERIALDO P.RoadRunner:automatic data extraction from data-intensive web sites[C]//Proceedings of ACM SIGMOD International Conference on Management of Data.New York,USA:ACM Press,2002:624-624.
[6] FINN A,KUSHMERICK N,SMYTH B.Fact or fiction:content classification for digital libraries[EB/OL].[2018-03-01].https://www.ercim.eu/publication/ws-proceedings/DelNoe02/AidanFinn.pdf.
[7] MANTRATZIS C,ORGUN M,CASSIDY S.Separating XHTML content from navigation clutter using DOM-structure block analysis[C]//Proceedings of ACM Conference on Hypertext and Hypermedia.New York,USA:ACM Press,2005:145-147.
[8] 孙承杰,关毅.基于统计的网页正文信息抽取方法的研究[J].中文信息学报,2004,18(5):18-23.
[9] SONG Ruihua,LIU Haifeng,WEN Jirong,et al.Learning important models for Web page blocks based on layout and content analysis[J].ACM SIGKDD Explorations Newsletter,2004,6(2):14-23.
[10] 胡国平,张巍,王仁华.基于双层决策的新闻网页正文精确抽取[J].中文信息学报,2006,20(6):1-9.
[11] GIBSON J,WELLNER B,LUBAR S.Adaptive Web-page content identification[C]//Proceedings of ACM International Workshop on Web Information and Data Management.New York,USA:ACM Press,2007:105-112.
[12] CAI Deng,YU Shipeng,WHEN Jirong,et al.VIPS:a vision based page segmentation algorithm[EB/OL].[2018-03-01].https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2003-79.pdf.
[13] 李蕾,王劲林,白鹤,等.基于FFT的网页正文提取算法研究与实现[J].计算机工程与应用,2007,43(30):148-151.
[14] 朱泽德,李淼,张健,等.基于文本密度模型的Web正文抽取[J].模式识别与人工智能,2013,26(7):667-672.
[15] 王辉,郁波,洪宇,等.基于知识图谱的Web信息抽取系统[J].计算机工程,2017,43(6):118- 124.