基于支持向量机的网页正文内容提取方法

英文篇名：Information Extraction of Web Pages Based on Support Vector Machine
作者：梁东 ; 杨永全 ; 魏志强
英文作者：LIANG Dong;YANG Yong-quan;WEI Zhi-qiang;School of Information Science and Engineering,Ocean University of China;
关键词：支持向量机 ; 正文抽取 ; HTML标签 ; 降噪 ; 机器学习
英文关键词：support vector machine;;information extraction;;HTML label;;noise reduction;;machine learning
中文刊名：JYXH
英文刊名：Computer and Modernization
机构：中国海洋大学信息科学与工程学院;
出版日期：2018-09-15
出版单位：计算机与现代化
年：2018
期：No.277
基金：海洋科学与技术国家实验室鳌山科技创新计划项目(2016ASKJ07,2016ASKJ07-08)
语种：中文;
页：JYXH201809007
页数：7
CN：09
ISSN：36-1137/TP
分类号：25-30+35

摘要

针对网页的正文信息抽取,提出一种基于支持向量机(SVM)的正文信息抽取方法。该方法采取宽进严出的策略。第1步根据网页结构的规律遍历网页DOM树,定位到一个同时包含正文和噪音信息的HTML标签。第2步选择含噪音信息的HTML标签的5个重要特征,并采用SVM训练样本数据。SVM训练得出的数据模型可以有效去除导航、推广、版权等噪音信息,成功保留正文。将该方法应用于几大常用的网站,实验结果表明该方法具有较好的正文抽取效果和降噪效果,对于传统方法中经常误删的短文本、与正文相关的超链接等信息能够准确保留。
Aiming at the text information extraction of Web pages,this paper presents a method of extracting text information based on support vector machines. This method adopts"come in easily,out strictly"policy. The first step is to traverse the Web DOM tree according to the rules of the Web page structure,and locate an HTML tag that contains both useful and noise information. The second step is to select five important features of the HTML tag with noise information and use SVM to train the sample data. The model can effectively remove the navigation,promotion,copyright and other noise information,and preserve the useful information of Web pages. The method is applied to several commonly used websites. The experimental results show that this method has good effect of extracting texts and noise reduction,and can preserve short texts,such as hyperlinks related to texts that often mistakenly deleted by traditional methods.

引文

[1]胡露露,刘小勤,孙凯.基于正文特征和网页结构的网页正文抽取方法[J].大气与环境光学学报,2017,12(3):230-235.
    [2]潘心宇,陈长福,刘蓉,等.基于网页DOM树节点路径相似度的正文抽取[J].微型机与应用,2016,35(19):74-77.
    [3]宋明秋,张瑞雪,吴新涛,等.网页正文信息抽取新方法[J].大连理工大学学报,2009,49(4):594-597.
    [4]Yang Xiudan,Zhu Yuanyuan.Ontology-based information extraction system in e-commerce websites[C]//Proceedings of the 2011 International Conference on Control,Automation and Systems Engineering.2011,doi:10.1109/ICCASE.2011.5997640.
    [5]刘鹏程,胡骏,吴共庆.基于文本块密度和标签路径覆盖率的网页正文抽取[J/OL].http://www.arocmag.com/article/02-2018-06-004.html,2017-06-14.
    [6]罗永莲,赵昌垣,贾玉芳,等.基于朴素贝叶斯Web新闻内容的抽取方法[J].计算机与现代化,2016(1):59-63.
    [7]赵欣欣,索红光,刘玉树.基于标记窗的网页正文信息提取方法[J].计算机应用研究,2007,24(3):144-145.
    [8]Zhu Ningbo,Zheng Bijuan,Zhang Chunfeng.An edge and filter based morphological text extracting method[C]//Proceedings of the 2010 4th International Conference on Intelligent Information Technology Application.2010.
    [9]李蕾,王劲林,白鹤,等.基于FFT的网页正文提取算法研究与实现[J].计算机工程与应用,2007,43(30):148-151.
    [10]蒋亚平,梅骁.基于支持向量机与人工免疫系统的垃圾邮件过滤模型[J].现代计算机,2016(11):55-57.
    [11]王祖辉,姜维.基于支持向量机的垃圾邮件过滤方法[J].计算机工程,2009,35(13):188-189.
    [12]张洁.改进支持向量机的电子邮件分类[J].现代电子技术,2017,40(1):77-79.
    [13]Bao Jianmin,Pan Lin,Xie Yuanfa.A new BDI forecasting model based on support vector machine[C]//Proceedings of the 2016 IEEE Information Technology,Networking,Electronic and Automation Control Conference.2016:65-69.
    [14]姚潇,余乐安.模糊近似支持向量机模型及其在信用风险评估中的应用[J].系统工程理论与实践,2012,32(3):549-554.
    [15]郭晓云.ICTCLAS中文词法分析的Delphi调用研究[J].电脑编程技巧与维护,2011(24):10-11.
    [16]刘克强.2009共享版ICTCLAS的分析与使用[J].科教文汇(上旬刊),2009(8):271.
    [17]罗燕,赵书良,李晓超,等.基于词频统计的文本关键词提取方法[J].计算机应用,2016,36(3):718-725.
    [18]赵胜辉,李吉月,徐碧2),等.基于TFIDF的社区问答系统问句相似度改进算法[J].北京理工大学学报,2017,37(9):982-985.
    [19]Jiang Hao,Li Wen Qiang.Improved algorithm based on TFIDF in text classification[J].Advanced Materials Research,2012,403-408:1791-1794.
    [20]Dro6)zd6)z M,Kryjak T.FPGA implementation of multi-scale face detection using HOG features and SVM classifier[J].Image Processing and Communications,2016,21(3):27-44.
    [21]Sharma A,Dey S.A boosted SVM based ensemble classifier for sentiment analysis of online reviews[J].ACM SIGAPP Applied Computing Review,2013,13(4):43-52.
    [22]Cervantes J,García Lamont F,López-Chau A,et al.Data selection based on decision tree for SVM classification on large data sets[J].Applied Soft Computing,2015,37:787-798.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700