互联网主题信息定向采集研究

英文题名：Research on Directed Subject Acquisition of Internet Information
作者：黄仲清
论文级别：硕士
学科专业名称：情报学
中文关键词：主题信息 ; 互联网信息 ; 定向采集 ; 正文抽取 ; 舆情信息
英文关键词：Subject Information ; Internet Information ; Directed Acquisition ; Text Extraction ; Public Opinion Information
学位年度：2010
导师：袁毅 ; 许鑫
学科代码：120502
学位授予单位：华东师范大学
论文提交日期：2010-05-01
答辩委员会主席：李国秋

摘要

在信息时代的大背景下,互联网信息以超乎想象的速度迅猛增加,信息爆炸、信息过载使人们陷入了信息时代的新困境。如何在海量的互联网信息资源中快速有效地获取所需信息成为亟待解决的一个重要问题。与此同时,信息用户对信息的需求正逐渐呈现出主题化、领域化、专业化和个性化的趋势。如何满足这些主题化的信息需求也是当前的重要课题。
     正是基于这样的背景,本文首先研究和比较了当前适用于互联网主题信息定向采集的各类理论、技术和信息采集方案,包括通用引擎和垂直引擎策略、主题信息采集技术、中文自动切分词技术、大规模文本计算技术等支撑性技术；在此基础上提出了通用搜素引擎与垂直搜索引擎相结合的互联网主题定向采集策略,利用基于领域的主题词表生成和优化方法确定主题范围,采用文本相似度计算算法进行系统的文本处理。在确定了采集策略和底层技术后,即对互联网主题定向采集系统进行了框架设计。
     本文分析和改进了采集系统中的三个关键技术,包括提出了多种防屏蔽技术相结合的网络采集防屏蔽解决方案；改进了一种基于文本密度的网页正文抽取方法；采用了基于分词的向量空间模型和余弦夹角公式实现了基于内容的标题去重。文中以实例介绍了采用模拟浏览器技术自动登录网站的防屏蔽实现方法,改进的网页正文抽取方法适用于新闻类网页,是一种通用性较强、性能优越的正文抽取算法。在网页去重技术方面,本文主要介绍了网页URL比对去重技术和基于内容的去重策略,并采用基于分词的向量空间模型和余弦夹角公式实现了基于内容的标题去重,给出了其核心算法。
     最后,本文从互联网舆情研究出发,分析了舆情研究对网络信息采集和分析的需求,并针对网络舆情的分支——网络侨情,开发了互联网侨情采集系统。确定了侨情领域主题词表、种子网站,实现了从URL抓取、网页源文件抓取、标题和正文抽取、网页去重等一系列工作流程。为今后继续对网络舆情信息的分析和处理打下了基础。
The amount of information on the Internet is dramatically increasing with the coming of the information age. Due to the information explosion and information overloading, human-beings are facing a lot of new difficulties. Furthermore, the requirements of information users are increasingly subject-oriented, specific, professional, and individual. The thesis proposed a strategy for subject acquisition in order to meet people's diversified information requirements.
     Firstly this thesis has done comparison between different theories, technologies and information acquisition solutions which are related to subject oriented acquiring of Internet information. Secondly, an integrated information acquisition strategy was put forward based on the merging of the two strategies of general search engine and vertical search engine. Thirdly the framework of the directed subject acquisition system was designed after indentifying the acquisition strategies and underlying technologies.
     Three key technologies related to the system were analyzed and improved, including:anti-shielding solution integrated with different technologies to avoid shielding, web content extraction based on text density, eliminate duplication technology based on VSM and cosine angle formula.
     Lastly, a directed subject information acquisition system on overseas Chinese was developed. The process includes several steps from identifying the thesaurus of overseas Chinese Internet information and the seed sites, subject acquisition of URL and web source files, to the extraction of titles and text. The thesis will contribute to the research on Internet public opinion for the future.

引文

3郑傲.网络互动中的网民自我意识研究[D].北京：中国传媒大学，2008.
    4刘春鸿.互联网信息资源采集与组织利用[J].情报探索,2004(01).
    5艾瑞市场研究.2007年中国搜索引擎市场份额报告[J].广告人,2008(4).
    6 Sergey Brin, Lawrence Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine [M]. Computer Networks and ISDN Systems,1998.
    7刘俊熙,盛宇.垂直和通用搜索引擎的差异和案例分析[J].现代情报,2009(3).
    8刘俊熙,盛宇.垂直和通用搜索引擎的差异和案例分析[J].现代情报，2009(3).
    9 Z. Nie, Y. Zhang, J. R. Wen, W. Y. Ma. Object-Level Ranking Bringing:Order to Web Objects[C]. In Proceedings of the 14th international conference on World Wide Web.2005,567-574.
    10 Microsoft Academic Search[OL]. [2010-3-20]. http://academic.research.microsoft.com.
    11 S. Chakrabarti, M.van den Berg and B. Dom. Focused Crawling:A New Approach to Topic-Specific Web Resource Discovery[C]. In Proceedings of the 8th International World Wide Web Conference, Toronto, Canada, 1999.
    12 C.Aggarwal, F. Al-Garawi and P.Yu. Intelligent Crawling on the World Wide Web with Arbitrary Predicates[C]. In Proceedings of the 10th International World Wide Web Conference, Hong Kong, May 2001.
    13 Menczer F.Pant G.Evaluating Topic-Driven Web Crawler[C].In Proceedings of the 24th Annual International ACM/SIGIR Conference, New Orleans, USA,2001.
    14旺建华.中文文本分类技术研究[D].吉林：吉林大学，2007.
    15计算所汉语词法分析系统ICTCLAS-中文自然语言处理开放平台CNLP Platform. [2010-3-201. http://www.nlp.org.cn/project/project.php?proj id=6.
    16周强,段慧明.现代汉语语料库加工中的切词与词性标注.中国计算机报[J].1994(21)：85.
    17 SCWS-PHP中文分词.[2010-3-20].http://www.ftphp.com/scws/.
    18基于HTTP协议的开源中文分词系统：HTTPCWS 1.0.0发布[J/OL].[2010-3-20].http://blog.s135.com/httpcws_v100/.
    19计算所汉语词法分析系统ICTCLAS-中文自然语言处理开放平台CNLP Platform[OL].[2010-3-20]. http://www.nlp.org.cn/project/project.php?proj_id=6.
    20白硕,程学旗,郭莉,等.大规模内容计算[A].语言计算与基于内容的文本处理——全国第七届计算语言学联合学术会议论文集[C],2003.
    21沈斌,基于分词的中文文本相似度计算研究[D].天津：天津财经大学,2006.
    22姚清耘,刘功申,李翔.基于向量空间模型的文本聚类算法[J].计算机工程,2008(18)：39-41.
    24 LIU L, PU C, et al. XWRAP:an XML-enable wrapper construction system for the Web information source[C].proceeding of the 16th IEEE International Conference on Data Engineering. San Diego:2000:611-620.
    25王琦,唐世渭,杨冬清,等.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10)：1786-1792.
    26 F INN A, KUSHMER ICK N, SMYTH B. Fact or Fiction:Content Classification for Digital Libraries:The 2nd DELOS Network of Excel-lence Workshop on Personalization and Recommender Systems in Digital Libraries[C]. Dublin:[s. n.],2001.
    27 LERMAN K, KNOBLOCK C,MINTON S. Automatic data extraction from lists and tables in web sources: Automatic Text Extraction and Mining Workshop (ATEM-01) [C]. Seattle:[s. n.],2001.
    28崔继馨,张鹏,杨文柱.基于DOM的Web信息抽取[J].河北农业大学学报,2005,28(3)：90-93.
    29孙承杰,关毅.基于统计的网页正文信息抽取方法的研究[J].中文信息学报,2004,18(5)：17-22.
    30 DENG Cai.YU Shipeng, WEN Jirong,et al.VIPS:A Vision-Based Page Segmentation Algorithm[R].Microsoft Technical Report,MSR-TR-2003-79,2003.
    31 Alexjc. The Easy Way to Extract Useful Text from Arbitrary HTML[OL].[2010-3-20]. http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html.2007.
    32 Alexjc. The Easy Way to Extract Useful Text from Arbitrary HTML[OL].[2010-3-20]. http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html.2007.
    33 Alexjc. The Easy Way to Extract Useful Text from Arbitrary HTML[OL].[2010-3-20]. http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html.2007.
    34宁力.搜索引擎中网页查重方法的研究[D].北京：北京化工大学,2007.
    35钱爱兵，江岚.基于后缀树的中文新闻重复网页识别算法[J].现代图书情报技术,2008(3).
    40郭庆光.传播学教程[M].北京：中国人民大学出版社,1999：284.
    41刘毅.内容分析法在网络舆情信息分析中的应用[J].天津大学学报(社会科学版),2006(7).
    42刘毅.内容分析法在网络舆情信息分析中的应用[J].天津大学学报(社会科学版),2006(7).
    43 Sentiment—analysis[OL].[2010-3-20].http://en.wikiPedia.org/wiki/Sentiment—analysis.
    [1]SCWS-PHP中文分词[OL].[2010-3-20].http://www.ftphp.com/scws/.
    [2]艾瑞市场研究.2007年中国搜索引擎市场份额报告[J].广告人,2008(4).
    [3]白硕,程学旗,郭莉,等.大规模内容计算[A].语言计算与基于内容的文本处理——全国第七届计算语言学联合学术会议论文集[C],2003：45-46.
    [4]陈征华,杨内.基于数据挖掘的网络信息采集与服务研究[J].情报理论与实践,2007,(5)：702-704.
    [5]崔继馨,张鹏,杨文柱.基于DOM的Web信息抽取[J].河北农业大学学报,2005,28(3)：90-93.
    [6]第25次中国互联网络发展状况统计报告-中国互联网络信息中心[EB/OL]. [2010-3-20].http://www.cnnic.cn/html/Dir/2010/01/15/5767.htm.
    [7]杜义华,及俊川.通用互联网信息采集系统的设计与初步实现[J].计算机应用研究,2005,(1)：187-189.
    [8]宫进,胡长军,曾广平.互联网信息定向采集系统的设计与实现[J].计算机应用,2007,(S1)：16-17.
    [9]谷俊.基于互联网的信息采集系统的设计与实现[J].情报探索,2008,(12)：65-67.
    [10]郭庆光.传播学教程[M].北京：中国人民大学出版社,1999：284.
    [11]互联网出版管理暂行规定-中国消费者协会[OL].[2010-3-20].http://www.cca.org.cn/web/xfts/newsShow.jsp?id=7203.
    [12]基于HTTP协议的开源中文分词系统：HTTPCWS 1.0.0发布[0L].[2010-3-20].http://blog.s135.com/httpcws_v100/.
    [13]计算所汉语词法分析系统ICTCLAS-中文自然语言处理开放平台CNLP Platform[OL].[2010-3-20].http://www.nlp.org.cn/project/project.php?proj_id=6.
    [14]李卫,刘建毅,何华灿,等.基于主题的智能Web信息采集系统的研究与实现[J].计算机应用研究,2006,(2)：163-166.
    [15]李文东.基于WEB的智能信息采集及处理系统研究[J].科技创新导报,2008,(23)：15-15.
    [16]李雯静,许鑫,陈正权.网络舆情指标体系设计与分析[J].情报科学,2009(7)：986-991.
    [17]李晓亚,赫枫龄,左万利.基于网页分块技术主题爬行器的实现[J].吉林大学学报(理学版),2007,(6)：959-965.
    [18]刘春鸿.互联网信息资源采集与组织利用[J].情报探索.2004(1)：42-44.
    [19]刘军.基于支持向量机的网页主题信息提取算法[J].电脑知识与技术(学术交流),2007(2)：451-452,513.
    [20]刘俊熙,盛宇.垂直和通用搜索引擎的差异和案例分析[J].现代情报,2009(3)：143-149.
    [21]刘毅.内容分析法在网络舆情信息分析中的应用[J].天津大学学报(社会科学版),2006(7)：307-310.
    [22]吕铁强,于满泉,孟庆发,等.基于网页分块的个性化信息采集的研究与设计[J].微电子学与计算机,2005,(10)：120-123.
    [23]宁力.搜索引擎中网页查重方法的研究[D].北京：北京化工大学,2007.
    [24]钱爱兵,江岚.基于后缀树的中文新闻重复网页识别算法[J].现代图书情报技术,2008(3).
    [25]山岚,徐耀.基于Agent的智能化专业信息采集系统[J].计算机工程与设计,2005,(11)：3028-3030,3036.
    [26]申伟,李翔,林祥.基于Cookie的身份认证网站信息采集研究与实现[J].计算机技术与发展,2009,(3)：178-181.
    [27]沈斌.基于分词的中文文本相似度计算研究[D].天津：天津财经大学,2006.
    [28]孙承杰,关毅.基于统计的网页正文信息抽取方法的研究[J].中文信息学报,2004,18(5)：17-22.
    [29]万乐,左万利,高金.基于主题的网页噪音去除机制[J].计算机工程与设计,2008,(8)：2072-2074,2084.
    [30]王丛,梁永全,田启家.基于Web的信息采集系统[J].山东理工大学学报(自然科学版),2005,(6)：91-95.
    [31]王来华.舆情研究概论——理论、方法和现实热点[M].天津：天津社会科学院出版社,2003：32.
    [32]王琦,唐世渭,杨冬清,等.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10)：1786-1792.
    [33]王冉冉,王刚,黄青松.基于Deep Web的信息采集系统[J].计算机技术与发展,2007,(10)：171-173,177.
    [34]王永成等.中文信息处理技术及其基础[M].上海：上海交通大学出版社,1990.
    [35]旺建华.中文文本分类技术研究[D].吉林：吉林大学,2007.
    [36]吴瑞生,王加团.基于知识的WEB信息采集系统研究[J].科技信息(学术研究),2008,(30).
    [37]徐薇.Web信息采集中页面分块技术的研究[J].武汉科技学院学报,2007,(5)：43-45.
    [38]杨仁广,孟祥增,原佳丽.一种基于网页内容和链接分析的主题搜索算法[J].情报杂志,2008,(6)：64-66.
    [39]姚清耘,刘功申,李翔.基于向量空间模型的文本聚类算法[J].计算机工程,2008(18)：39-41.
    [40]殷贤亮,李猛.基于分块的网页主题信息自动提取算法[J].华中科技大学学报(自然科学版),2007,(10)：39-41.
    [41]余静,刘万军.基于网页分块的主题爬虫研究[J].计算机与信息技术,2008,(10).
    [42]舆情研究趋势-万方数据知识脉络[OL].[2010-3-20].http://trend.wanfangdata.com.cn/Default.aspx?wd=.
    [43]张海东.关于对互联网信息采集支持静态页面和动态页面的抓取技术的文献综述[J].科学咨询(决策管理),2007,(3)：49-51.
    [44]赵佳鹤.基于语义分析的主题信息采集系统的设计与实现[D].大连：大连理工大学,2007,(2).
    [45]赵晓峰.基于Web的网站信息采集系统的设计与实现[J].电脑知识与技术,2008,(16)：1263-1264.
    [46]郑傲.网络互动中的网民自我意识研究[D].北京：中国传媒大学,2008.
    [47]中科院自动化研究所模式识别国家重点实验室[OL].[2010-3-20].http://nlpr-web.ia.ac.cn/english/cip/teachingmaterials/nlu-7-cqzong.pdf.
    [48]周强,段慧明.现代汉语语料库加工中的切词与词性标注[J].中国计算机报,1994.(21)：85.
    [49]Alexjc. The Easy Way to Extract Useful Text from Arbitrary HTML[OL]. [2010-3-20]http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbit rary-html.2007.
    [50]DENG Cai,YU Shipeng; WEN Jirong,et al.VIPS:A Vision-Based Page Segmentation Algorithm[R].Microsoft Technical Report,MSR-TR-2003-79,2003.
    [51]F INN A, KUSHMER ICK N, SMYTH B. Fact or Fiction:Content Classification for Digital Libraries:The 2nd DELOS Network of Excel-lence Workshop on Personalization and Recommender Systems in Digital Libraries[C]. Dublin:[s. n.], 2001.
    [52]Khoo Khyou Bun, Mitsuru Ishizuka. Topic Extraction from News Archive Using TF*PDF Algorithm[C]. The Third International Conference on Web Information Systems Engineering (WISE'02).Singapore:IEEE CS press,2002:73-82.
    [53]Koehler W. An Analysis of Web Page and Web Site Constancy and Permanence[J]. Journal of the American Society for Information Science,1999,50 (2): 162-180.
    [54]LERMAN K, KNOBLOCK C,MINTON S. Automatic data extraction from lists and tables in web sources:Automatic Text Extraction and Mining Workshop (ATEM-01) [C]. Seattle:[s. n.],2001.
    [55]Menczer F.Pant GEvaluating Topic-Driven Web Crawler[C].In Proceedings of the 24th Annual International ACM/SIGIR Conference, New Orleans, USA,2001.
    [56]Microsoft Academic Search[OL]. [2010-3-20]. http://academic.research.microsoft.com.
    [57]Sentiment—analysis[OL]. [2010-3-20].http://en.wikiPedia.org/wiki/Sentiment—analysis.
    [58]Sergey Brin, Lawrence Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine [M]. Computer Networks and ISDN Systems.1998.
    [59]Z. Nie, Y. Zhang, J. R. Wen, W. Y. Ma. Object-Level Ranking Bringing:Order to Web Objects[C]. In Proceedings of the 14th international conference on World Wide Web.2005:567-574.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700