网络论坛采集及热点话题发现研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

网络论坛采集及热点话题发现研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Key Technology Research on Web Forums Crawling and Hot Topic Detection
作者：李恒训
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：互联网论坛采集 ; 自动结构分析 ; 翻页识别 ; 网页结构聚类 ; 采集路径 ; 采集框架设计 ; 热点话题识别
英文关键词：Web Forums Crawling ; Automatic Structure Analysis ; Page-flipping Detection ; Web Page Clustering ; Traversal Strategy ; The Framework Design of Crawling ; Hot Topic detection
学位年度：2011
导师：王斌 ; 刘金刚
学科代码：081203
学位授予单位：首都师范大学
论文提交日期：2011-05-15

摘要

近年来,互联网蓬勃发展,已经成为人们生活中不可或缺的一部分。其中网络论坛由于其富于交互性、即时性、开放性的特点,逐渐吸引了大量网络用户,已经成为互联网的重要组成部分。论坛是人们发布和获取信息的必要途径和重要手段,在生活、工作、娱乐中扮演着不可缺少的角色。网民通过论坛进行交流,可以发表一个主题大家一起来探讨,也可以提出一个问题大家一起来解决,因此论坛是一个人与人语言文化共享的平台,蕴含着大量宝贵信息,是一个巨大的知识库,同样是搜索引擎的重要数据来源。此外,中国网民言论之活跃已达前所未有的程度,不断在网络论坛上形成热点话题,有些甚至形成热点社会事件,显示了其不可忽视的力量,往往会引发重大舆情危机。因此,论坛采集是信息检索、数据挖掘和舆情监测的重要基础。然而由于论坛的特有结构造成了论坛采集的极大困难,大多数通用搜索引擎都对论坛采集进行了规避或简单处理。
     本文对论坛采集的关键技术进行了研究,针对论坛结构复杂、链接层次深、翻页链接难以识别以及容易陷入采集陷阱等问题进行了深入研究,提出了一种通用性较强的论坛自动采集方法。
     首先,我们采用深度优先和广度优先相结合的随机算法从论坛上抽样采集一定数量的网页进行分析,通过网页结构聚类、动态网页链接聚类、网页有效度识别等方法和步骤,在离线状态下对论坛的逻辑结构进行分析,得到论坛采集的最优路径,并且通过翻页链接识别采集深层链接的论坛帖子。根据离线分析的结果和少量人工调整的基础上,本文设计并实现了一个高效快速的论坛采集框架,对大规模采集中的性能问题进行了分析与探讨,并应用于分布式文件系统进行分布式采集。实验结果表明,与传统采集方法相比,本文方法大大提高了论坛采集的有效率和覆盖率。
     在论坛采集的基础上,本文研究了基于论坛的热点问题发现,提出了一种基于主题词的快速聚类算法,并构建了一个热点话题发现原型系统。该系统可以实时有效地发现论坛中一段时间内的热点话题及话题所包含的帖子,并且在实际中得到成功应用。
The Internet is boomed in recent years, and it has become an indispensable part of people's lives. Because of some features, such as rich interactive, instant and open, forum gradually attracted a large number of users, which has become an important part of the Internet. Forum is a necessary approach and important method for people to publish and acquire information in our daily life, work, entertainment and other aspects, which plays an indispensable role. Internet users can communicate through the forums by post a topic to explore all together. You can ask a question, whoever knows will work together to solve the question. So, it is a platform for people to share language and culture, which contains a wealth of information. So forum is a huge knowledge base, it is also an important data source of search engines. In addition, comments of active Internet users in China reached unprecedented levels, which continued to form the network hot topics, and some even form a focus of social events to show their power cannot be ignored, which often lead to a major crisis in public opinion. Therefore, the forum is an important basis for information retrieval, data mining and monitoring public opinion. However, because of the unique structure of the forum, it is hard to obtain the forum data, and most search engines have avoided crawling from the forum.
     We studied the key technologies on the forum crawling in this paper, besides the complex structures, deep link-level, the link flipping, easy to fall into collection traps and other problems. We proposed a universal forum crawling method.
     First, we use depth first and breadth-first combining algorithm to randomly sampling from the forum of a certain number of pages, through the web structure identify, web page clustering, dynamic web links clustering and some other methods, we obtain the logical structure of the forum. Then, we design and implement a rapid and efficient distributed forum crawling framework for large-scale crawling, in which the performance problems are analyzed and discussed. Compared with traditional crawling methods, our method greatly increased the efficient and coverage of the forum crawling.
     Based on the crawling of the forum, we applied it to a hot topic detection prototype system. The system can detect forum hot topic effectively for some time period, and find the posts each topic contains. Finally, we successfully applied it to a public opinion monitoring system in ICT, CAS, which achieved good practical results.

引文

[1].中国互联网信息中心(CNNIC).第24次中国互联网络发展状况统计报告[R],2010年1月,http://www.cnnic.net.cn/index/oE/00/11/index.htm.
    [2]. Christopher D. Manning,等.信息检索导论[M].王斌,译.北京：人民邮电出版社,2010年9月.
    [3]. David A. Grossman,等.信息检索：算法与启发式方法[M].张华平、李恒训、刘治华,译.北京：人民邮电出版社,2010年8月.
    [4]. T. Berners-Lee and D. Connolly.Hypertext Markup Language Specification Version 2.0, RFC 1866, November 1995.
    [5]. R. Miller and K. Bharat. SPHINX:A Framework for Creating Personal, Site-Specific Web Crawlers. In Proceedings of the 7th International WWW Conference, Brisbane, Australia, April 1998..
    [6]. A. Heydon,M. Najork.Mercator:a scalable, extensible Web Crawler[J].World Wide Web, 19992(4),219-229.
    [7]. Marc Najork, Janet L. Wiener.Breadth-First Search Crawling Yields High-Quality Pages: The 19th International World Wide Web Conference, Raleigh,2010 [C].New York:ACM 2010.
    [8].李盛韬.基于主题的Web信息采集技术研究[D].北京：中国科学院计算技术研究所硕士学位论文,2002.
    [9]. P.Boldi, B. Codenotti, M. Santini,etc.UbiCrawler:A Scalable Fully Distributed Web Crawler.Software,2004.
    [10].MaricO L.A, Vidal, Altigran S.da Silva, etc.Structure-Driven Crawler Generation by Example:The 29th Annual International SIGIR Conference, Seattle,2006[C]. New York: ACM 2006.
    [11]. Y. Guo, K. Li, K. Zhang, etc.Board Forum Crawling:A Web Crawling Method for Web Forum:International Conference on Web Intelligence, Hong Kong,2006[C]. New York:ACM 2006.
    [12].李魁,程学旗等.WWW论坛中的动态网页采集[J].计算机工程.2007,33-6：P80-P82.
    [13].Yida Wang, Jiang-Ming, etc.Exploring Traversal Strategy for Web Forum Crawling[C]. The 31th Annual International SIGIR Conference, Singapore.2008[C]. New York:ACM 2008.
    [14].Rui Cai,Jiang-Ming,etc.iRobot:An Intelligent Crawler for Web Forums The 17th International World Wide Web Conference, Beijing,2008 [C].New York:ACM.
    [15].C. Aggarwal, F. Al-Garawi and P. Yu. Intelligent Crawling on the World Wide Web with Arbitrary Predicates. In Proceedings of the 10th International WWW Conference, Hong Kong, May 2001.
    [16].Pan agiotis G Ipeirotis,Luis Gmvano,Mehran Sahami.Probe,Count,and Classify:Categorizing Hidden-Web Databases.In:Proc of the ACM SIGMOD Conference, Santa Barbara,California,USA,2001.
    [17].Sriram Raghavan,Hector Garcia-Molina. Crawling the hidden Web. The International Conference on Vary Large Data Bases (VLDB), Rome, Italy,2001.
    [18].Steve Lawrence, C. Lee Giles. Accessibility of Information on the Web. Nature, 400(6740):107-109, July 1999.
    [19]. W. Bruce Croft,D. Metzler,T. Strohman.搜索引擎：信息检索实践(英文影印版)。机械工业出版社。2009年10月。
    [20].李魁.大规模Web论坛采集关键技术研究[D].北京：中国科学院计算技术研究所硕士学位论文,2006.
    [21].邓民文.垂直搜索中网页采集的关键技术研究[D].北京：中国科学院计算技术研究所硕士学位论文,2009.
    [22].Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. ACM Press/Addison-Wesley,1999.
    [23].Junghoo Cho and Hector Garcia-Molina. Parallel crawlers. In Proceedings of the eleventh international conference on World Wide Web, pages 124-135, Honolulu, Hawaii, USA, May 2002. ACM Press.
    [24].Brian D. Davison. Topical locality in the web. In Proceedings of the 23rd annual international ACMSIGIR conference on research and development in information retrieval, pages 272-279.ACM Press,2000.
    [25].Renaud Deraison. Nessus:remote security scanner.http://www.nessus.org/,2004.
    [26].J. Yang,R. Cai,C. Wang,etc.Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums:A List-wise Stragegy[C].KDD,2009.
    [27].J. Yang,R. Cai,C. Wang,etc.A Thread-wise Stragegy for Incremental Crawling of Web Forums[C].WWW,2009.
    [28].吴丽辉.个性化的Web信息采集技术研究[D].北京：中国科学院计算技术研究所博士学位论文,2005.
    [29].J. Allan. Introduction to Topic Detection and Tracking in Topic Detection and Tracking: Event- based Information Organization[R]. Kluwer Academic Publishers,2002:1-16.
    [30].洪宇,张宇,刘挺等.话题检测与跟踪的评测及研究综述[J].中文信息学报,2007，21(6)：71-87.
    [31].The 2004 Topic Detection and Tracking(TDT2004)Task Definition and Evaluation plan[R]. version 1.0.5 August 2004.
    [32].J.Allan, J.Carbonell, G.Doddington, J.Yamron,etc. Topic detection and tracking pilot study:Final report[R]. P194-218,1998.
    [33]. Y. Yang, T. Pierce, and J. Carbonell. A Study on Retrospective and Online Event Detection[C]. The 21st Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'98), pages 28-36,1998.
    [34].M. Connell, A. Feng, G. Kumaran, etc.2004 UMass at TDT 2004[C]. The Seventh Topic Detection and Tracking Conference (TDT2004).
    [35]. Yu Manquan, Luo Weihua, Xu Hongbo. Bai Shuo.2006. Research on Hierarchical Topic Detection in Topic Detection and Tracking [J]. Computer Research and Development. Vol.43 No.3. P489-495.
    [36].刘群,张华平,俞鸿魁等.基于层次隐马模型的汉语语法分析[J].计算机研究与发展,2004.8.
    [37].黄玉兰,龚才春,许洪波等.基于局部性原理的有意义串提取方法[C].第四届全国信息检索与内容安全学术会议论文集,2008.11.
    [38].曾依灵,许洪波,白硕.网络文本主题词的提取与组织研究[J].中文信息学报,2008.5.
    [39].曾依灵,许洪波.网络热点信息发现研究[J].通信学报,2007.12.
    [40].毛国君,段立娟,王实等.数据挖掘原理与算法(第二版)[M].北京,清华大学出版社2007.12.
    [41].刘菲.中文文本主题词抽取研究与应用[D].上海,复旦大学2007.
    [42].王丫.网络新闻流中热点话题识别与跟踪算法的改进与验证[D].燕山大学2007.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700