基于网络信息的热点事件发现与分析研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着网络的蓬勃发展,互联网已经成为人们发布和获取信息的重要渠道,网络信息越来越被人们所关注,因此对网络信息的热点事件发现与分析是十分必要的。从互联网的特性来看,网络信息来源众多,随机性强,信息发布者的观点和角度各不相同,仅仅靠传统的经验判断无法帮助用户了解网络信息主要热点事件和某个热点事件的主要方面。因此,需要采用一定的技术与方法对网络信息进行自动处理,用于快速准确的发现网络信息的热点事件,同时可以对热点事件进行一定的分析研究。
     本文以互联网网页信息为研究对象,利用信息采集技术、聚类技术等为网络信息的发现与分析提供了一套有效的解决方案,使用户能够清晰的了解当前社会的热点事件,并在一定程度上对热点事件进行规律分析。
     首先,本文介绍了网络信息热点事件发现与分析的背景、国内外研究现状和热点发现与分析中需要用到的一些相关关键技术。其次,着重就本文提出的两个创新点进行了介绍,即网络信息采集策略的改进和聚类算法的改进。通过以上改进,在一定程度上提高了信息采集效率和热点发现的效果。接着,本文就采集的结果提出了关于热点事件趋势发展的一些模型,来对网络信息热点进行分析与预测。最后,本文以创业板上市公司为例,就热点发现与分析进行了案例实验,实验证明本文提出的一些思路取得了一定的效果。
     热点发现与分析技术在国内的研究还比较落后,存在着大量的问题有待改进,这也意味着该研究有着巨大的提升空间。最后,本文对已做的工作进行了总结并对未来的研究进行了展望。
With the vigorous development of the network, the Internet has become an important way to issue and access information.Network information has been growing concern. So it is necessary to discovery and analyze hot event information on the network.From the characteristics of the Internet,there are many sources of the information,Network information is very random.the angle and view of the information publishers are varies.We can not help users to understand all of the hot information or all aspect of some hot information from just judging by the experience of traditional. Therefore, we need to adopt a certain of technology and methods to automatic process network information and find the hot events quickly and accurately from the network information. At the same time, we also can do some prediction and analysis.
     In this paper, my research object is the Internet web page information. We provides an effective solution that make users clearly understand the current hot issues of society and predict the hot events to some extent by the using of information collection technology, clustering technology and so on.
     First, this paper describes the background, research status, and some of the key technologies of the network information discovery and analysis. Secondly, this paper focuses on two innovations that is the improving of the network information acquisition strategies and the improving of clustering algorithm. Through the above improvement, the effect of hot events detection and the efficiency of information collection have been improved in a certain extent. Then, this paper proposed some models on the development of the hot events to analyze and predict the network information hot events. Finally, we take the GEM listed companies as an example, it shows that these improvements achieved a certain results.
     Hot event discovery and analysis in domestic research is still relatively backward, there are lots of issues to resolve, which means the research has great space for improvement. Finally, the work of the paper and future research are discussed.
引文
[1]2010年全球互联网人数将突破20亿[EB/OL].http://www.chinanews.com.cn/it/2010/10-20/2599512.shtml,2011-03-1
    [2]中国互联网络信息中心.第27次中国互联网络发展状况统计报告.http://www.cnnic.net.cn/dtygg/dtgg/201101/P020110119328960192287.pdf [EB/OL].2011.1.20.
    [3]张晓静.论网络信息资源管理[J].现代情报,2003.23(8):70-71.
    [4]Bra P.D., Post R. Information retrieval in the World Wide Web:making client-base searching feasible [A]. In proceeding of the 1st International WWW Conference[C].Geneva, Switzerland,1994.
    [4]Focused crawler-Wikipedia [EB/OL].http://en.wikipedia.org/wiki/Focused_crawler,2011-03-1.
    [5]Menczer, R. ARACHNID:Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery. In D. Fisher, ed., Proceedings of the 14th International Conference on Machine Learning (ICML97)[C]. Morgan Kaufmann,1997
    [6]Menczer, F. and Belew, R.K. Adaptive Information Agents in Distributed Textual Environments. In K. Sycara and M. Wooldridge (eds.) Proceedings of the 2nd International Conference on Autonomous Agents (Agents'98)[C]. ACM Press, 1998.
    [7]Pinkerton, B. Finding what people want:Experiences with the WebCrawler. In Proceedings of the First World Wide Web Conference, Geneva, Switzerland,1994
    [8]Diligenti, M., Coetzee, R, Lawrence, S., Giles, C. L., and Gori, M. Focused crawling using context graphs. In Proceedings of the 26th International Conference on Very Large Databases (VLDB)[C], Cairo, Egypt,2000
    [9]Hersovici M., Jaeovi M., Maarek Y, Pelleg D., Shtalhaim M. The Shark-search algorithm-an application:tailored Web site mapping[A].In Proceeding of the 7th International WWW Conference[C].Brisbane,Australia,1998.
    [10]J.Cho H., Garcia Molina, L.Page. Efficient crawling through url ordering [A].In proceedings of the Seventh International World Wide Web Conference[C].Brisbane,Australica,1998.
    [11]A.Rungsawang, N.Angkawattanawit. Learnable topic-specific web crawler [J].Joumal of Network and Computer applications,2004,28(2):97-114.
    [12]Liu Huilin,Kou Chunhua,Wang Guangxing. Efficiently Crawling Strategy for Focused Searching Engine [A].International Workshop on Database Management and application over Networks[C].HuangShan, China:Springer,2007,25-36.
    [13]Jialun Qin, Linchun Chen. Using Genetic Algorithm in Building Domain-Specific Collections, An Experiment in the Nanotechnology Domain [A].Proceedings of the 38th Hawaii International Conference on System Sciences[C].Hawaii, IEEE Xplore,2005.
    [14]张玲,林亚平,陈治平,童调生.基于综合价值的Web主题信息搜集策略研究[J].系统仿真学报,2005,17(2):323-326.
    [15]陈丛丛.主题爬虫搜索策略研究[D].山东大学,2009.
    [16]陈瑜芳.主题爬虫系统的研究[D].武汉理工大学,2010.
    [17]于满泉,陈铁睿,许洪波.基于分块的网页信息解析器的研究与设计[J].计算机应用,2005,25(4):974-976.
    [18]Scirus-百度百科http://baike.baidu.com/view/3994644.htm, [EB/OL].2011.1.20.
    [19]周明建,高济,李飞.基于本体论的Web信息抽取[J].计算机辅助设计与图形学学报.2004,16(4):535-541.
    [20]李泽峰,王煌.基于RBF神经网络和关联规则的Web文本分类规则获取方法[Z].图书情报工作,2006,50(10):90-92.
    [21]李舒晨.网络信息采集处理平台的研究[C].上海,上海交通大学,2009.
    [22]Finn A., Kush Erick A., Smyth B. Fact or fiction:content classification for digital libraries [J].The 2nd DELOS Network of Excellence Workshop on Personalization and Recommender Systems in Digital Libraries, Dublin,Ireland,2001:110-115.
    [23]Kaasinen E., Aaltonen M., Kolari Jet al. Two approaches to bringing Internet services to WAP devices [J].Proceedings of the 9th International World Wide Web Conference on Computer Networds, Amsterdam,2000:231-246.
    [24]Gupta S., Kaiser G., Neistadt Det al. DOM-based content extraction of HTML documents [J].Proceedings of the 12th International World Wide Web Conference, New York,2003:207-214.
    [25]王琦,唐世渭,杨冬清等.基于DOM的网页主题信息自动提取[J].计算机研究与发展.2004,41(10):1786-1792.
    [26]胡国平,张巍,王仁华.基于双层决策的新闻网页正文精确抽取[J].中文信息学报,2006,20(6):1-10.
    [27]孙承杰,关毅.基于统计的网页正文信息抽区方法的研究.中文信息学报[J],2004,18(5):17-22.
    [28]网络神采介绍[EB/OL]. http://www.sensite.cn/,2011-03-1.
    [29]乐思网络信息采集系统介绍[EB/OL]. http://www.knowlesys.cn/,2011-03-1.
    [30]火车头信息采集系统[EB/OL]. http://www.locoy.com/.2011-03-1.
    [31]刘迁,贾惠波.中文信息处理中自动分词技术的研究与展望.计算机工程与应用,2006,3(3):175-182.
    [32]中文自动分词技术[EB/OL].http://www.cnblogs.com/sachie/archive/2010/10/29/1864360.html,2011-03-1.
    [33]张小欢.中文分词系统的设计和实现[D].电子科技大学,2010.
    [34]梁南元.书面汉语的自动分词与一个自动分词系统CDWS[J]北京航空学院学报,1984,4(4):97-104.
    [35]刘群,张华平,俞鸿魁,程学旗.基于层次隐马模型的汉语词法分析[J].计算机研究与发展,2004,41(8).
    [36]李晓黎.WEB信息检索与分类中的数据采掘研究[D].中国科学院计算技术研究所,2001:61-90.
    [37]Y. yang, Thomas Pierce, Jaime Carbon ell. A Study on Retrospective and Online Event Detection. In Proceedings ACM SIGIR [J], Melbourne, Australia, 1998:28-36.
    [38]王丫,网络新闻流中热点事件识别与跟踪算法的改进与验证[D].燕山大学,2007.
    [39]刘林浩,网络热点新闻事件挖掘和跟踪分析方法的研究与实现[D].中南大学,2009.
    [40]F.Walls, H.Jin. S.Sista,et al. Topic Detection in Broadcast News. In Proceedings of the DARPA Broadcast News Workshop [J].Herndon, USA,1999:193-198.
    [41]T.Leek, H.jin, S.Sista, et al. The BBN Cross lingual Topic Detection and Tracking System. In Working Notes of the Third Topic Detection and Tracking Workshop [J].Vienna, Virginia,2002:332-346.
    [42]J.Makkonen, H.Ahonen-Myka, M.Salmenkivi. Topic Detection and Tracking with Patio-temporal Evidence. In Proceedings of the 25th European Conference on Information Retrieval Research [J].Pisa, Italy,2003:251-265.
    [43]李保利,俞士汉.话题识别与跟踪研究[J].计算机工程与应用,2003,39(17):6-10.
    [44]贾自艳,何清,张海俊等.一种基于动态进化模型的事件探测和追踪算法[J].计 算机研究与发展,2004,41(7):1273-1280.
    [45]于满泉,骆卫华,许洪波等.话题识别与跟踪中的层次化话题识别技术研究[J].计算机研究与发展.2006,43(3):489-495.
    [46]赵华,赵铁军,张妹.基于内容分析的话题检测研究[J].哈尔滨工业大学学报,2006,38(10):1740-1743.
    [47]方正智思互联网信息采集分析产品介绍[EB/OL]. http://www.founderegov.com/Product/2010-03/12/content_13309.htm,2011-03-1.
    [48]TRS互联网舆情管理系统白皮书[EB/OL]. www.trs.cn/servpport/pdf/trsomwhite.pdf,2011-03-1.
    [49]Arvind Arasu, Jasmine Novak, Andrew Tomkins, John Tomlin. Page Rank Computation and the Structure of the WEB:Experiments and Algorithms[C].In Proceedings of 11th International World Wide Web Conference,2002.
    [50]Jon M.K., Leinberg. Authoritative Sources in a Hyperlinked Environment[C].In Proceeding of the ACM-SIAM Symposium on Discrete Algorithms,1998.
    [51]Soderlan S. Learning Information Extraction Rules for Semi-structured and Free Text[J].International Journal of Machine Learning,1999,34(1-3):233-272.
    [52]Ion Muslea, Steven Minton, Craig A. Knoblock. Hierarchical Wrapper Induction for Semi-structured Information Sources [J].Autonomous Agents and Mufti-agent Systems,2001(4):93-114.
    [53]基于页面结构分析的网页信息抽取方法研究[D],董娟,中国石油大学,2010.
    [54]Arocena G, Mendelzon A. WebOQL:Restructuring Documents, Databases and Webs[C].In Proceedings of the 14th ICDE Conference,Orlando,Florida,UAS,1998:24-33.
    [55]Robert Bail, gartmer, Sergio Flesca, George Gottlob. Visual Web Information Extraction with Lixto[Z].Proceedings of 27th International Conference on Very Large Database,Roma,Italy,2001:119-128.
    [56]王晓伟.垂直搜索引擎若干关键技术研究[D],浙江大学,2007.
    [57]Chen Lee Feng. Tree based adaptive key phrase extraction for intelligent Chinese information retrieval. Information Processing and Management [J].1999(35):501-521.
    [58]Baez a-Yates R., Roberto Nero, B. Modern Information Retrieval[D].Addison-Wesley_Longman,Reading,MA,1999.
    [59]布尔模型[EB/OL].http://baike.baidu.com/view/541064.htm,2011-03-1.
    [60]曹冬林,林达真.文本检索模型综述[J].心智与计算.2007(04):25-36.
    [61]王永成等.中文信息处理技术及其基础[M].上海交通大学出版社,1990:25-28.
    [62]姚清耘.基于向量空间模型的中文文本聚类方法的研究[D].上海,上海交通大学,2008.
    [63]刘风丽.基于抽样的隐私保护聚类挖掘算法研究[D].河北河北工业大学,2007.
    [64]Jiawei Han, Micheline Kambe. Data Mining Concepts and Techniques[M].2000.范明,孟晓峰译.数据挖掘:概念与技术[M].机械工业出版,2001.
    [65]聚类[EB/OL]. http://www.hudong.com/wiki/%E8%81%9A%E7%B1%BB, 2011-03-1.
    [66]习赵佳鹤,王秀坤,刘亚欣.基于语义分析的主题信息采集系统的设计与实现[J].计算机应用,2007,27(2):406-408.
    [67]ICTCLAS简介[EB/OL].http://ictclas.org/sub_1_1.html,2011-03-1.
    [68]TFIDF介绍[EB/OL]. http://baike.baidu.com/view/1228847.htm,2011-03-1.
    [69]郭建永,蔡永,甑艳霞.基于文本聚类技术的主题发现[J].计算机工程与设计.2008(6):1426-1428.
    [70]魏玖长.危机事件社会影响的分析与评估研究[D].合肥,中国科技大学,2006.
    [71]创业板[EB/OL].http://baike.baidu.com/view/96367.htm,2011-03-1..
    [72]中国财经信息站点排名信息[EB/OL]. http://top.chinalabs.com/ciis_info.aspx?Site_Domain=cfi.net.cn,2011-03-1.
    [73]Windows Server 2003[EB/OL].http://baike.baidu.com/view/41415.htm, 2011-03-1.
    [74]Tomcat [EB/OL]. http://baike.baidu.com/view/10166.htm,2011-03-1.
    [75]MYSQL[EB/OL].http://baike.baidu.com/view/24816.htm,2011-03-1.
    [76]Berkeley DB开源嵌入式数据库测评报告[EB/OL].http://hankchan.iteye.com/blog/168622,2011-03-1.
    [77]Java [EB/OL]. http://baike.baidu.com/view/29.htm,2011-03-1.
    [78]MyEclipse [EB/OL]. http://baike.baidu.com/view/42723.htm,2011-03-1.
    [79]stmts[EB/OL].http:/baike.baidu.com/view/25603.htm,2011-03-1.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700