网络舆情分析关键技术研究与实现

作者：吴娱
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：舆情分析 ; 网页文本分类 ; 文本倾向性分析 ; 网络爬虫
英文关键词：Public opinion analysis ; Web text categorization ; Sentiment classification ; Web crawlers
学位年度：2011
导师：佘堃
学科代码：081202
学位授予单位：电子科技大学
论文提交日期：2011-03-01

摘要

随着计算机技术和通信技术的飞速发展,互联网己成为了人们生活的不可缺少的组成部分。据国际电联统计,截止至2010年12月,全球互联网用户总数已经超过20亿。其中,我国的互联网用户数量已经超过3.9亿。网络被公认为是继报纸、广播、电视之后的“第四媒体”,民众知情权、表达权、参与权、监督权在互联网上已基本得到落实。网民对企业、民生、政府管理、反腐败、社会道德等热点问题在互联网上踊跃发表意见,这些意见形成一种强大的舆论压力,其影响已经大大超过了传统媒体。网络已经成为反映社会舆情的最主要载体。
     在网络舆情迅猛发展的同时,对网络舆情分析监控工作显得愈发重要。网络具有的开放性和相对自由的宽松度,使得民众发言摆脱了社会权利体制的管制和限制,可以畅所欲言无所顾忌的表达个人的观点、立场、情绪,民意表达更为畅通。网络也由于其虚拟性也带来了很大的安全隐患,发言者身份隐蔽,并且缺少规则限制和有效监督,因此网络很容易成为一些网民发泄不良情绪的空间。而且由于目前我国正处于社会转型期,存在诸多矛盾,再加上少数社会管理者对于舆论习惯性的回避或堵塞,因此,非常需要使用舆情分析系统对网络舆情进行分析监控,及时防范误导性舆论造成的社会危害,把握和保障正确舆论的前进导向,为构建和谐社会的舆情保驾护航。
     本论文对网络舆情分析系统进行需求分析,提出了系统的设计方案,并实现了系统中网页文本分类和文本倾向性分析等关键技术。本论文的先进性表现在:
     1)针对现有的通用爬虫技术存在的局限性,提出了一种基于爬行策略和过滤策略的数据采集方法,过滤大量无用信息;同时制定了针对舆情分析系统的网页库更新策略,保证本地网页库的时新性。
     2)通过对基于朴素贝叶斯的网页文本分类技术进行研究,提出了一种基于粗糙集改进的朴素贝叶斯分类方法,并将该方法运用到舆情分析系统的舆情分类中。
     3)通过对现有的基于语义和基于机器学习的文本倾向性分析技术分别进行了探讨,并结合两类方法的优点,提出了一种基于语义改进的机器学习文本倾向性分析方法,并将该方法成功的应用到舆情分析系统中。
With the rapid development of computer technology and communication technology, the Internet has become indispensable part of people's lives. According to the ITU statistics, up to December 2010, the total number of global Internet users was more than 2 billion. And the number of Chinese Internet users was more than 390 million. Internet has been recognized as "the 4th media" after newspaper, radio, and TV. The public right to know, to express, to participate, and to supervise has been implemented on the Internet. Internet users commented on people's livelihood, government management, anti-corruption, social morality and hot issue enthusiastically, which has formed a kind of strong pressure of public opinion. The influence of Internet has exceeded the traditional media, and has become the most important carrier of social public opinion.
     In the rapid development of public opinion on the Internet, analysising and monitoring public opinion of Internet has become increasingly important. Internet with open and relatively free of loose degree, allows people to speak out of control and restriction of the system of social rights. People can open up broad expression of personal opinions, positions, emotions, and expression of public opinion is clearer. The virtual Internet poses a significant security risk, speaker identity conceals, and lack of rules limits and effective oversight. So Internet can easily become the space for some Internet users to vent negative emotions. Our country has long been in isolation, which is vulnerable to external ideological culture. And our country is in the social transition, so the contradictions exist. And a few social managers are used to avoid or congestion public opinion. Therefore, public opinion analysis system is needed for network monitoring, timely preventing social harm caused by misleading opinion, grasping and safeguarding right opinion forward oriented, for constructing harmonious society public opinion.
     In this thesis, requirements of public opinion analysis system were analysised and system was designed. Web text category, sentiment classification, and other key technologies were implemented. Advanced expressed in this thesis:
     1) For the current limitations of existing crawler techniques a data collection method based on crawling policy and filtering policy was put up, which could filter a lot of useless information. An update strategy of a local website database was developed for public opinion analysis system, to ensure that web pages of the local website database were fresh.
     2) Web page classification techniques based on Naive Bayes were studied. A rough set weighted Bayesian classification method was improved, and the method was applied to public opinion analysis system.
     3) The existing text orientation analysis techniques based on machine learning and semantic information were discussed. An improved machine learning method based on semantic information was put up. And the method was applied to the public opinion analysis system.

引文

[1]王来华.舆情研究概论:理论、方法和现实热点.天津:天津社会科学院出版社,2003,9-20
    [2]黄昌宁,赵海.中文分词十年回顾.中文信息学报,2007,21(3):8-19
    [3]姚天顺,杨莹.关于机器翻译的测评问题.自然语言理解与机器翻译—全国第六届计算语言学联合学术会议论文集.山西:中国中文信息学会,2001,472-480
    [4]刘挺,王开铸.自动文摘的四种主要方法.情报学报,1999,18(1):10-19
    [5]黄曾阳.HNC(概念层次网络)理论.北京:清华大学出版社.1998,37-45
    [6] http://www.goonie.cn/
    [7] http://www.trs.com.cn/product/product-om.html
    [8] http://www.54yuqing.com/
    [9]中国互联网统计发展报告,http://www.cnnic.net.cn/dtygg/dtgg/201101/P02011011932 8960192287.pdf, 2011
    [10]一根天价香烟掀落房产局长http://news.cn.yahoo.com/newspic/news/9958/30/,2011
    [11]青年发帖举报家乡违法征地遭跨省追捕, http://news.sina.com.cn/c/sd/2009-04-08/ 053817565370.shtml, 2009
    [12]网络舆情监控系统, http://www.free-softs.com/info/index.php?mno=0707
    [13]杜阿宁.互联网舆情信息挖掘方法研究[博士学位论文].哈尔滨:哈尔滨工业大学.2007, 12-25
    [14] http://www.dom4j.org/apidocs/
    [15] http://code.google.com/p/ictclas4j/
    [16] http://lucene.apache.org/
    [17] http://logging.apache.org/log4j/
    [18] W3C. Document Object Model (DOM) Level 2 HTML Specifieation, W3C Recommendation 09 January 2003[J]. http://www.w3.org/Dom,1998
    [19] W3C. The Document Object Model[J]. http://www.w3.org/DOM/#what ,2010
    [20]陈旭.基于社会网络的Web舆情分析系统的研究与实现[硕士学位论文].成都:电子科技大学,2010,6-22
    [21]邓岳贵.基于非常快速退火搜索算法的主题爬虫研究与实现[硕士学位论文].南昌:江西理工大学,2009,12-35
    [22]郑健珍.定题爬虫搜索策略研究[硕士学位论文].厦门:厦门大学,2007,18-33
    [23] Cho J, Garcia-Molina H.Effective Page Refresh Policies for Web Crawlers.ACM Trans.On Database System, 2003, 28(4):390-426
    [24]陈丽君,林怀忠.搜索引擎页面刷新策略研究综述.计算机系统应用,2009(7):210-214
    [25]吕韩飞,王申康.一种重要性与时新性结合的网页更新策略.计算机应用研究,2005(11):212-218
    [26] Andrew Mccallum, Kamal Nigam. A Comparison of Event Models for Naive Bayes Text Classification. AAAI-98 Workshop on Learning for Text Categorization, 1998, 41-48
    [27]张文修,吴伟志,梁吉业,李德玉.粗糙集理论与方法.北京:科学出版社.2001,10-19
    [28]桑妍丽.基于粗糙集的近似分布简约与决策规则获取[硕士学位论文].山西:山西大学,2005,20-45
    [29]苗夺谦,李道国.粗糙集理论、算法与应用.北京:清华大学出版社.2008,24-34
    [30] Z.Pawlak.Rough Sets[J].International Computer Information Science,1982,11(5):341-356
    [31]张文,吴伟志等.粗糙集理论与方法.北京:科技出版社,2005,123-125
    [32]贺海军,王建芬等.基于决策支持向量机的中文网页分类器.计算机工程,2003,29(2):47-48
    [33]刘冬梅.HTML文本自动分类技术的研究与工具的实现[硕士学位论文].内蒙古:内蒙古大学,2006,23-27
    [34]侯汉清.基于知识库的网页自动标引和自动分类系统的设计.大学图书馆学报,2004(1): 51-53
    [35] Yang Yiming, Pederson J O. A Comparative Study on Feature Selection in Text Categorization[A]. Proceedings of the 14th International Conference on Machine learning[C]. Nashville: Morgan Kaufmann, 1997,412-420
    [36]潘文锋.基于内容的垃圾邮件过滤研究[硕士学位论文].北京:中国科学院研究生院, 2004,10-13
    [37] Yang Yiming, Pederson J O. A Comparative Study on Feature Selection in Text Categorization[A]. Proceedings of the 14th International Conference on Machine learning[C]. Nashville: Morgan Kaufmann, 1997,412-420
    [38] Gloria C. Y. Tsang,Chen Degang,Eric C.C. Tsang, John W.T. Lee, Daniel S. Yeung. On attributes reduction with fuzzy rough sets. Systems, Man and Cybernetics, 2005 IEEE International Conference on Volume 3, Oct. 2005 Vol. 3:2775 - 2780
    [39] Vladimir Brtka, Ivana Berkovic, Eleonora Brtka, Vesna Jevtic.A Comparison of Rule Sets Induced by Techniques Based on Rough Set Theory. Intelligent Systems and Informatics, 2008. SISY 2008. 6th International Symposium on 26-27 Sept. 2008, 1–4
    [40] Qinghua Hu, Daren Yu, Yanfeng Duan and Wen Bao.A NOVEL WEIGHTING FORMULA AND FEATURE SELECTION FOR TEXT CLASSIFICATION BASED ON ROUGH SET THEORY, Natural Language Processing and Knowledge Engineering, 2003,638 -645
    [41] Wu Yu, She Kun, Zhu Williams, Yue Xiaojun, Luo Huiqiong. A Web Text Filter Basedon Rough set Weighted Bayesian. 8th IEEE International Symposium on Dependable, Autonomic and Secure Computing, 2009, 241-245
    [42] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan, Thumbs up? Sentiment Classification using Machine Learning Techniques, presented at the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP'2002), 2002, 79-86
    [43]潘凤铭.基于语义搭配的评论倾向性分析[硕士学位论文].辽宁:大连理工大学.2009, 5-16
    [44]许云,樊孝忠,张锋.基于知网的语义相关度计算.北京理工大学学报.2004,25(5):38-41
    [45] Perter D.Tumey and Miehael L.Littman.UnsuPervised learning of semantic orientation from a hundred-billion-wordcorpus.Tech.ReP.EGB-1094,National Researeh Council Canada, 2002,359-364
    [46]朱嫣岚,闵锦,周雅倩,黄萱菁,吴立德.基于HNC的词汇语义倾向计算.中文信息学报. 2006,20(l):14-20
    [47]知网.http://www.keenage.com.
    [48]董振东.语义关系的表达和知识系统的建造.语言文字应用,1998,27(3): 76-82
    [49] The Stanford Parser:A statistical parser. http://nlp.stanford.edu/software/lex-parser.shtml

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700