互联网舆情监控分析系统的研究与实现

作者：刘德鹏
论文级别：硕士
学科专业名称：软件工程
中文关键词：网络舆情 ; 监控分析 ; 热点识别 ; 文本倾向分析 ; 语义角色标注
英文关键词：Network Public Opinion ; Monitoring and Analyzing ; Hot Topic Recognition ; Text Tendency Analysis ; Semantic Role Labeling
学位年度：2011
导师：徐谡 ; 许慧新
学科代码：081203
学位授予单位：电子科技大学
论文提交日期：2011-03-25

摘要

随着互联网的高速发展,网络给人们提供了前所未有的开放、便捷的信息共享与发布平台,越来越多的人通过网络来表达自己的意见、想法、情绪和态度,其中既包括对对事件的发展有着正面、积极作用的信息,也包括一些负面、消极的信息。同时,网络平台的开放性、直接性和隐蔽性使得网络舆论越来越重要地影响人们的意识形态。因此,对大量舆情信息的及时有效监控分析,对维护社会稳定、促进国家发展具有重要的现实意义。
     网络舆情监控系统与自然语言处理技术密切相关。受限于自然语言处理技术水平,传统的网络舆情监控系统,主要为话题识别的相关内容,而对舆情的情感因素关注较少。虽然也有学者对舆情情感意见信息挖掘进行了研究,但由于处理结果与语料相关性较高,导致实用性不足。
     近年来,随着自然语言处理研究的逐步深入,浅层语义分析开始崭露头角,并在相关应用研究中体现出相对词性标注、句法分析更为智能实用的优势。浅层语义分析是一种简化了的语义分析形式,它利用动词对句意理解的关键作用,以动词为中心对句子意义的进行了形式化表示。语义角色标注作为一种浅层语义分析,对句子中一些成分为给定动词谓词的语义角色进行了标注,具有分析任务定义明确、便于评价等优点。
     结合这种最新的自然语言处理技术,基于对现有舆情监控分析算法的对比分析,我们设计并实现了一个网络舆情监控分析系统,创新性的提出了:(1)一种新的结合HowNet中公开的计算词语语义相似性算法和基于字的倾向计算算法,并对现有话题识别与追踪技术进行优化整合;(2)通过对大量样本的统计分析,得到倾向性语言表现形态规律,具体表现为角色-特征性概率表和角色-情感性概率表,为后续分析提供客观数据基础。
     本文包括的主要内容有:
     (1)舆情监控分析系统框架设计与模块设计。根据网络舆情信息的特点,提出系统总体框架,并对信息预处理模块、信息挖掘模块和信息服务模块进行了设计。
     (2)舆情热点话题识别技术研究。对网络中一段时间内大量出现的某个新闻主体进行提取追踪,通过对ICTCLAS分词技术、文档频率特征抽取方法、TFIDF权重计算以及K均值聚类算法的有效整合,实现热点话题识别与追踪。
     (3)舆情信息浅层语义分析研究。主要利用语义角色标注工具,通过训练测试,对文本语义层角色进行标注。
     (4)舆情信息倾向分析研究。实现文本中意见、情感等信息的提取,主要包括情感词库建设、特征库建设、情感倾向计算算法研究以及语料知识发现等。
     本文所涉工作在国内相关事件和分析中得到应用,可有效辅助舆情监控并减少人为干预,必将在未来的网络信息管理中发挥积极的效益。
Along with the rapid development of the Internet, network provides people with unprecedentedly open, convenient platform for information sharing and releasing. And more and more people express their opinions, ideas, feelings and attitudes through network, which include positive information boosting the development of events, also include some negative information making the events more badly. At the same time, the openness, directness and concealment of network make it influence the people’s ideology more importantly. Therefore, monitoring and analyzing the huge network information timely and effectively has practical significance in maintaining the social stability and promoting the national development.
     Network public opinion monitoring system is closely related to the Natural Language Processing technology. Because of the limited Natural Language Processing technology, traditional system solves the topic recognition and relevant content of it, but pay less attention to the emotional factor in public opinion. Although some scholars research the opinion mining of public opinion, the close relation between corpus and result makes the low practicability.
     In recent years, along with the gradually deeper researching of Natural Language Processing, shallow semantic analysis starts to make a figure, and performs more intelligently and practically in related application and research compare to part-of-speech and syntactic analysis. Shallow semantic analysis is a simplified semantic analysis, which represents the meaning of a sentence centering on the verb which is the key to understand the whole meaning. Semantic role labeling is a shallow semantic analysis, which labels some words and expressions’semantic roles for a given verb. It has some advantages such as clearly defined analyzing task, easy to evaluate and etc.
     Based on the comparative analysis of existing public opinion monitoring algorithms, we design and implement a network public opinion monitoring and analyzing system combing new Natural Language Processing technology, and put forward a novel tendency algorithm which integrating the semantic similarity computing algorithm between words released on HowNet with the tendency computing algorithm based on single character, and also optimize the existing hot topic identification and tracking. Also,based on the statistical analysis of mass samples, we find the regular pattern in tendency texts which is represented as role-feature probability table and role-emotion probability table and provides objective data base for subsequent analysis.
     This paper mainly includes the following content:
     (1) The design of system framework and main modules. According to the characteristic of public opinion, this paper designs the system framework and mainly modules which includes the information preprocessing module, information mining module and information service module.
     (2) The research of hot topic identification and tracking. In order to extract and track the topic appearing with high frequency in a period of time, this paper integrates the ICTCLAS word segmentation, the feature extraction of document frequency, TFIDF weighting computing and K-means clustering algorithm.
     (3) The research of shallow semantic analysis. This paper uses semantic role labeling tools to label the semantic role of word in texts through training and testing, which can improve the efficiency of text tendency analysis significantly.
     (4) The research of text tendency analysis. This paper presents methods to extract the feeling and opinion in the texts, which mainly includes emotional lexicon construction, feature lexicon construction and emotional tendency computing algorithm and knowledge discovery in corpus, etc.
     The related tasks in this paper have applied in domestic events analysis and it can effectively help network public opinion monitoring reduce human intervention. It will play a positive benefit in future network information management.

引文

[1]张卫.网络舆情分析中的特征提取研究:[硕士学位论文].合肥:中国科学技术大学,2008,113-233.
    [2]郑军.网络舆情监控的热点发现算法研究:[硕士学位论文].哈尔滨:哈尔滨工程大学,2006,2-38.
    [3]王来华.舆情研究概论[M].天津:天津社会科学院出版社.2003,32-33.
    [4]刘毅.网络舆情研究概论[M].天津:天津人民出版社.2007.23-77.
    [5]网络舆情.百度百科.http://baike.baidu.com/view/2143779.htm?fr=ala0129.
    [6]舆情.百度百科.http://baike.baidu.com/view/737646.htm?fr=ala01128.
    [7] Wu B, Goel V, Davison B D. Topical trust rank: Using topicality to combat web spam [J], WASHINGTON, DC; IEEE2006 .
    [8] Gyongyi Z, Garcia-Molina H. Web spare taxonomy[J],Chiba, Japan2004.
    [9]文本聚类.百度百科.http://baike.baidu.com/view/1133919.htm?fr=ala01126.
    [10]英国开发舆论分析软件[J].《环球时报》.2005,第6版45-51.
    [11]多文档自动文摘[J].百度文库.http://wenku.baidu.com/view/3aff6cd3240c844769eaee9e.html.
    [12]杨建武.文档自动摘要技术[J].百度文库. http://wenku.baidu.com/view/2293b724ccbff121dd368330.html.
    [13]钱飞龙.网络不良信息治理研究:[硕士学位论文].北京:中央民族大学,2009
    [14]北大方正技术研究院.http:// www.Founderrd.com/.
    [15]马海兵,刘永丹,王兰成,等.三种文档语义倾向性识别方法的分析与比较[J].现代图书情报技术,2007,4:43-47.
    [16] Li Kang, Zhong ZhenYu. Fast statistical spam filter by approximate classifications [J]. New York; ACM2006.
    [17]徐琳宏,林鸿飞,杨志豪.基于语义理解的文本倾向性识别机制[J].中文信息学报,2007, 21(1): 96-100.
    [18]黄曾阳.HNC(概念层次网络)理论[M].北京:清华大学出版社,1998,33-67.
    [19]张超.文本倾向性分析在舆情监控系统中的应用研究:[硕士学位论文].北京:北京邮电大学,2008,28-35.
    [20]朱嫣岚,闵锦,周雅倩,等.基于HowNet的词汇语义倾向计算[J].中文信息学报,2006,20(1):14-20.
    [21]唐慧丰,谭松波,程学旗.基于监督学习的中文情感分类技术比较研究[J].中文信息学报,2007,21(6):88-94,108.
    [22]陈锦言,孙济洲,张亚平.基于傅立叶变换的网页去重算法[J].计算机应用.2008,28(4):948-955.
    [23]梁秀娟.基于SVM的多类文本分类研究:[硕士学位论文].湖北:中南财经政法大学,2008,29-66.
    [24]史忠植.知识发现[M].北京:清华大学出版社,2002,3:22-49.
    [25]冯超. K-means聚类算法的研究:[硕士学位论文].大连:大连理工大学,2007,56-69.
    [26]方春.组合聚类方法在文本聚类中的应用研究.[硕士学位论文].武汉:华中师范大学.2009,11-39.
    [27]丁金涛.基于特征向量的语义角色标注研究:[硕士学位论文].苏州:苏州大学,2008,41-50.
    [28]高明乐.题元角色与题元角色理论[J].现代外语,2003,26(2):211-218.
    [29]王丽杰,车万翔,刘挺.基于SVMTool的中文词性标注[J].中文信息学报,2009,23(4):16-21
    [30]张国兵.汉语分词中未登录词识别及词性标注的研究与实现:[硕士学位论文].合肥:中国科技大学,2008,19-54.
    [31]张民,李生,等.统计与规则并举的汉语词性自动标注算法[J].软件学报,1998,9(2),134-138.
    [32]梁以敏,黄德根.基于完全二阶隐马尔科夫模型的汉语词性标注[J].计算机工程,2005,31(10):177-179.
    [33] Zhou Qiang. An algorithm of tagging Chinese POS based on statistics and rule [J]. Chinese Information Journal,1996,9(3):1-9.
    [34]卢俊之.基于语法功能匹配的句法分析算法:[硕士学位论文].南京:南京师范大学,2008,16-33.
    [35]车万翔,刘挺,李生.自动浅层语义分析[C].中国中文信息学会二十五周年学术会议.2006,161-171.
    [36] Yih S, Toutanova K. Automatic semantic role labeling. Microsoft Research[R], 2006,19-34.
    [37] Gildea D., Palmer M. The necessity of syntactic parsing for predicate argument recognition[C]. In Proc. of ACL-2002, 2002, 239-246.
    [38] Surdeanu M., Harabagiu S., Williams J., et al. Using predicate-argument structures for information extraction[C]. In Proc. of ACL-2003, 2003,(2)87-99.
    [39] Pradhan S., Ward W., Hacioglu K., et al. Shallow semantic parsing using Support Vector Machines[C]. In Proc. of HLT/NAACL-2004, 2004,74-81.
    [40]刘挺,车万翔,李生.基于最大熵分类器的语义角色标注[J].软件学报,2007,18(3):565-573.
    [41] Moschitti A. A study on convolution kernels for shallow statistic parsing[C]. In Proc. of ACL-2004, 2004, 335-342.
    [42] Che W., Zhang M., Liu T., et al. A hybrid convolution tree kernel for semantic role labeling[C]. In Proc. of the COLING/ACL-2006, 2006, 73-80.
    [43] Zhang M., Che W., AW A. T., et al. A grammar-driven convolution tree kernel for semantic role classification[C]. In Proc. of ACL-2007, 2007, 200-207.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700