网络舆情信息挖掘关键技术研究与应用
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着互联网的高速发展,越来越多的人通过网络来表达自己的意见、想法、情绪和态度,其中既包括对事件的发展有着正面、积极作用的信息,也包括一些负面、消极的信息。同时,网络平台的开放性、直接性和隐蔽性使得网络舆论越来越重要地影响人们的意识形态。因此,对大量舆情信息的及时有效挖掘,对维护社会稳定、促进国家发展具有重要的现实意义。
     网络舆情信息挖掘与自然语言处理技术密切相关。受限于自然语言处理技术水平,传统的网络舆情信息挖掘,主要为话题识别的相关内容,而对舆情的情感因素关注较少。近年来,浅层语义分析开始出现,并在相关应用研究中体现出相对词性标注、句法分析更为智能实用的优势。浅层语义分析是一种简化了的语义分析形式,以动词为中心对句子意义进行了形式化表示。结合相关自然语言处理技术,基于对现有舆情信息分析算法的对比分析,本文对舆情信息挖掘技术进行了研究与实验,并将其成果应用在网络舆情监控分析系统中。本文主要内容有:
     (1)自然语言处理技术介绍。考虑到自然语言处理技术在网络舆情信息挖掘中的重要作用,本文在第2章对该技术的关键部分进行了简述。
     (2)舆情热点话题识别技术研究。基于ICTCLAS分词与词性标注,提出一种结合文本关键词提取和文本聚类的热点话题识别方法。舆情信息的即时性导致未登录词分词错误率较高,利用词语共现概率对分词结果进行拼接,能有效改善未登录词分词性能。文本关键词提取则将词语位置权重信息纳入考虑范畴。
     (3)舆情文本倾向性分析技术研究。结合语义角色标注一种浅层语义分析和情感词库建设,实现文本倾向信息挖掘。通过对语义角色标注样本的统计分析,得到角色-特征性概率表和角色-情感性概率表,为角色抽取顺序选择提供数据支持。情感词库建设采取人工标注和自动扩充相结合方式,通过对基于字的情感词倾向计算的实验,得到一种改进后的情感词库自动扩充方法。
     (4)舆情监控分析系统设计与实现。根据网络舆情信息的特点,提出系统总体框架,并对系统主要模块进行了简要介绍。
     本文所涉工作在网络舆情监控分析系统中得到应用,可有效辅助舆情监控,减少人为干预,必将在未来的网络信息管理中发挥积极的效益。
Along with the rapid development of the Internet, more and more people express their opinions, ideas, feelings and attitudes through network, which include positive information boosting the development of events, also include some negative information making the events more badly. At the same time, the openness, directness and concealment of network make it influence the people's ideology more importantly. Therefore, extracting huge network information timely and effectively has practical significance in maintaining the social stability and promoting the national development.
     Network public opinion information mining is closely related to the Natural Language Processing (NLP) technology. Because of the limited NLP technology, traditional information mining mainly solves the topic recognition and relevant content of it, but pays less attention to the emotional factor in public opinion. In recent years, shallow semantic analysis starts to emerge, and performs more intelligently and practically in related application and research compared to part-of-speech and syntactic analysis. Shallow semantic analysis is a simplified semantic analysis, which represents the meaning of a sentence centering on the verb. Based on the comparative analysis of existing public opinion monitoring algorithms, this paper researches and experiments the mining technology of public opinion through related NLP technologies, and applies the mining technologies in the monitoring system of public opinion in network. This paper includes the following main contents.
     (1) The presentation of NLP technology. Considering the importance of NLP in public opinion information extraction, this paper briefly introduces several key technologies of NLP in chapter 2.
     (2) Research of public opinion hot topic recognition. This paper puts forward a novel method combing text keywords extraction and text clustering after the text was segmented and labelled as part-of-speech tagging making use of ICTCLAS. The real-time character of public opinion makes the high error rate of unknown words'segmenting, so this paper uses co-occurrence probability among words to joint words with higher probability in order to improve the segmenting result of unknown words. The weight information of location is also taken into consideration in keywords extraction.
     (3) Research of public opinion tendency analysis. Combing the Semantic Role Labeling (SRL) which is a kind of shallow semantic analysis and emotional lexicon construction, this paper realizes text tendency information mining. Based on statistical analysis of SRL samples, the role-feature and role-emotional probability tables are acquired which provides support for the sequence choice of role extraction. Emotional lexicon construction combines human labeling and automatic expanding. Through several experiments on emotional words'tendency calculation based on characters, an improved lexicon automatic expanding method is obtained.
     (4) Design and implementation of monitoring and analyzing system of public opinion in network. According to the character of public opinion, the paper introduces the system frame and some main modules.
     The related tasks in this paper have been applied in monitoring and analyzing system of public opinion in network, and it can effectively monitor network public opinion to reduce human intervention. It will play a positive benefit in future network information management.
引文
[1]中国互联网络信息中心.第25次中国互联网发展状况统计报告[R]:http://research.cnnic.cn/html/1263531336d1752.html.
    [2]董天策.网络新闻传播学[M].福建:福建人民出版社,2003,14-17.
    [3]刘保位.中国共产党社会舆情机制研究[D].北京:中共中央党校,2006.
    [4]江泽民.全面建设小康社会,开创中国特色社会主义事业新局面---在中国共产党第十六次全国代表大会上的报告[R].北京:人民出版社,2002.
    [5]秦微琼.网络舆情对政府形象的影响及应对策略研究[D].上海:上海交通大学,2008.
    [6]王来华.舆情研究概论[M].天津:天津社会科学院出版社,2003,32.
    [7]刘毅.网络舆情研究概论[M].天津:天津人民出版社.2007.
    [8]曾润喜.网络舆情管控工作机制研究[J].图书情报工作,2009,53(18):79-82.
    [9]曾润喜.网络舆情信息资源共享研究[J].情报杂志,2009,28(8):187-191.
    [10]网络舆情.百度百科http://baike.baidu.com/view/2143779.htm?fr=ala0129.
    [11]谢海光,陈中润.互联网内容及舆情深度分析模式[J].中国青年政治学院报,2006,3.
    [12]Kathleen R. McKeown, Regina Barzilay, David Kirk Evans, etal. Tracking and summarizing news on a daily basis with columbia's newsblaster[A]. In Proceedings of the Human Language Technology Conference. 2002[C].
    [13]James Allan, Jaime Carbonell, George Doddington et al. Topic Detection and Tracking Pilot study:Final Report, In:Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, San Francisco, CA, Morgan Kaufmann Publishers, Inc,1998:194-218P.
    [14]Yiming yang, Jaime Carbonell, Ralf Brown et al. Learning Approaches for Detecting and Tracking New Events. IEEE Intelligent Systems:Special Issue on Applications of Intelligent Information Retrieval,1999.
    [15]The 2002 Topic Detection and Tracking(TDT2002)Task Definition and Evaluation Plan. ftp//jagur.ncsl.nist.gov//tdt/tdt2002/evalplans/TDT02.Eval.Plan.v1.1.ps.
    [16]郑军.网络舆情监控的热点发现算法研究[D].哈尔滨:哈尔滨工程大学,2006.
    [17]英国开发舆论分析软件[J].环球时报,2005,第6版.
    [18]北大方正技术研究院.以科技手段辅助网络舆情突发事件的监测分析—方正智思舆情辅助决策支持系统[J].信息化建设,2005,10:25-50.
    [19]郭艳华,周昌乐.自然语言理解研究综述[J].杭州电子工业学院学报,2000,20(1):8-65.
    [20]宗成庆.统计自然语言处理[M].北京:清华大学出版社,2008:2,3,105.
    [21]俞士汶.计算语言学概论[M].北京:商务印书馆,2003:8,112.
    [22]Chomsky N. Aspects of the theory of syntax[M]. MIT press,1965.
    [23]Quillian M R. Semantic memory[C]//Semantic Information Processing. MIT Press,1968.
    [24]Montague R. Universal Grammar [M]. American Psychologist17,1970.
    [25]Fillmore C J. The ease for case[C]//Universal in Linguistic Theory, New York:Holt, Rinehart and Winston,1968.
    [26]宗成庆,曹右琦,俞士汶.中文信息处理60年[J].语言文字应用,2009,4:53-60.
    [27]俞士汶.语言随计算齐飞[J].当代语言学,2009,(2).
    [28]刘开瑛.中文文本自动分词与标注[M].北京:商务印书馆,2000.
    [29]黄昌宁,高剑锋,李沐.对自动分词的反思[C].语言计算与基于内容的文本处理(全国第七届计算语言学联合学术会议论文集).北京:清华大学出版社,2003:26-38.
    [30]刘颖.计算语言学[M].北京:清华大学出版社,2002:11.
    [31]刘,源,谭强,沈旭昆.信息处理用现代汉语分词规范及自动分词方法[M].北京:清华大学出版社; 南宁:广西科学技术出版社,1994.
    [32]Brill, Eric. Some advances in transformation-based part of speech tagging[C].In:Proceedings of the 12th National Conference on Artificial Intelligence,1994:722-727.
    [33]Tomita, M. An efficient context-free parsing algorithm for natural language[C].In:Proceedings of 9th International Joint Conference on Artificial Intelligence.
    [34]丁金涛.基于特征向量的语义角色标注研究[D].苏州:苏州大学,2008.
    [35]冯志伟.自然语言的计算机处理[M]上海:上海外语教育出版社,1996.
    [36]侯敏,孙建军,陈肇雄.汉语自动分词中的歧义问题.见:陈力为,袁琦主编.计算语言学进展与应用.北京:清华大学出版社,1995: 81-87.
    [37]吕叔湘.中国文法要略[M].北京:商务印书馆,1982.
    [38]鲁川.现代汉语的语义网络.:陈力为、袁琦主编.中文信息处理应用平台工程.北京:电子工业出版社,1995
    [39]车万翔,刘挺,李生.自动浅层语义分析[C].中国中文信息学会二十五周年学术会议.2006,161-171.
    [40]曾依灵,许洪波.网络热点信息发现研究[J].通信学报,2007,28(12):141-146.
    [41]Matsumura, N., et al. The Dynamism of 2channel. Journal of AI &Society. Springer Verlag.2005.19(1):84-92.
    [42]唐果,陈宏刚.基于BBS热点主题发现的文本聚类方法[J].计算机工程,2010,36(7):79-81,224.
    [43]Sen Qin, Guna-zhong, Yan-Ling Li. Proceedings of the Fifth International Conference on Machine Learning and Cybernetics. IEEE. Dalian,2006:1184-1186.
    [44]刘群,张华平.基于层叠隐马尔科夫模型的汉语词法分析[J].计算机研究与发展,2004,41(8):1421-1429.
    [45]张玉芳,彭时名,吕佳.基于文本分类TFIDF方法的改进与应用[J].计算机工程,2006,32(19):76-78.
    [46]麦林.虚拟社区热点话题意见挖掘模型研究[D].合肥:中国科学技术大学,2009.
    [47]方春.组合聚类方法在文本聚类中的应用研究[D].武汉:华中师范大学.2009.
    [48]姚天防,程希文,徐飞玉,汉思.乌思克尔特,等.文本意见挖掘综述[J].中文信息学报,2008,22(3):71-80.
    [49]HATZIVASSILOGLOU V, WIEBE J M. Effects of adjective orientation and gradability on sentence subjectivity[C] In:Proceedings of the 18th Conference on Computational Linguistics. Morristown, NJ,USA: Association for Computational Linguistics,2000:299-305.
    [50]YU H, HATZIVASSILOGLOU V. Towards answering opinion questions:Separating facts from opinions and identifying the polarity of opinion sentences[C] In:Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing. Morristown, NJ,USA:Association for Computational Linguistics, 2003:129-136.
    [51]叶强,张紫琼,罗振雄.面向互联网评论情感分析的中文主观性自动判别方法研究[J].信息系统学报,2007,1(1):79-91.
    [52]HATZIVASSILOGLOU V, MCKEOWN K R.Predicting the semantic orientation of adjectives [A].In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and the 8th Conference of the European Chapter of the ACL[C],1997:174-181.
    [53]ESULI A,SEBASTIANI F.Determining the semantic orientation of terms through gloss classification [A]. In:Proceedings of CIKM-05, the ACM SIGIR Conference on Information and Knowledge Management[C],2005:617-624.
    [54]DAS S, CHEN M. Yahoo! for Amazon:Extracting market sentiment from stock message boards [A].In: Proceedings of the Asia Pacific Finance Association Annual Conference (APFA) [C].2001.
    [55]孟凡博,蔡莲红,陈斌,吴鹏.文本褒贬倾向判定系统的研究[J].小型微型计算机系统,2009,30(7):1458-1462.
    [56]朱嫣岚,闵锦,周雅倩,黄萱菁等.基于HowNet的词汇语义倾向计算.中文信息学报,2006,20(1):14-20.
    [57]KAMPS J, MARX M,MOKKEN R J, RIJKE M D.Using WordNet to measure semantic orientations of adjectives[C]. In:Proceedings of the fourth international conference on Language Resources and Evaluation, 2004, IV:1115-1118.
    [58]TURNEY P D. Thumbs up or Thumbs down? Semantic orientation applied to unsupervised classification of reviews[C]. In:Proceedings of the 40th AnnualMeeting of the Association for Computational Linguistics,2002:417-424.
    [59]TURNEY P D,LITTMAN M L.Measuring praise and criticism:Inference of semantic orientation from association [J].ACM Transactions on Information Systems,2003,21 (4):315-346.
    [60]李艳玲,戴冠中,朱烨行.基于类别空间模型的文本倾向性分类方法[J].计算机应用,2007,27(9):2194-2196.
    [61]唐慧丰,谭松波,程学旗.基于监督学习的中文情感分类技术比较研究[J].中文信息学报,2007,21(6):88-94,108.
    [62]刘永丹,曾海泉,李荣陆,胡运发.基于语义分析的倾向性文本过滤[J].通信学报,2004,25(7):78-85.
    [63]姚天昉,娄德成.汉语语句主题语义倾向分析方法的研究[J].中文信息学报,2007,21(5):73-79.
    [64]MIAO Q,LI Q,DAI R.AMAZING:A sentiment mining and retrieval system[J]. Expert Systems with Applications:An International Journal,2009,36 (3):7192-7198.
    [65]NARAYANAN R,LIU B,CHOUDHARY A sentiment analysis of conditional sentences[C]. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore,6-7 August 2009:180-189.
    [66]http://www.cs.cornell.edu/people/pabo/movie%2Dreview%2Ddata/
    [67]WIEBE J, BRECK E, BUCKLEY C, CARDIE C, et al.NRRC Summer Workshop on MPQA: Multi-Perspective Question Answering Final Report,2002.
    [68]许小颖,陶建华.汉语情感系统中情感划分的研究[C].第一届中国情感计算及智能交互学术会议.北京,2003:199-205.
    [69]QUAN Chang-qin,REN Fu-ji. Construction of a blog emotion corpus for Chinese emotional expression analysis[C]. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore,6-7 August 2009:1446-1454.
    [70]L. W. Ku, Y. T. Liang and H. H. Chen. Opinion extraction, summarization and tracking in news and blog corpora[C]. Proceeding of AAAI-2006 Spring Symposium on Computational Approaches to Analyzing Weblogs, Boston,16-20 July 2006.
    [71]严君挺,安乐,韩艳,袁虹.基于多词典的观点句抽取及倾向分析[C].第二届中文倾向性分析评测,pages45-49.上海.14-15November,2009.
    [72]陈锦阳,蒋建中,郭军利,张良胜,李娜.一种带反馈自适应的搜索引擎系统结构的研究[J].计算机与网络,2003,23:54-55,58.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700