网络舆情热点信息发现及其倾向性研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着信息技术的发展和互联网的日益普及,网络已经成为广大民众获取信息的主要渠道,同时网络也成为人们发表评论、表达民意的重要平台。面对互联网上飞速增长的新闻话题以及人们的评论信息,如何从海量信息中采集到满足特定需求的信息,如何将互联网信息组织整理成有效的机器数据,如何从采集到的数据中区分有用信息和无用信息等等这些问题都是信息科技发展所面临的难题。网络舆情是指民众通过互联网对政府管理以及现实社会中各种现象、问题所表达的政治信念、态度、意见和情绪的总和。网络舆情与社会舆情相互作用、相互影响。两者不仅在内容表现形态方面具有一致性,同时网络舆情在一定程度上会影响社会舆情的发展趋势,对社会影响巨大。因此,政府部门对网络舆情信息必须具备一定的监控能力,能够及时掌握一定时期内民众所关注的热点问题,了解民众对热点事件的看法和态度,从而做出正确的决策,主动引导舆论走向。
     本文在分析网络舆情热点信息发现和网络舆情热点信息倾向性研究现状的基础上,从舆情信息的来源入手,设计了详细的采集流程。针对大众和政府部门都比较关注的热点信息,本文根据热点信息的概念和特征建立了热点信息的判断标准,并将热点信息的特征定量化,构建数学模型,用算法来描述热点信息的发现和获取。针对热点信息的倾向性分析,本文首先手工构建了极性词典,并对极性词典进行了扩充和修正,将未登录词汇、否定词和强调副词对原始极性词的影响做了进一步分析,并提出相应的解决办法。对于普通的文本信息,用向量来进行表示,通过计算特征词的权重来选取文本的特征词条。由于中文句子以标点符号进行划分,本文对句子进行句法分析,解析出词语之间的依存关系,并对词语进行词性标注。本文建立了语义模板,通过语义模板的匹配来确定句子的语义模式,利用极性词典计算出词语的极性值,再结合句法分析和模式匹配得出其上下文极性。句子的倾向性由组成句子的主题词和极性词及其极性值决定,文本的倾向性由句子的倾向性和句子在整个文本中的权重计算得出。最后,本文对所做的研究工作进行了模拟实验,对实验结果进行了讨论与分析。
With the development of information technology and the growing popularity of Internet, the network has become the main channel for general public people to get the information, as well as an important platform for expression of public opinion. At the face of rapid growth of news information and people's comments on the Internet, how can we get the information which meets the specific needs from the mass information? How to organize Internet information into an effective machine data? How to distinguish the useful information and useless information from the collected data? All these problems are difficult at the process of the development of information technology. Public opinion is the sum of political beliefs, attitudes, opinions and emotions about the government administration, as well as the variety of phenomena in the real world which are expressed by general people through the Internet. The Internet public opinion and the social public opinion are interaction and affect each other. The Internet public opinion and the social public opinion has a consistent on the content, The Internet public opinion to a certain extent, will affect the community development trends of social public opinion, and will have a huge impact on the community. Therefore, the Government needs to have some information on the network to monitor public opinion, and the ability to grasp a hot issue which the general people concern on the certain period of time, understand the attitudes and views of hot events in order to make the right decisions, and take the initiative to guide public opinion towards.
     Based on the analysis of the discovery on public opinion hotspot information and the research on tendency analysis of public opinion, this paper designs a detailed collection process from the source of public opinion. For the hot information which is concerned by the general public and government departments, this paper has established criteria for judging hot information according to the concept and characteristics of hot spots, and quantitative characteristics of hot information to build mathematical model, using algorithms to describe the discovery and access of hot spot information. To the tendency analysis of hot information, first of all this paper hand-built the polarity dictionary, and the polarity dictionary was expanded and amended, then have the further analysis on the no-logged vocabulary, the negative words and stressed words to the impact of the polarity on the original word, and give the solutions. This paper uses vectors to carry out the ordinary text messages, and selects the characteristics words of the text by calculating the weights. As the Chinese sentence is divided by punctuation, this paper carried a sentence parsing, parsed out the dependencies between words, and tagged the part of speech. This paper built the semantics template, and determined the sentence semantic model through the matching to semantic template, calculated the polarity value of the words using the polarity dictionary, got its context polarity by combining with syntactic analysis and pattern matching, the tendency of sentence is determined by the composition of the sentence and the polarity value of the words, the tendency of the text is calculated by the tendency of sentence and the weight of the sentence in the whole text. Finally, this paper made simulation experiments about the research work, discussed and analyzed the experimental results.
引文
[1]中国互联网信息中心.第二十五次中国互联网络发展状况统计报告.中国互联网统计报告,2010(1)
    [2]王来华,林竹,毕宏音.对舆情、民意和舆论三概念异同的初步辨析.新视野,2004(5):64-66
    [3]张毅.网络舆情管理及分析系统的构建.湖北成人教育学院学报.2009(5):64-65
    [4]张克生.国家决策:机制与舆情.天津:天津社会科学院出版社,2004:32
    [5]丁柏铨.略论舆情——兼及它与舆论、新闻的关系.新闻记者,2007:8-11
    [6]毕竟.试论高技术传播时代的舆情预警.新闻记者,2006(4):39-31
    [7]张丽红.论民众舆情形成、变化和发生作用的情境.前沿,2008(2):140-142
    [8]谢海光,陈中润.互联网内容及舆情深度分析模式.中国青年政治学院学报,2006(3):95-100
    [9]许鑫,章志成.互联网舆情分析及应用研究.情报科学,2008(8):1194-1200
    [10]吴绍忠,李淑华.互联网络舆情预警机制研究.中国人们公安大学学报,2008(3):38-42
    [11]梅中玲.基于Web信息挖掘的网络舆情分析技术.中国人民公安大学学报,2007(4):85-88
    [12]戴媛,姚飞.基于网络舆情安全的信息挖掘及评估指标体系研究.情报理论与实践,2008(6):873-876
    [13]纪红,马小洁.论网络舆情的搜集、分析和引导.华中科技大学学报,2007(6):104-107
    [14]姜胜洪.试论网上舆情的传播途径、特点及其现状.社科纵横,2008(1):130-131
    [15]刘鹏飞.网络舆情抽样与分析方法.青年记者,2009(3):4-5
    [16]吕洪波,姚锦峰.网络舆情分析系统信息清理的研究.硅谷,2009(8):70
    [17]黄晓斌,赵超.文本挖掘在网络舆情信息分析中的应用.情报科学,2009(1):94-99
    [18]杨频,李涛,赵奎.一种网络舆情的定量分析方法.计算机应用研究,2009(3):1066-1068
    [19]王来华,温淑春.舆情信息汇集和分析机制刍议.天津大学学报,2007(9):420-423
    [20]Yiming Yang, Jaime G Carbonell, Ralf D.Brown et al. Learning Approached for Detecting and Tracking News Events. IEEE Intelligent System, Intelligent Information Retrieval,1999: 32-33
    [21]kuan-Yu Chen, Luesukprasert, and Seng-cho T. Chou. Hot Topic Extraction Based on Timeline Analysis and Multidimensional Sentense Modeling. IEEE TRANSCTIONS ON KNOWLEDGE AND DATA ENGINEERING,2007,19(8):1016-1025
    [22]Allan J, Carbonell J. Topic Detection and Tracking pilot study:final report. Proceedings of the DAPPA Broadcast News Transctiption and Understanding Workshop, San Francisco: Kaufmann Publishers,1998:194-218
    [23]Wayne C. Multilingual Topic Detection and Tracking:successful research enabled by corpora and evaluation. Language Resources and Evaluation Conference, Greece,2000:1487-1494
    [24]Matsumura, N., Ohsawa, Y., Ishizuka, M. Influence Diffusion Model in Text-Based Communication. Journal of the Japanese Society for Artificial Intelligence,2002,13(3): 259-267
    [25]北大方正技术研究院.以科技手段辅助网络舆情突发事件的监测分析——方正智思舆情辅助决策支持系统.信息化建设,2005:50-52
    [26]黄宇栋,李翔.互联网媒体信息热点主动发现技术研究与应用.计算机技术与发展,2009(5):1-4
    [27]王林,戴冠中.基于复杂网络社区结构的论坛热点主题发现.计算机工程,2008(6):214-216
    [28]鲁明宇,姚晓娜,魏善岭.基于模糊聚类的网络论坛热点话题挖掘.大连海事大学学报,2008(11):52-55
    [29]王义,张阳,李书琴.基于字符串核函数的热点新闻发现系统.广西师范大学学报,2007(12):212-215
    [30]周亚东,孙钦东,管晓宏.流量内容词语相关度的网络热点话题提取.西安交通大学学报,2007(10):1142-1145
    [31]曾依灵,许洪波.网络热点信息发现研究.通信学报,2007(12):141-146
    [32]周启海,黄涛,张元新.同构化信息温度与热点发现应用初探.计算机科学,2007(11):113-117
    [33]刘星星,何婷婷,龚海军.;网络热点事件发现系统的设计.中文信息学报,2008(11):80-85
    [34]Wiebe J, Wilson T, Bell M. Identifying collocations for recognizing opinions. In:Proc. AC1-01 Workshop on Collocation:Computational Extraction, Analysis and Exploitation, 2001
    [35]Riloff E, Wiebe J, Wilson T. Learning Subjective Nouns using Extraction Pattern Boot strapping. In:Conf. on Natural Language Learning(CoNLL),2003:25-32
    [36]Turney P, Littman M. Measuring praise and criticism:Inference of semantic orientation from association. ACM Transations on Information Systems,2003(4):315-346
    [37]Whitelaw C, Garg N, Argamon S. Using Appraisal Group for Sentiment Analysis. In: Proceedings of the 14th ACM international conference on information and knowledge management, Bermen, Germeny,2005:625-631
    [38]Hatzivassiloglou V, Mckeown K R. Predicting the semantic orientation of adjectives. In: Proceedings of the 35th Annual Meeting of the Association for Computationl Linguistics(ACL97),1997:174-181
    [39]朱嫣岚,闵锦,周雅倩.基于HowNet的词汇语义倾向计算.中文信息学报,2005:14-20
    [40]熊德兰,程菊明,田胜利.基于HowNet的句子褒贬倾向性研究.计算机工程与应用,2008(22):143-145
    [41]李钝,曹付元,曹元大.基于短语模式的文本情感分类研究.计算机科学,2008:132-134
    [42]胡熠,陆汝占,李学宁.基于语言建模的文本情感分类研究.计算机研究与发展,2007(4):1469-1475
    [43]LI Yan-ling, DAI Guan-zhong, QIN Sen. A Rapid Method for Text Tendency Classification. 电子科技大学学报,2007(6):1232-1236
    [44]徐琳宏,林鸿飞,杨志豪.基于语义理解的文本倾向性识别机制.中文信息学报,2007(1):96-100
    [45]刘晓红.搜索引擎技术及其发展趋势.广西医科大学学报,2008(9):109-110
    [46]Shan-Hua Lin, Jan-Ming Ho. Discovering informative content block from Web documents. In: SIGKDD,2002
    [47]Soumen Chakrabarti, Mukul M.Joshi and Vivek B.Tawde. Enhanced topic distillation using text markup tags and hyperlinks. In:SIGIR,2001
    [48]Bun KK, Ishizuka M. Topic Extraction from News Archive Using TF*PDF Algorithm [A]. In: Proceedings of the 3rd International Conference on Web information Systems Engineering(SISE 2002), Singapore,2002:73-82
    [49]Ellen Riloff, Janyce Wiebe, Theresa Wilson. Just how mad are you? Finding strong and weak opinion clauses. Proceedings of the 19th National Conference on Artificial Intelligence,2004: 761-767
    [50]Hu M, Liu B. Mining opinion features in customer reviews. In the Proceedings of AAAI (American Association for artificial intelligence), San Jose, California,2004:755-760
    [51]娄德成,姚天昉.汉语句子极性分析和观点抽取方法的研究.计算机应用,2006(11):622-625
    [52]金晓鸥.互联网舆情信息获取与分析研究[硕士论文].上海交通大学,2008
    [53]Salton G, Wong A and Yang C.S. A vector space model for automatic indexing. Communications of ACM Vol.18, No.11, P613-620,1997
    [54]许高建,路遥,胡学钢.一种改进的文本特征选择方法的研究与设计.苏州大学学报,2008(2):18-22
    [55]C. Wayne. Multilingual Topic Detection and Tracking:Successful Research Enabled by Corpora and Evaluation. Proc. of the Language Resources and Evaluation Conference.2000: 1487-1494
    [56]Abney Steven. Partial parsing via finite-state cascades. Proc. of the ESSLLI'96 Robust Parsing Workshop. Prague, Czech Republic,1996:23-40、