基于博客搜索的博文情感倾向性分析技术的研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
伴随互联网的普及及其在全球范围内的飞速发展,网上博客信息呈爆炸式增长,博客在网民中的使用率高达57.7%,网民对博客的认知和普及程度日臻升高。博客使作者能方便地发表自己的观点,读者能快捷地对博文进行浏览和评论,以博客的形式来共享思想与资源变得越来越流行,其已成为一个重要的情感抒发与交流的平台,也使得它日渐成为舆情产生和传播的主要场所。
     然而,在信息过度膨胀的时代,网民更关注言简意赅、情感倾向相关的名人焦点信息。为了能快速的按需从博客领域获得支持或反对等博文情感信息,迫切需要一种合适的情感检索工具,来对海量的博客资源进行组织和搜索。这时,最好的选择就是博文情感倾向性搜索。
     本文通过对中文博文中隐含的情感因素加以分析研究,结合自然语言处理技术,提出了情感词典和依存分析相结合的博文情感倾向性分析SPOA算法。在博文预处理阶段构建了基础情感词典和褒贬多义词词典,进行博文中情感词的识别;以关系对组为最小情感分析单位,并结合提出的情感异位关系对转换VCCA算法,使得计算上下文相关的修饰程度,更加准确和合理。
     而后的实验表明,在中文博文情感分析上,基于依存句法的SPOA方法优于窗口修饰算法,语法距离的引入和依存关系对的修饰,使博文情感倾向分析的性能明显提升。博文全文分析与网摘分析,效果无明显差别,但针对博文结构的重点情感句处理,整体性能占优,表明博文结构特点对情感分析有明显影响。
     最后,应用本情感分析算法,将博文按照用户倾向性需求排序返回,初步实现了一个博文情感搜索原型系统。
With the popularity of the Internet and its rapid development worldwide, the information from online blog is exploding. The utilization rate of blog is as much as 57.7%, and the recognition and popularity level of blogs among Internet users are getting increased everyday. Blog makes that the authors express their opinions easily, and the readers can quickly browse and comment on the blog articles. The form of blog on sharing ideas becomes more and more popular. Therefore blog has become an important platform for sentiment expression and communication, and makes the emergence and spread of public opinion become the primary venue.
     However, in the over-expansion information ages, Internet users are more concerned about the concise, sentiment orientation information from celebrities. In order to quickly get blog article sentiment information for supporting or opposing from the Blog field on-demand, people have the urgent need for an appropriate sentiment search tools, which can easily organize and search the massive resources of blog. At this time, the best option is blog sentiment orientation retrieval.
     Based on the analysis on the sentiment factors implied in the Chinese blog article, combined with natural language processing technique, this thesis proposed the SPOA (Sentiment-dictionary and Parsing based Orientation Analysis) method of blog sentiment orientation analysis. In the blog article's pre-processing stage, a foundation sentiment dictionary and a polysemy dictionary were constructed for the identification of sentiment words in blog articles. The relationship of the group as the smallest unit of sentiment analysis, and combining the proposed VCCA algorithm (VOB and CMP Convert to ADV) on sentiment ectopic, the proposed algorithm makes the calculation of the modification degree of context-sensitive more accurate and reasonable.
     Then the experiment results show that the SPOA method based on dependency syntax is better than windows modified algorithm in emotions analysis on the Chinese blog articles. The syntax distance and dependency modification make the performance of sentiment orientation analysis improved significantly. There are no sharp difference between blog article's full-text analysis and network abstract analysis. However, the key emotional sentences processing for structural characteristics of blog articles makes significant advantages of overall performance. This shows that the structural features of the blog articles have impact on sentiment analysis obviously.
     Finally, a prototype system for blog article sentiment retrieval based on the SPOA algorithm is implemented, which sorted the search results by user preference requirements.
引文
1.中国互联网络信息中心(CNNIC)第25次中国互联网发展状况统计报[EB/OL],http://www.cnnic.net.cn,2010-1-15/2010-6-6
    2.fqs310.全面深入了解博客—相关名词解释[EB/OL], http://fqs310.blog.hexun.com /16133157 d.html,2008-1-2/2010-5-19
    3. readwriteweb五种情感分析工具[EB/OL], http://bl og. it. sohu. com/read write web/archives /5788,2009-8-27/2010-5-15
    4.王婵娟.专业搜索引擎之博客搜索[J],图书馆学研究,2009(6):54-57
    5. Villain什么是网络搜索引擎?[EB/OL], http://wenwen.soso.com/z/q166845947.htm, 2009-11-24/2010-5-16
    6.杨勇涛.WEB舆情观点挖掘关键技术研究[D],成都:电子科技大学,2009
    7. CICTech情感分析(Sentiment Analysis)的难题[EB/OL], http://blog.csdn.net/CICTech /archive/2008/04/15/2294240.aspx,2008-4-15/2010-5-19
    8. baike. html语言[EB/OL], http://baike.baidu.com/view/115411.htm?fr=ala0_1_1,2010-6-05/2010-5-19
    9. zhidao. XML[EB/OL], http://zhidao.baidu.com/question/7665282.html?fr=ala0,2006-5-24/2010-5-19
    10.刘开瑛.中文文本自动分词和标注[M],北京:商务印书馆,2000,1-4.
    11.杨超.基于情感词典扩展技术的网络舆情倾向性分析[D],沈阳:东北大学,2009
    12. NeoxusProject中文信息处理[EB/OL], http://www.ipic.njupt.edu.cn/neoxus/index .php,2009-3-31/2010-5-21
    13.孔佳薇.中文自然语言处理关键技术及主要软件产品介绍[EB/OL], http://218.1.116 .114/publish/portal2/tab227/info 147.htm,2010-1-16/2010-5-22
    14. liulingyu自然语言理解[EB/OL], http://liulingyu.blog.51cto.com/310500/63125,2008-2-25/2010-5-22
    15.张宜生.现代汉语副词探索[M],上海:学林出版社,2004,59-64.
    16.徐琳宏,林鸿飞,杨志豪.基于语义理解的文本倾向性识别机制[J],中文信息学报,2007,21(1):96-100
    17.池昌海.现代汉语语法修辞教程[M],杭州:浙江大学出版社,2004,74-75
    18. Dong Z, Dong Q. HowNet 2000[EB/OL]. http://www.keenage.com
    19.刘军,刘全升,陈漠沙等.第一届中文倾向性分析评测结果浅析[A],第一届中文倾 向性分析评测论文集[C],2008:125-141
    20.陈建美,林鸿飞,杨志豪.基于贝叶斯模型的词汇情感消歧[J],内容计算的研究与应用前沿——第九届全国计算语言学学术会议论文集[C],2007:594-599
    21.刘海涛.依存语法的理论与实践[M],北京:科学出版社,2009,94-121.
    22.娄德成,姚天防.汉语句子语义极性分析和观点抽取方法的研究[J],计算机应用,2006,26(11):2622-2625
    23.李正华.依存句法分析统计模型及树库转化研究[D],哈尔滨:哈尔滨工业大学,2008
    24.李正华,车万翔,刘挺.基于柱状搜索的高阶依存句法分析[A],中国计算机语言学研究前沿进展(2007-2009)[C],2009
    25. Dave K., Lawrence S., Pennock D.. Mining the Peanut Gallery:Opinion Extraction and Semantic Classification of Product Reviews[A]. In Proceeding of WWW2002[C],2002, 519-528
    26.熊德兰,程菊明,田胜利.基于HowNet的句子褒贬倾向性研究[J],计算机工程与应用,2008,44(22):143-145
    27. Kim S, Hovy Eduard. Determining the Sentiment of Opinions[A]. Proceedings of COLING 2004[C], New York:ACM Press,2004,1367-1373
    28.胡宝顺,王大玲,于戈等.基于句法结构特征分析及分类技术的答案提取算法[J],计算机学报,2008,31(4):662-676
    29.刘冬.基于情感引发事件的中文情感分析[D],北京:北京邮电大学,2009
    30.章志龙.基于语义网的博客搜索系统研究[D],武汉:武汉理工大学,2009
    31. Benjamin T, Kwong O, Wong W, et al. Sentiment and Content Analysis of Chinese News Coverage [A]. In Proceeding of International Journal of Computer Processing of Oriental Languages 2005[C],2005,171-183
    32.褒贬两用成语集锦,[DB/OL], http://www.ht88.com/downinfo/9114.html
    33.蔡建平,林世平.基于机器学习的词语和句子极性分析[A],第三届全国信息检索与内容安全学术会议论文集[C],2007:643-649
    34.王爽,熊德兰,赵会洋.基于论坛主题的网页褒贬倾向性识别[J],计算机技术与发展,2009,19(9):111-114
    35.陈博.WEB文本情感分类中关键问题的研究[D],北京:北京邮电大学,2008
    36.王娜.博客搜索引擎与传统搜索引擎的比较研究[J], library and information service, 2006,50 (7):54-57
    37.马金山.基于统计方法的汉语依存句法分析研究[D],哈尔滨:哈尔滨工业大学,2004
    38.赵妍妍,刘鸿宇,秦兵等.HIT_IR_OMS:情感分析系统[A],第一届中文倾向性分析 评测论文集[C],2008:81-88
    39.周立柱,贺宇凯,王建勇.情感分析研究综述[J],计算机应用,2008,28(11):2725-2728
    40.郝雷红.现代汉语否定副词研究[D],北京:首都师范大学,2003
    41.杨频,李涛,赵奎.一种网络舆情的定量分析方法[J],计算机应用研究,2009,26(3):1066-1069
    42.袁彩霞.中文功能组块分析及应用研究[D],北京:北京邮电大学,2009
    43.夏云庆,杨莹,张鹏洲.基于情感向量空间模型的歌词情感分析[A],中国计算机语言学研究前沿进展(2007-2009)[C],2009
    44. Yao J., Wu G., Liu J. et al. Using Bilingual Lexicon to Judge Sentiment Orientation of Chinese Words [A]. Proceedings of The Sixth IEEE International Conference on Computer and Information Technology (CIT'06) [C],2006,519-528
    45. Chin A, Chignell M. A social hypertext model for finding community in blogs[A]. Proceedings of HYPERTEXT 2006[C], New York:ACM Press,2006.11-22.
    46. Qamra A, Tseng B, Chang E. Mining blog stories using community-based and temporal clustering[A]. Proceedings of CIKM 2006[C], New York:ACM Press,2006,58-67.
    47. Liu B, Hu M, Cheng J. Opinion Observer:Analyzing and Comparing Opinions on the Web[A]. Proceeding of WWW 2005[C], New York:ACM Press,2005,342-351
    48.严曙谨.基于本体的博客搜索引擎关键技术研究与实现[D],上海:华东师范大学,2008
    49.姚天肪,娄德成.汉语语句主题语义倾向分析方法的研究[J],中文信息学报,2007,21(5):73-79
    50. Liu Y, Huang X, An A, et al. ARSA:A Sentiment-aware Model for Predicting Sales Performance Using Blogs[A]. Proceeding of SIGIR 2007[C],2007,607-614
    51. Kim S, Hovy E. CRYSTAL:Analyzing Predictive Opinions on the Web[A]. Proceeding of EMNLP 2007[C],2007,1056-1064
    52.李娟,张全,贾宁.中文词语倾向性分析处理[J],计算机工程与应用,2009,45(2):131-133
    53. Yun C, Belle L, Junichi T, et al. Eigen-trend:trend analysis in the blogsphere based onsingular value decompositions [A]. Proceedings of CIKM 2006[C], New York:ACM Press,2006,68-77.
    54. Wilson T, Wiebe J, Hoffmann P. Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis[A]. Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing[C].2005:347-354.
    55. Turney P. Thumbs up or Thumbs down? Semantic Orientation Applied to Unsupervised Classification of Reviews[A]. Proceedings of ACL[C].2002:417-424.
    56. Turney P, Littman M L. Measuring praise and criticism:Inference of semantic orientation from association[A]. ACM Transactions on Information Systems[C].2003,21(4):315-346.
    57. Yu H., Hatzivassiloglou V.. Towards Answering Opinion Questions:Separating Facts from Opinions and Identifying the Polarity of Opinion Sentences[A]. In Proceeding of EMNLP 2003[C],2003,129-136
    58. Pang B., Lee L., Vaithyanathan S.. Thumbs up? Sentiment Classification Using Machine Learning Techniques [A]. In Proceeding of EMNLP 2002[C],2002,79-86
    59.倪晓川.博客作者兴趣挖掘与博客信息、情感分析的研究[D],上海:上海交通大学,2008
    60. Turney P., Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews[A]. In Proceeding of ACL 2002[C],2002,417-424

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700