基于语义的文本倾向性分析与应用研究

英文题名：Analysis and Application Research of Text Orientation Based on Semantic
作者：杨天明
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：语义 ; 文本倾向性 ; HNC理论 ; 网络舆情
英文关键词：semantic ; text orientation ; HNC theory ; online public opinion
学位年度：2009
导师：程显毅
学科代码：081203
学位授予单位：江苏大学
论文提交日期：2009-10-01

摘要

随着互联网技术的迅速发展,如今越来越多的人通过互联网发表他们对商品服务的意见、交流对各种事件的看法,互联网已不仅仅是人们获取信息的仓库,更成为人们发表观点、交流看法的园地,对于互联网用户来说,互联网不仅改变了他们的工作方式,同时也改变了他们的生活方式。通常,人们对某件事物进行评论或者表达自己的观点的时候,常常是具有倾向性的,为了能从这些丰富的信息中提取出有用的信息,文本的倾向性分析研究便应运而生了。对文本的倾向性进行分析,是现在自然语言处理中比较活跃的一个领域,其目的是判断一篇文章对评价对象所持有的倾向是支持还是反对。本文的主要工作概括如下:
     (1)分析研究了传统的文本倾向性分析方法,并指出了其中的不足。通过对语义信息和语义倾向的理论知识分析,讨论了三种基于语义倾向的语义分析方法。
     (2)提出一种基于HNC的语义相关度方法计算词语的原始极性算法。在深入研究HNC基本理论的基础上提出了基于HNC概念基元符号体系理论的语义相关度计算方法,根据HNC理论给出了语义相关度计算策略,并实现了概念符号比较的量化计算的详细方法。最后将基于HNC的语义相关度方法运用到词语的原始极性分析上,从而可以较容易也较准确地计算出词语的原始极性。
     (3)提出一种改进算法计算词语的上下文极性。首先给出文本倾向性算法的整体框架,然后对算法的流程进行了详细的说明。由于忽略句子中的关联词有可能导致极性词的方向或者强度发生错误,所以提出基于上下文的词语的倾向性分析方法来解决这一问题。利用计算极性成分在文本中出现的广度、密度和强度的方法,根据极性词语的分布情况确定评论文本的倾向性。
     (4)在理论研究的基础上,将文本倾向性分析应用到网络舆情监控系统—国保情报系统中,实验表明,将文本倾向性分析应用到网络舆情监控系统中可提高系统的使用效率。
With the rapid development of Internet technology, now more and more people express their views on the services of goods and exchange their opinions on the various events through the Internet. The Internet has not only been the warehouses of obtaining information, but also become to the forums for people expressing views and exchanging opinions. For the Internet users, the Internet has not only changed their working way, but also changed their living way. Usually, people comment on something or express their opinions with orientation. In order to extract the useful information from the rich information, the analysis of the text orientation is born. Analyzing the text orientation is an active area in natural language processing, and the goal is to judge the orientation of the text supportive or negative. The main work of the article is summarized as follows:
     (1) Describe the methods of the traditional text orientation analysis and point out the deficiency. Through the analysis of semantic information and the theoretical knowledge of semantic orientation, we discuss three kinds of semantic analysis methods based on semantic orientation.
     (2) Propose an algorithm based on HNC for calculating the original polarity for words. Based on the basic theory of HNC, the method of calculating the semantic-correlation which is based on the system of HNC concepts primitive symbols is presented. Then according to the HNC theory, the calculation strategies for semantic-correlation are proposed and the detailed method of quantitative calculation for comparing the concepts symbols is proposed and realized. Finally, the method of semantic-correlation based on HNC is applied to the analysis of the original polarity for words, so it is easier and more accurate to calculate the original polarity for words
     (3) Propose an improved algorithm to compute the context polarity for words. First, the overall framework of the text orientation algorithm and then give the detailed description of the algorithm flow. By ignoring the associated words in the sentences may lead to the wrong judgment of the direction and intensity for words, so the orientation analysis method based on context analysis is proposed to solve the problem. Using the method of calculating the extent, the density and the intensity of polarity words, we can determine the orientation of the text according to the distribution of the polarity words.
     (4) Based on the theoretical research, the analysis of text orientation is applied to the public opinion monitoring system, and the subsystem of public opinion monitoring—the National Security Intelligence System. The experiment shows that, applying the text orientation analysis to the public opinion monitoring system can improve the system efficiency.

引文

[1]Perter D.Turney and Michael L.Littman.Measuring praise and criticism:inference of semantic orientation from association[J].ACM Transactions on Information systems,2003.10,21(4):315-346.
    [2]Perter D.Turney and Michael L.Littman.Unsupervised learning of semantic orientation from a hundred-billion-word corpus[C].Tech.Rep.EGB-1094,National Research Council Canada,2002:359-364.
    [3]王根,赵军.基于多重标记CRF的句子情感分析研究[C].全国第九届计算语言学学术会议,大连:清华大学出版社.2007:887-895.
    [4]章建锋,张奇,黄萱菁,吴立德.中文评价挖掘中的主观性关系抽取[C].苏州:第三届全国信息检索与内容安全学术会议.2007:1114-1121.
    [5]Fei Zhongchao,Liu Jian,Wu Gengfeng.Sentiment Classification Using Phrase Patterns[C].The Fourth International Conference on Computer and Information Technology.2004:1147-1157.
    [6]Yi,J.,Nasukawa,T.,Bunescu,R.,Niblack,W.:Sentiment analyzer:Extracting sentiments about a given topic using natural language processing techniques[C].The Third IEEE International Conference on Data Mining.Los Alamitos:IEEE Computer Society Press.2003:427-434.
    [7]刘永丹,曾海泉,李荣陆,胡运发.基于语义分析的倾向性文本过滤[J].通信学报.2004:78-85.
    [8]郑宇,刘建,孙晓斌,吴耿峰.基于文本倾向性的邮件过滤系统设计[C],中国人工智能学会第11届全国学术年会论文集(下).2005.9:1300-1305.
    [9]Bo Pang,Lillian Lee and Shivakumar Vaithyanathan,Thumbs up? Sentiment Classification using Machine Learning Techniques[C].Conference on Empirical Methods in Natural Language Processing(EMNLP'2002).2002:79-86.
    [10]徐琳宏,林鸿飞,杨志豪.基于语义理解的文本倾向性识别机制[C].中文信息学报.2007,21(1):96-100.
    [11]黄昌宁.统计语言模型能做什么[J].语言文字应用.2002:77-84.
    [12]周强,黄昌宁.基于局部优先的汉语句法分析方法[J].软件学报.1999:1-6.
    [13]周强,黄昌宁.基于局部优先的汉语句法分析方法[J].软件学报.1999:1-7.
    [14]黄曾阳.《HNC(概念层次网络)理论》[M].北京:清华大学出版社.1998.
    [15]易丽萍,竹勇,雷小春.知网在词语相似度计算方面的应用[J].人工智能与知识工程.2005:56-58.
    [16]许云,樊孝忠,张锋.基于知网的语义相关度计算[J].北京理工大学学报.2004,25(5):38-41.
    [17]V.Hatzivassiloglou and K.R.McKeown."Predicting the semantic orientation of adjectives"[C],Proceedings of the 35~(th) Annual Meeting of the Association for Computational Linguistics and the 8~(th) Conference of the European Chapter of the ACL.1997:174-181.
    [18]Pimwadee Chaovalit,Lina Zhou.Movie Review Mining:a Comparison between Supervised and Unsupervised Classification Approaches[C].Proceedings of the 38~(th) Hawaii International Conference on System Sciences.2005:939-947.
    [19]Ellen Riloff,Janyce Wiebe,Theresa Wilson.Just how mad are you? Finding strong and weak opinion clauses[C].Proceedings of the 19~(th) National Conference on Artificial Intelligence.2004:761-767.
    [20]Kushal Dave,Steve Lawrence,and David M.Pennock.Mining the Peanut Gallery:Opinion Extraction and Semantic Classification of Produce Reviews[C].Proceedings of the 12~(th)International World Wide Web Conference.2001:567-575.
    [21]朱嫣岚,闵锦,周雅倩,黄萱菁,吴立德.基于HNC的词汇语义倾向计算[J].中文信息学报.2006,20(1):14-20.
    [22]Peter D.Tumey.Mining the Web for synonyms:PMI-IR versus LSA on TOEFL[C].Proceedings of the Twelfth European Conference on Machine Learning.2001:491-502.
    [23]HowNet[R].HowNet's Home Page.http://www.keenage.com.
    [24]刘群,李素建.基于《知网》的词汇语义相似度计算[J].Computational Linguistics and Chinese Language Processing.2002,7(2):59-76.
    [25]Peter D.Turney."Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews"[C],New Brunswick,N.J.:the Association for Computational Linguistics 40~(th) Anniversary Meeting.2002:417-424.
    [26]苗传江.HNC(概念层次网络)理论导论[M].北京:清华大学出版社.2005:3-4.
    [27]苗传江.HNC(概念层次网络)理论导论[M].北京:清华大学出版社.2005:57-58.
    [28]刘肖健等.基于Solidworks的人机工程学CAD二次开发[J].计算机工程与应用.2004,40(25):104-106.
    [29]张万友.义素分析略说[J].语言教学与研究.2001(1):61-65.
    [30]朱德熙.语法讲义.商务印书馆.1982:37-40.
    [31]Jiangfeng Gao,Andi Wu,Mu Li,Chang-Ning Huang,Hongqiao Li,Xinsong Xia,Haowei Qin.Adaptive Chinese wore segmentation[C].Proceedings of the 42~(nd) Annual Meeting.2004,5(1):4-12.
    [32]梁南元.书面汉语自动分词系统-CDWS[J].中文信息学报.1987,1(2):1-4.
    [33]叶强,张紫琼,罗振雄.面向互联网评论情感分析的中文主观性自动判别方法研究.2007(1):79-91.
    [34]娄德成,姚天昉.汉语句子语义极性分析和观点抽取方法的研究[J].中文信息学报.2006.11.
    [35]徐琳宏,林鸿飞,杨志豪.基于语义理解的文本倾向性识别机制[J].中文信息学报.2007.1.
    [36]TSOU BKY,YUEN RWM.Polarity classification of celebrity coverage in the Chinese press[A].International Conference on Intelligence Analysis[C].Virgina,USA.2005.
    [37]王力.中国现代语法[M].北京:商务印书馆.1985:131-132.
    [38]蔺璜,郭姝慧.程度副词的特点范围与分类[J].山西大学学报(哲学社会科学版).2003.26(2):71-74.
    [39]张锦明.中文语义倾向识别的关键算法研究[D].北京:北京邮电大学.2008.3.
    [40]王来华.舆情研究概论[M].天津:天津社会科学院出版社.2003.
    [41]周如俊,王天琪.网络舆情:现代思想政治教育的新领域[J].思想理论教育.2005:12-15.
    [42]网络舆情及其分析技术[J].光明日报.2007.01.22.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700