基于博客的作者声誉度分析

英文题名：The Analysis of Blog Author Reputation Degree
作者：刁宇峰
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：Blog ; 垃圾评论 ; 情感倾向性 ; 多句联合评估的方法 ; 作者声誉度
英文关键词：Blog ; Opinion Spam ; Emotion Orientation ; Sentences joint assessment method ; Blog author reputation degree
学位年度：2011
导师：林鸿飞
学科代码：081203
学位授予单位：大连理工大学
论文提交日期：2011-11-01

摘要

随着Web2.0的发展,众多基于Web2.0的应用平台也迅猛的发展起来,其中Blog就是主要代表传播平台之一。众所周知,Blog这种自由、发散和随意性的特点也使得它日渐成为舆情产生和传播的主要场所,所以,对Blog中发布的评论中的垃圾评论、重复评论和Blog作者声誉度信息等方面研究的重要性也就越来越凸显出来。
     本文主要通过研究Blog中的评论和博文两个角度综合衡量得到Blog作者的声誉度,并没有采取传统的单单通过点击率对作者进行排名的方法,而是在充分考虑点击率的基础上,针对评论集合考虑其质量,去除垃圾评论和重复评论得到相关评论集,针对博文得到其谈论的主题信息,最终结合评论的质量、内容和博文的内容,以及通过语义特征等分析博文得出的博文情感基调,并用基于段落的多句联合评估的方法进行评价,综合这些要素判断评论的情感倾向性并结合浏览量等因素综合得出作者的声誉度,达到对博客网站的作者重新排名的目标,解决了博客网站中作者的排名不准确问题,实现了基于作者声誉度的博客的个人排名。
     经过对新浪博客中生活类博客的抽取作为训练集,对其中排名靠前的几个博客进行声誉度分析并排名。实验证明,本文的方法可以有效的得到作者的声誉情况,可以更公正的得到作者的排名情况。该方法有助于网络上Blog空间的对评论集合的管理及网络舆情的及时监控,提出有效的计算Blog作者声誉度分析的方法。
With the development of Web2.0, many applied platforms have developed rapidly based on the Web2.0. Blog is the main representative of the spread platforms. As we known, Blog's characteristics of freedom, arbitrary and divergence also make it been the generation and dissemination of public opinion on the main site, so the release of the Blog comment of spam comments, duplicate comments and Blog authors degree of reputation and other information also highlights the importance.
     In this paper, though the comments and Blog content to get a comprehensive measure of two angles to get Blog author's reputation degree.We do not only take traditional hits and number of comments based on the authors rank, but also considering quality and content of Blog Review and Blog Contents, and through the semantic feature to analysis the emotion keynote of the Blog content which using the Sentences joint assessment method based on the Paragraph to evaluate, the comprehensive these elements together with page view and so on to judge the comment orientation and then compute the Blog author reputation degree and finally get the author of the blog site for new ranking. It is verified that the method of better judgment the review emotion orientation, makes more reasonable the author rankings, gets more accurate and efficient information Blog for the users to use.
     After using on the extraction of Sina Blog as the training set, and we mainly analysis and rank a few top-ranked Blog author for reputation. Our experiments show that this method can effectively get out the reputation of the author, can be more just to get the author's rankings. This approach helps the network space for comments on the Blog a collection of network management and monitoring of public opinion in a timely manner, propose the calculation of Blog authors effective reputation analysis method.

引文

[1]Tsou Benjamin KY, Kwong OY, Wong WL. Sentiment and Content Analysis of Chinese News Coverage [J]. International Journal of Computer Processing of Oriental Languages, 2005:171-183.
    [2]时达明,林鸿飞.基于内容相关度和情感分析的Blogger声誉度研究[C].第三届全国信息检索与内容安全学术会议,苏州,2007,page 656-662.
    [3]时达明.Blog热点话题发现及其作者声誉度研究[D].大连,2008.
    [4]Hongning Wang, Yue Lu, Chengxiang Zhai. Latent Aspect Rating Analysis on Review Text Data:A Rating Regression Approach[C]. KDD 2010, Washington, DC, USA.
    [5]R W Picard. Affective Computing [M]. Cambridge, MA:MIT Press,1997
    [6]C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, S. Vigna. A reference collection for web spam[C]. SIGIR 2006, Seattle, WA, USA.
    [7]Bing Liu. Detecting Product Review Spammers using Mating Behaviors[C]. CIKM 2010, Toronto, Ontario, Canada.
    [8]Dennis Fetterly, Mark Manasse,Marc Najork.Spam,Damn Spam, and Statistics Using statistical analysis to locate spam web page[C]. SIGMOD/PODS 2004,Paris, France.
    [9]潘文锋.基于内容的垃圾邮件过滤研究[D],北京：中科院计算技术研究所,2004
    [10]Mengjun Xie, Heng Yin, Haining Wang. An effective defense against email spam laundering[C]. ACM 2006, Florida, USA.
    [11]M. Hu & B. Liu. Mining and summarizing customer reviews[C]. KDD 2004, US & Canada.
    [12]N. Jindal and B. Liu. Product Review Analysis[R]. Technical Report, UIC,2007.
    [13]Gilad Mishne, David Carmel, Ronny Lempel. Blocking Blog Spam with Language Model Disagreement[C]. WWW 2005, Chiba, Japan,
    [14]Nitin Jindal and Bing Liu. Opinion Spam and Analysis[C]. WWW 2008, Palo Alto, California, USA.
    [15]N. Jindal and B. Liu. Analyzing and Detecting Review Spam[C]. ICDM 2007, Omaha. NE, USA.
    [16]刁宇峰,杨亮,林鸿飞.基于LDA模型的博客垃圾评论发现[J].中文信息学报.Vol.25, No.1,201001. page 41-47.
    [17]刁宇峰,林鸿飞.基于LDA模型的博客垃圾评论发现[C].第六届全国信息检索学术会议.哈尔滨,中国：2010.page 707-715.
    [18]K. Muthmann, W. M. Barczynski, F. Brauer, and A. Loser. Near-duplicate detection for web-forums[C]. ACM 2009, New York, NY, USA, page 142-151.
    [19]N. Shivakumar and H. Garcia-Molina. Scam:A copy detection mechanism for digital documents[C]. ACM Digitial Library1995, Austin, Texas, USA.
    [20]N. Shivakumar and H. Garcia-Molina. Finding near-replicas of documents and servers on the web[C]. IWebDB 1998, London, UK, page 204-212.
    [21]J. Lin. Brute force and indexed approaches to pairwise document similarity comparisons with mapreduce[C]. SIGIR 2009, New York, NY, USA, page 155-162.
    [22]M. Bendersky and W. B. Croft. Finding text reuse on the web[C]. WSDM 2009, USA, page 262-271.
    [23]J. Seo and W. B. Croft. Local text reuse detection[C]. SIGIR 2008, New York, NY, USA, page 571-578.
    [24]M. Theobald, J. Siddharth, and A. Paepcke. Spotsigs:robust and effcient near duplicate detection in large web collections[C]. SIGIR 2008, New York, NY, USA, page 563-570.
    [25]A. Kotcz, A. Chowdhury. Lexicon randomization for near-duplicate detection with I-Match[J]. The Journal of Supercomputing, v.45 n.3, page 255-276,2008.
    [26]Qi Zhang, Yue Zhang, Haomin Yu. Efficient Partial-Duplicate Detection Based on Sequence Matching[C].SIGIR2010,Geneva, Switzerland, page 675-682.
    [27]俞吴旻,张玥,张奇,黄萱菁.基于Low-IDF-SIG的句子重复检测[J].中文信息学报.Vol.25, No.1,201001. page 123-128.
    [28]A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the Web[J]. Computer Networks,29(8-13):page1157-1166,1997.
    [29]刁宇峰,王吴,林鸿飞,杨亮.博客中重复评论发现[C].全国第十一届计算语言学学术会议(CNCCL-2011),洛阳,中国.
    [30]G. Mishne, N. Glance. Leave a reply:An analysis of weblog comments[C]. WWW 2006, Edinburgh, Scotland.
    [31]S. Han, Y.-Y. Ahn, S. Moon etal. Collaborative blog spam filtering using adaptive pereolation search[C]. WWW 2006, Edinburgh Scotland.
    [32]徐琳宏,林鸿飞.基于语义特征和本体的语篇情感计算[J].计算机研究与发展,2007,44(52)：356—360
    [33]徐琳宏,林鸿飞,潘宇,等.情感词汇本体的构造[J].情报学报,2008,27(2)：180-185.
    [34]D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research,3:993-1022, January 2003.
    [35]李文波,孙乐,黄瑞红,冯远勇,张大鲲,基于Label-based LDA模型的文本分类新算法[C].第三届全国信息检索与内容安全学术会议,苏州,2007.
    [36]D. Blei and J. Lafferty, Correlated topic models[J]. Advances in Neural Information Processing Systems 18. MIT Press, Cambridge, MA.2006.
    [37]Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, ChengXiang Zhai. Topic Sentiment Mixture:Model ing Facets and Opinions in Weblogs[C]. WWW 2007, Banff, Alberta, Canada, page 171-180.
    [38]Yue Lu, Chengxiang Zhai. Opinion Integration Through Semi-supervised Topic Modeling[C]. WWW 2008, Beijing, China, page 121-130
    [39]曹娟,张勇东,李锦涛,唐胜.一种基于密度的自适应最优LDA模型选择方法[J].中文信息学报.2008. 10(1780-1787)
    [40]A.Z. Broder, S. C. Glassman, M.S. Manasse,and G. Zweig. Syntactic clustering of the Web[J]. Computer Networks,29(8-13):page 1157-1166,1997.
    [41]A. Z. Broder. Identifying and filtering near-duplicate documents[C]. COM2000, London, UK, page 1-10.
    [42]L Ku, Y Liang, and H Chen. Opinion extraction, summarization and tracking in news and blog corpora[C]. AAAI 2006, California, USA, page 100-107.
    [43]Pang B, Lee L. A Sentimental education:sentiment analysis using subjective summarization based on minimum cuts[C]. Computational Lingusitics 2007, Barcelona Spain, page 271-278.
    [44]M Hu, B Liu. Mining and summarizing customer reviews[C]. SIGKDD2007, Seattle, Washington, USA.
    [45]朱嫣岚,闵锦,周雅倩,黄萱菁等.基于HowNet的词汇语义倾向计算[J].中文信息学报,2006,20(1)：14-20.
    [46]X. Wei, and W. B. Croft. LDA-based document models for ad-hoc retrieval[C]. SIGIR 2006, Seattle, WA, USA, page 178-185.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700