博客数据特征提取与基于分类的垃圾博客过滤

英文题名：Data Feature Extraction of Blogs and Filtering of Splogs Based on Classification
作者：闫瑞
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：垃圾博客分类 ; 组合分类器 ; AdaBoost算法 ; 集成学习 ; 文本聚类
英文关键词：splog classification ; assembly classifier ; AdaBoost algorithm ; ensemble learning ; text clustering
学位年度：2009
导师：曹先彬
学科代码：081202
学位授予单位：中国科学技术大学
论文提交日期：2009-05-01

摘要

随着Internet的迅速发展,博客成了继Email、BBS、QQ/ ICQ之后的新一代网络交流方式,并以极快的速度融入到人们的日常生活中,成为基于互联网的基础服务。随着博客空间的急速增长,垃圾博客也迅猛蔓延到博客空间的各个角落;而大量垃圾博客的存在,严重影响了信息检索的准确性,从而使得用户体验变得越来越差,如何精确地判断垃圾博客成为信息检索领域亟待解决的难题之一。在信息安全领域,博客内容倾向性分析成为新的研究热点之一,但大量垃圾博客的存在将严重影响倾向性分析的结果,大大降低其正确性和可信性。因此,必须对博客进行垃圾过滤,以便进行进一步的分析和检索。
     本文在已有的垃圾博客特征提取基础上,提出了采用词性分析手段对博客特征进行进一步提取的方法。首先考虑到在中文的语法结构中,一个句子由主谓宾构成,尤其在口语话的语句中,还会有很多省略句,这些句子通常只有主语和谓语或仅仅只有谓语。而且博客作者大都在博客文章中记录一些关于自己感兴趣的事情,或者记录自己的心情和近况,会在博客正文中使用丰富的形容词和语气词来表达自己。而垃圾博客通常只是为了提高用户的点击率,或者希望通过增加链接和关键词的方式来提升某个网页在搜索引擎中的重要程度,因此在文章中会出现大量的名词,尤其是跟行业相关的专有名词。所以,对博客文章进行词性分析,提取出跟词性相关的一些特征会大大增加特征之间的互补性,提高垃圾博客分类与过滤的效果。
     进一步,本文设计了一种针对垃圾博客过滤的动态组合分类算法。该算法首先构造出一种树状组合分类器结构来支持分类,并进一步利用了一种动态调整策略来训练组合分类器。与已有的基于单一分类器或简单集成分类器的方法相比,该方法可以根据样本的分布特点,自适应地调整分类器的组合结构,从而有效缓解样本特征稀疏和样本高度不均衡对分类性能的影响。基于垃圾博客过滤的测试实验表明,该算法在用于垃圾博客过滤时,可以获得较好的准确率和召回率。
     最后,本文设计并实现了一个基于博客内容的信息检索原型系统,并将垃圾博客过滤算法用于该系统,取得了较好的效果。
With the rapid development of Internet, blogs become a new application of network communication following Email, BBS, QQ / ICQ, and it goes into people's daily lives quickly to become the basic services based on Internet. Meanwhile, splogs(spam blogs) also spread rapidly to every corner of the blogosphere; and the existence of a large number of splogs has seriously affected the accuracy of information retrieval, which makes the user’s experience worse and worse. So how to determine the splogs precisely has become one urgent problem in the field of information retrieval. In the information security field, the opinion analysis of blog content has drawn more and more attention, but the existence of splogs will affect the result of opinion analysis seriously, and reduce the accuracy and credibility greatly. Therefore, it is necessary to filter the splogs for further analysis and retrieval.
     In this paper, we proposed a method of part-of-speech analysis based on the existing feature extraction of splogs. Firstly, in the grammatical structure of Chinese, a sentence is composed by subject、predicate、object, and especially in the oral statement, there are a lot of elliptical sentences which are composed by subject and predicate, or predicate only. Secondly, most blog authors record in their blogs what they are interested in, or their own feelings and situations, so in the blogs, there are rich adjectives and mood words to express themselves. Thirdly, usually, splogs are written to increase the users' click-through rates, or hope to improve the importance of a page in the search engine by increasing links and keywords, so there are a lot of terms in the articles, especially industry-related terminology. Therefore, analyzing the part-of-speech of blogs and extracting some part-of-speech-related features will increase the complementarities between features greatly and improve the effectiveness of classifiers.
     We also designed a dynamic assembly classification algorithm for filtering splogs. Firstly, the algorithm constructs a treelike assembly classifier to support the classification. Then it presents a dynamic adjusting strategy to train the assembly classifier. Comparing with the traditional classifiers such as single classifier and simply ensemble classifier, this algorithm also adjust the combinational structure of the classifier in an adaptive way, so as to reduce the impact of the sparse features and unbalanced data of the splogs. The experiments show that this algorithm can get better precision rate and recall rate for Filtering of Splogs.
     Finally, we designed and realized an information retrieval prototype system based on blog content with the filtering of splogs, and it achieves good performance.

引文

刘玮,廖祥文,许洪波,王丽宏.2008.基于内容特征的垃圾博客过滤.中文信息学报[J],22(6): 86-91.
    刘胥影,吴建鑫,周志华.2006.一种基于级联模型的类别不平衡数据分类方法.南京大学学报:自然科学,42(2):148-155.
    杨宇航,郑德权,于浩,赵铁军.2007.基于内容分析的作弊评论自动识别[C].第4届全国网络与信息安全技术研讨会(NetSec2007),青岛,288?294.
    张卫.2008.网络舆情分析中的特征提取研究(D):[硕士].合肥:中国科学技术大学.
    中国互联网络信息中心.2007.2007年中国博客市场调查报告.
    中国互联网络信息中心.2009.第23次中国互联网络发展状况统计报告.
    Ali KM, Pazzani MJ. 1995. On the link between error correlation and error reduction in decision tree ensembles[D]. Technical Report ICS-TR-95-38.
    Bloehdorn S, Hotho A. 2004. Text classification by boosting weak learners based on terms and concepts [A]. International Conference on Data Mining, Brighton [C]. UK: IEEE,331–334.
    Datta S, Sarkar S. 2008. A Comparative Study of Statistical Features of Language in Blogs-vs-Splogs[C]. Proceedings of the second workshop on Analytics for noisy unstructured text data, Singapore,63-66.
    Freund Y, Schapire R E. 1997. A decision-theoretic generalization of on-line learning and an application to boosting[J]. Journal of Computer and System Sciences,119-139.
    Sculley D, Wachman G M. 2007. Relaxed Online SVMs for Spam Filtering[C], Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, Amsterdam, The Netherlands,415-422.
    Fujimura K, lnoue T, Sugizaki M. 2005. The EigenRumor Algorithm for Ranking Blogs[C], Proceedings of the WWW 2005 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics. NY, USA.
    Gyongyi Z, Molina H. 2005. Web Spam Taxonomy[C]. First International Workshop on Adversarial Information Retrieval On the Web.
    James G, Shanahan.2003. Boosting support vector machines for text classification through parameter-free threshold relaxation[A]. Proceedings of the 12th international conference on Information and knowledge management [C]. New York, USA: ACM, 247-254. Kazunari I.2008. Extracting Spam Blogs with Co-citation Clusters, In Proceedings of the 15th international conference on World Wide Web. In Proc. of the 17th international conference onWorld Wide Web, Beijing, China.
    Kearns M J , Vazirani U V. 1994. An introduction to computational learning theory[M]. Cambridge , MIT Press.
    Kim Y H, Hahn S Y, and Zhang B T. 2000. Text filtering by boosting naive bayes classifiers [A]. Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval [C]. New York, USA: ACM, 168-175.
    Kittler J, Hatef M., Duin RPW, 1996. Combining classifiers[C]. Proc. 13th Internat conference on Pattern Recognition, Vienna, 897–901.
    Kittler J, Hatef M., R.P.W. Duin, and J. Matas. 1998. On Combining Classifiers[J], IEEE Trans. Pattern Analysis and Machine Intelligence, 20(3):226-239.
    Kolari P, Java A, Finin T. 2006. Characterizing the splogosphere[C]. Proceedings of the World Wide Web 2006 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics. Edinburgh.
    Kolari P, Java A, Finin T, et all. 2006. Blog Track Open Task: Spam Blog Classification[C]. TREC 2006 Blog Track Notebook.
    Kolari P, Finin T, Joshi A. 2006. SVMs for the blogosphere: Blog identification and splog detection[C]. on Computational Approaches to Analyzing Weblogs, California: AAAI Press, 92-99.
    Kritikopoulos A, Sideri M, Varlamis I. 2006. BlogRank: ranking weblogs based on connectivity and similarity features[C], Proceedings of the 2nd international workshop on Advanced architectures and algorithms for internet delivery and applications, Pisa.
    Kritikopoulos A, Sideri M, Varlamis I. 2007. BlogRank: Ranking on the blogosphere[C]. International Conference on Weblogs and Social Media.
    Lin Yu-Ru, Hari Sundaram, Yun Chi, et al.2007. Splog Detection using self-similarity analysis on blog temporal dynamics[C]. Proceedings of the ACM Workshop on Adversarial information retrieval on the web, 1- 8.
    Lin Yu-Ru, Hari Sundaram, Yun Chi, et al. 2007. Splog Detection Using Content, Time and Link Structures[C]. In Proceedings of 2007 IEEE International Conference on Multimedia and Expo, 2030-2033.
    Mishne G, Carmel D, Lempel R. 2005. Blocking blog spam with language model disagreement[C]. In Proceedings of the World Wide Web 2005 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics. Chiba.
    Narisawa K, Yamada Y, Ikeda D, Takeda M.2006. Detecting blog spams using the vocabulary size of all substrings in their copies[C]. In Proceedings of the World Wide Web 2006 Workshop onthe Weblogging Ecosystem: Aggregation, Analysis and Dynamics. Edinburgh.
    Ntoulas A, Najork M. 2006. Detecting Spam Web Pages through Content Analysis[C]. Proceedings of the 15th international conference on World Wide Web, Edinburgh, Scotland, 83-92.
    Page L, Brin S, Motwani R, and Winograd T. 1998. The Pagerank citation ranking: Bringing order to the web, Technical report, Stanford University.
    Salvetti F, Nicolov N. 2006. Weblog Classification for Fast Splog Filtering: A URL Language Model Segmentation Approach[C]. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers,137-140.
    Schapire R E, Singer Y. 2000. Boostexter: a boosting-based system for text categorization [J]. Machine Learning, 39(2-3): 135-168.
    Song X, Chi Y, Hino K, Tseng B. 2007. Identifying opinion leaders in the blogosphere[C]. Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, Lisbon, Portugal, 971–974.
    Takeda T, Takasu A. 2008. A splog filtering method based on string copy detection[C]. Applications of Digital Information and Web Technologies, 543-548.
    Tayebi M A, Hashemi SM, Mohades A, B2Rank: An Algorithm for Ranking Blogs Based on Behavioral Features[C]. Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, 104-107.
    Viola P, Jones M. 2001. Rapid object detection using a boosted cascade of simple features [A]. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition [C]. Kauai Marriott, Hawaii, 511-518.
    Wei C X, Cao X B, Xu Y W et a1. 2007. The treelike assembly classifier for pedestrian detection [C]. Proceedings of Pacific Asian Workshop on Intelligence and Security Informatics, 232-237.
    Weiss G. Mining with rarity: A unifying frame work[J]. SIGKDD Explorations, 6 (1) : 7-19.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700