基于信任和非信任传播的搜索引擎反作弊研究

英文题名：Combating Web Spam Based on Both Trust and Distrust Propagation
作者：王友
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：搜索引擎作弊 ; 信任传播 ; 非信任传播
英文关键词：Web Spam ; Trust Propagation ; Distrust Propagation
学位年度：2011
导师：张宪超
学科代码：081203
学位授予单位：大连理工大学
论文提交日期：2011-10-08

摘要

随着互联网的飞速发展,搜索引擎成了人们在互联网上查找有用信息的主要途径。网站在搜索引擎中的排名越高,从中获取的用户流量也就越多,流量越多也就意味着更多的利润。这就激励某些网站通过不正当的手段来操纵搜索引擎的排名。这种不正当的操纵就被定义为搜索引擎作弊。搜索引擎作弊不但会造成搜索引擎资源的浪费,还会降低用户的体验。商业搜索引擎不得不采取有效的措施,来减少搜索引擎作弊的不良影响。
     目前基于信任或非信任传播的链接反作弊算法被广泛用于抵抗搜索引擎作弊行为。相比传统的基于内容或启发式规则的反作弊算法,基于信任或非信任传播的链接反作弊算法不但对作弊者的攻击具有更高的鲁棒性,能抵抗多种作弊类型,而且由于只处理链接而拥有更好的性能。然而,不论是信任传播算法,还是非信任传播算法都存在两大问题。一方面,信任(非信任)传播的过程中好歹不分,即在传播的过程中,对权威页面和作弊页面同等对待。另一方面,虽然有很多学者都认为权威种子和作弊种子共同使用能带来更好的效果,但是之前没有研究者提出有效的利用办法。
     本文提出的TDR算法,认为一个网页具有两个方面,有价值的一面和作弊的一面,并给每个页面分配两个分数：T-Rank,代表该页面可信的一面；D-Rank,代表该页面不可信,即作弊的一面。TDR算法从权威种子和作弊种子出发,分别沿着链接或反向链接的方向同时传播T-Rank和D-Rank。在传播过程中,一个页面的T-Rank(D-Rank)的传播将受到当前该页面的D-Rank(T-Rank)的削弱。这样,上文提到的信任和非信任传播算法的两大问题都得到了很好的解决。在数据集WEBSPAM-UK2007和ClueWeb09上的实验结果表明,在众多标准下,TDR算法优于其他传统的反作弊算法。
As the rapid development of Word Wide Web, search engines become the dominant way for people to find useful information on the Web. Since higher ranking in searching results brings more traffic, and more traffic means more profit to the owners of Web sites. It drives some Web sites owners to manipulate ranking results of search engines through unethical methods. This kind of unethical manipulation is termed as Web spamming. Web spam will not only waste resources of search engines, but also decrease the experience of users. Commercial search engines have to take measures to eliminate the negative effect of spam.
     Recently, anti-spam algorithms based on trust or distrust propagation is widely used to combat Web spam. Anti-spam algorithms based on trust or distrust propagation is more robust to the attack of spammers and more efficient on computing because of only dealing with page links than that based on contents or heuristic rules. However, existing trust or distrust propagating algorithms all have two serious issues. On one hand, trust/distrust is propagated in non-differential ways, that is, it threats the authorities and the spam pages alike in the propagating process. One the other hand, it has been mentioned that a combined use of good and bad seeds can lead to better results, however, little work has been known to realize this insight successfully.
     The proposed TDR algorithm in this paper, views that each Web page has both a trustworthy side and an untrustworthy side, and assigns two scores to each Web page:T-Rank, scoring the trustworthiness, and D-Rank, scoring the untrustworthiness. From good and bad seeds, TDR simultaneously propagates T-Rank through links and D-Rank through inverse-links, respectively. In the propagating process, the propagation of T-Rank/D-Rank is penalized by the target's current D-Rank/T-Rank. In this way, propagating both trust and distrust with target differentiation is implemented and the above mentioned two problems are solved. Experimental results on WEBSPAM-UK2007 datasets and ClueWeb09 datasets show that TDR outperforms other typical anti-spam algorithms under various criteria.

引文

[1]Silverstein, C, Marais, H, Henzinger, M, et al. Analysis of a very large web search engine query log [J]. SIGIR Forum,1999,33(1):6-12.
    [2]Wu, B, Goel, V, Davison, B D. Topical TrustRank:using topicality to combat web spam [C]. Proceedings of the 15th international Conference on World Wide Web, Southampton,2006:63-72.
    [3]Gyongyi, Z, Garcia-Molina, H. Web Spam Taxonomy[C]. First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'05), Chiba,2005:39-47.
    [4]Henzinger, M R, Motwani, R, Silverstein, C. Challenges in web search engines [J]. SIGIR Forum,2002,36(2):11-22.
    [5]Gyongyi, Z, Garcia-Molina, H, Pedersen, J. Combating web spam with trustrank [C]. Proceedings of the Thirtieth international Conference on Very Large Data Bases, Toronto,2004:576-587.
    [6]Krishnan, V, Raj, R. Web Spam Detection with Anti-Trust Rank [C]. Proceedings of the Second International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'06), Seattle,2006:37-40.
    [7]Zhao, L, Jiang, Q, Zhang, Y. From good to bad ones:Making spam detection easier [C]. Proceedings of the 2008 IEEE 8th International Conference on Computer and Information Technology Workshops, Washington,2008:129-134.
    [8]Wu, B, Goel, V, Davision, B D. Propagating trust and distrust to demote web spam [C]. Proceedings of the WWW'06 Workshop on Models of Trust for the Web (MTW'06), Edinburgh,2006:1-9.
    [9]Zhang, L, Zhang, Y, Zhang, Y, et al. Exploring both content and link quality for anti-spamming [C]. Proceedings of the Sixth IEEE International Conference on Computer and Information Technology, Washington,2006:37-37.
    [10]Baeza-Yates R A, Ribeiro-Neto B. Modern Information Retrieval [M]. Boston: Addison-Wesley Longman Publishing,1999.
    [11]Google Inc. We knew the web was big [EB/OL]. (2008,7,25)[2011,10,7]. http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html.
    [12]中国互联网信息中心(CNNIC).第28次中国互联网络发展状况统计报告[R/OL].北京：中国互联网信息中心,2011[2011,10,7].http://www.cnnic.cn/research/bgxz/tjbg/201107/P020110721502208383670.pdf.
    [13]Meyerzon, D, Shoroff, S, Terek, S F, et al. Method and system for detecting duplicate documents in web crawls:US,6547829[P/OL].1999-6-30[2003-4-15]. http://www.freepatentsonline.com/6547829,html.
    [14]黄征宇.李彦宏谈框计算[J].中国信息化,2010,1：60-61.
    [15]Jones, K S. A statistical interpretation of term specificity and its application in retrieval [J]. Journal of documentation,1972,28(1):11-21.
    [16]Salton, G, Wong, A, Yang, C S. A vector space model for automatic indexing [J]. Communications of the ACM,1975,18(11):613-620.
    [17]Page, L, Brin, S, Motwani, R, et al. The pagerank citation ranking:Bringing order to the Web [R/OL]. Stanford:Stanford University,1999 [2011,10,7]. http://ilpubs.stanford, edu:8090/422/.
    [18]Liu, B. Web Data Mining [M]. Heidelberg:Springer,2007.
    [19]Kleinberg, J M. Authoritative sources in a hyperlinked environment [J]. Journal of the ACM,1999,46(5):604-632.
    [20]刘长德.Hao123网址之家“我的Hao123”[J].电脑知识与技术：经验技巧,2008,7：112-113.
    [21]Lempel, R, Moran, S. The stochastic approach for link-structure analysis (SALSA) and the TKC effect [J]. Computer Networks,2000,33:387-401.
    [22]艾瑞咨询.艾瑞咨询：受北美SEM市发展启发,中国SEM市场始动[EB/OL]. (2008,4,7) [2011,10,7]. http://www.iresearch.com.cn/Report/View.aspx?Newsid=78779.
    [23]Cormack, G, Smucker, M, Clarke, C. Efficient and effective spam filtering and re-ranking for large Web datasets [J]. Information Retrieval,2011,14:1-25.
    [24]Mathes, A. Filler Friday:Google bombing [EB/OL]. (2001,4,6)[2011,10,7]. http://uber.nu/2001/04/06/.
    [25]钱亚平.黑客“暗链”政府网[J].共产党员：下半月,2010,12：27-27.
    [26]Wu, B. Finding and fighting search engine spam [D]. Bethlehem:University of Lehigh,2007.
    [27]Westbrook, A, Greene, R. Using semantic analysis to classify search engine spam [R]. Stanford:Stanford University,2003.
    [28]Hall, M, Frank, E, Holmes, B, et al. The weka data mining software:an update [J]. SIGKDD Explor. Newsl.,2009,11(1):10-18.
    [29]Ntoulas, A, Najork, M, Manasse, M, et al. Detecting spam web pages through content analysis [C]. Proceedings of the 15th international conference on World Wide Web (WWW'06), New York,2006:83-92.
    [30]Blei, D M, Ng, A Y, Jordan, M I. Latent dirichlet allocation [J]. The Journal of Machine Learning Research,2003,3:993-1022.
    [31]Biro, I, Siklosi, D, Szabo, J, et al. Linked latent dirichlet allocation in web spam filtering [C]. Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'09), New York,2009:37-40.
    [32]Fetterly, D, Manasse, M, Najork, M. Spam, damn spam, and statistics:using statistical analysis to locate spam Web pages [C]. Proceedings of the 7th International Workshop on the Web and Databases:collocated wit ACM SIGMOD/PODS 2004 (WebDB' 04), New York,2004:1-6.
    [33]Benczur, A A, Csalogany, K, Sarlos, T, et al. Spamrank-fully automatic link spam detection. First International Workshop on Adversarial Information Retrieval on the Web [C], Chiba,2005:25-38.
    [34]Fogaras, D, Rccz, B. Towards scaling fully personalized pagerank [J]. Algorithms and Models for the Web-Graph,2004,3243:105-117.
    [35]Boldi, P. Totalrank:ranking without damping [C]. Special interest tracks and posters of the 14th international conference on Word Wide Web (WWW'05), New York, 2005:898-899.
    [36]Becchetti, L, Leonardi, S. Using rank propagation and probabilistic counting for link-based spam detection [C]. WebKDD' 06, Philadelphia,2006:1-10.
    [37]Becchetti, L, Castillo, C, Donato, D, et al. Link analysis for Web spam detection [J]. ACM Transaction on Web,2008,2(2):1-42.
    [38]Yang, H, King, I, Lyu, M R. Diffusionrank:a possible penicillin for Web spamming [C]. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR' 07), New York,2007:431-438.
    [39]Wang, X, Tao, T, Sun, J T, et al. Dirichletrank:Solving the zero-one gap problem of pagerank [J]. ACM Trans. Inf. Syst.,2008,26(10):1-29.
    [40]Kaul, R, Yun, Y, Kim, S G. Ranking billions of Web pages using diodes [J]. Communications of ACM,2009,52:132-136.
    [41]Baeza-Yates, R, Castillo, C, Lopez, V. Pagerank increase under different collusion topologies [C]. First International Workshop on Adversarial Information Retrieval on the Web, Chiba,2005:17-24.
    [42]Wu, B, Davision, B D. Identifying link farm spam pages [C]. Special interest tracks and posters of the 14th international conference on World Wide Web (WWW' 05), New York,2005:820-829.
    [43]Ono, H, Toyoda, M, Kitsuregawa, M. Identifying Web spam by densely connected sites and its statistic in Japanese Web snapshot [C].22nd International Conference on Data Engineering Workshops, Atlanta,2006:1-4.
    [44]Saito, H, Toyoda, M, Kitsuregawa, M, et al. A large-scale study of link spam detection by graph algorithms. Proceedings of the 3rd international workshop on Adversarial information retrieval on the Web (AIRWeb'07), New York,2007:45-48.
    [45]Caverleen, J, Liu, L. Countering Web spam with credibility-based link analysis [C]. Proceedings of the 26th annual ACM symposium on Principles of distributed computing (PODC'07), New York,2007:157-166.
    [46]Chen, Q, Yu, S N, Cheng, S. Link variable trustrank for fighting Web spam [C]. International Conference on Computer Science and Software Engineering, Wuhan,2008: 1004-1007.
    [47]Zhang, Y, Jiang, Q, Zhang, L, et al. Deeply exploiting link structure:Setting a tougher life for spammers [R/OL]. Beijing:Department of Machine Intelligence, Peking University,2Q09. http://www.cis.pku.edu.cn/faculty/system/zhangyan/papers/CPV.pdf.
    [48]Zhang, Y, Jiang, Q, Zhang, L, et al. Exploiting bidirectional links:making spamming detection easier [C]. Proceeding of the 18th ACM conference on Information and knowledge management, Hong Kong,2009:1839-1842.
    [49]Chung, Y, Toyoda, M, Kitsuregawa, M. Detecting link hijacking by Web Spammers [C]. In Proceedings of the 13th Pacific-Asia conference on knowledge discovery and data mining, Bangkok,2009:339-350.
    [50]Jiang, Q, Zhang, L, Zhu, et al. Larger is better:seed selection in link-based anti-spamming algorithms [C]. Proceeding of the 17th international conference on World Wide Web, Beijing,2008:1065-1066.
    [51]韩博.反搜索引擎作弊中种子集合自动扩展算法研究[D].大连：大连理工大学,2009.
    [52]Google Inc. Cloaking, sneaky javascript redirects, and doorway pages [EB/OL]. (2011,10,31) [2011,12,1]. http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=66355.
    [53]Wu, B, Davision, B D. Cloaking and redirection:A preliminary study [C]. Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web, Chiba,2005:1-10.
    [54]Wu, B, Davision, B D. Detecting semantic cloaking on the Web [C]. Proceedings of the 15th international conference on World Wide Web (WWW'06), Edinburgh, 2006:819-828.
    [55]Lin, J L. Detection of cloaked Web spam by using tag-based methods[J]. Expert Systems with Applications,2009,36(4):7493-7499.
    [56]Chellapilla K, Maykov, A. A taxonomy of javascript redirection spam [C]. Proceedings of the 3rd international workshop on Adversarial information retrieval on the Web (AIRWeb'07), Beijing,2007:81-88.
    [57]Wang, Y, Ma, M, Niu, Y, et al. Spam double-funnel:Connecting Web spammers with advertisers [C]. Proceedings of the 16th international conference on World Wide Web, Banff, Alberta,2007:291-300.
    [58]Deif, A S. Advanced matrix theory for scientists and engineers [M]. London:Taylor & Francis,1982.
    [59]Yahoo! Research. Web Spam Collections [DB/OL]. Crawled by the Laboratory of Web Algorithmics, University of Milan,2007. http://law.dsi.unimi.it/. (2011,1,21)[2011,10,7]. http://barcelona. research, yahoo, net/webspam/datasets/.
    [60]Callan, J, Hoy, M, Yoo, C, et al. The Clueweb09 data set [DB/OL]. (2011,12,6)[2011,12,12]. http://lemurproject.org/clueweb09/.
    [61]Yahoo! Research. Guidelines for WEBSPAM-UK2007 [EB/OL]. (2008,7,21)[2011,10,7]. http://barcelona.research.yahoo.net/webspam/datasets/uk2007/guidelines/.
    [62]Thelwall, M, Vaughan, L. A fair history of the Web? Examining country balance in the Internet Archive [J]. Library & information science research,2004, 26(2):162-176.
    [63]Cormack, G, Smucker, M, Clarke, C. Efficient and effective spam filtering and re-ranking for large Web datasets [J]. Information Retrieval,2001,14:1-25.
    [64]Heymann, P, Koutrika, G, Garcia-Molina, H. Fighting spam on social web sites:A survey of approaches and future challenges [J]. IEEE Internet Computing,2007, 11(6):36-45.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700