基于遗传规划和集成学习的Web Spam检测关键技术研究

英文题名：The Research on Key Technologies in Web Spam Detection Based on Genetic Programming and Ensemble Learning
作者：牛小飞
论文级别：博士
学科专业名称：计算机系统结构
中文关键词：Web ; Spam检测 ; 遗传规划 ; 集成学习 ; 非平衡数据集分类
英文关键词：Web Spam Detection ; Genetic Programming ; Ensemble Learning ; Classification on the Imbalanced Dataset
学位年度：2012
导师：马军
学科代码：081201
学位授予单位：山东大学
论文提交日期：2012-10-15

摘要

随着网络上的信息呈爆炸式增长,搜索引擎已成为日常生活中帮助人们发现其想要信息的重要工具。给定一个确定的查询,搜索引擎通常能返回成千上万个网页,但是大部分用户只读前几个,所以在搜索引擎中网页排名非常重要。因此,许多人采用一些手段来欺骗搜索引擎排序算法,使一些网页获得不应有的高排序值来吸引用户的关注,从而达到获取某方面利益的目的。所有试图增加网页在搜索引擎中排序的欺诈行为被称为Web Spam(网络作弊)。Web Spam严重降低了搜索引擎检索结果的质量,使用户在获取信息的过程中遇到巨大障碍,产生较差的用户体验。对于搜索引擎而言,即使这些作弊网页没有排得足够靠前来扰乱用户,抓取、索引和存储这些网页也需要成本。因此,识别Web Spam已成为搜索引擎的重要挑战之一。
     本文根据Web Spam数据集的特点,围绕基于网页特征构建分类器检测Web Spam方面进行了研究,主要工作包括以下三方面：
     (1)提出基于遗传规划学习判别函数检测Web Spam的方法
     将个体定义为检测Web Spam的判别函数,经过遗传操作,遗传规划就可以找到优化的判别函数来提高Web Spam的检测性能。然而,使用遗传规划产生判别函数时会出现一个问题,因为没有关于最优解的任何先验知识,所以很难知道个体的适当长度,如果个体长度太短,则个体中所包含的特征就会很少,个体的辨别力不高,对应函数表达式的分类性能就不好。要想充分利用Web Spam数据集中的内容、链接等特征,需要较长的判别函数,对应个体规模较大。对于由较大规模个体组成的种群,构造和搜索所需时间较长。基于较长判别函数是由若干较短判别函数组成的这一原理,本文提出通过遗传规划学习判别函数检测Web Spam,该方法先使用若干小规模的个体创建多个种群,每个种群经过遗传操作产生本种群的最好个体,然后再将每个种群所得的最好个体通过遗传规划进行组合得到更好的判别函数,从而利用较短时间就能产生性能更好的较长判别函数来检测Web Spam。本文还研究了表示个体的二叉树深度在遗传规划进化过程中的影响以及组合的效率。
     在WEBSPAM-UK2006数据集上进行了实验,实验结果表明,与单种群遗传规划相比,使用两次组合的多种群遗传规划能将召回率提高5.6%,F度量提高2.25%,正确率提高2.83%。与SVM相比,新方法将召回率提高了26%,F度量提高了11%,精确度提高了4%。
     (2)提出利用基于遗传规划的集成学习检测Web Spam的方法。
     目前多数基于分类检测Web Spam的方法只使用一种分类算法构造一个分类器,并且大都忽略了数据集中作弊样本和正常样本的不平衡性,即正常样本比作弊样本多很多。由于存在多种不同类型的Web Spam技术,新类型的Spam技术也在不断出现,期望发现一个万能分类器来检测所有类型的WebSpam是不可能的。所以,通过集成多个分类器的检测结果来找到增强分类器用于检测Web Spam是一种有效方法,并且集成学习也是解决非平衡数据集分类问题的有效方法之一。在集成学习中,如何产生多样的基分类器和如何组合它们的分类结果是两个关键的问题。本文提出利用基于遗传规划的集成学习来检测Web Spam,首先使用不同的分类算法分别在不同的样本集和特征集上进行训练产生多样的基分类器,然后使用遗传规划学习得到一个新颖的分类器,由它基于多个基分类器的检测结果给出最终检测结果。
     该方法根据Web Spam数据集的特点,利用不同的数据集合和分类算法产生差异较大的基分类器,利用遗传规划对基分类器的结果进行集成,不仅易于集成不同类型分类器的结果,提高分类性能,还能选择部分基分类器用于集成,降低预测时间。该方法还可以将欠抽样技术和集成学习融合起来提高非平衡数据集的分类性能。为了验证遗传规划集成方法的有效性,分别在平衡数据集和非平衡数据集上进行了实验。在平衡数据集的实验部分,首先分析了分类算法和特征集合对集成的影响,然后将其与已知集成学习算法进行比较,结果显示在准确率、召回率、F-度量、精确度,错误率和AUC方面,优于一些已知的集成学习算法；在非平衡数据集上的实验表明无论是同态集成还是异态集成,遗传规划集成均能提高分类的性能,且异态集成比同态集成更加有效；遗传规划集成比AdaBoost、Bagging、RandomForest、多数投票集成、EDKC算法和基于Prediction Spamicity的方法取得更高的F-度量值。
     (3)提出基于遗传规划产生新特征检测Web Spam的方法。
     特征在分类中扮演着很重要的角色,Web Spam数据集中有96个内容特征、41个链接特征和138个转换链接特征,其中138个转换链接特征是41个链接特征的简单组合或对数操作,这些特征的产生不仅需要由专家来完成,还很耗费人力,并且也不易把不同类型(如内容特征和链接特征)的特征融合在一起。该方法提出利用遗传规划将已有特征进行组合从而产生更有区别力的新特征,然后将这些新特征作为分类器的输入来检测Web Spam。在WEBSPAM-UK2006数据集上的实验显示,使用10个新特征的分类器的分类结果好于使用原41个链接特征的分类器,与使用138个转换链接特征的分类器的性能相当。
With the explosive growth of information on the web, search engine has become an important tool to help people find their desired information in daily lives. Given a certain query, search engines can generally return thousands of pages, but most users read only the first few ones. Therefore, the page ranking is highly important in search engines. So many people employ some means to deceive the ranking algorithm of search engines to enable some web pages to achieve undeserved high ranking values, which can attract the attention of users and help obtain some benefits. All the deceptive actions that try to increase the ranking of a page in search engines are generally referred to as Web spam. Web spam seriously deteriorates search engine ranking results, leads to great obstacle in users'information acquisition process and brings the poor user experience. From the point of view of a search engine, even if spam pages are not ranked sufficiently high to annoy users, there is a cost to crawl, index and store spam pages. Detecting web spam has become one of the top challenges in the research of web search engines.
     According to the characteristic of Web Spam dataset, this thesis focused on constructing classifiers based on the features of web pages in order to improve the Web Spam detection performance. It contains the following three parts:
     (1) Developed to learn a discriminating function to detect Web Spam by Genetic Programming
     An individual is defined as a discriminating function to detect Web Spam. Genetic Programming could find the optimized discriminating function to improve the Web Spam detection performance after genetic operators. However, one problem occurs when Genetic Programming is employed to generate discriminating functions. It is difficult to know the proper length of an individual because we have no prior knowledge about optimal solutions. If the length of an individual is too short, the individual contains few features and its discrimination is poor. The classification performance of the corresponding functional expression is not good. If we want to make full use of the features in Web Spam dataset, such as content features, link features and so on, the length of the discriminating function need to be longer and the scale of the corresponding individual is larger. For the population composed of some large-scale individuals, construction and search require more time. Based on the principle that a long discriminating function is composed of some short ones, this paper proposes a new method to learn a discriminating function to detect web spam by Genetic Programming. This method first constructs multi-populations composed of some small-scale individuals and every population can generate one best individual belonging to the population by genetic operators. Then the best individuals in every population are combined by Genetic Programming to gain a possible best discriminating function. This method can generate a better discriminating function to detect Web Spam within less time. We also study the effect of the depth of the binary trees representing the individuals in the Genetic Programming evolution process and the efficiency of the combination.
     We perform experiments on WEBSPAM-UK2006. The experimental results show that:(1) the multi-population Genetic Programming by two combinations can improve spam classification recall performance by5.6%, F-measure performance by2.25%and accuracy performance by2.83%compared with one population Genetic Programming;(2) the approach can improve spam classification recall performance by26%, F-measure performance by11%and accuracy performance by4%compared with SVM.
     (2) Developed to detect web spam by ensemble learning algorithm based on Genetic Programming.
     At present, most Web Spam detection methods based on classification only employ one classification algorithm to create base classifiers, and ignore the imbalance between spam and normal samples, i.e. normal samples are much more than spam ones. Since there are many types of Web Spam techniques and new types of spam are being developed continually, it is impossible to expect that we are able to find an omnipotent classifier to detect any kinds of Web Spam. Integrating the detection results of multi-classifiers is a way to find an enhanced classifier for Web Spam detection, and ensemble learning is also one of effective methods for the classification problem on the imbalanced dataset. Two key issues in ensemble learning are how to generate diverse base classifiers and how to integrate their results. This paper proposes to detect Web Spam by ensemble learning algorithm based on Genetic Programming. This new method first generates multiple diverse base classifiers, which use different classification algorithms and are trained on different instances and features. Then Genetic Programming is utilized to learn a novel classifier, which gives the final detection result based on the detection results of base classifiers.
     This method generates diverse base classifiers with different data sets and classification algorithms according to the characteristic of Web Spam Dataset. Ensemble on the results of base classifiers by Genetic Programming can not only be easy to integrate their classification results of heterogeneous base classifiers to improve classification performance, but also to select part of base classifiers for integration to reduce prediction time. This approach also combines the under-sampling technology with ensemble learning to improve the classification performance on imbalanced datasets. In order to verify the effectiveness of the Genetic Programming-based ensemble learning, we perform experiments on balanced and imbalanced data sets respectively. The experiments on the balanced dataset first analyze the effect of classification algorithms and feature sets on the ensemble. Then the experimental results are compared with those of some known ensemble learning algorithms and the results show that the new approach performs better than some known ensemble learning algorithms in terms of precision, recall, F-measure, accuracy, Error Rate and AUC. The experiments on the imbalanced dataset show that this method can improve the classification performance whether the base classifiers belong to the same type or not, and in most cases the heterogeneous classifier ensembles work better than the homogeneous ones. The F-measure of this new assemble method is higher than those of AdaBoost, Bagging, RandomForest, Vote, EDKC algorithm and the method based on Prediction Spamicity.
     (3) Developed to generate new features by Genetic Programming to detect Web Spam.
     For classification problem, features play an important role. In publicly available WEBSPAM-UK2006dataset, there are96content-based features,41link-based features andl38transformed link-based features. The transformed link-based features are the simple combination or logarithm operation of the link-based features manually which needs to be accomplished by experts and is labor-intensive. In addition, it is not easy to combine different kinds of features, such as content features and link features. This method proposed to derive new discriminating features using GP from existing features and use these newly generated features as the inputs to a SVM classifier and GP classifiers for web spam detection.
     Experiments on WEBSPAM-UK2006show that the classification results of the classifiers that use10new features are much better than those of the classifiers that use original41link-based features and are equivalent to those of the classifiers that use138transformed link-based features.

引文

[1]中国互联网络信息中心(CNNIC).第30次中国互联网络发展状况统计报告.http://www.cnnic.cn/hlwfzyj/hlwxzbg/hlwtjbg/201207/t20120723_32497.htm.
    [2]中国互联网络信息中心(CNNIC).第29次中国互联网络发展状况统计报告.http://www.cnnic.cn/hlwfzyj/hlwxzbg/hlwtjbg/201206/t20120612_26720.htm.
    [3]中国互联网络信息中心(CNNIC)2012年中国网民搜索行为研究报告.http://www.cnnic.cn/gywm/xwzx/rdxw/2012nrd/201208/t20120806_32995.htm.
    [4]Z. Gyongyi and H. Garcia-Molina. Web Spam Taxonomy. Proceedings of the First Workshop on Adversarial Information Retrieval on the Web,2005.
    [5]Silverstein, C., Marais, H., Henzinger, M., and Moricz, M. Analysis of a Very Large Web Search Engine Query Log. In SIGIR Forum,1999,33(1):6-12.
    [6]余慧佳,刘奕群,张敏,茹立云,马少平.基于大规模日志分析的搜索引擎用户行为分析.中文信息学报.2007,21(1)：109-114.
    [7]L. Becchetti, C. Castillo D. Donatol, S. Leonardi, R. Baeza-Yates. Using Rank Propagation and Probabilistic Counting for Link Based Spam Detection. Proceedings of WebKDD, LNCS (LNAI), August 2006, vol 4811:127-146.
    [8]Dennis Fetterly, Mark Manasse, Marc Najork. Spam, Damn Spam, and Statistics: Using Statistical Analysis to Locate Spam Web Pages. Proceedings of 7th International Workshop on the Web and Databases, Paris, France, June 2004:1-6.
    [9]刘奕群,马少平,洪涛,刘子正.搜索引擎技术基础.北京：清华大学出版社,2010.
    [10]M. R. Henzinger, R. Motwani, and C. Silverstein. Challenges in Web Search Engines. In SIGIR Forum,2002,36(2):11-22.
    [11]Wong S. K. M., Ziarko W. Generalized Vector Space Model in Information Retrieval. Proceedings of the 8th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval,1985:18-25.
    [12]刘挺,秦兵,张宇,车万翔.信息检索系统导论.北京：机械工业出版社,2008.
    [13]L. Page, S. Brin, R. Motwani, and T. Winograd. The Pagerank Citation Ranking: Bringing Order to the Web. Technical Report, Stanford University,1998.
    [14]J. M. Kleinberg. Authoritative Sources in a Hyperlinked Environment. Journal of ACM,1999,46(5):604-632.
    [15]A. Ntoulas, M. Najork, M. Manasse, D. Fetterly. Detecting Spam Web Pages through Content Analysis. Proceedings of the 15th WWW,2006:83-92.
    [16]E.Amitay, D.Carmel, A.Darlow, R.Lempel, A.S. The Connectivity Sonar: Detecting Site Functionality by Structural Patterns. Proceedings of the 14th ACM Conf. on Hypertext and Hypermedia,2003.
    [17]Q. Q. Gan, Torsten Suel. Improving Web Spam Classifiers Using Link Structure. Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web,2007:17-20.
    [18]Shen Guo yang, Gao Bin, Liu Tie-yan, Feng Guang, Song Shiji, Li Hang. Detecting Link Spam using Temporal Information. Proceedings of the 6th ICDM, 2006:1049-1053.
    [19]Z. Gyongyi, H. Garcia-Molina, J. Pedersen. Combating Web Spam with TrustRank. Proceedings of of the 30th International Conference on Very Large Databases (VLDB),2004:576-587.
    [20]L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based Characterization and Detection of Web Spam. Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web,2006.
    [21]C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your Neighbors:Web Spam Detection using the Web Topology. Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,2007:423-430.
    [22]G. G. Geng, C.H.Wang, Q.D. Li, L. Xu and X.B. Jin. Boosting the Performance of Web Spam Detection with Ensemble Under-Sampling Classification. Proceedings of the 4th International Conference on Fuzzy Systems and Knowledge Discovery, August,2007.
    [23]Na Dai, Brian D. Davison, Xiaoguang Qi. Looking into the Past to Better Classify Web Spam. Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web,2009:1-8.
    [24]Yiqun Liu, Rongwei Cen, Min Zhang, Shaoping Ma, Liyun Ru. Identifying Web Spam with User Behavior Analysis. Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web,2008:9-16.
    [25]Yiqun Liu, Min Zhang, Shaoping Ma, Liyun Ru. User Behavior Oriented Web Spam Detection. Proceedings of WWW,2008:1039-1040.
    [26]余慧佳.垃圾网页识别方法研究[D].北京：清华大学硕士学位论文,2010.
    [27]Juan Martinez-Romo, Lourdes Araujo. Web Spam Identification through Language Model Analysis. Proceedings of the 5th international Workshop on Adversarial Information Retrieval on the Web,2009:21-28.
    [28]武磊,高斌,李京.基于结构信息和时域信息的垃圾网页检测技术.计算机应用研究,2008,25(4)：1243-1246.
    [29]李智超,余慧佳,马少平.使用支持向量机进行作弊页面识别.第三届全国信息检索与内容安全学术会议论文集,中国,2007：248-254.
    [30]Andras Benczur, Istvan Biro, Karoly Csalogany, Tamas Sarlos. Web Spam Detection via Commercial Intent Analysis. Proceedings of the 3rd international Workshop on Adversarial Information Retrieval on the Web. New York:ACM, 2007,215:89-92.
    [31]余慧佳,刘奕群,张敏,马少平,茹立云.基于目的分析的作弊页面分类.中文信息学报,2009,2：405-413.
    [32]Guang-Gang Geng, Chun-HengWang, and Qiu-Dan Li. Improving Spamdexing Detection via a Two-Stage Classification Strategy. AIRS, LNCS 4993,2008: 356-364.
    [33]B. Wu, V. Goel, and B. D. Davison. Topical TrustRank:Using Topicality to Combat Web Spam. Proceedings of the 15th International World Wide Web Conference,2006:63-72.
    [34]PR10.info.BadRank as the Opposite of PageRank, [DB/OL]. http://en.pr10.info/pagerank0-badrank/,2006.
    [35]Benczur A, Csalogany K, Sarlos T, et al. SpamRank-Fully Automatic Link Spam Detection:Work in Progress. Proceedings of the 1st Adversarial Information Retrieval on the Web. Chiba, Japan,2005.
    [36]Krishnan V, Raj R. Web Spam Detection with Anti-trust Rank. Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb). Citeseer,2006.
    [37]Metaxas P, DeStefano J. Web Spam, Propaganda and Trust. Proceedings of the 1st Adversarial Information Retrieval on the Web. Chiba, Japan,2005.
    [38]B. Wu, V. Goel, and B. D. Davison. Propagating Trust and Distrust to Demote Web Spam. In Workshop on Models of Trust for the Web,2006.
    [39]冯东庆.基于链接分析的网页排序作弊检测方法研究[D].吉林长春：吉林大学硕士学位论文,2011.
    [40]魏超.搜索引擎反作弊技术研究[D].北京：清华大学硕士学位论文,2012.
    [41]韩博.反搜索引擎作弊中种子集合自动扩展算法研究[D].大连：大连理工大学硕士学位论文,2009.
    [42]Liu, Y., Gao, B., Liu, T., Zhang, Y., Ma, Z., He, S., Li, H.:BrowseRank:Lletting Web Users Vote for Page Importance. Proceedings of the 31st SIGIR Conference. ACM Press, Singapore,2008:451-458.
    [43]Huijia Yu, Yiqun Liu, Min Zhang, Liyun Ru, and Shaoping Ma. Web Spam Identification with User Browsing Graph. Lecture Notes in Computer Science, Information Retrieval Technology,2009,5839:38-49.
    [44]陈小飞,王轶彤,冯小军.一种基于网页质量的PageRank算法改进.计算机研究与发展,2009,46：381-387.
    [45]丁岳伟,王虎林.降级Web Spam的可信度链接分析算法.计算机工程与设计,2009,30(10)：2350-2353.
    [46]Yan Zhang, Qiancheng Jiang, Lei Zhang, Yizhen Zhu. Exploiting Bidirectional Links:Making Spamming Detection Easier. Proceedings of CIKM,2009: 1939-1842.
    [47]Baoning Wu, Kumar Chellapilla. Extracting Link Spam using Biased Random Walks from Spam Seed Sets. Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web. Banff, Alberta, Canada, May 8, 2007.
    [48]B. Wu, B. D. Davison. Identifying Link Farm Pages. Proceedings of the 14th International World Wide Web Conference (WWW),2005.
    [49]贺志明,王丽宏,张刚,程学期.一种抵抗链接作弊的PageRank改进算法.中文信息学报,2012,26(5)：101-106.
    [50]B. Wu and B. Davison. Cloaking and Redirection:A Preliminary Study. Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb),2005.
    [51]Wang Y, Ma M, Niu Y, et al. Spam Double-funnel:Connecting Web Spammers with Advertisers. Proceedings of the 16th international conference on World Wide Web (WWW). ACM,2007.
    [52]Koza, J. R.. Genetic Programming:on the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA,1992.
    [53]严蔚敏,吴伟民.数据结构(C语言版).北京：清华大学出版社,2012.
    [54]Jung-Yi Lin, Hao-Ren Ke, Been-Chian Chien, Wei-Pang Yang. Designing a Classifier by a Layered Multi-population Genetic Programming Approach. Pattern Recognition,2007,40:2211-2225.
    [55]Dietterich TG. Machine Learning Research:Four Current Directions. AI Magazine,1997,8(4):97-136.
    [56]Z.-H. Zhou. Ensemble learning. In:S. Z. Li ed. Encyclopedia of Biometrics. Berlin:Springer,2009,270-273.
    [57]ShiXin Yu. Feature Selection and Classifier Ensembles:A Study on Hyperspectral Remote Sensing Data [D]. A dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy (Physics) in The University of Antwerp,2003.
    [58]L. Breiman. Bagging predictors. Machine Learning,1996,24(2):123-140.
    [59]Y. Freund and R. E. Schapire. A Decision-theoretic Generalization of On-line Learning and an Application to Boosting. Journal of Computer and System Sciences,1997,55(1):119-139.
    [60]Hansen, L. K. and Salamon, P.. Neural Network Ensembles. IEEE Trans. Pattern Anal. Machine Intelligence,1990,12(10):993-1001.
    [61]Kearns, M.,& Mansour, Y. On the Boosting Ability of Top-down Decision Tree Learning Algorithms. Proceedings of the Twenty-Eighth Annual ACM Symposium on the Theory of Computing,1996.
    [62]孙丽娜.集成异种分类器分类稀有类[D].河南郑州：郑州大学硕士学位论文，2007.
    [63]Wolpert, D.H.. Stacked Generalization. Neural Networks,1992,5:241-259.
    [64]Ricardo V., Youssef D.. A Perspective View and Survey of Meta-learning. Artificial Intelligence Review,2002,18(2):77-95.
    [65]Ludmila I. Kuncheva, Christopher J. Whitaker. Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy. Machine Learning,2003,51:181-207.
    [66]Gavin Brown, Jeremy Wyatt, Rachel Harris, Xin Yao. Diversity Creation Methods:A Survey and Categorisation. Journal of Information Fusion,2005, 6(1):1-27.
    [67]E. K. Tang, P. N. Suganthan, X. Yao. An Analysis of Diversity Measures. Machine Learning 2006,65:247-271.
    [68]Matti Aksela, Jorma Laaksonen. Using Diversity of Errors for Selecting Members of a Committee Classifier. Pattern Recognition,2006,39:608-623.
    [69]Tie-Gang Fan, Ying Zhu, Jun-Min Chen. A New Measure of Classifier Diversity in Multiple Classifier System. Proceedings of the Seventh International Conference on Machine Learning and Cybernetics, Kunming,12-15 July 2008: 18-21.
    [70]Kuo-Wei Hsu and Jaideep Srivastava. Diversity in Combinations of Heterogeneous Classifiers. PAKDD, LNAI 5476,2009:923-932.
    [71]Kuo-Wei Hsu and Jaideep Srivastava. Relationship between Diversity and Correlation in Multi-Classifier Systems. PAKDD, Part Ⅱ, LNAI 6119,2010: 500-506.
    [72]Y. Yu, Y.-F. Li, and Z.-H. Zhou. Diversity Regularized Machine. Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI'11). Barcelona, Spain,2011:1603-1608.
    [73]Zhi-Hua Zhou, Jianxin Wu, Wei Tang. Ensembling Neural Networks:Many Could Be Better Than All. Artificial Intelligence,2002,137:239-263.
    [74]Rich Caruana, Alexandru Niculescu-Mizil, Geo_Crew, Alex Ksikes. Ensemble Selection from Libraries of Models. Proceedings of the 21st International Conference on Machine Learning, Banff, Canada,2004.
    [75]Gonzalo Martinez-Munoz, Alberto Suarez. Pruning in Ordered Bagging Ensembles. Proceedings of the 23th International Conference on Machine Learning. Pittsburgh, PA,2006:609-616.
    [76]Bekker B, Heskes T. Clustering Ensembles of Neural Network Models. Neural Networks,2003,16(2):261-269.
    [77]Qiang-Li Zhao, Yan-Huang Jiang, Ming Xu. A Fast Ensemble Pruning Algorithm Based on Pattern Mining Process. Data Min Knowl Disc,2009,19:277-292.
    [78]张春霞,张讲社.选择性集成学习算法综述.计算机学报,2011,34(8)：1399-1410.
    [79]赵强利,蒋艳凰,徐明.选择性集成算法分类与比较.计算机工程与科学,2012,34(2)：134-138.
    [80]赵强利.基于选择性集成的在线机器学习关键技术研究[D].湖南长沙：国防科学技术大学博士学位论文,2010.
    [81]Lei Xu, et al. Methods of Combining Multiple Classifiers and Their Applications to Handwriting Recognition. IEEE Transactions on Systems, Man and Cybernetics.1992,22(3):418-435.
    [82]Dennis Bahler and Laura Navarro. Methods for Combining Heterogeneous Sets of Classifiers. Proeeedings of 17th Natl. Conf. on Artificial Intelligence (AAAI), Workshop on New Research Problems for Machine Learning,2000.
    [83]N. Ueda. Optimal Linear Combination of Neural Networks for Improving Classification Performance. IEEE Trans. Pattern Anal. Machine Intelligence, 2000,22(2):207-215.
    [84]杨明,尹军梅,吉根林.不平衡数据分类方法综述.南京师范大学学报(工程技术版),2008,8(4)：7-12.
    [85]Barandela R., Valdovinos R. M., Sanchez J. S., Ferri F. J. The Imbalanced Training Sample Problem:Under or Over Sampling. In Joint IAPR International Workshops on Structural, Syntactic, and Statistical Pattern Recognition, Leeture Notes in Computer Science 2004,3138:806-814.
    [86]Chawla N, Bowyer K, Hall L, et al. SMOTE:Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research,2002,16:321-357.
    [87]Han Hui, Wang Wen-yuan, Mao Bing-huan. Borderline-SMOTE:a New Over-sampling Method in Imbalanced Data Sets Learning. Proeeedings of International conference on Intelligent Computing, Hefei,2005:878-887.
    [88]Juszczak P., Duin R. P. W. Uncertainty Sampling Methods for One-class Classifiers. Proeeedings of the ICML Workshop on Learning from Imbalanced Data Sets,2003.
    [89]Drown D. J., Khoshgoftaar T. M., Narayanan R. Using Evolutionary Sampling to Mine Imbalanced Data. The 6th International Conference on Machine Learning and Applications. Washington DC:IEEE Computer Society,2007:363-368.
    [90]Yen S. J., Lee Y. S. Cluster-based Under-sampling Approaches for Imbalanced Data Distributions. Expert Systems with Applications.2009,36:5718-5727.
    [91]Napolitano A. Alleviating Class Imbalance Using Data Sampling:Examining the effects on classification algorithms. Department of Computer Science and Engineering, Florida Atlantic University, Boca Raton, FL USA,2006.
    [92]Van Hulse J., Khoshgoftaar T.M., Napolitano A. Experimental Perspectives on Learning from Imbalanced Data. Proceedings of the 24th International Conference on Machine Learning, Corvallis, OR, USA,2007:935-942.
    [93]Kotsiantis S. B., Pintelas P. E. Mixture of Expert Agents for Handling Imbalanced Datasets, Annals of Mathematics, Computing and Teleinformatics. 2003,1(1):46-55.
    [94]Zhou Z H, Liu X Y. Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem. IEEE Trans Knowl Data Eng,2006, 18(1):63-77.
    [95]Miao Zhimin. Research on Imbalanced Data Based on One-class Classifiers. Nanjing:Institute of Automation Command, PLA University of Science and Technology,2008.
    [96]Chen Xue-wen, Gerlach B, Casasent D. Pruning Support Vectors for Imbalanced Data Classification. Proceedings of International Joint Conference on Neural Networks. Montreal,2005:1883-1888.
    [97]He Guoxun, Han Hui, Wang Wenyuan. An Over-sampling Expert System for Learning from Imbalanced Data Sets. Neural Networks and Brain,2005,1: 537-541.
    [98]李军.不平衡数据学习的研究[D].吉林长春：吉林大学博士学位论文,2011.
    [99]Jesse Davis, Mark Goadrich. The Relationship between Precision-Recall and ROC Curves. Proceedings of the 23rd International Conference on Machine Learning.2006:233-240.
    [100]Yan, L., Dodier, R., Mozer, M.,& Wolniewicz, R. Optimizing Classifier Performance via the Wilcoxon-Mann-Whitney Statistics. Proceedings of the 20th International Conference on Machine Learning,2003.
    [101]计算AUC的Java程序http://mark.goadrich.com/programs/AUC/.
    [102]Boldi, P., Codenotti, B., Santini, M., and Vigna, S.. Ubicrawler:a Scalable Fully Distributed Web Crawler. Software, Practice and Experience,2004,34(8):711-726.
    [103]Open Directory Project2. http://www.dmoz.org/.
    [104]Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., Vigna, S. A Reference Collection for Web Spam Detection. ACM SIGIR Forum, 2006,40(2):11-24.
    [105]曾刚,李宏.一个基于现实世界的大型Web参照数据集——UK2006 Datasets的初步研究.企业技术开发,2009,28(5)：16-17转31.
    [106]http://barcelona.research.vahoo.net/webspam/datasets/uk2006/contents/.免费下载简易版WEBSPAM-UK2006数据集网址.
    [107]Boldi, P. and Vigna, S. The Webgraph Framework I:Compression Techniques. Proceedings of the 13th International Conference on World Wide Web. New York, NY, USA,2004:595-602.
    [108]http://barcelona.research.vahoo.net/webspam/datasets/uk2006/links/. WEBSP AM-UK2006数据集中链接结构下载地址.
    [109]M. Brameier, W. Banzhaf, A Comparison of Linear Genetic Programming and Neural Networks in Medical Data Mining. IEEE Transactions on Evolutionary Computation,2001,5 (1):17-26.
    [110]M. Zhang, P. Wong. Genetic Programming for Mmedical Classification:a Pprogram Simplification Aapproach. Genetic Program Evolvable, Mach 2008,9: 229-255.
    [111]D.P. Muni, N.R. Pal, J. Das. A Novel Approach to Design Classifiers Using Genetic Programming. IEEE Trans. Evol. Comput,2004,8 (2):183-196.
    [112]I. De Falco, A. Della Cioppa, E. Tarantino. Discovering Interesting Classification Rules with Genetic Programming. Appl. Soft Comput.2002,23: 1-13.
    [113]W. Banzhaf, P. Nordin, R.E. Keller, F.D. Framcone. Genetic Programming:An Introduction on the Automatic Evolution of Computer Programs and Its Application. Morgan Kaufmann, San Francisco, CA,1998.
    [114]B.C. Chien, J.Y. Lin, W.P. Yang. Learning Effective Classifiers with Zvalue Measure based on Genetic Programming. Pattern Recognition,2004,37: 1957-1972.
    [115]A. Freitas. A Genetic Programming Framework for Two Data Mining Tasks: Classification and Generalized Rule Induction. Proceedings of Second Annual Conference on Genetic Programming, Stanford University, United States, July 1997:96-101.
    [116]J.K. Kishore, L.M. Patnaik, V. Mani, V.K. Agrawal. Application of Genetic Programming for Multicategory Pattern Classification. IEEE Trans. Evol. Comput,2000,4 (3):242-258.
    [117]A. Konstam. Group Classification Using a Mix of Genetic Programming and Genetic Algorithms. Proceedings of the 1998 ACM Symposium of Applied computing, Atlanta, Georgia, United States, February 27-March 1,1998: 308-312.
    [118]J.Y. Lin, B.C. Chien, T.P. Hong. A Function-based Classifier Learning Scheme Using Genetic Programming. Proceedings of sixth Pacific-Asia conference on Knowledge Discovery and Data Mining, Taipei, Taiwan, May 5-8,2002: 92-103.
    [119]T. Loveard, V. Ciesielski. Representing Classification Problems in Genetic Programming. Proceedings of the 2001 Congress on Evolutionary Computation, May 27-30,2001:1070-1077.
    [120]C.C. Bojarczuk, H.S. Lopes, A. A. Freitas. Discovering Comprehensible Classification Rules Using Genetic Programming:a Case Study in a Medical Domain. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-99), Orlando, FL, USA,1999:953-958.
    [121]Jiawei Han, Micheline Kamber(作者),范明,孟小峰(译者).数据挖掘：概念与技术.北京：机械工业出版社,2007.
    [122]Fernandez, F., Tomassini, M., Vanneschi, L. An Empirical Study of Multi-population Genetic Programming. Genetic Programming Evolvable,2003, 4:21-51.
    [123]谢元澄.分类器集成研究[D].南京：南京理工大学博士学位论文,2009.
    [124]陈海霞.面向数据挖掘的分类器集成研究[D].吉林长春：吉林大学博士学位论文,2006.
    [125]Kittler J, Hatef M, Duin R P, Matas J. On Combining Classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence,1998,20(3): 226-239.
    [126]Breiman L. Random Forests. Machine Learning,2001,45:5-32.
    [127]I. H. Witten and E. Frank. Data Mining:Practical Machine Learning Tools and Techniques北京：机械工业出版社,2006.
    [128]Chih-Chung Chang and Chih-Jen Lin. LIBSVM:a Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology,2011. Software available at http://www.csie.ntu.edu.tw/-cjlin/libsvm.
    [129]Tom M.Mitchell机器学习.北京：机械工业出版社,2003.
    [130]H. Guo, L. B. Jack, and A. K. Nandi. Feature Generation Using Genetic Programming with Application to Fault Classification. IEEE Transactions on Systems, Man, and Cybernetics-Part B:Cybernetics,2005,35(1):8999.
    [131]J. Sherrah, R.E. Bogner, A. Bouzerdoum. Automatic Selection of Features for Classification Using Genetic Programming. Proceedings of Australian and New Zealand Conference on Intelligent Information Systems. Adelaide, SA, Australia, November 18-20,1996:284-287.
    [132]M. Kotani, S. Ozawa, M. Nakai, K. Akazawa. Emergence of Feature Extraction Function Using Genetic Programming. Proceedings of Third International conference on Knowledge-based Intelligent Information Engineering System. Adelaide, Australia,1999:149-152.
    [133]Helen F. Gray, Ross J. Maxwell, Irene Martinez-Perez, Carles Arus, Sebastian Cerdan. Genetic Programming for Classification and Feature Selection:Analysis of H Nuclear Magnetic Resonance Spectra from Human Brain Tumour Biopsies. NMR Biomed 1998,11:217-224.
    [134]Li Ruihua, Xie Hengkun, Gao Naikui, Shi Weixiang. Genetic Programming for Partial Discharge Feature Construction in Large Generator Diagnosis. Proceedings of the 7th International Conference on Properties and Applications of Dielectric Materials, June 1-5,2003.
    [135]Qingyong Li, Hong Hu, Zhongzhi Shi. Semantic Feature Extraction Using Genetic Programming in Image Retrieval. Proceedings of the 17th International Conference on Pattern Recognition,2004.
    [136]Hong Guo, Lindsay B. Jack, Asoke K. Nandi. Feature Generation Using Genetic Programming with Applicaion to Fault Classification. IEEE transactions on systems, man and cybernetics,2005,35(1):89-99.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700