面向垃圾信息过滤的主动多域学习文本分类方法研究

英文题名：Research on Text Categorization Method by Active Multi-Field Learning for Spam Filtering
作者：刘伍颖
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：垃圾信息过滤 ; 文本分类 ; 多域学习 ; 主动学习 ; 幂律 ; Token频率索引 ; 基于方差的非确定采样 ; TREC评测
英文关键词：Spam Filtering ; Text Categorization ; Multi-Field Learning ; Active Learning ; Power Law ; Token Frequency Index ; Variance-Based Uncertainty Sampling ; TREC Evaluation
学位年度：2011
导师：王挺
学科代码：081203
学位授予单位：国防科学技术大学
论文提交日期：2011-04-01

摘要

垃圾信息过滤是提高网络信息可用性的关键技术之一。虽然该领域已有许多研究成果,但随着社会对垃圾信息过滤的迫切需要,以及垃圾信息过滤技术在实际应用和测试中表现出的许多不足,近年来许多研究机构都在十分活跃地进一步深入研究垃圾信息过滤领域的各种关键技术,以提高垃圾信息过滤的性能和解决实际应用中的问题。目前的研究大多采用基于统计的文本分类方法来解决垃圾信息过滤问题。在这种背景下,本文对用于垃圾信息过滤的基于统计的在线二值文本分类总体框架问题、域文档分割问题、域分类结果组合问题、时空高效域分类问题和有代价反馈问题进行了深入研究,提出了一系列针对这些问题的应对方法。我们采用TREC07P邮件语料上的垃圾邮件过滤实验、CSMS中文手机短信语料上的垃圾手机短信过滤实验以及TanCorp网页新闻语料上的多类别文档分类实验来验证提出方法的有效性。本文主要的研究工作包括:
     (1)分析了信息文档的文本结构,揭示了信息文档普遍具有多域结构特性。根据这一特性,提出了一种多域学习框架。该框架采用分而治之的研究思路,把一个复杂的多域文档的文本分类问题划分成几个简单的域分类子问题,每个域分类子问题有其自身的特征空间和统计文本分类模型。实验结果表明多域学习框架是一种有效的基于统计的在线二值文本分类总体框架。在多域学习框架下,域间文本特征的独立性更强,而域内文本分类模型针对性更强;并且在每个域分类子问题中,无论是文本特征抽取还是文本分类模型构造都更加简洁高效。
     (2)研究了域文档分割问题,提出了自然域文档分割策略和特定属性域文档分割策略。自然域文档分割就是根据文档本身具有的多域结构化特点,通过识别域分隔点,将一个文本文档分割成几个域文本文档。特定属性域文档分割是一种文本特征复用技术,它将那些具备较强区分能力的文本通过某种规则抽取出来,组成一个原来并不真实存在的文本域。实验结果表明前一种策略具有较强的通用性,因为信息文档普遍具有多域结构特性;而后一种策略更加适合短文本文档,因为可以克服短文本文档的特征稀少问题。
     (3)研究了域分类结果组合问题,提出了均权组合策略、支持向量模型权组合策略、域分类器历史性能权组合策略、域文档信息量权组合策略和复合权组合策略。实验结果表明在多域学习框架下,这五种组合策略都能提高已有文本分类算法的性能,其中综合考虑域分类器历史性能和当前域文档信息量两方面因素的复合权组合策略在时间复杂度和分类准确率上都能达到更理想的性能。
     (4)分析了信息文档集合中的Token频率分布,揭示了Token频率分布普遍服从幂律的特性。根据这一特性,提出了一种基于Token频率索引的文本分类算法。该算法采用文本检索的研究思路解决文本分类问题,利用等概率随机采样方法进行在线标注文档压缩,能够有效应对传统在线文本分类研究中难以将离线批处理后验规则变成在线可计算的先验规则的困难。由于Token频率索引数据结构具备每次查询和增量更新的时间复杂度都很低的优势,还具备索引的原始文本压缩特性和基于随机采样的压缩特性,所以能够高效地捕获文档内容的变化和垃圾概念的漂移。实验结果表明基于Token频率索引的文本分类算法能够很好地解决时空高效域分类问题,而且将该算法集成到多域学习框架下,能够达到低时空复杂度和高分类准确率的最佳性能。此外,还扩展了Token频率索引的研究思路,提出了一种基于多类别Token频率索引的文本分类算法。实验结果表明该算法在多类别文档分类中也是有效的。
     (5)研究了有代价反馈问题,提出了时序优先主动学习策略、先验区间主动学习策略和基于方差的非确定采样主动学习策略。其中基于方差的非确定采样主动学习策略充分利用了多个域分类器之间的决策差异,通过比较域分类结果间的当前方差和历史方差阈值,挑选信息丰富的文档请求用户反馈。实验结果表明在这三种主动学习策略中,基于方差的非确定采样效果最好,它能够在大量减少用户反馈的情况下,仍然达到较理想的分类性能,而且由于计算方差的时空复杂度比较低,所以基于方差的非确定采样是一种有效的主动学习策略。
     综上所述,本文研究了垃圾信息过滤面临的若干关键问题,提出了以多域学习为核心的一系列文本分类方法,较好地满足了垃圾信息过滤的实际应用需求。进一步的工作仍然围绕多域学习这一核心,可以预见多域学习的进一步完善和发展能够获得更好的效果。
Spam filtering is one of the key technologies to improve the availability of network information. Although, spam filtering has been extensively investigated and many advances have been made on it, there are still many problems expected to be solved which are shown in actual applications and evaluations. As a consequence, in recent years, many academies and industries have been making an in-depth research on spam filtering technologies. Currently, many academies tend to use statistical text categorization (TC) methods to solve the problem of spam filtering. This dissertation has explored the main framework problem of statistical online binary TC for spam filtering, the splitting problem of field documents, the combining problem of field results, the space-time-efficient classifying problem of field documents and the costly feedback problem, and proposed a series of methods to solve the problems. We have used the email spam filtering experiment on the TREC07P collection, the short message service spam filtering experiment on the CSMS collection, the multiclass document classification experiment on the TanCorp collection to validate the availability of proposed methods, and make the following contributions:
     (1) The text structure of information documents is detailedly investigated, and it is found that many information documents have a multi-field structure. According to this finding, we propose a multi-field learning (MFL) framework, which uses a divide-and-conquer idea to break a complex TC problem of multi-field documents into several simple field sub-problems. Each sub-problem has its own feature space and statistical TC model. Experimental results show that the MFL framework is an effective main framework of statistical online binary TC. In this framework, text features are more independent between fields and TC model is more targeted in each field. And in each sub-problem, feature extraction and TC model construction are both straightforward and efficient.
     (2) We investigate the splitting problem of field documents, and propose a natural field document splitting (NFDS) strategy and an attribute-specific field document splitting (ASFDS) strategy. The NFDS strategy splits a document into several field documents according to the splitting positions identified by the natural field structure. The ASFDS strategy, a reuse technique of text features, extracts the texts with strong distinguishability by some rules to form a field document, which does not really exist in the original document. Experimental results show that the NFDS strategy is general for the common multi-field structure of information documents, and the ASFDS strategy is more suitable for short text documents because it can overcome the problem of sparse features.
     (3) We also deeply investigate the combining problem of field results, and propose five combining strategies (arithmetical average, support vector model, historical performance of field classifiers, text quantity of field documents, and compound). Experimental results show that the five strategies can improve the performance of previous TC algorithms, and the compound strategy, considering the historical performance of field classifiers and the current text quantity of field documents, can achieve the promising performance of time complexity and precision.
     (4) The token frequency distribution is detailedly investigated in document collections, and it is found that the token frequency distribution commonly follows a power law. According to this finding, we propose a token frequency index (TFI) based TC algorithm. This algorithm, transfering a research idea of text retrieval to TC problems, uses an equal-probability-based random sampling to compress labeled documents online, and can solve the hard problem of turning a posteriori rule of offline batches to a priori online computable rule in traditional statistical TC methods. The TFI data structure has an advantage of the low time complexity for each query and each incremental update, and has a raw text compression property of indexes and a compression property based on random sampling. So the TFI can capture the varied content and the concept drift space-time-efficiently. Experimental results show that the TFI-based TC algorithm can solve the space-time-efficient classifying problem of field documents, and integrated in the MFL framework, this algorithm can achieve the state-of-the-art performance of the low space-time complexity and the high precision. Moreover, we extend the research idea of TFI, and propose a multiclass token frequency index (MTFI) based TC algorithm. Experimental results show that the MTFI-based TC algorithm is effective in the multiclass document classification.
     (5) In the research of the costly feedback problem, three active learning strategies (chronological priority, priori range, variance-based uncertainty sampling) are proposed. The variance-based uncertainty sampling active learning strategy makes use of the decision difference of several field classifiers, and compares the current variance of field results with the threshold of historical variances to choose informative documents to require user feedback. Experimental results show that the variance-based uncertainty sampling is the best active learning strategy among the three strategies, and the best strategy can achieve the promising performance with greatly reduced requirements of user feedback. And owing to the low space-time complexity of variance computing, the variance-based uncertainty sampling is an effective active learning strategy.
     In conclusion, this dissertation investigates some key problems in spam filtering, and proposes a series of TC methods around the MFL idea. The proposed TC methods can meet the practical application requirement of spam filtering. The further research of the MFL can be expected to achieve higher performance.

引文

[1] Peter J. Denning. Infoglut [J]. CACM IT Profession, 2006.
    [2] James Carpinter, Ray Hunt. Tightening the Net: A Review of Current and Next Generation Spam Filtering Tools [J]. Computers & Security, 2006, 25(8): 566~578.
    [3] Peter J. Denning. Electronic Junk [J]. ACM Communications, 1981, 25(3): 163~165.
    [4] Yong Hu, Ce Guo, E.W.T. Ngai, Mei Liu, Shifeng Chen. A Scalable Intelligent Non-Content-Based Spam-Filtering Framework [J]. Expert Systems with Applications, 2010, 37(12): 8557~8565.
    [5] Ahmed Khorsi. An Overview of Content-Based Spam Filtering Techniques [J]. Informatica, 2007, 31: 269~277.
    [6] G. Lindberg. Anti-Spam Recommendations for SMTP MTAs [Z]. RFC2505, Chalmers University of Technology, February 1999.
    [7]王美珍.垃圾邮件行为模式识别与过滤方法研究[D].武汉:华中科技大学,2009.
    [8] Barry Leiba, Jim Fenton. DomainKeys Identified Mail (DKIM): Using Digital Signatures for Domain Verification [C]. // The 4th Conference on Email and Anti-Spam, August 2007, Mountain View, California USA. http://www.ceas.cc/2007/papers/paper-78.pdf, 2008-06-30/2010-06-30.
    [9] Chris Fleizach, Geoffrey M. Voelker, Stefan Savage. Slicing Spam with Occam’s Razor [C]. // The 4th Conference on Email and Anti-Spam, August 2007, Mountain View, California USA. http://www.ceas.cc/2007/papers/paper-19.pdf, 2008-06-30/2010-06-30.
    [10]吕英杰.基于全球IP信誉系统的垃圾邮件过滤技术研究[D].哈尔滨:哈尔滨工业大学,2007.
    [11]刘震.垃圾邮件过滤理论和关键技术研究[D].成都:电子科技大学,2008.
    [12]于穆晴.垃圾邮件处理模型的研究与应用[D].北京:北京邮电大学,2010.
    [13]林丹宁.反垃圾邮件关键技术研究与实现[D].杭州:浙江大学,2007.
    [14] Keinosuke Fukunaga. Introduction to Statistical Pattern Recognition (Second Edition) [M]. Boston: Academic Press, 1990: 124~180.
    [15] Fabrizio Sebastiani. Machine Learning in Automated Text Categorization [J]. ACM Computing Surveys, 2002, 34(1): 1~47.
    [16] Thiago S. Guzella, Walmir M. Caminhas. A Review of Machine Learning Approaches to Spam Filtering [J]. Expert Systems with Applications, 2009, 36(7): 10206~10222.
    [17] Christopher D. Manning, Hinrich Schütze. Foundations of Statistical NaturalLanguage Processing [M]. London: MIT Press, 1999: 575~608.
    [18] Amit Singhal. Modern Information Retrieval: A Brief Overview [R]. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2001, 24(4): 35~43.
    [19] Konstantin Tretyakov. Machine Learning Techniques in Spam Filtering [R]. Institute of Computer Science, University of Tartu, 2004: 63~73.
    [20] Thorsten Timm. Application of Machine Learning Techniques to Spam Filtering [D]. Universit?t Paderborn, 2004: 9~34.
    [21] Chih-Chin Lai. An Empirical Study of Three Machine Learning Methods for Spam Filtering [J]. Knowledge-Based Systems, 2007, 20(3): 249~254.
    [22] Irena Koprinska, Josiah Poon, James Clark, Jason Chan. Learning to Classify E-mail [J]. Information Sciences, 2007, 177(10): 2167~2187.
    [23] Enrico Blanzieri, Anton Bryl. A Survey of Learning-Based Techniques of Email Spam Filtering [J]. Artificial Intelligence Review, 2008, 29(1): 63~92.
    [24]陈晓云,陈袆,王雷,李荣陆,胡运发.基于分类规则树的频繁模式文本分类[J].软件学报,2006,17(5):1017~1025.
    [25] Quinlan J. R. Induction of Decision Trees [J]. Machine Learning, 1986, 1(1): 81~106.
    [26] Quinlan J. R. C4.5: Programs for Machine Learning [M]. San Francisco: Morgan Kaufmann Publishers, 1993.
    [27] X. Carreras, L. Marquez. Boosting Trees for Anti-Spam Email Filtering [C]. // The Euro Conference Recent Advances in NLP (RANLP2001), September 2001, Tzigov Chark, Bulgaria. 58~64.
    [28]胡英飞.基于行为识别的垃圾邮件过滤研究[D].北京:北京邮电大学,2009.
    [29] William W. Cohen. Fast Effective Rule Induction [C]. // The 12th International Conference on Machine Learning, 1995, Lake Taho, California, Mongan Kanfmann. 115~123.
    [30] William W. Cohen. Learning Rules that Classify Email [C]. // The AAAI Spring Symposium of Machine Learning in Information Access, 1996, Palo Alto, California. 18~25.
    [31] H. Drucker, D. Wu, V. N. Vapnik. Support Vector Machines for Spam Categorization [J]. IEEE Transactions on Neural Networks, 1999, 10(5): 1048~1054.
    [32] Pawlak Z. Rough Sets: Theoretical Aspects of Reasoning about Data [M]. Norwell: Kluwer Academic Publishers, 1992.
    [33]刘洋,杜孝平,罗平,等.垃圾邮件的智能分析、过滤及Rough集讨论[C].//第12届全国计算机网络与数据通信大会.2002,武汉,中国.515~521.
    [34] M. Tariq Banday, Tariq R. Jan. Effectiveness and Limitations of Statistical SpamFilters [C]. // International Conference on New Trends in Statistics and Optimization, 2008.
    [35] Mehran Sahami, Susan Dumais, David Heckerman, Eric Horvitz. A Bayesian Approach to Filtering Junk E-Mail [C]. // The AAAI Workshop on Learning for Text Categorization, 1998. 55~62.
    [36] M. Sahami, S. Dumais, D. Heckerman, E. Horvitz. A Bayesian Approach to Filtering Junk E-mail [C]. // The AAAI Workshop on Learning for Text Categorization, 1998. 55~62.
    [37] I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras, C. D. Spyropoulos. An Evaluation of Naive Bayesian Anti-Spam Filtering [C]. // The Workshop on Machine Learning in the New Information Age, The 11th European Conference on Machine Learning (ECML2000), May 2000. 9~17.
    [38] K. Schneider. A Comparison of Event Models for Na?ve Bayes Anti-Spam E-Mail Filtering [C]. // The 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL2003), April 2003, Budapest, Hungary. 307~314.
    [39]潘文峰.基于内容的垃圾邮件过滤研究[D].北京:中国科学院计算技术研究所,2004.
    [40]苏绥,林鸿飞,叶正.基于字符语言模型的垃圾邮件过滤[J].中文信息学报,2009,23(2):41~46.
    [41] Kian Ming Adam Chai, Hwee Tou Ng, Hai Leong Chieu. Bayesian Online Classifiers for Text Classification and Filtering [C]. // The 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’02), 2002. 97~104.
    [42]曾志中.基于贝叶斯算法的垃圾邮件过滤系统的分析与实现[D].北京:北京邮电大学,2009.
    [43]熊石一.智能垃圾邮件过滤系统贝叶斯过滤器的设计与实现[D].北京:北京邮电大学,2010.
    [44]胡睿.基于贝叶斯分类的中文垃圾邮件过滤方法研究和改进[D].北京:清华大学,2006.
    [45]惠孛.基于即时分类的垃圾邮件过滤关键技术的研究[D].成都:电子科技大学,2009.
    [46] B. V. Dasarathy. Minimal Consistent Set (MCS) Identification for Optimal Nearest Neighbor Decision System Terms Design [J]. IEEE Transactions on Systems, Man, and Cybernetics, 1994, 24(3): 511~517.
    [47] I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C. D. Spyropoulos, P. Stamatopoulos. Learning to Filter Spam E-Mail: A Comparison of a NaiveBayesian and a Memory-based Approach [C]. // The 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD2000), September 2000. 1~13.
    [48] Thorsten Joachims. Text Categorization with Support Vector Machines: Learning with Many Relevant Features [C]. // The 10th European Conference on Machine Learning (ECML98), 1998. 137~142.
    [49]董建设.协作式垃圾邮件过滤关键技术研究[D].兰州:兰州理工大学,2009.
    [50]朱靖波,王会珍,张希娟.面向文本分类的混淆类判别技术[J].软件学报,2008,19(3):630~639.
    [51] A. Ko?cz, J. Alspector. SVM-based Filtering of E-mail Spam with Content-specific Misclassification Costs [C]. // The ICDM 2001 Workshop on Text Mining (TextDM’2001), November 2001.
    [52] D. Sculley, Gabriel M. Wachman. Relaxed Online SVMs for Spam Filtering [C]. // The 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07), 2007. 415~422.
    [53] Yiming Yang, Christopher G. Chute. An Example-based Mapping Method for Text Categorization and Retrieval [J]. ACM Transactions on Information Systems, 1994, 12(3): 253~277.
    [54]王鹏鸣,吴水秀,王明文,黄国斌.基于偏最小二乘特征抽取的垃圾邮件过滤[J].中文信息学报,2008,22(1):74~79.
    [55]岑芳明,王明文,王鹏鸣,戴玉娟.基于核偏最小二乘分类的垃圾邮件过滤[J].中文信息学报,2009,23(2):48~53.
    [56] J. Rocchio. Relevance Feedback in Information Retrieval [M]. The SMART Retrieval System: Experiments in Automatic Document Processing. PrenticeHall, 1971: 313~323.
    [57]麦范金,叶东海,史慧.基于语义理解的垃圾邮件过滤处理研究[J].中文信息学报,2008,22(5):80~83.
    [58] Duhong Chen, Tongjie Chen, Hua Ming. Spam Email Filter Using Na?ve Bayesian, Decision Tree, Neural Network, and AdaBoost [EB/OL]. http://www.cs.iastate.edu/~tongjie/spamfilter/paper.pdf, 2005-06-30/2006-06-30.
    [59] James Clark, Irena Koprinska, Josiah Poon. A Neural Network Based Approach to Automated E-Mail Classification [C]. // The 2003 IEEE/WIC International Conference on Web Intelligence, 2003. 702.
    [60]李惠娟,高峰,管晓宏,黄亮.基于贝叶斯神经网络的垃圾邮件过滤方法[J].微电子学与计算机,2005,22(4):107~111.
    [61] N. Littlestone. Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm [J]. Machine Learning, 1988, 2(4): 285~318.
    [62] Victor S. Sheng, Foster Provost, Panagiotis G. Ipeirotis. Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers [C]. // The 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08), 2008. 614~622.
    [63]刘伍颖,王挺.基于多过滤器集成学习的在线垃圾邮件过滤[J].中文信息学报,2008,22(1):67~73.
    [64]刘伍颖,王挺.适于垃圾文本流过滤的条件概率集成方法[J].计算机科学与探索,2010,4(5):445~454.
    [65]王斌,潘文锋.基于内容的垃圾邮件过滤技术综述[J].中文信息学报,2005,19(5):1~10.
    [66] Gordon V. Cormack. Email Spam Filtering: A Systematic Review [J]. Foundations and Trends in Information Retrieval, 2008, 1(4): 335~455.
    [67] Yiming Yang, Xin Liu. A Re-Examination of Text Categorization Methods [C]. // The 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR99), 1999. 42~49.
    [68]李文波,孙乐,张大鲲.基于Labeled-LDA模型的文本分类新算法[J].计算机学报,2008,31(4):620~627.
    [69]苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848~1859.
    [70] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer. Online Passive-Aggressive Algorithms [J]. Journal of Machine Learning Research, 2006, 7: 551~585.
    [71] Mark Dredze, Koby Crammer, Fernando Pereira. Confidence-Weighted Linear Classification [C]. // The 25th International Conference on Machine Learning, 2008, Helsinki, Finland. 264~271.
    [72]刘玮,廖祥文,许洪波,王丽宏.基于统计特征的垃圾博客过滤[J].中文信息学报,2008,22(6):86~91.
    [73] JoséMaría Gómez Hidalgo, Guillermo Cajigas Bringas, Enrique Puertas Sánz, Francisco Carrero García. Content Based SMS Spam Filtering [C]. // The 2006 ACM Symposium on Document Engineering (DocEng2006), 2006. 107~114.
    [74] Gordon V. Cormack, JoséMaría Gómez Hidalgo, Enrique Puertas Sánz. Feature Engineering for Mobile (SMS) Spam Filtering [C]. // The 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR2007), 2007. 871~872.
    [75] Ji Won Yoon, Hyoungshick Kim, Jun Ho Huh. Hybrid Spam Filtering for Mobile Communication [J]. Computers & Security, 2010, 29(4): 446~459.
    [76]黄文良.垃圾短信过滤关键技术研究[D].杭州:浙江大学,2008.
    [77] Gordon V. Cormack, JoséMaría Gómez Hidalgo, Enrique Puertas Sánz. Spam Filtering for Short Messages [C]. // The 16th ACM Conference on Information and Knowledge Management (CIKM'07), 2007. 313~320.
    [78]刘伍颖,王挺.基于词模型索引的短文本在线过滤方法[J].华中科技大学学报(自然科学版),2010,38(4):42~45.
    [79] LIU Wu-Ying, WANG Ting. Index-based Online Text Classification for SMS Spam Filtering [J]. Journal of Computers, 2010, 5(6): 844~851.
    [80] Le Zhang, Jingbo Zhu, Tianshun Yao. An Evaluation of Statistical Spam Filtering Techniques [J]. ACM Transactions on Asian Language Information Processing, 2004, 3(4): 243~269.
    [81] Gordon V. Cormack, Thomas R. Lynam. TREC 2005 Spam Track Overview [C]. // The 14th Text REtrieval Conference (TREC2005), 2005. National Institute of Standards and Technology, Special Publication SP500-266.
    [82] Tom Fawcett. An Introduction to ROC Analysis [J]. Pattern Recognition Letters, 2006, 27: 861~874.
    [83] Haoliang Qi, Muyun Yang, Xiaoning He, Sheng Li. Re-examination on Lam% in Spam Filtering [C]. // The 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR2010), July 2010, Geneva, Switzerland. ACM, 757~758, ISBN: 978-1-60558-896-4.
    [84] Sarah Jane Delany, Pádraig Cunningham, Alexey Tsymbal, Lorcan Coyle. A Case-Based Technique for Tracking Concept Drift in Spam Filtering [J]. Knowledge-Based Systems, 2005, 18(4-5): 187~195.
    [85]徐隽.基于流数据特性的垃圾邮件过滤技术研究[D].上海:复旦大学,2009.
    [86] Tunga Güng?r, Ali ??lt?k. Developing Methods and Heuristics with Low Time Complexities for Filtering Spam Messages [C]. // The 12th International Conference on Applications of Natural Language to Information Systems (NLDB 2007), June 2007, Paris, France. 35~47.
    [87] LIU Wu-Ying, WANG Ting. Multi-Field Learning for Email Spam Filtering [C]. // The 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR2010), July 2010, Geneva, Switzerland. ACM, 745~746, ISBN: 978-1-60558-896-4.
    [88] LIU Wu-Ying, WANG Lin, WANG Ting. Online Supervised Learning from Multi-field Documents for Email Spam Filtering [C]. // The 9th International Conference on Machine Learning and Cybernetics (ICMLC2010), July 2010, Qingdao, China. IEEE, 3335~3340, ISBN: 978-1-4244-6524-8.
    [89]刘伍颖,王挺,罗准辰.面向多源垃圾信息过滤的直推式迁移学习算法[C].//中国计算机大会(CNCC2008).2008年9月,西安.清华大学出版社,32~42,ISBN:978-7-3022-0977-5.
    [90] Nicholas J. Belkin, W. Bruce Croft. Information Filtering and Information Retrieval: Two Sides of the Same Coin [J]? ACM Communications, 1992, 35(12): 29~38.
    [91] LIU Wu-Ying, WANG Ting. Unimodel-Based Multi-Source Portable Spam Filtering [C]. // The 5th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD2008), October 2008, Jinan, China. IEEE, 540~544, ISBN: 978-0-7695-3305-6.
    [92] LIU Wu-Ying, WANG Ting. Active Learning for Online Spam Filtering [C]. // The 4th Asia Information Retrieval Symposium (AIRS2008), January 2008, Harbin, China. LNCS 4993 Springer-Verlag, 555~560, ISSN: 0302-9743.
    [93] M. E. J. Newman. Power Laws, Pareto Distributions and Zipf’s Law [J]. Contemporary Physics, 2005, 46: 323~351.
    [94] George Kingsley Zipf. Selected Studies of the Principle of Relative Frequency in Language [M]. Cambridge: Harvard University Press, 1932.
    [95] Aaron Clauset, Cosma Rohilla Shalizi, M. E. J. Newman. Power-Law Distributions in Empirical Data [J]. SIAM Review, 2009, 51: 661~703.
    [96] Wentian Li. Random Texts Exhibit Zipf’s-Law-Like Word Frequency Distribution [J]. IEEE Transactions on Information Theory, 1992, 38(6): 1842~1845.
    [97]王涛,李舟军,颜跃进,陈火旺.数据流挖掘分类技术综述[J].计算机研究与发展,2007,44(11):1809~1815.
    [98]徐隽,郑佳谦,姚静,牛军钰.一种基于时间流特性的垃圾邮件过滤方法[J].中文信息学报,2009,23(1):79~85.
    [99]王修君,沈鸿.一种基于增量学习型矢量量化的有效文本分类算法[J].计算机学报,2007,30(8):1277~1285.
    [100] Wen-Feng Hsiao, Te-Min Chang. An Incremental Cluster-Based Approach to Spam Filtering [J]. Expert Systems with Applications, 2008, 34(3): 1599~1608.
    [101] Andrej Bratko, Gordon V. Cormack, Bogdan Filipi?, Thomas R. Lynam, Bla? Zupan. Spam Filtering Using Statistical Data Compression Models [J]. Machine Learning Research, 2006, 7: 2673~2698.
    [102]陈彬.垃圾邮件的特征选择及检测方法研究[D].广州:华南理工大学,2010.
    [103] Thomas G. Dietterich, Pedro Domingos, Lise Getoor, Stephen Muggleton, Prasad Tadepalli. Structured Machine Learning: the Next Ten Years [J]. Machine Learning, 2008, 73(1): 3~23.
    [104] Chih-Ping Wei, Hsueh-Ching Chen, Tsang-Hsiang Cheng. Effective Spam Filtering: A Single-class Learning and Ensemble Approach [J]. Decision Support Systems, 2008, 45(3): 491~503.
    [105] Giorgio Valentini, Francesco Masulli. Ensembles of Learning Machines [C]. //The 13th Italian Workshop on Neural Nets, 2002. LNCS 2486, 3~22.
    [106] Yi Zhang, Arun C. Surendran, John C. Platt, Mukund Narasimhan. Learning from Multi-topic Web Documents for Contextual Advertisement [C]. // The 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08), 2008. 1051~1059.
    [107] Fernando Sanchez, Zhenhai Duan, Yingfei Dong. Understanding Forgery Properties of Spam Delivery Paths [C]. // The 7th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference (CEAS2010), July 2010, Redmond, Washington USA. http://ceas.cc/2010/papers/Paper%2012.pdf, 2010-10-30/2011-03-30.
    [108]刘赫,刘大有,裴志利,高滢.一种基于特征重要度的文本分类特征加权方法[J].计算机研究与发展,2009,46(10):1693~1703.
    [109] Tom Fawcett. ROC Graphs: Notes and Practical Considerations for Researchers [R]. Tech Report HPL-2003-4 in HP Laboratories, Netherlands: Kluwer Academic Publishers, 2004.
    [110] Paul Graham. A Plan for Spam [EB/OL]. http://www.paulgraham.com/spam.html, 2002.
    [111] Paul Graham. Better Bayesian Filtering [EB/OL]. http://www.paulgraham.com/better.html, The 2003 Spam Conference, January 2003.
    [112] Gordon V. Cormack. TREC 2006 Spam Track Overview [C]. // The 15th Text REtrieval Conference (TREC2006), 2006. National Institute of Standards and Technology, Special Publication SP500-272.
    [113] Thomas G. Dietterich. Ensemble Methods in Machine Learning [C]. // Multiple Classifier Systems (MCS2000), 2000. 1~15.
    [114]樊兴华,孙茂松.一种高性能的两类中文文本分类方法[J].计算机学报,2006,29(1):124~131.
    [115] Ali ?ιltιk, Tunga Güng?r. Time-Efficient Spam E-mail Filtering Using N-gram Models [J]. Pattern Recognition Letters, 2008, 29(1): 19~33.
    [116]张燕平,史科,徐庆鹏,谢飞.基于词共现模型的垃圾邮件过滤方法研究[J].中文信息学报,2009,23(6):61~66.
    [117]姜远,周志华.基于词频分类器集成的文本分类方法[J].计算机研究与发展,2006,43(10):1681~1687.
    [118] Justin Zobel, Alistair Moffat. Inverted Files for Text Search Engines [J]. ACM Computing Surveys, 2006, 38(2): Article 6.
    [119] Songbo Tan, Xueqi Cheng, Moustafa Ghanem, Bin Wang, Hongbo Xu. A Novel Refinement Approach for Text Categorization [C]. // The 14th ACM International Conference on Information and Knowledge Management (CIKM’05), 2005.469~476.
    [120] Gordon V. Cormack. University of Waterloo Participation in the TREC 2007 Spam Track [C]. // The 16th Text REtrieval Conference (TREC2007) Notebook, 2007.
    [121] Eui-Hong Han, George Karypis. Centroid-Based Document Classification Analysis & Experimental Result [C]. // The 4th European Conference on Principles of Data Mining and Knowledge Discovery, 2000. 424~431.
    [122] D. Sculley. Advances in Online Learning-Based Spam Filtering [D]. Tufts University, 2008: 77~105.
    [123] AP Engelbrecht. Incremental Learning Using Sensitivity Analysis [C]. // International Joint Conference on Neural Networks, 1999. 1350~1355.
    [124] David D. Lewis, Jason Catlett. Heterogeneous Uncertainty Sampling for Supervised Learning [C]. // The 11th International Conference on Machine Learning, 1994, San Francisco, CA. 148~156.
    [125] H. S. Seung, M. Opper, H. Somepolinsky. Query by Committee [C]. // The 5th Annual Workshop on Computation Learning Theory, 1992. 287~294.
    [126] Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification [J]. Machine Learning Research, 2002, 2: 45~66.
    [127] David A. Cohn, Zoubin Ghahramani, Michael I. Jordan. Active Learning with Statistical Models [J]. Artificial Intelligence Research, 1996, 4(1): 129~145.
    [128] D. Sculley. Online Active Learning Methods for Fast Label-Efficient Spam Filtering [C]. // The 4th Conference on Email and Anti-Spam, August 2007, Mountain View, California USA. http://www.ceas.cc/2007/papers/paper-61.pdf, 2008-06-30/2010-06-30.
    [129] Gordon V. Cormack. TREC 2007 Spam Track Overview [C]. // The 16th Text REtrieval Conference (TREC2007), 2007. National Institute of Standards and Technology, Special Publication SP500-274.
    [130] David D. Lewis, William A. Gale. A Sequential Algorithm for Training Text Classifiers [C]. // The 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’94), 1994. 3~12.
    [131] Mamoru Kato, Joseph Langeway, Yimin Wu, William S. Yerazunis. Three Non-Bayesian Methods of Spam Filtration: CRM114 at TREC 2007 [C]. // The 16th Text REtrieval Conference (TREC2007), 2007. National Institute of Standards and Technology, Special Publication SP500-274.
    [132] D. Sculley, Gabriel M. Wachman. Relaxed Online SVMs in the TREC Spam Filtering Track [C]. // The 16th Text REtrieval Conference (TREC2007), 2007. National Institute of Standards and Technology, Special Publication SP500-274.
    [133] Shinjae Yoo. Machine Learning Methods for Personalized Email Prioritization[D]. Language Technologies Institute, School of Computer Science, Carnegie Mellon University, 2010: 15~31.
    [134] Yiming Yang, Shinjae Yoo, Frank Lin, Il-Chul Moon. Personalized Email Prioritization Based on Content and Social Network Analysis [J]. IEEE Intelligent Systems: Special Issue on Social Learning, 2010, 25(4): 12~18.
    [135] Shinjae Yoo, Yiming Yang, Frank Lin, Il-Chul Moon. Mining Social Networks for Personalized Email Prioritization [C]. // The 15th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD09), June-July 2009, Paris, France. 967~976.
    [136] Richard Segal. Combining Global and Personal Anti-Spam Filtering [C]. // The 4th Conference on Email and Anti-Spam, August 2007, Mountain View, California USA. http://www.ceas.cc/2007/papers/paper-74.pdf, 2008-06-30/2010-06-30.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700