垃圾邮件过滤中的敌手分类问题研究

英文题名：Adversarial Classification for Email Spam Filtering
作者：邓蔚
论文级别：博士
学科专业名称：计算机系统结构
中文关键词：垃圾邮件过滤 ; 敌手分类 ; Stackelberg博弈 ; Kolmogorov复杂性 ; 中文好词攻击
英文关键词：spam filtering ; adversarial classification ; Stackelberg games ; Kolmogorovcomplexity ; Chinese good word attacks
学位年度：2011
导师：秦志光
学科代码：081201
学位授予单位：电子科技大学
论文提交日期：2011-03-15

摘要

机器学习作为一种重要的智能信息处理技术，在垃圾邮件过滤系统中得到广泛的应用。然而在实际对抗性网络环境中，垃圾邮件过滤器面临着垃圾邮件发送者无休止恶意攻击的威胁。从而导致在实验环境中高性能的机器学习算法，在实际应用时其性能可能变的很差。敌手分类的提出正是为了应对这种挑战，并成为当前机器学习领域的研究热点，具有重大的理论和实际应用价值。
     本文针对垃圾邮件过滤中的敌手分类问题展开了研究，包括对敌手分类中的攻防博弈问题，垃圾邮件过滤的抗中文好词攻击问题，以及基于Kolmogorov复杂性的鲁棒性分类问题这三方面的研究。本文取得了如下五点创新性成果：
     1.提出了一个基于Stackelberg延时博弈的敌手分类模型。以往基于Stackelberg博弈的敌手分类模型，不能解释取得纳什均衡后垃圾邮件发送者为何还要继续发动攻击。本模型将实际中跟随者的反应延时引入Stackelberg博弈建模，重点分析了反应延时对领导者和跟随者收益的影响，并利用遗传算法得到纳什均衡，最后通过实验仿真验证了本模型的正确性。本模型表明垃圾邮件发送者具有先发优势，并在数据挖掘者的反应延时中获得超额收益，从而不断发起新的攻击。
     2.提出了一个基于Stackelberg不确定性博弈的敌手分类模型。现有敌手分类的Stackelberg博弈模型通常假设跟随者的行动是最优的和理性的，这在实际垃圾邮件过滤中是不合理的。本模型将跟随者的有限理性和有限观察引入敌手分类的Stackelberg博弈建模，并重点分析了不确定性参数对分类器性能的影响，最后通过真实邮件数据集进行了实验，验证了本模型的有效性。
     3.提出了一个抗中文垃圾邮件好词攻击的多示例逻辑回归模型。目前对中文好词攻击问题的研究尚不多见。本模型结合中文分词技术和特征选择方法进行预处理，并利用多示例机制和逻辑回归算法进行学习和分类，最后在中文邮件数据集上进行了实验。实验结果表明该模型能够有效对抗中文垃圾邮件的好词攻击，且鲁棒性优于单示例逻辑回归和单示例支持向量机模型。
     4.提出了一个基于Kolmogorov复杂性的垃圾图像分类模型。传统的垃圾图像分类算法存在着鲁棒性较差、图像特征对特定数据集敏感等问题。本模型利用数据压缩技术和Kolmogorov分类机制，实现了对垃圾图像的准确分类。通过在垃圾图像数据集上进行实验，验证了本模型能有效对垃圾图像进行分类。同时对该模型的更新机制进行了安全性分析。本模型既不需要提取图像中的文字，也不需要对图像特征进行定义和选择，是一种数据驱动的无参数分类方法。
     5.提出了一个基于Kolmogorov复杂性的恶意软件检测框架。垃圾邮件是传播恶意软件的有效方式，传统的基于特征码的方法难于检测新的和变种的恶意软件。本模型提出了一种通用的恶意软件检测方法，并利用动态马尔科夫压缩来对代码样本进行分类，最后的实验结果验证了本框架能对恶意软件进行准确的分类。本框架实现简单，无需提取特征码，并且能够有效识别新的和变种的恶意软件。
As an important technology of intelligent information processing, machinelearning is widely used in spam filtering systems. However, in practical adversarialenvironments, spam filters encounter never-ending malicious attacks by spammers. Sothe machine learning algorithms which perform well in experimental environment mayperform badly in practice. Adversarial classification is proposed for this challenge. Nowadversarial classification is a hot topic in machine learning and has great value intheories and practical applications.
     In this dissertation, researches on adversarial classification problems in spamfiltering have been conducted, which include game problems between attacker anddefender in adversarial classification, combating Chinese good word attacks in spamfiltering, and Kolmogorov complexity based robust classification methods. Fiveinnovative contributions of the dissertation are enumerated as follows.
     1. A Stackelberg game theoretical model with reaction-time delay is proposed foradversarial classification. Previous researches on Stackelberg game theoretical modelsof adversarial classification could not explain the reason that the spammer continues tolaunch attacks after the Nash equilibrium is reached. In this model, the data miner'sreaction-time delay is considered in Stackelberg game. In addition, the influences ofreaction-time delay to the spammer and data miner are emphatically analyzed. The Nashequilibrium is reached by using genetic algorithm. The model's correctness is verifiedby our experiments. The model shows that the spammer who has the advantage of beingin the lead obtains extra payoffs during the data miner's reaction-time delay. So thespammer can continuously launch new attacks.
     2. A Stackelberg game theoretical model with uncertainties is proposed foradversarial classification. Existing researches on Stackelberg game model foradversarial classification critically assume the data miner plays optimally and rationally.Unfortunately, it is not real in practical spam filtering. In the proposed model, the dataminer's bounded rationality and limited observation for the spammer's strategy is considered. In addition, the influences of different uncertainty parameters to theclassifier are analyzed with emphasis. At last, the model's effectiveness is verified onreal spam dataset.
     3. A multiple instance logic regression model for combating Chinese good wordattacks is proposed. Now there is little research on the problem of Chinese good wordattacks. This model uses Chinese word segmentation and feature selection methods forpreprocessing. Then it uses multiple instance learning mechanism and logic regressionalgorithm for learning and classification. At last the experimental results on largeChinese spam corpora show that the model can effectively combat against Chinese goodword attacks. It also shows that the robustness of the model is better than that of singlelogic regression model and single instance support vector machine model.
     4. A Kolmogorov complexity based spam image classification model is proposed.Traditional classification algorithms for spam image have the vulnerabilities of lessrobustness and strong sensitivity of image features for special image dataset. The modeluses data compression technology and Kolmogorov complexity classificationmechanism to classify spam images effectively. At last, the experimental results onspam image database show the model can accurately classify spam images. In addition,the model’s security of updating mechanism is primarily analyzed. The model needsneither text extraction from images, nor feature definition and feature selection ofimages. It is a kind of data-driven parameter-free classification method.
     5. A Kolmogorov complexity based malware detection framework is proposed.Spam is an effective way to transmit malware. It is hard for traditional signature-basedapproaches to detect malware which is new or obfuscated. A general malware detectionframework is proposed. It uses dynamic Markov compression to classify code instances.The experimental results show the framework can accurately detect malware. Theframework can be implemented easily without malware signature selection and candetect unknown and obfuscated malware effectively.

引文

[1] ARPANET. Wikipedia, http://en.Wikipedia.org/wiki/ARPANET
    [2] Email. Wikipedia, http://en.Wikipedia.org/wiki/Email
    [3] Radicati Group. http://www.radicati.com/
    [4]第27次中国互联网络发展状况统计报告.中国互联网络信息中心(CNNIC), http://www.cnnic.net.cn/dtygg/dtgg/201101/t20110118_20250.html
    [5] Email Spam. Wikipedia, http://en.Wikipedia.org/wiki/E-mail_spam
    [6] B. Templeton. Reaction to the DEC spam of 1978. 2005/03/08, http://www.templetons.com/brad/spamreact.html
    [7] R. Singel. Immigration lawyers invent commercial spam. http://www.wired.com/thisday intech/2010/04/0412canter-siegel-usenet-spam/
    [8] 2010 Annual security report. MessageLabs Intelligence, http://www.messagelabs.com/download.get?filename=MessageLabsIntelligence_2010_Annual_Report_FINAL.pdf
    [9] A. Turing. Computing machinery and intelligence. Mind, 1950, 59 (236): 433-460
    [10] ACM Turing Award. http://www.acm.org/awards/taward.html
    [11] J. McCarthy, M. L. Minsky, N. Rochester, et al. A proposal for the Dartmouth summerresearch project on artificial intelligence. http://www-formal.Stanford.edu/jmc/history/dartmouth/dartmouth.html
    [12] Artificial intelligence. Wikipedia, http://en.Wikipedia.org/wiki/Artificial_intelligence#CITEREFMcCarthyMinskyRochesterShannon1955
    [13] Machine learning. Wikipedia, http://en.Wikipedia.org/wiki/Machine_learning
    [14] H. Simon. Why should machines learning?. Machine Learning: An Artificial IntelligenceApproach, R. Michalski, J. Carbonell, and T. Mitchell, eds., Tioga Press, 1983, 25-38
    [15] E. Mjolsness, D. DeCoste. Machine learning for science: state of the art and future prospects.Science, 2001, 293(14): 2051-2055
    [16] T. M. Mitchell, The discipline of machine learning. Machine Learning Department technicalreport CMU-ML-06-108, Carnegie Mellon University, 2006
    [17] Tom M. Mitchell. Mining our reality. Perspective, Science, December 2009, 326: 1644-1645
    [18] M. Brown, W. Grundy, D. Lin, et al. Knowledge-based analysis of microarray gene expressiondata by using support vector machines. Proceedings of the National Academy of Sciences,2000, 97(1): 262-267
    [19] E. Blanzieri, A. Bryl. A survey of learning-based techniques of email spam filtering. ArtificialIntelligenvce Review, 2008, 29: 63-92
    [20] H. Drucker, D. Wu, V. Vapnik. Support vector machines for spam categorization. IEEETransaction on Neural Networks, 1999, 10(5): 1048-1054
    [21] D. Holmes. Statistical methods in spam filtering. Communicating Mathematics III, Universityof Durham, 2009
    [22] T. Fawcett. 'In vivo' spam filtering: a challenge problem for data mining. KDD Explor, 2003,5(2): 140-148
    [23] K. Albrecht. Mastering spam: a multifaceted approach with the spamato spam filter system.Ph.D Thesis, SWISS Federal Institute of Technology Zurich, 2006
    [24] S. Hershkop. Behavior-based email analysis with application to spam detection. Ph.D Thesis,Columbia University, 2006
    [25] D. Sculley. Advances in online learning-based spam filtering. Ph.D Thesis, Tufts University,August 2008
    [26]詹川.反垃圾邮件技术的研究: [博士学位论文].成都:电子科技大学, 2005
    [27]李文斌.基于集成学习的邮件过滤及电子邮件智能应用研究: [博士学位论文].北京:北京工业大学, 2007
    [28]刘震.垃圾邮件过滤理论和关键技术研究: [博士学位论文].成都:电子科技大学, 2007
    [29]黄文良.垃圾短信过滤关键技术研究: [博士学位论文].杭州:浙江大学, 2008
    [30]惠孛.基于即时分类的垃圾邮件过滤关键技术的研究: [博士学位论文].成都:电子科技大学, 2009
    [31] M. Barreno, B. Nelson, R. Sears, et al. Can machine learning be secure?. The 2006 ACMSymposium on Information, Computer and Communications Security (ASIACCS’06), NewYork, NY, USA, 2006, 16-25
    [32] R. Perdisci, D. Dagon, W. Lee, et al. Misleading worm signature generators using deliberatenoise injection. IEEE Symposium on Security and Privacy, 2006, 15-31
    [33] J. Graham-Cumming. How to beat an adaptive spam filter. In MIT Spam Conference,Cambridge, MA, USA, 2004
    [34] D. Lowd, C. Meek. Good word attacks on statistical spam filters. In the 2nd Conference onEmail and Anti-Spam (CEAS), Mountain View, CA, USA, 2005
    [35] G. L. Wittel, S. F. Wu. On attacking statistical spam filters. In First Conference on Email andAnti-Spam (CEAS), Microsoft Research Silicon Valley, Mountain View, California, 2004
    [36] Security by Design. Wikipedia, http://en.Wikipedia.org/wiki/Secure_by_design# Security_by_design_in _practice
    [37] B. Nelson, Behavior of machine learning algorithms in adversarial environments. Universityof California at Berkeley, Technical Report, No. UCB/EECS-2010-140, 2010, http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-140.html
    [38] B. Biggio. Adversarial pattern classification. Ph. D Thesis. University of Cagliari. March 2010
    [39] M. Barreno, B. Nelson, A. D. Joseph, et al. The security of machine learning. MachineLearning, 2010, 81: 121-148
    [40] A. McCallumand, K. Nigam. A comparison of event models for naive bayes text classification.AAAI Workshop on Learning for Text Categorization, 1998, 41-48
    [41] B. Nelson, M. Barreno, F. J. Chi, et al. Exploiting machine learning to subvert your spam filter.The 1st USENIX Workshop on Large-Scale Exploits and Emergent Threats (LEET’08),Berkeley, CA, USA, 2008, USENIX Association, 1-9
    [42] M. Kearns, M. Li. Learning in the presence of malicious errors. SIAM Journal of Computing,1993, 22(4): 807-837
    [43] A. Kolcz, C. H. Teo. Feature weighting for improved classifier robustness. In the 6thConference on Email and Anti-Spam (CEAS), Mountain View, CA, 2009
    [44] D. Lowd, C. Meek. Adversarial learning. The 7th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (KDD'05), ACM Press, Chicago, IL., 2005, 641-647
    [45] G. Gu, P. Fogla, D. Dagon, et al. Measuring intrusion detection capability: aninformation-theoretic approach. The 2006 ACM Symposium on Information, Computer andCommunications Security (ASIACCS’06), ACM Press, New York, NY, 2006, 90-101
    [46] K. Ingham, H. Inoue. Comparing anomaly detection techniques for http. In Recent Advancesin Intrusion Detection, LNCS, Springer Press, 2007, 42-62
    [47] G. Vigna, W. Robertson, D. Balzarotti. Testing network-based intrusion detection signaturesusing mutant exploits. The 11th ACM Conference on Computer and Communications Security(CCS’04), ACM, New York, NY, 2004, 21-30
    [48] J. Giffin, S. Jha, B. Miller. Efficient context-sensitive intrusion detection. 11th Network andDistributed System Security Symposium (NDSS), San Diego, 2004
    [49] M. Kloft, P. Laskov. A’poisoning’attack against online anomaly detection. NeuralInformation Processing Systems (NIPS) Workshop on Machine Learning in AdversarialEnvironments for Computer Security, P. Laskov, R. Lippmann (Eds.), 2007
    [50] G. F. Cretu, A. Stavrou, M. E. Locasto, et al. Casting out demons: sanitizing training data foranomaly sensors. The IEEE Symposium on Security and Privacy, Oakland, CA, 2008, 81-95
    [51] P. Laskov, R. Lippmann, editors. Neural Information Processing Systems (NIPS) Workshop onMachine Learning in Adversarial Environments for Computer Security, http://mlsnips07.first.fraunhofer.de, 2007
    [52] A. A. Cárdenas, J. S. Baras, K. Seamon. A framework for the evaluation of intrusion detectionsystems. The 2006 IEEE Symposium on Security and Privacy (SP'06), Washington, DC, USA,IEEE Computer Society, 2006, 63-67
    [53] P. Laskov, M. Kloft. A framework for quantitative security analysis of machine learning. The2nd ACM Workshop on Security and Artificial Intelligence (AISec'09), New York, NY, USA,ACM, 2009, 1-4
    [54] N. Dalvi, P. Domingos, Mausam, et al. Adversarial classification. The 10th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining (KDD'04), Seattle, 2004,99-108
    [55] A. Globerson, S. T. Roweis. Nightmare at test time: robust learning by feature deletion. InWilliam W. Cohen and Andrew Moore (Eds.), ICML, Vol. 148, ACM, 2006, 353-360
    [56] H. Xu, C. Caramanis, S. Mannor. Robustness and regularization of support vector machines.Journal of Machine Learning Research, 2009, 10:1485-1510
    [57] Z. Jorgensen, Y. Zhou, M. Inge. A multiple instance learning strategy for combating goodword attacks on spam filters. Journal of Machine Learning Research, 2008, 9:1115-1146
    [58]邓蔚,钱伟中,傅翀,等.敌手分类的Stackelberg博弈分析.电子测量与仪器学报, 2011,25(1): 96-101
    [59] W. Deng, Z. Qu, L. Ye, et al. A game model for adversarial classification in spam filtering.Advanced Materials Research, 2011, Accepted
    [60]邓蔚,秦志光,刘峤,等.抗好词攻击的中文垃圾邮件过滤模型.电子测量与仪器学报,2010, 24(12): 1146-1152
    [61]邓蔚,程红蓉,钱伟中,等.基于Kolmogorov复杂性的垃圾图像分类模型.计算机应用研究, 2011, 28(4): 1533-1535
    [62] W. Deng, Z. Qin. Security analysis of Kolmogorov complexity based online spam filteringunder adversarial impact. 2010 International Conference on Information Security andArtificial Intelligence (ISAI 2010), IEEE Press, 2010, Vol. 1, 311-314
    [63] W. Deng, Q. Liu, H. Cheng, et al. A malware detection framework based on Kolmogorovcomplexity. Journal of Computational Information Systems, 2011, Accepted
    [64] Spam song. http://www.detritus.org/spam/skit.html
    [65] Spam origin. http://www.webopedia.com/term/s/spam.html
    [66] CAN-SPAM Act 2004. http://business.ftc.gov/documents/bus61-can-spam-act-compliance-guide-business
    [67] Spam definition by Spamhaus. http://www.spamhaus.org/definition.html
    [68] Definitions of spam on the web. http://www.iselong.com/AntiVirus/Anti-Spam_defination.htm
    [69]秦志光,周士杰,耿技,等.国家高技术研究发展计划（863计划）专题课题申请书.垃圾邮件检测控制关键技术研究, 2006
    [70] P. Graham. A plan for spam. http://www.paulgraham.com/spam.html
    [71] Bogofilter. http://bogofilter.sourceforge.net/
    [72] SpamBayes. http://spambayes.sourceforge.net/
    [73] M. Marchand, M. Sokolova. Learning with decision lists of data-dependent features. Journalof Machine Learning Research, 2005, 6: 427-451
    [74] P. Haider, U. Brefeld, T. Scheffer. Supervised clustering of streaming data for email batchdetection. In International Conference on Machine Learning (ICML), 2007, 345-352
    [75] P. Laskov, R. Lippmann, Machine learning in adversarial environments. Machine Learning,2010, 81: 115-119
    [76] G. Fumera, I. Pillai, F. Roli. Spam filtering based on the analysis of text informationembedded into images. Journal of Machine Learning Research, 2009, 6: 2699-2720
    [77] M. Barreno, P. L. Bartlett, F. J. Chi, et al. Open problems in the security of learning. The 1stACM Workshop on Security and Artificial Intelligence (AISec'08), 2008, Alexandria, Virginia,USA
    [78] B. I. P. Rubinstein. Secure learning and learning for security: Research in the Intersection.University of California at Berkeley, Technical Report, No. UCB/EECS-2010-71, Ph.D Thesis,2010, http://www. eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-71.html
    [79] ING. F. GARGIULO. Mutiple classifier systems in adversarial environments: challenges andsolutions. Ph.D Thesis, 2009
    [80] D. Brauckhoff, K. Salamatian, M. May. Applying PCA for traffic anomaly detection: problemsand solutions. The 28th IEEE International Conferenceon Computer Communications(INFOCOM), IEEE, 2009, 2866-2870
    [81] A. Kerckhoffs. La cryptographie militaire. Journal des Sciences Militaires, January 1883, 9:5-83
    [82] B. Schneier. The economics of spam. Schneier on Security, 2008, http://www.schneier.com/blog/archives/2008/11/the_economics_o.html
    [83] J. Kirk. Former spammer:‘I know I’m going to hell’. 2007, http://www.macworld.com/article/58997/2007/07/spammer.html
    [84] D. Watson. All spammers go to hell. 2007, http://www.mail-archive.com/funsec%40linuxbox.org/ msg03346.html
    [85] W. Y. P. Judge, D. Alperovitch. Understanding and reversing the profit model of spam. InWorkshop on Economics of Information Security (WEIS 2005), Boston, MA, USA, 2005
    [86] D. Khong. An economic analysis of spam law. Erasmus Law and Economics Review, 2004,1(1): 23-45
    [87] J. Goodman, R. Rounthwaite. Stopping outgoing spam. The 5th ACM conference onElectronic commerce, 2004, 30-39
    [88] A. Serjantov, R. Clayton. Modeling incentives for email blocking strategies. Workshop on theEconomics of Information Security (WEIS05), Boston, MA, USA, 2005
    [89] C. Kanich, C. Kreibich, K. Levchenko, et al. Spamalytics: an empirical analysis of spammarketing conversion. CCS'08, October 27-31, 2008, Alexandria, Virginia, USA
    [90] R. Anderson, T. Moore. The economics of information security. Science, 2006, 314 (5799):610-613
    [91] Game Theory. Wikipedia, http://en.Wikipedia.org/wiki/Game_theory
    [92] J. Madison, Vices of the Political System of the United States. 1787
    [93] J. Rakove. James Madison and the constitution. History Now, Issue 13, 2007
    [94] A. A. Cournot. Wikipedia, http://en.Wikipedia.org/wiki/Antoine_Augustin_Cournot
    [95] Zermelo's theorem. Wikipedia, http://en.Wikipedia.org/wiki/Zermelo's_theorem_(game_theory)
    [96] Minimax. Wikipedia, http://en.Wikipedia.org/wiki/Minimax
    [97] J. v. Neumann, O. Morgenstern. Theory of games and economic behavior. http://en.Wikipedia.org/wiki/Theory_of_Games_and_Economic_Behavior
    [98] RAND Corporation. http://www.rand.org/
    [99] Game Theory in RAND. http://www.rand.org/topics/game-theory
    [100] J. Grossklags, N. Christin, J. Chuang. Secure or insure? a game-theoretic analysis ofinformation security games. WWW 2008, Beijing, China, 2008, 209-218
    [101] M. Nowostawski. Social collaboration, stochastic strategies and information referrals. 2007IEEE/WIC/ACM International Conference on Intelligent Agent Technology, 2007, 416-419
    [102] V. Papadopoulou, A. Gregoriades. Network security validation using game theory. OTM 2009Workshops, R. Meersman, P. Herrero, and T. Dillon (Eds.), LNCS 5872, 2009, 259-266
    [103] F. Li, Y. Yang, J. Wu. Attack and flee: game-theory-based analysis on interactions amongnodes in MANETs. IEEE Transactions on Systems, Man, and Cybernetics, Part B:Cybernetics, 2010, 40(3): 612-622
    [104]姜伟,方滨兴,田志宏,等.基于攻防博弈模型的网络安全测评和最优主动防御.计算机学报, 2009, 32(4): 817-827
    [105] M. Kantarcioglu, B. Xi, C. Clifton. Classifier evaluation and attribute selection against activeadversaries, Technical Report, Department of Statistics, Purdue University, 2009
    [106] W. Liu, S. Chawla. A game theoretical model for adversarial learning. The 2009 IEEEInternational Conference on Data Mining Workshops, Los Alamitos, IEEE Computer Society,2009, 25-30
    [107] W. Liu, S. Chawla. Mining adversarial patterns via regularized loss minimization. MachineLearning, 2010, 81: 69-83
    [108] S. Kullback, R. Leibler. On information and sufficiency. The Annals of MathematicalStatistics, 1951, 22(1): 79-86
    [109] W. Liu, S. Chawla. A game theoretical model for adversarial learning, Technical Report, TheUniversity of Sydney, No. TR 642, 2009
    [110] D. Fudenberg, J. Tirole. Game theory (1st ed.). Cambridge: MIT Press, 1991
    [111] C. Lin, R. Weng, S. Keerthi. Trust region Newton method for logistic regression. The Journalof Machine Learning Research, 2008, 9: 627-650
    [112] M. Collins, R. Schapire, Y. Singer. Logistic regression, adaBoost and Bregman distances.Machine Learning, 2002, 48(1): 253-285
    [113] S. Keerthi, D. DeCoste. A modified finite Newton method for fast solution of large scale linearSVMs. Journal of Machine Learning Research, 2006, 6(1): 341-361
    [114] Game theory. http://plato.stanford.edu/entries/game-theory/
    [115] A. Globerson, C. H. Teo, A. Smola, et al. An adversarial view of covariate shift and a minimaxapproach. In Dataset Shift in Machine Learning, Cambridge: MIT Press, 2008
    [116] T. Hastie, R. Tibshirani, J. Friedman. The elements of statistical learning: data mining,inference, and prediction (2nd edition). Springer, 2009
    [117] T. G. Dietterich, R. H. Lathrop, T. Lozano-Perez. Solving the multiple-instance problem withaxis-parallel rectangles. Artificial Intelligence Journal, 1997, 89(1-2):31-71
    [118]蔡自兴，李枚毅.多示例学习及其研究现状.控制与决策, 2004, 19(6): 607-610, 615
    [119]黎铭,薛晓冰,周志华.基于多示例学习的中文Web目录页面推荐.软件学报, 2004,15(9): 1328- 1335
    [120]齐浩亮,程晓龙,杨沐昀,等.高性能中文垃圾邮件过滤器.中文信息学报, 2010, 24(2):76-83
    [121]苏绥,林鸿飞,叶正.基于字符语言模型的垃圾邮件过滤.中文信息学报, 2009, 23(2):42-47
    [122] CCERT data sets of Chinese emails (CDSCE). http://www.ccert.edu.cn/spam/sa/ datasets. htm
    [123] TREC2006c Chinese dataset. http://plg1.cs.uwaterloo.ca/cgi-bin/cgiwrap/gvcormac/ foo06
    [124] ICTCLAS. http://ictclas.org/
    [125] WEKA. http://www.cs.waikato.ac.nz/ml/weka/
    [126] C. E. Shannon. The mathematical theory of communication. Bell System Tech. J., 1948, 27:379-423, 623-656
    [127] A. N. Kolmogorov. Three approaches to the quantitative definition of information. ProblemsInform. Transmission, 1965, 1(1): 1-7
    [128] R. J. Solomonoff. A formal theory of inductive inference, part 1 and part 2. Inform. Contr.,1964, 7: 1-22, 224-254
    [129] G. J. Chaitin. On the length of programs for computing finite binary sequences: statisticalconsiderations. Journal of ACM, 1969, 16:145-159
    [130] M. Li, P. Vitanyi. An introduction to Kolmogorov complexity and its applications. 3rd ed.Springer-Verlag, New York, 2008
    [131] A. Bratko, G. Cormack, B. Filipic, et al. Spam filtering using statistical data compressionmodels. Journal of Machine Learning Research, 2006, 7: 2673-2698
    [132] Data compression. Wikipedia, http://en.Wikipedia.org/wiki/Data_compression
    [133] P. Ulkarni, S. F. Bush. Active network management and Kolmogorov complexity. InOpenArch 2001, Anchorage, Alaska, 2001
    [134] S. F. Bush. Active virtual network management prediction: complexity as a framework forprediction, optimization, and assurance. The 2002 DARPA Active Networks Conference andExposition (DANCE), San Francisco, CA, 2002, 534-553
    [135] S. F. Bush. Complexity and vulnerability analysis (extended abstract). In Complexity andInference, DIMACS Center, Rutgers University, 2003
    [136] M. Li, X. Chen, X. Li, et al. The similarity metric. IEEE Transactions on Information Theory,IEEE Press, 2004, 50(12): 3250-3264
    [137] M. Li, R. Sleep. Melody classification using a similarity metric based on Kolmogorovcomplexity. Sound and Music Computing (SMC), Paris, France, 2004, 126-129
    [138] L. Spracklin, L. Saxton. Filtering spam using Kolmogorov complexity estimates. In 21stInternational Conference on Advanced Information Networking and Applications Workshops(AINAW), 2007, 321-328
    [139] G. Richard, A. Doncescu. Spam filtering using Kolgomorov complexity analysis. InternationalJournal of Web and Grid Services (IJWGS), 2008, 4(1): 136-148
    [140] S. Belabbes, G. Richard. On using SVM and Kolmogorov Complexity for spam filtering.International Conference FLAIRS21, AAAI Press publisher, Miami, USA, May 2008,130-135
    [141]邓蔚，秦志光.基于Kolmogorov复杂性的垃圾信息过滤研究综述.中国电子学会第十七届信息论学术年会暨第三届全国网络编码学术研讨会,国防工业出版社, 2010, 176-179
    [142] 2010年9月网络安全报告. MessageLabs Intelligence. http://www.messagelabs.com/download.get?filename=MLI_2010_09_September_FINAL_EN.PDF
    [143] Image spam blog. http://www.symantec.com/connect/blogs/ image-spam
    [144] G. Fumera, I. Pillai, F. Roli. Spam filtering based on the analysis of text informationembedded into images. Journal of Machine Learning Research, 2006: 2699-2720
    [145]万明成,耿技,程红蓉,等.图像型垃圾邮件过滤技术综述.计算机应用研究, 2008, 25(9):2579-2582
    [146]程红蓉,秦志光,万明成,等.垃圾图像判别中的特征提取与选择研究.计算机应用研究.2009, 26(6): 2001-2003
    [147]刘峤,秦志光,程红蓉,等.基于颜色和边缘特征直方图的图像型垃圾邮件分类模型.计算机应用研究, 2010, 27(7): 2608-2610, 2617
    [148]耿技,万明成,程红蓉,等.基于文本区域特征的图像型垃圾邮件过滤算法.计算机应用.2008, 28(8): 1904-1906
    [149] M. Dredze, R. Gevaryahu, A. Elias-Bachrach. Learning fast classifiers for image spam. The4th Conference on Email and Anti-Spam (CEAS'07), Mountain View, CA, 2007
    [150] E. Keogh, S. Lonardi. C. Ratanamahatana. Towards parameter-free data mining. 10th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining. Seattle, WA:ACM Press, 2004
    [151] R. Martha, Quispe-Ayala, K. Asalde-Alvarez, et al. Image classification using datacompression techniques. 2010 IEEE 26th Convention of Electrical and Electronics Engineersin Israel (IEEEI 2010), Eilat, Israel, 2010
    [152] Z. Chi, J. Kong. Image content classification using a block Kolmogorov complexity measure.The 4th International Conference on Signal Processing (ICSP'98), IEEE CommunicationsSociety, 1998, 2: 1185-1188
    [153]垃圾图像和正常图像数据集. http://www.cs.jhu.edu/~mdredze/datasets /image _spam/
    [154] 7-Zip. 9.17 beta version. http://www.7-zip.org/
    [155] Microsoft. Microsoft security intelligence report (SIR). Vol. 9 (January-June 2010), MicrosoftCorporation, 2009, http://www.microsoft.com/security/sir/
    [156] Malware. Wikipedia, http://en.Wikipedia.org/wiki/Malware
    [157] T. Holz, M. Engelberth, F. Freiling. Learning more about the underground economy: acase-study of keyloggers and dropzones. European Symposium on Research in ComputerSecurity (ESORICS), 2009
    [158] M. G. Schultz, E. Eskin, E. Zadok, et al. Data mining methods for detection of new maliciousexecutables. IEEE Symposium on Security and Privacy, 2001
    [159] J. Kolter, M. Maloof. Learning to detect and classify malicious executables in the wild.Journal of Machine Learning Research, 2006, 8: 2755-2790
    [160] R. Perdisci, A. Lanzi, W. Lee. McBoost: boosting scalability in malware collection andanalysis using statistical classification of executables. Anual Computer Security ApplicationConference (ACSAC), 2008, 301-310
    [161] S. J. Stolfo, K. Wang, W. J. Li. Towards stealthy malware detection, Advances in InformationSecurity, Springer US, 2007, 27: 231-249
    [162] Y. Zhou, M. Inge. Malware detection using adaptive data compression. The 1st ACMWorkshop on Security and Artificial Intelligence (AISec'08), 2008, Alexandria, Virginia, USA
    [163] S. Josse. Secure and advanced unpacking using computer emulation. Journal in ComputerVirology, 2007, 3: 221-236
    [164] M. Christodorescu, S. Jha, J. Kinder, et al. Software transformations to improve malwaredetection. Journal in Computer Virology, 2007, 3: 253-265
    [165] S. K. Udupa, S. K. Debray, M. Madou. Deobfuscation: reverse engineering obfuscated code.The 12th Working Conference on Reverse Engineering (WCRE’05), Washington, DC, USA,IEEE Computer Society, 2005, 45-54
    [166] G. Cormack, R. Horspool. Data compression using dynamic markov modeling. The ComputerJournal, 1987, 30(6): 541-550
    [167] S. C. Evans, B. Barnett. Network security through conservation of complexity. MilitaryCommunications Conference (MILCOM), California, 2002, 1133-1138
    [168] R. Cilibrasi, P. Vitanyi. Similarity of objects and the meaning of words. The Third Conferenceon Theory and Applications of Models of Computation (TAMC), Beijing, China, 2006, 21-45
    [169] I. Witten, R. Neal, J. Cleary. Arithmetic coding for data compression. Communications of theACM, 1987: 520-540
    [170] J. Cleary, I. Witten. Data compression using adaptive coding and partial string matching. IEEETransactions on Communications, 1984, 32(4):396-402
    [171] M. Kloft, P. Laskov. Online anomaly detection under adversarial impact. The 13thInternational Conference on Artificial Intelligence and Statistics (AISTATS'10), Chia LagunaResort, Sardinia, Italy, 2010
    [172] M. Jain, J. Pita, M. Tambe, et al. Bayesian Stackelberg games and their application forsecurity at Los Angeles international airport. ACM SIGecom Exchanges, 2007, 7(1): 1-3

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700