垃圾邮件过滤理论和关键技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
作为Internet的重大“灾难”之一,日益泛滥的垃圾邮件问题引起了人们的普遍关注。自上世纪80年代中期出现首封垃圾邮件以来,各种反垃圾邮件策略与技术也应运而生并得到了迅速发展,至今方兴未艾。然而,研究反垃圾邮件问题已经逐渐把研究者引入到了一个“不确定性花园”。由于对垃圾邮件的判别存在着主观和客观上的不确定性,造成了目前针对垃圾邮件的机器自动分类和过滤技术存在较大的性能瓶颈。经过多年的研究,有很多学者已经注意到利用不确定智能计算技术可以在一定程度上较好地处理实际工程应用中的某些不确定性推理问题,虽然相关研究尚不成熟,但正如很多研究者相信上帝并不是简单地通过掷骰子来创造人类一样,不确定性背后的某些奇妙的确定性规律正吸引着人们不懈地深入探索,并取得了阶段性的研究成果。本文认为不确定智能计算技术在某些层面上,同样可以有效处理垃圾邮件识别过程中存在的诸多主观和客观不确定性问题,因此研究不确定计算理论并应用相关理论改进现有邮件过滤算法和设计新的邮件过滤算法成为了本文的工作重点。不确定智能计算技术的引入,使研究反垃圾邮件问题成为了一件充满乐趣又富有挑战的工作。
     本文在全面吸取和借鉴目前在不确定智能计算领域和反垃圾邮件领域取得的最新技术成果的基础上,从理论和应用两个层面,深入细致地研究了不确定智能计算理论和反垃圾邮件技术。取得了如下的主要研究成果,包括:
     1、系统地分析了垃圾邮件问题的背景,指出研究反垃圾邮件技术的理论价值和现实意义。通过跟踪国内外反垃圾邮件技术的最新进展,较全面地归纳概括了现有反垃圾分类技术的发展状况,比较分析了各种方法的优点和不足。指出基于统计理论的不确定智能学习和分类方法是值得深入研究,并能够提高反垃圾邮件技术水平的重要理论手段。
     2、深入地研究了Bayesian网络理论,提出了一些改进和创新的方法。(1)对于一般复杂网络,提出了一种基于全局消息传播的PPJT算法。新算法可以将推理计算的时间复杂度有效降低,同时能够在较小规模观察样本条件下,保证一般复杂贝叶斯网络推理的精度需求。(2)对于Polytree条件下的复杂Bayesian网络,考虑将推理算法扩展到多机模式,通过分析Polytree条件下的中大型贝叶斯网络的结构,定义新的适用于多处理机环境下的并行证据处理格式,并提出基于多处理机的并行推理算法,为提高Polytree条件下中大型贝叶斯网络的全局证据传播效率提供了一种并行解决方案。(3)研究了不完备证据条件下的参数学习问题,基于标准似然函数构建证据丢失的计算模型,利用χ2距离近似估计证据丢失导致的误差距离,推导出了包含学习率的EM算法。实验结果表明,新算法与传统处理算法相比,在不降低估计精度的前提下具有更快的收敛速度,能够较好地保证不完备证据条件下可信高效的Bayesian网络参数估计。
     3、提出了一种包含核函数的Bayesian参数估计方法,提高了Bayesian参数估计的实用性。结合邮件内容和报文格式两个方面分析和提取邮件的重要特征,建立了对应的Bayesian邮件分类网络。将包含核函数的Bayesian参数估计方法应用到邮件分类网络,在对不同邮件测试集的在线学习试验结果证明,这种新的分类模型能够比较有效地实现垃圾邮件的分类过滤。
     4、尝试采用拟合Logistic Regression模型对邮件分类问题建模,并在建模的过程中通过引入偏依赖系数函数模拟了邮件过滤中的偏依赖特性。在不同邮件样本集中的实验结果显示,新的邮件分类模型对垃圾邮件的误报误差和漏报误差具有良好的不对称区分性,因而从算法的层次上实现了具有偏依赖特征的邮件分类器。
     5、为了规避目前反垃圾邮件技术在文本关联和内容理解方面所存在的诸多困难,提出从另一个角度研究垃圾邮件分类过滤问题,即从垃圾邮件发送者的行为模式角度出发研究邮件类别。通过从邮件发送者的行为紧密相关的邮件特征提取对应特征向量,并应用支持向量机的方法构建分类函数,提出一种基于行为特征的垃圾邮件模式分类模型。经过仿真实验我们发现采用这种全新的行为特征分类模型判定邮件的类别具有较精确的判定效果和较强的鲁棒性。
     6、构建了一个位于邮件服务器前端的、多层次的垃圾邮件过滤系统—SpamWeeder。SpamWeeder系统集成了本文提出的基于多级属性集的Naive Bayes邮件分类,基于Bayesian网络的邮件分类,基于Logistic回归模型的邮件分类和基于行为特征的邮件分类等多种方法,各种方法之间相互协作、互相补充,形成一个比较准确、快速、高效、易管理和满足不同个性化要求的反垃圾邮件过滤系统。
Nowadays, spam flood has become one of the Internet disasters and aroused people’s wide attentions. Since the first spam sprung out in the middle of 1980s, various anti-spam strategies and techniques came alone with it and developed rapidly till today. However, Investigations on anti-spam problems have trapped researchers into an“uncertainty garden”. Subjective and objective uncertainties universally existed in discriminating spams have caused big performance bottlenecks on available automated machine classification and filtering methods. On the other hand, after decade years research, people have found in some extent that uncertain intelligent computing techniques are able to handle some uncertain problems in practical engineering applications. Althrough the theory is not perfect, researchers still keep exploring the rules behind the uncertainties and have achieved phased successful results since they believe God would not simply toss dice to create human beings. We also consent that uncertain intelligent computing techniques could well handle those subjective and objective uncertainties for discriminating spams from some aspects. Therefore, researching on uncertain intelligent computing theories and applying them into the area of anti-spam become the vital job of this dissertation. The involvement of the uncertain intelligent computing theories makes the research on spam filtering become a job which is full of challenges and delights.
     This dissertation utilizes and assimilates lastest achievements comprenhendly in uncertain intelligent computing and spam filtering. From two aspacts including theory and applicaton, investigations on uncertain intelligent computing and spam filtering are made deeply and carefully. The main research results and innovations can be conluded as follows:
     (1)The background of the spam issue is systematically analyzed, and the academic importance and practical value to investigate the spam issue is emphasized as well. By tracing the latest progress in spam filtering area, comparisons among various popular anti-spam approaches are made. According to the comparisons and our analysis, we conclude that uncertain intelligent computing theories based on statistics are feasible tools to improve spam filtering system’s performance and worth investigating carefully.
     (2)Advanced approachs and innovative methods on Bayesian network are proposed. Firstly, for less complicated network, a PPJT algorithm based on global message propagation is proposed. New algorithm is able to decrease the time complexity and ensure the precision requirement in a less complicated network under small scale of samples input. Secondly, for Polytree-featured complicated network, extending inference algorithm to multi-machine mode is considered. By analyzing the structure of Polytree-featured complicated Bayesian network and defining new parallel evidence format which is suitable for multi-machine environment, a parallel inference algorithm is proposed which can well improve evidence propagation performance in a large Bayesian network with Polytree structure. Finally, parameter learning under incomplete evidence input is investigated. By applying a standard likelihood function to construct evidence-loss computing model and usingχ2 distance to estimate error disatance caused by evidence loss, an EM algorithm contained learning ratio is derived. Compared with traditional processing method, new algorithm can converge much faster without precision degradation and ensure a trusted Bayesian network parameter estimation under incomplete evidence input.
     (3) A kernel function-based Bayesian parameter estimation approach is proposed which is able to make the parameter estimation more applicable. Combined with the both sides of email content and format, a Bayesian network for spam classification is well constructed. The testing results by on-line learning for different email testing sets prove that the new model can ensure the classification and filtering efficiently by applying the kernel function-based Bayesian parameter estimation approach into the classification network.
     (4) An advanced fitted logistic regression model is considered to implement email classifier training. By introducing a coefficient function, characteristic of partial dependency(CPD) is well imitated while modeling. The testing results by various email testing sets indicate that the new model has much stronger response sensitivity on false positvie than on false negative and therefore realize a new email classifier with CPD at the algorithm level.
     (5) As to avoid various difficulties that content-based spam filters have encountered before, spam categorization method is researched from another point of view, namely spammers’behaviors mode. The new categrazation model is well constructed by extracting and selecting an email feature vector which is closely related to spammer’s behavior features and applying SVM method to generate a classification function. After carefully model design and simulation tests, we found that the new categorization model is accurate and robust for spam discrimination.
     (6) A spam filtering system, SpamWeeder, which is located at the front end of the email server with multi-layer structures is well designed. SpamWeeder system has integrated Naive Bayes email classification based on multi-level attribute set, email classification based on Bayeisan network, email classification based on feature of spammer’s behaviors, and email classification based on logistic regression model which have been brought forward in this dissertation. With coordination and collaboration of these approaches mentioned above, SpamWeeder is easy to manage and meet individual requirements and can archieve precise, fast and efficient performance as well.
引文
[1]垃圾邮件的文化与历史. http://www.ccert.edu.cn/spam/knowledge/knowledge.htm#s32
    [2] D.J. Bemstein. Internet Mail 2000. http://cr.yp.to/im2000.html
    [3]中国互联网信息中心.第十三次中国互联网络发展状况统计报告. http://www.cnnic.com.cnlhtml/dir12004102/03/2114.htm, 2004
    [4] Bill Gates. A Spam-Free Future. http://www.washingtonpost.com/ac2. 2003
    [5] S.Hambridge, A.Lunde. Don't Spew, A Set of Guidelines for Mass Unsolicited Mailings and Postings(spam). RFC2635, http://www.rfc-editor.ora/rfc/rfc2635.txt, 1999.
    [6] G.Lindberg. Anti-Spam Recommendations for Smtp Mtas. RFC 2505, http://www.rfc-editor.org/rfc/rfc2505.txt, 1999
    [7] Jon Postel. On the Junk Mail Problem. RFC706, http://www.rfc-editor.org/rfcs/rfc706.html, November 1975
    [8] http://www.spamhaus.org/organization/statements.lasso
    [9] http://www.mail-abuse.com/support/an_listmgntgdlines.html.
    [10] http://www.orbs.org/resource/maps-orbs.html
    [11] http://SpamCorp.Junckemail.org/tonyt/program.html
    [12] http://www.iresearch.com.cn/email_service/
    [13] http://www.sophos.com/security/notifications/
    [14] http://www.huaosico.com/explain/mail_12.htm
    [15] M.Naor, C.Dwork.. Pricing via Processing or Combatting Junk Mail. Technical Reprot CS95-20. 1995
    [16] Adam Back. Hashcash. http://www.cypherspace.org/hashcash/. 1997
    [17] Campaign for Real Mail. http://www.camram.org/
    [18] Larry Seltzer. Challenge-Response Spam Blocking Challenges Patience. Http://security.ziffdavis.com/article2/03973106673300.asp, 2003
    [19] J.Helfman, C.Isbell. Ishmail: Immediate Identification of Important Information. Technical Report, AT&T Labs, 1995
    [20] Paul Graham. A Plan for Spam. http://www.paulgraham.com/spam.html, 2002
    [21] http://www.mii.gov.cn/art/2006/10/26/art_2003_26429.html
    [22] http://www.ptsn.com.cn/standard/std_query/show-yd-1570-1.htm
    [23] http://www.cwts.org/showgn.php3?source=yd&id=1571.
    [24]李德毅,刘常昱,杜鷁等.不确定性人工智能.软件学报, 2004, 15(11):1583-1594
    [25] Shane Hird. 2002 Technical Solutions for Controlling Spam. In proceedings of AUUG2002:197-214
    [26] http://pdg.uow.edu.au/binmail/index.html
    [27] http://www.uic.edu/depts/accc/software/pine/
    [28] http://kontact.kde.org/news.php
    [29] http://www.sendmail.org/security/
    [30] http://qmail.yeah888.com/top.html
    [31] http://www.postfix.org/features.html
    [32] http://www.courier-mta.org/imap/
    [33] ftp://ftp.psg.com/pub/unix/procmail/
    [34] David H .Crocker. rfc822 document, http://www.ietf.org/rfc/rfc0822.txt, August 13, 1982.
    [35] J.K. Reynolds. rfc918 document, http://www.faqs.org/ftp/rfc/rfc918.txt, October, 1984.
    [36] M.Crispin. rfc2060 document, http://www.ietf.org/rfc/rfc2060.txt, December, 1996.
    [37] N.Freed. rfc2045 document, http://www.ietf.org/rfc/rfc2045.txt, November, 1996.
    [38] Pak -kwong Wong, Chorkin Chan. Chinese Word Segmentation based on Maximum Matching and Word Binding Force. 1996 In:International Conference On Computational Linguistics Proceedings of the 16th conferenee on Computational linguistics. Copenhagen, Denmark: 200-203
    [39] Neinwen Xue, Libin Shen. Chinese word segmentation as LMR tagging . 2003 In Proceedings of the Second SIGHAN Workshop on Chinese Language Processing. Sapporo, Japan:176-179.
    [40] Jin Kiat Low, Hwee Tou Ng, Wenyuan Guo. Amaximum entropy approach to Chinese words Segmentation. 2005 In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing. J eju Island , Korea:161-164
    [41] Huihsin Tseng , Pichuan Chang et al. A conditional random field word segmenter for SIGHAN Bakeoff. 2005 In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, J eju Island , Korea:168-171
    [42] Hai Zhao, Chang2Ning Huang, Mu Li. An improved Chinese word segmentation system with conditional random field. 2006 In Proceedings of the Fif th SIGHAN Workshop on Chinese Language Processing, Sydney:108-117
    [43] Neinwen Xue, Susan P. Converse. Combining classifiers for Chinese word segmentation. 2000 In :Proceedings of the First SIGHAN Workshop on Chinese Language Processing,Taipei,Taiwan:63-70
    [44] Lucene, http://groups-beta.google.com/group/SegWord/web/IKAnalyzer.jar
    [45] Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing.Communications of the ACM, 1975:312-322
    [46] Turtle,H.R. Croft, W.B. A comparison of text retrieval models. Computer Journal 1992,35: 279-290
    [47] Tague, J.M. The pragmatics of information retrieval experimentation. Information Proceeding Management, 1992,28:467-490
    [48] Swanson, D. Historical note: information retrieval and the future of an illusion. Journal of the American Society for Information Science , 1988,39(2):92-98
    [49] Sparck Jones. Information retueval experiment. Butterworths, London, England, 1981,54-60
    [50] Mitchell T. M. . Machine Learning. New York: McCraw Hill, 1996,81-90
    [51] Paul Graham. Better Bayessian Filtering. http://www.paulgraham.cotn/better.html. January 2003.
    [52] Diao Y, Lu H., Wu D. A Comparative study of Classification Based Personal E-Mail Filtering. In: Proc of The Fourth Pacific-Asia Conf on Knowledge Discovery and Data Mining, Keihanna Plaza, Kyoto, Japan, 2000,4:18-20.
    [53] Ion Androutsopoulos, John KoutSias, Konstantinos V et al. An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal Email Messages. 2000 In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR2000), Athens, Greece:160-167.
    [54] W. Cohen. Fast effective rule induction. 1995 in Machine Learning: Proceedings of the Twelfth International Conference, Lake Taho, California, Mongan Kanfmann:115-123.
    [55] W. Cohen. Learning rules that classify email. 1996 in Proceedings of the AAAI spring symposium of Machine Learning in Information Access, Palo Alto, California:18-25.
    [56] H. Drucker, D. Wu, and V. N. Vapnik, "Support Vector Machines for Spam Categorization", IEEE Transactions on Neural Networks, Vol. 20, No. 5, 1999,20(5):1048-1054.
    [57] X. Carreras, L. Marquez. Boosting Trees for Anti-Spam Email Filtering. 2001 in Proceedings of Euro Conference Recent Advances in NLP (RANLP-2001):58-64
    [58] M. DeSouza, J. Fitzgerald, C. Kemp, G. Truong. A Decision Tree based Spam Filtering Agent. 2001 from http://www.cs.mu.oz.au/481/2001_projects/gntr/index.html
    [59] T. Nicholas. Using AdaBoost and Decision Stumps to Identify Spam E-mail. 2003 StanfordUniversity Course Project (Spring 2002/2003) Report, from http: //nlp.stanford.edu/ courses/cs224n/ 2003/fp/
    [60] M J Kearns, Y Mansour. On the Boosting Ability of Top-Down Decision Tree Learning Algorithms,JCSS, 1999,25(4):1236-1254
    [61] L Junco, L Sanchez. Using the Adaboost algorithm to induce fuzzy rules in classification problems. 2000 Proc. ESTYLF:276-281
    [62] I. Androutsopoulos, G. Paliouras and E. Michelakis. Learning to Filter Unsolicited Commercial E-Mail, [Technical report] 2004,2, NCSR Demokritos, 2004
    [63] Z Pawlak. Rough Sets: Theoretical Aspects of Reasoning About Data. Kluwer Academic Print, 1992,12-22
    [64] Z Pawlak, Rough set theory and its application to data analysis , Cybernetics & Systems, 1998,pp:312-322
    [65]刘洋,杜孝平,罗平,等.垃圾邮件的智能分析、过滤及Rough集讨论.第十二届中国计算机学会网络与数据通信学术会议,武汉,2002,12:541-546
    [66] Mineau G W. A simple KNN algorithm for text catego2rization. 2001 International Conference on Data Mining. San Jose , California , USA : IEEE Computer Society:647-648
    [67] Han EH, Karypis G, Kumar V. Text categorization using weight adjusted k-nearest neighbor classification. 2001 In: Cheung D, Williams GJ, Li Q, eds. Proc. of the 5th Pacific-Asia Conf. Springer-Verlag:53?65
    [68] I. Androutsopoulos, G. Paliouras, V. Karkaletsis, et al. Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach, 2000 in Proc. 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2000):1-13
    [69] Osuna E , Freund R , Girosi T. Training Support Vector Machines: An Application to Face Detection. 1997 In : Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. New York:211-216
    [70] H. Drucker, D. Wu, V. N. Vapnik, Support Vector Machines for Spam Categorization, IEEE Transactions on Neural Networks, 1999 ,20(5):1048-1054
    [71] I. Androutsopoulos, G. Paliouras, E. Michelakis. Learning to Filter Unsolicited Commercial E-Mail, [Technical report] 2004/2, NCSR Demokritos, 2004
    [72] A. Kolcz and J. Alspector. SVM-based Filtering of E-mail Spam with Content-specific Misclassification Costs. 2001 in Proc. ICDM-2001 Workshop on Text Mining (TextDM 2001) :759-764
    [73] J. Rocchio. Relevance feedback in information retrieval. in the SMART Retrival System: Experments in Automatic Document Processing, PrenticeHall Inc, 1971, 313-323
    [74] Gao JF, Yu H, Yuan W, Xu P. Minimum sample risk methods for language modeling. 2005 In: Proc. of the Empirical Methods in Natural Language Processing (EMNLP):209?216. http://research.microsoft.com/~jfgao/
    [75]罗远胜,王明文,曾雪强.基于核方法的潜在语义文本分类模型.清华大学学报, 2005,17(9): 1226-1230
    [76] N. Littlestone, Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm, Machine Learning, 1988,2(4):285-318
    [77]王斌,潘文锋.基于内容的垃圾邮件过滤技术综述,中文信息学报,2005,19(5 ):1-10
    [78] M. Sahami, S. Dumais, D. Heckerman, E. Horvitz. A Bayesian approach to filtering junk e-mail, 1998 in Proc. of AAAI Workshop on Learning for Text Categorization, 1998:55-62
    [79] I. Androutsopoulos, J. Koutsias, K.V. Chandrinos, et al. An Evaluation of Naive Bayesian Anti-Spam Filtering. 2000 in Proc. of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning (ECML 2000):9-17
    [80] K. Schneider. A Comparison of Event Models for Naive Bayes Anti-Spam E-Mail Filtering. 2003 in Proc. 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2003), Budapest, Hungary:307-314
    [81] J. Gama. A linear-Bayes classifier. 2000 In: Monard, M.C.,Sichman, J.S. (Eds.), Lecture Notes in Artificial Intelligence,vol. 1952. Springer-Verlag, Heidelberg:269-279
    [82] D. Mertz. Six approaches to eliminating unwanted e-mail. September, 1999 from http://www-900.ibm.com /developerWorks /cn/linux/other/l-spamf/index_eng.shtml
    [83] I. Androutsopoulos, G. Paliouras, E. Michelakis. Learning to Filter Unsolicited Commercial E-Mail. [Technical report] 2004/2, NCSR "Demokritos", 2004
    [84] F V Jensen. An Introduction to Bayesian Networks. London: UCL Press, 1996,12-18
    [85] Lauritzen,S.L., Spiegelhalter,D.J. Local computations with probabilities on graphical structures and their application to expert systems, J. Roy.statist.Soc.Ser B, 1988,50:157-184
    [86] V. Lepar, P. P. Shenoy. A Comparison of Lauritzen-Spiegelhalter, Hugin and Shenoy-Shafer Architectures for Computing Marginals of Probability Distributions. 1998 In G. Cooper and S. Moral, editors, UAI, Morgan Kaufmann:328–337
    [87]田凤占,张宏伟.模块贝叶斯网络中推理的简化.计算机研究与发展,2003,40:40-47
    [88] G. R. Shafer and P. P. Shenoy. Local computation in hypertrees. [Technical Report] 201,School of Business, Uni. Kansas, 1988
    [89] K. Murphy. Active learning of causal Bayes net structure. [Technical report], Comp. Sci. Div.,UC Berkeley, 2001.
    [90] S. L. Lauritzen, F. V. Jensen. Local computation with valuations from a commutative semigroup. Annals of Mathematics and Artificial Intelligence, 1997, 21(1):51–69
    [91] A. Y. Ng, M. I. Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. 2002 In NIPS-14:692-699
    [92] P. Spirtes, C. Glymour, R. Scheines. Causation, Prediction, and Search. MIT Press, 2nd edition, 2000,114-116
    [93] Hopkins, K.D, G.V. Glass. Basic Statistics for the Behavioral Sciences. Prentice-Hall Inc., Englewood Cliffs, N.J. 1978, 3-10
    [94] Stutz, J., W. Taylor, P. Cheeseman. AutoClass C-General Information. NASA, Ames Research Center, http://ic-www.arc.nasa.gov/ic/projects/bayes-group/autoclass/autoclass-c-program.html, 1998
    [95] Stewart. H, Masjedizadeh. N. Bayesian Search. NASA, Ames Research Center: http://ic.arc.nasa.gov/ic/projects/bayes-search.html, 1998
    [96] Heckerman, D, D. Geiger. Learning Bayesian Networks. Microsoft Research: Redmond WA. December, http://www.research.microsoft.com/research/dtg/heckerma/TR-95-02.htm, 1994
    [97] Cooper, G., E. Horvitz, R. Curry. Conceptual Design of Goal Understanding Systems: Investigation of Temporal Reasoning Under Uncertainty. Decision Theory & Adaptive Systems Group, Microsoft Research. Microsoft Corp. Redmond, WA. http://research.microsoft.com/research/dtg/horvitz/goal.htm, 1998
    [98] Horvitz, E. Lumiere Project: Bayesian Reasoning for Automated Assistance Decision Theory & Adaptive Systems Group, Microsoft Research. Microsoft Corp. Redmond, WA, http://research.microsoft.com/research/dtg/horvitz/lum.htm, 1998
    [99] Cooper G. F, E Herskovits. A Bayesian method for the induction of probabilistic networks from data, Mathine learning, 1992, 9:309-347
    [100] Goldszmidt M, Pearl J. Qualitative probabilities for default reasoning, Belief revision, and causal modeling. Artificial intelligence, 1996,84:57-112
    [101] V. Lepar, P. P. Shenoy. A Comparison of Lauritzen-Spiegelhalter, Hugin and Shenoy-Shafer Architectures for Computing Marginals of Probability Distributions. In G. Cooper and S. Moral, editors, UAI, 1998,328–337
    [102] Jensen, F,Dittmer,S, L. From influence diagrams to junction trees, 1994 Proceedings of the tenth conference on Uncertainty in Artifcial Intelligence:367-373
    [103] G. R. Shafer, P. P. Shenoy. Local computation in hypertrees. [Technical Report 201], School of Business, Uni. Kansas, 1988
    [104] K. Murphy. Active learning of causal Bayes net structure. [Technical report], Comp. Sci. Div.,UC Berkeley, 2001
    [105] S. L. Lauritzen, F. V. Jensen. Local computation with valuations from a commutative semigroup. Annals of Mathematics and Artificial Intelligence, 1997,21(1):51-69
    [106] A. Y. Ng, M. I. Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. 2002 In NIPS-14:245-250
    [107] Spirtes, C. Glymour, R. Scheines. Causation, Prediction and Search. MIT Press, 2000, 78-85
    [108] Binder J, Koller D, Russell S, et al. Adaptive probabilistic networks with hidden varivables. Machine Learning, 1997,29:213-244
    [109] Dean T, Kananzava K. A model for reasoning about persistence and causation, Computaitonal Intelligence, 1989,5:142-150
    [110] Friedman, N. The Bayesian structural EM algorithm, 1998 In Uncertainty in Artificial Intelligence: Proceedings of the Fourteenth Conference,(G. F. Cooper and S.Moral, Eds), San Mateo: Morgan Kaufmann:129-138
    [111] Jordan M I. Learning in Graphical Models. Cambridge, Mass, MIT Press,1999,82-91
    [112] Jordan, M.I. Sejnowski, T.J. Graphical Models: Foundations of Neural computation, Cambridge, Mass. MIT Press,1999,155-160
    [113] Kim, J.h, J Pearl. A computational model for combined causal and diagnostic reasoning in inference systems, 1993 in Proceeding of the Eighth International Joint Conference on Artificial Intelligence(IJCAI), San Mateo:Morgan Kaufmann:190-193
    [114] Koler D, Pfeffer,A , Probabilistic frame-based systems, 1998 in Proceedings of the Fifteenth National Conference on Artificial Intelligence(AAAI-98),Menlo Park: AAAi Press:580-587
    [115] Pearl J. Casualty: Model, Reasoning and Inference, New York: Cambridge University press,1992,30-32
    [116] G.F. Cooper. The computational complexity of probabilistic inference using Bayesian belief networks. Artificial Intelligence, 1990, 42 (2-3):393-405
    [117] A. Darwiche. Dynamic jointrees. 1998 in: Proc. 14th Conference on Uncertainty in Artificial Intelligence, Madison, WI:97-104
    [118] B. D’Ambrosio. Incremental probabilistic inference. 1993 in: Proc. 9th Conference on Uncertainty in Artificial Intelligence, Washington, DC:301-308
    [119] D. Pe'er, A. Regev, G. Elidan, et al. Bioinformatics 18 (suppl. 1). S215 ,2001,76-77
    [120] Jensen F, Jensen F V. Dittmer, S. L. From Influence Diagrams to Junction trees. 1994 In proc. Of the 10th Conderence on UAI:367-373
    [121] Madsen A L, D Ambrosio, B. Independence of Causal influence and lazy propagation. 1999 In proc of the 5th ECSQARU:133-140
    [122] Madsen A L, Jensen,F V. Lazy propagation in junction trees.1999 In proc. Of the 14th conference on UAI:362-369
    [123] D Reynolds, R Rose. Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans Speech and A udio Proc, 1995, 3 (1):72-83
    [124] Mccallum, A. K. Nigam, K. Employing EM in pool-based active learning for text classification. 1998 In Proceedings of ICML-98, 15th International Conference on Machine Learning (Madison, WI):350-358
    [125] A K Jain, RPW Duin, J Mao. Statistical pattern recognition: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22 (1): 4-37
    [126] Susan Dumais, John Platt, David Heckerman, et al. Inductive Learning Algorithms and Representations for Text Categorization. 1998 In Proceedings of the 17th International Conference on Information and Knowledge Management, Maryland:148-155
    [127] Schutze, H., Hull, D. A., Pedersen, J. O. A comparison of classifiers and document representations for the routing problem. 1995 In Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval (Seattle, WA):229–237
    [128] Giudici, P., P. Green. Decomposable graphical Gaussian model determination.Biometrika, 1999,86(4):785–801
    [129] Friedman, N., Ninio, M., Peer, I., et al. A structural EM algorithm for phylogenetic inference. J. Comp. Bio, 2002,9:331–353
    [130] Heckerman, D.A tutorial on learning with Bayesian networks. In M. I.Jordan (Ed.),Learning in Graphical Models. MIT Press, 1999,55-73
    [131]刘震,周明天.基于有监督Bayesian网络的垃圾邮件过滤.计算机应用,2006, 26(3):558-561.
    [132] P. Resnick . rfc2822, document,http://www.ietf.org/rfc/rfc2822.txt, April,2001.
    [133]袁家政,须德,鲍泓.基于结构与文本关键词相关度的XML网页分类研究.计算机研究与发展,2006,43 (8):1361-1367
    [134]范焱,郑诚.用Naive Bayes方法协调分类Web网页.软件学报, 2001,12 (9): 1386-1392
    [135] Jensen, F., Jensen, F.V., Dittmer, S.L. From influence diagrams to junction trees. 1994 In: Uncertainty in Artificial Intelligence: Proceedings of the Tenth Conference. Morgan Kaufmann, Seattle, WA:367-373
    [136] Email samples datasets. http://iit.demokritos.gr/skel/i-config/downloads/
    [137] Mut Puigserver, M.Ferrer Gomila, J.L. Huguet, et al. Electronic Mail Protocol Resistant to a Minority of Malicious. 2000 Proceedings of IEEE Infocom. Tel Aviv, Israel: IEEE computer society press:1401-1405
    [138] K.R. Gee. Using Latent Semantic Indexing to Filter Spam. 2003 Proceedings of the 2003 ACM Symposium on Applied Computing. Melbourne, Florida: ACM press:460-464
    [139] Chi-Yuan Yeh, Chih-Hung Wu, Shing-Hwang Doong. Effective Spam Classification based on Meta-Heuristics. 2005 IEEE International Conference on Systems, Man and Cybernetics. Guangzhou: IEEE computer society press:332-338
    [140] Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys, 2002,34(1):1-47
    [141] Androutsopoulos I, Koutsias J, Chandrinos V, et al. An Evaluation of Naive Bayesian Anti-Spam Filtering. 2000 Workshop on Machine Learning in the New Information Age. Barcelona, Spain: ACM press:578-584
    [142] Andrej Bratko, Bogdan Filipic. Spam Filtering Using Statistical Data Compression Models. Journal of machine learning research.,2006,14(2):1-38
    [143]李文斌,刘椿年,陈嶷瑛.基于混合高斯模型的电子邮件多过滤器融合方法.电子学报2006,34(2):247-251
    [144] Trevor Hastie, Robert Tibshirani, Jerome Friedman. The elements of statistical learning data mining, 2001 inference and prediction. New York: Springer-Verlag press:474-479
    [145] V.Zorkadis, D.A.Karras, M.Panayotou. Efficient information theoretic strategies for classifier combination, feature extraction and performance evaluation in improving false positives and false negatives for spam e-mail filtering. Neural Network. 2005, 18(5-6):799-807
    [146] George Forman. An Extensive Empirical Study of Feature Selection Metrics for Text Classification. Journal of Machine Learning Research. 2003, 3(1):1533-1548
    [147] L.Zhang, J Zhou, T Yao. An evaluation of statistical spam filtering techniques. ACM transaction on Asia language information processing, 2004, 3(4):243-269
    [148] Cohen, P.R. Empirical Methods for Artificial Intelligence. MIT Press, Cambridge, MA, 1995,241-249
    [149] V. Vapnik. Statistical Learning Theory. Wiley, 1998,10-12
    [150] Nello Cristianini,John Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press,2000,25-30
    [151] Joachims T. Text Categorization with Support Vector Machines:Learning with Many Relevant Features.1998 In : Proceedings of the European Conference on Machine Learning , Berlin , Springer:78-85
    [152] Cortes C ,Vapnik V. Support Vector Networks. Machine Learning ,1995:25-45
    [153] Osuna E , Freund R , Girosi T. Training Support Vector Machines : An Application to Face Detection. 1997 In : Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. New York:145-152
    [154] Joachims T. Making Large-Scale SVM Learning Practical. 1999 In: Scholkopf B , Burges C, Smola A ,eds. Advances in Kernel Methods-Support Vector Learning. MIT Press:328-333
    [155] Platt J . Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines : [ Technical Report ] . MSR-TR-98-14. Microsoft Research ,1998
    [156] Belhumeur Peter N. , Hespanha John P. , Kriegman David J. Eigenfaces vs. Fisherfaces : Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence , 1997,19 (7):711-720
    [157] Levent Ozgur, Tunga Gungor , Fikret Gurgen. Adaptive anti-spam filtering for agglutinative languages: a special case for Turkish, Pattern Recognition Letters 2004,25:1819-1831
    [158] Ponzetto, Simone Paolo, Michael Strube. Exploiting Semantic Role Labeling, Wordnet and Wikipedia for Coreference Resolution.2006 Paper presented at the Proceedings of the Human Language Technology Conference of the NAACL, Main Conference:201-210
    [159] Zhuoran Wang, Ting Liu. Chinese Unknown Word Identification Based on Local Bigram Model. International Journal of Computer Processing of Oriental Languages.2005,18(3): 185-196
    [160] Jiawei Han,Micheline Kamber.数据挖掘概念与技术.北京:机械工业出版社,2004,3-4
    [161] Agrawal, R., Rajagopalan, S., Srikant, R., et al. Mining newsgroups using networks arising from social behavior. 2003 WWW-03:438-448
    [162] Bunescu, R., Mooney, R. Collective Information Extraction with Relational Markov Networks. 2004 ACL-04:282-289
    [163] Hogue, A. and Karger, D. Thresher: Automating the unwrapping of semantic content from theWorld Wide Web. 2005 WWW-05:931-942
    [164] Wu W, Yu C, Doan A., et al. An Interactive Clustering-based Approach to Integrating Source Query interfaces on the Deep Web. 2004 SIGMOD-04:153-162
    [165] http://www.ccert.edu.cn/spam/sa/Chinese_rules.htm
    [166] http://www.mail-abuse.com/cgi-bin/nph-dul-remove/
    [167] http://www.spamhaus.org/sbl/index.lasso
    [168] Sara Jane Delany, Padraig Cunning Ham. A case-based technique for tracking concept drift in spam filtering. Knowledge based systems. 2005,4:187-195
    [169] Xiong H L, Swamy M N S, Ahmad M O. Optimizing the Kernel in the Empirical Feature Space. IEEE Trans on Neural Networks, 2005, 16(2):460-474
    [170] Joachims T. Text categorization with support vector machines : Learning with many relevant features, 1998 European Conference on Machine Learning. Chemnitz , Germany: Springer:137-142
    [171] Fransens R, Pris Jan De. SVM-based nonparamet ric discriminant analysis, an application to face detection. 2003 In : Proceedings of t he 9th International Conference on Computer Vision, Nice , France:1289-1296
    [172] Burges C. J . C. A tutorial on support vector machines for pattern recognition. Knowledge Discovery and Data Mining ,1998,2 (2):121-167
    [173] http://en.wikipedia.org/wiki/SpamAssassin
    [174] http://advosys.ca/papers/postfix-filtering.html
    [175]刘震,佘堃,周明天.基于多级属性级的垃圾邮件过滤技术.计算机应用研究, 2005, 22(7):122-126
    [176]刘震,周明天.基于核方法的贝叶斯邮件分类网络研究.电子科技大学学报,2007, 36(3):587-589
    [177] Liu Zhen, Tan Liang, Zhou Ming-Tian. Research on Spam Classifier Based on Features of spammer’s Behaviours. Information Technology Journal, Press of Aisa Network for Scientific Information, Accepted
    [178]刘震,谭良,周明天.垃圾邮件分类的偏依赖特性研究.电子学报,已录用
    [179] Pawlak Z. Rough sets. Int’l Journal of Computer and Information Sciences, 1982,11(5):341-356
    [180]李德毅.知识表示中的不确定性.中国工程科学,2000,2(10):73-79.
    [181] G. Widmer, M. Kubat. Learning in the presence of concept drift and hidden contexts. Machine Learning, 1996,23 (1):69-101
    [182] M. Kubat, G. Widmer. Adapting to drift in continuous domains, [Tech.Report]O-FAI-TR-94-27, Austrian Research Institute for Artificial Intelligence, Vienna, 1994
    [183] M. Salganicoff. Tolerating concept and sampling shift in lazy learning using prediction error context switching. AI Review Special Issuse on Lazy Learning, 1997,11 (1-5):33-155
    [184] R. Klinkenberg. Learning drifting concepts: example selection vs. example weighting. Intelligent Data Analysis, Special Issue on Incremental Learning Systems Capable of Dealing with Concept Drift ,2004, 8 (3):398-411
    [185] H. Wang, W. Fan, P.S. Yu, et al, Mining concept-drifting data streams using ensemble classifiers, 2003 in: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD-2003 (ACM Press):226-235
    [186] J.C. Schlimmer, R.H. Granger. Incremental learning from noisy data. Machine Learning,1986, 1 (3):317-354
    [187] M. Harries, C. Sammut, K. Horn. Extracting hidden context, Machine Learning. 1998, 32 (2):101-126
    [188] W. Street, Y. Kim. A. streaming, A streaming ensemble algorithm (SEA) for large-scale classification, 2001 Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD-2001, ACM Press, New York:377-82
    [189] J.Z. Kolter, M.A. Maloof. Dynamic weighted majority: a new ensemble method for tracking concept drift, 2003 in: Proceedings of the 3rd IEEE International Conference on Data Mining, IEEE CS Press:123-130
    [190] Lei Tang, Huan Liu. Bias Analysis in Text Classification for Highly Skewed Data, 2005 Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05):172-178
    [191] Chawla NV, Japkowicz N, Kotcz A. Editorial: Special issue on learning from imbalanced data sets. Sigkdd Explorations Newsletters, 2004,6(1):1-6
    [192] Yang Y, Liu X. A re-examination of text categorization methods. 1999 In: Gey F, Hearst M, Rong R, eds. Proc. of the 22nd ACM Int’l Conf. on Research and Development in Information Retrieval (SIGIR-99). Berkeley: ACM Press:42-49
    [193] Estabrooks A, Jo TH, Japkowicz N. A multiple resampling method for learning from imbalanced data sets. Computational Intelligence, 2004,20(1):18-36
    [194] N. Chawla, N. Japkowicz, A. Kolcz. ICML'2003 Workshop on Learning from Imbalanced Data Sets (II), 2003. Proceedings available at http://www.site.uottawa.ca/~nat/Workshop2003/workshop2003.html
    [195] Manevitz LM, Yousef M. One-Class SVMs for document classification. Journal of Machine Learning Research, 2001, 2(1):139-154
    [196] N. Japkowicz. AAAI Workshop on Learning from Imbalanced Data Sets, Menlo Park, CA, AAAI Press. [Techical report]WS-00-05, 2003
    [197] Tan S. Neighbor-Weighted k-Nearest neighbor for unbalanced text corpus. Expert Systems with Applications, 2005,28(4):667-671
    [198] Fawcett T. ROC graphs: Notes and practical considerations for researchers. [Technical Report], HPL-2003-4, Palo Alto: HP Laboratories, 2003
    [199] Chawla NV, Japkowicz N, Kotcz A. Special issue on learning from imbalanced data sets. Sigkdd Explorations Newsletters, 2004,6(1):1-6
    [200] Castillo MDd, Serrano JI. A multistrategy approach for digital text categorization from imbalanced documents. SIGKDD Explorations Newsletter, 2004,6(1):70-79
    [201] Zheng Z, Wu X, Srihari R. Feature selection for text categorization on imbalanced data. SIGKDD Explorations, 2004,6(1):80-89
    [202] Forman G. A pitfall and solution in multi-class feature selection for text classification. 2004 In: Brodley CE, ed. Proc. of the 21st Int’l Conf. on Machine Learning (ICML-04). Banff: Morgan Kaufmann Publishers, Vol 38:365-371

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700