Term frequency combined hybrid feature selection method for spam filtering
详细信息    查看全文
  • 作者:Yuanning Liu ; Youwei Wang ; Lizhou Feng ; Xiaodong Zhu
  • 关键词:Feature selection ; Spam filtering ; Document frequency ; Term frequency ; Parameter optimization
  • 刊名:Pattern Analysis & Applications
  • 出版年:2016
  • 出版时间:May 2016
  • 年:2016
  • 卷:19
  • 期:2
  • 页码:369-383
  • 全文大小:1,307 KB
  • 参考文献:1.Androutsopoulos I, Koutsias J, Chandrinos KV, Paliouras G, Spyropoulos C (2000) An evaluation of naive Bayesian anti-spam filtering. InL Proceedings of the workshop on machine learning in the new information age
    2.Azam N, Yao J (2012) Comparison of term frequency and document frequency based feature selection metrics in text categorization. Expert Syst Appl 39(5):4760–4768CrossRef
    3.Bermejo P, Ossa L, Gámez JA, Puerta JM (2012) Fast wrapper feature subset selection in high-dimensional datasets by means of filter re-ranking. Knowl-Based Syst 25(1):35–44CrossRef
    4.Boubezoul A, Paris S (2012) Application of global optimization methods to model and feature selection. Pattern Recogn 45(10):3676–3686CrossRef MATH
    5.Breiman L, Friedman JH, Olshen RA (1984) Classification and regression trees. Wadsworth International Group, MontereyMATH
    6.Chen CM, Lee HM, Chang YJ (2009) Two novel feature selection approaches for web page classification. Expert Syst Appl 36(1):260–272CrossRef
    7.Chen JN, Huang HK, Tian SF, Qu YL (2009) Feature selection for text classification with Naïve Bayes. Expert Syst Appl 36(3):5432–5435CrossRef
    8.Clark J, Koprinska I, Poon J (2003) A neural network based approach to automated e-mail classification. In: Proceedings of the IEEE/WIC international conference on web intelligence (WI 03)
    9.Cormack GV (2007) TREC 2007 spam track overview. In: Proceedings of TREC 2007: the 16th text retrieval conference
    10.Correa RF, Ludermir TB (2006) Improving self-organization of document collections by semantic mapping. Neurocomputing 70(1):62–69CrossRef
    11.Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861–874MathSciNet CrossRef
    12.Forman G (2008) BNS feature scaling: an improved representation over TFIDF for SVM text classification. In: Proceedings of the ACM conference on information and knowledge management. ACM, New York, pp 263–279
    13.Gomez JC, Moens MF (2012) PCA document reconstruction for email classification. Comput Stat Data Anal 56(3):741–751MathSciNet CrossRef
    14.Guzella TS, Caminhas WM (2009) A review of machine learning approaches to spam filtering. Expert Syst Appl 36(7):10206–10222CrossRef
    15.Lee C, Lee GG (2006) Information gain and divergence-based feature selection for machine learning-based text categorization. Inf Process Manag 42(1):155–165CrossRef
    16.Liu Y, Wang G, Chen H, Dong H, Zhu X, Wang S (2011) An improved particle swarm optimization for feature selection. J Bionic Eng 8(2):191–200CrossRef
    17.López FR, Jiménez-Salazar H, Pinto D (2007) A competitive term selection method for information retrieval. In: Proceedings of 8th international conference on computational linguistics and intelligent text processing, (CICLing’07), Lecture notes in computer science, vol 4394, pp 468–475
    18.McCallum A, Nigam K (2007) A comparison of event models for naive Bayes text classification. In: EACL ‘03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics, vol 1, pp 307–314
    19.Mengle SSR, Goharian N (2009) Ambiguity measure feature selection algorithm. J Am Soc Inform Sci Technol 60(5):1037–1050CrossRef
    20.Mladenic D, Grobelnik M (2003) Feature selection on hierarchy of web documents. Decis Support Syst 35(1):45–87CrossRef
    21.Ogura H, Amano H, Kondo M (2009) Feature selection with a measure of deviations from poisson in text categorization. Expert Syst Appl 36(3):6826–6832CrossRef
    22.Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106
    23.Ruiz R, Riquelme JC, Aguilar-Ruiz JS, García-Torres M (2012) Fast feature selection aimed at high-dimensional data via hybrid-sequential-ranked searches. Expert Syst Appl 39(12):11094–11102CrossRef
    24.Salton G, Clement TY (1973) On the construction of effective vocabularies for information retrieval. In: Proceedings of the 1973 meeting on programming languages and information retrieval. ACM, New York, pp 48–60
    25.Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun ACM 18:613–620CrossRef MATH
    26.Santos I, Laorden C, Sanz B, Bringas PG (2012) Enhanced topic-based vector space model for semantics-aware spam filtering. Expert Syst Appl 39(1):437–444CrossRef
    27.Shang W, Huang H, Zhu H, Lin Y, Qu Y, Wang Z (2007) A novel feature selection algorithm for text categorization. Expert Syst Appl 33(1):1–5CrossRef
    28.SpamAssassin (2005) Spamassassin public corpus. http://​spamassassin.​apache.​org/​publiccorpus/​ . Accessed June 2008
    29.Tezel SK (2009) Improving SVM classification on imbalanced data sets in distance space. Ninth IEEE international conference on data mining
    30.Tretyakov K (2004) Machine learning techniques in spam filtering. Data mining problem-oriented seminar MTAT.03.177, pp 60–79
    31.Willett P (2006) The Porter stemming algorithm: then and now. Progr Electron Libr Inf Syst 40(3):219–223MathSciNet
    32.Yan J, Liu N, Zhang B, Yan S, Chen Z, Cheng Q (2005) OCFS: optimal orthogonal centroid feature selection for text categorization. In: Proceedings of the 28th annual international ACM Sinformation gainIR conference on research and development in information retrieval, ACM, New York, pp 122–129
    33.Yang J, Liu Y, Liu Z, Zhu X, Zhang X (2011) A new feature selection algorithm based on binomial hypothesis testing for spam filtering. Knowl-Based Syst 24(6):904–914CrossRef
    34.Yang J, Liu Y, Zhu X, Liu Z, Zhang X (2012) A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Inf Process Manag 48(4):741–754CrossRef
    35.Yang Y, Pedersen J (1997) A comparative study on feature set selection in text categorization, In: Fisher DH (ed) Proceedings of the 14th international conference on machine learning. Morgan Kaufmann, San Francisco, pp 412–420
    36.Youn S, McLeod D (2007) A comparative study for email classification. Advances and innovations in systems, computing sciences and software engineering, pp 387–391
    37.Yu B, Xu Z (2008) A comparative study for content-based dynamic spam classification using four machine learning algorithms. Knowl-Based Syst 21(4):355–362CrossRef
    38.Yu SN, Lee MY (2012) Conditional mutual information-based feature selection for congestive heart failure recognition using heart rate variability. Comput Methods Programs Biomed 108(1):299–309CrossRef
    39.Zhang Y, Li S, Wang T, Zhang Z (2012) Divergence-based feature selection for separate classes. Neurocomputing 101(4):32–42
    40.Zhu Y, Tan Y (2011) A local-concentration-based feature extraction approach for spam filtering. IEEE Trans Inf Forensics Secur 6(2):486–497CrossRef
  • 作者单位:Yuanning Liu (1)
    Youwei Wang (1)
    Lizhou Feng (1)
    Xiaodong Zhu (1)

    1. Jilin University, No. 2699, Qianjin Street, Changchun, 130012, Jilin, China
  • 刊物类别:Computer Science
  • 刊物主题:Pattern Recognition
  • 出版者:Springer London
  • ISSN:1433-755X
文摘
Feature selection is an important technology on improving the efficiency and accuracy of spam filtering. Among the numerous methods, document frequency-based feature selections ignore the effect of term frequency information, thus always deduce unsatisfactory results. In this paper, a hybrid method (called HBM), which combines the document frequency information and term frequency information is proposed. To maintain the category distinguishing ability of the selected features, an optimal document frequency-based feature selection (called ODFFS) is chosen; terms which are indeed discriminative will be selected by ODFFS. For the remaining features, term frequency information is considered and the terms with the highest HBM values are selected. Further, a novel method called feature subset evaluating parameter optimization (FSEPO) is proposed for parameter optimization. Experiments with support vector machine (SVM) and Naïve Bayesian (NB) classifiers are applied on four corpora: PU1, LingSpam, SpamAssian and Trec2007. Six feature selections: information gain, Chi square, improved Gini-index, multi-class odds ratio, normalized term frequency-based discriminative power measure and comprehensively measure feature selection are compared with HBM. Experimental results show that, HBM is significantly superior to other feature selection methods on four corpora when SVM and NB are applied, respectively.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700