基于SVM的多类文本分类算法及其应用研究

英文题名：The Reseach and Application of Multi-Category Text Classification Algorithm Based on SVM
作者：成艳洁
论文级别：硕士
学科专业名称：管理科学与工程
中文关键词：支持向量机 ; 文本分类 ; 多类分类 ; 语义概念空间 ; 特征提取
英文关键词：Support Vector Machines ; Text Categorization ; Multi-Class Classification ; concept of semantic space ; Feature Extraction
学位年度：2009
导师：王建仁
学科代码：1201
学位授予单位：西安理工大学
论文提交日期：2009-11-01

摘要

随着通信技术和计算机技术、尤其是Internet的飞速发展,各种各样的信息成几何级数增长,作为传统信息载体的文本信息更是如此。为了能在海量的文本中及时准确地获得有效的知识和信息,文本表示技术以及文本自动分类技术受到了广泛的关注。SVM作为一种基于统计学习理论的新型机器学习方法,较好地解决了非线性、高维数、局部极小点等实际问题,是机器学习领域新的研究热点。文本分类是基于内容的自动信息管理的核心技术。文本向量稀疏性大、维数高、特征之间具有较大的相关性,支持向量机对于特征相关性和稀疏性不敏感,处理高维数问题具有较大的优势,因此,支持向量机非常适用于文本分类问题,在文本分类中具有很大的应用潜力,更是当前的一个研究热点。本文主要针对支持向量机在文本分类等实际应用中存在的问题进行深入研究,主要工作如下：
     首先,本文研究分析文本分类的总体模型,包括信息预处理、特征表示、特征提取。重点研究分析了特征表示与特征提取技术,文本的分类算法。支持向量机是针对两类分类问题提出的,如何将其有效地推广到多类分类仍是一个尚未完全解决的问题。本文分析了现有多类分类方法的缺陷,接着引出半对半分类分类算法。在此基础上,根据树型支持向量机的特性,提出了一种基于支持向量机的半对半多类分类方法。该方法设计树型支持向量机的树型结构,克服其差错积累问题。实验表明,与其它支持向量机多类分类方法相比,该方法具有较高的分类精度和训练速度,提高了支持向量机在多类分类问题中的应用效果。
     其次,认真研究了统计学习理论的主要内容和SVM算法的基本原理,讨论了核函数这一热点问题,阐述了SVM研究和应用现状,以及所面临的问题。并且结合语义概念空间,提出了一种基于支持向量机和语义概念空间的HAH多类分类算法。实验表明,该算法不仅在分类精度方面有所提高,而且大大降低了标号数据数目。
     最后,基于支持向量机在文本分类中的优势,将支持向量机方法应用于文本分类的特征提取,提出了一种基于支持向量机的单词聚类方法。该方法基于支持向量机度量单词对分类的贡献大小,将对分类贡献一致的单词合并起来作为文本向量的一个特征项。实验表明,该方法在基本不丢失分类信息的前提下,较大程度地降低了文本向量的维数、减少了文本特征之间的相关性,并提高了文本分类的查准率和查全率。
With the rapid development of communication and Internet, various kinds of information increases exponentially. Text, the most typical information carrier, can not make an exception. In order to control and retrieve valuable information, research of automatic text categorization (TC) becomes very important. Svm as a new machine learning method based on statistical learning theory, have attracted more and more attention and became a hot issue in the field of machine learning, because they can well resolve such practical problems as nonlinearity, high dimension and local minima. Text categorization is a key technique in content-based automatic information management. Text vectors are high dimensional and extremely sparse, and have numbers of relevant features. SVMs are particularly suited for text categorization and have great potential in text categorization,as SVMs are not sensitive to relevant features and sparse data, and have advantages in dealing with high dimensional problems. And is a hot field of research.This paper mainly focuses on the drawbacks of SVMs in the practical application including text categorization, and the main work is as:
     Firstly, the text analyzes the total model of text categorization, including the information preprocessing, feature representation and feature catching. The author analyzes the technologies of feature representation, feature catching and text categorization algorithm especially. SVMs were originally designed for binary classification. How to effectively extend them for mufti-class classification is still an ongoing research issue. Several existing mufti-class SVMs methods are compared and analyzed. And a HAH algorithm is presented, next, according to the characters of tree-structured SVMs, a tree-structured SVMs mufti-class classification method is proposed based on the HAH algorithm.The method designs the tree structure and overcomes the misclassification of tree-structured SVMs based on the semi-fuzzy kernel clustering algorithm.Experimental results indicate the method has higher precision and faster training speed than other mufti-class SVM methods do, and improves the classification performance of SVMs for mufti-class classification.
     Secondly, the text studies the Statistical Learning Theory (SLT) and Support Vector Machine (SVM) theory seriously, discusses kernel function. the author shows the research and application status of Support Vector Marchine, and points out some important issues. And combining the concept of semantic space, a space based on the concept of incremental semantic Direct Push support vector machine algorithm for text classification. The experiments show that the algorithm not only in the classification of a certain degree of precision, but also greatly reduces the number of labeling data.
     Thirdly,taking advantages of SVM in text categorization and applying SVMs to text feature extraction, a method of word clustering based on SVMs is proposed. The method evaluates the contribution of each word to classification by using SVMs, and combines several different words which have similar contribution to classification into one text feature.The experimental results indicate that the method almost does not lose the information of classification, dramatically decreases the dimensions of text vectors and the number of relevant features, and improves the precision and recall of text categorization.

引文

[1]Ronen Feldman, James Sanger. THE TEXT MINING HANDBOOK:Advanced Approaches in Analyzing Unstructured Data[M], Cambridge University Press,2007;
    [2]John Shawe-Taylor, Nello Cristianini. Kernel Mehods for Pattern Anaysis [M], Cambridge University Press,2004;
    [3]Nello Cristianini, John Shawe-Taylor. An introduction to Support Vector Machines[M], Cambridge University Press, Cambridge, UK,2000;
    [4]Salton. G, Wong A, Yang C. S. A vector space model for information retrieval[J], Journal of the American Society for Information Science,1995 18(11):613-620;
    [5]Salton. G, McGill. M. J. An Introduction to Modern Information Retrieval[M]. McGrraw-Hill, USA, 1983;
    [6]Vladmir N. Vapnik. Statistical Learning Theory [M], John Wiley & Sons, Inc.1998;
    [7]Bernhard Scholkopf, Alexander J. Smola. Learning with Kernels-Support Vector Machines, Regularization, Optimization and Beyond[M], The MIT Press,2002:23-468;
    [8]Chapelle, O., Scholkopf, B., Zien, A. Semi-Supervised Learning[M].MIT Press, Cambridge, 2006;
    [9]Ralf Herbrich. Learning kernel classifiers:theory and algorithms, Massachusetts Institute of Technology[M],2002:253-262 Convex optimization;
    [10]焦李成,刘芳等著,智能数据挖掘与知识发现[M],西安电子科技大学出版社,2006；
    [11]Jiawei Han, Micheline Kamber著,范明,孟小峰等译,数据挖掘：概念与技术[M],机械工业出版社,2005；
    [12]Pangning Tan, Michael Steinbach, Vipin Kumar著,范明,范宏建等译,数据挖掘导论[M],人民邮电出版社,2007：155-168,305-347；
    [13]Vapnik, V.TheNatureofStatistiealLearningTheory.NewYork:Springer-Verlag,1995,112-268P;
    [14]Soucy P, Mineau G, A Simple KNN algorithm for text categorization[C], In Proceedings of IEEE International Conference on Data Mining,2001:647-648;
    [15]Kohonen T, Self-organizing maps[M], Berlin, Springer,1995;
    [16]T. Joachims. Text Categorization with Support Vector Machines:Learning with Many Relevant Features[C], Proceedings of the European Conference on Machine Learning (ECML), Springer,1998: 137-142;
    [17]Thorsten Joachims. Transductive Inference for Text Classification using Support Vector Machines[C], International Conference on Machine Learning (ICML),1999;
    [18]Osuna E. Freund R Training Support Vector Machines:an Application to Face Detection[A].Proc of ComputerVision and Pattern Recognition[C].San Juan, Puerto Rico, IEEEComputer Soc.1997:130-136P;
    [19]马勇,丁晓青,基于层次型SVM的人脸检测.清华大学学报(自然科学版).2003,43(1)：35-38；
    [20]忻栋,杨莹春,吴朝晖.基于SVM和HMM混合模型的说话人确认.计算机辅助设计与图形学学报.2002,14(11)：1080-1082；
    [21]陈光英,张千里,李星.基于SVM分类机的入侵检测系统.通信学报.2002,23(5)：51-56;
    [22]李晓黎,刘继敏,史忠植.基于SVM与无监督聚类相结合的中文网页分类器.计算机学报.2001,24(1)：62-68；
    [23]Xianghua Fu. Zhaofeng Ma, Boqin Feng. Kernel-Based Semantic Text Categori-zation for Large Scale Web Information Organization[J], GCC 2004:389-396;
    [24]瓦普尼克著,许建华张学工译.统计学习理论[M], 电子工业出版社,2004：293-322；
    [25]C.Chang and C. Lin. Libsvm:http://www.csie.ntu.edu.tw/-cjlin/libsvm;
    [26]Bezdek, J.C. Pattern Recognition with Fuzzy Objective Function Algorithms[M]. Plenum Press, New York,1981;
    [27]Zhong-dong Wu, Wei-xin Xie, Jian-ping Yu. Fuzzy C-means clustering algorithm based on kernel method[C]. Proceedings of 5th International Conference on Computational Intelligence and Multimedia Applications,2003:9-56;
    [28]Pal N.R, Bezdek J.C. On clustering for the fuzzy c-means model[C], IEEE Transaction on Fuzzy System,1995:370-379;
    [29]Xie, X.L, Beni, G. A validity measure for fuzzy clustering[C], IEEE Transactions on Pattern Analysis and Machine Intelligence,1991:841-847;
    [30]Bensaid, A.M., Hall, L.O., Bezdek, J.C. Validity-guided clustering with applications to image segmentation[C], IEEE Transactions on Fuzzy Systems,1996:112-123;
    [31]Dae-Won Kim, Ki Young Lee, Doheon Lee, Kwang Hyung Lee. Evaluation of the performance of clustering algorithms in kernel-induced feature space[J], Pattern Recognition,200538(4):607-611;
    [32]Greene, D.Cunningham, P. Efficient prediction based validation for document clustering[C], In Proceedings 17th European Conference on Machine Learning,2006:663-670;
    [33]Kan Li, Yushu Liu. KFCSA:A Novel clustering Algorithm for High-Dimension Data[C]. Fuzzy Systems and Knowledge Discovery,2005:531-536;
    [34]Te-Ming Huang, Vojislav Kecman, Ivica Kopriva. Kernel Based Algorithms for Mining Huge Data Sets:Supervised, Semi-supervised, and Unsupervised Learning [M]. Springer, Berlin,2006;
    [35]Li R P, Mukaidono M.A maximum-entropy approach to fuzzy clustering[C].1995:2227-2232;
    [36]Bezdek J C. A physical interpretation of fuzzy ISODATA[C], IEEE Trans. SMC,1976,6(3): 387-390;
    [37]Bezdek JC, Hathaway R J, et al. Convergence and theory for fuzzy c-means clustering: counterexamples and repairs[C]. IEEE Trans. PAMI,1987 17(5):873-887;
    [38]Pal NR, Bezdek J C. On clustering for the fuzzy c-means model[M], IEEE Trans. FS,19953(3): 370-379;
    [39]D. A. Hull. Improving text retrieval for the routing problem using latent semantic Indexing [C] In Proceedings of 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR'94),1994:292-289;
    [40]Fuhr, N., Hartmanna, S., Lustig, G. Schwantner, M., and Tzeras, K.Air/X-Arule-based multi-stage indexing system for large subject fields. In Processings of RIAO'91, 1991:606-623;
    [41]Y. Fukuyama. M. Sugeno. A new method of choosing the number of clusters for t he fuzzy c-means method [M], In Proceedings of Fifth Fuzzy Systems Symposium.1989:247-250;
    [42]S.H. Kwon. Cluster validity index for fuzzy clustering[C]. ELECTRONICS LETTER.1998 34(22): 2176-2177;
    [43]Kothari R, Jain V. Learning from labeled and unlabeled data using a minimal number of queries[C]. IEEE Trans. On Neural Networks,2003 14(6):1496-1405;
    [44]Plutowski M, White H. Selecting consise training sets from clean data[C]. IEEE Transaction On Neural Networks,19934(2):305-318;
    [45]Cai, Z., McNamara, D.S., Louwerse, M., Hu, X., Rowe, M., Graesser, A.C.NLS: Non-latent similarity algorithm[C]. In Proceedings of the 26th annual meeting of the cognitive science society. Mahwah, NJ:Erlbaum,2004:180-185;
    [46]Christopher M. Bishop. Pattern Recognition and Machine Learning [M], Springer Science& Business Media, LLC,2006;
    [47]Lipo Wang. Support Vector Machines:Theory and Applications[M] Springer Verla Berlin Heidelberg, 2005:1-48;
    [48]Shigeo Abe. Support vector machines for pattern classification[M],Springer Verlag London Limited,2005:209-221; feature extracton;
    [49]Michael W. Berry. Survey of Text Mining Clustering, Classification, and Retrie-val[M], Springer-Verlag New York, Inc,2004:73-120;
    [50]Saman K. Halgamuge, Lipo Wang, Classification and Clustering for Knowledg Discovery[M], Springer-Verlag Berlin Heidelberg,2005:29-43;
    [51]J. MacQueen. Some k-means methods for classification and analysis of multivariate observations[C]. In Proceedings of 5th Berkeley Symp. Math. Statist, Prob.,1967:281-297;
    [52]S. Guha, R. Rastogi, K. Shim. Cure:An efficient clustering algorithm for large database[C], In Proceedings of 1998 ACM-SIGMOD International Conference Management of Date (SIGMOD'98), Seattle,1998:73-84;
    [53]M. Ester, H. Kriegel, J.Sander, X. Xu. A density-based algorithm for discovering clusters in large spatial databases[C], In Proceedings of 1996 International Knowledge Discovery and Data Mining(KDD'96), Portland,1996:226-231;
    [54]W. Wang, J. Yang, R. Muntz. STING:A statistical information grid approach to spatial data mining[C], In Proceedings of 1997 International Conference Very Large DataBase (VLDB'97), Athens, 1997:186-195;
    [55]K.Jean, S. Minsoo, A multilayer self-organizing feature map for range image segmentation[J], Neural Networks,19958(1):67-86;
    [56]高新波著,模糊聚类分析及其应用[M],西安电子科技大学出版社,2004；
    [57]邹涛,王继成,朱华宇等.WWW上的信息挖掘技术及实现[J].计算机研究与发展,199936(8)：1019-1024：
    [58]董燕.Web挖掘对电子商务网站建设的影响[[J].管理学报2005,第9卷增刊；
    [59]Scholkop of B, Burges C, Vapnik V. Incorporation Invariance in support vector learning machines.lCANN'96, Spingers Lecture Notes1996,11(12):47-52P Artificial Neural Networks in Computer Science Berlin;
    [60]Hyunsoo Kim, Peg Howland, Haesun Park. Dimension Reduction in TextClassification with Support Vector Machines[J], Journal of Machine Learning Research,2005:37-53;
    [61]Yiming Yang, Xin Liu. A re-examination of text categorization methods[C], In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99),1999:4-49;
    [62]Nello Cristianini, John Shawe-Taylor, Huma Lodhi. Latent Semantic Kernls[C], Proceedings of the 18th International Conference on Machine Learning (ICML),2001;
    [63]Yisong Chen, Guoping Wang, Shihai Dong. Learning with progressive transduct-ive Support Vector Machine [C], In Proceedings 2002,2002:67-74;
    [64]Kim, K. I., M. Franz and B. Scholkop. Kernel Hebbian Algorithm for Iterative Kernel Principal Component Analysis[R], MPI Technical Report, Max Planck Institute for Biological Cybernetics, Tubingen, Germany,2003;
    [65]Ives B, Learmonth G P. The information system as a competitve weapon[J] Communications of the ACM,1982,27(12):1193-1201;
    [66]kalakota R, Addison Wesley, Whinston A B. Frontiers of electronic commerce[J]. Massachusetts: 1996:132-156;
    [67]Zwass V. Electronicof Electronic Commerce, commerce:structures and issues[J].International Journal (1):19963-23;
    [68]汤兵勇,王素芬.客户关系管理[M].北京：高等教育出版社.2003：1-15；
    [69]Gartner Group. Strategic planning[J].Research Note,2001:79-95;
    [70]Dick Lee. What's CRM.http://www.CRMguru.com/crmtalk/advocates.htm;
    [71]Jon Anton.The past present and future of customer access[J], International Journal of service Industry Management.2000,11 (2):119-130;
    [72]Emma Chablo.The Importance of Marketing Data Intelligence in Delivering Successful CRM[J].DM Review,2000:46-57;
    [73]王晓龙,关毅等.计算自然语言处理阅.北京：清华大学出版社；
    [74]Maron M, Kuhns J. On relevance, probabilistic indexing and information retrieval [J]. Journal of the ACM,1960,7(3):216-244;
    [75]Cooper W. Getting beyond Boole [J]. Information Processing and Management1988,24(3):243-248;
    [76]Bekkerman R, EI-Yaniv R, Tishby On feature distributional clustering for text categorization[A].In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval [C]. New Orleans,2001:146-153;
    [76]Baker L D, Mccalum A K.Distributional clustering of words for text categorization [A].In: Proceedings of the 21st AnnualInternational ACM SIGIR Conference on Research and Development in Information Retrieval [C]. Melbourne,1998:96-103;
    [77]Lewis D. Feature selection and feature extraction for the Speech and Natural Language Workshop [C]. San text categorization (A]. In:Proceedings ofFrancisco,1992:212-217;
    [78]Li YH, Jain A K. Classification of text documents [J].The Computer Journal,1998.41(8):537-546;
    [79]Mladenic D, Grobelnik M. Feature selection for unbalanced class distribution and Naive Bayes [A]. In Proceedingsof the Sixteenth International Conference on Machine Learning [C].San Francisco, 1999:258-267;
    [80]Dumais S T. Latent semantic Indexing and TREC-2 [A]. In:Proceedings of the Second Text RetrievalConference [C].Washington,1994:105-116;
    [81]林鸿飞,战学刚,姚天顺基于概念的文本结构分析法[[J].计算机研究与发展,2000,37(3)：324-328；
    [82]林鸿飞,姚天顺.基于概念的中文文本可视化表示机制[f].小型微型计算机系统,2000,21(10)：1742-1045：
    [83]Pereira F, Tishby N, Lee L. Distributional clustering of English words [A]. In:Proceedings of the Annual Meeting of the Association for Computational Linguistics [C). Columbus,1993:183-190;
    [84]SaltonG, Wong A, Yang C. A vector space model for automatic indexing [J]. Communications of the 1975,18(11):613-620;
    [85]Salton G, Buckley C. Term-weighting approaches in automatic text retrieval [J]. Information processing and Management,1988,24(s):S13-S23;
    [86]Yang Y, Pedersen J. A comparative study on feature selection in text classification [A]. In: Proceedings of the Fourteenth International Conference on Machine Learning [C]. Nashville,1997:412-420;
    [87]Yang Y, Liu X. A reexamination of text categorization methods[A].In:Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval [C]. Berkeley,1999:42-49;
    [88]Joachims T. Text categorization with support vector machines:learning with many relevant features [A].In:Proceedings of the Tenth European Conference on Machine Learning [C]. Berlin,1998:137-142;
    [89]Burges.A tutorial onsupport vector machines for pattern recognition [J]. Data Mining andKnowledge Discovery,1998,2(2):121-167;
    [90]邓乃扬,田英杰.数据挖掘中的新方法一支持向量机[M].北京：科学出版社,2004；
    [91]袁亚湘,孙文瑜.最优化理论与方法[M].北京：科学出版社,1997；
    [92]Zelikovitz S.Hirsh H. Using LSI for text classification in the presence of background text[A].In: Proceedings of the Tenth International Conference on Information and Knowledge Management[C]. Atlanta, 2001:113-118;
    [93]边肇祺,张学工.模式识别[M]北京：清华大学出版社,2000；
    [94]靳小波.文本分类综述[J].自动化博览,2006.5；
    [95]程泽凯,林十敏.文本分类器准确性评估方法[J].情报学报,2004,5(23)：631-636；
    [96]马笑箫,黄席褪,柴毅.基于支持向量机的故障过程趋势预测研究.系统仿真学报,2002,14(11)：1548-1551；
    [97]王定成,方廷健,高理富等.支持向量机回归在线建模及应用.控制与决策,2003,18(1)：89-95；
    [98]赵登福,王蒙,张讲社等.基于支撑向量机的短期负荷预测.中国电机工程学报,2002,22(4)：26-30；
    [99]陈友,张国基,郭国雄一种改进的SVM算法及其在证券领域中的应用.华南理工大学学报,2003,31(7)：15-18；
    [100]李丽娜,候朝祯.基于支持向量机(SVM)的工业过程辨识.北京理工大学学报,2003,23(5)：569-580;
    [101]翟永杰,毛继佩.分级聚类支持向量机在汽轮机故障诊断中的应用.华北电力大学学报,2003,30(6)：25-29；
    [102]张国云,章兢.基于粗集理论和支持向量机的动态预测新方法及应用.自动化仪表,2005,26(2)：17-20：
    [103]王宇红,黄德先.基于支持向量机的非线性预测控制技术.信息与控制,2004,33(2)：134-140；
    [104]张浩然,韩正之,李昌刚.基于支持向量机的未知非线性系统辨识与控制.上海交通大学学报,2003,37(6)：927-930;
    [105]张国云,章兢.基于模糊支持向量机的多级二义树分类器的水轮机调速系统故障诊断.中国电机工程学报,2005,2(8)：100-104

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700