文本分类及其特征降维研究

作者：廖一星
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：文本分类 ; 云模型 ; 特征选择 ; 不平衡文本 ; 特征抽取
英文关键词：Text classification ; Cloud model ; Feature selection ; Imbalanced text ; Feature
英文关键词：extraction
学位年度：2012
导师：潘雪增
学科代码：0812
学位授予单位：浙江大学
论文提交日期：2012-04-01

摘要

随着信息技术的迅猛发展,特别是Internet的普及,信息容量呈爆炸性趋势增长,人们迫切需要一种技术高效地组织和管理信息。文本分类作为组织和处理大量文本信息的关键技术,可以在较大程度上解决信息杂乱的问题,对于信息的高效管理和有效利用都具有极其现实的意义,成为了数据挖掘领域中的一个重要的研究方向。
     目前,文本分类技术已经在多个领域得到了广泛的应用,并且取得了较大的进展。例如信息过滤、信息检索、词义辨析、新闻分发、邮件分类、数字图书馆和文本数据库等,此外,越来越多的学者也投入到文本分类研究中,出现了许多新的文本分类方法和技术。但是,文本分类也遇到了前所未有的挑战。在理论和实践上,文本分类的研究仍存在很大的发展空间。
     论文介绍了文本分类的研究背景、研究意义和国内外研究现状,并在分析和总结文本预处理、文本表示模型、特征降维、特征权重、分类方法和分类性能评价的基础上,对文本分类器及其特征降维进行了深入的研究。本文的主要创新研究工作如下：
     (1)提出了一种基于云模型的文本分类器(CMTC)。首先,引入平滑因子σ参数以解决因稀疏特征空间而造成普通云分类器无法在文本分类中直接使用的问题；然后通过实验分析了σ和文本分类性能之间的关系；最后选取合适的σ参数。实验结果表明,在Reuters10(Reuters-21578的一个子集)数据集上,CMTC比SVM和KNN具有更好的处理能力,特别是宏平均F1指标的最大值比KNN和SVM分别提高了5.06%和6.19%。在复旦大学提供的语料上,CMTC的分类性能与KNN不相上下,有时甚至比KNN更好。另外,CMTC的分类性能优于SVM。
     (2)提出了一种基于逆云模型的CMFS特征选择方法。首先根据逆云模型理论建立训练集各属性在各类别上的模型,然后根据所建模型计算每个属性的类间差别,最后选取类间差别大的属性作为分类特征。另外还考虑了特征频率。实验结果表明,无论采用NaiveBayes还是SVM分类,CMFS分类性能接近信息增益的分类性能,并优于文本证据权重和互信息。
     (3)提出了一种面向不平衡文本的强类别相关特征选择方法。首先,在分析传统特征选择方法构造的四项基本信息元素的基础上提出一种强类别信息的度量标准,并提出一种适用于不平衡文本的强类别相关的特征选择方法,该方法综合考虑了类别信息和词频,分别用于提高少数类和多数类的分类性能。实验结果表明,采用SVM分类时,在特征数为100时分类效果最好,此时Micro_F1分别比IG,CHI和DFICF提高2.12%,1.91%和1.91%,Macro_F1分别比IG,CHI和DFICF提高了1.21%,1.55%和1.14%。采用朴素贝叶斯分类器分类时,在特征数为300时分类效果最好,此时Micro_Fl分别比IG、CHI和DFICF提高了1.08%,1.76%和0.79%,Macro_F1分别比IG、CHI和DFICF提高了0.75%、2.85%和0.41%。
     (4)提出了一种基于Sprinkling的特征抽取方法。首先,考虑了特征的局部权重和全局权重。其次,考虑了样本的隶属度信息,样本的隶属度信息用降半哥西分布定义。再次,文档集中的每个类别用一个辅特征(人工增加的特征)映射,并通过调节辅特征权重来调节同类单词之间的紧密度。此外还讨论了辅特征权重对分类性能的影响。实验结果表明,在辅特征权重为2时分类精度达到最大值94.22%,比原始Sprinkling方法提高1.71%。
With the rapid development of information technology, especially the popularization of Internet, the information capacity increases explosively. There is a great desire to develop a technology which can organize and manage information efficiently. Text classification can solve the information chaos to a great extent as a key technology of processing and organizing vast text data. It has very realistic significance for efficient management and effective utilization of information and has gradually been an important research direction in the field of data mining.
     Now text classification is widely applied in many fields and great progresses have been made. For example, information filtering, information retrieval, word sense disambiguation, news distribution, mail classification, digital library, text database and so on. In addition, more and more scholars commit themselves to the researches on text classification. Many novel methods and techniques of text classification emerge. However, text classification also encounters unprecedented challenges. There is a broad development space for researches on text classification in theory and practice.
     In the thesis, the research background, the research significance and the research situation of text classification at home and abroad are described firstly. Then the conception of text classification, text preprocessing, text representation model, feature selection, feature weighting, classification method, and classification performance evaluation are described. On these bases, we study deeply the text classifier and feature dimension reduction techonlogy. The main research contents of this thesis are as follows.
     (1) A text classifier based on cloud model(CMTC) is proposed.
     First, the parameterσ is introduced in CMTC classifier to solve the problem that traditional classifier based cloud model can't be used in text classification because of sparse feature space. Then the relation between parameter a and the classification performance is analyzed through experiments. As the result of analysis, the proper σ value is selected. Experimental results show that CMTC classifier can deal with the imbalanced dataset Reuters10(a subset of Reuters-21578) better than SVM and KNN, especially the maximum of Macro_F1is5.06%higher than that of KNN and6.19%higher than that of SVM respectively. When tested over Fudan Chinese dataset, it is observed that the classification performance of the CMTC classifier is at least equal to, and sometimes better than that of KNN classifier. The CMTC classifier is greatly improved when compared with that of SVM classifier.
     (2) The feature selection method based on backward cloud model is proposed.
     First of all. the model of each feature in each class can be expressed according to theory of backward cloud model, and the distinction of each feature between different classes can be calculated. The features with larger distinction between classes are selected. In addition, the frequency of the features is considered. This method is easy which has lower time complexity and space complexity. The experimental results show that the performance of the proposed feature selection method is comparable with that of IG(information gain) and higher than that of WET(weight of evidence) and MI(mutual information).
     (3) The strong class-related feature selection method on imbalanced dataset is proposed.
     Firstly, a new measurement of strong class information is proposed after the four basic information elements are analyzed which construct the traditional feature selection methods. Based on it, a new feature selection method applicable to imbalanced text classification is proposed. The method considers the strong class information and the frequency of terms which improve the classification performance of minority classes and majority classes respectively. The experimental results on ReuterslO show that the Micro_Fl of the new feature selection method is2.12%higher than that of IG,1.91%higher than that of CHI and1.91%higher than that of DFICF when using SVM. The Macro_F1value is1.21%higher than that of IG,1.55%higher than that of CHI and1.14%higher than that of DFICF. When using the Naive Bayes classifier the Micro_Fl of the new method is1.08%higher than that of IG,1.76%higher than that of CHI and0.79%higher than that of DFICF. The Macro_Fl is0.75%higher than that of IG,2.85%higher than that of CHI and0.41%higher than that of DFICF.
     (4) A feature extraction method based on Sprinkling is proposed.
     Firstly, local weight and global weight of features are considered in this method. Then, the membership degrees of samples are considered. The membership degree information is defined with Descending Half Cauchy distribution. At last, we augment the feature set with one artificial term corresponding to a class label, at the same time, we adjust the weight of artificial term to improve classification performance. In addition, the influence of weight of artificial term on text classification performance is discussed. The results show that the classification performance of the new method reaches the maximum94.22%which is1.71%higher than that of the original Sprinkling method when the weight of augmented feature is2. The classification performance of the new method is improved in some degree.

引文

[1]Fabrizio Sebastiani. Machine learning in automated text categorization. Proceedings of ASAI-99,1st Argentinian symposium on artificial intelligenc,1999.
    [2]Luhn H P. Auto-encoding of documents for information retrieval systems[M].Modern Trends in documentation. New York:Pergamon Press,1959.
    [3]Maron M E, Kuhn J L. On relevance, probabilistic indexing and information retrieval[J]. ACM,1960,3(3):216-244.
    [4]成颖,史九林.自动分类研究现状与展望[J].情报学报,1999,1：20-27.
    [5]甘立国.中文文本分类系统的研究与实现[D].北京化工大学,硕士论文,2006.
    [6]P.J.Hayes, P. Andersen, I. Inrenburg, and L.M. Schmandt, Tcs:a shell for content-based text categorization[A]. In Proceedings of 6th IEEE Conference on Artificial Intelligence Applications(CAIA-90)[C], Santa Barbara, CA,1990:320-326.
    [7]Y.Yang and J.O. Pedersen. A comparative study on feature selection in text categorization[A]. In proceedings of the ICML97[C],1997:412-420.
    [8]H.Liu and H. Motoda. Feature extraction, construction and selection:a data mining perspective[M]. Kluwer Academic,Norwell, MA, USA,1998.
    [9]Gupta K M, Moore P G, Aha D W, Pal S K. Rough set feature selection methods for case-based categorization of text documents[C]. Proceedings of the 1st International Conference on Pattern Recognition and Machine Intelligenc.2005,3776 LNCS:792-798.
    [10]陈文亮,朱靖波,朱慕华,姚天顺.基于领域词典的文本特征表示[J].计算机研究与发展.2005,42(12)：2155-2160.
    [11]宋枫溪,高秀梅等.统计模式识别中的维数削减与低损降维[J].计算机学报,2005,28(11)：1915-1922.
    [12]唐焕玲,孙建涛,陆玉昌.文本分类中结合评估函数的TEF-WA权值调整技术[J].计算机研究与发展,,2005,,42(1)：47-53.
    [13]Kapalavayi N, Jayaram Murthy S N, Hu G Z. Hierarchical approach to select feature vectors for classification of text documents[C]. Proceedings of IEEE International Conference on Computer Systems and Applications. Sharjah, United Arab Emirates, 2006:1180-1183.
    [14]樊兴华,孙茂松.一种高性能的两类中文文档分类方法[J].计算机学报.2006,29(1)：124-131.
    [15]尚文倩,黄厚宽,刘玉玲等.文本分类中基于基尼指数的特征选择算法研究[J].计算机研究与发展.2006,43(10)：1688-1694.
    [16]Wong Alex K S, Lee John W T. An evolutionary approach for discovering effective composite feature for text categorization[C]. Proceedings of 2007 IEEE International Conference on System, Man and Cybernetics. Montreal, QC, Canada,2007:3045-3050.
    [17]刘桃,刘秉权等.领域术语自动抽取及其在文本分类中的应用[J].电子学报,2007.35(2)：328-332.
    [18]Samer H, Rada M. Carmen B. Random-walk term weighting for improved text classification[J]. Institute of Electrical and Electronics Engineers Compputer Society. Piscataway,US,2007:242-249.
    [19]Krithara A, Amini M R. Renders J M. Goutte C. Semi-supervised document classification with a mislabeling error model[C]. Proceedings of the 30th Annual European Conference on Information Retrieval. Glasgow, United Kingdom.2008.4956 LNCS:370-381.
    [20]徐燕,李锦涛等.文本分类中特征选择的约束研究[J].计算机研究与发展.2008,,45(4)：596-602.
    [21]王博,贾焰,杨树强等.适用于不确定文本分类的特征选择算法[J].通信学报,2009,30(8)：32-44.
    [22]刘赫,刘大有等.一种基于特征重要度的文本分类特征加权方法[J].计算机研究与发展.2009,46(10)：1693-1763.
    [23]LIU Ying, HAN TONG LOH, AIXIN SUN. Imbalanced text classification:A term weighting approach[J].Expert Systems with Application.2009,36:690-701
    [24]祝翠玲,马军,张冬梅.面向层次分类的文本特征选择方法[J].模式识别与人工智能.2011,24(1)：103-110.
    [25]张晓光,孙正,徐桂云等.一种类内方差与相关度结合的特征选择算法[J].哈尔滨工业大学学报.2011,34(3)：132-136.
    [26]孟佳娜,林鸿飞等.基于特征贡献度的特征选择方法在文本分类中应用[J].大连理工大学学报,,2011,51(4)：611-615.
    [27]I.T. Jolliffe. Principal component analysis[M]. New York:Spriger verlag,1986.
    [28]A. M. martinez and A. C. Kak. PCA versus LDA[J]. IEEE transaction on Pattern Analysis and Machine Intelligence,2001,23(2):228-233.
    [29]G. karypis and E. Han. Fast supervised dimensionality reduction algorithm with applications to document categorization and retrieval[A]. In proceedings of the 9th ACM International Conferdence on information and knowledge management(CIKM-00)[C], ACM Press, New York, USA,2000:12-19.
    [30]叶志飞,文益民,吕宝粮.不平衡分类问题研究综述[J].智能系统学报：2009,vol 4.No2,148-156.
    [31]Tomek I. Two modifications of CNN[J]. IEEE Trans on systems, Man and Communications.1976,6:769-772.
    [32]Hart P E. The condensed nearest neighbor rule[J]. IEEE Trans on Information Theory,1968,14(3):515-516
    [33]Kubat M, Matwin S. Addressing the course of imbalanced training sets:one-sided selection[C]. Proc of the 14th Internation Conference on Machine Learning. San Francisco, Morgan Kaufmann,1997:179-186.
    [34]Laurikkala J. Improving identification of difficult small classes by balancing class distribution[C]. Proc of the 8th Conference on AI in Medicine. Europe, Artificial Intelligence Medicine,2001:63-66
    [35]Hulse J V, Khoshgoftaar T. Knowledge discovery from imbalanced and noisy data[J]. Data Knowledge Engineering,2009.68:1513-1542.
    [36]Visa S, Ralescu A. Experiments in guided class rebalance based on class structure. Proc. Of the MAICS Conference.2004:8-14
    [37]Nickerson A, Japkowicz N, Milions E. Using unsupervised learning to guide resampling in imbalanced data sets. Proceedings of the Eighth International Workshop on Artificial Intelligence and Statistics,2001:261-265.
    [38]Chawla N V, Bowyer K W, Hall L 0, et al. SMOTE:Synthetic Minority Over-sampling Technique [J]. Journal of Artificial Intelligence Research,16,2002:321-357
    [39]Han H, Wang W Y, Mao B H. Borderline-Smote:A new over-sampling method in imbalanced data sets learning[C]. International Conference on Intelligence Computing. Springer Verlag,2005:878-887.
    [40]陶新民,徐晶等,不均衡数据下基于阴性免疫的过抽样新算法,控制与决策,2010,25(6)：867-872
    [41]Luxy, Wu Jx, Zhou Z H. A cascade-based classification method for class-imbalanced data [J]. Journal of Nanjing University:Natural Science,2006,42(2):148-155
    [42]Zhou Z H, Lu X Y. Training cost-sensitive neural networks with methods addressing the class imbalance problem[J]. IEEE Transaction on Knowledge and Data Engineering,2006,18(1):63-77.
    [43]Domngos P. Metacost:a general method for making classifiers cost sensitive[C]. Proceeding of the 5th International Conference on Knowledge Discovery and Data Mining San Diego, CA:ACM Press.1999:155-164.
    [44]Chen C, Liaw A. Breman L. Using random forest to learn imbalanced data[R]. No.666, Statistics Department, University of California at Berkeley,2004.
    [45]Sun Y, Kamel M S, Wong A K C, et al. Cost-sensitive boosting for classification of imbalanced data[J]. Pattern Recognition,2007,40:3358-3378
    [46]Chawlan N V. C4.6 and imbalanced datasets:investigating the effect of sampling method, probabilistic estimate, and decision tree structure[C]. International Conference on Machine Learning Washington DC,2003:125-130.
    [47]Ceci M, Malerba D. Hierarchical classification of HTML documents with WebClass Ⅱ. In: Sebastiani F, ed. Proc. of the 25th European conf. on information retrieval(ECIR-03). Pisa: Springer-Verlag.2003.57-72.
    [48]Sun A, Lim Ep,Ng WK. Hierarchical text classification methods and their specification. In: Chan AT, Chan SC, Leong HV, Ng VTY, eds. Cooperative internet computing. Dordrecht: kluwer academic publishers.2003.236-256.
    [49]Sun A. Lim EP, Ng WK. Performance measurement framework for hierarchical text classification. Journal of the American Society for Information Science and Technology, 2003,54(11):1014-1028.
    [50]Ruiz M. Combining machine learning and hierarchical structures for text categorization[Ph. D. Thesis]. Ames:Graduate College of University of Iowa,2001.
    [51]Huang CC, Chuang SL, Chien LF. LiveClassifier:Creating hierarchical text classifiers through web corpora. In:Proc. of the 13th Int’l world wide web conf. New York:ACM press,2004.184-192.
    [52]Sun A, Lim EP, Ng WK, Srivastava J. Blocking reduction strategies in hierarchical text classification. IEEE Trans. On knowledge and data engineering.2004,16(10):1305-1308.
    [53]O. Dekel, J. Keshet. Y. Singer. Large margin hierarchical classification[A]. In proceedings of the 21st international conference on machine learning(ICML04)[C],2004.
    [54]张国英,沙芸,余有朋,刘玉树.基于属性相似度的云分类器[J].2005,25(6)：499-503.
    [55]李荣陆,胡运发.基于密度的KNN文本分类器训练样本剪裁方法[J].计算机研究与发展,2004,41(4)：539-545.
    [56]王建会,王洪伟,申展,胡运发。一种实用高效的文本分类算法[J].计算机研究与发展,2005,42(1)：85-93.
    [57]李荣陆.文本分类若干关键技术研究[博士论文].复旦大学,上海,2005.
    [58]姚力群,陶卿.局部线性与One-Class结合的科技文本分类方法[J].计算机研究与发展,2005,42(11)：1862-1869.
    [59]陈晓云等.基于分类规则数的频繁模式文本分类[J].软件学报,2006,17(5)：1017-1025.
    [60]王强,关毅,王晓龙.基于特征类别属性分析的文本分类器噪声剪裁方法.[J]自动化学报.2007.33(8)：809-816.
    [61]唐华,曾碧卿.基于遗传算法和信息熵的文本分类规则抽取方法研究[J].中山大学学报(z自然科学版).2007.,46(5)：18-21
    [62]Janik M, Kochut K J. Wikipedia in action:Ontological Knowledge in text categorization[C]. Proceedings of the 2nd Annual IEEE International Conference on Semantic Computing. Santa Clara, CA, United States,2008:268-275.
    [63]Park C H. Dimension reduction using least squares regression in multi-labeled text categorization[C]. Proceedings of 2008 IEEE 8th International Conference on Computer and Information Technology. Sydney. NSW, Australia,2008:71-76.
    [64]Yoon Y, Lee G G. Text categorization based on Boosting association rules[C]. Proceedings of the 2nd Annual IEEE International Conference on Semantic Computing. Santa Clara, CA, United States.2008:136-143.
    [65]Gkanogiannis A, Kalamboukis T. An algorithm for text categorization[C]. Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Singapore,2008:869-870.
    [66]Suzuki M. Text categorization using the maximum ratio of term frequency[J]. Journal of Japan Industrial Management Association.2008,56(8):438-444.
    [67]Chapelle O, Sindhwani v, Keerthi S S. Optimization techniques for semi-supervised support vector machines[J]. Journal of Machine Learning Research.2008,9:203-233.
    [68]Isa D, Lee L H, Kallimani V P, Rajkumar R. Text document preprocessing with the bayes formla for classification using the support vector machine[J]. IEEE Transactions on Knowledge and Data Engineering.2008,20(9):1264-1272.
    [69]Markov A, Last M, Kandel A. The hybrid representation model for web document classification[J]. International Journal of Intelligent Systems,2008,23(6):654-679.
    [70]Moayed M J, Sabery A H, Khanteymoory A. Ant colony algorithm for web page classification[J]. International symposium on Information Technology 2008. Kuala Lumpur, Malaysia.2008,4.
    [71]朱靖波,王会珍, 张希娟.面向文本分类的混淆类别技术[J].软件学报,2008,19(3)：630-639
    [72]李文波,孙乐,张大鲲.基于Labeled-LDA模型的文本分类新算法[J].计算机学报。2008,3(4)：620-627.
    [73]唐焕玲等.基于差异性评估对Co-training文本分类算法的改进[J].电子学报.2008.12A：138-143.
    [74]刘婷,陈晓云.基于逆云模型的支持向量机多类分类方法[J].2008,36(3)；341-346.
    [75]郝秀兰,陶晓鹏,徐和祥,胡运发KNN文本分类器类偏斜问题的一种处理对策[J].计算机研究与发展.2009,.46(1)：52-61.
    [76]张孝飞,黄河燕.一种采用聚类技术改进的KNN文本分类方法[J].模式识别与人工智能.2009,,22(6)：936-940.
    [77]刘赫,刘大有等.基于多种群协同优化的文本分类规则抽取方法[J].自动化学报,2009,35(10)：1334-1340.
    [78]邱江涛,唐常杰等.关联文本分类的规则修正策略[J].计算机研究与发展.2009,46(4)：683-688.
    [79]朱杰等.一种新的基于SVM权重向量的云分类器[J].计算机应用研究.2009,,6(6)：2098-2100.
    [80]Rodrigo Alfaro, Hector Allende. A new input representation for multi-label text classification[C]. Proceedings of 2010 4th International Conference on Intelligence Information Technology Application.
    [81]Joachims T. Text categorization with support vector machines:Learning with many relevant features[C]. Proceedings of the 10th European conference on Machine Learning. Chemnitz, Germany,1998.
    [82]J. Li, M. Sun,and X. Zhang. A comparison and semi quantitative analysis of words and character-bigrams as features in Chinese text categorization[A]. In proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL[C],2006:545-552
    [83]Salton G, Wong A, Yang. Cs. A vector Space Model for Automatic Indexing[J]. Communication of the ACM,1975,18(1):613-620.
    [84]S.E.robertson and K. SparckJones. Relevance weighting of searching term, Journal of the American Society for Information Sciences,27(3):129,1976.
    [85]Salton G and Buckley C. Term-weighting approaches in automatic text retrieval. Information Processing and Management,1988,24(5):513-523.
    [86]Lang K. News weeder:Learning to filter netnews[C]. Proceedings of the 12th International Conference on Machine Learning. Lake Tahoe, US,1995:331-339.
    [87]Mark C, Dan D P, Dayne F, Andrew M C, Tom M, Kamal N, Sean S. Learning to extract symbolic knowledge from the world wide web[C].Proceedings of the 15th National Conference on Artificial Intelligence. AAAI Press,1998:509-516.
    [88]http://www.nlp.org.cn/
    [89]F. Sebastiani. Machine learning in automated text categorization[J]. ACM Computing Surveys,2002,34(1):1-47.
    [90]Han J W, Micheline K. Data Mining:Concepts and Techniques[M].机器工业出版社,2001：171-180.
    [91]DeyiLi, Han, E. Chan and xuemei shi. Knowledge representation and discovery based on linguistic atoms. Proc. of the 1st pacific-asia conference on KDD& DM, Singapore, Feb.1997.
    [92]Deyili, D.W. cheng. Xumeishi, and V. Ng. Uncertainty reasoning based on cloud models in controllers. Computers and mathematics with applications, Elsevier science.1997.
    [93]李德毅.三级倒立摆的云控制方法及动平衡模式[J].中国工程科学。1999.1(2)：41-46.
    [94]张飞舟,范跃祖,沈程智等.基于隶属云发生器的智能控制[J].航空学报,1999,20(1)： 89-92.
    [95]Wang S L, Li D R, Shi W Z. cloud model-based spatial data mining[J]. Geographical information science,2003,9(2):67-78.
    [96]吕辉军等.逆向云在定性评价中的应用[J].计算机学报。2003,26(8)：1009-1014.
    [97]李德毅等.C3系统可靠性、抗毁性和抗干扰性的统一评测[J].系统工程理论与实践,1997,17(3)：23-27.
    [98]李德仁,王树良,李德毅.空间数据挖掘理论与应用.科学出版社,2006
    [99]李洪兴,汪培庄.模糊数学[M].北京：国防工业出版社,1994
    [100]李德毅,刘常.论正态云模型的普适性[J].中国工程科学,2004,6(8)：28-34
    [101]李德毅等.不确定性人工智能[M].北京：国防工业出版社,2005：137-166.
    [102]李德毅等.人工智能与认知物理学[A].中国人工智能进展.北京邮电大学出版社,2003,6-14.
    [103]李德毅,孟海军,史雪梅.隶属云和隶属云发生器.计算机研究与发展.32(6)：1 995
    [104]Eyheramendy S, Lewis D D, Madigan D. On the Naive Bayes model for text categorization[M]. Artificial Intelligence& statistics.2003.
    [105]Kamal Nigam, Andrew K. McCallum, Sebastian Thrun. and Tom M. Mitchell. “Text classification from labeled and unlabeled documents using EM”
    [106]李荣陆,文本分类若干关键技术研究[博士论文].复旦大学,上海,2005.
    [107]陈世福,陈兆乾等。人工智能与知识工程[M].南京：南京大学出版社,1997.391
    [108]Yang Y, Pedersen J O. A comparative study on feature selection in text categorization[C]. Proceedings of the 14th International Conference on Machine Learning. Nashville. Tennessee, US:Morgan Kaufmann Publishers,1997:412-420.
    [109]Galavottiluigi, Sebastiani Fabrizio. Feature selection and negative evidence in automated text categorization[C]. Proceedings of the ACM KDD-00 workshop on text mining New York, US. ACM Press.2000:40-42.
    [110]Y Yang. Pedersen JO. A comparative study on feature selection in text categorization. In:Fisher DH. ed, Proc. Of the 14th Int’1 conf. on Machine Learning(ICML-97). Nashville: Morgan Kaufmann Publishers,1997.412-420.
    [111]苏金树,张傅锋,徐听.基于机器学习的文本分类技术研究进展[J].软件学报,2006.17(9)：1848-1859
    [112]刘胥影,吴建鑫,周志华.一种基于级联模型的类别不平衡数据分类方法[J].南京大学学报：自然科学版,2006.,42(2)
    [113]D MLADENIC, M GROBELNK. Feature selection for unbalanced class distribution and naive bayes. [C]//Proc of the 16th International Conf Machine Learning,1999:258-267
    [114]BONG, CHIH HOW, K NARAYANAN. An empirical study of feature selection for text categorization based on term weight. [C]//The IEEE/WIC/ACM International Conference on Web Intelligence, Beijing:2004.
    [115]LI Shou-shan and ZONG Cheng-qing. A New Approach to Feature Selection for Text Categorization. The IEEE International Conference on Natural Language Processing and Knowledge Engineering(NLP-KE). Wuhan.2005,626-630.
    [116]Card Iec, Howen. Improving minority class predicting using case-specific feature weights[C]. Proceedings of the 14th International Conference on Machine Learning. San Francisco:Morgan Kaufmann,1997:57-65.
    [117]Zheng Z H, Sriharir. Optimally combining positive and negative features for text categorization[C]. International Conference on Machine Learning Washington DC, 2003:241-145.
    [118]G Forman. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research,2003,3(1):1289-1305
    [119]徐燕等,不均衡数据集上文本分类的特征选择研究,计算机研究与发展,2007,11
    [120]靖红芳,王斌,杨雅辉,徐燕.基于类别分布的特征选择框架.计算机研究与发展.2009,46(9)：1586-1593.
    [121]尤鸣宇等,不均衡问题中的特征选择新算法：Im-Ig,山东大学学报工学版,2010.5
    [122]Fabrizio Sebastiani. Machine Learning in Automated Text Categorization. ACM, 34(1):1-47,2002
    [123]Fabrizio Sebastiani. Machine Learning in Automated Text Categorization. ACM.34(1):1-47,2002.
    [124]Gabrilovich and Markovitch and Shaul Markovitch. Feature Generation for Text Categorization Using World Knowledge. In IJCAI,pages 1048-1053,2005
    [125]Sottsusan T. Dumais.Thomas K. Landauer.George W. Furnas.and Richard A. Harshman. Indexing by Latent Semantic Analysis. JASIS,41(6):391-407,1990
    [126]Ran Bekkerman RanE Yaniv Naftali Tishby and Yoad Winter Distributional Word Clusters vs. Words for Text Categorization. Journal of Machine Learning Research, 3:1183-1208,2003
    [127]William W.Cohen and Yoram singer. Context-sensitive Learning Methods for Text Categorization. In SIGIR,pages 307-315. ACM Press,1996.
    [128]Lut. Chen Z.zhang by etal. Improving test classification using local latent semantic indexing.[A]. Proceeding of the 4th IEEE Intenational Conerence on DataMining.[C]. 2004.162-169.
    [129]SHMAK, TODRIKIM,SuZUKIA, Svm-based feature selection of latent semantic. [J] Pattern. Recognition Letters,2004,25(2):1052-1058.
    [130]曾雪强,王明文,陈素芬.一种基于潜在语义结构的文本分类模型.[J].华南理工大学学报.2004,32：99-102.
    [131]Sutanu Chakraborti, Robert Lothian, Nirmalie Wiratunga, and Stuart Watt. Sprinkling: Supervised Latent Semantic Indexing. In ECIR,pages 510-514.Springer,2006.
    [132]曾雪强,王明文,陈素芬.一种基于潜在语义结构的文本分类模型.华南理工大学学报(增刊).2004,32,99-102.
    [133]陈涛,宋妍.谢阳群.基于IIG和LSI组合特征提取方法的文本聚类研究[J].情报学报,2005,24(2)：203-209.
    [134]叶浩,王明文,曾雪强.基于潜在语义的多类文本分类模型研究[J].清华大学学报(自然科学版),2005,25(S1)：1818-1822.
    [135]刘云峰,齐欢,代建民.基于潜在语义空间维度特性的多层文档聚类[J].清华大学学报(自然科学版),2005.,25(S1)：
    [136]戴新宇,田宝明,周俊生,陈家骏.一种基于潜在语义分析和直推式谱图算法的文本分类方法LSASGT[J].电子学报,2008.36(8)：1626-1630.
    [137]陈毅恒,秦兵,刘挺,王平,李生.基于潜在语义索引和自组织映射网的检索结果聚类方法[J].计算机研究与发展.2009.7：1176-1183.
    [138]Schutze H. Hull D A, and Pedersen J O. A comparison of classifiers and document representations for the routing problem. In:Proceedings of the 18th ACM International Conference on Rsearch and Development in Information Retrieval(SIGIR-95). 1995,229-237.
    [139]Pentland A, Moghaddam B, Starner T, et al. View-based and modular eigenspaces for face recognition[A]. Proc IEEE computer society Conference on computer vision and pattern recognition[C]. Seattle,1994.84-91.
    [140]A Framework for understanding latent semantic indexing(LSI) Performance. April Kontostathis and William M. Pottenger 2004
    [141]Huang HP.Liu YH,Fuzzy support vector machines for pattern recognition and data mining. Int’l Journal of Fuzzy Systems.2002.4(3):826-835.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700