用户名: 密码: 验证码:
基于机器学习的高性能中文文本分类研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
文本分类是信息处理领域中的一个重要的研究方向。随着信息技术的发展,特别是20世纪90年代基于机器学习的文本分类方法的逐渐成熟,文本分类技术在自然语言处理与理解、信息组织与管理、内容信息过滤等领域中有着广泛的应用。由于这些领域对文本分类技术的不断需求,极大地推动了文本分类技术的深入研究,使文本分类技术成为计算机技术的热点研究课题。
     在基于机器学习的文本分类研究中,按照分类学习方式的不同,可分为有监督分类、半监督分类和无监督分类三种。有监督分类通常简称为文本分类(text categorization,简称TC),它的主要任务是在预先给定的类别标记(label)集合下,根据文本内容判定它的类别;无监督分类称为文本聚类(clustering),文本聚类是按照某种准则对文本集合进行组织或划分,使得相似的文本划分到同一簇中,差异较大的文本划分到不同簇中;半监督学习介于有监督分类与无监督分类之间,它主要关注的是当训练样本不足或者数据的部分信息缺失的情况下,如何获得具有良好泛化能力的学习机器,对文本类别进行正确区分。
     无论是哪种分类算法,对于高维文本来说,特征提取和特征选择作为降维的重要方法,是降低计算复杂性、提高分类器性能的重要手段。它们与上述的分类算法一样,面临着海量数据、非结构化、维数灾难与数据集偏斜等方面的挑战。
     本文主要研究中文文本分类,重点就文本的特征提取、特征选择、分类和聚类四个方面进行深入研究。本文首先提出了基于句子成分的文本特征提取算法、均衡特征选择算法和特征选择维数下限;接着,提出了特征索引与特征补偿的KNN分类算法,同时将均衡特征选择应用于非线性半监督分类;最后,在Hartuv and Shamir工作的基础上,提出了加权图聚类算法——WGC算法。本文研究中主要的创新点包括:
     1、基于句子成分的文本特征提取。在文本特征提取中经常会出现一些跟主题无关的词条。本文根据不同的句子成分在表达主题中所起的作用不同,利用句法分析实现句子成分的标注,并由此提出了基于句子成分的文本特征提取算法。实验结果显示,该算法不但能有效地过滤一些跟主题无关的词条,而且避免了停用词表或词性过滤的局限性。
     2、均衡特征选择算法研究。针对目前关于数据分类的假设在实际中难以满足以及数据偏斜的问题,本文通过对文本分类目标函数的分析,提出了均衡的特征选择算法。通过理论的分析和公开文本集的实验表明,该算法能够有效地处理子类间的数据偏斜问题。此外,提出了特征选择函数在某一文本集中特征选择维数的下限的计算方法,以及在特征维数下限条件下的非平均维数的特征选择算法。
     3、高性能文本分类算法研究。为了减少未标记样本与无关向量集的比较从而有效地提高分类的速度,本文利用选择的特征集作为待标记文本分类的索引,提出了基于特征空间索引的最近邻分类算法。实验表明,该算法分类时间受维数增加的影响较小。为了提高分类的准确性,本文将未包含在特征空间中且具有区分类别能力的特征词作为分类的补偿特征集,提出了基于特征补偿的KNN算法。最后,在均衡特征选择的基础上结合鲁棒路径正则化,实现文本的非线性半监督分类。
     4、基于最小割集的加权图聚类算法。在Hartuv and Shamir工作的基础上,提出了图论聚类算法——WGC算法,该算法有低多项式复杂度,可证明的聚类性质以及在聚类过程中自动地确定聚类的类数等优点。
Text Classification is an important research topic in the information processing area. With the development of information technology, particularly the matureness of machine-learning based text classification in the 1990s, text classification is widely applied in the area such as nature language understanding and processing, information organizing and management, content filtering, etc. Because more and more text classiffication applies in those areas, this greatly promotes further research on text classification and makes text classification technology a research topics in computer technology.
     According to the different ways of classification learning, text classification based on machine learning can be divided into three different kinds of methods: supervised classification, semi-supervised classification and unsupervised classification. Supervised text classification is called text categorization, TC for short. The main goal of TC is: given a text and predefined training sets with different class labels, decides which class the text belongs to. The unsupervised classification is called Clustering. It groups a set of data in a way that minimizes the similarity within cluster and maximizes the similarity between two different clusters. Semi-supervised classfication is between supervised classification and unsupervised classification. It focus on how to attain the good performance and generalization when there is not enough training sample or part of the data information is lost, so that it can accurately classified the texts.
     Whatever classification algorithms it may be, for high dimensional texts, feature extraction and feature selection serving as important methods for reducing dimensionality. play important roles in the performance of computing complexity and classification. They also receive attention such as large scale data, unstructured data, dimensionality disaster and data imbalance from more and more researchers..
     In this paper, we focus on Chinese text classification research. We concentrate our research on four areas: feature extraction, feature selection, text categorization and clustering. Firstly, we proposed three algorithms: feature extraction algorithm based on sentence elements, balanced feature selection algorithm and low limit dimensionality of feature selection; secondly, we proposed balanced feature indexing KNN classification algorithm and feature compensation KNN algorithm, then we applied the balanced feature indexing KNN to nonlinear semi-supervised classification. Finally, we proposed a graph-theoretic clustering algorithm (WGC algorithm ) on the base of the work of Hartuv and Shamir. Here are the further descriptions of our research:
     1. Text feature selection algorithm based on sentence elements. During the process of text feature extraction, we often encounter terms that have nothing to do with the subject. According to the fact that different sentence elements play different roles in expressing a subject, in this paper, we use syntactical analysis to label sentence elements and then propose feature extraction algorithm based on sentence elements. Experimental results show that the algorithm effectively not only filter some terms that have nothing to do with the subject, but also avoid the disadvantage of using stop-list table and part-of-speech filtering.
     2. Research on balanced feature selection algorithm. Aimed at dealing with the problems that the assumption of data classes is not satisfied in practice and data skewed exists , in this paper, we analyze the text classification objective function and then propose balanced feature selection algorithm. We theoretically and experimentally prove and verify the validity and effectiveness of the algorithm. We also propose method for calculating the low limit dimensionality when a feature selection function is selecting features in a certain document set. A non average dimensionality feature selection algorithm in cases of low limit dimension is also proposed.
     3. Research on high performance text classification algorithm. In order to improve the speed by reducing the matching between unlabeled samples and the unrelated vector sets, in this paper, we use the feature sets as the classification index of unlabeled documents and propose a KNN classification algorithm based on feature space indexing. Experimental results show that the increase of dimensionality has little effect in classification time. In addition, in order to improve the accuracy of classification, we construct the compensation feature sets which contain features that are not include in the feature sets but have some classification ability. We propose a features compensation KNN algorithm. Finally, combining balanced features selection and robust path regularization, we realize non-linear semi-supervised classification.
     4. Clustering algorithm for weighted graph based on minimum cut. Building on the work of Hartuv and Shamir, we propose a graph-theoretic clustering algorithm for weighted graph based on minimum cut (WGC). The algorithm has the advantage of many existing algorithm: low polynomial complexity, the provable properties, and automatically determining the number of clusters in the process of clustering.
引文
[1] Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys, 2002,34(1):1?47.
    [2]苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报, 2006, 17(9): 1848-1859.
    [3] Chawla NV, Japkowicz N, Kotcz A. Editorial. Special issue on learning from imbalanced data sets[J]. Sigkdd Explorations Newsletters, 2004,6(1):1?6.
    [4] Estabrooks A, Jo TH, Japkowicz N. A multiple resampling method for learning from imbalanced data sets[C]. Computational Intelligence, 2004,20(1):18?36.
    [5] Forman G. An extensive empirical study of feature selection metrics for text classification[J]. Journal of Machine Learning Research,2003,3(1):1533?7928.
    [6] Castillo MDd, Serrano JI. A multistrategy approach for digital text categorization from imbalanced documents[J]. SIGKDD Explorations Newsletter, 2004,6(1):70?79.
    [7] Zheng Z, Wu X, Srihari R. Feature selection for text categorization on imbalanced data. SIGKDD Explorations[J], 2004,6(1):80?89.
    [8] Forman G. A pitfall and solution in multi-class feature selection for text classification[C]. In: Brodley CE, ed. Proc. of the 21st Int’lConf. on Machine Learning (ICML-04). Banff: Morgan Kaufmann Publishers, 2004. 38.
    [9] Manevitz LM, Yousef M. One-Class SVMs for document classification[J]. Journal of Machine Learning Research, 2001,2(1):139?154.
    [10] Brank J, Grobelnik M. Training text classifiers with SVM on very few positive examples[C]. Technical Report, MSR-TR-2003-34,Redmond: Microsoft Research, 2003.
    [11] Tan S. Neighbor-Weighted k-Nearest neighbor for unbalanced text corpus[J]. Expert Systems with Applications, 2005,28(4):667?671.
    [12] Nigam K. Using unlabeled data to improve text classification [Ph.D. Thesis]. Pittsburgh: Carnegie Mellon University, 2001.
    [13] Joachims T. Transductive inference for text classification using support vector machines[C]. In: Bratko I, Dzeroski S, eds. Proc. of the 16th Int’l Conf. on Machine Learning (ICML-99). Bled: Morgan Kaufmann Publishers, 1999. 200?209.
    [14] Chen YS, Wang GP, Dong SH. A progressive transductive inference algorithm based on support vector machine[J]. Journal of Software, 2003,14(3):451?460 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/14/451.htm
    [15] Taira H, Haruno M. Text categorization using transductive boosting[C]. In: Raedt LD,Flach PA, eds. Proc. of the 12th European Conf. on Machine Learning (ECML-01). Freiburg: Springer-Verlag, 2001. 454?465.
    [16] Park SB, Zhang BT. Co-Trained support vector machines for large scale unstructured document classification using unlabeled data and syntactic information[J]. Information Processing and Management, 2004,40(3):421?439.
    [17] Kiritchenko S, Matwin S. Email classification with co-training[C]. In: Stewart DA, Johnson JH, eds. Proc. of the 2001 Conf. of the Centre for Advanced Studies on Collaborative Research. Toronto: IBM Press, 2001. 8.
    [18] Liu B, Dai Y, Li X, Lee WS, Yu PS. Building text classifiers using positive and unlabeled examples[C]. In: Proc. of the 3rd IEEE Int’l Conf. on Data Mining. Melbourne (ICDM-03). IEEE Computer Society, 2003. 179?188.
    [19] Tong S, Koller D. Support vector machine active learning with applications to text classification[J]. Journal of Machine Learning Research, 2001,2(1):45?66.
    [20] Yang Y, Pedersen JO. A Comparative Study on Feature Selection in Text Categorization[C]//Proceedings of the 14th International Conference on Machine Learning. Nashville: Morgan Kaufmann Press, 1997:412-420.
    [21] Duda R O, Hart P E, Stork D G. Pattern Classification[M]. 2nd Edition, New York: John Wiley & Sons, 2001.
    [22]申红,吕宝粮,内山将夫等.文本分类的特征提取方法比较与改进[J],计算机仿真, 2006,23(3): 1006– 9348.
    [23] Chen W, Chang X, Wang H, Zhu J, Tianshun Y. Automatic word clustering for text categorization using global information[C]. In: Myaeng SH, Zhou M, Wong KF, Zhang H, eds. Proc. of the Information Retrieval Technology, Asia Information Retrieval Symp. (AIRS 2004). Beijing: Springer-Verlag, 2004. 1?11.
    [24] Forman G. An extensive empirical study of feature selection metrics for text classification[J]. Journal of Machine Learning Research, 2003,3(1):1533?7928.
    [25] JIN Yao-Hong, MIAO Chuan-Jiang. An Algorithm of Extracting Text Character Based on a Model of Context Framework[J]. Journal of Computer Research and Development, 2004, 141(14).
    [26] ZHAO Peng, GENG Huan-tong. An Approach of Chinese Text Representation Based on Semantic and Statistic Feature[J]. Journal of Chinese Computer Systems, 2007, 128 (7).
    [27] HU Jia-ni, GUO Jun, et al. Independent Semantic Feature Extraction Algorithm based on Short Text[J]. Journal on Communications, 2007, 28 (12).
    [28] LIAO Sha-sha, JIANG Ming-hu. A Feature Selection Method in Chinese TextClassification Based on Concept Extraction with a Shielded Level[J]. Journal of Chinese Information Processing, 2006, 20(3).
    [29]胡燕,吴虎子,钟珞.中文文本分类中基于词性的特征提取方法研究[J].武汉理工大学学报, 2007,29(4).
    [30] D Mladenic, M Grobelnik, Feature selection for unbalanced class distribution and naive Bayes[C]. The 16th Int’l Conf on Machine Learning, Bled, Slovenia, 1999.
    [31] Lu Yuchang, Lu Mingyu, Li Fan, et al. Analysis and construction of word weighing function in VSM[J]. Journal of Computer Research and Development, 2002, 39 (10): 1205 -1210.
    [32] Rogati M, Yang Y. High-Performing feature selection for text classification[C]. In: David G, Kalpakis K, Sajda Q, Han D, Len S, eds. Proc. of the 11th ACM Int’l Conf. on Information and Knowledge Management (CIKM-02). McLean: ACM Press, 2002. 659?661.
    [33] Gabrilovich E, Markovitch S. Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5[C]. In: Brodley CE, ed. Proc. of the 21st Int’l Conf. on Machine Learning (ICML-04). Banff: Morgan Kaufmann Publishers, 2004. 41.
    [34] M.E.Maron. Automatic indexing:An experimental inquiry[J]. Journal of the ACM, 1961,8:404-417.
    [35] Sebastiani, F.A tutorial on automated text categorization[J]. Proceedings of ASAI-99, 1st Argentinian Symposium on Artificial Intelligence, 1999,7-35..
    [36] Fuhr,N., Hartmanna,S.,Lustig,G., et al. Air/X-A rule-based multi-stage indexing system for large subject fields[C]. In Processings of RIAO'91, 1991,606-623.
    [37] Vapnik,V. The Nature of Statistical Learning Theory.Springer. New York.
    [38] Chakrabarti S, Roy S, Soundalgekar M. Fast and accurate text classification via multiple linear discriminant projections[C]. Int’l Journal on Very Large Data Bases, 2003,12(2):170-185.
    [39] Wu H, Phang TH, Liu B, Li X. A refinement approach to handling model misfit in text categorization[C]. In: Davis H, Daniel K, Raymoind N, eds. Proc. of the 8th ACM Int’l Conf. on Knowledge Discovery and Data Mining (SIGKDD-02). Edmonton: ACM Press, 2002. 207-216.
    [40] Wang J, Wang H, Zhang S, Hu Y. A simple and efficient algorithm to classify a large scale of text. Journal of Computer Research and Development, 2005,42(1):85-93 (in Chinesewith English abstract).
    [41] Tan S, Cheng X, Wang B, Xu H, Ghanem MM, Guo Y. Using dragpushing to refine centroid text classifiers. In: Ricardo ABY, Nivio Z, Gary M, Alistair M, John T, eds. Proc. of the ACM SIGIR-05. Salvador: ACM Press, 2005. 653-654.
    [42] Debole F, Sebastiani F. An analysis of the relative hardness of reuters-21578 subsets. Journal of the American Society for Information Science and Technology, 2004,56(6):584-596.
    [43] Yang Y, Chute C.G. An example-based mapping method for textcategorization and retrieval[J]. ACM Transaction on Information Systems. 1994,12(3):252-277.
    [44] B.Masand, G.Linoff, D.Waltz. Classifying news stories using memory-based reasoning[C]. In Proceedings of SIGIR-92,15th ACM International Conference on Research and Development in Information Retrieval, Kobenhavn,DK, 1992,59-65.
    [45] Langley P, Iba W, Thompson K. An analysis of bayesian classifiers[C]. In: National Conference on Artificial Intelligence, 1992, 223-228.
    [46] Lewis D.D, Ringuette M.A. Comparison of Two Learning Algorithms for Text Categorization[C]. In Proc.of Third Annual Symposium on Document Analysis and Information Retrieval, 1994: 81-93.
    [47] Wiener E.D. et al. A Neural Network Approach to Topic Spotting[C]. In Proc.of SDAIR-95,the 4 Annual Symposium on Document Analysis and Information Retrieval, 1995: 317-332.
    [48]周志华,王珏.机器学习及其应用[M].北京:清华大学出版社, 2007.
    [49] Zhu X. Semi-Supervised Learning Literature Survey[R]. Technical Report 1530, Department of Computer Sciences, University of Wisconsin at Madison, 2007. http://www.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf.
    [50] Chapelle O, Scholkopf B, Zien A. Semi-Supervised Learning[M]. Cambridge: MIT Press, 2006.
    [51]熊云波.文本信息处理的若干关键技术研究.复旦大学博士论文.2006:3.
    [52]钱铁云.关联分类关键技术研究.华中科技大学博士学位论文.2006年4月.
    [53]谭松波.高性能文本分类算法研究.中国科学院研究生院博士学位论文. 2006年1月.
    [54] Baeza-Yates R, Ribeiro-Neto B. Modern Information Retrieval[C]. 1sted Addison-Wesley-Longman, Reading, MA, 1999.
    [55] Salton G, E.A.Fox, H.Wu. Extended boolean information retrieval[J], Communications ofthe ACM,1983,26(11):1022-1036
    [56] Joon Ho Lee. Properties of extended Boolean models in information retrieval[C]. Proceedings of the 17 th annual international ACM SIGIR conference on Research and development in information retrieval, Dublin, Ireland, 1994: 182-190.
    [57] G.Salton, A.Wong, C.S.Yang. A vector space model for automatic indexing[J]. Communications of ACM, 1975,18: 613-620,.
    [58] M.Maron, J.Kuhns. On relevance probabilistic indexing and information retrieval[J]. Journal of the ACM, 1960: 216-244.
    [59] S.Robertson.The probability ranking principle in IR[J]. Journal of Documentation. 1977:33.
    [60] Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al. Modern Information Retrieval[R]. ACM Press,1999.
    [61] Robertson S.E, Walker S, Beaulieu M.M, et al. A Okapi at TREC-4[C]. The Fourth Text REtrieval Conference(TREC-4),NIST Special Publication 500-236,National Institute of Standards and Technology,Gaithersburg, MD, 1996: 73-86.
    [62] Ponte J., W.Croft.(1998). A Language Modeling Approach to Information Retrieval[c]. In Proceedings of SIGIR,1998:275-281.
    [63] Belur V. Dasarathy. Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques [M]. IEEE Computer Society Press, Las Alamitos, California , 1991.
    [64] Y. Yang, X. Liu. A re-examination of text categorization methods[C] . In : Proceedings , 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval11 ] (SIGIRp99) , 1999:42 - 49.
    [65]代六玲,黄河燕,陈肇雄.中文文本分类中特征抽取方法的比较研究[J].中文信息学报, 2004, 18 (1): 26-32.
    [66] Yang,Y. A study on thresholding strategies for text categorization[C]. In Proc.24th ACM SIGIR, ACM Press, New York, NY, USA, 137145,2001:1073.
    [67] A.McCallum, K.Nigam. A comparison of event models for naive Bayes text classification[C]. In AAAI-98 Workshop on Learning for Text Categorization,1998.
    [68] D.A.Hull. Improving text retrieval for the routing problem using latent semantic indexing[C]. In W.B.Croft and C.J.van Rijsbergen,editors,Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval,Dublin,IE,.Springer Verlag,Heidelberg,DE, 1994: 282-289.
    [69] J.Lin, D.Gunopulos. Dimensionality Reduction by Random Projection and LatentSemantic Indexing[C]. Text Mining Workshop,at the 3rd SIAM Interna-tional Conference on Data Mining, 2003.
    [70] Kaski S. Dimensionality Reduction by Random Mapping:Fast Similarity Computation for Clustering[C]. Proceedings of International Joint Conference on Neural Networks(IJCNN'98). IEEE Service Center,Piscataway,NJ, 1998:413-418.
    [71] Bingham E.,Mannila H.Random. Projection in Dimensionality Reduction: Applications to Image and Text Data[C]. Proc.SIGKDD, 2001:245-250.
    [72] S.Dumais, G.Furnas, T.Landauer, et al. Using Latent Semantic Analysis to Improve Access to Textual Information[C]. In Proceedings of the Conference on Human Factors in Computing Systems CHI'88, Washington,DC,USA, 1988: 281-285.
    [73] Deerwester S., Dumais S.T., Furnas G.W., et al. A.Indexing by Latent Semantic Analysis[J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407.
    [74]于秀林,任雪松.多元统计分析[D].北京:中国统计出版社.1999.
    [75]赵林,胡恬,黄萱菁等.基于知网的概念特征抽取方法[J].通信学报. 2004.
    [76] J.W.Wilbur, K.Sirotkin. The automatic identification of stopwords[J]. Journal of Information Science.1992.
    [77] Y.Yang. Noise reduction in a statistical approach to text categorization[C]. In Proceedings of the 18th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR'95), 1995:256-263.
    [78] Yiming Yang, John Wilbur. Using corpus statistics to remove redundant words in text categorization[J]. Journal of the American Society of Information Science, 1996,47(5).
    [79]鲁松,李晓黎,白硕等.文档中词语权重计算方法的改进[J].中文信息学报, 2000, 14 (6): 8-13, 20.
    [80] http://ir.hit.edu.cn/demo/ltp/
    [81]中国科学院计算技术研究所数字化室和软件室.中文自然语言处理开放平台[CD/ DL].. http:∥www.nlp.org.cn/. 2007.
    [82] 5 Hermjakob U. Parsing and question classification for question answering[C]. ACS22001 Workshop on Open Domain Question Answering, Toulouse, 2001: 255-262.
    [83] Li Xin, Dan R. Learning question classifier[C]. In: Proceedings of the 19th International Conference on Computational Linguistics, Taipei, 2002: 556-562.
    [84] L. Tesniere. Element de Syntaxe St ructural [ M ] .Paris : Klincksieck , 1959.
    [85]王建会,王雷,胡运发.词语间依存关系的定量识别[J].中文信息学报. 2005, 19(4):1003-0077.
    [86]林颖,史晓东,郭锋.一种基于概率上下文无关文法的汉语句法分析[J].中文信息学报. 2006, 20(2): 1003-0077.
    [87]刘挺,马金山,李生.基于词汇支配度的汉语依存分析模型[J].软件学报, 2006,17(9): 1876-1883.
    [88]孟遥,李生,赵铁军等.四种基本统计句法分析模型在汉语句法分析中的性能比较[J].中文信息学报, 2003, 17(3): 1003 -0077.
    [89]孟遥,李生,赵铁军等.基于统计的句法分析技术综述[J].计算机科学. 2003 30 (9).
    [90]尤访,李涓子,王作英.基于语义依存关系的汉语语料库的构建[J].中文信息学报, 2003, 17(1): 1003 - 0077.
    [91]刘伟权,王明会,钟义信.建立现在汉语依存关系的层次体系[J].中文信息学报, 10(2).
    [92] Richard O.Duda, Peter E.Hart, David G.Stork. Pattern Classification[D]. 2000.
    [93] Debole F, Sebastiani F. Supervised term weighting for automated text categorization[C]. In: Haddad H, George AP, eds. Proc. of the 18th ACM Symp. on Applied Computing (SAC-03). Melbourne: ACM Press, 2003:784-788.
    [94] Forman G. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research,2003,3(1):1533-7928.
    [95] Yang Y, Liu X. A re-examination of text categorization methods[C]. In: Gey F, Hearst M, Rong R, eds. Proc. of the 22nd ACM Int’l Conf. on Research and Development in Information Retrieval (SIGIR-99). Berkeley: ACM Press, 1999: 42-49.
    [96] Fawcett T. ROC graphs: Notes and practical considerations for researchers[R]. Technical Report, HPL-2003-4, Palo Alto: HP Laboratories, 2003.
    [97] Chawla NV, Japkowicz N, Kotcz A. Editorial: Special issue on learning from imbalanced data sets[J]. Sigkdd Explorations Newsletters, 2004,6(1):1?6.
    [98] Estabrooks A, Jo TH, Japkowicz N. A multiple resampling method for learning from imbalanced data sets[J]. Computational Intelligence, 2004,20(1):18?36.
    [99] Manevitz LM, Yousef M. One-Class SVMs for document classification[J]. Journal of Machine Learning Research, 2001, 2(1):139?154.
    [100] Brank J, Grobelnik M. Training text classifiers with SVM on very few positive examples[R]. Technical Report, MSR-TR-2003-34, Redmond: Microsoft Research, 2003.
    [101] Tan S. Neighbor-Weighted k-Nearest neighbor for unbalanced text corpus[J]. ExpertSystems with Applications, 2005,28(4):667-671.
    [102] Castillo MDd, Serrano JI. A multistrategy approach for digital text categorization from imbalanced documents. SIGKDD Explorations Newsletter, 2004,6(1):70-79.
    [103] Zheng Z, Wu X, Srihari R. Feature selection for text categorization on imbalanced data[C]. SIGKDD Explorations, 2004,6(1):80-89.
    [104] Forman G. A pitfall and solution in multi-class feature selection for text classification[C]. In: Brodley CE, ed. Proc. of the 21st Int’l Conf. on Machine Learning (ICML-04). Banff: Morgan Kaufmann Publishers, 2004. 38.
    [105] Soucy P, Mineau GW. Feature selection strategies for text categorization[C]. In: Xiang Y, Chaib-Draa B, eds. Proc. of the 16th Conf. of the Canadian Society for Computational Studies of Intelligence (CSCSI-03). Halifax: Springer-Verlag, 2003: 505-509.
    [106] Xue D, Sun M. Chinese text categorization based on the binary weighting model with non-binary smoothing[C]. In: Sebastiani F, ed. Proc. of the 25th European Conf. on Information Retrieval (ECIR-03). Pisa: Springer-Verlag, 2003:408-419.
    [107] Yang Y, Zhang J, Kisiel B. A scalability analysis of classifiers in text categorization[C]. In: Callan J, Cormack G, Clarke C, Hawking D, Smeaton A, eds. Proc. of the 26th ACM Int’l Conf. on Research and Development in Information Retrieval (SIGIR-03). Toronto:ACM Press, 2003: 96-103.
    [108] Liu TY, Yang Y, Wan H, Zhou Q, Gao B, Zeng HJ, Chen Z, Ma WY. An experimental study on large-scale web categorization. In:Ellis A, Hagino T, eds. Proc. of the 14th Int’l World Wide Web Conf (WWW-05). Chiba: ACM Press, 2005:1106-1107.
    [109] Chakrabarti S, Roy S, Soundalgekar M. Fast and accurate text classification via multiple linear discriminant projections[C]. Int’l Journal on Very Large Data Bases, 2003,12(2):170-185.
    [110] Nello Cristi.支持向量机导论[D].电子工业出版社, 2000.
    [111] Yang M. Face recognition using extended isomap[C]. In Proc. Of 2002 Int. Conf. on Image Processing, 2, 2002, I-I177-I-I120.
    [112] Niskanen M, Silven O. Comparison of dimensionality reduction methods for wood surface inspection[C]. In Proc. of 6th Int. Conf. on Quality Control by Artifieial Vision, 2003: 178-188.
    [113] Hadid A, Kouro Pteva O, Pietikainen M. UnsuPervised learning using loecally linear embedding: experiments in faee pos eanalysis[C]. In Proc. of 16th Int. Conf. on Pattern Recognition, 1, 2002:111-114.
    [114] Roweis S, Saul L. Nonlinear dimensionality reduction by locally linear embedding[J].Science, 2000, 290(22):2323-2326.
    [115] Laskaris N A, Ioannides A A. Semantic geodesic maps: a unifying geometrical approach for studing the structure and dynamics of single trial evoked responses[J]. Clinieal Neurophysiology, 113, 2002: 1209-1226.
    [116] Kouropteva O, Okun O, Pietikainen M. Classification of handwritten digits using supervised locally linear embedding algorithm and support vector maehine[C]. In Proc. of the 11th European Symposiumon Artifieial Neural Networks(ESANN, 2003), April 23-25, Bruges, Beigium, 2003:229-23.
    [117] T. M. Cover, P.E Hart. Nearest Neighbor Pattern Classification[J]. IEEE Transactions on Information Theory, 1967, 13(1):21-27.
    [118] Y.Yang, X.Liu. A re-examination of text categorization methods[C]. In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99), Berkley, 1999:42-49.
    [119] Shin C, Yun U, Kim H, Park S. A hybrid approach of neural network and memory based learning to data mining[J]. IEEE Trans. On Neural, 2000,11(3):637-46.
    [120] Wettschereck K, Aha D W, Mohri T. A review and empirical evaluation of feature weighting methords for a class of lazy learning algorithms[J]. AI Review, 1997, 11(2): 273-314.
    [121]王晓晔,王正欧. K-最近邻分类技术的改进方法[J].电子信息学报, 2005, 27 (3): 487-491.
    [122] Lam W, Lai KY. Automatic textual document categorization based on generalized instance sets and a metamodel[J]. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2003, 25(5):628?633.
    [123] Tsay JJ, Wang JD. Improving linear classifier for Chinese text categorization[J]. Information Processing and Management, 2004,40(2): 223?237.
    [124] Tan S, Cheng X, Ghanem MM, et al. A novel refinement approach for text categorization[C]. In: Otthein H, Hans JS, Norbert F, Abdur C, Wilfried T, eds. Proc. of the 14th ACM Conf. on Information and Knowledge Management (CIKM-05). Bremen: ACM Press, 2005. 469?476.
    [125]王煜,王正欧,白石.用于文本分类的改进KNN算法[J].中文信息学报, 2007, 21(3).
    [126]张著英,黄玉龙,王翰虎.一个高效的KNN分类算法[J].计算机科学, 2008 135(13).
    [127]孙岩,吕世聘,唐一源.无先序条件约束的KNN算法[J].小型微型计算机系统. 2008,4.
    [128]庞秀丽,冯玉强,姜维.贝叶斯文本分类中特征词缺失的补偿策略[J].哈尔滨工业大学学报, 2008, 40(6).
    [129] Seung H S, Lee D D. The Manifold Ways of Perception[D]. Science, 2000,290(22): 2268-2269.
    [130] Bregler C, Omohundro S M. Nonlinear image interpolation using manifold leaming[C]. Advances in Neural Information Proeessing Systems 7. MIT Press, 1995.
    [131] Tenenbaum J B, Silva V D, Langford J C. A Global Geometric Framework for Nonlinear Dimensionality Reduction[D]. Science, 2000,290(22): 2319-2323.
    [132] Donoho D, Grimes C. Hessian Eigenmaps: Locally Linear Embedding Techniques for High Dimensiona Data[C]. Proceedings of the National Academy of Sciences, 2003: 5591-5596.
    [133] Belkin M, Niyogi P. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation[J]. Neural Computation, 2003,15(6): 1373-1396.
    [134] Coifman R. Lafon S, Lee A,et al. Geometric diffusions as a tool for hannonic analysis and structure definition of data: diffusion maps[C]. In Proc. of the National Academy of Seienees, 102, 2005, 7426-7431.
    [135] Bregler C, Omohundro S M. Nonlinear manifold learning for visual speech recognition[C]. In Proc.of Int. Conf. on Computer Vsion, 1995, 494.
    [136] Fischer B, Zoller T, Buhmann J M. Path Based Pairwise Data Clustering with Application to Texture Segmentation[C]. Proceedings of the International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, 2001: 235-250.
    [137] Chang H, Yeung D Y. Robust Path-Based Spectral Clustering[J]. Pattern Recognition, 2008,41(1): 191-203.
    [138] Fischer B, Roth V, Buhmann J M. Clustering with the Connectivity Kernel[C]. Proceedings of the Annual Conference on Neural Information Processing System, 2004.
    [139] Jain A.K., Murty M.N., Flynn P.J. Data Clustering: A Review. ACM Computing Surveys, 31, 1999: 264–323.
    [140] Dhillon I. S., Mallela S., Kumar R. A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification[J]. Journal of Machine Learning Research, 3, 2003:1265-1287.
    [141] Guha S., Meyerson A., Mishra N., et al: Clustering Data Streams: Theory and Practice[J]. IEEE Trans. on Knowledge and Data Engineering, 2003,15(3) : 515–528.
    [142] Hartuv E., Shamir R. A Clustering Algorithm Based on Graph Connectivity[J]. Information Processing Letters, 2000,76: 175–181.
    [143] Julisch K. Clustering Intrusion Detection Alarms to Support Root Cause Analysis[J]. ACM Trans. on Information and System Security, 2003, 6(4):443-471.
    [144] Tombros A, Villa R, Rijsbergen C.J.V. The Effectiveness of Query-Specific Hierarchic Clustering in Information Retrieval. Information Processing and Management, 38, 2002: 559-582.
    [145] Wang J.-B., Peng H., Hu J.-S, et al. Ensemble Learning for Keyphrases Extraction from Scientific Document[J]. In: Wang J., et al (Eds.): Advances in Neural Networks. Lecture Notes in Computer Science. Springer-Verlag, Berlin Heidelberg , 2006.
    [146] Wang L., D. Cheung W.-L., Mamoulis N., et al. An Efficient and Scalable Algorithm for Clustering XML Documents by Structure[J]. IEEE Trans. on Knowledge and Data Engineering, 2004, 16(1): 82–96.
    [147] Zhang Y.-J., Liu Z.-Q. Refining Web Search Engine Results Using Incremental Clustering[J]. International Journal of Intelligent Systems, 2004, 19,191–199.
    [148] Zhuge H. Retrieve Images by Understanding Semantic Links and Clustering Image Fragments[J]. The Journal of Systems and Software, 2004, 73: 455–466.
    [149] He X., Zha H., Ding C. H.Q., et al. Web Document Clustering Using Hyperlink Structures[J]. Computational Statistics & Data Analysis, 2002, 41: 19–45.
    [150] Kannan R., Vempala S., Vetta A. On Clusterings: Good, Bad and Spectral[J]. ACM, 2004, 51(3): 497–515.
    [151] Matula D.W. k-Components, Clusters and Slicings in Graphs[J]. SIAM J. Applied Mathematics, 1972, 22(3): 459–480.
    [152] Peng Y., Ngo C.-W. Hot Event Detection and Summarization by Graph Modeling and Matching[J]. In: Leow W.-K., et al. (Eds.): CIVR 2005. Lecture Notes in Computer Science, Vol. 3568, Springer-Verlag Berlin Heidelberg , 2005: 257–266.
    [153] Shi J., Malik J.: Normalized Cuts and Image Segmentation[J]. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2000, 22(8): 888–905.
    [154] Wu X., Chen D. Z., Mason J. J., et al: Pairwise Data Clustering and Applications. Warnow T. and Zhu B. (Eds.): COCOON 2003. Lecture Notes in Computer Science, Vol. 2697, Springer-Verlag Berlin Heidelberg , 2003:455-466.
    [155] Wu Z., Leahy R. An Optimal Graph Theoretic Approach to Data Clustering: Theory andIts Application to Image Segmentation[J]. IEEE Trans. Pattern Analysis and Machine Intelligence, 1993, 15 (11): 1101-1113.
    [156] Zahn C. T. Graph-theoretical methods for detecting and describing gestalt clusters[J]. IEEE Trans. Computer. 1971,C-20: 68–86.
    [157] West D. B. Introduction to Graph Theory[D]. China Machine Press, Beijing (2004).
    [158] Ford J. R., Fulkerson D. R. Flows in Networks[J]. Princeton University Press, Princeton, NJ, 1962.
    [159] Nagamochi H., Ibaraki T. Computing Edge-connectivity in Multigraphs and Capacitated Graphs[D]. SIAM J. Discrete Mathematics, 1992,5: 54–66.
    [160] Stoer M., Wagner F. A Simple Min-Cut Algorithm [J]. ACM, 1997,44(4): 585-591.
    [161] Bui T. N., Moon B. R. Genetic Algorithm and Graph Partitioning[J]. IEEE Trans. on Computers, 1996, 45(7): 841–855.
    [162] Chuangxin Yang, Hong Peng, Jiabing Wang. A Clustering Algorithm for Weighted Graph Based on Minimum Cut[C]. Intelligent Networks and Intelligent Systems, ICINIS '08. 2008:649-653.
    [163] Jing-Song Hu,Jia-bing Wang,Chuang-xin Yang. Self-optimum fuzzy controller based on minesweeping strategy[C]. Proceedings of 2007 International Conference on Computational Intelligence and Security. CISW 2007, 2007: 19-22.
    [164]徐燕,李锦涛,王斌.基于区分类别能力的高性能特征选择方法[J].软件学报, 2008, 19(1): 82-89.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700