文本分类技术与应用研究

英文题名：On Text Classification and Its Applications
作者：郝秀兰
论文级别：博士
学科专业名称：计算机软件与理论
中文关键词：文本分类 ; 话题识别与跟踪 ; 信息过滤 ; 类偏斜 ; 样本选择 ; 半监督学习 ; BBS
英文关键词：Text Classification ; Topic Detection and Tracking (TDT) ; Class Imbalance ; Information Filtering ; Samples Selection ; Semi-supervised Learning (SSL) ; BBS
学位年度：2008
导师：胡运发
学科代码：081202
学位授予单位：复旦大学
论文提交日期：2008-10-20

摘要

互联网上充斥着各种信息,其中有一些信息,如恐怖组织等通过互联网散布的消息,直接影响着国家安全与稳定。传统的按IP地址、主题进行信息拦截的方法已不再适合当前的需要,目前主要是对内容进行监控。
     鉴于Internet上的大部分信息都以文本的形式存在,以上技术大都依赖于文本内容的理解,核心技术是文本分类与聚类技术。爆炸式增长的文本信息对文本内容理解的精度与速度提出了新的标准与挑战,要求文本理解在提高精度的同时,还要进一步提升训练与理解速度。
     本文挑选文本分类中的3个困难与挑战进行了研究:数据集偏斜(数据集关于类别的分布是偏斜的,即类偏斜)、特征选择、小样本问题(标注瓶颈)。从提高分类方法的快速性、准确性出发,提出多种有效的解决(改进)方法。同时,对文本聚类、分类的一个重要应用领域——话题识别与跟踪,进行了研究。本文的创新工作主要包括以下三点:
     1、kNN文本分类器中类偏斜问题的处理
     类偏斜问题是数据挖掘领域的常见问题之一。在文本分类中得到广泛应用的kNN方法,当训练样本存在类偏斜问题时,分类性能明显下降。将kNN分类器用于某文本内容安全项目,我们发现,小类别的待测样本几乎都错分到其它大类中去了。针对kNN存在的这个问题,提出了训练集的临界点(Critical Point,CP)的概念,根据CP的下(上)近似值LA(UA)及训练样本数对传统的kNN决策函数进行修改,这就是自适应的加权kNN分类。在偏斜文本数据集上进行的实验表明,LA、UA是较好的收缩因子。自适应的加权kNN文本分类性能优于传统kNN方法及随机重取样方法。
     2、训练样本的选择
     训练样本的选择对分类器的创建非常重要,非典型样本不仅增加了分类器的训练时间,而且容易给训练样本集中引入一些“噪声”。作为一种基于实例的方法,kNN分类器有大量的计算及存储需求。同时,训练数据分布的不均衡,也会导致kNN分类器的性能下降。针对这些缺陷,首先对MultiEdit与Condensing算法进行了改进,然后提出了特征选择与Condensing技术相结合的取样方法。该方法分为两步:第一步,由几种传统的特征选择方法产生训练集中每类训练数据的特征;第二步,根据文档自身的类特征,结合Condensing策略移去多余的训练实例。大量实验表明,该方法明显减小了训练集的数据量,从而降低了算法的时空消耗,改进了分类器的性能。
     3、半监督的文本分类
     传统的分类器仅使用有标签的数据进行训练,然而,有标签的实例通常因昂贵、耗时而难以获得,从而造成标注瓶颈问题。半监督学习通过大量的无标签数据与有标签数据相结合来创建性能良好的分类器,从而解决标注瓶颈问题。由于半监督的学习需要较少的人工介入,而精确率又较高,因此无论在理论上还是实践上都具有意义。本文在对已有的半监督学习算法进行研究的基础上,针对有标签数据相当少时,无法使用统计方法进行标注置信度评价的情况,提出了基于kNN和SVM的二阶段协同学习,实验证实该方法是有效的。
     作为文本分类、聚类技术的应用,我们对BBS的话题识别与跟踪进行了研究。从文本挖掘的角度上来说,话题识别类似于文本聚类;而话题跟踪类似于多类文本分类。话题识别与跟踪,研究目标是要实现按话题查找、组织和利用来自多种新闻媒体的多语言信息。这类新技术是现实中急需的,比如:自动监控各种信息源(如广播、电视等),并从中识别出各种突发事件、新事件以及关于已知事件的新信息,这可广泛用于信息安全、证券市场分析等领域。另外,还可以找出有关用户某一感兴趣话题的所有报道,研究这一话题的发展历程等等。在对话题识别与跟踪各种算法进行研究的基础上,我们根据BBS内容的特点,建立了一个面向BBS的话题识别与跟踪系统。
     在以上研究的基础上,我们开发了一个文本内容安全管理原型系统。
Internet is imbued with various informations,some of which,such as terrorism, threaten the security of sovereignty.Traditional techniques to block information according IP address or theme are out of date.Now,the state of the art is to monitor the content of the information.
     Because text is main representation of information,many techniques to monitor information depend on the understanding of text.Text classification and clustering are key techniques.Explosive increase of text information poses new challenge to text understanding and requires that text understanding be quicker,more efficient,and more accurate.
     In this paper,three challenges in text categorization are explored,i.e.,class imbalance,feature selection and bottleneck of annotation.To improve the speed and accuracy of classification,several methods and techniques are presented.Meanwhile, topic detection and tracking,an important application of text classification and clustering is discussed.Our main contributions are,
     1.One strategy to deal with class imbalance in kNN classification
     Class imbalance is one of problems plagued the community of data mining. Performance of kNN,a widely used algorithm in text categoryization,deteriorates when distribution of training data is skewed among different classes.When used in a project of text content security,kNN classified almost all test samples of minority classes into majority ones.To overcome this defect,critical point(CP) of training set is proposed. Traditional decision functions of kNN are revised by LA or UA,approximate value of CP.This is so-called adaptive kNN with weight adjustment.Experiments on bised data sets shows that adaptive kNN with weight adjustment outperforms traditional kNN and random resampling and gets better results.
     2.Selection of training samples
     Selection of training samples is vital for a classifier to build.Atypical samples not only increase the time of training but also introduce noise into training set.As an instance based algorithm,kNN classifier has large computational requirement and space cost.Meantime,imbalance distribution of training data will lead to bad performance of kNN classifier.To deal with these defects,MultiEdit and Condensing algorithms are firstly modified,then sampling based on feature selection and Condensing is proposed. First,several traditional methods of feature selection are combined to form features for each class.Second,redundant cases are removed by combination of class features contained in cases with Condensing algorithm.Exaustive experiments show that the size of training set decreases sharply,which leads to reduction in space and time cost and improvement in classification quality.
     3.Semi-supervised text categorization
     Semi-supervised categorization is a kind of special categorization.Tradtional classifiers only train with labelled data,but labelling data is a difficult task because it is expensive and time-consuming.Labelling data is dull and requires experienced annotators to label them with plenty of time and special device.This is so-called bottle-neck of annotation.At the same time,unlabelled data are easy to obtain and can be used in diverse ways.Semi-supervised learning algorithm builds good classifiers with labelled data and lots of unlablelled data to solve the bottle-neck of annotation. Because semi-supervised learning needs less manual work,it is important both in theory and in practice.Two-phase co-training based on kNN and SVM is proposed after we examine existing semi-supervised learning.Experiments show that the given method is effective.
     Meantime,we discuss a practical application of text classification and clustering technology——topic detection and tracking oriented to BBS.
     From the point of view of text mining,topic detection is similar to text clustering and topic tracking is similar to text categorization.Topic detection and tracking(TDT) aims to organize and deploy multi-language news from various news agents according to topic.This technique is a must in appications,such as automatically monitoring information sources,for instance,radio and TV,and recognizing unexpected events, new events and new information about exsting events.It can be widely used in information security and analysis of securities business.In addition,TDT can be used to dig out all news some user interested in and discover the evolution course of a specific topic.On the basis of survey on TDT,we develop a TDT system oriented to BBS.
     We apply the above results into a prototype system on text content security.

引文

[Allan98] J. Allan, R. Papka,V. Lavrenko. On-line New Event Detection and Tracking [A]. In Proceedings of SIGIR'98 [C]. University of Massachusetts: Amherst, 1998, 37-45.

    [Allan03] J. Allan, A. Bolivar, M. Connell, etc. UMass TDT 2003 Research Summary [A]. In Proceedings of TDT 2003 evaluation [C], 2003.

    [Angluin88] D. Angluin and P. Laird. Learning from Noisy Examples [J]. Machine Learning, 1988,2 (4):343-370.

    [Azran07] A. Azran. The Rendezvous Algorithm: Multiclass Semi-Supervised Learning with Markov Random Walks [A]. In Proceedings of the 22nd International Conference on Machine Learning (ICML07) [C], 2007: 49-56.

    [Batista04] Gustavo E. A. P. A. Batista, Ronaldo C. Prati, Maria Carolina Monard. A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data [J]. SIGKDD Explorations Newsletter, 2004, 6(1):20-29.

    [Berger96] A. Berger, S. Delia Pietra and V. Delia Pietra. A maximum entropy approach to natural language processing [J]. Computational Linguistics, 1996, 22(1): 38-73.

    [Blum98] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training [A]. In Proceedings of the Workshop on Computational Learning Theory (COLT) [C], 1998.

    [Burges98] Christopher J.C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition [J]. Data Mining and Knowledge Discovery, 1998, 2(2): 121-167.

    [Cai04] Lijuan Cai, Thomas Hofmann. Hierarchical document categorization with support vector machines [A]. CIKM 2004 [C]: 78-87.

    [Carbonell99] J. Carbonell.Y. Yang, etc. CMU Report on TDT-2: Segmentation, Detection and Tracking [A]. In: Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop[C]. San Francisco: Morgan Kauffman, 1999, 117-120.

    [Castillo04] M. D. d. Castillo, J. I. Serrano. A multistrategy approach for digital text categorization from imbalanced documents [J]. SIGKDD Explorations Newsletter, 2004, 6(1):70-79.

    [Cavnar94] W. Cavnar and J. Trenkle. N-gram-based text categorization [A]. In Proceedings of SDAIR-94 [C], 1994.

    [Chapelle05] O. Chapelle and A. Zien. Semi-supervised classification by low density separation [A]. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics [C], 2005.

    [Chapelle06a] O. Chapelle, M. Chi, etc. Zien. A continuation method for semisupervised SVMs [A]. In Proceedings of the 23rd International Conference on Machine Learning (ICML06) [C],Pittsburgh, USA, 2006.

    [Chapelle06b] O. Chapelle, V. Sindhwani and S. S. Keerthi. Branch and bound for semisupervised support vector machines [A]. In Advances in Neural Information Processing Systems [C], 2006.
    [Chawla02] N. Chawla, K. Bowyer, L. Hall and W. Kegelmeyer. SMOTE: Synthetic Minority Over-sampling Technique [J]. Journal of Artificial Intelligence Research. 16, 321-357, 2002.

    [Chawla03a] N. V. Chawla, N. Japkowicz, A. Kotcz, editors, Proceedings of the ICML'2003 Workshop on Learning from Imbalanced Data Sets, 2003.

    [Chawla03b] N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer. Smoteboost:Improving prediction of the minority class in boosting [A]. In Proceedings of the Seventh European Conference on Principles and Practice of Knowledge Discovery in Databases [C], 2003: 107-119.

    [Chawla04] N. V. Chawla, N. Japkowicz, A. Kotcz. Editorial: Special issue on learning from imbalanced data sets [J]. Sigkdd Explorations Newsletters, 2004, 6(1): 1-6.

    [Collins99] M. Collins and Y. Singer. Unsupervised Models for Named Entity Classifications [A].In Proceedings of. Joint SIGDAT Conf. Empirical Methods in Natural Language Processing and Very Large Corpora [C], 1999: 100-110.

    [Collobert06] R. Collobert, J. Weston and L. Bottou. Trading convexity for scalability [A]. In Proceedings of the 23rd International Conference on Machine Learning (ICML06) [C], Pittsburgh,USA, 2006.

    [Connell04] Margaret Connell, Ao Feng, Giridhar Kumaran, etc. UMass at TDT 2004 [EB/OL].http://www.nist.gov/speech/tests/tdt/tdt2004/workshop.htm

    [Dave03] K. Dave, S. Lawrence, and D. Pennock. Mining the peanut gallery: opinion extraction and semantic classification of product reviews [A]. In Proceedings of the 22th International World Wide Web Conference [C], Budapest, Hungary, 2003.

    [De Bie06a] T. De Bie and N. Cristianini. Fast SDP relaxations of graph cut clustering, transduction, and other combinatorial problems [J]. Journal of Machine Learning Research, 2006, 7: 1409-1436.

    [De Bie06b] T. De Bie and N. Cristianini. Semi-supervised learning using semidefinite programming [A]. In O. Chapelle, B. Scho'elkopf and A. Zien (Eds.), Semisupervised learning. Cambridge-Massachussets: MIT Press, 2006.

    [Dekel04] O. Dekel, J. Keshet, Y. Singer. Large margin hierarchical classification [A]. In Proceedings of the 21st international conference on Machine learning (ICML 04) [C], 2004

    [Dong99] G Dong and J. Li. Efficient mining of emerging patterns: Discovery trends and differences [A]. In Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining (KDD99) [C], San Diego, CA, 1999: 43- 52.

    [Estabrooks04] A. Estabrooks, T. H. Jo, N. Japkowicz. A Multiple Resampling Method for Learning from Imbalanced Data Sets [J]. Computational Intelligence, 28(l):18-36, 2004.

    [Eyheramendy03] S. Eyheramendy, D. Lewis, D. Madigan. On the naive bayes model for text categorization [A]. In Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics [C], Key West, Florida, 2003..

    [Ferri99] F. J. Ferri, J. V. Albert, E.Vidal. Consideration about sample-size sensitivity of a family of edited nearest-neighbor rules [J]. IEEE Trans. on Systems, Man, and Cybernetics-Part B:Cybernetics, 1999 (29): 667-672.
    [Fiscus04] Jonathan Fiscus, Barbara Wheatley. Overview of the TDT 2004 Evaluation and Results [EB/OL]. http://www.nist.gov/speech/tests/tdt/tdt2004/workshop.htm

    [Franz01] M. Franz, J. S. McCarley. Unsupervised and supervised clustering for topic tracking [A].In Proceedings of the 24th annual international ACM SIGIR [C]. New Orleans, Louisiana, USA:ACM, 2001: 310-317.

    [Fujino05] A. Fujino, N. Ueda and K. Saito. A hybrid generative/discriminative approach to semi-supervised classifier design [A]. In Proceedings of the Twentieth National Conference on Artificial Intelligence (AAAI-05) [C], 2005.

    [Goldberg06] A. Goldberg and X. Zhu. Seeing stars when there aren't many stars: Graph based semi-supervised learning for sentiment categorization [A]. HLT-NAACL 2006 Workshop on Textgraphs: Graph-based Algorithms for Natural Language Processing [C]. New York, NY.

    [Goldman00] S. Goldman and Y. Zhou. Enhancing Supervised Learning with Unlabeled Data [A]. In Proceedings of. 17th International Conference on Machine Learning (ICML00) [C],2000:327-334.

    [Guan02] Jihong Guan and Shuigeng Zhou. Pruning Training Corpus to Speedup Text Classification. [A]. In Proceedings of DEXA2002 [C], LNCS 2453,2002: 831-840.

    [Guha98] S. Guha, R. Rastogi, etc. CURE: an efficient clustering algorithm for large databases [A]. In Proceedings of the ACM SIGMOD International Conference on Management of Data [C].Seattle: ACM Press, 1998: 73-84.

    [Guo04] Hongyu Guo and Herna L Viktor. Learning from Imbalanced Data Sets with Boosting and Data Generation: The DataBoost-IM Approach [J]. SIGKDD Explorations Newsletter, 2004,6(1): 30-39.

    [Hart68] P. E. Hart. The Condensed Nearest Neighbor Rule [J]. IEEE Trans. Information Theory,1968, 14(3): 515-516.

    [Hayes90] P. J. Hayes, P. Andersen, I. Nirenburg, and L. M. Schmandt. Tcs: a shell for content-based text categorization [A]. In Proceedings of 6th IEEE Conference on Artificial Intelligence Applications (CAIA-90) [C], Santa Barbara, CA, 1990: 320-326.

    [Hsu02] C. Hsu, C. Lin. A comparison on methods for multi-class support vector machines [J].IEEE Transactions on Neural Networks, 2002, 13(2): 415-425.

    [Japkowicz00] N. Japkowicz. Learning from imbalanced data sets: A comparison of various strategies [A]. Learning from imbalanced data sets: The AAAI Workshop [C]. Menlo Park, CA:AAAI Press. Technical Report WS-00-05, 2000.

    [Jo04] Taeho Jo, Nathalie Japkowicz. Class imbalances versus small disjuncts [J]. SIGKDD Explorations Newsletter, 2004, 6(1): 40-49.

    [Joachims98] T. Joachims. Text categorization with support vector machines: Learning With Many Relevant Features [A]. In Proceedings of 10th European Conference on Machine Learning [C],1998:137-142.

    [Joachims99] T. Joachims. Transductive inference for text classification using support vector machines [A]. Proceedings 16th International Conf. on Machine Learning, Morgan Kaufmann [C],San Francisco, CA, 1999: 200- 209.

    [Jolliffe86] I. T. Jolliffe. Principal Component Analysis [M]. New York: Spriger Verlag, 1986.

    [Karypis00] G. Karypis and E. Han. Fast supervised dimensionality reduction algorithm with applications to document categorization and retrieval [A]. In Proceedings of the 9th ACM International Conference on Information and Knowledge Management (CIKM-00) [C], ACM Press,New York, USA, 2000: 12-19.

    [Kubat97] M. Kubat and S. Matwin. Addressing the curse of imbalanced training sets: One-sided selection [A]. In Proceedings of the Fourteenth International Conference on Machine Learning [C],San Francisco, CA, Morgan Kaufmann, 1997: 179-186.

    [Kumaran04] G. Kumaran and J. Allan. Text classification and named entities for new event detection [A]. Proceedings of the SIGIR Conference on Research and Development in Information Retrieval [C]. Sheffield, South Yorkshire: ACM, 2004: 297-304.

    [Kuncheva97] L. I. Kuncheva. Fitness Functions in Editing K-NN Reference Set by Genetic Algorithms [J] .Pattern Recognition, 1997,30(6): 1041-1049.

    [Kuncheva99] Ludmila I. Kuncheva, Lakhmi C. Jain, Nearest neighbor classifier: Simultaneous editing and feature selection [J]. Pattern Recognition Letters, 1999(20): 1149-1156.

    [Larkey96] L. S. LARKEY and W. B. CROFT. Combining classifiers in text categorization [A]. In Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval [C], Switzerland, 1996: 289-297.

    [Larkey04] L. S. Larkey, F. F. Feng, M. Connell, V. Lavrenko. Language-specific Models in Multilingual Topic Tracking [A]. In Proceedings of the 27th annual international conference on research and development in information ret rieval [C]. Sheffield, U K, 2004, 402-409.

    [Lent97] B. Lent, A. Swami, and J. Widom. Clustering association rules [A]. In Proceedings of the Thirteenth International Conference on Data Engineering [C], Birmingham, England, 1997.

    [Levow02] G A. Levow and D. W. Oard. Signal boosting for tranlingual topic tracking: Document expansion and n-best translation [A]. Topic detection and tracking: Event-based information organization [C]. MA: Kluwer, 2002: 175-195.

    [Lewis94] D. D. Lewis and M. Ringuette. A Comparison of Two Learning Algorithms for Text Categorization [A]. In Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval [C], 1994: 81-93.

    [Lewis96] D. D. Lewis, R. E. Schapire, J. P. Callan and R. Papka. Training algorithms for linear text classifiers [A]. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval [C], 1996: 298-306.

    [Lewis98] D. D. Lewis. Naive (Bayes) at forty: The Independence Assumption in Information Retrieval [A]. In Proceedings of the 10th European Conference on Machine Learning [C], New York,1998:4-15.

    [Lewis04] D. D. Lewis, F. Li, T. Rose and Y. Yang. RCV1: A new benchmark collection for text categorization research [J]. Journal of Machine Learning Research, 2004, 5(3): 361-397.

    [LHM98] B. Liu, W. Hsu and Y. Ma. Integrating classification and association rule mining [A]. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (SIGKDD'98) [C], New York: ACM Press, 1998: 80-86.

    [Li00] J. Li, G Dong and K. Ramamohanrarao. Making use of the most expressive jumping emerging patterns for classification [A]. In Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD00) [C], Kyoto, Japan, 2000.

    [Li06] J. Li, M. Sun, and X. Zhang. A comparison and semiquantitative analysis of words and character-bigrams as features in Chinese text categorization [A]. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL [C],2006:545-552.

    [Li08] Z. Li, J. Liu, X. Tang. Pairwise Constraint Propagation by Semidefinite Programming for Semi-Supervised Classification [A]. In Proceedings of the 22nd International Conference on Machine Learning (ICML08) [C], 2008: 576 - 583.

    [Liu98] H. Liu and H. Motoda. Feature Extraction, Construction and Selection: A Data Mining Perspective [M]. Kluwer Academic, Norwell, MA, USA, 1998.

    [Lo02] Y. Lo, J. L. Gauvain. The LIMSI Topic Tracking System For TDT 2002 [A]. Topic Detection and Tracking Workshop [C]. Gaithersburg, USA, 2002.

    [Loh97] W. Y. Loh and Y. S. Shih. Split selection methods for classification trees [J]. Statistica Sinica, 1997, 7:815-840.

    [Maeireizo04] B. Maeireizo, D. Litman and R. Hwa. Co-training for predicting emotions with spoken dialogue data [A]. In The Companion Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL) [C], 2004.

    [Maron61] M. E. Maron. Automatic indexing: An experimental inquiry [J]. Journal of the ACM,1961,8(3): 404-417.

    [Maloof03] M. A. Maloof. Learning when data sets are Unbalanced and when costs are unequal and unknown [A]. In Proceedings of the ICML-2003 Workshop on Learning from Imbalanced Data Sets II [C], 2003.

    [Martinez01] A. M. Martinez and A. C. Kak. PCA versus LDA [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001,23(2): 228-233.

    [Murthy98] S. K. Murthy. Automatic construction of decision tree from data: A multi-disciplinary survey [J]. Data Mining and Knowledge Discovery, 1998,2: 345-389.

    [Nigam98] K. Nigam, A. McCallum, S. Thrun, T. Mitchell. Using EM to Classify Text from Labeled and Unlabeled Documents. Technical Report CMUCS -98-120, School of Computer Science, CMU, Pittsburgh, PA 15213, 1998

    [Nigam00a] K. Nigam and R.Ghani. Analyzing the effectiveness and applicability of co-training [A].In Proceedings of the Ninth International Conference on Information and Knowledge Management [C], 2000: 86-93.
    [Nigam00b] K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM [J]. Machine Learning, 2000, 39(2/3): 103-134.

    [Pang04] B. Pang and L. Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts [A]. In Proceedings of the Association for Computational Linguistics [C], 2004: 271-278.

    [Peng03] F. Peng and D. Schuurmans. Combining naive bayes and n-gram language models for text classification [A]. In Proceedings of The 25th European Conference on Information Retrieval Research (ECIR03) [C], 2003, Pisa, Italy.

    [Phua04] Clifton Phua, Damminda Alahakoon, and Vincent Lee. Minority Report in Fraud Detection: Classification of Skewed Data [J]. SIGKDD Explorations Newsletter, 2004, 6(1): 50-59.

    [Platt00] J. Platt, N. Cristianini, and J. Shawe-Taylor. Large margin DAGs for multiclass classification [A]. In Advances in Neural Information Processing Systems [C], 2000, 12: 547-553.

    [Quinlan86] J. R. Quinlan. Induction of decision tree [J]. Machine Learning, 1986, 1:81-106.

    [Quinlan93] J. R. Quinlan. C4.5: Programs for Machine Learning [M]. San Mateo, CA: Morgan Kaufmann, 1993.

    [Raina07] Rajat Raina, Alexis Battle, etc. Self-taught Learning: Transfer Learning from Unlabeled Data [A]. In Proceedings of the 24th International Conference on Machine Learning [C], Corvallis,OR, 2007.

    [Raskutti03] B. Raskutti, and A. Kowalczyk. Extreme re-balancing for SVMs: a case study [A]. In Workshop on Learning from Imbalanced Data Sets II of International Conference on Machine Learning [C], 2003.

    [Raskutti04] Bhavani Raskutti, Adam Kowalczyk. Extreme Rebalancing for SVMs: a case study [J].SIGKDD Explorations Newsletter, 2004, 6(1): 60-69.

    [Rastogi98] R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building and pruning [A]. In Proceedings of 24th International Conference on Very Large Data Bases (VLDB98) [C],New York, 1998.

    [Ratnaparkhi96] A. Ratnaparkhi. A maximum entropy model for Part-of-Speech tagging [A]. In Proceedings of the Empirical Methods in Natural Language Processing Conference [C], Philadelphia,1996.

    [Ratnaparkhi98] A. Ratnaparkhi. Maximum entropy models for natural language ambiguity resolution. [PhD dissertation], University of Pennsylvania, 1998.

    [Riloff03] E. Riloff, J. Wiebe and T. Wilson. Learning subjective nouns using extraction pattern bootstrapping [A]. In Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003) [C], 2003.

    [Riloff99] E. Riloff and R. Jones. Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping [A]. In Proceedings of. 16th National Conference on. Artificial Intelligence [C], 1999:474-479.

    [Robertson02] Stephen Robertson, Ian Soboroff. The TREC-11 Filtering Track Report [A]. The Eleventh Text REtrieval Conference (TREC-11) [C], 2002.

    [Schapire00] R. E. Schapire and Y. Singer. BoosTexter: a boosting-based system for text categorization [J]. Machine Learning, 2000, 39(2/3): 135-168.

    [Schlimmer86] J. C. Schlimmer and D. Fisher. A case study of incremental concept induction [A].In Proceedings of the 5th International Conference on Artificial Intelligence (AAAI86) [C], San Mateo: Morgan Kaufmann, 1986.

    [Sebastiani02] F. Sebastiani. Machine learning in automated text categorization [J]. ACM Computing Surveys, 2002, 34(1): 1-47.

    [Shahshahani94] B. Shahshahani and D. Landgrebe. The Effect of Unlabeled Samples in Reducing the Small Sample Size Problem and Mitigating the Hughes Phenomenon. IEEE Trans. Geoscience and Remote Sensing, vol. 32, no. 5, pp. 1087-1095, 1994.

    [Sindhwani05] V. Sindhwani, P. Niyogi and M. Belkin. Beyond the point cloud: from transductive to semi-supervised learning [A]. In Proceedings of the 22nd International Conference on Machine Learning (ICML05) [C], 2005.

    [Sokolovska08] N. Sokolovska, O. Capp ,F. Yvon. The asymptotic of semi-supervised learning in Discriminative probabilistic models. [A]. In Proceedings of the 22nd International Conference on Machine Learning (ICML08) [C], 2008: 984-991.

    [Stamatatos08] E. Stamatatos. Author identification: Using text sampling to handle the class imbalance problem [J]. Information Processing & Management, 2008,44(2): 790-799.

    [Swiniarshi98] R. Swiniarshi. Rough sets and principal component analysis and their applications in feature extraction and selection, data model building and classification [A]. In S.Pal and A.Skowron, editors, Fuzzy Sets, Rough Sets and Decision Making Processes. New York:Springer-Verlag, 1998.

    [Tan05a] Songbo Tan. Neighbor-weighted k-nearest neighbor for unbalanced text corpus [J].Expert Systems with Applications, 2005,28(4): 667-671.

    [Tan05b] Songbo Tan et al. A Novel Refinement Approach for Text Categorization [A]. In Proceedings of the ACM CIKM-05 [C], 2005.

    [TanCorp] 谭松波,王月粉.中文文本分类语料库-TanCorpV1.0. [EB/OL] http://www.searchforum.org.cn/tansongbo/corpus.htm.

    [TDT04] The 2004 Topic Detection and Tracking (TDT2004) Task Definition and Evaluation Plan [EB/OL]. version 1.2, http:// www.nist.gov.

    [Utgoff88] P. E. Utgoff. An incremental ID3 [A]. In Proceedings of the Fifth International Conference on Machine Learning [C], San Mateo, CA, 1988:107-120.

    [Vazquez05]Fernando Vazquez, J. Salvador Sanchez, and Filiberto Pla. A Stochastic Approach to Wilson's Editing Algorithm [A]. In Proceedings of the IbPRIA 2005 [C], LNCS 3523: 35-42.

    [Wang08] W. Wang, Z. Zhou. On Multi-View Active Learning and the Combination with Semi-Supervised Learning [A]. In Proceedings of the 22nd International Conference on Machine Learning (ICML08) [C],2008: 1152-1159.
    [Weiss03] G Weiss and F. Provost. Learning when training data are costly: the effect of class distribution on tree induction [J]. Journal of Artificial Intelligence Research, 2003, 19: 315- 354.

    [Weiss04] Gary M. Weiss. Mining with Rarity: A Unifying Frameworks [J]. Sigkdd Explorations Newsletters, 2004, 6(1):7-19.

    [Weston99] J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition [A]. In Proceedings of the 7th European Symposium on Artificial Neural Networks (ESANN-99) [C],Bruges, Belgium, 1999: 219-224.

    [Wiener95] E. D. Wiener et al. A Neural Network Approach to Topic Spotting [A]. In Proceedings of the 4 Annual Symposium on Document Analysis and Information Retrieval (SDAIR-95) [C],1995:317-332.

    [WilsonOO] D. R. Wilson and T. R. Martinez. Reduction techniques for instance-based learning algorithms [J]. Machine Learning, 2000, 38(3): 257-286.

    [Wilson72] D. L. Wilson. Asymptotic Properties of Nearest Neighbor Rules Using Edited Data [J].IEEE Transactions on Systems, Man and Cybernetics (Part B), 1972, SMC-2(3): 408-421.

    [Xu02] Hongbo Xu, Zhifeng Yang, Bin Wang, etc. TREC-11 Experiments at CAS-ICT:Filtering and Web [A]. In Proceedings of Text Retrieval Conferences [C], 2002.

    [Yang92] Y. Yang and C. G Chute. A linear least squares fit mapping method for information retrieval from natural language texts [A]. In Proceedings of the 14th Conference on Computational Linguistics (COLING92) [C], 1992.

    [Yang94] Y. Yang and C. G Chute. An example-based mapping method for text categorization and retrieval [J]. ACM Transaction on Information Systems. 12(3):252-277, 1994

    [Yang97] Y. Yang and J. O. Pedersen. A comparative Study on Feature Selection in Text Categorization [A]. In Proceedings of the ICML97 [C], 1997: 412-420.

    [Yang99a] Yiming Yang. An evaluation of statistical approaches to text categorization [J].Information Retrieval, 1999,1(1): 69-90.

    [Yang99b] Y. Yang and X. Liu. A re-examination of text categorization methods [A]. In Proceedings of the 22nd ACM Intl Conf. on Research and Development in Information Retrieval (SIGIR-99) [C], Berkeley: ACM Press, 1999: 42-49.

    [Yang02] Y. Yang, J. Carbonell, etc. Topic-conditioned novelty detection [A]. In Hand D, et al.Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining [C]. New York: ACM Press, 2002: 688-693.

    [Yang04] Yiming Yang. CMU TEAM-A in TDT 2004 Topic Tracking [EB/OL].http://www.nist.gov/speech/tests/tdt/tdt2004/workshop.htm

    [Yi03] J. Yi, T. Nasukawa, R. Bunescu, W. Niblack. Sentiment analyzer: Extracting Sentiments about a Given Topic using Natural Language Processing Techniques [A]. In Proceedings of the Third IEEE International Conference on Data Mining [C], 2003.

    [Yu04] Man-Quan Yu, Wei-Hua Luo, etc. ICT's Approaches to HTD and Tracking at TDT2004 [EB/OL].http://www.nist.gov/speech/tests/tdt/tdt2004/workshop.htm
    [Zhang97]Y.Zhang,J.G.Carbonell,J.Allan.Topic Detection and Tracking:Detection Task[A].In Proceedings of the Workshop of Topic Detection and Tracking[C],1997.
    [Zhang04a]Yi Zhang,Jamie Callan.CMU DIR Supervised Tracking Report[EB/OL].http://www.nist.gov/speech/tests/tdt/tdt2004/workshop.htm
    [Zhang04b]Y.Zhang.Using bayesian priors to combine classifiers for adaptive[A].In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval[C],2004.
    [Zhang07]Zhang Kuo,Li Juan Zi,Wu Gang.New Event Detection Based on Indexing-tree and Named Entity[A].In Proceedings of the SIGIR 2007[C].ACM:Amsterdam,2007.
    [Zheng04]Zhaohui Zheng,Xiaoyun Wu,Rohini Srihari.Feature Selection for Text Categorization on Imbalanced Data[J].SIGKDD Explorations Newsletter,2004,6(1):80-89.
    [Zhou04]Y.Zhou and S.Goldman.Democratic co-learing[A].In Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence[C],2004.
    [Zhou05]Z.Zhou and M.Li.Tri-Training:Exploiting Unlabeled Data Using Three Classifiers[J].IEEE Trans.Knowledge and Data Engineering,2005,17(11):1529-1541.
    [Zhou07]Z.Zhou and M.Li.Semisupervised Regression with Cotraining-Style Algorithms[J].IEEE Trans.Knowledge and Data Engineering,2007,19(11):1479-1493.
    [Zhu05]Xiaojin Zhu.Semi-supervised learning literature survey.TR-1530.University of Wisconsin-Madison Department of Computer Science,2005.(Last modified on July 19,2008)
    [Zhu07]Xiaojin Zhu.Semi-Supervised Learning Tutorial.Tutorial of ICML07,2007.
    [Ziarko91]W.Ziarko.The discovery,analysis,and representation of data dependencies in databases[A].In G.Piatetsky- Shapiro and W.J.Frawley,editors,Knowledge Discovery in Databases[C],Menlo Park:AAAI Press,1991:195-209.
    [边99]边肇祺,张学工.模式识别[M].北京:清华大学出版社,1999:151-153.
    [陈05]陈文亮,朱靖波,朱慕华,姚天顺.基于领域词典的文本特征表示[J].计算机研究与发展,2005,42(12):2155-2160.
    [刁02]刁力力,胡可云,陆玉昌,石纯一.用Boosting方法组合增强Stumps进行文本分类[J].软件学报,2002,13(8):1361-1367.
    [樊06]樊兴华,孙茂松.一种高性能的两类中文文本分类方法[J].计算机学报,2006,29(1):124-131
    [洪07]洪宇,张宇,刘挺,李生.话题检测与跟踪的评测及研究综述[J].中文信息学报,2007,21(6):71-87.
    [洪08a]洪宇,张宇,范基礼,刘挺,李生。基于子话题分治匹配的新事件检测。计算机学报,2008,31(4):687-695
    [洪08b]洪宇,张宇,范基礼,刘挺,李生。基于语义域语言模型的中文话题关联检测。《软件学报》,2008,19(9):2265-2275
    [黄03]黄萱菁,夏迎炬,吴立德.基于向量空间模型的文本过滤系统[J].软件学报,2003,14(3):435-442.
    [姜06]姜远,周志华.基于词频分类器集成的文本分类方法[J].计算机研究与发展,2006,43(10):1681-1687.
    [李05]李荣陆.文本分类若干关键技术研究[博士论文].复旦大学,上海,2005.
    [刘04]刘永丹,曾海泉,李荣陆,胡运发.基于语义分析的倾向性过滤[J].通信学报,2004,25(7):78-85.
    [刘07]刘桃,刘秉权,徐志明,王晓龙.领域术语自动抽取及其在文本分类中的应用[J].电子学报,2007,35(2):328-332
    [骆06]骆卫华,于满泉,许洪波,王斌,程学旗.基于多策略优化的分治多层聚类算法的话题发现研究[J].中文信息学报,2006,20(1):29-36.
    [贾04]贾自艳,何清,张俊海等.一种基于动态进化模型的事件探测和追踪算法[J].计算机研究与发展,2004,41(7):1273-1280.
    [金05]金珠,林鸿飞,赵晶.基于HowNet的话题跟踪及倾向性分类研究[J].情报学报,2005,5(24):555-561
    [尚06]尚文倩,黄厚宽,刘玉玲,林永民等.文本分类中基于基尼指数的特征选择算法研究[J].计算机研究与发展,2006,43(10):1688-1694.
    [宋05]宋枫溪,高秀梅,刘树海,杨静宇.统计模式识别中的维数削减与低损降维[J].计算机学报,2005,28(11):1915-1922
    [宋06]宋丹,卫东,陈英.基于改进向量空间模型的话题识别跟踪[J].计算机技术与发展,2006,9(16):62-67.
    [宋07]宋丹,林鸿飞,杨志豪.基于内容计算和链接分析的Web话题跟踪方法[J].情报学报,2007,26(4):555-560.
    [苏06]苏金树,张博锋,徐昕。基于机器学习的文本分类技术研究进展。软什学报[J],2006,17(9):1848-1859.
    [谭06]谭松波.高性能文本分类算法研究.[博士论文].中国科学院研究生院,北京,2006.
    [唐03]唐春生,金以慧.基于全信息矩阵的多分类器集成方法[J].软件学报.2003,14(6):1103-1109.
    [唐05]唐焕玲,孙建涛,陆玉昌.文本分类中结合评估函数的TEF-WA权值调整技术[J].计算机研究与发展,2005,42(1):47-53.
    [王05]王建会,王洪伟,申展,胡运发.一种实用高效的文本分类算法[J].计算机研究与发展,2005,42(1):85-93.
    [王06]王会珍,朱靖波,季铎,叶娜,张斌。基于反馈学习自适应的中文话题追踪。中文信息学报,2006,20(3):92-98。
    [夏03]夏迎炬,黄萱菁,胡恬,吴立德.自适应信息过滤中使用少量正例进行阈值优化[J].软件学报,2003,14(10):1697-1705.
    [文08]文本特征提取方法研究.http://blog.csdn.net/tvetve/archive/2008/04/14/2292111.aspx
    [姚05]姚力群,陶卿.局部线性与One-Class结合的科技文本分类方法[J].计算机研究与发展,2005,42(11):1862-1869.
    [袁04]袁时金,李荣陆等.层次化中文文档分类[J].通信学报.2004,25(11):55-63.[于06]于满泉,骆卫华,许洪波,白硕.话题识别与跟踪中的层次化话题识别技术研究[J].计算机技术与发展,2006,43(3):489-495.
    [赵06a]赵华,赵铁军,张妹,王浩畅.基于内容分析的话题检测研究[J].哈尔滨工业大学学报,2006,10(38):1740-1743.
    [赵06b]赵华,赵铁军,于浩,张妹.面向动态演化的话题检测研究[J].高技术通讯,2006,12(16):1230-1235.
    [赵07]赵华,赵铁军,于浩,郑德权.基于查询向量的英语话题跟踪研究[J].计算机研究与发展,2007,44(8):1412-1417.
    [郑07]郑伟,张宇,邹博伟,洪宇,刘挺。基于相关性模型的中文话题跟踪研究[A]。全国第九届计算语言学学术会议[C],2007:558-563。
    [周01]周水庚,关佶红,胡运发,周傲英.一个无需词典支持和切词处理的中文文档分类系统[J].计算机研究与发展,2001,38(7):839-844.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700