文本挖掘若干关键技术研究

英文题名：The Key Techniques Research on Text Mining
作者：陈晓云
论文级别：博士
学科专业名称：计算机软件与理论
中文关键词：文本挖掘 ; 特征选择 ; 关联分析 ; 文本关联分类 ; 规则加权 ; 样本加权
英文关键词：Text Mining ; Feature Selection ; Text Association Analysis ; Text Association Categorization ; Rule Intensity ; Boosting Technique
学位年度：2005
导师：胡运发
学科代码：081202
学位授予单位：复旦大学
论文提交日期：2005-04-18

摘要

面对浩如烟海的电子信息,如何帮助人们有效地收集和选择感兴趣的信息,如何帮助用户在日益增多的信息中发现潜在有用的知识已成为信息技术领域的热点问题。数据挖掘就是为解决这一问题而产生的研究领域。自90年代产生以来,对数据挖掘的研究已经比较深入,研究范围涉及到关联分析、分类分析、聚类分析、趋势分析等多个方面。由于现实生活中绝大部分信息资源是以非结构化数据的形式存在,而数据挖掘则普遍以结构化数据如关系数据库中的数据为对象,因此对非结构化信息进行挖掘成为继数据挖掘之后出现的又一课题。
     在常见的非结构化数据如文本、图像、视频中,文本数据是应用最为广泛的一种形式,常用于数字图书馆、产品目录、新闻组、医学报告、组织及个人主页。在自然语言理解、文本自动摘要、信息提取、信息过滤、信息检索等领域,文本挖掘技术都有着广泛的应用,因而比数据挖掘具有更高的商业价值。
     本文以文本数据为研究对象,对文本挖掘的若干关键技术进行研究,主要包括文本特征提取和特征选择、文本关联分析、文本关联分类,并提出更有效的文本挖掘算法。本文的研究工作和创新内容包括以下几个方面:
     (1)利用最小词频阈值的文档频特征评估函数减少噪声特征的比例,提高文本分类的质量。
     目前,文本特征选择普遍采用特征评估函数的方法,各种评估函数根据其使用的是词频还是文挡频有所不同。我们针对噪声特征的词频普遍较低的特点,提出利用最小词频阈值的文档频方法进行特征选择。分别对互信息、信息增益、x~2统计三种特征评估函数采用该方法进行实验,结果表明最小词频阈值有效地减少特征集中噪声特征所占的比例,并且发现随着阈值的提高不同评估函数得到的特征集趋于一致。
     (2)针对文本关联分析中难以确定最小支持度阈值的问题,提出N个最频繁项集挖掘算法。
     在文本关联分析中,频繁项集挖掘是重要的环节,但在频繁项集挖掘过程中,用户难以定义合适的最小支持度阈值这一问题始终存在。本文提出基于最小支持度阈值动态调整策略的N个最频繁项集挖掘算法,算法通过指定需要产生的频繁项集的数量N来控制频繁项集的规模。挖掘过程中,不断根据已有结果调高最小支持度阈值,从而达到降低搜索空间、改善挖掘性能的目的。根据这一策略分别提出类Apriori算法和基于倒排矩阵的IntvMatrix算法挖掘前N个频繁项集。
With the rapid development and spread of Internet, electronic information greatly increases. It become a hotspot for information science and technology that how to collect and find the interested information of user, and discovery latent, useful knowledge quickly, exactly and fully. Data mining technology is a new research fields to solve the problem. Since 90's the concept of DM was produced, the researches on DM have been very deep, and involved association analysis, categorization analysis, cluster analysis, trend analysis and so on. Structural data such as relational database is main research object for DM, but a majority of information exists with the form of unstructured data in realization; some datum show the unstructured data take 80% of existing information sources, so mining the unstructured information succeeds DM as a new challenge.Text data is a kind of information form used most spread among common unstructured data such as text, image, and video and so on. It is often used in digital library, product catalog, news group, medicine report, organization or individual homepages, and is also applied broadly to natural language understand, text summarize, information extract, information filter, information retrieval etc fields. So its value of business is higher than DM.Research on the key techniques of text mining is done in the paper, including text feature extract and feature select, text association analysis, text association classification. Several methods and techniques are presented from aspects of improving the speed, precision and stability. Our primary works are as follow.(1) The paper present feature evaluating function based document frequency with minimum term frequency threshold to reduce the proportion of noise features and improving the quality of text categorization.At present, the feature evaluating functions are main methods to select text feature for text categorization. These evaluating functions are different because some of them use term frequency and others use document frequency. Feature evaluating function based document frequency with minimum term frequency threshold is present in the paper. The result of experiment shows mutual information, information increase or x~2 Statistic with minimum term frequency thresholds is more effective than with document frequency.(2) Research on mining the top N most frequent item sets in text collection.The frequent item set mining is important step in text association analysis, but it is very difficult to ensure fit minimum support threshold. The paper present a strategy

引文

[Agrawal93] R. Agrawal, T. Imielinski and A. N. Swami. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, volume 22(2) of SIGMOD Record, pages 207-216. ACM Press, 1993.
    [Agrawal94] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proceedings of 1994 International Conference on Very Large Databases, pages 487-499, Santiago, Chile, September 1994.
    [Agrawal96] R. Agrawal et al. The QUEST data mining system. In Proceedings of International Conference on Data Mining and Knowledge Discovery (KDD'96), 1996: 244-249.
    [Aggarwal98a] C. C. Aggarwal and P. S. Yu. A New Framework for Itemset Generation. ACM PODS Conference Proceeding, pages 18-24, 1998,
    [Aggarwal98b] C. C. Aggarwal and P. S. Yu. Online Generation of Association Rules. ICDE Conference Proceedings, pages 402-411, 1998.
    [Aggarwal99] C. C. Aggarwal, S. C. Gates and P. S. Yu. On the merits of building categorization systems by supervised clustering. Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Pages 352-356, 1999.
    [Agrawal00] R. Agrawal, C. Aggarwal and V. V. V. Prasad. A tree projection algorithm for generation of frequent itemsets. In Journal of Parallel and Distributed Computing (Special Issue on High Performance Data Mining), 2000.
    [Andr00] I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras and C. D. Spyropoulos. An Evaluation of Naive Bayesian Anti-Spain Filtering. In Proceedings of the workshop on Machine Learning in the New Information Age, 2000.
    [Apte98] C. Apte, F. Damerau and R. S. Weiss. Text mining with decision rules and decision trees. In Proceedings of Conference on Automated Learning and Discovery, workshop 6: Learning from text and the Web, pages 487-499, Pittsburgh, PA, 1998.
    [Attardi98] G. Attardi, S. D. Marco and D. Salvi. Categorization by context. Journal Universal Computer Science 1998.4, 9, pages 719-736.
    [Bayardo98] R. J. Bayardo. Efficiently mining long patterns from databases. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, volume 27(2) of SIGMOD Record, pages 85-93. ACM Press, 1998.
    [Bayardo99] R. J. Bayardo and R. Agrawal. Mining the Most Interesting Rules. ACM SIGKDD Conference Proceedings, pages 145-154, 1999.
    [Bayardo00] R. Bayardo and R. Agrawal et al. Constraint-Based Rule Mining in Large, Dense Databases. Data Mining and Knowledge Discovery Journal. July 2000.
    [Berry99] M. W. Berry and M. Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval. SIAM, Philadelphia, 1999.
    [Chakra97] S. Chakrabarti, B. E. Dom, R. Agrawal and P. Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In Proc. 23rd International Conference on Very Large Data Bases (VLDB'97), pages 446-455, Athens, GR, 1997.
    [Chung96] D. Chung et al. A fast distributed algorithm for association rules. In: Proceedings of the International Conference on Parallel and Distributed Information Systems. Miami Beach, USA. 1996.
    [Christopher90] Fox. Christopher. A stop list for general text. SIGIR Forum, 1990, 24(1):19-35.
    [Cohen96] W. Cohen. Learning Rules that classify Email. In Proceedings of the 1996 AAAI Spring Symposium in Information Access, 1996.
    [Cutting92] D. R. Cutting, D. R. Karger, J. O. Pedersen and J. W. Tukey. Scatter/Gather: A Cluster- based Approach to Browsing Large Document Collections. SIGIR'92, Pages 318 -329, 1992.
    [Dasarathy91] B. V. Dasarathy. Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques, IEEE Computer Society Press: Las Alamitos. California. 1991
    [Dash00] M.Dash, H.Liu and H.Motoda. Consistency based feature selection. In Proceedings of the Fourth Pacific Asia Conference on Knowledge Discovery and Data Mining, (PAKDD-2000). Kyoto, Japan, pages 98-109. SpringerVerlag, 2000.
    [Delgado02] M.Delgado, M.J.Martin-Bautista, D.Sanchez and M.A.Vila. Mining Text Data: Special Features and Patterns. In Proceedings of ESF Exploratory Workshop, London, U.K., Sept. 2002.
    [Drucker99] H.Drucker, D.Wu, and V.N.Vapnik. Support Vector Machines for Spam Categorization. IEEE Transactionson Neural networks, 10(5), 1999.
    [Dong99] G.Dong. X.Zhang, L.Wong and J.Li. CAEP: Classification by aggregating emerging patterns. In Proceedings of the Second International Conference on Discovery Science, Tokyo, Japan, pages 30-42, 1999.
    [Escudero00] G.Escudero, L.Marquez and G.Rigau. Boosting applied to word sense disambiguation. In Proceedings of ECML-00, 11th European Conference on Machine Learning (Barcelona, Spain, 2000), pages 129-141.
    [Fayyad96] U.Fayyad, G.Piatetsky-Shapiro and P.Smyth. From data mining to nowledge discovery: An Overview. In Advances in Knowledge Discovery and Data Mining, MIT Press, Cambridge, Mass., 1-36. 1996.
    [Feldman95] R.Feldman and I.Dagan. Knowledge discovery in textual databases (KDT). In proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD-95), Montreal, Canada, August 20-21. 1995, AAAI Press, 112-117.
    [Feldman96] R.Feldman and H.Hirsh. Mining Associations in text in presence of background knowledge. In Proceedings of 2nd Intl. Conf. on Knowledge Discovery and Data Mining, KDD'96, pages 343-346, 1996.
    [Feldman97] R.Feldman and H.Hirsh. Finding Associations in Collections of Text. In Machine Learning and Data Mining, R.S. Michalski, I. Bratko, M. Kubat (eds.), John Wiley & Sons, NY 1997.
    [Feldman98] R. Feldman and I. Dagan. Mining Text Using Keyword Distributions. Journal of Intelligent Information Systems 1998, 10: 281-300.
    [Friedman97] N.Friedman, D.Geiger and M.Goldszmidt. Bayesian Network Classifiers. Machine Learning, 29,131-163, 1997.
    [Fu00] A.W.-C.Fu, R.W.-W.Kwong and J.Tang, Mining N-most Interesting Itemsets. ISMIS 2000.
    [Fukuda96a] T.Fukuda, Y.Morimoto, S.Morishita and T.Tokuyama. Mining Optimized Association Rules for Numeric Attributes. ACM PODS Conference Proceedings, pages 182-191, 1996.
    [Fukuda96b] T.Fukuda, Y.Morimoto, S.Morishita and T.Tokuyama. Data mining using Twodimensional Optimized Association Rules for Numeric Attributes: Scheme, Algorithms, Visualization. ACM SIGMOD Conference Proceedings, pages 13-23, 1996.
    [Furnkranz99] J.Furnkranz. Exploiting structural information for text classification on the WWW. In Proceedings of IDA-99, 3rd Symposium on Intelligent Data Analysis. Amsterdam, The Netherlands, 1999,487-497.
    [Gale93] W.A.Gale, K.W.Church and D.Yarowsky. A method for disambiguating word senses in a large corpus. Comput. Human. 1993 26, 5, 415-439.
    [Hajj03a] M. El-Hajj and O.R.Zaiane. Inverted matrix: Efficient discovery of frequent items in large datasets in the context of interactive mining. In Proc. 2003 Int'l Conf. on Data Mining and Knowledge Discovery (ACM SIGKDD), pages 109-118, August 2003.
    [Hajj03b] M.El-Hajj and O.R.Za iane. Non recursive generation of frequent k-itemsets from frequent pattern tree representations. In Proceedings of 5th International Conference on Data Warehousing and Knowledge Discovery (DaWak'2003), pases 371-380, September 2003.
    [Hall00] M.A.Hall. Correlation-based feature selection for discrete and numeric class machine learning. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML-00). Morgan Kaufmann Publishers, 2000.
    [Han00] J.Han, J.Pei. and Y.Yin. Mining frequent patterns without candidate generation. In ACM-SIGMOD, Dallas, 2000.
    [Han01] J.Han and M.Kamber. Data Mining Concepts and Techniques, Beijing, China, Machine Industry Publishing House, 2001.
    [He03] J. He. A. Tan; C.TAN. On Machine Learning Methods for Chinese Document Categorization. Applied Intelligence 18, 311-322, 2003.
    [Hearst97] M.A.Hearst. Text data mining: Issues, techniques, and the relationship to information access. Presentation notes for UW/MS workshop on data mining, July 1997.
    [Hidber99] C.Hidber. Online Association Rule Mining. ACM SIGMOD Conference Proceedings, pages 145-156. 1999.
    [Hipp00] J.Hipp, U.Guntzer and G.Nakhaeizadeh. Algorithms for Association Rule Mining-A General Survey and Comparison, SIGKDD Explorations, 2(2): 1-58, 2000.
    [Iwayama95] M. wayama and T.Tokunaga. Cluster-based text categorization: a comparison of category search strategies. In Proceedings of the 18~th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'95), pages 273-281, 1995.
    [Joachims97] T.Joachims, D.Freitag and T.Mitchell. Web watcher: A tour guide for the World Wide Web. In International Joint Conference on Artificial Intelligence (IJCAI), 1997.
    [Joachims98] T.Joachims. Text categorization with support vector machines: learning with many relevant features. In European Conference on Machine Learning (ECML), 1998.
    [Klemen94] M.Klementtinen, H.Mannila, P.Ronkainen, H.Toivone and A.I.Verkamo. Finding Interesting Rules from Large Sets of discovered association rules. CIKM Conference Proceedings, pages 401-407,1994.
    [Koller97] D.Koller and M.Sahami. Hierarchically classifying documents using very few words. Proceedings of the 14th International Conference on Machine Learning (ML), Nashville, Tennessee, July 1997, Pages 170-178.
    [Lang95] K.Lang. Newsweeder: learning to filter netnews. In International Conference on Machine Learning (ICML), 1995.
    [Lam98] W.Lam and C.Y.Ho. Using a generalized instance set for automatic text categorization. In Proceedings of the 21~th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'98), pages 81-89, 1998.
    [Lewis92] D.D.Lewis. Feature selection and feature extraction for text categorization. Proceedings of Speech and Natural Language Workshop. Morgan Kaufmann, Harriman, CA, 1992, pages.212-217.
    [Lewis94] D.Lewis and W.Gale. A Comparison of Two Learning Algorithms Categorization. In Proceedings of Symposium on Document Analysis and Information (SDAIR'94), 1994.
    [Lewis98] D.D.Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In Machine Learning: ECML-98, the 10th European Conference on Machine Learning, pages 4-15, 1998.
    [Li00a] J.Li, G.Dong and K. Ramamohanarao. Making use of the most expressive jumping emerging patterns for classification. In Proceedings of the Fourth Pacific-Asia Conference on Knowledge Discovery and Data Mining, Kyoto, Japan, pages 220-232, 2000.
    [Li00b] J.Li, G.Dong and K. Ramamohanarao. DeEPs: Instance-based classification by emerging patterns. Technical Report, Dept of CSSE, University of Melbourne, 2000.
    [Li00c] J.Li, K.Ramamohanarao and G. Dong. The space of jumping emerging patterns and its incremental maintenance algorithms. In Proceedings of the Seventeenth International Conference on Machine Learning, Stanford, CA June, 2000.
    [Li01] W.Li, J.Han and J.pei. CMAR: Accurate and efficient classification based on multiple classification rules. In IEEE International Conference on Data Mining (ICDM'01) San Jose, California, November 29-December 2001.
    [Lin98] S-H.Lin, C-S.Shih, M.C.Chen, J-M Ho, M-T.Ko and Y-M.Huang. Extracting Classification Knowledge of Internet Documents with Mining Term Associations: A Semantic Approach. In Proceedings of ACM/SIGIR'98, pages 241-249, 1998.
    [Liu96] H.Liu and R.Setiono. A probabilistic approach to feature selection-a filter solution. Proceedings of the 13th International Conference on Machine Learning ICML'97, Morgan Kaufmann, San Francisco, CA, 1996, pages. 319-327.
    [Liu98] B.Liu, W.Hsu and Y.Ma. Integrating classication and association rule mining. In Proceedings 4th International Conference on Knowledge Discovery and Data Mining (KDD'98), pages 80-86, New York, 1998.
    [Liu99] B.Liu, W.Hsu and Y.Ma. Mining Association Rules with Multiple Minimum Supports. In Proceedings of KDD-99, 1999.
    [Liu00] B.Liu, Y.Ma and C.K. Wong .Improving an Association Rule Based Classifier. PKDD 2000: 504-509. 30
    [Masand92] B.Masand, GLinoff and D.Waltz.Classifying news stories using memory-based reasoning. In International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 59-65, 1992.
    [Meretakis99] D.Meretakis and B.Wuthrich. Extending Naive Bayes Classifiers Using Long Itemsets. The 5th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD-99), San Diego, USA, 1999.
    [Meretakis00] D.Meretakis, H.Lu and B.Wuthrich. A study on the performance of Large Bayes Classifier. ECML2000, Barcelona, Spain, 2000.
    IMitchell97] T. M. Mitchell. Machine Learning. WCB/McGraw-Hill, 1997.
    [Mladenic99] D.Mladenic' and M.Grobelnik. Feature selection for unbalanced class distribution and Naive Bayes. Proceedings of the 16th International Conference on Machine Learning ICML99, Bled, Slovenia, Morgan Kaufmann, San Francisco, CA, 1999, pages 258-267.
    [Mladenic03] D.Mladenic' and M.Grobelnik. Feature selection on hierarchy of web documents. Decision Support Systems, 2003, 35: 45- 87.
    [Nahm01] U.Y.Nahm. Text Mining with Information Extraction: Mining Prediction Rules from Unstructured Text. Ph.D. thesis, Department of Computer Sciences, University of Texas, Austin, TX. 2001.
    [Nasukawa01] T.Nasukawa and T.Nagano. Text analysis and knowledge mining. IBM Systems Journal 2001.4: 967-984.
    [Nigam98] k. Nigam, S.Thrun and T.Michell. Learning to Classify Text Labeled and Unlabeled Documents. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI 98), 1998, 792-799.
    [Ng97] H.T.Ng, W.B.Goh and K.L.Low. Feature selection, perceptron learning and a usability case study for text categorization. In Proceedings 20th ACM International Conference Research and Development in Information Retrieval (SIGIR'97), pages 67-73, Philadelphia, US, 1997.
    [Oh00] H.-J.Oh, S.H.Myaeng and M.-H.Lee. A practical hypertext categorization method using links and incrementally available class information. In Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval, Athens, Greece, 2000, pages 264-271.
    [Park95a] J.S.Park. M.S.Chen and P.S.Yu. An effective hash-based algorithm for mining association rules. SIGMOD 95, pages 175-186.
    [Park95b] J.S.Park et al. Efficient parallel data mining for association rules. In: Proceedings of the 4th International Conference on Information and Knowledge Management. Baltimore, USA, 1995: 31-36.
    [Pasquier99] N.Pasquier, Y.Bastide, R.Taouil and L.Lakhal. Discovering frequent closed itemsets for association rules. In Proceedings 7th International Conference Database Theory (ICDT'99), pages398-416, Jerusalem, Israel, Jan. 1999.
    [Pei00] J.Pei, J.Han and R.Mao. CLOSET: An efficient algorithm for mining frequent closed itemsets. In Proceedings of 2000 ACM-SIGMOD International workshop Data Mining and Knowledge Discovery(DMKDOO), pagesll-20, Dallas,TX,Mav 2000.
    [Porter80] M.F.Porter. An algorithm for suffix stripping, Program, 1980. 14(3):130-137.
    [Rajman97] M.Rajman and R.Besancon. Text mining: natural language techniques and text mining applications. In Proceedings of the 7th IFIP Working Conference on Database Semantics (DS-7). Chapam & Hall, 1997.
    [Rastogi98] R.Rastogi and K.Shim. Mining optimized association rules for categorical and numeric attributes. ICDE Conference Proceedings, pages 502-512,1998.
    [Rastogi99] R.Rastogi and K.Shim. Mining optimized dupport rules for numeric Attributes. ICDE Conference Proceeding, pages 126-135, 1999.
    [Rennie00] J.Rennie. Ifile: An application of machine learning to Email filtering. In Proceedings of the KDD2000 Workshop on Text Mining, Boston, 2000.
    [Ricardo99] B.Y.Ricardo and R.N.Berthier. Modern Information Retrieval. ACM Press, New York, 1999.
    [Salton88] G.Salton and B.Buckley. Term weighting approaches in automatic text retrieval. Information processing and Management, 1988,24(5):513-23.
    [Sahami98] M.Sahami, S.Dumais. D.Heckerman and E.Horvitz. A bayesian approach to filtering Junk Email. In Learning for Text Categorization: Papers from the 1998 WorkshoD. 1998.
    [Sakkis01] G.Sakkis, I.Androutsopoulos, G.Paliouras, V.Karkaletsis, C.D.Spyropoulos and P.Stamatopoulos. Stacking classifiers for anti-spam filtering of Email. In Proceedings of the 6th conference on Empirical Methods in Natural Language Processing, 2001.
    [Savasere95] A.Savasere, E.Omiecinski and S.Navathe. An efficient algorithm for mining association rules in large databases. VLDB 95, pages 432-443.
    [Sebastiani02] F.Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1-47, 2002.
    [Silverstein97] C.Silverstein, R.Motwani and S.Brin. Beyond Market Baskets: Generalizing Association Rules to Correlations. ACM SIGMOD Conference Proceedings, pages 256-276, 1997.
    [Srikant95] R.Srikant and R.Agrawal. Mining Generalized Association Rules. VLDB Conference Proceedings, pages 407-419, 1995.

    [Srikant96] R.Srikant and R.Agrawal. Mining Quantitative Association Rules in Large Relational Tables. ACM SIGMOD Conference Proceedings, pages 1-12,1996.

    [Talavera99] L.Talavera. Feature selection as a preprocessing step for hierarchical clustering. In Proceedings of Internationl Conference on Machine Learning (ICML'99), pages 389-397, 1999.
    [Wang99] K.Wang, S.Zhou and S.C.Liew. Building hierarchical classifiers using class proximity. In Proceedings of the 25th Internation Conference on Very Large Data Bases (VLDB'99), pages 363-374, Edinburgh, UK, 1999.
    [Wang00] K.Wang et al. Mining Frequent Itemsets Using Support Constraints. In Proceedings 2000 International Conference on Bases (VLDB'00).Cairo, Egypt. Sept. 2000.
    [Wiener94] E.Wiener. A neural network approach to topic spotting. In Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval (SDAIE95), 1995.
    [Yang94a] Y.Yang. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'94), 1994.
    [Yang94b] Y.Yang and C.GChute. An example-based mapping method for text categorization and retrieval. ACM Transaction on Information Systems (TOIS), 1994, 12(3): 252-277.
    [Yang97] Y.Yang and J.O.Pedersen. A comparative study on feature selection in text categorization. In Proceedings of ICML97, 14th International Conference on Machine Learning, pages 412-420, Nashville, US, 1997. Morgan Kaufmann Publishers, San Francisco, US.
    [Yang99A] Y.Yang. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, vol.1, nos. 1/2, pages 67-88, 1999.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700