词间语义关系的研究及其在文本分类中的应用

英文题名：Study on Term Semantic Relationship and Its Application in Text Categorization
作者：崔晓源
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：文本分类 ; 语义关系 ; 向量空间模型 ; 依存模型 ; 句法分析
英文关键词：Text Categorization ; Term Semantic Relationship ; Vector Space Model ; Dependency Model ; Parsing
学位年度：2006
导师：何丕廉
学科代码：081203
学位授予单位：天津大学
论文提交日期：2006-01-01

摘要

自动文本分类是信息检索领域的基本任务之一。随着互联网上的信息量呈爆炸性增长,人们很难从大量的文本信息中迅速有效地提取出所需信息。为了解决信息迷向的现象,对文本分类的研究显得越来越重要。
     本文设计并实现了基于模块化的可扩展自动文本分类系统。对分类过程中的各重要环节进行了细致全面的研究和分析。在此基础上我们提出了将自然语言处理领域中的词语语义关系挖掘模型与文本分类系统相结合的方法,目的在于解决目前向量空间模型中词语相互独立这一基本假设的不合理性。同时期望通过利用文本中词语间的深层内涵,在较小的向量空间内表示更加丰富的文档信息,并以此提高文本分类的测试效果。
     语义关系挖掘模型利用语言学的句法分析和信息学的统计思想,通过对文本语料的深层挖掘,得到词条间网状语义关系词典。该词典资源丰富了文本的向量信息,使得向量表示更加高效简洁。我们把该模型与强大的SVM分类器模型结合在一起,显著提升了分类系统的结果。
     在实验中我们将该模型与标准的词袋模型在20NG和Reuters测试语料上进行比较。结果表明语义关系扩展可以明显改进文本分类的准确率和召回率。而且还可以在保证分类结果的同时,有效地降低计算的空间和时间复杂度,使得对超大规模文本语料的分析成为可能。最后,作者提出了语义关系挖掘模型在信息检索领域中未来的研究方向。
Text categorization is one of the basic tasks in information retrieval. With the explosive growth of web information, people have difficulty in finding the required information from massive information. In order to solve the so called“information confusion”problem, Research on text categorization gradually seemed to be more important.
     This paper design and implement a module-based scalable automated text categorization framework. We also did a comprehensive survey on each important step in the framework. Based on this framework, we bring up a method that integrating the term semantic relationship into classic text categorization task. This method can solve the inherent irrationality in the assumption of Vector Space Model that terms are treated independently. Meanwhile we show that the deep association between terms can be used to improve the result of our current experiment.
     Term semantic relationship can be obtained by using sentence parsing in natural language processing and statistical method in information theory. We presented the deep term relationship in the form of thesaurus which can make the document vector more informative and effective. When combined with the classification power of SVM, this method yields high performance in text categorization.
     We compare this technique with SVM-based categorization and other term relationship model on 20NG and Reuters-21578 dataset using the simple minded bag-of-words (BOW) representation. The comparison shows that our method outperforms others model in most cases.
     Finally, we bring out some future research on using term semantic relationship in information retrieval area.

引文

[1] K. Aas and L. Eikvil, “Text categorization: A survey”, tech. rep., Norwegian Computing Center, June 1999.
    [2] Sebastiani, F. (1999) “A Tutorial on Automated Text Categorisation”, Proceedings of the 1st Argentinian Symposium on Artificial Intelligence (ASAI-99)
    [3] Salton. G, “Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer”, Addison-Wesley, 1989.
    [4] Salton. G., C.S. Yang & A. Wong, “A Vector Space Model for Automatic Indexing. Comm. ACM”, pages 613-620, 1975
    [5] J. Ponte and W. B. Croft, “A language modeling approach to information retrieval”, In 21st ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'98), pages 275-281, 1998.
    [6] T. Kalt, “A New Probabilistic Model of Text Classification and Retrieval”, CIIR Tech. Report No. 78, 1996
    [7] A. Berger and J. Lafferty, “Information retrieval as statistical translation”, In 22nd ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99), pages 222-229, 1999.
    [8] J. Lafferty and C. Zhai, “Document language models, query models, and risk minimization for information retrieval”, In 24th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'01), 2001.
    [9] Y. Yang, JO. Pedersen, “A Comparative Study on Feature Selection in Text Categorization”, Proc. of the 14th International Conference on Machine Learning ICML97, pages 412-420, 1997.
    [10] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273-297, 1995.
    [11] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman, “Indexing by latent semantic analysis”, Journal of American Society for Information Science, 1990,41(6), pages 391-407.
    [12] T.K. Landauer, P.W. Foltz, D. Laham, “An Introduction to Latent SemanticAnalysis”, Discourse Processes, 1998(25), pages 259-284.
    [13] L. Breiman, J.H. Friedman, R.A. Olshen and C.J. Stone, “Classification and Regression Trees”, Belmont, CA: Wadsworth, 1984.
    [14] D.D. Lewis and M. Ringuette. Comparison of two learning algorithms for text categorization. In Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval (SDAIR’94), 1994.
    [15] V. Vapnic. The Nature of Statistical Learning Theory. Springer, New York, 1995.
    [16] Belur V. Dasarathy. Nearest Neighbor .NN. Norms: NN Pattern Classification Techniques. McGraw.Hill Computer Science Series. IEEE Computer Society Press. Las Alamitos. California. 1996
    [17] B. Masand, G. Lino., and D. Waltz. Classifying news stories using memory based reasoning. In 15th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’92), pages 59-64, 1992.
    [18] Tom Mitchell. Machine Learning. McGraw Hill,1996
    [19] E. Wiener, J.O. Pedersen, and A.S. Weigend. A neural network approach to topic spotting. In Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval (SDAIR’95), pages 317-332, Nevada, Las Vegas, 1995, University of Nevada, Las Vegas.
    [20] H.T. Ng, W.B. Goh, and K.L. Low. Feature selection, perceptron learning, and a usability case study for text categorization. In 20th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’97), pages 67-73, 1997.
    [21] Y. Yang and C.G. Chute. An example based mapping method for text categorization and retrieval. ACM Transaction on Information Systems (TOIS), 12(3):252-277, 1994.
    [22] Y. Yang and X. Liu, “A re-examination of text categorization methods”, In SIGIR-99, 1999
    [23] Yahoo!, http://www.yahoo.com/.
    [24] Looksmart, http://www.looksmart.com/.
    [25] H. Chen and S. Dumais, “Bringing order to the web: Automatically categorizingsearch results”, Proceedings of CHI’00, Human Factors inComputing Systems.
    [26] Lewis. Feature selection and feature extraction for text categorization. In Speech and Natural Language Proceedings of a workshop held at Harriman. 1992, pages 212-217.
    [27] Jan W. Buzydlowski, Howard D. Whie, “Term Co-occurrence Analysis as an Interface for Digital Libraries”, JCDL-The First ACM+IEEE Joint Conference on Digital Libraries, pages 207-214, 2001.
    [28] Schuetze. Hinrich, “Document information retrieval using global word co-occurrence patterns”, http://www.delphion.com/details?pn10=US05675819, 1997.
    [29] T. Joachims, “Text categorization with support vector machines: Learning with Many Relevant Features”, In Proceedings of the ECML’98, pages 137-142.
    [30] Hua-Jun Zeng, Xuan-Hui Wang, Zheng Chen, Hongjun Lu and Wei-Ying Ma, “CBC: Clustering Based Text Classification Requiring Minimal Labeled Data”, ICDM 2003, pages 443-450.
    [31] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pages 92-100, 1998.
    [32] 邹涛.基于 www 的信息发现技术的研究[D].南京:南京大学,1999.
    [33] 庞剑锋,卜东波,白硕.基于向量空间模型的文本自动分类系统的研究与实现[J].计算机应用研究,2001,(9):23-26,
    [34] L. Douglas baker & Andrew Kashites McCallum, “Distributional clustering of words for text classification”, In Proceedings of the 21st ACMSIGIR International Conference on Research and Development in Information Retrieval, 1998.
    [35] N. Kamal, L. John, M. Andrew, “Using maximum entropy for text classification”, The IJCAI-99 Workshop on Information Filtering, Stockholm, Sweden, 1999.
    [36] Hirotoshi Taira, “Text Categorization using Machine Learning”, Dissertations NAIST CL Lab, 2002.
    [37] G. Salton and M. J. McGill, “An Introduction to Modern Information Retrieval”, McGraw-Hill, 1983.
    [38] G. Salton and C. Buckley, “Term Weighting Approaches in automatic text retrieval”, Information Processing and Management, Vol. 24, No. 5, pages 513-523, 1988.
    [39] C. Buckley, G. Salton, J. Allan and A. Singhal, “Automatic Query Expansion Using SMART: TREC 3”, In Proc. 3rd Text Retrieval Conference, NIST, 1994.
    [40] D. Mlademnic, M. Grobelnik, “Feature Selection for unbalanced class distribution and Naive Bayees[A]”, Proceeding of the Sixteenth International Conference on Machine learning[C].Bled:Morgan Kaufumnn,1999:258-267.
    [41] 王梦云,曹素青。基于字频向量的中文文本自动分类系统[J]。情报学报,2000,19(6):644-649。
    [42] Y. Yang, “Noise reduction in a statistical approach to text categorization” [A], Proceeding of the 18th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’95) [C]. Seattle: ACM Press,1995:256-263.
    [43] T.E. Dunning. “Accurate Methods for The Statistics of Surprise and Coincidence”, In Computational Linguistics, volume 19:1, pages 61-74, 1993.
    [44] T.Joachims, “A probabilistic analysis of the rocchio algorithm with TFIDF for text categorization”, In Int. Conf. Machine Learning, 1997.
    [45] B. Masand, G. Lino., and D. Waltz. Classifying news stories using memory based reasoning. In 15th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’92), pages 59-64, 1992.
    [46] L. Breiman, J.H. Friedman, R.A. Olshen and C.J. Stone, “Classification and Regression Trees”, Belmont, CA: Wadsworth, 1984.
    [47] N. Kamal, L. John, M. Andrew, “Using maximum entropy for text classification”, The IJCAI99’ Workshop on Information Filtering, Stockholm, Sweden, 1999.
    [48] G. Golub, Loan V Van. Matrix Computations [M]. 3rd ed.. The Johns Hopkins University Press, Baltimore, MD. 1996.
    [49] N. Tishby, F. Pereira, and W. Bialek, “The information bottleneck method”, 1999. Invited paper to the 37th annual Allerton Conference on Communication, Control, and Computing.
    [50] R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter, “On featuredistributional clustering for text categorization”, In W. B. Croft, D. J. Harper, D. H. Kraft, and J. Zobel, editors, Proceedings of SIGIR’01, 24th ACM International Conference on Research and Development in Information Retrieval, pages 146-153, New Orleans, US, 2001, ACM Press, New York, US.
    [51] Christiane Fellbaum, 1998, WordNet: an electronic lexical database, Mass: MIT Press, 423 p.
    [52] Ido Dagan, Lillian Lee, and Fernando Pereira. 1997. “Similarity-based method for word sense disambiguation”. In Proceedings of the 35th Annual Meeting of the ACL, pages 56–63, Madrid, Spain.
    [53] Dekang Lin. 1994. Principar-an efficient, broadcoverage, principle-based parser. In Proceedings of COLING-94, pages 482-488. Kyoto, Japan.
    [54] Dekang Lin, Shaojun Zhao, Lijuan Qin and Ming Zhou. 2003. Identifying Synonyms among Distributionally Similar Words. In Proceedings of IJCAI-03, pp.1492-1493.
    [55] R. Krovetz, 1993: "Viewing morphology as an inference process," in R. Korfhage et al., Proc. 16th ACM SIGIR Conference, Pittsburgh, June 27-July 1, 1993; pp. 191-202.
    [56] 何伟,LSI潜在语义信息检索模型。数学的实践与认识,2003 Vol.33 No.9 P.1-10

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700