基于概念空间的文本分类的应用研究

英文题名：A Study on Concept-VSM And Its Application in Text Classification
作者：黄海英
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：文本分类 ; 潜在语义索引 ; 向量空间模型
英文关键词：text classification ; latent semantic indexing ; vector space model
学位年度：2002
导师：林士敏
学科代码：081202
学位授予单位：广西师范大学
论文提交日期：2002-02-01

摘要

随着文本信息的快速增长，特别是Internet上在线信息的增加，文本（网页）分类显得越来越重要。由于文本分类有助于用户有选择地阅读和处理海量文本，可以在较大程度上解决目前网上信息杂乱的现象，方便用户准确地定位所需的信息和分流信息，因此，文本自动分类已成为一项具有较大实用价值的关键技术，是组织和管理数据的有力手段.文本分类的方法分为两类：一是基于知识的分类方法；二是基于统计的分类方法。基于知识的文本分类系统应用于某一具体领域，需要该领域的知识库作为支撑，由于知识提取、更新、维护以及自我学习等方面存在的种种问题，使得它适用面较窄。而基于统计的分类方法由于采用纯粹的数学运算，不苛求复杂的语言学知识和领域知识，以及在实际应用中所体现出来的良好效果，成为目前流行的文本分类方法。现在广泛应用的基于统计的模型有向量空间模型、Naive Bayes模型、实例映射模型和支撑向量机模型。其中向量空间模型（Vector Space Model，VSM）是由G.Salton等人在20世纪60年代提出的，把文档简化为以项的权重为分量的向量表示，把分类过程简化为空间向量的运算，使得问题的复杂性大大减低。此外，向量空间模型对项的权重评价、相似度的计算都没有作出统一的规定，只是提供一个理论框架，可以使用

    不同的权重评价函数和相似度计算方法，使得此模型有广泛的适应性。但此模型一般采用索引词来表示文档，分类是通过文档之间的字、词匹配来实现，是浅层次的词匹配，而非深层次的语义匹配，是不准确的。显然，字、词的同义性和多义性将分别对文本分类的查全率和查准率产生不利影响。
    LSI（Latent Semantic Indexing，潜在语义索引）方法是1988年S.T.Dumains等人提出的一种新的信息检索代数模型，其基本思想是文本中的词与词之间存在某种联系，即存在某种潜在的语义结构，因此采用统计的方法来寻找该语义结构，并且用语义结构来表示词和文本，这样的结果可以达到消除词之间的相关性，化简文本向量的目的。LSI利用统计计算导出的概念索引进行信息检索，而不再是传统的索引字、词。LSI基于这样一种断言，即文档库中存在隐含的关于词使用的语义结构，这种语义由于部分地被文档中词的语义和形式上的多样性所掩盖而不明显。LSI通过对原文档库的词—文档矩阵的奇异值分解（Singular Value Decomposition）计算，并取前k个最大的奇异值及其对应的奇异矢量构成一个新矩阵来近似表示原文档库的词—文矩阵。由于新矩阵消减了词和文档之间语义关系的模糊度，从而更有利于信息检索。与传统信息检索模型相比，LSI的优势表现在：向量空间中每一维的含义发生了很大的变化，它反映的不再是词的简单出现频度和分布关系，而是强化的语义关系；用低维词、文档向量替代原有词、文档向量，可以有效地处理大规模文档库。
    本论文以LSI方法为基础，在文[1][2]的启发下，探讨了基于概念空间文本分类的计算方法。由于文本分类是计算机情报检索的一个分支，论文首先简要地介绍了情报检索与计算机情报检索的涵义及发展简史和发展趋势；计算机情报检索的基本理论、研究对象和方法，以及文本分类的关键技术；然后论述了隐含语义索引（LSI）方法的思想和理论基础，并用图例和一个小的实例对其进行形象化说明，阐述了LSI方法的优势。论文的主要工作是在向量空间模型和LSI的基础上构造文本分类的概念空间并提出在概念空间中词语相似度、文档相似度、待分类文档与类的相似度的计算方法，在大量训练集的基础上，进行概念获取，将文档转化为文档向量，同时构造类基准向量，最后在概念空间中将文档向量与类基准向量进行匹配，完成分类，同时还讨论了有待在概念空间中探讨的分类学习问题。实验证实了基于概念空间文本分类能够取得较好的效果。
    由于语言中词的同义性和多义性普遍存在，使得基于词匹配的文本分类方法先天不足，本论文提出的基于概念空间的文本分类方法以一个较小的而更健壮的统计导出的概念空间替代原来基于独立词索引的文档向量空间，表现出明显的性能优势，希望将来通过对基于概念空间的文本分类的计算方法的一些比较系统的研究，以期寻求一个既有严格的理论依据，而且在实践中也可行的文本分类方法。
As the volume of information available on the Internet and corporate intranets continues to increase, there is growing interest in helping people better find, filter, and manage these resources. Text classification - the assignment of natural language texts to one or more predefined categories based on their content - is an important component in many information organization and management tasks. Its most widespread application has been for assigning subject categories of documents to support text retrieval, routing and filtering. In many contexts, trained professionals are employed to categorize new items. This process is very time-consuming and costly, thus limiting its applicability. Rule-based approaches similar to those used in expert systems are common, but they generally require manual construction of the rules, make rigid binary decisions about category membership, and are typically difficult to modify. Another strategy is to use statistical analysis to automatically construct classifiers using labeled training data. The resulting classifiers, however, have many advantages: they are easy to construct and update, they depend only on information that is easy for people to provide, they can be customized to specific categories of interest to individuals, and they allow users to smoothly trade-off precision and recall depending on their tasks. A growing number of statistical classifications have been applied to text categorization, including Vector Space model, Naive Bayes model and Support Vector Machine model. VSM ( Vector Space Model ) was presented by G.Salton in 20 centuries. In the model, each document is represented as a vector of words, as is typically done in the popular vector representation for information retrieval. Because text classification is essentially semantic categorization, the VSM represents the contents of documents and queries with a set of index terms, which can lead to poor classification performance.
    Latent semantic index (LSI) was presented by S.T.Dumains in 1988, it is a new algebraic model that has achieved good results in information retrieval, which maps documents and queries vectors into a lower

    dimensional space by singular value decomposition, so that the inherent vagueness associated with a retrieval process based on keyword sets is considerably reduced, and semantic association among the documents is highlighted consequently. LSI is useful to find relation between terms, where human effort does not bring good results. Thus the synonymy can be solved, and the polysemy can be solved partially.
    With the guidance of LSI and VSM theory and taking paper [1][2] as the foundation, this paper will probe into the text classification based upon concept-VSM. First of all, the paper gives a brief introduction to the concept of information, information retrieval and computer information retrieval, and its development. Then the types of information retrieval model, the approach and basic contents of attribute theory will be dwelled upon. Third, this paper introduces the fundamental principles of LSI, and then using an illustration and an example elucidate LSI advantages. The focus of my work has been on building a concept space based on VSM and LSI, presenting the calculating method of the word-similarity and the text-similarity in the concept-space, acquiring concepts on large training set, converting the text to text vector, and constructing the basis vector. Finally, this paper discusses the future work - problem in the classification study problem in the concept space. At the end of this paper, theoretic analyses and experimental results all show that classification based upon concept-VSM can improve categorize performance significantly, and indicate it has high classification precision and recall on average.
     Because of existence of the synonymy and polysemy, the text classification based on words is of congenital lack, my thesis presents a text classification method based on concept-VSM with a small but more strong concept space instead of the text vector space ba

引文

[1] 陈家辉《一种基于潜在语义索引的“垃圾”邮件过滤方法》《计算机应用研究》2000年第10期
    [2] 周水庚、关佶红、胡运发《隐含语义索引及其在中文文本处理中的应用研究》《小型微型计算机系统》2001年第2 期
    [3] 林鸿飞《基于示例的文本标题分类机制》《计算机研究与发展》2001年第9期
    [4] 李晓黎、刘继敏、史忠植《概念推理网及其在文本分类中的应用》《计算机研究与发展》2000年第9期
    [5] 范焱、陈恩红、王清毅、蔡庆生、刘洁《超文本协调分类器的性能研究》《计算机研究与发展》2000年第9期
    [6] 朱华宇、孙正兴、张福炎《一个基于向量空间模型的中文文本自动分类系统》《计算机工程》2001年第2 期
    [7] 庞剑锋、卜东波、白硕《基于向量空间模型的文本自动分类系统的研究与实现》《计算机应用研究》2001年第9期
    [8] 赖茂生、王延飞、赵丹群编著《计算机情报检索》，北京大学出版社，1993年
    [9] 倪国熙《常用的矩阵理论和方法》，上海科学技术出版社，1984年
    [10] 赵有云编著《科技文献检索与利用》，中国科学技术大学出版社，1999年
    [11] 王继成、萧嵘、孙正兴、张福炎《Web信息检索研究进展》《计算机研究与发展》2001年第2 期
    [12] 赖茂生、徐克敏《科技文献检索》，北京大学出版社，1985年
    [13] 雷鸣、刘建国、王建勇、陈葆珏《一种基于词典的搜索引擎系统动态更新模型》《计算机研究与发展》2000年第10期
    [14] 吴应良、韦岗、金连文、李海洲《一种模糊矢量相关信息检索模型》《计算机工程与应用》2000年第11期
    [15] 崔伟东、李呈《中文网页分类查询系统的设计与实现》《计算机工程与应用》2000年第11期
    [16] 刘芳、卢正鼎《有效地检索HTML文档》《小型微型计算机系统》2000年第9期
    [17] 战学刚、林鸿飞、姚天顺《Infolite中文检索系统》《小型微型计算机系统》2000年第9期
    [18] 吴立德、罗航哉、薛向阳《基于多重倒排文件的快速相似性检索》《计算机学报》2000年第11期
    [19] 李培峰、杨季文、吕强、朱巧明《一个基于因特网的中文搜索引擎模型的实现》《微计算机应用》2000年第6期
    [20] 李晓黎、史忠植《用数据采掘方法获取汉语词性标识规则》《计算机研究与发展》2000年第12期
    [21] 邹海山、吴勇、吴月珠、陈阵《中文搜索引擎中的中文信息处理技术》《计算机应用研究》2000年第12期
    [22] 王奇、宋国新、邵志清《信息检索中基于链接的网页排序算法》《华东理工大学学报》2000年第5期
    [23] 丁永生、周斌、杨文春《HTML文档的模糊检索模型》《计算机工程与应用》2001年第3期
    [24] 林鸿飞《基于混合模式的文本过滤模型》《计算机研究与发展》2001年第9期
    [25] 周水庚、关佶红、胡运发、周傲英《一个无需词典支持和切词处理的中文文档分类系统》《计算机研究与发展》2001年第7期
    [26] 钟涛、陈新明、万钧、张世易《中文文本WEB搜索引擎的设计与实现》《计算机工程与应用》2001年第17期
    [27] 刘汝杰、袁保宗、唐晓芳《一种新的基于聚类的多分类器融合算法》《计算机研究与发展》2001年第10期
    [28] 王云《电脑网络检索方法》，国防工业出版社，1999年
    [29] 严怡民主编《信息系统理论与实践》，武汉大学出版社，1999年
    [30] 李广原《扩展布尔检索模型──Salton模型》，广西科学院学报2000年第11期


    [31] 王崇德编著《情报学引论》，天津大学出版社，1994年
    [32] 康耀红著《现代情报检索理论》，科学技术文献出版社，1990年
    [33] 张进《情报检索系统：从布尔逻辑到向量空间》，中国图书馆学报，1997年第6期
    [34] 康耀红、任志纯《现代情报检索理论研究的现状与发展趋势》，情报学报，1990年8月
    [35] 贾同兴编著《人工智能与情报检索》，北京图书馆出版社，1997年
    [36] 黎难秋、徐萍《关于检索的哲学思考》，图书情报工作，1998年第8期
    [37] 杨谱春《试论情报检索中的相似匹配原理》，情报资料工作，1998年第2期
    [38] 邹涛《检索模型》，计算机世界专题，1999年4月19日
    [39] 李正吾《关于情报检索中的向量方法》，情报学报，1992年12月
    [40] 顾耀芳《综述全文检索系统》，现代图书情报技术，1992年第1期
    [41] 贾同兴《国外近十年智能情报检索研究述评》，情报学报，1995年12月
    [42] 徐进洪、邵品洪、李明霞《情报检索数学模型及若干技术进展》，现代图书情报技术，1990年第3期
    [43] C.H.Papadimitriou, etal . Latent Semantic Indexing : A Probabilistic Analysis . [C] In Proceedings of PODS'98 , Seattle , WA.1998.
    [44] G.Golub and C.Van Loan . Matrix Computations . Johns-Hopkins , Baltimore . [M] Maryland , second edition , 1989 .
    [45] Susan Dumais , John Platt , David Heckerman , Mehran Sahami " Indective Learning Algorithms and Representations for Text Categorization "
    [46] Stanley loh , Leandro Krug Wives , Jose Palazzo M. De Oliveira , " Concept-Based Knowledge Discovery in Texts Extracted from the Web "
    [47] David Landau , Ronen Feldman , Yonatan aumann , Moshe Fresko , Yehuda Lindell , Orly Lipshtat " TextVis:An Integrated Visual Environment for Text Mining " , PKDD'98 Principles of Data Mining&Knowledge Discovery .
    [48] Sofus A.Macskassy , Arunava Banerjee , Brian D.Davison , Haym Hirsh " Humam Performance on Clustering Web Pages:A Preliminary Study"
    [49] Thorsten Joachims , Tom Mitchell , Dayne Freitag , and Robert Armstrong " WebWatcher:Machine Learning and Hypertext "
    [50] S.K.M.Wong , W.Ziarko , and P.C.N.Wong . " Generalized Vector space Model Information Retrieval . " In ACM SIGIR Conference on Research and Development in Information Retrieval .
    [51] P.Sheridan and J.P.Ballerini . " Experiments in Multilingual Information Retrieval Using the SPIDER System . " SIGIR'96,58-65,1996
    [52] Agosti , M. :Hypertext and Information Retrieval , IP&M,1993,29(3),283-285
    [53]S Deerwester,S T.Dumains,et al.Indexing by Latent Semantic Analysis[J].Journal of the American Society for Information Science,1990,41(6):391-407
    [54]S T Dumains.Latenn Semantic Indexing(LSI)and TREC-2[C].In D.Harman,ed.The Second Text Retrieval Conference(TREC2),National Institute of Standards and Technology Special Publication,1994.105-116
    [55]P.Foltz.Using latent semantic indexing for information filtering.[C]In Proceedings of the ACM Conference on Office Information System (COIS).1990 40-47
    [56]G.O'Brien.Information management tools for updating and SVD-encodeed indexing scheme.Master's thesis.[D]University of Tennessee,Knoxville,Tennessee,December 1994
    [57]P.young.Cross-language information retrieval using latent semantic indexing. Master'sthesis.[d]University of Tennessee,Knoxville,Tennessee,December,1994
    [58]T.G.Kolda and D.P.O'Leary.Large latent semantic indexing via a semi-discrete matrix Decomposition.[r]Technical Report Mo.UMCP-CSD CS-TR-3713,Department of Computer Science,Univ. of Maryland,November 1996
    [59]S.t.Dumains,LSI meets TREC:A status report.In the 1st Text Retrieval Conference, D.Harman,ed.[s]National Institute of Standards and Technology Special Publication 500-207,NIST,Gaithersburg,MD.March 1993.137-152
    [60]S.T.Dumains.Using LSI for information filtering:TREC-3 experiments. In: D.Harman(Ed.), [C]The Third Text Retrieval Conference (TREC#)National Institute of Standards and Technology Special Publication.1995


    [61] Apte C, Damerau F, Weisa S. Automated learning of decision rules for text categorization. ACM Transaction on Information System. 1994,12(3):233-251
    [62] Yang Y. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In:Prpoc of the Seventeenth Int'l ACM SIGIR Conf on Research and Development in Information Retrieval. Dublin, 1994.13-22
    [63] Lewis D D, Schapore R E, Callan J P et al. Training algorithms for linear text classifiers. In: Proc of the 19th Int'l ACM SIGIR Conf on Research and Development in Information Retrieval. Zurich, 1996. 298-306
    [64]Cohen W W,Singer Y. Context-sensitive learning methods for text categorization. In: Proc of the 19th Int'l ACM SIGIR Conf on Research and Development in Information Retrieval. Zurich, 1996. 307-315

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700