基于语义概念的文本特征描述

英文题名：The Description of Text's Feature Based on Semanteme Concept
作者：余刚
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：特征抽取 ; 规范化 ; 消歧 ; 权重计算 ; 词同现频率 ; 义原 ; 相似度计算 ; 匹配
英文关键词：describe character ; standardization ; word disambiguation ; term-weighing ; word co-occurrence ; sememe ; Similarity Computing ; matching
学位年度：2005
导师：朱征宇
学科代码：081203
学位授予单位：重庆大学
论文提交日期：2005-04-01
答辩委员会主席：何中市

摘要

文本的特征描述是自然语言处理、文本分类、聚类、中文信息检索、个性化服务等研究中的一项基础性工作,它研究的是用什么样的方法和模型来表示文章的主题思想。这个描述一方面要能很好的概括文章的主要内容,另一方面要方便计算机进行计算。目前,基于矢量的方法即VSM 得到了广泛的应用,它用若干个特征项和其权重来表示一篇文档。在这个模型中,有两个主要影响描述准确度的因素:一个是特征项的选择,一个是特征项的权重计算方式。广大学者的研究也主要集中在这两个方面,都希望从这两方面能够概括出文本的主题思想,反映其内在的隐含信息。利用统计和信息论的相关知识选择特征项和计算权重在一定程度上解决了VSM 模型描述文本的准确度问题,但一般能涉及和揭示特征项语义信息的比较少,本文主要在以下两方面来解决VSM 如何蕴含特征项的语义信息。
    (一)考虑词语出现的语言环境对词语的实际语义的重要影响,在现在广泛使用的TF-IDF 权重计算方式上进行了改进,采用了基于词同现频率的权重计算方式来表示文本的权重,该计算方式既含有TF-IDF 公式的相关统计信息,又表现了具体的语言环境对词语语义的影响。
    (二)在文本的相似度比较上,完全抛弃了纯数学的计算向量相似度的公式(如:计算向量间的欧氏距离、计算向量的夹角余弦、贝叶斯算法、K 最近邻算法等)。改为首先求向量中特征词间的语义相似度,再计算两向量的最大权匹配,最后统计每个匹配对的相似度和,当然在统计和的过程中要考虑每个特征词的权重。这样计算的好处在于:考虑了向量特征词的语义信息,并且在获得文本的向量描述时,不用消歧和规范化处理。
    最后,我们通过构建了一个文本分类器,把我们在这两个方面的研究与其它方式进行了比较,用实验验证了我们提出的算法在一定程度上提高了分类的准确率和召回率。虽然我们的研究主要是针对个性化服务的,但对中文信息检索和自然语言处理同样适用,可以推广到其它涉及到语言处理的领域。
The description of the text’s feature is a fundmental work for NPL ,document categorizing and clustering, Chinese information intrieval, personal service and so on. It focuses on the method and model to present the topic better. The feature discription should summarize the content of the document on one aspect; It also should think about that the model facilitate the computer’s processing. Currently, the VSM is used widely. The VSM use several feature words and their weights to present a document. In this model, there are two factors affecting the description’s precision: one is the choice of the feature words; another is the method of weight computing. Most of the scholars’research focus on these two points and they hope to summarize the documents’topics and reflect their connotative information. Utilizing the statistics and the knowledge of information entropy to choose the feature words and compute their weights, these two methods improved the VSM’s precision to describe the document to some extent. But there are few method can reflect the feature terms’semanteme. This paper mainly discuss how to solve the problem that reflect the VSM’s terms’semantic information from the following two aspects:
    (I) Considering that the context has great impact on the word’s right semanteme, we improve on the TF-IDF method which is most widely used to compute the term’s weight. Our method is based on the words co-occurrence. This method contains TF-IDF’s information and also reflect the specific context’s impact on words’semanteme.
    (II) As for comparing the texts’similarity,we abandon the pure mathematical method(e.g. the Euclidean distance, the cosine of the vectors’s angle, Bayes Algorithm, K-means and so on). Instead, we compute the similarity of different vector’s terms firstly and compute the the largest power match of the two vectors. Lastly, we compute the sum of the match-pair’s similarity and the terms’weights should also be considered. The advantage of our method exists in : it considers the terms’semanteme, avoid dispelling ambiguity and normalization.
    At last, we construct a classifier to compare our method with others. We use experiments to prove that our method has improved the precision and recall to some extent. Althoug our research aims at personal service, it can be applied to chinese information retrieaval and NPL.

引文

[1] 刘群,李素建. 基于《知网》的词汇语义相似度计算,第三届汉语词汇语义学研讨会. 台北. 2002 年5 月
    [2] 董振东,董强(1999). “知网”, http://www.keenage.com
    [3] 庞剑锋,卜东波,白硕. 基于向量空间模型的文本自动分类系统的研究与实现. 计算机应用研究. 2001 年第9 期.
    [4] 龚劬. 图论与网络最优化算法. 重庆大学出版社,2000
    [5] 史中植. 知识发现. 清华大学出版社,2002
    [6] 陆汝铃. 知识科学与计算科学. 清华大学出版社,2002.2
    [7] 李素建. 基于语义计算的语句相关度研究. 计算机工程与应用,2002.07
    [8] Agirre E. and Rigau G. (1995), A proposal for word sense disambiguation using conceptual distance, in International Conference "Recent Advances in Natural Language Processing" RANLP'95, Tzigov Chark, Bulgaria,.
    [9] Dagan I., Lee L. and Pereira F. (1999), Similarity-based models of word cooccurrence probabilities, Machine Learning, Special issue on Machine Learning and Natural Language, 1999
    [10] Li Sujian, Zhang Jian, Huang Xiong and Bai Shuo (2002), Semantic Computation in Chinese Question-Answering System, Journal of Computer Science and Technology (Accepted)
    [11] 李涓子(1999). 汉语词义排歧方法研究. 清华大学博士论文.
    [12] 王斌(1999). 汉英双语语料库自动对齐研究. 中国科学院计算技术研究所博士学位论文.
    [13] 刘开瑛,薛翠芳,郑家恒,周晓强.中文文本中抽取特征信息的区域与技术.中文信息学报,1998, 12(2):1-7
    [14] 杜飞龙(1999). 《知网》辟蹊径,共享新天地——董振东先生谈知网与知识共享,《微电脑世界》杂志,1999 年第29 期
    [15] David D. Lewis: Feature selection and feature extraction for text categorization, In Proceedings of Speech and Natural Language Workshop, pp 212-217. Defense Advanced Research Projects Agency, Morgan Kaufmann, February 1992
    [16] Yiming Yang: An evaluation of statistical approaches to text categorization, In Journal of Information Retrieval, 1999, Vol 1, No. 1/2, pp 67--88
    [17] David D. Lewis and Marc Ringuette: A comparison of tow learning algorithms of text categorization , In Third Annual Symposium on Document Analysis and Information Retrieval, pp 81-93, Las Vegas, NV, April 11-13 1994. ISRI; Univ. of Nevada, Las Vegas
    [18] Andrew McCallum and Kamal Nigam: A comparison of event models for naive bayes text categorization , AAAI-98 Workshop on "Learning for Text Categorization",1998
    [19] Yiming Yang and Xin Liu: A re-examination of text categorization methods , Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 1999, pp 42--49
    [20] 黄萱菁、吴立德. 独立于语种的文本分类方法,2000 International Conference on Multilingual Information Processing , pp 37-43 , 2000
    [21] 鲁松、白硕等. 文本中词语权重计算方法的改进,2000 International Conference on Multilingual Information Processing , pp 31-36 , 2000
    [22] 卜东波. 聚类/分类理论研究及其在大规模文本挖掘中的应用,博士论文2000 11
    [23] 鲁松(2001). 自然语言中词相关性知识无导获取和均衡分类器的构建. 中国科学院计算技术研究所博士论文.
    [24] Mark T Maybarg(1995). Generating summaries from Event Date . 《Information Processing & Management》Vol. 31.No .5 , pp. 735-751,1995
    [25] K. Bruce, and B. Chad.The Infofinder Agent:Learning User Interests through Heuristic Phrase Extraction.IEEE EXPERT.1997, 12(5):22-27
    [26] 朱明,王军,王俊普.Web 网页识别中的特征选择问题研究.计算机工程,2000,26(8):35-37.
    [27] 王庆一,王继成,周源远,袁春风.多信息块Web 页面中的抽取规则.计算机工程,2003,29(9):42-44,55.
    [28] 张永奎,赵辄谦,陈鑫卿,白丽君.基于机器学习的网页主题词自动抽取.计算机应用,2003,23(3):1-3.
    [29] 张义忠,赵明生,朱精南.基于内容的网页特征抽取.计算机工程与应用,2001,10:1-3.
    [30] 王丽坤、王宏、陆玉昌. 文本挖掘及其关键技术.计算机科学,2002vol.29.No.12
    [31] Rocchio Jr , J. Relevance feedback to information retrieval . In Salton, G, editor ,The SMART Retrieval System: Experiments in Automatic Document Processing, pp. 313-323. Prentice-Hall , Inc, Englewood Cliffs, New Jersey, 1971
    [32] Widrow B, Stearns S.D. Adaptive Signal Processing. Prentice-Hall, Englewood Cliffs, NJ,1979
    [33] 李国臣.文本分类中基于对数似然比测试的特征词选择方法.中文信息学报.1999,13(4):16-21
    [34] 张月杰,姚天顺.基于特征相关性的汉语文本自动分类模型的研究.小型微型计算机系统.1998,19(8):49-55
    [35] 张国印,陈先,皮鹏.基于词频统计的个性化信息过滤技术[J].哈尔滨工程大学学报,2003,24(1):63-67
    [36] 朱靖波,姚天顺.基于FIFA 算法的文本分类[J].中文信息学报,2002,16(3):20-26
    [37] 宋斌,方小璐.基于网页特征的TFIDF 改进算法[J].微计算机应用,2002,23(1):18-20
    [38] 马颖华,王永成,苏贵洋,张宇萌,一种基于字同现频率的汉语主题文本抽取方法,计算机研究与发展。2003.vol.40.No.6。
    [39] 王术(2004). 面向个性化服务的网页特征描述方法研究. 重庆大学硕士学位论文.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700