基于潜在语义分析的文本检索算法研究

英文题名：Research on Text Retrieval Algorithm Based on Latent Semantic Analysis
作者：赵亚慧
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：文本信息检索 ; 向量空间模型 ; 潜在语义索引 ; 遗传算法
英文关键词：text information retrieval ; vector space model ; latent semantic indexing ; genetic algorithms
学位年度：2009
导师：崔荣一
学科代码：081203
学位授予单位：延边大学
论文提交日期：2009-05-01
答辩委员会主席：白宝兴

摘要

文本信息检索技术的研究目标是从大量文本信息集合中识别和获取所需要的文本信息。在互联网普及的当今社会,文本信息检索技术已经成为人们有效利用信息资源,快捷、全面地吸收和获取文本信息的一条重要途径。这种技术越来越被人们所迫切需要,对人们的学习和科学研究有着重大意义。
     本学位论文研究在文本集中高效、高质量地检索定位语义上与查询文本相似的段落的文本检索策略和算法。
     本文采用的文本表示基础模型是向量空间模型(SVM),语义表现手段基础是潜在语义索引(LSI)模型,搜索算法的基础是遗传算法(GA)。本文的主要工作如下:
     (1)分析潜在语义空间的构造方法。利用奇异值分解方法处理词项-文本矩阵,并根据奇异值分布特征对该矩阵进行最小平方误差意义下的最佳近似,由此构造出潜在语义空间的投影矩阵。任意文本向量通过该投影矩阵可表示在潜在语义空间中,一方面可以有效消除词项之间的相关性,另一方面可以抑制噪声的干扰。
     (2)提出查询文本与大容量文本之间非相关性的有效判定方法。查询文本向量表示为潜在语义空间分量和零语义空间分量,而当其潜在语义空间分量小于给定阈值时,即可判定该查询文本与大容量文本中的所有段落都不相似,在检索策略中可以放弃进一步的细节匹配。
     (3)设计利用遗传算法的段落检索算法。当查询文本的潜在语义空间分量足够大时,把该空间中的所有段落(子文档)作为匹配对象,与查询文本的潜在语义空间分量进行余弦相似度匹配。由于采用遗传算法,高效地定位近似最优的段落;同时,由于检索是在潜在语义空间进行的,因此定位的段落在语义上与查询文本相似。
     实验结果表明,本文提出的基于潜在语义的文本检索策略和基于遗传算法的文本检索方法与传统算法相比,在检索的准确率、召回率以及F-指标等方面都有较大的提高,而且所提出算法在检索效率方面也优越于传统的文本信息检索方法。因此本文提出的基于潜在语义的文本检索策略和基于遗传算法的文本检索方法可用于大容量文本信息检索中。
The target of text information retrieval technique is to recognize and obtain desired textual information from massive texts. Nowadays, with the popularization of internet, text information retrieval technique has become an important way to effectively utilize information resources, rapidly and comprehensively obtain text information. Being of great significance to study and scientific research, the technique is being demanded more and more urgently.
     A text retrieval strategy and algorithm which could efficiently and accurately retrieve and locate similar paragraphs in the sense of semantic was investigated in this dissertation.
     In the dissertation, SVM is taken as basic model to represent a text, LSI as basis of semantic expression means, and GA as basis of searching algorithm. Main works are as follows:
     (1) Methods to construct latent semantic space were analyzed. After decomposing it with singular value method, lexical item-text matrix was approximated in the sense of minimum squar error according to distributing features of singular value, so as to construct a projection matrix in latent semantic space. By the projection matrix, any text can be represented in the latent semantic space. On one hand, correlativity between items could be removed availably; on the other hand, noise interference could be restrained.
     (2) An efficient method to determine non-correlativity between desired text and large-scale texts was proposed. The desired text can be represented to components of latent semantic space and null semantic space. While the component of latent semantic space was less than the preset threshold, it may be concluded that paragraphs in the desired text are not similar to every one of the large-scale texts, and further matching in retrieval strategy could be abandoned.
     (3) Paragraph retrieval algorithm based on GA was designed. When the component of latent semantic space corresponding to the desired text was large enough, all paragraphs (sub-text) in the space were taken as objects to be matched with the component of latent semantic space corresponding to the desired text by cosine similarity. Approximately optimal paragraphs can be located efficiently based on GA. Meanwhile, because texts were retrieved in the latent semantic space, located paragraphs was similar to the desired text semantically.
     Experimental results show that, compared with traditional methods, the accuracy, recall ratio and F-index of the proposed text retrieval strategy based on semantics latent and the retrieval method for large-scale texts based on genetic algorithm are all enhanced rapidly. And that, the retrieval efficiency of the algorithm proposed in the dissertation is also superior to traditional text information retrieval methods. Therefore, the advanced text retrieval strategy based on semantics latent and the retrieval method based on genetic algorithm are applicable to large-scale text information retrieval.

引文

1 B.Yates,R.Neto.Modern Information Retrieval.NewYork:ACM Press,1999:26-27
    2 李国辉,汤大全,武德辉.信息组织与检索.北京:科学出版社,2005:2-5
    3 B.Y Ricardo,R.N.Berthiro等著.现代信息检索.王知津,贾福斯,郑红军等译.北京:机械工业出版社,2005:3-5
    4 Y.M.Yang,J.Zhang,B.Kisiel.A Scalability Analysis of Classifiers in Text Categorization.Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Toronto:Canada,2003:96-103
    5 张毅波.中文结构化信息检索系统的研究与实践.中国科学院软件研究所博士学位论文.2001:3-6
    6 陈杏环.遗传算法和相关反馈在查询优化中的应用.重庆大学硕士学位论文.2006:1-3
    7 张映海.基于概念的中文文本检索研究.重庆大学硕士学位论文.2007:3-5
    8 廉雄杰.基于压缩原理的全文检索方法的研究.延边大学硕士学位论文,2004:2-3
    9 王继成,箫嵘,张福炎.Web信息检索研究进展.计算机研究与发展.2001,38(2):187-193
    10 郑庆华,王朝静,孙霞.一种基于结构化语料库的概念语义网络自动生成算法.计算机研究与发展.2005,42(3):478-485
    11 Y.Zheng,B.Wu,Z.Z.Shi.A Concept Space Based Text Retrieve System.Computer Engineerin and Applications.2002,38(12):67-69
    12 L.Li,N.Wang,Y.X.Zhong.Semantic Network Based Concept Retrieval.Journal of The China Society for Scientific and Technical Information.2000,19(5):525-531
    13 Y.Li,Q.He,Z.Z.Shi.Association Retrieve Based on Concept Semanticspace.Journal of University of Science Technology Beijing.2001,23(6):557-580
    14 Chia-Hui Chang,Ching-Chi Hsu.Enabling Concept-Based Relevance Feedback for Information Retrieval on the WWW.IEEE Transactions on Knowledge Anddate Engineering.1999,11(4):595-609
    15 李晓黎,周长胜.基于相关反馈技术的Web检索改进研究与实现.航空计算技术.2004,34(3):129-132
    16 H.Zhang,Y.H.Ma,Q.Y.Zhang.Study and Design of Chinese Concept-Based Search Engine.Communications and Information Technology.2005,1():40-43
    17 梁昌勇,张中恒.基于本体的企业文本检索模型研究.计算机应用研究.2005,22(12):27-30
    18 谭义红,王鑫,周铁军.基于概念检索的中文搜索引擎的设计与实现.计算机应用与软件.2006,23(5):38-40
    19 张琳.WWW上基于概念的智能搜索.上海海运学院学报.2000,22(4):118-123
    20 彭洪汇,林作铨.Internet上的搜索引擎和元搜索引擎.计算机科学.2002,29(9):1-12
    21 邹涛,王继成等.文本信息检索技术.计算机科学.1999,26(9):20-24
    22 庞剑锋,东波,白硕.基于向量空间模型的文本自动分类系统的研究与实现.计算机应用研究.2001,10(9):23-26
    23 王修力,马利平.文本信息检索的代数模型综述.吉林大学学报:信息科学版.2007,25(5):569-576
    24 邢永康,马少平.信息检索的概率模型.计算机科学.2003,30(8):13-17
    25 王斌.文本检索综述.数字图书馆论坛.2006,(8):1-9
    26 赵正文,康耀红.统计语言模型在信息检索中的应用.计算机工程与应用.2006,42(36):158-161
    27 符敏慧.基于文本的信息过滤模型.图书馆理论与实践.2006,28(2):43-45
    28 盖杰,王怡,武港山.基于潜在语义分析的信息检索.计算机工程.2004,30(2):58-60
    29 T.K.Landauer,P.W.Foltz,D.Laham.An Introduction to Latent Semantic Analysis.Discourse Processes.1998,25(2-3):259-284
    30 林鸿飞,姚天顺.基于潜在语义索引的文本浏览机制.中文信息学报.2000,14(5):49-56
    31 J.Kleinberg.Authoritative Sources in a Hyperlinked Environment.Journal of ACM(JASM).1999,46(5):604-632
    32 N.Kawamae.Latent Semantic Indexing Based on Factor Analysis.2001, 42(0):592-611
    33 鲁松,李晓黎,白硕等.文档中词语权重计算方法的改进.中文信息学报.2006,14(6):8-13
    34 张琳,陶振凯.基于Lucene的全文检索系统的改进方法.沈阳理工大学学报.2008,27(4):33-36,70
    35 黄岜,符绍宏.自动分词技术及其在信息检索中应用的研究.现代图书情报技术.2001,17(1):26-29
    36 文庭孝,邱均平,侯经川.汉语自动分词研究展望.现代图书情报技术.2004,20(7):6-10
    37 张启宇,朱玲,张雅萍.中文分词算法研究综述.情报探索.2008,22(11):53-56
    38 梁南元.书面汉语自动分词综述.计算机应用与软件.1987,5(3):44-50
    39 王永成,苏海菊,莫燕.中文词的自动处理.中文信息学报.1990,4(4):1-10
    40 陈玉忠,李保利,渝士文.藏文自动分词系统的设计与实现.中文信息学报.2003,17(3):15-20
    41 马晖男,吴江宁,潘东华.一种修正的向量空间模型在信息检索中的应用.哈尔滨工业大学学报.2008,40(4):666-669
    42 T.Joachims.A Probabilistic Analysis of the Rocchio Algorithm with TF-IDF for Text Categorization.Proceedings of 14th International Conference Machine Learning.San Francisco,USA,1997:143-150
    43 C.Cobl,A.N.Chasaide.The Role of Voice Quality in Communicating Emotion,Mood and Attitude.Speech Communication.2003,40(1-2):189-212
    44 郭庆琳,李艳梅,唐琦.基于VSM的文本相似度计算的研究.计算机应用研究.2008,25(11):3256-3258
    45 张仰森,徐波,曹元大.自然语言处理中的语言模型及其比较研究.广西师范大学学报(自然科学版).2003,21(1):16-24
    46 李媛媛,马永强.基于潜在语义索引的文本特征词权重计算方法.计算机应用.2008,28(6):1460-1462
    47 盖杰,王怡,武港山.潜在语义分析理论及其应用.计算机应用研究.2004,21(3):9-12
    48 S.Deerwester.Indexing by Latent Semantic Analysis.Journal of the Society for Information Science.1990,41(6):391-407
    49 卢文林.信息检索技术发展概况.农业图书情报学刊.2003,20(3):5-8,16
    50 钱学森图书馆医学分馆.信息检索基础知识:检索效率及评价.2003.http://202.117.24.24/html/xjtu/kejian/yxkj/pages/bjjc/chapterl/7.htm
    51 吕学强,赖治国,孙斌等.检索主题难易度评价.清华大学学报(自然科学版).2005,45(11):1833-1837

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700