基于潜在语义索引的中文文本检索研究

英文题名：Research of Chinese-Text Retrieval Based on Latent Semantic Indexing
作者：李媛媛
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：信息检索 ; 潜在语义索引 ; 权重计算 ; 文本-文本检索
英文关键词：Information Retrieval ; Latent Semantic Indexing ; Term Weighting ; doc-doc retrieva
学位年度：2008
导师：马永强
学科代码：081203
学位授予单位：西南交通大学
论文提交日期：2008-03-01

摘要

互联网上绝大多数的信息是以文本的形式保存的,文本信息的爆炸式增长给信息检索技术带来了巨大的挑战,人们越来越难以快速准确地从网上检索到相关信息。在目前使用最多的基于关键词的字符匹配检索中,参与匹配的只有词的外在形式,而日常语言中多词同义、一词多义等不确定性因素的存在,使得用户很难简单地用关键词或关键词串来真实地表达真正需要检索的内容。
     而潜在语义索引(LSI—Latent Semantic Indexing)模型的出现有效地克服基于关键词检索无法处理多义词和同义词问题,它具有可计算性强、需要人参与少等优点。LSI通过截断的奇异值分解建立潜在语义空间,词汇和文本都被投影在该空间,进而可以提取词汇间深层次的语义关系,从而呈现出自然语言中的语义结构,进一步提高了检索性能。
     本文围绕着如何利用LSI技术及其特点进一步提高中文文本检索的性能展开讨论。首先对LSI的相关关键技术以及数学基础进行了深度挖掘,对其在中文文本中的应用进行了举例和深入分析。其次对LSI的重要优化过程——权重计算进行了深入分析,提出了一种基于“非线性函数”和“位置因子”的新权重方案,并对其效果进行了对比验证。然后利用LSI能够方便计算出文本和文本相似度的特点,提出了“文本—文本检索”功能,弥补了由于检索语句较短和输入不准确等问题对检索查准率的影响,能够更好的帮助用户进行更加有效的检索。最后,开发了“中文潜在语义索引分析系统”作为实验平台,针对LSI的每个相对独立的环节专门设计实验方法,以可视化的方式呈现实验结果,文中所有研究内容都在该系统中作了验证。
Most information on Internet is based on text. The explosive growth of text information is a great challenge to information retrieval, making it increasingly difficult to find useful information on internet rapidly and accurately. In the most used information retrieval based on keywords match, what match is the explicit representation, but there exists uncertainty in natural languages, such as synonym and polysemy. It is not easy for users to express what they really want to retrieve just with keywords or keyword chains.
     Latent Semantic Indexing Model is easy to calculate and requires less human intervention. Latent semantic Space is established by truncated singular value decomposition, terms and documents are projected onto the LSI-Space. Then the semantic relationships among terms are abstracted to present the semantic structure of natural languages, it improves the retrieve performance.
     The thesis focuses on how to improve the Chinese text information retrieval system performance based on LSI and its features. Firstly,The key technology and mathematical basis of LSI were analyzed deeply. Examples were given and analyzed which aimed at Chinese text retrieval. Secondly,The term weighting which is of great importance in LSI is studied in detail, and a new weighting design based on non- linear function and location factor was proposed. The retrieval performance has been improved further.
     Using the concept that the LSI-Space can calculate the relation among documents conveniently, "doc-doc retrieval" is put forward to make uers' retrieval more effectively. It offsets the effects that the retrieval sentences and input inaccurately affects the retrieval precision. At last, an experimental platform, namely"Chinese LSI Analysis System" ,has been developed. In this system, each vital link in LSI is correspond to special experimental method, and presents the result visually. All aspects in the dissertation are evidenced with experiments on this system.

引文

[1]焦玉英.信息检索进展.科学出版社,2003:17-20
    [2]赵慧.基于.NET平台的智能答疑系统的研究与实现.江苏大学硕士论文.2005.4:27
    [3]杨哲.提高信息检索性能的有效机制与算法研究.中国科学院研究生院硕士论文.2004.5:11-12
    [4]贺瑞芳.基于内容的教学资源搜索引擎研究.东北师范大学硕士论文.2005.5
    [5]刘云峰.基于潜在语义分析的中文概念检索研究.华中科技大学博士论文.2005.10
    [6]Todd A.Letsche,Michael W.Berry.Large-Scale Information Retrieval with Latent Semantic Indexing.Information Science.1997,100(1):105-137
    [7]Peter W.Foltz,Susan T.Dumains.Personalized Information Delivery:An Analysis of Information Filtering Methods.Communication of the ACM.1992,35(12)
    [8]A.C.Graesser,N.Person,D.Harter.Teaching Tactics and Dialog in AutoTuto.International Journal of Artifical Intelligence in Education.2001,12(1):257-279
    [9]盖杰,王怡,武港山.潜在语义分析理论及其应用.计算机应用研究.2004,21(3):9-12
    [10]顾榕,王小平,曹立明.一种基于潜在语义分析的查询扩展算法.计算机工程与应用.2004,18(2):23-25
    [11]何明,冯博琴,傅向华.基于Rough集潜在语义索引的Web文档分类.计算机工程.2004,30(13):3-5
    [12]黄海英,林士敏,严小卫.基于概念空间的文本分类研究.计算机科学.2003,30(3):46-49
    [13]戚涌,徐永红,刘凤玉.基于潜在语义标引的Web文档自动分类.计算机工程与应用.2004,22(1):28-31
    [14]郑家恒,卢娇丽.关键词抽取方法的研究.计算机工程.2005,31(18):194-196
    [15]韩客松,王永成.一种用于主题提取的非线性加权方法.情报学报.2000,19(6):650-653
    [16]刘海峰,王元元,张学仁等.基于潜在语义空间的文本检索问题研究.情报科学.2007,25(5):748-753
    [17]李晓霞,郭力,杨宏伟.ChIN化学化工资源导航系统的新进展.2002,19(2):139-143
    [18]Michael W.Berry,Susan T.Dumais,Gavin W.O'Brien.Using Linear Algebra for Intelligent Information Retrieval.SIAM Review.1997,37(1)
    [19]Gao J,Zhang J.Clustered SVD strategies in Latent semantic indexing.Information Processing & Management,2007,41(3):1051-1063.
    [20]Laca Iocchi.The web-OEM approach to web information extraction.Journal of Network and Computer Application.1999,22(4)
    [21]盖杰,王怡,武港山.基于潜在语义分析的信息检索.计算机工程.2004,30(2):58-60
    [22]KIM H,HOWLAND P,PARK H.Dimension Reduction in Text Classification with Support Vector Machine.Journal of Machine Learning Research,2005,6(1)
    [23]张明淳.工程矩阵理论.东南大学出版社,1995:124
    [24]姜家辉.矩阵理论基础.大连理工大学出版社,1995:65
    [25]M.W.berry..Large-scale Sparse Singular value Computation.The International of Supercomputer Application.1992,6(1):234-251
    [26]张兰轩.基于潜在语义分析的大学概况中文问答系统.大连理工大学硕士论文.2004.3:25
    [27]Foltz P W..The Measurment of Textual Coherence with Latent Semantic Analysis.Discourse Processes.1998,25(1):285-307
    [28]卢健.潜在语义分析在文本信息检索中的应用研究.华中科技大学硕士论文.2005.4:21-22
    [29]CHEN L,TOKUDA N,NAGAI A.A new differential LSI space-based Probabilistic document classifier.Information Processing Letters,2003,88(5)
    [30]Deerwester S.,Dumais S T A..Indexing by Latent Semantic Analysis.Discourse Processes.1998,25(5)
    [31]Papadimitriou C.Raghavan P,Tamaki H..Latent Semantic Indexing:A Probabilistic Analysis.Journal of Computer and System Science.2000,61(2)
    [32]Kintsch E,Steinhart D,Stahl G..Developing Summarization Skills Through the Use of LSA-based Feedback.Interactive Learning Environments.2000,8(2)
    [33]李国辉,汤大权,武德峰.信息组织与检索.科学出版社,2002
    [34]符绍宏.信息检索.高等教育出版社,2004
    [35]盛俊.潜在语义的Markov网络检索模型的研究.江西师范大学硕士论文.2006.5:13
    [36]Ricardo Baeza-Yates,Berthier Ribeiro-Neto.Modern Information Retrieval.机械工业出版社,2004:52-55
    [37]Schamber L.,M.B.Eisenberg and M.S.Nilan.A Re-examination of Relevance:Toward a Dynamic Situational Definition.Information Processing &Management.1990,26(6)
    [38]Story,R.E.An explanation of the Effectiveness of Latent Semantic Indexing by means of a Bayesian Regression Model.Information Processing and Management.1996,32(3)
    [39]LIN Hong-fei.Text Browsing Based on Latent Semantic Indexing.Journal of Chinese Information Processing.2000,14(5)
    [40]LIN Hong-fei.The Mechanism of Text Title Classification Based on Examples.Journal of Computer Research & Development.2001,38(9)
    [41]Peter W Foltz..Latent semantic analysis for text-based research,behavior research methods.Instruments and Computers.1996,28(2)
    [42]鲁松,李晓黎,白硕.文档中词语权重计算方法的改进.中文信息学报.2000,14(6):8-13
    [43]全德.基于潜在语义索引的文本分类技术的研究.东北大学硕士论文.2005.1:19-20
    [44]刘里,何中市.基于关键词语的文本特征选择及权重计算方案.计算机工程与设计.2006,27(6):934-936
    [45]韩客松,王永成.一种用于主题提取的非线性加权方法.情报学报.2000,19(6):650-653
    [46]郑家恒,卢娇丽.关键词抽取方法的研究.计算机工程.2005,31(18):194-196
    [47]苏亮,聂峰光,郭力,李晓霞,梁春燕.隐含语义检索系统词条权重的处理.计算机与应用化学.2005,22(11):971-976
    [48]Microsoft Excel在科研与工程中的应用.中国林业出版社,2003
    [49]吴科,石冰,卢军,牛小飞.基于文本集密度的特征选择与权重计算方案.中文信息学报.2004,18(1)
    [50]王怡,盖杰,武港山,王继成.基于潜在语义分析的中文文本层次分类技术.计算机应用研究.2004,8(1)
    [51]蓝海洋,周杰韩,张和明.文本索引词项相对权重计算方法与应用.计算机工程与应用.2003,15(3)
    [52]张俊林.基于语言模型的信息检索系统研究.中国科学院研究生院博士论文.2004.5
    [53]王建会.中文信息处理中若干关键技术的研究.复旦大学博士论文.2004.3
    [54]李荣陆.文本分类及其相关技术研究.复旦大学博士论文.2005.4
    [55]贺扬.基于潜在语义索引模型的查询语义扩展模型.西南交通大学硕士论文.2004.3
    [56]木林森.C#和ASP.NET程序设计教程.清华大学出版社,2002

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700