基于潜在语义分析的智能搜索技术研究

英文题名：The Intelligent Search Technology Based on Latent Semantic Analysis
作者：王洋
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：搜索引擎 ; 潜在语义分析 ; 奇异值分解 ; 查询扩展
英文关键词：Search Engine ; Latent Semantic Analysis ; Singular Value Decomposition ; Query Expansion
学位年度：2010
导师：印桂生
学科代码：081202
学位授予单位：哈尔滨工程大学
论文提交日期：2010-01-01

摘要

近年来互联网飞速发展,已经发展成为包含多种信息资源、站点遍布全球的巨大动态信息服务网络,为用户提供了一个极具价值的信息源。搜索引擎为用户提供了友好的检索接口,能帮助人们从浩瀚的数据中抽取出对用户有用的信息,能极大地节省用户的查询时间。
     互联网上绝大多数的信息是以文本的形式保存的,互联网上文本信息的指数级增长给搜索引擎技术带来了巨大的挑战,人们越来越难以快速准确地从网上搜索到相关信息。由于自然语言中多词同义、一词多义等不确定性因素存在,相同概念可以有多种不同的表述方式。传统的基于关键词字符匹配的搜索引擎中,参与匹配的只有外在的表现形式,而非它们所表达的全部概念,用户很难简单地用关键词或关键词串来真实地表达真正需要查询的内容。把搜索引擎技术从关键词匹配的层面提高到语义的层面,从语义意义上智能地认知和处理用户的查询请求成为当前搜索引擎技术的研究热点。
     本文从智能搜索建模的角度出发结合潜在语义分析技术,研究了搜索引擎中文档处理、查询处理以及最后的信息匹配处理。在此基础上,对潜在语义空间中权值从概率角度进行了分析与改进,使其更能体现出文档间、文档与词汇间的语义关系;对用户查询进行语义扩展,补充了用户输入信息不足或与索引词汇不匹配的问题;对用户搜索结果不理想进行调整,提出二次搜索的策略改善搜索结果使其更贴近用户要求。最后文本设计并实现了基于潜在语义分析的智能搜索系统验证了算法可以在一定程度上搜索引擎对语义的理解,并获得较高的准确率与查准率。
In recent years, the Internet is growing fast and it has already been a great dynamic information service network full of all kinds of information around the world, which provides users with a valuable source of information. Search engines offer us user-friendly search interfaces that can help people acquire useful information from huge data, which can save a lot of time for user’s query.
     The vast majority of information on the Internet is saved in the form of the text. The exponential growth of text message has brought great challenges to the search engine technology. Due to multi-word synonyms, polysemy and other uncertainties that exist in natural language, the same concept can have many different patterns of expression. The traditional search engines based on keywords matching simply use keywords or keyword strings rather than the genuine concept which the users want to express. Thus, search engines need to develop into semantic level from keywords matching. Recognizing and dealing user’query intelligently in search engine technology have come into focus.
     This paper gives research on document processing, query processing and the final match of information processing in search engines combined latent semantic analysis technique from the f view of intelligent search modeling. On this basis, word weight values in the latent semantic space are analyzed and improved in the probabilistic sense, so it can better reflect the semantic relations between words and documents. Next, User’queries are expanded to complement the lack of information which the users give or mismatch between users’words and index vocabulary. In addition, second search strategy was proposed in the paper to enhance the search results to be closer to user requirements when users not satisfied with their first result. In the end, the intelligent searching system based on Latent Semantic Analysis was designed and implemented, which can apperceive users’intension to some extent and get a higher rate of accuracy and precision.

引文

[1]袁津生,李群,蔡岳.搜索引擎原理与实践.北京邮电大学出版社,2008:1-9页
    [2] Dumais S T. Latent Semantic Analysis [M]. Annual Review of Information Science and Technol ogy, 1989: 190– 230P
    [3] Dumais S T, Furnas GW, Landauer T K, et al. Using Latent Semantic Analysis to Improve Rnformation retrieval[C]. Proceedings of CHI’88 Conference on Human Factors in Computing Systems,1988: 281– 2851P
    [4] M W Berry, S T Dumais, G W O’Brien. Using Linear Algebra for Intelligent In formation Retrieval [J]. SIAM Review, December 1995
    [5]李晓明,闫宏飞,王继民.搜索引擎原理.技术与系统[M] .科学出版社,2004:124-135页
    [6]宗成庆.统计自然语言处理[M] .北京:清华大学出版社,2008:360-371页
    [7]史忠植.高级人工智能.第二版[M] .2006:190-193页
    [8]刘云峰.基于潜在语义分析的中文概念检索研究.华中科技大学博士论文.2005:14-39页
    [9] Jiawei Han,Micheline Kamber.数据挖掘概念与技术.范明,孟小峰.第二版.机械工业出版社,2007:10-25页
    [10]邓志鸿,唐世渭,张铭,等.Ontology研究综述.北京大学学报(自然科学版),2002,Vol.38,No.5,730-737页
    [11] Landauer T K, et al. A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of the Acquisition, Induction, and Representation of Knowledge [J]. Psychological Review, 1997, 104:2112240
    [12] Ian H.Witten, Alistair Moffat.Managing Gigabytes:Compressing and Indexing Documents and Images Second Edition[M].北京:电子工业出版社,2008
    [13]王修力,马利平.文本信息检索的代数模型综述[J] .吉林大学学报.2006,17,(9):1837-1847页
    [14]刘云峰,齐欢.潜在语义分析权重计算的改进.中文信息学报.第19卷第6期.
    [15] G Salton. Automatic In formation Organization and Retrieval [M] . New Y ork :McG raw Hill ,1968
    [16] Thomas Hofmann. Probabilistic Latent Semantic Indexing [C]. SIGIR’99. 1999
    [17] Lin Hong fei, et al. Text Browsing Based on Latent Semantic Indexing [J] .Journal of Chinese In formation Processing, 2000, 14 (5):49256
    [18] Justeson, J.S. and Kat, S.M. Teehnlcal Terminology: Some Linguistic Properties and an Algorithm for Identification in Text [J]. Natural Language Engineering, 1995, 9-27P
    [19] Mikko Kurimo, Chafic Mokbel. Latent Semantic Indexing by Self-Organizing Map[C]. 2000
    [20] Kintsch E, Steinhart D, Stahl G. Developing Summarization Skills Through the Use of LSA-based Feedback [J]. Interactive Learning Environments, 2000, 8 (2):872109
    [21] Foltz P W, et al. The Measurement of Textual Coherence with Latent SemanticAnalysis [J]. Discourse Processes, 1998 ,25 :2852307
    [22] Malcolm Slaney, Dulce P onceleon. Hierarchical Segmentation Using Latent Semantic Indexing in Scale Space [J]. 2001
    [23] Dumais S. Improving the Retrieval of Information from External Sources. Behavior Research Methods[J] Instruments & Computers, 1991, 23 (2) : 229– 236P
    [24]翟琳琳.基于潜在语义分析的智能检索系统.上海师范大学硕士学位论文.2007:28-33页
    [25]盖杰,王怡,武港山.基于潜在语义分析的信息检索[J] .计算机工程.2004,1(30)
    [26]王怡,盖杰,武港山,王继成.基于潜在语义分析的中文文本层次分类技术[J] .计算机应用研究.2003
    [27]蔡自兴,徐光祐.人工智能及其应用[M] .第二版.清华大学出版社,2006
    [28]周水庚,关佶红,胡运发.隐含语义索引及其在中文文本处理中的应用研究.小型微型计算机系统.2001,Vol.22,No2.239-243页
    [29]李蕾,王楠,钟义信.基于语义网络的概念检索研究与实现.情报学报.2000,Vol.19,No.5,525-531页
    [30]程莉,卢正鼎,文坤梅,李娟.基于语义的模糊匹配探索与应用.华中科技大学学报(自然科学版).2003,Vol.31,No.2,23-25页
    [31]焦玉英.信息检索进展.北京:科学出版社,2003.26-70,130-149页
    [32]董振东,董强.知网.计算语言学文集.北京:清华大学出版社,1999
    [33]胡佳妮.文本挖掘中若干关键问题的研究.北京邮电大学博士学位论文.2008:91-106页
    [34]王树梅.信息检索相关技术研究.南京理工大学博士学位论文.2007:59-63页
    [35]吴颜,沈洁,顾天竺等.协同过滤推荐系统中数据稀疏问题的解决[J] .计算机应用研究.2007,6
    [36] Scott Deerwester, Susan T.Dumais, George W.Furnas, Thomas K. Landauer, Richard Harshman.1990. Indexing by latent semantic analysis. J. Amer. Soc.Info. Sci.41, 391-407P
    [37] Andy Dong.The latent semantic approach to studying design team communication.Design Studies.Volume 26.Issue 5.September.2005
    [38] M.Kobayashi, M.Aono, H.Takeuchi, H.Samukawa.Matrix computations for information retrieval and major and outlier cluster detection.Journal ofComputational and Applied Mathematics 149.2002.12
    [39] Golub G H, Van Loan C F.Matrix computations [M], 2nd ed. Baltimore: John-Hopkins.1986.56-60P
    [40]蔡自兴,郑金华,朱珍民.稀疏矩阵乘法运算的并行算法[J] .湘潭大学自然科学学报.2000,22,(1):1-3页
    [41]黄曾阳.HNC理论概要.中文信息学报.1997,Vol.11,No.4,19-29页
    [42]吴蔚天.汉语计算语义学.关系语义场和形式分析[M] .北京:电子工业出版社,1999:30-71页
    [43] Hu,X.,Cai,Z.,Louwerse,M..,et.Al.A Revised Algorithmfor Latent Semantic Analysis[A].Proceedings of the 2003 International Joint Conference on Artificial Intelligence[C].2003.1489–1491P
    [44] G.Salton,C.Buekley.Improving retrieval Performance relevance feed back,Journal of the Ameriean Soeiety for Information Science , 1990(41):288-297P
    [45] Salton, Buekley. Term-weighting approaches in automatic text retrieval ,Ink.SparekJonesand.Willet(eds),Readingsin
    [46] Information Retrival,Morgan Kaufmann Publishers,Ine,1997. Shamsfard M., Barforoush A.A. Learning ontologies from natural language texts. International Journal of Human Computer Studies, 2004, 60(1):17-63P
    [47]林鸿飞,姚天顺.基于潜在语义索引的文本浏览机制[J] .Journal of Chinese information processing,2000,14(5):49-56页
    [48] Chowdhury. Introduetion to Modern Information Retrieval, Library Assoeiation Publishing, 1999
    [49] Maedche A. Ontology learning for the Semantic Web [M]. Boston: Kluwer Academic, 2002
    [50]林鸿飞.基于示例的文本标题分类机制[J] .计算机研究与发展.2001,38(9):1134-1136页

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700