个性化垂直搜索引擎研究

英文题名：Research on Individuated Vertical Search Engine
作者：李文泽
论文级别：硕士
学科专业名称：应用数学
中文关键词：垂直搜索引擎 ; 本体 ; Lucene ; 索引 ; 信息抽取 ; MVC
英文关键词：vertical search engine ; ontology ; Lucene ; indexing ; information extract ; MVC
学位年度：2007
导师：徐彬
学科代码：070104
学位授予单位：河南大学
论文提交日期：2007-05-01

摘要

目前互联网领域主要的搜索引擎服务商如Yahoo、百度、Google等,为用户提供的都是横向的海量信息搜索。而在互联网不断更新和演化的现阶段,我们发现:普通网络用户想找到所需的资料简直如同大海捞针,海量的信息已经不再是发展的主要动力,意识和时效性才是真正的动力。互联网发展的关键不再是能否快速、大量地向用户提供和传递信息,而是能否实现使用户在期望的时间、期望的地点,以期望的方式和成本,获取期望的信息。然而综合搜索引擎可以满足大量信息的横向搜索,但很难兼顾搜索的准确度与相关度的质量。综合搜索引擎的价值在于在做大量的信息导航,对于信息需求相对集中、分类更加详细的行业客户缺乏导向。解决这个问题成为搜索发展的机会,也成为未来科研机构竞相研究的热点。垂直搜索这一新的搜索模式正是在这一背景下产生的。
     本文主要的研究工作分为两个部分:第一部分通过理论研究分析,提出了对垂直搜索引擎信息采集算法的改进思路;第二部分通过对垂直搜索引擎的核心技术进行剖析,设计并实现了一个垂直搜索引擎的原型系统。正文部分分五章对研究内容进行详细介绍。
     第一章绪论部分详细介绍了搜索引擎的发展历史,指出了目前综合搜索引擎所面临的问题以及解决这些问题的途径,即本文所研究的方向:垂直搜索引擎。通过和综合搜索引擎在信息服务以及关键技术上的比较分析,指出垂直搜索引擎存在的巨大优势和发展空间。最后,分析了垂直搜索引擎在国内外发展状况以及提出本文所要解决的问题。
     第二章总体架构与信息采集部分给出了垂直搜索引擎总体架构的设计方案和工作流程,并对垂直搜索引擎自身特点进行分析。此外,在信息采集策略方面给出了常用的信息采集模型,并分析了目前通用的信息采集算法——基于向量空间模型的相似度匹配算法的核心思想及不足。最后,通过对本体的介绍,提出了构建基于本体知识库的智能化信息采集策略的实现思路来解决信息采集过程中一词多义和一义多词的问题。
     第三章Lucene框架的研究部分对目前最优秀的开源全文检索框架Lucene进行了详细的分析。包括对全文检索技术的介绍,Lucene项目的来源和框架构成的介绍,以及Lucene所提供的索引和搜索功能中非常重要的倒排索引技术和评分机制的介绍,并给出了索引建立和搜索实现的核心程序代码。最后,还介绍了中文分词技术以及Lucene中分词的实现原理。
     第四章垂直搜索引擎的实现部分结合Hertrix开源爬虫和Lucene框架设计并构建一个面向手机产品信息的垂直搜索引擎的原型系统。该系统分三个部分来实现,第一部分基于Heritrix框架实现了信息采集功能并设计了信息结构化抽取程序。第二部分设计了面向手机产品信息的分词工具,并利用Lucene框架实现了结构化文本信息的索引。第三部分设计了基于MVC架构的查询接口,并实现了原型系统的检索功能。从而为垂直搜索引擎在技术实现层面提供有益的借鉴和指导。
     第五章总结与展望部分对本文工作进行了小结,并提出了垂直搜索引擎的发展趋势以及若干继续研究的方向。
     搜索领域有句名言:“用户无法描述知道他要找什么,除非让他看到想找的东西”。微软研究院一名技术专家说:“75%的内容通用搜索引擎搜索不出来”。垂直搜索引擎作为搜索引擎技术发展的一个分支方向,是互联网用户的搜索倾向从起初单纯的希望搜索内容全面向搜索内容全面、搜索准确率提高以及信息的时效增强转移的必然结果。并且,垂直搜索引擎通过对行业领域内的信息模型和用户模型结构化的搜集或再组织,将会提供更多、更专业、个性化的行业相关服务,与传统综合搜索相比,显得更为聪明且更具人性化。因此,垂直搜索引擎市场有其存在的必要性和广阔的发展前景,然而垂直搜索作为一项刚刚起步的新技术,还有许多需要改进和突破的地方,本文对垂直搜索引擎技术的研究将为垂直搜索的发展提供现实指导意义。
At present the main search engine in Internet field main facilitator is Yahoo, Baidu and Google, etc, which provide the customer to find horizontal and large numbers of information. Go with the continuous update and evolvement of Internet, If the ordinary network user wants to find the necessary data it just like looking for a needle in a bottle of hay, the large numbers of information is no longer the main power of further development, that is consciousness and timeliness are the real motive force. The key problem of the Internet development is not to provide and transfer information for customer fleetly and largely, but to make our customer to obtain anticipant information at anticipant time and destination in anticipant mode and cost. We can satisfy the largely information’s research in horizontal way by common search engine, however ,it is very difficult to give consideration to the accuracy and the relevant of search quality. The value of common search engine lies in the navigation of in a large amount of information, which is lack of direction for trade customer whose demand for information is relatively centralized and classifying is more detailed. To solve this problem becomes the chance to the development of search engine. It also becomes the focus of the scientific research institution to competitively study in the future. The new search mode Vertical Search Engine is just produced under this background.
     The investigation of this dissertation constructs a prototype system of Vertical Search Engine by theoretic analysis and idiographic design. The text will introduce the investigation content detailedly in five parts.
     The introduction part of chapter one has introduced the development history of the search engine in detail, in which have pointed out the problem at present that the comprehensive search engine faces and the route to solve these problems. That is the direction of the dissertation studies: Vertical search engine. Through the comparative analysis with comprehensive search engine in information service and key technology, it points out that the vertical search engine is provided with enormous advantage and development space. Finally, it analyzes the state of development at home and abroad of the vertical search engine and proposed the problem that this text should solve.
     Overall frame analysis and design that builds up the chapter two, which provides overall design plan and workflow of the vertical search engine, and then analyzes it's own characteristic. In addition, it provides collection information model which is in common use in gathering strategy, and analyzes the kernel idea and the deficiency of the commonly collection algorithms– comparability matching algorithms based on the vector space model. Finally, through the introduction of ontology, it proposes the implement way of the intelligent information gathering strategy based on the ontology repository, which is to resolve the problem that one word more than justice and one justice more than word in the course of information collection.
     The chapter three is the Lucene frame research part which detailedly analyses the classic opening code full-text retrieval frame. Including the introduction of retrieval technique of the full text, the source of the project, the introduction on how to construct the frame, the introduction on the very important inverse arranging index technology and marking mechanism which the index and search function that Lucene provide, and show the core code of how to construct the index and realize the search. Finally, also introduces the participle technology in Chinese and the realization principle of Lucene.
     Chapter four describes with the opening code reptile Heritrix and the Lucene frame design how to realize the individualized vertical search engine, and construct one prototype system of vertical search engine which faced to the mobile phone product information. It is implemented in three parts, Part one realizes that gathering function of information based on Heritrix frame and designs the procedure of information structurization collection. Part two designs the participle tool facing mobile phone product information, and make use of Lucene frame to realize the index of the structurization text information. Part three designs the inquiry interface based on that MVC frame, realizes the search function of the prototype system. Thus it provides beneficial reference and guidance for the vertical search engine on the aspect of technology.
     Chapter five summarizes and expects have carried on the brief summary to the work of this text, has put forward the development trend of the vertical search engine and several directions studied in continuation.
     There is a famous motto in the search field: " the customers are unable to describe what he wants to look for, unless let him see the thing he wanted to look for ". A technologist of Microsoft research institute says: " There are almost 75% content that we can’t search them out in the common search engines ".As a branch direction of the technical development of the search engine, the vertical search engine is necessity result that the Internet customers’search that inclines to the originally simple hope to search overallly in content convert to not only overallly in content but also improve the accuracy and timeliness of the information .It will provide us related service that is not only in quantity but also more professional and individuation. Compared with the traditional search, it is more smart. So the vertical search engine market have its existing necessary condition and expansive development foreground. But as a new technology at the early-stage , there are a lot of places need to improvement and break through, this essay’s study on the technology of the vertical search engine will provide realistic directive significance for the development of vertical search.

引文

[1] 李晓明等.搜索引擎·原理,技术与系统.科学出版社,2005.4
    [2] 2006 年中国搜索引擎市场调查报告 http://www.cnnic.net.cn/html/Dir/2006 /09/13/4111.htm
    [3] 陈新颜垂直搜索引擎辨析.现代情报,2004.9,第九期
    [4] 包燕晗.搜索引擎存在的问题与发展趋势.中国信息导报,2006.6
    [5] 肖冬梅.垂直搜索引擎研究.图书馆学研究,2003.2
    [6] 化柏林.搜索引擎面面观.中国计算机用户.2004(26)
    [7] 刘樽雄.搜索引擎的智能化发展趋势.科技情报开发与经济,2004 年第 14卷 6 期
    [8] 崔明.当前搜索引擎不足及改进建议.图书馆学研究,2006.7
    [9] 王德峰.搜索引擎 Google 的体系结构及其核心技术研究.哈尔滨商业大学学报(自然科学版),第 22 卷第 1 期,2006.2
    [10] 胡蓉.搜索引擎的发展与个性化技术研究.宁波职业技术学院学报,第 9卷第 2 期.2005.4
    [11] 李世明.专题搜索引擎中信息过滤的研究与实现.计算机工程与设计,第27 卷第 8 期.2006.4
    [12] 姜杰.专业搜索引擎分布式 Robot 的设计与研究.中国电化教育,2005.6
    [13] 张亮.面向汉语分析的搜索引擎研究与实现.情报学报,第 25 卷第 4期.2006.8
    [14] 黄春毅.一种自适应搜索引擎的构建研究.情报检索,2006 年第 2 期
    [15] 刘炜.一种基于 Agent 的智能元搜索引擎框架.计算机工程与应用,2005.3
    [16] 胡亮.个性化高效元搜索引擎的设计与实现.计算机工程与设计,第 26 卷第 4 期.2005.4
    [17] 包金龙.基于向量空间模型的信息检索系统的设计.情报检索,No.7 2005
    [18] 焦玉英.基于向量空间模型的专题文献过滤算法研究.情报学报,第 24 卷第 5 期 2005.10
    [19] 康平波.基于自动分类的搜索引擎过滤系统.计算机工程,第 30 卷第 2 期 2004.1
    [20] 李学勇,田立军等一种基于非贪婪策略的网络蜘蛛搜索算法.计算机技术与自动化.第23卷第2期.2004.6
    [21] 齐冬梅.个性化智能搜索引擎爬行虫算法.计算机应用,第 24 卷.2004.12
    [22] 尹春天.基于搜索结果的预取技术研究.计算机应用,第 24 卷第 1 期
    [23] 汪涛,樊孝忠等.基于概念分析的主题爬虫设计.北京理工大学学报,第 24卷第 10 期.2004.10
    [24] 龙宇巍.定题搜索引擎 Robot 的设计与算法.计算机仿真,2004.4
    [25] 沈丹莹.Web 信息智能过滤系统 WIIFS.的研究.福建电脑,2006.1.
    [26] 周立柱.聚焦爬虫技术研究综述.计算机应用,第 25 卷第 9 期.2005.9
    [27] 李学勇,许向阳等.基于 Boltzmann 系统选择策略的网络蜘蛛搜索算法.小型微型计算机系统,第 26 卷第 6 期.2005.6
    [28] 金松河.Frame 页面过滤算法在 Web 日志挖掘预处理中的应用.云南民族大学学报(自然科学版),第 15 卷第 1 期.2006.1
    [29] 陈再良.dPageRank 一种改进的分布式 PageRank 算法.计算机应用,第 26卷第 1 期 2006.1
    [30] 李学勇.网络蜘蛛搜索策略比较研究.计算机工程与应用,2004.4
    [31] 陈刚.Eclipse 从入门到精通.清华大学出版社,2005.6
    [32] 宋炜.语义网简明教程.高等教育出版社,2004.9
    [33] Dave.Crane,Eric.Pascarello 等著.Ajax 实战.人民邮电出版社,2006.4
    [34] 张桂元,贾燕枫等著.Web2.0 快速入门与项目实践(Java).人民邮电出版社,2006.6
    [35]SONG Hui,Ontology2based Knowledge Extraction from Hidden Web,Journa l of DonghuaUniversity (Eng.Ed.)Vol.21,No.5(2004)
    [36] Ehrig M,Maedche A,Ontology-focused crawling of Web documents[c] Pro.of the 2003 ACM symposium on Applied computing.Florida,2003
    [37] Novak B,A survey of focused web crawling algorithms[c] SIKDD 2004 at multiconference IS 2004,Ljubljana,2004
    [38] 张晓卫.一种基于 Lucene 的 Web 全文检索系统的设计与实现.计算机与现代化,2006 年 12 期
    [39] 向晖.基于 Lucene 的中文字典分词模块的设计与实现.现代图书情报技术,2006 年第 8 期
    [40] 管建和.基于 Lucene 全文检索引擎的应用研究与实现.计算机工程与设计,2007 年 1 月第 28 卷第 2 期
    [41] 李文泽主编.Java 程序设计专家门诊.清华大学出版社,2006.4
    [42] 李文泽,徐彬.基于 XSLT 的动态搜索技术的研究与实现.电脑知识与技术,2007 年第 1 卷

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700