基于本体的Deep Web语义搜索引擎

英文题名：Ontology-based Semantic Search Engine for Deep Web
作者：谭春亮
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：语义Web ; 语义搜索 ; Deep ; Web ; 本体 ; 分类
英文关键词：Semantic Web ; Semantic Search ; Deep Web ; Ontology ; Classification
学位年度：2008
导师：蒋运承
学科代码：081202
学位授予单位：广西师范大学
论文提交日期：2008-04-01

摘要

随着WWW的迅速发展和普及,WWW成为一个巨大的信息资源库,对这个信息资源库的搜索出现了“信息过载”和“信息迷航”的问题。由于WWW的自治性、开放性、异构性、动态性和指数增长等特点,目录式搜索引擎、全文搜索引擎都暴露出了根本的缺点。基于关键字查询,只检索静态页面,只能进行“导航式”的检索,导致了索引容量指数增长、查全率和查准率不断降低等问题。提高搜索引擎的查全率和查准率,满足用户“知识粒度”检索的要求,同时能够进行语义层面的搜索,成为用户对新一代搜索引擎提出的要求。为了从根本上解决这些问题,新一代的搜索引擎要求必须对WWW进行新的知识表示。万维网的创始人Tim Berners-lee为此提出了新一代万维网的架构—Semantic Web,其上的信息具有良好的定义,使得人与机器、机器间能够更好的实现信息的共享与协作。Semantic Web能够从根本上解决传统搜索引擎所暴露出来的问题。由于WWW的自治性特点,Semantic Web的接受需要一个相当长的时间,并且由于Semantic Web的研究大都停留在理论研究阶段,所以新一代搜索引擎难以实现。本文在新一代搜索引擎和WWW之间找到了一个结合点,将Semantic Web的架构应用到Deep Web的搜索,提出了基于本体的Deep Web语义搜索引擎。基于本体的Deep Web语义搜索引擎可以解决传统搜索引擎只能搜索静态页面,无法进行语义搜索,无法为用户提供“知识粒度”检索的缺点。本文的创新点如下:
     1、本文基于Semantic Web架构对Deep Web进行语义搜索,解决了传统搜索引擎只能搜索静态页面,无法对Deep Web进行搜索,只能基于关键字搜索,无法进行语义搜索,只对静态页面的内容进行索引,而不能进行元数据索引的缺点,提高了搜索引擎的查全率和查准率,避免了搜索引擎索引容量的瓶颈问题。
     2、本文通过对Deep Web查询接口进行元数据提取,将查询接口看作后台数据库的元模式,利用元数据描述语言RDF对查询接口进行RDF描述,然后结合领域本体对查询接口的RDF元数据进行RDF检索,从而实现查询接口的语义搜索,提高了查询接口检索的准确率,由于查询接口具有高度的领域相关性,所以提高了搜索引擎的查准率。
     3、本文提出了基于领域本体的Deep Web语义搜索引擎的框架,由Deep Web爬虫、Deep Web分类器、Deep Web表单提取、自然语言查询接口、语义推理、表单检索器、Web检索器、统一接口查询和结果集成模块组成。在本文中重点分析了Deep Web的发现、分类和查询接口RDF的语义检索,整个RDF检索系统以Jena平台为开发平台,以汽车领域本体和查询接口RDF模型为例进行了验证。
     4、基于知网的词汇语义关系判断算法以知网做为本体,采用基于结构的模式匹配算法进行词汇逻辑关系的判断;Deep Web特征选择算法采用词汇频度作为类内、类间可分性判据以Tabu搜索策略进行特征选择;Deep Web查询接口RDF提取算法根据查询接口Html代码的特征进行查询接口Html代码和查询接口RDF模型的映射;Deep Web查询接口RDF查询算法以用户输入的关键词序列为检索条件,进行关键词序列的分类操作,概念推理算子操作,得到概念关键词对序列和实例关键词对序列,根据概念关键词对序列采用RDQL语言对RDF进行检索,然后根据检索结果和实例关键词对序列以Http协议格式对Web进行数据检索。本文对上述算法进行了实例验证。
     本文从理论上对基于Semantic Web架构的Deep Web搜索引擎进行了研究,提出了搜索引擎的大致框架和各关键部分的算法思想,完善了基于Semantic Web架构的Deep Web搜索引擎的检索流程,具有理论可行性,同时结合领域对检索流程和各关键部分的算法进行了实例验证,整个系统可以在Jena平台上开发实现。
WWW has been a tremendous information depository along with its rapid evolution and popularization. Search on WWW become more and more difficult because information over loading and drift off course on WWW. The shortcoming of directory tree Search Engine and keyword Search Engine is emerged because of autonomy, commonality, heterogeneity, dynamic, openness and increase on exponent. Search like navigation base on keyword only and surface Web make index capability increase on exponent, make recall ratio and precision ratio lower and lower. The new knowledge representation on WWW has become significant to improve recall ratio and precision ratio also satisfy request of user on knowledge granularity search and semantic search. The creator of semantic Web Tim Berners-lee put forward architecture of semantic Web, in which information is given well-defined meaning, better enabling computers and people to work in cooperation. Semantic Web is capable of solve these problem. This thesis applies the architecture of semantic Web on search of Deep Web to put forward semantic search engine on Deep Web based on ontology. Semantic search engine on Deep Web based on ontology could solve these problem traditional search engine can not solve like searching only surface Web, can not semantic search, can not search on“knowledge granularity”. Four innovations in this thesis are as follows:
     First: semantic search engine on Deep Web based on ontology makes up traditional search engine’s shortage. For example, traditional search engine could only search surface Web based on keyword, but semantic search and metadata search. Because of these, it could improve the recall ratio and precision ratio, also avoid the restriction on index capability. Second: this thesis represent query interface by RDF metadata. Query interface is pattern of database, so which is described in metadata descriptive language RDF. Search query interface is searched through searching RDF semantically using ontology to make precision ratio higher. Because query interface has high domain pertinence, it make search engine’s precision ratio higher.
     Third: semantic search engine on Deep Web based on ontology are composed of Deep Web crawler, Deep Web classifier, Deep Web form extractor, NLI (nature language interface), semantic reasoning, form retrieval, Web retrieval, query interface integration and result integration. In this thesis, discovery and classification of Deep Web, semantic search of query interface’s RDF are researched weightily.
     Fouth: vocable relation computing algorithms uses pattern match based on structure using HowNet as ontology. Deep Web feature select algorithms search feature by Tabu searching strategy using vocable frequency as separability criterion. Deep Web query interface RDF extractor algorithms makes map between query interface html code and model of RDF. Deep Web query interface RDF search algorithms make keyword sequence that user input as search condition to Classify, then Extend Concept to get Concept sequence and Instance sequence. RDF is searched by language RDQL according to Concept sequence. Search on Web is sent in http protocol according to RDF search result and Instance sequence. Algorithms discussed above is validated in this thesis
     Deep Web search engine based on semantic Web has been investigated theoretically in this thesis. Framework and algorithms thinking of search engine are feasible. The search engine could be developed on Jena. We validate it in domain.

引文

[1] E Voorhees. Query expansion using lexical-semantic relations[C].In Processdings of the 17th annual international ACM SIGIR conference on Reaearch and development in information retri- val Dublin, Ireland. 1994, 61-69.
    [2]W Maki, L McKinley, A Thompson. Semantic distance norms computed from an electronic dictionary (wordnet) [J].Behavior Research Methods, Instruments, &Computer. 2004, 36:421-4- 31.
    [3] R Navigli, P Velardi. An analysis of ontology-based query expansion strategies[C]. In Works- hop on Adaptive Text Extraction and Mining, in the 14th European Conference on Machine Lea- ring, 2003.
    [4]L Khan, D McLeod.Audio structuring and personalized retrieval using ontologies [J].Proceed- ings of IEEE advances in digital libraries, library of congress, Bethesda, MD. 2000, 116-126.
    [5] B Franz, M Deborah, N Daniele.The description logic handbook:Theory,implementation and applications,Cambridge,U K[B].Cambridge University Press,2003,1-100,436-459.
    [6] 宋俊峰,张维明,肖卫东,唐九阳.基于本体的信息检索模型研究[J].南京大学学报, 2005,2(41):189-197.
    [7] 张晓林.Semantic Web 与基于语义的网络信息检索[J].情报学报,2002,4(21):413-420.
    [8] T Berners-lee, and M.Fishetti.Weaving the Web: The original design and ultimate destiny of the World Wide Web by its inventor [EB/OL].Harper, San Francisco.
    [9] T Berners-lee, J Hendler, O Lassila.The Semantic Web [EB/OL].New York: Scientific Ameri- can.2001, 284(5).
    [10] T Berners-lee.The semantic toolbox: building semantic on top of XML-RDF [EB/OL].http: //WWW.w3. org/DesignIssues/Toolbox.html.
    [11] T Berners-lee.RFC2396[EB/OL].http://WWW.faqs.org/rfs/rfc2396.html
    [12]T Bray,J Paoli,C Sperberg-McQueen.Extensible markup language(xml) [H].W3C(World W- ide Web Consortium), 1998. http://WWW.w3c.org/TR/1998/REC-xm-19980210.html.
    [13] Noman Walsh.A techniacal introdution to XML[EB/OL].http://WWW.xml.com/pub/a/98/10 /guide0.html, 1998.
    [14]Ian Horrocks, Bijan Parsia, Peter Patel-Schneider. Semantic Web Architecture: Stack or Two Towers? [EB/OL]. http://WWW.cs.man.ac.uk/~horrocks/Publications/download/2005/HPPH05.- pdf.
    [15] Neches R,Fikes R E,Gruber T R. Enabling Technology for Knowledge Sharing [J].AI Ma- gazine.1991, 12(3):36-56.
    [16]Gruber T R.Toward Principles for the Design of Ontologies Used for Knowledge Sharing[C].International Workshop on Formal Ontology.Padova, Italy, 1993.
    [17]Gruber T R. A Translation Approach to Portable Ontology Specification.Konwledge Acquis- ition[C].1993, 5:199-220.
    [18]Borst W N.Construction of Engineering Ontologies for Knowledge Sharing and Reuse [D].P HD thesis, University of Twente, Enschede, 1997.
    [19] Studer R, Benjiamins V R, Fensel D. Knowledge Engineering, Principles and Methods [J]. Data and Knowledge Engineering, 1998, 25(122):161-197.
    [20]邓志鸿,唐世渭,张铭,杨冬青,陈捷.Ontolgy 研究综述[J].北京大学学报(自然科学版).2002,38(5):728～730.
    [21]Perez A G, Benjamins V R. Overview of Knowledge Sharing and Reuse Components: Ontol- ogies and Problem Solving Methods[C].Proceedings of the IJCAI-99 workshop on Ontologies a- nd Problem-Solving Methods (KRR5) 1999, 1-15.
    [22]Uschold M.Building Ontologies: Towards AUnified Methodology [EB/OL]. In expert syste- ms 96, 1996.
    [23]杨建武.本体 ppt[R]. 北京.北京大学计算机科学技术研究所,2005.
    [24]王雨英.基于本体的信息检索研究[D].青岛:中国海洋大学,2006.
    [25]胡坚.基于本体的机械产品领域智能信息检索系统研究[D].杭州:浙江工业大学,2005.
    [26]Irene Polik. Ontology Tool Support Ontology Development Lifecycle and Tools [EB/OL]. Top Quadrant Technology Briefing.
    [27] http://WWW.tcp.ca/Jan96/BusandMark.html [EB/OL].
    [28]BrightPlanet.com. The Deep Web: Surfacing hidden value [EB/OL].http://brightplanet.com, 2000.
    [29] Chang K C, He B, Li C. Structured databases on the Web: Observations and Implications [J]. SIGMOD Record.33 (3): 61-70.
    [30]HE B, PATEL, M ZHANG, Z CHANG, K.C.C. Accessing the Deep Web: A Survey[R]. Tech- nical Report, Department of Computer Science, UIUC, 2004.
    [31]丁晟春,岑咏华,顾德访.基于 Ontology 的语义检索研究[J].情报学报.2005, 6(24):702-707.
    [32]朱礼军.万维网环境下基于领域知识的信息资源管理模式研究[D].北京:中国农业大学,2004.
    [33]耿瑞峰,钱雪忠.基于元数据的语义搜索技术研究[J].微计算机信息.2005, 12(21):122-124.
    [34]刘伟,孟小峰,孟卫一.Deep Web 数据集成问题研究[R].WAMDM Technical Report, 2006.
    [35]Luciano, Juliana, Freire.Searching for Hidden Web Databases[C].Eighth International Work- shop on the Web and Databases.Baltimore, Maryland, 2005.
    [36]J.Cope, N Craswell, D Hawking.Automated discovery of search interfaces on the Web[C].In 14th Australasian Conference on Database technologies, 2003.
    [37] 高岭 , 赵朋朋 , 崔志明 .Deep Web 查询接口的自动判定 [J]. 计算机技术与发展,2007,17(5):148-151.
    [38]A Arasu, H Garcia-Molina. Extracting structured data from Web pages[C].In SEGMOD Co- nference, 2003.
    [39]N Kushmerick, D S Weld, R B Doorenbos.Wrapper induction for information extraction[C]. In Intl Joint Conference on Artificial Intelligence (IJCAI).1997:729-737.
    [40]S Raghavan, H Garcia-Molina. Crawling the hidden Web[C] Proceedings of the 27th Intern- ational Conference on Very Large Data Bases, Roma, Italy.2001:129–138.
    [41]Z Zhang, B He, K C. Understanding Web query interfaces: Best-effort parsing with hidden s- yntax[C].SIGMOD Conference, 2004.
    [42]He B, Tao T, Chang K C-C. Clustering structured Web sources: a schema-based, model-diff- erentiation Approach[C]. Proceedings of the 9th International Conference on Extendi- ng Database Technology, Heraklion, Crete.2004:536-546.
    [43]赵朋朋,高岭,崔志明.基于查询接口特征的 Deep Web 数据源自动分类[J].微电子与计算机,2006,23(10):47-50.
    [44] He B, Tao T, Chang K C-C.Organizing Structured Web Sources by Query Schemas: A Clust- ering Approach[C]. CIKM’04, Washington, DC, USA, 2004.
    [45] Peng Q, Meng W, He H, Yu C T. WISE-cluster: clustering e-commerce search engines auto- matically[C]. Proceedings of the 6th ACM International Workshop on Web Information and Data Management, Washington, 2004:104-111.
    [46] Panagiotis G Ipeirotis, Luis Gravano, Mehran Sahami. Probe, count and classify: categoryz- eeing hidden Web databases[C].Proceeding of the 2001 ACM SIGMOD International Conferenc- e on Management of Data, 2001:67-78.
    [47] Panagiotis G Ipeirotis, Luis Gravano, Mehran Sahami. PERSIVAL Demo: Categorizing Hi- dden Web Resources[C]. JCDL’01, Roanoke, Virginia, USA, 2001.
    [48] Luis Gravano, Panagiotis G Ipeirotis. QProber: A System for Automatic Classification of H- iddenWeb Databases [J].ACM Transactions on Information Systems, 2003, 21(1):1–41.
    [49]Panagiotis G. Ipeirotis, Luis Gravano. Summarizing and searching hidden Web database shi- erarchically using focused probes[R].Technical Report CUCS-015-01, Columbia University, Co- mputer ScienceDepartment, 2001.
    [50]董振东,董强.知网[EB/OL]. http://WWW.keenage.com.
    [51]董振东,董强. KDML — 知网知识系统描述语言[EB/OL] . http://WWW.keenage.com.
    [52]周强,冯松岩.构建知网关系的网状表示[J].中文信息学报,2000,11(6): 21-27.
    [53]P Bouquet, B Magnini, L Sera ni. A SAT-based algorithm for context matching [C]. Proce- edings of the 4th Int and Interdisciplinary.
    [54]边肇祺,张学工.模式识别.第 2 版[M].北京:清华大学出版社,2000.
    [55]张鸿宾,孙广煜. Tabu 搜索在特征选择中的应用[J].自动化学报. 1999, 25(4): 457-466.
    [56] 耿科明 , 袁方 .Jena 推理机在基于本体的信息检索中的应用 [J]. 微型机与应用.2005,10:62-64.
    [57]黄大鹏,崔杜武.基于 RDF 的查询引擎系统架构[J].计算机工程.2005, 31(16):231-233.
    [58]樊冠林.基于 RDF 的搜索引擎的研究与实践[D].安徽:安徽理工大学,2006.
    [59]何银俊.基于 RDF 的语义检索技术研究[D].南京:河海大学,2007.
    [60]沈文南.一个 RDF 存储与查询系统的设计与实现[D].南京:东南大学,2006.
    [61]陈琮.基于 Jena 的本体检索模型设计与实现[D].武汉:武汉大学,2005.
    [62]张娜.基于本体的语义智能检索系统研究[D].西安:西安工业大学,2007.
    [63]Gruninger M, Fox, M.S.The logic of enterprise modeling [J].In J Brown and D 0’Sullivan, editors, reengineering the Enterprise.Chapman&Hail. 1995:83-98.
    [64]丁晟春,顾德访. Jena 在实现基于 Ontology 的语义检索中的应用研究[J].数字图书馆.2005,10:5-9.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700