基于本体的旅游领域Web信息抽取

英文题名：Ontology-Based Web Information Extraction in Tourism Domain
作者：陈立娜
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：本体 ; OWL ; Web信息抽取 ; SHOIQ(D)-Tableaux算法
英文关键词：ontology ; OWL ; Web information extraction ; SHOIQ(D)-Tableaux algorithm
学位年度：2009
导师：王驹
学科代码：081202
学位授予单位：广西师范大学
论文提交日期：2009-04-01

摘要

随着Internet和Web技术的发展,WWW已经成为一个巨大的信息资源库,然而使用传统的搜索引擎,用户要精确地找到所需信息往往十分困难。Web信息抽取技术正是在这样的背景下出现的。
     目前,关于Web信息抽取方面的研究有很多。Web信息抽取的方法主要有基于自然语言处理的、基于包装器归纳的、基于HTML结构的和基于本体的。基于本体的信息抽取方法主要是利用了对数据本身的描述信息,对网页的依赖少,而且本体可提供机器可识别的领域概念知识及其关系,具有简单的推理能力。此外,在信息抽取中使用本体有许多优点。首先,本体提供了一个丰富的、预定义的词汇库,可作为与数据源的稳定的概念接口,并且独立于数据模式。第二,本体表示的知识足够支持所有相关信息源的转换。第三,本体支持一致的管理和非一致数据的识别等。
     由上述的分析并结合项目实际的需要,本文提出了一种基于本体的旅游领域Web信息抽取方法,并设计实现了一个广西旅游信息抽取原型系统。本文主要工作和创新点:
     (1)分析比较了几类主要的本体构建的方法。综合各方面,本文采用Mike Uschold & Micheal Gruninger提出的方法构建旅游领域本体。在构建过程中,本文分析研究了本体概念之间的关系、概念的层次结构、概念的等价性、属性约束以及实例的等价性。
     (2)介绍了Pellet推理机,阐述了SHOIQ(D)-Tableaux推理算法,研究利用该推理算法对旅游领域本体的推理,包括本体一致性检测、概念的包含关系检测、概念的可满足性检测、属性约束以及实例检测。最后阐述了利用Jena对本体的解析,分析出本体的概念、关键词、关系和实例等信息,存入数据库。
     (3)在本体推理解析的基础上,首先根据网页转换为DOM树结构,阐述了利用旅游本体关键词定位页面正文进行页面正文提取的算法。接着阐述利用ICTCLAS分词工具和旅游领域词汇相结合进行的中文分词处理,停用词过滤的分析。最后阐述了抽取规则。在抽取规则的构建中,我们利用了属性的语义特点和三元组相结合的方法。
     最后,根据研究的关键技术,本文实现了一个广西旅游信息抽取原型平台—Tourism_IESystem,并以旅游网站的Web页面为实验对象,验证信息抽取系统的性能。表明了本文方法的技术可行性,具有实际应用前景和现实的价值意义。
With the development of Internet and Web technology, WWW has become a tremendous information depository. However, with traditional search engines, people can’t easily find the precise information which they need. The technology of Web information extraction is appeared under this background.
     At present, the technology of Web information extraction has a lot of research. The main methods of Web information extraction are natural language processing-based and Wrapper induction-based and HTML structure-based and ontology-based. The method of ontology-based information extraction mainly uses the description information of the data itself, relying less on Web page, and ontology can provide domain concepts knowledge and relations which machine can understand, and ontology has expressive reasoning ability. Besides, in information extraction, it has many advantages using ontology. First, ontology provides a rich and predefined lexicon, which can be used as the stable concept interface for data source, and is independent of the data mode. Second, the knowledge of ontology representation is enough for the converting of all relevant information sources. Third, ontology supports the management of consistency and indentification of the non-consistent data, and etc.
     With the analysis above and the actual needs of our project, a method of Web information extraction based on ontology in tourism domain is proposed in this paper, and a model platform of information extraction in tourism of Guangxi—Tourism_IESystem is designed and implemented. The main works done in this paper are as follows:
     (1) Analyze and compare the main methods of domain ontology construction. All things considered, tourism ontology is constructed in this paper, using the method proposed by Mike Uschold & Micheal Gruninger. In constructing process, this paper studies the relation between the concept and the hierarchical structure of the concept and the equaivenlent of the concept and the restrictions of the property and equaivenlent of the individual.
     (2) Introduce the Pellet reasoner, state the SHOIQ(D)-Tableaux reasoning algorithm, study the reasoning of the tourism domain ontology using the reasoning algorithm, including the check of ontology consistency and the check of concept subsumption and the check of concept satisfiability and the check of property restrictions and the check of instance. At last, state the ontology parser using Jena, analyze ontology concept and keywords and relation and instance and etc, storing in database.
     (3) On the basis of ontology reasoning and parser, firstly, according to the characteristics of the transferring from the website to the DOM tree, state the extraction algorithm of the website text content using the keywords of the tourism ontology to locate the information regional of the pages. Secondly, state the Chinese word segmentation using ICTCLAS word segmentation tool and tourism domain vocabulary, and analyze the filtering of stop words. At last, state the extraction rules. In the construction of the extraction rules, the semantic feature of the property is used in this paper, and combining the triple.
     At last, according to the key technology studied in this paper, a model platform of information extraction in tourism of Guangxi—Tourism_IESystem is implemented. And the performance of the information extraction system is validated by making use of the Web page of tourism sites as experimental object. This shows that the method proposed in this paper is feasible according to technology aspect, and it has practical application value and realistic significance.

引文

[1] Line Eikvil原著,陈鸿标译.网上信息抽取技术纵览, 2003.
    [2] Bozsak E, Kaon. Towards a large scale semantic web[A]. In Proceedings of the Third International Conference on E-Commerce and Web Technologies (EC-Web 2002)[C], Springer Lecture Notes in Computer Science, 2002.
    [3] Berners-Lee T. Semantic Web Road Map, 1998. http://www.w3c.org/ DesignIssues/Semantic.html MessageUnderstandingConferenee.1998.
    [4] Robert B, Sergio F, Georg G. Supervised wrapper generation with lixto[C]. Proeeedings of 27th International Confereneeon Very Large Database, Roma, Italy, 2001.
    [5] ROBERI,B, SERGIOFIESCA,GEORG G. Visual web information extraction with lixto[C]. Proceedings of 27th Intemational Conference on Very Large Database, Roma, Italy, 2001.
    [6]狄慧.基于Agent的Web信息抽取研究[D].大连理工大学,硕士学位论文, 2004.
    [7] Creseenzi V, Meeea G, Merialdo P. RoadRuriner: Towards Automatic Data Extraction from Large Web Sites[C]. Proeeedings of the 2e International Conference on Very Large Database Systems. Rome, 2001: 109-118.
    [8] Yanhong Zhai, Bing Liu. Extracting Web Data Using Instance-Based Learning[C]. Proceedings of 6th International Conference on Web Information Systems Engineering, 2005: 318-331.
    [9] Soderland S. Learning information extraction rules for semi-structured and Free Text[J]. Machine Learning, 1999, 34(1-3): 233-272.
    [10] Califf M, Mooney R. Relational Learning of Pattern-Match Rules for Information Extraction[C]. In Proceeding of the 6th National Conference on Artificial Intelligence and Eleventh Conference on Innovative Applications of Artificial Intelligence, Orlando, Florida, 1999.
    [11] FREITAG D.Machine Learning for Information Extraction in Informal Domains[J]. Machine Learning, 2000, 39(2/3): 169-202.
    [12] Muslea I, Minton S, Knolock C. Hierarchical wrapper induction for semi-structured formation sources[J]. Autonomous Agents and Multi-Agent Systems, 2001, 4(1/2): 93-114.
    [13] Craig A, Knoblock, Kristina L, etal. Accurately and Reliably Extrating Data from the Web: A Machine Learning Approach[J]. Data Engineering Bulletin, 2000,23(4): 33-41.
    [14] Muslea I, Minton S, Craig A, etal. Active Learning for Hierarchical Wrapper Induction[C]. In Proceeding of the 6th National Conference on Artifial Intelligence, Orlando, Florida, USA, 1999.
    [15] Muslea I, Minton S, Craig A, etal. A Hierarchical Approach to Wrapper Induction[C]. In Proceeding of the Third International Conference on Autonomous Agents, Washington, USA,1999.
    [16] Hus C.N, Dung M. Generating Finite-state Transducers for Semi-structured Data Extraction from the Web[J]. Information system, 1998, 23(8): 521-538.
    [17] Kushmerick N. Wrapper Induction: Efficiency and Expressiveness[J]. Artificial Intelligence Journal, 2000, 118(1/2): 15-68.
    [18] Liu L, Pu C, Han W. XWRAP: An XML-enabled wrapper construction system for Web information sources[C]. In Proceedings of the International Conference on Data Engineering, San Diego, 2001.
    [19] Liu L, Han W, Buttler D, etal. An XML-Based Wrapper Generator for Web Information Extraction[C]. In Proceedings of ACM SIGMOD International Conference on Management of Data, Philadelphia, Pennsylvania, USA, 1999.
    [20] Valter C, Giansalvatore M. RoadRunner: Towards Automatic Data Extraction from Large Web Sites[C]. In Proceedings of 27th International Conference on Very Large Database. Roma, Italy, 2001.
    [21] Arnaud S, Fabien A. Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F[C]. Proceedings of 25th VLDB Conference, Edinburgh, Scotland, UK, 1999.
    [22] Arnaud S, Fabien A. Web Ecology: Recycling HTML Pages as XML Documents Using W4F[C]. In Second International. Workshop on the Web and Databases, Philadelphia, Pennsylvania, USA, 1999.
    [23] Application of Suffix Tree[EB/OL]. http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/Suffix/, 2007.
    [24]马腾.基于ontology的信息抽取系统的研究与实现[D].四川:电子科技大学, 2006.
    [25] Maria V, Enrico M, John D, etal. Knowledge Extraction by using an Ontology-based Annotation Tool. Knowledge Media Institute(KMI), The Open University, Walton Hall, Milton Keynes, MK76AA, United Kingdom.
    [26] Harith A, Sanghee K, David E.M, etal. Automatic Ontology-based Knowledge Extraction from Web Documents, Published by IEEE Computer Society, University of Southampton.
    [27] Chang-Shing Lee, Yea-Juan Chen, Zhi-Wei Jian. Ontology-based fuzzy event extraction agent for Chinesee-news summarization, Department of Information Management, Chang Jung University, Tainan 711, Taiwan.
    [28] Gruber R T. A translation approach to portable ontology specifications[J]. Knowledge Acquisition, 1993(5): 199-220.
    [29] Studer R, Benjamins R V, Fensel D. Knowledge engineering: principles and methods[J]. Data and Knowledge Engineering, 1998, 25(122): 161-197.
    [30] W3C OWL1.1 Web Ontology Language Overview. http://www.w3.org/ Submission/owl11overview/.
    [31] Perez G A, Benjamins R V. Overview of Knowledge Sharing and Reuse Components: Ontologies and Problem-Solving Methods[A]. In: Stockholm V R, Benjamins B, Chandrasekaran A, eds. Proceedings of the IJCAI-99 workshop on Ontologies and Problem-Solving Methods(KRRS), 1999: 1-15.
    [32] OWL Web本体语言指南http://zh.transwiki.org/cn/owlguide.htm.
    [33] Gruber T. Towards principles for the design of ontologies used for knowledge sharing[J]. International Journal of Human and Computer Studies, 1995(43): 907-928.
    [34] Gruninger M, Fox S M. Methodology for the design and evaluation of ontologies [A]. In: Proceedings of the Workshop on Basic Ontological Issues in Knowledge Sharing[C], held in conjunction with IJCAI-95,Montreal, Canada, 1995.
    [35] Uschold M, Gruninger M .Ontologies: Principles, methods and applications [J],The Knowledge Engineering Review ,1996,11(2).
    [36] Gomez-Perez A. Knowledge sharing and reuse [A]. In: The Handbook of Applied Expert Systems [M], CRC, 1998.
    [37] Farshad H, Andreas G. Resolving semantic heterogeneity in schema integration: an ontology based approach[C]. Proceeding of the international conference on Formal Ontology in Infromation Systems, Ogunquit, Maine, USA October 17-19, 2001: 297-308.
    [38] Baader F, Calvanese D, McGuinness D, etal. The Description Logic Handbook: Theory , Implementation and Applications[M]. Cambridge: Cambridge University Press, 2003.
    [39]陆建江等.语义网原理与技术[M].科学出版社, 2007.
    [40] Pellet: An OWL DL Reasoner. http://nellet.owldl.eoln/.
    [41] Manfred S, Gert S. Attributive Concept Descriptions with Complements[J]. Artificial Intelligence, 1991, 48(1): 1–26.
    [42] Massimo Paolucci, Takahiro Kawamura, Terry R.Payne, etal. Semantic matching of Web services capabilities. In Proceedings of the First International Semantic Web Conference(ISWC), volume 2342 of Lecture Notes in Computer Science, 2002: 333–347.
    [43] Gruber T. A Translation Approach to Portable Ontology Specifications. Knowledge Acquisition, 1993, 5(2): 199.
    [44] Ian Horrocks. The FaCT System. In Automated Reasoning with Analytic Tableaux and Related Methods: International Conference Tableaux’98, volume 1397 of LectureNotes in Artificial Intelligence, 1998: 307-312.
    [45] Franz B, Sattler U. An Overview of Tableaux Algorithms for Description Logics http://www.cs.man.ac.uk/~franconi/dl/course/articles/baader-Tableaux.ps.gz
    [46]蒋运承,汤庸,王驹,周生明.面向语义Web的描述逻辑[J].模式识别与人工智能, 2007(1): 48-54.
    [47]史忠植,董明楷,蒋运承,张海俊.语义Web的逻辑基础[J],中国科学E辑, 2004,34(10): 1123-1138.
    [48]董明楷,蒋运承,史忠植.一种带缺省推理的描述逻辑[J].计算机学报, 2003(6): 729-736.
    [49]蒋运承,史忠植,汤庸,王驹.面向语义Web语义表示的模糊描述逻辑[J]软件学报, 2007(6):1257-1269.
    [50]林琳.基于Ontology的Web表格内容抽取的研究与实现[D].西安:电子科技大学,2006.
    [51]周慧.基于应急案例本体的信息抽取的研究与应用[D].太原:太原理工大学, 2007.
    [52]陈静.基于本体的信息抽取研究[D].苏州:苏州大学, 2007.
    [53] Jena http://jena.sourceforge.net/index.html.
    [54] W3C. Document Object Model(DOM) Level 1 Specication, Version1.0[EB/OL]. http://www.w3.org/TR/REC-DOM-Level-1.
    [55]刘艺琴.基于本体的web非规范知识处理中信息抽取技术研究[D].昆明:昆明理工大学, 2005.
    [56]李盛.面向真实文本的汉语词义排歧系统:[D].太原:山西大学, 2004.
    [57]刘开瑛.中文文本自动分词和标注[M].北京:商务印书馆, 2000.
    [58]李晓明,王继民等.搜索引擎—原理,技术与系统[M].北京:科学出版社,2005.
    [59] ICTCLAS. http://ictclas.org/index.html.
    [60] Protégé. http://protege.standford.edu
    [61] HTMLParser http://sourceforge.net/projects/htmlparser/
    [62] JTidy http://jtidy.sourceforge.net/

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700