基于框架语义标注的Web信息抽取技术研究

英文题名：Research on Web Information Extraction Technology Based on Frame Semantic Tagging
作者：白鹏洲
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：信息抽取 ; 框架语义 ; 领域本体 ; 包装器 ; 抽取规则
英文关键词：information extraction ; frame sematic ; wrapper ; ontology ; extraction rules
学位年度：2008
导师：牛之贤
学科代码：081202
学位授予单位：太原理工大学
论文提交日期：2008-05-01

摘要

随着Internet的快速发展,web已经成为全球化的信息源,它为信息共享和资源共享提供了一个良好的平台。然而,用传统的搜索引擎人们很难迅速准确地找到所需要的信息。信息抽取技术正是在这样的前提背景下产生的,信息抽取是从网页(文本)中自动地抽取出有用的信息的一种技术,它是目前智能信息处理的一个重要研究课题。信息抽取系统在web上抽取的信息不仅可以直接提供给用户,还可以作为构建智能查询系统和数据挖掘系统的基础,有着广阔的应用前景。
     本文首先介绍了信息抽取系统的产生背景、发展历史,研究了信息抽取技术的研究现状,分析了当前几种重要的信息抽取工具和当前信息抽取工具的一些缺陷——缺乏语义或语义模型过于简单。然后针对这一不足之处,利用框架语义在语义信息标示方面的优势来解决信息抽取结果中语义信息缺失或语义信息过于简单这一问题,提出了一种信息抽取的方法——基于框架语义标注的信息抽取。
     本文通过构造一个基于框架语义标注的web图书信息抽取系统来说明基于框架语义标注的信息抽取技术的思想——将框架语义网络技术、领域本体知识和信息抽取技术相结合。对自由文本进行信息抽取时,首先进行框架语义标注,再根据标注结果结合领域本体知识生成抽取规则。该方法的特点在于在抽取过程中以框架语义标注作为构建信息抽取规则的基础,用统一的方法来指导信息抽取过程——以语义角色为核心构建信息模式,将信息模式的建立上升到语义角色一级,从而达到所抽取出信息的带有明确的语义信息。
     本系统对于实现基于语义的信息抽取研究具有重要的现实意义。不仅如此,它的体系结构和主要模块的设计思想,对于其他文档的信息抽取系统的设计和实现也具有较高的借鉴价值。
With the rapid development of Internet, web has becomed the global information source, which provides an ideal place for sharing and communication information. However, it's hard for user to get access to the needed and useful information quickly and correctly by traditional search engine. A new technical-information extraction has been put forword. Information extraction can extract auto-matically useful information from web (text) . It has been became an important research topic in the intelligent information processing field. These information extracted from the web site can not only provide the user but also be a foundation resource of the intelligent query system and data mining system. Information extraction has very broad application . prospects.
     This paper presents the background, history of information extraction, reviews the information extraction state of Internet, analysis several important tools of information extraction. And we analysis some disadvantages of current information extraction techniques. Bacause of the advantages of frame sematics in sematics information indicated, a new method of information extraction base on frame sematics tagging was put forword to resolve this issue of losing sematics or too brief sematics information in the results of information extraction, frame sematiocs have some advantages in sematics indicated .
     This paper explain the thinking of information extraction technical based on frame semantic tagging to constructing the web's book information extraction system based on frame semantic tagging—intergrating frame semantic network technology, domain ontology and information extraction technology. when text's information extracted , firstly,it was tagginged. then summarized the rules of extraction according to the results of tagging and domail ontology's knowledge . The method's character lies in frame sematics tagging as basis fo the building information extraction rules in extraction process, and guide the information extraction process by an unified method which building information model as core of sematics role,the model of information rise to the semantic role ,so as to achieve the information extracted with a clear semantic informaiton.
     The system is of great importance on information extraction based on semantic. Furthermore, the architecture of the system and design of the main components are also valuable for other IE Systems.

引文

[1].刘旭彤基于语义的web信息抽取系统的研究与设计[D]广州:暨南大学 2006.5
    [2].刘挺,王开铸自动文摘的四种主要方法[J]情报学报 1999.2 10-19
    [3].贡正仙,朱巧明,李培峰基于相似页面的信息抽取系统的实现[J]计算机应用2006.8 1983-1986
    [4].http://blog.csdn.net/lyflower/archive/2007/01/08/1477220.aspx
    [5].Muslea I,Minton S,Knoblock C.STALKER:Learning Extraction Rules for Semistructured,Web-based Information Sources.AAAI-98 Workshop on "AI & Information Integration",1998
    [6].http://www.fullsearcher.com/down/InformationExtraction/5.html
    [7].由丽萍构建现代汉语框架语义知识库技术研究[D]上海:上海师范大学 2006.6
    [8].刘开瑛,由丽萍汉语框架语义知识库构建工程[A]中文信息处理前沿进展:中国中文信息学会成立二十五周年学术会议论文集[C]2006.11 64-71
    [9].Charles J.Fillmore Frame semantics and the nature of language[A]In:Annals of the New York Academy of Sciences:Conference on the Origin and Development of Language and Speech[C].1976,280:20-32.
    [10].Charles J.Fillmore,Charles Wooters,Collin F.Baker.Building a large lexical data bank which provides deep semantics[A].In:Proceedings of the 15~(th)Pacific Asia Conference on Language,Information and Computation[C].Hong Kong:2001,3-26.
    [11].Charles J.Fillmore,Collin F.Bakeretal.The Berkeley FrameNet project[A].In:Proceedings of COLING/ACL[C],Montreal,Canada:1998,86-90.
    [12].郝晓燕、刘伟、李茹等汉语框架语义知识库及软件描述体系[J]中文信息学报2007.9 No.5 96-100
    [13].马腾基于ontology的信息抽取系统的研究与实现[D]西安:西安电子科技大学2006.6
    [14].杜文华本体构建方法研究比较[J]情报杂志 2005.10 24-25
    [15].王梅文药学本体的构建实践电脑知识与技术[J]2007.7 1520-1522
    [16].张树瑜,杜国宁,朱仲英基于Web的半结构化信息抽取技术研究[J]系统工程与电子技术 2004 No.5 610-612
    [17].李向阳苗壮自由文本信息抽取技术[J]情报科学 2004 Vol.7 815-821
    [18].林亚平,刘云中,周顺先等基于最大熵的隐马尔可夫模型文本信息抽取[J]电子学报 2005 Vol.2 236-240
    [19].周俊生,戴新宇,尹存燕等自然语言信息抽取中的机器学习方法研究[J]计算机科学 2005 Vol.3 186-189
    [20].Line Eikvil著,陈鸿标译网上信息抽取技术纵览[M]2003.3
    [21].杨艳萍,谭庆平一种有效的服务资源自动语义标注方法[J]计算机研究与发展2007 Vol.44 37-43
    [22].由丽萍、杨翠汉语框架语义知识库概述[J]电脑开发与应用 2007 Vol.2 2-4
    [23].李保利、陈玉忠、俞土汉信息抽取研究综述[J]计算机工程与应用 2003 No.101-4
    [24].董强、管国化基于DOM的Web信息抽取方法研究[J]舰船防化 2006 No.326-30
    [25].王敬普基于包装器模型的文本信息抽取算法研究[D]长沙:湖南大学 2006.6
    [26].韩婕向阳本体构建研究综述[J]计算机应用与软件 2007 Vol.24 No.921-23
    [27].徐振宁、宋阔益等,基于本体的语义信息查询系统的研究与实现[J]计算机工程,2002 Vol.28 No.12 6-8
    [28].韩立新、谢立,一种从WEB上抽取信息的方法[J],情报学报,2004,Vol.23 No.1 45-50
    [29].杨艳萍、谭庆平一种有效的服务资源自动语义标注方法[J]计算机研究与发展2007 Vol.44 No.1 37-43
    [30].支宗良,陈少飞一种基于XQuery的优化Web信息抽取方法[J]计算机应用 2008Vol.01 152-158
    [31].李新颖、陆科进,基于ontology的文本信息抽取[J],计算机应用研究,2003 vol.7 46-48
    [32].廖乐健、曹大元、李新颖,基于ontology的信息抽取[J]计算机工程与应用,2002vol.23 110-113
    [33].张树瑜,杜国宁,朱仲英基于Web的半结构化信息抽取[J]系统工程与电子技术2004 vol.26 No.5 610-613
    [34].李景孟连生构建知识本体方法体系的比较研究[J]现代图书情报技术 2004vol.7 17-22
    [35].李向阳,张亚非一种基于遗传算法的语义标注[J]电子科技大学学报 2007.286-89
    [36].姜吉发自由文本的信息抽取模式获取的研究[D]北京:中国科学院研究生院,2004.9

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700