信息化教育领域的Web信息抽取技术研究

英文题名：Research of Web Information Extraction in Informatization Education
作者：邱亚娜
论文级别：硕士
学科专业名称：教育技术学
中文关键词：Web信息抽取 ; HTML ; XML ; DOM树 ; 信息化教育
英文关键词：Web Information Extraction ; HTML ; XML ; DOM ; Informatization Education
学位年度：2008
导师：张桂芸
学科代码：040110
学位授予单位：天津师范大学
论文提交日期：2008-03-01

摘要

计算机技术和互联网(Internet)的迅猛发展,使Web发展成为一个全球的、巨大的、分布和共享的信息空间,Web作为一个庞大的资源库,给人们的学习、生活和工作带来了巨大的便利。然而面对Web上的海量信息,人们却陷入了“数据丰富,知识贫乏”的尴尬境地。由于目前的Web数据大多以HTML的形式出现,使得应用程序无法直接获取Web上的信息。Web信息抽取技术正是在这一背景下应运而生。
本文分析了一些典型的信息抽取系统技术特点,并探讨了在信息化教育中,从学习者的需求出发,抽取个性化的服务信息。本文实现了一个基于文档结构树的个性化信息抽取系统。本系统主要分为两个部分,抽取规则的定义以及抽取规则的执行。在抽取规则的定义阶段,首先将获取的HTML结构的网页进行规范化处理,转换为格式规范、语义清晰的XML文件,生成对应文档的DOM树,然后由用户指定待抽取信息的位置以及对应的目的表的模式,最后根据这些信息生成抽取规则。在抽取规则执行阶段,系统根据用户定义的抽取规则抽取Web数据并将其加载到指定位置的目的表中。
With the rapid development of computer technology and the Internet, Web has been a global, huge, distribution and shared information space. As a huge resource base to people's learning, life and work, Web has brought tremendous convenience. But in the face of vast amounts of information on the Web, people are trapped in an awkward condition of "data rich, poor knowledge". Since most of the Web data is in the form of HTML, the application makes no direct access to information on the Web. Web information extraction technology is brought forth to resolve this problem.
This paper analyzes some typical Information Extraction (IE) System and shows how to Extract personality information based on the personal needs of learners in Informatization Education. A personality information extraction system based on document structure tree has been implemented. The system includes two parts, which are the definition and execution of the extraction rules respectively. In the phase of the definition of extraction rules, first introduced is how to transform data represented by HTML to the well-formed XML document and how to get the DOM tree of the XML document. Then user specify the location of the information which will be extracted and map it to the target table to define the Extraction rules. In the phase of the execution of the Extraction rules, the system extracts the data of Web structure with user-defined extraction rules. Finally, it is stored in a structured way.

引文

[1]李保利,陈玉忠,俞士汉.信息抽取研究综述.计算机工程与应用[J],2003,(10):1-5.
    [2]陈少飞,郝亚南,李天柱等.Web信息抽取技术研究进展.河北大学学报(自然科学版).Vol.23.No.1.Mar.2003.
    [3]L.Eikvil.Information Extration from World Wide Web:A Surey.Technical Report 945,Norwegian Computing Centre.July,1999.
    [4]刘振岩,王万森,陈立平.Web信息检索与Web数据挖掘.微机发展[J].2003,07。Vol13 No.07
    [5]朗君.信息抽取调研结果与研究方案.30~(th),May,2004
    [6]Gio Wiederhold Stanford Univ,Stanford,CA Mediators in the Architecture of Future Information Systems Computer archive Volume 25,Issue 3(March 1992)table of contentsPages:38-49 Year of Publication:1992 ISSN:0018-9162
    [7]牛成.Information Extraction basic concepts,key technologies,and applications.微软亚洲研究院2005年信息抽取技术暑期研讨班.
    [8]FreitagD.Information extraction from html:Application of a general learning approach In Processing of the 15~(th)Conference on Artificial Inteligence(AAAI-98),1998:pp517-523.
    [9]Musleal,Minton S,Knoblock C.A hierarchical approach to wrapper induction In Processing of third International Conference on Autonomous agents(AA-1998),1998.
    [10]Kim J,MoNovanD.Acquisition of Semantic Patterns for information Extraction from corpora.In Proceeding of the ninth IEE Conference on Artificial Intelligence for Applications,Los Alamitos,CA,IEEE Computer Society Press,1993:pp.171-176.
    [11]Rohini K.Srihari,Wei Li,Cheng Niu,Thomas Comell.InfoXtract:A Customizable Intermediate Level Information Extraction Engine.In Proceedings of HLT/NAACL 2003 Workshop on Software Engineering and Architecture of Language Technology Systems(SEALTS),2003:PP.52-59.
    [12]RalPh Grishman and John Stealing.New York University:Description of the PROTEUS System as used for MUC-5.In Proceedings of the Fifth Message Understanding Conference (MUC-5),Baltimore,MD.August 1993,Morgan Kauffmann.
    [13]S.Sodlerland.Learning information extraction rules for semi-structured and free text.Machine learning,1999,pp.1-44,pp233-272.
    [14]CaliH M,Mooney R.Relational Learning of Pattern-Match Rules for Information Extraction.Working papers for ACL-97 Workshop on National Language Learning.1997:pp9-15.
    [15]Chen H H,Ding Y W,Tsai et al.Description of the NTU system used for MET2.In Proceedings of the Seventh Message Understanding Conference,1998.
    [16]Zhang Y M,Zhou J E.A Trainable Method for Extracting Chinese Entity Names and Their Relations.In Proceedings of the Second Chinese Language Processing Workshop,Hong Kong,2000-10.
    [17]杨文柱,李智玲等.基于信息抽取的Web查询系统的设计与实现.计算机应用,VOL.23(2),2003,pp.97-99.
    [18]李效东,顾毓清.基于DOM的Web信息提取.计算机学报,Vol25(5),2002,pp.526-532.
    [19]胡睿,张冬茉,杜蓬.基于结点语义关系的信息抽取技术.计算机工程,VoL27(4),2001,pp.26-28.
    [20]朱明,王军,王俊普.基于多层模式的多记录网页信息抽取方法.计算机工程,Vol27(9),2001,pp.41-42.
    [21]陆科进,李新颖.基于ontology的文本信息抽取.计算机应用研究,2003,7,pp.46-48.
    [22]王放,顾宁,吴国文.基于本体的WEB表格信息抽取.小型微型计算机系统,Vol.24(12),2003,pp.2142-2146.
    [23]于馄,蔡智,糜仲春,蔡庆生.基于路径学习的信息自动抽取方法.小型微型计算机系统,Vol.24(12),2003,pp.2147-2149.
    [24]Srinicasan A,Camacho R.Experiments in numeric reasoning with inductive logic programming[R].Technical Report PRG-TR-22-96,Oxford University,Oxford,1996.
    [25]http://www.nist.gov/speech/tests/ace/
    [26]Laender A,Ribeiro-Neto B,Silva A.A brief Survey of Web Data Extraction Tools[J].SIGMOD Record,2002,31(2):84-93.
    [27]Califf M,Mooney R.Relational Learning of pattern-match rules for information extraction[Z].In proceedings of the Sixteenth National Conference on Artificial Intelligence and Eleventh Conference on Innovative Applications of Artificial Intelligence,Orlandp,Florida,1999.
    [28]Muslea I,Minton S,Knolock C.Hierachical wrapper induction for semistructured information sources[J].Autonomous Agents and Multi-Agent Systems,2001,pp93-114.
    [29] Craig A,Knoblock,Kristina L,et al. Accurately and reliably extracting data from the web:A machine learning approach[J].Data Engineering Bulletin,2000,pp33-41.

    [30] Muslea I,Minton S,Craig A,et al. Active learning for hierarchical wrapper induction[Z]. In proceedings of Sixteenth National conference on Artificial Intelligence and Eleventh conference on Innovative Application of Artificial Intelligence,Orlando,Florida,USA,1999.

    [31] HSU C N,DUNG M.Generating finite-state transducers for semi-structured data extraction from the web[J].Information System,1998,23(8):521-538.

    [31] KUSHMERICK N.Wrapper induction: efficiency and expressiveness[J].Artificial Intelligence Journal,2000,118(1/2): 15-68.

    [32] Embley D,Campbelld,Jiang S ,et al. Conceptual-model-based data extraction from,ultiple record web pages[J] .Data and Knowledge engineering, 1999,31 (3):227-251.

    [33] Christina Yip Chung,Michael Gertz,Neel Sundraesan. Reverse engineering for web data:From visual to semantic structures[Z].In Proceedings of 18~(th) International Conference on Data Engineering,San Jose,California,2002.

    [34] Christina Yip Chung,Neel Sundraesan.Quixote;Building XML repositories from topic specific web documents[Z]. In Fourth Int.Workshop on the Web an Databases,2001.

    [35] Robert Baumgartner ,Sergio Flesca,Georg Gottlob.suprevised wrapper generation with lixto[Z] .Proceedings of 27~(th) International Conference on Very Large Database, Roma, Italy, 2001.

    [36] Robert Baumgartner ,Sergio Flesca,Georg Gottlob.Visual web information extraction with lixto[Z].In proceedings of 27~(th) International Conference on Very Large Database, Roma, Italy, 2001.

    [37] Liu L,Pu C,Han W.XWRAP:An XML-enabled wrapper construction system for Web information sources[Z].In proceedings of the International Conference on Data. Engineering, San Diego,2000.

    [38] Liu L,Han W, Buttler D,et al. An XML-Based wrapper generator for Web information extraction[Z].In proceedings of ACM SIGMOD International Conference on Management Data , Philadelphia, Pennsylvania,USA, 1999.

    [39] Valter Crescenzi, Giansal Vatore Mecca.RoadRunner:towards automatic data extraction from large Web sites[Z]. In Proceedings of the 27th Conference on Very Large Database, Roma, Italy, 2001.
    [40]Arnaud Sahuguet,Fabien Azavant.Building intelligent web applications using lightweight wrappers[J].Data Knowledge Engineering,2001,36(3):283-316.
    [41]Arocena G,Mendelzon A.WebOQL:Restrucring documents,databases and webs[Z].In Proceedings of the 14~(th)ICDE Conference,Orlando,Florida,USA,1998.
    [42]Gudtavoo Arocena.WebOQL:Exploiting docement structure in Web queries[D].Toronto:Master's thesis,University of Toronto,1997.
    [43]徐林昊,杨文柱,陈少飞.基于XPath的Web信息抽取[Z].19届全国数据库会议,郑州.2002.
    [44]杨文柱,徐林吴,郝亚南.个性化的Web查询助手的设计与实现[Z].19届全国数据库会议,郑州.2002.
    [45]XQuery.http://www.w3.org/TR/xquery.
    [46]Gaizauskas R,Wilks Y.Information Extraction:Beyond Document Retrieval[J].Journal of Documentation,1997
    [47]南国农.从视听教育到信息化教育—我国电化教育25年.中国电化教育.2003(9)
    [48]祝智庭.现代教育技术-走进信息化教育.高等教育出版社.2001(9)
    [49]刘德亮.黎加厚博士谈教育信息化.中国电化教育.2002(1)
    [50]李龙.教育技术学科的定义体系—论教育技术学科的理论与实践.电化教育研究.2003(9).
    [51]何典,宋中山.基于Web挖掘的个性化网络教育研究.计算机与现代化.2005(5).
    [52]王敬普,林亚平.基于包装器模型的文本信息抽取.计算机应用.2006年3月Vol(26),3th.
    [53]梁开健.Web挖掘在现代远程教育中的应用.微机发展.2005年8月.Vol(15),8th.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700