半结构化Web信息抽取研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

半结构化Web信息抽取研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on Semi-structure Information Extraction for Web
作者：周盛强
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：数据挖掘 ; 信息抽取 ; 半结构化数据 ; Web
英文关键词：data mining ; information extraction ; semi-structured data ; Web
学位年度：2009
导师：孙长嵩
学科代码：081203
学位授予单位：哈尔滨工程大学
论文提交日期：2009-02-01

摘要

随着互联网的快速发展和普及,人们越来越依赖于网络获取信息。如何从中快速高效的获得想要的信息成为迫切需要解决的问题,Web信息抽取技术应运而生。目前,已经产生了各种各样的方法来生成包装器,但这些方法有不同的局限性,在精确度、健壮性和通用性方面难以达到很高的要求。因此,信息抽取的研究重点就是如何构建良好的包装器。
     本文首先对现有的信息抽取技术和XML技术进行分析与研究,提出一个基于XML技术的Web信息抽取系统。通过该系统用户能够将HTML页面中感兴趣的信息点抽取出来,并用结构化和扩展性强的XML来表示抽取结果。该系统具有较好的通用性和灵活性,用户能够快速定制应用于不同领域的Web信息抽取包装器。本文应用XPath技术在数据定位方面的特点,提出一种基于DOM的XPath生成算法。本文利用XSLT作为抽取规则的描述语言,并使用XPath来定位待抽取信息点。
     对于Web信息抽取的问题利用本文提出的Web信息抽取方法能够较好地解决,同时,系统的召回率和准确率能够达到较高的百分比。
With the rapid development and popularization of Internet, more and more people obtain information from Web. To find necessary information quickly and efficiently from Web has become a serious problem. Web information extraction technology comes into bring. Many approaches have been proposed to generate wrapper, but they have too different limitations to make wrapper accurate, robust or general. So, the preparing better wrapper has become the research emphases of information extraction.
     After having analyzed and researched the technologies of XML and information extraction, a system of Web information extraction based on XML is developed in this paper. With this system, users can extract interested information from HTML pages, the extraction results are expressed in XML which have strong structure and expansion. The system has the generality and flexibility. Users can quickly customize the web information extraction wrapper applied to different areas. In this paper, by using the character of the XPath positioning technology in data area, a algorithm of XPath based on DOM is implemented. XSLT is used as the description language of extraction rules and XPath is used to locate information to be extracted.
     The method in Web information extraction presented in this dissertation can better solve the problem of Web information extraction, and also the precision and recall of the system can reach a higher proportion.

引文

[1]Bar-Yossef Z.Approximating Aggregate Queries about Web Pages via Random Walks.In:AmrA.Proceedings of the 26th International Conference on Very Large Data Bases.Cairo:Morgan Kaufrnann Publishers,2000.535544P.
    [2]Kumar S,Raghavan P,Rajagopalan S,Tomkins A.Trawling Emerging Cyber CommunitiesAutomatically.In:Albert V.Proceedings of the 8thACM-WWW International Conference.Toronto:ACM Press,1999.1481-1493P.
    [3]Wang X,Wu H,Wei L,Zhou A.A similarity-based analysis model for topic distillation.International Journal of Computational Intelligence and Application,2002,2(3):267-275P.
    [4]Brin S,Page L.The anatomy of a large-scale hypertextual Web search engine.In:Thistlewaite P,et al,eds.Proceedings of the 7th ACM-WWW International Conference.Brisbane:ACM Press,1998.107-117P.
    [5]White H,McCain K.Visualizing a discipline:An author co-citation analysis of informationscience 1972-1995.Journal of the American Society for Information Science,1998,49(4):327-356P.
    [6]Hammer J et al.Template-based wrappers in the TSIMMIS system[J].SIGMOD Record,1997,2
    [7]David Buttler,Ling liu,et al.A Fully Automated Object Extraction System for the World Wide Web.Proceedings of the 2001 International Conference on Distributed Computing Systems.2001
    [8]孟小峰,王海燕,谷明哲等.XWIS中基于预定义模式的包装器[J].计算机应用,2001-09.
    [9]李效东,股敏清.基于DOM的Web信息抽取[J].计算机学报,2002-05.
    [10]朱明,黄云.基于多知识的Web网页信息抽取方法.小型微型计算机系统,2001
    [11]易月娥.基于FP-tree关联规则挖掘算法的研究与应用.湖南大学硕士学位论文.2007:2-7页
    [12]C.Combes,N.Meskens,C.Rivat,J.P.Vandamme.Using a KDD process to forecast the duration of surgery.International Journal of Production Economics.2007
    [13]John F.Kros,Mike Lin,Marvin L.Brown.Effects of the neural network s-Sigmoid function on KDD in the presence of imprecise data.Computers & Operations Research.2006,33(11):3136-3149P
    [14]Andrea Romei,Salvatore Ruggieri,Franco Turini.A middleware language and system for knowledge discovery in databases.Data & Knowledge Engineering.2006,57(2):179-220P
    [15]翁敬农译,(美)Richard J.Roiger,Michael W.Geatz著.数据挖掘教程[M].北京:清华大学出版社,2003,356-358P
    [16]David Hand Hekki Mannila Padhraic Smyth,张银奎,廖丽,宋旬俊等译,数据挖掘原理[M],北京:机械工业出版社,2003,45-58页
    [17]闪四清、陈茵等译,(美)Mehmed Kantardzic著.数据挖掘:概念、模型、方法和算法[M].北京:清华大学出版社,2003,98-101页
    [18]R.Gaizauskas,Y Wilks.Information extraction:Beyond Document Retrieval [J].Computational Linguistics and Chinese Language Processing,1998,3(2):17-60P
    [19]Arivind Arasu,Hector Garcia-Molina.Extracting structured data from web pages[J].Techinical Report,Standford University,2002:298
    [20]杨文柱.基于领域知识和信息抽取的个性化Web查询系统[D].河北大学,工学硕士论文,2002:7
    [21]CALIFF M,MOONEY R.Relational Learning of pattern-match rules for information extraction[Z].In processing of the Sixteenth National Conference on Artificial Intelligence and Eleventh Conference on Innovative Applications of Artificial Intelligence,Orlando,Florida,1999
    [22]FREITAG D.Machine learning for information extraction in informal domains[J].Machine Learning,1999,39(2/3):169-202P
    [23]SODERLAND S.Learning information extraction rules for semi-sturctured and Free Text[J].Machine learning,1999,34(1-3):233-272P
    [24]MUSLEA I,MINTON S,KNOLOCK C.Hierarchical wrapper induce for semistructureed information sources[J].Autonomous Agents and Muti-agent System,2001,4(1/2):93-114P
    [25]HSU C N,DUNG M.Generating finite-state transducers for semi-structured data extraction from the Web[J].Information System,1998,23(8):521-538P
    [26]KUSHMERICK N.Wrapper induction:efficiency and expressiveness[J].Artificial Intelligence Journal,2000,118(1/2):15-68P
    [27]EMBLEY D,CAMPBELL D,JIANG S,et al.Conceptual Omodel-based data extraction from multiple record web pages[J].Data and knowledge Engineering 1999,31(3):227-251P
    [28]LIU L,HAN W.XWRAP:An XML-Based wrapper generator for Web information sources[Z].In Processing of the international Conference on Data Engineering,San Diego,2000
    [29]LIU L,HAN W,BUTTLER D,et al.An XML-Based wrapper generator for web information extraction[Z].In Proceedings of ACM SIGMOD International Conference on Management of data,Philadaelphia,Permsylvania USA,1999
    [30]杨文柱,徐林吴,郝业南.个性化的Web查询助手设计与实现[Z].19届全国数据库会议,郑州,2002
    [31]Ellen Riloff.Automatically Constructing a Dictionary for Information Extraction Tasks.In:Proceeding for the Eleventh National Conference on Artificial Intelligence.Washington,D.C:AAAI Press/MIT Press,1993,811-816P
    [32]E.Riloff,R.Jones.Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping.In:Proceedings of the Sixteenth National Conference on Artificial Intelligence.Orlando,Florida:AAAI Press/MIT Press,1999,811-816P
    [33]S.Soderland.Learning information extraction rules for semi-structuredand free text.Machine Learning,1999,34:233-272P
    [34]Kushrnerick,N.Wrapper Induction:Efficiency and Expressiveness.Artificial Intelligence Journal,2000,118(12):15-68P
    [35]Zhang Y M,Zhou J F,A Trainable Method for Extracting Chinese Entity Names and Their Relations,In Proceedings of the Second Chinese Language Processing Workshop,Hong Kong,Oct.2000
    [36]Leek,T.R.Information Extraction Using Hidden Markov Models:[Master's thesis]San Diego:University of California,1997,8-35P
    [37]Kristie Seymore,Andrew McCallum and Ronal Rosenfel.Learning HiddenMarkov Model Structure for Information Extraction.In:AAAI'99Workshop on Machine Learning for Information Extraction.Orlando,Florida:AAAI Press/MIT Press,1999,37-42P
    [38]Freitag,D.,McCallum,A.,and Pereira F.Maximum Entropy Markov Models for Information Extraction and Segmentation.In:Proceedings of ICML-2000.CA,USA:Morgan Kaufmann,2000,591-598P
    [39]YOSHIKAWA M,AmagasaT,ShimumT,etc.Xrel:a path-based approachto storage and retrieval of xml documents usingrelational databa8e[C].ACM TOIT,2001,1(1):110-141P.
    [40]赞媛.web信息抽取系统SEU-W1N设计与实现:[硕士学位论文].南京:东南大学,2006
    [41]蒲秋梅.基于XML的数据挖掘技术的研究:[硕士学位论文].武汉:武汉大学,2004
    [42]Jussi Myllylnaldk,Jared Jackson.Robust Web Data Extraction with XML Path Expressions.IBM Research Report,2003
    [43]彭渊,赵铁军,郑德权,于浩.基于特征句抽取的网页去重研究[A].全国第八届计算语言学联合学术会议(JSCL-2005)论文集[c],2005.
    [44]徐林昊,杨文柱,陈少飞.基于XPath的web信息抽取取[Z].19届全国数据库会议,郑州,2002
    [45]XSL Transformations(XSLT),W3C Reconunendation,http://www.w3.org/TR/xslt.html.1999-11
    [46]成光.基于XML/XSLT的动态网页自动生成系统研究与实现:[硕士学位论文].苏州:苏州大学,2006
    [47]徐小琴,章成志.Web信息检索中相关词提示技术与评测[A].第三届学生计算语言学研讨会论文集[C],2006.
    [48]Rohini Srihare,Wei Li.Information Extraetion Supported Question Answer -ing[R].1999-10-15.
    [49]胡海静.XML技术精粹[M].机械工业出版社,2001,35-48页
    [50]XML中国论坛.XML实用进阶教程[M].清华大学出版社,2001,19-35页
    [51]Caglek.XML高级开发指南[M].北京:电子工业出版社,2001,1-16页
    [52]Document objeet Model,W3C Recommendation,http://www.w3.org /DOM/.1998-10
    [53]顾兵.XML实用技术教程[M].清华大学出版社,2007,1,48-65页
    [54]XML Path Language(XPath),W3CReconunendation,http://www.w3.org/TR/xpath.html.
    [55]XSL Transformations(XSLT),W3C Reconunendation,http://www.w3.org/TR/xslt.html.1999-11

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700