遗传算法在Web数据同步抽取中的分析应用
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着在线结构化数据库的出现,与日俱增的要求大规模的信息集成,对于不同的领域,网络上存在着大量的并且不断更新的数据资源。那么,要有效地,系统地访问这些数据就需要处理庞大的数据资源,显然,大规模的信息综合需要更加自动化和精确的托管。对于每一个新增数据源,都将自动产生一个网络数据包装器对其进行处理。而现有的技术并不完善。本文将提出一种方法来改进现有的网络数据提取算法.
     互联网的发展到今天,Deep Web提供了大量的动态信息。在对这些数据的抽取和进行上下文相关的数据包装的过程中,遇到了诸多的问题。我们这里关注三个问题:
     1.大量的同等数据资源如何被应用于提高一个web数据包装器的准确率。
     2.多个平行的网络数据包装器怎样被应用于加强web数据包装的准确率。
     3.对现有的同步web数据抽取方法进行怎样的改进以加强web数据包装的准确率和算法效率。
     这些问题看起来没有什么联系,实际上根源都在于web数据的包装缺乏上下文相关性。目前的包装器只针对其中的一个数据资源进行包装,在对于内容的处理上缺乏多个同等资源的一致性和域规则的一致性。
     本文将提出一种基于遗传算法的web数据同步抽取算法,来产生一个上下文相关的Web数据包装器,能够利用多个同等数据资源和域规则找到更加精确的匹配。它能够利用上下文相关性对同等数据资源的待处理内容找到协同一致的匹配。我们将利用遗传算法来制造一个螺旋解码机制,建立各个平行包装器之间的联系。
     本文的主要工作在于:
     1.对Deep Web在线结构化数据库的信息抽取以及Web数据包装器进行深入的研究,并提出一种基于上下文相关性的包装。
     2.应用遗传算法给出一种Web数据同步抽取的算法来实现螺旋解码的上下文相关性包装。
     3.算法利用多个同等数据源和平行数据包装器以及域规则实现了上下文相关性数据包装,提高了包装器的抽数据取精确度。
     本文工作的意义是应用遗传算法给出了实现上下文相关性Web数据包装的一套完整方案,具体体现在以下方面:利用大量的同等数据资源提高了一个web数据包装器的准确率。利用多个平行的网络数据包装器加强了web数据包装的准确率。应用遗传算法改进了Web数据同步抽取的方法,提高了算法的效率。
The deep web presents a pressing need for integrating large numbers of dynamically evolving data sources. To be more automatic in building an integration system, we observe three problems:
     First, across sequential tasks in spider the peer sources to facilitate the subsequent matching task?
     Second, across parallel sources, how can a wrapper leverage the peer wrappers or domain rules to enhance extraction accuracy?
     Third, how to improve the extracting algorithm to enhance the extraction accuracy and the algorithm efficiency.
     These issues, while seemingly unrelated, both boil down to the lack of "context awareness". Current automatic wrapper induction approaches generate a wrapper for one source at a time, in isolation, and thus inherently lack the awareness of the peer sources or domain knowledge in the context of integration.
     In this paper, we propose the concept of context-awareness wrappers that are amenable to matching and that can leverage peer wrappers or prior domain knowledge. Such context awareness inspires a synchronization frame-work to construct wrappers consistently and collaboratively across their mutual context. We draw the insight from turbo codes and apply the genetic algorithm to develop a turbo syncer to interconnect extraction with matching, which together achieve context awareness in wrapping.
     The main works and achievements of this paper are:
     1. We discuss the synchronized data extraction in deep web and we propose the concept of context-awareness wrappers.
     2. We apply the genetic algorithm to develop a turbo syncer to interconnect extraction with matching, which together achieve context awareness in wrapping.
     3. We leverage the peer sources, peer wrappers and domain rules to enhance extraction accuracy.
     The contribution of this paper is that we discuss the problem that how to realize the Context-Ware Wrapping. We consider the peer sources to facilitate the matching task and enhance a wrapper's extraction accuracy by leverage the peer wrappers or domain rule. First, we bring in the concept Context-Ware Wrapping. With the problem how to realize it, then we propose a Spiral-Decoding Method to synchronize the extractions by spiral decoding. At last, we apply the genetic algorithm to develop a turbo syncer to realize it.
引文
[1]Chen,J,Zhou,B,Shi,J,Zhang,H.T,Wu,Q.Function-Based Object Model Towards Website Adaptation.In Proceedings of the 10th Intenrational World Wide Web Conference,2001.
    [2]Yangarher R,Grishman R,NYU:Description of the Proteus/PET System as Used for MUC-7,In Proceedings of the Seventh Message Understanding Conference,1998.
    [3]Ipeimtis P G,Gravano L,and Sahami M.Probe,Count,and Classiry:Categorizing Hidden Web Databases.Proceedings of the 19th ACM International Conference on Management of Data(SIGMOD'01),Santa Barbara,2001,pp67-78.
    [4]Embley,D.W.Jiang,Y,Ng,Y-K.Record-boundary discovery in Web documents.In Proceedings of the 1999 ACM SIGMOD intenrational conference on Management of data,Philadelphia PA,1999,P467-478.
    [5]S.Sodedand.Learning Information Extraction Rules for Semistructured and Free Text.Machine Learning,1999.
    [6]Chang K C-C,He B,and Zhang Z.Toward Large Scale Integration:Building a MetaQuerier over Databases on the Web.Proceedings of the 12th Biennial Conference on Innovative Data Systems Research(CIDR'05),Asilomar,2005,pp44-55.
    [7]Yu C.,Liu K.,Meng W.,Wu Z.,Rishe N..A methodology to retrieve text documents from multiple databases.IEEE Trans.Knowl.Data Eng.,2002,14,6:1347-1361.
    [8]D.Cai,S.Yu,J.Wen,W.Ma.Extracting Content Structure for Web Pages Based on Visual Representation.In APWeb,2003,P406-417.
    [9]Fettedy D,Manasse M,Najork M,ete.A large-scale Study of the Evolution of Web Pages.Proceedings of the 12th International World Wide Web Conference(WWW'03),Budapest,2003,pp669-679.
    [10]CNNC,第十六次中国互联网络发展状况统计报告,July 2005.
    [11]Hasan Davulcu,Juliana Freire,M ichaelK ifer,IX Ramakrishnam.A.alyered arc hite cture forquerying dynamic Web content.In SIGMOD'99Proceedings,Philadelphia,PA,May 1999,P191-502.
    [12]Chakrabarti,S,Punera,K,and Subramanyam,M.Accelerated focused crawling through online relevance feedback.In Proceedings of the eleventh international conference on World Wide Web(WWW2002),2002,P148-159.
    [13]Hobbs J.The Generic Information Extraction System.In Proceedings of the Fifth Message Understanding Conference(MUC-5),Morgan Kaufman,1993.P87-91.
    [14]He B,Tao T.Chang K C-C.Clustering Structured Web Sources:a Schema-based,Model-differentiation Approach.Proceedings of the 9th International Conference on Extending Database Technology,Heraklion,Crete,2004,pp536-546.
    [15]Crescenzi V.,Mecca G.,Merialdo P..RoadRunner:towards automatic data extraction from large web sites.In:Proceedings of the 27th International Conference on Very Large Data Bases(VLDB),Italy,2001,109-118.
    [16]He H.,Meng W.,Yu C.T.,Wu Z..Constructing interface schemas for search interfaces of Web databases.In:Proceedings of the 6th International Conference on Web Information Systems Engineering(ICDE),New York,2005,29-42.
    [17]Grishman R,Sundheim B.Message Understanding Conference "A Brief History.In Proceedings of the 16th Intenrational Conference on Computational Linguistics".(COLING-96),August,1996.
    [18]K.Chang,B.He,C.Li,Z.Zhang.Structured databases on the Web:Observations and implications.SIGMODR ecord,33.(3),September 2004.
    [19]Robert B.Doorenbos,Oren Etzioni,Daniels Weld.Ascalable comparison shopping agent forth World-Wide Web.In proceedings of the First International Confence on Autonomous Agents,Marinadel Rey,CA,February 1997,P39-48.
    [20]Chakrabarti,S.Integrating the Document Object Model with hyperlinks for enhanced topic distillation and information extraction.In the 10th Intenrational World Wide Web Conference,2001.
    [21]Randy L.Haupt,Sue Ellen Haupt.Practical Genetic Algorithms.Wiley,2004:22-23.
    [22]王小平,曹立平,遗传算法--理论、应用与软件实现.西安交通大学出版社,2002:4-10。
    [23]周明,孙树栋.遗传算法原理及应用.国防工业出版社,1999:21-24。
    [24]Buttler,D.,Liu,L.,Pu,C.A fully automated extraction system for the World Wide Web.IEEE ICDCS-21,2001.
    [25]Jiying Wang,Fred H.Lochovsky,Data extraction and label assignment for web databases,Proceedings of the 12th international conference on World Wide Web,May 20-24,2003.
    [26]Kai Simon,Georg Lausen,ViPER:augmenting automatic information extraction with visual perceptions,Proceedings of the 14th ACM international conference on Information and knowledge management,October 31-November 05,2005,Bremen,Germany.
    [27]Hsin-Hsi Chen,Shih-Chung Tsai,Jin-He Tsai,Mining tables from large scale HTML texts,Proceedings of the 18th conference on Computational linguistics,p.166-172,July 31-August 04,2000,Saarbr(u|¨)cken,Germany.
    [28]V.Crescenzi,G.Mecca,and P.Merialdo.RoadRunner:Towards automatic data extraction from large web sites.In Proc.of VLDB,2001.
    [29]Y.Zhai and B.Liu.Web data extraction based on partial tree alignment.In Proc.of WWW,2005.
    [30]H.Zhao,W.Meng,Z.Wu,V.Raghavan,and C.Yu.Fully automatic wrapper generation for search engines.In Proc.of WWW,2005.
    [31]A Fully Automated Object Extraction System for the World Wide Web,Proceedings of the The 21st International Conference on Distributed Computing Systems,p.361,April 16-19,2001.
    [32]XWRAP:An XML-Enabled Wrapper Construction System for Web Information Sources,Proceedings of the 16th International Conference on Data Engineering,p.611,February 28-March 03,2000.
    [33]C.Berrou,A.Glavieux,and P.Thitimajshima.Near shannon limit error-correcting coding and decoding:Turbo codes.In Proc.of IEEE Int.Conf.on Commun.,pages 1064-70,1993.
    [34]A.P.Dempster,N.M.Laird,and D.B.Rubin.Maximum likelihood from incomplete data via the EM algorithm.Journal of the Royal Statistical Society,Series B,39:1-38,1997.
    [35]H.Hartley.Maximum likelihood estimation from incomplete data.Biometrics,14:174-194,1958.
    [36]A.Arasu and H.Garcia-Molina.Extracting structured data from web pages.In Proc.of SIGMOD,2003.
    [37]He H,Meng W Y,Yu C,and Wu Z.WiSE-Integrator:A System for Extracting and Integrating Complex Web Search Interfaces of the Deep Web.the 31st International Conference on Very Large Data Bases(VLDB'05),Trondheim,Norway,August 2005,pp1314-1317.
    [38]Bin He,Mitesh Patel,Zhen Zhang,Kevin Chen-Chunn Chang.Accessing the Deep Web:A Survey.2004.
    [39]Deng Cai,Shipeng Yu,Ji-Rong Wen,Wei-Ying Ma.VIPS:a Vision-based Page Segmentation Algorithm.Nov.Technical Report MSR-TR-2003-79,Jan 2003.
NGLC 2004-2010.National Geological Library of China All Rights Reserved.
Add:29 Xueyuan Rd,Haidian District,Beijing,PRC. Mail Add: 8324 mailbox 100083
For exchange or info please contact us via email.