Deep Web数据抽取及集成技术研究

英文题名：Research on Deep Web Oriented Information Extraction and Integration
作者：刘桂峰
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：Deep ; Web ; 数据集成 ; 数据抽取 ; 聚类 ; 搜索引擎
英文关键词：Deep Web ; Data Integration ; Data Extraction ; Cluster ; Search Engine
学位年度：2009
导师：崔志明
学科代码：081203
学位授予单位：苏州大学
论文提交日期：2009-05-01

摘要

随着万维网技术和数据库技术的结合,网络开始迅速的深化。大量的信息都隐藏在Web数据库中,用户通过查询可以动态的获取这些信息,学者们将这类资源称为Deep Web。由于Deep Web资源分布在各个Deep Web站点,使用起来较为不便,因此,面向Deep Web的数据集成系统便应运而生。
     本文对Deep Web领域的数据抽取及集成技术进行了研究,并提出了相关的算法和解决方案,最后设计了一个面向Deep Web的搜索引擎原型系统。本文的主要研究工作如下:
     (1)将Web数据对象从查询结果页面中抽取出来是Deep Web数据集成的第一步,本文基于文档对象模型,通过页面预处理、抽取候选Web数据对象集、去除非Web数据对象三个阶段提出了一种自动抽出Web数据对象的方法。
     (2)提出了一种对模式异构的Web数据对象进行集成的方法。该方法以向量空间模型为基础,以聚类为手段对来自不同Deep Web站点的异构Web数据对象进行了集成,并以区分度为基础,以相似度为度量手段检测出了重复的Web数据对象,实现了Web数据对象的去重。
     (3)分析了海量数据的组织方法对查询响应速度的影响,在此基础上提出了一种对海量Web数据对象进行组织的方法。该方法通过递增聚类使Web数据对象根据自身的特征自然的聚集在一起,形成一个科学的类别层次,为查询的快速响应奠定基础。
     (4)在上述研究的基础上设计了一个面向Deep Web的搜索引擎原型系统。
     本文还对文中提出的方法和技术进行了实验,结果表明本文提出的方法技术是可行有效的。
With the development of the World Wide Web and Database technology,Internet is deepening rapidly.Large amount of information are hidden in Web Databases,which are called Deep Web.Users can get them dynamicly by submitting queries to query forms.Because Deep Web resources distribute in many different Deep Web sites,so it is not convenient to get information from Deep Web.Therefor,many researchers and companies had been researching how to integrate Deep Web resources into one system.
     This thesis researches on Deep Web oriented data extraction and integration technology,proposes corresponding algorithms and solutions,and then designs a Deep Web oriented prototype search engine in the last main section.The main work of this thesis is summarized as followings:
     (1) Extracting Web Data Objects from result pages of queries is the first step of Deep Web integration.This thesis proposes an automatic method of Web Data Object extraction based on DOM,which identifies the Data Regions and Web Data Objects by following steps:preprocessing the HTML pages,extracting candidate web data object set,and revoming objects which are not web data object from the set,then Web Data Objects can be extracted from the result HTML pages.
     (2) Proposes a method of integrating heterogeneous Web Data Objects which are extracted from different Deep Web sites.This method is based on vector space model.It was designed to integrate heterogeneous Web Data Objects by clustering,and then identifiy the duplicate Web Data Objects by discriminabiltity and similarity of property in order to eliminate redundant phenomenon.
     (3) Analyzes the influence on query response speed which are generated by the orgnization of the massive data,and then further proposes an orgnization method of huge amount of Web Data Objects.By incremental clustering,Web Data Objects are divided into different clusters according to their own characters.All the clusters construct a hierarchical structure,which is the basis of quick response to queries submitted by users.
     (4) Designs a Deep Web oriented prototype search engine based on the above works.
     Moreover,this thesis also designs and performs several experiments on the methods mentioned in the thesis.The experimental results show that these methods are feasible and effective.

引文

[1]Sherman C.,Price G.The Invisible Web:Uncovering Information Sources Search Engines Can't See[M].Thomas H.Hogan,Sr.2001.11:56-61.
    [2]Michael K.Bergman,.The Deep Web:Surfacing hidden value[DB/OL].2001.9.http://www.brightplanet.com/images/stories/pdf/deepwebwhitepaper.pdf.September.
    [3]He B.,Patel M.,Zhang Z.,Chang K.C-C..Accessing the Deep Web:A Survey[J].Communications of the ACM(CACM),2004,5,0(2):94-101.
    [4]Chang K.,He B.,Li C.,Zhang Z.Structured databases on the web:Observations and implications[C].SIGMOD Record,2004,9,33(3).
    [5]Thanaa M.Ghanem,Walid G.Aref.Databases Deepen the Web[J].IEEE Computer.2004,73(1):116-117.
    [6]刘伟,孟小锋,孟卫一.Deep Web数据集成研究综述[J].计算机学报,2007.9,Vol.30,NO.9:1475-1489.
    [7]Madhavan J.,Jeffery S.R.,Cohen S.,etc.Web-scale Data Integration:You Can Only Afford to Pay As You Go[C].In CIDR,Asilomar,CA.2007.
    [8]刘伟,孟小锋,孟卫一.Deep Web数据集成问题研究[R].WAMDM-TR-2006-3,WAMDM.2006.
    [9]Ye S.R.,Chua T-S.Learning Object Models from Semistructured Web Documents[J].IEEE Transactions on Knowledge and Data Engineering,2006,VOL.18,NO.3:334-349.
    [10]MetaQuerier Research Group[EB/OL].http://metaquerier.cs.uiuc.edu/.2006.06.
    [11]Raghavan S.,Garcia-Molina H..Crawling the hidden web[C].In Proc.of the 27th Int'l Conf.on VLDB,Roma,Italy.2001:129-138.
    [12]QProber Research Group[EB/OL].http://qprober.cs.columbia.edu/.2005.10.
    [13]Deep Web Technology[EB/OL].http://www.deepwebtech.com/.2005.10.
    [14]Invisiable.com[EB/OL].http://www.invisiable.com/.
    [15]Halevy A,Rajaraman A,Ordille J.Data integration:The teenage years[C].In Proc.of the 32nd Int'l Conf.on VLDB.New York:ACM Press,2006:9-16.
    [16]Chang K C-C,He B,Zhang Z.Toward Large Scale Integration:Building a MetaQuerier over Databases on the web[C].In Proc.of the Second Conf.On CIDR,2005:44-55.
    [17]He H,Meng W,Yu C,et al.WISE-Integrator:An Automatic Integrator of Web Search Interfaces for E-Commerce[C].VLDB,2003:357-368.
    [18]Nie Z.Q.,Wen J.R.,Ma W.Y.Object-Level Vertical Search[C].Proceedings of Third Biennial Conference on Innovative Data Systems Research (CIDR),2007.
    [19]Tao C,Kevin C.C-C.Entity Search Engine:Towards Agile Best-Effort Information Integration over the Web[C].In Proc.of CIDR,2007:108-113.
    [20]文档对象模型[DB／OL]．http：／／www．w3C．org／DOM．
    [21]Florescu,D.,Levy,A.Y.,Mendelzon,A.O.Database techniques for the World-Wide Web:A survey[C].SIGMOD Record 27,3(1998):59-74.
    [22]Grishman R,Sundheim B.Message Understanding Conference-6:A Brief History[C].In Proc.of the 16th Int'l Conf.on Computational Linguistics (COLING-96),August1996.
    [23]Saiiuguet A.,Azavant E.Building Intelligent Web Applications Using Lightweight Wrappers[J].Data and Knowledge Eng.,Vol.36,Issue3,2001,3:283-316.
    [24]Liu L.,Pu C,Han W.,XWrap:An XML-Enabled Wrapper Construction System for Web Information Sources[C].In Proc.of 16th IEEE Int'l Conf.Data Eng.(ICDE),2000:611-621.
    [25]Crescenzi V.,Mecca G,Merialdo P..RoadRunnenTowards-Automatic Data Extraction from Large Web Sites[C].In Proc.of the 26th Int'l Conf.VLDB,2001.109-118.
    [26]Soderland S.Learning information extraction rules for semi-structured and free text [J].Machine Learning;1999,vol.34,nos.l-3:233-272.
    [27]Califf M.E..Relational Learning Techniques for Natural Language Information Extraction[R].Department of Computer Sciences,University of Texas,Austin.Technical Report AI98-276.1998.8.
    [28]Freitag D..Information Extraction from HTML:Application of a General Machine Learning Approach[C].In Proc.of the 15th Int'l Conf.on Artificial Intelligence (AAAI-98),Madison,Wisconsin,1998.7.
    [29]Kushmerick N.,Weld D.S.,Doorenbos R.B..Wrapper induction for information extraction[C].In Proc.of the Int'l.Joint Conf.on Artificial Intelligence,1997.
    [30]Hsu C,Dung M..Generating finite-state transducers for semi-structured data extraction from the web[J].Information Systems,1998,23(8):521-538.
    [31]Muslea I.,Minton S.,Knoblock C..A hierarchical approach to wrapper induction[C].In Proc.of the Third Int'l.Conf.on Autonomous Agents,1999.
    [32]Adelberg B.,NoDoSE:A Tool for Semiautomatically Extracting Structured and Semistructured Data from Text Documents[C].SIGMOD Record,1998,vol.27,no.2:283-294.
    [33]Laender A.H.F.,Ribeiro-Neto B.,Silva A.S.DA,DEByE-Data Extraction by Example[J].Data and Knowledge Eng.,2002,vol.40,no.2:121-154.
    [34]Ribeiro-Neto B.,Laender A.H.F.,Silva A.S.DA,Extracting Semistructured Data through Examples[C].In Proc.of the 8th ACM Int'l Conf.Information and Knowledge Management(CIKM),1999:94-101.
    [35]Embley D.W.,Campbell D.M.,Jiang Y.S.,etc.Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages[J].Data and Knowledge Eng.,1999,vol.31,no.3:227-251.
    [36]Zhang J.W.,Yoshiharu I.,Hiroyuki Kitagawa:Record Extraction Based on User Feedback and Document Selection[C].APWeb/WAIM 2007,LNCS 4505,2007:574-585.
    [37]Hernandez M,Stolfo S.The merge/purge problem for large databases[C].In SIGMOD Conference,1995:127-138.
    [38]Tejada S,Knoblock C A,Minton S.Learning domain-independent string transformation weights for high accuracy object identification[C].In World Wide Web conference(WWW),2002:350-359.
    [39]Doan A.H.,Lu Y,Lee Y,etc.Object matching for information integration:A profiler-based approach[C].ⅡWeb,2003:53-58.
    [40]凌妍妍,刘伟,王仲远等.Deep Web数据集成中的实体识别方法[J].计算机研究与发展,2006,43(增刊):46-53.
    [41]Miller R.J,Ioannidis E.Y,Raghu R..Schema equivalence in heterogeneous systems:Bridging theory and practice[J].Information Systems,1994,19(1):3-31.
    [42]Meng W.,Wang W.,Sun H.,etc.Concept hierarchy based text database categOrizatiOn[J].Knowledge and Information Systems Journal,2002,4(2):132-150.
    [43]Wu W.,Yu C.T.,Doan A.Meng W..An interactive clustering-based approach to integrating source query interfaces on the Deep Web[C].In Proc.of the 23rd ACM SIGMOD Int'l Conf.on Management of Data.Paris.2004:95-106.
    [44]He H.,Meng W.,Yu C.T.,Wu Z.Automatic integration of Web search interfaces with WISE-integrator[J].VLDB Journal,2004,13(3):256-273.
    [45]Li W.,Clifton C..Semantic integration in heterogeneous databases using neural networks[C].In Proc.of the 20th Int'l Conf.on VLDB,Santiago,1994:1-12.
    [46]杨雪梅,董逸生,王永利,钱江波,钱刚.异构数据源集成中的模式映射技术[J].计算机科学,2006,Vol.336,No.7:87-91.
    [47]Tejada S,Knoblock C A,Minton S.Learning domain-independent string transformation weights for high accuracy object identification[C].In World Wide Web conference(WWW),2002:350-359.
    [48]寇月,申德荣,李冬等.一种基于语义及统计分析的Deep Web实体识别机制[J].软件学报,2008,2,Vol.1 9,No.2,:194-208.
    [49]Nam G.W.,Park J.H.,Kim T.Y..Dynamic management of URL based on object oriented paradigm[C].In Proc.of the Int'l Conf.on Parallel and Distributed Systems.Taiwan,China:IEEE Computer Society Press,1998:226-330.
    [50]吴平博,陈群秀,马亮.基于特征串的大规模中文网页快速去重算法研究[J].中文信息学报,2003,17(2):28-35.
    [51]Shivakumar N.,Garcia-Molilna H.Finding near replicas of documents on the web[C].In Proc.of Workshop on Web Databases.Spain:Springer Press,1998:204-212.
    [52]Cho J.H.,Shivakumar N.,Garcia-Molina H..Finding replicated web collections[C].In Proc.of the ACM Int'l Conf.on Management of the Data.USA:ACM Press,2000,29(2):355-366.
    [53]Bharat K.,Broder A.Z..Mirror,mirror on the web:A study of host pairs with replicated content[J].Computer Networks,1999,31(11216):1579-1590.
    [54]邵峰晶,于忠清.数据挖掘原理与算法[M].北京:中国水利水电出版社,2003,8:198
    [55]Eric S.J.,Fidelia I-S..Text mining without document context[J].Information Processing&Management,2006,42(6):1532-1552.
    [56]Chen Z.,Ma W.,Ma J..Leaming to Cluster Web Search Results[C].In Proc.of the 27th Annual Int'l ACM SIGIR Conf..Shefield,South Yorkshire,UK,2004,6:210-217.
    [57]Cades I,Smyth P,Mannila H.Probabilistic modeling of transcational data with application to profiling,visualization and prediction[C].In:Proc.of the 7th ACM SIGKDD Int'l Conf.on Knowledge discovery and data mining.San Francisco:ACM Press,2001:37-46.
    [58]Marques JP.吴逸飞译.模式识别.原理、方法及应用[M].北京:清华大学出版社,2002:51-74.
    [59]Kufman L.,Roueeeuw P.J..Finding Groups in Data:An Introduction to Cluster Analysis[M].John Wiley & Sons Ltd.,Chinchester,New York,Weinheim,1990.
    [60]Dempster A.P,Laird N.M,Rubin D.B..Maximum likelihood from incomplete data vis the EM algorithm[J].Journal of Royal Statistical Society Series B,1997,39(1):1-38.
    [61]Ng R.,Han J..Efficient and Effective Clustering Mehtods for Spatial Data Mining[C].In Proc.of the 20th Int'l.Conf.On VLDB,Santigo,Chile,Morgan Kaufmann,1994:144-155.
    [62]Zhanf T.,Ramakrishnan R.,Linvy M..BIRCH:An Efficient Data Clustering Method for Very Large Databases[C].In Proc.of the ACM SIGMOD Int'l.Conf.On Management of Data,ACM Press,1996:103-114.
    [63]Guha S.,Rastogi R.,Shim K..CURE:AN Efficient Clustering Algorithm for Large Database[C].ACM SIGMOD,1998:73-84.
    [64]Guhu S.,Rastogi R.,Shim K..ROCK:A Robust Clustering Algorithm for Categorical Attributes[J].Information Systems,2000,25(5):345-366.
    [65]Karypis G.,Han E.,Kumar V..Chameleon:Hiercrchical clustering using dynamic modeling[J].Computer,1999,32(8):68-75.
    [66]Ester M.,Krigel H.P.,Sander J.,etc.A Density-Based Algorithm for Discovering Clustering in Large Spatial Databases with Noise[C].In Proc.of the 2nd Int'l.Conf.On Knowledge Discovery and Data Mining,Portland,Oregon,AAAI Press,1996:226-231.
    [67]Hinneburg A.,Keim D.A..A efficient approach to clustering in large multimedia databases with noise[C].In Proc of the 4th Int.Conf.on Knowledge Discovery and Data Mining(KDD'98).New York:AAAI Press,1998:58-65.
    [68]Ankerst M.,Breunig M.,Kriegel H.,etc.OPTICS:Ordering points to indentify the clustering structure[C].In Proc.of the ACM SIGMOD Int'l.Conf.on Management of Data,Philadelphia,PA,1999:49-60.
    [69]Wang W.,Yang J.,Muntz R..STING:A Statistical Information Grid Approach to Spatial DataMining[C].In Proc.of the 23rd VLDB Conf.Athens,Greece,1997.
    [70]Sheikholeslami G.,Chatterjee S.,Zhang A..WaveCluster:A Wavelet-based Clustering Approach for Spatial Data in Very Large Databases[J].The VLDB Journal,2000,8:289-304.
    [71]Agrawal R.,Gehrke Jo.,Gunopulos D..Automatic subspace clustering of high dimensional data mining application[J].Data Mining and Knowledge Discovery,2005,11,:5-33.
    [72]Kohonen T..The Self-Organizing Maps[M].Proceedings of the IEEE 1990,78(9):1464-1480.
    [73]Hornik K.,Stinchcombe M.,White H..Multi-layer feedforward networks are universal approximators[J].Neural Nerworks,1989,Vol.2:359-366.
    [74]董琳,邱泉,于晓峰译.数据挖掘--实用机器学习技术[M].北京:机械工业出版社,2006.2:169-175.
    [75]Wikipedia.Deep Web[EB/OL].http://en.wikipedia.org/wiki/Deep_web#cite_note-bergman2001-1.
    [76]王学松.Lucene+Nutch搜索引擎开发[M].北京:人民邮电出版社,2008,8:2-6.
    [77]印鉴,陈忆群,张钢.搜索引擎技术研究与发展[J].计算机工程,2005.31(14):54-56.
    [78]Li Y.,Meng X.,Wang L.,etc.RecipeCrawler:Collecting Recipe Data from WWW Incrementally[C].In Proc.of the Seventh Int'l Conf.on Web-Age Information Management(WAIM2006),Hong Kong,China:Lecture Notes in Computer Science 4016,Springer 2006:263-274.
    [79]Madhavan J.,Ko David.,Kot L..Google's Deep Web Crawl[C].VLDB2008,Auckland,New Zealand.,2008,8,:1241-1252.
    [80]郑冬冬,赵朋朋,崔志明.Deep Web爬虫研究与设计[J].北京:清华大学学报(自然科学版),2005,45(S1):1896-1902.
    [81]ZHAO P.,LIN C.,GAO L.,CUI Z..Deep Web Sources Focused Crawling[C].International Conference on Enterprise Information Systems and Web Technologies (EISWT-07),2007.
    [82]iProspect.http://www.iprospect.com/premiumPDFs/iProspectSurveyComplete.pdf.2004

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700