Deep Web实体搜索的关键技术研究

英文题名：Research on Key Techniques of Entity Search for Deep Web
作者：寇月
论文级别：博士
学科专业名称：计算机软件与理论
中文关键词：Deep ; Web ; 实体搜索 ; 关联知识 ; 实体抽取 ; 实体评估 ; 实体去重
英文关键词：Deep Web ; entity search ; relationship knowledge ; entity extraction ; entity ranking ; entity deduplication
学位年度：2009
导师：于戈
学科代码：081202
学位授予单位：东北大学
论文提交日期：2008-11-10
答辩委员会主席：栾贵兴

摘要

Web上的信息量巨大而丰富,并且已成为了企业、个人赖以生存和发展的主要信息资源。随着Web数据库的不断增长,通过对Deep Web的访问逐渐成为获取信息的主要手段。然而,Deep Web环境中的数据信息对于传统的搜索引擎来说是不可见的,针对Deep Web的新型搜索引擎还远没有发展成熟。面对Deep Web环境的信息量巨大、内容缺乏结构性、结果异构性、数据状态可变等特性,使得Deep Web信息搜索课题不断面临新的挑战和机遇。因此,如何有效地搜索Deep Web中的数据资源成为一个值得研究的问题,其目标是从大规模的、动态变化的Deep Web数据中自动地获取满足用户需求的结果信息。为此,本文针对Deep Web搜索过程中的关联知识构建、实体抽取、实体评估、实体去重等内容进行了研究。主要工作包括以下几点：
     (1)提出了一种Deep Web实体搜索机制DWESM。通过分析传统的页面级搜索技术和面向专业领域的垂直搜索技术的特点,提出了DWESM的层次模型,具体由关联知识构建、实体抽取、实体评估及实体去重等模型组成；DWESM以网页中的实体数据作为操作的基本单元,不仅能够适合Deep Web的环境特点,而且继承了垂直搜索中的技术思想,更加专注、具体和深入。
     (2)提出了一种基于语义及统计分析的关联知识构建模型SS-KCM。基于文本匹配模型、语义分析模型和分组统计模型,构建了SS-KCM的整体模型框架；提出了文本粗略匹配、语义关联获取以及分组统计分析的三段式逐步求精策略,基于文本特征、语义关联及约束规则获取实体间的关联关系；提出了静态分析、动态协调相结合的自适应知识维护策略,构建和完善实体关联知识库,以适应Web数据的动态性并保证关联知识的完备性；通过实验验证了SS-KCM中所采用的关键技术的可行性和有效性。
     (3)提出了一种基于DOM树的Deep Web实体抽取模型D-EEM。D-EEM采用基于DOM树的自动实体抽取策略,利用DOM树中的文本内容和层次结构来确定数据区域和实体区域,提高了实体抽取的准确性；提出了一种基于上下文距离和共现次数的语义标注方法,能够有效地将来自不同数据源的抽取结果进行合成；通过实验验证了D-EEM在抽取效率及抽取准确性等方面所具有的优势。
     (4)提出了一种局部与全局评估相结合的实体评估模型LG-ERM。针对实体评估所涉及的实体特征、数据源特征、实体关联关系等影响因素进行了分析并量化表示；提出了一种局部与全局评估相结合的实体评估策略,既在数据源内部进行局部多重评估处理,又基于实体关联知识将局部评估结果进行聚集整合,有效地提高了评估的准确性；通过实验验证了LG-ERM所采用的关键技术的可行性和有效性。
     (5)提出了一种基于多相似度估算器的实体去重模型。针对实体描述属性的不同特征,定义了一系列相似度估算器,以适应不同的属性类型；提出了实体记录相似度的计算方法以及不确定重复记录的处理策略；实验数据表明,该模型在重复记录识别的准确度和有效性等方面具有一定的优势。
     (6)设计并实现了DWESM的原型系统。实现了本文所提出的关联知识构建、实体抽取、实体评估、实体去重等理论和方法,并验证了这些理论和方法的正确性和有效性。
     总之,本文研究了Deep Web实体搜索中的关联知识构建、实体抽取、实体评估以及实体去重等问题,提出了一种适合Deep Web环境的实体搜索机制,能够有效地解决Deep Web搜索中结果数据的抽取、排序、消重及整合等问题。理论分析和大量的实验结果证明了这些方法的有效性和高效性。我们希望这些方法和技术对于开发Deep Web搜索系统具有一定的参考价值。
The immense scale of the Web has rendered it as the most important information resource relied by enterprises and users. With the increase of Web databases, accessing Deep Web is becoming the main method to acquire information. But the data supplied by Deep Web is unavailable for traditional search engines. And currently there is not a novel search engine which can be suitable for Deep Web. Because of the large-scale unstructured content, heterogeneous result and dynamic data in Deep Web, there are some new challenges for Deep Web search. Thus it is important to solve the problem about searching the valuable data from Deep Web pages. Our goal is to search the best target entities that can meet users'needs from the large-scale Deep Web pages. The approaches about relationship knowledge construction, entity extraction, entity ranking and entity deduplication are researched in this dissertation. Here our work includes the following major aspects:
     (1) A Deep Web entity search mechanism called DWESM is presented. By analyzing the technique characteristics of traditional page-level search and vertical search, the hierarchy model of DWESM is presented, which includes the components of relationship knowledge construction, entity extraction, entity ranking and entity deduplication. By inheriting the basic idea of vertical search, entities contained in Web pages are considered as the operational units, which can make the search more professionally and throughly.
     (2) A semantics and statistical analysis based relationship knowledgement construction model called SS-KCM is presented, which includes text matching model, semantics analysis model and group statistics model. Also a three-phase gradual refining strategy is adopted, which includes text initial matching, semantic relationship abstraction and group statistics analysis. And based on text characteristics, semantic information and constraints, the relationship among entities are identified. By performing the self-adaptive knowledge maintenance strategy, the content of entity relationship knowledge database can be more complete and effective. The experiments demonstrate the feasibility and effectiveness of the key techniques of SS-KCM.
     (3) A DOM-tree based Deep Web entity extraction model called D-EEM is presented. A DOM-tree based automatic entity extraction strategy is performed in D-EEM to determine the data regions and the entity regions respectively, which can improve the accuracy of extraction by considering both the textual content and the hierarchical structure in DOM-trees. Also based on the Web context and co-occurrence, a semantic annotation method is proposed to benefit the process of data integration effectively. The experiments show that D-EEM is superior in the accuracy and efficiency of extraction.
     (4) An entity-level ranking model called LG-ERM for Deep Web query based on local and global scoring is presented. More rank influencing factors including the characteristics of entities, the importance of Web sources, as well as the entity relationships are considered and quantified. By combining local and global scoring in ranking, the query result can be more accurate and effective to meet users'needs. The experiments demonstrate the feasibility and effectiveness of the key techniques of LG-ERM.
     (5) An entity deduplication model based on multiple similarity calculators is presented. According to different characteristics of attributes, a series of similarity calculators are defined to suit different attribute types. The strategies of similarity calculation and uncertain records processing are also proposed. The experiments show that our approach is superior in the accuracy and efficiency of duplicate identification.
     (6) We design and implement the prototype system of DWESM, which applies the theories and approaches about relationship knowledge construction, entity extraction, entity ranking and entity deduplication proposed in this dissertation. The system shows validity and efficiency of these theories and approaches.
     In summary, this dissertation dedicates to study fundamental problems related to relationship knowledge construction, entity extraction, entity ranking and entity deduplication. And an entity search mechanism for Deep Web is presented, which can effectively solve the problem of result extraction, ranking, deduplication and consolidation. Lots of theoretical analysis and experiments show that these approaches are efficient and effective. We hope that these approaches and techniques could make some contributions to developing Deep Web search systems.

引文

1. Fetterly D, Manasse M, Najork M, et al. A Large-scale Study of the Evolution of Web Pages [C], WWW,2003,669-678.
    2. Bergman M K. The Deep Web:Surfacing Hidden Value [J], Journal of Electronic Publishing,2002,7(1):8912-8914.
    3. Chang K C C, He B, Li C K, et al. Structured Databases on the Web:Observations and Implications [R], SIGMOD Record,2004,33(3):61-70.
    4. Bilenko M, Mooney R. Adaptive Duplicate Detection Using Learnable String Similarity Measures [C], SIGKDD,2003,39-48.
    5. Cohen W W, Ravikumar P, Fienberg S E. A Comparison of String Distance Metrics for Name-matching Tasks [C], IJCAI,2003,73-78.
    6. 朱恒民,王宁生.一种改进的相似重复记录检测方法[J],控制与决策,2006,21(7)：805-813.
    7. 凌妍妍,刘伟,王仲远等Deep Web数据集成中的实体识别方法[J],计算机研究与发展,2006,43：46-53.
    8. Koudas N, Sarawagi S, Srivastava D. Record Linkage: Similarity Measures and Algorithms [C], SIGMOD,2006,802-803.
    9. 王丽娟,关守义,王晓龙等.基于属性权重的Fuzzy C Mean算法[J],计算机学报,2006,29(10)：1797-1803.
    10. Das G, Hristidis V. Ordering the Attributes of Query Results [C], SIGMOD,2006, 395-406.
    11. Nambiar U, Kambhampati S. Mining Approximate Functional Dependencies and Concept Similarities to Answer Imprecise Queries [C], WebDB,2004,73-78.
    12. Chaudhuri S, Granti V, Motwani R. Robust Identification of Fuzzy Duplicates [C], ICDE, 2005,865-876.
    13. Chen Z, Kalashnikov D V, Mehrotra S. Exploiting Relationships for Object Consolidation [C], SIGMOD Workshop on IQIS,2005,47-58.
    14. Thor A, Rahm E. MOMA- A Mapping-based Object Matching System [C], CIDR,2007, 247-258.
    15.Nie Z Q, Wen J R, Ma W Y. Object-level Vertical Search [C], CIDR,2007,235-246.
    16. Bhattacharya I, Getoor L. Iterative Record Linkage for Cleaning and Integration [C], SIGMOD Workshop on DMKD,2004,11-18.
    17. Dong X, Halevy A, Madhaven J. Reference Reconciliation in Complex Information Spaces [C], SIGMOD,2005,85-96.
    18. Wei M, Naumann F. DogmatiX Tracks down Duplicates in XML [C], SIGMOD,2005, 431-442.
    19. Calife M, Mooney R. Relational Learning of Pattern Match Rules for Information Extraction [C], AAAI/IAAI,1999,328-334.
    20. Soderlan S. Learning Information Extraction Rules for Semi-structured and Free Text [J], International Journal of Machine Learning,1999,34(1-3):233-272.
    21. Muslea I, Minton S, Knoblock G A Hierarchical Approach to Wrapper Induction [C], AA, 1999,190-197.
    22. Sahuguet A, Azavant F. Wysiwyg Web Wrapper Factory (W4F) [C], WWW,1999.1-12.
    23. Bergamaschi S, Castano S, De S, et al. An Intelligent Approach to Information Integration [C], FOIS,1998,253-267.
    24. Zhao H K, Meng W Y, Wu Z H, et al. Fully Automatic Wrapper Generation for Search Engines [C], WWW,2005,66-75.
    25. Liu W, Meng X F, Meng W Y. Vision-based Web data Records Extraction [C], SIGMOD Workshop on Web and Database,2006,20-25.
    26. Deng C, Yu S, Wen J R, et al. VIPS:A Vision-based Page Segmentation Algorithm [R], Microsoft Technical Report, MSR-TR-203-79,2003.
    27. Liu L, Pu C, Han W. XWRAP:An XML-enable Wrapper Construction System Web Information Sources [C], ICDE,2000,611-621.
    28. Crescenzi V, Mecca G, Merialdo P. RoadRunner: Towards Automatic Data Extraction from Large Web Sites [C], VLDB,2001,109-118.
    29.王茹,宋瀚涛,陆玉昌.基于树自动机的网页数据抽取[J].北京理工大学学报,2004,24(9)：790-793.
    30.李文奇,张忠能.页面包装器自动生成的改进算法[J].计算机工程与应用,2004,40(22)：113-115.
    31. Dong X, Halevy A Y, Yu C. Data Integration with Uncertainty [C], VLDB,2007, 687-698.
    32. Chakrabarti K, Ganti V, Han J W, et al. Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation [C], SIGMOD,2006,371-382.
    33. Cheng T, Yan X, Chang K C C. EntityRank:Searching Entities Directly and Holistically [C], VLDB,2007,387-398.
    34. Cheng T, Chang K C C. Entity Search Engine: towards Agile Best-effort Information Integration over the Web [C], CIDR,2007,108-113.
    35. Nie Z Q, Ma Y, Shi S, et al. Web Object Retrieval [C], WWW,2007.
    36. Nie Z Q, Zhang Y, Wen J, et al. Object-level Ranking: Bringing Order to Web Objects [C], IW3C2,2005.
    37. Zhao H. Semantic Matching across Heterogeneous Data Sources [J], Communications of the ACM,2008,11(50):45-50.
    38. Gao C, Fan W. Improving Data Quality:Consistency and Accuracy [C], VLDB,2007, 315-326.
    39.周宏广,周继承,彭银桥.数据ETL工具通用框架设计[J],计算机应用,2003,23(12)：96-98.
    40. Monge A, Elkan C. An Efficient Domain-independent Algorithm for Detecting Approximately Duplicate Database Records [C], SIGMOD Workshop on KDD,1997, 112-118.
    41. DeWitt J, Naught F, Sckneider A. An Evaluation of Non-join Algorithms [C], VLDB, 1991,443-452.
    42. Hernander M, Stolfo S. The Merge/Purge Proplem for Large Databases [C], SIGMOD, 1995,127-138.
    43. Liang J, Chen L, Sharad M. Efficient Record Linkage in Large Data Sets [C], DASFAA, 2003,137-146.
    44. Chaudhuri S, Ganjam K. Robust and Efficient Fuzzy Match for Online Data Cleaning [C], SIGMOD,2003,39-48.
    45. Low W, Lee M, Wang L. A Knowledge-based Approach for Duplicate Elimination in Data Cleaning [J]. Information Systems,2001,585-606.
    46. Ananthakrishna R, Chaudhuri S, Ganti V. Eliminating Fuzzy Duplicates in Data Warehouses [C], VLDB,2002,586-597.
    47. CoSORT [EB/OL]. http://www.iri.com/external/dbtrends.htm,2007.
    48. Data Stage [EB/OL]. http://www.ardentsoftware.com/datawarehouse/datastage,2007.
    49.王晓宇,周傲英.万维网的链接结构分析及其应用综述[J],软件学报,2003,14(10)：1768-1780.
    50. Hobbes'Internet Timeline v8.2 [EB/OL]. http://www.zakon.org/robert/internet/timeline/, 2006.
    51. Google [EB/OL]. http://www.google.com.
    52. Baidu [EB/OL]. http://www.baidu.com.
    53. Search Engine Watch [EB/OL]. http://www.searchenginewatch.com.
    54. Open Directory Project [EB/OL]. http://www.dmoz.org.
    55. Yahoo [EB/OL]. http://www.yahoo.com.
    56. MSN [EB/OL]. http://www.msn.com.
    57. Dogpile [EB/OL]. http://www.dogpile.com.
    58. Brin S, Page L. The Anatomy of a Large-scale Hypertextual Weg Search Engine [C], WWW,1998,107-117.
    59.李晓明,闫宏飞,王继民.搜索引擎原理技术与系统[M],北京,科学出版社,2004.
    60. Heydon A, Najork M. Mercator:A Scalable, Extensible Web Crawler [C], WWW,1999, 2(4):219-229.
    61. Aggarwal C C, Al-Garawi F, Yu P S. Intelligent Crawling on the World Wide Web with Arbitrary Predicates [C], WWW,2001,96-105.
    62.赫枫龄,左万利.用有向图解决网页爬行中循环链接问题[J],吉林大学学报(理学版),2004,42(3)：402-404.
    63.严威,赵政.开发中文搜索引擎汉语处理的关键技术[J],计算机工程,1999,25(61)：5-6.
    64.李晓明,闫宏飞,王继民.搜索引擎——原理、技术与系统[M],北京,科学出版社,2005.
    65. Baeza-Yates R, Ribeiro-Neto B. Modern Information Retrieval [M], ACM Press,1999.
    66. Kleinberg J. Authoritative Sources in a Hyperlinked Environment [C], ACM-SIAM, 1998,668-677.
    67. Page L, Brin S, Motwani R, et al. The Pagerank Citation Ranking: Bringing Order to the Web [EB/OL], http://www-db.stanford.edu/-backrub/pageranksub.ps,1998.
    68. Chakrabarti S, Berg M, Dom B. Distributed Hypertext Resource Discovery through Examples [C], VLDB,1999,375-386.
    69. Rennie J, Mccallum A. Using Reinforcement Learning to Spider the Web Efficiently [C], ICML,1999.
    70. Mitchecll T M. Machine Learning [M], 北京,机械工业出版社,2006.
    71. Diligenti M, Coetzee F M, Lawrence S, et al. Focused Crawling Using Context Graphs [C], VLDB,2000,527-534.
    72.肖冬梅.垂直搜索引擎研究[J],图书馆学研究,2003,2：87.
    73.陈新颜.垂直搜索引擎辨析[J],现代情报,2004,9：134-135.
    74.刘畅.综合搜索引擎与垂直搜索引擎的比较研究[J],情报科学,2007,1：99-104.
    75.王晓伟.垂直搜索引擎若干关键技术的研究[D],浙江大学,2007.
    76.肖亮.垂直搜索引擎的研究与实现[D],北京交通大学,2007.
    77. Cope J, Craswell N, Hawking D. Automated Discovery of Search Interfaces on the Web [C],ADC,2003,181-189.
    78. Zhang Z, He B, Chang K C. Understanding Web Query Interfaces:Best-effort Parsing with Hidden Syntax [C], SIGMOD,2004,107-118.
    79. He H, Meng W, Yu C, et al. WISE-Integrator: an Automatic Integrator of Web Search Interfaces for E-commerce [C], VLDB,2003,357-368.
    80. Arasu A, Garcia-Molina H. Extracting Structured Data form Web Pages [C], SIGMOD, 2003,337-348.
    81. Wittenburg K, Weitzman L. Visual Grammars and Incremental Parsing for Interface Languages [C], IEEE Symposium on Visual Languages,1990,111-118.
    82. Peng Q, Meng W, He H, et al. WISE-cluster:Clustering E-commerce Search Engines Automatically [C], WIDM,2004,104-111.
    83. He B, Tao T, Chang K C. Clustering Structured Web Sources:a Schema-based, Model-differentiation Approach [C], EDBT,2004,536-546.
    84. Ipeirotis P G, Gravano L, Sahami M, et al. Probe, Count and Classify:Categorizing Hidden Web Databases [C], SIGMOD,2001,67-78.
    85. Meng W, Wang W, Sun H, et al. Concept Hierarchy Based Text Database Categorization [J], Knowl Inf Syst,2002,4(2):132-150.
    86. Leake D B, Scherle R. Towards Context-based Search Engine Selection [C], IUI,2001, 109-112.
    87. Meng W, Yu C T, Liu K. Building Efficient and Effective Metasearch Engines [J], ACM Comput. Surv.,2002,34(1):48-89.
    88. Yu C, Liu K, Meng W, et al. A Methodology to Retrieve Text Documents from Multiple Databases [J], IEEE Trans. Knowl.Data Eng.,2002,14(6):1347-1361.
    89. Yu C T, Philip G, Meng W. Distributed Top-N Query Processing with Possibly Uncooperative Local Systems [C], VLDB,2003,117-128.
    90. Wu W, Yu C T, Doan A, et al. An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web [C], SIGMOD,2004,95-106.
    91. He H, Meng W, Yu C T, et al. Constructing Interface Schemas for Search Interfaces of Web Databases [C], WISE,2005,29-42.
    92. He H, Meng W, Yu C T, et al. Automatic Integration of Web Search Interfaces with WISE-Integrator [C], VLDB,2004,13(3):256-273.
    93. Wu Z, Raghavan V, Du C, et al. SE-LEGO:Creating Metasearch Engines on Demand [C],SIGIR,2003,464.
    94.郭志懋,周傲英.数据质量和数据清洗研究综述[J],软件学报,2002,13(11)：2076-2082.
    95. Hall P, Dowling G. Approximate String Matching [J], ACM Computing Surveys,1980, 12(4):381-402.
    96. Levenshtein V. Binary Codes Capable of Correcting Deletions, Insertions and Reversals [J], Doklady Akademii Nauk SSSR,1965,163(4):845-848.
    97. Skikant R, Agrawal R. Mining Generalized Association Rules [C], VLDB,1995, 407-419.
    98. Liu B, Hsu W, Ma Y. Integrating Classification and Association Rule Mining [C], KDD, 1998,80-86.
    99. Raggett D, Hors A L, Jacobs L. HTML 4.0 Specification [EB/OL]. http://www.w3.org /TR/1998/REC-html40-19980424,1997.
    100. Raggett D, Hors A L, Jacobs L. HTML 4.01 Specification [EB/OL]. http://www.w3.org /TR/HTML401/,1999.
    101. Morrison M. XML揭秘入门应用精通[M],北京,清华大学出版社,2001.
    102. Fallside D C,. Walmsley P. XML Schema Part 0:Primer Second Edition [EB/OL]. http://www.w3.org/TR/xmlschema-O/,2004.
    103. Raggett D. Clean up Your Web pages with HTML Tidy [EB/OL]. http://www.w3.org/ People/Raggett/tidy/,2005.
    104. Raggett D. HTML Tidy Library Project [EB/OL]. http://tidy.sourceforge.net/,2008.
    105. Hegaret P L, Whitmer R. Document Object Model (DOM) [EB/OL]. http://www.w3.org/ DOM/,1998.
    106. Nigam K, McCallum A K, Thrun S. Text Classification from Labeled and Unlabeled Documents Using EM [J], Machine Learning,2000,39(2):103-134.
    107. Lertnattee V, Theeramunkong T. Effect of Term Distributions on Centroid-based Text Categorization [J], Information Sciences,2004,158:89-115.
    108. Song R, Liu H, Wen J. Learning Block Importance Models for Web Pages [C], WWW, 2004,203-211.
    109. Bianchini M, Gori M, Scarselli F. Inside PageRank [J], ACM Transactions on Internet Technology,5(1),2005,92-128.
    110. Parreira J X, Weikum G.JXP:Global Authority Scores in a P2P Network [C], WebDB, 2005,31-36.
    111. Kou Y, Shen D R, Yu G, et al. KDS-CM:A Cache Model Based on Top-K Data Source for Deep Web Query [J], Wuhan University Journal of Natural Sciences,2007,12(5): 830-834.
    112. Kou Y, Shen D R, Yu G, et al. A Top-K-based Cache Model for Deep Web Query [C], INFOSCALE,2007,49-50.
    113. Ullmann J. A Binary N-Gram Technique for Automatic Correction of Substitution, Deletion, Insertion, and Reversal Errors in Words [J], The Computer,1977,20(2): 141-147.
    114. Smith T, Waterman M. Identification of Common Molecular Subsequences [J], Molecular Biology,1981,147:195-197.
    115. Jaro M. Unimatch: a Record Linkage System:User's Manual [C], US Bureau of the Census,1976,414-420.