Deep Web数据集成关键问题研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着互联网技术日新月异的发展,Web已经成为一个巨大的信息源,拥有着海量数据。这些数据具有重要的价值,目前许多应用领域,如市场情报分析等迫切需要利用这些数据进行分析挖掘,从中获取有用知识,最大程度的进行辅助决策。但是,Web数据具有大规模、异构性、自治性、分布式等特点,这使得Web数据的分析挖掘变得尤为困难,当务之急是要对其进行集成,为分析挖掘提供高质量数据。根据Web中所蕴含信息的“深度”,可以将Web分为Surface Web和Deep Web。Deep Web数据在数量和质量上远远超过了Surface Web,具有更高的应用价值。因此,如何进行Deep Web数据集成,以便于更有效的分析挖掘,具有重要的现实意义和广阔的应用前景。
     现在对Deep Web的研究主要侧重于面向查询的Deep Web数据集成,这种集成方式获取的数据量有限,适用于用户即时查询需求,但是难以胜任以分析挖掘为目标的应用。本文致力于面向分析的Deep Web数据集成研究,目标在于最大限度地获取Deep Web页面,运用抽取与消重技术得到结构化良好、高质量的数据,为进一步的分析挖掘提供数据支持。面向分析的DeepWeb数据集成存在以下问题有待解决:(1)由于分析挖掘需要大量的数据,而这些数据在Deep Web中来自于领域内多个Web数据库动态产生的Deep Web页面,因此,需要自动地最大限度地获取这些页面;(2)由于分析挖掘需要结构化良好的、语义丰富的数据,而这些数据存在于复杂的、半结构化的DeepWeb页面中,因此,需要从页面中准确地进行结构化数据的抽取,并进行语义理解;(3)由于分析挖掘需要统一的高质量数据,而这些数据重复存在于同一领域多个Web数据库中,因此,需要进行多个Web数据库之间的重复记录检测。
     本文以面向分析的Deep Web数据集成为目标,针对其中存在的关键问题展开研究,主要工作与贡献概括如下:
     1.提出一种基于扩展证据理论的Deep Web查询接口匹配方法,有效解决了同一领域内不同Web数据库爬取时的查询接口语义理解问题。
     同一领域内存在大量的Web数据库,这些Web数据库的查询接口模式之间具有异构性,导致在爬取不同Web数据库时难以通过统一的方式识别出需要投放查询词的接口属性,影响Deep Web页面的获取。针对这一问题,本文提出一种基于扩展证据理论的Deep Web查询接口匹配方法,通过构建待爬取Web数据库查询接口与其对应的领域查询接口之间的匹配关系,理解该查询接口属性的语义信息。该方法充分利用了查询接口的多种特征,构建不同匹配器,通过动态预测每个匹配器的可信度扩展现有的证据理论,进行多个匹配器结果的组合,提高组合的适应能力;通过top-k全局最优策略和树结构启发式规则进行匹配决策,得到最终的匹配关系,利用该匹配关系理解待爬取Web数据库查询接口。实验结果表明,该方法具有较高的匹配准确率,有效克服了现有查询接口匹配方法适应能力差导致匹配准确率较低的不足。
     2.提出一种基于查询词采新率模型的Web数据库爬取方法,有效解决了Deep Web页面的大规模获取问题。
     以分析挖掘为目标的应用需要大量的Deep Web数据,这些数据来自领域内多个Web数据库动态生成的Deep Web页面,但是Web数据库特有的查询接口访问方式,使得传统的搜索引擎爬虫无法爬取其中的内容。针对这一问题,本文提出一种基于查询词采新率模型的Web数据库爬取方法。该方法通过对Web数据库进行采样,利用采样数据,选择多种特征自动构建训练样本,避免样本的手工标注;利用多元线性回归方法,通过训练样本构建查询词采新率模型,借助该模型迭代选择查询词进行查询提交,从而实现对Web数据库的爬取。实验结果表明,利用该方法爬取Web数据库具有较高的覆盖率,有效地克服了现有Web数据库爬取方法采用启发式规则选取查询词的单一化和经验化的不足,学习得到的查询词采新率模型可以有效应用于同一领域其它Web数据库的爬取。
     3.提出一种基于层次聚类的Deep Web数据抽取方法,有效解决了DeepWeb页面中结构化数据的自动抽取问题。
     Deep Web页面以半结构化形式存在,难以对其中的结构化数据进行自动化处理。针对这一问题,本文提出一种基于层次聚类的Deep Web数据抽取方法。该方法通过利用查询结果列表页面的信息来辅助识别Deep Web页面中的内容块,确定数据抽取的区域:通过综合利用多个Deep Web页面的结构和内容特征,对这些页面中同一内容块中的内容结点特征向量进行层次聚类,从而实现Web数据记录的抽取。实验结果表明,该方法具有较高的抽取准确率,有效克服了现有大部分方法仅利用页面自身结构信息导致抽取准确率较低的不足。
     4.提出一种基于约束条件随机场的Deep Web数据语义标注方法,有效解决了Deep Web数据语义缺失以及多个Web站点数据记录之间的模式异构问题。
     对于抽取后的Web数据记录,如果单独依赖Deep Web页面中现有的语义标签进行标注,则无法处理语义标签缺失情况,而且不同站点通常使用不同语义标签,造成不同站点Web数据记录之间模式上的异构。针对以上问题,本文提出一种基于约束条件随机场的Deep Web数据语义标注方法。该方法利用已有的Web数据库信息构建可信约束,利用Web数据记录中数据元素之间的逻辑关系构建逻辑约束,将两类约束引入传统的条件随机场模型,构建约束条件随机场模型,采用整数线性规划推理方法,利用领域Web数据库模式的全局属性标签集为Web数据记录中的每个数据元素赋予对应的语义标签,从而实现对Deep Web数据的语义标注,同时也实现多个Web站点数据记录之间的模式统一。实验结果表明,该方法具有较高的语义标注准确率,有效地克服了传统条件随机场无法综合利用已有的Web数据库信息和Web数据元素之间逻辑关系导致标注准确率较低的不足。
     5.提出一种基于无监督学习的重复记录检测方法,有效解决了Deep Web中大规模重复记录检测的问题。
     同一领域内Web数据库数量多且数据冗余度高,难以为分析挖掘提供高质量数据。针对这一问题,本文提出一种基于无监督学习的重复记录检测方法。该方法通过利用聚类集成方法自动选择初始训练样本,提高训练样本的准确性;通过利用支持向量机迭代分类方法,构建分类模型,提高了模型的分类准确率;通过利用扩展证据理论集成多个分类模型结果,构建领域重复记录检测模型,从而实现同一领域内大量Web数据库之间的重复记录检测。实验结果表明,该方法具有较高的重复记录检测准确率,得到的领域重复记录检测模型在所属领域具有较好的性能,有效克服了传统方法难以进行大规模重复记录检测的不足。
With the rapid development of network technology, Web has become a huge information source with the massive data that have important value. At present, it is urgent in many application domains, such as market intelligence analysis, to analyze and mine these data to get useful knowledge that can be used to aid decision making. However, Web data have such characteristics as heterogeneity, autonomy and distribution, which make the analysis and mining difficult. In order to facilitate analysis and mining, Web data integration has been an urgent problem. According to the depth of data stored in Web, Web can be divided into two parts, Surface Web and Deep Web. The capacity and quality of the data in Deep Web have already far beyond those in Surface Web, so how to integrate Deep Web data to facilitate analysis and mining has good application effect and broad prospects.
     Recently, research efforts have been focused on query-oriented Deep Web data integration, which obtains a limited amount of data and is suitable for user queries on the fly. However, the integration method is not fit for the applications with the goal of analysis and mining. The thesis mainly researches on analysis-oriented Deep Web data integration. The goal of this integration method is to obtain deep web pages as much as possible and use the extraction and deduplication techniques to get structural, high-quality data that are the data basis of analysis and mining. For analysis-oriented Deep Web data integration, there are the following issues which need to be resolved:(1) As analyses require plenty of data which come from Deep Web pages dynamically generated by multiple of Web databases in the same domain, it needs to automatically acquire maximum pages. (2)As analyses require well-formed, semantic-rich data which exist in complex, semi-structured Deep Web pages, it needs to accurately extract the structural data and do the semantic understanding of them. (3)As analyses require consistent, high-quality data which exists in multiple Web databases in the same domain with high repetitive rate, it needs to detect duplicated records among these Web databases.
     This dissertation aims at analysis-oriented Deep Web data integration and places focus on the issues that need to be resolved. The main research works and contributions are as follows.
     1. A query interface matching approach based on extended evidence theory is proposed to effectively solve the problem of semantic understanding of query interfaces in different Web database crawling.
     There are a large number of Web databases in the same domain. The heterogeneities among query interfaces of these Web databases make it very difficult to recognize the interface attributes which are used to submit the query terms in a unified approach. To solve this issue, a query interface matching approach based on extended evidence theory is proposed, which constructs the matches between the query interface of the Web database to be crawled and its domain query interface to understand its semantic information. The approach fully utilizes multiple features of query interfaces and constructs different matchers. Then it extends traditional evidence theory with the credibilities of different matchers which are predicted dynamically to combine the results of multiple matchers. Finally, it performs one-to-one matching decision in terms of top-k global optimal policy and uses some heuristic rules of tree structure to perform one-to-many matching decision. Experimental results show that the proposed approach can improve the matching accuracy and can overcome the limitations of poor adaptabilities of traditional approaches.
     2. A Web database crawling approach based on query harvest rate model is proposed to effectively solve the large-scale acquisition problem of Deep Web pages.
     The analysis and mining applications need a large number of Deep Web data which come from Deep Web pages generated dynamically by multiple Web databases in the same domain. However, due to the special access method of Web database, the information in Deep Web cannot be crawled by traditional search engines crawler. To solve this issue, a Web database crawling approach based on query harvest rate model is proposed. The approach firstly samples the Web database and uses the sample database to select multiple kinds of features to automatically construct training instances, which avoids handful labeling. Then, it learns a query harvest rate model from the training instances. Finally, it uses the model to select the most promising query term to submit the query in every crawling round and crawls the Web database as much as possible. Experimental results show that the proposed approach can achieve high coverage of Web database and can overcome the simple and empirical limitations of traditional heuristic rules. The query harvest rate model can be effectively used to crawl other Web databases in the same domain.
     3. A data extraction approach for Deep Web based on hierarchical cluster is proposed to effectively solve the problem of extracting structural data in massive Deep Web pages.
     The structure of Deep Web page is so complex that the structural data in them are difficult to be automatically processed. To solve this issue, a data extraction approach for Deep Web based on hierarchical cluster is proposed. The approach uses the information of the list page of query result to recognize the content blocks in the Deep Web page, which determines the area of data extraction. Then it combines structural and content features from multiple Deep Web pages, and clusters content feature vectors in corresponding content blocks of these pages to effectively extract Web data records. Experimental results show that the proposed approach can significantly improve the extraction accuracy and can overcome the limitations of traditional approaches which only use the structural information of the page itself.
     4. A semantic annotation approach for Deep Web data based on constrained conditional random fields is proposed to effectively solve the problem of labeling the attributes without semantic labels and schema heterogeneities among data records from multiple Web sites.
     The extracted Web data records needs to be annotated, but only relying on existing semantic labels in Deep Web pages cannot annotate the data elements without labels and different sites often use different semantic labels, resulting in schema heterogeneity between data records from them. To solve this issue, a semantic annotation approach for Deep Web data based on constrained conditional random fields is proposed. The approach incorporates confidence constraints and logical constraints to efficiently utilize existing Web database and logical relationship among Web data elements. Then it incorporates an inference procedure based on integer linear programming and extends traditional conditional random fields to naturally and efficiently support two kinds of constraints. It uses the global attribute labels of the domain Web database schema to annotate every data elements in Web data records. Experimental results show that the proposed approach can significantly improve the accuracy of semantic annotation and overcome the limitations of traditional conditional random fields which cannot simultaneously use existing Web database and logical relationship among Web data elements.
     5. A duplicate record detection approach based on unsupervised learning is proposed to effectively solve the problem of massive duplicate record detection in Deep Web.
     Due to the large scale and high redundancy of the Deep Web, a duplicate record detection approach based on unsupervised learning is proposed. The approach firstly uses cluster ensemble to select initial training instance, which avoid handful labeling. Then it utilizes SVM classification with an iterative approach to construct classification model, which improve the accuracy of the model. Finally, it uses the voting approach to combine the results of multiple classification models to construct the domain-level duplicate record detection model, which effectively solves the problem of massive duplicate record detection. Experimental results show that the proposed approach can achieve high accuracy of duplicate record detection and the domain-level duplicate record detection model can get high performance, which overcome the limitations of traditional approaches which cannot carry out massive duplicate record detection.
引文
[1]M K Bergman.The Deep Web:Surfacing Hidden Value.2001, http://www.press.umich.edu/jep/07-01/bergman.html.
    [2]D Florescu, A Levy,A Mendelzon, Database techniques for the World-Wide Web:a survey, SIGMOD Record,1998,27(3):59-74.
    [3]C Sherman.The Invisible Web.2001, http://www.freepint.com/issues/ 08060.htm.
    [4]K C Chang, B He, C Li, M Patel,Z Zhang, Structured databases on the web: observations and implications, SIGMOD Record,2004,33(3):61-70.
    [5]B He, M Patel, Z Zhang,K C Chang, Accessing the deep web.A Survey, Communications of the ACM,2007,50(5):94-101.
    [6]H He, W Meng, C Yu,Z Wu, Wise-integrator:an automatic integrator of web search interfaces for E-commerce, In:Proceedings of the 29th International Conference on Very large data bases,Berlin, Germany,2003,pp.357-368.
    [7]J D Lafferty, A McCallum,F C Pereira, Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, In: Proceedings of the Eighteenth International Conference on Machine Learning,San Francisco, CA, USA,2001, pp.282-289.
    [8]S Raghavan,H Garcia-Molina, Crawling the Hidden Web, In:Proceedings of the 27th International Conference on Very Large Data Bases,San Francisco, CA, USA,2001, pp.129-138.
    [9]K C Chang, Bin He, Toward Large Scale Integration:Building a MetaQuerier over Databases on the Web, In:Conference on Innovative Data Systems Research,Asilomar, CA, USA,2005, pp.44-55.
    [10]B He, K C Chang,J Han, Discovering complex matchings across web query interfaces:a correlation mining approach, In:Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,New York, NY, USA,2004, pp.148-157.
    [11]Z Zhang, B He,K C Chang, Light-weight domain-based form assistant: querying web databases on the fly, In:Proceedings of the 31st International Conference on Very large data bases,Trondheim, Norway,2005, pp.97-108.
    [12]B He,K C Chang, Making holistic schema matching robust:an ensemble approach, In:Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining,New York, NY, USA,2005, pp.429-438.
    [13]. K C Chang, B He,Z Zhang, Mining semantics for large scale integration on the web:evidences, insights, and challenges, SIGKDD Explorations,2004,6 (2):67-76.
    [14]B He,K C Chang, Statistical schema matching across web query interfaces, In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data,New York, NY, USA,2003, pp.217-228.
    [15]K C Chang,H Garcia-Molina, Approximate Query Translation Across Heterogeneous Information Sources, In:Proceedings of the 26th International Conference on Very Large Data Bases,San Francisco, CA, USA,2000, pp. 566-577.
    [16]T Cheng,K C Chang, Entity Search Engine:Towards Agile Best-Effort Information Integration over the Web, In:Proceedings of the Third Conference on Innovative Data Systems Research,Asilomar, CA, USA,2007, pp.108-113.
    [17]T Cheng, X Yan,K C Chang, EntityRank:searching entities directly and holistically, In:Proceedings of the 33rd International Conference on Very Large Data Bases,Vienna, Austria,2007, pp.387-398.
    [18]Y Lu, H He, Q Peng, W Meng,C Yu, Clustering e-commerce search engines based on their search interface pages using WISE-cluster, Data&Knowledge Engineering,2006,59(2):231-246.
    [19]E C Dragut, T Kabisch, C Yu,U Leser, A hierarchical approach to model web query interfaces for web source integration, Proceedings of the VLDB Endowment,2009,2(1):325-336.
    [20]E Dragut, W Wu, P Sistla, C Yu,W Meng, Merging Source Query Interfaces on Web Databases, In:Proceedings of the 22nd International Conference on Data Engineering,Washington, DC, USA,2006, pp.46.
    [21]H He, W Meng, Y Lu, C Yu,Z Wu, Towards Deeper Understanding of the Search Interfaces of the Deep Web, World Wide Web,2007,10(2):133-155.
    [22]W Wu, C Yu, A H Doan,W Meng, An interactive clustering-based approach to integrating source query interfaces on the deep Web, In:Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data,New York, NY, USA,2004, pp.95-106.
    [23]R B Doorenbos, O Etzioni,D S Weld, A scalable comparison-shopping agent for the World-Wide Web, In:Proceedings of the First International Conference on Autonomous Agents,New York, NY, USA,1997, pp.39-48.
    [24]P G Ipeirotis, L Gravano,M Sahami, Automatic Classification of Text Databases Through Query Probing, In:Proceedings of the ACM SIGMOD Workshop on the Web and Databases,2000, pp.117-122.
    [25]P G Ipeirotis, A Ntoulas, J Cho,L Gravano, Modeling and Managing Content Changes in Text Databases, In:Proceedings of the 21st International Conference on Data Engineering,Washington, DC, USA,2005, pp.606-617.
    [26]P G Ipeirotis,L Gravano, When one sample is not enough:improving text database selection using shrinkage, In:Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data,New York, NY, USA,2004, pp.767-778.
    [27]L Gravano, P G Ipeirotis,M Sahami, QProber:A system for automatic classification of hidden-Web databases, ACM Transactions on Information Systems,2003,21(1):1-41.
    [28]J Wang, J R Wen, F Lochovsky,W Y Ma, Instance-based schema matching for web databases by domain-specific query probing, In:Proceedings of the Thirtieth International Conference on Very Large Data Bases,Toronto, Canada,2004, pp.408-419.
    [29]刘伟,孟小峰,Deep Web数据集成研究综述,计算机学报,2007,14(1): 28-35.
    [30]X Meng,W Liu, Vision-based Web Data Records Extraction, In:Proceedings of the 9th SIGMOD International Workshop on the Web and Databases,Chicago, Illinois, USA,2006, pp.20-25.
    [31]凌妍妍,刘伟,王仲远,艾静,孟小峰,Deep Web数据集成中的实体识别方法,计算机研究与发展,2006,43(增刊):46-53.
    [32]W Liu, X Li, Y Ling, X Zhang,X Meng, A Deep Web Data Integration System for Job Search, Wuhan University Journal of Natural Sciences,2006,11(5): 1197-1201.
    [33]寇月,申德荣,李冬,聂铁铮,一种基于语义及统计分析的Deep Web实体识别机制,2008,19(2):194-208.
    [34]Y Kou, D Shen, G Yu,T Nie, Combining Local Scoring and Global Aggregation to Rank Entities for Deep Web Queries, Journal of Computer Science and Technology,2009,24(4):626-637.
    [35]王辉,刘艳威,左万利,使用分类器自动发现特定领域的深度网入口,软件学报,2008,19(2):246-256.
    [36]徐和祥,王鑫印,王述云,胡运发,基于知识的Deep Web集成环境变化处理的研究,软件学报,2008,19(2):257-266.
    [37]袁柳,李战怀,陈世亮,基于本体的Deep Web数据标注,软件学报,2008,19(2):237-245.
    [38]P Zhao, Z Cui, L Gao,H Zhong, Vision-based Deep Web Query Interfaces Automatic Extraction, Journal of Computional Information System,2007, 3(4):1441-1448.
    [39]W Wu, A H Doan,C Yu, WebIQ:Learning from the Web to Match Deep-Web Query Interfaces, In:Proceedings of the 22nd International Conference on Data Engineering,Washington, DC, USA,2006, pp.44.
    [40]W S Li, C Clifton,S Y Liu, Database Integration Using Neural Networks: Implementation and Experiences, Knowledge and Information Systems,2000, 2(1):73-96.
    [41]J Berlin,A Motro, Database Schema Matching Using Machine Learning with Feature Selection, In:Proceedings of the 14th International Conference on Advanced Information Systems Engineering,London, UK,2002, pp.452-466.
    [42]J Madhavan, P A Bernstein,E Rahm, Generic Schema Matching with Cupid, In: Proceedings of the 27th International Conference on Very Large Data Bases,San Francisco, CA,USA,2001, pp.49-58.
    [43]A Doan, P Domingos,A Halvey, Reconciling schemas of disparate data sources:A machine-learning apporach, In:Proceedings of the 2001 SIGMOD International Conference on Management of Data,New York,2001, pp. 509-520.
    [44]H H Do,E Rahm, COMA:a system for flexible combination of schema matching approaches, In:Proceedings of the 28th International Conference on Very Large Data Bases,San Francisco, CA, USA,2002, pp.610-621.
    [45]S Melnik, H Garcia-Molina,E Rahm, Similarity Flooding:A Versatile Graph Matching Algorithm and Its Application to Schema Matching, In:Proceedings of the 18th International Conference on Data Engineering,Washington, DC, USA,2002, pp.117.
    [46]R Dhamankar, Y Lee, A H Doan, A Halevy,P Domingos, iMAP:discovering complex semantic matches between database schemas, In:Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data,New York, NY, USA,2004, pp.383-394.
    [47]L Xu,D W Embley, A composite approach to automating direct and indirect schema mappings, Information Systems,2006,31(8):697-732.
    [48]Y Lee, M Sayyadian, A H Doan,A S Rosenthal, eTuner:tuning schema matching software using synthetic scenarios, The VLDB Journal,2007,16 (1): 97-122.
    [49]S Berkovsky, Y Eytani,A Gal, Measuring the Relative Performance of Schema Matchers, In:Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence,Washington, DC, USA,2005, pp.366-371.
    [50]J Madhavan, P A Bernstein, A H Doan,A Halevy, Corpus-Based Schema Matching,In:Proceedings of the 21st International Conference on Data Engineering,Washington, DC, USA,2005, pp.57-68.
    [51]A Gal, Managing Uncertainty in Schema Matching with Top-K Schema Mappings,2006,6:90-114.
    [52]P Wu, J R Wen, H Liu,W Y Ma, Query Selection Techniques for Efficient Crawling of Structured Web Sources, In:Proceedings of the 22nd International Conference on Data Engineering,Washington, DC, USA,2006, pp.47-56.
    [53]J Madhavan, D Ko, L Kot, V Ganapathy, A Rasmussen,A Halevy, et al., Google's Deep Web crawl, Proceedings of the VLDB Endowment,2008,1(2): 1241-1252.
    [54]L Barbosa,J Freire, Siphoning Hidden-Web Data through Keyword-Based Interfaces, In:XIX Simposio Brasileiro de Bancos de Dados,Distrito Federal, Brasil, Anais,2004, pp.309-321.
    [55]A Ntoulas, P Zerfos,J Cho, Downloading textual hidden web content through keyword queries, In:Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital libraries,New York, NY, USA,2005, pp.100-109.
    [56]J Lu, Y Wang, J Liang, J Chen,J Liu, An Approach to Deep Web Crawling by Sampling, In:Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology,Washington, DC, USA,2008, pp.718-724.
    [57]C H Chang, M Kayed, M R Girgis,K F Shaalan, A Survey of Web Information Extraction Systems, IEEE Transaction on Knowledge and Data Engineering, 2006,18(10):1411-1428.
    [58]A H Laender, B A Ribeiro-Neto, S A da,J S Teixeira, A brief survey of web data extraction tools, SIGMOD Record,2002,31(2):84-93.
    [59]J Hammer, J McHugh,H Garcia-Molina, Semistructured Data:The TSIMMIS Experience, In:Proceedings of the First East-European Workshop on Advances in Databases and Information Systems,Petersburg,Russia,1997, pp. 1-8.
    [60]V Crescenzi,G Mecca, Grammars have exceptions, Information System,1998, 23(9):539-565.
    [61]G O Arocena,A O Mendelzon, WebOQL:restructuring documents, databases, and webs, Theory and Practice of Object Systems,1999,5(3):127-141.
    [62]G Mecca, P Atzeni, A Masci, G Sindoni,P Merialdo, The Araneus Web-based management system, SIGMOD Record,1998,27(2):544-546.
    [63]N Kushmerick, Wrapper induction:efficiency and expressiveness, Artificial Intelligence,2000,118(1):15-68.
    [64]C N Hsu,M T Dung, Generating finite-state transducers for semi-structured data extraction from the Web, Information System,1998,23(9):521-538.
    [65]C Hsu, Initial results on wrapping semi-structured Web pages with finite-state transducers and contextual rules, In:Workshop on AI and Information Integration, in conjuction with the 15th National Conference on Artificial Intelligence,1998, pp.436-441.
    [66]I Muslea, S Minton,C A Knoblock, Hierarchical Wrapper Induction for Semistructured Information Sources, Autonomous Agents and Multi-Agent Systems,2001,4(1):93-114.
    [67]A Sahuguet,F Azavant, Building intelligent web applications using lightweight wrappers, Data Knowledge Engnieering,2001,36(3):283-316.
    [68]L Liu, C Pu,W Han, XWRAP:An XML-Enabled Wrapper Construction System for Web Information Sources, In:Proceedings of the 16th International Conference on Data Engineering,Washington, DC, USA,2000, pp.611.
    [69]D J Buttler, L Liu,C Pu, A Fully Automated Object Extraction System for the World Wide Web, In:Proceedings of the The 21st International Conference on Distributed Computing Systems,Washington, DC, USA,2001, pp.361.
    [70]C H Chang, C N Hsu,S C Lui, Automatic information extraction from semi-structured Web pages by pattern discovery, Decision Support Systems, 2003,35(1):129-147.
    [71]V Crescenzi, G Mecca,P Merialdo, RoadRunner:automatic data extraction from data-intensive web sites,In:Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data,New York, NY, USA,2002, pp.624-624.
    [72]B Liu, R Grossman,Y Zhai, Mining data records in Web pages, In: Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,New York, NY, USA,2003, pp. 601-606.
    [73]Y Zhai,B Liu, Web data extraction based on partial tree alignment, In: Proceedings of the 14th International Conference on World Wide Web,New York, NY, USA,2005, pp.76-85.
    [74]A Arasu,H Garcia-Molina, Extracting structured data from Web pages, In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data,New York, NY, USA,2003, pp.337-348.
    [75]H Zhao, W Meng, Z Wu, V Raghavan,C Yu, Fully automatic wrapper generation for search engines, In:Proceedings of the 14th International Conference on World Wide Web,New York, NY, USA,2005, pp.66-75.
    [76]K Simon,G Lausen, ViPER:augmenting automatic information extraction with visual perceptions, In:Proceedings of the 14th ACM International Conference on Information and Knowledge Management,New York, NY, USA,2005, pp.381-388.
    [77]W Liu, X Meng,W Meng, ViDE:A Vision-based Approach for Deep Web Data Extraction, IEEE Transactions on Knowledge and Data Engineering,2010, 22(3):447-460.
    [78]杨少华,林海略,韩燕波,针对模板生成网页的一种数据自动抽取方法,软件学报,2008,19(2):209-223.
    [79]胡东东,孟小峰,一种基于树结构的Web数据自动抽取方法,计算机研究与发展,2004,41(10):1607-1613.
    [80]L Arlotta, V Crescenzi, G Mecca, P Merialdo,U R Tre, Automatic Annotation of Data Extracted from Large Web Sites, In:Proceedings of the 6th International Workshop on the Web and Databases,San Diego,2003, pp.7-12.
    [81]J Wang,F H Lochovsky, Data extraction and label assignment for web databases, In:Proceedings of the 12th International Conference on World Wide Web,New York, NY, USA,2003, pp.187-196.
    [82]H He, W Meng, H Zhao,C Yu, Annotating structured data of the deep Web, In: Proceedings of the 23th International Conference on Data Engineering,2007, pp.376-385.
    [83]Z Nie, F Wu, J R Wen,W Y Ma, Extracting Objects from the Web, In: Proceedings of the 22nd International Conference on Data Engineering,2006, pp.123.
    [84]T Hastie, R Tibshirani,J Friedman. The elements of statistical learning New York,Berlin,Heidelberg:, Springer,2001.
    [85]A P Dempster, N M Laird,D B Rubin, Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society,1977, B(39):1-38.
    [86]Improved decision rules in the Felligi-Sunter model of record linkage. Technical Report Statistical Research Report Series RR93/12.
    [87]D Cohn, L Atlas,R Ladner, Improving Generalization with Active Learning, Machine Learning,1994,15(2):201-221.
    [88]BL,FJH,ORA. Classification and Regression Trees, CRC Press,1984.
    [89]M Bilenko, R Mooney, W Cohen, P Ravikumar,S Fienberg, Adaptive Name Matching in Information Integration, IEEE Intelligent Systems,2003,18(5): 16-23.
    [90]T Joachims. Making large-scale support vector machine learning practical, MIT-Press,1999, p.169-184.
    [91]W W Cohen,J Richman, Learning to match and cluster large high-dimensional data sets for data integration, In:Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,New York, NY, USA,2002, pp.475-480.
    [92]A Mccallum,B Wellner, Conditional models of identity uncertainty with application to noun coreference, In:Advances in Neural Information Processing System,Vancouver,2004.
    [93]S Sarawagi,A Bhamidipaty, Interactive deduplication using active learning, In: Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,New York, NY, USA,2002, pp. 269-278.
    [94]S Tejada, C A Knoblock,S Minton, Learning domain-independent string transformation weights for high accuracy object identification, In: Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,New York, NY, USA,2002, pp. 350-359.
    [95]R Ananthakrishna, S Chaudhuri,V Ganti, Eliminating fuzzy duplicates in data warehouses, In:Proceedings of the 28th International Conference on Very Large Data Bases,Hong Kong, China,2002, pp.586-597.
    [96]S Guha, N Koudas, A Marathe.D Srivastava, Merging the results of approximate match operations, In:Proceedings of the Thirtieth International Conference on Very Large Data Bases,Toronto, Canada,2004, pp.636-647.
    [97]S Chaudhuri, V Ganti,R Motwani, Robust Identification of Fuzzy Duplicates, In:Proceedings of the 21st International Conference on Data Engineering,Washington, DC, USA,2005, pp.865-876.
    [98]Y R Wang,S E Madnick, The Inter-Database Instance Identification Problem in Integrating Autonomous Systems, In:Proceedings of the 5th International Conference on Data Engineering,Washington, DC, USA,1989, pp.46-55.
    [99]M A Hem,S J Stolfo, Real-world Data is Dirty:Data Cleansing and The Merge/Purge Problem, Data Mining and Knowledge Discovery,1998,2(1): 9-37.
    [100]H Galhardas, D Florescu, D Shasha, E Simon,C A Saita, Declarative Data Cleaning:Language, Model, and Algorithms, In:Proceedings of the 27th International Conference on Very Large Data Bases,San Francisco, CA, USA,2001,pp.371-380.
    [101]刘伟Deep Web数据集成中的关键技术研究.博士学位论文,中国人民大学,2008.
    [102]W Su, J Wang,F Lochovsky, Record Matching over Query Results from Multiple Web Databases, IEEE Transtraction on Knowledge and Data Engineering,2010,22 (4):578-589.
    [103]黄健斌,姬红兵,近似重复记录的自适应距离度量检测,西安电子科技大学学报,2007,34(2):126-130.
    [104]K T Yong, CMC:Combining Multiple Schema-Matching Strategies Based on Credibility Prediction, In:Proceedings of 10th International Database Systems for Advanced Applications,Beijing,China,April 17-20,2005, pp. 888-893.
    [105]H Zhongtian, Hong, Jun,D Bell, Schema Matching across Query Interfaces on the Deep Web, In:Proc of the 25th British National Conference on Databases (BNCOD 2008),Cardiff, UK,2008, pp.51-62.
    [106]J Hong, He, Zhongtian, Bell,David, An Evidential Approach to Query Interface Matching on the Deep Web, In:Proceedings of the International Workshop on New Trends in Information Integration,Auckland, New Zealand,2008, pp.20-23.
    [107]A P Dempster, Upper and lower probabilities induced by multivalued mapping, The Annals of Mathematical Statistics,1967,38(2):325-339.
    [108]G Shafer. A Mathematical Theory of Evidence, Princeton University Press, 1976.
    [109]A P Dempster, Upper and lower probabilities induced by multivalued mapping, 1967,38(2):325-339.
    [110]G Shafer. A Mathematical Theory of Evidence, Princeton University Press, 1976.
    [111]WordNet.http://wordnet.princeton.edu/.
    [112]P Hall,G Dowling, Approximate string matching, Computing Surveys,1980: 381-402.
    [113]W Cohen, P Ravikumar,S Fienberg, A comparison of string distance metrics for name-matching tasks, In:Proceedings of the 2th Internaltional Workshop on Information Integration on the Web,2003, pp.73-78.
    [114]R C van, Information Retrieval, Butterworths,1979.
    [115]Y Zhai,B Liu, Web data extraction based on partial tree alignment, In: Proceedings of the 14th international conference on World Wide Web,New York, NY, USA,2005, pp.76-85.
    [116]凌妍妍,孟小峰,刘伟,基于属性相关度的Web数据库大小估算方法,软件学报,2008,19(2):224-236.
    [117]刘伟,孟小峰,凌妍妍,一种基于图模型的Web数据库采样方法,软件学报,2008,19(2):179-193.
    [118]Y Cao, J Xu, T Y Liu, H Li, Y Huang,H W Hon, et al., Adapting ranking SVM to document retrieval, In:Proceedings of the 29th Annual International ACM SIGIR Conference on Research and development in Information Retrieval,New York, NY, USA,2006, pp.186-193.
    [119]B Y Ricardo,R N Berthier. Modern Information Retrieval, ACM Press,1999.
    [120]S E Robertson, Overview of the Okapi Projects, Journal of Documentation, 1997,53(1):3-7.
    [121]O Bretscher. Linear Algebra with Applications, Prentice Hall,1995.
    [122]http://www.w3.org/DOM/.
    [123]L Yi, B Liu,X Li, Eliminating noisy information in Web pages for data mining, In:Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,New York, NY, USA,2003, pp. 296-305.
    [124]R Song, H Liu, J R Wen,W Y Ma, Learning block importance models for web pages, In:Proceedings of the 13th International Conference on World Wide Web,New York, NY, USA,2004, pp.203-211.
    [125]G Salton, A Wong,C S Yang, A vector space model for automatic indexing, Communications of the ACM,1975,18(11):613-620.
    [126]G Salton,M J McGill. Introduction to Modern Information Retrieval New York, NY, USA:, McGraw-Hill, Inc.,1986.
    [127]于琨.互联网半结构化信息抽取研究.博士学位论文,中国科学技术大学,2005.
    [128]J Zhu, Z Nie, J R Wen, B Zhang,W Y Ma,2D Conditional Random Fields for Web information extraction, In:Proceedings of the 22nd International Conference on Machine Learning,New York, NY, USA,2005, pp.1044-1051.
    [129]J Zhu, Z Nie, J R Wen, B Zhang,W Y Ma, Simultaneous record detection and attribute labeling in web data extraction, In:Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,New York, NY, USA,2006, pp.494-503.
    [130]F Sha,F Pereira, Shallow parsing with conditional random fields,Morristown, NJ, USA,2003, pp.134-141.
    [131]D C Liu,J Nocedal, On the limited memory BFGS method for large scale optimization, Math. Program.,1989,45(3):503-528.
    [132]T Kristjansson, A Culotta, P Viola,A McCallum, Interactive information extraction with constrained conditional random fields, In:Proceedings of the 19th National Conference on Artifical Intelligence,2004, pp.412-418.
    [133]L Wolsey. Integer Programming, John Wiley&Sons,Inc,1998.
    [134]Multi-Class Support Vector Machine,http://svmlight.joachims.org/ svm_multiclass. html,
    [135]ChristelleGueret, C Prins, M Sevaux,S Heipcke. Applications of Optimization with XpressMP, Dash Optimization Ltd.,2002.
    [136]M G Elfeky, A K Elmagarmid,V S Verykios, TAILOR:A Record Linkage Tool Box, In:Proceedings of the 18th International Conference on Data Engineering,Washington, DC, USA,2002, pp.17.
    [137]L Gu,B Rohan, Decision Models for Record Linkage, Lecture Notes in Computer Science,2006,3755:146-160.
    [138]V T de, H Ke, S Chawla,P Christen, Robust record linkage blocking using suffix arrays, In:Proceedings of the 18th ACM Conference on Information and Knowledge Management,New York, NY, USA,2009, pp.305-314.
    [139]S Chaudhuri, K Ganjam, V Ganti,R Motwani, Robust and efficient fuzzy match for online data cleaning, In:Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data,New York, NY, USA,2003, pp.313-324.
    [140]W W Cohen, P D Ravikumar,S E Fienberg, A Comparison of String Distance Metrics for Name-Matching Tasks, In:Proceedings of IJCAI-03 Workshop on Information Integration on the Web,Acapulco, Mexico,2003, pp.73-78.
    [141]唐伟,周志华,基于Bagging的选择性聚类集成,软件学报,2005,16(4):496-502.
    [142]A Strehl,J Ghosh, Cluster ensembles-a knowledge reuse framework for combining multiple partitions, Journal of Machine Learning Research,2003,3: 583-617.
    [1]Dragut E C, Yu C,Meng W. Meaningful labeling of integrated query interfaces. In Proc. of the 32nd international conference on Very large data bases,Seoul,Korea,September 12-15,2006, pp.679-690.
    [2]He B,Chang K C. Statistical schema matching across web query interfaces. In Proc. of the 2003 ACM SIGMOD international conference on Management of data, San Diego,California,USA,June 9-12,2003, pp.217-228.
    [3]Wu W, Yu C, Doan A H,Meng W. An interactive clustering-based approach to integrating source query interfaces on the deep Web. In Proc. of the 2004 ACM SIGMOD international conference on Management of data,Paris,France,June 13-18, 2004, pp.95-106.
    [4]Wu W, Doan A H,Yu C. Merging Interface Schemas on the Deep Web via Clustering Aggregation. In Proc. of the Fifth IEEE International Conference on Data Mining,Houston,Texas,USA,November 27-30,2005, pp.801-804.
    [5]Hong J, He, Zhongtian, Bell,David. An Evidential Approach to Query Interface Matching on the Deep Web. In Proc. of the International Workshop on New Trends in Information Integration,Auckland, New Zealand,August 23,2008, pp.20-23.
    [6]Zhongtian H, Hong, Jun,Bell D. Schema Matching across Query Interfaces on the Deep Web. In Proc. of the 25th British National Conference on Databases (BNCOD 2008),Cardiff,UK,July 7-10,2008, pp.51-62.
    [7]He H, Meng W, Yu C T,Wu Z. Wise-integrator:An automatic integrator of web search interfaces for e-commerce. In Proc of the 29th international conference on Very Large Data Bases,Berlin,Germany,September 9-12,2003, pp.357-368.
    [8]Dempster A P. Upper and lower probabilities induced by multivalued mapping. The Annals of Mathematical Statistics,1967,38(2):325-339.
    [9]Rahm E,Bernstein P A. A survey of approaches to automatic schema matching. The VLDB Journal,2001,10 (4):334-350.
    [10]He B, Chang K C,Han J. Discovering complex matchings across web query interfaces:a correlation mining approach. In Proc. of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining,Seattle, WA, USA,August 22-25,2004, pp.148-157.
    [11]Do H H,Rahm E. COMA:a system for flexible combination of schema matching approaches. In Proc. of the 28th international conference on Very Large Data Bases,Hong Kong, China,August 20-23,2002, pp.610-621.
    [12]Madhavan J, Bernstein P A,Rahm E. Generic Schema Matching with Cupid. In Proc. of the 27th International Conference on Very Large Data Bases,Roma, Italy,September 11-14,2001, pp.49-58.
    [13]Yong K T. CMC:Combining Multiple Schema-Matching Strategies Based on Credibility Prediction. In Proc. of 10th International Database Systems for Advanced Applications,Beijing, China,April 17-20,2005, pp.888-893.
    [14]Doan A, Domingos P,Halvey A. Reconciling schemas of disparate data sources:A machine-learning apporach. In Proc. of the 2001 SIGMOD International Conference on Management of Data,Santa Barbara, CA, USA,2001, pp.509-520.
    [15]Shafer G A Mathematical Theory of Evidence, Princeton University Press,1976.
    [16]Hall P A,Dowling G R. Approximate String Matching. ACM Computing Surveys, 1980,12 (4):381-402.
    [17]Cohen W, Ravikumar P,Fienberg S. A comparison of string distance metrics for name-matching tasks. In Proc. of 2th internaltional workshop on Information Integration on the Web,Acapulco, Mexico,August 9-10,2003, pp.73-78.
    [18]ICQ Query Interfaces dataset. http://metaquerier.cs.uiuc.edu/repository/datasets /icq/index.html.
    [19]van Rijsbergen C J. Information Retrieval, Butterworths,1979.
    [20]Wu W, Doan A H,Yu C. WebIQ:Learning from the Web to Match Deep-Web Query Interfaces. In Proc. of the 22nd International Conference on Data Engineering,Atlanta, GA, USA,April 3-8,2006, pp.44-53.
    [1]N. Zaiqing, W. Fei, W. Ji-rong and M. Wei-ying, Extracting Objects from the Web, in Proceedings of the 22nd International Conference on Data Engineering,2006, pp. 123-123.
    [2]L. M. Haas, Beauty and the Beast:The Theory and Practice of Information Integration, in Proceedings of the 12th International Conference on Database Theory, 2007, pp.28-43.
    [3]J. D. Lafferty, A. McCallum and F. C. Pereira, Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, in Proceedings of the Eighteenth International Conference on Machine Learning,2001, pp.282-289.
    [4]D. W. Embley, D. M. Campbell, Y. S.Jiang, S. W. Liddle, D. W. Lonsdale and Y. Ng, et al., Conceptual-model-based data extraction from multiple-record Web pages, Data Knowl. Eng.,1999,31(3):227-251.
    [5]S.Mukherjee, I. V Ramakrishnan and A. Singh, Bootstrapping Semantic Annotation for Content-Rich HTML Documents, in Proceedings of the 21st International Conference on Data Engineering,2005, pp.583-593.
    [6]L. Arlotta, V Crescenzi, G Mecca, P. Merialdo and U. R. Tre, Automatic Annotation of Data Extracted from Large Web Sites, in Proceedings of Sixth International Workshop on the Web and Databases,2003, pp.7-12.
    [7]T. Kristjansson, A. Culotta, P. Viola and A. McCallum, Interactive information extraction with constrained conditional random fields, in Proceedings of the 19th national conference on Artifical intelligence,2004, pp.412-418.
    [8]D. Roth and W. Yih, Integer linear programming inference for conditional random fields, in Proceedings of the 22nd international conference on Machine leaming,2005, pp.736-743.
    [9]V. Punyakanok, D. Roth, W Yih and D. Zimak, Semantic role labeling via integer linear programming inference, in Proceedings of the 20th international conference on Computational Linguistics,2004, pp.1346-1352.
    [10]W. W Cohen and S. Sarawagi, Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods, in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining,2004, pp.89-98.
    [11]J. M. Hammersley and P. Clifford, Markov field on finite graphs and lattices.Unpublished manuscript.,1971.
    [12]D. C. Liu and J. Nocedal, On the limited memory BFGS method for large scale optimization, Math. Program.,1989,45 (3):503-528.
    [13]L. Wolsey. Integer Programming, John Wiley&Sons,Inc,1998.
    [14]ChristelleGueret, C. Prins, M. Sevaux and S. Heipcke. Applications of Optimization with XpressMP, Dash Optimization Ltd.,2002.
    [15]Multi-Class Support Vector Machine. http://svmlight.joachims.org/svm_multiclass.html.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700