Deep Web查询接口匹配技术研究

英文题名：Research on Technology of Deep Web Query Interface Matching
作者：曹庆皇
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：复杂匹配 ; Deep ; Web ; 关联挖掘 ; 聚类 ; 语义网 ; 互信息
英文关键词：complex matching ; Deep Web ; association mining ; clustering ; semantic net ; mutual information
学位年度：2009
导师：鞠时光
学科代码：081203
学位授予单位：江苏大学
论文提交日期：2009-11-01
答辩委员会主席：詹永照

摘要

Internet技术的飞速发展便得web数据厍得到了广泛应用,这些数据库隐藏在查询接口之后,用户只能通过本地查询接口提交请求才能获得其中信息。这些信息无法被搜索引擎通过超链接检索到,称为Deep Web信息。由于Deep Web海量的信息,构建一个Deep Web信息集成系统显得尤为重要。在Deep Web信息集成系统中,将Web数据库按领域分类,为每个领域建立一个统一查询接口。通过对统一查询接口提交查询,就可以同时向多个本地查询接口发送请求。将统一查询接口的请求映射到各个本地查询接口,需要解决查询接口匹配问题。
     查询接口匹配是Deep Web信息集成系统的基础。针对现有方法不能有效处理查询接口复杂匹配问题,本文提出一种新的匹配方法,利用正相关关联挖掘发现潜在的成组属性组,并将成组属性作为单个属性,对具有相同语义的属性进行语义聚类,达到匹配目的。最后实现一个面向图书检索领域的Deep Web信息集成系统。主要研究工作包括:
     (1)提出一种利用关联挖掘思想生成成组属性的方法。针对属性相关度计算不精确问题,设计了一种基于互信息的属性相关度度量标准,该标准能够体现成组属性的特点,并能解决属性稀疏性问题和高频率属性问题。另外,为了提高算法效率,提出“属性矩阵”概念,所有的计算都在仅含有0和1的矩阵上进行,复杂的概率计算转为简单的与运算,有效提高效率。
     (2)提出一种采用语义聚类思想生成同义属性的方法。借助语义网计算属性间的语义相似度,同时为了弥补部分属性语义信息不足问题,在计算属性相似度时,加入数据域相似度。通过语义相似度和数据域相似度的加权计算,提高属性相似度计算的精度。
     (3)设计并实现一个面向图书检索领域的Deep Web信息集成系统,并将匹配技术在系统中的应用作了分析。另外所有领域相关的信息都存放在配置文件中,通过改变配置文件能够快速搭建一个面向新领域的信息集成系统。
With the rapid development of Internet technology, web databases have been used widely. These databases are hidden in the local query interfaces. User must use the local query interface to submit request to get information. Deep Web means the information in database which can't be indexed by the Search Engineer. Recently, Deep Web Data Integration System has been paid more and more attention because of its huge capability of information, high data quality and well formatted structure. Deep Web Data Integration System divides the web databases by domain, and establishes a unique query interface for every domain. User can submit request through the unique query interface to send request to every local query interface at the same time. There exists a query interface matching problem while mapping request between the unique query interface and local query interface.
     Query interface matching is prerequisite to data integration. This paper first focuses on technology of query interface matching, and proposed a new matching method which uses association mining mines positively correlated attributes to form potential group attributes, and finds synonym attributes by clustering on the base of existed methods, then implements a Deep Web Data Integration System in the field of book. The main work is summarized as follows:
     (1) Design a new correlation measure based on Mutual Information, and use matrix to implement it. The measure can reflect the character which group attribute often occurs at the same time and appears alone rarely, and solve the problem of sparse and high-frequency attributes. Besides, propose the attribute matrix which only contains 0 and 1 to improve efficiency.
     (2) Add semantic and domain component to computation of attribute similarity. Use semantic net to compute the most precise semantic similarity. Besides, calculate domain similarity to improve the precision of attribute similarity.
     (3) Design and implement a data integration system in the field of book. The principle is that make sure the system has no correlation with domain. Everything about domain is stored in a configure file which can be modified while changing application domain. It helps to establish a new system quickly. A data integration system on Book domain is accomplished at the end of this paper.

引文

[1]Fetterly D,Manasse M,Najork M,et al.A large-scale study of the evolution of web pages [C]//Proceedings of the 12th International World Wide Web Conference,Budapest,2003:669-678
    [2]Chang K.C,He B,Li C,et al.Structured databases on the web:Observations and Implications[J].SIGMOD Record,2004,33(3):61-70
    [3]Deep Web Technology[EB/OL].(2005-10).http://www.deepwebtech.com
    [4]Invisiable.com[EB/OL].(2005-10).http://www.invisiable.com
    [5]M.K.Bergman.The Deep Web:Surfacing Hidden Value[J].The Journal of Electronic Publishing,2001,7(1):8912-8914
    [6]MetaQuerier Research Group[EB/OL].(2006-06).http://metaquerier.cs.uiuc.edu/
    [7]Hasan Davulcu,Juliana Freire,Michael Kifer,et al.A layered architecture for querying dynamic Web content[C]//Proc of International Conference on Management of Data.New York:ACM Press,1999:491-502
    [8]S.Raghavan,H.Garcia-Molina.Crawling the hidden Web[C]//Proceedings of the 27th International Conference on Very Large Data Bases,Roma,Italy,2001:129-138
    [9]QProber Research Group[EB/OL].Accessible at http://qprober.cs.columbia.edu/Oct 2005
    [10]Robert B.Doorenbos,Oren Etzioni,Daniels Weld.A scalable comparison shopping agent for the World-Wide Web[C]//Proc of the First International Conference on Autonomous Agents,Marina del Rey,CA,1997:39-48
    [11]L.Barbosa,J.Freire.Siphoning hidden-web data through keyword-based interfaces[C[//Proc of the Brazilian Symposium on Database,New York:ACM Press,2004:309-321
    [12]Michael K.Bergman.Deep Web White Paper[EB/OL].(2004-10).http://brighplanet.com
    [13]Chris Sherman,Gary Price.The Invisible Web:Uncovering Information Sources Search Engines Can't See[J].Library Trends.2003(2):282-298
    [14]中国互联网络信息中心(CNNIX),第23次中国互联网络发展状况统计报告[R].2009,23-25
    [15]Cope J.,Craswel N.,Hawking D.Automated discovery of search interfaces on the Web [C]//Proceedings of the 14th Australasian Database Conference (ADC 2003), Adelaide,2003:181-189
    [16] 刘伟,孟小峰,孟卫一. Deep Web 数据集成问题研究, WAMDM-TR-2006-3[R]. 北京:中国人民大学,2006

    [17] Arasu A., Garcia-Molina H. Extracting structured data from Web pages [C]//Proceedings of the 22th ACM SIGMOD International Conference on Management of Data, San Diego,2003:337-348
    [18] Crescenzi V., Mecca G., Merialdo P. RoadRunner: towards automatic data extraction from large web sites [C]//Proceedings of the 27th International Conference on Very Large Data Bases, Italy, 2001: 109-118
    [19] Wittenburg K. Weitzman L. Visual Grammars and Incremental Parsing for Interface Languages [C]//Proceedings of the IEEE Symposium on Visual Languages (VL), Skokie,1990:111-118
    [20] He H., Meng W., Yu C. T., et al. WISE-Integrator: an automatic integrator of Web search interfaces for e-commerce [C]//Proceedings of the 29th International Conference on Very Large Data Bases, Berlin, 2003: 357-368
    [21] Peng Q., Meng W., He H., et al. WISE-cluster: clustering e-commerce search engines automatically [C]//Proceedings of the 6th ACM International Workshop on Web Information and Data Management, Washington, 2004: 104-111
    [22] He B., Tao T., Chang K C. Clustering structured Web sources: a schema-based model-differentiation Approach [C]//Proceedings of the 9th International Conference on Extending Database Technology, Heraklion, Crete, 2004: 536-546
    [23] Ipeirotis P. G, Gravano L, Sahami M. Probe. count, and classify: categorizing hidden Web databases [C]//Proceedings of the 19th ACM SIGMOD International Conference on Management of Data, Santa Barbara, 2001:67-78
    [24] Meng W, Wang W, Sun H., et al. Concept hierarchy based text database categorization [C].Knowl. Inf. Syst., 2002, 4(2): 132-150
    [25] Wu W, Yu C. T, Doan A,et al. An interactive clustering-based approach to integrating source query interfaces on the Deep Web [C]//Proceedings of the 23th ACM SIGMOD International Conference on Management of Data, Paris, 2004: 95-106
    [26] He H, Meng W, Yu C. T, et al. Constructing interface schemas for search interfaces of Web databases [C]//Proceedings of the 6th International Conference on Web Information Systems Engineering, New York, 2005: 29-42
    [27] He H., Meng W., Yu C. T., et al. Automatic integration of Web search interfaces with WISE-Integrator[J]. VLDB Journal, 2004, 13(3): 256-273
    [28] Wu Z, Raghavan V, Du C, et al. SE-LEGO: creating metasearch engines on demand [C]//Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, 2003: 464
    [29] Li W, Clifton C. Semantic integration in heterogeneous databases using neural networks [C].//Proceedings of the 20th International Conference on Very Large Data Bases, Santiago,1994: 1-12
    [30] Miller R J., Ioannidis E Y., Raghu R.. Schema equivalence in heterogeneous systems: bridging theory and practice [C]. Inf. Syst., 1994, 19(1): 3-31
    [31] Milo T., Zohar S. Using schema matching to simplify heterogeneous data translation [C]//Proceedings of the 24th International Conference on Very Large Data Bases, New York, 1998: 122-133
    [32] Gio Wiederhold: Meditation to Deal with Heterogeneous Data Sources [C]//Proceedings of the 2th International Conference on Interoperating Geographic Information Systems, Zurich,1999:1-16
    [33] Doan A., Domingos P., Levy A. Y. Learning source description for data integration [C]//Proceedings of the 3th International Workshop on the Web and Databases, Dallas,2000:81-86
    [34] He B., Chang K. C. Statistical schema matching across Web query interfaces [C]//Proceedings of the 22th ACM SIGMOD International Conference on Management of Data, San Diego, 2003: 217-228
    [35] He B., Chang K. C, Han J. Discovering complex matching across web query interfaces: a correlation mining approach [C]//Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, 2004: 148-157
    [36] He B., Chang K. C., Han J. Mining complex matching across Web query interfaces [C]//Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, Paris, 2004: 3-10
    [37] Leake D. B., Scherle R. Towards context-based search engine selection [C]//Proceedings of the 5~(th) International Conference on Intelligent User Interfaces, Santa Fe, 2001: 109-112
    [38] Meng W., Yu C. T., Liu K. Building efficient and effective metasearch engines [C]. ACM,2002, 34(1): 48-89
    [39] Yu C, Liu K., Meng W., et al. A methodology to retrieve text documents from multiple databases [J]. IEEE Trans. Knowl. Data Eng., 2002, 14(6): 1347-1361
    [40] Yu C. T., Philip G., Meng W. Distributed top-N query processing with possibly uncooperative local systems [C]//Proceedings of the 29th International Conference on Very Large Data Bases, Berlin, 2003: 117-128

    [41] Crescenzi V., Mecca G. Grammars have exceptions [J]. Inf. Syst, 1998, 23(8): 539-565
    [42] Arocena G. O., Mendelzon A. O. WebOQL: restructuring documents, databases, and Webs [C]// Proceedings of the 14th International Conference on Data Engineering, Orlando, 1998:24-33
    [43] Liu L., Pu C, Han W. XWRAP: An XML-enabled wrapper construction system for Web information sources [C]//Proceedings of the 16th International Conference on Data Engineering, San Diego, 2000: 611-621
    [44] Liu B., Grossman R. L., Zhai Y. Mining data records in Web pages [C]//Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,Washington, 2003: 601-606
    [45] Zhai Y., Liu B. Web data extraction based on partial tree alignment [C]//Proceedings of the 14th International World Wide Web Conference, Chiba, 2005: 76-85
    [46] Kushmerick N. Wrapper induction: efficiency and expressiveness [J]. Artif. Intell.,2000,118(1-2): 15-68
    [47] Muslea I., Minton S., Knoblock C. A. Hierarchical wrapper induction for semi-structured information sources [J]. Autonomous Agents and Multi-Agent Systems, 2001, 4(1-2):93-114
    [48] Adelberg B., Denny M. Nodose version 2.0 [C]//Proceedings of the 18th ACM SIGMOD International Conference on Management of Data, Philadelphia, 1999: 559-561
    [49] Meng X., Lu H., Wang H., et al. SG-WRAP: a schema-guided wrapper generator [C]//Proceedings of the 18th International Conference on Data Engineering, San Jose, 2002:331-332
    [50] ArlottaL, Crescenzi V, Mecca G et al. Automatic annotation of data extracted from large web sites [C]// Proceedings of the 6-th International Workshop on Web and Databases, San Diego, 2003:7-12
    [51] B.He, K.C.-C. Chang, and J. Han. Automatic complex schema matching across web query interfaces: A correlation mining approach[R]. Technical Report UIUCDCS-R-2003-2388,Dept. of Computer Science, UIUC, 2003
    [52] W.Wu, A. Doan, C. T. Yu. WeblQ: Learning from the Web to Match Deep-Web Query Interfaces[C]// Proceedings of the Int. Conf. on Data Engineering, 2006: 44-54
    [53] J.Wang, J.R.Wen, F.Lochovsky, et al. Instance-based schema matching for web databases by domain-specific query probing[C]//In VLDB Conf, 2004: 408-419
    [54] B.He. A Holistic Paradigm for Large Scale Schema Matching [D]. Doctor dissertation of Philosophy in Computer Science, Graduate College of the University of Illinois at Urbana-Champaign, 2006
    [55] GMiller. WordNet: An On-line Lexical Database [J]. International Journal of Lexicography,1990, 3(4): 235-312
    [56] J.Madhavan, P. A. Bernstein, A. Doan, and A. Y. Halevy. Corpus-based schema matching [C]//Proceedings of the Int. Conf. on Data Engineering, 2005: 57-68