Deep Web集成中若干技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
Web信息,根据蕴涵信息的“深度”可以划分为Surface Web和Deep Web两大类。其中:Surface Web是指通过超链接可以被传统搜索引擎索引到的页面的集合。Deep Web是指通过填写表单(Form)形成对网站后台数据库的查询而得到的动态页面。如何有效地组、管理Deep Web信息,快速、准确地访问用户所需要的信息是当前信息科学和技术领域面临的一大挑战。随着动态网页技术的发展和日益成熟,Deep Web所蕴含信息量的快速增长,通过对Web数据库的访问逐渐成为获取信息的主要手段,而对Deep Web的研究也越来越受到人们的关注。作为组织和处理大规模Deep Web信息的关键技术,Deep Web数据集成可一定程度上解决用户访问互联网中这些“深度”数据库的需求;同时,Deep Web数据集成的相关技术在信息检索、数据挖掘、数据抽取、个性化服务、数字图书馆等领域有广阔的应用前景。
     (1)Deep Web集成模式的研究
     现实中Deep Web的类型多种多样,用户需求也各不相同,需要考虑不同情况的Deep Web数据集成。论文研究了Deep Web间的关系,以及这些关系对Deep Web数据集成系统查询处理的约束,并在此基础提出Deep Web数据集成的集成模式,以及不同集成模式下查询处理的过程。为不同类型Deep Web数据集成的进一步研究和应用提供参考。
     大量的deep Web源的存在,对他们的分类是通向deep Web分类集成和查询的关键步骤。论文提出了一种Deep Web表示模型和基于机器学习的分类模型,并在此基础上提出一种新的权重计算方法。实验结果表明:这种分类方法经过少量样本训练后,就能达到很好的分类效果;且随着训练的样本的增加,该分类器的性能保持稳定。
     (3)基于本体的Deep Web查询接口分类
     本体是一种知识表示模型,用来在某个特定领域中定义基本术语、关系和一些规则,并将之表示成机器可读的形式。针对deep Web查询接口,论文提出一种分类本体模型和建立本体的推理规则,并提出了deep Web空间向量模型(VSM)。试验表明,这种分类方法具有良好的分类效果。
     (4)基于知识的deep Web集成环境变化处理的研究
     研究了Deep Web集成环境中构件的依赖关系,在此基础上,论文提出了一种基于知识的环境变化的处理方法,包括Deep Web集成环境变化处理模型以及适应Deep Web环境变化的动态体系结构和处理算法,可以对大规模Deep Web集成的进一步探索和走向应用提供参考。实验结果表明,该方法不仅可以处理Deep Web集成环境的变化,还可以大幅度提高集成系统的性能。
     (5)基于Deep Web的个性化服务的研究
     个性化推荐可以实现“信息找人”,可一定程度上解决由于海量信息而导致的“信息过载”和“信息迷向”问题。论文提出了一种基于Deep Web的个性化服务的框架,包括:基于资源元数据描述为语义基础的用户兴趣模型、Deep Web爬虫和个性化推荐,并在个性化推荐的算法中提出了一种新的基于语义的相似度度量方法。最后,基于上述思想的基础上,开发了一个科技文献推荐系统,使用户在尽可能少的参与下,就完成科技文献的个性化服务。
Web's information can be classified into Surface Web and Deep Web according to the depth of the information. Surface Web means that the Web pages can be indexed by the traditional search engine for their hyperlinks in the Internet. While the Deep Web is defined as the content that can not be seen by the traditional search engine, those pages do not exist until they are created dynamically as the result of a specific query, Because traditional search engine crawlers can not probe beneath the surface, the deep Web has heretofore been hidden.
     It is a great challenge for information science and technology that how to organize and process large amount of Deep Web information. As the key technology in organizing and processing large mount of Deep Web, Deep Web integration can solve the problem of information disorder to a great extent, and is convenient for user to find the required information quickly. Moreover, Deep Web Integration has the broad applied future as the technical basis of information retrieval, search engine, personal service and so on.
     Research on Deep Web integration and its related technologies are done in the paper. Our primary works are as follow.
     (1) Study on the integrated Model of Deep Web system
     For the variation of user's requirement, different integrated model of Deep Web system should be considered. In this paper, we first study the relative model of Deep Web and the constraint, and based on which, different integrated Models of Deep Web system are presented, and their process flows are also discussed. This work can give reference for the further research and application.
     (2) A Machine Learning Approach Classification of Web Databases
     Classifying such structured sources into domains is one of the critical steps toward the integration of heterogeneous Web sources. In this paper, we present a deep Web model and machine learning based classifying model, and a novel weighting method is proposed. The experimental results show that we can achieve a good performance with a small scale training samples for each domain, and as the number of training samples increases, the performance keeps stabilization.
     (3) Ontology-based Query Interfaces Classification of Deep Web
     Ontology is model based on knowledge, which is used to represent the terms, the relations and the rules of the conception in a machine readable format. In this paper, we present an Ontology-based query interfaces Classification, which includes a category Ontology model and a novel weighting calculation over Vector Space Model (VSM). The experimental results show that we can get a good performance
     (4) Study on Environmental Changes Processing in Deep Web Integration Based on Knowledge
     Based on the research on the dependence of the components in the deep Web integration, a knowledge-based method is given to process the changes in such integration, which includes environmental changes processing model, a self-adaptive software architecture and algorithm. This method can provide a reference to the further research or toward application for the large-scale deep Web integration. The experimental results show that the method can not only process the changes, but also highly improve the performance of the integrated system.
     (5) Study on personal service over deep Web
     The digital science references are usually provided as non-freed Deep Web. The scale of information can make the user puzzled and missed. The personal service over Deep Web can solve the problem to some extent, which can make the information themselves to "find" the needed users. In this paper, we propose a framework of personal service system over the Deep Web, which includes the user profile model based on the the meta-description of digital resource, the Deep Web crawler, and a novel pushing algorithm etc. At last, a personal service system over the selected Deep Web is developed, and with small number of user's intervene, the system can push the information to the users that they needed.
[Ade98] Adelberg B. NoDoSE- A tool for semi-automatically extracting semi-structured data from text document. Proceedings of the 17th ACM SIGMOD International Conference on Management of Data. Washiongton,1998.283-294
    [AFJ+95] R. Armstrong, D. Freitag, T. Joachims and T. Mitchell. WebWatcher: a learning apprentice for the world wide web. In Proceedings of AAAI Spring Symposium on Information Gathering, Stanford, CA, March 1995.
    [AGW+07] Yoo Jung An,James Geller,Yi-Ta Wu,Soon Ae Chun.Automatic Generation of Ontology from the Deep Web. 18th International Workshop on Database and Expert Systems Applications,IEEE,2007.
    [AH03] Arasu A, Garcia-Molina H. Extracting structured data from Web pages. In: Proc. of the SIGMOD Conf. 2003. San Diego: ACM Press, 2003. 337-348.
    [AM98] Arocena G O,Mendelzon A 0. WebOQL: Restructuring document,databases, and web.Proceedings of the 14th International Conference on Data Engineering, Orlando,1998.24-33
    [APR04] Manuel Alvarez, Alberto Pan, Juan Raposo, Angel Vina. Client-Side Deep Web Data Extraction.Proceedings of the IEEE International Conference on E-Commerce Technology for Dynamic E-Business (CEC-East'04)
    [AS94] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases, In Proceedings of the 22th International Conference on Very Large Database, Santiago, Chile, 1994: 487-499.
    [B03] B.He UIUC web integration repository. Computer Science Department, University of Illinois at Urbana-Champaign., 2003
    [Ber01] Bergman MK. The deep Web: Surfacing hidden value. Technical Report, BrightPlanet LLC, 2001.
    [BFS07] Barbosa L, Freire J, Silva A. Organizing hidden-Web databases by clustering visible Web documents. In: Doqac A, ed. Proc. of IEEE the 23rd Int'l Conf. on Data Engineering. Istanbul: IEEE Computer Society, 2007. 326-335.
    [BGM04] Bruno N., Gravano L. and Marian A. Evaluating Top-K Queries Over. Web Accessible Databases. Proceedings of the 15th International Conference on Data Engineering. IEEE Computer Society, pp.319-362, 2004.
    [BK03] He B, Chang KCC. Statistical schema matching across Web query interfaces. In: Proc. of the SIGMOD Conf. 2003. San Diego: ACM Press, 2003. 217-228.
    [BK06] He B, Chang KCC. Automatic complex schema matching across Web query interfaces: A correlation mining approach. ACM Trans. on Database Systems, 2006,13(1):1-45.
    [BKJ04] He B, Chang KCC, Han J. Discovering complex matching across Web query interfaces: A correlation mining approach. In: Proc. of the SIGKDD Conf. 2004. Seattle: ACM Press, 2004.148-157.
    [BTK04] He B, Tao T, Chang KCC. Organizing structured Web sources by query schemas: A clustering approach. In: Proc. of the CIKM 2004. Washington: ACM Press, 2004. 22-31.
    [BMZ+07] He B, Patel M, Zhang Z, Chang KCC. Accessing the deep Web. Communications of the ACM, 2007,50(5):95-101.
    [BS95] M. Balabanovic and Y. Shoham. Learning information retrieval agents: experiments with automated web browsing, in AAAI Spring Symposium on Information Gathering, Stanford, CA, March 1995.
    [CCH+2003] Cope J,Crasweel N,Hawkong D. Automated discovery of search interface on the Web. proceedings of the 14th Australian Database Conference. Adelaide,2003.181-189
    [CG99] Surajit Chaudhuri, Luis Gravano. Evaluating Top-k Selection Queries. Proceedings of the 25th VLDB Conference, Edinburgh, Scotland, 1999.
    [CL05] James Caverlee,Ling Liu.QA-Pagelet: Data Preparation Techniques for Large-Scale Data Analysis of the Deep Web. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING VOL. 17,NO.9,SEPTEMBER 2005:1247-1262.
    [CLB04] James Caverlee, Ling Liu, David Buttler. Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web. Proceedings of the 20th International Conference on Data Engineering (ICDE'04), IEEE Computer Society,2004.
    [De00] Duineveld AJ, et al.Wonder tools: A comparative study of ontological engineering tools.Int'l Journal of Human-Computer Studies,2000,52(6): 1111 — 1133
    [FAG05] Li Fan, Lin Aiwu, Chen Guoshe.A Chinese text categorization system based on the improved VSM. J. Huazhong Univ. of Sci. & Tech. (Nature Science Edition) Vol.33,No.3,2005
    [FGM+06] Bla Fortuna, Marko Grobelnik , Dunja Mladenic .Background Knowledge for Ontology Construction WWW 2006,May 23-26,2006,Edinburgh,Scotland.
    [FS04] Augusto de Carvalho Fontes,Fabio Soares Silva. SmartCrawl: A New Strategy for the Exploration of the Hidden Web.WIDM'04, November 12-13, 2004, Washington, DC, USA:9-15
    [GCK05] Kabra G, Li CK, Chang KCC. Query routing: Finding ways in the maze of the deep Web. In: Proc. of the 2005 Int'l Workshop on Challenges in Web Information Retrieval and Integration (WIRI 2005). IEEE CNF, 2005. 64-73.
    [HCX+07] Hexiang Xu,Chenghong Zhang, Xiulan Hao, Yunfa Hu.A Machine Learning Approach Classification of Deep Web Sources. In Proceedings of FSKD 2007, VOL. 4, IEEE CNF, 2007.561-565.
    [HHN+97] Hammer J,Hector G, Nestorov S,Yerneni R,Breuning M M,Vassalos V. Template-based Wrapper in the TSIMMIS system. Proceedings of the 16th ACM Sigmod international conference on Mnagement of Data.Tucson,1997.532-535
    [HJY+03] Tang Huanling,Sun Jiantao,Lu Yuchang. A Weight Adjustment Technique with Feature Weight Function Named TEF-WA in Text Categorization. Journal of Computer Research and Developmen,42(1),47-53,2003.
    [HL02] C. Hsu, C. Lin. A comparison on methods for multi-class support vector machines, IEEE Transactions on Neural Networks. 2002, 13: 415-425.
    [HOL95] Hollingsworth D. The workflow reference model. WfMC-TC-1003, Workflow Management Coalition, 1995.
    [HVS03] Siegfried Handschuh, Raphael Volz,Steffen Staab.Annotation for the Deep Web.IEEE INTELLIGENT SYSTEMS,IEEE Computer Society,2003:42-48.
    [HWC+03] He H, Meng WY, Yu C, Wu ZH. Wise-Integrator: An automatic integrator of Web search interfaces for e-commerce. In: Proc. of the VLDB Conf. 2003. Berlin: VLDB Endowment, 2003. 357-368.
    [IGS01] Panagiotis G. Ipeirotis,Luis Gravano,Mehran Sahami.Probe, Count, and Classify:Categorizing HiddenWeb Databases. ACM SIGMOD 2001 May 2124,Santa Barbara, California, USA
    [JMF+95] T. Joachims, T. Mitchell, D. Freitag, and R. Armstrong. Webwatcher: machine learning and hypertext. In K. Morik and J. Herrmann, editors, GI Fachgruppentreffen Maschinelles Lernen, University of Dortmund, August, 1995.
    [JND03] Cope J, Craswell N, Hawking D. Automated discovery of search interface on the Web. In proceedings of the 14th Australasian Database Conference, Adelaide, 2003,181-189
    [KBC+04] Chang KCC, He B, Li CK, Patel M, Zhang Z. Structured databases on the Web: Observations and implications. SIGMOD Record, 2004,33(3):61-70
    [KBZ05] Chang KCC, He B, Zhang Z. Toward large scale integration: Building a metaquerier over databases on the Web. In: Proc. of the 2nd Int'lConf. on Innovative Data Systems Research. Asilomar, 2005. 44-55.
    [LGZ03] liu B, Grossman R L, Zhai Y. Mining data records in Web pages. Proceedings of the 19th Internatoinal Conference on Kownledge Discovery and Data Mining. Wahionton,2003.601 -606
    [Lie95] H. Lieberman. Letizia: an agent that assists web browsing. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, 1995.
    [Liu99] Ling Liu.Query Routing in Large-scale Digital Library Systems. Proceedings of the 15th International Conference on Data Engineering. IEEE Computer Society Washington, DC, USA,1999.154-162
    [LM06] Wei Liu, Xiaofeng Meng. Web Database Integration[C]. In Proceedings of the PhD Workshop in conjunction with VLDB 06 (VLDB-PhD2006), Seoul, Korea, September 11, 2006.
    [LMM06] Wei Liu, Xiaofeng Meng, Weiyi Meng. Vision-based Web Data Record Extraction[C]. In Proceedings of the 9th SIGMOD International Workshop on Web and Databases (SIGMOD-WebDB2006), Chicago, Illinois, June 30, 2006
    [LPH00] Liu L,Pu C,Han W. XWRAP: An XML-enableed wrapper construction system for Web information Source. Proceedings of the 16th Internationa] Conference on Data Engineering. San Diego,2000.611-621
    [LS02] Juliano Palmieri Lage,Altigran S. da Silva. Collecting Hidden Web Pages for Data Extraction.WIDM'02, November 8, 2002, McLean, Virginia, USA.69-75
    [LSP+96] Lim E,Srivastava J,Prabbakar S,Richardson J.Entity identification in database integration. Information Systems,1996,89(1).1-38
    [MKC05] Dheerendranath Mundluru, Jayasimha Reddy Katukuri, Saygin Celebi. Automatically Mining Result Records from Search Engine Response Pages.Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM'05),IEEE,2005.
    [MLG02] Meng X, Lu H,Gu M.SG-WARP:A schema guided wrapper generator.Proceedings of the 18th International Conference on Data Engineering. San Jose,2002.331-332
    [MMK01] Muslea I,Minton S,Knoblock C A.Hierarchical wrapper induction for semistructured information sources.Autononous Agents and Multi-Agent System,2001,4(1-2):93-114
    [MNZ+06] Frank McCown,Michael L. Nelson,Mohammad Zubair,Xiaoming Liu. Search Engine Coverage of the OAI-PMH Corpus.IEEE Computer Society,MARCH-APRIL,2006.
    [MS04] A.Maedche and S. Staab,"Ontology learning,"in Handbook on Ontologies,S. Staab and R. Studer,Eds.Berlin,Heidelberg:Springer Verlag,2004,pp. 173-189.
    [NBC05] A.H.H.Ngu, D.J.Buttler, T.J.Critchlow. Automatic Generation of Data Types for Classification of Deep Web Sources. Technical Report UCRL-CONF-209719, Lawrence Livermore National Laboratory, 2005. 7
    [NFM00] Noy NF, Fergerson RW and Musen MA. The knowledge model of Protege-2000: Combining interoperability and flexibility. In Proc of the EKAW,2000:17-32
    [NZC05] Alexandras Ntoulas,Petros Zerfos,Junghoo Cho. Downloading Textual Hidden Web Content Through Keyword Queries.JCDL'05, June 7-11, 2005, Denver, Colorado, USA.ACM ,2005:100-109.
    [PMH+04] Peng Q,Meng W,He H,Yu C T.Wise-clusterxlustering e-commerce search engines automatically.Proceedings of the 6th ACM International Workshop on Web Information and Data Management.Washington,2004.104-111
    [SB90] G. Salton and C. Buckley. Improving retrieval performance by relevance feedback. Journal of American Society for Information Science, 1990, 41(4):288-197.
    [Sc01] Sharma A, Capretz MAM. Application maintenance using software Agents. In: Proc. of the 1st IEEE Int'l Workshop on Source Code Analysis and Manipulation. IEEE CNF, 2001. 55-64.
    [Se02] Sure Y, et al. OntoEdit: Collaborative ontology engineering for the semantic Web. In Proc. of the ISWC 2002,Heidelberg:Springer-Verlag,2002:221-235
    [SH01] Raghavan S, Garcia-Molina H. Crawling the hidden web. Roma, Italy: Proceedings of the 27th International Conference on Very Large Data Bases, 2001.129-138.
    [Ull88] Ullman JD. Principles of Database and Knowledge: Base Systems, Vol.1. Stanford: Computer Science Press, 1988.
    [VGP01] Crescenzi V, Mecca G, Merialdo P. Roadrunner: Towards automatic data extraction from large Web sites. In: Proc. of the VLDB Conf. Rome: VLDB Endowment, 2001.109-118.
    [VGP02] Crescenzi V, Mecca G, Merialdo P. Roadrunner: Towards automatic data Intensive Web sites. In: Proc. of 21th ACM SIGMOD interneational Conf. management of Data. Madison,2002
    [WCA+04] Wu WS, Yu C, Doan AH, Meng WY. An interactive clustering based approach to integrating source query interfaces on the deep Web. In: Proc. of the SIGMOD Conf. 2004. Paris: ACM Press, 2004. 95-106.
    [WDY05] Wensheng Wu, AnHai Doan,Clement Yu. Merging Interface Schemas on the Deep Web via Clustering Aggregation. Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM'05).
    [WDY06] Wensheng Wu, AnHai Doan,Clement Yu. WebIQ: Learning from the Web to Match Deep-Web Query Interfaces. Proceedings of the 22nd International Conference on Data Engineering (ICDE'06)
    [WLL04] Wei W,Liu M,Li S. Merging of XML documents. Proceedings of the 23th Intenational Conference on Conceptual Modeling-ER. Shanghai,2004.273-285
    [WXY+06] Liu W, Li X, Ling YY, Zhang XY, Meng XF. A deep Web data integration system for job search. Wuhan University Journal of Natural Sicences, 2006,11(5):1197-1201.
    [YC92] Y. Yang and C.G. Chute. A linear least squares fit mapping method for information retrieval from natural language texts. In Proceedings of the 14th Conference on Computational Linguistics (COLING92), 1992.
    [YPW03] Clement Yu, George Philip, Weiyi Meng.Distributed Top-N Query Processing With Possibilty Uncooperative Local System.Proceedings of the 29th VLDB Conference Berlin,Germanny,2003.
    [ZBK04]Zhang Z,He B,Chang KCC.Understanding Web query interfaces:Best effort parsing with hidden syntax.In:Proc.of the SIGMOD Conf.2004.Paris:ACM Press,2004.107-118.
    [ZBK05]Zhang Z,He B,Chang KCC.Light-Weight domain-based form assistant:Querying Web databases on the fly.In:Proc.of the 31st VLDB Conf.Trondheim,2005.97-108.
    [ZL05]Zhai Y,Liu B.Web data extraction based on partial tree alignment.Proceedings of the 14th WWW Confernece.Chiba,2005.76-85
    [李05b]李凡,林爱武,陈国社.一种基于VSM文本分类系统的设计与实现.华中科技大学学报(自然科学版).Vol.33 No.3,2005.53-55
    [刘07a]刘伟,孟小峰,孟卫一.Deep Web数据集成研究综述.计算机学报,Vol.30No.9.2007:1475-1489.
    [刘08]刘伟,孟小峰,凌妍妍。一种基于图模型的Web数据库采样方法。软件学报,Vol.19,No.2,February 2008,pp.179-193
    [寇08]寇月,中德荣,李冬,聂铁铮.一种基于语义及统计分析的Deep Web实体识别机制.软件学报,Vol.19,No.2,February 2008,pp.194-208
    [凌08]凌妍妍,孟小峰,刘伟基于属性相关度的Web数据库大小估算方法。软件学报,Vol.19,No.2,February 2008,pp.224-236
    [马08]马军,宋玲,韩晓晖,闫泼。基于网页上下文的Deep Web数据库分类。软件学报,Vol.19,No.2,February 2008,pp.267-274
    [宋08]宋杰,王大玲,鲍玉斌,申德荣。基于页面Block的Web档案采集和存储。 软件学报,Vol.19,No.2,February 2008,pp.275-290
    [王08]王辉,刘艳威,左万利。使用分类器自动发现特定领域的深度网入口。软件学报,Vol.19,No.2,February 2008,pp.246-256
    [徐08]徐和祥,王鑫印,王述云,胡运发。基于知识的Deep Web集成环境变化处理的研究。软件学报 Vol.19,No.2,February 2008。pp.257-266
    [杨08]杨少华,林海略,韩燕波。针对模板生成网页的一种数据自动抽取方法。软件学报 Vol.19,No.2,February 2008。pp.209-223
    [袁08]袁柳,李战怀,陈世亮。基于本体的Deep Web数据标注。软件学报,Vol.19,No.2,February 2008,pp.237-245
    [赵06]赵朋朋,高岭,崔志明.基于查询接口特征的Deep Web数据源自动分类.微电子学与计算机,Vol.23 No.10,2006.47-50
    [郑06]郑冬冬,崔志明.Deep Web爬虫爬行策略研究.计算机工程与设计,2006,27(17),3154-3158.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700