Deep Web分类搜索引擎关键技术研究

英文题名：The Key Technology Research on Deep Web Directory Search Engine
作者：高岭
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：深网 ; 搜索引擎 ; 聚焦爬虫 ; Web数据库内容摘要 ; 数据源分类
英文关键词：Deep Web ; Search Engine ; Focused Crawler ; Web Database Content Summary ; Data Source Classification
学位年度：2007
导师：崔志明
学科代码：081203
学位授予单位：苏州大学
论文提交日期：2007-04-01

摘要

随着World Wide Web(WWW)的飞速发展,整个Web信息已经被各种各样可搜索的在线数据库所深化。这些信息被隐藏在Web查询接口之后,由站点后台数据库动态产生,而传统搜索引擎受技术限制无法对它们进行索引,我们称这类信息为Deep Web。
     Deep Web信息获取至今仍然是一个新兴的研究领域,也受到越来越多研究人员的重视。为了方便用户获取使用某领域的Deep Web信息,本文提出了一个Deep Web分类搜索引擎的系统架构,依据这个系统架构对Deep Web分类搜索引擎中若干关键问题进行了分析研究,并提出了相关的算法和模型。本文主要研究的工作包括:
     (1)对中国Deep Web资源的规模、分布、结构等进行了调查研究。
     (2)针对传统搜索引擎爬虫程序在Deep Web领域的缺陷,设计了一个面向Deep Web的聚焦爬虫,并提出了Deep Web查询接口的判定方法。
     (3)采用一种高效的Web数据库内容获取算法,对Web数据库内容进行采样,并对采样得到的页面进行分析,去除了无关信息,最终得到Web数据库的内容摘要。
     (4)依据雅虎的分类目录,提出了一种将Deep Web站点接口页面与数据库内容摘要相结合的方法,对Deep Web资源进行分类。
     本文最后设计和实现了一个针对中文的Deep Web分类搜索引擎原型系统Deep Searcher,并对文中提出的算法进行了实验和分析。
With the rapid development of the World Wide Web, the Web has been rapidly deepened by myriad searchable databases online. A large amount of dynamic information from the databases behind query interfaces can not be retrieved because of the restrictions of current search engine technology. We call such information as Deep Web. Deep Web information retrieval is still a fresh field of study and has been paid more and more attention. In attempt to meet users' need for Deep Web information, this paper proposes a system architecture for a Deep Web directory search engine. According to this framework, we focus on the key issues in the Deep Web directory search engine, and propose relevant algorithms and models. The paper’s main research works include:
     (1) We do some investigation on scale, distribution and structure of Chinese Deep Web resources.
     (2) To cope with limitation of traditional search engine crawler in Deep Web domain, we design a Deep Web focused crawler, and present a method to judge a Deep Web Query Interface.
     (3) We adopt an efficient algorithm to acquire contents of Web Databases. Through analysing the result pages,the irrelevant information is removed and a summary of the Web database contents is eventually constructed.
     (4) In accordance with Yahoo Directory, we propose a method which combines query interface pages and database summary to classify Deep Web resources.
     Finally, we design and implement a prototype for Deep Web directory search engine system called Deep Searcher, and we do experiments and analysises on the proposed algorithm.

引文

[1] Fetterly D., Manasse M., Najork M., Wiener J. L.. A large-scale study of the evolution of web pages[C]. Budapest: Proceedings of the 12th International World Wide Web Conference.2003:669-678
    [2] M.K.Bergman.The Deep Web:Surfacing Hidden Value[J]. The Journal of Electronic Publishing.2001,7(1):8912-8914
    [3] B.He, M.Patel, Z.Zhang, K.C.-C. Accessing the Deep Web: A Survey[EB/OL].http://eagle.cs.uiuc.edu/tr/dwsurveytr-hpzc-jul04.pdf, July 2004
    [4] DeepWeb Technology网址. http://www.deepwebtech.com/
    [5] Invisiable.com网址. http://www.invisiable.com/
    [6] Robert B. Doorenbos, Oren Etzioni, Daniels. Weld.A scalable comparison shopping agent for the World-Wide Web[C]. Marina del Rey,CA,USA: The First International Confence on Autonomous Agents, 1997:39-48
    [7] Hasan Davulcu, Juliana Freire,Michael Kifer,I.V. Ramakrishnam.A layered architecture for querying dynamic Web content[C]. Philadelphia,PA,USA:ACM SIGMOD Conference, 1999:491-502
    [8] S. Raghavan, H. Garcia-Molina.Crawling the hidden Web.Roma[C]. Italy: The 27th International Conference on Very Large Data Bases, 2001:129–138
    [9] QProber Reasearch Group网址. http://qprober.cs.columbia.edu/
    [10] MetaQuerier Research Group网址. http://metaquerier.cs.uiuc.edu/
    [11] L.Barbosa ,J. Freire. Siphoning hidden-web data through keyword-based interfaces[C]. Brasilia,Brazil:SBBD,2004:309-321
    [12] Chris Sherman, Gary Price. The Invisible Web: Uncovering Information Sources Search Engines Can't See[J]. Library Trends. 2003(2):282-298
    [13] 黄晓冬.Invisible Web研究综述[J].情报科学.2004,22(9):1144-1148
    [14] Kevin Chen-Chuan Chang, Bin He, Chengkai Li, Mitesh Patel, Zhen Zhang. Structured databases on the web: Observations and Implications[C]. SIGMOD Record,2004, 33,3:61-70
    [15] CNNIC. 17th Statistical Report of the Internet Development in China [EB/OL].http://www.cnnic.com.cn/images/2006/download/2006011701.pdf,2007
    [16] Gulli , A. Signorini. The indexable web is more than 11.5 billion pages[C]. Chiba,Japan:The 14th International World Wide Web Conference,2005:902–903
    [17] Heaton J著. 童兆丰,李纯,刘润杰译.网络机器Java 编程指南[M].北京:电子工业出版社,2002
    [18] 邓顺国.试论搜索引擎的发展趋势[J].图书馆理论与实践,2003,(5):51-52
    [19] Brian Pinkerton.Finding what people want: Experiences with the web crawler[C]. Chicago ,USA:The Second World-Wide Web conference ,1994
    [20] 周立柱,林玲.聚焦爬虫技术研究综述[j].计算机应用,2005,25(9): 1965-1969
    [21] Juliano Palmieri Lage, Altigran S. da Silva, Paulo B.Golgher, Alberto H.F. Laender. Automatic generation of agents for collecting hidden Web pages for data extraction[J]. Data&Knowledge Engineering ,2004,49(2):177-196
    [22] Html标准网址.http://www.w3.org/TR/html4/
    [23] WordNet网址.http://www .cogsci.princeton.edu/～wn/
    [24] Jiawei Han, Micheline Kamber. 数据挖掘概念与技术[M]. 北京:机械工业出版社,2005
    [25] Yang Y, Liu X. A re-examination of text categorization methods[C]. Berkley , CA ,USA :The 22th International ACM SIGIR Conference on Research and Development in Information Retrieval,1999:42-49
    [26] McCallum A K, Nigam K. Employing EM and Pool-Based active learning for text classification[C]. Madison ,Wisconsin USA :The 15th International Conference on Machine Learning,1998,350-358
    [27] 樊兴华,孙茂松.一种高性能的两类中文文本分类方法[J].计算机学报, 2006,19(1):124-131
    [28] 余芳. 一个基于朴素贝叶斯方法的Web 文本分类系统:WebCAT[J]. 计算机工程与应用, 2004,(13):195-197
    [29] Ian H. Witten, Eibe Frank. 数据挖掘实用机器学习技术[M]. 北京:机械工业出版社,2006
    [30] Junhoo Cho, Hector Garcia-Molina, Lawrence Page. Efficient crawling through URL ordering[J]. Computer Networks and ISDN Systems,1998,30(7):161-172
    [31] Rennie J, McCallum A. Using reinforcement learning to spider the Web efficiently[C]. Bled, Slovenia:The Sixteenth International Conference on Machine Learning,1999:335-343
    [32] Diligenti M, Coetzee F M, Lawrence S,et al. Focused crawling using context graphs[C]. Cairo, Egypt: 26th International Conference on Very Large Data Bases,2000:527-534
    [33] Luciano Barbosa, Juliana Freire. Searching for Hidden-Web Databases[C]. Baltimore, Maryland, USA :The Eight International Workshop on the Web & Databases(WebDB),2005:1-6
    [34] M. Perkowitz, R.Doorenbos, O.Etzioni,D.Weld, Learning to understand information on the internet: An example-based approach[J]. Journal of Intelligent Information Systems,1997,8(2): 133-153
    [35] J. Callan, M. Connell.Query-based sampling of text databases[J].ACM Transactions on Information Systems (TOIS),2001,19(2):97–130
    [36] L. Gravano, P.G. Ipeirotis, M. Sahami. QProber: A system for automatic classification of Hidden-Web databases[J]. ACM Transactions on Information Systems (TOIS).2003,21(1):1–41
    [37] K.I. Lin, H. Chen.Automatic information discovery from the Invisible Web[C]. Las Vegas, NV, USA:The International Conference on Information Technology,2002:332–337
    [38] W. Meng, W. Wang, H. Sun, C. Yu.Concept hierarchy based text database categorization[J]. International Journal on Knowledge and Information Systems,2002,4(2):132–150
    [39] Sugiura, O. Etzioni. Query routing for Web search engines: architecture and experiments[J]. Computer Networks ,2000,33(1-6): 417-429
    [40] Y.L. Hedley, M. Younas, A. James. Sampling, information extraction and summarisation of hidden web databases[J]. Data & Knowledge Engineering.2006,59(2):213-230
    [41] 廖述梅,徐升华,陶皖.带模板的结构化HTML 文档深度标注框架[J].清华大学学报(自然科学版),2006,46(S1):936-941
    [42] 雅虎中文目录. http://gb.chinese.yahoo.com/
    [43] 高志奎,曹锦丹.对中文网站信息分类体系的调查与比较[J].图书馆学研究,2003,(12):44-47
    [44] M. S. Panagiotis G. Ipeirotis, Luis Gravano. Probe, count and classify: Categorizing hidden web databases[C]. Santa Barbara, Ca,USA: ACM SIGMOD Conference,2001:67-78
    [45] 郭少友.基于查询结果的Web 数据库自动分类研究[J].情报学报,2006,25(4):481-487
    [46] Y.L. Hedley, M. Younas, A. James. The Categorisation of Hidden Web Databases Through Concept Specificity and Coverage[C]. Fukuoka, Japan : 19th International Conference on Advanced Information Networking and Applications( AINA ),2005:671-376
    [47] Bin He, Tao Tao, Kevin Chen-Chuan Chang. Clustering Structured Web Sources: A Schema-based, Model-Differentiation Approach[C]. Crete, Greece :EBDT Workshop on Clustering Information over the Web (EDBT-ClustWeb'04),2004:536-546
    [48] Bin He, Tao Tao, Kevin Chen-Chuan Chang. Organizing Structured Web Sources by Query Schemas: A Clustering Approach[C]. Washington, DC,USA :13th Conference on Information and Knowledge Management (CIKM),2004:22-31
    [49] Qian Peng, Weiyi Meng, Hai He, Clement Yu. WISE-Cluster: Clustering E-Commerce Search Engines Automatically[C]. Washington, DC, USA : Sixth ACM CIKM International Workshop on Web Information and Data Management(WIDM),2004:104-111
    [50] J.J. Rocchio. Relevance feedback in information retrieval[C]. Prentice-Hall, Englewood, Cliffs, New Jersey :The Smart Retrieval System --Experiments in Automatic Document Processing, 1971:313--323
    [51] Salton G, Wong A, Yang C. A Vector Space Model for Automatic Indexing[J]. Communications of ACM,1975,18(11):613-620
    [52] 许建潮,胡明.中文Web 文本的特征获取与分类[J].计算机工程,2005, 31(8):24-25
    [53] 宋斌.基于网页特征的TFIDF改进算法[J].微计算机应用,2002, 23(1):18-20
    [54] The Apache Jakarta Project:Lucene网址. http://jakar2ta.apache.org/lucene/
    [55] 车东.在应用中加入全文检索功能—基于Java 的全文索引引擎Lucene 简介[EB/OL]. http://www. chedong.com/tech/lucene.html,2006.08
    [56] 李刚,宋伟,邱哲.征服Ajax+Lucene构建搜索引擎[M].北京:人民邮电出版社,2006.4
    [57] ICTCLAS网址.http://www.nlp.org.cn/
    [58] Bin He, Zhen Zhang, Kevin Chen-Chuan Chang.Knocking the Door to the Deep Web: Integrating Web Query Interfaces[C]. Paris, France : ACM SIGMOD Conference,2004:913-914
    [59] 刘伟,孟小峰,孟卫一.Deep Web数据集成问题研究[EB/OL]. http://www.dbtech.cn/reports/report2006_cn.htm,2006.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700