Deep Web数据获取方法研究

英文题名：Research on Deep Web Data Acquisition Method
作者：蔡欣宝
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：Deep ; Web爬虫 ; 属性相关度 ; 属性组合 ; 查询选择 ; 增量爬虫
英文关键词：Deep Web Crawler ; Attributes Correlation ; Attribute Compounding ; Query Selection ; Incremental Crawler
学位年度：2010
导师：崔志明
学科代码：081203
学位授予单位：苏州大学
论文提交日期：2010-04-01

摘要

随着互联网的飞速发展,Web中的信息规模日益扩大,为人们提供了各种各样可利用的信息。其中大量的信息是存储在Web数据库当中,只能通过网页中的查询接口才能访问。改变了通过链接来访问网页的方式,使得传统的搜索引擎无法获取,因而被称为Deep Web。高速增长的Deep Web信息已成为人们进行信息获取的一个重要来源,然而Deep Web数据的异构性和动态性,为大规模Deep Web数据集成带来巨大的挑战。通过获取Deep Web的数据,在本地集成Web数据库的重要性正在逐渐凸显。
     本文针对Deep Web数据获取的相关技术进行深入研究,并提出了相应的算法和模型。本文的主要研究工作如下:
     (1)研究了Deep Web站点和查询接口的特点,在表单的属性选择方面,提出了一种基于属性相关度的属性组合有效性的计算方法。
     (2)分析了查询接口中属性的特点,提出了通过机器学习的方法识别查询接口中每个特定的文本属性。
     (3)通过对属性的分类,针对不同类型的属性采用不同方法产生查询词。对于普通的文本属性,提出了通过抽取查询结果页中的相应内容,并通过适应性策略来选取合适的关键词作为查询词的方法。对于特定的文本属性,使用人工建立知识库的方法。
     (4)分析了Deep Web数据源中网页的更新特点,通过泊松模型对网页更新事件建立模型,增量获取Deep Web数据。并设计了增量获取Deep Web数据的爬虫系统结构。
     此外,本文还对文中提出的方法和技术进行了实验,通过对实验结果的分析进一步验证了本文提出的方法是有效的。
With the rapid development of the Internet, Web information scale is growing continuously, which provide people with all kinds of available information. Large amount of information is stored in the Web database, which can only be accessed through the web query interface. Changed the way of visiting web page by link, so the traditional search engines can not access, they are called Deep Web. The increasing of Deep Web information with high-speed have being a significant resource for information retrieval. Due to the heterogeneity and dynamicity of Deep Web data, data integration of large-scale Deep Web are very challenging. By crawling Deep Web data, integrating web database in local host is becoming more and more significant.
     This thesis researches on Deep Web data acquisition in-depth, and propose the related algorithms and models. Our research issues are follows:
     (1) Research on characteristic in Deep Web site and query interfaces. In deciding which form inputs to be filled when submitting queries to a form, propose a method for searching valid attribute compounding based on attributes correlation
     (2) Analyze characteristic of attributes in query intefaces. a method to identify each typed text attribute in query interface by machine learning methods is proposed.
     (3) By the classification of attributes, For different types of attributes, used different methods to find appropriate query words. For generic text attributes, extracting the corresponding content in query result page, and through adaptive strategy to select the appropriate keywords as the query words . For typed text attributes,used the knowledge base built by hand.
     (4) Analyze the pages of the Deep Web website update features, by the Poisson model to model web pages update events, incremental crawling the Deep Web data. And designed the system frame of crawler to incremental crawling the Deep Web data.

引文

[1] M. K. Bergman. The Deep Web: Surfacing Hidden Value[J].The Journal of Electronic Publishing.2001,7(1):8912-8914.
    [2] D. Florescu, A. Y. Levy, and A. O. Mendelzon. Database techniques for the world-wide web: A survey[C]. SIGMOD Record, 1998,27(3):59–74.
    [3] Invisiable.com网址.http://www.invisiable.com/.
    [4] Thanaa M.Ghanem,Walid G.Aref. Databases Deepen the Web[J].IEEE Computer, 2004,73(1):116-117.
    [5] B. He, M. Patel, Z. Zhang, and K. C.-C. Chang. Accessing the Deep Web: A survey [R]. Communications of the ACM, 2007,50(5):95–101.
    [6] DeepWeb Technology. http://www.deepwebtech.com/, October 2005.
    [7] Completeplanet, http://www.completeplanet.com/, October 2005.
    [8] Chang KCC, He B, Zhang Z. Toward Large Scale Integration:Building a MetaQuerier over Databases on the Web[C]. In Proceedings of the Second Conference on Innovative Data Systems Research (CIDR),2005:44-55.
    [9] He H, Meng W, Yu C, et al. WISE-Integrator: An Automatic Integrator of Web Search Interfaces for E-Commerce[C]. In VLDB Conference, 2003:357-368.
    [10] Nie Zaiqing, Ma Yunxiao, Shi Shuming, et al. Web Object Retrieval[C]. Proceedings of the 16th international World Wide Web conference,2007:81-90.
    [11] Nie Zaiqing, Wen JiRong, Ma WeiYing. Object-Level Vertical Search[C]. Proceedings of Third Biennial Conference on Innovative Data Systems Research (CIDR),2007:235-246.
    [12] Cheng Tao, Chang Kevin Chen-Chuan. Entity Search Engine: Towards Agile Best-Effort Information Integration over the Web[C]. Proceedings of CIDR, 2007:108-113.
    [13]刘伟,孟小峰,孟卫一. Deep Web数据集成研究综述[J].软件学报, 2007, 30(9): 1475-1489.
    [14] Chang KCC,Cho J. Accessing the Web:From search to integration[C].In:Proc.of 2006 ACM SIGMOD Int\’1 Conf.on Management of Data(SIGMOD 2006),Chicago, ACM Press,2006:804-805.
    [15] Raghavan S , Garcia - Molina H. Crawling the hidden web[C]. Proceedings of the 27th International Conference on Very Large Data Bases. Italy ,Rome , 2001:129 - 138.
    [16] A. Ntoulas, P. Zerfos, and J. Cho. Downloading Textual Hidden Web Content through Keyword Queries[C]. In JCDL, 2005:100–109.
    [17] P. G. Ipeirotis and L. Gravano. Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection. In VLDB, 2002:394–405.
    [18] J.P.Callan and M.E.Connell. Query-based sampling of text databases[J].ACM Transactions on Information Systems, 2001,19(2):97-130.
    [19] L. Gravano, P. G. Ipeirotis, and M. Sahami. QProber: A system for automatic classification of hidden-web databases[J]. ACM Transactions on Information Systems, 2003,21(1):1–41.
    [20] L. Barbosa and J. Freire. Siphoning hidden-web data through keyword-based interfaces[C]. In Proceedings of SBBD, 2004:309-321.
    [21] Bergholz A , Chidlovskii B. Crawling for domain–specific hidden web resources[C]. In: Proc. of the Int’l Conf. on Web Information Systems Engineering (WISE). Roma: IEEE Computer Society, 2003:125-133.
    [22] P. Wu, J.-R. Wen, H. Liu, and W.-Y. Ma. Query Selection Techniques for Efficient Crawling of Structured WebSources[C]. In: Proc. of the 22nd Int’l Conf. on Data Engineering (ICDE 2006). Atlanta: IEEE Computer Society Press, 2006:47?58.
    [23] J. Madhavan, D. Ko, L. Kot, et al. Google’s deep-web crawl[C]. in Proceedings of the 34th International Conference on Very Large Data Bases, 2008:1241-1252.
    [24] G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. 1983.
    [25] Manuel Alvarez, Juan Raposo, Alberto Pan, et al. DeepBot: A Focused Crawler for Accessing Hidden Web Content[C].Proceedings of the 3rd international workshop on Data engineering issues in E-commerce and services: In conjunction with ACM Conference on Electronic Commerce, 2007:18-25.
    [26]刘伟,孟小峰,凌妍妍.一种基于图模型的Web数据库采样方法[J].软件学报,2008, 19(2): 179-193.
    [27]郑冬冬,赵朋朋,崔志明. DeepWeb爬虫研究与设计[J].清华大学学报(自然科学版), 2005,45(S1):1896-1902.
    [28]林超,赵朋朋,崔志明. Deep Web数据源聚焦爬虫[J].计算机工程,2008,34(7):56- 58.
    [29] J. Cope, N. Craswell, and D. Hawking. Automated discovery of search interfaces on the web[C].In:Proceedings of the 14th Australasian Database Conference(ADC 2003),Adelaide,2003:181-450.
    [30]高岭,赵朋朋,崔志明. Deep Web查询接口的自动判定[J].计算机技术与发展,2007,17(5):148-151.
    [31] E. Agichtein, P. Ipeirotis, and L. Gravano. Modeling query-based access to text databases[C]. In WebDB2003, 2003:87-92.
    [32]凌妍妍,孟小峰,刘伟.基于属性相关度的Web数据库大小估算方法[J].软件学报. 2008,19(2):224-236.
    [33]苗忠义,胡鹏昱,赵朋朋,崔志明.用Capture-Recapture方法估计Web数据库大小[J].计算机应用研究.2009,26(5):1754-1756.
    [34] Chia-Hui Chang and Shao-Chen Lui. IEPAD: Information Extraction based on Pattern Discovery, In Proceedings of the 10th International Conference on World Wide Web, Hong Kong, 2001:595-609.
    [35] Tak-Lam Wong, Wai Lam, Tik-Shun Wong. An unsupervised framework for extracting and normalizing product attributes from multiple web sites[C], in the Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, Singapore, 2008: 35-42.
    [36] E.Agichtein and L.Gravano. Querying text databases for efficient information extraction[C]. Proceedings of the 19 th International Conference on Data Engineering. Bangalore, India: IEEE, 2003: 113-124.
    [37]袁柳,李战怀,陈世亮.基于本体的Deep Web数据标注[J].软件学报,2008,19(2): 237–245.
    [38] Cristian Duda, Gianni Frey, Donald Kossmann, Chong Zhou. AJAXSearch: Crawling,Indexing and Searching Web 2.0 Applications[C]. Proceedings of the VLDB ,2008:1440-1443.
    [39] Ali Mesbah, Engin Bozdag,Arie van Deursen. Crawling AJAX by Inferring User Interface State Changes[C]. Proceedings of the 2008 Eighth International Conference on Web Engineering, 2008:122-134.
    [40] K. C.-C. Chang, B. He, C. Li, and Z. Zhang. Structured databases on the web:Observations and implications[C]. In SIGMOD Record, 2004, 33(3): 61-70.
    [41] ZHAO Peng-peng,CUI Zhi-ming,GAO Ling,ZHONG Hua.Survey of Chinese DeepWeb[J].Journal of Chinese Computer System. October 2007,128(10):1799-1802.
    [42]韩芸.基于查询接口特征的深度网络资源聚类分析[D].大连,大连理工大学,2007.
    [43] S. Zheng, R. Song, J.-R. Wen et al. Joint optimization of wrapper generation and template detection [C]. In Proc. 13th KDD. San Jose, CA, USA. 2007: 894?902.
    [44] Y. Zhai , B. Liu. Structured data extraction from the Web based on partial tree alignment[J].IEEE Trans. on Knowledge and Data Engineering ,2006, 18(12) :1614?1628.
    [45] Arlotta L, Crescenzi V, Mecca G, Merialdo P. Automatic annotation of data extracted from large Web sites[C]. In: Christophides V,Freire J, eds. Proc. of the 6th Int’l Workshop on Web and Databases. San Diego.ACM Press, 2003:7-12.
    [46] Wang JY, Lochovsky FH. Data extraction and label assignment for Web databases[C]. In: Proc. of the 12th Int’l World Wide Web Conf.Budapest: ACM Press, 2003: 187-196.
    [47] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms,2nd Edition. MIT Press/McGraw Hill, 2001.
    [48] A. Ntoulas, P. Zerfos, and J. Cho. Downloading hidden web content. Technical report, UCLA, 2004.
    [49] The Open Directory Project, http://www.dmoz.org.
    [50] J Cho and H Garcia-Molina. The Evolution of the Web and Implications for an Incremental Crawler[C]. In: Proc. of the 26th Int’l Conf. on Very Large Databases.San Francisco: Morgan Kaufmann Publishers, 2000:200-209.
    [51]孟涛,闫宏飞,王继民. Web网页信息变化的时间局部性规律及其验证[J].情报学报, 2005, 24(4): 398-406.
    [52] BE Brewington and G Cybenko. Keeping Up with the Changing Web[J]. Computer, 2000, 33(5): 52-58.
    [53] J Cho and H Garcia-Molina. Synchronizing a Database to Improve Freshness[C]. In:Proc. of the 2000 ACM Int’l Conf. on Management of Data. New York: ACM Press, 2000: 117-128.
    [54] J Cho and H Garcia-Molina. Estimating Frequency of Change[J]. ACM Trans. On Internet Technology, 2003, 3(3):256-290.
    [55] Page L, Brin S, Motwani R, Winograd T. The PageRank citation ranking: Bringing order to the Web. Technology Report, 1998. http://www-db.stanford.edu/~backrub/pageranksub.ps.
    [56] Kleinberg J, Kumar R, Raghavan P, Rajagopalan S, Tompkins A. The Web as a graph: Measurements, models, and methods.Lecture Notes in Computer Science, 1999,1627: 1?18.
    [57] Kleinberg JM. Authoritative sources in a hyperlinked environment[J]. Journal of the ACM, 1999,46(5):604?632.
    [58] Raghavan VV, Wong SKM. A critical analysis of vector space model for information retrieval[J]. Journal of the American Society for Information Science, 1986,37(5):279?287.
    [59]胡鹏昱,苗忠义,崔志明,方巍.扩展的Deep Web质量估计模型研究[J].微电子学与计算机, 2008,25(9):24-27.
    [60]胡鹏昱,赵朋朋,方巍,崔志明.深网数据源质量估计模型[J].计算机工程,2009,35(9):204-207.
    [61] L.Barbosa and J.Freire. Siphoning Hidden-Web Data through Keyword-Based Interfaces[C]. In Proceedings of SBBD, 2004:309-321.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700