基于表单的深度搜索技术研究

英文题名：Research on Form-Based Hidden Web
作者：徐荣
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：网页文本分类 ; 隐藏网页 ; 信息抽取 ; 表单 ; 名值表
英文关键词：Page text classify ; Hidden web ; Web information extract ; HTML form ; name value table
学位年度：2008
导师：蒋宗礼
学科代码：081202
学位授予单位：北京工业大学
论文提交日期：2008-04-01

摘要

目前大多数搜索引擎仅仅搜索超链接可以搜索到的静态网页,而许多的重要数据存放在web的后台数据库中,它们需要通过表单查询的方式获取,相应的网页称为隐藏网页。为了帮助用户获取更多的信息,本文讨论隐藏页面的搜索方法,给出了系统架构,并讨论其中的关键技术。
     本文首先分析了当前普遍采用的互联网信息搜索引擎的优缺点,比较通用搜索与深度搜索的不同,提出了适合深度搜索的爬行策略,即利用链接分类、文本分类进行聚焦爬行。并通过设置同一站点内停止搜索标准条件,对规则网站设置路径学习,尽量找到含有表单的网页。
     本文通过模拟用户访问深度网页的过程,开展了如下工作:首先,通过调查研究,提出适合能快速有效地下载含有表单的网页的爬行策略;然后处理网页,抽取出表单信息,将网页表单信息转换成程序可以理解的形式,即对表单进行建模。其次,利用启发式规则和表单分类方法提取有用的表单。再次,对表单标签和语义词进行提取,自动填写提交,找到需要网页。
     本文充分利用表单的结构和文本信息,其中的分类器使用标签分类和表单周围有用文字分类比较的办法。用Centroid、KNN、SVM算法进行训练。实验表明,表单周围文本分类效果好,用SVM算法效果最佳。最后,对表单自动填写的Name value table进行了一些讨论。
     通过实验验证了表单分类和表单信息抽取的有效性。
Most of the search engine only retrieve public indexable web (PIW) which is obtained by hyperlink. But the fact is that with the development of web, more and more information are stored in web’s backstage database. These data can be retrieved only through HTML form; they are called Hidden web page. In order to help people to obtain the important data in the web database, we have a system which can seach the hidden web pages. In this paper, the architecture is presented, and the key technologies are discussed.
     First, the common search engine's advantages and disadvantages are analysised, and the difference between common search engine and hidden web search engine are compared. The proper strategy which suits to hidden web crawlling by using link classifier and text classifier is given. This can achieve focus crawl. In addition, based on the specific characteristics of forms, the new stopping criteria that is very effective in guiding the crawler to avoid excessive speculative work in a single site is introduced.
     In this paper, the process of user’s accessing hidden web is simulated. First, forms are converted to an understandable form for program. It means modeling to the form. Secondly, the useful forms are extracted by using heuristic rules and form classifier. At last, form label and the context of form are extracted. The results are filled in the forms automatically to find the hidden web page.
     We make the full use of the structure and text information of forms. The classifier includes the cooperating of label classifing and the form appendix context classifing. We use Centroid,KNN and SVM algorithm. The experiments show that SVM algorithm has the best effect.
     Through the experiment we verify the effectiveness of form classifing and form extracting.

引文

1 Sriram Raghavan, Hector Garcia-Molina. Crawling the Hidden Web. In:Proceedings of the Twenty-seventh International Conference on Very LargeDatabases. Roma, 2001
    2 Panagiotis G Ipeirotis, Luis Gravano, Mehran Sahami. Probe, Count,Classify: Categorizing Hidden-Web Databases[C].In:Proc of the SIGMOD Conference, Santa Barbara, California, USA, 2001-05
    3 宋晖张岭,基于标记树对象抽取技术的 Hidden web 获取研究,计算机工程与应用,2002, 38 (23).9-12
    4 郑冬冬,赵朋朋,崔志明.Deep Web 爬虫研究与设计.清华大学学报(自然科学版),2005年第 45 卷第 S1 期
    5 业宁,梁作鹏,董逸生,王厚立. 一种 SVM 非线性回归算法. 计算机工程. 2005,31(20):19～21
    6 Valter Crescenzi and Giansalvatore Mecca. On Automatic Information Extraction from Large Web Sites. Technical Report DIA-76-2003
    7 Sergey Brin and Lawrence Page,The Anatomy of a Large-Scale Hypertextual Web Search Engine
    8 SONGHUI,MA FAN-YUAN,LIU XIAO-QIANG,Ontology-based knowledge Extraction from hidden web, Journal of Donghua University(Eng Ed)Vol.21 NO.5(2004)
    9 Arvind Arasu, Hector Garcia-Molina , Extracting Structured Data from Web Pages, SIGMOD 2003, June 9-12, 2003, San Diego, CA.
    10 Zhen Zhang, Bin He, Kevin ChenChuan Chang, Understanding Web Query Interfaces: BestEffort Parsing with Hidden Syntax¤SIGMOD 2004 June 1318, 2004, Paris, France.
    11 Alberto H. F. Laender Berthier A. Ribeiro-Neto ,Altigran S. da Silva* Juliana S. Teixeira A Brief Survey of Web Data Extraction Tools SIGMOD Record, Vol. 31, No. 2, June 2002
    12 Y.L. Hedley, M. Younas, A. James , M. Sanderson , A Two-Phase Sampling Technique for Information Extraction from Hidden Web Databases , WIDM’04, November 12–13, 2004, Washington, DC, USA.
    13 Augusto de Carvalho Fontes, Fábio Soares Silva SmartCrawl: A New Strategy for the Exploration of the Hidden Web ,WIDM’04, November 12–13, 2004, Washington, DC, USA.
    14 Yasuhiro Yamada, Nick Craswell, Tetsuya Nakatoh, Sachio Hirokawa , Testbed for Information Extraction from Deep Web, WWW2004, May 17–22, 2004, New York, New York, USA. ACM 1581139128/ 04/0005.
    15 Juliano Palmieri Lage, Altigran S. da Silva, Paulo B. Golgher, Alberto H. F. Laender Collecting Hidden Web Pages for Data Extraction , WIDM’02, November 8, 2002, McLean, Virginia, USA.
    16 Alexandros Ntoulas Petros Zerfos Junghoo Cho Downloading Hidden Web Content
    17 Henry Kautz, Bart Selman, and Mehul Shah The Hidden Web 1997, American Association for Artificial Intelligence.
    18 Valerie S. Allen, MSLIS, Abe Lederman, MSCS Searching the Deep Web – Distributed Explorit Directed Query Applications, SIGIR’01, September 9-12, 2001, New Orleans, Louisiana, USA.
    19 Elena Salvador, Andrea Cavallaro, Touradj Ebrahimi. Shadow Identification andClassification Using Invariant Color Models. IEEE. 2001:1546
    20 管业鹏, 顾伟康. 二维场景阴影区域的自动鲁棒分割. 电子学报, 2006:625
    21 Rafael C. Gonzalez, Richard E. Woods. Digital Image Processing. Qiuqi Ruan, Yuzhi Ruan. Second Edition. Publishing House of Electronics industry, 2003.3:475~479
    22 Pekka Kultanen, Lei Xu, Erkki Oja. Randomized Hough Transform(RHT). Pattern Recognition. 1990:631~634
    23 Panagiotis G. Ipeirotis, Luis Gravano, Mehran Sahami Probe, Count, and Classify:Categorizing HiddenWeb Databases ACM SIGMOD 2001 May 2124,Santa Barbara, California, USA
    24 J.Rennie and A.McCallum. Using Reinforcement Learning to Spider the Web Efficiently. In Proc. Of ICML, pages 335-343,1999.
    25 Wang Jian, Wang Xiao-tong, Xu Xiao-gang, Li Bo. Fast Circle Detection Using Randomized Hough Transform Based on Gradient. Application research of computers. 2006:164~165
    26 Soumen Chakrabarti, Martin van den Berg , Byron Dom. Focused crawling: a new approach to topic-specific Web resource discovery. Computer Science and Engineering, Indian Institute of Technology,1999
    27 Yue Jian, Xiang Xue-zhi. An improved algorithm of Hough circle detection. Applied Science and Technology. 2006:74~76
    28 The-Chuan Chen, Kuo-Liang Chung. An Efficient Randomized Algorithm for Detection Circles. Computer Vision and Image Understanding. 2001:172~190
    29 张志刚陈静李晓明一种 HTML 网页净化方法 1995-2005 Tsinghua Tongfang Optical Disc Co., Ltd.
    30 Soumen Chakrabarti, Kunal Punera, Mallela Subramanyam, Accelerated Focused Crawling through Online Relevance Feedback WWW2002, May 7{11, 2002, Honolulu, Hawaii, USA.ACM 1-58113-449-5/02/0005
    31 Sergej Sizov, Stefan Siersdorfer, Martin Theobald, Gerhard Weikum The BINGO! Focused Crawler: From Bookmarks to Archetypes, Proceedings of the 18th International Conference on Data Engineering (ICDE.02)1063-6382/02 $17.00 ? 2002 IEEE
    32 Soumen Chakrabarti, Martin H. van den Berg2, Byron E. Dom Distributed Hypertext Resource Discovery Through Examples, Proceedings of the 25th VLDB Conference, Edinburgh, Scotland, 1999.
    33 Jun Li, Kazutaka Furuse, Kazunori Yamaguchi Focused Crawling by Exploiting Anchor Text Using Decision Tree, WWW 2005, May 10–14, 2005, Chiba, Japan.ACM 1-59593-051-5/05/0005.
    34 Martin Ester, Matthias Gro?, Hans-Peter Kriegel , Focused Web Crawling: A Generic Framework for Specifying the User Interest and for Adaptive Crawling Strategies
    35 Marc Ehrig, Alexander Maedche Ontology-Focused Crawling of Web Documents
    36 陈琼,苏文健 ,基于网页结构树的 Web 信息抽取方法, 1994-2006 China Academic Journal Electronic Publishing House
    37 邓乃扬. 数据挖掘中的新方法——支持向量机. 科学出版社, 2004:224～257
    38 Luciano Barbosa, Juliana Freire. Searching for HiddenWeb Databases. Eighth International Workshop on the Web and Databases (WebDB 2005),June 1617,2005, Baltimore, Maryland.
    39 万小容、火善栋、黄青松,基于主题的 Web 信息采集系统的研究,昆明理工大学学报(理工版),2005.10 增刊
    40 黄晓冬.Invisible web 研究综述.情报科学,2004,22(9):1145-1148
    41 梁焕平.隐蔽网络及其检索策略研究[J].情报科学,2004,86-90

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700