Deep Web动态搜索的研究

英文题名：Research on Deep Web Dynamic Search
副题名：基于图书网站的动态搜索
英文副题名：Based Book Website Dynamic Search
作者：李海滨
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：表单解析 ; 动态填充 ; 结果页面解析 ; 结果项排序
英文关键词：form parser ; fill dynamically form ; parse result page ; result item sort
学位年度：2011
导师：许南山
学科代码：081203
学位授予单位：北京化工大学
论文提交日期：2011-06-03
答辩委员会主席：赵瑞莲

摘要

本文针对图书类网站的特点,根据表单项前的文字信息反映表单项输入信息,设计一种通过解析表单项动态填充表单的方法,利用动态解析表单获得结果页面,对其进行解析并加权排序,最后按照统一的显示格式展现。本文设计实现利用网站自身高级搜索页面对同一类型的多个网站进行检索的系统,为用户同时在多个图书网站搜索图书提供便利快捷的条件。实验结果验证了算法设计的正确性,本课题的主要研究工作包括：
     1、设计一个基于字典匹配的动态表单搜索算法。该算法采用SAX方式解析表单,避免前人采用DOM方式解析产生的大量无用信息；利用多线程方式解析查询接口所在页面提高处理性能；运用字典和表单项关键字进行匹配。服务器端程序通过抓取页面进行语义分析,发现新的图书网站和扩展关键字字典。
     2、在表单动态填充获取的结果基础上,实现了结果页面解析。’通过预先了解并熟悉图书网站的搜索结果的展示页面的HTML标签结构,将这种标签结构进行抽象提取,利用抽取模板进行解析获得图书信息对象的链表,完成结果解析。
     3、查询结果后续处理。对于结果页面解析出的结果项进行排序,主要考虑的因素是该类似图书在不同网站的出现频数和在各个网站的排序顺序。两个因素同等重要,都可以反映出图书受欢迎的程度和销售情况,因此采用等值加权排序法。
     在以上工作的基础上,设计实现了一个基于图书网站高级搜索的动态表单搜索系统。该系统提供一种较为新颖的思路,对于同一类型的网站,通过其高级搜索页面进行精确查询项匹配。
Based on features of book sail websites, the thesis designs he method that form items is parsed and filled dynamically according to the text before the input item reflect the information to be input in the input item, Make use of dynamically form to get result page, parse and sort results page by weight, at last display them according to the uniform display format. The paper designs and implements the system to query the same type of multiple websites based on their advanced search pages, at the same time the system provide convenient and efficient condition to query books on multiple book websites for users. Experimental results demonstrate the correctness of algorithm, the main research topics include:
     1. This paper has design a dynamic form search algorithm based on dictionary matching. The algorithm parses a form with SAX to avoid large quantities of useless information with existing DOM; improve processing performance with multiple threads to parse query interface page; make use of dictionaries to match key words of form items. On server side pages are crawled to make semantic analysis, to find new book sail websites and expand the book keywords dictionary.
     2. Based on the results of dynamic filling the form, the paper realizes the result pages parsing. Through foreseeing the structure of the HTML tags on the search result pages, extract this tag structure with abstract extracted tags to get books information object linked list, and complete results analyses.
     3. The proceeding work of query results. To resolve the results of sorting in the result page, main consideration factors are the frequency the similar books in different websites appear and sorting in every website. Two factors are equally important, both reflect popularity and sales situation of books, so the paper uses equivalent weighted ranking.
     On the basis of above work, designed and implemented a library website search system based on advanced search pages. The system provides a relatively new idea, for the same type of website, precisely inquiry items through its advanced search page.

引文

[1]黄晓冬Invisible Web研究综述[J].情报科学,2004,22(9)：1144-1148.
    [2]刘伟,孟小峰,孟卫一Deep Web数据集成研究综述[J].计算机学报,2007,30(9)：1475-1489.
    [3]Chang K, He B,Li C, Patel M, Zhang Z. Structured database on the Web:Observations and implications[C]//SIGMOD Record,2004,33(3):61-70.
    [4]马军,宋玲,韩晓晖,闫泼.基于网页上下文的Deep web数据库分类[J].软件学报,2008,19(2)：267-274.
    [5]Invisiable.com.http://www.invisiable.com[S].2005(10).
    [6]He B,Chang K C. Statistical schema matching across Web query interfaces[C]//Proceedings of the 22nd ACM SIGMOD International Conference on Management of Data. San Diego,2003:217-228.
    [7]He B,Chang K C,Han J. Discovering complex matchings across Web query interfaces:A correlation mining approach[C]//Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Seattle,2004:148-157.
    [8]李文骏,崔志明.基于搜索引擎的Deep Web数据源发现技术[J].计算机技术与发展,2008,18(8)：58-64.
    [9]印鉴,陈忆群,张钢.搜索引擎技术研究与发展[J].计算机工程,2005,31(47)：54-56.
    [10]DeepWeb Technology. http://www.deepwebtech.com/[S].2005(10).
    [11]MetaQuerier Research Group. Accessible at http://metaquerier.cs.uiuc.edu/[S],2006(7).
    [12]Bin He,Zhen Zhang,Kevin Chen-Chuan Chang. MetaQuerier:querying structured web sources on-the-fly[R]. In SIGMOD Conference,2005.
    [13]方巍,黄黎,崔志明.基于最大熵分类器的Deep Web查询接口自动判定[J].计算机工程与应用,2008,44(21)：133-137.
    [14]Robert B. Doorenbos, Oren Etzioni, Daniels Weld. Ascalable comparison shopping agent for the World-Wide Web[C]//In:proceedings of the First International Confence on Autonomous Agents.Marina del Rey, CA,February 1997,39-48.
    [15]Bin He,Zhen Zhang, Kevin Chen-Chuan Chang. Towards Building a MetaQuerier: Extracting and Matching Web Query Interfaces [R].Proceedings of the 21st International Conference on Data Engineering,2005.
    [16]Hasan Davulcu,Juliana Freire,Michael Kifer, I.V. Ramakrishnam. A alyered architecture for querying dynamic Web content[C]//In SIGMOD'99 Proceedings. Philadelphia,PA,May 1999,191-502.
    [17]Jared Cope,Nick Craswell,David Hawking. Automatic discovery of search interfaces on the Web[C]//Proceeding of the 14th Australian Database Conference. Adelaide,2003: 181-189.
    [18]Kelvin C,Chang B,He C et al. Structured Databases on the Web:Observations and Implications[C]//Special Interest Group on Management Of Data Record,2004:61-70.
    [19]He B, Tao T, Chang K C. Clustering structured Web sources:A schema-based model-differentiation approach[C]//Proceedings of the 9th International Conference on Extending Database Technology. Heraklion,Crete,2004:536-546.
    [20]Ipeirotis P G, Gravano L, Sahami M. Probe, count, and classify:Categorizing hidden Web databases [C]//Proceedings of the 19t h ACM SIGMOD International Conference on Management of Data. Santa, Barbara,2001:67-78.
    [21]Meng W, Wang W, Sun H, Yu C. Concept hierarchy based text database categorization[J]. Knowledge and Information Systems Journal.2002,4 (2):132-150.
    [22]寇月,申德荣,李冬,聂铁铮.一种基于语义及统计分析的Deep Web实体识别机制[J].软件学报,2008,19(2)：194-208.
    [23]袁柳,李战怀,陈世亮.基于本体的Deep Web数据标注[J].软件学报,2008,19(2)：237-245.
    [24]郑冬冬,崔志明Deep Web爬虫爬行策略研究[J].计算机工程与设计,2006,27(17)：3154-3158.
    [25]WordNet. A lexial database for the English language.http://wordnet.princeton.edu/[S], 2007(10).
    [26]张丽坤,蒋波.基于本体的语义web研究[J]计算机技术与发展.2007,17(6)：116-119.
    [27]J. P. Lage, A. S. D. Silva, P. B. Golgher, et al. Automatic generation of agents for collecting hidden web pages for data extraction[J]. Data & Knowledge Engineering.2004, 49(2):177-196.
    [28]高岭,赵明明,崔志明Deep Web查询接口的自动判定.计算机技术与发展,2007,17(5)：148-151.
    [29]Berners—Lee T, Hendler J, Lassila O. The Semantic Web[J].Scientific American.2001.284(5):34-43.
    [30]Goldstone R L, Son J Y. Similarity [J].Psychological Review.2004,100:254-278.
    [31]李瑞轩,赵战西,文坤梅等.基于本体的多域访问控制策略集成研究[J].小型微型计算机系统,2007,28(9)：1710-1714.
    [32]Deep Web Technology.http://www.deepwebtech.com[S],2005(10).
    [33]MetaQuerier Research Group. http://metaquerier.cs.uiuc.edu[S],2006(6).
    [34]曹庆皇Deep Web查询接口匹配技术研究[D].无锡：江苏大学,2009(12).
    [35]宋玲.语义相似度计算及其应用研究[D].济南：山东大学,2009(10).
    [36]洪静,林家骏.信息融合系统评价中指标排序权值的求取[J].科学技术与工程,2006,6(4)：412-416.
    [37]吕蓬,史丽超.层次分析法中排序权值计算的目标规划模型[J].科技信息,2007(22)：84-85.
    [38]金菊良,魏一鸣,付强,丁晶.计算层次分析法中排序权值的加速遗传算法[J].系统工程理论与实践,2002(11)：39-43.