Deep Web查询结果抽取及注释

英文题名：Deep Web Query Results Extraction and Annotation
作者：谢莹
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：Deep ; Web ; 抽取 ; 注释 ; 标签树 ; 本体 ; 启发式规则
英文关键词：Deep Web ; extraction ; annotation ; tag tree ; Ontology ; heuristic rules
学位年度：2010
导师：左万利
学科代码：081202
学位授予单位：吉林大学
论文提交日期：2010-04-01

摘要

本文对Deep Web数据集成系统进行了学习和研究,重点研究了系统中查询结果抽取和查询结果注释两个单元,提出了自己的实现方法。
     查询结果抽取是指从查询结果返回页面中自动抽取出数据记录;查询结果注释是指为抽取出的数据记录中的各个数据项添加语义标注。
     在查询结果抽取单元,本文采用基于HTML标签树的方法,通过递归过程在标签树中自顶向下地挖掘数据记录。对数据记录的识别,是通过计算标签树之间的相似度来完成的,标签树之间的相似度是基于编辑距离计算的。本文提出了不同于传统方法所提出的数据记录的定义,基于该定义的抽取过程较传统方法简单,不需要事先挖掘数据区域,而是直接抽取数据记录。
     在查询结果注释单元,本文采用基于本体与启发式规则相结合的方法为待标注数据项添加语义标注,本体可以保证注释的一致性,启发式规则可以提高注释的完整性。该单元分为本体管理模块和语义标注模块,在本体管理模块构建图书领域本体库并用子概念表和候选概念表来维护本体;在语义标注模块制定了启发式规则,并指出了对一个数据项进行注释的过程。
     本文采用多个中文图书领域Deep Web站点的查询结果返回页面进行实验测试,测试结果表明本文提出的方法准确、有效。
carry this information, Web databases appeared. The information is loaded in the Web databases, when users want to find this information, they just need to fill the entry forms of Web databases, which are called query interfaces forms also. Web sites which contain Web databases are called Deep Web. Deep Web is rich in information, so it gets more and more researches. Now, the researches on Deep Web mainly contain Query Interfaces Integration, Query Processing and Query Results Processing three parts, of which, Query Interfaces Integration part includes Web Databases Discovery, Query Interfaces Schema Extraction, Web Databases Classification and Query Interfaces Integration four units; Query Processing part includes Web Databases Selection and Query Transformation two units; Query Results Processing contain Query Results Extraction, Query Results Annotation and Query Results Combination three units. Putting the three parts of Query Interfaces Integration, Query Processing and Query Results Processing together forms the Deep Web Data Integration System.
     This paper focuses on the units of Query Results Extraction and Query Results Annotation. Query Results Extraction means to mine and extract data records from the returning query results page. Query Results Annotation means to add semantic label to each data item of data records.
     In the unit of Query Results Extraction, this paper uses the method based on HTML tag tree. As the data records in the same returning result page have high similarity in structure, which is in fact manifested on the tag tree that forms them, so, by turning the pages to tag trees, can identify data records base on the similarity of tag trees. In this paper, after a large number of observations over returning result pages of Deep Web sites and their source code, we summed up the characteristics of data records in the structure of tag trees, and put forward the definition of data record which is different from before. This method is divided into two steps: (1) Building tag trees of web pages; (2) Mining data records. In step (1), this paper use HtmlParser to parse the page, the result is saved in a parsing tree. The type of parsing tree node is Node interface which is defined by HtmlParser. Node has three types of implementation class, which are TagNode, TextNode and RemarkNode, in which TagNode represents the tag nodes of Html code. Traversal the parsing tree top-down to find out TagNode which are used as tag tree node to construct a tag tree, delete some useless node that express style; In step (2), the process of mining data records is a recursive process, starting from the root of tag tree, set the root node as real parameter of process, check whether the node is a data record. If so, then find the recursive exports, extract the content of the node and the process is over; if not, then find all the child nodes of the node, use them as real parameters to call the process recursively in turn. In the process of checking whether a node is a data record, the most important link is calculating the similarity of sub-trees of a tag tree, which is completed by applying edit distance algorithm. Traverse the two sub-trees of the tag tree to turn them to two tag node sequences first, and then use them as real parameters to call the edit distance algorithm. As can be seen from the above, different from many methods based on tag trees, the biggest feature of our method is that it is not necessary to mine data region, but to mine data record directly.
     In the unit of Query Results Annotation, this paper adds semantic label to each data item by using Ontology combined with heuristic rules. Query Results Annotation unit contains Ontology Management Module and Semantic Annotation Module. In Ontology Management Module, extract concepts which contain main-concepts and sub-concepts according to characteristics of many Deep Web query interfaces schema, establish Ontology library, and maintain the sub-concept table candidate-concept table in Ontology Manager in order to modify Ontology automatically. Sub-concept table stores all the sub-concepts of each main-concept in Ontology library, it ensure the consistency of Ontology concepts; candidate-concept table stores concepts which can not be matched by system. These concepts need confirm by domain experts to be sure whether they are domain relevant or domain irrelevant. For domain relevant concepts, put them into Ontology library as main-concepts, and updates the sub-concept table; for domain irrelevant concepts, do nothing to them and just give them up. By enriching the sub-concept table and candidate-concept table, Ontology’s ability on distinguishing domain semantics is also enhanced. In Semantic Annotation Module, need to pre-process the data records extracted in Query Records Extraction unit first, standardized each data item of data records, keep the text content and remove the image content. Then, for a data item to be labeled, determine whether the data item belonging to semantic-based data item or content-type data item first.
     For semantic-based data item, use main-concepts of Ontology in Ontology library or their sub-concepts to match with its described text. If they matched successfully, use the main-concept of the matching concept instead of the described text to label the data item; if they match unsuccessfully, put the described text to under-judge-concept table, and view the text after described text as content-type data item. For content-type data item, use the instances of Ontology in Ontology library to match with it. If they matched successfully, use the main-concept of the matching instance as described text to label the data item; if they matched unsuccessfully, then use heuristic rules to it. Query Results Extraction experiment uses many book domain Deep Web sites’query results returning pages to do test experiment for setting thresholds. After test experiment, knows that when thresholds H, L, S are set to be 2, 10, 0.9, F index reached the maximum 96.8%; then, under this thresholds set, do experiment by using some Chinese book domain Deep Web sites’query results returning pages, precision and recall are 100% and 98.2%, the effect is perfect. The data records this experiment extracted are used to the Query Results Annotation experiment. By experiment, The precision and recall of Query Results Annotation are 98.1% and 90.4%, F index is 94.1%, the effect reached the requirement of real application, but need improvement.

引文

[1]刘伟,孟小峰,孟卫一. Deep Web数据集成研究综述[J].计算机学报, 2007, 30(9): 1475-1489.
    [2] Bergman M K. The Deep Web: Surfacing Hidden Value[R]. BrightPlanet.com LLC, 2000.
    [3] He B, Patel M, Zhang Z, et al. Accessing the Deep Web: A Survey[J]. Communications of the ACM, 2007, 50(5): 94-101.
    [4] Liu B, Grossman R, Zhai Y H. Mining Data Records in Web Pages[C]. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York: ACM, 2003, 601-606.
    [5] Arasu A, Garcia-Molina H. Extracting Structured Data from Web Pages[C]. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data. New York: ACM, 2003, 337-348.
    [6] Kim Y J, Park J Y, Kim T W, et al. Web Information Extraction by HTML Tree Edit Distance Matching[C]. In Proceedings of the 2007 International Conference on Convergence Information Technology. Washington, DC: IEEE Computer Society, 2007, 2455-2460.
    [7] Miao G X, Tatemura J C, Hsiung W P, et al. Extracting Data Records from the Web Using Tag Path Clustering[C]. In Proceedings of the 18th International Conference on World Wide Web, New York: ACM, 2009, 981-990.
    [8]陶磊,莫倩.基于CSS选择器的深网结果页抽取方法[J].北京工商大学学报(自然科学版), 2009, 27(2): 40-45.
    [9]郑皎凌,唐常杰,姜玥,等.基于伪属性语义匹配的Deep Web信息抽取[J].四川大学学报(工程科学版), 2009, 41(2): 173-178.
    [10]朱明,李香郑.基于多学习策略的网页信息抽取方法[J].计算机应用与软件, 2008, 25(12): 68-69, 115.
    [11]马安香,张斌,高克宁,等.基于结果模式的Deep Web数据抽取[J].计算机研究与发展, 2009, 46(2): 280-288.
    [12]魏勇刚,张国春,常勇,袁方.基于词性分析和领域知识的Deep Web语义标注[J].郑州大学学报(理学版), 2009, 41(1):52-55.
    [13]袁柳,李战怀,陈世亮.基于本体的Deep Web数据标注[J].软件学报, 2008, 19(2): 237-245.
    [14]马安香,高克宁,张晓红,等.基于CPN网络的Deep Web数据语义标注[J].东北大学学报(自然科学版), 2009, 30(6): 794-797.
    [15]崔晓军,彭智勇,曾承.基于多标注源的Deep Web查询结果自动标注[J].计算机应用, 2009, 29(1): 196-200.
    [16] Chang K C, He B, Li C K, et al. Structured Databases on the Web: Observations and Implications[J]. ACM SIGMOD Record, 2004, 33(3): 61-70.
    [17] Cope J, Craswell N, Hawking D. Automated Discovery of Search Interfaces on the Web[C]. In Proceedings of the 14th Australasian Database Conference. Darlinghurst: Australian Computer Society, Inc., 2003, 181-189.
    [18] Lage J P, da Silva A S, Golgher P B, et al. Automatic Generation of Agents for Collecting Hidden Web Pages for Data Extraction[J]. Data & Knowledge Engineering, 2004, 49(2): 177-196.
    [19]高岭,赵朋朋,崔志明. Deep Web查询接口的自动判定[J].计算机技术与发展, 2007, 17(5): 148-151.
    [20]方巍,黄黎,崔志明.基于最大熵分类器的Deep Web查询接口自动判定[J].计算机工程与应用, 2008, 44(21): 133-137.
    [21]林培光,吕超.领域Web数据库查询接口的自动发现[J].江西师范大学学报(自然科学版), 2008, 32(2): 197-200.
    [22] Raghavan S, Garcia-Molina H. Crawling the Hidden Web[C]. In Proceedings of the 27th International Conference on Very Large Data Bases. Stanford, 2001, 129-138.
    [23] Zhang Z, He B, Chang K C. Understanding Web Query Interfaces: Best Effort Parsing with Hidden Syntax[C]. In Proceedings of the 23rd ACM SIGMOD International Conference on Management of Data. New York: ACM, 2004, 107-118.
    [24] He H, Meng W Y, Yu C T, et al. WISE-Integrator: An Automatic Integrator of Web Search Interfaces for E-Commerce[C]. In Proceedings of the 29th International Conference on Very Large Data Bases. Secaucus: Springer-Verlag New York, Inc., 2003, 357-368.
    [25]张亮,陆余良,刘金红. Deep Web入口探测与分类方法研究[J].计算机应用研究, 2009, 26(12): 4697-4700, 4703.
    [26]徐和祥,王述云,胡运发.基于本体的Deep Web查询接口分类[J].小型微型计算机系统, 2008, 29(10): 1889-1892.
    [27] Gong Z G, Zhang J B, Liu Q. Automatic Hidden Web Database Classification[J]. Lecture Notes in Computer Science, 2007, 4702: 454-461.
    [28]马军,宋玲,韩晓晖,等.基于网页上下文的Deep Web数据库分类[J].软件学报, 2008, 19(2): 267-274.
    [29]张云冬.特定领域的Deep Web查询集成及结果抽取[D].计算机与信息技术系,复旦大学, 2008.
    [30] Zhang Z, He B, Chang K C. Light-weight Domain-based Form Assistant: Querying Web Databases on the Fly[C]. In proceedings of the 31st International Conference on Very Large Data Bases. VLDB Endowment, 2005, 97-108.
    [31]洪辉,李石君,余伟,等.基于语义的中文Deep Web查询接口集成[J],计算机科学, 2008, 35(3): 61-64.
    [32] Yan Z M, Li Q Z, Cao L H, et al. Ontology-based Schema Matching Method in Web Query Interface Integration[J]. Journal of Southeast University ( English Edition ) , 2008, 24(3): 385-388.
    [33] Yu C T, Philip G, Meng W Y. Distributed Top-N Query Processing with Possibly Uncooperative Local Systems[C]. In Proceedings of the 29th International Conference on Very Large Data Bases. VLDB Endowment, 2003, 117-128.
    [34]雷雪,卢涛.分布式检索中查询结果合并策略研究[J].情报理论与实践, 2007, 30(4): 558-561.
    [35]凌波,周水庚,周傲英. P2P信息检索系统的查询结果排序与合并策略[J].计算机学报, 2007, 30(3): 405-414.
    [36]郭岩,王宇,曹冬林,等.网络信息抽取技术[J].研究信息技术快报, 2008, 6(6): 15-23.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700