基于结构连接的XML查询处理与研究

英文题名：Query Processing and Research for Structural Join Against XML Data
作者：贾蓓
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：XML ; XPath ; 编码方法 ; 过滤 ; 结构连接
英文关键词：XML ; XPath ; Numbering Scheme ; Filter ; Structural Join
学位年度：2008
导师：包小源
学科代码：081203
学位授予单位：天津师范大学
论文提交日期：2008-03-01

摘要

XML自从1998年由W3C提出以来,就迅速的成为Internet上用于数据表示和数据交换的标准。XML文档大量涌现,XML的有效管理受到广泛关注。由于XML数据具有不同于传统数据形式的树状结构,使得传统的数据库技术不能有效地发挥作用,因此需要针对其特点研究新的处理方法。为了解决XML路径查询处理中的关键技术问题,为较大规模的XML查询应用提出切实可行的解决方案,本文给出了XPath查询的系统框架,定义了系统可以处理的XPath的语法,实现了一个XML文档的查询处理系统。
     作为XML查询处理的核心操作,结构连接操作的高效实现是提高查询处理性能的关键所在。本文针对结构连接操作的高效问题,在XML数据区间编码的基础上,把基于过滤的小枝结构连接技术应用到查询系统中。把源路径以及路径包含的概念引入过滤算法,减少了PSet集合中的路径数目。对使用过滤算法与不使用过滤算法的整体小枝连接技术进行了实验比对,试验结果显示使用过滤算法的整体小枝连接具有更好的性能。
     现有的XML结构连接算法都是在节点编码的基础上提出的。目前,各种节点编码方式及其对应的结构连接算法很多。本文针对多种结构连接算法进行了系统的总结和比较,并分析了各种算法的不同性能。
XML has become new criteria of data represention and exchange in Internet and it has been accepted in many fields since it was put forward by W3C in 1998. This is creating a new set of data management requirements involving XML. Traditional database technologies can't work efficiently owing to the tree-like nature of XML data and new application environment .New technologies specially designed for XML data are needed to process XML data efficiently. In this paper, we focus on the path expression processing such that the key issues in the large-scale XML query application can be settled by feasible approaches. We propose a system framework of XPath query, defining the XPath grammar that the system can deal with, giving the query processing system.
     As the core operation of XML query processing, the efficient implimenation of structural join is the key to improve XML query processing. Based on the region numbering scheme of XML data, we led into filter-based twig structural join technology. Different form previous algorithms, filteration algorithm filters the query pattern and the data set with the path encoded information, leaving the elements to join the structural join. Then we use twig join algorithm for these elements. We introduce the concept of source path and path containment, decreasing the amount of PSet. We hava carried out an experiment to compare the technologys about whether using filtering algorithm or not. The results of our comprehensive experiment show that the twig join algorithm with filtering process performs well both synthetic and real-word datasets,and has good scalability.
     The XML containment join algorithm is proposed based on XML encoding. Many researchers hava proposed all kinds of encoding and relevant containment join algorithms. We sum various structural join algorithms up. At last, we analyse the performance of these algorithms.

引文

[1] Junghoo Cho, Sridhar Rajagopalan, "A fast regular expression indexing engine", In Proceedings of the 18th International Conference on Data Engineering (ICDE.02 )

    [2] Chee-Yong Chan, Pascal Felber, Minos Garofalakis, Rajeev Rastogi, "Efficient Filtering of XML Documents with XPath Expressions", In Proceedings of the 18th International Conference on Data Engineering (ICDE.02 ), 2002.

    [3] Quanzhong Li, Bongki Moon, "Indexing and Querying XML Data for Regular Path Expression", In Proceedings of the 27th VLDB Conference, Roma, Italy, 2001

    [4] Shu-Yao Chien, Zografoula Vagena, Donghui Zhang, Vassilis J. Tsotras, Carlo Zaniolo, "Efficient Structural Joins on Indexed XML Documents", In Proceedings of the 28th VLDB Conference, Hong Kong, China, 2002

    [5] Dao Dinh Kha, Masatoshi Yoshikawa, Shunsuke Uemura, "An XML Indexing Structure with Relative Region Coordinate", In Proceedings of the 17th International Conference on Data Engineering (ICDE.01 ), Heidelberg, Germany, April 02-06,2001.

    [6] Yan Chen, Sanjay Madria, Kalpdrum Passi, and Sourav Bhowmick, "Efficient Processing of XPath Queries Using Indexes", In Proceedings of DEXA 2002, LNCS 2453, pp. 721-730,2002.

    [7] World Wide Web Consortium. Extensible Makeup Language (XML) 1.0 (Third Edition). W3C Recommendation [S]. 4 February 2004. http://www.w3c.org/TR/REC-XML/

    [8] Bosak J, Bray T, Connolly, et al. W3C XML Specification ("XML-spec") DTD version 2.1 [S]. 15 February 2000. http://www.w3c.org/XML/1998/06/XMLspec-report-v21.htm

    [9] World Wide Web Consortium. XML Path Language (XPath) 2.0 [S]. W3C Recommendation 23 January 2007 http://www.w3.org/TR/xpath20/

    [10] S. Al-Khalifa, H. V. Jagadish, N. Koudas, J. M. Patel, D. Srivastava, and Y. Wu. Structural joins: A primitive for efficient XML query pattern matching. In Proceedings of the IEEE International Conference on Data Engineering, 2002.

    [11] Jens E. Wolff, Holger Florke, and Armin B. Cremers. Searching and browsing collections of structural information. In IEEE Advances in Digital Libraries (ADL'2000),pages 141-150, Bethesda, MD, May 1997.

    [12]R. Goldman, J. McHugh, and J. Widom. From Semisturctured Data to XML: Migrating the Lore Model and Query Language. WebDB 1999, pages25-30

    [13] S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. L.Wiener. The Lorel query language for semistructured data. International Journal on Digital Libraries, 1(1):68—88, April 1997.

    [14] Alin Deutsch, Mary Fernandez, Daniela Florescu, Alon Levy, and Dan Suciu. A query language for XML. In Proceedings of the 8th International World Wide Web Conference, pages 77-91, Toronto, Canada, May 1999.

    [15] Stefano Ceri, Sara Comai, Ernesto Damiani, Piero Fraternali, Stefano Paraboschi, and Letizia Tanca. XML-GL: A graphical language for querying and restructuring XML documents. In Proceedings of the 8th International World Wide Web Conference, pages 93-109, Toronto, Canada, May 1999.

    [16] Don Chamberlin, Jonathan Robie, and Daniela Florescu. Quilt: An XML query language for heterogeneous data sources. In International Workshop on the Web and Databases (WebDB'2000), Dallas, TX, May 2000.

    [17] Don Chamberlin, Daniela Florescu, Jonathan Robie, Jrme Simon, and Mugur Stefanescu. XQuery:A Query Language for XML W3C working draft.Technical Repor WD-xquery-20010215,World Wide Web Consortium,February 2001.
    [18]Chun Zhang,Jeffrey Naughton,David DeWitt,Qiong Luo,Guy Lohman.On Supporting Containment Queries in Relational Database Management Systems.In Proc.of ACM SIGMOD 2001,Santa Barbara,Califoruia,USA,May21-24,2001.
    [19]Paul F.Dietz.Maintaining order in a linked list.In Proceedings of the Fourteenth Annual ACM Symposium on Theory of Computing,pages 122-127,San Francisco,California,May 1982.
    [20]Q.Li and B.Moon.Indexing and querying XML data for regular path expressions,In Proc.of VLDB,Rome,Italy,361-370,2001.
    [21]Zhang C,Nanghton J,DeWitt D,et al.On Supporting Contianment Queries in Relational Database Management System.In:Mehrotra S et al Eds.Proceddings of the 20th ACM SIGMOD International Conference on Management of Data,2001
    [22]G.Salton and M.J.McGill.Introduction to modern information retrieval.McGraw-Hill,New York,1983
    [23]Online Computer Library Center.Dewey decimal classification.Http://www.oclc.org/dewey/
    [24]Wirth.N,Type Extensions.ACM Transactions on Programming Languages and Systems,1988
    [25]J.Kim,S.H.Lee,and H.-J.Kim,Efficient Structural Joins with Clustered Extents,Information Processing Letters,Volume 91,Number 2,2004,pages 69-75.
    [26]Y.Wu,J.M.Patel,H.V.Jagadish,Structural join order selection for XML query optimization,in:IEEE International Conference on Data Engineering,2003.
    [27]N.Bruno,N.Koudas,D.Srivastava,Holistic twig joins:Optimal XML pattern matching,in:Proceedings of the ACM SIGMOD International Conference on the Management of Data,2002.
    [28]Jiang Haifeng,Lu Hongjun,Wangwei,et al.XR-Tree:Indexing XML Data for Efficient Structural Joins.Proceddings of the 19th IEEE ICDE International Conference on Data Engineering,2003
    [29]Kyung-Sub Min,Hyoung-Joo Kim,A path-based node filtering method for efficient structural joins,Information Processing Letters 95,2005,pages 480-486.
    [30]万常选,XML数据库技术,清华大学出版社,2005
    [31](美)哈罗德(Harold.E.R.)著;刘文红等译Java语言与XML处理教程:SAX,DOM,JDOM与TrAX指南[M]北京:电子工业出版社,2003.11