一种基于小枝模式匹配的XML数据查询处理算法

英文题名：A XML Date Query Processing Algorithm Based on Twig Pattern Matching
作者：苗红涛
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：XML数据库 ; 小枝模式匹配 ; 路径表达式 ; 查询处理
英文关键词：XML Data Base ; Twig pattern matching ; Path expression ; Query processing
学位年度：2009
导师：陶世群
学科代码：081203
学位授予单位：山西大学
论文提交日期：2009-06-01

摘要

XML作为Web发展所带来的新技术中的代表,已经成为网上数据表示和交换的标准,并逐渐成为了学术界和工业界所关注的焦点。由于XML数据具有不同于传统数据形式的特点,因此,在各种基于XML的应用中,如何有效地查询XML数据已经成为一个重要的研究课题。
     近年来,小枝模式匹配问题已经被广泛地研究。一个小枝模式查询O在XML文档树T中的匹配,也就是找到这个查询模式Q在T中出现的所有情况。以前的工作比如Holistic Twig和TJFast都是把小枝模式分解成多个从根结点到叶子结点的单个路径,然后把这些路径模式的匹配结果合并起来以得到小枝模式查询Q的匹配结果。它们能够有效地删除不可能的路径模式来避免产生大量的中间结果,但合并这些线性路径会花费很高的计算代价。为了解决这个问题,最近又提出了Twig2Stack和TwigList算法,但它们都需要遍历所有的输入节点,另外,它们也没有考虑有些查询节点是不需要进行处理的。本文对XML路径查询处理中尚存在的难点问题进行了深入的研究,并且在汲取了各种小枝模式匹配算法优点的基础上,提出了一种新的小枝模式匹配算法TwigEN。本文的主要工作如下:
     (1)研究分析了现有的小枝模式匹配算法,并总结其实现方法;
     (2)根据XML文档的结构特点提出了一种新的小枝模式匹配算法TwigEN,该算法只处理有解扩展的元素结点,因此,能够较大限度的跳过输入流中无用的元素结点;
     (3)对算法TwigEN的复杂度和可行性进行了分析,给出了TwigEN算法在最坏情况下的时间和空间复杂度;
     (4)最后构建一个实验系统,实现了本文提出的小枝模式匹配算法,然后通过实验把TwigEN算法同Twigstack算法和TwigList算法进行了比较,并对实验结果进行了分析。实验证明,本文提出的小枝模式匹配算法的效率有了较大的提高。
     本文提出的小枝模式匹配算法能够根据XML文档的结构特点跳过那些在结构连接中无用的元素结点,这样不仅减少了待处理结点的数目,缩短了查询处理时间,而且也节省了内存空间。该算法的时间复杂度和空间复杂度与查询结点的标签流中结点总数呈线性关系。
XML has become the standard of the date representation and exchange on the Internet as the representative of new technologies on the Web, and has become the focus of research and industrial communities. And so, how to effectively query the XML data has become an important research topic in the various application based on XML, owning to owing to the tree-like nature of XML data and new application.
     Twig pattern matching problem has been widely studied in recent years. Give an XML tree T,A twig-pattern matching query,Q, represented as a query tree, is to find all the occurrences of such twig pattern in T. Previous works like Holistic Twig and TJFast decomposed the twig pattern into single paths from root to leaves, and merged all the occurrences of such path-patterns to find the occurrences of the twig-patterns matching query, Q. Their techniques can effectively prune the impossible path-patterns to avoid producing a large amount of intermediate results. But they still need to merge path-patterns which occurs high computational cost. So, Twig2Stack and TwigList are proposed to resolve this situation, but all they need to scan all the nodes, in addition, they don't considered another nodes that basically don't need to be processed. After deeply studying the remaining difficult issues in XML query processing and based on the advantage of the twig pattern before, this article offers a new twig pattern algorithm TwigEN. The main contributions are summarized as follows:
     (1) Studying and analyzing the current twig pattern matching algorithm, and summarize the implementation method.
     (2) A new twig pattern matching algorithm TwigEN is proposed, which is processing the solution-extension nodes only, so, the algorithm could skip nodes efficiently.
     (3) Analyzing the complexity and feasibility of the algorithm, and giving the time and space complexity on the worst situation.
     (4) Finally, constructs an experimental system and implements the proposed twig pattern matching algorithm, then compares the TwigEN with TwigStack and TwigList, and the results show that the efficient of our approach is improved greatly.
     According to XML, the algorithm can skip nodes that are useless in the structural join. This is a good way to save time and memory as well as the amount of nodes. The relationship between the complication of time and space on the approach and sum of nodes is Linear.

引文

[1] extensible Markup Language(XML)1.0. http://www.w3 .org/XML/.Feb,2007[Z].
    [2] Zhang C, Naughton J, DeWitt D, Luo Q, Lohman G. On supporting containment queries in relational database management systems[A].2001,425-436.
    [3] Li QZ, Moon B. Indexing and querying XML data for regular path expressions[C]. In: Apers PMG, Atzeni P, Ceri S, Paraboschi S,Ramamohanarao K, Snodgrass RT, eds. Proc. of the 27th Int'l Conf. on Very Large Data Bases. San Francisco: Morgan Kaufrnann, Publishers, 2001. 361-370.
    [4] AI-Khalifa S, Jagadish H.V, Koudas N, Patel J.M, Srivastava D, Wu Y. Structural joins: A primitive for efficient XML querypattern matching[C]. In: Agrawal R,Dittrich K, Ngu AHH, eds. Proc. of the 18th Int'l Conf. on Data Engineering. Los Alamitos: IEEE Press, 2002.141-152.
    [5] Bruno N, Koudas N, Srivastava D. Holistic Twig Joins: Optimal XML Pattern Matching[J]. In Proceedings of SIGMOD, 2002. 310-321.
    [6] Jiang Haifeng,Wang Wei,Lu Hongjun. Holistic Twig Joins on Indexed XML Documents[J]. IN: VLDB. 2003.273-284.

    [7] Ting Chen, Jiaheng Lu and Tok Wang Ling . On Boosting Holism in XML Twig Pattern Matching[J]. In Proceedings of SIGMOD, 2005. 455-466.
    [8] Lu JH, Chen T and Tok Wang Li.Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach[C].In: CIKM'04, November 8-13,2004, Washington, DC, USA, 533-542.
    [9] Lu J, Ling T.W, Chan C.Y, Chen T. From Region Encoding To Extended Dewey.On Efficient Processing of XML Twig Pattern Matching[J]. In Proceedings of VLDB, 2005. 193-204.
    [10]Aghili S, Li H.G, Agrawal D., Abbadi A.E. Twix: Twig structure and content matching of selective queries using binary labeling[C]. In: INFOSCALE. 2006.

    [11]Chen S., Li H.G., Tatemura J., Hsiung W.P., Agrawal D., Candan K.S.Twig2stack:Bottom-up processing of generalized- tree pattern queries over XML documents[J]. In: VLDB. 2006.283-294.
    [12]Lu Qin, Jeffrey Xu Yu, and Bolin Ding. A Query Language and Algebra for semistructured data based on recursion[J].VLDB Journal 2000, 9(1),76-110.
    [13]Alin Deutsch,Mary Fernandez,Daniela Florescu,Alon levy and Dan Suciu.A Query Language for XM[C]L.Computer Network(Amsterdam,Netherlands) 1999,Volume 31,numberll-16,1155-1169.
    [14]Alin Deutsch, Mary Fernandez, Daniela Florescu, Alon levy and Dan Suciu. XML-QL: A Query Language for XML[Z].W3C Note.http://www.w3 .org/TR/NOTE-xml-ql
    [15]J.McHugh, S. Abiteboul, R. Goldman, D.Quass and J.Widom.Lore: A Database Management System for Semistructured Data[J].SIGMOD Record. 1997,26(3),54-66.
    [16]P.Buneman, M. Fernandez, D. Suciu.UnQL: A Query Language and Algebra for semistructured data based on recursion[J].VLDB Journal 2000, 9(l),76-110.
    [17] Sergey Melnik, Hector Garcia-Molina and Erhard Rahm.Similarity Flooding: A Versatile Graph Matching Algorithm and its Application to Schema Matching[C]. Proceedings of the 18th ICDE Conference.2002,117-128.
    [18]Alin Deutsch, Mary Fernandez and Dan Suciu.Storing Semistructured Data with STORED[J].SIGMOD. 1999,431-442.
    [19]Wirth N. Type Extensions.ACM Transactions on Programming Languages and Systems, 1988,10(2),204-214.
    [20]H. Wang, S. Park, W. Fan, and P. S. Yu. VIST: A dynamic index method for querying XML data by tree structures[J]. In: SIGMOD. 2003,110-121.
    [21] P. Rao and B. Moon. PRDC: Indexing and querying XML using prufer sequences[C]. In: ICDE.2004,288-300.
    [22] Online Computer Library Center.Dewey decimal classification[Z]. Http://www.oclc.org/dewey/.
    [23] Wang W, Jiang HF, Lu HJ, Jeffrey XY. PBiTree coding and efficient processing of containment joins[C]. In: Dayal U, Ramamritham K, Vijayaraman TM, eds.Proc. of the 19th Int'l Conf. on Data Engineering. Los Alamitos: IEEE Press,2003,391-402.
    [24] Zhang C, Naughton J, DeWitt D, Luo Q, Lohman G. On supporting containment queries in relational database management systems[A]. In: Timos S, ed. Proc. of the 2001 ACM SIGMOD Int'l Conf.on Management of Data.New York:ACM Press,2001,425-436.
    [25]XML Path Language(XPath) 2.0.http://www.w3.org/TR/xpath/.Feb,2007[Z].
    [26]万常选.XML数据库技术[A].第1版.北京,清华大学出版社,2005,13-36.
    [27]富丽贞,陶世群.基于小枝模式的XML数据查询处理技术研究[O].山西大学.2008
    [28]傅立功,陶世群.一种基于Dewey编码的XML小枝模式匹配方法[O].山西大学.2008.
    [29]Zhangfeng Bao,Tok Wang Ling,Jiaheng Lu.Effective XML Keyword Search with Relevance Oriented Ranking[C].School of Computing National University of Singapore.2009,781-792.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700