TJDirect:一种基于IFED编码的XML文档小枝模式匹配算法
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
XML已成为Web上数据表示、集成和交换的标准,它的格式简单、自我描述能力强,实现了内容、结构和表现三者的分离,更适合于数据表示和交换。近年来,XML在各个领域得到了广泛的使用,Web上已经涌现了大量的XML数据。如何对XML文档进行有效地查询成为一个重要的研究课题。
     XML文档的结构查询通常被转化为两个节点列表之间的结构连接操作。结构连接主要有两种思想:一种是路径分解法,将查询的路径表达式分解成多个简单路径表达式,进行匹配,然后将结果连接起来。这种方法不可避免的要产生无用的中间结果。另一种是小枝模式匹配方法,将查询表达式用树的形式建模,称为“小枝模式”(twig pattern),再将小枝模式到XML文档树中进行匹配,这种方法使得算法的I/O代价、CPU时间复杂度和空间复杂度降低,提高了查询的效率。最近提出的基于ExtendedDewey编码的小枝模式匹配算法——TJfast能较有效地处理XML文档查询,但是ExtendedDewey编码不支持动态插入,并且TJfast算法的执行效率有待进一步提高。
     本文针对以上两个不足进行改进,主要工作如下:
     (1)提出了IFED(Insert-Friendly ExtendedDewey)编码方法,它从ExtendedDewey编码的基础上改进而来,既支持有效的查询,又支持动态插入。
     (2)TJfast算法的I/O代价和CPU时间复杂度与n个输入列表的大小和最后匹配结果大小的总和线性相关,但是,输入列表中有些不参与连接的节点是可以被跳过的。本文提出了TJDirect算法,它采用一种更为灵活的小枝模式匹配方法,跳过了一些不参与连接的节点,提高了算法的执行效率,并且算法中不再用集合保存匹配的节点,使得空间复杂度降低。
     (3)通过实验比较了基于IFED编码的TJDirect算法和基于IFED编码的TJfast算法的执行时间,结果显示TJDirect算法的执行效率较高。
     (4)为XML文档编码,考察IFED编码的存储空间超出ExtendedDewey编码的比值,实验证明IFED编码的存储空间不是很大。
XML has already become the standards of the expression, integration and exchange of the data on web. It has simple form and strong self-describing ability. Beside, it realizes the separation of the content, structure and expression, so that it is more adapt to data expression and exchange. During recent years, XML is widely used in various fields, and its data has abundantly appeared on web. How to query XML document become an important research topic.
     Querying XML document is usually transformed to join two nodes' list. There are two ideas about joining: one is to decompose it to several simple paths, match them and then join the results. The method inevitably produces useless result. The other is to match holistic twig, that is, to convert query path to a tree structure named twig pattern and then find its match in XML document tree. The method reduces algorithms' I/O and CPU complexity and space complexity. It makes query more efficiently. TJfast: a holistic twig matching algorithm proposed recently based on Extended Dewey encoding can efficiently deal with XML document query, but ExtendedDewey encoding doesn't support dynamic insertion and the performance of TJfast need improve.
     In this paper, we make improvement aiming at the two shortcomings mentioned above. The main contributions are summarized as follows:
     (1)We propose a new encoding scheme named IFED(Insert-Friendly ExtendedDewey),which improved from ExtendDewey. It supports both efficient query and dynamic insertion.
     (2)Algorithm TJfast has I/O and CPU time complexities linear in the sum of sizes of the n input lists and the output list. But some nodes in input lists which do not contribute to final results may be skipped. In this paper, we propose a more flexible twig pattern matching algorithm named TJDirect, which skip some nodes which won't be joined. Algorithms' performance is improved. Furthermore, the algorithm doesn't use sets associated with matching nodes, so the space complexity is reduced.
     (3)We compare the execution time of TJfast based on IFED and TJDirect based on IFED through experiments and conclude that TJDirect is more efficient than TJfast.
     (4)We encode for XML documents to make a survey of the space of IFED encoding and ExrendedDewey encoding and make a conclusion that the space of IFED is not very large.
引文
[1] World Wide Web Consortium. Extensible Markup Language (XML) 1.0 (Fourth Edition) [EB/OL]. 2006.http://www.w3.org/TR/REC-xml.
    [2] 万常选.XML数据库技术[M].北京:清华大学出版社.2005
    [3] C. Zhang, J. F. Naughton, D. J. DeWitt, Q. Luo, and G. Lohman. On supporting containment queries in relational database management systems[C]. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data.Santa Barbara, California, USA. June 2001.125-436
    [4] Wang Wei, Jiang Haifeng, Lu Hongjun, et al. PBiTree Coding and Efficient Processing of containment joins[C]. In: Casati F et al Eds. Proceedings of the 19th IEEE ICDE International Conference on Data Engineering. Bangalore, India. March 5-8, 2003. Los Alamitos: IEEE Computer Society. 2003.443-454
    [5] Grust T. Accelerating XPath Location Steps[C]. In: Franklin M J et al Eds. Proceedings of the 21th ACM SIGMOD International Conference on Management of Data. Madison, USA. June 3-6,2002.ACM Press. 2002. 109-120
    [6] Grust T, Keulen M V, Teubner J. Staircase Join: Teach A Relational DBMS to Watch its(Axis) steps[C]. In: Heuer A et al Eds. Proceedings of the 29th VLDB International Conference on Very Large Database. Berlin, Germany. September 9-12,2003.San Francisco: Morgan Kaufmann Publishers. 2003. 524-535
    [7] 王静,孟小峰,王珊.基于区域划分的XML结构连接[J].软件学报,2004.15(5):720-729.
    [8] Shu-Yao Chien, Zografoula Yagena, Donghui Zhang. Efficient Structural Joins on Indexed XML Documents[C]. Proceedings of the 28th VLDB Conference, Hong Kong, China, 2002.263-274.
    [9] 万常选,刘云生,徐升华,刘喜平,林大海.基于区间编码的XML索引结构的有效结构连接[J].计算机学报,2005.28(1):113-127
    [10] 刘云生,万常选,徐升华.基于关系数据库有效地实现RPE查询[J].小型微型计算机系统,2003.24(10):1764-1771
    [11] Jiang Haifeng, Lu Hongjun, Wang Wei, et al. XR tree: Indexing XML Data for Efficient Structural Joins[C]. In: Casati F et al Eds. Proceedings of the 19th IEEE ICDE Intemational Conference on Data Engineering. Bangalore, India. March 5-8,2003.Los Alamitos: IEEE Computer Society. 2003.253-264
    [12] Lam F, Shui W M, Fisher D K, et al. Skipping Strategies for Efficient Structural Joins[C]. In: Proceedings of the 9th DASFAA International Conference on Database Systems for Advanced Applications.2004,3.196-207
    [13] World Wide Web Consortium. XML Path Language (XPath) 2.0[EB/OL]. January 2007.http://www.w3.org/TR/xpath20/
    [14] 吕建华,王国仁,于戈.XML数据的路径表达式查询优化技术[J].软件学报,2003.14(9):1615-1620.
    [15] H. Wang, S. Park, W. Fan, and P. S. Yu. ViST: A dynamic index method for querying XML data by tree structures[C]. In SIGMOD. 2003.110-121.
    [16] P. Rao and B. Moon. PRIX: Indexing and querying XML using prufer sequences[C]. In ICDE.2004. 288-300.
    [17] N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: Optimal XML pattern matching[C]. In SIGMOD. 2002.310-321.
    [18] H. Jiang, W. Wang, and H. Lu. Holistic twig joins on indexed XML documents[C]. In Proc. of VLDB. 2003. 273—284.
    [19] J. Lu, T. Chen and T. Ling. Efficient Processing of XML Twig Patterns with Parent Child Edges: A Lookahead Approach[C]. In CIKM. 2004.533—542,.
    [20] Y. Chen, S. B. Davidson, and Y. Zheng. BLAS: An efficient XPath processing system[C]. In Proc. of SIGMOD, 2004.47—58.
    [21] T. Chen, J. Lu, and T. Ling. On boosting holism in XML twig pattern matching using structural indexing techniques[C]. In SIGMOD To appear.2005.
    [22] Al-Khalifa S, Jagadish HV, Koudas N, Patel JM, Srivastava D, Wu Y. Structural joins: A primitive for efficient XML query pattern matching[C]. In: Proc. of the 2002 IEEE ICDE Conf. San Jose: IEEE Computer Society, 2002. 141-154.
    [23] Q.Z. Li, B. Moon. Indexing and querying XML data for regular path expressions[C]. In Proceedings of the 27th International Conference on VLDB. Rome, Italy, September 2001.361-370
    [24] 罗道锋,孟小峰,蒋瑜.XML数据扩展前序编码的更新方法.软件学报[J].2005.16(5):810-818.
    [25] B. Yang, M. Fontoura, E. J. Shekita, S. Rajagopalan, and K. S. Beyer. Virtual cursors for XML joins[C]. In CIKM, 2004. 523-532,
    
    [26] H. Kaplan, T. Milo,R. Shabo, A Comparison of Labeling Schemes for Ancestor Queries[C], In: Proceedings ACM-SIAM Symposium on Discrete Algorithms, January 2002.
    [27] Tatarinov, S. Viglas, K. Beyer, J. Shanmugasun -daram, E. Shekita, C. Zhang. Storing and Querying Ordered XML Using a Relational Database System[C]. SIGMOD 2002.
    [28] E. Cohen, H. Kaplan and T. Milo, Labeling Dynamic XML Tree[C], In Proceedings of the 21st ACM Symposium on Principles of Database Systems, Madison, Wisconsin, USA, 2002. 271-281.
    [29] J. Lu, T. Wang Ling, C. Chan, T. Chen . From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching[C]. VLDB .2005. 193-204
    [30] S. Abiteboul, D. Quass, J. McHudh, J. Widom, J. Wiener. The Lorel query language for semistructured data[C]. International Journal on Digital Libraries(1), April 1997.68-88.
    [31] World Wide Web Consortium. Xquery 1.0: An xml query language[EB/OL].January 2007.http://www. w3.org/TR/ xquery.
    [32] Deutsch, M. F. Fernandez, D. Florescu, A. Y. Levy, D. Suciu. A Query Language for XML[C]. Computer Network 31(11-16).May 1999.1155-1169.
    
    [33] S. Ceri, S.Comai,E, Damiani, P.fratemail, S.Paraboschi, L.Tanca. XML-GL:A Graphical Language for Querying and Restructuring WWW Data[C]. In Proceedings of 8th International WWW conference.May 1999.
    [34] J.Robie, J.Lapp, D.Schach. XML Query Language(XQL)[C]. In Proceedings of the Query Language Workshop. December 1998.
    
    [35] D.Chamberlin, J.Robic, D.Florescu. Quilt: An XML Query Language for Heterogeneous Data Source[C]. WebDB2000, May 2000.
    
    [36] 张剑妹,陶世群.一种适用于顺序XML树的前缀编码方法[J].计算机应用,2005.25(12):2879-2881.
    
    [37] P.F.Dietz.Maintaining order in a linked list[C].In: Proceeding of 14th Annual ACM Symposium on Theory of Computing, San Francisco, California, May 1982.122-127
    [38] P. O'Neil, E. O'Neil, S. Pal, I. Cseri, G. Schaller, N. Westbury. ORDPATHs: Insert-Friendly XML Node Labels[C]. SIGMOD (2004) 903-908.
    
    [39] Centrum voor Wiskunde en Informatica.XMARK- An XML-benchmark project[EB/OL]. 2003. http://monetdb.cwi.nl/xml/

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700