XML结构索引技术及查询优化研究

英文题名：Study on Structural Index Technology and Query Optimization for XML
作者：郭松涛
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：XML ; Numbering ; Schema ; 存储 ; B~+树结构索引 ; 连接算法
英文关键词：Numbering Schema ; storage ; B~+-tree structural index ; join algorithm
学位年度：2003
导师：朱征宇
学科代码：081202
学位授予单位：重庆大学
论文提交日期：2003-05-06
答辩委员会主席：朱庆生

摘要

为了实现XML的查询优化，近年来人们相继提出了很多索引技术和连接算法[12,13,14,15,16,23,24]。这些索引主要是根据边标签和元素值建立的。然而有的索引不包含所有的元素结点，因而在进行查询时许多路径仍需要检测；有的在向前或向后遍历时产生了大量的冗余数据，从而造成查询代价较大。另外，在所提出的算法中，尽管有的算法，如MPMGJN算法[23]优于标准的RDBMS连接算法，但是该算法为匹配基本的结构关系，特别是在父子关系情况下，执行了大量不必要的计算和占用了大量的I/O资源；有的算法虽然代表了结构连接算法的先进水平，如Stack-Tree-Desc[24]连接算法，但是它没有利用索引结构而是顺序浏览输入列表。这样，必然浪费I/O资源，影响连接的速度。
    针对以上情况，本文做了以下几个方面的工作：
    ① 由于采用传统的Numbering Schema方法来表示XML文件结构不便于元素更新，本文在改进的基础上提出了Sparse Numbering Schema方法。与传统方法相比，其优点在于：由于在插入新结点时不需要重新计算其结点的start和end值，树结构更新效率得到提高；树的创建只需遍历一次文档，进一步地节省了建树开销；此外，它还能为索引提供一个相对持久和稳定的参考。
    ② 鉴于目前关于Numbering Schema存储方法的研究较为少见，本文针对Sparse Numbering Schema进行研究，给出了在关系数据库中的存储方法。该存储方法不仅有利于根据start值快速建立索引，而且可以节省存储空间。
    ③ 本文将关系数据库中B＋树索引技术与Sparse Numbering Schema相结合，提出了一种新的XML文件索引结构——B~+树结构索引，它对XML查询中连接操作和元素定位操作的优化有着重要作用。进而，通过引入指针对该索引进行改进，提出了一种带有Sibling Pointer的B~+树结构索引（简称B~+-SP）。利用这种索引可以克服元素查找总是从树的根部开始进行的缺陷。
    ④ 基于B~+-SP索引，本文还研究给出了Anc-Desc-B~+-sp连接算法。经理论分析，其算法的时间复杂度O(|A|+log|A|)比没有采用该索引的Stack-Tree-Desc算法[24]的时间复杂度O(|A|+|D|+|outlist|)明显降低，因|D|≥|A|，故|D|+|outlist|>>log|A|。经初步实验表明，本算法是一个有效、快速的连接算法。
    ⑤ 在XML查询中，影响查询时间的另一个重要因素是对涉及的XML数据源的定位问题。为解决XML数据源的快速定位问题，本文提出了一种分布式XML数据源定位系统框架，协作式XML搜索引擎（CXSE）。CXSE通过基于站点选择搜索和对XML数据源计分等方法来缩短收集时间，来实现对XML数据源的快速、准确定位。特别地，当在XML查询中同时涉及多个XML数据源时，该并行搜索技术也能起到一定的效果。
Various index techniques and join algorithms [12,13,14,15,16,23,24] have been recently proposed, in order to realize query optimization for XML. The indices are built on the tags and the element values. Nevertheless, some indices do not contain all element nodes, many paths need to still be examined in the query; other indices produce redundant data in the preorder or postorder traversal, this makes the cost of query much more. In the proposed join algorithms, although some algorithms such as MPMGIN algorithm [23], outperform standard RDBMS join algorithms, they perform a lot of unnecessary computation and I/O for matching basic structural relationships, especially in the case of parent-child relationships; other algorithms such as the Stack-Tree-Desc algorithm [24], represent the state-of-the-art in structural joins, however, they do not utilize indexed structures but sequentially scan the input lists. Thus, I/O's can be wasted for scanning element that do not participate in the join, and join speed can be influenced.
    According to this situation, the main works and contributions in the paper are as follow:
    ① As it is inconvenient to update that the conventional numbering schema is used to represent the structure of XML document, a sparse numbering schema based on improvements is proposed in this paper. By comparing with the conventional method, the sparse numbering schema has some merits as follow: the values of start and end do not recomputed, when a new node is inserted, the updating efficiency of tree structure is improvement, and the XML documents are traversed only once, when the schema is constructed, this further decrease the cost of building tree, and the schema can provide a durable conference for index.
    ② As the storage approach of the numbering schema is scarce, this paper proposes a new approach that the sparse numbering schema is stored in the relational database. By utilizing this storage approach, the indices can be easily built on the start column, and the storage space can be mostly reduced.
    ③ This paper refers to the indexing technologies of B~+-tree in DBMS, and combine it with the sparse numbering schema, and proposes a new indexed structure — B~+-tree structural index. It is very important for the optimization of join operation and element location in XML query. By introducing the pointers for further improving the indexed structure, this paper proposes a B~+-tree structural index with sibling pointer (B~+-sp for shorten). This structural index can avoid the defect of always traversing B~+-tree from the boot.
    ④ Based on B~+-sp, this paper proposes Anc-Desc- B~+-sp join algorithm. It is theoretically analyzed that the time complexity of the join algorithm (O(|A|+log|A|)) is obviously less than that of the Stack-Tree-Desc algorithm (O(|A|+|D|+|outlist|) [24], because of |D|≥|A|, |D|+|outlist|>>log|A|. Experiment results primarily prove that the join algorithm is more efficient and quick join algorithm.
    In XML query, the other important factor of influencing the query time is the problem of the location of XML data source. In order to resolve the problem, this paper proposes a distributed XML data source location system frame, called Cooperative XML Search Engine (CXSE). CXSE can shorten collection time by searching based site selection and

    ⑤ scoring Web document. Accordingly, CXSE really realize to quickly and correctly locate the URL of XML document needed. Moreover，the retrieval system is available to several XML data source in XML query.

引文

[1] W3C. "Extensible Markup Language (XML) 1.0" W3C Recommendation. February 10, 1998. http://www.w3c/TR/1998/REC-xml-19980210.html
    [2] S.Abiteboul, D.Quass, J.McHugh. The Lorel query language for semistructured data. International Journal on Digital Libraries, April 1997
    [3] Alin Deutsch, Mary Fernandez. A query language for XML. In Proceeding of the 8th International World Wide Web Conference, May 1999
    [4] Stafano Ceri, Sara Comai. XML-GL: A graphical language for querying and restructuring XML documents. In Proceeding of the 8th International World Wide Web Conference, May 1999
    [5] Don Chamberlin,Jonathan Robie. Quilt: An XML query language for heterogeneous data sources. In International Workshop on the Web and Databases. May 2000
    [6] James Clark and Steve DeRose. XML Path Language (XPath) version 1.0 w3c recommendation. Technical Report REC-xpath-19991116, World Wide Web Consortium, November 1999
    [7] World Wide Web Consortium, "XQuery 1.0: An XML Query Languages", W3C Working Draft November 15 , 2002 . http://www.w3.org/TR/2002/WD-xquery-20021115/
    [8] Jason McHugh and Jennifer Widom. Query optimization for XML. In Proceeding of the 25th VLDB Conference, Edinburgh, Scotland, 1999
    [9] R. Goldman and J. Widom. DataGuides: enabling query formulation and optimization in semistructured databases. In Proc. of the Int'l Conf. on Very Large Databases, pages 436-445, 1997
    [10] T. Milo and D. Suciu. Index structures for path expressions. In Proc. of the Int'l Conf. on Database Theory, pages 277-295, 1999
    [11] H. Liefke and D. Suciu. Xmill: an Efficient Compressor for XML Data. In Proc. of the ACM SIGMOD Int'l Conf. on Management of Data, pages 153-164, 2000
    [12] J. McHugh and J. Widom. Query optimization for semistructured data. Technical Report, Stanford University, 1997
    [13] B.F.Cooper, N.Sample, M.J.Francline, "A fast index for semistructured data", Proc. Of VLDB, 2001
    [14] Flavio Rizzolo, Alberto Mendelzon. Indexing XML Data with ToXin (2001) . http://citeseer.nj.nec.com/rizzolo01indexing.html


    [15] 方翔，袁国栋，李伟生。建立特殊索引实现XML文档的查询优化。《计算机工程》 2002年03期
    [16] Quanzhong Li, Bongki Moon. Indexing and Querying XML Data for Regular Path Expressions. In Proceeding of the 27th VLDB Conference, Roma, Italy, 2001
    [17] A.Apostolico and Z.Galil. Pattern Matching Algrithm. Oxford University Press, 1997
    [18] G.Jacobson, B.Krishnamurthy, and D.Suciu. Focusing search in hierarchical structures with directory sets. In Proceedings of the seventh International Conference on Information and Knowledge Management (CIKM), Washington, DC, Nov. 1998
    [19] H.V.Jagadish, L.V.S.Lakshmanan. Querying network directories. In Proceedings of the ACM SIGMOD Conference on Management of Data. Philadelpia, PA，June, 1999
    [20] D.Dewitt, J.Naughton, and D.Schnieider. An Evaluation of Non Equijoin Algorithms. Proceedings of ACM SIGMOD, pages 443-452, 1991
    [21] N.Koudas and K.C.Sevcik. Size Separation Spatial Join. Proceedings of ACM SIGMOD, pages 324-335, 1997
    [22] M.P.Consens and T.Milo. Optimizing queries on files. In Proceedings of the ACM SIGMOD Conference on Management of Data. Minneapolis, MN，May, 1995
    [23] C.Zhang, J.Naughton, D.Dewitt, Q.Luo. On Supporting containment queries in relational database management systems. In Proceedings of the ACM SIGMOD Conference on Management of Data. 2001
    [24] S.Al-Khalifa, H.V.Jagadish, N.Koudas, "Structural Join: A Primitive for Efficient XML Query Pattern Matching", Proc. Of ICDE, 2002
    [25] Tarjan R. Fast algorithms for solving path problems. Journalof ACM, 1981, 2 8(3 ):594～ 6 1 4
    [26] Shanmugasundaram J, Tufte K, Zhang C et al. Relational databases for querying XML document: Limitations and opportunities. In Proc. of the 25th Int'l Conf on Very Large Databases. Edinburgh, Scotland, 1999. 3 0 2～2 1 4
    [27] 王继成、杨晓江，基于元资料与Z39.50的分布协作式Web信息检索，软件学报，2001/12（04）0620-8
    [28] C. Mic Bowman, Peter B. Danzig, Darren R. Hardy, Udi Manber, Michael F. Schwartz: “The Harvest Information Discovery and Access System”, http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/schwartz.harvest/schwartz.harvest.html
    [29] Ingrid, http://www.ingrid.org/
    Nobuyoshi Sato, Minoru Uehara, Yoshifumi Sakai, Hideki Mori: “A Distributed Search Engine for Fresh Information Retrieval”, Third International Workshop on Network-Based

    [30] Information Systems (NBIS-3), (2001.9)

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700