XML关键字查询中数据索引和查询结果排序算法研究

英文题名：Research on Data Index and Query Result Sort Algorithm in XML Keyword Query
作者：王雪松
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：xml索引 ; Ttree变形树 ; xml关键字排序 ; xquery查询评估 ; 路径摘要 ; 值索引
英文关键词：xml index ; Ttree distortion tree ; xml keyword sort ; xquery evaluation ; path summary ; value index
学位年度：2007
导师：陈俊杰
学科代码：081203
学位授予单位：太原理工大学
论文提交日期：2007-05-01

摘要

90年代以来，信息技术进入了一个历史上前所未有的飞速发展时期。INTERNET为用户提供了资源共享和信息交互的方便快捷的手段和平台。目前，大多数WEB上的文档是用HTML格式存放和传送的，但在扩展性、结构化和可验证性等方面的不足限制了HTML的应用能力。于是，可扩展标记语言(XML)应运而生，它是专门为WEB设计的一个简单的SGML的子集，既克服了HTML的不足，又去除了SGML中那些对于当前WEB用户来说不必要的特性。目前，XML已成为万维网数据表示和交换的标准。随着XML文档的大量涌现，针对XML文档的检索查询成为目前研究的热点方向之一。
     尽管基于HTML关键字的搜索引擎已取得很大的成功，但由于XML与HTML在诸多方面存在差异，若不加修改直接用于XML数据查询，则不能完全发挥XML所带来的好处。特别是在关键字搜索查询时，针对XML数据建立有效的索引机制是实现高效查询的重要手段；此外，由于关键字搜索查询有可能返回大量的查询结果，对关键字查询的结果进行有效的排序，也显得尤为重要。本论文即是针对XML数据索引和查询结果排序这两个与XML关键字查询相关的技术开展研究。
     针对XML数据索引问题，本文在对现有的XML数据索引技术进行分析的基础上，并通过对XML文档树进行压缩的方式，建立了Ttree变形树索引结构和相应算法。我们详细地讨论了Ttree变形树索引的数据结构及其相关算法，并对现有的XML数据索引和Ttree变形树索引进行了对比实验，通过实验验证了Ttree变形树索引的优越性。针对XML关键字搜索查询结果的排序问题，本文探讨了一种关键字搜索查询结果的排序算法ElemRank，我们也详细地讨论了该算法的每个步骤，并通过实验验证了该算法的有效性。
Since the 90's,the information technology entered in a history the unprecedented rapid development time. INTERNET was the user has provided resources sharing and the information interactive convenience quick method and the platform. At present, on the majority WEB documents are with HTML form depositing and the transmission, but in the extension, the structure and might the confirmation and so on the aspect insufficiency limits the HTML application ability. Thereupon, may expand the mark language (XML) to arise at the historic moment, it's specially is a WEB design simple SGML subset, both has overcome the HTML insufficiency, and removed in SGML these to say the nonessential characteristic regarding the current WEB user. At present, XML has become the World Wide Web data expression and the exchange standard.Massively emerges along with the XML documents, inquires one of hotspot directions in view of the XML documents retrieval which into at present studies.
     Although has obtained the very big success based on the HTML essential character search engine, but because XML and HTML have the difference in many aspects, if does not revise directly uses in the XML data inquiry, then cannot completely display the advantage which XML brings. Specially when essential character search inquiry, in view ofthe XML data establishment effective index mechanism is the realization highly effective inquiry important method; In addition, because the essential character search inquiry has the possibility to return to the massive inquiries result, carries on the effective arrangement to the essential character inquiry result, also appears especially importantly. The present paper is arranges these two in view of the XML data index and the inquiry result with the XML essential character inquiry related technology development research.
     In view of the XML data index question, this article in carries on the analysis to the existing XML data index technology in the foundation,and through carries on the compression to the XML documents tree the way, has established the Ttree distortion tree index structure and the corresponding algorithm. We in detail discussed the Ttree distortion tree index construction of data and its the conelation algorithm, and has carried on the contrast experiment to the existing XML data indexand the Ttree distortion tree index, has confirmed the Ttree distortion tree index superiority through the experiment. In view of the XML essential character search inquiry result arrangement question, this article has discussed one kind of essential character search inquiry result sort algorithm ElemRank, we also in detail discussed this algorithm each step, and has confirmed this algorithm validity through the experiment.

引文

[1] S.Abiteboul, P.Buneman, and D.Suciu.Data on the web: from relations to semis tructured data and xml. Morgan kaufman Publisher, Los Altos, USA, 1999.

    [2] A.Arion, A.Bonifati, G.Costa and S.D Aguanno.Ioana ManoLescu, Pugliese: Efficient Query Evaluation over Compressed XML Data.Edbt, 2004
    [3] Q.Chen, A.Lim, and K.Ong D(k)-Index: An adaptive Structural summary for Graph-structured data. ACM SIGMOD, 2003.
    [4] S-Y.Chien, Z.Vagena, D.Zhang, VJ.Tsotras, and C.Zaniolo. Efficient Structural Joins on Indexed XML Documents, VLDB, 2002.
    [5] C.Chung, J.Min, and K.Shim.Apex:An adaptive path index for XML data.ACM SIGMOD, 2002.
    [6] B.Cooper, N.Example, M.J.Franklin, GR.Hjaltason, and M.Shadmon.A fast index for data.VLDB, 2001.
    [7] R.Goldman and J.Widom.Dataguides: Enabling query formulation and optimization in semistructured databases, VLDB, 1997.
    [8] R.Kaushik, P.Bohannon, J.Naughton, and H.Korth.Covering indexes for branching path queries.ACM DIGMOD, 2002.
    [9] R.Kaushik, P.Shenoy, P.Bohannon and E.Gudes.Exploring Local Similarity for Indexing Paths in Graph-Structured Data.ICDE, 2002.

    [10] R.Kaushik, P.Bohannon, J.F.Naughton and P.Shenoy. Updates for Structure Indexes.VLDB, 2002.

    [11] R.Kaushik, R.Krishnamurthy, J.F.Naughton and R.Ramakrishnan.On the Integration of Structure Indexes and Inverted Lists.SIGMOG, 2004.

    [12] Michael Ley. DBLP database website: http://www.informatik. uni-trier. de/ley/db.
    [13] Q.Li and B.Moon.Indexing and querying XML data for regular pathexpressions.VLDB,2001.
    [14] H.Jiang, H.LU, W.Wang, and B.C.Ooi.XR-Tree: Indexing XML Data for Efficient Structural Joins.ICDE, 2003.

    [15] T.Milo, and D.Suciu.Index structures for path expression.ICDT, 1999.
    [16] S.Nestorov, J.Ullman, J.Wiener, andS.Chawathe.Representative objects: Concise representations of semistructured, hierachical data.ICDE, 1997.
    [17] P.Rao, and B.Moon.PRIX: Indexing and querying XML using Prunfer sequences,ICDE, 2004.
    [18] D.Srivastava, S.A1-Khalifa, H.VJagadish, N.Koudas, J.M.Patel, and Y.Wu.Structural joins: A primitive for efficient XML query pattern matching.ICDE, 2002.
    [19] I.Tatarionv, Z.G.Ives,A.Y.Halevy,D.S.Weld.Updating XML.SIGMOD, 2001.
    [20] H.Wang, S.Park,W.Fan, and P.S Yu.ViST: A dynamic index method for querying XML data by tree structures.SIGMOD, 2003.
    [21] F.Weigel, H.Menuss, F.BryandK.U.Schulz. Content-Awareataguides: Interleaving IR and DB Indexing Techniques for Efficient Retreival of Texual XML Data.ECIR, 2004.
    [22] M.Yoshikawa, T.Amagasa, T.Shimura, and S.Uemura.XRel: A path-bases Approach to storages and retrieval of XML documents using relational databases.ACM Transaction on Internet Technology, 1 (1)110-141, August 2001.
    [23] C.Zhang, J.Naughton, D.DeWitt, Q.Luo, and G.Lohman.On supporting Containment queries in relational database management systems. ACMSIGMOD, 2001.
    [24] S.Liu, Q.Zou, W.Chu.Configurable Indexing and Ranking for XML Information Retreival.ACMSIGIR, 2004.
    [25] XMARK (The XML-benchmark project) http://monetdb.cwi.nl/xml.
    [26] INEX(InitiativefortheEvaluationofXMLRetrieval)http://inex.is.informatik.uni-duisburg.de:2003/.
    [27] XML Path Language (XPath) 2.0. W3C Working Draft 2003, www.w3c.org/TR/xpath20/.
    [28] D.Chamberlin, J.Clark, D.FloreScu, J.Robie, J.Simeon, M.Stefanescu. XQuery 1.0: XML Query Language.W3C Working Draft 2001. www.w3c.org/TR/xquery.
    [29] S.Brin, L.Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine", WWW Conf., 1998.
    [30] S.Dar, et al., "DTL's DataSpot: Database Exploration Using Plain Language", VLDB Conf., 1998.
    [31] 万常选．DTD和Schema在电子商务应用中的比较研究．计算机应用研究，2002，19(9)：30～32．