支持压缩域查询的XML数据压缩方法研究

英文题名：Research on Query Supporting XML Data Compression
作者：张万松
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：XML ; 规范化 ; 数据冗余 ; 数据压缩 ; 查询处理 ; 存取控制规则
英文关键词：XML ; Normalization ; Data Redundancy ; Data Compression ; Query Evaluation ; Access Control Rules
学位年度：2005
导师：刘大昕
学科代码：081203
学位授予单位：哈尔滨工程大学
论文提交日期：2005-05-10

摘要

近年来,随着Internet的迅猛发展,XML已经成为数据交换和表示的主要标准。由于XML具有良好的可扩展性和跨平台性,越来越多的信息以XML文件的形式进行交换和存储。XML数据的一个特点是存在较大的数据冗余,会造成存储空间的浪费、查询效率的降低。因此,对XML数据行有效压缩和查询成为XML数据库研究领域的一个重要的研究问题。
     本文主要研究XML数据的压缩和查询技术,对XML数据的存储模式拆分调整、XML数据的规范化存储、XML数据的相似性分析、频繁子树的挖掘、基于树文法的压缩、基于签名自动机的压缩数据查询技术等方面进行了深入的研究,提出了有效的算法。
     本文的研究工作主要围绕以下几个方面进行:
     首先对XML的研究历史与现状进行综述,分析了当前XML数据压缩与查询的研究现状和目前已有XML数据压缩方法的不足,并指出了研究主题及目标。
     其次,提出了XML模式规范化方法,利用函数依赖和规范化规则发现和消除XML文档中存在的冗余结构,实现在语义一级消除XML数据冗余;研究并阐述了基于树文法的XML数据压缩方法。研究了XML文档集之间和文档内部的结构冗余问题,并在此基础上,通过对文档集进行聚类、发现频繁子树,最终实现压缩,并对所提出的算法进行了实验,验证了算法的功能和有效性;提出了基于压缩域的XML压缩数据查询处理方法。为了实现非完全解压缩状态下的查询处理问题,提出将签名技术和自动机技术相结合的基于签名自动机的查询处理算法,实现XML压缩数据在非完全解压缩状态下的查询处理;提出了XML数据存取控制规则的压缩与查询方法。为了处理XML压缩数据的安全控制,以及由此带来的存取控制规则规模膨胀的问题,提出了基于DAC模型的存取规则剪枝处理方法,有效地压缩存取控制规则所占用的空间,并给出了规则压缩的查询处理方法。
In recent years, as XML has become an emerging standard for information exchange on the World Wide Web, it is clear that an enormous amount of data in the Internet will be encoded in XML in the near future because of it's extensibility and characteristic of cross-platform. However, XML documents in their textual form are rather verbose and tend to predate disk space and hinder the ability of query, due to the textual and repetitive nature of the XML tags and of several XML types. How to efficiently compress XML data and evaluate XPath queries over compressed XML data is a fundamental problem.In this paper, Methods of XML data compression with query support were studied intensively, including XML data models, schema formalisms and decomposition, the similarity analysis of XML documents, finding frequent subtrees, tree grammar based compression and pushing queries to compression data based on signature automata, etc.The main research works and specific contributions found in this thesis cover the following aspects:The research history and status-art of XML were summarized. Moreover, the XML data management technologies were analyzed. The disadvantages of exist methods of XML data compression were analyzed in detail. Furthermore, the developing aspects and goals of the study on XML data compression were given.A concept of XK-NF normal form for XML documents based on DTD path expression is proposed. The advantage of the definition is that it can represent the normal form with key constrains with three forms of functional dependency. The decomposition algorithm for XML schema is proposed for reducing the data redundancies based on the formalization rules, which is not mentioned by other XML compressor.The method of compressing XML data based on tree grammar is put forward. Redundant data appearing not only in a single XML document but also within
    different documents, an XML compression method based tree-grammar is proposed. In order to compress XML data with query support, a clustering step based on k-means is performed as the first step for raw XML documents to generate clusters. Next, within a cluster, a frequent sub-structure mining algorithm is presented to generate the compression dictionary similar to FP-growth method. Finally, subtrees are substituted by binding variable and frequent sub-structure based on the thinking of tree-grammar.The method of querying compressed data is studied. A significant portion of this thesis is devoted to query over compressed XML data with the analysis of indexing schemas and query method appeared in other XML compressor. The queries are performed effectively based on signature index and signature automata under non-full decompression.A method of access control rules compression with query support is given. In order to cope with the duplication of access control rule, a rule pruning method is proposed for XML data access control based on DAC model, which can compress the access control rules effectively. Furthermore, the query algorithm is presented for compressed access control map.

引文

[1] Bray T, Paoli J, Sperberg C M, et al. Extensible Markup Language (XML) 1.0. W3C Recommendation, 1998. http://www.w3.org/TR/REC-xml
    [2] Daniela F, Donald K. A Performance Evaluation of Alternative Mapping Schemes for Storing XML Data in a Relational Database. Technical Report 3684, INRIA, March 1999
    [3] Meike K, Holger M. XML and Object-Relational Database Systems-Enhancing Structural Mappings Based on Statistics. Selected Papers From the 3rd International Workshop WebDB 2000 on the World Wide Web and Databases, 2000: 151-170P
    [4] Florescu D, Kossmann D. Storing and Querying XML Data Using an RDBMS. IEEE Data Engineering, 1999, 22(3): 27-34P
    [5] Ioana M, Daniela F, Donald K. Pushing XML Queries Inside Relational Databases, Inria Rapport de recherche, No. 4112, 2001
    [6] Michael J. Review-Relational Databases for Querying XML Documents: Limitations and Opportunities. ACM SIGMOD Digital Review, 2000
    [7] Shanmugasundaram J, Shekita E J, Barr R, et al. Efficiently Publishing Relational Data as XML Documents. Proceedings of VLDB, 2000
    [8] Roy G, Jennifer W. Dataguides: Enabling Query Formulation and Optimization in Semistructured Databases. Technical Report, Stanford, 1977
    [9] Chan C Y, Felber P, Garofalakis M, et al. Efficient Filtering of XML Documents with XPath Expressions. Technical Report, Bell Labs, June 2001
    [10] Sara C, Yaron K, Yakov A. et al. EquiX-A Search and Query Language for XML. JASIST, 2002, 53(6): 454-466P
    [11] McHugh J, Widom J, Abiteboul S, et al. Indexing Semistructured Data. Technical Report, Stanford University, 1998
    [12] Felix W. A Survey of Indexing Techniques for Semistructured Documents. Technical Report, University of Munich, 2002
    [13] Chan C Y, Garofalakis M, Rastogi R. RE-Tree: An Efficient Index Structure for Regular Expressions. Technical Report, Bell Laboratories, Memorandum, February 2002
    [14] Yoon J, Raghavan V, Venu C. BitCube: A Three Dimensional Bitmap Indexing for XML Documents. The 13th International Conference on Scientific and Statistical Database Management, FairFax, VA, 2001
    [15] Haifeng Jiang, Hongjun Lu, Wei Wang, et al. XR-Tree: Indexing XML Data for Efficient Structural Joins. The 19th International Conference on Data Engineering, 2003
    [16] http://www.data-compression.info/Algorithms/index.htm
    [17] Witten I H, Neal R M, Cleary J G. Arithmetic Coding for Data Compression. Communications of the ACM, 1987, 30(6):520-540P
    [18] G.Cormack, Data Compression in a Database System, Communications of the ACM,28, 12,Dec, 1985
    [19] 骆吉洲,李建中.一种有效的关系数据库压缩方法.软件学报.2005 16(2):205-214页
    [20] Milo T, Suciu D. Index Structures for Path Expressions. Lecture Notes in Computer Science, 1999, 1540: 277-295P
    [21] Connell S J, Winterbottom N. Performing Joins Without Decompression in a Compressed Database System. ACM SIGMOD Record. 2003, 32(1): 55-67P
    [22] Kunihiko S. Unifying Text Search and Compression-Suffix Sorting, Block Sorting and Suffix Arrays: [dissertation]. The University of Tokyo, December 1999
    [23] Suel T, Yuan J. Compressing the Graph Structure of the Web. Proceedings of the IEEE Data Compression Conference (DCC), March 2001
    [24] Liefke H, Suciu D. XMilI: An Efficient Compressor for XML Data. Proceedings of SIGMOD, 2000
    [25] Cheney J. Compressing XML with Multiplexed Hierarchical PPM Models. Proceedings of the IEEE Data Compression Conference, 2000
    [26] Levene M, Wood P T. XML Structure Compression. Proceedings of the 2nd International Workshop on the Web Dynamics, 2002
    [27] 王宏志,李建中,何震瀛.一种压缩XML数据仓库的存储策略.NDBC 2002
    [28] 陈敏敏,胡蓉,唐常杰.对源于数据库的XML文件的结构制导压缩技术.NDBC 2002
    [29] Aditya N. Using Hardcoded Hedge Automata for Compressing Structured Documents. Echnical Report, No. TR-ⅡSc-CSA-2002-1, Department of Computer Science and Automation, ⅡSc, July 2002
    [30] Tolani P M, Haritsa J R. XGRIND: A Query-Friendly XML Compressor. Proceedings of ICDE, 2002
    [31] Min K, Park M J, Chung C W. XPRESS: A Queriable Compression for XML Data. Proceedings of SIGMOD, 2003
    [32] Buneman P, Grohe M, Koch C. Path Queries on Compressed XML. Proceedings of VLDB, 2003
    [33] Arion A B, Costa G, DAguanno S, et al. XQueC: Pushing Queries to Compressed XML Data. Proceedings of VLDB, 2003
    [34] James C, Wilfred N. Xqzip: Querying Compressed XML Using Structural Indexing. EDBT 2004, Lecture Notes of Computer Science, Vol. 2992
    [35] 王宏志,李建中.基于路径压缩XML数据上的twig查询处理.NDBC 2004
    [36] Angela B. Technical Survey of XML Schema and Query Languages. Proceedings of VLDB, 2001
    [37] Li Q, Moon B. Indexing and Querying XML Data for Regular Path Expressions. Proceedings of VLDB, 2001
    [38] Yanlei D. Path Sharing and Predicate Evaluation for High Performance XML Filtering. Proceedings of TODS, 2003
    [39] 吴永辉,周傲英.消除冗余模式并保证无损连接的XML模式设计. NDBC 2002
    [40] 陆世潮,孟小峰.OrientX中XQuery的导航式实现.计算机研究与发展.2004,41(10):1815-1822页
    [41] 王腾蛟,唐世渭,杨冬青.COMMIX中半结构化数据的模式提取与查询处理.软件学报.2001,12(suppl):230-236页
    [42] Chawathe S, Garcia M H, Hammer J, et al. The TSIMMIS Project: Integration of Heterogenous Information Sources. Proceedings of the 10th Meeting of the Information Processing Society of Japan, 1994:7-18P
    [43] Schlieder T. Schema-Driven Evaluation of ApproXQL Queries. Technical Report B 02-01, Freie Universitat Berlin, February 2002
    [44] Bonifati S C. Comparative Analysis of Five XML Query Languages. ACM SIGMOD Record, 2000, 29(1): 68-79P
    [45] Abiteboul S, Quass D, McHugh J, et al. The Lorel Query Language for Semistructured Data. International Journal on Digital Libraries, 1997, 1 (1): 68-88P
    [46] Deutsch A, et al. XML-QL: A Query Language for XML. http://www.w3.org/TR/1998/NOTE-xml-ql-19980819/
    [47] Don C, Jonathan R, Daiela F. Quilt: An XML Query Language for Heterogeneous Data Sources. Lecture Notes in Computer Science, Springer-Verlag. 2001 : 199-234P
    [48] Don C, James C, Daniela F, et al. XQuery 1.0: An XML Query Language. W3C Working Draft, 2001. http://www.w3.org/TR/xquery
    [49] World Wide Web Consortium, XML Path Language (XPath) Version 1.0. W3C Recommendation, Nov. 16, 1999. http://www.w3.org/TR/xpath.xml
    [50] Mignet L, Barbosa D V. The XML Web: A First Study. Proceedings of the 12th International Conference on World Wide Web, 2003, 500-510P
    [51] Provost W. Nomalizing XML. www.xml.com/pub/a/2001/11/13/normalizing.html
    [52] Marcelo Arenas, Leonid Libkin.A Normal Form for XMLDocuments Proceedings of ACM Symposium on Principles of Database Systems (PODS). Madison, Wisconsin, USA, 2002: 85-96P
    [53] Mong Li Lee, Tok Wang Ling, Wai Lup Low. Designing Functional Dependencies for XML. Proceedings of the 8th Conference on Extending Database Technology, EDBT 2002. Prague: Springer-Verlag, 2002: 124-141P
    [54] Chen Y, Davidson S B, Hara C S, et al. RRXF: Redundancy Reducing XML Storage in Relations. Proceedings of VLDB, 2003: 189-200P
    [55] Buneman P, Davidson S B, Fan W F, et al. Keys for XML. Proceedings of the 10th International World Wide Web Conference. Hong Kong: ACM Press, 2001: 201-210P
    [56] Abiteboul S, Hull R, Vianu V. Foundations of Database. Boston, MA: Addison-Wesley, 1995
    [57] Ramakrishnan R, Gehrke J. Database Management Systems. NY: McGraw-Hill Higher Education, 2000
    [58] Buneman P, Fan W F, Weistein S. Path Constraints on Semistructured and Structured Data. Proceedings of the ACM Symposium on Principles of Database Systems. Seattle: ACM Press, 1998: 129-138P
    [59] Abiteboul S, Vianu V. Regular Path Queries With Constraints. Proceedings of the ACM Symposium on Principles of Database Systems. Tucson: ACM Press, 1997: 122-133P
    [60] Bunenman P, Davidson S B, Fan W F. Reasoning about Keys for XML. In: Ghelli G, Grahne G, Eds. Database Programming Language, the 8th International Workshop. Lecture Notes in Computer Science 2397, Frascati: Springer-Verlag, 2001: 133-148P
    [61] 吕腾,顾宁,闫萍.XML文档的范式.小型微型计算机系统,2004,25(10):1836-1840页
    [62] Kaushik R, Bohannon P, Naughton J F, et al. Covering Indexes for Branching Path Queries. Proceedings of SIGMOD, 2002
    [63] Doucet A, Ahonen M. Native Clustering of a Large XML Documents Collection. Proceedings of the 1st Annual Workshop of the Initiative for
    ??the Evaluation of XML retrieval, 2002
    [64] MacQueen J. Some Methods for Classification and Analysis of Multivariate Observation. Proceedings of the 5th Berkeley Symp Math Statist and Prob 1. California: University of California Press, 1967: 281- 297P
    [65] Kauffman L, Rousseeuw P J. Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley & Sons, 1990
    [66] Ankerst M, Breuning M M, Kriegel H P, et al. OPTICS: Ordering Points to Identify the Clustering Structure. Proceedings of the 1999 ACM SIGMOD Int'l Conf on Management of Data. New York: ACM Press, 1999: 164-169P
    [67] Yun Shen, Bing Wang. Clustering Schemaless XML Documents. CoopIS/DOA/ODBASE, 2003: 767-784P
    [68] Lee M L, Yang L H. XClust: Clustering XML Schemas for Effective Integration. Proceedings of the 7th CIKM, 2002: 292-299P
    [69] Jagadish H V, Koudas N. On Effective Multi-Dimensional Indexing for Strings. ACM SIGMOD, Dallas, TX, 2000
    [70] Shasha D, Zhang K, Statman R. On the Editing Distance between Unordered Labeled Trees. Information Processing Letters. 1992, 42:133- 139P
    [71] DBLP Computer Science Bibliography. http://www.informatik.uni- trier.de/~ley/db/
    [72] Schmidt A, Waas F, Kersten M, et al. The XML Benchmark Project. Technical Report, INSR0103, 2001
    [73] Yan X, Han J. gSpan: Graph-Based Substructure Pattern Mining. Proceedings of the IEEE International Conference on Data Mining (IEEE ICDM), 2002
    [74] XML Data Sets, http://www.cs.wics.edu/niagara/data.html
    [75] Liang H Y, Mong L L, Wynne H. Sumit Acharya: Mining Frequent Query Patterns from XML Queries. DASFAA 2003: 355-362P
    [76] Nijssen S, Kok J N. Efficient Discovery of Frequent Unordered Trees. Proceedings of the First International Workshop on Mining Graphs, Trees and Sequences (MGTS'03), 2003
    [77] Pei J, Han J, Pinto H, et al. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth, Proc. 2001 Int. Conf. on Data Engineering (ICDE'01), Heidelberg, Germany, 2001
    [78] Mediana A, Lakhina A. Brite: Universal Topology Generation from User's Perspective. Technical Report BUCS-TR2001-003, Boston University, 2001
    [79] Nevill-Manning C G, Witten I H. Compression and Explanation Using Hierarchical Grammars. Computer Journal, 1997: 103-116P
    [80] Kieffer J C, Yang E H. Grammar Based Codes: A New Class of Universal Lossless Source Codes. IEEE Transactions on Information Theory, 2000: 737-754P
    [81] Charikar M, Lehman E, Liu D, et al. Approximating the Smallest Grammar: Kolmogorov Complexity in Natural Models. Proc. 34th ACM STOC, 2002: 792-801P
    [82] Kieffer J C, Yang E. Grammar Based Codes: A New Class of Universal Lossless Source Codes. IEEE Transactions on Information Theory. 2000, 46: 737-754P
    [83] Lehman E, Shelat A. Approximations Algorithms for Grammar-Based Com-Pression. Proc. SODA, 2002: 205-212P
    [84] Nevill-Manning C, Witten I. Compression and Explanation Using Hierarchical Grammars. Computer Journal. 1997,40(2/3): 103-116P
    [85] Sakamoto H. A Fully Linear-Time Approximation Algorithm for Grammar-Based Com-Pression. DOI Technical Report 214, Department of Informatics, Kyushu Univer-sity, 2003
    [86] Giorgio B, Markus L. Grammar-Based Tree Compression. Technical Report IC/2004/80, EPFL, 2004
    [87] Chung C W, Min J K. APEX: An Adaptive Path Index for XML Data.
    ??Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, June 2002
    [88] Haixun Wang, Sanghyun Park. ViST: A Dynamic Index Method for Querying XML Data by Tree Structures. Proceedings of the 2003 ACM SIGMOD Int'l Conf on Management of Data. New York: ACM Press, 2003: 110-121P
    [89] Chang W W, Schek H J. A Signature Access Method for the Starburst Database System. Proceedings of VLDB, 1989
    [90] Diao Y, et al. Yfilter: Efficient and Scalable Filtering of XML documents. Proceedings of ICDE, 2002
    [91] XML Signature W3C Working Draft. XML-Signature, http:// www.w3c.org/ TR/2000 / WD-xmldsig-core-2000228
    [92] Signed Document Markup Language (SDML). http://www.w3c.org /TR/NOTE-SDML
    [93] AlphaWorks. XML Security Suite. http://alphaworks.ibm.com/tech /xmlsecuritysuite
    [94] Kahan J. WDAI: A Simple World Wide Web Distributed Authorization Infrastructure. Proceedings of the 8th International World Wide Web Conference, 1999
    [95] Lewontin S, Zurko M E. The DCE Project: Providing Authorizations and Other Distributed Services to the World Wide Web. Proceedings of the 2nd World Wide Web Conference, 1994
    [96] Samarati P, Bertino E. An Authorization Model for a Distributed Hypertext System. IEEE Transaction on Knowledge and Data Engineering, 1996, 8(4): 555-562P
    [97] Damiani E, Sabrina D. Design and Implementation of an Access Control Processor for XML Documents, Computer Networks, 2000: 59-75P
    [98] Graharm G, Denning P. Protection-Principles and Practice, Proceedings on AFIPS Spring Joint Computer Conference. AFIPS Press, Arlington, 1972: 407-429P
    [99] Akihiro I, Takashi W, Hiroshi M. An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data. PKDD 2000: 13-23P
    [100] Miyahara T, Shoudai T, Uchida T, et al. Discovery of Frequent Tree Structured Patterns in Semistructured Web Documents. PAKDD 2001: 47- 52P
    [101] Mohammed J Z. Efficiently Mining Frequent Trees in a Forest. KDD 2002: 71-80P
    [102] Tatsuya A, Kenji A, Shinji K, et al. Efficient Substructure Discovery from Large Semi-structured Data. SDM 2002
    [103] Marcelo A, Leonid L. A Normal Form for XMLDocuments. Proceedings of ACM Symposium on Principles of Database Systems (PODS). Madison, Wisconsin, USA, 2002: 85-96P
    [104] Cooper B, Sample N, Franklin M J, et al. A Fast Index for Semistructured Data. Proceedings of VLDB, 2001
    [105] Roy G and Jennifer W. Dataguides: Enabling Query Formulation and Optimization in Semistructured Databases. Technical Report, Stanford, 1977
    [106] Fernandez M F, Suciu D. Optimizing Regular Path Expressions Using Graph Schemas. Proceedings of the 14th International Conference on Data Engineering, February 1998: 14-23P
    [107] Mathias N, John N W. Improving XML Processing Using Adapted Data Structures. Web, Web-Services, and Database Systems 2002: 206-220P
    [108] Cormack G. Data Compression in a Database System, Communications of the ACM, 1985,28(12): 1336-1342P
    [109] Haruo H, Benjamin P. Regular Expression Pattern Matching for XML. ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2001:67-80P
    [110] Alsberg P. Space and Time Savings through Large Database Compression and Dynamic Restruring, Proc. IEEE, 1975, 63(8)
    [111] Fernandez M, Suciu D. Optimizing Regular Path Expression Using Graph Schemas. Proceedings of the ICDE, 1998
    [112] Hu D A. A Method for the Construction of Minimum Redandancy Codes. Proceedings of the Institute of Radio Engineers 40, 1952: 1098-1101P

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700