一种快速的非提取式XML解析器的设计与实现
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着XML技术的广泛应用,如何提高XML解析器的性能是一个亟待解决的问题。XML解析模型直接影响XML解析器的性能,因此解决这个问题应从XML解析模型入手。当前的研究工作大多基于提取式XML解析模型,对非提取式XML解析模型的研究很少。VTD-XML是一种新型的非提取式XML解析模型。本文在VTD-XML的基础上设计并实现了一种快速的非提取式XML解析器,称为NEM-XML。
     首先,NEM-XML是一种非提取式XML解析器。它抛弃了XML DOM模型中为每个XML节点创建节点对象的做法,取而代之的是使用64位长的整数保存XML节点的元信息,极大地减少了解析XML文档所需的时间和内存空间。NEM-XML以静态链表的方式组织内部的数据结构,既方便了元素节点的添加和删除,又提高了XML文档的遍历速度。
     其次,探索了复用XML解析结果的方法,也就是在第一次使用XML文档时进行正常的解析并将解析结果保存到二进制文件中,以后使用时直接利用二进制文件还原原始的解析结果。这在那些仅对XML文档进行访问而无更新操作的应用中有很大的实用价值。为了复用NEM-XML的解析结果,本文改进了NEM-XML的数据结构,以减少保存解析结果所需的空间以及还原解析结果所需的时间。
     最后,并行计算是当前的一个重点研究领域,XML并行解析也得到了越来越多的关注。本文研究了NEM-XML的并行解析算法,提出了一种受限制的XML文档划分方法,可以很快地确定各个文档片段的初始解析状态。这个划分算法兼顾XML文档的层次结构和负载平衡,划分结果比较理想。
     本文对XML解析技术的研究具有一定的现实意义。它不但扩展了VTD-XML所体现的非提取式XML解析思想,还进一步研究了如何复用NEM-XML的解析结果,可以促进XML在各个领域的应用。另外,本文提出的受限制的XML文档划分方法对其它XML并行解析方面的研究具有一定的参考价值。
As a platform- and language-neutral markup language, XML plays an important role indata representation and data exchange over Internet. However, how to improve the perfor-mance of XML parsing is an urgent task. Nowadays most research is based on XML DOMwhich is the most widely used XML parsing model. This paper presents a fast non-extractiveXML parser based on VTD-XML, called NEM-XML.
     Firstly, NEM-XML is a non-extractive XML parsing model, which means that it doesnot create node objects for all XML nodes during parsing. Instead, it encodes the nodeinformation in 64-bit integers. In this way, a lot of memory space is saved and parsing per-formance is improved. To gain more ?exibility and usability, NEM-XML keeps the structureinformation in a static linked array. This kind of data structure can facilitate the updatingand navigation operation among XML nodes significantly.
     Secondly, it is quite promising to reuse the XML parsing results where there is no needto perform updating in XML document. This is a good way of avoiding parsing the sameXML document repetitively. This paper makes a further change on NEM-XML to reducethe space needed to save the XML parsing results and the time to restore them.
     Finally, parallel computing is a hot research field nowadays and parallel XML parsingtherefore becomes more and more popular. This paper proposes a restricted XML partitionmethod to reduce the uncertainty of each chunk. This partition method takes both the docu-ment structure and load balancing into consideration. The partition result is quite satisfied.
     The work on XML parsing technology has certain practical significance. On one hand,it extends VTD-XML to make it more ?exible and makes a deep research on reusing XMLparsing results, which promotes the application of XML in various fields to some extent.On the other hand, the partition algorithm this paper presents is quite e?cient and providessome reference for relative research on parallel XML parsing.
引文
[1] W3C. eXtensible Markup Language(XML)[EB/OL].http://www.w3.org/TR/2008/PER-xml-20080205/.
    [2] Diomidis Spinellis. Using and Abusing XML[J]. IEEE SOFTWARE, 2008, 25(2):88–89.
    [3]左伟明. XML数据标记语言参考手册[M].北京:人民邮电出版社, 2007.
    [4] Serge Abiteboul. Semistructured Data: from Practice to Theory[C]. Proceedings of the16th Annual IEEE Symposium on Logic in Computer Science, 2001: 379–386.
    [5] Sungchul Hong and Yeong Song. E?cient XML query using Relational DataModel[C]. The 8th ACIS International Conference on Software Engineering, Artifi-cial Intelligence, Networking, and Parallel/Distributed Computing, 2007: 1095–1100.
    [6] Michael Champion. How Much Pain for XML’s Gain?[C]. XML 2004 Conference,2004.
    [7] Tak Lam, Jianxun Ding, and Jyh Liu. XML Document Parsing: Operational and Per-formance Characteristics[J]. IEEE Computer Society, 2008, 41(9): 30–37.
    [8]金蓓弘,曹冬磊,任鑫,余双,戴蓓洁.高性能的XML解析器OnceXMLParser[J].软件学报,2008,19(10): 2728–2738.
    [9] XimpleWare. VTD-XML: The Future of XML Processing[EB/OL].http://vtd-xml.sourceforge.net/.
    [10] Scott Lee, Jing Sun, Gillian Dobbie, Lindsay Groves, and Yuanfang Li. CorrectnessCriteria for Normalization of Semistructured Data[C]. The 19th Australian Conferenceon Software Engineering, 2008: 248–257.
    [11] W3C. Document Object Model Level 3(DOM)[EB/OL].http://www.w3.org/TR/DOM-Level-3-Core/.
    [12] Mariano Consens and Flavio Rizzolo. Fast Answering of XPath Query Workloads onWeb Collections[J]. Lecture Notes in Computer Science, 2007: 31–45.
    [13] Christian Gru¨n, Alexander Holupirek, Marc Kramis, Marc Scholl, and Marcel Waldvo-gel. Pushing XPath Accelerator to its Limits[C]. Proceedings of the 1st InternationalWorkshop on Performance and Evaluation of Data Management Systems, 2006.
    [14] Herb Sutter. The Free Lunch Is Over: A Fundamental Turn Toward Concurrency inSoftware[EB/OL]. http://www.gotw.ca/publications/concurrency-ddj.htm.
    [15]陈国良.并行计算──结构·算法·编程(修订版)[M].北京:高等教育出版社,2003.
    [16] Faiza Abbaci, Jean Valsamis, and Pascal Francq. Index and Search XML Documents byCombining Content and Structure[C]. International Conference on Internet Computing,2006.
    [17] XimpleWare. VTD-XML Introduction and API Overview[R]. XimpleWare, 2008.
    [18] Toshiro Takase, Hisashi Miyashita, Toyotaro Suzumura, and Michiaki Tatsubori. AnAdaptive, Fast, and Safe XML Parser Based on Byte Sequences Memorization[C]. Pro-ceedings of the 14th International Conference on World Wide Web, 2005: 692–701.
    [19] Gang Wang, Cheng Xu, Ying Chen, and Ying Li. Analyzing XML Parser MemoryCharacteristics: Experiments towards Improving Web Services Performance[C]. IEEEInternational Conference on Web Services, 2006: 681–688.
    [20] Giuseppe Psaila. Virtual DOM: An E?cient Virtual Memory Representation for LargeXML Documents[C]. The 19th International Conference on Database and Expert Sys-tems Application, 2008: 233–237.
    [21] Matt Bone, Peter Nabicht, Konstantin La¨ufer, and George Thiruvathukal. TamingXML: Objects First, Then Markup[C]. IEEE International Conference on Electro/In-formation Technology, 2008: 488–493.
    [22] Mariano Consens, Flavio Rizzolo, and Alejandro Vaisman. AxPRE Summaries: Ex-ploring the (Semi-)Structure of XML Web Collections[C]. IEEE 24th InternationalConference on Data Engineering, 2008: 1519–1521.
    [23] Lei Li, Chunlei Niu, Ningjiang Chen, and Jun Wei. High Performance Web ServicesBased on Service-Specific SOAP Processor[C]. IEEE International Conference on WebServices, 2006: 603–610.
    [24] Jonathan Robie. XML Processing and Data Integration with XQuery[J]. IEEE InternetComputing, 2007, 11(4): 62–67.
    [25]于博文,范学峰. VTD-XML技术在.NET平台下EDI系统中的运用[J].计算机与现代化,2008,(3): 115-119.
    [26] Chang Chee, Mohd Faisal, and Mustapha Azhar. RBStreX: Hardware XML Parser forEmbedded System[C]. International Conference for Internet Technology and SecuredTransactions, 2009: 1–6.
    [27] Jaakko Kangasharju, Tancred Lindholm, and Sasu Tarkoma. On Encrypting and Sign-ing Binary XML Messages in the Wireless Environment[C]. International Conferenceon Web Services, 2006: 637–646.
    [28] Roberto Bayardo, Daniel Gruhl, Vanja Josifovski, and Jussi Myllymaki. An Evaluationof Binary XML Encoding Optimizations for Fast Stream Based XML Processing[C].Proceedings of the 13th International Conference on World Wide Web, 2004: 345–354.
    [29] Wei Lu, Kenneth Chiu, and Dennis Gannon. Building a Generic SOAP Frameworkover Binary XML[C]. The 15th IEEE International Symposium on High PerformanceDistributed Computing, 2006: 195–204.
    [30] Alois Zoitl, Ingo Hegny, and Andreas Schimmel. Utilizing Binary XML Representa-tions for Improving the Performance of the IEC 61499 Configuration Interface[C]. The7th IEEE International Conference on Industrial Informatics, 2009: 66–71.
    [31] Kenneth Chiu, Tharaka Devadithya, Wei Lu, and Aleksander Slominski. A BinaryXML for Scientific Applications[C]. Proceedings of the 1st International Conferenceon e-Science and Grid Computing, 2005: 336–343.
    [32] Michael Leventhal. Is Now the Time for Binary XML? Report on Current W3C Activ-ity[R]. XML Conference and Exposition, 2004.
    [33] David Geer. Will Binary XML Speed Network Tra?c?[J]. IEEE Computer Society,2005, 38(4): 16–18.
    [34] Wei Lu, Kenneth Chiu, and Yinfei Pan. A Parallel Approach to XML Parsing[C]. The7th IEEE/ACM International Conference on Grid Computing, 2006: 223–230.
    [35] Yinfei Pan, Wei Lu, Ying Zhang, and Kenneth Chiu. A Static Load-Balancing Schemefor Parallel XML Parsing on Multicore CPUs[C]. The 7th IEEE International Sympo-sium on Cluster Computing and the Grid, 2007: 351–362.
    [36] Yinfei Pan, Ying Zhang, Kenneth Chiu, and Wei Lu. Parallel XML Parsing UsingMeta-DFAs[C]. IEEE International Conference on e-Science and Grid Computing,2007: 237–244.
    [37]王国仁,汤南,于亚新,孙冰,于戈.一种并行XML数据库分片策略[J].软件学报,2006,17(4): 770–781.
    [38]于亚新,王国仁,于戈.并行XML数据库系统中数据分片策略的研究[J].计算机研究与发展,2003,40(10): 1499–1508.
    [39]孙静,宋扬,胡金星,郝银全.大型XML文件的分割和动态加载研究[J].计算机工程与应用,2003,39(16): 107–108.
    [40]冯进,丁博,史殿习,张瞩熹,许凯. XML解析技术研究[J].计算机工程与科学,2009, 31(2): 120–124.
    [41]张德才.基于索引的高性能XML解析器研究[J].微计算机信息, 2008, 24(33):198-200.
    [42]陈娟,李晖,鱼雷. XML文档快速解析技术研究[J].计算机技术与发展, 2007,17(10): 40–42.
    [43] Xiaoji Lan, Jianqiang Su, and Jinbao Cai. VTD-XML-based Design and Implementa-tion of GML Parsing Project[C]. International Conference on Information Engineeringand Computer Science, 2009: 1–5.
    [44]鱼雷,李晖,陈娟. VTD-XML解析技术研究[J].通信与信息技术, 2006: 72–74.
    [45] Eric Perkins, Margaret Kostoulas, Abraham Heifets, Morris Matsa, and Noah Mendel-sohn. Performance Analysis of XML APIs[C]. In XML 2005 Conference, 2005.
    [46]陈娟. XML安全快速解析技术研究[D].西安:西安电子科技大学, 2007.
    [47] Shiren Ye and Tat Chua. Learning Object Models from Semistructured Web Docu-ments[J]. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(3): 334–349.
    [48] Wei Zhang and Robert Engelen. A Table-Driven Streaming XML Parsing Methodol-ogy for High-Performance Web Services[C]. IEEE International Conference on WebServices, 2006: 197–204.
    [49] Yunsong Zhang, Lei Zhao, Jiwen Yang, and Liying Yu. NEM-XML: A Fast Non-extractive XML Parsing Algorithm[C]. The 3rd International Conference on Multime-dia and Ubiquitous Engineering, 2009 347–350.
    [50] Li Zhao and Laxmi Bhuyan. Performance Evaluation and Acceleration for XML DataParsing[C]. Proceedings of 9th Workshop Computer Architecture Evaluation UsingCommercial Workloads, 2006.
    [51] In Song, Ju Paik, and Ung Kim. Semantic-based Similarity Computation for XMLDocument[C]. International Conference on Multimedia and Ubiquitous Engineering,2007: 796–803.
    [52] Alessandro Marchetto , Paolo Tonella, and Filippo Ricca. State-Based Testing of AjaxWeb Applications[C]. International Conference on Software Testing, Verification, andValidation, 2008: 121–130.
    [53] Abdul Waheed and Jianxun Ding. Benchmarking XML Based Application OrientedNetwork Infrastructure and Services[C]. Proceedings of the 2007 International Sym-posium on Applications and the Internet, 2007.
    [54] Adam Dukovich, Jimmy Hua, Jong Lee, Michael Hu?man, and Alex Dekhtyar. JOXM:Java ObjectXML Mapping[C]. The 8th International Conference on Web Engineering,2008: 332–335.
    [55] Yunsong Zhang, Lei Zhao, and Jiwen Yang. R-NEMXML: A Reusable Solution forNEM-XML Parser[C]. The 2nd International Symposium on Information Processing,2009: 155–158.
    [56] Yuanhao Sun, Tianyou Li, Qi Zhang, Jia Yang, and Shih Liao. Parallel XML Transfor-mations on Multi-Core Processors[C]. IEEE International Conference on e-BusinessEngineering, 2007: 701–708.
    [57] Yinfei Pan, Ying Zhang, and Kenneth Chiu. Simultaneous Transducers for Data-Parallel XML Parsing[C]. IEEE International Symposium on Parallel and DistributedProcessing, 2008: 1–12.
    [58] Michael Head and Madhusudhan Govindaraju. Approaching a Parallelized XMLParser Optimized for Multi-Core Processors[C]. Proceedings of the 2007 Workshopon Service-oriented Computing Performance, 2007: 17–22.
    [59] Dong Zhou. Exploiting Structure Recurrence in XML Processing[C]. The 8th Interna-tional Conference on Web Engineering, 2008: 311–324.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700