基于权重边集比较法的XML语义聚类研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
XML(eXtensible Markup Language)即可扩展的标记语言,由于具有简单、可扩展、互操作性强、开放性强等特点,正迅速成为一种与技术无关的数据交换的标准和传输格式。与HTML相比,XML具有更大的灵活性。它不仅可以用来标记无结构的文本信息,还可以标记高度结构化的规则数据(如数据库中的数据)。随着Web上XML数据的快速增长,如何帮助用户快速有效地检索大量的XML数据,得到想要的信息,便成为亟待解决的课题。
     文档聚类是一种帮助人们检索信息的有效手段。为了有效的分析XML文档中的信息,XML文档聚类研究也就成了当前研究的热点。对XML文档聚类的关键点是文档间相似性的度量,由于XML文档是一种半结构化的文本,其信息可以通过文档结构得以描述,所以并不是所有的文本相似性算法都适合于XML文本。
     目前XML文档相似性计算方法主要有:元素比较法、边集比较法和编辑距离法。元素比较法简单,速度快,但是只是考虑节点的个数但是没有考虑XML文档树的结构复杂性,聚类结果不是很理想。树编辑距离法考虑了XML文档树的结构复杂性和节点相似行,有着良好的聚类结果,但是时间复杂度较高。边集比较法的性能介于二者之间,因此本文对边集比较法进行了扩展,提出带权重的边集比较算法,通过消除XML文档树中的嵌套和重复节点有效的简化了XML标记树,并结合语义信息度量XML文档之间的相似度。得到XML概要树间的相似度后,利用划分聚类法,对XML文档进行聚类。
     基于经典的边集比较算法,本文做出了以下创新:
     一、提出了带权重的边集比较法的概念,对XML概要树上每一条边都根据结构复杂性和所处的层次,赋予一定的权重,加强了XML中结构和层次的重要性。
     二、结合语义信息计算XML概要标记树中有向边的相似性,得到在语义上等价的边的集合,以此确定两个XML概要树之间的相似度,增加了聚类的精确度。实验结果表明,基于语义的带权重的边集比较法有较好的聚类结果。
XML (eXtensible Markup Language) with the simple, scalable, strong inter-operable and open features is becoming a kind of standards and transmission format for data exchange, which is unrelated to the technology. Compared with HTML, XML has greater flexibility. It not only can be used to tag the text of unstructured information but also can be used to mark highly structured data (e.g. data in the database) With the rapid growth of XML data on the Web, how to help users quickly and efficiently retrieve a large number of XML data and get the useful information will become an urgent issue to resolve.
     Document clustering is an effective means to help people retrieve information. In order to effectively analyze the information in the XML document, so the research of XML document clustering has become a hotspot in current research. The key point of XML Document Clustering is measure of the document similarity. As XML documents is Half-Structure text, and its information Can be described via documents structure. Thus, not all the text similarity algorithm is available for XML documents clustering.
     The current calculation methods of XML document similarity are:the method of elements comparison, edge set comparison algorithm and tree edit distance method. The elements comparison method is simple and fast, but it only considers the number of nodes, it does not take into account the structural complexity of XML document tree, so the clustering results are not very satisfactory. The tree edit distance method takes into account the complex structure of XML document tree and nodes similarity, and it can get a good clustering result, but it has a higher time complexity. The performance of edge set comparison method is between elements comparison method and edit distance method. This paper just extends edge set comparison method, and proposes the weighted edge set comparison algorithm, which eliminates the nested and repeated nodes of the XML document tree, and gets the effective simplified the XML labeled tree. It combines semantic information to measure the similarity between XML documents. After getting the similarity among the XML trees, it uses classified clustering method to cluster XML documents.
     Based on the classic edge set comparison algorithm, this paper makes the innovation as following:
     1. The idea of edge set comparison algorithm with weight is proposed. It gives some weight for each side of the XML summary tree according to the structure complexity and the level, so it strengthens importance of the structure and levels of the XML tree.
     2. The new algorithm calculates the edges similarities of XML labeled tree combined with semantic information, then gets the set of semantically equivalent edges so as to determine similarity between the two XML labeled trees.
     The experiments show that the semantic-based weighted edge set comparison algorithm has better clustering results.
引文
[1]XQuery:A query language for XML. W3C Working Draft 15 February 2001, http://www. w3. org/TR/xquery/
    [2]Deutsch A, Fernandez M, Florescu Detal. XML-QL:A query language for XML.W3C Note.1998, http://www.w3.org/TR/1998/NOTE-xml-ql-19980819/
    [3]Robie J, Lapp J, Schach D. XML query language(XOL). W3C Note,1998, http://www.w3.org/TandS/QL/QL98/pp/xql.html
    [4]Zhang C, Naughton J F, DeWitt D Jet al. On supporting containment queries in relational database management systems, Proceedings of ACM SIGMOD Conference, Santa Barbara, USA,2001.425-436.
    [5]Florescu D, Kossmann D, Manolescu I. Integrating keyword search into XML query processing, Proceedings of the 9th Internat ional www Conference, Amsterdam, Netherlands,2000.
    [6]潘有能.XML文档自动聚类研究.情报学报,2006,25(2).
    [7]Tai KC, The tree to tree correction problem. Journal of the ACM,1979,26(3): 422-433.
    [8]A. Nierman,H. Jagadish. Evaluating structural similarity in XML documents. In Proceedings ofWebDB'02. Madison,Wisconsin,USA.2002
    [9]Zhang K, Shasha D. On the edi ting di stance between unordered labeled trees. Information processing Letters,1992,42(3):133-139.
    [10]Zhang K. A consgrained editing distance between unordered labeled trees. Journal Of Algorithmic,1996,15(3):205-222.
    [11]T. Asai, K. Abe, S. Kawasoe, H. Arimura,H. Satamoto, S. Arikawa. Efficient substructure discovery from large semi-structured data. ACM SIAM International Conference on Data Mning,2002
    [12]K. Wang,H. Q. Liu. Discovering Typical Structures of Documents:A Road Map Approach. ACM SIGIR Conference. Melbourne, Australia,1998
    [13]M. J. Zaki, C. C. Aggarwal. XRules:An Effective Structural Classifier for XML Data. ACM KDD Conference. Washington D. C., USA,2003
    [14]Sachindra Joshi, Neeraj Agrawal, Raghu Krishnapuram, Sumit Negi. A Bag of Paths Model for Measuring Structural Similarity in Web Documents. SIGKDD'03, Washington D. C., USA,2003:577-582
    [15]Cobena G, Abiteboul S, Marian A. Detecting changes in XML document[A]. In 18th Int 1 Conf on Data Engineering(ICDE 2002) [C].2002.
    [16]Chawathe S S, Rajaraman A, Garcia-Molina H, et al. Change detection in hierarchically structured information[A]. In Proc s of the Int'l Conf on Management of Data(SIGMOD' 96)[C].1996.493-504.
    [17]Bertino E, Guerrini G, Mesiti M. Measuring the structural similarity among XML documents and DTDs[EB/CD]. Technical Report DISI-TR-02-02, Department of Computer Science, University of Genova,2002. http://www.disiunige.it/person/ MesitiM.
    [18]Flesca S, Manco G, Masciari E, et al. Detecting structural similarities between XML documents[A]. Proceedings of the Fifth Internat ional Workshop on the Web and Databases[C]. Web DB 2002, Madison, Wisconsin, USA, June 6-7,2002, in conjunction with ACM PODS/SIGMOD 2002. Informal proceedings.
    [19]Douceta, Mykaha. Naive clustering of a large XML document collection [A]. In Proc 1stAnnualWorkshop of the Initiative for the Evaluation of XML retrieval(INEX 02)[C]. Schloss Dagstuh,1 Germany,2002.
    [20]Lianw, Cheung DW, MAMOULISN, et al. An efficient and scalable algorithm for clustering xml documents by structure[J]. IEEE Transactions on Knowledge and Data Engineering,2004,16(1):82-96.
    [21]Leung Ho-pong,Chung Fu-lai, Chan C F, et al. XML document clustering using common xpath. In Proceedings of the 2005 International Workshop on Challenges in Web Information Retrieval and Integration(WIRI'05). Tokyo, Japan,2005
    [22]Rafiei D. Finding syntactic similarities between XML documents. In Proc 17th Int Conf on Database and Expert Systems Applications(DEXA'06). Krakow, Poland, 2006
    [23]Lee Jung-Won,Lee Kiho, Kim Won. Preparations for semantics—based XML mining. In Proceedings of the 2001 IEEE International Conference on Data Mining(ICDM' 01),2001
    [24]Leung H P. On the use of hierarchical information in sequential mining-based XML document similarity computation. Knowledge and Information Systems,2005
    [25]Richi Nayak, Wina Iryadi XML Schema Clustering with Semantic and Hierarchical Similarity Measures, Knowledge-Based Systems,2007, Vol. 20: 336-349
    [26]Mong-Li Lee, Liang Huai Yang, Wynne Hsu, Xia Yang. XClust:Clustering XML Schemas for Effective Integration. In Proceedings of the 11th ACM International Conference on Information and Knowledge Management(CIKM). McLean, Virginia, USA.2002:292-299
    [27]Marko Smiljanic, Maurice van Keulen, Willem Jonker. Using Element Clustering to Increase the Efficiency of XML Schema Matching. In Proceedings of the 22nd international Conference on Data Engineering Workshops(ICDEW),2006: 45-46
    [28]Shelokar,P S., Jayaraman, V K., Kulkarni, B. D. An ant colony approach for clustering. Analytica Chimica Acta,2004:187-195
    [29]Handl J, Knowles J, Dorigo M. Ant—based clustering and topographic mapping. Artificial Life,2006:35-61
    [30]A.C.Tsoi, M. Hagenbuchner, A. Sperduti. Self-organising Map Techniques for Graph Data Applications to Clustering of XML Documents. Advanced Data Mining and Applications,2006, Vol. 4093:19-30
    [31]M. Hagenbuchner, A. Sperduti, A. Tsoi. A self-organizing map for adaptive processing of structured data. IEEE Transactions on Neural Networks,2003:49 1-505
    [32]刘江,王军.WEB数据挖掘中的XML文档的聚类研究.大众科学(科学研究与实践),1002-6908(2007)0620038—03.
    [33]王桐,刘大昕.一种新的混合XML文档聚类方法.哈尔滨工程大学报,2007, Vol.28, No 6.
    [34]杨厚群,何中市,雷景生.基于划分的XML文档聚类研究.计算机科学,2008, Vol.3, No 3.
    [35]傅珊珊,吴扬扬.基于频繁结构的XML文档聚类.计算机工程与应用,2008,44(9).
    [36]张杰,卫金茂,刘丹.基于BFS树的XML文档图结构相似性计算.计算机工程与设计,1000-7024(2008)17-4603—03.
    [37]http://newsblaster. cs. columbi a. edu/
    [38]Zheng Chen, Wei-YingMa, Jinwen Ma. Leaning to Cluster web Search Result[A]. In proceedings of the 27th Annual International ACM SIGIR Conference [C]. Sheffield, South Yorkshire, UK, July 2004,210-217.
    [39]Anton Leuski and James Allan. Improving Interactive Retrieval By Combining Ranked Lists and Clustering [A]. In proceedings of RIAO'2000[C]. Paris, France, April 12-14,2000,665-681.
    [40]Y.C.Fang, S.Parthasarathy, F.Schwartz, Using Clustering to Boost Text Classification[J]. In Proceedings of the IEEE ICDM Workshop on Text Mining, Maebashi City, Japan,2002.
    [41]AntonV, Leuski and W. BruceCroft. An Evaluation of Techniques for Clustering Search Result[A]. Technical Report IR-76, Department of Computer Science, University of Massachusetts, Amherst,1996. http://www.CS.washington.edu/esearch/ clustering.
    [42]Zhang Dell. Semantic, Hierarchical, Online Clustering of Web Search Results [A]. In proceedings of the 6th Asia Pacific web Conference (APWEB) [C]. HangZhou, Chin, April 2004.
    [43]郑仕辉,周傲英,张龙.XML文档的相似测度和结构索引研究[J],计算机学报,2003.26(9):1116-1122.
    [44]Theobald,M. Schenkel, R. Weilama,G Exploiting Structure, Annotation, and
    Ontological Knowledge for Automatic Classification of XML Data[A], Proceedings
    of the 6th International Workshop on the Web and Databases,2003:1-6.
    [45]Theodore Dalamagas, Tao Cheng, Klaas-Jan Winkel, Timos Sellis, A
    methodology for clustering XML documents by structure. Elsevier B.V.2004.11.009, pp.187-228
    [46]Gauch S. and Chong M. K. Automatic Word Similarity Detection for TREC 4 Query Expansion, Proc. of TREC-4:The 4th Annual Text REtrieval Conf., Nov.1995, Gaithersburg, MD,1995, pp.527-536
    [47]LI Xiaobin, Szpakowicz S.and Matwin S., A WordNet-based algorithm for word sense disambiguation, Proc. of the Twelth International Joint Conference on Artificial Intelligence (IJCAI).1995, pp.1368-1374
    [48]王斌,汉英双语语料库自动对齐研究,中国科学院计算技术研究所博士学位论文,1999
    [49]李涓子,汉语词义排歧方法研究,清华大学博士论文,1999
    [50]Agirre E. and Rigau G., A proposal for word sense disambiguation using conceptual distance, Proc. of International Conference Recent Advances in Natural Language Processing (RANLP),1995, pp.258-264, Tzigov Chark, Bulgaria.
    [51]C. Fellbaum, WordNet:An Electronic Lexical Database, MIT Press,1998.
    [52]S.V. Rice, H. Bunke, T.A. Nartker, Classes of cost functions for string edit distance, Algorithmica 18 (2) (1997) 271-280.
NGLC 2004-2010.National Geological Library of China All Rights Reserved.
Add:29 Xueyuan Rd,Haidian District,Beijing,PRC. Mail Add: 8324 mailbox 100083
For exchange or info please contact us via email.