基于频繁模式树的XML数据挖掘

英文题名：XML Data Mining Based on Frequent Pattern Tree
作者：吴雅双
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：XML挖掘 ; 频繁模式树 ; 模式树匹配
英文关键词：XML mining ; Frequent pattern tree ; Pattern tree matching
学位年度：2009
导师：张东站
学科代码：081202
学位授予单位：厦门大学

摘要

数据挖掘是指从大量的、不完全的、有噪声的、模糊的数据中提取出隐含在其中的、人们事先不知道的但又潜在有用的知识的半自动化的方法,它是解决“数据丰富、信息贫乏”的有效方法。
     XML是由SGML发展而来的一种简单、灵活的文本格式。它已经成为Internet上数据描述和交换的标准,越来越多的数据以XML文档进行存储,在这些数据中隐含着大量的知识信息与各类模式,因此,人们迫切需要一些有效的方法来从中提取出一些潜在的、有价值的知识,这就是XML挖掘。
     但是,作为一种树形的半结构化数据,XML非常复杂且具有异构性,它不能轻易地被映射到关系模型,这样,传统的面向关系型数据的挖掘方法如Apriori算法等,并不能直接应用到XML挖掘上。因此,研究一种有效的针对XML的数据挖掘方法成为数据挖掘领域和XML技术领域的一项重要课题。
     本文首先介绍了传统的数据挖掘基本理论、XML的基本理论、XML的特点以及XML有关技术规范。
     其次介绍了频繁子树挖掘的相关概念和现有的一些频繁子树挖掘算法。
     接着在分析了现有频繁模式树挖掘算法FREQT和Freqttree的基础上,提出了一种新的频繁模式树挖掘算法—PDOM算法。PDOM算法采用最右路径扩展的思想,然后利用递推式的候选节点集更新技术来压缩候选节点集,使产生的候选模式数量大大减少,并且在计算候选模式树的支持数时,采用增量式技术,提高算法效率。通过定理证明了PDOM算法的正确性,并对其进行了实验分析。
     最后,考虑到XML的树形结构,提出了基于频繁模式树的XML文档分类算法—BFPC算法。BFPC算法基于XML内容和XML结构两方面。它首先利用tf*idf权值法提取XML文件中非结构的信息即XML内容的特征代表,接着利用PDOM算法提取各个类别的频繁模式树,作为该类别的结构特征,并赋予每个模式树一定的权值。同时,本文还提出了一种模式树匹配算法—PMatch,通过最右匹配集来实现模式树的匹配。最后测试阶段,利用PMatch算法以及关键字匹配,计算测试文档的得分,判断该文档所属的类别。通过实验证明,BFPC算法有较高的查准率。
Data mining is defined as a non-trivial process of extracting valid, novel, potentially useful, and ultimately understandable patterns from a large number of incomplete, noisy and ambiguous data. It is an efficient method of resolving the problem of "data rich-information poor".
     XML is a simple, very flexible text format derived from SGML. XML has become the standards for data representation and exchange over the Internet. More and more datas are stored in XML format, and a lot of information and various of patterns are hidden in the datas. Hence, there have been increasing demands of efficient methods that extract potential and valuable d information from XML data, namely XML data mining.
     However, as a kind of semi-structured data, XML data are a huge amount of complex and heterogeneous data modeled by trees, and cannot be easily mapped into a relational framework. Thus, we cannot directly apply to XML data traditional data mining methods for relational databases, such as Apriori. Hence, it is a important challenge to develop efficient and scalable methods for XML data mining.
     This paper first introduces the basic theory of the traditional data mining, the basic theory of XML, the features of XML documents and technical specifications related to XML.
     Second, it introduces the concepts related to frequent subtree mining, and some of the existing frequent subtree mining algorithm.
     Third, it proposes a novel algorithm PDOM, based on the analysis of the FREQT and Freqttree algorithm, which are the frequent subtree mining algorithm. The algorithm adopts the technology of the rightmost expansion. Then it uses a method of recursive updating the set of candidate nodes to reduce the number of candidate nodes. Thus, the number of the candidate patterns is small. And, it adopts incremental method to compute the support of candidate pattern trees, which improves the efficiency of algorithm. It proves the correctness of the algorithm PDOM through theorem, and analysis the algorithm through the experiment.
     At last, taking into account the tree structure of xml, the paper proposes an algorithm named BFPC for classifying xml documents based on frequent pattern tree. It makes use of both document content and structure. First, it use tf * idf method to extract the representative of characteristics from the non-structural information that is the content of xml in other words. Then, it uses the frequent pattern tree algorithm to extract frequent pattern tree of each class to be the representative of the class and give some weights to each frequent pattern tree. Simultaneously, we propose a pattern tree match algorithm-Pmatch to implement pattern tree match by rightmost match set. In the testing phase, it uses Pmatch algorithm and keyword matching method to calculate the scores of the test document, and judges which class it belongs to. Experimental results show that BFPC algorithm has higher accuracy.

引文

[1]J Han,M Kamber.Data Mining:Concepts and Techniques.Morgan Kaufmann Publishers,2000.
    [2]U M Fayyad,G P Shapiro,P Smyth and R Uthurusamy.Adavances in Knowledge Discovery and Data Mining.AI/MIT Press.1996.
    [3]XML.hnp://www.w3.org/XML/.
    [4]Zaki M J.Efficiently mining frequent trees in a forest.Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Alberta,Canada,2002:71-80.
    [5]Asai T,Abe K,Kawasoe S,et al.Efficient substructure discovery from large semi-structured data.Proceedings of the 2nd SIAM International Conference on Data Mining.Arlington,VA,2002:158-174.
    [6]T Miyahara,T Shoudai,T Uchida,K Takahashi,H Ueda.Discovery of Frequent Tree Structured Patterns in Semistructured Web Doeuments.Proceedings of the 5~(th) Pacific-Asia Conference on Knowledge Discovery and Data Mining,London,UK,2001:47～52.
    [7]T Asai,H Arimura,K Abe,S Kawasoe,S Arikawa.Online Algorithms for Mining Semi-structured Data Stream.Proceedings of the 2002 International Conference on Data Mining(ICDM02),MaebashiCity,Japan,2002:27～34.
    [8]M J Zaki and C C Aggarwal.XRules:An Effective Structural Classifier for XML Data.Proceedings of the 9~(th) ACM SIGKDD International Conference on Knowledge Diseovery and Data Mining(SIGKDD03),Washington,DC,USA,2003.
    [9]M J Zaki and C C Aggarwal.XRules:An Effective Algorithm for Structural Classification of XML Data.Machine Learning Journal special issue on Statistical Relational Learning and Multi-Relational Data Mining,2006,62(1-2):137-170.
    [10]Wang Lian,David Wai-lok Cheung.An Efficient and Scalable Algorithm for Clustering XML Documents by Structure.Proceedings of IEEE Transactions on Knowledge and Data Engineering.2004.
    [11]Jacky W.W.Wan,Gillian Dobble.Mining Association Rules from XML Data using XQuery.Proceedings of the second workshop on Australia information security,Data Mining and Web Intelligence.and Software Internationalization,New Zealand,2004.
    [12]Amnon Meisels,Miehael Orlov,Tal Maor.Discovery Associations in XML Data.Proceedings of the third International Conference on Web Information Systems Engineering(Workshops).Singapore.2002.
    [13]R Agrawal,T Inielinski,and A Swami.Mining association rules between sets of items in large databases.In P.Buneman and S.Jajodia,editors,Proceedings of ACM SIGMOD Conference on Management of Data(SIGMOD 93),pages 207-216,ACM,May 1993.
    [14]Inokuchi A,Washio T,Motoda H.An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data[A].Proc of the 4th European Conf on Principles of Knowledge Discovery and Data Mining[C].2000.
    [15]Inokuchi A,Washio T,Motoda H.Complete Mining of Frequent Patterns from Graphs.Mining Graph Data[J].Machine Learning,2003,50(3)321-354.
    [16]潘锦.Chopper:一个高效的有序标号树频繁结构的挖掘算法.第20届全国数据库年会(NDBC).长沙,2003:303-308
    [17]Wang C,Hong M,Pei J,et al.Efficient Pattern-Growth Methods for Frequent Tree Pattern Mining[A].Proc of the Pacific-Asia Conf on Knowledge Discovery and Data Mining[C],2004.
    [18]Caglek.XML高级开发指南。北京:电子工业出版社,2001.
    [19]熊光彩,莫蓉,赵歆波.XML文档对象模型研究与应用.计算机工程与设计,2002,23(5):1-4.
    [20]熊家治,王飞乐,丁祥武等.基于XML的异构数据XQuery查询.计算机应用与软件,2004,21(2):18-19,104.
    [21]R.Wright,B.Richmond,A.Odzlyzko.Constant Time Generation of Free Trees.SIAMJ Computing,1986,15(4):540-548.
    [22]F Luccio,A M Enriquez,P O Rieumont,L Pagli.Exact Rooted Subtree Matching in Sublinear Time.Technical Report TR-01-14,Universita',Di Pisa,2001,http://citeseer.ist.psu.edu/luccio01exact.html.
    [23]F Luccio,A M Enriquez,P O Rieumont,L Pagli.Bottom-up Subtree Isomorphism for Unordered Labeled Trees.Technical Repert TR-04-13,Universita',Di Pisa,2004.http://citeseer.ist.psu.edu/luccio04bottomup.html.
    [24]J Karkkainen and P Sanders.Simple Linear Work Suffix Array Construetion.Proceedings of the 30~(th) International Colloquium on Automata,Languages,and Programming,Springer-Verlag LNCS 2719,2003:943-955.
    [25]T.Asai,H.Arimura,T.Uno.Discovering Frequent Substructures in Large Unordered Trees.The 6~(th) International Conference on Discovery Science,2003:47-61.
    [26]D.Avis,K.Fukuda.Reverse Search for Enumeration.Discrete Applied Mathematics,1996,65(1-3):21-46.
    [27]M.J.Zaki.Efficiently Mining Frequent Embedded Unordered Trees.Fundamental Informaticae,2005:33-52.
    [28]朱永泰,王晨.ESPM:频繁子树挖掘算法.计算机研究与发展,2004,41(10):1720-1727.
    [29]K.Wang,H.Liu.Schema Discovery for Semistructure Data.The 3~(rd) International Conference on Knowledge Discovery and Data Mining,Newport Beach,California,USA,1997:271-274.
    [30]T.Miyahara.Discovery of Frequent Tree Structured Patterns in Semi Structured Web Documents.Knowledge Discovery and Data Mining,the 5~(th) Pacific-Asia Conference,Hong Kong,2001:47-52.
    [31]Y.Chi,Y.Yang,R.Muntz.HybridTreeMiner:An Efficient Algorithm for Mining Frequent Rooted Trees and Free Trees Using Canonical Forms.The 16~(th) International Conference on Scientific and Statistical Database Management(SSDBM),2004:11-20.
    [32]Y.Chi,Y.Yang,R.Muntz.Indexing and Mining Free Trees,Proceedings of the IEEE International Conference on Data Mining,Florida,USA,2003:509-512.
    [33]Y.Xiao,J.F.Yao,M Dunham.Efficient Data Mining for Maximal Frequent Subtrees.Proceedings of the IEEE International Conference on Data Mining(ICDM),Florida,USA,2003:379-386.
    [34]Y.Chi,Y.Yang,R.Muntz.CMTreeMiner:Mining both Closed and Maximal Frequent Subtrees.The 8th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining(PAKDD),Sydney,Australia,2004:63-73.
    [35]D Braga,A CamPi,M Klemettinen,et al.Mining Association Rules from XML Data.Proceedings of the 4~(th) International Conference on Data Warehousing and Knowledge Discovery(DawaK'02),Aixen,France,2002:21-30.
    [36]D Braga,A CamPi,S Ceri,et al.Discovering Interesting Information in XML Data with Association Rules.Proceedings of the 18~(th) Symposium on Applied Computing(SAC'03),Melbourne,Florida.2003:450-454.
    [37]D Braga,A CamPi,S Ceri,et al.A Tool for Extracting XML Association Rules from XML Documents.Proceedings of the 14~(th) IEEE International Conference on Tools with Artificial Intelligence(ICTA'02),Washington DC,USA,2002:57-64.
    [38]吉根林,韦素云,鲍培明.一种基于DOM树的XML数据频繁模式挖掘算法.南京航空航天大学学报,2006,38(2):206-211.
    [39]Wu Gongxing.A Study on the Mining Algorithm of Fast Association Rules for the XML Data.International Conference on Computer Science and Information Technology.2008:204-207.
    [40]杨沛,郑启伦,彭宏,等.PFTM:一种基于投影的频繁子树挖掘算法.计算机科学,2005,(32):206-209.
    [41]Dom4j.The flexible XML framework for Java[EB/OL](2005-05-16).http://www.dom4j.org.
    [42]Sigmod XML dataset.2002-10-25[2007-11-1].http://www.acm.org/sigmod/record/xml.
    [43]范明、孟小峰等译.数据挖掘概念与技术.北京:机械工业出版社.2006.3.
    [44]A.Bratko and B.Filipic.Exploiting structural information in semi-structured document classification.In Proc.13~(th) International Electro technical and Computer Science Conference,ERK'2004,vol B,pages 145-149,Portoroz,Slovenia,2004.
    [45]Weikum.G Theobald.M,Schenkel.R.Exploiting structure,annotation and ontological knowledge for automatic classification of xml data.In WebDB,San Diego,CA,2003.
    [46]J.Yi and N.Sundaresan.A classifier for semi-structured documents.In Proceedings of the 6~(th)ACM SIGKDD international conference on Knowledge discovery and data mining,pages 340-344,Boston,MA,2000.
    [47]D.Shen and et al.Web-page classification through summarization.In Proceedings of the 27~(th) annual international ACM SIGIR conference on Research and development in information retrieval,pages 242-249,2004.
    [48]A.Sun,E.P.Lim,and W.K.Ng.Web classification using support vector machine.In Proceedings of the 4~(th) international workshop on Web information and data management,pages 96-99,2002.
    [49]A Nierman and H V Jagadish.Evaluating Structural Similarity in XML Documents.Proceedings of the 5~(th) International Workshop on the Web and Database(WebDB),Madison,2002:61-66.
    [50]Theobald M.,Schenkel R.& Weikum G.Exploiting Structure,Annotation,and Ontological Knowledge for Automatic Classification of XML Data.In WebDB Workshop,2003,1-6.
    [51]唐凯。基于内容和分层结构的XML文件自动分类方法。计算机工程与应用。2007,43(3):168-172.
    [52]孙登峰.XML文档信息检索技术研究与实现.安徽:国防科学技术大学研究生院学位论文,2002
    [53]Salton,G.and Buckley,C.Term-weighting approaches in automatic text retrieval.Information Processing & Managemen.1988,24(5):513-523.
    [54]A.Kurt and T.Engin.Classification of xslt-generated web documents with support vector machines.In Knowledge Discovery from XML Documents,volume 3915,pages 33-42,2006.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700