面向半结构化数据的数据模型和数据挖掘方法研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

About the library

Background
History
Leadership
Organization

Readers' Guide

Opening Hours
Collections
Help Via Email

Publications

Electronic Information Resources

面向半结构化数据的数据模型和数据挖掘方法研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on the Data Model and the Approaches to Data Mining in the Semi-structured Data
作者：孙涛
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：数据挖掘 ; 半结构化数据 ; 标签树 ; 偏斜数据 ; 邻域平衡 ; 频繁变化结构 ; 数据挖掘系统
英文关键词：Data Mining ; Semi-structured Data ; Labelled Tree ; Skew data ; neighborhood balance ; frequent change structure ; Data Mining System
学位年度：2010
导师：李雄飞
学科代码：081203
学位授予单位：吉林大学
论文提交日期：2010-06-01

摘要

随着计算机技术、Internet、数据库技术的快速发展,各领域积累的半结构化数据和信息急剧增加。迫切需要面向知识发现需求设计半结构化数据模型,利用模型存储和描述半结构化数据的内容和结构信息。设计有效的半结构化数据挖掘算法,从大量半结构化数据文档中提取深层次的用于描述信息、结构特征以及事物发展趋势的预测内容,综合内容和结构等多方面信息对半结构化数据进行深层次的潜在知识发现。本文面向半结构化数据模型和数据挖掘方法展开了深入研究,主要内容包括:(1)从半结构化数据研究的整体内容出发,对该领域知识进行了详细的综述。总结了各种已提出的半结构化数据模型和数据模式;从特征提取、频繁结构的发现、文档聚类与分类等多角度详细综述了当前半结构化数据挖掘技术的研究进展;跟踪介绍了当前流行的数据挖掘系统的功能特点。(2)针对半结构化数据模型下不精确和不确定性知识,设计了基于标签树的粗糙集模型LTRS。利用LTRS模型从结构和内容两个角度分析半结构化数据,基于树的表现形式从结构和内容两个角度生成决策规则,描述树节点间的组成关系和内容上的知识约简。基于现有半结构化数据模型中缺少对数据变化趋势和变化程度的形式化定义,缺乏对数据动态性质有力描述的缺点,提出了一个带有树平均深度和平均宽度等动态变化信息的树模型ADAWT,为后续高效空间动态变化结构的发现奠定了基础。(3)提出一种新的基于数据的平衡方法—SSGP,用于处理半结构化数据固有的偏斜数据集分类问题。该算法能处理数据集中存在多种少数类别样例的情况,此外还扩展并运用了样例取模运算,使算法在计算效率上取得了较大提高。(4)在处理XML等半结构化数据集的聚类和分类问题时,都会面临类边界相互重叠,边界噪声带来聚类质量或分类精度下降的问题。借鉴方向性和物理学中万有引力定律的思想,以数据对象之间的相互作用为基础,从标量影响和方向影响两个角度讨论基于密度的聚类问题,提出一个考察对象间矢量感应的密度聚类算法VICA。使用方向相似度法和累加向量法两种计算矢量感应函数的方法判断邻域平衡,处理边界稀疏、对象密度分布不均且含有边界噪声点等情况下的数据聚类问题。(5)针对于传统的静态挖掘算法不能胜任对动态变化的XML文档进行知识发现的问题,利用所提出的ADAWT模型,设计了发现平均深度和平均宽度的空间结构变化达到用户关注程度的SCSFinder算法。此外,基于已抽取发现的各种动态结构为特征构建特征空间,将XML文档表示成特征向量的形式,利用改进的聚类算法实现了大规模XML文档的聚类分析。(6)基于已有的半结构化数据挖掘理论基础,综合目前市场及科研领域较为流行和成熟的数据挖掘产品(如SAS Enterprise Miner、Weka等)的优点,设计了一个多策略数据挖掘原型系统—DBIN Miner。系统实现了对半结构化XML数据的存储,集成了前述工作所介绍的挖掘算法和常用的基本数据挖掘算法。并针对数据挖掘技术和数据挖掘系统面临的处理大规模数据的难题,通过缓冲区和插件技术对系统的可扩展性等问题进行了重点设计与实现。
     本文在半结构化数据模型设计、面向半结构化数据应用的分类与聚类问题、基于半结构化数据动态特征提取的文档聚类等方向展开相关研究工作,为半结构化数据的知识发现打下理论基础。并且将所研究的理论应用于数据挖掘原型系统的设计与实现中,为相关理论的商业化应用奠定了基础。
As the society coming into the information period, and the comprehensive application of the computer network and computer technology, the database in every industry accumulates substantive data increasingly. How to use these data and pick up useful information or knowledge from them to guide the production and distribution of the enterprises comes into being and develops a new computer technology—Data Mining Technology which is widely used and has tremendous practicality. Along with the popularization of Internet, the network data increase endlessly with a great deal of semi-structured data appears. The semi-structured data is preferred of the data storage and data exchange as its scalability, self-describing and dynamically. It provides flexibility for system implementation and makes convenience for resource share between corporations.
     The characteristic of semi-structured data lacking of rigid and integrated structure makes it include content and structure information, its structure may be connotative, even being modified constantly. Therefor, it needs to design data models which can better describe semi-structure data characteristic based on data analysis requirement. The well designed models can establish the stability bases for data storage, indexing construction optimization query and knowledge discovery. Besides, as the flexibility of semi-strutured data, there are many problems while doing application analysis, such as data skewness, obscurity of clustering boundary, clustering boundary noises, it needs to design reasonable semi-structure data mining algorithms solving these problems. The structure and content of semi-structured data may be modified continuously and exhibit highly dynamic characteristic. The changes of structure and content can definetly reflect the change rules in time. How to find out the dynamics structure from the history changing process, and how to make use of the dynamic structures and information to do semi-structured data analysis work along with the clustering and classification method. These will be great signification to better use the flexibility and dynamic of semi-structured data.
     Along with the expanding of data scale and the increasing of analysis requirement, it needs to develop many kinds of data analyzing tools and data mining systems. By mining the history data, it can build decision rules to instruct the management or development and make more economy benefit for corporation. Data mining is face to application at the beginning, and no other than the widely using and popularization, it can promote the researches on data mining theory contrarily.
     The main results obtained by this thesis are summarized as follows:
     1) We analyze the current research work of the semi-structured data model and data mining work. By the analysis of relevant literatures, we summarize the characteristic of semi-structured data and data scheme which has been put forward, and point out the worse description while doing with the application. From the application of semi-structured data, we present the problem of data skewness, obscurity of clustering boundary, etc. Then, we sum up the research work on feature extraction, frequent structure discovery, document clustering and classification; introduce the characteristic of the popularity data mining system. All the reference reading work makes the bases for this thesis.
     2) Based on the data mining requirement, we design two semi-structured data model LTRS and ADAWT. In order to characterize and deal with the vagueness and uncertainty of structured data as well as the compositions and contents implied within semi-structured data models, we present a Labeled Tree Rough Set Model (LTRS) by extending the traditional rough set model. Making use of the structure and content of the semi-structured data, from the tree structure we redefine the information system and rough set’s basic concepts, such as equivalence relation, indiscernibility relation, upper approximation and lower approximation, etc. Furthermore, we give a description about the discernibility matrix and decision rules. By analyzing the XML data sets using the LTRS model, we can construct decision rules by structure and content at the same time and describe composing relationship between tree nodes and knowledge reduct of content. Based on the existing semi-structured data model lacking of the formalize defination about the data change direction and the degree of change, being short of the definitely description of data dynamic property, we presented a tree model ADAWT with dynamic change information of tree depth and width. The model can integrate the dynamic change information about the tree shape document like XML in N history edition files, and can establish the basis for the effective dynamic structure discovery.
     3) We put forward a data balance algorithm SSGP based on the classification problem about the semi-structured skew data. There are substantive skew data in the semi-structure data Web application field, the traditional classifier isn’t efficiency while dealing with this skew data. The classifier may partly or completely ignore the positive examples, so much as forecast every examples into negative examples. Therefor, the forecast and analysis on the less proportion examples is an important branch of data mining. It needs design classify algorithm to solve the widely used semi-structured skew data classification problem. In order to balance the training sets that have several classes, an algorithm called SSGP is introduced, which is based on the idea that little difference lies between the same class cases. SSGP form new minority class cases by interpolating between several minority class cases that lie together. It’s proved that SSGP would not add noise to the data set. To enhance the efficiency, SSGP adopt the modulus in stead of calculating a lot of dissimilarity between cases. Take decision tree classifier to test the effect of balancing, the results show that SSGP can improve the predictive accuracy of several minority classes by running once.
     4) We presented the clustering algorithm concerning vector influence between objects called VICA to deal with the obscurity of clustering boundary and clustering noises problems. While solving semi-structured skew data classification problems, we find clustering and classification problems facing to the obscurity of clustering boundary and clustering noises causing precision decrease problem. We present a density based clustering algorithm concerning vector influence between objects. From the point view of the law of gravity, the influence between particles includes two aspects, namely distance and direction. We define a concept of Vector Influence Function by introducing the scalar influence function and direction influence function. Moreover, we propose two methods, i.e. similarity method and summation method, to compute the direction influence. The VICA algorithm normalizes the object project of the core point in its neighborhood, inspects the balance of the core point and then expands objects which are reachable by balanceable core points with balanceable density into a cluster. The theoretical analysis and experimental results indicate that this algorithm can discover clusters with arbitrary shape and can also effectively eliminate noise such as boundary sparse points. It addresses many problems due to the obscurity of clustering boundary division for high dimensional data, an uneven density distribution, plenty of clustering boundary objects. The algorithm improves the accuracy of clustering and offers better results of clustering on various data sets.
     5) We research on the dynamic feature extraction and document clustering of XML data. For the problem of traditonal static mining algorithm being incapable of knowledge discovery on dynamic change XML document, we sum up the basic conception and definition of existing FSC, FS finding work, and design the corresponding structure finding algorithm based on the temporal data model, decrease substantive time consuming causing by change detection between different editions. Then, we present the ADAWT model at the point view of scaling space change between XML editions. Moreover, we construct feature space using kinds of extracted dynamic structure, make XML document into the eigenvector, implement the clustering of large scale XML documents by the algorithm VICA.
     6) We design a multi-strategy Data Mining System DBIN Miner. The development of the database technology and the comprehensive application of the database management system result in the data expanding and the increasing of the analysis requirement. Many kinds of datamining system and business intelligence software are developed continuously. We review the development history of the data mining system, analyze the characteristic of the typical data mining system, and design a multi-strategy data mining system. In dealing with the large scale data, we introduce and design the algorithm groupware idea, buffer processing technology, configuration file based on the XML. The system integrates the algorithms designed above and makes it well extensibility. The research results of this thesis promoting the research work of the semi-structure model, the classification and clustering facing semi-structured data analysis, dynamic feature extraction and document clustering of semi-structured data. Our contribution of theory research and prototype design takes on definite theory signification and application value.

引文

[1] FAYYAD U, PIATETSKY S, SMYTH, UYHURUSAMY. Advances in Knowledge Discovery and Data Mining[M], 1996, MIT Press.
    [2] FENG J H, QIAN Q. Efficient Mining of Frequent Closed XML Query Pattern[J]. Jourmal of Computer Science and Technology (JCST), 2007, 22(5): 725-735.
    [3] MANKU G S, MOTWANI R. Approximate Frequency Counts over Data Streams[C]. In: Proceedings of the 28th VLDB, 2002: 346-357.
    [4] ROMER K, FRANK C, MARRON P J, et al. Generic role assignment for wireless sensor networks[C]. In: Proceedings of the 11th ACM SIGOPS European Workshop, New York: ACM Press, 2004: 7-12.
    [5] ALON Y L, ANAND R, JOANN J O. Querying heterogeneous information sources using source descriptions[C]. In: Proceedings of the 22nd International Conference on Very Large Data Bases, San Francisco: Morgan Kaufmann Publishers Inc, 1996: 251-262.
    [6] TIRTHANKAR L, SERGE A, JENNIFER W. Ozone: Integrating Structured and Semistructured Data[C]. In: Proceedings of 8th international workshop on Database Programming Languages, 1999: 297-323.
    [7] GRACANIN D, ELTOWEISSY M, WADAA A, DASILVA L A. A service-centric model for wireless sensor Networks[J]. IEEE Journal on Selected Areas in Communications, 2005, 23(6): 1159-1166.
    [8] CHAWATHE S, GARCIA M H, HAMMER J, IRELAND K, PAPAKONSTANTINOU Y, ULLMAN J, WIDOM J. The TSIMMIS project: integration of heterogeneous information sources[C]. In: Proceedings of 10th Anniversary Meeting of the Information Processing Society of Japan, Tokyo, Japan, 1994: 7-18.
    [9] PAPAKONSTANTINOU Y, GARCIA M H, WIDOM J. Object Exchange Across Heterogeneous Information Sources[C]. In: Proceedings of ICDE, Taipei, 1995: 251-260.
    [10] BUNEMAN P, DAVIDSON S, FERNANDEZ M, SUCIU D. Adding Structure to Unstructured Data[C]. In: Proceedings of the International Conference on Data Base Theory, 1997: 336-350.
    [11] TATSUYA A, KENJI A, SHINJI K. Efficient Substructure Discovery from Large Semi-structured Data[C]. In: Proceedings of the Second SIAM International Conf on Data Mining (SDM 2002), 2002: 158-174.
    [12] MENGCHI L, WANG L. A Data Model for Semistructured Data with Partial and Inconsistent Information[C]. In: Proceedings of 7th International Conference on Extending Database Technology (EDBT 2000), Konstanz, Germany, 2000: 317-331.
    [13] DOBBIE G, WU X, LING T W, LEE M L. ORA-SS: An Object-Relationship-Attribute Model for Semi-structured Data[R]. Technical Report TR21/00, School of Computing, Natioal University of Singapore, 2000.
    [14] GOLDMAN R, WIDOM J. Dataguides: Enabling query formulation and optimization in semistructured databases[C]. In: Proceedings Of 23rd International Conference on Very Large Data Bases, 1977: 436-445.
    [15] TATSUYA A, HIROKI A, TAKEAKI U, SHIN N. Discovering Frequent Substructures in Large Unordered Trees[C]. Lecture Notes in Computer Science, 2003: 47-61.
    [16] XIAOYING W, TOK W L, Sin Y L, Mong L L. NF-SS: A Normal Form for Semistructured Schema[C]. Lecture Notes in Computer Science, 2002: 198-211.
    [17] NESTOROV S, ABITEBOUL S, MOTWANI R. Extracting schema from semistructured data[C]. In: Proceedings of the ACM-SIGMOD International Conf. on Management of Data, Seattle, Washington, 1998: 295-306.
    [18] BOSAK J, et al. W3C XML Specification DTD. 1998[2010-3-11]. Available at: http://www.w3.org./XML/1998/06/xml spec_ report 19980910. html.
    [19] BRAY T, FRANKSTON C, MALHOTRA A. Document content description (DCD). 1999[2010-3-11] Available at: http://www.w3.or -g/TR/NOTE-dcd, 1999.
    [20] AGRAWAL R, IMIELINSKI T, SWAMI A N. Mining Sssociation Rules between Sets of Items in Large Databases[C]. In:SIGMOD Conference, 1993: 207-216.
    [21] LIN J L, DUNHAM M H. Mining association rules: Anti-skew algorithms[C]. In: Proceedings of the Internatioal Conference on Data Engineering, Orlando, Florida, USA, 1998: 486-493.
    [22] HAN J W, PEI J, YIN Y. Mining frequent patterns without candidate generation[C]. In: Proceedings of the ACM SIGMOD Int.Conf. Management of Data, 2000: 1-12.
    [23] LI Xiaolei, HAN Jiawei. Mining Approximate Top-K Subspace Anomalies in Multi-Dimensional Time-Series Data[C]. In: Proceedings of VLDB, 2007: 447-458.
    [24] KAMBER M, HAN Jiawei, CHIANG J. Metarule-Guided Mining of Multi-Dimensional Association Rules Using Data Cubes[C]. In: Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA, USA, 1997: 207-210.
    [25] PEI J, HAN J, MORTAZAVI B, et al. Mining Access Patterns Efficiently from Web Logs[C]. In: Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining(PAKDD), 2000: 396-407.
    [26] ZAIK M J, SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning [J], 2001, 42(1/2): 31-60.
    [27] MAGED E, ELKE A R, CAROLINA R. FS-Miner: An Efficient and Incremental System to Mine Contiguous Frequent Sequences[R]. In: Computer Science Technical Report Series, 2003.
    [28] LU Y, EZEIFE C I, Position Coded Pre-order Linked WAP-Tree for Web Log Sequential Pattern Mining[C]. In: Proceedings of the 7th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2003: 337-349.
    [29] ZHOU B Y, HUI S C, FONG A C M. CS-mine: An Efficient WAP-tree Mining for Web Access Patterns[C]. In: Proceedings of the 6th Asia Pacific Web Conference (APWeb’04), 2004: 523-532.
    [30] LEUNG H, CHUNG K F. Chan Stephen Chi-fai, On the use of hierarchial information in sequential mining-based XML document similarity computation[J]. Knowl. Inf.Syst, 2005, 7(4): 476-498.
    [31] LEUNG H, CHUNG K F, CHAN S C. A New Sequential Mining Approach to XML Document Similarity Computation[C]. In: Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2003: 569.
    [32] LEUNG H, CHUNG K F, CHAN S C, et al. XML Document Clustering Using Common Xpath[C]. In: Proceedings International Workshop on Challenges in Web Information Retrieval and Integration, 2005: 91-96.
    [33] CALIN G, FLORENT M, BRIGITTE T. Sequential Pattern Mining for Structure-based XML Document Classification[C]. In: Proceedings of the 2005 Initiative for the Evaluation of XML Retrieval Workshop (INEX’05), 2005: 350-351.
    [34] ZAKI M J. Efficiently Mining Frequent Trees in a Forest[C]. In: Proceedings of the 8th ACM SIGKDD Intermational Conference Knowledge Discovery and Data Mining, 2002: 71-80.
    [35] ZAKI M J. Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications[J]. IEEE Trans, Knowl. Data Eng, 2005, 17(8): 1021-1035.
    [36] ZAKI M J. Efficiently Mining Frequent Embedded Unordered Tree[J]. Fundamenta Informaticae, 2005, 66(1/2): 33-52.
    [37] ASAI T, ABE K, KAWASOE S, et al. Efficient Substucture Discovery from Large Semi-structured Data[C]. In: Proceedings of the 2nd SIAM Int’l Conference on Data Mining, 2002: 158-174.
    [38] ASAI T, ARIMURA H, UNO T, et al. Discovering Frequent Substuctures in Large Unordered Trees[C]. In: Proceedings of the 6th Int’l Conf. on Discovery Science, 2003: 47-61.
    [39] YUN C, YI X, YIRONG Y, RICHARD R. Muntz Mining closed and maximal frequent subtrees from databases of labeled rooted trees[J]. IEEE Transactions on Knowledge and Data Engineering. February 2005, 17(2): 190-202
    [40] CHI Y, YANG Y, MUNTZ R R. HybridTreeMiner: An Efficient Algorithm for Mining Frequent Rooted Trees and FreeTrees Using Cannical Forms[C]. In: Proceedings of the 16th International Conference on Scientific and Statistical Database Management, 2004: 11-20.
    [41] WANG C, HONG M S, PEI J, et al. Efficient Pattern-Growth Methods for Frequent Tree Pattern Mining[C]. In: Proceedings of the PAKDD, 2004: 441-451.
    [42] TERMIER A, ROUSSET M C, SEBAG M. TreeFinder: a First Step towards XML Data Mining[C]. In: Proceedings of the 2002 IEEE International Conference on Data Mining(ICDM’02), 2002: 450-457.
    [43] YANG L H, LEE M L, HSU W. Efficient Mining of XML Query Patterns for Caching[C]. In: Proceedings of VLDB, 2003: 69-80.
    [44] BEI Y J, CHEN G, DONG J X. BUXMiner: An Efficient Bottom-Up Approach to Mining XML Query Patterns[C]. In: Proceedings of APWEB/WAIM, 2007: 709-720.
    [45] LI Guoliang, FENG Jianhua, WANG Jianyong, et al. Incremental Mining of Frequent Query Patterns from XML Queries for Caching[C]. In: Proceedings of the 2006 IEEE Int. Conf. on Data Mining, 2006: 350-361.
    [46] CURBERA, EPSTEIN D A. Fast difference and update of XML documents[C]. In: Proceedings of Xtech, 1999.
    [47] COBENA G, ABITEBOUL S, MARIAN A. Detecting changes in XML documents[C]. In: Proceedings of ICDE International Conference on Data Engineering, 2002: 41.
    [48] WANG Y, DEWITT D J, CAI J. X-Diff: An effective change detection algotithm from XML documents[C]. In: Proceedings of ICDE International Conference on Data Engineering, 2003: 519-530.
    [49] ZHAO Qiankun, CHEN Ling. Bhowmick Sourav S., et al., XML structural delta mining; Issues and challenges[J]. Data Knowl. Eng, 2006, 59(3): p627-651.
    [50] ZHAO Qiankun, BHOWMICK S S. MUKESH K, et al. Discovering frequently changing structures from historical structural deltas of unordered XML[C]. In: Proceeding of the CIKM, 2004: 188-197.
    [51] ZHAO Qiankun, BHOWMICK S S. FASST Mining: Discovering Frequently Changing Semantic Structure from Versions of Unordered XML Documents[C]. In: Proceedings of the DASFAA, 2005: 724-735.
    [52] CHEN L, BHOWMICK S S, CHIA L T. FRACTURE mining: Mining frequently and concurrently mutating structures from historical XML documents[J]. Data Knowl. Eng, 2006, 59(2): 320-347.
    [53] CHEN Ling, BHOWMICK S S, CHIA L T. Mining Association Rules from Structural Deltas of Historial XML Documents[C]. In: Proceedings of the PAKDD, 2004: 452-457.
    [54] BEI Yijun, CHEN Gang, YU Lihua, et al. XML Query Recommendation Based On Association Rules[C]. In: Proceedings of the SNPD, 2007: 303-308.
    [55] RUSU L I, RAHAYU W, TANIAR D. Extracting Variable Knowledge form Multiversioned XML Documents[C]. In: Proceedings of the Sixth IEEE International Conference on Data Mining Workshops, 2006: 70-74.
    [56] RUSU L I, RAHAYU W, TANIAR D. Maintaining Versions of Dynamic XML Documents[C]. In: Proceedings of The 6th International Conference on Web Information Systems Engineering, 2005: 536-543.
    [57] LUDOVIC D. Mining XML documents: Bridging the gap between Machine Learning and Information Retrieval[C]. In: Proceedings of INEX 2005, 2005: 332-340..
    [58] DENOYER L, GALINARI P. A Belief Networks-Based Generative Model for Structured Documents[J]. An Application to the XML Categorization, 2004, 40: 807-827.
    [59] BOUCHACHIA A, HASSLER M. Classification of XML Documents[C]. In: Proceedings of Computational Intelligence and Data Mining, 2007: 390-396.
    [60] Clark M, Watt S. Classifying XML Documents by Using Genre Features[C]; In: TIR-07 4th International Workshop, 2007: 242-248.
    [61] Ghosh S, Mitra P. Combining Content and Structure Similarity for XML Document Classification using Composite SVM Kernel[C]. In: Proceedings of Pattern Recognition, 2008: 1-4.
    [62] MOSTAFA H C, MORTEZA H C, CARO L, MASOUD R, EUHANNA G. Efficient rule based structural algorithms for classification of tree structured data[C]; In: Proceedings of Intelligent Data Analysis, 2009: 165-188.
    [63] Yan X, HANG S J. Graph_based substructure pattern mining[C]. In: Proceedings of IEEE International Conference on Data Mining, 2002: 124-133.
    [64] LAURENT C, ISABELLE T, FABIEN. Transforming XML trees for efficient classification and clustering[C]. Lecture Notes in Computer Science, 2006: 469-480.
    [65] BJ?RN B, ALBRECHT Z. Tree2-Decision Trees for Tree Structured Data[C]. In: European Conference on Principles and Practice of Knowledge Discovery in Databases edition, Porto, Portugal, 2005:46-58.
    [66] WANG Yuan, DAVID J D, CAI J Y. X-Diff An Effective Change Detection Algorithm for XML Documents[C]. In: Proceedings of IEEE19th ICDE. 2003: 519-530.
    [67] SWATHY G. XML Classification[D]. Master`s degree paper, University of Kansas, 2002.
    [68] TAGARELLI A, GRECO S. Toward Semantic XML Clustering[C]. In: Proceedings of the 2006 Siam Conference on Data Mining(SDM’06), Maryland, USA, 2006: 188-199.
    [69] MA Y, CHBEIR R. Content and Structure Based Approach for XML Similarity[C]. In: Proceedings of the 2005 Conference on Instructional Technologies (CIT‘05) , Binghamton, Canada, 2005: 136-140.
    [70] COSTA G, MANCO G, ORTALE R, TAGARELLI A. A Tree-Based Approach to Clustering XML Documents by Structure[C]. In: Proceedings of the 8th European Conference on Principles and Practice Knowledge Discovery in Databases (PKDD’04), Pisa, Italy, 2004: 137-148.
    [71] DALAMAGAS T, CHENG T, WINKEL K, SELLIS T K. A Methodology for clustering XML documents by structure[J]. In: Information Systems Journal, 2006, 31(3): 187-228.
    [72] JEONG H H, KEUN H R. Clustering and retrieval of XML documents by structure[C]. Computational Science and Its Applications. In: Proceedings of the Iccsa 2005, 2005: 925-935
    [73] GIANNI C, GIUSEPPE M, RICCARDO O, ANDREA T. A tree-based approach to clustering XML documents by structure[C]. In: Proceedings of the Knowledge Discovery in Databases, PKDD 2004, 2004, 137-148.
    [74] JEONG H H, KEUN H R. A new XML clustering for structural retrieval[C]. In: Conceptual Modeling-Er 2004, 2004: 377-387.
    [75] JEONG H H, KEUN H R. A new sequential mining approach to XML document clustering. In: Web Technologies Research and Development - Apweb 2005. 2005: 266-276
    [76] WANG L C, MAMOULIS D W, YIU N S. An efficient and scalable algorithm for clustering XML documents by structure[J]. IEEE Transactions on Knowledge and Data Engineering. 2004, 16(1): 82-96.
    [77] DOUCET A, AHONEN M H. Na?ve Clustering of a large XML Document Collection[C]. In: Proceedings of the 2002 Initiative for the Evaluation of XML Retrieval Workshop (INEX’02), 2002: 81-87.
    [78] ANDREA T, SERGIO G. Clustering transactional XML data with semantically-enriched content and structural features. In: Web Information Systems - Wise 2004. 2004: 266-278.
    [79] JEONG H H, KEUN H R. A new sequential mining approach to XML document clustering. In: Web TechnologiesResearch and Development - Apweb 2005. 2005: 266-267.
    [80] HWAN C, BONGKI M, KIM H J. A clustering method based on path similarities of XML data[C]. In: Data & Knowledge Engineering. 2007: 361-367.
    [81] PANAGIOTIS A, CHRISTOS M, NIKOS T. XEdge: Clustering Homogeneous and Heterogeneous XML Documents Using Edge Summarie[C]. In: Proceedings of the 2008 ACM symposium on Applied computing, 2008: 1081-1088.
    [82] RICHI N, SUMEI X. XML documents clustering by structures with XCLS[C]. In: Proceedings of the 2005 Initiative for the Evaluation of XML Retrieval Workshop (INEX’05), 2005: p337-349.
    [83] VERCOUSTRE A M, MOUNIR F, SABA G, YVES L. A Flexible Structured-based Representation for XML Document Mining[C]. In: Proceedings of the 2005 Initiative for the Evaluation of XML Retrieval Workshop (INEX’05), 2005: 349-350.
    [84] DOUCET, Lehtonen M. On the unsupervised classification of text-centric XML document collections[C]. In: Proceedings of the 2006 Initiative for the Evaluation of XML Retrieval Workshop (INEX’06), 2006: 288-292.
    [85] KC M, HAGENBUCHNER M, TSOI A C, SCARSELLI F, GORI M, SPERDUTI A. XML Document Mining using Contextual Self-Organizing Maps for Structures[C]. In: Proceedings of the 2006 Initiative for the Evaluation of XML Retrieval Workshop (INEX’06), 2006: 292-307.
    [86] KNIJF J D. FAT-CAT: Frequent Attributes Tree based Classification[C]. In: Proceedings of the 2006 Initiative for the Evaluation of XML Retrieval Workshop (INEX’06), 2006: 307-318.
    [87] YANG J, ZHANG F. XML Document Classification using Extended VSM[C]. In: Proceedings of the 2007 Initiative for the Evaluation of XML Retrieval Workshop (INEX’07), 2007: 200-211.
    [88] MURUGESHAN M. S, LAKSHMI K, MUKHERJEE S. A Categorization Approach for Wikipedia Collection Based on Negative Category Information and Initial Descriptions[C]. In: Proceedings of the 2007 Initiative for the Evaluation of XML Retrieval Workshop (INEX’07), 2007: 212-215.
    [89] TRAN T, NAYAK R. Document Clustering using Incremental and Pairwise Approaches[C]. In: Proceedings of the 2007 Initiative for the Evaluation of XML Retrieval Workshop (INEX’07), 2007: 215-223.
    [90] BORIS C. Semi-supervised Categorization of Wikipedia collection by Label Expansion[C]. In: Proceedings of the 2008 Initiative for the Evaluation of XML Retrieval Workshop (INEX’08), 2008: 352-360.
    [91] LUIS M, CAMPOS D, JUAN M. FERN L, JUAN F H, ALFONSO E R. Probabilistic Methods for Link-based Classification[C]. In: Proceedings of the 2008 Initiative for the Evaluation of XML Retrieval Workshop (INEX’08), 2008: 360-365.
    [92] CHRIS D V, SHLOMO G. Document Clustering with K-tree[C]. In: Proceedings of the 2008 Initiative for the Evaluation of XML Retrieval Workshop (INEX’08), 2008: 366-384.
    [93] GOLDMAN R, Widom J. Dataguides: Enabling query formulation and optimization in semistructured databases[C]. In: Proceedings Of 23rd International Conference on Very Large Data Bases, 1977: 436-445.
    [94] LEONARDI E, BHOWMICK S S. OXONE: A Scalable Solution for Detecting Superior Quality Deltas on Ordered Large XML Documents[C]. In: Proceedings of the 25th International Conference on Conceptual Modelling, 2006: 196-211.
    [95] LEONARDI E, BHOWMICK S S. Detecting Changes on Unordered XML Documents Using Relational Databases: A Schema-Conscious Approach[C]. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, 2005: 509-516.
    [96]陈滢,王能斌,半结构化数据查询的处理和优化.软件学报[J], 1999, 10(8): 883-891.
    [97]王宁,徐宏炳,王能斌,基于带根连通有向图的对象集成模型及代数.软件学报[J], 1998, 9(12): 894-898.
    [98]王宁,徐宏炳,王能斌,异构数据源集成系统中基于数据源能力的查询分解和优化策略.计算机学报[J], 1999, 22(1): 31-38.
    [99]吴胜利,钟华,黄涛,孙红艳,虞海江,关系型多数据库系统IS-Global的设计与实现.软件学报[J], 1999, 10(8): 877-882.
    [100]石祥滨,张斌,王国仁,于戈,郑怀远,SCOPE/CIMS系统中模式集成的形式化基础.计算机学报[J], 1998, 21(11): 1015-1021.
    [101]石祥滨,张斌,王国仁,于戈,郑怀远,一个实现对象查询语言的形式化基础.软件学报[J], 1998, 9(5): 360-365.
    [102]张斌,基于面向对象的大规模多数据源集成技术的研究与实现[D].东北大学,沈阳,1996.
    [103]李瑞轩,卢正鼎,肖卫军,李兵,多数据库系统中基于XIDM的模式映射方法研究.计算机研究与发展[J], 2004, 41(3): 485-491.
    [104]许学标,顾宁,施伯乐,半结构化数据模型及查询.计算机研究与发展[J].1998, 35(10): 896-901.
    [105]杨建武,陈晓鸥,半结构化数据相似搜索的索引技术研究.计算机学报[J]. 2002, 25(11): 1219-1226.
    [106]雷向欣,胡运发,杨智应,刘勇,张凯,基于互关联后继树的XML索引技术.计算机研究与发展[J]. 2005, 42(7): 1261-1271.
    [107]张斌,石祥滨,郑怀远,面向对象的多数据库技术.计算机科学[J], 1996, 26(5): 33-37.
    [108] FENG Jianhua, QIAN Qian, WANG Jianyong, et al. Exploit Sequencing to Accelerate Hot XML Query Pattern Mining[C]. In: Proceedings of the 21st ACM Symposium on Applied Computing (Data Mining Track), 2006: 517-524.
    [109] WANG Jianyong, ZENG Zhipin, ZHOU Lizhu. CLAN: An Algorithm for Mining Closed Cliques from Large Dense Graph Databases[C]. In: Proceedings of the 2006 IEEE Int. Conf. on Data Engineering, 2006: 73-85.
    [110] ZENG Zhiping, WANG Jianyong, ZHOU Lizhu, et al. Coherent Closed Quasi-Clique Discovery from Large Dense Graph Databases[C]. In: Proceedings of the 12th Int. Conf. on Knowledge Discovery and Data Mining, 2006: 797-802.
    [111] SAS/QC Software: Reference, Version 6, First [M]. SAS Institute Inc, 1991.
    [112] SAS Institute White Paper[R],“Finding the Solution to Data Mining: A Map of the Features and Components of SAS Enterprise Miner? Software,”SAS Institute Inc, 1999.
    [113]高惠璇,SAS系统SAS/STAT软件使用手册[M].北京:中国统计出版社,1997.
    [114]卢纹岱等编著. SPSS for Windows从入门到精通[M].北京:电子工业出版社. 1998.
    [115]游湘涛、史忠植,多策略通用数据采掘工具MSMiner[J].计算机研究与发展. 2001, 38(5): 581-586.
    [116] PAWLAK Z. Rough sets[J]. International Journal of Computer and Information Science , 1982, 11(5): 341～356.
    [117] JELONEK J, KRAWIEC K, SLOWINSKI R. Rough set reduction of attributes and their domains for neural networks[J]. Computational Intelligence, 1995, 11(2): 339-347.
    [118]苗夺谦,胡桂荣.知识约简的一种启发式算法[J].计算研究与发展, 1999, 36(6): 681-684.
    [119]杨明.一种基于改进差别矩阵的属性约简增量式更新算法[J].计算机学报. 2007, 30(5):815-822.
    [120] YAO Y Y, LINGARAS P. Interpretations of belief functions in the theory of rough sets[J]. Information Sciences. 1998, 104: 81-106.
    [121] YAO Y Y, LIN T Y. Gerneralization of rough sets using modal logic[J]. Intelligent Automation and Soft Computing, 1996, 2: 103-120.
    [122] YAO Y Y. Relational interpretations of neighborhood operators and rough set approximation[J]. Information Sciences, 1998, 111: 239- 259.
    [123] DUBOIS D, PRADE H. Rough fuzzy sets and fuzzy rough sets[J]. International journal of general systems, 1990, 17:191- 209.
    [124] PAWLAK Z, WONG S K M, ZIARKO W. Rough sets: probabilistic versus deterministic approach[J]. International Journal of Man- Machine Studies, 1988, 29: 81-95.
    [125] ZIARKO W. Variable precision rough set model[J]. Journal of Computer and System Sciences, 1993, 46: 39-59.
    [126]张文修,吴志伟.基于随机集的粗糙集模型(I)[J].西安交通大学学报, 2000, 34 (12) :15-19.
    [127] QUINLAN J R. Induction of decision trees [J]. Machine Learning. Kluwer Academic Publishers, 1986, 1(1): 81-106.
    [128] QUINLAN J R. C4.5: Programs for machine learning [M]. Morgan Kaufmann Publishers, 1993.
    [129] MEHTA M, AGARWAL R, RISSANEN J. SLIQ: A fast scalable classifier for data mining [C]. In: Proceedings of the 5th International Conference on Extending Database Technology. Springer Verlag, 1996: 18 - 32.
    [130] SHAFER J, AGARWAL R, MEHTA M. SPRINT: A scalable parallel classifier for data mining [C]. In: Proceedings of the 22th International Conference on Very Large Data Bases. Morgan Kaufmann Publishers, 1996, 544-555.
    [131] KUMAR A, NAGADEVARA V. Development of Hybrid Classification Methodology for Mining Skewed Data Sets-A Case Study of Indian Customs Data[C]. In: Proceedings of the IEEE International Conference on Computer Systems and Applications. IEEE Computer Society, 2006: 584-591.
    [132] WEISS G. Mining with rarity: a unifying framework[J]. ACM SIGKDD Explorations, 2004, 6(1):7-19.
    [133] CHAWLA N V, BOWYER K W, HALL L O, KEGELMEYER W P. SMOTE: Synthetic Minority Over-sampling Technique[J]. Journal of Artificial Intelligence Research, 2002, 16: 321～357.
    [134] TOMEK I. Two Modifications of CNN [J]. IEEE Transactions on Systems Man and Communications. 1976, 769–772.
    [135] GUSTAVO E A, BATISTA P A, RONALDO C P, MARIA C M. A study of the behavior of several methods for balancing machine learning training data[C]. In: ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD). 2004, 6(1): 20-29.
    [136] TOMEK I. Two Modifications of CNN[J]. IEEE Transactions on Systems, Man, and Cybernetics, 1976, 7(2): 679-772.
    [137] WILSON D L. Asymptotic Properties of Nearest Neighbor Rules Using Edited Data[J]. IEEE Transactions on systems, Man and Cybernetics, 1972, 2(3): 408～421.
    [138] CHEN J X, CHENG T H, CHAN A L F, WANG HY. An application of classification analysis for skewed class distribution in therapeutic drug monitoring - the case of vancomycin[C]. In: Proceedings of the IDEAS Workshop on Medical Information Systems: The Digital Hospital (IDEAS-DH'04), 2004: 35-39.
    [139] CèSAR F, PETER F, JOSéH. Learning Decision Tree Using the Area Under the ROC Curve[C]. In: Proceedings of the ICML(2002), 2002: 139.
    [140] ARTHUR A, DAVID N.2007[2010-3-11]. UCI Machine Learning Repository[R]. http://www.ics.uci.edu/~ml, 2007.
    [141] MARDIA K V, JUPP P. Directional Statistics (2nd edition ) [M]. Chichester, U.K.: John Willey and Sons Ltd., 2000.
    [142] DAVID L D. High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality [D]. USA: Department of Statistics Stanford University, 2000, 8.
    [143] ESTER M, et al. A density—based algorithm for discovering clusters in large spatial databases with noise [C]. In: Proceedings of 2nd international conference on Knowledge Discovery and Data Mining. Oregon, Portland, USA: AAAI Press, 1996: 226-231.
    [144] YANG M S, PAN J A. On fuzzy clustering of directional data [J]. Fuzzy Sets and Systems, 1997, 91(3): 319-326.
    [145] DHILLON I S, MODHA D S. Concept decompositions for large sparse text data using clustering [J]. Machine Learning,2001, 42(1): 143-175.
    [146] BANERJEE A, DHILLON I S, GHOSH J, SRA S. Clustering on the Unit Hypersphere using Von Mises-Fisher Distributions [J]. Journal of Machine Learning Research, 2005, 6(6):1345-1382.
    [147] BANERJEE A, GHOSH J. Frequency-Sensitive Competitive Learning for Scalable, Balanced Clustering on High-dimensional Hyperspheres [J]. IEEE Transactions on Neural Networks, 2004, 15(3): 702-719.
    [148] BERKHIN P. Survey of Clustering Data Mining Techniques [R]. Accrue Software, San Jose, California, USA, 2002.
    [149] XU R, et al. Survey of clustering algorithms [J]. IEEE Transactions on Neural Networks. 2005, 16(3): 645-678.
    [150] SANDER J, ESTER M, KRIEGEL H P, XU X. Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and its Applications [J]. Data Mining and Knowledge Discovery, 1998, 2(2): 169-194.
    [151] ANKERST M, BREUNIG M M, KRIEGEL HP, et al. Optics: ordering points to identify the clustering structure[C]. In: Proceedings of the ACM SIGMOD’99 International Conference on Management of Data. Philadelphia, Pennsylvania, USA: ACM Press,1999: 49-60.
    [152] XU X, ESTER M, KRIEGEL H, SANDER J. A distribution-based clustering algorithm for mining in large spatial databases[C]. In: Proceedings of the 14th International Conference on Data Engineering. Orlando, Florida, USA: IEEE Computer Society Press, 1998: 324-331.
    [153] HINNEBURG A, KEIM D A. An Efficient Approach to Clustering in Multimedia Databases with Noise [C]. Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. New York, USA: 1998: 58-65.
    [154]倪巍伟,孙志挥,陆介平. k-LDCHD—高维空间k邻域局部密度聚类算法[J].计算机研究与发展, 2005, 42(5): 784-79.
    [155]周水庚,周傲英等.基于数据分区的DBSCAN算法[J].计算机研究与发展, 2000, 37(10): 1153-1159.
    [156]马帅,王腾蛟,唐世渭等.一种基于参考点和密度的快速聚类算法[J].软件学报, 2003, 14(6): 1089-1095.
    [157]王玲,薄列峰,焦李成.密度敏感的谱聚类[J].电子学报, 2007, 35(8): 1577-1581.
    [158]宋余庆,谢丛华,朱玉全,李存华等.基于近似密度函数的医学图像聚类分析研究[J].计算机研究与发展, 2006, 43(11): 1947-1952.
    [159]倪巍伟,陈耿,吴英杰,孙志挥.一种基于局部密度的分布式聚类算法[J].软件学报, 2008, 19(9), 2339-2348.
    [160]谌德荣,孙波,陶鹏,宫久路.基于核光谱角余弦的高光谱图像空间邻域聚类方法[J].电子学报, 2008, 36(10): 1992-1995.
    [161]刘铭,王晓龙,刘远超.基于语义的高维数据聚类技术[J].电子学报, 2009, 37(5): 925-929.
    [162] HALKIDI M, VAZIRGIANNIS M. Clustering Validity Assessment: Finding the Optimal Partitioning of a Data Set[C]. In: Proceedings of the 2001 IEEE International Conference on Data Mining. California, USA: IEEE Computer Science Press, 2001: 187-194.
    [163] FLAVIO R, ALEJANDRO A. VAISMAN. Temporal XML: modeling, indexing, and query processing[J]. The VLDB Journal, 2008, 17: 1179-1212.
    [164] NIERMAN A, JAGADISH H V. Evaluating structural similarity in XML documents[C]. In: Proceedings of the WebDB Workshop, Madison, Wisconsin, USA, 2002: 61-66.
    [165] ROBERT G. The Terabyte Challenge Discoverying Information in Distributed and Massive Data[C]. In: Proceedings of the KDD’01.
    [166] HILLOL K, PARK B H. Mining Decision Trees from Data Streams in a Mobile Environment[C]. In: Proceedings of the PKDD’01. http://citeseeer.nj.nec.com/ 453619.html. 2001: 281.