数据流频繁模式挖掘关键算法及其仿真应用研究

英文题名：Research on Key Algorithms for Mining Frequent Patterns in Data Streams and Their Application in Simulation System
作者：敖富江
论文级别：博士
学科专业名称：控制科学与工程
中文关键词：仿真 ; 数据流挖掘 ; 关联规则 ; 频繁模式 ; 最大频繁项集 ; 频繁闭项集 ; Top-K最频繁项集 ; 基于关联规则的分类 ; 聚类高维数据流
英文关键词：simulation ; data stream mining ; association rule ; frequent pattern ; maximal frequent itemsets ; closed frequent itemsets ; Top-K most-frequent itemsets ; classification based on association rules ; clustering high dimensional data stream
学位年度：2008
导师：黄柯棣
学科代码：081103
学位授予单位：国防科学技术大学
论文提交日期：2008-06-01

摘要

系统仿真技术综合集成了计算机技术、网络技术、图形图像处理技术、信息处理技术、自动控制技术等多个领域的知识,是系统分析和研究的重要手段。数据挖掘技术是获取仿真数据中隐藏知识的有力工具。随着仿真系统复杂程度的提高和规模的增大,仿真时间越来越长、仿真所产生的数据量越来越大。这使得仿真数据具有数据流的特征。因此有必要采用数据流挖掘技术处理仿真数据。数据流是一种连续、高速、无限、时变的有序数据序列。数据流的特征对数据流的挖掘提出了严峻的挑战。传统面向静态数据集的算法无法直接用于挖掘数据流,而现有数据流挖掘算法存在时空效率不高的缺陷。因此,针对仿真中常用的数据挖掘任务,研究时空效率高效的相应数据流挖掘算法具有重要意义。
     关联规则挖掘是仿真中最常用的一类数据挖掘任务,而频繁模式挖掘是生成关联规则的关键步骤。为此,论文研究了数据流中频繁模式挖掘的关键算法,重点研究了数据流中最大频繁项集、频繁闭项集和Top-K最频繁项集的挖掘算法,以及基于频繁闭项集的数据流分类算法和基于Top-K频繁模式的高维数据流聚类算法。论文最后研究了如何将数据流挖掘算法快速集成到不同的仿真系统中,着重考虑了数据流挖掘算法资源在仿真中的重用。论文的主要研究工作及创新包括以下六个方面:
     (1)提出了一种数据流最大频繁项集挖掘算法。相对于完全频繁项集和频繁闭项集,最大频繁项集的数目最少,挖掘最大频繁项集的算法具有较高的时空效率。为此,论文研究了数据流中最大频繁项集的挖掘技术,旨在提供一种能够在任意时刻都快速维护数据流滑动窗口中最大频繁项集的算法。主要研究内容包括三个方面。首先提出了一种面向数据流的最大频繁项集剪枝技术,即子集等价剪枝技术。接着,提出了一种最大频繁项集单遍挖掘算法FPMFI-DS。其中,FPMFI-DS算法中应用了子集等价剪枝技术以降低算法的搜索空间大小,从而提高算法效率。最后,基于FPMFI-DS算法,提出了一种能够在线更新挖掘数据流滑动窗口中最大频繁项集的算法FPMFI-DS+。实验表明,对于稠密数据集子集等价剪枝技术能够缩小约40%的搜索空间;FPMFI-DS算法的挖掘速度快并具有良好的可扩展性;FPMFI-DS+算法更新挖掘速度快并具有良好的稳定性。
     (2)提出了一种数据流频繁闭项集挖掘算法。频繁闭项集的数目介于完全频繁项集和最大频繁项集之间,并保存了所有项集的支持度信息。因此挖掘数据流中的频繁闭项集既具有较高的时空效率,又保证了信息的完全性。为此,论文提出了一种频繁闭项集挖掘算法FPCFI-DS。该算法能够在有限的存储空间中高速挖掘数据流滑动窗口中的频繁闭项集,并且能够在任意时刻都维护数据流当前窗口中的频繁闭项集。实验表明,FPCFI-DS算法的时空效率显著优于同类经典算法Moment。
     (3)提出了一种数据流Top-K最频繁项集挖掘算法。Top-K最频繁项集挖掘的优点是不需要用户指定最小支持度阈值,仅指定需要寻找的项集数目k。已有Top-K最频繁项集挖掘算法存在初始项目数目过多、初始边界支持度过高的问题。为此,论文首先提出了一种基于混合搜索方式的高效Top-K最频繁项集挖掘算法MTKFP。该算法综合利用宽度优先搜索和深度优先搜索挖掘Top-K最频繁项集。然后基于MTKFP算法,提出了一种基于Chernoff不等式的数据流Top-K最频繁项集挖掘算法MTKFP-DS。实验表明,MTKFP算法所获得的初始项目数目至少低于已有算法70%,初始边界支持度高于已有算法,从而MTKFP算法的性能优于已有最好算法1倍以上;MTKFP-DS算法适合于对数据流数据的挖掘。
     (4)提出了一种基于频繁闭项集的数据流分类算法。相对于某些传统分类算法,基于关联规则的分类具有更高的精度。此类算法通常采用频繁项集作为生成类关联规则的依据。但挖掘频繁项集易遭受组合爆炸问题,从而影响算法效率;另外,数据流的出现也对分类算法提出了新的挑战。为此,论文提出了一种高效的基于频繁闭项集的数据流分类算法CBC-DS。在该算法中,设计了高效的频繁闭项集单遍挖掘算法和有效的分类器构建方法。实验表明,CBC-DS算法的平均分类精度比经典算法CMAR高1.09%左右,分类速度快于CMAR算法。
     (5)提出了基于Top-K频繁模式的高维数据流聚类算法。高维数据聚类是聚类问题中的研究难点。基于密度和基于网格的综合方法能够较好地解决该问题,该方法的关键在于发现高密单元格。传统方法采用挖掘频繁项集的方式发现高密单元格,该方式的不足是需要用户指定最小密度阈值,而且不利于发掘稀疏子空间中的高密单元格。为此,论文分别提出了基于Top-K最频繁项集、基于N-most interesting项集和基于Top-K项目的高维数据流聚类算法。这些算法不需要用户指定最小密度阈值。第二种算法有利于特定维的子空间分组的高密单元格发掘,第三种算法有利于特定子空间的高密单元格的发掘,从而解决稀疏子空间中高密单元格的发掘。实验表明,所提出的算法适用于对高维数据流的聚类。
     (6)研究了数据流挖掘技术在仿真中的应用。论文提出了基于数据流挖掘技术的仿真应用框架。并且为了能够将数据流挖掘算法快速集成到基于HLA体系结构的仿真系统中,采用模块化开发思想设计了通用性强的数据流挖掘构件和通用数据流挖掘成员,以提高算法资源的重用性。并以“导弹突防仿真系统”为例,介绍了通用关联规则挖掘成员的设计思想。
System simulation takes advantages of a lot of technologies from other domains, such as computer technology, network technology, graphics and image processing technology, information processing technology, automatic control technology and so on. It is a significant step for system analysis and research. Data mining technology is a powerful tool for discovering hidden knowledge from simulation data. With the increasing of complexity and scale in simulation system, simulation time becomes much longer and simulation data amount becomes much larger, which suggests simulation data exhibits the characteristics of data streams. So it is necessary to process simulation data with data stream mining technology. A data stream is an ordered sequence of data items with the characteristics of continuity, high data rates, infinity, and the distribution of data items changing with time. Those facts bring tremondous challenges to data-stream mining. Traditional data mining algorithms aiming at static datasets can’t be used to mine data streams directly, neither do they have the time and space efficiency. Thus, it is important to research data stream mining algorithms having higher time and space efficiency, and to aim at resolving data mining tasks often used in system simulation.
     In all data mining tasks, Association rule mining is the one mostly used in simulation. Mining frequent patterns is the key step to generate association rules. So, the research focuses on the key algorithms for mining frequent patterns in data streams. Particularly, three important aspects are researched and implemented, including the algorithms for mining maximal frequent itemsets, closed frequent itemsets, Top-K most-frequent itemsets in data stream, a classification algorithm based on closed frequent itemsets for data stream and a clustering algorithm based on Top-K frequent patterns for high dimensional data stream. Lastly, this paper also discusses how to quickly integrate the data stream mining algorithms into various simulation systems, with the emphasizing on the reusability of the data stream mining algorithms in simulation systems. The main innovative achievements of this research can be summarized as follows:
     (i) An algorithm for mining maximal frequent itemsets in data stream is proposed. The number of maximal frequent itemsets is much less than that of frequent itemsets or closed frequent itemsets. So, mining maximal frequent itemsets can get higher time and space efficiency. Thus, this paper researches on the technology of mining maximal frequent Itemsets in data stream, aiming at presenting an algorithm that can maintain maximal frequent itemsets in the current sliding window over data streams at any time. The contributions of the research lie in the following three aspects. First, the paper presents a novel pruning technique for mining maximal frequent itemsets in data steam, namely Subset Equivalence Pruning. Second, the paper proposes an one-pass algorithm for mining maximal frequent itemsets, called FPMFI-DS, in which Subset Equivalence Pruning is used to reduce the size of searching space, thereby to improve the efficiency of the algorithm. Lastly, based on FPMFI-DS algorithm, FPMFI-DS+ algorithm is proposed which can mine maximal frequent itemsets in sliding window over data streams in online updating fashion. Experiments show that for some dense datasets, the size of searching space can be reduced by about 40 percent by Subset Equivalence Pruning, FPMFI-DS achieves high performance and good scalability, and FPMFI-DS+ has high updating-mining speed and good stability.
     (ii) An algorithm for mining closed frequent itemsets in data stream is proposed. The number of closed frequent itemsets is less than that of frequent itemsets, but is more than that of maximal frequent itemsets. Closed frequent itemsets also contain the support of all frequent itemsets. So, mining closed frequent itemsets in data stream is efficient which ensures the perfectibility of information. The paper presents an algorithm for mining closed frequent itemsets, called FPMFI-DS, which can efficiently mine closed frequent itemsets over a stream sliding window with limited memory space, and maintain exact closed frequent itemsets in current window at any time. The experimental results show that the algorithm FPCFI-DS exhibits more tremendous potential than that of the state-of-the-art algorithm Moment in terms of time and space efficiency.
     (iii) An algorithm for mining Top-K most-frequent itemsets in data stream is proposed. One of the advantages of mining Top-K most-frequent itemsets is that users don’t need to specify a minimum support, instead of specifying an integer, k, which is the number of itemsets required. The existing algorithms for mining Top-K most-frequent itemsets have the problems of massive initial items and higher initial border support. To solve these problems, the paper presents an efficient mixed-searching-based algorithm for mining Top-K most-frequent itemsets, MTKFP. The MTKFP algorithm adopts both breadth-first searching and depth-first searching to mine Top-K most-frequent itemsets. Moreover, based on MTKFP algorithm, the paper presents a Chernoff-based algorithm for mining Top-K most-frequent itemsets in data stream, MTKFP-DS. The experimental results show that the number of initial items of MTKFP algorithm is 70 percent lower than that of the existing algorithms, but the initial border support of MTKFP algorithm is higher than that of the existing algorithms, consequently the performance of MTKFP algorithm is superior to that of the best existing algorithm by over one time; results also show that MTKFP-DS algorithm is suitable for mining data streams.
     (iv) A closed-frequent-itemsets based classification algorithm for classifying data stream is proposed. In contrast to some traditional classification algorithms, the algorithms based on association rules have higher classification precision. These algorithms generally generate classification association rules by frequent itemsets. As mining frequent itemsets often suffers from the problem of combination explosion, the efficiency of algorithm is low. Moreover, the emergence of data streams has posed new challenges to those classification algorithms. To solve these problems, the paper proposes an efficient closed-frequent-itemsets based classification algorithm, CBC-DS, for classifying data stream. In CBC-DS, an efficient one-pass algorithm for mining closed frequent itemsets and an effective method for constructing classifier are designed. The experimental results show that the average precision of CBC-DS is about 1.09 percent higher than that of CMAR algorithm, but CBC-DS is much faster than CMAR.
     (v) The Top-K-frequent-patterns based clustering algorithms for clustering high dimensional data stream are proposed. Clustering high dimensional data is a more difficult problem in the research domain of Clustering. The synthetical method with density-based clustering and grid-based clustering can be used to solve the problem effectively. Traditional method adopts the procedure of mining frequent itemsets to identify dense units. It has two deficiencies. First, it requires user to specify a minimum density threshold. Second, it identifies dense units with the same minimum density threshold for all subspace, so that the units in sparse subspace can’t be identified as dense units. To solve these problems, the paper presents three clustering algorithms for clustering high dimensional data stream. These algorithms base on Top-K most-frequent itemsets, N-most interesting itemsets and Top-K frequent items, respectively, and don’t require user to specify a minimum density threshold. The second algorithm is in favor of identifying the dense units in subspace group with specific dimension. The third algorithm is in favor of identifying the dense units in specific subspace. So, the problem of identifying the dense units in sparse subspace is solved. The experimental results show that these algorithms are suitable for clustering high dimensional data stream.
     (vi) The application of data stream mining technology in simulation system is researched. A simulation application framework based on data stream mining technology is proposed. Moreover, to simplify the process of integrating the data stream mining algorithms into the HLA-architecture-based simulation system, the paper designs universal data stream mining component and general data stream mining federate based on the idea of modularized development so as to improve the reusability of the algorithms. Lastly, the paper introduces the general federate for mining association rules with the example of Missile-Breakthrough simulation system.

引文

[1]邱晓钢,黄柯棣,黄健.先进分布仿真技术基础.国防科技大学机电工程与自动化学院HLA技术资料, 2003.
    [2] L.Golab, M.T. ?zsu. Issues in Data Stream Management. ACM SIGMOD Record, 2003, 32(2): 5-14.
    [3] MM.Gaber, A.Zaslavsky, S.Krishnaswamy. Mining Data Streams: A Review. ACM SIGMOD Record, 2005, 34(2): 18-26.
    [4] B.Babcock, S.Babu, M.Datar, R.Motwani, J.Widom. Models and Issues in Data Streams. In: Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. Madison: ACM Press, 2002, 1-16.
    [5] N.Jiang, L.Gruenwald. Research Issues in Data Stream Association Rule Mining. ACM SIGMOD Record, 2006, 35(1): 14-19.
    [6]潘云鹤,王金龙,徐从富.数据流频繁模式挖掘研究进展.自动化学报, 2006, 32(4):594-602.
    [7]王涛,李舟军,颜跃进,陈火旺.数据流挖掘分类技术综述.计算机研究与发展, 2007, 44(11):l809-1815.
    [8]刘学军,徐宏炳,董逸生等.数据流管理技术.计算机科学, 2005, 32(4):6-10.
    [9]张玲东,毛宇光,曹晨光,宋卫东.数据流管理系统研究与进展.计算机应用研究, 2005, 22 (6):12-15.
    [10] R.Motwani, J.Widom, A.Arasu, et al. Query Processing, Approximation, and Resource Management in a Data Stream Management System. In: Proceedings of Conference on Innovative Data Systems Research (CIDR), 2003, 245-256.
    [11] M.Cherniack, H.Balakrishnan, M.Balazinska, et al. Scalable Distributed Stream Processing. In: Proceedings of Conference on Innovative Data Systems Research (CIDR), 2003.
    [12] S.Chandrasekaran, MJ.Franklin. PSoup: A System for Streaming Queries over Streaming Data. The VLDB Journal, 2003, 12(2):140-156.
    [13] S.Chandrasekaran, O.Cooper, A.Deshpande, et al. TelegraphCQ: Continuous Dataflow Processing for an Uncertain World. In: Proceedings of Conference on Innovative Data Systems Research (CIDR), 2003, 269-280.
    [14] Y.Zhu, D.Shasha. StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time. In: Proceedings of the 28th International Conference on Very Large Data Bases. Hong Kong: Morgan Kaufmann, 2002, 358-369.
    [15] L.Liu, C.Pu, W.Tang. Continual Queries for Internet Scale Event-driven Information Delivery. IEEE Trans. on Knowledge and Data Engineering, Aug. 1999, 11(4):583-590.
    [16] J.Chen, D.J.DeWitt, F.Tian, Y.Wang. NiagraCQ: A Scalable Continuous Query System for Internet Databases. In: Proceedings of the 2000 ACM SIGMOD Intl. Conf. on Management of Data, May 2000, 379-390.
    [17] D.Terry, D.Goldberg, D.Nichols, and B.Oki. Continuous Queries over Append-only Databases. In: Proceedings of the 1992 ACM SIGMOD international conference on Management of data, San Diego, California, June 1992, 321-330.
    [18]王永利,徐宏炳,董逸生,钱江波,刘学军.配电自动化的数据流管理系统设计.电力系统自动化, 2004, 28(13):85-91.
    [19]何轶璇,罗毅,涂光瑜. EMS数据流管理系的框架设计.电力系统自动化,. 2006, 30(24):33-39.
    [20] M.Halatchev, L.Gruenwald. Estimating Missing Values in Related Sensor Data Streams. In: Proceedings of the Eleventh International Conference on Management of Data, 2005, 83-94.
    [21] ED.Demaine, A.Lpez-Ortiz, JI.Munro. Frequency Estimation of Internet Packet Streams with Limited Space. In: Proceedings of the 10th Annual European Symposium on Algorithms. September 2002, 348-360.
    [22] YD.Cai, D.Clutter, G.Pape, J.Han, M.Welge, L.Auvil. MAIDS: Mining Alarming Incidents from Data Streams. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, Paris, France, June 2004, 919-920.
    [23] H.Kargupta, R.Bhargava, K.Liu, M.Powers. VEDAS: A Mobile and Distributed Data Stream Mining System for Real-Time Vehicle Monitoring. In: Proceedings of SIAM International Conference on Data Mining, Madison: ACM Press, 2004.
    [24] J.Han, H.Cheng, D.Xin, X.Yan. Frequent Pattern Mining: Current Status and Future Directions. Data Mining and Knowledge Discovery, 2007, 55-86.
    [25] R.Agrawal, T.Imielinski, A.N.Swamy. Mining Assocaition Rules between Sets of Items in Large Databases. In: Proceedings of 1993 ACM SIGMOD International Conference on Management of Data, Washington, D.C. USA, 1993, 207-216.
    [26] J.Han, M.Kamber著,范明,孟小峰译.数据挖掘概念与技术[M].北京机械工业出版社. 2001.
    [27]黄柯棣等.系统仿真技术[M].国防科技大学出版社, 1998.
    [28]熊光楞等.先进仿真技术与仿真环境[M].国防工业出版社, 1997.
    [29] J.Steinman, D.Hardy. Evolution of the Standard Simulation Architecture[C]. In: Proceedings of the 2004 Spring Simulation Interoperability Workshop, paper 2004-SIW-100, 2004.
    [30]颜跃进.最大频繁项集挖掘算法的研究[Ph.D Thesis].长沙:国防科技大学计算机学院, 2005.
    [31]王涛.数据流挖掘分类方法关键技术研究[Ph.D Thesis].长沙:国防科技大学计算机学院, 2007.
    [32]郝建国,黄健,黄柯棣. HLA联邦数据收集的研究与实现.计算机仿真, 2002, 19(1):38-43.
    [33]李国和,何红梅,赵沁平.基于HLA的分布交互仿真数据收集系统的研究.计算机科学, 2000, 27(10):12-16.
    [34]陈彬. HLA仿真系统的联邦观测方法研究及工具设计.长沙:国防科技大学硕士论文. 2005.
    [35] MK. Painter, M.Erraguntla, GL.Hogg, B.Beachkofski. Using Simulation, Data Mining, and Knowledge Discovery Techniques for Optimized Aircraft Engine Fleet Management. In: Proceedings of the 2006 Winter Simulation Conference. Monterey, California, 2006, 1253-1260.
    [36] M.Remondino, G.Correndo. Data Mining Applied to Agent based Simulation. In: Proceedings 19th European Conference on Modelling and Simulation, 2005.
    [37]张文明,薛青.粗糙集方法在作战仿真数据挖掘中的应用.系统仿真学报. 2006, 18(2):179-181.
    [38] C. Morbitzer, P. Strachan, C. Simpson. Application of Data Mining Techniques for Building Simulation Performance Prediction analysis. In: Proceedings of the 8th International Building Performance Simulation Association Conference, Eindhoven, Nehterlands. August 11-14, 2003.
    [39] S.Bachinsky, G.Tarbox, E.Powell. Data Collection in an HLA Environment, In: Proceedings of the Spring Simulation Interoperability Workshop, Orlando, FL, 1997.
    [40] G.Abdulla, T.Critchlow, W.Arrighi. Simulation Data as Data Streams. ACM SIGMOD Record, 2004, 33(1): 89-94.
    [41]方伍元,陆介平,轩志远.基于相关性精简关联规则生成算法.江苏科技大学学报, 2007, 21(1):56-60.
    [42]郭俊芳,谢益武,周生宝.关联规则相关性的度量.计算机应用, 2007, 27(4):892-894.
    [43]罗可,吴杰.怎样获得有效的关联规则.小型微型计算机系统, 2002, 23(6):1711-1733.
    [44]伊卫国,卫金茂,王名扬.关联规则挖掘方法的改进.东北师大学报, 2006, 38(2):15-19.
    [45]任亚洲.频繁项集挖掘算法综述.开发研究与设计技术. 2007, 1066-1069.
    [46] HF.Li, SY.Lee, MK.Shan. An Efficient Algorithm for Mining Frequent Itemests over the Entire History of Data Streams. In: Proceedings of 1st International Workshop on Knowledge Discovery in Data Streams, 2004.
    [47] CC. Aggarwal, PS.Yu. Online Generation of Association Rules. In: Proceedings of 14th International Conference on Data Engineering (ICDE'98), Orlando, FL, USA, 1998, 402-411.
    [48] GS.Manku, R.Motwani. Approximate Frequency Counts over Data Streams. In: Proceedings of the 28th International Conference on Very Large Data Bases. Hong Kong, Morgan Kaufmann, 2002, 346-357.
    [49] C.Giannella, J.Han, J.Pei, X.Yan, PS.Yu. Mining Frequent Patterns in Data Streams at Multiple Time Granularities.Data Mining: Next Generation Challenges and Future Directions, 2003, 191-212.
    [50] C. Giannella, J. Han, E. Robertson, C. Liu. Mining Frequent Itemsets over Arbitrary Time Intervals in Data Streams. In: Technical Report TR587, Indiana University, 2003.
    [51] JH.Chang, WS.Lee. estWin: Adaptively Monitoring the Recent Change of Frequent Itemsets over Online Data Streams. In: Proceedings of the twelfth International Conference on Information and Knowledge Management, New Orleans, USA: ACM Press, 2003, 536-539.
    [52] H.Li, S.Lee, M.Shan. An Efficient Algorithm for Mining Frequent Itemsets over the Entire History of Data Streams. In: Proceedings of the first International Workshop on Knowledge Discovery in Data Streams. Pisa, Italy, 2004.
    [53]张昕,李晓光,王大玲等.数据流中一种快速启发式频繁模式挖掘方法.软件学报, 2005, 16(12):2099-2105.
    [54] L.Jia, C.Zhou, Z.Wang, X.Xu. SuffixMiner: Efficiently Mining Frequent Itemsets in Data Streams by Suffix-Forest. In: Proceedings of Fuzzy Systems and Knowledge Discovery, Changsha, China, 2005, 592-595.
    [55]刘学军,徐宏炳,董逸生,王永利,钱江波.挖掘数据流中的频繁模式.计算机研究与发展. 2005, 42(12): 2192-2198.
    [56] J.Cheng, Y.Ke, W.Ng. Maintaining Frequent Itemsets over High-Speed Data Streams. In: Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD2006), Singapore, 2006, 462-467.
    [57] D.Lee, W.Lee. Finding Maximal Frequent Itemsets over Online Data Streams Adaptively. In: Proceedings of the Fifth IEEE International Conference on Data Mining, Houston, 2005, 266-273.
    [58] H.Li, S.Lee, M.Shan. Online Mining (Recently) Maximal Frequent Itemsets over Data Streams. In: Proceedings of the fifteenth International Workshops on Research Issues in Data Engineering: Stream Data Mining and Applications, Tokyo, Japan: IEEE Press, 2005, 11-18.
    [59] G.Mao, X.Wu, X.Zhu, et al. Mining Maximal Frequent Itemsets from Data Streams. Journal of Information Science, 2007, 33(3):251-262.
    [60] F.Ao, Y.Yan, J.Huang, K.Huang. Mining Maximal Frequent Itemsets in Data Streams Based on FP-Tree. In: Proceedings of the 5th Int. Conf. on Machine Learning and Data Mining (MLDM'2007), Leipzig, German, July, 2007, 479-489.
    [61] F.Ao, Y.Yan, J.Huang, K.Huang. A Novel Pruning Technique for Mining Maximal Frequent Itemsets. In: Proceedings of the 4th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD'07), Haikou, China, August, 2007, 469-473.
    [62] J.Wang, J.Han, J.Pei. CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets. In: Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, USA: ACM Press, 2003, 236-245.
    [63] Y.Chi, H.Wang, PS.Yu, RR.Muntz. Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding Window. In: Proceedings of the fourth IEEE International Conference on Data Mining, Brighton, UK: IEEE Press, 2004, 59-66.
    [64] H.Li, C.Ho, F.Kuo, S.Lee. A New Algorithm for Maintaining Closed Frequent Itemsets in Data Streams by Incremental Updates. In: Proceedings of Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06), Hong Kong, December 18-22, 2006, 672-676.
    [65] N.Jiang, L.Gruenwald. CFI-Stream: Mining Closed Frequent Itemsets in Data Streams. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, August 20-23, 2006, 592-597.
    [66]刘旭,毛国君等.数据流中频繁闭项集的近似挖掘算法.电子学报, 2007, 35(5) :900-905.
    [67]刘学军,徐宏炳等.基于滑动窗口的数据流闭合频繁模式的挖掘.计算机研究与发展, 2006, 43(10):1738-1743.
    [68] J.Pei, G.Dong, W.Zou, J.Han. On Computing Condensed Frequent Pattern Bases. In: Proceedings of Second IEEE International Conference on Data Mining (ICDM'02), 2002, 378-385.
    [69] J.Pei, G.Dong, W.Zou, J.Han. Mining Condensed Frequent-Pattern Bases. Knowledge and Information Systems, 2004, 570-594.
    [70] G.Song, D.Yang, B.Cui, et al. CLAIM: An Efficient Method for Relaxed Frequent Closed Itemsets Mining over Stream Data. In: Proceedings of the 12th International Conference on Database Systems for Advanced Applicaitons (DASFAA), Bangkok, Thailand, 2007, 664-675.
    [71] J.Wang, J.Han, Y.Lu, P.Tzvetkov. TFP: An Efficient Algorithm for Mining Top-k Frequent Closed Itemsets. IEEE Trans. on Knowledge and Data Engineering,2005, 17(5):652-664.
    [72] S.Cong. Mining the Top-K Frequent Itemset with Minimum Length M [Master thesis], Simon Fraser University, 2001.
    [73] AWC.Fu, WK.Renfrew, J.Tang. Mining N-most Interesting Itemsets. In: Proceedings of the 12th International Symposium on Foundations of Intelligent Systems, October 11-14, 2000, 59-67.
    [74] YL.Cheung, AWC.Fu. Mining Frequent Itemsets without Support Threshold: With and without Item Constraints. IEEE Trans. on Knowledge and Data Engineering, September 2004 , 16(9):1052-1069.
    [75] SC.Ngan, T.Lam, RCW.Wong, AWC.Fu. Mining N-most Interesting Itemsets without Support Threshold by the COFI-tree. International Journal of Business Intelligence and Data Mining, 2005, 1(1):88-106.
    [76] Y.Hirate, E.Iwahashi, H.Yamana. TF2P-growth: An Efficient Algorithm for Mining Frequent Patterns without Any Thresholds. In: Proceedings of ICDM04. Brighton, UK, October, 2004.
    [77] T.M.Quang, S.Oyanagi, K.Yamazaki. ExMiner: An Efficient Algorithm for Mining Top-K Frequent Patterns. In: LNAI 4093, Springer-verlag, Berlin, Heidelberg, 2006, 436-447.
    [78] T.M.Quang, S.Oyanagi, K.Yamazaki. Mining the K-Most Interesting Frequent Patterns Sequentially. In: Proceedings of Intelligent Data Engineering and Automated Learning (IDEAL06), 2006, 620-628.
    [79] RCW.Wong, AWC.Fu. Mining Top-K Itemsets over a Sliding Window Based on Zipfian Distribution. In: Proceedings of 2005 SIAM International Conference on Data Mining, 2005.
    [80] RCW.Wong, AWC.Fu. Mining top-K Frequent Itemsets from Data Streams. Data Mining and Knowledge Discovery, 2006.
    [81] A.Manjhi, V.Shkapenyuk, K.Dhamdhere, C.Olston. Finding (Recently) Frequent Items in Distributed Data Streams. In: Proceedings of the 21st International Conference on Data Engineering, Washington, DC, USA, 2005, 767-778.
    [82] W.Feng, Q.Guo, Z.Zhang. Finding Hierarchical Frequent Items in Data Streams. In: Proceedings of the Sixth World Congress on Intelligent Control and Automation, 2006, 5972- 5976.
    [83]王伟平,李建中,张冬冬,郭龙江.一种有效的挖掘数据流近似频繁项算法.软件学报, 2005, 16(12):2099-2105.
    [84] A.Metwally, D.Agrawal, A.El.Abbadi. Efficient Computation of Frequent and Top-k Elements in Data Streams. In: Proceedings of the 10th International Conference on Database Theory, 2005, 398-412.
    [85] B.Babcock, C.Olston. Distributed Top-k Monitoring. In: Proceedings of the 2003ACM SIGMOD International Conference on Management of Data, San Diego, California, 2003, 28-39.
    [86] J.Pei, J.Han, LVS.Lakshmanan. Mining Frequent Itemsets with Convertible Constraints. In: Proceedings of the 17th International Conference on Data Engineering, 2001, 433-442.
    [87]宋余庆,朱玉全,孙志挥,杨鹤标.一种基于频繁模式树的约束最大频繁项目集挖掘及其更新算法.计算机研究与发展. 2005, 42(5):777-783.
    [88]崔立新,苑森淼,赵春喜.约束性相联规则发现方法及算法.计算机学报, 2000, 23(2):216-220.
    [89]刘君强,孙晓莹,潘云鹤.关联规则挖掘方法研究的新进展.计算机科学, 2004, 31(1):110-113.
    [90] JS.Park, MS.Chen, PS.Yu. An Effective Hash-based Algorithm for Mining Association Rules. In: Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, California, United States,1995, 175-186.
    [91] A.Savasere, E.Omiecinski, S.Navathe. An Efficient Algorithm for Mining Association Rules in Large Databases. In: Proceedings of the 21th International Conference on Very Large Data Bases, 1995, 432-444.
    [92] H.Toivonen. Sampling Large Databases for Association Rules. In: Proceedings of the 22th International Conference on Very Large Data Bases, 1996, 134-145.
    [93] S.Brin, R.Motwani, JD.Ullman, S.Tsur. Dynamic Itemset Counting and Implication Rules for Market Basket Data. In: Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, Tucson, Arizona, United States, 1997, 255-264.
    [94] RC.Agarwal, CC.Aggarwal, VVV.Prasad. A Tree Projection Algorithm for Generation of Frequent Item Sets. Journal of Parallel and Distributed Computing, 2001, 350-371.
    [95] J.Hipp, U.Güntzer, G.Nakhaeizadeh. Mining Association Rules: Deriving a Superior Algorithm by Analyzing Today’s Approaches. In: Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, 2000, 159-168.
    [96] J.Hipp, A.Myka, R.Wirth, U.Guntzer. A New Algorithm for Faster Mining of Generalized Association Rules. In: Proceedings of the Second European Symposium on Principles of Data Mining and Knowledge Discovery, 1998, 74-82.
    [97] D.Burdick, M.Calimlim, J.Gehrke. MAFIA: A Maximal Frequent Itemset Algorithm for Transactional Databases. In: Proceedings of the 17th International Conference on Data Engineering, 2001, 443-452.
    [98] S.Orlando, C.Lucchese, P.Palmerini, R.Perego, F. Silvestri.kDCI: AMulti-Strategy Algorithm for Mining Frequent Sets. In: Proceedings of IEEE ICDM’03 Workshop (FIMI’03), 2003.
    [99] C. Lucchese, S. Orlando, R. Perego. DCI Closed: A Fast and Memory Efficient Algorithm to Mine Frequent Closed Itemsets. In: Proceedings of IEEE ICDM’04 Workshop (FIMI’04), 2004.
    [100] M.Charikar, K.Chen, M.Farach-Colton. Finding Frequent Items in Data Streams. In: Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP), Malaga, Spain: Springer, 2002, 693-703.
    [101] ED.Demaine, A.Lopez-Ortiz, JI.Munro. Frequency Estimation of Internet Packet Streams with Limited Space. In: Proceedings of the 10th Annual European Symp. Rome: Springer-Verlag, 2002, 348-360.
    [102] C.Jin, W.Qian, C.Sha, JX.Yu, A.Zhou. Dynamically Maintaining Frequent Items over a Data Stream. In: Proceedings of the 2003 ACM CIKM International Conference on Information and Knowledge Management. New Orleans: ACM Press, 2003, 287-294.
    [103] RM.Karp, S.Shenker, CH.Papadimitriou. A Simple Algorithm for Finding Frequent Elements in Streams and Bags. ACM Transactions on Database Systems (TODS), 2003, 28(1): 51-55.
    [104] JH.Chang, WS.Lee. A Sliding Window Method for Finding Recently Frequent Itemsets over Online Data Streams. Journal of Information Science and Engineering, 2004, 20(4): 753-762.
    [105] CH.Lin, DY.Chiu, YH.Wu, ALP.Chen, T.Hsinchu. Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window. In: Proceedings of the fifth SIAM International on Data Mining, Newport Beach, USA, 2005.
    [106] A.Arasu, GS.Manku. Approximate Counts and Quantiles over Sliding Windows. In: Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Paris, France: ACM Press, 2004, 286-296.
    [107] JH.Chang, WS.Lee. Finding Recent Frequent Itemsets Adaptively over Online Data Streams. In: Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, USA: ACM Press, 2003, 487-492.
    [108] L.Yang, M.Sanver. Mining Short Association Rules with One Database Scan. In: Proceedings of the International Conference on Information and Knowledge Engineering. Las Vegas, Nevada, USA: CSREA Press, 2004, 392-398.
    [109] R.Rymon. Search through Systematic Set Enumeration. In: Proceedings of Third Int'l Conf. on Principles of Knowledge Representation and Reasoning, 1992, 539-550.
    [110]蔡自兴,徐光祐.人工智能及其应用(第二版)[M].清华大学出版社. 1996.
    [111] A.Fiat, S.Shporer. AIM2: Improved Implementation of AIM. In: Proceedings of IEEE ICDM’04 Workshop FIMI’04, 2004.
    [112] B. Racz. nonordfp: An FP-growth Variation without Rebuilding the FP-tree. In: Proceedings of IEEE ICDM’04 Workshop FIMI’04, 2004.
    [113] G.Liu, H.Lu, JX.Yu, W.Wei, X.Xiao. AFOPT: An Efficient Implementation of Pattern Growth Approach. In: Proceedings of IEEE ICDM’03 Workshop FIMI’03, 2003.
    [114] G.Grahne, J.Zhu. Efficiently Using Prefix-trees in Mining Frequent Itemsets. In: Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations Melbourne, Florida, USA, November 19, 2003.
    [115] A.Pietracaprina, D. Zandolin. Mining Frequent Itemsets using Patricia Tries. In: Proceedings of IEEE ICDM’03 Workshop FIMI’03, 2003.
    [116] Y.Yan, Z.Li, H.Chen. Fast Mining Maximal Frequent ItemSets Based on FP-Tree. In: Proceedings of the 17th Australian Computer Society (ACS) Australian Joint Conference on Artificial Intelligence (AI 2004), Cairns Australia, December, 2004, 475-487.
    [117]路松峰,卢正鼎.快速开采最大频繁项目集.软件学报, 2001,12(2):293-297.
    [118]宋余庆,朱玉全,孙志挥,陈耿.基于FP-Tree的最大频繁项目集挖掘及更新算法.软件学报, 2003(9), 1586-1592.
    [119] J. Han, J. Pei, Y.Yin. Mining Frequent Patterns without Candidate Generation. In: Proceedings of the Special Interest Group on Management of Data, 2000, 1-12.
    [120] R.Bayardo. Efficiently Mining Long Patterns from Databases. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, Seattle, Washington, United States, 1998, 85-93.
    [121] D.Burdick, M.Calimlim, J.Gehrke. MAFIA: A Performance Study of Mining Maximal Frequent Itemsets. In: Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations Melbourne, Florida, USA, November 19, 2003.
    [122] K.Gouda, MJ.Zaki.Efficiently Mining Maximal Frequent Itemsets.In: Proceedings of the 2001 IEEE International Conference on Data Mining, San Jose, CA, USA, 2001, 163-170.
    [123]马志新,陈晓云,王雪,李龙杰.最大频繁项集挖掘中搜索空间的剪枝策略.清华大学学报, 2005, 45(9):1748-1752.
    [124] B. Goethals. The FIMI Repository, http://fimi.cs.helsinki.fi/, 2003.
    [125] R.Agrawal, R.Srikant. Fast Algorithms for Mining Association Rules. In: Proceedings of the 20th Intl. Conf. on Very Large Databases (VLDB’94), Santiago, Chile, Sept, 1994, 487-499.
    [126] T.Uno, T.Asai, Y.Uchida, H.Arimura. LCM: An Efficient Algorithm forEnumerating Frequent Closed Item Sets. In: Proceedings of IEEE ICDM’03 Workshop FIMI’03, 2003.
    [127] T.Uno, T.Asai, Y.Uchida, H.Arimura. LCM ver. 2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets. In: Proceedings of IEEE ICDM'04 Workshop FIMI'04, 2004.
    [128] T.Uno, T.Asai, Y.Uchida, H.Arimura. LCM ver. 3: Collaboration of Array, bitmap and Prefix Tree for Frequent Itemset Mining. In: Proceedings of the 1st International Workshop on Open Source Data Mining: Frequent Pattern Mining Implementations, 2005, 77-86.
    [129] H.Wang, W.Li, Z.Li, L.Fan. Finding Closed Itemsets in Data Streams. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, Philadelphia, PA, USA, 2006, 592-597.
    [130] J.Pei, J.Han, R.Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets. In: Proceedings of 2000 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, May, 2000, 21-30.
    [131] J.Wang, G.Karypis. HARMONY: Efficiently Mining the Best Rules for Classification. In: Proceedings of 2005 SIAM conf. Data Mining (SDM’05), Newport Beach, CA, 2005, 205-216.
    [132] JR.Quinlan, RM.Cameron-Jones. FOIL: A Midterm Report. In: Proceedings of the European Conference on Machine Learning, Vienna, Austria, 1993, 3-20.
    [133] B.Liu, W.Hsu, Y.Ma. Integrating Classification and Association Rule Mining. In: Proceedings of the 4rd International Conference Knowledge Discovery and Data Mining, New York, NY, Aug. 1998, 80-86.
    [134] W.Li, J.Han, J.Pei. CMAR: Accurate and Efficient Classification based on Multiple Class-Association Rules. In: Proceedings of the 2001 IEEE International Conference on Data Mining, San Jose, CA, Nov. 2001, 369-376.
    [135] X.Yin, J.Han. CPAR: Classification Based on Predictive Association Rules. In: Proceedings of the Third SIAM International Conference on Data mining (SDM’03), San Fransisco, CA, 2003, 331-335.
    [136] F.Coenen, P.Leng. Obtaining Best Parameter Values for Accurate Classification. In: Proceedings of 5th IEEE International Conference on Data Mining, 2005, 597-600.
    [137]王鹏,吴晓晨等. CAPE--数据流上的基于频繁模式的分类算法.计算机研究与发展, 2004, 41(10):1677-1683.
    [138] Q. H. Xie. An Efficient Approach for Mining Concept-Drifting Data Streams [Master Thesis].
    [139] F.Thabtah. Rule Pruning in Associative Classification Mining. In: Proceedings of the IBIMA Conference, 2005.
    [140] GW.Snedecor, WG.Cochran. Statistical Methods, Eighth Edition, Iowa State University Press, 1989.
    [141] F.Ao, J.Du, Y.Yan, B.Liu, K.Huang. An Efficient Algorithm for Mining Closed Frequent Itemsets in Data Streams. In: Proceedings of CIT2008, Sydney, Australia, 2008.
    [142] F.Coenen. LUCS KDD Implementation of CMAR (Classification based on Multiple Association Rules). http://www.csc.liv.ac.uk/~frans/KDD/Software /CMAR/cmar.html, Department of Computer Science, The University of Liverpool, UK, 2004.
    [143] C.L.Blake, C.J.Merz. UCI Repository of Machine Learning Databases. http://www.ics.uci.edu/~mlearn/MLRepository.html, Irvine, CA: University of California, Department of Information and Computer Science, 1998.
    [144] F.Coenen. LUCS-KDD DN Software, http://www.csc.liv.ac.uk/~frans/KDD/ Software/LUCS_KDD_DN/, Department of Computer Science, The University of Liverpool, UK, 2003.
    [145] P.Domingos, G.Hulten, L.Spencer. Mining Time-changing Data Streams. In: Proceedings of the 7th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. San Francisco: ACM Press, 2001, 97-106.
    [146] J.Wang, J.Han, Y.Lu, P.Tzvetkov. Mining Top-K Frequent Closed Patterns without Minimum Support. In: Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM'02), 2002, 211-221.
    [147] A.Pietracaprina, F.Vandin. Efficient Incremental Mining of Top-K Frequent Closed Itemsets. Discovery Science, 2007, 275-280.
    [148] Y.Lan, Y.Qiu. Efficient Algorithms of Mining Top-k Frequent Closed Itemsets. In: Proceedings of 8th International Conference on Electronic Measurement and Instruments, ICEMI '07, 2007.
    [149] W. Hoeffding. Probability Inequalities for Sums of Bounded Random Variables. American Statistical Association, 1963, 58: 13-30.
    [150] O.Maron, A.Moore. Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation. Advances in Neural Information Processing System s, Morgan Kaufmann, 1993, 59-66.
    [151] P.Domingos, G.Hulten. Mining High-Speed Data Streams.In: Proceedings of the Assoiciation for Computing Machinery Sixth International Conference on Knowledge Discovery and Data Minings, 2000.
    [152]周晓云,孙志挥,张柏礼,杨宜东.高维数据流子空间聚类发现及维护算法.计算机研究与发展. 2006, 43(5):834-840.
    [153]周晓云,孙志挥,张柏礼,杨宜东.高维数据流聚类及其演化分析研究.计算机研究与发展. 2006, 43(11):2005-2001.
    [154]颜晓龙,沈鸿.一种适用于高维数据流的子空间聚类方法.计算机应用. 2007, 27(7):1680-1684.
    [155] R.Kohavi, CE.Brodley, B.Frasca, L.Mason, Z.Zheng. KDD-Cup 2000 Organizers Report: Peeling the Onion. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, August 26-29, San Francisco, California, 2001, 401-406.
    [156] J.Yu, Z.Chong, H.Lu, A.Zhou. False Positive or False Negative: Mining frequent Itemsets from High Speed Transactional Data Streams. In: Proceedings of the Thirtieth International Conference on Very large Data Bases. Toronto, Canada, 2004, 204-215.
    [157] R.Agrawal, J.Gehrke, D.Gunopulos, et a1. Automatic Subspace Clustering of High Dimensional Data for Data Mining Application. In: Proceedings of the 1994 ACM SIGMOD Int’l Conf on Management of Data. New York: ACM Press, 1994, 94-105.
    [158] JA.Hartigan. Clustering Algorithms [M]. John Wiley & Sons, New York NY, 1975.
    [159] T.Zhang, R.Ramakrishnan, M.Livny. BIRCH: An Efficient Data Clustering Method for Very Large Databases. ACM SIGMOD Record, 1996, 25(2):103-114.
    [160] M.Ester, HP.Kriegel, J.Sander, X.Xu. A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: Proceedings of the 2nd ACM SIGKDD, Portland Oregon. 1996, 226-231.
    [161] W.Wang, J.Yang, R.Muntz. STING: A Statistical Information and Grid Approach to Spatial Data Mining. In: Proceedings of Twenty-Third International Conference on Very Large Data Bases. Athens, Greece, 1997, 186-195.
    [162] DH.Fisher. Knowledge Acquisition via Incremental Conceptual Clustering. Machine Learning. 1987, 139-172.
    [163] S.Guha, N.Mishra, R.Motwani. Clustering Data Streams: Theory and Practice. IEEE TKDE Special Issue on Clustering, 2003, 3(2):37-46.
    [164] CC.Aggarwal, J.Han, J.Wang, PS.Yu. A Framework for Clustering Evolving Data Streams. In: Proceedings of the 29th International Conference on Very Large Data Bases. San Francisco: Morgan Kaufmann, 2003, 8l-92.
    [165] CC.Aggarwal, J.Han, J.Wang, PS.Yu. A Framework for Projected Clustering of High Dimensional Data Streams. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, San Francisco: Morgan Kaufmann, 2004, 852-863.
    [166] M.Donn. A Survey of Users of Thermal Simulation Programs. In: Proceedings of Building Simulation 97, Prague, 1997, 65-72.
    [167]鞠儒生.基于数据耕种与数据挖掘的系统效能评估方法研究[Ph.D Thesis].长沙国防科技大学机电工程与自动化学院. 2006.
    [168]薄涛,彭再求,刘秀罗,王正志,黄柯棣.基于模糊规则的双机格斗行为建模方法研究.系统仿真学报. 2002, 14(4):440-443.
    [169]黄柯棣,刘宝宏,黄健等.作战仿真技术综述.系统仿真学报, 2004, 16(9):1887-1894.
    [170]陈欣,胡晓惠,付勇,傅好华.基于XML的仿真想定标记语言SSML.系统仿真学报. 2004, 16(9):1928-1930.
    [171]杜静.流体系结构的编译技术研究—面向科学计算程序的编译优化[Ph.D Thesis].长沙国防科技大学计算机学院. 2008.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700