流数据的层次聚类和频繁模式的挖掘算法研究

英文题名：The Research of Clustering and Frequent Pattern in Streaming Data
作者：屠强
论文级别：硕士
学科专业名称：模式识别与智能系统
中文关键词：流数据 ; 聚类 ; 频繁模式
英文关键词：streaming data ; clustering ; frequent pattern
学位年度：2011
导师：陆建峰
学科代码：081104
学位授予单位：南京理工大学
论文提交日期：2011-02-01

摘要

流数据的聚类或频繁模式挖掘要求仅扫描数据集一次,就得到聚类或者频繁模式挖掘的结果。本文主要研究如何提高流数据的聚类和频繁模式挖掘算法的精度,在文中我们提出了两个新的算法：基于密度的高精度流聚类算法Density-based High Precision Streaming-data Clustering(DHPSC)和FP-tree单遍扫描算法Single-Pass Scan FP(SPSFP)。在本文提出的DHPSC算法中,我们使用基于密度的凝聚层次聚类法。该方法使用凝聚层次聚类法作为算法框架,在这种框架下,核心问题就是如何合并两个簇。目前,许多的流数据聚类算法仅仅使用簇的中心点去代表整个簇,这种做法会导致不好的结果。通过细致的分析我们发现,在数据的单遍扫描过程中,簇内距离、簇间距离和方差都是可以精确计算的,从而保证聚类结果的精度。这样,我们可以使用新的基于密度的公式,做为簇间是否合并的标准。实验结果表明,新算法会节省时间和空间方面的开销,并取得较好结果。流数据的频繁模式挖掘方面,FP-growth算法是频繁模式挖掘中用于静态数据集的经典算法。但是FP-tree的创建需要扫描数据库两遍,在处理流数据方面收到了很大的限制,使用滑动窗口虽然能在一定程度上解决这一问题,但是依然会造成FP-tree生成时的不准确,影响到后续的挖掘。本文提出的SFSFP算法,单遍扫描数据集即可准确创建出FP-tree。与传统的FP-tree创建算法相比,本文算法仅扫描数据库一遍,并且不需要将整个数据集调入内存。该方法不仅节省了所占用的空间,而且使得准确挖掘流数据中的频繁模式成为了可能,它的时间耗费方面与传统方法相当。
The streaming data mining is required to scan the data set only once and get the result; this research mainly focuses on improving the accuracy in clustering and frequent pattern mining. In this paper, we present two new algorithm:Density-based High Precision Streaming-data Clustering (DHPSC) and Single-Pass Scan FP (SPSFP). The clustering framework of the proposed method is based on Agglomerative method. Under this framework, the core problem of clustering is how to carry out cluster merging. Currently, most of the streaming data clustering algorithms only use the center of the cluster to represent whole cluster when merging, which leads to unsatisfactory clustering result. After detailed analysis, we find that even if the data set is accessed only once, some statistical information, such as variance within clusters, the intra-cluster density, as well as the inter-cluster distance, can be calculated accurately, which can be helpful to cluster merging and outperform the center of cluster. Besides, a new criterion is proposed for cluster merging also, which considers the impact of density. The experimental results show that the proposed method costs less space and can achieve fairly good results. In frequent pattern mining FP-growth is a classical algorithm, which is often used in static data mining. However, the double-scan-of-database manner in FP-tree creation is a serious bottleneck in streaming data analysis. Sliding window technique could solve this problem in certain degree, but it still can lead to the inaccuracy of FP-tree creation, which may impact the consequent data mining. In this paper, a new FP-tree algorithm is presented for streaming data, which creates the FP-tree by a single-pass scanning (SPSFP) throughout the database. Compared with the traditional FP-tree creation, the new method scans the database only once and doesn't need to store the whole data set into memory, which not only saves the memory space but also makes it possible to mine frequent pattern accurately in streaming data. Furthermore, the time cost of the new algorithm is almost equivalent to the traditional one.

引文

[1]陈鹏.数据挖掘技术应用初探.电脑知识与技术,2010,33(6)：9604-9605.
    [2]工光宏,蒋平.数据挖掘综述.同济大学学报,2004,32(2).
    [3]Naisbitt, J.Megatrends. Ten New directions transforming our lives. New York:Warner Books,1982.16-17.
    [4]王鹏.数据流上的分类算法的研究.复旦大学,2007.
    [5]韩家伟.数据挖掘概念与技术.机械工业出版社,2006.
    [6]张金.数据挖掘技术在3G业务扩展中的研究与应用.湖南师范大学,2010.
    [7]佟强.科学数据网格中数据挖掘技术研究.中国科学院研究生院,2010.
    [8]王莉.数据挖掘中聚类方法的研究.天津大学,2003.
    [9]刘红岩,陈剑.数据挖掘中的数据分类算法综述.清华大学学报,2002,42(6)：727-730
    [10]房鹏杰,张索兰,张继福.基于概念格和条件信息熵的分类规则获取算法.计算机工程与应用,2010,64(14).
    [11]Daniel,Cosmin,Porumbel. An efficient algorithm for computing the distance between close partitions.Discrete Applied Mathematics.2011,159(1):53-59
    [12]Aime Ntwari,Abdellali Kelil. DNAc:A Clustering Method for Identifying Kinship Relations Between DNA Profiles Using a Novel Similarity Measure.Journal of Forensic Sciences,2011,56(1):S17-S22.
    [13]张学茂.关联规则挖掘研究.长沙理工大学,2006.
    [14]O'Callaghan L., Mishra N. Streaming-Data Algorithm for High-Quality Clustering. Data Engineering (ICDE'02).2002:0685.
    [15]单世民.基于网格和密度的数据流聚类方法研究.大连理工大学,2006.
    [16]朱蔚恒,印鉴,谢益煌.基于数据流的任意形状聚类算法.软件学报,,2006,17(3)：379-387.
    [17]L. Golab, M.Tamer. Issues in data stream management. ACM SIGMOD,2003, 32(2):5-14.
    [18]刘学军.数据流聚集查询和频繁模式挖掘的研究.东南大学,2006.
    [19]黄惠宁,刘源璋,梁昭阳.多传感器数据融合技术概述.科技信息.2010,5.
    [20]胡侃,刘云生.传感器网络中写作实施数据库事务的提交控制.计算机学报,2007,6.
    [21]刘佳,张芳,刘国华,刘琳.基于流数据技术的信息检测系统的研究与设计.计算机工程,2007,5.
    [22]M. Sullivan, A.Heybey. Tribeea:A System for Managing Large Databases of Network. USENIX Annual Technical Conf.,1998.
    [23]Y.Zhu, D.Shasha. Statistical Monitoring of Thousands of Data Streams. Very Large Data Bases,2002.358-369.
    [24]A.Arasu, S.Babu, J.Widom. An Abstract Semantics and Concrete Language for Continuous Queries over Streams and Relations. Stanford University Technical Report, 2002.
    [25]C.Cortes, K.Fisher. Haneoek:A Language for Extracting Signatures from Data Streams. Knowledge Discovery and DataMining,2000.9-17.
    [26]Henzinger, M.R., Raghavan, P., Rajagopalan, S. Computing on data streams. DIMACS:1998,107-118.
    [27]Babcock, B., Babu, S., Datar, M. Models and issues in data streams. Principles of Database Systems.2002,1-16.
    [28]Dong, G.Z., Han J.W., Laks, V.S. Online mining of changes from data streams: Research problems and preliminary results. Management and Processing of Data Streams:2003,225-236.
    [29]曹峰.数据流聚类分析算法.复旦大学,2006.
    [30]赵恒.数据挖掘中聚类若干问题研究.西安电子科技大学,2005.
    [31]M.Ankerst. OPTICS:Ordering Points to Identify Clustering Structure.Management of Data. Philadelphia:ACM,1999.49-60.
    [32]Fisher.D. Knowledge acquisition via incremental conceptual clustering. Machine Learning,1987.139-172.
    [33]Zhu W.H., Yin J., Xie Y.H. Arbitrary shape cluster algorithm for clustering data stream[J]. Journal of Software,2006,17(3):379-387.
    [34]Barbara D. Requirements for clustering data streams. ACM SIGKDD Explorations Newsletter,2003,3(2):23-27.
    [35]常建龙,曹峰,周傲英.基于滑动窗口的进化数据流聚类[J].软件学报,2007,18(4)：905-918.
    [36]M.Charikar, L.O'Callaghan, and R.Panigrahy. Better streaming algorithms for clustering problems. STOC:2003.30-39.
    [37]L.O.Chalaghan, N.Mishra. Streaming data algorithm for hith-quality clustering[C]. ICDE:2002.685-694.
    [38]P.Domingos, G.Hulton. A general method for scaling up machine learning algorithms and its application to clustering[C]. ICML:2001.106-113.
    [39]Aggarwal CC, Han J. A framework for clustering evolving data streams. Very Large Data Bases. Berlin:Morgan Kaufmann.2003.81-92.
    [40]徐辉增.关联规则数据挖掘方法的研究.中国石油大学,2009.
    [41]朱涛.基于FP-growth关联规则挖掘算法的研究与应用.南昌大学,2010.
    [42]R. Agrawal, R. Srikant. Fast Algorithms for Mining Association Rules[C].International Conference:1994.
    [43]J.W. Han, J. Pei. Mining frequent patterns by pattern-growth:methodology and implications. ACM SIGKDD Explorations Newletter:2000.14-20.
    [44]吴仁堂.基于关联规则的数据挖掘算法研究.内蒙古农业大学,2010.
    [45]严澄,胡天磊MARSW:一种高效的基于活动窗口数据流关联规则挖掘方法[J].计算机研究与发展,2009：413-419.
    [46]C. Giannella, J.W. Han. Mining frequent patterns in data streams at multiple time granularities. Data Mining:Next Generation Challenges and Future Directions: AAAI/MIT,2004 (6).
    [47]J.L. Koh, S.N. Shin. An approximate approach for mining recently frequent itemsets from data streams. DaWaK:2006.352-362.
    [48]M. Seno, G. Karypis. Finding frequent patterns using length-decreasing support constraints. Data Mining and Knowledge Discovery,10(2005):197-228.
    [49]Cao, F., Ester, M., Qian, W. Density-based clustering over an evolving data stream with noise. Data Ming.2006.328-339.
    [50]Lu, J.F, Tang, J.B., Tang, Z.M.,Yang,J.Y. Hierarchical initialization approach for K-Means clustering. Pattern Recognition Letters:2008.787-795.
    [51]Ankerst, M., Breunig, M.M., Kriegel, H.P. OPTICS:Ordering Points To Identify the Clustering Structure. Management of Data, Philadelphia. PA.1999.
    [52]Wu, S., Tommy W.S. Clustering of the self-organizing map using a clustering validity index based on inter-cluster and intra-cluster density. Pattern Recognition:2004.175-188.
    [53]Chen, Y.X., Tu, L. Density-Based Clustering for Real-Time Stream Data. Knowledge Discovery and Data Mining.2007.133-142.
    [54]S.K. Tanbeer, C.F. Ahmed. Sliding window-based frequent pattern mining over data streams. Information Science,2009(179):3843-3865.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700