Data stream clustering by divide and conquer approach based on vector model

详细信息查看全文

作者：Madjid Khalilian ; Norwati Mustapha ; Nasir Sulaiman
关键词：Data mining ; Data stream clustering ; Vector space model ; Divide ; and ; conquer
刊名：Journal of Big Data
出版年：2016
出版时间：December 2016
年：2016
卷：3
期：1
全文大小：2,608 KB
参考文献：1.Xu R, Wunsch D. Survey of clustering algorithms. IEEE Trans Neural Netw. 2005;16(3):645–78.CrossRef
2.Tu L, Chen Y. Stream data clustering based on grid density and attraction. ACM Trans Knowl Discov Data (TKDD). 2009;3(3):12.
3.Aggarwal CC. A Framework for Clustering Massive-Domain Data Streams, presented at ICDE ‘09. IEEE 25th International Conference on Data Engineering; 2009.
4.Guha S, Meyerson A, Mishra N, Motwani R, O’Callaghan L. Clustering data streams: theory and practice. IEEE Trans Knowl Data Eng. 2003;15:515–28.CrossRef
5.Aggarwal CC, Yu PS. A framework for clustering massive text and categorical data streams. In: SDM; 2006.
6.Zhang, Ramakrishnan, Livny. BIRCH: an efficient data clustering method for very large databases. Presented at ACM SIGMOD Conference on Management of Data; 1996.
7.Yunyue Z, Dennis S. StatStream: statistical monitoring of thousands of data streams in real time. In: Proceedings of the 28th international conference on Very Large Data Bases. Hong Kong, China: VLDB Endowment; 2002.
8.Aggarwal C, Jiawei H, Jianyong W, Philip SY. A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on Very large data bases—Volume 29. Berlin, Germany: VLDB Endowment; 2003.
9.Chen Y, Li T. Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM; 2007.
10.Cormode, G, Muthukrishnan, S, Wei Z. Conquering the Divide: Continuous Clustering of Distributed Data Streams. In: IEEE 23rd International Conference on Data Engineering, 2007. ICDE 2007, p. 1036–45.
11.Rodrigues PP, Gama J, Pedroso JP. Hierarchical clustering of time-series data streams. In: IEEE transactions on knowledge and data engineering; 2007, p. 615–27.
12.Aoying Z, Feng C, Ying Y, Chaofeng S, Xiaofeng H. Distributed data stream clustering: a fast EM-based approach. Presented at IEEE 23rd International Conference on Data Engineering, 2007. ICDE 2007.
13.Aggarwal CC. On high dimensional projected clustering of uncertain data streams. Presented at IEEE 25th International Conference on Data Engineering, 2009. ICDE ‘09.
14.Chen Z, He R, Li Y. Online fractal dimensionality reduction in time decaying stream environment. In: Eighth international conference on fuzzy systems and knowledge discovery (FSKD), vol 3. IEEE; 2011. p. 1480–4.
15. Tu Q, Lu JF, Yuan B, Tang JB, Yang JY. Density-based hierarchical clustering for streaming data. Pattern Recognit Lett. 2012;33(5):641–5.CrossRef
16.Guha S, Meyerson A, Mishra N, Motwani R, O’Callaghan L. Streaming-data algorithms for high-quality clustering. Presented at Proceedings 18th International Conference on Data Engineering; 2002.
17.Aggarwal CC. On High Dimensional Projected Clustering of Uncertain Data Streams. Presented at IEEE 25th International Conference on Data Engineering, ICDE ‘09.
18.Vivekanandan P, Nedunchezhian R. Mining data streams with concept drifts using genetic algorithm. Artif Intell Rev. 2011;36:163–78.CrossRef
19.Pardeshi B, Toshniwal D, Meghanathan N, Kaushik BK, Nagamalai D. Hierarchical clustering of projected data streams using cluster validity index advances in computer science and information technology. vol. 131, Communications in Computer and Information Science, Berlin: Springer; 2011. p. 551–9.
20.Cardoso DD, Lima PM, De Gregorio M, Gama J, França FM. Clustering data streams with weightless neural networks. In: ESANN; 2011.
21.Ikonomovska E, Loskovska S, Gjorgjevik D. A survey of stream data mining. In: Proceedings of the 8th National Conference with International Participation. 2007. pp. 19–25.
22.Gaber MM, Zaslavsky A, Krishnaswamy S. Mining data streams: a review. ACM Sigmod Record. 2005;34(2):18–26.CrossRef
23.Daniel B. Requirements for clustering data streams. SIGKDD Explor Newsl. 2002;3:23–7.CrossRef
24.Wang H, Fan W, Yu PS, Han J. Mining concept-drifting data streams using ensemble classifiers. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. ACM; 2003. pp. 226–35.
25.Barbara D. Requirements for clustering data streams. ACM SIGKDD Explorat Newslett. 2002;3:23–7.CrossRef
26.Aggarwal CC. A Framework for Clustering Massive-Domain Data Streams. Presented at IEEE 25th International Conference on Data Engineering, ICDE ‘09.
27.Aggarwal C, Yu P. On clustering massive text and categorical data streams. Knowl Inform Syst. 2009;24:171–96.CrossRef
28.Aggarwal, CC, Yu PS. A Framework for Clustering Uncertain Data Streams. Presented at IEEE 24th International Conference on Data Engineering, ICDE 2008.
29.Tian Z, Raghu R, Miron L. BIRCH: an efficient data clustering method for very large databases. SIGMOD Rec. 1996;25:103–14.CrossRef
30.Heinz C, Seeger B. Cluster kernels: resource-aware kernel density estimators over streaming data. IEEE Trans Knowl Data Eng. 2008;20:880–93.CrossRef
31.Wan L, Ng WK, Dang XH, Yu PS, Zhang K. Density-based clustering of data streams at multiple resolutions. ACM Trans Knowl Discov Data (TKDD). 2009;3(3):14
32.Chehreghani MH, Abolhassani H, Chehreghani MH. Improving density-based methods for hierarchical clustering of web pages. Data Knowl Eng. 2008;67:30–50.CrossRef
33.Yang D, Rundensteiner EA, Ward MO. Neighbor-based pattern detection for windows over streaming data. In: Proceedings of the 12th international conference on extending database technology: advances in database technology. ACM; 2009. p. 529–40.
34.Lin G, Chen L. A grid and fractal dimension-based data stream clustering algorithm. In: International symposium on information science and engineering, ISISE'08, vol 1. IEEE; 2008. p. 66–70
35.Wei J, Brice P. Data stream clustering and modeling using context-trees. Presented at 6th International Conference on Service Systems and Service Management, ICSSSM ‘09.
36.Chen K, Liu L. HE-Tree: a framework for detecting changes in clustering structure for categorical data streams. VLDB J. 2009;18:1241–60.CrossRef
37.Hongbin G, Ruiguang L, Jie H. A Kind of Data Stream Clustering Algorithm Based on Grid and Relative Density. Presented at Spring Congress on Engineering and Technology (S-CET); 2012.
38.Kononenko I, Kukar M. Machin learning and data mining. Chichester: Horwood Publishing; 2007.CrossRef
39.Xiangliang Z, Furtlehner C, Germain-Renaud C, Sebag M. Data stream clustering with affinity propagation. Knowl Data Eng IEEE Trans o. 2014;26:1644–56.CrossRef
作者单位：Madjid Khalilian (1)
Norwati Mustapha (2)
Nasir Sulaiman (2)

1. Islamic Azad University, Karaj Branch, Karaj, Iran
2. Faculty of Computer Science and Information Technology, UPM University, Serdang, Malaysia
刊物类别：Database Management; Information Storage and Retrieval; Data Mining and Knowledge Discovery; Computa
刊物主题：Database Management; Information Storage and Retrieval; Data Mining and Knowledge Discovery; Computational Science and Engineering; Mathematical Applications in Computer Science; Communications Engine
出版者：Springer International Publishing
ISSN：2196-1115

文摘

Recently, many researchers have focused on data stream processing as an efficient method for extracting knowledge from big data. Data stream clustering is an unsupervised approach that is employed for huge data. The continuous effort on data stream clustering method has one common goal which is to achieve an accurate clustering algorithm. However, there are some issues that are overlooked by the previous works in proposing data stream clustering solutions; (1) clustering dataset including big segments of repetitive data, (2) monitoring clustering structure for ordinal data streams and (3) determining important parameters such as k number of exact clusters in stream of data. In this paper, DCSTREAM method is proposed with regard to the mentioned issues to cluster big datasets using the vector model and k-Means divide and conquer approach. Experimental results show that DCSTREAM can achieve superior quality and performance as compare to STREAM and ConStream methods for abrupt and gradual real world datasets. Results show that the usage of batch processing in DCSTREAM and ConStream is time consuming compared to STREAM but it avoids further analysis for detecting outliers and novel micro-clusters. Keywords Data mining Data stream clustering Vector space model Divide-and-conquer

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700