基于动态网格的数据流聚类算法及其应用研究

英文题名：Research on Algorithm and Application of Dynamic Grid-based Clustering over Data Stream
作者：何勇
论文级别：硕士
学科专业名称：项目管理
中文关键词：数据流 ; 聚类分析 ; 动态网格 ; 网络异常 ; 异常检测系统
英文关键词：Data Stream ; Clustering Analysis ; Dynamic Grid ; Network Anomaly ; Anomaly Intrusion Detection System
学位年度：2008
导师：刘青宝 ; 陆昌辉
学科代码：1201
学位授予单位：国防科学技术大学
论文提交日期：2008-04-01

摘要

近来,随着计算机通信技术和传感器网络技术的飞速发展,人们遇到大量无法用数据库进行存储的数据,这些数据是高速的、连续的、动态的、变化的、无限的,对它们的访问只能是顺序的、一次或有限次的,对它们的存储也只能是动态的、概要的。为了有效处理这些数据,人们提出了数据流模型。数据流模型已经引起了人们极大的关注。
     作为传统聚类方法在数据流环境下的延伸,数据流聚类方法已成为数据流模型研究的重点方向之一。本文通过对传统数据聚类方法和已有的数据流聚类方法进行研究分析,发现基于密度的方法可以发现任意形状的类,但其运算时间复杂度比较大并且不适合于发现分布情况不同的类。而基于网格的方法具有较高的运算速度,但牺牲了聚类的质量。因此,在研究和改进上述两类算法的基础上,结合数据流的特点,本文提出了一个基于动态网格的数据流聚类算法,主要对网格密度计算方式、参数的自适应设置和聚类顺序等进行了改进和创新,并通过实验数据和真实数据验证了算法的正确性和有效性。
     将数据流聚类分析方法用于网络异常检测是一个有趣的尝试。网络在给人们带来巨大方便的同时,也会给人们带来巨大的不便,层出不穷的入侵技术和手段经常会给网络带来毁灭性的灾难,而传统的异常入侵检测技术在扩展性和适应性上已不能应付越来越复杂的攻击方式,因此许多其他领域的知识被引入。最近,聚类分析方法由于能够发现未知的模式、自动实时更新异常入侵检测规则库而得到了高度重视。本文在Snort系统基础上构建了一个基于聚类分析方法的异常检测系统,最后通过实验证明该系统用于网络异常检测是有效的。
Recently, with the rapidly development of communication technology and sensor network, people faced large volume of data which is beyond the storage in database, these data is high-speed, continuous, dynamic, variable and infinity. We can visit them only once or limited times, and the storage is dynamic and synoptic. To dealing with it, the data stream model has been wildly concerned by researchers.
     As the extension of traditional clustering, data stream clustering has been one of the most important directions of data stream research. In this paper, Firstly, the traditional clustering methods have been analyzed, It finds that methods based density can discover clusters of arbitrary shape but the running time is horrible and it can't find clusters in the data sets which have different distribute. While methods based grid have high running speed, but the quality of the results is not good. Secondly, on the basic of researching and improving the existing traditional clustering methods, we have presented a new dynamic grid-based clustering algorithm which major works are summarized as follows: improving the method of density calculating, setting parameter automatically and clustering orders. Lastly experiment results show the correctness and effectiveness of algorithm.
     It's an interesting attempt to detect network anomaly with data stream clustering. The network brings us not only a large amount of convenience, but also inconvenience. The network is often destructive by various intrusion technique and measures. Nevertheless the traditional anomaly intrusion detection technique couldn't deal with the more and more complex attacks in the field of expansibility and adaptability, we need use other scopes of technology. Recently, clustering analytical method has been paid more attention since it can discover some unknown patterns and update the rules of anomaly intrusion detection in real-time. This paper establishes an anomaly intrusion detection system which is based on clustering analytical method and is built on Snort system. Finally it proves this system is effective by the experiment results.

引文

[1]Papadimitriou S,Sun J,Faloutsos C,Streaming Pattern Discovery in Multiple Time-Series[C],in:Proceedings of the 31~(st) VLDB Conf,2005,pp.697-708.
    [2]Pang-Ning Tan,Michael Steinbach,Vipin kumar著.数据挖掘导论[M].范明,范宏建译.北京:人民邮电出版社,2006.05.
    [3]Don Carney,Ugur Cetintemel,Mitch Chemiack,Monitoring Streams-a New Class of Data Management Applications[C],in:Proceedings of the 28~(th) VLDB Conference,China,Hong Kong,2002,pp.1-10.
    [4]Henzinger M R,Raghavan P,Rajagopalan S,Computing on Data Streams[D],SRC Technical Note 1998-011,Digital systems research center,Palo Alto,California 1998.
    [5]刘青宝,非精确多维数据建模技术研究[D],长沙:国防科学技术大学,工学博士学位论文,2006.10
    [6]Jiawei Han,Micheline Kamber著.数据挖掘:概念与技术(第2版)[M].范明,孟小峰译.北京:机械工业出版社,2007.03.
    [7]Yunyue Zhu,Dennis Shasha Statstream:Statistical Monitoring of Thousands of Data Streams in Real Time[C],in:Proceedings of VLDB,2002,pp.358-369.
    [8]P.B.Gibbons,Y.Matias,Synopsis Data Structures[C],in:Proceedings of SODA,1999,pp.909-910.
    [9]Brian Babcock,Shivnath Babu,Mayur Datar,Rajeev Motwani,Jennifer Widom,Models and Issues in Data Stream Systems[C],in:Proceedings of the 21~(th) ACM Symposium on Principles of Database Systems,Madison,Wisconsin,USA,2002,pp.1-16.
    [10]L.Qiao,D.Agrawal,A.E.Abbadi,Rhist:Adaptive Summarization over Continuous Data Streams[C],in:Proceedings of the Eleventh International Conference on Information and Knowledge Management,ACM Press,New York,2002,pp.469-476.
    [11]Francesco Buccafurri,Gianluca Lax,Fast Range Query Estimation by N-Level Tree Histograms[J],Data & Knowledge Engineering,2004,vol.51(2),pp.257-275.
    [12]Gilbert AC,Kotidis Y,Muthukrishnan S,Strauss MJ,Surfing Wavelets on Streams:One-Pass Summaries for Approximate Aggregate Queries[C],in:Apers PMG,Atzeni P,Ceri S,Paraboschi S,Ramamohanarao K,Snodgrass RT,eds.VLDB 2001,Proceedings of the 27~(th) Int'l Conf.on Very Large Data Bases Roma:Morgan Kaufmann,2001,pp.79-88.
    [13]邵峰晶,于忠清编著.数据挖掘原理与算法[M].北京:中国水利水电出版社,2004.10.
    [14]RuiXu,Survey of Clustering Algoritluns[J],IEEE Transactions on Neural Network,2005,vol.16(3),pp.645-678.
    [15]王莉,数据挖掘中聚类方法的研究[D],天津:天津大学,博士学位论文,2003.
    [16]T.Zhang,R.Ramakrisbnan and M.Livny,Birch:An Efficient Data Clustering Method for Very Large Databases[C],in:Proceedings of 1996 ACMSIGMOD,Montreal,1996,pp.103-114.
    [17]D.Fisher,Improving Inference through Conceptual Clustering[C],in:Proceedings of 1987 AAAI Conf,Seattle,WA,1987,pp.461-465.
    [18]J.Gennar,P.Langley,D.Fisher,Models of Incremental Concept Formation[J],Artificial Intelligence,vol.40 1),pp.11-61,1989.
    [19]M.Ester,H-P.Kriegel,J.Sander,and X.Xu,A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise[C],in:Proceedings of 1996 Int.Conf.Knowledge Discovery and Data Mining,Portland,1996,pp.226-331.
    [20]M.Ankerst,M.Breunig,H-P.Kriegel,and J.Sander,Optics:Ordering Points to Identify the Clustering Structure[C],in:Proceedings of 1999 ACM SIGMOD Int.Conf.Management of Data,Philadelphia,PA,1999,pp.49-60.
    [21]A.Hinneburg,D.A.Keim,An Efficient Approach to Clustering in Large Multimedia Databases with Noise[C],in:Proceedings of 1998 International Conference on Knowledge Discovery and Data Mining,New York,Aug,1998,pp.58-65.
    [22]Wei Wang,Jiong Yang,Richard Muntz,Sting:A Statistical Information Grid Approach to Spatial Data Mining[C],in:Proceedings of the 23~(rd) VLDB Conference nce,Athens,Greece,1997.
    [23]Sheikholeslami G;Chattterjee S,Zhang A,Wavecluster:Multi-Resolution Clustering Approach for Very Large Spatial Databases[C],in:Proceedings of the 24~(th)Conference on VLDB,New York,NY,1998,pp.428-439.
    [24]Agrawal R,Gehrke J,Gunopulos D,Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications[C],in:Proceedings of ACM SIGMOD Conf,Seattle,WA,1998,pp.94-105.
    [25]Barbard,Daniel,Requirements for Clustering Data Streams[J],SIGKDD Explorations,2003,vol.3(2),pp.23-27.
    [26]孙玉芬,卢炎生,流数据挖掘综述[J],计算机科学,2007,vol.34(1),pp.1-5.
    [27]单世民,基于网格和密度的数据流聚类方法研究[D],大连:大连理工大学,博士学位论文,2006.
    [28]L.O'Callaghan,N.Mishra,A.Meyerson,S.Guha,R.Motwani,Streaming-Data Algorithm for High-Quality Clustering[C],in:Proceedings of IEEE International Conference on Data Engineering,2002,pp.685-696.
    [29]C.C.Aggarwal,J.Han,J.Wang,A Framework for Clustering Evolving Data Streams[C],in:Proceedings of the 29~(th) VLDB Conf,2003,pp.81-92.
    [30]Sudipto Guha,Adam Meyerson,Nina Mishra,Clustering Data Streams:Theory and Practice[J],TKDE special issue on clustering,vol.15,2003.
    [31]C.C.Aggarwal,J.Han,J.Wang,A Framework for Projected Clustering of High Dimensional Data Steams[C],in:Proceedings of the 30~(th) VLDB Conf,2004,pp.852-863.
    [32]Nam Hun Park,Won Suk Lee,Statistical Grid-Based Clustering over Data Streams[J],SIGMOD Record,2004,vol.33(1),pp.32-37.
    [33]Ester M,Kriegel H-P,Sander J,Incremental Clustering for Mining in a Data Warehousing Environment[C],in:Proceedings of the 24~(th) International Conference on VLDB,New York,1998,pp.323-333.
    [34]S.Guha,Clustering Data Stream[C],in:Proceedings of The 41~(st) Annual Symposium on Foundations of Computer Science,2000,pp.359-366.
    [35]朱蔚恒,印鉴,谢益煌,基于数据流的任意形状聚类算法[J],软件学报,2006,vol.17(3),pp.379-386.
    [36]Qing-bao Liu,Su Deng,Changhui Lu,Relative Density Based K-Nearest Neighbors Clustering Algorithm[C],in:the International Conference on Machine Learning and Cybernetics,CHINA,2003.
    [37]曹峰,数据流聚类算法分析[D],上海:复旦大学,博士学位论文,2006.04
    [38]孙玉芬,卢炎生,一种基于网格方法的高维数据流子空间聚类算法[J],计算机科学,2007,vol.34(4),pp.199-203.
    [39]Yansheng Lu,Yufen Sun,Guiping Xu,A Grid-Based Clustering Algorithm for High-Dimensional Data Stream[J],Springer-Verlag 2005,pp.824-831.
    [40]王生生,刘大有,曹斌,一种高维空间数据的子空间聚类算法[J],计算机应用,2005,vol.25(11),pp.2615-2617.
    [41]冯兴杰,黄亚楼,增量式cure聚类算法研究[J],小型微型计算机系统,vol.2510,pp.1847-1849,2004.
    [42]http://www.ics.uci.edu/databases/.
    [43]Maxion R A,Frank E,A Case Study of Ethernet Anomalies in a Distributed Computing Environment[J],IEEE Transaction on Reliability,vol.39(4),1990,pp.433-443.
    [44]唐正军,李建华编著.入侵检测技术[M].北京:清华大学出版社,2004.04.
    [45]Portnoy L,Eskin L,Stolfo S J,Intrusion Detection with Unlabeled Data Using Clustering[C],in:Proceedings of ACM CSS Workshop on Data Mining Applied to Security(DMSA-2001),Philadelphia,2001,pp.109-115.
    [46]Sang-Hyun Oh,Jin-Suk Kang,Yung-Cheol Byun,Intrusion Detection Based on Clustering a Data Stream[C],in:Proceedings of the 2005 Third ACIS Int'l Conference on Software Engineering Research,Management and Applications,2005,pp.220-227.
    [47]http://www.snort.org/.
    [48]韩东海,王超,李群.入侵检测系统实例剖析[M].北京:清华大学出版社,2002.
    [49]http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.Html.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700