基于特征点选择的聚类算法研究与应用

英文题名：Research and Application of Clustering Algorithm Based on Feature Point Selection
作者：朱国红
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：数据挖掘 ; 聚类分析 ; K-均值 ; 入侵检测
英文关键词：Data Mining ; Clustering Analysis ; K-means ; Intrusion Detection
学位年度：2010
导师：石冰
学科代码：081202
学位授予单位：山东大学
论文提交日期：2010-04-05

摘要

随着全球信息量的爆炸式的增长,数据挖掘技术已成为新世纪计算机科学技术的研究热点。聚类分析是数据挖掘的最主要的功能之一,聚类就是将数据对象分组为多个类或簇,在同一个簇中的对象之间具有较高的相似度,而不同簇中的对象差别较大。聚类分析主要解决的问题是如何在没有先验知识的前提下,实现满足这种要求的聚簇的集合。到目前为止,人们提出了各种各样数据挖掘的聚类算法,但这些算法仅适用于特定的应用以及用户,而且它们在理论和方法上还有待完善,甚至还有严重的不足之处。K-means聚类算法在数据挖掘领域具有非常重要的应用价值。但随着应用领域的拓展和新的问题需求,K-means本身存在的局限越来越突出。在应用中聚类个数通常根据用户视觉和使用方便性假定,但用户往往不能准确的确定聚类个数,聚类个数一旦确定在整个聚类过程中都不能更改,最终得到的簇的数目就是初始的聚类个数。并且初始聚类中心的选取不同也同样会影响聚类算法的效果,因此用户一般不会得到准确的聚类。K-means算法这两个重要缺点严重影响了它在聚类算法中的应用范围。
     本文在分析了当前各种聚类算法的思想和方法的同时,针对K-means算法存在的一些缺陷和不足,提出了基于特征点选择的聚类算法CFPS (Clustering algorithm based on Feature Point Selection)。CFPS算法同样也属于划分聚类算法,CFPS算法在聚类过程中引入了适应度函数,算法根据对象间的距离和适应函数的值进行聚类和调整聚类个数k,CFPS算法不用选取初始聚类中心,算法开始时每个聚类对象自成一类,因此聚类结果稳定,算法不会陷入局部最优的聚类结果。实验结果表明CFPS聚类算法在数据挖掘中与其它聚类算法相比,CFPS算法提高了聚类精度和效率。因此用户可以方便地使用本文提出CFPS算法,不需要配置复杂的参数,并且能得到更好或一样的结果
     聚类分析及相关技术在入侵检测中的应用是当前入侵检测研究的一个热点,本文尝试将CFPS聚类算法应用于入侵检测系统中,并使用KDD CUP1999数据集作为实验数据,对K-means算法与CFPS算法进行了仿真实验,算法分析与实验结果表明CFPS算法具有较好的检测性能,可以获得较高的检测率和较低的误报率,该方法克服了传统K-means算法需要人为确定k值和受初始聚类中心点选择影响的问题。
With an explosive increase in global information, data mining technique has been a focus of the new century computer science and technology research. Cluster analysis in one of the most important fonctions in data mining.Clustering is the process of grouping a set of physical objects or objects into classes or clusters, in which similar objects are grouped in the same cluster while different objects are in different clusters. Clustering processes are always carried out in the condition without pre-known knowledge, so the main task is to solve that how to get the clustering result in this premise. Up to present, many clustering algorithms have been presented, but these algorithms are only suited special problems and users. Furthermore, they are imperfect both theoretically and methodologically, even severe fault. The K-means algorithm has the extremely important application value in Dam Mining, but with the application development and the new question demand, K-means limitations become increasingly prominent. The number of clusters in applications are usually based on the user assumes. But users often do not set the exact number of clusters. The number of clusters once have be established, in the whole clustering process can not be changed, the final clusters number is the initial number of clusters. And select different initial core nodes of the data also will affect the effectiveness of clustering algorithm, so the user generally will not get an accurate clustering. These two important shortcomings serious impact K-means algorithm's application scope in clustering algorithms.
     This dissertation systematically, deeply, roundly and detailedly studies and analyses the technique and methods of clustering analysis, puts forward an improved Clustering algorithm based on Feature Point Selection(CFPS), considering the fault of K-means clustering algorithm. The CFPS algorithm also belongs to the database segmentation category.CFPS algorithm use a fitness function during clustering, CFPS algorithm according to the distance of clusters and the fitness function of the points to clustering and adjust parameter k of clusters, this algorithm don't need select the initial core nodes of the data, at the beginning each object belongs to a cluster, so the result of clustering is stable, CFPS algorithm does not fall into local optimum clustering result. Experimental results show that the CFPS clustering algorithm in data mining, compared with other clustering algorithms, CFPS algorithm improves the clustering accuracy and efficiency. So users can easily use the algorithm proposed in this paper without configure complex parameters, and can get better or the same as the results of other clusterig algorithm.
     Cluster analysis and related technologies in Intrusion Detection Intrusion Detection is currently a hot topic, this dissertation attempts to use CFPS clustering algorithm in intrusion detection systems, and use the KDD CUP 1999 data set as the experimental data, the K-means algorithm and CFPS algorithm have be tested, algorithm analysis and experimental results show that the CFPS algorithm has better detection performance, get a higher detection rate and low false alarm rate, the method can overcome the traditional K-means algorithm needs to man-made determine the k value and by the initial clustering center of choice implications.

引文

[1]Abraham Silberschatz, Henry F.Korth, S.Sudarshan著.杨冬青等译,数据库系统概念[M],北京：机械工业出版社.2006.9
    [2]韩培友等,数据库技术[M],西安：西北工业大学出版社,2008.12
    [3]Jiawei Han, Micheline Kamber著.范明,孟小峰等译.数据挖掘概念与技术[M],北京：机械工业出版社.2002.9
    [4]David hand, heikki Mannila, Padhraic Smyth著.张银奎,廖丽,宋俊等译.数据挖掘原理[M],北京：机械工业出版社.2003.4.
    [5]Ryszard S.Michalski, Ivan Bratko, Miroslav Kubat et al.机器学习与数据挖掘[M],北京：电子工业出版社,2004.
    [6]Richard O.Duda Peter E.Hart David G.Stork著。李宏东等译,模式分类(中文版·第2版)[M],北京：机械工业出版社2004．2
    [7]唐龙妹,杨俊英.应用聚类分析研究医学论文中统计学设计的部分问题[J],数理医药学杂志,2006,19(1)：48-50.
    [8]Daasch W. Robert, Madge Robert. Variance reduction and outliers: statistical analysis of semiconductor test data [C], Test Conference,2005, 304-312.
    [9]Tubao Ho, Trongdung Nguyen, etc. Visualization Support For User-Centered Model Selection In Knowledge Discovery and Data Mining[J]. International Journal On Artificial Intelligence Tools,2001,10 (4):691-713
    [10](美)迪安等著.顾国昌等译.人工智能：理论与实践[M], 北京：电子工业出版社2003.6
    [11]郭蕴华,陈定方.基于模糊聚类分析的客户分类算法研究[J]计算机应用研究,2005,22(4)：67-71
    [11]BESEMER J, LOMSADZE A, BORODOVSKY M. A self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions [J]. Nucleic Acids Research,2001, 29(12):2607-2618
    [12]胡冰,胡东军,马文超.文本挖掘研究及发展[J]电脑知识与技术：学术交流,2008,11(3)：792-793.
    [13]柴胜,周云轩.基于科学工作流的空间数据处理技术探讨[J]计算机工程与应用,2007,43(2)：187-189.
    [14]Shah H., Undercoffer J., Joshi A. Fuzzy clustering for intrusion detection [C],12th IEEE International Conference on Fuzzy, Systems,2003, 2:1274-1278.
    [15]Das A., Nguyen D., Zambreno J. et al. An FPGA-Based Network Intrusion Detection Architecture Information Forensics and Security [C], IEEE Transactions,2008,3(1):118-132.
    [16]Guan Y, Ghorbani A.A., Belacel N. Y-means:a clustering method for intrusion detection [C], Canadian Conference on Electrical and Computer Engineering,2003,2:1083-1086.
    [17]Jiong Zhang, Mohammad Zulkernine. Anomaly Based Network Intrusion Detection with Unsupervised Outlier Detection [C], IEEE International Conference on Communications,2006,5:2388-2393.
    [18]Jagadeesh B.S., Rajesh K., Mundada R.S. et al. Anupam-Alpha parallel computer for operational weather forecasting [C],4th International Conference/Exhibition on High Performance Computing in the Asia-Pacific Region,2000,2:1140-1145.
    [19]Jagadeesh B.S., Rajesh K., Mundada R.S. et al. Anupam-Alpha parallel computer for operational weather forecasting [C],4th International Conference/Exhibition on High Performance Computing in the Asia-Pacific Region,2000,2:1140-1145.
    [20]Klawonn F., Angelov P. Evolving Extended Naive Bayes Classifiers [C], 6th IEEE International Conference on Data Mining Workshops, 2006:643-647.
    [21]Zengchang Qin. Naive Bayes Classification Given Probability Estimation Trees [C],5th International Conference on Machine Learning and Applications,2006:34-42.
    [22]Turner K, Ghosh J. Estimating the Bayes error rate through classifier combining[C],13th International Conference on Pattern Recognition, 1996,2:695-699.
    [23]Viswanathan K., Oruganti R., Srinivasan D. Dual-mode control of tri-state boost converter for improved performance [C], IEEE Transactions on Power Electronics,2005,20(4):790-797.
    [24]Wang J., Dunford W.G., Mauch K. A comparison between two proposed boost topologies and conventional topologies for power factor correction [C], Industry Applications Conference,1996,2:1210-1217.
    [25]Jingquan Chen, Maksimovic D, Erickson R. Buck-boost PWM converters having two independently controlled switches [C], Power Electronics Specialists Conference,2001,2:736-741.
    [26]Dorigo M, M aniezzoV, ColomiA. Ants ystem:Optimizationb ya Colony of Cooperating Agents[J]. IEEE Transactions on Systems, Man,and Cybernetics,1996,26(1):29-41
    [27]Shang Liu, ZhiTong Dou, Fei Li et al. A NEW ANT COLONY CLUSTERING ALGORITHM BASED ON DBSCAN [C], Proceedings of the Third International Conference on Machine Learning and Cybernetics, 2004,3:1491-1496.
    [28]张惟皎,刘春煌,尹晓峰.蚁群算法在数据挖掘中的应用研究[J],计算机工程与应用,2004,28：171-173.
    [29]解英文.基于蚁群算法的网络路由算法[D].济南：山东大学硕士学位论文,2009
    [30]Bo Liu, Hussein A.Abbass, Bob McKay. Density based Heuristic for Rule Discovery with Ant_Miner [C],6th Australia-Japan Joint Workshop on Intelligent and Evolutionary System,2002:180-184.
    [31]Bernard Chen, Phang C. Tai, R. Harrison et al. Novel Hybrid Hierarchical K-means Clustering Method (H-K-means) for Microarray Analysis [C], 2005 IEEE Computational Systems Bioinformatics Conference Workshops, 2005:105-108.
    [32]Magidson, J. and Vermunt, J.K. Latent class models for clustering:A comparison with K-means[J], Canadian Journal of Marketing Research, 2002,20:36-43.
    [33]Hamng Z. Extensions to the K-means algorithm for clustering large data sets with categorical value[J]s. Data Mining and Knowledge Discovery, 1998,2(3):283-304.
    [34]毛韶阳,林肯立.优化K-means初始聚类中心研究[J].计算机工程与应用,2007,43(22)：179-181.
    [35]朱国红,石冰,邢晓娜.基于特征点选择的聚类算法研究[J].山东大学学报(理学版),2009,44(9)：40-42。
    [36]杨兰仓.数据挖掘中聚类和孤立点检测算法的研究[D].济南：山东大学硕士论文,2008.
    [37]Satu Virtanen. Data clustering [R], Report for the 2003 Seminar on String Algorithms,2003.
    [38]Alexander Strehl, Joy deep Ghosh. Value-based customer grouping from large retail data-sets [C], SPIE Conference on Data Mining and Knowledge Discovery:Theory, Tools, and Technology Ⅱ,2000,4057:33-42
    [39]G. Sudipto, R.Rajeev, and S. Kyuseok, Cure:an efficient clustering algorithm for large databases[J], Information Systems,2001,26(1):35 -58.
    [40]Ali M N, Storey C, Topographical multilevel single linkage[J], Journal of Global Optimization 5,1994:349-358.
    [41]梁循.数据挖掘算法与应用[M].北京：北京大学出版社,2006.
    [42]Magidson J, Vermunt J K. Latent class models for clustering:A comparison with K-means[J], Canadian Journal of Marketing Research,2002,20: 36-43.
    [43]Lloyd-Williams M, Discovering the hidden secrets in your data-the data mining approach to information[J], Information Research,1997,3(2). Available at:http://informationr.net/ir/3-2/paper36.html
    [44]史美林,钱俊,董永乐.入侵检测技术与发展趋势[J].信息安全与通信保密,2002,17：12-16.
    [45]Xia Shixiong, Li Wenchao, Zhou Yong, et al. Improved K-means clustering algorithm[J]. Journal of Southeast University (English Edition), 2007,23 (3):435-438.
    [46]蒋建春,马恒太,任党恩,等.网络安全入侵检测研究：研究综述.软件学报,2000；11(11)：1460—1466.
    [47]张相锋,孙玉芳.入侵检测系统发展的研究综述[J]．计算机科学,2003,31：45-49.
    [48]石凌云.数据挖掘在网络入侵检测中的运用[J].电脑知识与技术,2009,16(5)：4120-4121
    [49]温智宇,唐红,吴渝.数据挖掘技术在入侵检测系统中的应用[J].计算机工程与应用,2003,40(17)：153-156
    [50]Richard P. Lippmann, David J. Fried, Isaac Graf et al. Evaluating intrusion detection systems:The 1998 DARPA of-line intrusion detection evaluation[R]. In Proceedings of the 2000 DARPA Information Survivability Conference and Exposition.2000.
    [51]叶和平,尚敏,范路桥.入侵检测系统的数据标准化应用研究[J].计算机工程,2007,33(9)：142-144.
    [52]张瑞丰.精通MATLAB6.5[M].北京：中国水利水电出版社.2004.2
    [53]张志涌.精通MATLAB6.5版[M].北京：北京航空航天大学出版社.2003.3

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700