基于遗传算法的模糊聚类挖掘方法应用研究

英文题名：Application and Research of Fuzzy Clustering Mining Method Base on Genetic Algorithm
作者：李新宇
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：数据挖掘 ; 入侵检测 ; 遗传算法 ; 模糊聚类
英文关键词：Data Mining ; Intrusion Detection ; Genetic algorithm ; Fuzzy Clustering algorithm
学位年度：2011
导师：周铁军
学科代码：081203
学位授予单位：中南林业科技大学
论文提交日期：2011-06-01

摘要

随着信息技术和数据库技术的迅猛发展,人们能够获取的数据也与日俱增,对数据的加工处理已经成为人们获取有用信息不可缺少的工具。数据挖掘是一种通用的知识发现技术,利用各种分析工具在大量数据中发现模型和数据间的关系的过程。聚类分析是数据挖掘技术中重要的组成部分,数据聚类挖掘技术是一个正在蓬勃发展的领域,涉及了人们生活的各个方面。
     模糊聚类FCM(Fuzzy c-means)算法是的一种重要的无监督学习的数据聚类挖掘方法,已成为聚类分析技术研究的热点。该算法具有结构简单、局部搜索能力强且收敛速度快的特点,然而FCM算法容易受聚类初始化的影响,而且在迭代时非常容易陷入局部极小。遗传算法是一种随机搜索的全局优化算法,它通过模拟自然进化过程对最优解进行搜索,其显著的特点是具有并行性及对搜索范围的全局性。如果将FCM算法和遗传算法相结合,用遗传算法来解决聚类问题,既能发挥遗传算法的全局寻优能力,又能兼顾FCM算法的局部搜索能力,从而大大提高算法的性能。
     本文提出了一种基于改进遗传算法的模糊聚类算法(IG-FCM),该算法首先采用遗传算法的全局搜索特性对初始聚类中心进行全局优化,接着运用FCM算法的局部寻优特性进一步的最优解搜索。IG-FCM算法采用了一种启发式聚类的方法,通过有序改变聚类类别数目,利用聚类有效性评价函数自动确定最优聚类数目及最优聚类结果。由于采用传统遗传算法进行聚类会出现算法收敛速度慢,以及稳定性不高、精准性低等问题,本文改进的遗传算法采取最优保存策略来保留当前种群中适应度最高的个体,让其副本及其他个体进行最大适应度差异交叉操作,确保遗传算法优良基因迭代的稳定性,避免不良基因的扩散,提高了算法的收敛速度和精确度。
     本文在IG-FCM聚类算法研究的基础上,针对现有的入侵检测系统检测性能的不足以及聚类算法在入侵检测系统中应用的特点,提出了基于改进遗传算法的特征加权模糊聚类算法(IG-WFCM)算法,将该算法用于入侵检测系统中训练数据集的聚类划分,以此为依据来检测网络数据是否正常。基于IG-WFCM算法的入侵检测系统采用将连续型属性和离散型属性分别处理的数据预处理方式,数据之间相似性度量采用加权的混合距离度量方式,并且采用设定正常数据类集聚类宽度阈值的方法来检测异常数据,以此来提高入侵检测系统的检测率。
     本文通过采用KDD CUP 1999入侵检测数据集进行了仿真实验,结果显示IG-WFCM算法的平均检测率达到了80.1%,平均误警率保持为1.605%左右。这充分表明IG-WFCM算法的可行性和有效性,能够克服FCM算法易陷入局部极小值、检测精度低等缺陷,在一定程度上提高了入侵检测系统的性能和效率。
With information technology and database technology developing at very fast speed, information processing has become a indispensable tool for people to acquire useful message. Data mining is a generic knowledge discovery technology, it is a process of findingmodel and the relationship of the data in a large amount of data by analytical tools. Clustering analytical is an important component of data mining technology. Data clustering mining technology is an emerging area which involves various areas.
     FCM (Fuzzy c-means) algorithm, as a kind of unsupervised learning methods, it is a research hotspot concerning about data clustering analytical technology. FCM is one of important algorithm in data clustering mining methods, it has the characteristics as simple, fast convergence and strong local searching power, etc. However, FCM is sensitive to initialization and tends to result in local minimum in iterations. Genetic Algorithm is a random searching global optimization algorithm. It is a computational model of the human evolution, with implicit parallelism and capacity of using effective global information. The combination of FCM algorithm and genetic algorithm will get a hybrid algorithm which benefits to solve clustering problem and make tremendous improvement in algorithm performance, the hybrid algorithm has good global and local search capability.
     This paper presents a hybrid Fuzzy c-means algorithm(IG-FCM) based on improved genetic algorithm.The algorithm use global search ability of genetic algorithm to optimize the initial cluster centers of clustering algorithm, and then carry out the FCM algorithm base on local optimization. IG-FCM is a Heuristic clustering algorithm, it orderly changes clustering class number, then automatically determine the optimal number of clustering class and the optimal clustering base on Evaluation of clustering validity function. Because of the traditional genetic algorithm has shortcomings like slow convergence, poor stability and low accuracy rate. This paper adopts the optimum preservation strategy in the selection operation to maintain the optimum individual in the process of genetic, and then copy the selected individual, then the optimum individual enters the next generation directly without participating in crossover and mutation operation. The copies and other individual will participate in crossover and mutation operation base on maximize degree of genetic variation. IG-FCM is greater improvement than classical clustering algorithms in performance, it could ensure the stability of genetic evolution and improve the convergence speed and accuracy.
     Based on analyzing the characteristics of IG-FCM algorithm and clustering algorithm that applied to intrusion detection system, according to the insufficiency of existing intrusion detection system detection performance, this paper proposed a hybrid weighted FCM clustering algorithm (IG-WFCM). The algorithm can be used to partition clustering of training data set for intrusion detection systems. The intrusion detection systems can detect the network data base on the results of clustering. IG-WFCM algorithm could make a difference between the continuous attributes and discrete attributes in the intrusion detection data Pretreatment Process, it using weighted hybrid distance metric methods to measure the similarity of data. In order to detect abnormal data and improve the detection rate of intrusion detection system, the algorithm using the method that set the width threshold of the normal data class.
     In this paper, we make the Intrusion Detection simulation experiment using KDD CUP 1999 data set base on IG-WFCM algorithm, results show that the average detection rate reached 80.1%, the average false positive rate remains at 1.605% or so. This results of experiment could fully demonstrate the feasibility and effectiveness of the IG-WFCM algorithm, and it could overcome the shortcomings like easily trapped into local minima, and low defect detection accuracy of FCM algorithm, and then could improve the performance and efficiency of the intrusion detection system to a certain extent.

引文

[1]Margart H, Dunham.DATA MINING Introductory and Advanced Topics[M].北京：清华大学出版社,2003：3-71.
    [2]CHEN M-S,HAN J, YU P S. Data mining:an overview from a database perspective[J]. IEEE Transactions on Knowledge and Data Engineering, 1996,8(6):866-883.
    [3]朱明.数据挖掘[M].合肥：中国科学技术大学出版社,2002
    [4]李雄飞,李军.数据挖掘与知识发现[M].北京：高等教育出版社,2003.11
    [5]Liu J Z,et al.GA-based fuzzy c-means clustering algorithms[J]. ISYI'94, Beijing, china.1994.
    [6]陈国良,王煦法,庄镇泉,等.遗传算法及其应用[M].北京：人民邮电出版社,1996,1-25.
    [7]Maulik U,Bandyopadhyay S.Genetic algorithm-based clustering techn-ique[J]. Pattern Recognition.2000,33(9):1455-1465.
    [8]Charikar Moses Samson. Algorithms for Clustering Problems[J]. LM3000018,2000
    [9]Pang-Ning Tan, Michael Steinbach Vipin Kumar. Introduction to Data Mining[M].2006,5:31-36.
    [10]毛国君,段立娟,王实,等.数据挖掘原理与算法[M].北京：清华大学出版社,2006.
    [11]JainA K, MurtyMN, FlynnPJ. Data clustering:Asurvey[J].ACM ComPute. Surv,1999,31:264-323.
    [12]Fayyad U eds,Knowledge Discovery and Data Mining Towards a Unifying Framework KDD'96 Proc[C].2nd Intl.Conf.On knowledge Discovery&Data Mining.AAAI Press.1996.
    [13]张云涛,龚玲.数据挖掘原理与技术[M].北京：电子工业出版社,2004.
    [14]蔡伟杰,张晓辉,关联规则挖掘综,计算机工程[J],2001,(5)：31-33
    [15]王晓晔,时间序列数据分类方法的研究：[D],天津.天津大学,2003
    [16]李永森,杨善林,马溪骏,等.空间聚类算法中的K值优化问题研究 [J].系统仿真学报,2006(3)：574-576.
    [17]李同强,周天弋,吴斌.基于改进遗传算法的加权模糊C均值聚类算法[J],计算机应用.2009,s2：42-45.
    [18]J.Han,M.Kamber,Data Mining:Concepts and Techniques[M],北京.高等教育出版社,2001
    [19]Filippone M, Camastra F.Masuili F, Rovetta S.A survey of kernel and spectral methods for clustering [J].Pattern Recognition,2008,41(1): 176-190.
    [20]Zhang,T Raghu R, Livny M.BIRCH:An Effieient Data Clustering Method for Very Large Databases[C], ACM SIGMOD Int, Conf Monrteal. Canada,1996,103-114.
    [21]GhuaS,Rasotgi R, Shim K.CURE:An Effieient Clustering Algorithm for Lagre Data bases[C], ACM SIGMOD Int.Conf Seattle, USA,1998, 73-84
    [22]G.K yarPis, E.Hna, V.Kmu.ar Chameleon:Hierarchical Clustering Using Dynamic Modeling[C],IEEE ComPuter,1999, (ICDE'98),1998, Vol.32, No.8,68-75.
    [23]Hinneburg,Keim D.Optimal Grid-Clustering:Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering [C], Proe.Int.Conf on Very Lagre Databases (VLDB'99),1999,506-517.
    [24]D.Fisher.Improving inherence through conceptual clustering.Proc.1987 AAAI Conf.461-465
    [25]D.E.Rumelhart, D.Zipser, Feature discovery by competitive learning, Congnitive Science,1985,9(1):75-112
    [26]T.Kohonen,Self-organization and associate memory, Berlin:Springer-Verlag,1984, Chapter 5.
    [27]D.E.Rumelhart, D.Zipser, Feature discovery by competitive learning, Congnitive Science,1985,9(1):75-112
    [28]T.Kohonen, Self-organization and associate memory, Berlin:Springer-Verlag,1984, Chapter 5.
    [29]Kohonen Teuvo, The self-organizing map, Neurocomputing,1998,21 (1-3):1-6
    [30]Kohonen T, Self-Organizing Maps, Springer, Berlin:1995
    [31]刘菁.基于模糊理论与人工神经网络的暂态稳定评估方法[D].上海：上海交通大学(硕士),2007：16-21.
    [32]孟丽敏,宋余庆,朱峰.基于空间邻域加权的模糊C-均值聚类及其应用研究[J],计算机应用研究,2010,10：32-35.
    [33]RusPini E H.A new approach to clustering[J].Int.Conf.1996,15(1): 22-23.
    [34]张晓丽.模糊聚类分析在多目标跟踪中的应用[J].国防技术基础,2008,(06).
    [35]陈水利,李敬功,王向松.模糊集理论及其应用[J].北京：科学出版社,2005
    [36]叶海军.模糊聚类分析技术与其应用研究[D].合肥：合肥工业大学,2006.
    [37]E.Bagheri,H.Edldari. Dejong Function Optimization by Means of a Parallel Approach to Fuzzified Genetic Algorithm[J]. Computers and communications, ISCC'06 Proceedings,2006,675-680
    [38]黄永青,梁昌勇,杨善林,等.基于一种加速收敛变异策略的交互式遗传算法[N].系统仿真学报,2007,19(9)：1913-1916
    [39]邹琳,夏巨谌,胡安国.基于实数编码的多种群并行遗传算法研究[J].小型微型计算机系统.2004,25(6)：982-986
    [40]潘伟,刁华宗.一种改进的实数自适应遗传算法[J].控制与决策,2006,21(7)：792-793.
    [41]周洪伟,原锦辉,张来顺.遗传算法“早熟”现象的改进策略[J].计算机工程,2007,33(19)：201-203
    [42]罗会兰.聚类集成关键技术研究[D].浙江：浙江大学计算机学院,2007.
    [43]覃宝灵.聚类分析技术及其应用研究[J].广西工学院学报.2007,18(3)：105-107
    [44]刘林,曹艳平,王婷,李娟飞.应用模糊数学[M].西安：陕西科学技术出版社,2008
    [45]蒋盛宜,李霞一种改进的BIRCH聚类算法[J].计算机应用,2009,29(1)：293-296.
    [46]A.K.Jain and R.C.Dubes,Algorithms for Clustering Data.Englewood Cliffs[J], NJ:Prentice-Hall,1988.
    [47]R.Agrawal,H.Mannila,R.Srikant etc., Fast discovery of association rules[J], Advances in Knowledge Discovery and Data mining, AAAI/MIT Press,1996,307-328.
    [48]曲福恒.一类模糊聚类算法研究及其应用[D].吉林大学,2009
    [49]殷瑞飞.数据挖掘中的聚类方法及其应用[D].厦门大学,2008.
    [50]Sanghamitra Bandyopadhyay,Ujjwal Maulik.An evolutionary technique based on K-Means algorithm for optimal clustering in RN[J].Information Sciences,2002,146:221-237.
    [51]李敏强,遗传算法的基本理论与应用[M],北京.科学出版社,2002.
    [52]Uday Kumar Chakrabo. A branching process model for genetic algorithms[J]. Information Process Letters,1995,56:281-290.
    [53]D E Goldberg,K Deb.A comparison of selection schemes used in genetic algorithms[J]. Foundations of Genetic Algorithms. San Mateo, CA:Morgan Kauffman Publishers.1991:69-63
    [54]周丽娟,石倩,葛学彬,王林爽.基于聚类的模糊遗传挖掘算法的研究[J].计算机工程与应用,2010,13：36-40.
    [55]戴文华.基于遗传算法的文本分类及聚类研究[M].北京：科学出版社2008
    [56]李颖,李传龙,马龙,于水明.动态加权模糊核聚类算法[J].合肥工业大学学报(自然科学版),2010,06：43-46
    [57]NikhilR.Pal, KuhuPal, Jmaes M.Kelle, rnadJmaesC.Bezdek.A Possibilistic Fuzzy C-Means Clustering Algorithm[J].IEEE Transactions On Fuzzy Systems,13(4),2005:517-530
    [58]刘蕊洁,张金波,刘锐.模糊c均值聚类算法[J].重庆学院学报(自然科学版),2008,22(2)：139-141
    [59]Hen, Chun-Wei Tsai, MSGKA:An Efficient Clustering Algorithm for Large Databases[J],2002 IEEE International Conference on Systems, Man and Cybernetics,2002, vol.5,110-116.
    [60]陈黎飞.高维数据的聚类方法研究与应用[D].厦门大学,2008
    [61]高滢.多关系聚类分析方法研究[D].吉林大学,2008
    [62]李晶,杨场.模糊聚类分析在数据挖掘中的应用[J].漯河职业技术学院学报,2010,05：29-32.
    [63]Eschrich S., Jingwei Ke, Hall L.O.etc.,Fast accurate fuzzy clustering through data reduction[J], IEEE Transactions on Fuzzy Systems, 2003,11(2):262-270
    [64]宋娇,葛临东.一种遗传模糊聚类算法及其应用[J].计算机应用,2008,28(5)：1197-1199
    [65]王小姣,徐夫田,单国杰.模糊C-均值聚类算法的改进[J].微型机与应用,2010,12：45-48.
    [66]王浩,王秀友,陈蕴.新的混合模糊C-均值聚类算法[J].计算机工程与设计,2008,29(4)：917-922
    [67]侯惠芳,刘素华一种改进的基于遗传算法的模糊C-均值算法[J].计算机工程,2005,31(17)：152-154.
    [68]覃俊华,张洪伟,赵世政.基于遗传算法的模糊聚类研究及其应用[J].计算机应用,2007,27(1)：52-55
    [69]周世兵,徐振源,唐旭清.新的K-均值算法最佳聚类数确定方法[J].计算机工程与应用,2010,46(16)：27-31.
    [70]诸克军,苏顺华,黎金玲.模糊c-均值中的最优聚类与最佳聚类数[J].系统工程理论与实践,2005,3：52-61.
    [71]李洁,高新波,焦李成.一种基于修正划分模糊度的聚类有效性函数[J].系统工程与电子技术,2005,27(4)：723-724.
    [72]肖立中,邵志清,马汉华,等.网络入侵检测中的自动决定聚类数算法[J].软件学报,2008,19(8)：2140-2148.
    [73]向继,高能,荆继武.聚类算法在网络入侵检测中的应用[D].计算机工程.哈尔滨工程大学硕士学位论文2003,29(16)：48-50
    [74]Joseph S. Sherif, Tommy G. Dearmond.Intrusion detection:Systems and models[J].proc. of the Eleventh IEEE International Workshops on Enabling Technologies:Infrastructure for Collaborative Enterprises, 2002.
    [75]罗守山.入侵检测[M].北京：北京邮电大学出版社,2004.101-120.
    [76]Tobias Chyssler, Stefan Burschka, Michael Semling, Tomas Lingvall, Kalle Burbeck. Alarm reduction and correlation in intrusion detection systems[J].In Proceedings of Detection of Intrusions and Malware & Vulnerability Assessment, GI SIG SIDAR Workshop,2004.
    [77]付兴兵,刘光远.入侵检测技术趋势[J].信息安全,2006,10：19-22.
    [78]李永忠,王汝山,张念贵,王玉雷.基于半监督模糊聚类的入侵检测技术[J].江苏科技大学学报,2010,4：35-39
    [79]陈健美,宋顺林,路鑫,宋全庆,朱玉金.改进模糊聚类算法及其在入侵检测中的应用[J].东南大学学报.2007,37(4)：589-592
    [80]杨德刚.基于模糊C均值聚类的网络入侵检测算法[J].计算机科学,2005,32(1)：86-87,91.
    [81]阎巧,谢维信.异常检测技术的研究与发展[J].西安电子科技大学学报,2002,29(1)：128-132
    [82]严骏.模糊聚类算法应用研究[D].浙江大学,2006.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700