基于粗糙集和遗传算法的聚类方法研究

英文题名：Research of Clustering Algorithm Based on Rough Set and Genetic Algorithm
作者：李琳
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：数据挖掘 ; 聚类 ; 粗糙集 ; 遗传算法 ; 自适应
英文关键词：Data mining ; clustering ; rough set ; genetic algorithms ; adaptive
学位年度：2009
导师：张超英
学科代码：081202
学位授予单位：广西师范大学
论文提交日期：2009-04-01

摘要

数据挖掘技术是机器学习、数据库和统计理论相结合的产物,是从大量的、不完全的、有噪声的、模糊的、随机的实际数据中,提取隐含的、先前未知的并有潜在价值的信息的非平凡过程。在数据挖掘领域中,聚类分析是一项重要的研究课题。与分类不同,聚类的目标是在没有任何先验知识的前提下,根据数据的相似性将数据聚合成不同的簇,使得相同簇中的元素尽可能相似,不同簇中的元素差别尽可能大,因此又被称为非监督分类。聚类分析作为数据挖掘系统中的一个模块,既可以作为一个单独的工具以发现数据库中数据分布的深层信息,也可以作为其他数据挖掘分析算法的一个预处理步骤,因此研究如何提高聚类算法的性能具有重要的意义。
     遗传算法是基于生物进化的概念设计了一系列过程来达到优化的目的。这些过程包括:基因组合、交叉、变异、自然选择。在这些过程中,通过“优胜劣汰”的原则来淘汰掉解较差的基因,使得解朝着好的方向发展。遗传算法从一组初始可行解出发在只需要目标函数这一信息的条件下实现对可行域的全局高效搜索并以概率1收敛到全局最优解,这种良好的特性使得遗传算法成为组合优化和函数优化的有力工具,并成为计算智能领域的研究热点.
     粗糙集理论是一种刻画不确定性和不完整性知识的数学工具,由波兰数学家在上世纪八十年代初首先提出的。粗糙集理论善于分析隐藏在数据中的事实而不需要关于数据的任何附加知识。该理论以其独特的优势正赢得越来越多的研究者的关注,并在各个领域获得了广泛的应用。在数据挖掘领域,粗糙集最初主要用于分类,而今有关粗糙集的研究已深入到该领域的各个方面。
     目前所用的聚类方法大多是基于对数值属性进行处理的,并且对数值进行处理的方法比较多。而聚类算法中针对符号属性的数据处理则比较困难,往往都是使用概念聚类方法,或者将符号属性转化为数值属性的方法。但是前者过于复杂也不成熟,后者对于数据的符号属性选择有局限性。所以目前大部分的聚类算法都面向数值属性,针对符号属性的则比较少。所以本文提出的算法主要是研究符号属性的数据。
     粗糙集理论适合用于数据之间(精确的或近似的)依赖关系发现、评价某一分类(属性)的重要性、数据相似或差异发现。经典粗糙集模型比较好的解决了符号型数据的机器学习问题,尤其是符号数据的特征选择、属性约简和规则归纳问题。所以说粗糙集特别适合于处理符号属性的数据。
     在提高聚类算法的性能方面,遗传聚类算法可较好地解决聚类时的优化问题以及满足优化目标的多样性需求。适应度是遗传算法得以进行下去的关键。由于有了适应度,个体之间才存在竞争。遗传算法的目标函数及适应度函数定义具有很大的灵活性,可根据需要进行定义。
     遗传算法是可调节的、鲁棒的、高效率的随机搜索算法,它具有的并行性、易于和其它模型结合等性质,适用于数据挖掘,但遗传算法较复杂,容易收敛于局部极小值。粗糙集不需要给出数据之外的额外信息,可以简化输入信息的表达空间,算法简单,易于操作,粗糙集处理的对象是类似二维关系表的信息表,也适用于数据挖掘。遗传算法与粗糙集理论具有优势互补的特点,可以将两者结合应用到聚类中。
     本文将粗糙集思想与遗传算法结合,提出了一种新的聚类方法。聚类算法质量的一个要求就是高类内相似度、低类间相似度,所以在本文中应用类内相似度和类间相似度来定义遗传算法的适应度函数。由于粗糙集的广义近似空间提出了类内不可区分度和类间不可区分度,所以可以将此思想应用到遗传算法中的适应度函数定义中。本文提出了一种新的面向符号属性的聚类算法(RNGACA)。该算法对于每个不同的值,采用自顶向下的分裂式层次聚类策略,利用RAGA算法对数据集进行逐层二分,直到达到预先指定的聚类的个数,然后输出聚类结果。RAGA算法则是将粗糙集思想和自适应遗传算法结合,对数据进行二分。
     为了验证该算法,做了4部分实验,第一部分是对4组实验数据进行测试,4组数据均是取自UCI机器学习数据库,该部分以聚类准确率为衡量准则,将RAGACA算法同其他3种算法进行比较;第二部分实验测试是根据基于F-measure方法的测试结果来衡量RAGACA算法和其他两种算法;第三部分是分析RAGACA算法中RAGA算法的收敛性,通过比较RAGA算法与使用标准遗传算法和使用普通自适应遗传算法来分析它们的收敛性;第四部分是分析RAGACA算法的时间复杂度和空间复杂度。通过这四部分实验,可以分析出RAGACA对符号属性数据进行聚类的可行性,以及拥有较高的准确率和收敛性,另外时间复杂度和空间复杂度也并不比其他算法差。
The data mining technique is a combination of machine learning, database and Statistical theory. Data mining can seek interesting or valuable information within large, incomplete, noisy, rough and random databases. Cluster analysis is an important research problem in the domain of data mining. The goal of clustering is to classify data set into such clusters that intra-cluster data are similar and inter-cluster data are dissimilar without any prior known as“unsupervised classification”. Cluster analysis as a module in the system of data mining can be used not only as a separate technique to discover the information about data distribution, but also as the preprocessing of other data mining operations, therefore it is very meaningful to research how to improve the performance of clustering algorithms.
     Genetic algorithm based on the conception of biological evolution designs a series of process to optimize the solution. These processes include gene combination、crossover、variation、natural selection. In these procedures, eliminate the bad gene through the principle of“Survival of the fittest”and develop the solution to better direction. Genetic algorithm begins with a group of initial feasible solutions, achieves the global effective search of the feasible field with the only information object function and converges to the global optimum value with the probability 1, this kind of nicer characteristic make the genetic algorithm a useful tool for combination and function optimization. The genetic algorithm becomes the research hotspot in the field of computational intelligence.
     Rough set theory (RST) is a mathematical tool used for dealing with vagueness and uncertainty which is introduced by Pawlak, in the early 1980s.Rough set theory is good at analyzing the facts hidden in the data without any additional knowledge about the data. Due to its particular advantages, rough set theory has been received more and more attentions from researchers and applied in a variety of areas in recent years. In the domain of data mining, rough set was only used for classification at the beginning, but the research about rough set has already expanded to any aspects of data mining today.
     We are used in most of the clustering method is based on dealing with numerical attributes currently, and more is to deal with numerical methods. And clustering algorithm for symbolic attributes in the data processing is more difficult, often use the concept of clustering method, or property will be transformed into symbols of the methods of numerical attributes. The former is too complicated but not mature, which is the symbol for the data attribute selection is limited. Therefore, most all of the clustering algorithm for numerical attributes, the attributes for the symbols is relatively small. Therefore, the algorithm proposed in this paper is to study the clustering algorithm for the data of symbolic attributes.
     Rough set theory was suitable for data (precise or approximate) dependence, evaluation of a classification (attributes) the importance of similarity or difference between the data found. Classical rough set model to solve relatively good data symbols of the machine learning problem, in particular, feature selection data symbol, attribute reduction and rule induction problem. So, Rough set is suited to deal with the data which has symbolic attributes.
     In improving the performance of clustering algorithm, the genetic algorithm can solve the optimization problem at the time of clustering, as well as the diversity of objectives to meet the needs of optimization. Genetic algorithm fitness is the key to continue. Because of the fitness, the competition can be existent. The objective function of genetic algorithm and the definition of fitness function with a great deal of flexibility can be defined.
     Genetic algorithm is adjustable, robust and efficient random search algorithm, it is the parallelism, easy, and other models such as the nature, applicable to data mining, but more complex genetic algorithm can easily converge to local minimum value. Rough set data do not need to give additional information, can simplify the expression of input space, the algorithm is simple, easy to operate, the target of rough set to deal with two-dimensional relationship is similar to the information in table form, but also to data mining. Genetic algorithm and rough set theory with the characteristics of complementary advantages, a combination of both can be applied to clustering.
     This article will be thinking of rough set and genetic algorithms, a new method of clustering. The quality of clustering algorithm is a requirement of high-class similarity and low between-class similarity, so in this paper the application of category similarity and between-class similarity of the genetic algorithm to define the fitness function. As a result of generalized rough set approximation space within a category and can not distinguish between types of non-discrimination, so this thinking can be applied to genetic algorithms in the definition of fitness function. In this paper, a new attribute-oriented clustering algorithm symbols (RNGACA). The algorithm for each different value, the use of top-down hierarchical clustering split strategy RAGA algorithm using data sets of the second sub-layer by layer until it reaches pre-specified number of cluster, and then output the results of clustering.
     There have been four experiments to verify the algorithm. First, four groups of data which were taken from the UCI machine learning database have been used to measure the algorithm. And use the clustering accuracy to test the RAGACA algorithm and other three algorithms. Second, the experimental test is based on F-measure method to measure the other two algorithms and RAGACA algorithm. Third, the experimental test is the analysis of algorithm convergence RAGA algorithm, by comparing the algorithm with the use of RAGA standard genetic algorithm and adaptive genetic algorithm using the general analysis of their convergence. Fourth, this part is to analyze the algorithm's time complexity and space complexity. Through the four experiments, you can analyze that the feasibility of the RAGACA could cluster of the data with the symbols attributes. And the RAGACA has higher accuracy and convergence, and time complexity and space complexity is not worse than other algorithms.

引文

[1] Ni Zhi-wei, Cai Qing-sheng,Li Long-shu. Data Mining and Neural Network Techniques incase Based System. Journal Of Wuhan university, 2001,6 (1):25-28
    [2]Jiawei Han,Micheline Kamber,范明,孟晓峰译数据挖掘:概念与技术[M],北京,机械工业出版社,2002.
    [3] Margaret H.Dunham.郭崇慧,田凤占,靳晓明等译,《数据挖掘教程》[M],清华大学出版社,2006
    [4]史忠植。知识发现。北京清华大学出版社,2002.1
    [5] Michael J.Corey,Michael Abbey, Lan Abramson and Ben Taub.Oracle 8数据仓库分析及构建实用指南[M],陈越,郭渊博等.机械工业出版社,2000
    [6]王斌,浅析数据挖掘的主要方法和研究方向[J].计算机仿真,2005,22(10):1-3.
    [7]邝粉良;刘宁;冯子山;云南省两栖动物地理分布格局的聚类分析[J],四川动物, 2007年02期,P445-P447
    [8]杨若明;张经华;蓝翁驰;蓝叶芬;头发中微量元素的测定与糖尿病人的聚类分析研究[J],分析试验室,2007年02期,P76-P79
    [9]陆远权;杨丹;关于人口质量区域差异的聚类分析[J],统计与决策,2007年12期,P80-P82
    [10]王霞;包启挺,聚类回归分析(CLR)在市场细分研究中的应用[J],数理统计与管理, 2008年02期,338-345
    [11]贺玲,吴玲达,蔡益朝。数据挖掘中的聚类算法综述[J],计算机应用研究,2007年1期,P10-P13
    [12]王小平等.遗传算法—理论,应用与软件实现[M],西安交大出版社, 2002.
    [13]张丽萍,柴跃廷。遗传算法的现状及发展动向[J],信息与控制,2001年第30卷第6期,P531-P536
    [14]张文修,吴伟志,梁吉业,李德玉,粗糙集理论与方法[M],科学出版社,2005.1
    [15]Z.Pawlak,Rough set Theory and its application to data analysis[J], Cybernetics and Systems,1998,29(9):P661-P568
    [16]Nejman D.A rough set based method of handwritten numerals classification.Institute of Compoter Scinence Report[M],Waraw University of Technology,Warsaw,1994
    [17]孙吉贵,刘杰,赵连宇,聚类算法研究[J],软件学报,2008年01期,P48-P61
    [18]谷淑化,吕维先,马于涛,关于数据挖掘中聚类分析算法的比较[J],现代计算机,2005.3,P84-P85
    [19]邵峰晶,于忠清.数据挖掘原理与算法[M].中国水利水电出版社,2003.
    [20] Shehro S.Khan,Amir Ahmad,Cluster center initialization algorithm for K-means clustering[J],Pattern Recognition Letters,2004(25),P1293-P1302
    [21] Ng R,Han J.Efficient and effective clustering method for spatial data mining[J]. Proc of int’l conf VLDB,1994.P122-155
    [22]T.Zhang,R.Ramakrishnan,M.Livny.BIRCH,An efficient data clustering method for very large databases[A].proc.1996 ACM-SIGMOD int.conf.Magement of data(SIGMOD’96)P103-P114
    [23] S.Guha,R.Rastogi,K.Shim.CURE:an efficient clustering algorithm for large database [J].Information Systems.26(1):P35-P58
    [24]G.Karypis,E.H.Han,V.Kumar.CHAMELEON:A hierarchical clustering algorithm using dynamic modeling[J].COMPUTER,1999,32(8):P68-P75
    [25]M.Ester,H.P.Kriegel,J.Sander,and X.Xu.A density based algorithm for discovering clusters in large spatial databases with noise[C].In proceedings of 2nd International Conference on Knowledge Discovering in Databases an Data Mining(KDD-96),porland,Oregon,August 1996
    [26]Mihael Ankerst,Markus M.breunig,Hans-peter Krieget,etc.POTICS:ordering Points to identify the clustering strcture[C].In Proceedings of the ACM SIFGMOD Conference.Philadephia:ACM press.1999:P49-P60
    [27]Hinneburg A,Keim DA.An efficient approach to clustering in large multimedia databases with noise.in:Proceedings of the International Conference on Knowledge Discovery and Data Mining(KDD’98). NewYork,1998.P58-P65
    [28]Rui Xu,Donald Wunsh II.survey of clustering algorithm[J]. IEEE Transactions on NeuralNetworks ,2005,16(3),P645-P678
    [29]Wang W,Yang J,Muntz R.SRTING:A statistical information grid approach to spatial data mining. In proceedings of the 23rd International conference on very large databases.Athens,Greece,1997.P186-P195
    [30]R.Agrawal,J.Gehrke,D..Gunopulos,and P.Raghavan.Automatic subspace clustering of high dimensional data for data mining applications[A]. In prodeedings of the 1998 ACM SIGMOD international conference on management of data[C],ACM press ,1998 ,P94-P105
    [31]G.Sheikholeslami,S.Chatterjee,and A.Zhang.WaveCLuster:A multi-resolution clustering approach for very large spatial databases [A],In proc.1998 Int.Conf. Very large Databases(VLDB’98)[C]. NewYork, Aug.1998.
    [32]阎平凡,张长水.人工神经网格与模拟进化计算[M].清华大学出版社.2000
    [33]K.Y.Yeung,C.Raley,A.Murua,et al.Model-based clustering and data transformations for gene expression data[J]. Biinformatiics,2001.17(10).P977-P987
    [34]Xiaohua Hu. Knowledge Discovery in Databases : An Attribute-Oriented Rough Set Approach . Phd Thesis of the university of Regina , Canada , 1995
    [35]T.Mollestad, A Skowron .A rough set frame work for datamining of propositional default rules In : Z.W. Ras , M Michalewicz ed. ISMIS-96 : Ninth International Symposium on Methodologies for Intelligent Systems , Zakopane , Poland , June 10-13 , Lecture Notes In Artificial Intelligence 1079. Berlin Springer-Verlag , 1996 : 448-457
    [36] Xlaohua Hu .Using rough sets theory and database operations to construct a good ensemble of classifiers for data mining applications In : Proceedings of 2001 International Conference on Data Mining 2001 , 233-240
    [37] E. Orlowska ed. Incomplete Information: Rough Set Analysis New York: Physica-Verlag, 1998
    [38]刘清,黄兆华等,带Rough算子的决策规则及数据挖掘中的软计算,计算机研究与发展,1999 , 36 ( 7 ) : 800 -804
    [39]尹旭日等,一种基于Rough set理论的数据过滤方法,计算机研究与发展,2000 ,37 (9) : 1082一1086
    [40] A . Chouchoulas , Q. Shen . A rough set -based approach to text classification .In : N. Zhang , A. Skowron , S. Ohsuga ed. New Directions In Rough Sets , Data Mining , and Granular -Soft Computing , 7th International workshop , RSFDGrC ' 99 ,Springer–Verlag. 1999 , 118-127
    [41] A Chouchoulas, Q Shen Rough set-aided keyword reduction for text categorization Applied Artificial Intelligence 2001, 15 (9): 843 -873
    [42] Q. shen, A .chouchoulas. Rough set-based dimensionality reduction for supervised and unsupervised learning . Applied Mathematics and Computer Science , Special Issue on Rough Sets and their Applications , 2001 , 11 ( 3 ) : 583一601
    [43] H.S. Nguyen , A. Skowron , P Synak , J.Wroblewski. Knowledge discovery In databases : Rough set approach. In : M Mares , R Meisar , V Novak , J Ramik ed. Proceedings of the Seventh International Fuzzy Systems Assotiation world Congress ( IFSA‘97 ) 1997 , 204 -209
    [44]王国胤, Rough set理论与知识获取,西安交通大学出版社, 2001
    [45]王志海,基于粗糙集合的知识发现研究,合肥工业大学博士论文, 1998
    [46]T.Y Lin ed. Proceedings of the Third International workshop on Rough Sets and Soft Computing (RSSC’94 ) San Jose , California , USA , 1994
    [47]王珏等主编,第二届中国Rough Set与软计算学术研讨会(CRSSC ' 2002 )计算机科学, 2002 , 29 ( 9专刊)
    [48]J.Wroblewski.Finding minimal reducts using genetic algorithm. Warsaw University of Technology: ICS Research report, 1995, 16/95
    [49]刘清,Rough Set及Rough推理,科学出版社,2001
    [50]刘少辉,胡斐,贾自艳,等.一种基于Rough集的层次聚类算法[J].计算机研究与发展,2004,41(4):552—557
    [51]AK Jain, M N Murty, P J Flynn·Data clustering: A review·ACM Computing Surveys, 1999, 31(3): 264~323
    [52]S Guha, R Rastogi, K Shim·Rock: A robust clustering algorithm for categorical attributes·Information Systems, 2000, 25 (5):345~366
    [53]雷英杰,张善文,李续武,周创明,matlab遗传算法工具箱及应用[M],西安电子科技大学出版社,2006
    [54]任江涛,吴海建,吴向军,印鉴,张毅,《一种基于遗传算法的分裂式层次化聚类算法》[J],计算机应用2005.25.11,P2618-P2620
    [55] Xiaoyao Zhou,Haozhong cheng.The induction motor parameter estimation through an adaptive genetic algorithm.Universities Power Engineering Conference.2004,1:494-498P
    [56] HazemM.Abbas,Mohamed M.Bayoumi.Volterra system identification using adaptive real-code genetic algorithms.Applied Soft Computing.2004,5:75-86P
    [57]许文杰,刘希玉,基于改进免疫遗传算法的聚类分析研究与应用[J],计算机科学,2008.1,204-205
    [58]傅平,罗可,基于信息熵的免疫遗传算法聚类分析[J],计算机工程,2008年6期,227-228
    [59]朱玲利,李吉桂,鲍苏苏,基于遗传算法的聚类分析在CT图像分割中的应用[J],计算机科学,2006年10期,186-188
    [60]苏守金,刘仁金,基于佳点集遗传算法的聚类技术[J],计算机应用,2005年03期,P643-645
    [61]武兆慧,张桂娟,刘希玉,基于模拟退火遗传算法的聚类分析,计算机应用研究,2005年12期,24-26
    [62] JAIN AK, DUBESRC. Algorithms for clustering data[M]. Prentice Hal,l 1988.
    [63] MAULIK U, BANDYOPADHYAY S. Genetic algorithm-based clustering technique[J]. Pattern Recognition, 2000,33(9):1455-1465.
    [64] TSENG LY, YANG SB. A genetic approach to the automatic clustering problem[J]. Pattern Recognition, 2001, 34(2): 415-42
    [65] TSENG LY, YANG SB. A genetic clustering algorithm for datawith nonspherical-shape clusters[J]. PatternRecognition, 2000,33(7):1251-1259.
    [66] YANG SB, LEE YL. A genetic clustering algorithm for searching the non-spherical-shape clusters[A]. TONBIIN KW, Jr MERI-AUDEAU F, ed. Sixth InternationalConference on Quality Control by Artificial Vision, SPIE 5132[C], 2003.113-118
    [67] PAN H, ZHU J, HAN D. Genetic Algorithms Applied toMulti-ClassClustering forGeneExpressionData[J]. GenomicsProteomics Bioinformatics, 2003, 1(4)58-62
    [68] KRISHNA K, MUTYMN. GeneticK-MeansAlgorithm[J]. IEEE Transactions on Systems, Man and Cybernetics PartB: Cybernetics, 1999, 29(3): 433-439
    [69]余有明;刘玉树;阎光伟;遗传算法的编码理论与应用,计算机工程与应用, 2006年03期72-75
    [70]张思才;张方晓;一种遗传算法适应度函数的改进方法,计算机应用与软件, 2006年02期23-26
    [71]朱鳌鑫;遗传算法的适应度函数研究,系统工程与电子技术, 1998年11期116-119
    [72]夏桂梅;曾建潮;一种基于轮盘赌选择遗传算法的随机微粒群算法,计算机工程与科学, 2007年06期89-91
    [73]乐光学;基于Gnutella协议的分布式Peer-to-Peer网络连接管理策略及改进研究,计算机工程与应用,2004年29期213-216
    [74]康海燕;李彦芳;林培光;樊孝忠;信息检索策略性能的云模型评价方法,中文信息学报, 2005年01期15-18
    [75]李村合;孟文杰;基于分类评价的元搜索引擎调度策略,计算机工程与设计2008年05期56-58
    [76]王继民;国内综合性搜索引擎时新性的计算,计算机工程与应用, Computer Engineering and Applications,编辑部邮箱2003年21期78-81
    [77]覃晓;元昌安;基于遗传算法和自组织特征映射网络的文本聚类方法,计算机应用, 2008年03期46-50
    [78]张刚;刘悦;郭嘉丰;程学旗;一种层次化的检索结果聚类方法,计算机研究与发展, 2008年03期44-47
    [79]刘海卫;倪恩志;周昌乐;基于遗传算法的分类规则序列生成厦门大学学报(自然科学版) , 2008年02期29-31
    [80]陶志,许宝栋,汪定伟等.基于遗传算法的粗糙集知识简约方法[J].系统工程,2003,21(4):116-122.
    [81]李石君,王汉飞.关系数据库中统计关系的挖掘和应用[J ].计算机工程与应用,2000 ,43(6) .
    [82]郭建生,赵奕,施鹏飞.一种有效的用于数据挖掘的动态概念聚类算法[J ].软件学报,2001 ,12(4) .

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700