用户名: 密码: 验证码:
基于粒子对和极值优化的基因聚类混合算法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着人类基因组计划的完成,生命科学的研究进入到后基因组时代,研究的重点已变为确定每条基因在生物体中的功能以及基因之间相互作用和调控的关系。作为后基因组时代功能基因组研究最基本的实验手段,基因芯片一次实验可以同时观测成千上万条基因在不同实验条件下的表达情况,从而产生了大量蕴含着基因活动信息的基因表达数据。如何分析和处理这些基因表达数据,以提取出对人类有意义的生物、医学信息,已成为后基因组时代人们关注和研究的热点。目前,聚类方法是对基因表达数据进行分析和处理的主要计算技术之一。通过对基因表达数据进行聚类,能够将表达模式相似或相同的基因归纳成类,有助于对基因功能、基因调控、细胞过程、细胞亚型等进行综合的研究,在补充未知基因的生物学功能注释、临床诊断治疗等方面具有重要的现实意义。因此,已有大量国内外学者提出了应用到基因表达数据聚类分析中的各种聚类算法。作为一种较新颖的基因聚类算法,粒子对算法(PPO)在一些基因表达数据集中获得了较好的聚类效果,但也存在着一些有待解决的问题。本文就是围绕着如何进一步提高PPO算法的聚类效果开展研究,主要做的相关研究工作如下:
     (1)对生物信息学的相关基础知识进行了简单介绍,接着对基因表达数据的获得、表示、预处理、聚类分析原理和聚类结果评价进行了较为详细的分析,最后获取了本文进行聚类分析实验所用到的两组基因表达数据集。
     (2)对K-means、层次聚类这两种传统的基因聚类算法的原理进行了简单分析,接着介绍了标准粒子群优化算法(PSO)的原理,并分析了粒子群聚类算法的原理和优缺点,最后对基本PPO算法的原理、聚类流程和特点进行了较为详细的阐述。
     (3)对基本PPO算法进行了较为深入的研究,分析了PPO算法存在着有待解决的3个问题,并相应提出了3种改进思路:用K-means快速聚类结果初始化一个粒子、为初始粒子对之间引入一种最优信息共享策略、根据粒子对的统计信息对属于不同类别的粒子采用不同的速度进化公式,由此得到了一种新的改进粒子对算法ImPPO。为验证改进思路和改进算法ImPPO的有效性,采用了三个基因表达数据集进行聚类分析实验。实验结果表明,与K-means、基本PPO算法相比,提出的改进思路和改进算法ImPPO在一些基因表达数据集中获得了较好的聚类效果,并且再一次说明了对于不同的聚类算法,甚至同一聚类算法使用不同的参数,应用到同一基因表达数据集中可能会得到不同的聚类结果。
     (4)在对基本极值优化算法(EO)的原理、特点进行分析的基础上,结合PPO和EO算法的优点,提出了一种新的基因聚类混合算法PPO-EO。混合算法PPO-EO在精英粒子对的迭代过程中根据一定的迭代次数将EO算法引入到PPO算法中,一方面利用EO算法强大的局部搜索能力的优点克服PPO算法后期可能过早陷入局部最优的缺点,另一方面利用PPO算法能够保证全局收敛的优点克服EO算法不能保证收敛的缺点,发挥二者的优势完成基因聚类,以提高基因聚类结果的精度。为评价混合算法的聚类效果,通过采用另外三个基因表达数据集进行了聚类分析实验。实验结果表明,混合算法PPO-EO在三个聚类评价指标均方差函数、类内紧致性和类间分离度方面获得了比K-means、PPO算法更好的聚类结果精度。
With the accomplishment of human genome project, life science research enters into the post-genome era, and the key point of research has changed into determining each gene's function in organism as well as the relations of interaction and regulation among genes. As the most basic experiment method of function genome study in the post-genome era,gene-chip each experiment can simultaneously monitor the expression of thousands of genes under different experimental condition,which thus has generated massive gene expression data containing gene activity information. How to analyze and handle these gene expression data, digging out valuable biology and medical information towards human,has become a hotspot of concern and study in the post-genome era. At present,cluster method is one of the major computation technologies for analyzing and handing gene expression data. Clustering the gene expression data can classify the genes of similar or same expression patterns into a category,which is helpful to synthetic research on gene function,gene regulation,cell process and cell hypotype and so on,and has vital practical significance in suppling unknown genes'biology function annotation,clinical diagnosis treatment and so on. Therefore,massive domestic and foreign scholars have proposed all kinds of clustering algorithm applied to gene exression data clusting analysis. As one kind of novel gene clustering algorithm,Particle-Pair Optimization(PPO) has obtained better clustering effect in some gene expression data sets,but it also has some questions to be solved. This article carries on research around how to further enhance the cluster effect of PPO algorithm,and main related research work is as follows:
     (1) We simply introduced the related elementary knowledge of Bioinformatics,then detaily analyzed the obtaining,expression,preprocessing,cluster analysis principle and cluster result validation of gene expression data,finally gained two group of gene expression data sets which were used to cluster analysis experiment in this article.
     (2) We simply analyzed the principle of two traditional cluster algorithms that K-means and hierarchical clustering,then introduced the principle of standard particle swarm optimization algorithm(PSO) and analyzed the principle,merit and drawback of particle swarm cluster algorithm,finally detaily elaborated the principle,cluster flow and characteristic of PPO algorithm.
     (3) Based on studying the basic PPO algorithm thorougher, we analyzed three questions to be solved for the PPO algorithm,and proposed three corresponding improved threads:initializing a particle with the fast cluser result of K-means,introducing a sharing strategy of best information between the initial particle-pair,using different velocity evolution formula for the particle that belongs to different category according to the statistical information of particle-pair,which formed an improved particle-pair algorithms called ImPPO. Finally,in order to validate the effectiveness of improved schemes and ImPPO,we used three gene expression data sets to do the experiments of cluster analyzing. The experimental results indicated that improved schemes and ImPPO had better cluster effect than K-means and basic PPO algorithm in som gene expression data sets,and again showed that different cluster algorithm,even same cluster algorithm with different parameter might produce different cluster result in the same gene expression data sets.
     (4) Based on analyzing the principle and characteristic of basic Extremal Optimization(EO) algorithm,we proposed a new gene cluster hybrid algorithm called PPO-EO by combining the merits of PPO and EO algorithm.The extremal optimization algorithm was introduced in the iteration process of elitist particle pair according to interval iteration. On the one hand,PPO-EO used the merit of EO with power local search capability to avoid PPO falling into local optimization premature in the later period,on the other hand,it used the merit of PPO with ensuring convergence to overcome the drawback of EO without ensuring convergence. PPO-EO completed gene cluster by performing the merits of PPO and EO,which could improve the cluster precision. Finally, in order to evaluate the effectiveness of hybrid algorithm,we used the other three gene expression data sets to do the experiments of cluster analyzing. The experiment results indicated that hybrid algorithm was better precision than K-means and PPO algorithm in three cluster evaluating indicators that mean square error,homogeneity and separation.
引文
[1]钟扬,张文娟,王莉,等.基因计算[M].上海:上海教育出版社,2005.
    [2]孙啸,陆祖宏,谢建明.生物信息学基础(第1版)[M].北京:清华大学出版社,2005.
    [3]李瑶.基因芯片技术一解码生命[M].北京:化学工业出版社,2004.
    [4]纪震,廖惠连,吴青华.粒子群算法及应用[M].北京:科学出版社,2009.
    [5]Jain A K,Murty M N, Flynn P J. Data ClusteringrA Review[J]. ACM Computing Surveys,1999,31 (3):264-323.
    [6]Jiang D X, Tang C, Zhang A D. Cluster analysis for gene expression data:a survey[J]. IEEE Transaction on Knowledge and Data Engineering,2004,16(11):1368-1370.
    [7]陆媛,杨慧中.基于代表熵的基因表达数据聚类分析方法[J].计算机工程与应用,2008,44(27):151-153.
    [8]马志强,孙平平,马雅楠,等.改进K-means聚类算法在基因系统发育谱分析中的应用[J].生物信息学,2008,6(4):82-84.
    [9]Du Zhihua, Wang Yiwei, Ji Zhen.PK-means:A new algorithm for gene clustering [J]. Computational Biology and Chemistry,2008,32(4):243-247.
    [10]张亮,张岩,周一鸣,等.用聚类法分析受抗真菌物质处理后的酵母细胞全基因表达谱[J].生物化学与生物物理进展,2002,29(4):538-542.
    [11]Veer L J, Dal H, Vijver M J, etal.Gene expression profiling predicts clinical outcome of breast cancer[J]. Nature,2002,4:530-536.
    [12]阮晓钢,李晓明,王金莲.边介数聚类算法在肿瘤基因表达谱中的应用[J].北京工业大学学报,2008,34(7):696-700.
    [13]黎刚果,王正志.结合蛋白质相互作用数据进行基因表达数据聚类[J].生物信息学,2009,7(4):280-283.
    [14]岳峰,孙亮,王宽全,等.基因表达数据的聚类分析研究进展[J].自动化学报,2008,34(2):113-119.
    [15]Mac Queen J B. Some methods for classification and analysis of multivariate observations[A]. In:Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability[C]. Berkeley:University of California Press,1967:281-297.
    [16]Milligan G W, Cooper M C.A study of the comparability of external criteria for hierarchical cluster analysis[J]. Multivariate Behavioral Research,1986,21: 441-458.
    [17]Fritzke B. Growing cell structures-a self-organizing network for unsupervised and supervised learning[J]. Neural Networks,1994,7:1141-1160.
    [18]Tavazoie S, Huges J D, Campbell M J. Systematic determination of genetic network architecture[J]. Nature Genetics,1999,22(3):281-285.
    [19]Eisen M B, Spellman P T, Brown P 0, etal. Cluster analysis and display of genome-wide expression patterns[J]. Proc Natl Acad Sci USA,1998,95:14863-14868.
    [20]Tamayo P, Slonim D, Mesirov J, etal. Interpreting patterns of gene expression with self-organizing maps:methods and application to hematopoietic differentiation[J]. Proc Natl Acad Sci USA.1999,96:2907-2912.
    [21]Gasch A P, Eisen M B. Exploring the conditional coregulation of yeast gene expression through fuzzy K-means clustering[J]. Genome Biology,2002,3(11).
    [22]Belacel N, Hansen P, Mladenovic N. Fuzzy J-means:a new heuristic for fuzzy clustering[J]. Pattern Recognition,2002,35(10):2193-2200.
    [23]Belacel N, Cuperlovic-Culf M, Laflamme M, etal. Fuzzy J-means and VNS methods for Clustering genes from microarray data[J]. Bioinformatics,2004,20(11):1690-1701.
    [24]Gustafson D E, Kessel W C. Fuzzy clustering with a fuzzy covariance matrix [A]. In Proceedings of the 17th Conference on Decision and Control Including Symposium on Adaptive Processes[C]. San Diego, USA:IEEE,1979:761-766.
    [25]Kim D W, Lee K H, Lee D. Detecting clusters of different geometrical shapes in microarray gene expression data[J]. Bioinformatics,2005,21(9):1927-1934.
    [26]Herrero J,Valencia A, Dopazo J. A hierarchical unsuper-vised growing neural network for clustering gene expression patterns[J]. Bioinformatics,2001,17(2): 126-136.
    [27]Feng Luo, Khan L, Bastan F, etal. A dynamically growing self-organizing tree (DGSOT) for hierarchical clustering gene expression profiles [J]. Bioinformatics, 2004,19(16):2605-2617.
    [28]Ji X L, Yuan Y, Li Y D, Sun Z R.HMMGEP:clustering gene expression data using hidden Markov models[J]. Bioinformatics,2004,20(11):1799-1800.
    [29]Lee S I, Batzoglou S. Application of independent component analysis to microarrays[J]. Genome Biology,2003,4(11):76.
    [30]Hinneburg A, Keim D A. An effcient approach to clustering in large multimedia database with nosie[A]. In Proceedings of the 4th International Conference on Knowledge Discovery in Databases[C]. New York, USA:AAAI,1998:58-65.
    [31]何宏,谭永红.基于计算智能的基因表达数据聚类分析研究进展[J].信息与控制,2009,38(6):743-751.
    [32]Krishna K,Murty N M. Genetic K-means algorithm[J]. IEEE Transactions on Systems, Man, and Cybernetics-Part B:Cybernetics,1999,29(3):433-439.
    [33]Lu Y, Lu S, Fotouchi F, etal. FGKA:a fast genetic K-means clustering algorithm[A] . In-.Proceedings of the 2004 ACM Symposium on Applied Computing [C]. New York, NJ, USA: ACM,2004:622-623.
    [34]Lu Y, Lu S, Fotouhi F, etal. Incremental genetic K-means algorithm and its application in gene expression data analysis[J]. BMC Bioinformatics,2004,5:172.
    [35]Xiao X, Dow E R, Eberhart R C, etal. Gene clustering using self-organizing maps and particle swarm optimization[A]. In Proceedings of the 17th International Symposium on Parallel and Distributed Processings[C]. Los Alamitos, CA, USA:IEEE Computer Society,2003:1-10.
    [36]Sun J, Xu W B, Feng B. Adaptive parameter control for quantum-behaved particles swarm optimization on individual level[A]. In Proceedings of 2005 IEEE International Conference on Systems, Man and Cybernetics.vol.4[C]. Piscataway, NJ, USA:IEEE,2005:3049-3054.
    [37]Chen W, Sun W, Ding Y R, etal. Clustering of gene expression data with quantum-behaved particle swarm optimization[A]. In:New Frontiers in Applied Artificial Intelligence[C]. Berlin, Germany:Springer-Verlag,2008:388-396.
    [38]高倩倩,须文波,孙俊.量子行为粒子群算法在基因聚类中的应用[J].计算机工程与应用,2010,46(21):152-155.
    [39]高倩倩.基因表达数据的聚类算法研究及其实现[D].无锡:江南大学,2009.
    [40]Xu W B,Sun J. Adaptive Parameter Selection of Quantum-Behaved Particle Swarm Optimization on Global Level[J]. Springer-Verlag Berlin Heidelberg,2005,1:420-428.
    [41]张国印,程慧杰,刘咏梅,等.一种新算法在基因表达谱聚类中的应用[J].计算机工程与应用,2009,45(36):216-218.
    [42]杨春梅.基因表达数据聚类分析算法研究和应用[D].天津:天津大学,2006.
    [43]王勇.聚类方法在生物数据中的研究与应用-基因表达数据聚类方法研究[D].无锡:江南大学,2008.
    [44]Boettcher S, Percus A G. Extremal optimization:Methods Derived from co-evolu tion[A]. In Proceedings of the Genetic and Evolutionary Computation Conference [C]. San Francisco:Morgan Kaufmann,1999:825-832.
    [45]梅丽.人类启动子识别算法研究[D].大连:辽宁师范大学,2010.
    [46]Schena M, Shalon D, Heller R. Parallel human genome analysis:Microarray-based expression monitoring of 1000 genes [J]. Proc. Natl. Acad. Sci. USA,1996,93 (20):10614 -10619.
    [47]Xiang C, Chen Y. cDNA microarray technology and its applications[J]. Biotechnology Advances,2000,18:35-46.
    [48]Smyth G K, Speed T.Normalization of cDNA microarray data[J]. Methods,2003, 31(4):265-273.
    [49]Pierre B,Anthony D L.A Bayesian framework for the analysis of microarray expression data:Regularized t-test and statistical inferences of gene changes [J]. Bioinformatics,2001,17(6):509-519.
    [50]Kaski S. Learning metrics for exploratory data analysis[J]. Neural Networks for Signal Processing,2001, XI:53-62.
    [51]Jain A, Dubes R. Algorithms for Clustering Data[M]. New Jersey:Prentice Hall, 1988.
    [52]Ji X L, Li L J, Sun Z R. Mining gene expression data using a novel approach based on hidden Markov models[J]. FEBS Letters,2003,542(1-3):123-131.
    [53]Cho R J, Campbell M J, Winzeler E A, etal. A genome-wide transcriptional analysis of the mitotic cell cycle[J]. Molecular Cell,1998,2(1):65-73.
    [54]Yeung K Y, Haynor D R,Ruzzo W L.Validating clustering for gene expression data[J]. Bioinformatics,2001,12:201-205.
    [55]Tavazoie S, Huges J D, Campbell M J, etal. Systematic determination of genetic network architecture[J]. Nature Genetics,1999,22:281-285.
    [56]Kennedy J, Eberhart R C. Particle Swarm Optimization[A]. In:Proc IEEE Int'1 Conf. on Neural Networks, IV[C]. Piscataway, NJ:IEEE Service Center,1995,1942-1948.
    [57]Shi Y, Eberhart R C.A Modified Particle Swarm Optimizer [A]. In:Proceedings of the IEEE Internattional Conference on Evolutionary Computation[C]. Piscataway, NJ: IEEE Press,1998,69-73.
    [58]曾建潮,介婧,崔志华.微粒群算法[M].北京:科学出版社,2004.
    [59]Shen Qi, Shi Wei-Min, Kong Wei. Hybrid particle swarm optimization and tabu search approach for selecting genes for tumor classification using gene expression data[J]. Computational Biology and Chemistry,2008,32(10):53-56.
    [60]Chuang Li-Yeh, Chang Hsueh-Wei,Tu Chung-Jui, etal. Improved binary PSO for feature selection using gene expression data[J]. Computational Biology and Chemistry,2008,32(9):29-38.
    [61]纪震,廖慧连,许文焕,等.粒子对算法在图像矢量量化中的应用[J].电子学报,2007,35(10):1916-1920.
    [62]Bak P. How nature works:The science of self-organized criticality[M]. New York:Springer,1996.
    [63]Bak P, Tang C, Wiesenfeld K. Self-organized criticality:An explanation of the 1/f noise[J]. Physical Review Letters,1987,59(4):381-384.
    [64]Bak P, Sneppen K. Punctuated Equilibrium and Criticality in a Simple Model of Evolution[J]. Physical Review Letters,1993,71(24):4083-4086.
    [65]李晓红,熊盛武.极值优化算法研究[J].武汉理工大学学报,2009,31(3):40-44.
    [66]齐洁,汪定伟.极值优化算法综述[J].控制与决策,2007,22(10):1081-1085.
    [67]欧阳金华,孙季丰.基于改进粒子群算法的矢量量化码书设计研究[J].科学技术与工程,2010,10(28):7022-7025.
    [68]Kennedy J. The particle swarm:social adaption of knowledge[A]. In:Proceedings of International Conference on Evolutionary Computation[C]. Piscataway,NJ,1997: 303-308.
    [69]吴延科,徐晨,李国.基于粒子群统计规律的PSO算法[J].郑州大学学报(理学版),2006,38(4):98-101.
    [70]毕晓君,刘国安.种群分类粒子群改进算法[J].哈尔滨工程大学学报,2008,29(9):991-996.
    [71]栾丽君,谭立静,牛奔.一种基于粒子群优化算法和差分进化算法的新型混合全局优化算法[J].信息与控制,2007,36(6):708-714.
    [72]Chen Min-rong, Li Xia, Zhang Xi,etal.A novel particle swarm optimizer hybridized with extremal optimization[J]. Applied Soft Computing,2010,10(2):367-373.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700