整合贪婪随机构造和多目标粒子群模型的双聚类算法在基因表达数据中的应用

英文题名：Application of Biclustering Integrated MOPSO and Greedy Randomized Construction to Gene Expression Data
作者：赖丽娜
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：基因表达数据 ; 双聚类 ; 多目标粒子群模型 ; 贪婪随机构造
英文关键词：Gene expression data ; Biclustering ; Multi-objective particle swarm optimization ; Greedy Randomized Construction
学位年度：2012
导师：刘桂霞
学科代码：081203
学位授予单位：吉林大学
论文提交日期：2012-04-01

摘要

高通量基因微阵列技术的出现，产生了大量的微阵列数据集，使得同时测量多个不同实验条件下成千上万个基因的表达水平成为可能。一般使用n m的数据矩阵组织基因表达数据，其中行对应着基因，列对应着实验条件。从这些微阵列数据集中挖掘局部模型对于生物信息学和药物研究都是一个非常重要的课题。
     聚类技术广泛的应用在基因表达数据分析中，它可以识别出具有相似表达谱的基因集合。但是根据细胞过程可知，一些基因子集仅在实验条件子集上才会有共调控和共表达的行为，而传统聚类技术不能识别出这样的基因子集。最早由Hartigan提出的双聚类技术能够同时进行行聚类和列聚类。后来，Cheng和Church将双聚类技术引入到基因表达分析上。双聚类技术能够有效的发掘出隐藏在基因表达数据中的局部模式。局部模式是指在实验条件子集下具有相似表达模式的基因子集，而不是在所有的实验条件下。目前，有大量的双聚类算法可以进行基因表达数据分析，如biclustering，SAMBA，谱双聚类，网格采样双聚类等等。大部分双聚类算法都使用局部搜索产生最优双聚类。最近几年，提出了很多的进化算法用于识别最优双聚类。并且，通过对进化算法进行改进，使其可以处理多目标优化问题。因此，研究者将这种基于启发式搜索的进化算法应用在基因表达数据分析上。
     在基因表达数据中搜索最优双聚类时，需要同时优化几个目标函数，这些目标函数分别是双聚类的大小，平均平方残基和行差异。那么，可以将多目标进化算法(MOEAs)应用到基因表达数据上搜索最优双聚类。MOEA算法中主要有三种Pareto占优弱化形式，分别是dominance， dominance和b dominance。使用Pareto占优弱化形式可以控制MOEA算法的收敛性，以及增加外部文档中粒子的多样性。在这三种形式中，由于
     dominance的执行效率和理论基础都较好，使其广泛应用在多目标进化算法中。粒子群算法（particle swarmoptimization，PSO）最早是由Kebnnedy和Eberhart受鸟群觅食行为启发而提出的。在处理单目标优化问题时，PSO具有较好的效率并且易于实现。由于粒子群算法具有快速收敛和执行效率高等特点，使得研究者更加关注于使用粒子群算法处理多目标优化问题，因此提出了多目标粒子群优化算法（multi-objective particleswarm optimization，MOPSO）并将其应用到双聚类搜索当中。
     在本论文中，我们提出一个新的改进算法，GR-MOPSO算法。该算法将dominance，拥挤距离（crowding distance）和随时间变化参数等方法融合到多目标粒子群模型上。首先，应用贪婪随机构造方法产生双聚类种子，并使用这些双聚类种子初始化粒子群中的所有粒子；其次，粒子群算法有三个重要的参数，即惯性权值w和加速度系数c1，c2。为了提高算法的收敛特性，将随时间变化的参数引入到多目标粒子算法中，即惯性权值和加速度系数随着迭代次数而变化，这样可以在早期阶段增强全局搜索，在后期阶段加强局部搜索；最后，为了控制外部文档的大小和保持文档中粒子具有较好的多样性，使用拥挤距离修剪外部文档的大小以及将较好的粒子保留在外部文档中。将GR-MOPSO算法分别应用在酵母菌和人类B-cells数据集上，并与MOEA和MOPSOB算法进行实验结果比较。实验结果表明：GR-MOPSO发现的双聚类质量比MOEA和MOPSOB算法发现的质量好。同时，GR-MOPSO使的种群具有较好的多样性以及快速收敛的特性。
     本论文的创新主要有以下两个方面：①通过分析可知要挖掘的双聚类的特征之间具有一定冲突，因此可以将多目标优化的思想应用到双聚类分析上。将dominance，拥挤距离（crowdingdistance），随时间变化的惯性权值和加速度系数等三种方法融合到多目标粒子群模型中，使得多目标粒子群算法在双聚类种子的基础上能够发现最佳的双聚类，并且粒子群中的粒子更具多样性，收敛速度较快。②在多目标粒子群初始化阶段，不是使用随机值对粒子的初始位置和个体最优进行赋值，而是通过贪婪随机构造阶段产生的双聚类种子进行初始化，这样可以使粒子具有一定的双聚类性质，可以提高算法搜索最优解的效率。
     未来的工作：首先，在大规模基因表达数据的双聚类分析中，阈值的大小决定着搜索到的双聚类个数以及双聚类质量。因此，需要进一步研究如何选取恰当的阈值，提高搜索到的双聚类质量以及算法效率。其次，由于粒子群算法没有使用交叉和变异等过滤操作，所以需要对算法的进行改进从而进一步增加粒子群中粒子的多样性，防止陷入极小值，同时提高算法的收敛速度。最后，算法搜索到的双聚类是否具有生物学意义需要进一步探究。
Due to the advent of High-throughput microarray technology, it generated large amountsof gene expression datasets. Microarray technology allows simultaneously to measure theexpression level of thousands of genes under different experimental conditions. Themicroarray datasets are generally presented as a n mdata matrix, where rows correspondto genes and columns correspond to experimental conditions. Mining local patterns whichshow similar expression patterns under subsets of conditions from those microarray datasets isan important research problem in bioinformatics and medical domain.
     Clustering technology has been widely used in gene expression profiles anaysis, and itcan identify sets of genes which has similar expression patterns under all expressionconditions. But in the cellular processes, it is knowledge that subsets of genes wereco-regulated and co-expression only under subsets of condtions. But clustering can notidentify such subsets of gene. Biclustering algorithms which were originally described byHartigan can perform simultaneous clustering of both rows and columns of the geneexpression matrix. Afterwards, biclustering was introduced to gene expression analysis byCheng and Church. It is a very useful technology to identify local patterns in which sets ofgenes shares similar expression under subsets of conditions but all of conditions. So far, manybiclustering algorithms have been developed for gene expression data analysis, such as
     biclustering, SAMBA, Gibbs sampling biclustering, spectral biclustering, and so on.Most of these methods use local search to generate optimal biclusters. In recent years, manyevolution algorithms (EA) have been proposed to indentify optimal bicluster in microarraydata in order to escape from local minima. And, through improving the evolution algorithms,EA is able to solve multi-objective optimization problem. So, more and more researchers areinterested in evolution algorithms based heutistics optimization methods.
     When searching optimal biclusters from gene expression data, we need to optimizesimultaneously several objectives, which are in conflict with each other. These severalobjectives are the size, mean squared score and row variance of the bicluster. In this situation,we can appy multi-objective evolutionary algorithms (MOEAs) to discover efficiently optimalbiclusters. MOEAs mainly proposed three relaxed forms of Pareto dominance, such as
     ε-dominance, α-dominanceb b-dominance. Applying these relaxed forms of Paretodomincace can regulate convergence of MOEA, and improve diversity of all particles in theexternal archive. Amonge these forms, is widely applied in the MOEAs, dueto its effectiveness and its theoretical foundation. Particle swarm optimization (PSO) firstlyintruduced by Kebnnedy and Eberhart is inspired by the behavior of a bird flock finding food.PSO has been found to be quite efficient in solving the single objective optimization problem.Because of rapid convergence and efficiency, researchers are interested in applying PSO tohandle multi-objective optimization problem. This methods is named as multi-objectiveparticle swarm optimization (MOPSO).
     In this paper, we propose a new inporoved algorithm, GR-MOPSO algorithm. Thismethod incorporate, crowding distance and time varying parametersapproach into MOPSO biclustering framework. Firstly, we apply greedy randomizedconstruction to generate bicluster seeds, then use these bicluster seeds to initialize all particlesin the particle swarm. Secondly, PSO has three vital parameters which are inertia w andacceleration coefficientsc1,c2. In order to improve the convergence of PSO, we introducetime varying parameters into MOPSO. These parameters are changing with the iterations,which can enhance global exploration at early stage and raise local exploitation at later stage.Lastly, with the purpose of controlling the size of external archive and maintaining the gooddiversity of particles in the archive, we appy the crowding distance to truncate archivepopulation. We respectively implement this new improved aglorithm GR-MOPSO on theyeast and human B-cells datasets, and compare the results with MOEA, MOPSOB. Theresults show: the quality of biclusters GR-MOPSO found are better than the results of MOEAand MOPSOB. And GR-MOPSO have well population diversity and rapid convergence.
     The innovation of this paper is mainly the following two aspects:①Through analyzingthe mining bicluster, we found the features of bicluster conflicting with each other.So,we canapply multi-objective optimization theory into the biclustering. Incorporated,crowding distance and time varying parameters approach into MOPSO framework,GR-MOPSO can search more optimal biclusters when application of the improved MOPSO tothe bicluster seeds.And,it can maintain diversity of particles in the particle swarm andenhance rapid convergence.②In the initial stage of multi-objective particle swarmoptimization, we use bicluster seeds to initialize the position and velocity of all particles in theswarm, not use random value. The bicluster seeds are the results of Greedy RandomizedConstruction. So, it can enable all particles to have the character of bicluster and enhance efficiency to search optimal solutions.
     The future work: Firstly, in the biclustering of largely gene expression data,thresholddetermines the number and quatity of found biclusters. So, we need to study how to selectsuitable threshold in order to improve quatity of biclusters and run efficiently.Secondly, due toPSO without the filting process such as crossover and mutation, we need to improve theparticle swarm for enhancing the diversity of the particles in the swarm and increasing rate ofconvergence. Lastly, we further rearch the biological significance of the biclustering results.

引文

[1] Cheng Y, Church GM. Biclustering of gene expression data[C]. Proceedings of theInternational Conference on Intelligent Systems for Molecular Biology,2000:93-103.
    [2] Hartigan JA. Direct clustering of a data matrix[J]. The American Statistical Association,1972,67(337):123-127.
    [3] Deb K, John Wiley, Sons. Multi-Objective Optimization Using EvolutionaryAlgorithms[J]. Chichester,2001:63-70.
    [4] Deb K, Pratap A, Agarwal S, Meyarivan T. A fast and elitist multiobjective geneticalgorithm: NSGA-II[J]. IEEE Transactions on Evolutionary Computation,2002,6:182-197.
    [5] Ron Edgar, Mlchael Domrachev, Alex E Lash. Gene Expression Omnibus: NCBI geneexpression and hybridization array data repository[J]. Nucleic Acids Research,2002,30(1):207-210.
    [6] DeRisi J L, Iyer V R, Brown P O, et al. Yeast microarrays for genome wide parallelgenetic and gene expression analysis[J]. Science,1997,94:13057-13062.
    [7] Alon U, Barkai N, et al. Broad patterns of gene expression revealed by clustering analysisof tumor and normal colon tissues probed by oligonucleotide arrays [J]. Proc. Natl. Acad.Sci. USA,1999(96):6745-6750.
    [8] Alvis Brazma, Jaak Vilo. Gene expression data analysis[J]. FEBS Letters480,2000:17-24.
    [9] J MacQueen. Some Methods for classification and Analysis of MultivariateObservations[C]. Proceedings of Fifth Berkeley Symposium on Mathematical Statisticsand Probability,1987,1:281-297.
    [10] G. Getz, E Levine, and E. Domany. Coupled Two-Way Clustering Analysis of GeneMicroarray Data[J]. Proc Natural Academy of Sciences US,2000:12079-12084.
    [11] Wassim Ayadi, Mourad Elloumi, Jin-Kao Hao. A biclustering algorithm based on aBicluster Enumberation Tree: application to DNA micrparray data[J]. BioData Mining,2009:2-7.
    [12] Teng L, Chan L. Discovering biclusters by iteratively sorting with weighted correlationcoefficient in gene expression data[J]. J Signal Process Syst,2008,50(3):267-280.
    [13] Aguilar Ruiz JS. Shifting and scaling patterns from gene expression data[J].Bioinformatics,2005,21:3840-3845.
    [14] Pontes B, Divina F, Giraldez R, Aguilar-Ruiz J. Virtual error: A new measure forevolutionary biclustering[C]. Evolutionary Computation, Machine Learning and DataMining in Bioinformatics,2007:217-226.
    [15] Lehmann EL, D’Abrera HJM. Nonparametrics: Statistical Methods Based on Ranks[J]. Inrev.ed Englewood Cliffs, NJ Prentice-Hall,1998:292-323.
    [16] Cheng K, Law N, Siu W, Liew A. Identification of coherent patterns in gene expressiondata using an efficient biclustering algorithm and parallel coordinate visualization[J].BMC Bioinformatics,2008,9(210):234-267.
    [17] Madeira SC, Oliveira AL. Biclustering algorithms for biological data analysis: Asurvey[J]. IEEE/ACM Transactions on Computational Biology andBioinformatics(TCBB),2004, I(I):24-45.
    [18] Praveen Kumar Tripathi, Sanghamitra Bandyopadhyay. Muliti-Objective Particle SwarmOptimization with time variant inertia and acceleration coefficients[J]. InformationSciences,2007,177:5033-5049.
    [19] Minqiang Li, Liu Liu, Dan Lin. A fast steady-state-dominance multi-objectiveevolutionary algorithm[C]. Springer Science+Business Media, LLC,2009:3-10.
    [20] Goldberg D E. Genetic Algorithms in Search, Optimization and Machine Learning[M].Boston USA: Addison-Wesley Longman Publishing Co,1989.
    [21] Chinchuluun A, Pardalos P. A survey of recent developments in multiobjectiveoptimization[J]. Ann Oper Res,2007,154(1):29–50.
    [22] Ikeda K, Kita H, Kobayashi S. Failure of Pareto-based MOEAs: does non-dominatedreally mean near to optimal?[J]. Evolutionary Computation,2001,2:957-962.
    [23] Deb K, Mohan M, Mishra S. Evaluating the epsilon-domination based multi-objectiveevolutionary algorithm for a quick computation of Pareto-optimal solutions[J]. EvolComput,2005,13(4):501–525.
    [24] Laumanns M, Thiele L, Deb K, Zitzler E. Combining convergence and diversity inevolutionary multiobjective optimization[J]. Evol Comput,2002,10(3):263–282.
    [25] Zitzler E, Thiele L. Multiobjective evolutionary algorithms: a comparative case study andthe strength Pareto approach[C]. IEEE Transaction on Evolutionary Computation,1999,3(4):257–271.
    [26] Kukkonen S, Deb K. Improved pruning of non-dominated solutions based on crowdingdistance for bi-objective optimization problems,2006[C]. Canada: IEEE Congress onEvolutionary Computation,2006:3995-4002.
    [27] Knowles J, Corne D. The Pareto archived evolution strategy: a new baseline algorithmfor Pareto multiobjective optimization,1999[C]. Washington DC: Proceedings of theCongress on Evolutionary Computation,1999:98-105.
    [28] Corne D W, Jerram N R, Knowles J D, Oates M J. PESA-II: Region-based selection inevolutionary multiobjective optimization[C]. Proceedings of the Genetic andEvolutionary Computation Conference (GECCO-2001),2001:283-290.
    [29] Jensen M T. Reducing the run-time complexity of multiobjective EAs: The NSGA-II andother algorithms[J]. IEEE Trans Evol Comput,2003,7(5):503–515.
    [30] Liu L, Li M, Lin D. A novel epsilon-dominance multi-objective evolutionary algorithmsfor solving DRS multi-objective optimization problems,2007[C]. Haikou, HaiNan: ThirdInternational Conference on Natural Computation,2007:286-301.
    [31] Schütze O, Laumanns M, Tantar E, Coello, et al. Convergence of stochastic searchalgorithms to gap-free Pareto front approximations,2007[C]. London, England:Proceedings of the9th Annual Conference on Genetic and Evolutionary Computation,2007:892-899.
    [32] J Kennedy, R Eberhart. Particle swarm optimization[C]. IEEE International ConferenceNeural Networks,1995:1942-1948.
    [33] Moore J, Chapman R. Application of Particle Swarm to Multi-objective Optimization [M].Department of Computer Science and Software Engineering, Auburn University,1999.
    [34] X Hu, R Eberhart. Multiobjective optimization using dynamicneighborhood particleswarm optimization, May2002[C]. Honolulu: Proc Congr Evolutionary Computation(CEC,2002),2002,2:1677–1681.
    [35] Junwan Liu, Zhoujun Li. Biclustering of microarray data with MOPSO based oncrowding distance [J]. BMC Bioinformatics,2009, IO(Suppl4): S9.
    [36] Divina F, Aguilar Ruiz JS. Biclustering of expression data with evolutionarycomputation[J]. IEEE transactions on knowledge and data engineering,2006,18:590-602.
    [37] Y Shi, R C Eberhart. Empirical study of particle swarm optimization[C]. IEEEInternational Congress on Evolutionary Computation,1999,3:101–106.
    [38] A Ratnaweera, S K Halgamuge, H C Watson. Self-organizing hierarchical particle swarmoptimizer with time-varying acceleration coefficients[J]. IEEE Transactions OnEvolutionary Computation,2004,8(3):240–255.
    [39] Smitha Dharan, Achuthsankar S Nair, Biclustering of gene e xpression data using reactivegreedy randomized adaptive search procedure[J]. BMC Bioinformatics,2009, IO(Suppl I):S27.
    [40] Prais M, Ribeiro CC. Reactive GRASP: An application to a matrix decompositionproblem in TDMA traffic assignment[J]. INFORMS Journal on Computing,2000,12:164-176.
    [41] Kennedy J, Eberhart RC. A discrete binary version of the particle swarm algorithm[C].Proceedings of IEEE Conference on Systems, Man and Cybernetics,1997:4104-4109.
    [42] Raquel C, Naval P Jr. An effective use of crowding distance in multiobjective particleswarm optimization[C]. Proceedings of the2005conference on Genetic and evolutionarycomputation,2005:257-264.
    [43] Cho RJ, Campbell MJ, Winzeler EA, Steinmetz L, Conway A, Wodicka L, Wolfsberg TG,Gabrielian AE, Landsman D, Lockhart DJ. A Genome-Wide Transcriptional Analysis ofthe Mitotic Cell Cycle[J]. Molecular Cell,1998,2:65-73.
    [44] Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, SabetH, Tran T, Yu X. Distinct types of diffuse large B-cell lymphoma identified by geneexpression profiling[J]. Nature,2000,403:503-511.
    [45] Mitra S, Banka H. Multi-objective evolutionary biclustering of gene expression data[J].Pattern Recognition,2006,39:2464-2477.
    [46] Liu JW, Li ZJ, Liu FF, Chen YM. Multi-Objective Particle Swarm OptimizationBiclustering of Microarray Data[C]. IEEE International Conference on Bioinformaticsand Biomedicine,2008:363-366.
    [47] Mitra S, Banka H. Multi-objective evolutionary biclustering of gene expression data[J].Pattern Recognition,2006,39:2464-2477.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700