基于基因表达数据的肿瘤分类算法研究

英文题名：A Study of Tumor Classification Algorithms Using Gene Expression Data
作者：陆慧娟
论文级别：博士
学科专业名称：控制理论与控制工程
中文关键词：肿瘤基因表达数据 ; 特征选择 ; 集成学习 ; 代价敏感 ; ELM
英文关键词：Tumor Gene Expression Data ; Feature Selection ; Ensemble Learning ; Cost Sensitive ; Extreme Learning Machine
学位年度：2012
导师：马小平
学科代码：081101
学位授予单位：中国矿业大学
论文提交日期：2012-12-01
答辩委员会主席：叶桦

摘要

随着基因芯片技术的快速发展，越来越多的肿瘤基因表达数据得以测定。依据基因表达数据，在分子生物学水平上进行肿瘤早期诊断具有重要意义。及时、准确的诊断将有利于后续治疗的成效，而误诊则可能使癌症患者错过最佳治疗机会。然而，基因表达数据具有高维、分布不平衡、样本数量少等特点。怎样有效地分析、处理和利用此类数据引起学者们的广泛关注。针对肿瘤基因表达数据的分类问题，由于存在大量冗余基因及噪声，基因表达数据的分类性能尚未达到实用水平，当前的研究重点在于：①如何从高维数据中提取出少数关键的致病基因；②寻找最适合的分类算法并提高其分类性能。
     本文主要借助神经网络和极限学习机（Extreme Learning Machine,ELM）来构建分类模型，预测肿瘤基因表达数据，提出的方法在多种肿瘤数据集和非肿瘤数据集上进行实验验证。主要研究工作如下：
     1）针对高维基因表达数据降维问题，提出了一种基于信息增益和遗传算法的基因选择方法，将特征基因选择转化为全局优化问题。在遗传算法搜索阶段，把类间距离与类内距离之比作为适应度函数，设计与模型无关的基因选择算法，降低数据维数。实验表明，经选择得到的各个特征与分类目标密切相关，提高分类器的泛化能力。
     2）针对基因表达数据的不平衡、小样本等问题，通过扩充小类样本规模和减少大类样本规模的思路以达到类别平衡。先经过特征选择过程保留对分类起关键作用的特征，再参照SMOTE过抽样理论，提出了FS-Sampling算法。实验表明，提出的方法能很好地平衡数据分布，能有效降低数据的不平衡性，明显提高少数类的分类精度。
     3）为解决数据分布对神经网络模型逼近精度的影响以及单个ELM性能不稳定问题，从数据层面着手构建集成分类器，研究了基于数据集差异的集成策略，提出一种基于样本集分割的集成算法。首先，将样本集分割为k等份；然后，从其中k-1份中随机抽样组成训练样本集，重复迭代n次训练n个基分类器；最后，利用多数投票法进行分类器集成。实验证明，该算法能提高基分类器之间的差异度，有效提高集成分类精度。
     4）针对单个ELM性能不稳定问题，从分类器输出结果差异的角度出发集成分类器，提出了一种基于输出不一致测度的ELM相异性集成算法（D-D-ELM）。首先，以输出不一致测度为标准对多个ELM模型进行相异性判断；然后，根据ELM的平均分类精度剔除相应的模型；最后，对筛选后的分类模型通过多数投票法进行集成。对该算法进行了理论证明和实验验证，实验结果显示该算法能够以更少的模型数量达到较稳定的分类精度。
     5）针对降低决策风险、减少平均代价等问题，以最小分类代价为目标，探讨了嵌入拒识代价和非对称误分类代价的分类问题，提出了嵌入误分类代价和拒识代价的ELM算法。通过在算法中嵌入代价敏感因素，使得嵌入代价因素的ELM能够直接处理具有不同代价的数据。实验证明该算法能有效降低平均误分类代价，提高分类的可靠性。
     综上所述，针对肿瘤基因表达数据分类任务中的挑战性问题，在解决高维小样本、数据降维和分布不平衡问题方面，综合提出了有效的基因选择和过抽样合成等方法。这些方法不仅可以提高分类器的性能，而且排除了大量无关基因干扰，有利于定位对疾病有鉴别力的特征基因，有助于相关疾病诊断。在数据分类中，提出了基于神经网络及ELM的集成分类模型，实现了基于数据集差异和分类器输出结果差异的集成算法，并在算法中嵌入代价敏感因素以体现肿瘤识别过程中不同数据的重要性。上述工作构建了一种适用于基因表达数据分类问题的算法框架，提高了肿瘤基因表达数据的分类精度，一定程度解决了该研究领域的难点问题，对于推进高维、不平衡数据的研究具有重要理论意义和实用价值。
With the rapid development of gene chip technology, more and more tumor geneexpression data could be determined. The early diagnosis of tumor is very importantat the level of molecular biology based on the gene expression data. An accurate earlydiagnosis is of great benefit to the treatment of tumor, and any misdiagnosis may leadcancer patients to miss the best treatment opportunity. It is well known that geneexpression data usually has some important features, such as high dimensions,imbalanced data distribution, and small-sample size. So how to effectively analyze,process and use the data has been drawing more and more extensive concern ofresearchers in this area. Due to a large number of redundant genes and noises, thegeneralization performance of gene expression data has not yet reached theapplication level currently. In order to solve the classification problems of tumor geneexpression data, current researches have been focusing on the following two aspects:(i) Identification of the few critical causative genes from high dimensional data;(ii)Development of the most suitable algorithms and improvement of its performance.
     This paper studies a novel machine learning algorithm, namely, extreme learningmachine, to build up the classification model and predict gene expression information.Some tumor and non tumor data sets extracted from the experiments are used tovalidate the developed algorithms. The achievements of this dissertation are brieflydescribed as follows:
     (1) A selection method based on genetic algorithm and information gain isproposed to reduce the dimensions of the data sharply. The genetic evolution is usedto transform the problem of gene selection into the one of the global optimization.The algorithm is designed with a fitness function that is given by the ratio of thebetween-class distance to the within-class distance in the genetic algorithm searchstage, and this designed algorithm is a model independent gene selection method forreducing the data dimension. Experimental results show that the selected features areclosely related to the objective. It improves the generalization performance of theclassifier.
     (2) In order to solve the problems of the imbalance data and small-sample ingene expression data, the idea of expanding small class sample and reducing the largeclass sample is explored and the FS-Sampling algorithm is put forward. It is seen thatthe crucial characteristic can be selected in terms of analyzing the gene expression data characteristics and synthetizing small class sample with SMOTE sampling theory.The experiments show that the presented methods can balance data distribution welland improve the classification accuracy of the tumor data effectively.
     (3) For the study of the impact of data distribution on approximation accuracy ofneural network model and the instability performance of single ELM, an ensemblealgorithm based on dataset splitting is presented, based on the ensemble strategy ofthe dataset difference. Firstly, the original training dataset is divided into k disjointsubsets. Secondly, the randomly re-sampling on k-1out of k subsets is performed toget a training dataset and then train a neural networks classifier with it. The trainingprocedure can then repeat for n times to obtain n neural networks. Finally, the classlabel of the unknown data is predicted with the ensemble classifier through majorityvote method. Experimental results show that the algorithm can enhance the differencedegree of the neural networks and effectively improve the accuracy of the classifierensemble.
     (4) To cope with imbalance performance of single ELM, ensemble classifier onthe level of outputting results is set up. Departing from the difference in the angle ofthe output of the classifier ensemble classifier, the ensemble classifier made from theselective classifiers with large dissimilarity namely D-D-ELM is presented. First of all,the diversity judgements of ELM models are made according to differentmeasurement in the outputs. And then the corresponding model is removed when theirclassification accuracy is under the average one. Finally, the selected classificationmodel is ensembled by means of voting. Both theoretical analysis and experimentalresults demonstrate that the algorithm can effectively improve the accuracy of theclassifier ensemble by the large difference degree of the neural networks.
     (5) For reducing the decision risk and average cost, and regarding the minimumclassification cost to be the target, the classification of embedded rejective recognitioncost and asymmetric misclassification cost are studied. The ELM algorithm for theembedded misclassification cost and rejective cost is proposed. It is shown that theembedding cost sensitive factor in the algorithm could cope with the data withdifferent costs directly. The experiments show that the method could reduce the totalclassification cost and improve the classification accuracy of the tumor dataeffectively.
     To sum up, how to develop the algorithms that can perform efficientclassification of the tumor gene expression data is a challenging task, since many existing algorithms suffer from the problems of small scale sample with highdimension, data dimension reduction and imbalance distribution. The work in thisthesis will develop the effective gene selection and the over sampling synthesismethods, which not only improve the performance of the classifier, but also exclude alarge number of unrelated genes. The work to be presented in this thesis could greatlybenefit to the further study and applications of the location of the disease genes andthe diagnosis of the related disease.
     For data classification, the ensemble classification model based on NeuralNetworks and the ELM is presented. The ensemble algorithm considers the differencefrom both dataset and classifier output results, and the cost sensitive factors areembedded in the algorithm in order to reflect the importance of different data duringtumor recognition. This work will develop a suitable algorithm framework forclassification of gene expression data and improve the classification accuracy oftumor gene expression data. The research shows the theoretical significance inclassification for high dimension and imbalanced data, and gives some helpfulapplication guides for tumor diagnosis.

引文

[1]董志伟,乔友林,李连弟等.中国癌症控制策略研究报告[J].中国肿瘤,2002,11(5):250-260.
    [2] Saeys Y, Inza I, Larranaga P.A review of feature selection techniques in bioinformatics[J].Bioinformatics,2007,23(19):2507-2517.
    [3] Soumyaroop B, Thomas J. Array of hope: expression profiling identifies disease biomarkersand mechanism[J]. Biochemical Society Transactions,2009,37:855-862.
    [4] Ruichu Cai, Zhifeng Hao, Wen Wen, et al. Kernel based gene expression pattern discovery andits application on cancer classification[J]. Neurocomputing,2010,73(13-15):2562-2570.
    [5] Boyle P，Levin B.World cancer report.2008[R].Lyon: IARC,2008:1-524.
    [6]中国优生与遗传杂志[J].2010年美国癌症统计报告.2011,19(4):70.
    [7] M.Schena, D.Shalon, R.W.Davis, et al. Quantitative monitoring of gene expression patternswith a complementary DNA microarray[J]. Science,1995,270(5235):368-371.
    [8] Hansen L K, Salamon P. Neural Network Ensembles[J]. IEEE Transactions on PatternAnalysis and Machine Intelligence,1990,12(10):993-1001.
    [9] Javed K, Wei S, Markus R, et al. Classification and Diagnostic Prediction of Cancers UsingGene Expression Profiling and Artificial Neural Networks[J]. Nature Medicine,2001,7:673-679.
    [10] Pomeroy L, Pablo T, Michelle G, et al. Prediction of central nervous system embryonal tumoroutcome based on gene expression[J]. Nature,2002,415:436-442.
    [11] Ross D T, Scherf U, Eisen B, et al. Systematic variation in gene expression patterns in humancancer cell lines[J]. Nature Genetics,2000,24(3):227-234.
    [12] Valafar F. Pattern recognition techniques in microarray data analysis a survey[J].Annals ofthe New York Academy of Sciences,2002,980(1):41-64.
    [13] Lee Z. An integrated algorithm for gene selection and classification applied to micro arraydata of ovarian cancer[J]. Artificial Intelligence in Medicine,2008,42(1):81-93.
    [14]黄德双.基因表达谱数据挖掘方法研究[M].北京：科学出版社,2009.
    [15] Yang J,Honavar V. Feature subset selection using a genetic algorithm[J].IEEE IntelligentSystems,1998,13(2):44-49.
    [16] William Bains. Hybridization methods for DNA sequencing[J]. Genomics,1991,11(2):294-301.
    [17]张阳德,生物信息学[M].北京:科学出版社,2007:291-304.
    [18] Staunton JE, Slonim DK, Coller HA, et al. Chemosensitivity prediction by transcriptionalprofiling[J].Proceedings of the National Academy of Sciences of the United States of America,2001,98(19):10787-10792.
    [19] Thomas Schalkhammer. Metal Nano Clusters as Transducers for Bioaffinity Interactions[J].Monatshefte fuer Chemie/Chemical Monthly,1998,129(10)：1067-1092.
    [20] Ji J, Chen X, Leung S Y. Comprehensive Analysis of the Gene Expression[J].Profiles inHuman Gastric Cancer Cell Lines.Oncogene,2002,21(42):6549~6556.
    [21] Roukos DH. Current Status and Future Perspectives in Gastric Cancer Management[J].Cancer Treat Rev,2000,26:243~255.
    [22] J. W. Lee,J. B. Lee,M. Park, et al. An extensive comparison of recent classification toolsapplied to microarray data[J].Computational Statistics&Data Analysis,2005,48:869-885.
    [23]卢新国.基于微阵列基因表达谱数据的癌症检测研究[D].长沙：湖南大学.2007.
    [24] Varela I, Tarpey P, Raine K, et al. Exome sequencing identifiers frequent mutation of theSWI/SNF complex gene PBRM1in renal carcinoma [J]. Nature,2011,469(7331):539-542.
    [25] Goffeau A.DNA technology: Molecular fish on chips[J].Nature,1997,385(6613):202-203.
    [26] Roobol M J. Contemporary role of prostate cancer gene3in the management of prostatecancer[J]. Current Opinion in Urology,2011,21(3):225-229.
    [27] Evans W E, Guy R K. Gene expression as a drug discovery tool[J].Science TranslationalMedicine,2011,3(107):107-109.
    [28] Golub T R, Slonim D K, Tamayo P, et al. Molecular classification of cancer:Class Discoveryand Class Prediction by Gene Expression Monitoring[J].Science,1999,286:531-537.
    [29] Marshall A, Hodgson J.DNA chips:an array of possibilities[J]. Nature biotechnology,1998,16(1):27-31.
    [30]蔡瑞初.基因表达数据挖掘若干关键技术研究[D].广州：华南理工大学,2010.
    [31] Kang H N, Chen I M, Wilson C S, et al. Gene expression classifiers for relapse-free survivaland minimal residual disease improve risk classification and outcome prediction in pediatricB-precursor acute lymphoblastic leukemia[J]. blood,2010,115:1394-1405.
    [32] Zhang BH,Pan XP. Anders on TA. MicroRNA:A new player in stem cells[J].Cell Physiol,2006,209(2):266-269.
    [33] Zhao Luqing, Chen Xue, Cao Ya. New role of microRNA: carcinogenesis and clinicalapplication in cancer[J]. Acta Biochim Biophys Sin,2011,43:831-839.
    [34] Sboner A, Demichelis F, Calza S, et al. Molecular sampling of prostate cancer: a dilemma forpredicting disease progression[J]. BMC Medical Genomics,2010,3(125):1-9.
    [35] Vani G, Savitha R, Sundararajan N. Classification of Abnormalities in DigitizedMammograms using Extreme Learning Machine[J].11Th International Conference OncontrolAutomation, Robotics And Vision, Singapore,2010:2114-2117.
    [36] Saraswathi S, Sundaram S, Sundararajan N, et al. ICGA-PSO-ELM Approach for AccurateMulticlass Cancer Classification Resulting in Reduced Gene Sets in Which Genes EncodingSecreted Proteins Are Highly Represented[J]. IEEE-ACM transctions on computational biologyand bioinformatics,2011,8(2):452-463.
    [37] Daxin Jiang,Chun tang,aidong Zhang.Cluster Analysis for gene Expression Data:Asurvey[J].IEEE Transactions on Knowledge and Data Engineering,2004(16):1370-1386.
    [38]王万良.人工智能及其应用(第二版)[M].北京:高等教育出版社,2008,208~305.
    [39] Gavin C, Nicola L C. Gene selection in cancer classification using sparse logistic regressionwith Bayesian regularization[J]. Bioinformatics,2006,22(19):2348-2355.
    [40] Liu Huan,Yu Lei. Toward Integrating Feature Selection Algorithms for Classification andClustering[J].IEEE Transactions on Knowledge and Data Engineering,2005,17(5):491-502.
    [41] Japkowicz N, Stephen S. The Class Imbalance Problem: A Systematic Study[J]. IntelligentData Analysis,2002,6(5):429-449.
    [42] Zhou Zhihua, Liu Xuying. Training cost-sensitive neural networks with methods addressingthe class imbalance problem[J]. IEEE Transactions on Knowledge and Data Engineering,2006,18(1):63-77.
    [43] Anand A, Pugalenthi G, Fogel GB, et al. An approach for classification of highly imbalanceddata using weighting and undersampling[J]. Amino Acids,2010,39(5):1385-1391.
    [44]欧阳震诤,罗建书,胡东敏等.一种不平衡数据流集成分类模型[J].电子学报,2010,38(1):184-189.
    [45] Zhai Yun, Ma Nan, Ruan Da, et al. An Effective Over-sampling for Imbalanced Data SetsClassification[J]. Chinese journal of electrics,2011,20(3):489-494.
    [46] Raudys S J, Jain A K. Small Sample Size Effects in Statistical Pattern Recognition:Recommendations for Practitioners[J]. IEEE Trans. Pattern Analysis and Machine Intelligence,1991,13(3):252-264.
    [47] Alexandre Blais,et al.Constructing transcriptional regulatory networks[J]. Gene&Development.2005(19):1499-1511.
    [48]王广云.肿瘤基因芯片表达数据分析相关问题研究[D].长沙：国防科技大学，2009.
    [49] W.P.Lee, K.C.Yang. A clustering-based approach for inferring recurrent neural network asgene regulatory networks[J]. Neuromputing,2008,71(4-6):600-610.
    [50] N.Friedman. Inferring cellular networks using probabilistic graphicalmodel[J]. Science,2004,303(5659):799-805.
    [51] Kohavi R.A study of cross-validation and bootstrap for accuracy estimation and modelselection[J].Proceeding of the Fourteenth International Joint Conference,1995,1137-1143.
    [52] Wu G,Chang E. Class-boundary alignment for imbalanced data set learning [J].Proceedingsof International Conference on Data Mining,Workshop Learning from Imbalanced Data Sets II.2003,:1-8.
    [53] Witold Pedrycz, Keun-Chang Kwak Boosting of granular models[J]. Fuzzy Sets and Systems,2006,157(22):2934-2953.
    [54] Pomeroy SL, Tamayo P, Gaasenbeek M, et al. Prediction of central nervous systemembryonal tumour outcome based on gene expression[J].Nature,2002,415(6870):436-442.
    [55] Qingshan Liu,Rui huang,Hanging Lu,Songde Ma. Face recognition using kernel-based fisherdiscriminant analysis[C].the Proc of Int. Conf. Automatic Face and Gesture Recognition.2002,:197-201.
    [56] Fabian M, Péter A, Alexander O, et al. Feature selection for DNA methylation based cancerclassification [J].Bioinformatics,2001,17(supp1):157-167.
    [57]李颖新,阮晓钢.基于基因表达谱的肿瘤亚型识别与分类特征基因选取研究[J].电子学报,2005,33(4):652-656.
    [58]李颖新,阮晓钢.基于SVM的肿瘤分类特征基因选取[J].计算机发展与研究,2005,42(10):1796-1801.
    [59] Yumin Yang,Chonghui Guo,Zunquan Xia. Independent Component Analysis forTime-dependent Processes Using AR Source Model[J]. Neural Processing Letters,2008,27(3).
    [60]邓林,马尽文,裴健.秩和基因选取方法及其在肿瘤诊断中的应用[J].科学通报,2004,49(13):1311-1316.
    [61] Aapo Hyvarinen. Fast and Robust Fixed-Point Algorithms for Independent ComponentAnalysis[J].IEEE Transactions on Neural Networks,1999,10(3):626-634.
    [62] Chris D, Peng H C. Minimum Redundancy Feature Selection from Microarray GeneExpression Data [C]. Proceedings of the Computational Systems Bioinformatics,2003,523-529.
    [63] Li S T, Wu X X, Hu X Y. Gene selection using genetic algorithm and support vectorsmachines [J]. Soft Compute,2008,12:693-698.
    [64] Shah S, Kusiak A. Cancer gene search with data-mining and genetic algorithms [J].Computers in Biology and Medicine,2007,37(2):251-261.
    [65] Maldonado S, Weber R. A wrapper method for feature selection using Support VectorMachines [J]. Information Sciences,2009,179(13):2208-2217.
    [66] Nguyen M H, Torrea F D. Optimal feature selection for support vector machines [J]. PatternRecognition,2010,43:584-591.
    [67]王树林,王戟,陈火旺等.肿瘤信息基因启发式宽度优先搜索算法研究[J].软件学报,2008,31(4):636-649.
    [68]姚旭,王晓丹等.特征选择方法综述[J].控制与决策.2012(2):161-166.
    [69]杨杨,吕静.高维数据的特征选择研究[J].南京师范大学学报,2012(1):2601-2606.
    [70]廖一星,潘雪增.面向不平衡文本的特征选择方法[J].电子科技大学学报,2012(4):592-596.
    [71]边肇祺,张学工.模式识别[M](第二版).北京:清华大学出版社,2000,250-257.
    [72] Simon Haykin. Neural Networks and Learning Machines [M]. Prentice Hall,2011.
    [73] Rumelhart. On evaluating story grammars[J].Cognitive Science,1980,4(3):313-316.
    [74]邵俊倩,邹大伟等.基于神经网络的综合方法的研究[J].价值工程,2012(1):167-169.
    [75]王明怡,吴平,王德林,基于相关性分析的基因选择算法[J].浙江大学学报（工学版）,2004(38):1289-1292.
    [76]张学工.关于统计学习理论与SVM [J].自动化学报,2000,26(1):32-42.
    [77] Zhe Wang,Songcan Chen. New Least Squares Support Vector Machines Based on MatrixPatterns[J]. Neural Processing Letters,2007,26(1).
    [78] J.A.K.Suykens,J. Vandewalle. Least Squares Support Vector Machine Classifiers [J]. NeuralProcessing Letters,1999,9(3).
    [79]王雪松，程玉虎.机器学习理论、方法及应用[M].北京:科学出版社，2009,
    [80] Widodo A, Yang B S. Support vector machine in machine condition monitoring and faultdiagnosis [J]. Mechanical Systems and Signal Processing,2007,21(6):2560-2574.
    [81] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: theory and applications[J].Neurocomputing70(2006)489–501.
    [82] Hornik K. Approximation capabilities of multilayer feed forward networks[J].NeuralNetworks,1991,4:251-257.
    [83] Leshno M, Lin V Y, Pinkus A, et al. Multilayer feedforward networks with a nonpolynomialactivation function can approximate any function[J]. Neural Networks,1993,6:861-867.
    [84] Huang G B, Babri H A. Upper bounds on t he number of hidden neurons in feed forwardnetworks with arbitrary bounded nonlinear activation functions [J]. IEEE Trans. Neural Networks,1998,9(1):224-229.
    [85]张春霞,张讲社.选择性集成学习算法综述[J].计算机学报,2011,34(8):1399-1410.
    [86] L.Cz Valiant. A Theory of the Learnable[J]. Communications of the ACM.1984,27(11):1134-1142.
    [87] Freund Y, Schapire R E. A decision-theoretic generalization of on-line learning and anApplication to boosting[J]. Journal of Computer and System Sciences,1997,55(1):119-139.
    [88]吴建鑫,周志华,沈学华,陈兆乾.一种选择性神经网络集成构造方法[J].计算机研究与发展.2000,37(9):1039-1044.
    [89] A. Grove, D. Schuurmans. Boosting in the Limit: Maximizing the Margin of LearnedEnsembles[C]. In Proceedings of the Fifteenth National Conference on Artificial Intelligence,1998, Madison, WI, pp.692-699.
    [90] Domingos P. Metacost: A general method for making classifiers cost-sensitive [C]. In:Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery andData Mining.1999,155-164.
    [91] Z. H. Zhou and X. Y. Liu. The Influence of Class Imbalance on Cost-Sensitive Learning: AnEmpirical Study [J]. In Proceedings of the ICDM'06,2006,970-974.
    [92] P. D. Turney. Types of Cost in Inductive Concept Learning [J]. The Computing ResearchRepository,2002,15-21.
    [93] X.-Y. Liu and Z.-H. Zhou. Towards cost-sensitive learning for real-world applications[C]. In:Proceedings of the PAKDD2011International Workshops, LNAI7104, Shenzhen, China,2012,pp.494-505.
    [94] Yang F,Wang H Z,Mi H,et al. Using random forest for reliable classification andcost-sensitive learning for medical diagnosis[J].BMC Bioinformatics,2009,10(1):1-14.
    [95] Kukar M, Kononenko I. Cost-sensitive learning with neural networks [C]. In: Proceedings ofthe13th European Conference on Artificial Intelligence.1998,1:299-313.
    [96] Huiqing Liu, Jinyan Li, Limsoon Wong. A Comparative Study on Feature Selection andClassification Methods Using Gene Expression Profiles and Proteomic Patterns [J]. GenomeInformatics,2002,13:51-60.
    [97] Zhao Zheng,Wang Lei,Liu Huan. Efficient spectral feature selection with minimumredundancy[C]. Proceedings of the National Conference on Artificial Intelligence,2010,1:673-678.
    [98] Leung Yukyee,Hung Yeungsam. A multiple-filter-multiple-wrapper approach to geneselection and microarray data classification[J]. IEEE/ACM Transactions on ComputationalBiology and Bioinformatics,2010,7(1):108-117.
    [99] Guang-Bin Huang,XiaojianDing,HongmingZhou. Optimization method based extremelearning machine for classification[J]. Neurocomputing,2010,74:155–163.
    [100]任江涛，孙靖昊.一种基于信息增益及遗传算法的特征选择算法[J].计算机科学,2006(10):193-195.
    [101] Michalewicz Z. A Modified Genetic Algorithm for Optimal Control Problems[J].ComputersMath. Application,1992,23(12):83-94.
    [102] Jinjie Huang, Yunze Cai, Xiaoming Xu. A Hybrid Genetic Algorithm for Feature SelectionWrapper Based on Mutual Information [J].Pattern Recognition Letters,2007,28(13):1825-1844.
    [103] Wang Jianchen,Shan Ganlin,Duan Xiusheng. Improved SVM-RFE feature selection methodfor multi-SVM classifier[C].2011International Conference on Electrical and ControlEngineering,2011:1592-1595.
    [104] Quanzhong Liu, Chihau Chen, Yang Zhang,Zhengguo Hu. Feature selection for supportvector machines with RBF kernel[J]. Artificial Intelligence Review,2011,36(2):99-115.
    [105] Ran Li, Jianjiang Lu, Yafei Zhang, Tianzhong Zhao. Dynamic Adaboost learning withfeature selection based on parallel genetic algorithm for image annotation[J]. Knowledge-BasedSystems,2010,23(3):195-201.
    [106] Jiansheng Jiang, Wanneng Shu,Huixia Jin. An Efficient Feature Selection Algorithm Basedon Hybrid Clonal Selection Genetic Strategy for Text Categorization[J]. Lecture Notes inElectrical Engineering,2010,56:127-134.
    [107] Huijuan Lu, Wutao Chen, Xiaoping Ma, Mingyi Wang, Jinwei Zhang. Model-free GeneSelection Using Genetic Algorithms[J].International Journal of Digital Content Technology andits Applications,2011,5(1):195-203.
    [108]王明怡.微阵列数据挖掘技术的研究[D].杭州：浙江大学，2004.
    [109]刘庆和，梁正友.一种基于信息增益的特征优化选择方法[J].计算机工程与应用，2011，47(12).
    [110] Hu Y,Loizou P C.Speech enhancement based on wavelet thresholding the multitaperSpectrum [J].IEEE Trans on Speech and Audio Processing,2004,12(1):59-67.
    [111] Wang Zhi Teng,Zhang Hong Jun,Hang Ying. Fire distribution optimization based onquantum immune genetic algorithm[C].2011International Conference of Information TechnologyComputer Engineering and Management Sciences,2011,1:95-98.
    [112] Jiang Fengguo,Wang Zhenqing.The truss structural optimization design based on improvedhybrid genetic algorithm[J]. Advanced Materials Research,2011,163-167:2304-2308.
    [113] JH Holland. The psychology of vocational choice: A theory of personality types and modelenvironments[M].1965.
    [114] Bagley. The Behavior of Adaptive Systems which Employ Genetic and CorrelationAlgorithms[D].The univerty of Michigan,1967.
    [115] R.B Hollstein. Artifical genetic adaptation in computer control systems[M].1971.
    [116] Holland. Discrete Multivariate Analysis: Theory and Practice[M].1975.
    [117]粱宇宏，张欣.对遗传算法的轮盘赌选择方式的改进[J].信息技术，2009，12:112-117.
    [118]尤海峰，王煦法.锦标赛选择交互式遗传算法及其应用[J].小型微型计算机系统，2009,9:98-105.
    [119]巩敦卫，孙晓燕.协同进化遗传算法理论及应用[M].北京：科学出版社,2009.
    [120] Schapire R E. The Strength of Weak Learn Ability [J]. Machine Learning,1990,5:197-227.
    [121] Cao W, Lin Z P, Huang G B, Liu N. Voting based extreme learning machine [J]. InformationSciences,2011:66-77.
    [122] J. Demsar. Statistical comparisons of classifiers over multiple data sets [J]. Mach.Learning,2006,7(1).
    [123] M. Friedman. The use of ranks to avoid the assumption of normalityimplicit in the analysisof variance [J]. Amer Statist Assoc,1973,32:675–701.
    [124]王和勇,樊泓坤.不平衡数据集的分类方法研究[J].计算机应用研究，2008,45(5):1301-1303.
    [125]岳峰,孙亮,王宽全.基因表达数据的聚类分析研究进展[J].自动化学报,2008,34(2):113-120.
    [126]丁晓剑,赵银亮.优化ELM的序列最小优化方法[J].西安交通大学学报,2011,45(6):7-12.
    [127]陆慧娟,张金伟,马小平,杨小兵.基于特征选择的过抽样算法的研究[J].电信科学，2012,28(1):87-91.
    [128] Elkan C. The foundations of cost-sensitive learning[C]. Proceedings of the SeventeenthInternational Joint Conference on Artificial Intelligence, Seattle, Washington,2001:973-978.
    [129] Nitesh V.Chawla, Kevin W.Browyer, Lawrence O.Hall, et al. SMOTE: Synthetic MinorityOver-sampling Technique[J]. Journal of Artificial Intelligence Research,2002,16:321-357.
    [130]杨明尹军梅吉根林.不平衡数据分类方法综述[J].南京师范大学学报,2008,6:7-12.
    [131]孙晓燕,张化祥,计华.基于AdaBoost的欠抽样集成学习算法[J].山东大学学报(工学版)，2011,41(4):91-94.
    [132]张金伟.不平衡数据分类研究及在肿瘤识别中的应用[D].杭州：中国计量学院.2012.
    [133] Chawla N V, Lazarevic N V, Hall L O, et al. SMOTEBOOST: Improving Prediction of theMinority Class in Boosting[C].7th PKDD, Cavtat-Dubronik, Croatia,2003:107-119.
    [134]王春玉,苏宏业,梁瑜等.一种基于过抽样技术的非平衡数据集分类方法[J].计算机工程与应用,2011,47(1):139-143.
    [135] Garc′a V, S′anchez J S. On the use of surrounding neighbors for synthetic over-sampling ofthe minority class[C]. Proceedings of the8th Conference on Simulation, Modeling andOptimization, Cantabria,200:389-394.
    [136]陶新民,徐晶,童智靖等.不平衡数据下基于阴性免疫的过抽样新算法[J].控制与决策,2010,25(6):867-872.
    [137] Laura Graves Ponsaing, Katalin Kiss, Mark Berner Hansen. Classification of submucosaltumors in the gastrointestinal[C]. World Journal of Gastroenterology,2007,13(24):3311-3315.
    [138] Sun Z L, Choi T M, Au K F, et al. Sales forecasting using extreme learning machine withapplications in fashion retailing[J]. Decision Support Systems,2008,46:411-419.
    [139] Heeswijk Y,Miche T. Lindh-Knuutila, et al. Adaptive ensemble models of extreme learningmachines for time series prediction[C].19th International Conference,Limassol,Cyprus,September14-17,2009,5769:305-314.
    [140] Heeswijk Y, Miche E O,et al.Gpu-accelerated and parallelized ELM ensembles forlarge-scale regression[J]. Neurocomputing,2011.
    [141]陆慧娟,张金伟,马小平,郑文斌. ELM集成在肿瘤分类中的应用[J].数学的实践与认识，2012,42(17):148-154.
    [142] Eliane F Meurs, Adrien Breiman. The interferon inducing pathways and the hepatitis Cvirus[C]. World Journal of Gastroenterology.2007May7,13(17):2446-2454.
    [143]冯楠,方德英,解晶.一种基于遗传算法的样本集数据分割方法[J].计算机工程与应用,2008,44(16):129-135.
    [144]陆慧娟,陈伍涛,王明怡.样本过滤的基因表达数据分类[J].中国计量学院学报,2009,20(3)：254-258.
    [145] Huijuan Lu, Wutao Chen, Xiaoping Ma and Yi Lu. A Dataset Spliting Based NeuralNetwork Ensemble Method for Tumor Classification[J]. International Journal of Digital ContentTechnology and its Applications,2012,6(5):201-209.
    [146] Yoav Freund, Robert E. Schapire et al. A Short Introduction to Boosting[J]. Journal ofJapanese Society for Artificial Intelligence,199914(5):771-780.
    [147] Dietterich T G.. Machine learing research:Four current directions[J].AI Magazine,1997,18(4):97-136.
    [148]杜健.基于神经遗传学习算法的模型优化研究[D],天津：天津大学,2005.
    [149] Xavier Sala-i-Martin, Gernot Doppelhofer and Ronald I. Determinants of Long-TermGrowth: A Bayesian Averaging of Classical Estimates (BACE) Approach[J]. The AmericanEconomic Review,2004,94(4):813-835.
    [150] Sehapier R E.The strength of weak learnability[J]. Machine Learning,1990,5(2):197-227.
    [151] Breiman L. Bagging predicators[J]. Machine Learning,1996,4(2):123-140.
    [152] Zhihua Zhou, Jianxin Wu et al. Ensembling Neural Networks: Many Could be Better thanAll[J]. Artificial Intelligence,2002,137(1-2):239-263.
    [153] Huang G B, Chen L, Siew C K. Universal approximation using incremental feedforwardnetworks with arbitrary input weights[J]. Neural Networks,2006,17(4):879-892.
    [154] Huang G B, Chen L. Convex incremental extreme learning machine[J]. Neurocomputing,2007,70:3056-3062.
    [155] Huang G B, Chen L. Enhanced random search based incremental extreme learningmachine[J]. Neurocomputing,2008,71:3060-3068.
    [156] Feng G R, Huang G B, Lin Q P. Error Minimized Extreme Learning Machine With Growthof Hidden Nodes and Incremental Learning[J]. Neural Networks,2009:20(8):1352-1357.
    [157] Lan Y, Soh Y C, Huang G B. Ensemble of online sequential extreme learning machine[J].Neurocomputing,2009,73(13-15):3391-3395.
    [158] Huang G B, Ding X J, Zhou H M. Optimization method.based extreme learning machine forclassication[J].Neurocomputing,2010,74(1-3):155-163.
    [159] Chen P, Dong S P, Lu H J and Wu X P. Color Image Segmentation by Fixation-Based ActiveLearning with ELM[J]. Soft Computing,2012,16(9),1569-1584.
    [160] Zheng W B, Qian Y T, and Lu H J.Text Categorization Based on Regularization ExtremeLearning Machine [J]. Neural Computing and Applications,2012, DOI:10.1007/s00521-011-0808-y.
    [161]陆慧娟,安春霖,郑恩辉等.基于输出不一致测度ELM集成的基因表达数据分类[C],CCF中国计算机大会,大连,中国,2012.
    [162] Krogh A, Vedelsby J. Neural network ensembles, cross validation, and activelearning[J]. InAdvances in Neural Information Processing Systems,1995,7:231-238.
    [163] Kuncheva, L I, Whitaker C J. Limits on the majority vote accuracy in classifier fusion[J],Pattern Analysis&Applications,2003,6(1):22-31.
    [164] Ho T. The random space method for constructing decision forests[J]. IEEE Transactions onPattern Analysis and Machine Intelligence,1998,20(8):832-844.
    [165]刘昆宏.多分类器集成系统在基因微阵列数据分析中的应用[D],合肥：中国科学技术大学,2008.
    [166] Ruta D, Gabrys B. Application of the Evolutionary Algorithms for Classifiers Selection inMultiple Classifier Systems with Majority Voting, Multiple Classifier Systems[C]. Proceedings ofthe MCS’2001Workshop. Cambridge,2001:399-408.
    [167]徐春归.基于微阵列数据分析的肿瘤分类方法研究[D].合肥：中国科技大学,2009.
    [168] Wolpert. Bias Plus Variance Decomposition for Zero One Loss Functions[C]. MachineLearning Proceedings of the Thirteenth International Conference,1996.
    [169] Pádraig Cunningham,John Carney. Diversity versus Quality in lassification EnsemblesBased on Feature Selection[J]. Lecture Notes in Computer Science,2000,18:109-116.
    [170] KunchevaL I. Combining Pattem classifiers: methods and algorithms[J]. Subhash C Bagui.Technometrics,2005,47(4):517-518.
    [171] Chow C K.On optimum recognition error and reject tradeoff [J]. IEEE TransactionsInformation Theory,1970(1):41-46.
    [172] Foggia P, Sansone C, Torella F, et al. Mulit-classification: reject criteria for the bayesiancombiner[J]. Pattern Recognition,1999(32):1435-1447.
    [173] Sterano C D, Sansone C, Vento M. To reject or not to reject: that is the question-answer incase of neural classifiers [J]. IEEE Transactions on SMC,2008,30(1):84-94.
    [174] Langgrebe T, Taxdmj C W, Paklik P, et al. The interaction between classification and rejectperformance for distance-based reject-option classifiers [J]. Pattern Recognition Letters,2006(8):908-917.
    [175] Zheng E H, et al. SVM-BASED credit card fraud detection with reject cost andclass-dependent error cost[C]. Proceedings of the PAKDD’09Workshop: Data Mining WhenClasses are Imbalanced and Errors Have Cost. Rangsit Campus: Thammasat Printing House,2009:50-58.
    [176] M Kubat, S Matwin. Addressing the curse of imbalanced training sets: one-sidedselection[C]. Proceedings of the Fourteenth International Conference in Machine Learning.Morgan Kaufmann, San Francisco,1997:179-186.
    [177] Kubat M, Holte R, Matwin S. Learning when negative examples abound[J].MachineLearning: ECML-97, Lecture Notes in Artificial Intelligence. Springer,1997,1224:146-153.
    [178] Chan P K, Stolfo S J. Toward scalable learning with non-uniform class and costdistributions: A Case Study in Credit Card Fraud Detection[C].Proceedings of the4thInternational Conference on Knowledge Discovery and Data Mining.2001:164-168.
    [179] Weiss G M, Provost F. Learning when training data are costly: the effect of classdistribution on tree induction [C]. Journal of Artifical Intelligence Research,2003(19):315-354.
    [180] N. Li, Y. Yu, and Z.-H. Zhou. Diversity regularized ensemble pruning[C]. In: Proceedingsof the European Conference on Machine Learning and Principles and Practice of KnowledgeDiscovery in Databases (ECML PKDD'12), Bristol, UK,2012:330-345
    [181] Liu X Y, Wu J, Zhou Z H. Exploratory under-sampling for class-imbalance learning [J].IEEE Transactions on systems, Man, and Cybernetics-Part B: Cybernetics,2009,39(2):539-550.
    [182] Chawla N V, Lazarevic A, Hall L O, Bowyer K. SMOTE Boost: Improving Prediction ofthe Minority Class in Boosting[C]. Proceedings of Principles of Knowledge Discovery inDatabases,2003.
    [183] Japkowicz N. The class imbalance problem:significance and stragegies [C]. Proceedings ofthe2000International Conference on Artificial Intelligence: Special Track on Inductive Learning,Las Vegas,2000.
    [184] Elkan C. The foundation of cost-sensitive learning[C].Proceedings of the17th InternationalJoint Conference on Artificial Intelligence, Washington,2001:239-246.
    [185] Zhou Z H, Liu X Y. Training cost-sensitive nerual networks with methods addressing theclass imbalance problem [J]. IEEE Transactions on Knowledge and Data Engineering,2006,18(1):63-77.
    [186] Ling C, Li C. Data mining for direct marketing problems and solutions [C]. Proceedings ofthe4th International Conference on Knowledge Discovery and Data Mining, New York,1998:73-79.
    [187] Drummond C, Holte R. Exploiting the cost in sensitivity of decision tree splitting criteria
    [C]. Proceedings of the17th International Conference on Machine Learning. Stanford University,1998:73-79.
    [188] Fan W, Stolfo S, Zhang J, Chan P. Adacost: misclassification cost-sensitive boosting [C].Proceedings of the16th International Conference on Machine Learning. Bled, Slovenia,1999:97-105.
    [189] Xiao J H, Pan K Q, Wu J P, Yang S Z. A study on SVM for fault diagnosis [J]. Journal ofVibration, Measurement and Diagnosis,2001,21(4):258-262.
    [190]邹超,郑恩辉,任玉玲,张英,范志刚.嵌入误分类代价和拒识代价的二元分类算法[J].广西师范大学学报:自然科学版.2010,28(3),201-208.
    [191]卫东,郑恩辉,杨敏,吴向阳.基于SVM的误分类代价敏感模糊推理系统[J].控制与决策,2010,25(2):121-127.
    [192]郑恩辉，基于支持向量机的代价敏感数据挖掘研究与应用[D].杭州：浙江大学，2006.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700