摘要
从基因层面对癌症进行诊断将有效提高患者的治愈率,但癌症基因表达数据集通常存在高维、小样本、高噪声并且类别不平衡等问题,对此类数据进行分类是一项具有挑战性的任务.针对这些问题,提出一种基于差分进化的代价敏感Stacking(DE-CStacking)集成的基因表达数据分类算法,采用随机森林、K近邻、朴素贝叶斯作为Stacking集成的初级学习器,将代价敏感的支持向量机作为次级学习器,初级学习器的输出类概率和原始特征集作为次级学习器的输入,并采用差分进化对这些学习器的参数进行优化.通过在四个UCI的癌症基因数据上的实验对比,相对于其他传统的集成算法,DE-CStacking算法在癌症基因数据上表现出更好的泛化性能.
The diagnosis of cancer on the gene level will effectively improve the cure rate of the patients. However,it is a challenging task to classify the cancer gene expression data,such as high dimension,small sample size,high noise and class-imbalance. The differential evolution based on cost-sensitive Stacking ensemble (DE-CStacking) for cancer gene expression data classification is proposed.Random Forest,K-nearest neighbors and Na?ve Bayes are used as low er-level learners of Stacking ensemble,and the cost-sensitive Support vector machine is used as the high-level learner. The original feature sets and the output class probabilities of the low er-level learners are used as the input of the high-level learner. The parameters of these learners are optimized by differential evolution. By comparing with the experimental data on four UCI cancer gene expression data,the DE-CStacking algorithm shows better generalization performance on cancer gene expression data than other traditional ensemble algorithms.
引文
[1]Pavlova N,Thompson C.The emerging hallmarks of cancer metabolism[J].Cell M etabolism,2016,23(1):27-47.
[2]Fleck J L,Pavel A B,Cassandras C G.Integrating mutation and gene expression cross-sectional data to infer cancer progression[J].Bmc Systems Biology,2016,10(1):1-12.
[3]Bhavani R,Sadasivam S G.Gene expression data classification using M apReduce version of KNN hybridized w ith PSO[J].Research Journal of Biotechnology,2016,11(7):37-41.
[4]Cogill S,Wang L.Support vector machine model of developmental brain gene expression data for prioritization of Autism risk gene candidates[J].Bioinformatics,2016,32(23):3611-3618.
[5]Khormuji M K,Bazrafkan M.A novel sparse coding algorithm for classification of tumors based on gene expression data[J].M edical&Biological Engineering&Computing,2016,54(6):869-876.
[6]Njah H,Jamoussi S.Weighted ensemble learning of Bayesian netw ork for gene regulatory netw orks[J].Neurocomputing,2015,(150):404-416.
[7]Wolpert D H.Stacked generalization[M].Springer US,2017.
[8]Ting K M,Witten I H.Issues in stacked generalization[J].Journal of Artificial Intelligence Research,1999,10(1):271-289.
[9]Lee E S.Exploring the performance of stacking classifier to predict depression among the elderly[C]//IEEE International Conference on Healthcare Informatics,IEEE,2017:13-20.
[10]Ekbal A,Saha S.Stacked ensemble coupled with feature selection for biomedical entity extraction[J].Know ledge-Based Systems,2013,46:22-32.
[11]Ali S,Majid A.Can-evo-ens:classifier stacking based evolutionary ensemble system for prediction of human breast cancer using amino acid sequences[J].Journal of Biomedical Informatics,2015,(54):256-269.
[12]Maillo J,Ramírez S,Triguero I,et al.k NN-IS:an iterative sparkbased design of the k-nearest neighbors classifier for big data[J].Know ledge-Based Systems,2017,117(C):3-15.
[13]Tang B,Kay S,He H.Toward optimal feature selection in naive bayes for text categorization[J].IEEE Transactions on Know ledge and Data Engineering,2016,28(9):2508-2521.
[14]Gregorutti,Baptiste,Michel,et al.Correlation and variable importance in random forests[J].Statistics and Computing,2017,27(3):659-678.
[15]Mao A,Luo J,Li Y,et al.Knitted fabrics design and manufacture:a novel CAD system for qualifying bagging performance based on geometric-mechanical models[J].Computer-Aided Design,2016,75(C):61-75.
[16]Yu H,Gao L,Zhang B.Union of random subspace-based group sparse representation for hyperspectral imagery classification[J].Remote Sensing Letters,2018,9(6):534-540.
[17]Cao P,Zhao D,Zaiane O.Measure oriented cost-sensitive SVM for3D nodule detection[C]//Engineering in M edicine&Biology Society,Conf Proc IEEE Eng M ed Biol Soc,2013:3981.
[18]Phan A V,Nguyen M L,Bui L T.Feature weighting and SVM parameters optimization based on genetic algorithms for classification problems[J].Applied Intelligence,2016,46(2):1-15.
[19]Yang Y H,Xu X B,He S B,et al.Cluster-based niching differential evolution algorithm for optimizing the stable structures of metallic clusters[J].Computational M aterials Science,2018,149:416-423.
[20]Gao W,Wang L,Jin R,et al.One-pass AUC optimization[J].Artificial Intelligence,2016,236(C):1-29.
[21]Le T T,Kyle S W,Misaki M,et al.Privacy preserving evaporative cooling feature selection and classification w ith Relief-F and random forests[J].Bioinformatics,2017,33(18):1-8.
[22]Fujino A,Isozaki H,Suzuki J.Multi-label text categorization with model combination based on F1-score maximization[C]//Proceedings of Ijcnlp,2013:823-828.