基于DE-CStacking集成的基因表达数据分类算法

英文篇名：Classification Algorithm of Gene Expression Data Based on Differential Evolution and Costsensitive Stacking Ensemble
作者：高慧云 ; 陆慧娟 ; 严珂 ; 叶敏超
英文作者：GAO Hui-yun;LU Hui-juan;YAN Ke;YE Min-chao;College of Information Engineering,China Jiliang University;
关键词：Stacking集成 ; 差分进化 ; 代价敏感 ; 基因表达数据
英文关键词：Stacking ensemble;;differential evolution;;cost sensitive;;gene expression data
中文刊名：XXWX
英文刊名：Journal of Chinese Computer Systems
机构：中国计量大学信息工程学院;
出版日期：2019-08-09
出版单位：小型微型计算机系统
年：2019
期：v.40
基金：国家自然科学基金项目(61272315)资助;; 浙江省科技计划项目(2017C34003)资助
语种：中文;
页：XXWX201908004
页数：5
CN：08
ISSN：21-1106/TP
分类号：19-23

摘要

从基因层面对癌症进行诊断将有效提高患者的治愈率,但癌症基因表达数据集通常存在高维、小样本、高噪声并且类别不平衡等问题,对此类数据进行分类是一项具有挑战性的任务.针对这些问题,提出一种基于差分进化的代价敏感Stacking(DE-CStacking)集成的基因表达数据分类算法,采用随机森林、K近邻、朴素贝叶斯作为Stacking集成的初级学习器,将代价敏感的支持向量机作为次级学习器,初级学习器的输出类概率和原始特征集作为次级学习器的输入,并采用差分进化对这些学习器的参数进行优化.通过在四个UCI的癌症基因数据上的实验对比,相对于其他传统的集成算法,DE-CStacking算法在癌症基因数据上表现出更好的泛化性能.
The diagnosis of cancer on the gene level will effectively improve the cure rate of the patients. However,it is a challenging task to classify the cancer gene expression data,such as high dimension,small sample size,high noise and class-imbalance. The differential evolution based on cost-sensitive Stacking ensemble (DE-CStacking) for cancer gene expression data classification is proposed.Random Forest,K-nearest neighbors and Na?ve Bayes are used as low er-level learners of Stacking ensemble,and the cost-sensitive Support vector machine is used as the high-level learner. The original feature sets and the output class probabilities of the low er-level learners are used as the input of the high-level learner. The parameters of these learners are optimized by differential evolution. By comparing with the experimental data on four UCI cancer gene expression data,the DE-CStacking algorithm shows better generalization performance on cancer gene expression data than other traditional ensemble algorithms.

引文

[1]Pavlova N,Thompson C.The emerging hallmarks of cancer metabolism[J].Cell M etabolism,2016,23(1):27-47.
    [2]Fleck J L,Pavel A B,Cassandras C G.Integrating mutation and gene expression cross-sectional data to infer cancer progression[J].Bmc Systems Biology,2016,10(1):1-12.
    [3]Bhavani R,Sadasivam S G.Gene expression data classification using M apReduce version of KNN hybridized w ith PSO[J].Research Journal of Biotechnology,2016,11(7):37-41.
    [4]Cogill S,Wang L.Support vector machine model of developmental brain gene expression data for prioritization of Autism risk gene candidates[J].Bioinformatics,2016,32(23):3611-3618.
    [5]Khormuji M K,Bazrafkan M.A novel sparse coding algorithm for classification of tumors based on gene expression data[J].M edical&Biological Engineering&Computing,2016,54(6):869-876.
    [6]Njah H,Jamoussi S.Weighted ensemble learning of Bayesian netw ork for gene regulatory netw orks[J].Neurocomputing,2015,(150):404-416.
    [7]Wolpert D H.Stacked generalization[M].Springer US,2017.
    [8]Ting K M,Witten I H.Issues in stacked generalization[J].Journal of Artificial Intelligence Research,1999,10(1):271-289.
    [9]Lee E S.Exploring the performance of stacking classifier to predict depression among the elderly[C]//IEEE International Conference on Healthcare Informatics,IEEE,2017:13-20.
    [10]Ekbal A,Saha S.Stacked ensemble coupled with feature selection for biomedical entity extraction[J].Know ledge-Based Systems,2013,46:22-32.
    [11]Ali S,Majid A.Can-evo-ens:classifier stacking based evolutionary ensemble system for prediction of human breast cancer using amino acid sequences[J].Journal of Biomedical Informatics,2015,(54):256-269.
    [12]Maillo J,Ramírez S,Triguero I,et al.k NN-IS:an iterative sparkbased design of the k-nearest neighbors classifier for big data[J].Know ledge-Based Systems,2017,117(C):3-15.
    [13]Tang B,Kay S,He H.Toward optimal feature selection in naive bayes for text categorization[J].IEEE Transactions on Know ledge and Data Engineering,2016,28(9):2508-2521.
    [14]Gregorutti,Baptiste,Michel,et al.Correlation and variable importance in random forests[J].Statistics and Computing,2017,27(3):659-678.
    [15]Mao A,Luo J,Li Y,et al.Knitted fabrics design and manufacture:a novel CAD system for qualifying bagging performance based on geometric-mechanical models[J].Computer-Aided Design,2016,75(C):61-75.
    [16]Yu H,Gao L,Zhang B.Union of random subspace-based group sparse representation for hyperspectral imagery classification[J].Remote Sensing Letters,2018,9(6):534-540.
    [17]Cao P,Zhao D,Zaiane O.Measure oriented cost-sensitive SVM for3D nodule detection[C]//Engineering in M edicine&Biology Society,Conf Proc IEEE Eng M ed Biol Soc,2013:3981.
    [18]Phan A V,Nguyen M L,Bui L T.Feature weighting and SVM parameters optimization based on genetic algorithms for classification problems[J].Applied Intelligence,2016,46(2):1-15.
    [19]Yang Y H,Xu X B,He S B,et al.Cluster-based niching differential evolution algorithm for optimizing the stable structures of metallic clusters[J].Computational M aterials Science,2018,149:416-423.
    [20]Gao W,Wang L,Jin R,et al.One-pass AUC optimization[J].Artificial Intelligence,2016,236(C):1-29.
    [21]Le T T,Kyle S W,Misaki M,et al.Privacy preserving evaporative cooling feature selection and classification w ith Relief-F and random forests[J].Bioinformatics,2017,33(18):1-8.
    [22]Fujino A,Isozaki H,Suzuki J.Multi-label text categorization with model combination based on F1-score maximization[C]//Proceedings of Ijcnlp,2013:823-828.