基于重采样技术在医学不平衡数据分类中的应用研究

英文篇名：Application of the Resampling Technology in the Classification of Imbalanced Medical Datasets
作者：闫慈 ; 田翔华 ; 阿拉依·阿汗 ; 张伟文 ; 曹明芹
英文作者：Yan Ci;Tian Xianghua;Alayi Ahan;College of Public Health,Xinjiang Medical University;
关键词：代谢综合征 ; 不平衡数据集 ; 重采样技术 ; 神经网络 ; 决策树
英文关键词：Metabolic syndrome;;Imbalanced datasets;;Resampling technique;;Neural network;;Decision tree
中文刊名：ZGWT
英文刊名：Chinese Journal of Health Statistics
机构：新疆医科大学公共卫生学院流行病与卫生统计学教研室;新疆医科大学医学工程技术学院计算机教研室;
出版日期：2018-04-25
出版单位：中国卫生统计
年：2018
期：v.35
基金：新疆科技支疆项目(2016E02082);; 国家自然科学基金(71663053)
语种：中文;
页：ZGWT201802004
页数：5
CN：02
ISSN：21-1153/R
分类号：19-22+27

摘要

目的以代谢综合征为例,探讨不平衡数据对分类算法的影响,并运用重采样技术对数据进行平衡化处理,比较神经网络、决策树的分类性能。方法采用随机过采样、随机欠采样、混合采样和人工合成数据四种重采样技术,比较数据重采样前后及四种数据重采样间使用神经网络、决策树分类的性能,以F-Measure,G-mean和AUC作为模型评价指标。结果(1)分类算法性能随不平衡数据集不平衡比例的加剧而降低;(2)四种重采样技术中随机过采样后作用于BP神经网络、C4.5决策树分类性能最大。结论分类性能随数据集中患病率的降低而下降。采用随机过采样提高了算法的分类性能。建议在应用分类算法对医学不平衡数据分类前,采用随机过采样技术以提高分类性能。
Objective Metabolic syndrome as the breakthrough point,the influence of imbalanced datasets on classification is discussed.The resampling technique is used to balance the datasets,and the classification of neural network and decision tree are compared.Methods(1) BP neural network and C4.5 decision tree are used to classify imbalanced datasets of different ratios.(2) The random oversampling,random undersampling,hybrids methods and synthetic data of four kinds of resampling technology are used to compare the datasets of before and after re-sampling and four resampling using neural network and decision tree,F-Measure,G-mean and AUC as the evaluation index performance of the model.Results(1) With the imbalanced proportion of datasets increases,the AUC decreased gradually,which indicates that the classification performance of the classification algorithm decreased with proportion of the imbalanced datasets.(2) After random oversampling had the best performance.Conclusion The performance of classification algorithms are improved by using random over-sampling.It is recommended that the random over-sampling is used before applying the classification algorithm in the medical imbalanced datasets.

引文

[1]Longadge MR,Donger MSS,Malik L.Class Imbalance Problem in Data Mining:Review.International Journal of Computer Science and Network,2013,2(1):83-87.
    [2]袁联雄,余玲玲,林爱华,等.常用分类算法在不同样本量和类分布的不平衡数据中的分类效果比较.中国医院统计,2015,(1):22-26.
    [3]张健,方宏彬.剪枝与过采样的不平衡数据分类方法.计算机应用研究,2012,29(3):847-848.
    [4]野梅娜,李艳艳,杨陈军,等.非平衡数据处理方法在癫痫发作检测中的应用.西北大学学报,2016,46(6):789-794.
    [5]中华医学会糖尿病学分会代谢综合征研究协作组.中华医学会糖尿病学分会关于代谢综合征的建议.中华糖尿病杂志,2004,12(3):156-161.
    [6]王晓娟,郭躬德.不平衡数据采样方法的对比学习.微计算机信息,2011(12):155-157.(下转第页)
    [7]徐丽丽.面向不平衡数据集的分类算法研究.辽宁:辽宁师范大学,2016:5.
    [8]秦平,张镏琢,赵晓雯,等.BP神经网络在代谢综合征影响因素分析中的应用.实用预防医学,2011,18(10):1819-1822.
    [9]Wang C,Li L,Wang L,et al.Evaluating the risk of type 2 diabetes mellitus using artificial neural network:An effective classification approach.Diabetes Research Clinical Practice,2013,100(1):111-118.
    [10]王俊杰,陈景武.BP神经网络原理及其在医学统计应用中的设计技巧.中国卫生统计,2008,25(5):547-549.
    [11]侯玉梅,朱亚楠,朱立春,等.决策树模型在2型糖尿病患病风险预测中的应用.中国卫生统计,2016,33(6):976-978.
    [12]关晓蔷.基于决策树的分类算法研究.太原:山西大学,2006,1-5.
    [13]李勇,刘战东,张海军.不平衡数据的继承分类算法综述.计算机应用研究,2014,05(6):1287-1291.
    [14]闫欣.综合过采样和欠采样的不平衡数据集的学习研究.吉林:东北电力大学,2016,12-14.
    [15]Apilak W,Chanin N,Virapong P.Quantitative population-health relationship(QPHR)for assessing metabolic syndrome.Excli Journal,2013,12:569-583.
    [16]陈江鹏,彭斌,阙萍,等.重庆市体检人群代谢综合征流行状况及其组分的结构方程模型.中国卫生统计,2016,33(2):231-234.
    [17]韩秋玲.过采样算法在不平衡数据学习中的应用.上海:东南理工大学,2011:11-15.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700