面向类不平衡数据集的软件缺陷预测模型

英文篇名：Software defect prediction model based on class imbalanced datasets
作者：李冉 ; 周丽娟 ; 王华
英文作者：Li Ran;Zhou Lijuan;Wang Hua;College of Information Engineering,Capital Normal University;
关键词：软件缺陷预测 ; 类不平衡数据 ; 特征选择 ; 集成算法
英文关键词：software defect prediction;;class imbalanced data;;attribute selection;;ensemble algorithm
中文刊名：JSYJ
英文刊名：Application Research of Computers
机构：首都师范大学信息工程学院;
出版日期：2018-05-07 17:06
出版单位：计算机应用研究
年：2018
期：v.35;No.323
基金：国家自然科学基金资助项目(61601310);; 高可靠嵌入式系统技术北京市工程研究中心资助项目(2013BAH19F01)
语种：中文;
页：JSYJ201809059
页数：5
CN：09
ISSN：51-1196/TP
分类号：252-256

摘要

软件缺陷数据的类不平衡问题会影响缺陷预测分类的准确性,为解决类不平衡数据对预测分类的影响,针对如何优化数据预处理的算法执行顺序进行了研究,提出了一种有效提升分类效果的软件缺陷预测模型(ASRAdaBoost)。该算法模型在根据对照实验确定数据预处理最优顺序后,采用特征选择卡方检验算法,再执行SMOTE过采样与简单采样方法,解决数据类不平衡和属性冗余同时存在的问题,最后结合AdaBoost集成算法,构建出软件缺陷预测模型ASRAdaBoost。实验均采用J48决策树作为基分类器,实验结果表明ASRAdaBoost算法模型有效地提高了软件缺陷预测的准确性,得到了更好的分类效果。
The problem of class imbalanced data of software defect will affect the accuracy of defect predictive classification.In order to solve the problem of classification,this paper discussed the order of algorithm execution of optimized data preprocessing and developed a software defect prediction model( ASRAdaBoost) to effectively improve the classification. The algorithm was based on the comparison experiment to determine the optimal sequence of data preprocessing,using the chi-square test of attribute selection,and then performed SMOTE oversampling and resample method to solve the imbalanced data and attri-bute redundancy problems,using the AdaBoostensemble algorithm to build a software defect prediction model ASRAdaBoost eventually. The experimental results show that the ASRAdaBoost model can effectively improve the accuracy of software defect prediction and get a better classification effect.

引文

[1]王青,伍书剑,李明树.软件缺陷预测技术[J].软件学报,2008,19(7):1565-1580.
    [2]Weiss G M.Mining with rarity:a unifying framework[J].ACM SIGKDD Explorations Newsletter,2004,6(1):7-19.
    [3]Chawla N V,Bowyer K W,Hall L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357.
    [4]Tahir M A,Kittler J,Yan Fei.Inverse random under sampling for class imbalance problem and its application to multi-label classification[J].Pattern Recognition,2012,45(10):3738-3750.
    [5]Freund Y,Schapire R E.A desicion-theoretic generalization of online learning and an application to boosting[J].Journal of Computer and System Sciences,1997,55(1):119-139.
    [6]Breiman L.Bagging predictors[J].Machine Learning,1996,24(2):123-140.
    [7]Fan Wei,Stolfo S J,Zhang Junxin,et al.Ada Cost:misclassification cost-sensitive boosting[C]//Proc of the 16th International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc,1999:97-105.
    [8]Kai M T.An empirical study of Meta Cost using boosting algorithms[C]//Proc of the 11th European Conference on Machine Learning.Berlin:Springer,2000:413-425.
    [9]张栋,王勇,蔡立军.基于单类别学习的自适应数据流分类算法[J].西北工业大学学报,2010,28(5):713-717.
    [10]张晓风,张德平.基于不平衡数据集的软件缺陷预测[J].计算机应用研究,2017,34(7):2027-2031.
    [11]熊婧,高岩,王雅瑜.基于Ada Boost算法的软件缺陷预测模型[J].计算机科学,2016,43(7):186-190.
    [12]Liu Mingxia,Miao Linsong,Zhang Daoqiang,et al.Two-stage costsensitive learning for software defect prediction[J].IEEE Trans on Reliability,2014,63(2):676-686.
    [13]Khoshgoftaar T M,Gao Kehan,Hulse J V.Feature selection for highly imbalanced software measurement data[M]//Recent Trends in Information Reuse and Integration.Berlin:Springer,2012:167-189.
    [14]Guyon I,Elisseeff A.An introduction to variable feature selection[J].Journal of Machine Learning Research,2003,3:1157-1182.
    [15]曹莹,苗启广,刘家辰,等.Ada Boost算法研究进展与展望[J].自动化学报,2013,39(6):745-758.
    [16]Shanab A A,Khoshgoftaar T M,Wald R,et al.Comparison of approaches to alleviate problems with high-dimensional and class-imbalanced data[C]//Proc of IEEE International Conference on Information Reuse and Integration.Piscataway,NJ:IEEE Press,2011:234-239.
    [17]周志华.机器学习[M].北京:清华大学出版社,2016.
    [18]Gao Sheng,Lee C H,Lim J H.An ensemble classifier learning approach to ROC optimization[C]//Proc of the 18th International Conference on Pattern Recognition.Piscataway,NJ:IEEE Press,2006:679-682.
    [19]Fawcett T.An introduction to ROC analysis[J].Pattern Recognition Letters,2006,27(8):861-874.
    [20]李勇.结合欠抽样与集成的软件缺陷预测[J].计算机应用,2014,34(8):2291-2294,2310.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700