基于代价敏感支持向量机的软件缺陷预测研究

英文篇名：Software defect prediction based on cost-sensitive support vector machine
作者：任胜兵 ; 廖湘荡
英文作者：REN Sheng-bing;LIAO Xiang-dang;School of Software,Central South University;
关键词：软件缺陷预测 ; 代价敏感 ; 支持向量机 ; 非平衡数据分类 ; 参数选择 ; 遗传算法
英文关键词：software defect prediction;;cost sensitivity;;support vector machine;;unbanlanced data classification;;parameter selection;;genetic algorithm
中文刊名：JSJK
英文刊名：Computer Engineering & Science
机构：中南大学软件学院;
出版日期：2018-10-15
出版单位：计算机工程与科学
年：2018
期：v.40;No.286
语种：中文;
页：JSJK201810011
页数：9
CN：10
ISSN：43-1258/TP
分类号：75-83

摘要

软件缺陷预测是典型的非平衡学习问题。基于CS-SVM和聚类算法改进代价敏感支持向量机(SVM)算法,提出了CCS-SVM软件缺陷预测模型。在CCS-SVM预测模型中,将SVM与类别误分代价结合起来,以非平衡数据评价指标作为目标函数,优化错分代价因子,提升少数类样本的识别率。通过聚类找到每类样本的中心点,根据样本到其中心点的距离定义每个样本的类别置信度,给每个样本分配不同的误分代价系数,并把样本的置信度引入到代价敏感SVM优化问题中,提高算法鲁棒性,提升SVM分类性能。此外,为了提高模型的泛化能力,使用遗传算法优化特征选择和模型参数。通过美国航空航天局NASA MDP数据集实验表明,本文方法的G-mean和F-measure模型评价值有明显的提升。
Software defect prediction is a typical unbalanced learning problem.We propose a CCSSVM software defect prediction model based on cost sensitive SVM algorithm improved by the CS-SVM and clustering algorithm.In the CCS-SVM prediction model,we combine SVM and the cost of class misclassification,take unbalanced data evaluation index as the objective function,and optimize the misclassification cost factor so as to enhance the recognition rate of the minority class samples.We find the center point of each sample through clustering,define the class confidence for each sample according to the distance of the sample to its center point,assign different misclassification cost factors to different samples,and introduce the class confidence of each sample to the optimization problem of cost sensitive SVM,and improve the robustness of the algorithm and classification performance of SVM.To enhance the generalization ability of the model,we use the genetic algorithm to optimize feature selection and model parameters.Experimental results of the NASA Metric Data Program(MDP)dataset show that our method is significantly improved in the G-mean and F-measure value for model evaluation.

引文

[1] Wang Qing,Wu Shu-jian,Li Ming-shu.Software defect prediction[J].Journal of Software,2008,19(7):1565-1580.(in Chinese)
    [2] Shi Hai-feng. Analysis of software defect management scheme[J].Computer&Telecommunication,2014(10):71-73.(in Chinese)
    [3] Menzies T,Greenwald J,Frank A.Data mining static code attributes to learn defect predictors[J].IEEE Transactions on Software Engineering,2007,33(9):637-640.
    [4] Shuai B,Li H F,Li M J,et al.Software defect prediction using dynamic support vector machine[C]∥Proc of the 20139th International Conference on Computational Intelligence and Security(CIS),2013:260-263.
    [5] Seliya N,Khoshgoftaar T M,Van Hulse J.Predicting faults in high assurance software[C]∥Proc of 2010IEEE 12th International Symposium on High-Assurance Systems Engineering(HASE),2010:26-34.
    [6] Nasa I V V.Metrics data program[EB/OL].[2004-12-03].http:∥mdp.ivv.nasa.gov/.
    [7] Wang S,Yao X.Using class imbalance learning for software defect prediction[J].IEEE Transactions on Reliability,2013,62(2):434-443.
    [8] Lee W,Jun C H,Lee J S.Instance categorization by support vector machines to adjust weights in AdaBoost for imbalanced data classification[J].Information Sciences,2017,381:92-103.
    [9] Choeikiwong T,Vateekul P.Software defect prediction in imbalanced data sets using unbiased support vector machine[M]∥Information Science and Application,Berlin:Springer Berlin Heidelb,2015:923-931.
    [10] Jian C,Gao J,Ao Y.A new sampling method for classifying imbalanced data based on support vector machine ensemble[J].Neurocomputing,2016,193(C):115-122.
    [11] Jiang Hui-yan,Zong Mao,Liu Xiang-ying.Research of software defect prediction model based on ACO SVM[J].Chinese Journal of Computers,2011,34(6):1148-1154.(in Chinese)
    [12] Elish K O,Elish M O.Predicting defect-prone software modules using support vector machines[J].Journal of Systems and Software,2008,81(5):649-660.
    [13] Cao Peng,Zhao Da-zhe,Zaiane O.An optimized cost sensitive SVM for imbalanced data learning[C]∥Proc of the17th Pacific-Asia Conference on Knowledge Discovery and Data Mining,2013:280-292.
    [14] López V,Fernández A,García S,et al.An insight into classification with imbalanced data:Empirical results and current trends on using data intrinsic characteristics[J].Information Sciences,2013,250(11):113-141.
    [15] Rivera W A,Xanthopoulos P.A priori synthetic over-sampling methods for increasing classification sensitivity in imbalanced data sets[J].Expert Systems with Applications,2016,66:124-135.
    [16] Vluymans S,TarragóD S,Saeys Y,et al.Fuzzy rough classifiers for class imbalanced multi-instance data[J].Pattern Recognition,2016,53(C):36-45.
    [17] Lin Y,Lee Y,Wahba G.Support vector machines for classification in nonstandard situations[J].Machine Learning,2002,46(1):191-202.
    [18] Khan S H,Hayat M,Bennamoun M,et al.Cost-sensitive learning of deep feature representations from imbalanced data[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(8):3573-3587.
    [19] Halstead M H.Elements of software science[M].New York:Elsevier,1977.
    [20] Gray D,Bowes D,Davey N,et al.Software defect prediction using static code metrics underestimates defect-proneness[C]∥Proc of the 2010International Joint Conference on Neural Networks(IJCNN),2010:1-7.
    [21] Cai Yan-yan,Song Xiao-dong.New fuzzy SVM model used in imbalanced datasets[J].Journal of Xidian University,2015,42(5):120-124.(in Chinese)
    [22] Zhao Yong-bin,Chen Shuo,Liu Ming,et al.Imbalanced data learning for support vector machine based on confidence cost sensitivity[J].Computer Engineering,2015,41(10):177-180.(in Chinesse)
    [23] Frey B J,Dueck D.Clustering by passing messages between data points[J].Science,2007,315(5814):972-976.
    [24] Chandrashekar G,Sahin F.A survey on feature selection methods[J].Computers&Electrical Engineering,2014,40(1):16-28.
    [25] Huang C L,Wang C J.A GA-based feature selection and parameters optimization for support vector machines[J].Expert Systems with Applications,2006,31(2):231-240.
    [26] Mitchell M.An introduction to genetic algorithms[M].Cambridge:MIT Press,1998.
    [27] Wu C H,Tzeng G H,Goo Y J,et al.A real-valued genetic algorithm to optimize the parameters of support vector machine for predicting bankruptcy[J].Expert Systems with Applications,2007,32(2):397-408.
    [28] Lee Y,Lin Y,Wahba G.Multicategory support vector machines:Theory and application to the classification of microarray data and satellite radiance data[J].Journal of the American Statistical Association,2004,99(465):67-81.
    [29] Sajan K S,Kumar V,Tyagi B.Genetic algorithm based support vector machine for on-line voltage stability monitoring[J].International Journal of Electrical Power&Energy Systems,2015,73:200-208.
    [30] McCabe T J.A complexity measure[J].IEEE Transactions on Software Engineering,1976,4(SE-2):308-320.
    [31] Dong Yuan-fang,Li Xiong-fei,Li Jun.Gradually learning algorithm for imbalanced data[J].Computer Engineering,2010,36(24):161-163.(in Chinese)
    [1]王青,伍书剑,李明树.软件缺陷预测技术[J].软件学报,2008,19(7):1565-1580.
    [2]史海峰.软件缺陷管理方案分析[J].电脑与电信,2014(10):71-73.
    [11]姜慧研,宗茂,刘相莹.基于ACO-SVM的软件缺陷预测模型的研究[J].计算机学报,2011,34(6):1148-1154.
    [21]蔡艳艳,宋晓东.针对非平衡数据分类的新型模糊SVM模型[J].西安电子科技大学学报(自然科学版),2015,42(5):120-124.
    [22]赵永彬,陈硕,刘明,等.基于置信度代价敏感的支持向量机不均衡数据学习[J].计算机工程,2015,41(10):177-180.
    [31]董元方,李雄飞,李军.一种不平衡数据渐进学习算法[J].计算机工程,2010,36(24):161-163.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700