A Hybrid Instance Selection Using Nearest-Neighbor for Cross-Project Defect Prediction

详细信息查看全文

作者：Duksan Ryu ; Jong-In Jang ; Jongmoon Baik
关键词：software defect analysis ; instance ; based learning ; nearest ; neighbor algorithm ; data cleaning
刊名：Journal of Computer Science and Technology
出版年：2015
出版时间：September 2015
年：2015
卷：30
期：5
页码：969-980
全文大小：380 KB
参考文献：[1]Gao K, Khoshgoftaar T. Software defect prediction for high-dimensional and class-imbalanced data. In Proc. the 23rd SEKE, July 2011, pp. 89-4.
[2]Zheng J. Cost-sensitive boosting neural networks for software defect prediction. Expert Syst. Appl., 2010, 37(6): 4537-543.
[3]Wang S, Yao X. Using class imbalance learning for software defect prediction. IEEE Trans. Reliab., 2013, 62(2): 434-43.
[4]Turhan B, Tosun M?s?rl? A, Bener A. Empirical evaluation of the effects of mixed project data on learning defect predictors. Inf. Softw. Technol., 2013, 55(6): 1101-118.
[5]Turhan B, Menzies T, Bener A B, Di Stefano J. On the relative value of cross-company and within-company data for defect prediction. Empir. Softw. Eng., 2009, 14(5): 540-78.
[6]Wilcoxon F. Individual comparisons by ranking methods. Biometrics Bull., 1945, 1(6): 80-3.
[7]Vargha A, Delaney H D. A critique and improvement of the “CL-common language effect size statistics of McGraw and Wong. J. Educ. Behav. Stat., 2000, 25(2): 101-32.
[8]Hall T, Beecham S, Bowes D, Gray D, Counsell S. A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Softw. Eng., 2012, 38(6): 1276-304.
[9]Arisholm E, Briand L C, Johannessen E B. A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. J. Syst. Softw., 2010, 83(1): 2-7.
[10]D’Ambros M, Lanza M, Robbes R. Evaluating defect prediction approaches: A benchmark and an extensive comparison. Empir. Softw. Eng., 2012, 17(4/5): 531-77.
[11]Dejaeger K, Verbraker T, Basesens B. Toward comprehensible software fault prediction models using Bayesian network classifiers. IEEE Trans. Softw. Eng., 2013, 39(2): 237-57.
[12]Elish K O, Elish M O. Predicting defect-prone software modules using support vector machines. J. Syst. Softw., 2008, 81(5): 649-60.
[13]Singh Y, Kaur A, Malhotra R. Empirical validation of object-oriented metrics for predicting fault proneness models. Softw. Qual. J., 2009, 18(1): 3-5.
[14]Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B. Cross-project defect prediction: A large scale experiment on data vs. domain vs. process. In Proc. the 7th ESEC/FSE, August 2009, pp. 91-00.
[15]He Z, Shu F, Yang Y, Li M, Wang Q. An investigation on the feasibility of cross-project defect prediction. Autom. Softw. Eng., 2011, 19(2): 167-99.
[16]Ma Y, Luo G, Zeng X, Chen A. Transfer learning for cross-company software defect prediction. Inf. Softw. Technol., 2012, 54(3): 248-56.
[17]Nam J, Pan S J, Kim S. Transfer defect learning. In Proc. the 35th Int. Conf. Softw. Eng., May 2013, pp. 382-91.
[18]Tan P N, Steinbach M, Kumar V. Introduction to Data Mining. Addison Wesley, 2006.
[19]Grbac T, Mausa G, Ba?i? B. Stability of software defect prediction in relation to levels of data imbalance. In Proc. the 2nd SQAMIA, Sept. 2013, pp.1:1-:10.
[20]Raman B, Ioerger T R. Enhancing learning using feature and example selection. Technical Report, Department of Computer Science, Texas A&M Univ., 2003.
[21]Beyer K, Goldstein J, Ramakrishnan R, Shaft U. When is “nearest neighbor-meaningful? In Lecture Notes in Computer Science 1540, Beeri C, Buneman P (eds.), Springer-Verlag, 1999, pp. 217-35.
[22]Mahalanobis P C. On the generalised distance in statistics. Proc. Natl. Inst. Sci., 1936, 2(1): 49-5.
[23]Turhan B, Tosun A, Bener A. Empirical evaluation of mixed-project defect prediction models. In Proc. the 37th EUROMICRO Conf. Softw. Eng. Adv. Appl., Aug. 30-Sept. 2, 2011, pp.396-03.
[24]Hall M, Frank E, Holmes G et al. The WEKA data mining software: An update. ACM SIGKDD Explor. Newsl., 2009, 11(1): 10-8.
[25]Menzies T, Milton Z, Turhan B, Cukic B, Jiang Y, Bener A. Defect prediction from static code features: Current results, limitations, new approaches. Autom. Softw. Eng., 2010, 17(4): 375-07.
作者单位：Duksan Ryu (1)
Jong-In Jang (1)
Jongmoon Baik (1)

1. School of Computing, Korea Advanced Institute of Science and Technology, Yuseong-gu, Daejeon, 305-701, Korea
刊物类别：Computer Science
刊物主题：Computer Science, general
Software Engineering
Theory of Computation
Data Structures, Cryptology and Information Theory
Artificial Intelligence and Robotics
Information Systems Applications and The Internet
Chinese Library of Science
出版者：Springer Boston
ISSN：1860-4749

文摘

Software defect prediction (SDP) is an active research field in software engineering to identify defect-prone modules. Thanks to SDP, limited testing resources can be effectively allocated to defect-prone modules. Although SDP requires sufficient local data within a company, there are cases where local data are not available, e.g., pilot projects. Companies without local data can employ cross-project defect prediction (CPDP) using external data to build classifiers. The major challenge of CPDP is different distributions between training and test data. To tackle this, instances of source data similar to target data are selected to build classifiers. Software datasets have a class imbalance problem meaning the ratio of defective class to clean class is far low. It usually lowers the performance of classifiers. We propose a Hybrid Instance Selection Using Nearest-Neighbor (HISNN) method that performs a hybrid classification selectively learning local knowledge (via k-nearest neighbor) and global knowledge (via na?ve Bayes). Instances having strong local knowledge are identified via nearest-neighbors with the same class label. Previous studies showed low PD (probability of detection) or high PF (probability of false alarm) which is impractical to use. The experimental results show that HISNN produces high overall performance as well as high PD and low PF. Keywords software defect analysis instance-based learning nearest-neighbor algorithm data cleaning

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700