A Hybrid Instance Selection Using Nearest-Neighbor for Cross-Project Defect Prediction
详细信息    查看全文
  • 作者:Duksan Ryu ; Jong-In Jang ; Jongmoon Baik
  • 关键词:software defect analysis ; instance ; based learning ; nearest ; neighbor algorithm ; data cleaning
  • 刊名:Journal of Computer Science and Technology
  • 出版年:2015
  • 出版时间:September 2015
  • 年:2015
  • 卷:30
  • 期:5
  • 页码:969-980
  • 全文大小:380 KB
  • 参考文献:[1]Gao K, Khoshgoftaar T. Software defect prediction for high-dimensional and class-imbalanced data. In Proc. the 23rd SEKE, July 2011, pp. 89-4.
    [2]Zheng J. Cost-sensitive boosting neural networks for software defect prediction. Expert Syst. Appl., 2010, 37(6): 4537-543.
    [3]Wang S, Yao X. Using class imbalance learning for software defect prediction. IEEE Trans. Reliab., 2013, 62(2): 434-43.
    [4]Turhan B, Tosun M?s?rl? A, Bener A. Empirical evaluation of the effects of mixed project data on learning defect predictors. Inf. Softw. Technol., 2013, 55(6): 1101-118.
    [5]Turhan B, Menzies T, Bener A B, Di Stefano J. On the relative value of cross-company and within-company data for defect prediction. Empir. Softw. Eng., 2009, 14(5): 540-78.
    [6]Wilcoxon F. Individual comparisons by ranking methods. Biometrics Bull., 1945, 1(6): 80-3.
    [7]Vargha A, Delaney H D. A critique and improvement of the “CL-common language effect size statistics of McGraw and Wong. J. Educ. Behav. Stat., 2000, 25(2): 101-32.
    [8]Hall T, Beecham S, Bowes D, Gray D, Counsell S. A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Softw. Eng., 2012, 38(6): 1276-304.
    [9]Arisholm E, Briand L C, Johannessen E B. A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. J. Syst. Softw., 2010, 83(1): 2-7.
    [10]D’Ambros M, Lanza M, Robbes R. Evaluating defect prediction approaches: A benchmark and an extensive comparison. Empir. Softw. Eng., 2012, 17(4/5): 531-77.
    [11]Dejaeger K, Verbraker T, Basesens B. Toward comprehensible software fault prediction models using Bayesian network classifiers. IEEE Trans. Softw. Eng., 2013, 39(2): 237-57.
    [12]Elish K O, Elish M O. Predicting defect-prone software modules using support vector machines. J. Syst. Softw., 2008, 81(5): 649-60.
    [13]Singh Y, Kaur A, Malhotra R. Empirical validation of object-oriented metrics for predicting fault proneness models. Softw. Qual. J., 2009, 18(1): 3-5.
    [14]Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B. Cross-project defect prediction: A large scale experiment on data vs. domain vs. process. In Proc. the 7th ESEC/FSE, August 2009, pp. 91-00.
    [15]He Z, Shu F, Yang Y, Li M, Wang Q. An investigation on the feasibility of cross-project defect prediction. Autom. Softw. Eng., 2011, 19(2): 167-99.
    [16]Ma Y, Luo G, Zeng X, Chen A. Transfer learning for cross-company software defect prediction. Inf. Softw. Technol., 2012, 54(3): 248-56.
    [17]Nam J, Pan S J, Kim S. Transfer defect learning. In Proc. the 35th Int. Conf. Softw. Eng., May 2013, pp. 382-91.
    [18]Tan P N, Steinbach M, Kumar V. Introduction to Data Mining. Addison Wesley, 2006.
    [19]Grbac T, Mausa G, Ba?i? B. Stability of software defect prediction in relation to levels of data imbalance. In Proc. the 2nd SQAMIA, Sept. 2013, pp.1:1-:10.
    [20]Raman B, Ioerger T R. Enhancing learning using feature and example selection. Technical Report, Department of Computer Science, Texas A&M Univ., 2003.
    [21]Beyer K, Goldstein J, Ramakrishnan R, Shaft U. When is “nearest neighbor-meaningful? In Lecture Notes in Computer Science 1540, Beeri C, Buneman P (eds.), Springer-Verlag, 1999, pp. 217-35.
    [22]Mahalanobis P C. On the generalised distance in statistics. Proc. Natl. Inst. Sci., 1936, 2(1): 49-5.
    [23]Turhan B, Tosun A, Bener A. Empirical evaluation of mixed-project defect prediction models. In Proc. the 37th EUROMICRO Conf. Softw. Eng. Adv. Appl., Aug. 30-Sept. 2, 2011, pp.396-03.
    [24]Hall M, Frank E, Holmes G et al. The WEKA data mining software: An update. ACM SIGKDD Explor. Newsl., 2009, 11(1): 10-8.
    [25]Menzies T, Milton Z, Turhan B, Cukic B, Jiang Y, Bener A. Defect prediction from static code features: Current results, limitations, new approaches. Autom. Softw. Eng., 2010, 17(4): 375-07.
  • 作者单位:Duksan Ryu (1)
    Jong-In Jang (1)
    Jongmoon Baik (1)

    1. School of Computing, Korea Advanced Institute of Science and Technology, Yuseong-gu, Daejeon, 305-701, Korea
  • 刊物类别:Computer Science
  • 刊物主题:Computer Science, general
    Software Engineering
    Theory of Computation
    Data Structures, Cryptology and Information Theory
    Artificial Intelligence and Robotics
    Information Systems Applications and The Internet
    Chinese Library of Science
  • 出版者:Springer Boston
  • ISSN:1860-4749
文摘
Software defect prediction (SDP) is an active research field in software engineering to identify defect-prone modules. Thanks to SDP, limited testing resources can be effectively allocated to defect-prone modules. Although SDP requires sufficient local data within a company, there are cases where local data are not available, e.g., pilot projects. Companies without local data can employ cross-project defect prediction (CPDP) using external data to build classifiers. The major challenge of CPDP is different distributions between training and test data. To tackle this, instances of source data similar to target data are selected to build classifiers. Software datasets have a class imbalance problem meaning the ratio of defective class to clean class is far low. It usually lowers the performance of classifiers. We propose a Hybrid Instance Selection Using Nearest-Neighbor (HISNN) method that performs a hybrid classification selectively learning local knowledge (via k-nearest neighbor) and global knowledge (via na?ve Bayes). Instances having strong local knowledge are identified via nearest-neighbors with the same class label. Previous studies showed low PD (probability of detection) or high PF (probability of false alarm) which is impractical to use. The experimental results show that HISNN produces high overall performance as well as high PD and low PF. Keywords software defect analysis instance-based learning nearest-neighbor algorithm data cleaning

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700