Classifying imbalanced data in distance-based feature space
详细信息    查看全文
  • 作者:Shin Ando
  • 关键词:Class imbalance ; Weighted nearest neighbor classifier ; Structural classifier
  • 刊名:Knowledge and Information Systems
  • 出版年:2016
  • 出版时间:March 2016
  • 年:2016
  • 卷:46
  • 期:3
  • 页码:707-730
  • 全文大小:1,768 KB
  • 参考文献:1.He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284CrossRef
    2.Chawla N, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor Newsl 6(1):1–6CrossRef
    3.Chan P, Stolfo S (1998) Toward scalable learning with non-uniform cost and class distributions: a case study in credit card fraud detection. In: Proceedings of the fourth international conference on knowledge discovery and data mining, pp 164–168
    4.Chawla NV, Cieslak DA, Hall LO, Joshi A (2008) Automatically countering imbalance and its empirical relationship to costs. Data Min Knowl Discov 17(2):225–252CrossRef MathSciNet
    5.Köknar-Tezel S, Latecki LJ (2011) Improving SVM classification on imbalanced time series data sets with ghost points. Knowl Inf Syst 28(1):1–23CrossRef
    6.Li Y, Zhang X (2011) Improving k nearest neighbor with Exemplar generalization for imbalanced classification. In: Proceedings of the 15th Pacific-Asia conference on advances in knowledge discovery and data mining, vol 2, PAKDD’11, pp 321–332
    7.Liu W, Chawla S (2011) Class confidence weighted kNN algorithms for imbalanced data sets. In: Proceedings of the 15th Pacific-Asia conference on advances in knowledge discovery and data mining, vol 2, PAKDD’11, pp 345–356
    8.Domingos P (1999) MetaCost: a general method for making classifiers cost-sensitive. In: Proceedings of the 5th SIGKDD international conference on knowledge discovery and data mining, pp 155–164
    9.Sun Y, Kamel MS, Wong AKC, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recogn 40(12):3358–3378CrossRef MATH
    10.Joachims T, Finley T, Yu CNJ (2009) Cutting-plane training of structural SVMs. Mach Learn 77:27–59CrossRef MATH
    11.Tsochantaridis I, Joachims T, Hofmann T, Altun Y (2005) Large margin methods for structured and interdependent output variables. J Mach Learn Res 6:1453–1484MathSciNet MATH
    12.Hido S, Kashima H (2008) Roughly balanced bagging for imbalanced data. In: Proceedings of the SIAM international conference on data mining. SDM 2008, pp 143–152
    13.Wallace BC, Small K, Brodley CE, Trikalinos TA (2011) Class imbalance, redux. In: Proceedings of the 2011 IEEE 11th international conference on data mining. ICDM’11, pp 754–763
    14.Chen S, He H, Garcia EA (2010) Ramoboost: ranked minority oversampling in boosting. Trans Neural Netw 21(10):1624–1642CrossRef
    15.Masnadi-Shirazi H, Vasconcelos N (2010) Risk minimization, probability elicitation, and cost-sensitive SVMs. In: Proceedings of the 27th international conference on machine learning, pp 759–766
    16.Holte RC, Acker LE, Porter BW (1989) Concept learning and the problem of small disjuncts. In: Proceedings of the 11th international joint conference on artificial intelligence, pp 813–818
    17.Batista G, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor Newsl 6(1):20–29CrossRef
    18.Napierala K, Stefanowski J (2012) BRACID: a comprehensive approach to learning rules from imbalanced data. J Intell Inf Syst 39(2):335–373CrossRef
    19.Ando S (2012) Performance-optimizing classification of time-series based on nearest neighbor density approximation. In: 2012 IEEE 12th international conference on data mining workshops (ICDMW), pp 759–764
    20.Calders T, Jaroszewicz S (2007) Efficient AUC optimization for classification. In: Proceedings of the European conference on principles and practice of knowledge discovery in databases, pp 42–53
    21.Yue Y, Finley T, Radlinski F, Joachims T (2007) A support vector method for optimizing average precision. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. SIGIR ’07, pp 271–278
    22.Joachims T (2006) Training linear SVMs in linear time. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’06, pp 217–226
    23.Fukunaga K (1990) Introduction to statistical pattern recognition, computer science and scientific computing, 2nd edn. Elsevier science, Amsterdam
    24.Angiulli F, Basta S, Pizzuti C (2006) Distance-based detection and prediction of outliers. IEEE Trans Knowl Data Eng 18(2):145–160CrossRef
    25.Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern SMC–6(11):769–772MathSciNet
    26.Covões TF, Hruschka ER, Ghosh J (2013) A study of \(k\) -means-based algorithms for constrained clustering. Intell Data Anal 17(3):485–505
    27.Zeng H, Cheung Ym (2012) Semi-supervised maximum margin clustering with pairwise constraints. IEEE Trans Knowl Data Eng 24(5):926–939CrossRef
    28.Joachims T (2005) A support vector method for multivariate performance measures. In: Proceedings of the 22nd international conference on machine learning. ICML ’05, pp 377–384
    29.Pham DT, Chan AB (1998) Control chart pattern recognition using a new type of self-organizing neural network. Proc Inst Mech Eng Part I J Syst Control Eng 212(2):115–127CrossRef
    30.Saito N (1994) Local feature extraction and its applications using a library of bases. Ph.D. thesis, Yale University, New Haven, CT, USA
    31.Bache K, Lichman M (2013) UCI machine learning repository. http://​archive.​ics.​uci.​edu/​ml
    32.Kaluz̆a B, Mirchevska V, Dovgan E, Lus̆trek M, Gams M (2010) An agent-based approach to care in independent living. In: Ambient intelligence, vol 6439 of lecture notes in computer science. Springer, Berlin, pp 177–186
    33.Zhang H, Berg AC, Maire M, Malik J (2006) SVM-KNN: discriminative nearest neighbor classification for visual category recognition. In: Proceedings of the 2006 IEEE computer society conference on computer vision and pattern recognition, vol 2, pp 2126–2136
    34.Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Int Res 16(1):321–357MATH
    35.Chawla NV, Lazarevic A, Hall LO, Bowyer K (2003) Smoteboost: improving prediction of the minority class in boosting. Knowledge discovery in databases: PKDD 2003, lecture notes in computer science. Springer, Berlin, vol 2838, pp 107–119
    36.Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244MATH
    37.Verikas A, Gelzinis A, Bacauskiene M (2011) Mining data with random forests: a survey and results of new tests. Pattern Recogn 44(2):330–349CrossRef
    38.Chang C, Lin C (2001) LIBSVM: a library for support vector machines
    39.Flach PA, Hernández-Orallo J, Ramirez CF (2011) A coherent interpretation of AUC as a measure of aggregated classification performance. In: Proceedings of the 28th international conference on machine learning, ICML 2011, pp 657–664
    40.Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MathSciNet MATH
    41.Xi X, Keogh E, Shelton C, Wei L, Ratanamahatana CA (2006) Fast time series classification using numerosity reduction. In: ICML ’06: Proceedings of the 23rd international conference on machine learning, pp 1033–1040
  • 作者单位:Shin Ando (1)

    1. School of Management, Tokyo University of Science, Tokyo, Japan
  • 刊物类别:Computer Science
  • 刊物主题:Information Systems and Communication Service
    Business Information Systems
  • 出版者:Springer London
  • ISSN:0219-3116
文摘
Class imbalance is a significant issue in practical classification problems. Important countermeasures, such as re-sampling, instance-weighting, and cost-sensitive learning have been developed, but there are limitations as well as advantages to respective approaches. The synthetic re-sampling methods have wide applicability, but require a vector representation to generate additional instances. The instance-based methods can be applied to distance space data, but are not tractable with regard to a global objective. The cost-sensitive learning can minimize the expected cost given the costs of error, but generally does not extend to nonlinear measures, such as F-measure and area under the curve. In order to address the above shortcomings, this paper proposes a nearest neighbor classification model which employs a class-wise weighting scheme to counteract the class imbalance and a convex optimization technique to learn its weight parameters. As a result, the proposed model maintains the simple instance-based rule for prediction, yet retains a mathematical support for learning to maximize a nonlinear performance measure over the training set. An empirical study is conducted to evaluate the performance of the proposed algorithm on the imbalanced distance space data and make comparison with existing methods. Keywords Class imbalance Weighted nearest neighbor classifier Structural classifier

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700