The study of under- and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis
详细信息    查看全文
文摘
Osteoporosis is a frequent bone disease without typical early symptoms but with serious complications e.g. low-energy bone fractures. Patients with risk factors should be screened for proper diagnosis as early as possible. Unfortunately, the registered medical data are often highly imbalanced. That is why the machine-based data processing is difficult or even impossible. Considering this, our goal was to search for the best method of coping with the problem of imbalancing in relation to the analysed data regarding the osteoporotic patients. Therefore, we checked several paradigms of classifiers in synergy with preprocessing techniques to address the inner skewed class distribution of the data.In the source dataset 92.6% of instances related to patients without any fractures (negative cases) and only 7.41% to patients (positive cases) who reported at least one fracture. To alleviate class imbalance there were examined not only data-level methods which in fact modify the input dataset, but also ensemble ones that strengthen the results of the base algorithms. In the first group the under- and over-sampling methods were used, such as random undersampling, edited nearest neighbours and synthetic minority over-sampling techniques, while in the second one – a range of methods based on various subsets of training data were analysed. Also various combinations of the above mentioned were investigated. Additionally, we propose the way how to find the balancing level which, without excessive distortion of the input, raw data, will give the appropriate classification efficiency.The aim of our experiment was to identify which of an undersampling or an oversampling approach with reference to the simple and the ensemble-based classifiers allows to achieve the best results.The outcomes of the comparative studies concerning imbalancing problem with regard to our dataset showed that the highest efficiency was achieved while using the synthetic minority over-sampling technique and RandomForest classifier. As far as the optimal balancing level is concerned, we empirically determined that 300% oversampling with the synthetic minority over-sampling method combined with edited nearest neighbours undersampling allowed to gain the required precision of classification.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700