摘要
不平衡数据分类经常面临样本严重不平衡、少数类样本分类精度低的问题,随着数据规模增大,分类效率也成为了瓶颈问题.针对以上问题,本文结合spark高效的数据处理能力,提出了一种Spark环境下基于综合权重的不平衡数据集成分类方法.该方法首先依照多数类样本中每类样本的权重以及少数类样本量获得的综合权重进行采样,并与少数类样本组成平衡规模的训练数据集;其次,采用基于相关性的特征选择方法选择最优的特征子集,并对随机森林算法进行改进优化以及利用其获得子分类器.最后在Spark环境下,以UCI数据集进行实验验证.实验结果表明本文方法不仅提高了整体分类精度,而且提升了分类效率.
Imbalanced data classification often faces the problem of severe sample imbalance and lowaccuracy of minority sample classification,and with the increase of data size,classification efficiency has also become a bottleneck problem. In viewof the above problems,combined with the efficient data processing ability of Spark,this paper proposes an integrated classification method of imbalanced data based on comprehensive weight in Spark environment. Firstly,the method samples by comprehensive weight which obtained by in accordance with weight of each class of samples in majority class samples and samples of minority class amount from the original sample. and form a balanced scale of training data set with samples of minority class; Secondly,we select the optimal feature subset based on the correlation based feature selection method to improve and optimize the random forest algorithm,and use it to get the sub classifiers; Finally,in the Spark environment,using UCI data set experimental verification. The experimental results showthat the proposed method not only improves the accuracy of the overall classification,but also improves the classification efficiency.
引文
[1]Kim H,Howland P,Park H. Dimension reduction in text classifi-cation w ith support vector machine[J]. Journal of M achine Learn-ing Research,2005,6(1):37-53.
[2]Zhou Quan,Guo Mao-zu,Liu Yang,et al. A classification methodfor class-imbalanced data and its application on bioinformatics[J].Journal of Computer Research and Development,2010,47(8):1407-1414.
[3]Liu Shao-yu,Zhou Jie,Li Bi-cheng,et al. Entity relation extractionmethod based on multi-SVM-KNN classifier[J]. Journal of DataAcquistion and Processing,2015,30(1):202-210.
[4]Tao Xing-ming,Hao Si-yuan,Zhang Dong-xue,et al. Overviewof classification algorithms for unbalanced data[J]. Journal ofChongqing University of Posts and Telecommunications(NaturalScience Edition),2013,25(1):102-110.
[5]Jiang Sheng-yi,Miao Bang,Yu Wen. Under sampling method basedon onepass clustering for imbalanced data distribution[J]. Journalof Chinese Computer Systems,2012,33(2):232-236.
[6]Hu Xiao-sheng,Wen Ju-ping,Zhong Yong. Imbalanced data en-semble classification using dynamic balance sampling[J]. CAAITransactions on Intelligent Systems,2016,11(2):257-263.
[7]Cao Peng,Li Wei,Zhao Da-zhe. Multiclass imbalanced data clas-sification based on decision criteria optimization[J]. Journal ofChinese Computer Systems,2014,35(5):961-966.
[8]Li Ke-wen,Yang Lei,Liu Wen-ying,et al. Classification method ofimbalanced data based on RSBoost[J]. Computer Science,2015,42(9):249-252.
[9]Li Xiong-fei,Li Jun,Dong Yuan-fang,et al. A new learning algo-rithm for imbalanced data-PCBoost[J]. Chinese Journal of Com-puters,2012,35(2):2202-2209.
[10]Andrzejak A,Langner F,Zabala S. Interpretable models from dis-tributed data via merging of decision trees[C]. Computational In-telligence and Data M ining,IEEE,2013:1-9.
[11]Ray P K,Mohanty S R,Kishor N,et al. Optimal feature and de-cision tree-based classification of pow er quality disturbances in dis-tributed generation systems[J]. IEEE Transactions on SustainableEnergy,2013,5(1):200-208.
[12]Río S D,López V,Benítez J M,et al. On the use of MapReducefor imbalanced big data using random forest[J]. Information Sci-ences An International Journal,2014,285(C):112-137.
[13]Wang Z,Xin J,Tian S,et al. Distributed weighted extreme learn-ing machine for big imbalanced data learning[M]. Proceedings ofELM-2015 Volume 1. Springer International Publishing,2016.
[14] Wang Wen,Zhao Kan-kan,Li Cui-ping,et a1. Feature extensionand category research for short text based on spark platform[J].Journal of Frontiers of Computer Science and Technology,2017,11(5):732-741.
[15]Chen J,Li K,Tang Z,et al. A parallel random forest algorithmfor big data in a spark cloud computing environment[J]. IEEETransactions on Parallel&Distributed Systems,2017,28(4):919-933.
[16] Breiman L. Random forests[J]. Machine Learning,2001,45(1):5-32.
[17] Xie Juan-ying,Xie Wei-xin. Several feature selection algorithmsbased on the discernibility of a feature subset and support vectormachines[J]. Chinese Journal of Computers,2014,37(8):1704-1718.
[18]Qin Jing,Qian Xue-zhong,Wang Wei-tao,et al. A algorithm forunbalanced big data using paralleled random forest[J]. M icroelec-tronics&Computer,2017,34(4):22-27
[2]邹权,郭茂祖,刘扬,等.类别不平衡的分类方法及在生物信息学中的应用[J].计算机研究与发展,2010,47(8):1407-1414.
[4]陶新民,郝思媛,张冬雪,等.不均衡数据分类算法的综述[J].重庆邮电大学学报:自然科学版,2013,25(1):102-110.
[5]蒋盛益,苗邦,余雯.基于一趟聚类的不平衡数据抽样算法[J].小型微型计算机系统,2012,33(2):232-236.
[6]胡小生,温菊屏,钟勇.动态平衡采样的不平衡数据集成分类方法[J].智能系统学报,2016,11(2):257-263.
[7]曹鹏,栗伟,赵大哲.基于决策准则优化的不均衡数据分类[J].小型微型计算机系统,2014,35(5):961-966.
[8]李克文,杨磊,刘文英,等.基于RSBoost算法的不平衡数据分类方法[J].计算机科学,2015,42(9):249-252.
[9]李雄飞,李军,董元方,等.一种新的不平衡数据学习算法PCBoost[J].计算机学报,2012,35(2):2202-2209.
[14]王雯,赵衎衎,李翠平,等. Spark平台下的短文本特征扩展与分类研究[J].计算机科学与探索,2017,11(5):732-741.
[17]谢娟英,谢维信.基于特征子集区分度与支持向量机的特征选择算法[J].计算机学报,2014,37(8):1704-1718.
[18]秦静,钱雪忠,王卫涛,等.一种处理不平衡大数据的并行随机森林算法[J].微电子学与计算机,2017,34(4):22-27.
1http://spark. apache. org