用户名: 密码: 验证码:
基于卡方方法及对称不确定性的网络流量特征选择方法
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Network Traffic Feature Selection Method Based on Chi-square Method and Symmetric Uncertainty
  • 作者:刘雪亚 ; 姜志侠 ; 徐轩 ; 杨子帅 ; 李林
  • 英文作者:LIU Xueya;JIANG Zhixia;XU Xuan;YANG Zishuai;LI Lin;School of Science,Changchun University of Science and Technology;
  • 关键词:数据不均衡 ; 网络流量 ; 相对不确定性 ; 召回率
  • 英文关键词:data imbalance;;network traffic;;relative uncertainty;;recall rate
  • 中文刊名:CGJM
  • 英文刊名:Journal of Changchun University of Science and Technology(Natural Science Edition)
  • 机构:长春理工大学理学院;
  • 出版日期:2019-04-15
  • 出版单位:长春理工大学学报(自然科学版)
  • 年:2019
  • 期:v.42
  • 基金:国家自然科学基金资助项目(51378076)
  • 语种:中文;
  • 页:CGJM201902017
  • 页数:5
  • CN:02
  • ISSN:22-1364/TH
  • 分类号:78-82
摘要
对网络流量数据进行分类时,由于网络流量具有多个类别,并且各类样本数量不均衡,故在利用机器学习进行分类时,会导致分类的模型的性能降低,致使样本被误分为样本数量多的类别,进而致使样本数量较少的类别(小类别)的召回率过低。针对该问题,提出一种基于卡方方法及对称不确定性网络流量特征选择方法。该方法首先计算特征与类之间的加权卡方值,选择卡方值较大的特征组成候选特征子集,然后根据特征与所有类之间的对称不确定性进一步筛选特征集。在Moore网络流量数据集上进行实验,得到的实验结果证明,通过该方法选择的特征对网络流量数据进行分类,在保证准确率高的前提下也得到了较高的小类召回率,减轻了数据不均衡问题带来的不良影响。
        When classifying network traffic data,because network traffic has many categories and the number of samples is not balanced,the performance of classification model will be reduced while machine learning is used to classify network traffic data. As a result,samples are mistakenly classified into categories with a large number of samples,and the recall rate of smaller categories(small categories) is too low. In order to solve this problem,a chi-square method and a symmetric uncertain network traffic feature selection method are proposed. Firstly,the weighted chi-square values between the features and the classes are calculated in the method;and the features with larger chi-square values are selected to form a candidate feature subset. Then the feature sets are selected according to the symmetry uncertainty between the features and all classes. The experimental results on the Moore network traffic data set show that the classification of the network traffic data by the selected features of the method can also obtain a higher recall rate of small classes on the premise of high accuracy. The negative impact of the data imbalance is mitigated.
引文
[1]王立东,钱丽萍,王大伟,等.网络流量分类方法与实践[M].北京:人民邮电出版社,2013.
    [2]Moore A W,Papagiannaki K.Toward the Accurate Identification of Network Applications[C].International Conference on Passive and Active Network Measurement.Springer-Verlag,2005:41-54.
    [3]Lei D,Xiaochun Y,Jun X.Optimizing traffic classification using hybrid feature selection[C].The Ninth International Conference on Web-Age Information Management.IEEE,2008:520-525.
    [4]储慧琳,张兴明.一种组合式特征选择算法及其在网络流量识别中的应用[J].小型微型计算机系统,2012,33(2):325-329.
    [5]孙兴斌,芮赟.一种基于统计频率的网络流量特征选择方法[J].小型微型计算机系统,2016,37(11):2483-2487.
    [6]孙兴斌,孙彦赞,郑小盈,等.面向多类不均衡网络流量的特征选择方法[J].计算机应用研究,2017,34(2):568-571.
    [7]刘纪伟,赵月显,赵杨.一种基于统计排序的网络流量特征选择方法[J].电子技术应用,2018(1):84-87.
    [8]Qiu Y F,Wang W,Liu D Y.Research on an improved CHI feature selection method[C].Applied Mechanics and Materials.Trans Tech Publications,2013,241:2841-2844.
    [9]Dash M,Liu H.Feature selection for classification[J].Intelligent data analysis,1997,1(3):131-156.
    [10]Xu K,Zhang Z L,Bhattacharyya S.Internet traffic behavior profiling for network security monitoring[J].IEEE/ACM Transactions on Networking(TON),2008,16(6):1241-1252.
    [11]Moore A W,Zuev D.Internet traffic classification using bayesian analysis techniques[C].ACM SIG-METRICS Performance Evaluation Review.ACM,2005,33(1):50-60.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700