因特网流量类不平衡特性与分类方法的研究

英文题名：Studying Class Imbalance Characteristics and Classification Methods on Internet Traffic Flows
作者：刘珍
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：因特网流量分类 ; 类不平衡特性 ; 数据重采样 ; 特征选择 ; 代价敏感学习 ; 集成学习
英文关键词：Internet traffic classification ; class imbalance characteristics ; data resampling ; feature selection ; cost-sensitive learning ; ensemble learning
学位年度：2013
导师：刘琼
学科代码：081203
学位授予单位：华南理工大学
论文提交日期：2013-10-09

摘要

因特网（Internet）流量分类是实施网络管理、服务质量保障、网络计费以及网络安全等的重要基础。传统的流量分类方法难以适应因特网应用的快速发展，基于机器学习的流量分类方法具有良好的应用前景。但是，这类方法通常以获得高总体分类准确率为优化目标，尚未顾及因特网流量数据所具有的多类不平衡特性，致使分类性能往往偏向大类，而忽略小类。在因特网流量中，某些小类应用多涉及命令流、实时通信流等，其分类性能关乎通信的可靠性或用户体验，有的小类属于重量级应用，其分类性能关乎网络规划或带宽资源分配等。
     目前，因特网流量的类不平衡特性及分类方法缺乏系统研究。论文针对因特网流量数据集，就选定的特征空间，观察分析网络流样本的类分布特性，分析其特点，从数据重采样、特征选择和分类算法三个方面展开因特网流量分类方法的研究。论文的主要贡献如下：
     （1）因特网流量数据的类不平衡特性。论文从表象和内在两个方面剖析流量数据存在的类不平衡特性。比较各类别的网络流数目和字节数目，发现流量数据往往包括多个大类和多个小类，大类与小类之间的流数目差距显著，小类可能拥有较大比例的字节数，类内还可能存在大流与小流之间的显著不平衡。观察分析网络流样本在选定特征空间的分布特性，认识到同类流样本往往分布于多个子概念区域，某些子概念仅包含少量的流样本，类间流样本多存在重叠现象。研究类不平衡特性对流量分类性能的影响，发现多子概念特性对流量分类性能的影响比类间流数目不平衡或类间重叠更显著。
     （2）适合因特网流量多小类特性的代价敏感学习算法。当采用代价敏感学习算法处理流量数据的类不平衡问题，基于流比率的错分代价矩阵不适合因特网流量数据的困难小类（训练的流样本不致最少，但流量难以被正确分类的小类）。论文利用加权方式控制错分代价矩阵，即分析错分代价增长空间与类不平衡程度之间的关系，提出类不平衡程度评估指标和权重计算方法，以适度增加困难小类的错分代价而基本不损失大类的分类性能。
     （3）因特网流量数据的重采样方法。针对因特网流量数据可能存在的类间流数目不平衡、类间重叠、多子概念和小析取项等问题，提出分层式数据重采样方法PSC(partition, sampling and combining)，首先将原始流量数据集划分为多个不相交且密集的子集，以减少类内子概念数；针对每个子集中的小类流样本特征值，以随机插值法扩充小类流样本，进而处理小析取项；并在每个子集上，移除大类与小类重叠区域的大类流样本，进而缓解类间重叠。PSC方法为子分类器训练建立类内散度、类间重叠程度和类不平衡程度均较低的训练子集。
     （4）因特网流量统计特征的选择算法。针对因特网流量数据可能存在类内多子概念、类间重叠和多小类，提出平衡式特征选择算法BFS (balanced feature selection)。为选择出使得单类流样本具有较低离散度的特征，提出局部相关性指标，用于评估单特征在单类流样本上的确定性程度。为选择出使得类间流样本具有较低重叠程度的特征，采用全局相关性指标评估特征对类别变量的确定性程度。基于每个特征的局部与全局相关性，为每个类别选择局部相关且全局区分性较强的特征，以保证选出的特征子集有利于区分多个小类。
     （5）因特网重型流分类方法。在因特网流量中，类内的大流与小流不平衡可能使分类器忽略大流的学习；类间流数目不平衡可能使分类器忽略拥有高字节数的小类的分类性能。两种情况均可能导致重型流分类困难，得到低字节分类性能。针对大流与小流不平衡，提出基于信息增益率的流尺度模块化方法(flow size modularization based oninformation gain ratio，FSMGR)。FSMGR以最小化大流集合的数据复杂度为目标搜索大流与小流的划分阈值，将原始流量数据集划分为大流和小流子集，并分别用于分类器训练，从而强化了大流的学习。针对类间流数目不平衡，改进（3）中提出的PSC重采样方法，在保留重型流的情况下缓解小类与大类之间的不平衡，并结合Boosting集成学习算法提高分类器的稳定性。
Internet traffic classification is an important foundation for performing networkmanagement, quality of service guarantee, network accounting and network security etc.Traditional traffic classification methods difficultly accommodate the rapid developing ofnetwork applications. Internet traffic classification using machine learning (ML) is apromising alternative. However, the traffic classifier is always optimized to obtain highoverall classification accuracy, which does not take into account the class imbalance propertyof Internet traffic datasets. The traffic classification performance always biases towards themajority class and ignore the minority class. On Internet traffic, some minority classes containsignaling flows or real-time communication flows, and their classification performanceinfluences communication quality and user experience etc. Some minority classes own a lot ofbytes, and their classification performance affects network planning or bandwidth resourcesallocation etc.
     At present, there is lacking of systematic research on the class imbalance characteristicsand classification methods in Internet traffic classification. This paper observes the classdistribution of Internet traffic datasets on selected feature space and analyzes the imbalancecharacteristics, and then carries out researches on Internet traffic classification methods fromdata resampling, feature selection and classification algorithm. The main contributions are asfollows.
     (1) Class imbalance characteristics of Internet traffic datasets. This paper studies theclass imbalance characteristics of Internet traffic datasets from external and internal aspects.By comparing the flow number and byte number of each traffic class, this paper found thattraffic datasets usually contain multiple majority classes and multiple minority classes, thereis a big distance between the flow number of the majority class and that of the minority class,the minority class may own a lot of bytes and there is obvious imbalance between large flowsand small flows in some classes. The distribution of flow samples in the feature space showsthat the flow samples from the same class usually have several sub concepts and some subconcepts only have a small number of flow samples, and the flow samples of a class overlapthose of other classes. The research of the influence of class imbalance characteristics onInternet traffic classification performance shows that multiple sub concepts is more closelycorrelated to the classification performance when compared to flow number imbalance andclass overlapping.
     (2) Cost-sensitive learning for the traffic datasets with multiple minority classes. When cost-sensitive learning algorithm is applied to classify traffic flows, the flow rate based costmatrix does not fit the difficulty classes with more flows but difficultly identified. This paperutilizes weights to improve the cost matrix. Through analyzing the relationship between theclass imbalance degree and the room of increasing misclassification cost, an evaluation metricfor class imbalance degree and the calculation method for weight are proposed. The methodaims to properly increase the weights of difficulty clases without decreasing the classificationperformance of the majority class significantly.
     (3) Data resampling method for Internet traffic datasets. A traffic dataset may existseveral imbalance related factors i.e. flow number imbalance, class overlapping, multiple subconcepts and small disjuncts. To handle these problems simultaneously, a hierarchical dataresampling method named PSC (partition, sampling and combining) is proposed. Firstly, anorigin traffic dataset is partitioned into multiple disjoint and dense subsets to reduce subconcepts. And over sampling is performed on each cluster, which handles small disjuncts inthe way of enhancing flow samples for minority classes. Then, a heuristic under samplingmethod is performed on each class, in which rules for removing majority class flow samplesare devised, so as to alleviate class overlapping. PSC can build sub training set with lowerwithin-class dispersion, class overlapping and class imbalance.
     (4) Selection algorithm for Internet traffic flow features. Considering the multiple subconcepts, class overlapping and multiple minority classes, a balanced feature selection (BFS)algrithm is proposed. In order to select the features that make flow samples with lowerdispersion, a local correlation metric is proposed to evaluate the certainty of a feature on theflow samples of a class. In order to select the features that make flow samples of differentclasses with lower overlapping, a global correlation metric is applied to evaluate the certaintyof class variable when a feature is given. Based on the evaluation results of local and globalcorrelation of each feature, a search algorithm is proposed, which selects a local correlationfeature for each class and the feature also has high global discrimination power. So that, theselected feature subset includes the features that are advantageous to discriminate minorityclasses.
     (5) Classification methods for large flows. The imbalance between large flows and smallflows exists in some classes, which may result into that the classifier ignores the learning oflarge flows. The flow number imbalance between the minority class and the majority classmay result into that the classifier ignores the classification performance of the minority classwith a lot of bytes. Both of the two cases may lead to difficultly classifying large flows andobtaining low byte accuracy. For handling the imbalance between small flows and large flows, a flow size modularization method based on information gain ratio (FSMGR) is proposed.Taking the object of minimizing the data complexity of large flows, it searches a partitionthreshold (correlated to bytes). The origin traffic training set is partitioned into large and smallflow sub sets according the partition threshold, each of which is individually used to train aspecific classifier. So that the large flows are emphasized and the classification problembecomes easier. For handling the imbalance between the minority class and the majority class,the PSC in (3) is improved (named BPSC) to alleviate the flow number imbalance whileretaining all large flows and the boosting ensemble learning algorithm is used to improve thestability of the classifier.

引文

[1]杨家海,吴建平,安常青.互联网络测量理论与应用[M].人民邮电出版社,2009.
    [2]何海涛.因特网行为特征与流量分类研究[D].中山大学,博士学位,2008.
    [3]李君.互联网流量分类与识别方法研究[D].南京邮电大学,博士学位论文,2009.
    [4] T. T. T. Nguyen. A novel approach for practical, real-time, machine learning based IPtraffic classification [D]. Swinburne University of Technology, PhD Thesis,2009.
    [5] G. Dán, T. Hossfeld, S. Oechsner, et al. Interaction patterns between P2P contentdistribution systems and ISPs [J]. IEEE Communications Magazine,2011,49(5):222-230.
    [6]田旭.互联网流量识别技术研究[D].北京邮电大学,博士学位,2012.
    [7] A. W. Moore, D. Zuev. Internet traffic classification using bayesian analysis techniques[A]. Proceedings of the ACM SIGMETRICS[C].2005:50-60.
    [8] W. Li, M. Canini, A. W. Moore, et al. Efficient application identification and thetemporal and spatial stability of classification schema [J]. Computer Networks,2009,53(6):790-809.
    [9] A. Dainotti, A. Pescapé, C. Sansone. Early classification of network traffic throughmulti-classification [A]. Proceedings of the Traffic Monitoring and Analysis:3rdInternational Workshop [C].2011:122-135.
    [10]刘琼,刘珍,黄敏.基于机器学习的IP流量分类研究[J].计算机科学,2010,37(12):35-40.
    [11] J. Erman, A. Mahanti, M. Arlitt. Byte me: A case for byte accuracy in trafficclassification [A]. Proceedings of the3rd annual ACM workshop on Mining networkdata [C].2007:35-38.
    [12] N. Wang, K. H. Ho, G. Pavlou, et al. An overview of routing optimization for internettraffic engineering [J]. IEEE Communications Surveys and Tutorials,2008,10(1):36-56.
    [13]王攀. IP网络业务识别关键技术研究[D].南京邮电大学,博士学位,2013.
    [14] J. H. Wang, C. Q. An, J. H. Yang. A study of traffic, user behavior and pricing policiesin a large campus network [J]. Computer Communications,2011,34(16):1922-1931.
    [15] J. L. García-Dorado, A. Finamore, M. Mellia, et al. Characterization of ISP traffic:Trends, user habits, and access technology impact [J]. IEEE Transactions on Networkand Service Management,2012,9(2):142-155.
    [16]林平.网络流量的离线分析[D].北京邮电大学,博士学位,2010.
    [17] A. Callado, C. Kamienski, G. Szabo, et al. A survey on Internet traffic identification [J].IEEE Communications Surveys&Tutorials,2009,11(3):37-52.
    [18] T. T. T. Nguyen, G. Armitage. A survey of techniques for internet traffic classificationusing machine learning [J]. IEEE Communications Surveys&Tutorials,2008,10(4):56-76.
    [19] T. Karagiannis, A. Broido, N. Brownlee, et al. Is P2P dying or just hiding?[A].Proceedings of the47th annual IEEE Global Telecommunications [C].2004:1532-1538.
    [20]刘琼,徐鹏,杨海涛,等. Peer-to-Peer文件共享系统的测量研究[J].软件学报,2006,17(10):2131-2140.
    [21] A. M. Moore, D. Panpagiannaki. Toward the accurate identification of networkapplications [A]. Proceedings of the the Sixth Passive and Active MeasurementWorkshop [C].2005:41-54.
    [22] S. Sen, O. Spatscheck, D. Wang. Accurate, scalable in network identification of P2Ptraffic using application signatures [A]. Proceedings of the WWW [C].2004:512-521.
    [23] A. Tongaonkar, R. Keralapura, A. Nucci. Challenges in network applicationidentification [A]. Proceedings of the5th USENIX Workshop on Large-Scale Exploitsand Emergent Threats [C].2012:1-3.
    [24] Y. Wang, Y. Xiang, W. L. Zhou, et al. Generating regular expression signatures fornetwork traffic classification in trusted network management [J]. Journal of Networkand Computer Applications,2012,35(3):992-1000.
    [25] B. Park, J. W.-K. Hong, Y. J. Won. Toward fine-grained traffic classification [J]. IEEECommunications Magazine,2011,49(7):104-111.
    [26] L. Hamers, Y. Hemeryck, G. Herweyers, et al. Similarity measures in scientometricresearch: The jaccard index versus salton's cosine formula [J]. Information Processing&Management,1989,25(3):315-318.
    [27] T. Karagiannis, K. Papagiannaki, M. Faloutsos. Blinc: Multilevel traffic classification inthe dark [J]. Computer Communication Review,2005,35(4):229-240.
    [28] K. Xu, Z. L. Zhang, S. Bhattacharyya. Internet traffic behavior profiling for networksecurity monitoring [J]. IEEE/ACM Transaction on Networking,2008,16(6):1241-1252.
    [29] H. Asai, K. Fukuda, H. Esaki. Traffic causality graphs: Profiling network applicationsthrough temporal and spatial causality of flows [A]. Proceedings of the23rdInternational Teletraffic Congress [C].2011:95-102.
    [30] M. Roughan, S. Sen, O. Spatscheck, et al. Class-of-service mapping for QoS: Astatistical signature-based approach to IP traffic classification [A]. Proceedings of the4th ACM SIGCOMM conference on Internet measurement [C].2004:135-148.
    [31] A. W. Moore, D. Zuev, M. Crogan, Discriminators for use in flowbased classification[R],2005:1–16.
    [32] A. Dainotti, A. Pescape, H. C. Kim, et al. Traffic classification through jointdistributions of packet-level statistics [A]. Proceedings of the54th Annual IEEE GlobalTelecommunications Conference [C].2011:1-6.
    [33] T. T. T. Nguyen, G. Armitage, P. Branch, et al. Timely and continuousmachine-learning-based classification for interactive IP traffic [J]. IEEE-ACMTransactions on Networking,2012,20(6):1880-1894.
    [34] E. Hjelmvik, W. John. Breaking and improving protocol obfuscation [R].2010:1-30.
    [35] T. Zink, M. Waldvogel. Bittorrent traffic obfuscation: A chase towards semantic trafficidentification [A]. Proceedings of the12th International Conference on Peer-to-PeerComputing [C].2012:126-137.
    [36]林平,余循宜,刘芳,等.基于流统计特性的网络流量分类算法[J].北京邮电大学学报,2008,31(2):15-19.
    [37] H. B. Jiang, Z. H. Ge, S. D. Jin, et al. Network prefix-level traffic profiling:Characterizing, modeling, and evaluation [J]. Computer Networks,2010,54(18):3327-3340.
    [38] S. Lee, J. Song, S. Ahn, et al. Session-based classification of internet applications in3gwireless networks [J]. Computer Networks,2011,55(17):3915-3931.
    [39] S. Valenti, D. Rossi. Identifying key features for P2P traffic classification [A].Proceedings of the IEEE International Conference on Communications [C].2011:3327-3340.
    [40] P. Bermolen, M. Mellia, M. Meo, et al. Abacus: Accurate behavioral classification ofP2P-tv traffic [J]. Computer Networks,2011,55(6):1394-1411.
    [41] N. Williams, S. Zander, G. Armitage. A preliminary performance comparison of fivemachine learning algorithms for practical IP traffic flow classification [J]. ComputerCommunication Review,2006,36(5):7-15.
    [42] Y. Wang, S. Z. Yu. Supervised learning real-time traffic classifiers [J]. Journal ofNetworks,2009,4(5):622-628.
    [43] J. Yang, J. Ma, G. Cheng, et al. An empirical investigation of filter attribute selectiontechniques for high-speed network traffic flow classification [J]. Wireless PersonalCommunication,2012,66(3):541–558.
    [44] A. Fahad, Z. Tari, I. Khalil, et al. Toward an efficient and scalable feature selectionapproach for internet traffic classification [J]. Computer Networks,2013,57(9):2040-2057.
    [45] Z. Li, R. X. Yuan, X. H. Guan. Accurate classification of the Internet traffic based onthe SVM method [A]. Proceedings of the IEEE International Conference onCommunications [C].2007:1373-1378.
    [46] L. Dai, X. C. Yun, J. Xiao. Optimizing traffic classification using hybrid featureselection [A]. Proceedings of the9th International Conference on Web-Age InformationManagement [C].2008:520-525.
    [47] C. G. Yin, S. Q. Li, Q. Li. Network traffic classification via hmm under the guidance ofsyntactic structure [J]. Computer Networks,2012,56(6):1814-1825.
    [48] J. Zhang, C. Chen, Y. Xiang, et al. Internet traffic classification by aggregatingcorrelated naive bayes predictions [J]. IEEE Transactions on Information Forensics andSecurity,2013,8(1):5-15.
    [49] S. Dong, D. D. Zhou, W. G. Zhou, et al. Research on network traffic identificationbased on improved bp neural network [J]. Applied Mathematics&InformationSciences,2013,7(1):389-398.
    [50] J. Zhang, Y. Xiang, W. L. Zhou, et al. Unsupervised traffic classification using flowstatistical properties and IP packet payload [J]. Journal of Computer and SystemSciences,2013,79(5):573-585.
    [51] G. Dewaele, Y. Himura, P. Borgnat, et al. Unsupervised host behavior classificationfrom connection patterns [J]. International Journal of Network Management,2010,20(5):317-337.
    [52] J. Erman, M. Arlitt, A. Mahanti. Traffic classification using clustering algorithms [A].Proceedings of the SIGCOMM workshop on Mining network data [C].2006:281-286.
    [53] J. Erman, A. Mahanti, M. Arlitt. Internet traffic identification using machine learning[A]. Proceedings of the IEEE Globecom [C].2006:1-6.
    [54] J. Erman, A. Mahanti, M. Arlitt, et al. Semi-supervised network traffic classification [J].SIGMETRICS Performance Evaluation Review,2007,35(1):369-370.
    [55] H. T. He, X. N. Luo, F. T. Ma, et al. Network traffic classification based on ensemblelearning and co-training [J]. Science in China Series F (Information Science),2009,52(2):338-346.
    [56] P. H. Li, Y. Wang, X. L. Tao. A semi-supervised network traffic classification methodbased on incremental learning [A]. Proceedings of the International Conference onInformation Technology and Software Engineering [C].2013:955-964.
    [57] R. X. Yuan, Z. Li, X. H. Guan, et al. An SVM-based machine learning method foraccurate internet traffic classification [J]. Information Systems Frontiers,2010,12(2):149-156.
    [58] G. W. Xie, M. Iliofotou, R. Keralapura, et al. Subflow: Towards practical flow-leveltraffic classification [A]. Proceedings of the IEEE INFOCOM [C].2012:2541-2545.
    [59] Z. Jun, X. Yang, W. Yu, et al. Network traffic classification using correlationinformation [J]. IEEE Transactions on Parallel and Distributed Systems,2013,24(1):104-117.
    [60] M. Soysal, E. G. Schmidt. Machine learning algorithms for accurate flow-basednetwork traffic classification: Evaluation and comparison [J]. Performance Evaluation,2010,67(6):451-467.
    [61]徐鹏,林森.基于C4.5决策树的流量分类方法[J].软件学报,2009,20(10):2692-2704.
    [62]徐鹏,刘琼,林森.基于支持向量机的Internet流量分类研究[J].计算机研究与发展,2009,46(3):407-414.
    [63] R. Alshammari, A. N. Zincir-Heywood. Can encrypted traffic be identified without portnumbers, IP addresses and payload inspection?[J]. Computer Networks,2011,55(2011):1326–1350.
    [64]张宏莉,鲁刚.分类不平衡协议流的机器学习算法评估与比较[J].软件学报,2012,23(6):1500-1516.
    [65] A. Callado, J. Kelner, D. Sadok, et al. Better network traffic identification through theindependent combination of techniques [J]. Journal of Network and ComputerApplications,2010,33(4):433-446.
    [66] Y. Jin, N. Duffield, J. Erman, et al. A modular machine learning system for flow-leveltraffic classification in large networks [J]. ACM Transactions on Knowledge Discoveryfrom Data,2012,6(1):1-34.
    [67] Y. Jin, N. Duffield, P. Haffner, et al. Inferring applications at the network layer usingcollective traffic statistics [A]. Proceedings of the22nd International TeletrafficCongress [C].2010:1-8.
    [68] A. Dainotti, A. Pescape, K. C. Claffy. Issues and future directions in trafficclassification [J]. IEEE Network,2012,26(1):35-40.
    [69] N. Japkowicz, S. Stephen. The class imbalance problem: A systematic study [J].Intelligent Data Analysis,2002,6(5):429-449.
    [70] R. C. Prati, G. E. a. P. A. Batista. Class imbalances versus class overlapping: Ananalysis of a learning system behavior [A]. Proceedings of the Mexican InternationalConference on Artificial Intelligence [C].2004:312-321.
    [71] T. Jo, N. Japkowicz. Class imbalances versus small disjuncts [J]. SIGKDD ExplorationsNewsletter,2004,6(1):40-49.
    [72] V. Garcia, J. Sanchez, R. Mollineda. An empirical study of the behavior of classifierson imbalanced and overlapped data sets [A]. Proceedings of the Pattern Recognition,Image Analysis and Applications [C].2007:397-406.
    [73] V. García, R. A. Mollineda, J. S. Sánchez. On the k-nn performance in a challengingscenario of imbalance and overlapping [J]. Pattern Analysis&Appllications,2008,11(3-4):269-280.
    [74] M. Denil, T. Trappenberg. Overlap versus imbalance [A]. Proceedings of the23rdCanadian conference on Advances in Artificial Intelligence [C].2010:220-231.
    [75] J. Stefanowski. Overlapping, rare examples and class decomposition in learningclassifiers from imbalanced data [M]. Emerging paradigms in machine learning.Springer.2013:277-306.
    [76] N. V. Chawla, K. W. Bowyer, L. O. Hall, et al. SMOTE: Synthetic minorityover-sampling technique [J]. Journal of Artificial Intelligence Research,2002,16(1):321-357.
    [77] N. V. Chawla, A. Lazarevic, L. O. Hall, et al. SMOTEBoost: Improving prediction ofthe minority class in boosting: Knowledge discovery in databases [A]. Proceedings ofthe7th European Conference on Principles and Practice of Knowledge Discovery inDatabases [C].2003:107-119.
    [78] F. Fernandez-Navarro, C. Hervas-Martinez, P. Antonio Gutierrez. A dynamicover-sampling procedure based on sensitivity for multi-class problems [J]. PatternRecognition,2011,44(8):1821-1833.
    [79] G. E. a. P. A. Batista, R. C. Prati, M. C. Monard. A study of the behavior of severalmethods for balancing machine learning training data [J]. ACM SIGKDD ExplorationsNewsletter,2004,6(1):20-29.
    [80] S. Barua, M. M. Islam, X. Yao, et al. MWMOTE-majority weighted minorityoversampling technique for imbalanced data set learning [J]. IEEE Transaction onKnowledge and Data Engineering,2013,
    [81] M. A. Tahir, J. Kittler, F. Yan. Inverse random under sampling for class imbalanceproblem and its application to multi-label classification [J]. Pattern Recognition,2012,45(10):3738-3750.
    [82] N. S. W. Prachuabsupakij. A new classification for multiclass imbalanced datasetsbased on clustering approach [A]. Proceedings of the26th Annual Conference of theJapanese Society for Artificial Intelligence [C].2012:1-10.
    [83] H. Nen-Fu, J. Gin-Yuan, C. Han-Chieh. Early identifying application traffic withapplication characteristics [A]. Proceedings of the International Conference onCommunications [C].2008:6704-6708.
    [84] M. Pietrzyk, J. L. Costeux, G. Urvoy-Keller, et al. Challenging statistical classificationfor operational usage: The ADSL case [A]. Proceedings of the ACM Sigcomm InternetMeasurement [C].2009:122-135.
    [85] J. Erman, A. Mahanti, I. C. M. Arlitt, et al., Offline/realtime traffic classification usingsemi-supervised learning [R],2007:1-15.
    [86] W. C. Zhong, B. Raahemi, J. Liu. Learning on class imbalanced data to classifyPeer-to-Peer applications in IP traffic using resampling techniques [A]. Proceedings ofthe IEEE International Joint Conference On Neural Networks [C].2009:3548-3554.
    [87] T. M. Khoshgoftaar, G. Kehan, J. Van Hulse. A novel feature selection technique forhighly imbalanced data [A]. Proceedings of the IEEE International Conference onInformation Reuse&Integration [C].2010:80-85.
    [88] M. Wasikowski, X. W. Chen. Combating the small sample class imbalance problemusing feature selection [J]. IEEE Transactions on Knowledge and Data Engineering,2010,22(10):1388-1400.
    [89] L. Z. Yin, Y. Ge, K. L. Xiao, et al. Feature selection for high-dimensional imbalanceddata [J]. Neurocomputing,2013,105(2013):3-11.
    [90] M. Alibeigi, S. Hashemi, A. Hamzeh. DBFS: An effective density based featureselection scheme for small sample size and high dimensional imbalanced data sets [J].Data&Knowledge Engineering,2012,81-82(2013):67-103.
    [91] Y.-S. Lim, H.-C. Kim, J. Jeong, et al. Internet traffic classification demystified: On thesources of the discriminative power [A]. Proceedings of the ACM Conference onEmerging Networking Experiments and Technology [C].2010:1-12.
    [92] T. En-Najjary, G. Urvoy-Keller, M. Pietrzyk, et al. Application-based feature selectionfor Internet traffic classification [J].22nd International Teletraffic Congress,2010,1-8.
    [93] H. L. Zhang, G. Lu, M. T. Qassrawi, et al. Feature selection for optimizing trafficclassification [J]. Computer Communications,2012,35(12):1457-1471.
    [94] P. Domingos. Metacost: A general method for making classifiers cost-sensitive [A].Proceedings of the5th ACM SIGKDD international conference on Knowledgediscovery and data mining [C].1999:155-164.
    [95] R. Alejo, J. M. Sotoca, R. M. Valdovinos, et al. The multi-class imbalance problem:Cost functions with modular and non-modular neural networks [A]. Proceedings of the6th International Symposium on Neural Networks [C].2009:421-431.
    [96] W. Fan, S. J. Stolfo, J. Zhang, et al. Adacost: Misclassification cost-sensitive boosting[A]. Proceedings of the16th International Conference on Machine Learning [C].1999:97-105.
    [97] S. Wang, X. Yao. Multiclass imbalance problems: Analysis and potential solutions [J].IEEE Transactions on Systems Man and Cybernetics Part B-Cybernetics,2012,42(4):1119-1130.
    [98] Z. H. Zhou, X. Y. Liu. Training cost-sensitive neural networks with methods addressingthe class imbalance problem [J]. IEEE Transactions on Knowledge and DataEngineering,2006,18(1):63-77.
    [99] S. Wang. Ensemble diversity for class imbalance learning [D]. The University ofBirmingham, Ph.D,2011.
    [100] H. T. He, C. H. Che, F. T. Ma, et al. Improve flow accuracy and byte accuracy innetwork traffic classification [A]. Proceedings of the4th International Conference onIntelligent Computing [C].2008:449-458.
    [101] R. Alejo, J. M. Sotoca, G. A. Casan. An empirical study for the multi-class imbalanceproblem with neural networks [A]. Proceedings of the Pattern Recognition, ImageAnalysis and Applications [C].2008:479-486.
    [102] T. Mori, M. Uchida, R. Kawahara, et al. Identifying elephant flows through periodicallysampled packets [A]. Proceedings of the4th ACM SIGCOMM conference on Internetmeasurement [C].2004:115-120.
    [103] K. Lan, J. Heidemann, On the correlation of Internet flow characteristics [R]: TechnicalReport ISI-TR-574, USC/ISI,2003.
    [104] A. Este, F. Gringoli, L. Salgarelli. On the stability of the information carried by trafficflow features at the packet level [J]. Computer Communication Review,2009,39(3):13-18.
    [105] V. López, A. Fernández, J. G. Moreno-Torres, et al. Analysis of preprocessing vs.Cost-sensitive learning for imbalanced classification. Open problems on intrinsic datacharacteristics [J]. Expert Systems with Applications,2012,39(7):6585-6608.
    [106] J. R. Quinlan. C4.5: Programs for machine learning [M]. Morgan kaufmann,1993.
    [107]米歇尔著,曾华军,张银奎等译.机器学习[M].机械工业出版社,2003.
    [108] D. A. Cieslak, T. R. Hoens, N. V. Chawla, et al. Hellinger distance decision trees arerobust and skew-insensitive [J]. Data Mining and Knowledge Discovery,2012,24(1):136-158.
    [109] H. He, E. A. Garcia. Learning from imbalanced data [J]. IEEE Transactions onKnowledge and Data Engineering,2009,21(9):1263-1284.
    [110] G. M. Weiss. The effect of small disjuncts and class distribution on decision treelearning [D]. Rutgers, The State University of New Jersey,2003.
    [111] R. O. Duda, P. E. Hart, D. G. Stork. Pattern classification [M]. Wiley-Interscience,2Ed,2003.
    [112] K. Xu, F. Wang. Behavioral graph analysis of internet applications [J]. Proceedings ofthe IEEE Global Communications Conference (GLOBECOM),2011,1-5.
    [113] J. Luengo, A. Fernandez, S. Garcia, et al. Addressing data complexity for imbalanceddata sets: Analysis of SMOTE-based oversampling and evolutionary undersampling [J].Soft Computing,2011,15(10):1909-1936.
    [114] N. Macià-Antolínez. Data complexity in supervised learning: A far-reaching implication[D]. Ramon Llull University, Ph.D Thesis,2011.
    [115] M. Canini, W. Li, A. W. Moore, et al. GTVS: Boosting the collection of applicationtraffic ground truth [M]. Traffic monitoring and analysis. Springer.2009:54-63.
    [116] I. H. Witten, E. Frank. Data mining: Practical machine learning tools and techniques[M]. Elsevier Inc.,2nd edn.,2005.
    [117] J. Yu, H. Lee, Y. Im, et al. Real-time classification of Internet application traffic using ahierarchical multi-class SVM [J]. KSII Transactions on Internet and InformationSystems,2010,4(5):859-876.
    [118] J. S. Lei. Feature selection for text classification on skewed data sets [J]. Journal ofComputational Information System,2010,6(1):147-153.
    [119] M. A. Hall. Correlation-based feature selection for machine learning [D]. WaikatoUniversity, Ph.D thesis,1998.
    [120] M. Dash, H. A. Liu. Consistency-based search in feature selection [J]. ArtificialIntelligence,2003,151(1-2):155-176.
    [121] F. Sebastiani. Machine learning in automated text categorization [J]. ACM ComputingSurveys,2002,34(1):1-47.
    [122] L. Yu, H. Liu. Feature selection for high-dimensional data: A fast correlation-basedfilter solution [A]. Proceedings of the Machine Learning [C].2003:856–863.
    [123] D. J. Hand, R. J. Till. A simple generalisation of the area under the roc curve formultiple class classification problems [J]. Machine Learning,2001,45(2):171-186.
    [124] V. Carela-Espanol, P. Barlet-Ros, A. Cabellos-Aparicio, et al. Analysis of the impact ofsampling on net flow traffic classification [J]. Computer Networks,2011,55(5):1083-1099.
    [125] J. M. Maciejowski. Model discrimination using an algorithmic information criterion [J].Automatica,1979,15(1979):579–593.
    [126] M. Galar, A. Fernández, E. Barrenechea, et al. A review on ensembles for the classimbalance problem: Bagging-, boosting-, and hybrid-based approaches [J]. Systems,Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on,2012,42(4):463-484.
    [127]韩颜伦,李学明.基于C4.5算法和AdaBoost算法的P2P流量[A/OL].中国科技论文在线,2013, http://www.paper.edu.cn/releasepaper/content/201301-269.
    [128] S. Lee, H. Kim, D. Barman, et al. Netramark: A network traffic classificationbenchmark [J]. ACM SIGCOMM Computer Communication Review,2011,41(1):22-30.
    [129] S. Gebert, R. Pries, D. Schlosser, et al. Internet access traffic measurement and analysis[A]. Proceedings of the Traffic Monitoring and Analysis [C].2012:29-42.
    [130] J. Hurley, E. Garcia-Palacios, S. Sezer. Host-based P2P flow identification and use inreal-time [J]. ACM Transactions on the Web,2011,5(2):1-27.
    [131] R. Y. Wang, L. Zhang, Z. Liu. Classifying imbalanced Internet traffic based pcdd: A perconcept drift detection method [J]. Smart Computing Review,2013,3(2):112-122.
    [132] W. C. Zhong, B. Raahemi, J. Liu. Classifying peer-to-peer applications usingimbalanced concept-adapting very fast decision tree on IP data stream [J]. Peer-to-PeerNetworking and Applications,2013,6(3):233-246.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700