大规模NetFlow数据上的P2P流量检测
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着P2P得到越来越广泛的应用,P2P流量检测逐渐成为网络数据分析领域的一个热门问题。有关报告表明,P2P应用所产生的流量占据网络总流量的50%以上。由于P2P应用会导致网络拥塞,对于电信运营商来说,如何从全部网络流量中检测出P2P应用所产生的流量就成为一个非常重要的课题。P2P采用随机端口进行传输的机制,以及P2P系统本身的复杂性、分布性和多变性,都增加了P2P流量检测的难度。
     本文主要目的是研究如何有效地在大规模NetFlow数据上进行P2P流量检测。目前的P2P检测方法都是针对数据包的,对主干网上数量庞大的数据包进行分析要耗费大量存储和计算资源,因此学术界中大多数研究工作都无法得到实际应用。目前投入使用的P2P检测产品利用串接设备采集数据包内容,并依靠硬件来进行计算,具有部署代价昂贵、可扩展相差、侵犯隐私等弊端。本文用NetFlow数据进行P2P检测,克服了上述问题。NetFlow数据对数据包信息进行了汇总和统计,既保留了体现流量特征的重要信息,又降低了数据量,并且NetFlow技术作为业界标准已经在电信运营商中得到广泛使用。
     本文的主要贡献包括:
     1)根据P2P协议的运作原理,推测P2P在流量表现上可能具有的一系列特征。对每一条特征,都通过实验验证它在区分P2P流量和非P2P流量上的效果,根据实验结果选择有效的检测特征。
     2)设计了一个基于NetFlow数据的P2P流量检测算法。该算法将1)中选取的有效特征,按照检测逻辑组织起来,使检测更加高效。
     3)基于2)中的算法,实现了P2P流量检测系统INFOPAD。系统利用数据库实现数据的存储,用SQL查询的方式来实现检测算法,很好地解决了对大量流量数据进行存储和计算的问题。在系统中各个检测规则形成独立的模块,新规则可以作为新模块方便地整合到系统中来,系统架构具有良好的开放性和可扩展性。
     4)对INFOPAD系统采用上海电信路由器上采集的真实NetFlow数据进行实验,并根据上海电信提供的深度包检测(DPI)报告对检测结果进行验证。通过实验证明,INFOPAD系统的检测算法能够达到较高的准确率,并且系统的性能可以达到离线分析的要求。
     本文实现的检测系统适用于电信主干网络上的P2P流量检测。系统接收路由器输出的大量NetFlow数据并进行离线分析,提交出P2P流量报告。本系统已经在上海电信的日常网络管理中得到应用,和上海电信原来部署的深度包检测(DPI)产品相比,本系统可以达到同等程度的准确率,但是部署代价降低了很多,而且检测算法的维护和更新更加方便。
With the increasing use of P2P applications,P2P traffic detection gradually becomes one of the hot topics in network traffic analysis field.The popular P2P applications make more than 50%of the network traffic according to some reports. Since P2P applications can cause network congestion,it becomes an important problem for operators that how can detect P2P traffic out of all the network traffic. P2P applications use random ports to transfer data and P2P system has its own complexity,distribution and variability.All of these facts make P2P detection a hard problem.
     The main purpose of this paper is to find a way to effectively detect P2P traffic on large scale NetFlow data.All existing P2P detection methods focus on packet data.It is very resource consuming to analyze huge amount of packets over the backbone,so almost all the existing research works can not be put into real use. The current P2P detection products collect packet content by connecting to network in series and use hardware for computing.They have the drawbacks of expensive to deploy,poor extensibility and privacy invasion problem.In this paper, we use NetFlow data for P2P detection and thus we can overcome the above problems.NetFlow data is aggregation and statistics of the packet information.It keeps the important information which indicates the traffic characteristics and makes the data volume smaller.Furthermore,NetFlow technique has been widely used among operators as an industry standard.
     The main contributions of this paper are:
     1) Got a series of P2P traffic characteristics according to the way P2P protocols run.For each characteristic,verified its usefulness for differentiating P2P and non-P2P traffic.Chose useful characteristics for detection according to the experimental result.
     2) Designed a P2P traffic detection algorithm for NetFlow data.The algorithm logically organized the useful characteristics chosen in 1) and made the detection more effective.
     3) Implemented a P2P traffic detection system INFOPAD based on the algorithm in 2).The system uses database for its data storage and uses SQL queries to implement the detection algorithm,which effectively solves the problem of storing and computing large volume of data.Each detection rule forms an independent module in the system.New rule can be easily integrated into the system as a new module.The system architecture is open and scalable.
     4) Used real NetFlow data collected from the routers of Shanghai Telecom to test the system.Verified the detection result according to the DPI report from Shanghai Telecom.It is shown that the detection algorithm of INFOPAD system achieves a high accuracy and the system has a satisfactory performance as an offline anaylsis procedure.
     The detection system implemented in this paper is well applied to P2P traffic detection on backbone networks.The system receives large volume of NetFlow data coming from routers,analyzes the data offline and submits a P2P traffic report at the end.The system has been used in daily network management in Shanghai Telecom.Comparing to DPI products which were already deployed before,our system can achieve almost the same accuracy level.However,it is cheaper to deploy and it is more convenient to maintain and update the detection algorithm by using our system.
引文
[AS04] Stephanos Androutsellis-Theotokis and Diomidis. Spinellis .2004. A survey of peer-to-peer content distribution technologies. ACM Comput. Surv., 36(4):335-371.
    
    [BG02] M.Bawa and H.Garcia-Molina. Transience of peers and streaming media. In ACM HotNets I, pages 107-112, 2002.
    [BHP05] G. Bartlett, J. Heidemann, and C. Papadopoulos. P2P in 2005. Technical report, WebLogic, http://www.cachelogic.com/home/pages/research/P2P2005.php.
    
    [BHP07] Genevieve Bartlett, John Heidemann and Christos Papadopoulos, Inherent behaviors for on-line detection of peer-to-peer file sharing. Proceedings of 10th IEEE Global Internet Symposium (GI '07) in conjunction with IEEE INFOCOM 2007. Anchorage, AK, USA, May 2007.
    
    [COO] SHIRKY, C. 2000. What is p2p... and what isnt' t. Network, available online at http://www.oreillynet.com/pub/a/p2p/2000/l l/24shirkyl-what isp2p.html. O' Reilly
    
    [CSW+01] I. CLARKE, O. SANDBERG, B. WILEY, AND T. W. HONG, Freenet: A Distributed Anonymous Information Storage and Retrieval System, Lecture Notes in Computer Science, 2009 (2001), p. 46. http://citeseer.ist.psu.edu/clarke00freenet.html.
    
    [EMA06] Jeffrey Erman, Anirban Mahanti, Martin F. Arlitt: Internet Traffic Identification using Machine Learning. GLOBECOM 2006
    
    [Gong05] Yiming Gong. Identifying P2P users using traffic analysis. 2005. http://www.securityfocus.com/infocus/1843
    
    [IHL+05] Dedinski I, Meer HD, Han L, Mathy L. Cross-Layer peer-to-peer traffic identification and optimization based on active networking. In Proceedings of the 7th Int'l Working Conference on Active and Programmable Networks. 2005.
    
    [KB05] Y. Kulbak and D. Bickson. The emule protocol specification. Technical report TR-2005-03, the Hebrew University of Jerusalem, 2005. http://citeseer.ist.psu.edu/kulbak05emule.html.
    
    [K.BB+03] T. Karagiannis, A. Broido, N. Brownlee, K. Claffy, M. Faloutsos, "File-sharing in the Internet: A characterization of P2P traffic in the backbone" Technical report. November,2003.
    
    [KBB+04] T. Karagiannis, A. Broido, N. Brownlee, K.C. Claffy, M. Faloutsos, Is P2P dying or just hiding?, IEEE Global Internet and Next Generation Networks (Globecom' 04), Dallas, Texas, USA, 29 Nov - 3 Dec, 2004
    [KBF+04]T.Karagiannis,A.Broido,M.Faloutsos,and K.daffy.Transport layer identification of P2P traffic.In IMC,pages 121-134,2004.
    [LGC+07]李人和,宫学庆,常建龙,周游弋,周红福,周傲英,RealMon:处理低质量SNMP数据流的实时监测系统,DBAT 2007(山东大学学报正刊)
    [MRC00]WALDMAN,M.,AD,R.,AND LF,C.2000.Publius:A robust,tamper-evident,censorship-resistant web publishing system.In Proceedings of the 9th USENIX Security Symposium.
    [MZ05]Andrew W.Moore,Denis Zuev:Internet traffic classification using bayesian analysis techniques.SIGMETR1CS 2005:50-60
    [RFH+01]S.RATNASAMY,P.FRANCIS,M.HANDLEY,R.KARP,AND S.SCHENKER,A Scalable Content-Addressable Network,in Proceedings of the 2001 conference on applications,technologies,architectures,and protocols for computer communications,ACM Press,2001,pp.161.172.http://citeseer.ist.psu.edu/article/ratnasamy01scalable.html.
    [Sold04]C.Soldani.Peer-to-peer behaviour detection by tcp flows analysis.May 2004.
    [SMK+01]I.STOICA,R.MORRIS,D.KARGER,M.F.KAASHOEK,AND H.BALAKRISHNAN,Chord:A Scalable Peer-to-Peer Lookup Service for Internet Applications,in Proceedings of ACM SIGCOMM,ACM Press,2001,pp.149.160.http://citeseer.ist.psu.edu/stoica01 chord.html.
    [SSW04]S.Sen,O.Spatscheck,and D.Wang.Accurate,scalable in-network identification of P2P traffic using application signatures.In WWW,pages 512-521,2004.
    [SW04]Subhabrata Sen,Jia Wang:Analyzing peer-to-peer traffic across large networks.IEEE/ACM Trans.Netw.12(2):219-232(2004)
    [Tuts04]K.Tutschku:A Measurement-based Traffic Profile of the eDonkey Filesharing Service.In the Proceedings of the 5th Passive and Active Measurement Workshop(PAM2004),Antibes Juan-les-Pins,France,April 2004.
    [URLa]http://www.cisco.com/go/netflow.
    [URLb]http://netflow.caligare.com/netflow_format.htm.
    [URLc]http://www.splintered.net/sw/flow-tools/.
    [URLd]http://www.xxws.net/batch.viewlink.php?itemid=2718
    [URLe]http://www.wireshark.org/.
    [URLf]The Kazaa web site.http://www.kazaa.com.
    [URLg]The Gnutella web site:http://gnutella.wego.com.
    [URLh]Peer-to-Peer working group,http://www.p2pwg.org.
    [URLi]The eDonkey web site:www.edonkey2000.com.
    [URLj]A.Parker.The true picture of peer-to-peer filesharing,2004.http://www.cachelogic.com/.
    [URLk]The BitTorrent Web Site:http://www.bittorrent.org.
    [WBC+03]NEJDL,W.,WOLF,B.,QU,C.,DECKER,S.,SINTEK,M.,NAEVE,A.,NILSSON,M.,PALMER,M.,AND RISCH,T.2003.Edutella:A p2p networking infrastructure based on rdf.In Proceedings of the 12th International Conference on World Wide Web.Budapest,Hungary.
    [WWF+06]Wang Yixin,Wang Rui,Fan Aihua,Tang Chuan.Discuss the Technology of P2P Traffic Identification,Computer and Digital Engineering,Vol.34 No.6 P.161-164,2006
    [YGC+07]闩莺,宫学庆,常建龙,戴岱,周傲英,SMART:基于数据流技术的电信网络流量监控系统,DBAT 2007(山东大学学报正刊).
    [ZKJ01]B.Y.ZHAO,J.D.KUBIATOWICZ,AND A.D.JOSEPH,Tapestry:An Infrastructure for Fault-tolerant Wide-area Location and Routing,Tech.Rep.UCB/CSD-01-1141,UC Berkeley,Apr.2001.http://citeseer.ist.psu.edu/zhao01tapestry.html.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700