面向多核多线程的BGP协议并行技术研究

英文题名：Research on BGP Parallelism Technologies for Multicore and Multi-threading
作者：高蕾
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：多核处理器 ; BGP ; 性能优化 ; 线程级并行 ; 路由表并行访问 ; 线程间路由通告
英文关键词：Multicore Processor ; BGP ; Performance Optimization ; Thread Level Parallelism ; Parallel Access to Routing Table ; Route Announcement among Threads
学位年度：2009
导师：龚正虎 ; 刘亚萍
学科代码：081202
学位授予单位：国防科学技术大学
论文提交日期：2009-09-01

摘要

为了适应超大规模超高速信息网络的需求,下一代Internet应具备安全性、实时性、可管理性、可用性和可扩展性等特点,而核心路由器作为构建Internet网络的重要基础设施,其现有硬件体系结构及协议软件结构已无法满足上述要求。特别是,随着Internet规模的快速增长以及各种网络应用的大量涌现,路由系统中关键域间路由协议BGP(Border Gateway Protocol)逐渐面临着性能与多维可扩展性等方面的严峻挑战。多核处理器具有较强计算能力、支持多个层次的并行性、拥有更高存储与I/O带宽、易扩展以及功耗低等巨大优势,为解决BGP协议所面临的问题提供了广阔的研究空间,成为构建下一代Internet网络的基石。因此,研究BGP协议如何充分利用多核处理器在结构与性能方面的优势来满足未来Internet需求,为下一代互联网基础设施建设提供理论基础和技术支撑,具有非常重要的实践意义。
     本文面向多核处理器平台,以BGP协议性能优化为重点,展开BGP协议线程级并行技术方面的研究。首先,通过分析BGP协议行为与潜在的并行特征,提出了两种线程化BGP(T-BGP)协议并行软件结构:多实例T-BGP与多任务T-BGP,并给出了它们的体系结构设计;继而,重点研究了多任务T-BGP协议中的推测线程划分,以及多实例T-BGP协议中的路由表并行访问与线程间路由通告技术;最后,在多实例T-BGP协议基础上完成了T-BGP原型系统的设计与实现。本文取得的研究成果主要包括以下几个方面:
     1、提出了两种T-BGP协议并行软件结构:多实例T-BGP协议与多任务T-BGP协议。多实例T-BGP协议以邻居会话划分为基础,由一个管理线程和一组BGP实例线程组成,利用数据并行思想将邻居会话分布在不同BGP实例线程上并行处理。该并行结构具有性能优化效果好、多维可扩展性强、路由选择实时性高等优势,程序开发设计难度较低,且易实现线程间的负载均衡,然而,其运行效率的提升还取决于对路由表并行访问冲突与线程间路由通告阻塞等关键问题的有效解决。多任务T-BGP协议以缓解BGP协议运行瓶颈为目标,采用了任务并行方式,将协议执行核心的更新报文处理过程分解在解析线程、路由更新线程和封装线程上并行执行,而其余功能仍由管理线程处理。该协议旨在挖掘程序内部的任务并行性,其并行性能的发挥取决于线程划分与线程调度等关键因素。
     2.提出了一种基于局部推测的线程划分技术,挖掘多任务T-BGP协议中更细粒度的线程并行来有效提高协议并行度。通过分析多任务T-BGP协议各功能线程中潜在的可推测性,采用基于最小割的启发式算法对各功能线程进行独立的局部推测线程划分。同时,还提出相应的线程推测策略来实现多线程并行执行,采用推测提交策略和写后读相关检测机制确保了线程推测执行与程序运行结果的正确性。最后,完成了基于局部推测技术的多任务T-BGP协议的功能验证与性能模拟,实验结果显示,该方法具有线程划分效果好且计算复杂性低的特点,采用局部推测技术之后显著减少了各功能线程的执行时间,获得了较好的性能提升。
     3.提出了一种高效路由表并行访问技术,有效解决了多实例T-BGP协议沿用传统路由表结构带来的路由表并行访问冲突问题。该方法将传统路由表重新组织成为一种动态可重构路由表结构,提出了细粒度动态拆分重组算法,周期性地动态调整路由表结构,根据路由节点访问频度来划分子树进而重组为多个子树集合。动态可重构路由表通过两级访问、局部子树加锁等途径支持不同线程并行访问路由表,有效克服了串行访问路由表的性能瓶颈。同时拆分重组算法通过均衡各子树集合上的访问负载,使得路由表并行访问冲突降低了92.5%。另外,还修改了路由表操作以支持简捷的访问行为。实验结果显示,在三种线程数量配置下,单线程最大路由更新耗时较使用传统路由表时平均降低了约48.7%、50.5%与71.1%,有效缓解了路由更新阶段的性能瓶颈。
     4.提出一种适用于多实例T-BGP协议的无阻塞线程间路由通告技术,克服了采用传统基于锁机制共享队列的路由通告方式所带来的大量共享访问冲突问题。在该方法中,每个线程为所有邻居各维护一个邻居播报队列,将更新后路由仅写入本线程所有播报队列;在获取播报路由时,各线程将根据本地处理的邻居会话标识从所有线程匹配的邻居播报队列中读取路由信息,避免了各线程竞争访问邻居播报队列。同时每个邻居播报队列都设计为SCLF(Speedy Concurrency Lock-free FIFO)无锁队列,实现了线程间无阻塞快速路由信息传播;还设计了带cache乒乓缓解的路由播报过程以降低因cache数据颠簸所造成的cache失效开销。实验结果显示,该方法使得线程间路由通告时间平均降低了约79.4%,设计的SCLF无锁队列显著改善了队列操作时间,较Lamport队列的操作耗时减少了56.5%左右。
     5、在多实例T-BGP协议的基础上,完成了T-BGP原型系统的设计与实现,并重点阐述了主线程与从线程设计的实现细节。该系统应用了动态可重构路由表技术与无阻塞线程间路由通告技术,较多实例T-BGP协议能够提供更加高效的并行处理能力。继而,还对T-BGP原型系统的路由学习时间、eBGP邻居切换速度、CPU使用率及系统加速比等关键指标进行了测试与分析,实验结果显示,与传统BGP协议相比,主要性能指标均得到了明显改善。
     综上所述,本文的工作针对BGP协议性能优化提出了有效的解决方案,对于推进新型域间路由技术的理论研究和实用化具有一定的理论价值与应用价值。
To satisfy the requirement for the ultra large-scale and high-speed information network, the next generation Internet should possess the characteristics of security, real time, manageability, availability and scalability. For the limitations of hardware and intricate software architecture, the core routers which are the most important infrastructure on Internet cannot satisfy the requirements above. Especially, with the scale growth of Internet and the emergence of a large number of applications, the critical BGP (Border Gateway Protocol) in the interdomain routing system, gradually faces the challenges of performance and multi-dimensional scalability. Multicore processors have the advantages of powerful computing ability, multi-level parallelism, higher bandwidth for memory and I/O, and they are easy to scale and consume less energy, thus providing a broad research platform as the basis of the next generation Internet. Therefore, it is of great significance for practice to research how BGP utilizes the advantages of architecture and performance on multicores to meet the demands of Internet in the future, so as to provide the theoretical basis and technical support for the construction of the next generation Internet infrastructure.
     Towards multicore platform, the performance optimization of BGP is focused and the research of thread-level parallelism for BGP is carried out in this dissertation. Firstly, by analyzing the parallelism underlying in the behavior of BGP, two parallel architectures of Threaded BGP (T-BGP) including multi-instance T-BGP and multi-task T-BGP are presented along with their architecture designs. This dissertation emphasizes to address the issues of the parallel access to routing table and route propagation among threads in the first architecture and then the speculative thread partition in the second one. At last, the T-BGP prototype system is designed and implemented. The contributions of this dissertation are as follows:
     1. Two thread-level parallel architectures including multi-instance T-BGP and multi-task T-BGP are proposed on multicores. Based on the neighbor session division, the multi-instance T-BGP is composed of a manager thread and a group of BGP instance threads. With the idea of data parallelism, this architecture distributes the neighbor sessions on different BGP instance threads to achieve the parallel processing. Multi-instance T-BGP has the advantages of good optimization effects for performance, strong multi-dimensional scalability, high real-time for route selection, etc., and it is easy to implement the load balance and has low design complexity. But the key issues of parallel access to routing table and route announcement among threads both of which decide the performance of the protocol are still needed to be resolved. On the other hand, aiming at releasing the performance bottleneck of BGP, the multi-task T-BGP employs the way of task parallelism, and divides the update packet processing which is the core procedure of BGP into three parallel function threads involving parsing thread, route updating thread and packing thread, leaving the remaining functions in the master thread. This architecture could explore the parallelism inherent in the protocol, but the critical issues of thread partition and thread scheduling are still to be addressed to further improve its performance.
     2. Coarse-grain task parallelism limits the performance improvement of the multi-task T-BGP, and exploiting its finer-grain parallelism becomes the major method to yield better parallel efficiency. Thus, with the analysis for potential speculation of the function threads in multi-task T-BGP, a thread partition technology based on local speculation is proposed. It presents the min-cut heuristic algorithm to divide the speculative threads locally in each function thread, and also put forwards the static speculative strategy to achieve the thread speculation, making threads work in parallel. Additionally, the ordinal commit mechanism and the dependence detection strategy ensure the validity of thread speculation and program results. At last, the multi-task T-BGP based on local speculation is preliminarily implemented, and the functional verification and performance evaluation are carried out. The experimental results show that our method has the characteristics of good partition effects and low computational complexity, and greatly reduces the runtime of all fuction threads by using local speculation, thereby yielding good performance improvement.
     3. A mass of contentions by racing to access the routing table are brought in the multi-instance T-BGP, if still using the traditional routing table. Thus, this dissertation proposes a highly-efficient parallel access approach, in which a novel dynamic reconfigurable routing table is presented to achieve the fast parallel table access. In this approach, a heuristic-based divide-and-recombine algorithm is devised to partition routing table into subtries according to the real-time access frequency of table nodes, and then reorganize the subtries into several sets. And the algorithm is periodically performed to dynamically regulate the table structure. The main advantages of the dynamic reconfigurable routing table are concentrated on three aspects as follows. Firstly, the parallelization of route update by multi-threading is achieved through two-phase trie access by locking subtrie, so as to efficiently break the restriction of sequential table access. Secondly, the proposed divide-and-recombine algorithm significantly reduces the parallel access contentions by balancing accesses to subtrie sets. Thirdly, the table operations are effectively optimized to maintain the consistency with the behaviors of traditional routing table, attaining the higher rate of route update. The experimental results show that, when compared to traditional routing table, the maximal update time per thread is decreased by nearly 48.7%, 50.5% and 71.1% with the dynamic reconfigurable routing table under three thread configurations respectively, efficiently releasing the performance bottleneck of route update.
     4. A non-blocking route announcement method is put forward for the multi-instance T-BGP, thereby eliminating a great number of contentions which are incurred by the common method with the lock-based shared queues. In our scheme, each thread maintains an advertising queue for each neighbor with announcing the locally updated routes, and then gets the routes from the advertising queues of all threads with matching its local peer, thus avoiding the conflicts of accessing the same advertising queue by different threads. Additionally, each advertising queue is designed as the SCLF to significantly enhance the rate of operating the shared queues, and achieves the non-blocking route propagation among threads. At last, the route announcing process with cache ping pong mitigation is also presented to reduce the miss overhead by cache thrashing. The experimental results show that the runtime of route announcement among threads are reduced by 79.4% on average and the SCLF decreases the operation time by nearly 56.5% compared to Lamport lock-free queue.
     5. Based on the multi-instance T-BGP architecture, the T-BGP prototype system is implemented focusing on the details of the master thread and the slave one. With the key technologies of parallel access to routing table and non-blocking route announce- ment among threads, T-BGP could provide high-efficiency parallelism in comparison with multi-instance T-BGP. Finally, the experimental results for the key metrics of route learning time, the switch rate of eBGP neighbor, the CPU utilization and the system speedup show that T-BGP yields good performance improvement compared to BGP.
     To sum up, we present well-evaluated solutions in this dissertation for some key issues of BGP performance optimization. We believe that our contributions make a nice groundwork for future research and engineering on the interdomain routing protocol optimization both in theory and practice.

引文

[1].郑纬民,张广艳,薛瑞尼.计算机科学发展综合报告[R].北京:清华大学出版社,2006: 1~46.
    [2]. RFC 1723, RIP Version 2 - Carrying Additional Information[S].
    [3]. RFC 1142, OSI IS-IS intra-domain routing protocol[S].
    [4]. RFC 2328, OSPF version 2[S].
    [5]. RFC 904, Exterior Gateway Protocol Formal Specification[S].
    [6]. RFC 4271, A Border Gateway Protocol 4 (BGP-4)[S].
    [7]. Jakma P. Quagga Routing Software Suit[EB/OL]. http://www.quagga.net/down- load.html, 2009-08-28/2008-05-8.
    [8]. RFC 1519, Classless Inter-Domain Routing (CIDR): an Address Assignment and Aggregation Strategy[S].
    [9]. Huston G. Internet BGP Table[EB/OL]. http://www.potaroo.net/.html, 2009-09 -29/2008-10-01.
    [10]. Bu T, Gao L, Towsley D. On routing table growth[C] // Proc. of ACM SIG- COMM Computer Communication Review, New York, USA, 2002(32):7~77.
    [11]. Huston G. The BGP Report for 2005[EB/OL]. http://www.potaroo.net/papers/ isoc/ 2006-06/ bgpupds.html, 2006-06-01/2007-05-01.
    [12]. Labovitz C, Malan G R, Jahanian F. Internet Routing Instability[J]. IEEE/ACM Transaction on Networking, 1998, 6(5):515~528.
    [13].段晶晶,唐海娜,李俊.BGP中高活动IP地址前缀的测量与分析[J].微电子学与计算机,2007,24(3):37~40.
    [14]. Iannaccone G, Chuah C, Mortier R, et al. Analysis of link failures in an IP backbone[C] // Proc. of IMW, New York, 2002:237~242.
    [15]. Griffin T, Wilfong G. On the correctness of ibgp configuration[C] // Proc. of ACM SIGCOMM, Pittsburgh, Pennsylvania, USA, 2002:17~29.
    [16]. Kleffman M D. Analysis of Effects of BGP Black Hole Routing on a Network Like the NIPRNET[D]. Ohio:Wright-Patterson Air Force Base, 2005:1~105.
    [17]. Taft N. The basics of bgp routing and its performance in today’s internet[C] // Proc. of RHDM High-speed Networks and Multimedia Workshop, France, 2001: 35~42.
    [18]. Griffin T, Shepherd F B, Wilfong G. The stable paths problem and interdomain routing[J]. IEEE/ACM Transactions On Networking, 2002, 10(1):232~234.
    [19]. Varadhan K, Govindan R, Estrin D. Persistent route oscillations in inter-domain routing[J]. Computer Networks, 2000, 32(1):1~16.
    [20]. Agarwal S, Chuah C N, Bhattacharyya S, et al. Impact of bgp dynamics on router cpu utilization[C] // Barakat C and Pratt I(eds.), Passive Active Measurement Workshop, France: Springer-Verlag GmbH, 2004(3015):278~288.
    [21].贾凤根,张建辉,郭云飞等.BGP路由协议实际应用中的问题及解决方案[J].电信科学, 2004,20(3):9~13.
    [22].赵金晶.域间路由系统自组织特性及关键问题研究[D].长沙:国防科技大学,2007:1~139.
    [23]. Feldmann A, Kong H W, Maennel O, et al. Measuring bgp pass-through times[C] // Barakat C and Pratt I(eds.). Passive Active Measurement Workshop, France: Springer-Verlag GmbH, 2004(3015):267~277.
    [24]. Hennessy J L, Patterson D A. Computer Architecture: A quantitative approach[M]. USA: Morgan Kaufman Publishers, 2006:196~264.
    [25]. Kahle J A, et al. Introduction to the Cell multiprocessor[J]. IBM Journal of Research and Development, 2005, 49(4/5):589~604.
    [26]. Schwartz J. UltraSPARC T2 Processor-Overview[EB/OL]. http://www.sun.com/ processors/UltraSPARC-T2/.html, 2008-03-04/2008-09-02.
    [27]. Sinharoy B, Kalla R N, Tendler J M. POWER5 System microarchitecture[J]. IBM Journal of Research and Development, 2005, 49(4/5):505~521.
    [28]. Intel's multicore architecture briefing[EB/OL]. http://www.cdrinfo.com/Sections/ News/Details.aspx?NewsId=22782, 2008-03-18/2009-02-27.
    [29]. Chaudhry S, Cypher R, Ekman M, et al. Rock: A High-Performance Sparc CMT Processor[J]. IEEE Micro, IEEE Computer Society Press, 2009, 29(2):6~16.
    [30]. Gepner P, Fraser D L, Kowlik M F. Second Generation Quad-Core Intel Xeon Processors Bring 45 nm Technology and a New Level of Performance to HPC Applications[C] // Proc. of ICCS, Krakow, Poland, 2008:417~426.
    [31]. Nishikawa Y, Koibuchi M, Yoshimi M. Performance Improvement Methodology for ClearSpeed's CSX600[C] // Proc. of ICPP, 2007:77~77.
    [32]. Teraops research chip[EB/OL]. http://techresearch.intel.com/articles/Tera-Scale/ 1449.html, 2007-04-25/2009-08-28.
    [33].冯小兵,李致远等.编译技术:应对片上多处理器的挑战[R].北京:清华大学出版社,2006:375~424.
    [34].徐恪,熊勇强,吴建平.宽带IP路由器的体系结构分析[J].软件学报,2002,11(2): 179~186.
    [35].张晓哲.路由协议并行处理技术研究.长沙:国防科技大学,2005:1~80.
    [36]. Henriksson T, Nordqvist U, Liu D K. Embedded protocol processor for fast and efficient packet reception[C] // Proceedings of IEEE International Conference on Computing Design: VLSI in Computers and Processors, Freiburg,2002(2):414~419.
    [37]. Ang B S. An evaluation of an attempt at offloading tcp/ip protocol processing onto an i960rn-based inic[R]. USA: Computer Systems and Technology Laboratory HP Laboratories, 2001.
    [38]. Maruyama M, Takahashi N. CORErouter-I: An Experimental Parallel IP Router Using a Cluster of Workstations[J]. IEICE Transaction, 1997, E80-B(10): 1407~1414.
    [39].王圣,苏金树.TCP加速技术研究综述[J].软件学报.2004,15(11):1689~1699.
    [40]. Bilic H, Yitzhak B. Deferred segmentation for wire-speed transmission of large tcp frames over standard gbe networks[C] // Proceedings of the Ninth Sym- posium on High Performance Interconnects (HOTI), IEEE Hot Interconnects, 2001(9): 81~85.
    [41]. Blackwell T. Speeding up protocols for small messages[C] // Proceedings of Applications, technologies, architectures, and protocols for computer communications, Palo Alto, California, 1996: 85~95.
    [42]. Deval M, Khosravi H, Muralidhar R. Distributed control plane architecture for network elements[J]. Intel Technology Journal, 2003, 7(4):1689~1699.
    [43]. Halabi S. Pluris Massively Parallel Routing[R]. USA: Pluris Inc., 1999.
    [44]. Xiao X P, Ni L M. Parallel routing table computation for scalable ip routers[C] // Proceedings of the IEEE International Workshop on CANPC. Las Vegas, Nevada, USA: Springer-Verlag, 1998:144~158.
    [45]. Bjorkman M, Gunningberg P. Performance Modeling of Multiprocessor Implementations of Protocols[J]. IEEE/ACM Transactions On Networking, 1998, 6(3):262~273.
    [46]. Hutchinson N, Peterson L. The x-kernel: An architecture for implementing network protocols[J]. IEEE Trans. Software Eng, 1991(17): 64~75.
    [47]. Balaji P, Feng W, et al. Head-to-toe evaluation of high-performance sockets over protocol offload engines[C] // Proc. of Cluster, Boston, MA, 2005: 1~10.
    [48]. Falsafi B, Wood D A. Parallel Dispatch Queue: A Queue-Based Programming Abstraction To Parallelize Fine-Grain Communication Protocols[C] // Proc. of the Fifth International Symposium on High Performance Computer Architecture (HPCA), Orlando, FL, USA, 1999:182~192.
    [49]. Koufopavlou O G, Ziterbart M. Parallel tcp for high performance communication subsystems[C] // Proc. of GLOBECOM, Orlando, FL, USA, 1992: 1395~1399.
    [50]. Yates D J, Nahum E M, Kurose J F, et al. Networking Support for Large Scale Multiprocessor Servers[C] // Proc. of IEEE Workshop on the Architecture and Implementation of High Performance Communications Subsystems (HPCS),Mystic, CT, 1995:153~157.
    [51]. Kaiserswerth M. The parallel protocol engine[J]. IEEE/ACM Transactions On Networking, 1993, 1(6):650~663.
    [52]. Bare A, Jayasumana A P, Piratla N M. On growth of parallelism within routers and its impact on packet reordering[C] // Proc. of IEEE workshop on local and metropolitan area networks, Princeton, NJ, 2007: 145~150.
    [53]. Goldberg M, Neufeld G, Ito M. A parallel approach to OSI connection-oriented protocols[C] // Proc. of 3rd IFIP Workshop Protocols for High-Speed Networks, Stockholm, Sweden, 1992:225~240.
    [54]. Liu H. A trace driven study of packet level parallelism[C] // Proc. of ICC, New York, USA, 2002: 2191~2195.
    [55].张晓哲,朱培栋,卢锡城.基于Cluster的BGP协议分布式并行实现模型及关键算法[J]。计算机研究与发展,2004, 41(suppl):115~121.
    [56]. Klockar T, Hidell M, et al. Modularized bgp for decentralization in a distributed router[C] // Proc. of Winternet Grand Finale workshop, poster, 2005: 65~65.
    [57]. Ball D A, Bennett R E, et al. Distributed software architecture for implementing BGP[P]. USA: US 2005/0074003, 2005.
    [58]. Zhang X, Zhu P, Lu X. Fully-distributed and highly-parallelized implement- tation model of bgp4 based on clustered routers[C] // Proc. of ICN 2005, LNCS 3421, Berlin: Springer-Verlag, 2005:433~441.
    [59]. Wu K, Wu J, Xu K. A tree-based distributed model for BGP route processing[C] // Proc. of HPCC 2006, LNCS 4208, Berlin: Springer-Verlag, 2006:119~128.
    [60]. Xu K, He H. BGP parallel computing model based on the iteration tree[J]. Journal of China Universities of Posts and Telcommunications, 2008, 15(Suppl.): 1~8.
    [61].侯锐.多核加速串行程序技术综述[J].信息技术快报,2006,4(6):1~13.
    [62]. Blagojevic F, Stamatakis A, Antonopoulos C D, et al. Dynamic multigrain parallelization on the cell broad band engine[C] // Proc. of PPoPP, San Jose, California, 2007:90~100.
    [63]. Chen T, Cao M, Shi Q. Component-based network protocol architecture for multi-core[C] // Proc. of International Conf. Networking, Architecture, and Storage, 2008:189~190.
    [64]. Ennals R, Sharp R, Mycroft A. Task Partitioning for Multi-Core Network Processors[J]. Computer Science, 2005(3443): 76~90.
    [65]. Mignolet J Y, Baert R, Ashby T J, et al. MPA: Parallelizing an application onto a multicore platform made easy[C] // Proc. of IEEE MICRO, 2009:31~39.
    [66]. Ostadzadeh S A, Meeuws R J, et al. A multipurpose clustering algorithm for taskpartitioning in multicore reconfigurable systems[C] // Proc. of International Conf. Complex, Intelligent and Software Intensive Systems, Ukuoka, Japan 2009:663~668.
    [67]. Gu Z, Yuan M, He X. Optimal Static Task Scheduling on Reconfigurable Hardware Devices Using Model-Checking[C] // Proc. of IEEE Real Time and Embedded Technology and Applications Symposium, Bllevue, WA, 2007: 32~44.
    [68]. Qu Y, Soininen J, Nurmi J. Static Scheduling Techniques for Dependent Tasks on Dynamically Reconfigurable Devices[J]. Journal of Systems Architecture, 2007, 53(11):861~876.
    [69]. Wiangtong T, Cheung P, Luk W. Cluster-Driven Hardware/Software Partitioning and Scheduling Approach for a Reconfigurable Computer System[J]. Computer Science, 2003(2778):1071~1074.
    [70]. Kranz D A, Halstead J R H, Mohr E. Mul-T: a High-Performance Parallel Lisp[J]. SIGPLAN Not., 1989, 24(7): 81~90.
    [71]. Prechelt L, Hansgen S U. Efficient parallel execution of irregular recursive programs[J]. IEEE Transactions on Parallel and Distributed Systems, 2002, 13(2): 167~178.
    [72]. Taura K, Tabata K, Yonezawa A. StackThreads/MP: Integrating Futures into Calling Standards[J]. SIGPLAN Not., 1999, 34(8): 60~71.
    [73]. Duran A, Corbalan J, Ayguade E. An adaptive cut-off for task parallelism[C] // Proc. of SC, Austin, Texas, USA, 2008: 1~11.
    [74]. Goldstein S C, Schauser K E, Culler D E. Lazy Threads: Implementing a Fast Parallel Call[J]. Journel of Parallel Distrib. Comput., 1996, 37(1): 5~20.
    [75]. Ottoni G, Rangan R, Stoler A, et al. I. August. Automatic thread extraction with decoupled software pipelining[C] // Proc. of MICRO, Los Alamitos, CA, USA, 2005: 105~118.
    [76]. Ottoni G, Rangan R, Stoler A, et al. From sequential programs to concurrent threads[J]. IEEE Computer Architecture Letters, 2006(5): 6~9.
    [77]. Dai J, Huang B, Li L, Harrison L. Automatically partitioning packet processing applications for pipelined architectures[C] // Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, New York, NY, USA, 2005: 237~248.
    [78].张鹏,杜建国,解晓东等.一种基于多核流水的多标准视频解码器体系结构[J].计算机研究与发展.2008,45(11):1947~1954.
    [79]. Hu X, Tang X, Hua B. High-performance IPv6 forwarding algorithm for multi-core and multithreaded network processor[C] // Proc. of PPoPP, New York,NY, USA, 2006: 168~177.
    [80].孙小涓,孙凝晖,陈明宇.多核平台上B-NIDS的优化[J].计算机研究与发展,2007,4(11):1733~1740.
    [81]. Wang J, Cheng H, Hua B, et al. Practice of parallelizing network applications on multi-core architectures[C] // Proc. of ICS, Yorktown Heights, New York, USA, 2009: 204~213.
    [82]. Liu Y, Xu D, Song W, et al. Design and implementation of high performance IPSec applications with multi-core processors[C] // Proc. of International Semi- nar on Future Information Technology and Management Engineering, 2008: 595~598.
    [83]. El-Mahdy A, El-shishiny H. An efficient load-balancing algorithm for image processing applications on multicore processors[C] // Proc. of IFMT, Cairo, Egypt, 2008(356): 1~8.
    [84]. Rangan R, Vachharajani N, et al. Performance scalability of decoupled software pipelining[J]. ACM Trans. on Architecture and Code Optimization, 2008, 5(2): 1~25.
    [85]. Thies W, Chandrasekhar V, Amarasinghe S. A practical approach to exploiting coarse-grained pipeline parallelism in c program[C] // Proc. of IEEE/ACM Inter- national symposium on Microarchitecture, Washington DC, USA, 2007: 356~368, 2007.
    [86]. Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters[C] // Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI), NY, USA, 2004: 137~150.
    [87]. DeWitt D, Gray J. Parallel database systems: The future of high performance database processing[J]. Communications of the ACM, 1992, 35(6): 85~98.
    [88]. Tarditi D, Puri S, Oglesby J. Accelerator: using data-parallelism to program GPUs for general-purpose uses[C] // Proc. of ASPLOS, Boston, MA, 2006: 325~335.
    [89]. Cieslewicz J, Ross K A. Data partitioning on chip multiprocessors[C] // Proc. of DaMoN, Vancouver, Canada, 2008: 25~34.
    [90]. Bader D A, Agarwal V, Kang S. Computing discrete transforms on the Cell Broadband Engine[J]. Parallel Computing, 2009, 35(2009): 119~137.
    [91]. Isard M, Budiu M, Yu Y, et al. Dryad: distributed data-parallel programs from sequential building blocks[C] // Proc. of EuroSys, Lisboa, Protugal, 2007: 59~72.
    [92]. Kulkarni M, Pingali K, Walter B, et al. Optimistic parallelism requires abstractions[C] // Proc. of PLDI, NY, USA, 2007, 42(6):211~222.
    [93]. Kulkarni M, Ramanarayanan G, Walter B, et al. Optimistic parallelism benefitsfrom data partitioning[C] // Proc. of ASPLOS, Seattle, Washington, USA, 2008: 233~243.
    [94]. Saglam B E, Mooney V J. System-on-a-chip processor synchronization support in hardware[C] // Proc. of Conf. on Design, automation and test in Europe, Munich, Germany, 2001: 633~641.
    [95]. Cheng L, Carter J. Fast barriers for scalable ccNUMA systems[C] // Proc. of International Conference on Parallel Processing, 2005: 241~250.
    [96]. Nicolau A, Li G, Kejariwal A. Techniques for efficient placement of synchronization primitives[C] // Proc. of PPoPP, Raleigh, North Carolina, USA, 2009: 199~208.
    [97]. Sampson J, Gonzalez R, et al. Exploiting fine-grained data parallelism with chip multiprocessors and fast barriers[C] // Proc. of MICRO, Washington DC, USA, 2006: 235~246.
    [98]. Ding C, Shen X, Kelsey K, et al. Software behavior oriented parallelization[C] // Proc. of PLDI, NY, USA, 2007: 223~234.
    [99]. Chen M K, Olukotun K. The Jrpm System for Dynamically Parallelizing Java Programs[C] // Proc. of the 30th Annual Intl. Symp. On Comp. Arch., NY, USA, 2003: 434~46.
    [100]. Steffan J G, Colohan C B, et al. A scalable approach to thread-level speculation[C] // Proc. of ISCA, NY, USA, 2000: 1~12.
    [101]. Balarkrishnan S, Sohi G S. Program demultiplexing: data-flow based speculative parallelization of methods in sequential programs[C] // Proc. of ISCA, Madison, WI, USA, 2006: 163~172.
    [102]. Gao L, Li L, Xue X, et al. Exploiting speculative tlp in recursive programs by dynamic thread prediction[C] // Proc. of ETAPS, Springer-Verlag, Berlin, Hei- delberg, 2009: 78~93.
    [103]. Gupta M, Nim R. Techniques for speculative run-time parallelization of loops[C] // Proc. of SC, San Jose, USA, 1998: 1~12.
    [104]. Tian C, Feng M, Nagarajan V, Gupta R. Copy or discard execution model for speculative parallelization on multicores[C] // Proc. of International Symposium on Microarchitecture, Washington DC, USA, 2008: 330~341.
    [105]. Madriles C, Lopez P, et al. Boosting single-thread performance in multi-core systems through fine-grain multi-threading[C] // Proc. of ISCA, Austin, Texas, USA, 2009: 474~483.
    [106]. Johnson T A, Eigenmann R, Vijaykumar T N. Speculative thread decomposition through empirical optimization[C] // Proc. of PPoPP, San Jose, California, USA, 2007: 205~214.
    [107]. Wang C, Wu Y, Borin E, et al. Dynamic parallelization of single-threaded binary programs using speculative slicing[C] // Proc. of ICS, Yorktown Heights, New York, USA, 2009: 158-168.
    [108]. Kejariwal A, X. Tian, M. Girkar, et al. Tight analysis of the performance potential of thread speculation using SPEC CPU2006[C] // Proc. of PPoPP, San Jose, California, USA, 2007: 215~225.
    [109]. Wang C, Y. Wu, E. Borin, et al. Dynamic parallelization of single-threaded binary programs using speculative slicing[C] // Proc. of ICS, Yorktown Heights, New York, USA, 2009: 158~168.
    [110]. Kejariwal A, Tian X, et al. On the performance potential of different types of speculative thread-level parallelism[C] // Proc. of ICS, Cairns, Queensland, Aus- tralia, 2006: 24-32.
    [111]. Marcuello P, Gonzalez A, Tubella J. Thread partitioning and value prediction for exploiting speculative thread-level parallelism[J]. IEEE Trans. on Computers, 2004, 53(2): 114~125.
    [112].林福利.BGP协议并行性分析与评测[D].长沙:国防科技大学,2009: 1~65.
    [113]. Zilles C B. Master/Slave Speculative Parallelization and Approximate Code[D]. USA: Computer Sciences Department, University of Wisconsin-Madison, 2002: 72~98.
    [114]. Sohi G S, Breach S E, Vijaykumar T N. Multiscalar processors[C] // Proc. of ISCA, NY, USA, 1995: 414~425.
    [115]. Wang Z, O’Boyle M F P. Mapping parallelism to multi-cores: a machine learning based approach[C] // Proc. of PoPP, Raleigh, North Carolina, USA, 2009: 75~84.
    [116]. Kwok Y K, Ahmad I. Static Scheduling Algorithms for Allocating Directed Task Graphs to Multiprocessors[J]. ACM Computing Surveys, 1999, 31(4):406~471.
    [117]. Liu W, et al. POSH: A TLS Compiler that Exploits Program Structure[C] // Proceedings of the Symposium on Principles and Practice of Parallel Programming, 2006: 158~167.
    [118]. Vijaykumar T N, Sohi G S. Task Selection for a Multiscalar Processor[C] // Proc. of International Symposium on Microarchitecture, Los Alamitos, CA, USA, 1998: 81~92.
    [119]. Tubella J, Gonzalez A. Control Speculation in Multithreaded Processors through Dynamic Loop Detection[C] // Proceedings of the 4th IEEE Symposium on High-Performance Computer Architecture, Washington DC, USA, 1998: 14~23.
    [120]. Codrescu L, Wills D S. On Dynamic Speculative Thread Partitioning and the MEM-Slicing Algorithm[J]. Journal of Universal Computer Science, 2000, 6(10):908~914.
    [121]. Sarkar V, Hennessy J. Partitioning Parallel Programs for Macro-Dataflow[C] // Proceedings of the Conference on LISP and Functional Programming, NY, USA, 1986: 202~211.
    [122]. Krishnan V, Torrellas J. Hardware and Software Support for Speculative Execution of Sequential Binaries on a Chip-Multiprocessor[C] // Proc. of ICS, New York, NY, 1998: 85~92.
    [123]. Olukotun K, Hammond L, Willey M. Improving the Performance of Speculatively Parallel Applications on the Hydra CMP[C] // Proc. of ICS, New York, NY, 1999: 21~30.
    [124]. Barli N D, Mine H, et al. A thread partitioning algorithm using structural analysis[J]. Joho Shori Gakkai Kenkyu Hokoku, 2000(74): 37~42.
    [125]. Chen L, Wu Y. Aggressive compiler optimization and parallelization with thread-level speculation[C] // Proc. of ICPP, 2007: 607~614.
    [126]. Moore B, Rodrigues A. Application of genetic algorithms to partitioning and scheduling[R]. France: University of Notre Dame, 2004: 1~5.
    [127]. Rotenberg E, Jacobson Q, etc. Trace Processors[C] // Proc. of the 30th Int. Symp. on Microarchitecture, Washington DC, USA, 1997: 138~148.
    [128]. Zilles C. Master/Slave Speculative Parallelization and Approximate Code[D]. USA: University of Wisconsin-Madison, 2002: 1~165.
    [129]. Chang F, Gibson G A. Automatic i/o hint generation through speculative execution[C] // Proc. of OSDI, New Orleans, Louisiana, USA, 1999: 1~14.
    [130]. Thain D, Livny M. The ethernet approach to grid computing[C] // Proc. of HPDC, New York: IEEE Computer Society Press, 2003: 138~147.
    [131]. Lai A, Falsafi B. Memory sharing predictor: the key to a speculative coherent dsm[C] // Proceedings of the 26th annual international symposium on Computer architecture, New York: IEEE Computer Society Press, 1999:172~183.
    [132]. Oplinger J, et al. Software and hardware for exploiting speculative parallelism with a multiprocessor[R]. CA, USA: Stanford University, 1997: 1~12.
    [133]. Nightingale E B, Chen P M, Flinn J. Speculative execution in a distributed file system[C] // Proc. of SOSP’05: Proceedings of the twentieth ACM symposium on Operating systems principles, New York, NY, USA, 2005: 191~205.
    [134]. Gopal S, Vijaykumar T N, Smith G E, et al. Speculative Versioning Cache[C] // Proc. of the 4th IEEE Symposium on High-Performance Computer Architecture, Las Vegas, NV, USA, 1998: 195~205.
    [135]. Ohsawa T, Takagi M, Kawahara S, et al. Pinot: Speculative Multi-threading Processor Architecture Exploiting Parallelism over a Wide Range of Granularities[C] // Proc. of MICRO, Barcelona, Spain, 2005: 81~92.
    [136]. VtuneTM. Intel Vtune Performance Analyzer for Linux[Z]. Santa Clara(CA) : Intel Corporation, 2007.
    [137].徐玖平,胡知能,李军.运筹学[M].北京:科学出版社,2008:35~47.
    [138]. Smartech Consulting Inc.. Spirent Adtech AX4000 broadband test system[EB/OL]. http://www.smartechconsulting.com/Test/Spirent-Adtech-AX- 4000.html, 2004-03-01/2007-05-01.
    [139]. Linux online Inc..Linux kernel[EB/OL]. http://www.linux.org.html, 2008-04-03/ 2008-04-03.
    [140]. Ruiz-Sanchez M A, Biersack E W, Dabbous W. Survey and taxonomy of IP address lookup algorithms[J]. IEEE Trans. On Networking, 2001, 15(2): 8~23.
    [141]. Sklower K. A Tree-Based Packet Routing Table for Berkeley Unix[C] // Proc. of 1991 Winter Usenix Conf., Dallas, TX, USA, 1991: 93~99.
    [142]. Gupta P, Lin S, McKeown N. Routing Lookups in Hardware at Memory Access Speeds[C] // Proc. of IEEE INFOCOM, San Francisco, CA, USA, 1998: 1240~ 1247.
    [143]. Sahni S, Kim K S. Efficient construction of multibit tries for ip lookup[J]. IEEE/ACM Transactions on Networking, 2003, 11(4): 650~662.
    [144]. Nilsson S, Karlsson G. IP address lookup using LC-tries[J]. IEEE JSAC, 1999, 17(6): 1083~1092.
    [145]. Degermark M, et al. Small Forwarding Tables for Fast Routing Lookups[C] // Proc. of ACM SIGCOMM, New York, USA, 1997: 3~14.
    [146]. Sangireddy R, Futamura N, Aluru S, et al. Scalable, memory efficient, high-speed ip lookup algorithms[J]. IEEE/ACM Transactions on networking, 2005, 13(4): 802~812.
    [147].梁志勇,吴建平等.基于非重叠前缀集合的并行路由查找系统.电子学报,2004,32(8): 1277~1281.
    [148]. Srinivasan T, et al. An efficient parallel ip lookup technique using CREW based multiprocessor organization[C] // Proceeding of the 4th Annual Communication Networks and Services Research Conference (CNSR), Washington DC, USA, 2006: 221~226.
    [149]. Zheng K, Hu C, Lu H, Liu B. A TCAM-Based Distributed Parallel IP Lookup Scheme and Performance Analysis[J]. IEEE/ACM Transactions on networking, 2006, 14(4): 863~875.
    [150]. Meyer D. RouteViews project[EB/OL]. http://www.routeviews.org/data.html, 2006-02-01/2008-05-01.
    [151].李宝峰,富弘毅,李韬译.多核程序设计技术-通过软件多线程提升性能[M].北京:电子工业出版社,2007:176~217.
    [152]. Scherer W N, Lea D, Scott M L. Scalable synchronous queues[C] // Proc. of PPoPP: Symposium on Principles and Practice of Parallel Programming, New York, USA, 2006: 147~156.
    [153]. Silberschatz A, Galvin P. Operation system concepts[M]. UK: Addison Wesley, 1994: 3~23.
    [154]. Ladan-Mozes E, Shavit N. An optimistic approach to lock-free FIFO queues[C] // Proc. of DISC’04, USA: Springer, 2004(3274): 117~131.
    [155]. Michael M, Scott M L. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms[C] // Proceedings of the 15th Annual ACM Symposium on Principles of Distributed Computing (PODC), New York, USA, 1996: 267~275.
    [156]. Doherty S, Herlihy M, Luchangco V, et al. Bringing Practical Lock-Free Synchronization to 64-bit Applications[C] // Proc. of the 23rd Ann. ACM Symp. on Principles of Distributed Computing (PODC),New York, USA, 2004: 31~39.
    [157]. Sundell H, Tsigas P. Fast and lock-free concurrent priority queues for multi-thread systems[C] // Proc. IPDPS, Orlando, USA: Academic Press, 2003: 22~26.
    [158]. Mudigonda J, Vin H M, Yavatkar R. Overcoming the memory wall in packet processing: hammers or ladders[C] // Proceedings of the Symposium on Architecture for Networking and Communications Systems, New York, NY, USA, 2005: 1~10.
    [159]. Evequoz C. Non-blocking concurrent fifo queues with single word synchronization primitives[C] // Proc. of ICPP, Washington DC, USA, 2008: 397~405.
    [160]. Giacomoni J, Moseley T, et al. Fastforward for efficient pipeline parallelism a cache-optimized concurrent lock-free queue[C] // Proc. of PPoPP, Utah, USA, 2008: 42~52.
    [161]. Tsigas P, Zhang Y. A Simple, Fast and Scalable Non-Blocking Concurrent FIFO Queue for Shared Memory Multiprocessor Systems[C] // Proc. of the 13th Ann. ACM Symp. on Parallel Algorithms and Architectures (SPAA), New York, USA, 2001: 134~143.
    [162]. Moir M, Nussbaum D, et al. Using elimination to implement scalable and lock-free fifo queues[C] // Proc. of SPAA, Las Vegas, Nevada, USA, 2005: 253~262.
    [163]. Hendler D, Shavit N, Yerushalmi L. A scalable lock-free stack algorithm[C]. // Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures, New York, USA, 2004: 206~215.
    [164]. Shavit N, Touitou D. Elimination trees and the construction of pools and stacks[J]. Theory of Computing Systems, 1997(30):645~670.
    [165]. Kim K H, Colmenares J A, Rim K W. Efficient Adaptations of the Non-Blocking Buffer for Event Message Communication between Real-Time Threads[C] // Proc. of ISORC, Washington DC, USA, 2007: 29~40.
    [166].盛骤,谢式千,潘承毅.概率论与数理统计.北京:高等教育出版社,1989:4~67.
    [167].陈国良,吴俊敏等.并行计算机体系结构.北京:高等教育出版社,2002:65~124.
    [168]. Lamport L. Specifying concurrent program modules[J]. ACM Transactions on Programming Languages and Systems, 1983, 5(2):190~222.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700