高性能计算机可扩展并行调试技术的研究与实现
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
并行调试器在并行程序开发过程中的意义重大。如何有效地捕获并行程序的异常,纠正并行程序的错误,从而提高并行程序的开发质量是研究者一直关注的问题。
     并行程序除了具有串行程序常见的指针错误、变量错误、语法错误等错误以外,还会发生死锁、活锁、竞争条件等并行程序特有的错误,所以并行程序调试比串行程序调试要复杂得多。相比串行调试,并行程序调试面临可扩展性不足、不确定性、大时间开销等并行调试特有的问题。
     本文的重点是基于MPI编程模型的并行调试器的可扩展性研究。传统的并行调试器采用Client/Server平面通信模式,系统主控端的负载较大,当需要调试上千个进程规模时,主控端往往由于通信和计算的负载太大而不能正常工作。根据系统设计时提出的可扩展性要求,本文提出TreeNet树型通信协议。采用TreeNet通信协议,系统有效地将主控端负载分散到中间通信节点上,从而使系统的启动时间以及每条命令的操作时间大大缩小,提高系统的可扩展性。可扩展并行调试器采用Tree通信协议,可以支持512进程规模以上的并行程序调试。
     可扩展并行调试器采用组调试技术,根据需要将并行作业的进程划分为若干逻辑进程集,所有的操作都可以同时作用于当前活跃进程集中。该调试器以断点、单步等调试手段支持源级调试要求。该调试器支持Load和Attach两种调试模式。在Load模式下,并行作业在其整个生存周期内均处于调试器的控制之下,直到调试会话终止;在Attach模式下,并行作业独立加载,用户可在需要时启动调试会话并将整个并行作业纳入并行调试器的控制,调试完成后,用户可剥离调试器对并行作业的控制,并行作业继续执行。
     可扩展并行调试器的实现以Eclipse平台为基础,利用Eclipse插件体系结构,扩展视图、透视图等界面元素,实现良好的图形用户接口。
Parallel debugger is greatly helpful to facilitate programmers find problems such as incorrect returned results or abnormal interruption when writing parallel programming. Therefore, in order to improve the quality and efficiency of parallel programs, most researchers usually focus on how to catch the exception, correct the program early.
     Similar to those errors that occur in sequential programs such as unexpected handling of pointers, mixing of variables, various syntax errors, parallel programs have many other own distinguished problems like deadlock, livelock and race condition. In addition, comparable to sequential debugging, parallel debugging also face some other solid unique problems such as non-deterministic execution, low scalability, time-consuming.
     This paper mainly concentrates on the research and implement of Scalable Parallel Debugger. Our aim is to develop an efficient parallel debugger based on MPI model. Usually, traditional debugger architecture usually consists of a root debugger connected directly to and controlling debug servers. There are many problems to this approach as the number of application processes scales. Further, we designed a TreeNet protocol to improve the scalability of parallel debugger. Scalable Parallel Debugger uses TreeNet protocol; it can debug more than 512 processes.
     Scalable Parallel Debugger uses the process group concept to manipulate parallel processes. The group compartmentalizes processes by logic. The basic methods for source level debugging are breakpoint and step. Scalable parallel debugger supports load and attach debugging. In the load mode, the parallel tasks are totally in the control of parallel debugger from the beginning of the tasks. In the attach mode, parallel debugger can debug the parallel tasks when they are running at anytime.
     Scalable Parallel Debugger is based on Eclipse platform. We use eclipse plug-in architecture and extend extensions of the view and perspective to develop a user-friendly debugger.
引文
[1]李晓梅等,可扩展并行算法的设计与分析,国防工业出版社,2000(7)
    [2]http://www.computerhope.com/jargon/m/mpp.htm
    [3]http://www.computerhope.com/jargon/c/cluster.htm
    [4]安虹,陈国良,并行程序设计模型和语言,软件学报,2002(1)
    [5]Jonathan B.Rosenberg,How debuggers work,WILEY COMPUTER PUBLISHING,1996(10)
    [6]GDB,http://sourceware.org/GDB/
    [7]Eclipse,http://www.eclipse.org
    [8]High Performance Debugging Forum's HPD Version 1 Standard:Command Interface for Parallel Debuggers,http://www.ptools.org/hpdf,1998(9)
    [9]Sandeep,Process Tracing Using Ptrace,http://linuxgazette.net/issue81/sandeep.html
    [10]Joel Huselius.Debugging,Parallel Systems:A State of the Art Report,2002(9)
    [11]Jason Gait,A Probe Effect in Concurrent Programs,Software- Practise and Experience,1986(3)
    [12]TotalView,http://www.etnus.com/totalview
    [13]DDT,http://www.allinea.com/
    [14]Ladebug,http://h30097.www3.hp.com/dtk/ladebug_ov.html
    [15]P2D2,http://www.nas.nasa.gov/NAS/Tools/p2d2
    [16]http://www.computerhope.com/jargon/m/smp.htm
    [17]Top 500 supercomputer sites,http://www.top500.org
    [18]C/C++ Development Tools Project,http://www.eclipse.org.cdt
    [19]Photran Project,http://www.eclipse.org/photran
    [20]MPICH,http://www.mcs.anl.org/mpi/mpich/
    [21]Open MPI,http://www.open-mpi.org
    [22]LAM/MPI.,http://www.lam-mpi.org/
    [23]Eclipse Parallel Tools Project,http://www.eclipse.org/ptp
    [24]SLURM,http://www.llnl.gov/linux/slurm
    [25]D.C.P.LaFrance-Linden,Challenges in Designing an HPF Debugger,Digital Technical Journal,1998(1)
    [26]S.M.Balle,B.R.Brett,C.P.Chen,and D.LaFrance-Linden,Extending a traditional debugger to debug Massively Parallel Applications,Journal of Parallel and Distributed Computing,2004(5)
    [27]Philip C.Roth,Dorian C.Arnold,and Barton P.Miller,Benchmarking the MRNet Distributed Tool Infrastructure:Lessons Learned,2004 High- Performance Grid Computing Workshop,2004(7)
    [28]Dorian C.Arnold,Gary D.Pack,Barton P.Miller,Tree-based Overlay Networks for Scalable Applications,11th International Workshop on High-Level Parallel Programming Models and Supportive Environments(HIPS 2006),2006(4)
    [29]Doreen Chengt,Robert Hood,A Portable Debugger for Parallel and Distributed Programs
    [30]Zhen Liu,The One-to-Many TCP Overlay:A Scalable and Reliable Multicast Architecture,2004(1)
    [31]Tree Building Control Protocol - State Machine,Department of Control Systems Center for Applied Cybernetics Faculty of Electrical Engineering Czech Technical University,2001(12)
    [32]C.D.Pham,Comparison of Message Aggregation Strategies for Parallel Simulations on High Performance Cluster,Proceedings of the 8~(th) International Symposium on Modeling,Analysis and Simulation of Computer and Telecommunication Systems,2000(8)
    [33]R Kaushik,An Interactive Debugger For Message-Passing Parallel Program,2006(6)
    [34]陈晨,陈文光,郑纬民,TNet:基于树型结构的集群工具软件通信协议,http://www.paper.edu.cn,2007(2)
    [35]鄢超,刘淘英,陈国良,基于机群操作系统的并行调试器,计算机研究与发展,2004(4)
    [36]李永强,兰巨龙,跨平台可移植远程并行程序调试器的设计与实现,小型微型计算机系统,2007(3)
    [37]王巍,方滨兴,张宏莉,并行调试中的若干关键问题,计算机科学,2003
    [38]刘建,汪东升,一种基于检查点的并行程序调试器的设计与实现,计算机研究与发展,2002(12)
    [39]王敬宇,实现大规模并行调试工具,电子计算机,2002(4)
    [40]张惠成,谢余强,王华,魏鸿,一种并行程序与检测分析软件的体系结构设计,计算机工程,2003(5)
    [41]葛羽嘉,Linux源码分析
    [42]方奕,张卫,一个单源的应用层组播协议的设计和实现,计算机应用,2005(2)
    [43]姚再勇,郑启龙,许胤龙,姚震,基于Eclipse的并行开发环境EMPI,计算机应用与软件,2005(10)
    [44]张卫民,黄瑞芳,基于PVM的动态可伸缩并行调试器设计,计算机研究与发展,1997(11)
    [45]王锋,基于消息传递模型的并行调试技术及实现,中国科学技术大学,2000(5)
    [46]党建国,分布式并行调试器DPD的研究与实现,华中科技大学,2003(5)
    [47]Philip C.Roth,Dorian C.Arnold,MRNet:A Software-Based Multicast/Reduction Network for Scalable Tools,http://www.paradyn.org/mrnet/,2003(11)
    [48]刘建,汪东升,一种基于检查点的并行程序调试器的设计与实现,计算机研究与发展,2002(12)

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700