多核异构环境下通用并行计算框架关键技术研究

英文题名：Research on Key Technology of General Parallel Computing Framework in Multi-core Heterogeneous Environment
作者：盛艳秀
论文级别：博士
学科专业名称：地图学与地理信息系统
中文关键词：并行计算 ; 分层并行计算模型 ; 参数库 ; 方法库 ; 语言解释系统
英文关键词：parallel computing ; layered parallel computing framework ; parameters library ; method library ; language interpretation system
学位年度：2013
导师：魏志强
学科代码：070503
学位授予单位：中国海洋大学
论文提交日期：2013-12-08

摘要

随着科技的发展，尤其是计算机技术的发展，各个行业中的数据量都开始呈指数型增长，传统的串行计算能力，已经远远不能满足日益增长的数据处理需求。在这种背景下，并行计算技术应运而生，其主要目的是快速解决大型且复杂的计算问题。并行计算不仅和国家的科技和经济发展密切相关,而且直接影响到国防能力和国家安全，如核爆炸模拟，复杂系统精确解算、基因研究和国家机要通信的加密与解密等等。并行计算能力是衡量国家实力的重要标志。
     虽然并行计算已发展多年，在一些具体问题的解决上也已经有了较为实用的方案，总结了相当多的经验，但还远远不及串行算法那样丰富，因此这门学科研究尚不成熟。并行算法与串行算法的最大不同之处在于，不仅需要考虑问题本身的解决方法，还需要考虑问题所适用的并行模型，若要达到效率最大化，还需要考虑处理器架构、网络连接等因素，这必然会增加并行算法的设计和实现难度。
     本论文在分析了并行计算中的各种难题以及国内外研究现状的基础上，针对并行计算模型的种种难题，提出了一种新的满足多核处理器机群计算需求的分层异构并行的通用计算模型，并对其中的关键技术做了初步研究,具体内容如下:
     （1）提出了一种新的满足多核处理器机群计算需求的分层异构并行的通用计算模型，该模型将目标问题的开发划分为程序模型算法设计、并行程序设计、并行程序执行三个阶段。程序模型算法设计阶段，开发人员面对参数化的并行机设计程序模型算法；并行程序设计阶段，开发人员利用并行开发平台开发具体的并行程序，实现并行任务；并行程序执行阶段，并行程序运行在相应的软硬件架构下，通过解释系统优化的计算参数优化指令执行效率。
     （2）对模型框架进行细化和实现，针对分层异构模型的分层，在不同的层次设计相应的方法库、参数库、程序复用库等工具对模型功能以匹配，分层实现并行算法设计与实现的过程，最终实现并行计算模型的动态性、自适应性、可重构性与通用性。
     （3）在不同的层次之间，提出了语言解释系统和编译系统，实现层次之间的链接，保证了系统模型的框架的完整性和可实现性。
     （4）利用该异构并行通用计算模型，实现了叠前偏移程序并行算法的设计。叠前偏移程序算法是石油探测中经典算法之一，该算法的串行程序已经较为完善，但其并行算法一直是个较为复杂的问题。应用该模型很好地解决了其并行性。
     该通用并行计算框架对应用开发人员提供简单易用的设计语言，实现并行计算程序设计的高效性、正确性、普适性。因而具有广泛的应用前景和显著的社会效益。分层异构并行的通用计算模型为应用开发人员提供独立于硬件的可扩展的编程接口，为具有普遍性的问题建立方法库，为程序运行平台建立参数库，综合管理不同的计算资源，合理分配计算任务，减轻程序的开发难度以及应用开发人员的工作量。
With the development of technology, especially the development of computertechnology, the amount of data in various industries grows exponentially, however,the traditional serial computing power cannot meet the growing demand for dataprocessing. In this context, parallel computing technology is advanced to quicklysolve large and complex computing problems. Parallel computing are not only closelyrelated to the technology and economic development of the country, but also directlyaffect the national defense capability and national security, such as nuclear explosionsimulation, complex systems accurate solver, genetic research, national confidentialcommunications encryption and decryption and so on. Parallel computing power is animportant symbol of national strength.
     Although parallel computing has been developed for many years and there aresome practical solutions and a lot of experience in some specific problems, thediversity of parallel computing is still less than the serial algorithm, and this study isnot yet a mature discipline. Corresponding to serial algorithm, Parallel Algorithms notonly need to consider the problem itself, but also need to consider appropriate parallelmodel, the processor architecture, network connection and other factors in order tomaximize efficiency, which must increase the difficulty of the parallel algorithmdesign and implementation.
     This thesis analyzes the various problems and research domestic and foreignrelated work on parallel computing, According to various problems of parallelcomputing model, A new multi-core layered heterogeneous general parallelcomputing is proposed to meet the computing needs of cluster of multi-coreprocessors, and some key technologies are studied in this paper..
     The detailed contents are as follows：
     (1) A new multi-core layered heterogeneous general parallel computingframework is proposed. Through the framework, the target problem is divided intoprogram model algorithm design, parallel program design and parallel programexecution. In program model algorithm design, developers design process modelalgorithm according to parameterized parallel machine; In parallel program design, developers utilize the parallel development platform to develop parallel programming;In parallel program execution, parallel programs is implemented on correspondingsoftware and hardware architecture, and instruction execution efficiency is optimizedby optimizing the calculation of parameter.
     (2) The framework is refined and implemented in this paper. According todifferent layers of layered heterogeneous model, the method library, parameter libraryand the program reuse library are designed to match the model. The design ofhierarchically parallel algorithm is carried out, and ultimately the parallel computingmodel could be dynamic, adaptive, reconfigurable and universal.
     (3) Between the different layers, language interpretation system and compilersystem are proposed to connect layers, which ensure the integrity and implementationof the system model.
     (4) The generic heterogeneous parallel computing framework is utilized todesign the parallel computing algorithm of prestack migration process. The prestackmigration program algorithm is the classic algorithm of oil exploration. The serialprogram of the algorithm is already complete, but its parallel algorithm has been amore complex issue. The framework well solutes its parallelism.
     The general parallel computing model provides a design language that is easy touse for application developers. The model achieves high-performance, accuracy anduniversality of parallel computing program design. So it has broad applicationprospects and significant social benefits. Layered heterogeneous parallel computingmodel provides a hardware-independent scalable programming interface for theapplication developers, establishes method library for some universal problem,establishes parameter library for the program running platform, and integratesdifferent computing resources, allocates computing tasks appropriately, and reducesthe difficulty of developing and workload of developers.

引文

[1]陈国良.并行计算:结构·算法·编程[M].高等教育出版社,1999.
    [2] Kumar V, Grama A, Gupta A, et al. Introduction to parallel computing[M]. RedwoodCity: Benjamin/Cummings,1994.
    [3] Rousselet G A, Fabre-Thorpe M, Thorpe S J. Parallel processing in high-levelcategorization of natural images[J]. Nature neuroscience,2002,5(7):629-630.
    [4]胡悦.并行计算时间与存储空间关系研究[D].上海大学,2011.
    [5] Grama A, Gupta A, Kumar V. Isoe ciency function: A scalability metric for parallelalgorithms and architectures[J]. IEEE Parallel and Distributed Technology, Special Issueon Parallel and Distributed Systems: From Theory to Practice,1993,1(3):12-21.
    [6]田俊刚.高性能集群系统分析,设计与应用[D].西北工业大学,2004.
    [7] Vikramaditya Sen, Mrinal K. Sen, Paul L. Stoffa. PVM based3-D Kirchhoff depthmigration using dynamically computed travel-times: An application in seismic dataprocessing [J]. Parallel Computing, Volume25, Issue3, March1999, Pages231-248
    [8]何朗.基于解析逼近偏微分方程的并行求解算法[D].武汉理工大学,2008.
    [9] Hadrien Courtecuisse, Hoeryong Jung, Jérémie Allard, Christian Duriez, Doo Yong Lee,Stéphane Cotin. GPU-base real-time soft tissue deformation with cutting and hapticfeedback [J]. Progress in Biophysics and Molecular Biology, Volume103, Issues2–3,December2010, Pages159-168
    [10]单莹,吴建平,王正华.基于SMP集群的多层次并行编程模型与并行优化技术[J].计算机应用研究,2006,23(10):254-256.
    [11] Lideng GAN, Xiaofeng DAI, Xin ZHANG, Linggao LI, Wenhui DU, Xiaohong LIU,Yinbo GAO, Minghui LU, Shufang MA, Zheyuan HUANG. Key technologies forseismic reservoir characterization of high water-cut oilfields. Petroleum Exploration andDevelopment, Volume39, Issue3, June2012, Pages391-404
    [12] Miki Meiler, Moshe Reshef, Haim Shulman. Seismic depth-domain stratigraphicclassification of the Golan Heights, central Dead Sea Fault [J]. Tectonophysics, Volume510, Issues3–4,4October2011, Pages354-369
    [13] Peter Bergmann, Can Yang, Stefan Lüth, Christopher Juhlin, Calin Cosma. Time-lapseprocessing of2D seismic profiles with testing of static correction methods at the CO2injection site Ketzin (Germany)[J]. Journal of Applied Geophysics, Volume75, Issue1,September2011, Pages124-139
    [14]王棣,王华忠,马在田,等.叠前时间偏移方法综述[J].勘探地球物理进展,2004,27(5):313-320.
    [15]王翠华,何光明,张帆.三维叠前深度偏移技术在川中地区的应用[J].石油地球物理勘探,2009,44(1):72-75.
    [16] Schneider W A. Integral formulation for migration in two and three dimensions [J].Geophysics,1978,43(1):49-76.
    [17] Keho T H, Beydoun W B. Paraxial ray Kirchhoff migration [J]. Geophysics,1988,53(12):1540-1546.
    [18]于秀敏,李建中,郭风.高性能并行计算的曙光:机群系统[J].哈尔滨学院学报,2004,25(2):136-140.
    [19]赵改善.地球物理高性能计算的新选择:GPU计算技术[J].勘探地球物理进展,2007,30(5):399-404.
    [20] Hao Wang, Xudong Fu, Guangqian Wang, Tiejian Li, Jie Gao. A common parallelcomputing framework for modeling hydrological processes of river basins [J]. ParallelComputing, Volume37, Issues6–7, June–July2011, Pages302-315
    [21]王庆先,孙世新,尚明生,等.并行计算模型研究[J].计算机科学,2004,31(9):128-131.
    [22] Barnes G H, Brown R M, Kato M, et al. The iliac iv computer [J]. Computers, IEEETransactions on,1968,100(8):746-757.
    [23] Hiromoto R E, Lubeck O M, Moore J. Experiences with the Denelcor HEP[J]. ParallelComputing,1984,1(3):197-206.
    [24]陈鹏,张立昂. PRAM模型模拟RMESH模型的2种方案[J].北京大学学报(自然科学版),2005,4(3).
    [25]许胤龙,王洵,万颖瑜,等.基于Wormhole路由的二维Mesh上的并行k-选择[J].计算机学报,1999,22(12):1309-1313.
    [26]沈绪榜. MPP系统芯片体系结构技术的发展[J].中国科学: E辑,2008,38(6):933-940.
    [27] Matthew Felice Pace. BSP vs MapReduce [J]. Procedia Computer Science, Volume9,2012, Pages246-255.
    [28]乔香珍,杨晔.基于LogP模型的并行计算模拟器[J].计算机研究与发展,1997,34(9):641-645.
    [29]曾国荪,陆鑫达.异构计算中的负载共享[J].软件学报,2000,11(4):551-556.
    [30]申俊,郑纬民.异构并行工作站机群系统的性能评价指标[J].计算机研究与发展,1998,35(3):193-198.
    [31]莫则尧,袁国兴.消息传递并行编程环境MPI[M].科学出版社,2001.
    [32]章隆兵,吴少刚,蔡飞,等. PC机群上共享存储与消息传递的比较[J].软件学报,2004,15(6):842-849.
    [33] Rabenseifner R, Hager G, Jost G. Hybrid MPI/OpenMP parallel programming onclusters of multi-core SMP nodes[C]//Parallel, Distributed and Network-basedProcessing,200917th Euromicro International Conference on. IEEE,2009:427-436.
    [34] Anuj V. Prakash, Anwesha Chaudhury, Dana Barrasso, Rohit Ramachandran. Simulationof population balance model-based particulate processes via parallel and distributedcomputing [J]. Chemical Engineering Research and Design, Volume91, Issue7, July2013, Pages1259-1271
    [35] Peng Y, Wang F. Cloud computing model based on MPI and OpenMP[C]//ComputerEngineering and Technology (ICCET),20102nd International Conference on. IEEE,2010,7: V7-85-V7-87.
    [36] Martin J. Chorley, David W. Walker. Performance analysis of a hybrid MPI/OpenMPapplication on multi-core clusters [J]. Journal of Computational Science, Volume1,Issue3, August2010, Pages168-174
    [37] Seongkyu Lee, Chuluong Choi, Jinsoo Kim. Evaluating the suitability of the EGM2008geopotential model for the Korean peninsula using parallel computing on a disklesscluster [J]. Computers&Geosciences, Volume52, March2013, Pages132-145
    [38] Cappello F, Etiemble D. MPI versus MPI+OpenMP on the IBM SP for the NASBenchmarks[C]//Supercomputing, ACM/IEEE2000Conference. IEEE,2000:12-12.
    [39] Sergio Orts, Jose Garcia-Rodriguez, Diego Viejo, Miguel Cazorla, Vicente Morell.GPGPU implementation of growing neural gas: Application to3D scene reconstruction[J]. Journal of Parallel and Distributed Computing, Volume72, Issue10, October2012,Pages1361-1372.
    [40] Raki P S, Mila inovi D D, ivanov, et al. MPI–CUDA parallelization of afinite-strip program for geometric nonlinear analysis: A hybrid approach[J]. Advances inEngineering Software,2011,42(5):273-285.
    [41] Pennycook S J, Hammond S D, Jarvis S A, et al. Performance analysis of a hybridMPI/CUDA implementation of the NASLU benchmark[J]. ACM SIGMETRICSPerformance Evaluation Review,2011,38(4):23-29.
    [42]喻勤,张少华,孔选林.基于MPI和CUDA的转换波Kirchhoff叠前时间偏移并行计算[J].石油物探,2013,52(1).
    [43] Chao-Tung Yang, Chih-Lin Huang, Cheng-Fang Lin. Hybrid CUDA, OpenMP, and MPIparallel programming on multicore GPU clusters [J].计Computer PhysicsCommunications, Volume182, Issue1, January2011, Pages266-269
    [44]王欢,都志辉.并行计算模型对比分析[J].计算机科学,2005,32(12):142-145.
    [45]涂碧波,邹铭,詹剑锋,等.多核处理器机群Memory层次化并行计算模型研究[J].计算机学报,2008,31(11):1948-1954.
    [46]陈国良,苗乾坤,孙广中,徐云,郑启龙.分层并行计算模型[J].中国科学技术大学学报,2008,38(7).
    [47] Jacques Bahi, Rapha l Couturier, Flavien Vernier. Synchronous distributed loadbalancing on dynamic networks [J]. Journal of Parallel and Distributed Computing,Volume65, Issue11, November2005, Pages1397-1405.
    [48] Francisco Heron de Carvalho Junior, Cenez Araújo de Rezende. NHBL A case study onexpressiveness and performance of component-oriented parallel programming [J].Journal of Parallel and Distributed Computing, Volume73, Issue5, May2013, Pages557-569
    [49] Gales M J F, Young S J. A fast and flexible implementation of parallel modelcombination[C]//Acoustics, Speech, and Signal Processing,1995. ICASSP-95.,1995International Conference on. IEEE,1995,1:133-136.
    [50]陈国良,孙广中,徐云,等.并行算法研究方法学[J].计算机学报,2008,31(9):1493-1502.
    [51] Yang Ming, Yu Xin Wang Ming. Research and Preliminary Realization of ParallelComputing Technology of Yellow River Mathematical Model Based on MPI [J].Procedia Environmental Sciences, Volume11, Part B,2011, Pages934-938.
    [52] M. Mezmaz, N. Melab, Y. Kessaci, Y.C. Lee, E.-G. Talbi, A.Y. Zomaya, D. Tuyttens. Aparallel bi-objective hybrid metaheuristic for energy-aware scheduling for cloudcomputing systems [J]. Journal of Parallel and Distributed Computing, Volume71, Issue11, November2011, Pages1497-1508.
    [53] Vittoria de Nitto Personè, Vincenzo Grassi. An analytical model for a parallelfault–tolerant computing system [J]. Performance Evaluation, Volume38, Issues3–4,December1999, Pages201-218.
    [54] Xiaofeng Meng, Xiao Yu, Zhigang Peng, Bo Hong. Detecting Earthquakes aroundSalton Sea Following the2010Mw7.2El Mayor-Cucapah Earthquake Using GPUParallel Computing [J]. Procedia Computer Science, Volume9,2012, Pages937-946.
    [55] Cook P W. IC chips including ALUs and identical register files whereby a number ofALUs directly and concurrently write results to every register file per cycle: U.S. Patent5,301,340[P].1994-4-5.
    [56] Snyder L. A taxonomy of synchronous parallel machines[R]. WASHINGTON UNIVSEATTLE DEPT OF COMPUTER SCIENCE,1988.
    [57]林智华,为现.对阵列处理机和多处理机的深入比较[J].福建金融管理干部学院学报,2004,1:010.
    [58]郑飞,陆鑫达.新一代RISC微处理器的技术特征与趋向[J].小型微型计算机系统,1995,16(9):56-60.
    [59]张理论,叶红,吴建平,等.基于最大负载偏移率的并行负载平衡性能分析[J].计算机研究与发展,2010(006):1125-1131.
    [60] Anderson D P. Boinc: A system for public-resource computing and storage[C]//GridComputing,2004. Proceedings. Fifth IEEE/ACM International Workshop on. IEEE,2004:4-10.
    [61] Dana A. Jacobsen, Inanc Senocak. Multi-level parallelism for incompressible flowcomputations on GPU clusters [J]. Parallel Computing, Volume39, Issue1, January2013, Pages1-20.
    [62] Wenjing Gao, Qian Kemao. Parallel computing in experimental mechanics and opticalmeasurement: A review[J]. Optics and Lasers in Engineering, Volume50, Issue4, April2012, Pages608-617.
    [63] Rocco Aversa, Beniamino Di Martino, Massimiliano Rak, Salvatore Venticinque,Umberto Villano. Performance prediction through simulation of a hybrid MPI/OpenMPapplication [J]. Parallel Computing, Volume31, Issues10–12, October–December2005,Pages1013-1033.
    [64] Kumar S, Huang C, Zheng G, et al. Scalable molecular dynamics with NAMD on theIBM Blue Gene/L system[J]. IBM Journal of Research and Development,2008,52(1.2):177-188.
    [65] Bohm E, Bhatele A, Kale L V, et al. Fine-grained parallelization of the Car-Parrinello abinitio molecular dynamics method on the IBM Blue Gene/L supercomputer[J]. IBMJournal of Research and Development,2008,52(1.2):159-175.
    [66] Peterka T, Yu H, Ross R, et al. Parallel volume rendering on the IBM BlueGene/P[C]//Proceedings of the8th Eurographics conference on Parallel Graphics andVisualization. Eurographics Association,2008:73-80.
    [67] Laksono Adhianto, Barbara Chapman. Performance modeling of communication andcomputation in hybrid MPI and OpenMP applications [J]. Simulation ModellingPractice and Theory, Volume15, Issue4, April2007, Pages481-491.
    [68] Reakook Hwang, Mitsuo Gen, Hiroshi Katayama. A comparison of multiprocessor taskscheduling algorithms with communication costs [J]. Computers&Operations Research,Volume35, Issue3, March2008, Pages976-993.
    [69] D. Janaki Ram, A. Vijay Srinivas, P. Manjula Rani. A model for parallel programmingover CORBA[J]. Journal of Parallel and Distributed Computing, Volume64, Issue11,November2004, Pages1256-1269.
    [70] Gropp W, Lusk E L, Skjellum A. Using MPI-: Portable Parallel Programming with theMessage Passing Interface[M]. MIT press,1999.
    [71] Andreas H. Hielscher, Sebastian Bartel. Parallel programming of gradient-basediterative image reconstruction schemes for optical tomography [J]. Computer Methodsand Programs in Biomedicine, Volume73, Issue2, February2004, Pages101-113.
    [72] Panda D K, Singal S, Prabhakaran P. Multidestination message passing mechanismconforming to base wormhole routing scheme[M]//Parallel Computer Routing andCommunication. Springer Berlin Heidelberg,1994:131-145.
    [73]颜小洋,张伟文,布社辉,等.基于MPI/OPENMP混合编程的三维粒子模拟并行优化[J].华南理工大学学报(自然科学版,2012,40(4).
    [74] S.N. Omkar, Akshay Venkatesh, Mrunmaya Mudigere. MPI-based parallel synchronousvector evaluated particle swarm optimization for multi-objective design optimization ofcomposite structures [J]. Engineering Applications of Artificial Intelligence, Volume25,Issue8, December2012, Pages1611-16278.
    [75] Timur Keskinturk, Mehmet B. Yildirim, Mehmet Barut. An ant colony optimizationalgorithm for load balancing in parallel machines with sequence-dependent setup times[J]. Computers&Operations Research, Volume39, Issue6, June2012, Pages1225-1235.
    [76]李建江,舒继武,王有新,等.一种基于共享存储的叠前深度偏移并行算法[J].Journal of Software,2002,13(12).
    [77] JáJá J. An introduction to parallel algorithms[M]. Addison Wesley Longman PublishingCo., Inc.,1992.
    [78] Plimpton S. Fast parallel algorithms for short-range molecular dynamics[J]. Journal ofComputational Physics,1995,117(1):1-19.
    [79] Antonio Plaza, David Valencia, Javier Plaza. An experimental comparison of parallelalgorithms for hyperspectral analysis using heterogeneous and homogeneous networksof workstations [J]. Parallel Computing, Volume34, Issue2, February2008, Pages92-114.
    [80] Papadimitriou C H, Yannakakis M. Towards an architecture-independent analysis ofparallel algorithms[J]. SIAM Journal on Computing,1990,19(2):322-328.
    [81] Xiao Qin, Hong Jiang. A dynamic and reliability-driven scheduling algorithm forparallel real-time jobs executing on heterogeneous clusters [J]. Journal of Parallel andDistributed Computing, Volume65, Issue8, August2005, Pages885-900.
    [82] He Huang, Liqiang Wang, En-Jui Lee, Po Chen. An MPI-CUDA Implementation andOptimization for Parallel Sparse Equations and Least Squares (LSQR)[J]. ProcediaComputer Science, Volume9,2012, Pages76-85.
    [83]周浩,钟波,罗志才,等. OpenMP并行算法在卫星重力场模型反演中的应用[J].大地测量与地球动力学,2011,31(5):123-127.
    [84] Jelena Pje ivac-Grbovi, George Bosilca, Graham E. Fagg, Thara Angskun, Jack J.Dongarra. MPI collective algorithm selection and quadtree encoding [J]. ParallelComputing, Volume33, Issue9, September2007, Pages613-623.
    [85] Sergio Nesmachnow, Héctor Cancela, Enrique Alba. A parallel micro evolutionaryalgorithm for heterogeneous computing and grid scheduling [J]. Applied SoftComputing, Volume12, Issue2, February2012, Pages626-639.
    [86] Sergio Nesmachnow, Héctor Cancela, Enrique Alba. A parallel micro evolutionaryalgorithm for heterogeneous computing and grid scheduling [J]. Applied SoftComputing, Volume12, Issue2, February2012, Pages626-639.
    [87] O. Hasan ebi, T. Bah ecio lu,. Kur, M.P. Saka. Optimum design of high-rise steelbuildings using an evolution strategy integrated parallel algorithm [J]. Computers&Structures, Volume89, Issues21–22, November2011, Pages2037-2051.
    [88]陶应龙,王建国,牛胜利,等. MCATNP蒙特卡罗粒子输运程序的MPI并行化[J].核电子学与探测技术,2011,31(5):490-494.
    [89]刘劲松,刘福田,刘俊,等.地震层析成像LSQR算法的并行化[J].地球物理学报,2006,49(2):540-545.
    [90] Gropp W, Lusk E L, Skjellum A. Using MPI-: Portable Parallel Programming with theMessage Passing Interface[M]. MIT press,1999.
    [91] Nickolls J, Buck I, Garland M, et al. Scalable parallel programming with CUDA[J].Queue,2008,6(2):40-53.
    [92] George Teodoro, Tony Pan, Tahsin M. Kurc, Jun Kong, Lee A.D. Cooper, Joel H. Saltz.Efficient irregular wavefront propagation algorithms on hybrid CPU–GPU machines [J].Parallel Computing, Volume39, Issues4–5, April–May2013, Pages189-211.
    [93] Bart Pieters, Charles-Frederik Hollemeersch, Jan De Cock, Peter Lambert, Rik Van deWalle. Data-parallel intra decoding for block-based image and video coding onmassively parallel architectures [J]. Signal Processing: Image Communication, Volume27, Issue3, March2012, Pages220-237.
    [94] Pacheco P S. Parallel programming with MPI[M]. Morgan Kaufmann Pub,1997.
    [95] C. Goktug Gurler, Anil Aksay, Gozde Bozdagi Akar, A. Murat Tekalp. Architectures formulti-threaded MVC-compliant multi-view video decoding and benchmark tests [J].Signal Processing: Image Communication, Volume25, Issue5, June2010, Pages325-334.
    [96] Seongmin Jo, Song Hyun Jo, Yong Ho Song. Exploring parallelization techniques basedon OpenMP in H.264/AVC encoder for embedded multi-core processor [J]. Journal ofSystems Architecture, Volume58, Issue9, October2012, Pages339-353.
    [97] P. Morillo, A. Bierbaum, P. Hartling, M. Fernández, C. Cruz-Neira. Analyzing theperformance of a cluster-based architecture for immersive visualization systems [J].Journal of Parallel and Distributed Computing, Volume68, Issue2, February2008,Pages221-234.
    [98] M. Mustafa Rafique, Ali R. Butt, Eli Tilevich. Reusable software components foraccelerator-based clusters [J]. Journal of Systems and Software, Volume84, Issue7, July2011, Pages1071-1081.
    [99] Buyya R. High Performance Cluster Computing: Architectures and Systems (Volume1)[J]. Prentice Hall, Upper SaddleRiver, NJ, USA,1999,1:999.
    [100] Manish Parashar, Hector Klie, Umit Catalyurek, Tahsin Kurc, Wolfgang Bangerth,Vincent Matossian, Joel Saltz, Mary F. Wheeler. Application of Grid-enabledtechnologies for solving optimization problems in data-driven reservoir studies [J].Future Generation Computer Systems, Volume21, Issue1,1January2005, Pages19-26.
    [101] Chang Y H, Chen J W. Designing an enhanced PC cluster system for scalablenetwork services[C]//Advanced Information Networking and Applications,2005. AINA2005.19th International Conference on. IEEE,2005,2:163-166.
    [102] Stone J E, Gohara D, Shi G. OpenCL: A parallel programming standard forheterogeneous computing systems[J]. Computing in science&engineering,2010,12(3):66.
    [103] Kegel P, Schellmann M, Gorlatch S. Using openmp vs. threading building blocksfor medical imaging on multi-cores[M]//Euro-Par2009Parallel Processing. SpringerBerlin Heidelberg,2009:654-665.
    [104] Krüger J, Westermann R. Linear algebra operators for GPU implementation ofnumerical algorithms[C]//ACM Transactions on Graphics (TOG). ACM,2003,22(3):908-916.
    [105] Ryoo S, Rodrigues C I, Baghsorkhi S S, et al. Optimization principles andapplication performance evaluation of a multithreaded GPU usingCUDA[C]//Proceedings of the13th ACM SIGPLAN Symposium on Principles andpractice of parallel programming. ACM,2008:73-82.
    [106] Fan Z, Qiu F, Kaufman A, et al. GPU cluster for high performancecomputing[C]//Proceedings of the2004ACM/IEEE conference on Supercomputing.IEEE Computer Society,2004:47.
    [107] Manavski S A, Valle G. CUDA compatible GPU cards as efficient hardwareaccelerators for Smith-Waterman sequence alignment[J]. BMC bioinformatics,2008,9(Suppl2): S10.
    [108]王皓.基于图形处理器的数据流并行处理方法研究[D].大连理工大学,2009.
    [109] Lee S, Min S J, Eigenmann R. OpenMP to GPGPU: a compiler framework forautomatic translation and optimization[J]. ACM Sigplan Notices,2009,44(4):101-110.
    [110] Luebke D, Harris M, Govindaraju N, et al. GPGPU: general-purpose computationon graphics hardware[C]//Proceedings of the2006ACM/IEEE conference onSupercomputing. ACM,2006:208.
    [111]陈国良.并行算法的设计与分析[M].高等教育出版社,1994.
    [112] Zhang T Y, Suen C Y. A fast parallel algorithm for thinning digital patterns[J].Communications of the ACM,1984,27(3):236-239.
    [113]李晓梅.可扩展并行算法的设计与分析[M].国防工业出版社,2000.
    [114] Zhang T Y, Suen C Y. A fast parallel algorithm for thinning digital patterns[J].Communications of the ACM,1984,27(3):236-239.
    [115] Paul Biggar, Edsko de Vries, David Gregg. A practical solution for achievinglanguage compatibility in scripting language compilers [J]. Science of ComputerProgramming, Volume77, Issue9,1August2012, Pages971-989.
    [116] Shuo-Huan Hsu, Balachandra Krishnamurthy, Prathima Rao, Chunhua Zhao,Suresh Jagannathan, Venkat Venkatasubramanian. A domain-specific compiler theorybased framework for automated reaction network generation [J]. Computers&ChemicalEngineering, Volume32, Issue10,17October2008, Pages2455-2470.
    [117]尹玉,鞠九滨.一个PVM应用并行库的设计与实现[J].软件学报,1997,8(A00):22-27.
    [118]吕映芝,张素琴,蒋维杜.编译原理[M].清华大学出版社,1998.
    [119] Nirmeen A. Bahnasawy, Fatma Omara, Magdy A. Koutb, Mervat Mosa.Optimization procedure for algorithms of task scheduling in high performanceheterogeneous distributed computing systems[J]. Egyptian Informatics Journal, Volume12, Issue3, November2011, Pages219-229.
    [120]臧斌宇.并行化编译系统AFT的构造[D]:博士论文[D].,1999.
    [121] Luca De Feo, éric Schost. Fast arithmetics in Artin–Schreier towers over finitefields [J]. Journal of Symbolic Computation, Volume47, Issue7, July2012, Pages771-792.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700