The Design and Implementation of TIDeFlow: A Dataflow-Inspired Execution Model for Parallel Loops and Task Pipelining
详细信息    查看全文
  • 作者:Daniel Orozco ; Elkin Garcia ; Robert Pavel…
  • 关键词:Dataflow ; Task pipelining ; Parallel execution models ; TIDeFlow ; Runtime system ; Graph languages ; Codelets ; Iterated dataflow ; Dependency graph
  • 刊名:International Journal of Parallel Programming
  • 出版年:2016
  • 出版时间:April 2016
  • 年:2016
  • 卷:44
  • 期:2
  • 页码:278-307
  • 全文大小:3,334 KB
  • 参考文献:1.Agrawal, G., Saltz, J.: Interprocedural data flow based optimizations for compilation of irregular problems. In: Huang, C.H., Sadayappan, P., Banerjee, U., Gelernter, D., Nicolau, A., Padua, D. (eds.) Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science, vol. 1033, pp. 465–479. Springer, Berlin (1996). doi:10.​1007/​BFb0014218 CrossRef
    2.Arvind, Culler, D.E.: Dataflow Architectures, pp. 225–253. Annual Reviews Inc., Palo Alto. http://​portal.​acm.​org/​citation.​cfm?​id=​17814.​17824 (1986)
    3.Blumofe, R., Leiserson, C.: Scheduling multithreaded computations by work stealing. In: Foundations of Computer Science, 1994 Proceedings, 35th Annual Symposium on, pp. 356 –368 (1994). doi:10.​1109/​SFCS.​1994.​365680
    4.Butenhof, D.: Programming with POSIX Threads. Addison-Wesley Professional, Boston (1997)
    5.Chapman, B., Jost, G., van der Pas, R.: Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation). MIT Press, Cambridge (2007)
    6.del Cuvillo, J., Zhu, W., Gao, G.: Landing openmp on cyclops-64: an efficient mapping of openmp to a many-core system-on-a-chip. In: CF ’06: Proceedings of the 3rd Conference on Computing Frontiers, ACM, New York, NY, USA, pp. 41–50, (2006a). doi:10.​1145/​1128022.​1128030
    7.del Cuvillo, J., Zhu, W., Hu, Z., Gao, G.R.: Toward a software infrastructure for the cyclops-64 cellular architecture. In: High-Performance Computing in an Advanced Collaborative Environment, p. 9 (2006b). doi:10.​1109/​HPCS.​2006.​48
    8.Del Cuvillo, J., Zhu, W., Hu, Z., Gao, G.: Tiny threads: a thread virtual machine for the cyclops64 cellular architecture. In: Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International, IEEE, p. 8 (2005)
    9.Dennis, J.B.: First version of a data flow procedure language. In: Programming Symposium, Proceedings Colloque sur la Programmation. Springer, London, pp. 362–376 (1974). http://​portal.​acm.​org/​citation.​cfm?​id=​647323.​721501
    10.Duran, A., Ayguad, E., Badia, R.M., Labarta, J., Martinell, L., Martorell, X., Planas, J.: Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel Process. Lett. 21(02), pp. 173–193 (2011). doi:10.​1142/​S012962641100015​1
    11.Ebcioglu, K., Saraswat, V., Sarkar, V.: X10: programming for hierarchical parallelism and non-uniform data access. In: Proceedings of the International Workshop on Language Runtimes, OOPSLA (2004)
    12.Ellson, J., Gansner, E., Koutsofios, L., North, S., Woodhull, G.: Graphviz—open source graph drawing tools. In: Mutzel, P., Jünger, M., Leipert, S.(eds.) Graph Drawing. Lecture Notes in Computer Science, vol. 2265, pp. 483–484. Springer, Berlin Heidelberg (2002). doi:10.​1007/​3-540-45848-4_​57
    13.Gao, G.R.: A pipelined code mapping scheme for static data flow computers. PhD thesis, Massachusetts Institute of Technology. http://​hdl.​handle.​net/​1721.​1/​37165 (1986)
    14.Garcia, E., Orozco, D., Khan, R., Venetis, I., Livingston, K., Gao, G.: A dynamic schema to increase performance in many-core architectures through percolation operations. In: Proceedings of the 2013 IEEE International Conference on High Performance Computing (HiPC 2013), Bangalore. IEEE Computer Society (2013)
    15.Garcia, E., Venetis, I.E., Khan, R., Gao, G.: Optimized dense matrix multiplication on a many-core architecture. In: Proceedings of the Sixteenth International Conference on Parallel Computing (Euro-Par 2010), Part II, Springer, Ischia, Italy, Lecture Notes in Computer Science, vol. 6272, pp. 316–327 (2010b)
    16.Garcia, E., Venetis, I.E., Khan, R., Gao, G.R.: Optimized dense matrix multiplication on a many-core architecture. In: Euro-Par 2010-Parallel Processing, pp. 316–327 (2010c)
    17.Garcia, E., Orozco, D., Khan, R., Venetis, I.E., Livingston, K., Gao, G.R.: Dynamic percolation: a case of study on the shortcomings of traditional optimization in many-core architectures. In: ACM International Conference on Computing Frontiers 2012 (CF’12) (2012a)
    18.Garcia, E., Orozco, D., Pavel, R., Gao, G.: A discussion in favor of dynamic scheduling for regular applications in many-core architectures. In: Parallel and Distributed Processing Symposium Workshops and PhD Forum (IPDPSW), 2012 IEEE 26th International, IEEE, pp. 1591–1600 (2012b)
    19.Garcia, E., Orozco, D., Pavel, R., Gao, G.R.: A discussion in favor of Dynamic Scheduling for regular applications in Many-core Architectures. In: Proceedings of 2012 Workshop on Multithreaded Architectures and Applications (MTAAP 2012); 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2012), pp. 1591–1600. ACM, Shanghai (2012)
    20.Gautier, T., Besseron, X., Pigeon, L.: Kaapi: A thread scheduling runtime system for data flow computations on cluster of multi-processors. In: Proceedings of the 2007 International Workshop on Parallel Symbolic Computation, PASCO ’07, pp. 15–23. ACM, New York, NY, USA (2007)
    21.Gropp, W., Lusk, E., Thakur, R.: Using MPI-2: Advanced Features of the Message-Passing Interface. MIT Press, Cambridge (1999)
    22.Irigoin, F., Triolet, R.: Supernode partitioning. In: Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of Programming Languages, pp. 319–329. ACM (1988)
    23.Murata, T.: Petri nets: properties, analysis and applications. Proc. IEEE 77(4), 541–580 (1989). doi:10.​1109/​5.​24143 CrossRef
    24.Najjar, W.A., Lee, E.A., Gao, G.R.: Advances in the dataflow computational model. Parallel Comput. 25, 1907–1929 (1999)CrossRef
    25.Nemawarkar, S., Gao, G.: Measurement and modeling of earth-manna multithreaded architecture. In: Modeling, Analysis, and Simulation of Computer and Telecommunication Systems. MASCOTS ’96, Proceedings of the Fourth International Workshop on, pp. 109–114 (1996). doi:10.​1109/​MASCOT.​1996.​501002
    26.Gulati, K., Khatri, S.P.: GPU architecture and the CUDA programming model. In: Hardware acceleration of EDA algorithms, pp. 23–30. Springer US (2010). doi:10.​1007/​978-1-4419-0944-2_​3
    27.Orozco, D.: Tideflow: a parallel execution model for high performance computing programs. In: 2011 International Conference on Parallel Architectures and Compilation Techniques, p. 211 (2011)
    28.Orozco, D., Gao, G.: Mapping the FDTD application to many-core chip architectures. In: Parallel Processing. ICPP ’09. International Conference on, pp. 309–316 (2009)
    29.Orozco, D., Xue, L., Bolat, M., Li, X., Gao, G.R.: Experience of optimizing FFT on intel architectures. In: Parallel and Distributed Processing Symposium. IPDPS 2007. IEEE International, IEEE, pp. 1–8 (2007)
    30.Orozco, D., Garcia, E., Gao, G.: Locality optimization of stencil applications using data dependency graphs. In: Proceedings of the 23rd International Conference on Languages and Compilers for Parallel Computing, LCPC’10, pp. 77–91. Springer, Berlin (2011a)
    31.Orozco, D., Garcia, E., Khan, R., Livingston, K., Gao, G.: High throughput queue algorithms. Tech. rep., CAPSL Technical Memo 103 (2011b)
    32.Orozco, D., Garcia, E., Pavel, R., Khan, R., Gao, G.R.: Polytasks: a compressed task representation for hpc runtimes. In: Proceedings of the 24th International Conference on Languages and Compilers for Parallel Computing, LCPC 11 (2011c)
    33.Orozco, D., Garcia, E., Pavel, R., Khan, R., Gao, G.R.: Polytasks: a compressed task representation for hpc runtimes. CAPSL Technical Memo 105 (2011d)
    34.Orozco, D., Garcia, E., Khan, R., Livingston, K., Gao, G.R.: Toward high-throughput algorithms on many-core architectures. ACM Trans. Archit. Code Optim. 8(4), 49 (2012)CrossRef
    35.Sarkar, V., Hennessy, J.: Partitioning parallel programs for macro-dataflow. In: Proceedings of the 1986 ACM Conference on LISP and Functional Programming, LFP ’86, pp. 202–211. ACM, New York, NY, USA (1986). doi:10.​1145/​319838.​319863
    36.Stone, J.E., Gohara, D., Shi, G.: Opencl: a parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12(3), 66 (2010)CrossRef
    37.Theobald, K.: Earth: an efficient architecture for running threads. PhD thesis, University of Delaware (1999)
    38.Yan, Y., Chatterjee, S., Orozco, D., Garcia, E., Budimlic, Z., Shirako, J., Pavel, R., Sarkar, V., Gao, G.: Hardware and software tradeoffs for task synchronization on manycore architectures. In: Proceedings of the Seventeenth International Conference on Parallel Computing (Euro-Par 2011), Bordeaux, France, Lecture Notes in Computer Science (2011)
    39.Zuckerman, S., Suetterlein, J., Knauerhase, R.,Gao, G.: Using a codelet program execution model for exascale machines: position paper. In: Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era, pp. 64–69. ACM (2011)
  • 作者单位:Daniel Orozco (1)
    Elkin Garcia (1)
    Robert Pavel (1)
    Jaime Arteaga (1)
    Guang Gao (1)

    1. University of Delaware, Newark, DE, USA
  • 刊物类别:Computer Science
  • 刊物主题:Theory of Computation
    Processor Architectures
    Software Engineering, Programming and Operating Systems
  • 出版者:Springer Netherlands
  • ISSN:1573-7640
文摘
This paper provides an extended description of the design and implementation of the Time Iterated Dependency Flow (TIDeFlow) execution model. TIDeFlow is a dataflow-inspired model that simplifies the scheduling of shared resources on many-core processors. To accomplish this, programs are specified as directed graphs and the dataflow model is extended through the introduction of intrinsic constructs for parallel loops and the arbitrary pipelining of operations. The main contributions of this paper are: (1) a formal description of the TIDeFlow execution model and its programming model, (2) a description of the TIDeFlow implementation and its strengths over previous execution models, such as the ability to natively express parallel loops and task pipelining, (3) an analysis of experimental results showing the advantages of TIDeFlow with respect to expressing parallel programs on many-core architectures and (4) a presentation of the implementation of a low overhead runtime system for TIDeFlow.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700