H.264并行编码算法的研究和优化

英文题名：Parallel Algorithm Design and Optimization for H.264 Video Encoding
作者：蒋兴昌
论文级别：硕士
学科专业名称：信号与信息处理
中文关键词：H.264/AVC ; 指令级并行 ; 线程级并行 ; 多核 ; 自适应线程池模型
英文关键词：H.264/AVC ; Instruction Level Parallelism ; Thread Level Parallelism ; Multi-Core ; Adaptive Threading-Pool Model
学位年度：2008
导师：周军
学科代码：081002
学位授予单位：上海交通大学
论文提交日期：2007-10-01

摘要

2001年,ISO的运动图像专家组(MPEG)和ITU/VCEG成立了联合视频组(JVT)共同发展和研究H.26L标准,并将研究草案纳入ITU-T视频技术建议H.264和ISO/MPEG组织制定的MPEG-4标准的Part 10(即AVC)中。H.264/AVC作为新一代视频编码算法吸收了以往编码方案的优点,在视频压缩性能和网络接口友好性上得到了很大的提高,但是这些优点都是以引入复杂度为前提的。分析H.264编码器的结构可知,其高复杂度的计算量主要来源于两个方面,一是帧间编码的1/4像素精度运动搜索,多种可变大小的块模式及多参考帧的运动估计;二是帧内编码的多种预测模式。如何快速的实现编、解码成了H.264目前急需解决的问题。
     目前几乎所有的CPU架构都在向多核化方向发展,例如Intel、AMD的X86、IBM的PowerPC等。跟传统的单核CPU相比,多核CPU带来了更强的并行处理能力和更高的计算密度。可以预见,多核CPU必将成为未来处理器市场上的主流产品。
     针对H.264编码的复杂性,人们提出了很多并行优化的方法。一种是基于指令级的并行(ILP),例如很多DSP平台提供的单指令多数据流(SIMD)指令;另一种是基于线程级的并行(TLP),这种方法需要和多核技术相互配合来实现。试验证明,单纯使用任何一种方法都不能实现编码的最大并行化。考虑到X264本身已经实现了X86平台下的指令级并行,因此本文使用X264编码器作为研究对象,并在其SIMD指令级并行优化的基础上进行线程级并行优化,配合Intel的双核处理器平台,在Linux Fedora Core 5操作系统下获得了较高的编码加速比提升。
     X264当前采用基于Fork-Join模型的Slice级并行算法。测试结果表明:该模型开销较大,在低分辨率的编码应用中严重限制了Slice级并行算法所获得的编码加速比。为了避免Fork-Join模型带来的开销,本文在传统线程池的基础上提出了自适应线程池模型,并使用该模型对X264进行优化。
     对码流质量要求严格的非实时编码的应用场景,本文选取GOP级并行对X264编码器进行并行优化并获得接近线性的加速比。
     对码流质量要求严格的实时编码的应用场景,本文使用Frame级并行和Slice级并行相结合的方法,该算法避免了单纯使用Frame级并行获得的加速比低以及Slice并行会降低目标码流质量的缺点,在几乎不影响目标码流质量的前提下,获得了较高的加速比。
In 2001, the Moving Picture Experts Group (MPEG) of ISO and the Video Coding Experts Group (VCEG) of JVT developed the H.26L standard together. The latest video coding standard significantly improves in both the coding efficiency and network adaption. In order to achieve the best coding efficiency, the H.264 encoder introduces a lot of complexity. Analyzing the structure of H.264 encoder, it is clear that the complexity is derived from two aspects, one is the inter prediction technology, including the multiple reference frames, the other one is the intra prediction technology. How to encode the H.264 stream more quickly is one of the most important things for now.
     In order to avoid the worse power consumption, almost all CPU architectures are on their ways toward Multi-Core technology, compared with the single core CPUs, Multi-Core CPUs can provide more powerful computing ablilty and real parallel processing.
     There are mainly two parallel methods to optimize the H.264 encoder, one of which is Instruction Level Parallelism(ILP), we can see it on many DSP platforms, the other one is Thread Level Parallelism(TLP), this method must cooperate with the Multi-Core technology. This thesis optimizes the H.264 encoder using both methods, and got much speedup improvement under Linux Fedora Core 5 operating system on Intel Dual-Core platform.
     X264 encoder uses“Fork-Join”model for its parallel algorithm, but this model is costly especially in low resolution encoding. In order to avoid this penalty, this thesis introduces a new model—“Adaptive Threading Pool”model.
     For the non-real-time applications, we choose the GOP level parallel algorithm, the tests results show that the speedup can achieve up to about 2 on the Intel dual-core architecture.
     For the real-time applications, we choose the combination of the frame level parallel and the slice level parallel algorithms. It is quicker than the frame level parallelism and the quality is much better than slice level parallelism.

引文

[1] A.Murat Tekalp, “Digital Video Processing(影印版)”, 清华大学出版社,1998.
    [2] 余松煜,周源华,吴时光,“数字图像处理”,电子工业出版社,1989
    [3] 郑志航,“全数字高清晰度电视和DVB” ,中国广播电视出版社,1997
    [4] John L.Hennessy, David A.Patterson “计算机体系结构-量化研究方法(第四版)”,机械工业出版社,2007
    [5] NEIL MATTHEW,“Advanced Linux Programming”,机械工业出版社,2002
    [6] John Shapley Gray,“Interprocess Communications in UNIX,Second Edition”,电子工业出版社,2001
    [7] Shameem Akhter,Jason Roberts,“Multi-Core Programming”,Intel Press,2006
    [8] Kai Hwang, Zhiwei Xu,“Scalable Parallel Computing”,机械工业出版社,1999
    [9] 毕厚杰,“新一代视频压缩编码标准-H.264/AVC”,人民邮电出版社,2005
    [10] Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification (ITU-T Rec. H.264/ ISO/ IEC14496-10 AVC), Mar. 2003.
    [11] Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 1: Base architecture. http://www.intel.com/design/literature.htm
    [12] Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide Part 1. http://www.intel.com/design/literature.htm
    [13] Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3B: System Programming Guide Part 2. http://www.intel.com/design/literature.htm
    [14] Intel 64 and IA-32 Architectures Optimization Reference Manual: http://www.intel.com/design/literature.htm
    [15] Gary J.Sullivan,Thomas Wiegand,“Video Compression-From Concepts to the H.264/AVC Standard”,Proceedings of the IEEE,Volume 93,2005
    [16] Gary J.Sullivan,Pankaj Topiwala,Ajay Luthra,“The H.264/AVC Advanced Video Coding Standard:Overview and Introduction to the Fidelity Range Extensions”,SPIE Conference on Applications of Digital Image ProcessingXXVII Special Session on Advances in the New Emerging Standard:H.264/AVC,2004
    [17] Tomas Wiegand,Gary J.Sullivan,Gisle Bjontegaard,Ajay Luthra,“Overview of the H.264/AVC Video Coding Standard”,IEEE Transactions on circuits and systems for video technology,Volume 13,2003
    [18] Detlev Marpe,Heiko Schwarz,Thomas Wiegand,“Context-based Adaptive Binary Arithmetic Coding in the H.264/AVC Video Compression Standard”,IEEE Transactions on Circuits and Systems for Video Technology,Volume 13,2003
    [19] D A.Huffman,“A method for the Construction of Minimum Redundancy Code”,Proceedings of the I.R.E,1952
    [20] Gordon E.Moore,“Cramming more components onto integrated circuits”,Electronics,Volume 38,1965
    [21] Gene M.Amdahl,“Validity of the single processor approach to achieving large scale computing capabilities”,AFIPS spring joint computer conference,1967
    [22] JOHN L.GUSTAFSON,“Reevaluating Amdahl’s Law”,Communication of the ACM,May 1088 Volume 31 Number 5,pp. 483–485,1988
    [23] Yuan shi,“Reevaluating Amdahl’s Law and Gustafson’s Law”, pp. 54-64,1996
    [24] Denilson M.Barbosa,Joao Paulo Kitajama,Wagner Meira JR,“Parallelizing MPEG Video Encoding using Multiprocessors”, Proceedings of the XII Brazilian Symposium on Computer Graphics and Image Processing,1999
    [25] Xinmin Tian,Yen-Kuang Chen,Milind Girkar,Steven Ge,Rainer Lienhart,SanjivShah,“Exploring the Use of Hyper-Threading Technology for Multimedia Applications with Intel OpenMP Compiler”,Proceedings of the International Parallel and Distributed Processing Symposium,pp25-28 2003
    [26] Yen-Kuang Chen,Eric Q.Li,Xiaosong Zhou,Steven Ge,“Implementation of H.264 encoder and decoder on personal computers”,pp960-983,2005
    [27] Sebastian Fluegel,Heiko Klussmann,Peter Pirsch,“A Highly Parallel Sub-Pel Accurate Motion Estimator for H.264” , IEEE International Workshop onMultimedia Signal Processing,pp. 489-492,2006
    [28] Yen-Kuang Chen,Matthew Holliman,Eric Debes,Sergey Zheltov,Alexander Knyazev,Stanislav Bratanov,Roman Belenov,Ishmael Santos,“Media Applications on Hyper-Threading Technology”,Intel Technology Journal Q1,2002. Vol.6 Issue 1,pp. 960-983,2002
    [29] Dean M.Tullsen,Susan J.Eggers,Henry M.Levy,“Simultaneous Multithreading: Maximizing On-Chip Parallelism” , Annual International Symposium on Computer Architecture,pp. 608-615,1995
    [30] David Koufaty,Deborah T.Marr,“Hyperthreading Technology in the NetBurst”,Micro, IEEE,Volume 23, Issue 2,pp. 56-65,2003
    [31] Aleksandar Milenkovic ,Milenkovic, A,“Achieving high performance in bus-based shared-memory multiprocessors” , IEEE Parallel and Distributed Technology, Volume 8,Issue 3,pp. 87-92,2000
    [32] Y.Wang,Q.Zhu,“Error control and concealment for video communication:A review ”, Proc. IEEE, Volume 86, pp. 974-997, 1998
    [33] H.Schwarz, D.Marpe, T.Wiegand, “CABAC and slices”, Joint Video Team of ISO/IEC JTC1/SC29/WG11 & ITU-T SG16/Q.6 Doc. JVT-D020, pp. 200-222,2002
    [34] J.Rissanen and G .G .Langdon Jr, “Universal modeling and coding”,IEEE Trans. Inform. Theory, Volume 30, pp. 12-23, 1981
    [35] M.Mrak, D.Marpe, T.Wiegand, “A context modeling algorithm and its application in video compression”,IEEE Int. Conf. Image Proc.(ICIP), pp. 239-248,2003
    [36] D.M.Barbosa, J.P.Kitajima, W.Meira Jr, “real-time MPEG Encoding in Shared-Memory Multiprocessors”, Int’l Conf. on Parallel Computing System, pp. 537-542,1999
    [37] A.Bik, M.Girkar, P.Grey, X. Tian, “Automatic Intra-Register Vectorization for the Intel Architecture,” in Int’l Journal of Parallel Programming, pp. 61-74,2002.
    [38] Y.-K. Chen, M. Holliman, E. Debes, S. Zheltov, A. Knyazev, S. Bratanov, R. Belenov, I. Santos, "Media Applications on Hyper-Threading Technology," Intel Technology Journal, pp. 47-57, 2002.
    [39] D. Marr, F. Binns, D. L. Hill, G. Hinton, D. A. Koufaty, J. A. Miller, M. Upton, “Hyper-Threading Technology Microarchitecture and Architecture,” Intel Technology Journal, Volume 6, Q1, pp. 130-135,2002.
    [40] H. H. Taylor, “An MPEG Encoder Implementation on the Princeton Engine Video Supercomputer,” in Proc. of Data Compression Conference, pp. 420–429, 1993.
    [41] X.Zhou, E.Q.Li, Y.K.Chen, “Implementation of H.264 Decoder on General Purpose Processors with Media Instructions”, SPIE Conf. on Image and Video Communications and Processing, pp. 704-716,2003.
    [42] E.B.vander, E.G.T.Jaspers, R.H.Gelderblom, “Mapping of H.264 Decoding on a Multiprocessor Architecture”, SPIE Conf. on Image and Video Communications and Processing, pp. 707-718,2003.
    [43] V.Iverson, J.McVeigh, B.Reese, “Real-time H.264/AVC codec on Intel architectures”, Int. Conf.Image Process, pp 1541-1544, 2004.
    [44] V. Lappalainen, “Performance analysis of Intel MMX technology for an H.263 Video Encoder”, Proc.ACM Multimedia, pp. 309–314, 1998.
    [45] Jeff Bonwick, “The Slab Allocator: An Object-Caching Kernel Memory Allocator”,USENIX, pp. 87-98,1994
    [46] X264-dev,http://www.videolan.org/developers/x264.html

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700