文摘
With regards to the nature of high intensive computation for motion estimation with an H.264/AVC encoder, this paper presents a parallel block-matching algorithm implemented on a general purpose graphics processing units (GPU) to speed up the execution of UAV video coding. Traditional parallel block-matching algorithms are primarily used to leverage the huge number of computational cores in graphic processing units, which can be used to compute the block-matching operation at each candidate position in a search range by an independent thread of kernel computation. In realistic scenarios, the time used to transfer pixel values among the various memory modules to fulfill the operation in a GPU system is much higher than the computation time used for computing each block-matching operation by the kernel threads. This leads to a performance improvement bottleneck for GPU algorithm design. The proposed algorithm exploits the characteristics of distinct memory modules on the data transfer speed for the block-matching algorithm and proposes a feasible mechanism to reduce the bandwidth of data transmission required for the parallel block-matching algorithms implemented on GPU system. With experiments on GPU systems, the proposed parallel block-matching algorithm gains up to 99% execution reduction of motion estimation compared to the host processor only motion estimation process.