非连续上下文建模及其在可执行文件压缩中的应用

英文题名：On Non-sequential Context Modeling with Application to Executable Data Compression
作者：戴文睿
论文级别：硕士
学科专业名称：通信与信息系统
中文关键词：非连续上下文建模 ; 可执行文件压缩 ; 上下文加权 ; 模型估计冗余
英文关键词：non-sequential context modeling ; executable data compression ; context weighting ; model redundancy
学位年度：2008
导师：支琤
学科代码：081001
学位授予单位：上海交通大学
论文提交日期：2008-01-01

摘要

数据压缩技术在过去的20年中迅速发展,并且广泛地应用于文本、语音、图像、视频以及可执行文件等领域。数据压缩的过程一般严格地分为两步:建模过程和编码过程。在编码算法得到的码长已经十分接近香农极限的今天,建模过程对于数据压缩中效率的提升起到了决定性的作用。迄今为止研究人员对于上下文建模进行了大量工作,并且提出了一系列经典的建模方法。
     可执行文件压缩作为数据压缩的一个分支,随着网络设施和手持终端的普及,各种网络程序分发和手持设备驱动程序存储等应用要求的出现,逐渐显现出其重要的意义。然而经典的建模方法一般仅考虑连续的上下文模型,虽然在诸如文本压缩中取得良好的效果,但在可执行文件压缩中则不尽如人意。
     本文首先回顾了经典的基于连续上下文建模的算法及其冗余的估计。在此基础上,通过对于连续上下文限制条件的松弛,考虑采用已预测子序列中字符的任意组合而非之前的后缀来构成非连续上下文,从而延伸出更广泛的非连续上下文及其模型的定义,并就此讨论基于非连续上下文的建模。对于非连续上下文建模,讨论的主要内容包括三点:第一,通过引入模型树来为非连续上下文建立一系列的上下文加权树,从而可以得出基于非连续上下文模型的加权概率估计;其二,针对所得到的加权概率估计,讨论它的模型估计冗余并与经典的连续上下文建模方法中的相应结果进行比较,体现出其对具有非连续相关性数据估计时得优势;最后对于存在或不存在训练数据的情况,分别建立对于指定数据的上下文模型选择的方法:当不存在训练数据时可以通过本文提出的方法,用已预测数据快速近似判断上下文模型;而当存在训练数据时可以依据最小描述长度准则,通过贪心算法选定一系列最优的模型。
     对于可执行文件压缩,本文考虑同时采用连续上下文模型和非连续上下文模型来进行估计,并由这些估计最终给出加权概率估计。在总结了可执行文件,尤其是IA-32指令集中的连续和非连续相关性后,将非连续建模方法结合其中的相关性应用到IA-32指令集中,并通过训练数据采用前述的贪心算法依据最小描述长度准则选择出一组最优估计的上下文模型。在具体的实现中,本文提出一个可执行文件压缩的框架,包括对于可执行文件特有相关性建模以及指令语法分析的预处理、对于指令以选定的模型得出基于连续上下文或非连续上下文的概率模型估计,以及采用一族考虑p阶范数的归一化最小均方误差算法对所得出的概率估计进行混合,渐近地得到最优加权概率估计。
In the last two decades, data compression techniques have developed rapidly, with extensive applications to the fields, such as text, speech, image, video and executable compression. The process of data compression can be strictly split into two steps: modeling and encoding. Nowadays, since the code length is rather close to the Shannon bound, modeling is playing a critically significant role in the enhancement of compression efficiency. Up to now, mass literature is concentrated on context modeling, and consequently, several classic context modeling methods have been put forward.
     With the development of network and portable device, executable compression,as a branch of data compression, turns to be significant in the applications such as online distribution of program binaries and storage of driver file in portable device. However, although classic context model algorithms, which consider sequential context models only, perform efficiently in text compression, their performance in executable compression is not satisfactory enough.
     In this dissertation, we first review the sequential modeling algorithms and their redundancy estimation. And then we generalize the definition of non-sequential contexts and context models by relaxing the limitations on sequential contexts from suffix of predicted subsequence to any permutation of symbols in the subsequence. On such basis, we put forward a method for non-sequential context modeling. There are three main points in it: 1) we establish a series of context weighting trees for non-sequential contexts by introducing the model tree, and we can make weighted estimation according to this series of context weighting trees. 2) The upper bound of model redundancy in non-sequential context modeling is developed according to the weighted estimation, and it is compared with the one in sequential context modeling for the advantage in estimation of source data with non-sequential correlation. 3) Whether training sequence exists or not, the model selection methods are proposed in non-sequential context modeling: a fast method, where predicted symbols are utilized for selection of models, is proposed when there is no training sequence available and when training sequence does exist, an optimal set of models are found by greedy algorithm under the minimum description length (MDL) evaluation.
     In this dissertation, both sequential models and non-sequential models are considered in executable compression, and the predictions from such models are mixed for a weighted probabilistic estimation. By concluding the sequential and non-sequential correlation in executable files, especially IA-32 instruction set, we apply our non-sequential context modeling method to the assembly instructions collaborating with the correlation in it and customize an optimal set of context models by the described greedy algorithm under MDL evaluation according to the training sequence. Moreover, the executable compression framework is proposed, which comprises of preprocessing which deals with unique correlation in assembly instructions and the syntax parsing of IA-32 instruction, estimation by the selected models (sequential or non-sequential), weighting by a family of normalized least mean square (LMS) algorithms considering Lp-norm of input probability estimation and asymptotically gaining the optimal weighted probability estimation.

引文

[1] Cover, T.M., Thomas, J.A. Elements of Information Theory. New York: Wiley, 1991.
    [2] Salomon, D. Data Compression, The Complete Reference, Second Edition, Springer-Verlage New York Inc, 1998.
    [3] Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J., vol. 27, pp. 379-423, July 1948. Reprinted in Key Papers in the Development of Inform. Theory, Slephian, D., Ed. New York: IEEE Press, pp. 5-18, 1974.
    [4] Rissanen, J. A universal data compression system. IEEE Trans. Inform. Theory, vol. 29, pp. 656-664, Sept. 1983.
    [5] Rissanen, J. Universal Coding, Information, Prediction, and Estimation. IEEE Trans. Inform. Theory, vol. IT-30, no. 4, July 1984.
    [6] Rissanen, J., Langdon, G., Jr. Universal modeling and coding. IEEE Trans. Inform. Theory, vol. 27, pp. 12-23, Jan 1981.
    [7] Weinberger, M.J., Rissanen, J., and Freder, M. A universal finite memory source. IEEE Trans. Inform. Theory, vol. 41, pp. 643-652, May 1995.
    [8] Nohre, R. Some Topics in Descriptive Complexity. PhD thesis, Department of Computer Science, The Technical University of Linkoping, Sweden, 1994.
    [9] Willems, F.M.J., Shtarkov, Y.M., and Tjalkens, T.J., “The context-tree weighting method: basic properties,” IEEE Trans. Inform. Theory, vol. 41, no. 3, pp. 653–664, May 1995.
    [10] Willems, F.M.J., Shtarkov, Y.M., and Tjalkens. T.J. Context weighting for general finite context sources. IEEE Trans. Inform. Theory, vol. IT-42, pp. 1514-1520, Sept 1996.
    [11] Willems, F.M.J., Tjalkens, T.J., and Ignatenko, T. Context-tree weighting and maximizing: processing betas. Proc. The Inaugural Workshop of the ITA, pp. 1-5, Feb 2006.
    [12] Volf, P.A.J. Weighting Techniques in Data Compression: Theory and Algorithm. Ph.D. dissertation, Technical University Eindhoven, Eindhoven, 2002.
    [13] Cleary J.G., Witten, I.H. Data compression using adaptive coding and partial string matching. IEEE Trans. Communications, COM-32(4): 396-402, April 1984.
    [14] Moffat, A. Implementing the PPM data compression scheme. IEEE Trans. Communications, COM-38: 1917-1921, November 1990.
    [15] Shkarin, D. PPM: one step to practicality. Proc. Data Compression Conference 2002, pp. 202-211, 2002.
    [16] Shkarin, D. Improving the efficiency of PPM algorithm. Problems of Information Transmission, 34(3):44-54, 2001.
    [17] Witten, I.H., Bell, T.C. The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression. IEEE Trans. Inform. Theory, vol. 37, no. 4, July 1991.
    [18] Cleary, J.G., Teahan, W.J. Experiments on the zero frequency problem. Proc. Data Compression Conference: 480, 1995.
    [19] Ron, D. Singer, Y., and Tishby, N. The power of amnesia: Learning probabilistic automata with variable memory length. Machine Learning, 1996, 25(2-3), 117-149.
    [20] Kermorvant, C., Dupont, P. Improved smoothing for probabilistic suffix trees seen as variable order markov chains. European Conf. Machine Learning, pp. 185-194, 2002.
    [21] Hastie, T., Tibshirani, R., and Friedman, J. The Element of Statistical Learning-Data Mining, Inference, and Prediction. Springer-Verlag, 2001.
    [22] Barron, A., Rissanen, J., and Yu, B. The minimum description length principle in coding and modeling. IEEE Trans. Inform. Theory, vol. 44, no. 6, October 1998.
    [23] Rissanen, J. A universal prior for integers and estimation by minimum description length, Annals of Statistics 11: 416-431, 1983.
    [24] Akaike, H. Information theory and an extension of the maximum likelihood principle. Intl. Symp. Inform. Theory, pp. 267-281, 1973.
    [25] Schwartz, G. Estimating the dimension of a model, Annals of Statistics 6: 461-464, 1979.
    [26] Madigan, D., Raftery, A. Model selection and accounting for model uncertainty using occam’s window. J. Amer. Statist. Assoc. 89: 1535-1546, 1994.
    [27] Drinic, M., Kirovski, D. PPMexe: PPM for compressing software. Proc. Data Compression Conference 2002, pp. 192–201, April 2-4 2002.
    [28] Drinic, M., Kirovski, D., and Vo, H. Code optimization for code compression. Intl. Symp. Code Generation and Optimization, pp. 315-324, 23-26 March 2003.
    [29] Drinic, M., Kirovski, D., and Vo, H. PPMexe: program compression. ACM Trans. Programming Language and Systems, vol. 29, no. 3, January 2007.
    [30] Lekatsas, H., Wolf, W. SAMC: A Code Compression Algorithm for Embedded Processors. IEEE Trans. Computer-aided Design of Integrated Circuits and Systems, vol. 18, no. 12, December 1999.
    [31] Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG. Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification (ITU-T Rec. H.264 | ISO/IEC 14496-10 AVC). Doc. G050rl, Mar. 2003.
    [32] Marpe, D., Schwarz, H., Wiegand, T. Context-based adaptive binary arithmetic coding in the H.264/AVC video compression standard. IEEE Trans. Circuits and Syst. Video Technol., vol. 13, pp. 620-636, July 2003.
    [33] Mrak, M., Marpe, D., and Wiegand, T. A context modeling algorithm and its application in video compression. Proc. Intl. Conf. Image Processing, vol. 2, pp. 845-848, 14-17 Sept 2003.
    [34] Diniz, P.S.R. Adaptive Filtering: Algorithms and Practical Implementing. Second Edition, Kluwer Academic Publishers.
    [35] Haykin, S. Adaptive Filter Theory, Fourth Edition, Pretince Hall, 1996.
    [36] Douglas, S.C. A family of normalized LMS algorithms. IEEE Signal Processing Letters, vol. 1, no. 3, March 1994.
    [37] Douglas, S.C., Meng, T., H.-Y. Normalized data nonlinearities for LMS adaptation. IEEE Trans. Acoust. Speech Signal Processing, vol. 42, no. 6, June 1994.
    [38] Nagumo, J., Noda, A. A learning method for system identification. IEEE Trans. Automat. Contr., vol. AC-12, no. 3, pp. 282-287, June 1967.
    [39] Boyd, S., Vandenberghe, L. Convex Optimization, Cambridge University Press, 2006.
    [40] Ordentlich, E., Weinberger, M.J., and Weissman, T. Multi-directional context sets with applications to universal denoising and compression. IEEE Proc. Intl. Symp. Inform. Theory, pp. 1270-1274, Sept. 2005.
    [41] Yu, J., Verdu, S. Schemes for bidirectional modeling of discrete stationary sources. IEEE Trans. Inform. Theory, vol. 52, no. 11, Nov 2006.
    [42] Mahoney, M.V. Adaptive weighing of context models for lossless data compression. http://www.cs.fit.edu/~mmahoney/compression/.
    [43] Mahoney, M.V. Fast text compression with neural networks. Proc. 13th Intl. Florida Artificial Intelligence Research Society Conference, pp. 230-234, 2000.
    [44] Krichevsky R.E., Trofimov, V.K. The performance of universal coding. IEEE Trans. Inform. Theory, vol. 27, pp. 199-207, March 1981.
    [45] El-Yaniv, R., Begleiter, R., and Yona, G. On prediction using variable order markov models. Journal of Artificial Intelligence Research (JAIR), 22:385--421, 2004.
    [46] IA-32 Intel architecture software developer’s manual. volumn 2A: instruction set reference, A-M.
    [47] IA-32 Intel architecture software developer’s manual. volumn 2A: instruction set reference, N-Z
    [48] Microsoft portable executable and common object file format specification, Microsoft Corporation, Revision 6.0, February 1999.
    [49] Moffat, A., Neal, R. and Witten, I.H. Arithmetic Coding Revisited. ACM Trans. Inform. Systems, 16(3): 256-294, July. 1998.
    [50] Witten, I.H., Neal, R.M., and Cleary, J.G. Arithmetic Coding for data compression. Communication of the ACM, 30(6): 520-540, 1987.
    [51] Langdon, G.G., Jr., Rissanen, J. A simple general binary source code. IEEE Trans, Inform. Theory, vol, IT-28, pp. 800-803, Sept.1982.
    [52] Rissanen, J. Generalized Kraft Inequality and Arithmetic Coding. IBM J. Res. Devel., vol.20, p.198, 1976.
    [53] Pasco, R. Source coding algorithms for fast data compression. Ph.D. dissertation, Stanford Univ., Stanford, CA, 1976.
    [54] Schalkwijk, J.P.M., An algorithm for source coding. IEEE Trans. Inform. Theory, vol. IT-18, pp. 395-399, May 1972.
    [55] Derbay, S., Evans, W., Muth, R., and De Sutter., B. Compiler techniques for code compaction. ACM Trans. Programming and Systems, vol. 22, pp. 378-415, March 2000.
    [56] Hong, I., Kirovski, D., and Potkonjak, M. Potential-driven statistical ordering of transformations. Proc. Design Automation Conference, pp. 347-352, 1997.
    [57] Merhav, N., Gutman, M., and Ziv, J. On the estimation of the order of a Markov chain and universal data compression. IEEE Trans. Inform. Theory, vol. 35, no. 5, Sept 1989.
    [58] 吴乐南译,数据压缩原理与应用,电子工业出版社,2003.
    [59] 李洋,基于上下文的低码率图像可重用技术的研究[硕士论文],上海,上海交通大学硕士论文,2007.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700