信息散度和Alignment空间的一些研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着信息科学的不断发展,人们对信息论学科的认识日益加深,信息论学科与其他学科的交叉渗透也越来越广。目前对信息论的研究已经从香农当年仅限于通信系统的数学理论的狭义范围扩展开来,像智能计算、生物、金融等领域都开始大量运用信息论的有关知识。在以上这些领域中,涉及到了各种随机分布差异的概念,需要利用信息量去衡量它们之间的区别。另外,在通信科学与生命科学中,对差错的概念已经有所推广,不再仅仅是经典的字符改变形成的差错,而且还扩展到了字符的插入与删除形成的差错。面对这些新问题,本文从三个部分进行了初步的探讨。
     第一部分:信息散度(Information Divergence)的研究
     本文的第二章主要讨论了信息散度的问题。信息散度在信息论中又被称为离散量,主要用来衡量两个随机分布之间的差异。比如最早提出的相对熵(即Kullback-Leibler散度)就是其中最为人所熟知的一种。本章首先介绍了一些著名的信息散度,然后讨论了信息散度与概率分布空间中度量的关系。2003年,Endres和Schindelin在论文“A New Metric for Probability Distributions”中将Jensen-Shannon散度(有的论文也称为capacitory discrimination)作了改进,证明了改进后的结果可以成为概率分布空间中的度量。本章在此论文的基础之上继续研究,从而得到了一类由概率分布生成的新度量。文中证明了得到新度量的充分必要条件,讨论了新度量的最值问题。本章最后对Jensen-Shannon散度的凸性作了一点探讨。
     第二部分:F_q上的Alignment空间的相关研究以及计数问题
     在数据处理问题中,差错的类型有多种,除了符号的替换之外还有数据的插入与丢失等等情况发生,本文称这样的差错为广义差错或者突变误差。由广义差错可以得到一种非线性空间——Alignment空间,这种空间在编码、密码、计算机与生物信息等等领域中有着广泛的应用。比如带插入/删除的信道编码、生物序列比对、图像处理等,都需要用到广义差错和Alignment空间中的有关概念与性质。本文在这一部分对广义差错和Alignment空间作了详细的说明和讨论。
     本文的第三章介绍了F_q集合上的Alignment空间的相关概念。首先我们对F-q集合上的Alignment空间和Alignment距离的定义作了说明,然后对Alignment距离的计算方法作了介绍,这个计算方法就是经典的动态规划算法。接下来文中讨论了广义差错的Levenshtein距离与Alignment距离的关系,最后介绍了该空间的一些简单的性质。
     第四章主要讨论一种研究Alignment空间的途径——序列的模结构理论以及虚拟符号的运算理论。本章首先简要介绍了序列的模结构理论,然后详细介绍了比对序列的虚拟符号运算理论,严格证明了两序列的比对序列间虚拟符号运算子的存在性,并且证明了等位运算子成为保距运算子和微调运算子的充分必要条件。
     第五章主要讨论Alignment空间中的计数问题。Alignment空间中的计数问题主要分为两类,本章开始对其作了说明。然后文中详细讨论了F_2上的n维Alignment子空间中Alignment距离为n与Alignment距离为2的序列对数目。得到了F_2上的n维Alignment子空间中Alignment距离为n的序列有2n对,F_2上的n维Alignment子空间中Alignment距离为2的序列有(2n~2-7n+11)-6对的结果,并且得到了Alignment距离为n的序列对满足的充分必要条件,说明了它们的最长的最小罚分比对序列就是最短的最大得分比对序列的结论。
     第三部分:由一般拓扑度量空间生成的Alignment空间
     在第二部分讨论的基础之上,Alignment空间还可以继续扩展到更一般的情况,由一般的拓扑度量空间同样可以产生Alignment空间。第六章中首先对由一般拓扑度量空间所产生的Alignment空间和其中的Alignment距离的定义作了说明,然后证明了此时得到的Alignment距离与一类推广的Levenshtein距离等价的结论,并且利用模结构理论详细证明了此距离满足度量空间中度量的三个条件。本章最后给出了该空间在生物信息学中的一个应用——蛋白质三维结构形态的比对问题。
With the development of information science, people have recognized that the information theory has been not only used in mathematical theory of communication and secrecy systems, but also used in other fields, such as intelligent computing, biology, financial modeling. In all the fields above, random distributions is a very significant concept. People need " information " to identify the distinction between two different distributions. Therefore, information theory can play an important role in these areas. Besides, the concept of error has promoted. At previous time, the error only refers to the substitutions of symbols. Recently, the deletions and insertions of symbols are also considered as errors yet. We call all these errors generalized errors. Facing those problems mentioned, we mainly study three aspects below in this paper.
     1. The information divergence is studied. Information divergence is a way to measure the difference between two random distributions. For example, relative entropy is the most well-known information divergence. In chapter 2, we introduce some famous information divergences and discuss the relationship of them and metrics in probability distribution space. We improve the chief conclusion in the paper " A New Metric for Probability Distributions " and obtain a class of new metrics for probability distributions. The new metrics based on the important concept of relative entropy and Jensen - Shannon divergence. The sufficient and necessary conditions for the new metrics to hold are provided next. The maxima and minima of the new metrics are given, too. At last, the convexity of the Jensen - Shannon divergence is discussed.
     2. The Alignment space over F_q is studied. Alignment space is a nonlinear space defined by generalized errors. In chapter 3-5, the detailed introductions of the related concepts of Alignment space over F_q are presented. Firstly, we give the definitions of the Alignment space over F_q and the Alignment distance in it. We illustrate that the Alignment distance can be computed by classical dynamic programming algorithm. The properties and relations between Levenshtein distance by expansion sequences and Alignment distance are discussed. In order to research the Alignment space, we introduce the modulus structure theory of sequences and virtual symbol operation theory. That there must be a virtual symbol operation between two alignment sequences is proved strictly in this part. The sufficient and necessary conditions for isometric operator and micro-adapted operator are also proved. At last, we study the structure of the Alignment space, give the number of the sequence pairs whose Alignment distance is n in the n - dimensional Alignment subspace is 2n and whose Alignment distance is 2 in the n - dimensional Alignment subspace is (2n~2—7n + 11)—6, obtained the sufficient and necessary conditions for sequence pairs whose Alignment distance is n in the n - dimensional Alignment subspace and the result that the maximum score alignment sequences are the minimum penalty alignment sequences in this case.
     3. The Alignment space generated by general metric space is studied. The Alignment space mentioned in part 2 can be extended to general condition. Firstly, the definitions of the Alignment space generated by general metric space and the Alignment distance in it are described. And then, we prove that the Alignment distance is equivalent to a class of Levenshtein distances. Using modulus structure theory, the Alignment distance is proved to be a metric in metric space. At last, we give an example on the protein structure sequence alignment in 3-dimensional space as the application of Alignment space.
引文
[1]R.V.L.Hartley.Transmission of information.Bell.Syst.Tech.J.,1928,7:535-563.
    [2]R.A.Fisher.On the mathematical foundations of theoretical statistics.Phil.Trans.Roy.Soc.,1922,222:309-368.
    [3]R.A.Fisher.Theory of statistical estimation.Proc.Cambridge Philos.Soc.,1925,22:700-725.
    [4]N.Wiener.Cybernetics.New York:Wiley,1948.
    [5]C.E.Shannon.A mathematical theory of communication.Bell.Syst.Tech.J.,1948,27:379-423,623-656.
    [6]C.E.Shannon.Communication theory of secrecy systems.Bell.Syst.Tech.J.,1949,28:656-715.
    [7]A.N.Kolmogorov.Three approaches to the quantitative definition of information.Probl.Inf.Transm.(USSR),1965,1:4-7.
    [8]A.N.Kolmogorov.Logical basis for information theory and probability theory.IEEE Trans.Inf.Theory,1968,IT-14:662-664.
    [9]A.Feinstein.A new basic theorem of information theory.IRE Trans.Inf.Theory,1954,IT-4:2-22.
    [10]J.Wolfowitz.The coding of messages subject to chance errors.Ill.J.Math.,1957,1:591-606.
    [11]E.T.Jaynes.Information theory and statistical mechanics.Phys.Rev.,1957,106:620-630.
    [12]L.G.Kraft.A device for quantizing,grouping and coding amplitude modulated pulses.Master's thesis,Department of Electrical Engineering,MIT,Cambridge,MA,1949.
    [13]D.A.Huffman.A method for the construction of minimum redundancy codes.Proc.IRE.,1952,40:1098-1101.
    [14]R.M.Fano.Class notes for transmission of information,course 6.574(Technical Report).MIT,Cambridge,MA,1952.
    [15]C.E.Shannon.Coding theorems for a discrete source with a fidelity criterion.IRE Nat.Cony.Rec.,1959,Pt.4:142-163.
    [16]T.Berger.Rate Distortion Theory:A mathematical Basis for Data Compression.Englewood Cliffs:Prentice-Hall,NJ,1971.
    [17]R.V.Hamming.Error detecting and error correcting codes.Bell.Syst.Tech.J.,1950,29:147-160.
    [18]R.J.MacWilliams,N.J.A.Sloane The theory of Error-Correcting Codes.Amsterdam:North-Holland Mathematical Library,1977,vol16.
    [19]T.M.Cover,J.A.Thomas.Elements of Information Theory(2nd ed).New York:Wiley-Interscience,2006.
    [20]T.M.Cover,J.A.Thomas著,阮吉寿,张华译.信息论基础(第二版).北京:机械工业出版社,2008.
    [21]沈世镒,吴忠华.信息论基础与应用.北京:高等教育出版社,2004.
    [22]傅祖芸.信息论—基础理论与应用(第二版).北京:电子工业出版社,2007.
    [23]仇佩亮.信息论与编码.北京:高等教育出版社,2003.
    [24]J.Ziv,A.Lemple.A universal algorithm for sequential data compression.IEEE Trans.Inf.Theory,1977,IT-23:337-343.
    [25]J.Ziv,A.Lemple.Compression of individual sequences by variable rate coding.IEEE Trans.Inf.Theory,1978,IT-24:530-536.
    [26]T.A.Welch.A technique for high-performance data compression.Computer,1984,17(1):8-19.
    [27]C.Nevill-Manning and I.H.Witten.Compression and explanation using hierarchical grammars Computer J,1997,40:103-116.
    [28]E.-H.Yang and J.C.Kieffer.On the performance of data compression algorithms based upon string matching.IEEE Trans.on Inf.Theory,1998,30(1):47-65.
    [29]J.C.Kieffer and E.-H.Yang.Grammar based codes:A new class of universal lossles source codes.IEEE Trans.on Inf.Theory,2000,43(3):733-754.
    [30]吴乐南.数据压缩(第二版).北京:电子工业出版社,2005.
    [31]R.Ahlswede.Multi-way communication channels.Proc.2nd Int.Symp.Inf.Theory,23-52.Hungarian Academy of Science,Budapest,1971.
    [32]H.Liao.Multiple access channels:PhD thesis,Honolulu:University of Hawaii,1972
    [33]T.H.Cover.Broadcast channels.IEEE Trans.on Inf.Theory,1972,IT-18:2-14.
    [34]T.H.Cover.Comments on broadcast channels.IEEE Trans.on Inf.Theory.1998,44(6):2524-2530.
    [35]S.Vishwanath,N.Jindal,and A.Goldsmith.Duality,achievable rates,and sum-rate capacity of Gaussian MIMO broadcast channels.IEEE Trans.on Inf.Theory,2003,49(10):2658-2668.
    [36]V.K.Goyal.Multiple description coding:Compression meets the network.IEEE Signal Proc.Mag,2001,18(1):74-93.
    [37]S.S.Pradhan,K.Ramchandran.Distributed source coding using syndromes(DIS-CUS):design and construction.IEEE Trans.on Inf.Theory,2003,49(3):626-643.
    [38]R.W.Yeung.Multilevel diversity coding with distortion.IEEE Trans.on Inf.Theory,1995,41(2):412-422.
    [39]R.W.Yeung,Zhen Zhang.Distributed source coding for satellite communications.IEEE Trans.on Inf.Theory,1999,45(4):1111-1120.
    [40]R.Ahlswede,Ning Cai,S.-Y.R.Li,etc.Network information flow.IEEE Trans.on Inf.Theory,2000,46(4):1204-1216.
    [41]R.W.Yeung,S.-Y.R.Li,Ning Cai,etc.Theory of network coding.Foundations and Trends in Communications and Information Theory,2005,2:241-381.
    [42]Zhen Zhang.Linear network error correction codes in packet networks.IEEE Trans.on Inf.Theory,2008,54(1):209-218.
    [43]L.-L.Xie and P.R.Kumar.A network information theory for wireless communication:scaling laws and optimal operation.IEEE Trans.on Inf.Theory.2004,50(5):748-767.
    [44]S.Kullback and R.A.Leibler.On information and sufficiency.Ann.Math.Stat,1951,22:79-86.
    [45]I.Csiszar.Information-type measures of difference of probability distributions and indirect observations.Stud.Sci.Math.Hung.,1967,2:299-318.
    [46]S.Amari and H.Nagaoka.Methods of Information Geometry.Oxford:Oxford University Press,1999.
    [47]A.Renyi.On measures of entropy and information.Proe.4th Berkeley Symp.on Math.Star.and Prob.,Berkley:Univ.Calif.Press,1961:547-561.
    [48]M.C.Pardo and I.Vajda.About distances of discrete distributions satisfying the data processing theorem of information theory.IEEE Trans.on Inf.Theory,1997,43(4):1288-1293.
    [49]F.Topsφe.Some Inequalities for Information Divergence and Related Measures of Discrimination.IEEE Trans.on Inf.Theory,2000,46(4):1602-1609.
    [50]F.Osterreicher and I.Vajda.A New Class of Metric Divergences on Probability Spaces and its Applicability in Statistics.Ann.Inst.Statist.Math.,2003,55(3):639-653.
    [51]S.-C.Tsai,W.-G.Tzeng and H.-L.Wu.On the Jensen-Shonnon Divergence and Variational Distance.IEEE Trans.on Inf.Theory,2005,51(9):3333-3336.
    [52]Y.Ofran and B.Rost.Analyzing Six Types of Protein-Protein Interfaces.J.Mol.Biol.,2003,325:377-387.
    [53]S.F.Altschul,John C.Wootton,etc.Protein database searches using compositionally adjusted substitution matrices.FEBS Journal,2005,272:5101-5109.
    [54]D.M.Endres and J.E.Schindelin.A New Metric for Probability Distributions.IEEE Trans.on Inf.Theory,2003,49(7):1858-1860.
    [55]V.I.Levenshtein.Binary Coded Capable of Correcting Deletion,Insertions and Reversals.(Russian) Doklady Akademii Nauk SSSR,1965,163(4):845-848;(English)Soviet Phys.Doki.,1966,10(8):707-710.
    [56]M.C.Davey and D.J.C.Mackay.Reliable communication over channels with insertions,deletions,and substitutions.IEEE Trans.on Inf.Theory,2001,47(2):687-697.
    [57]S.Diggavi and M.Grossglauser.On information transmission over a finite buffer channel.IEEE Trans.on Inf.Theory,2006,52(3):1226-1237.
    [58]E.Drinea and M.Mitzenmacher.On lower bounds for the capacity of deletion channels.IEEE Trans.on Inf.Theory,2006,52(10):4648-4657.
    [59]M.Mitzenmacher and E.Drinea.A simple lower bound for the capacity of the deletion channel.IEEE Trans.on Inf.Theory,2006,52(10):4657-4660.
    [60]E.Drinea and M.Mitzenmacher.Improved Lower Bounds for the Capacity of i.i.d.Deletion and Duplication Channels.IEEE Trans.on Inf.Theory,2007,53(8):2693-2714.
    [61]P.A.Bours.Construction of fixed-length insertion/deletion correcting runlength-limited codes.IEEE Trans.on Inf.Theory.1994,40(6):1841-1856.
    [62]A.S.J.Helberg and H.C.Ferreira.On multiple insertion/deletion correcting codes.IEEE Trans.on Inf.Theory.2002,48(1):305-308.
    [63]T.G.Swart and H.C.Ferreira.A note on insertion/deletion Codes.IEEE Trans.on Inf.Theory.2003,49(1):269-273.
    [64]A.Klein.On perfect deletion-correcting codes.J.Comb.Des..2004,12:72-77.
    [65]S.K.Houghten,D.Ashlock and J.Lenarz.Construction of optimal edit metric codes.Proceedings of 2006 IEEE Information Theory Workshop(ITW'06),Chengdu,2006:259-263.
    [66]J.Wang.Some combinatorial constructions for optimal perfect deletion-correcting codes.Des.Codes Cryptogr.,2008,48:331-337.
    [67]Shi-Yi Shen,KuiWang,Shu-Tao Xia etc.On the Alignment Space.Annual International Conference of the IEEE Engineering in Medicine and Biology-Proceedings,v 7VOLS,Proceedings of the 2005 27th Annual International Conference of the Engineering in Medicine and Biology Society,IEEE-EMBS 2005,Shanghai,2005:244-247.
    [68]David W.Mount.Bioinformatics-Sequence and Genome Analysis.New York:Cold Spring Harbor LaBoratory Press,2001.
    [69]S.Y.Shen and J.A.Tuszynski.Theory and Mathematical Methods for Bioinformatics,Berlin Heidelberg:Springer-Verlag,2008.
    [70]I.J.Taneja.Bounds on non-symmetric divergence measures in terms of symmetric divergence measures,arXiv:math.PR/0506256v1,2005b.
    [71]叶中行.信息论基础(第二版).北京:高等教育出版社,2007.
    [72]I.J.Taneja.Refinement Inequalities Among Symmetric Divergence Measures.The Australian Journal of Mathematical Analysis and Applications,2005,2,Art.8:1-23.
    [73]M.S.Pinsker.Information and Information Stability of Random Variables and Processes (in Russiau).Moscow,U.S.S.R.:Izv.Akad.Nauk,1960.
    [74]J.Lin.Divergence measures based on the Shannon entropy.IEEE Trans.on Inf.Theory,1991,37(1):145-150.
    [75]F.(O|¨)terreicher and I.Vajda.Statistical information and discriminiation.IEEE Trans.on Inf.Theory,1993,39(3):1036-1039.
    [76]F.Topsφe.Inequalities for the Jensen-Shannon divergence.Draft available at http://www.math.ku.dk/topsoe/,2006,unpublished.
    [77]I.J.Schoenberg.Metric spaces and positive definite functions.Trans.Amer.Math.Soc.,1938,44:522-536.
    [78]C.Berg,J.P.R.Christensen,and P.Ressel.Harmonic Analysis on Semigroups.New York:Springer-Verlag,1984.
    [79]B.Fuglede and F.Topsφe.Jensen-shannon divergence and hilbert space embedding.Proc.of the Internat.Symposium on Information Theory,Chicago,2004:31.
    [80]程士宏.测度论与概率论基础.北京:北京大学出版社,2004.
    [81]老大中.变分法基础(第二版).北京:国防工业出版社,2007.
    [82]刘光中.凸分析与极值问题.北京:高等教育出版社,1991.
    [83]H.D.L.Hollmann.A Relation Between Levenshtein-Type Distances and Insertion -and-Deletion Correcting Capabilities of Codes,IEEE Trans.Inform.Theory,1993,39(4):1424-1427.
    [84]S.N.Diggavi and M.Grossglauser.On Transmission Over Deletion Channels.Allerton Conference,Monticello,Illinois:October 2001.
    [85]G.Navarro.A Guided Tour to Approximate String Matching.ACM Compuing Surveys,2001,33(1):31-88.
    [86]Z.D.Dai and K.Imamura.Linear complexity for one-symbol substitution of a periodic sequence over GF(q).IEEE Trans.Inform,.Theory,1998,44(3):1328-1331.
    [87]S.Q.Jiang,Z.D.Dai and K.Imamura.Linear compulexity of a sequence obtained from a periodic sequence by either substituting,inserting,or deleting k symbols with in one period.IEEE Trans.Inform.Theory,2000,46(3):1174-1177.
    [88]R.A.Wagner and M.J.Fischer.The String-to-String Correction Problem.Journal of the Association for Computing Machinery,1974,21(1):168-173.
    [89]L.M.Adleman.Molecular computation of solutions to combinatorial problems.Science,1994,266:1021-1024.
    [90]P.H.Sellers.On the theory and computation of evolutionary distances.SIAM J.Appl.Math.,1974,26(4):787-793.
    [91]沈世镒.生物序列突变与比对的结构分析.北京:科学出版社,2004.
    [92]S.B.Needleman and C.D.Wunsch.A general method applicable to the search for similarities in the amino acid sequence of two proteins.J.Mol.Biol.,1970,48:443-453.
    [93]T.F.Smith,M.S.Waterman and W.M.Fitch.Comparative biosequence metrics.J.Molecular Evolution,1981,18:38-46.
    [94]K.-M.Chao,W.R.Pearson and W.Miller.Aligning two sequences within a specified diagonal band.Comput.Appl.Biosci.,1992,8:481-487.
    [95]K.-M.Chao,J.Zhang,J.Ostell and W.Miller.A Tool Aligning Very Similar DNA Sequences.Comput.Appl.Biosci.,1997,13(1):75-80.
    [96]K.-M.Chao,R.C Hardison,and W.Miller.Recent Developments in Linear-Space Aligning Methods:A Survey.Comput.Appl.Biosci.,1994,1(4):271-291.
    [97]K.M.Chao,J.Ostell and W.Miller.A Local Aligning Tool for Very Long DNA Sequences.Comput.Appl.Biosci.,1995,11(2):147-153.
    [98]A.Torres,A.Cabada and J.J.Nieto.An exact formula for the number of alignment s between two DNA sequences.DNA Sequence,2003,14(6):427-430.
    [99]S.-Y.Shen,K.Wang,G.Hu and S.-T.Xia.On the alignment space and its applications.Proceedings of 2006 IEEE Information Theory Workshop(ITW'06),Chengdu,2006:165-169.
    [100]W.Ewens and G.Grant.Statistical methods in bioinformatics:an introduction.New York:Springer-Verlag,2001.
    [101]K.Lange.Mathematical and statistical methods for genetic analysis.New York:Springer-Verlag,2002.
    [102]L.P.Chew.,K.Kedem,D.P.Huttenlocher and Kleinberg.Fast detection of geometric substructure in proteins.J.Comp.Biol.,1999,6(3-4):313-325.
    [103]K.Kedem,L.P.Chew and R.Elber.Unit-vector RMS(URMS) as a tool to analyze molecular dynamics trajectories.Proteins Struct.Funct.Genet.,1999,37:554-564.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700