基于序列分析的模式识别方法和功效研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
基因表达(gene expression)是指细胞在生命过程中,把储存在DNA中的遗传信息转变成具有生物活性的蛋白质分子。这一过程要经过转录和翻译两个步骤。其中转录是在细胞核内进行的,它是以DNA的一条链为模板,按照碱基互补配对的原则,在RNA聚合酶的催化下合成mRNA的过程。而翻译则是指以mRNA作为模板,以tRNA为运载工具,在有关酶、辅助因子和能量的作用下将氨基酸装配为蛋白质多肽链的过程。
     转录因子结合位点(Transcription factor binding site, TFBS),是指转录因子调节基因表达时,与信使RNA(mRNA)相结合的区域,它包括启动子、增强子和沉默子,又被称作顺式作用元件(cis-acting element)。转录因子结合位点本身不编码任何蛋白质,仅仅提供一个作用位点,但是它通过与转录因子的结合来调控基因转录的效率和精确度。在分子序列集合中,每个转录因子的结合位点通常都有特定的模式,这些模式被称为Motif。而识别这些Motif对于研究基因的转录表达有着非常重要的意义。
     最初的时候,人们往往在实验中使用凝胶迁移(EMSA)以及DBase足迹法来确定转录因子结合位点,但是这种实验的方法既费时费力,而且还有不确定性,不能实现大规模、高流通量的分析。而90年代中期出现的毛细管电泳技术使得测序的通量大为提高。近年来,利用染色质免疫共沉淀技术(Chip)与芯片技术相结合的方法,得到了大量的Chip-Chip数据。Chip-Chip数据的长度大约在800bp左右([26],[15])。如何在这些长度的数据中确定转录因子的结合位点,目前已经发展了大量的方法,但是对于转录因子结合位点的功效研究,目前还只有模拟的方法,理论的方法还很少有人做。而随着二代测序技术的发展,利用染色质免疫沉淀反应(Chip)与二代测序技术相结合,又产生了大量的Chip-Seq数据,那么如何在这些Chip-Seq数据上寻找转录因子结合位点以及研究这些位点的功效则成为一个新的课题。本论文将围绕这两个问题展开讨论。
     1.基于长序列的Motif的功效研究
     到目前为止,转录因子结合位点的识别方法已经有非常多,其中最成功的还是通过设计一个统计量研究模式在序列中是否过多表示。但是,对于这个统计量的功效问题,目前还只有使用模拟方法来研究的情况,并没有一个理论的方法来研究这个统计量的功效。在第2章,我们建立了一个隐马尔科夫模型来研究统计量的功效。
     在这个隐马尔科夫模型中,序列的生成主要受三个元素的影响:背景序列,前景序列,Motif的分布。其中背景序列假定是独立同分布的随机变量序列,前景序列是指插入的Motif,而Motif的分布是指Motif插入背景序列的概率。我们还可以得到:背景序列的发射概率,Motif的位置权重矩阵,初始分布,状态转移矩阵,状态空间。设背景序列的长度为n,我们要研究的Motif为W,W的长度为ω1,那么我们用NW(n)表示W在序列中发生的次数。
     在理论部分,首先在第2.2.1节给出了NW(n)的均值和方差的计算方法。其次,我们给出结论:对于发生次数比较频繁的模式,NW(n)的分布可以用正态分布来近似,而对于发生次数比较少的模式,我们利用复合泊松分布来近似NW(n)的分布。
     在模拟部分,我们考虑了三个模拟。在第一个模拟中,我们考虑的模式是“11”,状态空间为{0,1},“1”在背景序列中的发射概率分别取0.1,0.5.0.7, Motif的密度分别取0,0.05,0.1。在第二个模拟中,我们考虑的状态空间为{A,C,G,T},我们考虑的模式为"ACGT"和"CGCG"两种,而对于核苷酸在背景序列中的发射概率,我们考虑了CG poor, uniform和CG rich三种情况。在第三个模拟中,我们考虑了两个相对较长的模式"ACGTATC"和"AAGAAGAA",并且也考虑了CG poor, uniform和CG rich三种情况。对于这三个模拟,我们用三个不同的准则对我们的模拟结果和理论结果进行了比较。在第一个准则中,我们给出了模拟的均值和方差与理论的均值和方差的比较。在第二个准则中,对于模式“11”,我们给出了模拟的功效和正态近似的理论功效,而对于模式"ACGT","CGCG","ACGTATC"和"AAGAAGAA",则给出了模拟功效和复合泊松近似的理论功效。在第三个准则中,对于模式“11”,我们给出了模式发生次数的qqplot比较图,而对于模式"ACGT","CGCG","ACGTATC"和"AAGAAGAA",则给出了模式发生次数的模拟频率和复合泊松分布的比较直方图。在模拟的最后,我们还给出了一个求功效的在线项目。
     在实际数据部分,我们给出了4个例子。在第一个例子中,我们通过考虑线虫,果蝇和大肠杆菌三种生物中CpG的富集情况,给出了正态分布下CpG岛发生次数的功效和序列长度的关系图。而在后面的3个例子中,我们考虑了转录因子SPl,锌脂蛋白Motif "C2H2"和结构Motif,结合他们的位置权重矩阵,我们给出了其在复合泊松近似下的功效检验和Motif密度关系图。
     2. Motif在二代测序数据中的识别及功效研究
     第一代测序技术虽然帮助我们完成了人类基因组草图的测序工作,但是却花费了30亿美元,并且用了3年的时间。这显然不是我们理想的测序方法。进入21世纪以来,第二代测序技术蓬勃的发展起来。第二代测序技术在保持了高准确度的同时,大大降低了测序的成本并极大地提高了测序速度,现在已经在生物学的研究中被广泛的使用。
     在二代测序中,序列的reads (reads是实验得到的序列短串)是被随机的从基因组序列中抽取的。而现在研究二代测序数据的方法主要是先将reads映射到基因组序列上,然后基于这些映射的reads再分析数据。但是,许多生物体的基因组数据我们并不知道,即使我们知道生物体的基因组序列,reads在被映射到基因组序列上时,也不一定能够映射到唯一的一个位置。这就需要我们发展一种新的方法研究二代测序数据。
     这里我们用数数的方法来分析二代测序数据。我们知道在分析一代测序数据时,已经发展了非常多的方法用来研究模式在一条长的序列中发生次数的分布问题。但是,目前还没有发现有人在二代测序数据的基础上研究模式发生次数的分布问题。在第3章,我们建立了一个概率模型,这个模型的背景序列是独立同分布的随机变量序列,长度为n,然后,在这个背景序列中随机的抽取M个长度为β的reads,对于模式W,令Nw(M,n,β)表示模式W在这M个长度为β的reads中的发生次数。
     在理论部分,同上一部分相同,我们在这里也给出了Nw(M,n,β)的均值和方差的计算方法,我们还考虑了正态近似和复合泊松近似这两种情况,而对于复合泊松近似的情况,我们还分别考虑单链模型和双链模型两种情况,并给出了这三种情况的全变差的上界。最后,我们利用第2章建立的隐马尔科夫模型,还讨论了Nw(M,n,β)的功效问题。
     在模拟部分,我们考虑了5个不同的模式:"TAT","ACGT","CGCG","ACGTATC","AAGAAGAA".核苷酸的分布,我们仍然考虑了CG poor, uniform和CG rich三种情况。在所有的模拟中,我们给出了模拟的直方图和复合泊松近似的概率分布的比较图,并在有些情况下还加入了正态近似分布的密度曲线图。在模拟中,我们还考虑了模式发生次数的功效问题,并对这5种模式发生次数的理论功效和模拟功效做了比较。我们最后还给出了一个计算模式的p-值的Matlab程序。
     在实际数据中,我们分析了转录因子GABP的结合位点。根据[64]给出的chip-seq数据,我们通过复合泊松近似得到了所有长度为6的模式在control数据和Chip-seq数据中的p-值,并通过将p-值最小的10个模式拼接,我们得到了和实验测序完全一致的模式。
Gene expression means in the process of life, cell transform the genetic information which in DNA to be active protein. Gene expression occurs in two major stages. The first is transcription. In this process, one of the DNA sequence is copied to produce an messager RNA molecule based on the principle of complementary base pairing. The second stage is protein synthesis. This stage is also known as translation, and is so called because there is no direct correspondence between the nucleotide sequence in DNA (and RNA) and the sequence of amino acids in the protein. In this process, the messager RNA was used as a template strand, and the tRNA as the transport, under the effort of enzyme to synthetic protein.
     Transcription factor binding sites is the region which transcription fac-tor combine the mRNA when it regulate the gene expression, it included promoter, enhancer and silencer, so it can also known as cis-acting element. The transcription factor binding site do not encode any protein, it just pro-vide a location which can bind the transcription factor to regulate the gene expression. In the molecular sequence, each binding site of transcription factor have an given pattern, these pattern can be known as Motif. The identification of the Motif is very important in genomic research.
     Early, scientist identify the transcription factor binding sites through electrophoretic mobility shift assay(EMSA) and Dbase footprinting, but these methods waste a lot of time and cannot get the accurate result, the high-throughput analysis cannot achieved yet. In the middle of1990s, Capil- lary Array Electrophoresis made the high-throughput possible. Recently, combine the Chromatin Immunoprecipitation(Chip) and chip, scientist get a lot of chip-chip data, the length of the chip-chip data are about800bp([26],[45]).Many approaches have been developed to identify the transcrip-tion factor bingding site in these long sequence, but for the power of the transcription factor binding site, there is only simulation method, no theo-retical power are available. As the development of next generation sequenc-ing, combine the Chromatin Immunoprecipitation(Chip) and next generation sequencing, a lot of chip-seq data available. So how to identify the transcrip-tion factor binding site in these chip-seq data and how to study the power of the binding site is another new problem.So in this thesis, we will discuss these two question.
     1. The Power of Motif based on long sequence
     So far, There are many approaches have been developed to identify the transcription factor binding sites, one of the successful approaches is to identify statistically over-or under-represented patterns in a sequence. And there are only simulation approaches have been used to evaluate the power of motif detecting methods. No systematic theoretical formulas are available for the power of detecting over-represented patterns when the sequence contain multiple incidences of motifs, so in section2, we developed a hidden markov model to study the power of test statistics.
     In the hidden markov model, we model the sequence data using three components:the background model, the foreground model for the motif, and the distribution of the motifs along the sequence. We also can know:the emit probability of the background model, the position weight matrix of the motif, motif density, initial distribution, state space, state transition matrix. Let the length of the background sequence is n, and W be the motif which we interested, the length of W is w. Let Nw(n) be the number of occurrence of W in the sequence.
     In the theoretical part, we first give the mean and variance of Nw(n) in the subsection2.2.1, and then we get the result:for the numbers of oc-currences of frequenct patterns, we can use the normal distribution to ap-proximate the distribution of Nw(n), and for the number of occurrences of rare patterns, we can use compound poisson distribution to approximate the distribution of Nw(n).
     In the simulation part, we carry out three simulations to evaluate the validity of the theoretical results. In the first simulation, the pattern which we interested is "11", state space is{0,1}, the probability of choosing1in the background sequence to be0.1,0.5,0.7, respectively. The density of the Motif to be0,0.05,0.1, respectively. In the second simulation, the state space is{A,C,G,T}, we consider two different pattern:"ACGT" and "CGCG". The following three different situations are considered:CG poor, uniform and CG rich. In the third simulation, We consider two relatively long sequence:ACGTATC and AAGAAGAA. We also consider the situa-tions:CG poor, uniform and CG rich. For these three simulation, we use three different critetia to compare the theoretical result and the simulated result. In the first criteria, we compared the simulated the mean and vari-ance with the theoretical mean and variance. In the second criteria, for pattern "11", we give the simulated power and normal approximated power. For pattern "ACGT","CGCG","ACGTATC" and "AAGAAGAA", we give the simulated power and compound poisson approximated power. In the third criteria, for pattern "11", we use qqplot to compare the standard nor-mal distribution with the standardization of Nw(n). For pattern "ACGT","CGCG","ACGTATC" and "AAGAAGAA", we compare the histograms of the simulated value of Nw(n) with the compound poisson distribution. We also give an online program to calculate the power of a pattern.
     In the real data part, we give four example. In the first example, we consider the CpG enriched region of C.elegans, D.melanogaster and E.coli, and we give an relation figure between the power of number of occurrence of CpG under normal approximation and the sequence length. In the other three examples, we consider the binding sites of transcription factor SP1, a zinc finger motif C2H2, and a structural motif. Based on the position weight matrix of these examples, we obtian the figure which compare the power under compound poisson approximation and the Motif destity.
     2.The Identification of Motif and Its Power based on Next Generation Sequencing
     The Human Genome Project (HGP) was accomplished by the first gen-eration sequencing, but it costs near three billions and three years, so the first generation sequencing was not the ideal sequencing method for us. Since the21st century, the next generation sequencing technology was developed. The next generation sequencing except keep the accuracy of the first gen-eration sequencing, it has a low-cost and has high-throughput, so the next generation sequencing was used in many biology studies.
     In the next generation sequencing data, the sequence reads are ran-domly sampled from the genome sequence of interest. Most comptational approaches for next generation sequencing data first map the reads to the genome and then analyze the data based on the mapped reads. But many or-ganisms have unknown genome sequences and many reads cannot be uniquely-mapped to the genomes even if the genome sequences are known. So a new method need to developed to analyze the next generation sequencing data.
     Here we use word patterns to analyze next generation sequencing data. Word pattern counting has played an important role in molecular sequence analysis. Many approaches have been developed to analyze the number of occurrence of word pattern in a long sequence and give its approximation distribution, but for next generation sequencing data, no studies on the dis-tribution of the number of occurrences of word patterns have been carried out. In section3, we developed a probabilistic model. In this model, the back-ground sequence is i.i.d random sequence, and the length of the sequence is n, then we choose M reads of length β from the background sequence randomly. For pattern W, let Nw(M,n,β) be the number of occurrnece of pattern W in these M reads of length β.
     In the theoretical part, the same to the last section, we first give the mean and variance of Nw(M,n,β). We also consider the normal approxi-mation and compound poisson approximation. Especially for the compound poisson approximation, we consider the single-strand and double-strand, re-spectively, and give the total variance distance for these three approximation. In the last section of the theoretical part, we talked the power of NW(M, n, β) using the hidden markov model which we developed in section2.
     In the simulation part, we consider five different pattern:"TAT","ACGT'"CGCG","ACGTATC" and "AAGAAGAA". About the probability of nu-cleotide, we consider the following three situations:CG poor, uniform and CG rich. In all our simulations, we compare the histogram of the simulated value of NW(M, n. β) with the compound poisson distribution, and for some situations, we also cure the density function of the normal approximation. In addition to the histogram, we also consider the power of NW(M,n,β), and compare the simulated power with the theoretical power for these five patterns. Of course, we have developed a MatLab GUI program to calculate the p-value of the pattern.
     In the real data part, We consider the chip-seq data of the binding site of transcription factor GABP in. We obtained the p-value of all patterns of length6for the control data and chip-seq data through the compound poisson approximation. We analyze the top10smallest p-values and the corresponding pattern, then we construct a consensus sequence which is the same to the real sequence.
引文
[1]F. Antequera and A. Bird. CpG islands as genomic footprints of pro-moters that are associated with replication origins. Current Biology, 9(17):R661-R667,1999.
    [2]R. Arratia, L. Goldstein, and L. Gordon. Poisson approximation and the Chen-Stein method. Statistical Science,5(4):403-424,1990.
    [3]T.L. Bailey and C. Elkan. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In Proc Int Conf Intell Syst Mol Biol, volume 2, pages 28-36. Citeseer,1994.
    [4]T.L. Bailey and C. Elkan. Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine learning,21(1): 51-80,1995.
    [5]V. Boeva, J. Clement, M. Regnier, M.A. Roytberg, and V.J. Makeev. Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory mod-ules. Algorithms for molecular biology,2(1):13,2007.
    [6]A. Campbell, J. Mrazek, and S. Karlin. Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA. Proceedings of the National Academy of Sciences of the United States of America,96(16): 9184-9189,1999.
    [7]M. Carpentier, S. Brouillet, and J. Pothier. YAKUSA:a fast struc-tural database scanning method. Proteins:Structure, Function, and Bioinformatics,61(1):137-151,2005.
    [8]L.H.Y. Chen and Q.-M. Shao. Normal approximation under local de-pendence. The Annals of Probability,32(3A):1985-2028,2004.
    [9]D. Dalevi. D. Dubhashi, and M. Hermansson. Bayesian classifiers for detecting HGT using fixed and variable order markov models of genomic signatures. Bioinformatics,22(5):517-522,2006.
    [10]C. Dufraigne, B. Fertil, S. Lespinats, A. Giron, and P. Deschavanne. Detection and characterization of horizontal transfers in prokaryotes using genomic signature. Nucleic Acids Research,33(1):e6,2005.
    [11]J.C. Fu and W.Y.W. Lou. On the normal approximation for the dis-tribution of the number of simple or compound patterns in a random sequence of multi-state trials. Methodology and Computing in Applied Probability,9(2):195-205,2007.
    [12]A.J. Gentles and S. Karlin. Genome-scale compositional comparisons in eukaryotes. Genome research,11 (4):540-546,2001.
    [13]A.P. Godbole. Poisson approximations for runs and patterns of rare events. Advances in Applied Probability,23(4):851-865,1991.
    [14]L.J. Guibas and A.M. Odlyzko. Periods in strings. Journal of Combi-natorial Theory, Series A,30(1):19-42,1981.
    [15]Sun Haixi and Xiujie Wang. The development and future perspectives of dna sequencing technology. e-Science Technology,2(3):19-29,2009.
    [16]H. Huang. Error bounds on multivariate normal approximations for word count statistics. Advances in Applied Probability,34(3):559-586. 2002.
    [17]S.R. Jun, G.E. Sims, G.A. Wu, and S.H. Kim. Whole-proteome phy-logeny of prokaryotes by feature frequency profiles:An alignment-free method with optimal feature resolution. Proceedings of the National Academy of Sciences of the United States of America,107(1):133-138, 2010.
    [18]S. Karlin, C. Burge, and A.M. Campbell. Statistical analyses of counts and distributions of restriction sites in DNA sequences. Nucleic acids research,20(6):1363-1370,1992.
    [19]S. Karlin and J. Mrazek. Compositional differences within and between eukaryotic genomes. Proceedings of the national academy of sciences of the United States of America,94(19):10227-10232,1997.
    [20]J. Kleffe and M. Borodovsky. First and second moment of counts of words in random texts generated by Markov chains. Computer applica-tions in the biosciences,8(5):433-441,1992.
    [21]J. Kleffe and U. Langbecker. Exact computation of pattern probabilities in random sequences generated by Markov chains. Computer applica-tions in the biosciences,6(4):347-353,1990.
    [22]S Sri Krishna. Indraneel Majumdar, and Nick V Grishin. Structural clas-sification of zinc fingers survey and summary. Nucleic Acids Research, 31(2):532-550,2003.
    [23]S.Y. Ku and Y.J. Hu. Protein structure search and local structure char-acterization. BMC Bioinformatics,9:349.2008.
    [24]E.S. Lander and M.S. Waterman. Genomic mapping by fingerprinting random clones:a mathematical analysis. Genomics,2(3):231-239,1988.
    [25]C.E. Lawrence, S.F. Altschul, M.S. Boguski, J.S. Liu, A.F. Neuwald, and J.C. Wootton. Detecting subtle sequence signals:a Gibbs sampling strategy for multiple alignment. Science (New York, NY),262(5131): 208-214,1993.
    [26]Ting-ting LI, Bo JIANG, Xiao-wo WANG, and Xue-gong ZHANG. Tu-torial for computational analysis of transcription factor binding sites. Acta Biophysica Sinica,5:006,2008.
    [27]J.S. Liu. The Collapsed Gibbs Sampler in Bayesian Computations with Applications to a Gene Regulation Problem. Journal of the American Statistical Association,89(427):958-966,1994.
    [28]X.S. Liu, D.L. Brutlag, and J.S. Liu. An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nature biotechnology,20(8):835-839,2002.
    [29]W.C. Lo, P.J. Huang, C.H. Chang, and P.C. Lyu. Protein structural similarity search by Ramachandran codes. BMC bioinformatics,8(1): 307,2007.
    [30]M. Lothaire. Combinatorics on words, encyclopedia of mathematics, vol.17,1983.
    [31]D. MacLean, J.D.G. Jones, and D.J. Studholme. Application of'next-generation'sequencing technologies to microbial genetics. Nature Re-views Microbiology,7(4):287-296,2009.
    [32]E.R. Mardis. Next-generation DNA sequencing methods. Annual review of genomics and human genetics,9(1):387-402,2008.
    [33]E.R. Mardis. The impact of next-generation sequencing technology on genetics. Trends in Genetics,24(3):133-141,2008.
    [34]A.C. McHardy, H.G. Martin, A. Tsirigos, P. Hugenholtz, and I. Rigout-sos. Accurate phylogenetic classification of variable-length DNA frag-ments. Nature methods,4(1):63-72,2006.
    [35]A. Nekrutenko and W.H. Li. Assessment of compositional heterogeneity within and between eukaryotic genomes. Genome Research,10(12): 1986-1995,2000.
    [36]C. Nguyen, G. Liang, T.D.T. Nguyen, D. Tsao-Wei, S. Groshen, M. L "ubbert, J.H. Zhou. W.F. Benedict, and P.A. Jones. Susceptibility of Nonpromoter CpG Islands to De Novo Methylation in Normal and Neo-plastic Cells. Journal of the National Cancer Institute,93(19):1465-1472,2001.
    [37]G. Nuel. LD-SPatt:Large deviations statistics for patterns on Markov chains. Journal of Computational Biology,11 (6):1023—1033,2004.
    [38]G. Nuel. Effective p-value computations using Finite Markov Chain Imbedding(FMCI):application to local score and to pattern statistics. Algorithms for Molecular Biology,1(1):5,2006.
    [39]G. Nuel. Numerical Solutions for Patterns Statistics on Markov Chains. Statistical Applications in Genetics and Molecular Biology,5(1):26,2006.
    [40]G. Nuel. Pattern Markov chains:optimal Markov chain embedding through deterministic finite automata. Journal of Applied Probability, 45(l):226-243,2008.
    [41]H.H. Panjer. Recursive evaluation of a family of compound distributions. Astin Bulletin,12(l):22-26,1981.
    [42]Utz J Pape, Sven Rahmann, Fengzhu Sun, and Martin Vingron. Com-pound poisson approximation of the number of occurrences of a position frequency matrix (pfm) on both strands. Journal of Computational Bi-ology,15(6):547-564,2008.
    [43]G. Pavesi, P. Mereghetti, G. Mauri, and G. Pesole. Weeder Web:dis-covery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Research,32(Web Server Issue):W199-W203,2004.
    [44]L.R. Rabiner. A tutorial on hidden Markov models and selected appli-cations in speech recognition. Proceedings of the IEEE,77(2):257-286, 1989.
    [45]M. Regnier. A unified approach to word occurrence probabilities. Dis-crete Applied Mathematics,104(1-3):259-280,2000.
    [46]G. Reinert and S. Schbath. Compound Poisson and Poisson process ap-proximations for occurrences of multiple words in Markov chains. Jour-nal of Computational Biology,5(2):223-253,1998.
    [47]G. Reinert. S. Schbath, and M.S. Waterman. Probabilistic and statistical properties of words:an overview. Journal of Computational Biology. 7(1-2):1-46,2000.
    [48]G. Reinert, S. Schbath, and MS Waterman. Statistics on words with applications to biological sequences. Applied combinatorics on words, 105:252-323,2005.
    [49]P. Ribeca and E. Rained. Faster exact Markovian probability functions for motif occurrences:a DFA-only approach. Bioinformatics,24(24): 2839-2848,2008.
    [50]S. Robin and J.J. Daudin. Exact distribution of word occurrences in a random sequence of letters. Journal of applied probability,36(1):179-193, 1999.
    [51]S. Robin, F. Rodolphe, and S. Schbath. DNA, Words and Models:Statis-tics of Exceptional Words. Cambridge University Press,2005.
    [52]S. Robin and S. Schbath. Numerical comparison of several approxima-tions of the word count distribution in random sequences. Journal of Computational Biology,8(4):349-359,2001.
    [53]Halsey Lawrence Royden and Patrick Fitzpatrick. Real analysis, vol-ume 3. Prentice Hall Englewood Cliffs, NJ:,1988.
    [54]S. Schbath. Compound Poisson approximation of word counts in DNA sequences. ESAIM:Probability and Statistics,1:1-16,1995.
    [55]S. Schbath. An overview on the distribution of word counts in Markov chains. Journal of Computational Biology,7(1-2):193-201,2000.
    [56]S. Schbath and S. Robin. How can pattern statistics be useful for DNA motif discovery? Scan Statistics:Methods and Applications, pages 319-350,2009.
    [57]G. Shan and W.M. Zheng. Counting of oligomers in sequences generated by markov chains for DNA motif discovery. Journal of bioinformatics and computational biology,7(1):39-54,2009.
    [58]G.E. Sims, S.R. Jun, G.A. Wu, and S.H. Kim. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolu-tions. Proceedings of the National Academy of Sciences of the United States of America,106(8):2677-2682,2009.
    [59]D. Takai and P.A. Jones. Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proceedings of the National Academy of Sciences of the United States of America,99(6):3740-3745,2002.
    [60]H.J. Thiesen and C. Bach. Target detection assay (TDA):a versatile procedure to determine DNA binding sites as demonstrated on SP1 pro-tein. Nucleic acids research,18(11):3203-3209,1990.
    [61]M. Tompa, N. Li, T.L. Bailey, G.M. Church, B. De Moor, E. Eskin, A.V. Favorov. M.C. Frith, Y. Fu, W.J. Kent, et al. Assessing computa-tional tools for the discovery of transcription factor binding sites. Nature biotechnology.,23(1):137-144,2005.
    [62]M. Tyagi, P. Sharma, C.S. Swamy, F. Cadet, N. Srinivasan, A.G. de Brevern, and B. Offmann. Protein Block Expert (PBE):a web-based protein structure analysis server using a structural alphabet. Nucleic Acids Research.34(Web Server issue):W119-W123,2006.
    [63]E.C. Uberbacher and R.J. Mural. Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proceedings of the National Academy of Sciences of the United States of America,88(24):11261,1991.
    [64]A. Valouev, D.S. Johnson, A. Sundquist, C. Medina, E. Anton, S. Bat-zoglou, R.M. Myers, and A. Sidow. Genome-wide analysis of transcrip-tion factor binding sites based on ChIP-Seq data. Nature methods,5(9): 829-834,2008.
    [65]N. Vergne and M. Abadi. Poisson approximation for search of rare words in DNA sequences. Alea,4:223-244,2008.
    [66]M.S. Waterman. Introduction to computational biology:maps, sequences and genomes. Chapman & Hall,1995.
    [67]GE Willmot and HH Panjer. Difference equation approaches in eval-uation of compound distributions. Insurance:Mathematics and Eco-nomics,6(1):43-56,1987.
    [68]G.A. Wu, S.R. Jun, G.E. Sims, and S.H. Kim. Whole-proteome phy-logeny of large dsDNA virus families by an alignment-free method. Pro-ceedings of the National Academy of Sciences of the United States of America,106(31):12826-12831, 2009.
    [69]J.M. Yang and C.H. Tung. Protein structure database search and evo-lutionary classification. Nucleic acids research,34(13):3646-3659,2006.
    [70]J. Zhang, B. Jiang, M. Li, J. Tromp, X. Zhang, and M.Q. Zhang. Com-puting exact P-values for DNA motifs. Bioinformatics,23(5):531-537, 2007.
    [71]Z.D. Zhang. J. Rozowsky, M. Snyder, J. Chang, and M. Gerstein. Mod-eling ChIP sequencing in silico with applications. PLoS Computational Biology,4(8):e1000158,2008.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700