DNA信号序列分析的基因预测方法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
生物信息学是21世纪最具发展前途的一门科学,它致力于解释丰富的基因信息并从中揭示和提取规律,最终达到全面认识生命及其过程的目的。而解释和理解基因组序列的关键是基因预测,即识别基因组中所有的功能单元,包括编码蛋白质的DNA片段和其它功能单元。由于生物基因多样性、基因结构复杂以及该学科较为年轻等原因,现有的生物识别算法在辨识精度、计算量、适用范围等方面还存在很多问题。针对上述问题,本文从基因预测的三个方面进行研究:
     1.剪接位点预测方法研究:剪接位点辨识是基因预测的一个重要环节。本文基于Takagi-Sugeno(T-S)模糊模型具有泛化性较好、鲁棒性强、结构简单等优点,提出一种基于模糊似然函数的模糊聚类和最小二乘相结合的T-S建模方法;根据剪接位点上下游附近序列的统计特征与附近序列碱基组成随GC含量高低变化的特征分别建立剪接位点T-S预测模型,有效地提高了识别精度。为了进一步提高辨识精度,减少计算量,提出基于序列中碱基的组成信息以及位置信息的改进贝叶斯剪接位点预测模型。基于核方法理论,算法提出了贝叶斯特征映射方法,通过将DNA序列映射到新的特征空间,推导出决策属性和各条件属性对数值间存在线性关系,并用最小二乘法求出这种线性关系系数,设计出一种新的贝叶斯分类器。仿真结果表明,该算法的计算效率高、结构简单、分类精度高,优于SVM-B和朴素贝叶斯方法,能够适应大数据量DNA序列结构辨识。
     2.蛋白质编码区的预测方法研究:蛋白质编码区辨识是基因预测的重要研究课题。本文提出一种辨识外显子精确位置的综合算法。首先根据蛋白质编码区的保守序列,建立支持向量机二元分类器。然后依据密码子第一位碱基的“周期3行为”,用短时傅立叶变换对分类器的输出值进行分析,精确辨识出编码区的位置。由于基因结构复杂多样,为了提高辨识精度,基因中碱基的位置应分为3部分。用支持向量机二元分类器不能很好辨识基因中碱基所在位置,而支持向量机多分类器的结构较复杂。用Takagi-Sugeno模糊模型建立基因序列模型,输出值反映输入窗中心碱基是否属于:非编码区碱基、编码区密码子第一位碱基或编码区密码子非第一位碱基。然后用短时傅立叶变换对模型的输出值进行分析,精确辨识出编码区的位置。
     3.人类基因启动子预测方法研究:真核基因启动子辨识是基因预测的难点。本文提出基于寡核苷酸位置分布密度模型的启动子识别方法。首先,使用高斯混合模型(GMM)建立寡核苷酸的位置分布密度模型以提取一些重要的基序,这些基序往往对生物信号起着重要调控作用。采用期望最大化算法(EM)估计GMM模型参数,应用模糊聚类指导GMM模型混合度和初始均值的选取,较好地保证了GMM模型的精度;然后根据提取的寡核苷酸位置密度采用基于最小二乘的加权贝叶斯分类器辨识人类基因启动子。该算法的计算量小、适合海量数据的建模。为了更有效利用启动子序列固有信号特征以提高辨识精度,提出通过贝叶斯特征映射将原启动子序列投影到高维寡核苷酸位置分布密度空间,基于构建新的核函数,建立最小二乘支持向量机模型辨识人类基因启动子。核函数的特征变换综合了启动子序列的寡核苷酸组成信息和位置信息,能够较好反映实际的转录调控机制。该方法泛化性能好、计算量与输入维数无关。该预测方法可应用到几个其它生物问题。
     最后对本文研究工作进行了总结,并指出今后的工作方向。
Biotechnology is the most promising science areas in 21th century, Bioinformatics dedicates to interpret the genomic information, explore hidden patterns in genome and comprehensively understand life and their process in the end. The key to gene prediction is to interpret and understand genome sequence, namely, the identification of all functional units in the genome, including the encoding protein DNA fragments and other functional units. Because of biodiversity and a large variation in structure, the existing Bio-recognition algorithms have many problems with the accuracy, computation load and scope of application. To deal with the above problems, three aspects are studied as follows:
     1. The research on splice site prediction. The recognition of splice sites is an important step in gene prediction. In view of Takagi-Sugen (T-S) fuzzy model with good generalization, robustness and simple structure, a T-S modeling algorithm based on least squares and fuzzy clustering with fuzzy likelihood function is proposed. A GC content-classified (high GC content and low GC content) modeling method is presented based on the relationship between the conservative signal sequences around splice sites and the statistical characteristics that the composition of the up and down stream sequences of splice site depending on the GC content of the sequences around splice sites. The identification accuracy is improved. In order to improve the identification accuracy and reduces computational complexity further, according to the composition and position information of bases in the sequence, an improved naive Bayesian splice site classification is proposed. Based on the kernel method theory, this method adopts Bayesian feature function to map the sequences into a new feature space. The linear relationship between condition attributes and decision attribute was derived and the relationship coefficients is determined by least square method. So a new Bayesian classifier is designed. Simulation results show the computation time is directly proportional to the number of sequences, and the methods has high classification accuracy. The performance is improved compared with SVM-B and the naive Bayesian classifier. This method is very suitable for gene structure identification with large DNA sequence data.
     2. The research on accurate protein coding regions Localization. The recognition of protein coding regions is an important research subject in gene prediction. An integrated algorithm for exon identification is proposed. First, according to the conserved sequence of DNA coding regions, support vector machine classification of the first nucleotide of a codon in coding regions is established. Then, according to the period 3 behavior of the first nucleotide of a codon, the output sequences of the model are analyzed through short time Fourier transform, and the position of coding regions can be accurately determinate. As the complexity and diversity of gene structure, in order to improve the identification accuracy, the position of bases in gene should be divided into three classes. A binary SVM classifier can not recognize the position of bases well and the structure of SVM multi-classifier is complicated. T-S fuzzy model is used to construct the gene sequence model. The single output indicates whether the nucleotide in the center of the input window belonging to non-coding regions, the first nucleotide of a codon in a coding region or not the first nucleotide of a codon in a coding region. Then the output sequences of the model are analyzed by short time Fourier transform, and the position of coding regions can be accurately determined.
     3. The research on Human promoter prediction. The recognition of eukaryotic promoter is a difficult research subject in gene prediction. A promoter recognition algorithm based on the positional densities of oligonucleotides model is proposed. First, a Gaussian Mixture Model (GMM) is adopted to model the positional densities of oligonucleotides to extract the some important motifs which play an important role in signal regulation. Expectation Maximization (EM) algorithm is used to evaluate the parameters of GMM. In order to improve the modeling accuracy, the optimal numbers of Gaussian Mixture Model components and the initial means are determined through the fuzzy cluster. According to the known oligonucleotide position density, weighted Bayesian classifier based on least square is built to identify the Human promoter. The cost of computation is small and suitable for large DNA sequence data. To take advantage of the signal feature of promter to improve the identification accuracy and efficiency, the original promter DNA sequences are projected into the high dimension space of the oligonucleotides positional densities using Bayes feature mapping, and least squares-support vector machine (LSSVM) based on new kernel function corresponding to Bayes feature mapping is established, then Human promoters are identified by LSSVM. Through transformation of this kernel, both the content and position information of oligonucleotide can be integrated, which reflect the characteristic of actual Transcriptional Regulation mechanism well. These prediction methods can be generalized to several other biological problems. The algorithm has good generalization and the cost of computation is insensitive to the input dimension of samples.
     Finally, the research work of this paper is summarized, and the direction of future work is point out.
引文
[1]Fleischmann R D et al. Whole-genome random sequencing and assembly of Haemophilus influenza. Science,1995,269:496-512.
    [2]Goffeau et al. The yeast genome directory. Nature,1997,387 (Suppl):5-105.
    [3]The C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans:a platform for investigating biology. Science,1998,282:2012-2018.
    [4]Myers EW et al. A whole-genome assembly of Drosophila. Science,2000,287:868-877.
    [5]The Arabidopsis Genome Initiative. Arabidopsis genome has been sequenced and annotated by the Arabidopsis genome Initiative(AGI). Nature,2000,408:796-815.
    [6]International Human Genome Sequencing Consortium. Initial sequencing and analysis of human genome. Nature,2001,409 (6822):860-921.
    [7]Venter C et al. The sequence of the human genome. Science,2001,291:1304-1351.
    [8]Alan Wee Chung Liew, Hong Yan, Mengsu Yang. Pattern recognition techniques for the emerging field of bioinformatics:A review. Pattern Recognition,2005,38 (11):2055-2073.
    [9]Zoheir Ezziane. Applications of artificial intelligence in bioinformatics:A review. Expert Systems with Applications,2006,30 (1):2-10.
    [10]赵国屏.生物信息学.北京:科学出版社.2005.
    [11]Dunham AR et al. The DNA sequence of human chromosome 22. Nature,1999,402:489-495.
    [12]Hattori M et al. The DNA sequence of Human chromosome 21. Nature,2000,405:311-319.
    [13]孙啸,陆祖宏,谢建明.生物信息学基础,北京:清华大学出版社,2005.
    [14]http://www.ncbi.nlm.nih.gov/SCIENCE98.
    [15]李稚锋.真核基因简介机制相关特征研究:(博士论文).长沙:国防科技大学,2006.
    [16]方刚,陈蕴佳,高歌,刘翟,何坤,吴昕,顾孝诚,罗静初.基因组数据库简介.遗传,2003,25(4):440-444.
    [17]http://www.biosino.org/pages/readme.htm.
    [18]张革新.简明生物信息学教程,化学工业出版社,2005.
    [19]张阳德.生物信息学.北京:科学出版社,2009.
    [20]David C Kulp. Protein-coding gene structure prediction using generalized Hidden Markov Models:[doctor dissertation]. USA:University of California SANTA CRUZ,2003.
    [21]R Staden. Finding protein-coding regions in genomic sequences. Methods Enzymol,1990,183: 63-80.
    [22]R Staden. Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res, 1984,12(1 Pt2):505-19.
    [23]M S Gelfand. Prediction of function in DNA sequence analysis. J Comput Biol,1995,2(1): 87-115.
    [24]C Burge, S Karlin. Prediction of complete gene structures in human genomic DNA. J Mol Biol, 1997,268 (1):78-94.
    [25]S Brunak, J Engelbrecht, S Knudsen. Prediction of human mRNA donor and acceptor sites from the dna sequence. J Mol Biol,1991,220 (1):49-65.
    [26]M G Reese, F H Eeckman, D Kulp, D Haussler. Improved splice site detection in Genie. J comput Biol,1997,4 (3):311-23.
    [27]M G Reese. Application of a time-delay neural network to promoter annotation in the Drosophila melanogaster genome. Comput Chem,20001,26 (1):51-56.
    [28]V V Solovyev, A A Salamov, C B Lawrence. Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frame. Nucleic Acids Res,1994, 22 (24):5156-63.
    [29]M Q Zhang. Identification of protein coding regions in the human genome by quadratic discriminant analysis. Proc Natl Acad Sci USA, Genetics,1997,94 (2):565-568.
    [30]R Guigo, J W Fickett. Distinctive sequence features in protein coding genic non-coding and intergenic human DNA. J Mol Biol,1995,253 (1):51-60.
    [31]Mihaela Pertea. Gene finding in eukaryotes:[doctor dissertation]. Baltimore Maryland:The Johns Hopkins University,2001.
    [32]Schuler, G.D. Sequence mapping by electronic PCR. Genome Res,1997,7 (5):541-50.
    [33]Altschul S, Gish W, Miller W, Myers, E, Lipman, D. Basic local alignment search tool. J. Mol. Biol.,1990,215:403-410.
    [34]http://www.sanger.ac.uk/Software/analysis/SSAHA.
    [35]Stein, L. Genome annotation:from sequence to biology. Nat Rev Genet.,2001,2 (7):493-503.
    [36]Florea, L, Hartzell, G, Zhang Z, Rubin G M, Miller, W. A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res,1998,8 (9):967-74.
    [37]Mount S M. A catalogue of splice junction sequences. Nucleic Acids Res,1982,10(2):459-72.
    [38]Senapathy P, Shapiro M B, Harris N L. Splice junctions, branch point sites, and exons: sequence statistics, identification, and applications to genome project. Methods Enzymol,1990, 183:252-78.
    [39]Tarn, W Y, Steitz J A. Pre-mRNA splicing:the discovery of a new spliceosome doubles the challenge. Trends Biochem Sci.1997,22 (4):132-7.
    [40]Zhang M Q. Statistical features of human exons and their flanking regions. Hum Mol Genet, 1998,7 (5):919-32.
    [41]Ermolaeva, M D, Khalak H G, White O, Smith H O, Salzberg S L. Prediction of transcription terminators in bacterial genomes. J Mol Biol,2000,301 (1):27-33.
    [42]Tompa, M. An exact method for finding short motifs in sequences, with application to the ribosome binding site problem.7th Intl. Conf. Intelligent Systems for Molecular Biology, Heidelberg, Germany,1999:262-71.
    [43]Stormo G D. Consensus patterns in DNA. Methods Enzymol,1990,183:211-21.
    [44]Fickett, J.W. The gene identification problem-an overview for developpers. Computers and Chemistry,1996,20 (1):103-118.
    [45]Zhang M Q, Marr T G. A weight array method for splicing signal analysis. Computational Applied Bioscience,1993,9 (5):499-509.
    [46]Dong S, SearIs D B. Gene structure prediction by linguistic methods. Genomics,1994,23 (3):540-51.
    [47]Farber R, Lapedes A, Sirotkin K. Determination of enkaryotic protein coding regions using neural networks and information theory. Journal of Molecular Biology,1992,226 (2):471-9.
    [48]Matis S, Xu Y, Shah M, Guan X, Einstein J R, Mural R, Uberbacher E. Detection of RNA polymerase Ⅱ promoters and polyadenylation sites in human DNA sequence. Computers and Chemistry,1996,20 (1):135-40.
    [49]O'Neill M C. Training back-propagation neural networks to define and detect DNA-binding sites. Nucleic Acids Research,1991,19 (2):313-8.
    [50]O'Neill M C. Escherichia coli promoters:neural networks develop distinct descriptions in learning to search for promoters of different spacing classes. Nucleic Acids Research,1992,20 (13):3471-7.
    [51]Salzberg S. Locating protein coding regions in human DNA using a decision tree algorithm. Journal of Computational Biology,1995,2 (3):473-85.
    [52]Erickson J M, Altman G G. A search for patterns in the nucleotide sequence of the MS2 genome. Journal of Mathematical Biology,1979,7:219-230.
    [53]Hebsgaard S M, Korning P G, Tolstrup N, Engelbrecht J, Rouze P, Brunak S. Splice site prediction in Arabidopsis thaliana DNA by combining local and global sequence information. Nucleic Acids Research,1996,24 (17):3439-3452.
    [54]Salzberg S L. A method for identifying splice sites and translational start sites in eukaryotic mRNA. Computational Applied Bioscience,1997,13 (4):365-76.
    [55]Zien A, Ratsch G, Mika S, Scholkopf B, Lengauer T, Muller K R. Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics,2000,16 (9):799-807.
    [56]Fickett JW. Recognition of protein coding regions in DNA sequences. Nucleic Acids Res.1982, 10(17):5303-18.
    [57]Borodovsky M, Mclninch J. Recognition of genes in DNA sequence with ambiguities. Biosysterns,1993,30 (1-3):161-71.
    [58]Xu Y, Uberbacher E C. Computational gene prediction using neural networks and similarity search. Computational Methods in Molecular Biology,1998:109-128.
    [59]Nussinov R. Compositional variations in DNA sequences. Comput Appl Biosci.1991,7 (3):287-93.
    [60]Burge C, Campbell A M, Karlin S. Over-and under-representation of short oligonucleotides in DNA sequences. Proc Natl Acad Sci US,1992,489 (4):1358-62.
    [61]Claverie J M, Sauvaget I, Bougueleret L. K-tuple frequency analysis:from intron/exon discrimination to T-cell epitope mapping. Methods Enzymol,1990,183:237-52.
    [62]White O, Dunning T, Sutton G, Adams M, Venter J C, Fields C. A quality control algorithm for DNA sequencing projects. Nucleic Acids Res,1993,21 (16):382-938.
    [63]Shuanhu Wu, Xudong Xie, Alan Wee-Chung Liew, Hong Yan. Eukaryotic promoter prediction based on relative entropy and positional information. PHYSICAL RE VIE WE,2007,75: 041908-1-041908-7.
    [64]Duret L, Mouchiroud D, Gautier C. Statistical analysis of vertebrate sequences reveals that long genes are scarce in GC-rich isochores. Journal of Molecular Evolution,1995,40 (3):308-17.
    [65]P P Vaidyanathan. Genomics and Proteomics:A Signal Processor's Tour. IEEE CIRCUITS AND SYSTEMS MAGAZINE,2004, Fourth Quarter:6-29.
    [66]S Tiwari, S Ramachandran, A Bhattacharya, S Bhattacharya, R Ramaswamy, Prediction of probable genes by Fourier analysis of genomic sequences. CABIOS,1997,13 (3):263-270.
    [67]D Anastassiou. Genomic signal processing. IEEE Signal Processing Magazine,2001:8-20.
    [68]P P Vaidyanathan, B J Yoon. Gene and exon prediction using allpass-based filters. Workshop on Genomic Sig. Proc. And Stat., Raleigh,2002.
    [69]E Pirogova, Q Fang, M Akay, I Cosic. Investigation of the structural and functional relationships of oncogene proteins. Proc. of the IEEE,2002,90 (12):1859-1867.
    [70]Y Neuvo, C Y Dong, S K Mitra. Interpolated finite impulse response filters. IEEE Trans, on ASSP,1984:563-570.
    [71]Yoon B J, Vaidyanathan P P. Digital filters for gene prediction applications. Proc. of 36th Asilomar Conference on Signals, Systems, and Computers, Monterey, CA, Nov.2002,1:306-310.
    [72]P P Vaidyanathan, B J Yoon. The role of signal processing concepts in genomics and proteomics. Journal of the Franklin Institute,2004,341:111-135.
    [73]Vaidyanathan P P, Yoon B J. The role of signal-processing concepts in genomics and proteomics. Journal of the Franklin Institute (invited paper), Special Issue on Genomics,2004, 341:111-135.
    [74]Bergen S W A, Antoniou A. Application of parametric window functions to the STDFT method for gene prediction.2005 IEEE Pacific Rim Conference on Communications, Computers and signal Processing, Canada:University of Victoria,2005:324-327.
    [75]Ambikairajah E, Epps J, Akhtar M. Gene and exon prediction using time domain algorithms. Proceedings of the Eighth International Symposium on Signal Processing and Its Applications, Sydney, Australia,2005,1:199-202.
    [76]Akhtar M, Epps J, Ambikairajah E. On DNA Numerical Representations for Period-3 Based Exon Prediction. IEEE International Workshop on Genomic Signal Processing and Statistics, Tuusula, Finland,2007:1-4.
    [77]W Li. The study of correlation structures of DNA sequences:A critical review. Computers Chem.1997,21(4):257-271.
    [78]田元新,陈超,邹小勇等.外显子周期三行为特征的研究.化学学报,2005,263(13):1215-1219.
    [79]马宝山,朱义胜.一种用于基因预测的FIR数字滤波器.电子学报,2007,35(9):1710-1713.
    [80]A Krogh, I Saira Mian, D Haussler. A hidden Markov model that finds genes in E. Coli DNA. Nucleic Acids Research,1994,22:4768-4778.
    [81]Franco, G R, Adams MD, Soares M B, Simpson A J, Venter JC, Pena SD. Identification of new Schistosoma mansoni genes by the EST strategy using a directional cDNA library. Gene,1995,152 (2):141-7.
    [82]王立新.模糊系统与模糊控制教程.北京:清华大学出版社,2003.
    [83]Gelfand M S, Mironov A A, Pevzner P A. Gene recognition via spliced sequence alignment. Proc. Natl. Sci. USA,1996,93:9061-9066.
    [84]Jiang J, Jacob H J. EbEST:an automated tool using expressed sequence tags to delineate gene structure. Genome Res,1998,8 (3):268-75.
    [85]Jelinek F. Statistical Methods for Speech Recognition. MIT Press,1998.
    [86]朱红梅,王家廞,赵燕南,杨泽红.延时HMM在基因剪接供体位点识别中的应用.计算机工程,2007,33(5):1-3.
    [87]StevenLS, Arthur L D, Simon K et al. Microbial gene identification using interpolated Markov models. Nucleic Acid Research,1998,26 (2):544-548.
    [88]谢雪英,孙啸,谢建明,陆祖宏.基于内插马尔可夫模型的Gibbs改进算法识别调控元件.中国生物医学工程学报,2006,25(4):396-399.
    [89]Rabiner L R. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recgnition.1989,77 (2):257-285.
    [90]李冬冬,杜耀华,王正志.一种针对基因识别的GHMM简化算法.国防科技大学学报, 2004,26(4):103-106.
    [91]Kulp D, Haussler D, Reese M G, Eeckman F H. A generalized hidden Markov model for the recognition of human genes in DNA. Proceedings of the Fourth International Conference on Intelligent System for Molecular Biology, St. Louis, MO, USA,1996:134-142.
    [92]Quinlan J R. Induction of decision trees. Machine Learning,1986,1:81-106.
    [93]Quinlan J R. C4.5:Programs for Machine Learning. Morgan Kaufman,1993.
    [94]Breiman L, Friedman J, Olshen R, Stone C. Classification and Regression Trees. Wadsworth International Group,1984.
    [95]Kliisgen W. Explora:a multipattern and multistrategy discovery assistant. Advances in Knowledge Discovery and Data Mining,1996:249-271.
    [96]Piatetsky Shapiro G. Discovery, analysis, and presentation of strong rules. Knowledge Discovery in Databases,1991:229-238.
    [97]Weiss S M, Kulikowski C A. Computer Systems that Learn:Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufinann Publishers,1991.
    [98]Huang X, Adams M D, Zhou H, Kerlavage A R. A tool for analyzing and annotating genomic sequences. Genomics,1997,46 (1):37-45.
    [99]Guigo R, Knudsen S, Drake N, Smith T. Prediction of gene structure. Journal of Molecular Biology,1992,226 (1):141-57.
    [100]Lukashin A V, M Borodovsky. GeneIVIark.hmm:new solutions for gene finding. Nucleic Acids Research,1998,26:1107-1115.
    [101]Snyder E E, Stormo G D. Identification of coding regions in genomic DNA sequences:an application of dynamic programming and neural networks. Nucleic Acids Research,1993,21 (3): 607-613.
    [102]Snyder E E, Stormo G D. Identification of Coding Regions in Genomic DNA. Journal of Molecular Biology,1995,248:1-18.
    [103]晏春.基因剪接的信号序列分析和相关特征研究:(博士论文).长沙:国防科学技术大学,2006.
    [104]Uberbacher EC, Mural R J. Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc Natl Acad Sci USA,1991,88 (24):11261-5.
    [105]Xu Y, Mural R J, Uberbacher E C. Constructing gene models from accurately predicted exons: an application of dynamic programming. Comput Appl Biosci,1994,10 (6):613-23.
    [106]Krogh A. Two methods for improving performance of an HIVIM and their application for gene finding. Proc Int Conf Intell Syst Mol Biol,1997,5:179-86.
    [107]Laub M T, Smith D W. Finding intron/exon splice junctions using INFO, INterruption Finder
    and Organizer. J Comput Biol,1998,5 (2):307-21.
    [108]Salzberg S, Delcher A, Fasman K, Henderson J. A Decision Tree System for Finding Genes in DNA. Journal of Computational Biology,1998:667-680.
    [109]Hutchinson G B, Hayden M R. The prediction of exons through an analysis of spliceable open reading frames. Nucleic Acids Research,1992,20 (13):3453-62.
    [110]Henderson J, Salzberg S, Fasman K H. Finding genes in DNA with a Hidden Markov Model. J Comput Biol,1997,4 (2):127-41.
    [111]Milanesi L, Kolchanov N, Rogozin I, Kel A, Titov I. Sequence functional inference. In Bishop, MJ (ed.) Guide to Human genome computing. Cambridge, UK,1993:249-312.
    [112]Rogozin I B, Kolchanov NA, Milanesi L. A computing system for protein-coding regions prediction in Diptera nucleotide sequences. Drosophila Information Service,1995,76:185-187.
    [113]Rogozin I B, Milanesi L, Kolchanov NA. Gene structure prediction using information on homologous protein sequence. Comput Applic Biosci,1996,12:161-170.
    [114]Milanesi L, Rogozin I B. Prediction of human gene structure. In Guide to Human Genome Computing (2nd ed.), Cambridge,1998:215-259.
    [115]Milanesi L, D'Angelo D, Rogozin I B. GeneBuilder:interactive in silico prediction of gene structure. Bioinformatics,1999,15 (7-8):612-21.
    [116]C Wilson, L Hilyer, P Green. Genefinder Documentation, [Online]. Available: http://weeds.mgh.harvard.edu/doc/genefinder.doc.html,1995.
    [117]L Milanesi, N A Kolchanov, I B Rogozin, I V Ischenko, A E Kel, Y L Orlov, M P Ponomarenko, P Vezzoni. Gen View:A computing tool for protein-coding regions prediction in nucleotide sequences. Proc.2nd Int. Conf. Bioinformatics, Supercomput. Complex Genome Anal, Singapore,1993:573-588.
    [118]S L Salzberg, M Pertea, A L Delcher, M J Gardner, H Tettelin. Interpolated Markov models for eukaryotic gene finding. Genomics,1999,59:24-31.
    [119]I Korf, P Flicek, D Duan, M R Brent. Integrating genomic homology into gene structure prediction. Bioinformatics,2001,17 (1):140-148.
    [120]T Schiex, A Moisan, L Duret, P Rouze. EuGene:An eukaryotic gene finder that combines several sources of evidence. Lecture Notes in Computational Science,2066, New York: Springer-Verlag,2001.
    [121]M S Gelfand, T V Astakhova, M A Roytberg. An algorithm for highly specific recognition of protein-coding regions. Genome Inf.,1996,7:82-87.
    [122]Y Xu, R J Mural, E C Uberbaker. Constructing gene models from accurately predicted exons: An application of dynamic programming. Comput. Appl. Biosci.,1994,10:613-623.
    [123]I B Rogozin, L Milanesi. Analysis of donor splice signals in different organisms. J. Mol. Evol., 1997,45:50-59.
    [124]J Kleffe, K Hermann, W Vahrson, B Witting, V Brendel. Logitlinear models for the prediction of splice sites in plant pre-mRNAsequences. Nucleic Acids Res.,1996,24 (23):4709-4718.
    [125]J S Chuang, D Roth. Gene recognition based on DAG shortest paths. Bioinformatics,2001,1: 1-9.
    [126]E C Uberbacher, Y Xu, R J Mural. Discovering and understanding genes in human DNA sequence using GRAIL. Methods Enzymol,1996,266:259-281.
    [127]Vladimir B. Bajic, Seng Hong Seah. Dragon Gene Start Finder identifies approximate locations of the 50 ends of genes. Nucleic Acids Research,2003,31 (13):3560-3563.
    [128]A E Kel, Reuter E Cheremushkin O V Kel-Margoulis, E Wingender. MATCHTM:a tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Research,2003,31 (13):3576-3579.
    [129]Matthias Scherf, Andreas Klingenhoff, Thomas Werner. Highly Specific Localization of Promoter Regions in Large Genomic Sequences by PromoterInspector:A Novel Context Analysis Approach. J. Mol. Biol,2000,297:599-606.
    [130]Olga V Kel-Margoulis, Alexander E Kel, Ingmar Reuter, Igoe V Deineko, Edgar wingender. TRANSCompel:a database on composite regulatory elements in eukaryotic genes. Nucleic Acids Research,2002,30 (1):332-334.
    [131]V Matys, E Fricke, R Geffers et al. TRANSFAC1:transcriptional regulation, from patterns to profiles. Nucleic Acids Research,2003,31 (1):374-378.
    [132]Christoph D Schmid, Viviane Praz, Mauro Delorenzi, Rouaida Perier, Philipp Bucher. The Eukaryotic Promoter Database EPD:the impact of in silico primer extension. Nucleic Acids Research,2004,32:D82-D85.
    [133]Burset M, Guigo R. Evaluation of gene structure prediction programs. Genomics,1996,34 (3):353-67.
    [134]Burset M, Guigo R. Evaluation of gene structure prediction programs. Genomics,1996,34 (3):353-67.
    [135]http://genomic.sanger.ac.uk/gf/gf.html.
    [136]Friedland P, Kedes L H. Discovering the secrets of DNA. Communications of the ACM,1985, 28 (11):1164-1186.
    [137]Milaela Pertea. Gene finding in eukaryotes:[dissertation]. Baltimore, Maryland:The Johns Hopkins University,2001.
    [138]Staden R. A computer program to search for tRNA genes. Nucleic Acids Research,1980,8 (4): 817-825.
    [139]Shulman M J, Steinberg C M, Westmoreland N. The coding function of nucleotide sequences can be discerned by statistical analysis. Journal of Theoretical Biology,1981,88:409-420.
    [140]Birney E, Durbin R. Using GeneWise in the Drosophila annotation experiment. Genome Res., 2000,10 (4):547-548.
    [141]Gelfand M, Mironov A A, Pevzner P A. Gene recognition via spliced sequence alignment. Proc. Natl Acad. Sci. USA,1996,93 (17):9061-9066.
    [142]CLAVERIE J M, SAUVAGET I, BOUGUELERET L. K-tuple frequency analysis:from intron/exon discrimination to t-cell epitope mapping. Method in Enzymology,1990,183:237-252.
    [143]闻芳,卢欣,孙之荣等.基于支持向量机(SVM)的剪接位点识别.生物物理学报,1999,15(4):733-738.
    [144]Quanwei Zhang, Qinke Peng, Qi Zhang, Yanhua Yan, Kankan Li, Jing Li. Splice sites prediction of Human genome using length-variable Markov model and feature selection. Expert Systems with Applications,2010,37 (4):2771-2782.
    [145]TAKAGI T, SUGENO M. Fuzzy identification of systems and its application to modeling and control. IEEE Trans on systems, Manand Cybernetics,1985,15 (1):116-132.
    [146]曾凡锋,蔡自兴,马润津.基于模糊似然函数的模糊辨识方法.控制与决策,1998,13(5):581-584.
    [147]陈建勤,席裕庚,张钟俊.用模糊模型在线辨识非线性系统.自动化学报,1998,24(1):90-94.
    [148]王守唐,高东杰.基于T-S模糊模型的辨识算法.控制与决策,2001,16(5):630-632.
    [149]Chen Weixu, Yong Zailu. Fuzzy Model Identification and Self-learning for Dynamic Sys tems. IEEE Tram on System, Man and Cybernetics,1987,17 (4):683-689.
    [150]Liang Wang. Complex Systems Modeling via Fuzzy Logic. IEEE Trans on System, Man and Cybernetics,1996,26 (1):100-106.
    [151]郭烁,李平.模糊聚类与最小二乘相结合建立非线性系统模型.模式识别与人工智能,2003,16(3):288-291.
    [152]尚修刚,蒋慰孙.一种新的模糊似然函数.模式识别与人工智能,1997,10(1):9-14.
    [153]邵青,冯汝鹏.非线性系统模糊辨识的新方法.控制与决策.2001,16(1):83-85.
    [154]Universita del Sannio. Homo Sapiens Splice Sites Dataset. http://www.sci.unisannio.it/docenti/rampone,2003-06-16.
    [155]A consortium of the Drosophila Genome Center, The Berkeley Drosophila Genome Project (BDGP), http://www.fruitfly.org/seq_tools/datasets/Human,2006-03-29.
    [156]T Shashi Rekha, Chanchal K Mitra. Comparative Analysis of Splice Site Regions by Information Content. Genomics, Proteomics & Bioinformatics,2006,4 (4):230-237.
    [157]孙啸,陆祖宏,谢建明.生物信息学概论,清华大学出版社,2004:117-176.
    [158]周艳红,王卉,杨雷.基于特征挖掘与融合的剪接位点识别.华中科技大学学报:自然科 学版,2006,34(12):117-120.
    [159]YIN M, WANG J. GeneScout:a data mining system for predicting vertebrate genes in genomic DNA sequences, Information Sciences:an International Journal,2004,163 (1):201-218.
    [160]孙宗晓,桑凌洁,居理宁,朱怀球.基于剪接信号和调节元件序列特征的剪接位点预测方法.科学通报,2008,53(19):2298-2306.
    [161]DAVID R, B. STOCKWELL. LBS:Bayesian learning system for rapid expert system development. Expert Systems With Applicatitms,1993,6:137-147.
    [162]李百策,苑森淼,王利民.贝叶斯网络的简约模式表达.仪器仪表学报,2005,26(10):1701-3701.
    [163]CLAVERIE J M, SAUVAGET I, BOUGUELERET L. K-tuple frequency analysis:from intron/exon discrimination to t-cell epitope mapping. Method in Enzymology,1990,183:237-252.
    [164]MARASHI S A, GOODARZI H, SADEGHI M, et al. Importance of RNA secondary structure information for yeast donor and acceptor splice site predictions by neural networks. Computational Biology and Chemistry,2006,30 (1):50-57.
    [165]Jones D, Watkins C. Comparing kernels using synthetic DNA and genomic data. Technical report, Department of Computer Science, University of London, UK,2000.
    [166]Y Zhang, C H Chu, Y Chen, H Zha, X Ji. Splice site prediction using support vector machines with a Bayes kernel. Expert Systems with Application,2006,30:73-81.
    [167](英)John Shawe Taylor,(美)Nello Cristianini著,赵玲玲,翁苏明,曾华军等译.模式分析的核方法.北京:机械工业出版社,2006:21-51.
    [168]Mercer J. Functions of positive and negative type, and their connection with the theory of integral equations. Transactions of the London Philosophical Society(A),1909,209:415-446.
    [169]Courant R, D Hilbert. Methods of Mathematical Physics, New York:Wiley Interscience,1970.
    [170]刘惠.蛋白质序列数据的分类预测研究:(博士论文).上海:上海交通大学,2007.
    [171](美)Richard O Duda, Peter E Hart, David G Stork著,李宏东,姚天翔等译.模式分类,北京:机械工业出版社,中信出版社,2003.
    [172]Hatzigeorgiou A, Mache N, Reczko M. Functional Site Prediction on the DNA sequence by Artificial Neural Networks. Proceedings of the 1996 IEEE International Joint Symposia on Intelligence and Systems, Rockville, Maryland:IEEE Computer Society IEEE Press,1996,7 (96):12-16.
    [173]Cai Y D, Bork P. Homology-Based Gene Prediction Using Neural Nets. Analytical Biochemistry,1998,265 (2):269-274.
    [174]Brona B, Daniel G B, Tomas V. The most probable annotation problem in HMMs and its application to bioinformatics. Journal of Computer and System Sciences,2007,73 (7):1060-1077.
    [175]Tsonis A A, Elsner J B, Tsonis P A. Periodicity in DNA coding sequences:Implications in Gene Evolution. Theor Biol,1991,151 (3):323-331.
    [176]Eftestol T, Ryen T. Eukaryotic gene prediction by spectral analysis and pattern recognition techniques. Signal processing symposium, NORSIG 2006 Proceedings of the 7th Nordic. Reykjavik, Iceland, Askja:University of Iceland,2007:146-149.
    [177]Akhtar M, Ambikairajah E, Epps J. Detection of period-3 behavior in genomic sequences using singular value decomposition. Proceeding of the IEEE Symposium on Emerging Technologies, IEEE Press,2005,13-17.
    [178]Vapnik V N. The Nature of Statistical Learning Theory. Springer, New York,1995.
    [179]Vapnik V N. Statistical learning theory. Wiley Interscience, New York,1998.
    [180]Cristianini N, Taylor J S. An introduction to support vector machines and other kernel-based learning methods. Cambridge:Cambridge University Press,2000.
    [181]Cortes C, Vapnik V N. Support vector networks. Mach. Learning,1995,20:273-293.
    [182]Drucker H, Wu D, Vapnik V N. Support vector machines for spam categorization. IEEE Trans. Neural Networks,1999,10:1054-1084.
    [183]Sun Y F, Fan X D, Li Y D. Identifying splicing sites in eukaryotic RNA:Support vector machine approach. Computers in Biology and medicine,2003,33:17-29.
    [184]Sohn I, Shim J, Hwang C, Kim S, Lee I W. Informative transcription factor selection using support vector machine-based generalized approximate cross validation criteria. Computational Statistics and Data Analysis,2009,53:1727-1735.
    [185]Emmersen J, Rudd S. Separation of sequences from host-pathogen interface using triplet nucleotide frequencies. Fungal Genetics and Biology,2007,44 (27):231-241.
    [186]Bronislava Brejova, Evidence Combination in Hidden Markov Models for Gene Prediction. [doctor disserion]. Waterloo, Ontario, Canada:University of Waterloo.2005.
    [187]Boser B, I Guyon, V N Vapnik. A training algorithm for optimal margin classifiers, fifth Annual Workshop on Computational Learning Theory, San Mateo, CA:Morgan Kaufmann. 1992:144-152.
    [188]Kun Nan Tsai, Shu Hung Lin, Shin Ru Shih, Jhih Siang Lai, Chung Ming Chen. Genomic splice site prediction algorithm based on nucleotide sequence pattern for RNA viruses. Computational Biology and Chemistry,2009,33 (2):171-175.
    [189]胡广书.现代信号处理教程.北京:清华大学出版社,2005.
    [190]胡广书.数字信号处理——理论、算法与实现(第2版).北京:清华大学出版社,2003.
    [191]Gabor D. Therty of communication. J IEE,1946,93:429-457.
    [192]皇埔堪,陈建文,楼生强.现代数字信号处理.北京:电子工业出版社,1988.
    [193]魏松,李琦,赵仁才.基于短时傅立叶变换语言信号分析算法.电子测量技术,2006, 29(1):16-17.
    [194]章铭,陆菊康.基于隐式马尔可夫链的基因发现模型和算法.计算机工程,2003,29(7):122-123.
    [195]Nini Rao, Xu Lei, Jianxiu Guo, Hao Huang, Zhenglong Ren. An efficient sliding window strategy for accurate location of eukaryotic protein coding regions. Computers in Biology and Medicine,2009,39 (4):392-395.
    [196]Fickettand J, Tung C S. Assesment of protein coding measures. Nucleic Acids Res,1992,20: 6441-50.
    [197]CHANG C H Y, Yau S S T. Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. Journal of Theoretical Biology,2007,247:687-694.
    [198]吴昕,罗静初,李伍举.基因调控元件的计算机识别和基因调控网络构建.北京:2003年中国科协“生物信息学与进化计算”青年科学家论坛,2003,92-108.
    [199]Michael Towsey, Peter Timms, James Hogan, Sarah A Mathews. The cross-species prediction of bacterial promoters using a support vector machine. Computational Biology and Chemistry,2008, 32 (5):359-366.
    [200]Werner, T. Identification and functional modelling of DNA sequence elements of transcription. Brief Bioinform,2000,1 (4):372-380.
    [201]Stormo G D. DNA binding sites:representation and discovery. Bioinformatics,2000,16 (1):16-23.
    [202]Munch R, Hiller K, Barg H, Heldt D, Linz S, Wingender E, Jahn D. PRODORIC:prokaryotic database of gene regulation. Nucleic Acids Res.,2003,31 (1):266-269.
    [203]Salgado H, Gama Castro S, Martinez Antonio et al. RegulonDB (version 4.0):transcriptional regulation, operon organization and growth conditions in Escherichia coli K-12. Nucleic Acids Res., 2004,32 (Database issue):D303-D306.
    [204]Wingender E. TRANSFAC, TRANSPATH and CYTOMER as starting points for an ontology of regulatory networks. In. Silico. Biol.,2004,4 (1):55-61.
    [205]Wingender E, Chen X, Hehl R et al. TRANSFAC:an integrated system for gene expression regulation. Nucleic Acids Res.,2000,28 (1):316-319.
    [206]Zhao F, Xuan Z, Liu L, Zhang M Q. TRED:a Transcriptional Regulatory Element Database and a platform for in silico gene regulation studies. Nucleic Acids Res.,2005,33 (Database issue):D103-D107.
    [207]Sandelin A, Wasserman W W, Lenhard B. ConSite:web-based prediction of regulatory elements using cross-species comparison. Nucleic Acids Res.,2004,32 (Web Server issue): W249-W252.
    [208]WeiChun Huang. Computational methods for identifying and characterizing the Human Gene regulatory regions and CIS-elements:[docter dissertation]. USA:North Carolina State University, 2005.
    [209]Pennacchio L A, Rubin E M. Genomic strategies to identify mammalian regulatory sequences. Nat. Rev. Genet.,2001,2 (2):100-110.
    [210]王建新;杨德;黄元南.DNA序列中弱信号基序查找算法比较与分析.计算机科学,2008,35(8):188-194.
    [211]Schneider T D. Consensus sequence Zen. Appl. Bioinformatics,2002,1 (3):111-119.
    [212]Staden R. Computer methods to locate signals in nucleic acid sequences. Nucl. Acids Res., 1984,12(1):505-519.
    [213]Stormo G D, Fields D S. Specificity, free energy and information content in protein-DNA interactions. Trends Biochem. Sci.,1998,23 (3):109-113.
    [214]Hutchinson G B. The prediction of vertebrate promoter regions using differential hexamer frequency analysis. Comp Appl Biosci,1996,12:391-398.
    [215]Chen QK, Hertz GZ, Stormo GD. PromFD 1.0:a computer program that predicts eukaryotic pol Ⅱ promoters using strings and IMD matrices. Comp Appl Biosci,1997,13:29-35.
    [216]Hannenhalli S, Levy S. Promoter prediction in the human genome. Bioinformatics,2001,17 (Suppl 1):90-96.
    [217]Davuluri R V, Grosse I, Zhang M Q. Computational identification of promoters and first exons in the human genome. Nat Genet,2001,29:412-417.
    [218]Scherf M, Klingenhoff A, Werner T. Highly specific localization of promoter regions in large genomic sequences by Promoter Inspector:a novel context analysis approach. J Mol Biol,2000, 297:599-606.
    [219]Quandt K, Frech K, Karas H, Wingender, E, Werner T. MatInd and MatInspector:new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucl. Acids Res., 1995,23(23):4878-4884.
    [220]Kel A E, Gossling E, Reuter I, Cheremushkin Eet al. MATCH:A tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res.,2003,31 (13):3576-3579.
    [221]Agarwal P, Bafha V. Detecting non-adjoining correlations with signals in dna. In RECOMB '98:Proceedings of the second annual international conference on Computational molecular biology, USA:New York, NY, ACM Press,1998:2-8.
    [222]Benos P V, Lapedes A S, Fields D S, Stormo G D. SAMIE:statistical algorithm for modeling interaction energies. Pac. Symp. Biocomput.,2001:115-126.
    [223]Bulyk M L, Johnson P L F, Church G M. Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. Nucl. Acids Res.,2002,30 (5):1255-1261.
    [224]Fickett J W, Hatzigeorgiou A G. Eukaryotic promoter recognition-review. Genome Res,1997, 7:861-878.
    [225]Werner T. The state of art of mammalian promoter recognition. Briefings Bioinformat,2003, 4:22-30.
    [226]Bajic V B, Seah S H, Chong Aet al.Computer model for recognition of functional transcription start sites in RNA Polymerase II promoters of vertebrates. J Mol Graph Model,2003,21:323-332.
    [227]Werner T. Models for prediction and recognition of eukaryotic promoters. Mammalian Genome,1999,10:168-175.
    [228]Pedersen A G, Baldi P, Chauvin Y, Brunak S. The biology of eukaryotic promoter prediction-a review. Comput Chem,1999,23:191-207.
    [229]Zhang M Q. Computational methods for promoter recognition, computational molecular biology, Cambridge, Massachusetts:MIT Press,2002:249-268.
    [230]P C特纳,A G麦克伦南,A D贝茨等著.刘进元,李文君,王薛林等译.分子生物学.北京:科学出版社,2000
    [231]Bajic V B, Choudhary V, Hock C K. Content analysis of the core promoter region of human genes. In Silico Biol,2003,4:0011.
    [232]Smale S T, Kadonaga J T. The RNA polymerase II core promoter. Ann Rev Biochem,2003, 72:449-479.
    [233]Suzuki Y, Tsunoda T, Sese J, Taira H et al. Identification and characterization of the potential promoter regions of 1031 kinds of human genes. Genome Res,2001,11:677-684.
    [234]Helden J, Andre B, Collado Vides J. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol,1998, 281:827-842.
    [235]Vipin Narang, Wing Kin Sung, Ankush Mittal. Computational modeling of oligonucleotide positional densities for human promoter prediction. Artificial intelligence in medicine,2005, 35:107-119.
    [236]韩纪庆,张磊,郑铁然.语音信号处理.北京:清华大学出版社,2004.
    [237]Carlin B P, Louis T A. Bayes and empirical bayes methods for data analysis. Florida: Chapman and Hall,2000.
    [238]Harry Z, Sheng S L. Learning Weighted Naive Bayes with Accurate Ranking. Fourth IEEE International Conference on Data Mining (ICDM'04). UK:Brighton,2004:567:570.
    [239]Wan V, Campbell W M. Support vector machines for speaker verification and identification. Proc ICA SSP,2002,1:669-672.
    [240]黄伟,戴蓓蓓.基于GMM统计特性参数和SVM的话者确认.数据采集与处理,2004,19(4):365-370.
    [241]Suykens J A K, Vander Walle J. Least squares support vector machine classifiers. Neural Processing Letters,1998,9 (3):293-300.
    [242]Suykens J A K, Lukas L, Vander Walle J. Sparse app roximation using least squares support vectormachine. Proc of IEEE International Symposium on Circuit and Systems. Switzerland: Geneva, IEEE,2000:757-760.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700