人类RNA聚合酶Ⅱ启动子识别研究

英文题名：Research on Human POL Ⅱ Promoter Recognizing
作者：智慧
论文级别：硕士
学科专业名称：分析化学
中文关键词：人类RNA聚合酶Ⅱ启动子识别 ; 支持向量机 ; 共识模型 ; 双层SVM ; 生物统计学
英文关键词：Human RNA POLⅡPromoter Recognition ; Support Vector Machine ; Consensus Model ; Dual-SVM ; Biostatistics
学位年度：2008
导师：李通化
学科代码：070302
学位授予单位：同济大学
论文提交日期：2008-03-01

摘要

启动子的识别是基因识别的重要组成部分。对启动子区的认识,不仅有助于实验室分析研究,而且还可以为人类认识全基因组功能、基因表达调控机制以及人类疾病与启动子多态性或突变的关系提供很大的帮助。
     本文旨在对人类RNA聚合酶(POL)Ⅱ启动子数据进行识别分类并提高识别的准确率。我们将创新的编码方法应用在人类启动子序列编码中,建立并使用合适的共识模型,使用支持向量机(SVM)的方法对启动子数据进行分类并提高了启动子识别的准确率。
     首先,我们从真核生物启动子数据库(EPD)以及非启动子数据库中得到用于分类研究的DNA启动子序列数据及非启动子序列数据。正、负数据集均分别被分成5份和10份,用于5重(5-fold)及10重(10-fold)交叉验证。另外,我们还从转录起始位点数据库(DBTSS)中得到了由实验得出的人类染色体启动子数据,准备用于后续的研究。
     然后,在对数据进行处理后(包括保证数据的非冗余性等),对碱基数据进行编码、选择合适的参数及编码方法。这是本研究的重点和难点。根据采用编码方式的不同,将之分为三步。
     第一步,本文采用了基于知识的统计编码方法,并将此方法进一步扩展成六种子编码方式,分别是:单碱基统计特征编码、相邻双碱基统计特征编码、隔一位的双碱基统计特征编码、隔两位的双碱基统计特征编码、隔三位的双碱基统计特征编码以及相邻三碱基统计特征编码。编码后在SVM中进行启动子识别,使用10-fold交叉验证的准确率达到了89.68%,灵敏性在86.24%～90.11%,特异性在85.91%～98.35%,与其他利用SVM进行启动子识别的工具相比,均有5%左右的提高。
     第二步,本文采用了CpG编码和五联体(Pentamers)编码,从不同的角度对人类RNA POLⅡ启动子序列进行编码,提取变量信息,找出预报结果最佳及搭配最合理的编码方式用于后面的研究。
     第三步,本文还尝试了一种新的编码方法——模式字典(Pattern Dictionary)的编码方法(由本实验室开发),并且针对启动子数据的特点,将ATCG四碱基两两结合,扩展成十六种字符进行编码,以增加数据的特征变量。
     再次,基于上述编码方法的识别结果,根据编码方式的不同、样本选择的不同、核函数选择的不同等等,我们建立出不同类型成员子模型的共识模型,并用双层SVM进行识别分析。由于共识模型考虑了各子模型的独立性和模型之间的差异性,发挥了各模型之间的互补优势,从而提高了最终的识别准确率。
     最后,我们将优秀的识别模型及共识模型的思想应用到人类22号染色体启动子数据的识别中,识别准确率达到了90.98%。
Promoters Recognition is an important part of the research of the gene recognition. Finding the knowledge of the promoter regions not only redounds to the analysis and research in the laboratory, but is helpful to the human knowing the function of the whole genome, the mechanism of the gene expression and controlling, and the relationship of the human diseases and the polymorphism or mutation of the promoters.
     This paper aimed to do the recognition of the human RNA POLⅡpromoters, classify the promoter sequences, and promote the veracity of the recognizing results. We applied novel encoding methods to the encoding of the human promoter sequences, built up right consensus models, and recognized the promoter sequences with the Support Vector Machine (SVM), and finally improved the veracity of the recognizing results.
     Firstly, we got the promoter and non-promoter sequences data from Eukaryotic Promoter Database (EPD) and non-promoter databases, which were used for the recognition research. Both of the positive and negative data were divided into 5 and 10 parts, for the 5-fold and 10-fold cross-validation. Otherwise, we also got the human chromosome promoter data from the DataBase of Transcriptional Start Sites (DBTSS), which were got from experiments. The data were used for the following research.
     Secondly, we did the pre-processing of the sequences data, including guarantee the non- redundant of the data, encoded the sequences data, and selected the suitable parameters and encoding methods. This part of our work is the emphasis and difficulty of the research, and we divided it into three steps:
     Step one, we applied the knowledge-based statistical encoding method, which were expanded into 6 sub-encoding methods, such as, single-base statistical encoding method, adjacent dual-base statistical encoding method, one-base apart dual-base statistical encoding method, two-base apart dual-base statistical encoding method, three-base apart dual-base statistical encoding method and adjacent ternate-base statistical encoding method. Then we recognized the data with SVM, the accuracy of the 10-fold cross-validation reached 89.68%, the sensitivities were from 86.24% to 90.11%, and the specificities were from 85.91% to 98.35%, compared to other SVM used promoter recognizing tools, our results had nearly 5% precedence.
     Step two, we applied the CpG islands and Pentamers encoding methods, encoded the promoter sequences data in a different perspective, extracted the information of the variables, and selected the encoding method which got the best recognizing result, used for the following research.
     Step three, we tried the Pattern Dictionary encoding method, and expanded the 4 bases into 16 bases, combining the arbitrary two of the A, T, C and G four bases, to increase the amounts of the variables, according to the characteristic of the promoter sequences data.
     Thirdly, we built up the right consensus models, according to the results of the different encoding methods. Based the differences of the encoding methods, the differences of the sample selecting methods, the differences of the kernel functions, .etc, we built up consensus models with different sub-models, and did the recognition with dual-SVM. We finally promoted the accuracy of the recognition, for the consensus models included the independence and difference of each sub-models, and exerted the superiorities and the complementarities of the sub-models.
     At last, we applied the excellent recognition model into the human chromosome 22 promoter recognizing, and the accuracy of the recognizing reached 90.98%.

引文

铩颷1]Pedersen A G,Baldi P,Brunk Y S,Chauvin,Characterization of prokaryotic and eukaryoticpromoters using hidden Markov models[A],Proceedings of the Fourth InternationalConference on Intelligent Systems for Molecular Biology[C],1996,182-191
    [2]Pedersen A G,Engeibrecht J,Investigations of Escherichia coli promoter sequences withartificial neural networks:New signals discovered upstream of the transcriptional start-point[A],Proceeding of the Third International Conference on Intelligent Systems for MolecularBiology[C],1995,292-299
    [3]Matsuda T,Motoda H,Washio T,Graph-based induction and its applications[J],AdvancedEngineering Informatics,2002,16:135-143
    [4]Bucher P,Weight matrix descriptions of four eukaryotic RNA polymerase Ⅱ promoterelements derived from 502 unrelated promoters,J.Mol.Biol.,1990,212:563-578
    [5]Sridhar H,Samuel L,Promoter prediction in the human genome[J],Bioinformatics,2001,17(s1):s90-s96
    [6]Rajeev Gangal,Pankaj Sharma,Human pol Ⅱ prompter prediction:time series descriptorsand machine learning,Nucleic Acids Research,2005,33:No.4
    [7]Leo Gordon,Alexey Ya.Chervonenkis,Alex J.Gammerman,et.al.,Sequence alignmentkernel for recognition of promoter regions,Bioinformatics,2003,15:1964-1971
    [8]Baldi,P.and Brunak,S.,In Adaptive Computation and Machine Learning,Bionformatics:themachine learning approach,MIT Press,Cambridge,MA,1998
    [9]邓乃扬,田英杰,《数据挖掘中的新方法-支持向量机》,科学出版社,2004.6
    [10]Wang Qi,Li Weipeng,Huang Zhongxi,Yang Lin,Predicting human tumor-specific promoterusing transcription factor binding sites,第一军医大学学报(J First Mil Med Uni.),2004,24(11)
    [11]姚凤霞,张瑞芳,刘春宇,夏家辉,夏昆,真核生物RNA聚合酶Ⅱ启动子的计算机预测,Section Genet Foreign Med Sci.,February 15,2005,28
    [12]Fickett J W,Hatzigeorgiou A C.Eukaryotic promoter recognition,Genome Res.,7,1997,9:861-878
    [13]郑集,《普通生物化学(第二版)》,高等教育出版社,1982.3
    [14]李栒,《染色体遗传学导论》,高等教育出版社,1991.9
    [15]李宝森,胡庆宝,《遗传学》,南开大学出版社,1991.11
    [16]Liu,H.,Wang,Y.,Zhang,Y.,Song,Q.,Di,C.,Chen,G.,Tang,J.and Ma,D.TFAR19,aNovel Apoptosis-Related Gene Cloned from Human Leukemia Cell Line TF-1,CouldEnhance Apoptosis of Some Tumor Cells Induced by Growth Factor Withdrawal,Biochem.Biophys.Res.Commun.,1999,254:203-210
    [17] B. Modrek, C. Lee. A genomic view of alternative splicing, Nature, Genetics, 30, January, 2002
    [18] Vladimir B Bajic, Sin Lam Tan, Yutaka Suzuki, Sumio Sugano, Promoter prediction analysis on the whole human genome, Nature Biotechnology, Nov., 2004, 22:1467-1473
    [19] Bajic V. B., et al., Computer model for recognition of functional transcription start sites in RNA polymerase II promoters of vertebrates, J. Mol. Graph. Model., 2003, 21: 323-332
    [20] Bajic V. B., Seah S. H., Dragon Gene Start Finder identifies approximate locations of the 5' end of genes, Nucleic Acids Res., 2003, 31: 3560-3563
    [21] Bajic V. B., Seah S. H., Dragon Gene Start Finder: an advanced system for finding approximate locations of the start of gene transcriptional unites, Genome Res., 2003, 13: 1923-1929
    [22] Reese M. G., Application of a time-delay neural network to promoter annotation in the Drosophila melanogaster genome, Comput. Chem., 2001, 26: 51-56
    [23] Knudsen S., Promoter 2.0: for the recognition of Pol II promoter sequences, Bioinformatics, 1999, 15:356-361
    [24] Ohler U., Liao G. C., Niemann H., Rubin G. M. Computational analysis of core promoters in the Drosophila genome, Genome Biol., 3(12), RESEARCH0087. Epub 2002 Dec 20, 2002
    [25] Davuluri R. V., Grosse I., Zhang M. Q., Computational identification of promoters and first exons in the human genome, Nat. Genet., 2001, 29: 412-417
    [26] Loshikhes I. P., Zhang M. Q., Large-scale human promoter mapping using CpG islands, Nat. Genet, 2000, 26: 61-63
    [27] Solovyev V. V., shahmuradov I. A., PromH: Promoters identification using orthologous genomic sequences, Nucleic Acids Res., 2003, 31: 3540-3545
    [28] Down T. A., Hubbard T. J., Computational detection and location of transcription start sites in genomic DNA, Genome Res., 2002, 12: 458-461
    [29] Ponger L., Mouchiroud D., CpGProD: identifying CpG islands associated with transcription start sites in large genomic mammalian sequences. Bioinformatics, 2000, 5: 380-391
    [30] Scherf M, Klingenhoff A., Werner T., Highly specific localization of promoter regions in large genomic sequences by Promoterlnspector: a novel context analysis approach, J. Mol. Biol., 2000, 297: 599-606
    [31] Ohler U., Stemmer G., Harbeck S., Niemann H., Stochastic segment models of eukaryotic promoter regions, Proc. Pac. Symp. Biocomput., 2000, 5: 380-391
    [32] Solovyev, V. and A. Salamov., The Gene-Finder computer tools for analysis of human and model organism genome sequences, In Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology (ed. T. Gaasterland, P. Karp, K. Karplus, C.Ouzounis, C. Sander, and A. Valencia), 294-302. AAAI Press, Menlo Park, CA., 1997
    [33] Zhang M. Q., Identification of human gene core promoters in silico, Genome Res., 1998, 8(3): 319-326
    [34] Audic, S. and J. M. Claverie., Detection of eukaryotic promoters using Markov transition matrices., Comput. Chem., 1997, 21(4): 223-227
    [35] Vladimir B. Bajic, Seng Hong Seah, Allen Chong, Guanglan Zhang, Judice L. Y. Koh and Vladimir Brusic, Dragon Promoter Finder: recognition of vertebrate RNA polymerase II promoters, Bioinformatics, 2002,18(1): 198-199
    [36] M. G. Reese, F. H. Eeckman, Time-delayed neural network for eukaryotic promoter prediction, unpublished, 1999
    [37] M. G. Reese, N. L. Harris, F. H. Eeckman, Large scale sequencing specific neural networks for promoter and splice site recognition, Proceedings of the 1996 Pacific Symposium on Biocomputing, L. Hunter, T.E. Klein (Eds.), World Scientific Publishing Co., Singapore, 2-7 January, 1996, http://www.fruitfly.org/seq tools/promoter.html/
    [38] Hutchinson, G.B., The prediction of vertebrate promoter regions using differential hexamer frequency analysis, Comp. Appl. Biosci., 1996,12: 391-398
    [39] Eponine, http://www.sanger.ac.uk/Users/td2/eponine/, http://servlet.sanger.ac.uk:8080/eponine/

    [40] Ramana Davuluri, Ivo Grosse, and Michael Zhang, http://rulai.cshl.edu/tools/FirstEF/
    [41] A. Gorm Pedersen et al., The biology of eukaryotic promoter prediction - a review, Computers & Chemistry, 1999, 23: 191-207
    [42] Kondrakhin, Y. V., Kel, A. E., Romashchenko, N. A. K. A. G. & Milanesi, L., Eukaryotic promoter recognition by binding sites for transcription factors, Comput. Appl. Biosci., 1995, 11:477-488
    [43] Frech, K., Danescu-Mayer, J. & Werner, T., A novel method to develop highly specific models for regulatory units detects a new ltr in genbank which contains a functionalpromoter, J. Mol. Biol., 1997, 270: 674-687
    [44] Chen, Q. K., Hertz, G. Z. & Stormo, G. D., Promfd 1.0: a computer program that predicts eukaryotic pol II promoters using strings and IMD matrices, CABIOS, 1997, 13: 29-35
    [45] Prestridge, D. S., Prediction of pol ii promoter sequences using transcription factor binding sites, J. Mol. Biol., 1995, 249:923-932
    [46] Quandt, K., Frech, K., Karas, H.,Wingender, E. &Werner, T., Matind and matinspector - new fast and versatile tools for detection of consensus matches in nucleotide sequence data, Nucl.Acids Res., 1995, 23: 4878-4884
    [47] Chen, Q. K. & Stormo, G. Z. H. G. D., Matrix search 1.0: a computer program that scans dna sequences for transcriptional elements using a database of weight matrices, Comput. Appl. Biosci., 1995, 11:563-566
    [48] Wingender, T. H. E., Hermjakob, I. R. H., Kel, A. E., Kel, O. V., Ignatieva, E. V, Ananko, E. A., Podkolodnaya, O. A., Kolpakov, F. A. & Kolchanov, N. L. P. N. A., Databases on transcriptional regulation: Transfac, trrd, and compel, Nucl. Acids Res., 1998, 26: 364-370
    [49] Prestridge, D. S., Signal scan: A computer program that scans DNA sequences for eukaryotic transcriptional elements, CABIOS, 1997, 7: 203-206
    [50] Schug, J. & Overton, G. C, Tess: Transcription element search software on the www. In Technical Report CBIL-TR-1997-1001-v0.0,of the Computational Biology and Informatics
    Laboratory,School of Medicine,1997
    铩颷51]Burge,C.& Karlin,S.,Prediction of complete gene structures in human genomic DNA,J.Mol.Biol.,1997,268:78-94
    [52]Matis,S.,Xu,Y.,Shah,M.,Guan,X.,Einstein,J.R.,Mural,R.& Uberbacher,E.C.,Detection of RNA polymerase Ⅱ promoters and polyadenylation sites in human DNAsequence,Comput.Chem.,1996,20:135-140
    [53]Uberbacher,E.C.,Xu,Y.& Mural,R.J.,Discovering and understanding genes in humanDNA sequence using GRAIL,Meth.Enz.,1996,266:259-281
    [54]Singh,G.B.,Mathematical model to predict regions of chromatin attachment to the nuclearmatrix,Nucleic Acids Res.,1997,25:1419-1425
    [55]Milanesi,L.,Muselli,M.& Arrigo,P.,Hamming clustering method for signals prediction in5' and 3' regions of eukaryotic genes,CABIOS,1996,12:399-404
    [56]Milanesi and I.B.Rogozin,Prediction of human gene structure.In:Guide to Human GenomeComputing(2nd ed.)(Ed.M.J.Bishop),Academic Press,Cambridge,1998,215-259
    [57]Burge,C.& Karlin,S.,Prediction of complete gene structures in human genomic DNA,J.Mol.Biol.,1997,268:78-94
    [58]Matis,S.,Xu,Y.,Shah,M.,Guan,X.,Einstein,J.R.,Mural,R.& Uberbacher,E.C.,Detection of RNA polymerase Ⅱ promoters and polyadenylation sites in human DNAsequence,Comput.Chem.,1996,20:135-140
    [59]Uberbacher,E.C.,Xu,Y.& Mural,R.J.,Discovering and understanding genes in humanDNA sequence using GRAIL,Meth.Enz.,1996,266:259-281
    [60]Workshop on the Vapnik-Chervonenkis Dimension,Edinburgh,9th - 13th September 1996
    [61]http://www.kernel-machines.org/software
    [62]http://www.csie.ntu.edu.tw/~cjlin/libsvm/
    [63]李建民,张钹,林福宗,序贯最小优化的改进算法,软件学报,2003,14(5):919-925
    [64]Jain Xiong Dong,Ching Y S,Adam K.A,Fast SVM Training Algorithm,InternationalJournal of Pattem Recognition and Artificial Intelligence,2003,17(3):367-384
    [65]Bernd Heisele et al.,Hieraachical classification and feature reduction for fast face detectionwith support vector machines,Pattern Recognition,2003,36:2007-2017
    [66]吴翔,谭李,陆文凯,张学工,提高超大规模SVM训练计算速度的研究,模式识别与人工智能,2003,16(1):46-49
    [67]孙剑,郑南宁,张志华,一种训练支撑向量机的改进贯序最小优化算法,软件学报,2002,13(10):2007-2013
    [68]Rouaida Cavin Perier,Viviane Praz,Thomas Junier,Claude Bonnard and Philipp Bucher,The Eukaryotic Promoter Database(EPD),Nucleic Acids Research,2000,28,No.1
    [69]EPD RELEASE 92,SEPTEMBER 2007,http://www.epd.isb-sib.ch/current/usrman.html
    [70]Dennis A.Benson,Ilene Karsch-Mizrachi,David J.Lipman,James Ostell,and David L.Wheeler,GenBank,Nucleic Acids Res.,Jan 1,2006,34(Database issue):D16-20
    铩颷71]The DNA Data Bank of Japan(DDBJ),http://www.ddbj.nig.ac.jp/
    [72]The European Molecular Biology Laboratory(EMBL),http://www.embl.org/
    [73]H.Ogura,H.Agala,et al.,A study of learning splice sites of DNA sequence by neuralnetworks,Comput.Biol.Med.,1997,27:67-75
    [74]M.Gardiner-Garden,M.Frommer,CpG islands in vertebrate genomes,J Mol Biol,1987,196:261-282
    [75]Bajic VB,Seah SH,Dragon Gene Start Finder identifies approximate locations of the 5'ends of genes,Nucleic Acids Research,2003,31(13):3560-3563
    [76]X.Xie,S.Wu,K.Lain and H.Yan,PromoterExplorer:an effective promoter identificationmethod based on the AdaBoost algorithm,Bioinformatics,2006,22(22):2722-2728
    [77]Bajic,V.B.,Scab,S.H.,Chong,A.,Zhang,G.,Koh,J.L.Y.& Brusic,V.,Dragon PromoterFinder:recognition of vertebrate RNA polymerase Ⅱ promoters,Bioinformatics,2002,18(1):198-199
    [78]Chen Chuan bo,Li Tao,A hybrid neural network system for prediction and recognition ofpromoter regions in human genome,Journal of Zhejiang University Science,2005,6B(5):401-407
    [79]Solovyev,V.V,Makarova,K.S.,A novel method of protein sequence classification based onoligopeptide frequency analysis and its application to search for functional sites and todomain location,Computer Application in the Biosciences,1993,9(1):17-24
    [80]Jinyan Huang,Tonghua Li,Kai Chen,Kailin Tang,A knowledge-based encoding forprediction of phosphorylation sites use SVMs,The 25th Chinese Chemical Society Congress,July 2006
    [81]Kalai Mathee and Giri Narasimhan,Detection of DNA-Binding Helix-Turn-Helix Motifs inProteins Using the Pattern Dictionary Method,Methods in Enzymology,2003,370:250-264
    [82]李艳坤,邵学广,蔡文生,基于多模型共识的偏最小二乘法用于近红外光谱定量分析,高等学校化学学报,2007,28(2):246-249
    [83]These data were produced by the Chromosome 22 Group at the Sanger Institute and wereobtained from the World Wide Web at http://www.sanger.ac.uk/HGP/Chr22(Collins et al.Genome Research,2003 Jan;13(1):27-36)
    [84]Clustal W Programmatic Access:http://www.ebi.ac.uk/clustalw/

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700