基于支持向量机的miRNA预测及其靶基因预测
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
miRNA是一类长度约为20~24个核苷酸的非编码RNA,自被发现以来,它便成为生物信息学领域中的研究热点。miRNA通过分裂或者翻译抑制靶mRNA来达到调控基因的目的。对miRNA的研究有助于人们了解基因间的网络调控关系,同时对基因功能的研究,生物进化探索等有着重要的意义。本文所要研究的是miRNA领域中的两个热点问题:miRNA预测和miRNA靶基因预测。这两项研究分别基于支持向量机方法,设计出有效的特征提取方案,对候选的基因序列进行预测识别。
     本文对于这两个问题的研究分别提出了新的改进算法,并取得了较好的效果。在miRNA预测的研究中,本文提出了一种名为PMirP的方法,它通过鉴别真假pre-miRNA来预测miRNA,将pre-miRNA茎部的结构和序列结合起来进行特征提取,同时首次提出将两游离核苷酸特性作为重要特征加入到miRNA预测中。实验结果表明,该方法大大提高了miRNA预测的准确度、敏感度和特异性。在miRNA靶基因预测研究中,本文考虑了非种子区域对种子区域在预测中的补足作用,将miRNA:mRNA二聚体结构分为种子区域和非种子区域进行特征提取,这有效地提高了miRNA靶基因预测的精度,其预测的整体效果较好。
Bioinformatics as a hybrid subject that involves various traditional subjects such as biology,computer science as well as applied mathematics,emerges with the launch of the human genome project at the end of 1980s.In bioinformatics,researchers try to retrieve critical biological information through the analysis of nucleotide and amino acid sequence,so as to provide evidence in explaining the origin of life,and also to better understand the evolution and development process of life.To achieve this, various kinds of techniques need to be employed,such as data storage,data indexing, pattern classification and data mining.Recently,researchers have figured out the fact that not only the genetic information but also the expression regulation of genetic information is very important for organisms.
     RNA,as one of the most important genetic material,had long been considered merely as an intermediate auxiliary element during the transition process from DNA to protein.Recently,researches on small RNA molecules have received substantial attention amongst researchers.They have found out the fact that those small RNA molecules are responsible for manipulation of cell function,which in turn regulate the gene expression process.MicroRNAs(miRNAs),as a particular kind of such small RNA,which can be widely found in animal and plant,cleave or suppress translation of target genes by binding to their mRNAs.Further studies show that about 1/3 of human genes are regulated by miRNAs,miRNA molecules are one of the core components in the networks of gene regulation.Furthermore,miRNAs play essential roles in many biological processes,including the developmental processes,cell proliferation,death and fat metabolism,the cell differentiation and so on.Moreover,increasing evidence has demonstrated that miRNAs has strong ties with the formation of cancer.Therefore, research on miRNAs will be beneficial to understanding the construction of gene expression regulation network,in other words,gene function.Gene function studies, on the other hand,may have very strong impact on human disease control and biological evolution.
     In this paper,we focus on two hotspot issues related to miRNAs,namely miRNAs prediction,and miRNA target predictions.Our goal is to use machine learning techniques to tackle these two tasks,from the bioinformatics perspective. Specifically,we make use of the well known support vector machine approach.We design effective methods for feature extraction to improve the prediction accuracy for both tasks.We believe that our research will help to find more new miRNAs and their targets,and can also help to provide accurate and reliable data source for the study of miRNA functions and mechanism.
     Currently,a large amount of miRNAs have been identified through various methods.However,theoretically there exist more that have yet been identified.In order to reveal the mysteries of miRNAs,it is important for us to further explore new miRNAs.Mature miRNAs consist of about 20 to 24 nucleotides,and are processed from pre-miRNAs,which have the characteristic of stem-loop hairpin structure. During the biogenesis procedure,the hairpin structure of pre-miRNAs is essential for the miRNA formation.Therefore,miRNAs can be predicted by distinguishing true pre-miRNAs from faked ones.However,according to the current studies,a large amount of similar hairpins can be folded in many genomes,which makes the identification of the pre-miRNAs more difficult.To date,machine learning method has been widely used in the pre-miRNAs prediction tasks,and feature extraction is a key step for miRNAs prediction.However,due to inadequate feature extraction process, the experiments yield low precision as well as limited recall.In this paper,based on the current feature extraction method,we developed an improved method for pre-miRNAs prediction using support vector machine(SVM).The method is named PMirP,which includes many hybrid features,such as structure-sequence characteristics which are extracted from the pre-miRNA stems,the free energy and the number of nucleotide-matching in pre-miRNA stem.At the same time,we firstly proposed an important feature,two free nucleotides in miRNA:miRNA~* double helix structure,to predict miRNAs.The optimal parameters for the RBF kernel were selected based on the training set.The learned parameters together with the training set are then used to build the SVM classifiers.The learned classifiers were then applied to several separate testing sets for evaluation.The experimental results showed that PMirP not only effectively identified true pre-miRNAs in human beings,but was also able to predict pre-miRNAs for other species with high accuracy.Compared with existing methods, PMirP enhanced the sensitivity and specificity significantly.Therefore,PMirP is effective in miRNA prediction.Lastly,we have published our method through a web service for scientific research purpose.
     miRNAs act by binding to the complementary sites on the 3' untranslated region of the target gene to regulate gene,so identification of miRNA targets is the basis for the research on miRNA functionalities.Generally,miRNA molecules are very short in length.In practice there exist many possible genes that can be complementary to them in the whole genome.In addition,only a small number of miRNA have been confirmed,which make it hard to find target mRNAs.Recently,bioinformatics plays a dominant role in the miRNA targets prediction task several methods have emerged,but these methods are mainly based on sequence complementary in seed region and they still have limitations in revealing actual target genes.The introduction of machine learning techniques to the miRNA targets prediction has been shown to be successful and the prediction accuracy has been improved accordingly.In addition,the machine learning technique,when incorporated with the new biological characteristics is the important method for improving prediction accuracy.In this study,based on the key features,we propose a method for miRNA targets prediction using support vector machines.The out-seed segment of the miRNA:mRNA duplex sequence can compensate for imperfect base pairing within the seed segment,in this method,the feature is considered.Consequently,we have partitioned the duplex into two parts:the seed and out-seed for feature extraction.The nucleotide position of seed region,the MFE and others act as features in the target gene prediction.The method is similar in process of PMirP,such as selecting the optimal parameters to build the SVM classifiers, testing the classifier with the test dataset.The experimental results show that the method yields high performance on the task of targets prediction.Furthermore,the method has high sensitivity and specificity,which validates the feasibility and efficiency of the method.
     Human miRNAs research is still in its infancy.To date,most of the existing research based on the combination of known miRNA characteristics for prediction application,which requires us to have better understanding on biological characteristics in miRNAs and miRNA targets.For example,structure,sequence, pathway and so on.In this paper,based on the Support Vector Machine,we develop two types of prediction methods,which are verified with good results respectively. Thus,this study not only provides some effective means for the research on miRNAs and miRNA targets,but also gives some solid foundation for future research in this field.
引文
[1]Robert J,Johnston.A microRNA controlling left/right neuronal asymmetry in Caenorhabditis elegans[J].Nature,2003,426(6968):845-849.
    [2]郑红军,周旭,毕笃彦.统计学习理论及支持向量机概述[J].现代电子技术,2003,147(4):59-61.
    [3]张学工.关于统计学习理论和支持向量机[J].自动化学报,2000,26(1):33-42.
    [4]尚彤,张丹,卢铭.生物信息学概述[J].北京大学学报(医学版),2001,33(1):29-32.
    [5]李海明等.miRNA——一种新的调控基因表达的小分子RNA[J].中国癌症杂志,2006,16(8):675-678.
    [6]Lau N C,Lim L P,Weinstein E G,et al.An abundant class of tiny RNAs with probable regulatory roles in Caenorhabditis elegans[J].Science,2001,294(5543):858-862.
    [7]Reinhart B J,Weinstein E G,Rhoades M W,et al.MicroRNAs in plants[J].Genes Dev,2002,16(13):1616-1626.
    [8]James R,Brown,Philippe Sanseau.A computational view of microRNAs and their targets[J].Drug Discovery Today,2005,10(8):595-601.
    [9]Lee Y,Ahn C,Han J,et al.The nuclear RNase Ⅲ Drosha initiates micmRNA processing[J].Nature,2003,425(4):415-419.
    [10]Lund E,Guttinger S,Calado A,et al.Nuclear export of microRNA precursors[J].Science,2004,303(2):95-98.
    [11]Schwarz D S,Hutvagncr G,Du T,et al.Asymmetry in the assembly ofthe RNAi enzyme complex[J].Cell,2003,115(6):199-208.
    [12]YektaS,ShihIH,BartelDP,et al.MicroRNA-directed cleavage of HOXB8mRNA[J].Science,2004,304:594-596.
    [13]Lauter N,Kampani A,Carlson S,Goebel M,et al.microRNA172 down-regulates glossyl5 to promote vegetative phase change in maize[J].Proc Natl Acad Sci,2005,102(26):9412-9417.
    [14]Nam J W,Kim J,Kim S K,et al.ProMiR Ⅱ:A web server for the probabilistic prediction of clustered,nonclustered,conserved and noncomserved microRNAs[J].Nucleic Acids Research,2006,34:455-458.
    [15]Berezikov E,Guryev V,et al.Phylogenetic shadowing and computational identification of human microRNA genes[J].Cell,2005,120(1):21-24.
    [16]Legendre M,Lambert A,et al.Profile-based detection of microRNA precursors in animal genomes[J].Bioinformatics,2005,21(7):841-845.
    [17]Pedersen J S,Bejerano G,et al.Identification and classification of conserved RNA secondary structures in the human genome[J].PLoS Biology,2006,2(4):251-262.
    [18]Williams L,Caries C C,et al.A database analysis method identifies an endogenous trans-acting short-interfering RNA that targets the Arabidopsis ARF2,ARF3,and ARF4 genes[J].Proc Natl Acad Sci,2005,102:9703-9708.
    [19]Yousef M,Nebozhy M,Shatkay H,et al.Combining multi-species genomic data for microRNA identification using a Naive Bayes classifier[J].Bioinformatics,2006,22(11):1325-1334.
    [20]杨良怀,吕丕明,陈立军.k-gram方法识别microRNA前体[J].生物化学与生物物理进展,2007,34(2):154-161.
    [21]Chenghai Xue,Fei Li,et al.Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine[J].Bioinformatics,2005,6:310-317.
    [22]Hertel J,Stadler P F.Hairpins in a Haystack:Recognizing microRNA precursors in comparative genomics data[J].Bioinformatics,2006,22(14):197-202.
    [23]Helvik S A,Saetrom P.Reliable prediction of Drosha processing sites improves microRNA gene prediction[J].Bioinformatics,2007,3(2):142-149.
    [24]茹松伟,申卫红,杨鹏程.microRNA靶基因预测算法研究概况及发展趋势[J].生命科学,2007,19(5):562-567.
    [25]Enright A J,John B,Gaul U,et al.MicroRNA targets in Drosophila[J].Genome Biol,2003,5(1):120-127.
    [26]Lewis B P,Shih I H,Jones-Rhoades M W,et al.Prediction of mammalian microRNA targets[J].Cell,2003,115(7):787-798.
    [27]Kiriakidou M,Nelson P T,Kouranov A,et al.A combined computational experimental approach predicts human microRNA targets[J].Genes Dev,2004,18(10):1165-1178.
    [28]Rehmsmeier M,Steffen P,Hochsmann M,et al.Fast and effective prediction of microRNA/target duplexes[J].RNA,2004,10(10):1507-1517.
    [29]Rusinov V,Baev V,Minkov I N,et al.MicroInspector:a web tool for detection of miRNA binding sites in an RNA sequence[J].Nucleic Acids Research,2005,33:696-700.
    [30]Krek A,Grun D,Poy M N,et al.Combinatorial microRNA target predictions[J].Nature Genet,2005,37(5):495-500.
    [31]Kim S K,Nam J W,Rhee J K,et al.MiTarget:microRNA target gene prediction using a support vector machine[J].Bioinformatics,2006,7:411-423.
    [32]Kim S K,Nam J W.A Kernel Method for MicroRNA Target Prediction Using Sensible Data and Position-Based Features[C].In Proceedings of the 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.La Jolla:CA,2005,46-52.
    [33]Saetrom O,Ola Snove J,Saetrom P.Weighted sequencemotifs as an improved seeding step in microRNA target prediction algorithms[J].RNA,2005,11:995-1003.
    [34]Miranda K C,Huynh T,Tay Y,et al.A pattern-based method for the identification of microRNA binding sites and their corresponding heteroduplexes[J].Cell,2006,126(6):1203-1217.
    [35]Griffiths-Jones,Sam.The microRNA Registry[J].Nucleic Acids Research,2004,32(1):109-111.
    [36]Karolchik D,Baertsch R,Diekhans M,et al.The UCSC Genome Browser Database[J].Nucleic Acids Research,2003,31(1):51-54.
    [37]Bonnet E,Wuyts J,Rouze P,Van de Peer.Evidence that micro-RNA precursors,unlike other non-coding RNAs,have lower folding free energies than random sequences[J].Bioinformatics,2004,20(17):2911-2917.
    [38]Sewer,Alain,Paul,et al.Identification of clustered microRNAs using an ab initio prediction method[J].Bioinformatics,2005,6:267-272.
    [39]Yousef,Malik,Nebozhyn,et al.Combining multi-species genomic data for microRNA identification using a naive bayes classifier[J].Bioinformatics,2006,22:1325-1334.
    [40]Sethupathy,Praveen,et al.TarBase:A comprehensive database of experimentally supported animal microRNA targets[J].RNA,12(2):192-197.
    [41]Malik Yousef,Segun Jung,et al.Naive Bayes for microRNA target predictions-machine learning for microRNA targets[J].Bioinformatics,2007,23(22):2987-2992.
    [42]Brennecke,Julius,et al.Principles of MicroRNA-target recognition[J].PLoS Biology,2005,3(3):404-418.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700