基于随机森林算法识别基因间长非编码RNA
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Identification of large intergenic non-coding RNAs using random forest
  • 作者:徐炜娜 ; 张广乐 ; 李仕红 ; 陈园园 ; 李强 ; 杨涛 ; 许明敏 ; 乔宁 ; 张良云
  • 英文作者:XU Wei-na;ZHANG Guang-le;LI Shi-hong;CHEN Yuan-yuan;LI Qiang;YANG Tao;XU M ing-min;QIAO Ning;ZHANG Liang-yun;College of Science,Nanjing Agricultural University;
  • 关键词:基因间长非编码RNA ; 随机森林算法 ; 最小自由能 ; 信噪比
  • 英文关键词:long intergenic non-coding RNA;;random forests algorithm;;minimum free energy;;signal-noise ratio
  • 中文刊名:SDDX
  • 英文刊名:Journal of Shandong University(Natural Science)
  • 机构:南京农业大学理学院;
  • 出版日期:2018-12-24 17:01
  • 出版单位:山东大学学报(理学版)
  • 年:2019
  • 期:v.54
  • 基金:国家自然科学基金资助项目(11571173; 11401311; 11601231)
  • 语种:中文;
  • 页:SDDX201903012
  • 页数:9
  • CN:03
  • ISSN:37-1389/N
  • 分类号:89-96+105
摘要
为了深入了解和探索lincRNA的调控机制,建立了lincRNA高效识别模型,有助于为后续研究提供数据源。依据最小自由能(minimum free energy, MFE)和信噪比(signal-noise ratio, SNR)等特征,并通过特征贡献度大小剔除冗余特征,构建随机森林(random forest, RF)分类模型,有效地识别lincRNAs。经检验,模型的灵敏度、特异性和精确度分别达到94.1%、93.2%和93.7%,高于现有PhyloCSF、LncRNA-ID和CPC方法的各项识别指标。模型在识别过程中表现出较好的鲁棒性,可准确识别lincRNA。
        A data source for understanding lincRNAs′ regulatory mechanisms by accurate identification is provided. With the features of minimum free energy and signal-noise ratio, we remove the redundant features by feature contribution. Thus, we develop a machine learning model(random forest) based on random forest algorithm to identify lincRNAs. After inspecting with the same experimental dataset, we prove that the sensitivity, specificity and accuracy of this new method have reached 94.1%, 93.2% and 93.7%, which are higher than the current identification index of the methods of PhyloCSF, LncRNA-ID and CPC. The method proposed in this paper shows better robustness and effective classification.
引文
[1] PONTING C, OLIVER P, REIK W. Evolution and functions of long noncoding RNAs[J]. Cell, 2009, 136(4): 629-641.
    [2] CABILI M N, TRAPNELL C, GOFF L, et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses[J]. Genes, 2011, 25(18): 1915-1927.
    [3] ■ROM UA, THOMAS D,MALTE B, et al. Long noncoding RNAs with enhancer-like function in human cells[J]. Cell, 2011, 27(4): 46-58.
    [4] GUTTMAN M, AMIT I, GARBER M, et al. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals[J]. Nature, 2009, 458(12): 223—227.
    [5] ULITSKY I, SHKUMATAVA A, JAN C H, et al. Conserved function of lincRNAs in vertebrate embryonic development despite rapid sequence evolution[J]. Cell, 2011, 147(7): 1537-1550.
    [6] CAO C H, ZHANG D, GUO X. The long intergenic noncoding RNA UFC1,a target of microRNA 34a, interacts with the mRNA stabilizing protein HuR to increase levels of β-catenin in Hcc cells[J]. Gastroenterology, 2015, 148(2): 415-426.
    [7] 翁侠,洪晓明. LincRNA-PVT1在甲状腺癌组织中的表达及意义[J]. 实用肿瘤杂志,2017, 32(1):57-61.WENG Xia, HONG Xiaoming. Expression of lincRNA-PVT1 in thyroid carcinoma and its clinicopathological significance[J]. Journal of Practical Oncology, 2017, 32(1): 57-61.
    [8] TSENG Y Y, MORIARITY B S, GONG W, et al. PVT1 dependence in cancer with MYC copy-number increase[J]. Nature, 2014, 512(7512): 82-86.
    [9] PAULI A, RINN J L, SCHIER A F. Non-coding RNAs as regulators of embryo genesis[J]. Nat Rev Genet, 2011, 12(2): 136-149.
    [10] PAULI A, VALEN E, LIN M F, et al. Systematic identification of long noncoding RNAs expressed during zebrafish embryogenesis[J]. Genome Res, 2012, 22(3): 577-591.
    [11] CABILI M N, TRAPNELL C, GOFF L, et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses[J]. Genes, 2011, 25(18): 1915-1927.
    [12] SUN K, CHEN X N, JIANG P Y, et al. iSeeRNA: identification of long intergenic non-coding RNA transcripts from transcriptome sequencing data[J]. BMC Genomics, 2013, 14(S2): 13-23.
    [13] 施伟,赵健,宋晓峰, 等. LincRNA的研究进展[J]. 现代生物医学进展, 2016, 16(9):1762-1765.SHI Wei, ZHAO Jian, SONG Xiaofeng, et al. Research progress of LincRNA[J]. Progress in Modern Biomedicine, 2016, 16(9): 1762-1765.
    [14] LIN M F, JUNGREIS I, KELLIS M. PhyloCSF:a comparative genomics method to distinguish protein coding and non-coding regions[J]. Bioinformatics, 2011, 27(13): i275-i282.
    [15] PIAN C, ZHANG G, CHEN Z, et al. LncRNApred:classification of long non-coding RNAs and protein-coding transcripts by the ensemble algorithm with a new hybrid feature[J]. PLOS ONE, 2016, 11(5): e0154567.
    [16] ACHAWANANTAKUN R, CHEN J, SUN Y, et al. LncRNA-ID: long non-coding RNA Identification using balanced random forests[J]. Bioinformatics, 2015, 31(24): 3897-390.
    [17] KONG L, ZHANG Y, YE Z, et al. CPC: assess the protein-coding potential of transcripts using sequence features and support vector[J]. Nucleic Acids Res, 2007, 35(Web Server issue):345-349.
    [18] BU D, YU K, SUN S, et al. NONCODE v3.0:integrative annotation of long noncoding RNAs[J]. Nucleic Acids Res, 2012, 36(8): 210-215.
    [19] SPEIR M L, ZWEIG A S, ROSENBLOOM K R, et al. The UCSC genome browser database:2016 update[J]. Nucleic Acids Res, 2016, 44(D1): D717.
    [20] TINOCO I, BORER P N, DENGLER B, et al. Improved estimation of secondary structure in ribonucleic acids[J]. Nat New Biol, 1973, 246(150): 40-41.
    [21] BONNET E, WUYTS J, PIERRE Y, et al. Evidence that microRNA precursors, unlike other non-coding RNAs, have lower folding free energies than random sequences[J]. Bioinformatics, 2004, 20 (17):2911-2917.
    [22] DING X, ZHU L, JI T, et al. Long intergenic Non-Coding RNAs(LincRNAs) identified by RNA-Seq in breast cancer[J]. PLOS ONE, 2014, 9(8): e103270.
    [23] HUANG T, CHANG H Y. Long noncoding RNA in genome regulation: prospects and mechanisms[J]. RNA Biol, 2010, 7(5): 582-585.
    [24] YAN M, LIN Z S, ZHANG C T. A new fourier transform approach for protein coding measure based on the format of the Z-curve[J]. Bioinformatics, 1998, 14(8):685-690.
    [25] LIU G, LUAN Y. An adaptive integrated algorithm for noninvasive fetal ECG separation and noise reduction based on ICA-EEMD-WS[J]. Med Biol Eng Comput, 2015, 53(11):1113-1127.
    [26] YIN C, YAU S S. Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence[J]. Theor Biol, 2007, 247(4): 687-694.
    [27] KAPRANOV P, CHENG J, DIKE S, et al. RNA maps reveal new RNA classes and a possible function for pervasive transcription[J]. Science, 2007, 316(5830): 1484-1488.
    [28] COMPEAU P, PEVZNER P, TESLER G. How to apply de Bruijn graphs to genome assembly[J]. Nat Biotechnology, 2011, 29(11): 987-991.
    [29] HURST L D, MERCHANT A R. High guanine-cytosine content is not an adaptation to high temperature: a comparative analysis amongst prokaryotes[J]. The Royal Society, 2001, 268 (1466): 493-497.
    [30] FREYHULT E, GARDNER P P, MOULTON V. A comparison of RNA folding measures[J]. BMC Bioinformatics, 2005, 6(1): 241.
    [31] SOPHIA S, LEE F, SUN L, et al. EM-random forest and new measures of variable importance for multi-locus quantitative trait linkage analysis[J]. Bioinformatics, 2008, 5(21): 1603-1610.
    [32] ROBIN G, JEAN-MICHEL P, CHRISTINE T M. VSURF: an R package for variable selection using random forests[J]. Computing, 2016, 7(2): 19-31.
    [33] HUANG G B, ZHU Q Y, SIEW C K. Extreme learning machine: a new learning scheme of feed forward neural networks[J]. Proc Int Joint Conf Neural Netw, 2004, 2(2): 985-990.
    [34] VLADIMIR V, CORINNA C. Support-vector networks[J]. Machine Learning, 1995, 20 (3): 273-297.
    [35] BREIMAN L. Random forest[J]. Machine Learning, 2001, 45(1): 5-32.
    [36] JESSE D, MARK G. The relationship between Precision-Recall and ROC curves[J]. ICML, 2006, 6(23): 233-240.
    [37] ATAPATTU S, TELLAMBURA C, JIANG H, et al. Analysis of area under the ROC curve of energy detection[J]. IEEE Transactions on Wireless Communications, 2010, 9(3): 1216-1225.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700