一款基于转录组差异基因表达分析的软件包—

一款基于转录组差异基因表达分析的软件包——findDEG

英文篇名：findDEG: an integrated software package for differential gene expression analysis with RNA sequencing data
作者：吴吉妍 ; 姚丹 ; 吴海楠 ; 童春发
英文作者：WU Jiyan;YAO Dan;WU Hainan;TONG Chunfa;College of Forestry, Nanjing Forestry University;
关键词：基因差异表达分析 ; Perl语言 ; 杨树 ; 转录本 ; 转录组测序 ; findDEG
英文关键词：gene differential expression analysis;;Perl language;;poplar;;transcript;;transcriptome sequencing;;findDEG
中文刊名：NJLY
英文刊名：Journal of Nanjing Forestry University(Natural Sciences Edition)
机构：南京林业大学林学院;
出版日期：2019-03-15
出版单位：南京林业大学学报(自然科学版)
年：2019
期：v.43;No.200
基金：国家自然科学基金项目(31270706,31870654);; 江苏高校优势学科建设工程资助项目(PAPD)
语种：中文;
页：NJLY201902013
页数：7
CN：02
ISSN：32-1161/S
分类号：97-103

摘要

【目的】随着二代测序技术的不断发展,转录组测序技术在许多物种里已被广泛地应用于基因差异表达分析和基因注释研究。现有的多种基因差异表达分析软件,分析步骤多而且复杂,不同分析方法其结果差别也较大,这给研究者分析实际数据带来了不少困难。为了简化基因差异表达分析的过程,利用现有的软件开发一个集成的软件包。【方法】针对Trinity、TopHat+Cufflinks和HISAT2+StringTie 3种比较成熟的基因差异表达分析流程,考虑研究对象有无参考基因组序列、样本数据是否有重复、单端还是双端测序、不同基因表达量的计算方法以及不同的基因差异表达显著性检验方法等因素,将多种转录组测序数据分析软件整合起来形成一个集成的软件包。【结果】使用Perl语言开发了一个名为findDEG软件包用于转录组测序数据的基因差异表达分析。软件包共分为3个模块,即Trinity、TopHat+Cufflinks和HISAT2+StringTie模块。Trinity模块提供3种计算转录本表达量方法和4种差异表达基因显著性检验方法,TopHat+Cufflinks模块可供用户选择新版或旧版的Cufflinks分析方案,HISAT2+StringTie模块则只有一种分析方案。该软件包可以自由下载使用,其网址为http://www.bioseqdata.com/findDEG/findDEG.htm。采用新版和旧版的Cufflinks分析方案以及一种Trinity组合方法,分别对小叶杨在正常和干旱胁迫条件下的转录组数据进行了分析。结果两种Cufflinks方法分别识别出了53和33个差异表达基因,其中25个是相同的;Trinity方法识别了高达1 641个差异表达基因,其中与Cufflinks两种方法相同的分别有14和3个。【结论】新开发的软件包findDEG有十多种基因差异表达分析方案可供选择,采用一键的方式进行数据计算分析,避免了中间环节参数输入和结果利用等操作步骤,使用方便。
【Objective】With the fast development of next-generation sequencing technology, transcriptome sequencing(or RNA-seq) is being widely used for differential gene expression analyses and gene annotations in many species. A variety of software packages for RNA-seq data analysis are available. However, the practical analysis involves several complicated steps and multiple parameters, making it difficult for most researchers to perform such an analysis accurately. 【Method】Based on the available software packages such as Trinity, TopHat+Cufflinks and HISAT2+StringTie, an integrated package was generated to analyze RNA-seq data by considering different methods for computing gene expression abundance and hypothesis testing of differential gene expression. Meanwhile, other issues were also considered, including whether a reference genome is available, if the sampling is repetitive or not, and whether the data is paired or single end. 【Result】An integrated software package called findDEG was developed with Perl language for differential gene expression analysis. The software consisted of three modules, i.e., Trinity, TopHat+Cufflinks, and HISAT2+StringTie. The Trinity module provides three methods for calculating transcript expression abundance and four methods for testing differentially expressed genes, while the module TopHat+Cufflinks allows users to choose either the new or old version of Cufflinks for performing differential gene expression analysis. However, the module HISAT2+StringTie has only one strategy for the analysis. The new software is freely available at the website http://www.bioseqdata.com/findDEG/findDEG.htm. By taking three analytical strategies, including the old and new versions of Cufflinks and the Trinity module, we analyzed the RNA-seq data from Populus simonii under normal and drought stress conditions. Consequently, the new and old versions of Cufflinks identified 53 and 33 differentially expressed genes, respectively, with 25 matching genes between them. Trinity detected up to 1 641 differentially expressed genes, of which 14 and 3 genes were the same as the results from the new and old versions of Cufflinks, respectively. 【Conclusion】The new developed software findDEG can conveniently provide more than a dozen strategies for differential gene expression analysis with RNA-seq data by using one piece of software to conduct the whole analysis, avoiding many intermediate parameters and results that would need to be manually processed.

引文

[1] TRAPNELL C,ROBERTS A,GOFF L,et al.Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks [J].Nature Protocols,2012,7(3):562-578.DOI:10.1038/nprot.2012.016.
    [2] PERTEA M,KIM D,PERTEA G M,et al.TranscripT-level expression analysis of RNA-seq experiments with HISAT,StringTie and Ballgown [J].Nature Protocols,2016,11(9):1650.DOI:10.1038/nprot.2016.095.
    [3] HAAS B J,PAPANICOLAOU A,YASSOUR M,et al.De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis [J].Nature Protocols,2013,8(8):1494-1512.DOI:10.1038/nprot.2013.084.
    [4] GHOSH S,CHAN C K K.Analysis of RNA-Seq data using TopHat and Cufflinks [J].Methods in Molecular Biology,2016,1374:339-361.DOI:10.1007/978-1-4939-3167-5_18.
    [5] KIM D,LANGMEAD B,SALZBERG S L.HISAT:a fast spliced aligner with low memory requirements [J].Nature Methods,2015,12(4):357-360.DOI:10.1038/nmeth.3317.
    [6] FRAZEE A C,PERTEA G,JAFFE A E,et al.Ballgown bridges the gap between transcriptome assembly and expression analysis [J].Nature Biotechnology,2015,33(3):243-246.DOI:10.1038/nbt.3172.
    [7] LI B,DEWEY C N.RSEM:accurate transcript quantification from RNA-Seq data with or without a reference genome [J].BMC Bioinformatics,2011,12(1):323.DOI:10.1186/1471-2105-12-323.
    [8] BRAY N L,PIMENTEL H,MELSTED P,et al.Near-optimal probabilistic RNA-seq quantification [J].Nature Biotechnology,2016,34(5):525-527.DOI:10.1038/nbt.3519.
    [9] PATRO R,DUGGAL G,LOVE M I,et al.Salmon provides fast and bias-aware quantification of transcript expression [J].Nature Methods,2017,14(4):417-419.DOI:10.1038/nmeth.4197.
    [10] ROBINSON M D,MCCARTHY D J,SMYTH G K.edgeR:a Bioconductor package for differential expression analysis of digital gene expression data [J].Bioinformatics,2010,26(1):139-140.DOI:10.1093/bioinformatics/btp616.
    [11] ANDERS S,MCCARTHY D J,CHEN Y,et al.Count-based differential expression analysis of RNA sequencing data using R and Bioconductor [J].Nature Protocols,2013,8(9):1765-1786.DOI:10.1038/nprot.2013.099.
    [12] LAW C W,CHEN Y,SHI W,et al.Voom:precision weights unlock linear model analysis tools for RNA-seq read counts [J].Genome Biology,2014,15(2):29.DOI:10.1186/gb-2014-15-2-r29.
    [13] SUOMI T,SEYEDNASROLLAH F,JAAKKOLA M K,et al.ROTS:an R package for reproducibility-optimized statistical testing [J].PloS Computational Biology,2017,13(5):e1005562.DOI:10.1371/journal.pcbi.1005562.
    [14] SAHRAEIAN S M E,MOHIYUDDIN M,SEBRA R,et al.Gaining comprehensive biological insight into the transcriptome by performing a broad-spectrum RNA-seq analysis [J].Nature Communications,2017,8(1):59.DOI:10.1038/s41467-017-00050-4.
    [15] TONG C F,LI H G,WANG Y,et al.Construction of high-density linkage maps of Populus deltoides × P.simonii using restriction-site associated DNA sequencing [J].PloS One,2016,11(3):e0150692.DOI:10.1371/journal.pone.0150692.
    [16] MOUSAVI M,TONG C F,LIU F X,et al.De novo SNP discovery and genetic linkage mapping in poplar using restriction site associated DNA and whole-genome sequencing technologies [J].BMC Genomics,2016,17:656.DOI:10.1186/s12864-016-3003-9.
    [17] 欧佳佳.杨树干旱响应转录组测序分析 [D].南京:南京林业大学,2015.OU J J.Research on the drought-responsive transcriptome of Populus using RNA-seq [D].Nanjing:Nanjing Forestry University,2015.
    [18] TRAPNELL C,PACHTER L,SALZBERG S L.TopHat:discovering splice junctions with RNA-Seq [J].Bioinformatics,2009,25(9):1105-1111.DOI:10.1093/bioinformatics/btp120.
    [19] TRAPNELL C,WILLIAMS B A,PERTEA G,et al.Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation [J].Nature Biotechnology,2010,28(5):511-515.DOI:10.1038/nbt.1621.
    [20] PERTEA M,PERTEA G M,ANTONESCU C M,et al.StringTie enables improved reconstruction of a transcriptome from RNA-seq reads [J].Nature Biotechnology,2015,33(3):290-295.DOI:10.1038/nbt.3122.
    [21] GRABHERR M G,HAAS B J,YASSOUR M,et al.Full-length transcriptome assembly from RNA-Seq data without a reference genome [J].Nature Biotechnology,2011,29(7):644-652.DOI:10.1038/nbt.1883.
    [22] LANGMEAD B,SALZBERG S L.Fast gapped-read alignment with Bowtie 2 [J].Nature Methods,2012,9(4):357-359.DOI:10.1038/nmeth.1923.
    [23] LI H,HANDSAKER B,WYSOKER A,et al.The sequence alignment/map format and SAMtools [J].Bioinformatics,2009,25(16):2078-2079.DOI:10.1093/bioinformatics/btp352.
    [24] BENJAMINI Y,HOCHBERG Y.Controlling the false discovery rate:a practical and powerful approach to multiple testing [J].Journal of the Royal Statistical Society,1995,57(1):289-300.DOI:10.1111/j.2517-6161.1995.tb02031.x.
    [25] TUSKAN G A,DIFAZIO S,JANSSON S,et al.The genome of black cottonwood,Populus trichocarpa (Torr.& Gray) [J].Science,2006,313(5793):1596-1604.DOI:10.1126/science.1128691.
    [26] TANG S,DONG Y,LIANG D,et al.Analysis of the drought stress-responsive transcriptome of black cottonwood (Populus trichocarpa) using deep RNA sequencing [J].Plant Molecular Biology Reporter,2014,33(3):424-438.DOI:10.1007/s11105-014-0759-4.
    [27] TANG S,LIANG H,YAN D,et al.Populus euphratica:the transcriptomic response to drought stress [J].Plant molecular biology,2013,83(6):539-557.DOI:10.1007/s11103-013-0107-3.
    [28] ROBERTS R J,CARNEIRO M O,SCHATZ M C.The advantages of SMRT sequencing [J].Genome Biology,2013,14(7):405.DOI:10.1186/gb-2013-14-6-405.
    [29] JAIN M,OLSEN H E,PATEN B,et al.The Oxford Nanopore MinION:delivery of nanopore sequencing to the genomics community [J].Genome Biology,2016,17(1):239.DOI:10.1186/s13059-016-1103-0.
    [30] SEDLAZECK F J,LEE H,DARBY C A,et al.Piercing the dark matter:bioinformatics of long-range sequencing and mapping [J].Nature Reviews Genetics,2018,19(6):329-346.DOI:10.1038/s41576-018-0003-4.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700