增强型稀疏后缀数组索引的高错误率reads比对

英文篇名：Aligning High Error Rate Reads Using Enhanced Sparse Suffix Array Index
作者：韦好 ; 钟诚
英文作者：WEI Hao;ZHONG Cheng;School of Computer and Electronics and Information,Guangxi University;
关键词：序列比对 ; 增强型稀疏后缀数组 ; 索引 ; 最大精确匹配
英文关键词：sequence alignment;;enhanced sparse suffix array;;index;;maximal exact match
中文刊名：XXWX
英文刊名：Journal of Chinese Computer Systems
机构：广西大学计算机与电子信息学院广西高校并行分布式计算技术重点实验室;
出版日期：2019-08-09
出版单位：小型微型计算机系统
年：2019
期：v.40
基金：国家自然科学基金项目(61462005)资助;; 广西自然科学基金项目(2014GXNSFAA118396)资助
语种：中文;
页：XXWX201908044
页数：5
CN：08
ISSN：21-1106/TP
分类号：222-226

摘要

生物序列比对有助于定位序列之间的相似区域.测序技术的快速发展需要序列比对算法能够灵活地处理更长且错误率更高的reads序列.通过增强型稀疏后缀数组对参考序列建立索引,自适应地调整种子的最小长度,寻找参考序列与reads序列之间的最大精确匹配和超大精确匹配,以此进行种子扩展,提出一种改进的long-read比对算法.与已有代表性的算法相比,模拟和真实数据实验结果表明,本文算法在获得基本相同精确度的前提下,召回率明显提升,敏感度总体上更高,且能够识别更多的reads序列.
Biological sequence alignments help to locate similar regions between sequences. The rapid development of sequencing technology has forced the sequence-mapping algorithm to flexibly process longer reads with higher error. The reference sequence is indexed by an enhanced sparse suffix array,and the maximum exact match and super maximum exact match between the reference sequence and the reads are found by adaptively adjusting minimum length of seeds,the seeds are expanded by these two matches,and an improved long-read alignment algorithm is proposed. Compared with the existing representative algorithm,the experimental result on the simulation and real data shows that the proposed algorithm significantly improves the recall rate and has totally higher sensitivity under the premise of obtaining basically same accuracy,and it can identify more reads.

引文

[1]Mielczarek M,Szyda J.Review of alignment and SNP calling algorithms for next-generation sequencing data.[J].Journal of Applied Genetics,2016,57(1):71-79.
    [2]Nils Homer,Barry Merriman,Stanley F Nelson.BFAST:an alignment tool for large scale genome resequencing[J].Plos One,2009,4(11):e7767.
    [3]Li R,Li Y,Kristiansen K,et al.SOAP:short oligonucleotide alignment program[J].Bioin-formatics,2008,24(5):713-714.
    [4]Li H,Ruan J,Durbin R,et al.Mapping short DNA sequencing reads and calling variants using mapping quality scores[J].Genome Research,2008,18(11):1851-1858.
    [5]Faust G G,Hall I M.YAHA:fast and flexible long-read alignment w ith optimal breakpoint detection[J].Bioinformatics,2012,28(19):2417-2424.
    [6]Song L,Yi W,Fei W.A fast read alignment method based on seedand-vote for next generation sequencing[J].BM C Bioinformatics,2016,17(S17):466.
    [7]Burrows M,Wheeler D J.A block-sorting.lossless data compression algorithm[R].Technical report 124,Digital Equipment Corporation,Palo Alto,California,1994.
    [8]Li H,Durbin R.Fast and accurate short read alignment with Burrow s-Wheeler transform[J].Bioinformatics,2009,25(14):1754-1760.
    [9]Langmead B,Salzberg S L.Fast gapped-read alignment with Bowtie 2[J].Nature M ethods,2012,9(4):357-359.
    [10]Li R,Yu C,Li Y,et al.SOAP2:an improved ultrafast tool for short read alignment[J].Bioinformatics,2009,25(15):1966-1967.
    [11]Tárraga J,Arnau V,Martínez H,et al.Acceleration of short and long DNA read mapping w ithout loss of accuracy using suffix array[J].Bioinformatics,2014,30(23):3396-3398.
    [12]Suzuki H,Kasahara M.Introducing difference recurrence relations for faster semi-global alignment of long sequences[J].BM CBioinformatics,2018,19(S1):45.
    [13]Li H,Homer N.A survey of sequence alignment algorithms for next-generation sequencing[J].Briefings in Bioinformatics,2010,11(5):473-483.
    [14]Lin H,Hsu W.Kart:a divide-and-conquer algorithm for NGS read alignment[J].Bioinformatics,2017,33(15):2281-2287.
    [15]Vyverman M,De B B,Fack V,et al.essa MEM:finding maximal exact matches using enhanced sparse suffix arrays[J].Bioinformatics,2013,29(6):802-804.
    [16]Heng L.Aligning sequence reads,clone sequences and assembly contigs w ith BWA-M EM[J].ar Xiv:Genomics,2013,ar Xiv:1303,3997.
    [17]Yang Chun-yan,Zhong Cheng.Accelerating multiple bio-sequences alignment using CPU and GPU cooperative computing[J].Journal of Chinese Computer Systems,2016,37(12):2780-2784.
    [17]杨春燕,钟诚. CPU和GPU协同并行加速多生物序列比对[J].小型微型计算机系统,2016,37(12):2780-2784.
    1 https://github. com/lh3/wgsim