DNA数据压缩方法的研究

英文题名：Research of DNA Data Compression Method
作者：谭丽
论文级别：博士
学科专业名称：信息与通信工程
中文关键词：DNA数据压缩 ; 扩展操作 ; EARV ; 动态混沌搜索算子 ; Memetic算法 ; 码书索引变换模型 ; K-means聚类
英文关键词：DNA data compression ; extended operations ; EARV ; daptive chaotic search
英文关键词：operator ; memetic algorithm ; codebook index transformation model ; k-means clustering
学位年度：2014
导师：孙季丰
学科代码：0810
学位授予单位：华南理工大学
论文提交日期：2014-06-11
答辩委员会主席：秦家银

摘要

随着DNA测序技术的发展，生物医学研究面临着如何存储和传输DNA数据的问题。DNA数据压缩技术成为其中解决问题的重要方法之一，即以高效的压缩编码方法，将DNA数据存储于较小的空间。由于DNA数据的特殊性，使用传统的压缩算法并不是很理想，因此出现了专门针对DNA数据的压缩算法，这类算法主要有两大类，即替代法和统计法。替代法是在原始DNA数据中寻找重复出现频率高的DNA片段，并将这些片段编入字典，用字典索引值替代原始DNA数据中的这些片段；统计法是通过统计归纳出原始DNA数据中每个碱基符号出现的规律，估计出其对应的概率统计模型，然后利用该模型对DNA数据进行熵编码。这两类压缩方法对DNA基准测序序列具有较好的压缩效果，而对高通量DNA数据压缩有限，故出现一系列针对高通量DNA数据的压缩方法。
     目前，高通量DNA数据的压缩方法主要为重测序的DNA数据压缩方法和从头测序的DNA数据压缩方法。对于重测序的DNA数据压缩方法主要是基于参考基因组压缩，基于参考基因组的DNA数据压缩方法，虽然获得的压缩比很高，但对参考序列依赖性太强，而实际上有些测序序列并不存在现成的参考基因组，同时由于压缩和解压都需要相同的参考基因组，故参考基因组必须事先保存在本地，从而导致算法资源开销较大；对于从头测序的DNA数据压缩方法，不依赖外部的参考基因组，自完备性较好，但在一定程度上仍受限于短读拼接技术。
     本文针对这些问题，首先在前两章中调研了DNA数据压缩方法的研究现状，并对相关的DNA数据压缩技术及压缩方法所面临的挑战与展望进行了分析与讨论。最后，提出了几种DNA数据压缩方法，并对算法进行了分析。本论文主要有如下几方面的贡献：
     1.在经典DNA数据压缩算法的基础上，提出一种基于扩展操作的DNA数据压缩算法(DNA sequence compression using extended operations, DNAEC)。该算法将三种标准编辑操作扩展为八种操作，利用LZ算法思想对精确匹配、互补回文、自匹配三种模型进行压缩，而对非重复片段使用基于上下文的二阶二进制算术编码器进行压缩编码。最后通过实验仿真，与典型的DNA压缩算法相比，在DNA基准测序序列上该算法的压缩性能得到了提升，特别是对较长的DNA数据，其压缩效果更为明显。
     2.在Memetic算法框架下，提出一种基于CPMA (Collaborative particle swarmoptimization-based memetic algorithm)的DNA数据压缩方法。CPMA分别采用综合学习粒子群优化算法和动态调整的混沌搜索算子进行全局搜索和局部搜索，接着寻找全局最优的基于扩展操作的近似重复矢量码书，并用此码书压缩DNA数据。实验结果表明，CPMA比其它优化算法有很大的改善，对文中采用的大部分测试函数，其解都非常接近全局最优点；对于DNA基准测序序列，与经典DNA数据压缩算法相比，基于CPMA的压缩性能得到了显著提升。
     3.在Memetic算法框架下，提出另外一种混合粒子群优化(Hybrid particle swarmoptimization based memetic algorithm, HPMA)的DNA数据压缩方法。HPMA采用动态综合学习粒子群优化算法作为全局搜索方法，然后分别用两种不同的局部搜索算子，即中心对称变异差分进化算子和自适应混沌搜索算子，接着寻找全局最优的基于扩展操作的近似重复矢量码书，用此码书压缩DNA数据。最后对算法进行实验仿真，在19个高维测试函数上，HPMA能获得比论文提及的优化算法更好的优化性能和伸缩性能；在DNA基准测序序列上，与经典DNA数据压缩算法相比其压缩率得到明显的提高。
     4.考虑到基于参考基因组的DNA数据压缩算法对参考序列依赖性太强的特点，提出一种高通量DNA数据的压缩算法(Codebook index transformation for high-throughputDNA data, CITD)，该算法先采用码书索引变换模型，将传统码书索引值的表示方法变换成由四个标准碱基字符替代的四进制数值方式，并采用一种界定替换串与非替换串的简明编码方法，接着通过信息熵的大小来决定是否进行BWT (Burrows-Wheelertransformation)，最后用MTF (Move to front)变换和Huffman熵编码对高通量DNA数据进行压缩。在多个测序数据集上的实验结果表明，CITD在大多数情况下可以获得比所对比的高通量DNA数据专用压缩方法更优的压缩性能。
     5.提出另外一种高通量DNA数据压缩算法，即DNAC-K(K-means clustering forhigh-throughput DNA data)。该算法首先利用K-means方法建立DNA数据聚类族，接着在聚类族中对子序列之间进行序列比对，最后用Huffman熵编码对高通量DNA数据进行压缩。实验结果表明，在大多数测序数据集上DNAC-K可以获得比从前的高通量DNA数据专用压缩方法更优的压缩性能。
Due to the recent advancements in DNA sequencing technologies, the biomedicalresearch is faced with the problems of storage and transmission of the huge amount of DNAdata. The use of compression techniques provides an important candidate solution for thestorage and transmission challenges of DNA data, in other words, the DNA data is stored inthe smaller space by the effective coding methods. Because of the particularity of the DNAdata, it is not very ideal with traditional compression algorithms. Thus compressionalgorithms for DNA data specially appeared, including substitutional methods and statisticalmethods. The substitutional methods look for high frequently repeated fragments of DNA inoriginal DNA data, and index these fragments into a dictionary, which will be used to indteadthese fragments of DNA data; The statistical methods estimate the corresponding probabilitystatistical model through statistical rule of original DNA data, and then DNA data is entropycoded by this statistical model. These two types of compression methods have bettercompression effect for the benchmark DNA sequences, but limited for high-throughput DNAdata compression. So, a series of high-throughput DNA data compression methods appeared.
     At present, the high-throughput DNA data compression methods mainly includeresequencing DNA data compression method and de novo sequencing DNA data compressionmethod. The resequencing DNA data compression method mainly based on the referencegenome, which obtain the high compression ratio but require reference genome, but in fact,some sequences have no existing reference data, and because the same reference data isneeded by compression and decompression, so these reference genome must be stored in thelocal, increasing the costs. The de novo ssequencing DNA data compression methods do notrely on external reference genome and the completeness is better, but is still limited by theshort read splicing technology.
     In view of the above problem, the research status of DNA data compression method aresurveyed in this thesis, the challenges and prospects faced by DNA data compression methodare also discussed. Finally, several kinds of DNA data compression methods are proposed.The contribution of this dissertation mainly has the following several aspects:
     (1) A compression algorithm called DNAEC is proposed for DNA data compression, inwhich the extended operations are used. The three edit operations are extended to eightoperations so that the DNA data can form the three match patterns (exact match,complementary palindrome match and self repeat match), and then can be well compressed bythe idea of LZ, encoding the unmatched fragments with the order-2arithmetic code. Simulation results show better compression performance of DNAEC than typical DNA datacompression algorithms on benchmark DNA sequences, especially for longer DNA data.
     (2) On memetic algorithm framework, a DNA data compression method based oncollaborative particle swarm optimization-based memetic Algorithm (CPMA) is proposed, inwhich the comprehensive learning particle swarm optimization (CLPSO) is used for globalsearch and a dynamic adjustive chaotic search operator (DACSO) as the local searchrespectively. In CPMA, it looks for the global optimal code book based on extendedapproximate repeat vector (EARV), by which the DNA data is compressed. Experimentalresults demonstrate better performance of CPMA than the other optimization algorithms, andit is very close to the global optimal points in most of the test functions adopted by the thesis.The compression performance of the method based on CPMA is markedly improvedcompared to many of the classical DNA sequence compression algorithms on benchmarkDNA sequences.
     (3) A novel hybrid particle swarm optimization based memetic algorithm (HPMA) isproposed for DNA data compression, in which dynamic comprehensive learning particleswarm optimization (DCLPSO) method within the frameworkl of the memetic algorithm isused for global search and two adaptive local searches (LS) operators including centersymmetry mutation differential evolution (CSMDE) operator and adaptive chaotic searchoperator (ACSO) work in cooperative way. HPMA looks for the global optimal code bookbased on EARV, by which the DNA sequence will be compressed. Experiments wereconducted on19high-dimensional functions and11real DNA sequences. The results showthat HPMA is more competitive in both the performance and scability, and also attains bettercompression ability than other representative DNA specific algorithms on DNA sequencedata.
     (4) Considering the DNA data compression algorithm based on the reference genome,the dependence have been too strong. A novel high-throughput DNA data compressionmethod based on codebook index transformation (CITD) is proposed, in which the codebookindex transformation (CIT) model is adopted to substitute the traditional representation ofcodebook indexes by the quaternary values which are expressed by the four standard basecharacters, and adopts a simple encoding method to distinguish the replaced and non-replacedsubstring, and subsequently determines whether need to use the burrows-wheelertransformation (BWT) according to the value of information entropy, finally uses move tofront (MTF) transformation and Huffman entropy coding to compress the data. Experimentalresults on several sequencing data sets demonstrate better performance of CITD than the high-throughput DNA data compression algorithms cited in this thesis, in most cases.
     (5) A compression algorithm, called DNAC-K, is proposed, which is based on K-meansclustering for high-through DNA data and creates cluster of sequences based on K-meansclustering method at first, then the clusters are iterated according to the edit distances ofsubsequences, and finally, Huffman coding is adopted to encode the clustering result.Experimental results on several sequencing data sets demonstrate better performance ofDNAC-K than many of the previous high-throughput DNA data compression algorithms.

引文

[1] Sanger研究所,深圳华大基因研究院,美国人类基因组研究所.千人基因组计划[EB/OL]. http://www.1000genomes.org/,2014-05-07.
    [2]国际联盟.国际人类基因组单体型图计划[EB/OL]. http://hapmap.ncbi.nlm.nih.gov/,2014-05-07.
    [3]深圳华大基因研究院.孟得尔遗传疾病计划[EB/OL]. http://www.genomics.cn/en/navigation/show_navigation?nid=4213,2014-05-07.
    [4] Margulies M, Egholm M, Altman W E, et al. Genome sequencing in microfabricatedhigh-density picolitre reactors [J],2005,437(7057):376-380.
    [5] Galperin M Y, Cochrane G R. Petabyte-scale innovations at the European nucleotidearchive [J]. Nucleic Acids Research,2009,37: D1-D4.
    [6] Baxevanis A D, Ouellette B F F. Bioinformatics: A Practical Guide to the Analysis ofGenes and Proteins, Third Edition [M]. United States: Wiley Publishing House,2005.
    [7]孙啸,陆祖宏,谢建明.生物信息学基础[M].北京:清华大学出版社,2005:31-34.
    [8]樊龙江.生物信息学札记(第3版)[EB/OL]. http://ibi.zju.edu.cn/bioinplant/,2010-01.
    [9] Cao Y, Zhu S, Yu J C, et al. Protein detection based on small molecule-linked DNA [J].Analytical,2012,84(10):4314-4320.
    [10] Spengler S J. Bioinformatics in the Information Age [J]. Science,2000,287(5456):1221-1223.
    [11] Kahn S D. On the future of genomic data [J]. Science,2011,331(6018):728-729.
    [12] Schatz M C, Langmead B. The DNA data deluge [J]. IEEE Spectrum,2013,50(7):28-33.
    [13]张成岗,贺福初.生物信息学方法与实践[M].北京:科学出版社,2002:52-90.
    [14] Kuruppu S. Compression of Large DNA Databases [D]. Melbourne: The University ofMelbourne,2012.
    [15] Compress. Compress tool [EB/OL]. http://compress-tool1.software.informer.com/,2014-05-08.
    [16] Deutsch P. Gzip file format specification version4.3[M].1996.
    [17]7-zip.7-zip tool [EB/OL]. http://www.7-zip.org/,2014-05-08.
    [18] Bzip2. Bzip2tool [EB/OL]. http://bzip.org/,2014-05-08.
    [19] Grumbach S, Tahi F. Compression of DNA sequences [A]. Data CompressionConference on [C].1993:340-350.
    [20] Grumbach S, Tahi F. A new challenge for compression algorithms: Genetic sequences[J]. Information Processing and management,1994,30(6):875-886.
    [21] Rivals E, Delahaye J, Dauchet M, Delgrange O. A guaranteed compression scheme forrepetitive DNA sequences [A]. Data Compression Conference on [C].1996:453.
    [22] Chen X, Kwong S, et al. A compression algorithm for DNA sequences and itsapplications in genome comparison [A]. The10th Workshop on Genome Informatics
    [C]. Tokyo: GIW,1999:51-61.
    [23] Matsumoto T, Sadakane K, Imai H. Biological sequence compression algorithms [J].Genome Informatics,2000,11:43-52
    [24] Chen X, Li M, Ma B, Tromp J. DNACompress: Fast and Effective DNA sequencecompression [J]. Bioinformatics,2002,18(12):1696-1698
    [25] Li M, Ma B. Patternhunter Ⅱ: Highly sensitive and fast homology search [J]. Journalof Bioinformatics and Computational Biology,2004,2(3):417-439.
    [26] Behzadi B, Fessant FL. DNA compression challenge revisited: A dynamicprogramming approach [A]. The16th Annual Symposium on Combinatorial PatternMatching [C].2005:190-200
    [27] Klein S T, Ben-Nissan M K. Using Fibonacci compression codes as alternatives todense codes [A]. Data Compression Conference on [C].2008:472-481.
    [28] Xiao L, Zhang M, Xing D, et al. Shifted encoding strategy in retinal luminanceadaptation: form firing rate to neural correlation [J]. Journal of Neurophysiology,2013:110(8):1793-803.
    [29] Srinivasa KG, Jagadish M, Venugopal KR, Patnaik LM. Efficient compression ofnon-repetitive DNA sequence using dynamic programming [A]. Advanced Computingand Communications Conference on [C].2006:569-574.
    [30] Korodi G, Tabus I. DNA sequence compression-based on the normalized maximumlikelihood model [J]. IEEE Signal Processing Magazine,2007,24(1):47-53
    [31]纪震,周家锐,朱泽轩, Wu Q H.基于生物信息学特征的DNA序列数据压缩算法[J].电子学报,2011,39(5):991-995.
    [32] Zhu ZX, Zhou JR, Ji Z, Shi YH. DNA sequence compression using adaptive particleswarm optimization-based memetic algorithm [J]. IEEE Trans on EvolutionaryComputation,2011,15(5):643-658
    [33] Ji Z, Liao H, Wang Y, Wu Q H. A novel intelligent particle optimizer for globaloptimization of multimodal functions [A]. IEEE Congress on EvolutionaryComputation [C].2007:3272-3275.
    [34] Liang JJ, Qin AK, Suganthan PN, Baskar S. Comprehensive learning particle swarmoptimizer for global optimization of multimodal functions [J]. IEEE Trans onEvolutionary Computation,2006,10(3):281-295
    [35] Ji Z, Zhou JR, Zhu ZX, Chen SP. Self-configuration single particle optimizer for DNAsequence compression [J]. Soft Computing,2013,17(4):675-682
    [36]纪震,周家锐,廖惠连,吴青华.智能单粒子优化算法[J],计算机学报,2010,33(3):556-561.
    [37]周家锐,纪震,朱泽轩,陈思平.基于Memetic优化的智能DNA序列数据压缩算法[J].电子学报,41(3):513-518.
    [38] Giancarlo R, Scaturro D, Utro F. Textual data compression in computational biology:asynopsis [J]. Bioinformatics,2009,25(13):1575-1586.
    [39]朱泽轩,张永朋,尤著宏,等.高通量DNA测序数据压缩研究进展[J].深圳大学学报,2013,30(4):409-415.
    [40] Christley S, Lu Y, Li C, et al. Huffman genomes as email attachments [J].Bioinformatics,2009,25(2):274-275.
    [41] Christley S, James Watson’s genome [EB/OL]. http://www.ics.uci.edu/~dnazip/JWB.tgz,2014-05-08.
    [42] Brandon M C, Wallace D G, Baldi P. Data structures and compression algorithms forgenomic sequence data [J]. Bioinformatics,2009,25(14):1731-1738.
    [43] Wang CM, Zhang DB. A novel compression tool for efficient storage of genomeresequencing data [J]. Nucleic Acids Research,2011,39(7): E45-U74.
    [44] Fritz M H Y, Leinonen R, Cochrane G, et al. Efficient storage of high throughput DNAsequencing data using reference-based compression [J]. Genome Research,2011,21(5):734-740.
    [45] Jones D, Ruzzo W, Peng X, et al. Compression of next-generation sequencing readsaided by highly efficient denovo assembly [J]. Nucleic Acids Research,2012,40(22):e171.
    [46] Huffman, D. A method for the construction of minimum-redundancy codes [J].Proceedings of the IRE,1952,40(9):1098-1101.
    [47] Ziv J, Lempel A. A universal algorithm for sequential data compression [J].Information Theory, IEEE Transactions on,1977, IT-23(3):337-343.
    [48] Wu XY, Yan H, Li LR. Numerical method for reliability analysis of phased-missionsystem using markov chains [J]. Communications in Statistics-Teory and Methods,2012,41(21):3960-3973.
    [49] Tembe W, Lowey J, Suh E. G-SQZ: compact encoding of genomic sequence andquality data [J]. Bioinformatics,2010,26(17):2192-2194.
    [50] Deorrowicz S, Grabowski S. Compression of DNA sequence reads in FASTQ format[J]. Bioinformatics,2011,27(6):860-862.
    [51] Popitsch N, Haeseler A V. NGG: lossless and lossy compression of aligned highthroughput sequencing data [J]. Nucleic Acids Research,2012,41(1): e27.
    [52] Kuruppu S, Puglisi S J, Zobel J. Relative Lempel-Ziv compression of genomes forlarge-scale storage and retrieval [A], The17th International Symposium. StringProcessing and Information Retrieval [C].2010:201-206.
    [53] Kuruppu S, Beresford-Smith B, Conway T, Zobel J. Iterative dictionary constructionfor compression of large DNA data sets [J]. IEEE/ACM Trans on ComputationalBiology and Bioinformatics,2012,9(1):137-149.
    [54] Stein L D. The case for cloud computing in genome informatics [J]. Genome Biology,2010,11(5):207.
    [55] Hibbs X, Krstic S, Mastrangelo A, et al. The potential and challenges of nanoporesequencing [J]. Nature Biotechnology,2008,26(10):1146-1153.
    [56]张丽霞,宋鸿陡.多重压缩DNA序列数据[J].计算机应用,2010,30(5):1379-1382.
    [57]王玉忠,陈思翀,袁立华.高分子科学导论[M].北京:科学出版社,2010:179-202.
    [58] Sanger. S.cerevisiae genome [EB/OL]. ftp://ftp.sanger.ac.uk/pub/users/dmc/yeast/latest,2014-05-12.
    [59] Paolo M. Bio-Quantum physics: DNA/RNA as an information energy catalyst in lifesystems [EB/OL]. http://www.gsjournal.net/old/science/manzelli.pdf,2014-05-12.
    [60] Mridula T V, Samuel P. Lossless segment based DNA compression [A].20113rdInternational Electronics Computer Technology Conference on [C].2011:298-302.
    [61] Loh P-R, Baym M, Berger B. Compressive genomics [J]. Nat Biotechnol,2012,30:627-30.
    [62] Bonfield J K, Mahoney M V. Compression of FASTQ and SAM format sequencingdata [J]. PLoS One,2013,8: e59190.
    [63] DNA序列数据库. EMBL Database [EB/OL]. http://www.ac.uk/ebi_docs/embl_db/ebi/topembl.html,2014-05-12.
    [64] Hossein S M, Mohapatra P K D. DNA compression and security techniques based onpalindrome searching [J]. Indian Journal of Applied Research,2013,3(3):107-110.
    [65] Kozanitis C, Saunders C, Kruglyak S, et al. Compressing genomic sequence fragmentsusing SlimGene [J]. Journal of Computational Biology,2011,18:401-13.
    [66]王玉,饶妮妮,等.基于小波变换技术预测DNA序列的编码区[J].电子学报,2007,35(1):141-144.
    [67] Daily K, Rigor P, Christley S, et al. Data structures and compression algorithms forhigh-throughput sequencing technologies [J]. BMC Bioinformatics,2010,11:514.
    [68] Garcia S P, Rodrigues J M O S, Santos S, et al. A genomic distance for assemblycomparison based on compressed maximal exact matches [J]. IEEE/ACM Transactionson Computational Biology and Bioinformatics,2013,10(3):793-798.
    [69] Johnson A, Lew J, et al. Molecular Biology of the Cell, Fourth Edition [M]. New Yorkand London: Garland Science,2002.
    [70] Kaessmann H, Ebersberger I, et al. DNA sequence variation among humans and apes[A]. The Annual International Computational Molecular Biology Conference on [C].2002:175.
    [71] Gupta R, Mittal A, et al. An efficient algorithm to detect palindromes in DNAsequences using periodicity transform [J]. Signal Processing,2006,86(8):2067-2073.
    [72] Arndt P F, Burge C B. DNA sequence evolution with neighbor-dependent mutation [A].The Annual International Computational Molecular Biology Conference on [C].2002:32-38.
    [73] Osborne M. Predicting DNA sequences using a backoff language model [DB/OL].http://www.cogsci.ed.ac.uk/~osborne/dna-backoff.ps.gz,2014-05-12.
    [74] Benson D A, Cavanaugh M, Clark K, et al.. GenBank [J]. Nucleic Acids Research,2013,41: D36-D42.
    [75] Kuruppu S, Beresford-Smith B, et al. Iterative dictionary construction for compression of large DNA datasets: supplementary material [DB/OL]. http://www.computer.org/csdl/trans/tb/2012/01/ttb2012010137Abs.html,2013-12-09.
    [76]纪震,周家锐,姜来, Wu Q H. DNA序列数据压缩技术综述[J].电子学报,2010,38(5):1113-1121.
    [77]谭丽,孙季丰,郭礼华.基于Memetic算法的DNA序列数据压缩方法[J].电子与信息学报,2014,36(1):121-127.
    [78] Deorowicz S, Grabowski S. Robust relative compression of genomes with randomaccess [J]. Bioinformatics,2011,27(21):2979-2986.
    [79] Gupta A, Dubey K K. An efficient compressor for biological sequences [A]. The3rdIEEE International Advance Computing Conference on [C].2013:690-695.
    [80] Guo C, Chang C C, Wang Z H. A new data hiding scheme based on DNA sequence [J].International Journal of Innovative Computing, Information and Control,2012,8(1):139-149.
    [81] Ferzli R, Karam L J. DNA-residual: A DNA compression algorithm using forwardlinear prediction [A]. The IEEE International Acoustics, Speech and Signal ProcessingConference [C].2006:1100-1103.
    [82] Garcia Zapata J L, Diaz Martin J C. A geometric algorithm for winding numbercomputation with complexity analysis [J]. Journal of Complexity,2012,28(3):320-345.
    [83] Li H Y, Zhang Y L, Xu Xing. Hybrid dynamical evolutionary algorithm and timecomplexity analysis [J]. Applied Mechanics and Materials,2013,263-266(PART1):2344-2348.
    [84] Watanabe T. Computational complexity analysis and algorithm design forcombinatorial optimization problems [A].20123rd International Conference onNetworking and Computing [C].2012:19-20.
    [85] Cannane A, Williams, H E. General-Purpose compression for efficient retrieval [J].Journal of the American Society for Information Science and Technology,2001,52(5):430-437.
    [86] Netto E, Azevedo R, Centoducatte P, Araujo G. Mixed static/dynamic profiling fordictionary based code compress [M], International Symposium on System-on-Chip,2003:159-163.
    [87] Austin C D, Ash J N, Moses R L. Dynamic dictionary algorithms for model order andparameter estimation [J]. IEEE Transactions on Signal Processing,2013,61(20):5117-5130.
    [88] Ziv J, Lempel A. Compression of individual sequences via variable-rate coding [J].IEEE Transactions on Information Theory,1978,24(5):530-536.
    [89] Wang Z H, Yang H R, et al. A high-performance reversible data-hiding scheme forLZW codes [J]. Journal of Systems and Software,2013,86(11):2771-2778.
    [90] Crain T, Gramoli V, Raynal M. A speculation-friendly binary search tree [J]. ACMSIGPLAN Notices,2012,47(8):161-170.
    [91] Witten I H, Moffat A, Bell T C. Managing Gigabytes: Compressing and IndexingDocuments and Images [M]. Managing Gigabytes,1999.
    [92] Larsson N J. Structures of string matching and data compression [D]. Department ofComputer Science, Lund University, Sweden.
    [93] Pinho A J, Pratas D, Ferreira P J S G. Bacteria DNA sequence compression using amixture of finite-context model [A]. The IEEE Workshop on Statistical SignalConference on [C].2011:125-128.
    [94] Cao M D, Dix T I, Allison L, Mears C. A simple statistical algorithm for biologicalsequence compression [A].2007Data Compression Conference on [C].2007:43-52.
    [95] Pinho A J, Ferreira P J S G, Neves A J R, Bastos C A C. On the representability ofcomplete genomes by multiple competing finite-context (Markov) models [J]. PlosOne,2011,6(6): e21588.
    [96] Pinho A J, Pratas D. MFCompress: a compression tool for FASTA and multi-FASTAdata [J]. Bioinformatics,2014,30(1):117-118.
    [97]罗泽举,李艳会,宋丽红,朱思铭.基于隐马尔可夫模型的DNA序列识别[J].华南理工大学学报(自然科学版),2007,35(8):123-126.
    [98] Pratas D, Bastos C A C, Pinho A J, et al. DNA synthetic sequences generation usingmultiple competing Markov models [A]. The IEEE Workshop on Statistical SignalProcessing [C].2011:133-136.
    [99] Pratas D, Pinho A J. Compressing the human genome using exclusively markovmodels [A]. The5th International Conference on Practical Applications ofComputation Biology and Bioinformatics [C].2011:213-220.
    [100]李航.统计学习方法[M].北京:清华大学出版社,2012:171-174.
    [101]国际联盟.万种脊椎动物基因组计划[EB/OL]. https://genome10k.soe.ucsc.edu/,2014-05-09.
    [102] Pail Y K, et al. The Chromosome-Centric Human Proteome Project for catalogingproteins encoded in the genome [J]. Nature Biotechnology,2012,30(3):221-223.
    [103] Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variantsusing mapping quality scores [J]. Genome Research,2008,18(11):1851-1858.
    [104] Li Cong, Ji Zhenzhou, et al. Efficient parallel design for BWT-based DNA sequencesdata multicompression algorithm [A]. Proceeding of International Conference onAutomatic Control and Artificial Intelligence [C].2012:967-970.
    [105] Miller J R, Koren S, Sutton G. Assembly algorithms for next generation sequencingdata [J]. Genomic,2010,95(6):315-327.
    [106] Pevzner P A, Tang H, Waterman M S. An Eulerian path approach to DNA fragmentassembly [J]. PNAS,2001,98(17):9748-9753.
    [107] Neri F, Cotta C. Memetic algorithms and memetic computing optimization: A literaturereview [J]. Swarm and Evolutionary Computation,2012,2:1-14.
    [108] Ong YS, Lim MH, Chen XS. Research frontier: Memetic Computation-Past, Present&Future [J]. IEEE Computational Intelligence Magazine,2010,5(2):24-36.
    [109]张军,詹志辉,等.计算智能[M].北京:清华大学出版社,2009:177-194.
    [110] Wang P, Tang K, E.P.K T, Yao X. A memetic genetic programming with decisiontree-based local search for classification problems [A]. IEEE EvolutionaryComputation Congress on [C].2011:917-924.
    [111] Sengupta A, Chakraborti T, Konar A, Eunjin K, Nagar AK. An adaptive memeticalgorithm using a synergy of differential evolution and learning automata [A]. IEEECongress on Evolutionary Computation [C].2012:1-8.
    [112] Zhu Z X, Ji Z, Jia S. Memetic ant colony optimization for band selection ofhyperspectral imagery classification [A].2010Chinese Pattern RecognitionConference on [C].2010:1012-1017.
    [113] Ezzat A, Abdelbar A M, Wunsch II D C. A bare-bones ant colony optimizationalgorithm that performs competitively on the sequential ordering problem [J]. MemeticComputing,2014,6(1):19-29.
    [114] Ni JC, Li L, Qiao F, Wu QD. A novel memetic algorithm based on the comprehensivelearning PSO [A]. The IEEE Congress on Evolutionary Computation Conference on
    [C].2012:1-8.
    [115] Wang H F, Yang S X, Ip W H, Wang D W. A memetic particle swarm optimizationalgorithm for dynamic multi-modal optimization problems [J]. International Journal ofSystems Science,2012,43(7):1268-1283.
    [116] Iacca G, Neri F, Mininno E, Ong YS, Lim MH. Ockham’s Razor in memetic computing:three stage optimal memetic exploration [J]. Information Science,2012,188:17-43.
    [117] Gao X Z, Wang X, Zenger K, Wang X F. A bee foraging-based memetic harmonysearch method [A]. IEEE International Systems, Man and Cybernetics Conference on[C].2012:184-189.
    [118] Zhang Y W, Wang L, Wu Q D. Differential annealing for global optimization [A]. The3rd International Advances in Swarm Intelligence Conference on [C].2012:382-389.
    [119] Neri F, Weber M, Caraffini F, Poikolainen I. Meta-Lamarckian learning in three stageoptimal memetic exploration [A]. The12th Workshop on Computational IntelligenceConference on [C].2012:1-8.
    [120] Chen XS, Ong YS. A conceptual modeling of meme complexes in stochastic search [J].IEEE Trans on Systems, Man and Cybernetics Part C: Applications and Review,2012,42(5):612-625.
    [121] Li H J, Ni B, W M H, et al. A fast CUDA implementation of agrep algorithm forapproximate nucleotide sequence matching [A].2011IEEE9th Symposium onApplication Specific Processors Conference on [C].2011:74-77.
    [122] Xie H Y, Zhang M. Impacts of sampling strategies in tournament selection for geneticprogramming [J]. Soft Computing,2012,16(4):615-633.
    [123] Long M and Tan L. A chaos-based data encryption algorithm for image/video[A].20102nd International MultiMedia and Information Technology Conference on [C].2010:172-175.
    [124]戴朝华.搜寻者优化算法及其应用研究[D],博士学位论文,西南交通大学,2009.
    [125] Rakshit P, Konar A, et al. Realization of an adaptive memetic algorithm usingdifferential evolution and q-learning: A case study in multirobot path planning [J].IEEE Transactions on Systems, Man, and Cybemetics: Systems,2013,43(4):814-831.
    [126] Piotrowski, A P. Adaptive memetic differential evolution with global and localneighborhood-based mutation operators [J]. Information Science,2013,241:164-194.
    [127] Zhu J, Yan X F. Adaptive variable space differential evolution algorithm based onpopulation distribution [J]. Memetic Computing,2013,5(1):49-64.
    [128] Tan Li, Sun Jifeng. A spring oscillator model used for particle swarm optimizer [A].IEEE Symposium Series on Computational Intelligence Conference on [C].2013:90-94.
    [129] Lozano M, Molina D and Herrera F. Editorial scalability of evolutionary algorithmsand other metaheuristics for large-scale continuous optimization problems [J]. SoftComputing,2011,15(11):2085-2087.
    [130] Bilal Alatas, Erhan Akin, A Bedri Ozer. Chaos embedded particle swarm optimizationalgorithms [J]. Chaos, Solitons and Fractals,2009,40(4):1715-1734.
    [131] Molina D, Lozano M, Sanchez AM, Herrera F. Memetic algorithms based on localsearch chains for large scale continuous optimization problems: MA-SSW-Chains [J].Soft Computing,2011,15(11):2201-2220.
    [132] Wang H F, Moon L, Yang S X, Wang D W. A memetic particle swarm optimizationalgorithm for multimodal optimization problems [J]. Information Sciences,2012,197:38-52.
    [133] SCI2S research group. Statistical Inference in Computational Intelligence and DataMining [DB/OL]. http://sci2s.ugr.es/sicidm/,2014-05-13.
    [134] García S, Fernández A, Luengo J, Herrera F. Advanced nonparametric tests formultiple comparisons in the design of experiment in computational intelligence anddata mining: experiment analysis of power [J]. Information Sciences,2010,180(10):2044-2064.
    [135] Wikipedia. Move-to-front Transformation [DB/OL]. http://en.wikipedia.org/wiki/Move-to-front-transform,2013-12-09.
    [136] Shamir. Gil I. Universal lossless compression with unknown alphabets-the averagecase [J]. IEEE Transactions on Information Theory,2006,52(11):4915-4944.
    [137] Kuruppu S, Puglisi S J, Zobel J. Optimized relative Lempel-ziv compression ofgenomes [A]. The34th Australasian Computer Science Conference on [C].2011:91-98.
    [138] Solovyov A, Lipkin W. Centroid based clustering of high throughput sequencing readsbased on n-mer counts [J]. BMC Bioinformatics,2013,14(1):268.
    [139] BarileéB. Baridam. Investigating the particle swarm optimization clustering method onnucleic acid sequences [J]. International Journal of Jnnovative Technology&CreativeEngineering,2011,1(5):32-40.
    [140] Kong Dexi and Kong Rui. A fast and effective kernel-based K-means clusteringalgorithm [A].20133rd International Conference on Intelligent System Design andEngineering Applications (ISDEA)[C].2013:58-61.
    [141] Kyoto University Bioinformatics Center. Multiple Sequence Alignment by Clustal[DB/OL]. http://www.genome.jp/tools/clustalw/,2014-05-13.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700