大规模生物序列分析的高性能算法和模型

英文题名：High Performance Algorithms and Models for Large-scale Biological Sequence Analysis
作者：杨矫云
论文级别：博士
学科专业名称：计算机软件与理论
中文关键词：大规模数据 ; SNP ; 单体型 ; 模体 ; 最长公共子序列 ; 网络流 ; 图搜索
英文关键词：Large scale data ; SNP ; haplotype ; motif ; longest common subsequence ;
英文关键词：flow network ; graph search
学位年度：2014
导师：陈国良 ; 徐云
学科代码：081202
学位授予单位：中国科学技术大学
论文提交日期：2014-05-01

摘要

随着测序技术的发展,生物序列的规模呈现爆炸性增长,目前生物信息学中的计算方法与技术如何应对快速增长的序列数据,已成为当前生物信息学迫切需要解决的问题。为了适应大规模生物序列数据的分析和计算,本文主要从三个层面研究了数据组织、算法设计和并行化加速。数据组织就是建立数据表示和组织的模型,模型能尽量给出全局信息以及有利于分析和计算的效率提高；算法设计就是给出适应大规模数据处理的高效算法,算法具有低时间复杂度或尽可能短的时间内输出好的解(即尽快求解算法)；并行化加速是实际大规模数据处理必须考虑的手段,着重要解决算法的并行化与有效的负载平衡。
     本文选取生物信息学中单体分型、模体发现和最长公共子序列三个重要的生物序列分析问题,来探究大规模生物序列数据处理中的关键技术和方法。本文的主要工作有：
     (1)单体分型问题：单体型是单条染色体上特异位点组成的序列,与人类疾病密切相关。生物实验测序通常得到两条单体型合并而成的基因型,因此需要将基因型分型成单体型。本文研究群体数据集的单体分型问题,首先建立了网络流模型,并在该模型上对已有的分型规则进行分析和综合,归纳出新的启发式知识,进而设计了新的单体分型算法FNphasing。在大规模数据集上,计算实验表明FNphasing算法的时间性能显著优于已有的算法,且精度也达到了目前最优。
     (2)模体发现问题：模体是生物序列中一些重复出现、保守的区域,通常具有重要的生物功能,通过发现模体可以帮助了解生命机体的原理和特征。本文研究(l,d)模体发现问题,首先采用新哈希策略来减少存储的潜在模体数目,进一步设计了新的剪枝策略,降低了算法的平均时间复杂度。在挑战性实例的求解上,计算实验表明新算法CVoting的时间性能比已有算法降低一个数量级,且空间消耗更少。
     (3)最长公共子序列问题：寻找序列间的最长公共子序列是序列相似性鉴定的一种重要手段,序列间的相似性可以作为物种共同起源的证据。本文研究多序列最长公共子序列(MLCS)问题,首先将该问题转化为图搜索,然后采用迭代最佳优先搜索策略设计了尽快求解算法Pro-MLCS,计算实验表明Pro-MLCS算法一般在总运行时间的前3%时间内即可输出最优解。在Pro-MLCS算法的基础上,进一步设计了空间增长缓慢的SA-MLCS算法和空间受限的SLA-MLCS算法。SA-MLCS算法采用迭代beam加宽的搜索策略,使得其找到与Pro-MLCS算法相同解所消耗的空间要少得多;而SLA-MLCS算法采用替换策略,使得其在SA-MLCS算法达到空间限制后能够继续搜索更好的解,进一步提高了可解问题规模。计算实验表明,在给定的空间限制内,SA-MLCS算法与SLA-MLCS算法能够处理的数据规模比Pro-MLCS算法高一个数量级。最后设计了Pro-MLCS算法的并行化版本：DPro-MLCS和DSDPro-MLCS,前者适用于分布式环境,后者适用于分布式-共享分层存储的集群环境。计算实验反映,二者均能达到了线性加速,且具有良好的尽快求解性能。
     本文所研究的大规模生物序列数据处理中的关键技术和方法,其主要创新之处如下：
     (1)数据组织：贡献在于全局表示模型的建立。对于单体型问题,本文构建了单体分型全局视图的网络流模型,该模型包含了原始数据的全局信息,使得单体分型的可行解与模型上的流存在一一对应关系,更有利于设计高效的分型算法。对于模体发现问题,本文采用新的哈希策略,减少了存储的潜在模体数目,使得空间消耗大大降低,减少了空间对大规模数据处理的制约。对于最长公共子序列问题,本文将该问题的解空间组织为搜索图,并转化为在图中寻找最长路径问题,高效的图搜索算法可以在该问题上的得到应用。
     (2)算法设计：贡献在于高效算法和尽快求解算法的设计。对于单体型问题,本文使用网络流模型的全局信息设计了高效的启发式搜索算法FNphasing,其在大规模数据处理的应用中,时间性能显著优于已有算法。对于模体发现问题,本文设计了新的剪枝算法减少哈希表的访问次数,使得新算法的平均时间复杂度达到目前最好。对于最长公共子序列问题,本文设计了尽快求解算法模式和空间受限的尽快求解算法模式。相比于已有的算法,尽快求解算法Pro-MLCS在求得相同解的情形下时间性能降低了一个数量级,而空间受限的尽快求解算法SLA-MLCS在相同的时间与空间限制下可求解问题规模提高了两个数量级。
     (3)并行化：贡献在于尽快求解算法的并行化。本文针对新提出的尽快求解算法,设计了一种跨层并行化策略,使得不同层之间的并行处理成为可能,并利于实现负载均衡,新的并行算法达到了线性加速,且维持了尽快求解性能,能够充分利用大规模集群环境的计算资源,能够处理大规模数据。
With the development of new DNA sequencing technologies, an enormous amount of sequence data has been generated, which presents great challenges for computational biology problems. Therefore, how to deal with the rapid growth of data volumes be-comes a problem needing to be solved urgently. In order to handle the large scale datasets, we study the methods from three different levels, include data organization, algorithm development and parallelization. Data organization is to establish models for data presentation and organization, which contains the global information and could improve the efficiency of data analysis; Algorithm development is to design efficient algorithms for large scale datasets, which have small time complexity or ouput good solutions as soon as possible (i.e. anytime algorithms); Parallelization must be consid-ered when dealing with large scale datasets, and the parallelization of algorithms and load balance should be solved.
     In this dissertation, we choose three important biological sequences analysis prob-lems, including haplotype phasing, motif search and longest common subsequence, to study the methods and technologies of large scale data analysis. The works of this dis-sertation include:
     (1) Haplotype phasing problem:haplotype is composed of the SNPs on a single chromosome, which is related to human diseases. The sequenced data are often geno-types, i.e. the combination of two haplotypes, therefore, the genotypes need to be phased into haplotypes. In this dissertation, we studied the population haplotype phasing prob-lem. At first, we established the flow network model and analyzed the phasing rules on this model in order to obtain some heuristic knowledge. Then, a new haplotype phasing algorithm FNphasing was presented. When applied on large scale data sets, FNphasing is significantly faster than the state-of-the-art algorithms, and also achieves an equal or superior accuracy compared with other approaches.
     (2) Motif search problem:Motifs are recurring and conserved regions in biologi-cal sequences, which have molecular structural or functional features, the identification of them can help us better understand the mechanisms of life. In this dissertation, we studied (I, d) motif search problem. At first, we adopted a new hashing strategy to re-duce the number of potential motifs. Then a new pruning algorithm was designed to reduce the average time complexity. When solving challenging instances, the new al-gorithm CVoting is an order of magnitude faster than current algorithms, and uses much less space.
     (3) Longest common subsequence problem:finding the longest common subse- quence an important method to identify the similarity of sequences, which could serve as the measurement of how two different species are evolutionarily related. In this disser-tation, we studied multiple longest common subsequence (MLCS) problem. At first, we coverted the longest common subsequence problem into a graph search problem. Then, an anytime algorithm Pro-MLCS was designed by using iterative best-first search. The experimental results show that Pro-MLCS could achieve the optimal solution with only3%of the total runnning time. Furthermore, based on Pro-MLCS, two anytime algo-rithms were designed:memory efficient algorithm-SA-MLCS-with small space in-crease rate, and space bounded algorithm, SLA-MLCS. SA-MLCS adopted an iterative beam widening strategy, and could use much less space to find the same quality solution when compared to Pro-MLCS; While SLA-MLCS adopted a replacement strategy, and could continue to find better solutions after SA-MLCS reaches the space bound. The experimental results show SA-MLCS and SLA-MLCS could deal with an order of mag-nitude larger size instances than Pro-MLCS when given the same space bound. At last, we developed two parallel algorithms:DPro-MLCS and DSDPro-MLCS:the former one is for distributed memory architecture and the latter is for hierarchical distributed-shared memory architecture. Both algorithms could achieve linear speedup, and have good anytime property.
     This dissertation focuses on the methods of dealing with large scale biological sequence data, thus the contributions of this dissertation for three levels are as follows:
     (1) Data organization:the contribution is the establishment of the global repre-sentation model. For haplotype phasing problem, this dissertation constructed the flow network model, which contains all the information that could be derived from the orig-inal data, which makes the feasible solutions and the flows one-to-one mapping, which would be useful for the algorithm design. For motif search problem, this dissertation adopted a new hashing strategy to reduce the number of stored potential motifs, which brings down the space usage. For longest common subsequence problem, this disserta-tion reorganized the solution space into a graph, and formulated the problem into finding the longest path in the graph, thus the efficient graph search algorithms could be applied.
     (2) Algorithm development:the contribution is the design of efficient algorithms and anytime algorithms. For haplotype phasing problem, we developed a heuristic al-gorithm FNphasing based on flow network model, which is significantly faster than current algorithms when applied on large scale datasets. For motif search problem, we designed a new pruning method to reduce the access number of hashing table, mak-ing the average time complexity of the new algorithm achieve the best among current algorithms. For longest common subsequence problem, we established an anytime al-gorithm Pro-MLCS and a space bounded anytime algorithm SLA-MLCS. Compared to current algorithms, Pro-MLCS is an order of magnitude faster when achieving the same quality solution and SLA-MLCS could handle an order of magnitude larger size instances when given the same time and space bounds.
     (3) Parallelization:the contribution is the parallelization of anytime algorithm. This dissertation designed a new parallel strategy to parallelize the anytime algorithm for longest common subsequence problem. The new strategy makes the parallel com-putations of different layers possible. The new parallel algorithms could achieve nearly linear speedup, and have good anytime property. It could make full use of the compu-tational resources of clusters to deal with larger scale datasets.

引文

[1]Smith T.F. and Waterman M.S., Identification of common molecular subsequences, Journal of Molecular Biol-ogy,1981,147:195-197.
    [2]http://www.ncbi.nlm.nih.gov/genbank/statistics
    [3]Altschul S.F., Gish W., Miller W., Myers E.W. and Lipman D.J., Basic local alignment search tool, J. Mol. Biol.,1990,215:403-410.
    [4]Pevzner P.A., Tang H. and Waterman M.S., An Eulerian path approach to DNA fragment assembly, Proc. Natl. Acad. Sci.,2001,98:9748-9753.
    [5]Langmead B., Trapnell C., Pop M. and Salzberg S.L., Ultrafast and memory-efficient alignment of short DNA sequences to the human genome, Genome Biol.,2009,10(3):R25.
    [6]Ma B., Tromp J., Li M., PatternHunter:faster and more sensitive homology search, Bioinformatics,2002, 18(3):440-445.
    [7]Browning S.R., and Browning B.L., Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering, Am. J. Hum. Genet,2007,81:1084-1097.
    [8]Williams A.L., Patterson N., Glessner J., Hakonarson H., and Reich D., Phasing of Many Thousands of Geno-typed Samples, Am. J. Hum. Genet.,2012,91(2):238-251.
    [9]Delaneau O., Zagury J.-F. and Marchini J., Improved whole-chromosome phasing for disease and population genetic studies, Nature Methods,2013,10:5-6.
    [10]Wallace J.R., Elmquist K.A. and Haines E.A., A ray tracing algorithm for progressive radiosity, ACM SIG-GRAPH Computer Graphics,1989,23(3):315-324.
    [11]Pietracaprina A., Riondato M., Upfal E. and Vandin F., Mining top-K frequent itemsets through progressive sampling, DATA MINING AND KNOWLEDGE DISCOVERY,2010,21(2):310-326.
    [12]Hoppe H., Progressive meshes, Proceeding SIGGRAPH'96,1996,99-108.
    [13]Hansen E.A. and Zhou R., Anytime heuristic search, Journal of Artificial Intelligence Research (JAIR),2007, 28:267-297.
    [14]Korf R.E., Divide-and-conquer bidirectional search:First results. In Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI-99), (Stockholm, Sweden).1999,1184-1189.
    [15]Korf R.E. and Zhang W, Divide-and-conquer frontier search applied to optimal sequence alignment, In Pro-ceedings of the 17th National Conference on Artificial Intelligence (AAAI-00), (Austin, TX).2000,910-916.
    [16]Korf R.E., Zhang W., Thayer I., and Hohwald H., Frontier search, Journal of the ACM (JACM),2005, 52(5):715-748.
    [17]Chakrabarti P.P., Ghose S., Acharya A. and de Sarkar S.C., Heuristic search in restricted memory, Artif. Intell., 1989,41(2):197-221.
    [18]Russell S., Efficient memory-bounded search methods, In Proceedings of the 10th European Conference on Artificial Intelligence (ECAI-92) (Vienna, Austria),1992,1-5.
    [19]http://www.graph500.org/.
    [20]Munguia L.M., Bader D.A, Ayguade E., Task-based parallel breadth-first search in heterogeneous environ-ments, High Performance Computing (HiPC),201219th International Conference on. IEEE,2012:1-10.
    [21]MerrllD., Garland M. and Grimshaw A., Scalable GPU graph traversal, ACM SIGPLANNotices. ACM,2012, 47(8):117-128.
    [22]http://googleblog.blogspot.com/200S/11/sorting-lpb-with-mapreduce.html.
    [23]Hoehe M.R., Kopke K., Wendel B., Rohde K., Flachmeier C., Kidd K.K., Berrettini W.H. and Church G.M., Sequence variability and candidate gene analysis in complex disease:association of μ opioid receptor gene variation with substance dependence, Human Molecular Genetics,2000,9(19):2895-2908.
    [24]Ben-Hur A. and Brutlag D., Sequence motifs:highly predictive features of protein function. Feature Extraction, 2006,207:625-645.
    [25]Polo S., Sigismund S., Faretta M., Guidi M., Capua M.R., Bossi G., Chen H., De Camilli P. and Di Fiore P.P., A single motif responsible for ubiquitin recognition and monoubiquitination in endocytic proteins, Nature,2002, 416:451-455.
    [26]Sassanfar M. and Szostak J.W., An RNA motif that binds ATP, Nature,1993,364:550-553.
    [27]Jones N.C.and Pevzner P.A., An introduction to bioinformatics algorithms, The MIT Press,2004.
    [28]Pray L.A., Discovery of DNA structure and function:Watson and Crick, Nature Education,2008,1(1):100.
    [29]Russell P.J., iGenetics,3rd ed.,2010.
    [30]Jahani S., Setarehdan S.K., Centromere and length detection in artificially straightened highly curved human chromosomes, International Journal of Biological Engineering,2012,2(5):56-61.
    [31]http://www.bio.miami.edu/dana/pix/transcription_factor.jpg.
    [32]D'haeseleer P., What are DNA sequence motifs?, Nature biotechnology,2006,24(4):423-425.
    [33]Clark S., Inference of Haplotypes from Pcr-Amplified Samples of Diploid Populations, Molecular Biology and Evolution,1990,7:111-122.
    [34]Wang L. and Xu Y., Haplotype Inference by Maximum Parsimony, Bioinformatics,2003,19:1773-1780.
    [35]Excoffier L., and Slatkin M., Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population, Mol. Biol. Evol,1995,12:921-927,.
    [36]Gusfield D., Haplotyping as Perfect Phylogeny:Conceptual Framework and Efficient Solutions, Proceedings of the sixth annual international conference on Computational biology (RECOMB),2002,166-175.
    [37]Stephens M., Smith N.J., and Donnelly P., A new statistical method for haplotype reconstruction from popula-tion data, Am. J. Hum. Genet,2001,68:978-989.
    [38]Hubbell E., Finding a Maximum Parsimony Solution to Haplotype Phase Is NP-Hard, personal comm.,2000.
    [39]Catanzaro D. and Labbe M., The pure parsimony haplotyping problem:overview and computational advances, INFORMS Journal on Computing,2009,16(5):561-584.
    [40]Lancia G., Pinotti M.C., and Rizzi R., Haplotyping Populations by Pure Parsimony:Complexity of Exact and Approximation Algorithms, INFORMS J. Computing,2004,16:348-359.
    [41]Sharan R., Halldorsson B.V., and Istrail S., Islands of Tractabibty for Parsimony Haplotyping, IEEE/ACM Trans. Computational Biology and Bioinformatics,2006,3(3):303-311.
    [42]Zhang Q., Che H., Chen G., and Sun G., A practical algorithm for haplotyping by maximum parsimony, Journal of Software,2005,16(10):1699-1707.
    [43]Tininini L., Bertolazzi P., Godi A., and Lancia G., CollHaps:A Heuristic Approach to Haplotype Inference by Parsimony, IEEE/ACM Trans. Computational Biology and Bioinformatics,2010,7(3):511-523.
    [44]Browning S.R. and Browning B.L., Haplotype phasing:existing methods and new developments, Nature Re-views Genetics,2011,12:703-714.
    [45]Qin Z.S., Niu T. and Liu J.S., Partition-Ligation EM algorithm for haplotype inference with single nucleotide polymorphisms. Am. J. Hum. Genet.,2002,71:1242-1247.
    [46]Zhao Y, Xu Y, Wang Z., Zhang H. and Chen G., A better block partition and ligation strategy for individual haplotyping, Bioinformatics,2008,24(23):2720-2725.
    [47]Brinza D. and Zelikovsky A.,2SNP:scalable phasing based on 2-SNP haplotypes, Bioinformatics,2006, 22(3):371-373.
    [48]Bonizzoni P., Vedova G.D., Dondi R. and Li J., The Haplotyping problem:An overview of computational models and solutions, Journal of Computer Science and Technology,2003,18(6):675-688.
    [49]Bafna V, Gusfield D., Hannenhalli S., and Yooseph S., A Note on Efficient Computation of Haplotypes via Perfect Phylogeny, Journal of Computational Biology,2004, 11(5):858-866.
    [50]Stephens M., and Scheet P., Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation, Am. J. Hum. Genet.,2005,76:449-462.
    [51]Scheet P., and Stephens M., A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase, Am. J. Hum. Genet.,2006,78:629-644.
    [52]Climer S., Templeton A.R., and Zhang W., SplittingHeirs:Inferring Haplotypes by Optimizing Resultant Dense Graphs, Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology (BCB),2010,127-136.
    [53]Daly M.J., Rioux J.D., Schaffner S.F., Hudson T.J. and Lander E.S., High-resolution haplotype structure in the human genome, Nat. Genet.,2001,29:229-232.
    [54]PatilN., Berno A.J., Hinds D.A., Barrett W.A., Doshi J.M., Hacker C.R., Kautzer C.R., Lee D.H., Marjoribanks C., McDonough D.P., Nguyen B.T.N., Norris M.C., Sheehan J.B., Shen N., Stern D., Stokowski R.P., Thomas D.J., Trulson M.O., Vyas K.R., Frazer K.A., Fodor S.P.A. and Cox D.R., Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21, Science,2001,294:1719-1723.
    [55]Lewontin R.C., On measures of gametic disequilibrium, Genetics,1988,120:849-852.
    [56]Barrett J.C., Fry B., Maller J. and Daly M.J., Haploview:analysis and visualization of LD and haplotype maps, Bioinformatics,2005,21:263-265.
    [57]Marchini J., Howie B., Myers S., McVean G. and Donnelly P., A new multipoint method for genome-wide association studies by imputation of genotypes, Nature Genet,2007,39:906-913.
    [58]Lin S., Cutler D.J., Zwick M.E. and Chakravarti A., Haplotype inference in random population samples, Am. J. Hum. Genet,2002,71:1129-1137.
    [59]Stephens M. and Donnelly P., A comparison of bayesian methods for haplotype reconstruction from population genotype data, Am. J. Hum. Genet,2003,73:1162-1169.
    [60]Rieder M.J., Taylor S.L., Clark A.G., and Nickerson D.A., Sequence Variation in the Human Angiotensin Converting Enzyme, Nature Genetics,1999,22:59-62.
    [61]Frazer K.A., et al. A second generation human haplotype map of over 3.1 million SNPs, Nature,2007,449:851-861.
    [62]Hudson R.R., Generating Samples under a Wright-Fisher Neutral Model of Genetic Variation, Bioinformatics, 2002,18:337-338.
    [63]Chin F.Y.L., Leung H.C.M., Voting algorithms for discovering long motifs. Proc. Third Asia-Pacific Bioinfor-matics Conf. (APBC'05),2005,261-271.
    [64]Davila J., Balla S., Rajasekaran S., Fast and practical algorithms for planted (1,d) motif search. IEEE/ACM Transactions on Computational Biology and Bioinformatics,2007,4(4):544-552.
    [65]Rajasekaran S., Balla S., Huang C.-H., Exact algorithms for planted motif chanllenge problems. Journal of Computational Biology,2005,12(8):1117-1128.
    [66]Pavesi G., Mauri G., Pesole G., An algorithm for finding signals of unknown length in DNA sequences. Bioin-formatics,2001,17(Suppl 1):S207-S214.
    [67]Huang C.-W., Lee W.-S., Hsieh S.-Y., An improved heuristic algorithm for finding motif signals in DNA sequences. IEEE/ACM Transactions on Computational Biology and Bioinformatics,2011,8(4):959-975.
    [68]Price A., Ramabhadran S., Pevzner P.A., Finding subtle motifs by branching from sample strings. Bioinfor-matics,2003,19(Suppl 2):ii149-ii155.
    [69]Ruhler J., Tompa M., Finding motifs using random projections. Journal of Computational Biology,2002, 9(2):225-242.
    [70]Bailey T.L., Elkan C., Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning,1995,21:51-80.
    [71]Lawrence C.E., Reilly A.A., An expectation maximization (EM) algorithm for the identification and character-ization of common sites in unaligned biopolymer sequences. Proteins:Structure, Function and Bioinformatics, 1990,7:41-51.
    [72]Lawrence C.E., Altschul S.F., Boguski M.S., Liu J.S., Neuwald A.F., Wootton J.C.:Detecting subtle sequence signals, a Gibbs sampling strategy for multiple alignment. Science,1993,262:208-214.
    [73]Yin M.M., Wang J.T.L., Effective hidden Markov models for detecting splicing junction sites in DNA se-quences. Information Sciences,2001,139(1-2):139-163.
    [74]Evans P.A., Smith A.D., Wareham H.T., On the complexity of finding common approximate substrings. The-oretical Computer Science,2003,306:407-430.
    [75]Pevzner P.A., Sze S.H., Combinatorial approaches to finding subtle signals in DNA sequences. Proc. Eighth Int'l Conf. Intelligent Systems for Molecular Biology,2000,269-278.
    [76]Eskin E., Pevzner P.A., Finding composite regulatory patterns in DNA sequences. Bioinformatics,2002, 18(1):353-363.
    [77]Evans P.A., Smith A.D., Toward optimal motif enumeration. Algorithms and Data Structures,2003,2748:47-58.
    [78]Pisanti N., Carvalho A.M., Marsan L., Sagot M.-F., RISOTTO:Fast extraction of motifs with mismatches. LATIN 2006:Theoretical Informatics,2006,3887:757-768.
    [79]Sagot M.-F., Spelling approximate repeated or common motifs using a suffix tree. Latin'98:Theoretical Infor-matics,1998,1380:111-127.
    [80]Sun H.Q., Low M. Y.H., Hsu W. J. and Rajapakse J.C., RecMotif:a novel fast algorithm for weak motif discov-ery, BMC Bioinformatics,2010, 11(Suppl 11):S8.
    [81]Sun H.Q. Low M.Y.H., Hsu W.J., Tan C.W. and Rajapakse J.C., Tree-structured algorithm for long weak motif discovery, Bioinformatics,2011,27(19):2641-2647.
    [82]Ho E.S., Jakubowski C.D. and Gunderson S.I., iTriplet, a rule-based nucleic acid sequence motif finder, Algo-rithms for Molecular Biology,2009,4:14.
    [83]Y. Zhao, Studies and applications of some biological sequence analysis algorithms, University of Science and Technology of China,2010.
    [84]Satya R. V. and Mukherjee A., New algorithms for finding monad patterns in DNA sequences, String Processing and Information Retrieval,2004,3246:273-285.
    [85]Smith T.F. and Waterman M.S., Identification of common molecular subsequences, Journal of Molecular Biol-ogy,1981,147:195-197.
    [86]SankoffD., Matching sequences under deletion/insertion constraints, Proc. Natl. Acad. Sci. USA,1972,69:4-6.
    [87]Hsu W.J. and Du M.W., Computing a longest common subsequence for a set of strings, BIT,24(1):45-59.
    [88]Maier D., The Complexity of Some Problems on Subsequences and Supersequences, J. ACM,1978,25:322-336.
    [89]Apostolico A., Browne S., and Guerra C., Fast linear-space computations of longest common subsequences, Theoretical Computer Science,1992,92(1):3-17.
    [90]Masek W.J. and Paterson M.S., A faster algorithm computing string edit distances, J. Comput. Syst. Sci.,1980, 18-31.
    [91]Hirschberg D.S., Algorithms for the longest common subsequence problem, J. ACM,1977,24:664-675.
    [92]Hakata K. and Imai H., Algorithms for the Longest Common Subsequence Problem, Genome Informatics Work-shop Ⅲ,1992,53-56.
    [93]Hakata K. and Imai H., Algorithms for the Longest Common Subsequence Problem for Multiple Strings Based on Geometric Maxima, Optimization Methods and Software,1998,10:233-260.
    [94]Wang Q., Korkin D., and Shang Y, A Fast Multiple Longest Common Subsequence (MLCS) Algorithm, IEEE Transactions on Knowledge and Data Engineering,2011,23(3):321-334.
    [95]Kossmann D., Ramsak F., and Rost S., Shooting Stars in the Sky:an Online Algorithm for Skyline Queries, Very Large Data Bases Conference (VLDB),2002,275-286.
    [96]Papadias D., Tao Y., Fu G. and Seeger B., Progressive skyline computation in database systems, ACM Trans-actions on Database Systems,2005,30:41-82.
    [97]Jiang T., and Li M., On the approximation of shortest common supersequences and longest common subse-quences, SIAM Journal on Computing,1995,24:1122-39.
    [98]Bonizzoni P., Vedova G.D., and Mauri G., Experimenting an approximation algorithm for the LCS, Discrete Applied Mathematics,2001,110:13-24.
    [99]Huang K., Yang C., and Tseng K., Fast algorithms for finding the common subsequence of multiple sequences, In:Proceedings of international computer symposium,2004,1006-1011.
    [100]Shyu S.J., and Tsai C.-Y, Finding the longest common subsequence for multiple biological sequences by ant colony optimization, Comput. Oper. Res.,2009,36(1):73-91.
    [101]Bluma C., Blesaa M.J., and Lopez-Ibanez M., Beam search for the longest common subsequence problem, Comput. Oper. Res.,2009,36(12):3178-3186.
    [102]Wang Q., Pan M., Shang Y. and Korkin D., A Fast Heuristic Search Algorithm for Finding the Longest Com-mon Subsequence of Multiple Strings, Proceedings of the Twenty-Fourth AAAI Conference on Artificial In-telligence,2010,1287-1292.
    [103]Aggarwal A. and Park J., Notes on Searching in Multidimensional Monotone Arrays, Proc.29th Ann. IEEE Symp. Foundations of Comput. Sci.,1988,497-512.
    [104]Apostolico A., Atallah M., Larmore L. and Mcfaddin S., Efficient parallel algorithms for string editing and related problems, SIAM J Computing,1990,19:968-988.
    [105]Lu M. and Lin H., Parallel algorithms for the longest common subsequence problem, IEEE Transaction on Parallel and Distributed System,1994,5:835-848.
    [106]Allison L. and Dix T.L., A bit-string longest common subsequence algorithm, Inform. Process. Lett.,1986, 23(6):305-310.
    [107]Crochemore M., Iliopoulos C.S., Pinzon YJ., Reid J.F., A fast and practical bit vector algorithm for the longest common subsequence problem, Inform. Process. Lett.,2001,80(5):279-285.
    [108]Hyyro H., Bit-parallel LCS-length computation revisited, In Proc.15th Australasian Workshop on Combina-torial Algorithms AWOCA,2004,16-27.
    [109]Luce G., Myoupo J.F., Systolicbased parallel architecture for the longest common subsequences problem, Integration,1998,25(1):53-70.
    [110]Freschi V. and Bogliolo A., Longest common subsequence between run-length-encoded strings:a new algo-rithm with improved parallelism, Information Processing Letters,2004,90(4):167-173.
    [111]Ukiyama N. and Imai H., Parallel multiple alignments and their implementation on CM5, Genome Informatics, 1993,103-108.
    [112]Chen Y, Wan A. and Liu W., A fast parallel algorithm for finding the longest common sequence of multiple biosequences, BMC Bioinformatics,2006,7(Suppl 4):S4.
    [113]Korkin D., Wang Q. and Shang Y., An Efficient Parallel Algorithm for the Multiple Longest Common Sub-sequence (MLCS) Problem, ICPP'08:Proc.37th Intl. Conf. on Parallel Processing,2008,354-363.
    [114]Likhachev M., Gordon G. and Thrun S., ARA*:Anytime A* with provable bounds on sub-optimality, Ad-vances in Neural Information Processing Systems (NIPS),2003,16.
    [115]Likhachev M., Ferguson D., Gordon G., Stentz A. and Thrun S., Anytime search in dynamic graphs, Artificial Intelligence,2008,172(14):1613-1643.
    [116]van den Berg J., Shah R., Huang A. and Goldberg K., ANA*:Anytime Nonparametric A, Proceedings of AAAI Conference on Artificial Intelligence,2011.
    [117]Aine S., Chakrabarti P.P. and Kumar R., AWA*-a window constrained anytime heuristic search algorithm, International Joint Conference on Artificial Intelligence, Morgan Kaufmann Publishers Inc.,2007,2250-2255.
    [118]Vadlamudi S.G., Aine S. and Chakrabarti P.P., MAWA*-a memory-bounded anytime heuristic-search algo-rithm, IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics:a publication of the IEEE Systems, Man, and Cybernetics Society,2011,41(3):725.
    [119]Zhang W., Complete anytime beam search, Proceedings of the National Conference on Artificial Intelligence, JOHN WILEY and SONS LTD,1998,425-430.
    [120]Zhou R. and Hansen E.A., Beam-stack search:Integrating backtracking with beam search, Proceedings of the International Conference on Automated Planning and Scheduling (ICAPS),2005,90-98.
    [121]Furcy D., ITSA*:Iterative tunneling search with A*, Proceedings of the National Conference on Artificial In-telligence (AAAI), Workshop on Heuristic Search, Memory-Based Heuristics and Their Applications, Boston, Massachusetts.2006.
    [122]Zhou R., Hansen E., Sparse-memory graph search, in:Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI-03), Acapulco, Mexico,2003,1259-1266.
    [123]Zhou R. and Hansen E.A., Sweep-A*:Space-efficient heuristic search in partially ordered graphs. In Proceed-ings of the 15th International Conference on Tools with Artificial Intelligence (ICTAI-03) (Sacramento, CA), 2003,427-434.
    [124]Zhou R. and Hansen E.A., Breadth-first heuristic search, In Proceedings of the 14th International Conference on Automated Planning and Scheduling (ICAPS-04) (Whistler, British Columbia, Canada),2004,92-100.
    [125]Zhou R. and Hansen E.A., Breadth-first heuristic search, Artificial Intelligence,2006,170(4):385-408.
    [126]Kaindl H. and Khorsand A., Memory-bounded bidirectional search, in Proc.12th Nat. Conf. Artif. Intell. (AAAI), (Menlo Park, CA), American Association for Artificial Intelligence,1994,2:1359-1364.
    [127]Zhou R. and Hansen E.A., Memory-bounded A. graph search, in Proc.15th Int. Florida Artif. Intell. Res. Soc. Conf.,2002,203-209.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700