摘要
本文通过运用Yau 2011提出的自然向量法,对艾滋病病毒HIV-1, HIV-2, PLV三类病毒进行进化分析。采用Python计算序列的自然向量以及序列间两两的距离,之后利用Mega对计算好的距离矩阵画进化树。为了测试本文方法的可行性,本文选取了HIV Sequence Database的20条全基因数据进行研究,分别用我们的方法和传统的MSA(多序列比对)画进化树。得到的结果显示我们的方法明显优于MSA,而且在耗时上我们也优于MSA。因此,我们的方法能为艾滋病病毒在进化方面的研究提供有利的工具。
引文
[1] Amano K, Nakamura H, Ichikawa H. Self-organizing clustering:a novel non-hierarchical method for clustering large amount of DNA sequences[J]. Genome Informatics, 2003, 14:575-576.
[2] Emrich S J, Kalyanaraman A, Aluru S. Algorithms for large-scale clustering and assembly of biological sequence data[J]. Handbook of Computational Molecular Biology. pp, 2006:13.1-13.30.
[3] FitzGerald P C, Shlyakhtenko A, Mir A A, et al. Clustering of DNA sequences in human promoters[J]. Genome research, 2004, 14(8):1562-1574.
[4] Waterman M S. Introduction to computational biology:maps, sequences and genomes[M]. CRC Press, 1995.
[5] Abe T, Kanaya S, Kinouchi M, et al. Informatics for unveiling hidden genome signatures[J]. Genome research, 2003, 13(4):693-702.
[6] Chuzhanova N A, Jones A J, Margetts S. Feature selection for genetic sequence classification[J]. Bioinformatics(Oxford, England),1998, 14(2):139-143.
[7] Karlin S, Ladunga I. Comparisons of eukaryotic genomic sequences[J]. Proceedings of the National Academy of Sciences, 1994, 91(26):12832-12836.
[8] Nakashima H, Ota M, Nishikawa K, et al. Genes from nine genomes are separated into their organisms in the dinucleotide composition space[J]. DNA Research, 1998, 5(5):251-259.
[9] Yau S S T, Wang J, Niknejad A, et al. DNA sequence representation without degeneracy[J]. Nucleic acids research, 2003, 31(12):3078-3080.
[10] Liu L, Ho Y, Yau S. Clustering DNA sequences by feature vectors[J]. Molecular phylogenetics and evolution, 2006, 41(1):64-69.
[11] Yau S S T, Yu C, He R. A protein map and its application[J]. DNA and cell biology, 2008, 27(5):241-250.
[12] Carr K, Murray E, Armah E, et al. A rapid method for characterization of protein relatedness using feature vectors[J]. PLoS One,2010, 5(3):e9550.
[13] Yu C, Liang Q, Yin C, et al. A novel construction of genome space with biological geometry[J]. DNA research, 2010, 17(3):155-168.
[14] Larkin M A, Blackshields G, Brown N P, et al. Clustal W and Clustal X version 2.0[J]. bioinformatics, 2007, 23(21):2947-2948.
[15] Edgar R C. MUSCLE:a multiple sequence alignment method with reduced time and space complexity[J]. BMC bioinformatics, 2004,5(1):113.
[16] Katoh K, Misawa K, Kuma K, et al. MAFFT:a novel method for rapid multiple sequence alignment based on fast Fourier transform[J].Nucleic acids research, 2002, 30(14):3059-3066.
[17] Wang L, Jiang T. On the complexity of multiple sequence alignment[J]. Journal of computational biology, 1994, 1(4):337-348.
[18] Musto H, CacciòS, Rodríguez-Maseda H, et al. Compositional constraints in the extremely GC-poor genome of Plasmodium falciparum[J]. Memórias do Instituto Oswaldo Cruz, 1997, 92(6):835-841.
[19] Deng M, Yu C, Liang Q, et al. A novel method of characterizing genetic sequences:genome space with biological distance and applications[J]. PloS one, 2011, 6(3):e17293.