摘要
基于蛋白质序列的κ-字位置序列,利用标准化的κ-字区间平均距离和改进的标准化的κ-字区间平均距离的方法作为蛋白质序列的数字特征,并给出了比较蛋白质序列相似性的方法.最后,运用这两种方法分析了9个物种的ND5蛋白质序列和8个物种的ND6蛋白质序列的相似性,并利用交叉验证得出基于改进的标准化的κ-字区间平均距离的方法的准确度比基于标准化的κ-字区间平均距离的方法的准确度高.
This paper is mainly based on the normalized κ-word average relative distance method and the new normalized κ-word average relative distance method, and applied the two methods to the free-alignment analysis of protein sequences. For a fixed κ-word,a 20~κ-dimension feature vector representation of the protein sequence can be given. The ND5 protein sequence of 9 species and the ND6 protein sequence of 8 species were analyzed by using the two methods, and the similarity matrix and system cluster analysis were made. By using cross validation, the accuracy of the new normalized κ-word average relative distance method is higher than the normalized κ-word average relative distance method.
引文
[1]Domazet-Louso M,Haubold B.Alignment-free detection of local similarity among viral and bacterial genomes[J].Bioinformatics,2011,27:1466-1472.
[2]Amit K,Chattopadhyay,Nasiev Diar,Darren R,Flower.A statistical physics perspective on alignment-independent protein sequence comparison[J].Bioinformatics,2015,31(15):2469-2474.
[3]Cheng J,Zeng X,Ren G,Liu Z.CGAP:a new comprehensive platform for the comparative analysis of chloroplast genomes[J].BMC Bioinform,2013,14:95-101.
[4]Chou K,Lin W,Xiao X.iLoc-Euk:a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins[J].PLoS ONE,2011,6(3):e18258.
[5]Jie Li,Patrice Koehl.3D representations of amino acids—applications to protein sequence comparison and classification[J].Computational and Structural Biotechnology Journal,2014,11:47-58.
[6]Xian-Hua Xie,Zu-Guo Yu,Guo-Sheng Han,Wei-Feng Yang,Vo Anh.Whole-proteome based phylogenetic tree constructionwith inter-amino-acid distances and the conditionalgeometric distribution profiles[J].Molecular Phylogenetics and Evolution,2015,89(2015):37-45.
[7]梁丽萍,解小莉,程健,王振凤,郭满才,袁志发.一种新的DNA序列进化距离及其应用[J].生物化学与生物物理进展,2011,38(8):768-776.
[8]解小莉,梁丽萍,杜俊莉,袁志发.一种新的氨基酸序列进化距离及其应用[J].浙江大学学报(农业与生命科学版),2012,38(3):271-278.
[9]Xiaojing Xie,Jihong Guan,Shuigeng Zhou.Similarity evaluation of DNA sequences based on frequent patterns and entropy[J].BMC Genomics,2015,16(Suppl 3):S5.
[10]Ying Wang et al.Effect of k-tuple length on sample-comparison with high-throughput sequencing data[J].Biochemical and Biophysical Research Communications,2016,469:1021-1027.
[11]Ding S,Li Y,Yang X,Wang T.A simple k-word interval method for phylogenetic analysis of DNA sequences[J].J Theor Biol,2013,317:192-199.
[12]Upuli Gunasinghe,Damminda Alahakoon,Susan Bedingfield.Extraction of high quality k-words for alignment-free sequence comparison[J].Journal of Theoretical Biology,2014,358:31-51.
[13]WANG J,Wang W.A computational approach to simplifying the protein folding problem[J].Nature Structural Biology,1999,6:1033-1038.
[14]张艳萍,贺平安.蛋白质序列的图形表示及其应用[J].浙江理工大学学报,2010,27(2):308-314.
[15]贾美多,杨闫,张盈盈,李春.蛋白质序列基于k-字的数值刻画和应用[J].浙江农业学报,2014,26(6):1635-1640.
[16]张彦龙.基于Cp-曲线的蛋白质序列相似性分析[J].辽宁师专学报,2014,16(3),105-108.
[17]李菁,李逢博,王炜.蛋白质序列复杂性简化与非比对序列分析[J].生物化学与生物物理进展,2006,33(12):1215-1222.
[18]Wenbing Hou,Qiuhui Pan,Mingfeng He.A new graphical representation of protein sequences and its applications[J].Physica,2016,444:996-1002.
[19]Jie Tang,Keru Hua,Mengye Chen,Ruiming Zhang,Xiaoli Xie.A novel k-word relative measure for sequence comparison[J].Computational Biology and Chemistry,2014,53:331-338.
[20]Agata Czerniecka,et al.20D-dynamic representation of protein sequences[J].Genomics,2016,107:16-23.