基于k字位置序列的蛋白质序列分析方法及其应用

英文篇名：Based on k-word Position Sequence Analysis Methods of Protein Sequence and Application
作者：王磊 ; 高冰涛 ; 薛晓龙 ; 解小莉
英文作者：WANG Lei;GAO Bing-tao;XUE Xiao-long;XIE Xiao-li;College of Science Northwest A&F University;College of Information Engineering Northwest A&F University;
关键词：序列非比对分析 ; 遗传距离矩阵 ; k-字 ; 蛋白质序列
英文关键词：free-alignment analysis;;similarity matrix;;k-word;;protein sequence
中文刊名：SSJS
英文刊名：Mathematics in Practice and Theory
机构：西北农林科技大学理学院;西北农林科技大学信息工程学院;
出版日期：2017-10-08
出版单位：数学的实践与认识
年：2017
期：v.47
语种：中文;
页：SSJS201719019
页数：8
CN：19
ISSN：11-2018/O1
分类号：160-167

摘要

基于蛋白质序列的κ-字位置序列,利用标准化的κ-字区间平均距离和改进的标准化的κ-字区间平均距离的方法作为蛋白质序列的数字特征,并给出了比较蛋白质序列相似性的方法.最后,运用这两种方法分析了9个物种的ND5蛋白质序列和8个物种的ND6蛋白质序列的相似性,并利用交叉验证得出基于改进的标准化的κ-字区间平均距离的方法的准确度比基于标准化的κ-字区间平均距离的方法的准确度高.
This paper is mainly based on the normalized κ-word average relative distance method and the new normalized κ-word average relative distance method, and applied the two methods to the free-alignment analysis of protein sequences. For a fixed κ-word,a 20~κ-dimension feature vector representation of the protein sequence can be given. The ND5 protein sequence of 9 species and the ND6 protein sequence of 8 species were analyzed by using the two methods, and the similarity matrix and system cluster analysis were made. By using cross validation, the accuracy of the new normalized κ-word average relative distance method is higher than the normalized κ-word average relative distance method.

引文

[1]Domazet-Louso M,Haubold B.Alignment-free detection of local similarity among viral and bacterial genomes[J].Bioinformatics,2011,27:1466-1472.
    [2]Amit K,Chattopadhyay,Nasiev Diar,Darren R,Flower.A statistical physics perspective on alignment-independent protein sequence comparison[J].Bioinformatics,2015,31(15):2469-2474.
    [3]Cheng J,Zeng X,Ren G,Liu Z.CGAP:a new comprehensive platform for the comparative analysis of chloroplast genomes[J].BMC Bioinform,2013,14:95-101.
    [4]Chou K,Lin W,Xiao X.iLoc-Euk:a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins[J].PLoS ONE,2011,6(3):e18258.
    [5]Jie Li,Patrice Koehl.3D representations of amino acids—applications to protein sequence comparison and classification[J].Computational and Structural Biotechnology Journal,2014,11:47-58.
    [6]Xian-Hua Xie,Zu-Guo Yu,Guo-Sheng Han,Wei-Feng Yang,Vo Anh.Whole-proteome based phylogenetic tree constructionwith inter-amino-acid distances and the conditionalgeometric distribution profiles[J].Molecular Phylogenetics and Evolution,2015,89(2015):37-45.
    [7]梁丽萍,解小莉,程健,王振凤,郭满才,袁志发.一种新的DNA序列进化距离及其应用[J].生物化学与生物物理进展,2011,38(8):768-776.
    [8]解小莉,梁丽萍,杜俊莉,袁志发.一种新的氨基酸序列进化距离及其应用[J].浙江大学学报(农业与生命科学版),2012,38(3):271-278.
    [9]Xiaojing Xie,Jihong Guan,Shuigeng Zhou.Similarity evaluation of DNA sequences based on frequent patterns and entropy[J].BMC Genomics,2015,16(Suppl 3):S5.
    [10]Ying Wang et al.Effect of k-tuple length on sample-comparison with high-throughput sequencing data[J].Biochemical and Biophysical Research Communications,2016,469:1021-1027.
    [11]Ding S,Li Y,Yang X,Wang T.A simple k-word interval method for phylogenetic analysis of DNA sequences[J].J Theor Biol,2013,317:192-199.
    [12]Upuli Gunasinghe,Damminda Alahakoon,Susan Bedingfield.Extraction of high quality k-words for alignment-free sequence comparison[J].Journal of Theoretical Biology,2014,358:31-51.
    [13]WANG J,Wang W.A computational approach to simplifying the protein folding problem[J].Nature Structural Biology,1999,6:1033-1038.
    [14]张艳萍,贺平安.蛋白质序列的图形表示及其应用[J].浙江理工大学学报,2010,27(2):308-314.
    [15]贾美多,杨闫,张盈盈,李春.蛋白质序列基于k-字的数值刻画和应用[J].浙江农业学报,2014,26(6):1635-1640.
    [16]张彦龙.基于Cp-曲线的蛋白质序列相似性分析[J].辽宁师专学报,2014,16(3),105-108.
    [17]李菁,李逢博,王炜.蛋白质序列复杂性简化与非比对序列分析[J].生物化学与生物物理进展,2006,33(12):1215-1222.
    [18]Wenbing Hou,Qiuhui Pan,Mingfeng He.A new graphical representation of protein sequences and its applications[J].Physica,2016,444:996-1002.
    [19]Jie Tang,Keru Hua,Mengye Chen,Ruiming Zhang,Xiaoli Xie.A novel k-word relative measure for sequence comparison[J].Computational Biology and Chemistry,2014,53:331-338.
    [20]Agata Czerniecka,et al.20D-dynamic representation of protein sequences[J].Genomics,2016,107:16-23.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700