一种基于压缩感知和动态时间规整的信号肽特征提取新算法
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:A New Algorithm of Feature Extraction for Signal Peptide Based on Compressed Sensing and Dynamic Time Warping
  • 作者:张洋俐君 ; 高翠芳 ; 陈卫 ; 田丰伟
  • 英文作者:Zhang Yanglijun;Gao Cuifang;Chen Wei;Tian Fengwei;School of Science, Jiangnan University;School of Food Science and Technology, Jiangnan University;
  • 关键词:信号肽 ; 动态时间规整 ; 压缩感知 ; 特征提取 ; 机器学习
  • 英文关键词:signal peptide;;dynamic time warping;;compressed sensing;;feature extraction;;machine learning
  • 中文刊名:SJCJ
  • 英文刊名:Journal of Data Acquisition and Processing
  • 机构:江南大学理学院;江南大学食品学院;
  • 出版日期:2019-03-15
  • 出版单位:数据采集与处理
  • 年:2019
  • 期:v.34;No.154
  • 基金:国家自然科学基金青年基金(61402202)资助项目;; 中国博士后科学基金(2015M581724)资助项目;; 江苏省博士后科学基金(1401099C)资助项目;; 江苏省自然科学基金青年基金(BK20150124)资助项目
  • 语种:中文;
  • 页:SJCJ201902013
  • 页数:9
  • CN:02
  • ISSN:32-1367/TN
  • 分类号:113-121
摘要
准确识别出信号肽对蛋白质的研究和定位有着非常重要的意义。压缩感知技术能够在保留生物序列主要信息的同时降低冗余信息,将高维信息投影到低维空间上进行特征提取。因此本文基于压缩感知技术再结合动态时间规整算法提取出新的特征向量,提出一种高鉴别性的信号肽特征提取新方法。该算法所提取的特征不但体现了信号肽中的氨基酸组成、排列顺序、结构等重要信息,还能把信号肽的不同区域在时间维度中非线性地弯曲对整,为机器学习算法提供有效的信号肽特征表达。实验结果显示,新方法提取的特征向量在3个数据集Eukaryotes,Gram+bacteria,Gram-bacteria上的识别率分别达到99.65%,98.05%和98.56%,并且这种方法能简单地运用到其他生物序列的识别过程中。
        Identifying signal peptide accurately is significant for protein research and localization. This paper presents a new method to extract high discriminant features for signal peptide sequence. Firstly,features based on compressed sensing are extracted by projecting a high-dimensional sequence onto a lowdimensional space,which remove redundant data while preserving the important information. And then dynamic time warping(DTW)algorithm is introduced to create the new features. The features extracted by the new method can reflect the important information of amino acid composition,sequence order and structure in the signal peptide,and also can nonlinearly align the different regions of signal peptide in the time dimension. Therefore the effective feature expression of the signal peptide for machine learning algorithm is provided. Experimental results show that the recognition accuracies with the extracted features are 99.65%,98.05% and 98.56% respectively in the three datasets Eukaryotes,Gram + bacteria and Gram-bacteria. Moreover,the new method can be simply applied to the identification of several biological sequences.
引文
[1]韦雪芳,王冬梅,刘思,等.信号肽及其在蛋白质表达中的应用[J].生物技术通报,2006(6):38-42.Wei Xuefang,Wang Dongmei,Liu Si,et al.Signal sequence and its application to protein expression[J].Biotechnology Bulletin,2006(6):38-42.
    [2]Gao Cuifang,Guan Qiang,Zhang Hao,et al.A novel feature extraction method by compressive sensing for signal peptide[J].Journal of Chemical and Pharmaceutical Research,2013,5(11):212-218.
    [3]许国根,贾瑛.模式识别与智能计算的MATLAB实现[M].3版.北京:北京航空航天大学出版社,2012.
    [4]高翠芳,吴小俊,田丰伟,等.一种表征蛋白质可分泌性的结构融合度特征[J].生物工程学报,2010,26(5):687-695.Gao Cuifang,Wu Xiaojun,Tian Fengwei,et al.Characterization of protein secretion based on structural fusion degree[J].Chin J Biotech,2010,26(5):687-695.
    [5]Shen H B,Chou K C.Ensemble classifier for protein fold pattern recognition[J].Bioinformatics,2006,22(14):1717-1722.
    [6]LiòP.Wavelets in bioinformatics and computational biology:State of art and perspectives[J].Bioinformatics,2003,19(1):2-9.
    [7]徐君,李莉.基于马尔可夫矩阵模型的企业集群状态预测[J].辽宁工程技术大学学报,2006,25(S1):16-18.Xu Jun,Li Li.Enterprise clusters forecast based on Markov transition probability matrix model[J].Journal of Liaoning Technical University,2006,25(S1):16-18.
    [8]Donoho D,Tanner J.Observed universality of phase transitions in high-dimensional geometry,with implications for modern data analysis and signal processing[J].Philosophical Transactions Mathematical Physical&Engineering Sciences,2009,367(1906):4273-4293.
    [9]Romberg J,Tao T.Exact signal reconstruction from highly incomplete frequency information[J].IEEE Transactions on Information Theory,2006,52(2):489-509.
    [10]孙林慧,杨震.语音压缩感知研究进展与展望[J].数据采集与处理,2015,30(2):275-288.Sun Linhui,Yang Zhen.Compressed speech sensing for research progress and prospect[J].Journal of Data Acquisition and Processing,2015,30(2):275-288.
    [11]Candès E J,Wakin M B.An introduction to compressive sampling[J].IEEE Signal Processing Magazine,2008,25(2):21-30.
    [12]Sakoe H,Chiba S.Dynamic programming algorithm optimization for spoken word recognition[J].IEEE Transactions on Acoustics Speech&Signal Processing,1978,26(1):43-49.
    [13]Jain B J.Generalized gradient learning on time series[J].Machine Learning,2015,100(2):587-608.
    [14]Batista G E,Wang X,Keogh E J.A Complexity-invariant distance measure for time series[C]//Eleventh SIAM International Conference on Data Mining.Mesa,Arizona,USA:SIAM,2011:699-710.
    [15]冯志远,张连海.基于分段动态时间规整的语音样例快速检索[J].数据采集与处理,2014,29(2):274-279.Feng Zhiyuan,Zhang Lianhai.Fast query-by-example spoken term detection using segmental dynamic time warping[J].Journal of Data Acquisition and Processing,2014,29(2):274-279.
    [16]Lines J,Bagnall A.Time series classification with ensembles of elastic distance measures[J].Data Mining and Knowledge Discovery,2015,29(3):565-592.
    [17]Kate R J.Using dynamic time warping distances as features for improved time series classification[J].Data Mining and Knowledge Discovery,2016,30(2):1-30.
    [18]Nielsen H,Engelbrecht J,Brunak S,et al.The SWISS-PROT protein sequence data bank:current status[EB/OL].(2017-2-23)[2017-3-11].http://www.cbs.dtu.dk/ftp/signalp/.
    [19]Chang C C,Lin C J.LIBSVM:A library for support vector machines[EB/OL].(2017-2-23)[2017-3-11].http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
    [20]Nielsen H,Krogh A.Prediction of signal peptides and signal anchors by a hidden Markov model[C]//Proceedings of the Sixth International Conference on Intelligent Systems for Molecular Biology.Menlo Park:AAAI Press,1998:122-130.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700