iRSpot-GAEnsC: identifing recombination spots via ensemble classifier and extending the concept of Chou’s PseAAC to formulate DNA samples
详细信息    查看全文
  • 作者:Muhammad Kabir ; Maqsood Hayat
  • 关键词:TNC ; DNC ; DNA ; SVM ; PNN
  • 刊名:Molecular Genetics and Genomics
  • 出版年:2016
  • 出版时间:February 2016
  • 年:2016
  • 卷:291
  • 期:1
  • 页码:285-296
  • 全文大小:919 KB
  • 参考文献:Ahmad S, Kabir M, Hayat M (2015) Identification of heat shock protein families and J-protein types by incorporating dipeptide composition into Chou’s general pseAAC. Comput Methods Programs Biomed. doi:10.​1016/​j.​cmpb.​2015.​07.​005 PubMed
    Akbar S, Ahmad A, Hayat M (2014) Identification of fingerprint using discrete wavelet transform in conjunction with support vector machine. IJCSI 11(Print):1694–0814
    ALAllaf ONA (2012) Cascade-forward vs. function fitting neural network for improving image quality and learning time in image compression system. In: Proceedings of the world congress on engineering, pp 4–6
    Beigi MM, Behjati M, Mohabatkar H (2011) Prediction of metalloproteinase family based on the concept of Chou’s pseudo amino acid composition using a machine learning approach. J Struct Funct Genomics 12:191–197CrossRef
    Boulesteix A, Bender A, Bermejo JL, Strobl C (2012) Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations. Brief Bioinform 13:292–304PubMed CrossRef
    Breiman L (2001) Random forests. Machine Learning 45:5–32CrossRef
    Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. TIST 2:27CrossRef
    Chen W, Lin H, Feng PM, Ding C, Zuo YC, Chou KC (2012) iNuc-PhysChem: a sequence-based predictor for identifying nucleosomes via physicochemical properties. PLoS One 7:e47843PubMed PubMedCentral CrossRef
    Chen W, Feng PM, Lin H, Chou KC (2013) iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Res:gks1450
    Chen W, Feng PM, Deng EZ, Lin H, Chou KC (2014a) iTIS-PseTNC: a sequence-based predictor for identifying translation initiation site in human genes using pseudo trinucleotide composition. Anal Biochem 462:76–83PubMed CrossRef
    Chen W, Feng PM, Lin H, Chou KC (2014b) iSS-PseDNC: identifying Splicing Sites Using Pseudo Dinucleotide Composition. BioMed Res Int 2014:12
    Chen W, Lei TY, Jin DC, Lin H, Chou KC (2014c) PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. Anal Biochem 456:53–60PubMed CrossRef
    Chen W, Zhang X, Brooker J, Lin H, Zhang L, Chou K (2014d) PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions. Bioinformatics:btu602
    Chen W, Lin H, Chou KC (2015) Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. Mol BioSyst
    Cherian M, Sathiyan SP (2012) Neural Network based ACC for Optimized safety and comfort. Int J Comp Appl 42
    Chou KC (2001a) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins: structure. Function, and Bioinformatics 43:246–255CrossRef
    Chou KC (2001b) Using subsite coupling to predict signal peptides. Protein Eng 14:75–79PubMed CrossRef
    Chou KC (2005) Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21:10–19PubMed CrossRef
    Chou KC (2009) Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Curr Proteomics 6:262–274CrossRef
    Chou K-C (2011) Some remarks on protein attribute prediction and pseudo amino acid composition. J Theor Biol 273:236–247PubMed CrossRef
    Chou KC (2013) Some remarks on predicting multi-label attributes in molecular biosystems. Mol BioSyst 9:1092–1100PubMed CrossRef
    Chou KC (2015) Impacts of bioinformatics to medicinal chemistry. Med Chem 11:218–234PubMed CrossRef
    Chou KC, Shen HB (2007a) Euk-mPLoc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites. J Proteome Res 6:1728–1734PubMed CrossRef
    Chou KC, Shen HB (2007b) Recent progress in protein subcellular location prediction. Anal Biochem 370:1–16PubMed CrossRef
    Chou KC, Shen HB (2007c) Signal-CF: a subsite-coupled and window-fusing approach for predicting signal peptides. Biochem Biophys Res Commun 357:633–640PubMed CrossRef
    Chou KC, Wu ZC, Xiao X (2011) iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. PLoS One 6:e18258PubMed PubMedCentral CrossRef
    Chou KC, Wu ZC, Xiao X (2012) iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. Mol BioSyst 8:629–641PubMed CrossRef
    Dehzangi A, Heffernan R, Sharma A, Lyons J, Paliwal K, Sattar A (2015) Gram-positive and Gram-negative protein subcellular localization by incorporating evolutionary-based descriptors into Chou’s general PseAAC. J Theor Biol 364:284–294PubMed CrossRef
    Ding H, Luo L, Lin H (2009) Prediction of cell wall lytic enzymes using Chou’s amphiphilic pseudo amino acid composition. Protein Pept Lett 16:351–355PubMed CrossRef
    Ding C, Yuan LF, Guo SH, Lin H, Chen W (2012) Identification of mycobacterial membrane proteins and their types using over-represented tripeptide compositions. J Proteomics 77:321–328PubMed CrossRef
    Ding H, Deng EZ, Yuan LF, Liu L, Lin H, Chen W, Chou KC (2014) iCTX-Type: a sequence-based predictor for identifying the types of conotoxins in targeting ion channels. BioMed research international 2014
    Duda RO, Hart PE, Stork DG (2012) Pattern classification. Wiley
    Ebina T, Toh H, Kuroda Y (2011) DROP: an SVM domain linker predictor trained with optimal features selected by random forest. Bioinformatics 27:487–494PubMed CrossRef
    Esmaeili M, Mohabatkar H, Mohsenzadeh S (2010) Using the concept of Chou’s pseudo amino acid composition for risk type prediction of human papillomaviruses. J Theor Biol 263:203–209PubMed CrossRef
    Fang Y, Guo Y, Feng Y, Li M (2008) Predicting DNA-binding proteins: approached from Chou’s pseudo amino acid composition and other specific sequence features. Amino Acids 34:103–109PubMed CrossRef
    Feng PM, Chen W, Lin H, Chou KC (2013) iHSP-PseRAAAC: identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Anal Biochem 442:118–125PubMed CrossRef
    Georgiou V, Pavlidis N, Parsopoulos K, Alevizos PD, Vrahatis M (2004) Optimizing the performance of probabilistic neural networks in a bioinformatics task. In: Proceedings of the EUNITE 2004 Conference, pp 34–40
    Georgiou D, Karakasidis TE, Nieto J, Torres A (2009) Use of fuzzy clustering technique and matrices to classify amino acids and its impact to Chou’s pseudo amino acid composition. J Theor Biol 257:17–26PubMed CrossRef
    Gu Q, Ding YS, Zhang TL (2010) Prediction of G-protein-coupled receptor classes in low homology using Chou’s pseudo amino acid composition with approximate entropy and hydrophobicity patterns. Protein Pept Lett 17:559–567PubMed CrossRef
    Guo J, Rao N, Liu G, Yang Y, Wang G (2011) Predicting protein folding rates using the concept of Chou’s pseudo amino acid composition. J Comput Chem 32:1612–1617PubMed CrossRef
    Guo SH, Deng EZ, Xu LQ, Ding H, Lin H, Chen W, Chou KC (2014) iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics:btu083
    Han J, Kamber M (2006) Data Mining, Southeast, Asia edn. Concepts and Techniques, Morgan kaufmann
    Hayat M, Khan A (2011) Predicting membrane protein types by fusing composite protein sequence features into pseudo amino acid composition. J Theor Biol 271:10–17PubMed CrossRef
    Hayat M, Khan A (2012a) Discriminating outer membrane proteins with fuzzy K-nearest neighbor algorithms based on the general form of Chou’s PseAAC. Protein Pept Lett 19:411–421PubMed CrossRef
    Hayat M, Khan A (2012b) Mem-PHybrid: hybrid features-based prediction system for classifying membrane protein types. Anal Biochem 424:35–44PubMed CrossRef
    Hayat M, Tahir M (2015) PSOFuzzySVM-TMH: identification of transmembrane helix segments using ensemble feature space by incorporated fuzzy support vector machine. Mol BioSyst
    Hayat M, Khan A, Yeasin M (2012) Prediction of membrane proteins using split amino acid and ensemble classification. Amino Acids 42:2447–2460PubMed CrossRef
    He X, Han K, Hu J, Yan H, Yang JY, Shen HB, Yu DJ (2015) TargetFreeze: Identifying Antifreeze Proteins via a Combination of Weights using Sequence Evolutionary Information and Pseudo Amino Acid Composition. J Membrane Biol:1–10
    Jia J, Liu Z, Xiao X, Liu B, Chou KC (2015) iPPI-Esml: an ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into PseAAC. J Theor Biol 377:47–56PubMed CrossRef
    Jiang P, Wu H, Wang W, Ma W, Sun X, Lu Z (2007) MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features. Nucleic Acids Res 35:W339–W344PubMed PubMedCentral CrossRef
    Keeney S (2008) Spo11 and the formation of DNA double-strand breaks in meiosis. In: Recombination and meiosis. Springer, pp 81–123
    Khan A (2012) Identifying GPCRs and their types with Chou’s pseudo amino acid composition: an approach from multi-scale energy representation and position specific scoring matrix. Protein Pept Lett 19:890–903PubMed CrossRef
    Khan A, Majid A, Hayat M (2011) CE-PLoc: an ensemble classifier for predicting protein subcellular locations by fusing different modes of pseudo amino acid composition. Comput Biol Chem 35:218–229PubMed CrossRef
    Khan ZU, Hayat M, Khan MA (2015) Discrimination of acidic and alkaline enzyme using Chou’s pseudo amino acid composition in conjunction with probabilistic neural network model. J Theor Biol 365:197–203PubMed CrossRef
    Kumar KK, Pugalenthi G, Suganthan P (2009) DNA-Prot: identification of DNA binding proteins from protein sequence information using random forest. J Biomol Struct Dyn 26:679–686PubMed CrossRef
    Li WC, Deng EZ, Ding H, Chen W, Lin H (2015) iORI-PseKNC: a predictor for identifying origin of replication with pseudo k-tuple nucleotide composition. Chemometr Intell Lab Syst 141:100–106CrossRef
    Lin H, Ding H, Guo FB, Zhang AY, Huang J (2008) Predicting subcellular localization of mycobacterial proteins by using Chou’s pseudo amino acid composition. Protein Pept Lett 15:739–744PubMed CrossRef
    Lin H, Wang H, Ding H, Chen YL, Li QZ (2009a) Prediction of subcellular localization of apoptosis protein using Chou’s pseudo amino acid composition. Acta Biotheor 57:321–330PubMed CrossRef
    Lin WZ, Xiao X, Chou KC (2009b) GPCR-GIA: a web-server for identifying G-protein coupled receptors and their families with grey incidence analysis. Protein Engineering Design and Selection:gzp057
    Lin WZ, Fang JA, Xiao X, Chou KC (2012) Predicting secretory proteins of malaria parasite by incorporating sequence evolution information into pseudo amino acid composition via grey system model
    Lin H, Chen W, Yuan LF, Li ZQ, Ding H (2013) Using over-represented tetrapeptides to predict protein submitochondria locations. Acta Biotheor 61:259–268PubMed CrossRef
    Liu G, Liu J, Cui X, Cai L (2012) Sequence-dependent prediction of recombination hotspots in Saccharomyces cerevisiae. J Theor Biol 293:49–54PubMed CrossRef
    Liu B, Xu J, Lan X, Xu R, Zhou J, Wang X, Chou KC (2014) iDNA-Prot| dis: Identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition
    Liu B, Fang L, Liu F, Wang X, Chen J, Chou KC (2015a) Identification of real microRNA precursors with a pseudo structure status composition approach. PLoS One 10:e0121501PubMed PubMedCentral CrossRef
    Liu B, Liu F, Fang L, Wang X, Chou KC (2015b) repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics 31:1307–1309PubMed CrossRef
    Liu B, Fang L, Liu F, Wang X, Chou KC (2015b) iMiRNA-PseDPC: microRNA precursor identification with a pseudo distance-pair composition approach. J Biomol Struct and Dynamics:1–13
    Liu Z, Xiao X, Qiu WR, Chou KC (2015d) iDNA-Methyl: identifying DNA methylation sites via pseudo trinucleotide composition. Anal Biochem 474:69–77PubMed CrossRef
    Liu B, Liu F, Fang L, Wang X, Chou KC (2015d) repRNA: a web server for generating various feature vectors of RNA sequences. Mole Genet Genomics:1–9
    Liu B, Liu F, Wang X, Chen J, Fang L, Chou KC (2015e) Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Research:gkv458
    Lou W, Wang X, Chen F, Chen Y, Jiang B, Zhang H (2014) Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian Naïve Bayes. PLoS One 9:e86703PubMed PubMedCentral CrossRef
    Lu J, Huang G, Li HP, Feng KY, Chen L, Zheng MY, Cai YD (2014) Prediction of cancer drugs by chemical–chemical interactions. PLoS One 9
    Mandal M, Mukhopadhyay A, Maulik U (2015) Prediction of protein subcellular localization by incorporating multiobjective PSO-based feature subset selection into the general form of Chou’s PseAAC. Med Biol Eng Comput 53:331–344PubMed CrossRef
    Mohabatkar H (2010) Prediction of cyclin proteins using Chou’s pseudo amino acid composition. Protein Pept Lett 17:1207–1214PubMed CrossRef
    Mohabatkar H, Mohammad Beigi M, Esmaeili A (2011) Prediction of GABA receptor proteins using the concept of Chou’s pseudo-amino acid composition and support vector machine. J Theor Biol 281:18–23PubMed CrossRef
    Mohabatkar H, Mohammad Beigi M, Abdolahi K, Mohsenzadeh S (2013) Prediction of allergenic proteins by means of the concept of Chou’s pseudo amino acid composition and a machine learning approach. Med Chem 9:133–137PubMed CrossRef
    Nanni L, Lumini A, Gupta D, Garg A (2012) Identifying bacterial virulent proteins by fusing a set of classifiers based on variants of Chou’s pseudo amino acid composition and on evolutionary information. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 9:467–475CrossRef
    Qiu JD, Huang JH, Liang RP, Lu XQ (2009) Prediction of G-protein-coupled receptor classes based on the concept of Chou’s pseudo amino acid composition: an approach from discrete wavelet transform. Anal Biochem 390:68–73PubMed CrossRef
    Qiu WR, Xiao X, Chou KC (2014a) iRSpot-TNCPseAAC: identify recombination spots with trinucleotide composition and pseudo amino acid components. Int J Mol Sci 15:1746–1766PubMed PubMedCentral CrossRef
    Qiu WR, Xiao X, Lin WZ, Chou KC (2014b) iMethyl-PseAAC: Identification of protein methylation sites via a pseudo amino acid composition approach. BioMed Res Int 2014
    Qiu WR, Xiao X, Lin WZ, Chou KC (2014c) iUbiq-Lys: prediction of lysine ubiquitination sites in proteins by extracting sequence evolution information via a gray system model. J Biomol Struct Dynamics:1–12
    Sahu SS, Panda G (2010) A novel feature representation method based on Chou’s pseudo amino acid composition for protein structural class prediction. Comput Biol Chem 34:320–327PubMed CrossRef
    Specht DF (1990) Probabilistic neural networks. Neural networks 3:109–118CrossRef
    Touw WG, Bayjanov JR, Overmars L, Backus L, Boekhorst J, Wels M, van Hijum SA (2013) Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle? Brief Bioinform 14:315PubMed PubMedCentral CrossRef
    Vapnik V (2000) The nature of statistical learning theory. Springer
    Xiao X, Wang P, Chou KC (2009) GPCR-CA: a cellular automaton image approach for predicting G-protein-coupled receptor functional classes. J Comput Chem 30:1414–1423PubMed CrossRef
    Xiao X, Wang P, Chou KC (2011) Quat-2L: a web-server for predicting protein quaternary structural attributes. Mol Diversity 15:149–155CrossRef
    Xiao X, Min JL, Wang P, Chou KC (2013a) iGPCR-Drug: a web server for predicting interaction between GPCRs and drugs in cellular networking. PLoS One 8:e72234PubMed PubMedCentral CrossRef
    Xiao X, Wang P, Lin WZ, Jia JH, Chou KC (2013b) iAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types. Anal Biochem 436:168–177PubMed CrossRef
    Xiao X, Hui MJ, Liu Z, Qiu WR (2015) iCataly-PseAAC: Identification of enzymes catalytic sites using sequence evolution information with grey model GM (2, 1). The J Memb Biol:1–9
    Xu Y, Ding J, Wu LY, Chou KC (2013a) iSNO-PseAAC: predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition. PLoS ONE 8:e55844PubMed PubMedCentral CrossRef
    Xu Y, Shao XJ, Wu LY, Deng NY, Chou KC (2013b) iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins. PeerJ 1:e171PubMed PubMedCentral CrossRef
    Xu R, Zhou J, Liu B, He Y, Zou Q, Wang X, Chou KC (2014a) Identification of DNA-binding proteins by incorporating evolutionary information into pseudo amino acid composition via the top-n-gram approach. Journal of Biomolecular Structure and Dynamics:1–11
    Xu Y, Wen X, Wen LS, Wu LY, Deng NY, Chou KC (2014b) iNitro-Tyr: Prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition
    Yuan LF, Ding C, Guo SH, Ding H, Chen W, Lin H (2013) Prediction of the types of ion channel-targeted conotoxins based on radial basis function network. Toxicol In Vitro 27:852–856PubMed CrossRef
    Zhang YN, Yu DJ, Li SS, Fan YX, Huang Y, Shen HB (2012) Predicting protein-ATP binding sites from primary sequence through fusing bi-profile sampling of multi-view features. BMC Bioinform 13:118CrossRef
    Zhou XB, Chen C, Li ZC, Zou XY (2007) Using Chou’s amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes. J Theor Biol 248:546–551PubMed CrossRef
    Zou D, He Z, He J, Xia Y (2011) Supersecondary structure prediction using Chou’s pseudo amino acid composition. J Comput Chem 32:271–278PubMed CrossRef
  • 作者单位:Muhammad Kabir (1)
    Maqsood Hayat (1)

    1. Department of Computer Science, Abdul Wali Khan University, Mardan, KP, Pakistan
  • 刊物类别:Biomedical and Life Sciences
  • 刊物主题:Life Sciences
    Cell Biology
    Biochemistry
    Microbial Genetics and Genomics
  • 出版者:Springer Berlin / Heidelberg
  • ISSN:1617-4623
文摘
Meiotic recombination is vital for maintaining the sequence diversity in human genome. Meiosis and recombination are considered the essential phases of cell division. In meiosis, the genome is divided into equal parts for sexual reproduction whereas in recombination, the diverse genomes are combined to form new combination of genetic variations. Recombination process does not occur randomly across the genomes, it targets specific areas called recombinationhotspots” and “coldspots”. Owing to huge exploration of polygenetic sequences in data banks, it is impossible to recognize the sequences through conventional methods. Looking at the significance of recombination spots, it is indispensable to develop an accurate, fast, robust, and high-throughput automated computational model. In this model, the numerical descriptors are extracted using two sequence representation schemes namely: dinucleotide composition and trinucleotide composition. The performances of seven classification algorithms were investigated. Finally, the predicted outcomes of individual classifiers are fused to form ensemble classification, which is formed through majority voting and genetic algorithm (GA). The performance of GA-based ensemble model is quite promising compared to individual classifiers and majority voting-based ensemble model. iRSpot-GAEnsC has achieved 84.46 % accuracy. The empirical results revealed that the performance of iRSpot-GAEnsC is not only higher than the examined algorithms but also better than existing methods in the literature developed so far. It is anticipated that the proposed model might be helpful for research community, academia and for drug discovery. Keywords TNC DNC DNA SVM PNN

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700