摘要
In view of huge search space in drug design, machine learning has become a powerful method to predict the affinity between small molecular drug and targeting protein with the development of artificial intelligence technology. However, various machine learning algorithms including massive different parameters make the prediction framework choice to be quite difficult. In this work, we took a recent drug design competition(from XtalPi company on the DataCastle platform) as the typical case to find the optimized parameters for different machines learning algorithms and the most effective algorithm. After the parameter optimizations, we compared the typical machine learning methods as decision tree(XGBoost, LightGBM) and artificial neural network(MLP, CNN) with root-mean-square error(RMSE) and coefficient of determination(R~2) evaluation. As a result, decision tree is more effective than the neural network as LightGBM>XGBoost>CNN>MLP in the affinity prediction of the specific drug design problem with ~160000 samples. For a much larger screening task in a more complicated drug design study, the sophisticated neural network model may go beyond the decision tree algorithm after generalization enhancing and overfitting reducing. The advanced machine learning methods could extract more information of protein-ligand bindings than traditional ones and improve the screen efficiency of drug design up to 200–1000 times.
In view of huge search space in drug design, machine learning has become a powerful method to predict the affinity between small molecular drug and targeting protein with the development of artificial intelligence technology. However, various machine learning algorithms including massive different parameters make the prediction framework choice to be quite difficult. In this work, we took a recent drug design competition(from XtalPi company on the DataCastle platform) as the typical case to find the optimized parameters for different machines learning algorithms and the most effective algorithm. After the parameter optimizations, we compared the typical machine learning methods as decision tree(XGBoost, LightGBM) and artificial neural network(MLP, CNN) with root-mean-square error(RMSE) and coefficient of determination(R~2) evaluation. As a result, decision tree is more effective than the neural network as LightGBM>XGBoost>CNN>MLP in the affinity prediction of the specific drug design problem with ~160000 samples. For a much larger screening task in a more complicated drug design study, the sophisticated neural network model may go beyond the decision tree algorithm after generalization enhancing and overfitting reducing. The advanced machine learning methods could extract more information of protein-ligand bindings than traditional ones and improve the screen efficiency of drug design up to 200–1000 times.
引文
1 Csermely P,Korcsmáros T,Kiss HJM,London G,Nussinov R.Pharmacol Therapeutics,2013,138:333-408
2 Zhang GB,Maddili SK,Tangadanchu VKR,Gopala L,Gao WW,Cai GX,Zhou CH.Sci China Chem,2018,61:557-568
3 Song CM,Lim SJ,Tong JC.Briefings BioInf,2009,10:579-591
4 DiMasi JA,Hansen RW,Grabowski HG.J Health Economics,2003,22:151-185
5 Begley CG,Ellis LM.Nature,2012,483:531-533
6 Talele T,Khedkar S,Rigby A.Curr Top Med Chem,2010,10:127-141
7 Mayr LM,Fuerst P.J Biomol Screen,2008,13:443-448
8 Zhang H,Liu Y,Sun Y,Li M,Ni W,Zhang Q,Wan X,Chen Y.Sci China Chem,2017,60:366-369
9 Liu J,Zheng N,Hu Z,Wang Z,Yang X,Huang F,Cao Y.Sci China Chem,2017,60:1136-1144
10 Evers A,Klabunde T.J Med Chem,2005,48:1088-1097
11 Ferrari S,Morandi F,Motiejunas D,Nerini E,Henrich S,Luciani R,Venturelli A,Lazzari S,Calo S,Gupta S,Hannaert V,Michels PAM,Wade RC,Costi MP.J Med Chem,2010,54:211-221
12 Su P,Chen H,Wu W.Sci China Chem,2016,59:1025-1032
13 Gerogiokas G,Calabro G,Henchman RH,Southey MWY,Law RJ,Michel J.J Chem Theor Comput,2013,10:35-48
14 Rastelli G,Del Rio A,Degliesposti G,Sgobba M.J Comput Chem,2010,31:797-810
15 Sliwoski G,Kothiwale S,Meiler J,Lowe EW.Pharmacol Rev,2014,66:334-395
16 Montavon G,Rupp M,Gobre V,Vazquez-Mayagoitia A,Hansen K,Tkatchenko A,Müller KR,Anatole von Lilienfeld O.New J Phys,2013,15:095003
17 Ain QU,Aleksandrova A,Roessler FD,Ballester PJ.WIREs Comput Mol Sci,2015,5:405-424
18 Kurczab R,Smusz S,Bojarski AJ.J Cheminform,2014,6:32
19 Domingos P.Commun ACM,2012,55:78
20 Jordan MI,Mitchell TM.Science,2015,349:255-260
21 Sidorov G,Velasquez F,Stamatatos E,Gelbukh A,Chanona-Hernández L.Expert Syst Appl,2014,41:853-860
22 Nanni L,Lumini A,Ferrara M,Cappelli R.Neurocomputing,2015,149:526-535
23 Libbrecht MW,Noble WS.Nat Rev Genet,2015,16:321-332
24 Michalski RS,Carbonell JG,Mitchell TM.Machine Learning:An Artificial Intelligence Approach.Berlin-Heidelberg:Springer Science&Business Media,2013
25 Lavecchia A.Drug Discov Today,2015,20:318-331
26 Murphy RF.Nat Chem Biol,2011,7:327-330
27 Barros RC,Basgalupp MP,de Carvalho ACPLF,Freitas AA.IEEETrans Syst Man Cybern C,2012,42:291-312
28 Fan CY,Chang PC,Lin JJ,Hsieh JC.Appl Soft Comput,2011,11:632-644
29 Garg V,Kumar H,Sinha R.Speech based emotion recognition based on hierarchical decision tree with SVM,BLG and SVR classifiers.In:2013 National Conference on Communications.New Delhi:IEEE,2013.1-5
30 Zhang Z.Artificial neural network.In:Zhang Z,Ed.Multivariate Time Series Analysis in Climate and Environmental Research.Cham:Springer,2018.1-35
31 Li H,Lin Z,Shen X,Brandt J,Hua G.A convolutional neural network cascade for face detection.In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Boston:IEEE,2015.5325-5334
32 Moal IH,Agius R,Bates PA.Bioinformatics,2011,27:3002-3009
33 Medina F,Aguila S,Baratto MC,Martorana A,Basosi R,Alderete JB,Vazquez-Duhalt R.Enzyme Microbial Tech,2013,52:68-76
34 Pereira JC,Caffarena ER,Dos Santos CN.J Chem Inf Model,2016,56:2495-2506
35 Tian K,Shao M,Wang Y,Guan J,Zhou S.Methods,2016,110:64-72
36 http://www.dcjingsai.com/
37 Gilson MK,Liu T,Baitaluk M,Nicola G,Hwang L,Chong J.Nucleic Acids Res,2015,44:D1045-D1053
38 Rehurek R,Sojka P.Software framework for topic modelling with large corpora.In:Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks.Valletta:IEEE,2010
39 Chen T,Guestrin C.Xgboost:a scalable tree boosting system.In:Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining.San Francisco,2016.785-794
40 Ke G,Meng Q,Finley T,Wang T,Chen W,Ma W,Ye Q,Liu TY.Lightgbm:a highly efficient gradient boosting decision tree.In:Advances in Neural Information Processing Systems.Long Beach,2017.3146-3154
41 Tang J,Deng C,Huang GB.IEEE Trans Neural Netw Learn Syst,2016,27:809-821
42 Krizhevsky A,Sutskever I,Hinton GE.Imagenet classification with deep convolutional neural networks.In:Advances in Neural Information Processing Systems.Lake Tahoe,2012.1097-1105
43 Chen T,He T,Benesty M.Xgboost:extreme gradient boosting.RPackage Version 0.4-2,2015.1-4
44 Orhan U,Hekim M,Ozer M.Expert Syst Appl,2011,38:13475-13481
45 Zare M,Pourghasemi HR,Vafakhah M,Pradhan B.Arab J Geosci,2013,6:2873-2888
46 Oquab M,Bottou L,Laptev I,et al.Learning and transferring midlevel image representations using convolutional neural networks.In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Columbus,2014.1717-1724
47 Kim Y.Convolutional neural networks for sentence classification.arXiv preprint,1408.5882,2014
48 Vedaldi A,Lenc K.Matconvnet:convolutional neural networks for MATLAB.In:Proceedings of the 23rd ACM International Conference on Multimedia.New York:ACM,2015.689-692
49 Chai T,Draxler RR.Geosci Model Dev Discuss,2014,7:1525-1534
50 Lee SH,Goddard ME,Wray NR,Visscher PM.Genet Epidemiol,2012,36:214-224