文摘
Accurately predicting the binding affinities of large sets of diverse protein-ligand complexes is a key challenge in computational biomolecular science,with applications in drug discovery,chemical biology,and structural biology. Since a scoring function (SF) is used to score,rank,and identify drug leads,the fidelity with which it predicts the affinity of a ligand candidate for a protein's binding site and its computational complexity have a significant bearing on the accuracy and throughput of virtual screening. Despite intense efforts in this area,so far there is no universal SF that consistently outperforms others. Therefore,in this work,we explore a range of novel SFs employing different machine-learning (ML) approaches in conjunction with a diverse feature set characterizing protein-ligand complexes. We assess the scoring and ranking accuracies and computational complexity of these new ML-based SFs as well as those of conventional SFs in the context of the 2007 and 2010 PDBbind benchmark datasets. We also investigate the influence of the size of the training dataset,the number of features,the protein family,and the novelty of the protein target on scoring accuracy. Furthermore,we examine the interpretive power of the best ML-based SFs. We find that the best performing ML-based SF has a Pearson correlation coefficient of 0.797 between predicted and measured binding affinities compared to 0.644 achieved by a state-of-the-art conventional SF. Finally,we show that there is potential for further improvement in our proposed ML-based SFs.