Melting Point Prediction Employing k-Nearest Neighbor Algorithms and Genetic Parameter Optimization
详细信息    查看全文
文摘
We have applied the k-nearest neighbor (kNN) modeling technique to the prediction of melting points. Adata set of 4119 diverse organic molecules (data set 1) and an additional set of 277 drugs (data set 2) wereused to compare performance in different regions of chemical space, and we investigated the influence ofthe number of nearest neighbors using different types of molecular descriptors. To compute the predictionon the basis of the melting temperatures of the nearest neighbors, we used four different methods (arithmeticand geometric average, inverse distance weighting, and exponential weighting), of which the exponentialweighting scheme yielded the best results. We assessed our model via a 25-fold Monte Carlo cross-validation(with approximately 30% of the total data as a test set) and optimized it using a genetic algorithm. Predictionsfor drugs based on drugs (separate training and test sets each taken from data set 2) were found to beconsiderably better [root-mean-squared error (RMSE) = 46.3 C, r2 = 0.30] than those based on nondrugs(prediction of data set 2 based on the training set from data set 1, RMSE = 50.3 C, r2 = 0.20). Theoptimized model yields an average RMSE as low as 46.2 C (r2 = 0.49) for data set 1, and an averageRMSE of 42.2 C (r2 = 0.42) for data set 2. It is shown that the kNN method inherently introduces asystematic error in melting point prediction. Much of the remaining error can be attributed to the lack ofinformation about interactions in the liquid state, which are not well-captured by molecular descriptors.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700