文摘
The objective of many cheminformatics applications is to effectively search large databases of molecules. For example, in ligand-based screening, the searches can identify lead molecules that are most likely to bind to a drug target, typically a protein receptor or enzyme, at the earliest stage of drug discovery. Identification of lead molecules in those early stages may decrease the cost of drug discovery and speed up the overall development time. The ongoing expansion of databases of molecules, as new compounds are discovered or synthesized, and the potential exploration of the large chemical space of virtual compounds, creates a need for improving search quality and efficiency. The work presented here addresses the need in three ways: (1) computing the statistical significance of similarity scores efficiently to give them biological relevance; (2) developing algorithms and data structures for speeding up the searches of very large databases by many folds; and (3) formulating and validating new similarity metrics to find biologically relevant molecules given a set of multiple query molecules.