The structural identi
fication o
f unknown biochemical compounds in complex bio
fluids continues to be a major challenge in metabolomics research. Using LC/MS, there are currently two major options
for solving this problem: searching small biochemical databases, which o
ften do not contain the unknown o
f interest or searching large chemical databases which include large numbers o
f nonbiochemical compounds. Searching larger chemical databases (larger chemical space) increases the odds o
f identi
fying an unknown biochemical compound, but only i
f nonbiochemical structures can be eliminated
from consideration. In this paper we present BioSM; a chemin
formatics tool that uses known endogenous mammalian biochemical compounds (as sca
ffolds) and graph matching methods to identi
fy endogenous mammalian biochemical structures in chemical structure space. The results o
f a comprehensive set o
f empirical experiments suggest that BioSM identi
fies endogenous mammalian biochemical structures with high accuracy. In a leave-one-out cross validation experiment, BioSM correctly predicted 95% o
f 1388 Kyoto Encyclopedia o
f Genes and Genomes (KEGG) compounds as endogenous mammalian biochemicals using 1565 sca
ffolds. Analysis o
f two additional biological data sets containing 2330 human metabolites (HMDB) and 2416 plant secondary metabolites (KEGG) resulted in biochemical annotations o
f 89% and 72% o
f the compounds, respectively. When a data set o
f 3895 drugs (DrugBank and USAN) was tested, 48% o
f these structures were predicted to be biochemical. However, when a set o
f synthetic chemical compounds (Chembridge and Chemsynthesis databases) were examined, only 29% o
f the 458鈥?07 structures were predicted to be biochemical. Moreover, BioSM predicted that 34% o
f 883鈥?99 randomly selected compounds
from PubChem were biochemical. We then expanded the sca
ffold list to 3927 biochemical compounds and reevaluated the above data sets to determine whether sca
ffold number in
fluenced model per
formance. Although there were signi
ficant improvements in model sensitivity and speci
ficity using the larger sca
ffold list, the data set comparison results were very similar. These results suggest that additional biochemical sca
ffolds will not
further improve our representation o
f biochemical structure space and that the model is reasonably robust. BioSM provides a qualitative (yes/no) and quantitative (ranking) method
for endogenous mammalian biochemical annotation o
f chemical space and, thus, will be use
ful in the identi
fication o
f unknown biochemical structures in metabolomics. BioSM is
freely available at
f="http://metabolomics.pharm.uconn.edu" class="extLink">http://metabolomics.pharm.uconn.edu.