December 4, 2009 2 Comments
(These are live notes from a talk Prof Reymond gave at EBI today)
The GDB Database
GDB = Generated Database (of Molecules)
The Chemical Universe Project – how many small molecules are possible?
GDB was put together by starting from graphs - in this case the graphs were hydrocarbons and used GENG software to elaborate all possible graphs (after predefining which graphs are chemically reasonable and incorporating bonding informatation etc.) Then place atoms, enumerate, get combinatorial explosion of compounds and apply filters to remove chemical immpossibility: result couple of billion compounds.
Some choices restricting diversity: no allenes, no DB at bridgeheads etc, problematic heteroatom constellations (did not consider peroxides), hydrolytically labile functional groups.
In general – number of possible molecules increases exponentially with increasing number of nodes.
Showing that the molecular diversity increases with linear open carbon skeletons – cyclic graphs have fewer substitution possibilities. Chiral compounds offer more diversity than non-chiral ones.
Now talking about GDB13:
removed fluorine, introduced sulphur, filtered for molecules with “too many” heteroatoms – due to synthetic difficulties and the fact they may be of lesser interest to medchem.
Now showing statistical analysis of molecular types in GDB. 95% of all marketed drugs violate at least two Lipinski Rules. All molecules in the GDB13 are Lipinski conformant.
Use case: take known drug and find isomers. Aspirin has approx 180 compounds similar to Aspirin by Tanimoto score > 0.7 similarity. Points out that any of these molecules may not have been imagined by chemists.
GDB15 is just out – corrected some bugs, eliminated enol ethers (due to quick hydrolysis), optimized CPU usage…approx 26 billion molecules, 1.4 Tb – counting them takes a day)
Applications of the Database – mainly GDB 11
Use case: Glutamatergic Synapse Binding
used Bayesian classifier trained with known actives and then used that to retrieve about 11000 molecules from GDB11. This was followed by high throughput docking – selected 22 compounds for lab testing. Enrichment of glycine-containing compounds. Now showing some activity data for selected compounds.
Use case: Glutamate Transporter: applied certain structural selection criteria to database molecules to obtain a subset of approx 250 k compounds. Again followed by HT docking. Now showing syntheses of some selected candidate structures together with screening data.
“Molecular Quantum Numbers”
Classification system for large compound databases. Draws analogy to periodic table: classification system for elements. We do not have something like this for molecules. Define features for molecules: atom types, bond types, polarity, topology……42 categories in total. Now examines ZINC database against these features: can show that there are common features for molecules occupying similar categories.PCA analysis: first 2 PCs cover 70% of diversity space: first PC includes molecular weight…2D representations considered to be acceptable. PCA also shows nice grouping of molecules by number of cycles
Same analysis for GDB 11: first PCs now mainly account for molecular flexibility, polarity (doesn’t contain many rings due to atom limitation).
Analysis for PubChem – difficult to discover information at the moment.
Was on the cover of ChemMedChem this November.
Shows examples of fishing our structural motive analogies for given molecular motives.