Exploring Chemical Space with GDB – Jean Louis Raymond (University of Bern)

Three molecules. This image was originally upl...
Image via Wikipedia

(These are live notes from a talk Prof Reymond gave at EBI today)

The GDB Database

GDB = Generated Database (of Molecules)

The Chemical Universe Project – how many small molecules are possible?

GDB was put together by starting from graphs –  in this case the graphs were hydrocarbons and used GENG software to elaborate all possible graphs (after predefining which graphs are chemically reasonable and incorporating bonding informatation etc.) Then place atoms, enumerate, get combinatorial explosion of compounds and apply filters to remove chemical immpossibility: result couple of billion compounds.


Some choices restricting diversity: no allenes, no DB at bridgeheads etc, problematic heteroatom constellations (did not consider peroxides), hydrolytically labile functional groups.

In general – number of possible molecules increases exponentially with increasing number of nodes.

Showing that the molecular diversity increases with linear open carbon skeletons – cyclic graphs have fewer substitution possibilities. Chiral compounds offer more diversity than non-chiral ones.


GDB Website


Now talking about GDB13:

removed fluorine, introduced sulphur, filtered for molecules with “too many” heteroatoms – due to synthetic difficulties and the fact they may be of lesser interest to medchem.

Now showing statistical analysis of molecular types in GDB. 95% of all marketed drugs violate at least two Lipinski Rules. All molecules in the GDB13 are Lipinski conformant.

Use case: take known drug and find isomers. Aspirin has approx 180 compounds similar to Aspirin by Tanimoto score > 0.7 similarity. Points out that any of these molecules may not have been imagined by chemists.


GDB15 is just out – corrected some bugs, eliminated enol ethers (due to quick hydrolysis), optimized CPU usage…approx 26 billion molecules, 1.4 Tb – counting them takes a day)


Applications of the Database – mainly GDB 11

Use case: Glutamatergic Synapse Binding

used Bayesian classifier trained with known actives and then used that to retrieve about 11000 molecules from GDB11. This was followed by high throughput docking – selected 22 compounds for lab testing. Enrichment of glycine-containing compounds. Now showing some activity data for selected compounds.

Use case: Glutamate Transporter: applied certain structural selection criteria to database molecules to obtain a subset of approx 250 k compounds. Again followed by HT docking. Now showing syntheses of some selected candidate structures together with screening data.


“Molecular Quantum Numbers”

Classification system for large compound databases. Draws analogy to periodic table: classification system for elements. We do not have something like this for molecules. Define features for molecules: atom types, bond types, polarity, topology……42 categories in total. Now examines ZINC database against these features: can show that there are common features for molecules occupying similar categories.PCA analysis: first 2 PCs cover 70% of diversity space: first PC includes molecular weight…2D representations considered to be acceptable. PCA also shows nice grouping of molecules by number of cycles

Same analysis for GDB 11: first PCs now mainly account for molecular flexibility, polarity (doesn’t contain many rings due to atom limitation).

Analysis for PubChem – difficult to discover information at the moment.

Was on the cover of ChemMedChem this November.

Shows examples of fishing our structural motive analogies for given molecular motives.

Reblog this post [with Zemanta]

2 Responses to Exploring Chemical Space with GDB – Jean Louis Raymond (University of Bern)

  1. What’s the DB’s license? CC0?

    Any discussion on why such a database is relevant, compared to compute-on-demand approaches, like in Christoph’s CASE approach?

    • na303 says:

      Hi Egon,

      it is not clear what the licence is. Here’s the statement from the GDB’s website:

      GDB-13 may be used free of charge for research by individuals and institutions. Whereas you are free to share the results of a GDB-13 search or a screen of molecules from GDB-13, you may not redistribute major portions of GDB-13 without the express written permission of Jean-Louis Reymond.

      So as far as I can see without doing too much research, the answer is: free to view, download and work with, but not to redistribute. No clear licence though. And as to the discission of necessity: no such discussion took place. I think the major argument was that combinatorial enumeration delivers structures that chemists may not necessarily “think of”. though there is no reason as to why that could not be done on demand also. If the database WERE under CC0 or PDDL, one could make the argument for avoidance of redundancy, as all the computed data could be distributed. But, no, no real formal discussion.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: