Exploring Chemical Space with GDB – Jean Louis Raymond (University of Bern)

Three molecules. This image was originally upl...
Image via Wikipedia

(These are live notes from a talk Prof Reymond gave at EBI today)

The GDB Database

GDB = Generated Database (of Molecules)

The Chemical Universe Project – how many small molecules are possible?

GDB was put together by starting from graphs –  in this case the graphs were hydrocarbons and used GENG software to elaborate all possible graphs (after predefining which graphs are chemically reasonable and incorporating bonding informatation etc.) Then place atoms, enumerate, get combinatorial explosion of compounds and apply filters to remove chemical immpossibility: result couple of billion compounds.


Some choices restricting diversity: no allenes, no DB at bridgeheads etc, problematic heteroatom constellations (did not consider peroxides), hydrolytically labile functional groups.

In general – number of possible molecules increases exponentially with increasing number of nodes.

Showing that the molecular diversity increases with linear open carbon skeletons – cyclic graphs have fewer substitution possibilities. Chiral compounds offer more diversity than non-chiral ones.


GDB Website


Now talking about GDB13:

removed fluorine, introduced sulphur, filtered for molecules with “too many” heteroatoms – due to synthetic difficulties and the fact they may be of lesser interest to medchem.

Now showing statistical analysis of molecular types in GDB. 95% of all marketed drugs violate at least two Lipinski Rules. All molecules in the GDB13 are Lipinski conformant.

Use case: take known drug and find isomers. Aspirin has approx 180 compounds similar to Aspirin by Tanimoto score > 0.7 similarity. Points out that any of these molecules may not have been imagined by chemists.


GDB15 is just out – corrected some bugs, eliminated enol ethers (due to quick hydrolysis), optimized CPU usage…approx 26 billion molecules, 1.4 Tb – counting them takes a day)


Applications of the Database – mainly GDB 11

Use case: Glutamatergic Synapse Binding

used Bayesian classifier trained with known actives and then used that to retrieve about 11000 molecules from GDB11. This was followed by high throughput docking – selected 22 compounds for lab testing. Enrichment of glycine-containing compounds. Now showing some activity data for selected compounds.

Use case: Glutamate Transporter: applied certain structural selection criteria to database molecules to obtain a subset of approx 250 k compounds. Again followed by HT docking. Now showing syntheses of some selected candidate structures together with screening data.


“Molecular Quantum Numbers”

Classification system for large compound databases. Draws analogy to periodic table: classification system for elements. We do not have something like this for molecules. Define features for molecules: atom types, bond types, polarity, topology……42 categories in total. Now examines ZINC database against these features: can show that there are common features for molecules occupying similar categories.PCA analysis: first 2 PCs cover 70% of diversity space: first PC includes molecular weight…2D representations considered to be acceptable. PCA also shows nice grouping of molecules by number of cycles

Same analysis for GDB 11: first PCs now mainly account for molecular flexibility, polarity (doesn’t contain many rings due to atom limitation).

Analysis for PubChem – difficult to discover information at the moment.

Was on the cover of ChemMedChem this November.

Shows examples of fishing our structural motive analogies for given molecular motives.

Reblog this post [with Zemanta]

ChemAxiom: An Ontology for Chemistry 4. ChemAxiomChemDomain

Obligations to our funders and some publishers have delayed me in continuing this series of blog post and participation in the discussion on the Google Group for a few days, but I hope I can catch up on either now. In my previous blogpost, I have summarised all of the ChemAxiom modules briefly: now is the time to delve into some more detail. First up then: ChemAxiomChemDomain.

ChemAxiomChemDomain is, at the moment, a rather small, but nevertheless important ontology, which clarifies some fundamental domain concepts in chemistry, namely the relationship between platonic molecules, platonic bulk substances, instances of either and roles.  

First oof all, let’s turn to some fundamental concepts. The classes “ChemicalElement”, “MolecularEntity”, and “ChemicalSpecies”are all subclasses of “snap:Object”. The class “Object” in the BFO is defined as a “material entity [snap:MaterialEntity] that is spatially extended, maximally self-connected and self-contained (the parts of a substance are not separated from each other by spatial gaps) and possesses an internal unity. The identity of substantial object [snap:Object] entities is independent of that of other entities and can be maintained through time.” Various disjoint axioms specify the fact that “MolecularEntities” are not the same as “ChemicalSpecies”, thus addessing some of fundamental issues about the relationship between molecules and substances etc.

Further axioms on these classes specify other necessary parthood relationships: “ChemicalSpecies” are composed of molecules or other ChemicalSpecies (thus giving recursion and allowing the modeling of formulations) or BulkChemicalElements.:

      a       owl:Class ;
      rdfs:comment “An ensemble of chemically identical molecular entities that can explore the same set of molecular energy levels on the time scale of the experiment.”@en ;
      rdfs:subClassOf snap:Object ;
              [ a       owl:Class ;
                owl:unionOf ([ a       owl:Restriction ;
                            owl:onProperty ChemistryOntology:hasPart ;
                            owl:someValuesFrom ChemistryOntology:MolecularEntity
                          ] [ a       owl:Restriction ;
                            owl:onProperty ChemistryOntology:hasPart ;
                            owl:someValuesFrom ChemistryOntology:ChemicalSpecies
                          ] [ a       owl:Restriction ;
                            owl:hasValue ChemistryOntology:BulkChemicalElement ;
                            owl:onProperty ChemistryOntology:hasPart
              ] ;
              [ a       owl:Restriction ;
                owl:onProperty ChemistryOntology:preseentInAmount ;
                owl:someValuesFrom xsd:string
              ] ;
              [ a       owl:Restriction ;
                owl:onProperty ChemAxiomProp:hasProperty ;
                owl:someValuesFrom ChemAxiomProp:Property
              ] ;
      owl:disjointWith ChemistryOntology:ChemicalElement , ChemistryOntology:MolecularEntity

When intengrated with ChemAxiomProp (as has been done in ChemAxiomComtinuants), ChemicalSpecies can be connected up to their properties and other statements which one might wish to make about chemical species.

Another part of ChemAxiomChemDomain is the definition of roles: generic types of ChemicalSpecies, such as solvents, acids, catalysts, can be defined in terms of roles: no molecule is ever only just a solvent or an acid or a catalyst. Rather, these categories are realisable entities; a molecular species or a chemical entity behaves as a catalyst, nucleophile or a solvent under certain circumstances

      a       owl:Class ;
      rdfs:subClassOf ChemistryOntology:MolecularEntity ;
      owl:disjointWith ChemistryOntology:ElectrophileMolecule ;
              [ a       owl:Class ;
                owl:intersectionOf (ChemistryOntology:MolecularEntity [ a       owl:Restriction ;
                            owl:onProperty ChemistryOntology:hasRole ;
                            owl:someValuesFrom ChemistryOntology:NucleophileRole
              ] .

Furthemore, roles in combination with MolecularEntity or ChemicalSpecies allow the definition of generic molecules or substances, such as acids (hydrochloric acid) and acids (proton donor), catalysts, solvents etc. At the moment, the number of axio
ms is small, however, as the body of axioms grows in the future, it can be expected, that  ChemAxiom will become more and more useful for the disambiguation of concepts: while it would make sense for a chemical species, which is an acid, to talk about a pH-Value, it would not make sense to speak of “molecular acids” in the same terms.

Finally, OWL’s model of classes as collections of instances models the things we need to model really well: the class “ChemicalSpecies” and “MolecularEntitiy” and thweir respective subclasses can be thought of as rpreesentinmg the platonic ideals of molecules or substances, whereas instances of these classes can be thought of as representing “real” samples of both molecules (e.g. a single molecule, in for example, matrix isolation) and substances (100 ml of HCl in a flask).

So much for ChamAxiomChemDomain fo rnow. It is the beginning of a domain model and very much driven by the use-case I ourtlined in a prewvious blog post. Obviously, we would like to expand the scope of this particular ontology to be morwe universally useful in the future., However, I believe that rather to do this via random ontological engineering, this should be driven by use-cases. So therefore, if you have use-cases in mind, please be in touch and let’s discuss how we can collaborate.

Tags and automatic links, as always, by Zemanta.

Reblog this post [with Zemanta]

ChemAxiom: An Ontology for Chemistry 2. The Set-Up

Now that I have introduced at least some of the motivation behind ChemAxiom, let me outline some of the mechanics.

ChemAxiom is a collective term for a set of ontologies, all of which make a start at describing subdomains within chemistry. The ontology modules are independent and self-contained and can (largely) be developed seperately and concurrently. Although they are independent, they are interoperable and integrated via a common upper ontology – in the case of ChemAxiom, we have chosen the Basic Formal Ontology (BFO). I will blog the reasons for this choice in the next post.


The ontologies are currently in various stages of axiomatisation depending on how long we have been working on them and how much we have had a chance to play – so therefore, if there are axioms there that are not and you think there should be, or if you agree/disagree with some of our design decisions, please let us know. In any case, the discussion has already started with some helpful comments over on the Google Group. Let me describe the various modules in greater detail:

The Reasons for Modularity: When developing ontologies, it is always tempting to develop the ueber-McDaddy-ontology-of-everything, because, of course, ontology development is, by definition, never done: we alsways need more than we have  – more terms, more axioms etc.. Very quickly, this can result in monstrously large and virtually unmaintainable constructs. Modularisation has, from out perspective, the advantage of (a) smaller and more handlable ontologies, (b) ontologies which are easier to maintain, (c) ontologies which can be developed in parallel or orthogonally and subsequently integrated using either a common upper ontology or mapping/rules etc…..Furthermore, if refactoring of ontologies is necessary during the development process, this is also facilitated by modularity: changes in one module have less chance of affecting changes in another module.

The General Use Case: One of the things we are particularly interested in here in Cambridge, is the extraction of chemical entities and data from text and Peter Corbett’s OSCAR is now fairly well established within the chemical informatics community. Our text sources vary widely, and can range from standard chemical papers to theses, blogs and Wikipedia pages. To give you an impression of the types of data we are talking about, there’s an example Wikipedia’s infobox for benzene (somewhat truncated):


benzene infobox for blog 

So we have to deal with names, identifiers of various type, physico-chemical property data as well as the corresponding metadata (e.g. measurement pressures, measurement temperatures etc.), and chemical structure (InChI, SMILES). Our ontologies should enable us the generate RDF that allow us to hold this data – the ontology here serves as a schema. While we are interested in reasoning/using reasoners for the purposes of (retrospective) typing (again, I will explain what I mean by that in subsequent blog posts) applying ontologies to the description of chemical data is our first use-case.

With all of that said, let me provide a quick summary of the modules:

Chemistry Domain Ontology – ChemAxiomDomain ChemAxiomDomain is the first module in the set. It is currently a small ontology, which clarifies some fundamental relationships in the chemistry domain. Key concepts in this ontology are “ChemicalElement”, “ChemicalSpecies” and “MolecularEntity” as well as “Role”. ChemAxiomDomain clarifies the relationships between these terms (see my previous blog post) and also deals with identifiers etc. Chemical roles too are important: while chemical entities, may be or act as nucleophiles, acids, solvents etc.. some of the time, they do not have these roles all of the time – roles are realisable entities and and ChemAxiomDomain provides a mechanism for dealing with that. There are few other high-level domain concepts in there at the moment, though obviously we are looking to expand as and when the need arises and use-cases are provided.I will blog some details in a subsequent blog post.

Properties Ontology – ChemAxiomProp. ChemAxiomProp is an ontology of over 150 chemical and materials properties, together with a first set of definitions and symbols (where available and appropriate) and some axioms for typing of properties. Again, details will follow in a subsequent blog post.

Measurement Techniques – ChemAxiomMetrology. This is an ontology of over 200 measurement techniques and also contains a list of instrument parts and axioms for typing of measurement techniques. It does not currently include information about minimum information requirements for measurement techniques (e.g. the measurement of a boiling point also requires a measurement of pressure) and other metadata, but this will be added at a later stage. Again, a detailed blog-post will follow.

ChemAxiomPoly and ChemAxiomPolyClass – These two ontologies contain terms which are in common use across polymer science as well as a taxonomy of polymers based on the composition of their backbone (though the latter is not axiomatised yet). Details will follow in a further blog post.

ChemAxiomMeta – ChemAxiomMeta is a developing ontology, that will allow the specification of provenance of data (e.g. data derived from wiki pages etc.) and will also define what a journal, journal article, thesis, thesis chapter etc is and what the relationships between these entities are. We have not currently released this yet. Details will follow in a further blog post.

ChemAxiomComtinuants – ChemAxionContinuants represents an integration of all the above sub-ontologies into an ontological framework for chemical continuants (with some occurrents mixed in when we need to talk about measurement techniques). Details will follow in a further blog post.

We have also started to work on ontologies of chemical reactions, actions and, as mentioned above, minimum information requirements – however, these are at a relatively early stage of development and hence not released yet.

So much for a short overview over the mechanics of the ontologies. I am sure there are a thousand other things I should have said, but that will have to
do for now. Comments and suggestions via the usual channels. Automatic links and tags, as always, by Zemanta.

Reblog this post [with Zemanta]

The Unilever Centre @ Semantic Technology 2009

In a previous blogpost, I had already announced, that both Jim and I had been accepted to speak at Semantic Technology 2009 in San Jose.

Well, the programme for the conference is out now and looks even more mind-blowing (in a very good way) than last year. Jim and I will be speaking on Tuesday, 16th June at 14:00. Here’s our talk abstracts:

PART I | Lensfield – The Working Scientist’s Linked Data Space Elevator (Jim Downing)

The vision of Open Linked Data in long-tail science (as opposed to Big Science, high energy physics, genomics etc) is an attractive one, with the possibility of delivering abundant data without the need for massive centralization. In achieving that vision we face a number of practical challenges. The principal challenge is the steep learning curve that scientists face in dealing with URIs, web deployment, RDF, SPARQL etc. Additionally most software that could generated Linked Data runs off-web, on workstations and internal systems. The result of this is that the desktop filesystem is likely remain the arena for the production of data in the near to medium term. Lensfield is a data repository system that works with the filesystem model and abstracts semantic web complexities away from scientists who are unable to deal with them. Lensfield makes it easy for researchers to publish linked data without leaving their familiar working environment. The presentation of this system will include a demonstration of how we have extended Lensfield to produce a Linked Data publication system for small molecule data.

PART II | The Semantic Chemical World Wide Web (Nico Adams)

The development of modern new drugs, new materials and new personal care products requires the confluence of data and ideas from many different scientific disciplines and enabling scientists to ask questions of heterogeneous data sources is crucial for future innovation and progress. The central science in much of this is chemistry and therefore the development of a “semantic infrastructure” for this very important vertical is essential and of direct relevance to large industries such as the pharmaceuticals and life sciences, home and personal care and, of course, the classical chemical industry. Such an infrastructure shouls include a range of technological capabilities, from the representation of molecules and data in semantically rich form to the availability of chemistry domain ontologies and the ability to extract data from unstructured sources.

The talk will discuss the development of markup languages and ontologies for chemicals and materials (data). It will illustrate how ontologies can be used for indexing, faceted search and retrieval of chemical information and for the “axiomatisation” of chemical entities and materials beyond simple notions of chemical structure. The talk will discuss the use of linked data to generate new chemical insight and will provide a brief discussion of the use of entity extraction and natural language processing for the “semantification” of chemical information.

But that’s not all. Lezan has been accepted to present a poster and so she will be there too,, showing off her great work on the extraction and semantification of chemical reaction data from the literature. Here is her abstract:

The domain of chemistry is central to a large number of significant industries such as the pharmaceuticals and life sciences industry, the home and personal care industry as well as the “classical” chemical industry. All of these are research-intensive and any innovation is crucially dependent on the ability to connect data from heterogeneous sources: in the pharmaceutical industry, for example, the ability to link data about chemical compounds, with toxicology data, genomic and proteomic data, pathway data etc. is crucial. The availability of a semantic infrastructure for chemistry will be a significant factor for the future success of this industry. Unfortunately, virtually all current chemical knowledge and data is generated in non-semantic form and in many silos, which makes such data integration immensely difficult.

In order to address these issues, the talk will discuss several distinct, but related areas, namely chemical information extraction, information/data integration, ontology-aided information retrieval and information visualization. In particular, we demonstrate how chemical data can be retrieved from a range of unstructured sources such as reports, scientific theses and papers or patents. We will discuss how these sources can be processed using ontologies, natural language processing techniques and named-entity recognisers to produce chemical data and knowledge expressed in RDF. We will furthermore show, how this information can be searched and indexed. Particular attention will also be paid to data representation and visualisation using topic/topology maps and information lenses. At the end of the talk, attendees should have a detailed awareness of how chemical entities and data can be extracted from unstructured sources and visualised for rapid information discovery and knowledge generation.

It promises to be a great conference and I am sure our minds will go into overdrive when there….can’t wait to go! See you there!?

Reblog this post [with Zemanta]

ChemAxiom: An Ontology for Chemistry – 1. The Motivation

I have already announced the fact that we are working on ontologies in the polymer domain some time ago, though I realise that so far, I have yet to produce the proof of that: the actual ontology/ontologies.

So today I am happy to announce that the time of vapourware is over and that we have released ChemAxiom – a modular set of ontologies, which form the first ontological framework for chemistry (or at least so we believe). The development of these ontologies has taken us a while: I started this on a hunch and as a nice intellectual exercise, not entirely sure where to go with them and what to use them and therefore not working on them full time. As the work progressed, however, we understood just how inordinately useful they would be for doing what we are trying to accomplish in both polymer informatics and chemical informatics at large. I will introduce and discuss the ontologies in a succession of blogposts, of which this is the first one

So what, though maybe somwhat retrospectively, was the motivation for the preparation of the ontologies? In short – the breakdown of many common chemistry information systems when confronted with real chemical phenomena rather than small subsections of idealised abstractions. Let me explain.

Chemistry and chemical information systems positively thrive on the use of a connection table as a chemical identifier and determinant of uniqueness. The reasons for this are fairly clear: chemistry, for the past 100 years or so, has elevated the (potential) correlation between the chemical structure of a molecule and its physicochemical and biological properties to be its “central dogma.” The application of this dogma has served subsections of the community – notably organic/medicinal/biological chemists incredibly well, while causing major headaches for other parts of the chemistry community and given an outright migraine to information scientists and researchers. There are several reasons for the pain:

The use of a connection table as an identifier for chemical objects leads to significant ontological confusion. Often, chemists and their information systems do not realise that there is a fundamental distinction between (a) the platonic idea of a molecule, (b) the idea of a bulk substance and (c) an instance of (“the real bulk substance”) in a flask or bottle on the researcher’s lab bench. An example of this is the association of a physicochemical property of a chemical entity with a structure representation of a molecule: while it would, for example, make sense to do this for a HOMO energy, it does NOT make sense to speak of a melting point or a boiling point in terms of a a molecule. The point here simply is that many physicochemical properties are the mereological sums of the properties of many molecules in an ensemble. If this is true for simple properties of pure small molecules, it is even more true for properties of complex systems such as polymers, which are ensembles of many different molecules of many different architectures. A similar argument can also be made for identifiers: in most chemical information systems, it is often not clear whether the identifier (such as a CAS number etc.) refers to a molecule or a substance composed of these molecules.

Many chemical objects have temporal characteristics. Often, chemical objects have temporal characteristics, which influence and determine their connection table. A typical example for this are rapidly interconverting isomers: glucose, when dissolved in water, for example, can be described by several rapidly interconverting structures – a single connection table is not enough to describe the concept “glucose in water” and there exists a parthood relationship between the concept and several possible connection tables. Ontologies can help with specifying and defining these parthood relationships.

There is another aspect to time dependence we also need to consider. For many materials, their existence in time, or, put in another way, their history, often holds more meaningful information about an observed physical property of that substance than the chemical structure of one of the components of the mixture. For an observable property of a polymer, such as the glass transition temperature, for example, it matters a great deal whether the polymer was synthesized in on the solid phase in a pressure autoclave or in solution at ambient pressure. Furthermore, it matters, whether and how a polymer was processed – how was it extruded, grafted etc. All of these processes have a significant amount of influence on the observable physical properties of a bulk sample of this polymer, while leaving the chemical decription of the material, essentially unchanged (in current practice, polyethylene is often represented either by using the structure of the corresponding repeat unit (ethene, for example) or the structure of a repeat unit fragment (-CH2-CH2-). Ontologies will help us to describe and define these histories. Ultimately, we envisage that this will result in a “semantic fingerprint” of a material, which – one might speculate – will be much more appropriate for the development of design rules for materials than the dumb structure representations in use today.

Many chemical objects are mixtures….and mixtures simply do not lend themselves to being described using the connection table of a single constituent entity of that mixture. If this is true for glucose in water, it is even truer for things such as polymers: polymers are mixtures of many different macromolecules, all of which have slightly different architectures etc. An observed physical property, and therefore a data object, is the mereological sum of the contributions made by all the constituent macromolecules and therefore, such a data object cannot simply be associated with a single connection table.

This, in my view, is a short summary of the case for ontology in chemistry. Please feel free to violently (dis-)agree and if you want to do so, I am looking forward to a discussion in the comments section.

There’s one more thing:


The ChemAxiom ontologies are far from perfect and far from finished. We hope, that they show the way how an ontological framework for chemistry could look like. In developing these ontologies, we can contribute our particular point of view, but we would like to hear yours. Even more, we would like to invite the community to get involved in the development of these ontologies in order to make them a general and valuable resource. If you would like to  become involved, then please send an email to chemaxiom at googlemail dot com or leave comments/questions etc, in the ChemAxiom Google Group.

In the next several blog posts, I will dive into some of the technical details of the ontologies.

(Automatic Links etc., as always, by Zemanta)

Reblog this post [with Zemanta]