Polymer Informatics and The Semantic Web – The Solution, Part 1: Adding Structure

In one of my last posts I mentioned that one of the problems we encounter in current knowledge bases is the fact that polymer information is quite often present in free text. It is therefore very hard to extract information from these sources (although it can be done, see Peter Corbett’s OSCAR system) and even when it is accomplished, one is quite often faced with the problem of what the extracted information means. Take your favourite search engine and look for the term “cook” for example. The search engine will most likely retrieve information about people called “Cook”, about “cook” the profession, the Cook Islands or Cook County, Illinois.

One way around this, is too add more descriptive data to data contained in web pages and other documents, or, in other words, data about data. If we could mark up the term “cook” as a person, or a profession of a place name according to the context in which we use it, a machine would have a much better time of finding the bits of information we were really interested in. Now, data about data is also called “metadata” and one way of adding metadata to documents is through the use of markup languages and, in our case, through the use of Extensible Markup Language (XML) and its dialects.

Now the concept of a markup language should not be unfamiliar. Every internet user should has heard of HTML, Hypertext Markup Language, which can be used to structure text into headings, tables, paragraphs etc. XML, just like HTML belongs to the class of descriptive markup languages.

markup-languages.gif

If you use Wikis at all, then you will have come across and used another type of markup, which is used for purely presentational purposes. And maybe you write your papers, in LaTeX and deal with postscript files a lot, in which case you will have had exposure to procedural markup languages too.

Now according to the Wikipedia entry on XML, the latter “provides a text-based means to describe and apply a tree-based structure to information. At its base level, all information manifests as text, interspersed with markup that indicates the information’s separation into a hierarchy of character data, container-like elements, and attributes of those elements.” In an XML document, metadata is enclosed in angle brackets (“”), which, in turn enclose the data to be described. This is what is meant by a container. Let’s look at a simple XML document, it’s a receipe for baking bread (also taken from the Wikipedia article):

bread.gif

We see that there are a number of containers with labels (known as “elements” such as “recipe”, “title”, “ingredient”, “instructions” and “step”. Some of these carry a number of attributes, such as “name”, “prep_time”, “unit” and “state”, which specify further information concerning that element.

When looking at this example , you will have hopefully realized, that XML is eminantly human readable and that you don’t have to be computer genius to figure out what is going on in the document. And you will hopefully also realize, that this markup should now make it easy for a computer to, for example, extract all the ingredients from the text, as they are now explicitly labelled as such.

In my next post, I’ll discuss how to mark up chemistry and molecules….but maybe you can beginn to see now, how this structuring of information could be useful for polymers already.

Web 2.0 for Scientists

I have used the Word web 2.0 on a number of occasions. Now most scientists tend to be relatively internet-savy. Talking to people, I find, however, that most have not heard of “web 2.0” and certainly have no notion what it entails. My colleague, Dr. Andrew Walkingshaw recently gave an excellent talk to the Department of Earth Sciences here in Cambridge. We have video’ed his talk and it is up on YouTube. Alternatively watch it straight on here!

Polymer Informatics and the Semantic Web – The Problem, Part II – A Common Understanding of Polymers.

In my previous post concerning the challenges facing the polymer information scientist, I talked about a general lack of freely available polymer data as well as insufficient curation of that data.
Another reason that polymer informatics is in its infancy, is simply the fact that we do not, at the moment have a shared understanding of what a polymer is and how to represent it. Let me discuss this further.

As I have already discussed in my previous blog post, Take a simple polymer, which has the following repeat unit:

repeat-unit.gif

Now if you are the Chemical Abstracts Service, you would register this polymer as “1,3-butadiene, homopolymer”. If you were IUPAC, you would allow any of these four names: “polybutadiene”, “poly(but-1-ene-1,4-diyl)”, “1,4-polybutadiene” or  poly(buta-1,3-diene).

Historical continuity of the indexing system is also an issue. The following monomer:

monomer.gif

would be registered as “methacrylic acid, methyl ester” in the 8th Collective Index of the Chemical Abstracts Service, but as “2-propenoic acid, 2-methyl-, methyl ester” in the 9th Collective Index.

I can already hear you saying, well so how about a chemistry-based representation? Well ok.  Now to do that, polymers would traditionally be indexed using their repeat unit structure. So how about a polymer like this:

polymer.gif

For a polymer like this, two perfectly good repeat units could be written, namely -O-CHF-CH2– or -O-CH2-CHF- Now, these two are identical for a chemist, but completely different things for a machine. So with multiple possible repeat units, you then have to get into the business of rules again and start to fiddle around with alphabetical precedence of atoms, locants etc. And if you ever decide to change these rules, you have issues with historical continuity, which makes information searching and retrieval harder.

So in summary, we have not satisfactorily solved representation issues and frequently encounter issues of historical continuity. Any modern polymer informatics system should aim to overcome those challenges.

I will discuss just how that could be accomplished in my next posts.

Blogging the ACS.

As I keep saying, the chemical bloggosphere is exploding at the moment. And while not at the ACS in Chicago, I can follow a lot of it via the blogs:

Egon Willighagen on Chem-bla-ics
The People at Nature on The Sceptical Chymyst
Kyle Finchsigmate on The Chem Blog

These are the peope in my RSS reader. If you know of any more, please let me know!

Self-immolative Dendrimers.

Sometimes one comes across a piece of chemistry that makes one proud to be a chemist. I came across one such example recently at a conference on polymer therapeutics, where Doron Shabat of Tel Aviv University gave a presentation on self-immolative dendrimers1,2.
The idea behind those molecules is that the dendrons can be loaded up with drug molecules, which they deliver to a target site. At the site of action, a passivator is cleaved, which leads to the self-unravelling of the dendrimer and the release of the drug. Is is almost like a knitted jumper. Pull on a bit of loose thread and the whole thing unravels. Here is how the chemistry works:

self-immolative-dendrimers.gif

The molecule essentially consists of three elements: a target moiety, an adaptor unit and a reporter, which is usually the drug. The adaptor unit is based on 2,6-bis(hydroxymethyl)-p-cresol and therefore trifunctional. The phenolic group links to a short N.N’dimethylethylene diamine spacer, which, in turn connects to the trigger molecule. The hydroxybenzyl groups connect to reporter molecules via a carbamate linker.

Cleavage of the trigger moiety leads to an amine which cyclizes to form a cyclic N,N-dimethy urea and a phenol. The latter undregoes a 1,4-quinone methide rearrangement, immediately followed by a decarboxylation, which liberates one of the reporter molecules. Attack of a molecule of water on the methide reforms the phenol, which subsequently undergoes a second rearrangement and looses the second arm. Attack by another water molecule reforms the phenol.

Now there are some problems with this: does one really want to have (relatively toxic) monomers from a carrier scaffold floating about in cells etc, but irrespective of this, it is just a beautiful, aesthetic and elegant piece of chemistry. I have only shown the first generation dendrons here, but the synthesis has been reported up the third generation. It’ll be interesting to see whether a linear version of this will come out too….doing this with a linear polymer would of course mean, that much higher reporter/drug loadings could be achieved.

These are the key references:

[1] Amir, R. J., Pessah, N., Shamis, M., Shabat, D. Angew. Chem. Int. Ed., 42, 4494 (2003)
[2] Shamis, M. Lode, H. N., Shabat, D., J. Am. Chem. Soc., 126, 1726 (2004)

Wikis have hit IUPAC!

Just saw an announcement on the frontpage of the IUPAC website, that they have finally discovered wikis. Kermit Murray and his colleague have used them as a collaborative tool to develop standard definitions of terms relating to mass spectroscopy. You can look at the project wiki here and the link to the paper is here.

For those of you who don’t know what wikis are: the name comes from the Hawaiian word ‘wiki-wiki’, which apparently means “quick.” Wikis combine the processes of editing and viewing a website. Website content is stored in a database on a webserver and the actual webpage is generated (using PHP) on the fly as the page is requested, using content in the database. Changes are also stored in the database, which means they can be tracked and undone, if necessary.

Because of their eas of use, wikis are an interesting an efficient collaborative tools, and can be used for communal paper writing to lab-journaling and, of course, knowledge sharing (see Wikipedia itself). I know of at least one VERY VERY (hint) large chemical company that is using wikis for this purpose in house.

Polymer Informatics and the Semantic Web – The Problem, Part I: Availability and Curation of Data

In one of my last posts, I have outlined the vision that we have for polymer informatics. Now let me outline some of the challenges that are in our way. In my little scenario, I talked about a semantic web agent going off and gathering data. Well here is where the difficulty starts for a machine. There are several problems:

1. Data Availability

One of the best loved and most commonly used sources of polymer information is the Polymer Handbook by Brandrup and Immergut. It contains information about approximately 2500 different polymers, scattered over multiple chapters. As it is paper-based, it is not accessible to machines and information has to be extracted and collated by hand. Wiley has taken the contents and turned them into a collection of HTML documents which are connected via hyperlinks. Though available in an electronic form, it is still very difficult to extract anything from this for a machine, as all the information is present in unstructured free text. It is not impossible, mind you, systems such as OSCAR, which we are currently developing in-house make that sort of thing possible, but it is still far from trivial and requires much hard work. “Polymers – A property database” , published by CRC is set up in much the same way and therefore subject to the samme limitations. Furthermore, it is worth pointing out that all of these sources of data are commercial and if one’s host institution/organization does not subscribe to the relevant data source, one is….well….hosed anyway.

Things look up a bit with the PoLyInfo Database, maintained by the National Institute for Materials Science of Japan. Here we find, amongst other valuable information, the ability to search for sub(structure) and a string which defines the repeat unit structure of the polymer and which, in principle at least, is parseable. And all this goodness for approximately 13000 polymers, a large variety of physicochemical properties and, best of all, for free.

2. Data Curation

However, there is a catch. When looking, for example, at the glass transition temperature (Tg) entry for polydimethylsiloxane, we find an incredibly wide temperature range….-163 deg. C to +42 deg C. How come there is such a wide range. Well, first of all, and this is the problem with a lot of polymer properties, the glass transition temperature is dependent on the molecular weight in the low molecular weight regime. As MWs increase, Tg eventually becomes invariant w.r.t. the molecular weight. Now when it comes to registering polymer property values, the polymer science communit has gotten into the habit of reporting them WITHOUT the corresponding dependent variables, such as MW in case of the glass transition temperature. Clearly, this makes it very hard to build good and accurate predictive models for such properties.

Sticking with the glass transition temperature for a moment, here’s another one. Tg is mainly determined using two different methods, namely Differential Scanning Calorimentry (DSC) or Thermomechanical Analysis (TMA). While both methods try and determine a glass transition temperature, they measure fundmentally different things. DSC essentially determines a change in heat capacity of a polymer, whereas TMA measures a dimensional change in the sample. And yes, when both methods are used on the same sample, the results usually differ by between 6- 10 K. So it is crucial to report the measurement method, the experimental conditions etc. Furthermore, when data is abstracted to and accumulated in a knowledge system, it has to be curated to ensure that all relevant and necessary bits of metadata are available.

On occassions, the PolyInfo will also register data for composites under the pure polymer…..which of course can shift properties tremendously.

In summary then, the first set of challenges we encounter are data availability, data curation and metadata. Unfortunatelt there is more, which I will discuss in one of my next posts.

When Ontologies go wrong.

Now as part of the polymer informatics project I am currently wworking on an ontology for polymer concepts. Now for those of you who don’t know, an ontology in computer science is a data model that represents a set of concepts within a domain and the relationships between those concepts. It is used to reason about the objects within that domain. This is the Wikipedia definition.

As part of this I am also currently trying to map out which ontologies on the wild wild web contain chemistry terms and to look at those slightly more closely. Now today I had a look at CyC, which claims to be an upper ontology of everything. When looking up the definition of the term “polymer” I found the following:

cyc-polymer-molecule-definition.gif

Now what is wrong with that definition? A whole lot. First of all, the term “PolymerMolecule” is a contradiction in terms. A polymer is a SUBSTANCE, which is composed of different individual MACROMOLECULES, as the definition above quite rightly states. It cannot therefore be a molecule in its own right and hence a concept like “PolymerMolecule” does not make sense. Furthermore, a macromolecule, according to IUPAC is “a molecule of high relative molecular mass, the structure of which essentially comprises the multiple repetition of units derived, actually or conceptually, from molecules of low relative molecular mass.” So even if the “chemical union of five or more identical combining units” were to refer to a concept
“macromolecule” this particular definition would generate an oligomer at best.

Upon seeing this, my colleague Dr Peter Corbett remarked, that upper ontologies are fundamentally broken. I tend to agree with him.

Polymer Informatics and the Semantic Web – The Vision

What is the thing that you would love to be able to do as a scientist when you develop a new application involving polymers (and in this context it does not matter whether you would like to develop a new shampoo formulation involving a polymer or whether you use a polymer as a matrix for drug delivery)?

You would love to be able to draw up a property profile, that any polymer which you use in your particular application has to conform to (these could be physico-chemical properties as much as, for example, toxicity and biodistribution data or availability and cost data from a supplier). In other words you would like to solve the inverse structure-property relationship problem.

Now a typical scenario for this could be the following. Imagine a young trendy polymer chemist, let’s call him Peter, at your favourite personal care or phramaceutical company. One morning – Peter is in the lab doing his research as usual – he gets a call from his boss, John. John tells him that the polymer they are using in their shampoo/washingpowder/pill/toothpaste formulation is likely to become blacklisted by a regulatory authority soon and that he (Peter) really ought to come up with a viable alternative polymer.

Peter agrees to take the job and his computer as soon as he gets off the phone with John. The first thing he does it to retrieve the recipe of the shampoo/washingpowder/pill/toothpaste formulation. Next he instructs his semantic web agent to find alternative polymers for the formulation. The agent does this by going off to in-house databases and retrieving physico-chemical properties and toxicology data, but also by going off to supplier’s catalogues and gathering information about pricing and availability and to the servers of a regulartory authority to check whether the polymer in in . It then compiles the information from various sources, evaluates it against the property profile of the polymer currently in use and returns a list of candidate polymers to Peter.

That is the vision. And when I say vision, I mean vision – we are a very long way away from being able to make this come true. It is the vision that Tim Berners-Lee developed in his Scientific American Article in 2001. Semantic technologies can go a long way towards making this vision come true. In subsequent posts under the same heading, I will discuss what is wrong with polymer informatics at the moment and will develop a strategy for making the vision come true.

If you are reading this, I would very much like your comments and ideas on all of this….so consider yourselves invited to get involved.

Hermann Staudinger

Hermann Staudinger (1881 – 1965) is the father of modern polymer chemistry and the man who gave us the concept of the macromolecule. Therefore I think it is only appropriate that he should lend his name to this blog and that I should quickly discuss who he was, before this blog starts in earnest.

Staudinger was born in Worms, an old Roman foundation and studied chemistry at the universities of Halle, Darmstadt and Munich. After receiving his doctorate from Halle (only four years after he matriculated at the university), he habilitated at the University of Strassbourg and at the age of 26 was appointed to a professorship in organic chemistry at the Technische Hochschule Karlsruhe and later on at the ETH Zurich as a successor to the famous Richard Willstaetter (who had received the Nobel Prize in Chemistry in 1915). At the time, Staudinger was only 31 years old. It was at Zurich, where his work on azide chemistry and the synthesis of synthetic diamonds first created a “bang” in the scientific world in more than one way. First there was the reaction named after him, the Staudinger reaction. This allowed the gentle reduction of an azide to an amine:


Mechanistically, the triphenyl phosphine reacts with the azide to form a phosphazide, which subsequently looses nitrogen to give an iminophosphorane. Aqueous workup then leads to the amine.

The other “bang” resulted from Staudinger’s attempts to create synthetic diamonds: in a quarry close to Zurich he conducted experiments, in which he reacted carbon tetrachloride (InChI=1/CCl4/c2-1-1) with sodium metal (InChI=1/Na) in a closed container. The idea, of course, was that the reaction would form sodium chloride (InChI=1/ClH.Na/h1H;/q;+1/p-1) and elemental carbon (InChI=1/C), which would arrange into a diamond lattice given the high pressures generated by the explosion.
During his time in Karlsruhe, Staudinger started to pursure research into rubber chemistry and in a paper in 1920 floated the idea that rubbers and and polymers in general were composed of small repeating molecular units, which were all covalently linked.1 The idea put him at odds with a number of leading chemists of his time, most notably Emil Fischer, who, like many others at the time believed Grahams’s colloid theory, which stated that micellar self-assembly of small molecules, which were non-covalently linked, was essentially responsible for polymer properties. Another doubtor was Wieland, who wrote in a letter to Staudinger2:

“Dear Colleague, abandon your idea of large molecules, organic molecules with molecular weights exceeding 5000 do not exist. Purify your products such as rubber, they will crystallize and turn out to be low molecular weight compounds”

In his memoirs, Staudinger later added:2

“Those colleagues who were aware of my early publications in the field of low molecular weight chemistry asked me why I had decided to quit these beautiful fields of research and why I devoted myself to such disgisting and ill-defined compounds such as rubber and synthetic polymers which at that time in view of their properties were referred to as grease chemistry (“Schmierenchemie”).”

In 1922, he published a paper concerning the hydrogenation of natural rubber3 and it was in this paper that he first coined the term “macromolecule.” Upon moving to the University of Freiburg in 1926, Staudinger started to pursue grease chemistry full time, continuing to amass experimental evidence indicating the existence of macromolecules. His studies on crystalline poly(oxy methylene) (POM) using X-ray crystallography clearly proved the existence of macromolecules.4 Overall, his research in macromolecular chemistry resulted in the publication of 644 papers and the award of the Nobel Prize for Chemistry in 1953.
Rolf Muelhaupt, Staudinger’s successor in the chair for Macromolecular Chemistry at Freiburg, has recently published an eminently readable biography2 about the man and his chemistry, which I would urge you all to read.

[1] Staudinger, H. Ber. Deut. Chem. Ges., 53, 1073 (1920)
[2] Muelhaupt, R., Angew. Chem. Int. Edn., 43(9), 1054 (2004) DOI: 10.1002/anie.200330070).

[3] Staudinger, H., Fritschi, J., Helv. Chim. Acta, 5, 785 (1922)

[4] Staudinger, H., Johner, H., Signer, H., Mie, G., Hengstenberg, J., Z. Phys. Chem., 126, 425 (1927)