2011 – The International Year of Chemistry

Appearance of real linear polymer chains as re...

Image via Wikipedia

In their editorial for the January Issue (you will need a Nature subscription to access this, altrenatively see the Sceptical Chymyst post here), the good folks at Nature Chemistry have reminded us that 2011 is the International Year of Chemistry:

“The United Nations has proclaimed 2011 to be the International Year of Chemistry. Under this banner, chemists should seize the opportunity to highlight the rich history and successes of our subject to a much broader audience — and explain how it can help to solve the global challenges we face today and in the future.”

The year even has a website. The UN also singles out two important areas of chemistry – neither of which have chemistry in the name – on the frontpage of the site: namely the development of advanced materials and molecular medicine. I am extremely happy to see this – materials and in particular polymers have been a long-standing interest of mine and some of the immunology work I am currently doing has implications for molecular medicine too.

There are several ways to participate in the Year of Chemistry – one of them is through an essay and video competition: “A World Without Polymers”. Students are asked to make short videos or write essays, trying to imagine what the world would be like without polymers. Furthermore there are networking events, conferences and more all across the world. So go and check out the UN’s site, participate and contribute!

Enhanced by Zemanta

ChemAxiom: An Ontology for Chemistry – 1. The Motivation

I have already announced the fact that we are working on ontologies in the polymer domain some time ago, though I realise that so far, I have yet to produce the proof of that: the actual ontology/ontologies.

So today I am happy to announce that the time of vapourware is over and that we have released ChemAxiom – a modular set of ontologies, which form the first ontological framework for chemistry (or at least so we believe). The development of these ontologies has taken us a while: I started this on a hunch and as a nice intellectual exercise, not entirely sure where to go with them and what to use them and therefore not working on them full time. As the work progressed, however, we understood just how inordinately useful they would be for doing what we are trying to accomplish in both polymer informatics and chemical informatics at large. I will introduce and discuss the ontologies in a succession of blogposts, of which this is the first one

So what, though maybe somwhat retrospectively, was the motivation for the preparation of the ontologies? In short – the breakdown of many common chemistry information systems when confronted with real chemical phenomena rather than small subsections of idealised abstractions. Let me explain.

Chemistry and chemical information systems positively thrive on the use of a connection table as a chemical identifier and determinant of uniqueness. The reasons for this are fairly clear: chemistry, for the past 100 years or so, has elevated the (potential) correlation between the chemical structure of a molecule and its physicochemical and biological properties to be its “central dogma.” The application of this dogma has served subsections of the community – notably organic/medicinal/biological chemists incredibly well, while causing major headaches for other parts of the chemistry community and given an outright migraine to information scientists and researchers. There are several reasons for the pain:

The use of a connection table as an identifier for chemical objects leads to significant ontological confusion. Often, chemists and their information systems do not realise that there is a fundamental distinction between (a) the platonic idea of a molecule, (b) the idea of a bulk substance and (c) an instance of (“the real bulk substance”) in a flask or bottle on the researcher’s lab bench. An example of this is the association of a physicochemical property of a chemical entity with a structure representation of a molecule: while it would, for example, make sense to do this for a HOMO energy, it does NOT make sense to speak of a melting point or a boiling point in terms of a a molecule. The point here simply is that many physicochemical properties are the mereological sums of the properties of many molecules in an ensemble. If this is true for simple properties of pure small molecules, it is even more true for properties of complex systems such as polymers, which are ensembles of many different molecules of many different architectures. A similar argument can also be made for identifiers: in most chemical information systems, it is often not clear whether the identifier (such as a CAS number etc.) refers to a molecule or a substance composed of these molecules.

Many chemical objects have temporal characteristics. Often, chemical objects have temporal characteristics, which influence and determine their connection table. A typical example for this are rapidly interconverting isomers: glucose, when dissolved in water, for example, can be described by several rapidly interconverting structures – a single connection table is not enough to describe the concept “glucose in water” and there exists a parthood relationship between the concept and several possible connection tables. Ontologies can help with specifying and defining these parthood relationships.

There is another aspect to time dependence we also need to consider. For many materials, their existence in time, or, put in another way, their history, often holds more meaningful information about an observed physical property of that substance than the chemical structure of one of the components of the mixture. For an observable property of a polymer, such as the glass transition temperature, for example, it matters a great deal whether the polymer was synthesized in on the solid phase in a pressure autoclave or in solution at ambient pressure. Furthermore, it matters, whether and how a polymer was processed – how was it extruded, grafted etc. All of these processes have a significant amount of influence on the observable physical properties of a bulk sample of this polymer, while leaving the chemical decription of the material, essentially unchanged (in current practice, polyethylene is often represented either by using the structure of the corresponding repeat unit (ethene, for example) or the structure of a repeat unit fragment (-CH2-CH2-). Ontologies will help us to describe and define these histories. Ultimately, we envisage that this will result in a “semantic fingerprint” of a material, which – one might speculate – will be much more appropriate for the development of design rules for materials than the dumb structure representations in use today.

Many chemical objects are mixtures….and mixtures simply do not lend themselves to being described using the connection table of a single constituent entity of that mixture. If this is true for glucose in water, it is even truer for things such as polymers: polymers are mixtures of many different macromolecules, all of which have slightly different architectures etc. An observed physical property, and therefore a data object, is the mereological sum of the contributions made by all the constituent macromolecules and therefore, such a data object cannot simply be associated with a single connection table.

This, in my view, is a short summary of the case for ontology in chemistry. Please feel free to violently (dis-)agree and if you want to do so, I am looking forward to a discussion in the comments section.

There’s one more thing:

AN INVITATION

The ChemAxiom ontologies are far from perfect and far from finished. We hope, that they show the way how an ontological framework for chemistry could look like. In developing these ontologies, we can contribute our particular point of view, but we would like to hear yours. Even more, we would like to invite the community to get involved in the development of these ontologies in order to make them a general and valuable resource. If you would like to  become involved, then please send an email to chemaxiom at googlemail dot com or leave comments/questions etc, in the ChemAxiom Google Group.

In the next several blog posts, I will dive into some of the technical details of the ontologies.

(Automatic Links etc., as always, by Zemanta)

Reblog this post [with Zemanta]

Polymer Markup Language Paper

Now i started this blog with the intention of writing about polymers, informatics etc.. Somewhere along the way, some advocacy, some ranting and a general critique of the scholarly publication process also crept in and, of course, there were long breaks. However, we have recently published polymer markup language, which has been in the making for a while and I am pleased to announce the paper, published in the Journal of Chemical Information and Modeling:

Chemical Markup, XML and the World-Wide Web. 8. Polymer Markup Language

Nico Adams, Jerry Winter, Peter Murray-Rust and Henry S. Rzepa

Polymers are among the most important classes of materials but are only inadequately supported by modern informatics. The paper discusses the reasons why polymer informatics is considerably more challenging than small molecule informatics and develops a vision for the computer-aided design of polymers, based on modern semantic web technologies. The paper then discusses the development of Polymer Markup Language (PML). PML is an extensible language, designed to support the (structural) representation of polymers and polymer-related information. PML closely interoperates with Chemical Markup Language (CML) and overcomes a number of the previously identified challenges.

Many thanks are due to everybody who worked on this and everybody in the Unilever Centre who was available for discussions, comments and critique.

The paper can be found here

Reblog this post [with Zemanta]

The Dutch Polymer Institute

I am currently sitting through one of those corporate presentations and so have some time to blog about the Dutch Polymer Institute (DPI) which is organising the meeting I am currently attending.

The DPI was set up by the Dutch Government in 1996 as part of the “leading technology institute” (LTI – other current institutes are Netherlands Institute for Metals Research, Telematics Institute, Wageningen Centre for Food Sciences, Dutch Separation Technology Institute, Top Institute Pharma, Wetsus and TI Green Genetics) initiative. The DPI is a public-private partnership (PPP), with funding being provided by industry (25 %), academia (25 %) and government (50 %). A 2003 OECD study suggested that the DPI was one of the purest forms of PPP and is certainly one of the few examples that I know, which work well. In a typical scenario, an industrial member joins the institute (which is a virtual institute – it has no laboratories or facilities of its own) by purchasing a share (“a ticket”) in the institute, which currently is worth approximately roughly 50000 Euros per annum with a minimum commitment of four years. Academia contributes the same amount of money (in practice through in-kind contributions) and the Dutch government doubles this sum. An investment of Eur 200 000 by industry thus generates about Eur 800 000 in research funding (the DPI has minimal overheads due to the virtual nature of the institute). This is a daring scheme in many ways: the institute is international and while the largest beneficiary from this funding model is still Dutch research (and in particular the Eindhoven University of Technology (TU/e)), international research is funded too and we in Cambridge certainly benefit from this as well. It is a great credit to the Netherlands that this is possible and I wonder how many other European governments would willing to set up such a scheme.

DPI’s main mission is to catalyse the process of developing fundamental research further and to bring it up to the end of the pre-competitive phase, so that it can subsequently be taken up by industry and developed into commercial products. The DPI does this by financing academic research and staff suggested and requested by the academic partners in the programme and approved and decided over by the industrial members. Any intellectual property generated as part of the projects will be transferred to the industrial members if there is a request to do so (part of industry’s ROI) and if not, will be disseminated in the normal way (i.e. through publications). It furthermore provides a platform for networking and recruiting.

The institute has a lot ot be proud of: DPI funded research produces between 150 and 200 papers per year and a good number of patents, a significant number of which have been transferred to industry. Fur full details check the annual report (2007) here.

A meeting like the one today shows the vibrance of its community and I can only hope, that it will continue to prosper in the future.

Reblog this post [with Zemanta]

An appetite for open data…

…is what I have encountered here at Antwerp already. I am currently at the annual meeting of the Dutch Polymer Institute, with which I have been associated in various forms over the best part of five years now. We are the guests of Borealis here in Antwerp and as such, it promises to be an interesting meeting. The morning will be taken up with “Golden Thesis Awards”. The DPI evaluates all PhD thesis it funds by scinetific merit and the best PhD students in a year will be given an award. This is followed by an excursion to Borealis and in the afternoon, there will be thematic sessions: “Polymers and Water” and “Polymers and Time”. The former is self explanatory and the latter concerns mainly molecular simulations of polymers at short and long time scales. This is followed by poster sessions and a Borealis hosted dinner in the evening. Tomorrow then we will have several further talks on bio-based polymers, sustainability and solar cells and in the evening a brain-storm sesssion: “What could polymers mean for the bottom of the pyramid?” I like DPI meetings – they are extremely young…most of the participants are PhDs and Post-Docs and always brimming with energy.

In that spirit, I arrived at my hotel last night and sat down for dinner. It didn’t take long before I was surrounded by old and some new acquaintances and we spent the time catching up and discussing what we have been doing. And inevitably the conversaton turned to polymer informatics and open data. There were many questions: “Will extraction of data from a manuscript cause problems with publication later?”, “Why should I trust you and give you my manuscript or thesis to datamine?”, “How does copyright work out?” “What happens to the publishers – why should they not sell my data?” etc. However, all the minds were open. They see the argument for open data and open knowledge and they agree with it in principle, but there is great uncertainty as to the politics and technicalities associated with open data. The moral of the story is: much more talking needs to be done and much more education. Open access and open data evangelists should put together an FAQ for “mere mortals” i.e. researchers who do not think about this all the time and who should not have to think subtly about the differeneces between “gold OA”, “green OA” “libre OA” and what have you. We need to do much more talking to the science community. Let’s start now. And let’s not weaken our position by OA sophistry. I wil try and blog some more as the meeting goes on and hopefully also provide some photos.

PS: You will see some new and unusual tags at the bottom of this blog post and(UPDATE: no tags apparently) links in the text. I have installed Zemanta to try and make this blog semantically a little richer. The tags and links are autogenerated and I hope the result is worthwhile.

Reblog this post [with Zemanta]

Some Polymer Geek Chic for the Weekend: Amazing Poly(acrylic acid)

No matter how often I see it demonstrated, I still think that poly(acrylic acid) and its salts are amazing materials. Sodium polyacrylate is used, as many of you know as a superabsorber – it can absorb about 1000 times its weight in water: 1 g is enough to turn a litre of water into a gel. Unsurprisingly, poly(sodium acrylate) is now a standard ingredient in baby diapers. It all works via hydrogen bonding…..the addition of salt to the gel will reverse the effect and destroy the gel:

Why oh why oh why….?

…do we make things so difficult for ourselves sometimes? Sigh.It’s lateish on Fri evening and therefore it seems time for another rant.

I was browsing a couple of journal homepages just now and came across an interesting paper by Fenniri et al…well known names in combinatorial chemistry. What had caught my eye was, that these guys had prepared a library of 630 polystyrene copolymers…with the intention of using them as substrates for solid phase organic synthesis (all the styrene monomers they used were spectroscopically active and their vibrational fingerprint (IR or Raman) could be used to identify each co-polymer and therefore the synthesis history of each molecule on a copolymer bead uniquely (DOI)).

Being someone with a history and interest in combinatorial polymer science as well as solid phase synthesis, I got naturally excited about this: a library of 630 copolymers together with spectroscopic data does not get reported altogether that often. They had also provided all the spectroscopic information in the supporting info. So off I go to have a look at it and to see how it had been reported. Well, I click on the supporting info link and all seems to be well….a largish file (approx 8 Mb) is downloading. It’s a pdf….ok…..download has finished….my computer should open the thing automatically….wait for it….oh…..what do I see? This:

pdf-package.gif

Sure enough…..Preview….my built in Mac viewer for all things refuses to open this. So what has happened? With the introduction of Acrobat 8, Adobe has introduced the concept of pdf packages…..you can now package up a couple of individual pdf files in a new one…it’s kind of a zip for pdf. But this is all wrong. Sure, Adobe Reader is free and I could download it and view the things. But I am using a Mac and therefore have a brilliant pdf reader shipping with the operating system I am using. It renders much faster than Adobe’s own software, and interacting with it is a pleasure – unlike Adobe’s software. I don’t want to have to download it….and I suspect that some of the tools that we are using and that Peter has spent a long time hacking will fall over this. And other (free) non-Adobe Tools don’t seem to deal with it…at least not yet. That is the trouble with proprietary standards.But that is only the begining. I happen to have Access to Acrobat Professional 7, which managed to open this file after a couple of error messages and a bit of huffing and puffing. And then what do you see? 630 spectra which look like this:

spectrum.gif

 And that’s when it hit me really hard…..what a waste this is….all these spectra, reported and accessible to the world in this format. I mean I would love to get my grubby mittens on this data…..if there were some polymer property data there too, it could potentially be a wonderful dataset for mininng, structure-property relationships, the lot. But of course, this is not going to happen if I can only get at the data in the form of tiny spectral graphs in pdf…..there is just so little one can do with that. What I would really need is the digital raw data…preferably in some open format, which I can download, look at and work on. But because I cannot get to the data and do stuff with it, it is potentially lost. I cannot add value to it….as Jim pointed out recently…..reposited data gains value over time….because you can be almost sure that someone will be interested in it, work with it and maybe mash it up wit other data in ways never envisaged by the original data creator.So who is to blame? The scientist for being completely unthinking and publishing his data as graphs in pdf? Adobe for messing with pdf? The publishers for using pdf and new formats and keeping the data inaccessible unless I as a user use their technology standard (quite apart from needing to subscribe to the journal etc.)? The chemical community at large for not having evolved mechanisms and processes to make it easy for researchers to distribute this data? The science infrastructure (e.g. libraries, learned societies etc.) for not providing necessary infrastructure to deal with data capture and distribution? Well, maybe everybody….a little….

Let’s start with ourselves, the scientists. Certainly, when I was a classically working synthetic chemist, data didn’t matter. No really. We produced compounds on a one-by-one basis, which, even over the course of, for example a PhD career, doesn’t really amount to a data flood, particularly if all you do is the usual chemistry rut: synthesize and characterize. And even if you make slightly more compounds than your average PhD student (I prepared about 120 distinct and fully characterized compounds during my doctorate) all it takes is a slightly sadistic supervisor (I had one of those too) for whom it is perfectly ok to have a PhD student sit there for a couple of weeks and type it all up in a Word document. And really, that is the problem. Most chemistry labs have evolved a combination of a text processor and a spreadsheet for their data handling needs, which suits them perfectly for group internal purposes…and a combination of Word and Excel on that level can be quite a powerful one. Because one’s learned brethren never ask for more than the data typed up in the usual way in a paper and because it is the paper and therefore the publisher that takes care of dissemination and because the Word/Excel combination takes care of data in the lab, the average chemist does not have to think about data, how to generate it, how to store it, how to disseminate it, ever. And – I only speak for the materials sciences now; the situation is different in parts of organic chemistry, in particular medicinal chemistry – even the not-so-standard chemists, namely the combinatorial and high throughput ones, which generate more data and should therefore be interested, often fall into this trap: the powerful combination of Word and Excel as datastores. And who can blame them- for them, too, data is a means, not an end. They are interested in the knowledge they can extract from the data…and once extracted, the data becomes secondary.

Now you might argue that it is not a chemist’s job to worry about data, it’s his job to do chemistry and make compounds (I know..it’s a myopic view of chemistry but let’s stick to it for now). And yes, that is a defensible point, though I think that certainly with the increasingly routine use of high-throughput and combinatorial techniques, that is becoming less defensible.Chemists need to realise that the data they produce has value beyond the immediate research project for which it was produced. Furthermore, it has usually been generated at great cost and effort and should be treated as a scarce resource. Apart from everything else, data produced through public funding is a public good and produced in the public interest. So I think chemists have to start thinking about data….and it won’t come easy to them. And one way of doing this, is of course, to get them where it hurts most: the money. So the recent BBSRC data policy initiative seems to me to be a step in the right direction:

BBSRC has launched its new data sharing policy, setting out expected standards of data sharing for BBSRC-supported researchers. The policy states that BBSRC expects research data generated as a result of BBSRC support to be made available with as few restrictions as possible in a timely and responsible manner to the scientific community for examination and use.In turn, BBSRC will provide support and funding to facilitate appropriate data sharing activities, and recognises that different fields of study will require different approaches. What is sensible in one scientific or technological area may not work in others. The policy aims to achieve the sharing of data in a both a timely and scientifically appropriate manner.In implementing the policy, all research proposals submitted to BBSRC from 26th April 2007 must now include a statement on data sharing. This should include concise plans for data management and sharing or provide explicit reasons why data sharing is not possible or appropriate. The statement on data sharing will be under a separate section within the ‘case for support’ document, and the guidance for applicants has been updated accordingly. Statements will be considered separately from the scientific excellence of the proposed research; however, an application’s credibility will suffer if peer review agrees the statement is inappropriate.    

This is a good step in the right direction (provided it is also policed!!) and one can only hope the EPSRC, which, as far as I know, does not have a formal policy at the moment, will follow suit. There’s another thing though. Educating people about data needs to be part of the curriculum starting with the undergraduate chemistry syllabus. And the few remaining chemical informaticians of the world need to get out of their server rooms and into the labs. If you think that organic chemists are bad in not wanting to have to do anything with informatics, well, informaticians are usually even worse in not wanting to have anything to do with flasks. And it makes me hopping mad when I hear that “this is not on the critical path”. Chemical informatics only makes sense in combination with experiments and it is the informaticians here that should lead the way and show the world just how successful a combination of laboratory and computing can be. It is that, which will educate the next generation of students and make them computer and data literate.

You might also argue, of course, that it should be a researcher’s institution that takes care of data produced by the research organisation. Which then brings us on to institutional repositories. Well trouble here is, can an institution really produce the tools for archiving and dissemination? What a strange question you will say. Is not Cambridge involved in DSpace and Spectra etc.? Yes. The point, though, is, that scientific data is incredibly varied and new data with new data models gets produced all the time. Will institutional repositories really be able to evolve quickly enough to accommodate all this or will they be limited to well-established data and information models because they typically operate a “one software fits all” model?
I may not be the best qualified person to judge this, but having worked in a number of large institutions in the past and observed the speed at which they evolve, there is nothing that leads me to believe that institutions and centralized software systems will be able to evolve rapidly enough. Jim, in a recent post, already alludes to something like this and makes reference to a post by Clifford Lynch, who defines institutional repositories as

a set of services that a university offers to the members of its community for the management and dissemination of digital materials created by the institution and its community members”   

I must confess, I find this thought intriguing. The set of services the doesn’t have to be a monolithic bit of software as we have now, but rather a set of services which manages digital materials. So could one naively consider a bottom-up approach? One, in which, for example, universities make available resource that would allow an individual department or even researcher to directly collaborate with a librarian and an information scientist, who would then make sure that a scientists data needs were met, while at the same time allowing the instititution to ensure proper data preservation etc. How about individual research groups maintaining their own repositories which they, with the help of a central institution evolve quickly and which a central system could then access, index, link to etc….and help with dissemination?

Now the question here is what’s in it for the individual researcher? Not much, I would say. Sure, some will care about data preservation, some will acknowledge the fact that publicly funded research is a public good and the researcher therefore has a duty of care towards the product and will therefore care. But as discussed above: the point of generating data is ultimately getting the next publication and finishing the project…..what happens beyond that is often irrelevant to the researcher that generates the data, which means there is no point in expending much effort and maybe even climbing a learning curve to be able to archive it and disseminate it in any way other than by sticking it into pdf. Ultimately, the data disappears in the long tail. So if there is no obvious carrot, then maybe a stick? Well, the stick will live and die with for example the funders. having a data policy is great, but it also needs to be policed and enforced. The funders can either do this themselves, or in the case of public money, might even conceivably hand this to a national audit office. And I think the broad gist of this discusssion also applies to learned societies etc. So now we are back at the researcher. And back to needing to educate the researcher and the student…..and therefore ultimately back again at chemoinformaticians having to leave their server rooms and touch a flask…..

So how about the publishers? Haven’t they traditionally filled this role? Yes they have but of course given the fact that they are now trying to harvest the data themselves and re-sell it to us in the form of data- and knowledge bases (see Wiley’s Chemgate and eMolecules, for example). For that reason alone it seems utterly undesireable, to have a commercial publisher continue to fill that role. If the publisher is an open access publisher, then getting at the data is not a concern, but the data format is….a publisher is just as much an institution as a library and whether they will be able to be nimble enough to cope with constantly evolving data needs and models is doubtful. Which means, we would be back to the generic “one software for all” model.
Which, at least to me, seems bad. The same, sadly, applies to learned societies.

Hmmm…the longer I think about it, the more I come to the conclusion that the lab chemists or the departents will have to do it themselves, assisted and educated by the chemoinformaticians and their own institutions and setting up small-scale dedicated and light-weight repositories. The institutions will have to make a commitment to ensure long-term preservation, inter-linking and interoperability between repositories evolved by individual researchers or departments. And funders, finally, well funders will not only have to have a data policy like the BBSRC, but they will also have to police it and, in Jim’s words “keep the scientists honest”.