Polymer Informatics and the Semantic Web – The Problem, Part I: Availability and Curation of Data
March 23, 2007 1 Comment
In one of my last posts, I have outlined the vision that we have for polymer informatics. Now let me outline some of the challenges that are in our way. In my little scenario, I talked about a semantic web agent going off and gathering data. Well here is where the difficulty starts for a machine. There are several problems:
1. Data Availability
One of the best loved and most commonly used sources of polymer information is the Polymer Handbook by Brandrup and Immergut. It contains information about approximately 2500 different polymers, scattered over multiple chapters. As it is paper-based, it is not accessible to machines and information has to be extracted and collated by hand. Wiley has taken the contents and turned them into a collection of HTML documents which are connected via hyperlinks. Though available in an electronic form, it is still very difficult to extract anything from this for a machine, as all the information is present in unstructured free text. It is not impossible, mind you, systems such as OSCAR, which we are currently developing in-house make that sort of thing possible, but it is still far from trivial and requires much hard work. “Polymers – A property database” , published by CRC is set up in much the same way and therefore subject to the samme limitations. Furthermore, it is worth pointing out that all of these sources of data are commercial and if one’s host institution/organization does not subscribe to the relevant data source, one is….well….hosed anyway.
Things look up a bit with the PoLyInfo Database, maintained by the National Institute for Materials Science of Japan. Here we find, amongst other valuable information, the ability to search for sub(structure) and a string which defines the repeat unit structure of the polymer and which, in principle at least, is parseable. And all this goodness for approximately 13000 polymers, a large variety of physicochemical properties and, best of all, for free.
2. Data Curation
However, there is a catch. When looking, for example, at the glass transition temperature (Tg) entry for polydimethylsiloxane, we find an incredibly wide temperature range….-163 deg. C to +42 deg C. How come there is such a wide range. Well, first of all, and this is the problem with a lot of polymer properties, the glass transition temperature is dependent on the molecular weight in the low molecular weight regime. As MWs increase, Tg eventually becomes invariant w.r.t. the molecular weight. Now when it comes to registering polymer property values, the polymer science communit has gotten into the habit of reporting them WITHOUT the corresponding dependent variables, such as MW in case of the glass transition temperature. Clearly, this makes it very hard to build good and accurate predictive models for such properties.
Sticking with the glass transition temperature for a moment, here’s another one. Tg is mainly determined using two different methods, namely Differential Scanning Calorimentry (DSC) or Thermomechanical Analysis (TMA). While both methods try and determine a glass transition temperature, they measure fundmentally different things. DSC essentially determines a change in heat capacity of a polymer, whereas TMA measures a dimensional change in the sample. And yes, when both methods are used on the same sample, the results usually differ by between 6- 10 K. So it is crucial to report the measurement method, the experimental conditions etc. Furthermore, when data is abstracted to and accumulated in a knowledge system, it has to be curated to ensure that all relevant and necessary bits of metadata are available.
On occassions, the PolyInfo will also register data for composites under the pure polymer…..which of course can shift properties tremendously.
In summary then, the first set of challenges we encounter are data availability, data curation and metadata. Unfortunatelt there is more, which I will discuss in one of my next posts.