Why oh why oh why….?

…do we make things so difficult for ourselves sometimes? Sigh.It’s lateish on Fri evening and therefore it seems time for another rant.

I was browsing a couple of journal homepages just now and came across an interesting paper by Fenniri et al…well known names in combinatorial chemistry. What had caught my eye was, that these guys had prepared a library of 630 polystyrene copolymers…with the intention of using them as substrates for solid phase organic synthesis (all the styrene monomers they used were spectroscopically active and their vibrational fingerprint (IR or Raman) could be used to identify each co-polymer and therefore the synthesis history of each molecule on a copolymer bead uniquely (DOI)).

Being someone with a history and interest in combinatorial polymer science as well as solid phase synthesis, I got naturally excited about this: a library of 630 copolymers together with spectroscopic data does not get reported altogether that often. They had also provided all the spectroscopic information in the supporting info. So off I go to have a look at it and to see how it had been reported. Well, I click on the supporting info link and all seems to be well….a largish file (approx 8 Mb) is downloading. It’s a pdf….ok…..download has finished….my computer should open the thing automatically….wait for it….oh…..what do I see? This:

pdf-package.gif

Sure enough…..Preview….my built in Mac viewer for all things refuses to open this. So what has happened? With the introduction of Acrobat 8, Adobe has introduced the concept of pdf packages…..you can now package up a couple of individual pdf files in a new one…it’s kind of a zip for pdf. But this is all wrong. Sure, Adobe Reader is free and I could download it and view the things. But I am using a Mac and therefore have a brilliant pdf reader shipping with the operating system I am using. It renders much faster than Adobe’s own software, and interacting with it is a pleasure – unlike Adobe’s software. I don’t want to have to download it….and I suspect that some of the tools that we are using and that Peter has spent a long time hacking will fall over this. And other (free) non-Adobe Tools don’t seem to deal with it…at least not yet. That is the trouble with proprietary standards.But that is only the begining. I happen to have Access to Acrobat Professional 7, which managed to open this file after a couple of error messages and a bit of huffing and puffing. And then what do you see? 630 spectra which look like this:

spectrum.gif

 And that’s when it hit me really hard…..what a waste this is….all these spectra, reported and accessible to the world in this format. I mean I would love to get my grubby mittens on this data…..if there were some polymer property data there too, it could potentially be a wonderful dataset for mininng, structure-property relationships, the lot. But of course, this is not going to happen if I can only get at the data in the form of tiny spectral graphs in pdf…..there is just so little one can do with that. What I would really need is the digital raw data…preferably in some open format, which I can download, look at and work on. But because I cannot get to the data and do stuff with it, it is potentially lost. I cannot add value to it….as Jim pointed out recently…..reposited data gains value over time….because you can be almost sure that someone will be interested in it, work with it and maybe mash it up wit other data in ways never envisaged by the original data creator.So who is to blame? The scientist for being completely unthinking and publishing his data as graphs in pdf? Adobe for messing with pdf? The publishers for using pdf and new formats and keeping the data inaccessible unless I as a user use their technology standard (quite apart from needing to subscribe to the journal etc.)? The chemical community at large for not having evolved mechanisms and processes to make it easy for researchers to distribute this data? The science infrastructure (e.g. libraries, learned societies etc.) for not providing necessary infrastructure to deal with data capture and distribution? Well, maybe everybody….a little….

Let’s start with ourselves, the scientists. Certainly, when I was a classically working synthetic chemist, data didn’t matter. No really. We produced compounds on a one-by-one basis, which, even over the course of, for example a PhD career, doesn’t really amount to a data flood, particularly if all you do is the usual chemistry rut: synthesize and characterize. And even if you make slightly more compounds than your average PhD student (I prepared about 120 distinct and fully characterized compounds during my doctorate) all it takes is a slightly sadistic supervisor (I had one of those too) for whom it is perfectly ok to have a PhD student sit there for a couple of weeks and type it all up in a Word document. And really, that is the problem. Most chemistry labs have evolved a combination of a text processor and a spreadsheet for their data handling needs, which suits them perfectly for group internal purposes…and a combination of Word and Excel on that level can be quite a powerful one. Because one’s learned brethren never ask for more than the data typed up in the usual way in a paper and because it is the paper and therefore the publisher that takes care of dissemination and because the Word/Excel combination takes care of data in the lab, the average chemist does not have to think about data, how to generate it, how to store it, how to disseminate it, ever. And – I only speak for the materials sciences now; the situation is different in parts of organic chemistry, in particular medicinal chemistry – even the not-so-standard chemists, namely the combinatorial and high throughput ones, which generate more data and should therefore be interested, often fall into this trap: the powerful combination of Word and Excel as datastores. And who can blame them- for them, too, data is a means, not an end. They are interested in the knowledge they can extract from the data…and once extracted, the data becomes secondary.

Now you might argue that it is not a chemist’s job to worry about data, it’s his job to do chemistry and make compounds (I know..it’s a myopic view of chemistry but let’s stick to it for now). And yes, that is a defensible point, though I think that certainly with the increasingly routine use of high-throughput and combinatorial techniques, that is becoming less defensible.Chemists need to realise that the data they produce has value beyond the immediate research project for which it was produced. Furthermore, it has usually been generated at great cost and effort and should be treated as a scarce resource. Apart from everything else, data produced through public funding is a public good and produced in the public interest. So I think chemists have to start thinking about data….and it won’t come easy to them. And one way of doing this, is of course, to get them where it hurts most: the money. So the recent BBSRC data policy initiative seems to me to be a step in the right direction:

BBSRC has launched its new data sharing policy, setting out expected standards of data sharing for BBSRC-supported researchers. The policy states that BBSRC expects research data generated as a result of BBSRC support to be made available with as few restrictions as possible in a timely and responsible manner to the scientific community for examination and use.In turn, BBSRC will provide support and funding to facilitate appropriate data sharing activities, and recognises that different fields of study will require different approaches. What is sensible in one scientific or technological area may not work in others. The policy aims to achieve the sharing of data in a both a timely and scientifically appropriate manner.In implementing the policy, all research proposals submitted to BBSRC from 26th April 2007 must now include a statement on data sharing. This should include concise plans for data management and sharing or provide explicit reasons why data sharing is not possible or appropriate. The statement on data sharing will be under a separate section within the ‘case for support’ document, and the guidance for applicants has been updated accordingly. Statements will be considered separately from the scientific excellence of the proposed research; however, an application’s credibility will suffer if peer review agrees the statement is inappropriate.    

This is a good step in the right direction (provided it is also policed!!) and one can only hope the EPSRC, which, as far as I know, does not have a formal policy at the moment, will follow suit. There’s another thing though. Educating people about data needs to be part of the curriculum starting with the undergraduate chemistry syllabus. And the few remaining chemical informaticians of the world need to get out of their server rooms and into the labs. If you think that organic chemists are bad in not wanting to have to do anything with informatics, well, informaticians are usually even worse in not wanting to have anything to do with flasks. And it makes me hopping mad when I hear that “this is not on the critical path”. Chemical informatics only makes sense in combination with experiments and it is the informaticians here that should lead the way and show the world just how successful a combination of laboratory and computing can be. It is that, which will educate the next generation of students and make them computer and data literate.

You might also argue, of course, that it should be a researcher’s institution that takes care of data produced by the research organisation. Which then brings us on to institutional repositories. Well trouble here is, can an institution really produce the tools for archiving and dissemination? What a strange question you will say. Is not Cambridge involved in DSpace and Spectra etc.? Yes. The point, though, is, that scientific data is incredibly varied and new data with new data models gets produced all the time. Will institutional repositories really be able to evolve quickly enough to accommodate all this or will they be limited to well-established data and information models because they typically operate a “one software fits all” model?
I may not be the best qualified person to judge this, but having worked in a number of large institutions in the past and observed the speed at which they evolve, there is nothing that leads me to believe that institutions and centralized software systems will be able to evolve rapidly enough. Jim, in a recent post, already alludes to something like this and makes reference to a post by Clifford Lynch, who defines institutional repositories as

a set of services that a university offers to the members of its community for the management and dissemination of digital materials created by the institution and its community members”   

I must confess, I find this thought intriguing. The set of services the doesn’t have to be a monolithic bit of software as we have now, but rather a set of services which manages digital materials. So could one naively consider a bottom-up approach? One, in which, for example, universities make available resource that would allow an individual department or even researcher to directly collaborate with a librarian and an information scientist, who would then make sure that a scientists data needs were met, while at the same time allowing the instititution to ensure proper data preservation etc. How about individual research groups maintaining their own repositories which they, with the help of a central institution evolve quickly and which a central system could then access, index, link to etc….and help with dissemination?

Now the question here is what’s in it for the individual researcher? Not much, I would say. Sure, some will care about data preservation, some will acknowledge the fact that publicly funded research is a public good and the researcher therefore has a duty of care towards the product and will therefore care. But as discussed above: the point of generating data is ultimately getting the next publication and finishing the project…..what happens beyond that is often irrelevant to the researcher that generates the data, which means there is no point in expending much effort and maybe even climbing a learning curve to be able to archive it and disseminate it in any way other than by sticking it into pdf. Ultimately, the data disappears in the long tail. So if there is no obvious carrot, then maybe a stick? Well, the stick will live and die with for example the funders. having a data policy is great, but it also needs to be policed and enforced. The funders can either do this themselves, or in the case of public money, might even conceivably hand this to a national audit office. And I think the broad gist of this discusssion also applies to learned societies etc. So now we are back at the researcher. And back to needing to educate the researcher and the student…..and therefore ultimately back again at chemoinformaticians having to leave their server rooms and touch a flask…..

So how about the publishers? Haven’t they traditionally filled this role? Yes they have but of course given the fact that they are now trying to harvest the data themselves and re-sell it to us in the form of data- and knowledge bases (see Wiley’s Chemgate and eMolecules, for example). For that reason alone it seems utterly undesireable, to have a commercial publisher continue to fill that role. If the publisher is an open access publisher, then getting at the data is not a concern, but the data format is….a publisher is just as much an institution as a library and whether they will be able to be nimble enough to cope with constantly evolving data needs and models is doubtful. Which means, we would be back to the generic “one software for all” model.
Which, at least to me, seems bad. The same, sadly, applies to learned societies.

Hmmm…the longer I think about it, the more I come to the conclusion that the lab chemists or the departents will have to do it themselves, assisted and educated by the chemoinformaticians and their own institutions and setting up small-scale dedicated and light-weight repositories. The institutions will have to make a commitment to ensure long-term preservation, inter-linking and interoperability between repositories evolved by individual researchers or departments. And funders, finally, well funders will not only have to have a data policy like the BBSRC, but they will also have to police it and, in Jim’s words “keep the scientists honest”.

2 Responses to Why oh why oh why….?

  1. Pingback: Unilever Centre for Molecular Informatics, Cambridge - Data-Rich Publishing « Staudinger’s Semantic Molecules

  2. Pingback: Reading the Tea Leaves of 2011 – Data and Technology Predictions for the Year Ahead « Scimantica – Semantic Science

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: