dev8D – JISC Developer Happiness Days Day One

I have just finished the first day of dev8D – the JISC developer happiness days 2009. There is a lot being blogged about it and tweeted (look for hastags dev8D or dev8D) about it already, so just a quick summary here.

The day saw a number of workshops and code labs – one of them Python for n00bs (albeit developer n00bs rather than n00by n00bs), others on mashups, HTML and CSS. I attended the Python sessions – I must confess to never having used Python and was interested in some snake-wrangling because several of my present and former colleagues swear by it. The morning started up very simply, by showing a number of “Hello World” type programmes and discussing concepts such as test-driven development, classes and IDEs..the first sessions were given by Peter Sefton (of ICE fame and one of our collaborators in the Unilever Centre) and Ben O’Steen from Oxford.

This was followed by a nice introduction to Unicode and how to handle encodings/unicode in python – necessary for any text mining etc..and a discussion of the differences between the recently released Python 3.0 and Python 2.x…(not to mention much sighing about the amount of code that everybody expects to have to re-write).

After a typically British lunch of sandwiches and some pushing and pulling for the few available power strips (I haave never understood this – but why are organisers of large gatherings of hackers and computer-fanciers so blaze when it comes to the provision of power???…they had made an effort, but it was by far not enough….someone please enlighten me…) a nice session on using Python and libraries/software written in Python for data mining problems. In particular solutions for clustering, learning of rules etc. were discussed., Links to referenced libraries are can be found here. This was followed by the final and long session on developing web=-apps using Django and the Google App Engine, given by Brian. After a short discussion of Django, Brian invited ideas for the real time development of a web app, although he had a planned demo, which presumably worked well. Someone suggested to use the “We feel fine” API to retrieve a list of feelings from we feel fine and re-display them in the app to be developed. Inviting such challenges is valiant and to be commended, but in my experience almost inevitably leads to difficulties and so things didn’t quite work out as planned. Nobody took particular exception to this though and there were many offers to continue hacking in the bar or later in the hotel.

The latter was another wonderful feature of the first day – there were people huddled together in every corner of the building either building or discussing prototypes for web apps, mashups and visualisations….it was wonderful and uplifting to be in such an enthusiastic and “can do” atmosphere….where innovatibe things are being done. This is in sharp contrast to the normal world of chemoinformatics and the use of information technology for chemistry as understood in academia….,calculate yet another transition state or develop yet another machine learning technique/QSAR model, dock yet another molecule into yet another protein.

Anyhoo, before I go off into a rant as I am sometimes prone to do on this blog, suffice it to say that the first day was not only enjoyable, but has also taught me a huge amount of stuff. Due to other commitments, I did not participate in the evening’s revelries and entertainment, but undoubtedly there will be reports on this too.

I am looking forward to an equally enjoyable and informative second day tomorrow.

Reblog this post [with Zemanta]

Why oh why oh why….?

…do we make things so difficult for ourselves sometimes? Sigh.It’s lateish on Fri evening and therefore it seems time for another rant.

I was browsing a couple of journal homepages just now and came across an interesting paper by Fenniri et al…well known names in combinatorial chemistry. What had caught my eye was, that these guys had prepared a library of 630 polystyrene copolymers…with the intention of using them as substrates for solid phase organic synthesis (all the styrene monomers they used were spectroscopically active and their vibrational fingerprint (IR or Raman) could be used to identify each co-polymer and therefore the synthesis history of each molecule on a copolymer bead uniquely (DOI)).

Being someone with a history and interest in combinatorial polymer science as well as solid phase synthesis, I got naturally excited about this: a library of 630 copolymers together with spectroscopic data does not get reported altogether that often. They had also provided all the spectroscopic information in the supporting info. So off I go to have a look at it and to see how it had been reported. Well, I click on the supporting info link and all seems to be well….a largish file (approx 8 Mb) is downloading. It’s a pdf….ok…..download has finished….my computer should open the thing automatically….wait for it….oh…..what do I see? This:

pdf-package.gif

Sure enough…..Preview….my built in Mac viewer for all things refuses to open this. So what has happened? With the introduction of Acrobat 8, Adobe has introduced the concept of pdf packages…..you can now package up a couple of individual pdf files in a new one…it’s kind of a zip for pdf. But this is all wrong. Sure, Adobe Reader is free and I could download it and view the things. But I am using a Mac and therefore have a brilliant pdf reader shipping with the operating system I am using. It renders much faster than Adobe’s own software, and interacting with it is a pleasure – unlike Adobe’s software. I don’t want to have to download it….and I suspect that some of the tools that we are using and that Peter has spent a long time hacking will fall over this. And other (free) non-Adobe Tools don’t seem to deal with it…at least not yet. That is the trouble with proprietary standards.But that is only the begining. I happen to have Access to Acrobat Professional 7, which managed to open this file after a couple of error messages and a bit of huffing and puffing. And then what do you see? 630 spectra which look like this:

spectrum.gif

 And that’s when it hit me really hard…..what a waste this is….all these spectra, reported and accessible to the world in this format. I mean I would love to get my grubby mittens on this data…..if there were some polymer property data there too, it could potentially be a wonderful dataset for mininng, structure-property relationships, the lot. But of course, this is not going to happen if I can only get at the data in the form of tiny spectral graphs in pdf…..there is just so little one can do with that. What I would really need is the digital raw data…preferably in some open format, which I can download, look at and work on. But because I cannot get to the data and do stuff with it, it is potentially lost. I cannot add value to it….as Jim pointed out recently…..reposited data gains value over time….because you can be almost sure that someone will be interested in it, work with it and maybe mash it up wit other data in ways never envisaged by the original data creator.So who is to blame? The scientist for being completely unthinking and publishing his data as graphs in pdf? Adobe for messing with pdf? The publishers for using pdf and new formats and keeping the data inaccessible unless I as a user use their technology standard (quite apart from needing to subscribe to the journal etc.)? The chemical community at large for not having evolved mechanisms and processes to make it easy for researchers to distribute this data? The science infrastructure (e.g. libraries, learned societies etc.) for not providing necessary infrastructure to deal with data capture and distribution? Well, maybe everybody….a little….

Let’s start with ourselves, the scientists. Certainly, when I was a classically working synthetic chemist, data didn’t matter. No really. We produced compounds on a one-by-one basis, which, even over the course of, for example a PhD career, doesn’t really amount to a data flood, particularly if all you do is the usual chemistry rut: synthesize and characterize. And even if you make slightly more compounds than your average PhD student (I prepared about 120 distinct and fully characterized compounds during my doctorate) all it takes is a slightly sadistic supervisor (I had one of those too) for whom it is perfectly ok to have a PhD student sit there for a couple of weeks and type it all up in a Word document. And really, that is the problem. Most chemistry labs have evolved a combination of a text processor and a spreadsheet for their data handling needs, which suits them perfectly for group internal purposes…and a combination of Word and Excel on that level can be quite a powerful one. Because one’s learned brethren never ask for more than the data typed up in the usual way in a paper and because it is the paper and therefore the publisher that takes care of dissemination and because the Word/Excel combination takes care of data in the lab, the average chemist does not have to think about data, how to generate it, how to store it, how to disseminate it, ever. And – I only speak for the materials sciences now; the situation is different in parts of organic chemistry, in particular medicinal chemistry – even the not-so-standard chemists, namely the combinatorial and high throughput ones, which generate more data and should therefore be interested, often fall into this trap: the powerful combination of Word and Excel as datastores. And who can blame them- for them, too, data is a means, not an end. They are interested in the knowledge they can extract from the data…and once extracted, the data becomes secondary.

Now you might argue that it is not a chemist’s job to worry about data, it’s his job to do chemistry and make compounds (I know..it’s a myopic view of chemistry but let’s stick to it for now). And yes, that is a defensible point, though I think that certainly with the increasingly routine use of high-throughput and combinatorial techniques, that is becoming less defensible.Chemists need to realise that the data they produce has value beyond the immediate research project for which it was produced. Furthermore, it has usually been generated at great cost and effort and should be treated as a scarce resource. Apart from everything else, data produced through public funding is a public good and produced in the public interest. So I think chemists have to start thinking about data….and it won’t come easy to them. And one way of doing this, is of course, to get them where it hurts most: the money. So the recent BBSRC data policy initiative seems to me to be a step in the right direction:

BBSRC has launched its new data sharing policy, setting out expected standards of data sharing for BBSRC-supported researchers. The policy states that BBSRC expects research data generated as a result of BBSRC support to be made available with as few restrictions as possible in a timely and responsible manner to the scientific community for examination and use.In turn, BBSRC will provide support and funding to facilitate appropriate data sharing activities, and recognises that different fields of study will require different approaches. What is sensible in one scientific or technological area may not work in others. The policy aims to achieve the sharing of data in a both a timely and scientifically appropriate manner.In implementing the policy, all research proposals submitted to BBSRC from 26th April 2007 must now include a statement on data sharing. This should include concise plans for data management and sharing or provide explicit reasons why data sharing is not possible or appropriate. The statement on data sharing will be under a separate section within the ‘case for support’ document, and the guidance for applicants has been updated accordingly. Statements will be considered separately from the scientific excellence of the proposed research; however, an application’s credibility will suffer if peer review agrees the statement is inappropriate.    

This is a good step in the right direction (provided it is also policed!!) and one can only hope the EPSRC, which, as far as I know, does not have a formal policy at the moment, will follow suit. There’s another thing though. Educating people about data needs to be part of the curriculum starting with the undergraduate chemistry syllabus. And the few remaining chemical informaticians of the world need to get out of their server rooms and into the labs. If you think that organic chemists are bad in not wanting to have to do anything with informatics, well, informaticians are usually even worse in not wanting to have anything to do with flasks. And it makes me hopping mad when I hear that “this is not on the critical path”. Chemical informatics only makes sense in combination with experiments and it is the informaticians here that should lead the way and show the world just how successful a combination of laboratory and computing can be. It is that, which will educate the next generation of students and make them computer and data literate.

You might also argue, of course, that it should be a researcher’s institution that takes care of data produced by the research organisation. Which then brings us on to institutional repositories. Well trouble here is, can an institution really produce the tools for archiving and dissemination? What a strange question you will say. Is not Cambridge involved in DSpace and Spectra etc.? Yes. The point, though, is, that scientific data is incredibly varied and new data with new data models gets produced all the time. Will institutional repositories really be able to evolve quickly enough to accommodate all this or will they be limited to well-established data and information models because they typically operate a “one software fits all” model?
I may not be the best qualified person to judge this, but having worked in a number of large institutions in the past and observed the speed at which they evolve, there is nothing that leads me to believe that institutions and centralized software systems will be able to evolve rapidly enough. Jim, in a recent post, already alludes to something like this and makes reference to a post by Clifford Lynch, who defines institutional repositories as

a set of services that a university offers to the members of its community for the management and dissemination of digital materials created by the institution and its community members”   

I must confess, I find this thought intriguing. The set of services the doesn’t have to be a monolithic bit of software as we have now, but rather a set of services which manages digital materials. So could one naively consider a bottom-up approach? One, in which, for example, universities make available resource that would allow an individual department or even researcher to directly collaborate with a librarian and an information scientist, who would then make sure that a scientists data needs were met, while at the same time allowing the instititution to ensure proper data preservation etc. How about individual research groups maintaining their own repositories which they, with the help of a central institution evolve quickly and which a central system could then access, index, link to etc….and help with dissemination?

Now the question here is what’s in it for the individual researcher? Not much, I would say. Sure, some will care about data preservation, some will acknowledge the fact that publicly funded research is a public good and the researcher therefore has a duty of care towards the product and will therefore care. But as discussed above: the point of generating data is ultimately getting the next publication and finishing the project…..what happens beyond that is often irrelevant to the researcher that generates the data, which means there is no point in expending much effort and maybe even climbing a learning curve to be able to archive it and disseminate it in any way other than by sticking it into pdf. Ultimately, the data disappears in the long tail. So if there is no obvious carrot, then maybe a stick? Well, the stick will live and die with for example the funders. having a data policy is great, but it also needs to be policed and enforced. The funders can either do this themselves, or in the case of public money, might even conceivably hand this to a national audit office. And I think the broad gist of this discusssion also applies to learned societies etc. So now we are back at the researcher. And back to needing to educate the researcher and the student…..and therefore ultimately back again at chemoinformaticians having to leave their server rooms and touch a flask…..

So how about the publishers? Haven’t they traditionally filled this role? Yes they have but of course given the fact that they are now trying to harvest the data themselves and re-sell it to us in the form of data- and knowledge bases (see Wiley’s Chemgate and eMolecules, for example). For that reason alone it seems utterly undesireable, to have a commercial publisher continue to fill that role. If the publisher is an open access publisher, then getting at the data is not a concern, but the data format is….a publisher is just as much an institution as a library and whether they will be able to be nimble enough to cope with constantly evolving data needs and models is doubtful. Which means, we would be back to the generic “one software for all” model.
Which, at least to me, seems bad. The same, sadly, applies to learned societies.

Hmmm…the longer I think about it, the more I come to the conclusion that the lab chemists or the departents will have to do it themselves, assisted and educated by the chemoinformaticians and their own institutions and setting up small-scale dedicated and light-weight repositories. The institutions will have to make a commitment to ensure long-term preservation, inter-linking and interoperability between repositories evolved by individual researchers or departments. And funders, finally, well funders will not only have to have a data policy like the BBSRC, but they will also have to police it and, in Jim’s words “keep the scientists honest”.

Aesthetic Data

I have a confession to make: I am a sucker for bright ideas and bright people – particularly when those ideas resonate with my academic work, or when they strike a particular aesthetic cord with me. And there are a number of places on the web, where both can usually be found abundantly. One of these places is a podcast, “Entrepreneurial Thought Leaders” produced by the Stanford Department of Management Science and Engineering. Another one is the homepage of the TED (Technology, Entertainment, Design) conference and in particular their videocasts.

When browsing their website, I came across a short presentation by Jonathan Harris, who has done work, which not only resonated with our own, but is just simply beautiful. Jonathan is both a computer scientist and an artist, who is trying to understand the world and us human beings by analyzing the content and artefacts they are contributing on the web. He does this by collecting huge amounts of data, which he then visualizes in incredibly beautiful displays. One of his most impressive projects is the “We feel fine” project.

In “We feel fine” he scours the world’s blogs and extracts all those sentences from blog entries which contain the phrases “I feel” and “I am feeling”. If a sentence like this is found, it is scraped and transferred into a database. Using language processing it is then scanned to determine whether it contains a set of “pre-defined” feelings. If the blog post contains a picture, that is scraped too and associated with the sentence. Using statistics and a lot of visualization, Jonathan can draw an amazing amount of conclusions about “how the world feels” at the moment. I did a search just now, asking, how both men and women in the UK feel, when the weather is sunny between 2005 and 2007. The results speak volumes: the top 10 emotions in descending order are: ill, old, free, tired, weired, sorry, down, guilty, sad and happy.
Conceptually, there is so much in this that is analogous to the work some of my colleagues are doing here in the Centre (data gathering and analysis in Crystal Eye for example…..I am sure that if we were allowed to gather a bit more metadata than we are, we could tell so much about what’s occupying science at the moment, follow scientific progress etc…..just by looking at the molecules people are working on), and so much that we will have to learn about yet (e.g. analysis and visualization of large data sets, useful representations of data for chemistry).
But enough of me, best let Jonathan do the talking and the presentation. And I urge you to spend some time with the “We feel fine” website….it is fascinating.

Copyright Permissions – How they can also be done!

Having recovered from my hopping madness vented in my last post by taking a brisk walk round the chemistry department, I sat down to request more re-use permissions. One of the figures I wanted to use came from PNAS – the Proceedings of the National Academy of Sciences. Their current copyright policy was explained in a PNAS editorial in 2004 (I quote the entire thing):

Liberalization of PNAS copyright policy: Noncommercial use freely allowed

Nicholas R. Cozzarelli, Editor-in-Chief, Kenneth R. Fulton, Publisher, and Diane M. Sullenberger, Executive Editor

We have changed our copyright and permissions policies to make it easier for authors and readers to use material published in PNAS for research or teaching. Our guiding principle is that, while PNAS retains copyright, anyone can make noncommercial use of work in PNAS without asking our permission, provided that the original source is cited. For commercial use (e.g., in books for sale or in corporate marketing materials), we approve requests on an individual basis and may ask for compensation. We have revised our copyright assignment form to make the changes clear (www.pnas.org/misc/copyright.pdf) and added to our web site a “frequently asked questions” (FAQ) section on author and reader rights (www.pnas.org/misc/authorfaq.shtml).

As a PNAS author, you automatically have the right to do the following:

* Post a PDF of your article on your web site.
* Post a webcast containing material from your article.
* Make electronic or hard copies of articles for your personal use, including classroom use.
* Use, after publication, all or part of your article in a printed compilation of your work, such as collected writings or lecture notes.
* Include your article in your thesis or dissertation.
* Reuse your original figures or tables in your future works.
* Post a preprint of your article on a public electronic server, provided that you do not use the files created by PNAS.
* Present your paper at a meeting or conference, including those that are webcast, and give copies of your paper to meeting attendees before or after publication in PNAS. For interactions with the media prior to publication, see the PNAS policy on media coverage (www.pnas.org/misc/forms.shtml).
* Permit others to use your original figures or tables published in PNAS for noncommercial use (e.g., in a review article), provided that the source is cited. Third parties need not request permission to use figures and tables for such use.

Given that authors and readers can automatically use original material in PNAS for research or teaching, why do we request copyright transfer? We do so for three reasons: to allow us to publish, archive, and migrate articles to new media; to remove the administrative burden of rights and permissions management from authors; and to provide protection from copyright abuse.

We do not feel that this or any copyright policy is the only one possible. In fact, our policy has changed through our 90 years of publishing and surely will change again. We have requested that authors transfer copyright only since 1993. From the first issue of PNAS in 1915 through 1992, authors held copyright to their articles. From 1978 to 1992, we registered copyright for each journal issue as a collected work but did not request copyright for individual articles. In 1993, we began requiring that authors transfer copyright “in all forms, languages, and media now or hereafter known,” which granted us the rights to publish papers online in 1997 and to then digitize selected back issues and post them online.

We think that our current policy best meets the needs of readers, authors, and the journal, for the following reasons:

1. To store and migrate archival formats of the journal. We are committed to facilitating permanent, freely accessible archives of the scientific literature. PNAS is a charter member of PubMed Central, a digital archive of the life sciences journal literature (www.pubmedcentral.nih.gov), and is a participant in the National Library of Medicine’s effort to digitize and post back issues of journals. Not holding copyright to individual articles from 1915 to 1992 delayed our posting of this older material online because we do not have the legal rights to do so. In the end we proceeded without explicit permission from the original authors or their heirs. We accept the risk in doing so because we believe it is clearly in everyone’s best interest. If a copyright holder objects, however, we will immediately remove the article from our online collection. Full copyright transfer allows publishers explicit rights to invest in long-term archiving strategies.
2. To provide an administrative convenience for everyone. Despite our liberal rights and permissions policies, PNAS still receives more than 50 commercial and noncommercial permission requests per week. We routinely agree to noncommercial use, so such requests waste everyone’s time.
Unfortunately, PNAS cannot provide permission for others to use all or part of articles published from 1915 to 1992 because we do not hold copyright. Only the original authors or their designees can grant permission. Researchers are frustrated when they contact us for permission to use seminal works and we are unable to grant their requests.
3. To provide international protection regarding infringement or plagiarism. On the rare occasion that material is misused, authors appeal to PNAS to intervene on their behalf to enforce copyright protection. In such cases, a formal query from PNAS or the threat of a copyright infringement lawsuit has prompted expeditious action. In cases of redundant publication we sanction authors for violating journal and copyright policy. Because international standards and copyright law are complex, PNAS leaves interpretation of global copyright standards to our expert legal counsel.

We also support creative efforts such as charting, mining, analyzing, sorting, navigating, and displaying information contained in PNAS. The highly successful Sackler Colloquium “Mapping Knowledge Domains” (www.pnas.org/content/vol101/suppl_1) is a prime example (1). We encourage authors to use standard forms of data presentation to facilitate this process.

References

1. Shiffrin, R. M. & Börner, K., eds. (2004) Proc. Natl. Acad. Sci. USA 101, Suppl. 1, 5183–5310.[Free Full Text]

This is reaffirmed on the up-to-date 2007 PNAS copyright faq page:

Can others (nonauthor third parties) use my original figures or tables in their works without asking PNAS for permission?

Yes, PNAS automatically permits others to use your original figures or tables published in PNAS for noncommercial and educational use (i.e., in a review article, in a book that is not for sale), provided that the original source and copyright notice are cited. Commercial reuse of figures and tables (i.e., in promotional materials, in a textbook for sale) requires permission from PNAS.

Now this is an entirely sensible and science friendly policy. Now open access according to the BBB declaration, but a huge step in the right direction, certainly in comparison with the last example:

  • automatic grant of re-use rights for non-commercial puposes (because asking for permission and credit card numbers is a WASTE OF TIME)
  • re-distribution rights for authors
  • self-archiving rights (of original manuscript etc.)
  • free classroom use etc.

Thanks PNAS – you guys made everyone’s life a bit easier.

Requesting Permissions for Re-use of Copyrighted Material.

Now I am not normally one to rant, at least not on a blog, but today I encountered something that makes me just mad…..and I mean hopping mad.

I have just finished writing a review paper on poly(2-oxazolines)…a class of polymers close to my heart. As it is a review paper, I have included some figures, which were taken from the original research papers forming part of the review. Given that these were not my figures and that I respect and honour the copyright of other authors who have worked hard to produce high-quality and illustrative figures for their publications and the copyright of publishers who have been assigned those rights by an author, I went off to request permissions for re-use of copyrighted material from the relevant publishers. The review was based on about 150 papers, and I had taken figures from a few of them…ACS, RSC…no problem. Their procedures are all more or less automated and relatively pain free, although time consuming. And then, well then I came to Elsevier……

Elsevier has outsourced their copyright clearance procedure to a company called the Copyright Clearance Centre (I have included the link for your edification), which, on its website claims to “help to advance education, innovation and the free flow of information.” So far so good. Following Elsevier’s instruction, the first thing I have to do to obtain permission, is to go and find the resource I took the figure from on ScienceDirect. So off I go and locate the relevant journal (Talanta) and citation on Science Direct. Next, the website instructs me to find the abstract of the paper and to press the “Request Permissions” button.

abstract.gif

Pressing this button launches a pop-up window which asks me what I want to do and I make my selections:

frontpage.gif

I am somewhat curious as to why it asks me which currency area I am currently in, but decide to ignore it for the moment. Having made my choices, I hit the “continue” button. I am then asked to set up an account as I have never used rights link before. Ok, getting tedious, but I hit the button to set up an account (note: none of this is necessary with the other publishers). I am now taken to a page where the anger really sets in: they are asking me how I want to pay.

secondpage.gif

How I want to pay?? All I want is to request permission for reuse of one small figure. I do not want to pay anything – my institution is subscribing to the journal for me. Why on earth would you want to lump requests for re-use of copyrighted material together with a business process that may be appropriate for the purchase of pay-per-view access? If I do not want to have pay-per-view access, why do I need to hand over payment details? However, the dropdown menu only gives me the opportunity to choose between a credit card payment and an invoice.

Hmmm…..on I go and fill in my details hoping that the “payment” thing is just going to go away down the line. But no such luck and sure enough, on the next screen I am being asked for my credit card details IN ORDER TO BE ABLE TO SET UP AN ACCOUNT to request re-use permissions.

thirdpage.gif

At this stage, I broke off the procedure. I understand that it might be convenient for the “Copyright Clearance Centre” to set up an account for me in such a way, that if I ever wanted to purchase a journal article from one of their customers, they have all the necessary information. IT IS NOT CONVENIENT FOR ME. All I want is permission to re-use a figure in a paper. I do not think that I should have to hand over my credit card details for this and I refuse to do so.

So what is the consequence of this? I am not prepared to set up a Rightslink account with the Copyright Clearance Centre under these circumstances. Therefore I cannot obtain permission to reproduce the figure I wanted and therefore I cannot use the figure in my paper. Furthermore, there is the personal inconvenience: I now have to throw the figure out of the manuscript and to renumber all of my figures in the text. This will cost me at least half an hour.

More significantly though, this has a negative impact on scientific dissemination. On the grand scale of things, it is only a tiny thing, but in effect this has stopped me from re-using a figure created by other scientists, which, I am sure, have a vested interest in their research being talked about, evaluated and disseminated. That is part of a scientist’s core business. The Copyright Clearance Centre has neither helped to advance education and innovation, nor indeed the flow of information, but rather has impeded it. And Elsevier is indirectly guilty: they have not done their best for their authors by helping to disseminate their science, but are collaborating with an organisation which actually puts people off reusing science. They have allowed requests for re-use of material to be lumped into the same procedure used for the purchase of pay-per-view articles. At best that is thoughtless and very poor customer service.

Now as I say, I don’t like to rant, but this kind of thoughtlessness makes me mad.

Something exciting, catalytic and quite delightful…

…has happened today.

I recently blogged about attending the first ESF summer school in Nanomedicine in Wales and speaking about our efforts in polymer informatics there.

After my talk, I was approached by an undergraduate, Hosea Handoyo, who wanted to know more about our work and who, amongst many other things, is currently a Neuroscience student in the Netherlands. When I asked him why he was interested, he said that he was attending the summer school in the capacity of a “student journalist”. Apparently, Hosea is part of a group of Indonesian students, which attend research conferences and try to find out what is going on in various areas of science at the moment. They then write this up in the form of “popular science” articles, which get published on the web.

Now if I remember our conversation correctly, there are several points to this. Firstly, it is intended to inform the Indonesian public in simple terms about what is going on at the cutting edge of research science at the moment. Secondly though, it also serves a landmark for students in Indonesia as to what research is going on where and which institutions/research groups they might consider joining in the future.

Now this morning, when I looked over my blog, I saw an incoming link from a website netsains.com (as annoying as the WordPress software may sometimes be when wanting to publish code in angle brackets, it is phantastic for all the housekeeping bits it offers). It looked a bit odd, but I could make out the terms “polimer informatika” in the link and so investigated further. And indeed, it turns out that the link led to an article that Hosea had written about our work here in Cambridge. Now his article is all written in Indonesian and I had no idea what it said, though I could make out some words “polimer informatika”, “kanker” (cancer – a lot of work in polymer pharmaceuticals is done in the area of anti-cancer drugs), the Unilever Centre was mentioned as was polymer markup language (PML), some of the databases I had discussed and Peter Corbett’s OSCAR (which always wows people every time it is demonstrated). I have since found out that his article has also appeared on the pages of the Indonesian Chemistry Forum.

Furthermore, there were links to all of the Unilever Centre blogs, my blog, a link to OSCAR 3 on sourceforge and even to the video on with a lecture on polymer informatics which is up on Google Video. I then got in touch with Hosea via email to make sure that I had remembered the details of our conversation correctly and I also asked him about the the purpose of the netsains.com website. Well, he told me that the site is supported by the Indonesian Minister for Research and Technology and is modeled on the Dutch Kennislink site. Kennislink was set up by the Dutch Ministry for Education, contains over 5000 popular science articles across all disciplines and is the most prominent Dutch language popular science site.

Now in his email (quoted with permission), Hosea said:

“All of these websites are aiming to bridge the gap between Indonesian scientists (and students) abroad and the ones in Indonesia. ICT especially internet is very limited in Indonesia (though the gadgets are quite sophisticated) so it is troublesome for people just simply browsing for information. By providing them the hottest issues from Europe, Japan, US, China, and many other countries, we share the information of research and development of scientific world with them. We could provide them the information of technology and in return, Indonesian communities abroad get updates of what happens in Indonesia and the link to translate their research/latest technology to what public in Indonesia needs. Simply like an open source idea but this is more to information sharing and empowering public awareness in scientific field.”

I found this really heartwarming and delightful for a number of reasons:

  • A genuine interest in science. It is phantastic to see that undergraduates go out to conferences with an interest in science and a desire to find out what is going on. In the past, I have worked in institutions where even the attendance of PhD students and post-docs was considered to be a “waste of time and money”. Personally, I think that it is never too early to expose someone who is genuinely interested in science to the cutting edge of what is going on in the world.
  • The idea of sharing and openness. It is an often quoted mantra, but one that is hardly ever practiced. We tend to lock up science and access to data in closed access journals, books or other resources. Often enough that already breaks our backs at well-resourced and well-funded institutions like Cambridge and makes scientific progress difficult. In other parts of the world, this is an absolutely insurmountable barrier. However, the more people like Hosea and others write about science on websites like kennislink.com or esains.com, the more people blog about their and other people’s science (the chemical blogosphere is exemplary in this) and the more students write their theses in the open, the more we can start to break these barriers down. And the internet, blogs, wikis etc. are the disruptive technology that will make it possible. Furthermore there is a social dimension here: those with access to resources (IT, conferences, literature etc….) enable access for those with fewer resources in the most efficient way through filtering and feedback.
  • The ability to set an agenda. Undergraduates turn into research students, post-docs, academics and decision makers. As research students, they have (always assuming the presence of an enlightened supervisor) the ability to determine what they work on (through choice of the research group they join) and maybe therefore also a choice over the culture in which science is done and in which they want to do science. As post-docs and academics they have the opportunity (together with their colleagues) to fundamentally change the way science is done and communicated. And as decision makers, they might just hold the purse strings, which enables them to tell academics how and where to publish (some funding bodies, for example, mandate that research funded through that body is published in open access journals or reposited).

I think that Hosea and people like him are the catalysts for positive change, which we need to move forward.

Polymer Theses, Polymer Data and a Common Language.

I am currently at the European Science Foundation’s first summer school on Nanomedicine in Cardiff, where I was invited to present some of the work in polymer informatics which we are doing in Cambridge. The summer school is a wonderful event, with approximately 180 attendees, the majority of which are PhD students and even a few undergraduates as well as a significant number of tenured faculty. The attendees came from a number of scientific disciplines, such as chemistry, biology, physics, medicine and ethics. And bringing people together in this way to talk about a field of research which is completely interfacial is the only sustainable way forward.
An awful lot of people were very impressed by the work we do and our approach to data and knowledge management and many of the PhD students I spoke to were enthused by the potential power that informatics can bring to their research. They also appreciated the need to have well-curated data that is freely available and not copyrighted by publishers etc. With so many PhD students here talking to each other freely about their research, getting to know each other and appreciating each other’s science, it seemed to me, that there is a real chance to build a community, that exchanges data and information in order to communally advance a field of research.

While the summer school was very multidisciplinary, there was a predominance of people interested in the use of polymers for all sorts of different applications – not least for applications in drug and gene delivery.
People working in polymer therapeutics are quite often “jacks of all trades;” not only are they chemists who know how to synthesize and purify polymers, but, to a certain extent at least, they also have to be physical chemists, biologists, formulators etc. So the polymer pharmaceuticals community produces very rich and diverse datasets. The data they create is usually of general importance:
An important property of polymers in medical applications, for example, is solubility. So quite often, people working in polymer pharmaceuticals will engage in the determination of phase diagramms for polymers. And as there is a lot of interest in stimulus responsive polymers, these diagramms are not just measured in pure water, but also in the presence of different ions and pH values. Researchers might also be interested in the dimensions of the polymer chain under all of those conditions, so light or x-ray scattering studies are carried out. And that is just on the pure polymer! Conjugation of a drug or gene to th pure material changes the game completely and so all of these measurements potentially get carried out again.

Once we are done with the physicochemical characterisation, we then go on to try and characterize the polymers we have synthesized w.r.t. their biological properties: we are interested in their toxicology, their biodistribution, their specificity etc. That, too, generates an awful lot of data which is potentially related to the structure of the polymers we are dealing with.

And as I said before, it is not only other pharmaceutical people that are interested in this sort of data. A lot of polymer chemists in general as well as companies should in principle be very interested in thi type of data: polymers are present in most modern household and cleaning products (check the labels of your shampoo and washing powder bottles).

Therefore it seems to me, that we have a rich source of polymer-related data here, that we should attempt to harvest. Judging from the initial enthusiasm that I have encountered at the summer school leads me to think that maybe we have an opportunity to work with the polymer pharmaceutics/nanomedicine research community to build up, at least in the long term, a valuable polymer knowledge base. Now, I am aware of the fact that this community in particular is very conscious of patents and intellectual property and we have mechanisms to ensure that these considerations can be taken into account and accommodated. How could we get hold of this data?
Over on his blog, Peter has pointed out that a viable way would be to capture digital theses in repositories, which, would not only allow the thesis to be preserved, but will undoubtedly also help with dissemination and intelligent data mining. Furthermore, it would be a way to prevent publishers from copyrighting scientific data.

All of this said, the potentialities go much further than this. I have already mentioned the strongly interdisciplinary nature of the summer school. Now, in our work here in Cambridge, we use semantic web technologies to hold information about polymers….we have developed an XML-based polymer markup language and are working on ontologies, which codify polymer knowledge. One of the conclusions of my talk was, that biologists and medics use exactly the same technologies to communicate their data and knowledge and so here for the first time, we have an opportunity to bring knowledge from disparate disciplines together and map it onto each other. In that way, we should be able to develop a joint language which we and our information systems can understand each other and that should allow us to ask new questions – Peter has already demonstrated what is possible when a thesis can be turned into RDF.
And theses originating in a strongly interdisciplinary field of research could be a wonderful starting point.

So, dear polymer science/polymer pharmaceuticals community, how about it? If you are interested not only in preserving and disseminating your data (after patenting etc.), but also in being able to ask new questions of it and in bringing multiple disciplines together, then give us your theses and let us work with you to show you how all this can be achieved. Here’s an offer – please take us up on it.

Polymer Informatics and The Semantic Web – The Solution, Part 1: Adding Structure

In one of my last posts I mentioned that one of the problems we encounter in current knowledge bases is the fact that polymer information is quite often present in free text. It is therefore very hard to extract information from these sources (although it can be done, see Peter Corbett’s OSCAR system) and even when it is accomplished, one is quite often faced with the problem of what the extracted information means. Take your favourite search engine and look for the term “cook” for example. The search engine will most likely retrieve information about people called “Cook”, about “cook” the profession, the Cook Islands or Cook County, Illinois.

One way around this, is too add more descriptive data to data contained in web pages and other documents, or, in other words, data about data. If we could mark up the term “cook” as a person, or a profession of a place name according to the context in which we use it, a machine would have a much better time of finding the bits of information we were really interested in. Now, data about data is also called “metadata” and one way of adding metadata to documents is through the use of markup languages and, in our case, through the use of Extensible Markup Language (XML) and its dialects.

Now the concept of a markup language should not be unfamiliar. Every internet user should has heard of HTML, Hypertext Markup Language, which can be used to structure text into headings, tables, paragraphs etc. XML, just like HTML belongs to the class of descriptive markup languages.

markup-languages.gif

If you use Wikis at all, then you will have come across and used another type of markup, which is used for purely presentational purposes. And maybe you write your papers, in LaTeX and deal with postscript files a lot, in which case you will have had exposure to procedural markup languages too.

Now according to the Wikipedia entry on XML, the latter “provides a text-based means to describe and apply a tree-based structure to information. At its base level, all information manifests as text, interspersed with markup that indicates the information’s separation into a hierarchy of character data, container-like elements, and attributes of those elements.” In an XML document, metadata is enclosed in angle brackets (“”), which, in turn enclose the data to be described. This is what is meant by a container. Let’s look at a simple XML document, it’s a receipe for baking bread (also taken from the Wikipedia article):

bread.gif

We see that there are a number of containers with labels (known as “elements” such as “recipe”, “title”, “ingredient”, “instructions” and “step”. Some of these carry a number of attributes, such as “name”, “prep_time”, “unit” and “state”, which specify further information concerning that element.

When looking at this example , you will have hopefully realized, that XML is eminantly human readable and that you don’t have to be computer genius to figure out what is going on in the document. And you will hopefully also realize, that this markup should now make it easy for a computer to, for example, extract all the ingredients from the text, as they are now explicitly labelled as such.

In my next post, I’ll discuss how to mark up chemistry and molecules….but maybe you can beginn to see now, how this structuring of information could be useful for polymers already.

Polymer Informatics and the Semantic Web – The Problem, Part I: Availability and Curation of Data

In one of my last posts, I have outlined the vision that we have for polymer informatics. Now let me outline some of the challenges that are in our way. In my little scenario, I talked about a semantic web agent going off and gathering data. Well here is where the difficulty starts for a machine. There are several problems:

1. Data Availability

One of the best loved and most commonly used sources of polymer information is the Polymer Handbook by Brandrup and Immergut. It contains information about approximately 2500 different polymers, scattered over multiple chapters. As it is paper-based, it is not accessible to machines and information has to be extracted and collated by hand. Wiley has taken the contents and turned them into a collection of HTML documents which are connected via hyperlinks. Though available in an electronic form, it is still very difficult to extract anything from this for a machine, as all the information is present in unstructured free text. It is not impossible, mind you, systems such as OSCAR, which we are currently developing in-house make that sort of thing possible, but it is still far from trivial and requires much hard work. “Polymers – A property database” , published by CRC is set up in much the same way and therefore subject to the samme limitations. Furthermore, it is worth pointing out that all of these sources of data are commercial and if one’s host institution/organization does not subscribe to the relevant data source, one is….well….hosed anyway.

Things look up a bit with the PoLyInfo Database, maintained by the National Institute for Materials Science of Japan. Here we find, amongst other valuable information, the ability to search for sub(structure) and a string which defines the repeat unit structure of the polymer and which, in principle at least, is parseable. And all this goodness for approximately 13000 polymers, a large variety of physicochemical properties and, best of all, for free.

2. Data Curation

However, there is a catch. When looking, for example, at the glass transition temperature (Tg) entry for polydimethylsiloxane, we find an incredibly wide temperature range….-163 deg. C to +42 deg C. How come there is such a wide range. Well, first of all, and this is the problem with a lot of polymer properties, the glass transition temperature is dependent on the molecular weight in the low molecular weight regime. As MWs increase, Tg eventually becomes invariant w.r.t. the molecular weight. Now when it comes to registering polymer property values, the polymer science communit has gotten into the habit of reporting them WITHOUT the corresponding dependent variables, such as MW in case of the glass transition temperature. Clearly, this makes it very hard to build good and accurate predictive models for such properties.

Sticking with the glass transition temperature for a moment, here’s another one. Tg is mainly determined using two different methods, namely Differential Scanning Calorimentry (DSC) or Thermomechanical Analysis (TMA). While both methods try and determine a glass transition temperature, they measure fundmentally different things. DSC essentially determines a change in heat capacity of a polymer, whereas TMA measures a dimensional change in the sample. And yes, when both methods are used on the same sample, the results usually differ by between 6- 10 K. So it is crucial to report the measurement method, the experimental conditions etc. Furthermore, when data is abstracted to and accumulated in a knowledge system, it has to be curated to ensure that all relevant and necessary bits of metadata are available.

On occassions, the PolyInfo will also register data for composites under the pure polymer…..which of course can shift properties tremendously.

In summary then, the first set of challenges we encounter are data availability, data curation and metadata. Unfortunatelt there is more, which I will discuss in one of my next posts.

Follow

Get every new post delivered to your Inbox.