Capturing process: In silico, in laboratorio and all the messy in-betweens – Cameron Neylon @ the Unilever Centre

I am not very good at live-blogging, but Cameron Neylon is at the Unilever Centre and giving a talk about capturing the scientific process. This is important stuff and so I shall give it a go.

He starts off by making the point that to capture the scientific process we need to capture the information about the objects we are investigating as well as the process how we get there.

Journals not enough – the journal article is static but knowledge is dynamic. Can solutions come from software development? Yes to a certain extent….

e.g. source control/versioning systems – captures snapshots of development over time, date stamping etc.
Unit testing – continuous tests as part of the science/knowledge testing
Solid-replication…distributed version control

Branching and merging: data integration. However, commits are free text..unstructured knowledge…no relationships between objects – what Cameron really wants to say is NO ONTOLOGIES, NO LINKED DATA.

Need linked data, need ontologies: towards a linked web of data.

Data is nice and well…but how about the stuff that goes on in the lab? Objects, data spread over multiple silos – recording much harder: we need to worry about the lab notebook.

“Lab notebook is pretty much an episodic journal” – which is not too dissimilar to a blog. Similarities are striking: descriptions of stuff happening, date stamping, categorisation, tagging, accessibility…and not of much interest to most people…;-). But problem with blogs is still information retrieval – same as lab notbook…

Now showing a blog of one of his students recording lab work…software built by Jeremy Frey’s group….blog IS the primary record: blog is a production system…2GB of data. At first glance lab-log similar to conventional blog: dates, tags etc….BUT fundamental difference is that data is marked up and linked to other relevant resources…now showing video demo of capturing provanance, date, linking of resources, versioning, etc: data is linked to experiment/procedure, procedure is linked to sample, sample is linked to material….etc….

Proposes that his blog system is a system for capturing both objects and processes….a web of objects…now showing a visualisation of resources in the notbook and demonstrates that the visualisation of the connectedness of the resources can indicate problems in the science or recording of science etc….and says it is only the linking/networking effect that allows you to do this. BUT…no semantics in the system yet (tags yes…no PROPER semantics).

Initial labblog used hand-coded markup: scientists needed to know how to hand code markup…and hated it…..this led to a desire for templates….templates create posts and associate controlled vocab and specify the metadata that needs to be recorded for a given procedure….in effect they are metadata frameworks….templates can be preconfigured for procedures and experiments….metadata frameworks map onto ontologies quite well….

Bio-ontologies…sometimes convolute process and object….says there is no particularly good ontology of experiments….I think the OBI and EXPO people might disagree….

So how about the future?

    • Important thing is: capture at source IN CONTEXT
      Capture as much as possible automatically. Try and take human out of the equation as much as possible.
      In the lab capture each object as it is created and capture the plan and track the execution step by step
      Data repositories as easy as Flickr – repos specific for a data type and then link artefacts together across repos..e.g. the Periodic Table of Videos on YouTube, embedding of chemical structures into pages from ChemSpider
      More natural interfaces to interact with these records…better visualisation etc…
      Trust and Provenance and cutting through the noise: which objects/people/literature will I trust and pay attention to? Managing people and reputation of people creating the objects: SEMANTIC SOCIAL WEB (now shows FriendFeed as an example: subscription as a measure of trust in people, but people discussing objects) “Data finds the data, then people find the people”..Social network with objects at the Centre…
      Connecting with people only works if the objects are OPEN
      Connected research changes the playing field – again resources are key
      OUCH controversy: communicate first, standardize second….but at least he ackowledges that it will be messy….
  • UPDATE: Cameron’s slides of the talk are here:

    Reblog this post [with Zemanta]

    ChemAxiom: An Ontology for Chemistry – 1. The Motivation

    I have already announced the fact that we are working on ontologies in the polymer domain some time ago, though I realise that so far, I have yet to produce the proof of that: the actual ontology/ontologies.

    So today I am happy to announce that the time of vapourware is over and that we have released ChemAxiom – a modular set of ontologies, which form the first ontological framework for chemistry (or at least so we believe). The development of these ontologies has taken us a while: I started this on a hunch and as a nice intellectual exercise, not entirely sure where to go with them and what to use them and therefore not working on them full time. As the work progressed, however, we understood just how inordinately useful they would be for doing what we are trying to accomplish in both polymer informatics and chemical informatics at large. I will introduce and discuss the ontologies in a succession of blogposts, of which this is the first one

    So what, though maybe somwhat retrospectively, was the motivation for the preparation of the ontologies? In short – the breakdown of many common chemistry information systems when confronted with real chemical phenomena rather than small subsections of idealised abstractions. Let me explain.

    Chemistry and chemical information systems positively thrive on the use of a connection table as a chemical identifier and determinant of uniqueness. The reasons for this are fairly clear: chemistry, for the past 100 years or so, has elevated the (potential) correlation between the chemical structure of a molecule and its physicochemical and biological properties to be its “central dogma.” The application of this dogma has served subsections of the community – notably organic/medicinal/biological chemists incredibly well, while causing major headaches for other parts of the chemistry community and given an outright migraine to information scientists and researchers. There are several reasons for the pain:

    The use of a connection table as an identifier for chemical objects leads to significant ontological confusion. Often, chemists and their information systems do not realise that there is a fundamental distinction between (a) the platonic idea of a molecule, (b) the idea of a bulk substance and (c) an instance of (“the real bulk substance”) in a flask or bottle on the researcher’s lab bench. An example of this is the association of a physicochemical property of a chemical entity with a structure representation of a molecule: while it would, for example, make sense to do this for a HOMO energy, it does NOT make sense to speak of a melting point or a boiling point in terms of a a molecule. The point here simply is that many physicochemical properties are the mereological sums of the properties of many molecules in an ensemble. If this is true for simple properties of pure small molecules, it is even more true for properties of complex systems such as polymers, which are ensembles of many different molecules of many different architectures. A similar argument can also be made for identifiers: in most chemical information systems, it is often not clear whether the identifier (such as a CAS number etc.) refers to a molecule or a substance composed of these molecules.

    Many chemical objects have temporal characteristics. Often, chemical objects have temporal characteristics, which influence and determine their connection table. A typical example for this are rapidly interconverting isomers: glucose, when dissolved in water, for example, can be described by several rapidly interconverting structures – a single connection table is not enough to describe the concept “glucose in water” and there exists a parthood relationship between the concept and several possible connection tables. Ontologies can help with specifying and defining these parthood relationships.

    There is another aspect to time dependence we also need to consider. For many materials, their existence in time, or, put in another way, their history, often holds more meaningful information about an observed physical property of that substance than the chemical structure of one of the components of the mixture. For an observable property of a polymer, such as the glass transition temperature, for example, it matters a great deal whether the polymer was synthesized in on the solid phase in a pressure autoclave or in solution at ambient pressure. Furthermore, it matters, whether and how a polymer was processed – how was it extruded, grafted etc. All of these processes have a significant amount of influence on the observable physical properties of a bulk sample of this polymer, while leaving the chemical decription of the material, essentially unchanged (in current practice, polyethylene is often represented either by using the structure of the corresponding repeat unit (ethene, for example) or the structure of a repeat unit fragment (-CH2-CH2-). Ontologies will help us to describe and define these histories. Ultimately, we envisage that this will result in a “semantic fingerprint” of a material, which – one might speculate – will be much more appropriate for the development of design rules for materials than the dumb structure representations in use today.

    Many chemical objects are mixtures….and mixtures simply do not lend themselves to being described using the connection table of a single constituent entity of that mixture. If this is true for glucose in water, it is even truer for things such as polymers: polymers are mixtures of many different macromolecules, all of which have slightly different architectures etc. An observed physical property, and therefore a data object, is the mereological sum of the contributions made by all the constituent macromolecules and therefore, such a data object cannot simply be associated with a single connection table.

    This, in my view, is a short summary of the case for ontology in chemistry. Please feel free to violently (dis-)agree and if you want to do so, I am looking forward to a discussion in the comments section.

    There’s one more thing:


    The ChemAxiom ontologies are far from perfect and far from finished. We hope, that they show the way how an ontological framework for chemistry could look like. In developing these ontologies, we can contribute our particular point of view, but we would like to hear yours. Even more, we would like to invite the community to get involved in the development of these ontologies in order to make them a general and valuable resource. If you would like to  become involved, then please send an email to chemaxiom at googlemail dot com or leave comments/questions etc, in the ChemAxiom Google Group.

    In the next several blog posts, I will dive into some of the technical details of the ontologies.

    (Automatic Links etc., as always, by Zemanta)

    Reblog this post [with Zemanta]

    Semantic Universe Website with Chemical/Polymer Informatics Contributions now Live

    Over on Twitter, Semantic Universe has just announced the relaunch of their website. The purpose of the site is “to educate the world about semantic technologies and applications.”

    To quote from the website:

    “Semantic Universe and Cerebra today announced the launch of the “Semantic Universe Network”, a vibrant educational and networking hub for the global semantic technology marketplace. Semantic Universe Network will be the educational and information resource for the people and companies within the high-growth semantics sector, covering the latest news, opinions, events, announcements, products, solutions, promotions and research in the industry.”

    As part of the re-launch, both Lezan Hawizy and I have written two short contributions reviewing the state of Semantic Chemistry and showcasing our work on how semantification of chemistry can happen. The contributions were intended to be short “how to..s” and as such are written in a somewhat chatty style. Here are the links:

    Semantic Chemistry

    The Semantification of Chemistry

    Feedback is welcome.

    Reblog this post [with Zemanta]

    (More) Triples for the World

    I have taken a long hiatus from blogging for a number of reasons and still don’t have time to blog much, but something has just happened that has really excited me.
    During this year’s International Semantic Web Conference in Karlsruhe (which I am still angry about not being able to attend due to time constraints), it was announced, that Freebase now produces RDF!

    Now just in case you are wondering what Freebase is, here’s a description from their website:

    Freebase, created by Metaweb Technologies, is an open database of the world’s information. It’s built by the community and for the community – free for anyone to query, contribute to, build applications on top of, or integrate into their websites.

    Already, Freebase covers millions of topics in hundreds of categories. Drawing from large open data sets like Wikipedia, MusicBrainz, and the SEC archives, it contains structured information on many popular topics, including movies, music, people and locations – all reconciled and freely available via an open API. This information is supplemented by the efforts of a passionate global community of users who are working together to add structured information on everything from philosophy to European railway stations to the chemical properties of common food ingredients.

    By structuring the world’s data in this manner, the Freebase community is creating a global resource that will one day allow people and machines everywhere to access information far more easily and quickly than they can today.

    And all of this data, they are making available as RDF triples, which you can get via a simple service:

    Welcome to the Freebase RDF service.

    This service generates views of Freebase Topics following the principles of Linked Data. You can obtain an RDF representation of a Topic by sending a simple GET request to, where the “thetopicid” is a Freebase identifier with the slashes replaced by dots. For instance to see “/en/blade_runner” represented in RDF request

    The /ns end-point will perform content negotiation, redirecting your client to the HTML view of the Topic if HTML is prefered (as it is in standard browsers) or redirecting you to to obtain an RDF representation in N3, RDF/XML or Turtle depending on the preferences expressed in your clients HTTP Accept header.

    This service will display content in Firefox if you use the Tabulator extension.

    If you have questions of comments about the service please join the Freebase developer mailing list.

    So now there’s DBPedia and Freebase. More triples for the world, more data, more opportunity to move ahead. In chemistry, it’s sometimes so difficult to convince people of the value of open and linked data. This sort of stuff makes me feel that we are making progress. Slowly, but inexorably. And that is exciting.

    Twine as a model for repository functionality?

    Now although I have not blogged anything for a long time again, I did not mean to write this blog-post as I have some more pressing academic concerns to deal with at the moment. However, given that the discussion as to what a repository should be or should do for its users is flaring up again here, and here. In particular, Chris Rusbridge asked for ideas about repository functionality and so I thought I should chime in.

    When reading through all of the posts referenced above, the theme of automated metadata discovery is high up on everybody’s agenda and for good reason: while I DO use our DSpace implementation here in Cambridge and try and submit posters and manuscripts, I do feel considerable frustration everytime I do so. Having to enter the metadata first (and huge amounts of it) costs me anything from 5 to 10 min a pop. Now (ex-)DSpacers tell me that the interface and funcationality that make me do this is a consequence of user interaction studies. If that is true, then the mind boggles….but anyway, back to the point.

    I have been wondering for a while now, whether the darling of the semantic web community, namely Radar Network’s Twine, could not be a good model for at least some of the functionality that an institutional repo should/could have. It takes a little effort to explain what Twine is, but if you were to press me for the elevator pitch, then it would probably be fair to say that Twine is to interests and content, what Facebook is to social relationships and LinkedIn to professional relationships.

    In short, when logging into Twine, I am provided with a sort of workspace, which allows me to reposit all sorts of stuff: text documents, pdf documents, bookmarks, videos etc. The Twine Workspace:

    I can furthermore organize this content into collections (“Twines”), which can be either public or private:

    Once uploaded, all resources get pushed through a natural language processing workflow, which aims to extract metadata from these and subsequently marks the metadata up in a semantically rich form (RDF) using Twine’s own ontologies. Here, for example, is a bookmark for a book on Amazon’s site:

    The extracted metadata is shown in a user friendly way on the right. And here is the RDF that Twine produces as a concequence of metadata extraction from the Amazon page:

    So far, the NLP functionality extracts people, places, organisations, events etc. However, Radar Networks have announced that users will be allowed to use their own ontologies come the end of the year. Now I have no idea how this will work technically, but assuming that they can come up with a reasonable implementation of this, things get exciting as it is then up to the user to “customize” his workspace around his interests etc. and to decide on the information they want to see.

    On the basis of the extracted metadata, the system will suggest other documents in my own collection or in other public Twines, which might be of interest to me, and I, for one, have already been alerted to a number of interesting documents this way. Again, if Radar’s plans go well, Twine will offer document similarity analyses on the basis of clustering around autumn time.

    It doesn’t end here: there is also a social component to the system. On the basis of the metadata extracted from my documents, other users with a similar metadata profile and therefore presumed similar interests will be recommended to me and I have to opportunity to link up with them.

    As I said above, at the moment, Twine is in private beta and so the stuff is hidden by behind a password for now. However, if everything goes to plan, Radar plans to take the passwords off the public Twines so that the stuff will be exposed on the web, indexed by Google etc. And once that happens, of course, there are more triples for the world too…..which can only be a good thing.

    Personally, I am excited about all of this, simply because the potential is huge. Some of my colleagues are less enthusiastic – for all sorts of reasons. For one, the user interface is far from intuitive at the moment and it actually takes a little while to “get” Twine. But once you do, it is very exciting….and I think that a great deal of this functionality could be/should be implemented by institutional repos as well. Oh and what would it mean for data portability/data integration etc. if institutional repos started to expose RDF to the world….?

    By the way, I have quite a few Twine invites left – so should anybody want to have a look and play with the system, leave a comment on the blog and I’ll send you an invite!