Capturing process: In silico, in laboratorio and all the messy in-betweens – Cameron Neylon @ the Unilever Centre
May 12, 2009 1 Comment
He starts off by making the point that to capture the scientific process we need to capture the information about the objects we are investigating as well as the process how we get there.
Journals not enough – the journal article is static but knowledge is dynamic. Can solutions come from software development? Yes to a certain extent….
e.g. source control/versioning systems – captures snapshots of development over time, date stamping etc.
Unit testing – continuous tests as part of the science/knowledge testing
Solid-replication…distributed version control
Branching and merging: data integration. However, commits are free text..unstructured knowledge…no relationships between objects – what Cameron really wants to say is NO ONTOLOGIES, NO LINKED DATA.
Need linked data, need ontologies: towards a linked web of data.
Data is nice and well…but how about the stuff that goes on in the lab? Objects, data spread over multiple silos – recording much harder: we need to worry about the lab notebook.
“Lab notebook is pretty much an episodic journal” – which is not too dissimilar to a blog. Similarities are striking: descriptions of stuff happening, date stamping, categorisation, tagging, accessibility…and not of much interest to most people…;-). But problem with blogs is still information retrieval – same as lab notbook…
Now showing a blog of one of his students recording lab work…software built by Jeremy Frey’s group….blog IS the primary record: blog is a production system…2GB of data. At first glance lab-log similar to conventional blog: dates, tags etc….BUT fundamental difference is that data is marked up and linked to other relevant resources…now showing video demo of capturing provanance, date, linking of resources, versioning, etc: data is linked to experiment/procedure, procedure is linked to sample, sample is linked to material….etc….
Proposes that his blog system is a system for capturing both objects and processes….a web of objects…now showing a visualisation of resources in the notbook and demonstrates that the visualisation of the connectedness of the resources can indicate problems in the science or recording of science etc….and says it is only the linking/networking effect that allows you to do this. BUT…no semantics in the system yet (tags yes…no PROPER semantics).
Initial labblog used hand-coded markup: scientists needed to know how to hand code markup…and hated it…..this led to a desire for templates….templates create posts and associate controlled vocab and specify the metadata that needs to be recorded for a given procedure….in effect they are metadata frameworks….templates can be preconfigured for procedures and experiments….metadata frameworks map onto ontologies quite well….
Bio-ontologies…sometimes convolute process and object….says there is no particularly good ontology of experiments….I think the OBI and EXPO people might disagree….
So how about the future?
Important thing is: capture at source IN CONTEXT
Capture as much as possible automatically. Try and take human out of the equation as much as possible.
In the lab capture each object as it is created and capture the plan and track the execution step by step
Data repositories as easy as Flickr – repos specific for a data type and then link artefacts together across repos..e.g. the Periodic Table of Videos on YouTube, embedding of chemical structures into pages from ChemSpider…
More natural interfaces to interact with these records…better visualisation etc…
- Trust and Provenance and cutting through the noise: which objects/people/literature will I trust and pay attention to? Managing people and reputation of people creating the objects: SEMANTIC SOCIAL WEB (now shows FriendFeed as an example: subscription as a measure of trust in people, but people discussing objects) “Data finds the data, then people find the people”..Social network with objects at the Centre…
- Connecting with people only works if the objects are OPEN
- Connected research changes the playing field – again resources are key
- OUCH controversy: communicate first, standardize second….but at least he ackowledges that it will be messy….
UPDATE: Cameron’s slides of the talk are here: