What is a repository (and more importantly: what should it be)?
March 31, 2008 3 Comments
Every once in a while, things come together and one begins to understand stuff. I have had that experience over the last couple of days.
I spent much of last week trying to put together a white paper explaining what we to in simple terms to some managers. I have been through that exercise a couple of times now and so am quite comfortable explaining semantic technologies to chemists. So far so good. Recently, however, we have started working on molecular repositories – Peter has already mentioned this in a previous blog post. Now, of course, we do want to make use of them in polymer science and so I sat down and tried to explain what a repository is. Turns out, I got stuck.
I got stuck because I had to explain what a repository is from the point of view of functionality – when talking to non-specialist managers, it is the only way one can sell and explain these things…they do not care about the technological perspective…the stuff that’s under the hood. I found it impossible to explain what the value proposal of a repository is and how it differentiates itself in terms of its functionality from, say, the document management systems in their respective companies. I talked it over with some colleagues and they couldn’t come up with anything much either. As a matter of fact, we were even unable to come up with a definition of the word “repository” that would satisfy and help differentiate: it seemed a completely meaningless word.
Now today it struck me: a repository should be a place where information is (a) collected, preserved and disseminated (b) semantically enriched, (c) on the basis of the semantic enrichment put into a relationship with other repository content to enable knowledge discovery, collaboration and, ultimately, the creation of knowledge spaces. And to me, this is what the repository of the future should be and how it should be defined.
Let’s take a concrete example to illustrate what I mean: my institutional repository provides me with a workspace, which I can use in my scientific work everyday. In that workspace, I have my collection of literature (scientific papers, other people’s theses etc.), my scientific data (spectra, chromatogramms etc) as well as drafts of my papers that I am working on at the moment. Furthermore, I have the ability to share some of this stuff with my colleagues and also to permanently archive data and information that I don’t require for projects in the future.
Now I spend my morning working on a literature review, so I have my manuscript open and I am searching and retrieving papers from journals and information from the web. As I retrieve this information I submit it to my workspace. Before I submit, the workspace functionality of course allows me to tag the information, either using my own tags, or tags, which are actually ontology terms and which are being suggested to me as I type. Submission to the workspace triggers a whole chain of events: the paper I have just added to my workspace gets submitted to a workflow, which parses it and discovers people, places, chemical information etc….and the relevant metadata is added to my document in the form of RDF. The document thus gets semantically enriched.
Now the system has discovered, that the document I just submitted contains the terms “methacrylate” and “ATRP” and “synthesis” reasons, that the paper is probably talking about a polymer synthesis. Therefore it shows me other papers from my own workspace and the workspace of colleagues that have either shared this information or who have “befriended” me, that talk about acrylates and radical polymerisation.
In the afternoon then, I go into the lab and am finally able to make the compound that I have been working on for a while in a pure form. I write up my experimental in my lab notebook and submit the data to the workspace. Again, a background procedure detects chemical entities, actions etc. and augments my document with the relevant metadata. My workspace already contains the NMR spectrum I ran earlier in the afternoon and I can now simply cross-reference the write-up for the compound with the NMR spectrum and also annotate the spectrum right there and then in my workspace. Once I am satisfied with the assignments, I archive the compound and the data – I know that when it comes to writing this up, the system has a number of templates built in which allow me, with the click of a button, to pull this data together and autogenerate an experimental section in the style of a particular paper. Furthermore, the system alerts me, that someone in another group has tried to prepare a very similar compound a while ago…hmm…maybe I should go and talk to that person, as I did encounter some difficulties during the synthesis….Finally, as I am happy with the purity of my compound and the associated data, I allow the system to expose this data on the web after an embargo period (say until after the publication of the relevant data)
I could carry on with this scenario for quite some time: but the most salient point is: repositories must not be roach motels: they need to make data do work for the scientist. And by doing so, almost as an afterthought, they will also fulfill all their traditional roles of collecting artefacts, perserving them and disseminating them.
Now all of the technologies for this are, in principle, in place: we have DSpace, for example, for storage and dissemination, natural language processing systems such as OSCAR3 or parts of speech taggers for entity recognition, RDF to hold the data, OWL and SWRL for reasoning. And, although the example here, was chemistry specific, the same thing should be doable for any other discipline. As an outsider, it seems to me that what needs to happen now, is for these technologies to converge and integrate. Yes it needs to be done in a subject specific manner. Yes, every department should have an embedded informatician to take care of the data structures that are specific for a particular discipline. But the important thing is to just make the damn data do work!