Reading the Tea Leaves of 2011 – Data and Technology Predictions for the Year Ahead

The beginning of a new year usually affords the opportunity to join in the predication game and to think about which topics will not only be on our radar screens on the next year, but may dominate it. I couldn’t help myself but to attempt to do the same in my particular line of work – if for no other reason, than to see how wrong I was when I will look at this again at the beginning of 2012. Here are what I think will be at least some of the big technology and data topics in 2011:

1. Big, big, big Data
2010 has been an extraordinary year when it comes to data availability. Traditional big data producers such as biology continue to generate vast amounts of sequencing and other data. Government data is pouring in from countries all over the world, be it here in the United Kingdom, in the United States and efforts to liberate and obtain government data are also starting in other countries. The Linked Open Data Cloud is growing steadily:

Linked Open Data October 2007 - Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.

Linked Open Data September 2010 - Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.

the current linked data cloud has about 20 billion triples in it. Britain now has, thanks to the Open Knowledge Foundation, an open bibliography. The Guardian’s Datastore is a wonderful example of a commercial company making data available. The New York Times is making an annotated corpus available. Twitter and other user-generated content also provide significant data firehoses from which one can drink and build interesting mashups and applications, such as Ben Marsh’s UK Snow Map. So that are just some examples of big data and there are several issues associated with it, that will occupy us in 2011.

2. Curation and Scalability
A lot of this big data we are talking about is “real-world” and messy. There is no nice underlying ontological model (the stuff that I am so fond of) and by necessity it is exceptionally noisy. Extracting a signal out of clean data is hard enough, but getting one out of messy data requires a great deal of effort and an even greater deal of care. And therefore the development of curation tools and methodologies will continue to be high up on the agenda of the data scientist. The development of both automated and social curation tools will be high up on the agenda. And yes, I do believe that this effort is going to become a lot more social – there are signs of this starting to happen everywhere.
However, we are now generating so much data, that the sheer amount is starting to outstrip our ability to compute it – and therefore scalability will become an issue. The fact that service providers such as Amazon are offering Cluster GPU Instances as part of the EC2 offering is highly significant in this respect. MapReduce technologies seem to be extremely popular in “Web 2.0” companies and the Hadoop ecosystem is growing extremely fast – and the ability to “make Hadoop your bitch” as an acquaintance of mine recently put it, seems to be an in-demand skill at the moment and I think for the forseeable future. And – needless to say – successful automated curation of big data,, too, requires scalable computing.

3. Discovery
Having a lot of datasets available to play with is wonderful, but what if nobody knows they are there. Even in science, it is still much much harder to discover datasets than ought to be the case. And even once you have found what you may have been looking for, it is hard to decide whether that really was what you were looking for – describing metadata is often extremely poor or not available. There is currently little collaboration between information and data providers. Data marketplaces such as Infochimps, Factual, Public Datasets on Amazon AWS or the Talis Connected Commons (to name but a few) are springing up, but there is a lot of work to do still. And is it just me or is science – the very people whose primary product is data and knowledge – is lagging far behind in developing these market places. Maybe they will develop as part of a change in the scholarly pulication landscape (journals such as Open Research Computation have a chance of leading the way here), but it is too early to tell. The increasing availablity of data will push this topic further onto the agenda in 2011.

4. An Impassioned Plea for Small Data
One thing, that will unfortunately not be on the agenda much is small data. Of course it won’t matter to you when you do stuff either at web scale or if you are someone working in Genomics. However, looking at my past existence as a laboratory-based chemist in an academic lab, a significant amount of valuable data is being produced by the lone research student who is the only one working on his project or by a small research group in a much larger department. Although there is a trend to large-scale projects in academia and away from individual small grants, small-scale data production on small scale research projects is still the reality in a significant number laboratories the world over. And the only time, this data will get published, is as a mangled PDF document in some journal supplementary – and as such is dead. And sometimes it is perfectly good data, which never gets published at all: in my previous woworkplace we found that our in-house crystallographer was sitting on several thousand structures, which were perfectly good and publishable, but had, for various reasons, never been published. And usually it is data that has been produced at great cost to both the funder as well as the student. Now small data like this is not sexy per se. But if you manage to collect lots of small data from lots of small laboratories, it becomes big data. So my plea would simply be not to forget small data, to build systems, which collect, curate and publish it and make it available to the world. It’ll be harder to convince both funders and institutions and often researchers to engage with it. But please let’s not forget it – it’s valuable.

Enough soothsaying for one blog post. But let’s get the discussion going – what are your data and technology predictions for 2011?

Licences for Ontologies

Creative Commons: Some Rights Reserved
Image via Wikipedia

One of the things that I have been grappling with for quite some time is the whole notion of licences for ontologies. Of course, neither I – nor anybody else for that matter, should have to worry about this. But the world is the way it is and so the question is: what would an appropriate licence for an ontology be? The answer to that question would mainly depend on what an ontology actually is. Is it a piece of software? Is it a database? A structured document (whatever that means in the context of licensing)?

I have spent quite some time talking to my colleagues about this and we haven’t been able to come up with a satisfactory answer. Even emailing the good folks at the Open Knowledge foundation did not ellicit a response. Now, it seems that the Science Commons have made an attempt to provide some answers on their website.

They state that whether an ontology is protected by copyright law will mainly depend on whether the ontology “contains a sufficient degree of creative expression” or whether it draws entirely on fact. In the latter case, it might not be protected. Now such a statement in itself is intriguing – in the communities in which I and many of the Science Commons people tend to spend most of my time, ontologies are usually understood to be representational artefacts, “whose representational units are intended to designate universals in reality and the relations between them.” Just how much “creative expression” that would allow is an interesting debate in itself, which is probably best had in the pub. But I digress.

Science Commons then goes on to quote some legal precedence in which US courts have upheld copyright in medical ontologies. So really, we don’t know. Science Commons then counsels “pre-emptive” licencing: if in doubt, slap a Creative Commons licence on your ontology (CC0 is explicitly recommended) – if it is later found that copyright cannot subsist in ontologies and that your licence is therefore invalid, you haven’t lost anything, but if it turns out that copyright does indeed subsist in an/your ontology, your bottom is covered. small surprise, too, that the Science Commons would wish to promote the licences of their sister organisation the Creative Commons.

Again, I am not convinced that Creative Commons Licences are an appropriate form of licence for ontologies any more than I am convinced that the GPL licence attached to ChemAxiom is an entirely appropriate licence for an ontology. I would be interested in what the OKF experts have to say about this. The bottom line, for now at least, seems to be that we just won’t know until someone does a lot of deep thinking or it will be tested in court.

Any comments and opinions would be extremely welcome!

Reblog this post [with Zemanta]