January 4, 2011 1 Comment
The beginning of a new year usually affords the opportunity to join in the predication game and to think about which topics will not only be on our radar screens on the next year, but may dominate it. I couldn’t help myself but to attempt to do the same in my particular line of work – if for no other reason, than to see how wrong I was when I will look at this again at the beginning of 2012. Here are what I think will be at least some of the big technology and data topics in 2011:
1. Big, big, big Data
2010 has been an extraordinary year when it comes to data availability. Traditional big data producers such as biology continue to generate vast amounts of sequencing and other data. Government data is pouring in from countries all over the world, be it here in the United Kingdom, in the United States and efforts to liberate and obtain government data are also starting in other countries. The Linked Open Data Cloud is growing steadily:
the current linked data cloud has about 20 billion triples in it. Britain now has, thanks to the Open Knowledge Foundation, an open bibliography. The Guardian’s Datastore is a wonderful example of a commercial company making data available. The New York Times is making an annotated corpus available. Twitter and other user-generated content also provide significant data firehoses from which one can drink and build interesting mashups and applications, such as Ben Marsh’s UK Snow Map. So that are just some examples of big data and there are several issues associated with it, that will occupy us in 2011.
2. Curation and Scalability
A lot of this big data we are talking about is “real-world” and messy. There is no nice underlying ontological model (the stuff that I am so fond of) and by necessity it is exceptionally noisy. Extracting a signal out of clean data is hard enough, but getting one out of messy data requires a great deal of effort and an even greater deal of care. And therefore the development of curation tools and methodologies will continue to be high up on the agenda of the data scientist. The development of both automated and social curation tools will be high up on the agenda. And yes, I do believe that this effort is going to become a lot more social – there are signs of this starting to happen everywhere.
However, we are now generating so much data, that the sheer amount is starting to outstrip our ability to compute it – and therefore scalability will become an issue. The fact that service providers such as Amazon are offering Cluster GPU Instances as part of the EC2 offering is highly significant in this respect. MapReduce technologies seem to be extremely popular in “Web 2.0” companies and the Hadoop ecosystem is growing extremely fast – and the ability to “make Hadoop your bitch” as an acquaintance of mine recently put it, seems to be an in-demand skill at the moment and I think for the forseeable future. And – needless to say – successful automated curation of big data,, too, requires scalable computing.
Having a lot of datasets available to play with is wonderful, but what if nobody knows they are there. Even in science, it is still much much harder to discover datasets than ought to be the case. And even once you have found what you may have been looking for, it is hard to decide whether that really was what you were looking for – describing metadata is often extremely poor or not available. There is currently little collaboration between information and data providers. Data marketplaces such as Infochimps, Factual, Public Datasets on Amazon AWS or the Talis Connected Commons (to name but a few) are springing up, but there is a lot of work to do still. And is it just me or is science – the very people whose primary product is data and knowledge – is lagging far behind in developing these market places. Maybe they will develop as part of a change in the scholarly pulication landscape (journals such as Open Research Computation have a chance of leading the way here), but it is too early to tell. The increasing availablity of data will push this topic further onto the agenda in 2011.
4. An Impassioned Plea for Small Data
One thing, that will unfortunately not be on the agenda much is small data. Of course it won’t matter to you when you do stuff either at web scale or if you are someone working in Genomics. However, looking at my past existence as a laboratory-based chemist in an academic lab, a significant amount of valuable data is being produced by the lone research student who is the only one working on his project or by a small research group in a much larger department. Although there is a trend to large-scale projects in academia and away from individual small grants, small-scale data production on small scale research projects is still the reality in a significant number laboratories the world over. And the only time, this data will get published, is as a mangled PDF document in some journal supplementary – and as such is dead. And sometimes it is perfectly good data, which never gets published at all: in my previous woworkplace we found that our in-house crystallographer was sitting on several thousand structures, which were perfectly good and publishable, but had, for various reasons, never been published. And usually it is data that has been produced at great cost to both the funder as well as the student. Now small data like this is not sexy per se. But if you manage to collect lots of small data from lots of small laboratories, it becomes big data. So my plea would simply be not to forget small data, to build systems, which collect, curate and publish it and make it available to the world. It’ll be harder to convince both funders and institutions and often researchers to engage with it. But please let’s not forget it – it’s valuable.
Enough soothsaying for one blog post. But let’s get the discussion going – what are your data and technology predictions for 2011?