Tomorrow’s Giants 1 – Big Data

I recently spent an afternoon at a meeting entitled “Tomorrow’s Giants”, which was jointly organized by the Royal Society and Nature and took place here in Cambridge. The meeting was in preparation for a larger meeting, also entitled “Tomorrow’s Giants” which is to be held on the 1st July 2010 as part of the Royal Society’s 350th anniversary celebrations. The purpose of the larger event will be to bring together scientists and politicians in an effort to gather scientist’s visions for the next 5 decades and to ask questions such as

  • What will be required to enable academic achievement in the future?
  • What are the main goals and challenges facing science in the future?

In discussing this, funding considerations were to be left to one side. This is interesting, considering that the current fashion and move towards larger and larger platform grants has profound implications for some of the questions the Royal Society and nature wanted to debate.

As part of the preparatory Cambridge meeting, the Royal Society and Nature had singled out four questions they whished us to debate:

  • “Database Management”
  • “Science Organisation”
  • “Metrics”
  • “Career Security and Support”

For historical and other reasons, readers of this blog will not be surprised to know that my personal interests are centered on scientific data and I shall therefore spend a few blogposts on the question of scientific data, that we were asked to debate. In this context, “Database Management” was a very unfortunate name for a vastly important topic which had all to do how science handles its data in the future. The questions that were asked were: (a) Managing big data – what is the right infrastructure for data sharing, (b) is “big data more of a concern for some disciplines rather than others (e.g. biologists), (c) how can – and is it appropriate to – facilitate inter-laboratory dataset comparison (d) does the type of data have an impact on the ways it can be shared? (d) future literatures in the wider sense i.e. not just how findings are published in journals, but how can interim findings be shared and accessed? (e) what about the tension between transparency and data protection (f) implications for the growing use of web2.0 as a resource for sharing research findings and (g) how well organised is the current use of web 2.0 and how does this impact accessibility?

These were all wonderful questions which must be asked in order to “future-proof” science and to which we were expected to provide answers in 20 min (!). While I was and am glad that we were to debate these issues, the devil is – as always – in the detail and the undifferentiated nature of asking made might heart sink again.

In this post, I would like to address the first two questions:

Managing big data – what is the right infrastructure for sharing
The Good: What is exciting here is the recognition by the RS that data needs infrastructure. And that infrastructure is both technical as well as sociocultural problem. Some components of that infrastructure (and by far not all) that are direly needed are

  • Data Repositories (departmental, university level, subject-specific and transinstitutional
  • Open, non-propriatary and standards-based markup (exchange formats)
  • Computable Metadata (e.g. ontologies which can be used to give data COMPUTABLE meaning
  • University librarians who think that preservation of the data generated by one’s own instritution falls WITHIN the remit of the library
  • Scholarly Societies who remember that they were founded in response to a scaling problem – namely the increasing availability of scientific data and the need to distribute it – and who start taking this reason for their existence seriously again rather than trying to lock up data in inaccessible and copyrighted/DRM’ed/pdf’ed publications
  • Academics who belive that data science should be a compulsory part of every undergraduate’s course
  • Funding agencies who mandate open access publishing and data sharing as a condition of the award of a grant
  • The availability and use of appropriate data licences, such as Creative Commons licences or Open Knowledge Foundation Licences

etc etc…..I am sure there are many more things that I should mention here and that I have forgotten. Come to think it: funding bodies and universities – don’t forget about or squeeze out the infrastructure guys. Don’t say to the infrastructure guys that the development of /institutional repositories/markup languages/models/eScience tools is not science but it engineering and has no place in a research university that “does science”. Do you detect bitterness? Yes you do – some of my colleagues – even those that call themselves “chemoinformaticians” tell me just this on a regular basis. Only thing is – without the infrastructure guys and the engineers that develop all of this stuff and develop it in a scientific manner using scientific methods, NO science will get done because there will be no infrastructure to support it. And which buttons will you push then to calculate your transition states, dock your molecules etc.? Yes – data needs infrastructure…now universities, senior academics and funding bodies….put your money and your recongnition where your mouth is.
The Bad:The focus of the question on BIG data perturbs me immensely. Because BIG data is, well, BIG data, one of the first things that people who produce/manage/exchange BIG data have to do – almost by the very nature of the thing – is to worry about infrastructure for BIG data. And while we may not have all the technical answers just yet (e.g. it is sad in a way that the fastest bandwidth we have for shuffling really BIG data, such as produced by astronomers around the world, for example, is to load it onto hard disks and to load these onto trucks and to send the trucks on their way) people who deal in BIG data are very aware that it needs infrastructure and hardly need convincing. It is not BIG data that is the problem. What is the problem, is data that is produced in the “bog-standard” long-tail research group of between 3 and 20 people. It is these guys, who usually DO NOT (unless they happen to be blessed and are biologists) have the infrastructure to make data available in such a way that it can be stored exchanged and re-used. It is the biology/chemistry/physics…PhD student that has slaved for three years to assemble data and keeps it an Excel spreadsheet that we need to worry about – how do we make it possible for him to publish his data and make it reusable? How about the departmental crystallographer who sits on thousands of publication-quality but unpublished crystal structures just because the compound never quite made it into a paper. We need to develop mechanisms and infrastucture for the small “long-tail” laboratory scientists…the big data guys have this figured out anyway.

Is Big Data more of a concern for some disciplines rather than others (e.g. biologists)?
The GoodYes of course it is. High throughput screening/ gene sequencing/radioastrononmy produce huge amount of data. Yes it is a concern for them – but they are thinking about it already.
The Bad Big data again. See above – it is not about Big data…let’s talk about the synthetic organic chemistr and the data associated with the 20 compounds he makes over 3 years too, please.

I’ll continue to address some of the other data related questions in other blog posts.

Reblog this post [with Zemanta]

One Response to Tomorrow’s Giants 1 – Big Data

  1. Pingback: Tomorrow’s Giants 2 – Dataset Comparison, Data Sharing and Future Literatures « Semantic Science

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: