Tomorrow’s Giants 1 – Big Data

I recently spent an afternoon at a meeting entitled “Tomorrow’s Giants”, which was jointly organized by the Royal Society and Nature and took place here in Cambridge. The meeting was in preparation for a larger meeting, also entitled “Tomorrow’s Giants” which is to be held on the 1st July 2010 as part of the Royal Society’s 350th anniversary celebrations. The purpose of the larger event will be to bring together scientists and politicians in an effort to gather scientist’s visions for the next 5 decades and to ask questions such as

  • What will be required to enable academic achievement in the future?
  • What are the main goals and challenges facing science in the future?

In discussing this, funding considerations were to be left to one side. This is interesting, considering that the current fashion and move towards larger and larger platform grants has profound implications for some of the questions the Royal Society and nature wanted to debate.

As part of the preparatory Cambridge meeting, the Royal Society and Nature had singled out four questions they whished us to debate:

  • “Database Management”
  • “Science Organisation”
  • “Metrics”
  • “Career Security and Support”

For historical and other reasons, readers of this blog will not be surprised to know that my personal interests are centered on scientific data and I shall therefore spend a few blogposts on the question of scientific data, that we were asked to debate. In this context, “Database Management” was a very unfortunate name for a vastly important topic which had all to do how science handles its data in the future. The questions that were asked were: (a) Managing big data – what is the right infrastructure for data sharing, (b) is “big data more of a concern for some disciplines rather than others (e.g. biologists), (c) how can – and is it appropriate to – facilitate inter-laboratory dataset comparison (d) does the type of data have an impact on the ways it can be shared? (d) future literatures in the wider sense i.e. not just how findings are published in journals, but how can interim findings be shared and accessed? (e) what about the tension between transparency and data protection (f) implications for the growing use of web2.0 as a resource for sharing research findings and (g) how well organised is the current use of web 2.0 and how does this impact accessibility?

These were all wonderful questions which must be asked in order to “future-proof” science and to which we were expected to provide answers in 20 min (!). While I was and am glad that we were to debate these issues, the devil is – as always – in the detail and the undifferentiated nature of asking made might heart sink again.

In this post, I would like to address the first two questions:

Managing big data – what is the right infrastructure for sharing
The Good: What is exciting here is the recognition by the RS that data needs infrastructure. And that infrastructure is both technical as well as sociocultural problem. Some components of that infrastructure (and by far not all) that are direly needed are

  • Data Repositories (departmental, university level, subject-specific and transinstitutional
  • Open, non-propriatary and standards-based markup (exchange formats)
  • Computable Metadata (e.g. ontologies which can be used to give data COMPUTABLE meaning
  • University librarians who think that preservation of the data generated by one’s own instritution falls WITHIN the remit of the library
  • Scholarly Societies who remember that they were founded in response to a scaling problem – namely the increasing availability of scientific data and the need to distribute it – and who start taking this reason for their existence seriously again rather than trying to lock up data in inaccessible and copyrighted/DRM’ed/pdf’ed publications
  • Academics who belive that data science should be a compulsory part of every undergraduate’s course
  • Funding agencies who mandate open access publishing and data sharing as a condition of the award of a grant
  • The availability and use of appropriate data licences, such as Creative Commons licences or Open Knowledge Foundation Licences

etc etc…..I am sure there are many more things that I should mention here and that I have forgotten. Come to think it: funding bodies and universities – don’t forget about or squeeze out the infrastructure guys. Don’t say to the infrastructure guys that the development of /institutional repositories/markup languages/models/eScience tools is not science but it engineering and has no place in a research university that “does science”. Do you detect bitterness? Yes you do – some of my colleagues – even those that call themselves “chemoinformaticians” tell me just this on a regular basis. Only thing is – without the infrastructure guys and the engineers that develop all of this stuff and develop it in a scientific manner using scientific methods, NO science will get done because there will be no infrastructure to support it. And which buttons will you push then to calculate your transition states, dock your molecules etc.? Yes – data needs infrastructure…now universities, senior academics and funding bodies….put your money and your recongnition where your mouth is.
The Bad:The focus of the question on BIG data perturbs me immensely. Because BIG data is, well, BIG data, one of the first things that people who produce/manage/exchange BIG data have to do – almost by the very nature of the thing – is to worry about infrastructure for BIG data. And while we may not have all the technical answers just yet (e.g. it is sad in a way that the fastest bandwidth we have for shuffling really BIG data, such as produced by astronomers around the world, for example, is to load it onto hard disks and to load these onto trucks and to send the trucks on their way) people who deal in BIG data are very aware that it needs infrastructure and hardly need convincing. It is not BIG data that is the problem. What is the problem, is data that is produced in the “bog-standard” long-tail research group of between 3 and 20 people. It is these guys, who usually DO NOT (unless they happen to be blessed and are biologists) have the infrastructure to make data available in such a way that it can be stored exchanged and re-used. It is the biology/chemistry/physics…PhD student that has slaved for three years to assemble data and keeps it an Excel spreadsheet that we need to worry about – how do we make it possible for him to publish his data and make it reusable? How about the departmental crystallographer who sits on thousands of publication-quality but unpublished crystal structures just because the compound never quite made it into a paper. We need to develop mechanisms and infrastucture for the small “long-tail” laboratory scientists…the big data guys have this figured out anyway.

Is Big Data more of a concern for some disciplines rather than others (e.g. biologists)?
The GoodYes of course it is. High throughput screening/ gene sequencing/radioastrononmy produce huge amount of data. Yes it is a concern for them – but they are thinking about it already.
The Bad Big data again. See above – it is not about Big data…let’s talk about the synthetic organic chemistr and the data associated with the 20 compounds he makes over 3 years too, please.

I’ll continue to address some of the other data related questions in other blog posts.

Reblog this post [with Zemanta]

An FAQ for Open Access

In a blogpost yesterday, I asked for an FAQ for Open Access for mere mortals. Well, it turns out that Peter Suber has already provided one, which I had previously overlooked. Peter pointed me to it in the comments section of the post. As this is important and tremendously valuable, I thought I should make it more explicit here:

An Open Access Overview

Autogenerated links by Zemanta.

Reblog this post [with Zemanta]

An appetite for open data…

…is what I have encountered here at Antwerp already. I am currently at the annual meeting of the Dutch Polymer Institute, with which I have been associated in various forms over the best part of five years now. We are the guests of Borealis here in Antwerp and as such, it promises to be an interesting meeting. The morning will be taken up with “Golden Thesis Awards”. The DPI evaluates all PhD thesis it funds by scinetific merit and the best PhD students in a year will be given an award. This is followed by an excursion to Borealis and in the afternoon, there will be thematic sessions: “Polymers and Water” and “Polymers and Time”. The former is self explanatory and the latter concerns mainly molecular simulations of polymers at short and long time scales. This is followed by poster sessions and a Borealis hosted dinner in the evening. Tomorrow then we will have several further talks on bio-based polymers, sustainability and solar cells and in the evening a brain-storm sesssion: “What could polymers mean for the bottom of the pyramid?” I like DPI meetings – they are extremely young…most of the participants are PhDs and Post-Docs and always brimming with energy.

In that spirit, I arrived at my hotel last night and sat down for dinner. It didn’t take long before I was surrounded by old and some new acquaintances and we spent the time catching up and discussing what we have been doing. And inevitably the conversaton turned to polymer informatics and open data. There were many questions: “Will extraction of data from a manuscript cause problems with publication later?”, “Why should I trust you and give you my manuscript or thesis to datamine?”, “How does copyright work out?” “What happens to the publishers – why should they not sell my data?” etc. However, all the minds were open. They see the argument for open data and open knowledge and they agree with it in principle, but there is great uncertainty as to the politics and technicalities associated with open data. The moral of the story is: much more talking needs to be done and much more education. Open access and open data evangelists should put together an FAQ for “mere mortals” i.e. researchers who do not think about this all the time and who should not have to think subtly about the differeneces between “gold OA”, “green OA” “libre OA” and what have you. We need to do much more talking to the science community. Let’s start now. And let’s not weaken our position by OA sophistry. I wil try and blog some more as the meeting goes on and hopefully also provide some photos.

PS: You will see some new and unusual tags at the bottom of this blog post and(UPDATE: no tags apparently) links in the text. I have installed Zemanta to try and make this blog semantically a little richer. The tags and links are autogenerated and I hope the result is worthwhile.

Reblog this post [with Zemanta]