Tomorrow’s Giants 2 – Dataset Comparison, Data Sharing and Future Literatures

Following my first post from last week, here are more questions that the Royal Society wanted us Cambridge researchers to discuss during the peparatory Tomorrow’s Giant’s Meeting in Cambridge.

How can – and is it appropriate to – facilitate inter-laboratory dataset comparison?
Great that the question was asked. And the answer is yes of course it is. Not only is it appropriate, it is the vey essence of scientific endeavour. What else could be called science? That said, the fact that the question even had to be asked and that the answer is not self evident is disappointing. What has science/have scientists lost by way of attitude/ethics etc. that makes us even ask that question? Yes admittedly, there may be commercial reasons as to why this sort of comparison is not desirable. One of the participants in the session was at great pains to point out that there is often commercial interest tied up to data which prevents sharing and re-use and that is a fair point. However, over the past couple of years I have sat through far too many presentations where the presenter got up and talked about the development of a proprietary model/machine learning tool using a proprietary dataset and proprietary software. Now that is NOT science – at best it is a piece of local engineering which solves a particular problem for the presenter, but it does not advance human knowledge at all. I,, as a fellow scientist, could not pick up any aspect of this work and build upon it as it is all proprietary. Local engineering at best.

Does the type of data have an impact on the ways it can be shared?
Flippantly speaking: “you betcha”. Again, great that the question was even asked. And the answer is multifaceted because the question can be read in a number of different ways. It could be read as “does the provenance of the data and context in which it was generated have an impact on the ways in which it can be shared?” The question can also be read as “Does the (technical) format the data is in have an impact on the way in which it can be shared? The answer in both cases is yes. Let’s tackle these two in turn. One of the participants of the workshop worked at the faculty of education and her primary research data consisted of a large collection of interviews she had conducted with children over the course of her work. She believes that this data is valuable to other researchers in her field and would dearly love to share – but finds herself in a mire of legal and ethical concerns with respect to, for example, the children’s privacy that effectively prevent her from data sharing. So yes, the context in which data is produced and the type of data that is generated can be an obstacle to sharing. If “type of data” is understood to mean “format” then the answer is also yes. A number of my colleagues have pointed out (see here, for example) the data loss that occurs when documents containing scientific data are converted from the format in which they were produced to pdf (examples are here, here and here). The production of data in vernacular or lossy dataformats obviously also have an impact on data sharing – particularly when the sharing and exchange format is lossy.
However, the fact that the question had to be asked at all and that it went straight over the heads of most scientists who were at the meeting and who do not work in the data business, is intensely disappointing. Laboratory researchers have no appreciation of what they are doing when they convert their Word documents to pdf. Data science and informatics are not part of the standard curriculum in the education of scientists – something that desperately needs to change if data loss due to ignorance in data handling is to be avoided in the future.

Future literatures in the wider sense i.e. not just how findings are published in journals, but how can interim findings be shared and accessed?
That is a great question and one, as it turns out, that many of the people present in the meeting had pondered themselves in one form or another already. Scientists should not only be assessed on the basis of the journal articles they write, but, for example, also on the (raw) data they publish. However, science has, so far, not only not evolved a technical soloution to the data publication problem (of course, there isn’t just one solution – there are many depending on the type of data as well as the specific subject/sub-subject/sub-sub-subject that is producing the data etc.) Interim findings are part of this and systems like Nature Preceedings could point the way (although even Nature Preceedings does not allow us to deal with data). Obviously, one has to be careful that these do not just become dumping grounds for lower quality science. Once we have evolved technical solutions for publishing data, the next step will be to develop an ecosystem of metrics. And those metrics should only extend to things like data quality, trust and data provenance. Data “usefulness” – e.g. things like citation indices etc for data should, I think, not be part of the mix: it is impossible to predict what data will be useful when and under which circumstances (and incidentally it is the same for papers). In that sense, data usefulness can be as flighty as fashion and should not be a criterion.

There were a few more questions – and I will blog about these in a future post.

Reblog this post [with Zemanta]

Tomorrow’s Giants 1 – Big Data

I recently spent an afternoon at a meeting entitled “Tomorrow’s Giants”, which was jointly organized by the Royal Society and Nature and took place here in Cambridge. The meeting was in preparation for a larger meeting, also entitled “Tomorrow’s Giants” which is to be held on the 1st July 2010 as part of the Royal Society’s 350th anniversary celebrations. The purpose of the larger event will be to bring together scientists and politicians in an effort to gather scientist’s visions for the next 5 decades and to ask questions such as

  • What will be required to enable academic achievement in the future?
  • What are the main goals and challenges facing science in the future?

In discussing this, funding considerations were to be left to one side. This is interesting, considering that the current fashion and move towards larger and larger platform grants has profound implications for some of the questions the Royal Society and nature wanted to debate.

As part of the preparatory Cambridge meeting, the Royal Society and Nature had singled out four questions they whished us to debate:

  • “Database Management”
  • “Science Organisation”
  • “Metrics”
  • “Career Security and Support”

For historical and other reasons, readers of this blog will not be surprised to know that my personal interests are centered on scientific data and I shall therefore spend a few blogposts on the question of scientific data, that we were asked to debate. In this context, “Database Management” was a very unfortunate name for a vastly important topic which had all to do how science handles its data in the future. The questions that were asked were: (a) Managing big data – what is the right infrastructure for data sharing, (b) is “big data more of a concern for some disciplines rather than others (e.g. biologists), (c) how can – and is it appropriate to – facilitate inter-laboratory dataset comparison (d) does the type of data have an impact on the ways it can be shared? (d) future literatures in the wider sense i.e. not just how findings are published in journals, but how can interim findings be shared and accessed? (e) what about the tension between transparency and data protection (f) implications for the growing use of web2.0 as a resource for sharing research findings and (g) how well organised is the current use of web 2.0 and how does this impact accessibility?

These were all wonderful questions which must be asked in order to “future-proof” science and to which we were expected to provide answers in 20 min (!). While I was and am glad that we were to debate these issues, the devil is – as always – in the detail and the undifferentiated nature of asking made might heart sink again.

In this post, I would like to address the first two questions:

Managing big data – what is the right infrastructure for sharing
The Good: What is exciting here is the recognition by the RS that data needs infrastructure. And that infrastructure is both technical as well as sociocultural problem. Some components of that infrastructure (and by far not all) that are direly needed are

  • Data Repositories (departmental, university level, subject-specific and transinstitutional
  • Open, non-propriatary and standards-based markup (exchange formats)
  • Computable Metadata (e.g. ontologies which can be used to give data COMPUTABLE meaning
  • University librarians who think that preservation of the data generated by one’s own instritution falls WITHIN the remit of the library
  • Scholarly Societies who remember that they were founded in response to a scaling problem – namely the increasing availability of scientific data and the need to distribute it – and who start taking this reason for their existence seriously again rather than trying to lock up data in inaccessible and copyrighted/DRM’ed/pdf’ed publications
  • Academics who belive that data science should be a compulsory part of every undergraduate’s course
  • Funding agencies who mandate open access publishing and data sharing as a condition of the award of a grant
  • The availability and use of appropriate data licences, such as Creative Commons licences or Open Knowledge Foundation Licences

etc etc…..I am sure there are many more things that I should mention here and that I have forgotten. Come to think it: funding bodies and universities – don’t forget about or squeeze out the infrastructure guys. Don’t say to the infrastructure guys that the development of /institutional repositories/markup languages/models/eScience tools is not science but it engineering and has no place in a research university that “does science”. Do you detect bitterness? Yes you do – some of my colleagues – even those that call themselves “chemoinformaticians” tell me just this on a regular basis. Only thing is – without the infrastructure guys and the engineers that develop all of this stuff and develop it in a scientific manner using scientific methods, NO science will get done because there will be no infrastructure to support it. And which buttons will you push then to calculate your transition states, dock your molecules etc.? Yes – data needs infrastructure…now universities, senior academics and funding bodies….put your money and your recongnition where your mouth is.
The Bad:The focus of the question on BIG data perturbs me immensely. Because BIG data is, well, BIG data, one of the first things that people who produce/manage/exchange BIG data have to do – almost by the very nature of the thing – is to worry about infrastructure for BIG data. And while we may not have all the technical answers just yet (e.g. it is sad in a way that the fastest bandwidth we have for shuffling really BIG data, such as produced by astronomers around the world, for example, is to load it onto hard disks and to load these onto trucks and to send the trucks on their way) people who deal in BIG data are very aware that it needs infrastructure and hardly need convincing. It is not BIG data that is the problem. What is the problem, is data that is produced in the “bog-standard” long-tail research group of between 3 and 20 people. It is these guys, who usually DO NOT (unless they happen to be blessed and are biologists) have the infrastructure to make data available in such a way that it can be stored exchanged and re-used. It is the biology/chemistry/physics…PhD student that has slaved for three years to assemble data and keeps it an Excel spreadsheet that we need to worry about – how do we make it possible for him to publish his data and make it reusable? How about the departmental crystallographer who sits on thousands of publication-quality but unpublished crystal structures just because the compound never quite made it into a paper. We need to develop mechanisms and infrastucture for the small “long-tail” laboratory scientists…the big data guys have this figured out anyway.

Is Big Data more of a concern for some disciplines rather than others (e.g. biologists)?
The GoodYes of course it is. High throughput screening/ gene sequencing/radioastrononmy produce huge amount of data. Yes it is a concern for them – but they are thinking about it already.
The Bad Big data again. See above – it is not about Big data…let’s talk about the synthetic organic chemistr and the data associated with the 20 compounds he makes over 3 years too, please.

I’ll continue to address some of the other data related questions in other blog posts.

Reblog this post [with Zemanta]
Follow

Get every new post delivered to your Inbox.