Reading the Tea Leaves of 2011 – Data and Technology Predictions for the Year Ahead

The beginning of a new year usually affords the opportunity to join in the predication game and to think about which topics will not only be on our radar screens on the next year, but may dominate it. I couldn’t help myself but to attempt to do the same in my particular line of work – if for no other reason, than to see how wrong I was when I will look at this again at the beginning of 2012. Here are what I think will be at least some of the big technology and data topics in 2011:

1. Big, big, big Data
2010 has been an extraordinary year when it comes to data availability. Traditional big data producers such as biology continue to generate vast amounts of sequencing and other data. Government data is pouring in from countries all over the world, be it here in the United Kingdom, in the United States and efforts to liberate and obtain government data are also starting in other countries. The Linked Open Data Cloud is growing steadily:

Linked Open Data October 2007 - Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.

Linked Open Data September 2010 - Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.

the current linked data cloud has about 20 billion triples in it. Britain now has, thanks to the Open Knowledge Foundation, an open bibliography. The Guardian’s Datastore is a wonderful example of a commercial company making data available. The New York Times is making an annotated corpus available. Twitter and other user-generated content also provide significant data firehoses from which one can drink and build interesting mashups and applications, such as Ben Marsh’s UK Snow Map. So that are just some examples of big data and there are several issues associated with it, that will occupy us in 2011.

2. Curation and Scalability
A lot of this big data we are talking about is “real-world” and messy. There is no nice underlying ontological model (the stuff that I am so fond of) and by necessity it is exceptionally noisy. Extracting a signal out of clean data is hard enough, but getting one out of messy data requires a great deal of effort and an even greater deal of care. And therefore the development of curation tools and methodologies will continue to be high up on the agenda of the data scientist. The development of both automated and social curation tools will be high up on the agenda. And yes, I do believe that this effort is going to become a lot more social – there are signs of this starting to happen everywhere.
However, we are now generating so much data, that the sheer amount is starting to outstrip our ability to compute it – and therefore scalability will become an issue. The fact that service providers such as Amazon are offering Cluster GPU Instances as part of the EC2 offering is highly significant in this respect. MapReduce technologies seem to be extremely popular in “Web 2.0” companies and the Hadoop ecosystem is growing extremely fast – and the ability to “make Hadoop your bitch” as an acquaintance of mine recently put it, seems to be an in-demand skill at the moment and I think for the forseeable future. And – needless to say – successful automated curation of big data,, too, requires scalable computing.

3. Discovery
Having a lot of datasets available to play with is wonderful, but what if nobody knows they are there. Even in science, it is still much much harder to discover datasets than ought to be the case. And even once you have found what you may have been looking for, it is hard to decide whether that really was what you were looking for – describing metadata is often extremely poor or not available. There is currently little collaboration between information and data providers. Data marketplaces such as Infochimps, Factual, Public Datasets on Amazon AWS or the Talis Connected Commons (to name but a few) are springing up, but there is a lot of work to do still. And is it just me or is science – the very people whose primary product is data and knowledge – is lagging far behind in developing these market places. Maybe they will develop as part of a change in the scholarly pulication landscape (journals such as Open Research Computation have a chance of leading the way here), but it is too early to tell. The increasing availablity of data will push this topic further onto the agenda in 2011.

4. An Impassioned Plea for Small Data
One thing, that will unfortunately not be on the agenda much is small data. Of course it won’t matter to you when you do stuff either at web scale or if you are someone working in Genomics. However, looking at my past existence as a laboratory-based chemist in an academic lab, a significant amount of valuable data is being produced by the lone research student who is the only one working on his project or by a small research group in a much larger department. Although there is a trend to large-scale projects in academia and away from individual small grants, small-scale data production on small scale research projects is still the reality in a significant number laboratories the world over. And the only time, this data will get published, is as a mangled PDF document in some journal supplementary – and as such is dead. And sometimes it is perfectly good data, which never gets published at all: in my previous woworkplace we found that our in-house crystallographer was sitting on several thousand structures, which were perfectly good and publishable, but had, for various reasons, never been published. And usually it is data that has been produced at great cost to both the funder as well as the student. Now small data like this is not sexy per se. But if you manage to collect lots of small data from lots of small laboratories, it becomes big data. So my plea would simply be not to forget small data, to build systems, which collect, curate and publish it and make it available to the world. It’ll be harder to convince both funders and institutions and often researchers to engage with it. But please let’s not forget it – it’s valuable.

Enough soothsaying for one blog post. But let’s get the discussion going – what are your data and technology predictions for 2011?

SWAT4LS2009 – James Eales: Mining Semantic Networks of Bioinformatics eResources from Literature

eResource Annotations could help with

  • making better choices: which resource is best?
  • which is available?
  • reduce curation
  • help with service discovery

Approach: link bioinformatics resources using semantic descriptors generated from text mining….head terms for services can be used to assign services to types..e.g. applications, data sources etc.

Reblog this post [with Zemanta]

SWAT4LS2009 – Keynote Alan Ruttenberg: Semantic Web Technology to Support Studying the Relation of HLA Structure Variation to Disease

(These are live-blogging notes from Alan’s keynote…so don’t expect any coherent text….use them as bullt points to follow the gist of the argument.)

The Science Commons:

  • a project of the Creative Commons
  • 6 people
  • CC specializes CC to science
  • information discovery and re-use
  • establish legal clarity around data sharing and encourage automated attribution and provenance

Semantic Web for Biologist because it maximizes value o scientific work by removing repeat experimentation.

ImmPort Semantic Integration Feasibility Project

  • Immport is an immunology database and analysis portal
  • Goals:metaanalysis
  • Question: how can ontology help data integration for data from many sources

Using semantics to help integrate sequence features of HLA with disorders

  • Curation of sequence features
  • Linking to disorders
  • Associating allele sequences with peptide structures with nomenclature with secondary structure with human phenotype etc etc etc…

Talks about elements of representation

  • pdb structures translated into ontology-bases respresentations
  • canonical MHC molecule instances constructed from IMGT
  • relate each residue in pdb to the canonical residue if exists
  • use existing ontologies
  • contact points between peptide and other chains computed using JMOL following IMGT. Represented as relation between residue instances.
  • Structural features have fiat parts

Connecting Allele Names to Disease Names

  • use papers as join factors: papers mention both disease and allele – noisy
  • use regex and rewrites applied to titles and abstracts to fish out links between diseases and alleles

Correspondence of molecules with allele structures is difficult.

  • use blast to fiind closest allele match between pdb and allele sequence
  • every pdb and allele residue has URI
  • relate matching molecules
  • relate each allele residue to the canonical allele
  • annotate various residoes with various coordinate systems

This creates massive map that can be navigated and queried. Example queries:

  • What autoimmune diseases can de indexed against a given allele?
  • What are the variant residues at a position?
  • Classification of amino acids
  • Show alleles perturned at contacts of 1AGB

Summary of Progress to Date:
Elements of Approach in Place: Structure, Variation, transfer of annotation via alignment, information extraction from literature etc…

Nuts and Bolts:

  • Primary source
  • Local copy of souce
  • Scripts transforms to RDF
  • Exports RDF Bundles
  • Get selected RDF Bundles and load into triple store
  • Parsers generate in memory structures (python, java)
  • Template files are instructions to fomat these into owl
  • Modeling is iteratively refined by editiing templates
  • RDF loaded into Neurocommons, some amount of reasoning

RDFHerd package management for data

Can we reduce the burden of data integration?

  • Too many people are doing data integration – wasting effort
  • Use web as platform
  • Too many ontologies…here’s the social pressure again


  • have lawyers bless every bit of data integration
  • reasoning over triple stores
  • SPARQL over HTTP
  • Understand and exploit ontology and reasoning
  • Grow a software ecosystem like Firefox
Reblog this post [with Zemanta]

Capturing process: In silico, in laboratorio and all the messy in-betweens – Cameron Neylon @ the Unilever Centre

I am not very good at live-blogging, but Cameron Neylon is at the Unilever Centre and giving a talk about capturing the scientific process. This is important stuff and so I shall give it a go.

He starts off by making the point that to capture the scientific process we need to capture the information about the objects we are investigating as well as the process how we get there.

Journals not enough – the journal article is static but knowledge is dynamic. Can solutions come from software development? Yes to a certain extent….

e.g. source control/versioning systems – captures snapshots of development over time, date stamping etc.
Unit testing – continuous tests as part of the science/knowledge testing
Solid-replication…distributed version control

Branching and merging: data integration. However, commits are free text..unstructured knowledge…no relationships between objects – what Cameron really wants to say is NO ONTOLOGIES, NO LINKED DATA.

Need linked data, need ontologies: towards a linked web of data.

Data is nice and well…but how about the stuff that goes on in the lab? Objects, data spread over multiple silos – recording much harder: we need to worry about the lab notebook.

“Lab notebook is pretty much an episodic journal” – which is not too dissimilar to a blog. Similarities are striking: descriptions of stuff happening, date stamping, categorisation, tagging, accessibility…and not of much interest to most people…;-). But problem with blogs is still information retrieval – same as lab notbook…

Now showing a blog of one of his students recording lab work…software built by Jeremy Frey’s group….blog IS the primary record: blog is a production system…2GB of data. At first glance lab-log similar to conventional blog: dates, tags etc….BUT fundamental difference is that data is marked up and linked to other relevant resources…now showing video demo of capturing provanance, date, linking of resources, versioning, etc: data is linked to experiment/procedure, procedure is linked to sample, sample is linked to material….etc….

Proposes that his blog system is a system for capturing both objects and processes….a web of objects…now showing a visualisation of resources in the notbook and demonstrates that the visualisation of the connectedness of the resources can indicate problems in the science or recording of science etc….and says it is only the linking/networking effect that allows you to do this. BUT…no semantics in the system yet (tags yes…no PROPER semantics).

Initial labblog used hand-coded markup: scientists needed to know how to hand code markup…and hated it…..this led to a desire for templates….templates create posts and associate controlled vocab and specify the metadata that needs to be recorded for a given procedure….in effect they are metadata frameworks….templates can be preconfigured for procedures and experiments….metadata frameworks map onto ontologies quite well….

Bio-ontologies…sometimes convolute process and object….says there is no particularly good ontology of experiments….I think the OBI and EXPO people might disagree….

So how about the future?

    • Important thing is: capture at source IN CONTEXT
      Capture as much as possible automatically. Try and take human out of the equation as much as possible.
      In the lab capture each object as it is created and capture the plan and track the execution step by step
      Data repositories as easy as Flickr – repos specific for a data type and then link artefacts together across repos..e.g. the Periodic Table of Videos on YouTube, embedding of chemical structures into pages from ChemSpider
      More natural interfaces to interact with these records…better visualisation etc…
      Trust and Provenance and cutting through the noise: which objects/people/literature will I trust and pay attention to? Managing people and reputation of people creating the objects: SEMANTIC SOCIAL WEB (now shows FriendFeed as an example: subscription as a measure of trust in people, but people discussing objects) “Data finds the data, then people find the people”..Social network with objects at the Centre…
      Connecting with people only works if the objects are OPEN
      Connected research changes the playing field – again resources are key
      OUCH controversy: communicate first, standardize second….but at least he ackowledges that it will be messy….
  • UPDATE: Cameron’s slides of the talk are here:

    Reblog this post [with Zemanta]

    The Unilever Centre @ Semantic Technology 2009

    In a previous blogpost, I had already announced, that both Jim and I had been accepted to speak at Semantic Technology 2009 in San Jose.

    Well, the programme for the conference is out now and looks even more mind-blowing (in a very good way) than last year. Jim and I will be speaking on Tuesday, 16th June at 14:00. Here’s our talk abstracts:

    PART I | Lensfield – The Working Scientist’s Linked Data Space Elevator (Jim Downing)

    The vision of Open Linked Data in long-tail science (as opposed to Big Science, high energy physics, genomics etc) is an attractive one, with the possibility of delivering abundant data without the need for massive centralization. In achieving that vision we face a number of practical challenges. The principal challenge is the steep learning curve that scientists face in dealing with URIs, web deployment, RDF, SPARQL etc. Additionally most software that could generated Linked Data runs off-web, on workstations and internal systems. The result of this is that the desktop filesystem is likely remain the arena for the production of data in the near to medium term. Lensfield is a data repository system that works with the filesystem model and abstracts semantic web complexities away from scientists who are unable to deal with them. Lensfield makes it easy for researchers to publish linked data without leaving their familiar working environment. The presentation of this system will include a demonstration of how we have extended Lensfield to produce a Linked Data publication system for small molecule data.

    PART II | The Semantic Chemical World Wide Web (Nico Adams)

    The development of modern new drugs, new materials and new personal care products requires the confluence of data and ideas from many different scientific disciplines and enabling scientists to ask questions of heterogeneous data sources is crucial for future innovation and progress. The central science in much of this is chemistry and therefore the development of a “semantic infrastructure” for this very important vertical is essential and of direct relevance to large industries such as the pharmaceuticals and life sciences, home and personal care and, of course, the classical chemical industry. Such an infrastructure shouls include a range of technological capabilities, from the representation of molecules and data in semantically rich form to the availability of chemistry domain ontologies and the ability to extract data from unstructured sources.

    The talk will discuss the development of markup languages and ontologies for chemicals and materials (data). It will illustrate how ontologies can be used for indexing, faceted search and retrieval of chemical information and for the “axiomatisation” of chemical entities and materials beyond simple notions of chemical structure. The talk will discuss the use of linked data to generate new chemical insight and will provide a brief discussion of the use of entity extraction and natural language processing for the “semantification” of chemical information.

    But that’s not all. Lezan has been accepted to present a poster and so she will be there too,, showing off her great work on the extraction and semantification of chemical reaction data from the literature. Here is her abstract:

    The domain of chemistry is central to a large number of significant industries such as the pharmaceuticals and life sciences industry, the home and personal care industry as well as the “classical” chemical industry. All of these are research-intensive and any innovation is crucially dependent on the ability to connect data from heterogeneous sources: in the pharmaceutical industry, for example, the ability to link data about chemical compounds, with toxicology data, genomic and proteomic data, pathway data etc. is crucial. The availability of a semantic infrastructure for chemistry will be a significant factor for the future success of this industry. Unfortunately, virtually all current chemical knowledge and data is generated in non-semantic form and in many silos, which makes such data integration immensely difficult.

    In order to address these issues, the talk will discuss several distinct, but related areas, namely chemical information extraction, information/data integration, ontology-aided information retrieval and information visualization. In particular, we demonstrate how chemical data can be retrieved from a range of unstructured sources such as reports, scientific theses and papers or patents. We will discuss how these sources can be processed using ontologies, natural language processing techniques and named-entity recognisers to produce chemical data and knowledge expressed in RDF. We will furthermore show, how this information can be searched and indexed. Particular attention will also be paid to data representation and visualisation using topic/topology maps and information lenses. At the end of the talk, attendees should have a detailed awareness of how chemical entities and data can be extracted from unstructured sources and visualised for rapid information discovery and knowledge generation.

    It promises to be a great conference and I am sure our minds will go into overdrive when there….can’t wait to go! See you there!?

    Reblog this post [with Zemanta]