SWAT4LS: Demo Preview NeuroLex.org.
- online wiki-bases ontology for neuroscience
- built on top of mediawiki
- domain scientists can make contributions and a curation process turns this into formal representations
SWAT4LS2009 – Keynote Alan Ruttenberg: Semantic Web Technology to Support Studying the Relation of HLA Structure Variation to Disease
(These are live-blogging notes from Alan’s keynote…so don’t expect any coherent text….use them as bullt points to follow the gist of the argument.)
The Science Commons:
- a project of the Creative Commons
- 6 people
- CC specializes CC to science
- information discovery and re-use
- establish legal clarity around data sharing and encourage automated attribution and provenance
Semantic Web for Biologist because it maximizes value o scientific work by removing repeat experimentation.
ImmPort Semantic Integration Feasibility Project
- Immport is an immunology database and analysis portal
- Goals:metaanalysis
- Question: how can ontology help data integration for data from many sources
Using semantics to help integrate sequence features of HLA with disorders
Challenges:
- Curation of sequence features
- Linking to disorders
- Associating allele sequences with peptide structures with nomenclature with secondary structure with human phenotype etc etc etc…
Talks about elements of representation
- pdb structures translated into ontology-bases respresentations
- canonical MHC molecule instances constructed from IMGT
- relate each residue in pdb to the canonical residue if exists
- use existing ontologies
- contact points between peptide and other chains computed using JMOL following IMGT. Represented as relation between residue instances.
- Structural features have fiat parts
Connecting Allele Names to Disease Names
- use papers as join factors: papers mention both disease and allele – noisy
- use regex and rewrites applied to titles and abstracts to fish out links between diseases and alleles
Correspondence of molecules with allele structures is difficult.
- use blast to fiind closest allele match between pdb and allele sequence
- every pdb and allele residue has URI
- relate matching molecules
- relate each allele residue to the canonical allele
- annotate various residoes with various coordinate systems
This creates massive map that can be navigated and queried. Example queries:
- What autoimmune diseases can de indexed against a given allele?
- What are the variant residues at a position?
- Classification of amino acids
- Show alleles perturned at contacts of 1AGB
Summary of Progress to Date:
Elements of Approach in Place: Structure, Variation, transfer of annotation via alignment, information extraction from literature etc…
Nuts and Bolts:
- Primary source
- Local copy of souce
- Scripts transforms to RDF
- Exports RDF Bundles
- Get selected RDF Bundles and load into triple store
- Parsers generate in memory structures (python, java)
- Template files are instructions to fomat these into owl
- Modeling is iteratively refined by editiing templates
- RDF loaded into Neurocommons, some amount of reasoning
RDFHerd package management for data
neurocommons.org/bundles
Can we reduce the burden of data integration?
- Too many people are doing data integration – wasting effort
- Use web as platform
- Too many ontologies…here’s the social pressure again
Challenges
- have lawyers bless every bit of data integration
- reasoning over triple stores
- SPARQL over HTTP
- Understand and exploit ontology and reasoning
- Grow a software ecosystem like Firefox
Licences for Ontologies

- Image via Wikipedia
One of the things that I have been grappling with for quite some time is the whole notion of licences for ontologies. Of course, neither I – nor anybody else for that matter, should have to worry about this. But the world is the way it is and so the question is: what would an appropriate licence for an ontology be? The answer to that question would mainly depend on what an ontology actually is. Is it a piece of software? Is it a database? A structured document (whatever that means in the context of licensing)?
I have spent quite some time talking to my colleagues about this and we haven’t been able to come up with a satisfactory answer. Even emailing the good folks at the Open Knowledge foundation did not ellicit a response. Now, it seems that the Science Commons have made an attempt to provide some answers on their website.
They state that whether an ontology is protected by copyright law will mainly depend on whether the ontology “contains a sufficient degree of creative expression” or whether it draws entirely on fact. In the latter case, it might not be protected. Now such a statement in itself is intriguing – in the communities in which I and many of the Science Commons people tend to spend most of my time, ontologies are usually understood to be representational artefacts, “whose representational units are intended to designate universals in reality and the relations between them.” Just how much “creative expression” that would allow is an interesting debate in itself, which is probably best had in the pub. But I digress.
Science Commons then goes on to quote some legal precedence in which US courts have upheld copyright in medical ontologies. So really, we don’t know. Science Commons then counsels “pre-emptive” licencing: if in doubt, slap a Creative Commons licence on your ontology (CC0 is explicitly recommended) – if it is later found that copyright cannot subsist in ontologies and that your licence is therefore invalid, you haven’t lost anything, but if it turns out that copyright does indeed subsist in an/your ontology, your bottom is covered. small surprise, too, that the Science Commons would wish to promote the licences of their sister organisation the Creative Commons.
Again, I am not convinced that Creative Commons Licences are an appropriate form of licence for ontologies any more than I am convinced that the GPL licence attached to ChemAxiom is an entirely appropriate licence for an ontology. I would be interested in what the OKF experts have to say about this. The bottom line, for now at least, seems to be that we just won’t know until someone does a lot of deep thinking or it will be tested in court.
Any comments and opinions would be extremely welcome!
Tomorrow’s Giants 2 – Dataset Comparison, Data Sharing and Future Literatures
Following my first post from last week, here are more questions that the Royal Society wanted us Cambridge researchers to discuss during the peparatory Tomorrow’s Giant’s Meeting in Cambridge.
How can – and is it appropriate to – facilitate inter-laboratory dataset comparison?
Great that the question was asked. And the answer is yes of course it is. Not only is it appropriate, it is the vey essence of scientific endeavour. What else could be called science? That said, the fact that the question even had to be asked and that the answer is not self evident is disappointing. What has science/have scientists lost by way of attitude/ethics etc. that makes us even ask that question? Yes admittedly, there may be commercial reasons as to why this sort of comparison is not desirable. One of the participants in the session was at great pains to point out that there is often commercial interest tied up to data which prevents sharing and re-use and that is a fair point. However, over the past couple of years I have sat through far too many presentations where the presenter got up and talked about the development of a proprietary model/machine learning tool using a proprietary dataset and proprietary software. Now that is NOT science – at best it is a piece of local engineering which solves a particular problem for the presenter, but it does not advance human knowledge at all. I,, as a fellow scientist, could not pick up any aspect of this work and build upon it as it is all proprietary. Local engineering at best.
Does the type of data have an impact on the ways it can be shared?
Flippantly speaking: “you betcha”. Again, great that the question was even asked. And the answer is multifaceted because the question can be read in a number of different ways. It could be read as “does the provenance of the data and context in which it was generated have an impact on the ways in which it can be shared?” The question can also be read as “Does the (technical) format the data is in have an impact on the way in which it can be shared? The answer in both cases is yes. Let’s tackle these two in turn. One of the participants of the workshop worked at the faculty of education and her primary research data consisted of a large collection of interviews she had conducted with children over the course of her work. She believes that this data is valuable to other researchers in her field and would dearly love to share – but finds herself in a mire of legal and ethical concerns with respect to, for example, the children’s privacy that effectively prevent her from data sharing. So yes, the context in which data is produced and the type of data that is generated can be an obstacle to sharing. If “type of data” is understood to mean “format” then the answer is also yes. A number of my colleagues have pointed out (see here, for example) the data loss that occurs when documents containing scientific data are converted from the format in which they were produced to pdf (examples are here, here and here). The production of data in vernacular or lossy dataformats obviously also have an impact on data sharing – particularly when the sharing and exchange format is lossy.
However, the fact that the question had to be asked at all and that it went straight over the heads of most scientists who were at the meeting and who do not work in the data business, is intensely disappointing. Laboratory researchers have no appreciation of what they are doing when they convert their Word documents to pdf. Data science and informatics are not part of the standard curriculum in the education of scientists – something that desperately needs to change if data loss due to ignorance in data handling is to be avoided in the future.
Future literatures in the wider sense i.e. not just how findings are published in journals, but how can interim findings be shared and accessed?
That is a great question and one, as it turns out, that many of the people present in the meeting had pondered themselves in one form or another already. Scientists should not only be assessed on the basis of the journal articles they write, but, for example, also on the (raw) data they publish. However, science has, so far, not only not evolved a technical soloution to the data publication problem (of course, there isn’t just one solution – there are many depending on the type of data as well as the specific subject/sub-subject/sub-sub-subject that is producing the data etc.) Interim findings are part of this and systems like Nature Preceedings could point the way (although even Nature Preceedings does not allow us to deal with data). Obviously, one has to be careful that these do not just become dumping grounds for lower quality science. Once we have evolved technical solutions for publishing data, the next step will be to develop an ecosystem of metrics. And those metrics should only extend to things like data quality, trust and data provenance. Data “usefulness” – e.g. things like citation indices etc for data should, I think, not be part of the mix: it is impossible to predict what data will be useful when and under which circumstances (and incidentally it is the same for papers). In that sense, data usefulness can be as flighty as fashion and should not be a criterion.
There were a few more questions – and I will blog about these in a future post.
The Microsoft Biology Foundation
As most of you may know, Microsoft – and in particular the folks at Microsoft (External) Research – have started to make major inroads into developing tools for scientists, be that in the area of scholarly communication with a repository offering as well as an ontology plugin for Word or in chemistry with Chem4Word, which is currently being developed by Joe Townsend, Jim Downing and Peter Murray-Rust here at Cambridge and the team at Microsoft External Research.
Now they have also announced a first version of the Microsoft Biology Foundation. From the announcement:
“The Microsoft Biology Foundation (MBF) is a language-neutral bioinformatics toolkit built as an extension to the Microsoft .NET Framework. Currently it implements a range of parsers for common bioinformatics file formats; a range of algorithms for manipulating DNA, RNA, and protein sequences; and a set of connectors to biological Web services such as NCBI BLAST. MBF is available under an open source license, and executables, source code, demo applications, and documentation are freely downloadable [...]“
Now every time Microsoft gets involved in something like this, it is bound to generate discussion and debate, such as happened around Chem4Word (see here and links contained in this). I, for one, am happy about every constructive and open contribution to the canon of scientific tools available to the community and welcome the news.
Tomorrow’s Giants 1 – Big Data
I recently spent an afternoon at a meeting entitled “Tomorrow’s Giants”, which was jointly organized by the Royal Society and Nature and took place here in Cambridge. The meeting was in preparation for a larger meeting, also entitled “Tomorrow’s Giants” which is to be held on the 1st July 2010 as part of the Royal Society’s 350th anniversary celebrations. The purpose of the larger event will be to bring together scientists and politicians in an effort to gather scientist’s visions for the next 5 decades and to ask questions such as
- What will be required to enable academic achievement in the future?
- What are the main goals and challenges facing science in the future?
In discussing this, funding considerations were to be left to one side. This is interesting, considering that the current fashion and move towards larger and larger platform grants has profound implications for some of the questions the Royal Society and nature wanted to debate.
As part of the preparatory Cambridge meeting, the Royal Society and Nature had singled out four questions they whished us to debate:
- “Database Management”
- “Science Organisation”
- “Metrics”
- “Career Security and Support”
For historical and other reasons, readers of this blog will not be surprised to know that my personal interests are centered on scientific data and I shall therefore spend a few blogposts on the question of scientific data, that we were asked to debate. In this context, “Database Management” was a very unfortunate name for a vastly important topic which had all to do how science handles its data in the future. The questions that were asked were: (a) Managing big data – what is the right infrastructure for data sharing, (b) is “big data more of a concern for some disciplines rather than others (e.g. biologists), (c) how can – and is it appropriate to – facilitate inter-laboratory dataset comparison (d) does the type of data have an impact on the ways it can be shared? (d) future literatures in the wider sense i.e. not just how findings are published in journals, but how can interim findings be shared and accessed? (e) what about the tension between transparency and data protection (f) implications for the growing use of web2.0 as a resource for sharing research findings and (g) how well organised is the current use of web 2.0 and how does this impact accessibility?
These were all wonderful questions which must be asked in order to “future-proof” science and to which we were expected to provide answers in 20 min (!). While I was and am glad that we were to debate these issues, the devil is – as always – in the detail and the undifferentiated nature of asking made might heart sink again.
In this post, I would like to address the first two questions:
Managing big data – what is the right infrastructure for sharing
The Good: What is exciting here is the recognition by the RS that data needs infrastructure. And that infrastructure is both technical as well as sociocultural problem. Some components of that infrastructure (and by far not all) that are direly needed are
- Data Repositories (departmental, university level, subject-specific and transinstitutional
- Open, non-propriatary and standards-based markup (exchange formats)
- Computable Metadata (e.g. ontologies which can be used to give data COMPUTABLE meaning
- University librarians who think that preservation of the data generated by one’s own instritution falls WITHIN the remit of the library
- Scholarly Societies who remember that they were founded in response to a scaling problem – namely the increasing availability of scientific data and the need to distribute it – and who start taking this reason for their existence seriously again rather than trying to lock up data in inaccessible and copyrighted/DRM’ed/pdf’ed publications
- Academics who belive that data science should be a compulsory part of every undergraduate’s course
- Funding agencies who mandate open access publishing and data sharing as a condition of the award of a grant
- The availability and use of appropriate data licences, such as Creative Commons licences or Open Knowledge Foundation Licences
etc etc…..I am sure there are many more things that I should mention here and that I have forgotten. Come to think it: funding bodies and universities – don’t forget about or squeeze out the infrastructure guys. Don’t say to the infrastructure guys that the development of /institutional repositories/markup languages/models/eScience tools is not science but it engineering and has no place in a research university that “does science”. Do you detect bitterness? Yes you do – some of my colleagues – even those that call themselves “chemoinformaticians” tell me just this on a regular basis. Only thing is – without the infrastructure guys and the engineers that develop all of this stuff and develop it in a scientific manner using scientific methods, NO science will get done because there will be no infrastructure to support it. And which buttons will you push then to calculate your transition states, dock your molecules etc.? Yes – data needs infrastructure…now universities, senior academics and funding bodies….put your money and your recongnition where your mouth is.
The Bad:The focus of the question on BIG data perturbs me immensely. Because BIG data is, well, BIG data, one of the first things that people who produce/manage/exchange BIG data have to do – almost by the very nature of the thing – is to worry about infrastructure for BIG data. And while we may not have all the technical answers just yet (e.g. it is sad in a way that the fastest bandwidth we have for shuffling really BIG data, such as produced by astronomers around the world, for example, is to load it onto hard disks and to load these onto trucks and to send the trucks on their way) people who deal in BIG data are very aware that it needs infrastructure and hardly need convincing. It is not BIG data that is the problem. What is the problem, is data that is produced in the “bog-standard” long-tail research group of between 3 and 20 people. It is these guys, who usually DO NOT (unless they happen to be blessed and are biologists) have the infrastructure to make data available in such a way that it can be stored exchanged and re-used. It is the biology/chemistry/physics…PhD student that has slaved for three years to assemble data and keeps it an Excel spreadsheet that we need to worry about – how do we make it possible for him to publish his data and make it reusable? How about the departmental crystallographer who sits on thousands of publication-quality but unpublished crystal structures just because the compound never quite made it into a paper. We need to develop mechanisms and infrastucture for the small “long-tail” laboratory scientists…the big data guys have this figured out anyway.
Is Big Data more of a concern for some disciplines rather than others (e.g. biologists)?
The GoodYes of course it is. High throughput screening/ gene sequencing/radioastrononmy produce huge amount of data. Yes it is a concern for them – but they are thinking about it already.
The Bad Big data again. See above – it is not about Big data…let’s talk about the synthetic organic chemistr and the data associated with the 20 compounds he makes over 3 years too, please.
I’ll continue to address some of the other data related questions in other blog posts.
Hello from Hinxton
So in my last post I pretty much said good-bye to the Unilever Centre and the people there and now it is time for a hello – a hello to a new job. I have recently joined the Department of Genetics and the group of Prof Ashburner as a Research Associate. While I am formally employed by the university, I will, however, spend most of my time at the European Bioinformatics Institute in the group of Christoph Steinbeck.
My remit here will be to continue to develop chemical ontology and in particular to help, together with my colleagues and the ChEBI user community, to put the ChEBI ontology onto a “formal” footing and to align it with the upper ontology used by the OBO Foundry ontologies. I will blog more about this as the story develops – however, for now, I am very excited about this new opportunity. I have a great set of new colleagues (Duncan Hull has also just joined the ChEBI team and has blogged about it) both in the ChEBI group as well as in the wider EBI community and there is a community of people here that believe in the value of this type of work. So I am very much looking forward to helping create some exciting ontology and resources of value to the chemical and biological community.
As I was walking across the Genome campus this morning, I couldn’t help but to be struck by its beauty – here are some pictures I shot with my mobile phone:

Hinxton High Street - On the way to the Genome Campus

Genome Campus - By Hinxton Hall
Goodbye and Hello…
The eagle-eyed amongst you (or those that do not use feedreaders to read this blog) will have noticed that there have been some changes to Staudinger’s Semantic Molecules recently – the biggest change is that the blog is no more in its old form and that you are now being re-directed to Semantic Science Blog. I have migrated the old content from “Staudinger’s Semantic Molecules” to this blog and everything apart from the pictures should be here.
The move has to do with the fact that yesterday was my last day at the Unilever Centre for Molecular Science Informatics and in Peter Murray-Rusts’s Group. I am now taking a short break to relax for a bit and pursue some side projects that I have been working on for a while now with more force. I will blog about my new job soon – so for now suffice it to say that I will stay in the area of chemical informatics and ontology development – and that also means that ChemAxiom will continue to be developed and if anything, I will now have much more time to devote to it than I have had recently.
Finally, the three years at the Unilever Centre have been a phantastic experience. I have learned a lot and hopefully grown a lot and much of this is due to Peter’s guidance and support. I have also had a set of wonderful colleagues throughout and can think of few people I would rather work with. So a VERY BIG thank you to all of them.
I will continue to blog about all things semantic, science and chemistry and hopefully a few other subjects too at this location for now. I am currently working on a new website and so may move the blog again in the future, but for now, wordpress will be my new home. The forward from the old wwmm address should continue to work, but if you have a moment, then change your bookmarks/feedreader subscriptions to http://semanticscience.wordpress.com.
For now, thanks to all the readers of this blog and in particular to those that have left comments and engaged in interesting debate – your thoughts and inputs have been and will continue to be much appreciated! See you all soon…..
Today is a special day…(at least in the US)
…it’s the First Annual national Postdoc Appreciation Day. And how badly needed it is too….after all, it’s usually the post-doc that keeps both the lab and the PI organised, writes the grant proposal that his PI then subsequently gets to keep and administer and is the guy or gal that generally keeps research science ticking over. So spare a thought for your post-doc today….give him/her a pat on the back and buy him a beer. Alternatively, if that does not seem good enough, here’s a link to a post-doc appreciation kit you can follow for inspiration.
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_e.png?x-id=c926b277-0abf-4e9b-a5cd-7c57351c734d)
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_e.png?x-id=068d98c5-9453-45cd-a6d3-4ceb94ecda61)
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_e.png?x-id=3c9bdcdd-128b-41bd-bb79-8f7d18614618)
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_e.png?x-id=f9c06396-c9df-414c-b60a-945834326cdf)
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_e.png?x-id=9b3cec81-12e6-415a-b73b-ac2d781dd408)
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_e.png?x-id=d31956bd-c705-4ab6-9aed-1285be27d90a)
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_e.png?x-id=d348c38c-2aaa-414b-9017-dd685bfdeb4e)
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_e.png?x-id=008f78f6-80d6-4edd-8c8f-8318ca2ddd4c)