Project Management by Committee

I am just catching up on Seth Godin’s blog – as usual his posts are short and poignant. This one struck a particular chord:

“Hi, we’re here to take your project to places you didn’t imagine.

With us on board, your project will now take three times as long.

It will cost five times as much.

And we will compromise the art and the vision out of it, we will make it reasonable and safe and boring.”

Great work is never reasonable, safe or boring. Thanks anyway.

Support the Long-Term Future of KEGG

The KEGG database is an invaluable resource for biologists, bioinformaticians, clinical researchers, chemists etc. in general and has also been invaluable in some of my personal activities. KEGG is developed in the laboratory of Minoru Kanehisa who is now coming up towards his mandatory retirement. And he is looking to put KEGG on a sustainable footing and to give it a viable business model for the future. The following is a complete reproduction (though no explicit licence for reuse is provided I claim fair use) of a recent post on the KEGG website:

Plea to Support KEGG

Since 1995 the KEGG database has been developed in my laboratories (Kanehisa Laboratories) at Kyoto University and the University of Tokyo thanks to funding from the Japanese Ministry of Education and its agencies. Contrary to popular perception, KEGG has never been a public database, as there has never been an official long-term commitment from any government agency. Although I have managed over the years to obtain multiple and overlapping short-term research grants to support KEGG, this has become more difficult now that I am reaching the mandatory retirement age. Foreseeing this eventuality, together with my colleagues, I started a non-profit organization, NPO Bioinformatics Japan, as a vehicle to raise funds for the service that we have been delivering.

For the last ten years our major source of funding has come from the Institute for Bioinformatics Research and Development (BIRD) of the Japan Science and Technology Agency (JST). As of April 1, 2011 BIRD has been converted to the National Bioscience Database Center (NBDC) in JST. The newly established NBDC focuses on the integration of various databases, and does not support the development of individual databases as BIRD did. The good news is that I was awarded a three-year grant from NBDC for integration of KEGG MEDICUS with disease and drug information used in practice and in society. However, the bad news is that this grant is not sufficient to continue to hire my talented crew of KEGG curators and software developers.

KEGG is now one of the most widely used biological databases in the world as indicated by the web access statistics (150 to 200 thousand unique visitors per month) and the number of KEGG paper citations (one thousand per year). I intend to ensure that KEGG remains a freely available web resource. However, this will be possible only with your support. First, I would like to ask all of you who have benefited from KEGG to write, email, tweet, and blog about your support for KEGG. I hope, in the long run, your voices will increase our chances of getting more stable funding. Second, we will continue to ask commercial organizations to obtain a license to use KEGG from Pathway Solutions Inc. I am very grateful to all the companies who have so far supported KEGG by obtaining license agreements. This licensing revenue is fully reinvested to further the development of KEGG. Unfortunately though, this is still insufficient to maintain the high-quality service that we strive to accomplish. Consequently, I would like to introduce the following mechanism.

Starting on July 1, 2011 the KEGG FTP site for academic users will be transferred from GenomeNet at Kyoto University to NPO Bioinformatics Japan, and it will be available only to paid subscribers. The publicly funded portion, the medicusdirectory, will continue to be freely accessible at GenomeNet. The KEGG FTP site for commercial customers managed by Pathway Solutions will remain unchanged. The new FTP site is available for free trial until the end of June.

Please register to learn more about the KEGG FTP subscription.

Thank you!

Minoru Kanehisa

2011 – The International Year of Chemistry

Appearance of real linear polymer chains as re...

Image via Wikipedia

In their editorial for the January Issue (you will need a Nature subscription to access this, altrenatively see the Sceptical Chymyst post here), the good folks at Nature Chemistry have reminded us that 2011 is the International Year of Chemistry:

“The United Nations has proclaimed 2011 to be the International Year of Chemistry. Under this banner, chemists should seize the opportunity to highlight the rich history and successes of our subject to a much broader audience — and explain how it can help to solve the global challenges we face today and in the future.”

The year even has a website. The UN also singles out two important areas of chemistry – neither of which have chemistry in the name – on the frontpage of the site: namely the development of advanced materials and molecular medicine. I am extremely happy to see this – materials and in particular polymers have been a long-standing interest of mine and some of the immunology work I am currently doing has implications for molecular medicine too.

There are several ways to participate in the Year of Chemistry – one of them is through an essay and video competition: “A World Without Polymers”. Students are asked to make short videos or write essays, trying to imagine what the world would be like without polymers. Furthermore there are networking events, conferences and more all across the world. So go and check out the UN’s site, participate and contribute!

Enhanced by Zemanta

Reading the Tea Leaves of 2011 – Data and Technology Predictions for the Year Ahead

The beginning of a new year usually affords the opportunity to join in the predication game and to think about which topics will not only be on our radar screens on the next year, but may dominate it. I couldn’t help myself but to attempt to do the same in my particular line of work – if for no other reason, than to see how wrong I was when I will look at this again at the beginning of 2012. Here are what I think will be at least some of the big technology and data topics in 2011:

1. Big, big, big Data
2010 has been an extraordinary year when it comes to data availability. Traditional big data producers such as biology continue to generate vast amounts of sequencing and other data. Government data is pouring in from countries all over the world, be it here in the United Kingdom, in the United States and efforts to liberate and obtain government data are also starting in other countries. The Linked Open Data Cloud is growing steadily:

Linked Open Data October 2007 - Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

Linked Open Data September 2010 - Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

the current linked data cloud has about 20 billion triples in it. Britain now has, thanks to the Open Knowledge Foundation, an open bibliography. The Guardian’s Datastore is a wonderful example of a commercial company making data available. The New York Times is making an annotated corpus available. Twitter and other user-generated content also provide significant data firehoses from which one can drink and build interesting mashups and applications, such as Ben Marsh’s UK Snow Map. So that are just some examples of big data and there are several issues associated with it, that will occupy us in 2011.

2. Curation and Scalability
A lot of this big data we are talking about is “real-world” and messy. There is no nice underlying ontological model (the stuff that I am so fond of) and by necessity it is exceptionally noisy. Extracting a signal out of clean data is hard enough, but getting one out of messy data requires a great deal of effort and an even greater deal of care. And therefore the development of curation tools and methodologies will continue to be high up on the agenda of the data scientist. The development of both automated and social curation tools will be high up on the agenda. And yes, I do believe that this effort is going to become a lot more social – there are signs of this starting to happen everywhere.
However, we are now generating so much data, that the sheer amount is starting to outstrip our ability to compute it – and therefore scalability will become an issue. The fact that service providers such as Amazon are offering Cluster GPU Instances as part of the EC2 offering is highly significant in this respect. MapReduce technologies seem to be extremely popular in “Web 2.0” companies and the Hadoop ecosystem is growing extremely fast – and the ability to “make Hadoop your bitch” as an acquaintance of mine recently put it, seems to be an in-demand skill at the moment and I think for the forseeable future. And – needless to say – successful automated curation of big data,, too, requires scalable computing.

3. Discovery
Having a lot of datasets available to play with is wonderful, but what if nobody knows they are there. Even in science, it is still much much harder to discover datasets than ought to be the case. And even once you have found what you may have been looking for, it is hard to decide whether that really was what you were looking for – describing metadata is often extremely poor or not available. There is currently little collaboration between information and data providers. Data marketplaces such as Infochimps, Factual, Public Datasets on Amazon AWS or the Talis Connected Commons (to name but a few) are springing up, but there is a lot of work to do still. And is it just me or is science – the very people whose primary product is data and knowledge – is lagging far behind in developing these market places. Maybe they will develop as part of a change in the scholarly pulication landscape (journals such as Open Research Computation have a chance of leading the way here), but it is too early to tell. The increasing availablity of data will push this topic further onto the agenda in 2011.

4. An Impassioned Plea for Small Data
One thing, that will unfortunately not be on the agenda much is small data. Of course it won’t matter to you when you do stuff either at web scale or if you are someone working in Genomics. However, looking at my past existence as a laboratory-based chemist in an academic lab, a significant amount of valuable data is being produced by the lone research student who is the only one working on his project or by a small research group in a much larger department. Although there is a trend to large-scale projects in academia and away from individual small grants, small-scale data production on small scale research projects is still the reality in a significant number laboratories the world over. And the only time, this data will get published, is as a mangled PDF document in some journal supplementary – and as such is dead. And sometimes it is perfectly good data, which never gets published at all: in my previous woworkplace we found that our in-house crystallographer was sitting on several thousand structures, which were perfectly good and publishable, but had, for various reasons, never been published. And usually it is data that has been produced at great cost to both the funder as well as the student. Now small data like this is not sexy per se. But if you manage to collect lots of small data from lots of small laboratories, it becomes big data. So my plea would simply be not to forget small data, to build systems, which collect, curate and publish it and make it available to the world. It’ll be harder to convince both funders and institutions and often researchers to engage with it. But please let’s not forget it – it’s valuable.

Enough soothsaying for one blog post. But let’s get the discussion going – what are your data and technology predictions for 2011?

The Manuscript Submission Process at Science (Magazine)

Combination of 20px and rotated version of 20p...
Image via Wikipedia

Today Peter Stern, the Senior Editor of Science Magazine was here at the Genome Campus to give a talk about the manuscript submission process at Science. No matter where you are wr.t. scientific publication and the future of scholarly communication, the talk was very engaging, throughtfully delivered and mercifully done without any audivisual aids. I have made a few notes during the talk and here they are. They are unedited “live typing” and as such not pretty – but hopefully useful when trying to understand the publication process for Science. I have (mostly) refrained from commenting on what he said – though there could be a lot that could be said about this – but maybe at a later date. For now, just the raw unadulterated notes:

Peter Stern: The Manuscript Submission Process at Science
Three points to address:

1. Presubmission Enquiries
2. Board Members
3. Review Process

Ad One
Scientists sometimes forget the bigger picture as they work and get their results. Presubmission enquiries can be useful to get some feedback and help place work in some bigger picture. Insists on confidentiality of info provided pre sub enquiries.

Ad Two
Science has about 28 individuals trying to cover all of science….all editors have science profiles…multiple ones…many have run research groups.

Ad Three
Paper is submitted…Stern makes high play of safety and confidentiality. Paper gets assigned to editor. Editors try to read in full and form an informal opinion, but admits that this is getting harder due to volume of submissions. Now talks at great length about the “wackos” (creationists, people inventing perpetuum mobile). Editors work with advisors – “board members”….again about 10-12 advisors…..but they don’t do a review but rather try to place manuscript in the bigger picture of science. They come back with a short evaluation and a confidence score. Board members are active scientists with labs…..looks for gentleman factor in board members (wants to be sure that they are fair to papers even if paper disagrees scientifically with board member).

Once feedback from board members has been received editor opens another round of discussion with fellow editors. If there is a positive decision at this stage paper will be sent for full review. Most papers fail of this stage. Also little room for discussion – decision is essentially binary.

Finding referees: authors can prepare a “negative” list and a positive list of referees. Lists are usually respected…certainly the “negative” list. Editors often scan websites of grant giving bodies…to avoid friends/collaborators refereeing each other. Recommends “Guardians of Science” – a sociological study of the peer review process. Default options of two referees, sometimes more if necessary. Default review time of two weeks: seen as the right balance between speed an ensuring that authors din’t get scooped and allowing enough time for in depth review.

When referee comments come back, there is room for negotiation depending on comments. What happens next depends on what referees ask for. If it is reasonably further work, paper could go back to authors, if too much further work si requested editor has to make a decision. “Peer review is not a democratic process.” If referee reviews are all over the shop could use an arbitrator – which could be a board member.

If positive decision is made, editor will do a “pre-edit” to make it fit Science style. If author is native English speaker, editor will focus on logical argument and flow of paper, if non-native speaker, more linguistic help is needed. After pre-edit is done, paper is returned to authors and a revised version is expected back within 4 weeks unless experimental work needs to be done which takes longer. Most of the time revised paper goes back to referees and gets green light if referee comments have been addressed.

Once accepted papers can go onto science express for rapid publication and to allow the scientists to claim precedence of publication. This is followed by harsh copy-editing. Calls orthographic mistakes an “affront to science”. Now talks about how good they are at disseminating science and making their authors famous. Here’s the gatekeeper justification again.

Enhanced by Zemanta

Visualisation of Ontologies and Large Scale Graphs

{{en|A phylogenetic tree of life, showing the ...
Image via Wikipedia

For a whole number of reasons, I am currently looking into the visualisation of large-scale graphs and ontologies and to that end, I have made some notes concerning tools and concepts which might be useful for others. Here they are:

Visualisation by Node-Link and Tree

jOWL: jQuery Plugin for the navigation and visualisation of OWL ontologies and RDFS documents. Visualisations mainly as trees, navigation bars.

OntoViz: Plugin into Protege…at the moment supports Protege 3.4 and doesn’t seem to work with Protege 4.

IsaViz: Much the same as OntoViz really. Last stable version 2004 and does not seem to see active development.

NeOn Toolkit: The Neon toolkit also has some visualisation capability, but not independent of the editor. Under active development with a growing user base.

OntoTrack: OntoTrack is a graphical OWL editor and as such has visualisation capabilities. Meager though and it does not seem to be supported or developed anymore either…the current version seems about 5 years old.

Cone Trees: Cone trees are three-dimensional extensions of 2D tree structures and have been designed to allow for a greater amount odf information to be visualised and navigated. Not found any software for download at the moment, but the idea is so interesting that we should bear it in mind. Examples are here, here and the key reference is Robertson, George G. and Mackinlay, Jock D. and Card, Stuart K., Cone Trees: animated 3D visualizations of hierarchical information, CHI ’91: Proceedings of the SIGCHI conference on Human factors in computing systems, 1991, ISBN = 0-89791-383-3, pp.189-194. (DOI here)

PhyloWidget: PhyloWidget is software for the visualisation of phylogenetic trees, but should be repurposable for ontology trees. Javascript – so appropriate for websites. Student project as part of the Phyloinformatics Summer of Code 2007.

The JavaScript Information Visualization Toolkit: Extremely pretty JS toolkit for the visualisation of graphs etc…..Dynamic and interactive visualisations too…just pretty. Have spent some time hacking with it and I am becoming a fan.

Welkin: Standalone application for the visualisation of RDF graphs. Allows dynamic filtering, colour coding of resources etc…

Three-Dimensional Visualisation

Ontosphere3D: Visualisation of ontologies on 3D spheres. Does not seem to be supported anymore and requires Java 3D, which is just a bad nightmare in itself.

Cone Trees (see above) with their extension of Disc Trees (for an example of disc trees, see here

3D Hyperbolic Tree as exemplified by the Walrus software. Originally developed for website visualisation, results in stunnign images. Not under active development anymore, but source code available for download.

Cytoscape: The 1000 pound gorilla in the room of large-scale graph visualization. There are several plugins available for interaction with the Gene Ontology, such as BiNGO and ClueGO. Both tools consider the ontologies as annotation rather than a knowledgebase of its own and can be used for the identification of GO terms, which are overrepresented in a cluster/network. In terms of visualisation of ontologies themselves, there is there is the RDFScape plugin, which can visualize ontologies.

Zoomable Visualisations

Jamabalaya – Protege Plugin, but can also run as a browser applet. Uses Shrimp to visualise class hierarchies in ontologies and arrows between boxes to represent relationships.

CropCircles (link is to the paper describing it): CropCircles have been implemented in the SWOOP ontology editor which is not under active development anymore, but where the source code is available.

Information Landscapes – again, no software, just papers.

Reblog this post [with Zemanta]

Merry Christmas Everyone

German painting, 1457
Image via Wikipedia

Another year is coming to a close and it has been nothing short of eventful. There has been the end of one direction of research, the beginning of my existence as a service provider at the EBI and several new strands of research. Not to speak of moving house and a number of other things.

I have learned a lot about people this year and sometimes more than I wanted to. In particular, I have learned that “trust” is the only way that allows anyone to manage anything – both in business and academia. Destroying trust between people or people and organisations, causes untold harm in the medium and long term, no matter how expedient it seems at the time.

However, it is christmas now and the world rests for a few days. Time to reflect on 2009 and to look forward to the new year with all its possibilities and challenges.

A very merry christmas and a happy new year to you all, thank you for reading the blog and see you in 2010!

Reblog this post [with Zemanta]