Tuesday 23 October 2018

Open data and the scientific gift culture


Continuing our theme for International Open Access WeekProfessor Kevin Cowtan, Department of Chemistry, writes about the value of open data for scientific research.

If you've applied for a research council grant recently, you'll know that research councils have become rather keen on 'open data' in recent years. Funders would like us, not just to produce new results, but also to provide all the data used in deriving those results. Many journals are introducing similar requirements.

At first glance this might appear as research funders imposing more bureaucracy on grant holders. However I would like to suggest that open data is fundamental to how science works, and in addition that releasing research data can provide significant benefits to the researcher themself.

All science involves building on the work of others, or 'standing on the shoulders of giants'. This makes science a gift culture - we take the gift of the work of others and in turn gift our own work to others for them to build on. Making our results available sooner increases the opportunities for others to build on them, or if necessary to point out our errors, both of which increase human knowledge. Releasing our data often increases the value of our work, because other researchers can test our hypotheses and others against the data. In open source software, these benefits are characterized by the slogans 'release early, release often', and 'given enough eyeballs, all bugs are shallow'.

Or that is what is supposed to happen. But does it work in practice? I would like to highlight three experiences from my own career which suggest that it does.

Example 1: In the 1990s Dr Paul Emsley and myself developed a new piece of software for X-ray crystallography, called 'Coot'. University culture at the time was heavily focussed on the commercialisation of software outputs, however we (not without difficulty) made our work 'open source', meaning anyone else could build on our work, and we in turn could incorporate the work of others. This turned out to be a very good decision: Coot quickly surpassed and largely replaced all competing tools, and for the past few years the software has typically been cited in around 10 new peer-reviewed papers every day. The use of the software in industry as well as in academia produces an economic impact.

Example 2: Around 2013 I became interested in climate science, and identified a problem with how a major historical temperature dataset was being used. Users assumed that the data were global in coverage, when in fact they were not. I published a paper on estimating an unbiased global mean from the incomplete data, but also released the data and monthly updates from then on. The dataset has attracted over 200 citations and been used in official reports from government organizations. The name recognition this has generated has made it easy for me to build collaborations with climate scientists - which is not always easy when starting in a new field.

Example 3: In 2015 I identified a problem in how climate model simulations are compared with observations - the most commonly used method did not provide an 'apples to apples' comparison because of complexities of the historical data. A correct comparison involved some dull but careful data analysis. Again, I released the software as well as the data. Several subsequent comparisons have made use of this code, leading to both citations and co-authorships, at least one of which will be REF returnable.

Image courtesy of XKCD, https://xkcd.com/1827 
under a CC BY-NC 2.5 licence
Now, this may all have been luck. After all, had I not had success in releasing data and computer code, I would not have been asked to write this blog post. There could be hundreds of people releasing data and not seeing any benefits. I could be the beneficiary of 'survivorship bias', explained by Randall Munroe in the comic XKCD.

However there are objective reasons to believe that releasing data does benefit the researcher. In 2013, Piwowar and Vision found that after controlling for a range of other factors, papers with open data received more citations than papers without open data. Open data also provides economic impact, estimated for example by Houghton and Gruen in 2014, which when measurable may be useful to the department and the researcher for REF "impact" studies.

In summary, open data is a natural extension of the principles of good scientific research: science is and has always been a social activity, and the gifting of information is fundamental to that activity. Studies of open data publications show benefits both to the researcher and to the wider economy. My own research career has been built on giving away data and computer code: not every case has led to benefits, but the net benefit over the course of my career has far outweighed the time cost of releasing the data.


Professor Cowtan is an interdisciplinary data scientist working in the fields of X-ray crystallography and climate science. While most of his career has been at the University of York, he has also spent sabbaticals at San Diego Supercomputer Centre. He is the chair of the university Research Data Management committee.

No comments:

Post a Comment

Anybody can comment on this blog, provided that your comment is constructive and relevant. Comments represent the view of the individual and do not represent those of The University of York Information Directorate. All comments are moderated and the Information Directorate reserves the right to decline, edit or remove any unsuitable comments.