I am an ardent supporter of open data. In fact, I am such a big fan I even published all of my PhD data back in 2013. It was mostly very boring but I know of at least one dataset that was used (albeit briefly) by another group. I thought that spending the time to put it up with a few notes and references to parts of my thesis was something that might be good. While I wasn’t particularly aware of it, I was doing exactly what the UK research councils wanted.
In 2012 the research councils of the UK decided that all projects they fund should share their data online, free for everyone to use with a bare minimum attribution to the original creator of the data. It’s publically funded research so the idea of making publically funded data publically available seems like an easy sell. And I think most people are behind the idea – I mean free data, what’s not to love
Well as it turns out, the bit that’s not to love is the fact that someone has to actually now go to the trouble of putting all that data on to the internet. And even less lovely, the people who need to do it are generally the scientists who generate it. Which is fair enough because a) they are generally the ones with the data and b) anyone who has ever generated data knows full well that the originating person is probably the only person on the planet who knows which files are which! But researchers spend quite a lot of time doing research so extra work is rarely viewed without a degree of withering scepticism.
Right now, anyone working on a Research Council grant is technically meant to be uploading their data to an accessible repository, which then keeps it safe for a whopping 10 years.
But there are some problems.
Firstly, it’s not entirely clear that the research councils actually understood what they are asking for. I say ‘they’ because what they have asked for is (in the way most universities are interpreting it) crazy. They want ALL the data produced as part of a research council project to be hosted online and made public.
To give you some idea of what that means: I’m coming to the end of a Research Council grant which has generated 37.42 GB of data in 104,780 files. Now the 37.42 GB is easy enough, space is cheap so hosting that should be simple and cheap enough.
But what exactly are those 104,780 files?
Well they are a big mix of things – mostly it’s all data files or code for converting the source files into other files. But I can make some hand wavy reductions based on some approximate amounts.
First off there’s my ‘admin’ folder with all my PORs, quotes, H&S and all manner of other admin. That’s about 1000 files – we have a lot of admin…
Then there are experiments that are just one experiment but generated lots of files. It’s very normal with my data collection to collect about 1,000 files per experiment. So if we assume that around 75% of my files are part of a larger dataset, then that shrinks it 75% down to just 78 datasets.
The remaining 25% (25,945 for anyone keeping track) are also probably sets but much smaller than 1000. So just for the sake of argument let’s assume they are grouped in 100s. Neither of these approximations are perfect but it kind of gets to a probably overly generous number of about 300 datasets.
So far I’ve uploaded 20 datasets to our open access data repository. This you might note, is a lot less than 300. The project ends in about 6 months and I think it’s fair to say that I will still have uploaded a lot less than 300.
Clearly I’m terrible at open data.
The problem is that while I love open data, 300 datasets is just too monumental a task. For each data set I need the following information at a bare minimum, file name, authors, subject area, meta data description of the file contents, funders and links to paper. The datasets I’ve already done have a big advantage anyway as they have papers describing their data, so the meta data was practically already written. Meta data is the worst part because that’s not a single word but an actual description of the data and putting it in context, you actually have to think rather than just copy and paste.
I timed myself for one dataset and it took me 15 mins. There are now 192 working days left in the project, and to do all 300 datasets will take me about 14 days (7.5 working hours per day) solid data entry. Which…. hang on that’s not that much at all. That’s just 5% of the remaining project time or 1% of the total project time…
Well in retrospect, I should have done this maths before starting this blog post because that sounds entirely a reasonable thing to ask of researchers. They are actually wanting something that is not only practically possible but the person time required is essentially a rounding error on my grant. Damn. Well there goes my excuse, better get uploading…
Terry Bucknell · 2 February 2017 at 08:54
Research Council people have told us it’s data underlying publications that must be made available. Making the rest of the data available is optional! But good on you for going the extra mile!
Georgina Parsons · 5 April 2017 at 16:48
Interesting, as that’s not what their policies say – e.g. “EPSRC-funded research data is a public good produced in the public interest and should be made freely and openly available with as few restrictions as possible in a timely and responsible manner”. Their clarifications go on to talk about the different cases of data underpinning publications versus other retained research data where there is no intention to publish on those data. I suspect the distinction is really about the practicalities of compliance checking – at the moment the focus is on checking the bare minimum of publishing data underpinning publications, but we’ll get to the next stage soon enough…