We have been doing some thinking around how to improve the research data management services we offer here at Lancaster. We’re keen to move away from the idea of the role of research data management as purely for compliance purposes – we want to really push the idea of open data and data reuse and develop the idea that the research data produced by the university are valuable assets. We know that researchers at the university are working on interesting, valuable and important work. Look at Derek Gatherer’s work on the Zika virus or Maggie Mort’s project looking at disaster planning and children and a host of other more specialized datasets supporting research right across the sciences and the humanities. Each dataset will have its own context, background and requirements for it to be properly interpreted and understood.
Capturing high quality data means capturing high quality metadata; the structure which supports the data. The metadata explains the research data and supports discovery and (re)interpretation. Archivists are well used to supplying metadata for collections (or cataloguing it as it is more familiarly known!) and also know that the richest metadata is that which is supplied by the creator of the collection. This will be the person who knows most about the data, who fully understands the context and can supply additional information which will help with the later re-use and re-interpretation of the data.
The ideal set up would be one where each dataset came with full and rich descriptive metadata with keywords taken from relevant subject specific vocabularies but the reality is always going to fall very far short of this.
Research data is often seen as something of a by-product of the research process and this can reinforce the idea that action is only necessary because the research councils demand it, running the risk of creating a compliance culture.
The truth of the matter is that researchers have little spare time or resource to devote to creating detailed and complex descriptions of their data (often having done so in the related published article). Even worse is when it comes to capturing the data in a format which is likely to promote its chances of being accessible and reusable well into the future. From Art through to Women’s Studies via Engineering, Linguistics, Physics and Creative Writing and everything in between there is a dizzying array of software and file types supporting everything from spreadsheets, to videos, to models to graphs.
To what extent might it be possible to expect and demand rich metadata and standardised file formats? In terms of current practices at data repositories there is wide variety. Some repositories are extremely prescriptive about what can be deposited. The UK Data Archive for example which is a repository for “large collections of high quality data” for the Social Sciences. With a reputation for high quality reliable data the UK Data Archive service is in a position to demand specific file formats and detailed metadata. Because they have a high institutional reputation researchers immediately see the value of investing time in producing data in the format required and to some extent competing for the privilege of having work deposited in this repository. However the majority of institutional repositories are catering for the long tail of research – datasets which have no “natural” home and do not meet the requirements of repositories such as the UK Data Archive. This puts institutional repositories on the backfoot – the starting position is of the repository of last resort so rather than researchers competing for the privilege of depositing they are using the repository as a filing cabinet to clear away the papers at the end of the project.
So what to do about this? Again there are a variety of approaches which range from the prescriptive to the permissive. Some repositories – ourselves included in this – put no restrictions on the format of data and ask for the minimum amount of metadata as required by their institutional system (in our case Pure). We ask for keywords, geographic locations and covering dates but these are not required fields. We make no restriction on the format of the digital files deposited although we ask, where possible, for some explanatory notes to help future users of the data. We are, however, at the mercy of our depositors. This can mean anything from extremely rich and well described datasets to ones where lack of time and resources (and possibly engagement) provide scant metadata and risk having datasets which are hard for others to interpret, especially where data managers have had to add in metadata and descriptions later. At best we end up with uneven and patchy descriptions and at worst data which are unusable by anyone other than the creator right from the outset.
There are several improvements we can make. We should advocate and educate so that researchers understand the need for high quality data and metadata. We should be better at getting across the message of why it is important to make data openly available for transparency and reuse.
We should also be looking at ways to refine the automation of data discovery and there are various interesting initiatives around although they would require rich metadata to allow for this kind of detailed analysis.
Each institution will find itself in a different position with regards to the level of engagement but clearly collaborative approaches will work well both in raising the profile of data management and also in looking for shared solutions to data discovery and sharing. It will be interesting to see how the forthcoming JISC sponsored project for shared Research Data Services will affect these current issues. Hopefully it will promote more consistency and a stronger voice, especially for smaller institutions who don’t have the resources to develop a complex repository.
There is a lot happening right now in data management with the emphasis on making it discoverable and reusable and we are keen to be a part of that conversation.