Here at Lancaster University we are very excited to be part of a group of pilot institutions taking part in Jisc’s Research data shared services project. This aims to provide a flexible range of services which suit the varied needs of institutions in the HE sector help achieve policy compliance for deposit, publication, discovery, storage and long term preservation of research data. It’s an ambitious project but one that there is an undoubted need for and we are trying to work with Jisc to help them achieve this goal.
Last week we were invited down to Jisc London HQ to learn about the progress of the project and – just as importantly – share our own thoughts and experiences on the process.
Daniela Duca has written a comprehensive overview of the meeting and the way forward for Jisc from the meeting.
Our table represented a microcosm of the project: Cambridge University (large institution), ourselves at Lancaster (medium) and the Royal College of Music (small). We all have extremely different needs and resources and how one institution tackles a problem will not work at another. However we have a common purpose in supporting our academics and students in their research, ensuring compliance with funders and enabling our institutions to support first class research outputs to share with the wider world.
We had been asked to do some preparatory work around costing models for the meeting – I think it would be fair to say we all found this challenging – probably because it is! My previous knowledge of costings comes from having looked at the excellent Curation Costs Exchange which is an excellent staring point for anyone considering approaching the very difficult task of costing curation services.
My main interest in the day lay in the preservation aspects of the project especially in exploring wider use cases. It’s clear that many institutions have a number of digital preservation scenarios for which the Shared Service solution might also be applicable. What is also clear is that there are so many possible use cases that it would be very easy to accidentally create a whole new project without even trying! I think it’s fair to say that all of us in the room – whether we are actively involved in digital preservation or not – are very interested in this part of the project. There is no sense in Jisc replicating work which has already been done elsewhere or is being developed by other parties so it presents an ideal opportunity for collaborative working and building on the strengths of the existing digital preservation community.
Overall there was much food for thought and I look forward to the next development in the shared services project.
Well… it’s probably quite hard to get to the truth of the matter but here at Lancaster we are trying to find out what researchers really think. This is crucial for developing and improving our services and vital for delivering the service our researchers want.
We are one of the organisations taking part in the JISC RDM Shared Services pilot and you can read their take on the work being done here. With JISC’s help we undertook a researcher survey to find out a bit more about the kinds of research data which were being produced, how the data were (or weren’t) being managed and researcher attitudes towards their data.
Researchers were asked about the types of data which were generated from their research. The results were quite interesting to us. Unsurprisingly perhaps far and away the most popular “type” of data were “document or report” followed with a bit of a gap by spreadsheets. Structured text files (eg xml, json etc) came a lot lower down the list as did databases.
What interested us was comparing the kinds of files which researchers said they created during the research process with the kinds of files which were actually being deposited with us as research outputs. Obviously comparisons are problematic not least because our researchers were being asked about the data generated as part of their research activities rather than specifically those which were ultimately selected for permanent preservation. We also know that we only get a small proportion of the research data which are being created within the university and the respondents may include people who have not deposited data with us. Having analysed the research datasets which we have already we can see that a huge percentage were structured or unstructured text files and a much smaller proportion were spreadsheets or Word documents.
Is it that our researchers have a false sense of the kinds of data which they are creating and using or is it that we as data curators have a poor understanding of the researcher community? I suspect that it is a bit of both but as data curators it is our duty to both have a good understanding of the data environment and also to be able to communicate to our research community. This is something we need to address as part of improving our advocacy and engagement strategies.
Another question which was asked was was about sharing data and this got answers which did surprise us. The majority said that they did already share data and very few said they were not willing to share. For the ones who did not share data it was mostly because it was sensitive or confidential data or they did not have permission to share it. Of those who did share data the majority said it was for “the potential for others to re-use data” and because “research is a public good and should be open to all”. An encouraging third of those questioned said they had re-used someone else’s data.
Of course we know that the people who did answer our survey represent those who are in some way already engaged with the RDM process. We also know that people are likely to give the answers they want us to hear! But if people are serious about being willing and able to share we really want to support them in this.
So we’ve decided to try and get talking to our researchers – and for them to talk to each other – by setting up a series of Data Conversations – events where researchers can discuss creation and dissemination of data to try and encourage a climate of sharing and valuing the data. It means we can hope for data that is well curated from the start of its life and that will be selected for deposit appropriately and with good metadata.
Better communication and advocacy will help us in the long run to preserve and share high quality relevant data which can be shared and reused. Managing (research) data and long term preservation of digital data are collaborative activities and the more we understand and share the better we will be at achieving these goals.
I attended the first Research Data Alliance workshop held in sunny Birmingham which was designed to bring together practitioners from across the UK to find out more about the work of the RDA. It was also a chance to see how we might be able to contribute and benefit from what the organisation has to offer. Despite already being a member of the RDA Interest Groups for Archives and Records Professionals, I confess to having been more of a casual observer than an active participant. So it was a brilliant opportunity to find out more about exactly what the Research Data Alliance is, how it works and what it hopes to achieve. Rachel Bruce from JISC introduced the event by outlining some of the ways in which JISC are working with the RDA across broad areas of Research Data Management and then handed over to Mark Parsons, the charismatic Secretary General of the RDA. Parsons is passionate about data, about connecting people and about creativity. He gave examples of technology “leapfrogging” and how local networks can come together to solve global issues. He used an illustration from the New York Magazine on how Willie Nelson is using local networks to take on corporate agricultural firms in the battle for the rising (legalised) marijuana market.
He also introduced ideas around how networks and connections lead to creativity and again referenced Anna Lowenhaupt Tsing’s Friction (this is the link if you’re lucky enough to be a Lancaster University person!) as well as Steve Johnson: “Chance favors the connected mind”:
That is how innovation happens…
The RDA he explained were absolutely not about a top-down framework but instead promoted a model of organic development; creating spaces for things to happen in. It was not, as Parsons explained, about thinking locally and acting globally but about doing local and global at the same time. The RDA has 75 Working and Interest Groups covering a very wide range of topics from the general right through to the extremely specific. There is no question that it is a complex network so we were invited to hear from a few of the Interest Groups: I chose Certification and Metadata, mostly because of their particular relevance to Digital Preservation.
The first session of the afternoon was on certification and first up was Lesley Rickards from the British Oceanographic Data Centre introducing the work of the Certification of Digital Repositories Interest Group. They are trying to map out Core Requirements for certifying repositories across the two main certification schemes for “trusted repositories”: World Data System (WDS) and the Data Seal of Approval (DSA). The two are different schemes using different concepts and methodologies which the RDA were keen to bring together. This they have successfully achieved with a Common Requirements document painstakingly mapping on onto the other and allowing for greater interoperability.
Next was Ingrid Dillo from the Data Archiving and Networked Service in the Netherlands who spoke about their experiences with obtaining certification – they went the whole hog and obtained Data Seal of Approval, World Data System certification and NestorSeal. DSA certification was A Lot of Work (approximately 250 staff hours) but nothing like as onerous as NestorSeal which took an eye popping 1500 person hours (if I recall correctly) which is something few repositories I imagine would be willing to contemplate. Interestingly DANS did not attempt ISO 16363. Certification is extremely important and Dillo pointed out the benefits of increased stakeholder trust and raising the profile of digital preservation in her organisation. She also felt the extra effort of attaining NestorSeal was worth it because it addressed some of the issues she felt were outstanding in the way they managed data. As for ISO 16363 it has a notoriously low take up and I wonder if too onerous a system coupled with limited resources means this situation does not change much in the near future.
The second session of the afternoon was on metadata and with Alex Ball of the Digital Curation Centre talking about the work of the RDA Metadata Standards Catalog Working Group whose initial aim was to make metadata standards easier to find and to advocate for their adoption. They hope that creating a more easily searchable catalogue of metadata will help with this. Sarah Jones (DCC) also introduced an enhancement to DMPOnline (a really useful tool we find!) which will make the addition of metadata easier and move towards Data Management Plans which are capable of being analysed by machines. This session also included a presentation from Dom Fripp of JISC on some of the ways in which they are trying to bring people together and be effective at using shared resources – don’t develop in isolation! He talked about JISC’s Research Data Discovery Service – a massive project which looks very exciting and also some of the work of the RDA Interoperability Working Group.
My quote of the day was “You’ve got to grab [metadata] when it’s produced” (Dom Fripp). This is so true and needs to be factored in when developing workflows and planning advocacy strategies.
My take-aways from the day were: it’s good to collaborate. Connections and conversations lead to new ideas.
I was extremely lucky to attend iPres 2016 the International Digital Preservation conference this year held in the beautiful Swiss capital city Bern.
The conference attracts some of the leading practitioners in the field so it’s a real privilege to be able to hear from and speak to people who are leading in research and development – creating tools, developing workflows and undertaking research into all aspects of digital management and preservation.
It will take a while to digest everything – there was so much to learn! – but I thought I would gather together some “highlights” of the session while still fresh in my mind.
The conference opened with a keynote from Bob Kahn who reflected on the need for interoperability and unique identifiers with digital objects. The world we live in is a networked one and as we conceive of information and objects as linked to one another over networks so we must find ways of describing them in question and unambiguous ways. When objects can exist anywhere and in several places at once so we need to find unambiguous ways of describing them.
To complement this I attended a workshop on persistent identifiers which gave an extremely helpful introduction to the world of URNs, URLs, PURLs, Handles, DOIs and the rest. Sometimes it can seem a little like acronym spaghetti but the presenters Jonathan Clark, Maurizio Lunghi, Remco Van Veenendaal, Marcel Ras and Juha Hakala did did their best to untangle it for us. Remco van Veenendaal introduced a great online tool from National Archives of the Netherlands which aims to guide practitioners towards an informed choice about which identifier scheme to use. You can have a go at it here and the Netherlands Coalition for Digital Preservation are keen for feedback.
What is particularly useful about it is that it explains in some detail at each stage about which PiD system might be particularly good in specific circumstances allowing for a nuanced approach to collections management.
Current persistent identifier systems do not cope well with complex digital objects and likely future developments will be around tackling these shortcomings. Sadly the current widely used systems have already developed along separate lines to the extent that they cannot be fully aligned – sadly not the interoperable future we are all hoping for.
The second keynote came from Sabine Himmelsbach of the House of Electronic Art in Basel and was a lively and engaging account of a range of digital artworks and how digital preservation and curation has to work closely with artists to (re)create artworks. It threw up many philosophical questions about authenticity an integrity not to mention the technical challenges of emulation and preservation of legacy formats. This was a theme returned to again and again in various sessions throughout the conference as was the constant refrain of how the main challenges are not necessarily technological.
The conference had so many highlights it’s very hard to choose from amongst them. There were a number of papers looking specifically at the issues around the long term preservation of research data, which is of particular interest to the work we are undertaking at Lancaster University. There was a fascinating paper given by Austrian researchers from SBA research and TU Wien (the Vienna University of Technology) looking specifically at the management of the so-called “long tail” of research data – that is the wide variety of file formats spread over a relatively small number of files which characterises the management of research data in particular, but also of relevance for the management of legacy digital collections and digital art collections. This discussion was returned to by Jen Mitcham (University of York) and Steve Mackey (Archivum) talking about preserving Research Data and also in my final workshop on file format identification. Jay Gattusso – nobly joining in at 4 am local time from New Zealand – talked about similar issues at the National Library of New Zealand involving legacy digital formats where there were only one or two examples.
One of the posters also captured this point perfectly – “Should We Keep Everything Forever?: Determining Long-Term Value of Research Data” from the team at the University of Illinois at Urbana-Champaign which looked at trying to create a methodology for assessing and appraising research data.
Plenty of food for thought there about how much effort we should put into preserving, how we prioritise and how we appraise our collections.
The final keynote was from Dr David Bosshart of the Gottlieb Duttweiler Institute – a provocative take on the move from an industrial to a digital age. He had a very particular view of the future which caused a bit of a mini-twitter storm from those who felt that his view was very narrow; after all more than half the world is not online. Whilst his paper was no doubt deliberately designed to create debate, it highlighted the issues about where we direct our future developments and what our ultimate goals are. This is common to all archives/preservation strategies: whose stories are we preserving? and how are we capturing complex narratives? This issue was revisited later in a workshop on personal digital archiving. Preservation can only happen where information is captured in the first place. It can be about educating and empowering people to capture and present their own narratives.
There is still a lot for me to think about from such a varied and interesting conference. There was very little time for leisure but there were wonderful evening events which the conference organisers arranged – a drinks receptions at the National Library of Switzerland and a conference dinner at the impressive fifteenth century Rathaus. There are lots of conference photos online which give a flavour of the event.
And speaking of flavours I couldn’t visit Switzerland and not try a fondue…. Delicious!
I was delighted recently to welcome colleagues from across the UK to Lancaster University for an Archivematica UK User group meeting. It was the hottest day of September here in Lancaster and while the campus did look lovely I did recommend our wonderful campus ice cream shop* to help cool down.
Archivematica UK User Group is an informal group made up of people considering, testing or using Archivematica, a digital preservation system. Those who attended are at all different stages of development and have a wide range of collections that they manage. What unites us all is a desire to tackle digital preservation as best they can with the resources they have available and to share experiences with others in the digital preservation community.
What Archivematica is: an open-source digital preservation system.
What Archivematica is not: a magic bullet that will solve all your digital preservation needs.
It relies very much on community input and can be implemented in a wide range of environments. It’s very important for all of us to be able to share experiences and what we’ve been up to so we can all move forward with tackling digital preservation.
My colleague Dr Adrian Albin-Clark and I were up first talking about our experiments with using Archivematica for the preservation of research data. RCUK stipulate that research data should be preserved and made available for at least 10 years after last use – or forever in practical terms! We’ve been focusing on how to get our institutional Research Information System (in our case Pure) to work with Archivematica and you can see details of Adrian’s work here.
Jasmin Boehmer, a student at Aberystwyth University gave us an insight into her research around metadata and rights management and we were excited to learn that her dissertation should be available soon via Aberystwyth University’s library catalogue. A lot of digital preservation development work requires time, a resource few of us can spare, so it’s fantastic that there are people out there undertaking detailed research and then sharing the results.
Jake Henry reported the latest news from the National Library of Wales where they are working at integrating Archivematica with their local digital repository and providing long term preservation and access to a wide range of digital content. They are also supporting a national programme of managing digital archives which involves remote deposit and management of digital collections from across Wales. It looked complicated and ambitious but we really look forward to hearing more from this project. We especially liked their local instance of Archivematica – Archwfmatica; wonder if there are plans for a Welsh language version…
Final update from the morning was Kirsty Lee from the University of Edinburgh who was the envy of all having attended Archivematicamp (Un)conference in Michigan – there’s a report here from some other attendees but it sounded like a really useful event where it was as much about attendees learning about Archivematica development as users giving feedback on what they wanted. We are really hoping there might be an Archivematicamp UK at some point – watch this space!
After lunch Jen Mitcham and Julie Allinson from the University of York gave an update from their “Filling the Digital Preservation Gap” project – part of the JISC-funded Research Data Spring project. This was the official last day of the project but not the end of the work! There will still be more blog posts to read here and the work they have already produced is extremely well documented and shared. I personally have found it of great practical use and also inspirational not least in terms of file format identification work which was the general discussion topic for the afternoon. We were invited to think about the “problem” of unidentified file formats – which is especially acute for those who work with very diverse research data outputs. There was some discussion about how we might contribute to file format identification as a community and balancing up the widely different needs of different institutions. For some it would be worthwhile and important to invest time and resource in id-ing particular file types but for others a more basic level of preservation has to be the priority with id-ing work coming second. I do now wonder if it might be useful to try and do more advocacy work in this area to try and get the creators of the data to help map and describe formats.
Next up was John Kaye who is leading on JISC’s research and development strand around managing research data and digital preservation. There’s a lot happening here particularly around the issue of shared services and it was useful for sharing experiences of managing and preserving research data and a good opportunity to hear what support JISC is planning to offer in these areas.
Heather Roberts from the Royal Northern College of Music was up next asking for advice on taking first steps with Archivematica and there were some very good tips from the floor – someone very sensibly said “be very clear about what you want to achieve”. Archivematica has the functionality to do lots of different things and in different ways but as I said earlier, it’s not a magic bullet, so it’s a good idea to have a clear idea of the workflow and infrastructure to fit it into. This is great advice for anyone looking at systems and tools to help solve problems!
Finally we were delighted to have a Skype call with Sarah Romkey from Artefactual Systems, the company who produce Archivematica who updated us on the news from the company and what devlopments are in the pipeline. It was also a chance for us to add our questions and comments. I was just relieved that the technology all worked to enable the conversation!
It was a very packed day and as ever it felt like there wasn’t enough time to speak to everyone who was there. It hot and there was a bit of noise from the exciting new building developments which are taking place outside the library (!) but I hope everyone enjoyed visiting as much as I enjoyed hosting. I have various avenues from this which I intend to chase up and am looking forward to the next Archivematica UK User group meeting in the new year.
We have been doing some thinking around how to improve the research data management services we offer here at Lancaster. We’re keen to move away from the idea of the role of research data management as purely for compliance purposes – we want to really push the idea of open data and data reuse and develop the idea that the research data produced by the university are valuable assets. We know that researchers at the university are working on interesting, valuable and important work. Look at Derek Gatherer’s work on the Zika virus or Maggie Mort’s project looking at disaster planning and children and a host of other more specialized datasets supporting research right across the sciences and the humanities. Each dataset will have its own context, background and requirements for it to be properly interpreted and understood.
Capturing high quality data means capturing high quality metadata; the structure which supports the data. The metadata explains the research data and supports discovery and (re)interpretation. Archivists are well used to supplying metadata for collections (or cataloguing it as it is more familiarly known!) and also know that the richest metadata is that which is supplied by the creator of the collection. This will be the person who knows most about the data, who fully understands the context and can supply additional information which will help with the later re-use and re-interpretation of the data.
The ideal set up would be one where each dataset came with full and rich descriptive metadata with keywords taken from relevant subject specific vocabularies but the reality is always going to fall very far short of this.
Research data is often seen as something of a by-product of the research process and this can reinforce the idea that action is only necessary because the research councils demand it, running the risk of creating a compliance culture.
The truth of the matter is that researchers have little spare time or resource to devote to creating detailed and complex descriptions of their data (often having done so in the related published article). Even worse is when it comes to capturing the data in a format which is likely to promote its chances of being accessible and reusable well into the future. From Art through to Women’s Studies via Engineering, Linguistics, Physics and Creative Writing and everything in between there is a dizzying array of software and file types supporting everything from spreadsheets, to videos, to models to graphs.
To what extent might it be possible to expect and demand rich metadata and standardised file formats? In terms of current practices at data repositories there is wide variety. Some repositories are extremely prescriptive about what can be deposited. The UK Data Archive for example which is a repository for “large collections of high quality data” for the Social Sciences. With a reputation for high quality reliable data the UK Data Archive service is in a position to demand specific file formats and detailed metadata. Because they have a high institutional reputation researchers immediately see the value of investing time in producing data in the format required and to some extent competing for the privilege of having work deposited in this repository. However the majority of institutional repositories are catering for the long tail of research – datasets which have no “natural” home and do not meet the requirements of repositories such as the UK Data Archive. This puts institutional repositories on the backfoot – the starting position is of the repository of last resort so rather than researchers competing for the privilege of depositing they are using the repository as a filing cabinet to clear away the papers at the end of the project.
So what to do about this? Again there are a variety of approaches which range from the prescriptive to the permissive. Some repositories – ourselves included in this – put no restrictions on the format of data and ask for the minimum amount of metadata as required by their institutional system (in our case Pure). We ask for keywords, geographic locations and covering dates but these are not required fields. We make no restriction on the format of the digital files deposited although we ask, where possible, for some explanatory notes to help future users of the data. We are, however, at the mercy of our depositors. This can mean anything from extremely rich and well described datasets to ones where lack of time and resources (and possibly engagement) provide scant metadata and risk having datasets which are hard for others to interpret, especially where data managers have had to add in metadata and descriptions later. At best we end up with uneven and patchy descriptions and at worst data which are unusable by anyone other than the creator right from the outset.
There are several improvements we can make. We should advocate and educate so that researchers understand the need for high quality data and metadata. We should be better at getting across the message of why it is important to make data openly available for transparency and reuse.
We should also be looking at ways to refine the automation of data discovery and there are various interesting initiatives around although they would require rich metadata to allow for this kind of detailed analysis.
Each institution will find itself in a different position with regards to the level of engagement but clearly collaborative approaches will work well both in raising the profile of data management and also in looking for shared solutions to data discovery and sharing. It will be interesting to see how the forthcoming JISC sponsored project for shared Research Data Services will affect these current issues. Hopefully it will promote more consistency and a stronger voice, especially for smaller institutions who don’t have the resources to develop a complex repository.
There is a lot happening right now in data management with the emphasis on making it discoverable and reusable and we are keen to be a part of that conversation.