We had our third Data Conversation here at Lancaster University again with the aim of bringing together researchers to share their data stories and discuss issues and exchange ideas in a friendly and informal setting.
We all had plenty of time to eat pizza and crisps before Neil invited us all to consider reproducibility and sustainability in relation to software. Neil has a very clear and engaging style which really helped us, the audience, navigate around the complex issues of managing software. He asked us all to imagine returning to our work in three months time – would it make sense? Would it still work? He also addressed some of the complex issues around versioning, authorship and sharing software.
The second half of the afternoon followed the more traditional Data Conversations route of short lightning talks given by Lancaster University researchers.
First up was Barry Rowlingson (Lancaster Medical School) talking about the benefits of using GitLab for developing, sharing and keeping software safe.
Barry Rowlingson weighs up the benefits of GitLab over GitHub…
Next was Kristoffer Geyer (Psychology) talking about the innovative and challenging uses of smartphone data for investigating behaviour and in particular the issues of capturing the data from external and ever changing software. Kris mentioned how the recent update of Android (to Oreo) makes retrieving relevant data more difficult – a flexible approach is definitely what is needed.
Then we heard from Andrew Moore (School of Computing and Communications) who returned to the theme of sharing software, looking at some of the barriers and opportunities which present themselves. Andrew argued passionately that we need more resources for software sharing (such as specialist Research Software Engineers) but also that researchers need to share their attitudes towards sharing their code.
Our final speaker was the Library’s own Stephen Robinson (Library Developer) talking about using containers as a method of software preservation. This provoked quite some debate – which is exactly what we want to encourage at these events!
We think these kind of conversations are a great way of getting people to share good ideas and good practice around data management and we look forward to the next Data Conversations in January 2018!
This blog post was co-authored by Rachel MacGregor and Hardy Schwamm.
It was fantastic to see PASIG 2017 (Preservation and Archives Special Interest Group) come to Oxford this year which meant I had the privilege of attending this prestigious international conference in the beautiful surroundings of Oxford’s Natural History Museum. All slides and presentations are available here.
The first day was advertised as Bootcamp Day so that everyone could be up-to-speed with the basics. And I thought: “do I know everything about Digital Preservation?” and the answer was “No” so I decided to come along to see what I could learn. The answer was: quite a lot. There was some excellent advice on offer from Sharon McMeekin of the Digital Preservation Coalition and Stephanie Taylor of CoSector who both have a huge amount of experience in delivering and supporting digital preservation training. Adrian Brown (UK Parliament) gave us a lightning tour of relevant standards – what they are and why they are important. It was so whistle stop that I think we were all glad that the slides of all the presentations are available – this was definitely one to go back to.
The afternoon kicked off with “What I wish I knew before I started” and again responses to these have been summarised in some fantastic notes made collaboratively but especially by Erwin Verbruggen (Netherlands Institute for Sound and Vision) and David Underdown (UK National Archives). One of the pieces of advice I liked the most came from Tim Gollins (National Records of Scotland) who suggested that inspiration for solutions does not always come from experts or even from within the field – it’s an invitation to think broadly and get ideas, inspiration and solutions from far and wide. Otherwise we will never innovate or move on from current practices or ways of thinking.
There was much food for thought from the British Library team who are dealing with all sorts of complex format features. The line between book and game and book and artwork is often blurred. They used the example of Nosy Crow’s Goldilocks and Little Bear – is it a book, an app, a game or all three? And then there is Tea Uglow’s A Universe Explodes , a blockchain book, designed to be ephemeral and changing. In this it has many things in common with time-based artworks which institutions such as the Tate, MOMA and many others are grappling with preserving.
The conference dinner was held at the beautiful Wadham College and it was great again to have the opportunity to meet new people in fantastic surroundings. I really liked what Wadham College had done with their Changing Faces commission – four brilliant portraits of Wadham women.
The conference proper began on Day Two and over the course of the two days there were lots of interesting presentations which it would be impossible to summarise here. John Sheridan’s engaging and thought provoking talk on disrupting the archive, mapping the transition from paper archive to digital not just in a literal sense but also in the sense of our ways of thinking. Paper-based archival practices rely on hierarchies and order – this does not work so well with digital content. We probably also need to be thinking more like this:
and less like this:
for our digital archives.
Eduardo del Valle of the University of the Balearic Islands gave his Digital Fail story – a really important example of how sharing failures can be as important as sharing successes – in his case they learnt key lessons and can move on from this and hopefully prevent others from making the same mistakes. Catherine Taylor of Waddesdon Manor also bravely shared the shared drive – there was a nervous giggle from an audience made up of people who all work with similarly idiosyncratically arranged shared drives… In both cases acquiring tools and applying technical solutions was only half of the work (or possibly not even half) its the implementation of the entire system (made up of a range of different parts) which is the difficult part to get right.
As a counter point to John Sheridan’s theory we had the extremely practical and important presentation from Angeline Takawira of the United Nations Mechanism for Criminal Tribunals who explained that preserving and managing archives are a core part of the function of the organisation. Access for an extremely broad range of stake holders is key. Some of the stakeholders live in parts of Rwanda where internet access is usually wifi onto mobile devices – this is an important part of considerations of how to make material available.
Alongside Angeline Takawira’s presentation Pat Sleeman of the UN Refugee Agency packed a powerful punch with her description of archives and records management in the field when coping with the biggest humanitarian crisis in the history of the organisation. How to put together a business case for spending on digital preservation when the organisation needs to spend money on feeding starving babies. And even twitter which had been lively during the course of the conference at the hashtag #PASIG17 fell silent at the testimony of Emi Mahmoud which exemplifies the importance of preserving the voices and stories of refugees and displaced persons.
I came away with a lot to think about and also a lot to do. What can we do (if anything) to help with the some of the tasks faced by the digital preservation community as a whole? The answer is we can share the work we are doing – success or failure – and all learn that it is a combination of tools, processes and skills which come from right across the board of IT, archives, librarians, data scientists and beyond that we can help preserve what needs to be preserved.
We were very excited to be visiting the lovely city of York for the Digital Preservation’s event “From Planning to Deployment: Digital Preservation and Organizational Change”. The day promised a mixture of case studies from organisations who have or are in the process of implementing a digital preservation programme and also a chance for Jisc to showcase some of the work they have been sponsoring as part of the Research Data Shared Services project (which we are a pilot institution for). It was a varied programme and the audience was very mixed – one of the big benefits of attending events like these is the opportunity to speak to colleagues from other institutions in related but different roles. I spoke to some Records Managers and was interested in their perspective as active managers of current data. I’m a big believer in promoting digital preservation through involvement at all stages of the data lifecycle (or records continuum if you prefer) so it is important that as many people as possible – whatever their role in the creation or management of data – are encouraged into good data management practices. This might be by encouraging scientists to adopt the FAIR principles or by Records Managers advising on file formats, file naming and structures and so on.
The first half of the day was a series of case studies presented by various institutions, large and small, who had a whole range of experiences to share. It was introduced by a presentation from the Polonsky Digital Preservation Project based at Oxford and Cambridge Universities. Lee Pretlove and Sarah Mason jointly led the conversation talking us through the challenges of developing and delivering a digital preservation project which has to continue beyond the life of the project. Both Universities represented in this project are very large organisations but this can make the issues faced by the team extremely complex and challenging. They have been recording their experiences of trying to embed practices from the project so that digital preservation can become part of a sustainable programme.
The first case study came from Jen Mitcham from York University talking about the digital preservation work they have undertaken their. Jen has documented her activities very helpfully and consistently on her blog and she talked specifically about the amount of planning which needs to go into work and then the very real difficulties in implementation. She has most recently been looking at digital preservation for research data – something we are working on here at Lancaster University.
Next up was Louisa Matthews from the Archaeological Data Service who have been spearheading approaches to Digital Preservation for a very long time. The act of excavating a site is by its nature destructive so it is vital to be able to capture a data about it accurately and be able to return to and reuse the data for the foreseeable future. This captures digital preservation in a nutshell! Louisa described how engaging with their contributors ensures high quality re-usable data – something we are all aiming for.
The final case study for the morning was Rebecca Short from the University of Westminster talking about digital preservation and records management. The university have already had success implementing a digital preservation workflow and are now seeking to embed it further in the whole records creation and management process. Rebecca described the very complex information environment at her university – relatively small in comparison to the earlier presentations but no less challenging for all that
The afternoon was a useful opportunity to hear from Jisc about their Research Data Shared Services project which we are a pilot for. We heard presentations from Arkivum, Preservica and Artefactual Systems who are all vendors taking part in the project and gave interesting and useful perspectives on their approaches to digital preservation issues. The overwhelming message however has to be – you can’t buy a product which will do digital preservation. Different products and services can help you with it, but as William Kilbride, Executive Director of the Digital Preservation Coalition has so neatly put it “digital preservation is a human project” and we should be focussing on getting people to engage with the issues and for all of us to be doing digital preservation.
Already our third Data Interview!This time with Dr Jude Towers. Jude is Lecturer in Sociology and Quantitative Methods and the Associate Director Violence and Society UNESCO Centre. She holds Graduate Statistician status from the Royal Statistical Society, is an Accredited Researcher through the ONS Approved Researcher Scheme, and is level 3 vetted by Lancashire Constabulary. Her current research is focused on the measurement of violence. Jude also presented at the first Data Conversations.
Then we comply with the Home Office and ONS [Office for National Statistics] recommendations about the sizes of cells for publication. They say there should be a minimum of 50 respondents in a cell before it’s statistically analysed. You must ensure that you if you’re doing cross tabulations, for example, the numbers are sufficient that you couldn’t identify individual respondents. That is relatively straightforward and I would say that’s general good practice in dealing with that kind of data.
We have also used the Intimate Violence module, which is a self-complete module as part of the Crime Survey. For that there is a special level of access which requires training from what used to be the Administrative Data Liaison Service. That was a one day training course in London, signing of lots of different agreements. Then you access that data through your desktop computer, it has to be a static IP address, and everything is held on their server. You go into their server, you can’t bring anything out, and everything you do has to be done in there.
That means if you want to write a journal article using that data you have to write it inside their server. Anything that you produce using that data, whether it’s a presentation in PowerPoint, a table in a slide, all of that has to have approval from the UK Data Service before it can come off the server into any form of public domain. That has to be done each time you use it. It is quite onerous in some ways but is a very high level of security.
Q: That data is already in an archive so there is no need to share it again. Is citing that data straightforward in case somebody wants to see the data that you used?
Jude: Yes, it’s straightforward to cite. If people want to have access to the raw data they’d have to be accredited in the same way I got accredited. We got the whole team accredited at the same time so we can share data as we produced the work. There is nobody in our team who isn’t accredited. There is no problem …. we can sit in front of the computer and look at that data as we’re trying to develop the work.
Q: So if I were to look at your screen here to view the data I’d have to have the accreditation.
Jude: Yes! Actually it’s interesting that some of these requirements are similar to the ones for police data.
We are doing a lot of work with Lancashire Constabulary. We as a team have just been vetted to Level 3 which gives us the same access as any serving police officer. We have direct access to raw data at the individual level. This is for two reasons. One is that you can ask for data that the police put together, anonymise and give you but if you don’t know what data there is, it is really difficult to know what to ask for. And the second reason is being able to explore the data at that level means that you can make links that you couldn’t otherwise make. You can find individual people in different datasets that allows you to ask much more complex research questions and then anonymise and take it out as a dataset.
That’s been quite an interesting process. First of all, you have to be vetted. Then you get your police access card. Rather than it being on a secure server what we have now got is police laptops. We access the police server through that police laptop. Again, you can’t take anything out until it is anonymised. The keyboard on the laptop records every keystroke so someone can exactly see who you have looked for and why you have looked for them.
Then the requirements that are similar to some of the Home Office ones which are being in a locked office without public access so someone can’t look at what you’re doing over your shoulder whilst you’re doing it. I couldn’t take my police laptop and work in the Library. You can’t work on it in public spaces.
That’s quite interesting because we just got two ESRC Studentships with Lancashire Constabulary and they will do the same. They go through Level 3 vetting and they’ll have the police laptops. But then we came across the problem where do we put them? They can’t go in an office with other PhD students who are not vetted. They are at different stages in their PhD. So actually, what we’ve had to do were quite specific arrangements so that those students share a room that’s locked. You can’t have someone else in the room who is not vetted!
Q: Is it more difficult in this case to cite data because the data is not in an archive like the UK Data Archive?
Jude: What we haven’t yet done in any official capacity, but we’ve had discussions. The Crime Survey data people can access. What we have done in some of the cases where we have produced new data we’ve done data tables and can release those. So people can see the data we use, completely anonymized, aggregated to a very high level. If people want the raw data they can get accredited or they can go to the UK Data Service. If people just want to re-run our statistical tests then the “semi-raw data” if you like is there.
Q: Is that what you could do with the police dataset?
Jude: That is the conversation we are currently having with the police: Is there any point at which that data can be released into the public domain. We haven’t yet made agreements about that. I think what we’ll end up doing will be very interesting. There are very few researchers who are doing it in this way. Most people get given anonymised data that the police have anonymised themselves.
So we are doing a series of test cases saying that as we increasingly aggregate and anonymise the data at what level can that data put into the public domain and at what level is it useful? We’ll have to see if we can find a place that matches where it is still useful and it can go public. If we are able to do that then we’ll put it into archives.
Q: That is really interesting!
Jude: Yes, but is very clear that in the ESRC Studentships that the police have the final say on that.
Q: Do the police have a level of expertise and confidence in providing data and working with you? Does that work well?
Jude: It does work well. The police are in a really interesting position. They [are] systematically, some more quickly than others… [nationally] moving to evidence based policing and significantly improving their research capacity. At the moment they are doing that in two ways. One is by working closely with universities and the other is by more systematically training police officers and associate staff.
I am doing a lot of work with Leeds University on data analytics for the police and we are setting up CPD [Continuing Professional Development] for data analysts in the police to have a more systemic and academic approach to research questions. Now that’s really interesting because the position they are in in their organisation tends to be relatively low but some of the things they are asked are just impossible.
So we are trying to give them the tools to say you can’t ask me for this when you don’t collect it. Or you want me to evaluate something but nobody told me it was happening so there is no data from before. We’re getting them to think through the research process in order to influence how data analytics are used inside the police. It is interesting because there is a bit of a debate about whether they really need data analysts or they can spend their money buying really good algorithms [which] will sort all this stuff out. Our argument is that you need really good data analysts because you need them to explicate the inherent theories that people have, that they’re trying to test, that they can talk people through that research process.
In Lancashire Police those things are coming together. They are much more actively working with academics and they are much more systemically embedding academic research processes inside the institution. They have a Futures team that includes multiple PhDs, M.A.s and now even some undergraduate students. They have a list of research questions that they are interested in as an institution, and they are actively going out looking for people who do that research for them and to sit inside the police while they do it.
Q: That is really fascinating! Is there anything Lancaster University could do to help you or your colleagues with your research? Or does the set up work for you?
Jude: I think it’s OK. The sticky parts are things we are working through for example around contracts. Who owns the Intellectual Property? Who gets final say over publications? We’ve been lucky so far that we’ve negotiated things but I know in other areas these have been problematic: getting clarity and setting up protocols is useful.
There’s been some talk about setting up secure data hubs and I’m in two minds about it. I think in some ways they’d be really useful but I think in other ways they are perhaps a bit inflexible. My colleague across the corridor is doing the same as us with social work data and they’ve done what we have done. They accredited the individuals and have given them a specific laptop to access that data directly, and that works really well.
This is our second Data Interview. This time we were glad to have a chat with Dr Jo Knight.
Jo is a Reader within the CHICAS research group, Research Director in the Lancaster Medical School and theme lead for Health within Lancaster’s Data Science Institute. Jo has experience in developing new methods for analysing genetic data as well as experience in applying known techniques to a large variety of datasets.
Q: Jo, when you talked at our recent Data Management event about a “positive” data management story and a “negative” story there was a lot of interest in that, so we thought we could use this in our next Data Interview. Which story would you like to start with?
Jo: I think it would be good to start with a negative one so I can end on a positive note. And chronologically that is how it occurred.
So the negative story relates to an early time in my career. I had some genetic data on a number of individuals, about 120. I did some statistical analysis of the data. I noticed that some of the patterns that I had in my analysis seemed unusual. They weren’t characteristic of the type of patterns you would expect given that the individuals in this sample were supposed to be siblings. I didn’t have enough genetic information to establish their relationships completely but I did have enough to see that overall patterns didn’t look how I expected them to.
I took the data to someone more experienced and said: “There is something wrong with the patterns here”, and he said “Yep, there is definitely something wrong. Those individuals clearly aren’t related to each other.”
At that time, given the technologies that were available, we couldn’t just get more data to determine the relationships. We had to throw all of that data away!
It was essentially because the data and the samples had not been linked and managed. At some point between labelling the samples, entering the labels into a database and recording the relationships and rest of the information about the individuals something had gone wrong. So the data management had gone wrong and these samples were now completely useless. As well loss of my time we couldn’t use these samples for any other work either. They no longer had the data provenance.
Q: Can you quantify how much time you invested in that project?
Jo: It’s hard to remember but for me it would have been months of work to interrogate the samples! It would also have cost a fair amount in reagents. And for the person that collected the data probably up to a year’s work getting all the DNA samples from the individuals. Furthermore those individuals had given samples for medical research that was not been able to be undertaken.
Q: That is a rather sad story.
Jo: Yes, it is.
Q: Now the positive story. What happened?
Jo: I’m involved in a Consortium now, the Psychiatric Genomics Consortium, and in this Consortium over 800 researchers from 38 countries have come together and worked really very hard through ethical approvals, data procedures, data collection and data pooling in order to collate samples.
And they have been able to collect data that is now published, actually a couple of years ago in 2014, on more than 35,000 schizophrenic cases and even more control samples than that. And through the good and appropriate management of data it has meant that we were able to identify 108 genetic risk loci for schizophrenia. It has enabled us to move the field forward in terms of beginning to understand the genetic contribution to schizophrenia.
For a long time we knew that schizophrenia has a genetic component but we were unable to pinpoint very many of the risk variants at all, and this study was a real landmark in identifying a large number of the risk variants involved in the disorder. Lots more work needs to be done! What is really exciting about the Consortium is that the original paper is just the tip of the iceberg. That was the paper where the first analysis was done but the data is now held and managed in a manner that researchers who work in psychiatric genetics are able to access that data, analyse that data and answer lots of different questions about the genetic predisposition to schizophrenia.
The Psychiatric Genomics Consortium holds data on lots of other disorders as well. Basically, the appropriate management of that data means we are able to learn a lot more about diseases than we would have if people hadn’t got together and as a large group effectively managed the data.
Q: What is the key step in doing this?
Jo: It’s a willingness to share data and to see the bigger scientific question that can be answered if you share the data, and not just try to hold onto it and answer your own smaller questions. It is a willingness to put considerable amounts of time into data management. So there are lots of people including myself that have informal unpaid roles in managing that data to make it accessible.
Q: What can we as an institution do to encourage that willingness to share data?
Jo: I think Lancaster University as an institution has a very strong positive view of collaborative research across the Faculties and beyond the University. And that’s the kind of thing that does encourage people to share data and be involved in these projects. I think that is something we need to continue to pursue. And also the support systems that we have in place, the people and systems that help us to deposit data and make it available.
Thanks very much for the interview Jo!
You can find out more about Jo and her research here. The full reference of the article on schizophrenia mentioned by Jo is:
“Schizophrenia Working Group of the Psychiatric Genomics Consortium. Biological insights from 108 schizophrenia-associated genetic loci.” (2014) Nature. 24 July. 511 (7510): 421-7. doi:10.1038/nature13595
You can find a short summary of the event, the slides and some photos below.
Denes Csala – The sensor cloud around us: collecting, mining and visualizing the energy and building management data of the campus
Dr Denes Csala is a newly appointed lecturer in Energy Storage Systems Dynamics with Energy Lancaster.
There are 30,000 sensors on campus capturing all sorts of data about energy and energy consumption. This has the potential for us to understand a huge amount about the way energy is managed and used but at the same time throws up the issue of managing extremely sensitive commercial and personal data. Access to the data is strictly controlled but Energy Lancaster are very excited about the possibilities of what could be done with the data.
You can see an animated visualization of the campus energy metering system sensor data here:
Kopo Ramokapane – Could computing: When is Deletion Deletion
Kopo reported that when you delete data in the cloud there is no way to be sure that all copies or all versions have been deleted from the cloud provider. This issue isn’t new but doesn’t get as much attention as it should be. Because of the way Cloud storage operates it is almost impossible even for the service providers to be certain that all the data has been deleted. Avoid storing confidential data in the Cloud and learn more about how the systems work! Lancaster University has a contract with cloud service Box which ensures that compliance issues are dealt with in relation to storage of confidential or sensitive data.
Karen Broadhurst and Stuart Bedston – Better data for better justice: Towards data-driven analyses of Family Court policy and practice
Professor Karen Broadhurst and Stuart Bedston from the Sociology Department reported on concerns about transparency in family court-decision-making. Greater transparency and “open data” would have a positive impact in many ways but is hard to achieve looking at the security requirements and potential risks.
Karen and Stu highlighted the changes that would be needed in order to strengthen interdisciplinary research using controlled-data here at Lancaster University but also the difficulties that stand in the way.
John Couzins – Security Overview at Lancaster University
Next on was John Couzins, the IT Security Manager of Lancaster University. John who works for the institutional IT service ISS reported on the certifications that are necessary to fulfil requirements of certain providers of confidential data. Current examples are Cyber Essentials Plus and the IG Toolkit (Information Governance Toolkit) which is used by the NHS.
Mateusz Mikusz – Running Research as a Service. Implications for Privacy Policies and Ethics
The issue regarding the data is that is used for two purposes:
To make the app and its use cases work
To create research data of usage and other properties that can be analysed by the project team
Mateusz explained that he is working hard to bring both things together in an ethical way that still allows innovative research.
It was a great showcase for a lot of fantastic research that is taking place at Lancaster University and the way in which handling sensitive data and tackling data security is at the forefront of this. There were probably as many questions raised as there were answers given but it was a great opportunity to share approaches to handling data securely and ethically.
On 5 April we invited Libby Bishop to give a workshop on how to share qualitative data. Libby is well known in the Research Data Management (RDM) world as the Manager for Producer Relations at the UK Data Archive (University of Essex) although she introduced herself as a “maverick social science researcher”.
Why have a workshop on sharing qualitative data?
The short answer is: because it is difficult! If we look at the datasets deposited in our Lancaster Research Directory (currently about 150) you will find very few qualitative datasets. The reason for that is that there are many challenges in sharing this type of data. Which is why we invited expert advice from Libby.
Firstly, you can have a look at Libby’s slides below but I would like to highlight a few things that were especially of interest to me further on.
Qualitative data does get reused! Not just for research.
One of the surprises for me personally was that the reuse purpose of qualitative data is mainly for learning purposes (see figure below). According to Libby’s research 64% of downloads of qualitative data are for learning and 15% for research.
In our workshop Libby used a dataset created by Lancaster University researchers to illustrate the benefits of archiving data: It will get re-used! The example is the dataset “Health and Social Consequences of the Foot and Mouth Disease Epidemic in North Cumbria, 2001-2003” which is available from the UK Data Service (http://doi.org/10.5255/UKDA-SN-5407-1). It is a rich qualitative study including interviews with people affected by the Foot & Mouth crisis and diaries documenting experiences in Cumbria 2001-2003.
Libby explained how the researchers themselves thought the data could not be archived but with support (and some extra funding) created an important resource that is being reused in different contexts.
Get the consent right!
A major hurdle on the way to sharing qualitative data is the right consent from research participants. Workshop participants worked on some real life examples provided by Libby and realised that critiquing consent forms is much easier than writing one yourselves.
For example, any pledge to “totally anonymise” an interview is a promise you are unlikely to keep. Also, vague statements or legalistic terminology were criticised.
Libby highlighted that consent statements actually have become more difficult to write as dissemination tools (including data archives) have diversified.
Here are a few points that stuck on my mind after the Sharing Qualitative Data workshop:
Sharing qualitative data offers many benefits. We heard of examples where research participants were more keen on sharing their (anonymised) data than overly careful researchers.
The prime responsibility of the researcher is to protect participants but she/he has also a responsibility to science and funders. Both together according to Libby “is not an easy package”.
The three tools for sharing qualitative data are:
A well written and explained informed consent form
Protection of identities (through careful anonymisation)
Regulated access (not all data should be open without restrictions)
Q: In support of Open Data what roles do Policies by funders of the University have? Are they helpful? Or is it seen as just another hurdle in the way of doing research?
David: I could be wrong but I don’t think most people just view it as just a hurdle. I think when people have to write a Data Management Plan for a grant that is a bit of a pain. But I don’t think the idea of having the data freely available is something where most people say “I can’t be bothered”. It is an additional step but it is something people should be doing anyway because if you are going to be clear on what your results are the data should be in a form that’s usable and could be easily moved between people. I think most people say that’s a good thing but maybe I’m biased…
Q: I have talked to PhD students asking if they want to share their data and they said I should have asked them three years ago because now it is so much work. I wonder why that is and if we need to change the way we teach them how to manage their data?
David: I wonder if I would have said the same thing. All the data of my PhD is still around but as I was learning my craft I probably wasn’t the most efficient, and my data wasn’t managed as efficiently as I would do now. I don’t remember going to a data management training or anything. And if someone had done that on day one of my PhD? Data should be kept in an ordered fashion etc. I created a lot of extra work for myself because I would do some analysis, close the file and end up re-doing the same things I have done multiple times. And even on that level that is not very efficient.
Q: Is that something we should teach students, you think?
It’s probably something students wouldn’t be too keen [on].
Q: Yes, you don’t want to patronise them.
David: And it is a bit like saying: For god’s sake back up stuff! If you look at all the horror stories [about] who loses data. It’s only when it goes wrong that it becomes a problem. I think some people are automatically super organised. I was probably somewhere in the middle, probably more organised now. I think the issue is in a lot of academia, you just figure it out as you go. And some people develop brilliant habits and some people, including myself, bad and good. And other people develop really bad habits. And that just carries on.
I sometimes look at Retraction Watch to see what’s in, and there is this really interesting example of an American paper, an American guy who posted a paper in Psychological Science whose undergraduate student collected the data and then it turns out the entire paper is wrong when someone re-analysed the data and found so many mistakes in it. Of course it has been retracted. Now the professor has said it is the student’s fault [whole story here]. But whoever taught that student data management? If that is the issue and it looks like it, they have taken the eye off the ball. And now without a doubt his other papers will be scrutinised. Clearly, there are bad habits ingrained that he’s been passed on.
And it is not just students, it is people higher up as well. The students have been informed by their own supervisors. So I say to my students: back stuff up, make sure things are organised and I can usually tell without going into their file system. What usually happens, if I ask them for something, a piece of data, it will appear quickly because they know where it is, and that is good enough for me. But if it takes ages, that’s when we end up having a talk saying “What are you actually doing with your data?” because this seems really all over the place. But not every supervisor does that, as that guy proved. He didn’t even seem to look at the data. I am not saying that can happen here; but is not only the students.
Q: What could the University do more to assist Open Data supporters like yourself?
David: I really like the fact that the Library is pushing the fact that you can upload datasets. I know there are not many people from my Department that are doing it… I think that is really interesting. It is something that I – not necessarily challenge – but I do mention it. I don’t really get why. It is the sort of thing where you are submitting a paper you don’t even have to do it formally. There are journals that don’t have a data policy but I can still through our Pure system link data and paper together. I don’t see how that is a bad thing and that there is a huge effort needed to do that.
Maybe academics say it is just another thing to do? A colleague of mine would always say if they want the data they can always email me. Now that might be true but there are lots of cases when you email academics they never get back to you. The same colleague gets so many emails that they have someone to manage their mails. I take the point that the counter argument is that nobody actually will want to see the data and maybe they won’t. But given how random stuff is… you don’t know. For, what you publish today it might not be important and then suddenly it is important.
So my answer to the question is I am not exactly sure. There is more support in this institution than in my other, to my memory, in terms of: “this is a place to put my dataset”. One of the courses I was on here about data management as part of the 50 Programme was really useful in the sense that I left thinking from now on I am going to put my data there [into Pure].
Q: Should there be other incentives for opening up research data rather than “doing a good thing”? Should there be more credit for Open Data?
David: Yes, probably. We are always judged, when we do PDRs every year, on how much I published and got this much money. But actually, the data output does have a DOI now and it is citable and it is a contribution that the University is getting from the academic. It is additional effort. So it would be interesting to see what happened if it went as far as maybe not a promotion thing, but … part of good practice. I think the question I would ask academics is: if your data is not there, where are you keeping it long term? Now I am working at another project where data cannot be made open and that is fair enough, but in general I do wonder where all that data is going. There is a duty where it needs to be kept for a certain length of time. I think it is easier to put in there [Pure] then I don’t need to think about it if nothing else. That gives me more comfort.
Q: Is there anything you’d like to add?
David: I am certainly in support of Open Data but I write more about data visualisation because I like pictures as much as I like data [laughs].
Thanks David for an interesting interview. We hope to do more Data Interviews soon. In the meantime, if you have any questions or comments leave them below or email email@example.com.
This is the first interview of hopefully a series to come about the impact of Open Data on research. The interview was conducted by Hardy Schwamm.
Q: We define Open Data as data that can be freely used, shared and built-on by anyone, anywhere, for any purpose. Open Data is also a way to remove legal and technical barriers to using digital information. Does that go with your idea of what Open Data is?
David: Yes, I think so. I might add to that: the data is actually useful and fit for purpose. To me it’s one thing to just uploading all that data, make it available. But a lot of time, how useful that is on its own is not quite clear. As a psychologist you can run an experiment and you have a lot of data coming out of a study. You can just dump that data online but is there enough information there for other scientists to use that data and get the results?
Q: So would you say that the usefulness of data depends on what we as librarians call metadata, data about the data?
David: Yes, exactly. The definition you gave earlier is spot on. I would just add you need to make sure it is useful to other people. That might also depend on the audience but there are lots of datasets that people post for papers that are just the raw data. That is useful but to understand how they get from the raw data to the conclusions is an important step. There isn’t always space in publications to make that clear.
Q: My next question you have probably already answered already. What is your interest in Open Data? Do you support it as a principle or because it is useful for your research?
David: I do support it as a matter of principle! I always find it weird, even as a student, that you could have papers published and it was just a “Take our word for it” process. I still find that weird now. So absolutely, I support it as a matter of principle. I think as a scientist it just seems right. The data is the cornerstone of every publication. So if that is not there it seems like a massive omission, unless there is a reason for it not to be there. There are lots of mainstream psychology journals that don’t have any policy on data.
Q: That leads me to my next question: To what extent do researchers in your field Psychology support or embrace a culture Open Data?
David: Psychology does have a culture of it and it is probably growing. I think it is inevitable that this is going to become the standard practice if you look at the way Open Access publishing is going.
Q: Why do you think this is happening?
David: Because I think what is eventually happening is that journals are going to say… Lots of people who are doing it but it is like everything else, particularly if that data is going to be usable it does require a bit more effort on the author’s part to make sure that things are organised and that they have a Data Management Plan. I am not suggesting that lots of people don’t have Data Management Plans but it’s something that if you look at current problems in Social Psychology really that wasn’t being followed. There have been leaks and there have been other problems.
So if I tell you the story last week from a 3rd year student at Glasgow University had spotted errors in a published paper and it was actually errors in the Degrees of freedom. They didn’t need the raw data but the point is that a lot of that could have been sorted if the raw data had been made available. There are lots of little issues that keep coming up.
There is nevertheless still resistance and there are plenty of journals where there really is no policy, certainly the journals for which I review for. At the end, there is no data provided, I don’t know what the policy is. It would be nice if in the future authors could upload raw data but that depends on the journal’s policy and if the journal has a policy.
Q: Where should the push for Open Data come from? From journals, funders or the science community?
David: I think from all! If peer reviewers started asking for data, which I think more are, and I think if more scientists start uploading data as supplementary material as a matter of course then I think journals will start to do that. I guess the other option is that journals will start to be favoured that do provide additional resources. So particularly given how much money places like Elsevier make, what do they actually offer? If they want to sell themselves they could offer lots of things but they don’t seem to be pushing it.
And I appreciate it is very discipline specific, and that came up after my talk at the Data Conversations [on 30 January 2017] some disciplines don’t share data. It has improved massively since I started as a postgrad student. Then it just wasn’t a thing and it has slowly become more of an issue.
Q: Do you think this has to do with skills and knowledge of researchers and PhD students? Do they know how to prepare and share data? Do they know how to use other researchers’ data? Is there something missing?
David: A lot of psychologists are in a kind of hybrid area. They are obviously not statisticians and I do wonder if there is a bit of a concern because what if I upload everything, what if somebody finds a mistake? My view is always: I’d rather know that there is a mistake. But I do wonder if people are sometimes sceptical about. Not because they’ve got anything to hide but because they are not a 100 per cent sure sometimes. They understand the result and they know what the numbers mean but we are not mathematicians.
I am just curious that given the numbers of statistical mistakes being flagged up in psychology papers… I am sure I made mistakes myself. I’d just rather know about them. And having the data there means someone can check if they really want to. My view is that I am quite flattered if someone that bothered to go and re-run my analysis. They are obviously reading it!
The interview with David will be continued in Part 2 which you can find here.
Here at Lancaster University we are very excited to be part of a group of pilot institutions taking part in Jisc’s Research data shared services project. This aims to provide a flexible range of services which suit the varied needs of institutions in the HE sector help achieve policy compliance for deposit, publication, discovery, storage and long term preservation of research data. It’s an ambitious project but one that there is an undoubted need for and we are trying to work with Jisc to help them achieve this goal.
Last week we were invited down to Jisc London HQ to learn about the progress of the project and – just as importantly – share our own thoughts and experiences on the process.
Daniela Duca has written a comprehensive overview of the meeting and the way forward for Jisc from the meeting.
Our table represented a microcosm of the project: Cambridge University (large institution), ourselves at Lancaster (medium) and the Royal College of Music (small). We all have extremely different needs and resources and how one institution tackles a problem will not work at another. However we have a common purpose in supporting our academics and students in their research, ensuring compliance with funders and enabling our institutions to support first class research outputs to share with the wider world.
We had been asked to do some preparatory work around costing models for the meeting – I think it would be fair to say we all found this challenging – probably because it is! My previous knowledge of costings comes from having looked at the excellent Curation Costs Exchange which is an excellent staring point for anyone considering approaching the very difficult task of costing curation services.
My main interest in the day lay in the preservation aspects of the project especially in exploring wider use cases. It’s clear that many institutions have a number of digital preservation scenarios for which the Shared Service solution might also be applicable. What is also clear is that there are so many possible use cases that it would be very easy to accidentally create a whole new project without even trying! I think it’s fair to say that all of us in the room – whether we are actively involved in digital preservation or not – are very interested in this part of the project. There is no sense in Jisc replicating work which has already been done elsewhere or is being developed by other parties so it presents an ideal opportunity for collaborative working and building on the strengths of the existing digital preservation community.
Overall there was much food for thought and I look forward to the next development in the shared services project.