We held our fourth Data Conversation here at Lancaster University bringing together researcher and their experiences of using and sharing data over pizza and cake…
Pizza is a big attraction at an event but more importantly it brings people together to share experiences and creates a relaxed and informal environment which encourages conversation – exactly what we want. Now in our fourth event in the series we have some “regulars” who come for the conversation (and the pizza) but also new faces who bring new perspectives.
We had another interesting programme with a range of researchers from different disciplines:
Our first speaker was Dr John Towse, Senior lecturer in Psychology and for this Data Conversation he reflected on his role as editor of the Journal of Numerical Cognition an open access journal which charges no author fees. The Journal is very encouraging of data sharing and as editor John is in the position of being able to ask his contributors to share their data although the journal does not require it. John stressed that you can’t expect data sharing to happen organically – you have to ask.
Our next speaker was Dr Jo Knight who has featured as part of our Data Interview series talking about her work. She explained about the emergence of the Psychiatric Genomics Consortium out of a need to share genomics data even where that data can be quite sensitive. The aim is to make the data as open as possible and this has been made possible by creating a community of trust. She emphasised that they are motivated by the wish to change people’s lives and do not share the data with commercial entities.
Dr Kyungmee Lee from the Department of Educational Research works with Distance Learners supporting their doctoral training as part of the preparation for their PhD research. She encourages students to reuse existing datasets to investigate research methods and it was whilst doing this she realised how many datasets were out there which were difficult to use because they lacked context.
Dr Dermot Lynott entertained us with his confessions of a poor data manager, as he like the rest of us has been guilty of poor file organisation and even worse file naming. However he also gave us a success story of publishing data which has been shared and re-used for a period of over 10 years and was keen to encourage others to see the benefit of doing the same.
Finally Professor Maggie Mort wrapped up with a moving and powerful description of the data gathered as part of the Documenting Flood Experience project and with warnings about the difficulties which might lie ahead with the incoming GDPR regulations which will impact on future projects which gather, use and store data relating to children. This sparked off even more interest and debate.
To be honest we could easily have been there all day and we’re very much looking forward to the next Data Conversation on 10th April – Stories from the Field.
Q: When does software become research data in your understanding?
Andrew: As soon as you start writing software towards a research paper that I would count as research data.
Q: Is that when you need the code to verify results or re-run calculations?
Andrew: You also need the code to clean your data which is just as important as your results because depending on how you clean your data that informs on what your results are going to be.
Q: And the software is needed to clean the data?
Andrew: Yes. The software will be needed for cleaning the data. So as soon as you start writing your software towards a paper that is when the code becomes research data. It doesn’t have to be in the public domain but it really should be.
Q: What is the current practice when you publish a paper? Do you get asked where your software is?
Andrew: No, that’s the conference chairs who are asking but it is not a requirement. Personally I think it should be. I can understand in certain cases when for instance there are security concerns. But normally the sensitivity is on the data side rather than the software.
Q: At the moment if you read a paper the software that is linked to the paper is notavailable?
Andrew: Normally, if there is software with the paper the paper would have a link, normally on the first or the last page. But a large proportion of the papers don’t have a link. Normally there would be a link to GitHub, maybe 50 per cent of the time. Other than that you can dig around if you’re really looking for it, perhaps Google the name but that’s not really how it should be.
Q: So sometimes the software is available but not referenced in the paper?
Andrew: That’s correct.
Q: But why would you not reference the software in the paper when it is available?
Andrew: I am really puzzled by this [laughs]. I can think of a few reasons. One of them could be that the GitHub instance is just used as backup. The problem I have with that is that it is not referenced in the paper how much do you trust the code to be the version that is associated with the paper?
Also, the other problem with that if I’m on GitHub is that if you reference it in a paper, on GitHub you can keep changing the code and unless you “tag” it on GitHub like a version number and reference that tag in your paper you don’t know what is the correct version.
Q: What about pushing a version of the code from GitHub to [the data archiving tool] Zenodo and get a DOI?
Andrew: I didn’t know about that until recently!
Q: So this mechanism is not widely known?
Andrew: I know what DOIs are but not really how you can get them.
Q: So are the issues why software isn’t shared about the lack of time or is it more technical as we have just discussed, to do with versions and ways of publishing?
Andrew: I think time and technical issues go hand in hand. To be technically better takes time and to do research takes time. It is always a tradeoff between “I want my next paper out” and spending extra time on your code. If your paper is already accepted that is “my merit” so why spend more time?
But there are incentives! When I submitted paper at an evaluation workshop I said that everybody should release their software because it was about evaluating models so it makes sense to have all the code online. So it was decided that we shouldn’t enforce the release but it was encouraged and the argument was that you are likely to get more citations. Because if your code is available people are more likely to use it and then to credit you by citing your paper. So getting more citations is a good incentive but I am not sure if there are some studies proving that releasing software correlates to more citations?
Q: There are a number of studies proving there is a positive correlation when you deposit your research data. I am not aware there is one for software. So maybe we need more evidence to persuade researchers to release code?
Andrew: Personally I think you should do it anyway! You spend so many hours on writing software so even if it takes you a couple of hours extra to put it online it might save somebody else a lot of time doing the same thing. But some technical training could help significantly. From my understanding, the better I got at doing software development the quicker I’ve been getting at releasing code.
Q: Is that something that Lancaster University could help with? Would that be training or do we need specialists that offer support?
Andrew: I am not too sure. I have a personal interest in training myself but I am not sure how that would fit into research management.
Andrew: I think that would be a great idea. They could help direct researchers. Even if they don’t do any development work for them they could have a look at the code and point them into directions and suggest “I think you should do this or that”, like re-factoring. I think that kind of supervision would be really beneficial, like a mentor even if they are not directly on that project. Just for example ten per cent of their time on a project would help.
Q: Are you aware that this is happening elsewhere?
Andrew: Yes, I did a summer internship with the Turing Institute and they have a team of Research Software Engineers.
Q: And who do the Research Software Engineers support?
Andrew: The Alan Turing Institute is made up of five institutes. They represent the Institute of Data Science for the UK. They do have their own researchers but also associated researchers from the other five universities. The Research Software Engineers are embedded in the research side integrated with the researchers.
When I was an intern at the Turing Institute one of the Research Software Engineers had a time slot for us available once a week.
Q: Like a drop in help session?
Andrew: Yes, like that. They helped me by directing me to different libraries and software to unit test my code and create documentation as well stating the benefits of doing this. I know that others teams benefited from there guidance and support on using Microsoft Azure cloud computing to facilitate their work. I imagine that a lot of time was saved by the help that they gave.
Q: Thanks Andrew. And to get to the final question. You deposited data here at Lancaster University using Pure. Does that work for you as a method to deposit your research data and get a DOI? Does that address your needs?
Andrew: I think better support for software might be needed on Pure. It would be great if it could work with GitHub.
Q: Yes, at the moment you can’t link Pure with GitHub in the same way you can link GitHub with Zenodo.
Andrew: When you link GitHub and Zenodo does Zenodo keep a copy of the code?
Q: I am not an expert but I believe provides the DOI to a specific release of the software.
Andrew: One thing I think it is really good that we keep data at Lancaster’s repository. In twenty years’ time GitHub might not exist anymore and then I would really appreciate a copy store in the Lancaster archives. The assumption that “It’s in GitHub, it’s fine” might not be true.
Q: Yes, if we assume that GitHub is platform for long-term preservation of code we need to trust it and I am not sure that this is the case. If you deposit here at Lancaster the University has a commitment to preservation and I believe that the University’s data archive is “trustworthy”.
Andrew: So putting a zipped copy of your code is a good solution for now. But in the long term the University’s archives could be better for software. An institutional GitLab might be good and useful. I know there is one in Medicine but an institution wide one would help. It would be nice if Pure could talk to these systems but I can imagine it is difficult.
The area of Neuroscience seems to be doing quite well with releasing research software. You have an opt-in system for the review of code. I think one of the Fellows of the Software Sustainability Institute was behind this idea.
Q: Did that happen locally here at Lancaster University?
Andrew: No, the Fellow was from Cambridge. They seem to be ahead of the curve but it only happened this year. But they seem to be really pushing for that.
Q: Thanks a lot for the Data Interview Andrew!
The interview was conducted by Hardy Schwamm.
 For example: Piwowar, H. A., & Vision, T. J. (2013). Data reuse and the open data citation advantage. PeerJ, 1, e175. http://doi.org/10.7717/peerj.175
International Digital Preservation Day 30th November 2017 #IDPD2017
What’s that about then?
Digital Archivists are a much misunderstood lot.
A lot of people think our work on digital preservation must be something to do with digitising old documents but this is absolutely not the case. Of course digitising old documents is fantastic and the wonderful resources which are now increasingly available on the internet like (and there are so many examples these are just some of my favourite ones) Charles Booth’s London or the Cambridge Digital Library . There are thousands and thousands useful for scholars, historians, students, teachers, genealogists, journalists – well just about anyone really who is interested in getting access to sources that would otherwise be near impossible to access. Digitising archive and library content has revolutionised the way we access and interact with archives, manuscripts and special collections.
However – this is not what the digital archivist does (although there are overlaps). The digital archivist is concerned mainly (although not exclusively) with archives, data, stuff – whatever you want to call it – which was created in a digital format and has never had a physical existence. If someone accidentally deletes the digitised version of Charles Booth’s poverty maps, the original is still there and can be digitised again. Of course that would be an enormous waste of time and effort which is why we often treat digitised content as if it were the original content and guard against accidental deletion or loss.
But although digitisation does help preserve a document because it reduces the wear and tear on the original it is often swapping one stable format (paper, parchment etc) for a less stable one. So you could argue that digitising – rather than helping with preservation issues – is just creating new ones. Of course there are many very unstable analogue formats such as many photographic processes, magnetic tape and so forth which need to be digitised if they are to survive at all.
Digitisation is not preservation.
With digitised content you would like to think (!) that you might have some measure of control about what that content is, specifically the format it comes in. It is possible to choose to save the image files in a format that is widely used and well documented, so that the risk that they will be hard to access in 5 or 10 years time is lessened. There are formats which are recommended for long term preservation because they are widely adopted and well supported and by choosing these we help the process of digital preservation by giving those files a “head start”.
However files which are created by others – perhaps completely outside of the organisation – can come in *literally* any format. A good example of this is when I analysed a sample of the data deposited by academics undertaking research at our institution and found a grand total of 59 different file types. OK so that doesn’t sound *too* bad but 55% of the files I couldn’t identify at all. Which is not so good.
So we could try (as some archives do) saying we will only accept files in a certain format, to give our files the best chance of a long and happy life. But clearly there are lots of circumstances where this is either impractical or impossible. For example with the papers of now-deceased person – we cannot ask them to convert them or resubmit them. And in the case of our researchers they will need to be using specific software to perform specific specialised tasks and they themselves may have very little say in their choice of software.
Another major -and perhaps often overlooked issue with digital preservation is actually making sure that the files are captured in the first place. This is not a digital specific problem – any kind of data whether it is research outputs, personal papers, financial records of a business – are all at risk of disappearing if they are not looked after properly. They will need a safe storage environment where the risk of accidental or malicious damage is kept to a minimum and they can be found, the content understood and shared effectively. For digital files this means a particularly rigorous ongoing check that the content and format are stable and that they can still be made accessible.
So what is digital preservation?
It’s not just backing stuff up
It’s the active management of digital assets to ensure they will still be accessible in the future.
Making sure we can still open files in the future.
Making sure we can still understand files in the future.
We had the opportunity to discuss research data issues surrounding their project. It turned out to be a highly interesting conversation on topics such as confidentiality, the limits of anonymisation, legal frameworks and the freedom of speech.
Q: Could you describe the aims of your project?
Alison: Thanks for inviting us. It is strange to be on the receiving end because we have been doing a lot of data collection where we put people at ease and now we are at the other end.
About 4 years ago, I became concerned about the increasing surveillance culture around Muslim communities, particularly on campus because that has an impact on free expression or could do. To me as an experienced researcher this seemed to be a politicisation of a research field if you generally identify Muslims as the “official other” and also tell us that they are dangerous with the 2015 Counter-Terrorism and Security Act and its attendant Prevent duty. What is currently not acknowledged is that the Prevent duty is actually not compulsory but the university sector has adopted it in order to keep their reputations clean.
So it is quite a difficult topic and the project aims to look at four major questions:
What do university staff and students know about Islam?
Where do they find that information?
Thirdly with specific reference to three issues, how do they formulate their opinions? The first issue with regard to Islam is gender because that’s often in the media. The whole hijab discussion for example. Radicalisation, there is no point ignoring it because … [even though] there is no evidence that anybody gets radicalised on campus. And the third one is inter-faith because relations among students of different faiths and intra-faith also is of interest to us because it is a very secular culture we live in and yet for many young people their faith identity is important, more important than we realise because of the secular atmosphere that we created on campus.
The fourth question is given that there might be some discrepancies self-identified by our participants in their responses to their first three questions, what could be done to improve the quality of the discussion on campus about Islam? How could we improve the discussion about anything that is regarded by university authorities as risky?
So all the way right from the start when I built a team we were all thinking about issues around Islam but also about the implications of that for the campus about free speech. That turned out to be a big issue because that gets more and more discussed even in the press.
Q: How long does the project run?
Alison: It is a 3 year project from 2015-2018. We are two thirds through.
Q: What kind of data do you need to answer your research questions?
Shuruq:We have two sets of data. We have actually completed data collection. We have collected quantitative data through a survey questionnaire. It was designed to be sent to the 6 universitiesi which are participating in the research. Before we received the grant and throughout the first year we were in conversation with the gatekeepers at those universities who were usually senior managers. They promised to facilitate the research including the survey to staff and students.
When we started on-site research, we also wanted to do the questionnaire at the same time but the gatekeepers withdrew their collaboration. The gatekeepers tried to get approval from the vice-chancellors and senior management. We came across a problem on several sites and that is what some describe as survey-fatigue. They were worried about students and staff receiving too many requests to fill in questionnaires. It seemed that universities were very reluctant to facilitate our surveys.
We had to redesign the questionnaire so that it was no no longer specific to the case studies; it is was now nation-wide questionnaire targeting students only, and we went to a private company to do that. The private company had access to students and could build up a sample for us. For example, we wanted our sample to include Muslims and non-Muslims and equal representation of gender and other criteria that we had in mind. We decided not to do the staff questionnaire because you can’t do that through the private companies and the universities were refusing to help. We had to make these decisions because of that particular challenge.
The other subset of data which is qualitative is based on interviews, focus groups, ethnography and curricular material. On each of the six campuses we interviewed 10 students and 10 members of staff. We attempted to handpick staff according to an ideal list which represents a mix of administrative and academic staff, senior and junior staff in different departments, Human Resources, deans and postdocs, etc. The student interviewees were recruited through emails sent through the student union or were invited by researchers. It was a random sample. There were four focus groups on each site, one with staff and three with students. We wanted one focus group to be with Muslims, one with non-Muslims and one mixed. We didn’t always achieve all types and we faced a real challenge in recruiting students. Sometimes non-Muslim students weren’t at all interested in religion or Islam. We tried different techniques such as focus groups in cafes or other hang-out spaces for students but if participants are not interested in your topic no matter how you promote it, it’s really challenging! You might get a self-selected sample of participants who are interested in that topic.
Then we’ve also done ethnography which included observing the sites where students are, talking to different student societies, talking to a wide range of university staff. We attended public events, observing and describing these events: Who attends them, who the speakers are, especially if they are related to topics of religion, Islam, freedom of speech?
Part of the research is also how Islam is studied in the classroom. For each campus we attempted to collate data about all the courses that included a component on Islam. For a long time we used to call this “Islamic Studies” but we don’t mean Islamic Studies in a narrow sense, we mean it in a broad sense. We changed that label for that category of data to “Studying Islam” to broaden it out to include a course in the Faculty of Medicine on for example “Religion and Health”. We collected material through desktop research on all the courses that are offered in the year of the field work which have a component on Islam or religion.
Then we tried to zoom in on some modules reflecting a range of disciplines and approaches, collecting course programme and syllabus for further analysis. Within that sample we also attended some of the classes to observe the actual teaching and how the students respond. So we have a very complex set of data and we are just about to start the analysis stage and there are quite a few challenges there too.
Q: You have collected a wide range of data, from publicly available information to sensitive data like views on religion. Does that have an impact on how you manage your data?
Alison: There are challenges of managing that data but also of collecting it. When I submitted the research proposal to AHRC that was a year before the Counter-Terrorism and Security Act was passed. When I was awarded the grant that act had been passed. So a situation on campus that had already been quite sensitive arguably becomes more so. We were determined as a team to protect the identity of participants and we have established a sequence of events which we hope maximizes that possibility. We do tell our participants that they have to accept that it is actually impossible for us to be completely sure that we can protect them. Because if somebody wants to hack and they have money and expertise then they can get access to stuff.
But I’ll run you quickly through how we do things. There are only two documents that have the allocated number given to a participant and their name. One of them is the consent form. That is kept away from the university, locked up. The other document that has their allocated number and their identity is an Excel spreadsheet which is kept in a virtual vault which has all their characteristics except their political views. We are not collecting political views which the 1998 Data Protection Act lists as something that should be protected. So we are acting in accordance with that Act by seeking to protect their identity.
Once we’ve done that we then tell them before they speak that they have the right to withdraw, the right to anonymity and confidentiality and we give them a timeline so they have six months in which they could say “I’m actually not comfortable with this” but nobody has done that. What we cannot be sure of, of course, is who are the people who walked away from the possibility of speaking to us? It could be the silent majority. We will never know that. We have worked through the student unions to secure the interested students but if something pops up on their screens regarding opinions on Islam there are people who might think “I don’t want to enter that arena” for all sorts of different reasons.
Q: Can you expand on your data security and confidentiality measures?
Alison: We keep our master spreadsheet encrypted via VeraCrypt which is a non-aligned programme unlike BitLocker which belongs to Microsoft.
In order to conduct an interview or a focus group we allocate a number to each person and before we did this we thought participants will find this ridiculous. But actually, with focus group people find it liberating which is the ideal. Every time they spoke they said “Number 32 speaking” and they would even say things like “I would like to endorse what Number 42 has just said”. That was perfect!
Q: Instead of a name badge people would wear a number?
Alison: No name badge but a numbered postit on the table in front of them and we know who they are if we want to track back. That worked much better than we thought it possibly could.
Then before the interviews and focus groups are transcribed we had a company called Divas because it is a lot of material. They have their own confidentiality agreement and we created one from SOAS as well. Divas destroy the original audios after a couple of weeks. We keep them but will destroy them some time in the second year. They will never be archived.
After the transcripts come back to us we have to clean them up. We have to take out any mention of names.
Shuruq: Let me add to that. Two issues have come up when cleaning the data.
Q: By cleaning do you mean anonymising?
Shuruq: Yes, anonymising and removing any identifiers. Even when we use numbers in the focus groups they will refer to sites on their particular campus which will make locations identifiable. Or they would refer to a lecturer by name or to a course title. These are all ways by which confidentiality on that campus would be undermined. So we weren’t anonymising just the participants but also ensuring the anonymity of the campuses. Although the campuses are all named in our research we have agreed that when we come to write up the findings, we will not identify the campuses, because of sensitive issues such as how does the university implement Prevent policies. There could be some negative opinions, some difficult experiences. We don’t want to link those to specific campuses. So we are cleaning the data more extensively than normal perhaps.
It is quite challenging because as you are stripping down the data you lose context. If there is a university in Wales the Welsh context actually has certain factors that are important to remember when you are analysing the data. Or a specific college in London, how do we do that? We were negotiating the cleaning of the data with regard to gender, ethnicity, background, names of places. We tried to replace these with things that identify these elements but which maintain the anonymity. If it is a café we would strip down the name but still reflect the fact that it is a café in a student union.
But sometimes, especially with interviews we’ve had people who have roles, for example a student who is the Head of a Society or who is active on campus, is well-known and speaks in a certain way. Even if we clean the transcription if we want to quote him he might still be identified by his peers and people who know him.
And then one of the things we are coming up against is transliteration because as we look at how Islam is studied, some of the courses are linked with language training and attract overseas students. It is normal to hear different languages in this context. In an interview different languages could be used. Most of our team members speak several languages so participants have felt at ease using other languages. So how do we transliterate or translate? Sometimes it’s copious work. Some of the terms used in Arabic have specific religious connotations.
This is also sensitive data because often Arabic is perceived suspiciously as a sign of being foreign, as a sign of being a bit radical or of being committed to certain religious concepts. Do you keep the Arabic in the data? Certain words like Hijab and Jihad are loaded with negative connotations in public discourses. On some occasions we made the decision not to send a particular interview to the transcriber because it would endanger the person because they have expressed political views or they used a language that might be misunderstood. To protect the identity of that particular person on one occasion, our postdoc decided to transcribe the interview herself.
Q: Will you be able to share your data?
Alison: It will go into the UK Data Archive. That is a commitment we made to the AHRC and the ESRC who are partly funding us. There are definitely difficulties in assessing the risk of re-identification because it is impossible for us to know how recognisable somebody is to their colleagues or their friends by the way they are expressing themselves.
Q: Can I just confirm that you will share only transcriptions?
Alison: Yes, no audio, no video. But also, we haven’t decided what level of sharing is needed. We have already discussed this with the UK Data Archive and they have three access levels. Our data will not be Open Access. Some of it might be open to all registered users; other data might be accessible to approved researchers only. There might be two tiers. I think our concern all the way through was not that that anybody has said anything dangerous because nobody has but that it might be construed as overly political by somebody who is looking at that data. If one of our participants has a view on foreign policy that doesn’t concur with the Government – in a democracy that should be possible but may be problematic in the current climate.
Q: Thanks for the explanation. What kind of research data services can Lancaster University offer to help your project?
Alison: I am personally very interested in the General Data Protection Regulation (GDPR) which will come into force in 2018. It appears to be inviting member states to decide if they tighten up on consent. This is an issue to do with Big Data and the way in which it is possible for all of us to covertly record or film each other, track each other. Anything is possible now. So the issues about consent may impact upon our ethnography. We did nothing covertly but inevitably if we were in a big open meeting we may have made notes about something somebody said and even if we don’t identify them we haven’t asked their consent. We would like guidance to whether this is going to clamp down issues around consent or if it is business as usual which means that if you go to reasonable lengths to protect somebody’s identity then that is acceptable.
We would also like you to be our critical friend [laughs]. We have a year to go. I think we are well prepared and we worked really hard on this aspect but there may be issues that we haven’t covered.
Q: Can I ask about the ethnography, field notes and observations, will you be able to share them?
Alison: I give you a specific example. At campuses where it was possible we secured the approval of members of staff to allow us to sit in a lesson. The students were told when we were there but we didn’t ask each of them to sign a consent form. For example a student in one class I was in about international politics described how her relatives were caught up in border violence in Eastern Europe. I didn’t have her name but I made a note of the fact that this was an example of the fact that a really difficult issue can be taught so well that the trust between the student and the staff is so high that a student can self-disclose.
But it might be necessary under the new General Data Protection Act to remove that and simply say that there was evidence that trust was high rather than given the specific example. To me it doesn’t seem that I am endangering that person’s identity, absolutely not.
Shuruq: And the other difficulty is of course that we have also done ethnography at public events which could have been organised by the chaplaincy or a student society. Again, if you wanted to identify these events that can be done. These societies often set up event pages.
It could also be a lecture on Islam and the media, which was one of the public lectures I attended. The speaker is well known and the event was well publicized. The discussions and kind of questions that emerged, my observations look at how the audience was made up ( mostly Muslims, very few of the white students attended during that talk). The ones who are interested in Islam in the media are those who are impacted by the media representation which is largely Muslim students on campus.
How do you keep aspects of the context that shed light on the meaningfulness of this event and which makes the ethnography useful without undermining anonymity?
Q: One final question: In our trainings we often hear the concern that if you include a statement in a consent form that anonymised data will be shared publicly you might get fewer participants. Is that something you have experienced?
Alison: No, participants accept that. The point is that if they come to meet us, if they made that step that means that the information that was sent out by staff or student bodies has convinced them that this is an ethically planned project where we are not going in with preconceptions. If we then say that anonymised data will be shared they accept that.
The issue I am raising is the one that the ICO [Information Commissioner’s Office] hasn’t really clarified is this issue about would you have to get a consent form from thirty people in a classroom which at one level is a reasonable extension of consent issues but challenges our understanding of ethnography.
Shuruq: Of course we don’t collect any information on the students; we don’t know who they are. But the course outlines and lecture names will not be anonymised in class ethnography so that is something we need to be reflecting upon. The other thing is that the lecturer of one class asked if we were allowing students to withdraw from the class and whether we are asking for their consent. Our team member asked for a verbal consent and the lecturer gave students the opportunity to stay or withdraw from the class. So this could be an issue for some people.
Q: Do you have any final comments on your project with regards to data?
Shuruq: On one campusat a private university they had a previous experience of research where the anonymity of some of the interviewees was not protected and the way they were represented in the book that came out of the research was very negative. They were extremely reluctant to allow us in without sufficient guarantees that we are going to protect their identity. But we are facing a serious dilemma because it is such a unique campus that it is impossible to report anything on it without revealing which one it is. That is a serious challenge.
Alison: Just to follow on from that. We mentioned right at the beginning free speech. These strictures which are ethically motivated like the possible new legislation [GDPR] about consent they are at one level eminently sensible but at another level they may make it almost impossible to do research on people’s ability to express themselves freely. If people can’t express themselves freely because it might compromise them or their institution then we can’t do the research. So it is a very clever double bind but it’s not good for democracy because the ability to express oneself freely has possibly become, seen in the public eye, the ability to have a strong opinion about something. Instead of what I think which is going right back to Socrates where you talk something through in order to understand it better and understand your own decision making processes. For young adults at university the heuristic value of freedom of expression, as long as is not rude or illegal, is absolutely paramount to having citizens who are able to conduct themselves wisely in this complex world! There are huge issues at stake here!
Alison, Shuruq, thank you very much for this interesting interview!
The interview was conducted by Hardy Schwamm @hardyschwamm
We had our third Data Conversation here at Lancaster University again with the aim of bringing together researchers to share their data stories and discuss issues and exchange ideas in a friendly and informal setting.
We all had plenty of time to eat pizza and crisps before Neil invited us all to consider reproducibility and sustainability in relation to software. Neil has a very clear and engaging style which really helped us, the audience, navigate around the complex issues of managing software. He asked us all to imagine returning to our work in three months time – would it make sense? Would it still work? He also addressed some of the complex issues around versioning, authorship and sharing software.
The second half of the afternoon followed the more traditional Data Conversations route of short lightning talks given by Lancaster University researchers.
First up was Barry Rowlingson (Lancaster Medical School) talking about the benefits of using GitLab for developing, sharing and keeping software safe.
Barry Rowlingson weighs up the benefits of GitLab over GitHub…
Next was Kristoffer Geyer (Psychology) talking about the innovative and challenging uses of smartphone data for investigating behaviour and in particular the issues of capturing the data from external and ever changing software. Kris mentioned how the recent update of Android (to Oreo) makes retrieving relevant data more difficult – a flexible approach is definitely what is needed.
Then we heard from Andrew Moore (School of Computing and Communications) who returned to the theme of sharing software, looking at some of the barriers and opportunities which present themselves. Andrew argued passionately that we need more resources for software sharing (such as specialist Research Software Engineers) but also that researchers need to share their attitudes towards sharing their code.
Our final speaker was the Library’s own Stephen Robinson (Library Developer) talking about using containers as a method of software preservation. This provoked quite some debate – which is exactly what we want to encourage at these events!
We think these kind of conversations are a great way of getting people to share good ideas and good practice around data management and we look forward to the next Data Conversations in January 2018!
This blog post was co-authored by Rachel MacGregor and Hardy Schwamm.
It was fantastic to see PASIG 2017 (Preservation and Archives Special Interest Group) come to Oxford this year which meant I had the privilege of attending this prestigious international conference in the beautiful surroundings of Oxford’s Natural History Museum. All slides and presentations are available here.
The first day was advertised as Bootcamp Day so that everyone could be up-to-speed with the basics. And I thought: “do I know everything about Digital Preservation?” and the answer was “No” so I decided to come along to see what I could learn. The answer was: quite a lot. There was some excellent advice on offer from Sharon McMeekin of the Digital Preservation Coalition and Stephanie Taylor of CoSector who both have a huge amount of experience in delivering and supporting digital preservation training. Adrian Brown (UK Parliament) gave us a lightning tour of relevant standards – what they are and why they are important. It was so whistle stop that I think we were all glad that the slides of all the presentations are available – this was definitely one to go back to.
The afternoon kicked off with “What I wish I knew before I started” and again responses to these have been summarised in some fantastic notes made collaboratively but especially by Erwin Verbruggen (Netherlands Institute for Sound and Vision) and David Underdown (UK National Archives). One of the pieces of advice I liked the most came from Tim Gollins (National Records of Scotland) who suggested that inspiration for solutions does not always come from experts or even from within the field – it’s an invitation to think broadly and get ideas, inspiration and solutions from far and wide. Otherwise we will never innovate or move on from current practices or ways of thinking.
There was much food for thought from the British Library team who are dealing with all sorts of complex format features. The line between book and game and book and artwork is often blurred. They used the example of Nosy Crow’s Goldilocks and Little Bear – is it a book, an app, a game or all three? And then there is Tea Uglow’s A Universe Explodes , a blockchain book, designed to be ephemeral and changing. In this it has many things in common with time-based artworks which institutions such as the Tate, MOMA and many others are grappling with preserving.
The conference dinner was held at the beautiful Wadham College and it was great again to have the opportunity to meet new people in fantastic surroundings. I really liked what Wadham College had done with their Changing Faces commission – four brilliant portraits of Wadham women.
The conference proper began on Day Two and over the course of the two days there were lots of interesting presentations which it would be impossible to summarise here. John Sheridan’s engaging and thought provoking talk on disrupting the archive, mapping the transition from paper archive to digital not just in a literal sense but also in the sense of our ways of thinking. Paper-based archival practices rely on hierarchies and order – this does not work so well with digital content. We probably also need to be thinking more like this:
and less like this:
for our digital archives.
Eduardo del Valle of the University of the Balearic Islands gave his Digital Fail story – a really important example of how sharing failures can be as important as sharing successes – in his case they learnt key lessons and can move on from this and hopefully prevent others from making the same mistakes. Catherine Taylor of Waddesdon Manor also bravely shared the shared drive – there was a nervous giggle from an audience made up of people who all work with similarly idiosyncratically arranged shared drives… In both cases acquiring tools and applying technical solutions was only half of the work (or possibly not even half) its the implementation of the entire system (made up of a range of different parts) which is the difficult part to get right.
As a counter point to John Sheridan’s theory we had the extremely practical and important presentation from Angeline Takawira of the United Nations Mechanism for Criminal Tribunals who explained that preserving and managing archives are a core part of the function of the organisation. Access for an extremely broad range of stake holders is key. Some of the stakeholders live in parts of Rwanda where internet access is usually wifi onto mobile devices – this is an important part of considerations of how to make material available.
Alongside Angeline Takawira’s presentation Pat Sleeman of the UN Refugee Agency packed a powerful punch with her description of archives and records management in the field when coping with the biggest humanitarian crisis in the history of the organisation. How to put together a business case for spending on digital preservation when the organisation needs to spend money on feeding starving babies. And even twitter which had been lively during the course of the conference at the hashtag #PASIG17 fell silent at the testimony of Emi Mahmoud which exemplifies the importance of preserving the voices and stories of refugees and displaced persons.
I came away with a lot to think about and also a lot to do. What can we do (if anything) to help with the some of the tasks faced by the digital preservation community as a whole? The answer is we can share the work we are doing – success or failure – and all learn that it is a combination of tools, processes and skills which come from right across the board of IT, archives, librarians, data scientists and beyond that we can help preserve what needs to be preserved.
We were very excited to be visiting the lovely city of York for the Digital Preservation’s event “From Planning to Deployment: Digital Preservation and Organizational Change”. The day promised a mixture of case studies from organisations who have or are in the process of implementing a digital preservation programme and also a chance for Jisc to showcase some of the work they have been sponsoring as part of the Research Data Shared Services project (which we are a pilot institution for). It was a varied programme and the audience was very mixed – one of the big benefits of attending events like these is the opportunity to speak to colleagues from other institutions in related but different roles. I spoke to some Records Managers and was interested in their perspective as active managers of current data. I’m a big believer in promoting digital preservation through involvement at all stages of the data lifecycle (or records continuum if you prefer) so it is important that as many people as possible – whatever their role in the creation or management of data – are encouraged into good data management practices. This might be by encouraging scientists to adopt the FAIR principles or by Records Managers advising on file formats, file naming and structures and so on.
The first half of the day was a series of case studies presented by various institutions, large and small, who had a whole range of experiences to share. It was introduced by a presentation from the Polonsky Digital Preservation Project based at Oxford and Cambridge Universities. Lee Pretlove and Sarah Mason jointly led the conversation talking us through the challenges of developing and delivering a digital preservation project which has to continue beyond the life of the project. Both Universities represented in this project are very large organisations but this can make the issues faced by the team extremely complex and challenging. They have been recording their experiences of trying to embed practices from the project so that digital preservation can become part of a sustainable programme.
The first case study came from Jen Mitcham from York University talking about the digital preservation work they have undertaken their. Jen has documented her activities very helpfully and consistently on her blog and she talked specifically about the amount of planning which needs to go into work and then the very real difficulties in implementation. She has most recently been looking at digital preservation for research data – something we are working on here at Lancaster University.
Next up was Louisa Matthews from the Archaeological Data Service who have been spearheading approaches to Digital Preservation for a very long time. The act of excavating a site is by its nature destructive so it is vital to be able to capture a data about it accurately and be able to return to and reuse the data for the foreseeable future. This captures digital preservation in a nutshell! Louisa described how engaging with their contributors ensures high quality re-usable data – something we are all aiming for.
The final case study for the morning was Rebecca Short from the University of Westminster talking about digital preservation and records management. The university have already had success implementing a digital preservation workflow and are now seeking to embed it further in the whole records creation and management process. Rebecca described the very complex information environment at her university – relatively small in comparison to the earlier presentations but no less challenging for all that
The afternoon was a useful opportunity to hear from Jisc about their Research Data Shared Services project which we are a pilot for. We heard presentations from Arkivum, Preservica and Artefactual Systems who are all vendors taking part in the project and gave interesting and useful perspectives on their approaches to digital preservation issues. The overwhelming message however has to be – you can’t buy a product which will do digital preservation. Different products and services can help you with it, but as William Kilbride, Executive Director of the Digital Preservation Coalition has so neatly put it “digital preservation is a human project” and we should be focussing on getting people to engage with the issues and for all of us to be doing digital preservation.
Already our third Data Interview!This time with Dr Jude Towers. Jude is Lecturer in Sociology and Quantitative Methods and the Associate Director Violence and Society UNESCO Centre. She holds Graduate Statistician status from the Royal Statistical Society, is an Accredited Researcher through the ONS Approved Researcher Scheme, and is level 3 vetted by Lancashire Constabulary. Her current research is focused on the measurement of violence. Jude also presented at the first Data Conversations.
Then we comply with the Home Office and ONS [Office for National Statistics] recommendations about the sizes of cells for publication. They say there should be a minimum of 50 respondents in a cell before it’s statistically analysed. You must ensure that you if you’re doing cross tabulations, for example, the numbers are sufficient that you couldn’t identify individual respondents. That is relatively straightforward and I would say that’s general good practice in dealing with that kind of data.
We have also used the Intimate Violence module, which is a self-complete module as part of the Crime Survey. For that there is a special level of access which requires training from what used to be the Administrative Data Liaison Service. That was a one day training course in London, signing of lots of different agreements. Then you access that data through your desktop computer, it has to be a static IP address, and everything is held on their server. You go into their server, you can’t bring anything out, and everything you do has to be done in there.
That means if you want to write a journal article using that data you have to write it inside their server. Anything that you produce using that data, whether it’s a presentation in PowerPoint, a table in a slide, all of that has to have approval from the UK Data Service before it can come off the server into any form of public domain. That has to be done each time you use it. It is quite onerous in some ways but is a very high level of security.
Q: That data is already in an archive so there is no need to share it again. Is citing that data straightforward in case somebody wants to see the data that you used?
Jude: Yes, it’s straightforward to cite. If people want to have access to the raw data they’d have to be accredited in the same way I got accredited. We got the whole team accredited at the same time so we can share data as we produced the work. There is nobody in our team who isn’t accredited. There is no problem …. we can sit in front of the computer and look at that data as we’re trying to develop the work.
Q: So if I were to look at your screen here to view the data I’d have to have the accreditation.
Jude: Yes! Actually it’s interesting that some of these requirements are similar to the ones for police data.
We are doing a lot of work with Lancashire Constabulary. We as a team have just been vetted to Level 3 which gives us the same access as any serving police officer. We have direct access to raw data at the individual level. This is for two reasons. One is that you can ask for data that the police put together, anonymise and give you but if you don’t know what data there is, it is really difficult to know what to ask for. And the second reason is being able to explore the data at that level means that you can make links that you couldn’t otherwise make. You can find individual people in different datasets that allows you to ask much more complex research questions and then anonymise and take it out as a dataset.
That’s been quite an interesting process. First of all, you have to be vetted. Then you get your police access card. Rather than it being on a secure server what we have now got is police laptops. We access the police server through that police laptop. Again, you can’t take anything out until it is anonymised. The keyboard on the laptop records every keystroke so someone can exactly see who you have looked for and why you have looked for them.
Then the requirements that are similar to some of the Home Office ones which are being in a locked office without public access so someone can’t look at what you’re doing over your shoulder whilst you’re doing it. I couldn’t take my police laptop and work in the Library. You can’t work on it in public spaces.
That’s quite interesting because we just got two ESRC Studentships with Lancashire Constabulary and they will do the same. They go through Level 3 vetting and they’ll have the police laptops. But then we came across the problem where do we put them? They can’t go in an office with other PhD students who are not vetted. They are at different stages in their PhD. So actually, what we’ve had to do were quite specific arrangements so that those students share a room that’s locked. You can’t have someone else in the room who is not vetted!
Q: Is it more difficult in this case to cite data because the data is not in an archive like the UK Data Archive?
Jude: What we haven’t yet done in any official capacity, but we’ve had discussions. The Crime Survey data people can access. What we have done in some of the cases where we have produced new data we’ve done data tables and can release those. So people can see the data we use, completely anonymized, aggregated to a very high level. If people want the raw data they can get accredited or they can go to the UK Data Service. If people just want to re-run our statistical tests then the “semi-raw data” if you like is there.
Q: Is that what you could do with the police dataset?
Jude: That is the conversation we are currently having with the police: Is there any point at which that data can be released into the public domain. We haven’t yet made agreements about that. I think what we’ll end up doing will be very interesting. There are very few researchers who are doing it in this way. Most people get given anonymised data that the police have anonymised themselves.
So we are doing a series of test cases saying that as we increasingly aggregate and anonymise the data at what level can that data put into the public domain and at what level is it useful? We’ll have to see if we can find a place that matches where it is still useful and it can go public. If we are able to do that then we’ll put it into archives.
Q: That is really interesting!
Jude: Yes, but is very clear that in the ESRC Studentships that the police have the final say on that.
Q: Do the police have a level of expertise and confidence in providing data and working with you? Does that work well?
Jude: It does work well. The police are in a really interesting position. They [are] systematically, some more quickly than others… [nationally] moving to evidence based policing and significantly improving their research capacity. At the moment they are doing that in two ways. One is by working closely with universities and the other is by more systematically training police officers and associate staff.
I am doing a lot of work with Leeds University on data analytics for the police and we are setting up CPD [Continuing Professional Development] for data analysts in the police to have a more systemic and academic approach to research questions. Now that’s really interesting because the position they are in in their organisation tends to be relatively low but some of the things they are asked are just impossible.
So we are trying to give them the tools to say you can’t ask me for this when you don’t collect it. Or you want me to evaluate something but nobody told me it was happening so there is no data from before. We’re getting them to think through the research process in order to influence how data analytics are used inside the police. It is interesting because there is a bit of a debate about whether they really need data analysts or they can spend their money buying really good algorithms [which] will sort all this stuff out. Our argument is that you need really good data analysts because you need them to explicate the inherent theories that people have, that they’re trying to test, that they can talk people through that research process.
In Lancashire Police those things are coming together. They are much more actively working with academics and they are much more systemically embedding academic research processes inside the institution. They have a Futures team that includes multiple PhDs, M.A.s and now even some undergraduate students. They have a list of research questions that they are interested in as an institution, and they are actively going out looking for people who do that research for them and to sit inside the police while they do it.
Q: That is really fascinating! Is there anything Lancaster University could do to help you or your colleagues with your research? Or does the set up work for you?
Jude: I think it’s OK. The sticky parts are things we are working through for example around contracts. Who owns the Intellectual Property? Who gets final say over publications? We’ve been lucky so far that we’ve negotiated things but I know in other areas these have been problematic: getting clarity and setting up protocols is useful.
There’s been some talk about setting up secure data hubs and I’m in two minds about it. I think in some ways they’d be really useful but I think in other ways they are perhaps a bit inflexible. My colleague across the corridor is doing the same as us with social work data and they’ve done what we have done. They accredited the individuals and have given them a specific laptop to access that data directly, and that works really well.
This is our second Data Interview. This time we were glad to have a chat with Dr Jo Knight.
Jo is a Reader within the CHICAS research group, Research Director in the Lancaster Medical School and theme lead for Health within Lancaster’s Data Science Institute. Jo has experience in developing new methods for analysing genetic data as well as experience in applying known techniques to a large variety of datasets.
Q: Jo, when you talked at our recent Data Management event about a “positive” data management story and a “negative” story there was a lot of interest in that, so we thought we could use this in our next Data Interview. Which story would you like to start with?
Jo: I think it would be good to start with a negative one so I can end on a positive note. And chronologically that is how it occurred.
So the negative story relates to an early time in my career. I had some genetic data on a number of individuals, about 120. I did some statistical analysis of the data. I noticed that some of the patterns that I had in my analysis seemed unusual. They weren’t characteristic of the type of patterns you would expect given that the individuals in this sample were supposed to be siblings. I didn’t have enough genetic information to establish their relationships completely but I did have enough to see that overall patterns didn’t look how I expected them to.
I took the data to someone more experienced and said: “There is something wrong with the patterns here”, and he said “Yep, there is definitely something wrong. Those individuals clearly aren’t related to each other.”
At that time, given the technologies that were available, we couldn’t just get more data to determine the relationships. We had to throw all of that data away!
It was essentially because the data and the samples had not been linked and managed. At some point between labelling the samples, entering the labels into a database and recording the relationships and rest of the information about the individuals something had gone wrong. So the data management had gone wrong and these samples were now completely useless. As well loss of my time we couldn’t use these samples for any other work either. They no longer had the data provenance.
Q: Can you quantify how much time you invested in that project?
Jo: It’s hard to remember but for me it would have been months of work to interrogate the samples! It would also have cost a fair amount in reagents. And for the person that collected the data probably up to a year’s work getting all the DNA samples from the individuals. Furthermore those individuals had given samples for medical research that was not been able to be undertaken.
Q: That is a rather sad story.
Jo: Yes, it is.
Q: Now the positive story. What happened?
Jo: I’m involved in a Consortium now, the Psychiatric Genomics Consortium, and in this Consortium over 800 researchers from 38 countries have come together and worked really very hard through ethical approvals, data procedures, data collection and data pooling in order to collate samples.
And they have been able to collect data that is now published, actually a couple of years ago in 2014, on more than 35,000 schizophrenic cases and even more control samples than that. And through the good and appropriate management of data it has meant that we were able to identify 108 genetic risk loci for schizophrenia. It has enabled us to move the field forward in terms of beginning to understand the genetic contribution to schizophrenia.
For a long time we knew that schizophrenia has a genetic component but we were unable to pinpoint very many of the risk variants at all, and this study was a real landmark in identifying a large number of the risk variants involved in the disorder. Lots more work needs to be done! What is really exciting about the Consortium is that the original paper is just the tip of the iceberg. That was the paper where the first analysis was done but the data is now held and managed in a manner that researchers who work in psychiatric genetics are able to access that data, analyse that data and answer lots of different questions about the genetic predisposition to schizophrenia.
The Psychiatric Genomics Consortium holds data on lots of other disorders as well. Basically, the appropriate management of that data means we are able to learn a lot more about diseases than we would have if people hadn’t got together and as a large group effectively managed the data.
Q: What is the key step in doing this?
Jo: It’s a willingness to share data and to see the bigger scientific question that can be answered if you share the data, and not just try to hold onto it and answer your own smaller questions. It is a willingness to put considerable amounts of time into data management. So there are lots of people including myself that have informal unpaid roles in managing that data to make it accessible.
Q: What can we as an institution do to encourage that willingness to share data?
Jo: I think Lancaster University as an institution has a very strong positive view of collaborative research across the Faculties and beyond the University. And that’s the kind of thing that does encourage people to share data and be involved in these projects. I think that is something we need to continue to pursue. And also the support systems that we have in place, the people and systems that help us to deposit data and make it available.
Thanks very much for the interview Jo!
You can find out more about Jo and her research here. The full reference of the article on schizophrenia mentioned by Jo is:
“Schizophrenia Working Group of the Psychiatric Genomics Consortium. Biological insights from 108 schizophrenia-associated genetic loci.” (2014) Nature. 24 July. 511 (7510): 421-7. doi:10.1038/nature13595
You can find a short summary of the event, the slides and some photos below.
Denes Csala – The sensor cloud around us: collecting, mining and visualizing the energy and building management data of the campus
Dr Denes Csala is a newly appointed lecturer in Energy Storage Systems Dynamics with Energy Lancaster.
There are 30,000 sensors on campus capturing all sorts of data about energy and energy consumption. This has the potential for us to understand a huge amount about the way energy is managed and used but at the same time throws up the issue of managing extremely sensitive commercial and personal data. Access to the data is strictly controlled but Energy Lancaster are very excited about the possibilities of what could be done with the data.
You can see an animated visualization of the campus energy metering system sensor data here:
Kopo Ramokapane – Could computing: When is Deletion Deletion
Kopo reported that when you delete data in the cloud there is no way to be sure that all copies or all versions have been deleted from the cloud provider. This issue isn’t new but doesn’t get as much attention as it should be. Because of the way Cloud storage operates it is almost impossible even for the service providers to be certain that all the data has been deleted. Avoid storing confidential data in the Cloud and learn more about how the systems work! Lancaster University has a contract with cloud service Box which ensures that compliance issues are dealt with in relation to storage of confidential or sensitive data.
Karen Broadhurst and Stuart Bedston – Better data for better justice: Towards data-driven analyses of Family Court policy and practice
Professor Karen Broadhurst and Stuart Bedston from the Sociology Department reported on concerns about transparency in family court-decision-making. Greater transparency and “open data” would have a positive impact in many ways but is hard to achieve looking at the security requirements and potential risks.
Karen and Stu highlighted the changes that would be needed in order to strengthen interdisciplinary research using controlled-data here at Lancaster University but also the difficulties that stand in the way.
John Couzins – Security Overview at Lancaster University
Next on was John Couzins, the IT Security Manager of Lancaster University. John who works for the institutional IT service ISS reported on the certifications that are necessary to fulfil requirements of certain providers of confidential data. Current examples are Cyber Essentials Plus and the IG Toolkit (Information Governance Toolkit) which is used by the NHS.
Mateusz Mikusz – Running Research as a Service. Implications for Privacy Policies and Ethics
The issue regarding the data is that is used for two purposes:
To make the app and its use cases work
To create research data of usage and other properties that can be analysed by the project team
Mateusz explained that he is working hard to bring both things together in an ethical way that still allows innovative research.
It was a great showcase for a lot of fantastic research that is taking place at Lancaster University and the way in which handling sensitive data and tackling data security is at the forefront of this. There were probably as many questions raised as there were answers given but it was a great opportunity to share approaches to handling data securely and ethically.