We recently held our fifth Data Conversations here at Lancaster University Library. These events bring researchers together and act as a forum to share their experiences of using and sharing data. The vibe’s informal and we provide our attendees with complementary coffee, cake and pizza…
It’s FAIR to say that pizza is a popular part of the event. Who doesn’t love pizza…? The informal lunch at the start brings researchers together. It’s a chance to spark conversations and connections with colleagues from different disciplines and at different career stages.
Once again we had a great programme with contributions from three fantastic speakers:
Up first was Dr David Ellis, Lecturer in Computational Social Science from the Psychology department and one of our Jisc Data Champions. David spoke about his experiences (including challenges and solutions) of working with National Health Service Data.
Next up was Jessica Phoenix, Criminology PhD Candidate. Jess spoke about her Masters dissertation project which looked at missing persons and the link between risk assessment and time to resolution. She spoke about the challenges and solutions associated with creating a dataset from pre-existing raw data. Issues that were amplified as the data were highly sensitive and identifiable (police records).
Last up was Professor Chris Hatton, Centre for Disability Research, Division of Health Research. Chris discussed his experience of collaborating with social workers to achieve uniquely valuable results. He also explored the way in which social media (his Twitter account) has provided a platform to engage with a wide array of voices that he couldn’t have reached through conventional research methods.
We held our fourth Data Conversation here at Lancaster University bringing together researcher and their experiences of using and sharing data over pizza and cake…
Pizza is a big attraction at an event but more importantly it brings people together to share experiences and creates a relaxed and informal environment which encourages conversation – exactly what we want. Now in our fourth event in the series we have some “regulars” who come for the conversation (and the pizza) but also new faces who bring new perspectives.
We had another interesting programme with a range of researchers from different disciplines:
Our first speaker was Dr John Towse, Senior lecturer in Psychology and for this Data Conversation he reflected on his role as editor of the Journal of Numerical Cognition an open access journal which charges no author fees. The Journal is very encouraging of data sharing and as editor John is in the position of being able to ask his contributors to share their data although the journal does not require it. John stressed that you can’t expect data sharing to happen organically – you have to ask.
Our next speaker was Dr Jo Knight who has featured as part of our Data Interview series talking about her work. She explained about the emergence of the Psychiatric Genomics Consortium out of a need to share genomics data even where that data can be quite sensitive. The aim is to make the data as open as possible and this has been made possible by creating a community of trust. She emphasised that they are motivated by the wish to change people’s lives and do not share the data with commercial entities.
Dr Kyungmee Lee from the Department of Educational Research works with Distance Learners supporting their doctoral training as part of the preparation for their PhD research. She encourages students to reuse existing datasets to investigate research methods and it was whilst doing this she realised how many datasets were out there which were difficult to use because they lacked context.
Dr Dermot Lynott entertained us with his confessions of a poor data manager, as he like the rest of us has been guilty of poor file organisation and even worse file naming. However he also gave us a success story of publishing data which has been shared and re-used for a period of over 10 years and was keen to encourage others to see the benefit of doing the same.
Finally Professor Maggie Mort wrapped up with a moving and powerful description of the data gathered as part of the Documenting Flood Experience project and with warnings about the difficulties which might lie ahead with the incoming GDPR regulations which will impact on future projects which gather, use and store data relating to children. This sparked off even more interest and debate.
To be honest we could easily have been there all day and we’re very much looking forward to the next Data Conversation on 10th April – Stories from the Field.
We had the opportunity to discuss research data issues surrounding their project. It turned out to be a highly interesting conversation on topics such as confidentiality, the limits of anonymisation, legal frameworks and the freedom of speech.
Q: Could you describe the aims of your project?
Alison: Thanks for inviting us. It is strange to be on the receiving end because we have been doing a lot of data collection where we put people at ease and now we are at the other end.
About 4 years ago, I became concerned about the increasing surveillance culture around Muslim communities, particularly on campus because that has an impact on free expression or could do. To me as an experienced researcher this seemed to be a politicisation of a research field if you generally identify Muslims as the “official other” and also tell us that they are dangerous with the 2015 Counter-Terrorism and Security Act and its attendant Prevent duty. What is currently not acknowledged is that the Prevent duty is actually not compulsory but the university sector has adopted it in order to keep their reputations clean.
So it is quite a difficult topic and the project aims to look at four major questions:
What do university staff and students know about Islam?
Where do they find that information?
Thirdly with specific reference to three issues, how do they formulate their opinions? The first issue with regard to Islam is gender because that’s often in the media. The whole hijab discussion for example. Radicalisation, there is no point ignoring it because … [even though] there is no evidence that anybody gets radicalised on campus. And the third one is inter-faith because relations among students of different faiths and intra-faith also is of interest to us because it is a very secular culture we live in and yet for many young people their faith identity is important, more important than we realise because of the secular atmosphere that we created on campus.
The fourth question is given that there might be some discrepancies self-identified by our participants in their responses to their first three questions, what could be done to improve the quality of the discussion on campus about Islam? How could we improve the discussion about anything that is regarded by university authorities as risky?
So all the way right from the start when I built a team we were all thinking about issues around Islam but also about the implications of that for the campus about free speech. That turned out to be a big issue because that gets more and more discussed even in the press.
Q: How long does the project run?
Alison: It is a 3 year project from 2015-2018. We are two thirds through.
Q: What kind of data do you need to answer your research questions?
Shuruq:We have two sets of data. We have actually completed data collection. We have collected quantitative data through a survey questionnaire. It was designed to be sent to the 6 universitiesi which are participating in the research. Before we received the grant and throughout the first year we were in conversation with the gatekeepers at those universities who were usually senior managers. They promised to facilitate the research including the survey to staff and students.
When we started on-site research, we also wanted to do the questionnaire at the same time but the gatekeepers withdrew their collaboration. The gatekeepers tried to get approval from the vice-chancellors and senior management. We came across a problem on several sites and that is what some describe as survey-fatigue. They were worried about students and staff receiving too many requests to fill in questionnaires. It seemed that universities were very reluctant to facilitate our surveys.
We had to redesign the questionnaire so that it was no no longer specific to the case studies; it is was now nation-wide questionnaire targeting students only, and we went to a private company to do that. The private company had access to students and could build up a sample for us. For example, we wanted our sample to include Muslims and non-Muslims and equal representation of gender and other criteria that we had in mind. We decided not to do the staff questionnaire because you can’t do that through the private companies and the universities were refusing to help. We had to make these decisions because of that particular challenge.
The other subset of data which is qualitative is based on interviews, focus groups, ethnography and curricular material. On each of the six campuses we interviewed 10 students and 10 members of staff. We attempted to handpick staff according to an ideal list which represents a mix of administrative and academic staff, senior and junior staff in different departments, Human Resources, deans and postdocs, etc. The student interviewees were recruited through emails sent through the student union or were invited by researchers. It was a random sample. There were four focus groups on each site, one with staff and three with students. We wanted one focus group to be with Muslims, one with non-Muslims and one mixed. We didn’t always achieve all types and we faced a real challenge in recruiting students. Sometimes non-Muslim students weren’t at all interested in religion or Islam. We tried different techniques such as focus groups in cafes or other hang-out spaces for students but if participants are not interested in your topic no matter how you promote it, it’s really challenging! You might get a self-selected sample of participants who are interested in that topic.
Then we’ve also done ethnography which included observing the sites where students are, talking to different student societies, talking to a wide range of university staff. We attended public events, observing and describing these events: Who attends them, who the speakers are, especially if they are related to topics of religion, Islam, freedom of speech?
Part of the research is also how Islam is studied in the classroom. For each campus we attempted to collate data about all the courses that included a component on Islam. For a long time we used to call this “Islamic Studies” but we don’t mean Islamic Studies in a narrow sense, we mean it in a broad sense. We changed that label for that category of data to “Studying Islam” to broaden it out to include a course in the Faculty of Medicine on for example “Religion and Health”. We collected material through desktop research on all the courses that are offered in the year of the field work which have a component on Islam or religion.
Then we tried to zoom in on some modules reflecting a range of disciplines and approaches, collecting course programme and syllabus for further analysis. Within that sample we also attended some of the classes to observe the actual teaching and how the students respond. So we have a very complex set of data and we are just about to start the analysis stage and there are quite a few challenges there too.
Q: You have collected a wide range of data, from publicly available information to sensitive data like views on religion. Does that have an impact on how you manage your data?
Alison: There are challenges of managing that data but also of collecting it. When I submitted the research proposal to AHRC that was a year before the Counter-Terrorism and Security Act was passed. When I was awarded the grant that act had been passed. So a situation on campus that had already been quite sensitive arguably becomes more so. We were determined as a team to protect the identity of participants and we have established a sequence of events which we hope maximizes that possibility. We do tell our participants that they have to accept that it is actually impossible for us to be completely sure that we can protect them. Because if somebody wants to hack and they have money and expertise then they can get access to stuff.
But I’ll run you quickly through how we do things. There are only two documents that have the allocated number given to a participant and their name. One of them is the consent form. That is kept away from the university, locked up. The other document that has their allocated number and their identity is an Excel spreadsheet which is kept in a virtual vault which has all their characteristics except their political views. We are not collecting political views which the 1998 Data Protection Act lists as something that should be protected. So we are acting in accordance with that Act by seeking to protect their identity.
Once we’ve done that we then tell them before they speak that they have the right to withdraw, the right to anonymity and confidentiality and we give them a timeline so they have six months in which they could say “I’m actually not comfortable with this” but nobody has done that. What we cannot be sure of, of course, is who are the people who walked away from the possibility of speaking to us? It could be the silent majority. We will never know that. We have worked through the student unions to secure the interested students but if something pops up on their screens regarding opinions on Islam there are people who might think “I don’t want to enter that arena” for all sorts of different reasons.
Q: Can you expand on your data security and confidentiality measures?
Alison: We keep our master spreadsheet encrypted via VeraCrypt which is a non-aligned programme unlike BitLocker which belongs to Microsoft.
In order to conduct an interview or a focus group we allocate a number to each person and before we did this we thought participants will find this ridiculous. But actually, with focus group people find it liberating which is the ideal. Every time they spoke they said “Number 32 speaking” and they would even say things like “I would like to endorse what Number 42 has just said”. That was perfect!
Q: Instead of a name badge people would wear a number?
Alison: No name badge but a numbered postit on the table in front of them and we know who they are if we want to track back. That worked much better than we thought it possibly could.
Then before the interviews and focus groups are transcribed we had a company called Divas because it is a lot of material. They have their own confidentiality agreement and we created one from SOAS as well. Divas destroy the original audios after a couple of weeks. We keep them but will destroy them some time in the second year. They will never be archived.
After the transcripts come back to us we have to clean them up. We have to take out any mention of names.
Shuruq: Let me add to that. Two issues have come up when cleaning the data.
Q: By cleaning do you mean anonymising?
Shuruq: Yes, anonymising and removing any identifiers. Even when we use numbers in the focus groups they will refer to sites on their particular campus which will make locations identifiable. Or they would refer to a lecturer by name or to a course title. These are all ways by which confidentiality on that campus would be undermined. So we weren’t anonymising just the participants but also ensuring the anonymity of the campuses. Although the campuses are all named in our research we have agreed that when we come to write up the findings, we will not identify the campuses, because of sensitive issues such as how does the university implement Prevent policies. There could be some negative opinions, some difficult experiences. We don’t want to link those to specific campuses. So we are cleaning the data more extensively than normal perhaps.
It is quite challenging because as you are stripping down the data you lose context. If there is a university in Wales the Welsh context actually has certain factors that are important to remember when you are analysing the data. Or a specific college in London, how do we do that? We were negotiating the cleaning of the data with regard to gender, ethnicity, background, names of places. We tried to replace these with things that identify these elements but which maintain the anonymity. If it is a café we would strip down the name but still reflect the fact that it is a café in a student union.
But sometimes, especially with interviews we’ve had people who have roles, for example a student who is the Head of a Society or who is active on campus, is well-known and speaks in a certain way. Even if we clean the transcription if we want to quote him he might still be identified by his peers and people who know him.
And then one of the things we are coming up against is transliteration because as we look at how Islam is studied, some of the courses are linked with language training and attract overseas students. It is normal to hear different languages in this context. In an interview different languages could be used. Most of our team members speak several languages so participants have felt at ease using other languages. So how do we transliterate or translate? Sometimes it’s copious work. Some of the terms used in Arabic have specific religious connotations.
This is also sensitive data because often Arabic is perceived suspiciously as a sign of being foreign, as a sign of being a bit radical or of being committed to certain religious concepts. Do you keep the Arabic in the data? Certain words like Hijab and Jihad are loaded with negative connotations in public discourses. On some occasions we made the decision not to send a particular interview to the transcriber because it would endanger the person because they have expressed political views or they used a language that might be misunderstood. To protect the identity of that particular person on one occasion, our postdoc decided to transcribe the interview herself.
Q: Will you be able to share your data?
Alison: It will go into the UK Data Archive. That is a commitment we made to the AHRC and the ESRC who are partly funding us. There are definitely difficulties in assessing the risk of re-identification because it is impossible for us to know how recognisable somebody is to their colleagues or their friends by the way they are expressing themselves.
Q: Can I just confirm that you will share only transcriptions?
Alison: Yes, no audio, no video. But also, we haven’t decided what level of sharing is needed. We have already discussed this with the UK Data Archive and they have three access levels. Our data will not be Open Access. Some of it might be open to all registered users; other data might be accessible to approved researchers only. There might be two tiers. I think our concern all the way through was not that that anybody has said anything dangerous because nobody has but that it might be construed as overly political by somebody who is looking at that data. If one of our participants has a view on foreign policy that doesn’t concur with the Government – in a democracy that should be possible but may be problematic in the current climate.
Q: Thanks for the explanation. What kind of research data services can Lancaster University offer to help your project?
Alison: I am personally very interested in the General Data Protection Regulation (GDPR) which will come into force in 2018. It appears to be inviting member states to decide if they tighten up on consent. This is an issue to do with Big Data and the way in which it is possible for all of us to covertly record or film each other, track each other. Anything is possible now. So the issues about consent may impact upon our ethnography. We did nothing covertly but inevitably if we were in a big open meeting we may have made notes about something somebody said and even if we don’t identify them we haven’t asked their consent. We would like guidance to whether this is going to clamp down issues around consent or if it is business as usual which means that if you go to reasonable lengths to protect somebody’s identity then that is acceptable.
We would also like you to be our critical friend [laughs]. We have a year to go. I think we are well prepared and we worked really hard on this aspect but there may be issues that we haven’t covered.
Q: Can I ask about the ethnography, field notes and observations, will you be able to share them?
Alison: I give you a specific example. At campuses where it was possible we secured the approval of members of staff to allow us to sit in a lesson. The students were told when we were there but we didn’t ask each of them to sign a consent form. For example a student in one class I was in about international politics described how her relatives were caught up in border violence in Eastern Europe. I didn’t have her name but I made a note of the fact that this was an example of the fact that a really difficult issue can be taught so well that the trust between the student and the staff is so high that a student can self-disclose.
But it might be necessary under the new General Data Protection Act to remove that and simply say that there was evidence that trust was high rather than given the specific example. To me it doesn’t seem that I am endangering that person’s identity, absolutely not.
Shuruq: And the other difficulty is of course that we have also done ethnography at public events which could have been organised by the chaplaincy or a student society. Again, if you wanted to identify these events that can be done. These societies often set up event pages.
It could also be a lecture on Islam and the media, which was one of the public lectures I attended. The speaker is well known and the event was well publicized. The discussions and kind of questions that emerged, my observations look at how the audience was made up ( mostly Muslims, very few of the white students attended during that talk). The ones who are interested in Islam in the media are those who are impacted by the media representation which is largely Muslim students on campus.
How do you keep aspects of the context that shed light on the meaningfulness of this event and which makes the ethnography useful without undermining anonymity?
Q: One final question: In our trainings we often hear the concern that if you include a statement in a consent form that anonymised data will be shared publicly you might get fewer participants. Is that something you have experienced?
Alison: No, participants accept that. The point is that if they come to meet us, if they made that step that means that the information that was sent out by staff or student bodies has convinced them that this is an ethically planned project where we are not going in with preconceptions. If we then say that anonymised data will be shared they accept that.
The issue I am raising is the one that the ICO [Information Commissioner’s Office] hasn’t really clarified is this issue about would you have to get a consent form from thirty people in a classroom which at one level is a reasonable extension of consent issues but challenges our understanding of ethnography.
Shuruq: Of course we don’t collect any information on the students; we don’t know who they are. But the course outlines and lecture names will not be anonymised in class ethnography so that is something we need to be reflecting upon. The other thing is that the lecturer of one class asked if we were allowing students to withdraw from the class and whether we are asking for their consent. Our team member asked for a verbal consent and the lecturer gave students the opportunity to stay or withdraw from the class. So this could be an issue for some people.
Q: Do you have any final comments on your project with regards to data?
Shuruq: On one campusat a private university they had a previous experience of research where the anonymity of some of the interviewees was not protected and the way they were represented in the book that came out of the research was very negative. They were extremely reluctant to allow us in without sufficient guarantees that we are going to protect their identity. But we are facing a serious dilemma because it is such a unique campus that it is impossible to report anything on it without revealing which one it is. That is a serious challenge.
Alison: Just to follow on from that. We mentioned right at the beginning free speech. These strictures which are ethically motivated like the possible new legislation [GDPR] about consent they are at one level eminently sensible but at another level they may make it almost impossible to do research on people’s ability to express themselves freely. If people can’t express themselves freely because it might compromise them or their institution then we can’t do the research. So it is a very clever double bind but it’s not good for democracy because the ability to express oneself freely has possibly become, seen in the public eye, the ability to have a strong opinion about something. Instead of what I think which is going right back to Socrates where you talk something through in order to understand it better and understand your own decision making processes. For young adults at university the heuristic value of freedom of expression, as long as is not rude or illegal, is absolutely paramount to having citizens who are able to conduct themselves wisely in this complex world! There are huge issues at stake here!
Alison, Shuruq, thank you very much for this interesting interview!
The interview was conducted by Hardy Schwamm @hardyschwamm
We had our third Data Conversation here at Lancaster University again with the aim of bringing together researchers to share their data stories and discuss issues and exchange ideas in a friendly and informal setting.
We all had plenty of time to eat pizza and crisps before Neil invited us all to consider reproducibility and sustainability in relation to software. Neil has a very clear and engaging style which really helped us, the audience, navigate around the complex issues of managing software. He asked us all to imagine returning to our work in three months time – would it make sense? Would it still work? He also addressed some of the complex issues around versioning, authorship and sharing software.
The second half of the afternoon followed the more traditional Data Conversations route of short lightning talks given by Lancaster University researchers.
First up was Barry Rowlingson (Lancaster Medical School) talking about the benefits of using GitLab for developing, sharing and keeping software safe.
Barry Rowlingson weighs up the benefits of GitLab over GitHub…
Next was Kristoffer Geyer (Psychology) talking about the innovative and challenging uses of smartphone data for investigating behaviour and in particular the issues of capturing the data from external and ever changing software. Kris mentioned how the recent update of Android (to Oreo) makes retrieving relevant data more difficult – a flexible approach is definitely what is needed.
Then we heard from Andrew Moore (School of Computing and Communications) who returned to the theme of sharing software, looking at some of the barriers and opportunities which present themselves. Andrew argued passionately that we need more resources for software sharing (such as specialist Research Software Engineers) but also that researchers need to share their attitudes towards sharing their code.
Our final speaker was the Library’s own Stephen Robinson (Library Developer) talking about using containers as a method of software preservation. This provoked quite some debate – which is exactly what we want to encourage at these events!
We think these kind of conversations are a great way of getting people to share good ideas and good practice around data management and we look forward to the next Data Conversations in January 2018!
This blog post was co-authored by Rachel MacGregor and Hardy Schwamm.
We were very excited to be visiting the lovely city of York for the Digital Preservation’s event “From Planning to Deployment: Digital Preservation and Organizational Change”. The day promised a mixture of case studies from organisations who have or are in the process of implementing a digital preservation programme and also a chance for Jisc to showcase some of the work they have been sponsoring as part of the Research Data Shared Services project (which we are a pilot institution for). It was a varied programme and the audience was very mixed – one of the big benefits of attending events like these is the opportunity to speak to colleagues from other institutions in related but different roles. I spoke to some Records Managers and was interested in their perspective as active managers of current data. I’m a big believer in promoting digital preservation through involvement at all stages of the data lifecycle (or records continuum if you prefer) so it is important that as many people as possible – whatever their role in the creation or management of data – are encouraged into good data management practices. This might be by encouraging scientists to adopt the FAIR principles or by Records Managers advising on file formats, file naming and structures and so on.
The first half of the day was a series of case studies presented by various institutions, large and small, who had a whole range of experiences to share. It was introduced by a presentation from the Polonsky Digital Preservation Project based at Oxford and Cambridge Universities. Lee Pretlove and Sarah Mason jointly led the conversation talking us through the challenges of developing and delivering a digital preservation project which has to continue beyond the life of the project. Both Universities represented in this project are very large organisations but this can make the issues faced by the team extremely complex and challenging. They have been recording their experiences of trying to embed practices from the project so that digital preservation can become part of a sustainable programme.
The first case study came from Jen Mitcham from York University talking about the digital preservation work they have undertaken their. Jen has documented her activities very helpfully and consistently on her blog and she talked specifically about the amount of planning which needs to go into work and then the very real difficulties in implementation. She has most recently been looking at digital preservation for research data – something we are working on here at Lancaster University.
Next up was Louisa Matthews from the Archaeological Data Service who have been spearheading approaches to Digital Preservation for a very long time. The act of excavating a site is by its nature destructive so it is vital to be able to capture a data about it accurately and be able to return to and reuse the data for the foreseeable future. This captures digital preservation in a nutshell! Louisa described how engaging with their contributors ensures high quality re-usable data – something we are all aiming for.
The final case study for the morning was Rebecca Short from the University of Westminster talking about digital preservation and records management. The university have already had success implementing a digital preservation workflow and are now seeking to embed it further in the whole records creation and management process. Rebecca described the very complex information environment at her university – relatively small in comparison to the earlier presentations but no less challenging for all that
The afternoon was a useful opportunity to hear from Jisc about their Research Data Shared Services project which we are a pilot for. We heard presentations from Arkivum, Preservica and Artefactual Systems who are all vendors taking part in the project and gave interesting and useful perspectives on their approaches to digital preservation issues. The overwhelming message however has to be – you can’t buy a product which will do digital preservation. Different products and services can help you with it, but as William Kilbride, Executive Director of the Digital Preservation Coalition has so neatly put it “digital preservation is a human project” and we should be focussing on getting people to engage with the issues and for all of us to be doing digital preservation.
Already our third Data Interview!This time with Dr Jude Towers. Jude is Lecturer in Sociology and Quantitative Methods and the Associate Director Violence and Society UNESCO Centre. She holds Graduate Statistician status from the Royal Statistical Society, is an Accredited Researcher through the ONS Approved Researcher Scheme, and is level 3 vetted by Lancashire Constabulary. Her current research is focused on the measurement of violence. Jude also presented at the first Data Conversations.
Then we comply with the Home Office and ONS [Office for National Statistics] recommendations about the sizes of cells for publication. They say there should be a minimum of 50 respondents in a cell before it’s statistically analysed. You must ensure that you if you’re doing cross tabulations, for example, the numbers are sufficient that you couldn’t identify individual respondents. That is relatively straightforward and I would say that’s general good practice in dealing with that kind of data.
We have also used the Intimate Violence module, which is a self-complete module as part of the Crime Survey. For that there is a special level of access which requires training from what used to be the Administrative Data Liaison Service. That was a one day training course in London, signing of lots of different agreements. Then you access that data through your desktop computer, it has to be a static IP address, and everything is held on their server. You go into their server, you can’t bring anything out, and everything you do has to be done in there.
That means if you want to write a journal article using that data you have to write it inside their server. Anything that you produce using that data, whether it’s a presentation in PowerPoint, a table in a slide, all of that has to have approval from the UK Data Service before it can come off the server into any form of public domain. That has to be done each time you use it. It is quite onerous in some ways but is a very high level of security.
Q: That data is already in an archive so there is no need to share it again. Is citing that data straightforward in case somebody wants to see the data that you used?
Jude: Yes, it’s straightforward to cite. If people want to have access to the raw data they’d have to be accredited in the same way I got accredited. We got the whole team accredited at the same time so we can share data as we produced the work. There is nobody in our team who isn’t accredited. There is no problem …. we can sit in front of the computer and look at that data as we’re trying to develop the work.
Q: So if I were to look at your screen here to view the data I’d have to have the accreditation.
Jude: Yes! Actually it’s interesting that some of these requirements are similar to the ones for police data.
We are doing a lot of work with Lancashire Constabulary. We as a team have just been vetted to Level 3 which gives us the same access as any serving police officer. We have direct access to raw data at the individual level. This is for two reasons. One is that you can ask for data that the police put together, anonymise and give you but if you don’t know what data there is, it is really difficult to know what to ask for. And the second reason is being able to explore the data at that level means that you can make links that you couldn’t otherwise make. You can find individual people in different datasets that allows you to ask much more complex research questions and then anonymise and take it out as a dataset.
That’s been quite an interesting process. First of all, you have to be vetted. Then you get your police access card. Rather than it being on a secure server what we have now got is police laptops. We access the police server through that police laptop. Again, you can’t take anything out until it is anonymised. The keyboard on the laptop records every keystroke so someone can exactly see who you have looked for and why you have looked for them.
Then the requirements that are similar to some of the Home Office ones which are being in a locked office without public access so someone can’t look at what you’re doing over your shoulder whilst you’re doing it. I couldn’t take my police laptop and work in the Library. You can’t work on it in public spaces.
That’s quite interesting because we just got two ESRC Studentships with Lancashire Constabulary and they will do the same. They go through Level 3 vetting and they’ll have the police laptops. But then we came across the problem where do we put them? They can’t go in an office with other PhD students who are not vetted. They are at different stages in their PhD. So actually, what we’ve had to do were quite specific arrangements so that those students share a room that’s locked. You can’t have someone else in the room who is not vetted!
Q: Is it more difficult in this case to cite data because the data is not in an archive like the UK Data Archive?
Jude: What we haven’t yet done in any official capacity, but we’ve had discussions. The Crime Survey data people can access. What we have done in some of the cases where we have produced new data we’ve done data tables and can release those. So people can see the data we use, completely anonymized, aggregated to a very high level. If people want the raw data they can get accredited or they can go to the UK Data Service. If people just want to re-run our statistical tests then the “semi-raw data” if you like is there.
Q: Is that what you could do with the police dataset?
Jude: That is the conversation we are currently having with the police: Is there any point at which that data can be released into the public domain. We haven’t yet made agreements about that. I think what we’ll end up doing will be very interesting. There are very few researchers who are doing it in this way. Most people get given anonymised data that the police have anonymised themselves.
So we are doing a series of test cases saying that as we increasingly aggregate and anonymise the data at what level can that data put into the public domain and at what level is it useful? We’ll have to see if we can find a place that matches where it is still useful and it can go public. If we are able to do that then we’ll put it into archives.
Q: That is really interesting!
Jude: Yes, but is very clear that in the ESRC Studentships that the police have the final say on that.
Q: Do the police have a level of expertise and confidence in providing data and working with you? Does that work well?
Jude: It does work well. The police are in a really interesting position. They [are] systematically, some more quickly than others… [nationally] moving to evidence based policing and significantly improving their research capacity. At the moment they are doing that in two ways. One is by working closely with universities and the other is by more systematically training police officers and associate staff.
I am doing a lot of work with Leeds University on data analytics for the police and we are setting up CPD [Continuing Professional Development] for data analysts in the police to have a more systemic and academic approach to research questions. Now that’s really interesting because the position they are in in their organisation tends to be relatively low but some of the things they are asked are just impossible.
So we are trying to give them the tools to say you can’t ask me for this when you don’t collect it. Or you want me to evaluate something but nobody told me it was happening so there is no data from before. We’re getting them to think through the research process in order to influence how data analytics are used inside the police. It is interesting because there is a bit of a debate about whether they really need data analysts or they can spend their money buying really good algorithms [which] will sort all this stuff out. Our argument is that you need really good data analysts because you need them to explicate the inherent theories that people have, that they’re trying to test, that they can talk people through that research process.
In Lancashire Police those things are coming together. They are much more actively working with academics and they are much more systemically embedding academic research processes inside the institution. They have a Futures team that includes multiple PhDs, M.A.s and now even some undergraduate students. They have a list of research questions that they are interested in as an institution, and they are actively going out looking for people who do that research for them and to sit inside the police while they do it.
Q: That is really fascinating! Is there anything Lancaster University could do to help you or your colleagues with your research? Or does the set up work for you?
Jude: I think it’s OK. The sticky parts are things we are working through for example around contracts. Who owns the Intellectual Property? Who gets final say over publications? We’ve been lucky so far that we’ve negotiated things but I know in other areas these have been problematic: getting clarity and setting up protocols is useful.
There’s been some talk about setting up secure data hubs and I’m in two minds about it. I think in some ways they’d be really useful but I think in other ways they are perhaps a bit inflexible. My colleague across the corridor is doing the same as us with social work data and they’ve done what we have done. They accredited the individuals and have given them a specific laptop to access that data directly, and that works really well.
This is our second Data Interview. This time we were glad to have a chat with Dr Jo Knight.
Jo is a Reader within the CHICAS research group, Research Director in the Lancaster Medical School and theme lead for Health within Lancaster’s Data Science Institute. Jo has experience in developing new methods for analysing genetic data as well as experience in applying known techniques to a large variety of datasets.
Q: Jo, when you talked at our recent Data Management event about a “positive” data management story and a “negative” story there was a lot of interest in that, so we thought we could use this in our next Data Interview. Which story would you like to start with?
Jo: I think it would be good to start with a negative one so I can end on a positive note. And chronologically that is how it occurred.
So the negative story relates to an early time in my career. I had some genetic data on a number of individuals, about 120. I did some statistical analysis of the data. I noticed that some of the patterns that I had in my analysis seemed unusual. They weren’t characteristic of the type of patterns you would expect given that the individuals in this sample were supposed to be siblings. I didn’t have enough genetic information to establish their relationships completely but I did have enough to see that overall patterns didn’t look how I expected them to.
I took the data to someone more experienced and said: “There is something wrong with the patterns here”, and he said “Yep, there is definitely something wrong. Those individuals clearly aren’t related to each other.”
At that time, given the technologies that were available, we couldn’t just get more data to determine the relationships. We had to throw all of that data away!
It was essentially because the data and the samples had not been linked and managed. At some point between labelling the samples, entering the labels into a database and recording the relationships and rest of the information about the individuals something had gone wrong. So the data management had gone wrong and these samples were now completely useless. As well loss of my time we couldn’t use these samples for any other work either. They no longer had the data provenance.
Q: Can you quantify how much time you invested in that project?
Jo: It’s hard to remember but for me it would have been months of work to interrogate the samples! It would also have cost a fair amount in reagents. And for the person that collected the data probably up to a year’s work getting all the DNA samples from the individuals. Furthermore those individuals had given samples for medical research that was not been able to be undertaken.
Q: That is a rather sad story.
Jo: Yes, it is.
Q: Now the positive story. What happened?
Jo: I’m involved in a Consortium now, the Psychiatric Genomics Consortium, and in this Consortium over 800 researchers from 38 countries have come together and worked really very hard through ethical approvals, data procedures, data collection and data pooling in order to collate samples.
And they have been able to collect data that is now published, actually a couple of years ago in 2014, on more than 35,000 schizophrenic cases and even more control samples than that. And through the good and appropriate management of data it has meant that we were able to identify 108 genetic risk loci for schizophrenia. It has enabled us to move the field forward in terms of beginning to understand the genetic contribution to schizophrenia.
For a long time we knew that schizophrenia has a genetic component but we were unable to pinpoint very many of the risk variants at all, and this study was a real landmark in identifying a large number of the risk variants involved in the disorder. Lots more work needs to be done! What is really exciting about the Consortium is that the original paper is just the tip of the iceberg. That was the paper where the first analysis was done but the data is now held and managed in a manner that researchers who work in psychiatric genetics are able to access that data, analyse that data and answer lots of different questions about the genetic predisposition to schizophrenia.
The Psychiatric Genomics Consortium holds data on lots of other disorders as well. Basically, the appropriate management of that data means we are able to learn a lot more about diseases than we would have if people hadn’t got together and as a large group effectively managed the data.
Q: What is the key step in doing this?
Jo: It’s a willingness to share data and to see the bigger scientific question that can be answered if you share the data, and not just try to hold onto it and answer your own smaller questions. It is a willingness to put considerable amounts of time into data management. So there are lots of people including myself that have informal unpaid roles in managing that data to make it accessible.
Q: What can we as an institution do to encourage that willingness to share data?
Jo: I think Lancaster University as an institution has a very strong positive view of collaborative research across the Faculties and beyond the University. And that’s the kind of thing that does encourage people to share data and be involved in these projects. I think that is something we need to continue to pursue. And also the support systems that we have in place, the people and systems that help us to deposit data and make it available.
Thanks very much for the interview Jo!
You can find out more about Jo and her research here. The full reference of the article on schizophrenia mentioned by Jo is:
“Schizophrenia Working Group of the Psychiatric Genomics Consortium. Biological insights from 108 schizophrenia-associated genetic loci.” (2014) Nature. 24 July. 511 (7510): 421-7. doi:10.1038/nature13595
You can find a short summary of the event, the slides and some photos below.
Denes Csala – The sensor cloud around us: collecting, mining and visualizing the energy and building management data of the campus
Dr Denes Csala is a newly appointed lecturer in Energy Storage Systems Dynamics with Energy Lancaster.
There are 30,000 sensors on campus capturing all sorts of data about energy and energy consumption. This has the potential for us to understand a huge amount about the way energy is managed and used but at the same time throws up the issue of managing extremely sensitive commercial and personal data. Access to the data is strictly controlled but Energy Lancaster are very excited about the possibilities of what could be done with the data.
You can see an animated visualization of the campus energy metering system sensor data here:
Kopo Ramokapane – Could computing: When is Deletion Deletion
Kopo reported that when you delete data in the cloud there is no way to be sure that all copies or all versions have been deleted from the cloud provider. This issue isn’t new but doesn’t get as much attention as it should be. Because of the way Cloud storage operates it is almost impossible even for the service providers to be certain that all the data has been deleted. Avoid storing confidential data in the Cloud and learn more about how the systems work! Lancaster University has a contract with cloud service Box which ensures that compliance issues are dealt with in relation to storage of confidential or sensitive data.
Karen Broadhurst and Stuart Bedston – Better data for better justice: Towards data-driven analyses of Family Court policy and practice
Professor Karen Broadhurst and Stuart Bedston from the Sociology Department reported on concerns about transparency in family court-decision-making. Greater transparency and “open data” would have a positive impact in many ways but is hard to achieve looking at the security requirements and potential risks.
Karen and Stu highlighted the changes that would be needed in order to strengthen interdisciplinary research using controlled-data here at Lancaster University but also the difficulties that stand in the way.
John Couzins – Security Overview at Lancaster University
Next on was John Couzins, the IT Security Manager of Lancaster University. John who works for the institutional IT service ISS reported on the certifications that are necessary to fulfil requirements of certain providers of confidential data. Current examples are Cyber Essentials Plus and the IG Toolkit (Information Governance Toolkit) which is used by the NHS.
Mateusz Mikusz – Running Research as a Service. Implications for Privacy Policies and Ethics
The issue regarding the data is that is used for two purposes:
To make the app and its use cases work
To create research data of usage and other properties that can be analysed by the project team
Mateusz explained that he is working hard to bring both things together in an ethical way that still allows innovative research.
It was a great showcase for a lot of fantastic research that is taking place at Lancaster University and the way in which handling sensitive data and tackling data security is at the forefront of this. There were probably as many questions raised as there were answers given but it was a great opportunity to share approaches to handling data securely and ethically.
Q: In support of Open Data what roles do Policies by funders of the University have? Are they helpful? Or is it seen as just another hurdle in the way of doing research?
David: I could be wrong but I don’t think most people just view it as just a hurdle. I think when people have to write a Data Management Plan for a grant that is a bit of a pain. But I don’t think the idea of having the data freely available is something where most people say “I can’t be bothered”. It is an additional step but it is something people should be doing anyway because if you are going to be clear on what your results are the data should be in a form that’s usable and could be easily moved between people. I think most people say that’s a good thing but maybe I’m biased…
Q: I have talked to PhD students asking if they want to share their data and they said I should have asked them three years ago because now it is so much work. I wonder why that is and if we need to change the way we teach them how to manage their data?
David: I wonder if I would have said the same thing. All the data of my PhD is still around but as I was learning my craft I probably wasn’t the most efficient, and my data wasn’t managed as efficiently as I would do now. I don’t remember going to a data management training or anything. And if someone had done that on day one of my PhD? Data should be kept in an ordered fashion etc. I created a lot of extra work for myself because I would do some analysis, close the file and end up re-doing the same things I have done multiple times. And even on that level that is not very efficient.
Q: Is that something we should teach students, you think?
It’s probably something students wouldn’t be too keen [on].
Q: Yes, you don’t want to patronise them.
David: And it is a bit like saying: For god’s sake back up stuff! If you look at all the horror stories [about] who loses data. It’s only when it goes wrong that it becomes a problem. I think some people are automatically super organised. I was probably somewhere in the middle, probably more organised now. I think the issue is in a lot of academia, you just figure it out as you go. And some people develop brilliant habits and some people, including myself, bad and good. And other people develop really bad habits. And that just carries on.
I sometimes look at Retraction Watch to see what’s in, and there is this really interesting example of an American paper, an American guy who posted a paper in Psychological Science whose undergraduate student collected the data and then it turns out the entire paper is wrong when someone re-analysed the data and found so many mistakes in it. Of course it has been retracted. Now the professor has said it is the student’s fault [whole story here]. But whoever taught that student data management? If that is the issue and it looks like it, they have taken the eye off the ball. And now without a doubt his other papers will be scrutinised. Clearly, there are bad habits ingrained that he’s been passed on.
And it is not just students, it is people higher up as well. The students have been informed by their own supervisors. So I say to my students: back stuff up, make sure things are organised and I can usually tell without going into their file system. What usually happens, if I ask them for something, a piece of data, it will appear quickly because they know where it is, and that is good enough for me. But if it takes ages, that’s when we end up having a talk saying “What are you actually doing with your data?” because this seems really all over the place. But not every supervisor does that, as that guy proved. He didn’t even seem to look at the data. I am not saying that can happen here; but is not only the students.
Q: What could the University do more to assist Open Data supporters like yourself?
David: I really like the fact that the Library is pushing the fact that you can upload datasets. I know there are not many people from my Department that are doing it… I think that is really interesting. It is something that I – not necessarily challenge – but I do mention it. I don’t really get why. It is the sort of thing where you are submitting a paper you don’t even have to do it formally. There are journals that don’t have a data policy but I can still through our Pure system link data and paper together. I don’t see how that is a bad thing and that there is a huge effort needed to do that.
Maybe academics say it is just another thing to do? A colleague of mine would always say if they want the data they can always email me. Now that might be true but there are lots of cases when you email academics they never get back to you. The same colleague gets so many emails that they have someone to manage their mails. I take the point that the counter argument is that nobody actually will want to see the data and maybe they won’t. But given how random stuff is… you don’t know. For, what you publish today it might not be important and then suddenly it is important.
So my answer to the question is I am not exactly sure. There is more support in this institution than in my other, to my memory, in terms of: “this is a place to put my dataset”. One of the courses I was on here about data management as part of the 50 Programme was really useful in the sense that I left thinking from now on I am going to put my data there [into Pure].
Q: Should there be other incentives for opening up research data rather than “doing a good thing”? Should there be more credit for Open Data?
David: Yes, probably. We are always judged, when we do PDRs every year, on how much I published and got this much money. But actually, the data output does have a DOI now and it is citable and it is a contribution that the University is getting from the academic. It is additional effort. So it would be interesting to see what happened if it went as far as maybe not a promotion thing, but … part of good practice. I think the question I would ask academics is: if your data is not there, where are you keeping it long term? Now I am working at another project where data cannot be made open and that is fair enough, but in general I do wonder where all that data is going. There is a duty where it needs to be kept for a certain length of time. I think it is easier to put in there [Pure] then I don’t need to think about it if nothing else. That gives me more comfort.
Q: Is there anything you’d like to add?
David: I am certainly in support of Open Data but I write more about data visualisation because I like pictures as much as I like data [laughs].
Thanks David for an interesting interview. We hope to do more Data Interviews soon. In the meantime, if you have any questions or comments leave them below or email email@example.com.
This is the first interview of hopefully a series to come about the impact of Open Data on research. The interview was conducted by Hardy Schwamm.
Q: We define Open Data as data that can be freely used, shared and built-on by anyone, anywhere, for any purpose. Open Data is also a way to remove legal and technical barriers to using digital information. Does that go with your idea of what Open Data is?
David: Yes, I think so. I might add to that: the data is actually useful and fit for purpose. To me it’s one thing to just uploading all that data, make it available. But a lot of time, how useful that is on its own is not quite clear. As a psychologist you can run an experiment and you have a lot of data coming out of a study. You can just dump that data online but is there enough information there for other scientists to use that data and get the results?
Q: So would you say that the usefulness of data depends on what we as librarians call metadata, data about the data?
David: Yes, exactly. The definition you gave earlier is spot on. I would just add you need to make sure it is useful to other people. That might also depend on the audience but there are lots of datasets that people post for papers that are just the raw data. That is useful but to understand how they get from the raw data to the conclusions is an important step. There isn’t always space in publications to make that clear.
Q: My next question you have probably already answered already. What is your interest in Open Data? Do you support it as a principle or because it is useful for your research?
David: I do support it as a matter of principle! I always find it weird, even as a student, that you could have papers published and it was just a “Take our word for it” process. I still find that weird now. So absolutely, I support it as a matter of principle. I think as a scientist it just seems right. The data is the cornerstone of every publication. So if that is not there it seems like a massive omission, unless there is a reason for it not to be there. There are lots of mainstream psychology journals that don’t have any policy on data.
Q: That leads me to my next question: To what extent do researchers in your field Psychology support or embrace a culture Open Data?
David: Psychology does have a culture of it and it is probably growing. I think it is inevitable that this is going to become the standard practice if you look at the way Open Access publishing is going.
Q: Why do you think this is happening?
David: Because I think what is eventually happening is that journals are going to say… Lots of people who are doing it but it is like everything else, particularly if that data is going to be usable it does require a bit more effort on the author’s part to make sure that things are organised and that they have a Data Management Plan. I am not suggesting that lots of people don’t have Data Management Plans but it’s something that if you look at current problems in Social Psychology really that wasn’t being followed. There have been leaks and there have been other problems.
So if I tell you the story last week from a 3rd year student at Glasgow University had spotted errors in a published paper and it was actually errors in the Degrees of freedom. They didn’t need the raw data but the point is that a lot of that could have been sorted if the raw data had been made available. There are lots of little issues that keep coming up.
There is nevertheless still resistance and there are plenty of journals where there really is no policy, certainly the journals for which I review for. At the end, there is no data provided, I don’t know what the policy is. It would be nice if in the future authors could upload raw data but that depends on the journal’s policy and if the journal has a policy.
Q: Where should the push for Open Data come from? From journals, funders or the science community?
David: I think from all! If peer reviewers started asking for data, which I think more are, and I think if more scientists start uploading data as supplementary material as a matter of course then I think journals will start to do that. I guess the other option is that journals will start to be favoured that do provide additional resources. So particularly given how much money places like Elsevier make, what do they actually offer? If they want to sell themselves they could offer lots of things but they don’t seem to be pushing it.
And I appreciate it is very discipline specific, and that came up after my talk at the Data Conversations [on 30 January 2017] some disciplines don’t share data. It has improved massively since I started as a postgrad student. Then it just wasn’t a thing and it has slowly become more of an issue.
Q: Do you think this has to do with skills and knowledge of researchers and PhD students? Do they know how to prepare and share data? Do they know how to use other researchers’ data? Is there something missing?
David: A lot of psychologists are in a kind of hybrid area. They are obviously not statisticians and I do wonder if there is a bit of a concern because what if I upload everything, what if somebody finds a mistake? My view is always: I’d rather know that there is a mistake. But I do wonder if people are sometimes sceptical about. Not because they’ve got anything to hide but because they are not a 100 per cent sure sometimes. They understand the result and they know what the numbers mean but we are not mathematicians.
I am just curious that given the numbers of statistical mistakes being flagged up in psychology papers… I am sure I made mistakes myself. I’d just rather know about them. And having the data there means someone can check if they really want to. My view is that I am quite flattered if someone that bothered to go and re-run my analysis. They are obviously reading it!
The interview with David will be continued in Part 2 which you can find here.