Data Interview on “Messy Data”

Our latest Data Interview features our two Jisc sponsored Data Champions, Dr Jude Towers and Dr David Ellis. Jude is a Lecturer in Sociology and Quantitative Methods and David a Lecturer in Computational Social Science in our Psychology Department.

Jude and David recently presented at a Jisc event on ‘Stories from the Field: Data are Messy and that’s (kind of) ok’.

We talked to Jude and David about what Messy Data are (and many other things):

Q: At the recent Research Data Champions Day the title of your presentation was ‘Data are Messy and that’s (kind of) ok’. I wonder what are ‘messy research data’ in your fields?

Jude: My ‘messy data’ are crime data. The ‘messiness’ comes from a lot of different directions. One of the main ones is that administrative data is not collected for research, it is collected for other purposes. It never quite has the thing that you want. You need to work out if it is a good enough proxy or not.

For example I am interested in violence and gender but police crime data doesn’t disaggregate by gender.  There is no such crime as domestic violence, so it depends on whether somebody has flagged it as such, which is not mandatory, so it is hit and miss. I think the fact the data are not collected for research makes them messy for researchers, and then I guess there is all the other kind of biases that come with things like administrative data.  So if you think about crime, not everybody reports a crime, so you only get a particular sample. If you have a particular initiative, so every time there are international football matches they have big initiatives around domestic violence, so reporting goes up, so everyone says that domestic violence is related to football.  But is it, or is it just related to the fact that everyone one tells you, that you can report and they have zero tolerance to domestic violence during football matches?  It’s more likely to be recorded.

Then you get feedback loops, so the classic one at the moment is knife crime in London, because knife crime has gone up on the agenda more money and resource will go into knife crime, at some point that will probably go down, and something else will go up because there is a finite amount of resource.  These create feedback loops by the research that you do on the administrative data and people don’t always remember that when they come to interpret research.

Jude and David presenting on Messy Data

David: The majority of data within psychology that tends to measure people is messy because people are messy, particularly social, psychological phenomenon, there is always noise within. The challenge is often trying to get past that noise to understand what might be going on.  This is also true in administrative data and data you collect in a lab.  Probably the only exception in psychology is where people are doing very, very controlled maybe visual perception experiments where the measurement is very fine grain, but almost everything else in Psychology is by its nature extremely messy, and data never looks like it appears in a textbook.

Q: So there is always that ‘noise’ in research data, regardless if you use external data such NHS data, or if you collect data yourself, unless as you say it is in a very controlled environment?

David:  Yes. And I guess that within Psychology there is an argument that if the data is collected in a very controlled environment, is that actually someone’s real behaviour or is a less controlled environment more ecologically valid as you’ve always got that balance to try and address?

Q:  So what are the advantages, why do you work with messy data?     

Jude: Sometimes because there is nothing else. [laughs]

David: Because there is nothing else.  I think Psychology generally is going to be messy. Because as I said people aren’t perfect, you know they are not perfect scientific participants. Participants are not 100% predictable, people aren’t predictable social phenomena.  There are very few theories within social psychology, in fact, I don’t think there’s any that are 100% spot on.

When you compare that to say physics, where there is Newton’s law, where there are governing theories, which are singular truths, which explain a certain phenomenon. We don’t have much of that in psychology!  We have theories that tend to explain social phenomena but people are too unpredictable.  There are good examples of where theories have held for a long time but it is never a universal explanation.

David presenting

Q: What are the implications for management of that kind of messy data?

David: I think the implications are that you make sure that it is clear to people how you got from the raw data, which was noisy or messy, to something that resembles a conclusion.  So that could be: how did you get from X number of observations that you boiled down to an average that you then analysed? What is it the process of that? It’s not just about running a statistical test, it’s about the whole process from: this is what we started with and this is what we ended with.

Jude:  I think that’s right, I think being very clear about what your data can and cannot support and be very clear that you are not producing facts, you are testing theory, where everything is iterative, a tiny step towards something else, not the end. You never get to the end.

David: I think researchers have a responsibility to do that and people have to be careful in the language they use to convey how that has happened.  A good example of that at the moment is, there is a lot in the press and current debate about the effects social media has on children or on teenagers, and the way that it is measured and the language that is used to talk about that is to me totally disconnected. That behaviour isn’t really measured. It is generated by people providing an estimate of what they do, yet we know that, that estimate isn’t very accurate.  The conclusions which have been drawn  are that this is having this big effect on people.  I’m not saying it’s not having any effect; it’s not as exciting to say: ‘well actually the data’s really messy or not perfect, we can’t really conclude very much’. Instead it’s being pushed into saying that [social media] is causing a massive problem for young people, which we don’t know.  Which is why there is a responsibility for that to be clear and I don’t think in that debate it is clear, and I think there are big consequences because of it.

Jude and David at Jisc panel discussion

Q: So in your dream world, what would change, so we could work better with this kind of data?

Jude: I think we need better statistical literacy, across the board. This is what I did with my Masters students:  I told them to go and find a paper or media story which used centred statistics,  then critique it.  So, how do you know what someone is telling you is ‘true’? Why are they telling it you in that particular way? What data have they used? What have they excluded?

You go to the stats literature and they talk about outliers, as though it’s just a mere statistical phenomenon, but those decisions are often political and they massively change what we know, and nobody talks about that, nobody sets out exactly what that means.  The only official statistics for crime in England and Wales are currently capped at a maximum of five incidents.  If you are beaten up by your partner 40 times a year, only the first five are included in the count, which is a huge bias effect in what we know about crime.  Then in the way resources are distributed between different groups, about what crimes are going up and what are falling.  I think this lack of people questioning statistics in particular, but data more generally, is a real problem.  In our social science degrees we just do not teach undergraduates how to do that.  We do it with qualitative data, but we don’t do it with quantitative data. It’s exactly the same process, it’s exactly the same questions, but we just don’t do it, we are really bad at it Britain!

David: I think more generally, there is a cultural issue within the whole ethos of science, of how it gets published, of what becomes read and what doesn’t become read.  So again, say I go back, do a paper and find no relationship between social media use and anxiety.  That would be harder to publish than if I write a paper and find a tiny correlation, which is probably spurious and not even relevant, between anxiety and social media. So again, this comes down to both criticising what is out there but also what is just becoming more sellable or having more ‘impact’.  I use the word impact with inverted commas; what sounds more interesting, but actually might be totally wrong.  I think what is pushed is what’s more interesting rather than what is truth.  I think it’s worth remembering that science is about getting a result and trying to unpick it, looking at what else could explain this, what might we have missed.  Rather than saying ‘that’s it, it’s done’, it’s similar to what Jude was saying about a critical thinking process.

Q: Following on from what Jude said about the skills gap: You say that undergraduates are not taught the skills they need.  Therefore, when we eventually get PhD students and early career researchers this gap might have even increased?

Jude:  Yes, and they don’t use quantitative data, or they use it really uncritically. So lots of  post-graduate students who work on domestic violence won’t use quantitative data, but their thesis often  starts with ‘one in four women will experience domestic violence  in their lifetime’ or ‘two women a week are killed by intimate partners’, bbut they don’t know where that data comes from or how reliable it is or how it was achieved, yet it is just parroted.

David: I can give a similar example to that where it is sometimes difficult to take those numbers back, once they become a part of the common discourse.  So years ago we found that people check their smartphone 85 times a day on average.  Now that was a sample of about thirty young people. Now we obviously talked about that, but that number is now used repeatedly.  Now there is no way that my grandmother or my parents check their phone 85 times a day.  But that sample did, so there is now this kind of view that everyone checks it 85 times a day.  They probably don’t, but I can’t take that back now, there are things you don’t know at the time, but that is what that data showed.  It’s tricky to balance, and it was picked up as an impactful thing, but it wasn’t what we really meant.

Q: Is there also a job for you as a researcher if your findings are picked up by the media looking for a catchy easy numbers, to write your paper differently so that it is not being picked up so easily, or is it the fault of the media, because they are just looking for a simplified version of a complex issue?

David:  There is a cultural issue, a kind of toing and froing; because we want our work to be read and we want people to read it and certainly writing a press release is one way of doing that.  I think it’s actually what you put in the press release [that] has to be even more refined, because a lot of people won’t read the paper, but they will see the press release, and that will be spun.  Once the press release is done, it’s out of your control in some ways.  You can get it as right as you want but a journalist might still tweak it a certain way.  It’s a really tough balance because as you say the other extreme is to say I am just going to leave it. But then people might not hear about the work, so it’s a very tricky tightrope to walk.

Jude: We made the decision as a Centre when our work started getting picked up by the media, that we would not talk to the media about anything that had not been through peer review, so it is always peer reviewed first.  We work with one person from the press office, we work with her closely, all the way through the process of putting the paper together and deciding the press release and how we are going to release it.  What we have actually got now is contacts in several newspapers and media outlets And we say we will work with you exclusively providing this is the message which goes out. We have actually been successful enough that we’ve now got two or three people on board who will do that with us.  They get exclusives providing we see the copy before it goes public.

Jude

David: That is very hard to do, but really good.

Jude: We have been really hardcore and we’ve had a lot of pressure to put stuff out earlier, to make a bigger splash, to go with more papers. It was only I think because we resisted that, that in the long run it has been much better, although it is hard to resist the pressure.  The press in our early work wanted our trends, but we wanted them to talk about the data, we wouldn’t release the trends unless they talked about the problems with bias, official statistics.  So we kind of married the two, but they didn’t want it, but that was the  deal.

David: It’s like when you say: ‘people do X this number of times’ then you can’t put in brackets ‘within the sample’ so I understand where journalists come from and I understand the conversations with the press. To me as I said it’s like walking a tightrope. It has to be interesting enough that people want to read it, but at the same time it needs to be accurate.

Jude: But that’s the statistical literacy, because you want someone reading a media story going ‘Really? Well how did you get that?’ That’s something we would do as academics when you are reading it. People are always telling me ‘interesting facts’ about violence and my first reaction is always: ‘Where has that come from?’ These questions should become routine. I think journalist training is terrible!  I mean I have spent hours on the phone with journalists, who want me to say a really particular thing, and its clearly absolute nonsense! But they have got two little bits of data and they have drawn a line between them.

David: I have had a few experiences where journalists have tried to get a comment about someone else’s work and I have said things like, ‘I don’t think this is right’ or I’ve been critical and the journalist said, ‘well really what we are looking for is a positive comment’.  And I’ve said ‘well I’m not going to give you one’, and they have said ‘alright bye then’, and have gone and found someone that will.  That doesn’t happen very often, but we can see what they are kind of hoping for.  Presumably, some of the time I have said things where I have been really critical. The BBC are quite good at that; they get someone who they know is going to be critical without having to explicitly saying something negative.

Q: This has been fascinating; we have been though the whole life cycle of data from the creation to the management and now to the digestion by the media.  This tells us that data management issues are fundamental to the outputs of research.

Jude: I think it impacts on the open data agenda though ‘cause if I was going to put my data out, the caveat manual which came with it would be three times the size of the data.  Again, you don’t have any control over how someone presents an analysis of that data. I think it’s really difficult because we are not consistent with good practice in reporting on messiness of data.

David: I think there is a weight of responsibility on scientists to get that right! Because it does affect other things. I keep using social media as an example. The government are running an enquiry at the moment into the effects of screen time and social media. If I was being super critical I would say it’s a bit early for an enquiry, because there isn’t any cause and effect evidence. Even some of studies they report on their home page of the enquiry are totally flawed, one of them is not peer reviewed.  That lack of transparency or statistical literacy even among Members of Parliament, clearly, is leading to things being investigated where actually we could be missing a bigger problem here.  So that is just one example, but that is where there is a lot of noise about it, there is a lot of ‘this might be a problem’, or ‘is it a problem?’, right through to ‘it definitely is a problem’, without anyone standing back and going, ‘actually, is this an issue, is the quality of the evidence there?’

David

Jude: Or can you even do it at the moment?

David: Yes, absolutely! That is a separate area and there is a methodological challenge in that.

Jude: We get asked to measure trafficking in human beings on a regular basis, we’ve  even written a report that said you can’t measure it at the moment! There is no mechanism in place that can give you any data that is good enough to produce any kind of measure.

David: But that isn’t going to make it onto the front of the Daily Mail. [laughs]

Q: Maybe just to conclude our interview, what can the university do? You mentioned statistical literacy as one thing. Are there other things we can do to help?

Jude: We are starting to move a little bit in FASS [Faculty of Arts and Social Sciences] with some of Research Training Programme and I think things like the data conversations which are hard to measure but I think are actually having a really good impact.  Drawing people in through those kinds of mechanisms and then setting up people that are interested in talking about this would be good. I would like to see something around… what you need to tell people about your data when it’s published; you know, the caveats: what it can and can’t support, how far you can push it.

David: I think the University as a whole does a lot, certainly psychology, is preaching to the converted, in a way.  I would like a thing in Pure [Lancaster University Data Repository] that when you upload a paper it says… ‘have you have included any code or data?’ just as a sort of a ‘by the way you can do that’. One, it tells people that we do it and two, it reminds people that if you’re not doing that it would be useful just to have tick box just to see why.  Obviously, there are lots of cases where you can’t do it, but it would be good for that to be recorded. So is it actually, I can’t do it because the data is a total mess or some other reason or I’m not bothered.  There is an issue here about why not, because, if it has just been published it should be in a form which is sensible and clear.

Jude: I wonder if there is some scope in just understanding the data, so maybe like the data conversation is specifically about qualitative data, and then other even more obscure forms like literature reviews as data, ‘cause I still keep thinking about when you told me you offered to do data management with FASS and you were told they didn’t have any data.

I think that people don’t think about it as data in the same way and it would be really good to kind of challenge that.  I think data science has a massive problem in that area, it has become so dominant, and if you’re not doing what fits inside the data science box you’re not doing data and you’re not doing science and it’s really excluding.  I think for the university to embrace a universal definition of data would be really, really, beneficial.

David: It’s also good for the University, [to] capitalise on that extra resource; it would have a big effect on the institution as a whole.

Jude, David, thank you very much for this interesting interview!

Jude and David presenting

Jude and David have also featured in previous Data Interviews.

The interview was conducted by Hardy Schwamm, Research and Scholarly Communications Manager @hardyschwamm. Editing was done by Aniela Bylinski-Gelder and Rachel MacGregor.

 

 

 

Data Interview with Andrew Moore

Andrew Moore (@apmoore94) is a 2nd year PhD student at Lancaster University within the School of Computing and Communications. He is studying how sentiment analysis can be improved through world knowledge using finance as his specialised domain. His research interests are across Natural Language Processing, Machine Learning, and Reproducibility.

We talked to Andrew after he presented at the 3rd Data Conversations.

Q: When does software become research data in your understanding?

Andrew: As soon as you start writing software towards a research paper that I would count as research data.

Q: Is that when you need the code to verify results or re-run calculations?

Andrew: You also need the code to clean your data which is just as important as your results because depending on how you clean your data that informs on what your results are going to be.

Q: And the software is needed to clean the data?

Andrew: Yes. The software will be needed for cleaning the data. So as soon as you start writing your software towards a paper that is when the code becomes research data. It doesn’t have to be in the public domain but it really should be.

Q: What is the current practice when you publish a paper? Do you get asked where your software is?

Andrew: Recently we have actually, for some of our conferences in the computational linguistics or Natural Languages Processing field. But it is not a requirement to get published. It is a friendly question rather than an obligation.

Q: Who is asking, the publisher?

Andrew: No, that’s the conference chairs who are asking but it is not a requirement. Personally I think it should be. I can understand in certain cases when for instance there are security concerns. But normally the sensitivity is on the data side rather than the software.

Q: At the moment if you read a paper the software that is linked to the paper is not available?

Andrew: Normally, if there is software with the paper the paper would have a link, normally on the first or the last page. But a large proportion of the papers don’t have a link. Normally there would be a link to GitHub, maybe 50 per cent of the time. Other than that you can dig around if you’re really looking for it, perhaps Google the name but that’s not really how it should be.

Q: So sometimes the software is available but not referenced in the paper?

Andrew: That’s correct.

Q: But why would you not reference the software in the paper when it is available?

Andrew: I am really puzzled by this [laughs]. I can think of a few reasons. One of them could be that the GitHub instance is just used as backup. The problem I have with that is that it is not referenced in the paper how much do you trust the code to be the version that is associated with the paper?

Also, the other problem with that if I’m on GitHub is that if you reference it in a paper, on GitHub you can keep changing the code and unless you “tag” it on GitHub like a version number and reference that tag in your paper you don’t know what is the correct version.

Q: What about pushing a version of the code from GitHub to [the data archiving tool] Zenodo and get a DOI?

Andrew: I didn’t know about that until recently!

Andrew presenting at Data Conversations

Q: So this mechanism is not widely known?

Andrew: I know what DOIs are but not really how you can get them.

Q: So are the issues why software isn’t shared about the lack of time or is it more technical as we have just discussed, to do with versions and ways of publishing?

Andrew: I think time and technical issues go hand in hand. To be technically better takes time and to do research takes time. It is always a tradeoff between “I want my next paper out” and spending extra time on your code. If your paper is already accepted that is “my merit” so why spend more time?

But there are incentives! When I submitted paper at an evaluation workshop I said that everybody should release their software because it was about evaluating models so it makes sense to have all the code online. So it was decided that we shouldn’t enforce the release but it was encouraged and the argument was that you are likely to get more citations. Because if your code is available people are more likely to use it and then to credit you by citing your paper. So getting more citations is a good incentive but I am not sure if there are some studies proving that releasing software correlates to more citations?

Q: There are a number of studies proving there is a positive correlation when you deposit your research data[1]. I am not aware there is one for software[2]. So maybe we need more evidence to persuade researchers to release code?

Andrew: Personally I think you should do it anyway! You spend so many hours on writing software so even if it takes you a couple of hours extra to put it online it might save somebody else a lot of time doing the same thing. But some technical training could help significantly. From my understanding, the better I got at doing software development the quicker I’ve been getting at releasing code.

Q: Is that something that Lancaster University could help with? Would that be training or do we need specialists that offer support?

Andrew: I am not too sure. I have a personal interest in training myself but I am not sure how that would fit into research management.

Q: I remember that at the last Data Conversations Research Software Engineers were being discussed as a support method.

Andrew: I think that would be a great idea. They could help direct researchers. Even if they don’t do any development work for them they could have a look at the code and point them into directions and suggest “I think you should do this or that”, like re-factoring. I think that kind of supervision would be really beneficial, like a mentor even if they are not directly on that project. Just for example ten per cent of their time on a project would help.

Q: Are you aware that this is happening elsewhere?

Andrew: Yes, I did a summer internship with the Turing Institute and they have a team of Research Software Engineers.

Q: And who do the Research Software Engineers support?

Andrew: The Alan Turing Institute is made up of five institutes. They represent the Institute of Data Science for the UK. They do have their own researchers but also associated researchers from the other five universities. The Research Software Engineers are embedded in the research side integrated with the researchers.

When I was an intern at the Turing Institute one of the Research Software Engineers had a time slot for us available once a week.

Q: Like a drop in help session?

Andrew: Yes, like that. They helped me by directing me to different libraries and software to unit test my code and create documentation as well stating the benefits of doing this. I know that others teams benefited from there guidance and support on using Microsoft Azure cloud computing to facilitate their work. I imagine that a lot of time was saved by the help that they gave.

Q: Thanks Andrew. And to get to the final question. You deposited data here at Lancaster University using Pure. Does that work for you as a method to deposit your research data and get a DOI? Does that address your needs?

Andrew: I think better support for software might be needed on Pure. It would be great if it could work with GitHub.

Q: Yes, at the moment you can’t link Pure with GitHub in the same way you can link GitHub with Zenodo.

Andrew: When you link GitHub and Zenodo does Zenodo keep a copy of the code?

Q: I am not an expert but I believe provides the DOI to a specific release of the software.

Andrew: One thing I think it is really good that we keep data at Lancaster’s repository. In twenty years’ time GitHub might not exist anymore and then I would really appreciate a copy store in the Lancaster archives. The assumption that “It’s in GitHub, it’s fine” might not be true.

Q: Yes, if we assume that GitHub is platform for long-term preservation of code we need to trust it and I am not sure that this is the case. If you deposit here at Lancaster the University has a commitment to preservation and I believe that the University’s data archive is “trustworthy”.

Andrew: So putting a zipped copy of your code is a good solution for now. But in the long term the University’s archives could be better for software. An institutional GitLab might be good and useful. I know there is one in Medicine but an institution wide one would help. It would be nice if Pure could talk to these systems but I can imagine it is difficult.

The area of Neuroscience seems to be doing quite well with releasing research software. You have an opt-in system for the review of code. I think one of the Fellows of the Software Sustainability Institute was behind this idea.

Q: Did that happen locally here at Lancaster University?

Andrew: No, the Fellow was from Cambridge. They seem to be ahead of the curve but it only happened this year. But they seem to be really pushing for that.

Q: Thanks a lot for the Data Interview Andrew!

The interview was conducted by Hardy Schwamm.

[1] For example: Piwowar, H. A., & Vision, T. J. (2013). Data reuse and the open data citation advantage. PeerJ, 1, e175. http://doi.org/10.7717/peerj.175

[2] Actually there is a relevant study: Vandewalle, Patrick. Code Sharing Is Associated with Research Impact in Image Processing . Computing in Science & Engineering, 2012, http://ieeexplore.ieee.org/document/6200247/.

 

 

Data Interview with Alison Scott-Baumann and Shuruq Naguib

Our latest Data Interview follows up a presentation at our 2nd Data Conversation. Alison Scott-Baumann (Professor of Society & Belief SOAS) and Dr Shuruq Naguib (Lecturer in Politics, Philosophy and Religion Lancaster) are working on the Re/presenting Islam on Campus project. Re/presenting Islam on Campus is a three year project funded by the Arts and Humanities Research Council (AHRC) and by the Economic and Social Research Council (ESRC). It explores how Islam and Muslims are represented and perceived on UK University campuses.

We had the opportunity to discuss research data issues surrounding their project. It turned out to be a highly interesting conversation on topics such as confidentiality, the limits of anonymisation, legal frameworks and the freedom of speech.

Q: Could you describe the aims of your project?

Alison: Thanks for inviting us. It is strange to be on the receiving end because we have been doing a lot of data collection where we put people at ease and now we are at the other end.

About 4 years ago, I became concerned about the increasing surveillance culture around Muslim communities, particularly on campus because that has an impact on free expression or could do. To me as an experienced researcher this seemed to be a politicisation of a research field if you generally identify Muslims as the “official other” and also tell us that they are dangerous with the 2015 Counter-Terrorism and Security Act and its attendant Prevent duty. What is currently not acknowledged is that the Prevent duty is actually not compulsory but the university sector has adopted it in order to keep their reputations clean.

So it is quite a difficult topic and the project aims to look at four major questions:

  1. What do university staff and students know about Islam?
  2. Where do they find that information?
  3. Thirdly with specific reference to three issues, how do they formulate their opinions? The first issue with regard to Islam is gender because that’s often in the media. The whole hijab discussion for example. Radicalisation, there is no point ignoring it because … [even though] there is no evidence that anybody gets radicalised on campus. And the third one is inter-faith because relations among students of different faiths and intra-faith also is of interest to us because it is a very secular culture we live in and yet for many young people their faith identity is important, more important than we realise because of the secular atmosphere that we created on campus.
  4. The fourth question is given that there might be some discrepancies self-identified by our participants in their responses to their first three questions, what could be done to improve the quality of the discussion on campus about Islam? How could we improve the discussion about anything that is regarded by university authorities as risky?

So all the way right from the start when I built a team we were all thinking about issues around Islam but also about the implications of that for the campus about free speech. That turned out to be a big issue because that gets more and more discussed even in the press.

Alison Scott-Baumann

Q: How long does the project run?

Alison: It is a 3 year project from 2015-2018. We are two thirds through.

Q: What kind of data do you need to answer your research questions?

Shuruq:We have two sets of data. We have actually completed data collection. We have collected quantitative data through a survey questionnaire. It was designed to be sent to the 6 universitiesi which are participating in the research. Before we received the grant and throughout the first year we were in conversation with the gatekeepers at those universities who were usually senior managers. They promised to facilitate the research including the survey to staff and students.

When we started  on-site research, we also wanted to do the questionnaire at the same time but the gatekeepers withdrew their collaboration.  The gatekeepers tried to get approval from the vice-chancellors and senior management. We came across a problem on several sites and that is what some describe as survey-fatigue. They were worried about students and staff receiving too many requests to fill in questionnaires. It seemed that universities were very reluctant to facilitate our surveys.

We had to redesign the questionnaire so that it was no no longer specific to the case studies; it is was now nation-wide questionnaire targeting students only, and we went to a private company to do that. The private company had access to students and could build up a sample for us. For example, we wanted our sample to include Muslims and non-Muslims and equal representation of gender and other criteria that we had in mind. We decided not to do the staff questionnaire because you can’t do that through the private companies and the universities were refusing to help. We had to make these decisions because of that particular challenge.

The other subset of data which is qualitative is based on interviews, focus groups, ethnography and curricular material. On each of the six campuses we interviewed 10 students and 10 members of staff. We attempted to handpick staff according to an ideal list which represents a mix of administrative and academic staff, senior and junior staff in different departments, Human Resources,  deans and postdocs, etc. The student interviewees were recruited through emails sent through the student union or were invited by researchers. It was a random sample. There were four focus groups on each site, one with staff and three with students. We wanted one focus group to be with Muslims, one with non-Muslims and one mixed. We didn’t always achieve all types and we faced a real challenge in recruiting students. Sometimes non-Muslim students weren’t at all interested in religion or Islam. We tried different techniques such as focus groups in cafes or other hang-out spaces for students but if participants are not interested in your topic no matter how you promote it, it’s really challenging! You might get a self-selected sample of participants who are interested in that topic.

Then we’ve also done ethnography which included observing the sites where students are, talking to different student societies, talking to a wide range of university staff. We attended public events, observing and describing these events: Who attends them, who the speakers are, especially if they are related to topics of religion, Islam, freedom of speech?

Part of the research is also how Islam is studied in the classroom. For each campus we attempted to collate data about all the courses that included a component on Islam. For a long time we used to call this “Islamic Studies” but we don’t mean Islamic Studies in a narrow sense, we mean it in a broad sense. We changed that label for that category of data to “Studying Islam” to broaden it out to include a course in the Faculty of Medicine on for example “Religion and Health”. We collected material through desktop research on all the courses that are offered in the year of the field work which have a component on Islam or religion.

Then we tried to zoom in on some modules reflecting a range of disciplines and approaches, collecting course programme and syllabus for further analysis. Within that sample we also attended some of the classes to observe the actual teaching and how the students respond. So we have a very complex set of data and we are just about to start the analysis stage and there are quite a few challenges there too.

Q: You have collected a wide range of data, from publicly available information to sensitive data like views on religion. Does that have an impact on how you manage your data?

Alison: There are challenges of managing that data but also of collecting it. When I submitted the research proposal to AHRC that was a year before the Counter-Terrorism and Security Act was passed. When I was awarded the grant that act had been passed. So a situation on campus that had already been quite sensitive arguably becomes more so. We were determined as a team to protect the identity of participants and we have established a sequence of events which we hope maximizes that possibility. We do tell our participants that they have to accept that it is actually impossible for us to be completely sure that we can protect them. Because if somebody wants to hack and they have money and expertise then they can get access to stuff.

But I’ll run you quickly through how we do things. There are only two documents that have the allocated number given to a participant and their name. One of them is the consent form. That is kept away from the university, locked up. The other document that has their allocated number and their identity is an Excel spreadsheet which is kept in a virtual vault which has all their characteristics except their political views. We are not collecting political views which the 1998 Data Protection Act lists as something that should be protected. So we are acting in accordance with that Act by seeking to protect their identity.

Once we’ve done that we then tell them before they speak that they have the right to withdraw, the right to anonymity and confidentiality and we give them a timeline so they have six months in which they could say “I’m actually not comfortable with this” but nobody has done that. What we cannot be sure of, of course, is who are the people who walked away from the possibility of speaking to us? It could be the silent majority. We will never know that. We have worked through the student unions to secure the interested students but if something pops up on their screens regarding opinions on Islam there are people who might think “I don’t want to enter that arena” for all sorts of different reasons.

Q: Can you expand on your data security and confidentiality measures?   

Alison: We keep our master spreadsheet encrypted via VeraCrypt which is a non-aligned programme unlike BitLocker which belongs to Microsoft.

In order to conduct an interview or a focus group we allocate a number to each person and before we did this we thought participants will find this ridiculous. But actually, with focus group people find it liberating which is the ideal. Every time they spoke they said “Number 32 speaking” and they would even say things like “I would like to endorse what Number 42 has just said”. That was perfect!

Q: Instead of a name badge people would wear a number?

Alison: No name badge but a numbered postit on the table in front of them and we know who they are if we want to track back. That worked much better than we thought it possibly could.

Then before the interviews and focus groups are transcribed we had a company called Divas because it is a lot of material. They have their own confidentiality agreement and we created one from SOAS as well. Divas destroy the original audios after a couple of weeks. We keep them but will destroy them some time in the second year. They will never be archived.

After the transcripts come back to us we have to clean them up. We have to take out any mention of names.

Shuruq: Let me add to that. Two issues have come up when cleaning the data.

Q: By cleaning do you mean anonymising?

Shuruq:  Yes, anonymising and removing any identifiers. Even when we use numbers in the focus groups they will refer to sites on their particular campus which will make locations identifiable. Or they would refer to a lecturer by name or to a course title. These are all ways by which confidentiality on that campus would be undermined. So we weren’t anonymising just the participants but also ensuring the anonymity of the campuses. Although the campuses are all named in our research we have agreed that when we come to write up the findings, we will not identify the campuses, because of sensitive issues such as how does the university implement Prevent policies. There could be some negative opinions, some difficult experiences. We don’t want to link those to specific campuses. So we are cleaning the data more extensively than normal perhaps.

Shuruq Naguib

It is quite challenging because as you are stripping down the data you lose context. If there is a university in Wales the Welsh context actually has certain factors that are important to remember when you are analysing the data. Or a specific college in London, how do we do that? We were negotiating the cleaning of the data with regard to gender, ethnicity, background, names of places. We tried to replace these with things that identify these elements but which maintain the anonymity. If it is a café we would strip down the name but still reflect the fact that it is a café in a student union.

But sometimes, especially with interviews we’ve had people who have roles, for example a student who is the Head of a Society or who is active on campus, is well-known and speaks in a certain way. Even if we clean the transcription if we want to quote him he might still be identified by his peers and people who know him.

And then one of the things we are coming up against is transliteration because as we look at how Islam is studied, some of the courses are linked with language training and attract overseas students. It is normal to hear different languages in this context. In an interview different languages could be used. Most of our team members speak several languages so participants have felt at ease using other languages. So how do we transliterate or translate? Sometimes it’s copious work. Some of the terms used in Arabic have specific religious connotations.

This is also sensitive data because often Arabic is perceived suspiciously as a sign of being foreign, as a sign of being a bit radical or of being committed to certain religious concepts. Do you keep the Arabic in the data? Certain words like Hijab and Jihad are loaded with negative connotations in public discourses. On some occasions we made the decision not to send a particular interview to the transcriber because it would endanger the person because they have expressed political views or they used a language that might be misunderstood. To protect the identity of that particular person on one occasion, our postdoc decided to transcribe the interview herself.

Q: Will you be able to share your data?

Alison: It will go into the UK Data Archive. That is a commitment we made to the AHRC and the ESRC who are partly funding us. There are definitely difficulties in assessing the risk of re-identification because it is impossible for us to know how recognisable somebody is to their colleagues or their friends by the way they are expressing themselves.

Q: Can I just confirm that you will share only transcriptions?

Alison: Yes, no audio, no video. But also, we haven’t decided what level of sharing is needed. We have already discussed this with the UK Data Archive and they have three access levels. Our data will not be Open Access. Some of it might be open to all registered users; other data might be accessible to approved researchers only.  There might be two tiers. I think our concern all the way through was not that that anybody has said anything dangerous because nobody has but that it might be construed as overly political by somebody who is looking at that data. If one of our participants has a view on foreign policy that doesn’t concur with the Government – in a democracy that should be possible but may be problematic in the current climate.

Q: Thanks for the explanation. What kind of research data services can Lancaster University offer to help your project?

Alison: I am personally very interested in the General Data Protection Regulation (GDPR) which will come into force in 2018. It appears to be inviting member states to decide if they tighten up on consent. This is an issue to do with Big Data and the way in which it is possible for all of us to covertly record or film each other, track each other. Anything is possible now. So the issues about consent may impact upon our ethnography. We did nothing covertly but inevitably if we were in a big open meeting we may have made notes about something somebody said and even if we don’t identify them we haven’t asked their consent. We would like guidance to whether this is going to clamp down issues around consent or if it is business as usual which means that if you go to reasonable lengths to protect somebody’s identity then that is acceptable.

We would also like you to be our critical friend [laughs]. We have a year to go. I think we are well prepared and we worked really hard on this aspect but there may be issues that we haven’t covered.

Project website: http://representingislamoncampussoas.co.uk

Q: Can I ask about the ethnography, field notes and observations, will you be able to share them?

Alison: I give you a specific example. At campuses where it was possible we secured the approval of members of staff to allow us to sit in a lesson. The students were told when we were there but we didn’t ask each of them to sign a consent form. For example a student in one class I was in about international politics described how her relatives were caught up in border violence in Eastern Europe. I didn’t have her name but I made a note of the fact that this was an example of the fact that a really difficult issue can be taught so well that the trust between the student and the staff is so high that a student can self-disclose.

But it might be necessary under the new General Data Protection Act to remove that and simply say that there was evidence that trust was high rather than given the specific example. To me it doesn’t seem that I am endangering that person’s identity, absolutely not.

Shuruq: And the other difficulty is of course that we have also done ethnography at public events which could have been organised by the chaplaincy or a student society. Again, if you wanted to identify these events that can be done. These societies often set up event pages.

It could also be a lecture on Islam and the media, which was one of the public lectures I attended. The speaker is well known and the event was well publicized. The discussions and kind of questions that emerged, my observations look at how the audience was made up ( mostly Muslims, very few of the white students attended during that talk). The ones who are interested in Islam in the media are those who are impacted by the media representation which is largely Muslim students on campus.

How do you keep aspects of the context that shed light on the meaningfulness of this event and which makes the ethnography useful without undermining anonymity?

Q: One final question: In our trainings we often hear the concern that if you include a statement in a consent form that anonymised data will be shared publicly you might get fewer participants. Is that something you have experienced?

Alison: No, participants accept that. The point is that if they come to meet us, if they made that step that means that the information that was sent out by staff or student bodies has convinced them that this is an ethically planned project where we are not going in with preconceptions. If we then say that anonymised data will be shared they accept that.

The issue I am raising is the one that the ICO [Information Commissioner’s Office] hasn’t really clarified is this issue about would you have to get a consent form from thirty people in a classroom which at one level is a reasonable extension of consent issues but challenges our understanding of ethnography.

Shuruq: Of course we don’t collect any information on the students; we don’t know who they are. But the course outlines and lecture names will not be anonymised in class ethnography so that is something we need to be reflecting upon. The other thing is that the lecturer of one class asked if we were allowing students to withdraw from the class and whether we are asking for their consent. Our team member asked for a verbal consent and the lecturer gave students the opportunity to stay or withdraw from the class. So this could be an issue for some people.

Alison presenting at Data Conversations

Q: Do you have any final comments on your project with regards to data?

Shuruq: On one campusat a private university they had a previous experience of research where the anonymity of some of the interviewees was not protected and the way they were represented in the book that came out of the research was very negative. They were extremely reluctant to allow us in without sufficient guarantees that we are going to protect their identity. But we are facing a serious dilemma because it is such a unique campus that it is impossible to report anything on it without revealing which one it is. That is a serious challenge.

Alison: Just to follow on from that. We mentioned right at the beginning free speech. These strictures which are ethically motivated like the possible new legislation [GDPR] about consent they are at one level eminently sensible but at another level they may make it almost impossible to do research on people’s ability to express themselves freely. If people can’t express themselves freely because it might compromise them or their institution then we can’t do the research. So it is a very clever double bind but it’s not good for democracy because the ability to express oneself freely has possibly become, seen in the public eye, the ability to have a strong opinion about something. Instead of what I think which is going right back to Socrates where you talk something through in order to understand it better and understand your own decision making processes. For young adults at university the heuristic value of freedom of expression, as long as is not rude or illegal, is absolutely paramount to having citizens who are able to conduct themselves wisely in this complex world! There are huge issues at stake here!

Alison, Shuruq, thank you very much for this interesting interview!

The interview was conducted by Hardy Schwamm @hardyschwamm

 

Data Interview with Jude Towers

Already our third Data Interview! This time with Dr Jude Towers. Jude is Lecturer in Sociology and Quantitative Methods and the Associate Director Violence and Society UNESCO Centre. She holds Graduate Statistician status from the Royal Statistical Society, is an Accredited Researcher through the ONS Approved Researcher Scheme, and is level 3 vetted by Lancashire Constabulary. Her current research is focused on the measurement of violence. Jude also presented at the first Data Conversations.

Q: Jude, what data do you currently work with?

Jude: The main data I work with is the Crime Survey for England and Wales. That is available on the UK Data Service. The different parts of it have different access requirements. The main questionnaire which I now mostly use is relative straightforward. You can just download it and use it.

Then we comply with the Home Office and ONS [Office for National Statistics] recommendations about the sizes of cells for publication. They say there should be a minimum of 50 respondents in a cell before it’s statistically analysed. You must ensure that you if you’re doing cross tabulations, for example, the numbers are sufficient that you couldn’t identify individual respondents. That is relatively straightforward and I would say that’s general good practice in dealing with that kind of data.

We have also used the Intimate Violence module, which is a self-complete module as part of the Crime Survey. For that there is a special level of access which requires training from what used to be the Administrative Data Liaison Service. That was a one day training course in London, signing of lots of different agreements. Then you access that data through your desktop computer, it has to be a static IP address, and everything is held on their server. You go into their server, you can’t bring anything out, and everything you do has to be done in there.

That means if you want to write a journal article using that data you have to write it inside their server. Anything that you produce using that data, whether it’s a presentation in PowerPoint, a table in a slide, all of that has to have approval from the UK Data Service before it can come off the server into any form of public domain. That has to be done each time you use it. It is quite onerous in some ways but is a very high level of security.

Jude Towers

Q: That data is already in an archive so there is no need to share it again. Is citing that data straightforward in case somebody wants to see the data that you used?

Jude: Yes, it’s straightforward to cite. If people want to have access to the raw data they’d have to be accredited in the same way I got accredited. We got the whole team accredited at the same time so we can share data as we produced the work. There is nobody in our team who isn’t accredited. There is no problem …. we can sit in front of the computer and look at that data as we’re trying to develop the work.

Q: So if I were to look at your screen here to view the data I’d have to have the accreditation.

Jude: Yes! Actually it’s interesting that some of these requirements are similar to the ones for police data.

We are doing a lot of work with Lancashire Constabulary. We as a team have just been vetted to Level 3 which gives us the same access as any serving police officer. We have direct access to raw data at the individual level. This is for two reasons. One is that you can ask for data that the police put together, anonymise and give you but if you don’t know what data there is, it is really difficult to know what to ask for. And the second reason is being able to explore the data at that level means that you can make links that you couldn’t otherwise make. You can find individual people in different datasets that allows you to ask much more complex research questions and then anonymise and take it out as a dataset.

That’s been quite an interesting process. First of all, you have to be vetted. Then you get your police access card. Rather than it being on a secure server what we have now got is police laptops. We access the police server through that police laptop. Again, you can’t take anything out until it is anonymised. The keyboard on the laptop records every keystroke so someone can exactly see who you have looked for and why you have looked for them.

Then the requirements that are similar to some of the Home Office ones which are being in a locked office without public access so someone can’t look at what you’re doing over your shoulder whilst you’re doing it. I couldn’t take my police laptop and work in the Library. You can’t work on it in public spaces.

That’s quite interesting because we just got two ESRC Studentships with Lancashire Constabulary and they will do the same. They go through Level 3 vetting and they’ll have the police laptops. But then we came across the problem where do we put them? They can’t go in an office with other PhD students who are not vetted. They are at different stages in their PhD. So actually, what we’ve had to do were quite specific arrangements so that those students share a room that’s locked. You can’t have someone else in the room who is not vetted!

Q: Is it more difficult in this case to cite data because the data is not in an archive like the UK Data Archive?

Jude: What we haven’t yet done in any official capacity, but we’ve had discussions. The Crime Survey data people can access. What we have done in some of the cases where we have produced new data we’ve done data tables and can release those. So people can see the data we use, completely anonymized, aggregated to a very high level. If people want the raw data they can get accredited or they can go to the UK Data Service. If people just want to re-run our statistical tests then the “semi-raw data” if you like is there.

Jude Towers

Q: Is that what you could do with the police dataset?

Jude: That is the conversation we are currently having with the police: Is there any point at which that data can be released into the public domain. We haven’t yet made agreements about that. I think what we’ll end up doing will be very interesting. There are very few researchers who are doing it in this way. Most people get given anonymised data that the police have anonymised themselves.

So we are doing a series of test cases saying that as we increasingly aggregate and anonymise the data at what level can that data put into the public domain and at what level is it useful? We’ll have to see if we can find a place that matches where it is still useful and it can go public. If we are able to do that then we’ll put it into archives.

Q: That is really interesting!

Jude: Yes, but is very clear that in the ESRC Studentships that the police have the final say on that.

Jude at the first Data Conversations

Q: Do the police have a level of expertise and confidence in providing data and working with you? Does that work well?

Jude: It does work well. The police are in a really interesting position. They [are] systematically, some more quickly than others… [nationally] moving to evidence based policing and significantly improving their research capacity. At the moment they are doing that in two ways. One is by working closely with universities and the other is by more systematically training police officers and associate staff.

I am doing a lot of work with Leeds University on data analytics for the police and we are setting up CPD [Continuing Professional Development] for data analysts in the police to have a more systemic and academic approach to research questions. Now that’s really interesting because the position they are in in their organisation tends to be relatively low but some of the things they are asked are just impossible.

So we are trying to give them the tools to say you can’t ask me for this when you don’t collect it. Or you want me to evaluate something but nobody told me it was happening so there is no data from before. We’re getting them to think through the research process in order to influence how data analytics are used inside the police. It is interesting because there is a bit of a debate about whether they really need data analysts or they can spend their money buying really good algorithms [which] will sort all this stuff out. Our argument is that you need really good data analysts because you need them to explicate the inherent theories that people have, that they’re trying to test, that they can talk people through that research process.

In Lancashire Police those things are coming together. They are much more actively working with academics and they are much more systemically embedding academic research processes inside the institution. They have a Futures team that includes multiple PhDs, M.A.s and now even some undergraduate students. They have a list of research questions that they are interested in as an institution, and they are actively going out looking for people who do that research for them and to sit inside the police while they do it.

Q: That is really fascinating! Is there anything Lancaster University could do to help you or your colleagues with your research? Or does the set up work for you?

Jude: I think it’s OK. The sticky parts are things we are working through for example around contracts. Who owns the Intellectual Property? Who gets final say over publications? We’ve been lucky so far that we’ve negotiated things but I know in other areas these have been problematic: getting clarity and setting up protocols is useful.

There’s been some talk about setting up secure data hubs and I’m in two minds about it. I think in some ways they’d be really useful but I think in other ways they are perhaps a bit inflexible. My colleague across the corridor is doing the same as us with social work data and they’ve done what we have done. They accredited the individuals and have given them a specific laptop to access that data directly, and that works really well.

Thanks very much for the interview Jude!

 You can find out more about Jude and her research here. Her current research papers are: with Walby and Francis, ‘Is violent crime increasing or decreasing?’ (BJC 2016); with Walby, ‘Measuring violence to end violence’ (Journal of Gender-based Violence forthcoming); and with Walby et al, The Concept and Measurement of Violence against Women and Men (Policy Press 2017).

Data Interview with Jo Knight

This is our second Data Interview. This time we were glad to have a chat with Dr Jo Knight.

Jo is a Reader within the CHICAS research group, Research Director in the Lancaster Medical School and theme lead for Health within Lancaster’s Data Science Institute. Jo has experience in developing new methods for analysing genetic data as well as experience in applying known techniques to a large variety of datasets.

The Conversation by Michael Dunne, Flickr, CC BY-NC

Q: Jo, when you talked at our recent Data Management event about a “positive” data management story and a “negative” story there was a lot of interest in that, so we thought we could use this in our next Data Interview. Which story would you like to start with?

Jo: I think it would be good to start with a negative one so I can end on a positive note. And chronologically that is how it occurred.

So the negative story relates to an early time in my career. I had some genetic data on a number of individuals, about 120. I did some statistical analysis of the data. I noticed that some of the patterns that I had in my analysis seemed unusual. They weren’t characteristic of the type of patterns you would expect given that the individuals in this sample were supposed to be siblings. I didn’t have enough genetic information to establish their relationships completely but I did have enough to see that overall patterns didn’t look how I expected them to.

I took the data to someone more experienced and said: “There is something wrong with the patterns here”, and he said “Yep, there is definitely something wrong. Those individuals clearly aren’t related to each other.”

At that time, given the technologies that were available, we couldn’t just get more data to determine the relationships. We had to throw all of that data away!

It was essentially because the data and the samples had not been linked and managed. At some point between labelling the samples, entering the labels into a database and recording the relationships and rest of the information about the individuals something had gone wrong. So the data management had gone wrong and these samples were now completely useless. As well loss of my time we couldn’t use these samples for any other work either. They no longer had the data provenance.

Q: Can you quantify how much time you invested in that project?

Jo: It’s hard to remember but for me it would have been months of work to interrogate the samples! It would also have cost a fair amount in reagents. And for the person that collected the data probably up to a year’s work getting all the DNA samples from the individuals. Furthermore those individuals had given samples for medical research that was not been able to be undertaken.

Q: That is a rather sad story.

Jo: Yes, it is.

Dr Jo Knight

Q: Now the positive story. What happened?

Jo: I’m involved in a Consortium now, the Psychiatric Genomics Consortium, and in this Consortium over 800 researchers from 38 countries have come together and worked really very hard through ethical approvals, data procedures, data collection and data pooling in order to collate samples.

And they have been able to collect data that is now published, actually a couple of years ago in 2014, on more than 35,000 schizophrenic cases and even more control samples than that. And through the good and appropriate management of data it has meant that we were able to identify 108 genetic risk loci for schizophrenia. It has enabled us to move the field forward in terms of beginning to understand the genetic contribution to schizophrenia.

For a long time we knew that schizophrenia has a genetic component but we were unable to pinpoint very many of the risk variants at all, and this study was a real landmark in identifying a large number of the risk variants involved in the disorder. Lots more work needs to be done! What is really exciting about the Consortium is that the original paper is just the tip of the iceberg. That was the paper where the first analysis was done but the data is now held and managed in a manner that researchers who work in psychiatric genetics are able to access that data, analyse that data and answer lots of different questions about the genetic predisposition to schizophrenia.

The Psychiatric Genomics Consortium holds data on lots of other disorders as well. Basically, the appropriate management of that data means we are able to learn a lot more about diseases than we would have if people hadn’t got together and as a large group effectively managed the data.

Q: What is the key step in doing this?

Jo: It’s a willingness to share data and to see the bigger scientific question that can be answered if you share the data, and not just try to hold onto it and answer your own smaller questions. It is a willingness to put considerable amounts of time into data management. So there are lots of people including myself that have informal unpaid roles in managing that data to make it accessible.

Q: What can we as an institution do to encourage that willingness to share data?

Jo: I think Lancaster University as an institution has a very strong positive view of collaborative research across the Faculties and beyond the University. And that’s the kind of thing that does encourage people to share data and be involved in these projects. I think that is something we need to continue to pursue. And also the support systems that we have in place, the people and systems that help us to deposit data and make it available.

Thanks very much for the interview Jo!

 You can find out more about Jo and her research here. The full reference of the article on schizophrenia mentioned by Jo is:

“Schizophrenia Working Group of the Psychiatric Genomics Consortium. Biological insights from 108 schizophrenia-associated genetic loci.” (2014) Nature. 24 July. 511 (7510): 421-7. doi:10.1038/nature13595

 

Data Interview by Hardy Schwamm (@HardySchwamm), 3 May 2017.

 

Data Interview with David Ellis (Part 2)

Part 2 (of two) of a Data Interview with Dr David Ellis (@davidaellis). See here for Part 1. David is a Lecturer in Computational Social Science and holds a 50th Anniversary Lectureship in Psychology at Lancaster University.

Picture from https://commons.wikimedia.org/wiki/File:Opendata.png CC-BY-SA

Q: In support of Open Data what roles do Policies by funders of the University have? Are they helpful? Or is it seen as just another hurdle in the way of doing research?

David: I could be wrong but I don’t think most people just view it as just a hurdle. I think when people have to write a Data Management Plan for a grant that is a bit of a pain. But I don’t think the idea of having the data freely available is something where most people say “I can’t be bothered”. It is an additional step but it is something people should be doing anyway because if you are going to be clear on what your results are the data should be in a form that’s usable and could be easily moved between people. I think most people say that’s a good thing but maybe I’m biased…

Q: I have talked to PhD students asking if they want to share their data and they said I should have asked them three years ago because now it is so much work. I wonder why that is and if we need to change the way we teach them how to manage their data?

David: I wonder if I would have said the same thing. All the data of my PhD is still around but as I was learning my craft I probably wasn’t the most efficient, and my data wasn’t managed as efficiently as I would do now. I don’t remember going to a data management training or anything. And if someone had done that on day one of my PhD? Data should be kept in an ordered fashion etc. I created a lot of extra work for myself because I would do some analysis, close the file and end up re-doing the same things I have done multiple times. And even on that level that is not very efficient.

David Ellis

Q: Is that something we should teach students, you think?

It’s probably something students wouldn’t be too keen [on].

Q: Yes, you don’t want to patronise them.

David: And it is a bit like saying: For god’s sake back up stuff! If you look at all the horror stories [about] who loses data. It’s only when it goes wrong that it becomes a problem. I think some people are automatically super organised. I was probably somewhere in the middle, probably more organised now. I think the issue is in a lot of academia, you just figure it out as you go. And some people develop brilliant habits and some people, including myself, bad and good. And other people develop really bad habits. And that just carries on.

I sometimes look at Retraction Watch to see what’s in, and there is this really interesting example of an American paper, an American guy who posted a paper in Psychological Science whose undergraduate student collected the data and then it turns out the entire paper is wrong when someone re-analysed the data and found so many mistakes in it. Of course it has been retracted. Now the professor has said it is the student’s fault [whole story here]. But whoever taught that student data management? If that is the issue and it looks like it, they have taken the eye off the ball. And now without a doubt his other papers will be scrutinised. Clearly, there are bad habits ingrained that he’s been passed on.

And it is not just students, it is people higher up as well. The students have been informed by their own supervisors. So I say to my students: back stuff up, make sure things are organised and I can usually tell without going into their file system. What usually happens, if I ask them for something, a piece of data, it will appear quickly because they know where it is, and that is good enough for me. But if it takes ages, that’s when we end up having a talk saying “What are you actually doing with your data?” because this seems really all over the place. But not every supervisor does that, as that guy proved. He didn’t even seem to look at the data. I am not saying that can happen here; but is not only the students.

David presenting at Data Conversations

Q: What could the University do more to assist Open Data supporters like yourself?

David: I really like the fact that the Library is pushing the fact that you can upload datasets. I know there are not many people from my Department that are doing it…  I think that is really interesting. It is something that I – not necessarily challenge – but I do mention it. I don’t really get why. It is the sort of thing where you are submitting a paper you don’t even have to do it formally. There are journals that don’t have a data policy but I can still through our Pure system link data and paper together. I don’t see how that is a bad thing and that there is a huge effort needed to do that.

Maybe academics say it is just another thing to do? A colleague of mine would always say if they want the data they can always email me. Now that might be true but there are lots of cases when you email academics they never get back to you. The same colleague gets so many emails that they have someone to manage their mails. I take the point that the counter argument is that nobody actually will want to see the data and maybe they won’t. But given how random stuff is… you don’t know.  For, what you publish today it might not be important and then suddenly it is important.

So my answer to the question is I am not exactly sure. There is more support in this institution than in my other, to my memory, in terms of: “this is a place to put my dataset”. One of the courses I was on here about data management as part of the 50 Programme was really useful in the sense that I left thinking from now on I am going to put my data there [into Pure].

Q: Should there be other incentives for opening up research data rather than “doing a good thing”? Should there be more credit for Open Data?

David: Yes, probably. We are always judged, when we do PDRs every year, on how much I published and got this much money. But actually, the data output does have a DOI now and it is citable and it is a contribution that the University is getting from the academic. It is additional effort. So it would be interesting to see what happened if it went as far as maybe not a promotion thing, but … part of good practice. I think the question I would ask academics is: if your data is not there, where are you keeping it long term? Now I am working at another project where data cannot be made open and that is fair enough, but in general I do wonder where all that data is going. There is a duty where it needs to be kept for a certain length of time. I think it is easier to put in there [Pure] then I don’t need to think about it if nothing else. That gives me more comfort.

Q: Is there anything you’d like to add?

David: I am certainly in support of Open Data but I write more about data visualisation because I like pictures as much as I like data [laughs].

Thanks David for an interesting interview. We hope to do more Data Interviews soon. In the meantime, if you have any questions or comments leave them below or email rdm@lancaster.ac.uk.

Data Interview with David Ellis (Part 1)

Part 1 (of two) of a Data Interview with Dr David Ellis (@davidaellis). David is a Lecturer in Computational Social Science and holds a 50th Anniversary Lectureship in Psychology at Lancaster University. David presented at the first Data Conversations on Data Visualization.

This is the first interview of hopefully a series to come about the impact of Open Data on research. The interview was conducted by Hardy Schwamm.

Q: We define Open Data as data that can be freely used, shared and built-on by anyone, anywhere, for any purpose. Open Data is also a way to remove legal and technical barriers to using digital information.  Does that go with your idea of what Open Data is?

David: Yes, I think so. I might add to that: the data is actually useful and fit for purpose. To me it’s one thing to just uploading all that data, make it available. But a lot of time, how useful that is on its own is not quite clear. As a psychologist you can run an experiment and you have a lot of data coming out of a study. You can just dump that data online but is there enough information there for other scientists to use that data and get the results?

Q: So would you say that the usefulness of data depends on what we as librarians call metadata, data about the data?

David: Yes, exactly. The definition you gave earlier is spot on. I would just add you need to make sure it is useful to other people. That might also depend on the audience but there are lots of datasets that people post for papers that are just the raw data. That is useful but to understand how they get from the raw data to the conclusions is an important step. There isn’t always space in publications to make that clear.

Q: My next question you have probably already answered already. What is your interest in Open Data? Do you support it as a principle or because it is useful for your research?

David: I do support it as a matter of principle! I always find it weird, even as a student, that you could have papers published and it was just a “Take our word for it” process. I still find that weird now. So absolutely, I support it as a matter of principle. I think as a scientist it just seems right. The data is the cornerstone of every publication. So if that is not there it seems like a massive omission, unless there is a reason for it not to be there. There are lots of mainstream psychology journals that don’t have any policy on data.

Q: That leads me to my next question: To what extent do researchers in your field Psychology support or embrace a culture Open Data?

David: Psychology does have a culture of it and it is probably growing. I think it is inevitable that this is going to become the standard practice if you look at the way Open Access publishing is going.

Q: Why do you think this is happening?

David: Because I think what is eventually happening is that journals are going to say… Lots of people who are doing it but it is like everything else, particularly if that data is going to be usable it does require a bit more effort on the author’s part to make sure that things are organised and that they have a Data Management Plan. I am not suggesting that lots of people don’t have Data Management Plans but it’s something that if you look at current problems in Social Psychology really that wasn’t being followed. There have been leaks and there have been other problems.

So if I tell you the story last week from a 3rd year student at Glasgow University had spotted errors in a published paper and it was actually errors in the Degrees of freedom. They didn’t need the raw data but the point is that a lot of that could have been sorted if the raw data had been made available. There are lots of little issues that keep coming up.

There is nevertheless still resistance and there are plenty of journals where there really is no policy, certainly the journals for which I review for. At the end, there is no data provided, I don’t know what the policy is. It would be nice if in the future authors could upload raw data but that depends on the journal’s policy and if the journal has a policy.

Q: Where should the push for Open Data come from? From journals, funders or the science community?

David: I think from all! If peer reviewers started asking for data, which I think more are, and I think if more scientists start uploading data as supplementary material as a matter of course then I think journals will start to do that. I guess the other option is that journals will start to be favoured that do provide additional resources. So particularly given how much money places like Elsevier make, what do they actually offer? If they want to sell themselves they could offer lots of things but they don’t seem to be pushing it.

And I appreciate it is very discipline specific, and that came up after my talk at the Data Conversations [on 30 January 2017] some disciplines don’t share data. It has improved massively since I started as a postgrad student. Then it just wasn’t a thing and it has slowly become more of an issue.

Q: Do you think this has to do with skills and knowledge of researchers and PhD students? Do they know how to prepare and share data? Do they know how to use other researchers’ data? Is there something missing?

David: A lot of psychologists are in a kind of hybrid area. They are obviously not statisticians and I do wonder if there is a bit of a concern because what if I upload everything, what if somebody finds a mistake? My view is always: I’d rather know that there is a mistake. But I do wonder if people are sometimes sceptical about. Not because they’ve got anything to hide but because they are not a 100 per cent sure sometimes. They understand the result and they know what the numbers mean but we are not mathematicians.

I am just curious that given the numbers of statistical mistakes being flagged up in psychology papers… I am sure I made mistakes myself. I’d just rather know about them. And having the data there means someone can check if they really want to. My view is that I am quite flattered if someone that bothered to go and re-run my analysis. They are obviously reading it!

The interview with David will be continued in Part 2 which you can find here.