Data Interview on “Messy Data”

Our latest Data Interview features our two Jisc sponsored Data Champions, Dr Jude Towers and Dr David Ellis. Jude is a Lecturer in Sociology and Quantitative Methods and David a Lecturer in Computational Social Science in our Psychology Department.

Jude and David recently presented at a Jisc event on ‘Stories from the Field: Data are Messy and that’s (kind of) ok’.

We talked to Jude and David about what Messy Data are (and many other things):

Q: At the recent Research Data Champions Day the title of your presentation was ‘Data are Messy and that’s (kind of) ok’. I wonder what are ‘messy research data’ in your fields?

Jude: My ‘messy data’ are crime data. The ‘messiness’ comes from a lot of different directions. One of the main ones is that administrative data is not collected for research, it is collected for other purposes. It never quite has the thing that you want. You need to work out if it is a good enough proxy or not.

For example I am interested in violence and gender but police crime data doesn’t disaggregate by gender.  There is no such crime as domestic violence, so it depends on whether somebody has flagged it as such, which is not mandatory, so it is hit and miss. I think the fact the data are not collected for research makes them messy for researchers, and then I guess there is all the other kind of biases that come with things like administrative data.  So if you think about crime, not everybody reports a crime, so you only get a particular sample. If you have a particular initiative, so every time there are international football matches they have big initiatives around domestic violence, so reporting goes up, so everyone says that domestic violence is related to football.  But is it, or is it just related to the fact that everyone one tells you, that you can report and they have zero tolerance to domestic violence during football matches?  It’s more likely to be recorded.

Then you get feedback loops, so the classic one at the moment is knife crime in London, because knife crime has gone up on the agenda more money and resource will go into knife crime, at some point that will probably go down, and something else will go up because there is a finite amount of resource.  These create feedback loops by the research that you do on the administrative data and people don’t always remember that when they come to interpret research.

Jude and David presenting on Messy Data

David: The majority of data within psychology that tends to measure people is messy because people are messy, particularly social, psychological phenomenon, there is always noise within. The challenge is often trying to get past that noise to understand what might be going on.  This is also true in administrative data and data you collect in a lab.  Probably the only exception in psychology is where people are doing very, very controlled maybe visual perception experiments where the measurement is very fine grain, but almost everything else in Psychology is by its nature extremely messy, and data never looks like it appears in a textbook.

Q: So there is always that ‘noise’ in research data, regardless if you use external data such NHS data, or if you collect data yourself, unless as you say it is in a very controlled environment?

David:  Yes. And I guess that within Psychology there is an argument that if the data is collected in a very controlled environment, is that actually someone’s real behaviour or is a less controlled environment more ecologically valid as you’ve always got that balance to try and address?

Q:  So what are the advantages, why do you work with messy data?     

Jude: Sometimes because there is nothing else. [laughs]

David: Because there is nothing else.  I think Psychology generally is going to be messy. Because as I said people aren’t perfect, you know they are not perfect scientific participants. Participants are not 100% predictable, people aren’t predictable social phenomena.  There are very few theories within social psychology, in fact, I don’t think there’s any that are 100% spot on.

When you compare that to say physics, where there is Newton’s law, where there are governing theories, which are singular truths, which explain a certain phenomenon. We don’t have much of that in psychology!  We have theories that tend to explain social phenomena but people are too unpredictable.  There are good examples of where theories have held for a long time but it is never a universal explanation.

David presenting

Q: What are the implications for management of that kind of messy data?

David: I think the implications are that you make sure that it is clear to people how you got from the raw data, which was noisy or messy, to something that resembles a conclusion.  So that could be: how did you get from X number of observations that you boiled down to an average that you then analysed? What is it the process of that? It’s not just about running a statistical test, it’s about the whole process from: this is what we started with and this is what we ended with.

Jude:  I think that’s right, I think being very clear about what your data can and cannot support and be very clear that you are not producing facts, you are testing theory, where everything is iterative, a tiny step towards something else, not the end. You never get to the end.

David: I think researchers have a responsibility to do that and people have to be careful in the language they use to convey how that has happened.  A good example of that at the moment is, there is a lot in the press and current debate about the effects social media has on children or on teenagers, and the way that it is measured and the language that is used to talk about that is to me totally disconnected. That behaviour isn’t really measured. It is generated by people providing an estimate of what they do, yet we know that, that estimate isn’t very accurate.  The conclusions which have been drawn  are that this is having this big effect on people.  I’m not saying it’s not having any effect; it’s not as exciting to say: ‘well actually the data’s really messy or not perfect, we can’t really conclude very much’. Instead it’s being pushed into saying that [social media] is causing a massive problem for young people, which we don’t know.  Which is why there is a responsibility for that to be clear and I don’t think in that debate it is clear, and I think there are big consequences because of it.

Jude and David at Jisc panel discussion

Q: So in your dream world, what would change, so we could work better with this kind of data?

Jude: I think we need better statistical literacy, across the board. This is what I did with my Masters students:  I told them to go and find a paper or media story which used centred statistics,  then critique it.  So, how do you know what someone is telling you is ‘true’? Why are they telling it you in that particular way? What data have they used? What have they excluded?

You go to the stats literature and they talk about outliers, as though it’s just a mere statistical phenomenon, but those decisions are often political and they massively change what we know, and nobody talks about that, nobody sets out exactly what that means.  The only official statistics for crime in England and Wales are currently capped at a maximum of five incidents.  If you are beaten up by your partner 40 times a year, only the first five are included in the count, which is a huge bias effect in what we know about crime.  Then in the way resources are distributed between different groups, about what crimes are going up and what are falling.  I think this lack of people questioning statistics in particular, but data more generally, is a real problem.  In our social science degrees we just do not teach undergraduates how to do that.  We do it with qualitative data, but we don’t do it with quantitative data. It’s exactly the same process, it’s exactly the same questions, but we just don’t do it, we are really bad at it Britain!

David: I think more generally, there is a cultural issue within the whole ethos of science, of how it gets published, of what becomes read and what doesn’t become read.  So again, say I go back, do a paper and find no relationship between social media use and anxiety.  That would be harder to publish than if I write a paper and find a tiny correlation, which is probably spurious and not even relevant, between anxiety and social media. So again, this comes down to both criticising what is out there but also what is just becoming more sellable or having more ‘impact’.  I use the word impact with inverted commas; what sounds more interesting, but actually might be totally wrong.  I think what is pushed is what’s more interesting rather than what is truth.  I think it’s worth remembering that science is about getting a result and trying to unpick it, looking at what else could explain this, what might we have missed.  Rather than saying ‘that’s it, it’s done’, it’s similar to what Jude was saying about a critical thinking process.

Q: Following on from what Jude said about the skills gap: You say that undergraduates are not taught the skills they need.  Therefore, when we eventually get PhD students and early career researchers this gap might have even increased?

Jude:  Yes, and they don’t use quantitative data, or they use it really uncritically. So lots of  post-graduate students who work on domestic violence won’t use quantitative data, but their thesis often  starts with ‘one in four women will experience domestic violence  in their lifetime’ or ‘two women a week are killed by intimate partners’, bbut they don’t know where that data comes from or how reliable it is or how it was achieved, yet it is just parroted.

David: I can give a similar example to that where it is sometimes difficult to take those numbers back, once they become a part of the common discourse.  So years ago we found that people check their smartphone 85 times a day on average.  Now that was a sample of about thirty young people. Now we obviously talked about that, but that number is now used repeatedly.  Now there is no way that my grandmother or my parents check their phone 85 times a day.  But that sample did, so there is now this kind of view that everyone checks it 85 times a day.  They probably don’t, but I can’t take that back now, there are things you don’t know at the time, but that is what that data showed.  It’s tricky to balance, and it was picked up as an impactful thing, but it wasn’t what we really meant.

Q: Is there also a job for you as a researcher if your findings are picked up by the media looking for a catchy easy numbers, to write your paper differently so that it is not being picked up so easily, or is it the fault of the media, because they are just looking for a simplified version of a complex issue?

David:  There is a cultural issue, a kind of toing and froing; because we want our work to be read and we want people to read it and certainly writing a press release is one way of doing that.  I think it’s actually what you put in the press release [that] has to be even more refined, because a lot of people won’t read the paper, but they will see the press release, and that will be spun.  Once the press release is done, it’s out of your control in some ways.  You can get it as right as you want but a journalist might still tweak it a certain way.  It’s a really tough balance because as you say the other extreme is to say I am just going to leave it. But then people might not hear about the work, so it’s a very tricky tightrope to walk.

Jude: We made the decision as a Centre when our work started getting picked up by the media, that we would not talk to the media about anything that had not been through peer review, so it is always peer reviewed first.  We work with one person from the press office, we work with her closely, all the way through the process of putting the paper together and deciding the press release and how we are going to release it.  What we have actually got now is contacts in several newspapers and media outlets And we say we will work with you exclusively providing this is the message which goes out. We have actually been successful enough that we’ve now got two or three people on board who will do that with us.  They get exclusives providing we see the copy before it goes public.


David: That is very hard to do, but really good.

Jude: We have been really hardcore and we’ve had a lot of pressure to put stuff out earlier, to make a bigger splash, to go with more papers. It was only I think because we resisted that, that in the long run it has been much better, although it is hard to resist the pressure.  The press in our early work wanted our trends, but we wanted them to talk about the data, we wouldn’t release the trends unless they talked about the problems with bias, official statistics.  So we kind of married the two, but they didn’t want it, but that was the  deal.

David: It’s like when you say: ‘people do X this number of times’ then you can’t put in brackets ‘within the sample’ so I understand where journalists come from and I understand the conversations with the press. To me as I said it’s like walking a tightrope. It has to be interesting enough that people want to read it, but at the same time it needs to be accurate.

Jude: But that’s the statistical literacy, because you want someone reading a media story going ‘Really? Well how did you get that?’ That’s something we would do as academics when you are reading it. People are always telling me ‘interesting facts’ about violence and my first reaction is always: ‘Where has that come from?’ These questions should become routine. I think journalist training is terrible!  I mean I have spent hours on the phone with journalists, who want me to say a really particular thing, and its clearly absolute nonsense! But they have got two little bits of data and they have drawn a line between them.

David: I have had a few experiences where journalists have tried to get a comment about someone else’s work and I have said things like, ‘I don’t think this is right’ or I’ve been critical and the journalist said, ‘well really what we are looking for is a positive comment’.  And I’ve said ‘well I’m not going to give you one’, and they have said ‘alright bye then’, and have gone and found someone that will.  That doesn’t happen very often, but we can see what they are kind of hoping for.  Presumably, some of the time I have said things where I have been really critical. The BBC are quite good at that; they get someone who they know is going to be critical without having to explicitly saying something negative.

Q: This has been fascinating; we have been though the whole life cycle of data from the creation to the management and now to the digestion by the media.  This tells us that data management issues are fundamental to the outputs of research.

Jude: I think it impacts on the open data agenda though ‘cause if I was going to put my data out, the caveat manual which came with it would be three times the size of the data.  Again, you don’t have any control over how someone presents an analysis of that data. I think it’s really difficult because we are not consistent with good practice in reporting on messiness of data.

David: I think there is a weight of responsibility on scientists to get that right! Because it does affect other things. I keep using social media as an example. The government are running an enquiry at the moment into the effects of screen time and social media. If I was being super critical I would say it’s a bit early for an enquiry, because there isn’t any cause and effect evidence. Even some of studies they report on their home page of the enquiry are totally flawed, one of them is not peer reviewed.  That lack of transparency or statistical literacy even among Members of Parliament, clearly, is leading to things being investigated where actually we could be missing a bigger problem here.  So that is just one example, but that is where there is a lot of noise about it, there is a lot of ‘this might be a problem’, or ‘is it a problem?’, right through to ‘it definitely is a problem’, without anyone standing back and going, ‘actually, is this an issue, is the quality of the evidence there?’


Jude: Or can you even do it at the moment?

David: Yes, absolutely! That is a separate area and there is a methodological challenge in that.

Jude: We get asked to measure trafficking in human beings on a regular basis, we’ve  even written a report that said you can’t measure it at the moment! There is no mechanism in place that can give you any data that is good enough to produce any kind of measure.

David: But that isn’t going to make it onto the front of the Daily Mail. [laughs]

Q: Maybe just to conclude our interview, what can the university do? You mentioned statistical literacy as one thing. Are there other things we can do to help?

Jude: We are starting to move a little bit in FASS [Faculty of Arts and Social Sciences] with some of Research Training Programme and I think things like the data conversations which are hard to measure but I think are actually having a really good impact.  Drawing people in through those kinds of mechanisms and then setting up people that are interested in talking about this would be good. I would like to see something around… what you need to tell people about your data when it’s published; you know, the caveats: what it can and can’t support, how far you can push it.

David: I think the University as a whole does a lot, certainly psychology, is preaching to the converted, in a way.  I would like a thing in Pure [Lancaster University Data Repository] that when you upload a paper it says… ‘have you have included any code or data?’ just as a sort of a ‘by the way you can do that’. One, it tells people that we do it and two, it reminds people that if you’re not doing that it would be useful just to have tick box just to see why.  Obviously, there are lots of cases where you can’t do it, but it would be good for that to be recorded. So is it actually, I can’t do it because the data is a total mess or some other reason or I’m not bothered.  There is an issue here about why not, because, if it has just been published it should be in a form which is sensible and clear.

Jude: I wonder if there is some scope in just understanding the data, so maybe like the data conversation is specifically about qualitative data, and then other even more obscure forms like literature reviews as data, ‘cause I still keep thinking about when you told me you offered to do data management with FASS and you were told they didn’t have any data.

I think that people don’t think about it as data in the same way and it would be really good to kind of challenge that.  I think data science has a massive problem in that area, it has become so dominant, and if you’re not doing what fits inside the data science box you’re not doing data and you’re not doing science and it’s really excluding.  I think for the university to embrace a universal definition of data would be really, really, beneficial.

David: It’s also good for the University, [to] capitalise on that extra resource; it would have a big effect on the institution as a whole.

Jude, David, thank you very much for this interesting interview!

Jude and David presenting

Jude and David have also featured in previous Data Interviews.

The interview was conducted by Hardy Schwamm, Research and Scholarly Communications Manager @hardyschwamm. Editing was done by Aniela Bylinski-Gelder and Rachel MacGregor.