So Long and Thanks For All the Pizza

 

 

 

 

 

Today is my last day working as Digital Archivist at Lancaster University so I thought I would take a little time to reflect on my three years here; the highlights and what I have learnt in my time here.

Pizza

Pizza has featured quite a bit in my time here at Lancaster. And that’s not just team lunches! Pizza is a core component of our Data Conversations – the networking event designed to bring together researchers to share experiences of creating, using and sharing data. Having a peer-led discussion forum has been a fantastic success story for us – it’s even gone global!  The pizza is a key part of this as it helps create an informal, friendly environment where sharing is central. I’ve learnt a huge amount from being part of the Data Conversations and I will definitely be taking forward what I’ve learnt about successful engagement activities.

Culture change

Data Conversations' attendees enjoying refreshments and conversation

The focus for the Research Services team from early on (ie when there were only two of us!) was about how to push forward culture change. We were fortunate in having a management team who supported and promoted team-led agenda setting.  We identified our priorities for development which focused on bringing about culture change and promoting an Open Research agenda. Looking at our goals and keeping them at the centre of what we do was important and meant we could tailor and prioritise activities around encouraging and promoting good data management practices.  From my perspective of being engaged in digital preservation the best chance of preserving data that we have is by ensuring that the data is created in the right way in the first place.   Sending the message about good data practices “upstream” so that well formed data is captured early and with the right metadata means it has the best chance of being accessible into the future.

Opportunities

Eating fondue

I’ve learnt a huge amount in the time I’ve been at Lancaster; when I started I had a lot of enthusiasm but not much practical experience.  I hope I’ve retained the enthusiasm but added experience and practical application to it. Things change at a very fast pace in the digital preservation world so it’s brilliant to be able to go to training events and conferences and hear from the leaders in the field. I was lucky enough to attend iPres 2016 in the beautiful city of Bern. I learnt a lot there and have very much built on that knowledge and experience especially around peronal digital archiving and community engagement activities.

Communities

Did I mention cake and biscuits?

The last three years have also brought me more fully into the Digital Preservation Community and it is a community where sharing best practice and collaborating is greatly encouraged for the benefit of all. I have had help and support from countless people but I would single out Jen Mitcham at the University of York and Sean Rippington at the University of St Andrews as being particularly supportive and inspirational. The Archivematica UK user group has also been a fantastic group and I am looking forward to continuing these relationships into the future.

The Future

So now I’m off to take up new challenges at the University of Warwick in their Modern Records Centre. I am looking forward to future collaborations with the team of colleagues and friends at Lancaster and see us all face the challenges that digital data presents together.

Rachel MacGregor, Digital Archivist

Connecting the Bits

Glasgow: location of the unconference (CC0: https://pixabay.com/en/glasgow-scotland-city-tourism-2997987/)

We are members of the Digital Preservation Coalition which is a members organisation which exists to secure our digital legacy. Members include businesses, HE institutions, funding bodies, national heritage and cultural organisations and are drawn from every continent.

Last week all members were invited to the annual un-conference where we come together not only to share experiences and network but also to help set the Digital Preservation Coalition’s training and development agenda for the year ahead. The ideas is that members have the opportunity to raise the issues which really matter to them and then discuss how the DPC can take action to move forward on these issues.

The agenda for is set on the day and full members are invited to give a three minute presentation of their successes, challenges and future opportunities.  Listening to the reports it was clear that there were themes common to all, whatever stage of maturity they are at.

So what were the common themes which came out of the day?

Challenges

(CC0 https://pixabay.com/en/magic-cube-patience-tricky-hobby-1976725/)

Many people shared their efforts to meet the challenge of preserving specific “types” of data:

  • Software and software environments
  • Email
  • Audio visual materials
  • Sensitive data

The preservation of Research Data, which usually means a huge range of data types also came up.  Here at Lancaster the preservation of Research Data has been our focus so we are well aware of the challenges we face but it’s great to be able to share them and know that there is a community out there working on this together.  We have also been engaging with software preservation and looking into ways on which we can support our researchers who create software. There is really encouraging work being done by the Software Sustainability Institute and here at Lancaster we have been running various initiatives including inviting Neil Chue Hong of the SSI to speak at our Data Conversation and presenting at our own Research Software Forum.

There were quite a few organisational challenges discussed such as:

  • Huge rise in quantity of data and difficulties in predicting the growth rate.
  • Resources either staying the same or being cut in the face of the growth in data
  • Sustaining work beyond a project level – moving it on to business as usual
  • Dealing with organisational restructures

Finding the right tools for the job

(CC0 https://pixabay.com/en/tools-awl-pliers-antique-equipment-1083796/)

These challenges require robust strategies and planning to tackle.  Again the approaches we need to develop can be done as a community. Here at Lancaster we are developing a tool called DMAonline as part of the Jisc Research Data Shared Service.  DMAonline has reporting functionality for a variety of research data and scholarly communications outputs but one of the things we are hoping for is that it will be able to provide intelligence (rather than analytics) – it will use machine learning to make suggestions on growth and development and predictions on future use.

We don’t just want to create pretty graphs we want to answer questions; for example predicting growth in storage needs or predicting the growth of the “long tail” of unidentified file formats. It’s an ambitious aim but we are keen to take part in the challenges presented by the long term preservation of digital assets.

Finding the right tools for the job was also mentioned.  I think we would all agree that the tools we currently have are not necessarily the right fit for the job. Often we just need to get on with the job and have to use the tools which are available but sometimes it’s good to take a step back and say – what are we trying to achieve? What is the best way to get there and what should the tools we need look like? I don’t have the technical knowledge to build them but I can work with others – like my team here at Lancaster – to work towards this.

The human problem

One thing that came up was the challenge of getting the data/records/archives as quickly as possible ie before they are lots/altered/deleted/degraded/ended up on a corrupted cd.  Some of this challenge is technical ie having simple easy-to-use systems which people will engage with and will encourage good data practices.  However more of the challenge is about getting people to engage with the process in the first place so that vital data, metadata and contextual information is not lost over the passage of time.

Successes

(CC0 https://pixabay.com/en/raise-challenge-landscape-mountain-3338589/)

It was great to hear about many successes with many institutions implementing a fully functional preservation system. Other institutions had successes getting digital preservation on the agenda with senior management.  One institution mentioned that they argues by not investing in digital preservation and training they would fall behind competitors. Another mentioned getting digital preservation recognised on a risk register. These are all significant achievements and show that individual institutions are moving forward and making progress.  

It was also really good to hear about some specific projects such as the work done by the National Library of Scotland on converting tif files to jp2 or the British Library’s work to keep up with the challenge of preserving digital formats which form part of the collections of a legal deposit library.  This work will also benefit other institutions tackling similar problems. 

Moving on

I really hope this day leads to relevant and targeted planning and support for all DPC members and I also hope it helps connect us as a community to tackle the common challenges which we all face.  The Digital Preservation Coalition also provide lots of resources for the wider non-member community so it’s a great way of coordinating development work and sharing expertise to help foster a real community of practice.

[This blog post was first published by the Digital Preservation Coalition]

Rachel MacGregor (Digital Archivist)

International Archives Day

Today is International Archives Day where everyone involved in preserving archives, records, data – whatever your take – celebrates the work that is happening worldwide to ensure the preservation of our memory and heritage and the protection of our rights by documenting decisions and building the foundations for good governance.

Lancaster Castle: very visible heritage (image author’s own CC-BY)

It’s easy to get people interested in memory and heritage – our history surrounds us in very visible ways and our memories are what binds us together with sharing and celebrating the past to inform our culture and identity.  But it’s much harder to get excited about “governance” even though it’s all about maintaining rights and responsibilities and ensuring justice and equality across the board.

So I want to take a moment to hear it for governance and shout about how the work we are doing here at Lancaster University is contributing towards supporting the creation of strong and accountable governance structures.  Accountable governance ensures fairness and equality for all. The work in my team is all about promoting the Open Research agenda which creates an environment where research is sustainable, reliable, accountable and for the greater good.

“Good governance in the public sector encourages
better informed and longer-term decision making as
well as the efficient use of resources. It strengthens
accountability for the stewardship of those
resources… People’s lives are
thereby improved.”

(International framework: good governance in the public sector IFAC/CIPFA 2014)

And it’s improving people’s lives that we are all really putting all our effort into.

So how are we hoping to supporting these objectives? The long term preservation of data and of good quality, reliable data means that we can support the decision making processes which affect all of us.  Poor data leads to poor decisions so we are looking to see if we can establish ways of preserving data in a way that guarantees its authenticity and integrity and ensures that it will be available for the long term.  The work is not done in isolation and we are looking at best practice and initiatives such as the Jisc Research Data Shared Service which we are hoping will deliver huge advances in helping us preserve important data.

Let’s celebrate everyone who is working hard on preserving documents, manuscripts, archives, data – all kinds of information – which enrich our lives and help us build a better world.

Rachel MacGregor (Digital Archivist)

Data Interview on “Messy Data”

Our latest Data Interview features our two Jisc sponsored Data Champions, Dr Jude Towers and Dr David Ellis. Jude is a Lecturer in Sociology and Quantitative Methods and David a Lecturer in Computational Social Science in our Psychology Department.

Jude and David recently presented at a Jisc event on ‘Stories from the Field: Data are Messy and that’s (kind of) ok’.

We talked to Jude and David about what Messy Data are (and many other things):

Q: At the recent Research Data Champions Day the title of your presentation was ‘Data are Messy and that’s (kind of) ok’. I wonder what are ‘messy research data’ in your fields?

Jude: My ‘messy data’ are crime data. The ‘messiness’ comes from a lot of different directions. One of the main ones is that administrative data is not collected for research, it is collected for other purposes. It never quite has the thing that you want. You need to work out if it is a good enough proxy or not.

For example I am interested in violence and gender but police crime data doesn’t disaggregate by gender.  There is no such crime as domestic violence, so it depends on whether somebody has flagged it as such, which is not mandatory, so it is hit and miss. I think the fact the data are not collected for research makes them messy for researchers, and then I guess there is all the other kind of biases that come with things like administrative data.  So if you think about crime, not everybody reports a crime, so you only get a particular sample. If you have a particular initiative, so every time there are international football matches they have big initiatives around domestic violence, so reporting goes up, so everyone says that domestic violence is related to football.  But is it, or is it just related to the fact that everyone one tells you, that you can report and they have zero tolerance to domestic violence during football matches?  It’s more likely to be recorded.

Then you get feedback loops, so the classic one at the moment is knife crime in London, because knife crime has gone up on the agenda more money and resource will go into knife crime, at some point that will probably go down, and something else will go up because there is a finite amount of resource.  These create feedback loops by the research that you do on the administrative data and people don’t always remember that when they come to interpret research.

Jude and David presenting on Messy Data

David: The majority of data within psychology that tends to measure people is messy because people are messy, particularly social, psychological phenomenon, there is always noise within. The challenge is often trying to get past that noise to understand what might be going on.  This is also true in administrative data and data you collect in a lab.  Probably the only exception in psychology is where people are doing very, very controlled maybe visual perception experiments where the measurement is very fine grain, but almost everything else in Psychology is by its nature extremely messy, and data never looks like it appears in a textbook.

Q: So there is always that ‘noise’ in research data, regardless if you use external data such NHS data, or if you collect data yourself, unless as you say it is in a very controlled environment?

David:  Yes. And I guess that within Psychology there is an argument that if the data is collected in a very controlled environment, is that actually someone’s real behaviour or is a less controlled environment more ecologically valid as you’ve always got that balance to try and address?

Q:  So what are the advantages, why do you work with messy data?     

Jude: Sometimes because there is nothing else. [laughs]

David: Because there is nothing else.  I think Psychology generally is going to be messy. Because as I said people aren’t perfect, you know they are not perfect scientific participants. Participants are not 100% predictable, people aren’t predictable social phenomena.  There are very few theories within social psychology, in fact, I don’t think there’s any that are 100% spot on.

When you compare that to say physics, where there is Newton’s law, where there are governing theories, which are singular truths, which explain a certain phenomenon. We don’t have much of that in psychology!  We have theories that tend to explain social phenomena but people are too unpredictable.  There are good examples of where theories have held for a long time but it is never a universal explanation.

David presenting

Q: What are the implications for management of that kind of messy data?

David: I think the implications are that you make sure that it is clear to people how you got from the raw data, which was noisy or messy, to something that resembles a conclusion.  So that could be: how did you get from X number of observations that you boiled down to an average that you then analysed? What is it the process of that? It’s not just about running a statistical test, it’s about the whole process from: this is what we started with and this is what we ended with.

Jude:  I think that’s right, I think being very clear about what your data can and cannot support and be very clear that you are not producing facts, you are testing theory, where everything is iterative, a tiny step towards something else, not the end. You never get to the end.

David: I think researchers have a responsibility to do that and people have to be careful in the language they use to convey how that has happened.  A good example of that at the moment is, there is a lot in the press and current debate about the effects social media has on children or on teenagers, and the way that it is measured and the language that is used to talk about that is to me totally disconnected. That behaviour isn’t really measured. It is generated by people providing an estimate of what they do, yet we know that, that estimate isn’t very accurate.  The conclusions which have been drawn  are that this is having this big effect on people.  I’m not saying it’s not having any effect; it’s not as exciting to say: ‘well actually the data’s really messy or not perfect, we can’t really conclude very much’. Instead it’s being pushed into saying that [social media] is causing a massive problem for young people, which we don’t know.  Which is why there is a responsibility for that to be clear and I don’t think in that debate it is clear, and I think there are big consequences because of it.

Jude and David at Jisc panel discussion

Q: So in your dream world, what would change, so we could work better with this kind of data?

Jude: I think we need better statistical literacy, across the board. This is what I did with my Masters students:  I told them to go and find a paper or media story which used centred statistics,  then critique it.  So, how do you know what someone is telling you is ‘true’? Why are they telling it you in that particular way? What data have they used? What have they excluded?

You go to the stats literature and they talk about outliers, as though it’s just a mere statistical phenomenon, but those decisions are often political and they massively change what we know, and nobody talks about that, nobody sets out exactly what that means.  The only official statistics for crime in England and Wales are currently capped at a maximum of five incidents.  If you are beaten up by your partner 40 times a year, only the first five are included in the count, which is a huge bias effect in what we know about crime.  Then in the way resources are distributed between different groups, about what crimes are going up and what are falling.  I think this lack of people questioning statistics in particular, but data more generally, is a real problem.  In our social science degrees we just do not teach undergraduates how to do that.  We do it with qualitative data, but we don’t do it with quantitative data. It’s exactly the same process, it’s exactly the same questions, but we just don’t do it, we are really bad at it Britain!

David: I think more generally, there is a cultural issue within the whole ethos of science, of how it gets published, of what becomes read and what doesn’t become read.  So again, say I go back, do a paper and find no relationship between social media use and anxiety.  That would be harder to publish than if I write a paper and find a tiny correlation, which is probably spurious and not even relevant, between anxiety and social media. So again, this comes down to both criticising what is out there but also what is just becoming more sellable or having more ‘impact’.  I use the word impact with inverted commas; what sounds more interesting, but actually might be totally wrong.  I think what is pushed is what’s more interesting rather than what is truth.  I think it’s worth remembering that science is about getting a result and trying to unpick it, looking at what else could explain this, what might we have missed.  Rather than saying ‘that’s it, it’s done’, it’s similar to what Jude was saying about a critical thinking process.

Q: Following on from what Jude said about the skills gap: You say that undergraduates are not taught the skills they need.  Therefore, when we eventually get PhD students and early career researchers this gap might have even increased?

Jude:  Yes, and they don’t use quantitative data, or they use it really uncritically. So lots of  post-graduate students who work on domestic violence won’t use quantitative data, but their thesis often  starts with ‘one in four women will experience domestic violence  in their lifetime’ or ‘two women a week are killed by intimate partners’, bbut they don’t know where that data comes from or how reliable it is or how it was achieved, yet it is just parroted.

David: I can give a similar example to that where it is sometimes difficult to take those numbers back, once they become a part of the common discourse.  So years ago we found that people check their smartphone 85 times a day on average.  Now that was a sample of about thirty young people. Now we obviously talked about that, but that number is now used repeatedly.  Now there is no way that my grandmother or my parents check their phone 85 times a day.  But that sample did, so there is now this kind of view that everyone checks it 85 times a day.  They probably don’t, but I can’t take that back now, there are things you don’t know at the time, but that is what that data showed.  It’s tricky to balance, and it was picked up as an impactful thing, but it wasn’t what we really meant.

Q: Is there also a job for you as a researcher if your findings are picked up by the media looking for a catchy easy numbers, to write your paper differently so that it is not being picked up so easily, or is it the fault of the media, because they are just looking for a simplified version of a complex issue?

David:  There is a cultural issue, a kind of toing and froing; because we want our work to be read and we want people to read it and certainly writing a press release is one way of doing that.  I think it’s actually what you put in the press release [that] has to be even more refined, because a lot of people won’t read the paper, but they will see the press release, and that will be spun.  Once the press release is done, it’s out of your control in some ways.  You can get it as right as you want but a journalist might still tweak it a certain way.  It’s a really tough balance because as you say the other extreme is to say I am just going to leave it. But then people might not hear about the work, so it’s a very tricky tightrope to walk.

Jude: We made the decision as a Centre when our work started getting picked up by the media, that we would not talk to the media about anything that had not been through peer review, so it is always peer reviewed first.  We work with one person from the press office, we work with her closely, all the way through the process of putting the paper together and deciding the press release and how we are going to release it.  What we have actually got now is contacts in several newspapers and media outlets And we say we will work with you exclusively providing this is the message which goes out. We have actually been successful enough that we’ve now got two or three people on board who will do that with us.  They get exclusives providing we see the copy before it goes public.

Jude

David: That is very hard to do, but really good.

Jude: We have been really hardcore and we’ve had a lot of pressure to put stuff out earlier, to make a bigger splash, to go with more papers. It was only I think because we resisted that, that in the long run it has been much better, although it is hard to resist the pressure.  The press in our early work wanted our trends, but we wanted them to talk about the data, we wouldn’t release the trends unless they talked about the problems with bias, official statistics.  So we kind of married the two, but they didn’t want it, but that was the  deal.

David: It’s like when you say: ‘people do X this number of times’ then you can’t put in brackets ‘within the sample’ so I understand where journalists come from and I understand the conversations with the press. To me as I said it’s like walking a tightrope. It has to be interesting enough that people want to read it, but at the same time it needs to be accurate.

Jude: But that’s the statistical literacy, because you want someone reading a media story going ‘Really? Well how did you get that?’ That’s something we would do as academics when you are reading it. People are always telling me ‘interesting facts’ about violence and my first reaction is always: ‘Where has that come from?’ These questions should become routine. I think journalist training is terrible!  I mean I have spent hours on the phone with journalists, who want me to say a really particular thing, and its clearly absolute nonsense! But they have got two little bits of data and they have drawn a line between them.

David: I have had a few experiences where journalists have tried to get a comment about someone else’s work and I have said things like, ‘I don’t think this is right’ or I’ve been critical and the journalist said, ‘well really what we are looking for is a positive comment’.  And I’ve said ‘well I’m not going to give you one’, and they have said ‘alright bye then’, and have gone and found someone that will.  That doesn’t happen very often, but we can see what they are kind of hoping for.  Presumably, some of the time I have said things where I have been really critical. The BBC are quite good at that; they get someone who they know is going to be critical without having to explicitly saying something negative.

Q: This has been fascinating; we have been though the whole life cycle of data from the creation to the management and now to the digestion by the media.  This tells us that data management issues are fundamental to the outputs of research.

Jude: I think it impacts on the open data agenda though ‘cause if I was going to put my data out, the caveat manual which came with it would be three times the size of the data.  Again, you don’t have any control over how someone presents an analysis of that data. I think it’s really difficult because we are not consistent with good practice in reporting on messiness of data.

David: I think there is a weight of responsibility on scientists to get that right! Because it does affect other things. I keep using social media as an example. The government are running an enquiry at the moment into the effects of screen time and social media. If I was being super critical I would say it’s a bit early for an enquiry, because there isn’t any cause and effect evidence. Even some of studies they report on their home page of the enquiry are totally flawed, one of them is not peer reviewed.  That lack of transparency or statistical literacy even among Members of Parliament, clearly, is leading to things being investigated where actually we could be missing a bigger problem here.  So that is just one example, but that is where there is a lot of noise about it, there is a lot of ‘this might be a problem’, or ‘is it a problem?’, right through to ‘it definitely is a problem’, without anyone standing back and going, ‘actually, is this an issue, is the quality of the evidence there?’

David

Jude: Or can you even do it at the moment?

David: Yes, absolutely! That is a separate area and there is a methodological challenge in that.

Jude: We get asked to measure trafficking in human beings on a regular basis, we’ve  even written a report that said you can’t measure it at the moment! There is no mechanism in place that can give you any data that is good enough to produce any kind of measure.

David: But that isn’t going to make it onto the front of the Daily Mail. [laughs]

Q: Maybe just to conclude our interview, what can the university do? You mentioned statistical literacy as one thing. Are there other things we can do to help?

Jude: We are starting to move a little bit in FASS [Faculty of Arts and Social Sciences] with some of Research Training Programme and I think things like the data conversations which are hard to measure but I think are actually having a really good impact.  Drawing people in through those kinds of mechanisms and then setting up people that are interested in talking about this would be good. I would like to see something around… what you need to tell people about your data when it’s published; you know, the caveats: what it can and can’t support, how far you can push it.

David: I think the University as a whole does a lot, certainly psychology, is preaching to the converted, in a way.  I would like a thing in Pure [Lancaster University Data Repository] that when you upload a paper it says… ‘have you have included any code or data?’ just as a sort of a ‘by the way you can do that’. One, it tells people that we do it and two, it reminds people that if you’re not doing that it would be useful just to have tick box just to see why.  Obviously, there are lots of cases where you can’t do it, but it would be good for that to be recorded. So is it actually, I can’t do it because the data is a total mess or some other reason or I’m not bothered.  There is an issue here about why not, because, if it has just been published it should be in a form which is sensible and clear.

Jude: I wonder if there is some scope in just understanding the data, so maybe like the data conversation is specifically about qualitative data, and then other even more obscure forms like literature reviews as data, ‘cause I still keep thinking about when you told me you offered to do data management with FASS and you were told they didn’t have any data.

I think that people don’t think about it as data in the same way and it would be really good to kind of challenge that.  I think data science has a massive problem in that area, it has become so dominant, and if you’re not doing what fits inside the data science box you’re not doing data and you’re not doing science and it’s really excluding.  I think for the university to embrace a universal definition of data would be really, really, beneficial.

David: It’s also good for the University, [to] capitalise on that extra resource; it would have a big effect on the institution as a whole.

Jude, David, thank you very much for this interesting interview!

Jude and David presenting

Jude and David have also featured in previous Data Interviews.

The interview was conducted by Hardy Schwamm, Research and Scholarly Communications Manager @hardyschwamm. Editing was done by Aniela Bylinski-Gelder and Rachel MacGregor.

 

 

 

Two days in the City

Beautiful sunshine in the City: Westminster Bridge (photo: Rachel MacGregor CC-BY)

I was lucky enough to have two days in London last week to attend two separate but linked events: the first was a Jisc sponsored workshop on Digital Appraisal and the second an Archivematica UK User group meet up.  It was a nice balance of activities, Day One was around the theory of how we decide what to keep or what to throw away and Day Two was about sharing experiences of using Archivematica – a digital preservation tool which can potentially help us with aspects of this.

Wednesday was a day at the University of Westminster – founded in 1838  in their beautiful buildings at 309 Regent Street.

Foyer at University of Westminster (Credit: Big Rock Cat / Sabotage1 https://en.wikipedia.org/w/index.php?title=File:University_of_Westminster_Foyer.jpg CC-BY 3.0)

This event – kindly sponsored by Jisc – designed to bring together digital preservation practitioners to discuss and explore approaches to the theory and practice of the managing digital archives. Chatham House Rules applied so there was freedom to discuss practice in an open and honest way.  The morning session comprised of two presentations.  The first focussed on the theory of appraisal, that is how we make decisions about what to keep and what to get rid of.  The second explored practical experiences of the same and reflecting on the change that those who are responsible for managing and looking after records have experienced in the move to the digital age.

For the afternoon session we reflected on what we had heard in the morning and were divided into smaller groups and invited to discuss the approaches we took to appraising both digital and physical collections.  It was a good chance to share experiences of tools which we found useful and difficulties we encountered.

For me it was a great opportunity to meet people out there actually “doing preservation” using a wide variety of tools. Sometimes when people use one software package or another it can have the effect of dividing them into camps.  It’s really important to be able to meet up with and share experiences of others who are in a similar position – as witnessed at the Archivematica Meeting the next day – but it also good to hear a diversity of experience.  There was a strong feeling that any tools, workflows and ways of working are likely to change and develop rapidly, paralleling rapid technological changes, so that anything we opt for now is necessarily only a “temporary” solution.  We have to learn to work in a state of flux and be dynamic in our approaches to preservation.

Day two was the Archivematica UK User group this time hosted by Westminster School.  I’ve blogged before about this group when we hosted here at Lancaster University. Yet another fantastic setting for our meeting another brilliant opportunity to discuss our work with colleagues from a wide range of institutions.

Deans Yard, Westminster (Photo by Rachel MacGregor CC-BY)

The morning session involved the sharing of workflows and in a nice parallel to the previous day’s session, talking about appraisal!

Lunch was back-to-school in the canteen but I’m pleased to report that school dinners have certainly moved on since I remember them!

In the afternoon there were a selection of presentations – including one that I gave to update people on our work at Lancaster as part of the Jisc RDSS to create a reporting tool – DMAonline – which will work with Archivematica to give added reporting functionality.  One of the attractive things about Archivematica as a digital preservation tool, is that it is Open Source so that it allows for development work to happen parallel to the product and to suit all sorts of circumstances.

We also heard from Hrafn Malmquist at University of Edinburgh talking about his recent work with Archivematica to help with preserving the records of the University Court. Sean Rippington from the University of St Andrews talked to us about experimenting with exporting Sharepoint files and Laura Giles from the University of Hull talked about documenting Hull’s year as City of Culture.

We were also lucky enough to get a tour of Westminster School’s archive which gave the archivist Elizabeth the chance to show off her favourite items, including the wonderful Town Boy ledger which you can discover for yourself here.

All in all it was a very useful couple of days in London which gave me a lot to think about and incorporate into my practice.  Having time to reflect on theoretical approaches in invaluable and rarely achieved when the “day job” is so busy and I am grateful to have had the time to attend.

Rachel MacGregor

5th Data Conversations – Stories from the Field

We recently held our fifth Data Conversations here at Lancaster University Library. These events bring researchers together and act as a forum to share their experiences of using and sharing data. The vibe’s informal and we provide our attendees with complementary coffee, cake and pizza…

It’s FAIR  to say that pizza is a popular part of the event. Who doesn’t love pizza…? The informal lunch at the start brings researchers together. It’s a chance to spark conversations and connections with colleagues from different disciplines and at different career stages.

Data Conversations' attendees enjoying refreshments and conversation

Once again we had a great programme with contributions from three fantastic speakers: 

Up first was Dr David Ellis, Lecturer in Computational Social Science from the Psychology department and one of our Jisc Data Champions. David spoke about his experiences (including challenges and solutions) of working with National Health Service Data.

David Ellis beginning his presentation

 

 

 

 

 

 

Next up was Jessica Phoenix, Criminology PhD Candidate. Jess spoke about her Masters dissertation project which looked at missing persons and the link between risk assessment and time to resolution. She spoke about the challenges and solutions associated with creating a dataset from pre-existing raw data. Issues that were amplified as the data were highly sensitive and identifiable (police records).

Image showing Jess as she begins her presentation

 

 

 

 

 

 

 

 

Last up was Professor Chris Hatton, Centre for Disability Research, Division of Health Research. Chris discussed his experience of collaborating with social workers to achieve uniquely valuable results. He also explored the way in which social media (his Twitter account) has provided a platform to engage with a wide array of voices that he couldn’t have reached through conventional research methods.

Chris enjoying jovial interaction with attendees

 

 

 

 

 

 

 

It was a another fantastic installment in an ongoing series of Data Conversations. We thoroughly enjoyed it and we’re looking forward to 6th Data Conversations: Keep it, throw it, put it in the vault…? We hope you can join us, sign up today!

Digital flyer promoting 6th Data Conversations to be held 18th September, 13:30-16:00, the Library, C130. Link below.

Joshua Sendall, Research Data Manager @JSendall

4th Data Conversation – Open Data Open Doors

 

We held our fourth Data Conversation here at Lancaster University bringing together researcher and their experiences of using and sharing data over pizza and cake…

Pizza is a big attraction at an event but more importantly it brings people together to share experiences and creates a relaxed and informal environment which encourages conversation – exactly what we want.  Now in our fourth event in the series we have some “regulars” who come for the conversation (and the pizza) but also new faces who bring new perspectives.

We had another interesting programme with a range of researchers from different disciplines:

Our first speaker was Dr John Towse, Senior lecturer in Psychology and for this Data Conversation he reflected on his role as editor of the Journal of Numerical Cognition an open access journal which charges no author fees.  The Journal is very encouraging of data sharing and as editor John is in the position of being able to ask his contributors to share their data although the journal does not require it.  John stressed that you can’t expect data sharing to happen organically – you have to ask.

Dr John Towse at the Data Conversation

Our next speaker was Dr Jo Knight who has featured as part of our Data Interview series talking about her work.  She explained about the emergence of the Psychiatric Genomics Consortium out of a need to share genomics data even where that data can be quite sensitive.  The aim is to make the data as open as possible and this has been made possible by creating a community of trust.  She emphasised that they are motivated by the wish to change people’s lives and do not share the data with commercial entities.

Dr Jo Knight discusses the issues of sharing genomic data

Dr Kyungmee Lee from the Department of Educational Research works with Distance Learners supporting their doctoral training as part of the preparation for their PhD research.  She encourages students to reuse existing datasets to investigate research methods and it was whilst doing this she realised how many datasets were out there which were difficult to use because they lacked context.

Dr Dermot Lynott takes us on a Data Journey

Dr Dermot Lynott entertained us with his confessions of a poor data manager, as he like the rest of us has been guilty of poor file organisation and even worse file naming.  However he also gave us a success story of publishing data which has been shared and re-used for a period of over 10 years and was keen to encourage others to see the benefit of doing the same.

Finally Professor Maggie Mort wrapped up with a moving and powerful description of the data gathered as part of the Documenting Flood Experience project and with warnings about the difficulties which might lie ahead with the incoming GDPR regulations which will impact on future projects which gather, use and store data relating to children.  This sparked off even more interest and debate.

Professor Maggie Mort discusses working with children and their data

To be honest we could easily have been there all day and we’re very much looking forward to the next Data Conversation on 10th April – Stories from the Field.

Wrapt attention at the 4th Data Conversation

Rachel MacGregor, Digital Archivist

 

Data Interview with Andrew Moore

Andrew Moore (@apmoore94) is a 2nd year PhD student at Lancaster University within the School of Computing and Communications. He is studying how sentiment analysis can be improved through world knowledge using finance as his specialised domain. His research interests are across Natural Language Processing, Machine Learning, and Reproducibility.

We talked to Andrew after he presented at the 3rd Data Conversations.

Q: When does software become research data in your understanding?

Andrew: As soon as you start writing software towards a research paper that I would count as research data.

Q: Is that when you need the code to verify results or re-run calculations?

Andrew: You also need the code to clean your data which is just as important as your results because depending on how you clean your data that informs on what your results are going to be.

Q: And the software is needed to clean the data?

Andrew: Yes. The software will be needed for cleaning the data. So as soon as you start writing your software towards a paper that is when the code becomes research data. It doesn’t have to be in the public domain but it really should be.

Q: What is the current practice when you publish a paper? Do you get asked where your software is?

Andrew: Recently we have actually, for some of our conferences in the computational linguistics or Natural Languages Processing field. But it is not a requirement to get published. It is a friendly question rather than an obligation.

Q: Who is asking, the publisher?

Andrew: No, that’s the conference chairs who are asking but it is not a requirement. Personally I think it should be. I can understand in certain cases when for instance there are security concerns. But normally the sensitivity is on the data side rather than the software.

Q: At the moment if you read a paper the software that is linked to the paper is not available?

Andrew: Normally, if there is software with the paper the paper would have a link, normally on the first or the last page. But a large proportion of the papers don’t have a link. Normally there would be a link to GitHub, maybe 50 per cent of the time. Other than that you can dig around if you’re really looking for it, perhaps Google the name but that’s not really how it should be.

Q: So sometimes the software is available but not referenced in the paper?

Andrew: That’s correct.

Q: But why would you not reference the software in the paper when it is available?

Andrew: I am really puzzled by this [laughs]. I can think of a few reasons. One of them could be that the GitHub instance is just used as backup. The problem I have with that is that it is not referenced in the paper how much do you trust the code to be the version that is associated with the paper?

Also, the other problem with that if I’m on GitHub is that if you reference it in a paper, on GitHub you can keep changing the code and unless you “tag” it on GitHub like a version number and reference that tag in your paper you don’t know what is the correct version.

Q: What about pushing a version of the code from GitHub to [the data archiving tool] Zenodo and get a DOI?

Andrew: I didn’t know about that until recently!

Andrew presenting at Data Conversations

Q: So this mechanism is not widely known?

Andrew: I know what DOIs are but not really how you can get them.

Q: So are the issues why software isn’t shared about the lack of time or is it more technical as we have just discussed, to do with versions and ways of publishing?

Andrew: I think time and technical issues go hand in hand. To be technically better takes time and to do research takes time. It is always a tradeoff between “I want my next paper out” and spending extra time on your code. If your paper is already accepted that is “my merit” so why spend more time?

But there are incentives! When I submitted paper at an evaluation workshop I said that everybody should release their software because it was about evaluating models so it makes sense to have all the code online. So it was decided that we shouldn’t enforce the release but it was encouraged and the argument was that you are likely to get more citations. Because if your code is available people are more likely to use it and then to credit you by citing your paper. So getting more citations is a good incentive but I am not sure if there are some studies proving that releasing software correlates to more citations?

Q: There are a number of studies proving there is a positive correlation when you deposit your research data[1]. I am not aware there is one for software[2]. So maybe we need more evidence to persuade researchers to release code?

Andrew: Personally I think you should do it anyway! You spend so many hours on writing software so even if it takes you a couple of hours extra to put it online it might save somebody else a lot of time doing the same thing. But some technical training could help significantly. From my understanding, the better I got at doing software development the quicker I’ve been getting at releasing code.

Q: Is that something that Lancaster University could help with? Would that be training or do we need specialists that offer support?

Andrew: I am not too sure. I have a personal interest in training myself but I am not sure how that would fit into research management.

Q: I remember that at the last Data Conversations Research Software Engineers were being discussed as a support method.

Andrew: I think that would be a great idea. They could help direct researchers. Even if they don’t do any development work for them they could have a look at the code and point them into directions and suggest “I think you should do this or that”, like re-factoring. I think that kind of supervision would be really beneficial, like a mentor even if they are not directly on that project. Just for example ten per cent of their time on a project would help.

Q: Are you aware that this is happening elsewhere?

Andrew: Yes, I did a summer internship with the Turing Institute and they have a team of Research Software Engineers.

Q: And who do the Research Software Engineers support?

Andrew: The Alan Turing Institute is made up of five institutes. They represent the Institute of Data Science for the UK. They do have their own researchers but also associated researchers from the other five universities. The Research Software Engineers are embedded in the research side integrated with the researchers.

When I was an intern at the Turing Institute one of the Research Software Engineers had a time slot for us available once a week.

Q: Like a drop in help session?

Andrew: Yes, like that. They helped me by directing me to different libraries and software to unit test my code and create documentation as well stating the benefits of doing this. I know that others teams benefited from there guidance and support on using Microsoft Azure cloud computing to facilitate their work. I imagine that a lot of time was saved by the help that they gave.

Q: Thanks Andrew. And to get to the final question. You deposited data here at Lancaster University using Pure. Does that work for you as a method to deposit your research data and get a DOI? Does that address your needs?

Andrew: I think better support for software might be needed on Pure. It would be great if it could work with GitHub.

Q: Yes, at the moment you can’t link Pure with GitHub in the same way you can link GitHub with Zenodo.

Andrew: When you link GitHub and Zenodo does Zenodo keep a copy of the code?

Q: I am not an expert but I believe provides the DOI to a specific release of the software.

Andrew: One thing I think it is really good that we keep data at Lancaster’s repository. In twenty years’ time GitHub might not exist anymore and then I would really appreciate a copy store in the Lancaster archives. The assumption that “It’s in GitHub, it’s fine” might not be true.

Q: Yes, if we assume that GitHub is platform for long-term preservation of code we need to trust it and I am not sure that this is the case. If you deposit here at Lancaster the University has a commitment to preservation and I believe that the University’s data archive is “trustworthy”.

Andrew: So putting a zipped copy of your code is a good solution for now. But in the long term the University’s archives could be better for software. An institutional GitLab might be good and useful. I know there is one in Medicine but an institution wide one would help. It would be nice if Pure could talk to these systems but I can imagine it is difficult.

The area of Neuroscience seems to be doing quite well with releasing research software. You have an opt-in system for the review of code. I think one of the Fellows of the Software Sustainability Institute was behind this idea.

Q: Did that happen locally here at Lancaster University?

Andrew: No, the Fellow was from Cambridge. They seem to be ahead of the curve but it only happened this year. But they seem to be really pushing for that.

Q: Thanks a lot for the Data Interview Andrew!

The interview was conducted by Hardy Schwamm.

[1] For example: Piwowar, H. A., & Vision, T. J. (2013). Data reuse and the open data citation advantage. PeerJ, 1, e175. http://doi.org/10.7717/peerj.175

[2] Actually there is a relevant study: Vandewalle, Patrick. Code Sharing Is Associated with Research Impact in Image Processing . Computing in Science & Engineering, 2012, http://ieeexplore.ieee.org/document/6200247/.

 

 

International Digital Preservation Day

What’s that about then?

International Digital Preservation Day 30th November 2017 #IDPD2017

What’s that about then?

Digital Archivists are a much misunderstood lot.

A lot of people think our  work on digital preservation must be something to do with digitising old documents but this is absolutely not the case.  Of course digitising old documents is fantastic and the wonderful resources which are now increasingly available on the internet like (and there are so many examples these are just some of my favourite ones) Charles Booth’s London or the Cambridge Digital Library . There are thousands and thousands useful for scholars, historians, students, teachers, genealogists, journalists – well just about anyone really who is interested in getting access to sources that would otherwise be near impossible to access.  Digitising archive and library content has revolutionised the way we access and interact with archives, manuscripts and special collections.

Image: Flickr https://flic.kr/p/dVHkbG Kjetil Korslien CC BY-NC 2.0

However – this is not what the digital archivist does (although there are overlaps).  The digital archivist is concerned mainly (although not exclusively) with archives, data, stuff – whatever you want to call it – which was created in a digital format and has never had a physical existence.  If someone accidentally deletes the digitised version of Charles Booth’s poverty maps, the original is still there and can be digitised again.  Of course that would be an enormous waste of time and effort which is why we often treat digitised content as if it were the original content and guard against accidental deletion or loss.

But although digitisation does help preserve a document because it reduces the wear and tear on the original it is often swapping one stable format (paper, parchment etc) for a less stable one.  So you could argue that digitising – rather than helping with preservation issues – is just creating new ones.  Of course there are many very unstable analogue formats such as many photographic processes, magnetic tape and so forth which need to be digitised if they are to survive at all.

Digitisation is not preservation.

With digitised content you would like to think (!) that you might have some measure of control about what that content is, specifically the format it comes in.  It is possible to choose to save the image files in a format that is widely used and well documented, so that the risk that they will be hard to access in 5 or 10 years time is lessened.  There are formats which are recommended for long term preservation because they are widely adopted and well supported and by choosing these we help the process of digital preservation by giving those files a “head start”.

However files which are created by others – perhaps completely outside of the organisation – can come in *literally* any format.  A good example of this is when I analysed a sample of the data deposited by academics undertaking research at our institution and found a grand total of  59 different file types.  OK so that doesn’t sound *too* bad but 55% of the files I couldn’t identify at all.  Which is not so good.

So we could try (as some archives do) saying we will only accept files in a certain format, to give our files the best chance of a long and happy life.  But clearly there are lots of circumstances where this is either impractical or impossible.  For example with the papers of now-deceased person – we cannot ask them to convert them or resubmit them.  And in the case of our researchers they will need to be using specific software to perform specific specialised tasks and they themselves may have very little say in their choice of software.

Another major -and perhaps often overlooked issue with digital preservation is actually making sure that the files are captured in the first place.  This is not a digital specific problem – any kind of data whether it is research outputs, personal papers, financial records of a business – are all at risk of disappearing if they are not looked after properly.  They will need a safe storage environment where the risk of accidental or malicious damage is kept to a minimum and they can be found, the content understood and shared effectively.  For digital files this means a particularly rigorous ongoing check that the content and format are stable and that they can still be made accessible.

So what is digital preservation?

It’s not just backing stuff up

It’s the active management of digital assets to ensure they will still be accessible in the future.

Making sure we can still open files in the future.
Making sure we can still understand files in the future.

It’s a tough job – but someone’s got to do it!

 

Data Interview with Alison Scott-Baumann and Shuruq Naguib

Our latest Data Interview follows up a presentation at our 2nd Data Conversation. Alison Scott-Baumann (Professor of Society & Belief SOAS) and Dr Shuruq Naguib (Lecturer in Politics, Philosophy and Religion Lancaster) are working on the Re/presenting Islam on Campus project. Re/presenting Islam on Campus is a three year project funded by the Arts and Humanities Research Council (AHRC) and by the Economic and Social Research Council (ESRC). It explores how Islam and Muslims are represented and perceived on UK University campuses.

We had the opportunity to discuss research data issues surrounding their project. It turned out to be a highly interesting conversation on topics such as confidentiality, the limits of anonymisation, legal frameworks and the freedom of speech.

Q: Could you describe the aims of your project?

Alison: Thanks for inviting us. It is strange to be on the receiving end because we have been doing a lot of data collection where we put people at ease and now we are at the other end.

About 4 years ago, I became concerned about the increasing surveillance culture around Muslim communities, particularly on campus because that has an impact on free expression or could do. To me as an experienced researcher this seemed to be a politicisation of a research field if you generally identify Muslims as the “official other” and also tell us that they are dangerous with the 2015 Counter-Terrorism and Security Act and its attendant Prevent duty. What is currently not acknowledged is that the Prevent duty is actually not compulsory but the university sector has adopted it in order to keep their reputations clean.

So it is quite a difficult topic and the project aims to look at four major questions:

  1. What do university staff and students know about Islam?
  2. Where do they find that information?
  3. Thirdly with specific reference to three issues, how do they formulate their opinions? The first issue with regard to Islam is gender because that’s often in the media. The whole hijab discussion for example. Radicalisation, there is no point ignoring it because … [even though] there is no evidence that anybody gets radicalised on campus. And the third one is inter-faith because relations among students of different faiths and intra-faith also is of interest to us because it is a very secular culture we live in and yet for many young people their faith identity is important, more important than we realise because of the secular atmosphere that we created on campus.
  4. The fourth question is given that there might be some discrepancies self-identified by our participants in their responses to their first three questions, what could be done to improve the quality of the discussion on campus about Islam? How could we improve the discussion about anything that is regarded by university authorities as risky?

So all the way right from the start when I built a team we were all thinking about issues around Islam but also about the implications of that for the campus about free speech. That turned out to be a big issue because that gets more and more discussed even in the press.

Alison Scott-Baumann

Q: How long does the project run?

Alison: It is a 3 year project from 2015-2018. We are two thirds through.

Q: What kind of data do you need to answer your research questions?

Shuruq:We have two sets of data. We have actually completed data collection. We have collected quantitative data through a survey questionnaire. It was designed to be sent to the 6 universitiesi which are participating in the research. Before we received the grant and throughout the first year we were in conversation with the gatekeepers at those universities who were usually senior managers. They promised to facilitate the research including the survey to staff and students.

When we started  on-site research, we also wanted to do the questionnaire at the same time but the gatekeepers withdrew their collaboration.  The gatekeepers tried to get approval from the vice-chancellors and senior management. We came across a problem on several sites and that is what some describe as survey-fatigue. They were worried about students and staff receiving too many requests to fill in questionnaires. It seemed that universities were very reluctant to facilitate our surveys.

We had to redesign the questionnaire so that it was no no longer specific to the case studies; it is was now nation-wide questionnaire targeting students only, and we went to a private company to do that. The private company had access to students and could build up a sample for us. For example, we wanted our sample to include Muslims and non-Muslims and equal representation of gender and other criteria that we had in mind. We decided not to do the staff questionnaire because you can’t do that through the private companies and the universities were refusing to help. We had to make these decisions because of that particular challenge.

The other subset of data which is qualitative is based on interviews, focus groups, ethnography and curricular material. On each of the six campuses we interviewed 10 students and 10 members of staff. We attempted to handpick staff according to an ideal list which represents a mix of administrative and academic staff, senior and junior staff in different departments, Human Resources,  deans and postdocs, etc. The student interviewees were recruited through emails sent through the student union or were invited by researchers. It was a random sample. There were four focus groups on each site, one with staff and three with students. We wanted one focus group to be with Muslims, one with non-Muslims and one mixed. We didn’t always achieve all types and we faced a real challenge in recruiting students. Sometimes non-Muslim students weren’t at all interested in religion or Islam. We tried different techniques such as focus groups in cafes or other hang-out spaces for students but if participants are not interested in your topic no matter how you promote it, it’s really challenging! You might get a self-selected sample of participants who are interested in that topic.

Then we’ve also done ethnography which included observing the sites where students are, talking to different student societies, talking to a wide range of university staff. We attended public events, observing and describing these events: Who attends them, who the speakers are, especially if they are related to topics of religion, Islam, freedom of speech?

Part of the research is also how Islam is studied in the classroom. For each campus we attempted to collate data about all the courses that included a component on Islam. For a long time we used to call this “Islamic Studies” but we don’t mean Islamic Studies in a narrow sense, we mean it in a broad sense. We changed that label for that category of data to “Studying Islam” to broaden it out to include a course in the Faculty of Medicine on for example “Religion and Health”. We collected material through desktop research on all the courses that are offered in the year of the field work which have a component on Islam or religion.

Then we tried to zoom in on some modules reflecting a range of disciplines and approaches, collecting course programme and syllabus for further analysis. Within that sample we also attended some of the classes to observe the actual teaching and how the students respond. So we have a very complex set of data and we are just about to start the analysis stage and there are quite a few challenges there too.

Q: You have collected a wide range of data, from publicly available information to sensitive data like views on religion. Does that have an impact on how you manage your data?

Alison: There are challenges of managing that data but also of collecting it. When I submitted the research proposal to AHRC that was a year before the Counter-Terrorism and Security Act was passed. When I was awarded the grant that act had been passed. So a situation on campus that had already been quite sensitive arguably becomes more so. We were determined as a team to protect the identity of participants and we have established a sequence of events which we hope maximizes that possibility. We do tell our participants that they have to accept that it is actually impossible for us to be completely sure that we can protect them. Because if somebody wants to hack and they have money and expertise then they can get access to stuff.

But I’ll run you quickly through how we do things. There are only two documents that have the allocated number given to a participant and their name. One of them is the consent form. That is kept away from the university, locked up. The other document that has their allocated number and their identity is an Excel spreadsheet which is kept in a virtual vault which has all their characteristics except their political views. We are not collecting political views which the 1998 Data Protection Act lists as something that should be protected. So we are acting in accordance with that Act by seeking to protect their identity.

Once we’ve done that we then tell them before they speak that they have the right to withdraw, the right to anonymity and confidentiality and we give them a timeline so they have six months in which they could say “I’m actually not comfortable with this” but nobody has done that. What we cannot be sure of, of course, is who are the people who walked away from the possibility of speaking to us? It could be the silent majority. We will never know that. We have worked through the student unions to secure the interested students but if something pops up on their screens regarding opinions on Islam there are people who might think “I don’t want to enter that arena” for all sorts of different reasons.

Q: Can you expand on your data security and confidentiality measures?   

Alison: We keep our master spreadsheet encrypted via VeraCrypt which is a non-aligned programme unlike BitLocker which belongs to Microsoft.

In order to conduct an interview or a focus group we allocate a number to each person and before we did this we thought participants will find this ridiculous. But actually, with focus group people find it liberating which is the ideal. Every time they spoke they said “Number 32 speaking” and they would even say things like “I would like to endorse what Number 42 has just said”. That was perfect!

Q: Instead of a name badge people would wear a number?

Alison: No name badge but a numbered postit on the table in front of them and we know who they are if we want to track back. That worked much better than we thought it possibly could.

Then before the interviews and focus groups are transcribed we had a company called Divas because it is a lot of material. They have their own confidentiality agreement and we created one from SOAS as well. Divas destroy the original audios after a couple of weeks. We keep them but will destroy them some time in the second year. They will never be archived.

After the transcripts come back to us we have to clean them up. We have to take out any mention of names.

Shuruq: Let me add to that. Two issues have come up when cleaning the data.

Q: By cleaning do you mean anonymising?

Shuruq:  Yes, anonymising and removing any identifiers. Even when we use numbers in the focus groups they will refer to sites on their particular campus which will make locations identifiable. Or they would refer to a lecturer by name or to a course title. These are all ways by which confidentiality on that campus would be undermined. So we weren’t anonymising just the participants but also ensuring the anonymity of the campuses. Although the campuses are all named in our research we have agreed that when we come to write up the findings, we will not identify the campuses, because of sensitive issues such as how does the university implement Prevent policies. There could be some negative opinions, some difficult experiences. We don’t want to link those to specific campuses. So we are cleaning the data more extensively than normal perhaps.

Shuruq Naguib

It is quite challenging because as you are stripping down the data you lose context. If there is a university in Wales the Welsh context actually has certain factors that are important to remember when you are analysing the data. Or a specific college in London, how do we do that? We were negotiating the cleaning of the data with regard to gender, ethnicity, background, names of places. We tried to replace these with things that identify these elements but which maintain the anonymity. If it is a café we would strip down the name but still reflect the fact that it is a café in a student union.

But sometimes, especially with interviews we’ve had people who have roles, for example a student who is the Head of a Society or who is active on campus, is well-known and speaks in a certain way. Even if we clean the transcription if we want to quote him he might still be identified by his peers and people who know him.

And then one of the things we are coming up against is transliteration because as we look at how Islam is studied, some of the courses are linked with language training and attract overseas students. It is normal to hear different languages in this context. In an interview different languages could be used. Most of our team members speak several languages so participants have felt at ease using other languages. So how do we transliterate or translate? Sometimes it’s copious work. Some of the terms used in Arabic have specific religious connotations.

This is also sensitive data because often Arabic is perceived suspiciously as a sign of being foreign, as a sign of being a bit radical or of being committed to certain religious concepts. Do you keep the Arabic in the data? Certain words like Hijab and Jihad are loaded with negative connotations in public discourses. On some occasions we made the decision not to send a particular interview to the transcriber because it would endanger the person because they have expressed political views or they used a language that might be misunderstood. To protect the identity of that particular person on one occasion, our postdoc decided to transcribe the interview herself.

Q: Will you be able to share your data?

Alison: It will go into the UK Data Archive. That is a commitment we made to the AHRC and the ESRC who are partly funding us. There are definitely difficulties in assessing the risk of re-identification because it is impossible for us to know how recognisable somebody is to their colleagues or their friends by the way they are expressing themselves.

Q: Can I just confirm that you will share only transcriptions?

Alison: Yes, no audio, no video. But also, we haven’t decided what level of sharing is needed. We have already discussed this with the UK Data Archive and they have three access levels. Our data will not be Open Access. Some of it might be open to all registered users; other data might be accessible to approved researchers only.  There might be two tiers. I think our concern all the way through was not that that anybody has said anything dangerous because nobody has but that it might be construed as overly political by somebody who is looking at that data. If one of our participants has a view on foreign policy that doesn’t concur with the Government – in a democracy that should be possible but may be problematic in the current climate.

Q: Thanks for the explanation. What kind of research data services can Lancaster University offer to help your project?

Alison: I am personally very interested in the General Data Protection Regulation (GDPR) which will come into force in 2018. It appears to be inviting member states to decide if they tighten up on consent. This is an issue to do with Big Data and the way in which it is possible for all of us to covertly record or film each other, track each other. Anything is possible now. So the issues about consent may impact upon our ethnography. We did nothing covertly but inevitably if we were in a big open meeting we may have made notes about something somebody said and even if we don’t identify them we haven’t asked their consent. We would like guidance to whether this is going to clamp down issues around consent or if it is business as usual which means that if you go to reasonable lengths to protect somebody’s identity then that is acceptable.

We would also like you to be our critical friend [laughs]. We have a year to go. I think we are well prepared and we worked really hard on this aspect but there may be issues that we haven’t covered.

Project website: http://representingislamoncampussoas.co.uk

Q: Can I ask about the ethnography, field notes and observations, will you be able to share them?

Alison: I give you a specific example. At campuses where it was possible we secured the approval of members of staff to allow us to sit in a lesson. The students were told when we were there but we didn’t ask each of them to sign a consent form. For example a student in one class I was in about international politics described how her relatives were caught up in border violence in Eastern Europe. I didn’t have her name but I made a note of the fact that this was an example of the fact that a really difficult issue can be taught so well that the trust between the student and the staff is so high that a student can self-disclose.

But it might be necessary under the new General Data Protection Act to remove that and simply say that there was evidence that trust was high rather than given the specific example. To me it doesn’t seem that I am endangering that person’s identity, absolutely not.

Shuruq: And the other difficulty is of course that we have also done ethnography at public events which could have been organised by the chaplaincy or a student society. Again, if you wanted to identify these events that can be done. These societies often set up event pages.

It could also be a lecture on Islam and the media, which was one of the public lectures I attended. The speaker is well known and the event was well publicized. The discussions and kind of questions that emerged, my observations look at how the audience was made up ( mostly Muslims, very few of the white students attended during that talk). The ones who are interested in Islam in the media are those who are impacted by the media representation which is largely Muslim students on campus.

How do you keep aspects of the context that shed light on the meaningfulness of this event and which makes the ethnography useful without undermining anonymity?

Q: One final question: In our trainings we often hear the concern that if you include a statement in a consent form that anonymised data will be shared publicly you might get fewer participants. Is that something you have experienced?

Alison: No, participants accept that. The point is that if they come to meet us, if they made that step that means that the information that was sent out by staff or student bodies has convinced them that this is an ethically planned project where we are not going in with preconceptions. If we then say that anonymised data will be shared they accept that.

The issue I am raising is the one that the ICO [Information Commissioner’s Office] hasn’t really clarified is this issue about would you have to get a consent form from thirty people in a classroom which at one level is a reasonable extension of consent issues but challenges our understanding of ethnography.

Shuruq: Of course we don’t collect any information on the students; we don’t know who they are. But the course outlines and lecture names will not be anonymised in class ethnography so that is something we need to be reflecting upon. The other thing is that the lecturer of one class asked if we were allowing students to withdraw from the class and whether we are asking for their consent. Our team member asked for a verbal consent and the lecturer gave students the opportunity to stay or withdraw from the class. So this could be an issue for some people.

Alison presenting at Data Conversations

Q: Do you have any final comments on your project with regards to data?

Shuruq: On one campusat a private university they had a previous experience of research where the anonymity of some of the interviewees was not protected and the way they were represented in the book that came out of the research was very negative. They were extremely reluctant to allow us in without sufficient guarantees that we are going to protect their identity. But we are facing a serious dilemma because it is such a unique campus that it is impossible to report anything on it without revealing which one it is. That is a serious challenge.

Alison: Just to follow on from that. We mentioned right at the beginning free speech. These strictures which are ethically motivated like the possible new legislation [GDPR] about consent they are at one level eminently sensible but at another level they may make it almost impossible to do research on people’s ability to express themselves freely. If people can’t express themselves freely because it might compromise them or their institution then we can’t do the research. So it is a very clever double bind but it’s not good for democracy because the ability to express oneself freely has possibly become, seen in the public eye, the ability to have a strong opinion about something. Instead of what I think which is going right back to Socrates where you talk something through in order to understand it better and understand your own decision making processes. For young adults at university the heuristic value of freedom of expression, as long as is not rude or illegal, is absolutely paramount to having citizens who are able to conduct themselves wisely in this complex world! There are huge issues at stake here!

Alison, Shuruq, thank you very much for this interesting interview!

The interview was conducted by Hardy Schwamm @hardyschwamm