We had our third Data Conversation here at Lancaster University again with the aim of bringing together researchers to share their data stories and discuss issues and exchange ideas in a friendly and informal setting.
We all had plenty of time to eat pizza and crisps before Neil invited us all to consider reproducibility and sustainability in relation to software. Neil has a very clear and engaging style which really helped us, the audience, navigate around the complex issues of managing software. He asked us all to imagine returning to our work in three months time – would it make sense? Would it still work? He also addressed some of the complex issues around versioning, authorship and sharing software.
The second half of the afternoon followed the more traditional Data Conversations route of short lightning talks given by Lancaster University researchers.
First up was Barry Rowlingson (Lancaster Medical School) talking about the benefits of using GitLab for developing, sharing and keeping software safe.
Barry Rowlingson weighs up the benefits of GitLab over GitHub…
Next was Kristoffer Geyer (Psychology) talking about the innovative and challenging uses of smartphone data for investigating behaviour and in particular the issues of capturing the data from external and ever changing software. Kris mentioned how the recent update of Android (to Oreo) makes retrieving relevant data more difficult – a flexible approach is definitely what is needed.
Then we heard from Andrew Moore (School of Computing and Communications) who returned to the theme of sharing software, looking at some of the barriers and opportunities which present themselves. Andrew argued passionately that we need more resources for software sharing (such as specialist Research Software Engineers) but also that researchers need to share their attitudes towards sharing their code.
Our final speaker was the Library’s own Stephen Robinson (Library Developer) talking about using containers as a method of software preservation. This provoked quite some debate – which is exactly what we want to encourage at these events!
We think these kind of conversations are a great way of getting people to share good ideas and good practice around data management and we look forward to the next Data Conversations in January 2018!
This blog post was co-authored by Rachel MacGregor and Hardy Schwamm.
You can find a short summary of the event, the slides and some photos below.
Denes Csala – The sensor cloud around us: collecting, mining and visualizing the energy and building management data of the campus
Dr Denes Csala is a newly appointed lecturer in Energy Storage Systems Dynamics with Energy Lancaster.
There are 30,000 sensors on campus capturing all sorts of data about energy and energy consumption. This has the potential for us to understand a huge amount about the way energy is managed and used but at the same time throws up the issue of managing extremely sensitive commercial and personal data. Access to the data is strictly controlled but Energy Lancaster are very excited about the possibilities of what could be done with the data.
You can see an animated visualization of the campus energy metering system sensor data here:
Kopo Ramokapane – Could computing: When is Deletion Deletion
Kopo reported that when you delete data in the cloud there is no way to be sure that all copies or all versions have been deleted from the cloud provider. This issue isn’t new but doesn’t get as much attention as it should be. Because of the way Cloud storage operates it is almost impossible even for the service providers to be certain that all the data has been deleted. Avoid storing confidential data in the Cloud and learn more about how the systems work! Lancaster University has a contract with cloud service Box which ensures that compliance issues are dealt with in relation to storage of confidential or sensitive data.
Karen Broadhurst and Stuart Bedston – Better data for better justice: Towards data-driven analyses of Family Court policy and practice
Professor Karen Broadhurst and Stuart Bedston from the Sociology Department reported on concerns about transparency in family court-decision-making. Greater transparency and “open data” would have a positive impact in many ways but is hard to achieve looking at the security requirements and potential risks.
Karen and Stu highlighted the changes that would be needed in order to strengthen interdisciplinary research using controlled-data here at Lancaster University but also the difficulties that stand in the way.
John Couzins – Security Overview at Lancaster University
Next on was John Couzins, the IT Security Manager of Lancaster University. John who works for the institutional IT service ISS reported on the certifications that are necessary to fulfil requirements of certain providers of confidential data. Current examples are Cyber Essentials Plus and the IG Toolkit (Information Governance Toolkit) which is used by the NHS.
Mateusz Mikusz – Running Research as a Service. Implications for Privacy Policies and Ethics
The issue regarding the data is that is used for two purposes:
To make the app and its use cases work
To create research data of usage and other properties that can be analysed by the project team
Mateusz explained that he is working hard to bring both things together in an ethical way that still allows innovative research.
It was a great showcase for a lot of fantastic research that is taking place at Lancaster University and the way in which handling sensitive data and tackling data security is at the forefront of this. There were probably as many questions raised as there were answers given but it was a great opportunity to share approaches to handling data securely and ethically.
Q: In support of Open Data what roles do Policies by funders of the University have? Are they helpful? Or is it seen as just another hurdle in the way of doing research?
David: I could be wrong but I don’t think most people just view it as just a hurdle. I think when people have to write a Data Management Plan for a grant that is a bit of a pain. But I don’t think the idea of having the data freely available is something where most people say “I can’t be bothered”. It is an additional step but it is something people should be doing anyway because if you are going to be clear on what your results are the data should be in a form that’s usable and could be easily moved between people. I think most people say that’s a good thing but maybe I’m biased…
Q: I have talked to PhD students asking if they want to share their data and they said I should have asked them three years ago because now it is so much work. I wonder why that is and if we need to change the way we teach them how to manage their data?
David: I wonder if I would have said the same thing. All the data of my PhD is still around but as I was learning my craft I probably wasn’t the most efficient, and my data wasn’t managed as efficiently as I would do now. I don’t remember going to a data management training or anything. And if someone had done that on day one of my PhD? Data should be kept in an ordered fashion etc. I created a lot of extra work for myself because I would do some analysis, close the file and end up re-doing the same things I have done multiple times. And even on that level that is not very efficient.
Q: Is that something we should teach students, you think?
It’s probably something students wouldn’t be too keen [on].
Q: Yes, you don’t want to patronise them.
David: And it is a bit like saying: For god’s sake back up stuff! If you look at all the horror stories [about] who loses data. It’s only when it goes wrong that it becomes a problem. I think some people are automatically super organised. I was probably somewhere in the middle, probably more organised now. I think the issue is in a lot of academia, you just figure it out as you go. And some people develop brilliant habits and some people, including myself, bad and good. And other people develop really bad habits. And that just carries on.
I sometimes look at Retraction Watch to see what’s in, and there is this really interesting example of an American paper, an American guy who posted a paper in Psychological Science whose undergraduate student collected the data and then it turns out the entire paper is wrong when someone re-analysed the data and found so many mistakes in it. Of course it has been retracted. Now the professor has said it is the student’s fault [whole story here]. But whoever taught that student data management? If that is the issue and it looks like it, they have taken the eye off the ball. And now without a doubt his other papers will be scrutinised. Clearly, there are bad habits ingrained that he’s been passed on.
And it is not just students, it is people higher up as well. The students have been informed by their own supervisors. So I say to my students: back stuff up, make sure things are organised and I can usually tell without going into their file system. What usually happens, if I ask them for something, a piece of data, it will appear quickly because they know where it is, and that is good enough for me. But if it takes ages, that’s when we end up having a talk saying “What are you actually doing with your data?” because this seems really all over the place. But not every supervisor does that, as that guy proved. He didn’t even seem to look at the data. I am not saying that can happen here; but is not only the students.
Q: What could the University do more to assist Open Data supporters like yourself?
David: I really like the fact that the Library is pushing the fact that you can upload datasets. I know there are not many people from my Department that are doing it… I think that is really interesting. It is something that I – not necessarily challenge – but I do mention it. I don’t really get why. It is the sort of thing where you are submitting a paper you don’t even have to do it formally. There are journals that don’t have a data policy but I can still through our Pure system link data and paper together. I don’t see how that is a bad thing and that there is a huge effort needed to do that.
Maybe academics say it is just another thing to do? A colleague of mine would always say if they want the data they can always email me. Now that might be true but there are lots of cases when you email academics they never get back to you. The same colleague gets so many emails that they have someone to manage their mails. I take the point that the counter argument is that nobody actually will want to see the data and maybe they won’t. But given how random stuff is… you don’t know. For, what you publish today it might not be important and then suddenly it is important.
So my answer to the question is I am not exactly sure. There is more support in this institution than in my other, to my memory, in terms of: “this is a place to put my dataset”. One of the courses I was on here about data management as part of the 50 Programme was really useful in the sense that I left thinking from now on I am going to put my data there [into Pure].
Q: Should there be other incentives for opening up research data rather than “doing a good thing”? Should there be more credit for Open Data?
David: Yes, probably. We are always judged, when we do PDRs every year, on how much I published and got this much money. But actually, the data output does have a DOI now and it is citable and it is a contribution that the University is getting from the academic. It is additional effort. So it would be interesting to see what happened if it went as far as maybe not a promotion thing, but … part of good practice. I think the question I would ask academics is: if your data is not there, where are you keeping it long term? Now I am working at another project where data cannot be made open and that is fair enough, but in general I do wonder where all that data is going. There is a duty where it needs to be kept for a certain length of time. I think it is easier to put in there [Pure] then I don’t need to think about it if nothing else. That gives me more comfort.
Q: Is there anything you’d like to add?
David: I am certainly in support of Open Data but I write more about data visualisation because I like pictures as much as I like data [laughs].
Thanks David for an interesting interview. We hope to do more Data Interviews soon. In the meantime, if you have any questions or comments leave them below or email email@example.com.
The first Data Conversations happened on Monday, 31st of January 2017. Below is a quick overview of the action. You can find slides of four talks below.
Data Conversations Opening
The event was opened by Professor Adrian Friday from the Data Science Institute (DSI) who emphasised that the DSI is all about collaboration between disciplines which is also the spirit of Data Conversations. In fact the 25 attendees came from a range of Departments: Biological and Life Sciences, Chemistry, Computing, Educational Research, History, Law, Lancaster Environment Centre, Politics, Psychology and others.
Data Conversations Talks
Unfortunately, Dr Chris Jewell from the Medical School had to cancel his talk. You can see an overview of the agenda below.
Leif Isaksen – Does Linked Data Have to be Open?
Leif Isaksen from the History Department (Leif is also involved in the Data Science Institute) presented the Pelagios Commons project which provides online resources for using open data methods to link and explore historical places.
Leif stressed that linking data is a social process which is built on open partnerships.
You can see Leif’s presentation below:
Jude Towers – Is Violent Crime Increasing or Decreasing?
Dr Jude Towers from Lancaster’s Sociology Department discussed crime rates, especially the rate of domestic violence over time through the Crime Survey for England and Wales. A current ESRC project is looking at how changing survey methodologies alter the underlying data of crime statistics.
Alison Scott-Baumann – Protecting participants and their data on a sensitive topic
Alison and Shuruq explained how difficult it is to get the balance right between confidentiality and data security required to manage often highly sensitive data, and to meet the expectations of data sharing. They stressed how much effort they spend on explaining the terms of the consent forms to project participants.
David Ellis – Building interactive data visualisations to support publications
Chris Donaldson & James Butler – Mining and mapping places with multiple names
Finally, Dr Christopher Donaldson and Dr James Butler talked about their research using a 1.5 million word corpus of Lake District 18th and 19th century literature. Christopher and James use the Edinburgh Geoparser System to automatically recognise place names in text and disambiguate them with respect to a gazetteer.
James demonstrated how he can deal with name variations (secondary names), it is a lot of work. For example, the lake “Coniston” appears in the corpus as: Thurstan, Coniston Lake, Coniston Water, Thurston, Conistone, Conistone Lake, Cunnistone Lake, Thurston Lake, Coniston Mere, Lake of Coniston, Conis- ton, Conyngs Tun, Conyngeston, Thorstane’s watter, Turstinus.
Feedback so far
The feedback from attendees and presenters so far so far is encouraging.
Enjoyed the presentations. I hope these data conversations will become a nice community for those interested in data. Relaxed and nicely themed but not too prescribed. The venue was good and the cakes and biscuits were very good!
We got some comments on the length of the presentations and question time.
Really enjoyable – perhaps a bit more time for each speaker / questions and discussions.
We will look into amending the format. We do like to keep a balance between time for data stories and discussions and giving a number of Lancaster researchers a forum to talk about their experiences. Thanks for the comments and suggestions so far!
Upcoming: 2nd Data Conversations 4th of May
We hope to report on some of the data presentations in more detail in future blog posts. Meanwhile, we are already preparing for the next Data Conversations event on 4th of May (1.45-4 pm). The theme of the event will be “Data Security and Confidentiality”, and registrations are open: http://bit.ly/ludatacon2. Please come along and if you have any questions get in touch with the RDM Support Team: firstname.lastname@example.org.