We recently held our fifth Data Conversations here at Lancaster University Library. These events bring researchers together and act as a forum to share their experiences of using and sharing data. The vibe’s informal and we provide our attendees with complementary coffee, cake and pizza…
It’s FAIR to say that pizza is a popular part of the event. Who doesn’t love pizza…? The informal lunch at the start brings researchers together. It’s a chance to spark conversations and connections with colleagues from different disciplines and at different career stages.
Once again we had a great programme with contributions from three fantastic speakers:
Up first was Dr David Ellis, Lecturer in Computational Social Science from the Psychology department and one of our Jisc Data Champions. David spoke about his experiences (including challenges and solutions) of working with National Health Service Data.
Next up was Jessica Phoenix, Criminology PhD Candidate. Jess spoke about her Masters dissertation project which looked at missing persons and the link between risk assessment and time to resolution. She spoke about the challenges and solutions associated with creating a dataset from pre-existing raw data. Issues that were amplified as the data were highly sensitive and identifiable (police records).
Last up was Professor Chris Hatton, Centre for Disability Research, Division of Health Research. Chris discussed his experience of collaborating with social workers to achieve uniquely valuable results. He also explored the way in which social media (his Twitter account) has provided a platform to engage with a wide array of voices that he couldn’t have reached through conventional research methods.
We held our fourth Data Conversation here at Lancaster University bringing together researcher and their experiences of using and sharing data over pizza and cake…
Pizza is a big attraction at an event but more importantly it brings people together to share experiences and creates a relaxed and informal environment which encourages conversation – exactly what we want. Now in our fourth event in the series we have some “regulars” who come for the conversation (and the pizza) but also new faces who bring new perspectives.
We had another interesting programme with a range of researchers from different disciplines:
Our first speaker was Dr John Towse, Senior lecturer in Psychology and for this Data Conversation he reflected on his role as editor of the Journal of Numerical Cognition an open access journal which charges no author fees. The Journal is very encouraging of data sharing and as editor John is in the position of being able to ask his contributors to share their data although the journal does not require it. John stressed that you can’t expect data sharing to happen organically – you have to ask.
Our next speaker was Dr Jo Knight who has featured as part of our Data Interview series talking about her work. She explained about the emergence of the Psychiatric Genomics Consortium out of a need to share genomics data even where that data can be quite sensitive. The aim is to make the data as open as possible and this has been made possible by creating a community of trust. She emphasised that they are motivated by the wish to change people’s lives and do not share the data with commercial entities.
Dr Kyungmee Lee from the Department of Educational Research works with Distance Learners supporting their doctoral training as part of the preparation for their PhD research. She encourages students to reuse existing datasets to investigate research methods and it was whilst doing this she realised how many datasets were out there which were difficult to use because they lacked context.
Dr Dermot Lynott entertained us with his confessions of a poor data manager, as he like the rest of us has been guilty of poor file organisation and even worse file naming. However he also gave us a success story of publishing data which has been shared and re-used for a period of over 10 years and was keen to encourage others to see the benefit of doing the same.
Finally Professor Maggie Mort wrapped up with a moving and powerful description of the data gathered as part of the Documenting Flood Experience project and with warnings about the difficulties which might lie ahead with the incoming GDPR regulations which will impact on future projects which gather, use and store data relating to children. This sparked off even more interest and debate.
To be honest we could easily have been there all day and we’re very much looking forward to the next Data Conversation on 10th April – Stories from the Field.
Q: When does software become research data in your understanding?
Andrew: As soon as you start writing software towards a research paper that I would count as research data.
Q: Is that when you need the code to verify results or re-run calculations?
Andrew: You also need the code to clean your data which is just as important as your results because depending on how you clean your data that informs on what your results are going to be.
Q: And the software is needed to clean the data?
Andrew: Yes. The software will be needed for cleaning the data. So as soon as you start writing your software towards a paper that is when the code becomes research data. It doesn’t have to be in the public domain but it really should be.
Q: What is the current practice when you publish a paper? Do you get asked where your software is?
Andrew: No, that’s the conference chairs who are asking but it is not a requirement. Personally I think it should be. I can understand in certain cases when for instance there are security concerns. But normally the sensitivity is on the data side rather than the software.
Q: At the moment if you read a paper the software that is linked to the paper is notavailable?
Andrew: Normally, if there is software with the paper the paper would have a link, normally on the first or the last page. But a large proportion of the papers don’t have a link. Normally there would be a link to GitHub, maybe 50 per cent of the time. Other than that you can dig around if you’re really looking for it, perhaps Google the name but that’s not really how it should be.
Q: So sometimes the software is available but not referenced in the paper?
Andrew: That’s correct.
Q: But why would you not reference the software in the paper when it is available?
Andrew: I am really puzzled by this [laughs]. I can think of a few reasons. One of them could be that the GitHub instance is just used as backup. The problem I have with that is that it is not referenced in the paper how much do you trust the code to be the version that is associated with the paper?
Also, the other problem with that if I’m on GitHub is that if you reference it in a paper, on GitHub you can keep changing the code and unless you “tag” it on GitHub like a version number and reference that tag in your paper you don’t know what is the correct version.
Q: What about pushing a version of the code from GitHub to [the data archiving tool] Zenodo and get a DOI?
Andrew: I didn’t know about that until recently!
Q: So this mechanism is not widely known?
Andrew: I know what DOIs are but not really how you can get them.
Q: So are the issues why software isn’t shared about the lack of time or is it more technical as we have just discussed, to do with versions and ways of publishing?
Andrew: I think time and technical issues go hand in hand. To be technically better takes time and to do research takes time. It is always a tradeoff between “I want my next paper out” and spending extra time on your code. If your paper is already accepted that is “my merit” so why spend more time?
But there are incentives! When I submitted paper at an evaluation workshop I said that everybody should release their software because it was about evaluating models so it makes sense to have all the code online. So it was decided that we shouldn’t enforce the release but it was encouraged and the argument was that you are likely to get more citations. Because if your code is available people are more likely to use it and then to credit you by citing your paper. So getting more citations is a good incentive but I am not sure if there are some studies proving that releasing software correlates to more citations?
Q: There are a number of studies proving there is a positive correlation when you deposit your research data. I am not aware there is one for software. So maybe we need more evidence to persuade researchers to release code?
Andrew: Personally I think you should do it anyway! You spend so many hours on writing software so even if it takes you a couple of hours extra to put it online it might save somebody else a lot of time doing the same thing. But some technical training could help significantly. From my understanding, the better I got at doing software development the quicker I’ve been getting at releasing code.
Q: Is that something that Lancaster University could help with? Would that be training or do we need specialists that offer support?
Andrew: I am not too sure. I have a personal interest in training myself but I am not sure how that would fit into research management.
Andrew: I think that would be a great idea. They could help direct researchers. Even if they don’t do any development work for them they could have a look at the code and point them into directions and suggest “I think you should do this or that”, like re-factoring. I think that kind of supervision would be really beneficial, like a mentor even if they are not directly on that project. Just for example ten per cent of their time on a project would help.
Q: Are you aware that this is happening elsewhere?
Andrew: Yes, I did a summer internship with the Turing Institute and they have a team of Research Software Engineers.
Q: And who do the Research Software Engineers support?
Andrew: The Alan Turing Institute is made up of five institutes. They represent the Institute of Data Science for the UK. They do have their own researchers but also associated researchers from the other five universities. The Research Software Engineers are embedded in the research side integrated with the researchers.
When I was an intern at the Turing Institute one of the Research Software Engineers had a time slot for us available once a week.
Q: Like a drop in help session?
Andrew: Yes, like that. They helped me by directing me to different libraries and software to unit test my code and create documentation as well stating the benefits of doing this. I know that others teams benefited from there guidance and support on using Microsoft Azure cloud computing to facilitate their work. I imagine that a lot of time was saved by the help that they gave.
Q: Thanks Andrew. And to get to the final question. You deposited data here at Lancaster University using Pure. Does that work for you as a method to deposit your research data and get a DOI? Does that address your needs?
Andrew: I think better support for software might be needed on Pure. It would be great if it could work with GitHub.
Q: Yes, at the moment you can’t link Pure with GitHub in the same way you can link GitHub with Zenodo.
Andrew: When you link GitHub and Zenodo does Zenodo keep a copy of the code?
Q: I am not an expert but I believe provides the DOI to a specific release of the software.
Andrew: One thing I think it is really good that we keep data at Lancaster’s repository. In twenty years’ time GitHub might not exist anymore and then I would really appreciate a copy store in the Lancaster archives. The assumption that “It’s in GitHub, it’s fine” might not be true.
Q: Yes, if we assume that GitHub is platform for long-term preservation of code we need to trust it and I am not sure that this is the case. If you deposit here at Lancaster the University has a commitment to preservation and I believe that the University’s data archive is “trustworthy”.
Andrew: So putting a zipped copy of your code is a good solution for now. But in the long term the University’s archives could be better for software. An institutional GitLab might be good and useful. I know there is one in Medicine but an institution wide one would help. It would be nice if Pure could talk to these systems but I can imagine it is difficult.
The area of Neuroscience seems to be doing quite well with releasing research software. You have an opt-in system for the review of code. I think one of the Fellows of the Software Sustainability Institute was behind this idea.
Q: Did that happen locally here at Lancaster University?
Andrew: No, the Fellow was from Cambridge. They seem to be ahead of the curve but it only happened this year. But they seem to be really pushing for that.
Q: Thanks a lot for the Data Interview Andrew!
The interview was conducted by Hardy Schwamm.
 For example: Piwowar, H. A., & Vision, T. J. (2013). Data reuse and the open data citation advantage. PeerJ, 1, e175. http://doi.org/10.7717/peerj.175
We had our third Data Conversation here at Lancaster University again with the aim of bringing together researchers to share their data stories and discuss issues and exchange ideas in a friendly and informal setting.
We all had plenty of time to eat pizza and crisps before Neil invited us all to consider reproducibility and sustainability in relation to software. Neil has a very clear and engaging style which really helped us, the audience, navigate around the complex issues of managing software. He asked us all to imagine returning to our work in three months time – would it make sense? Would it still work? He also addressed some of the complex issues around versioning, authorship and sharing software.
The second half of the afternoon followed the more traditional Data Conversations route of short lightning talks given by Lancaster University researchers.
First up was Barry Rowlingson (Lancaster Medical School) talking about the benefits of using GitLab for developing, sharing and keeping software safe.
Barry Rowlingson weighs up the benefits of GitLab over GitHub…
Next was Kristoffer Geyer (Psychology) talking about the innovative and challenging uses of smartphone data for investigating behaviour and in particular the issues of capturing the data from external and ever changing software. Kris mentioned how the recent update of Android (to Oreo) makes retrieving relevant data more difficult – a flexible approach is definitely what is needed.
Then we heard from Andrew Moore (School of Computing and Communications) who returned to the theme of sharing software, looking at some of the barriers and opportunities which present themselves. Andrew argued passionately that we need more resources for software sharing (such as specialist Research Software Engineers) but also that researchers need to share their attitudes towards sharing their code.
Our final speaker was the Library’s own Stephen Robinson (Library Developer) talking about using containers as a method of software preservation. This provoked quite some debate – which is exactly what we want to encourage at these events!
We think these kind of conversations are a great way of getting people to share good ideas and good practice around data management and we look forward to the next Data Conversations in January 2018!
This blog post was co-authored by Rachel MacGregor and Hardy Schwamm.
You can find a short summary of the event, the slides and some photos below.
Denes Csala – The sensor cloud around us: collecting, mining and visualizing the energy and building management data of the campus
Dr Denes Csala is a newly appointed lecturer in Energy Storage Systems Dynamics with Energy Lancaster.
There are 30,000 sensors on campus capturing all sorts of data about energy and energy consumption. This has the potential for us to understand a huge amount about the way energy is managed and used but at the same time throws up the issue of managing extremely sensitive commercial and personal data. Access to the data is strictly controlled but Energy Lancaster are very excited about the possibilities of what could be done with the data.
You can see an animated visualization of the campus energy metering system sensor data here:
Kopo Ramokapane – Could computing: When is Deletion Deletion
Kopo reported that when you delete data in the cloud there is no way to be sure that all copies or all versions have been deleted from the cloud provider. This issue isn’t new but doesn’t get as much attention as it should be. Because of the way Cloud storage operates it is almost impossible even for the service providers to be certain that all the data has been deleted. Avoid storing confidential data in the Cloud and learn more about how the systems work! Lancaster University has a contract with cloud service Box which ensures that compliance issues are dealt with in relation to storage of confidential or sensitive data.
Karen Broadhurst and Stuart Bedston – Better data for better justice: Towards data-driven analyses of Family Court policy and practice
Professor Karen Broadhurst and Stuart Bedston from the Sociology Department reported on concerns about transparency in family court-decision-making. Greater transparency and “open data” would have a positive impact in many ways but is hard to achieve looking at the security requirements and potential risks.
Karen and Stu highlighted the changes that would be needed in order to strengthen interdisciplinary research using controlled-data here at Lancaster University but also the difficulties that stand in the way.
John Couzins – Security Overview at Lancaster University
Next on was John Couzins, the IT Security Manager of Lancaster University. John who works for the institutional IT service ISS reported on the certifications that are necessary to fulfil requirements of certain providers of confidential data. Current examples are Cyber Essentials Plus and the IG Toolkit (Information Governance Toolkit) which is used by the NHS.
Mateusz Mikusz – Running Research as a Service. Implications for Privacy Policies and Ethics
The issue regarding the data is that is used for two purposes:
To make the app and its use cases work
To create research data of usage and other properties that can be analysed by the project team
Mateusz explained that he is working hard to bring both things together in an ethical way that still allows innovative research.
It was a great showcase for a lot of fantastic research that is taking place at Lancaster University and the way in which handling sensitive data and tackling data security is at the forefront of this. There were probably as many questions raised as there were answers given but it was a great opportunity to share approaches to handling data securely and ethically.
The first Data Conversations happened on Monday, 31st of January 2017. Below is a quick overview of the action. You can find slides of four talks below.
Data Conversations Opening
The event was opened by Professor Adrian Friday from the Data Science Institute (DSI) who emphasised that the DSI is all about collaboration between disciplines which is also the spirit of Data Conversations. In fact the 25 attendees came from a range of Departments: Biological and Life Sciences, Chemistry, Computing, Educational Research, History, Law, Lancaster Environment Centre, Politics, Psychology and others.
Data Conversations Talks
Unfortunately, Dr Chris Jewell from the Medical School had to cancel his talk. You can see an overview of the agenda below.
Leif Isaksen – Does Linked Data Have to be Open?
Leif Isaksen from the History Department (Leif is also involved in the Data Science Institute) presented the Pelagios Commons project which provides online resources for using open data methods to link and explore historical places.
Leif stressed that linking data is a social process which is built on open partnerships.
You can see Leif’s presentation below:
Jude Towers – Is Violent Crime Increasing or Decreasing?
Dr Jude Towers from Lancaster’s Sociology Department discussed crime rates, especially the rate of domestic violence over time through the Crime Survey for England and Wales. A current ESRC project is looking at how changing survey methodologies alter the underlying data of crime statistics.
Alison Scott-Baumann – Protecting participants and their data on a sensitive topic
Alison and Shuruq explained how difficult it is to get the balance right between confidentiality and data security required to manage often highly sensitive data, and to meet the expectations of data sharing. They stressed how much effort they spend on explaining the terms of the consent forms to project participants.
David Ellis – Building interactive data visualisations to support publications
Chris Donaldson & James Butler – Mining and mapping places with multiple names
Finally, Dr Christopher Donaldson and Dr James Butler talked about their research using a 1.5 million word corpus of Lake District 18th and 19th century literature. Christopher and James use the Edinburgh Geoparser System to automatically recognise place names in text and disambiguate them with respect to a gazetteer.
James demonstrated how he can deal with name variations (secondary names), it is a lot of work. For example, the lake “Coniston” appears in the corpus as: Thurstan, Coniston Lake, Coniston Water, Thurston, Conistone, Conistone Lake, Cunnistone Lake, Thurston Lake, Coniston Mere, Lake of Coniston, Conis- ton, Conyngs Tun, Conyngeston, Thorstane’s watter, Turstinus.
Feedback so far
The feedback from attendees and presenters so far so far is encouraging.
Enjoyed the presentations. I hope these data conversations will become a nice community for those interested in data. Relaxed and nicely themed but not too prescribed. The venue was good and the cakes and biscuits were very good!
We got some comments on the length of the presentations and question time.
Really enjoyable – perhaps a bit more time for each speaker / questions and discussions.
We will look into amending the format. We do like to keep a balance between time for data stories and discussions and giving a number of Lancaster researchers a forum to talk about their experiences. Thanks for the comments and suggestions so far!
Upcoming: 2nd Data Conversations 4th of May
We hope to report on some of the data presentations in more detail in future blog posts. Meanwhile, we are already preparing for the next Data Conversations event on 4th of May (1.45-4 pm). The theme of the event will be “Data Security and Confidentiality”, and registrations are open: http://bit.ly/ludatacon2. Please come along and if you have any questions get in touch with the RDM Support Team: firstname.lastname@example.org.