Data Interview with Andrew Moore

Andrew Moore (@apmoore94) is a 2nd year PhD student at Lancaster University within the School of Computing and Communications. He is studying how sentiment analysis can be improved through world knowledge using finance as his specialised domain. His research interests are across Natural Language Processing, Machine Learning, and Reproducibility.

We talked to Andrew after he presented at the 3rd Data Conversations.

Q: When does software become research data in your understanding?

Andrew: As soon as you start writing software towards a research paper that I would count as research data.

Q: Is that when you need the code to verify results or re-run calculations?

Andrew: You also need the code to clean your data which is just as important as your results because depending on how you clean your data that informs on what your results are going to be.

Q: And the software is needed to clean the data?

Andrew: Yes. The software will be needed for cleaning the data. So as soon as you start writing your software towards a paper that is when the code becomes research data. It doesn’t have to be in the public domain but it really should be.

Q: What is the current practice when you publish a paper? Do you get asked where your software is?

Andrew: Recently we have actually, for some of our conferences in the computational linguistics or Natural Languages Processing field. But it is not a requirement to get published. It is a friendly question rather than an obligation.

Q: Who is asking, the publisher?

Andrew: No, that’s the conference chairs who are asking but it is not a requirement. Personally I think it should be. I can understand in certain cases when for instance there are security concerns. But normally the sensitivity is on the data side rather than the software.

Q: At the moment if you read a paper the software that is linked to the paper is not available?

Andrew: Normally, if there is software with the paper the paper would have a link, normally on the first or the last page. But a large proportion of the papers don’t have a link. Normally there would be a link to GitHub, maybe 50 per cent of the time. Other than that you can dig around if you’re really looking for it, perhaps Google the name but that’s not really how it should be.

Q: So sometimes the software is available but not referenced in the paper?

Andrew: That’s correct.

Q: But why would you not reference the software in the paper when it is available?

Andrew: I am really puzzled by this [laughs]. I can think of a few reasons. One of them could be that the GitHub instance is just used as backup. The problem I have with that is that it is not referenced in the paper how much do you trust the code to be the version that is associated with the paper?

Also, the other problem with that if I’m on GitHub is that if you reference it in a paper, on GitHub you can keep changing the code and unless you “tag” it on GitHub like a version number and reference that tag in your paper you don’t know what is the correct version.

Q: What about pushing a version of the code from GitHub to [the data archiving tool] Zenodo and get a DOI?

Andrew: I didn’t know about that until recently!

Andrew presenting at Data Conversations

Q: So this mechanism is not widely known?

Andrew: I know what DOIs are but not really how you can get them.

Q: So are the issues why software isn’t shared about the lack of time or is it more technical as we have just discussed, to do with versions and ways of publishing?

Andrew: I think time and technical issues go hand in hand. To be technically better takes time and to do research takes time. It is always a tradeoff between “I want my next paper out” and spending extra time on your code. If your paper is already accepted that is “my merit” so why spend more time?

But there are incentives! When I submitted paper at an evaluation workshop I said that everybody should release their software because it was about evaluating models so it makes sense to have all the code online. So it was decided that we shouldn’t enforce the release but it was encouraged and the argument was that you are likely to get more citations. Because if your code is available people are more likely to use it and then to credit you by citing your paper. So getting more citations is a good incentive but I am not sure if there are some studies proving that releasing software correlates to more citations?

Q: There are a number of studies proving there is a positive correlation when you deposit your research data[1]. I am not aware there is one for software[2]. So maybe we need more evidence to persuade researchers to release code?

Andrew: Personally I think you should do it anyway! You spend so many hours on writing software so even if it takes you a couple of hours extra to put it online it might save somebody else a lot of time doing the same thing. But some technical training could help significantly. From my understanding, the better I got at doing software development the quicker I’ve been getting at releasing code.

Q: Is that something that Lancaster University could help with? Would that be training or do we need specialists that offer support?

Andrew: I am not too sure. I have a personal interest in training myself but I am not sure how that would fit into research management.

Q: I remember that at the last Data Conversations Research Software Engineers were being discussed as a support method.

Andrew: I think that would be a great idea. They could help direct researchers. Even if they don’t do any development work for them they could have a look at the code and point them into directions and suggest “I think you should do this or that”, like re-factoring. I think that kind of supervision would be really beneficial, like a mentor even if they are not directly on that project. Just for example ten per cent of their time on a project would help.

Q: Are you aware that this is happening elsewhere?

Andrew: Yes, I did a summer internship with the Turing Institute and they have a team of Research Software Engineers.

Q: And who do the Research Software Engineers support?

Andrew: The Alan Turing Institute is made up of five institutes. They represent the Institute of Data Science for the UK. They do have their own researchers but also associated researchers from the other five universities. The Research Software Engineers are embedded in the research side integrated with the researchers.

When I was an intern at the Turing Institute one of the Research Software Engineers had a time slot for us available once a week.

Q: Like a drop in help session?

Andrew: Yes, like that. They helped me by directing me to different libraries and software to unit test my code and create documentation as well stating the benefits of doing this. I know that others teams benefited from there guidance and support on using Microsoft Azure cloud computing to facilitate their work. I imagine that a lot of time was saved by the help that they gave.

Q: Thanks Andrew. And to get to the final question. You deposited data here at Lancaster University using Pure. Does that work for you as a method to deposit your research data and get a DOI? Does that address your needs?

Andrew: I think better support for software might be needed on Pure. It would be great if it could work with GitHub.

Q: Yes, at the moment you can’t link Pure with GitHub in the same way you can link GitHub with Zenodo.

Andrew: When you link GitHub and Zenodo does Zenodo keep a copy of the code?

Q: I am not an expert but I believe provides the DOI to a specific release of the software.

Andrew: One thing I think it is really good that we keep data at Lancaster’s repository. In twenty years’ time GitHub might not exist anymore and then I would really appreciate a copy store in the Lancaster archives. The assumption that “It’s in GitHub, it’s fine” might not be true.

Q: Yes, if we assume that GitHub is platform for long-term preservation of code we need to trust it and I am not sure that this is the case. If you deposit here at Lancaster the University has a commitment to preservation and I believe that the University’s data archive is “trustworthy”.

Andrew: So putting a zipped copy of your code is a good solution for now. But in the long term the University’s archives could be better for software. An institutional GitLab might be good and useful. I know there is one in Medicine but an institution wide one would help. It would be nice if Pure could talk to these systems but I can imagine it is difficult.

The area of Neuroscience seems to be doing quite well with releasing research software. You have an opt-in system for the review of code. I think one of the Fellows of the Software Sustainability Institute was behind this idea.

Q: Did that happen locally here at Lancaster University?

Andrew: No, the Fellow was from Cambridge. They seem to be ahead of the curve but it only happened this year. But they seem to be really pushing for that.

Q: Thanks a lot for the Data Interview Andrew!

The interview was conducted by Hardy Schwamm.

[1] For example: Piwowar, H. A., & Vision, T. J. (2013). Data reuse and the open data citation advantage. PeerJ, 1, e175. http://doi.org/10.7717/peerj.175

[2] Actually there is a relevant study: Vandewalle, Patrick. Code Sharing Is Associated with Research Impact in Image Processing . Computing in Science & Engineering, 2012, http://ieeexplore.ieee.org/document/6200247/.