Articles, Literature Reviews

“In the Artificial Intelligence (AI) Science boom, beware: your results are only as good as your data.”

Hunter Moseley shines a light on how we can make our experimental results more trustworthy.  Thoroughly vetting them before and after publication will ensure huge complex data sets are both accurate and valid.  We need to question results and papers; just because it has been published does not mean it is accurate or even correct in spite of who the author may be and their credentials.

The key to ensuring the accuracy of these results is reproducibility, careful examination of the data with peers and other research groups investigating the outcomes.  This is vitally important with a  data set that is used in new applications.  Mosely and his colleagues found something unexpected when they investigated some recent research papers.   Duplicates appeared in the data sets which were used in three papers meaning they were corrupt.

In machine learning it is usual to split a data set in two and to use one subset to train a model and the other to evaluate the performance of this model.  With no overlap between training and testing subsets, performance in the testing phase will reflect how well the model learns and performs.  However, in their examination they found what they described as a “catastrophic data leakage” problem in that the two subsets were cross contaminated, thereby messing up the ideal separation.  About one quarter of the dataset in question was represented more than once, corrupting the cross validation steps.  After cleaning up the data sets and applying the published methods again the observed performance was a lot less impressive with a drop in the accuracy score from 0.94 to 0.82.  A score of 0.94 is reasonably high and “indicates that the algorithm is usable in many scientific applications”, but at 0.82 it is useful but with limitations and then “only if handled appropriately”.

So what?

Studies that are published with flawed results obviously call research into question.  If researchers do not  make their code and methods fully available then this type of error can occur.  If high performance is reported this may lead to other researchers not attempting to improve on results, feeling that “their algorithms are lacking in comparison.”  Some journals like to publish reviews of successful results so this could prevent progress in research as it is not considered valid or even worth publishing!

Encouraging reproducibility:

Moseley argues that a measured approach is needed.  Where transparency is demonstrated with data, code and full results being available, a thorough evaluation and identification of the problematic dataset would allow an author to correct their work. Another of his solutions is to retract studies with highly flawed results and little or no support for reproducible research.  Scientific reproducibility should not be an option.

Researchers at all levels will need to learn to treat published data with a degree of scepticism, the research community does not want to repeat others’ mistakes.  But data sets are complex, especially when using AI.  Making these data sets and the code used to analyse them available will benefit the original authors, help validate the research and ensure rigour in the research community.

Link to full article in Nature.