Articles, Literature Reviews

“In the Artificial Intelligence (AI) Science boom, beware: your results are only as good as your data.”

Hunter Moseley shines a light on how we can make our experimental results more trustworthy.  Thoroughly vetting them before and after publication will ensure huge complex data sets are both accurate and valid.  We need to question results and papers; just because it has been published does not mean it is accurate or even correct in spite of who the author may be and their credentials.

The key to ensuring the accuracy of these results is reproducibility, careful examination of the data with peers and other research groups investigating the outcomes.  This is vitally important with a  data set that is used in new applications.  Mosely and his colleagues found something unexpected when they investigated some recent research papers.   Duplicates appeared in the data sets which were used in three papers meaning they were corrupt.

In machine learning it is usual to split a data set in two and to use one subset to train a model and the other to evaluate the performance of this model.  With no overlap between training and testing subsets, performance in the testing phase will reflect how well the model learns and performs.  However, in their examination they found what they described as a “catastrophic data leakage” problem in that the two subsets were cross contaminated, thereby messing up the ideal separation.  About one quarter of the dataset in question was represented more than once, corrupting the cross validation steps.  After cleaning up the data sets and applying the published methods again the observed performance was a lot less impressive with a drop in the accuracy score from 0.94 to 0.82.  A score of 0.94 is reasonably high and “indicates that the algorithm is usable in many scientific applications”, but at 0.82 it is useful but with limitations and then “only if handled appropriately”.

So what?

Studies that are published with flawed results obviously call research into question.  If researchers do not  make their code and methods fully available then this type of error can occur.  If high performance is reported this may lead to other researchers not attempting to improve on results, feeling that “their algorithms are lacking in comparison.”  Some journals like to publish reviews of successful results so this could prevent progress in research as it is not considered valid or even worth publishing!

Encouraging reproducibility:

Moseley argues that a measured approach is needed.  Where transparency is demonstrated with data, code and full results being available, a thorough evaluation and identification of the problematic dataset would allow an author to correct their work. Another of his solutions is to retract studies with highly flawed results and little or no support for reproducible research.  Scientific reproducibility should not be an option.

Researchers at all levels will need to learn to treat published data with a degree of scepticism, the research community does not want to repeat others’ mistakes.  But data sets are complex, especially when using AI.  Making these data sets and the code used to analyse them available will benefit the original authors, help validate the research and ensure rigour in the research community.

Link to full article in Nature.

Articles

Butterflies and ChatGPT

Prompting is the way we talk to generative AI and large language models (LLM’s). The way we construct a prompt can change a models decision on the results it provides and impact the accuracy as well. Research from the University of Southern California Information Sciences Institute shows that a minute tweak – such as a space at the beginning of a prompt can change the results.  This is likened to chaos theory where a butterfly flaps its wings generating a minor ripple in the air, resulting in a tornado several weeks later in a faraway land.

The researchers, who were sponsored by the US Defense Advanced Research Projects Agency (DARPA), chose ChatGPT and applied various different prompt variations.  Even slight changes led to significant changes in the results. They found many factors at play and there is more work to be done to ascertain solutions to this effect.

Why do slight changes result in such significant changes?  Do the changes “confuse” the model?  By running experiments across 11 classification tasks, they were able to measure how often the LLM changed its predictions and the impact on accuracy. By studying the correlation between confusion and the instances likelihood of having its answer changed (using a subset of the task with individual human annotations), they did not get a full answer.

So what?:

Generating LLMs which are resistant to changes and yield consistent, accurate answers is a logical next step.  However, this will require a deeper understanding of why responses change under minor tweaks.  Is there a way we can anticipate these resulting changes in outputs?  With ChatGPT being integrated into systems at scale this work will be important for the future.

Link to full article. 

Articles

The EU AI Act – what does this mean for innovation?

European Union member states and the European Parliament have worked to publish “the AI act” to enable a framework of “staggered rules” based on risk These include items such as:

  • Unacceptable risk systems – tech that poses a threat to people will mostly be banned
  • AI systems must respect EU copyright rules and be transparent around how generative AI relative to who we are, so with a wider models have been trained
  • General purpose tools like Chat GPT will be assessed on how powerful they are. Tools trained using large amounts of computing power would face more obligations and reporting restrictions.

So what?

Companies will not have to implement the rules for 2 years, in which time they could be out of date before they are even implemented. Will these rules stifle innovation in this fast moving sector?

Link here: Euronews

Articles, Literature Reviews

Hypotheses devised by AI could find “blind spots” in research

Could “Artificial Intelligence (AI) have a creative role in the scientific process” was a question posed in 2023 by a group of researchers in Stockholm. AI is already being used in literature searches, to automate data collection, run statistical analyses and even for drafting some parts of industry and academic papers. Sendhil Mullainathan, an economist at the University of Chicago Booth School of Business in Illinois has suggested using AI to generate hypotheses and stated “it’s probably been the single most exhilarating kind of research I’ve ever done in my life”.

AI could help with creativity as using large language models (LLM’s) to create new text, even if it is inaccurate, it could lead to a statement such as: “here’s a kind of thing that looks true”; when you think about it, this is exactly what a hypothesis is! These “hallucinations” are sometimes likely to be something that a human would not make and could aid thinking outside of the box.

Hypotheses are on a spectrum from concrete and specific to the abstract and general, using AI in areas where fundamentals remain hidden could generate insights. For example we know there is this behaviour happening, but we do not know why, could the AI identify some rules that could possibly be applied to this situation? James Evans, a sociologist at the University of Chicago says AI systems that generate hypotheses based purely on machine learning require a lot of data. Should we be looking to build AI that goes beyond “matching pattens” but can also be guided by known laws? Rose Yu, a computer scientist at the University of California, San Diego states that it would be a “powerful way to include understanding the limits is crucial, people still need scientific knowledge into AI systems”.

Ross King a computer scientist at Chalmers University of Technology in Gothenburg is o think in a critical way. Is a coordinated campaign building robotic systems that perform experiments. Factors are being adjusted subtly in his “‘Genesis’ systems allowing these robot scientists to be more constant, unbiased. cheap, efficient and transparent than humans”.

Hypothesis generation by AI is not new, in the 1980’s Don Swanson pioneered “literature based discovery” with some software he created called “Arrowsmith” that searched for indirect connections and proposed for example that fish oil might help treat Raynaud’s syndrome, where human circulation is limited in the hands. This hypothesis when taken forward was proved to be correct in that it decreased the bloods viscosity leading to improved circulation.

Data gathering is becoming more automated and automating hypothesis generation could become an important factor as there is more data being generated than humans can handle. Scaling up “intelligent, adaptive questions” will ensure that this capacity is not wasted.
So What? This approach could lead to valid hypotheses being developed which are clear and broad in areas where the underlying principals are poorly understood. A panacea perhaps to “researchers block” to unlock blind spots? For Defence this could mean helping to avoid group think, encourage more innovation outside of the chain of command and enabling things to be done differently in an often slow to change organisation. AI could prove to be a lot more useful than performing Literature Reviews.

Full article: Nature magazine

Articles

How can we “stop deepfakes from sinking society?”

How can we “stop deepfakes from sinking society?”

It is easy for AI to generate convincing images and videos, but we need to ensure we can guard against the harm such deepfakes could cause. The US Defence Advanced Research Projects Agency lead, Wil Corvey says we should question not “how much of this is synthetic” but instead “why was this made?” One problem identified is that “people are not used to generative technology” says Cynthia Rudin from Duke University, North Carolina. There is no degree of scepticism as this technology has exploded onto the scene rather than developed slowly. Some deepfake images are fun and for entertainment, but others can be used to deceive and carry out fraudulent activities. One way to help detect these synthetic images is to watermark them by altering pixels in a method that is imperceptible to the naked eye but is picked up on analysis; or to tag a file’s metadata to authenticate an image.

So What?

There is a losing battle occurring to detect deepfakes, we need greater technological literacy and tools at our disposal to counteract the harmful deepfakes. As our own Dr Sophie Nightingale says “People’s ability to really know where they should place their trust is falling away. And that’s a real problem for democracy”. With major elections due in many countries there is possibly a big threat to contend with.

Link: https://www.nature.com/articles/d41586-023-02990-y

Articles

Deep fakes – a cause for concern?

an example of how deep fake images can be dangerousIn this issue we wanted to take a look at deep fakes and how easy it is to detect them. Image manipulation/editing is nothing new, and deep fakes are the latest in a long line of techniques used for manipulation. Joseph Stalin had people removed from photographic images of him so he was not seen to be associating with the “wrong type of people”.

What is a Deep fake? Deep fakes refer to audio, image, text or video that have been automatically synthesised by a machine learning system and AI. Deep fake technology can be used to create highly realistic images or videos that may depict people saying or doing something that they did not. For example, recent images have circulated of the Pope wearing a large white “puffer” coat, something he never did. Link here: https://www.bloomberg.com/news/newsletters/2023-04-06/pope- francis-white-puffer-coat-ai-image-sparks-deep-fake-concerns

  • Public concern: The public are concerned about the misuse of deep fakes, they are hard to detect and technology is advancing rapidly. The public have limited understanding, and there is a risk of public misinformation especially as the deep fakes become more sophisticated. It is good to look for inconsistencies when trying to decide if an image is a fake, such as mismatched earrings, inconsistent eye blinking etc.
  • Worries and considerations: Deep fakes are increasingly being used for malicious purposes, such as the creation of pornography, and modern tools for creating them are readily available and increasing in sophistication yielding better and better results. Even though public awareness is increasing, the ability to detect a deep fake is not. However some recent research has shone a lens on who might be better at detecting them.
  • Research by Ganna Pogrebna: Ganna is a decision theorist and a behavioural scientist working at the Turing Institute. She recently gave a talk by Zoom on her empirical study into “Temporal Evolution of Human Perceptions and Detection of Deep fakes”. Ganna identified a range of personality traits (37) which could be measured using psychological measurement scales e.g. Anxiety, extraversion, self- esteem etc. Based on the description of the trait she then developed an algorithm. The hypothesis was based on the “big five” personality traits (openness, conscientiousness, extraversion, neuroticism and agreeableness).

The study commenced with a small group of 200 people, and has now increased to 3,000 people in each of five different Anglophone countries: UK, US, Canada, Australia, New Zealand. As Ganna has a large group of deep fakes (dataset) she can test using lots of different people not just using images of actors and politicians as in some studies. This has yielded a copious amounts of data, including cross sectional data from representative samples. Each participant was subjected to 6 deep fake algorithm variations in a between subjects design.

  • Results: People’s ability to detect deep fakes gradually declines as the quality of the deep fakes improves. However, those people who show high emotional intelligence, conscientiousness and are prevention focused are better at detecting deep fakes. Neuroticism, resilience, empathy, impulsivity and risk aversion were traits coming in a close second with these people having better results. 2% of participants (which is low) were very good at detecting deep fakes (although no exact definition of “very good” was presented). They have three traits which are statistically scored higher than other participants: conscientiousness, emotional intelligence and prevention focus – they all detect well. General intelligence and knowing about technology does not make you able to detect deep fakes better, testing for general versus emotional intelligence could be an interesting addition to the data. It will be good to see the full results in terms of exact performance and effect size when published.
  • So What: We are getting familiar with deep fakes and with talking about “hallucinations” such as content created by ChatGPT, these are assertions confidently made by algorithms even though they are far removed from the truth. The future technology is exciting, possibilities are endless with new technologies emerging at an exponential rate, but we need to question more than ever what we see and what we read.

Let us know what work you are doing in the deep fake arena – we’d love to hear from you – CAISS@lancaster.ac.uk