Byte

Super-intelligent AI is not a thing

Panic not! – says a report in Nature, LLM’s will not have the ability to match or even exceed human beings on most tasks. “Scientific study to date strongly suggests most aspects of language models are indeed predictable,” says computer scientist and study co-author Sanmi Koyejo.  Emerging artificial “general” intelligence is no longer apparent when systems are tested in different ways.  This “emergence”, when AI models gain knowledge in a sharp and predictable way is nothing more than a mirage with systems’ abilities building gradually.

So What?

Models are making improvements but they are no where near approaching consciousness, perhaps benchmarking needs more attention paid to it – working on how tasks fit into real world activities.  Link to article.

Articles, Literature Reviews

“In the Artificial Intelligence (AI) Science boom, beware: your results are only as good as your data.”

Hunter Moseley shines a light on how we can make our experimental results more trustworthy.  Thoroughly vetting them before and after publication will ensure huge complex data sets are both accurate and valid.  We need to question results and papers; just because it has been published does not mean it is accurate or even correct in spite of who the author may be and their credentials.

The key to ensuring the accuracy of these results is reproducibility, careful examination of the data with peers and other research groups investigating the outcomes.  This is vitally important with a  data set that is used in new applications.  Mosely and his colleagues found something unexpected when they investigated some recent research papers.   Duplicates appeared in the data sets which were used in three papers meaning they were corrupt.

In machine learning it is usual to split a data set in two and to use one subset to train a model and the other to evaluate the performance of this model.  With no overlap between training and testing subsets, performance in the testing phase will reflect how well the model learns and performs.  However, in their examination they found what they described as a “catastrophic data leakage” problem in that the two subsets were cross contaminated, thereby messing up the ideal separation.  About one quarter of the dataset in question was represented more than once, corrupting the cross validation steps.  After cleaning up the data sets and applying the published methods again the observed performance was a lot less impressive with a drop in the accuracy score from 0.94 to 0.82.  A score of 0.94 is reasonably high and “indicates that the algorithm is usable in many scientific applications”, but at 0.82 it is useful but with limitations and then “only if handled appropriately”.

So what?

Studies that are published with flawed results obviously call research into question.  If researchers do not  make their code and methods fully available then this type of error can occur.  If high performance is reported this may lead to other researchers not attempting to improve on results, feeling that “their algorithms are lacking in comparison.”  Some journals like to publish reviews of successful results so this could prevent progress in research as it is not considered valid or even worth publishing!

Encouraging reproducibility:

Moseley argues that a measured approach is needed.  Where transparency is demonstrated with data, code and full results being available, a thorough evaluation and identification of the problematic dataset would allow an author to correct their work. Another of his solutions is to retract studies with highly flawed results and little or no support for reproducible research.  Scientific reproducibility should not be an option.

Researchers at all levels will need to learn to treat published data with a degree of scepticism, the research community does not want to repeat others’ mistakes.  But data sets are complex, especially when using AI.  Making these data sets and the code used to analyse them available will benefit the original authors, help validate the research and ensure rigour in the research community.

Link to full article in Nature.

Articles

Butterflies and ChatGPT

Prompting is the way we talk to generative AI and large language models (LLM’s). The way we construct a prompt can change a models decision on the results it provides and impact the accuracy as well. Research from the University of Southern California Information Sciences Institute shows that a minute tweak – such as a space at the beginning of a prompt can change the results.  This is likened to chaos theory where a butterfly flaps its wings generating a minor ripple in the air, resulting in a tornado several weeks later in a faraway land.

The researchers, who were sponsored by the US Defense Advanced Research Projects Agency (DARPA), chose ChatGPT and applied various different prompt variations.  Even slight changes led to significant changes in the results. They found many factors at play and there is more work to be done to ascertain solutions to this effect.

Why do slight changes result in such significant changes?  Do the changes “confuse” the model?  By running experiments across 11 classification tasks, they were able to measure how often the LLM changed its predictions and the impact on accuracy. By studying the correlation between confusion and the instances likelihood of having its answer changed (using a subset of the task with individual human annotations), they did not get a full answer.

So what?:

Generating LLMs which are resistant to changes and yield consistent, accurate answers is a logical next step.  However, this will require a deeper understanding of why responses change under minor tweaks.  Is there a way we can anticipate these resulting changes in outputs?  With ChatGPT being integrated into systems at scale this work will be important for the future.

Link to full article. 

Articles

The EU AI Act – what does this mean for innovation?

European Union member states and the European Parliament have worked to publish “the AI act” to enable a framework of “staggered rules” based on risk These include items such as:

  • Unacceptable risk systems – tech that poses a threat to people will mostly be banned
  • AI systems must respect EU copyright rules and be transparent around how generative AI relative to who we are, so with a wider models have been trained
  • General purpose tools like Chat GPT will be assessed on how powerful they are. Tools trained using large amounts of computing power would face more obligations and reporting restrictions.

So what?

Companies will not have to implement the rules for 2 years, in which time they could be out of date before they are even implemented. Will these rules stifle innovation in this fast moving sector?

Link here: Euronews

Talk Review

CAISS TALK: Dr Lewys Brace – Biases when exploring the “inceolosphere”

The November talk was a great event with Dr Lewys Brace from Exeter University discussing “Biases when exploring online extremist sub-cultures and the “inceolosphere”: examples from the ConCel project”. Incels, short for “involuntary celibate”, is an online sub-culture where individuals define themselves by their inability to form sexual relationships with women. Recent years have seen an increase in the amount of work using large-scale, data-driven, analysis methods to understand such ecosystems, and this work typically utilises text data acquired from online spaces, which is then analysed using Natural Language Processing (NLP) techniques. However, there are several points along the road from data collection, through to interpretation of results where biases can emerge when using such methods on these online sub-cultures.

Lewys talked to us about how bias can appear by:

  • Not selecting the “right” online spaces for gathering data
  • The data collected may not be representative of the extremist ecosystem
  • The initial “seed” list may be biased – (manual checking attempts to reduce this)
  • Data cleaning – as these sites have a specific sub cultural language in use – interrogation of the data in depth helps with this
  • Edgy humour can be a euphemism for racist, misogynistic and homophobic views, is it irony or genuine?
  • Deciding on the measure to use can be problematic – using multiple measures

helps with a “sanity check” and can offer additional insights.
The team used the Fisher Jenks algorithm  which uses an iterative approach to find the best groupings of numbers based on how close they are together; (i.e. based on variance from the group’s mean) while also trying to ensure the different groupings are as distinct as possible (by maximizing the group’s variance between groups). Analyses were also carried out at the micro-level to adopt a context-based approach i.e. integration of ideology with personal life experiences and the macro-level which can cause issues in this case with the use of hateful language. This was mitigated by using violent language and out group terms in the analysis.
A very engaging question and answer session followed covering many aspects of Lewys work such as: group isolation, whether Incels use the dark web (they tend not to), whether the groups can be infiltrated (no, people doing this are spotted, ridiculed and driven out), cross culture (groups are emerging in Japan and Russia), Incel demographics (young, white males in general) and how to track individuals over time. Further work in this area is ongoing using topic modelling and the idea of potential hybrid ideologies.
Link to accompanying report

Talk Review

CAISS TALK: Assistant Professor Xiao Hui Tao – Monitoring internal displacement

CAISS were privileged to have Assistant Professor Xiao Hui Tao from the University of California Davies deliver our December talk.

Xiao Hui talked to us about how mobile phone data is used to monitor internal displacement within a country, in this case Afghanistan. This is especially relevant at present due to current world events. The forced displacement of people is a key cost of violence and Internal Displaced Persons (IDP) are hard to keep track off. This matters in terms of targeting aid more effectively and understanding likely locations of future instability in order to allocate forces or target specific programmes.
The “vast untapped resource” of mobile phone data was utilised to estimate violence induced placement in a granular manner. This was both methodological and substantive i.e. what was the overall effect of violence on displacement in Afghanistan, what factors affected the choice of destination and could the team confirm and test hypotheses from qualitative work gathered from surveys? A large amount of mobile phone data was used: 20 billion transactions, from the anonymised records of 10 million subscribers from Afghanistan’s largest mobile phone operator from April 2013 to March 2017. 398 districts were identified and 5,984 violent events, 13,000 cell towers grouped by proximity into 1,439 tower groups. The results showed that for those in district on a violent day, there was an immediate and statistically significant increase in likelihood of leaving the district. Results also showed that when looking at Islamic State violence versus Taliban violence, there was a larger impact for IS related violence than for the Taliban, this could be credited to the fact that IS have been known to target civilians when for example executions were filmed. There was also a larger impact with recently experienced violence and a smaller impact in provincial capitals.

When being displaced people were not just seeking economic opportunity. Half of those moving from a capital moved to other capitals or major cities. For those moving from non- capitals, more than half went to capitals or major cities with 30% moving to a provincial capital in the same province. The main driver was seeking safety rather than economic opportunities and this is consistent with the narrative. In non-capitals, violence resulted in people seeking safety close to home.

Xiao Hui talked specifically about some of the limitations and mitigating biases:

  • There could be bias in the data sources
  • Check and check again if the results contain bias
  • People could be sharing mobile phones
  • Are phones only being used by the wealthy
  • Are women using phones in a patriarchal society?
  • Is the displacement intra district rather than inter district?
  • Are cell phone towers being destroyed resulting in data of false displacement?

The analysis of this data provided insight into the nature of violence-induced displacement in Afghanistan and helped to quantify some of the human costs of violence that would be difficult to measure using traditional methods such as surveys. While there are definite limitations to what can be observed through mobile phone data, conflict-prone regions are often also the places where traditional survey-based data are the least reliable and most difficult to obtain. This approach could complement traditional perspectives on displacement and eventually contribute to the design of effective policies for prevention and mitigation.

Articles, Literature Reviews

Hypotheses devised by AI could find “blind spots” in research

Could “Artificial Intelligence (AI) have a creative role in the scientific process” was a question posed in 2023 by a group of researchers in Stockholm. AI is already being used in literature searches, to automate data collection, run statistical analyses and even for drafting some parts of industry and academic papers. Sendhil Mullainathan, an economist at the University of Chicago Booth School of Business in Illinois has suggested using AI to generate hypotheses and stated “it’s probably been the single most exhilarating kind of research I’ve ever done in my life”.

AI could help with creativity as using large language models (LLM’s) to create new text, even if it is inaccurate, it could lead to a statement such as: “here’s a kind of thing that looks true”; when you think about it, this is exactly what a hypothesis is! These “hallucinations” are sometimes likely to be something that a human would not make and could aid thinking outside of the box.

Hypotheses are on a spectrum from concrete and specific to the abstract and general, using AI in areas where fundamentals remain hidden could generate insights. For example we know there is this behaviour happening, but we do not know why, could the AI identify some rules that could possibly be applied to this situation? James Evans, a sociologist at the University of Chicago says AI systems that generate hypotheses based purely on machine learning require a lot of data. Should we be looking to build AI that goes beyond “matching pattens” but can also be guided by known laws? Rose Yu, a computer scientist at the University of California, San Diego states that it would be a “powerful way to include understanding the limits is crucial, people still need scientific knowledge into AI systems”.

Ross King a computer scientist at Chalmers University of Technology in Gothenburg is o think in a critical way. Is a coordinated campaign building robotic systems that perform experiments. Factors are being adjusted subtly in his “‘Genesis’ systems allowing these robot scientists to be more constant, unbiased. cheap, efficient and transparent than humans”.

Hypothesis generation by AI is not new, in the 1980’s Don Swanson pioneered “literature based discovery” with some software he created called “Arrowsmith” that searched for indirect connections and proposed for example that fish oil might help treat Raynaud’s syndrome, where human circulation is limited in the hands. This hypothesis when taken forward was proved to be correct in that it decreased the bloods viscosity leading to improved circulation.

Data gathering is becoming more automated and automating hypothesis generation could become an important factor as there is more data being generated than humans can handle. Scaling up “intelligent, adaptive questions” will ensure that this capacity is not wasted.
So What? This approach could lead to valid hypotheses being developed which are clear and broad in areas where the underlying principals are poorly understood. A panacea perhaps to “researchers block” to unlock blind spots? For Defence this could mean helping to avoid group think, encourage more innovation outside of the chain of command and enabling things to be done differently in an often slow to change organisation. AI could prove to be a lot more useful than performing Literature Reviews.

Full article: Nature magazine

Byte

Why Algorithms pick up on our biases

Why do algorithms pick up on our biases? It could be argued that this is due to a 95 year old economic model that assumes people’s preferences can be revealed by looking at their behaviour. However, the choices we make are not always what would be best for us. We might have a great wish list on our Netflix account which reflects our true interests, but watch the “trashy” shows that are easier to click on that Netflix sends us. All algorithms are built on what the user is doing, making predictions rather than realistic assumptions as revealed preferences can be incomplete and even misleading. Should algorithms be built with a move away from revealed preferences and encompass more behavioural science? Would this lead to an improvement in our welfare? Or do we just need to watch something “trashy” to de-stress at the end of the day?

SOURCE: Nature Human Behaviour

Articles

How can we “stop deepfakes from sinking society?”

How can we “stop deepfakes from sinking society?”

It is easy for AI to generate convincing images and videos, but we need to ensure we can guard against the harm such deepfakes could cause. The US Defence Advanced Research Projects Agency lead, Wil Corvey says we should question not “how much of this is synthetic” but instead “why was this made?” One problem identified is that “people are not used to generative technology” says Cynthia Rudin from Duke University, North Carolina. There is no degree of scepticism as this technology has exploded onto the scene rather than developed slowly. Some deepfake images are fun and for entertainment, but others can be used to deceive and carry out fraudulent activities. One way to help detect these synthetic images is to watermark them by altering pixels in a method that is imperceptible to the naked eye but is picked up on analysis; or to tag a file’s metadata to authenticate an image.

So What?

There is a losing battle occurring to detect deepfakes, we need greater technological literacy and tools at our disposal to counteract the harmful deepfakes. As our own Dr Sophie Nightingale says “People’s ability to really know where they should place their trust is falling away. And that’s a real problem for democracy”. With major elections due in many countries there is possibly a big threat to contend with.

Link: https://www.nature.com/articles/d41586-023-02990-y

Byte

CAISS Bytes: ChatGPT Content Moderation

Anirban Ghosal, senior writer for Computerworld, discusses how OpenAI are planning to use GPT-4 LLM for content moderation, and how this could help to eliminate bias. By automating the process of content moderation on digital platforms, especially social media, GPT-4 could interpret rules and nuances in long content policy documentation, as well as adapting instantly to policy updates. The company believe AI can help to moderate online traffic and relive the mental burden on a large number of human moderators. The company posit that custom content policies could be created in hours, and they could use data sets containing real-life examples of policy violations in order to label the data. Traditionally people label the data and this is time consuming and expensive.

People will then be used to read the policy and assign labels to the same dataset without seeing the answers. Using these discrepancies the experts can ask GPT-4 to explain the reasoning behind its labels, look into policy definitions, discuss the ambiguity and resolve any confusion. This iterative process will have many steps with data scientists and engineers before the LLM can generate good useful results.

So What: Using this approach should lead to a decrease in inconsistent labelling and a faster feedback loop. Results should be more consistent. Undesired biases can creep into content moderation during training, although results and output will need to be carefully looked at and further refined by maintaining humans in the loop, therefore, bias could be reduced. Industry experts suggest that this approach has potential and could lead to a massive multi-million dollar market for Open AI.

Link: https://www.computerworld.com/article/3704618/openai-to-use-gpt-4-llm-for- content-moderation-warns-against-bias.html