Text editions

The Text Edition (v2) of Bot or Not? (est. 2023) examines a question that looks relatively mundane on the surface, but that is in fact criminologically and societally significant:

how well can readers distinguish between authentic and AI-generated online reviews, and how confident are they in doing so?

BoNTE QR code

Want to share this quiz?
Feel free to use this QR code

Led by Prof Claire Hardaker and Dr Georgina Brown with major research assistance from Amy Dixon, this strand of the overall Bot or Not suite focuses specifically on hotel reviews. Fake reviews are a long-standing problem in platform economies, and generative AI has merely industrialised the practice. Large language models such as ChatGPT make it trivial to produce high-volume, stylistically fluent endorsements for products, services, or even entirely fictitious businesses, allowing the execution of widescale fraud and fraud-adjacent behaviour built on deceptive, persuasive content.

The central research questions are:

  • What is overall detection accuracy for AI-generated vs. human-written reviews?
  • How well calibrated are participants’ confidence judgements?
  • What linguistic or stylistic cues do readers report relying on?
  • Are these cues diagnostic, or illusory?

Design overview (v2)

Participants are presented with 15 reviews randomly drawn from a large bank of 400 reviews overall. This bank consists of…

As with other editions, a participant’s set may be:

  • entirely human-authored
  • entirely AI-generated
  • or a mixture

Moreover, the AI-generated reviews fall into three categories:

  • minimal prompt engineering, e.g. “write a positive review of Hotel XYZ”
  • moderate prompt engineering, e.g. “write a positive review of Hotel XYZ, include specific details”
  • extensive prompt engineering, e.g. “write a positive review of Hotel XYZ, include specific details and some typos”

All hotels are anonymised as Hotel XYZ to prevent brand familiarity effects.

For each review, participants:

  1. Classify the review as human or AI-generated
  2. Provide confidence ratings (pre- and post-task)
  3. Offer qualitative explanations of the cues or reasoning underlying their decisions (submitted prior to score feedback)

A key methodological refinement in v2 is the disaggregation of accuracy and confidence (see below for more on this). Participants’ scores are based solely on the correctness of their binary classifications. Confidence is measured independently to enable analysis of:

  • calibration (confidence vs. accuracy alignment)
  • overconfidence or underconfidence trends
  • shifts in metacognitive awareness across the task

This design brings the Text Edition into alignment with the Music and Speech editions, facilitating cross-modal comparison.

Current performance snapshot (v2)

How well do people do at this particular quiz? Our latest summary statistics (Feb 2026) are as follows:

Responses Lowest score Highest score Mean SD Variance
589 1 15 9.10 2.28 5.19

Feb 2026: BoNTE results

Scope (v2)

The dataset supports work in:

  • metacognition and judgement under uncertainty
  • platform governance and consumer protection
  • AI literacy and public resilience
  • adversarial text generation modelling

Of particular interest is the gap between perceived cues (as reported in free-text explanations) and statistically predictive cues. For instance, participants articulate rationales that do not always correspond to reliable discriminators – a finding with implications well beyond hotel reviews. Confidence and competence measures are also particularly interesting.

What about version 1?

The first iteration of the Text Edition closed on 13 May 2025 with 957 completed responses. In v1 (the very first quiz in the entire Bot or Not suite), rather than being collected separately, confidence was embedded directly within each response option using a five-point categorical scale:

  • Definitely human
  • Maybe human
  • Not sure
  • Maybe bot
  • Definitely bot

Scoring was weighted:

  • 1 point for a correct “definitely” judgement
  • 0.5 points for a correct “maybe” judgement

While this gamified structure provided useful gradience data and a measure per sample, it also introduced a potential behavioural distortion. Because stronger responses yielded higher scores, participants may have been incentivised to overstate certainty to maximise performance. In other words, the measurement instrument risked shaping the phenomenon it aimed to observe.

For this reason, v2 separates binary classification from confidence reporting, removing score-based incentives for overclaiming certainty and enabling cleaner calibration analysis.

Final scores snapshot (v1)

Responses Lowest score Highest score Mean SD Variance
957 1 15 7.19 2.34 5.48

Feb 2026: BoNTE v1 results
Across both versions, the Text Edition continues to ask a simple but increasingly consequential question: when reading persuasive online prose, what convinces us that a human is behind it, and how often are we mistaken?