CONTENT RATING: universal
Can you tell your real individual from your robot agent? In the ultimate game of Bot or Not, would you stake $26m of your own money on it? Below you will find data, audio credits, further reading, and a transcript of the podcast.
This episode was supported by UKRI as part of the annual Festival of Social Science. Each year this Festival celebrates the amazing research and advancements of our best and brightest social scientists. Hundreds of events are running all over the country from the 19th of October to the 09th of November, and this year’s theme is “Our Digital Lives”.
Play the real Bot or Nots
- Audio Edition: can you your doting aunt from your digital agent?
- Text Edition: can you spot the scam, AI-generated hotel reviews?
Audio credits
Kai Engel – Machinery
Kai Engel – The Moments of Our Mornings
Scott Holmes – Infrastructure
Lee Rosevere – Puzzle Pieces
Transcript
Case S02E10: Bot or Not
It’s mid-morning on Monday the 15th of January, 2024. The sky is clear, and as the temperature steadily rises towards 25°c (that’s about 77°f) the air shimmers above six lanes of inner-city traffic. Looking down over the cars, bikes, and hot pavement is building that dwarfs many of the structures around it, not so much in height, but mainly in sheer footprint. Above ground, three floors – black, brown, and white – each sit, one on top of the other, none of them quite flush. Out of sight below ground are a further four basement levels. This is Festival Walk, a one million square foot shopping centre in Kowloon Tong, close to the heart of Hong Kong. Among its hundreds of shops, it is home to the usual suspects – Calvin Klein, Juicy Couture, Hollister – and of course it has a multiplex cinema, an atrium, and an ice rink. It may be huge and architecturally quirky, but it’s also just like many other malls worldwide. And, perhaps most importantly, it’s not really the scene where you might expect a state-of-the-art, multimillion dollar online fraud to take place.
But Festival Walk isn’t just one of the world’s largest retail real estates. As if to emphasise its overall geometric irregularity, at one end of the building, a further four floors of crisp black glass rise from the mall’s roof – an extra quarter million square feet of office space available for rent. If you have the credentials to pass through the marble-style entrance lobby and secure smartgates, the acutely angled floors of Festival Walk Office Tower are tenanted by a range of businesses and users, from little coworking startups to the City University of Hong Kong. And at the top rung of this striking commercial ladder, we find Arup.
Who, you might ask, is Arup? It’s a multinational consultancy that turned over £1.9b pounds of revenue (that’s nearly $2.5b US dollars) in 2022 alone, and this is where the story of our heist begins.
Welcome
Welcome to en clair, an archive of forensic linguistics, literary detection, and language mysteries. You can find case notes about this episode, including credits, acknowledgements, and, far more than usual, many extra links to further reading at the blog. The web address is given at the end of this podcast.
Modern wonders
If you’ve never heard of Arup before, don’t worry. It’s not the sort of business the average person has day-to-day dealings with. To simplify a long history into a few words, the Arup acorn was planted in London in 1946 when Sir Ove Arup founded a professional consultancy for the built environment. As the years passed, this provided an increasing number of services ranging from design and engineering to architecture and planning, and opened offices all over the world. You may not have heard of Arup, but you’ve probably heard of at least a few of its biggest projects, such as Apple Park in North America (that’s the corporate headquarters of Apple), Aspire Tower in Qatar, the Sydney Opera House in Australia, and in the UK alone, the Shard, the Gherkin, the London Eye, and the Millennium Bridge. That’s the bridge in Harry Potter and the Half-Blood Prince that the Death Eaters spiral around and destroy as they fly away through London with a captive. Arup has been involved in everything from sports stadiums and bus stations to TV studios and coastal seaports. Banks, bridges, arenas, libraries, galleries, sculptures, terminals – like I said, Arup’s bread and butter tends to be projects at a scale that the average person is simply not involved in.
The inner clockworks of a platinum-tier multinational like Arup might be a mystery to most of us, but for one enterprising outfit, Arup’s Hong Kong office is not only on their radar; they have been researching it for some time now, and they have just clicked send on a handful of emails to Arup employees working in Festival Walk Office Tower. As those emails flit across cyberspace, weather warnings arrive for a possible monsoon, but no one knows that over the coming week, the temperature is about to plunge below freezing, and not just metaphorically…
The heist
One of Arup’s employees is frowning at the email. For ease, we’re going to call that employee Jiang. For the record, I have no secret information that this is their name, but we need to call them something, and this one means river. Back to Jiang’s email. It’s come all the way from London headquarters. And not just anyone. It’s Arup’s Global Chief Financial Officer, Rob Boardman. He wants an important task doing, but it requires discretion. For a moment, Jiang is doubtful. It sounds just a little bit odd. Has the CFO really just… no, surely not… But then comes the clincher: Boardman has scheduled an online video conference. He will be there, himself, personally, to explain and to give instructions.
Any remaining worries are put to rest.
Boardman is going to oversee the matter directly, and who is our little Hong Kong employee to argue with someone of this seniority. Questions could trigger shameful recriminations or even career-ending fury. If Boardman will be on this call what more proof could anyone want.
Sure enough, Jiang joins the call and Rob Boardman himself is on camera along with several other recognisable, senior Arup figures and a few outsiders. It’s a terse meeting. Jiang is asked to introduce themselves, and accordingly does so, but this isn’t a social occasion. No one is here to chitchat. There is big-money business to be done, and quickly. Jiang is given a series of instructions for making fifteen bank transfers to five different Hong Kong accounts totalling HK$200 million – that’s about £20 million, or $26 million. And then the call abruptly ends. But though the meeting is over, the communication doesn’t stop. Boardman and the others continue with instructions and commands, sometimes sending emails, sometimes using instant messaging, and sometimes appearing in one-to-one video calls.
But as these missives come in, gradually, the voice of doubt revives itself. What if something irregular is happening here? Why do they need this money transferring in such an odd way? Rob Boardman is the CFO – one of the most powerful people in this massive, global, billion-dollar multinational. Doesn’t he have someone in London to do this? A dozen someones? Couldn’t he even do it himself? Are legitimate instructions to move huge sums into different bank accounts really sent by instant messenger to far less senior people in faraway offices? What if something is afoot… Will Jiang be the one held responsible? Could they get fired? Might there even be criminal charges?
Finally, the tension and uncertainty become unbearable. The risk is too high. The balance between respectful deference and self-preservation tips. Jiang writes an email to London HQ, takes a breath, hits send. It’s done. For better or worse. Now there’s just the wait. Refreshing the inbox. Praying. Hoping. Fearing. And there it is – a reply from HQ.
No, Mr Boardman has taken part in no such video calls. No, Mr Boardman definitely has not requested multiple large transfers to different bank accounts. Yes, this has all been an elaborate scam, and an enterprising criminal outfit has just escaped with HK$200 million.
To err is human
Everyone has been there, and if you haven’t, I pity you, because your day will surely come. We’ve all done it. That screw up so monumental that we feel like the waters of shame will close over our heads forever and never let us back up for air. I’m not talking about malicious actors, but innocent hardworking people doing their jobs, and something, somehow, slips through the net. In 2024 alone, there was the poor soul who released the CrowdStrike update and tanked half the computers around the globe, including grounding flights and shutting hospitals. You know that feeling that goes through you when you realise you have screwed up, and screwed up big.
On that basis, I want to give Jiang every possible defence. Is it possible they were a bit daft? Sure. I have days where I’m daft too. But they were also at an immediate disadvantage. It isn’t easy in many cultures, and especially for those with certain histories or personalities or both, to challenge even our equals if we think something suspicious might be afoot. What if they get offended? What if they complain? What if our suspicions are ill-founded and we cause problems at our workplace? We have to be there all day, every day. That problem is compounded many times over if this isn’t just a colleague but one of the most senior people in a huge, international business. How do you safely stop a top boss who probably earns more in a day than you do in a lifetime, and articulate a question like, “I’m sorry but are you real? I think you might be an impostor. Can you prove your identity to me please?”
I mean, what would proof even look like? They already have the right face and the right voice, so what then – a birth certificate? A passport? At best it might sound like you’re accusing them of looking dishonest, and people have suddenly found themselves and their entire careers in the bin for far less. On paper, superiors should reward employees who are ready to ask difficult questions to protect the business, but in practice, far too many managers are insecure and will interpret such behaviour as an attack on themselves. I’m not for one moment suggesting that this is the climate at Arup. I simply don’t know. But Jiang probably wouldn’t have known either. It’s unlikely that an office worker in Hong Kong would have spent enough time – if indeed any – with the London-based CFO to learn whether he would tolerate questions like this. And this possibility of a fragile ego is the perfect exploit for an enterprising white-collar criminal.
As is so often the case in crimes involving computers, it isn’t necessarily the technological weaknesses that cause the problems. It’s the humans, and their habits, biases, patterns, and flaws. All we need is anxious deference combined with compelling evidence, and we have a near-perfect recipe for crime. And in this case, the evidence was the face and voice of Rob Boardman. Or so it seemed. What is more probable, after all – that a convincing lookalike and soundalike is running a multimillion-dollar scam and they just happen to have picked you of everyone in a giant global organisation? Or… you’re speaking to the actual CFO? And there were others on the call too – people that Jiang recognised as Arup figures, and who also spoke. Perhaps a criminal enterprise might track down one really good face-and-voice clone, but amassing multiple is implausible, even with the help of a criminal cameos craigslist. (Yes, it took me several tries to record that.)
Of course, there is a seemingly obvious explanation. Most of us are now familiar with the idea of deepfakes, where artificial intelligence manipulates existing videos or generates entirely new ones. Just one common deepfake genre involves a notable figure like a celebrity or a politician supposedly being caught on camera saying or doing something mind-blowingly offensive when in reality the footage is partly or fully counterfeit. But at the time of writing about this case, these examples have all been designed for passive consumption – they’re effectively movies for the audience to simply watch. They are not live interactions that need to be responsive to audience feedback. At present, generative AI may be convincing at some tasks, but when it comes to ordinary conversation, it has quite a spectacular feat to achieve.
To start understanding just where the field is right now, firstly, we need to take a moment to understand a little of the history, and all the layers necessary to convincingly spoof a real person’s voice.
Fort Vox
Let’s start with the simplest option, and we’ve already mentioned it. Find a human being who can impersonate you. To do that, they have to either have a great deal of control over their whole vocal tract or, more realistically, they have to have a vocal tract that really isn’t so dissimilar to yours. As it happens, some people have more generic sounding voices and, no surprise, they’re easier to emulate. But being distinctive isn’t an automatic defence against being impersonated. Think how many Elvis Presley, Morgan Freeman, and Barak Obama tribute acts there are out there.
So, what about tech? Well, since the 1920s we’ve had something called the vocoder. Very simplistically, the vocoder was intended to reduce the amount of bandwidth required to transmit voices over the phone, but as an accidental consequence, this had a distorting effect on what came out the other side, and, incidentally, the military was delighted with the way that this process presented opportunities to scramble and thus protect war-time communications. Anyway, take a few zig-zagging steps forward through history and you arrive at voice changers. The clue is in the name. This is technology deliberately designed to change the voice. You see these as toys or spy tech or superhero gadgets in plenty of films and TV shows where characters use them to sound like robots or hide their identities or strike fear into the hearts of their enemies or all of the above, thus otherwise complicating the plot a little further than was ever strictly necessary. Take another few steps forward, though, and you get from changing your voice in some sort of generic way – making it lower or robotic or whatever – to changing it in a very particular way to sound like someone else. Someone specific. Someone who exists.
On paper, that sounds like an instant win. Why bother with complex computational modelling and algorithms and embeddings when you can just swing by a toyshop and start sounding like Obama for less than the price of a family takeout? Well, because in practice, the tech is simply not that good. At all. And secondarily, even if it were that good, we are not.
What I mean by that is, I could pick up the perfect Taylortron from a nearby toyshop that makes my voice sound exactly like Taylor Swift’s, and yet, I don’t think I would fool a devoted fan for more than a few minutes, if that. Why? Because I don’t have Taylor’s accent. Even if I could put accents on, which I can’t, I also don’t know her stress patterns, her habitual intonational contours, how she pauses, how quickly she speaks, the sorts of linguistic errors or false starts or self-interruptions she makes, whether she sounds chirpy and happy or down or just neutral, and so on… Someone who genuinely knew Taylor Swift and who had spoken to her enough would recognise very quickly that whilst the quality of the voice might sound right, that would be where any similarity ended. Voice conversion is, effectively, an extremely superficial veneer. A paper-thin mask that can quickly fall apart under any real scrutiny.
So, perhaps computers can be helpful. For instance, we could feed a high-end machine with thousands of hours of Taylor’s voice so an AI model can learn how her voice sounds and how she uses it, and then I could type in the words and tell it to say them out loud. Tacotron2 does precisely this. If you’d like to go full nerd for a second, Tacotron2 is a neural network architecture that you can use to synthesise speech directly from text with no additional prosody information. There you go. Bet you didn’t know you’d be learning that today. Anyway, this is what plenty of generative AI models do now – they learn a voice, you give them a script, and with luck, those words come out sounding like your chosen person. The output could range from okay to reasonable to excellent, but in our Taylor Swift example, it still may not articulate her accent correctly. Why is that?
Well, before you can ask an AI to generate someone else’s voice, it has to know how to speak in the first place, and to arrive at that point, it is usually pre-trained on mountains of data. That data tends to be vacuumed up en masse from the internet, and you need only think about what you find a lot of on the internet. The podcasts and radio shows and videos and so on. Certain types of voices, accents, and demographics dominate. Others barely appear at all. So a generative model like Tacotron2 will already come with an inbuilt phonology, and that phonology will tend to have been derived from white middle-class western males in their twenties, thirties, and forties. In fairness, that isn’t so dissimilar from Taylor Swift, but for someone primed to spot a fake, it could be enough.
Even with a state-of-the-art text-to-speech synthesis model though, you still haven’t made it all the way there yet. Your output may sound like Taylor, and even have her accent, but we’ve left a major layer unconsidered. If I were typing in the speech, I would probably repeatedly screw up the dialectal choices – the words she uses for things like rubbish bin and tap and bumper and so on. The nicknames she’d choose for her various people. The turns of phrase she uses. Her go-to one-liners and in-jokes. The ways she indicates that she’s thinking. Her back-channel markers. Even the topics she would choose to speak about and how and in what tone.
Again, maybe computers can help because a lot of this is content and it’s countable. In theory I could scoop up thousands of transcripts of Taylor interviews and talks and voice-overs and so on, have an AI model identify these patterns, learn them, and generate new scripts in that same style. And then you get your synthesiser to turn that text into speech, and finally, you might have a convincing Taylor Swift voice.
And yet, you’re still not done. Remember, we wanted to not just create a convincing spoof, but one that could hold a conversation with a human and fool them. For this, we need a crime, and for a minute, you need to be a criminal. Here’s the con: You want to spend a week in a top tier, supremely luxurious hotel but you don’t have anything like the money. You figure if Taylor Swift called the hotel, they’d let her stay in their finest presidential suite for free. You hatch a plan. You’re going to synthesise her voice, call the hotel, and bag yourself a first-class ticket to luxury. You can predict that when the receptionist answers, they’ll probably ask how they can help, so you line up a pre-typed answer saying hi, asking to stay, hoping to keep it all low profile. But the receptionist is a little sceptical. Taylor Swift? The Taylor Swift? The Taylor Swift who’s in Edinburgh right now as part of the latest concert tour? You freeze. You hadn’t planned on this. The receptionist is actually wrong but you can’t just ignore him. You have to type your response swiftly but also very quietly, and naturally you make a typo and have to fix it. Meanwhile the gap has spun out too long and the receptionist is saying, “Hello? Hello…?” You send the answer. That tour ended a few weeks ago. You’re tired. You really need a break away from the spotlights. The receptionist seems somewhat reassured. What a pleasure to be speaking to the real Taylor Swift on the phone! It’s very exciting! He has a cousin who is the biggest fan! Derek. Have you heard of Derek? Derek sends you lots of emails… You start typing again. No. No you haven’t heard of Derek. You’re sorry. There are a lot of Dereks out there. But again, it’s taking too long, and the receptionist is getting suspicious again. What’s the typing sound? Why are the gaps so big? Is this a scam?
And it’s over.
All this to say, latency is a major issue. Even if you used AI to convert the receptionist’s voice to text, had it construct a meaningful response as swiftly as it could, and then promptly generated that as speech, the gap would still be too big. In conversational turns our sensitivity to latency is extraordinary. We’re regularly leaving gaps of fractions of a second. Anything in the one second region is considered a slight delay. Two seconds is a long delay. Three seconds or more and we start to think that there’s a major miscommunication problem. Never mind that even if you could fix this timing issue, most of our generative LLMs at the moment – the AIs that spit out convincing chunks of text – most if not really all of them are trained on writing. They’re not trained on speech that has been transcribed. And that is also important, because it means that we’re missing all kinds of things I mentioned before – disfluencies, lags, breathing, sniffles, coughs, false-starts, repetitions. And that’s before we get into the fact that we don’t speak the way that we write, grammatically, so even if you artificially inserted those things – some false-starts and sniffs – if the grammar is still the grammar of writing and not of speech, you still have traces of text left in the voice. And this podcast is a great example of that. When I write these episodes I insert speech-like features, but fundamentally this is text read out loud.
That might make you think that actually, the risks around spoofed voices being used for scams are therefore overstated. And to that I would say, not so much. The field is leaping forward swiftly, and not every scam requires a live conversation. Plenty of fraud is committed using voice notes and voicemails. In such cases there’s no need for interaction so a lot of the complexity is stripped away. And as the various speech synthesisers improve, our ability to distinguish them from human speech is swiftly diminishing.
And that takes us neatly onto Bot or Not.
Bot or Not
Bot or Not is a quiz we came up with. By we, I mean me and Dr Georgina Brown with the help of our researchers Amy Dixon and Hope McVean. On the surface, each version is just a bit of fun. On the one hand, can you spot real hotel reviews from AI generated ones? If you want to play that one, head to the blog and you’ll find a link to it. And on the other hand, can you tell spoofed voices from real humans? I probably don’t need to explain that, however fun these are on the surface, each one is actually addressing a very serious problem.
Generative AI that produces text can be used for good, certainly, and I talk about that at the end because we’ll need it, but it can also be used for creating rivers of disinformation that can lead to real offline harms. It can be used for synthesising the style of someone to scam their loved ones. It can be used for suggesting novel ways of committing crimes or cleaning up after them. And of course, it can be used for generating millions of fakes reviews to rip off unsuspecting customers into parting with their hard-earned monies for inferior or non-existent products or services. Think about it. You can run a giant one-time scam like Arup and get away with US$26m but with all the major investigative bodies of the world screaming down the highway after you. Or you can run a US$100 scam 260,000 times, get away with the same amount of money, and probably never even make it onto a local police force’s radar. And arguably, if you ran this sort of con right, you could make substantially more money even than that.
Of course, fake reviews have existed as long as review systems have existed. Previously you just bought them from humans. AI simply made it quick and free. By contrast, the criminal use of AI generated voices to pull off scams of any scale is, relatively speaking, still in its infancy, but even so, we’re now at a point where instead of answering a phone or listening to a voicemail and mentally asking ourselves, “Who is this person?” we’ve arrived at a point where we might be better to ask ourselves, “Is this a person?”
So, let’s give it a go. I’m going to play you five very short samples. They may all be real individuals. They may all be robot impostors. They may be any mix in-between. You have a simple job – decide whether each sample is bot, or not. I’ll leave a long enough gap between each so that you can pause and think, and then I’ll do the reveal a few seconds after the end. And if you think that it’s unfair because they’re so short, it’s worth noting that these are the same length, if not considerably longer, than that key passphrase, “My voice is my password”.
Are you ready? Off we go.
Sample 1:
Sample 2:
Sample 3:
Sample 4:
Sample 5:
Alright, if you want to skip back and have another listen, now’s your chance, but if you feel like you’re prepared for the answer, then for those reading the blog post, the final answer is at the very bottom.
New biometrics please
As that probably showed, we shouldn’t be complacent around the potential uses of spoofed voices. In fact, synthesised voice tech has improved at such a pace that it poses significant global problems for how we legitimately identify each other. But it’s not just an interpersonal issue around validating that the friend calling you from a strange number really is your friend. Plenty of security systems use voice identification – banks, utility companies, building access points, and more. A common version of this, as I hinted at above, is using the phrase, “My voice is my password” as a way to access secure information and systems. (Incidentally, there’s a whole argument to be had around whether a voice is a biometric anyway, but let’s just leave that one for now. It’s used as one in these contexts and that’s the problem.)
To simplify a very complex matter, biometrics turn some sort of consistent reading from you into a long string. A hash if you like. That could be your voice, your fingerprint, your iris, your face, the veins in the back of your hand, even your ear shape. It doesn’t especially matter as long as it’s some part of you that’s unique versus the rest of the population, it’s relatively exhaustive (in other words, this is pretty much all of it), and it’s relatively unchanging.
Anyway, biometrics are a way of turning these various reasonably stable readings into hashes. These hashes are stored in computers and then when you next try to access your bank, you reproduce that biometric, it gets hashed again, the one you’ve just submitted is compared with the one that’s on file, and if they match, you’re allowed in. But there’s a rather chilling problem here. Firstly, the owners of those hashes tend to be corporations, and honestly, corporations don’t have glowing track records when it comes to ethical behaviour. Secondly, corporations routinely get hacked and their data gets leaked everywhere. Or worse, it doesn’t get leaked, because whoever took it knew exactly what they wanted from it and isn’t up for some sort of Robin Hood style sharing of the wealth. Thirdly, all profit-making entities are extremely strongly motivated to not disclose when they have experienced breaches because this almost always has an enormous impact on their profit margins as customers ditch them for supposedly safer alternatives. And fourthly, most important of all, if that corporation has been hacked, and the biometrics databases have been copied, whether you find out or not, someone now owns copies of your biometrics. Your voice. Your fingerprints. Your iris. Your face. Whatever was submitted. And now they have those hashes, they can use them to get into any other systems that are protected using the same.
In different and far fewer words, as a system admin, if a user ID and password database is compromised, I could automatically issue everyone with new log in details, but if a biometric database is compromised, I can’t issue you with new fingerprints. Or eyeballs. Or faces. Or voices. That’s the issue with them being both relatively exhaustive and uniquely identifying. There’s no more of that thing to draw on, and they just belonged to you. Of course, if you just used one finger for your biometric there’s a chance you have nine more lives, and if you used one eyeball for your iris scan, you hopefully still have one left spare. But your face? Your voice? That’s it. The wolves in the wires got away with it, and there’s no telling what they might do with it.
Technological determinism
It’s not easy in a true crime podcast to end on a cheery note, but even for me, this is a particularly bleak one, so I’d like to just balance it up slightly. Our digital lives are being shaped by AI in lots of new and challenging ways, but some of those developments good. In fact, not just good. They’re incredible.
Generative AI and synthesised voices present outstanding opportunities for people who have lost the ability to speak for whatever reason – throat cancers, head injuries, neurological pathologies, and so on – and giving those people back not just any voice, but their own voice can be life-changing. Similarly, for those who have never had a voice to begin with, these technological developments can offer a new quality of life that was simply impossible before.
Globally, AI translation technologies break down barriers, especially for the most vulnerable populations among us. Once, communication between speakers of different languages required the investment of learning that language or the help of an interpreter – a luxury most of us can’t afford in either time or money. Especially for displaced adults and their children, this is invaluable in allowing them to join their new communities, understand neighbours and teachers, and feel welcomed.
AI tools can do much more besides. They can handle disaster responses by rapidly sending out emergency alerts in multiple languages. They can generate really nice audiobooks and narration of content, which is an especial benefit for people with reading difficulties or who are constrained by time. They can provide mental health support and meditation sessions and guidance for people struggling with anxiety or depression. They can not only subtitle and caption videos, but they can do so in lots of languages, making them far more accessible than this sort of content has ever been. They can create schedules and send reminders and read stories and play games and provide news for people who are isolated, in care home, or who need additional social support. And that’s just the beginning. No surprise, this episode has focussed on the dark side of AI, but as ever, the technology itself is totally agnostic. It’s what we choose to do with it that counts, and it’s those choices that reveal just how human we really are.
Outro
The episode was researched, fact-checked, narrated, and produced by me, Professor Claire Hardaker. However, this work wouldn’t exist in its current form without the prior efforts of many others. You can find acknowledgements and references for those people at the blog (here!). Also here you can find data, links, articles, pictures, older cases, and more besides.
The address for the blog is wp.lancs.ac.uk/enclair. And you can follow the podcast on Twitter at _enclair. Or if you like, you can follow me on Twitter at DrClaireH.
Whither the bots?
Four of the samples were bots. One was a human. If you want to go back and have another go to see if you can find the human, now is your chance. Don’t read any further.
Otherwise, the absolutely final reveal is that the human was the fifth sample.
Uncanny, no?