Over a few weeks, I’ve been gradually looking into setting up and running my own Twitter bot, so I’d like to introduce you to PyClaireH, my very first digital clone. She may be slightly… erm… sweary. In fact, this is probably a pretty close insight into what I’d be like if I drunk-tweeted.
What’s a bot?
A bot is, usually, simple software doing a simple job. Some bots are nice and some are less so. The good ones tend to be playing simulated characters in video games, producing art, crawling the internet for data, putting bids in on eBay for you, providing answers to customer questions, or holding a (hopefully convincing) conversation with you in some way. These latter “chatty” types are informally called chatbots. Meanwhile, the malicious bots are busily exploiting vulnerabilities in computer systems, crashing servers with artificially high traffic, sending out torrents of spam or abuse, scraping data they’re not allowed to collect, and maliciously impersonating people.
What kind of bot is PyClaireH?
Well, that’s an interesting question. PyClaireH (and, in the future, hopefully, RyClaireH) are chatbots who both produce communication and respond to certain linguistic prompts. However, PyClaireH is also intended to impersonate me. That’s hardly in the realms of malicious, of course, but it does have malicious applications, and that’s my interest: in the online arms race of fraud and counter-fraud, how well can bots pretend to be us, or more accurately, specific instances of us. In an experimental setting, for instance, could PyClaireH ever fool someone into believing that she really was me? And can we identify the linguistic tells that distinguish the ghosts in the machine from the humans?
On a more technical note, PyClaireH is written in the programming language, Python, hence the Py, and she’s primarily powered by the markovbot.py library. (See What are your future plans for more on an in-development bot, RyClaireH, who will be written in Ruby.)
Why would you even do this?
It might seem that, for a linguist, coding software to spout stuff randomly onto the internet is an odd pastime, and yes, such a project may well tread lightly into the grey areas of interests that are a little more niche, but it does also have quite a number of overlaps with my research.
Let’s start with corpus linguistics. Any half-decent bot will need to produce realistic-seeming language, and for that, they need to draw on a reservoir of linguistic data – a corpus. But it’s more interesting than that. Computer languages and programmes are objective, logical, binary, and discrete. Our languages and thought patterns are subjective, inconsistent, vague, and messy. Whilst computers embrace technical precision and overt clarity, we throw ourselves deep into the fuzzy waters of dreamy speculation and half-finished allusion. We hint, joke, imply, cross-reference, and make all kinds of context-dependent contributions in conversation that a computer is simply unable to access. In short, computers are digital, and humans are analogue. Unsurprisingly, then, since computers don’t really understand human language in a meaningful way, they’re also not very good at producing convincing language, let alone holding conversations. (More on that shortly.) At the simplest level, we could let a computer pick words at random and string them together to make “sentences”, but this output would almost certainly be gibberish. As a result, because it can’t really understand what it is reading or writing, we need to give the bot other strategies, besides comprehension, that maximise the possibility that its output will be coherent.
In this particular case, we’re talking about using a Markov chain. In extremely simplified terms, the markovbot powering PyClaireH works out the probability that two words like Then one will be followed by a third word like day, and that one day will be followed by I, and so on till it’s completed a sentence. As it moves its window of attention forward, the markovbot script only ever keeps two words (e.g. Then one) in mind, whilst searching for a new one. As that new one is added onto the end, it moves its window forward to include it, and in so doing, it forgets the one at the start (e.g. one day (I)… day I (will)… I will (see)… will see (you)… see you (now)… you now(.)… you now.). As a result, it has a very short attention span, but the benefit is that it doesn’t have to process quite so much data in the computer’s active memory, which would slow the bot down significantly if it was working with a very large corpus. Thus, from its ruminations, we get our whole sentence, e.g. Then one day I will see you now.
This also takes us into the area of the Turing Test, in which we try to programme software to produce behaviour, and particularly language, that convinces us that we are dealing with a real human. And, as you might imagine, that has some pretty strong cross-overs with my research on deception and interest in authorship analysis. When creating PyClaireH, I didn’t want a bot that produced just any old convincing language – though that’s an extremely ambitious goal by itself. I wanted a clone of myself that, as far as possible, tweeted like me. PyClaireH is written for fun, but other bots have rather more nefarious purposes. Understanding how we can automatically and convincingly replicate individual authorial style has implications for work in areas like (spear)phishing, mass market frauds, dating scams, and the like, where initial baiting may be done through automated means, as unwitting victims are lured into “conversations” with naught but shadows on the bits of the internet. What’s unique for the study of authorship attribution in a case like this, too, is that unlike Criminal A attempting to replicate Victim B’s writing style using a generalised, vague, imperfect knowledge of B’s style, a Markov bot is using real snippets of language, along with accurate statistical probabilities. Whether this makes bots better or worse at spoofing than humans, is, of course, up for the test with PyClaireH.
How did you bring PyClaireH into existence?
Let’s start with a confession: I am an extremely amateur coder. It’s even stretching the word “coder” to apply that to what I do. My first step was to find a good, easy tutorial and do as I was told. After a couple of evenings of tinkering and more than the odd hiccup, I had my bot up and running. But rather than talking through the same tutorial in different words, I’ll focus instead on the interesting decisions to be made in setting up the bot.
The first major issue is the data. The bot produces its tweets based on the data fed into it. (See What are your future plans below for more on this.) These kinds of bots are what they input, and they need to read enough sensible input to have a fighting chance of writing even halfway-sensible output. I initially figured that I would have literally millions of words that I’ve written lying about, and made a quick list of places I could easily acquire reasonable amounts of data:
- Sent emails
- Real (DrClaireH) tweets
- Online messenger chat logs
- Facebook posts
- Text messages
- Academic writing
- Academic feedback on student work
The trouble is, many of these genres vary quite drastically across numerous dimensions: purpose, intended audience, formality, privacy, accessibility, and so on. It would be rather awkward if the bot tweeted a snippet from a very personal text, for instance, and since I would have to carefully screen thousands, if not tens of thousands of texts, emails, Facebook posts, and diary entries, it was just easier to exclude them. In fact, after weighing up the pros and cons, I was left with online chat logs (carefully checked over, but who knows what may still crop up), tweets, and solo-authored academic writing (stripped of quotes, examples, and bibliographies). This took my corpus to 318,230 words. It does mean, though, that PyClaireH is going to be an odd hybrid of scholar and sailor – moments of intellectual pondering interspersed with emoticon-studded profanities.
Once the data was chosen, cleaned, and stuck into the relevant folder, the next interesting decisions to make were what I wanted the bot to actually do. It can, for instance, tweet gently into the ether, sharing its digital wisdom with whoever is willing to listen. You can (also) have it respond to certain words or phrases. So it might, for instance, find and reply to all tweets containing the word “horse”. More sophisticatedly, you can set it so that if those tweets contain particular keywords like “gallop” or “expensive” it will use those as linguistic seeds from which to generate a response, in an effort to make its reply seem relevant. You can also ensure that it always includes certain prefixes (e.g. @DrClaireH) or suffixes (e.g. #itLIVES). At the moment, PyClaireH is set up to tweet randomly, and to respond to anyone who tweets her directly. However, she’ll only respond once or twice so that she doesn’t get too spammy, and sometimes she just ignores people anyway. She doesn’t have prefixes or suffixes set, and I haven’t given her any keywords to model her responses on. Overall, she’s a free, if somewhat delirious spirit.
What are your future plans?
As I mentioned above, PyClaireH is written in Python, and is primarily powered by the markovbot Python library. She’s actually very simple, in many respects, so it’s impressive how well she does. As a result, I’m interested to see how well she performs versus another, more sophisticated Markov chain, so I’m currently developing RyClaireH, who will be written in Ruby, and based on this framework. Unlike PyClaireH who works purely from her training corpus, RyClaireH will be able to draw on extensive, editable dictionaries of nouns and adjectives, and from this, her tweets should also therefore be more original, since they will go beyond parroting only the corpora she’s fed. In turn, though, this may dilute the style by introducing words I would never use. Overall, testing which one produces more “believable” language is not likely to be a scientific process, but it should be a fun one nonetheless. [Edited to add: I subsequently did that here.]
Another plan I have, which is rather less scientific and even more fun still, is to add different voices to PyClaireH‘s training corpus. There’s no reason that her output needs to be a corpus of just my words. It would, in fact, be interesting see how I would speak if I were mashed up with Shakespeare, or Taylor Swift, or Jane Austen, or Sheldon Cooper, or Jeremy Kyle transcripts, or Presidential speeches, or eighteenth century legislation, or The Idiot’s Guide to Showjumping, or clickbait links, or the poetry of Sylvia Plath, or the BASIC programming language, or death metal lyrics, or tweets about Brexit, or Daily Mail headlines, or Daily Mail comments, or literally any part of the Daily Mail, or horror stories- wait, I’m repeating myself. Anyway, you get the idea. Finding about 300,000 words of another “voice” should be relatively straightforward and will be, if nothing else, an amusing foray into the world of artificial co-authorship synthesis.