Cloning humans with code: building a bot that speaks just like me

Over a few weeks, I’ve been gradually looking into setting up and running my own Twitter bot, so I’d like to introduce you to PyClaireH, my very first digital clone. She may be slightly… erm… sweary. In fact, this is probably a pretty close insight into what I’d be like if I drunk-tweeted.


What’s a bot?

A bot is, usually, simple software doing a simple job. Some bots are nice and some are less so. The good ones tend to be playing simulated characters in video games, producing art, crawling the internet for data, putting bids in on eBay for you, providing answers to customer questions, or holding a (hopefully convincing) conversation with you in some way. These latter “chatty” types are informally called chatbots. Meanwhile, the malicious bots are busily exploiting vulnerabilities in computer systems, crashing servers with artificially high traffic, sending out torrents of spam or abuse, scraping data they’re not allowed to collect, and maliciously impersonating people.

What kind of bot is PyClaireH?

Well, that’s an interesting question. PyClaireH (and, in the future, hopefully, RyClaireH) are chatbots who both produce communication and respond to certain linguistic prompts. However, PyClaireH is also intended to impersonate me. That’s hardly in the realms of malicious, of course, but it does have malicious applications, and that’s my interest: in the online arms race of fraud and counter-fraud, how well can bots pretend to be us, or more accurately, specific instances of us. In an experimental setting, for instance, could PyClaireH ever fool someone into believing that she really was me? And can we identify the linguistic tells that distinguish the ghosts in the machine from the humans? Continue reading

Oxford Dictionary’s Word of the Year is “post-truth”, but what makes a newly coined word survive?

From the distribution of letters in a box of Cheez-It Scrabble crackers to the incorporation of new words into the dictionary, it seems that we are constantly fascinated by all aspects of language, and particularly by its newest developments. Today, the Oxford Dictionary Word of the Year was announced: post-truth – relating to or denoting circumstances in which objective facts are less influential in shaping public opinion than appeals to emotion and personal belief.

Interestingly, this has links to ideas that have been around for quite some time. Centuries ago, Jonathan Swift (1677-1745) was lamenting that whilst falsehood flies, the truth comes limping after it. Similarly, for the modern era we have Brandolini’s law, “The amount of energy needed to refute bullshit is an order of magnitude bigger than to produce it.” From a political perspective, this lends itself to highly expedient game-playing. Misstatements and convenient omissions are today’s front-page attention-grabbing headlines, and retractions are tomorrow’s tiny, overlooked addendums. The benefits gained from the lie may exponentially outweigh whatever consequences trail along in its wake. This kind of post-truth politics has even driven the rise of sites like Fact Check, Politifact, and snopes as audiences increasingly recognise that they may not be getting a fair representation of issues.

To return to the main point of this post, however, I find these Words of the Year/Decade/Century events linguistically interesting for three reasons – how a word is born and flourishes (or not), the social importance of new words, and the method behind identifying them (corpus linguistics). And since I’ve had some media interest in this already today, I thought I’d write out my ideas more fully here. Continue reading