How to be a (recreational) forensic linguist

Yesterday (5 September 2018) the New York Times printed an op-ed by an anonymous source, entitled, “I Am Part of the Resistance Inside the Trump Administration. [I work for the president but like-minded colleagues and I have vowed to thwart parts of his agenda and his worst inclinations.]” Allegedly penned by a senior official in the Trump administration, possibly from within the White House itself, the piece outlines aspects of Trump’s behaviour in fairly unflattering terms, and describes some of the internal resistance being staged against him on a day-to-day basis.

Interestingly, rather than discredit the author’s proximity, the response from Trump a few hours later was to encourage the author to resign. This rather lent credibility to the piece’s authenticity and arguably made it even more damaging than it would have otherwise been had Trump simply laughed it off as a wild story by some fabulist that had never been within a mile of him. The result was that, overnight, a whole slew of people suddenly became forensic linguists specialising in authorship attribution. It’s been fun to watch and I’ve written this simple “how to be a forensic linguist” guide in response to the scramble to identify the author. Continue reading

Worker, she twote: are the Amazombies real people, or is it all a PR stunt?

Over the past few days, TechCrunch, The Guardian, and others have been writing with some amusement about the inexplicable and rather silly little army of Amazon bot-drones that have sprung up on Twitter. Assuming they don’t get deleted in a fit of embarrassment, you should be able to find those accounts here. The questions that sprang up across various media sources ranged from, “Is this elaborate counter-marketing?” to “Are these people really… real?”

That latter question naturally caught my full attention, so I decided to spend my Saturday afternoon addressing two research questions:

  • Are these tweets really being sent by a bunch of unique people?
  • Or are these all synthetic identities that are the work of a PR company employee or two?

Continue reading

The devil in the data: are women as aggressive online as men?

So, sometimes things grind along veeeeery slowly in the background, and eventually, finally, at long last, I get to them. This has been one of those slow-burn projects. To understand what’s going on, it’s worth briefly casting our eyes back to this post. Alternatively, if you want a very short overview, then here it is…

On the 26th of May, 2016, Demos thinktank’s Centre for the Analysis of Social Media (CASM) presented the results of an investigation into the use of misogynistic terms on Twitter to the House of Commons. This work was a collaboration between the thinktank Demos and the TAG laboratory at the University of Sussex, and the release of the report was timed to coincide with a political, cross-party campaign, Recl@im the Internet. Launched by Labour MP Yvette Cooper, the aim of Recl@im the Internet is described thus: Continue reading

Missing Melania II: has Trump used the First Lady’s Twitter accounts in the past?

With her absence from Camp David this weekend, Melania Trump’s absence continues into the 24th day, and as time passes, the number of theories about her disappearance are increasing. Twitter is an excellent source for these if you’re interested, but the main ones appear to be that she is secretly giving evidence to Mueller; that she is back in New York with her son; and that she is busy instigating divorce proceedings.

However, whilst all of that foments away in the background, something quite different caught my eye – the highlighted tweet below from @MELANIATRUMP back on the 18th of October 2013:

@realDonaldTrump: I love watching the dishonest writers @NYMag suffer the magazine’s failure.

@DanAmira: Your wife is waiting for you to die (…)

@MELANIATRUMP: @DanAmira @NYMag Only a dumb “animal” would say that! You should be fired from your failing magazine! (Link)

Intriguingly, in 2016, New Yorker noted about this very tweet that, “One couldn’t help but detect Donald’s influence when Melania fired off [this] reply”, and plenty of users replied with similar questions about precisely who might have written it. Continue reading

Missing Melania: is the First Lady’s Twitter account being used by Trump?

As we speak (01st June 2018), interesting conspiracies are breaking across some news networks about Melania Trump. The First Lady has allegedly not been seen in public for twenty days, and according to White House sources, she is recovering from an operation. However, the increasing concern for her welfare was heightened with the latest (as of this moment) tweet from her FLOTUS account on Wednesday 30th May which reads:

I see the media is working overtime speculating where I am & what I’m doing.  Rest assured, I’m here at the @WhiteHouse w my family, feeling great, & working hard on behalf of children & the American people! (Link)

The reaction by some to this tweet was an immediate cynicism that she had written it. Some felt that it was authored by Trump, and others demanded to see her holding a current newspaper. So, is that tweet truly unusual for the First Lady? Or is this a case of mass-confirmation-bias, where many people who already hold Donald Trump in contempt have simply found another possible avenue of attack? I thought I’d have a look at it out of curiosity and see what I could see. Continue reading

The savage garden of social media: London’s violent crime surge

Over the past few days, the media has been reporting on a “surge” in violent crime in the capital. Figures such as Met Commissioner Cressida Dick and Home Secretary Amber Rudd have framed this fluctuation with a narrative that social media is playing a key role in arguments between young people, particularly those in gangs, by allowing them to react quickly to online grievances with offline violence. For instance, Met Commissioner Dick claims that “sites and apps such as YouTube, Snapchat and Instagram are partially to blame for the bloodshed” (source). As someone who researches online aggression, I find this notion particularly interesting. Continue reading

The Ghost(writer) Busters: Can machine learning help in the fight against contract cheating?

Yesterday morning (Mon 12 Feb 18) the Times Higher published an article entitled “Caution over Turnitin’s role in fight against essay mills” (tagline: New software to identify ghost-written essays welcomed, but experts say it is not a panacea). To summarise, the article describes how, later this year, Turnitin will be releasing their new tool, Authorship Investigation, which “will use machine learning algorithms and forensic linguistic analysis to detect major differences in students’ writing style between papers”. Continue reading

PyClaireH versus RyClaireH: which bot wins the imitation game?

I’ve been meaning to write about my newest bot for a while, but finally here we are. Welcome, RyClaireH, to the fold. Yay! In case you’re wondering who the other bot is, you can find out all about PyClaireH in this blog post. For those already familiar with Py, the easiest way to describe Ry is through her differences. Where Py is a Markov chain (more detail on this), Ry is a much more sophisticated pseudo-Markov chain. Py essentially uses word-level probabilities to construct sentences based on the likelihood that one word will occur after another. On the other hand, Ry uses NLP (natural language processing). From this toolkit, she tags each word in the data for its part of speech (e.g. noun, verb, adverb, adjective, etc.) in advance, and also uses dictionaries of nouns and adjectives to help her formulate more syntactically coherent tweets. In a nutshell, Py is a very simple bot that works at the level of the word (lexis), whereas Ry works with both words (lexis) and grammar (syntax). Neither bot has any help at the higher levels of language, such as with the meanings of words (semantics), and certainly not with the meanings communicated beyond the word (pragmatics). Arguably a semantically “aware” bot is possible – semantic tagging by something like the USAS tagger could provide a nice way in, but to incorporate that into the model requires a level of programming competence I don’t possess.

As I mentioned in my Py post a few months ago, one of my interests was in whether more coding and extra tools would help Ry to be more convincing than Py, or if it would actually hinder her, so this post is a non-serious, barely-scientific, entirely-amusement-driven shoot-out between the two. You can (and indeed should) pick about three hundred holes in the general methods and rigour of this, but I refuse to let that stop me.

So, onwards. Continue reading

Lies, damned lies, and slippery surveys

On the afternoon of Thursday 16th February, perhaps in light of a week that has been even rockier than usual, current US President Donald Trump held a controversial press conference. Whilst this was, in itself, newsworthy for a variety of reasons, there was an unexpected plot-twist. Trump followed up with sending out a mailshot to his Republican supporters…

Click to embiggen, but just in case you can’t read the image for some reason, as its preface, the email opened with the ongoing narrative that “the mainstream media” (this damning moniker seems to exclude pro-Trump agencies such as Fox News, incidentally) is carrying out hit jobs, attacks, deceptions, and so forth, specifically against Trump and the Republicans. As part of the resistance to this, recipients were encouraged to complete a “Mainstream Media Accountability Survey” (PDF for posterity).

Very quickly, that terribly biased, pesky mainstream media noted that this survey was, itself, rather biased. In fairness to both sides, claiming that a survey is biased is an easy win. Every survey and questionnaire contains bias right from the start – the goal of the survey, the topic choice, the time of asking, the person who is asked, the person doing the asking – all are the product of intentional choice and have an ability to alter the results, but the point is to limit and control for all of these factors as much as possible. More importantly, it’s an easy claim to make because it can be surprisingly difficult to pick out the exact features that are creating larger degrees of bias than we would consider acceptable. You might read the survey and get an intuitive sense that it isn’t playing fairly, but it’s helpful to be able to specifically identify the very methods that are being employed to push you one way or another. And that’s what I do in this post. Continue reading

Cloning humans with code: building a bot that speaks just like me

Over a few weeks, I’ve been gradually looking into setting up and running my own Twitter bot, so I’d like to introduce you to PyClaireH, my very first digital clone. She may be slightly… erm… sweary. In fact, this is probably a pretty close insight into what I’d be like if I drunk-tweeted.


What’s a bot?

A bot is, usually, simple software doing a simple job. Some bots are nice and some are less so. The good ones tend to be playing simulated characters in video games, producing art, crawling the internet for data, putting bids in on eBay for you, providing answers to customer questions, or holding a (hopefully convincing) conversation with you in some way. These latter “chatty” types are informally called chatbots. Meanwhile, the malicious bots are busily exploiting vulnerabilities in computer systems, crashing servers with artificially high traffic, sending out torrents of spam or abuse, scraping data they’re not allowed to collect, and maliciously impersonating people.

What kind of bot is PyClaireH?

Well, that’s an interesting question. PyClaireH (and, in the future, hopefully, RyClaireH) are chatbots who both produce communication and respond to certain linguistic prompts. However, PyClaireH is also intended to impersonate me. That’s hardly in the realms of malicious, of course, but it does have malicious applications, and that’s my interest: in the online arms race of fraud and counter-fraud, how well can bots pretend to be us, or more accurately, specific instances of us. In an experimental setting, for instance, could PyClaireH ever fool someone into believing that she really was me? And can we identify the linguistic tells that distinguish the ghosts in the machine from the humans? Continue reading