How to be a (recreational) forensic linguist

Yesterday (5 September 2018) the New York Times printed an op-ed by an anonymous source, entitled, “I Am Part of the Resistance Inside the Trump Administration. [I work for the president but like-minded colleagues and I have vowed to thwart parts of his agenda and his worst inclinations.]” Allegedly penned by a senior official in the Trump administration, possibly from within the White House itself, the piece outlines aspects of Trump’s behaviour in fairly unflattering terms, and describes some of the internal resistance being staged against him on a day-to-day basis.

Interestingly, rather than discredit the author’s proximity, the response from Trump a few hours later was to encourage the author to resign. This rather lent credibility to the piece’s authenticity and arguably made it even more damaging than it would have otherwise been had Trump simply laughed it off as a wild story by some fabulist that had never been within a mile of him. The result was that, overnight, a whole slew of people suddenly became forensic linguists specialising in authorship attribution. It’s been fun to watch and I’ve written this simple “how to be a forensic linguist” guide in response to the scramble to identify the author.

The most crucial point I’m going to make is that this how-to guide is light-hearted, incomplete, and very, very simplified. You are not going to learn how to be a fully qualified, courtroom-ready, expert-evidence-giving forensic linguist from a single blog post, BUT if you want to have a bit of fun and maybe even stand an extremely slim chance at cracking the mystery, then the following five steps will help you make a better job of it.

So, how do you go about investigating the authorship of an anonymous text?

Step 1 – choose your prime suspects

In forensic linguistics (hereon FL) we sometimes call this your “closed set”. Why do we start with a small pool of suspects before gathering any evidence? Because we have to. FL is awesome, but it is not a fingerprint or DNA test. It is not going to pick one person out of seven billion, nor even out of seven hundred. You have to come up with a shortlist – and I do mean short – of suspects simply to make the task manageable. Luckily, in this particular case, not many people qualify as being “a Trump appointee”, and from the evidence available in the op-ed it’s probably relatively straight forward to draw up a list of those who would have the insider knowledge to write it. However, for every person added to the suspect list, the amount of work and analysis required goes up substantially.

Step 2 – collect evidence

Firstly, you have your questioned text – the op-ed itself. So far so good. But now you need something to compare it to for similarities and differences, and so for each suspect on your list, you now need to gather data. Sounds simple – scoop up some tweets, get a bit of audio, find the transcript of some speech or other, right? Nope. You need data that is as comparable as possible across as many dimensions as possible. Seems easy enough, but in reality, this is tricky. What are the dimensions I refer to? These are things like…

  • mode: writing (as opposed to transcribed speech)
  • register: formal
  • genre: press article
  • subgenre: op-ed
  • topic: Trump, White House, politics
  • publisher: the NYT or similar
  • target audience: Democrats
  • creation date: September 2018
  • creation method: ??? (laptop? iPad? pen? dictated? etc.)
  • purpose: persuade, inform
  • …and so on.

In practice, we don’t have scandalous exposé op-eds from all the prime suspects everyone is pointing at – Pence, Sanders, Mattis, et al. – and if the author is actually a backroom staffer who doesn’t put out official texts (or doesn’t do so with their name on them) then we may not have any data for them at all. Where data does exist but isn’t perfect, we have to go to the next nearest text that ticks as many of the above as possible, bearing in mind that every step away from the ideal is an increased chance of your ultimate conclusion being wrong.

Fudging some of these dimensions will matter more than others. Choosing a transcription of, say, press conference answers by one of the suspects will ruin several linguistic levels of analysis. For obvious reasons you couldn’t judge their spelling or punctuation use, but non-linguists are unlikely to realise how radically different the syntax of spontaneous speech is versus that of writing. Note that I said spontaneous speech. A speech-speech, like a formal address to Congress, will typically be written beforehand and read out from screens (typically – looking at you here Donald) so it’s more like writing, right? Well yes, but a different problem crops up in the form of… *drumroll*… speechwriters. Very senior figures in the administration will often read out speeches that they are very unlikely to have penned themselves, so including data of that nature will really mess up the purity of your evidence.

You might protest that, hey, language is language, and language by Trump is language by Trump, how much can it really matter? Well, I’m afraid you’re going to have to take my word for it. It really does. Simple example – I am writing about the same topics on this blog that I write about in my research, and that I lecture about, but the way I produce language about FL in formal peer-reviewed publications, and the way I speak about it in class is about as different as you can imagine. Similarly, I like to read, and to write book reviews about what I read. I write an extremely candid (*cough* sweary *cough*) review for my closed-group reading club, then I sanitise and edit it heavily for the general public reviews on here and Goodreads. That affects everything from lexis and syntax to orthography and pragmatics (see below) even though it’s exactly the same genre and on exactly the same topic for exactly the same purpose. The major factor that has changed? Presumed audiences.

Anyway, spend the next few hours/weeks/months agonising over the perfect data for each of your suspects until you have a corpus for each that (a) you feel confident they produced, (b) is as comparable to the questioned text as you can feasibly make it, and (c) is as big as possible. Done? Excellent. You may now pass go and lose another 300 hours/weeks/months on the next step.

Step 3 – analyse your questioned text

Now you can get stuck into your op-ed. Just what are you looking for? Well, the magic of linguistics is that the waters of communication run very deep. Or, to torture another metaphor, language is a cake of many layers, so you could focus on any, or all, of the following:

  • Orthography (spelling)
    • Problems: throughout the copyediting process, any typos and creative spellings will have been cleaned up
  • Punctuation
    • This is a more promising area. People may watch their words but they rarely notice their commas and fullstops…
  • Lexis (vocabulary)
    • Does it use the vocabulary of an adult or child? An expert or lay person? Does it use slang or specialist terminology? Are there rare words? Are common words being used in unusual ways? Does it under- or over-use closed class words (if, and, but, of, to, the, etc.)?
    • Problems: open-class words are easily manipulated. We notice words – especially unusual words (ahem, lodestar), so if you were trying to spoof someone, this is the amateur’s first line of attack…
  • Syntax (grammar)
    • Are the sentences simple, compound, or complex? Are they active or passive? Long or short? Interrogatives, declaratives, or imperatives? Are there lots of nested clauses? Is there a preference for certain tenses? Nominalisations? Adjectival strings? Adverbial phrases? Etc.
  • Semantics (meaning behind the words)
    • Is the choice of language emotional? Scientific? Does it drag in ideological baggage? Are the word choices ambiguous? Etc.
  • Pragmatics (meaning between the lines)
    • Is it bossy? Polite? Abrupt? Tentative? Sarcastic? Braggadocios? (See what I did there.)
    • Problems: this category can be extremely subjective
  • Style
    • Is it childish? High flown? Lyrical? Punchy?
    • Note that this category is an amalgamation of all of the above, plus a bit extra besides.

Okay, so now you know language is a lot more than just vocabulary. What next? Now, you need to go through the op-ed, and highlight any feature that is (a) distinctive, and (b) consistent.

Those are the two golden words in authorship attribution. What do they mean? Well, a distinctive word is, for instance, lodestar. Lodestar is an unusual lexical item in anyone’s vocabulary, and it now has some pretty strong ties to Pence. However, one occurrence does not consistency make. That’s precisely why rare words are not very helpful, and in fact can be distinctly unhelpful. They’re rare. And they stand out. And that makes them easy for a smart person who isn’t Pence to pop into an op-ed to keep people off their trail.

For the once-in-a-lifetime, holy grail, gold-standard feature that is extremely persuasive evidence, you really want a feature that occurs (a) only in your questioned text(s) – the op-ed in this case, and only in one of your suspect’s texts – the one who wrote the op-ed, (b) as frequently in your questioned text as in the suspect’s texts, and (c) is not easily consciously manipulated. But wait. What if that feature is lol. And what if nine of your ten stuffy White House suspects don’t use lol because they’re stuffy White House people. But one does. See the problem? Just because Suspect 7 and the op-ed author both use lol doesn’t mean Suspect 7 wrote the op-ed. I have been known to use lol myself. So maybe I wrote the op-ed… Point being, truly gold-standard features also need to be relatively distinct versus a sensible, broader population. If you want to crack the nut of precisely what a sensible, broader population is, knock yourself out, but for the sake of keeping this shorter than a thesis, I’m going to move on.

Note that some analysts rely not on features that are supremely distinctive, but instead on distinctive patterns of use of common features. For example, we could analyse closed class words such as the frequency of the and and and so on in the op-ed versus in suspect texts. This is actually a sensible idea. The use of such words tends to fall below the level of conscious choice, so it’s unlikely to be deliberately manipulated by someone, and if the questioned text is long enough, they should occur frequently enough to extract meaningful averages for them. But that’s a big if.

Whatever you decide to put on your list – gold-standard features (vanishingly rare so doubtful) or a handful of silver-standard features and lots of bronze-standard ones (much more typical of authorship attribution tasks), once you have a list of features that you consider to be consistent, and distinctive, it’s time for the next step.

Step 4 – analyse your known texts

Get stuck into your suspects’ datasets. Ideally, you need to devise a system of coding and then counting. Some people like to print and run amok with coloured highlighters. Some people prefer to use software. It doesn’t really matter as long as you find every instance of each of your features. (Notably, some forensic linguists like to start by identifying features in known texts first and then tackle the questioned text. There are good reasons for this but this is just for fun, so don’t sweat it too much.) You need to normalise the frequency of occurrence if you can. It’s no use comparing the frequency of the word and in a ten page speech versus a single tweet. It simply can’t occur as often in the latter. And you need to account for opportunity of occurrence. In some text types or on some topics, some features just don’t get the chance to appear, or they appear all the time.

Once you have your results, compare them to your op-ed to see which of your suspects matches it the most (if any), and then, you have your answer.

Or do you?

Step 5 – state your findings… cautiously

So, have you identified the author? Chances are, your answer is no. Instead, you’ll probably find that you now have more questions than you started with. Did you put the right people on the list to begin with? Is your comparison data good enough? Do your figures show anything significant? If more than one person is similar but across different features, did they both write the op-ed together or is this just a coincidence? Even if you feel like you can answer all of these questions with a reasonable degree of confidence, and even if one person in your suspect list seems like a good fit for the op-ed author, you still don’t have proof. You only have a possibility.


I don’t think forensic linguistics can solve this one – at least not single-handedly. The reasons are as follows: the authorship of the piece has almost certainly been contaminated by a series of editors, proof-readers, and so forth. It may never have been a solo-authored piece in the first place (note the frequent use of we throughout it) but by the end it will very likely have had input from multiple sources, whether concurrently or consecutively.

The second issue is that the list of people who could have written this is too extensive. Forensic linguistics can suggest similarity of style within a small pool of potential candidates, but the list of possible authors in this case could number in the hundreds.

The third issue is that the author is clearly smart, and could therefore have foreseen the likelihood of people feverishly trying to identify them. Consequently, there has been a lot of focus on lodestar (a Pence-esque word) and first principles (a Mattis-esque phrase) but if you wanted to throw would-be Sherlocks off the scent, these are precisely the kinds of eye-catching breadcrumbs you would leave in your writing to lead people astray.

The fourth issue – which is really only a restatement of the first in different words – is the extremely murky political climate in the White House right now, to the extent that this op-ed may very well have been (foolishly or otherwise) commissioned by someone who is actually on Trump’s side but who is trying to divert attention away from other issues. The linguistic aspect of that angle is that the production of a distraction piece is likely to be different to that of a sincere exposé.

Overall, then, it might be clear why I haven’t had a bash at this particular chestnut. Even if I overlooked all the obstacles, authorship attribution tasks, if done properly, are long, exhaustive, and intensive investigations that more often than not run headlong into giant question marks at the end. Even a toy analysis for this blog would burn up at least half a week and I’m actually on annual leave right now, soooo… (If someone is willing to pay me a healthy five figures I will gladly set aside a month and have a serious go at cracking it, but before you all rush to whip out your chequebooks, you might want to reread Step 5 and the conclusion again.)

Anyway, whatever the case, welcome to the wonderful, frustrating, addictive world of forensic linguistics. Please bring more coffee.