Introduction: if love is a battlefield, politics is a bonfire
As the Dominic Cummings v Boris Johnson battle rages across the headlines, it caught my eye that two high profile figures have both noted similar, rather interesting points. The first figure was Robert Hutton, who, at 3:03pm on the 23rd of April 2012 (three days ago as I write this) tweeted:
“Look, the point of getting journalists to attribute a quote to “friends” or “allies” is that you have plausible deniability. There’s no point if the quote sounds so absolutely like you that it just looks like you always refer to yourself in the third person.”
He then includes a small screenshot that, judging by the style and font, looks like a quote from The Telegraph (but I can’t check to credit them properly because paywall 🤷♀️). The text of that is thus:
“Allies of Mr Cummings have hit back at Number 10 for starting “a war they can’t win”, adding: “Dom doesn’t care about all this stuff and they’re in gov. It’s like the Americans going into Vietnam – they may be able to drop big bombs but in a war of attrition, the rebel always wins.””
This first case is fun, and it would make a nice little study for uses such as gov (possibly some sort of written correspondence that’s been copied and pasted?) and the types of analogies that Cummings routinely makes (does he fall back on war examples a lot?), but realistically it’s a bit short so whilst tempting, I left it alone.
“Interesting change of style in Cummings’ latest blog. From long, rambling and incontinent, to rather tight and focused, as though he had the help of an experienced journalist who knew how to land more blows with fewer words. Anyone seen @michaelgove ?”
Research question: did Dom go to SpadSavers?
What’s really nice about this is that Campbell has very loosely posed an authorship analysis research question somewhere along the lines of,
Is this statement authored in the same way as the rest of Cummings’ blog posts?
So we have something that could actually be investigated.
There are still plenty of issues even with this simple question. Let’s take Campbell’s suggestion that DC had “help” as a starting point. The trouble is, this is rather vague. Did Dominic Cummings (DC) write the statement, and then others (e.g. Journalist A, or Jay for short) simply read it over and provide suggestions that DC dealt with himself? In other words, was it all effectively penned by DC but lightly steered by Jay? Or, in a complete reversal, did Jay ghostwrite the statement entirely – though presumably based on DC’s information – and DC suggest possible changes that Jay then carried out, if any were required? Or was it something in between? Or was it something else entirely?
Because a second problem here is that I’ve followed Campbell and assumed that the influence was one (1) journalist. In reality, as I’ve found from personal experience, press releases and the like are actually infinitely messier, and I can only imagine what goes into a press release quite like this.
For instance, we might guess that DC had a number of fraught discussions with his wife on the subject, since she would probably take a very direct interest in her husband’s affairs when they suddenly start to grip the nation’s headlines. And Mary Wakefield is, as some will already know, a journalist. But it’s arguably different editorialising on topics safely air-gapped from one’s personal life versus giving quasi-legal crisis management advice to one’s spouse about a seedy national scandal that’s threatening to engulf the incumbent Prime Minister and his girlfriend. So assuming that DC discussed this with his wife, she, in turn, may have made numerous calls to fellow journalists, the family lawyer, a friend who had survived a similar experience. Similarly, DC may have had a flurry of tense, off-the-record chats with past colleagues at No10, consulted a PR representative, confided in a close relative. And everyone’s input may have made it into this statement in various guises. Some of it may have been carried in from recollection. Others may have directly edited the text. This statement may have done the email rounds some thirty or forty times before it finally arrived on DC’s blog, primped and tweaked and taste tested to within an inch of its life. Indeed, DC may even have written it with input from Boris Johnson himself.
With such a number of problematic stories in the news at the moment, Johnson might have hoped that several dead cats on the table would consume some of the oxygen in the room. Confected scandal and conveniently timed controversy as diversionary tactics are nothing new, after all, and such a drama would be too dazzling for most of the media to ignore.
In some ways, however, the actual numbers and identities and practical involvement of the influencing voices doesn’t really matter, because the research question is much simpler than that. The answer to, “Is this statement authored in the same way as the rest of Cummings’ blog posts?” is merely yes or no, though obviously I’m going to take an eternity and go from thread to needle to get there.
Anyway, naturally, as a forensic corpus linguist, the Case of Who Wrote Dom’s Statement (or more accurately, but less pithily, If It Really Is Different Did Dom Get Help Writing His Statement Or Is It UnDomlike Because He Was Freaking The Fuck Out Or Is It A Complicated Mix Of Both And Can We Even Tell Anyway?) is messy, fun, eternally unanswerable without more information, and best of all, there’s a reasonable dataset to draw on.
Data: the Dom Comms
The data – yes I’m going to use it in the singular, no I don’t care if that makes prescriptivists cry – all comes from DC’s blog. I took the statement as the Disputed Text (DT). This is the one Campbell seems to think has had authorial assistance. Then I collected the next ten blog posts in a row and designated these the Known Texts (KTs) on the assumption that these were all solo-authored by DC. Of course, they might not be. He might have ordered an intern to write some, or plagiarised bits, or had a friend help with a couple. We honestly don’t know. At least two of them contain significant numbers of long quotes, which is a real headache to deal with objectively. Anyway, I don’t have much choice but to work on the basis that the KTs were all authored by DC and that they did not experience an undue amount of authorial interference from other sources, because otherwise this post simply couldn’t proceed any further, but it’s vital to note that the benchmarking KTs are built on an unverifiable assumption.
Here are some basic stats for the dataset because this wouldn’t be as much fun if I didn’t throw a table in somewhere, and there are nifty links to the original posts so that you can even go and immerse yourself eyebrow deep in DC’s uniquely extensive blogging oeuvre:
|DT||statement||21-04-23||1,101 words; 48 sentences; 22.93wps|
|KT||two_hands||20-01-02||2,924 words; 136 sentences; 21.5wps|
|KT||ref34||19-11-27||2,338 words; 114 sentences; 20.50wps|
|KT||ref33||19-06-26||10,642 words; 422 sentences; 25.21wps|
|KT||ref24n||19-03-27||2,087 words; 91 sentences; 22.93wps|
|KT||ref32||19-03-11||3,007 words; 103 sentences; 29.19wps|
|KT||complexity||19-03-06||920 words; 33 sentences; 27.87wps|
|KT||biolabs||19-03-04||1,930 words; 69 sentences; 27.97wps|
|KT||ref31||19-03-01||15,150 words; 572 sentences; 26.48wps|
|KT||ref30||19-02-21||4,177 words; 162 sentences; 25.78wps|
|KT||ref29||18-09-11||592 words; 41 sentences; 14.43wps|
So that’s the dataset. What next? Some light analysis…
Analysis: three is a pattern?
Remember Campbell’s claim that DC’s statement was “rather tight and focused, as though he had the help of an experienced journalist who knew how to land more blows with fewer words”? This was implicitly contrasted with the rest of the blog, which was “long, rambling and incontinent”. That made for a quick and easy first port of call: post length. The ten other posts from DC’s blog that I collected amounted to roughly 44,000 words. (The total is somewhere between 43,529 and 43,767, but that depends on quite how you count a word – and yes, it’s actually a lot harder to count words that you might think. Would you count a number as a word? Is “blog-post” one word or two? But I digress.) So Dom’s comms average 4,300 words per post versus the relatively short 1,100 word statement. However, the range is very wide. Two posts are literally dissertation-length (10.6k and 15.1k long) and one is a fair bit shorter than the statement. Regardless, however, the statement still ranks tenth out of all eleven posts I collected for length.
As a momentary visual interlude, here’s a nice barchart that basically says the same thing in reassuringly geometric primary colours:
Anyway, we could just throw all caution to the wind, take the Olympics approach, and knock out the extreme highest and lowest outliers, and that brings DC’s average post down to 3,500 words, but that’s still three times longer than the statement, and the statement then ends up as the shortest piece. Whichever way you torture the data, overall, it pretty much confesses that the statement is unusually short.
But words are not the only thing it’s light on. The statement is also unusually plain. Skim over DC’s other posts, including at least a dozen or so beyond the ten that I collected, and one finds a frenetic FORMAT ALL THE THINGS philosophy. DC might say that he is leveraging the formatting affordances of the modern blogging platform to maximise cognitive asset impact… or something, but in simpler terms, you’ll find an ample supply of bold, italics, headings, indented quoting, bullet points, numbered lists, hyperlinks, pictures, embedded videos, even a bit of old-school all-caps… In fact, really the only thing that’s missing is the More function, which allows the reader to preview the first few lines of a post and then click to see the rest if they wish, but I can see how that might not fit DC’s general style.
Anyway, by contrast, the statement contains… nothing. Not even a single italicised No! And that’s interesting – to me anyway – because the statement does actually contain text that is, in practice, a numbered list. And DC does like his bullet points and number lists very much. Just go take a trip down his More-less homepage which scrolls on for fucking ever and count how often he empties entire clips of bullets into a post till it’s an almost illegible spattered mess of dying syntax. And yet that doesn’t happen here. The opportunity to impose the structured order of a neatly formatted list has been passed by. It is simply left as plain text.
Why would this matter? Is this a tiny thread of bloodstained polyester from which the CSI linguist can reconstruct the entire murder?
No. That would be ridiculous.
But it suggests… something.
Perhaps DC stripped all formatting from the post because… I don’t know. I’m not sure why he’d do that. But then, this is the Barnard Castle guy, so really, anything is possible.
Perhaps he wrote the post in the text editor interface rather than the visual interface, thus side-stepping any auto-formatting, but again, why this change in habit?
Perhaps his other posts were primarily written or subsequently edited in the WordPress interface, whereas this statement was written elsewhere – email, Word, notepad++, WhatsApp, the drafts of a burner Gmail account, whatever – and then the finished, unformatted text was simply dumped into a new post and published as was.
I slightly favour this last explanation based on one minor point: in WordPress, as in a lot of editing software now, if a new line starts with a number directly followed by, e.g. a fullstop, pressing the spacebar will trigger WordPress’s misguidedly officious auto-formatting editor function who will promptly transform the preceding text into a numbered list. So if he had typed it, live, into WordPress, it would have “helped” by correcting the plain 1. into a proper, HTML-tagged ordered list, with correct indenting and so forth. However, if he had simply pasted a whole, unformatted text in as is, then it would not go back and change anything.
In short, we have another intriguing difference here – the total absence of formatting, especially in an instance where, compared to DC’s other posts, it would typically occur.
But what else?
The third aspect I considered was word length. I decided to compare the statement (green line) not just to DC’s other posts (black line), but also to a 100,000-word sample of writing from the BNC – a sort of benchmark for “ordinary” writing (red line). For ease I used the Signature software, and the results turned out thus:
So, anything meaningful happening here? Well, yes actually. Nothing outrageous, but a little bit. What we find is that when compared with both the statement (green) and the BNC (red), DC’s other posts (black) favour longer words and disfavour shorter ones. Indeed, the statement is more similar to the BNC written sample which stands as a very rough “norm”, and this suggests a fairly reasonable shift in authorial style.
How do we explain this, then? Well, I can think of at least two possibilities. We could indeed say that the statement’s “simpler” wording might be indicative of a journalist’s input, since the arc of journalism invariably bends toward selling papers, and one sensible way to achieve this is to make the content as accessible, punchy, and credible as possible. (By contrast, lofty, high-flown lexis is sometimes associated with sophistry, elitism, and deception, and that would be exactly not the tone DC would want to strike in this post.)
However, a second possible explanation is that it could simply be an artefact of the topic. DC isn’t waxing lyrical about ethical research into biolabs or the potentiality of AI or the leveraging of cognitive heuristics. Nor is he dissecting policy and practice surrounding referendums. He’s rebutting a series of specific points from the PM directed at himself, and that involves a lot of “I” (38) and “me” (13) and “PM” (20) and “him” (7). Ergo, less opportunity for philosophical academese, more demand for personal pronouns.
Whatever the explanation, the statement does indeed differ, again, on this count.
There was a fourth area I looked into: content, and I particularly took a little time with a function known as keywords. These are words that appear with unusual frequency in a text when it’s compared with something else, so in this case, the statement contains some fairly unsurprising keywords when compared with DC’s other posts – PM, secretary, inquiry, leak, cabinet, Dyson, and so on. Nothing worthy of a headline.
By contrast, the DC posts are heavy on words like science, ideas, systems, project, technology, and so on. None of the words that appear in the results were remotely surprising, with the exception of just one.
In the statement DC is happy five times (an average of 0.5% of the time). He’s happy to meet with the Cabinet Secretary, happy to publish messages, happy to tell what he knows, happy to give evidence, and even happy for No10 to publish every one of his emails, except for those that might pose a national security issue. When it comes to putting out information and correspondence and evidence, DC would seem to be one numbered list away from outright gleeful disclosure. In the rest of the posts I collected he’s happy only once (an average of 0.002% of the time), and even then, it’s angrily sarcastic: “Such is [a powerful network’s] loyalty to the EU they were happy to make Britain a laughing stock.” (Emphasis mine.)
Anyway, because happy isn’t exactly a common word in such a context, without analysing the whole blog, there’s little to conclude about it from an authorship perspective. Mostly I was struck by how delighted DC appears to be at the prospect of sharing emails, WhatsApp messages, conversations, and more with literally anyone he can get them to at even the slightest invitation. From my other perspective as someone who analyses manipulation and aggression, it sounded like a series of very loud hints that Johnson should back away carefully without breaking eye contact, lest some very awkward recordings and screenshots suddenly escape DC’s phone for a better life out in the sunlit tabloid uplands. Naturally, it could be a bluff, since each might have as much to lose as the other, but it will be interesting to see how it plays out.
Perhaps the most frustrating issue for a linguist is that it is not especially fruitful to analyse much of the language in this post because, as I’ve mentioned a few times now, the format, topic, speed, necessity, purpose, and intended audience of the rest of the blog are all so different.
When it comes to DC’s other posts and their format, some are striving to be academic papers. One is a painfully inappropriate job advert. Another is a list of links with very little actual content. The topic tends to be either romantic meanderings through the misty vales of Dom’s techno-governmental utopia or passionate careening into all-caps outbursts about referendum-blocking billionaire cabals. Speed? All these posts were presumably penned at leisure. Necessity? Whilst some may have felt very important to DC, they probably did not feel absolutely critical. Purpose? DC’s goal seems to be the creation of a sort of Aldi-level Slate Star Codex that will impress the reader with his brilliance and thus persuade them to agree with and adore him. Audience? He seems to be aiming (however unsuccessfully) at a reasonably coherent mix of aspirational, idealistic, would-be politician-scientists and LessWrong subscribers.
When it comes to undertaking a meaningful comparison, these variables create a distinct problem, because the statement manages to be different on every single count. It is roughly formatted as a press release. The topic is a fixed list of specific, pre-defined points. When it comes to speed, I think it’s reasonable to suggest that this post will have been rushed out, under intense pressure, within hours. It will probably also have felt spectacularly more necessary than any other post on the blog. Similarly the purpose is far higher stakes: not only convincing as many people as possible that he is telling the truth in a matter of national and potentially criminal conduct, but also cowing Johnson and staying any further attacks. And this, all for the perusal of every media outlet, every politician, every Cabinet lawyer, every single person idly scrolling through their phone over breakfast this week, wondering what the latest No10 scandal is today.
So, is this statement authored in the same way as the rest of Cummings’ blog posts? Well, on at least three counts, patently, no, it is not. Shorter words are used to write a shorter post that is bereft of formatting.
The next question, then, is did DC write this post himself? And naturally, the answer to that question is far more complicated. Yes, this statement may look different to DC’s other posts, and yet it may still have been entirely solo-authored by him. So then why would it be so different? Well, it could simply be that No10 staring down the barrels of their many media blunderbusses may have focussed DC’s blogging style wonderfully, propelling it from its usual meandering academese to a far more en pointe riposte. To put it more simply, the duress and significance of the situation might have directly caused the change in DC’s style. Or, as noted above, those same factors may have driven him to involve others in his writing process, and this influence may have changed what would have otherwise been the final output.
Both is good.
But what if we are determined to conclude that the difference can only be explained by an active, external influence – a ghostwriter? Or even an entire ghost army? Well, we still wouldn’t be able to come up with a number of potential voices. There is simply no way to disentangle the possible influences of proofreaders or factcheckers or style editors or legal exposure mitigators or brand management gurus or vexed spouses, nor whether they got to edit the document directly or left voicemails that were then paraphrased or sent WhatsApp messages that were copied, pasted, and edited in or any other of a thousand possibilities. And there’s absolutely no point trying to infer their identities. Both tasks, even with substantially more evidence, are impossible with the information we have, and would be as close to impossible even with a closed set of well-evidenced suspect candidates.
In summary, as is so often the case, without further information, there’s no way to be sure who wrote DC’s statement – whether he played Lone Ranger and did it all himself or had an entire Dom Squad firing the thing back and forth for hours, dragging it through innumerable exhausting rounds until every sentence had been tortured to everyone’s satisfaction. We can speculate based on his legal exposure and his connections and his income and the high stakes at play and what a seasoned veteran of No10 would be wise to do in such a situation, of course, and just like Alastair Campbell (only in 3,500 words rather than 42) we can conclude that this statement certainly is different to his other posts, but ultimately, we can’t absolutely know why.
So in the Case of Who Wrote Dom’s Statement, I must end with the status as: