The Case of Mark Zuckerberg

1 – Introduction
Over the past few months, news has emerged[ref][ref] that yet another person is trying to sue Mark Zuckerberg for a share of Facebook. This time, it’s Paul Ceglia, who claims to have a work-for-hire contract with Zuckerberg entitling him to 50% of Facebook’s fortune. To support this claim, Ceglia has produced emails that he says were sent between himself and Zuckerberg, including one allegedly by Zuckerberg stating, “I’d like to suggest that […] we officially return to 50/50 ownership”. However, whilst emails could usually have their headers examined and cross-referenced with server logs to establish certain aspects of their authenticity, Ceglia claims that he kept these emails as Word files, ergo no headers.

Facebook’s lawyers pitched for disputing authorship (amongst other things), by arguing that Ceglia’s emails are fabrications, and they called in Emeritus Professor Gerald R. McMenamin to perform a forensic stylistic analysis on the eleven emails in Ceglia’s Complaint[ref] (Doc39) which were purportedly written by Zuckerberg. McMenamin has submitted a Declaration[ref] (Doc50) in support of the defendant, disputing the authenticity of these eleven suspect emails. On the plaintiff’s side, John Evans has filed a Declaration[ref] (Doc61) which briefly refers to the Word files.

1.1 – Hypotheses
I am working on two hypotheses for the following sections, which can only be partially tested, as will be seen below.

The first hypothesis is that if the emails are fabricated, then it seems reasonable to argue that Ceglia will have done this himself. This is because Ceglia has read many of Zuckerberg’s emails before, and therefore has a working knowledge of his style (though whether he can then bring that knowledge to bear is another matter). But perhaps more importantly, it would be unwise, particularly in a case involving such an amount of money, to ask a third party to undertake the fabrication. Should that third party have a sudden attack of conscience, greed, dislike etc., either before or after the case, then Ceglia might find himself dealing with blackmail, Zuckerberg may discover that vital information has mysteriously become available – for a price, and so forth. In short, managing any wrongdoing personally and privately seems a safer method of ensuring that the deception succeeds.

The second hypothesis is that if fabrication has been carried out, it does not immediately follow that all the Zuckerberg emails are entirely fabricated, from greeting to signature. The word fabrication itself tends to suggest that the suspicious emails have been completely made up, but one can alter a text in several ways so that it is no longer faithful to the original. For instance, one could:

Fabricate (i.e. invent some/all)
Insert (i.e. introduce text by same author from another source)
Reword (i.e. paraphrase/rewrite some/all)
Restructure (i.e. rearrange some/all, or move email to another place in the discussion)
Remove (i.e. delete some/all)[Note]

Naturally, a person isn’t restricted to just one of these. Let’s imagine that I write:
Ok, could you bring the draft then? Am meeting Jim tonight.

This could be edited and rearranged to read:
meeting Jim tonight. bring the draft then

Despite the fact that two thirds of the content is original, right down to the punctuation and capitalisation, the tone, structure, and meaning are entirely altered. The first email reads as a response to a previous message, a request, an observation which implies that the recipient is not expected at the meeting (since the inflection of am alludes to an omitted I, not an omitted we), and a subtle suggestion that the draft is therefore expected before the meeting. The second email, however, reads as a new message (i.e. one that is not necessarily responding to previous communication), an observation, a peremptory command which implies that the recipient is expected at the meeting, and a clear suggestion that the draft is expected then, and not before.

The problem for the analyst is that genuine text which has undergone insertion, rewording, restructuring, or removal (IRRR), is going to be harder, and maybe even impossible to spot, compared to elements which have been entirely falsified by another author. Whilst IRRR can change tone, style, and meaning, the main effect is likely to be felt at the level of the discourse, sentence, and clause (since these are easier to switch around, introduce, or delete en masse) rather than on the phrasal, lexical, formatting, or punctuation level (since reorganising at this level requires more work and attention), and it tends to be on these finer-grained levels that some of the most interesting and distinctive style-markers occur. As a result, I do not attempt to identify IRRR original text, but instead focus on analysing the data for potential fabrications.

We can also reasonably argue that the degree of risk of being found out increases with the amount of fabrication, and that if Ceglia did engage in this, then it would be in his interests to keep those fabrications to a minimum. As such, whilst I have analysed the data (see 2 below) for issues of disputed authorship, I have not assumed that the whole Zuckerberg dataset is false and then worked to prove this[note]. Instead, I have worked on the principles that (1) the authenticity of the data has been called into question, (2) some, all, or none of the data may be fabricated, (3) if the data does contain fabrications, then Ceglia most likely authored these, and (4) if the data does contain fabrications authored by Ceglia, then these fabrications are likely to share stylistic markers with Ceglia’s own emails.

2 – Data
The major portion of the data that I use is taken from Ceglia’s First Amended Complaint (Doc39)[ref], with a small amount of data taken from McMenamin’s Declaration (Doc50)[ref], and the information regarding the Word files containing the emails taken from Evans’ Declaration (Doc61)[ref]. Amongst many other things, Doc39 contains:

– eleven emails (1,079 words), purportedly by Zuckerberg, forming the Suspect Corpus (SC)
– seven emails (1,207 words), purportedly by Ceglia, forming Ceglia’s Corpus (CC)

There are five important points to note. Firstly, the wordcounts are mine, based on the corpora I have constructed from Doc39. Secondly, McMenamin’s forensic stylistic analysis was augmented by access to thirty-five emails by Zuckerberg whose authorship was verified, providing McMenamin with a Zuckerberg Corpus (ZC) that I have limited access to. The only parts of ZC I can access are those snippets and summaries which McMenamin reproduces in Doc50. As a result, I have no wordcount for ZC.

Thirdly, Doc39 was issued by Ceglia’s team, so they have had first hand in deciding what was included, excluded, or paraphrased. Probably for brevity (and doubtless for other reasons), Doc39 only contains some of the Ceglia/Zuckerberg emails, whilst others are clearly paraphrased:

“Communications between Zuckerberg and Ceglia concerning their development of The Face Book and the website planned by the Harvard upperclassmen described in Zuckerberg’s November 22, 2003 email continued through the balance of 2003.” (Doc39: 37)

“Ceglia responded on the same day with an email explaining to Zuckerberg that he could not remember the relevant terms of the Agreement and did not have access to it. Consequently, he could not respond to Zuckerberg’s request for a waiver. Zuckerberg replied by email to Ceglia, informing him that he would scan the Agreement and send it to him.” (Doc39: 39)

“On January 13, 2004 and January 16, 2004, Ceglia and Zuckerberg exchanged emails concerning the functionality of The Face Book’s website […]” (Doc39: 44)

Fourthly, McMenamin does not perform any analysis of CC, nor in any way compare it to SC – an entirely understandable and, in fact, expected approach since McMenamin’s object will have been restricted to casting doubt on the authenticity of SC by contrasting it with ZC. The reason for this restriction is that forensic linguistics, like many fields of scientific enquiry, is far more suited to defence strategies of questioning (e.g. doubting the validity of evidence) than to the prosecution strategies of proving (e.g. verifying the validity of evidence). Trying, therefore, to not only cast doubt on the fact that Zuckerberg authored the emails, but also to identify who did author them would be a risky step to take in court. It is better to stick with casting doubt, since if the identification turns out to be wrong (and it is usually easier to prove something wrong than it is to prove it right), then the rest of the – possibly excellent – analysis could well be written off as equally unreliable. To an extent, the remainder of this article actually sets out in part to demonstrate why proving authorship is very difficult, but moving back to my original point, McMenamin does not perform analysis of CC, so any evidence or analysis of CC below is mine. I have striven throughout to keep my views and analyses of SC and ZC distinct from McMenamin’s, but don’t hesitate to point out any lack of clarity anywhere.

Fifthly, we know very little about ZC. We know that it consists of emails sent from Zuckerberg both to Ceglia and others from the same time period, but it is doubtful that those were the only thirty-five emails Zuckerberg sent around that time (the period in question covers roughly April 2003 to April 2004). Therefore, the potential dataset from which ZC was ultimately drawn must have been winnowed down from possibly thousands of emails, and in a multi-million dollar court case, I would be frankly astonished if the emails hadn’t been carefully selected to provide the defence with the best case possible. It would be nice to think that the ZC emails were chosen at random, or objectively on the basis of their representativeness of Zuckerberg’s style, or best of all, if all emails from that time were being used, but the likelihood is that the analyst has ended up with data that has been selected in order to prove a predetermined argument. And thus we go back round to the Witchcraft Analogy.

In summary, bear in mind whilst reading the remainder that, (1) I do not have full datasets for each individual (I have SC and CC, but only limited information about ZC), (2) the datasets that I do have access to have been primarily shaped by the plaintiff, (3) whilst I critique the approach McMenamin took in other parts of this discussion, this is partly due to my different methodological views and approach (namely my emphasis on corpus/statistics), and (4) I am undertaking a different analysis in places to McMenamin by comparing SC with CC – something he does not do.

2.1 – Methodology
A first issue with the Doc39 analysis is the lack of data wordcounts, and the fact that raw frequencies are not normalised[note]. Using, and especially comparing raw frequencies can be extremely misleading if the file sizes in question are different. For example, let’s imagine that cat occurs ten times in Corpus A, and only twice in Corpus B, and, on the basis of these raw frequencies, I tell you that cat occurs more in A. If you later discover that A is one million words, and B is only one hundred words, you would probably feel misled. A is ten thousand times bigger, so cat has ten thousand times more chance of appearing. To fairly compare these raw frequencies, they need to be normalised, and one method of normalising is percentages[note]. If we normalise our cat raw frequencies into percentages, we now find that cat constitutes a mere 0.001% of Corpus A, whereas it constitutes 2% of Corpus B. In other words, cat is 2,000 times more frequent in B, which is a fairly significant reversal of my earlier claim.

In the case of the emails, we have a problem with a missing quantity. Since I created my own SC and CC, I can give their wordcounts and they turn out to be fairly similar in size (SC = 1,079 words, fifty sentences, eleven emails, CC = 1,207 words, sixty-six sentences, seven emails). However, we know nothing of ZC’s size, other than that it contains thirty-five emails. It could be three times SC’s size, or three hundred times, but speculation is meaningless, and without that information, whenever raw frequencies from SC or ZC are compared in Doc50, those results could be as misleading as the cat example above.

A second issue that is also a vital part of describing and understanding results, is known as significance. Significance does not necessarily refer to something that occurs a lot. And and the occur a lot, but that doesn’t make them important. In fact, it would be far more interesting (and significant) if we found a lengthy document in which they never appeared at all. Instead, rather as the and/the example implies, significance is determined by something happening more, or less, than chance alone would dictate. To understand what chance would dictate, we need to establish norms (which are the regular play-out of chance in action), and certain norms, particularly in text, can be uncovered with exceptional speed and thoroughness via corpus linguistics – assuming you have access to an adequately sized corpus. Extrapolating from an insufficient dataset can lead to all sorts of problems, not least including elevating exceptions to the status of norms, and then using those ‘norms’ to find ‘deviations’. I return to this in 3.3.

3 – Analysis/Discussion
The following sections divide roughly into discussions of McMenamin’s analysis, my assessment of that analysis, and in some cases, my own additional analyses, usually incorporating CC. In the case in question, McMenamin concludes that,

“the KNOWN writings of Mr. Zuckerberg demonstrate a sufficiently significant set of differences vis-à-vis the QUESTIONED writings to constitute evidence that Mr. Zuckerberg is not the author of the excerpted QUESTIONED references.” (Doc50: 14)

You get to find out in the conclusion (4 below) what my… well… conclusions are.

3.1 – Apostrophes
In exhibit B1, Doc50 states that in SC, “apostrophes indicating contraction and possession are sometimes absent”. In fact, out of twenty-five words requiring apostrophes in SC, only four examples fail to contain them, giving an 84% correct apostrophe presence, or CAP, for want of a better word. My analysis of CC shows that, out of forty-four instances requiring apostrophes, six lack them (wouldnt, doesnt, persons, whats, couldnt, dont)[note], demonstrating an 86.363% CAP. ZC, we are told, demonstrates 100% CAP: “All apostrophes in contractions and possessives are present” (Doc50: B1).

This actually unearths a few interesting facts. Firstly, SC appears to be more similar to Ceglia’s emails than to Zuckerberg’s. However, this does not apply to the whole of SC, but rather to parts of it. SC’s four missing apostrophes only occur in two of the eleven emails (6 (x2), and 9 (x2)) – an 18% distribution across the corpus, whereas the six missing apostrophes in CC occur in five of the seven emails (1 (x1), 2 (x1), 3 (x2), 4 (x1), 6(x1)) – a 71% distribution across the corpus. This suggests that (a) a consistent feature found across the majority of CC is the omission of roughly one in every seven apostrophes, and (b) an inconsistent feature found only in parts of SC is to miss out roughly one in every seven apostrophes.

Let’s say that on this basis, we now wanted to argue that sections of SC (e.g. emails 6 and 9) were partly or fully fabricated by Ceglia. By itself, the above is barely a shadow of evidence, since people’s internal variation of style is enormous. You need only think about the differences in the way you would text a friend to join you for a drink versus the way you would write a covering letter for a job. To complicate matters, email falls into an interesting grey area in that it is not generally perceived as an especially formal (i.e. written letter), or informal (i.e. text message) mode of communication. In the main, people are as happy to receive mortgage approvals by email as they are to get ungrammatical one-liners from their mates. This means that the scope for internal variation that could pass by unmarked (i.e. without being noticed as odd) within this communicative form is particularly large.

Back to apostrophes, then. Ceglia demonstrates in CC that he does, generally, know how to apostrophise contractions, even sometimes varying between the same word (e.g. don’t x7, dont x1), meaning that the omissions do not seem to be predominantly caused by lack of, or incorrect knowledge. So, Ceglia may have fabricated the whole of SC, and, knowing Zuckerberg’s style, tried to ensure that by and large, all apostrophes were present. Or he may have only fabricated part of SC. Or Zuckerberg may have written these emails and, just like Ceglia, may occasionally ditch or forget apostrophes for reasons of speed, tiredness, annoyance, etc. etc. etc..

Short version? We need more evidence…

Secondly, the (mis)used apostrophes actually lead us in the direction of some interesting differences. If we count all correctly and incorrectly punctuated possessives and contractions, we find that these constitute 3.645% of CC (44 altogether), and 2.316% of SC (25 altogether). In other words, Ceglia’s emails show a greater propensity for apostrophised words. Does that mean, then, that SC shows a greater propensity for non-contracted forms, or elaborations? However, we need to be careful. Some words, such as possessives, and certain contractions such as let’s, don’t readily elaborate back into their original forms without sounding odd[note], so these have been discounted (CCx3, SCx3). This is somewhat arbitrary, since we could argue the same point, to a weaker extent, for several of the contractions below, but ultimately, since the excluded forms occur almost equally as often in both corpora, this shouldn’t overly affect the results.

CC – 1,207 SC – 1,079 CC – 1,207 SC – 1,709
Aren’t (x1) Aren’t (x0) Are not (x0) Are not (x0)
Can’t (x2) Can’t (x1) Can not/cannot (x0) Can not/cannot (x1)
Couldn’t (x1) Couldn’t (x0) Could not (x0) Could not (x0)
Doesnt (x1) Doesnt (x1) Does not (x0) Does not (x0)
Don’t (x8) Don’t (x6) Do not (x0) Do not (x1)
I’d (x2) I’d (x2) I would (x0) I would (x2)
I’ll (x3) I’ll (x2) I will (x0) I will (x2)
I’m (x4) I’m (x2) I am (x5) I am (x7)
It’ll (x0) It’ll (x0) It will (x0) It will (x2)
Isn’t (x1) Isn’t (x0) Is not (x0) Is not (x0)
It’s (x5) It’s (x1) It is (x1) It is (x2)
I’ve (x4) I’ve (x2) I have (x0) I have (x5)
Shouldn’t (x0) Shouldn’t (x0) Should not (x0) Should not (x1)
Sites (x0) Sites (x1) Site is (x0) Site is (x1)
That’s (x0) That’s (x0) That is/has (x3) That is/has (x3)
There’s (x0) There’s (x0) There is/has (x0) There is/has (x2)
They’re (x0) They’re (x0) They are (x0) They are (x1)
Won’t (x0) Won’t (x3) Will not (x0) Will not (x0)
Wouldn’t (x2) Wouldn’t (x1) Would not (x0) Would not (x0)
You’re (x0) You’re (x0) You are (x0) You are (x1)
You’ve (x4) You’ve (x0) You have (x0) You have (x1)
We’ll (x0) We’ll (x0) We will (x0) We will (x2)
Weren’t (x1) Weren’t (x0) Were not (x0) Were not (x0)
We’ve (x1) We’ve (x0) We have (x0) We have (x0)
What’s (x1) What’s (x0) What is/has (x0) What is/has (x0)
Total: 3.396% (41) Total: 2.038% (22) Total: 1.491% (9 instances, 18 words) Total: 6.302% (34 instances, 68 words)

(This table ignores non-contractable examples, e.g. the site is cool as it is.) The interesting thing about contractions versus elaborations is that stylistically, they are less amenable to accidental inclusion or exclusion. By this, I mean that it would be easy to miss out or drop in an apostrophe since it’s only one tap of a button, whereas it’s pretty unrealistic to suggest that one might accidentally type out a full elaboration. Elaborations, then, are less open to pure chance, but they can still be affected by emotional state, the requirements of formality, and so forth.

However, let’s analyse what we’ve got in the table above. If we compare the results, we can see that CC contains more than half as many contractions again as SC, showing that Ceglia’s style tends toward contracting two words into one. By contrast, SC shows over four times the disposition for using elaborations[note]. This table doesn’t tell us where in SC these usages have occurred, though, and if only parts of SC have been altered, then there may be interesting patterns, or ‘hotspots’ in the results. Here’s a map of the number of contractions/elaborations in each SC email:

SC 1 2 3 4 5 6 7 8 9 10 11 Tot.
Cont. 3 1 2 0 1 2 3 1 4 1 4 21
Elab. 1 4 2 7 0 5 3 0 4 3 5 34

Interestingly, emails 6 and 9, which were both previously mentioned for being the only emails with apostrophe errors, also both contain an equal or greater amount of elaborated constructions, but at the same time, email 9 has one of the highest rates of contractions too. Another notable result is that email 4 (one of the longest emails, at 159 words) contains the highest number of elaborations, and no contractions at all. (Emails 5 and 8, incidentally, are only eleven, and twenty-three words long respectively, therefore the chance for elaborations is minimal.) Again, we should be careful not to over-interpret these results. They are a guide only, since the way in which we can use contractions is often limited by context and convention.

The short summary is that Ceglia’s style appears to favour three particular aspects, (a) using contractions almost half as much again as SC, (b) missing out roughly one in every seven apostrophes, and (c) using only about one quarter of the elaborated constructions as SC. Therefore, whilst (b) makes his style similar to some parts of SC (i.e. emails 6 and 9), it also differentiates him from other parts, and would support the notion that if Ceglia did engage in fabrication, it was probably not of the entire SC, but only of selected parts.

3.2 – Ellipses
Two deviations of three-dot ellipses/suspension points which contain spaces (e.g. . . .), or are abutted by spaces (e.g. I can …) are noted by McMenamin in SC, versus six in ZC which are all used without spaces either inside of, or around them (e.g. boxes…there). However, in CC, there are three instances of two-dot suspension points occurring in two different emails which are consistently kept together, and abutted by spaces on either side:

It doesnt to me .. I am wondering (Doc39: 40, or CC email 2)
my time is very thin .. Let’s get it (Doc39: 40, or CC email 4)
especially the parents .. they deserve (Doc39: 46, or CC email 4)

As I stress in other parts of this discussion, so few examples in such a small dataset does not a trend make, but what we can safely say is that whilst the irregular suspension points in SC may not match those in ZC, they certainly don’t match those in CC either. That would suggest either that Zuckerberg’s style is more varied than ZC suggests, that CC’s style is more varied than CC suggests, or that someone else introduced those suspension points. And herein lies a small issue. The examples given in Doc39 look like this:

“On July 30, 2003, Zuckerberg sent an email to Ceglia informing Ceglia that:
. . . I’ve been tweaking the search engine today [referring to the project] and I’m pleased with its results.” (Doc39: 32, or SC email 1)

“Zuckerberg replied: “I’ll just get this site online as quickly as I can”” (Doc39: 41, or SC email 5)

In academia (as actually occurs in the first email above), it is common to demarcate alterations or additions to the original text via square brackets, so that the reader is aware that these are non-original elements which have been added in afterwards by a third party. However, there is variation in how deletion is handled. Some people place three dots within square brackets to indicate that something is missing, but others omit the square brackets and rely on the dots alone. Normally, this wouldn’t be an issue, but then, normally documents don’t get forensically analysed in multi-million dollar lawsuits, and when Ceglia submitted Doc39, he or his lawyers may not have known that those emails were going to be subject to a forensic stylistic analysis. If so, he/his lawyers also wouldn’t have been aware of the need to have any meta-markup very clearly defined, so I can’t help but wonder if those dots were put in by Ceglia/his team, innocently indicating that earlier text has been omitted? Kind of embarrassing if so, eh?

3.3 – Spelling
Exhibits B3, B4 and B5 are described by McMenamin as dealing with spelling. (More accurately, they deal with formatting and compounding, but there you go.)

Feature CC (1,207 words) SC (1,079 words) ZC (? words)
B3: back end 0.092% (1)
B3: backend 0.082% (1) ??% (6)
B4: internet 0.092% (1)
B4: Internet ??% (2)
B5: can not 0.092% (1)
B5: cannot ??% (6)

And now we return to the discussion of significance, and its intersection with corpus size. McMenamin seems to suggest that Zuckerberg normally title-cases Internet, and normally compounds backend and cannot – otherwise, why would alleged deviations from these appear in the analysis? However, this presupposes that McMenamin has derived a firmly-established norm from ZC, based on numerous examples demonstrating little or, preferably, no variation, i.e. it would suck for McMenamin’s analysis if it ever came to light that Zuckerberg occasionally switched between Internet and internet or randomly spaced out words which are more commonly written as compounds. Given that we don’t know the size of ZC (I’m guessing that it isn’t a six-, or even a five-figure corpus) we are left in the dark about the validity of any extrapolations taken from it. What we do know, however, is the size of SC, and the norms that can be reasonably extrapolated from this.

In kind words, SC is small. It features only 396 different words altogether (not wordcount – literally the total different types of words that occur, properly known as tokens). Of all these words, 245 only occur once[note]. That’s 22.7% of the whole SC where relevant and important words (e.g. amount, build, completed, discuss, expenses, funding, launch, owed, results, server, students, users, working, etc.) only occur once. It is not only reasonable, but expected that if SC was bigger, these words – including words like backend, internet, and cannot – would occur more often, and with the increased frequency, would come the increased chance for variation.

McMenamin might argue with regards to cannot and backend that he is actually citing a general habit of spacing compounds, and that we should therefore count cannot and backend as two examples of the same feature, rather than as single examples of two different features. However, there are other examples of unspaced compounds in SC (e.g. upperclassmen, roommate), weakening any habitual compound-spacing argument. At best, all SC demonstrates is variability in compound-spacing, rather than a fixed habit. In short, whether we view these as individual examples of different features, or multiple examples of a habit, B3, B4, and B5 effectively prove very little. Making an argument based on a single example, which by its very nature can only appear in one form in a corpus, is worrying, and the entire ‘spelling’ section in Doc50 (which is actually more about formatting, compounding, etc.) is probably worth quietly excising.

McMenamin also discusses single-word sentence openers, deixis, and if-clause commas in a way that I find equally as unconvincing for different reasons, but since trawling over all that with the same fine-toothed comb would drive us all nuts, I am going to focus on the last element that I think does have promise, namely, run-on sentences.

3.4 – Run-on sentences
McMenamin notes that in ZC, there are, “No run-on sentences”[note], meaning, I guess, that ZC is 100% run-on-free. Analysis of SC reveals that out of all fifty sentences[note], 20% (10) of these run-on, and the remaining 80% (40) are correct. Meanwhile, out of sixty-six sentences in CC, 42.424% (28) are run-on (i.e. double the amount found in SC), and the remaining 57.575% (38) are correct. Again, however, it is important to note where the run-on sentences in SC occur, since this may guide us towards interesting hotspots:

SC 1 2 3 4 5 6 7 8 9 10 11 Tot.
Run-on 1 1 0 0 0 2 1 1 3 0 1 10
Correct 3 8 3 5 1 2 4 2 4 3 5 40

Once again, emails 6 and 9 have been flagged up for having the highest numbers of run-on sentences, whilst emails 2, 4, and 11 show high numbers of correct sentences. As with the contractions, though, we have to take this as a guide, and not an absolute. If an email only contains one sentence, it’s impossible for it to run-on (cf. email 5). Another issue is that, like apostrophes, we are talking about the difference of a button press, and added to this, the full-stop and comma are right next to each other. Further, the run-on habit is particularly common – we are not talking about some genuinely peculiar stylistic choice that very few others use. However, habit is something to bear in mind. Someone with an ingrained habit for producing correct sentences is unlikely to produce run-ons, but such a general observation doesn’t make for good evidence in court!

3.5 – Register
Something that McMenamin doesn’t consider is the register (i.e. the formality, ‘tone’, lexical choices, etc.) of the emails, and there are some notable differences within SC itself. Register can be determined by a number of factors, including emotional state, how much of a rush we are in, the topic, and so forth, but one generally clear aspect is that whilst someone capable of a high register can use a lower one, someone who is only capable of a lower register will struggle to use a higher one.

A higher register generally calls for greater use of standard English, lower-frequency lexis (e.g. equine, abnegate, inebriated), more complex syntactic structures, more formal language, an impersonal style, and (no surprises) elaborations as opposed to contractions. A lower register, on the other hand, may be created by using higher-frequency words (e.g. horse, give up, drunk), simpler syntactic structures, contractions, a more personal style, and more informal language. (This isn’t an exhaustive list, by the way.) The trouble with some of these is that they’re fairly subjective – what one might find formal and distancing another might consider neutral and expected, so after analysing SC, I’ve tried to pull out only those instances that seem most clearly formal or informal. There’s plenty of scope for argument, so feel free to disagree:

SC Higher register Lower register
1 Would it be agreeable with you if … ? I’ve been tweaking the search engine
2 The issue we must resolve is how to produce a revenue stream … My conclusion this past week is to … we could, as you suggested, rapidly expand
4 I just wanted to extend to you a Happy New Year … I think it is unnecessary at this point, with all of the extra work I have done for you, to hold me to the original completion date … thereby delaying my start … Thus, I am requesting a written waiver on your part exempting me from the obligation to give you additional ownership in the project that is outlined in our original contract
6 just because I am young doesnt mean I’m afraid of my parents response … they would probably just laugh you off anyway
7 I have a rather serious issue to discuss with you I don’t even think it is legal to charge such a huge penalty
9 Now that the sites live I feel I must take creative control and I just can not risk injuring my sites reputation by cheapening it with your idea of selling college junk, nor do I wish to spend my time shipping out coffee mugs to rich alumni. The site is cool as it is … That is money I am entitled to and is rightfully mine

There are some remarkable contrasts in tone and style, and once again, emails 6 and 9 are flagged up as seeming to be similar to each other, but different to the rest of SC. However, to gain a fuller picture, we need to compare the above table with CC:

CC Higher register Lower register
1 I like your thinking … The minute I did it I just knew that is what I was waiting for … … sweatshirts, mugs, t shirsts and stuff
2 just a quick question … causing us some grief … I guess I am somewhat torn … does that make sense to you? It doesnt to me ..
3 you can’t just take a persons investment and then spend it on women and beer or whatever you do up there in Harvard. I’ve been stalled long enough on this thing and if I don’t see something soon I’ll have no choice but to contact the school and perhaps your parents in Dobbs Ferry and let them know whats been going on
4 OK fine Mark 50/50 just as long as we … without having to do the un PC thing of asking someone face to face … 🙂 … why couldnt we have the … local pizza and chinese or whatever
5 Congrats Mark! … Just wondering if we might think of another title for it without the the, but plenty of time for that, I’ll try and think of some names … you know another thing I’ve been thinking
6 you think in your head that an Ok way to act is to just say- oh I’ve changed my mind I don’t think it’s cool to make money and that that should be that … It’s one thing to say [x2] … angel investors are just con men … and anyway at this point it’s just a freaking harvard thing.
7 You’ve got some nerve talking about me owing you with the CRIMINAL stunts you’ve pulled … dude, you stole code, not once, not twice but THREE TIMES! … Grow up, take a fucking ethics class, choke yourself with that silver spoon of yours.

Ceglia’s style is particularly informal throughout, featuring anything from smileys and taboo language to creative punctuation (???, !, .., etc.), and he tends to employ a mental/journal-narrative style, where he confides his thought-processes, asks rhetorical questions, and uses colloquialisms. In fact, I couldn’t really find anything in CC that could be described as higher-register in the way that appears in SC. What we might summarise from this section then, is that emails 6 and 9 are once again closer to Ceglia’s style than to the rest of SC, and whilst these may not be the most incriminating emails of the bunch, the questioned provenance of the whole SC, coupled with the marked differences that SC demonstrates, altogether throws up some interesting questions.

3.6 – Word documents
One of the elements of the story that particularly caught my attention was the fact that the emails were not pulled from a server, but according to Doc61, had been copied and pasted into three separate Word documents, and saved on one of 169 floppy disks that were handed over to Evans at Project Leadership Associates, (along with 1,075 CDs, a laptop, and a harddrive image).

Evans gives details about these documents which suggest authenticity:

Mark harvard emails up to dec.doc
Date created: 30/12/03. Last revision: 30/12/03. (Doc61: 9)
Note: this date falls after SC email 3, before CC email 3.

mark feb emails.doc
Date created: 02/14/04. Last revision: 02/14/04. (Doc61: 10)
Note: this is the same day as SC email 7. Since times are not given for either the emails or the document creation/revision, it isn’t possible to say whether email 7 could/should (not) be in this document.

Mark emails july04.doc
Date created: 23/06/04. Last revision: 23/06/04. (Doc61: 11)
Note: this is the day after SC email 11.

I don’t doubt, however, that if Evans had been working for the defence, he would have blithely reeled off the many ways in which one can modify or falsify a document’s metadata[ref][ref]. This aside, there are several interesting points to note about Evans’ discussion of the documents. For example, he writes that,

“A common[note] means to save the contents of an MSN Hotmail webmail message, is to copy the message into a text editing software program, such as Microsoft Word.” (Doc61: 7)

We are left to gather from this that Ceglia was sending/receiving his email via Hotmail and not a desktop email client. Further, we are left with the implication (not the assertion) that since Hotmail doesn’t allow users to download email, Ceglia was retrospectively copy/pasting his emails into Word documents. I could, of course, wax lyrical on this point. Did Ceglia save emails like this prior to this time? Afterwards? Did he do this for all his contacts? Or just for Zuckerberg? If so, why? And if not, then what prompted him on December 30th, after SC email 3, to think that there was any particular need to start saving Zuckerberg’s emails, since everything at that stage seemed to be going okay?

Anyway, those questions could go on forever, and there are better quotes from Evans’ Declaration, like this one:

“One of the floppy disks in PLA’s custody, as collected and imaged from Mr. Ceglia, contains three Microsoft Word documents containing what I understand to be email communications between Mr. Ceglia and Mr. Zuckerberg relating to the issues in this case.” (Doc61: 8)

Curious. Mr Evans, with his BA in Computer Science, his extensive knowledge of software and internet technologies, his experience in identifying and analysing computer-related evidence, and his involvement in over 100 investigations concerning digital evidence (Doc61: 4), who is prepared to call a floppy a floppy, and a program a program, does not seem willing to commit himself to calling the contents of those Word documents emails. They are, in his own rather lukewarm terms, “what I understand to be email communications”.

And here is my final big thought of the article. Going online and modifying the contents of an email saved on a webmail server like Hotmail would take all kinds of hacking expertise which Ceglia seemingly didn’t have (why hire an 18/19-year-old student to do coding if he had that much talent himself?), and it would run any number of serious legal risks. However, modifying the contents (and, one might add, the metadata) of a Word document is the risk-free work of a moment, and here are some final interesting coincidences I noticed about the emails:

“[…] you would be seriously violating our trust by doing so, I have done what I can with the small amount of money you have invested and I will have something live for you to view soon.” (Doc39: 38, or SC email 6)

If we consider that ZC (according to McMenamin) contained absolutely no run-on sentences, it is interesting to see that the run-on that does occur in this email, lands just before a sentence which acknowledges a financial investment in the project. Here’s another one:

“Paul, I have a rather serious issue to discuss with you, according to our contract I owe you over 30% more of the business in late penalties which would give you over 80% of the company.” (Doc39: 45, or SC email 7)

Another run-on directly before an acknowledgement of Ceglia’s 50% ownership… But wait. Here’s yet more:

“[…] Another summer is here and I still don’t have any time to build our site, I understand that I promised I would, but other things have come up and I am out in California working during break.” (Doc39: 55, or SC email 11)

A run-on before an acknowledgement of a promise. Whoda thunk. Anyway, does this make for legal evidence? No, of course not, but then, this is a blog post, not a courtroom, so I can speculate far more freely on the case than McMenamin, Evans et al. can.

4 – Conclusion
In my view, McMenamin’s analysis fails to uphold the forensic linguistic principle of distinctiveness with consistency. A feature that occurs once or twice in a small corpus cannot be considered a distinctive norm from which comparative deviations in other small corpora may be measured, since both the ‘distinctive feature’ and the ‘deviation’ could simply be the result of tiredness, rushing, annoyance, carelessness, a cat on the keyboard, distracting TV, or just plain old over-helpful auto-c’wreck.

Like any scientific field, forensic linguistics should strive to triangulate its results, and given that corpus linguistics excels at establishing consistency and distinctiveness both within and between datasets, there is an excellent argument for making corpus/statistics an extra method to work alongside the more traditional discourse analysis approach. Further, where the dataset is humble, so too should be the claims based upon it.

Moving on, I actually did find some really interesting, and intriguing elements in this data, and some distinct differences both within SC, and between SC and CC. For example, in SC, email 4 contains no contractions, no run-on sentences, and a notably higher register. In complete contrast, emails 6 and 9 both demonstrate missing apostrophes, the highest levels of run-on sentences, and the lowest register. In shorter words, that makes these two SC emails far more like CC emails, and whilst that doesn’t mean that Ceglia wrote them, it does allow us to start to question whether the same person authored all of the emails within SC.

Though this is a very muted view compared to McMenamin’s assertion that, “the KNOWN writings of Mr. Zuckerberg demonstrate a sufficiently significant set of differences […] to constitute evidence that Mr. Zuckerberg is not the author of the excerpted QUESTIONED references”, the potential impact is still as drastic. If my far more modest view is correct – that the authorship of some of the emails may be called into question, then Ceglia’s claim that these emails are written by Zuckerberg must be, at least in part, false, and if even part of the evidence turns out to be fabricated, then the whole house of cards comes down.

The irony, at the end of all this, is that without ZC, all we can really establish is how (un)like SC is to other parts of itself, and how (un)like SC is to CC, rather than the far more useful and pertinent question of how (un)like SC is to ZC. As mentioned above, what I have really done is spend a lot of words showing that forensic linguistics is much better at doubting authenticity, than at proving authenticity, and even what has been done above is, at best, superficial and brief. I could easily write a hundred-thousand more words on this dataset, with or without access to ZC (though access would be nice… Anyone…?) but for now I think we must leave this one in the hands of the superbly well-paid lawyers and their expert witnesses.

5 – References/Sources
Should any of the links below die off, the easiest way to find PDFs on the case is to Google “Case 1:10-cv-00569″ and check the results-previews for document numbers. (I recommend Doc62 – forensic document analysis, and Doc66 – forensic ink analysis).

Language Log (2011) High-stakes forensic linguistics
News and Insight (2011) Doc61
New York Times (2011) Decoding Your E-Mail Personality
Scribd (2011) Doc50
Wired (2011) Doc39

6 – Postscript
17/08/11: The original contract has come under scrutiny[ref][ref][ref]. Ruh roh!
18/08/11: I stuck my neck out and emailed Facebook to see if I might be allowed a copy of the thirty-five Zuckerberg emails.
25/10/15: Facebook never replied. I know. I’m as stunned as you are.