CONTENT RATING: UNIVERSAL
In 2010, Paul Ceglia sues Mark Zuckerberg for half of Facebook. In response, Zuckerberg calls in a forensic linguist. Below you will find credits, sources, and a transcript.
References, sources, and more
The data, including Doc 39 and Doc 50
Wikipedia: Paul Ceglia
Wikipedia: Mark Zuckerberg
Ars Technica (bail)
New York Post
New York Times
Case S01E02 – Paul Ceglia.
It’s spring of 2003, in Cambridge, Massachusetts, in the US. April here has been a pretty chilly month, close to freezing for the most part except for a couple of unseasonably warm days.
In a Harvard University dorm room, a nineteen-year-old student is scrolling through Craigslist, looking for work. He spots an advert for website development, and responds. The advertiser seeking help with his website is then-27-year-old ex-teacher and business man, Paul Ceglia. After some email conversation, in April 2003, Ceglia agrees to pay the student $1,000, sets a deadline, and writes up a contract.
Nine months later, in February 2004, the student launches a website, but Ceglia’s name is nowhere on it. This site will quickly become the biggest social media platform in the world, and the student will become one of the richest people on the planet. This student is, of course, Mark Zuckerberg, and the website is Facebook.
Seven years pass by. Zuckerberg has long-since dropped out of Harvard and moved to California to focus on Facebook. It is a decision that will prove to be more than worthwhile.
Let’s consider 2010 alone.
In July of this year, Facebook will hit half a billion users.
In October, Vanity Fair’s annual list of the top 100 most influential people of the Information Age, will put 26-year-old Zuckerberg in first place.
And in December, Time magazine’s annual list of the 100 wealthiest and most influential people in the world will include Zuckerberg for the first time, and he will continue to appear in this ranking every year afterward, to the present day.
On the surface, then, it would seem that Zuckerberg is literally a living embodiment of the American Dream. But behind the scenes, there are the first tremors of a potential nightmare.
In June of 2010, Ceglia files a lawsuit. In this suit, he claims that he paid Zuckerberg $1,000 for a project called “The Face Book” and that his contract entitles him to 50% of this site. He also claims that a further clause in the contract awards him an additional 1% interest in the site for every day it was late past a January implementation deadline. In total, Ceglia is suing for 84% of Facebook. Zuckerberg, meanwhile, agrees that he did indeed do work for Ceglia at around this time, but on an entirely unrelated website known as StreetFax.
Ceglia, however, claims to have a receipt for the $1,000, an email trail, and a copy of a contract for “The Face Book” to prove his claim.
Surely no one would be so audacious as to invent such a wildly implausible story… would they?
If true, this would be a bombshell. Zuckerberg would likely be bankrupted overnight.
The world’s biggest and most powerful website, Facebook, would effectively change hands.
And Ceglia would be catapulted into unimaginable wealth and power.
Welcome to en clair, an archive of forensic linguistics, literary detection, and language mysteries. You can find case notes about this episode, including credits, links, and a transcript at the blog. The web address is given at the end of the podcast.
Context and history
During the trial, alongside plenty of other evidence from forensic digital analysis of computers, two types of linguistic evidence are presented by Ceglia – the contract, and the emails between Ceglia and Zuckerberg. We’ll focus just on the emails, and for good reason. In those online conversations between Zuckerberg and Ceglia, there are statements that further support the terms within the contract that Ceglia claimed to have with Zuckerberg. For instance, an email allegedly from Zuckerberg to Ceglia supposedly sent in February 2004, a few weeks after the site was meant to be implemented, states,
I’d like to suggest that you drop the penalty completely and we officially return to 50/50 ownership.
The following day, on the 03rd of February, Ceglia then supposedly replied,
Ok fine MArk 50/50 just as long as we start making some money from this thing.
Then on the 04th, Zuckerberg supposedly emailed,
Paul, thefacebook.com opened for students today.
The 04th of February is indeed the day that Facebook actually launched. Ceglia replies the same day with some congratulations and a suggestion to remove the initial “the” from the site name, along with some ideas for monetising the site through selling branded merchandise.
Two days later, on the 06th of February, Zuckerberg responds:
Sorry it’s taken me a few days to respond, (sic) Now that the sites (sic) live I feel I must take creative control and I just can not risk injuring my sites (sic) reputation by cheapening it with your idea of selling college junk, nor do I wish to spend my time shipping out coffee mugs to rich alumni. The site is cool as it is and I don’t care about making any money on it right now, I just want to see if people will use it. If I had the rest of the money I was owed by you for all that extra work I did I wouldn’t even need to make money at all on this site. That is money I am entitled to and is rightfully mine.
Two months later, on the 06th of April, Zuckerberg supposedly sends Ceglia another email:
Paul, I have become too busy to deal with the site and no one wants to pay for it, so I am thinking of just taking the server down. My parents have a fund that I can tap into for my college expenses and I would just like to give you your two thousand dollars back and call it even on the rest of the money you owe me for the extra work. At this point I won’t even really be able to work on the facebook until Summer.
There are several other emails besides, and you can read all of those that are available via links provided in the Case Notes. To save you some time, the main of the correspondence is contained in Document 39.
So far, it seems like Ceglia might have a pretty good case. But there’s a catch. Under normal circumstances, when dealing with emails, a forensic analyst would check the email headers – the information about when they were sent, what servers they were routed through, who they were sent to – and they would cross-reference this with the server logs at, say, Harvard University where Zuckerberg was. But Ceglia claims that he kept these emails saved in three different Word files. In other words, in some form or other, he was copying and pasting them into Word, but whatever he was doing, there are no headers. Still, this doesn’t automatically mean that they are not real. It just makes it harder for Ceglia to prove their authenticity.
In response, Zuckerberg’s defence team hire a forensic linguist. Their job? Analyse the emails that Ceglia claims Zuckerberg wrote, and compare this with the language of emails known to be sent by Zuckerberg. Would the language be similar? Or would they be quite different?
Data and evidence
Facebook’s legal team hire forensic linguist, Gerald McMenamin. At the time of this case being heard – that is, in 2011 – Gerald McMenamin is Professor Emeritus of Linguistics and former Chair of the Department of Linguistics at California State University, Fresno.
McMenamin’s approach to forensic linguistics is to use forensic stylistic analysis. What is this? Well, in simple terms, this uses a subdiscipline within linguistics known as stylistics to study variation in language. In essence, stylistics is interested in the choices we make at each point in a text. Take Case Notes, for instance. Do we choose to write this as two distinct words – Case … Notes? Or as a hyphenated word – Case-dash-Notes? Or as one complete word with neither spaces nor dashes in the middle? And do we make this choice every single time? Do we use am not, cannot, do not, or do we use aren’t, can’t, don’t? Do we write should have? Or should of? Just to stress, this isn’t about picking up errors or correcting people. This is about habits and preferences, or in short, it is about style. These choices come through in every level of our language from where we put in full-stops right up to how wordy we are in our… er… podcast scripts.
One feature alone doesn’t tell us very much, of course. There’s little point in only observing that you write while and I write whilst. There’s also little point in noting that I say knock-a-door-run and you say knock-knock-ginger. And there’s not much use in merely stating that I routinely don’t capitalise whereas you always capitalise beautifully. Any one of those features, alone, is not unique to either of us. (EEEther… Or is it EYEther…?!) Moving on, stylistics doesn’t care for one feature alone. Its interest is in collections of features that begin to characterise an individual’s overall style, or idiolect. The more of these you can find within a text, the more you build an idea of that person’s habitual choices.
Back to the case. McMenamin’s findings are outlined in a relatively short analysis contained in Document 50. As I’ve said, McMenamin’s task was a simple one: he was given thirty-five emails known to be by Zuckerberg to compare against eleven emails that Ceglia alleges Zuckerberg wrote. But these eleven disputed emails, as it happens, are sourced from the amended complaint – that is, Document 39. As McMenamin writes in his report:
I was […] asked to determine, to the extent possible, the authorship of a series of QUESTIONED writings excerpted into an Amended Complaint in this matter…
What’s the issue here? Well, it’s not entirely clear why McMenamin wasn’t given clean, proper copies of all the emails by Ceglia’s team, and why he instead had to rely on the ones in Document 39. Those emails have clearly been altered in a few different ways. Glosses have been inserted into some to make referents within the email clearer. Multiple times, (sic) has been added, presumably to indicate that some typo or other is an original feature of that email. And one time, just after paragraph 32, the email starts with a three-dot ellipsis, which usually indicates that some preceding text has been cut-off, perhaps because it hasn’t been deemed relevant. The email also reads as though some prior text is missing. Similarly, Paragraph 41 contains a short example from an email that ends with an ellipsis. It is very unclear in both cases where those ellipses have come from. I’ll come back to this shortly.
On the other side of the fence, we don’t get much information about the thirty-five known Zuckerberg emails. We find out that they were all from the same time period, and that some were ones that Zuckerberg did indeed send to Ceglia, but about the others, all we are told is that they went to “related parties”. What their purpose was, who they were sent to, how formal or informal they were, what device they were composed on, how long they are on average – all of these factors are not described, and I’ll also come back to those issues in a little while.
Overall, then, McMenamin had thirty-five known Zuckerberg emails to use as a comparison corpus against the eleven questioned Zuckerberg emails that had been extracted from Document 39. If you scroll down Document 50 to Exhibit B, McMenamin lays out eleven features that he focussed on – two issues of punctuation, three of spelling, five of syntax, and one feature at the discourse level. I’ll go through most, but not all of them. So what did McMenamin find?
- Zuckerberg consistently uses apostrophes in contractions and possessives correctly, but in the emails produced by Ceglia, there were four instances when this didn’t happen when they should have.
- Zuckerberg writes the three-dot suspension point – the ellipsis – with no spaces between the dots, nor between the dots and the word that comes directly beforehand. However, one ellipsis in Ceglia’s emails has spaces between each dot, and the other comes after a space.
- Zuckerberg writes terms like backend and frontend and also cannot as one word, whereas in one case, in the disputed emails, backend is split into two words, whilst cannot is split into two multiple times.
- Zuckerberg capitalises the word Internet twice, but the one time this occurs in the disputed emails, it is not capitalised.
- Zuckerberg never produces a run-on sentence, but the disputed emails contained at least nine examples.
- Zuckerberg frequently opens sentences with words like Okay, And, Anyhow, Also, But, Then, However. The disputed emails instead used words like Further, Additionally, Thus, Again, First, Mostly, Paul.
- Zuckerberg consistently uses commas after an if-clause. In fifteen possible places where these can occur, they are present thirteen times. In the disputed emails there are three possible places for commas to occur after if-clauses, and they never do.
Aside from these differences, like any good forensic linguist, McMenamin also observes points of similarity, and notes two.
- Both the known and disputed sets of emails each include an instance of an email finishing with the word Thanks!
- And one of the disputed emails started with the word Sorry once, whilst emails known to be by Zuckerberg started with the word Sorry four times.
What can we say about this analysis? Well, we’re obviously not going to get every tiny bit of detail in a report aimed at the court. The court usually only wants clean, clear results. They don’t want all the lengthy, minute, extended detail that can surround those results. If there’s something to be challenged in the methodology or the execution or whatever, it’s for the cross-examining lawyer to investigate if and when the forensic linguist takes the stand. However, just on a read through of this there are definitely things I would ask more questions about.
Firstly, let’s take the capitalisation of the word internet. Plenty of software has a penchant for autocorrecting perceived errors in text, including “fixing” (I’m saying this in quote marks) things that are otherwise fine. In fact it’s such a common issue that there are whole websites dedicated to autocorrects that have gone horribly, or hilariously wrong. Zuckerberg’s known emails could well have been composed using an email editor that does precisely this, and if we put him on another device that doesn’t do this and made him type internet, we might discover that he actually never capitalises this word. What we could have here is interference from the device used to compose the message. In fact, depending on the autocorrects and the message, it is sometimes possible to tentatively infer operating systems or devices that have been used to compose a message.
A second issue is the sheer opportunity for occurrence. This is an issue of not only length, since longer texts will generally offer more chances for a feature to crop up. It’s also one of context. For instance, let’s consider the four missing apostrophes in the disputed emails versus the known emails where every apostrophe occurs as it should. It’s extremely unlikely, of course, but imagine that there are only two possible instances for those apostrophes to occur in Zuckerberg’s actual emails. Perhaps he very infrequently contracts words like I am to I’m or she would to she’d and instead prefers to write them out fully, reducing the overall requirement for apostrophes. Getting them all right if they don’t occur very often is not beyond the realms of chance. These are contextual issues that are left out of the description of the known Zuckerberg dataset. In short, to really understand McMenamin’s analysis and results, we also need a sense of how often the feature could have occurred, versus how often it did or did not occur in each dataset.
And thirdly, remember those three-dot suspension points – the ellipses? There were two in questioned emails and they crop up in McMenamin’s feature list. However, when we look at them in Document 39, as I mentioned, one occurs at the start of an email and looks like it has been inserted to signal that some previous part of the email had been cropped for brevity, whilst the other occurs at the end of a very short email quote, and could have been inserted to mean the same thing. If so, this feature is actually noise and interference from whoever put Document 39 together. This takes us back to the issue of McMenamin not being given a clean, clear set of data to work with.
To finish off, however, it’s vital to return to the notion that forensic stylistic analysis works on the basis of combining a multitude of features together. No single feature in this list makes a case alone, since each is relatively weak in isolation, but therein lies the strength of the analysis. If one feature is discounted, there are – or should be – a host of others still, and it is in that constellation of features, rather than in any single choice, that we find an individual’s overall style.
In this case, the points of similarity – the sorry and the thanks are extremely weak. You could find thousands, if not hundreds of thousands of people all doing these exact same things in their emails right now, around the world. By contrast, the pattern of differences which range from run-on sentences to comma- and apostrophe use and more besides is relatively compelling. These results can’t tell us who did author the eleven questioned emails, but McMenamin’s comparison makes a good case that across a range of dimensions, those eleven emails produced by Ceglia don’t match the thirty-five emails known to be by Zuckerberg.
To return to the point, McMenamin’s overall conclusion is unsurprising. In his expert report, Document 50, he gives his opinion thus:
Based on the contrastingly-distinct style markers which the QUESTIONED excerpts and the KNOWN-Zuckerberg writings demonstrate, as well as the presence of no more than two minimally-significant similarities between the QUESTIONED and KNOWN-Zuckerberg writings, I conclude that the KNOWN writings of Mr. Zuckerberg demonstrate a sufficiently significant set of differences vis-à-vis the QUESTIONED writings to constitute evidence that Mr. Zuckerberg is not the author of the excerpted QUESTIONED references.
From Ceglia’s perspective, the case all very quickly starts to fall apart. In rapid succession, he is fired by a raft of law firms that he has hired to represent him, including DLA Piper, Connors & Vilardo, and Lippes Mathias Wexler Friedman. Whether spurred on by McMenamin’s forensic linguistic report, or because of their own discoveries in the interim, some of these firms directly cite the fact that Ceglia is using forged evidence as their reason for dropping him. In turn, Facebook sues a number of Ceglia’s past legal representatives, alleging that these lawyers knew, or should have known, that Ceglia was a conman, that his lawsuit was a malicious fraud, and that he was using forged documents.
In late 2012 Ceglia is arrested and charged. A GPS tracker is fitted to his leg to monitor his whereabouts and his bail is set at $250,000. Ceglia’s mother, father, and brother become guarantors for the bail bond. For those who are not clear what this means, if Ceglia breaches his bail conditions, he will land his family with a debt of a quarter million dollars.
For a few years, at least, this seems to work, but then, in early 2015, the authorities become suspicious that something has happened. Agents break into his home and according to reports in the media find his GPS tracker strapped to a rotating ceiling fan. Ceglia’s leg, and the rest of him, are nowhere to be found. It seems that Ceglia has somehow managed to get the tracker off, but perhaps suspicious that it will send out an alert if it doesn’t move enough, he has found a method to fool it into thinking that it is still being worn. This has then bought him enough time to abscond with his wife, two sons, and even his Jack Russel, Buddy.
A warrant is issued for Ceglia’s arrest and reward money is offered, but he appears to have vanished without a trace. Ceglia remains at large for the next three years, somehow, successfully keeping himself and his family hidden.
In 2018, however, the authorities track Ceglia – now aged 45 – down to a location in Ecuador, South America. At the time of recording, Ceglia is being represented by Roberto Calderón, and he is currently fighting extradition back to the US.
If returned, he will likely face charges of wire and mail fraud – offences that can carry a sentence of up to forty years in jail.
This episode of en clair is entirely researched, narrated, and produced by me, Dr Claire Hardaker. However this work wouldn’t exist in its current form without the prior efforts of many others. You can find acknowledgements and references for those people at the blog. Also there you can find data, links, articles, pictures, older cases, and more besides. The address for the blog is wp.lancs.ac.uk/enclair. And you can follow the podcast on Twitter at _enclair. Or if you like, you can follow me on Twitter DrClaireH.