Encyclopedia of Shakespeare's Language

Some thoughts on version 1.1

Posted on May 22, 2026 by Andrew Hardie

This May marks four years since the online release of the corpora created by the Encyclopedia of Shakespeare’s Language Project – and eight years since the original limited pre-release.

One of the major features of the Enhanced Shakespearean Corpus (ESC) has always been that its annotations were created through detailed and painstaking manual effort.

The principal component of ESC was a corpus of all 38 of Shakespeare’s plays based mostly on the First Folio edition (ESC:Folio). This component received the lion’s share of those efforts. We first took source texts from the Internet Shakespeare Editions (to whom much credit and gratitude) and reformatted their XML markup by automatic means. But we then invested months of effort, first into adding a spelling regularisation layer, and second into post-editing the part-of-speech tagging of ESC:Folio.

Part-of-speech post-editing – where the output of an automatic grammatical tagger is checked, word-by-word, by a trained grammarian to fix all mis-taggings – was relatively common in the 1990s and early 2000s. For example, the two-million-word extract from the BNC1994 known as the BNC Sampler was post-edited by hand.

These days, however, corpora tend to be either too large, or too ephemeral, for the effort of full post-editing to be justified. But at one million words in extent, ESC:Folio is small enough, and its role in our efforts to describe Shakespeare’s language central enough, for post-editing to be feasible.

As a result, we have been able to say that the part-of-speech tagging in ESC:Folio is as close as humanly possible to 100% correct. In contrast, most automatic taggers score between 95% and 97% on contemporary English, with accuracy reduced for earlier periods… like Early Modern English.

Although we could not afford the time to manually process the spelling regularisations and part-of-speech tags for the other components of ESC, what we could do was take the outputs from our work on ESC:Folio and use them to improve the automatic procedures applied to those other datasets.

So while those corpora – the collection of Early Modern books in ESC:EEBO, the corpus of comparative playwrights in ESC:Comp, and the subsidiary corpus of Shakespeare’s other writings in ESC:Quartos and ESC:Verse – couldn’t be brought up to quite the same standard, we are confident of having achieved best-in-class automatic annotation for Early Modern corpora.

We knew, though, even at the release of ESC:Folio that a few errors in the spelling regularisation and part-of-speech tagging must remain – thus, our proviso that the tagging is not 100% correct but only as close as humanly possible to 100% correct.

Our processes involved passing all our manual work on ESC:Folio in front of multiple sets of eyes, so any error made by one person could be caught by somebody else – but while this did indeed catch many slips, it couldn’t catch absolutely everything.

So since the release of the corpora, we’ve slowly accumulated a list of things-to-fix in the spelling regularisation and tagging. Some we spotted ourselves, others were reported to us by our students or by colleagues outside Lancaster making use of the data. By 2026, we felt certain that it was time for a version 1.1 of the ESC – with issues manually fixed in ESC:Folio, and updated automatic procedures applied to the other corpora.

What kinds of things have changed? Some of the issues we fixed were simple cases of human error. For instance, in the following couplet from Titus Andronicus:

The green leaves quiver with the cooling wind,
And make a chequered shadow on the ground:

… the word green is clearly an adjective rather than a noun, though it can be a noun, for instance if we talk about “the village green”. But in ESC:Folio as originally released, it has a noun tag (NN1) rather than an adjective tag (JJ).

This example typifies the kind of distinction that the automatic tagger struggles with (since it’s possible for green to be a noun, and possible for a noun to premodify a noun, there is no basis for the tagger to rule out the noun tag in this context). In fact, the manual post-editing focused on expected errors like this, and we fixed nearly all of them prior to the original release. But not this one – damnit, it just slipped through!

And ultimately, there is always a last remnant of issues which are not so much obvious errors as they are areas of ambiguity of analysis, where even to a human being the correct annotation is debatable.

An example on the spelling regularisation side is ope, a clipped version of open which occurs in ESC:Folio roughly 30 times. Our practice was generally to regularise forms that reflect clipped pronunciations. But it’s an open question (forgive the pun) whether that policy ought to apply to ope, which may be seen as different enough from open, and frequent enough, to be treated as an established separate lexical item.

There’s no right or wrong answer here – reasonable analysts could disagree. What we can aim for, however, is consistency, which makes annotation feasible for you to work with, even if you personally disagree with one or more decisions. And we noticed, a few months ago, that some instances of ope were regularised to open and some weren’t. Different members of the team had not treated it entirely consistently – or possibly the same person made different decisions on different days; human beings have been known to do that!

Both the tagging error for green (it’s now JJ) and the regularisation inconsistency for ope (they’re all now mapped to open) have been sorted out, along with dozens of similar tweaks across the 38 plays. The accumulation of fixes was such that we decided it was time for a version 1.1 release of the corpus.

So we’ve re-indexed all five ESC corpora for public use on Lancaster’s CQPweb server. These version 1.1 corpora are now live: if you’ve already signed up to use ESC, you’ll be able to access the new versions immediately without any additional action – just follow the links from the CQPweb homepage (https://cqpweb.lancs.ac.uk/).

We’ve also taken the opportunity to spruce up the user signup system a bit… so you can now get hold of XML downloads through the same form that manages the CQPweb signups. Here’s the link:

http://corpora.lancs.ac.uk/esc-user-service/

The 1.0 versions remain accessible (and always will); it is still true that any research that used ESC:Folio 1.0 was based on annotation that was as close as humanly possible to 100% correct. But for new investigations, ESC 1.1 is now the recommended version, because it’s even closer.

Happy language analysis, everyone!

Posted in Uncategorized | Comments Off

Intern – Katie Bates

Posted on June 22, 2020 by eiajc

Intern-Katie-Bates

Having just finished my second year as an English Language and Linguistics student at Lancaster University, I was thrilled to be given the opportunity to complete an internship with the department, on the subject of the language of Shakespeare.

My work focused specifically on defining low frequency words that featured in Shakespeare’s First Folio, specifically those that had been collectively less ten times, but were most often used only once! Attempting to make sense of some wonderful words (my favourites include lass-lorn, out-paramoured, and traitoress) felt like the careful work of excavation; the infrequent uses of these terms like uncovered fossils from which we can begin to wonder at the lesser-known communications, attitudes and cultures of a past time.

Like previous interns I worked with corpus methods, but I was the first to be able to take advantage of scripts that had been developed to improve speed and productivity. This meant that for each definition, both the relevant corpus data and existent definitions for each term were automatically pooled for me, resulting in much smoother workflow. Through a process of comparison with the existent dictionaries of Shakespeare’s language (e.g. those of Schmidt and Onions) and the Oxford English Dictionary, it became possible to piece together semblances of meaning. Whilst the meanings of some terms not used today are apparent to readers (e.g depopulate), others such as pregnancy (which Shakespeare used with the meaning of clever or quick-wittedness) required heavy inference and research into etymology, context and contemporaries in order to compose a definition.

My involvement with the project through the internship has renewed my awe for the beauty and complexity of Shakespeare’s language, whilst also providing a practical insight into the uses of corpus methods, which will be invaluable to my final year of study. I have found both the research and wider project truly fascinating, and look forward to its completion, so others can enjoy the rich resources it will offer.

Posted in Uncategorized | Comments Off

Shakespeare’s Neologisms: From Myth to Evidence

Posted on November 21, 2019 by Mathew Gillings

Following on from the AHRC-funded Encyclopedia of Shakespeare’s Language project, we are pleased to announce that we have been successfully awarded a grant (£9,740.15) from the British Academy. The project will establish whether, and to what extent, widely held views about Shakespeare’s neologisms are a myth, and also improve our understanding and appreciation of his words.

The website of the well-respected Shakespeare Birthplace Trust proclaims that “William Shakespeare invented over 1,700 words”, with similar estimates being found across non-academic and academic works alike. However, these estimates are often based on the number of words in the Oxford English Dictionary that have as their first citation a work attributed to Shakespeare. No study, however, has systematically scrutinized each of these words, hunting for earlier uses.

The recent advent of Early English Books Online (the largest repository of historical English printed works), paired with the recently-released Enhanced Shakespearean Corpus, means that it is timely to undertake such a study. This study will also investigate a further set of potential neologisms based on a list of words that only occur in texts attributed to Shakespeare.

The project will be led by Prof. Jonathan Culpeper, with Prof. Jonathan Hope acting as a consultant, and Isolde van Dorst providing research assistance.

Posted in Uncategorized | Tagged British Academy, neologisms, Shakespeare | Comments Off

Scuffles, Swagger, and Shakespeare: The Hidden Story of English

Posted on November 21, 2019 by Mathew Gillings

Our very own Jonathan Culpeper recently featured in the BBC Four documentary “Scuffles, Swagger, and Shakespeare: The Hidden Story of English” presented by Dr. John Gallagher. Jonathan discusses some recent work coming out of the Encyclopedia of Shakespeare’s Language project, suggesting that the amount of neologisms credited as being coined by Shakespeare may not be as numerous as has been suggested in the past.

You can still catch the documentary on BBC iPlayer by clicking here.

Posted in Uncategorized | Tagged BBC, documentary, iPlayer, John Gallagher, Shakespeare | Comments Off

Intern – Eleanor Field

Posted on August 3, 2019 by Mathew Gillings

Hi, I’m Eleanor, and I’ve just finished my second year of studying English Language and Linguistics at Lancaster. During my summer break I completed an internship working on the Encyclopaedia of Shakespeare’s Language, supervised by Professor Jonathan Culpeper. I decided to apply for the internship as I have an interest in stylistics, and the fact that the internship allowed for the analysis of literary texts through linguistic methods (such as the use of corpora) seemed exciting to me. I was initially nervous about using corpus research methods as it was something completely new to me, but with Professor Culpeper’s supervision I quickly picked it up and found it to be an invaluable tool throughout the project.

My role involved writing proposed definitions for the encyclopaedia, that could be verified by the project team, focussing on words that occur at an extremely low frequency within Shakespeare’s work. In order to do this, I considered Shakespeare’s usage of each word in context by using a corpus; viewing the language use in context really helped me to understand the intended meaning and guided the writing of the entries.

Completing the internship has not only improved my understanding of corpus research methods and provided me with independent study skills but also confirmed that I wish to pursue postgraduate study. I loved completing the research for this project and hopefully will have the opportunity to undertake my own in the future!

Posted in Uncategorized | Comments Off

BBC Radio 3 Podcast

Posted on July 17, 2019 by Mathew Gillings

Jonathan Culpeper and Alison Findlay feature in a new 45-minute BBC Radio 3 Podcast called “New thinking: Shakespeare’s Language”, presented by John Gallagher. They discuss how the project works, and the light it’s shedding both on how Shakespeare worked as a writer, and on the development of the English language in Shakespeare’s day.

Give it a listen on the BBC website by clicking here!

Posted in Blog | Comments Off

Intern – Sam Hollands

Posted on July 12, 2019 by Mathew Gillings

Being involved with the Encyclopedia of Shakespeare’s Language project has been a great opportunity. I have been working as an intern for the last 4 weeks developing scripts to improve the efficiency of certain workflows, mainly designing a system to increase the speed that we can write definitions for the encyclopedia. The project has been great for giving me hands-on coding experience in an academic environment, and brushing up on my Bash and Python skills. My particular area of interest is in speech processing, so the invitation to improve my computational skills is something I am grateful to have received.

I found Shakespeare’s tendency to create neologisms through the prefix ‘un-’ particularly interesting, suggesting a vast proportion of his neologisms are just negations of words he didn’t invent. Where I once thought Shakespeare invented thousands of neologisms, it appears that this is a myth, largely predicated on the OED’s use of Shakespeare as first user for many terms that simply haven’t been tested for antedating. This has been a fantastic experience and I’m excited to see what the final results are and what percentage of the initial claims of approximately 1,500 neologisms are actually words Shakespeare invented.

Posted in Blog | Comments Off

Encyclopedia of Shakespeare’s Language Symposium

Posted on July 10, 2019 by Mathew Gillings

Encyclopedia of Shakespeare’s Language Symposium

Posted in Blog | Comments Off

A close encounter with Richard III

Posted on August 7, 2018 by Mathew Gillings

By Dr Jane Demmen, Senior Research Associate

Last month project Co-Investigator Andrew Hardie and I presented a paper at the Computational Methods for Literary-Historical Textual Scholarship conference at De Montfort University in Leicester (UK): a great event bringing together scholars from far and wide with interests in digital humanities approaches to literary texts. (The slides from this paper, and others from our project, are available on our website here http://wp.lancs.ac.uk/shakespearelang/outputs/). Leicester is also the resting place of King Richard III: the last English monarch to die in the course of battle and the inspiration for one of Shakespeare’s most interesting and controversial villains…

In Shakespeare’s plays we first meet the character who becomes Richard III in Henry VI parts 2 and 3 (when he is Duke of Gloucester). Part 3 familiarises us with the cruel and merciless tyrant who later will murder any number of individuals standing between him and the English throne (in the play Richard III). Yet Richard’s character evokes a certain admiration, in part for his ruthless cunning but also for his grotesque humour. “See how my sword weeps for the poor king’s death”, he remarks of his bloody weapon, having just fatally stabbed Henry (in Henry VI part 3, 5:6). “Why I can smile, and murder while I smile,” he tells us in his long soliloquy at the end of Henry VI part 3, 3:2. Cleverly, Richard makes these wry observations when he is alone on stage, so we the audience find ourselves drawn into his dastardly ambitions simply through our privileged access to what is going on in his mind.

As in the play, the real king Richard III died leading his troops at the Battle of Bosworth (the last main conflict in the Wars of the Roses, between the noble houses of Lancaster and York), just outside the city of Leicester, in August 1485. He was 32 and had reigned for just two years, having assumed the throne when the young heir apparent, Edward V, was declared illegitimate (based on claims that his father Edward IV had been bigamously married to young Edward’s mother, Queen Elizabeth). The question of whether or not Richard III was subsequently responsible for the murder of young Edward (who with his brother became known as ‘the princes in the tower’), as is portrayed in the play, remains open and a topic of conjecture to this day.

Richard III’s remains were, astonishingly, discovered beneath a car park in Leicester in 2012, after a search instigated by members of the Richard III Society and an archaeological dig led by the University of Leicester (in co-operation with Leicester City Council, owners of the car park). The site of the car park had once been a friary, where Richard’s body was buried after his death and defeat in battle. Following an impassioned debate over where Richard’s remains should hereafter lie, they were finally interred in Leicester Cathedral in 2015 in a coffin made by one of his descendants, whose DNA was used to help confirm the identity of the remains.

On the plinth of Richard’s monument in the cathedral is his coat of arms, his personal emblem (the figure of a boar), and his personal motto Loyaulte me lie (“loyalty binds me”).

Right next to the conference building at DeMontfort University (the green edge of which is just visible in the far right of the photo below) is the Newarke Gate. Richard III’s body would likely have passed through it when it was brought back to the city, draped over a horse, on public display (as proof of Richard’s death, the defeat of the Yorkists, and victory of the Tudor forces).

The analysis of Richard’s remains bore out the fact that he had a curvature of the spine, but not that he was strikingly physically deformed, a popular idea which appears to have been fuelled by literary characterisation of him as twisted in body as well as mind. In Shakespeare’s plays, Queen Elizabeth (widow of Richard’s elder brother, the late king Edward IV) describes him as “that foul bunch-backed Toad” (Richard III, 4:3). Richard himself says, “Then since the Heavens have shaped my Body so, Let Hell make crooked my Mind to answer it” (although only in the audience’s hearing, in Henry VI part 3, 5:6).

Although the real-life Richard’s reign saw bloody conflict, treachery and political intrigue, as one walks around the cathedral and the city of Leicester in the present day noticing artefacts and snippets of information about his life, it’s apparent that he is remembered as a supporter of education and fair laws for ordinary people, and not the villainous tyrant imagined in Shakespeare’s plays.

Education was of course the main reason for my visit to Leicester, and is cornerstone of our project. Learning more about Richard III was an unexpected bonus to attending the conference. There’s more information about visiting the cathedral at http://leicestercathedral.org/learn/richard-iii/ and about the nearby King Richard III Visitors’ Centre at https://kriii.com/.

Posted in Uncategorized | Tagged computational methods, corpus linguistics, Leicester, Richard III, Shakespeare | Comments Off

New intern

Posted on June 24, 2018 by Mathew Gillings

We are very pleased to welcome Poppy Plumb to the Encyclopaedia of Shakespeare’s Language team for the next few weeks. Find out a little more about Poppy and what she’ll be working on below…

I’ve just finished my second year at Lancaster studying English Language and Literature, and for the next few weeks I’ll be working on the Encyclopaedia of Shakespeare’s Language. I was excited to hear about this project because of its application of linguistic and corpus methods to Shakespeare. Being such an integral part of the literary canon, and embedded in the study of English Literature throughout compulsory education in the UK, I find the opportunity to take a more linguistic approach to Shakespeare refreshing and exciting. I’m also keen to pursue postgraduate study following my undergraduate degree, so the opportunity to work on a research project like this would provide me with invaluable experience.

My research on the project will be focussing on neologisms: the words that Shakespeare supposedly coined. I’ll be building upon the work of past interns and comparing various definitions of each word from different sources; checking each word is present in a corpus of Shakespeare’s plays, and coming up with a proposed definition of each word for the encyclopaedia.

Given the immense number of words being added to the encyclopaedia, the work Poppy is doing is integral to its compilation. We look forward to working with her.

Posted in Uncategorized | Tagged Encyclopaedia of Shakespeare's Language, intern, Shakespeare | Comments Off

Encyclopedia of Shakespeare's Language

Some thoughts on version 1.1

Intern – Katie Bates

Shakespeare’s Neologisms: From Myth to Evidence

Scuffles, Swagger, and Shakespeare: The Hidden Story of English

Intern – Eleanor Field

BBC Radio 3 Podcast

Intern – Sam Hollands

Encyclopedia of Shakespeare’s Language Symposium

A close encounter with Richard III

New intern

Recent Posts

Twitter Feed