Some thoughts on version 1.1

This May marks four years since the online release of the corpora created by the Encyclopedia of Shakespeare’s Language Project – and eight years since the original limited pre-release.

One of the major features of the Enhanced Shakespearean Corpus (ESC) has always been that its annotations were created through detailed and painstaking manual effort.

The principal component of ESC was a corpus of all 38 of Shakespeare’s plays based mostly on the First Folio edition (ESC:Folio). This component received the lion’s share of those efforts. We first took source texts from the Internet Shakespeare Editions (to whom much credit and gratitude) and reformatted their XML markup by automatic means. But we then invested months of effort, first into adding a spelling regularisation layer, and second into post-editing the part-of-speech tagging of ESC:Folio.

Part-of-speech post-editing – where the output of an automatic grammatical tagger is checked, word-by-word, by a trained grammarian to fix all mis-taggings – was relatively common in the 1990s and early 2000s. For example, the two-million-word extract from the BNC1994 known as the BNC Sampler was post-edited by hand.

These days, however, corpora tend to be either too large, or too ephemeral, for the effort of full post-editing to be justified. But at one million words in extent, ESC:Folio is small enough, and its role in our efforts to describe Shakespeare’s language central enough, for post-editing to be feasible.

As a result, we have been able to say that the part-of-speech tagging in ESC:Folio is as close as humanly possible to 100% correct. In contrast, most automatic taggers score between 95% and 97% on contemporary English, with accuracy reduced for earlier periods… like Early Modern English.

Although we could not afford the time to manually process the spelling regularisations and part-of-speech tags for the other components of ESC, what we could do was take the outputs from our work on ESC:Folio and use them to improve the automatic procedures applied to those other datasets.

So while those corpora – the collection of Early Modern books in ESC:EEBO, the corpus of comparative playwrights in ESC:Comp, and the subsidiary corpus of Shakespeare’s other writings in ESC:Quartos and ESC:Verse – couldn’t be brought up to quite the same standard, we are confident of having achieved best-in-class automatic annotation for Early Modern corpora.

We knew, though, even at the release of ESC:Folio that a few errors in the spelling regularisation and part-of-speech tagging must remain – thus, our proviso that the tagging is not 100% correct but only as close as humanly possible to 100% correct.

Our processes involved passing all our manual work on ESC:Folio in front of multiple sets of eyes, so any error made by one person could be caught by somebody else – but while this did indeed catch many slips, it couldn’t catch absolutely everything.

So since the release of the corpora, we’ve slowly accumulated a list of things-to-fix in the spelling regularisation and tagging. Some we spotted ourselves, others were reported to us by our students or by colleagues outside Lancaster making use of the data. By 2026, we felt certain that it was time for a version 1.1 of the ESC – with issues manually fixed in ESC:Folio, and updated automatic procedures applied to the other corpora.

What kinds of things have changed? Some of the issues we fixed were simple cases of human error. For instance, in the following couplet from Titus Andronicus:

The green leaves quiver with the cooling wind,
And make a chequered shadow on the ground:

… the word green is clearly an adjective rather than a noun, though it can be a noun, for instance if we talk about “the village green”. But in ESC:Folio as originally released, it has a noun tag (NN1) rather than an adjective tag (JJ).

This example typifies the kind of distinction that the automatic tagger struggles with (since it’s possible for green to be a noun, and possible for a noun to premodify a noun, there is no basis for the tagger to rule out the noun tag in this context). In fact, the manual post-editing focused on expected errors like this, and we fixed nearly all of them prior to the original release. But not this one – damnit, it just slipped through!

And ultimately, there is always a last remnant of issues which are not so much obvious errors as they are areas of ambiguity of analysis, where even to a human being the correct annotation is debatable.

An example on the spelling regularisation side is ope, a clipped version of open which occurs in ESC:Folio roughly 30 times. Our practice was generally to regularise forms that reflect clipped pronunciations. But it’s an open question (forgive the pun) whether that policy ought to apply to ope, which may be seen as different enough from open, and frequent enough, to be treated as an established separate lexical item.

There’s no right or wrong answer here – reasonable analysts could disagree. What we can aim for, however, is consistency, which makes annotation feasible for you to work with, even if you personally disagree with one or more decisions. And we noticed, a few months ago, that some instances of ope were regularised to open and some weren’t. Different members of the team had not treated it entirely consistently – or possibly the same person made different decisions on different days; human beings have been known to do that!

Both the tagging error for green (it’s now JJ) and the regularisation inconsistency for ope (they’re all now mapped to open) have been sorted out, along with dozens of similar tweaks across the 38 plays. The accumulation of fixes was such that we decided it was time for a version 1.1 release of the corpus.

So we’ve re-indexed all five ESC corpora for public use on Lancaster’s CQPweb server. These version 1.1 corpora are now live: if you’ve already signed up to use ESC, you’ll be able to access the new versions immediately without any additional action – just follow the links from the CQPweb homepage (https://cqpweb.lancs.ac.uk/).

We’ve also taken the opportunity to spruce up the user signup system a bit… so you can now get hold of XML downloads through the same form that manages the CQPweb signups. Here’s the link:

http://corpora.lancs.ac.uk/esc-user-service/

The 1.0 versions remain accessible (and always will); it is still true that any research that used ESC:Folio 1.0 was based on annotation that was as close as humanly possible to 100% correct. But for new investigations, ESC 1.1 is now the recommended version, because it’s even closer.

Happy language analysis, everyone!

Some thoughts on version 1.1

About Andrew Hardie

Recent Posts

Twitter Feed

Some thoughts on version 1.1

Share this:

About Andrew Hardie

Recent Posts

Twitter Feed