{"id":1396,"date":"2026-05-22T20:55:26","date_gmt":"2026-05-22T20:55:26","guid":{"rendered":"https:\/\/wp.lancs.ac.uk\/shakespearelang\/?p=1396"},"modified":"2026-05-22T20:55:26","modified_gmt":"2026-05-22T20:55:26","slug":"esc-version-1-1","status":"publish","type":"post","link":"http:\/\/wp.lancs.ac.uk\/shakespearelang\/2026\/05\/22\/esc-version-1-1\/","title":{"rendered":"Some thoughts on version 1.1"},"content":{"rendered":"<p>This May marks four years since the online release of the corpora created by the <em>Encyclopedia of Shakespeare\u2019s Language Project<\/em> \u2013 and eight years since the original limited pre-release.<\/p>\n<p>One of the major features of the <strong>Enhanced Shakespearean Corpus<\/strong> (ESC) has always been that its annotations were created through detailed and painstaking manual effort.<\/p>\n<p>The principal component of ESC was a corpus of all 38 of Shakespeare\u2019s plays based mostly on the First Folio edition (<strong>ESC:Folio<\/strong>). This component received the lion\u2019s share of those efforts. We first took source texts from the <em>Internet Shakespeare Editions<\/em> (to whom much credit and gratitude) and reformatted their XML markup by automatic means. But we then invested months of effort, first into adding a spelling regularisation layer, and second into post-editing the part-of-speech tagging of ESC:Folio.<\/p>\n<p>Part-of-speech post-editing \u2013 where the output of an automatic grammatical tagger is checked, word-by-word, by a trained grammarian to fix all mis-taggings \u2013 was relatively common in the 1990s and early 2000s. For example, the two-million-word extract from the BNC1994 known as the <strong>BNC Sampler<\/strong> was post-edited by hand.<\/p>\n<p>These days, however, corpora tend to be either too large, or too ephemeral, for the effort of full post-editing to be justified. But at one million words in extent, ESC:Folio is small enough, and its role in our efforts to describe Shakespeare\u2019s language central enough, for post-editing to be feasible.<\/p>\n<p>As a result, we have been able to say that the part-of-speech tagging in ESC:Folio is as close as humanly possible to 100% correct. In contrast, most automatic taggers score between 95% and 97% on contemporary English, with accuracy reduced for earlier periods\u2026 like Early Modern English.<\/p>\n<p>Although we could not afford the time to manually process the spelling regularisations and part-of-speech tags for the other components of ESC, what we <i>could<\/i>\u00a0do was take the outputs from our work on ESC:Folio and use them to improve the automatic procedures applied to those other datasets.<\/p>\n<p>So while those corpora \u2013 the collection of Early Modern books in <strong>ESC:EEBO<\/strong>, the corpus of comparative playwrights in <strong>ESC:Comp<\/strong>, and the subsidiary corpus of Shakespeare\u2019s other writings in <strong>ESC:Quartos<\/strong> and <strong>ESC:Verse<\/strong> \u2013 couldn\u2019t be brought up to quite the same standard, we are confident of having achieved best-in-class automatic annotation for Early Modern corpora.<\/p>\n<p>We knew, though, even at the release of ESC:Folio that a few errors in the spelling regularisation and part-of-speech tagging must remain \u2013 thus, our proviso that the tagging is not 100% correct but only <em>as close as humanly possible<\/em> to 100% correct.<\/p>\n<p>Our processes involved passing all our manual work on ESC:Folio in front of multiple sets of eyes, so any error made by one person could be caught by somebody else \u2013 but while this did indeed catch many slips, it couldn\u2019t catch absolutely everything.<\/p>\n<p>So since the release of the corpora, we\u2019ve slowly accumulated a list of things-to-fix in the spelling regularisation and tagging. Some we spotted ourselves, others were reported to us by our students or by colleagues outside Lancaster making use of the data. By 2026, we felt certain that it was time for a version 1.1 of the ESC \u2013 with issues manually fixed in ESC:Folio, and updated automatic procedures applied to the other corpora.<\/p>\n<p>What kinds of things have changed? Some of the issues we fixed were simple cases of human error. For instance, in the following couplet from <em>Titus Andronicus<\/em>:<\/p>\n<blockquote><p>The <strong>green<\/strong> leaves quiver with the cooling wind,<br \/>\nAnd make a chequered shadow on the ground:<\/p><\/blockquote>\n<p>\u2026 the word <em>green<\/em> is clearly an adjective rather than a noun, though it <em>can<\/em> be a noun, for instance if we talk about \u201cthe village green\u201d. But in ESC:Folio as originally released, it has a noun tag (NN1) rather than an adjective tag (JJ).<\/p>\n<p>This example typifies the kind of distinction that the automatic tagger struggles with (since it\u2019s possible for <em>green<\/em> to be a noun, and possible for a noun to premodify a noun, there is no basis for the tagger to rule out the noun tag in this context). In fact, the manual post-editing focused on expected errors like this, and we fixed nearly all of them prior to the original release. But not this one \u2013 <em>damnit, it just slipped through!<\/em><\/p>\n<p>And ultimately, there is always a last remnant of issues which are not so much obvious errors as they are areas of ambiguity of analysis, where even to a human being the correct annotation is debatable.<\/p>\n<p>An example on the spelling regularisation side is <em>ope<\/em>, a clipped version of <em>open<\/em> which occurs in ESC:Folio roughly 30 times. Our practice was generally to regularise forms that reflect clipped pronunciations. But it\u2019s an <strong>open<\/strong> question (forgive the pun) whether that policy ought to apply to <em>ope<\/em>, which may be seen as different enough from <em>open<\/em>, and frequent enough, to be treated as an established separate lexical item.<\/p>\n<p>There\u2019s no right or wrong answer here \u2013 reasonable analysts could disagree. What we can aim for, however, is <strong>consistency<\/strong>, which makes annotation feasible for you to work with, even if you personally disagree with one or more decisions. And we noticed, a few months ago, that some instances of <em>ope<\/em> were regularised to <em>open<\/em> and some weren\u2019t. Different members of the team had not treated it entirely consistently \u2013 or possibly the same person made different decisions on different days; human beings <em>have<\/em> been known to do that!<\/p>\n<p>Both the tagging error for <em>green<\/em> (it\u2019s now JJ) and the regularisation inconsistency for <em>ope<\/em> (they\u2019re all now mapped to <em>open<\/em>) have been sorted out, along with dozens of similar tweaks across the 38 plays. The accumulation of fixes was such that we decided it was time for a version 1.1 release of the corpus.<\/p>\n<p>So we\u2019ve re-indexed all five ESC corpora for public use on Lancaster\u2019s CQPweb server. These version 1.1 corpora are now live: if you\u2019ve already signed up to use ESC, you\u2019ll be able to access the new versions immediately without any additional action \u2013 just follow the links from the CQPweb homepage (<a href=\"https:\/\/cqpweb.lancs.ac.uk\/\">https:\/\/cqpweb.lancs.ac.uk\/<\/a>).<\/p>\n<p>We\u2019ve also taken the opportunity to spruce up the user signup system a bit\u2026 so you can now get hold of XML downloads through the same form that manages the CQPweb signups. Here\u2019s the link:<\/p>\n<p><a href=\"http:\/\/corpora.lancs.ac.uk\/esc-user-service\/\">http:\/\/corpora.lancs.ac.uk\/esc-user-service\/<\/a><\/p>\n<p>The 1.0 versions remain accessible (and always will); it is still true that any research that used ESC:Folio 1.0 was based on annotation that was as close as humanly possible to 100% correct. But for <em>new<\/em> investigations, ESC 1.1 is now the recommended version, because it\u2019s <em>even closer<\/em>.<\/p>\n<p>Happy language analysis, everyone!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This May marks four years since the online release of the corpora created by the Encyclopedia of Shakespeare\u2019s Language Project \u2013 and eight years since the original limited pre-release. One of the major features of the Enhanced Shakespearean Corpus (ESC) &hellip; <a href=\"http:\/\/wp.lancs.ac.uk\/shakespearelang\/2026\/05\/22\/esc-version-1-1\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":37,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[1],"tags":[],"class_list":["post-1396","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"http:\/\/wp.lancs.ac.uk\/shakespearelang\/wp-json\/wp\/v2\/posts\/1396","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/wp.lancs.ac.uk\/shakespearelang\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/wp.lancs.ac.uk\/shakespearelang\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/wp.lancs.ac.uk\/shakespearelang\/wp-json\/wp\/v2\/users\/37"}],"replies":[{"embeddable":true,"href":"http:\/\/wp.lancs.ac.uk\/shakespearelang\/wp-json\/wp\/v2\/comments?post=1396"}],"version-history":[{"count":1,"href":"http:\/\/wp.lancs.ac.uk\/shakespearelang\/wp-json\/wp\/v2\/posts\/1396\/revisions"}],"predecessor-version":[{"id":1397,"href":"http:\/\/wp.lancs.ac.uk\/shakespearelang\/wp-json\/wp\/v2\/posts\/1396\/revisions\/1397"}],"wp:attachment":[{"href":"http:\/\/wp.lancs.ac.uk\/shakespearelang\/wp-json\/wp\/v2\/media?parent=1396"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/wp.lancs.ac.uk\/shakespearelang\/wp-json\/wp\/v2\/categories?post=1396"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/wp.lancs.ac.uk\/shakespearelang\/wp-json\/wp\/v2\/tags?post=1396"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}