Jane Demmen discusses the process of part-of-speech tagging the Shakespeare corpus, explores some of the issues the team encountered, and their subsequent solutions…
One of the many software programs that enables us to carry out the task of creating an electronic encyclopaedia of Shakespeare’s language is the Constituent Likelihood Automatic Word-tagging System (known to its friends as CLAWS). CLAWS “reads” the text of each play and assigns a label to each word denoting its grammatical function (also known as a part-of-speech tag or POS tag).
Why bother with grammatical labels for every word?
Assigning grammatical category labels to the texts of Shakespeare’s plays is essential to our project for several reasons. Crucially, it enables the word-stock of the plays to be classified into headwords, which form the basis of the dictionary-type entries in Volume 1 of our encyclopaedia. A headword is the lemma or base form of a set of grammatically-related words. For example, the headword fight (verb) is related to fights, fightest, fighting, fought and foughtst. The headword fight (noun) is treated separately, because it’s a different part of speech, and is related to fights (plural). Most dictionaries are arranged in a similar way. Importantly, it also lays useful groundwork for further potential studies, especially:
- creating a descriptive grammar of Shakespeare’s language
- studying variation in styles of grammar amongst different characters, plays, genres and between Shakespeare and other authors
- investigating change in grammatical usage over time (within the Shakespeare canon, or between Shakespeare and other authors).
Finally, at some later stage we may want to apply semantic category tags (labels denoting the area of meaning or semantic “domain” to which each word belongs) to our Shakespeare play-texts using another software tool, for example, the USAS (UCREL Semantic Analysis System) software tool (http://ucrel.lancs.ac.uk/usas/). The USAS tool relies partly on grammatical information from the CLAWS tags in order to assign categories of meaning to words in a text, and if these are incorrect it’s less likely to be able to suggest appropriate meaning categories.
A brief history of CLAWS
CLAWS was developed at the Lancaster research centre UCREL (the University Centre for Computer Corpus Research on Language; http://ucrel.lancs.ac.uk/) in the 1990s. The CLAWS tagset (the range of part-of-speech/grammatical category labels it assigns) has been through several iterations. We use the CLAWS6 tagset in our project (http://ucrel.lancs.ac.uk/claws6tags.html), which has about 200 possible labels for different grammatical categories!
How does it work?
When the CLAWS software is run over a text, it assigns the part-of-speech (POS) tags in part by using the information from its lexicon (a built-in dictionary of known words and the grammatical role(s) they can take) and in part using a set of context-based rules (for example, nouns tend to be preceded by determiners). Of course, many words can play more than one possible grammatical role. For example, to is a highly frequent word which can be a preposition, if it occurs before a noun, or part of an infinitive verb. In cases like this, CLAWS will assign a series of possible tags, starting with the one it calculates as having the greatest probability of being correct. It displays that tag within square brackets, with other possible tags after it. It expresses the probability of each tag being correct as a percentage. POS-tagged words in a text file appear like this:
Once_RR
more_RRR
unto_II
the_AT
Breach_[NN1/100] VV0@/0]
In the example above (from Henry V 3_1), CLAWS correctly assigns the POS tags for Once as a general adverb (RR), more as a comparative general adverb (RRR), unto as a general preposition (II), the as an article (AT) and Breach as a singular common noun (NN1). The tags for Breach show the probability of it being a noun as 100% in this context, and a 0% probability of it being a verb.
So far, so good. If only it was always this straightforward!
CLAWS and older forms of English
CLAWS was developed for late 20th century English, with which it has an impressively high accuracy rate of 96-97% (when applied to the British National Corpus, according to the writers of the manual Geoffrey Leech and Nick Smith in 2000). However, we know from other research carried out by Lancaster colleagues in 2007 that its accuracy drops slightly with English from the 16th/17th century. When the spelling is standardised, as it has been for our project, we can expect an accuracy rate of about 89% – which is still very good, but not good enough for us to be confident in building frequency-based encyclopaedia entries that rely on grammatical information. Therefore, project Co-Investigator Andrew Hardie has carried out some development work on CLAWS specifically for this project (for example, extending its lexicon to include verb forms that agree with the pronoun thou), and he, I and recent CASS PhD graduate Jennifer Hughes have been manually checking the POS tags assigned by CLAWS to every single word of 38 Shakespeare plays, and correcting any tagging errors.
What kind of things does CLAWS have trouble with?
There are a number of factors which cause CLAWS difficulty in working out the grammatical role of words in Shakespeare’s language. Some are to do with the style of English of this period in general, such as word orders which were typical then but not now (e.g. the main verb coming first in questions, as in “Know you where you are?”, “Saw you Aufidius?”). Words which are unfamiliar because they are no longer in use also cause problems (e.g. ancient, familiar to us as an adjective meaning ‘very old’ in present-day English, but in earlier times also used as a noun to mean either someone who lived a long time ago, or someone who was a standard-bearer/ensign (a military term). Printing errors (spelling anomalies, missing words or words which may be incorrect) cause further difficulties. Some of these remain in our texts as linguistic artefacts, particularly if there is disagreement among scholars over what the intended word is.
During the course of the tag checking we’ve expanded the tagging lexicon of CLAWS by several thousand words so that, for example, it now knows that ancient can be a noun.
Other factors relate to the type of texts we’re dealing with, and which we could expect to encounter in plays not only by Shakespeare, but also by other dramatists of his day. These include foreign words (French, Italian, Spanish and/or Latin being popular). For example, in Twelfth Night 4_2, Feste the clown (as Sir Topas) says:
“Bonos dies, Sir Toby;”
Bonos dies is meant to be either Latin or Spanish for ‘good day’, which we would tag simply as foreign words (FW). Although CLAWS does recognise some foreign words and tag them as such, it doesn’t in this case, and tags Bonos as a plural noun (NN2) and dies as a verb in the second person present (VVZ). Wordplay, puns, innuendo and other such language features beloved of dramatists, especially in comedy dialogue, sometimes baffle CLAWS (and, not infrequently, the human researcher).
For anyone interested in the details of typical and recurring POS tagging errors we’ve encountered and corrected in our data, here are a few:
Unfamiliar-looking adverb (not ending in –ly) tagged incorrectly (as a noun, in this case)
“Why do you speak so startingly and rash [NN1/58] JJ/42?” (Othello 3_4)
Infinitive verbs incorrectly tagged as preposition followed by noun
“To [II/100] TO/0 lip NN1 a wanton in a secure couch,” (Othello 4_1)
Noun incorrectly tagged as verb
“Alas! what cry [VV0/62] NN1/38 is that?” (Othello 5_2)
Noun incorrectly tagged as adjective
“Among the Nettles at the Elder [JJR/74] NN1/26 tree:” (Titus Andronicus 2_3)
Verb incorrectly tagged as noun
“Both heaven and earth Friend [NN1/99] NP1/1 thee for ever.” (The Two Noble Kinsmen 1_4)
Adjective incorrectly tagged as noun
“Patience dear [NN1/55] JJ/44 UH/0 RR@/0 Niece,” (Titus Andronicus 3_1)
Interjection incorrectly tagged as verb
“Hail [VV0/68] UH/21 NN1/11 to thee Lady” (Othello 2_1)
“Marry [VV0/86] UH/14 for justice she is so employed” (Titus Andronicus 4_3)
To conclude, POS tag checking, although labour-intensive, has been a crucial process in the data preparation for our project because it’s vital to the quality of the output that the underlying assumptions about grammatical categories are correct. It’s not the type of task everyone would enjoy, though I have: it’s challenged and improved my understanding of grammar in the period Shakespeare was writing. The expansion of the tagging lexicon by several thousand words means that we now have a version of CLAWS which is much better equipped for use with English from earlier centuries, which we anticipate will be a useful future resource.