Corpus resources

This webpage offers a range of free online resources in corpus linguistics and the application of the corpus method. These resources can be used to develop knowledge and skills in corpus linguistics. There are four main groups of resources here:

Corpus tools, projects and materials
Readings: All readings listed here are free to access
Video recordings: lectures, workshops and tutorials

I. CORPUS TOOLS, PROJECTS & MATERIALS

CQPWEB: CQPweb is a web-based corpus analysis system, intended to address the conflicting requirements for usability and power in corpus analysis software. CQPweb’s main innovative feature is its flexibility; its more generalised data model makes it compatible with any corpus. The analysis options available in CQPweb include: concordancing; collocations; distribution tables and charts; frequency lists; and keywords or key tags. Despite some limitations, in making a sophisticated query system accessible to untrained users, CQPweb combines ease of use, power and flexibility to a very high degree.

CQPweb can be accessed here. To read more about CQPweb see Hardie, A (2012). CQPweb – combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics 17 (3): 380–409. [Full text on publisher’s website] [Alternative source for PDF]

#LANCSBOX: #LancsBox is a new-generation software package for the analysis of language data and corpora developed at Lancaster University. #LancsBox can be accessed and downloaded for free here. #LancsBox

Works with your own data or existing corpora.
Can be used by linguists, language teachers, historians, sociologists, educators and anyone interested in language.
Visualizes language data.
Analyses data in any language
Automatically annotates data for part-of-speech.
Works with any major operating system (Windows, Mac, Linux).

Read more in: Brezina, V., McEnery, T., & Wattam, S. (2015). Collocations in context: A new perspective on collocation networks. International Journal of Corpus Linguistics, 20(2), 139-173

#LANCSBOX X: #LancsBox X is designed for very large corpora. It natively supports XML, which allows working with rich metadata. Data can be loaded and imported into #LancsBox very easily. #LancsBox X allows searching BNC1994 and BNC2014 (both provided).

More info: http://corpora.lancs.ac.uk/lancsbox/docs/pdf/LancsBoxX_EN.pdf

BNCLAB: BNCLab is a user-friendly, interactive online corpus platform developed at Lancaster University to give students, teachers and researchers easy access to a large sample of spoken British English. The platform can be accessed directly here. The BNClab platform contains samples from two major corpora of British English:

The British National Corpus: The platform gives access to 5 million words from the BNC representing informal conversations in British English from 1990s.
The British National Corpus 2014: The platform gives access to 5 million words from the BNC2104 representing informal conversations in British English from 2010s.

The platform allows analysing language in each of the corpora separately or the samples can be compared in order to identify patterns of change in spoken British English over the course of twenty years. BNClab also allows users to analyse the effect of social variables such as gender, age and social class on the language use. You can watch a tutorial on the use of BNClab on this link.

CORPUS FOR SCHOOLS: The website focuses on the use of corpora for teaching about spoken English. The aim of the project is to bring corpora and corpus methods into classrooms to teach students about the use of the English language. Corpora are large electronic collections of language samples that can be analysed automatically to identify regularities in language use. These patterns can be the result of sociolinguistic, psycholinguistic as well as historic processes in language and can be found in the language of social groups as well as individuals.

The project brings together corpus linguistics, applied linguistics, teachers and material writers to develop teaching materials and online platforms that incorporate corpus-based findings as well as direct access to corpora to teach about how English is used in real life situations. The materials were developed both for A-level English Language classes as well as for teaching English as a foreign/second language classes. The teaching materials can be accessed on this page. Corpus for Schools website can be accessed here.

II. READINGS

We have put together a selection of readings from across a range of topics in corpus linguistics that can be freely accessed.

Baker, P., Gabrielatos, C. and McEnery T. (2013) Sketching Muslims: A corpus-driven analysis of representations around the word “Muslim” in the British press 1998-2009. Applied Linguistics 34, 3255-78. The paper offers an analysis of representation of Muslims in the press and can be accessed here.

Baker, P. and Love, R. (2015) ‘The hate that dare not speak its name?’ Journal of Language, Aggression and Conflict. 3(1): 57-86. The paper looks at keywords/collocation discourse analysis of representation of gay people in parliament and can be accessed here.

Baker, P. and Levon, E. (2015) ‘Picking the right cherries?: a comparison of corpus-based and qualitative analyses of news articles about masculinity.‘ Discourse and Communication 9(2): 221-336. Paper which is a reflexive comparison of corpus and qualitative discourse analysis and can be accessed here.

Baker, P. and Vessey, R. (2018) A corpus-driven comparison of English and French Islamist extremist texts. International Journal of Corpus Linguistics 23(3): 255-278. Paper about analysing extremist texts, using keywords to compare French and English comparable corpora. It can be accessed here.

Baker, J. P., & Levon, E. (2016). ‘That’s what I call a man’: Representations of racialised and classed masculinities in the UK print media. Gender and Language, 10(1). This article examines contemporary discourses of masculinity in the British press, using quantitative and qualitative analysis of a corpus of newspaper articles between 2003 and 2011. It can be accessed here.

Baker, P., Brookes, G., Atanasova, D., & Flint, S. W. (2020). Changing frames of obesity in the UK press 2008–2017. Social science & medicine, 264, 113403. This study examines how obesity is framed in the press using a 36-million-word database of UK newspaper articles mentioning the words ‘obese’ or ‘obesity’. It can be accessed here.

Brezina, V., & Meyerhoff, M. (2014). Significant or random. A critical review of sociolinguistic generalisations based on large corpora . International Journal of Corpus Linguistics, 19(1), 1-28. This article offers a critical review of a methodology often employed in corpus-based sociolinguistic studies. This methodology relies on a general comparison of frequencies of a target linguistic variable in socially defined sub-corpora. The main issue with this procedure lies in the fact that it emphasises inter-group differences and ignores within group variation. It can be accessed here.

Brezina, V., McEnery, T., & Wattam, S. (2015). Collocations in context: A new perspective on collocation networks. International Journal of Corpus Linguistics, 20(2), 139-173. This article discusses the concept of collocation networks and introduce GraphColl (part of #LancsBox), a new tool that builds collocation networks from user-defined corpora. The method of collocation networks is demonstrated with a case study on the late 17th and early 18th centuries’ discourse on swearing. It can be accessed here.

Brezina, V., & Gablasova, D. (2015). Is there a core general vocabulary? Introducing the New General Service List.Applied Linguistics, 36(1), 1-22. The article describes the methodology used in developing the corpus-based New General Service List. It also includes access to the new-GSL. The paper can be accessed here.

Brezina, V., Hawtin, A., & McEnery, T. (2021). The Written British National Corpus 2014–design and comparability. Text & Talk, 41(5-6), 595-615. It can be accessed here.

Brookes, G. and Baker, P. (2017) ‘What does patient feedback reveal about the NHS? A mixed methods study of comments posted to the NHS Choices online service‘. BMJ Open 7(4). Paper about patient feedback in the NHS which can be accessed here.

Brookes, G. (2021). ‘Lose weight, save the NHS’: Discourses of obesity in press coverage of COVID-19. Critical Discourse Studies, 1-19. This study examines the discourse around obesity in the British press in its coverage of the COVID-19 pandemic. It uses keyword analysis to argue that the discourses surrounding obesity have become more stigmatising in the context of the pandemic. It can be accessed here.

Brookes, G., & Baker, P. (2021). Patient feedback and duration of treatment: A corpus-based analysis of written comments on cancer care in England. Applied Corpus Linguistics, 1(3), 100010. This paper considers the relationship between the length of treatment for cancer and the feedback given in patient comments by analysing keywords that provide qualitative evaluations of the patient’s experience. It can be accessed here.

Collins, L. C. (2022). Pre-exposure prophylaxis (PrEP) and ‘risk’in the news. Journal of risk research, 25(3), 379-394. This article investigates how ‘risk’ is discussed in the news in relation to the HIV prevention drug PrEP using corpus linguistics methods. It can be accessed here.

Collins, L., Brezina, V., Demjén, Z., Semino, E., & Woods, A. (2022). Corpus linguistics and clinical psychology: Investigating personification in first-person accounts of voice-hearing. International Journal of Corpus Linguistics. This paper applies corpus linguistic methods to ‘voice hearers’ in order to understand their experiences, and potentially address their issues more effectively. It can be accessed here.

Culpeper, J. (2017). Shakespeare’s Language. English and Media Centre eMagazine, 77, 53-55. A brief overview of some of the ways in which corpus-based work is challenging myths about Shakespeare’s language. The paper can be accessed here.

Culpeper, J., Archer, D., Findlay, A., & Thelwall, M. (2018). John Webster, the dark and violent playwright? ANQ: A Quarterly Journal of Short Articles, Notes and Reviews, 31(3), 201-210. The paper shows how corpus/computational methods, specifically relating to semantics and emotion, challenge the idea that John Webster’s plays are outstandingly violent. It can be accessed here.

Culpeper, J. & Findlay, A. (forth.) National identities in the context of Shakespeare’s Henry V: Exploring contemporary understandings through collocations. Language and Literature. Analyses of collocates in millions of words of text written in Shakespeare’s time reveal what it meant to be Scots, Irish or Welsh, and how those general understandings might have impacted on Shakespeare’s construction of Celtic characters in Henry V. The paper can be accessed here.

Culpeper, J. (2018). Affirmatives in Early Modern English: Yes, yea and ay. Journal of Historical Pragmatics, 19(2), 243-264. This study investigates affirmatives in the early modern period, using A Corpus of English Dialogues 1560-1760. It can be accessed here.

Culpeper, J., Hardie, A., Demmen, J., Hughes, J., & Timperley, M. (2021). Supporting the corpus-based study of Shakespeare’s language: Enhancing a corpus of the First Folio. ICAME Journal, 45(1), 37-86. This article explores challenges in using corpus linguistics to analyse both Shakespeare and Early Modern English and offers some possible solutions to these issues. It can be accessed here.

Demmen, J., Semino, E., Demjén, Z., Koller, V., Hardie, A., Rayson, P., & Payne, S. (2015). A computer-assisted study of the use of violence metaphors for cancer and end of life by patients, family carers and health professionals. International Journal of Corpus Linguistics, 20(2), 205-231. This study combines quantitative corpus methods and qualitative analysis to look at violence metaphors for cancer and end of life in a corpus of data from patients, family carers, and healthcare professionals. It can be accessed here.

Gablasova, D., Brezina, V., & McEnery, T. (2017). Exploring learner language through corpora: Comparing and interpreting corpus frequency information. Language Learning, 67(S1), 130-154. This article contributes to the debate about the appropriate use of corpus data in language learning research. It focuses on frequencies of linguistic features in language use and their comparison across corpora. The article can be accessed here.

Gablasova, D., Brezina, V., McEnery, T., & Boyd, E. (2017). Epistemic stance in spoken L2 English: The effect of task and speaker style. Applied Linguistics, 38(5), 613-637. The article investigates epistemic stance in spoken L2 English production using a subset (advanced speakers) of the Trinity Lancaster Corpus of spoken L2 production. It can be accessed here.

Gablasova, D., Brezina, V., & McEnery, T. (2017). Collocations in corpus‐based language learning research: Identifying, comparing, and interpreting the evidence. Language learning, 67(S1), 155-179. The paper critically reviews both the application of measures used to identify collocability between words and the nature of the relationship between two collocates. Particular attention is paid to the comparison of collocability across different corpora representing different genres, registers, or modalities. The paper can be accessed here.

Gablasova, D., Brezina, V., & McEnery, T. (2019). The Trinity Lancaster Corpus: Development, description and application. International Journal of Learner Corpus Research, 5(2), 126-158.This paper introduces a new corpus resource for language learning research, the Trinity Lancaster Corpus (TLC), which contains 4.2 million words of interaction between L1 and L2 speakers of English. The discussion of practical decisions taken in the construction of the TLC also enables a critical reflection on current methodological issues in corpus construction. It can be accessed here.

Hardie, A. (2014) Modest XML for Corpora: Not a standard, but a suggestion. ICAME Journal 38: 73-103. DOI: 10.2478/icame-2014-0004 The paper can be accessed here.

Hardaker, C., & McGlashan, M. (2016). “Real men don’t hate women”: Twitter rape threats and group identity. Journal of Pragmatics, 91, 80-93. This paper investigates the increasingly prominent phenomenon of rape threats made via social networks. Specifically, it investigates the sustained period of abuse directed towards the Twitter account of feminist campaigner and journalist, Caroline Criado-Perez. The paper can be accessed here.

Love, R., Brezina, V., McEnery, T., Hawtin, A., Hardie, A., & Dembry, C. (2019). Functional variation in the Spoken BNC2014 and the potential for register analysis. Register Studies, 1(2), 296-317. This article considers the design decisions in creating the spoken BNC2014 with the goal of producing a representative corpus of contemporary British English, focusing on the representation of register. It can be accessed here.

Love, R., Dembry, C., Hardie, A., Brezina, V., & McEnery, T. (2017). The Spoken BNC2014: Designing and building a spoken corpus of everyday conversations. International Journal of Corpus Linguistics, 22(3), 319-344. This paper introduces the Spoken British National Corpus 2014, describing the process of designing and building the corpus. It can be accessed here.

McEnery, T., Brezina, V., & Baker, H. (2019). Usage Fluctuation Analysis: A new way of analysing shifts in historical discourse. International Journal of Corpus Linguistics, 24(4), 413-444. This article introduces a new method for the diachronic analysis of large historical corpora, Usage Fluctuation Analysis (UFA). UFA looks at the fluctuation of the usage of a word as observed through collocation.

McEnery, T., Brezina, V., Gablasova, D., & Banerjee, J. (2019). Corpus linguistics, learner corpora, and SLA: Employing technology to analyze language use. Annual Review of Applied Linguistics, 39, 74-92. In this article we explore the relationship between learner corpus and second language acquisition research. By exploring some of the corpus building practices of learner corpus research, and the theoretical goals of second language acquisition studies, we identify reasons for this lack of interaction and make proposals for how this situation could be fruitfully addressed. It can be accessed here.

McEnery, A., & Baker, H. (2016). Corpus linguistics and 17th-century prostitution: Computational linguistics and history. Bloomsbury Academic. This book describes how corpus linguistics can be used as a method in historiography, focusing on prostitution in 17^th century England. It can be accessed here.

McEnery, T., & Baker, H. (2017). The public representation of homosexual men in seventeenth-century England–A corpus based view. Journal of Historical Sociolinguistics, 3(2), 197-217. This article explores public discourse around homosexual men in early-modern English society, exploring methodological issues and historical context. It can be accessed here.

McEnery, T., & Baker, H. (2017). The poor in seventeenth-century England: A corpus based analysis. Token: A Journal of English Linguistics, 6, 51-83. This paper looks at the perceptions of poor people in 17^th century England through a corpus analysis of the phrase ‘the poor’ in the Early English Books Online corpus. It examines the changing representations through time using collocation analysis in each decade. It can be accessed here.

Öksüz, D., Brezina, V., & Rebuschat, P. (2021). Collocational processing in L1 and L2: The effects of word frequency, collocational frequency, and association. Language Learning, 71(1), 55-98. It can be accessed here.

Potts, A. and Semino, E. (2019) Cancer as a metaphor, Metaphor and Symbol, 34, 2, 81-95. The article presents the first systematic study of cancer as a metaphor in contemporary English, showing the forms, frequencies, and functions of 925 metaphorical uses of cancer-related vocabulary in two large English language corpora, It can be accessed here.

Potts, A., & Semino, E. (2017). Healthcare professionals’ online use of violence metaphors for care at the end of life in the US: a corpus-based comparison with the UK. Corpora, 12(1), 55-84. This paper compares frequency and type of violence metaphors in UK and US contexts, relating the findings to cultural and institutional contexts. It can be accessed here.

Semino, E., Demjén, Z. and Demmen, J. (2018) An integrated approach to metaphor and framing in cognition, discourse and practice, with an application to metaphors for cancer, Applied Linguistics, 39, 5, 625-45. In this article, we examine the notion of ‘framing’ as a function of metaphor from three interrelated perspectives—cognitive, discourse-based, and practice-based—with the aim of providing an adaptable blueprint of good practice in framing analysis. The article can be accessed here.

Semino, E., Demjén, Z., Demmen, J., Koller, V., Payne, S., Hardie, H. and Rayson, P. (2017). The online use of ‘Violence’ and ‘Journey’ metaphors by cancer patients, as compared with health professionals: a mixed methods study. BMJ Supportive and Palliative Care, 7, 1, 60-66. The aim of the paper is to compare the frequencies with which patients with cancer and health professionals use Violence and Journey metaphors when writing online; and to investigate the use of these metaphors by patients with cancer. The paper can be accessed here.

III. Video recordings

Quantitative Research Methods for Corpus Linguistics: ‘Structural equation modeling for corpus linguistic data’ by Tove Larsson, Assistant Professor of Applied Linguistics at Northern Arizona University, and Gregory Hancock, Professor of Measurement, Statistics and Evaluation in the Department of Human Development and Quantitative Methodology, University of Maryland

Reading: Larsson, T., Plonsky, L., & Hancock, G. R. (2021). On the benefits of structural equation modeling for corpus linguists. Corpus Linguistics and Linguistic Theory, 17(3), 683–714. https://doi.org/10.1515/cllt-2020-0051

Watch the interview: https://www.youtube.com/watch?v=OF2XvCqQa4I

Watch the introduction to the workshop: https://www.youtube.com/watch?v=SelkHjpFZ10

Lancaster Summer Schools in Corpus Linguistics

15-19 June 2026

Corpus resources