Online resources

This webpage offers a range of online resources in corpus linguistics and the application of the corpus method. These  resources can be used to develop knowledge and skills in corpus linguistics. 



CQPweb is a web-based corpus analysis system, intended to address the conflicting requirements for usability and power in corpus analysis software. CQPweb’s main innovative feature is its flexibility; its more generalised data model makes it compatible with any corpus. The analysis options available in CQPweb include: concordancing; collocations; distribution tables and charts; frequency lists; and keywords or key tags. Despite some limitations, in making a sophisticated query system accessible to untrained users, CQPweb combines ease of use, power and flexibility to a very high degree.

CQPweb can be accessed here. To read more about CQPweb see Hardie, A (2012). CQPweb – combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics 17 (3): 380–409. [Full text on publisher’s website]  [Alternative source for PDF]  


#LancsBox is a new-generation software package for the analysis of language data and corpora developed at  Lancaster University. #LancsBox can be accessed and downloaded for free here. #LancsBox

  • Works with your own data or existing corpora.
  • Can be used by linguists, language teachers, historians, sociologists, educators and anyone interested in language.
  • Visualizes language data.
  • Analyses data in any language
  • Automatically annotates data for part-of-speech.
  • Works with any major operating system (Windows, Mac, Linux).

.Read more in: Brezina, V., McEnery, T., & Wattam, S. (2015). Collocations in context: A new perspective on collocation networks. International Journal of Corpus Linguistics, 20(2), 139-173


BNCLab is a user-friendly, interactive online corpus platform developed at Lancaster University to give students, teachers and researchers easy access to a large  sample of spoken British English. The platform can be accessed directly here. The BNClab platform contains  samples from two major corpora of British English:

  • The British National Corpus: The platform gives access to 5 million words from the BNC representing informal conversations in British English from 1990s.
  • The British National Corpus 2014: The platform gives access to 5 million words from the BNC2104 representing informal conversations in British English from 2010s.

The platform allows analysing language in each of the corpora separately or the samples can be compared in order to identify patterns of change in spoken British English over the course of twenty years. BNClab also allows users to analyse the effect of social variables such as gender, age and social class on the language use. You can watch a tutorial on the use of BNClab on this link


The website focuses on the use of corpora for teaching about spoken English. The aim of the project is to bring corpora and corpus methods into classrooms  to teach students about the use of the English language. Corpora are large electronic collections of language samples that can be analysed automatically to identify regularities in language use. These patterns can be the result of sociolinguistic, psycholinguistic as well as historic processes in language and can be found in the language of social groups as well as individuals.

The project brings together corpus linguistics, applied linguistics, teachers and material writers to develop teaching materials and online platforms that incorporate corpus-based findings as well as direct access to corpora to teach about how English is used in real life situations. The materials were developed both for A-level English Language classes as well as for teaching English as a foreign/second language classes. The teaching materials can be accessed on this page. Corpus for Schools website can be accessed here.



We have put together a selection of readings from across a range of topics in corpus linguistics that can be freely accessed.

Baker, P., Gabrielatos, C. and McEnery T. (2013) Sketching Muslims: A corpus-driven analysis of representations around the word “Muslim” in the British press 1998-2009. Applied Linguistics 34, 3255-78. The paper offers an analysis of representation of Muslims in the press and can be accessed here.

Baker, P. and Love, R. (2015) ‘The hate that dare not speak its name?’ Journal of Language, Aggression and Conflict. 3(1): 57-86. The paper looks at keywords/collocation discourse analysis of representation of gay people in parliament and can be accessed here.
Baker, P. and Levon, E. (2015) ‘Picking the right cherries?: a comparison of corpus-based and qualitative analyses of news articles about masculinity.‘ Discourse and Communication 9(2): 221-336. Paper which is a reflexive comparison of corpus and qualitative discourse analysis and can be accessed here.
Baker, P. and Vessey, R. (2018) A corpus-driven comparison of English and French Islamist extremist texts. International Journal of Corpus Linguistics 23(3): 255-278. Paper about analysing extremist texts, using keywords to compare French and English comparable corpora. It can be accessed here.
Brezina, V., & Meyerhoff, M. (2014). Significant or random. A critical review of sociolinguistic generalisations based on large corpora. International Journal of Corpus Linguistics19(1), 1-28. This article offers a critical review of a methodology often employed in corpus-based sociolinguistic studies. This methodology relies on a general comparison of frequencies of a target linguistic variable in socially defined sub-corpora. The main issue with this procedure lies in the fact that it emphasises inter-group differences and ignores within group variation. It can be accessed here.
Brezina, V., McEnery, T., & Wattam, S. (2015). Collocations in context: A new perspective on collocation networks. International Journal of Corpus Linguistics, 20(2), 139-173. This article discusses the concept of collocation networks and introduce GraphColl (part of #LancsBox), a new tool that builds collocation networks from user-defined corpora. The method of collocation networks is demonstrated with a case study on the late 17th and early 18th centuries’ discourse on swearing. It can be accessed here
Brezina, V., & Gablasova, D. (2015). Is there a core general vocabulary? Introducing the New General Service List. Applied Linguistics36(1), 1-22. The article describes the methodology used in developing the corpus-based New General Service List. It also includes access to the new-GSL. The paper can be accessed here
Brookes, G. and Baker, P. (2017) ‘What does patient feedback reveal about the NHS? A mixed methods study of comments posted to the NHS Choices online service‘. BMJ Open 7(4). Paper about patient feedback in the NHS which can be accessed here.
Culpeper, J. (2017). Shakespeare’s Language. English and Media Centre eMagazine77, 53-55. A brief overview of some of the ways in which corpus-based work is challenging myths about Shakespeare’s language. The paper can be accessed here
Culpeper, J., Archer, D., Findlay, A., & Thelwall, M. (2018). John Webster, the dark and violent playwright? ANQ: A Quarterly Journal of Short Articles, Notes and Reviews31(3), 201-210. The paper shows how corpus/computational methods, specifically relating to semantics and emotion, challenge the idea that John Webster’s plays are outstandingly violent. It can be accessed here
Culpeper, J. & Findlay, A. (forth.) National identities in the context of Shakespeare’s Henry V: Exploring contemporary understandings through collocations. Language and Literature.  Analyses of collocates in millions of words of text written in Shakespeare’s time reveal what it meant to be Scots, Irish or Welsh, and how those general understandings might have impacted on Shakespeare’s construction of Celtic characters in Henry V. The paper can be accessed here
Gablasova, D., Brezina, V., & McEnery, T. (2017). Exploring learner language through corpora: Comparing and interpreting corpus frequency information. Language Learning67(S1), 130-154.  This article contributes to the debate about the appropriate use of corpus data in language learning research. It focuses on frequencies of linguistic features in language use and their comparison across corpora. The article can be accessed here.
Gablasova, D., Brezina, V., McEnery, T., & Boyd, E. (2017). Epistemic stance in spoken L2 English: The effect of task and speaker style. Applied Linguistics38(5), 613-637. The article investigates epistemic stance in spoken L2 English production using a subset (advanced speakers) of the Trinity Lancaster Corpus of spoken L2 production.   It can be accessed here.
Gablasova, D., Brezina, V., & McEnery, T. (2017). Collocations in corpus‐based language learning research: Identifying, comparing, and interpreting the evidence. Language learning67(S1), 155-179.  The paper critically reviews both the application of measures used to identify collocability between words and the nature of the relationship between two collocates. Particular attention is paid to the comparison of collocability across different corpora representing different genres, registers, or modalities. The paper can be accessed here
Gablasova, D., Brezina, V., & McEnery, T. (2019). The Trinity Lancaster Corpus: Development, description and application. International Journal of Learner Corpus Research5(2), 126-158.This paper introduces a new corpus resource for language learning research, the Trinity Lancaster Corpus (TLC), which contains 4.2 million words of interaction between L1 and L2 speakers of English. The discussion of practical decisions taken in the construction of the TLC also enables a critical reflection on current methodological issues in corpus construction.  It can be accessed here.
Hardie, A. (2014) Modest XML for Corpora: Not a standard, but a suggestionICAME Journal 38: 73-103. DOI: 10.2478/icame-2014-0004  The paper can be accessed here.
Hardaker, C., & McGlashan, M. (2016). “Real men don’t hate women”: Twitter rape threats and group identity. Journal of Pragmatics91, 80-93. This paper investigates the increasingly prominent phenomenon of rape threats made via social networks. Specifically, it investigates the sustained period of abuse directed towards the Twitter account of feminist campaigner and journalist, Caroline Criado-Perez. The paper can be accessed here
McEnery, T., Brezina, V., & Baker, H. (2019). Usage Fluctuation Analysis: A new way of analysing shifts in historical discourse. International Journal of Corpus Linguistics24(4), 413-444. This article introduces a new method for the diachronic analysis of large historical corpora, Usage Fluctuation Analysis (UFA). UFA looks at the fluctuation of the usage of a word as observed through collocation. 
McEnery, T., Brezina, V., Gablasova, D., & Banerjee, J. (2019). Corpus linguistics, learner corpora, and SLA: Employing technology to analyze language use. Annual Review of Applied Linguistics39, 74-92. In this article we explore the relationship between learner corpus and second language acquisition research. By exploring some of the corpus building practices of learner corpus research, and the theoretical goals of second language acquisition studies, we identify reasons for this lack of interaction and make proposals for how this situation could be fruitfully addressed. It can be accessed here. 
Potts, A. and Semino, E. (2019) Cancer as a metaphor, Metaphor and Symbol, 34, 2, 81-95. The article presents the first systematic study of cancer as a metaphor in contemporary English, showing the forms, frequencies, and functions of 925 metaphorical uses of cancer-related vocabulary in two large English language corpora, It can be accessed here.
Semino, E., Demjén, Z. and Demmen, J. (2018) An integrated approach to metaphor and framing in cognition, discourse and practice, with an application to metaphors for cancerApplied Linguistics, 39, 5, 625-45. In this article, we examine the notion of ‘framing’ as a function of metaphor from three interrelated perspectives—cognitive, discourse-based, and practice-based—with the aim of providing an adaptable blueprint of good practice in framing analysis. The article can be accessed here
Semino, E., Demjén, Z., Demmen, J., Koller, V., Payne, S., Hardie, H. and Rayson, P. (2017). The online use of ‘Violence’ and ‘Journey’ metaphors by cancer patients, as compared with health professionals: a mixed methods study. BMJ Supportive and Palliative Care, 7, 1, 60-66. The aim of the paper is to compare the frequencies with which patients with cancer and health professionals use Violence and Journey metaphors when writing online; and to investigate the use of these metaphors by patients with cancer. The paper can be accessed here.



Watch the video recordings of some of our past lectures featuring different topics in corpus linguistics.                                                    

Lecture No. 1Prof Elena Semino talks about Corpus Linguistics and health communication: The case of chronic pain

This lecture will demonstrate the use of corpus linguistic tools to carry out research on communication about chronic pain in healthcare settings. Pain is notoriously difficult to put into words, and this is well known to cause problems in diagnosis and treatment. Two studies will be introduced, respectively on (a) a language-based diagnostic questionnaire for pain, and (b) the use of visual images in specialist pain consultation. In both cases, the application of corpus tools leads to findings that are directly relevant to healthcare professionals who care for people with chronic pain.  Elena Semino is the Director of the Centre for Corpus Approaches to Social Science (CASS), a leading research centre focusing on the development of corpus methods and their applications to different fields in Social Science, Arts and Humanities. Elena’s research interests are in stylistics, metaphor theory and analysis as well as the medical humanities and health communication. You can read more about Prof Semino’s research, follow this link.


Lecture No. 2: Prof Jonathan Culpeper talks about ‘Debunking myths about  Shakespeare’s language with corpus methods

This lecture shows how corpus-related techniques can  address myths about Shakespeare’s language, such as the claim that he invented a huge number of words. Moreover, along the way, it reflects on particular difficulties that attend corpus data, including how to define the notion of a word and how to tackle non-standard language. Jonathan Culpeper is Professor at the Department of Linguistics and English Language, Lancaster University. His research interests are in pragmatics, English historical linguistics, stylistics and Shakespeare’s language. He is the Principal researcher of the Encyclopaedia of Shakespeare’s Language Project, a £1 million project funded by the AHRC. The essential aim of the project is to bring corpus methods to the study of Shakespeare’s language, providing a systematic description of his words and language patterns, and showing how they compare with those of his contemporaries. To read more about Prof Culpeper’s research, follow this link.


Lecture No. 3: Dr Vaclav Brezina talks about Understanding statistics for corpus analysis

This lecture offers an introduction to statistical methods used in corpus linguistics. It does not presuppose any knowledge of statistics; instead, it invites you on a journey to explore different aspects of statistics applied to linguistic data. The lecture addresses the following key questions:

  • What are the underlying principles of statistical thinking in corpus linguistics?
  • What statistical techniques do we have at our disposal in descriptive and inferential statistics?
  • What is the best practice in the field?

Vaclav Brezina is Senior lecturer at Lancaster University, UK and a member of the ESRC Centre for Corpus Approaches to Social Science. His research interests are in the areas of corpus design, statistics and applied linguistics.  He is the author of Statistics in Corpus Linguistics: A Practical Guide (CUP).