Building a corpus of deception: methodological and analytical considerations

The field of deception detection has been largely dominated by researchers from the field of psychology, and it is clear that those in linguistics have a lot more to offer than they have done thus far. Previous work in this area has primarily been carried out using LIWC (Pennebaker et al, 2001), which identifies what percentage of a given text can be attributed to particular personalities and mental states. However, in more recent years, Archer and Lansley (2015) and McQuaid et al (2015) have applied corpus linguistic methods to the field, using Wmatrix to investigate between truthful and deceptive corpora. However, there has never been a large-scale and systematic corpus study using deceptive spoken data. Similarly, up until now, the sociolinguistic nature of deception has never been investigated.

In this talk, I will discuss the state-of-the-art in deception detection and identify a series of issues with that early work. In particular, I will discuss how certain methodological decisions can impact on the quality and validity of results that arise from the data, and how a different method of analysis can lead to more intuitive and nuanced findings. I will explain how I created a corpus of deception by following best practice in increasing motivation, cognitive load, and ecological validity. I will then discuss how more traditional corpus linguistic methods (such as keyword analysis, effect size measures, dispersion, and concordance analysis), combined with a more flexible, user-friendly analysis tool, can provide further insight into the sociolinguistic nature of deceptive discourse.

