Prospects for using the digital humanities in demographic history

On Monday 28 November 2016 the Spatial Humanities project held a meeting of invited experts from around the world to consider together how digital approaches, such as the ones developed during the project, might contribute to future research in demographic history. The focus of the day was to look at current research challenges in these fields, and ask where the tools of digital humanities could be of most use. The goal is to help clarify a future research agenda in which the digital humanities move from the demonstration of tools and techniques to the delivery of new knowledge discovery. A group of distinguished participants gave presentations on the state of the field, current research questions and challenges, and their own thoughts about where digital humanities could have an impact. This site’s readers may be interested in a short report.

The Spatial Humanities project has succeeded in demonstrating that corpus linguistic approaches to text are capable of creating new historical demographic knowledge, which can augment what we were able to discover previously using purely quantitative sources, such as the Census or birth and death registrations. It would be wrong to call this advance revolutionary, but it does add something valuable to the historian’s toolkit. First of all, corpus linguistic explorations can be used to create spatial knowledge, as the project’s Catherine Porter showed. A geoparser can be trained to identify place-names in a text and combined with corpus queries to return all the places mentioned, for example, by the Registrar-General in a given period when talking about cholera, or water-borne diseases generally. A range of quantitative spatial analyses can be performed on the results to examine their significance.

Corpus linguistics can create new historical knowledge in other ways, as Paul Atkinson, also from the project, demonstrated later in the day. Corpus-based discourse analysis, on lines proposed by linguists like Paul Baker and Norman Fairclough, can objectively show how frequent a significant word pattern such as ‘infant mortality’ or ‘nursing mother’ is within a corpus such as the published text of a newspaper. This can be mapped over time. Somewhat more tentatively, deductions can be made from these ‘distant readings’ of very large texts about the underlying ideologies which the published text reinforced. Paul presented conclusions about how discourses in newspapers helped form Victorian society’s construct of infant welfare.

Ian Shuttleworth (Queen’s University Belfast) and Paul Norman (University of Leeds) contributed a view from geography and social science, based on experience of post-1970 British quantitative data from longitudinal surveys. Adding a text dimension, in Ian’s view, might allow us to capture shifting political sentiments as they affected immigration, for example. Ruth Byrne and James Perry, Lancaster doctoral students, later presented the work they are already doing on migration, using corpus linguistics and other digital tools. Social scientists are more familiar with topic modelling than corpus linguistics for this sort of enquiry: Andrew Hardie (Lancaster) argued that corpus linguistics had a more accurate model of how language worked and should give better results.

Difficult research topics such as long-term trends in internal migration, a central topic in many demographic studies, would of course benefit from more use of automated record linkage, and the meeting heard of some plans in this area. We speculated whether such a system might be able to ingest newspaper text such as court reports, offering a new source of demographic data. In practice, however, suggested Colin Pooley (Lancaster), crime reports usually showed surprisingly little interest in singling out migrants and recording their place of birth; a contrast with some modern newspaper reporting. Andy Beveridge (City University New York and Social Explorer, Inc.) added his experience of similar types of enquiry with US data. Andy addressed some of the challenges in combining geographical data, such as race data from the Census and housing quality from an independent source. As far as adding text was concerned, his greatest wish was for a way to automate the import of tables from a text source, making the tables machine-readable.

The work of the The Cambridge Group for the History of Population and Social Structure naturally played a prominent part in our discussions. The group was represented by Joe Day, Eilidh Garrett and Alice Reid, who provided updates on current projects including the very exciting Atlas of Victorian Fertility Decline. One project which may soon combine text with quantitative data is Transport, urbanization and economic development in England and Wales c.1670-1911, where the use of Bradshaw’s famous railway guide is contemplated.

Isabelle Devos (University of Ghent), another leading contributor to historical demography, also gave an update on her group’s work. She offered some wider lessons to the research community. First, there was a trade-off between a data-driven approach which ‘let the data speak’ and a question-driven one. Each had strengths and weaknesses which needed to be borne in mind. In common with other historians at the meeting, and also with Tim Hitchcock, Isabelle urged us not to lose sight of the need for source-criticism when the source had been automated and/or provided on an open access platform. The scholars who create these digital sources are rarely under many illusions about the reliability or representativeness of the underlying source, or the effect which digital processing has had on it, but later users may come to grief if they ignore these things. Published metadata needs to be of high quality, and needs to be used!

We benefited from two contributions which reflected more widely on the strengths and weaknesses of digital humanities. Colin Pooley (Lancaster) saw considerable potential in the new methods if properly applied. In sympathy with the goals of the meeting, he suggested that this involved a firm focus on developing good research questions, rather than producing research that was ‘technique-led’. Rightly done, corpus linguistics promoted lateral thinking about what source to explore for which question. The linking of quantitative with qualitative enquiry allowed triangulation and led to better explanations: quantitative studies should not try to solve every research problem with statistical techniques, but instead examine issues qualitatively when it would be more enlightening. Colin cautioned that some corpus linguistic tools, at least, presented query results in a very decontextualized way. For the historian, context was all-important: a sentence in a diary might need to be read in the light of the whole diary, not just the hundred words or so around it. The problem was not insuperable and a tool for historians which permitted them to link back from a query result to the page image and the whole digitised document was a possibility, suggested Andrew Hardie.

Our final contributor, Kevin Schürer, made a plea for scholars to break down the disciplinary divides which sometimes separated, for instance, researchers looking at transport, migration and population. There were a series of GIS datasets which did not currently integrate – a missed opportunity. Before GIS, scholars might overlay data on different variables onto a map by hand: why should it be so difficult now, after GIS had promised us so much? One layer historians should use more was the environment itself: elevation, terrain and climate. A ‘flat-earth’ kind of GIS which left out the terrain risked missing the explanations for the phenomena it sought to represent. Text might help us recover historical climate data (for instance from the Registrar-General’s reports). Another data set which could very profitably be mined out of large newspaper text collections was the local variation in the prices of commodities like grain, extremely valuable for studying the cost of living and its fluctuations. Newspapers could give a much richer data set than the national values and scattered local samples often used now. Now that historical data sets looked more like modern ones there was no excuse not to learn methodological lessons from modern transport planners or demographers.

Summing up a very valuable day, Ian Gregory (Professor of Digital Humanities, Lancaster) commented that with the ever more sophisticated data sets now available to historians, we needed to use the latest tools on them, letting us work with large scale sources in ways which had not previously been possible.