Organisers:
Dr Mo El-Haj, Senior Lecturer in NLP and Co-Director of the UCREL NLP Group at Lancaster University.
Dr Saad Ezzini, Lecturer in Computer Science and member of Software Engineering at UCREL NLP Group at Lancaster University.
Hello, we’re thrilled to host this workshop dedicated to language model creation for under-resourced languages. With a strong focus on linguistic diversity, Lancaster University has organised previous workshops and undertaken projects on languages like Welsh, Igbo, Luxembourgish, and various Arabic dialects. CLIDA 2024 represents our ongoing commitment to pushing the boundaries of linguistic innovation and digital support for these languages.
Keynote Speakers and talk details:
Professor Dawn Knight, Cardiff University
Title: Enhancing language technology resources in minoritised language contexts: past, present and future research projects and opportunities
Abstract: The creation of language technology resources in a minoritised language context poses interesting challenges, but also presents opportunities that are not always available to developers of such resources for larger languages. In this presentation I will demonstrate how scrutiny of the unique context of a specific minoritised language, and meaningful collaboration with potential user groups, can determine the design and construction of language resources. This presentation showcases some recent interdisciplinary and cross-institutional projects involving applied/corpus linguists (at Cardiff University) and colleagues in NLP (Lancaster University). This will include a short demonstration of the CorCenCC corpus (the National Corpus of Contemporary Welsh: www.corcencc.org), and an overview of a range of satellite projects including the Thesawrws and FreeTxt (a bilingual toolkit that supports the analysis and visualisation of free text data) projects, and will overview the recently launched GDC-WDG website (an online collection of freely available digital resources designed to support the exploration, analysis, learning, and referencing of the Welsh language). The creation of these resources involved the development of important new tools and processes, including, in the case of CorCenCC, a unique user-driven corpus design in which language data was collected and validated through crowdsourcing, and an in-built pedagogic toolkit (Y Tiwtiadur) developed in consultation with representatives of all anticipated academic and community user groups. The approaches used to construct the resources mentioned in this talk provide an invaluable template for those researching other minoritised or minority languages. The specifics of how this template might inform corpus construction in these/such languages will be discussed in more depth during my presentation.
Bio: Dawn Knight is a Professor of English Language and Applied Linguistics at Cardiff University, Wales. Her research interests lie in the areas of corpus linguistics, multimodality and discourse analysis. Dawn has expertise in conceptualising, theorising and applying innovative interdisciplinary approaches/methodologies for extracting and predicting language patterns within/across social and linguistic contexts. Her pioneering work on Welsh language resource development (including CorCenCC and FreeTxt), supported by major AHRC, ESRC and Welsh Government grants, is helping to change the landscape of minoritised language research and the potential real-world applications of corpora/corpus-based enquiry.
Title: Explainable AI for Irish grammatical error correction
Abstract: Grammatical error correction is an important end-user application of natural language processing. In recent years, approaches using large language models have led to improved performance on this task, at least for English and a few other well-resourced languages. Nevertheless, it remains challenging to build systems that (1) provide results that are sufficiently reliable for end-users and (2) give some explanation for errors that they detect for the benefit of language learners. I will discuss recent progress on this problem for the Irish language. The primary challenge is assembling a large enough dataset for training — we make use of both synthetic data produced with the help of an Irish dependency parser, as well as error examples mined from Wikipedia edit logs.
Bio: Kevin Scannell is a software developer specializing in computational resources that help speakers of indigenous and minority languages use their language online, with a particular focus on tools that support the Irish, Manx, and Scottish Gaelic communities. He has developed an Irish spell checker, grammar checker, and thesaurus, as well as a number of dictionaries and translation engines for the Gaelic languages. He is also a member of the team that has produced Irish localisations of some important software products including Gmail, Twitter, Firefox, and WhatsApp. Kevin was professor of mathematics and computer science at Saint Louis University from 1998 to 2023.
Dr Daniel Cunliffe, University of South Wales
Title: Exploring the presence of Cymraeg on TikTok
Abstract: A presence in technological domains can illustrate vitality and demonstrate the relevance of Celtic languages in contemporary society. The use, or non-use of a language on social media is generally recognised as being particularly significant for younger speakers due to their high levels of social media use. This presentation will consider the presence of Cymraeg (the Welsh language) on one popular social media platform – TikTok. Based on a manual analysis of a corpus of 200 videos it will explore various aspects of the content producers, the videos and audience comments, with a particular focus on the use of language. The corpus reveals a complex and richly bilingual content space with considerable intermingling of Cymraeg and English. Though TikTok is primarily a video sharing platform, commenting on videos and engaging in a text-based conversation with the content creator and other audience members, is shown to be a significant activity. Whilst a corpus-based study doesn’t provide answers to all the important questions, it demonstrates that Cymraeg is being used, taught, discussed, and promoted on TikTok. It also shows that TikTok deserves to be taken seriously by those who study minority language use and by those who promote minority language use.
Bio: Daniel Cunliffe is an Associate Professor in the Faculty of Computing, Engineering and Science at the University of South Wales. Since 2001, his research has applied insights and methods from fields such as computer-mediated communication, human-computer interaction and user experience design to the study of minority language use in information and communications technology.
Mr Gruffudd Prys, Bangor University
Title: Recent Language Technology developments for Welsh at Bangor University
Abstract: Over the past few years, with funding from the Welsh Government, the Language Technologies Unit at Bangor University has developed a large number of open resources and services for the Welsh language in a number of areas including speech recognition, text to speech, machine translation and NLP. As well as giving an overview of these, the talk will address some of the challenges faced in producing such resources, be they linguistic, technological or legal in nature. The talk will also outline the future aims of the Language Technologies Unit as part of the broader effort to build a language technology infrastructure for the Welsh Government’s long-term strategy, Cymraeg 2050. In recent years, with funding from the Welsh Government, the Language Technologies Unit at Bangor University has developed open resources and services for the Welsh language, including speech recognition, text-to-speech systems, machine translation, and natural language processing. The talk will provide an overview of this work, and discuss the linguistic, technological and legal challenges faced during the creation of these resources. In addition, the Unit’s future goals will be outlined as part of the broader effort to build a language technology infrastructure for the Welsh Government’s long-term strategy, Cymraeg 2050.
Bio: Gruffudd Prys is the Head of the Language Technologies. He serves as a terminologist and senior editor of Y Termiadur Addysg, the terminology dictionary standardizing Welsh-language terminology for school-age education in Wales. As part of this role, he has worked on developing an NLP infrastructure to facilitate terminology development at scale. In addition to this role, he has also managed a number of European-funded projects in the fields of language technology and translation. His current research interest is the appropriate classification of the Welsh verbal noun in NLP settings, and the increasing need to bridge the gap between dictionary forms of Wels and the forms used in everyday speech.
Dr Inge Birnie, University of Strathclyde (Glasgow)
Title: Gaelic language use and identity in the digital era
Abstract: Gaelic has become increasingly marginalised in the daily linguistic practices of communities across Scotland (Birnie, 2024). This means that language revitalisation and support initiatives in Scotland to promote the use of Gaelic have increasingly focussed on the use of technology to encourage the learning and use of the language, but also in the creation of new communities of practice (Bòrd na Gàidhlig, 2018), which are not bounded by a geographical physical space (Moriarty, 2015). These initiatives have included ongoing support for the traditional media in Gaelic (Radio nan Gàidheal and BBC Alba), but also the establishment of World Gaelic Week, which has included a significant online presence, which aims to raise the profile of the language through community initiatives, projects and events. Drawing on data collected across different studies which evaluate the use of Gaelic across different (online) media, this paper will discuss the challenges and opportunities created by these initiatives for the promotion of Gaelic and the creation of virtual breathing spaces (Belmar & Glass, 2019). Findings of these studies highlight the complex nature of identities of (young) Gaelic speakers and how these impact on their current engagement with Gaelic and technology as well as their perceptions of digital platforms. These plurilingual identities means that engagement with Gaelic media is space and place dependent and their online practices and engagement include multiple languages. Drawing on these findings, this papers concludes by discussing some of the new technological developments and how these might contribute to a sustainable digital future for Gaelic as we enter the Human-machine era – with its focus on speaking in and through technology.
Bio: Dr Inge Birnie is a senior lecturer in the Institute of Education at the University of Strathclyde (Glasgow). Her research interests focus on the use of minority languages, and in particular Gaelic, both in real and virtual communities. Her work has focussed on exploring the public social linguistic soundscape – the languages that individuals use in interactions – and how speakers can be encouraged to use their language with others. This work is now expanded to consider the online communities and how technology can support the creation of virtual breathing spaces and the promotion and sustained use of the language.
She is currently leading a small working group as part of the Language in the Human Machine Era (LITHME) COST+ Action in the creation of recommendations for intergovernmental organisations on the protection of minority and regional languages as advances move towards speaking in and through technology.
Dr Merryn Davies-Deacon, Queen’s University Belfast
Talk: Variation and communities among speakers of Breton and Cornish
Abstract: This talk draws on research on Breton and Cornish conducted over the past few years. As a revived language, Cornish is unusual in that almost all active speakers have made the conscious decision to learn the language as adults; in the Breton case, a similar “new speaker” community is prominent in discourse and associated with a particular set of both linguistic and non-linguistic practices. These include a supposed preference for standardised linguistic varieties and for Celtic-based neologisms over borrowings from French, as well as involvement with language activism. Accordingly, this sets new speakers apart from traditional speakers, those who have acquired Breton by means of intergenerational transmission and are said to speak dialectal varieties and to have different beliefs and goals in their use of Breton. While the Cornish speaker community can be characterised as composed entirely of new speakers, this does not mean it is immune from similar splits: different approaches to how to use traditional Cornish as a source for the revived language have led to divergent opinions on how to spell, pronounce, and use Cornish in the twenty-first century. I argue, firstly, that these splits are more complex than they are sometimes depicted. My research shows that many competent speakers of Breton do not fit neatly into the new or traditional speaker category based on their backgrounds and linguistic practices, and innovative uses of languages are emerging in online contexts in particular. In the case of Cornish, while a new standard orthography was proposed in 2008 with the aim of reconciling tensions within the community, this failed to win overall acceptance and the old divisions are still present, albeit less acceptable in public discourse. The failure of this “Standard Written Form” comes from the very fact that it was proposed as a “compromise orthography” – it does not meet the ideological needs of Cornish speakers, which, I propose, are more important to these speakers than strictly linguistic concerns. Secondly, leading on from this, I argue that it is important to take into account the linguistic and sociolinguistic diversity in these language communities in academic work, including in computational approaches, and to bear in mind the fact that some speakers are doubly minoritised: they may be users of non-mainstream language varieties within an already small minoritised language community.
Bio: Merryn Davies-Deacon grew up in a bilingual household in Cornwall and has been Lecturer in French Linguistics at Queen’s University Belfast since 2020. Merryn’s research to date has focused on the sociolinguistics of Cornish and Breton, and particularly on the interface between choices around linguistic forms and language attitudes, ideologies, and identities.
Dr David Howcroft, Edinburgh Napier University
Title: Scottish Gaelic in Natural Language Processing
Abstract: Scottish Gaelic is well-represented in NLP research for a low-resource language with only a few tens of thousands of speakers, resulting in the availability of part-of-speech taggers, syntactic parsers, machine translation, and other tools. Despite advances in these areas, however, corpora remain much too small for training the “large” language models which have taken NLP by storm and key areas like natural language understanding and generation remain virtually unexplored. This talk is my best attempt to provide an overview of the lay of the land–what work can we build on, and what do we still need to build?
Bio: Dave Howcroft is a computational linguist focused on natural language generation, crowdsourcing, and evaluation, with an interest in psycholinguistics. He received his PhD from Saarland University in 2021 and has been working in the UK since June 2019, first at Heriot-Watt University and since June 2021 at Edinburgh Napier University.
Dr Mícheál J. Ó Meachair, Dublin City University
Title: NLP and corpus-linguistic tasks: Current work and challenges for the Irish language
Abstract: In this talk I present a brief history of corpus-linguistic and NLP research for the Irish language, followed by a detailed runthrough of the practical tasks and theoretical work being undertaken on the National Corpus of Irish (Corpas Náisiúnta na Gaeilge) project 2022-2024. This project is being completed by the Gaois research group in Fiontar & Scoil na Gaeilge, Dublin City University. It has been funded for three years by the Department of Tourism, Culture, Arts, Gaeltacht, Sport and Media and by the National Lottery. In doing so I first present the arguments for and against the development of NLP tools during a corpus-compilation project in a minority-language context. This is followed by an overview of the current roadmap for Irish-language corpora and NLP tool development, as it is presented in the Digital Plan for the Irish Language: Speech and Language Technologies 2023-2027. The relationship between this digital plan and our current work is detailed in terms of project-level goals and outputs, and lower-level daily tasks. As part of this I show examples of our work including Python programs, the details of our collection and cleaning tasks to date, and I present the beta-version of the Monitor Corpus of Irish (Corpas Monatóireachta na Gaeilge) which is currently available at:https://beta.corpas.ie/ga/cmg/.
Bio: Dr Mícheál J. Ó Meachair is an assistant professor at Fiontar & Scoil na Gaeilge in Dublin City University. He attained his doctoral degree in corpus linguistics from Trinity College, Dublin. The focus of this doctoral research was on the compilation and tagging of a corpus of Irish-language Educational materials (EduGA), followed by a language-complexity analysis of the corpus. Dr Ó Meachair is currently working on the National Corpus of Irish project, a project on which 4 unique Irish-language corpora are being compiled and multiple NLP tools and resources are being developed. He is a co-founder of an Ríomhacadaimh, a group focussed on localization for the GA-ie locale. His research interests include corpus linguistics, language technologies, localization, and language acquisition. He is a member of the DCU NLP Group and the SEALBHÚ research group in DCU, the latter of which is a cross-faculty research group focussing on applied linguistics and language-planning research.
Dr Cedric Lothritz, University of Luxembourg
Title: LuxemBERT: Exploring Data Augmentation and Transfer Learning Techniques to Create Language Models for Luxembourgish
Abstract: Luxembourgish is a Moselle-Franconian dialect that is marked by its German and French influences. Being spoken by an estimated 600,000 people world-wide, the Luxembourgish language community is not considered vulnerable, however, it is comparably small. In addition, due to several contributing factors, the availability of high-quality textual data is rather limited, making Luxembourgish a less-resourced language. In this talk, we will explore how to take advantage of the linguistic similarities of Luxembourgish to German that allowed us to create adequate language models for the Luxembourgish language, one of which being LuxemBERT, the first BERT model for Luxembourgish.
Bio: Cedric obtained his PhD in Natural Language Processing (NLP) at the University of Luxembourg in 2023. His work focusses on NLP in the FinTech domain and for the Luxembourgish language. His main research interests lie in language modeling and techniques to develop models for low-resource languages. He has been working at SnT since 2019 and has been part of the TruX research group since its inception.
Dr Abigail Walsh, ADAPT Centre at Dublin City University
Title: eSTÓR and More: Developing Datasets for Irish NLP
Abstract: eSTÓR (Sonraí Teanga Óstáilte i gcomhair Ríomhphróiseála) is a project funded by the Department of Tourism, Culture, Arts, Gaeltacht, Sport and Media for the development of an online repository for Irish language. Through a combination of outreach, education, and engagement with public institutions, Irish-English corpora are collected and processed, and, where the license permits, shared with the European Commission translation service. Such datasets form a vital part of the development of Irish MT services, and other NLP applications. In this talk, I will describe the development of this project over the years, from its origins as a European Language Resource Infrastructure (ELRI) project, to the current team consisting of researchers and engineers, whose work includes site visits, scientific talks, language processing, research publications and more. The talk will then explore the future of this project, which includes expanding the scope to the development and analysis of language resources annotated with multiword expressions (MWEs), for specialised deployment in NLU and NLG tasks.
Bio: Abigail Walsh is a post-doctoral researcher at the ADAPT Centre at Dublin City University, working on the eSTÓR project developing resources for Irish language technology, with a focus on the automatic processing of MWEs. She is passionate about promoting linguistic diversity and support for minority and low-resource languages in the field of NLP. Her research topics include MWE identification, parsing, machine translation, lexicon development, linguistic analysis, resource development, and machine learning. She is an active member of both the PARSEME community and the UniDive COST Action for universality, diversity and idiosyncrasy in language technology.
Dr Nouran Khallaf is a Research Associate at Lancaster University.
Dr Ignatius Ezeani is a Senior Research Associates at Lancaster University
Title: Advancing Welsh Natural Language Processing: Bridging Gaps in Resources and Tools for Low-Resource Languages
Abstract: Over the past decade or more, significant strides have been made in the realm of natural language processing (NLP), particularly for well-resourced languages such as English. This progress has given rise to a plethora of technologies capable of achieving near-human performance in the processing and annotation of language data. However, the benefits of these advancements in NLP remain elusive for low-resource languages, such as Welsh, and their speakers. This discrepancy arises from the scarcity of specific language corpora and processing methodologies tailored to these languages. The development of bespoke resources and tools necessitated painstaking and diligent efforts undertaken by a specialized cadre of researchers distributed across various universities and research groups both locally and globally. Fortunately, this sustained endeavor has borne fruit, securing Welsh a distinct position on the NLP landscape. The Welsh NLP initiative has engendered the creation of fundamental language annotation and processing tools, encompassing part-of-speech (POS) and semantic taggers, as well as more advanced tools for language comprehension and analysis, including automatic summarizers and sentiment analyzers. This presentation offers a technical yet accessible overview of these tools, elucidating their contribution to bridging the gap for low-resource languages within the evolving domain of natural language processing.
Professor Paul Rayson is the Director of the UCREL interdisciplinary research centre at Lancaster University, UK. With a focus on corpus linguistics and natural language processing (NLP), he has spearheaded groundbreaking research in semantic multilingual NLP, particularly in challenging scenarios where language is inherently noisy, such as historical texts, learner language, speech, emails, text messages, and other computer-mediated communication varieties.
Throughout his distinguished career, Professor Rayson has collaborated with domain experts to apply his research findings to a diverse range of fields, including dementia detection, mental health, online child protection, cyber security, learner dictionaries, and text mining of biomedical literature, historical corpora, and financial narratives. Notably, he played a key role as a co-investigator in the five-year ESRC Centre for Corpus Approaches to Social Science (CASS), aimed at integrating corpus methodologies into various social science disciplines.
In addition to his academic roles, Professor Rayson actively contributes to various multidisciplinary initiatives, including the Institute for Security Lancaster, the Lancaster Centre for Digital Humanities, and the Data Science Institute.
Paul has worked on numerous projects involving the Welsh language. His contributions to these projects have been invaluable, leveraging his expertise in corpus linguistics and NLP to develop innovative solutions tailored specifically for the Welsh language domain.
Dr Saad Ezzini is a Lecturer (Assistant Professor) in Computer Science at the School of Computing and Communications at Lancaster University. He is an active member of SE@L and the UCREL NLP Groups, contributing significantly to cutting-edge research in various domains.
His research interests span several areas within computer science, with a primary focus on software engineering. Specifically, Saad is deeply involved in software engineering methodologies, including software requirements engineering and empirical software engineering. His expertise extends to code representation, exploring innovative ways to bridge the gap between code and natural language through Code2Text and Text2Code approaches. Additionally, he applies machine learning (ML) and natural language processing (NLP) techniques to software engineering tasks, such as ambiguity detection and resolution, question answering/chatbot systems, and information retrieval.
Moreover, Saad’s research extends to the realm of linguistics, particularly in the domain of low-resource languages. He is passionate about leveraging neural machine translation and other NLP techniques to address the challenges faced by minority and under-resourced languages.
Beyond his research endeavours, Saad is actively engaged in community-building initiatives. He serves as the co-organiser of the CLIDA 2024 workshop (Celtic Languages in the Digital Age), demonstrating his commitment to fostering collaboration and advancing research in linguistics and digital humanities.
Dr Mo El-Haj is a Senior Lecturer in Computer Science at the School of Computing and Communications at Lancaster University, where he focuses on natural language processing (NLP). Additionally, he holds the position of Co-Director of the UCREL NLP Group and serves as the Strategic Lead of the Arabic NLP and Financial NLP themes.
His academic journey began with a PhD from The University of Essex, where he specialised in Arabic Multi-document Summarization. Since then, Dr El-Haj has broadened his research scope to include various NLP domains such as summarization, information extraction, financial NLP, and multilingual NLP.
Mo’s dedication to linguistic diversity is evident in his work on under-resourced languages. He has led several projects in collaboration with Cardiff University, funded by the Welsh Government and AHRC, aimed at developing NLP resources for Welsh.
Mo is the organiser of the CLIDA 2024 workshop (Celtic Languages in the Digital Age) with support from the UCREL NLP Group at Lancaster University and funded by the Faculty of Science and Technology (FST)’s Research Catalyst Fund.