We are developing a publicly available Welsh-language automatic text summarisation tool: Adnodd Creu Crynodebau (ACC). ACC will contribute to the automated tools available in the Welsh language and facilitate the work of those involved in document preparation, proof-reading, and (in certain circumstances) translation. ACC will also allow professionals to quickly summarise long documents for efficient presentation. For instance, ACC will allow educators to adapt long documents for use in the classroom. It is also envisaged that ACC will benefit the wider public, who may prefer to read a summary of complex information presented on the internet or who may have difficulties reading translated versions of information on websites.
What is text summarisation?
Text summarisation is a digital approach to summarising ‘key’ information contained within texts, and the creation of shortened versions of texts based on this content. This is to provide succinct and coherent summaries to users, something that is often time-consuming and difficult to conduct manually. Summarisation is useful in the modern digital world where the creation and sharing of text is ever-increasing, as it enables users to navigate, and make sense of, the dearth of information available with ease.
Approaches to text summarisation
The main approaches to text summarisation include extraction-based summarisation and abstraction-based summarisation. The former extracts specific words/phrases from the text in the creation of the summary, while the latter works to provide paraphrased summaries (i.e. not directly extracted) from the source text. The successful extraction/abstraction of content, when using summarisation tools/approaches, depends on the accuracy of automatic algorithms (which require training using hand-coded gold-standard datasets).
Work on automatic text summarisation has a long history in NLP (Natural Language Processing). This work originally focused only on English, but is now used in a range of other language contexts, including French, Spanish, Hindi, Arabic, amongst others. The ‘MultiLing’ project and associated conference series, are a noteworthy champion of developing text summarisation in a range of the world’s 7000+ different languages. The website, http://multiling.iit.demokritos.gr provides an open repository for summarisation tasks test/training data, model summaries, amongst others. Missing from current summarisation resources are tools that effectively work with the Welsh language – this is the research gap that the proposed research project aims to fill.
Dr. Dawn Knight is a Reader in Applied Linguistics at Cardiff University, UK, and Chair of the British Association for Applied Linguistics (BAAL). She was the Principal Investigator (PI) of the CorCenCC (National Corpus of Contemporary Welsh) project and has expertise in corpus linguistics, discourse analysis, digital interaction and non-verbal communication. Dawn is the PI of the Welsh Automatic Text Summarisation project.
Dr. Jonathan Morris is a Senior Lecturer in Welsh linguistics at Cardiff University. Jonathan’s research focuses on sociolinguistic aspects of bilingualism. His publications include work on cross-linguistic phonological interactions and sociophonetic variation in Welsh-English bilinguals’ speech and research on the use of the Welsh language among young people and families.
Dr. Mahmoud El-Haj, also known as Mo, is an NLP Lecturer in Computer Science at the School of Computing and Communications at Lancaster University. Mo received his PhD in Computer Science from The University of Essex working on Multi-document Summarization. His work is mainly towards Summarization, Information Extraction, Financial NLP and multilingual NLP with his work being applied to many languages including English, Arabic, Spanish, Portuguese and Welsh. He has an interest in under-resourced languages and building NLP datasets.
Dr Ignatius Ezeani is a Senior Teaching/Research Associate at Lancaster University. He is interested in the application of NLP techniques in building resources for low-resource languages including Igbo and Welsh. He works on the efficient adaption of existing NLP tools and techniques for creating task-oriented systems for low-resource languages.
To learn more about the technical development of ACC, and for access to the tools and dataset being created as part of this project, please visit our GitHub site.
ACC will be available soon. Updates on the development of ACC will be added to this website, the project’s GitHub site and will be tweeted via the @CorCenCC Twitter account.
This project, which runs from 2021-2022, is funded by the Welsh Government as part of the ‘Welsh Automatic Text Summarisation’ project.