Readability for Low Resourced Language

The 1st workshop on Readability for Low Resourced Languages

5 September 2023

Lancaster University 🔹 Sheffield Hallam University 🔹 King Saud University

About the workshop

Join us for an exciting workshop where experts in the field of natural language processing will come together to discuss the latest research and innovative approaches to assessing the readability of low-resource languages. We will delve into the development of a comprehensive Readability Framework, utilizing cutting-edge machine learning techniques to pre-process and identify key factors that impact text readability. The ultimate goal of the workshop is to discuss best practices and state-of-the-art AI-based approaches to create mathematical representations of expected readability levels at different school grade or cognitive ability levels. The workshop will also focus on utilising classifiers that are intuitive for humans to understand and adjust, enabling the analysis and improvement of the decision-making criteria. The main objectives of the workshop are three folds: a) increase awareness of the importance of readability in low-resource languages and its impact on language learning and literacy, b) discuss the challenges of readability in low-resource languages, such as limited resources and lack of standardization, and brainstorm strategies for addressing these challenges, and c) foster a community of practice among participants, allowing them to share their experiences and best practices for addressing readability issues in low-resource languages.

Keynote Speakers:

Professor Laurence Anthony – Faculty of Science and Engineering at Waseda University, Japan

Title of keynote: Vocabulary profiling of low resourced languages: Insights and challenges

Abstract: One of the most transparent and effective ways to assess the readability of texts is to profile their vocabulary content and determine the vocabulary coverage for different threshold levels. In an English setting, if a learner is known to have a vocabulary knowledge of say 5000 word items and those items cover over 95-98% of the vocabulary in a target text, research suggests that the text will be comprehensible without the need for glosses or dictionaries. However, mapping these findings to texts in non-English settings is far from trivial. In this talk, I will first review the basic principles of vocabulary profiling and introduce online and offline tools that are commonly used for this task. Next, I will discuss some of the main challenges that researchers face when profiling the vocabulary of low resourced languages, especially those where even the concept of a word is not always clear. Then, I will demonstrate some novel tools created for vocabulary profiling of modern foreign languages in the UK national curriculum that were designed specifically to overcome some of these issues. Finally, I will offer suggestions for developers of software tools that can hopefully advance the research on low resourced languages.

Biography: Laurence Anthony is Professor of Applied Linguistics at the Faculty of Science and Engineering, Waseda University, Japan. He has a BSc degree (Mathematical Physics) from the University of Manchester, UK, and MA (TESL/TEFL) and PhD (Applied Linguistics) degrees from the University of Birmingham, UK. He is a founding member of the Center for English Language Education in Science and Engineering (CELESE), which runs discipline-specific language courses for the 10,000 students of the faculty. His main research interests are in corpus linguistics, educational technology, and English for Specific Purposes (ESP) program design and teaching methodologies. He received the National Prize of the Japan Association for English Corpus Studies (JAECS) in 2012 for his work in corpus software tools design, including the creation of AntConc.

Dr Violetta Cavalli-Sforza – School of Science and Engineering at Al Akhawayn University, Morocco

Title of keynote: What if readability assessment actually considered the reader’s needs?

Abstract: Readability research has made great strides from the simple but still useful readability formulas of the mid-1900s to the modern approaches based on statistical machine learning and, more recently, deep learning models. Nonetheless, these approaches rely on the availability of large quantities of labeled data. If the models require the extraction and evaluation of text features capturing different aspects of language, the specific tools able to extract those features may not exist or be only rudimentary, which can have a negative impact on the models’ performance. If no feature engineering is required, the models will serve the purpose of evaluating the difficulty of input texts but will provide no information about the text features that contribute to their ease or difficulty. Moreover, most studies of readability in less-resourced languages have focused on classifying texts into a very limited number of levels, often achieving their best results with a coarse-grained categorization into three or four difficulty levels. While not without its value, the coarseness of this labeling is not sufficient to inform the choice of texts suitable for a specific reader, which is becoming increasingly important in this age of adaptive learning. In this talk, I will suggest alternatives to existing readability assessment approaches, which can give the reader (and the writer) more control over features of texts that make them suitable for specific individuals and purposes. The proposed approaches are sensitive to the fact that, after all, readability is not just a number or a ranking but rather the outcome of the interaction between text and reader.

Biography: Violetta Cavalli-Sforza is Associate Professor of Computer Science at Al Akhawayn University in Ifrane (AUI), where she has taught since 2008. She holds graduate degrees in Civil Engineering and Computer Science, culminating in a Ph.D. in Intelligent Systems Studies for the University of Pittsburgh. Her doctoral dissertation focused on computer-assisted instruction to visualize scientific argumentation. She worked in natural language processing, particularly machine translation, at Carnegie Mellon University, but her research in the last few years has focused on language learning through reading and dialogue and on readability of texts with a focus on Arabic. She has taught a variety of topics in Computer Science at the graduate and undergraduate level and has also coordinated and contributed to the faculty development center at AUI. She currently leads the Education Research group in the School of Science and Engineering and promotes undergraduate and interdisciplinary research at the University.

Spotlight Talk:

Professor Nizar Habash – Department of Computer Science at New York University Abu Dhabi, UAE.

Professor Hanada Taha – Arabic Language at Zayed University, UAE.

Talk title: BAREC: Balanced Arabic Readability Evaluation Corpus

Abstract: In this talk we introduce a newly started project: the Balanced Arabic Readability Evaluation Corpus (BAREC). The overarching objective of BAREC is to develop a comprehensive reference resource to facilitate the study and evaluation of Arabic readability across the Arab world. BAREC will adopt an evidence-based approach and generate practical resources and tools to support and enhance the use of the Arabic language. To this end, we aim to curate an open-source corpus of 10 million words that encompasses diverse genres, topics, and countries of origin, with a particular focus on balancing for readability levels. We plan to use the Taha Readability Leveling System, which comprises 19 levels and is designed with young readers in mind. Portions of this corpus will undergo manual annotation to mark sentence segments and their readability, as well as syntactic annotation to help study the effect of syntactic complexity on readability. Furthermore, we will extend a comprehensive lexicon that is closely integrated in a toolkit for Arabic NLP to these readability levels. These annotations will serve as the basis for developing artificial intelligence tools to automatically annotate the remaining corpus. We will also design additional AI tools to assist content creators in assessing the readability levels of their materials based on specific target audiences. The plan for BAREC is to bridge educational content with recreational reading and digital content, maximizing exposure to selected vocabulary that spans across knowledge, culture, and creativity. Ultimately, this project is scalable, and all the resources and tools we create will be made available to the public and open-source to encourage researchers to build upon our work.

As this is a new project that is expected to go on for 2.5 years, we would like to get some feedback on the plan, pointers to resources, possible collaborations, and general advice from previous and similar experiences.

Speaker’s biography: Nizar Habash is a Professor of Computer Science at New York University Abu Dhabi (NYUAD). He is also the director of the Computational Approaches to Modeling Language (CAMeL) Lab. Professor Habash specializes in natural language processing and computational linguistics. Before joining NYUAD in 2014, he was a research scientist at Columbia University’s Center for Computational Learning Systems. He received his PhD in Computer Science from the University of Maryland College Park in 2003. He has two bachelors degrees, one in Computer Engineering and one in Linguistics and Languages. His research includes extensive work on machine translation, morphological analysis, and computational modeling of Arabic and its dialects. Professor Habash has been a principal investigator or co-investigator on over 25 research grants. And he has over 250 publications including a book entitled “Introduction to Arabic Natural Language Processing”. His website is www.nizarhabash.com.

Talks by the Workshop Organisers:

Professor Hend Al-Khalifa – College of Computer and Information Sciences at King Saud University – KSA

Talk title: Text Readability using Cognitive DataAbstract: In this talk we will explore the intricate connection between cognitive data and text readability. The presentation will focus on various types of cognitive data, examining their impact on our understanding and interpretation of written content. By evaluating diverse approaches from empirical research to practical applications, we will delve into how cognitive processes can be harnessed to improve text readability by demonstrating our current research.

Biography: I am a Professor at the Information Technology Department, College of Computer and Information Sciences, and the first female Professor in the college. In terms of teaching, I have taught and supervised many BSc, MSc PhD courses and students. Besides teaching, I have played different managerial roles in the past years. I was the vice dean of the College of Computer and Information Sciences (2011-2015). Also, I was the first female President of the Saudi Computer Society’s female division (since 2013) and the first female member of the Saudi Computer Society board (since 2015). Moreover, I am a member of the Editorial Board of journal of King Saud University – Computer and Information Sciences (since 2012).

Dr Mo El-Haj – School of Computing and Communications at Lancaster University – UK

Talk title: : Empowering Low-Resourced Languages: Supporting and Enhancing Readability

Abstract: Low-resourced languages, characterized by limited linguistic resources and computational tools, often face significant challenges in achieving effective digital communication and accessibility. This talk explores the nature of low-resourced languages, delving into the complexities they entail and the importance of supporting these linguistic communities. It further highlights how efforts to improve language support can enhance readability, thereby enabling better communication and understanding. The presentation begins by providing an overview of what constitutes a low-resourced language, encompassing factors such as scarcity of linguistic resources, limited digital presence, and insufficient computational tools for natural language processing (NLP). Through examples from diverse languages, including Arabic, Welsh, and Igbo, we examine the unique characteristics and challenges associated with low-resourced languages.

Biography: Hello, I’m Dr. Mahmoud El-Haj, also known as Mo. I’m an NLP Senior Lecturer in Computer Science at the School of Computing and Communications at Lancaster University. I’m also the Co-Director of the UCREL NLP Group and the Strategic Lead of the Arabic NLP and Financial NLP themes. I received my PhD in Computer Science from The University of Essex working on Arabic Multi-document Summarization. My work is mainly towards Summarization, Information Extraction, Financial NLP and multilingual NLP with my work being applied to many languages including English, Arabic, Spanish, Portuguese, Welsh and many others. I have great interest in under-resourced languages and building NLP datasets.

Dr Abdel-Karim Al Tamimi– Computer Science and Software Engineering at Sheffield Hallam University – UK

Talk Title: Reflections on the Automatic Arabic Readability Index (AARI) and Future Directions

Abstract: In this talk, we will explore the research methodology employed during the development of AARI and we will uncover valuable lessons learned along the way. Additionally, we will address the persistent challenges entailed in creating reliable Arabic readability indexes and examine how recent advancements in Natural Language Processing (NLP) hold promising potential for bolstering future research endeavors.

Biography: Dr. Abdel-Karim Al-Tamimi is a Senior Lecturer of Computer Science and Software Engineering at Sheffield Hallam University. He is a renowned expert in the field and holds several prestigious positions. Dr. Al-Tamimi leads the Interactive Data Analytics Group (iDAG) and is a valuable member of the Applied Software Engineering Research Group (ASERG) and the Conversational AI Cluster. He is closely affiliated with Sheffield Hallam University’s Advanced Wellbeing Research Centre (AWRC) as a member of the Systems, Services, and Strategy in Health and Care research group. Dr. Al-Tamimi’s current research focus revolves around the application of machine learning techniques to tackle interdisciplinary research challenges in areas such as Natural Language Processing (NLP), Multimedia Networks, Computer Security, and the Internet of Things (IoT).

Important Dates:

~~Due date for workshop abstract submission: August 1, 2023 (closed)~~
~~Notification of abstract acceptance to authors: August 10, 2023 (sent out on 07/08/2023)~~
Workshop date: September 5, 2023 (online event)

Accepted Abstracts (speakers in bold font):

Bret Mulligan, Hugh Paterson and Michael Rabayda. LexR: a Preliminary Readability Metric for Latin
Nizar Habash, Muhamed Al Khalil, Hind Saddiki, Zhengyang Jiang, Reem Hazim and Basahr Alhafni. Arabic Automatic Readability Resources in the SAMER Project
Iglika Nikolova-Stoupak, Eva Schaeffer-Lacroix and Gaël Lejeune. Readability of Interslavic as a Measure of the Language’s Naturalness
Johannes Sibeko. Sesotho Readability Assessment: Challenges and Solutions
Lucas Tcacenco. Guidelines for the Production of Science and Technology Museum Texts in Plain Language
Kate Challis and Tom Drusa. Foreign Language Reading with Lex-See: a browser extension designed to offload working memory burden

Call for speakers:

We invite researchers and practitioners to submit abstract proposals for talks related to the development of a Readability Framework for low-resource languages. The extended versions of the accepted abstracts will appear in the Computing Research Repository (CoRR), subject to the number of abstracts received. Topics of interest include, but are not limited to:

Machine learning for text readability
Applications of readability assessment
Readability in low-resource languages
Comprehensibility measures
Mathematical representations of readability levels
Text simplification for low-resource languages
Readability & comprehensibility in language learning
The effects of text simplification on readability
Readability frameworks for indigenous languages
Updating readability representations

Abstract submission:

~~Click here to submit an abstract. (Closed)~~
Abstracts can be either in English or Arabic (in the future we’ll try to expand to more languages)
The form only allows submission of text, with a limit to 500 words.
Accepted abstracts we’ll be asked to submit an extended abstracts in PDF format (English only).
Please select who is going to be the presenter.

Presentations:

Accepted abstracts will be presented live during the workshop
All presentations are online.
For abstracts in Arabic, at this version of the workshop we only allow pre-recorded presentations with English subtitles added in. We’ll play the recording during the workshop and allow the presenter to answer questions. The organisers will translate the Q&As.
More details to come…

Organisers:

Dr Mo El-Haj (SCC/DSI/UCREL, Lancaster University)
Dr Abdel-Karim Al Tamimi (CSSE, Sheffield Hallam University)
Prof. Hend Al Khalifa (iWAN, King Saud University)