AraGenre: A Hierarchical Definition-Guided Arabic Genre Classification Shared Task

Mo El-Haj, Saad Ezzini, Mustafa Jarrar, Shadi Abudalfa, Salima Lamsiyah
m.el-haj@lancaster.ac.uk

Registration form: https://forms.gle/RAi7SpYoRD5RgUnP6

AraGenre Codabench: https://www.codabench.org/competitions/16356

1. Overview

AraGenre is a shared task on hierarchical Arabic genre classification. The task evaluates whether systems can identify the communicative genre of Arabic texts across two levels: Broad Genre, referring to high-level communicative functions such as Informative, Interactive, Creative, Religious, Legal, or Learning, and Specific Genre, referring to finer-grained genre subtypes such as encyclopaedic writing, forum discussion, educational explanation, diplomatic communication, or religious commentary.

Unlike traditi

onal text classification benchmarks that rely heavily on large labelled datasets and stable training distributions, AraGenre focuses on robust genre understanding across diverse communicative settings, dialects, orthographic conventions, and writing styles. The task adopts a definition-guided framework where participants are provided with a hierarchical genre taxonomy, English genre definitions, labelled training and development data, and a hidden evaluation benchmark containing naturally occurring Arabic texts.

The released training and development sets consist primarily of synthetic and carefully controlled examples designed to simulate low-resource conditions. In contrast, the hidden evaluation benchmark contains substantially noisier naturally occurring texts and previously unseen genre-definition combinations, encouraging systems to rely on semantic understanding of communicative function rather than memorisation of narrow lexical patterns.

The benchmark spans Modern Standard Arabic, Classical Arabic, and multiple Arabic dialects, including both formal and informal writing styles as well as diacritised and undiacritised text.

Systems are evaluated separately at the broad and specific genre levels. The official ranking metric is Hierarchical Macro F1, calculated as the average of Broad Macro F1 and Specific Macro F1.

AraGenre is intended as a practical computational benchmarking framework for Arabic NLP. Some categories may overlap with related notions in linguistics and discourse studies, including text type, register, and discourse function.

Each input instance consists of an Arabic text segment ranging from short fragments to longer passages. Systems must assign both a broad genre label and a fine-grained specific genre label.

AraGenre adopts a definition-guided evaluation framework where systems receive English genre definitions together with limited labelled training and development data designed to simulate low-resource conditions.

The hidden evaluation benchmark contains noisier naturally occurring Arabic texts from diverse communicative settings and includes previously unseen genre-definition combinations to encourage semantic genre understanding beyond narrow lexical memorisation.

Table 1: Example Genres with Definitions and Illustrative Samples

Broad Genre	Specific Genre	Definition	Example 1	Example 2
Informative	Analytical Reports	Structured analytical writing discussing trends, developments, policies, or evidence-based observations using factual interpretation or comparative analysis.	تشير البيانات الاقتصادية الأخيرة إلى ارتفاع معدلات التضخم مقارنة بالعام السابق، مما أدى إلى تراجع القوة الشرائية للأسر.	أظهرت الدراسة أن الاستخدام المفرط للهواتف الذكية بين المراهقين يرتبط بانخفاض ساعات النوم وضعف التركيز الأكاديمي.
Interactive	Advice Columns	Texts where individuals seek or provide practical, emotional, social, or moral advice regarding personal experiences or everyday situations.	أشعر بالتوتر الشديد قبل الامتحانات، ولا أعرف كيف أتعامل مع هذا القلق المستمر، فماذا تنصحونني؟	أعاني من صعوبة في تنظيم وقتي بين العمل والدراسة، وأحتاج إلى نصائح تساعدني على تحقيق التوازن.
Learning	Educational Explanations	Explanatory educational texts intended to help learners understand academic, scientific, technical, or conceptual topics through clarification and structured reasoning.	يحدث التبخر عندما تتحول المادة من الحالة السائلة إلى الحالة الغازية نتيجة اكتسابها للطاقة الحرارية.	لحساب مساحة المثلث، نقوم بضرب طول القاعدة في الارتفاع ثم نقسم الناتج على اثنين.
Legal	Diplomatic Communication	Official communication concerning international relations, negotiations, cooperation, agreements, or political coordination between states or organisations.	أكدت الدول المشاركة أهمية تعزيز التعاون الإقليمي لمواجهة التحديات الاقتصادية المشتركة.	عقد الوفدان اجتماعاً ثنائياً لبحث آليات تطوير العلاقات التجارية بين البلدين خلال المرحلة المقبلة.
Creative	Motivational Writing	Expressive or reflective writing intended to inspire, encourage, or persuade readers through motivational rhetoric or self-improvement themes.	لا تسمح للفشل أن يوقفك، فكل تجربة صعبة تمنحك فرصة جديدة للنمو والتعلم.	النجاح لا يأتي صدفة، بل يبدأ بخطوة صغيرة وإيمان مستمر بقدراتك.
Religious	Religious Commentary	Reflective or explanatory religious writing discussing moral lessons, spiritual guidance, interpretation, or ethical values within a religious context.	يدعو الإسلام إلى الصبر والتسامح، ويحث الإنسان على معاملة الآخرين بالرحمة والعدل.	يوضح الكاتب أهمية الإخلاص في العمل، وأن النية الصادقة أساس قبول الأعمال عند الله.

2. Motivation and Significance

Most contemporary NLP systems rely heavily on large annotated datasets and stable training distributions. In Arabic and other low-resource settings, labelled data are often limited, uneven across dialects and domains, or unavailable altogether. As a result, systems frequently struggle when exposed to noisier real-world text or previously unseen data distributions.

AraGenre reframes genre classification as a robust generalisation problem rather than a conventional closed-label task. Systems must infer communicative function and stylistic behaviour from limited supervision while generalising to naturally occurring Arabic texts.

The benchmark is designed to encourage research on hierarchical classification, definition-guided modelling, and robust semantic generalisation under realistic low-resource conditions.

3. Hierarchical Genre Organisation and Task Design

AraGenre is organised as a hierarchical Arabic genre classification task where systems must predict both a broad genre label and a fine-grained specific genre label for each Arabic text segment.

The benchmark currently includes six broad genres:

Informative
Creative
Interactive
Learning
Legal
Religious

The released training and development data include genres such as analytical reports, educational explanations, diplomatic communication, motivational writing, forum discussion, and religious commentary.

AraGenre adopts a definition-guided framework where participants receive English genre definitions together with limited labelled training and development data designed to simulate realistic low-resource conditions.

In contrast, the hidden evaluation benchmark contains noisier naturally occurring Arabic texts spanning Modern Standard Arabic, Classical Arabic, and multiple Arabic dialects, including both formal and informal writing styles.

Systems are evaluated separately at the broad and specific genre levels. The official ranking metric is Hierarchical Macro F1, calculated as the average of Broad Macro F1 and Specific Macro F1, allowing evaluation of both high-level communicative understanding and fine-grained genre recognition.

4. Evaluation Framework

Systems generate two predictions for each input instance:

a broad genre label,
and a specific genre label.

Performance is evaluated separately at both levels using Macro F1, Weighted F1, and Accuracy.

The official ranking metric is Hierarchical Macro F1, calculated as the average of Broad Macro F1 and Specific Macro F1.

The evaluation setup is designed to encourage robust genre understanding under linguistic and stylistic variation rather than narrow lexical memorisation.

5. Baselines and Benchmarking

AraGenre provides two official baseline systems as reproducible reference points for hierarchical Arabic genre classification.

Baseline 1 uses multilingual sentence embeddings and semantic similarity between input texts and genre definitions.
Baseline 2 uses retrieval-augmented generation (RAG) with an instruction-tuned multilingual large language model.

Baselines results are on Codabench: https://www.codabench.org/competitions/16356/#/results-tab

Both baselines generate broad and specific genre predictions using the official shared-task JSON format.

On the development benchmark, Baseline 1 achieved stronger overall performance:

Baseline	Hierarchical Macro F1	Specific Macro F1	Specific Weighted F1	Specific Accuracy	Broad Macro F1	Broad Weighted F1	Broad Accuracy
Baseline 1: Embedding Similarity	0.4658	0.4658	0.4999	0.4818	0.4658	0.4999	0.4818
Baseline 2: RAG + Multilingual LLM	0.3899	0.3899	0.3890	0.4273	0.3899	0.3890	0.4273

The starter kit additionally includes data loaders, evaluation scripts, example submissions, and Codabench-compatible prediction formatting.

6. Submission Protocol and Reproducibility

AraGenre uses Codabench as the official evaluation platform:

https://www.codabench.org/competitions/16356

Participants receive the released training and development data together with hidden evaluation inputs and genre definitions during testing.

For each test instance, systems must generate:

one broad genre prediction,
and one specific genre prediction.

Submissions must follow the official JSON format provided in the starter kit. Prediction files are automatically evaluated against hidden gold-standard annotations on the evaluation server.

Systems are evaluated using Broad Macro F1, Specific Macro F1, and the official ranking metric, Hierarchical Macro F1.

Participants are encouraged to submit prediction files, system description papers, and optional reproducibility material or source code.

The released baselines provide reference implementations for embedding-based definition matching and retrieval-augmented generation using multilingual large language models.

7. Accessibility and Research Impact

AraGenre is designed as an accessible benchmark for hierarchical Arabic genre classification under realistic low-resource conditions. The task encourages systems that generalise across diverse communicative settings, dialects, and writing styles rather than relying on narrow lexical memorisation.

Beyond Arabic NLP, the benchmark supports research on hierarchical classification, definition-guided modelling, retrieval-augmented reasoning, and robust semantic generalisation under limited supervision.

The shared task will be promoted through Arabic NLP and ACL communities, with a dedicated website providing task documentation, datasets, baselines, starter code, and public leaderboards.

8. System Description Papers

All participating teams are strongly encouraged to submit a system description paper describing their approach to the shared task. System papers provide an opportunity to share your methodology, experiments, and insights with the research community, regardless of your final ranking on the leaderboard. We welcome submissions from all participating teams, including those who experimented with novel ideas or conducted informative analyses.

Accepted system description papers will be published in the Proceedings of the ArabicNLP 2026 Workshop, co-located with EMNLP 2026, and will appear in the ACL Anthology.

System papers should be up to 4 pages in length (excluding references) and should follow the ACL author guidelines. Authors are encouraged to describe:

The overall system architecture and modelling approach.
Data preprocessing and feature engineering methods.
Training procedure, hyperparameters, and implementation details.
External datasets, pretrained models, or additional resources used (if any).
Experimental results, error analysis, and lessons learned.

Registration is not required for an accepted system paper to appear in the workshop proceedings and the ACL Anthology. However, authors who wish to attend ArabicNLP 2026 and/or EMNLP 2026 in person must complete the appropriate conference registration.

We look forward to reading about the diverse approaches developed by participating teams and encourage everyone to consider submitting a system description paper.

9. Timeline (all in UTC time zone)

Date	Milestone
May 18, 2026 – 00:00	Start of the Development Phase
July 20, 2026 – 23:59	Registration deadline https://forms.gle/RAi7SpYoRD5RgUnP6
July 26, 2026 – 23:59	End of the Development Phase
July 27, 2026 – 00:00	Start of the Final Evaluation Phase
July 31, 2026 – 23:59	End of the Final Evaluation Phase
August 22, 2026 – 23:59	System description papers due
September 1, 2026 – 23:59	Shared task overview paper due
September 10, 2026	Conference camera-ready deadline
October 24–29, 2026	ArabicNLP 2026 / EMNLP 2026, Budapest, Hungary

10. Organisers

Mo El-Haj, Organising Chair, Lancaster University, UK. VinUniversity, Vietnam. m.el-haj@lancaster.ac.uk
Mustafa Jarrar, Programme Chair, Hamad Bin Khalifa University (HBKU), Qatar. mjarrar@hbku.edu.qa
Saad Ezzini, Programme Chair, King Fahd University of Petroleum and Minerals (KFUPM), Saudi Arabia. saad.ezzini@kfupm.edu.sa
Shadi Abudalfa, Programme Chair, King Fahd University of Petroleum and Minerals (KFUPM), Saudi Arabia. shadi.abudalfa@kfupm.edu.sa
Salima Lamsiyah, Programme Chair, University of Luxembourg, Luxembourg. salima.lamsiyah@uni.lu