AraGenre 2026
AraGenre: A Hierarchical Definition-Guided Arabic Genre Classification Shared Task
Mo El-Haj, Saad Ezzini, Mustafa Jarrar, Shadi Abudalfa
m.el-haj@lancaster.ac.uk
Registration form: https://forms.gle/RAi7SpYoRD5RgUnP6
AraGenre Codabench: https://www.codabench.org/competitions/16356
1. Overview
AraGenre is a shared task on hierarchical Arabic genre classification. The task evaluates whether systems can identify the communicative genre of Arabic texts across two levels: Broad Genre, referring to high-level communicative functions such as Informative, Interactive, Creative, Religious, Legal, or Learning, and Specific Genre, referring to finer-grained genre subtypes such as encyclopaedic writing, forum discussion, educational explanation, diplomatic communication, or religious commentary.
Unlike traditi
onal text classification benchmarks that rely heavily on large labelled datasets and stable training distributions, AraGenre focuses on robust genre understanding across diverse communicative settings, dialects, orthographic conventions, and writing styles. The task adopts a definition-guided framework where participants are provided with a hierarchical genre taxonomy, English genre definitions, labelled training and development data, and a hidden evaluation benchmark containing naturally occurring Arabic texts.
The released training and development sets consist primarily of synthetic and carefully controlled examples designed to simulate low-resource conditions. In contrast, the hidden evaluation benchmark contains substantially noisier naturally occurring texts and previously unseen genre-definition combinations, encouraging systems to rely on semantic understanding of communicative function rather than memorisation of narrow lexical patterns.
The benchmark spans Modern Standard Arabic, Classical Arabic, and multiple Arabic dialects, including both formal and informal writing styles as well as diacritised and undiacritised text.
Systems are evaluated separately at the broad and specific genre levels. The official ranking metric is Hierarchical Macro F1, calculated as the average of Broad Macro F1 and Specific Macro F1.
AraGenre is intended as a practical computational benchmarking framework for Arabic NLP. Some categories may overlap with related notions in linguistics and discourse studies, including text type, register, and discourse function.
Each input instance consists of an Arabic text segment ranging from short fragments to longer passages. Systems must assign both a broad genre label and a fine-grained specific genre label.
AraGenre adopts a definition-guided evaluation framework where systems receive English genre definitions together with limited labelled training and development data designed to simulate low-resource conditions.
The hidden evaluation benchmark contains noisier naturally occurring Arabic texts from diverse communicative settings and includes previously unseen genre-definition combinations to encourage semantic genre understanding beyond narrow lexical memorisation.
Table 1: Example Genres with Definitions and Illustrative Samples
| Broad Genre | Specific Genre | Definition | Example 1 | Example 2 |
|---|---|---|---|---|
| Informative | Analytical Reports | Structured analytical writing discussing trends, developments, policies, or evidence-based observations using factual interpretation or comparative analysis. | تشير البيانات الاقتصادية الأخيرة إلى ارتفاع معدلات التضخم مقارنة بالعام السابق، مما أدى إلى تراجع القوة الشرائية للأسر. | أظهرت الدراسة أن الاستخدام المفرط للهواتف الذكية بين المراهقين يرتبط بانخفاض ساعات النوم وضعف التركيز الأكاديمي. |
| Interactive | Advice Columns | Texts where individuals seek or provide practical, emotional, social, or moral advice regarding personal experiences or everyday situations. | أشعر بالتوتر الشديد قبل الامتحانات، ولا أعرف كيف أتعامل مع هذا القلق المستمر، فماذا تنصحونني؟ | أعاني من صعوبة في تنظيم وقتي بين العمل والدراسة، وأحتاج إلى نصائح تساعدني على تحقيق التوازن. |
| Learning | Educational Explanations | Explanatory educational texts intended to help learners understand academic, scientific, technical, or conceptual topics through clarification and structured reasoning. | يحدث التبخر عندما تتحول المادة من الحالة السائلة إلى الحالة الغازية نتيجة اكتسابها للطاقة الحرارية. | لحساب مساحة المثلث، نقوم بضرب طول القاعدة في الارتفاع ثم نقسم الناتج على اثنين. |
| Legal | Diplomatic Communication | Official communication concerning international relations, negotiations, cooperation, agreements, or political coordination between states or organisations. | أكدت الدول المشاركة أهمية تعزيز التعاون الإقليمي لمواجهة التحديات الاقتصادية المشتركة. | عقد الوفدان اجتماعاً ثنائياً لبحث آليات تطوير العلاقات التجارية بين البلدين خلال المرحلة المقبلة. |
| Creative | Motivational Writing | Expressive or reflective writing intended to inspire, encourage, or persuade readers through motivational rhetoric or self-improvement themes. | لا تسمح للفشل أن يوقفك، فكل تجربة صعبة تمنحك فرصة جديدة للنمو والتعلم. | النجاح لا يأتي صدفة، بل يبدأ بخطوة صغيرة وإيمان مستمر بقدراتك. |
| Religious | Religious Commentary | Reflective or explanatory religious writing discussing moral lessons, spiritual guidance, interpretation, or ethical values within a religious context. | يدعو الإسلام إلى الصبر والتسامح، ويحث الإنسان على معاملة الآخرين بالرحمة والعدل. | يوضح الكاتب أهمية الإخلاص في العمل، وأن النية الصادقة أساس قبول الأعمال عند الله. |
2. Motivation and Significance
Most contemporary NLP systems rely heavily on large annotated datasets and stable training distributions. In Arabic and other low-resource settings, labelled data are often limited, uneven across dialects and domains, or unavailable altogether. As a result, systems frequently struggle when exposed to noisier real-world text or previously unseen data distributions.
AraGenre reframes genre classification as a robust generalisation problem rather than a conventional closed-label task. Systems must infer communicative function and stylistic behaviour from limited supervision while generalising to naturally occurring Arabic texts.
The benchmark is designed to encourage research on hierarchical classification, definition-guided modelling, and robust semantic generalisation under realistic low-resource conditions.
3. Hierarchical Genre Organisation and Task Design
AraGenre is organised as a hierarchical Arabic genre classification task where systems must predict both a broad genre label and a fine-grained specific genre label for each Arabic text segment.
The benchmark currently includes six broad genres:
- Informative
- Creative
- Interactive
- Learning
- Legal
- Religious
The released training and development data include genres such as analytical reports, educational explanations, diplomatic communication, motivational writing, forum discussion, and religious commentary.
AraGenre adopts a definition-guided framework where participants receive English genre definitions together with limited labelled training and development data designed to simulate realistic low-resource conditions.
In contrast, the hidden evaluation benchmark contains noisier naturally occurring Arabic texts spanning Modern Standard Arabic, Classical Arabic, and multiple Arabic dialects, including both formal and informal writing styles.
Systems are evaluated separately at the broad and specific genre levels. The official ranking metric is Hierarchical Macro F1, calculated as the average of Broad Macro F1 and Specific Macro F1, allowing evaluation of both high-level communicative understanding and fine-grained genre recognition.
4. Evaluation Framework
Systems generate two predictions for each input instance:
- a broad genre label,
- and a specific genre label.
Performance is evaluated separately at both levels using Macro F1, Weighted F1, and Accuracy.
The official ranking metric is Hierarchical Macro F1, calculated as the average of Broad Macro F1 and Specific Macro F1.
The evaluation setup is designed to encourage robust genre understanding under linguistic and stylistic variation rather than narrow lexical memorisation.
5. Baselines and Benchmarking
AraGenre provides two official baseline systems as reproducible reference points for hierarchical Arabic genre classification.
Baseline 1 uses multilingual sentence embeddings and semantic similarity between input texts and genre definitions.
Baseline 2 uses retrieval-augmented generation (RAG) with an instruction-tuned multilingual large language model.
Baselines results are on Codabench: https://www.codabench.org/competitions/16356/#/results-tab
Both baselines generate broad and specific genre predictions using the official shared-task JSON format.
On the development benchmark, Baseline 1 achieved stronger overall performance:
| Baseline | Hierarchical Macro F1 | Specific Macro F1 | Specific Weighted F1 | Specific Accuracy | Broad Macro F1 | Broad Weighted F1 | Broad Accuracy |
|---|---|---|---|---|---|---|---|
| Baseline 1: Embedding Similarity | 0.4658 | 0.4658 | 0.4999 | 0.4818 | 0.4658 | 0.4999 | 0.4818 |
| Baseline 2: RAG + Multilingual LLM | 0.3899 | 0.3899 | 0.3890 | 0.4273 | 0.3899 | 0.3890 | 0.4273 |
The starter kit additionally includes data loaders, evaluation scripts, example submissions, and Codabench-compatible prediction formatting.
6. Submission Protocol and Reproducibility
AraGenre uses Codabench as the official evaluation platform:
https://www.codabench.org/competitions/16356
Participants receive the released training and development data together with hidden evaluation inputs and genre definitions during testing.
For each test instance, systems must generate:
- one broad genre prediction,
- and one specific genre prediction.
Submissions must follow the official JSON format provided in the starter kit. Prediction files are automatically evaluated against hidden gold-standard annotations on the evaluation server.
Systems are evaluated using Broad Macro F1, Specific Macro F1, and the official ranking metric, Hierarchical Macro F1.
Participants are encouraged to submit prediction files, system description papers, and optional reproducibility material or source code.
The released baselines provide reference implementations for embedding-based definition matching and retrieval-augmented generation using multilingual large language models.
7. Accessibility and Research Impact
AraGenre is designed as an accessible benchmark for hierarchical Arabic genre classification under realistic low-resource conditions. The task encourages systems that generalise across diverse communicative settings, dialects, and writing styles rather than relying on narrow lexical memorisation.
Beyond Arabic NLP, the benchmark supports research on hierarchical classification, definition-guided modelling, retrieval-augmented reasoning, and robust semantic generalisation under limited supervision.
The shared task will be promoted through Arabic NLP and ACL communities, with a dedicated website providing task documentation, datasets, baselines, starter code, and public leaderboards.
9. Timeline (Tentative)
| Date | Milestone |
|---|---|
| May 18, 2026 | Release of task website, training/development data, baseline systems, and evaluation scripts |
| July 30, 2026 | Registration deadline and release of hidden test inputs |
| July 31, 2026 | Submission deadline and final evaluation |
| August 22, 2026 | System description papers due |
| September 1, 2026 | Shared task overview paper due |
| September 10, 2026 | Conference camera-ready deadline |
| October 24–29, 2026 | ArabicNLP 2026 / EMNLP 2026, Budapest, Hungary |
10. Organisers
- Mo El-Haj, Organising Chair, Lancaster University, UK. VinUniversity, Vietnam. m.el-haj@lancaster.ac.uk
- Mustafa Jarrar, Programme Chair, Hamad Bin Khalifa University (HBKU), Qatar. mjarrar@hbku.edu.qa
- Saad Ezzini, Programme Chair, King Fahd University of Petroleum and Minerals (KFUPM), Saudi Arabia. saad.ezzini@kfupm.edu.sa
- Shadi Abudalfa, Programme Chair, King Fahd University of Petroleum and Minerals (KFUPM), Saudi Arabia. shadi.abudalfa@kfupm.edu.sa

