AraGenre 2026

AraGenre: A Hierarchical Definition-Guided Arabic Genre Classification Shared Task

Mo El-Haj, Saad Ezzini, Mustafa Jarrar, Shadi Abudalfa
m.el-haj@lancaster.ac.uk 

Registration form: https://forms.gle/RAi7SpYoRD5RgUnP6

AraGenre Codabench: https://www.codabench.org/competitions/16356

1. Overview

AraGenre is a shared task on hierarchical Arabic genre classification. The task evaluates whether systems can identify the communicative genre of Arabic texts across two levels: Broad Genre, referring to high-level communicative functions such as Informative, Interactive, Creative, Religious, Legal, or Learning, and Specific Genre, referring to finer-grained genre subtypes such as encyclopaedic writing, forum discussion, educational explanation, diplomatic communication, or religious commentary.

Unlike traditi

onal text classification benchmarks that rely heavily on large labelled datasets and stable training distributions, AraGenre focuses on robust genre understanding across diverse communicative settings, dialects, orthographic conventions, and writing styles. The task adopts a definition-guided framework where participants are provided with a hierarchical genre taxonomy, English genre definitions, labelled training and development data, and a hidden evaluation benchmark containing naturally occurring Arabic texts.

The released training and development sets consist primarily of synthetic and carefully controlled examples designed to simulate low-resource conditions. In contrast, the hidden evaluation benchmark contains substantially noisier naturally occurring texts and previously unseen genre-definition combinations, encouraging systems to rely on semantic understanding of communicative function rather than memorisation of narrow lexical patterns.

The benchmark spans Modern Standard Arabic, Classical Arabic, and multiple Arabic dialects, including both formal and informal writing styles as well as diacritised and undiacritised text.

Systems are evaluated separately at the broad and specific genre levels. The official ranking metric is Hierarchical Macro F1, calculated as the average of Broad Macro F1 and Specific Macro F1.

AraGenre is intended as a practical computational benchmarking framework for Arabic NLP. Some categories may overlap with related notions in linguistics and discourse studies, including text type, register, and discourse function.

Each input instance consists of an Arabic text segment ranging from short fragments to longer passages. Systems must assign both a broad genre label and a fine-grained specific genre label.

AraGenre adopts a definition-guided evaluation framework where systems receive English genre definitions together with limited labelled training and development data designed to simulate low-resource conditions.

The hidden evaluation benchmark contains noisier naturally occurring Arabic texts from diverse communicative settings and includes previously unseen genre-definition combinations to encourage semantic genre understanding beyond narrow lexical memorisation.

Table 1: Example Genres with Definitions and Illustrative Samples

Broad Genre Specific Genre Definition Example 1 Example 2
Informative Analytical Reports Structured analytical writing discussing trends, developments, policies, or evidence-based observations using factual interpretation or comparative analysis. تشير البيانات الاقتصادية الأخيرة إلى ارتفاع معدلات التضخم مقارنة بالعام السابق، مما أدى إلى تراجع القوة الشرائية للأسر. أظهرت الدراسة أن الاستخدام المفرط للهواتف الذكية بين المراهقين يرتبط بانخفاض ساعات النوم وضعف التركيز الأكاديمي.
Interactive Advice Columns Texts where individuals seek or provide practical, emotional, social, or moral advice regarding personal experiences or everyday situations. أشعر بالتوتر الشديد قبل الامتحانات، ولا أعرف كيف أتعامل مع هذا القلق المستمر، فماذا تنصحونني؟ أعاني من صعوبة في تنظيم وقتي بين العمل والدراسة، وأحتاج إلى نصائح تساعدني على تحقيق التوازن.
Learning Educational Explanations Explanatory educational texts intended to help learners understand academic, scientific, technical, or conceptual topics through clarification and structured reasoning. يحدث التبخر عندما تتحول المادة من الحالة السائلة إلى الحالة الغازية نتيجة اكتسابها للطاقة الحرارية. لحساب مساحة المثلث، نقوم بضرب طول القاعدة في الارتفاع ثم نقسم الناتج على اثنين.
Legal Diplomatic Communication Official communication concerning international relations, negotiations, cooperation, agreements, or political coordination between states or organisations. أكدت الدول المشاركة أهمية تعزيز التعاون الإقليمي لمواجهة التحديات الاقتصادية المشتركة. عقد الوفدان اجتماعاً ثنائياً لبحث آليات تطوير العلاقات التجارية بين البلدين خلال المرحلة المقبلة.
Creative Motivational Writing Expressive or reflective writing intended to inspire, encourage, or persuade readers through motivational rhetoric or self-improvement themes. لا تسمح للفشل أن يوقفك، فكل تجربة صعبة تمنحك فرصة جديدة للنمو والتعلم. النجاح لا يأتي صدفة، بل يبدأ بخطوة صغيرة وإيمان مستمر بقدراتك.
Religious Religious Commentary Reflective or explanatory religious writing discussing moral lessons, spiritual guidance, interpretation, or ethical values within a religious context. يدعو الإسلام إلى الصبر والتسامح، ويحث الإنسان على معاملة الآخرين بالرحمة والعدل. يوضح الكاتب أهمية الإخلاص في العمل، وأن النية الصادقة أساس قبول الأعمال عند الله.

2. Motivation and Significance

Most contemporary NLP systems rely heavily on large annotated datasets and stable training distributions. In Arabic and other low-resource settings, labelled data are often limited, uneven across dialects and domains, or unavailable altogether. As a result, systems frequently struggle when exposed to noisier real-world text or previously unseen data distributions.

AraGenre reframes genre classification as a robust generalisation problem rather than a conventional closed-label task. Systems must infer communicative function and stylistic behaviour from limited supervision while generalising to naturally occurring Arabic texts.

The benchmark is designed to encourage research on hierarchical classification, definition-guided modelling, and robust semantic generalisation under realistic low-resource conditions.

3. Hierarchical Genre Organisation and Task Design

AraGenre is organised as a hierarchical Arabic genre classification task where systems must predict both a broad genre label and a fine-grained specific genre label for each Arabic text segment.

The benchmark currently includes six broad genres:

  • Informative
  • Creative
  • Interactive
  • Learning
  • Legal
  • Religious

The released training and development data include genres such as analytical reports, educational explanations, diplomatic communication, motivational writing, forum discussion, and religious commentary.

AraGenre adopts a definition-guided framework where participants receive English genre definitions together with limited labelled training and development data designed to simulate realistic low-resource conditions.

In contrast, the hidden evaluation benchmark contains noisier naturally occurring Arabic texts spanning Modern Standard Arabic, Classical Arabic, and multiple Arabic dialects, including both formal and informal writing styles.

Systems are evaluated separately at the broad and specific genre levels. The official ranking metric is Hierarchical Macro F1, calculated as the average of Broad Macro F1 and Specific Macro F1, allowing evaluation of both high-level communicative understanding and fine-grained genre recognition.

4. Evaluation Framework

Systems generate two predictions for each input instance:

  • a broad genre label,
  • and a specific genre label.

Performance is evaluated separately at both levels using Macro F1, Weighted F1, and Accuracy.

The official ranking metric is Hierarchical Macro F1, calculated as the average of Broad Macro F1 and Specific Macro F1.

The evaluation setup is designed to encourage robust genre understanding under linguistic and stylistic variation rather than narrow lexical memorisation.

5. Baselines and Benchmarking

AraGenre provides two official baseline systems as reproducible reference points for hierarchical Arabic genre classification.

Baseline 1 uses multilingual sentence embeddings and semantic similarity between input texts and genre definitions.
Baseline 2 uses retrieval-augmented generation (RAG) with an instruction-tuned multilingual large language model.

Baselines results are on Codabench: https://www.codabench.org/competitions/16356/#/results-tab

Both baselines generate broad and specific genre predictions using the official shared-task JSON format.

On the development benchmark, Baseline 1 achieved stronger overall performance:

Baseline Hierarchical Macro F1 Specific Macro F1 Specific Weighted F1 Specific Accuracy Broad Macro F1 Broad Weighted F1 Broad Accuracy
Baseline 1: Embedding Similarity 0.4658 0.4658 0.4999 0.4818 0.4658 0.4999 0.4818
Baseline 2: RAG + Multilingual LLM 0.3899 0.3899 0.3890 0.4273 0.3899 0.3890 0.4273

The starter kit additionally includes data loaders, evaluation scripts, example submissions, and Codabench-compatible prediction formatting.

6. Submission Protocol and Reproducibility

AraGenre uses Codabench as the official evaluation platform:

https://www.codabench.org/competitions/16356

Participants receive the released training and development data together with hidden evaluation inputs and genre definitions during testing.

For each test instance, systems must generate:

  • one broad genre prediction,
  • and one specific genre prediction.

Submissions must follow the official JSON format provided in the starter kit. Prediction files are automatically evaluated against hidden gold-standard annotations on the evaluation server.

Systems are evaluated using Broad Macro F1, Specific Macro F1, and the official ranking metric, Hierarchical Macro F1.

Participants are encouraged to submit prediction files, system description papers, and optional reproducibility material or source code.

The released baselines provide reference implementations for embedding-based definition matching and retrieval-augmented generation using multilingual large language models.

7. Accessibility and Research Impact

AraGenre is designed as an accessible benchmark for hierarchical Arabic genre classification under realistic low-resource conditions. The task encourages systems that generalise across diverse communicative settings, dialects, and writing styles rather than relying on narrow lexical memorisation.

Beyond Arabic NLP, the benchmark supports research on hierarchical classification, definition-guided modelling, retrieval-augmented reasoning, and robust semantic generalisation under limited supervision.

The shared task will be promoted through Arabic NLP and ACL communities, with a dedicated website providing task documentation, datasets, baselines, starter code, and public leaderboards.

9. Timeline (Tentative)

Date Milestone
May 18, 2026 Release of task website, training/development data, baseline systems, and evaluation scripts
July 30, 2026 Registration deadline and release of hidden test inputs
July 31, 2026 Submission deadline and final evaluation
August 22, 2026 System description papers due
September 1, 2026 Shared task overview paper due
September 10, 2026 Conference camera-ready deadline
October 24–29, 2026 ArabicNLP 2026 / EMNLP 2026, Budapest, Hungary

10. Organisers