QIAS 2025 Dataset Overview

Updated 6 September 2025

QIAS 2025 dataset is a specialized benchmark that assesses ML models on complex Arabic Islamic inheritance law through MCQs derived from 32,000 authentic fatwas.
It is organized into training, validation, and test splits with varying difficulty levels, employing techniques like majority voting, RAG, and encoder-based scoring.
The dataset drives advancements in fine-tuning and chain-of-thought methods, highlighting trade-offs between peak accuracy and operational efficiency.

The QIAS 2025 dataset is a specialized benchmark developed for evaluating machine learning systems, notably LLMs, on the complex domain of Islamic inheritance law. The dataset, curated from authentic fatwas and verified by domain experts, focuses on assessing automated legal reasoning in Arabic, presenting a rigorous testbed for both natural language understanding and mathematical deduction. QIAS 2025 has become a focal point for research groups participating in shared tasks, aiming to advance technologies capable of high-stakes, rule-based decision support within Islamic legal contexts.

1. Structure and Design of the QIAS 2025 Dataset

The QIAS 2025 dataset comprises multiple-choice questions (MCQs) in Arabic, each constructed from real inheritance scenarios derived from a corpus of 32,000 fatwas (IslamWeb corpus). Every MCQ presents six options (labeled A–F) supplemented with brief explanatory text, forming a comprehensive assessment protocol for both fundamental and advanced legal reasoning.

Dataset organization features distinct difficulty classifications—Beginner, Intermediate, and Advanced—such that cases range from elementary eligibility determination (e.g., identifying heirs and assigning fixed shares) to intricate multi-generational conflicts, blocked heirs, proportional reduction ( $\text{ʿawl}$ ), and redistribution ( $\text{radd}$ ).

The principal dataset splits used in research tasks are as follows:

Split	Number of Samples	Difficulty Coverage
Training	9,446 or 20,000	All (beginner–advanced)
Validation	1,000	All
Test	1,000 (unlabeled)	All (final leaderboard)

Two major versions are referenced: one with answer choices and brief explanations, and a subsequent extension including detailed reasoning for each option, intended to facilitate chain-of-thought alignment in fine-tuning.

2. Evaluation Paradigms and Model Architectures

Studies utilizing QIAS 2025 compare a diversity of approaches:

Open-source base models: Falcon3, Fanar, Allam, MARBERT, ArabicBERT, AraBERT.
Proprietary frontier models: Gemini Flash 2.5, Gemini Pro 2.5, GPT o3, GPT-4o.
Fine-tuned variants: Allam Thinking, Llama4 Scout, domain-adapted GPT-4o, Gemini Flash 2.x.
Hybrid and ensemble systems: Majority-voting approaches, Retrieval-Augmented Generation (RAG), and encoder-based relevance ranking.

Model evaluation employs the dataset’s MCQs, with predicted answers (A–F) compared to the expert-verified ground truth. Prompt sensitivity is exhaustively tested: simple (one-word) prompts versus detailed chain-of-thought prompts yield substantial performance differentials. Fine-tuning methodologies include Low-Rank Adaptation (LoRA), adapter layers, and quantization strategies (e.g., 4-bit NF4) to balance efficiency and capacity.

3. Benchmark Results and Comparative Metrics

Across published research, primary evaluation is accuracy—i.e., the fraction of MCQs answered correctly under official splits. Notable results:

Model/System	Accuracy (Overall)	Remarks
Majority Voting (Gemini/GPT o3)	92.7%	Ensemble of three base models
Gemini Flash 2.5 (base)	91.5%–88.1%	Proprietary
GPT o3 (base)	92.3%–88.4%	Proprietary
Fanar-1-9B (LoRA+RAG/QU-NLP)	85.8%	97.6% Advanced subset
MARBERT+ARS (CVPD)	69.87%	Encoder-based, on-device
LLaMA (zero-shot)	~48.8%	Baseline

Prompt design exerts a measurable impact: for GPT-4o, accuracy rises from 57.5% (simple) to 70.1% (chain-of-thought). Fine-tuning yields heterogeneous results—Allam Thinking at 38.8%, fine-tuned GPT-4o at 86.6% (Prompt 2), and Llama4 Scout at 84.3%.

Advanced reasoning subsets (e.g., blocked heirs, $\text{ʿawl}$ , $\text{radd}$ ) discriminate among systems: QU-NLP’s Fanar-1-9B reaches 97.6% versus Gemini 2.5 (89.6%) and GPT o3 (92.4%) (AL-Smadi, 20 Aug 2025).

4. Model Methodologies: Ensembles, Retrieval Augmentation, and Efficiency

Three principal architectures are deployed:

A. Majority Voting (Ensemble)

Predictions from top-performing base models (Gemini Flash 2.5, Gemini Pro 2.5, GPT o3) are combined via majority rule. This approach leverages diverse error profiles, boosting aggregate accuracy beyond individual systems (92.7%) (AlDahoul et al., 13 Aug 2025).

B. Retrieval-Augmented Generation (RAG)

QU-NLP applies LoRA fine-tuning to Fanar-1-9B and integrates external document retrieval. Arabic input questions are encoded (MiniLM-L6-v2), relevant passages identified via FAISS, concatenated with the MCQ, and presented to the generation model. Deterministic decoding ( $T=0.05$ ) and regex-based answer extraction (A–F) complete the pipeline.

Pseudocode, from (AL-Smadi, 20 Aug 2025):

Input: Q (question), A–F (options)
1. Retrieve top-5 passages P₁–₅ via FAISS
2. Construct Prompt = concatenate(Q, P₁–₅, options)
3. Model generates output (low temp)
4. Extract final answer via pattern match
Output: Letter in {A, B, C, D, E, F}

C. Encoder-Based Attentive Relevance Scoring (ARS)

CVPD employs text encoders (e.g., MARBERT) to generate normalized [CLS] embeddings for questions/answers:

$\mathbf{q}_{\text{emb}} = \operatorname{Norm}(E(q)_{[\text{CLS}]})$

$\mathbf{c}_{\text{emb}} = \operatorname{Norm}(E(c)_{[\text{CLS}]})$

These are projected, interacted, and scored:

$\mathbf{h}_q = W_q\, \mathbf{q}_{\text{emb}},\qquad \mathbf{h}_c = W_c\, \mathbf{c}_{\text{emb}}$

$\mathbf{v}_{\text{int}} = \tanh(\mathbf{h}_q \odot \mathbf{h}_c)$

$r = \sigma(w_{\text{att}}^\top \mathbf{v}_{\text{int}})$

The training objective combines contrastive, dynamic relevance, and logit regularization losses:

$\mathcal{L}_{\text{total}} = 0.4\, \mathcal{L}_{\text{cons}} + 0.4\, \mathcal{L}_{\text{dyn}} + 0.2\, \mathcal{L}_{\text{reg}}$

(Bekhouche et al., 30 Aug 2025)

5. Key Challenges and Dataset Limitations

All benchmarked models display sensitivity to domain coverage and annotation detail. Limitations include:

Absence of detailed reasoning for each MCQ option in the original release, impeding fine-tuning for chain-of-thought models.
Limited scope of inheritance scenarios: even high-performing LLMs exhibit errors in rare or edge cases, indicating incomplete legal knowledge.
Fine-tuning adapters of smaller size can degrade reasoning capacity in models like Gemini Flash 2.5.
Artifact prevalence: near-duplicate options, blocked heirs imbalance.
Efficiency vs. accuracy: larger models provide superior results (up to 92.7%) but carry high computational, latency, and privacy burdens. Encoder-based systems are lighter (69.87%) and support on-device deployment.

A plausible implication is that richer, more fully annotated datasets—and methods to generate interpretative justifications—are necessary for robust symbolic reasoning and compliance in legal domains (AlDahoul et al., 13 Aug 2025, AL-Smadi, 20 Aug 2025).

6. Implications and Future Directions

The QIAS 2025 dataset demonstrates that domain-specific training and external knowledge retrieval (RAG) can permit mid-scale LLMs (e.g., Fanar-1-9B) to approach, and sometimes surpass, proprietary frontier models in complex Arabic legal reasoning (AL-Smadi, 20 Aug 2025). The contrast between high-accuracy ensembles and lightweight encoder-based approaches quantifies the trade-offs between peak performance and real-world deployment constraints (Bekhouche et al., 30 Aug 2025).

Future research avenues include:

Enhanced retrieval strategies (multi-hop, hierarchical).
Symbolic reasoning integration for formal share calculation.
Artifact mitigation in datasets (balanced representation, option diversity).
Methodologies for interpretative reasoning, supporting transparent decision-making.

These developments would support the broader goal of deploying AI systems in high-stakes, rule-driven environments with rigorous requirements for both factual fidelity and operational efficiency.