Birbal Model: Efficient Instruction-Tuned LLM
- Birbal Model is an efficient, instruction-tuned LLM based on Mistral-7B using 4-bit QLoRA for scalable fine-tuning.
- Its performance is driven by curated instruction datasets from diverse public sources and targeted data filtering techniques.
- Birbal achieved a 35% improvement in final challenge scores, demonstrating superior efficiency and reproducibility over larger models.
Birbal is an efficient instruction-tuned LLM developed on top of the open-source Mistral-7B Transformer architecture. Designed for high task coverage and robust reproducibility within constrained computational resources, Birbal was the winning entry in the NeurIPS LLM Efficiency Challenge, outperforming both its base model and substantially larger alternatives by leveraging carefully curated instruction datasets and advanced fine-tuning methodologies using QLoRA.
1. Base Model Architecture
Birbal utilizes the Mistral-7B Transformer as its foundational architecture. This model features 7 billion dense decoder-only parameters, a maximum sequence length of 4,096 tokens, and employs rotary positional embeddings. No modifications were introduced to Mistral-7B's core structure for Birbal. Fine-tuning leverages 4-bit quantization and Parameter-Efficient Fine-Tuning (PEFT) via QLoRA. Specifically, low-rank adapters are injected into each multi-head self-attention block and feed-forward "Linear" layer, with rank and scaling factor . This augmentation introduces only parameters per layer, maintaining the original 7B parameters in their quantized form.
2. Instruction Data Curation
A key innovation underlying Birbal’s performance is targeted curation of high-quality, diverse instruction data within stringent single-GPU and time limits. Birbal's datasets were hand-curated in three scales (200K, 400K, 700K examples) from seven public data sources. The process included:
- Filtering Natural Instructions (NI): Over 1,600 tasks were condensed to 463 English answer-generation tasks, classified as Exact-Match or Generation. Base Mistral-7B was used for few-shot inference—Accuracy measured for Exact-Match and ROUGE for Generation tasks. Tasks were bucketed by difficulty, with greater sampling from lower-accuracy groups.
- Excluding GPT-generated instructions: 10% of Open-Platypus entries were removed to eliminate synthetic content.
- Removing redundancies: MMLU multiple-choice tasks from NI and HELM were excluded to prevent overlap with evaluation benchmarks.
- Sampling specialized data: OpenBookQA, QuAC, CNN/DailyMail, and MathInstruct (after discarding LLM-generated examples) contributed expert QA and summarization tasks.
The 200K dataset is summarized below; larger datasets scale NI and MathInstruct appropriately.
| Source Dataset | #Examples |
|---|---|
| LIMA | 1,000 |
| Open-Platypus | 25,000 |
| NI (Exact Match) | 50,000 |
| NI (Generation) | 50,000 |
| OpenBookQA (OpenQA) | 5,000 |
| QuAC | 10,000 |
| CNN/DailyMail | 15,000 |
| MathInstruct | 50,000 |
This strategy maximized both the breadth and depth of instruction coverage within a manageable training footprint.
3. Fine-Tuning Methodology
Birbal’s fine-tuning was conducted on a single NVIDIA RTX 4090 (24 GB) for approximately 16 hours. The process incorporated:
- 4-bit QLoRA: Low-rank adapters (rank=128, ) were added to all , , , and feed-forward layers following the QLoRA protocol.
- Random embedding noise: Applied per NEFTune methodology to improve generalization.
- Sample packing: Used to maximize GPU utilization and throughput.
Optimization details included paged_adamw_32bit as the optimizer, a learning rate of , weight decay of 0.01, 100 warmup steps, and a cosine decay schedule: with gradient accumulation steps of 3 and a micro-batch size of 2.
Validation utilized 2,000 held-out examples per sub-corpus, with checkpoint selection governed by minimum cross-entropy loss as: Number of epochs depended on dataset size: ~3 epochs (200K), ~2 (400K), ~1 (700K).
4. Evaluation Framework and Results
Birbal's performance was assessed through the four-stage LLM Efficiency Challenge protocol:
- Open evaluation: Using a subset of HELM tasks (MMLU, TruthfulQA, BBQ, GSM8K, BIG-bench).
- Closed evaluation: Hidden tasks (SAMSum, Corr2cause, MATH, Ethics_x).
- Organizer reproduction: Confirmation of results under standardized conditions.
- Final scoring: Weighted as $1/3$ Open + $2/3$ Closed evaluation.
Summary of leaderboard performance:
| Team (Base Model) | Open Eval | Closed Eval | Final Score | Rank |
|---|---|---|---|---|
| Birbal (Mistral-7B) | 0.52 | 0.61 | 0.58 | 1 |
| Rank 2 (Qwen-14B) | 0.63 | 0.32 | 0.42 | 2 |
| Rank 3 (Mistral-7B) | 0.21 | 0.47 | 0.38 | 3 |
Birbal achieved a final score of 0.58, representing approximately a 35% improvement over the next-best Qwen-14B submission. Notably, Birbal-200K achieved best-in-class results on 12 out of 31 evaluation tasks, including substantial gains on TruthfulQA (+7 points), GSM8K (+11 points), and several Ethics subcategories.
A plausible implication is that robust instruction curation and advanced quantized adaptation can enable smaller models to surpass the performance of much larger LLMs when evaluated on both open and closed benchmark suites.
5. Reproducibility and Transparency
The entire Birbal pipeline—comprising data curation scripts, fine-tuning recipes, model weights, and evaluation harness—has been open-sourced for full reproducibility. The materials include detailed hyperparameter configurations, dataset sampling scripts with logged random seeds, Docker-based training environments, Axolotl fine-tuning workflows, and exact validation splits. The challenge committee successfully reproduced Birbal’s results as part of the evaluation protocol, confirming transparency and the absence of hidden procedures or proprietary workflows.
6. Significance and Implications
Birbal demonstrates that carefully constructed instruction datasets, coupled with quantization-aware PEFT strategies such as QLoRA, can yield competitive—and in this case, superior—task performance in small parameter LLMs versus models more than double their size. The results indicate that efficiency in language modeling is achievable not solely via architectural scale but substantially via data composition and fine-tuning discipline.
This model advances the field’s understanding of trade-offs between model size, dataset composition, computational resources, and real-world task performance, providing an open and reproducible blueprint for instruction-tuned LLM development under practical constraints (Jindal et al., 2024).