Papers
Topics
Authors
Recent
Search
2000 character limit reached

Curated LLM (CLLM) Overview

Updated 18 November 2025
  • Curated LLM is a large language model fine-tuned with rigorously curated, open-license instruction datasets using difficulty-aware sampling and transparent workflows.
  • The methodology involves multi-stage data filtering, bucket-based sampling, and rigorous validation to ensure high quality and balanced task difficulty.
  • Efficient fine-tuning with techniques like QLoRA and LoRA adapters enables superior performance, achieving a final score of 0.58 with optimized GPU usage.

A curated LLM (CLLM) refers to a LLM fine-tuned with carefully selected and constructed instruction datasets, with a central focus on difficulty-aware curation, quality-control mechanisms, and transparent, reproducible workflows. The CLLM paradigm addresses issues of hardware cost, lack of transparency in data and training protocols, and the non-reproducibility endemic to proprietary LLMOps, as exemplified by the Birbal system—a @@@@6@@@@ based instruct model that achieved state-of-the-art performance within the NeurIPS LLM Efficiency Challenge through rigorous curation and efficient fine-tuning procedures (Jindal et al., 2024).

1. Dataset Curation Methodology

The foundation of CLLM training lies in the multi-stage curation of open-license, instruction–response corpora. Birbal's approach involved aggregating diverse sources: LIMA (1K), Open-Platypus (∼25K), Natural-Instructions (NI, 1.6K tasks), and sub-datasets from HELM (OpenBookQA, QuAC, CNN/DailyMail), as well as MathInstruct. Rigorous task-level filtering was applied: non-English (576 NI tasks) as well as MMLU QA, question-generation/understanding, math, and linguistic-probing tasks were excluded. Only answer-generation tasks within 33 NI categories (e.g., QA, sentiment, program execution, toxicity) were retained.

Difficulty-aware sampling was central; few-shot inference with the Mistral-7B base model served to score and filter tasks:

  • Exact-match tasks were scored by accuracy (Acc); generation tasks by ROUGE-1/2.
  • Low-accuracy tasks (Acc <τ1\tau_1) were discarded.
  • Remaining tasks were bucketed and sampled inversely proportional to mean task accuracy, emphasizing harder tasks.
  • Generation tasks were further stratified by ROUGE-2 intervals ([0,0.2), [0.2,0.3), …), with 40% sampled from the lowest and 10% from each higher bucket.

Three final curated splits (200K, 400K, 700K examples) each covered the six major data families with proportional representation. Notably, the 200K split included 463 NI tasks (50K Exact-match, 50K Generation) and robust sampling from other benchmarks (10K–50K examples per source).

2. Quality-Control Measures

Ensuring dataset quality and relevance involved multi-pronged validation:

  • Automatic rejection of tasks below accuracy threshold τ1\tau_1 (base-model Acc < 50%).
  • Bucket-based retention for generation tasks to balance low and high ROUGE examples.
  • 2,000 held-out validation instances per split for monitoring loss and overfitting.
  • Manual spot-checking (500 examples per source) assessed clarity, prompt–response alignment, and prohibited LLM-generated prompt exclusion.

This methodical quality process guards against overfitting, redundancy, and spurious data artifacts, reinforcing reproducibility and model reliability.

3. Fine-Tuning Procedure

CLLM fine-tuning utilized Mistral-7B, a decoder-only transformer with multi-query attention. Adaptation employed QLoRA 4-bit quantization and LoRA adapters (rank 128, α=256\alpha=256), integrated into Q/K/V projections and feed-forward linear layers. NEFTune regularization added Gaussian noise to input embeddings.

The optimization employed paged_adamw_32bit; learning rate followed cosine decay with warmup (peak 2×1052 \times 10^{-5}, 100 steps, weight decay 0.01). Effective batch size was 6 (micro_batch=2, grad_accum=3). Epochs ran \sim3 (200K), 2 (400K), or 1 (700K), with packing for throughput maximization. Validation on 2,000 random examples determined checkpoint retention (minimum loss after 24 hours).

Hardware utilization was highly efficient: Birbal was fine-tuned on a single NVIDIA RTX 4090 (24GB) for 16 hours; maximum runs capped at 24 hours. GPU-hours were calculated as GPUh=NGPU×ThoursGPU_h = N_{GPU} \times T_{hours} (Birbal-200K: 1×16=161 \times 16 = 16 h).

4. Performance Evaluation and Comparative Analysis

Evaluation employed both open (HELM subset: MMLU, TruthfulQA, BBQ, GSM8K, BIG-Bench) and closed benchmarks (SAMSum [ROUGE-2], corr2cause, MATH [CoT], ethics categories). Metrics included exact-match accuracy (EM), mean win rate (MWR), ROUGE-2, and error rates for representation and stereotyping.

Scores were aggregated via geometric mean across benchmarks:

Score=(i=1NMWRi)1/NScore = \left(\prod_{i=1}^N \mathrm{MWR}_i \right)^{1/N}

The final score formula combined open and closed evaluations:

Sfinal=13Sopen+23SclosedS_{\mathrm{final}} = \frac{1}{3}S_{\mathrm{open}} + \frac{2}{3}S_{\mathrm{closed}}

Birbal (Mistral-7B) achieved a final score of 0.58, surpassing Qwen-14B (0.42) by ~35%. Birbal was superior on closed tasks (0.61 vs. 0.32) and competitive on open tasks (0.52 vs. 0.63), underscoring the effectiveness of targeted data curation.

Dataset Birbal (Mistral-7B) Qwen-14B Rank 3
Open Eval 0.52 0.63 0.21
Closed Eval 0.61 0.32 0.47
Final Score 0.58 0.42 0.38

5. Curation Insights and Ablation Studies

Empirical analysis demonstrated that the compact 200K curated split reached top ranking due to difficulty-aware sampling and balanced task representation. Relative to a naïve “all-data” approach (700K), compact curation yielded:

  • +12% open-eval geometric mean
  • +8% overall final score This supports the premise that quality-focused sampling accelerates convergence and model robustness.

Ablation studies revealed dataset size trade-offs:

  • Open-task performance declined with increasing size (overfitting to hidden tasks).
  • Closed-task performance improved with larger splits up to 400K, but plateaued thereafter.
  • Intensive sampling of low-accuracy NI tasks mitigated “hard-task” deficiencies, particularly for multi-step reasoning.

6. Best Practices for CLLM Development

The CLLM training framework yields several best practices:

  • Difficulty-driven selection: Leverage few-shot base-model inference for learnability assessment.
  • Bucketed sampling: Systematic retention across difficulty strata ensures broad task generalization.
  • Compact data curation (~200K), when paired with adapter tuning (LoRA) and 4-bit QLoRA, enables state-of-the-art results on single commodity GPUs in <24 hours.
  • Continuous validation with held-out splits mitigates overfitting and preserves generalization.
  • Open reproducibility through release of code, data splits, and checkpoints under open licenses fosters transparent LLM research.

This suggests that curated LLM approaches, exemplified by Birbal, balance resource efficiency, performance, and scientific reproducibility, contributing a robust framework for developing specialized instruction models suitable for academic and real-world deployments (Jindal et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Curated LLM (CLLM).