Papers
Topics
Authors
Recent
2000 character limit reached

Ensemble-Instruct: Scalable Instruction Tuning

Updated 24 November 2025
  • Ensemble-Instruct is an instruction‐tuning pipeline that generates diverse training data using a three-stage process: instruction generation, instance generation, and output ensembling.
  • It leverages a heterogeneous ensemble of open-source language models (10–40B parameters) to produce high-quality instruction sets with enforced novelty and consensus.
  • Key innovations include simplified in-context learning templates and a greedy consensus algorithm, which together significantly boost performance metrics such as Rouge-L.

Ensemble-Instruct is an instruction-tuning data generation pipeline that synthesizes high-quality training data for large LMs by leveraging a heterogeneous ensemble of open-source LMs in the 10–40B parameter range. It addresses inherent limitations of earlier self-instruction frameworks, such as Self-Instruct and Alpaca, which rely on very large, often closed 175B-parameter models, by devising new in-context learning (ICL) techniques and output ensembling to produce performant instruction-tuning corpora from smaller, open-access models (Lee et al., 2023).

1. Algorithmic Pipeline

Ensemble-Instruct operates by decomposing the task into three chained stages—Instruction Generation, Instance Generation, and Output Ensembling—applied independently to two disjoint classes of tasks:

  • Type A: Tasks requiring an external input (125 Self-Instruct seeds)
  • Type B: Tasks formulated solely as output generations (50 seeds)

The generation process consists of the following steps:

  1. Instruction Generation: For each type t{A,B}t\in\{A,B\}, few-shot prompt templates are populated with a mix of seed (drawn from the task's corresponding set) and previously generated synthetic examples (ninstn_\text{inst} examples; $24$ for A, $10$ for B). An LM is then called to generate a new instruction ss, which is accepted only if RougeL(s,s)<0.7\mathrm{RougeL}(s,s') < 0.7 for all existing ss'—enforcing novelty within the growing instruction set.
  2. Instance Generation: Using minstm_\text{inst} few-shot seed examples ($18$ for A, $15$ for B), the LM is prompted to produce (i) for type A, an (input, output) pair conditioned on the new instruction, or (ii) only an output for type B.
  3. Output Ensembling: Each (s,i)(s, i) or (s)(s) and initial LM output o1o_1 is augmented with two additional outputs o2,o3o_2, o_3 from separate LMs (which may include instruction-tuned models), produced under zero-shot or few-shot prompting as appropriate. The system applies a greedy consensus algorithm: it accepts the example only if all output pairs have RougeL(oi,oj)>0.01\mathrm{RougeL}(o_i,o_j)>0.01 and returns oio_{i^*} from the pair with highest Rouge-L similarity; otherwise, the instance is discarded.

All accepted triples (instruction,input,output)(\text{instruction}, \text{input} \vee \varnothing,\text{output}) populate the synthetic dataset. Iteration continues per type until reaching target dataset size, yielding a final corpus of approximately 45k examples after ensembling.

2. In-Context Learning Templates and Sampling

The ICL framework in Ensemble-Instruct is designed for maximum utility with smaller LMs. Tasks are strictly categorized, and template strategies are simplified to lower the prompt complexity:

  • Instruction Generation:
    • Type A uses 24-shot context (20 seed + 4 synthetic); type B uses 10-shot (8 seed + 2 synthetic).
  • Instance Generation:
    • Type A (input-output): 18-shot (all seed); type B (output-only): 15-shot (all seed).

Prompt structures adhere to regularized sketches: For type A, the prompt elicits "instruction: <INST> input: <INPUT> output: <OUTPUT> |EoS|"; for type B, it omits the input.

Formally, at chain step mm, sampling proceeds as smpM1(sσ(s<m),D1A)s_m \sim p_{\mathcal{M}_1}(s \mid \sigma(s_{<m}), \mathcal{D}^A_1), where context σ(s<m)\sigma(s_{<m}) includes up to two earlier synthetic instructions. (im,om)pM1(i,osm,D2A)(i_m, o_m) \sim p_{\mathcal{M}_1}(i,o \mid s_m, \mathcal{D}^A_2). Instruction novelty is enforced by rejecting sms_m if maxsRougeL(sm,s)0.7\max_{s'}\mathrm{RougeL}(s_m,s') \geq 0.7.

Templates and controlled shot counts are critical for success at smaller model scales, mitigating prompt complexity and reducing ICL failure rates.

3. Ensembling and Output Selection

The output ensembling mechanism is central to quality control in Ensemble-Instruct. For each generated (input, output) or (output), outputs o2,o3o_2, o_3 are sampled from additional LMs (allowing for instruction-tuned variants). The consensus algorithm (a greedy approximation of Minimum Bayes Risk decoding):

  1. Computes all pairwise Rouge-L scores Ri,jR_{i,j} for {o1,o2,o3}\{o_1,o_2,o_3\}.
  2. If all Ri,j>0.01R_{i,j} > 0.01, selects the output from the most mutually similar pair.
  3. If not, the example is discarded.

This step acts as a filter against spurious or low-consensus generations, increasing both the mean quality and diversity of the resulting dataset. Empirically, this mechanism significantly augments Rouge-L scores over single-model pipelines and is robust across base and instruction-tuned LMs.

4. Seed Task Utilization and Novelty Enforcement

The procedure leverages the original 175 Self-Instruct tasks, partitioning them into DA\mathcal{D}^A (125 with inputs) and DB\mathcal{D}^B (50 output-only). Few-shot prompt construction always balances seed and synthetic examples for instruction generation and uses only seeds for instance generation, with periodic replacement of seed examples by synthetic ones across generation steps to enhance diversity.

Novelty among instructions is strictly maintained by applying a Rouge-L threshold (<0.7<0.7) against all prior instructions, sharply limiting paraphrasing and redundancy. The rejection process ensures that the expanding instruction pool remains diverse and semantically non-overlapping.

5. Model Selection and Implementation Details

Ensemble-Instruct employs a heterogeneous mixture of open-access LMs, all with permissive Apache-2 licenses:

  • Instruction and instance generation models: falcon-40B (decoder-only, 40B), ul2-20B (seq2seq, 20B), gpt-neoxt-chat-20B (decoder-only, 20B; OIG-tuned), flan-ul2-20B (seq2seq, instructed), and flan-t5-xxl (seq2seq, instructed, 11B).
  • Downstream fine-tuning models: pythia-1.4B (vanilla), mpt-7B (decoder-only), and gpt-jt-6B (instruction-tuned).

Sampling for all generation phases uses greedy or default settings (no sampling temperature or top-k/pk/p variation). Fine-tuning hyperparameters include QLoRA on single A100 (40 GB), 5–7 epochs at LR 5×1055 \times 10^{-5}, or full fine-tuning on double A100s (2 × 80 GB) at LR 1×1061 \times 10^{-6}.

The pipeline generates approximately 30–45k samples total, with the output-ensemble filter reducing the final dataset to around 18k (for certain configurations), subject to the ensembling discard step.

6. Empirical Evaluation and Analysis

Empirical validation confirms that Ensemble-Instruct delivers superior or competitive instruction-tuning data compared to both Self-Instruct and Alpaca, especially at reduced data and parameter budgets.

Synthetic Dataset Performance (Rouge-L, Super-NI)

Dataset Samples Rouge-L
zero-shot 0 9.8
alpaca 51,760 33.4
self-inst (82K) 82,612 34.4
m-self-inst (24K) 24,984 28.5
so-{ul2,neox} (25K) 25,660 33.6
eo-{ul2,neox}-ilm 18,218 38.3
so-falcon (30K) 30,537 34.4
eo-falcon-ilm 26,701 37.1
  • Using categorized and simplified templates (so-{ul2,neox}) yields up to +5.1 Rouge-L over vanilla Self-Instruct. Output ensembling (eo-) adds +4.7 further points. Ensemble-Instruct with ~30k samples matches or exceeds Self-Instruct’s 82k example performance.

Human Data Quality

Manual assessment on eo-{ul2,neox}-ilm (140 samples):

Type Good Bad Maybe Total
output-only 77 14 9 100
in–out 22 15 3 40
overall 99 29 12 140

Output-only tasks achieved 77% rated good, input-output 55%. The dual-pipeline design ameliorates imbalances inherent in self-instruction.

Large LM Fine-Tuning

Fine-tuning on base LMs (6B–40B) with Ensemble-Instruct data increases Rouge-L scores by 30–35 points over zero-shot, consistently across architectures.

Base LM Zero-shot Fine-tuned (eo-combo-ilm)
gpt-jt-6B (6B) 10.4 43.1
mpt-7B (7B) 16.6 46.4
open-llama-13B 11.9 46.7
mpt-30B 12.2 49.5
falcon-40B 12.7 49.9

On user-facing tasks, ablations show +5–9 points from template simplification and +2–4 from output ensembling.

7. Technical Observations, Limitations, and Future Directions

Ensemble-Instruct's effectiveness at smaller model scales is attributable to its reduced ICL burden via prompt simplification and diversity/quality improvement through output ensembling. The rejection and consensus mechanisms are crucial in maintaining instructional novelty and filtering low-agreement samples.

Limitations include inherent dependence on multiple open LMs (allowing representation biases to propagate), discard of viable low-consensus examples, absence of tuned sampling parameters (temperature, top-k/p), and reliance on automatic metrics (Rouge-L) for most evaluations; broader human assessments are lacking.

Potential avenues for future enhancement comprise integration of learned ensemble metrics (such as BERTScore or ranking LMs), expanded human evaluation, dynamic context/shot selection, and adaptation to multilingual or even smaller LMs. The pipeline’s modularity and reliance on permissively licensed models position it as a viable replacement for closed-model instruction data curation at scale (Lee et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Ensemble-Instruct.