Ensemble-Instruct: Scalable Instruction Tuning

Updated 24 November 2025

Ensemble-Instruct is an instruction‐tuning pipeline that generates diverse training data using a three-stage process: instruction generation, instance generation, and output ensembling.
It leverages a heterogeneous ensemble of open-source language models (10–40B parameters) to produce high-quality instruction sets with enforced novelty and consensus.
Key innovations include simplified in-context learning templates and a greedy consensus algorithm, which together significantly boost performance metrics such as Rouge-L.

Ensemble-Instruct is an instruction-tuning data generation pipeline that synthesizes high-quality training data for large LMs by leveraging a heterogeneous ensemble of open-source LMs in the 10–40B parameter range. It addresses inherent limitations of earlier self-instruction frameworks, such as Self-Instruct and Alpaca, which rely on very large, often closed 175B-parameter models, by devising new in-context learning (ICL) techniques and output ensembling to produce performant instruction-tuning corpora from smaller, open-access models (Lee et al., 2023).

1. Algorithmic Pipeline

Ensemble-Instruct operates by decomposing the task into three chained stages—Instruction Generation, Instance Generation, and Output Ensembling—applied independently to two disjoint classes of tasks:

Type A: Tasks requiring an external input (125 Self-Instruct seeds)
Type B: Tasks formulated solely as output generations (50 seeds)

The generation process consists of the following steps:

Instruction Generation: For each type $t\in\{A,B\}$ , few-shot prompt templates are populated with a mix of seed (drawn from the task's corresponding set) and previously generated synthetic examples ( $n_\text{inst}$ examples; $24$ for A, $10$ for B). An LM is then called to generate a new instruction $s$ , which is accepted only if $\mathrm{RougeL}(s,s') < 0.7$ for all existing $s'$ —enforcing novelty within the growing instruction set.
Instance Generation: Using $m_\text{inst}$ few-shot seed examples ($18$ for A, $15$ for B), the LM is prompted to produce (i) for type A, an (input, output) pair conditioned on the new instruction, or (ii) only an output for type B.
Output Ensembling: Each $(s, i)$ or $(s)$ and initial LM output $o_1$ is augmented with two additional outputs $o_2, o_3$ from separate LMs (which may include instruction-tuned models), produced under zero-shot or few-shot prompting as appropriate. The system applies a greedy consensus algorithm: it accepts the example only if all output pairs have $\mathrm{RougeL}(o_i,o_j)>0.01$ and returns $o_{i^*}$ from the pair with highest Rouge-L similarity; otherwise, the instance is discarded.

All accepted triples $(\text{instruction}, \text{input} \vee \varnothing,\text{output})$ populate the synthetic dataset. Iteration continues per type until reaching target dataset size, yielding a final corpus of approximately 45k examples after ensembling.

2. In-Context Learning Templates and Sampling

The ICL framework in Ensemble-Instruct is designed for maximum utility with smaller LMs. Tasks are strictly categorized, and template strategies are simplified to lower the prompt complexity:

Instruction Generation:
- Type A uses 24-shot context (20 seed + 4 synthetic); type B uses 10-shot (8 seed + 2 synthetic).
Instance Generation:
- Type A (input-output): 18-shot (all seed); type B (output-only): 15-shot (all seed).

Prompt structures adhere to regularized sketches: For type A, the prompt elicits "instruction: <INST> input: <INPUT> output: <OUTPUT> |EoS|"; for type B, it omits the input.

Formally, at chain step $m$ , sampling proceeds as $s_m \sim p_{\mathcal{M}_1}(s \mid \sigma(s_{<m}), \mathcal{D}^A_1)$ , where context $\sigma(s_{<m})$ includes up to two earlier synthetic instructions. $(i_m, o_m) \sim p_{\mathcal{M}_1}(i,o \mid s_m, \mathcal{D}^A_2)$ . Instruction novelty is enforced by rejecting $s_m$ if $\max_{s'}\mathrm{RougeL}(s_m,s') \geq 0.7$ .

Templates and controlled shot counts are critical for success at smaller model scales, mitigating prompt complexity and reducing ICL failure rates.

3. Ensembling and Output Selection

The output ensembling mechanism is central to quality control in Ensemble-Instruct. For each generated (input, output) or (output), outputs $o_2, o_3$ are sampled from additional LMs (allowing for instruction-tuned variants). The consensus algorithm (a greedy approximation of Minimum Bayes Risk decoding):

Computes all pairwise Rouge-L scores $R_{i,j}$ for $\{o_1,o_2,o_3\}$ .
If all $R_{i,j} > 0.01$ , selects the output from the most mutually similar pair.
If not, the example is discarded.

This step acts as a filter against spurious or low-consensus generations, increasing both the mean quality and diversity of the resulting dataset. Empirically, this mechanism significantly augments Rouge-L scores over single-model pipelines and is robust across base and instruction-tuned LMs.

4. Seed Task Utilization and Novelty Enforcement

The procedure leverages the original 175 Self-Instruct tasks, partitioning them into $\mathcal{D}^A$ (125 with inputs) and $\mathcal{D}^B$ (50 output-only). Few-shot prompt construction always balances seed and synthetic examples for instruction generation and uses only seeds for instance generation, with periodic replacement of seed examples by synthetic ones across generation steps to enhance diversity.

Novelty among instructions is strictly maintained by applying a Rouge-L threshold ( $<0.7$ ) against all prior instructions, sharply limiting paraphrasing and redundancy. The rejection process ensures that the expanding instruction pool remains diverse and semantically non-overlapping.

5. Model Selection and Implementation Details

Ensemble-Instruct employs a heterogeneous mixture of open-access LMs, all with permissive Apache-2 licenses:

Instruction and instance generation models: falcon-40B (decoder-only, 40B), ul2-20B (seq2seq, 20B), gpt-neoxt-chat-20B (decoder-only, 20B; OIG-tuned), flan-ul2-20B (seq2seq, instructed), and flan-t5-xxl (seq2seq, instructed, 11B).
Downstream fine-tuning models: pythia-1.4B (vanilla), mpt-7B (decoder-only), and gpt-jt-6B (instruction-tuned).

Sampling for all generation phases uses greedy or default settings (no sampling temperature or top- $k/p$ variation). Fine-tuning hyperparameters include QLoRA on single A100 (40 GB), 5–7 epochs at LR $5 \times 10^{-5}$ , or full fine-tuning on double A100s (2 × 80 GB) at LR $1 \times 10^{-6}$ .

The pipeline generates approximately 30–45k samples total, with the output-ensemble filter reducing the final dataset to around 18k (for certain configurations), subject to the ensembling discard step.

6. Empirical Evaluation and Analysis

Empirical validation confirms that Ensemble-Instruct delivers superior or competitive instruction-tuning data compared to both Self-Instruct and Alpaca, especially at reduced data and parameter budgets.

Synthetic Dataset Performance (Rouge-L, Super-NI)

Dataset	Samples	Rouge-L
zero-shot	0	9.8
alpaca	51,760	33.4
self-inst (82K)	82,612	34.4
m-self-inst (24K)	24,984	28.5
so-{ul2,neox} (25K)	25,660	33.6
eo-{ul2,neox}-ilm	18,218	38.3
so-falcon (30K)	30,537	34.4
eo-falcon-ilm	26,701	37.1

Using categorized and simplified templates (so-{ul2,neox}) yields up to +5.1 Rouge-L over vanilla Self-Instruct. Output ensembling (eo-) adds +4.7 further points. Ensemble-Instruct with ~30k samples matches or exceeds Self-Instruct’s 82k example performance.

Human Data Quality

Manual assessment on eo-{ul2,neox}-ilm (140 samples):

Type	Good	Bad	Maybe	Total
output-only	77	14	9	100
in–out	22	15	3	40
overall	99	29	12	140

Output-only tasks achieved 77% rated good, input-output 55%. The dual-pipeline design ameliorates imbalances inherent in self-instruction.

Large LM Fine-Tuning

Fine-tuning on base LMs (6B–40B) with Ensemble-Instruct data increases Rouge-L scores by 30–35 points over zero-shot, consistently across architectures.

Base LM	Zero-shot	Fine-tuned (eo-combo-ilm)
gpt-jt-6B (6B)	10.4	43.1
mpt-7B (7B)	16.6	46.4
open-llama-13B	11.9	46.7
mpt-30B	12.2	49.5
falcon-40B	12.7	49.9

On user-facing tasks, ablations show +5–9 points from template simplification and +2–4 from output ensembling.

7. Technical Observations, Limitations, and Future Directions

Ensemble-Instruct's effectiveness at smaller model scales is attributable to its reduced ICL burden via prompt simplification and diversity/quality improvement through output ensembling. The rejection and consensus mechanisms are crucial in maintaining instructional novelty and filtering low-agreement samples.

Limitations include inherent dependence on multiple open LMs (allowing representation biases to propagate), discard of viable low-consensus examples, absence of tuned sampling parameters (temperature, top-k/p), and reliance on automatic metrics (Rouge-L) for most evaluations; broader human assessments are lacking.

Potential avenues for future enhancement comprise integration of learned ensemble metrics (such as BERTScore or ranking LMs), expanded human evaluation, dynamic context/shot selection, and adaptation to multilingual or even smaller LMs. The pipeline’s modularity and reliance on permissively licensed models position it as a viable replacement for closed-model instruction data curation at scale (Lee et al., 2023).

Markdown Upgrade to Chat

References (1)

Ensemble-Instruct: Generating Instruction-Tuning Data with a Heterogeneous Mixture of LMs (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Ensemble-Instruct.