Ensemble-Instruct: Scalable Instruction Tuning
- Ensemble-Instruct is an instruction‐tuning pipeline that generates diverse training data using a three-stage process: instruction generation, instance generation, and output ensembling.
- It leverages a heterogeneous ensemble of open-source language models (10–40B parameters) to produce high-quality instruction sets with enforced novelty and consensus.
- Key innovations include simplified in-context learning templates and a greedy consensus algorithm, which together significantly boost performance metrics such as Rouge-L.
Ensemble-Instruct is an instruction-tuning data generation pipeline that synthesizes high-quality training data for large LMs by leveraging a heterogeneous ensemble of open-source LMs in the 10–40B parameter range. It addresses inherent limitations of earlier self-instruction frameworks, such as Self-Instruct and Alpaca, which rely on very large, often closed 175B-parameter models, by devising new in-context learning (ICL) techniques and output ensembling to produce performant instruction-tuning corpora from smaller, open-access models (Lee et al., 2023).
1. Algorithmic Pipeline
Ensemble-Instruct operates by decomposing the task into three chained stages—Instruction Generation, Instance Generation, and Output Ensembling—applied independently to two disjoint classes of tasks:
- Type A: Tasks requiring an external input (125 Self-Instruct seeds)
- Type B: Tasks formulated solely as output generations (50 seeds)
The generation process consists of the following steps:
- Instruction Generation: For each type , few-shot prompt templates are populated with a mix of seed (drawn from the task's corresponding set) and previously generated synthetic examples ( examples; $24$ for A, $10$ for B). An LM is then called to generate a new instruction , which is accepted only if for all existing —enforcing novelty within the growing instruction set.
- Instance Generation: Using few-shot seed examples ($18$ for A, $15$ for B), the LM is prompted to produce (i) for type A, an (input, output) pair conditioned on the new instruction, or (ii) only an output for type B.
- Output Ensembling: Each or and initial LM output is augmented with two additional outputs from separate LMs (which may include instruction-tuned models), produced under zero-shot or few-shot prompting as appropriate. The system applies a greedy consensus algorithm: it accepts the example only if all output pairs have and returns from the pair with highest Rouge-L similarity; otherwise, the instance is discarded.
All accepted triples populate the synthetic dataset. Iteration continues per type until reaching target dataset size, yielding a final corpus of approximately 45k examples after ensembling.
2. In-Context Learning Templates and Sampling
The ICL framework in Ensemble-Instruct is designed for maximum utility with smaller LMs. Tasks are strictly categorized, and template strategies are simplified to lower the prompt complexity:
- Instruction Generation:
- Type A uses 24-shot context (20 seed + 4 synthetic); type B uses 10-shot (8 seed + 2 synthetic).
- Instance Generation:
- Type A (input-output): 18-shot (all seed); type B (output-only): 15-shot (all seed).
Prompt structures adhere to regularized sketches: For type A, the prompt elicits "instruction: <INST> input: <INPUT> output: <OUTPUT> |EoS|"; for type B, it omits the input.
Formally, at chain step , sampling proceeds as , where context includes up to two earlier synthetic instructions. . Instruction novelty is enforced by rejecting if .
Templates and controlled shot counts are critical for success at smaller model scales, mitigating prompt complexity and reducing ICL failure rates.
3. Ensembling and Output Selection
The output ensembling mechanism is central to quality control in Ensemble-Instruct. For each generated (input, output) or (output), outputs are sampled from additional LMs (allowing for instruction-tuned variants). The consensus algorithm (a greedy approximation of Minimum Bayes Risk decoding):
- Computes all pairwise Rouge-L scores for .
- If all , selects the output from the most mutually similar pair.
- If not, the example is discarded.
This step acts as a filter against spurious or low-consensus generations, increasing both the mean quality and diversity of the resulting dataset. Empirically, this mechanism significantly augments Rouge-L scores over single-model pipelines and is robust across base and instruction-tuned LMs.
4. Seed Task Utilization and Novelty Enforcement
The procedure leverages the original 175 Self-Instruct tasks, partitioning them into (125 with inputs) and (50 output-only). Few-shot prompt construction always balances seed and synthetic examples for instruction generation and uses only seeds for instance generation, with periodic replacement of seed examples by synthetic ones across generation steps to enhance diversity.
Novelty among instructions is strictly maintained by applying a Rouge-L threshold () against all prior instructions, sharply limiting paraphrasing and redundancy. The rejection process ensures that the expanding instruction pool remains diverse and semantically non-overlapping.
5. Model Selection and Implementation Details
Ensemble-Instruct employs a heterogeneous mixture of open-access LMs, all with permissive Apache-2 licenses:
- Instruction and instance generation models: falcon-40B (decoder-only, 40B), ul2-20B (seq2seq, 20B), gpt-neoxt-chat-20B (decoder-only, 20B; OIG-tuned), flan-ul2-20B (seq2seq, instructed), and flan-t5-xxl (seq2seq, instructed, 11B).
- Downstream fine-tuning models: pythia-1.4B (vanilla), mpt-7B (decoder-only), and gpt-jt-6B (instruction-tuned).
Sampling for all generation phases uses greedy or default settings (no sampling temperature or top- variation). Fine-tuning hyperparameters include QLoRA on single A100 (40 GB), 5–7 epochs at LR , or full fine-tuning on double A100s (2 × 80 GB) at LR .
The pipeline generates approximately 30–45k samples total, with the output-ensemble filter reducing the final dataset to around 18k (for certain configurations), subject to the ensembling discard step.
6. Empirical Evaluation and Analysis
Empirical validation confirms that Ensemble-Instruct delivers superior or competitive instruction-tuning data compared to both Self-Instruct and Alpaca, especially at reduced data and parameter budgets.
Synthetic Dataset Performance (Rouge-L, Super-NI)
| Dataset | Samples | Rouge-L |
|---|---|---|
| zero-shot | 0 | 9.8 |
| alpaca | 51,760 | 33.4 |
| self-inst (82K) | 82,612 | 34.4 |
| m-self-inst (24K) | 24,984 | 28.5 |
| so-{ul2,neox} (25K) | 25,660 | 33.6 |
| eo-{ul2,neox}-ilm | 18,218 | 38.3 |
| so-falcon (30K) | 30,537 | 34.4 |
| eo-falcon-ilm | 26,701 | 37.1 |
- Using categorized and simplified templates (so-{ul2,neox}) yields up to +5.1 Rouge-L over vanilla Self-Instruct. Output ensembling (eo-) adds +4.7 further points. Ensemble-Instruct with ~30k samples matches or exceeds Self-Instruct’s 82k example performance.
Human Data Quality
Manual assessment on eo-{ul2,neox}-ilm (140 samples):
| Type | Good | Bad | Maybe | Total |
|---|---|---|---|---|
| output-only | 77 | 14 | 9 | 100 |
| in–out | 22 | 15 | 3 | 40 |
| overall | 99 | 29 | 12 | 140 |
Output-only tasks achieved 77% rated good, input-output 55%. The dual-pipeline design ameliorates imbalances inherent in self-instruction.
Large LM Fine-Tuning
Fine-tuning on base LMs (6B–40B) with Ensemble-Instruct data increases Rouge-L scores by 30–35 points over zero-shot, consistently across architectures.
| Base LM | Zero-shot | Fine-tuned (eo-combo-ilm) |
|---|---|---|
| gpt-jt-6B (6B) | 10.4 | 43.1 |
| mpt-7B (7B) | 16.6 | 46.4 |
| open-llama-13B | 11.9 | 46.7 |
| mpt-30B | 12.2 | 49.5 |
| falcon-40B | 12.7 | 49.9 |
On user-facing tasks, ablations show +5–9 points from template simplification and +2–4 from output ensembling.
7. Technical Observations, Limitations, and Future Directions
Ensemble-Instruct's effectiveness at smaller model scales is attributable to its reduced ICL burden via prompt simplification and diversity/quality improvement through output ensembling. The rejection and consensus mechanisms are crucial in maintaining instructional novelty and filtering low-agreement samples.
Limitations include inherent dependence on multiple open LMs (allowing representation biases to propagate), discard of viable low-consensus examples, absence of tuned sampling parameters (temperature, top-k/p), and reliance on automatic metrics (Rouge-L) for most evaluations; broader human assessments are lacking.
Potential avenues for future enhancement comprise integration of learned ensemble metrics (such as BERTScore or ranking LMs), expanded human evaluation, dynamic context/shot selection, and adaptation to multilingual or even smaller LMs. The pipeline’s modularity and reliance on permissively licensed models position it as a viable replacement for closed-model instruction data curation at scale (Lee et al., 2023).