Deep Ensemble Router (DER)

Updated 7 June 2026

Deep Ensemble Router (DER) is a framework that models ensemble reasoning as a Markov Decision Process to dynamically select and refine LLM expert outputs.
It employs a compact agent architecture with an encoder, policy head, and value head, optimizing expert selection via Proximal Policy Optimization.
DER achieves high output quality with significant compute savings, as shown by improved BERTScore and GSM8K accuracy compared to traditional ensembles.

The Deep Ensemble Router (DER) is a framework designed for dynamic ensemble reasoning over a pool of LLM experts. It integrates multiple LLMs in a sequential decision process, optimizing both output quality and computational efficiency by adaptively routing queries through selected experts and leveraging knowledge transfer between them. DER models the ensemble reasoning process as a Markov Decision Process (MDP), with a dedicated agent responsible for expert selection and answer refinement at each stage, trained using Proximal Policy Optimization (PPO) with explicit cost and quality awareness (Hu et al., 2024).

1. Markov Decision Process Formulation

DER represents the routing and refinement of answers as an episodic MDP $\langle\S,\A,\T,\R,\pi_\theta\rangle$. At each step, the agent decides which expert model from the set $\{\M_1,\dots,\M_N\}$ to invoke, aiming to improve the current answer using as few computational resources as possible.

MDP Components

State Space ( $\mathcal{S}$ ): At time $t$ , the state is $s_t = [ Q: x, \; A: \hat y_{t-1}]$ , encoding the original question $x$ and the best answer $\hat y_{t-1}$ so far.
Action Space ( $\mathcal{A}$ ): Discrete set $\{1,2,\dots,N\}$ , selecting the next expert to query.
Transition Function ( $\mathcal{T}$ ): On choosing action $\{\M_1,\dots,\M_N\}$0, the agent feeds a Knowledge Transfer Prompt (KTP) to expert $\{\M_1,\dots,\M_N\}$1, producing $\{\M_1,\dots,\M_N\}$2. The next state is $\{\M_1,\dots,\M_N\}$3.
Termination: The episode ends either when an automated Terminator (a trained classifier) deems the answer satisfactory, or after reaching a pre-defined maximum step $\{\M_1,\dots,\M_N\}$4.
Reward ($\{\M_1,\dots,\M_N\}$5): At step $\{\M_1,\dots,\M_N\}$6,
- For $\{\M_1,\dots,\M_N\}$7: $\{\M_1,\dots,\M_N\}$8
- For $\{\M_1,\dots,\M_N\}$9: $\mathcal{S}$ 0, with $\mathcal{S}$ 1
- Terminal adjustment: $\mathcal{S}$ 2 bonus if success ( $\mathcal{S}$ 3), $\mathcal{S}$ 4 penalty otherwise.
- Example hyperparameters: $\mathcal{S}$ 5, $\mathcal{S}$ 6, $\mathcal{S}$ 7.

This formalism enables DER to optimize answer quality (measured by BERTScore $\mathcal{S}$ 8) while minimizing overall compute, measured as expert parameter counts.

2. DER-Agent Architecture

The DER agent consists of an encoder, policy head (actor), and value head (critic):

Input Encoding: Each state $\mathcal{S}$ 9 is serialized as $t$ 0, which is input to a pre-trained OPT-125M transformer encoder.
Policy Head: Two linear layers are stacked atop the encoder's last hidden state, producing logits $t$ 1. The probability of selecting action $t$ 2 is

$t$ 3

Value Head: A copy of the encoder plus two linear layers yields a scalar estimate $t$ 4 for value prediction.
Terminator: A lightweight OPT-125M-based classifier predicts whether the current answer meets the BERTScore threshold $t$ 5 for early stopping.

The total routing infrastructure is compact (approximately 125M parameters each for policy and value networks), allowing lightweight autonomous control.

3. Training Procedure and Objective

DER employs PPO for policy optimization:

Trajectory Collection: Rollouts of $t$ 6 are sampled using the current policy, collecting states, actions, and rewards.
Advantage Estimation: Computed via $t$ 7.
Actor Objective: PPO's clipped objective:

$t$ 8

with $t$ 9, $s_t = [ Q: x, \; A: \hat y_{t-1}]$ 0.

Critic Update: Minimizes TD-error $s_t = [ Q: x, \; A: \hat y_{t-1}]$ 1.
Early Termination Mechanism: No formal curriculum; the episode self-terminates on easier samples, thus naturally economizing computation.

This joint PPO framework is central to optimizing the quality-cost tradeoff and supports generalization across diverse question types and ensemble expert sets.

4. Knowledge Transfer Prompt (KTP)

The Knowledge Transfer Prompt is a template mechanism provided to each expert at every step to ensure the newly invoked LLM leverages prior answer information constructively:

$s_t = [ Q: x, \; A: \hat y_{t-1}]$ 4

This prompt construction enforces that the expert acts as a student solicited to improve upon a previous answer, integrating but not merely restating, the prior solution. This mechanism fosters cumulative answer refinement and demonstrably boosts downstream quality metrics. Ablation studies indicate that removal of KTP degrades BERTScore performance (from 75.0 to 74.3), highlighting its integral role (Hu et al., 2024).

5. Inference Procedure and Hardware Efficiency

During inference, DER operates as follows:

For up to $s_t = [ Q: x, \; A: \hat y_{t-1}]$ 2 steps, the current state $s_t = [ Q: x, \; A: \hat y_{t-1}]$ 3 is encoded;
The policy selects the next expert to invoke;
The KTP is constructed and presented to the chosen LLM;
If the Terminator signals satisfactory quality, the episode halts early; otherwise, the sequence continues.

Pseudocode:

$s_t = [ Q: x, \; A: \hat y_{t-1}]$ 5

Empirical results demonstrate that DER infers using only 15–20B parameter-inference per sample on average, a substantial reduction compared to 117B for PairRanker and 234B for full debate-based ensembles. Over half of all episodes conclude within two steps, amplifying hardware savings.

6. Experimental Performance and Comparative Analysis

DER established significant improvements in both efficiency and output quality on standardized benchmarks:

Method	Average Inference Cost (B Params)	BERTScore	GSM8K Accuracy
Vicuna-13B	13	≈69.6	33.74%
PairRanker	117	≈73.0	–
Full Debate Ensemble	234	–	–
DER	17 (MixInstruct), 26 (GSM8K)	≈75.0	34.98%

DER's BERTScore of ≈75.0 on the MixInstruct test set outperforms single best open-source LLMs and PairRanker, while utilizing an order of magnitude fewer parameters. On GSM8K, DER improves the Vicuna baseline by over 1 percentage point in accuracy, concurrent with a fivefold reduction in compute cost.

Ablation findings establish the significance of both the KTP and the incremental/terminal rewards to DER's effectiveness; without these components, BERTScore drops by up to 3% (Hu et al., 2024).

7. Significance and Context within Ensemble Reasoning

DER reframes LLM ensemble reasoning as a computationally-aware, sequential improvement task, moving beyond static voting or pairwise ranking approaches. It demonstrates that a lightweight router can dynamically leverage complementary expertise and knowledge, achieving higher answer quality under stringent computational budgets. The framework's efficient policy-and-critic design, sequential reasoning via MDP, and knowledge transfer mechanism collectively render it effective for a range of multi-expert LLM scenarios.

A plausible implication is that sequential refinement ensembles with explicit cost-quality tradeoffs, as pioneered by DER, are likely to set new standards for both data efficiency and model utilization in multi-agent language reasoning architectures (Hu et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Dynamic Ensemble Reasoning for LLM Experts (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deep Ensemble Router (DER).