Papers
Topics
Authors
Recent
2000 character limit reached

Adaptive Token Ensemble Decoding (ATED)

Updated 28 October 2025
  • ATED is a token-level, training-free ensemble framework that aggregates outputs from LVLMs to reduce hallucination in vision-language tasks.
  • It adaptively weights per-token predictions using uncertainty estimates, enhancing both semantic consistency and factual grounding.
  • Experimental results show improved accuracy and reduced hallucination on benchmarks, illustrating ATED's effectiveness in high-stakes multimodal applications.

Adaptive Token Ensemble Decoding (ATED) is a token-level, training-free ensemble inference framework for large multimodal models, introduced to systematically mitigate hallucination in vision-language reasoning. ATED aggregates per-token predictions from multiple pre-trained Large Vision-LLMs (LVLMs), dynamically weighting their contributions at each step based on model-specific uncertainty, enhancing factual grounding, semantic consistency, and robustness without sacrificing output fluency or relevance (Li et al., 21 Oct 2025).

1. Principles of Token-Level Adaptive Ensembling

ATED operates by running NN LVLMs in parallel for each input (image vv, query qq, and prefix x<tx_{<t}) during decoding. At each generation step tt, every model ii produces a logit-based probability distribution pi(xtv,q,x<t)p_i(x_t\,|\,v, q, x_{<t}) over the shared vocabulary. These distributions are adaptively fused into an ensemble prediction:

p(xtv,q,x<t)=i=1Nλi  pi(xtv,q,x<t)p(x_t\,|\,v, q, x_{<t}) = \sum_{i=1}^{N} \lambda_i\; p_i(x_t\,|\,v, q, x_{<t})

with λi\lambda_i as the importance weight for model ii at step tt, normalized so that iλi=1\sum_i \lambda_i = 1.

The weighting mechanism is uncertainty-driven. For each model, the entropy Hi=xpi(x)logpi(x)H_i = -\sum_x p_i(x)\log p_i(x) quantifies confidence; lower entropy indicates higher reliability. ATED formulates the ensemble as an uncertainty minimization:

{λ1,,λN}=arg minλ1,,λNplogp\{\lambda_1^*, \ldots, \lambda_N^*\} = \argmin_{\lambda_1, \ldots, \lambda_N} -\sum p \log p

where pp denotes the softmax over the logit-weighted ensemble. A greedy search, starting with the lowest-uncertainty model, incrementally adjusts weights as grid search candidates, always favoring combinations that further reduce total entropy.

2. Contrastive and Multi-Path Aggregation

To refine grounding, ATED introduces multi-path contrastive decoding using visual perturbations. For each model, an alternate input vv' is produced via Gaussian noise perturbation, yielding two distributions: pi(xtv,q,x<t)p_i(x_t\,|\,v, q, x_{<t}) and pi(xtv,q,x<t)p_i(x_t\,|\,v', q, x_{<t}). These are combined via:

pi(xtv,v,q,x<t)=softmax((1+α)logitϕ(xtv,q,x<t)αlogitϕ(xtv,q,x<t))p_i(x_t\,|\,v, v', q, x_{<t}) = \text{softmax}\left((1+\alpha)\cdot \text{logit}_\phi(x_t\,|\,v, q, x_{<t}) - \alpha\cdot \text{logit}_\phi(x_t\,|\,v', q, x_{<t})\right)

where α\alpha controls contrastive influence. This fusion improves resilience under uncertain visual features, reducing reliance on spurious correlations and mitigating object misidentification.

3. Implementation and Inference Pipeline

Practical deployment of ATED requires a collection of pre-trained LVLMs with a compatible vocabulary. At every token step during inference:

  1. Each LVLM outputs token probabilities and entropy estimates.
  2. The greedy optimization algorithm sorts models, updating λi\lambda_i to minimize ensemble entropy.
  3. When multi-path contrastive decoding is activated, each model is queried with both the original and perturbed images.
  4. The ensemble logits are weighted, summed, and the token with highest probability is selected.

This process happens serially for each token, but all model-level computations at each step are parallelizable, making ATED scalable and efficient in heterogeneous, high-throughput environments.

4. Experimental Evaluation and Performance

ATED was empirically validated on standard object and caption hallucination benchmarks:

Benchmark ATED Accuracy Gain ATED F1-Score Gain Baselines Compared
POPE +4–6% up to +7% OPERA, VCD, ICD, SID
CHAIR substantial drop in halluc. ratio (not numerically specified) Uniform ensemble, single LVLM
MME significant improvements over uniform or single models (not numerically specified) LVLM outputs, uniform ensemble

Qualitative results confirm the ensemble approach suppresses extraneous or erroneous object mentions in open-ended description, while maintaining natural fluency and relevance. The incremental fusion of decoding trajectories across LVLMs ensures improved contextual grounding and semantic consistency.

5. Relationship to Prior Adaptive and Ensemble Decoding Methods

ATED advances prior ensemble methods by implementing token-level adaptive fusion, rather than uniform or statically weighted combination, and applying robust, uncertainty-driven selection at each decoding step. Unlike traditional beam search, contrastive decoding, or end-to-end training-based defenses, ATED:

  • Requires no additional training or internal model modifications.
  • Utilizes uncertainty metrics (entropy) instead of fixed confidence or static weights.
  • Integrates visual contrastive signals via image perturbations for increased robustness.
  • Enables flexible, fine-grained control over the fusion process, accommodating model-specific strengths and weaknesses.

Comparisons with related adaptive mechanisms (e.g., AED (Liu et al., 14 Aug 2024), ATCD (Kan et al., 19 Nov 2024)), highlight that ATED aggregates cross-model outputs rather than intra-model self-evaluation, and dynamically tunes contribution weights, which are directly tied to token-level uncertainty.

6. Applications, Scalability, and Limitations

ATED is particularly suited for high-stakes multimodal tasks—image captioning, visual question answering—where hallucination reduces reliability or risks safety. Its model-agnostic, training-free design allows deployment across varied LVLM architectures, leveraging diversity of model strengths.

Potential challenges include increased inference latency with large ensemble sizes or wide decoding vocabulary, and diminishing marginal gains when component LVLMs are poorly calibrated or not suitably diverse. Grid search granularity (step size ss) in the weighting optimization can be tuned to trade off speed for accuracy, and early stopping mechanisms may be employed for practical latency bounds.

7. Future Research Directions

ATED’s uncertainty-guided adaptive ensemble suggests several avenues for further investigation:

  • Refinement of uncertainty metrics, possibly incorporating multimodal embeddings or context features beyond entropy.
  • Acceleration strategies for ensemble inference, such as hierarchical fusion, asynchronous token-level evaluation, or selective model activation.
  • Integration with training-time alignment or verification modules to create broader robustness pipelines.
  • Extension to ensembles across modality boundaries (audio-visual, text-speech) or cross-task adaptation.

This framework constitutes an effective, adaptable approach for mitigating hallucination in multimodal generative modeling, with foundations applicable to text-only or other model domains where token-level uncertainty and ensemble diversity can be leveraged for improved factuality and reliability (Li et al., 21 Oct 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Adaptive Token Ensemble Decoding (ATED).