Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
84 tokens/sec
Gemini 2.5 Pro Premium
49 tokens/sec
GPT-5 Medium
16 tokens/sec
GPT-5 High Premium
19 tokens/sec
GPT-4o
97 tokens/sec
DeepSeek R1 via Azure Premium
77 tokens/sec
GPT OSS 120B via Groq Premium
476 tokens/sec
Kimi K2 via Groq Premium
234 tokens/sec
2000 character limit reached

VL-DAC: Vision-Language Decoupled Actor-Critic

Updated 8 August 2025
  • VL-DAC is a reinforcement learning paradigm that decouples policy and value updates to enhance stability and generalization in vision-language models.
  • It utilizes a tokenwise PPO for the actor and a stepwise critic loss to specialize representations, reducing overfitting in high-dimensional environments.
  • Empirical results demonstrate significant gains in agentic control and spatial planning, validating the decoupled design’s effectiveness.

Vision-Language Decoupled Actor-Critic (VL-DAC) is a reinforcement learning (RL) algorithmic paradigm designed to equip vision-LLMs (VLMs) with robust agentic and reasoning capabilities via a principled decoupling of policy (actor) and value (critic) updates. This approach explicitly separates the optimization of action choices (language-conditioned, visually-grounded actions) from value estimation, facilitating more stable learning, enhanced generalization, and improved transfer of skills from synthetic environments to real-world interactive tasks (Bredis et al., 6 Aug 2025, Garcin et al., 8 Mar 2025, Raileanu et al., 2021).

1. Conceptual Foundation and Distinguishing Features

VL-DAC originates from the insight that the actor and critic serve fundamentally different inferential goals. The actor requires representations that are immediately conducive to optimal action selection, while the critic must encode sufficient state and dynamics information to accurately estimate long-range value. In the standard RL regime, shared representations between these heads often lead to overfitting (especially in high-dimensional, visually-rich environments) and hamper generalization (Garcin et al., 8 Mar 2025, Raileanu et al., 2021). VL-DAC addresses these issues by:

  • Decoupling policy (actor) and value (critic) computation both at the optimization and architectural level.
  • Applying actor updates at the token (i.e., text action) level using Proximal Policy Optimization (PPO).
  • Restricting value updates to a single environment-step granularity, with a stop-gradient between actor and critic modules.
  • Removing hyperparameters related to mixing or weighting of thought/action token rewards, yielding a hyperparameter-free RL algorithm (Bredis et al., 6 Aug 2025).
  • Enabling specialization of actor and critic representations, improving both sample efficiency and out-of-domain generalization (Garcin et al., 8 Mar 2025).

2. Algorithmic Structure and Mathematical Formulation

VL-DAC implements the following core algorithmic design (Bredis et al., 6 Aug 2025):

  • Tokenwise PPO Actor Loss: For each environment step tt, the policy loss is computed per output token:

Lpolicy(θ)=Et[1ati=1atmin(rt,iAt,clip(rt,i,1ϵ,1+ϵ)At)]\mathcal{L}_{\text{policy}}(\theta) = -\mathbb{E}_t \left[ \frac{1}{|a_t|} \sum_{i=1}^{|a_t|} \min\left( r_{t,i} A_t, \mathrm{clip}(r_{t,i}, 1-\epsilon, 1+\epsilon) A_t \right) \right]

where rt,ir_{t,i} is the importance sampling ratio for token atia_t^i and AtA_t is a single step-level advantage estimate.

  • Stepwise Critic Loss: The value head predicts at the environment step:

Lvalue(ϕ)=12(Vϕ(st)R^t)2\mathcal{L}_{\text{value}}(\phi) = \frac{1}{2} (V_\phi(s_t) - \hat{R}_t)^2

with Vϕ(st)=MLPϕ(FVLM(st))V_\phi(s_t) = \text{MLP}_\phi(\mathcal{F}_{\text{VLM}}(s_t)). Critic gradients do not propagate into the backbone, enforced via explicit stop-gradient.

  • Full Objective:

L(θ,ϕ)=Lpolicy(θ)+βLKL(θ)+αLvalue(ϕ)\mathcal{L}(\theta, \phi) = \mathcal{L}_{\text{policy}}(\theta) + \beta \mathcal{L}_{\text{KL}}(\theta) + \alpha \mathcal{L}_{\text{value}}(\phi)

where LKL\mathcal{L}_{\text{KL}} regularizes the policy relative to a reference (e.g., pre-trained) distribution and (α,β)(\alpha, \beta) are fixed scalars.

Distinctive implementation aspects include not updating the value head during the initial training epochs (value warm-up) and aligning the update periods of PPO actor and stepwise critic, ensuring stable, low-variance learning.

3. Representational Specialization and Mutual Information

Decoupling actor and critic in VL-DAC produces highly specialized representations (Garcin et al., 8 Mar 2025):

  • The actor’s representation ϕA\phi_A maximizes mutual information with action-relevant cues I((ZA,ZA),A)I((Z_A, Z_A'), A) while minimizing I(ZA;L)I(Z_A; L) with respect to non-generalizable (e.g., environment- or level-specific) details.
  • The critic’s representation ϕC\phi_C remains sensitive to environmental, value-predictive, and long-range state-transition features, reflected in increased I(ZC;V)I(Z_C; V) and I(ZC;L)I(Z_C; L).
  • Empirical analyses show that decoupling reduces actor overfitting, increases critic expressivity, and yields a more favorable parameter-efficiency/generality trade-off.
  • The separation ensures that the critic’s value-estimation capacity can implicitly drive actor exploration, as the actor is guided towards state-space regions with high critic uncertainty or value function nonstationarity.

4. Training Methodologies and Simulator Curriculum

VL-DAC’s training leverages a curriculum of inexpensive synthetic simulators, each with varying action spaces and environmental demands (Bredis et al., 6 Aug 2025):

  • MiniWorld (spatial navigation), Gym-Cards/EZPoints (logic reasoning), ALFWorld (agentic household manipulation), and WebShop (web navigation/UI interaction).
  • Each VLM is exposed serially to these environments, training a unified vision-language backbone with decoupled actor-critic heads.
  • KL regularization controls catastrophic policy drift, while environment diversity serves as a source of rich, multitask generalization.
  • This procedure obviates the need for expensive or labor-intensive human annotation and enables transfer of learned skills to harder real-world, image-centric benchmarks.

5. Empirical Performance and Generalization

VL-DAC demonstrates:

  • +50% relative improvement on BALROG: agentic control in game-like settings.
  • +5% relative on hardest part of VSI-Bench (spatial planning via language and vision).
  • +2% on VisualWebBench: UI navigation and interaction tasks.
  • Maintenance of general image understanding accuracy post-RL training on interactive tasks (Bredis et al., 6 Aug 2025).
  • Decoupling actor and critic consistently yields superior sample efficiency and model generalization compared to shared-representation PPO variants and alternatives such as RL4VLM (which requires per-token reward mixing), LOOP (sequence-level leave-one-out), and ArCHer (off-policy with replay buffers and dense-reward prerequisites).

Table: Key comparative properties

Method Decoupling Tuning Required Step-level critic Generalization Gains
VL-DAC Yes None Yes High
RL4VLM Partial λ Sequence+Token Sensitive
LOOP No Minor Sequence Varies
ArCHer No Buffer/Rewards Sequence Dense-reward only

6. Connections to Broader Vision-Language Literature

While VL-DAC is formulated in the RL regime, decoupled architectures echo advances in supervised and self-supervised pretraining (Li et al., 2021, Jian et al., 2023), with the architectural separation of modality-specific and cross-modal functions for more effective multi-task transfer and stable optimization. The principle of staged or modular alignment—seen in Prompt-Transformer decoupling or two-stream encoding—further substantiates the efficacy of separating high-level policy/value signals from raw sensor input transformation. IDAAC (Raileanu et al., 2021) and related work empirically validate that decoupling (and enforcing invariance in the policy representation) improves performance, especially for out-of-domain generalization in visually complex environments.

7. Limitations and Future Directions

VL-DAC currently excels on tasks with discrete, screen-based, or language-driven action spaces. Extension to continuous control domains (robotics), or cooperative/competitive multi-agent settings, will require architectural modifications. Additional avenues include:

  • Hierarchical RL (multi-scale control via step- and sub-goal-level critics)
  • Memory-augmented transformers for extended planning
  • Integration of representation learning objectives (e.g., dynamics prediction, value distillation) on critic heads
  • Systematic evaluation across more diverse, real-world, and adversarial multimodal environments

These directions aim to further the generalization and sample efficiency gains yielded by the decoupling paradigm, supporting the deployment of large vision-LLMs as robust, generalist, and interactive agents (Bredis et al., 6 Aug 2025, Garcin et al., 8 Mar 2025, Raileanu et al., 2021).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube