VL-DAC: Vision-Language Decoupled Actor-Critic

Updated 8 August 2025

VL-DAC is a reinforcement learning paradigm that decouples policy and value updates to enhance stability and generalization in vision-language models.
It utilizes a tokenwise PPO for the actor and a stepwise critic loss to specialize representations, reducing overfitting in high-dimensional environments.
Empirical results demonstrate significant gains in agentic control and spatial planning, validating the decoupled design’s effectiveness.

Vision-Language Decoupled Actor-Critic (VL-DAC) is a reinforcement learning (RL) algorithmic paradigm designed to equip vision-LLMs (VLMs) with robust agentic and reasoning capabilities via a principled decoupling of policy (actor) and value (critic) updates. This approach explicitly separates the optimization of action choices (language-conditioned, visually-grounded actions) from value estimation, facilitating more stable learning, enhanced generalization, and improved transfer of skills from synthetic environments to real-world interactive tasks (Bredis et al., 6 Aug 2025, Garcin et al., 8 Mar 2025, Raileanu et al., 2021).

1. Conceptual Foundation and Distinguishing Features

VL-DAC originates from the insight that the actor and critic serve fundamentally different inferential goals. The actor requires representations that are immediately conducive to optimal action selection, while the critic must encode sufficient state and dynamics information to accurately estimate long-range value. In the standard RL regime, shared representations between these heads often lead to overfitting (especially in high-dimensional, visually-rich environments) and hamper generalization (Garcin et al., 8 Mar 2025, Raileanu et al., 2021). VL-DAC addresses these issues by:

Decoupling policy (actor) and value (critic) computation both at the optimization and architectural level.
Applying actor updates at the token (i.e., text action) level using Proximal Policy Optimization (PPO).
Restricting value updates to a single environment-step granularity, with a stop-gradient between actor and critic modules.
Removing hyperparameters related to mixing or weighting of thought/action token rewards, yielding a hyperparameter-free RL algorithm (Bredis et al., 6 Aug 2025).
Enabling specialization of actor and critic representations, improving both sample efficiency and out-of-domain generalization (Garcin et al., 8 Mar 2025).

2. Algorithmic Structure and Mathematical Formulation

VL-DAC implements the following core algorithmic design (Bredis et al., 6 Aug 2025):

Tokenwise PPO Actor Loss: For each environment step $t$ , the policy loss is computed per output token:

$\mathcal{L}_{\text{policy}}(\theta) = -\mathbb{E}_t \left[ \frac{1}{|a_t|} \sum_{i=1}^{|a_t|} \min\left( r_{t,i} A_t, \mathrm{clip}(r_{t,i}, 1-\epsilon, 1+\epsilon) A_t \right) \right]$

where $r_{t,i}$ is the importance sampling ratio for token $a_t^i$ and $A_t$ is a single step-level advantage estimate.

Stepwise Critic Loss: The value head predicts at the environment step:

$\mathcal{L}_{\text{value}}(\phi) = \frac{1}{2} (V_\phi(s_t) - \hat{R}_t)^2$

with $V_\phi(s_t) = \text{MLP}_\phi(\mathcal{F}_{\text{VLM}}(s_t))$ . Critic gradients do not propagate into the backbone, enforced via explicit stop-gradient.

Full Objective:

$\mathcal{L}(\theta, \phi) = \mathcal{L}_{\text{policy}}(\theta) + \beta \mathcal{L}_{\text{KL}}(\theta) + \alpha \mathcal{L}_{\text{value}}(\phi)$

where $\mathcal{L}_{\text{KL}}$ regularizes the policy relative to a reference (e.g., pre-trained) distribution and $(\alpha, \beta)$ are fixed scalars.

Distinctive implementation aspects include not updating the value head during the initial training epochs (value warm-up) and aligning the update periods of PPO actor and stepwise critic, ensuring stable, low-variance learning.

3. Representational Specialization and Mutual Information

Decoupling actor and critic in VL-DAC produces highly specialized representations (Garcin et al., 8 Mar 2025):

The actor’s representation $\phi_A$ maximizes mutual information with action-relevant cues $I((Z_A, Z_A'), A)$ while minimizing $I(Z_A; L)$ with respect to non-generalizable (e.g., environment- or level-specific) details.
The critic’s representation $\phi_C$ remains sensitive to environmental, value-predictive, and long-range state-transition features, reflected in increased $I(Z_C; V)$ and $I(Z_C; L)$ .
Empirical analyses show that decoupling reduces actor overfitting, increases critic expressivity, and yields a more favorable parameter-efficiency/generality trade-off.
The separation ensures that the critic’s value-estimation capacity can implicitly drive actor exploration, as the actor is guided towards state-space regions with high critic uncertainty or value function nonstationarity.

4. Training Methodologies and Simulator Curriculum

VL-DAC’s training leverages a curriculum of inexpensive synthetic simulators, each with varying action spaces and environmental demands (Bredis et al., 6 Aug 2025):

MiniWorld (spatial navigation), Gym-Cards/EZPoints (logic reasoning), ALFWorld (agentic household manipulation), and WebShop (web navigation/UI interaction).
Each VLM is exposed serially to these environments, training a unified vision-language backbone with decoupled actor-critic heads.
KL regularization controls catastrophic policy drift, while environment diversity serves as a source of rich, multitask generalization.
This procedure obviates the need for expensive or labor-intensive human annotation and enables transfer of learned skills to harder real-world, image-centric benchmarks.

5. Empirical Performance and Generalization

VL-DAC demonstrates:

+50% relative improvement on BALROG: agentic control in game-like settings.
+5% relative on hardest part of VSI-Bench (spatial planning via language and vision).
+2% on VisualWebBench: UI navigation and interaction tasks.
Maintenance of general image understanding accuracy post-RL training on interactive tasks (Bredis et al., 6 Aug 2025).
Decoupling actor and critic consistently yields superior sample efficiency and model generalization compared to shared-representation PPO variants and alternatives such as RL4VLM (which requires per-token reward mixing), LOOP (sequence-level leave-one-out), and ArCHer (off-policy with replay buffers and dense-reward prerequisites).

Table: Key comparative properties

Method	Decoupling	Tuning Required	Step-level critic	Generalization Gains
VL-DAC	Yes	None	Yes	High
RL4VLM	Partial	λ	Sequence+Token	Sensitive
LOOP	No	Minor	Sequence	Varies
ArCHer	No	Buffer/Rewards	Sequence	Dense-reward only

6. Connections to Broader Vision-Language Literature

While VL-DAC is formulated in the RL regime, decoupled architectures echo advances in supervised and self-supervised pretraining (Li et al., 2021, Jian et al., 2023), with the architectural separation of modality-specific and cross-modal functions for more effective multi-task transfer and stable optimization. The principle of staged or modular alignment—seen in Prompt-Transformer decoupling or two-stream encoding—further substantiates the efficacy of separating high-level policy/value signals from raw sensor input transformation. IDAAC (Raileanu et al., 2021) and related work empirically validate that decoupling (and enforcing invariance in the policy representation) improves performance, especially for out-of-domain generalization in visually complex environments.

7. Limitations and Future Directions

VL-DAC currently excels on tasks with discrete, screen-based, or language-driven action spaces. Extension to continuous control domains (robotics), or cooperative/competitive multi-agent settings, will require architectural modifications. Additional avenues include:

Hierarchical RL (multi-scale control via step- and sub-goal-level critics)
Memory-augmented transformers for extended planning
Integration of representation learning objectives (e.g., dynamics prediction, value distillation) on critic heads
Systematic evaluation across more diverse, real-world, and adversarial multimodal environments

These directions aim to further the generalization and sample efficiency gains yielded by the decoupling paradigm, supporting the deployment of large vision-LLMs as robust, generalist, and interactive agents (Bredis et al., 6 Aug 2025, Garcin et al., 8 Mar 2025, Raileanu et al., 2021).