From Pixels to Factors: Learning Independently Controllable State Variables for Reinforcement Learning

Published 2 Oct 2025 in cs.LG and cs.AI | (2510.02484v1)

Abstract: Algorithms that exploit factored Markov decision processes are far more sample-efficient than factor-agnostic methods, yet they assume a factored representation is known a priori -- a requirement that breaks down when the agent sees only high-dimensional observations. Conversely, deep reinforcement learning handles such inputs but cannot benefit from factored structure. We address this representation problem with Action-Controllable Factorization (ACF), a contrastive learning approach that uncovers independently controllable latent variables -- state components each action can influence separately. ACF leverages sparsity: actions typically affect only a subset of variables, while the rest evolve under the environment's dynamics, yielding informative data for contrastive training. ACF recovers the ground truth controllable factors directly from pixel observations on three benchmarks with known factored structure -- Taxi, FourRooms, and MiniGrid-DoorKey -- consistently outperforming baseline disentanglement algorithms.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces ACF, which learns independently controllable latent variables from pixel data using contrastive objectives.
ACF leverages action-effect sparsity to isolate state components influenced by specific actions for efficient reinforcement learning.
Empirical tests in Taxi and Minigrid domains show ACF outperforms baselines by achieving higher R² metrics in disentanglement.

From Pixels to Factors: Learning Independently Controllable State Variables for Reinforcement Learning

This essay provides an in-depth analysis and authoritative summary of the paper "From Pixels to Factors: Learning Independently Controllable State Variables for Reinforcement Learning," which addresses a significant gap in reinforcement learning (RL) by focusing on learning factored representations from high-dimensional pixel data without predefined state-variable factors.

Introduction

Classical approaches in reinforcement learning (RL) have highlighted the benefits of factored Markov decision processes (FMDPs) for achieving sample efficiency. However, they require predefined state-variable factors, which is impractical in environments where agents only have high-dimensional sensory inputs. This paper introduces Action-Controllable Factorization (ACF), a contrastive learning-based method designed to discover independently controllable latent variables directly from pixels without the need for manual factorization.

Technical Approach

ACF utilizes contrastive learning to isolate state components influenced by actions. The crux of the methodology lies in leveraging the sparsity of action effects: typically, an action affects only a subset of variables, with others following the environment's natural dynamics. A contrastive objective compares predictions of next-state distributions under specific agent actions against natural transitions. The paper introduces a specific energy-based parameterization for learning state transitions, which allows for efficient learning of independently controllable factors.

Mathematical Formulation

The identification problem is formalized where a diffeomorphic mapping from high-dimensional observations to latent spaces needs to be learned. The dynamics of the underlying state space are modeled as:

$T(s' \mid s, a) = \prod_{i \in \Psi(s,a)} T_i(s_i' \mid s, a) \prod_{j \not\in \Psi(s,a)} T_j(s_j' \mid s, a_0),$

where $\Psi(s,a)$ denotes the variables affected by action $a$ .

Learning Framework

The encoder $f_\phi$ and transition dynamics parameterized by energy functions $E_\theta$ ensure the learned factors are independently controllable. The losses designed for optimizing the encoder and energies include:

Inverse Dynamics Loss: Cross-entropy loss to fit a softmax classifier for predicting actions given state transitions.
Forward Dynamics Loss: InfoNCE to maximize mutual information between consecutive states.
Ratio Loss: Binary cross-entropy loss using estimated logits $r_a(x',x)$ to capture the discrepancy between action-induced and natural dynamics.
Figure 1: Factorization metrics. The left side bars show information represented off the diagonal, while the right bars represent mean diagonal values in $R^2$ matrices.

Empirical Evaluation

The paper evaluates ACF across several benchmark environments, including visual variations of the Taxi domain and Minigrid environments like FourRooms and DoorKey. The results demonstrate that ACF consistently recovers controllable factors and performs superior to baseline disentanglement methods such as GCL and DMS.

Results Analysis

Factorization Metrics: The $R^2$ matrices (Figure 2) assessed the disentanglement quality, aiming for high diagonal and low off-diagonal values to signal successful isolation of underlying factors.
Quantitative Performance: ACF demonstrated a significant improvement in identifying independently controllable factors, outperforming baselines by achieving higher mean diagonal $R^2$ values.
Figure 2: Factorization matrices for DoorKey. Mean $R^2$ matrices over 5 seeds.

Insights from Traversals

Latent space traversals in domains like Taxi and DoorKey revealed meaningful correspondence between latent variables and controllable state components (Figures 3 and 4), visualizing how variations in each latent dimension affect the environment's state.

Figure 3: Taxi latent traversals, highlighting passenger dynamics through varied latent variables.

Figure 4: DoorKey latent traversals, demonstrating controllable elements such as agent and key positions.

The study situates ACF within the broader RL literature, noting the historical focus on exploiting factored representations and recent advancements in latent space modeling. It contrasts ACF with previous efforts in disentangled representation learning and causal structure discovery, emphasizing its novel contribution to learning factored structures without explicit supervision or predefined factor graphs.

Conclusion

The research introduces a method that bridges the gap between pixel-based deep learning approaches and the efficiency of factored RL by automatically discovering independently controllable state variables. This advancement points towards potential future improvements in RL systems' ability to generalize across tasks by learning compact, interpretable representations directly from unstructured sensory data. Further exploration could enhance factor identification for both controllable and critical non-controllable state variables.

Markdown Report Issue