Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

156 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

45 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Reinforcement Learning with Generalizable Gaussian Splatting (2404.07950v3)

Published 18 Mar 2024 in cs.CV, cs.AI, and cs.LG

Abstract: An excellent representation is crucial for reinforcement learning (RL) performance, especially in vision-based reinforcement learning tasks. The quality of the environment representation directly influences the achievement of the learning task. Previous vision-based RL typically uses explicit or implicit ways to represent environments, such as images, points, voxels, and neural radiance fields. However, these representations contain several drawbacks. They cannot either describe complex local geometries or generalize well to unseen scenes, or require precise foreground masks. Moreover, these implicit neural representations are akin to a ``black box", significantly hindering interpretability. 3D Gaussian Splatting (3DGS), with its explicit scene representation and differentiable rendering nature, is considered a revolutionary change for reconstruction and representation methods. In this paper, we propose a novel Generalizable Gaussian Splatting framework to be the representation of RL tasks, called GSRL. Through validation in the RoboMimic environment, our method achieves better results than other baselines in multiple tasks, improving the performance by 10%, 44%, and 15% compared with baselines on the hardest task. This work is the first attempt to leverage generalizable 3DGS as a representation for RL.

References (45)

Citations (2)

View on Semantic Scholar

Summary

The paper presents GSRL, which leverages a pre-trained, generalizable 3D Gaussian Splatting estimator to create explicit 3D state representations from multi-view images.
It employs a multi-stage training pipeline combining depth estimation, UNet-based Gaussian regression, and graph-based refinement to generate coherent 3D outputs.
Experimental results on robotic manipulation tasks show significant performance gains and reduced variance compared to image, point cloud, and voxel baselines.

Vision-based reinforcement learning (RL) heavily relies on the quality of the environmental representation derived from visual inputs. Traditional approaches often utilize 2D images, depth maps, point clouds, voxels, or implicit neural representations like Neural Radiance Fields (NeRFs). However, these methods exhibit limitations: image-based representations lack explicit 3D structure and consistency, point clouds and voxels struggle to capture fine geometric details efficiently, and NeRFs typically require per-scene optimization, lack generalizability without specific architectures, can be computationally intensive for real-time inference, may require foreground masks, and often function as "black boxes," hindering interpretability. The work "Reinforcement Learning with Generalizable Gaussian Splatting" (2404.07950) proposes a novel approach, GSRL, which leverages a generalizable variant of 3D Gaussian Splatting (3DGS) as a state representation for RL tasks, aiming to overcome these limitations.

The GSRL Framework: Generalizable 3DGS for RL

The core challenge in directly applying standard 3DGS to RL is its requirement for per-scene optimization, which is computationally prohibitive within an interactive RL loop. GSRL addresses this by pre-training a generalizable 3DGS estimator network capable of predicting a 3D Gaussian representation directly from multi-view images without iterative optimization.

Pre-training the Generalizable 3DGS Estimator

This stage involves training a network offline to map input multi-view images and associated camera parameters to a set of 3D Gaussians $\{G_i\}$ representing the scene. Each Gaussian $G_i$ is defined by its mean (position) $\mu_i \in \mathbb{R}^3$ , covariance $\Sigma_i$ (represented by rotation $R_i \in SO(3)$ and scaling $S_i \in \mathbb{R}^3$ ), opacity $\alpha_i \in \mathbb{R}$ , and color $c_i \in \mathbb{R}^3$ . The estimator architecture comprises three main modules:

Depth Estimator: This module takes stereo image pairs as input and predicts per-pixel depth maps. It employs a feature extraction backbone followed by a cost volume construction mechanism (akin to MVSNet) and subsequent processing to output dense depth. The predicted depth, combined with known camera intrinsics and extrinsics, provides the initial 3D positions $\mu$ for the Gaussians corresponding to each pixel.
Gaussian Regressor: This module predicts the remaining Gaussian parameters ( $R, S, \alpha$ ). It utilizes a UNet-like architecture, taking as input image features (from the depth estimator's backbone), the source image itself, and the predicted depth map. The encoder processes this multi-modal input, and subsequent prediction heads output the per-pixel rotation (as quaternions), scale, and opacity values. The color $c$ is directly sampled from the corresponding pixel in the source input image.
Gaussian Refinement: To enhance the quality and spatial coherence of the initially predicted Gaussians, a refinement module based on graph neural networks is employed. This module operates on the set of predicted Gaussians, treating them as nodes in a graph. Using an architecture similar to FoldingNet with KNN-based message passing, it refines the Gaussian properties ( $\mu, R, S, \alpha$ ) via an autoencoding process, aiming to smooth the representation and mitigate view-dependent noise or artifacts.

The training proceeds sequentially. First, the Depth Estimator is trained using an L1 loss on the predicted depth against ground truth. Subsequently, the depth network is frozen, and the Gaussian Regressor and Refinement modules are trained jointly. The loss function for this stage includes:

A rendering loss ( $L_r$ ): This compares the rendered image from the predicted Gaussians (using the differentiable 3DGS rasterizer) against a ground-truth target view. An L1 and SSIM loss combination is typically used.
A reconstruction loss ( $L_{recon}$ ): This is applied to the output of the refinement autoencoder, encouraging it to accurately reconstruct the input Gaussian properties.

depth_estimator = DepthEstimator()
gaussian_regressor = GaussianRegressor()
gaussian_refiner = GaussianRefiner()

optimizer_depth = Adam(depth_estimator.parameters())
for data in training_dataloader_depth:
    predicted_depth = depth_estimator(data.I_left, data.I_right, data.K, data.E)
    loss_depth = L1Loss(predicted_depth, data.ground_truth_depth)
    loss_depth.backward()
    optimizer_depth.step()

depth_estimator.eval()
for param in depth_estimator.parameters():
    param.requires_grad = False

optimizer_gs = Adam(list(gaussian_regressor.parameters()) + list(gaussian_refiner.parameters()))
for data in training_dataloader_gs:
    with torch.no_grad():
        predicted_depth = depth_estimator(data.I_left, data.I_right, data.K, data.E)
        initial_means = backproject(predicted_depth, data.K, data.E) # Calculate 3D points
        initial_colors = data.I_left.pixels # Sample colors

    # Predict other GS parameters
    features = depth_estimator.get_features(data.I_left, data.I_right) # Reuse features
    predicted_rotations, predicted_scales, predicted_opacities = gaussian_regressor(features, data.I_left, predicted_depth)

    # Form initial Gaussians
    initial_gaussians = {'mean': initial_means, 'rot': predicted_rotations, 'scale': predicted_scales, 'opacity': predicted_opacities, 'color': initial_colors}

    # Refine Gaussians
    refined_gaussians = gaussian_refiner(initial_gaussians) # Autoencoder refinement

    # Render image from refined Gaussians
    rendered_image = render_gaussians(refined_gaussians, data.target_camera_params)

    # Calculate losses
    loss_render = L1Loss(rendered_image, data.I_target) + SSIM(rendered_image, data.I_target)
    loss_recon = ReconstructionLoss(refined_gaussians, initial_gaussians) # Refinement autoencoder loss

    total_loss = loss_render + lambda_recon * loss_recon
    total_loss.backward()
    optimizer_gs.step()

Integration into the RL Framework

Once the generalizable 3DGS estimator is pre-trained, its weights are frozen and integrated into the RL agent's observation processing pipeline.

Observation: At each timestep $t$ , the RL agent receives multi-view image observations $O_t = \{I_{t,1}, I_{t,2}, ..., I_{t,N}\}$ along with corresponding camera parameters $\{K_{t,i}, E_{t,i}\}$ .

State Representation Generation: The observations

O_t

are fed into the frozen generalizable 3DGS estimator. The estimator rapidly predicts a set of 3D Gaussians

G_t = \{G_{t,i}\}

representing the current state of the environment.

# Pseudocode for RL Step Integration
# Load pre-trained models (eval mode)
depth_estimator.load_state_dict(...)
gaussian_regressor.load_state_dict(...)
gaussian_refiner.load_state_dict(...)

def get_state_representation(observations, camera_params):
    # observations = {view_1_img, view_2_img, ...}
    # camera_params = {view_1_K, view_1_E, ...}
    with torch.no_grad():
        # Assuming stereo input for depth, primary view for others
        img_left, img_right = observations[0], observations[1] # Example
        K_l, E_l = camera_params[0]['K'], camera_params[0]['E']
        K_r, E_r = camera_params[1]['K'], camera_params[1]['E']

        predicted_depth = depth_estimator(img_left, img_right, K_l, E_l, K_r, E_r)
        initial_means = backproject(predicted_depth, K_l, E_l)
        initial_colors = img_left.pixels

        features = depth_estimator.get_features(img_left, img_right)
        pred_rot, pred_scale, pred_alpha = gaussian_regressor(features, img_left, predicted_depth)

        initial_gaussians = {'mean': initial_means, 'rot': pred_rot, 'scale': pred_scale, 'opacity': pred_alpha, 'color': initial_colors}
        refined_gaussians = gaussian_refiner(initial_gaussians)

        # Sample or select a subset if needed
        num_gaussians_for_policy = 10000 # Example
        sampled_gaussians = sample_gaussians(refined_gaussians, num_gaussians_for_policy)

    return sampled_gaussians # This is the state representation 's_t'

Policy Input: The resulting set of Gaussians $G_t$ (potentially sampled or processed further, e.g., using a SetTransformer encoder to handle the unordered set structure and variable number of points) serves as the state input $s_t$ to the RL policy network $\pi(a_t | s_t)$ .
Action and Transition: The policy outputs an action $a_t$ , which is executed in the environment, leading to the next state observation $O_{t+1}$ .

This process allows the RL agent to leverage the rich, 3D-consistent, and geometrically detailed representation provided by the generalizable 3DGS at each step without incurring the cost of online optimization.

Experimental Validation and Results

The effectiveness of GSRL was evaluated on the RoboMimic benchmark, specifically focusing on four robotic manipulation tasks: Lift, Can, Square, and Transport. These tasks involve simulated Franka robots interacting with objects in diverse scenes. Three offline RL algorithms were employed: Behavior Cloning regularized Q-learning (BCQ), Implicit Q-Learning (IQL), and Implicit Reinforcement Learning from Scenes (IRIS).

GSRL was compared against baseline representations:

Multi-view Images: Raw images encoded using a ResNet18 architecture.
Point Clouds: 3D points derived directly from the depth estimator module, processed by a SetTransformer.
Voxels: A 3D voxel grid representation processed by a 3D Convolutional ResNet.

Key Findings:

Superior Performance: GSRL consistently outperformed the baseline representations across most tasks and RL algorithms in terms of task success rate (Table 1 in the paper).
Significant Gains on Hard Tasks: The improvements were particularly pronounced on the more complex tasks. For the Transport task using the IRIS algorithm, GSRL achieved improvements of +10%, +44%, and +15% in success rate compared to images, point clouds, and voxels, respectively. Similar substantial gains were observed for the Square task.
Reduced Variance: GSRL often exhibited lower variance in success rates across different runs, suggesting more stable learning.

Ablation Studies:

Number of Gaussians: The RL performance was relatively robust to the number of Gaussians used as input to the policy (tested from 2048 to 10000), although very low numbers (2048) degraded performance on harder tasks, indicating that a sufficient density of Gaussians is needed to capture critical geometric details.
3DGS Reconstruction Quality: A clear correlation was observed between the reconstruction quality (measured by PSNR) of the pre-trained generalizable 3DGS model and the final RL task performance. Higher PSNR generally led to better success rates, especially on complex tasks, validating the hypothesis that a high-fidelity 3D representation benefits policy learning.
Framework Components: Removing the feature reuse mechanism between the depth estimator and Gaussian regressor, or removing the Gaussian refinement module, resulted in lower reconstruction quality (PSNR), confirming their contribution to generating better 3DGS representations. Notably, using the predicted depth from the estimator yielded results nearly on par with using ground-truth depth, demonstrating the effectiveness of the learned depth prediction.

Significance and Implementation Considerations

The GSRL approach represents a significant step towards incorporating sophisticated 3D scene representations into RL. By developing a generalizable 3DGS estimator, it overcomes the primary efficiency bottleneck of standard 3DGS for online applications like RL.

Key Advantages:

Explicit 3D Geometry: Provides detailed, explicit 3D structure, including local geometry captured by the Gaussian covariances, which is often missing or poorly represented in other methods.
3D Consistency: Inherently generates representations that are consistent across different viewpoints, benefiting tasks requiring understanding of spatial relationships.
Efficiency: The pre-trained estimator allows fast inference, generating the 3DGS representation directly from images within the RL loop's time constraints.
Interpretability: As an explicit representation, 3DGS offers potentially greater interpretability compared to purely implicit neural fields.

Implementation Considerations:

Pre-training Data: Requires a diverse dataset of multi-view observations with corresponding camera parameters for pre-training the generalizable estimator. The quality and diversity of this data directly impact the generalizability and reconstruction quality.
Computational Resources: Pre-training the estimator involves training multiple deep network components (depth network, UNet regressor, graph network refiner) and requires significant GPU resources. However, inference during RL is fast.
State Encoder: The output of the 3DGS estimator is a set of Gaussians. An appropriate network architecture, such as a SetTransformer or PointNet++, is needed to encode this set representation effectively for the RL policy.
Task Dependency: While showing strong results on manipulation tasks, the utility might vary depending on the specific RL task's reliance on fine-grained 3D geometry versus other factors.

Conclusion

The GSRL framework demonstrates the successful integration of generalizable 3D Gaussian Splatting as a state representation for vision-based RL (2404.07950). By pre-training an efficient estimator, it bypasses the per-scene optimization bottleneck of standard 3DGS, enabling the use of its rich, explicit, and 3D-consistent representation within an RL loop. Experimental results on challenging manipulation tasks show substantial performance improvements over conventional image, point cloud, and voxel representations, highlighting the potential of advanced 3D vision techniques to enhance RL agent capabilities.

PDF Markdown

Tweets

https://twitter.com/realmofresearch/status/1779868535990497329