CRT: Corruption Restoration Transformer

Updated 8 February 2026

The paper shows that CRT, a high-resolution vision transformer using shifted patch tokenization and rotary position embeddings, recovers near-baseline VLA performance under severe corruptions.
The methodology combines adversarial, L1 pixel-wise, and SSIM losses, optimizing the restoration of degraded frames with robust reconstruction fidelity.
The plug-and-play design allows CRT to integrate with existing VLA policies without modifications, adding minimal latency while enhancing real-world robotic task success.

The Corruption Restoration Transformer (CRT) is a vision transformer architecture designed to restore visual observations degraded by sensor-level artifacts, thereby immunizing vision-language-action (VLA) models against failures caused by image corruptions. CRT’s primary objective is to serve as a plug-and-play, model-agnostic restoration module that sits upstream of pretrained VLA policies, such as π₀.₅ and SmolVLA, without requiring any modifications or fine-tuning to the downstream policy. Experimental evidence demonstrates CRT’s ability to recover near-baseline manipulation performance under various severe visual disturbances, addressing a key challenge in robust real-world deployment of VLA-driven robotic systems (Orjuela et al., 1 Feb 2026).

1. Architecture and Design Principles

CRT, denoted as $G$ , constitutes a specialized vision transformer configured for high-resolution image-to-image restoration. The architecture comprises several key components:

Shifted Patch Tokenization (SPT): The input RGB frame $x'\in\mathbb{R}^{H\times W \times 3}$ , e.g., 360×360×3 (LIBERO), 480×480×3 (Meta-World), is partitioned into overlapping 2D patches using four shifted grids to reinforce local continuity and texture preservation. Each patch is linearly projected into a $D$ -dimensional token (typically $D \approx 768$ –$1024$).
Rotary Position Embedding (RoPE): RoPE injects both absolute and relative 2D positional information into tokens, enhancing spatial awareness.
Transformer Backbone: The core encoder comprises $L$ transformer blocks ( $L=12$ –$16$ for lower resolution; $L \approx 24$ –$32$ for high-resolution settings), each using Multi-Head Locality-Self-Attention (LSA) with $x'\in\mathbb{R}^{H\times W \times 3}$ 0– $x'\in\mathbb{R}^{H\times W \times 3}$ 1 heads. Feedforward layers scale as $x'\in\mathbb{R}^{H\times W \times 3}$ 2 (up to $x'\in\mathbb{R}^{H\times W \times 3}$ 3– $x'\in\mathbb{R}^{H\times W \times 3}$ 4).
Linear Decoder: “Patch-unembedding” reassembles tokens into a full-resolution RGB image, merging overlapping regions via overlap-add and upsampling.
Discriminator $x'\in\mathbb{R}^{H\times W \times 3}$ 5: Mirrors much of $x'\in\mathbb{R}^{H\times W \times 3}$ 6’s embedding and transformer structure (8–12 blocks), terminating with an MLP head for binary scalar output (“real” or “fake”).

2. Adversarial and Reconstruction Learning Objective

CRT is trained within a generative adversarial framework with the following key loss terms:

Adversarial Loss: Binary cross-entropy (BCE) loss where $x'\in\mathbb{R}^{H\times W \times 3}$ 7 attempts to generate restored images $x'\in\mathbb{R}^{H\times W \times 3}$ 8 indistinguishable from clean images $x'\in\mathbb{R}^{H\times W \times 3}$ 9 under $D$ 0.

$D$ 1

Generator minimizes $D$ 2.

L1 Pixel-wise Loss: Standard reconstruction loss penalizes pixel-level deviations.

$D$ 3

Structural Similarity (SSIM) Loss: Encourages perceptual similarity via SSIM:

$D$ 4

Total Generator Objective:

$D$ 5

with experimental weights $D$ 6, $D$ 7, $D$ 8.

No additional regularization is employed beyond standard weight decay.

3. Plug-and-Play Integration and Modularity

CRT operates as a modular preprocessing layer directly preceding any pretrained VLA model. Each control timestep proceeds as:

Receipt of a corrupted frame $D$ 9, either from simulation or real-world sensors.
CRT restores the frame as $D \approx 768$ 0.
The VLA policy $D \approx 768$ 1 (e.g., π₀.₅, SmolVLA) accepts $D \approx 768$ 2 and any language prompt, outputting the appropriate action.

CRT is entirely model-agnostic, imposing zero modifications on the VLA’s weights, input tokenization, or architectural scheme. Its “drop-in” nature facilitates seamless retrofitting to any VLA-based robotic pipeline without policy retraining.

4. Empirical Evaluation: Benchmarks, Protocols, and Corruption Types

CRT’s efficacy was tested on two established benchmarks:

LIBERO-10: Ten manipulation tasks, input size 360×360.
Meta-World MT50: Fifty tasks, input size 480×480.

Experiments subjected VLA policies to five distinct corruption types:

Centered square occluder (25% area, black).
Zero-mean Gaussian noise ( $D \approx 768$ 3, per-pixel).
Horizontal black lines covering 50% of rows (high intensity).
Horizontal lines covering 20% of rows (low intensity).
Semi-transparent, blurred “water-drop” artifacts at random locations.

Performance is quantified via average success rate (SR) across all tasks. Baseline models include π₀.₅ and SmolVLA with no CRT augmentation.

5. Restoration Performance and Quantitative Results

CRT demonstrates substantial recovery of VLA task success rates under severe corruption, as summarized below (SR = success rate):

Model & Setting	Clean SR	Corrupted (Lines 50%)	CRT+Corrupted (Lines 50%)	Clean w/ CRT
π₀.₅ on LIBERO-10	90.0%	2.0% (–97.8%)	87.0% (–3.3%)	89.0% (–1.1%)
SmolVLA on LIBERO-10	43.0%	0.0% (–100%)	3.0% (–93.0%)	33.0% (–23.3%)
SmolVLA on Meta-World	58.0%	20.6% (–64.5%)	32.2% (–44.4%)	47.0% (–19.0%)

For π₀.₅, CRT restores nearly all lost performance (≤3% drop from baseline) even under the most severe corruptions.
For SmolVLA, CRT achieves large absolute gains under corruption (e.g., +11.6 percentage points on Meta-World lines), though some degradation under clean input is observed (10–20% drop).

6. Component Analysis and Ablations

While no quantitative ablative table is provided, several qualitative findings clarify the roles of architectural and loss-based design choices:

Adversarial Loss ( $D \approx 768$ 4): Essential for preserving high-frequency details (edges, handles) otherwise lost with pixel-level objectives alone.
Shifted Patch Tokenization (SPT): Critical to reconstructing local textures and mitigating “water-drop” and “line” disturbances; SPT removal severely impairs restoration.
RoPE and LSA: Their combination sharpens attention on authentic object contours over artifact edges.
Network Depth and Attention Heads: Deeper models with more heads are better at disentangling scene layout from corruption noise and allow for parallel locality-sensitive processing.

A plausible implication is that enhanced transformer depth improves discrimination between semantic and spurious structural features in heavy artifact regimes.

7. Limitations, Computational Cost, and Prospective Work

Limitations: Slight performance degradation on clean frames for smaller VLAs (e.g., SmolVLA); CRT must be retrained per visual environment, precluding direct cross-domain generalization.
Overhead: Adds only 10–50 ms inference latency and ~1 GB VRAM per-frame (batch=1) on NVIDIA RTX Quadro 6000, considered negligible relative to typical VLA decision latencies.
Prospective Enhancements:
- Automatic corruption detection to trigger CRT only under artifacts, thereby preserving 100% clean accuracy.
- Joint CRT+VLA cascade training to further mitigate distributional shifts.
- Extension to novel real-world distortions (e.g., fisheye, chromatic aberration).
- Lightweight scene-specific fine-tuning from a handful of paired real corrupted/clean samples.

In sum, CRT constitutes a modular, high-capacity approach to input restoration in robotic VLA architectures, providing significant resilience against a wide array of challenging sensor-level corruptions and enabling robust execution of manipulation policies in the presence of substantial observation artifacts (Orjuela et al., 1 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Improving Robustness of Vision-Language-Action Models by Restoring Corrupted Visual Inputs (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Corruption Restoration Transformer (CRT).