CRT: Corruption Restoration Transformer
- The paper shows that CRT, a high-resolution vision transformer using shifted patch tokenization and rotary position embeddings, recovers near-baseline VLA performance under severe corruptions.
- The methodology combines adversarial, L1 pixel-wise, and SSIM losses, optimizing the restoration of degraded frames with robust reconstruction fidelity.
- The plug-and-play design allows CRT to integrate with existing VLA policies without modifications, adding minimal latency while enhancing real-world robotic task success.
The Corruption Restoration Transformer (CRT) is a vision transformer architecture designed to restore visual observations degraded by sensor-level artifacts, thereby immunizing vision-language-action (VLA) models against failures caused by image corruptions. CRT’s primary objective is to serve as a plug-and-play, model-agnostic restoration module that sits upstream of pretrained VLA policies, such as π₀.₅ and SmolVLA, without requiring any modifications or fine-tuning to the downstream policy. Experimental evidence demonstrates CRT’s ability to recover near-baseline manipulation performance under various severe visual disturbances, addressing a key challenge in robust real-world deployment of VLA-driven robotic systems (Orjuela et al., 1 Feb 2026).
1. Architecture and Design Principles
CRT, denoted as , constitutes a specialized vision transformer configured for high-resolution image-to-image restoration. The architecture comprises several key components:
- Shifted Patch Tokenization (SPT): The input RGB frame , e.g., 360×360×3 (LIBERO), 480×480×3 (Meta-World), is partitioned into overlapping 2D patches using four shifted grids to reinforce local continuity and texture preservation. Each patch is linearly projected into a -dimensional token (typically –$1024$).
- Rotary Position Embedding (RoPE): RoPE injects both absolute and relative 2D positional information into tokens, enhancing spatial awareness.
- Transformer Backbone: The core encoder comprises transformer blocks (–$16$ for lower resolution; –$32$ for high-resolution settings), each using Multi-Head Locality-Self-Attention (LSA) with 0–1 heads. Feedforward layers scale as 2 (up to 3–4).
- Linear Decoder: “Patch-unembedding” reassembles tokens into a full-resolution RGB image, merging overlapping regions via overlap-add and upsampling.
- Discriminator 5: Mirrors much of 6’s embedding and transformer structure (8–12 blocks), terminating with an MLP head for binary scalar output (“real” or “fake”).
2. Adversarial and Reconstruction Learning Objective
CRT is trained within a generative adversarial framework with the following key loss terms:
- Adversarial Loss: Binary cross-entropy (BCE) loss where 7 attempts to generate restored images 8 indistinguishable from clean images 9 under 0.
1
Generator minimizes 2.
- L1 Pixel-wise Loss: Standard reconstruction loss penalizes pixel-level deviations.
3
- Structural Similarity (SSIM) Loss: Encourages perceptual similarity via SSIM:
4
- Total Generator Objective:
5
with experimental weights 6, 7, 8.
No additional regularization is employed beyond standard weight decay.
3. Plug-and-Play Integration and Modularity
CRT operates as a modular preprocessing layer directly preceding any pretrained VLA model. Each control timestep proceeds as:
- Receipt of a corrupted frame 9, either from simulation or real-world sensors.
- CRT restores the frame as 0.
- The VLA policy 1 (e.g., π₀.₅, SmolVLA) accepts 2 and any language prompt, outputting the appropriate action.
CRT is entirely model-agnostic, imposing zero modifications on the VLA’s weights, input tokenization, or architectural scheme. Its “drop-in” nature facilitates seamless retrofitting to any VLA-based robotic pipeline without policy retraining.
4. Empirical Evaluation: Benchmarks, Protocols, and Corruption Types
CRT’s efficacy was tested on two established benchmarks:
- LIBERO-10: Ten manipulation tasks, input size 360×360.
- Meta-World MT50: Fifty tasks, input size 480×480.
Experiments subjected VLA policies to five distinct corruption types:
- Centered square occluder (25% area, black).
- Zero-mean Gaussian noise (3, per-pixel).
- Horizontal black lines covering 50% of rows (high intensity).
- Horizontal lines covering 20% of rows (low intensity).
- Semi-transparent, blurred “water-drop” artifacts at random locations.
Performance is quantified via average success rate (SR) across all tasks. Baseline models include π₀.₅ and SmolVLA with no CRT augmentation.
5. Restoration Performance and Quantitative Results
CRT demonstrates substantial recovery of VLA task success rates under severe corruption, as summarized below (SR = success rate):
| Model & Setting | Clean SR | Corrupted (Lines 50%) | CRT+Corrupted (Lines 50%) | Clean w/ CRT |
|---|---|---|---|---|
| π₀.₅ on LIBERO-10 | 90.0% | 2.0% (–97.8%) | 87.0% (–3.3%) | 89.0% (–1.1%) |
| SmolVLA on LIBERO-10 | 43.0% | 0.0% (–100%) | 3.0% (–93.0%) | 33.0% (–23.3%) |
| SmolVLA on Meta-World | 58.0% | 20.6% (–64.5%) | 32.2% (–44.4%) | 47.0% (–19.0%) |
- For π₀.₅, CRT restores nearly all lost performance (≤3% drop from baseline) even under the most severe corruptions.
- For SmolVLA, CRT achieves large absolute gains under corruption (e.g., +11.6 percentage points on Meta-World lines), though some degradation under clean input is observed (10–20% drop).
6. Component Analysis and Ablations
While no quantitative ablative table is provided, several qualitative findings clarify the roles of architectural and loss-based design choices:
- Adversarial Loss (4): Essential for preserving high-frequency details (edges, handles) otherwise lost with pixel-level objectives alone.
- Shifted Patch Tokenization (SPT): Critical to reconstructing local textures and mitigating “water-drop” and “line” disturbances; SPT removal severely impairs restoration.
- RoPE and LSA: Their combination sharpens attention on authentic object contours over artifact edges.
- Network Depth and Attention Heads: Deeper models with more heads are better at disentangling scene layout from corruption noise and allow for parallel locality-sensitive processing.
A plausible implication is that enhanced transformer depth improves discrimination between semantic and spurious structural features in heavy artifact regimes.
7. Limitations, Computational Cost, and Prospective Work
- Limitations: Slight performance degradation on clean frames for smaller VLAs (e.g., SmolVLA); CRT must be retrained per visual environment, precluding direct cross-domain generalization.
- Overhead: Adds only 10–50 ms inference latency and ~1 GB VRAM per-frame (batch=1) on NVIDIA RTX Quadro 6000, considered negligible relative to typical VLA decision latencies.
- Prospective Enhancements:
- Automatic corruption detection to trigger CRT only under artifacts, thereby preserving 100% clean accuracy.
- Joint CRT+VLA cascade training to further mitigate distributional shifts.
- Extension to novel real-world distortions (e.g., fisheye, chromatic aberration).
- Lightweight scene-specific fine-tuning from a handful of paired real corrupted/clean samples.
In sum, CRT constitutes a modular, high-capacity approach to input restoration in robotic VLA architectures, providing significant resilience against a wide array of challenging sensor-level corruptions and enabling robust execution of manipulation policies in the presence of substantial observation artifacts (Orjuela et al., 1 Feb 2026).