Papers
Topics
Authors
Recent
Search
2000 character limit reached

CRT: Corruption Restoration Transformer

Updated 8 February 2026
  • The paper shows that CRT, a high-resolution vision transformer using shifted patch tokenization and rotary position embeddings, recovers near-baseline VLA performance under severe corruptions.
  • The methodology combines adversarial, L1 pixel-wise, and SSIM losses, optimizing the restoration of degraded frames with robust reconstruction fidelity.
  • The plug-and-play design allows CRT to integrate with existing VLA policies without modifications, adding minimal latency while enhancing real-world robotic task success.

The Corruption Restoration Transformer (CRT) is a vision transformer architecture designed to restore visual observations degraded by sensor-level artifacts, thereby immunizing vision-language-action (VLA) models against failures caused by image corruptions. CRT’s primary objective is to serve as a plug-and-play, model-agnostic restoration module that sits upstream of pretrained VLA policies, such as π₀.₅ and SmolVLA, without requiring any modifications or fine-tuning to the downstream policy. Experimental evidence demonstrates CRT’s ability to recover near-baseline manipulation performance under various severe visual disturbances, addressing a key challenge in robust real-world deployment of VLA-driven robotic systems (Orjuela et al., 1 Feb 2026).

1. Architecture and Design Principles

CRT, denoted as GG, constitutes a specialized vision transformer configured for high-resolution image-to-image restoration. The architecture comprises several key components:

  • Shifted Patch Tokenization (SPT): The input RGB frame xRH×W×3x'\in\mathbb{R}^{H\times W \times 3}, e.g., 360×360×3 (LIBERO), 480×480×3 (Meta-World), is partitioned into overlapping 2D patches using four shifted grids to reinforce local continuity and texture preservation. Each patch is linearly projected into a DD-dimensional token (typically D768D \approx 768–$1024$).
  • Rotary Position Embedding (RoPE): RoPE injects both absolute and relative 2D positional information into tokens, enhancing spatial awareness.
  • Transformer Backbone: The core encoder comprises LL transformer blocks (L=12L=12–$16$ for lower resolution; L24L \approx 24–$32$ for high-resolution settings), each using Multi-Head Locality-Self-Attention (LSA) with xRH×W×3x'\in\mathbb{R}^{H\times W \times 3}0–xRH×W×3x'\in\mathbb{R}^{H\times W \times 3}1 heads. Feedforward layers scale as xRH×W×3x'\in\mathbb{R}^{H\times W \times 3}2 (up to xRH×W×3x'\in\mathbb{R}^{H\times W \times 3}3–xRH×W×3x'\in\mathbb{R}^{H\times W \times 3}4).
  • Linear Decoder: “Patch-unembedding” reassembles tokens into a full-resolution RGB image, merging overlapping regions via overlap-add and upsampling.
  • Discriminator xRH×W×3x'\in\mathbb{R}^{H\times W \times 3}5: Mirrors much of xRH×W×3x'\in\mathbb{R}^{H\times W \times 3}6’s embedding and transformer structure (8–12 blocks), terminating with an MLP head for binary scalar output (“real” or “fake”).

2. Adversarial and Reconstruction Learning Objective

CRT is trained within a generative adversarial framework with the following key loss terms:

  • Adversarial Loss: Binary cross-entropy (BCE) loss where xRH×W×3x'\in\mathbb{R}^{H\times W \times 3}7 attempts to generate restored images xRH×W×3x'\in\mathbb{R}^{H\times W \times 3}8 indistinguishable from clean images xRH×W×3x'\in\mathbb{R}^{H\times W \times 3}9 under DD0.

DD1

Generator minimizes DD2.

  • L1 Pixel-wise Loss: Standard reconstruction loss penalizes pixel-level deviations.

DD3

  • Structural Similarity (SSIM) Loss: Encourages perceptual similarity via SSIM:

DD4

  • Total Generator Objective:

DD5

with experimental weights DD6, DD7, DD8.

No additional regularization is employed beyond standard weight decay.

3. Plug-and-Play Integration and Modularity

CRT operates as a modular preprocessing layer directly preceding any pretrained VLA model. Each control timestep proceeds as:

  1. Receipt of a corrupted frame DD9, either from simulation or real-world sensors.
  2. CRT restores the frame as D768D \approx 7680.
  3. The VLA policy D768D \approx 7681 (e.g., π₀.₅, SmolVLA) accepts D768D \approx 7682 and any language prompt, outputting the appropriate action.

CRT is entirely model-agnostic, imposing zero modifications on the VLA’s weights, input tokenization, or architectural scheme. Its “drop-in” nature facilitates seamless retrofitting to any VLA-based robotic pipeline without policy retraining.

4. Empirical Evaluation: Benchmarks, Protocols, and Corruption Types

CRT’s efficacy was tested on two established benchmarks:

  • LIBERO-10: Ten manipulation tasks, input size 360×360.
  • Meta-World MT50: Fifty tasks, input size 480×480.

Experiments subjected VLA policies to five distinct corruption types:

  1. Centered square occluder (25% area, black).
  2. Zero-mean Gaussian noise (D768D \approx 7683, per-pixel).
  3. Horizontal black lines covering 50% of rows (high intensity).
  4. Horizontal lines covering 20% of rows (low intensity).
  5. Semi-transparent, blurred “water-drop” artifacts at random locations.

Performance is quantified via average success rate (SR) across all tasks. Baseline models include π₀.₅ and SmolVLA with no CRT augmentation.

5. Restoration Performance and Quantitative Results

CRT demonstrates substantial recovery of VLA task success rates under severe corruption, as summarized below (SR = success rate):

Model & Setting Clean SR Corrupted (Lines 50%) CRT+Corrupted (Lines 50%) Clean w/ CRT
π₀.₅ on LIBERO-10 90.0% 2.0% (–97.8%) 87.0% (–3.3%) 89.0% (–1.1%)
SmolVLA on LIBERO-10 43.0% 0.0% (–100%) 3.0% (–93.0%) 33.0% (–23.3%)
SmolVLA on Meta-World 58.0% 20.6% (–64.5%) 32.2% (–44.4%) 47.0% (–19.0%)
  • For π₀.₅, CRT restores nearly all lost performance (≤3% drop from baseline) even under the most severe corruptions.
  • For SmolVLA, CRT achieves large absolute gains under corruption (e.g., +11.6 percentage points on Meta-World lines), though some degradation under clean input is observed (10–20% drop).

6. Component Analysis and Ablations

While no quantitative ablative table is provided, several qualitative findings clarify the roles of architectural and loss-based design choices:

  • Adversarial Loss (D768D \approx 7684): Essential for preserving high-frequency details (edges, handles) otherwise lost with pixel-level objectives alone.
  • Shifted Patch Tokenization (SPT): Critical to reconstructing local textures and mitigating “water-drop” and “line” disturbances; SPT removal severely impairs restoration.
  • RoPE and LSA: Their combination sharpens attention on authentic object contours over artifact edges.
  • Network Depth and Attention Heads: Deeper models with more heads are better at disentangling scene layout from corruption noise and allow for parallel locality-sensitive processing.

A plausible implication is that enhanced transformer depth improves discrimination between semantic and spurious structural features in heavy artifact regimes.

7. Limitations, Computational Cost, and Prospective Work

  • Limitations: Slight performance degradation on clean frames for smaller VLAs (e.g., SmolVLA); CRT must be retrained per visual environment, precluding direct cross-domain generalization.
  • Overhead: Adds only 10–50 ms inference latency and ~1 GB VRAM per-frame (batch=1) on NVIDIA RTX Quadro 6000, considered negligible relative to typical VLA decision latencies.
  • Prospective Enhancements:
    • Automatic corruption detection to trigger CRT only under artifacts, thereby preserving 100% clean accuracy.
    • Joint CRT+VLA cascade training to further mitigate distributional shifts.
    • Extension to novel real-world distortions (e.g., fisheye, chromatic aberration).
    • Lightweight scene-specific fine-tuning from a handful of paired real corrupted/clean samples.

In sum, CRT constitutes a modular, high-capacity approach to input restoration in robotic VLA architectures, providing significant resilience against a wide array of challenging sensor-level corruptions and enabling robust execution of manipulation policies in the presence of substantial observation artifacts (Orjuela et al., 1 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Corruption Restoration Transformer (CRT).