- The paper demonstrates that pruning causes catastrophic collapse in VLA models, drastically reducing task success and increasing unsafe behaviors.
- The proposed GLUESTICK method uses singular value stitching to reintroduce lost weight contributions without retraining.
- Experimental results reveal substantial recovery in performance and safety across manipulation and navigation tasks while retaining memory efficiency.
Pruning-Induced Collapse and Recovery in Vision-Language-Action Models
Introduction
Vision-Language-Action (VLA) models represent a paradigm shift in robotics, integrating perception, language understanding, and action generation into unified, end-to-end transformer-based policies. These models leverage large-scale internet robotics data and pretrained vision/language backbones to generalize across diverse tasks and environments. However, their substantial parameter count poses significant challenges for deployment on resource-constrained robotic hardware, necessitating model compression techniques such as pruning. While pruning has proven effective for LLMs, this paper provides the first systematic study demonstrating that pruning induces catastrophic degradation in VLA models, both in terms of task success and safety. The authors introduce GLUESTICK, a post-pruning, training-free recovery method that restores much of the lost functionality while retaining the efficiency benefits of structured sparsity.
Pruning in VLA Models: Empirical Collapse
The study reveals that standard pruning algorithms, including Magnitude and Wanda, which are effective for LLMs, result in near-complete collapse of VLA model performance. For instance, pruning OpenVLA and NaVILA with 50% 2:4 structured sparsity reduces manipulation success rates from 85.2% and 43.0% to 0.0%, respectively, and increases unsafe-episode rates substantially. This degradation is not merely a reduction in efficiency but a fundamental loss of embodied control capabilities, with pruned agents failing to complete tasks and exhibiting unsafe behaviors such as collisions and object drops.
Spectral analysis of weight matrices reveals that VLA layers exhibit flatter singular value spectra compared to language-only models. In LLMs, energy is concentrated in a few dominant directions, making pruning less destructive. In contrast, VLA models distribute energy across many directions, so pruning even small-magnitude weights discards critical signal, explaining their heightened fragility.
GLUESTICK: Post-Pruning Recovery via Singular Value Stitching
GLUESTICK is a post-hoc, training-free recovery algorithm that operates in weight space and is agnostic to the pruning method. For each pruned linear layer, GLUESTICK computes the gap matrix Wgap​=Wdense​−Wpruned​ and performs a truncated SVD to extract the top-r singular components. These components are folded into compact matrices A and B, and at inference, the pruned layer output is corrected as h(x)=Wpruned​x+A(BTx). This approach re-injects dominant lost directions with minimal computational and memory overhead, preserving the efficiency of structured sparsity.
The rank r serves as a hyperparameter controlling the trade-off between memory savings and recovery. Empirically, increasing r improves success-rate recovery at the cost of additional memory, but even moderate values (e.g., r=200 or r=500) yield substantial restoration of performance.
Implementation
GLUESTICK can be integrated into existing PyTorch-based VLA models with minimal code changes. The offline stage computes and stores correction terms for each pruned layer, while the online stage wraps pruned layers to apply the correction during inference. The additional parameters and compute scale as O((din​+dout​)r) per layer, which is negligible compared to the dense case.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
def prime_gluestick(W_dense, W_pruned, r):
W_gap = W_dense - W_pruned
U, S, Vh = torch.linalg.svd(W_gap)
U_r = U[:, :r]
S_r = S[:r]
V_r = Vh[:r, :].T
A = U_r * S_r.unsqueeze(0)
B = V_r
return {"A": A, "B": B}
class GLUESTICKWrap(nn.Module):
def __init__(self, pruned_linear_layer, A, B):
super().__init__()
self.pruned_linear = pruned_linear_layer
self.A = A
self.B = B
def forward(self, x):
y = F.linear(x, self.pruned_linear.weight, self.pruned_linear.bias)
correction = self.A @ (self.B.T @ x)
return torch.add(y, correction) |
Experimental Results
Manipulation Tasks
On the LIBERO benchmark, full sparse pruning yields an average degradation of -72.4% in success rate. GLUESTICK-500 recovers approximately 50% of the lost success, with particularly strong recovery in spatial and goal-oriented tasks (62% and 57%, respectively). Compared to memory-matched baselines, GLUESTICK achieves a 40% improvement in success rate while maintaining similar memory efficiency.
Navigation Tasks
For navigation (VLN-CE-Isaac), pruning collapses NaVILA's success rate from 43.0% to 0%, with pruned agents exhibiting erratic, inefficient trajectories. GLUESTICK-500 fully restores the dense model's performance, matching both success rates and trajectory quality, and maintaining memory savings within 0.38GB of the fully sparse baseline.
Safety
Pruning increases unsafe-episode rates by up to +23.0% in navigation and +13.6% in manipulation. GLUESTICK-500 restores safety profiles to near parity with dense models, with only a minimal +0.4% change across domains. This indicates that the dominant weight-space directions recovered by GLUESTICK are both task-relevant and safety-critical.
Component Sensitivity
Selective pruning experiments show that vision backbones are disproportionately sensitive to pruning, causing outsized harm relative to their parameter count. For OpenVLA, pruning the vision backbone is 4.75x more damaging per million parameters than pruning the language backbone. This suggests that pruning should focus on language components for maximal efficiency with minimal performance loss.
SVD Compression vs. Pruning
Direct low-rank SVD compression of weights, without pruning, fails to preserve VLA functionality (0% success rate for rank-200 SVD). The pruned weight matrix retains valuable structure that cannot be captured by SVD alone. GLUESTICK leverages this by preserving pruned weights and using SVD only to reintroduce lost directions.
Practical and Theoretical Implications
The findings have immediate implications for deploying VLA models on resource-constrained robotic platforms. Pruning, as validated in LLMs, cannot be naively transferred to embodied models without severe loss of functionality and safety. GLUESTICK provides a universal, training-free recovery step that is compatible with any pruning algorithm and exposes a single interpretable hyperparameter for efficiency-accuracy trade-off. This enables practitioners to adapt VLA models to diverse hardware constraints without retraining or sacrificing safety.
Theoretically, the work highlights the importance of weight-space structure in multimodal models and the limitations of magnitude-based pruning heuristics in settings where energy is distributed across many directions. The spectral analysis suggests that future compression techniques should account for the anisotropy of singular value spectra in different model components.
Future Directions
Potential avenues for further research include:
- Prioritizing recovery of safety-critical directions in weight space.
- Investigating GLUESTICK's impact on inference speed and energy efficiency.
- Dynamically selecting rank r per layer to optimize the recovery-memory trade-off.
- Extending the approach to other multimodal and embodied AI architectures.
Conclusion
This work demonstrates that pruning induces catastrophic collapse in VLA models, fundamentally impairing both task success and safety. GLUESTICK, a training-free, pruning-agnostic recovery method, restores much of the lost functionality while retaining the efficiency benefits of structured sparsity. The approach is practical, easily integrable, and exposes a tunable trade-off between memory and accuracy, making it well-suited for real-world robotic deployment. The results underscore the need for compression techniques tailored to the unique properties of multimodal, embodied models and lay the groundwork for future research in efficient, safe AI for robotics.