2D or 3D: Who Governs Salience in VLA Models? -- Tri-Stage Token Pruning Framework with Modality Salience Awareness

Published 10 Apr 2026 in cs.MM, cs.CV, and cs.RO | (2604.09244v1)

Abstract: Vision-Language-Action (VLA) models have emerged as the mainstream of embodied intelligence. Recent VLA models have expanded their input modalities from 2D-only to 2D+3D paradigms, forming multi-visual-modal VLA (MVLA) models. Despite achieving improved spatial perception, MVLA faces a greater acceleration demand due to the increased number of input tokens caused by modal expansion. Token pruning is an effective optimization methods tailored to MVLA models. However, existing token pruning schemes are designed for 2D-only VLA models, ignoring 2D/3D modality salience differences. In this paper, we follow the application process of multi-modal data in MVLA models and develop a tri-stage analysis to capture the discrepancy and dynamics of 2D/3D modality salience. Based on these, we propose a corresponding tri-stage token pruning framework for MVLA models to achieve optimal 2D/3D token selection and efficient pruning. Experiments show that our framework achieves up to a 2.55x inference speedup with minimal accuracy loss, while only costing 5.8% overhead. Our Code is coming soon.

Abstract PDF Upgrade to Chat

Authors (11)

Summary

The paper presents a tri-stage token pruning framework that leverages dynamic 2D and 3D salience analysis to accelerate MVLA inference with minimal accuracy loss.
It employs a systematic three-stage approach—preprocessing, semantic synthesis, and temporal adaptation—that achieves a speedup of up to 2.69x while maintaining robust task performance.
Experimental evaluations in simulation and real-world settings validate the framework's scalability and effectiveness in enabling efficient, real-time robotic control.

Introduction and Motivation

The recent evolution of Vision-Language-Action (VLA) models has led to the widespread deployment of multi-visual-modal architectures (MVLA) that integrate both 2D and 3D sensory modalities for embodied AI control. While this expansion to 3D point clouds alongside 2D images significantly augments spatial perception and manipulation capabilities, it also introduces a steep rise in sequence lengths due to an increased number of input tokens. These enlarged token sequences escalate inference latency and computational overhead—inhibiting real-time robotic control and limiting practical deployment. Traditional token pruning techniques, initially developed for single-visual-modal (SVLA) models, do not account for the discrepant task relevance and salience between 2D and 3D modalities in MVLA settings. The paper addresses this critical limitation by proposing a principled, tri-stage analysis of token salience and a matching tri-stage token pruning framework with explicit modality awareness, yielding substantial acceleration with negligible loss in policy performance (2604.09244).

Figure 2: Standard MVLA architectures combine 2D/3D encoders, an LLM backbone, and action decoders with concatenated visual tokens and instructions.

Analysis of Modality Salience Across MVLA Inference

The core insight is that modality salience—i.e., the contribution of 2D and 3D features to downstream decisions—varies dynamically not just across modalities, but also as a function of inference stage and task phase. The authors conduct a comprehensive, three-stage analysis:

Data Preprocessing Stage: Here, feature norm statistics from the final model layer reveal that, generally, 2D tokens exhibit higher salience than 3D tokens. Pruning experiments show pronounced degradation when 2D tokens are pruned, but often improved or stable performance when pruning 3D tokens, indicating that redundancy in 3D information is more tolerable for many tasks.
Semantic Synthesis Stage: The model’s attention matrices are clustered to partition visual tokens into semantic regions (background, robot, object). Analysis demonstrates that 2D salience dominates in backgrounds, while 3D tokens gain prominence in certain task-critical regions—especially the robot and manipulation targets. Furthermore, overlapping redundant representations (i.e., orthogonal/parallel attention) can be decomposed to isolate irreducible modality contributions.
Action Iteration Stage: During real-time control, the modality salience distribution is not static: temporal evolution (as shown via rolling averages of norm and attention-based metrics) indicates the necessity of temporal smoothing and adaptive decision policies to avoid flicker and instability in pruning masks.
Figure 3: The MVLA inference data flow is segmented into preprocessing, semantic fusion, and actuation with dynamic token importance.

Figure 4: Salience decomposition during semantic fusion (a) and temporal dynamics in action iteration (b).

Tri-Stage Token Pruning Framework

Building on the tri-stage analysis, the authors design a matching tri-stage pruning scheme:

Stage 1 (Preprocessing Candidate Selection): Patch-level modality preference is estimated via feature norms and mapped into pruning candidate sets using a dual-threshold scheme, isolating regions retainable by only 2D, only 3D, or both.
Stage 2 (Semantic Synthesis Pruning): Attention-driven clustering subdivides visual patches by semantic role; pruning rates are then adaptively assigned with semantic and modality salience constraints—e.g., aggressive random pruning for backgrounds (up to 90%), semantic safeguarding for object/robot patches, and fallbacks for ambiguous intersections.
Stage 3 (Temporal Adaptation): All per-modality/region metrics are temporally smoothed using EMA and sliding window mechanisms, enabling adaptive, predictive mask updates that prevent oscillations and ensure stable control even under sensor/scene variations.

Candidate sets from all three stages are fused (intersection with conflict resolution) to generate final retention masks. The result is a coarse-to-fine, semantically driven, and temporally stable pruning pipeline, maximizing speed without compromising task-critical token availability.

Figure 5: The tri-stage token pruning pipeline fuses stagewise candidate token sets for robust, dynamic inference acceleration.

Figure 6: Comparison of naive and tri-stage token pruning, emphasizing the modality- and semantic-aware advantage of the proposed pipeline.

Figure 8: Visualization demonstrates gradual, context-driven updates in candidate selection during the “Take Umbrella” real-world task.

Experimental Evaluation

Simulation Results

Extensive simulation experiments on RLBench tasks validate the tri-stage framework. Compared to naively/uniformly pruned or SVLA-designed baselines (e.g., SP-VLA, VLA-Pruner), the method consistently achieves:

2.3 $\times$ –2.69 $\times$ end-to-end speedup (vs. 1.2 $\times$ –1.5 $\times$ for baselines)
Minimal accuracy loss (<5% SR drop) at baseline pruning rates, and in some tasks exceeds even the unpruned baseline due to redundancy reduction
Modality-aware pruning rates: E.g., at $r=70\%$ , 54.7% of 2D and 72.8% of 3D tokens pruned with only 2.5% SR degradation
Figure 7: Task success rate and inference speedup across RLBench tasks confirm superior efficiency and robustness.

Real-World Experiments

Evaluated in real-robot deployments, the framework maintains robust real-time performance and effective policy stabilization:

Average speedup of 2.3 $\times$
Consistent token salience adaptation: 2D pruning dominates, but 3D retention increases transiently during key manipulative events, in line with action iteration analysis
Figure 9: Real-world deployment in pick-and-place tasks confirms alignment with stagewise token salience predictions.

Ablation and Overhead

Ablation studies show strong complementarity between the three stages: semantic-aware pruning is critical for SR robustness, while temporal adaptation supplies most of the acceleration. The framework’s computational overhead is modest (5.8% total), confirmed by step-by-step runtime breakdowns.

Implications and Future Directions

The tri-stage modality-aware approach provides a systematic, interpretable, and highly effective alternative to uniform or SVLA-derived pruning paradigms for MVLA models. By directly quantifying token relevance at each processing and semantic level, this framework facilitates robust pruning policies that adapt to changing sensory conditions, complex manipulation scenes, and real-world noise. Important practical advances include:

Scalability: The framework supports scaling MVLA models to higher sensor resolutions without proportional latency growth
Extensibility: Salience estimation and stagewise pruning could generalize to non-visual modalities (e.g., audio, tactile), and to multimodal fusion in non-embodied AI
Theoretical insight: The explicit quantification and observation of shifting modality salience can guide robust model design, multi-modal embedding learning, and dynamic attention/policy mechanisms in future AI systems
Figure 1: Pruning hyperparameter sweeps confirm narrow SR/speedup sensitivity, indicating high robustness to practical deployment variance.

Conclusion

The paper provides a comprehensive modality salience analysis and proposes a principled, tri-stage token pruning framework that substantially accelerates MVLA inference by leveraging stage-specific, semantic, and temporal token relevance. Numerical results demonstrate strong superiority over baselines, low sensitivity to hyperparameters, and practical deployability in real robotic control. This methodology lays groundwork for future advances in efficient, scalable, and adaptive multi-modal embodied AI.

Markdown Report Issue