Dice Question Streamline Icon: https://streamlinehq.com

Sufficiency of perturbing only late self-attention layers in deep VLA flow models for ACG

Determine whether, in deeper transformer-based Vision-Language-Action flow matching policies employing Action Coherence Guidance (ACG), perturbing only a small subset of the late self-attention layers is sufficient to construct an effective incoherent guidance vector field for coherent action generation.

Information Square Streamline Icon: https://streamlinehq.com

Background

Action Coherence Guidance (ACG) improves action coherence in flow-based VLA policies by constructing an incoherent vector field via perturbing self-attention (replacing attention maps with identity), and guiding sampling in the opposite direction. This requires an additional forward pass through an incoherent variant of the model, increasing inference-time compute.

The authors reduce overhead by reusing intermediate features and observe that later attention layers contribute more to incoherence, allowing caching of earlier layers and lowering cost to about 1.5×. However, for deeper networks, it remains unknown whether perturbing only a small fraction of late layers would be sufficient to achieve effective guidance without needing broader perturbations.

References

Still, it remains an open question whether perturbing only a small fraction of the latter layers suffices for deeper networks.

ACG: Action Coherence Guidance for Flow-based VLA models (2510.22201 - Park et al., 25 Oct 2025) in Conclusion