Isolate the contribution of each component in the Pi0.5-based BEHAVIOR-1K solution

Determine the individual performance contribution of each component introduced in the Pi0.5-based vision-language-action policy for the 2025 BEHAVIOR Challenge—specifically correlated noise with beta shrinkage for flow matching, KV cache transformation for mixed-layer attention, System 2 stage prediction with voting, custom attention masks, delta action space with per-timestamp normalization, multi-sample flow matching, FAST auxiliary training, correlation-aware inpainting, action compression via cubic splines, and task-specific correction rules—through rigorous ablation experiments on BEHAVIOR-1K tasks measured by q-score and binary success.

Background

The paper reports a first-place solution for the 2025 BEHAVIOR Challenge built upon the Pi0.5 architecture and augmented with multiple innovations, including correlation-aware noise for flow matching, mixed-layer attention via KV cache transformation, System 2 stage tracking, multi-sample flow matching, and inference-time heuristics like action compression and correction rules.

Due to limited resources and competition constraints, the authors did not conduct comprehensive ablations and explicitly state that they could not isolate which components materially drive performance gains. Establishing component-level contributions would clarify which innovations are necessary, which are redundant, and guide future model design and training budgets.

References

Due to resource constraints, we could not isolate the contribution of each component. Rigorous ablation studies would be valuable to identify which innovations actually matter and which are redundant.

Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge (2512.06951 - Larchenko et al., 7 Dec 2025) in Section 7, Discussion and Conclusion