Unclear causes of performance decline when replacing CNN blocks with ViT in FoundationGait

Determine the causes of the observed performance decline when substituting the final two convolutional backbone blocks of FoundationGait with a 12-layer Vision Transformer module, in the self-supervised pretraining and evaluation setting described for FoundationGait-0.03B.

Background

In the ablation study, the authors replaced the last two CNN blocks in FoundationGait-0.03B with a 12-layer Vision Transformer block, which led to a noticeable performance drop on CASIA-B and Gait3D. Although they report the degradation, they explicitly state that the reasons are unclear, indicating an unresolved technical issue about architectural substitution in gait models.

Understanding this decline is important because a ViT-based design could help align gait foundation models with transformer-centric ecosystems and potentially facilitate cross-domain integration; however, without clarity on the root causes of the observed degradation, principled ViT adoption remains challenging.

References

The reasons for this decline remain unclear and are not specific to FoundationGait.

Silhouette-based Gait Foundation Model (2512.00691 - Ye et al., 30 Nov 2025) in Ablation Study, ViT Replacement