Flexible Vision Transformer (FiT)

Updated 3 December 2025

Flexible Vision Transformer (FiT) is a dynamic Vision Transformer that adapts key architectural parameters for variable configurations and minimal retraining.
FiT employs curriculum-based elasticity and stable subnet sampling to mitigate gradient interference and optimize performance across a range of settings.
FiT’s innovative routing, variable token handling, and adaptable positional encodings support robust applications in classification, segmentation, and image synthesis.

A Flexible Vision Transformer (FiT) is a Vision Transformer (ViT) architecture adapted to support variable configurations or dynamic data regimes with minimal or no retraining, allowing for flexibility in resource usage, input format, or downstream adaptation. Techniques under the FiT paradigm have been developed for elastic backbone scaling, dynamic patching, adaptable token length, cross-modal inference, and continual learning. FiT and its derivatives yield robust accuracy-compute trade-offs, adapt to variable input conditions, and retain high performance across a range of vision tasks (Zhu et al., 25 Jul 2025, Raghavan et al., 2022, Beyer et al., 2022, Lu et al., 19 Feb 2024, Wang et al., 17 Oct 2024, Zhang et al., 6 Dec 2024, Das et al., 4 Apr 2025).

Flexible ViT methods often instantiate a single backbone capable of representing exponentially many submodels by enabling elasticity in key architectural hyperparameters. EA-ViT introduces a "nested" elastic architecture where, starting from a pre-trained ViT, each Transformer block supports variation along four primary axes: MLP expansion ratio ( $R$ ), number of attention heads ( $H$ ), global embedding dimension ( $E$ ), and network depth (block skipping) (Zhu et al., 25 Jul 2025). Discrete candidate values for each axis are collected:

$R \in \mathcal{R} = \{ r_1, ..., r_M \}$
$H \in \mathcal{H} = \{ h_1, ..., h_N \}$
$E \in \mathcal{E} = \{ e_1, ..., e_P \}$
$D \in \mathcal{D} = \{0,1\}^L$ (block inclusion mask)

Submodels are realized by selecting a subset of these settings, and "nested" parameter sharing is enforced by sorting and allocating the most important channels, heads, and MLP units preferentially, enabling all submodels to leverage weights from the full-capacity model. This enables rapid instantiation of submodels of various sizes for deployment under heterogeneous resource constraints (Zhu et al., 25 Jul 2025).

Scala further explores width-wise flexibility by treating any subnet as a contiguous slice of the weight tensors parameterized by a width ratio $r \in [s, 1]$ . Isolation and scale coordination during training address cross-subnet gradient interference, yielding near-oracle performance with a single model (Zhang et al., 6 Dec 2024).

2. Flexible Training Strategies and Optimization

Simultaneously training all possible submodels leads to gradient interference. Flexible ViT frameworks utilize strategies to gradually or stably introduce architectural flexibility and optimize for joint performance:

Curriculum-Based Elasticity Expansion: EA-ViT introduces architectural flexibility progressively over a scheduled sequence of training steps. At each curriculum phase, new ranges for $R$ , $H$ , $E$ , or $D$ are unlocked, and submodels are sampled from the current feasible configuration set. This staged approach avoids convergence difficulties associated with immediate exposure to the full combinatorial design space (Zhu et al., 25 Jul 2025).
Scale Coordination and Stable Sampling: Scala samples multiple width-ratio subnets per batch (smallest, largest, plus two intermediates) and coordinates loss propagation across them. The smallest subnet is "isolated" via anti-prefix weight slicing to prevent dominant gradient flow; each intermediate subnet is distilled from its next-larger neighbor, forming a progressive knowledge transfer chain. Stable sampling ensures broad coverage of the width spectrum in each batch (Zhang et al., 6 Dec 2024).
Functionally Invariant Paths (FIP): For continual and multi-task learning, adaptation is formulated as traversal along a geodesic in weight space with respect to a Riemannian metric capturing functional invariance on the source task. Each update step minimizes output drift under the metric while guiding the weight towards performance on new objectives. This yields Pareto-optimal trade-offs between retention and adaptation, facilitating continual learning and network sparsification (Raghavan et al., 2022).

3. Routing, Submodel Selection, and Inference

To enable task-adaptive or resource-adaptive inference, some FiT methods utilize lightweight routers:

Router via Pareto-Initiated MLP: EA-ViT first conducts a Pareto search over submodel configurations via NSGA-II, seeking optimal trade-offs between accuracy and resource use. Pareto-optimal configurations warm-start a router (two-layer MLP), which, conditioned on a normalized budget, outputs with Gumbel-Sigmoid a submodel $\theta$ for execution. Joint optimization of router and backbone under a combined loss with curriculum-annealed regularization ensures adherence to the Pareto front while allowing flexible exploration (Zhu et al., 25 Jul 2025).
Slice-At-Inference: Scala supports direct selection of width ratio $r$ based on instantaneous compute or energy budget. For ViT architectures using LayerNorm, no further calibration is needed at inference (Zhang et al., 6 Dec 2024).

4. Flexible Input, Patch, and Token Handling

FiT advances input-flexibility on several axes:

Patch Size Randomization: FlexiViT trains on randomly sampled patch sizes $p$ from a predefined set. This trains a single set of weights to yield near-optimal performance across a range of sequence lengths and compute-accuracy points. Architectural adjustments at inference (bilinear/PI-resize for embedding and positional maps) are lightweight; all transformer weights are shared (Beyer et al., 2022).
Variable-Length Token Streams: For adaptable data or multimodal settings, AdaViT introduces a dynamic tokenizer. Each input modality is patchified and mapped via learnable modality vectors and dynamic convolutions, producing a variable-length concatenation per case. The Transformer operates natively on these variable-length token streams with standard self-attention and positional plus modality embeddings (Das et al., 4 Apr 2025). Similarly, for image generation in diffusion models, FiT reshapes VAE latents into a variable-length sequence, enabling arbitrary resolution and aspect ratio (Lu et al., 19 Feb 2024, Wang et al., 17 Oct 2024).
Position Embedding for Variable Resolution: FiT and FiTv2 replace native learned 2D grid position encodings with 2D rotary positional embedding (RoPE), providing seamless extrapolation to unseen spatial grids (Lu et al., 19 Feb 2024, Wang et al., 17 Oct 2024).

5. Application to Diffusion Models and Image Synthesis

FiT and FiTv2 extend to highly flexible latent diffusion models:

Dynamic Token Sequences with Masked MHSA: FiT treats each image as a sequence of variable tokens, which are padded and masked. At each diffusion denoising step, only real tokens participate in attention, allowing images of arbitrary spatial shape and aspect to be processed in a single model (Lu et al., 19 Feb 2024).
2D Rotary Embedding and Extrapolation: The RoPE-based positional encodings in FiT enable extrapolation to higher-resolution or aspect ratios not seen during training, supported by vision-centric extrapolation schemes (VisionNTK, VisionYaRN). FiTv2 introduces QK-Norm for attention stability, AdaLN-LoRA for memory-efficient conditioning, rectified flow scheduling, and logit-normal timestep sampling, achieving state-of-the-art synthesis FIDs across diverse and out-of-distribution resolutions (Wang et al., 17 Oct 2024).

6. Empirical Benchmarks and Trade-Offs

FiT-based models demonstrate substantial empirical improvements:

Method	Flexibility Axis	Key Result Domains	Gains Over Baselines
EA-ViT	Width, depth, heads, MLP	Classification, segmentation	+2–10% Top-1 at low MACs vs elastic ViTs
Scala	Width-ratio	ImageNet-1K, DeiT-B/S	Matches/or exceeds separate training
FlexiViT	Patch size	Classification, transfer, retrieval	Matches or exceeds fixed $p$ ViTs
FiT, FiTv2	Token count, resolution	Diffusion, image synthesis	SOTA FID on OOD resolutions, 2× convergence
AdaViT	Modality/intrinsic token	Medical imaging segmentation	10–30% Dice improvement zero/few shot

FiT-based approaches unify flexible deployment, efficient adaptation, and high performance across train-time and deployment axes (Zhu et al., 25 Jul 2025, Beyer et al., 2022, Zhang et al., 6 Dec 2024, Lu et al., 19 Feb 2024, Wang et al., 17 Oct 2024, Das et al., 4 Apr 2025).

7. Limitations and Extensions

Challenges persist in scaling FiT to extremely large backbone sizes due to increased memory needs (e.g., for Jacobian computations in FIP (Raghavan et al., 2022)) and limits in extrapolating architectural or input flexibility beyond the range seen during training (e.g., FlexiViT’s patch-size regime (Beyer et al., 2022)). Layerwise approximations, curriculum-based or staged extension of elasticity, and deployment of adaptive position encoding schemes (VisionNTK/YaRN) partially alleviate these constraints. Extensions to hierarchical backbones, multimodal transformers, and richer tokenization or routing policies represent active areas for further research (Das et al., 4 Apr 2025, Wang et al., 17 Oct 2024). FiTv2 demonstrates that FiT architectures can scale up to multi-billion parameter regimes and adapt to new output domains post-training with minimal overhead (Wang et al., 17 Oct 2024).