Feature Linear Adaptation (FLA)

Updated 4 December 2025

Feature Linear Adaptation (FLA) is a parameter-efficient technique that injects low-rank linear updates in pre-trained models to adapt internal feature representations.
It operates by introducing learnable adapters within key layers, freezing core weights while enabling robust performance under distributional shifts.
FLA achieves high sample efficiency and reduced parameter overhead in applications like vision-language-action models and transformer fine-tuning.

Feature Linear Adaptation (FLA) denotes a family of parameter-efficient adaptation techniques that modify the internal representations of pre-trained models via learnable linear transformations in feature space. FLA techniques have emerged as alternatives to classical weight-space adaptation strategies, addressing robustness, generalization under distribution shift, and sample efficiency across diverse domains such as vision-language-action (VLA) models, large-scale transformers, and high-dimensional sparse regression. They share the core concept of leveraging low-rank or structure-constrained linear operations on features to reconcile the distributed knowledge in pre-trained models with new data, while typically freezing the bulk of pre-trained model parameters.

1. Mathematical Foundations and Parameterization

Feature Linear Adaptation methods inject trainable linear mappings at designated sites within the architecture, operating in the space of features rather than directly modifying pre-trained weights. The canonical form, as instantiated in vision transformers (ViTs), parameterizes the adaptation at each linear layer as follows:

Let $W\in\mathbb{R}^{d_\mathrm{out}\times d_\mathrm{in}}$ be a frozen pre-trained weight matrix. FLA adds a low-rank trainable update $\Delta W = U V^T$ , where $U\in\mathbb{R}^{d_\mathrm{out}\times r}$ and $V\in\mathbb{R}^{d_\mathrm{in}\times r}$ with $r \ll \min(d_\mathrm{in}, d_\mathrm{out})$ . The adapted layer thus computes:

$h' = (W + U V^T) x = W x + U (V^T x).$

Only $U$ and $V$ are updated during adaptation; $W$ remains frozen (Li et al., 2 Dec 2025).

A closely related design is Feature-Space Adapter composition, exemplified by LoRFA (Low-Rank Feature Adaptation) and VeFA (Vector-based Feature Adaptation) (Wang et al., 22 Oct 2025). Here, the adaptation acts before the frozen weight matrix $W_0^{(ℓ)}$ :

LoRFA: Transforms $x$ via $x' = (I + B^{(ℓ)} A^{(ℓ)}) x$ , where $A^{(ℓ)} \in \mathbb{R}^{r\times q},\ B^{(ℓ)}\in\mathbb{R}^{q\times r}$ . The forward pass becomes $W_0^{(ℓ)} x' = W_0^{(ℓ)} (I + B^{(ℓ)} A^{(ℓ)}) x$ .
VeFA: Uses a parameter-efficient diagonal scaling, $x' = (I + \Lambda_b^{(ℓ)}) x$ , with $\Lambda_b = \text{diag}(b^{(ℓ)})$ for $b^{(ℓ)}\in\mathbb{R}^q$ .

The explicit insertion site is typically linear layers or attention projections in deep architectures.

2. Adaptation Workflow and Optimization

Feature Linear Adaptation procedures generally follow a consistent protocol:

For each selected layer $\ell$ , initialize adapter parameters ( $U_\ell, V_\ell$ or $A^{(\ell)}, B^{(\ell)}$ , or $b^{(\ell)}$ ) with small random values (e.g., Gaussian with zero mean and small variance).
Freeze all original model weights. Only the newly introduced adapter parameters are made trainable.
Define the adaptation loss—the standard objective (such as cross-entropy or negative log-likelihood) computed on the outputs of the model with adapters in place.
Optionally, regularize adapter parameters with weight decay.
Optimize the loss using modern optimizers (e.g., AdamW) with learning rates and decay chosen based on downstream task and adapter type.
After adaptation (e.g., 1.5k–2k steps in one-shot FLA for VLA), deploy the model with updated adapter parameters while keeping the core model weights unchanged (Li et al., 2 Dec 2025, Wang et al., 22 Oct 2025).

A typical FLA adaptation pseudocode as introduced for VLA spatial adaptation is:

for each linear layer ℓ:
    Insert adapter ΔW_ℓ = U_ℓ V_ℓ^T
    Freeze W_ℓ; set U_ℓ, V_ℓ as trainable
For T optimization steps:
    Sample minibatch from demonstration data
    Compute adapted forward pass
    Compute loss (e.g., action cross-entropy)
    Backpropagate gradients only into U_ℓ, V_ℓ
    Update via AdamW
Deploy model with trained adapters

(Li et al., 2 Dec 2025).

3. Model Placement, Parameter Efficiency, and Variants

The placement of FLA adapters is application-dependent but generally targets projections governing the model's core feature transformations. In vision-language-action policies (e.g., SigLIP ViT backbone), adapters are inserted into every linear layer of the vision encoder (Li et al., 2 Dec 2025). In transformer-based NLP or NLG, adaptation typically occurs in the query/value projections or MLP intermediates (Wang et al., 22 Oct 2025).

A key feature of FLA is parameter efficiency. The per-layer trainable parameter count is $r(d_\mathrm{in} + d_\mathrm{out})$ for FLA/LoRA-style low-rank adaptation, $2qr$ for LoRFA, and $q$ for VeFA (where $q$ is the feature dimension per layer). As an empirical illustration, in a RoBERTa-base GLUE fine-tuning task, LoRA requires $\sim 300$ k parameters, while VeFA uses only $\sim 18$ k (Wang et al., 22 Oct 2025). Similarly, on a 27-layer ViT with $d=2048$ and $r=16$ , FLA introduces $4.7$M trainable parameters—less than $1\%$ of the backbone, yet matches full LoRA in accuracy (Li et al., 2 Dec 2025).

Adapter	Per-layer Params	Example: RoBERTa GLUE	Example: GPT-2 Large
LoRA	$r(q+p)$	$\sim$ 300k	$\sim$ 800k
LoRFA	$2qr$	N/A	N/A
VeFA	$q$	$\sim$ 18k	$\sim$ 49k

The rank hyperparameter $r$ governs the expressive power and overhead; $r=16$ sufficed to close much of the generalization gap in practical VLA adaptation.

4. Theoretical Guarantees and Robustness

FLA's underlying theoretical principle is that critical adaptation needs—induced by domain shift (e.g., viewpoint changes or distributional skew)—can often be addressed by a low-rank correction to internal feature projections. If the optimal correction $\Delta W^*$ is low-rank or well-approximated by its best rank- $r$ factorization, then the propagated feature error $\|\Delta W^* - \Delta W_r\|_2$ scales linearly in the network, allowing bounded action-distribution shift under a locally Lipschitz policy (Theorem 3 in (Li et al., 2 Dec 2025)).

This realignment mechanism empirically "realigns" the representation manifold of target-domain features onto the original pre-training distribution, substantially mitigating brittleness to spatial and distributional shifts. Consistently, feature-space adaptation methods (notably VeFA) have shown stronger preservation of zero-shot generalization (robustness metrics $R_1$ , $R_2$ ) compared to weight-space methods in low-shot scenarios and transfer tasks (Wang et al., 22 Oct 2025).

5. Empirical Evaluation and Applications

FLA has been deployed across a spectrum of adaptation scenarios:

Vision-Language-Action (VLA) Models: In Libero novel-view settings, one-shot FLA adaptation closed the gap between zero-shot and full-data LoRA, achieving $90.8\%$ success rate (SR) with $4.7$M parameters, compared to $90.3\%$ for LoRA using $467$M parameters (Li et al., 2 Dec 2025).
Transformer Fine-Tuning: On GLUE and E2E NLG tasks, VeFA matched or outperformed LoRA and LoRFA with far fewer parameters. It also consistently preserved zero-shot accuracy under distribution shift (Wang et al., 22 Oct 2025).
Sparse Linear Regression: Feature adaptation frameworks modify classical Lasso by pre-processing the design matrix to identify "bad" directions (small eigenvalue sectors indicating feature dependencies) and adapting regularization schemes—achieving near-optimal sample complexity for constant sparsity $t$ and providing polynomial-speedups over naive search (Kelner et al., 2023).

FLA variants have also demonstrated superior robustness to limited labeled data and domain shift in low-shot image classification (e.g., CLIP ViT), NLP, and NLG tasks.

6. Practical Considerations and Limitations

The successful application of FLA rests on both principled configuration and awareness of its inherent constraints:

Rank Selection: A moderate adapter rank ( $r=16$ ) typically suffices; higher $r$ is only necessary for severe domain drift.
Parameter Regularization: Weight decay (e.g., $1 \times 10^{-10}$ ) is essential to prevent overfitting under few-shot adaptation.
Module Freezing: Only adapter parameters are trainable; language and decoder modules remain fixed to preserve original reasoning and action generation capabilities (Li et al., 2 Dec 2025).
Adapter Placement: Selection of layers is crucial—most gains are realized by adapting vision or attention projections.
Limitations: FLA assumes domain shifts can be effectively modeled by low-rank corrections. Very large or highly non-linear shifts may exceed the expressivity of linear adapters, necessitating either larger ranks or auxiliary non-linear modules. In sparse regression, successful adaptation is contingent on a modest number of "bad" directions (outlier eigenvalues); when the ill-conditioning is excessive, more sophisticated feature adaptation or alternative regularization is required (Kelner et al., 2023).

7. Connections to Broader Feature Adaptation Frameworks

Feature Linear Adaptation is situated within a broader spectrum of feature adaptation strategies, which tailor model representations rather than model parameters to counter ill-conditioning, high-dimensionality, or domain shift. In high-dimensional regression, feature adaptation encompasses (1) eigen-decomposition and iterative "peeling" to isolate ill-behaved directions, (2) dictionary construction to expand the representational basis, and (3) boosting over augmented feature sets for bias reduction, enabling polynomial-factor efficiency gains in sample and runtime complexity (Kelner et al., 2023). The general philosophy extends to neural architectures, where lightweight, structure-regularized transformations on intermediary features achieve broad generalizability and parameter efficiency.

Overall, Feature Linear Adaptation provides a principled, generalizable mechanism for adapting pre-trained systems under new data regimes while preserving accumulated knowledge and minimizing catastrophic forgetting. Its modular, low-overhead nature makes it a versatile tool for diverse applications involving high-dimensional distributions and limited or adversarial data.