Papers
Topics
Authors
Recent
2000 character limit reached

ActiveFT: Active Fine-Tuning Method

Updated 20 November 2025
  • ActiveFT is a method that actively selects informative data to fine-tune pretrained models under limited data or compute budgets.
  • It employs feature-space embedding, continuous mixture models, and uncertainty quantification to ensure diversity and optimal information gain.
  • Empirical evaluations show ActiveFT outperforms random and clustering baselines with significant efficiency and accuracy improvements.

ActiveFT denotes a family of algorithms and methodologies that employ active data selection principles to optimize fine-tuning in machine learning, particularly in data- and label-scarce regimes. The paradigm appears across domains, including computer vision, LLMs, and scientific computing. It seeks to maximally exploit annotation or compute budgets by judiciously selecting training instances or subproblems, departing from passive random data selection. This approach relies on theoretical principles from information theory, uncertainty quantification, and optimal transport, and is characterized by algorithmic efficiency and superior empirical performance over standard baselines.

1. Core Principles of ActiveFT

ActiveFT targets the finite-budget regime of fine-tuning where either labels or compute cycles are limited and must be allocated optimally. Unlike classic active learning, which iteratively augments a labeled set from scratch, ActiveFT typically starts from a large-scale pretrained model and a pool of unlabeled data, without a labeled seed set. The key objective is to select a subset SS of size BB from an unlabeled pool UU such that fine-tuning on SS yields minimal generalization error:

S=argminS=BE(x,y)pu ⁣[error(f(x;wS),y)]S^* = \arg\min_{|S| = B} \mathbb{E}_{(x, y) \sim p_u}\!\left[\mathrm{error}(f(x; w_S), y)\right]

where f(;wS)f(\cdot; w_S) is the pretrained model further fine-tuned on SS (Xie et al., 2023). Selection schemes are driven by principles including sample representativeness (distribution matching), diversity, and information gain regarding model uncertainty (Abgrall et al., 2023, Hübotter et al., 2024).

2. Methodologies and Algorithms

Two major instances of ActiveFT are prominent in recent literature for computer vision and LLMs.

ActiveFT for Computer Vision

The method from "Active Finetuning" (Xie et al., 2023) introduces a parametric, feature-based algorithm for sample selection:

  • Feature-space embedding: Map each xiUx_i \in U to a normalized feature fi=f(xi;w0)f_i = f(x_i; w_0).
  • Continuous mixture model: Introduce a set of continuous centroids {θSj}j=1B\{\theta_S^j\}_{j=1}^B and define a mixture pθSp_{\theta_S}, with each p(fθSj)p(f|\theta_S^j) sharply peaked at its centroid.
  • Loss function: Minimize

L(θS)=1Ni=1NmaxjfiθSjτ+1Bj=1Blog(kjexp(θSjθSkτ))L(\theta_S) = -\frac{1}{N} \sum_{i=1}^N \max_{j} \frac{f_i^\top \theta_S^j}{\tau} + \frac{1}{B} \sum_{j=1}^B \log \Bigg(\sum_{k \neq j} \exp\left(\frac{\theta_S^{j\top} \theta_S^k}{\tau}\right)\Bigg)

The first term encourages distribution matching between selected and pool features, while the second (log-sum-exp repulsion) enforces diversity among selected points.

  • Optimization: θS\theta_S are updated via gradient descent and snapped (post-training) to their nearest actual feature vectors to obtain SS.
  • Theoretical guarantee: The process provably reduces the Earth Mover’s Distance (EMD) between the full pool and the selected subset in feature space.

ActiveFT for LLMs (SIFT Algorithm)

In "Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs" (Hübotter et al., 2024), the SIFT (Select Informative data for Fine-Tuning) algorithm underpins ActiveFT for dynamic LLM adaptation:

  • Uncertainty quantification: Model the next-token logits as f(x)=Wϕ(x)f_*(x) = W_* \phi(x) under a surrogate linear embedding ϕ(x)\phi(x); the posterior covariance

σn2(x)=k(x,x)kXn(x)[KXn+λκIn]1kXn(x)\sigma_n^2(x) = k(x, x) - k_{X_n}(x)^\top [K_{X_n} + \lambda \kappa I_n]^{-1} k_{X_n}(x)

quantifies epistemic uncertainty.

  • Information gain criterion: At each step, select xDx \in D maximizing the marginal reduction in uncertainty for the target prompt; concretely,

xn+1=argmaxxDk(n1)(p,x)2/(k(n1)(x,x)+λ)x_{n+1} = \arg\max_{x \in D} k^{(n-1)}(p, x)^2 / (k^{(n-1)}(x, x) + \lambda')

  • Redundancy-aware update: Schur-complement updates to the kernel ensure previously selected, possibly redundant, candidates do not dominate subsequent steps.
  • Greedy algorithm: Select NN examples in order of maximal joint information gain; efficient implementation with KK prefiltered candidates via fast nearest neighbor search.
  • Adaptive computation: Use empirical uncertainty σn(p)\sigma_n(p) to decide when to halt fine-tuning based on expected performance gain per compute unit.

3. Theoretical Foundations

ActiveFT algorithms leverage a confluence of theoretical constructs:

  • Distribution matching and diversity: The mixture-model formulation minimizes the Kullback–Leibler divergence between the feature distribution of selected samples and that of the unlabeled pool, while regularization maximizes the spread (diversity) among θSj\theta^j_S (Xie et al., 2023).
  • Earth Mover’s Distance minimization: An explicit connection is established between the ActiveFT loss and EMD, providing a guarantee that subset selection reduces the maximum average feature distance to the pool.
  • Uncertainty-driven selection: For LLMs, the SIFT approach is grounded in posterior confidence sets and information-theoretic criteria: each selected point reduces the model’s epistemic uncertainty on the target task (Hübotter et al., 2024).
  • Limitation of nearest-neighbor approaches: It is proved that nearest-neighbor selection fails to guarantee uncertainty reduction and is vulnerable to information duplication, whereas SIFT provably achieves optimal uncertainty decay rates.

4. Empirical Evaluation

Comprehensive comparisons benchmark ActiveFT against established baselines across modalities:

Task Baseline ActiveFT Accuracy/Metric Best Baseline Margin
CIFAR-10 @ 1% 82.2 (Random) 88.2 85.9 (KMeans) +2.3
CIFAR-100 @ 2% 24.3 (Random) 40.7 31.9 (KMeans) +8.8
ImageNet @ 1% 45.1 (Random) 50.1 45.1 +5.0
ADE20K mIoU @ 10% 20.3 (Random) 21.6 19.1 (KMeans) +2.5
Pile FT (bpb Δ, avg) –21.7% (NN) –26.5% –21.7% (NN) +4.8pp

ActiveFT consistently outperforms both random and clustering-based selection in image classification/segmentation (Xie et al., 2023) and surpasses nearest-neighbor FT by 5–30% reduction in bits-per-byte in language modeling (Hübotter et al., 2024). Notably, computational efficiency is high; e.g., ActiveFT selects CIFAR-100 budgets in 12.6s, vs. over an hour for CoreSet.

5. Implementation and Practical Guidance

ActiveFT algorithms are supplied as extensible, user-focused libraries (e.g., activeft, available via PyPI), and integrate with existing PyTorch and HuggingFace pipelines (Hübotter et al., 2024). The typical workflow involves:

  1. Embedding the data pool via pretrained encoders.
  2. Candidate prefiltering with fast approximate nearest neighbors (Faiss).
  3. ActiveFT/SIFT selection for either annotation (CV) or fine-tuning batches (LLM).
  4. Optional adaptive budget allocation, halting when marginal gains diminish.

Key practical insights:

  • Feature normalization is essential for selection robustness.
  • Regularization (e.g., log-sum-exp repulsion) is required to avoid collapsed solutions.
  • Adaptive approaches to compute budget further increase efficiency, spending more on hard prompts and less on easy ones.
  • For region- or token-level finetuning, extensions to multi-scale or non-global features are suggested.

6. Extensions, Limitations, and Open Problems

Current ActiveFT methodologies assume fixed pretrained encoders and operation at the global feature level. For vision, fine-grained and region-based sampling remains less explored; for LLMs, surrogate linear models do not always capture true model adaptation dynamics. Adaptive hyperparameter tuning, alternate kernels, and multi-modal or multi-step selection represent fertile directions. Theoretical investigation into scaling laws and broader applications (dialog, code, instruction tuning) is ongoing (Xie et al., 2023, Hübotter et al., 2024). There also remains an irreducible uncertainty floor when the data pool lacks relevant information. Application beyond vision and text, such as point clouds or speech, is considered a promising area for future research.

ActiveFT is conceptually related to classical active learning, uncertainty sampling, and core-set strategies but is uniquely tailored for pretrained model adaptation in the finite-budget regime. Distinct from federated active learning or active flux numerical methods (Goetz et al., 2019, Abgrall et al., 2023), which involve selection of clients in distributed settings or fluxes in PDE solvers, respectively, ActiveFT's defining attribute is its focus on maximizing information gain or distributional coverage per unit annotation or compute, within the pretraining-finetuning paradigm.

In summary, ActiveFT constitutes a theoretically backed, practical, and highly effective solution to budget-constrained fine-tuning, with rigorous guarantees and demonstrated superiority across diverse machine learning benchmarks (Xie et al., 2023, Hübotter et al., 2024).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Activeft.