FisherSFT: Efficient Data & Feature Selection

Updated 15 March 2026

FisherSFT is a framework of algorithms that utilizes Fisher information to optimize data selection and parameter estimation in supervised learning.
It maximizes the log-determinant of the Fisher information matrix to select training examples, improving fine-tuning efficiency for large language models.
The framework also supports adaptive, budget-constrained feature acquisition for tabular data, reducing computation time while maintaining accuracy.

The FisherSFT framework encompasses a family of algorithms that leverage Fisher information metrics for fast, data-efficient, or parameter-efficient supervised learning. Two principal research lines currently bear the FisherSFT name: (1) information-theoretic selection of training data for LLM supervised fine-tuning, and (2) computationally efficient sequential feature acquisition for classification under budget constraints. These streams, though sharing foundational ties to Fisher-discriminative statistics and information, employ distinct methodologies and address different challenges—optimal training-data subset selection for LLMs (Deb et al., 20 May 2025), versus adaptive feature acquisition for tabular or vector data (Mirzaei et al., 2023). The following sections detail the core problem settings, algorithms, mathematical formulations, implementation characteristics, and empirical results for each variant, focusing on their unifying Fisher-information perspective.

1. Data-Efficient Supervised Fine-Tuning via Fisher Information

Large-scale supervised fine-tuning (SFT) of LLMs is often bottlenecked by compute that grows with the number of selected training examples. FisherSFT (Deb et al., 20 May 2025) treats example selection itself as an information-gain maximization problem grounded in classical optimal design.

Let $\{\mathbf{y}_i\}_{i=1}^N$ be a candidate training corpus and $n \ll N$ the fine-tuning budget. The SFT objective is standard negative log-likelihood minimization over the selected set $S$ : $\mathcal{L}(\theta) = -\frac{1}{|S|} \sum_{(i,j)\in S} \log p_\theta(y_{i,j}|x_{i,j}).$ The statistical efficiency of parameter estimation is determined by the Fisher information matrix,

$I(\theta; S) = -\mathbb{E}_{(x, y) \in S}[\nabla^2_\theta \ell(\theta; x)],$

where $\ell(\theta;x)$ is the log-likelihood. Maximizing the information gain with respect to $\theta$ is quantified via the log-determinant optimal design criterion: $\mathcal{G}(S) = \log \det I(\theta^\star; S).$ Full Fisher computation is intractable for LLMs due to dimensionality and unknown true parameters. The FisherSFT approach circumvents this via last-layer linearization, yielding a surrogate design objective: $f(S) \equiv \log \det \left( \sum_{(i,j) \in S} x_{i,j} x_{i,j}^T \right),$ where $x_{i,j}$ are penultimate-layer embeddings, extracted with a forward pass of the (frozen) base model.

2. FisherSFT Algorithmic Structure and Selection Rule

Optimal $n \ll N$ 0-subset selection under this submodular, monotone set function is NP-hard, but greedy approximation guarantees $n \ll N$ 1-optimality. The FisherSFT algorithm (Deb et al., 20 May 2025) proceeds as follows:

Initialize the design matrix $n \ll N$ 2 and $n \ll N$ 3;
For $n \ll N$ $n ≪ N$ 4 to $n \ll N$ $n ≪ N$ 5:
- For each candidate $n \ll N$ 6, estimate the information gain if $n \ll N$ 7 is added:
$n \ll N$ 8 - Add the candidate $n \ll N$ 9 with the largest $S$ 0 to $S$ 1 and update $S$ 2;
Return $S$ 3.

Efficient implementation exploits the Woodbury formula for determinants and classic lazy-greedy selection cache. After subset selection, conventional gradient-based SFT is performed on $S$ 4.

3. FisherSFT for Sequential Feature Acquisition

In the distinct setting of sequential test-time feature acquisition under budget (Mirzaei et al., 2023), FisherSFT designates a lazy, model-free method for choosing which features to query at test time in a sample-adaptive fashion, with the dual objectives of maximizing classification accuracy and minimizing acquisition cost.

Given test instance $S$ 5, a set of acquired features $S$ 6, and a maximum budget $S$ 7, the framework cycles:

For each candidate $S$ 8, compute the ANOVA Fisher score $S$ 9 on the filtered training subset $\mathcal{L}(\theta) = -\frac{1}{|S|} \sum_{(i,j)\in S} \log p_\theta(y_{i,j}|x_{i,j}).$ 0:

$\mathcal{L}(\theta) = -\frac{1}{|S|} \sum_{(i,j)\in S} \log p_\theta(y_{i,j}|x_{i,j}).$ 1

where $\mathcal{L}(\theta) = -\frac{1}{|S|} \sum_{(i,j)\in S} \log p_\theta(y_{i,j}|x_{i,j}).$ 2 is the mean, $\mathcal{L}(\theta) = -\frac{1}{|S|} \sum_{(i,j)\in S} \log p_\theta(y_{i,j}|x_{i,j}).$ 3 variance, and $\mathcal{L}(\theta) = -\frac{1}{|S|} \sum_{(i,j)\in S} \log p_\theta(y_{i,j}|x_{i,j}).$ 4 count for class $\mathcal{L}(\theta) = -\frac{1}{|S|} \sum_{(i,j)\in S} \log p_\theta(y_{i,j}|x_{i,j}).$ 5 over $\mathcal{L}(\theta) = -\frac{1}{|S|} \sum_{(i,j)\in S} \log p_\theta(y_{i,j}|x_{i,j}).$ 6.

Select $\mathcal{L}(\theta) = -\frac{1}{|S|} \sum_{(i,j)\in S} \log p_\theta(y_{i,j}|x_{i,j}).$ 7 as the next feature to acquire;
Observe $\mathcal{L}(\theta) = -\frac{1}{|S|} \sum_{(i,j)\in S} \log p_\theta(y_{i,j}|x_{i,j}).$ 8, update $\mathcal{L}(\theta) = -\frac{1}{|S|} \sum_{(i,j)\in S} \log p_\theta(y_{i,j}|x_{i,j}).$ 9;
Filter $I(\theta; S) = -\mathbb{E}_{(x, y) \in S}[\nabla^2_\theta \ell(\theta; x)],$ 0 to retain only points close in the acquired feature subspace (using a distance threshold $I(\theta; S) = -\mathbb{E}_{(x, y) \in S}[\nabla^2_\theta \ell(\theta; x)],$ 1);
Repeat or halt if budget is reached or $I(\theta; S) = -\mathbb{E}_{(x, y) \in S}[\nabla^2_\theta \ell(\theta; x)],$ 2 is empty.

Final label prediction is given by the majority class among $I(\theta; S) = -\mathbb{E}_{(x, y) \in S}[\nabla^2_\theta \ell(\theta; x)],$ 3; if $I(\theta; S) = -\mathbb{E}_{(x, y) \in S}[\nabla^2_\theta \ell(\theta; x)],$ 4 is empty, fallback to the prior.

4. Computational Complexity and Performance

FisherSFT for data selection (Deb et al., 20 May 2025) requires $I(\theta; S) = -\mathbb{E}_{(x, y) \in S}[\nabla^2_\theta \ell(\theta; x)],$ 5 in the naive case, with practical reductions via Woodbury identity and lazy updating; selection time per experiment is on the order of minutes for $I(\theta; S) = -\mathbb{E}_{(x, y) \in S}[\nabla^2_\theta \ell(\theta; x)],$ 6 and $I(\theta; S) = -\mathbb{E}_{(x, y) \in S}[\nabla^2_\theta \ell(\theta; x)],$ 7. For sequential feature test selection (Mirzaei et al., 2023), per-sample complexity is $I(\theta; S) = -\mathbb{E}_{(x, y) \in S}[\nabla^2_\theta \ell(\theta; x)],$ 8 where $I(\theta; S) = -\mathbb{E}_{(x, y) \in S}[\nabla^2_\theta \ell(\theta; x)],$ 9 is the budget. The practical constant is small due to rapid filtering and shrinking candidate sets.

Empirical findings:

FisherSFT achieves a twofold reduction in mean and maximum token-prediction error compared to uniform, cluster, and dense sampling on synthetic or word embedding tasks (Deb et al., 20 May 2025).
In LLM fine-tuning of GPT-2 on the tiny-Shakespeare corpus, FisherSFT-selected sentences yielded generations preferred by external LLM evaluators 60–80% of the time over several baseline methods.
In active feature acquisition, FisherSFT attains near-identical accuracy-feature curves as RL-based policies, with three to six orders-of-magnitude faster inference (e.g., 0.0056s vs. 47–243s per test) (Mirzaei et al., 2023).

5. Methodological Connections and Generalizations

Both FisherSFT regimes are rooted in maximizing models’ Fisher information (last-layer for LLMs; ANOVA-discriminative statistics for tabular features) with a greedy, submodular-selection principle. In the context of LLMs, this connects to classic optimal experimental design and information-theoretic sample selection.

Fisher information as a discriminative or parameter-importance metric also underpins recent advances in parameter-efficient fine-tuning (PEFT) regimes—e.g., FISH-Tuning (Xue et al., 5 Apr 2025) applies a Fisher diagonal mask to select the most informative subset of trainable parameters within LoRA, adapters, or their reparameterized modules, consistently yielding superior quality for a fixed parameter budget. However, the “FisherSFT” terminology in the PEFT literature typically refers to training-instance selection, with “FISH-Tuning” more specifically denoting parameter masking.

6. Variants, Limitations, and Prospects

The main assumption underlying the data selection variant is that the last-layer linearization and the $\ell(\theta;x)$ 0-curvature bound adequately capture information gain; this is validated in the original studies by empirical agreement between Fisher and full MCMC-style likelihood analyses for moderate parameter spaces. For test-time sequential feature acquisition, the framework presumes unit-cost features and appropriateness of Euclidean filtering.

Proposed extensions include modeling curvature more precisely (token-wise $\ell(\theta;x)$ 1), adaptive or staged selection interleaved with partial fine-tuning, and integrating preference data or RLHF pipelines. A plausible implication is that further gains could be realized by combining FisherSFT selection with recent advances in mask-based parameter-efficient fine-tuning (Xue et al., 5 Apr 2025).

7. Summary Table: FisherSFT Algorithmic Variants

Domain	FisherSFT Role	Objective
LLM Fine-Tuning (Deb et al., 20 May 2025)	Training example selection	Maximize information gain in SFT, reduce data usage
Tabular Feature Acquisition (Mirzaei et al., 2023)	Test-time sequential feature selection	Minimize #features for high-accuracy prediction
PEFT (FISH-Tuning) (Xue et al., 5 Apr 2025)	Parameter subset selection	Sparse, high-utility mask for PEFT module

In all settings, the FisherSFT approach eschews extensive training passes or learned meta-policies, instead yielding computationally efficient, interpretable methods for optimizing the use of data, features, or trainable model components. The information-theoretic foundation ensures robust statistical guarantees and empirical competitiveness across domains.