Papers
Topics
Authors
Recent
2000 character limit reached

FisherSFT: Applications of Fisher Information

Updated 20 November 2025
  • FisherSFT is a unified framework that leverages Fisher information to optimize sample selection in language models, sparse discriminant analysis, and sequential feature acquisition.
  • It employs techniques like submodular optimization and penalty-based sparsity to improve efficiency and accuracy in high-dimensional data and classification tasks.
  • The framework also extends to stochastic thermodynamics by formulating geometric fluctuation relations that provide novel insights into nonequilibrium dynamics and uncertainty quantification.

FisherSFT refers to a diverse set of methodologies unified by their explicit use of Fisher information or Fisher-based criteria for optimality, efficiency, or interpretability in statistical learning, fine-tuning, discriminant analysis, and stochastic thermodynamics. Notably, the FisherSFT moniker appears in recent literature across data-efficient supervised fine-tuning of LLMs, sparse discriminant analysis, active feature selection for classification, and the formulation of geometric fluctuation relations in nonequilibrium physics. Each of these instantiations leverages the Fisher information’s geometric and statistical structure, adapting it to distinct domains and theoretical objectives.

1. Data-Efficient Fine-Tuning of LLMs via Fisher Information

"FisherSFT: Data-Efficient Supervised Fine-Tuning of LLMs Using Information Gain" (Deb et al., 20 May 2025) introduces FisherSFT as a framework for maximizing the informativeness of training data used to fine-tune LLMs. The central idea is selection of training examples that maximize information gain, operationalized through the Fisher information matrix of the model’s last-layer softmax.

Given an LLM with token-level pre-logit embeddings xi,jRdx_{i,j} \in \mathbb{R}^d and associated token labels yi,jy_{i,j} for sentence ii and position jj, the empirical Fisher information (Hessian of the negative log-likelihood) is:

HS(Θ)=Θ2[1niSj=1Milogp(yi,jxi,j;Θ)]H_S(\Theta) = -\nabla^2_\Theta \left[\frac{1}{n} \sum_{i \in S} \sum_{j=1}^{M_i} \log p(y_{i,j} \mid x_{i,j}; \Theta)\right]

The optimization problem becomes maximizing logdetHS(Θ)\log \det H_S(\Theta) over subsets SS of fixed size, which corresponds to a classical D-optimal design and, asymptotically, to minimizing estimator variance. Direct computation is typically infeasible due to parameter dimensionality, but under a last-layer linearization, the determinant reduces to a function of the pre-logit embedding Gramian:

logdetHS(Θ)dlogdet(γniSj=1Mixi,jxi,jT)\log \det H_S(\Theta) \gtrsim d \cdot \log \det \left( \frac{\gamma}{n} \sum_{i \in S}\sum_{j=1}^{M_i} x_{i,j} x_{i,j}^T \right)

where γ>0\gamma > 0 encodes minimal eigenvalue lower bounds induced by the softmax’s Hessian structure.

The greedy subset-selection strategy relies on the logarithmic submodularity of matrix determinants, enabling near-optimal performance with efficient lazy evaluation. Empirically, FisherSFT demonstrates superior sample-efficiency in fine-tuning LLMs for in-domain adaptation, outperforming uniform sampling, clustering, and various kernel-based active-learning baselines across synthetic, embedding-based, and GPT-2 fine-tuning experiments. Downstream generation quality, as measured by human and LLM judgment, consistently favors FisherSFT-chosen example subsets (Deb et al., 20 May 2025).

2. Sparse Fisher Discriminant Analysis with Thresholded Constraints

A distinct instantiation of FisherSFT is "Sparse Fisher's Discriminant Analysis with Thresholded Linear Constraints" (Luo et al., 2015). Here, FisherSFT addresses the inconsistencies of classical linear discriminant analysis (LDA) in high-dimensional, multiclass settings, particularly when pnp \gg n.

For a multiclass Gaussian model with classes K2K \geq 2, FisherSFT replaces the ill-posed classic Fisher LDA eigenproblem with a penalized quadratic program:

maxαRpαTB^αsubject toαTΣ^α+ταλ2=1\max_{\alpha \in \mathbb{R}^p} \alpha^T \hat B \alpha \quad \text{subject to} \quad \alpha^T \hat \Sigma \alpha + \tau \|\alpha\|^2_\lambda = 1

where B^\hat B is the empirical between-class covariance, Σ^\hat \Sigma the within-class, and αλ2=(1λ)α22+λα12\|\alpha\|^2_\lambda = (1-\lambda)\|\alpha\|_2^2 + \lambda\|\alpha\|_1^2 combines 2\ell_2 ridge and 1\ell_1 sparsity. For K>2K > 2, classical linear independence constraints are replaced with orthogonality against thresholded projections,

ξj=argminξRpξB^α^j22+κξ1\xi_j = \mathrm{argmin}_{\xi \in \mathbb{R}^p} \|\xi - \hat B \hat \alpha_j\|_2^2 + \kappa\|\xi\|_1

so that directions are successively identified with only weak reliance on unreliable high-dimensional covariance estimates.

FisherSFT in this context yields consistent and asymptotically optimal discriminant solutions (under regularity and sparsity conditions), with provable bounds on excess risk relative to Bayes-optimal multiclass rules. Empirically, the method produces interpretable, sparse discriminant vectors, and outperforms regularized LDA (RDA) and penalized DA (PDA) in both simulation and functional data settings, achieving low misclassification error in high-dimensional regimes (Luo et al., 2015).

3. Sequential Feature Acquisition via Fisher Score

"Fast Classification with Sequential Feature Selection in Test Phase" (Mirzaei et al., 2023) (FisherSFT) applies a Fisher score-based protocol for budgeted, active feature selection during runtime classification. The method ranks unmeasured features by their ANOVA-style Fisher score:

Fj=c=1CNc(μc,jμj)2c=1CNcσc,j2F_j = \frac{\sum_{c=1}^{C} N_c (\mu_{c, j} - \mu_j)^2}{\sum_{c=1}^C N_c \sigma_{c, j}^2}

with higher FjF_j indicating stronger class-separation. Features are sequentially acquired and for each acquisition, the filtered training set is updated to include only those instances within a Euclidean threshold of the test instance on measured features; prediction is by majority vote over the filtered set.

The FisherSFT algorithm has computational complexity O(BMN)O(BMN) per test point for BB budgeted feature acquisitions, MM features, and NN training points. Empirical benchmarks show that FisherSFT achieves comparable accuracy to reinforcement learning-based active acquisition and other sophisticated baselines, while offering orders-of-magnitude faster inference and requiring zero retraining. The method is robust to irrelevant features and performs well even when the number of informative features is sparse (Mirzaei et al., 2023).

4. Geometric Stochastic Fluctuation Relations

"Classical Geometric Fluctuation Relations" (Melo et al., 31 Oct 2024) formalizes FisherSFT as a geometric fluctuation relation for the stochastic Fisher information in nonequilibrium stochastic systems. The stochastic Fisher information (SFI) for a Markov process with probability density P(x,t)P(x,t) is defined as

ιF(x,t)=(tlnP(x,t))2=[tssys(x,t)]2\iota_F(x, t) = \left(\frac{\partial}{\partial t} \ln P(x, t) \right)^2 = [\partial_t s_{\text{sys}}(x, t)]^2

where ssys(x,t)=lnP(x,t)s_{\text{sys}}(x, t) = -\ln P(x, t) is the system entropy.

A key geometric object is the stochastic length in entropy space,

[x()]=0τdtιF(x(t),t)\ell[x(\cdot)] = \int_0^\tau dt\, \sqrt{\iota_F(x(t),t)}

which quantifies the accumulated local rate of entropy change along a trajectory.

The main result is a pair of fluctuation relations involving the joint distributions of forward and backward SFI trajectories:

  • The detailed geometric fluctuation relation:

PF({ιF(t)})PB({ι^F(t)})=exp(βq[x()]+[x()])\frac{P_F(\{\iota_F(t)\})}{P_B(\{\hat{\iota}_F(t)\})} = \exp\left(\beta q[x(\cdot)] + \ell[x(\cdot)]\right)

  • The integral geometric fluctuation relation:

exp[βq[x()]+[x()]]F=1\left\langle \exp\left[\beta q[x(\cdot)] + \ell[x(\cdot)]\right] \right\rangle_F = 1

These geometric fluctuation theorems generalize traditional nonequilibrium entropy fluctuation relations to trajectory-dependent Fisher metrics, with direct analogs to thermodynamic length and trajectory-dependent uncertainty relations:

j[x()][x()]2;IF(t)t/σ21/2j[x(\cdot)] \geq \ell[x(\cdot)]^2;\quad \langle \mathcal I_F(t) \rangle_t/\sigma^2 \geq 1/2

Verification in analytically tractable Langevin models confirm the tightness and universality of the relations (Melo et al., 31 Oct 2024).

5. Comparative Table of FisherSFT Methodologies

Domain & Reference Core Principle Main Application
LLM Fine-Tuning (Deb et al., 20 May 2025) Fisher information maximization (log det Hessian over embeddings) Active example selection for data-efficient SFT
Sparse Discriminant Analysis (Luo et al., 2015) Sparse, penalized Fisher criterion with thresholded constraints High-dimensional classification (LDA extension)
Sequential Feature Acquisition (Mirzaei et al., 2023) Fisher score-based feature ranking and filtering Budgeted, active feature acquisition in test-time classification
Stochastic Thermodynamics (Melo et al., 31 Oct 2024) Trajectory-dependent Fisher information and fluctuation relation Geometric fluctuation relations for Markovian nonequilibrium processes

6. Significance and Theoretical Connections

The FisherSFT paradigm illustrates the unifying power of Fisher information as both a geometric and statistical object. In supervised learning, maximizing Fisher information ensures rapid reduction of parameter uncertainty and improves sample efficiency. In discriminant analysis, Fisher-based sparsity constraints are fundamental for robust high-dimensional generalization. In stochastic processes, trajectory-level Fisher information defines new fluctuation relations, revealing geometric structure underlying nonequilibrium dynamics.

The methodologies surveyed under the FisherSFT nomenclature exhibit submodularity, convexity, and strong statistical guarantees regarding convergence, risk optimality, and uncertainty control. In data selection and feature acquisition, the Fisher criterion is directly interpretable as an optimal design principle. For fluctuation relations, it acts as a metric on entropy-space curves, establishing new bounds for trajectory-level uncertainty.

The application-oriented and theoretical advances in FisherSFT frameworks are supported by rigorous experimental validation across simulation, large-scale text, functional data, and stochastic processes (Deb et al., 20 May 2025, Mirzaei et al., 2023, Luo et al., 2015, Melo et al., 31 Oct 2024).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to FisherSFT.