Papers
Topics
Authors
Recent
Search
2000 character limit reached

Generalized Fisher Score Overview

Updated 19 March 2026
  • Generalized Fisher Score is a methodology that extends the classical Fisher score to closed probability simplexes, enabling analysis of distributions with zero-probability events.
  • It enhances supervised feature selection by jointly optimizing feature indicators and class projection, thereby capturing joint effects and reducing redundancy.
  • The framework generalizes Fisher information through χβ-divergences, leading to novel Cramér–Rao inequalities and robust information-geometric insights.

The Generalized Fisher Score (GFS) encompasses a spectrum of modern statistical and information-geometric methodologies that extend or generalize the classical Fisher score and Fisher information across discrete, continuous, and applied settings. Its principal incarnations include: (a) the geometric and algebraic generalization of Fisher score to the closed probability simplex—including zero-probability events—enabling analytic tools for finite-state statistical models with boundary distributions; (b) a family of information-theoretic generalizations, particularly those induced by χβ\chi^\beta-divergences, yielding extended Fisher information and Cramér–Rao inequalities; and (c) algorithmic generalizations for supervised feature selection that capture joint effects and redundancy among features, achieving strictly superior performance compared to traditional Fisher scores.

1. Geometric and Algebraic Generalization on the Closed Simplex

The conventional Fisher score presupposes distributional models restricted to the interior of the probability simplex Δ(Ω)\Delta(\Omega), where all probabilities p(x)p(x) are strictly positive. However, many statistical models—including contingency tables with non-structural zeros—require analytic and differential tools valid on the closure Δ(Ω):={pRΩ:p(x)0 x, xp(x)=1}\Delta(\Omega) := \{ p \in \mathbb{R}^\Omega : p(x)\geq0 \ \forall x, \ \sum_x p(x)=1\}, where p(x)p(x) may vanish.

In this setting, the tangent space TpΔ(Ω)T_p\Delta(\Omega) at any pp is shown to consist of zero-sum vectors supported on suppp\mathrm{supp}\,p:

TpΔ(Ω)={vRΩ:xv(x)=0, suppvsuppp}T_p\Delta(\Omega) = \{ v \in \mathbb{R}^\Omega : \sum_x v(x)=0, \ \mathrm{supp}\,v \subseteq \mathrm{supp}\,p \}

(Pistone et al., 7 Feb 2026). For one-parameter curves γ(t)Δ(Ω)\gamma(t)\in\Delta(\Omega), with velocity v(x;t)=ddtγ(x;t)v(x;t)=\frac{d}{dt}\gamma(x;t), the generalized Fisher score s(x;t)s(x;t) is defined algebraically by

v(x;t)=s(x;t)γ(x;t),xΩv(x;t)=s(x;t)\gamma(x;t)\,,\quad x\in\Omega

with s(x;t)s(x;t) unique on suppγ(t)\mathrm{supp}\,\gamma(t), arbitrary elsewhere. On the interior, s(x;t)=tlogγ(x;t)s(x;t)=\partial_t\log\gamma(x;t) recovers the classical case. The main result asserts that the derivatives (velocities) and, thus, the Fisher score are well-defined on each face of the closed simplex, enforcing that velocities vanish on zero-probability cells. This algebraic extension aligns with the information-geometric framework—tangent bundles, exponential and mixture connections, Fisher–Rao metric—all consistently extend to cases with probability-0 events (Pistone et al., 7 Feb 2026).

2. Feature Selection: Joint Maximization and Non-redundancy

The Fisher score remains a foundational criterion for supervised feature selection but is classically univariate—it selects features independently by scoring each dimension for class-separability, then retains the top mm (Gu et al., 2012). This design omits joint effects and feature redundancy: features with weak individual scores but strong joint separability—or high redundancy—are mismanaged.

Generalized Fisher Score for Feature Selection transforms this process: Let XRd×nX\in\mathbb{R}^{d\times n} be zero-mean data, and yi{1,,c}y_i \in \{1,\ldots,c\} label vector. The GFS objective introduces binary selection vector p{0,1}dp\in\{0,1\}^d, pT1=mp^T1=m, and coordinates projection via Z=DpXZ=D_p X. The classical Fisher score F(Z)F(Z) is:

F(Z)=tr[Sb(St+γI)1]F(Z) = \mathrm{tr}\left[ S_b (S_t + \gamma I)^{-1} \right]

where SbS_b and StS_t are the between- and total-class scatter matrices. GFS recasts feature selection as the mixed-integer maximization

maxp,Wtr[WTDpSbDpW(WTDp(St+γI)DpW)1]\max_{p,W} \mathrm{tr}\left[ W^T D_p S_b D_p W \cdot (W^T D_p(S_t + \gamma I) D_p W)^{-1} \right]

subject to pT1=mp^T1 = m, jointly optimizing feature indicators pp and class projection WW. The resulting problem is equivalent to a mixed-integer program and further reformulated as a quadratically-constrained linear program (QCLP), efficiently solvable via a cutting-plane algorithm and multiple kernel learning (MKL) subroutines (Gu et al., 2012).

Empirical benchmarks (UCI data, ORL faces, USPS digits) show that GFS outperforms not only standard Fisher score but also Laplacian score, HSIC, and trace-ratio methods—especially in scenarios requiring joint feature evaluation or minimization of redundancy (Gu et al., 2012).

3. Information-theoretic Generalization: χβ\chi^\beta-divergence and Extended Fisher Information

Beyond the simplex-centric and feature selection perspectives, the Fisher score and Fisher information admit substantial generalization via χβ\chi^\beta-divergence frameworks. For probability densities p(x)p(x), q(x)q(x) on XRnX\subset\mathbb{R}^n and an auxiliary density g(x)g(x), the modified χβ\chi^\beta-divergence is defined as:

Δχβ(pq;g):=Xq(x)p(x)g(x)βg(x)dx=Eg[qpgβ]\Delta_\chi^\beta(p\|q;g) := \int_X \left| \frac{q(x)-p(x)}{g(x)} \right|^\beta g(x) dx = \mathbb{E}_g \left[ \left| \frac{q-p}{g} \right|^\beta \right]

(Bercher, 2013). Local quadratic expansion leads to the generalized Fisher information of order β\beta:

Iβ[fθgθ;θ]:=i=1mEgθ[ifθ(x)gθ(x)β]I_\beta[f_\theta|g_\theta;\theta] := \sum_{i=1}^m \mathbb{E}_{g_\theta} \left[ \left| \frac{\partial_i f_\theta(x)}{g_\theta(x)} \right|^\beta \right ]

whose associated generalized score vector is

ψg(x;θ):=θf(x;θ)g(x;θ)\psi_g(x;\theta) := \frac{\nabla_\theta f(x;\theta)}{g(x;\theta)}

When β=2\beta=2, the corresponding extended Fisher information matrix arises. This construction provides the basis for generalized Cramér–Rao inequalities and new characterizations of minimum-uncertainty distributions under non-standard norms and escort density pairs (see below) (Bercher, 2013).

4. Generalized Cramér–Rao Bounds and Applications

The extension of Fisher information to arbitrary norms and powers yields a family of generalized Cramér–Rao inequalities, encompassing higher-order moments, general loss functions, and arbitrary bias structures. For an estimator T(X)T(X) of h(θ)Rkh(\theta)\in\mathbb{R}^k (possibly biased), Hölder conjugate exponents α,β>1\alpha,\beta>1, and arbitrary norm \|\cdot\| with dual \|\cdot\|_*, the inequality reads:

(Egθ[T(X)h(θ)α])1/α ⁣ ⁣(Egθ[Hθf(X;θ)g(X;θ)β])1/βk+h(θ)Bf(h(θ))\left ( \mathbb{E}_{g_{\theta}} [ \| T(X) - h(\theta) \|^\alpha ] \right )^{1/\alpha} \! \cdot \! \left ( \mathbb{E}_{g_{\theta}} [ \| H \frac{\nabla_\theta f(X;\theta)}{g(X;\theta)} \|_*^\beta ] \right )^{1/\beta} \ge |k + \nabla_{h(\theta)} \cdot B_f ( h(\theta) ) |

where Hij=θj/hiH_{ij} = \partial \theta_j/\partial h_i, BfB_f is the bias (Bercher, 2013). In the unbiased case, this generalizes the variance–Fisher information duality to the context of χβ\chi^\beta-divergence and powers.

Particularly, for translation families and appropriate escort pairs, the minimizers of generalized Fisher information at fixed moment constraints are qq-Gaussians:

g(x)[1γ(q1)xα]+1/(q1),g(x) \propto [1-\gamma(q-1)\|x\|^{\alpha}]_+^{1/(q-1)},

uniquely saturating the qq-Cramér–Rao bound, and yielding a variational characterization for classes of maximum-entropy distributions (Bercher, 2013).

5. Information-geometric and Analytical Consequences

Extending the Fisher score and Fisher information yields profound consequences in information geometry and statistical analysis. On the closed simplex, the construction of the tangent bundle of pp-contrasts and the generalized score as a Radon–Nikodym derivative supports dually flat structure on each face, with exponential and mixture connections naturally extending to boundary cases, underpinning the applicability of natural gradients, information-geometric geodesics, and Crámer–Rao bounds on statistical models with non-strictly positive distributions (Pistone et al., 7 Feb 2026).

Within the χβ\chi^\beta framework, generalized Fisher information appears as the time-derivative of generalized entropy in nonlinear diffusion flows (extended de Bruijn identity), and new uncertainty relations are derived—generalizing the Weyl–Heisenberg principle and being saturated by qq-Gaussians (Bercher, 2013).

6. Summary Table of Main GFS Incarnations

Context Core Principle/Formulation Reference
Closed simplex/statistical geometry Generalized score via v(x;t)=s(x;t)γ(x;t)v(x;t)=s(x;t)\gamma(x;t), TpΔT_p\Delta fibers (Pistone et al., 7 Feb 2026)
Feature selection (machine learning) Joint maximization of lower bound Fisher criterion via QCLP (Gu et al., 2012)
Information theory/geometry Generalized score matrix: ψg(x;θ)\psi_g(x;\theta) via χβ\chi^\beta-divergence (Bercher, 2013)

The Generalized Fisher Score, in its various incarnations, provides a unified algebraic, geometric, and algorithmic framework for extending classical information-theoretic and statistical principles to models with boundaries, complex feature structure, and generalized divergence measures, substantiating new analytic, computational, and estimation-theoretic results across finite, continuous, and high-dimensional domains.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Generalized Fisher Score (GFS).