Implicit Deep Equilibrium Models (i-FPN)

Updated 26 February 2026

Implicit Deep Equilibrium Models (i-FPN) are defined as fixed-point equations that model infinite-depth behavior without explicit layer unrolling.
They employ subhomogeneous nonlinear operators ensuring global convergence, stability, and unique equilibria across diverse architectures.
i-FPN architectures enable efficient backpropagation via implicit differentiation, achieving competitive performance in vision, language modeling, and object detection.

Implicit Deep Equilibrium Models (i-FPN) define the output embedding of a deep neural network as the unique solution of a single, possibly infinite-depth, fixed-point equation rather than an explicit, finite unrolling of layers. This implicitization offers constant-memory learning with global feature propagation, sharp non-Euclidean stability guarantees, and a modular design space governed by subhomogeneous nonlinear operator theory. The i-FPN paradigm extends across feedforward networks, multiscale pyramids for vision, convolutional and graph architectures, and is especially prominent in the context of feature pyramid networks for dense prediction.

1. Mathematical Foundations: Fixed-Point and Subhomogeneity

An i-FPN models the hidden state $z^* \in \mathbb{R}^n$ for input $x \in \mathbb{R}^d$ as the solution to an operator equation

$z^* = F_\theta(z^*, x)$

with $F_\theta$ parametrizing the network family. In canonical cases, $F_\theta(z;x) = \sigma(Wz + Ux + b)$ , but more generally, compositional, multiscale, or graph-coupled operators may be used. This formulation replaces explicit deep stacks with the root of a nonlinear mapping, capturing the effect of “infinite” depth via equilibrium.

Subhomogeneity, introduced to generalize existence and uniqueness results, imposes weaker requirements than strict contractivity. A mapping $F:\mathbb{R}^n \to \mathbb{R}^n$ is said to be Clarke-subhomogeneous on a domain $\Omega \subset \mathrm{dom}_+(F)$ (the strict positive cone) with constant $\mu > 0$ if for all $z\in\Omega$ and all Clarke generalized Jacobians $M\in\partial F(z)$ ,

$|Mz| \preceq \mu F(z)$

or, equivalently (under differentiability and componentwise positivity),

$F(tz) \preceq t^{\mu} F(z), \quad \forall t\geq 1$

where $\preceq$ denotes the natural cone order on $\mathbb{R}_+^n$ . Classical contraction is recovered as the special case $\mu < 1$ in the Euclidean metric, but subhomogeneity enables analysis in the Thompson metric and on broader operator classes (Sittoni et al., 2024).

2. Existence, Uniqueness, and Nonlinear Perron–Frobenius Theory

With normalization via a 1-homogeneous, order-preserving functional $\varphi:\mathbb{R}_+^n \to \mathbb{R}_{++}$ (e.g., $\|z\|_p$ ), the subhomogeneous framework provides completeness of the normalized positive slice $(\{z \succ 0 : \varphi(z) = 1\}, \delta)$ in the Thompson metric $\delta(x,y) = \|\ln x - \ln y\|_\infty$ .

Existence and uniqueness follow from contraction on this metric: for $F \in \mathrm{subhom}_\mu$ with $0<\mu<1$ (possibly $<1/2$ without additional smoothness), the normalized map $G(z) = F(z)/\varphi(F(z))$ is contractive, and the Krasnosel’skiĭ–Mann iteration $z_{k+1} = G(z_k)$ converges globally at a linear rate governed by $\mu$ (Sittoni et al., 2024). If $F$ is linear, subhomogeneity reduces to $\rho(|W|) < 1$ for $F(z) = Wz$ , which is strictly weaker than an operator-norm bound, further enlarging the design space for implicit models.

3. Algorithmic Framework: Forward and Backward Dynamics

Solving the i-FPN equilibrium involves iterative methods:

Forward. Initialize $z_0 \succ 0$ and iterate

$\tilde{z}_{k+1} = F(z_k; x), \quad z_{k+1} = \mathrm{norm}_\varphi(\tilde{z}_{k+1})$

until convergence. In practice, Picard iteration, Broyden’s quasi-Newton method, or Anderson acceleration are deployed for improved efficiency and memory scaling. Typical halting criteria are $\|\tilde{z}_{k+1} - z_k\|_\infty < \varepsilon$ .

Backward. Gradients are computed via implicit differentiation: solve

$(I - J_z F(z^*; x))^T \lambda = \nabla_z \ell(z^*)$

for $\lambda$ , and set

$\nabla_\theta \ell = -\lambda^T \nabla_\theta F(z^*;x)$

Matrix-free Jacobian–vector products allow $O(1)$ memory with respect to depth, since only equilibrium states and solver history must be retained (Wang et al., 2020, Bai et al., 2020).

4. Model Instantiations and Architectural Principles

i-FPN theory is applied to diverse neural architectures:

Feedforward DEQ. $F(z;x) = \sigma(Wz + Ux + b)$ with subhomogeneous activations (ReLU, shifted $\tanh$ , SoftPlus, sigmoid) and normalization (e.g., $\|z\|_p$ ) ensures existence and unique equilibrium without eigenvalue constraints on $W$ (Sittoni et al., 2024).
Convolutional DEQ. $F(z;x) = \sigma(W * z + Ux + b)$ where $W*\cdot$ is convolution; subhomogeneity holds if the kernel’s induced $\ell^1$ -norm satisfies $\kappa(\mathrm{Conv}) = \rho(|W| * \cdot) < 1$ .
Implicit GNN (APPNP variant). Nonlinear variants replace the standard linear propagation with $\tilde{Z}\leftarrow \tanh((1-\alpha)\tilde{A}\tilde{Z}) + \alpha f_\theta(X)$ , then normalize and enforce $\kappa < 1$ .
Multiscale i-FPN (MDEQ). A vector $Z = [z^1, \ldots, z^R]$ $Z = [z^{1}, \dots, z^{R}]$ of feature tensors per resolution is updated blockwise:
- Intra-scale: residual convolutional blocks with group normalization.
- Cross-scale: structured up/downsampling allows joint equilibrium across all levels.
- Equilibrium is found by solving $G_\theta(Z;X) \equiv F_\theta(Z;X) - Z = 0$ via Broyden or Anderson solvers (Bai et al., 2020).
Object Detection i-FPN. Feature maps $B_i$ (e.g., from ResNet stages) are stacked to $P$ , and the shared cross-scale transformation $G_\theta$ is repeatedly applied until

$P^* = G_\theta(P^* + B)$

is satisfied. The output $P_i^*$ feeds into standard RPN/heads. Performance consistently exceeds explicit FPNs on COCO: +3.4 (RetinaNet), +3.2 (Faster-RCNN), +3.5 (FCOS), +4.2 (ATSS), +3.2 (AutoAssign) mAP (Wang et al., 2020).

5. Stability, Regularization, and Practical Guidelines

Subhomogeneous i-FPNs exhibit global linear convergence at rate $\mu$ , strong robustness to perturbations, and reproducibility, since every initialization in the positive cone converges to the same $z^*$ . The mapping $x \mapsto z^*(x)$ is Lipschitz with constant $\frac{\mathrm{Lip}_x(F)}{1-\mu}$ (Sittoni et al., 2024).

Stabilization via explicit Jacobian regularization further enhances robustness: $R(\theta) = \frac{\lambda}{2}\|J_f(z^*,x)\|_F^2$ Penalizing $\|J_f(z^*,x)\|_F$ shrinks spectral radius, ensures forward and backward solver convergence, and reduces the number of solver iterations (NFEs) with little accuracy loss. For multiscale i-FPN/DEQ, per-scale regularization is accumulated (Bai et al., 2021).

Design guidelines include:

Using subhomogeneous activations (shifted $\tanh$ , ReLU on $\mathbb{R}_{++}$ , positive SoftPlus/sigmoid).
Final normalization by $\varphi$ to enforce compactness.
For linear/convolutional kernels, ensuring $\rho(|W|) < 1$ suffices (no PSD or operator-norm restriction).
For optimization-guided variants, expressing each layer as a proximal operator enables convex regularization and feature selection via the SAM algorithm (Xie et al., 2021).

6. Relation to Optimization, Implicit Bias, and Continuous-Time Formulations

i-FPN equilibrium points correspond, in some architectures, to global solutions of convex minimization problems: $z^* = \mathrm{prox}_\varphi(z^*) \iff z^* = \arg\min_z \varphi(z)$ This allows for direct incorporation of prior knowledge and feature regularization by modifying objectives or through bilevel optimization schemes (Xie et al., 2021).

Gradient-based learning on i-FPN/DEQ models is closely related to trust-region Newton methods on shallow equivalent problems, inducing a distinct implicit bias that may underlie their empirical generalization properties (Kawaguchi, 2021).

Continuous DEQ models reinterpret the fixed-point as the infinite-time limit of a neural ODE: $\frac{dz}{dt} = f_\theta(z,x) - z$ Backpropagation operates purely through the terminal equilibrium, avoiding adjoint ODE solves and offering 2–4x speedup over traditional Neural ODEs (Pal et al., 2022). IMEX strategies further bridge explicit and implicit paradigms, reducing wall-clock and backward pass costs in large-scale tasks.

7. Empirical Performance, Limitations, and Outlook

Empirical studies show that i-FPN/MDEQ achieves accuracy competitive with state-of-the-art explicit deep networks on high-dimensional tasks with significantly reduced memory requirements:

Vision: MDEQ-large (63M params) achieves 77.5% top-1 on ImageNet, on par with ResNet-101 (77.1%). Cityscapes mIoU matches or nears top explicit networks (Bai et al., 2020).
Object detection: i-FPN consistently adds 3–4 mAP points over FPN-based detectors (Wang et al., 2020).
Language modeling: DEQ-Transformers with Jacobi regularization maintain perplexity within 1% of Transformer-XL, halving iteration requirements (Bai et al., 2021).
Efficiency: Jacobian regularization reduces forward/backward iterations (NFEs) by 2–3× with minimal accuracy loss.

Remaining challenges include the higher per-batch time cost (∼6× slower than explicit FPN during training), solver parameter tuning, and stability for very deep or stiff nonlinearities. However, the modularity and rigorous guarantees afforded by subhomogeneity, implicit differentiation, and operator-theoretic analysis mark i-FPNs as a robust platform for scalable, memory-efficient deep learning (Sittoni et al., 2024, Wang et al., 2020, Bai et al., 2020, Kawaguchi, 2021, Bai et al., 2021, Xie et al., 2021, Pal et al., 2022).