Entropy Penalized Reparameterization

Updated 28 November 2025

Entropy penalized reparameterization is a methodology that augments latent parameterizations with a learned entropy penalty to unify model compression, regularization, and control.
It employs a reparameterization strategy that integrates discretization, quantization, and learned decoders to enable exact entropy computations and tractable optimization.
The method demonstrates significant improvements in neural compression, 3D implicit scene encoding, and stochastic control by optimizing trade-offs between fidelity and compressibility.

Entropy penalized reparameterization is a methodology that unifies model compression, regularization, and control by augmenting latent parameterizations or optimization variables with a learned entropy penalty. This penalty, based on an explicit or implicit probability model, scales with the compressibility of the latent representation and is designed to yield highly compact and structured solutions. The paradigm has been deployed in neural network compression, 3D implicit scene encoding, and stochastic control, where entropy penalization improves tractability, generalization, and efficiency by directly optimizing for both task fidelity and information content (Oktay et al., 2019, Bird et al., 2021, Bourdais et al., 2023).

1. Core Principles and Mathematical Formalism

Entropy penalized reparameterization embeds latent variables—such as discretized weights or candidate probability measures—within an optimization or training objective augmented by an entropy (or relative entropy) penalty. In the neural network context, this typically takes the form of

$L(\theta) = L_{\text{recon/cls}}(\theta) + \lambda H(\theta)$

where $L_{\text{recon/cls}}$ is a reconstruction or classification loss, $H(\theta)$ is the (possibly learned) entropy of the discretized latent representation, and $\lambda$ governs the rate/distortion or rate/accuracy trade-off (Oktay et al., 2019, Bird et al., 2021). In stochastic control, the penalty is usually relative entropy (Kullback–Leibler divergence) between two measures:

$J_\epsilon(\mu, \nu) = \mathbb{E}^\mu\left[\text{cost functional}\right] + \frac{1}{\epsilon} H(\mu\|\nu)$

The entropy term incentivizes compressibility, sparsity, or proximity of auxiliary variables, acting as a regularizer that may render the optimization objective strictly convex or more tractable (Bourdais et al., 2023).

2. Reparameterization and Quantization Structures

Practical deployment involves "lifting" each real-valued model parameter to a continuous latent domain, which is then discretized by quantization:

In neural compression, continuous per-layer pre-weights $\Phi$ are quantized either by rounding or by other discrete assignment mechanisms: $\tilde\Phi = \lfloor \Phi \rceil$ .
Each quantized latent is mapped back to the parameter space with a learned, low-dimensional decoder, e.g., a layer-wise affine transformation $W_l = \alpha_l \tilde{W}_l + \beta_l$ (Bird et al., 2021), or through more general functions $\mathcal{F}$ (Oktay et al., 2019).

This discretization enables exact entropy computations and source coding, while learned priors over the latent space induce anisotropic compression (Oktay et al., 2019, Bird et al., 2021). For stochastic control, reparameterization refers to introducing duplicate variables or decoupled probability measures, penalizing their divergence with relative entropy to split the minimization into manageable substeps (Bourdais et al., 2023).

3. Entropy Penalty, Probability Modeling, and Training Dynamics

The entropy penalty $H(\theta)$ encourages the model to employ a small, selective set of latent symbols (or to align measures in control), achieved as follows:

Probabilities for each quantized latent are modeled by learned univariate PMFs $q_i(k)$ . During training, they are estimated via small auxiliary networks, and the total bitrate penalty is approximated as $-\log_2 q_i(\tilde{\phi}_i)$ per parameter (Bird et al., 2021, Oktay et al., 2019).
To preserve differentiability, additive uniform noise $u \sim U(-\frac12, \frac12)$ is injected pre-quantization and losses are optimized in expectation over noise.
Training uses the straight-through estimator (STE) for quantization, treating the rounding operation as identity in the backward pass to propagate gradients (Bird et al., 2021, Oktay et al., 2019).
In high-dimensional stochastic control, relative entropy regularizes the discrepancy between two laws $(\mu, \nu)$ . Optimization alternates between exponential twist updates (variational inference-like) and drift minimization (convex control), with provable $O(1/k)$ convergence (Bourdais et al., 2023).

Entropy penalized reparameterization has been applied to a broad class of models and tasks:

For neural implicit 3D scene reconstruction, the paradigm is integrated into a NeRF architecture with parameter reparameterization, per-layer decoding, and joint multi-scene soft parameter sharing via a shared "shift" tensor and affine transformations (Bird et al., 2021). Parameter sharing exploits redundancies and reduces bitrate further at low rates.
For standard DNN compression, a universal framework is proposed with groupwise integer latents, decoders for each layer (standard or DFT-based), and density networks modeling PMFs over latents (Oktay et al., 2019).
The typical training loop proceeds by iterated stochastic gradient optimizer steps over the latent codes, decoders, and prior parameters, with post-hoc entropy coding (e.g., arithmetic coding) according to the learned prior (Oktay et al., 2019, Bird et al., 2021). For control, alternating minimization cycles between $\mu$ and $\nu$ updates (Bourdais et al., 2023).

A summary table:

Domain	Latent Representation	Penalty Type
Neural compression	Quantized integer weights	Discrete entropy ( $-\log_2 q$ )
Implicit scene coding	Quantized MLP weights	Discrete entropy + sharing term
Stochastic control	Probability measures	Relative entropy ( $H(\mu\\|\nu)$ )

5. Comparative Performance and Compression Effectiveness

Empirical studies demonstrate clear advantages:

In neural network model compression, entropy penalized reparameterization achieves 19 $\times$ –606 $\times$ reduction in model size on standard benchmarks (MNIST, CIFAR-10, ImageNet) with negligible loss of accuracy, outperforming conventional pruning or quantization approaches without recourse to multi-stage procedures (Oktay et al., 2019).
For 3D scene compression, the method delivers 5–10 $\times$ smaller model size at equal or improved PSNR, MS-SSIM, and LPIPS compared to HEVC+LLFF baselines, and further gains under parameter sharing for multi-scene regimes (Bird et al., 2021).
In stochastic control, the entropy-penalized reformulation enables efficient solution of very high-dimensional problems by reducing a nonconvex strong-form program to two tractable convex subproblems, guaranteeing $O(1/k)$ convergence and preserving approximation accuracy as the penalty weight increases (Bourdais et al., 2023).

6. Extensions, Regularity, and Theoretical Guarantees

Regularization via entropy is robust to architectural and optimization choices:

The learned prior adapts to quantization structure. If sharing or regularization is not useful for a given scene or parameter group, the prior collapses to a degenerate distribution; otherwise, it induces nontrivial structure supporting compression (Bird et al., 2021).
In control, under hypotheses of continuity, growth, and convexity, the method admits existence and convergence guarantees, making it suitable for large-scale systems with dynamics and cost functionals of significant complexity (Bourdais et al., 2023).
Extensions include parameter-tying and marginal constraints (martingale optimal transport, Schrödinger bridges), using the same basic entropy penalty mechanism (Bourdais et al., 2023).

7. Applications and Broader Impact

Entropy penalized reparameterization reframes standard compression and control problems:

In end-to-end neural compression, it offers a unified framework that enables rate-accuracy/distortion trade-offs via a simple hyperparameter sweep, integrating seamlessly with neural architectures and providing practical post-training coding (Oktay et al., 2019, Bird et al., 2021).
For neural radiance fields, it permits implicit, differentiably compressed 3D representations with fast, decoder-free rendering at test time (Bird et al., 2021).
In high-dimensional stochastic optimization and control, it allows decoupling and efficient solution of problems that would otherwise require direct solutions of intractable PDEs or backward equations (Bourdais et al., 2023).

A plausible implication is that this approach, by combining reparameterization, learned probability modeling, and direct entropy penalization, is broadly applicable to settings that require controllable trade-offs between fidelity and compressibility or between optimization quality and tractability.