Instruction Compression in Neural Systems

Updated 7 June 2026

Instruction Compression is the process of converting lengthy operational instructions into compact, quantized representations using precise discount functions.
It employs both exponential and hyperbolic discounting techniques, along with state-dependent mechanisms, to model temporal preferences in neural networks.
This approach enhances performance in reinforcement learning and behavioral economics by improving convergence, sample efficiency, and the alignment of model predictions with empirical data.

Neural discount models describe how agents—including biological organisms and artificial neural systems—assign present value to temporally delayed events by discounting future outcomes. These models formalize the mapping of time delays, uncertainty, or memory constraints onto objective or subjective value, often through functionally or neurally plausible mechanisms. Neural discounting frameworks span psychology, neuroscience, and machine learning, covering discrete and continuous reward delay, intertemporal choice, scale-invariant and hyperbolic discounting, state-adaptive discounting, and applications to behavioral economics and reinforcement learning. Modern neural discount models are characterized by mathematically precise discount functions and network architectures that embody either parametric or learned forms of temporal preference.

1. Mathematical Foundations: Discount Functions and Quantized Representations

Classic discount models assume continuous exponential or hyperbolic decay of value:

Exponential: $V_e(t) = e^{-\lambda t}$ , $\lambda > 0$
Hyperbolic: $V_h(t) = \frac{1}{1 + k t}$ , $k > 0$

Neural discount models generalize or discretize these formulations to better match empirical behavioral and neural data. Tee & Taylor introduce quantized discount operators $Q_N$ mapping $x \in [0,1]$ onto $2^N$ discrete bins: $Q_N[x] = \lfloor x 2^N \rfloor / 2^N$ , yielding quantized hyperbolic and exponential forms $V_{h,q}(t) = Q_N\bigl(1/(1+kt)\bigr)$ and $V_{e,q}(t) = Q_N\bigl(e^{-\lambda t}\bigr)$ . Empirically, intertemporal choices in humans are fit best by models with approximately 5-bit resolution (32 discrete bins), indicating discrete coding of value under delay (Tee et al., 2020).

Non-exponential discounting, notably hyperbolic forms or arbitrary discount functions $\lambda > 0$ 0, are motivated by findings from behavioral economics and neuroscience. The function's choice—whether exponential, hyperbolic, or power-law—connects to theoretical assumptions about uncertainty (e.g., variable hazard rates), memory resources, or environmental statistics (Schultheis et al., 2022, Fedus et al., 2019, Ortega et al., 2016).

2. Neural Substrates and Scale-Invariant Mechanisms

Neural discount models are linked to circuit-level implementations, including leaky integrator dynamics, Hebbian associative memory, and logarithmically compressed temporal representations. In dynamic settings, evidence is discounted via a forgetting kernel with rate $\lambda > 0$ 1, driven by task volatility and sensory noise (as in evidence accumulation tasks with hazard rate $\lambda > 0$ 2; $\lambda > 0$ 3) (Piet et al., 2017). The discounting kernel shapes the integration window for relevant evidence.

Howard et al. propose a continuous-time model incorporating Laplace and inverse-Laplace encodings, yielding a full compressed timeline of future events $\lambda > 0$ 4 for each cue $\lambda > 0$ 5. Integration against a power-law kernel $\lambda > 0$ 6 yields value estimates $\lambda > 0$ 7, unifying scale-invariance and hyperbolic discounting in a locally computable, parallel neural circuit (Tiganj et al., 2018).

Encoding of these discounting computations at the neural level is compatible with observed discrete coding in reward-related cortical areas and is aligned with BOLD signal patterns (e.g., ventromedial prefrontal cortex) adapted to subjective value under discrete coding constraints (Tee et al., 2020).

3. Algorithmic Implementations in Reinforcement Learning

Neural discount models have been operationalized in deep RL architectures through several approaches:

Progressive or dynamic discount schedules: The discount factor $\lambda > 0$ 8 is increased gradually during training to stabilize neural network function approximation and accelerate convergence. Empirically, this reduces learning steps and increases final performance (François-Lavet et al., 2015). The dynamic schedule is captured recursively by $\lambda > 0$ 9 ( $V_h(t) = \frac{1}{1 + k t}$ 0), starting from a small $V_h(t) = \frac{1}{1 + k t}$ 1.
State-dependent neural discounting: AdaGamma replaces the fixed scalar $V_h(t) = \frac{1}{1 + k t}$ 2 with a learned state-dependent function $V_h(t) = \frac{1}{1 + k t}$ 3 parameterized by an MLP. Learning is regularized by a return-consistency loss comparing the one-step TD target to an $V_h(t) = \frac{1}{1 + k t}$ 4-step discounted return, preventing degenerate collapse to short horizons (Wang et al., 7 May 2026). The framework preserves convergence guarantees at the operator level under mild assumptions and yields empirical gains in deep actor-critic architectures (e.g., SAC, PPO) on continuous-control and production-scale tasks.
Multi-horizon and hyperbolic discounting: To approximate non-exponential discounting (notably hyperbolic), architectures allocate multiple Q-function heads $V_h(t) = \frac{1}{1 + k t}$ 5 corresponding to a grid of $V_h(t) = \frac{1}{1 + k t}$ 6 values. The hyperbolic Q-value is reconstructed via $V_h(t) = \frac{1}{1 + k t}$ 7, where $V_h(t) = \frac{1}{1 + k t}$ 8 are analytically derived weights. Simultaneous training of multi-horizon heads serves as an auxiliary task, improving sample efficiency and stability (Fedus et al., 2019).

Approach	Key Feature	Example Citations
Quantized Models	Finite-bit value coding	(Tee et al., 2020)
State-Dependent RL	$V_h(t) = \frac{1}{1 + k t}$ 9 via MLP	(Wang et al., 7 May 2026)
Multi-horizon RL	Multiple $k > 0$ 0-heads	(Fedus et al., 2019, François-Lavet et al., 2015)

4. Behavioral, Biological, and Clinical Relevance

Neural discount models provide mechanistic accounts for observed intertemporal choice phenomena and their neural correlates. Discrete (quantized) value representations explain "chunked" delay sensitivity in human temporal preference, compatible with both economic and neurobiological constraints (Tee et al., 2020). Information-theoretic approaches show that both exponential and hyperbolic discount functions naturally emerge from agent memory constraints and predictive information curves; limited coding resources yield hyperbolic forms (Ortega et al., 2016).

Hierarchical RL frameworks incorporating level-dependent discounting elucidate the role of temporal preference gradients in compulsive behaviors such as addiction. Per-level discount factors $k > 0$ 1 are constructed to ensure value convergence for natural rewards and divergence for drug rewards with dopamine-induced bias, accounting for the elevated impulsivity and drug-seeking seen in substance use disorders. Increased discounting magnitude (lower $k > 0$ 2) systematically amplifies lower-level (habitual) drug-seeking, paralleling clinical severity metrics and neuroimaging gradients in striatal time coding (Palod et al., 5 Jun 2025).

5. Model Fitting, Learning, and Empirical Comparison

Quantitative fitting of neural discount models to behavioral data employs measures such as AIC and BIC for model selection, logistic regression for parameter inference (including bit precision $k > 0$ 3 in quantized models), and bootstrap or cross-validation for robustness (Tee et al., 2020). In neuroscience tasks (e.g., evidence accumulation in rats), models are validated against observed adjustment of integration timescales, change-of-mind timing distributions, and sensitivity to environmental hazard rates (Piet et al., 2017).

In RL, model learning and evaluation rely on minimizing TD losses (possibly across multiple heads or with return-consistency), ablations to test auxiliary-task effects, and benchmarking on synthetic and real-world tasks. Recovery of underlying discount parameters from behavior is realized through inverse RL via sensitivity backpropagation through collocation-based PDE solvers (Schultheis et al., 2022).

6. Limitations, Extensions, and Open Problems

Current neural discount models vary in their domain of applicability and biological plausibility:

Fixed-bit quantization offers a compact fit to behavioral data but leaves the neural origin of bit precision open and does not model neural dynamics.
State-dependent and dynamic neural discounting in deep RL are currently most effective in heterogeneously timed tasks; convergence proofs beyond tabular domains remain to be fully established (Wang et al., 7 May 2026).
Scale-invariant temporal coding yields power-law discounting but requires specialized circuit architectures and experimental validation (Tiganj et al., 2018).
Hierarchical integration of discounting matches human and animal data but may require further empirical grounding, especially concerning the joint roles of dopamine modulation, time-resolved value coding, and hierarchical control (Palod et al., 5 Jun 2025).

Future work spans directly testing quantization in neural signals, extending frameworks to clinical or computational populations, automating dynamic or state-based discount learning, and synthesizing information-theoretic with circuit-level models to probe how neural systems balance coding efficiency, subjective time, and value prediction across domains.

Key References:

"A Quantized Representation of Intertemporal Choice in the Brain" (Tee et al., 2020)
"Reinforcement Learning with Non-Exponential Discounting" (Schultheis et al., 2022)
"How to Discount Deep Reinforcement Learning: Towards New Dynamic Strategies" (François-Lavet et al., 2015)
"Hyperbolic Discounting and Learning over Multiple Horizons" (Fedus et al., 2019)
"AdaGamma: State-Dependent Discounting for Temporal Adaptation in Reinforcement Learning" (Wang et al., 7 May 2026)
"Estimating scale-invariant future in continuous time" (Tiganj et al., 2018)
"Discounting and Drug Seeking in Biological Hierarchical Reinforcement Learning" (Palod et al., 5 Jun 2025)
"Rats optimally accumulate and discount evidence in a dynamic environment" (Piet et al., 2017)
"Memory shapes time perception and intertemporal choices" (Ortega et al., 2016)