- The paper introduces the recursive perturbed utility (RPU) model that integrates entropy-based randomization into dynamic portfolio optimization, addressing limitations of static models.
- It mathematically characterizes the optimal randomized policy as a Gaussian distribution with bias conditions dictated by risk aversion and market incompleteness.
- An asymptotic expansion quantifies the first-order deviation from classical strategies and measures the modest wealth loss incurred by investors favoring randomization.
Merton's Problem with Recursive Perturbed Utility: Summary and Analysis
Introduction
The paper "Merton's Problem with Recursive Perturbed Utility" (2602.13544) rigorously addresses the question of why a rational investor might prefer stochastic, rather than deterministic, portfolio choices in dynamic settings. Traditional Merton-type portfolio optimization prescribes deterministic, state-dependent rules, but experimental evidence reveals systematic human inclination toward randomization. Stochastic choice models, such as Fudenberg's additive perturbed utility (APU), explain this in static contexts, yet their extension to dynamic setups can be ill-posed or computationally intractable. The authors introduce the recursive perturbed utility (RPU) framework as an entropy-regularized, recursive dynamic utility model—well-posed for a wide class of preferences and mathematically tractable for continuous-time portfolio optimization in Markovian incomplete markets. RPU incorporates endogenous, state- and history-dependent trade-offs between monetary utility and utility from randomization, thereby overcoming theoretical limitations of static perturbations in dynamic environments.
Mathematical Framework
Market Model and Dynamics
The financial market consists of a risk-free asset and a risky asset whose returns and volatility depend on an observable stochastic factor. The dynamics are Markovian and potentially incomplete, with instantaneous return μ(t,Xt) and volatility σ(t,Xt) driven by exogenous factor process Xt (see equations (1) and (2) in the paper).
Investor actions are modeled as scalar-valued processes at (fraction of wealth allocated to the risky asset), leading to self-financing wealth dynamics in the classical Merton problem. In the exploratory formulation, actions are generalized to randomized controls πt, probability distributions from which investment decisions are drawn, resulting in altered wealth dynamics capturing increased volatility due to extrinsic randomization.
Entropy-Based Randomization Preference
The RPU framework explicitly rewards the investor for engaging in randomization, measured by the differential entropy H(π) of the randomized control. Unlike static APU which simply adds entropy-based utility to expected terminal wealth, RPU recursively weighs the randomization flow via a dynamic, endogenous discount term, dependent on past and present entropy accumulation. This recursive aggregation mirrors Uzawa-type endogenous time preference and ensures that excessive randomization is discouraged, avoiding ill-posedness even for low risk aversion.
Formally, the value process Jtπ satisfies a backward stochastic differential equation incorporating both bequest utility and recursive entropy utility, with the discount rate (aggregator) decreasing as randomization accumulates.
Theoretical Results
Optimal Policy Characterization
Under CRRA preferences, the authors prove that the RPU-optimal portfolio follows a Gaussian distribution, independent of wealth (Theorem 1). The variance is given in closed form, Var(π∗)=λ/(γσ2), and is strictly decreasing in risk aversion γ and market volatility σ2. The optimal mean is characterized as the sum of (i) a myopic term, proportional to the instantaneous Sharpe ratio and risk aversion, and (ii) an intertemporal hedging term against market incompleteness, both defined by the solution of a PDE (see eq. (7)–(9)). The hedging term is recursively entangled with randomization: unlike classical Merton solutions or additive entropy regularization, policy randomization affects optimal hedging.
In complete markets or for log-utility (γ=1), the hedging term vanishes, and the mean reduces to the classical, unbiased Merton rule; recursive and additive entropy models coincide. However, when risk aversion is not unitary or markets are incomplete, recursive regularization induces bias in the optimal policy mean.
Asymptotic Expansion and Financial Cost of Randomization
The paper provides an asymptotic expansion in the temperature parameter λ, quantifying how the optimal mean deviates from its classical value and measuring the associated wealth loss. Explicitly, the deviation is first-order in λ, while the relative wealth loss is higher-order (O(λ2)), establishing that the cost of preferential randomization is small and measurable for moderate λ (Theorem 2). The equivalent wealth loss that investors are willing to pay for the pleasure of randomization is formalized, with explicit PDEs for the expansion coefficients.
Relation to Entropy Regularization and RL
While entropy-regularized RL frameworks incentivize exploration due to model uncertainty, the RPU approach here formalizes intrinsic human preference for randomization under full information. Mathematically, both lead to relaxed control models with Gibbs optimal measures, but motivations and utility aggregation differ fundamentally. The paper surveys the literature on continuous-time entropy regularization (e.g., [wang2018exploration], [bender2024continuous], [jia2025accuracy]), noting that existing works employ additive perturbations, whereas RPU generalizes to recursive structures.
Unbiasedness and Bias Conditions
The bias in the optimal randomized policy is shown to depend entirely on the factor dynamics and their stochastic coupling to the asset price. When market incompleteness is absent (deterministic factors or independent factor-stock dynamics), or log-utility is used, the recursive solution is unbiased and coincides with classical benchmarks. The BSDE perspective further elucidates conditions leading to unbiasedness.
Extension and Limitations
Alternative temperature weighting schemes (constant, wealth-dependent) are discussed and shown to be theoretically problematic (ill-posed or non-tractable) compared to recursive weighting. The RPU approach is also extended to CARA preferences, demonstrating generalizability.
Implications and Future Directions
The recursive perturbed utility framework establishes a rigorous foundation for modeling dynamic stochastic choice in portfolio optimization. The practical implication is that optimal policy randomization can be formally justified and quantitatively analyzed, moving beyond stylized empiricism to precise measurement of the financial cost of randomization preference. For asset management and behavioral finance, incorporating recursive entropy-based utility may improve descriptive and predictive accuracy, potentially illuminating asset pricing puzzles tied to choice randomization.
From a theoretical standpoint, the recursive aggregation offers flexible trade-offs suitable for other dynamic stochastic control problems (e.g., consumption/investment, gambling, mean-field games) and may lead to new formulations for RL with intrinsic exploration motivations.
Future research could enhance the framework by integrating consumption, empirically calibrating randomization preference in real markets, and disentangling intrinsic versus extrinsic motivations in RL settings. The impact of alternative entropy functionals (Tsallis, Renyi) and utility forms (non-CRRA/CARA) remains unexplored.
Conclusion
The paper provides a mathematically rigorous and economically interpretable approach to dynamic portfolio optimization with stochastic choice, grounded in recursive entropy-perturbed utility. The optimal policies are tractable (Gaussian), and the bias induced by randomization preference is precisely characterized. Recursive aggregation resolves the ill-posedness of static additive regularization, quantifies the wealth cost of randomization, and opens new avenues for modeling behavioral preferences in stochastic control and reinforcement learning.