Learning Curves for Revenue Maximization

Updated 4 July 2026

Learning curves for revenue maximization are quantitative measures that track how pricing algorithms converge to optimal revenue as sample size increases.
They incorporate various benchmark choices—from fixed-distribution to online regret and pricing-query frameworks—that capture different operational and feedback constraints.
The analysis distinguishes regimes ranging from heavy-tailed slow convergence to exponential decay, emphasizing the roles of demand regularity, inventory limits, and strategic responses.

Searching arXiv for the core paper and closely related work on learning curves for revenue maximization. arXiv search: "On the Learning Curves of Revenue Maximization" (Hanneke et al., 29 Apr 2026) Learning curves for revenue maximization quantify how rapidly a pricing or mechanism-design procedure approaches its benchmark revenue as information accumulates. In the single-item, single-buyer posted-pricing model, the canonical object is the expected revenue gap

$\epsilon_n(t_n,D):=\operatorname{opt}_D-\mathbb{E}[\operatorname{rev}_D(t_n)],$

viewed as a function of the sample size $n$ for a fixed valuation distribution $D$ (Hanneke et al., 29 Apr 2026). In adjacent literatures, the same idea is instantiated through relative regret against a fluid benchmark, dynamic regret against a time-varying oracle, Bayesian regret across episodes, pricing-query complexity, and sample complexity for approximate optimality. The resulting theory shows that the shape of the curve depends on the benchmark, the feedback model, the regularity of demand, whether the optimal price is attained, and whether the seller faces inventory, budget, or strategic-response constraints.

1. Formalizations and benchmark choices

The fixed-distribution formulation isolates a single distribution $D$ and studies the sequence $\epsilon_n(t_n,D)$ for an algorithm $(t_n)$ . In this framework, Bayes-consistency means

$\lim_{n\to\infty}\mathbb{E}[\operatorname{rev}_D(t_n)]=\operatorname{opt}_D$

for every $D$ in the class under study, with the convention that if $\operatorname{opt}_D=\infty$ , then $\mathbb{E}[\operatorname{rev}_D(t_n)]\to\infty$ (Hanneke et al., 29 Apr 2026). The same paper distinguishes a PAC upper bound, which controls the worst-case envelope uniformly over a class of distributions, from a universal learning rate, which allows the constants to depend on the fixed underlying distribution. This distinction is central because fixed-distribution learning curves can be much sharper than worst-case PAC rates.

Online and dynamic models replace the sample-size axis by a time horizon and define learning curves through regret. In personalized dynamic pricing with an inventory constraint, regret is

$n$ 0

where $n$ 1 is the fluid benchmark under proportional scaling of demand and capacity (Chen et al., 2018). Under nonstationarity with one-point feedback, the benchmark is dynamic:

$n$ 2

with $n$ 3 and path variation

$n$ 4

as the nonstationarity budget (Yang et al., 20 May 2026). In pricing-query models, the learning curve is expressed by the achievable revenue gap after $n$ 5 pricing queries, typically denoted $n$ 6, while in episodic revenue management with unknown time-varying demand, the benchmark is Bayesian regret relative to the clairvoyant dynamic-programming policy over $n$ 7 episodes of length $n$ 8 (Leme et al., 2021, Shimizu et al., 2024).

These benchmark choices are not interchangeable. A fixed-distribution curve measures convergence to optimal revenue for one environment; a regret curve measures cumulative shortfall under sequential decision-making; a pricing-query curve measures how much can be inferred from binary accept/reject observations. Much of the contemporary literature can be read as a comparison of these benchmark families.

2. Distribution-dependent sample-size regimes

The sharpest current characterization of fixed-distribution learning curves is for the basic single-item, single-buyer posted-pricing model (Hanneke et al., 29 Apr 2026). The central structural distinction is whether the optimal revenue is attained at a finite price.

Regime	Learning-curve rate	Source
$n$ 9, not attained at finite price	arbitrarily slow	(Hanneke et al., 29 Apr 2026)
$D$ 0 attained at finite $D$ 1	essentially $D$ 2 up to logarithmic factors	(Hanneke et al., 29 Apr 2026)
bounded support	optimal $D$ 3 universal rate	(Hanneke et al., 29 Apr 2026)
closed discrete support	$D$ 4	(Hanneke et al., 29 Apr 2026)
finite support	$D$ 5	(Hanneke et al., 29 Apr 2026)

For unrestricted distributions, there exists a Bayes-consistent algorithm for all valuation distributions on $D$ 6: capped ERM with cap $D$ 7 achieves $D$ 8 even when $D$ 9 (Hanneke et al., 29 Apr 2026). However, if $D$ 0 but no finite price attains it, then convergence can be arbitrarily slow. The lower bound holds in a strong sense: for any algorithm $D$ 1 and any rate function $D$ 2, there exists a fixed distribution $D$ 3 such that infinitely often

$D$ 4

for some constant $D$ 5 (Hanneke et al., 29 Apr 2026). The heavy-tail example

$D$ 6

illustrates the regime $D$ 7 without attainment (Hanneke et al., 29 Apr 2026).

When the optimal revenue is achieved at a finite price $D$ 8, the universal rate becomes essentially $D$ 9. More precisely, for any rate $\epsilon_n(t_n,D)$ 0 there is an algorithm that learns the class at universal rate $\epsilon_n(t_n,D)$ 1, whereas for any $\epsilon_n(t_n,D)$ 2 no algorithm can learn the class universally at that rate (Hanneke et al., 29 Apr 2026). On bounded supports, ERM improves this to an optimal $\epsilon_n(t_n,D)$ 3 universal rate by a localized Bernstein analysis. On closed discrete supports, structured ERM attains $\epsilon_n(t_n,D)$ 4 rates, while vanilla ERM is not Bayes-consistent: there exists a closed discrete support on which ERM incurs a constant expected revenue gap along an infinite subsequence of sample sizes (Hanneke et al., 29 Apr 2026). On finite supports, ERM achieves

$\epsilon_n(t_n,D)$ 5

and no faster-than-exponential universal rate is possible on nontrivial finite supports (Hanneke et al., 29 Apr 2026).

A recurring conclusion is that PAC-style worst-case envelopes obscure these shapes. The fixed-distribution view separates heavy-tail nonattainment, finite-price attainability, bounded support, closed discrete support, and finite support into genuinely different rate classes.

3. Inventory, resource constraints, and contextual decision-making

A large segment of the literature studies learning curves in revenue maximization under operational constraints rather than pure posted-pricing estimation. In personalized dynamic pricing with one inventory resource and $\epsilon_n(t_n,D)$ 6 observable customer types, the seller learns a single dual shadow price $\epsilon_n(t_n,D)$ 7 instead of learning all type-specific demand functions in full (Chen et al., 2018). The fluid benchmark is

$\epsilon_n(t_n,D)$ 8

and the dual variable $\epsilon_n(t_n,D)$ 9 satisfies

$(t_n)$ 0

The primal-dual learning algorithm achieves the dimension-free regret rate

$(t_n)$ 1

with the exponent independent of the number of types $(t_n)$ 2 (Chen et al., 2018). Under sufficient capacity, the final phase uses one price per type; under insufficient capacity, it uses two prices per type and an interpolation parameter $(t_n)$ 3 to pin aggregate sales near $(t_n)$ 4.

In the older single-product, limited-inventory model with unknown regular demand, the dynamic pricing algorithm of Besbes and Zeevi uses shrinking price intervals and function-value estimation rather than parametric identification (Wang et al., 2011). In the size- $(t_n)$ 5 scaling regime, it achieves

$(t_n)$ 6

while a lower bound shows that no admissible policy can beat order $(t_n)$ 7 up to logarithmic factors (Wang et al., 2011). The paper interprets this as closing the gaps between parametric and non-parametric learning and between a post-price mechanism and a customer-bidding mechanism.

Constraint-coupled contextual revenue maximization introduces a different learning curve. In dual mirror descent with unknown model parameter $(t_n)$ 8, the agent observes i.i.d. contexts $(t_n)$ 9, chooses $\lim_{n\to\infty}\mathbb{E}[\operatorname{rev}_D(t_n)]=\operatorname{opt}_D$ 0, receives revenue $\lim_{n\to\infty}\mathbb{E}[\operatorname{rev}_D(t_n)]=\operatorname{opt}_D$ 1, and incurs costs $\lim_{n\to\infty}\mathbb{E}[\operatorname{rev}_D(t_n)]=\operatorname{opt}_D$ 2 subject to both lower and upper average-cost bounds (Lobos et al., 2021). With known $\lim_{n\to\infty}\mathbb{E}[\operatorname{rev}_D(t_n)]=\operatorname{opt}_D$ 3, regret and lower-bound violations are both $\lim_{n\to\infty}\mathbb{E}[\operatorname{rev}_D(t_n)]=\operatorname{opt}_D$ 4. With unknown $\lim_{n\to\infty}\mathbb{E}[\operatorname{rev}_D(t_n)]=\operatorname{opt}_D$ 5, the decomposition

$\lim_{n\to\infty}\mathbb{E}[\operatorname{rev}_D(t_n)]=\operatorname{opt}_D$ 6

adds an estimation term that depends on

$\lim_{n\to\infty}\mathbb{E}[\operatorname{rev}_D(t_n)]=\operatorname{opt}_D$ 7

If $\lim_{n\to\infty}\mathbb{E}[\operatorname{rev}_D(t_n)]=\operatorname{opt}_D$ 8, the overall regret remains $\lim_{n\to\infty}\mathbb{E}[\operatorname{rev}_D(t_n)]=\operatorname{opt}_D$ 9; if it decays as $D$ 0, the bound becomes $D$ 1 (Lobos et al., 2021).

These results share a common pattern: the learning curve is not only a function of statistical difficulty, but also of how a low-dimensional structure—such as a dual variable, a shrinking interval, or a dual feasibility certificate—compresses the constrained control problem.

4. Nonstationarity and time-varying demand

When demand changes over time, learning curves acquire an explicit dependence on a variation budget. Under one-point feedback in a convex price domain $D$ 2, mirror ascent with the estimator

$D$ 3

and periodic restarting yields a static regret bound

$D$ 4

(Yang et al., 20 May 2026). With tuned $D$ 5 and $D$ 6, this becomes $D$ 7, and for spherical smoothing $D$ 8 it reduces to $D$ 9. Restarting converts this into a dynamic regret bound

$\operatorname{opt}_D=\infty$ 0

with the special case

$\operatorname{opt}_D=\infty$ 1

when $\operatorname{opt}_D=\infty$ 2 and $\operatorname{opt}_D=\infty$ 3 is known. If $\operatorname{opt}_D=\infty$ 4 is unknown, the bandit-over-bandit meta-layer yields

$\operatorname{opt}_D=\infty$ 5

for $\operatorname{opt}_D=\infty$ 6 (Yang et al., 20 May 2026).

In the single-buyer binary-feedback model with drifting valuations, the benchmark is first-best revenue

$\operatorname{opt}_D=\infty$ 7

and the regret is

$\operatorname{opt}_D=\infty$ 8

Here the nonstationarity parameter is either a fixed $\operatorname{opt}_D=\infty$ 9, the average changing rate

$\mathbb{E}[\operatorname{rev}_D(t_n)]\to\infty$ 0

$\mathbb{E}[\operatorname{rev}_D(t_n)]\to\infty$ 1

in the stochastic known- $\mathbb{E}[\operatorname{rev}_D(t_n)]\to\infty$ 2 case (Leme et al., 2021). The optimal exponents differ between adversarial and stochastic drift:

$\mathbb{E}[\operatorname{rev}_D(t_n)]\to\infty$ 3

with matching lower bounds, and the same exponents extend to unknown or dynamic non-increasing $\mathbb{E}[\operatorname{rev}_D(t_n)]\to\infty$ 4 through $\mathbb{E}[\operatorname{rev}_D(t_n)]\to\infty$ 5 (Leme et al., 2021). The algorithms alternate binary-search localization with exploitation phases and sparse checking rounds.

Episodic revenue management with unknown time-varying demand introduces a further interaction between learning and inventory. With $\mathbb{E}[\operatorname{rev}_D(t_n)]\to\infty$ 6 seasons, $\mathbb{E}[\operatorname{rev}_D(t_n)]\to\infty$ 7 periods per season, a finite price set $\mathbb{E}[\operatorname{rev}_D(t_n)]\to\infty$ 8, and a Bayesian prior over time-varying demand parameters, posterior sampling plus LP re-optimization yields

$\mathbb{E}[\operatorname{rev}_D(t_n)]\to\infty$ 9

while a lower bound shows

$n$ 00

for some prior $n$ 01 (Shimizu et al., 2024). The paper interprets the $n$ 02 term as the unavoidable learning component and the $n$ 03 term as the computational-efficiency cost of using LP rather than exact dynamic programming. Empirically, correlated Gaussian-process priors steepen early learning curves by sharing information across time and price (Shimizu et al., 2024).

Taken together, these models replace a single asymptotic rate by a two-parameter geometry: horizon length controls the accumulation of exploration error, while a variation budget controls how rapidly past information becomes stale.

5. Feedback limitations, strategic responses, and information complexity

Learning curves also depend on what information the seller receives. In pricing-query models, the learner posts a price $n$ 04, observes only the binary signal

$n$ 05

and seeks a reserve price $n$ 06 that nearly maximizes

$n$ 07

(Leme et al., 2021). The resulting query complexity exhibits three regimes:

$n$ 08

Equivalently, the revenue-gap learning curve satisfies

$n$ 09

for general distributions and

$n$ 10

for regular and MHR distributions (Leme et al., 2021). The regular-distribution algorithm relies on the relative flatness property, which rules out hidden interior spikes after probing a constant number of evenly spaced prices. The same paper shows that for regular distributions, learning the reserve is strictly easier than learning the entire distribution in Lévy distance:

$n$ 11

whereas reserve-price learning requires only $n$ 12 queries (Leme et al., 2021).

Patient buyers create a different information geometry. When each buyer has a value-patience type $n$ 13 and can delay purchase over up to $n$ 14 steps, the revenue class for pure non-increasing price sequences has fat-shattering dimension linear in $n$ 15 (Mashiah et al., 2022). The offline pure-strategy sample complexity has two regimes: for sample size $n$ 16 the excess error behaves like $n$ 17, whereas for $n$ 18 it behaves like $n$ 19 up to logarithmic factors. In online learning, regret against the optimal pure strategy is

$n$ 20

after the crossover $n$ 21, while regret against the optimal mixed strategy is $n$ 22 for finite support and $n$ 23 in general (Mashiah et al., 2022).

Strategic data generation further modifies the curve. In ERM with endogenous sampling, the “samples” are buyers’ bids, and a coalition of size $n$ 24 can manipulate the learned reserve. The paper formalizes this through the incentive-awareness measure

$n$ 25

which upper-bounds the expected relative price drop caused by altering $n$ 26 out of $n$ 27 samples (Deng et al., 2020). For guarded ERM,

$n$ 28

for MHR distributions and

$n$ 29

for distributions supported on $n$ 30 (Deng et al., 2020). The endogenous-learning curve therefore becomes the sum of the classical statistical error and a manipulation-robustness term:

$n$ 31

Two-sided learning against a budget- and ROI-constrained buyer supplies a further example of structure-driven learnability. The seller’s fixed-price revenue

$n$ 32

is bell-shaped over the price grid: strictly increasing on the non-binding region, flat at $n$ 33 on the budget-binding segment, and strictly decreasing on the ROI-binding segment (Golrezaei et al., 2021). An episodic binary search then achieves

$n$ 34

where $n$ 35 is the buyer’s within-episode adaptivity exponent and $n$ 36 (Golrezaei et al., 2021). If the buyer best responds exactly, or uses empirical-distribution advice, then $n$ 37 and the seller’s regret is $n$ 38.

6. Algorithmic motifs, applications, and open directions

Several recurring design principles organize the field. Capped ERM and structured ERM control tail risk or discrete-support pathologies in fixed-distribution learning (Hanneke et al., 29 Apr 2026). Primal-dual methods reduce a high-dimensional pricing problem to learning a shadow price of capacity or cost (Chen et al., 2018, Lobos et al., 2021). Restarting and meta-learning discount stale information under nonstationarity (Yang et al., 20 May 2026). Relative flatness enables zoom-in without reconstructing the full value distribution (Leme et al., 2021). Posterior sampling combined with LP re-optimization transforms episodic time-varying revenue management into tractable mean-demand planning (Shimizu et al., 2024).

These motifs extend beyond single posted prices. In airline revenue management with unconstrained capacity and simultaneous pricing of $n$ 39 active flights, the unified objective

$n$ 40

balances current expected revenue against Fisher-information-driven learning quality (Pinheiro et al., 2022). In the reported experiments, with $n$ 41, 10 prices from \$n$42230, and a sweep over 160 values of $n$43, the best choice $n$44 achieved normalized expected revenue of $n$45 versus $n$46 for RMS, an absolute improvement of $n$47 (Pinheiro et al., 2022). The same paper reports that the MSE of $n$48 estimation decreases monotonically with $n$49 up to a sweet spot, after which over-exploration reduces revenue.

Revenue-maximizing ranking under random attention spans supplies another example of approximation-oriented learning curves. For random span $n$50 with tail $n$51, the Best-$n$52 policy selects

$n$53

where $n$54 is the fixed-span optimum (Chen et al., 2020). Under the IFR condition on attention spans, Best-$n$55 achieves at least $n$56 of the clairvoyant benchmark, while no algorithm can exceed $n$57 of that benchmark in the worst case (Chen et al., 2020). In the contextual online version, RankUCB attains expected regret of order $n$58 relative to the scaled $n$59 benchmark despite censoring of the attention span (Chen et al., 2020).

The move from posted prices to richer mechanisms preserves the learning-curve perspective but increases parameter dependence. For menus of two-part tariffs, adversarial full-information online regret is

$n$60

while fixed-length lottery menus admit

$n$61

full-information regret and

$n$62

bandit regret (Balcan et al., 2023). In the distributional setting, fixed-length lottery menus have sample complexity

$n$63

whereas arbitrary-length menus exhibit the expected exponential dependence on $n$64 under correlated valuations (Balcan et al., 2023). The same paper shows that dispersion-based methods, successful for smoothed online learning of two-part tariffs, are inadequate for menus of lotteries because dispersion can fail at the optimizer (Balcan et al., 2023).

The open problems are correspondingly structural. One line asks when fixed-distribution curves can be sharpened beyond current worst-case universal rates, especially for unbounded supports and for ERM outside bounded-support or finite-support settings (Hanneke et al., 29 Apr 2026). Another asks whether dimension-free final-phase guarantees survive multiple resource constraints in primal-dual inventory control (Chen et al., 2018). In patient-buyer models, the gap between upper and lower bounds for mixed strategies and the possibility of $n$65 regret against optimal mixed menus remain unresolved (Mashiah et al., 2022). In episodic time-varying revenue management, a formal regret analysis for dynamic per-period posterior sampling and a theory that quantifies the gains from correlated priors remain open (Shimizu et al., 2024).

Across these domains, the central lesson is not that revenue learning has a single canonical asymptotic law, but that it decomposes into sharply different regimes. Heavy tails without attainment can force arbitrarily slow convergence; finite optimal prices produce essentially $n$66 curves; discrete supports can yield almost exponential or exponential decay; nonstationarity introduces explicit variation exponents; and resource, feedback, or strategic constraints add separate structural terms that often dominate the statistical component.