Solving Problems of Unknown Difficulty

Published 31 Mar 2026 in econ.TH | (2604.00156v1)

Abstract: This paper studies how uncertainty about problem difficulty shapes problem-solving strategies. I develop a dynamic model where an agent solves a problem by brainstorming approaches of unknown quality and allocating a fixed effort budget among them. Success arrives from spending effort pursuing good approaches, at a rate determined by the unknown problem difficulty. The agent balances costly exploration (expanding the set of approaches) with exploitation (pursuing existing approaches). Failures could signal either a bad idea or a hard problem, and this uncertainty generates novel dynamics: optimal search alternates between trying new approaches and revisiting previously abandoned ones. I then examine a principal-agent environment, where moral hazard arises on the intensive margin: how the agent explores. Dynamic commitment leads contracts to frontload incentives, which can be counteracted by the presence of learning. The framework reflects scientific discovery, product development, and other creative work, providing insights into innovation and organizational design.

Abstract PDF Upgrade to Chat

Authors (1)

Nicholas Wu

Summary

The paper introduces a model that dynamically balances exploration of new approaches with recall of past ones under uncertain problem difficulty.
It differentiates from traditional multi-armed bandit frameworks by coupling learning across arms via a latent difficulty parameter, influencing incentive design.
The findings inform optimal dynamic contracts and strategic search behaviors applicable to scientific research, innovation, and creative exploration.

Solving Problems Under Unknown Difficulty: A Dynamic and Strategic Perspective

Overview and Motivation

"Solving Problems of Unknown Difficulty" (2604.00156) presents a formal model to understand search, experimentation, and incentive design in environments where both the viability of potential approaches and the fundamental problem difficulty are unknown. The framework is motivated by real-world contexts such as scientific research, technological innovation, and entrepreneurship, where agents must engage in costly creative exploration and cannot distinguish between unproductive efforts due to flawed ideas versus inherent hardness.

Crucially, the model departs from classical multi-armed bandit setups by making difficulty a shared latent parameter that endogenously correlates learning across alternative approaches. The agent’s failures are ambiguous—providing only entangled statistical evidence about the quality of attempted approaches and the global rate of progress achievable.

Formal Model and Key Mechanisms

Single-Agent Dynamic Search

An agent sequentially allocates continuous effort among previously brainstormed approaches (arms); novel approaches are generated at a fixed cost. Each new approach is valid with ex-ante probability $\nu_0$ and is only productive under the "easy" global difficulty state ( $\lambda = \lambda_E$ ), which applies uniformly to all arms. With probability $\delta_0$ , the problem is "hard" ( $\lambda = \lambda_H < \lambda_E$ ), which means valid arms yield breakthroughs at a much lower rate. The agent discounts future payoffs at rate $r$ and only observes successes and failures (i.e., the process is bandit-like with censored feedback).

The agent faces a strategic breadth-vs-depth allocation: continued exploitation of existing approaches versus incurring costs for expanding the search space. Exploration and belief-updating are highly interdependent over time and across alternatives.

Departure from Standard Bandits

The critical distinction is the endogenous, information-theoretic coupling of beliefs about each arm via the latent state $\theta$ (difficulty). This invalidates indexability and the tractable decomposition present in Gittins-style bandit solutions, leading instead to a correlated, restless-bandit dynamic with high-dimensional beliefs.

Behavioral Implications

Ambiguous failures: A central result is that failures on an arm do not merely reduce its perceived validity but also induce dynamic learning about the difficulty of the underlying problem, which then recursively modifies beliefs about all other arms.

The agent’s optimal policy alternates between (a) exploration—generating and focusing initial effort on new approaches, and (b) recall/task-juggling—redistributing effort to previously abandoned arms as updates on the global difficulty state unfold. This contrasts with classical models (with known difficulty), where abandoned arms are never revisited, and agents do not diversify effort.

Characterization and Dynamics of the Optimal Policy

Benchmark: Known Difficulty

When the global difficulty (success rate) is known, the agent sequentially generates a new approach and explores each to a static threshold $K^*$ , after which the arm is abandoned and a new one is brainstormed. There is no recall and no parallelization: effort is concentrated, and each viable approach is explored exhaustively and then never revisited.

The optimal threshold $K^*$ is strictly increasing in the approach cost $c$ and strictly decreasing in the arrival rate $\lambda$ . Interestingly, $\lambda = \lambda_E$ 0 may be non-monotone in the discount rate $\lambda = \lambda_E$ 1 (see Figure 1):

Figure 1: Non-monotone effect of time pressure on the threshold $\lambda = \lambda_E$ 2. Example plotted for $\lambda = \lambda_E$ 3, $\lambda = \lambda_E$ 4, $\lambda = \lambda_E$ 5.

Unknown Difficulty: Endogenous Recall and Task-Splitting

When there is aggregate uncertainty over difficulty, the optimal time/effort thresholds for exploration become state-dependent sequences $\lambda = \lambda_E$ 6, with $\lambda = \lambda_E$ 7 increasing in the number of brainstormed approaches. The agent spends more cumulative effort on each new approach before switching, reflecting the shifting posterior over difficulty and the decreasing anticipated value of further search as failure persists.

More importantly, recall and task-juggling emerge endogenously: previous arms are optimally revisited and explored in parallel with newer arms, a sharp contrast to models with known difficulty or independent arm rewards.

Beliefs about approach quality and problem tractability evolve in a sophisticated manner, forming non-trivial stochastic paths (see Figure 2):

Figure 2: Belief over quality of old approaches when brainstorming new approaches.

These non-monotonic belief updates drive periods where the agent is more optimistic about previously-abandoned approaches as accumulating failures shift posterior weight toward global difficulty, thus increasing the likelihood assigned to prior arms being valid but merely unlucky in the prior exploitation round.

Continuum Limit and Application to Agency Problems

To enable tractable dynamic game and contract-theoretic analysis, the model develops a limiting "continuum of approaches" framework. Breadth and depth become continuous control variables, with the overall distribution of breakthrough times being an aggregate over possible valid approaches and realization of the common difficulty state.

This limit enables closed-form Euler–Lagrange characterizations and comparative static analysis for agent and principal-agent problems.

Principal-Agent Contracting and the Incentive to Explore

A principal contracts with an agent to incentivize creative search when effort and exploration allocation are unobservable (moral hazard arises on the intensive—not extensive—margin).

Key results:

Optimal static (time-invariant) contracts induce insufficient creative search: The agent under-explores relative to the social planner, as the cost of broadening search is unobserved and not perfectly contractible.
Optimal dynamic contracts generally **frontload incentives—contradicting classical dynamic moral hazard theory (which predicts backloaded compensation under extensive margin risk):**
- The principal offers higher equity shares for early success to induce exploration, dissipating the effectiveness of incentives over time as the marginal impact of exploration diminishes.
- This produces a monotonically declining agent share (see Figure 3):
- Figure 3: The optimal dynamic contract share given to the agent over time, and the share granted in the no-commitment environment. Solution for $\lambda = \lambda_E$ 8, $\lambda = \lambda_E$ 9, $\delta_0$ 0, $\delta_0$ 1.
When compared to the static (or no-commitment) contract, the dynamic (commitment) contract sustains greater exploration breadth early on but reverts to less creative activity for long periods without success (see Figure 4):
Figure 4: The path of exploration in the dynamic contract vs the no-commitment equilibrium. Solution parameters as above.
Learning-driven backloading: As pessimism accumulates (i.e., as the belief that the problem is impossible increases), the principal is, at times, incentivized to temporarily increase the agent's share (backloading) to encourage continued exploration. This force can induce non-monotonic contracts, especially when failures increasingly signal impossibility rather than mere luck (see Figure 5):
Figure 5: The optimal contract when learning about impossibility; each color illustrates the contract for a different prior.

Complex non-monotonic paths for the agent's share $\delta_0$ 2 can arise, featuring both "early tolerance for failure" and eventual reversion to low-incentive backstops as learning saturates (see Figure 6):

Figure 6: Share alpha offered over time.

Theoretical and Practical Implications

Theoretical

The model introduces an analytically tractable correlated multi-armed bandit structure, extending well beyond classical index and independence-based approaches.
It emphasizes the importance of restless, non-separable exploration and endogenously-interdependent beliefs, with immediate consequences for the design of algorithms in adaptive search and learning contexts.
In dynamic contracting, the result that intensive-margin (creative allocation) moral hazard induces frontloaded, not backloaded, incentive provision is a substantial departure from classical repeated agency models.

Practical

The results have direct application to the design of research and innovation contracts: tolerance for failure, timing of rewards, and dynamic adjustment to accumulated pessimism must be tailored to the structure of ambiguity in feasibility rather than merely effort observability.
The behavioral insight that optimal search involves dynamic recall and reuse of previously dismissed ideas, and temporary shifts in attention allocation, aligns with empirical patterns observed in scientific discovery and technological breakthroughs.

Conclusion

This work delivers a rich, technically rigorous framework for the study of dynamic experimentation, exploration, and incentive design in uncertain, creative environments. Through the introduction of global uncertainty in difficulty, the analysis predicts dynamic revisiting of old ideas, optimal parallelization, and nuanced, non-monotonic incentive contracts that depart from canonical results in economics and RL.

The continuum model opens multiple directions for future development, including robustness to alternative uncertainty structures, multi-agent and competitive search, and generalizations to possible endogenously evolving difficulty. The implications for algorithmic experimentation, research management, and organizational design are substantial, especially as search problems become increasingly complex, high-dimensional, and strategically coupled.

Markdown Report Issue

Paper to Video (Beta)

All Videos Subscribe on YouTube

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

A simple guide to “Solving Problems of Unknown Difficulty”

What this paper is about

This paper asks a practical question: how should someone tackle a big, tricky problem when they don’t know how hard it is? Think of inventing a new technology, finding a scientific breakthrough, or building a startup. The author builds a model of how a person (or team) should split their time between:

exploring new ideas, and
pushing forward on ideas they already have, when they can’t tell if failures mean “this idea is bad” or “this problem is just really hard.”

The main questions in plain language

The paper focuses on five easy-to-understand questions:

When you don’t know how hard a problem is, how should you balance trying new approaches versus sticking with current ones?
How do repeated failures change what you should do next?
Is it ever smart to go back to an old idea you previously abandoned?
How do these choices change if you know the problem is easy or hard from the start?
If an investor funds a problem-solver (like a startup founder), how should rewards be split over time to encourage smart exploration?

How the study works (with simple analogies)

The author builds a step-by-step model, like a “thought experiment,” to see what a careful problem-solver would do.

Imagine you’re trying to open a locked door with a pile of unknown keys.
- Some keys are “valid” and could open the door.
- Most keys are “flawed” and will never work.
- If you have a valid key, trying it longer makes success more likely—but you don’t know how long it will take because the lock itself might be easy or hard.
You have limited effort and can:
- keep trying a key you already have (exploitation), or
- spend time to find a new key (exploration), which costs energy/time.

The model tracks:

your list of approaches (the keys you’ve tried),
how long you’ve tried each one,
what you learn from not succeeding yet,
and how your beliefs change about whether a key is good and whether the lock is easy or hard.

This setup is similar to a classic “multi-armed bandit” problem (like testing different slot machines to see which pays out), with a twist: all “machines” are influenced by the same hidden factor—the overall difficulty of the problem. That common factor makes everything interdependent: trying one approach teaches you about all the others.

To make the results usable for contracts and teamwork, the author also builds a smoother version of the model where there are “many tiny possible approaches” instead of a small number. This helps analyze how incentives (like equity shares for a founder) should change over time.

What the study found and why it matters

Here are the main takeaways, with short explanations of why they’re important:

Unknown difficulty changes everything
- If you already know how hard the problem is, the best plan is simple: try one approach for a fixed amount of time, then move on; never split attention, and never go back to old ideas.
- If you don’t know the difficulty, failures are ambiguous. Now the best plan is more dynamic:
- you sometimes split effort across a few promising approaches,
- you alternate between trying new ideas and revisiting old ones,
- and you raise your “patience threshold” as you explore more. In other words, the more ideas you’ve generated, the longer you’re willing to test each before moving on.
Why revisiting old ideas can be smart
- Early failures might mean the problem is just harder than you thought, not that the idea is bad. As you learn more (say, you try other ideas and they also don’t work quickly), an old idea can look better again. So, “false starts” and “pivots back” are not mistakes—they can be optimal.
How costs and speed affect your choices
- If it’s more costly to brainstorm a new idea, you spend longer on each idea before moving on.
- If successful ideas pay off quickly, you switch to new ideas sooner.
- Time pressure has a surprising effect: being a little more rushed can make you brainstorm more; being too rushed can make you procrastinate on exploring. Creativity often peaks at “medium” time pressure.
When the problem might be impossible
- If a “hard” version of the problem literally never pays off, then there’s a limit to how many new ideas it makes sense to generate. The best plan eventually focuses on what you have.
Designing incentives for teams and startups
- In a principal–agent setting (think: investor and founder), the conflict isn’t about working hard or shirking; it’s about how to work—how broad to search (try new ideas) versus how deep to dig (keep pushing one idea).
- Entrepreneurs tend to search too narrowly on their own, sticking with ideas for too long and pivoting too little.
- Optimal contracts can fix this by changing the founder’s equity share over time:
- Without learning, the best contract “frontloads” incentives: give the founder a higher share early to encourage broader early exploration, which boosts future progress even if early attempts don’t work.
- With learning, there’s a counterforce: as everyone becomes more pessimistic after long failure, increasing the founder’s share later can help keep exploration going.
- Result: the best contract can move up and down over time and may even reward or tolerate early failure, to keep the right kind of exploration alive.

Why this matters in the real world

For students and researchers: Don’t treat failure as a sure sign your idea is bad. Sometimes the problem is just tough. Track how much effort you’ve spent on each idea, try new ones, and don’t be afraid to revisit a promising old idea later.
For startups and innovators: Plan for cycles—explore, focus, revisit, and explore again. Early “false starts” can be part of the smartest path.
For managers and investors: Structure rewards so people explore broadly early on, then adjust as you learn about the problem’s difficulty. Tolerating early failure can be exactly what keeps the project moving toward success.

In short, when you don’t know how hard the problem is, the smartest way to search looks less like a straight line and more like cycles of trying, learning, switching, and sometimes circling back. And the right incentives can make those cycles happen at the right times.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below is a concise set of actionable use cases that can be deployed now, drawing directly from the paper’s findings on optimal search under unknown difficulty, alternating exploration and revisitation, and intensive-margin contracting.

R&D project management workflows (software, biotech, materials, hardware)
- Application: Implement an “breadth–depth” search workflow that alternates between brainstorming new approaches and revisiting previously abandoned ones, with a policy to:
- Equalize effort across the least-tested approaches (“best” approaches are those with minimal cumulative effort).
- Use increasing pivot thresholds: require more cumulative effort per approach before greenlighting new brainstorming as the project progresses.
- Tools/products:
- Breadth–Depth Tracker (JIRA/Trello plugin) that logs per-approach cumulative effort and recommends when to brainstorm vs revisit based on rising thresholds.
- Pivot Threshold Calculator (simple spreadsheet) to operationalize the idea that the revisit/pivot threshold increases with the number of explored approaches.
- Assumptions/dependencies: Ability to track time/effort per approach; team discipline in logging; suitable when approaches are ex-ante similar and failure is an imperfect signal.
AI/ML research and AutoML pipelines (software, AI)
- Application: Modify hyperparameter/architecture search to periodically revisit previously dropped configurations when accumulated failures suggest the task may be harder than expected. Allocate compute to equalize effort across least-tested candidates rather than chasing only recent “winners.”
- Tools/products: AutoML scheduler that enforces equalization across least-tested models and injects scheduled revisit cycles; logging modules to estimate “difficulty” via survival rates.
- Assumptions/dependencies: Instrumented pipelines with per-trial logs; sufficiently large candidate pool; compute budgets; approximation of success arrival as stochastic.
High-throughput screening and re-screening protocols (biotech, materials, energy)
- Application: Adjust screening campaigns to:
- Re-screen previously discarded hits later in the campaign if early failures suggest higher underlying difficulty.
- Allocate instrument time to equalize cumulative exposure across least-tested compounds or configurations.
- Tools/products: Revisit cadence SOPs; dashboards showing cumulative exposure by compound/material and recommending next batch composition.
- Assumptions/dependencies: Sample availability, cost per screening run, robust tracking; appropriate when many candidates are ex-ante undifferentiated.
Agile product development and innovation teams (software/hardware)
- Application: Sprint policies that:
- Timebox effort per prototype, then brainstorm a new prototype if the least-tested pool has hit a threshold.
- Schedule “revisit sprints” to retest abandoned prototypes after additional failures on new ones (interpreting failures as potentially due to difficulty, not just bad design).
- Tools/products: Sprint templates with increasing pivot thresholds; backlog analytics to identify least-tested items.
- Assumptions/dependencies: Granular effort logging; management buy-in to revisit decisions; suitable in environments where success is binary or milestone-based.
Venture capital and corporate innovation contracts (finance, entrepreneurship)
- Application: Frontload incentives to induce broader early exploration (e.g., higher early equity, milestone bonuses for new-approach generation), with provisions that tolerate or reward early failure (reflecting the paper’s result that intensive-margin moral hazard calls for frontloading; learning can later temper this).
- Tools/products: Exploration Equity Designer (term-sheet templates with time-varying founder equity; milestone bonuses tied to breadth metrics); “failure bonus” micro-grants for documented pivots.
- Assumptions/dependencies: Legal feasibility of time-varying equity; measurable exploration breadth; cultural acceptance of rewarding early failure.
Public funding and grants program design (policy, academia)
- Application: Stage-gated grants that:
- Frontload funding and flexibility to expand breadth early.
- Include “revisit vouchers” to revisit previously unsuccessful directions when accumulated evidence suggests higher difficulty.
- Use portfolio metrics like Breadth Index and Revisit Rate in reporting.
- Tools/products: Grant Frontloader policy templates; reviewer guidelines that recognize optimal alternation and revisit.
- Assumptions/dependencies: Administrative capacity to track breadth/depth; fair and transparent metrics; applicable to fundamental research with uncertain difficulty.
Academic advising and lab management (academia)
- Application: Research planning that:
- Alternates exploration and revisitation; uses increasing thresholds for moving on.
- Logs cumulative effort per idea; schedules revisit cycles; educates students that failures ambiguously signal approach quality vs problem difficulty.
- Tools/products: Breadth–Depth logbooks; lab SOPs for alternating phases; simple calculators for threshold increases as projects expand.
- Assumptions/dependencies: Willingness to maintain effort logs; many ex-ante similar ideas; acceptance of “false starts” as part of optimal strategy.
Personal problem-solving and productivity (daily life)
- Application: Timeboxing with planned revisits:
- Equalize time across least-tried strategies.
- Increase the timebox threshold before trying new strategies as you add more strategies.
- Revisit earlier strategies after additional failures to avoid prematurely discarding good ideas on hard problems.
- Tools/products: Spreadsheet or app that tracks strategy timeboxes and auto-schedules revisit sessions.
- Assumptions/dependencies: Consistent logging; problems where success is rare and difficulty uncertain (e.g., creative writing, algorithmic puzzles).

Long-Term Applications

The following applications require further research, parameter estimation, scaling, or organizational and regulatory development before widespread deployment.

Model-based decision-support platforms for R&D portfolio optimization (software for cross-industry R&D)
- Application: A platform that estimates $c$ , $\nu_0$ , $\lambda_E$ , $\lambda_H$ , $r$ from historical data, computes rising thresholds $K^*_n$ , and recommends when to brainstorm, split effort across least-tested approaches, and when to revisit.
- Tools/products: Breadth–Depth Scheduler with Bayesian difficulty learners; APIs for lab notebooks, issue trackers, and automation.
- Assumptions/dependencies: Reliable data to estimate parameters; validation in different domains; user trust in probabilistic recommendations.
Adaptive, non-monotonic incentive systems for VC and corporate R&D (finance, entrepreneurship)
- Application: Contracts that algorithmically adjust founder/agent equity or bonus shares over time, frontloading to induce breadth and selectively backloading when learning indicates sustained difficulty, potentially including failure reward clauses.
- Tools/products: Contract analytics that co-optimize exploration breadth and principal’s dilution; templates that encode learning-triggered adjustments.
- Assumptions/dependencies: Legal adaptability, regulatory approval, accurate measurement of exploration behaviors; cultural acceptance.
Healthcare diagnostics decision support (healthcare)
- Application: Systems that allocate diagnostic effort across differential diagnoses, equalizing effort across least-tested hypotheses and revisiting earlier hypotheses as ongoing failure suggests high difficulty, with guardrails for safety and ethics.
- Tools/products: Diagnostic effort allocators integrated into EHR; policy rules for revisit cadence when uncertainty about difficulty persists.
- Assumptions/dependencies: Clinical validation; adaptation from single-breakthrough to multi-outcome settings; strong safety and regulatory oversight.
AI lab governance and compute allocation (AI, research infrastructure)
- Application: Compute schedulers that implement breadth-first allocation early, enforce equalization across least-tested experiment families, and algorithmically schedule revisits as evidence accumulates about problem difficulty.
- Tools/products: Cluster-level breadth–depth controllers; telemetry to estimate survival probabilities and difficulty over time.
- Assumptions/dependencies: Robust telemetry, parameter learning; integration with job schedulers; buy-in from research leadership.
Automation in materials and energy discovery (robotics, materials, energy)
- Application: Autonomous labs that blend high-throughput exploration with planned revisits based on learned difficulty, dynamically adjusting batch selection to equalize cumulative exposure and increasing thresholds over campaign duration.
- Tools/products: Lab automation orchestrators with embedded breadth–depth strategies; interfaces to characterization tools.
- Assumptions/dependencies: Advanced robotics; live parameter estimation; domain-specific adaptations to non-Poisson success and correlated validity.
Education and pedagogy for problem solving (education)
- Application: Curricula that teach “problems vs exercises,” alternating exploration and revisitation, and the ambiguity of failure; classroom tools to log effort and schedule revisits with increasing thresholds.
- Tools/products: Lesson modules; student dashboards tracking breadth–depth; instructor guidelines.
- Assumptions/dependencies: Teacher training; curriculum time; assessment frameworks that value breadth and learning dynamics.
Government innovation policy for frontier science (policy, national innovation systems)
- Application: Programs for AGI, fusion, superconductivity, etc., that explicitly frontload exploration budgets, maintain revisit funds, and tolerate early failure as a feature of optimal search under unknown difficulty.
- Tools/products: Policy playbooks; metrics (Breadth Index, Revisit Rate) for portfolio oversight.
- Assumptions/dependencies: Political will; robust monitoring; alignment with long-run societal objectives.

Cross-cutting assumptions and dependencies

Ex-ante similarity of approaches and independent validity draws (can be relaxed with correlation, but requires model adaptation).
Success modeled as a one-time breakthrough with Poisson arrival; some sectors may need multi-stage or repeated-payoff adaptations.
Effort can be fractionally allocated and reliably logged; survival (no success) is an informative but imperfect signal.
Unknown difficulty acts as a common factor across approaches; failures update beliefs system-wide.
Intensive-margin moral hazard (how to explore) is the relevant frication; environments with extensive-margin shirking may need different contracts.
Organizational and legal feasibility of time-varying equity/bonus schedules and “failure bonuses.”
Cultural acceptance of revisiting previously abandoned ideas and tolerating early failure as optimal behavior under uncertainty.

View Paper Prompt View All Prompts

Solving Problems of Unknown Difficulty

Summary

Solving Problems Under Unknown Difficulty: A Dynamic and Strategic Perspective

Overview and Motivation

Formal Model and Key Mechanisms

Single-Agent Dynamic Search

Departure from Standard Bandits

Behavioral Implications

Characterization and Dynamics of the Optimal Policy

Benchmark: Known Difficulty

Unknown Difficulty: Endogenous Recall and Task-Splitting

Continuum Limit and Application to Agency Problems

Principal-Agent Contracting and the Incentive to Explore

Theoretical and Practical Implications

Theoretical

Practical

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

A simple guide to “Solving Problems of Unknown Difficulty”

What this paper is about

The main questions in plain language

How the study works (with simple analogies)

What the study found and why it matters

Why this matters in the real world

Practical Applications

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and dependencies

Open Problems

Continue Learning

Collections