Papers
Topics
Authors
Recent
Search
2000 character limit reached

$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution

Published 1 Apr 2026 in cs.CL and cs.AI | (2604.01212v1)

Abstract: As LLM agents tackle increasingly complex tasks, a critical question is whether they can maintain strategic coherence over long horizons: planning under uncertainty, learning from delayed feedback, and adapting when early mistakes compound. We introduce $\texttt{YC-Bench}$, a benchmark that evaluates these capabilities by tasking an agent with running a simulated startup over a one-year horizon spanning hundreds of turns. The agent must manage employees, select task contracts, and maintain profitability in a partially observable environment where adversarial clients and growing payroll create compounding consequences for poor decisions. We evaluate 12 models, both proprietary and open source, across 3 seeds each. Only three models consistently surpass the starting capital of \$200K, with Claude Opus 4.6 achieving the highest average final funds at \$1.27 M, followed by GLM-5 at \$1.21 M at 11$\times$ lower inference cost. Scratchpad usage, the sole mechanism for persisting information across context truncation, is the strongest predictor of success, and adversarial client detection is the primary failure mode, accounting for $47\%$ of bankruptcies. Our analysis reveals that frontier models still fail through distinct failure modes such as over-parallelization, demonstrating the capability gaps for long-horizon performance. $\texttt{YC-Bench}$ is open-source, reproducible, and configurable.

Summary

  • The paper introduces YC-Bench to assess LLM agents' long-horizon coherence in strategic planning and execution over a simulated year-long startup operation.
  • It employs a persistent scratchpad and structured CLI commands to manage memory, risk, and resource allocation under POMDP conditions.
  • Empirical analysis shows that only a few models sustain profitability, highlighting the critical role of strategic scratchpad usage and frequent tool actions.

YC-Bench: A Benchmark for Long-Term Coherence in LLM Agents

Benchmark Design and Motivation

YC-Bench addresses a critical evaluation gap for LLM agents: the capacity for sustained strategic coherence across hundreds of compounding steps in realistic, partially observable environments. The benchmark simulates one year of startup operation, requiring agents to navigate dynamic task selection, risk management, resource allocation, and financial optimization in the presence of adversarial clients and delayed rewards.

The environment formalizes the challenge as a POMDP, combining deterministic but unknown transition and observation dynamics. Agents interact through a CLI interface, issuing explicit commands for observation, task selection, employee assignment, client management, and memory maintenance—specifically via a persistent scratchpad, which serves as the only cross-turn information persistence mechanism in the presence of context truncation. The design is adversarial: a significant fraction of clients inflate work requirements after task acceptance, requiring robust inference of reliability from episodic task outcomes. Figure 1

Figure 1: Overview of YC-Bench—agent interaction via structured CLI commands, state tracking, and hidden adversarial client dynamics.

Empirical Analysis: Model Performance and Behavioral Insights

YC-Bench evaluates twelve LLM models (both proprietary and open-source), benchmarking their ability to maximize final company funds under the constraints of the environment. The results identify substantial gaps in long-horizon agentic abilities that are not captured by standard benchmarks such as AgentBench or SWE-Bench.

Of twelve models, only five turn a profit, and just three (Claude Opus 4.6, GLM-5, GPT-5.4) achieve substantial capital growth (>5× starting funds). These models distinguish themselves by early client trust concentration, strategic avoidance of adversarial clients, and frequent use of the scratchpad for memoization of key facts and decision rules. Conversely, most models distribute effort indiscriminately, suffer avoidable task failures, or go bankrupt due to payroll escalation and repeated engagement with unreliable clients. Figure 2

Figure 2: Funds trajectory over time for 12 models, showing divergence and bankruptcy events; only three models sustain substantial profit across runs.

Scratchpad usage emerges as the dominant predictor of long-term success: top models record decision heuristics, client reliability profiles, and operational strategies persistently, enabling adaptive behavior despite context truncation. The three high-performing models execute 3–4× more tool actions per run than their counterparts, with the largest gaps in task inspection and scratchpad updates, leading to superior trust-building and task throughput. Figure 3

Figure 3: Tool usage analysis—top models significantly outpace others in command execution, inspection, and persistent memory writing.

Failure Modes and Long-Horizon Coherence Analysis

Analysis of failed tasks reveals three primary modes: (1) repeated engagement with adversarial clients due to poor memory and policy persistence, (2) suboptimal employee assignment (incorrect productivity estimation, excessive task concurrency), and (3) over-committed employees leading to payroll-driven financial collapse. Identifying adversarial clients—who represent ~35% of the market—is particularly challenging; models that ignore or fail to blacklist unreliable clients accept adversarial tasks at or above market prevalence, resulting in cascading failures and negative feedback loops. Figure 4

Figure 4

Figure 4: Adversarial client engagement rate; dashed line denotes natural prevalence, with most models overexposed relative to top performers.

Behavioral spectrum analysis demonstrates that even high-scoring models occasionally violate their own strategic guidelines, evidencing a reasoning–execution gap. Lower-performing models exhibit rigid action loops, lack reflection, or fail to act even when diagnostic information is correctly recorded. This exposes the lack of unified deliberation–execution capabilities in most current LLM agents, confirming that long-horizon coherence is a pipeline of perceive → record → retrieve → act, with failure points at every stage. Figure 5

Figure 5: Representative decision moments; panels highlight diverse reasoning–execution lapses across model classes.

Implications, Limitations, and Directions for Future Research

YC-Bench illuminates persistent gaps in long-horizon agentic reasoning, memory management, and adaptive choice-making for modern LLMs. Model size and API cost do not consistently correlate with performance; cost-efficient open-source models can outperform proprietary baselines despite lower absolute returns. Practical deployment architectures for agentic LLMs must prioritize robust memory interfaces (e.g., persistent external scratchpads), dynamic risk inference schemes, and persistent strategy adaptation for production readiness. The environment’s deterministic structure, limited employee roster, and explicit numerical outputs simplify some aspects; future work should introduce stochasticity, dynamic workforce composition, and more naturalistic observation modalities to further stress test agentic coherence and adaptation.

The benchmark’s POMDP structure, persistent adversarial pressure, and compounding financial constraints present nontrivial sequential decision-making challenges not tractable with short-term heuristics, simple retrieval, or context-only memory solutions. Empirical evidence suggests a need for architectures integrating deliberate memory, policy refinement, and explicit execution mapping between reasoning and action.

Conclusion

YC-Bench establishes a rigorous new standard for evaluating the long-horizon coherence, planning, and execution consistency of LLM agents (2604.01212). The empirical findings—only three models consistently maintain profitability, and the strongest predictor of success is persistent, strategic memory usage—underscore significant gaps in agentic robustness and the need for improved architectures, inference strategies, and evaluation schemes. The benchmark is open-source and extensible, inviting community-driven model stress-testing on compounding sequenced tasks and real-world adversarial dynamics. Future research can leverage YC-Bench to drive advances in memory systems, policy adaptation, and unified deliberative agent architectures.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

YC-Bench: A plain‑language summary for teens

What is this paper about?

This paper introduces YC‑Bench, a “video game–like” test for AI agents. In the game, an AI runs a pretend startup company for a whole year. It has to choose which jobs (contracts) to take, assign the right employees, watch its cash, and avoid risky customers. The goal is simple: finish the year with more money than you started with.

What questions are the researchers asking?

They want to know:

  • Can AI agents make smart plans that last a long time, not just quick fixes?
  • Can they remember important facts over hundreds of decisions (like which customers to avoid)?
  • Can they learn from mistakes that only show up much later (like a bad job leading to a missed deadline and a fine)?
  • Which habits make an AI successful or likely to fail in this long, complicated setting?

How does YC‑Bench work? (Methods explained simply)

Think of YC‑Bench as a realistic business simulator with rules:

  • You start with $200,000 and run a company for 12 months (hundreds of turns).
  • Each turn, the AI sees some info (money, events) but not everything; it has to ask for details when needed. That’s called “partial information.”
  • The AI picks contracts from different “clients,” assigns employees with different strengths, and pays salaries every month.
  • Some clients are secretly “adversarial” (like scammy customers). Their jobs look great but are actually impossible to finish on time. The AI must figure out who these are by noticing patterns in its own history.
  • There’s a memory limit: after about 20 turns, the AI “forgets” earlier conversation details unless it writes a short, persistent “scratchpad” (like keeping a notebook). This is the only way to remember things long‑term (e.g., “Never accept jobs from Client X”).
  • Good decisions snowball into bigger rewards (trusted clients lower required work and unlock better jobs), while bad choices add up (salaries keep rising, missed deadlines cost money and reputation).

The researchers tested 12 different AI models and ran each three times. They also compared them to a simple “greedy” strategy that just picks the highest‑paying job and assigns everyone to it without thinking.

What did they find, and why does it matter?

Here are the main takeaways:

  • Only a few AIs actually grew the business. Out of 12 models, only 3 consistently finished with more than the starting $200,000. The best average was about$1.27 million; another reached a similar level but with much lower computing cost.
  • Remembering on purpose is crucial. The strongest predictor of success was using the scratchpad well—writing down which clients to trust, which to avoid, and which employees are best for each kind of work.
  • “Tricky clients” sink companies. Not spotting adversarial clients caused 47% of bankruptcies. Many models either didn’t notice the pattern or didn’t act on it.
  • Planning isn’t just thinking; it’s doing. Some AIs wrote good rules (“focus on one task at a time,” “check if we can finish before the deadline”) and then ignored them. Others repeated the same loop without reflecting. The best model both reflected and followed its own rules.
  • Employee assignment matters a lot. Many failures came from assigning the wrong people to jobs or spreading them too thin, so tasks couldn’t be finished on time.
  • Cost matters in the real world. Some models made lots of money in the game but cost a lot to run. Others earned less but were far more cost‑efficient (more “revenue per dollar of compute”).

Why it matters: Lots of common AI tests measure quick reasoning. YC‑Bench shows that long‑term success needs more: steady planning, memory, learning from delayed outcomes, and consistent execution. It reveals capability gaps that don’t show up on short, one‑step tasks.

What could this change or improve in the future?

  • Better AI design: YC‑Bench highlights the need for AIs that can keep a reliable “notebook,” notice long‑term patterns (like risky clients), and then stick to the plan.
  • Safer, more reliable agents: By testing how AIs handle uncertainty and sneaky problems, YC‑Bench can guide improvements for real‑world uses like project management, operations, and finance.
  • Fairer comparisons: Because it’s open‑source and reproducible, researchers and developers can use YC‑Bench to compare models honestly and track progress on long‑term planning.
  • Next steps: The authors suggest making the simulation even more realistic later (e.g., hiring/firing employees, random surprises), which could uncover new weaknesses and push AIs to improve.

In short: YC‑Bench is a year‑long startup simulator for AIs. It shows that to win over time, an AI must plan ahead, remember what matters, learn from slow feedback, avoid bad actors, and follow its own rules consistently.

Knowledge Gaps

Unresolved gaps, limitations, and open questions

Below is a single, concrete list of what the paper leaves missing, uncertain, or unexplored:

  • Causality of scratchpad effects: Only correlational evidence is provided that scratchpad use predicts success; no controlled ablations (e.g., same model with/without scratchpad, enforced memory discipline, or alternative memory mechanisms) establish causal impact.
  • Memory design bias: The scratchpad is the sole long-term memory mechanism; the benchmark does not evaluate or compare structured memory alternatives (e.g., vector DB retrieval, MemGPT-style hierarchical memory, episodic vs semantic stores) or the trade-offs between memory length, cost, and performance.
  • Fairness across model capabilities: A fixed 20-turn conversation history may advantage/disadvantage models with different native context limits or retrieval behaviors; the impact of context window size and prompt length on performance is not systematically studied.
  • Sensitivity to environment parameters: No parameter sweeps test robustness to changes in adversarial client prevalence, work inflation magnitude, trust/decay rates, prestige scaling, salary growth, task deadlines, or reward distributions.
  • Determinism vs realism: The environment is deterministic with hidden state; how results transfer to stochastic, noisy, or partially misspecified environments (delayed signals, observation errors, random shocks) remains untested.
  • Simplified labor dynamics: Employees are fixed and cannot be hired/fired; the effects of staffing decisions, hiring pipelines, layoffs, ramp-up times, or skill drift on long-horizon planning are not evaluated.
  • Limited exogenous disruptions: Only adversarial clients act as disruptions; the benchmark omits realistic shocks (market downturns, supply delays, regulatory changes) that could probe resilience and re-planning.
  • Numerical, non-linguistic signals: Key quantities are provided as exact numbers; the benchmark does not test agents’ ability to parse noisy, incomplete, or narrative business reports, emails, or contracts expressed in natural language.
  • Single-agent, non-competitive market: There is no competition for contracts or strategic interaction with other agents; how performance changes in competitive, multi-agent markets with bidding/negotiation is unknown.
  • Narrow domain and task taxonomy: Four domains (training, inference, research, data engineering) are used; generalization to other industries, task mixes, or heterogeneous contract types (maintenance vs projects) is not assessed.
  • Baseline coverage: Only a simple greedy baseline is included; comparisons to classical OR/heuristic planners, explicit schedulers, RL agents, or LLM+planner hybrids are missing.
  • Statistical robustness: Only 3 seeds per model are reported; the paper lacks confidence intervals or significance tests to assess the reliability of model rankings and effect sizes.
  • Attribution methodology transparency: The procedure for labeling task failure causes (e.g., adversarial vs misassignment vs overcommitment) is not fully specified (rules, thresholds, human validation, inter-rater reliability), leaving the breakdown’s validity uncertain.
  • Cost-efficiency comparability: Revenue-per-API-dollar depends on provider pricing and may fluctuate; wall-clock latency, rate limits, and compute constraints are not incorporated, limiting operational relevance of the metric.
  • Prompting and hyperparameter control: The impact of temperatures, decoding strategies, tool-calling formats, and system prompts across models is not ablated, leaving open whether performance gaps stem from modeling vs prompting.
  • Version stability of proprietary models: Several evaluated models are preview/rapidly evolving; reproducibility across versions or over time is not quantified.
  • Overfitting risk to an open benchmark: Because the environment is public and deterministic, models can be prompt-tuned to the benchmark; no hidden test scenarios or parameter-randomized variants are proposed to prevent benchmark gaming.
  • Transfer validity: The paper does not test whether YC-Bench rankings predict performance on other long-horizon benchmarks (e.g., VB, BALROG games) or real-world agent tasks; cross-benchmark correlation is unknown.
  • Reward shaping and metric diversity: Final funds is the primary scalar; no analysis of risk-adjusted returns, drawdowns, bankruptcy probability, portfolio stability, or multi-objective trade-offs (e.g., growth vs runway safety).
  • Action granularity and tool reliability: The CLI actions are assumed reliable and atomic; the benchmark does not study tool failures, ambiguous API responses, or error recovery strategies.
  • Reasoning–execution gap diagnosis: While observed, the root causes (planning errors, retrieval failures, self-consistency issues) are not disentangled with instrumentation (e.g., introspection probes, step-level audits) or targeted interventions.
  • Learning across episodes: Despite references to multi-episode improvement, the experiments do not quantify whether agents learn policies over repeated runs (e.g., using summaries across episodes) or how quickly they improve.
  • Seed difficulty calibration: The distribution of employee productivity, client sets, and task streams across seeds is not characterized; there is no analysis of whether certain seeds are systematically easier/harder.
  • Trust and prestige dynamics validity: The chosen functional forms (linear prestige scaling, uniform trust decay) are not validated against real-world analogs or tested for alternative forms (e.g., diminishing returns, thresholds).
  • Safety and reward hacking: The benchmark does not probe for emergent undesirable strategies (e.g., gaming metrics, deferring payroll via loopholes), nor does it stress test misalignment under high pressure or ambiguous rules.
  • Multimodal information handling: The environment excludes documents, charts, or attachments; the ability to integrate multimodal signals into long-term plans remains unexplored.
  • Human-in-the-loop settings: The benchmark evaluates fully autonomous agents; the effect of sparse human feedback, approval gates, or periodic interventions is not studied.
  • Generalization over horizon length: Only a one-year horizon is used; how performance scales with longer/shorter horizons or with non-stationary regime shifts over time is unknown.
  • Policy reproducibility within models: Run-to-run variance for the same model/prompt beyond three seeds is not reported, leaving intra-model stability underexplored.

Practical Applications

Immediate Applications

Below are concrete, deployable use cases that leverage YC-Bench’s findings, artifact, and methodology today.

  • Model selection and procurement for autonomous agents (software/finance/ops)
    • What: Use YC-Bench to compare candidate LLMs by long-horizon reliability and cost-efficiency (e.g., revenue-per-API-dollar). Prefer models that show high profitability and disciplined scratchpad usage (e.g., GLM-5 where budgets are tight; Claude Opus where performance is paramount).
    • Tools/products/workflows: “YC-Bench-in-the-loop” bakeoff before production; dashboards reporting bankruptcy probability, adversarial-task rate, trust buildup, and runway.
    • Assumptions/dependencies: Benchmark dynamics are deterministic and industry-agnostic; requires mapping from simulated profit to domain-specific KPIs.
  • Memory scaffolding for agents (software, enterprise automation)
    • What: Adopt persistent scratchpads or memory vaults (e.g., MemGPT-like) as a default design pattern, since scratchpad usage is the strongest predictor of success in long-horizon tasks.
    • Tools/products/workflows: Scratchpad middleware that auto-curates client lists, feasibility rules, and action checklists; CI rules that fail builds if scratchpad is underused.
    • Assumptions/dependencies: Agents must support system-prompt injection or external memory stores; must respect organizational data governance.
  • Risk-aware task acceptance policies (sales, professional services, BPO)
    • What: Implement agent policies to blacklist/whitelist counterparties based on historical failures; add “adversarial client” detectors and enforce pre-acceptance checks.
    • Tools/products/workflows: Client risk-scoring plugins; “acceptance guardrails” that block tasks failing feasibility or adversarial heuristics; audit logs tying failures to counterparties.
    • Assumptions/dependencies: Requires reliable data capture on partner history; adversarial patterns differ by domain and need calibration.
  • Portfolio focus to exploit “trust snowball” (CRM, platform marketplaces)
    • What: Concentrate agent effort on a few counterparties early to compound trust-like benefits (reduced scope/work, higher-tier opportunities), mirroring the benchmark’s top performers.
    • Tools/products/workflows: CRM integrations where agents track trust scores and prioritize tasks from “core accounts”; OKRs tied to deepening selected relationships.
    • Assumptions/dependencies: Real environments must expose “trust” or proxy metrics (SLAs, discounts, easier processes) that actually compound with repeated success.
  • Feasibility calculators and single-task enforcement (project management, software delivery)
    • What: Equip agents with feasibility checks before task acceptance (required_qty/total_rate < hours_until_deadline) and enforce one-task-at-a-time or cap concurrency to avoid over-parallelization.
    • Tools/products/workflows: Pre-acceptance feasibility modules; concurrency governors; automatic reallocation rules; alerts when estimated throughput < required pace.
    • Assumptions/dependencies: Agent must know (or estimate) resource rates and deadlines; real throughput is stochastic and may need buffers.
  • Cost-aware agent operations (finance, FinOps for AI)
    • What: Optimize for “revenue per API dollar”; route simpler tasks to cheaper models (e.g., Kimi-K2.5, Gemini Flash) and keep complex planning with top models; set budget ceilings.
    • Tools/products/workflows: Model-router with cost-performance thresholds; live spend dashboards and kill-switches when marginal ROI dips.
    • Assumptions/dependencies: Requires accurate marginal value estimation per call; token costs vary by provider and may change.
  • Pre-deployment red-teaming with adversarial scenarios (security, governance)
    • What: Use YC-Bench to stress-test agents against adversarial counterparties and scope creep; track how quickly agents learn and avoid repeated traps.
    • Tools/products/workflows: Adversarial task injectors in staging; failure-mode catalogs (e.g., adversarial acceptance rate, blacklist violations).
    • Assumptions/dependencies: Needs domain-specific adversarial patterns and validation data to generalize beyond the benchmark.
  • Training curricula for long-horizon coherence (L&D, academia, internal AI upskilling)
    • What: Teach operators and students the “perceive → record → retrieve → act consistently” pipeline; run labs on client-selection strategy, memory discipline, and payroll/runway management.
    • Tools/products/workflows: Coursework modules using YC-Bench; evaluation rubrics for memory quality and policy adherence.
    • Assumptions/dependencies: Translating simulated lessons to sector reality needs domain mapping and real KPIs.
  • MLOps guardrails for agentic systems (software engineering)
    • What: Add deployment gates: agents must meet minimum YC-Bench scores on profitability, adversarial avoidance, and memory usage before shipping.
    • Tools/products/workflows: CI checks; regression tests comparing new model versions against historical YC-Bench baselines.
    • Assumptions/dependencies: Requires maintenance of benchmark versions and thresholds; risk of overfitting agents to the benchmark format.
  • Personal productivity assistants with robust memory (daily life/education)
    • What: Configure personal AI assistants to keep persistent scratchpads (deadlines, vendor reliability, “do-not-work-with” lists) and enforce feasibility before accepting commitments.
    • Tools/products/workflows: Scratchpad templates for projects; “commitment sanity checks” and focus-mode enforcement.
    • Assumptions/dependencies: Users must allow persistent storage; privacy and data retention policies apply.
  • Vendor reliability tracking for procurement (operations, supply chain)
    • What: Maintain live blacklists/whitelists based on on-time delivery, quality, and scope changes; instruct agents to avoid risky vendors despite attractive pricing.
    • Tools/products/workflows: Procurement agent plug-ins; exception approvals required to engage blacklisted vendors.
    • Assumptions/dependencies: Accurate vendor performance data; incentives aligned to discourage short-term “cheap but risky” choices.
  • Benchmark-driven product marketing and pricing (AI vendors)
    • What: Report YC-Bench results (profitability, cost-efficiency) to customers; tailor pricing/SLA tiers tied to long-horizon performance.
    • Tools/products/workflows: Public scorecards; performance-based pricing pilots.
    • Assumptions/dependencies: Customers accept benchmark relevance; periodic re-benchmarks needed as models update.

Long-Term Applications

These applications require further research, scaling, or domain adaptation beyond the current YC-Bench scope.

  • Sector-specific long-horizon certification for autonomous agents (healthcare, finance, gov)
    • What: Regulatory-style test suites adapted from YC-Bench to certify agents for sensitive workflows (e.g., revenue cycle, claims, case management).
    • Tools/products/workflows: Standardized “agent driving tests” with adversarial injects; certification bodies and audit trails.
    • Assumptions/dependencies: Requires sector datasets, safety rules, and interpretability standards; policy consensus needed.
  • Realistic stochastic environments and exogenous shocks (operations, macro risk)
    • What: Extend the benchmark to include hiring/firing, random disruptions, and natural-language noise (e.g., ambiguous requirements).
    • Tools/products/workflows: Next-gen simulators for resilience testing (sick leave, supplier delays, demand shocks).
    • Assumptions/dependencies: Higher simulation complexity; data for shock distributions and calibration.
  • Architectures that close the reasoning–execution gap (AI research, software)
    • What: Develop agent designs that tie reflection to action (e.g., policy compilation from scratchpad rules, action contracts, self-verification).
    • Tools/products/workflows: “Policy binder” that converts written rules into executable constraints; plan–do–check–act loops with automatic enforcement.
    • Assumptions/dependencies: Advances in tool-use reliability and controllability; careful avoidance of over-constraint or brittleness.
  • Learned adversarial-counterparty detectors (finance, marketplaces, trust & safety)
    • What: Train models to infer counterparty risk from multi-episode histories and subtle scope inflation patterns; generalize across domains.
    • Tools/products/workflows: Sequence models over interaction logs; early-warning signals and “hold to review” workflows.
    • Assumptions/dependencies: Requires labeled histories and feedback loops; potential for false positives with business impact.
  • Memory management and curation systems (AgentOS)
    • What: Automated memory pruning, salience scoring, and retrieval to keep only policies and facts that drive outcomes; reduce hallucinated or stale rules.
    • Tools/products/workflows: Memory optimizers; change-detection and policy-refresh pipelines.
    • Assumptions/dependencies: Requires robust instrumentation linking memory entries to performance; privacy-compliant storage.
  • Cross-agent coordination and resource allocation (multi-team orgs, robotics fleets)
    • What: Generalize from single-agent payroll/assignment to multi-agent coordination under partial observability and compounding costs.
    • Tools/products/workflows: Coordinated schedulers; shared memory ledgers and conflict-resolution protocols.
    • Assumptions/dependencies: Communication reliability and synchronization; added complexity in credit assignment.
  • Long-horizon RL/finetuning curricula (AI research)
    • What: Use YC-Bench variants for RL or constitutional/reflection fine-tuning targeting adversarial avoidance, disciplined memory, and focus.
    • Tools/products/workflows: Self-play and curriculum learning with increasing horizon and adversarial intensity.
    • Assumptions/dependencies: Stable training signals over long horizons; robust offline–online transfer.
  • Organizational planning assistants with runway/payroll awareness (startups, SMEs)
    • What: Agents that simulate cash runway, rising payroll, and task portfolio to advise founders on contract mix and client focus over quarters.
    • Tools/products/workflows: FP&A-integrated planning copilot; scenario analysis with trust and prestige proxies (e.g., customer LTV, partner tiers).
    • Assumptions/dependencies: Requires integration with finance systems; mapping “prestige” to real commercial unlocks.
  • Education and assessment for operations management (academia, executive education)
    • What: Case-based curricula and competitions that evaluate strategic coherence over hundreds of steps; scorecards on memory, risk, and focus.
    • Tools/products/workflows: Courseware with instructor dashboards and auto-grading on long-horizon metrics.
    • Assumptions/dependencies: Need domain contextualization and rubrics that reflect real-world constraints.
  • Safety and governance analytics for autonomous operations (policy, compliance)
    • What: Metrics and thresholds (e.g., max adversarial acceptance rate, blacklist violations, memory staleness) baked into governance for autonomous workflows.
    • Tools/products/workflows: Continuous assurance pipelines; incident postmortems tied to memory and policy adherence signals.
    • Assumptions/dependencies: Monitoring infrastructure and standardized telemetry; clarity on acceptable risk thresholds.
  • Domain-adapted benchmarks (healthcare, education, energy)
    • What: YC-Bench-style testbeds for clinical scheduling, tutoring over a semester, or energy dispatch over seasons.
    • Tools/products/workflows: Sector simulators with delayed rewards and adversarial elements (e.g., insurer denials, disengaged students, price spikes).
    • Assumptions/dependencies: Requires expert-designed dynamics and validation datasets; ethical and regulatory constraints.
  • Economic simulations for market design and policy testing (economics, public policy)
    • What: Use agent-based variants to study how different rules (e.g., penalties, trust decay, pricing) affect market outcomes and resilience.
    • Tools/products/workflows: Policy sandboxes; stress tests for proposed regulations involving autonomous agents.
    • Assumptions/dependencies: External validity from simulation to real markets; agreement on modeling assumptions.

In all cases, the core takeaways from YC-Bench—persistent memory as a first-class capability, risk-aware selection under partial observability, disciplined focus to harness compounding benefits, and sensitivity to cost-efficiency—translate into concrete design rules and evaluation gates for deploying agentic systems in consequential settings.

Glossary

  • Action space: The set of all actions available to the agent at any point. "The full agent action space is in Appendix~\ref{app:action-space}."
  • Adversarial clients: Clients whose tasks are intentionally structured to be nearly impossible by inflating required work after acceptance. "A subset of clients are adversarial: after the agent accepts one of their tasks, the environment inflates the work quantity, making the deadline nearly impossible to meet."
  • Agentic capabilities: The capacity of LLMs to autonomously plan, decide, and act over multiple steps using tools and interactions. "The rapid growth of agentic capabilities of LLMs has spawned a rich ecosystem of benchmarks."
  • Bankruptcy: The terminal state where company funds fall below zero, ending the episode. "The episode terminates when funds drop below zero (bankruptcy) or the horizon ends."
  • CLI: Command-line interface used by the agent to interact with the environment via commands. "The agent operates through a CLI tool interface"
  • Client blacklist: A recorded list of clients to avoid due to repeated failures or adversarial behavior. "building a client blacklist"
  • Client trust: A per-client relationship metric that reduces future workload and unlocks higher-tier tasks when increased. "The agent builds trust with a client by completing its tasks successfully; higher trust reduces the work required on future tasks from that client and unlocks higher-tier tasks."
  • Context truncation: The removal of older turns from conversation history due to context length limits. "Scratchpad usage, the sole mechanism for persisting information across context truncation, is the strongest predictor of success"
  • Context window: The maximum number of recent turns retained in the agent’s context before older interactions are truncated. "through a 20-turn context window that forces the agent to use a persistent scratchpad for memory"
  • Cost-efficiency: Performance relative to API inference cost, often measured as revenue per dollar spent. "There is also a significant gap in cost-efficiency with open-source models being more Pareto optimal than their counterparts."
  • Deterministic but unknown dynamics: Environment transitions and observations that are fixed given actions but not revealed to the agent. "a POMDP-based benchmark with deterministic but unknown transition and observation dynamics."
  • Deterministic transition function: A mapping from current state and action to a unique next state. "T:S×AS\mathcal{T}: \mathcal{S} \times \mathcal{A} \rightarrow \mathcal{S} is the deterministic transition function"
  • Domain prestige: A per-domain reputation level that gates task access and scales payouts. "The agent maintains a prestige level in each domain"
  • Failure modes: Recurrent patterns of errors leading to poor outcomes (e.g., missed deadlines, bad assignments). "frontier models still fail through distinct failure modes such as over-parallelization"
  • Frontier models: The most capable, cutting-edge LLMs at a given time. "frontier models still fail through distinct failure modes such as over-parallelization"
  • Greedy baseline: A simple policy that always chooses the immediate highest-reward option without strategic planning. "We compare against a greedy baseline that, in each turn, accepts the highest-reward task available on the market"
  • Information-asymmetry: A setting where some relevant variables are hidden and must be inferred from outcomes. "hidden employee skill rates that create an information-asymmetry puzzle"
  • Multi-episode learning framework: A setup where agents can improve across restarts or repeated episodes. "a multi-episode learning framework that tests whether agents can improve across restarts."
  • Observation space: The set of possible observations available to the agent at each turn. "O\mathcal{O} is the observation space"
  • Over-parallelization: Taking on too many tasks simultaneously, causing diluted focus and missed deadlines. "distinct failure modes such as over-parallelization"
  • Pareto optimal: Achieving a superior trade-off such that improving one metric would worsen another. "open-source models being more Pareto optimal than their counterparts."
  • Partially observable environment: An environment where the agent cannot directly observe the full state and must infer hidden factors. "in a partially observable environment where adversarial clients and growing payroll create compounding consequences for poor decisions."
  • Partially Observable Markov Decision Process (POMDP): A framework modeling decision-making with hidden states and observations. "We formalize YC-Bench as a Partially Observable Markov Decision Process (POMDP)"
  • Reward function: The mapping from state and action to a scalar signal measuring immediate utility. "R:S×ARR: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R} is the reward function defined as the net change in company funds."
  • Runway: The estimated time before funds are depleted given current burn rate. "Runway down to 1 month,"
  • Scope creep: Expansion of task scope beyond initial estimates, increasing required work. "Analyzing how the models deal with adversarial clients, who have appealing rewards when accepting a task but have a lot more work than claimed (scope creep)."
  • Scratchpad: A persistent memory channel where the agent writes notes to survive context truncation across turns. "the agent may write to a persistent scratchpad that is injected into the system prompt on every turn."
  • Seed: A fixed random initialization used to replicate stochastic components across runs. "We evaluate 12 models, both proprietary and open-source, across 3 seeds each."
  • Simulation clock: The controllable mechanism that advances simulated time to the next event. "Simulation Clock. The agent controls the passage of simulated time explicitly."
  • State space: The set of all possible environment configurations. "S\mathcal{S} is the state space"
  • Throughput: The amount of work an employee completes per unit time in a given domain. "a senior-tier employee may have senior-level throughput in training but junior-level throughput in research."
  • Trust-gated task selection: Choosing tasks based on earned client trust thresholds that unlock higher-tier opportunities. "tracking per-client success rates to optimize trust-gated task selection."
  • Trust snowball: A compounding effect where increasing trust reduces workload, enabling more completions and further trust gains. "triggering a trust snowball: each success reduces future task workloads (up to 50\%), enabling more completions per month, which builds further trust."

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 11 tweets with 195 likes about this paper.