YC-Bench: LLM Startup Simulation
- YC-Bench is a benchmark that tests LLM-based agents on long-horizon planning and strategic decision-making in simulated startup management with partial observability.
- It operationalizes agent performance through metrics like final capital, bankruptcy rates, and cost-efficiency over a one-year simulation, emphasizing delayed rewards and adversarial challenges.
- The benchmark highlights the critical role of explicit memory management via scratchpads and effective adversarial client detection to mitigate compounding risks and strategic failures.
is a benchmark designed to evaluate the long-horizon planning, strategic coherence, and memory management capabilities of LLM-based agents within the context of simulated startup management. The benchmark operationalizes an AI agent’s ability to plan under uncertainty, learn from delayed feedback, and adapt to compounding consequences associated with early missteps. Agents are tasked with running a simulated startup over a one-year horizon, requiring the orchestration of task selection, employee management, and profitability maintenance under partial observability and adversarial (hidden) dynamics (He et al., 1 Apr 2026).
1. Formal Environment Specification
is formalized as a deterministic, partially observable Markov decision process (POMDP) defined as
where:
- : hidden state at time ,
with as company funds, as payroll, as employees (fixed, each with hidden, per-domain productivities), as current tasks (structurally annotated), 0 as per-domain prestige, 1 as per-client trust, and 2 encoding hidden adversarial clients.
- 3: sequence of CLI commands per turn, including observation (e.g.,
company status,market browse), task management (task accept,task assign), progression control (sim resume), and explicit memory operations (scratchpad write/append). - 4: observation at each turn,
5
- 6: deterministic transition function (7). Key transitions include task progress (enforced work hours, splitting of employee throughput), completion events (funds and payroll adjustment, prestige/trust increments), deadline penalties (8 for overdue tasks), and mandatory payroll deductions (9) at the start of each month (bankruptcy if 0).
- 1: immediate reward as change in funds,
2
Uncertainty derives from adversarial clients (3), whose status is only inferable through outcome patterns—task workloads are covertly magnified (4). There is no direct oracle providing client status.
2. Task Structure and Agent Objectives
The simulation spans a one-year horizon (5–6 turns), with each agent turn encompassing an arbitrary sequence of actions before advancing simulated time using the sim resume command. The environment is constrained as follows:
- Task acceptance in domain 7 requires 8.
- Task acceptance from client 9 requires 0 above threshold.
- Funds must remain non-negative; violation leads to bankruptcy and episode termination.
- Employee roster is fixed; no hiring or firing. Task completions cause monotonically increasing payroll due to fixed salary increments.
Agent objectives are long-horizon: maximizing end-of-year funds via strategic task/client selection, effective parallelization, and adversarial client avoidance, all while mitigating compounding risks (e.g., payroll inflation, prestige/ trust constraints).
3. Evaluation Metrics and Experimental Methodology
The benchmark quantifies performance through a set of primary and secondary metrics:
| Metric | Definition | Significance |
|---|---|---|
| Final capital (1) | Funds at end of horizon or bankruptcy | Main success criterion |
| Bankruptcy rate (2) | 3 | Robustness and consistency measure |
| Cost-efficiency (4) | 5 (million\$\mathcal{S}$6) | Resource-performance tradeoff indicator |
Twelve LLMs (including proprietary and open-source) are evaluated over three seeds each. A greedy heuristic (max-reward tasks, maximal employee assignment) serves as baseline. Inference cost, token usage, and runtime are measured via LiteLLM+OpenRouter. Model variance is presented as per-seed ranges; formal significance testing is not reported (He et al., 1 Apr 2026).
4. Persistent Memory and Adversarial Inference Mechanisms
A defining constraint of $\mathcal{S}$7 is the limited context window ($\mathcal{S}$8 turns), necessitating explicit memory management to retain information over hundreds of turns. The sole mechanism provided is the scratchpad, invoked via dedicated CLI commands:
$F_t$6
Agents can append notes, rules, or inferences; scratchpad entries serve as externalized memory to mitigate context truncation. Empirically, scratchpad usage (writes per 100 turns) is the strongest predictor of agent success. Top models average $\mathcal{S}$9–$t$0 notes per $t$1 turns; unsuccessful models average $t$2.
Adversarial client detection is both critical and challenging: $t$3 of bankruptcies are attributable to adversarial contracts. Performance is captured by
- Proportion of adversarial tasks accepted ($t$4): best models achieve $t$5 (market share $t$6).
- Detection accuracy: estimated as $t$7 or via explicit client blacklist policies recorded in scratchpads.
A plausible implication is that persistent external memory and effective pattern mining for adversarial risk are necessary (but not sufficient) for long-term viability in such environments.
5. Performance Outcomes and Failure Modes
Of twelve models evaluated, only five consistently achieve $t$8200$t$91 million:
- Claude Opus 4.6: $s_t = (F_t,\; P_t,\; \mathcal{E},\; \mathcal{Tsk}_t,\; \text{Prestige}_t,\; \text{Trust}_t,\; \mathbf{h})$01.27$M (0/3 bankruptcies)
- GLM-5: 11.212\timess_t = (F_t,\; P_t,\; \mathcal{E},\; \mathcal{Tsk}_t,\; \text{Prestige}_t,\; \text{Trust}_t,\; \mathbf{h})$31.0$M (0/3 bankruptcies)
Cost-efficiency varies: Kimi-K2.5 delivers the highest API cost-performance (2.54 the next best), whereas Opus-4.6, though best in revenue, is 5 less cost-efficient than GLM-5.
Identified failure modes include:
- Over-parallelization: agents (e.g., Sonnet, average concurrency 6 tasks) overcommit employee throughput, missing deadlines and incurring penalties.
- Memory neglect: agents neglect or inconsistently update scratchpad rules, leading to repeated suboptimal decisions (e.g., Grok notes “Avoid Equinox” but repeats acceptance).
- Adversarial traps: insufficient adversarial detection, leading to repeated costly mistakes.
Temporal structure is observed—by turn 760 (∼March), high-performing agents cultivate “trust snowballs” with select clients, halving effective work required per task via accumulating domain prestige and trust, whereas under-performing agents remain indiscriminate in client and domain selection.
6. Implementation, Reproducibility, and Configuration
8 is fully open-source, with code and configuration files available online: https://github.com/collinear-ai/yc-bench. Key default hyperparameters include: horizon 9 yr, employees 0, initial funds 12002=1\%3=35\%4K=20F_t$7
All supported CLI commands, configuration details, and system prompts are provided in the repository appendix. The platform is configurable, permitting adjustment of environment stochasticity, employee profiles, and memory window.
7. Research Significance and Outlook
$F_t$5 constitutes a rigorous testbed for agentic LLMs, foregrounding long-term, partially observable decision-making under sparse and delayed rewards. Empirical results reveal persistent deficiencies in current frontier models: strategic failures typically stem from breakdowns in memory scaffolding, inability to infer adversarial structure, and poor parallelization heuristics. Only a minority of evaluated models exhibit systematic compounding of advantages (prestige/trust) and economically consistent behavior over extended horizons.
The benchmark highlights the essentiality of persistent memory mechanisms (via scratchpad) and causal reasoning under uncertainty. These insights inform both next-generation agent architectures and the design of benchmarks for evaluation under complex, delayed-reward, partially observable settings (He et al., 1 Apr 2026).