VCBench: VC Success Prediction Benchmark

Updated 19 September 2025

VCBench is a benchmark for predicting founder success in the venture capital industry, offering a standardized and privacy-preserving dataset that addresses sparse and noisy data challenges.
It utilizes rigorous anonymization protocols and a split evaluation process with the F0.5 score to emphasize precision in forecasting early-stage startup success.
Key insights include improved performance of LLMs over human experts and potential advancements in algorithmic decision support for venture capital investing.

VCBench is a benchmark designed for predicting founder success in the venture capital (VC) industry, providing a standardized, privacy-preserving resource to evaluate artificial intelligence models and human forecasters on early-stage startup investing. It addresses unique challenges inherent to VC—such as sparse and noisy signals, low precision rates for success in real-world datasets, and significant privacy risks associated with profile data. VCBench is publicly available at vcbench.com and is intended as a community-driven platform to support reproducible AGI evaluation in entrepreneurial forecasting (Chen et al., 17 Sep 2025).

1. Development Motivations and Benchmark Rationale

VCBench was created to close a gap in VC research and AGI benchmarking, recognizing that prior datasets (e.g., SWE-bench, ARC-AGI) focus on perception and reasoning tasks but do not model decision-making under uncertainty with sparse information, as is typical in early-stage investing. The real-world success rate in VC is notably low: the market index achieves a precision of only 1.9% for predicting major exits (acquisition/IPO) or fundraising milestones. Even elite investors—Y Combinator and tier-1 VC firms—demonstrate only moderate improvements over the index (1.7× and 2.9×, respectively). VCBench thus enables reproducible, comparative evaluation of LLMs and human benchmarks within the domain’s constraints.

2. Dataset Composition and Anonymization Protocols

The benchmark incorporates 9,000 anonymized founder profiles, with each profile linked to a startup and labeled “successful” for any founder with a major acquisition or IPO valued at over \$500M, or if cumulative fundraising exceeds \$500M. Unsuccessful companies are those that raised \$100K–\$4M and did not reach major milestones (exit/funding) within eight years.

Anonymization proceeds in two stages:

Entry-level: All direct identifiers (names, companies, dates, locations) are removed from structured data and nested JSON fields (education, job history).
Dataset-level: Numeric features (e.g., IPO/acquisition counts) are grouped, and high-cardinality attributes such as industry are clustered via embedding models and agglomerative hierarchical clustering with subsequent manual review.

These rigorous steps eliminate identity leakage, validated by adversarial re-identification tests that show a >90% reduction in risk, while maintaining relevant predictive signals.

3. Benchmarking Procedure and Evaluation Metrics

Profiles are divided into six equal folds with a preserved success rate, and multiple formats (raw JSON to prose) ensure consistent signal post-anonymization. The core predictive task is binary founder success classification.

VCBench uses the F₀.₅ score as its chief metric (placing twice the emphasis on precision over recall), formulated as:

$F_{0.5} = (1+0.5^2) \times \frac{\text{Precision} \times \text{Recall}}{0.5^2 \times \text{Precision} + \text{Recall}}$

This reflects real-world VC priorities: false positives are costlier than false negatives given investment constraints. Precision and recall are computed per model and averaged across folds, supporting public leaderboard comparisons.

4. LLM Performance and Comparative Results

VCBench evaluates nine LLMs, including DeepSeek-V3, GPT-4o, DeepSeek-R1, Gemini-2.5-Pro, GPT-5, and Gemini-2.5-Flash.

DeepSeek-V3: Achieves maximal precision (59.1%), exceeding the baseline by over 6×, with lower recall.
GPT-4o: Demonstrates the highest F₀.₅ score (25.1%) with precision 29.1% and recall 16.2%.
Gemini-2.5-Flash: Yields high recall (69.1%) but low precision.

Most LLMs surpass human benchmarks (Y Combinator and tier-1 VC firms) on anonymized data, with improvements confirmed not to result from identity leakage (via adversarial testing), but rather extraction of real predictive signal.

Model	Precision (%)	Recall (%)	F₀.₅ (%)
DeepSeek-V3	59.1	10.2	20.3
GPT-4o	29.1	16.2	25.1
Gemini-2.5-Flash	16.1	69.1	21.7

5. Human Expert Baselines and Interpretability

Human-expert baselines include the market index (1.9% precision), Y Combinator (1.7× improvement), and tier-1 VC firms (2.9× improvement). VCBench’s format ensures that improvements in LLMs are not the result of overfitting to identities. The performance gap between models and humans reveals considerable progress in data-driven founder success prediction and an opportunity for deploying LLMs as decision-support tools.

6. Community, Reproducibility, and Future Development

VCBench is structured as a living resource:

Leaderboard: Half the dataset is public for open evaluation; the other half is reserved for private validation to prevent benchmark contamination in future model pre-training.
Iterative Updates: The dataset and anonymization protocol will be refined in response to user feedback and advancements in privacy-preserving techniques.
Richer Simulation Modes: Planned extensions include tournament-style, sequential investment simulations, and direct human–AI competitions, aiming to more closely model realistic VC workflows.
Feature Engineering: Efforts underway to define scalable proxies for company prestige, improve cluster assignments for job and education fields, and introduce trajectory-level features relevant to success prediction.

7. Implications for AGI and Venture Capital Research

VCBench sets a precedent for the rigorous, privacy-conscious evaluation of AGI systems in complex, real-world decision-making settings. By providing an anonymized, standardized dataset and robust evaluation metrics that reflect domain-specific priorities (precision-dominant), it enables systematic benchmarking and advances the conversation on the role of LLMs in venture capital. This resource is positioned to facilitate further research into algorithmic investment decision support, reproducible AI evaluation, and the intersection of privacy, prediction, and entrepreneurial success.

A plausible implication is that VCBench may catalyze new methodologies in portfolio construction, risk management, and founder evaluation by providing transparent, cross-model comparisons and supporting the evolution of both human and AI-based VC strategies.

PDF Markdown Chat (Pro)

References (1)

VCBench: Benchmarking LLMs in Venture Capital (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to VCBench.

VCBench: VC Success Prediction Benchmark

1. Development Motivations and Benchmark Rationale

2. Dataset Composition and Anonymization Protocols

3. Benchmarking Procedure and Evaluation Metrics

4. LLM Performance and Comparative Results

5. Human Expert Baselines and Interpretability

6. Community, Reproducibility, and Future Development

7. Implications for AGI and Venture Capital Research

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

VCBench: VC Success Prediction Benchmark

1. Development Motivations and Benchmark Rationale

2. Dataset Composition and Anonymization Protocols

3. Benchmarking Procedure and Evaluation Metrics

4. LLM Performance and Comparative Results

5. Human Expert Baselines and Interpretability

6. Community, Reproducibility, and Future Development

7. Implications for AGI and Venture Capital Research

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research