FinSearchComp: Financial Search Benchmark
- FinSearchComp is an open-source benchmark that measures LLM-based financial agents’ ability to retrieve, synthesize, and reason over expert-level, time-sensitive queries.
- It employs a rigorous, expert-annotated dataset of 635 questions across global and Greater China markets to simulate real-world financial analysis workflows.
- The evaluation demonstrates that integrating plugin-based real-time search significantly boosts performance, while highlighting challenges in historical data synthesis and multi-source reconciliation.
FinSearchComp is an open-source, expert-constructed benchmark for evaluating end-to-end agent performance on realistic, open-domain search and reasoning tasks encountered in professional financial analysis. It is the first benchmark to assess the full data searching capabilities of LLM-based financial agents using a corpus of expert-authored, time-sensitive, and complex queries modeled closely after real-world workflows in global and Greater China markets. FinSearchComp is built on 635 rigorously vetted questions covering three key families of analyst tasks and is annotated and validated by a panel of 70 financial professionals. The benchmark is structured to enable systematic, rubric-based evaluation of retrieval proficiency, information synthesis, and reasoning accuracy at a level that approaches expert financial decision-making.
1. Benchmark Scope and Task Design
FinSearchComp explicitly targets the central skills required for financial research agents by organizing its 635 questions into three task categories:
- Time-Sensitive Data Fetching (T1): Questions require the retrieval of up-to-the-moment data such as the latest equity or FX prices. Key properties include a strict requirement for data freshness and the demand for precise, current, and source-attributed answers.
- Simple Historical Lookup (T2): These tasks focus on extracting specific, point-in-time financial metrics (e.g., revenue in a SEC 10-K for a given year/quarter) while enforcing accuracy around domain conventions (such as fiscal calendars, currency normalization, and restatement awareness).
- Complex Historical Investigation (T3): The most challenging category, these queries require agents to perform multi-step, cross-temporal retrieval and synthesis. Examples include identifying multi-period extrema or reconciling data normalized over asset splits, and extracting statistics that require aggregation and reasoning over multiple sources.
All questions are provided in both English and Chinese versions, ensuring the benchmark evaluates language proficiency and adaptability to divergent market conventions and regulatory environments.
2. Construction and Quality Assurance Pipeline
The creation of FinSearchComp follows a multi-tiered, expert-driven pipeline:
- Expert Panel Annotation: 70 professional financial experts are divided into an annotation group and a senior review panel. Together, they design, annotate, and validate all benchmark items to guarantee realism, market coverage, and technical accuracy.
- Rigorous Multi-Stage Review: The process includes annotation, cross-validation, and an additional blind review by a third-party judge. Explicit tolerance bands—such as acceptance ranges, decimal rounding, and ambiguity management—are built into the rubrics for each question.
- Dataset Bifurcation: The benchmark covers both global and Greater China markets, with each question tailored and annotated according to relevant standards, ensuring that agents must handle region-specific conventions.
This design ensures questions reflect the complexity and reporting heterogeneity faced by actual financial analysts and prevents overfitting to short, synthetic, or static query forms.
3. Evaluation Methodology and Scoring
FinSearchComp employs a rubric-based, LLM-assisted judging protocol. The evaluation function is not a simple string match but instead compares each candidate answer to a rubric using an assessor function as formalized in Equation 1 of the benchmark. This process includes:
- Explicit specification of acceptable answer bands (ranges, rounding, ± thresholds).
- Source attribution and citation checks for data-sourced answers.
- Multilingual and domain adaptation, with separate rubrics for global and Chinese-market questions.
- Results alignment with a blinded senior-panel expert assessment to mitigate bias.
The LLM-based judging mechanism ensures consistent, scalable grading—including situations involving complex intermediate reasoning, synthesized answers, or questions requiring aggregation over multiple supporting documents.
4. Agent Performance and Experimental Findings
The initial edition evaluated 21 deployed LLM-based agents (including both web-augmented products and API-only endpoints). Highlights include:
- Global Subset: Grok‑4 (web) achieved the leading overall score of 68.9%, closely matching the human expert reference score of 75.0%.
- Greater China Subset: DouBao (web) was the top performer.
- Effect of Real-Time Search: Agents enabled with active web search and financial plugins substantially outperformed static models. The performance gain was pronounced (up to a 40% uplift) on T1 questions (time-sensitive fetching).
- Market Geographical Effects: U.S.-based models had clear advantages on global asset queries, while China-origin models excelled on domestic (Greater China) queries. This demonstrates the role of domain and market adaptation in retrieval effectiveness.
- Complexity Gradient: As task complexity increased (T1 → T2 → T3), all agents—including top LLMs—showed a marked decline in performance, indicating the persistent difficulty of historical synthesis and multi-source reconciliation.
5. Implications for Financial Search Agents
FinSearchComp reveals several critical insights about the current state and direction of financial research automation:
- End-to-End Search as the Bottleneck: The benchmark demonstrates that search capability—especially with plugins enabling real-time, structured financial data access—is a limiting factor for agent performance. Static LLMs (lacking retrieval access) consistently underperform on time-sensitive and complex investigation tasks.
- Realism and Workflow Reproduction: By directly sampling from authentic analyst workflows and including ambiguity, data delay, and reporting variants, FinSearchComp challenges systems on aspects that are typically underrepresented in LLM QA benchmarks.
- System Integration Effects: The integration of plugin-based retrieval with LLM reasoning enables agents to approach (but not fully match) expert-level accuracy. The benchmark quantifies the impact of system design decisions (e.g., retrieval pipeline, plugin coverage) in a rigorous and comparable way.
6. Applications and Benchmark Significance
The benchmark is structured to guide the development, diagnosis, and competitive assessment of advanced financial research assistants, with implications that span industry tool evaluation, academic algorithm innovation, and regulatory oversight. Applications include:
- Automated analyst assistants able to rapidly retrieve, attribute, and synthesize market data under realistic accuracy constraints.
- Systematic assessment of reasoning and data-fetching components in competitive agent products.
- Facilitating cross-lingual and cross-regional adaptation by providing a dual-market, bilingual test set.
FinSearchComp serves as a calibration tool for both algorithmic research and practical deployment, offering a yardstick for progress on the core problems of financial data search and complex knowledge-grounded reasoning.
7. Outlook and Future Directions
The results of FinSearchComp expose persistent challenges for AI agents:
- Even top scoring agents do not fully reach expert-level performance, particularly as task complexity deepens or when market conventions diverge.
- Performance improvements are strongly tied to the advances in system-level data integration—particularly temporal plugins and retrieval engines—rather than exclusively to LLM parameter scaling.
- Future benchmark expansion is likely to include real-time scenario simulations, cross-asset analytics, and even multi-agent collaboration on research deliverables.
FinSearchComp is therefore positioned as a canonical benchmark for assessing progress toward practical, expert-grade financial AI, while highlighting the avenues for innovation required to close the gap with human professionals (Hu et al., 16 Sep 2025).