Papers
Topics
Authors
Recent
2000 character limit reached

Chinese Financial QA Dataset Overview

Updated 3 January 2026
  • Chinese Financial QA Dataset is a rigorously constructed collection of Chinese-language QA pairs designed to benchmark financial NLP models with diverse modalities including MCQ, dialogue, and multimodal reasoning.
  • It employs advanced annotation frameworks, multi-stage expert reviews, and data cleaning protocols to ensure quality and consistency across certification exams, consultations, and meeting transcripts.
  • The dataset supports robust evaluation using metrics such as accuracy, F1, and EM, facilitating detailed analysis of model performance on complex financial language tasks.

A Chinese Financial QA Dataset is a rigorously constructed resource containing Chinese-language question–answer pairs designed to benchmark, fine-tune, or evaluate natural language processing models specific to financial tasks. These datasets span a diverse array of domains and modalities, including professional certification exams, corporate disclosures, conversational consulting, complex multimodal reasoning with tables/charts, real-world meeting transcripts, and specialized accounting dialogues. The following sections analyze the primary benchmark datasets, annotation frameworks, data stratifications, evaluation methodologies, and observed limitations shaping current research in this field.

1. Foundational Datasets: Scale, Structure, and Sources

Chinese financial QA datasets manifest in several archetypes, each serving distinct research purposes and model capabilities. Major dataset families, their core properties, and data provenance are summarized.

Dataset Name QA Pairs / Questions Focus / Modality Primary Sources
CFinBench (Nie et al., 2024) 99,100 MCQ, judgment Professional mock exams (CPA, CFA, bank), internal exams
CFLUE (Zhu et al., 2024) 38,636 (MCQ) + 16,522 (appl.) MCQ + NLP tasks Mock exams, shared tasks, financial news, transcripts
CFData-QA (Li et al., 2023) 12,000 Chinese FinQA/ConvFinQA Translated FinQA/ConvFinQA, annual/earnings reports
DISC-FIN-SFT (Chen et al., 2023) 63,000 (consulting), 246,000 (total) Dialog-style, multi-turn Forums, FiQA, expert generation, self-chat
CAtAcctQA (Luo et al., 2024) ~70,000 Accountant–client dialogues Real-world accounting consultations
M³FinMeeting (Zhu et al., 3 Jun 2025) 6,442 (QA) Meeting transcript QA 400 real meeting transcripts, 11 GICS sectors
CFBenchmark-MM (Li et al., 16 Jun 2025) 9,356 QA / 2,339 charts Multimodal (image+text) Research reports, financial charts/images
VisFinEval (Liu et al., 13 Aug 2025) 15,848 QA Multimodal, process scenario Financial images (K-line, tables, seals), scenario-annotated
FAMMA (Xue et al., 2024) 1,758 (253 Chinese) Multimodal, advanced QA Textbooks, exams, finance forums, human experts
SNFinLLM SFT-set (Zhao et al., 2024) 550,000 (pending release) MCQ, open QA, MRC News, reports, policies, textbooks, self-instruct
FinTruthQA (Xu et al., 2024) 6,000 QA with quality grading Stock exchange interactive platforms (SSE/SZSE)

Data Acquisition and Annotation

  • Professional exams: Datasets such as CFinBench and CFLUE extract, normalize, and de-duplicate MCQs from CPA, CFA, bank, and other certification sources, often using OCR and manual validation to ensure format and label consistency (Nie et al., 2024, Zhu et al., 2024).
  • Instruction and dialogue: DISC-FIN-SFT and CFData-QA derive QA pairs from forums, financial datasets (e.g., FinQA/ConvFinQA), and generate multi-turn consulting examples reflecting realistic investment/information-seeking scenarios (Chen et al., 2023, Li et al., 2023).
  • Real-world QA: CAtAcctQA and FinTruthQA emphasize authentic practitioner–client interactions or public investor–issuer Q&A, annotated with domain criteria and graded for answer quality, providing representative language use and pragmatic coverage (Luo et al., 2024, Xu et al., 2024).
  • Multimodal and scenario-driven: CFBenchmark-MM, FAMMA, and VisFinEval offer extensive image–question pairs, integrating table/chart comprehension, financial numeracy, and scenario-based reasoning with manually validated or GPT-4-generated rationales (Li et al., 16 Jun 2025, Xue et al., 2024, Liu et al., 13 Aug 2025).
  • Meeting understanding: The M³FinMeeting dataset processes long-form financial meeting transcripts (ASR/manual-corrected), extracting granular QA pairs stratified by sector and transcript length (Zhu et al., 3 Jun 2025).

2. Task Taxonomy, Instruction Format, and Domain Coverage

Task Classes

  • Multiple-Choice and Judgment (MCQ, True/False): Predominant in CFinBench, CFLUE, and FAMMA; standardized for consistency and cross-model comparability.
  • Open-Ended/Span Extraction QA: CFData-QA, SNFinLLM, CAtAcctQA, and M³FinMeeting focus on short-form or passage-anchored factual or computational answers.
  • Multi-Turn Dialogue/Consulting: DISC-FIN-SFT and CAtAcctQA model conversational exchanges, preserving dialogue history and user-assisted clarification chains.
  • Machine Reading Comprehension (MRC): SNFinLLM and CFLUE provide extensive MRC data (context-prompt-answer), including both single-span and multi-hop contexts.
  • Financial Computation: SNFinLLM and CFBenchmark-MM embed explicit calculation sub-tasks, with formula-encoded answers and calculator-style outputs.
  • Multimodal Reasoning: FAMMA, CFBenchmark-MM, and VisFinEval serve as benchmarks for table/chart-derived question answering and cross-modal information integration.

Financial Domain Stratification

  • Certification Knowledge vs. Operational Practice: CFinBench and CFLUE sections target practitioner exams and professional certification syllabi; other resources emphasize contemporaneous analysis, regulatory compliance, and client-facing operations.
  • Subfields and Hierarchies: Most datasets organize QA pairs by fine-grained taxonomy: e.g., 43 CFinBench subcategories (Auditing, Tax Law, Financial Management), or topic markers in CAtAcctQA aligned to CPA domains.
  • Complex reasoning scenarios: VisFinEval, FAMMA, and M³FinMeeting explicitly annotate scenario depth, financial workflow stage (front/mid/back office), or CFA Level (I–III) to enable stratified analysis of model failures.

3. Annotation Quality, Schema, and Validation Protocols

Annotation pipelines emphasize domain authenticity, language consistency, and label validity:

  • Multi-stage annotation: FinTruthQA and CAtAcctQA employ multi-pass expert checking, annotator training, and conflict adjudication, often reporting Cohen’s κ (e.g., κ = 0.84–0.89 in FinTruthQA) or Fleiss’ κ for agreement (Xu et al., 2024, Luo et al., 2024, Zhu et al., 3 Jun 2025).
  • Expert/peer review: All major datasets incorporate domain-expert or practitioner review (e.g., DISC-FIN-SFT, CAtAcctQA, FinTruthQA, M³FinMeeting).
  • Quality assurance via model–human feedback: FinTruthQA and SNFinLLM use model-driven disagreement highlighting (e.g., confident learning flags for low-concordance instances) as a final filter (Xu et al., 2024, Zhao et al., 2024).
  • Prompt engineering and context preservation: Multi-turn/instructional datasets define strict schema (e.g., history/context/role fields, in DISC-FIN-SFT) and scripted prompt templates to ensure logical flow, depth, and realistic conversational tone (Chen et al., 2023).
  • Data cleaning and deduplication: Use of MinHash, SimHash, or embedding-based duplication filters is standard in large-scale resources (Nie et al., 2024, Zhao et al., 2024).

4. Evaluation Metrics and Baseline Performance

Evaluation is standardized around accuracy, F1, and scenario-appropriate metrics, with modality-dependent adjustments.

Dataset QA Evaluation Metric(s) Top Model (QA Accuracy/F1) Benchmark Features
CFinBench Accu. (MC, MulC, Judgmt); aggregate score Yi1.5-34B: 60.16% (3-shot), GPT-4: 54% Large-scale, professional stratum
CFLUE Accu., F1, ROUGE-L (reasoning) Qwen-72B: 72.8%, GPT-4: 60.9% Both knowledge MCQ & NLP tasks
CFData-QA EM, F1 (SQuAD-style) Not specified; standard practice Free-form Chinese answer, no MCQ
DISC-FIN-SFT Accuracy (benchmark QA), expert scoring 51.6% (LoRA expert), GPT-4: 68.6% FIN-Eval, FINCUGE QA, retrieval-enhanced
M³FinMeeting EM, Token-F1 (SQuAD-style spans) Qwen2.5-72B F1 92.4%, GPT-4o 91.7% Long-meeting QA extract
FinTruthQA Accuracy, Micro/Macro-F1, QWK FinBERT (QWK ≈0.68 answer relevance) Four-grade answer quality evaluation
CFBenchmark-MM Accuracy (MCQ)/Point fraction (expln) GPT-4V Q+I avg. 46.7% Multimodal, staged evaluation
FAMMA (Chinese) Accu. (GPT-4o scoring, 0/1 per Q) GPT-4o 37.7% (Chinese test set) Multi-lingual, multimodal hard QA

Evaluation protocols universally separate held-out test data; some benchmarks provide few-shot examples or step-by-step annotated rationales for chain-of-thought prompting (Nie et al., 2024, Zhu et al., 2024). For extraction and reading comprehension, token-level F1 and EM metrics are normative (Zhu et al., 3 Jun 2025, Li et al., 2023).

5. Advanced Topics: Multimodal, Meeting, and Conversational QA

Multimodal QA

Recent benchmarks integrate visual data—financial charts, tables, diagrams—requiring models to perform non-textual reasoning:

  • CFBenchmark-MM (Li et al., 16 Jun 2025): >9k image–QA pairs across five question types (arithmetic/statistical/structural reasoning, explanation, knowledge), systematic staged evaluation (Q only, Q+I, Q+C, Q+I+C).
  • VisFinEval (Liu et al., 13 Aug 2025): 8 image modalities, 3 scenario depths, hashing full financial front/mid/back office lifecycle, including process reasoning and business logic.
  • FAMMA (Xue et al., 2024): 253 Chinese multimodal Qs, fine-grained by subfield and difficulty, with LaTeX-rich rationales.

Financial Meeting QA

The M³FinMeeting corpus (Zhu et al., 3 Jun 2025) sets a precedent for long-context, realistic meeting QA:

  • 400 meetings (Chinese), ~6,442 question–answer spans, stratified by GICS sector and length, ASR + manual correction.
  • Annotation emphasizes speaker turn structure, domain-specific disambiguation, and exact answer span extraction.
  • SOTA models (Qwen2.5-72B-Instruct, GPT-4o) reach >90% token-F1, revealing current proficiency with long financial discourse parsing.

Conversational and Dialogue QA

DISC-FIN-SFT (Chen et al., 2023) and CAtAcctQA (Luo et al., 2024) provide context-aware, multi-turn frameworks designed for robust simulation of practical financial and accounting consultations, including intricate follow-up chains and formula-based answers. SNFinLLM’s SFT dataset (Zhao et al., 2024) extends these paradigms to instruction–input–output triplets with explicit financial computation logic.

6. Limitations, Common Failure Modes, and Ongoing Challenges

Despite expanding coverage and sophistication, Chinese Financial QA datasets face persistent challenges:

  • Model performance ceilings: Even top models rarely exceed 60% aggregate accuracy on complex multi-choice or law-centric questions (CFinBench, CFLUE, FAMMA).
  • Cross-modal and chain-reasoning failures: Frequent misinterpretation of figures, confusion between financial indicators, and business-process lapses observed in VisFinEval and CFBenchmark-MM (Liu et al., 13 Aug 2025, Li et al., 16 Jun 2025).
  • Conversation and long-context reasoning: While F1 scores are high for passage extraction, exact phrasing reproduction lags (EM ≈80%), and multi-hop/hypothetical reasoning remains under-developed (Zhu et al., 3 Jun 2025, Chen et al., 2023).
  • Data representation imbalances: Scenario- or sector-dominant splits, relatively sparse coverage of highly informal or dynamic financial information (e.g., real-time or social media discussions) (Zhao et al., 2024).
  • Annotation and label ambiguity: Despite rigor, explicit inter-annotator agreement is inconsistently reported; some complex QA pairs risk answer ambiguity or label noise.

7. Access, Licensing, and Recommendations for Users

Nearly all major Chinese Financial QA resources adhere to permissive open-source licenses (typically Apache 2.0), with comprehensive train–validation–test splits and code repositories supporting full reproducibility (Nie et al., 2024, Zhu et al., 2024, Xu et al., 2024, Chen et al., 2023). Best practices recommended by dataset maintainers include:

  • Stratified evaluation across domain categories and scenario types to support robust benchmarking of LLMs.
  • Combining domain-adaptive fine-tuning with retrieval-augmented pipelines for reading comprehension and multimodal QA.
  • Leveraging chain-of-thought prompts and stepwise rationales for improved model generalization in complex reasoning and computation sub-tasks.
  • Cautious extrapolation from benchmark scores to real-world application, due to documented error modes and domain coverage limitations.

In sum, the Chinese Financial QA dataset landscape encompasses a spectrum from traditional exam MCQs to multimodal reasoning, elaborate multi-turn advisory exchanges, and multi-sector, long-context meeting comprehension. These resources collectively catalyze the development, fine-tuning, and evaluation of financial-LLMs, while concurrently revealing the substantial challenges that remain in the accurate, context-rich automation of financial knowledge work.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Chinese Financial QA Dataset.