CFGPT: Finance NLP & Visual Reasoning

Updated 3 December 2025

CFGPT is a framework that integrates Chinese financial NLP and video-language counterfactual reasoning using a two-stage training process.
It employs domain-structured data and adapter-based tuning techniques (QLoRA/LoRA) along with reinforcement learning to boost performance on benchmark tasks.
CFGPT enables automated financial analysis and robust multi-modal reasoning, offering actionable insights for corporate disclosures and counterfactual video evaluations.

CFGPT, or Chinese Financial Generative Pre-trained Transformer, refers to two distinct state-of-the-art frameworks in the machine learning literature: one designed for Chinese financial NLP and another for visual counterfactual reasoning in video-LLMs. Each version leverages LLMs or vision-LLMs (VLMs), domain-structured data, and novel training strategies to address challenges in their respective domains, achieving state-of-the-art results in open benchmarks (Li et al., 2023, Chen et al., 25 Nov 2025).

1. CFGPT for Chinese Financial NLP

CFGPT, as developed for Chinese financial NLP, is a comprehensive framework comprising a curated financial corpus (CFData), a domain-pretrained and instruction-tuned LLM (CFLLM), and a modular deployment stack (CFAPP). The goal is to facilitate robust financial text understanding, analysis, and application in production settings (Li et al., 2023).

1.1 CFData: Financial Corpus Construction

CFData is partitioned into two constituent datasets: a pre-training corpus (583,978K documents, 140,609M tokens, 573.2 GB) and a supervised fine-tuning dataset (1.57M instruction pairs, 1,512M tokens). The pre-training corpus integrates six sources, accounting for both breadth and depth in Chinese financial context (see Table 1).

Table 1: CFData Pre-training Subsets

Subset	Docs (K)	Tokens (B)	% Total	Source Example
CFData-CP	39.1	13.4	6.24	Corporate Prospectus
CFData-CA	6,190	17.3	12.28	Corporate Announcements
CFData-RR	392	3.53	2.51	Research Reports
CFData-FN	82,400	26.3	18.70	Financial News
CFData-SM	494,700	84.6	60.15	Social Media Posts
CFData-Wiki	255	0.137	0.09	Wikipedia Dump

The supervised fine-tuning dataset is task-aligned with six instructional tasks: Sentiment Analysis (SA), Event Detection (ED), Topic Decomposition (TD), Report Summarization (RS), Question Answering (QA), and Stock Movement Prediction (SP). Each task is constructed with labeled or GPT-4-generated instances, e.g., 120K pairs for SA and 490K pairs across 98 event classes for ED.

1.2 Model Architecture and Training

The core financial LLM, CFLLM, is instantiated from InternLM-chat-7B (7B parameters) using relative positional encodings and FlashAttention for efficient handling of long contexts. Training follows a two-stage process:

Continued pre-training: Standard left-to-right causal language modeling on pre-training corpus; sequence slicing with length 1024 and stride 512; optimized with AdamW and linear warmup/cosine decay.
Supervised fine-tuning: Instruction tuning on the six financial tasks plus Moss-03-sft for generalization; applied QLoRA adapters of rank 64. Training batch sizes are dynamically adjusted, with maximum input length extended up to 2048 tokens.

1.3 Deployment and Modular Application

CFAPP delivers an LLM-centric application stack supporting input parsing (text/audio/PDF), dynamic task classification, and full workflow logging. Functional modules comprise:

Content summary (with support for templates/mind maps through Graphviz)
Causal reasoning (ReAct-like, with search/response/pass actions)
Price prediction (CFData-SP-specific model calls)
Risk management (domain models e.g., exposure analysis)
Auxiliary tool/database integration for complex, multi-invocation workflows

1.4 Empirical Performance

CFGPT (CFLLM-ins-7B) demonstrates state-of-the-art results across public Chinese financial NLP benchmarks: FLARE (FLUE extension), BBT-CFLEB, and FinEval. Reported metrics include perplexity, accuracy, F1, ROUGE, and classification accuracy across the six fine-tuned tasks. Instruction-tuned CFLLM-ins-7B consistently outperforms FinGPT, PIXIU, and baseline LLMs in zero/few-shot settings. For instance, it achieves notable improvements in sentiment analysis accuracy, event detection F1, report summarization ROUGE-L, and stock movement classification accuracy compared to prior work.

1.5 Use Cases, Limitations, and Outlook

CFGPT underpins automated corporate disclosure analysis, quantitative investment support, analyst report drafting, interactive assistance for both retail and institutional users, and risk monitoring within compliance workflows. Main limitations include hallucination risk on out-of-distribution events, non-end-to-end real-time data integration, long-context input constraints (finetuning capped at 2K tokens), and challenges inherent to the 7B-parameter model in multi-hop reasoning. Future research priorities include retrieval-augmented training, improved numeric reasoning, PDF/multimodal input handling, and continuous adaptation to live data (Li et al., 2023).

2. CFGPT for Vision-Language Counterfactual Reasoning

CFGPT also refers to a domain-agnostic post-training framework for improving counterfactual reasoning in VLMs, as introduced in the CounterVQA benchmark. The method is benchmarked for video understanding tasks requiring inference over hypothetical interventions and long causal chains (Chen et al., 25 Nov 2025).

2.1 Motivation and Scope

Most VLMs can describe observed events within video frames, but fail to systematically answer counterfactual (“what-if”) queries, especially those involving long or non-existent-event causal chains. On the CounterVQA dataset, CFGPT is proposed as a model-agnostic, lightweight downstream recipe for closing this reasoning gap.

2.2 Formal Methodology and Architecture

CFGPT operates as a two-stage post-training framework on an arbitrary base VLM $f_\theta$ :

Stage I (Cross-Modal Causal Transfer): Supervised fine-tuning on a dataset $D_{SFT} = \{(v_i, q_i, c_i, a_i)\}$ , where $c_i$ are chain-of-thought rationales generated by text-based teachers, and each $(v_i, q_i)$ input is mapped to both $c_i$ and ground-truth answer $a_i$ .
Stage II (Visual-Causal Alignment): Reinforcement learning using a reward $R(o|v,q,G) = \alpha R_{causal}(o,G) + \beta R_{visual}(o,v)$ , connecting logical consistency of chain-of-thought-and-answer pairs ( $o$ ) to causal graphs $G$ (multimodal parsing) and visual evidence in the video.

The architecture features LoRA-style cross-modal adapters, dual-generation heads for CoT and answers, and a GRPO (Generalized Rank-based PPO) policy optimization procedure in reinforcement steps.

2.3 Training Dynamics and Mathematical Objectives

Data curation applies multi-agent systems to extract causal graphs from videos and LLMs to generate/verify counterfactual question-answer pairs, filtering by graph complexity (e.g., causal depth ≥ 3, CNDA ≥ 0.12).
Supervised fine-tuning loss:

$L_{SFT}(\theta)= -\sum_{i=1}^N [ \log p_\theta(c_i|v_i,q_i) + \log p_\theta(a_i|v_i,q_i,c_i) ]$

Reinforcement objective:

$L_{RL}(\theta) = -\sum_{i,k} A_{i,k} \cdot \log \pi_\theta(o_{i,k}|v_i,q_i)$

with $A_{i,k} = R_{i,k} - \bar R_i$ , and $R$ as the joint causal-graph and visual alignment reward.

Visual and causal-graph rewards, $\alpha$ and $\beta$ (default 0.5), balance the model’s reliance on video evidence and logical structure.

Implementation details include LoRA adapters (rank=16), learning rate 5e-5 (Stage I), 1e-5 (Stage II), 512-frame video sampling at 144×144 resolution, and GRPO sampling $K=4$ outputs per input.

2.4 Evaluation and Ablation Results

On the Qwen-3-VL-8B backbone, CFGPT yields substantial absolute gains (average accuracy 60.1% → 72.6%) across all CounterVQA difficulty levels:

Level 1 (adjacent intervention): 54.6% → 70.1%
Level 2 (long-chain): 62.1% → 71.6%
Level 3 (non-existent-event): 65.3% → 76.0%

Ablation confirms dual necessity: disabling either causal-graph reward or supervised video distillation degrades performance by 6–7 points. Error analysis reveals conventional models rely heavily on surface priors, hallucinating ungrounded events (11.7% of errors) and disregarding video-anchored facts (6.3%), while CFGPT regularizes against these failure modes via explicit grounding.

2.5 Implementation Practices and Reproducibility

Robustness of the causal graphs (≥90% precision/recall via human audit) and LLM-based verification of counterfactual question grounding are critical preprocessing steps.
Fine-grained, non-binary reward signals during RL are empirically superior to binary correctness.
Monitoring chain-of-thought coherence and causal alignment is indispensable to prevent policy collapse during training (Chen et al., 25 Nov 2025).

3. Comparative Perspective and Significance

Both applications of CFGPT share the following attributes:

Two-stage training dynamics: task-aligned supervised fine-tuning followed by a specialized alignment (domain or cross-modal reinforcement).
Adapter-based parameter-efficient tuning (QLoRA or LoRA).
Emphasis on domain or task-specific data curation and filtering.
Modular, extensible architecture (LLM-centric deployment or post-training VLM add-on).

A key distinction is the targeted mode: financial text for Chinese NLP, and video-based counterfactual reasoning for VLMs. The Editor's term “CFGPT” thus references a general methodology emphasizing curriculum-aligned distillation and robust alignment via reward signals in either financial or multimodal settings.

4. Future Directions and Open Challenges

For Chinese financial NLP, future research agendas include retrieval-augmented learning, numeric reasoning enhancements, multimodal integration (notably PDF table understanding), and in-the-loop continuous adaptation to real-time market data. In visual reasoning, expansions toward more complex video-event graphs, dense temporal reasoning, and integration with ever-larger VLMs are anticipated.

Common cross-cutting challenges remain: hallucination in out-of-distribution scenarios, context length constraints, limits to parameter efficiency in very large models, and the need for rigorous alignment stratagems in rapidly evolving domain applications (Li et al., 2023, Chen et al., 25 Nov 2025).

PDF Markdown Chat (Pro)

References (2)

CFGPT: Chinese Financial Assistant with Large Language Model (2023)

CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding (2025)

CFGPT: Finance NLP & Visual Reasoning

1. CFGPT for Chinese Financial NLP

1.1 CFData: Financial Corpus Construction

1.2 Model Architecture and Training

1.3 Deployment and Modular Application

1.4 Empirical Performance

1.5 Use Cases, Limitations, and Outlook

2. CFGPT for Vision-Language Counterfactual Reasoning

2.1 Motivation and Scope

2.2 Formal Methodology and Architecture

2.3 Training Dynamics and Mathematical Objectives

2.4 Evaluation and Ablation Results

2.5 Implementation Practices and Reproducibility

3. Comparative Perspective and Significance

4. Future Directions and Open Challenges

Whiteboard

Follow Topic

Continue Learning

CFGPT: Finance NLP & Visual Reasoning

1. CFGPT for Chinese Financial NLP

1.1 CFData: Financial Corpus Construction

1.2 Model Architecture and Training

1.3 Deployment and Modular Application

1.4 Empirical Performance

1.5 Use Cases, Limitations, and Outlook

2. CFGPT for Vision-Language Counterfactual Reasoning

2.1 Motivation and Scope

2.2 Formal Methodology and Architecture

2.3 Training Dynamics and Mathematical Objectives

2.4 Evaluation and Ablation Results

2.5 Implementation Practices and Reproducibility

3. Comparative Perspective and Significance

4. Future Directions and Open Challenges

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics