Open-FinLLMs: Open Source Financial LLMs

Updated 6 May 2026

Open-FinLLMs are a family of open-source large language models tailored for financial tasks using transformer-based architectures and instruction tuning.
They employ parameter-efficient methods like LoRA and QLoRA to achieve high accuracy on structured financial data while leveraging vast, curated corpora.
They extend to multimodal and agentic functionalities, enabling advanced chart reasoning, automated analysis, and real-time decision-making in finance.

Open-FinLLMs are a family of open-source LLMs specifically architected, pre-trained, and instruction-tuned for financial domain applications, with extensions to multimodal reasoning and agentic workflows. These systems leverage high-quality financial corpora, parameter-efficient adaptation, and comprehensive benchmarking to provide robust capabilities in areas such as financial NLP, decision-making, reporting, and automated analysis. Open-FinLLMs aim to democratize access to advanced financial AI, allowing researchers, practitioners, and industry to adapt, evaluate, and deploy LLM-driven solutions in finance, accounting, trading, auditing, and regulation.

1. Model Architectures and Adaptation Strategies

Open-FinLLMs typically employ transformer-based, decoder-only architectures with parameters spanning from 7B to 70B, often inheriting from LLaMA (e.g., LLaMA3-8B), Qwen, Baichuan, Falcon, or Gemma bases (Huang et al., 2024, Caillaut et al., 7 Nov 2025, Lee et al., 2024). Parameter-efficient fine-tuning is central—Linear Rank Adaptation (LoRA), QLoRA, DoRA, and federated LoRA reduce hardware, time, and data requirements for domain adaptation (Yang et al., 2023, Liu et al., 2023, Wang et al., 26 May 2025). Table 1 summarizes principal Open-FinLLMs and their base architectures:

Model Suite	Base Model	Params	Multimodality	Released
FinGPT	GPT-J, LLaMA	6–7B	Text	Yes
FinLLaMA/FinLLaVA	LLaMA3-8B	8B	Text, Vision	Yes
LLM Pro Finance	LLaMA, Qwen, Gemma	8–70B	Text	Partial
Touchstone-GPT	Qwen-2	7B	Text	Yes
PIXIU/FinMA	LLaMA-7B/30B	7–30B	Text	Yes

Instruction tuning is a universal adaptation method, wrapping domain tasks (e.g., sentiment, NER, QA, forecasting) in structured prompts (Instruction, Input, Options) to align model behavior with financial reasoning chains (Wang et al., 2023, Lin et al., 19 Jan 2025). Multimodal extensions, such as FinLLaVA, add a CLIP-based vision encoder and multimodal adapters to support tabular, chart, and image analysis (Huang et al., 2024).

2. Data Curation, Pretraining, and Instruction Tuning

Open-FinLLMs rely on extensive, carefully curated financial corpora covering SEC filings, earnings calls, financial news, regulatory documents, academic papers, social media, and technical indicators. For example, the FinLLaMA corpus comprises 52B tokens across seven financial domains, including 13B historical market data and 6B SEC filings; general-domain tokens (FineWeb) are included at a specific mixing ratio to mitigate catastrophic forgetting (Huang et al., 2024). Preprocessing steps include document cleaning, Unicode normalization, duplication filtering, entity normalization, and supervised or weakly-supervised labeling (e.g., with post-news price movement as a proxy for sentiment) (Liu et al., 2023, Lin et al., 22 Feb 2026).

Supervised fine-tuning (SFT), LoRA/QLoRA adaptation, and reinforcement learning with task-specific reward shaping (e.g., RLSP in FinGPT: reward signals from actual asset price shifts following news events) are standard methods for domain adaptation (Yang et al., 2023, Wang et al., 26 May 2025). Instruction datasets can reach 573K samples (FinLLaMA-Instruct) or 300K high-quality bilingual pairs (Touchstone-GPT), with coverage spanning sentiment, NER, relation extraction, mathematical reasoning, tabular QA, text generation, and summarization (Huang et al., 2024, Wu et al., 2024).

3. Benchmarking: Tasks, Metrics, and the Open FinLLM Leaderboard

A robust benchmarking ecosystem underpins Open-FinLLMs. The Open FinLLM Leaderboard, hosted in partnership with the Linux Foundation and Hugging Face, provides continuous, automated evaluation of models across 42 datasets in seven categories: information extraction, textual analysis, QA, text generation, risk management, forecasting, and decision-making (Lin et al., 19 Jan 2025, Lin et al., 22 Feb 2026). Evaluation is zero-shot, forbidding task-specific fine-tuning, and covers document- and table-based QA, claim analysis, stock movement prediction, summarization, credit scoring, and multi-turn trading decision tasks.

Primary quantitative metrics include accuracy, macro/micro F1, exact match, span-level F1 (entities/relations), Matthews Correlation Coefficient, ROUGE-L, BLEU, and Sharpe Ratio or cumulative return for trading agents (Huang et al., 2024, Wang et al., 26 May 2025, Zhang et al., 4 Aug 2025, Wu et al., 2024). The suite introduces min–max normalization for cross-task comparison:

$\overline{S} = \frac{S - \min}{\max - \min} \times 100$

Qualitative and metacognitive tests (e.g., LLM-as-Judge, tool-use tracing, self-assessment protocols) supplement quantitative evaluation, especially for agentic and multimodal settings (Lin et al., 22 Feb 2026). Golden Touchstone offers a comprehensive bilingual benchmark, supporting both English and Chinese, with unified instruction–input–output templates for model-agnostic, reproducible evaluation (Wu et al., 2024).

4. Multimodal and Agentic Capabilities

Recent Open-FinLLMs extend core LLMs with vision (chart/image/table) and agentic tools. FinLLaVA integrates a CLIP vision encoder and multimodal adapter (2-layer MLP) to enable chart, table, and image understanding; joint alignment and SFT expose 1.43M multimodal pairs during fine-tuning (Huang et al., 2024). Evaluation on ChartBench and TableBench confirms substantial improvements in zero-shot parsing and reasoning versus general models.

Agentic financial systems such as FinVerse combine LLMs with hierarchical agent controllers (planner, tool-caller, code-executor), leveraging a curated API set (~642 financial API endpoints) and embedded code interpreters for real-time data retrieval, analysis, and report generation (An et al., 2024). Open-FinLLMs are integrated into agent frameworks (e.g., FinWorld AgentOrchestra) via JSON RPC, interchangeable LLM backends, and prompt planning pipelines (Zhang et al., 4 Aug 2025). RL-based fine-tuning of action policies via GRPO or PPO is now routine.

5. Empirical Performance and Applications

Open-FinLLMs routinely achieve or surpass strong baselines on domain-specific tasks. For example, FinLLaMA-Instruct outperforms GPT-4 or BloombergGPT on sentiment, NER, and numerical understanding (Huang et al., 2024); LoRA-adapted financial LLMs report average performance gains of 36% over base models on SEC filing tagging, value extraction, and formula calculation tasks—routinely hitting >98% accuracy/F1 on structured XBRL (Wang et al., 26 May 2025). Table 2 summarizes representative results:

Task	Best Open-FinLLM	Metric	Value	Baseline
Sentiment (FPB, FiQA-SA)	Touchstone-GPT	W-F₁/ACC	0.86/0.86	GPT-4o: 0.81
NER/Relation Extraction	FinLLaMA-Instruct	F₁	0.82	LLaMA3: 0.39
Financial QA (ConvFinQA)	FinLLaMA	EM	0.51	GPT-4: 0.43
XBRL Value Extraction	Llama 3.1 8B + LoRA	ACC/F1	>0.98	Base: <0.5
Chart/Table Reasoning	FinLLaVA	TableBench	0.72	LLaVA: 0.69
Credit Scoring	FinReasoner	Score	80.1%	DeepSeek-R1: 74.0%

Applications include robo-advisory, regulatory compliance, automated report summarization, algorithmic trading strategy generation, and financial misinformation detection (FMDLlama) (An et al., 2024, Liu et al., 2024, Yang et al., 2023). Multilingual variants demonstrate +10–65% relative accuracy improvements in financial acronym and translation tasks for FR, DE, and EN (Caillaut et al., 7 Nov 2025). Trading-agent benchmarks (FinWorld, FinMem) show risk-adjusted Sharpe ratio gains and superior drawdown characteristics compared to buy-and-hold or generic models (Zhang et al., 4 Aug 2025, Huang et al., 2024).

6. Reproducibility, Governance, and Ecosystem Practices

Open-FinLLMs emphasize transparent, reproducible pipelines. Code, datasets, adapter weights, and benchmarking scripts are publicly released under permissive licenses (Apache 2.0, MIT, CC-BY-NC 4.0) and are version-controlled with checksums and retrieval timestamps (Yang et al., 2023, Caillaut et al., 7 Nov 2025, Zhang et al., 4 Aug 2025). Model Openness Framework (MOF) compliance ensures traceable data provenance, license clarity, and prevention of “open-washing” (Lin et al., 22 Feb 2026, Lin et al., 19 Jan 2025). Community contributions (new benchmarks, models, tasks) are welcomed via GitHub/Hugging Face pull requests; automated evaluation/re-evaluation ensures rapid integration of improvements (Lin et al., 19 Jan 2025).

Layered governance frameworks (AI governance checklists, risk audits, drift monitors) are actively integrated, with procedures for human-in-the-loop audit, hallucination detection, regulatory compliance checks, and privacy-enforcing deployment modes (air-gapped, federated LoRA, zero-knowledge proof for IP protection) (Zhang et al., 4 Aug 2025, Wang et al., 26 May 2025, Lin et al., 22 Feb 2026).

7. Limitations, Challenges, and Forward Directions

Several persistent limitations remain. Free-form financial reasoning, multi-step numerical QA, and relation extraction tasks expose residual performance gaps even in SOTA open FinLLMs, typically trailing closed models like GPT-4o by 10–20 F1 or EM points (Wu et al., 2024, Zhang et al., 4 Aug 2025). Integration of tabular, chart, and report data—while advanced in FinLLaVA/FinWorld—is not universally robust. Hallucination and factuality remain open research areas, particularly for regulatory and high-stakes applications. Computational cost, even with QLoRA, restricts ultra-low-latency or high-frequency applications for large models (Lee et al., 2024).

Future priorities include richer instruction collections (especially multilingual and multimodal), architecture innovations (Mixture-of-Experts for scale; numeric-aware tokenization), advanced RAG pipelines, agent-level real-time dashboarding, and systematic multimodal benchmarking (e.g., extension of Golden Touchstone tasks). Model operations (LLMOps), production monitoring, privacy-preserving data integration, and real-world evaluation (Sharpe, MDD, regulatory stress tests) are further identified as critical for widespread adoption.

References