Open-FinLLMs: Open Source Financial LLMs
- Open-FinLLMs are a family of open-source large language models tailored for financial tasks using transformer-based architectures and instruction tuning.
- They employ parameter-efficient methods like LoRA and QLoRA to achieve high accuracy on structured financial data while leveraging vast, curated corpora.
- They extend to multimodal and agentic functionalities, enabling advanced chart reasoning, automated analysis, and real-time decision-making in finance.
Open-FinLLMs are a family of open-source LLMs specifically architected, pre-trained, and instruction-tuned for financial domain applications, with extensions to multimodal reasoning and agentic workflows. These systems leverage high-quality financial corpora, parameter-efficient adaptation, and comprehensive benchmarking to provide robust capabilities in areas such as financial NLP, decision-making, reporting, and automated analysis. Open-FinLLMs aim to democratize access to advanced financial AI, allowing researchers, practitioners, and industry to adapt, evaluate, and deploy LLM-driven solutions in finance, accounting, trading, auditing, and regulation.
1. Model Architectures and Adaptation Strategies
Open-FinLLMs typically employ transformer-based, decoder-only architectures with parameters spanning from 7B to 70B, often inheriting from LLaMA (e.g., LLaMA3-8B), Qwen, Baichuan, Falcon, or Gemma bases (Huang et al., 2024, Caillaut et al., 7 Nov 2025, Lee et al., 2024). Parameter-efficient fine-tuning is central—Linear Rank Adaptation (LoRA), QLoRA, DoRA, and federated LoRA reduce hardware, time, and data requirements for domain adaptation (Yang et al., 2023, Liu et al., 2023, Wang et al., 26 May 2025). Table 1 summarizes principal Open-FinLLMs and their base architectures:
| Model Suite | Base Model | Params | Multimodality | Released |
|---|---|---|---|---|
| FinGPT | GPT-J, LLaMA | 6–7B | Text | Yes |
| FinLLaMA/FinLLaVA | LLaMA3-8B | 8B | Text, Vision | Yes |
| LLM Pro Finance | LLaMA, Qwen, Gemma | 8–70B | Text | Partial |
| Touchstone-GPT | Qwen-2 | 7B | Text | Yes |
| PIXIU/FinMA | LLaMA-7B/30B | 7–30B | Text | Yes |
Instruction tuning is a universal adaptation method, wrapping domain tasks (e.g., sentiment, NER, QA, forecasting) in structured prompts (Instruction, Input, Options) to align model behavior with financial reasoning chains (Wang et al., 2023, Lin et al., 19 Jan 2025). Multimodal extensions, such as FinLLaVA, add a CLIP-based vision encoder and multimodal adapters to support tabular, chart, and image analysis (Huang et al., 2024).
2. Data Curation, Pretraining, and Instruction Tuning
Open-FinLLMs rely on extensive, carefully curated financial corpora covering SEC filings, earnings calls, financial news, regulatory documents, academic papers, social media, and technical indicators. For example, the FinLLaMA corpus comprises 52B tokens across seven financial domains, including 13B historical market data and 6B SEC filings; general-domain tokens (FineWeb) are included at a specific mixing ratio to mitigate catastrophic forgetting (Huang et al., 2024). Preprocessing steps include document cleaning, Unicode normalization, duplication filtering, entity normalization, and supervised or weakly-supervised labeling (e.g., with post-news price movement as a proxy for sentiment) (Liu et al., 2023, Lin et al., 22 Feb 2026).
Supervised fine-tuning (SFT), LoRA/QLoRA adaptation, and reinforcement learning with task-specific reward shaping (e.g., RLSP in FinGPT: reward signals from actual asset price shifts following news events) are standard methods for domain adaptation (Yang et al., 2023, Wang et al., 26 May 2025). Instruction datasets can reach 573K samples (FinLLaMA-Instruct) or 300K high-quality bilingual pairs (Touchstone-GPT), with coverage spanning sentiment, NER, relation extraction, mathematical reasoning, tabular QA, text generation, and summarization (Huang et al., 2024, Wu et al., 2024).
3. Benchmarking: Tasks, Metrics, and the Open FinLLM Leaderboard
A robust benchmarking ecosystem underpins Open-FinLLMs. The Open FinLLM Leaderboard, hosted in partnership with the Linux Foundation and Hugging Face, provides continuous, automated evaluation of models across 42 datasets in seven categories: information extraction, textual analysis, QA, text generation, risk management, forecasting, and decision-making (Lin et al., 19 Jan 2025, Lin et al., 22 Feb 2026). Evaluation is zero-shot, forbidding task-specific fine-tuning, and covers document- and table-based QA, claim analysis, stock movement prediction, summarization, credit scoring, and multi-turn trading decision tasks.
Primary quantitative metrics include accuracy, macro/micro F1, exact match, span-level F1 (entities/relations), Matthews Correlation Coefficient, ROUGE-L, BLEU, and Sharpe Ratio or cumulative return for trading agents (Huang et al., 2024, Wang et al., 26 May 2025, Zhang et al., 4 Aug 2025, Wu et al., 2024). The suite introduces min–max normalization for cross-task comparison:
Qualitative and metacognitive tests (e.g., LLM-as-Judge, tool-use tracing, self-assessment protocols) supplement quantitative evaluation, especially for agentic and multimodal settings (Lin et al., 22 Feb 2026). Golden Touchstone offers a comprehensive bilingual benchmark, supporting both English and Chinese, with unified instruction–input–output templates for model-agnostic, reproducible evaluation (Wu et al., 2024).
4. Multimodal and Agentic Capabilities
Recent Open-FinLLMs extend core LLMs with vision (chart/image/table) and agentic tools. FinLLaVA integrates a CLIP vision encoder and multimodal adapter (2-layer MLP) to enable chart, table, and image understanding; joint alignment and SFT expose 1.43M multimodal pairs during fine-tuning (Huang et al., 2024). Evaluation on ChartBench and TableBench confirms substantial improvements in zero-shot parsing and reasoning versus general models.
Agentic financial systems such as FinVerse combine LLMs with hierarchical agent controllers (planner, tool-caller, code-executor), leveraging a curated API set (~642 financial API endpoints) and embedded code interpreters for real-time data retrieval, analysis, and report generation (An et al., 2024). Open-FinLLMs are integrated into agent frameworks (e.g., FinWorld AgentOrchestra) via JSON RPC, interchangeable LLM backends, and prompt planning pipelines (Zhang et al., 4 Aug 2025). RL-based fine-tuning of action policies via GRPO or PPO is now routine.
5. Empirical Performance and Applications
Open-FinLLMs routinely achieve or surpass strong baselines on domain-specific tasks. For example, FinLLaMA-Instruct outperforms GPT-4 or BloombergGPT on sentiment, NER, and numerical understanding (Huang et al., 2024); LoRA-adapted financial LLMs report average performance gains of 36% over base models on SEC filing tagging, value extraction, and formula calculation tasks—routinely hitting >98% accuracy/F1 on structured XBRL (Wang et al., 26 May 2025). Table 2 summarizes representative results:
| Task | Best Open-FinLLM | Metric | Value | Baseline |
|---|---|---|---|---|
| Sentiment (FPB, FiQA-SA) | Touchstone-GPT | W-F₁/ACC | 0.86/0.86 | GPT-4o: 0.81 |
| NER/Relation Extraction | FinLLaMA-Instruct | F₁ | 0.82 | LLaMA3: 0.39 |
| Financial QA (ConvFinQA) | FinLLaMA | EM | 0.51 | GPT-4: 0.43 |
| XBRL Value Extraction | Llama 3.1 8B + LoRA | ACC/F1 | >0.98 | Base: <0.5 |
| Chart/Table Reasoning | FinLLaVA | TableBench | 0.72 | LLaVA: 0.69 |
| Credit Scoring | FinReasoner | Score | 80.1% | DeepSeek-R1: 74.0% |
Applications include robo-advisory, regulatory compliance, automated report summarization, algorithmic trading strategy generation, and financial misinformation detection (FMDLlama) (An et al., 2024, Liu et al., 2024, Yang et al., 2023). Multilingual variants demonstrate +10–65% relative accuracy improvements in financial acronym and translation tasks for FR, DE, and EN (Caillaut et al., 7 Nov 2025). Trading-agent benchmarks (FinWorld, FinMem) show risk-adjusted Sharpe ratio gains and superior drawdown characteristics compared to buy-and-hold or generic models (Zhang et al., 4 Aug 2025, Huang et al., 2024).
6. Reproducibility, Governance, and Ecosystem Practices
Open-FinLLMs emphasize transparent, reproducible pipelines. Code, datasets, adapter weights, and benchmarking scripts are publicly released under permissive licenses (Apache 2.0, MIT, CC-BY-NC 4.0) and are version-controlled with checksums and retrieval timestamps (Yang et al., 2023, Caillaut et al., 7 Nov 2025, Zhang et al., 4 Aug 2025). Model Openness Framework (MOF) compliance ensures traceable data provenance, license clarity, and prevention of “open-washing” (Lin et al., 22 Feb 2026, Lin et al., 19 Jan 2025). Community contributions (new benchmarks, models, tasks) are welcomed via GitHub/Hugging Face pull requests; automated evaluation/re-evaluation ensures rapid integration of improvements (Lin et al., 19 Jan 2025).
Layered governance frameworks (AI governance checklists, risk audits, drift monitors) are actively integrated, with procedures for human-in-the-loop audit, hallucination detection, regulatory compliance checks, and privacy-enforcing deployment modes (air-gapped, federated LoRA, zero-knowledge proof for IP protection) (Zhang et al., 4 Aug 2025, Wang et al., 26 May 2025, Lin et al., 22 Feb 2026).
7. Limitations, Challenges, and Forward Directions
Several persistent limitations remain. Free-form financial reasoning, multi-step numerical QA, and relation extraction tasks expose residual performance gaps even in SOTA open FinLLMs, typically trailing closed models like GPT-4o by 10–20 F1 or EM points (Wu et al., 2024, Zhang et al., 4 Aug 2025). Integration of tabular, chart, and report data—while advanced in FinLLaVA/FinWorld—is not universally robust. Hallucination and factuality remain open research areas, particularly for regulatory and high-stakes applications. Computational cost, even with QLoRA, restricts ultra-low-latency or high-frequency applications for large models (Lee et al., 2024).
Future priorities include richer instruction collections (especially multilingual and multimodal), architecture innovations (Mixture-of-Experts for scale; numeric-aware tokenization), advanced RAG pipelines, agent-level real-time dashboarding, and systematic multimodal benchmarking (e.g., extension of Golden Touchstone tasks). Model operations (LLMOps), production monitoring, privacy-preserving data integration, and real-world evaluation (Sharpe, MDD, regulatory stress tests) are further identified as critical for widespread adoption.
References
- Open-FinLLMs: Open Multimodal LLMs for Financial Applications
- The LLM Pro Finance Suite: Multilingual LLMs for Financial Applications
- Open FinLLM Leaderboard: Towards Financial AI Readiness
- Evaluation and Benchmarking Suite for Financial LLMs and Agents
- Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial LLMs
- FinWorld: An All-in-One Open-Source Platform for End-to-End Financial AI Research and Deployment
- FinLoRA: Benchmarking LoRA Methods for Fine-Tuning LLMs on Financial Datasets
- FinGPT: Open-Source Financial LLMs
- FinGPT: Democratizing Internet-scale Data for Financial LLMs
- FinGPT: Instruction Tuning Benchmark for Open-Source LLMs in Financial Datasets
- A Survey of LLMs in Finance (FinLLMs)
- FMDLlama: Financial Misinformation Detection based on LLMs
- FinVerse: An Autonomous Agent System for Versatile Financial Analysis