FLANG-BERT/FLANG-ELECTRA: Financial NLP Models
- FLANG-BERT and FLANG-ELECTRA are domain-adapted language models augmented with financial vocabulary and customized masking protocols.
- They employ a two-stage masking approach, including targeted phrase masking and a span boundary objective, to better capture financial semantics.
- Empirical evaluations on the FLUE benchmark demonstrate superior performance across financial sentiment, NER, and QA tasks compared to baseline models.
FLANG-BERT and FLANG-ELECTRA are domain-adapted large pre-trained LLMs for the financial sector, extending the BERT-base and ELECTRA-base architectures respectively. Both incorporate explicit financial vocabulary expansion, finance-aware masking strategies, and additional objectives designed to better leverage the structural and semantic properties of financial texts. Their introduction coincides with the Financial Language Understanding Evaluation (FLUE) benchmark suite, which provides comprehensive multi-task evaluation for the financial domain. FLANG-BERT and FLANG-ELECTRA have demonstrated superior performance over prior approaches across a broad array of financial NLP tasks (Shah et al., 2022).
1. Architecture and Model Modifications
FLANG-BERT
FLANG-BERT adopts the BERT-base configuration, comprising 12 Transformer encoder layers, 12 self-attention heads, a hidden size of 768, and approximately 110 million parameters. The major architectural adaptation is the vocabulary expansion: the standard WordPiece vocabulary is augmented by ~8.2k words and phrases targeting financial discourse. Pre-training introduces a two-stage masking procedure—first targeting single-word tokens and then contiguous financial phrases.
FLANG-ELECTRA
FLANG-ELECTRA builds on ELECTRA-base, with distinct architectures for the generator and discriminator. The generator utilizes 12 layers (hidden size 256, 4 attention heads, ~14 million parameters), while the discriminator matches ELECTRA-base (12 layers, hidden size 768, 12 heads, ~110 million parameters). Modifications include:
- Preferential masking—masking of tokens/phrases selected for financial salience.
- Addition of a Span Boundary Objective to the generator (details below).
- Adoption of the two-stage masking protocol paralleling FLANG-BERT.
2. Pre-training Objectives and Loss Formulations
Both models integrate three core loss components, detailed below using the original mathematical formulations:
2.1 Financial Keyword and Phrase Masking
Let denote the sequence length, denote indices where token is in a financial-term dictionary, and denote positions selected for masking. The procedure is:
- Mask 15% of tokens per sequence.
- 30% of masked positions are reserved for financial terms ().
- 70% are drawn at random from non-financial terms ().
Probability a token is masked:
During the phrase-masking stage, each multi-token financial phrase is masked as a contiguous span (single [MASK] token) with probability 0.3.
2.2 Masked Language Modeling (Infilling) Loss
The model predicts masked tokens from their context:
0
2.3 Span Boundary Objective (SBO)
Specific to FLANG-ELECTRA, the generator reconstructs each masked token 1 using only the boundary tokens of its masked span. For a span 2, compute
3
where
4
The span boundary loss: 5
2.4 Discriminator Loss (ELECTRA)
The discriminator distinguishes real tokens from generator replacements:
6
2.5 Total Loss
The overall loss combines these components:
7
This composite objective explicitly incorporates domain relevance, infilling, and span boundary prediction (Shah et al., 2022).
3. Datasets and Pre-training Protocol
FLANG models are pre-trained using a mixture of general English and financial-domain corpora.
Pre-training Data Sources
| Dataset | # Documents | Years | % Sampled/Epoch |
|---|---|---|---|
| BooksCorpus | — | — | — |
| Wikipedia | — | — | — |
| SEC 10-K filings | 13,660 | 1993–2020 | 8 % |
| SEC 10-Q filings | 36,402 | 1993–2020 | 5 % |
| Earnings call transcripts | 151,359 | 2007–2019 | 1.5 % |
| Reuters financial news | 106,521 | 2007 | 10 % |
| Bloomberg financial news | 387,220 | 2009 | 5 % |
| Analyst reports (LexisNexis) | 201 | 2017–2020 | 100 % |
| Investopedia concept pages | 638 | N/A | 100 % |
- BooksCorpus (800M words) and Wikipedia (2.5B words) represent general English.
- Financial corpora are subsampled per epoch as indicated.
Pre-training Regimen
- Models initialized from Huggingface BERT-base or ELECTRA-base.
- Four epochs of pre-training on the combined data.
- Epochs 1–2: single-token (word) financial masking.
- Epochs 3–4: token + multi-token phrase masking.
- Masking rate: 15%.
- Adam optimizer; learning rate warm-up as in standard Transformer pre-training.
- Batch size and specific learning rate schedule not fixed.
4. Financial Language Understanding Evaluation (FLUE) and Empirical Results
FLUE is an open-source benchmark suite covering five core financial NLP tasks:
- Financial Sentiment Analysis (Financial PhraseBank, FiQA 2018 SA)
- News Headline Classification (Gold Headlines dataset)
- Named Entity Recognition (finance-domain NER set)
- Structure Boundary Detection (FinSBD3)
- Financial Question Answering (FiQA 2018 QA)
Summary Results (averaged over 3 seeds)
| Model | FPB (Acc) | FiQA SA (MSE) | Headline (F₁) | NER (F₁) | SBD (F₁) | FiQA QA (nDCG) |
|---|---|---|---|---|---|---|
| BERT-base | 85.6% | 0.073 | 0.967 | 0.79 | 0.95 | 0.46 |
| FinBERT | 87.2% | 0.070 | 0.968 | 0.80 | 0.89 | 0.42 |
| FLANG-BERT | 91.2% | 0.054 | 0.972 | 0.83 | 0.96 | 0.51 |
| ELECTRA-base | 88.1% | 0.066 | 0.966 | 0.78 | 0.94 | 0.52 |
| FLANG-ELECTRA | 91.9% | 0.034 | 0.980 | 0.82 | 0.97 | 0.55 |
Across every task, FLANG-BERT and FLANG-ELECTRA deliver state-of-the-art metrics compared to non-domain-adapted and prior financial-domain baselines. Detailed ablation results are available in the paper's supplementary tables.
5. Deployment and Usage Recommendations
- Model access: FLANG-BERT and FLANG-ELECTRA are available for download on Huggingface. Complete code, data, and model configurations are hosted at https://salt-nlp.github.io/FLANG/.
- Fine-tuning: For classification tasks, it is recommended to use a combined loss—cross-entropy plus Supervised Contrastive Loss:
8
Refer to Section –A.3 of (Shah et al., 2022) for the exact formula.
- Usual hyperparameters: 2–5 epochs, learning rate 9–0, batch size 16–32, weight decay 0.01.
- For regression and QA tasks, standard task heads from the Transformers library may be used (MSE loss for regression, bi-encoder or cross-encoder for QA).
This suggests that FLANG-BERT and FLANG-ELECTRA can be integrated into existing NLP pipelines for financial tasks with minimal adaptation, while yielding substantial gains where financial vocabulary and structure are critical.
6. Distinctive Features and Performance Analysis
Distinctive aspects of FLANG-BERT and FLANG-ELECTRA include:
- Systematic augmentation of vocabulary for financial precision (~8.2k terms/phrases).
- Layered masking regime emphasizing domain relevancy, including span-based and phrase-level masking.
- Integration of the span boundary objective, especially in FLANG-ELECTRA, to leverage local context for masked span reconstruction.
- Empirical validation demonstrating consistent outperformance over BERT-base, ELECTRA-base, and domain-adapted baselines (FinBERT), especially on complex multi-sentence financial tasks and open-domain QA.
- The FLUE benchmark suite is designed to be more comprehensive and challenging than previously available financial NLP benchmarks.
A plausible implication is that targeted adaptation through vocabulary design and masking objectives is critical for extending pre-trained LLMs to specialized technical domains such as finance. These adaptations achieve superior empirical results while preserving compatibility with standard Transformer and ELECTRA frameworks (Shah et al., 2022).