nuFormer: Transformer for Financial Data
- nuFormer is a transformer-based architecture that tokenizes heterogeneous financial transactions, capturing long-range behavioral patterns.
- It integrates self-supervised pretraining with joint fusion of transformer embeddings and tabular features to improve prediction accuracy.
- Empirical results at Nubank demonstrate measurable gains in AUC and reduced churn, validating its efficacy for large-scale financial applications.
nuFormer is a transformer-based representation learning architecture designed for modeling financial transaction data at scale, addressing limitations of traditional tabular-feature-based machine learning systems in banking and financial services. Developed for large-scale recommendation, risk, and fraud detection tasks, nuFormer incorporates self-supervised learning, specialized tokenization of heterogeneous financial attributes, and an end-to-end joint fusion mechanism integrating transformer-derived embeddings with classic tabular features. The approach has been empirically validated on production-scale datasets, notably at Nubank, demonstrating measurable gains over state-of-the-art (SOTA) baselines in recommendation and user retention settings (Braithwaite et al., 31 Jul 2025).
1. Motivation and Problem Setting
The primary motivation is to automate and enhance the extraction of behavioral user representations from large volumes of raw financial transaction records, which typically encompass text (transaction descriptions), numerical (amounts), and categorical (dates, account types) information. Conventional production systems in finance often rely on labor-intensive, hand-engineered tabular feature pipelines, which are lossy and can omit semantically-rich signals found in unstructured data. nuFormer addresses these limitations by adopting transformer models—proven in NLP—to learn end-to-end representations from native, minimally processed transaction sequences.
Key challenges tackled include:
- Handling the multi-modal, high-cardinality, and sequential nature of transaction data.
- Achieving representation learning at enterprise scale (billions of transactions, millions of users).
- Bridging the gap between legacy tabular ML workflows and modern self-supervised approaches, thus maximizing available signal for high-stakes financial prediction.
2. Transformer Adaptations for Structured Financial Data
nuFormer leverages a causal transformer architecture (akin to GPT), but requires adaptations to represent structured transaction inputs. Rather than using sequential IDs or fixed schemas, transactions are encoded as tokenized key-value streams. This process involves:
- Numerical Encoding: Continuous features like transaction amounts are quantized into discrete bins; the sign is separately tokenized using , and the value using (21 bins).
- Date Encoding: Day, month, and weekday for each transaction date are mapped to unique tokens: .
- Textual Tokenization: Transaction descriptions are tokenized using byte-pair encoding (BPE), yielding .
- Stream Construction: Per-transaction tokens are concatenated with a separator and streamed to the transformer:
where denotes concatenation.
- Causal Context: The transformer processes the entire chronological user transaction history, allowing subsequent token predictions to incorporate long-range seasonal and behavioral dependencies.
This tokenization design maximizes representational capacity for both structured and unstructured transaction features, bypassing the need for prior feature engineering.
3. Self-Supervised Pretraining and Fine-Tuning Methodology
Self-Supervised Learning (SSL) Phase
nuFormer is pre-trained using a next token prediction (NTP) objective. The sequence of tokenized transactions is presented to the transformer, which learns to predict the subsequent token based on the context. This task is analogous to masked language modeling in NLP, but is adapted for the heterogeneously tokenized structure of financial data.
End-to-End Fine-Tuning and Joint Fusion
After pretraining, nuFormer is fine-tuned for downstream supervised tasks. The key steps include:
- Extracting the context vector from the last token’s representation (which integrates all previous input).
- Appending an MLP classifier head for tasks like binary product activation prediction.
- Integrating handcrafted tabular features via a custom deep network (modified Deep Cross Network V2, DCNv2).
- Concatenating the transformer output with tabular feature embeddings to enable “joint fusion.”
- Training the fusion model end-to-end, optimizing the transformer, DCNv2, and output MLP jointly.
- Employing regularization techniques such as periodic activations for numerical features and weight decay, as well as LoRA (Low-Rank Adaptation) during fine-tuning to combat overfitting and reduce memory consumption.
4. Experimental Evaluation and Performance
nuFormer was evaluated on a realistic recommendation system (recsys) problem at Nubank, with the following experimental setup:
| Dataset | Description | Size |
|---|---|---|
| Transactions | Credit/debit card, billet, external sources | >200M training, 2M test |
| Tabular features | Aggregates, bureau scores, event metrics | 291 features |
| Labels | Binary interaction (e.g., product activation) | 6-month future window |
Performance assessment highlighted:
- Joint Fusion AUC Gain: Integrating transformer user embeddings with the DNN-processed tabular features yielded a +1.25% relative improvement in test AUC over a LightGBM feature-only baseline.
- Scaling Effects: Increasing transformer parameters (from 24M to 330M) systematically improved AUC (~0.31 to ~0.52 absolute gain). Expanded context window (from 512 to 2048 tokens) and increased fine-tuning data volume (from 5M to 100M) both amplified gains.
- Deployment Impact: A production deployment targeting churn reduction observed a relative decrease in churn of 4.4% against the system without transformer-augmented representations, significantly exceeding typical business benchmarks for such launches.
5. Architectural and Engineering Considerations
Key architectural features:
- The tokenization pipeline efficiently represents heterogeneous data attributes.
- The causal transformer architecture enables capture of both local activity spikes and long-term patterns such as seasonality.
- “Joint fusion” with DCNv2 preserves the practical value of legacy feature pipelines, providing a smooth transition for production teams accustomed to tabular features.
- Regularization via periodic activations (for numercials) and LoRA mitigates overfitting and catastrophic forgetting, crucial for reliable adaptation in financial systems.
- Scalability to hundreds of millions of rows and large context sizes is empirically validated.
A plausible implication is that, due to the modular tokenization scheme, the architectural pattern could be generalized for other multi-modal sequential tasks in finance beyond recommendations, such as dynamic risk or anti-fraud modeling, given sufficient labeled data.
6. Broader Implications and Future Directions
The nuFormer approach demonstrates that transformer-based SSL can materially reduce the need for domain-specific manual feature engineering in financial applications, potentially improving not only predictive accuracy but also the agility and maintainability of production ML systems. By enabling direct modeling of rich, heterogeneous, and long-range transactional histories, such architectures unlock a broader class of behavioral modeling tasks that might have been infeasible due to feature engineering bottlenecks.
Anticipated extensions include:
- Scaling both model size and transaction context windows further to capture more subtle patterns in ever-growing datasets.
- Deriving domain-specific scaling laws for representation quality in heterogeneous sequential data.
- Generalizing the joint fusion paradigm to ingest external, non-transactional, and graph-based signals.
- Adapting the nuFormer framework as a modular foundation model for diverse downstream financial tasks, such as real-time fraud detection or credit underwriting.
These trends suggest a shift toward foundation models in financial machine learning, paralleling developments in NLP, but adapted to the complexity and regulatory requirements of banking data (Braithwaite et al., 31 Jul 2025).