- The paper shows that architectural choices can yield an 11.8 percentage point gain over alternative methods in operand-explicit financial QA.
- It compares retrieval-augmented generation, memory augmentation, and structured memory, highlighting trade-offs in latency, token consumption, and accuracy.
- The study recommends hybrid frameworks with task-aware routing to balance precision in deterministic tasks and robustness in conversational QA settings.
Architectural Trade-offs in Retrieval and Memory Augmentation for Financial QA Under SME Constraints
Problem Context and Significance
The deployment of AI-driven financial analytics is increasingly essential for enterprises, yet small- and medium-sized enterprises (SMEs) operate under severe infrastructure constraints: limited compute budgets, no dedicated AI teams, and ineligible for API inference at scale. This paper rigorously investigates how architectural choices—retrieval-augmented generation (RAG), memory augmentation, and structured memory—impact performance in financial QA tasks under a fixed, locally hosted 8B LLM, explicitly modeling realistic SME deployment envelopes (2604.17979).
Experimental Framework and Architectural Variants
A unified evaluation framework is introduced, abstracting financial numerical QA as a problem of grounding, reasoning, and arithmetic over semi-structured financial documents. Four architectures are assessed:
- Baseline LLM: Standard instruction-tuned 8B model.
- Retrieval-Augmented Generation (RAG): Dynamic evidence retrieval with granular fact chunking; top-k relevant snippets are injected at inference.
- Memory-Augmented Reasoning (Mem0-style): Persistent, free-form memory accumulates Q/A context across turns.
- Structured Memory Representation (Structured Mem0): Deterministic, schema-grounded table row serialization; retrieval selects atomic attribute-value pairs, minimizing ambiguity.
All experiments employ Llama 3.1 8B hosted via Ollama with standardized inference/configuration. Metrics include exact and tolerance-based match, semantic judge correctness, latency, prompt size, and token consumption.
On the FinQA benchmark, which demands operand-level reasoning with explicit context, Structured Mem0 outperforms all others (Corrected Exact: 0.354, Close: 0.423), followed by Baseline LLM (Exact: 0.319). The architectural bias of Structured Mem0—typed entity-metric-period tuples and atomic fact retrieval—reduces distractor noise and operand ambiguity, enabling precision even in compute-constrained environments.
RAG, while providing latency advantages (median p50: 0.913s), suffers from operand selection failures, notably in percentage questions requiring denominator disambiguation. The observed performance gap between architectures (Structured Mem0 vs. RAG: +11.8pp on Corrected Exact) is achieved entirely via architectural choices, not scale or supervision. Memory-Augmented architectures incur higher latency and token consumption without meaningful accuracy gains.
Absolute performance remains below supervised program induction methods on larger models, reinforcing that architectural optimization is a practical lever for SMEs, not model scaling.
ConvFinQA introduces referential ambiguity: implicit metric references, cross-turn dependency, and context drift. Here, RAG consistently dominates numerically (Auto Close: 52.75%), with statistically reliable improvement over Baseline LLM (+4.3%) and memory-based architectures (+9.5%).
Structured memory and persistent memory amplify early entity misalignment; errors committed during turn 0 propagate across the entire dialog, leading to sharp monotonic degradation in downstream turns (median cascade failure: 100%). RAG’s dynamic per-turn retrieval enables recovery from early grounding errors, reflecting robustness under referential uncertainty.
Memory-heavy architectures, especially Mem0-Augmented, exhibit pronounced fluency-accuracy divergence: their responses appear semantically correct (Judge Correct: 63.76%), but numeric grounding fails, systematically overstating real-world reliability.
Mechanistic Interpretation: Task-Architecture Alignment
A cross-dataset synthesis reveals a fundamental architectural inversion. In FinQA, computational uncertainty is dominant; structured persistence increases precision. In ConvFinQA, referential uncertainty dominates; dynamic retrieval is essential. Architectures differ in when they commit to entity interpretation: Structured Mem0 does so upfront, beneficial when semantic identity is stable. RAG defers commitment, allowing re-grounding amid ambiguity.
Model scaling is shown to be insufficient for breaking conversational accuracy ceilings (~50-55%), which are caused by entity drift, denominator ambiguity, and referential misalignment—not arithmetic computation. Targeted supervision in pointer-grounding and entity identification is recommended for future advances.
Implications for System Design and Industry Practice
Dual-Mode Routing and Deployment
Financial workflows bifurcate into deterministic pipelines (reporting, reconciliation) and conversational analytics. Empirical results support a hybrid routing framework: structured memory for operand-explicit queries, RAG for conversational, reference-implicit tasks; routing based on presence of dialogue history yields a 2.9pp accuracy gain over single architectures with no model or hardware changes.
Cost-Efficiency and Auditability
Memory-heavy architectures increase token consumption ~3x with lower conversational accuracy. RAG’s cost-accuracy balance is optimal for API-scale deployments; for SMEs, retrieval-first methods are the practical path forward. Retrieval-grounded architectures also improve transparency and traceability, linking outputs to evidence for regulatory compliance.
Breaking Accuracy Ceilings and Supervision
Performance improvements beyond moderate accuracy require labeled data and targeted supervision: supervised entity-grounding models reliably outperform unsupervised LLMs even at larger scales. System designers should prioritize program induction and explicit entity tracking over further scaling unsupervised models.
Conclusion
Architectural choices, rather than model size, govern performance in financial QA under SME constraints. In deterministic, operand-explicit environments, structured memory and symbolic normalization deliver precision. In conversational, reference-implicit settings, retrieval-augmented strategies provide robustness and recoverability. The observed architectural inversion is actionable: hybrid, task-aware routing in production yields measurable gains in accuracy, efficiency, and reliability, confirming that alignment between architectural bias and uncertainty structure is central to practical financial AI adoption.
Future work should extend architectural evaluation to additional base models, validate hybrid frameworks in live enterprise settings, and develop lightweight task-structure classifiers for operational routing. Targeted supervision in entity-grounding presents a viable path to breaking the conversational accuracy ceiling, with implications for broader AI system design in finance.
Reference:
"Architecture Matters More Than Scale: A Comparative Study of Retrieval and Memory Augmentation for Financial QA Under SME Compute Constraints" (2604.17979).