Architecture Matters More Than Scale: A Comparative Study of Retrieval and Memory Augmentation for Financial QA Under SME Compute Constraints

Published 20 Apr 2026 in cs.IR | (2604.17979v1)

Abstract: The rapid adoption of AI and LLMs is transforming financial analytics by enabling natural language interfaces for reporting, decision support, and automated reasoning. However, limited empirical understanding exists regarding how different LLM-based reasoning architectures perform across realistic financial workflows, particularly under the cost, accuracy, and compliance constraints faced by small and medium-sized enterprises (SMEs). SMEs typically operate within severe infrastructure constraints, lacking cloud GPU budgets, dedicated AI teams, and API-scale inference capacity, making architectural efficiency a first-class concern. To ensure practical relevance, we introduce an explicit SME-constrained evaluation setting in which all experiments are conducted using a locally hosted 8B-parameter instruction-tuned model without cloud-scale infrastructure. This design isolates the impact of architectural choices within a realistic deployment environment. We systematically compare four reasoning architectures: baseline LLM, retrieval-augmented generation (RAG), structured long-term memory, and memory-augmented conversational reasoning across both FinQA and ConvFinQA benchmarks. Results reveal a consistent architectural inversion: structured memory improves precision in deterministic, operand-explicit tasks, while retrieval-based approaches outperform memory-centric methods in conversational, reference-implicit settings. Based on these findings, we propose a hybrid deployment framework that dynamically selects reasoning strategies to balance numerical accuracy, auditability, and infrastructure efficiency, providing a practical pathway for financial AI adoption in resource-constrained environments.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper shows that architectural choices can yield an 11.8 percentage point gain over alternative methods in operand-explicit financial QA.
It compares retrieval-augmented generation, memory augmentation, and structured memory, highlighting trade-offs in latency, token consumption, and accuracy.
The study recommends hybrid frameworks with task-aware routing to balance precision in deterministic tasks and robustness in conversational QA settings.

Architectural Trade-offs in Retrieval and Memory Augmentation for Financial QA Under SME Constraints

Problem Context and Significance

The deployment of AI-driven financial analytics is increasingly essential for enterprises, yet small- and medium-sized enterprises (SMEs) operate under severe infrastructure constraints: limited compute budgets, no dedicated AI teams, and ineligible for API inference at scale. This paper rigorously investigates how architectural choices—retrieval-augmented generation (RAG), memory augmentation, and structured memory—impact performance in financial QA tasks under a fixed, locally hosted 8B LLM, explicitly modeling realistic SME deployment envelopes (2604.17979).

Experimental Framework and Architectural Variants

A unified evaluation framework is introduced, abstracting financial numerical QA as a problem of grounding, reasoning, and arithmetic over semi-structured financial documents. Four architectures are assessed:

Baseline LLM: Standard instruction-tuned 8B model.
Retrieval-Augmented Generation (RAG): Dynamic evidence retrieval with granular fact chunking; top-k relevant snippets are injected at inference.
Memory-Augmented Reasoning (Mem0-style): Persistent, free-form memory accumulates Q/A context across turns.
Structured Memory Representation (Structured Mem0): Deterministic, schema-grounded table row serialization; retrieval selects atomic attribute-value pairs, minimizing ambiguity.

All experiments employ Llama 3.1 8B hosted via Ollama with standardized inference/configuration. Metrics include exact and tolerance-based match, semantic judge correctness, latency, prompt size, and token consumption.

Performance Analysis: FinQA (Deterministic Single-Turn)

On the FinQA benchmark, which demands operand-level reasoning with explicit context, Structured Mem0 outperforms all others (Corrected Exact: 0.354, Close: 0.423), followed by Baseline LLM (Exact: 0.319). The architectural bias of Structured Mem0—typed entity-metric-period tuples and atomic fact retrieval—reduces distractor noise and operand ambiguity, enabling precision even in compute-constrained environments.

RAG, while providing latency advantages (median p50: 0.913s), suffers from operand selection failures, notably in percentage questions requiring denominator disambiguation. The observed performance gap between architectures (Structured Mem0 vs. RAG: +11.8pp on Corrected Exact) is achieved entirely via architectural choices, not scale or supervision. Memory-Augmented architectures incur higher latency and token consumption without meaningful accuracy gains.

Absolute performance remains below supervised program induction methods on larger models, reinforcing that architectural optimization is a practical lever for SMEs, not model scaling.

Performance Analysis: ConvFinQA (Conversational Multi-Turn)

ConvFinQA introduces referential ambiguity: implicit metric references, cross-turn dependency, and context drift. Here, RAG consistently dominates numerically (Auto Close: 52.75%), with statistically reliable improvement over Baseline LLM (+4.3%) and memory-based architectures (+9.5%).

Structured memory and persistent memory amplify early entity misalignment; errors committed during turn 0 propagate across the entire dialog, leading to sharp monotonic degradation in downstream turns (median cascade failure: 100%). RAG’s dynamic per-turn retrieval enables recovery from early grounding errors, reflecting robustness under referential uncertainty.

Memory-heavy architectures, especially Mem0-Augmented, exhibit pronounced fluency-accuracy divergence: their responses appear semantically correct (Judge Correct: 63.76%), but numeric grounding fails, systematically overstating real-world reliability.

Mechanistic Interpretation: Task-Architecture Alignment

A cross-dataset synthesis reveals a fundamental architectural inversion. In FinQA, computational uncertainty is dominant; structured persistence increases precision. In ConvFinQA, referential uncertainty dominates; dynamic retrieval is essential. Architectures differ in when they commit to entity interpretation: Structured Mem0 does so upfront, beneficial when semantic identity is stable. RAG defers commitment, allowing re-grounding amid ambiguity.

Model scaling is shown to be insufficient for breaking conversational accuracy ceilings (~50-55%), which are caused by entity drift, denominator ambiguity, and referential misalignment—not arithmetic computation. Targeted supervision in pointer-grounding and entity identification is recommended for future advances.

Implications for System Design and Industry Practice

Dual-Mode Routing and Deployment

Financial workflows bifurcate into deterministic pipelines (reporting, reconciliation) and conversational analytics. Empirical results support a hybrid routing framework: structured memory for operand-explicit queries, RAG for conversational, reference-implicit tasks; routing based on presence of dialogue history yields a 2.9pp accuracy gain over single architectures with no model or hardware changes.

Cost-Efficiency and Auditability

Memory-heavy architectures increase token consumption ~3x with lower conversational accuracy. RAG’s cost-accuracy balance is optimal for API-scale deployments; for SMEs, retrieval-first methods are the practical path forward. Retrieval-grounded architectures also improve transparency and traceability, linking outputs to evidence for regulatory compliance.

Breaking Accuracy Ceilings and Supervision

Performance improvements beyond moderate accuracy require labeled data and targeted supervision: supervised entity-grounding models reliably outperform unsupervised LLMs even at larger scales. System designers should prioritize program induction and explicit entity tracking over further scaling unsupervised models.

Conclusion

Architectural choices, rather than model size, govern performance in financial QA under SME constraints. In deterministic, operand-explicit environments, structured memory and symbolic normalization deliver precision. In conversational, reference-implicit settings, retrieval-augmented strategies provide robustness and recoverability. The observed architectural inversion is actionable: hybrid, task-aware routing in production yields measurable gains in accuracy, efficiency, and reliability, confirming that alignment between architectural bias and uncertainty structure is central to practical financial AI adoption.

Future work should extend architectural evaluation to additional base models, validate hybrid frameworks in live enterprise settings, and develop lightweight task-structure classifiers for operational routing. Targeted supervision in entity-grounding presents a viable path to breaking the conversational accuracy ceiling, with implications for broader AI system design in finance.

Reference:

"Architecture Matters More Than Scale: A Comparative Study of Retrieval and Memory Augmentation for Financial QA Under SME Compute Constraints" (2604.17979).

Markdown Report Issue