Multimodal Arithmetic Tasks

Updated 2 June 2026

Multimodal arithmetic tasks are challenges that integrate numerical reasoning with diverse data modalities like text, images, and audio to perform arithmetic operations.
These tasks employ techniques such as the Perception–Alignment–Reasoning paradigm and embedding arithmetic to align and fuse signals across modalities.
Benchmarks like CLEVR-Math, SIMAT, and MV-MATH highlight current model limitations in visual recognition, symbolic computation, and compositional reasoning.

Multimodal arithmetic tasks comprise a class of problems requiring the integration of arithmetic reasoning with perception and alignment across two or more modalities—most commonly text, images, and, in some instances, audio or speech. These tasks extend classic symbolic or textual arithmetic into settings where numerals, numeric relations, and mathematical operators are embedded in visual, auditory, or multimedia contexts. Modern approaches probe whether large-scale vision–language or multimodal models can reliably execute arithmetic, both by grounding symbolic representations in perception and by transferring semantic regularities found in text space (such as analogy or embedding arithmetic) to multimodal joint spaces.

1. Core Definitions and Taxonomies

Multimodal arithmetic tasks can be grouped along two axes: modality composition and reasoning complexity. At the simplest, these include direct visual (or audio) recognition and manipulation of numeric symbols for performing basic arithmetic operations (addition, subtraction, multiplication, division), as exemplified in datasets where numerals or equations appear as images or voice and require arithmetic computation regardless of representation. More complex instances require compositional reasoning over multi-step operations (as in multihop scenarios), fusion across multiple visual or perceptual panels, or alignment between natural language instructions and visual or auditory contexts.

A foundational example is operationalized in CLEVR-Math, which formalizes each problem as a tuple $(I, q, a)$ , where $I$ is an image (typically synthetic), $q$ a sequence of textually described actions with arithmetic semantics, and $a$ the numeric answer. Supported operations include both insertion/removal (addition/subtraction) and quantification conditioned on object attributes (Lindström et al., 2022). Others, such as the IRPD benchmark, encode analogical reasoning—requiring models to recover relational structure from cross-modal input by leveraging embedding arithmetic in joint spaces (Xu et al., 21 Apr 2026). The taxonomy extends to tasks like those in MV-MATH—where models must aggregate and reconcile quantitative information from several images sequentially interleaved with text, and VisioMath—where answer choices themselves are visual diagrams encoding arithmetic results (Wang et al., 28 Feb 2025, Li et al., 7 Jun 2025).

2. Embedding Arithmetic and Analogy in Multimodal Spaces

A key direction leverages the hypothesis that geometric regularities observed in word embeddings (e.g., $\text{queen} - \text{king} + \text{man} \approx \text{woman}$ ) may extend to multimodal representations. In the canonical framework (Couairon et al., 2021), given encoders $E_{\text{img}}$ and $E_{\text{txt}}$ mapping images and text into a shared $d$ -dimensional space, a semantic transformation is instantiated as a text delta vector

$\Delta E_{\text{txt}} = E_{\text{txt}}(w_2) - E_{\text{txt}}(w_1)$

and the target multimodal embedding is

$E' = E_{\text{img}}(I_1) + \lambda \, \Delta E_{\text{txt}},$

where $I$ 0 is a source image and $I$ 1 controls transformation strength.

Empirical results on the SIMAT benchmark show that, out-of-the-box, CLIP embeddings are not robust to such vector arithmetic; accuracy is $I$ 2 for $I$ 3 and $I$ 4 for $I$ 5, while simple COCO fine-tuning restores the geometry to support effective analogy-structured image retrieval ( $I$ 6 accuracy at $I$ 7). Notably, leveraging universal sentence encoders confers little advantage unless $I$ 8 is computed over full sentences, highlighting the importance of preserving linear regularity at the sentence or phrase level (Couairon et al., 2021).

In relational arithmetic, the IRPD benchmark formalizes two-term subtraction and three-term analogy queries in joint spaces, operationalized as $I$ 9 and $q$ 0 (Xu et al., 21 Apr 2026). Rather than relying on direct vector arithmetic, state-of-the-art performance is achieved by prompting vision–LLMs to explicitly reason over these relationships with “reasoning chains” reinforced via verifiable reward functions (accuracy and semantic similarity).

3. Benchmarks, Datasets, and Evaluation Protocols

Multimodal arithmetic is systematically assessed via a spectrum of datasets:

Direct Arithmetic Recognition and Computation: MATH-V, CMM-Math, and VisioMath evaluate models on tasks requiring extraction and manipulation of numbers encountered as printed, handwritten, or pictorial content alongside text (Wang et al., 2024, Liu et al., 2024, Li et al., 7 Jun 2025). Benchmark statistics: MATH-V’s Arithmetic subset ( $q$ 1190 items) covers all four elementary operations, with images ranging from simple expressions to context-rich diagrams; CMM-Math offers over $q$ 2 arithmetic-rich problems across all grades, many with bar charts, pie charts, or tables.
Compositional and Multi-Visual Reasoning: MV-MATH extends the challenge to multi-image input, often with narrative text interleaved, forcing models to cross-reference and align quantitative entries from multiple related visuals (mean $q$ 3 images/problem) (Wang et al., 28 Feb 2025). CLEVR-Math probes compositionality: single-step reasoning tasks are tractable, but ‘multi-hop’—sequential chains of arithmetic operations—expose failure of neural and neuro-symbolic baselines at generalizing structural regularity (Lindström et al., 2022).
Embedding Arithmetic Benchmarks: SIMAT and IRPD target structured, analogy-based transformation and relation recovery, measuring both top-K retrieval and classification/recall@5 for analogy completion (Couairon et al., 2021, Xu et al., 21 Apr 2026).
Rich Modality Pairings: Recent work introduces synchronized benchmarks varying both notation and presentation modality (e.g., multiplication problems expressed as numerals, words, images, and audio), carefully controlling for “arithmetic load”—the computational burden dictated by digit count and sparsity (Balter et al., 20 Apr 2026).

Performance is commonly reported as (1) exact-match or top-1 accuracy, (2) process- or step-level correctness, e.g., “step accuracy rate” (SAR) or process accuracy across a multistep chain (Wang et al., 28 Feb 2025, Yang et al., 9 Mar 2026), and (3) downstream execution correctness—fraction of outputs producing the expected result via a symbolic verifier. Chain-of-thought prompting has a marginal effect and may even degrade arithmetic accuracy if reasoning is not verifiable (Wang et al., 2024, Wang et al., 28 Feb 2025).

4. Error Analysis and Model Limitations

Systematic error analysis identifies several persistent limitations:

Perceptual Failures: Misreading digits, mislocating numbers in charts, and symbol misrecognition—especially of decimals, non-standard fonts, or multi-digit expressions—remain common in vision–LLMs (Wang et al., 2024, Liu et al., 2024, Wang et al., 28 Feb 2025). MATH-V and CMM-Math report up to 32% and 42% errors at the perception stage, respectively.
Reasoning and Compositional Failures: Multihop or chained arithmetic exposes fragilities in both neural and neuro-symbolic approaches—failure to correctly parse and execute sequences of symbolic operations or apply the correct combinatorial logic. In MV-MATH, multi-step question completeness (QCR) is <10% for top models, compared to ~66% for humans (Wang et al., 28 Feb 2025). CLEVR-Math highlights this in the steep accuracy drop on multihop test cases not seen in training (Lindström et al., 2022).
Alignment and Modality Gap: Insufficient alignment between visual facts and textual tokens impedes fusion. Fine-tuned models with structured perception–alignment–reasoning architectures, as in the PAR paradigm, mediate these errors, but misalignment between tokens and regions still accounts for a substantial fraction of wrong answers (Yang et al., 9 Mar 2026).
Arithmetic Computation Bottlenecks: In tightly controlled experiments, failures in multi-digit multiplication are traced to computation rather than perception, with model accuracy sharply constrained by the ‘arithmetic load’ C. Key finding: >99% perception accuracy even as arithmetic performance drops to near zero for high $q$ 4 (Balter et al., 20 Apr 2026).

5. Model Architectures, Training Strategies, and Methodological Innovations

Modeling multimodal arithmetic tasks explicitly involves staged pipelines or end-to-end multimodal transformers:

Pipeline Approaches: The Perception–Alignment–Reasoning (PAR) paradigm decomposes the process into (1) structured perception (object, number, symbol extraction), (2) explicit alignment (text tokens $q$ 5 visual regions), and (3) verifiable reasoning, either via neural autoregressive step generation with program supervision, or symbolic executors for arithmetic step verification (Yang et al., 9 Mar 2026). This allows process-level debugging and iterative refinement.
Embedding Arithmetic and Fine-Tuning: Embedding arithmetic methods rely on joint space regularity. CLIP, FastText, LaBSE, and LASER encoders are evaluated under contrastive objectives (InfoNCE), and fine-tuning on image–caption corpora (COCO) is essential for restoring analogy-like behaviour. Optimal transfer is observed for delta weighting $q$ 6 at contrastive temperature $q$ 7 (Couairon et al., 2021).
Reinforcement Learning for Relational Reasoning: SAri-RFT employs group-normalized PPO style losses with verifiable reward functions (exact match and cosine similarity) to train LVLMs on visual semantic arithmetic (relation and analogy recovery) (Xu et al., 21 Apr 2026). This greatly outperforms both zero-shot base models and supervised fine-tuning for cross-modal relation induction.
Curriculum and Post-Training Alignment: CogAlign proposes invariance-based post-training (inspired by Piaget’s concrete operational stage) to train VLM decoders to recognize invariant properties (length, angle, count) under visual transformation, combined with Direct Preference Optimization (DPO). This method is shown to improve accuracy on downstream chart and geometry tasks, even with limited synthetic secondary training (Huang et al., 17 Feb 2025).
LoRA and Adapter Merging for Task Expansion: Task arithmetic in parameter space, as used for rapid language expansion in multimodal ST/MT tasks, enables new translation pairs via the addition (or analogy) of task vectors in LoRA-adapted neural networks. Language control adapters mitigate interference among targets (Cheng et al., 2024).

6. Main Empirical Findings and Open Challenges

Most state-of-the-art multimodal models, including large open-source and proprietary LMMs, remain substantially below human performance on arithmetic subdomains. For instance, GPT-4V achieves only 35.7% on MATH-V’s arithmetic subset (human: 100%) and 45.9% on VisioMath for figure-based arithmetic reasoning, with even the best open-source models under 35% on translation-rich Chinese tasks (CMM-Math) (Wang et al., 2024, Li et al., 7 Jun 2025, Liu et al., 2024). CLEVR-Math’s multihop compositional challenge remains open, with neural and neuro-symbolic systems failing to generalize action sequences (Lindström et al., 2022).

Error types include: (1) perception (misreading numerals), (2) stepwise reasoning, (3) semantic misalignment, and (4) symbolic computation failure under high load. Integration of explicit symbolic modules, richer vision-language alignment, curriculum self-consistency, and structured program supervision are central recommendations across studies.

Recent contributions propose robust evaluation schemes (e.g., answer/process/executable-level metrics), verifiable reward shaping for RL, and datasets that decouple perception and computation. Future work suggests more systematic symbolic–neural fusion, cross-modal chain-of-thought, richer data augmentations, and expansion of benchmarks to include challenging, real-world arithmetic sequences embedded within diverse multi-image, multi-modal contexts (Yang et al., 9 Mar 2026, Wang et al., 2024, Wang et al., 28 Feb 2025, Couairon et al., 2021, Xu et al., 21 Apr 2026).