NUMINA-Flow: 3D Numerical Inference Pipeline

Updated 27 September 2025

NUMINA-Flow is an automated annotation pipeline that constructs the NUMINA benchmark, enabling comprehensive evaluation of 3D numerical reasoning in multimodal models.
It integrates geometric ground truth extraction from high-quality 3D scenes with LLM-driven QA generation and robust bias-mitigation techniques.
Empirical results reveal that current 3D-aware MLLMs struggle with fine-grained numerical inference, motivating innovations in spatial reasoning architectures.

NUMINA-Flow is an automated annotation pipeline developed for constructing the NUMINA benchmark, designed explicitly to evaluate and advance multimodal LLMs (MLLMs) in the domains of three-dimensional perception, spatial reasoning, and fine-grained numerical inference. NUMINA-Flow integrates geometric ground-truth extraction from reconstructed 3D scenes, automated LLM–driven question–answer generation, rigorous bias-mitigation and task-formatting routines, as well as rule-based and human-in-the-loop verification, thereby enabling the synthesis of comprehensive, numerically precise QA-pair datasets for robust evaluation in 3D settings. The pipeline underpins the NUMINA benchmark, which exposes current limitations in 3D-aware MLLMs—particularly their inability to perform fine-scale spatial measurements or numerical computations—highlighting a critical need for model innovations in 3D geometric understanding (Zeng et al., 20 Sep 2025).

1. Pipeline Architecture and Process

NUMINA-Flow encompasses a multi-stage, automated workflow tailored to ensure the accuracy, diversity, and bias-resilience of its output QA-pairs:

Extraction of Numerical Ground Truth (NGT):
- The process begins with high-quality 3D scenes from the ScanNet dataset, extracting for each object instance: centroid (center-of-mass) coordinates, axis-aligned bounding box dimensions, and extremal (min/max) positions along $x$ , $y$ , and $z$ .
- Pairwise convex hull distances are computed using established computational geometry algorithms (e.g., Bentley et al., 1982). The convex hull distance, defined as the minimal Euclidean distance between the convex hulls of any two object point sets, provides a spatial metric that closely matches human spatial intuition.
QA Template Generation and LLM-Based Rewriting:
- NUMINA-Flow utilizes advanced LLMs (notably GPT-4o for template creation and Qwen2.5-72B for paraphrasing) to synthesize a broad range of question templates per category. For each task type—Fact Validation (FV), Prompt Matching (PM), and Numerical Inference (NI)—around ten syntactic variants are crafted.
- Reserved placeholders (e.g., {OBJ1}, {x}, {TYPE}) in templates are systematically bound with corresponding NGT values, tightly coupling language with spatial measurement.
Categorical Formatting and Bias Mitigation:
- FV tasks prompt a forced Yes/No format; PM tasks employ five-option multiple choice (A–E), uniformly distributing correct options to eliminate positional bias; NI tasks enforce numerically precise answers with correct units.
- The pipeline monitors answer distributions across the dataset, ensuring equal frequency for binary outcomes and uniform selection across PM options.
Automated and Human Verification:
- A rule-based filter checks for hallucinations (e.g., non-existent objects, malformed options) and structural QA errors, with auto-regeneration (up to five times) upon failure.
- Human inspection is performed on sample subsets, ensuring up to 99.5% QA-pair correctness.
Multi-Scale Annotation and Aggregation:
- By combining instance-level geometric annotation and LLM-generated QA, NUMINA-Flow achieves a multi-scale, multi-style dataset—spanning simple object recognition queries to complex geometric computations—culminating in 74,526 QA pairs with controlled diversity and difficulty.

Processing Stage	Key Output	Methodological Detail
NGT Extraction	Centroids, bbox, convex hull	ScanNet + geometric algorithms
Template/Linguistic Gen	QA Template, LLM rewrites	GPT-4o / Qwen2.5-72B
Formatting/Bias	Balanced, formatted QA	Automated positional/answer balancing
Rule-based Verification	QA integrity check, auto-rewrite	Option counting, hallucination detection
Human Review	Sample-level validation	Manual inspection

2. Numerical Reasoning Grounding

NUMINA-Flow distinguishes itself by enforcing a direct link between the extracted physical world geometry and linguistic output:

Each generated QA sample referencing numeric attributes (distance, size, volume) is programmatically grounded in true scene metrics (e.g., bounding box volume is computed from $w, h, d$ of the axis-aligned box; object pair distances use convex hull proximity).
Numerical questions are formatted to demand both value and unit (e.g., “What is the distance between the two chairs in centimeters?”), with direct value substitution to prevent discrepancy between question context and underlying ground truth.
For evaluating numerical predictions, NUMINA-Flow employs a threshold accuracy (TA) metric, filtering predictions by absolute closeness to ground truth:

$TA = I\left(|d_{pred} - d_{true}| < \mathrm{Threshold}\right)$

where $I$ is the indicator function (1 if the threshold condition is met).

This approach ensures precise anchoring of language with numerically defined reality, setting a rigorous standard for evaluation that most current MLLMs fail to meet.

3. Diversity and Bias Control

NUMINA-Flow incorporates explicit procedures to avoid systematic bias and to enforce comprehensive coverage:

Uniform distribution of correct answers in PM tasks (each option A–E is correct 20% of the time), and Yes/No answers in FV tasks.
Use of LLM-driven paraphrasing via Qwen2.5-72B to increase linguistic variance, mitigating any LLM-specific stylistic quirks.
Generation and balance checks are repeated recursively until each dataset split achieves desired distributional properties.

A plausible implication is that this reduces single-option or phrasing bias in numerical/spatial QA, increasing the benchmark’s discriminative power for downstream model analysis.

4. Rule-Based and Human-in-the-Loop Self-Verification

NUMINA-Flow’s fidelity relies on an integrated two-stage verification process:

Automated rule checks monitor for hallucinations, malformed answer options, template adherence, and duplicate questions.
Automated retry mechanism allows up to five regeneration attempts for any sample failing the heuristic checks.
Manual human review of a stratified sample ensures overall output error rate below 0.5%, establishing reliability suitable for benchmarking purposes.

This verification regime addresses both semantic consistency and technical fidelity, serving as a safeguard against both LLM hallucination and trivial programmatic error.

5. Empirical Impact and Revealed Challenges

Deploying NUMINA-Flow for the NUMINA benchmark exposes critical limitations in state-of-the-art 3D-aware MLLMs:

When models are evaluated using the Chat-Scene framework (processing 3D point clouds, images, and text), performance in non-numerical tasks (e.g., object category, color) is significantly higher than in strictly numerical inference tasks.
For strict thresholds (e.g., 5% error tolerance), accuracy in numerical inference tasks (distance/volume estimation) is below 3% for all tested models.
This suggests that current models lack both magnitude sensitivity and robust geometric inductive bias. Most models treat numbers as tokens rather than quantitative entities and lack dedicated 3D spatial reasoning modules.

A plausible implication is that high performance on “visual question answering” benchmarks does not translate to genuine numerical reasoning capability unless the annotation/testing pipeline achieves the precision and diversity standards exemplified by NUMINA-Flow.

6. Prospects for Model Development and Benchmark Expansion

Empirical results from NUMINA-Flow–based evaluation motivate several research directions:

Incorporation of specialized geometric reasoning modules or architectural modifications (e.g., operations directly on 3D point clouds, geometric pretraining, or explicit propagation of metric properties).
Enhanced training regimes, entailing explicit 3D spatial supervision and targeted numerical tasks, to imbue models with spatial magnitude awareness and geometric consistency.
Expansion of benchmarks along the NUMINA-Flow paradigm to include broader scene types, outdoor environments, or more complex spatial/numerical inferences involving object interactions.

A plausible implication is that annotation pipelines with the rigor and granularity of NUMINA-Flow are necessary prerequisites for advancing numerical reasoning in multimodal AI and for the rigorous evaluation of future 3D-aware MLLMs.

7. Summary

NUMINA-Flow constitutes a comprehensive, scalable pipeline for synthesizing multi-scale, numerically grounded QA-pair datasets based on 3D scene analysis. Its integration of LLM-based QA generation, rule-based and human verification, balanced answer formatting, and direct geometric-metric grounding achieves high annotation fidelity and dataset diversity. Empirical findings underscore that current multimodal models exhibit profound shortcomings in 3D numerical reasoning, motivating significant research into architectures that better couple linguistic, spatial, and quantitative understanding (Zeng et al., 20 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

NUMINA: A Natural Understanding Benchmark for Multi-dimensional Intelligence and Numerical Reasoning Abilities (2025)

Follow Topic

Get notified by email when new papers are published related to NUMINA-Flow.