MulTaBench: Multimodal Tabular Benchmark
- MulTaBench is a curated collection of 40 multimodal tabular prediction datasets that combine structured data with complementary text or image inputs.
- It introduces Target-Aware Representations (TAR) by fine-tuning pretrained encoders to capture task-relevant details, outperforming frozen embeddings.
- The benchmark evaluates models on unified metrics like ROC AUC and R² across diverse domains, emphasizing the impact of multimodal fusion strategies.
Searching arXiv for the benchmark name and related variants to ground the article in the relevant papers. MulTaBench is a benchmark for multimodal tabular learning introduced in “MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image” (Arazi et al., 11 May 2026). It is a curated collection of 40 multimodal tabular prediction datasets—20 image–tabular and 20 text–tabular—selected to require joint multimodal modeling and Target-Aware Representations (TAR) of unstructured inputs. The benchmark focuses on predictive tasks in which text or image inputs provide complementary predictive signal beyond numeric and categorical columns, and in which generic frozen embeddings lose critical information. In the contemporary literature, closely related names create substantial ambiguity: “MulTaBench” is also used informally for MultiTab or MultiTab-Bench in tabular evaluation and multitask tabular learning contexts (Lee et al., 20 May 2025, Sinodinos et al., 13 Nov 2025), while some discussions also use similar labels for MMTBench or MTBench in unrelated domains (Titiya et al., 27 May 2025, Joshi et al., 31 Jul 2025). In the strict canonical sense, however, MulTaBench denotes the multimodal tabular benchmark of Arazi and collaborators (Arazi et al., 11 May 2026).
1. Naming, scope, and disambiguation
MulTaBench, in its canonical usage, refers to the benchmark introduced for multimodal tabular learning with text and image inputs (Arazi et al., 11 May 2026). The benchmark is explicitly framed around multimodal tabular prediction, rather than standard supervised tabular learning, synthetic multitask tabular generation, multimodal table question answering, or multi-task robotics reinforcement learning.
The surrounding naming landscape is unusually ambiguous. In “MultiTab: A Comprehensive Benchmark Suite for Multi-Dimensional Evaluation in Tabular Domains” (Lee et al., 20 May 2025), “MulTaBench” is described as an informal community name for MultiTab, whose canonical repository and documentation are labeled MultiTab. In “MultiTab: A Scalable Foundation for Multitask Learning on Tabular Data” (Sinodinos et al., 13 Nov 2025), the official synthetic generator is called “MultiTab-Bench,” and “MulTaBench” is stated to refer to the same component in some usage. By contrast, the papers “MMTBENCH: A Unified Benchmark for Complex Multimodal Table Reasoning” (Titiya et al., 27 May 2025) and “Benchmarking Massively Parallelized Multi-Task Reinforcement Learning for Robotics Tasks” (Joshi et al., 31 Jul 2025) explicitly note that “MulTaBench” is not their official benchmark name.
This suggests that the term has become a polysemous shorthand in community discussions. A precise interpretation therefore depends on domain: in multimodal tabular prediction it denotes the 40-dataset benchmark of (Arazi et al., 11 May 2026), whereas in related tabular or multitask contexts it may be used informally for MultiTab or MultiTab-Bench (Lee et al., 20 May 2025, Sinodinos et al., 13 Nov 2025).
2. Benchmark objective and conceptual basis
MulTaBench was designed to address a specific limitation in multimodal tabular learning benchmarks: many prior datasets contain text or images that merely co-occur with structured features, rather than contributing genuinely complementary predictive information (Arazi et al., 11 May 2026). Under those conditions, benchmarks can obscure whether multimodal learning is helping, and can mask the effect of adapting unstructured encoders to the downstream target.
The central concept is Target-Aware Representations. In MulTaBench, TAR means tuning a pretrained vision or text encoder on the downstream prediction label so that task-specific signal is surfaced before fusion with the tabular learner (Arazi et al., 11 May 2026). The paper argues that fixed embeddings are lossy summaries: compressing text or image inputs into frozen vectors optimized for global semantics can discard details that matter for the target, such as localized visual anomalies or exact phrasing. MulTaBench therefore isolates tasks where target-aware adaptation is consequential.
The benchmark formalizes this motivation through four conditions evaluated per dataset and per model: Unimodal Structured, Unimodal Unstructured, Joint Frozen, and Joint TAR (Arazi et al., 11 May 2026). Two gains are then computed. The first is a joint-signal criterion requiring the joint frozen model to outperform either unimodal baseline. The second is a task-awareness criterion requiring TAR to improve upon the frozen joint model. The acceptance rule is
$\text{Accept}(\mathcal{D}) \iff \Big|\{m:\,\Delta_{\text{Joint}(m)>\delta \land \Delta_{\text{Awareness}(m)>\delta\}\Big|\ \ge\ \rho\cdot|\mathcal{M}|$
with and (Arazi et al., 11 May 2026). This curation rule operationalizes the benchmark’s core premise: accepted datasets must reward both multimodal complementarity and target-aware tuning.
3. Dataset composition and curation procedure
MulTaBench contains 40 datasets split equally between image–tabular and text–tabular tasks, balancing classification and regression (Arazi et al., 11 May 2026). The domains include healthcare, e-commerce, social media, entertainment, scientific and academic metadata, and restaurant review data. Example image–tabular classification datasets include CheXpert, CBIS-DDSM, Glaucoma SMDG, and PetFinder; image–tabular regression examples include Amazon Packages, Painting Price, and Mango Mass. Text–tabular classification examples include Jigsaw Toxicity, Michelin Guide, Spotify Genres, and Zomato Restaurants, while text–tabular regression examples include Mercari Marketplace, Book Readability, and Video Game Sales (Arazi et al., 11 May 2026).
The benchmark specifies broad target and input ranges. Classification targets span 2–114 classes, while regression targets include price, rating, mass, and annual salary. Structured inputs comprise 1–245 numeric or categorical features. Unstructured inputs consist of one image per row for image–tabular tasks or multiple free-text columns per row for text–tabular tasks (Arazi et al., 11 May 2026).
The curation pipeline filtered candidate datasets using the acceptance rule above. Among 56 candidate text–tabular datasets aggregated from AutoML Multimodal, CARTE, Grinsztajn et al., and TextTabBench, 23% failed Joint Signal; of the remainder, 36% failed Task-awareness; 41% passed both. The image–tabular pool had 16 unique candidates from MuG, BAG, TIME, and MultiModalTabPFN; only 5 passed, which led to additional manual curation from Kaggle to reach 20 accepted image–tabular datasets (Arazi et al., 11 May 2026).
MulTaBench also includes trimodal analysis. Eight image–tabular datasets additionally contain text, and two of them—PetFinder and Amazon Packages—pass a trimodal curation criterion in which both text and image provide Joint Signal and both benefit from TAR (Arazi et al., 11 May 2026). For PetFinder, the paper reports that TAR on both modalities is best; for Amazon Packages, the same pattern is reported with mean values (Arazi et al., 11 May 2026). A plausible implication is that the benchmark is not restricted to bimodal fusion, even though its default presentation is organized around image–tabular and text–tabular tasks.
4. Data protocol, preprocessing, and target-aware adaptation
MulTaBench provides a reproducible pipeline spanning data acquisition, preprocessing, encoder adaptation, and downstream evaluation (Arazi et al., 11 May 2026). The text–tabular datasets are drawn from published benchmarks such as AutoML Multimodal, CARTE, TextTabBench, and Grinsztajn et al. The image–tabular datasets are curated primarily from Kaggle and public datasets, and all 20 preprocessed image–tabular datasets are uploaded under a unified Kaggle API (Arazi et al., 11 May 2026).
Two pretrained encoder families are used. For text, the benchmark uses e5-v2-small with 384-dimensional embeddings and also evaluates e5-large with 1024-dimensional embeddings. For images, it uses DINO-v3-small ViT-S/16 with 384-dimensional embeddings and also evaluates DINO-v3-large ViT-L/16 with 1024-dimensional embeddings (Arazi et al., 11 May 2026).
Target-aware tuning is implemented as LoRA on the top 3 transformer layers, with rank , , and dropout $0.1$ (Arazi et al., 11 May 2026). Optimization uses AdamW, with learning rate for e5 and for DINO, weight decay $0.01$, batch size 0, and early stopping after 3 epochs of no validation improvement. DINO is trained for up to 100 epochs and e5 for up to 50 (Arazi et al., 11 May 2026). For regression, the benchmark discretizes continuous targets into 20 equal-frequency bins and optimizes cross-entropy over bins, which the paper states is more stable than direct regression finetuning. When multiple text columns are present, one shared e5 TAR model is trained jointly across columns using “col_name: col_val”–to–target pairs (Arazi et al., 11 May 2026).
Strict split hygiene is emphasized. TAR is trained only on the training split, with an internal 90/10 stratified train/validation split for checkpoint selection, and the test split is never used during TAR training (Arazi et al., 11 May 2026). Embedding features are then reduced using PCA to 30 components, while 15, 60, and no-PCA variants are also evaluated. Structured features undergo minimal cleaning, with columns removed if they trivially leak or dominate the target (Arazi et al., 11 May 2026).
The paper’s training objectives are given explicitly. For TAR, classification and quantized regression use cross-entropy:
1
For vanilla regression baselines, the paper also reports mean squared error and root mean squared error:
2
3
5. Models, fusion regimes, and evaluation methodology
MulTaBench distinguishes between curation learners, an extended robustness suite, and end-to-end multimodal baselines (Arazi et al., 11 May 2026). The five curation learners are LightGBM, CatBoost, TabM, TabPFNv2, and TabPFN-2.5. The extended suite adds XGBoost, RandomForest, RealMLP, TabDPT, and TabICLv2. End-to-end multimodal baselines include AG-MM, TabSTAR, and ConTextTab (Arazi et al., 11 May 2026).
The benchmark’s default fusion strategy is late fusion: PCA-compressed TAR or frozen embeddings are concatenated with tabular columns, and then standard tabular learners are trained on the combined features (Arazi et al., 11 May 2026). End-to-end models instead natively fuse unstructured encoders with tabular backbones. ConTextTab is described as using frozen text embeddings within a table-native in-context-learning transformer (Arazi et al., 11 May 2026).
Evaluation is performed over 5 random seeds per dataset, model, and condition, with subsampling up to 10,000 examples per run for cost efficiency (Arazi et al., 11 May 2026). Classification performance is measured by ROC AUC,
4
and regression performance by coefficient of determination,
5
(Arazi et al., 11 May 2026). For pooled comparisons across classification and regression tasks, the paper applies min–max normalization of AUC and 6 to 7 and reports 8 confidence intervals. TAR win-rates are computed as the fraction of dataset–fold pairs for which TAR outperforms frozen embeddings (Arazi et al., 11 May 2026).
A concise summary of the benchmark’s main evaluation components is given below.
| Component | Specification | Source |
|---|---|---|
| Dataset count | 40 datasets: 20 image–tabular, 20 text–tabular | (Arazi et al., 11 May 2026) |
| Curation learners | LightGBM, CatBoost, TabM, TabPFNv2, TabPFN-2.5 | (Arazi et al., 11 May 2026) |
| Default fusion | Concatenate PCA-compressed TAR/frozen embeddings with tabular columns | (Arazi et al., 11 May 2026) |
| Seeds | 5 random seeds per dataset/model/condition | (Arazi et al., 11 May 2026) |
| Classification metric | ROC AUC | (Arazi et al., 11 May 2026) |
| Regression metric | 9 | (Arazi et al., 11 May 2026) |
6. Empirical results and methodological implications
The central empirical finding is that TAR consistently improves over frozen embeddings across modalities, learners, encoder sizes, and embedding dimensions (Arazi et al., 11 May 2026). Averaged over all datasets and learners, the reported gains are +0.022 mean normalized score for image–tabular tasks and +0.018 for text–tabular tasks (Arazi et al., 11 May 2026).
The paper reports model-level TAR win-rates with 0 confidence intervals. For image–tabular tasks, the win-rates are 84% 1 for LightGBM, 90% 2 for CatBoost, 77% 3 for TabPFNv2, 82% 4 for TabM, and 55% 5 for TabICLv2. For text–tabular tasks, the corresponding numbers are 93% 6 for LightGBM, 93% 7 for CatBoost, 84% 8 for TabPFNv2, 77% 9 for TabM, and 75% 0 for TabICLv2 (Arazi et al., 11 May 2026). ConTextTab is reported to be substantially outperformed by AG-MM and TabSTAR on MulTaBench, which the paper interprets as evidence that the tasks reward target-aware text modeling rather than frozen text embeddings (Arazi et al., 11 May 2026).
Several robustness analyses refine this picture. Larger encoders improve both frozen and TAR performance, but TAR remains superior; notably, TAR-small often surpasses Frozen-large, indicating that scaling capacity alone does not ensure retention of task-relevant information (Arazi et al., 11 May 2026). Likewise, TAR gains persist across PCA sizes of 15, 30, and 60 and even without PCA, which the paper uses to argue that the improvements are not an artifact of compression (Arazi et al., 11 May 2026). Image attention maps show that TAR shifts DINO-v3 attention from global or background regions toward target-relevant areas such as lung regions in CheXpert, the optic disc in Glaucoma, animal ears and eyes in PetFinder, and facial features in CelebA (Arazi et al., 11 May 2026).
The benchmark also quantifies compute trade-offs. Relative to frozen embeddings, image TAR with small DINO incurs roughly 1 runtime, while text TAR with small e5 is approximately 2 frozen, and e5-large TAR can approach multi-hour runs on larger datasets (Arazi et al., 11 May 2026). The authors state that this overhead renders naive hyperparameter optimization with per-fold TAR impractical, and they note that the reported gains are conservative because no per-dataset TAR tuning was performed (Arazi et al., 11 May 2026).
These findings motivate a specific practical recipe. The paper recommends starting with LightGBM or CatBoost plus PCA(30) on embeddings and e5-small or DINO-small for extraction, then applying LoRA TAR on the last 3 transformer layers and tuning only on the training split with early stopping (Arazi et al., 11 May 2026). It further recommends discretizing regression targets into 20 equal-frequency bins and optimizing cross-entropy, using one shared encoder for multiple text columns, and validating complementarity by comparing unimodal baselines, Joint Frozen, and Joint TAR in sequence (Arazi et al., 11 May 2026).
7. Relation to adjacent benchmarks and limitations
MulTaBench occupies a specific niche among recent tabular benchmarks. It differs from MultiTab, which evaluates supervised tabular learning across 196 datasets and seven regime axes using normalized error and explicit thresholds (Lee et al., 20 May 2025). It also differs from MultiTab-Bench, the synthetic multitask tabular dataset generator associated with MultiTab-Net, which controls task count, pairwise task correlations, relative task complexity, and task-specific noise for multitask regression experiments (Sinodinos et al., 13 Nov 2025). Although the names are similar and the alias “MulTaBench” is sometimes applied to those resources, their goals are distinct: MultiTab studies inductive bias across tabular data regimes, MultiTab-Bench studies multitask dynamics in synthetic tabular settings, and MulTaBench studies multimodal tabular prediction with target-aware representations (Lee et al., 20 May 2025, Sinodinos et al., 13 Nov 2025, Arazi et al., 11 May 2026).
It also differs from MMTBench, which evaluates question answering and reasoning over complex multimodal tables with interleaved charts, maps, logos, and other images (Titiya et al., 27 May 2025), and from MTBench, which targets massively parallelized multi-task reinforcement learning for robotics (Joshi et al., 31 Jul 2025). The overlap is thus nominal rather than methodological.
The limitations stated for MulTaBench are consequential. The curation process entangles dataset selection with a specific algorithmic solution, namely LoRA TAR on the last 3 encoder layers, so models used in curation cannot be fairly ranked on a benchmark whose datasets were selected for their success criteria (Arazi et al., 11 May 2026). TAR also imposes notable runtime and memory costs, especially for text with multiple columns, making full hyperparameter optimization with per-fold TAR impractical. Despite being described as the largest image–tabular benchmark to date, the benchmark still underrepresents many domains such as audio, video, and genomics, and more trimodal datasets are needed (Arazi et al., 11 May 2026).
The future directions proposed in the paper center on “true Multimodal Tabular Foundation Models” that jointly model text, image, and tabular data while retaining the robustness characteristics associated with PFN and ICL approaches (Arazi et al., 11 May 2026). Promising paths include architectures that couple PFN-style in-context tabular learning with target-aware multimodal encoders, synthetic training priors extended to multimodal features, and dedicated text–image–tabular benchmarks with cross-tabular self-supervised pretraining (Arazi et al., 11 May 2026). This suggests that MulTaBench functions both as a benchmark and as a stress test for whether multimodal tabular methods genuinely exploit complementary unstructured signal rather than merely appending frozen embeddings.