Orion-MSP: Scalable Tabular ICL

Updated 9 November 2025

Orion-MSP is a tabular in-context learning architecture that integrates multi-scale sparse attention and a Perceiver-style memory to overcome traditional computational bottlenecks.
Its hierarchical design and block-sparse attention reduce complexity from quadratic to near-linear scaling, enabling efficient processing of high-dimensional tables.
Empirical results on benchmarks like TALENT, OpenML-CC18, and TabZilla demonstrate competitive accuracy and robust performance across diverse datasets.

Orion-MSP is a tabular in-context learning (ICL) architecture that addresses core representational and computational bottlenecks in table-native neural models. It integrates multi-scale sparse attention mechanisms and a Perceiver-style memory module to enable efficient, accurate, and scalable modeling of tabular data prompts. Orion-MSP processes a dataset composed of both training and test rows in a single forward pass, yielding competitive results without task-specific fine-tuning across a broad set of benchmarks.

1. Foundations and Motivation

Tabular in-context learning (ICL) reformulates the standard supervised learning paradigm for tabular data by presenting the model with a prompt consisting of context rows (training examples) and query rows (test instances), then requiring direct prediction of test labels: $p(y \mid x, \mathcal{C}) \qquad \forall x \in \mathcal{Q}$ where $\mathcal{C} = \{(x_i, y_i)\}_{i=1}^{n_{\rm train}}$ is the context set, and $\mathcal{Q} = \{x_i\}_{i=n_{\rm train}+1}^{n_{\rm train} + n_{\rm test}}$ is the query set. Prior models such as TabPFN and TabICL have demonstrated strong performance but are limited by single-scale feature processing, dense attention patterns with quadratic scaling in table width, and strictly sequential pipelines that prohibit cross-component feedback.

Orion-MSP targets these deficiencies by introducing hierarchical feature interaction via multi-scale grouping, block-sparse attention for computational efficiency at scale, and a latent memory module allowing (ICL-safe) bidirectional information flow.

2. Architectural Components

Orion-MSP’s architectural pipeline comprises four principal components: column-wise embedding, multi-scale sparse row interaction, a cross-component Perceiver-style memory, and an ICL predictor. Each stage is constructed to address specific limitations of previous tabular ICL frameworks.

2.1 Column-Wise Embedding

Utilizing a Set Transformer (TF $_{\rm col}$ ) with $k$ inducing points per column, Orion-MSP produces a context-aware embedding tensor $\mathbf{E} \in \mathbb{R}^{n \times (m+C) \times d}$ . This method is consistent with TabICL and leverages ISAB (Induced Set Attention Block) layers to encode distributional summaries observed from training rows, and to decode sample-wise embeddings. The result is per-cell affine feature representations, facilitating context-dependent adaptation.

2.2 Multi-Scale Feature Grouping and Encoding

For hierarchical modeling, feature tokens are grouped at multiple scales $\mathcal{S} = \{s_1, \ldots, s_M\}$ (e.g., $\{1,4,16\}$ ). At each scale $s$ :

The $m$ feature tokens are partitioned into $K_s = \lceil m/s \rceil$ groups by Pooling by Multihead Attention (PMA).
Each scale’s sequence is prepended with $N_{\rm cls}$ [CLS] tokens and $N_{\rm global}$ [GLOBAL] tokens, yielding total length $L_s = N_{\rm cls} + N_{\rm global} + K_s$ .
Dedicated sparse-attention Transformers encode the grouped representations.
Outputs are aggregated across scales, then flattened and layer-normalized: $\mathbf{H} = \mathrm{LayerNorm}\big(\mathrm{Flatten}\big( \frac{1}{M}\sum_{s \in \mathcal{S}} \mathbf{H}_s \big)\big)$ where $\mathbf{H}_s$ is derived from [CLS] outputs per scale.

2.3 Block-Sparse Attention Mechanism

For each scale, a block-sparse self-attention mask $\mathbf{M}_s \in \mathbb{R}^{L_s \times L_s}$ is imposed:

[CLS] and [GLOBAL] tokens are densely connected amongst themselves and to other tokens.
Feature tokens interact locally via a sliding window of radius $w$ , yielding efficient modeling of local dependencies.
Random long-range connections (random links) supplement local and global communication, with $r$ such random indices per token.

The resultant sparse attention mechanism reduces per-layer computational complexity from $O(m^2)$ to approximately $O(m (w+g+r))$ , with $(w+g+r) \ll m$ . The multi-head attention operation for each head $\ell$ is: $\text{head}_\ell = \mathrm{softmax}\Big( \frac{Q_\ell K_\ell^\top + \mathbf{M}_s}{\sqrt{d_k}} \Big)\, V_\ell$

2.4 Perceiver-Style Memory for Bidirectional Cross-Component Communication

A latent memory $\mathbf{L}_0 \in \mathbb{R}^{P \times d_H}$ enables bidirectional interaction while guaranteeing ICL safety:

Write phase (training rows only): Latent memory attends to embeddings of context rows via a number of CrossAttnBlock updates.
Read phase (all rows): Each row embedding attends to the final memory, refining its representation.
Embeddings are then propagated to the final prediction head.

2.5 ICL Predictor: Dataset-Level Decoding

A split-masked Transformer processes the label-injected final row representations over both training and test rows, using an attention mask that strictly prevents any test-to-train leakage:

One-hot label embeddings are injected into training row representations.
Split-masked attention ensures independence across training and test sets during decoding.
The output for test rows is mapped to class logits by an MLP decoder.

3. Computational Complexity and Scalability

Traditional dense self-attention mechanisms over $m$ feature tokens incur $O(B n m^2 d)$ computation per layer (where $B$ is batch size and $n$ is number of rows). The block-sparse attention in Orion-MSP reduces this to: $O\Bigl(B\,n\,m\,(w+g+r)\,d\,N_{\rm blocks}^{\rm row}\Bigr)$ where $N_{\rm blocks}^{\rm row}$ is the number of Transformer blocks per scale and $(w+g+r) \ll m$ . This design achieves near-linear scaling with respect to table width $m$ , facilitating application to high-dimensional datasets that previously would lead to out-of-memory (OOM) failures in dense ICL models.

4. Empirical Results and Comparative Performance

Orion-MSP was evaluated on the TALENT (154 datasets), OpenML-CC18 (63 datasets), and TabZilla (27 datasets) benchmarks with neural (TabPFN, TabICL, TabDPT, ContextTab, Mitra) and classic ensemble baselines (XGBoost, LightGBM, CatBoost, RF). Principal aggregate results are shown below:

Model	All-rank	TALENT (ACC/F1)	OpenML-CC18 (ACC/F1)	TabZilla (ACC/F1)
Orion-MSP	3.58	0.8461 / 0.8360	0.8722 / 0.8676	0.8821 / 0.8786
TabPFN	4.61	0.8514 / 0.8412	0.8714 / 0.8663	0.8752 / 0.8716
TabICL	4.96	0.8471 / 0.8379	0.8667 / 0.8623	0.8734 / 0.8698
XGBoost	6.70	0.8403 / 0.8360	0.8558 / 0.8537	0.8612 / 0.8326

Notable findings include:

Orion-MSP matches or surpasses the state-of-the-art in mean rank and accuracy across all benchmarks.
Performance is robust across small (<1k), medium (1k–10k), and large (>10k) datasets.
Superior scalability: Orion-MSP maintains top performance for $m > 100$ , whereas other ICL models encounter memory limitations on wide tabular inputs.
Strong outlier performance in underrepresented-class regimes (ranked #2 in class-imbalanced splits) and in Medical and Finance domains.

5. Component Analysis and Ablation Insights

Systematic ablation studies illuminate critical components:

Perceiver Memory: Disabling memory ( $P=0$ ) increases mean-rank by ~0.5 and reduces accuracy by 0.5–1%, confirming the benefit of latent cross-dataset communication.
Multi-Scale vs. Single-Scale: Restricting to a single-scale (TabICL-style) grouping yields a 0.4% accuracy drop on wide tables, highlighting the utility of hierarchical grouping for modeling complex feature interactions.
Sparse vs. Dense Attention: Dense attention models provide no accuracy benefit for narrow tables ( $m < 100$ ) but double computational resource usage.
Random Links: Setting the number of random links ( $r$ ) to zero increases variance in long-range dependencies and decreases accuracy by 0.2%.
Global Tokens: Omitting [GLOBAL] tokens reduces the model’s ability to capture dependencies among dispersed feature subsets, lowering top-rank scores on challenging datasets.

6. Algorithmic Outline

The practical workflow of Orion-MSP consists of three high-level modules:

Algorithm 1: Multi-Scale Sparse Row Interaction

Input: E ∈ R^{B×n×(m+C)×d}, valid_features d_valid, scales S, window w, random_links r
Output: H ∈ R^{B×n×(N_cls⋅d)}
Initialize CLS, GLOBAL tokens
H_all = []
for s in S:
    K_s = ceil(m/s)
    G_s = PMA(E[:,:,C:,:], K_s) # pooling via multihead attention
    X_s = concatenate([CLS, GLOBAL, G_s])
    M_s = BuildBlockSparseMask(L_s, N_special, w, r)
    Z_s = Encoder_s(X_s, M_s)   # sparse Transformer
    H_s = Z_s[:,:,1:N_cls,:]
    H_all.append(H_s)
H_agg = average(H_all)
H_flat = Flatten(H_agg)
H = LayerNorm(H_flat)
return H

Algorithm 2: ICL with Perceiver Memory

Input: H ∈ R^{B×n×d_H}, y_train ∈ R^{B×n_train}
Output: logits for y_test
if P > 0:
    for b in 1..B:
        H_tr = H[b,1:n_train,:]
        L = L0 # P×d_H
        for i in 1..N_write:
            L = CrossAttnBlock(Q=L, KV=H_tr)
        R = H[b,:,:]
        for i in 1..N_read:
            R = CrossAttnBlock(Q=R, KV=L)
        H[b,:,:] = R
e_y = OneHot(y_train)·W_label
H[:,1:n_train,:] += e_y
M_split = BuildSplitMask(n, n_train)
H_prime = TF_icl(H, attn_mask=M_split)
H_norm = LayerNorm(H_prime)
H_test = H_norm[:,n_train+1:n,:]
z = GELU(H_test·W1 + b1)
logits = z·W2 + b2
return logits[:,:,0:K]

Algorithm 3: End-to-End Forward Pass

def OrionMSP_Forward(X, y_train):
    E = TF_col(X)
    H = MultiScaleRow(E)
    logits = ICL_with_Memory(H, y_train)
    return logits

7. Availability and Implementation Considerations

Orion-MSP is available at https://github.com/Lexsi-Labs/Orion-MSP and is designed for extensibility to a range of tabular classification and regression tasks. The architecture’s reliance on block-sparse attention confers significant memory and computational advantages especially for wide tables. Preservation of the ICL safety constraint in the memory module is critical for legitimate evaluation in few-shot and prompt-based learning settings, as enforced by the phased read/write scheme.

A plausible implication is that as tabular benchmarks increase in both row and column dimension, Orion-MSP’s architectural innovations will be increasingly important for practical deployment and for enabling broader generalization across tabular domains. Extensions to regression or structured outputs would require adaptation of the label-injection and prediction heads, but the core architectural primitives remain general.

The notation and component names herein follow Qu et al. (2025), "Orion-MSP: Multi-Scale Sparse Attention for Tabular In-Context Learning" (Bouadi et al., 4 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Orion-MSP: Multi-Scale Sparse Attention for Tabular In-Context Learning (2025)

Follow Topic

Get notified by email when new papers are published related to Orion-MSP.