Papers
Topics
Authors
Recent
2000 character limit reached

Orion-MSP: Scalable Tabular ICL

Updated 9 November 2025
  • Orion-MSP is a tabular in-context learning architecture that integrates multi-scale sparse attention and a Perceiver-style memory to overcome traditional computational bottlenecks.
  • Its hierarchical design and block-sparse attention reduce complexity from quadratic to near-linear scaling, enabling efficient processing of high-dimensional tables.
  • Empirical results on benchmarks like TALENT, OpenML-CC18, and TabZilla demonstrate competitive accuracy and robust performance across diverse datasets.

Orion-MSP is a tabular in-context learning (ICL) architecture that addresses core representational and computational bottlenecks in table-native neural models. It integrates multi-scale sparse attention mechanisms and a Perceiver-style memory module to enable efficient, accurate, and scalable modeling of tabular data prompts. Orion-MSP processes a dataset composed of both training and test rows in a single forward pass, yielding competitive results without task-specific fine-tuning across a broad set of benchmarks.

1. Foundations and Motivation

Tabular in-context learning (ICL) reformulates the standard supervised learning paradigm for tabular data by presenting the model with a prompt consisting of context rows (training examples) and query rows (test instances), then requiring direct prediction of test labels: p(yx,C)xQp(y \mid x, \mathcal{C}) \qquad \forall x \in \mathcal{Q} where C={(xi,yi)}i=1ntrain\mathcal{C} = \{(x_i, y_i)\}_{i=1}^{n_{\rm train}} is the context set, and Q={xi}i=ntrain+1ntrain+ntest\mathcal{Q} = \{x_i\}_{i=n_{\rm train}+1}^{n_{\rm train} + n_{\rm test}} is the query set. Prior models such as TabPFN and TabICL have demonstrated strong performance but are limited by single-scale feature processing, dense attention patterns with quadratic scaling in table width, and strictly sequential pipelines that prohibit cross-component feedback.

Orion-MSP targets these deficiencies by introducing hierarchical feature interaction via multi-scale grouping, block-sparse attention for computational efficiency at scale, and a latent memory module allowing (ICL-safe) bidirectional information flow.

2. Architectural Components

Orion-MSP’s architectural pipeline comprises four principal components: column-wise embedding, multi-scale sparse row interaction, a cross-component Perceiver-style memory, and an ICL predictor. Each stage is constructed to address specific limitations of previous tabular ICL frameworks.

2.1 Column-Wise Embedding

Utilizing a Set Transformer (TFcol_{\rm col}) with kk inducing points per column, Orion-MSP produces a context-aware embedding tensor ERn×(m+C)×d\mathbf{E} \in \mathbb{R}^{n \times (m+C) \times d}. This method is consistent with TabICL and leverages ISAB (Induced Set Attention Block) layers to encode distributional summaries observed from training rows, and to decode sample-wise embeddings. The result is per-cell affine feature representations, facilitating context-dependent adaptation.

2.2 Multi-Scale Feature Grouping and Encoding

For hierarchical modeling, feature tokens are grouped at multiple scales S={s1,,sM}\mathcal{S} = \{s_1, \ldots, s_M\} (e.g., {1,4,16}\{1,4,16\}). At each scale ss:

  • The mm feature tokens are partitioned into Ks=m/sK_s = \lceil m/s \rceil groups by Pooling by Multihead Attention (PMA).
  • Each scale’s sequence is prepended with NclsN_{\rm cls} [CLS] tokens and NglobalN_{\rm global} [GLOBAL] tokens, yielding total length Ls=Ncls+Nglobal+KsL_s = N_{\rm cls} + N_{\rm global} + K_s.
  • Dedicated sparse-attention Transformers encode the grouped representations.
  • Outputs are aggregated across scales, then flattened and layer-normalized: H=LayerNorm(Flatten(1MsSHs))\mathbf{H} = \mathrm{LayerNorm}\big(\mathrm{Flatten}\big( \frac{1}{M}\sum_{s \in \mathcal{S}} \mathbf{H}_s \big)\big) where Hs\mathbf{H}_s is derived from [CLS] outputs per scale.

2.3 Block-Sparse Attention Mechanism

For each scale, a block-sparse self-attention mask MsRLs×Ls\mathbf{M}_s \in \mathbb{R}^{L_s \times L_s} is imposed:

  1. [CLS] and [GLOBAL] tokens are densely connected amongst themselves and to other tokens.
  2. Feature tokens interact locally via a sliding window of radius ww, yielding efficient modeling of local dependencies.
  3. Random long-range connections (random links) supplement local and global communication, with rr such random indices per token.

The resultant sparse attention mechanism reduces per-layer computational complexity from O(m2)O(m^2) to approximately O(m(w+g+r))O(m (w+g+r)), with (w+g+r)m(w+g+r) \ll m. The multi-head attention operation for each head \ell is: head=softmax(QK+Msdk)V\text{head}_\ell = \mathrm{softmax}\Big( \frac{Q_\ell K_\ell^\top + \mathbf{M}_s}{\sqrt{d_k}} \Big)\, V_\ell

2.4 Perceiver-Style Memory for Bidirectional Cross-Component Communication

A latent memory L0RP×dH\mathbf{L}_0 \in \mathbb{R}^{P \times d_H} enables bidirectional interaction while guaranteeing ICL safety:

  • Write phase (training rows only): Latent memory attends to embeddings of context rows via a number of CrossAttnBlock updates.
  • Read phase (all rows): Each row embedding attends to the final memory, refining its representation.
  • Embeddings are then propagated to the final prediction head.

2.5 ICL Predictor: Dataset-Level Decoding

A split-masked Transformer processes the label-injected final row representations over both training and test rows, using an attention mask that strictly prevents any test-to-train leakage:

  • One-hot label embeddings are injected into training row representations.
  • Split-masked attention ensures independence across training and test sets during decoding.
  • The output for test rows is mapped to class logits by an MLP decoder.

3. Computational Complexity and Scalability

Traditional dense self-attention mechanisms over mm feature tokens incur O(Bnm2d)O(B n m^2 d) computation per layer (where BB is batch size and nn is number of rows). The block-sparse attention in Orion-MSP reduces this to: O(Bnm(w+g+r)dNblocksrow)O\Bigl(B\,n\,m\,(w+g+r)\,d\,N_{\rm blocks}^{\rm row}\Bigr) where NblocksrowN_{\rm blocks}^{\rm row} is the number of Transformer blocks per scale and (w+g+r)m(w+g+r) \ll m. This design achieves near-linear scaling with respect to table width mm, facilitating application to high-dimensional datasets that previously would lead to out-of-memory (OOM) failures in dense ICL models.

4. Empirical Results and Comparative Performance

Orion-MSP was evaluated on the TALENT (154 datasets), OpenML-CC18 (63 datasets), and TabZilla (27 datasets) benchmarks with neural (TabPFN, TabICL, TabDPT, ContextTab, Mitra) and classic ensemble baselines (XGBoost, LightGBM, CatBoost, RF). Principal aggregate results are shown below:

Model All-rank TALENT (ACC/F1) OpenML-CC18 (ACC/F1) TabZilla (ACC/F1)
Orion-MSP 3.58 0.8461 / 0.8360 0.8722 / 0.8676 0.8821 / 0.8786
TabPFN 4.61 0.8514 / 0.8412 0.8714 / 0.8663 0.8752 / 0.8716
TabICL 4.96 0.8471 / 0.8379 0.8667 / 0.8623 0.8734 / 0.8698
XGBoost 6.70 0.8403 / 0.8360 0.8558 / 0.8537 0.8612 / 0.8326

Notable findings include:

  • Orion-MSP matches or surpasses the state-of-the-art in mean rank and accuracy across all benchmarks.
  • Performance is robust across small (<1k), medium (1k–10k), and large (>10k) datasets.
  • Superior scalability: Orion-MSP maintains top performance for m>100m > 100, whereas other ICL models encounter memory limitations on wide tabular inputs.
  • Strong outlier performance in underrepresented-class regimes (ranked #2 in class-imbalanced splits) and in Medical and Finance domains.

5. Component Analysis and Ablation Insights

Systematic ablation studies illuminate critical components:

  • Perceiver Memory: Disabling memory (P=0P=0) increases mean-rank by ~0.5 and reduces accuracy by 0.5–1%, confirming the benefit of latent cross-dataset communication.
  • Multi-Scale vs. Single-Scale: Restricting to a single-scale (TabICL-style) grouping yields a 0.4% accuracy drop on wide tables, highlighting the utility of hierarchical grouping for modeling complex feature interactions.
  • Sparse vs. Dense Attention: Dense attention models provide no accuracy benefit for narrow tables (m<100m < 100) but double computational resource usage.
  • Random Links: Setting the number of random links (rr) to zero increases variance in long-range dependencies and decreases accuracy by 0.2%.
  • Global Tokens: Omitting [GLOBAL] tokens reduces the model’s ability to capture dependencies among dispersed feature subsets, lowering top-rank scores on challenging datasets.

6. Algorithmic Outline

The practical workflow of Orion-MSP consists of three high-level modules:

Algorithm 1: Multi-Scale Sparse Row Interaction

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Input: E  R^{B×n×(m+C)×d}, valid_features d_valid, scales S, window w, random_links r
Output: H  R^{B×n×(N_clsd)}
Initialize CLS, GLOBAL tokens
H_all = []
for s in S:
    K_s = ceil(m/s)
    G_s = PMA(E[:,:,C:,:], K_s) # pooling via multihead attention
    X_s = concatenate([CLS, GLOBAL, G_s])
    M_s = BuildBlockSparseMask(L_s, N_special, w, r)
    Z_s = Encoder_s(X_s, M_s)   # sparse Transformer
    H_s = Z_s[:,:,1:N_cls,:]
    H_all.append(H_s)
H_agg = average(H_all)
H_flat = Flatten(H_agg)
H = LayerNorm(H_flat)
return H

Algorithm 2: ICL with Perceiver Memory

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Input: H  R^{B×n×d_H}, y_train  R^{B×n_train}
Output: logits for y_test
if P > 0:
    for b in 1..B:
        H_tr = H[b,1:n_train,:]
        L = L0 # P×d_H
        for i in 1..N_write:
            L = CrossAttnBlock(Q=L, KV=H_tr)
        R = H[b,:,:]
        for i in 1..N_read:
            R = CrossAttnBlock(Q=R, KV=L)
        H[b,:,:] = R
e_y = OneHot(y_train)·W_label
H[:,1:n_train,:] += e_y
M_split = BuildSplitMask(n, n_train)
H_prime = TF_icl(H, attn_mask=M_split)
H_norm = LayerNorm(H_prime)
H_test = H_norm[:,n_train+1:n,:]
z = GELU(H_test·W1 + b1)
logits = z·W2 + b2
return logits[:,:,0:K]

Algorithm 3: End-to-End Forward Pass

1
2
3
4
5
def OrionMSP_Forward(X, y_train):
    E = TF_col(X)
    H = MultiScaleRow(E)
    logits = ICL_with_Memory(H, y_train)
    return logits

7. Availability and Implementation Considerations

Orion-MSP is available at https://github.com/Lexsi-Labs/Orion-MSP and is designed for extensibility to a range of tabular classification and regression tasks. The architecture’s reliance on block-sparse attention confers significant memory and computational advantages especially for wide tables. Preservation of the ICL safety constraint in the memory module is critical for legitimate evaluation in few-shot and prompt-based learning settings, as enforced by the phased read/write scheme.

A plausible implication is that as tabular benchmarks increase in both row and column dimension, Orion-MSP’s architectural innovations will be increasingly important for practical deployment and for enabling broader generalization across tabular domains. Extensions to regression or structured outputs would require adaptation of the label-injection and prediction heads, but the core architectural primitives remain general.

The notation and component names herein follow Qu et al. (2025), "Orion-MSP: Multi-Scale Sparse Attention for Tabular In-Context Learning" (Bouadi et al., 4 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Orion-MSP.