Orion-MSP: Scalable Tabular ICL
- Orion-MSP is a tabular in-context learning architecture that integrates multi-scale sparse attention and a Perceiver-style memory to overcome traditional computational bottlenecks.
- Its hierarchical design and block-sparse attention reduce complexity from quadratic to near-linear scaling, enabling efficient processing of high-dimensional tables.
- Empirical results on benchmarks like TALENT, OpenML-CC18, and TabZilla demonstrate competitive accuracy and robust performance across diverse datasets.
Orion-MSP is a tabular in-context learning (ICL) architecture that addresses core representational and computational bottlenecks in table-native neural models. It integrates multi-scale sparse attention mechanisms and a Perceiver-style memory module to enable efficient, accurate, and scalable modeling of tabular data prompts. Orion-MSP processes a dataset composed of both training and test rows in a single forward pass, yielding competitive results without task-specific fine-tuning across a broad set of benchmarks.
1. Foundations and Motivation
Tabular in-context learning (ICL) reformulates the standard supervised learning paradigm for tabular data by presenting the model with a prompt consisting of context rows (training examples) and query rows (test instances), then requiring direct prediction of test labels: where is the context set, and is the query set. Prior models such as TabPFN and TabICL have demonstrated strong performance but are limited by single-scale feature processing, dense attention patterns with quadratic scaling in table width, and strictly sequential pipelines that prohibit cross-component feedback.
Orion-MSP targets these deficiencies by introducing hierarchical feature interaction via multi-scale grouping, block-sparse attention for computational efficiency at scale, and a latent memory module allowing (ICL-safe) bidirectional information flow.
2. Architectural Components
Orion-MSP’s architectural pipeline comprises four principal components: column-wise embedding, multi-scale sparse row interaction, a cross-component Perceiver-style memory, and an ICL predictor. Each stage is constructed to address specific limitations of previous tabular ICL frameworks.
2.1 Column-Wise Embedding
Utilizing a Set Transformer (TF) with inducing points per column, Orion-MSP produces a context-aware embedding tensor . This method is consistent with TabICL and leverages ISAB (Induced Set Attention Block) layers to encode distributional summaries observed from training rows, and to decode sample-wise embeddings. The result is per-cell affine feature representations, facilitating context-dependent adaptation.
2.2 Multi-Scale Feature Grouping and Encoding
For hierarchical modeling, feature tokens are grouped at multiple scales (e.g., ). At each scale :
- The feature tokens are partitioned into groups by Pooling by Multihead Attention (PMA).
- Each scale’s sequence is prepended with [CLS] tokens and [GLOBAL] tokens, yielding total length .
- Dedicated sparse-attention Transformers encode the grouped representations.
- Outputs are aggregated across scales, then flattened and layer-normalized: where is derived from [CLS] outputs per scale.
2.3 Block-Sparse Attention Mechanism
For each scale, a block-sparse self-attention mask is imposed:
- [CLS] and [GLOBAL] tokens are densely connected amongst themselves and to other tokens.
- Feature tokens interact locally via a sliding window of radius , yielding efficient modeling of local dependencies.
- Random long-range connections (random links) supplement local and global communication, with such random indices per token.
The resultant sparse attention mechanism reduces per-layer computational complexity from to approximately , with . The multi-head attention operation for each head is:
2.4 Perceiver-Style Memory for Bidirectional Cross-Component Communication
A latent memory enables bidirectional interaction while guaranteeing ICL safety:
- Write phase (training rows only): Latent memory attends to embeddings of context rows via a number of CrossAttnBlock updates.
- Read phase (all rows): Each row embedding attends to the final memory, refining its representation.
- Embeddings are then propagated to the final prediction head.
2.5 ICL Predictor: Dataset-Level Decoding
A split-masked Transformer processes the label-injected final row representations over both training and test rows, using an attention mask that strictly prevents any test-to-train leakage:
- One-hot label embeddings are injected into training row representations.
- Split-masked attention ensures independence across training and test sets during decoding.
- The output for test rows is mapped to class logits by an MLP decoder.
3. Computational Complexity and Scalability
Traditional dense self-attention mechanisms over feature tokens incur computation per layer (where is batch size and is number of rows). The block-sparse attention in Orion-MSP reduces this to: where is the number of Transformer blocks per scale and . This design achieves near-linear scaling with respect to table width , facilitating application to high-dimensional datasets that previously would lead to out-of-memory (OOM) failures in dense ICL models.
4. Empirical Results and Comparative Performance
Orion-MSP was evaluated on the TALENT (154 datasets), OpenML-CC18 (63 datasets), and TabZilla (27 datasets) benchmarks with neural (TabPFN, TabICL, TabDPT, ContextTab, Mitra) and classic ensemble baselines (XGBoost, LightGBM, CatBoost, RF). Principal aggregate results are shown below:
| Model | All-rank | TALENT (ACC/F1) | OpenML-CC18 (ACC/F1) | TabZilla (ACC/F1) |
|---|---|---|---|---|
| Orion-MSP | 3.58 | 0.8461 / 0.8360 | 0.8722 / 0.8676 | 0.8821 / 0.8786 |
| TabPFN | 4.61 | 0.8514 / 0.8412 | 0.8714 / 0.8663 | 0.8752 / 0.8716 |
| TabICL | 4.96 | 0.8471 / 0.8379 | 0.8667 / 0.8623 | 0.8734 / 0.8698 |
| XGBoost | 6.70 | 0.8403 / 0.8360 | 0.8558 / 0.8537 | 0.8612 / 0.8326 |
Notable findings include:
- Orion-MSP matches or surpasses the state-of-the-art in mean rank and accuracy across all benchmarks.
- Performance is robust across small (<1k), medium (1k–10k), and large (>10k) datasets.
- Superior scalability: Orion-MSP maintains top performance for , whereas other ICL models encounter memory limitations on wide tabular inputs.
- Strong outlier performance in underrepresented-class regimes (ranked #2 in class-imbalanced splits) and in Medical and Finance domains.
5. Component Analysis and Ablation Insights
Systematic ablation studies illuminate critical components:
- Perceiver Memory: Disabling memory () increases mean-rank by ~0.5 and reduces accuracy by 0.5–1%, confirming the benefit of latent cross-dataset communication.
- Multi-Scale vs. Single-Scale: Restricting to a single-scale (TabICL-style) grouping yields a 0.4% accuracy drop on wide tables, highlighting the utility of hierarchical grouping for modeling complex feature interactions.
- Sparse vs. Dense Attention: Dense attention models provide no accuracy benefit for narrow tables () but double computational resource usage.
- Random Links: Setting the number of random links () to zero increases variance in long-range dependencies and decreases accuracy by 0.2%.
- Global Tokens: Omitting [GLOBAL] tokens reduces the model’s ability to capture dependencies among dispersed feature subsets, lowering top-rank scores on challenging datasets.
6. Algorithmic Outline
The practical workflow of Orion-MSP consists of three high-level modules:
Algorithm 1: Multi-Scale Sparse Row Interaction
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
Input: E ∈ R^{B×n×(m+C)×d}, valid_features d_valid, scales S, window w, random_links r Output: H ∈ R^{B×n×(N_cls⋅d)} Initialize CLS, GLOBAL tokens H_all = [] for s in S: K_s = ceil(m/s) G_s = PMA(E[:,:,C:,:], K_s) # pooling via multihead attention X_s = concatenate([CLS, GLOBAL, G_s]) M_s = BuildBlockSparseMask(L_s, N_special, w, r) Z_s = Encoder_s(X_s, M_s) # sparse Transformer H_s = Z_s[:,:,1:N_cls,:] H_all.append(H_s) H_agg = average(H_all) H_flat = Flatten(H_agg) H = LayerNorm(H_flat) return H |
Algorithm 2: ICL with Perceiver Memory
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
Input: H ∈ R^{B×n×d_H}, y_train ∈ R^{B×n_train} Output: logits for y_test if P > 0: for b in 1..B: H_tr = H[b,1:n_train,:] L = L0 # P×d_H for i in 1..N_write: L = CrossAttnBlock(Q=L, KV=H_tr) R = H[b,:,:] for i in 1..N_read: R = CrossAttnBlock(Q=R, KV=L) H[b,:,:] = R e_y = OneHot(y_train)·W_label H[:,1:n_train,:] += e_y M_split = BuildSplitMask(n, n_train) H_prime = TF_icl(H, attn_mask=M_split) H_norm = LayerNorm(H_prime) H_test = H_norm[:,n_train+1:n,:] z = GELU(H_test·W1 + b1) logits = z·W2 + b2 return logits[:,:,0:K] |
Algorithm 3: End-to-End Forward Pass
1 2 3 4 5 |
def OrionMSP_Forward(X, y_train): E = TF_col(X) H = MultiScaleRow(E) logits = ICL_with_Memory(H, y_train) return logits |
7. Availability and Implementation Considerations
Orion-MSP is available at https://github.com/Lexsi-Labs/Orion-MSP and is designed for extensibility to a range of tabular classification and regression tasks. The architecture’s reliance on block-sparse attention confers significant memory and computational advantages especially for wide tables. Preservation of the ICL safety constraint in the memory module is critical for legitimate evaluation in few-shot and prompt-based learning settings, as enforced by the phased read/write scheme.
A plausible implication is that as tabular benchmarks increase in both row and column dimension, Orion-MSP’s architectural innovations will be increasingly important for practical deployment and for enabling broader generalization across tabular domains. Extensions to regression or structured outputs would require adaptation of the label-injection and prediction heads, but the core architectural primitives remain general.
The notation and component names herein follow Qu et al. (2025), "Orion-MSP: Multi-Scale Sparse Attention for Tabular In-Context Learning" (Bouadi et al., 4 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free