Missing Part Sensitive Transformer
- MPSTs are deep learning architectures that represent missing data using learnable mask tokens and explicit attention masking, avoiding traditional imputation.
- They integrate innovative masked attention and fusion tokens to robustly aggregate available features in modalities like sensor fusion, 3D point clouds, seismic, and tabular data.
- Empirical results demonstrate high accuracy and resilience under extensive missingness, substantiating MPSTs' significance in robust prediction and completion tasks.
A Missing Part Sensitive Transformer (MPST) is a family of deep learning architectures based on the transformer paradigm, specifically designed to achieve robust prediction, completion, or classification in the presence of missing, incomplete, or partial input data. Unlike standard transformers, which typically degrade when faced with occluded, absent, or dropped data modalities, regions, or features, MPSTs employ architectural innovations—such as masking strategies, learnable placeholder tokens, and explicit alignment mechanisms—to ensure model behavior remains accurate and reliable under varying missing-data patterns. Instances of MPSTs have been proposed in diverse domains including multimodal sensor fusion, point cloud completion, seismic data interpolation, and tabular modeling, each adapting the MPST principle to respect the particular structure and semantics of missingness central to the application (Wen et al., 12 Dec 2025, Li et al., 2023, Cheng et al., 2024, Caruso et al., 2024).
1. Core Design Principles of Missing Part Sensitive Transformers
MPSTs are unified by several underlying principles for handling missing or partial information:
- Explicit Missing-Data Representation: Missing modalities, features, or regions are not imputed but represented either by learnable mask tokens (modality placeholders), zero-vector embeddings (“pad” tokens), or random position codes. This ensures the transformer recognizes, rather than fills, missingness at the input level.
- Masked Attention Mechanisms: Attention masks are engineered to prevent the propagation of noise or spurious signals arising from missing tokens. In multimodal contexts, modality-aware binary masks block cross-modal attention links from or to absent sources (Wen et al., 12 Dec 2025). In tabular and sequence contexts, double-sided masking ensures missing entries neither send nor receive attention (Caruso et al., 2024).
- Learnable Fusion and Querying: MPSTs introduce fusion tokens designed to adaptively aggregate information only from available modalities, regions, or features using attention. The fusion mechanism queries the collection of available embeddings, distilling a coherent global feature or prediction (Wen et al., 12 Dec 2025).
- Alignment and Contrastive Objectives: To ensure robustness and invariance, some MPSTs incorporate modules (e.g., class-former-aided modality alignment in multimodal fusion, proxy alignment in point cloud prediction) that use supervised contrastive or proxy alignment losses to align available and fused representations during training (Wen et al., 12 Dec 2025, Li et al., 2023).
- Data-Augmentation by Random Masking: Training procedures often include augmentation by random masking to ensure the model learns missing-data-agnostic representations, crucial for generalization from fully-observed to partially-observed test conditions (Caruso et al., 2024).
2. Architectural Realizations Across Domains
MPST design adapts to domain requirements, yielding significant architectural diversity:
| Domain | Key MPST Mechanism | Reference |
|---|---|---|
| Multimodal Fusion | Modality tokens, masked cross-attn, fusion token, class-former alignment | (Wen et al., 12 Dec 2025) |
| Point Cloud Completion | Existing/missing proxies, attention-based random→positional mapping, proxy alignment MSE | (Li et al., 2023) |
| Seismic Interpolation | Encoder-decoder transformers, U-shaped Swin blocks, SSIM+L1 loss | (Cheng et al., 2024) |
| Tabular Data Prediction | Feature embeddings with pad index, double-sided self-attn masking, random feature cutout | (Caruso et al., 2024) |
In multimodal beam prediction (Wen et al., 12 Dec 2025), each modality (image, LiDAR, radar, GPS, beam history) is represented by learnable tokens, with missing modalities replaced by dedicated placeholders. A missing-modality-aware mask pervades all self- and cross-attention layers. The fusion token aggregates available information, and a class-former–aided module uses a contrastive loss for semantic alignment. In 3D point cloud completion (Li et al., 2023), input (existing) proxies and generated random missing proxies are aligned with ground truth via an MPST acting on proxy features, with a loss combining Chamfer distances and proxy alignment.
For seismic data interpolation (Cheng et al., 2024), the transformer integrates encoder/decoder stacks with U-shaped Swin Transformer subnetwork, allowing reconstructions even with extensive consecutive missing traces. For tabular data, NAIM (Caruso et al., 2024) processes each feature as a token, assigning missing entries a zero vector and fully masking their influence during attention.
3. Mechanisms for Handling Missing Data
The distinguishing factor for all MPSTs is the treatment of missing inputs. In multimodal and tabular MPSTs, the availability vector or mask determines both input substitution and subsequent attention masking. For instance, given missing modality indicator (in multimodal fusion), tokens from missing modalities are replaced with learnable placeholder vectors . The mask within every attention layer is set to whenever token derives from a missing (i.e., ) modality or feature, ensuring queries cannot attend to these entries.
In tabular NAIM, double-sided attention masking is computed as:
where encodes missingness, blocking both queries from and to missing features.
MPSTs for 3D completion use random Gaussian noise to initialize missing position proxies, allowing the transformer to learn mappings from noise to probable missing part locations and shapes, guided by explicit loss alignment with true missing part proxies.
4. Training Objectives and Regularization
MPSTs typically combine data-fitting objectives with additional alignment and regularization losses:
- Classification/Prediction Loss: For beam prediction, a focal loss variant is employed, incorporating soft-label cross-entropy penalties based on correct class probability and a class imbalance factor (Wen et al., 12 Dec 2025).
- Contrastive Alignment/Proxy Alignment Loss: In multimodal and point cloud settings, a supervised contrastive loss or mean squared error between predicted and ground-truth missing proxies ensures semantic and geometric consistency (Wen et al., 12 Dec 2025, Li et al., 2023).
- Structural & Pixelwise Loss: For signal or image reconstruction (e.g., seismic gap filling), loss functions combine pixelwise L1 error with structural similarity index (SSIM) (Cheng et al., 2024).
- Masking Regularization: Random augmentation by simulated missingness (feature cutout) ensures models train under a distribution matching test-time missingness and suppress overfitting to observed training patterns (Caruso et al., 2024).
5. Empirical Results and Comparative Performance
MPSTs have set benchmarks for robustness and accuracy under missingness across diverse tasks:
- In multimodal mmWave beam prediction (Wen et al., 12 Dec 2025), MPST achieved Top-3 accuracy of 89.07% with full data and degraded by less than 2% Top-1 accuracy with up to 75% missing probability for a modality, significantly outperforming LSTM-based fusion models.
- ProxyFormer’s MPST outperformed point cloud completion baselines on standard datasets and achieved fastest inference among contemporary models by using an explicit proxy mechanism (Li et al., 2023).
- In seismic interpolation (Cheng et al., 2024), the MPST achieved a 68.6 dB SNR improvement (versus 24.6 dB for U-Net and 45.3 dB for pure Swin-Transformer) and SSIM ≈ 0.999. Ablations revealed that both attention and multi-scale Swin submodules contribute to peak performance.
- In tabular data, NAIM’s double-masked transformer surpassed both classical ML/deep learning methods paired with imputation in over 50% of missingness regimes, with AUC drops of less than 1% across extensive missingness patterns (Caruso et al., 2024). Ablation confirmed the necessity of the masking and augmentation mechanisms for robust generalization.
6. Implementation Details and Hyperparameter Choices
MPSTs are generally parameter-efficient and suitable for real-time or large-scale deployment. Detailed implementation choices include:
- Parameter and FLOP Efficiency: In AMBER (Wen et al., 12 Dec 2025), the MPST has 46.5M parameters, 47.3 GFLOPs, outperforming larger baselines (66M for TII-Transfuser).
- Transformers Hyperparameters: Depth (typically 6–8 layers), number of heads (3–8), feed-forward widths (often 4× model dimension), per-domain specificities (e.g., 384 channels for ProxyFormer’s point cloud MPST (Li et al., 2023), 64 feature channels and 6 heads for seismic MPST (Cheng et al., 2024), 6 heads and 6 layers with for NAIM (Caruso et al., 2024)).
- Augmentation and Training: Use of per-epoch random masking for NAIM, batch-level randomly sampled Gaussian position encodings in ProxyFormer, and data-specific module initialization and scheduling appropriate to input structure.
7. Applications, Limitations, and Significance
MPSTs provide a principled framework for robust learning when faced with incomplete input. They underpin:
- Multimodal Sensor Fusion: Allowing flexible deployment in environments with frequent sensor outage or occlusion (Wen et al., 12 Dec 2025).
- Point Cloud and 3D Shape Completion: Predicting missing geometry with high fidelity using information transfer from existing structure (Li et al., 2023).
- Signal and Image Inpainting: Effective interpolation of long missing segments in sequential/geospatial data (Cheng et al., 2024).
- Tabular Modeling: Avoidance of imputation and resilience to various missing-data regimes (Caruso et al., 2024).
A key implication is that MPSTs, by explicitly encoding missingness rather than imputing, better preserve the statistical structure of observed data and can exploit available information more efficiently. Limitations include reliance on architectural masking (which may not capture all nuances of informative missingness) and the necessity of careful alignment losses or augmentation strategies. Continued evolution is expected in model capacity, scalability to higher input dimension, and adaptation to domain-specific forms of missing-data semantics.
References
- AMBER: An Adaptive Multimodal Mask Transformer for Beam Prediction with Missing Modalities (Wen et al., 12 Dec 2025)
- ProxyFormer: Proxy Alignment Assisted Point Cloud Completion with Missing Part Sensitive Transformer (Li et al., 2023)
- Seismic Interpolation Transformer for Consecutively Missing Data: A Case Study in DAS-VSP Data (Cheng et al., 2024)
- Not Another Imputation Method: A Transformer-based Model for Missing Values in Tabular Datasets (Caruso et al., 2024)