M2T2: Multi-Task Masked Transformer

Updated 27 February 2026

The paper introduces M2T2, which unifies grasp and placement tasks by simultaneously predicting contact points and 6-DoF poses in a single inference pass.
The architecture employs multi-scale 3D scene encoding and masked multi-head attention to robustly handle a dense 16,000-point cloud without needing explicit object proposals.
Experimental results show state-of-the-art sim2real transfer, especially in challenging reorientation tasks, substantially outperforming previous models.

M2T2: Multi-Task Masked Transformer represents a unified Transformer-based framework for object-centric 6-DoF manipulation, notably pick and place, that robustly generalizes over arbitrary, cluttered objects and supports multiple low-level action modes from a single inference pass. Unlike traditional task-specific pipelines, M2T2 simultaneously predicts contact points and valid gripper poses for diverse manipulation modes, leveraging a sophisticated masked attention mechanism and multi-scale 3D scene encoding. The solution demonstrates state-of-the-art zero-shot sim2real transfer, substantially outperforming prior approaches in challenging real-world and simulated environments (Yuan et al., 2023). The M2T2 architecture is conceptually informed by advances in parallel, masked-sequence decoding for vision generalists (Qiu et al., 2024).

1. Model Architecture

M2T2 comprises a two-stage architecture: a scene encoder and a contact/action decoder. The input is a single dense point cloud (typically 16,000 points), generated from high-resolution depth images without requiring explicit instance segmentation or object proposals.

Scene Encoder: The backbone is a PointNet++ architecture with four set-abstraction layers. The encoder hierarchically down-samples input points, then applies feature-propagation layers to obtain per-point feature descriptors. This results in four multi-scale feature maps $\{F^1, F^2, F^3, F^4\}$ , maintaining explicit grounding to 3D coordinates.
Contact Decoder (Masked Transformer): The decoder employs multiple Transformer blocks, each comprising:
1. Cross-attention between a fixed number of query tokens and one multi-scale feature map, modulated by token-specific spatial masks.
2. Self-attention among all query tokens (grasp, place, and, optionally, language tokens).
3. Feed-forward MLP and layer normalization.
Query Tokens: There are $G=100$ grasp tokens, $P=64$ place tokens (each linked to a discrete tabletop planar rotation), and, for language-conditioned tasks, $L$ CLIP-embedded language tokens.

After each transformer block, mask heads produce token-to-point spatial masks, enabling supervision and focused attention. A dedicated objectness MLP scores each grasp token, and a shared action decoder MLP regresses per-point manipulation parameters: approach direction, contact direction, grasp width, and confidence.

2. Multi-Task Manipulation and Action Modes

M2T2 explicitly unifies several manipulation primitives as parallel outputs:

Grasp Generation: Each grasp token predicts a contact mask and, via the action decoder, a precise 6-DoF grasp pose based on sampled contact points and regressed approach/contact directions and gripper width. The final grasp pose is formed as $t_{\mathrm{grasp}} = p + (w/2) \cdot c + d \cdot a$ and $R_{\mathrm{grasp}} = [c, c \times a, a] \in SO(3)$ .
Placement Generation: Each place token yields masks for candidate placement regions, supporting discrete re-orientations about the table normal. Placement poses use geometric reasoning from contact points, gripper pose, and token-associated rotations.
Language-Conditioned Tasks: For RLBench scenarios, text instructions are encoded as CLIP vectors and concatenated to the token set, enabling language-guided action specialization.

No explicit proposals, segmentation, or candidate filtering are needed—the Transformer resolves token allocation, spatial region prediction, and mode decoding in parallel.

3. Masked Attention and Training Losses

The core technical innovation is the application of masked multi-head attention with spatially grounded, token-wise attention masks. At each transformer stage, interim masks are produced by scoring tokens against each point feature. High-confidence points are selected, and their indices are used to mask subsequent cross-attention operations, promoting spatial locality and object-centric focus.

Mathematically, masked multi-head attention is formalized as:

$\mathrm{Attention}(Q, K, V; M) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_k}} + M\right)V,$

where $M \in \{0, -\infty\}^{m \times n}$ is the attention mask.

The total loss is:

$\mathcal{L} = \mathcal{L}_{\mathrm{grasp}} + \lambda_{\mathrm{place}} \mathcal{L}_{\mathrm{placing}},$

with $\lambda_{\mathrm{place}} = 1$ . Individual terms include:

Objectness BCE loss for grasp token activation.
Mask segmentation loss (binary cross-entropy + Dice loss) for token/point contact regions.
ADD-S distance regression loss for accurate 6-DoF pose prediction, pairing predicted and ground-truth poses via nearest-neighbor assignment.
Placing loss parallels mask segmentation for place tokens.

Hungarian matching is used for optimal pairing of tokens and ground-truths, mitigating permutation invariance challenges.

4. Dataset, Training Regimen, and Implementation

M2T2 is trained entirely on synthetic data generated using the ACRONYM 3D object set (252 categories, 8,800 models), covering broad shape, pose, and scene variability. The training set comprises 64,000 synthetic scenes, while specialized test sets assess pick, place, and category OOD generalization. Objects per scene range from 1 to 15, and ground-truth labels (contact regions, grasp/placement poses) are computed offline by collision-aware filtering.

Training utilizes AdamW (LR $8 \times 10^{-4}$ , weight decay $10^{-2}$ ), global batch size 128, and standard augmentation (random rotations, jitter, sub-sampling). No learning-rate decay is used. The model converges within approximately two days (160 epochs) without overfitting.

5. Evaluation and Empirical Results

The model achieves strong zero-shot sim2real transfer on a Franka Panda robot, without domain adaptation. Evaluation in 21 multi-object pick-and-place sequences (held-out categories and scenes) yields:

	M2T2	Baseline (Contact-GraspNet+CabiNet)
Pick	85.7%	76.2%
Place	72.2%	56.2%
Place-Rot	62.5%	25.0%
Overall	61.9%	42.9%

M2T2 thus outperforms previous state-of-the-art models by substantial margins, particularly in challenging reorientation settings (+37.5% for place-rot). Inference is efficient ( $\sim$ 0.1s/frame, no batching) and requires only single-shot raw point clouds.

For RLBench language-conditioned tasks (e.g., "open drawer," "turn tap"), M2T2 achieves superior success rates, surpassing the PerAct baseline by up to $\sim$ 10%.

6. Connections to Masked Sequence Models in Vision

M2T2’s architecture draws inspiration from masked sequence modeling strategies applied to vision generalists such as MAD (Masked AutoDecoder) (Qiu et al., 2024). Key conceptual borrowings include parallel, bi-directional decoding, task-token–based prompting, and the use of masked supervision to enhance inter-token and inter-task contextual learning. In contrast to classical vision-language autoregressive transformers, M2T2 and MAD leverage full self-attention and parallel inference over all tokens, increasing efficiency and enabling stronger multi-task coupling.

A plausible implication is that this unified masked transformer paradigm could facilitate generalist manipulation beyond pick and place, as well as support richer forms of multi-modal conditioning, leveraging recent insights from the vision generalist community.

7. Significance and Prospects

M2T2 establishes a new scalable template for multi-task manipulation: a masked Transformer with object-centric query tokens and explicit spatial mask conditioning, trained on diverse scenes and tasks via dense synthetic supervision. This design eliminates heavy reliance on task-specific heuristics, explicit object proposals, or handcrafted policies, delegating multi-mode action reasoning to learned attention and masking.

Demonstrated robustness in sim2real transfer and strong performance on both spatial and language-conditioned tasks validate the effectiveness and generality of this approach. Extensions may include scaling the paradigm to more complex skills, integrating richer sensorimotor modalities, and further leveraging advances from unified vision-language masked modeling (Yuan et al., 2023, Qiu et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

M2T2: Multi-Task Masked Transformer for Object-centric Pick and Place (2023)

Masked AutoDecoder is Effective Multi-Task Vision Generalist (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to M2T2: Multi-Task Masked Transformer.