Tangram Assembly Task: Visual & Robotic Benchmark

Updated 8 December 2025

Tangram Assembly Task is a set of geometric and visual reasoning challenges using seven standard tans to form target silhouettes.
It leverages combinatorial search, reinforcement learning, and supervised pipelines to optimize piece placement and maximize coverage metrics.
The task underpins advances in robotics, computer vision, and quantum circuit embedding, highlighting practical and theoretical implications.

Tangram Assembly Task refers to a class of geometric, visual reasoning, and manipulation problems centered on the arrangement of planar fragments (typically the seven standard “tans”) to form a target silhouette. It unifies cognitive, computational, vision, and robotic assembly tasks under abstract rules of piece selection, transformation, and placement. In research contexts, it serves both as a benchmark for shape reasoning and as a proxy for complex multi-object assembly in robotics, computer vision, and even quantum circuit embedding.

1. Formal Problem Definition

Tangram assembly is most commonly formalized either as a combinatorial search over transformations of standard polygons, a Markov Decision Process (MDP) in robotics, or a supervised learning pipeline.

Piece Set and Target Representation:

Standard tangram comprises seven “tans”—planar polygons with canonical dimensions (large/medium/small triangles, square, parallelogram).
Silhouette prompt $S\in \{0,1\}^{H\times W}$ : binary raster mask delineating the target shape.

Transformations:

Each piece $j$ can be translated $p_j\in \mathbb{R}^2$ , rotated $\theta_j$ (typically discretized), and reflected $r_j\in \{0,1\}$ .
Full assembly is the non-overlapping union:

$\bigcup_{j=1}^7 \text{Render}(V_j; p_j, \theta_j, r_j) \approx S$

Robotic Formulation (MDP as in MRChaos):

State $s_t = (I_s, I_{tc,t}, n_1,...,n_{t-1})$ combines target silhouette, current workspace top-down image, and ordered piece sequence.
Action $a_t = (\Delta x_t, \Delta y_t, \Delta\theta_t)$ specifies relative gripper-pose displacement for the next piece.
Transition: apply physics, update image, increment placed set.
Reward: pixel-wise coverage,

$r_t = |p_t \cap p'_t| / |p'_t|$

where $p_t$ is the visible mask of just-placed $n_t$ , $p'_t$ is the region in $I_s$ assigned to $n_t$ (Zhao et al., 17 May 2025).

Variants and Generalizations:

Arbitrary numbers and shapes of fragments (Lee et al., 2022).
Annotated part-segmentation and linguistic labels for shape abstraction (Ji et al., 2022).

2. Architectures and Solution Methodologies

2.1 Reinforcement Learning for Robotic Assembly

MRChaos introduces a vision-based PPO agent trained entirely by self-exploration in simulation:

Policy network takes $o_t = [I_s; I_{tc,t}] \in \mathbb{R}^{2\times H\times W}$ , passes through 3-conv encoder and 2-layer MLP head, outputs mean/std for action Gaussian, and value estimate.
No geometric or kinematic models required; reward derived directly from visual change.
Silhouette prompt is concatenated to workspace observation, conditioning both policy and reward (Zhao et al., 17 May 2025).

2.2 Supervised Deep Learning for Assembly Prediction

FAN (Fragment Assembly Network) utilizes two heads:

FAN-Select: chooses the next fragment via cross-entropy with respect to ground-truth order.
FAN-Pose: predicts pixel-wise probability map for placement and discrete rotation bin via fully convolutional encoder-decoder.
FRAM (Fragment Relation Attention Module) applies transformer-style multi-head attention across per-fragment feature vectors, facilitating context-sensitive selection and pose inference (Lee et al., 2022).

2.3 Vision-Language Reasoning and Pre-Training

KiloGram dataset supports abstract visual reasoning by pairing shapes and rich part/whole descriptions:

k-way reference game evaluates the compatibility between text $x$ and image $s$ via models like CLIP and ViLT.
Joint part-segmentation and linguistic annotation are crucial for accurate model reasoning; fine-tuned ViLT can exceed human performance especially when description and segmentation are both present (Ji et al., 2022).

Zhao et al.'s pipeline focuses on pre-training vision features using assembly trajectories, completeness signals, and semantic word-embeddings. These transfer effectively to aesthetic layout and classification tasks (Zhao et al., 2021).

2.4 Quantum Circuit Embedding

The MBQC-tangram paradigm encodes gates as colored polyominoes (tiles) mapped onto a 2D lattice, with strict in/out adjacency and rotation/deformation rules. The minimal-area embedding of all needed gates onto the grid is NP-hard (Patil et al., 2022).

3. Evaluation Metrics and Experimental Protocols

Coverage and Intersection-over-Union (IoU):

$Cov(c, S) = \frac{\text{area}(c \cap S)}{\text{area}(S)}$
$IoU(c, S) = \frac{\text{area}(c \cap S)}{\text{area}(c \cup S)}$

Piece-wise and Final Coverage:

MRChaos tracks both average per-piece coverage (Relative) and final silhouette coverage (Final) across random and human-designed sets.

Reference Game Accuracy:

KiloGram benchmarks human and model accuracy in selecting the correct shape from textual or part-based descriptions.

Assembly Robustness:

Handling of missing, eroded, or distorted pieces rigorously tested (FAN) (Lee et al., 2022).

Generalization:

MRChaos and FAN are validated on unseen targets; MRChaos demonstrates zero-shot ability on complex human-designed silhouettes and transfers directly to real-world robotic setups (Zhao et al., 17 May 2025).

Task/Method	Metric	Result(s)
MRChaos (Sim-Random)	Final Cov	88.7%
MRChaos (Real-Hard)	Final Cov	62.4%
FAN (Square, 8 frags)	[email protected]	0.470
KiloGram (ViLT FT)	Ref Acc (p+c)	75.2%

4. Generalization, Reasoning, and Representational Insights

Tangram assembly challenges both agents and models in visual parsing, local/global reasoning, and strategic piece placement:

MRChaos demonstrates generalization from random assemblies to complex artistic forms, enabled by conditioning policy on silhouette prompts and closed-loop visual rewards (Zhao et al., 17 May 2025).
Vision backbones pre-trained on completeness ordering and semantic supervision (Tangram assembly sequences) encode fine-grained progression from partial to complete shapes, transferable to few-shot and layout tasks (Zhao et al., 2021).
Abstract reasoning about shape-parts and object segments, when jointly annotated visually and linguistically, substantially increases multimodal model performance on ambiguous shape matching (Ji et al., 2022).
FAN-style architectures handle arbitrary fragment counts, rotations, missing/distorted parts, and textured domains via permutation-equivariant attention and multi-scale pixelwise loss (Lee et al., 2022).

5. Comparative Analysis with Baselines and Failure Modes

RL and deep learning methods decisively outperform combinatorial search (simulated annealing, Bayesian opt.) and GAN-based baselines in coverage and speed (Lee et al., 2022).
Pure behavior cloning fails (<20% coverage) in MRChaos, underscoring the necessity of trial-and-error or goal-conditioned exploration (Zhao et al., 17 May 2025).
Failure cases include high-density assemblies ("H-Fiendish"), early piece-placement limiting feasible solutions, and model backbone limits for textures/colors.
In the MBQC variant, minimal-area embedding remains open and computationally hard: only brute-force or greedy heuristic approaches are used (Patil et al., 2022).

6. Extensions, Applications, and Future Research Directions

Broader Application Domains:

MRChaos framework extends to cutlery and soda assembly tasks: re-training on 5 (cutlery) or 3 (cans) pieces yields high final coverage on unseen silhouettes (Zhao et al., 17 May 2025).
Features and pipelines pre-trained via tangram assembly are repurposed for folding, room layout, icon recognition, and handwriting classification (Zhao et al., 2021).

Proposed Extensions:

Joint planning of piece ordering and pose, moving beyond fixed order (Zhao et al., 17 May 2025, Lee et al., 2022).
Dexterous manipulation with full 6-DoF, non-planar assemblies, and 3D link-based constructions.
For MBQC, efficient algorithms for minimal-overhead tangram embeddings pose an ongoing challenge (Patil et al., 2022).

Insights into Reasoning:

Step-wise completeness signals and semantic alignment govern transferability to downstream tasks.
Human-model comparison reveals that fine-tuned vision-LLMs are sensitive to joint part/text input, approaching attentive human-level performance (Ji et al., 2022).

7. Historical, Cognitive, and Cross-Domain Significance

Tangram assembly has evolved from cognitive psychology stimuli to a rigorous computational and robotic benchmark. Its structure—piece transformation, spatial reasoning, compositional assembly—has been leveraged to probe abstract reasoning, perceptual categorization, and planning under uncertainty. Recent work demonstrates that learning “from chaos”—random fragments and silhouettes—enables models and agents to master general rules for assembly applicable to both natural and artificial domains, including robotic manipulation and quantum circuit compilation (Zhao et al., 17 May 2025, Patil et al., 2022).

A plausible implication is that self-supervised, cue-conditioned assembly—in the absence of human priors or engineered rules—can be a universal paradigm for developing general-purpose shape reasoning in both physical and simulated environments.