Affostruction: 3D Affordance Reconstruction

Updated 21 January 2026

Affostruction is a generative framework for 3D affordance grounding that reconstructs complete object geometry from partial RGB-D observations.
It employs sparse voxel fusion and a Flow Transformer to integrate multi-view data, achieving significant improvements in reconstruction accuracy and affordance localization.
Affordance-driven active view selection guides intelligent viewpoint acquisition, enabling the system to capture diverse, multimodal functional region distributions.

Affostruction is a generative framework for 3D affordance grounding that jointly reconstructs complete object geometry from partial RGB-D observations and localizes open-vocabulary affordance regions, including those on unobserved surfaces. Unlike prior methods that restrict affordance predictions to visible surfaces, Affostruction performs generative multi-view reconstruction using sparse voxel fusion to extrapolate the object’s full geometry and applies flow-based affordance grounding to capture multimodal functional region distributions. The system incorporates affordance-driven active view selection, using predicted affordance maps to guide intelligent viewpoint acquisition. Empirical evaluation demonstrates substantial improvements in both reconstruction and affordance localization accuracy over previous baselines (Park et al., 14 Jan 2026).

1. Formal Problem Definition

Affostruction addresses the problem of affordance grounding from a sequence of RGBD images $\{\mathcal V_i\}_{i=1}^N$ , with each observation $\mathcal V_i = (I_i, D_i, K_i, T_i)$ consisting of a color image $I_i \in \mathbb{R}^{H \times W \times 3}$ , a depth map $D_i \in \mathbb{R}^{H \times W}$ , camera intrinsics $K_i$ , and extrinsics $T_i$ . These inputs are used to back-project observed pixels into a sparse 3D point cloud or voxel set

$\bar{\mathcal P} = \bigcup_{i=1}^N \left\{ (p_{ij}, f_{ij}) \mid p_{ij} = T_i K_i^{-1} [u,v,1]^\top D_i(u,v) \right\},$

where $f_{ij}$ denotes per-pixel DINOv2 features. The principal objectives are: (a) Reconstruction of the complete 3D shape $\mathcal S = \{p_m\}_{m=1}^M$ . (b) Grounding of an open-vocabulary affordance query $q$ (e.g., "where to grasp") to a probabilistic heatmap $a : \mathcal S \rightarrow [0,1]$ , reflecting the probability that point $p_m$ affords the queried action: $a(p_m) \approx \mathrm{Pr}(\text{affordance}=1 \mid p_m, q).$ Ground-truth annotation $\mathcal A^* = \{(p_m, a^*_m)\}_{m=1}^M$ is supplied for supervision. The framework jointly minimizes reconstruction error and affordance localization loss.

2. Generative Multi-View Reconstruction via Sparse Voxel Fusion

Affostruction employs a sparse voxel fusion procedure to synthesize dense geometric representations from partial multi-view input. Each input view’s DINOv2 per-pixel features are back-projected into 3D and merged across $N$ views by averaging features at corresponding voxel grid positions (grid size $r^{-1}$ ). The fused set is

$\bar{\mathcal V} = \{(p_m, \bar f_m)\}_{m=1}^M, \quad \bar f_m = \frac{1}{K} \sum_{(i,j)\mapsto m} f_{ij}.$

A positional encoding $\mathrm{PE}_{3D}(p_m)$ is appended to each $\bar f_m$ to form conditioning tokens $\mathcal{C}_{\rm voxel}$ .

The core mapping utilizes a Flow Transformer architecture—specifically, a 12-block DiT backbone with hidden size 768, cross-attending into the fused features. The input is a dense noise tensor $X \in \mathbb{R}^{r^3 \times C}$ (with $r=16$ , $C=8$ ); the output is denoised and decoded by a frozen sparse-structure VAE, producing the reconstructed 3D point set $\{p_m\}_{m=1}^L$ . Rectified flow matching supervises the generative process: $X_t = (1-t)\,X_0 + t\,\epsilon, \qquad \mathcal L_{\rm recon} = \mathbb{E}_{t, X_0, \epsilon} \left\| v_\theta(X_t, \mathcal{C}_{\rm voxel}, t) - (\epsilon - X_0) \right\|_2^2,$ with $t \in [0,1]$ and $\epsilon \sim \mathcal N(0,I)$ . The architecture’s token complexity is constant ( $O(r^3)$ ), independent of the number of input views $N$ , due to voxel fusion.

3. Flow-Based Affordance Grounding

Affordance grounding is structured as a flow-based stochastic mapping on the reconstructed voxels. The approach uses a Sparse Flow Transformer operating on the $L$ occupied voxels, where an initial noise vector $A_t \in \mathbb{R}^L$ is denoised to produce per-voxel affordance logits $A_0$ . The process is conditioned on the CLIP text embedding of the query. The architecture inherits the DiT backbone (12 blocks, hidden size 768) and cross-attends into a 768-dimensional CLIP-ViT-L/14 text token.

The loss function combines voxel-wise binary cross-entropy and Dice losses: $\mathcal L_{\rm mask}(A',A) = \mathcal L_{\rm BCE}(A',A) + \mathcal L_{\rm Dice}(A',A),$ with flow-matching supervision: $\mathcal L_{\rm flow} = \mathbb{E}_{t, A_0, \epsilon}\left[ \mathcal L_{\rm mask}\left(\epsilon - v_\phi(A_t, \mathcal{C}_{\rm text}, t),\, A_0\right)\right].$ At inference, diverse affordance heatmaps are generated from noise, capturing multimodal ambiguity in functional region assignment.

4. Affordance-Driven Active View Selection

To acquire information optimal for affordance localization, Affostruction implements an active view selection policy. Given a reconstructed mesh $\mathcal M$ colored by the latest affordance heatmap, $K$ candidate camera poses $\Pi = \{\pi_1, \dots, \pi_K\}$ are sampled (e.g., uniformly from a hemisphere). For each $\pi$ , a rendered 2D affordance map $A_{\rm render}(\cdot; \pi)$ is produced and scored: $S(\pi) = \sum_{u,v} A_{\rm render}(u,v; \pi),$ with the next view $\pi^*$ selected by maximizing $S(\pi)$ . This policy prioritizes functional regions and empirically achieves faster convergence (fewer viewpoints needed) relative to random or sequential sampling. While the procedure does not explicitly maximize information gain (entropy reduction), summing affordance probabilities empirically approximates optimal coverage.

5. Experimental Design and Results

Training and Testing Protocols:

Reconstruction training utilized 3D-FUTURE, HSSD, and ABO datasets; affordance training relied on Affogato’s train split with generated annotations.
Reconstruction was tested on 1,250 objects from Toys4k; affordance on the Affogato test split (first view/query).

Key Metrics:

3D reconstruction: Volumetric IoU, Chamfer Distance (CD), F-score @0.05, PSNR/LPIPS for normals and color.
Complete affordance: aIoU, AUC, SIM, MAE.
Partial affordance (introduced metric): multi-threshold aIoU and aCD, coupling reconstruction with affordance.

Quantitative Results:

Task	Baseline (TRELLIS/Espresso-3D)	Affostruction	% Improvement
Reconstruction IoU	19.49	32.67	+67.7%
Chamfer Distance	0.3694	0.2427
Complete Affordance aIoU	13.6	19.1	+40.4%
Partial Affordance aIoU (1 view)	0.60	9.26	+14.66
Partial Affordance aCD (1 view)	0.2885	0.1044
aIoU (active view, 4 views)		12.46

Notable ablations confirm the impact of sparse voxel fusion (superior IoU vs. image-patch conditioning) and that multi-view stochastic training is essential for leveraging supplementary frames. Affordance-driven view selection outperformed sequential and random sampling in aIoU convergence, with all improvements underpinned by qualitative results showing reconstruction of occluded functional geometry and diverse, multimodal affordance region maps (Park et al., 14 Jan 2026).

6. Innovations, Limitations, and Prospective Directions

The core innovations of Affostruction are:

Sparse voxel fusion enabling constant computational complexity across variable observation count while supporting depth-guided, multi-view synthesis.
Flow-based affordance grounding that enables distributional, multimodal region assignment and is driven by a rectified flow objective targeting binary functional regions.
Affordance-driven view selection that ties scene exploration to the accumulation of functionally salient information.

Empirical evaluations report substantial improvements over existing methods on both geometric reconstruction and affordance localization. The current system is limited to single-object scenes and leverages pretrained generative priors (e.g., TRELLIS VAE decoders). Future work is anticipated in the extension to multi-object or cluttered scenes, the introduction of richer interaction priors, optimization-based or mutual-information-driven view planning, and closed-loop evaluation with physical robots (Park et al., 14 Jan 2026).

A plausible implication is that integrating explicit information-theoretic policies or interaction-specific priors could further advance the quality and efficiency of affordance grounding in robotic perceptual systems.

Markdown Report Issue Upgrade to Chat

References (1)

Affostruction: 3D Affordance Grounding with Generative Reconstruction (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Affostruction.

Affostruction: 3D Affordance Reconstruction

1. Formal Problem Definition

2. Generative Multi-View Reconstruction via Sparse Voxel Fusion

3. Flow-Based Affordance Grounding

4. Affordance-Driven Active View Selection

5. Experimental Design and Results

6. Innovations, Limitations, and Prospective Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Affostruction: 3D Affordance Reconstruction

1. Formal Problem Definition

2. Generative Multi-View Reconstruction via Sparse Voxel Fusion

3. Flow-Based Affordance Grounding

4. Affordance-Driven Active View Selection

5. Experimental Design and Results

6. Innovations, Limitations, and Prospective Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research