Papers
Topics
Authors
Recent
Search
2000 character limit reached

Affostruction: 3D Affordance Reconstruction

Updated 21 January 2026
  • Affostruction is a generative framework for 3D affordance grounding that reconstructs complete object geometry from partial RGB-D observations.
  • It employs sparse voxel fusion and a Flow Transformer to integrate multi-view data, achieving significant improvements in reconstruction accuracy and affordance localization.
  • Affordance-driven active view selection guides intelligent viewpoint acquisition, enabling the system to capture diverse, multimodal functional region distributions.

Affostruction is a generative framework for 3D affordance grounding that jointly reconstructs complete object geometry from partial RGB-D observations and localizes open-vocabulary affordance regions, including those on unobserved surfaces. Unlike prior methods that restrict affordance predictions to visible surfaces, Affostruction performs generative multi-view reconstruction using sparse voxel fusion to extrapolate the object’s full geometry and applies flow-based affordance grounding to capture multimodal functional region distributions. The system incorporates affordance-driven active view selection, using predicted affordance maps to guide intelligent viewpoint acquisition. Empirical evaluation demonstrates substantial improvements in both reconstruction and affordance localization accuracy over previous baselines (Park et al., 14 Jan 2026).

1. Formal Problem Definition

Affostruction addresses the problem of affordance grounding from a sequence of RGBD images {Vi}i=1N\{\mathcal V_i\}_{i=1}^N, with each observation Vi=(Ii,Di,Ki,Ti)\mathcal V_i = (I_i, D_i, K_i, T_i) consisting of a color image Ii∈RH×W×3I_i \in \mathbb{R}^{H \times W \times 3}, a depth map Di∈RH×WD_i \in \mathbb{R}^{H \times W}, camera intrinsics KiK_i, and extrinsics TiT_i. These inputs are used to back-project observed pixels into a sparse 3D point cloud or voxel set

Pˉ=⋃i=1N{(pij,fij)∣pij=TiKi−1[u,v,1]⊤Di(u,v)},\bar{\mathcal P} = \bigcup_{i=1}^N \left\{ (p_{ij}, f_{ij}) \mid p_{ij} = T_i K_i^{-1} [u,v,1]^\top D_i(u,v) \right\},

where fijf_{ij} denotes per-pixel DINOv2 features. The principal objectives are: (a) Reconstruction of the complete 3D shape S={pm}m=1M\mathcal S = \{p_m\}_{m=1}^M. (b) Grounding of an open-vocabulary affordance query qq (e.g., "where to grasp") to a probabilistic heatmap a:S→[0,1]a : \mathcal S \rightarrow [0,1], reflecting the probability that point pmp_m affords the queried action: a(pm)≈Pr(affordance=1∣pm,q).a(p_m) \approx \mathrm{Pr}(\text{affordance}=1 \mid p_m, q). Ground-truth annotation A∗={(pm,am∗)}m=1M\mathcal A^* = \{(p_m, a^*_m)\}_{m=1}^M is supplied for supervision. The framework jointly minimizes reconstruction error and affordance localization loss.

2. Generative Multi-View Reconstruction via Sparse Voxel Fusion

Affostruction employs a sparse voxel fusion procedure to synthesize dense geometric representations from partial multi-view input. Each input view’s DINOv2 per-pixel features are back-projected into 3D and merged across NN views by averaging features at corresponding voxel grid positions (grid size r−1r^{-1}). The fused set is

Vˉ={(pm,fˉm)}m=1M,fˉm=1K∑(i,j)↦mfij.\bar{\mathcal V} = \{(p_m, \bar f_m)\}_{m=1}^M, \quad \bar f_m = \frac{1}{K} \sum_{(i,j)\mapsto m} f_{ij}.

A positional encoding PE3D(pm)\mathrm{PE}_{3D}(p_m) is appended to each fˉm\bar f_m to form conditioning tokens Cvoxel\mathcal{C}_{\rm voxel}.

The core mapping utilizes a Flow Transformer architecture—specifically, a 12-block DiT backbone with hidden size 768, cross-attending into the fused features. The input is a dense noise tensor X∈Rr3×CX \in \mathbb{R}^{r^3 \times C} (with r=16r=16, C=8C=8); the output is denoised and decoded by a frozen sparse-structure VAE, producing the reconstructed 3D point set {pm}m=1L\{p_m\}_{m=1}^L. Rectified flow matching supervises the generative process: Xt=(1−t) X0+t ϵ,Lrecon=Et,X0,ϵ∥vθ(Xt,Cvoxel,t)−(ϵ−X0)∥22,X_t = (1-t)\,X_0 + t\,\epsilon, \qquad \mathcal L_{\rm recon} = \mathbb{E}_{t, X_0, \epsilon} \left\| v_\theta(X_t, \mathcal{C}_{\rm voxel}, t) - (\epsilon - X_0) \right\|_2^2, with t∈[0,1]t \in [0,1] and ϵ∼N(0,I)\epsilon \sim \mathcal N(0,I). The architecture’s token complexity is constant (O(r3)O(r^3)), independent of the number of input views NN, due to voxel fusion.

3. Flow-Based Affordance Grounding

Affordance grounding is structured as a flow-based stochastic mapping on the reconstructed voxels. The approach uses a Sparse Flow Transformer operating on the LL occupied voxels, where an initial noise vector At∈RLA_t \in \mathbb{R}^L is denoised to produce per-voxel affordance logits A0A_0. The process is conditioned on the CLIP text embedding of the query. The architecture inherits the DiT backbone (12 blocks, hidden size 768) and cross-attends into a 768-dimensional CLIP-ViT-L/14 text token.

The loss function combines voxel-wise binary cross-entropy and Dice losses: Lmask(A′,A)=LBCE(A′,A)+LDice(A′,A),\mathcal L_{\rm mask}(A',A) = \mathcal L_{\rm BCE}(A',A) + \mathcal L_{\rm Dice}(A',A), with flow-matching supervision: Lflow=Et,A0,ϵ[Lmask(ϵ−vϕ(At,Ctext,t), A0)].\mathcal L_{\rm flow} = \mathbb{E}_{t, A_0, \epsilon}\left[ \mathcal L_{\rm mask}\left(\epsilon - v_\phi(A_t, \mathcal{C}_{\rm text}, t),\, A_0\right)\right]. At inference, diverse affordance heatmaps are generated from noise, capturing multimodal ambiguity in functional region assignment.

4. Affordance-Driven Active View Selection

To acquire information optimal for affordance localization, Affostruction implements an active view selection policy. Given a reconstructed mesh M\mathcal M colored by the latest affordance heatmap, KK candidate camera poses Π={π1,…,πK}\Pi = \{\pi_1, \dots, \pi_K\} are sampled (e.g., uniformly from a hemisphere). For each π\pi, a rendered 2D affordance map Arender(⋅;π)A_{\rm render}(\cdot; \pi) is produced and scored: S(π)=∑u,vArender(u,v;π),S(\pi) = \sum_{u,v} A_{\rm render}(u,v; \pi), with the next view π∗\pi^* selected by maximizing S(π)S(\pi). This policy prioritizes functional regions and empirically achieves faster convergence (fewer viewpoints needed) relative to random or sequential sampling. While the procedure does not explicitly maximize information gain (entropy reduction), summing affordance probabilities empirically approximates optimal coverage.

5. Experimental Design and Results

Training and Testing Protocols:

  • Reconstruction training utilized 3D-FUTURE, HSSD, and ABO datasets; affordance training relied on Affogato’s train split with generated annotations.
  • Reconstruction was tested on 1,250 objects from Toys4k; affordance on the Affogato test split (first view/query).

Key Metrics:

  • 3D reconstruction: Volumetric IoU, Chamfer Distance (CD), F-score @0.05, PSNR/LPIPS for normals and color.
  • Complete affordance: aIoU, AUC, SIM, MAE.
  • Partial affordance (introduced metric): multi-threshold aIoU and aCD, coupling reconstruction with affordance.

Quantitative Results:

Task Baseline (TRELLIS/Espresso-3D) Affostruction % Improvement
Reconstruction IoU 19.49 32.67 +67.7%
Chamfer Distance 0.3694 0.2427
Complete Affordance aIoU 13.6 19.1 +40.4%
Partial Affordance aIoU (1 view) 0.60 9.26 +14.66
Partial Affordance aCD (1 view) 0.2885 0.1044
aIoU (active view, 4 views) 12.46

Notable ablations confirm the impact of sparse voxel fusion (superior IoU vs. image-patch conditioning) and that multi-view stochastic training is essential for leveraging supplementary frames. Affordance-driven view selection outperformed sequential and random sampling in aIoU convergence, with all improvements underpinned by qualitative results showing reconstruction of occluded functional geometry and diverse, multimodal affordance region maps (Park et al., 14 Jan 2026).

6. Innovations, Limitations, and Prospective Directions

The core innovations of Affostruction are:

  • Sparse voxel fusion enabling constant computational complexity across variable observation count while supporting depth-guided, multi-view synthesis.
  • Flow-based affordance grounding that enables distributional, multimodal region assignment and is driven by a rectified flow objective targeting binary functional regions.
  • Affordance-driven view selection that ties scene exploration to the accumulation of functionally salient information.

Empirical evaluations report substantial improvements over existing methods on both geometric reconstruction and affordance localization. The current system is limited to single-object scenes and leverages pretrained generative priors (e.g., TRELLIS VAE decoders). Future work is anticipated in the extension to multi-object or cluttered scenes, the introduction of richer interaction priors, optimization-based or mutual-information-driven view planning, and closed-loop evaluation with physical robots (Park et al., 14 Jan 2026).

A plausible implication is that integrating explicit information-theoretic policies or interaction-specific priors could further advance the quality and efficiency of affordance grounding in robotic perceptual systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Affostruction.