Affostruction: 3D Affordance Reconstruction
- Affostruction is a generative framework for 3D affordance grounding that reconstructs complete object geometry from partial RGB-D observations.
- It employs sparse voxel fusion and a Flow Transformer to integrate multi-view data, achieving significant improvements in reconstruction accuracy and affordance localization.
- Affordance-driven active view selection guides intelligent viewpoint acquisition, enabling the system to capture diverse, multimodal functional region distributions.
Affostruction is a generative framework for 3D affordance grounding that jointly reconstructs complete object geometry from partial RGB-D observations and localizes open-vocabulary affordance regions, including those on unobserved surfaces. Unlike prior methods that restrict affordance predictions to visible surfaces, Affostruction performs generative multi-view reconstruction using sparse voxel fusion to extrapolate the object’s full geometry and applies flow-based affordance grounding to capture multimodal functional region distributions. The system incorporates affordance-driven active view selection, using predicted affordance maps to guide intelligent viewpoint acquisition. Empirical evaluation demonstrates substantial improvements in both reconstruction and affordance localization accuracy over previous baselines (Park et al., 14 Jan 2026).
1. Formal Problem Definition
Affostruction addresses the problem of affordance grounding from a sequence of RGBD images , with each observation consisting of a color image , a depth map , camera intrinsics , and extrinsics . These inputs are used to back-project observed pixels into a sparse 3D point cloud or voxel set
where denotes per-pixel DINOv2 features. The principal objectives are: (a) Reconstruction of the complete 3D shape . (b) Grounding of an open-vocabulary affordance query (e.g., "where to grasp") to a probabilistic heatmap , reflecting the probability that point affords the queried action: Ground-truth annotation is supplied for supervision. The framework jointly minimizes reconstruction error and affordance localization loss.
2. Generative Multi-View Reconstruction via Sparse Voxel Fusion
Affostruction employs a sparse voxel fusion procedure to synthesize dense geometric representations from partial multi-view input. Each input view’s DINOv2 per-pixel features are back-projected into 3D and merged across views by averaging features at corresponding voxel grid positions (grid size ). The fused set is
A positional encoding is appended to each to form conditioning tokens .
The core mapping utilizes a Flow Transformer architecture—specifically, a 12-block DiT backbone with hidden size 768, cross-attending into the fused features. The input is a dense noise tensor (with , ); the output is denoised and decoded by a frozen sparse-structure VAE, producing the reconstructed 3D point set . Rectified flow matching supervises the generative process: with and . The architecture’s token complexity is constant (), independent of the number of input views , due to voxel fusion.
3. Flow-Based Affordance Grounding
Affordance grounding is structured as a flow-based stochastic mapping on the reconstructed voxels. The approach uses a Sparse Flow Transformer operating on the occupied voxels, where an initial noise vector is denoised to produce per-voxel affordance logits . The process is conditioned on the CLIP text embedding of the query. The architecture inherits the DiT backbone (12 blocks, hidden size 768) and cross-attends into a 768-dimensional CLIP-ViT-L/14 text token.
The loss function combines voxel-wise binary cross-entropy and Dice losses: with flow-matching supervision: At inference, diverse affordance heatmaps are generated from noise, capturing multimodal ambiguity in functional region assignment.
4. Affordance-Driven Active View Selection
To acquire information optimal for affordance localization, Affostruction implements an active view selection policy. Given a reconstructed mesh colored by the latest affordance heatmap, candidate camera poses are sampled (e.g., uniformly from a hemisphere). For each , a rendered 2D affordance map is produced and scored: with the next view selected by maximizing . This policy prioritizes functional regions and empirically achieves faster convergence (fewer viewpoints needed) relative to random or sequential sampling. While the procedure does not explicitly maximize information gain (entropy reduction), summing affordance probabilities empirically approximates optimal coverage.
5. Experimental Design and Results
Training and Testing Protocols:
- Reconstruction training utilized 3D-FUTURE, HSSD, and ABO datasets; affordance training relied on Affogato’s train split with generated annotations.
- Reconstruction was tested on 1,250 objects from Toys4k; affordance on the Affogato test split (first view/query).
Key Metrics:
- 3D reconstruction: Volumetric IoU, Chamfer Distance (CD), F-score @0.05, PSNR/LPIPS for normals and color.
- Complete affordance: aIoU, AUC, SIM, MAE.
- Partial affordance (introduced metric): multi-threshold aIoU and aCD, coupling reconstruction with affordance.
Quantitative Results:
| Task | Baseline (TRELLIS/Espresso-3D) | Affostruction | % Improvement |
|---|---|---|---|
| Reconstruction IoU | 19.49 | 32.67 | +67.7% |
| Chamfer Distance | 0.3694 | 0.2427 | |
| Complete Affordance aIoU | 13.6 | 19.1 | +40.4% |
| Partial Affordance aIoU (1 view) | 0.60 | 9.26 | +14.66 |
| Partial Affordance aCD (1 view) | 0.2885 | 0.1044 | |
| aIoU (active view, 4 views) | 12.46 |
Notable ablations confirm the impact of sparse voxel fusion (superior IoU vs. image-patch conditioning) and that multi-view stochastic training is essential for leveraging supplementary frames. Affordance-driven view selection outperformed sequential and random sampling in aIoU convergence, with all improvements underpinned by qualitative results showing reconstruction of occluded functional geometry and diverse, multimodal affordance region maps (Park et al., 14 Jan 2026).
6. Innovations, Limitations, and Prospective Directions
The core innovations of Affostruction are:
- Sparse voxel fusion enabling constant computational complexity across variable observation count while supporting depth-guided, multi-view synthesis.
- Flow-based affordance grounding that enables distributional, multimodal region assignment and is driven by a rectified flow objective targeting binary functional regions.
- Affordance-driven view selection that ties scene exploration to the accumulation of functionally salient information.
Empirical evaluations report substantial improvements over existing methods on both geometric reconstruction and affordance localization. The current system is limited to single-object scenes and leverages pretrained generative priors (e.g., TRELLIS VAE decoders). Future work is anticipated in the extension to multi-object or cluttered scenes, the introduction of richer interaction priors, optimization-based or mutual-information-driven view planning, and closed-loop evaluation with physical robots (Park et al., 14 Jan 2026).
A plausible implication is that integrating explicit information-theoretic policies or interaction-specific priors could further advance the quality and efficiency of affordance grounding in robotic perceptual systems.