Affordance-Aware Partial Models

Updated 4 July 2026

The paper introduces a partial modeling approach that restricts predictions to affordance-relevant subsets, demonstrating improved efficiency and precision in complex tasks.
The method employs a modular decomposition that separates high-level reasoning from low-level grounding, enabling clear attribution of failures and enhanced interpretability.
Empirical benchmarks show that these models reduce computational overhead and prediction errors, leading to better generalization in robotic grasping and 3D interaction.

Searching arXiv for papers on affordance-aware partial models and related affordance reasoning. Affordance-aware partial models are computational formulations that restrict prediction to the state–action pairs, object parts, geometric cues, or latent variables that are functionally relevant to an intended interaction. In the strongest formalization, a partial world model is defined only on an affordance set $\mathcal A \subseteq \mathcal S \times \mathcal O$ , so that prediction outside $\mathcal A$ is never queried (Khetarpal et al., 11 Feb 2026). In perceptual and embodied settings, the same principle appears as modular decomposition: functional reasoning, perceptual alignment, and physical retrievability are modeled separately and then composed, rather than learned in a single monolithic predictor (Chen et al., 3 Dec 2025). Related systems further split affordance prediction into high-level reasoning over “what” to interact with and low-level grounding over “where,” or isolate geometry and interaction as separate components whose fusion yields an affordance map (Zhang et al., 16 Dec 2025, Zhang et al., 24 Feb 2026). Taken together, the literature treats affordances not merely as labels on objects, but as a mechanism for defining where incomplete yet high-value prediction is sufficient for action.

1. Formal meaning and conceptual scope

The formal statement of affordance-aware partial world modeling begins with the contrast between a full world model and a partial one. A full model is written as $\hat P(s' \mid s,o)$ for all $(s,o)\in\mathcal S\times\mathcal O$ , whereas an affordance-aware partial model is defined by restricting that predictor to an affordance set $\mathcal A$ :

$\hat P_{\mathcal A}(s' \mid s,o)= \begin{cases} \hat P(s' \mid s,o), & (s,o)\in\mathcal A \ \bot, & \text{otherwise.} \end{cases}$

The stated motivation is that in large or continuous action spaces, querying a full world model for all actions is computationally prohibitive and prone to compounding prediction errors, while affordances identify the state–action pairs likely to achieve desired intents (Khetarpal et al., 11 Feb 2026).

That formal view coexists with a second, compositional notion of partial modeling. In CRAFT-E, affordance grounding is decomposed into three partial models: a symbolic verb–property–object affordance score, a visual–language alignment score, and a grasp feasibility score. These are combined in a unified energy

$E(v,r_i)=\alpha E_{\rm grasp}(\mathcal G_i)+\beta E_{\rm aff}(v,o_{r_i})+\gamma E_{\rm align}(r_i,v),$

with $\alpha=\beta=\gamma=1$ by default, and selection is performed by minimizing the summed energy over candidate regions (Chen et al., 3 Dec 2025). The same paper explicitly describes this as a modular “partial-model” approach in which each module specializes in one aspect of the decision.

A third formulation isolates orthogonal perceptual capacities. “Probing and Bridging Geometry-Interaction Cues for Affordance Reasoning in Vision Foundation Models” separates geometry perception from interaction perception, then composes them without fine-tuning. Geometry defines structurally valid parts; interaction defines which of those parts are functionally relevant to a verb; their product yields the affordance region (Zhang et al., 24 Feb 2026). This suggests that “partial” does not only mean incomplete coverage of the world model; it can also mean factorized coverage of distinct explanatory dimensions.

A common misunderstanding is that a partial model is simply a weaker approximation to a full model. The formal and modular papers argue for a different interpretation. In the formal account, the objective is not universal coverage but high-quality prediction on the subset linked through affordances that achieve user intents (Khetarpal et al., 11 Feb 2026). In the neuro-symbolic and vision-foundation formulations, the objective is transparent balancing of functional plausibility, perceptual evidence, and embodiment constraints, not maximal end-to-end compression into one latent representation (Chen et al., 3 Dec 2025, Zhang et al., 24 Feb 2026).

2. Modular affordance reasoning in 2D perception and grounding

A prominent line of work uses partial decomposition to separate high-level reasoning from low-level localization in image-based affordance prediction. A4-Agent makes this separation explicit by casting affordance prediction as

$A_{ff}=F(I,T)=Ground(Reason(I,T)),$

where $I\in\mathbb R^{H\times W\times 3}$ is an RGB image, $\mathcal A$ 0 is a natural-language task instruction, and $\mathcal A$ 1 is the set of predicted affordance regions (Zhang et al., 16 Dec 2025). The framework then instantiates this decomposition as a three-stage, training-free pipeline: a Dreamer produces a synthetic interaction image $\mathcal A$ 2, a Thinker outputs a structured textual description $\mathcal A$ 3 of the actionable part, and a Spotter invokes open-vocabulary detection and SAM segmentation to localize the region. The system is explicitly described as decoupling “how,” “what,” and “where.”

This modularization targets a failure mode attributed to prevailing end-to-end models: coupling high-level reasoning and low-level grounding into a single monolithic pipeline trained on annotated datasets leads to poor generalization on novel objects and unseen environments (Zhang et al., 16 Dec 2025). The Dreamer uses a vision-LLM to derive an image-editing prompt $\mathcal A$ 4, then a generative model produces a plausible interaction image. The Thinker receives $\mathcal A$ 5 and emits a JSON string such as “the [object_part] of the [object_name].” The Spotter translates that description into boxes, keypoints, and masks using Rex-Omni and SAM2-Large.

Weakly supervised affordance grounding work adopts a different but related form of partialization: rather than learning directly from action labels alone, it inserts part-level semantic priors between verbs and regions. “Weakly-Supervised Affordance Grounding Guided by Part-Level Semantic Priors” defines a mapping $\mathcal A$ 6 from affordances to target parts, uses an open-vocabulary part detector and SAM to create pseudo-labels, and then trains a dense grounder with a label refining stage, a fine-grained feature alignment process, and a lightweight reasoning module (Xu et al., 30 May 2025). The partial model here is the explicit decomposition of affordance grounding into part retrieval, pseudo-label generation, exocentric alignment, and noun/part reasoning.

“Selective Contrastive Learning for Weakly Supervised Affordance Grounding” pushes the same idea further by distinguishing object-level and part-level contrastive structure depending on the reliability of the part proposal (Moon et al., 11 Aug 2025). Its selective prototypical and pixel contrastive objectives are “adaptive” in the precise sense that they use part prototypes when exocentric part proposals pass a reliability check and fall back to object prototypes otherwise. This is a partial model of supervision: the model does not assume dense part labels are always available or trustworthy, and instead modulates the granularity of the learned cue.

Across these systems, a recurring architectural principle is that affordance reasoning is more stable when “what part is relevant,” “which pixels correspond to it,” and “how interaction semantics should bias perception” are not forced into a single inference step. This suggests that modular affordance grounding is one operational interpretation of affordance-aware partial modeling in 2D vision.

3. Partial modeling in 3D objects, parts, and articulated interaction

In 3D, affordance-aware partial models often operate by restricting prediction to functionally meaningful parts rather than entire shapes. PartAfford defines the task of part-level affordance discovery from 3D objects under sparse supervision: given only an affordance set per object, the model must decompose a voxelized shape into parts and assign each discovered part an affordance from a fixed taxonomy (Xu et al., 2022). The architecture combines an abstraction encoder with slot attention and an affordance decoder with part reconstruction, affordance prediction, and cuboidal primitive regularization. The partiality lies in discovering part hypotheses as slots rather than learning a dense object-wide labeling directly from annotated part masks.

The slot-based decomposition is technical rather than purely heuristic. The encoder produces $\mathcal A$ 7 slot vectors $\mathcal A$ 8, and the decoder predicts per-slot occupancies, masks, and affordance distributions. Since supervision is only an affordance set $\mathcal A$ 9 per object, assignment between slots and affordances is handled by Hungarian matching over a set-prediction loss (Xu et al., 2022). This makes the part decomposition itself a latent partial model of the object, with explicit pressure toward geometric simplicity through the cuboidal primitive loss.

AdaAfford addresses a different 3D setting: articulated objects with hidden kinematic and dynamic constraints. It begins from an affordance prior network inherited from Where2Act, operating on a partial 3D scan $\hat P(s' \mid s,o)$ 0, and then performs a small number of test-time probing interactions to adapt this prior into an instance-specific posterior (Wang et al., 2021). The Adaptive Affordance Prediction module maps a set of interactions $\hat P(s' \mid s,o)$ 1 into a latent summary $\hat P(s' \mid s,o)$ 2, which conditions updated per-point actionability and action-success predictions. The Adaptive Interaction Proposal module then selects the next probe expected to maximally reduce residual uncertainty.

This is a partial model in two senses. First, it is conditioned on partial observation: a single partial scan. Second, it updates only those aspects of the affordance prior that are informative for the current instance, using one to four interactions rather than learning a new full model (Wang et al., 2021). The paper explicitly contrasts this with passive observation, which misses hidden joint location, joint limits, friction, and restitution.

Open-vocabulary 3D affordance grounding introduces a further variation. “Part-Aware Open-Vocabulary 3D Affordance Grounding via Prototypical Semantic and Geometric Alignment” decomposes the task $\hat P(s' \mid s,o)$ 3 into semantic recovery and geometric alignment (Gou et al., 18 Mar 2026). In the first stage, a LLM transforms a raw query into a structured, part-aware instruction such as “An {object} consists of {P $\hat P(s' \mid s,o)$ 4, P $\hat P(s' \mid s,o)$ 5, …}. For the affordance '{action}', focus on the {part}.” In the second stage, Intra-Object Relational Modeling refines geometric differentiation within the object, while Affordance Prototype Aggregation maintains a learnable prototype bank that captures cross-object geometric consistency for each affordance. This two-stage design is explicitly motivated by open-vocabulary generalization, fine-grained geometric alignment, and part-level semantic consistency.

The 3D literature therefore uses affordance-aware partial models to isolate parts, relations, or latent prototype structure that are sufficient for action, instead of requiring a dense, fully supervised mapping from whole shape to whole-scene behavior.

4. Generative and energy-based partial models

A distinct family of methods uses affordances to constrain generative inference over only the ambiguous or functionally critical variables. “Affordance-Guided Diffusion Prior for 3D Hand Reconstruction” presents a diffusion-based generative prior conditioned on affordance-aware textual descriptions of hand-object interactions, inferred from a vision-LLM and summarized by an LLM (Suzuki et al., 1 Oct 2025). The method reconstructs 3D hand poses under severe occlusion by refining only occluded joints while keeping visible joints fixed to the initial estimate. The paper explicitly states: “By conditioning the diffusion prior only on the occluded joints and rich textual affordance cues $\hat P(s' \mid s,o)$ 6, AGDP acts as a ‘partial model’ that fills in missing parts of the hand in a functionally coherent way.”

The conditioning variable $\hat P(s' \mid s,o)$ 7 is produced from six parsed captions—object category, object shape, object size, interaction type, intention, and grasp taxonomy—then embedded via a CLIP text encoder. The diffusion model follows a standard noising process $\hat P(s' \mid s,o)$ 8 and simplified $\hat P(s' \mid s,o)$ 9-matching objective, but the key affordance-aware design choice is not the diffusion formalism itself; it is the restriction of denoising to the occluded subset $(s,o)\in\mathcal S\times\mathcal O$ 0 and the use of functionally meaningful text conditions (Suzuki et al., 1 Oct 2025). This is partial modeling at the level of latent state completion.

CRAFT-E provides the energy-based counterpart. Its verb–property–object knowledge graph is fixed and cached offline, with no LLM calls at inference time, and functional affordance is scored by aggregating two-hop paths $(s,o)\in\mathcal S\times\mathcal O$ 1 in the graph (Chen et al., 3 Dec 2025). Visual–language alignment and grasp energy are then added to that symbolic score. The resulting decision can be exposed as an interpretable path from verb to properties to object hypothesis to region to grasp. This differs from AGDP’s generative prior, but both systems narrow inference to the variables that matter: one to occluded hand joints, the other to physically actionable candidate regions.

“Probing and Bridging Geometry-Interaction Cues for Affordance Reasoning in Vision Foundation Models” shows that even training-free zero-shot affordance estimation can be constructed from two partial sources: PCA-based geometric prototypes from DINOv3 and verb-conditioned spatial attention maps from FLUX.1 Kontext (Zhang et al., 24 Feb 2026). The fusion pipeline extracts noun and verb attention from FLUX, crops DINO features within the noun region, computes $(s,o)\in\mathcal S\times\mathcal O$ 2 geometric bases, scores each basis against the verb attention via NSS, selects the best-aligned component, and multiplies the corresponding part map with the interaction map. The result is a mechanistic argument that pre-trained vision and generative models already contain complementary partial models of affordance, even if no single component is itself a full affordance predictor.

These works indicate that affordance-aware partial modeling is compatible with several inference paradigms—diffusion refinement, energy minimization, and training-free feature fusion—provided the model is restricted to functionally consequential latent variables or cues.

5. Learning regimes, benchmarks, and reported empirical behavior

The empirical literature spans training-free zero-shot systems, weakly supervised grounders, few-shot adaptive interaction models, neuro-symbolic energy models, and formal LLM-based planners. Their reported gains are diverse, but they are consistently tied to the same structural claim: restricting inference to affordance-relevant subsets improves grounding, robustness, or search efficiency.

Paper	Setting	Reported result
A4-Agent (Zhang et al., 16 Dec 2025)	ReasonAff zero-shot	gIoU $(s,o)\in\mathcal S\times\mathcal O$ 3, cIoU $(s,o)\in\mathcal S\times\mathcal O$ 4, P@50 $(s,o)\in\mathcal S\times\mathcal O$ 5, P@50–95 $(s,o)\in\mathcal S\times\mathcal O$ 6
AGDP (Suzuki et al., 1 Oct 2025)	HOGraspNet high occlusion	PA-MPJPE reduced by $(s,o)\in\mathcal S\times\mathcal O$ 7 mm
Agentic RAG-VLM (Chen et al., 30 Jun 2026)	12-task grasp benchmark	$(s,o)\in\mathcal S\times\mathcal O$ 8 overall success
Open-vocabulary 3D grounding (Gou et al., 18 Mar 2026)	OpenAfford open-set full-view	aIoU $(s,o)\in\mathcal S\times\mathcal O$ 9, AUC $\mathcal A$ 0, SIM $\mathcal A$ 1, MAE $\mathcal A$ 2
Selective CL (Moon et al., 11 Aug 2025)	AGD20K-Unseen	KLD $\mathcal A$ 3, SIM $\mathcal A$ 4, NSS $\mathcal A$ 5
CRAFT-E (Chen et al., 3 Dec 2025)	ImageNet functional retrieval	$\mathcal A$ 6 top-1, MRR $\mathcal A$ 7, nDCG $\mathcal A$ 8

A4-Agent reports zero-shot performance across ReasonAff, RAGNet-3DOI, RAGNet-HANDAL, UMD Part Affordance, and an open-world qualitative set (Zhang et al., 16 Dec 2025). On ReasonAff, it improves over Affordance-R1 in gIoU from $\mathcal A$ 9 to $\hat P_{\mathcal A}(s' \mid s,o)= \begin{cases} \hat P(s' \mid s,o), & (s,o)\in\mathcal A \ \bot, & \text{otherwise.} \end{cases}$ 0 and matches P@50–95 at $\hat P_{\mathcal A}(s' \mid s,o)= \begin{cases} \hat P(s' \mid s,o), & (s,o)\in\mathcal A \ \bot, & \text{otherwise.} \end{cases}$ 1. On RAGNet subsets, A4-Agent reports $\hat P_{\mathcal A}(s' \mid s,o)= \begin{cases} \hat P(s' \mid s,o), & (s,o)\in\mathcal A \ \bot, & \text{otherwise.} \end{cases}$ 2 gIoU and $\hat P_{\mathcal A}(s' \mid s,o)= \begin{cases} \hat P(s' \mid s,o), & (s,o)\in\mathcal A \ \bot, & \text{otherwise.} \end{cases}$ 3 cIoU on 3DOI, and $\hat P_{\mathcal A}(s' \mid s,o)= \begin{cases} \hat P(s' \mid s,o), & (s,o)\in\mathcal A \ \bot, & \text{otherwise.} \end{cases}$ 4 gIoU and $\hat P_{\mathcal A}(s' \mid s,o)= \begin{cases} \hat P(s' \mid s,o), & (s,o)\in\mathcal A \ \bot, & \text{otherwise.} \end{cases}$ 5 cIoU on HANDAL-hard. On UMD cross-category, it reports $\hat P_{\mathcal A}(s' \mid s,o)= \begin{cases} \hat P(s' \mid s,o), & (s,o)\in\mathcal A \ \bot, & \text{otherwise.} \end{cases}$ 6 gIoU and $\hat P_{\mathcal A}(s' \mid s,o)= \begin{cases} \hat P(s' \mid s,o), & (s,o)\in\mathcal A \ \bot, & \text{otherwise.} \end{cases}$ 7 P@50. Its ablations further show that replacing SAM2-Large with SAM2-Tiny drops gIoU by only $\hat P_{\mathcal A}(s' \mid s,o)= \begin{cases} \hat P(s' \mid s,o), & (s,o)\in\mathcal A \ \bot, & \text{otherwise.} \end{cases}$ 8, and adding the Dreamer improves gIoU for both GPT-4o and Qwen-2.5-VL backbones (Zhang et al., 16 Dec 2025).

The formal planning paper evaluates MCTS in a PyBullet tabletop manipulation domain on the task “move blocks close to each other,” comparing a full LLM world model, an affordance-restricted partial model, and an oracle-affordance variant (Khetarpal et al., 11 Feb 2026). On the 3-block task with 4 MCTS simulations and depth-10 rollouts, the full model reports MC-Score $\hat P_{\mathcal A}(s' \mid s,o)= \begin{cases} \hat P(s' \mid s,o), & (s,o)\in\mathcal A \ \bot, & \text{otherwise.} \end{cases}$ 9, while the partial model reports MC-Score $E(v,r_i)=\alpha E_{\rm grasp}(\mathcal G_i)+\beta E_{\rm aff}(v,o_{r_i})+\gamma E_{\rm align}(r_i,v),$ 0, and the oracle-affordance model reports $E(v,r_i)=\alpha E_{\rm grasp}(\mathcal G_i)+\beta E_{\rm aff}(v,o_{r_i})+\gamma E_{\rm align}(r_i,v),$ 1. The stated interpretation is that partial models find rewarding states under tight MCTS budgets whereas full models often fail.

In robotic grasping, Agentic RAG-VLM frames affordance-aware retrieval itself as a partialization of the strategy space (Chen et al., 30 Jun 2026). Its Hierarchical Affordance-Aware RAG encodes four-dimensional affordance descriptors—type, material, fragility, and graspable region—and retrieves top-3 experiences by a score that combines category filtering, affordance similarity, and visual similarity. The full system reaches $E(v,r_i)=\alpha E_{\rm grasp}(\mathcal G_i)+\beta E_{\rm aff}(v,o_{r_i})+\gamma E_{\rm align}(r_i,v),$ 2 overall success, compared with $E(v,r_i)=\alpha E_{\rm grasp}(\mathcal G_i)+\beta E_{\rm aff}(v,o_{r_i})+\gamma E_{\rm align}(r_i,v),$ 3 for the VLM-only configuration. Removing recovery reduces overall success to $E(v,r_i)=\alpha E_{\rm grasp}(\mathcal G_i)+\beta E_{\rm aff}(v,o_{r_i})+\gamma E_{\rm align}(r_i,v),$ 4, and removing scene graph reasoning reduces it to $E(v,r_i)=\alpha E_{\rm grasp}(\mathcal G_i)+\beta E_{\rm aff}(v,o_{r_i})+\gamma E_{\rm align}(r_i,v),$ 5. Retrieval quality is also reported: HAA-RAG achieves $E(v,r_i)=\alpha E_{\rm grasp}(\mathcal G_i)+\beta E_{\rm aff}(v,o_{r_i})+\gamma E_{\rm align}(r_i,v),$ 6 P@1, $E(v,r_i)=\alpha E_{\rm grasp}(\mathcal G_i)+\beta E_{\rm aff}(v,o_{r_i})+\gamma E_{\rm align}(r_i,v),$ 7 P@3, MRR $E(v,r_i)=\alpha E_{\rm grasp}(\mathcal G_i)+\beta E_{\rm aff}(v,o_{r_i})+\gamma E_{\rm align}(r_i,v),$ 8, and $E(v,r_i)=\alpha E_{\rm grasp}(\mathcal G_i)+\beta E_{\rm aff}(v,o_{r_i})+\gamma E_{\rm align}(r_i,v),$ 9 affordance-match (Chen et al., 30 Jun 2026).

Weakly supervised methods report gains from inserting part-level or cross-view partial structure into the learning process. Selective CL achieves $\alpha=\beta=\gamma=1$ 0 on AGD20K-Seen and $\alpha=\beta=\gamma=1$ 1 on AGD20K-Unseen for KLD/SIM/NSS (Moon et al., 11 Aug 2025). The part-semantic-prior method reports on AGD20K that its baseline reaches KLD $\alpha=\beta=\gamma=1$ 2, SIM $\alpha=\beta=\gamma=1$ 3, NSS $\alpha=\beta=\gamma=1$ 4 on the Seen split and the full model improves to KLD $\alpha=\beta=\gamma=1$ 5, SIM $\alpha=\beta=\gamma=1$ 6, NSS $\alpha=\beta=\gamma=1$ 7; on Unseen, KLD drops from $\alpha=\beta=\gamma=1$ 8 to $\alpha=\beta=\gamma=1$ 9, SIM rises from $A_{ff}=F(I,T)=Ground(Reason(I,T)),$ 0 to $A_{ff}=F(I,T)=Ground(Reason(I,T)),$ 1, and NSS from $A_{ff}=F(I,T)=Ground(Reason(I,T)),$ 2 to $A_{ff}=F(I,T)=Ground(Reason(I,T)),$ 3 (Xu et al., 30 May 2025).

The reported pattern across benchmarks is not that every partial model dominates every baseline in every metric. PartAfford, for example, reports AP $A_{ff}=F(I,T)=Ground(Reason(I,T)),$ 4 on “sittable,” $A_{ff}=F(I,T)=Ground(Reason(I,T)),$ 5 on “support,” and $A_{ff}=F(I,T)=Ground(Reason(I,T)),$ 6 on “openable,” while noting that tiny handles and strongly non-cuboidal parts remain difficult (Xu et al., 2022). AdaAfford shows especially large gains on ambiguous tasks such as pulling closed cabinet doors and pushing faucets after only a handful of interactions (Wang et al., 2021). The empirical evidence therefore supports a narrower claim: affordance-aware restrictions are particularly valuable under ambiguity, open-vocabulary variation, partial observation, and high branching factors.

6. Interpretability, limitations, and open directions

One of the strongest arguments for affordance-aware partial models is interpretability. CRAFT-E exposes the supporting property set $A_{ff}=F(I,T)=Ground(Reason(I,T)),$ 7, the affordance energy contributions $A_{ff}=F(I,T)=Ground(Reason(I,T)),$ 8 and $A_{ff}=F(I,T)=Ground(Reason(I,T)),$ 9, the alignment cosine, and grasp confidence at runtime, and reports component-level diagnostics such as “3D feasibility,” “Detection IoU,” “Alignment score,” and “Affordance energy” (Chen et al., 3 Dec 2025). A4-Agent’s decomposition into Dreamer, Thinker, and Spotter likewise makes it possible to attribute failure to imagination, reasoning, or grounding rather than to a single opaque affordance head (Zhang et al., 16 Dec 2025). This has made modularity and transparency central themes in assistive robotics and human-facing applications.

At the same time, the literature repeatedly documents limitations. AGDP notes that novel objects or previously unseen affordances may fail if the VLM or LLM does not produce accurate descriptions, and that MANO restricts hand shape variation (Suzuki et al., 1 Oct 2025). AdaAfford states that it considers only two primitive actions, short-horizon single probes, and abstracts away robot-arm kinematics via a “flying gripper” (Wang et al., 2021). PartAfford identifies tiny handles and parts that strongly violate the cuboid prior as failure modes (Xu et al., 2022). The formal LLM planning account describes prompt engineering for task-agnostic intents as brittle and assumes static affordances (Khetarpal et al., 11 Feb 2026). Weakly supervised pseudo-labeling methods continue to rely on manual or LLM-constructed affordance-to-part mappings and ad hoc refinement stages (Xu et al., 30 May 2025).

There is also an unresolved methodological tension between monolithic and modular approaches. A4-Agent explicitly argues against coupling high-level reasoning and low-level grounding into a single pipeline (Zhang et al., 16 Dec 2025), while CRAFT-E positions itself as an interpretable and customizable alternative to black-box models (Chen et al., 3 Dec 2025). By contrast, some open-vocabulary 3D grounding methods retain a unified trainable backbone while inserting part-aware instructions, prototype banks, or relational modeling inside it (Gou et al., 18 Mar 2026). This suggests that the main divide is not strictly “modular versus end-to-end,” but whether affordance structure is represented as an explicit intermediate object—an affordance set, a part-aware instruction, a retrieved episode set, a symbolic path, a prototype, or a masked latent subset.

Several future directions are stated directly in the source literature. AGDP proposes conditioning on direct 3D object geometry, extending to other articulations, and co-diffusing hand and object deformations (Suzuki et al., 1 Oct 2025). AdaAfford proposes extension to multi-step manipulation trajectories, full 7-DoF manipulators, and vision-only partial observations (Wang et al., 2021). The geometry–interaction probing work suggests video-based diffusion or transformer models as possible unified geometry-plus-interaction priors (Zhang et al., 24 Feb 2026). The formal partial world model work proposes automating intent discovery via program synthesis and integrating online learning to correct the partial model (Khetarpal et al., 11 Feb 2026).

A plausible implication is that affordance-aware partial models will remain important even if larger foundation models continue to improve. The surveyed papers repeatedly show that the value of affordances is not only to supply more semantics, but to delimit inference: to specify which actions deserve rollout, which parts deserve grounding, which latent variables deserve denoising, and which retrieved experiences deserve consideration. In that sense, affordance-aware partial models define a research program in which action-relevant structure, rather than exhaustive representation, is the primary organizing principle.