Object Affordance Knowledge Base (Oak)

Updated 16 December 2025

Oak is a structured knowledge base that captures how objects and their parts afford specific interactions, supporting both static properties and dynamic task sequences.
It integrates graph-based and tripartite representations to formalize affordance relationships, enabling zero-shot inference and effective planning in robotic systems.
Oak employs multimodal embeddings and language-driven techniques to enrich affordance annotations, yielding measurable improvements in detection and manipulation tasks.

An Object Affordance Knowledge Base (Oak) encodes structured knowledge about the possible interactions—affordances—between agents and objects, supporting affordance-aware perception, reasoning, and task completion in both robotic and cognitive systems. Oak formalizes these affordances, often down to the level of object parts, capturing not only static object properties but also the dynamic, goal-conditioned sequences of actions required for functional object manipulation and recognition.

1. Core Formalisms and Structure

Oak implementations adopt diverse but clearly formalized representations:

Graph-Based KBs: Nodes represent objects, semantic attributes, contexts, affordance classes, primitive tasks, or effects; edges encode hierarchical or sequential dependencies with probabilistic weights or deterministic logic (Ardón et al., 2019, Zhan et al., 28 Mar 2024).
Tripartite Hierarchies: Recent OAKINK2-style formalisms structure affordance knowledge in three tiers: object part-level affordances $\Phi$ , primitive tasks $\mathcal{P}$ fulfilling those affordances, and complex tasks $T$ as directed acyclic graphs of primitives (Zhan et al., 28 Mar 2024).

Formally, the object-to-affordance mapping is expressed as

$A : O \to 2^{\Phi},\quad A(o) = \{\phi_1, \phi_2, \dots\}$

where $\phi$ entries encode a part mask and a descriptive verb phrase, e.g., for "knife" $\to$ ("blade", "cut sth").

For action-oriented affordance reasoning, graph structures $G = (V, E, \Psi)$ progressively link semantic features (shape, texture, etc.), context nodes, affordance classes (e.g., to_eat, to_contain), and binary effect nodes (success, failure), with edge weights $\Psi_{pq} \in [0,1]$ representing learned posterior probabilities (Ardón et al., 2019).

2. Taxonomies, Schemas, and Data Annotation

OakInk and OakInk2 Schema

Objects: $O = \{o_1, \dots, o_{1800}\}$ , each with category labels drawn from 32 leaf categories (organized under two superclasses, $\mathtt{maniptool}$ and $\mathtt{functool}$ ) (Yang et al., 2022).
Affordance Vocabulary: $\mathcal{A}$ , 30 distinct part-level affordance phrases (e.g., "cut sth", "handled by sth", "pump out sth"), formalized as $\langle$ verb(+prep), sth $\rangle$ tuples (Yang et al., 2022).
Part-Affordance Mapping: Each object is segmented into parts $P(o)$ , and $f(o)$ assigns affordance phrases to each part via

$f : O \to 2^{P(o)\times \mathcal{A}}.$

Quality Control: Affordance phrases are elicited and validated by multi-annotator majority vote; disagreement rates are <5% for part-affordance assignments, and <4% overall in random spot checks (Yang et al., 2022).

OAKINK2 Extensions

Primitive Tasks ( $\mathcal{P}$ ): Each is a minimal unit fulfilling a single affordance, formalized as $\pi = (\phi, C_{start}, C_{end}, \mathcal{M})$ with initial and final world states and a hand-object trajectory segment (Zhan et al., 28 Mar 2024).
Complex Tasks ( $T$ ): Encoded as PDGs $G = (V, E)$ of primitive task nodes $V \subseteq \mathcal{P}$ with temporal dependency edges. The global knowledge base becomes a tripartite graph over objects, affordances, primitives, and tasks (Zhan et al., 28 Mar 2024).

3. Computational Construction and Learning

Attribute Graph Learning for Perception

Four independent CNNs (ResNet-50 backbones), each for shape, texture, categorical, and environment attributes, produce softmax probability vectors, serving as node activations in the hierarchical KB graph (Ardón et al., 2019).
The composite feature vector

$y(x) = [y^{shape}(x),\; y^{texture}(x),\; y^{categorical}(x),\; y^{environment}(x)]^T$

is scored against affordance classes by a linear or tree-based predictor:

$R(x) = \Psi_A^T y(x), \qquad \hat A = \arg\max_i R_i(x)$

where $\Psi$ is learned via multinomial logistic regression or gradient-boosting (Ardón et al., 2019).

Language and Multimodal Embeddings

For textual affordance knowledge, verb–object co-occurrence matrices $M$ are extracted from dependency-parsed corpora, converted to PPMI matrices, then factorized with sparse nonnegative matrix factorization (NMF) to yield interpretable, low-dimensional object and verb affordance embeddings (Lam et al., 2020).
These embeddings enable direct ranking of verbs for each object and effective prediction of categorical/functional human-judged properties (e.g., SPoSE embedding dimensions), with correlations up to $\rho=0.84$ for object classes such as "animal" and "food" (Lam et al., 2020).

Knowledge Base Enrichment by LLMs

MLCoT (Multi-level Chain-of-Thought) prompting is used to build affordance KBs for task-driven detection: LLMs are prompted to enumerate task-relevant object examples, explain rationales in terms of visual features, then summarize minimal visual attributes ("affordances"). These MLCoT units feed directly into knowledge-conditioned detection models such as CoTDet (Tang et al., 2023).

4. Knowledge Utilization in Robotic and Recognition Systems

perception and Planning

The Oak KB supports zero-shot affordance inference by running the attribute CNN pipeline, extracting $y(x)$ , and inferring the most plausible affordance class $\hat A$ . Predicted affordances inform geometric grasp selection: e.g., central grasps for "to_contain" objects, density-based filters for other tasks (Ardón et al., 2019).
Part-level mappings (OakInk) enable fine-grained hand-object interaction generation by associating grasp points or manipulation actions with specific object parts and their labeled affordances (Yang et al., 2022).

Task-Driven Detection and Reasoning

CoTDet integrates affordance knowledge extracted via MLCoT, conditioning its deformable DETR decoder with affordance attributes and rationale-aware query fusion, yielding +15.6 box AP over prior knowledge-based detectors (Tang et al., 2023).
In open-world perception scenarios, affordance KBs are linked with vision-LLMs (e.g., GLIP) via prompt engineering and entity label mapping, with error correction from a human-in-the-loop process leveraging kNN over CLIP embeddings and spatial logic pruning (Burghouts et al., 18 Jul 2024).

Hierarchical Task Planning

OAKINK2 enables symbolic planning whereby LLMs decompose textual goals and scenes into sequences of primitive tasks, mapping them to concrete object-part affordances and retrieving ground-truth hand/object trajectories for motion synthesis via diffusion models (Zhan et al., 28 Mar 2024).

5. Evaluation Methodologies and Empirical Results

Paper/Resource	Task	Key Metrics/Results
OakInk (Yang et al., 2022)	Grasp region/intent query	1,800 objects, 30 affordances, 32 categories
OAKINK2 (Zhan et al., 28 Mar 2024)	Bimanual manipulation	627 sequences, 75 objects, 39 affordance types
Ardón et al. (Ardón et al., 2019)	Attribute/affordance prediction	96.81% affordance accuracy with environment, 81.3% zero-shot, 88% grasp-region match
Misra et al. (Lam et al., 2020)	Human verb–object plausibility	AAUC=0.88 on WTAction, 0.77 on MSCOCO; SPoSE $\rho$ as high as 0.84
CoTDet (Tang et al., 2023)	Task-driven object detection	mAP $_\text{box}$ =56.9, mAP $_\text{mask}$ =53.6
GLIP+KB (Burghouts et al., 18 Jul 2024)	Door-opener detection	mAP boosts from 0.04–0.10 to 0.60–0.91 post-correction

These results collectively demonstrate that explicit KB-based affordance modeling, especially incorporating context, reasoning over graphical or language-elicited knowledge structures, and integrating error correction, yields substantial gains in both traditional perception metrics and practical affordance fulfillment capabilities.

6. Extensions, Multimodal Integration, and Scalability

Vocabulary Expansion: Oak-derived KBs can be scaled by mining broader object and verb sets via WordNet synsets, large image naming corpora, or multimodal knowledge graphs (e.g., robotics video datasets) (Lam et al., 2020).
Symbolic–Neural Integration: Modular pipelines allow joint factorization of visual and textual data; object affordance embeddings can be coupled with CNN or VLM features, supporting cross-modal transfer and robust generalization (Lam et al., 2020, Burghouts et al., 18 Jul 2024).
Task Graphs for Planning: Embedding complex manipulations as DAGs over primitive tasks, and allowing LLMs as planners, OAKINK2 provides a foundation for full-scene task completion engines grounded in explicitly labeled affordances (Zhan et al., 28 Mar 2024).
Human-in-the-Loop Correction: Lightweight feedback, e.g., a few manual relabels propagated via kNN in feature space, effectively eliminates systematic errors in fine-grained affordance labels in open-world VLM pipelines (Burghouts et al., 18 Jul 2024).

A plausible implication is that Oak-style KBs, with rich multimodal structure and layered affordance codebooks, are positioned as central infrastructure for explainable, generalizable, and interpretable action understanding and task completion in next-generation embodied AI and cognitive modeling systems.