Papers
Topics
Authors
Recent
2000 character limit reached

YOLOA: Unified Affordance Detection

Updated 6 December 2025
  • YOLOA is a real-time affordance detection framework that jointly addresses the 'what', 'where', and 'how' challenges using a unified architecture.
  • It features dual branches for object detection and pixel-wise affordance learning, enhanced during training by an LLM adapter that refines inter-branch interactions.
  • Empirical evaluations on ADG-Det and IIT-Heat benchmarks demonstrate YOLOA’s superior detection accuracy and inference speed compared to previous methods.

YOLO Affordance (YOLOA) is a real-time affordance detection framework designed to jointly resolve the "what–where–how" challenge in embodied AI. Unlike prior art that treats object detection and affordance learning separately or that focuses solely on encoding affordance-related cues, YOLOA unifies the tasks of object classification, spatial localization, and pixel-wise affordance estimation within a single architecture, utilizing a LLM adapter to enable sophisticated inter-branch interaction during training. This results in state-of-the-art detection and efficiency, as demonstrated on the re-annotated ADG-Det and IIT-Heat benchmarks, where YOLOA substantially outperforms previous methods both in detection accuracy (mAP) and inference speed (Ji et al., 3 Dec 2025).

1. Unified Model Architecture

YOLOA's architecture extends the one-stage YOLOv11 backbone by decoupling its head into two parallel branches:

  • Object Detection Branch (Fdet\mathcal{F}_\text{det}): Predicts class logits (c^\hat c) and bounding box offsets (b^\hat b) following standard YOLO design.
  • Affordance Learning Branch (Faff\mathcal{F}_\text{aff}): Outputs a dense, pixel-wise heatmap (a^\hat a) indicating affordance regions for detected objects.

Given an input image xix_i, multi-scale features pi=ϕ(xi)p_i = \phi(x_i) are extracted via a DarkNet backbone. The detection branch produces preliminary object predictions, while the affordance branch, implemented as a lightweight MLP block over convolved, batch-normalized, dilated features, estimates affordance masks:

(c^i,b^i)=Fdet(pi),a^i=Faff(pi)(\hat c_i, \hat b_i) = \mathcal{F}_\text{det}(p_i),\quad \hat a_i = \mathcal{F}_\text{aff}(p_i)

The affordance branch design—Faff(p)=MLP(SiLU(BN(DConv(Conv(p)))))\mathcal{F}_\text{aff}(p)=\mathrm{MLP}(\mathrm{SiLU}(\mathrm{BN}(\mathrm{DConv}(\mathrm{Conv}(p)))))—prioritizes both computational tractability and spatial resolution.

2. LLM Adapter Integration

A core innovation of YOLOA is the LLM Adapter mechanism, active only during training, that leverages a frozen, LoRA-tuned LLaMA-3 model (8B parameters) to refine interaction between detection and affordance branches. This adapter operates as follows:

  • Visual Embedding (EvisE_\text{vis}): Constructed via ROIAlign applied to preliminary detection boxes overlaid on input and affordance mask channels, followed by a linear projection.
  • Textual Embedding (EtextE_\text{text}): Formed from question-form prompts (“What can the [category] at [location] be used for?”) associated with detections, then encoded via the LLM.

These embeddings are concatenated and input to the LLM to obtain token-level features {hit}t=1T\{h_i^t\}_{t=1}^T. The final hidden token is then fed through lightweight heads to generate:

  • Class priors: cik=Fadaptercls(hiT)\overline c_i^k = \mathcal{F}_\text{adapter}^\text{cls}(h_i^T)
  • Box offsets: bik=Fadapterbox(hiT)\overline b_i^k = \mathcal{F}_\text{adapter}^\text{box}(h_i^T)
  • Affordance gates: ai=Fadapteraff(hiT,b^ik)\overline a_i = \mathcal{F}_\text{adapter}^\text{aff}(h_i^T, \hat b_i^k)

These signals are injected back into the preliminary predictions for both branches (with learned scaling factors), yielding final refined outputs:

c^ikc^ik+αcik b^ikb^ik+βbik a^ia^i+γlogit(ai) \begin{aligned} \hat c_i^k &\leftarrow \hat c_i^k + \alpha\,\overline c_i^k\ \hat b_i^k &\leftarrow \hat b_i^k + \beta\,\overline b_i^k\ \hat a_i &\leftarrow \hat a_i + \gamma\,\mathrm{logit}(\overline a_i)\ \end{aligned}

At inference, the LLM Adapter is disabled (“YOLOA-light”), and both branches execute using only vision features, resulting in significantly increased throughput.

3. Mathematical Objectives and Loss Structure

YOLOA is trained end-to-end under a unified objective:

Ltotal=Ldet+Laff+Ladapter\mathcal{L}_\text{total} = \mathcal{L}_\text{det} + \mathcal{L}_\text{aff} + \mathcal{L}_\text{adapter}

with:

  • Detection loss (Ldet\mathcal{L}_\text{det}): Combines binary cross-entropy (BCE) loss on c^i\hat c_i, IoU and distributional focal loss (DFL) on b^i\hat b_i.
  • Affordance loss (Laff\mathcal{L}_\text{aff}): BCE between a^i\hat a_i and ground-truth aia_i.
  • Adapter loss (Ladapter\mathcal{L}_\text{adapter}): Weighted sum of BCE on class priors, Smooth-1\ell_1 loss on box offsets, and BCE on affordance gates for adapter-refined predictions.

Ladapter=λ1Lcls-priors+λ2Lbox-offsets+λ3Laff-gates\mathcal{L}_\text{adapter} = \lambda_1\,\mathcal{L}_\text{cls-priors} + \lambda_2\,\mathcal{L}_\text{box-offsets} + \lambda_3\,\mathcal{L}_\text{aff-gates}

This formulation enforces mutual regularization and semantic alignment between detection and affordance predictions.

4. Data Regimes, Training, and Computational Considerations

YOLOA is benchmarked on two re-annotated datasets:

Dataset Images Obj. Classes Afford. Cats. Annotation
ADG-Det 2,000 50 36 Dense heatmaps
IIT-Heat 8,835 10 9 Sparse heatmaps

ADG-Det is derived from ADG20K by adding manual box annotations across exocentric and egocentric views; IIT-Heat is created by converting IIT-AFF segmentation masks into keypoint-based sparse heatmaps.

Training incorporates SGD with momentum 0.9, weight decay 5×1045 \times 10^{-4}, batch size 64, for 800 epochs (10 warmup). Cosine learning rate decays from 1×1081\times10^{-8} to 2×1032\times10^{-3}, and training utilizes four A100 GPUs.

Detection is primarily evaluated using mean Average Precision (mAP) and Average Recall (AR); affordances are evaluated with Kullback-Leibler divergence (KLD), Similarity Metric (SIM), and Normalized Scanpath Saliency (NSS).

5. Empirical Results and Comparative Performance

On ADG-Det and IIT-Heat, YOLOA achieves the following unified results:

Model mAP (ADG-Det) AR (ADG-Det) KLD SIM NSS FPS (ADG-Det) mAP (IIT-Heat) AR (IIT-Heat) KLD SIM NSS FPS (IIT-Heat)
YOLOA 52.8 49.7 1.53 0.34 1.35 73.8 73.1 74.8 2.55 0.19 1.84 89.8
YOLOA-light 51.7 47.8 1.66 0.31 1.31 470.6 72.6 73.6 2.68 0.17 1.53 846.2

Compared to previous state-of-the-art methods (AffordanceNet, CoTDet, AffordanceLLM, WSMA), YOLOA increases detection mAP by 4.5–13 points while retaining real-time speed (73–89 FPS). YOLOA-light sustains more than 98% of full-model accuracy at $5$–11×11\times the frame rate, with up to 846 FPS on IIT-Heat (Ji et al., 3 Dec 2025).

6. Ablation Studies and Component Analysis

YOLOA's dual-branch and adapter design demonstrates significant interaction effects:

  • Without LLM Adapter: mAP drops to 49.6, AR to 43.7, KLD rises to 1.812, indicating the adapter's key contribution.
  • Branch removal: Removing affordance or detection branches yields mAP or KLD degradation, confirming cross-task dependence.
  • Adapter path ablations: Individual adapter heads (CLS, BOX, AFF) improve respective metrics, enabling tight affordance masks and robust detection, with all combined providing highest accuracy (e.g., mAP = 52.8, KLD = 1.528).

Optimal contextual interaction is achieved by setting the LLM Adapter's top-kk detections to k=5k=5 (ADG-Det) and k=6k=6 (IIT-Heat), thus balancing semantic context with noise suppression.

7. Qualitative Analysis and Feature Alignment

Qualitative examples (Figs. 4 & 6 in (Ji et al., 3 Dec 2025)) illustrate YOLOA's precise affordance localization—for example, "cut" affordance for knives and scissors, and multi-region affordance on complex objects (e.g., "contain" for bowls, "pound" for hammers in cluttered scenes)—outperforming both vision-only and text-conditioned baselines. Feature t-SNE visualization evidences adapter-induced semantic compactness in representation space, supporting improved cross-modal alignment.

A plausible implication is that adapter-mediated cross-modal guidance not only enhances empirical detection performance but also induces more distinguishable and semantically meaningful intermediate features, beneficial for downstream tasks in embodied AI.


YOLOA advances real-time affordance detection by tightly coupling object detection and affordance estimation via LLM-guided semantic refinement, setting new baselines for unified "what–where–how" inference both in accuracy and throughput on established benchmarks (Ji et al., 3 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to YOLO Affordance (YOLOA).