YOLOA: Unified Affordance Detection
- YOLOA is a real-time affordance detection framework that jointly addresses the 'what', 'where', and 'how' challenges using a unified architecture.
- It features dual branches for object detection and pixel-wise affordance learning, enhanced during training by an LLM adapter that refines inter-branch interactions.
- Empirical evaluations on ADG-Det and IIT-Heat benchmarks demonstrate YOLOA’s superior detection accuracy and inference speed compared to previous methods.
YOLO Affordance (YOLOA) is a real-time affordance detection framework designed to jointly resolve the "what–where–how" challenge in embodied AI. Unlike prior art that treats object detection and affordance learning separately or that focuses solely on encoding affordance-related cues, YOLOA unifies the tasks of object classification, spatial localization, and pixel-wise affordance estimation within a single architecture, utilizing a LLM adapter to enable sophisticated inter-branch interaction during training. This results in state-of-the-art detection and efficiency, as demonstrated on the re-annotated ADG-Det and IIT-Heat benchmarks, where YOLOA substantially outperforms previous methods both in detection accuracy (mAP) and inference speed (Ji et al., 3 Dec 2025).
1. Unified Model Architecture
YOLOA's architecture extends the one-stage YOLOv11 backbone by decoupling its head into two parallel branches:
- Object Detection Branch (): Predicts class logits () and bounding box offsets () following standard YOLO design.
- Affordance Learning Branch (): Outputs a dense, pixel-wise heatmap () indicating affordance regions for detected objects.
Given an input image , multi-scale features are extracted via a DarkNet backbone. The detection branch produces preliminary object predictions, while the affordance branch, implemented as a lightweight MLP block over convolved, batch-normalized, dilated features, estimates affordance masks:
The affordance branch design——prioritizes both computational tractability and spatial resolution.
2. LLM Adapter Integration
A core innovation of YOLOA is the LLM Adapter mechanism, active only during training, that leverages a frozen, LoRA-tuned LLaMA-3 model (8B parameters) to refine interaction between detection and affordance branches. This adapter operates as follows:
- Visual Embedding (): Constructed via ROIAlign applied to preliminary detection boxes overlaid on input and affordance mask channels, followed by a linear projection.
- Textual Embedding (): Formed from question-form prompts (“What can the [category] at [location] be used for?”) associated with detections, then encoded via the LLM.
These embeddings are concatenated and input to the LLM to obtain token-level features . The final hidden token is then fed through lightweight heads to generate:
- Class priors:
- Box offsets:
- Affordance gates:
These signals are injected back into the preliminary predictions for both branches (with learned scaling factors), yielding final refined outputs:
At inference, the LLM Adapter is disabled (“YOLOA-light”), and both branches execute using only vision features, resulting in significantly increased throughput.
3. Mathematical Objectives and Loss Structure
YOLOA is trained end-to-end under a unified objective:
with:
- Detection loss (): Combines binary cross-entropy (BCE) loss on , IoU and distributional focal loss (DFL) on .
- Affordance loss (): BCE between and ground-truth .
- Adapter loss (): Weighted sum of BCE on class priors, Smooth- loss on box offsets, and BCE on affordance gates for adapter-refined predictions.
This formulation enforces mutual regularization and semantic alignment between detection and affordance predictions.
4. Data Regimes, Training, and Computational Considerations
YOLOA is benchmarked on two re-annotated datasets:
| Dataset | Images | Obj. Classes | Afford. Cats. | Annotation |
|---|---|---|---|---|
| ADG-Det | 2,000 | 50 | 36 | Dense heatmaps |
| IIT-Heat | 8,835 | 10 | 9 | Sparse heatmaps |
ADG-Det is derived from ADG20K by adding manual box annotations across exocentric and egocentric views; IIT-Heat is created by converting IIT-AFF segmentation masks into keypoint-based sparse heatmaps.
Training incorporates SGD with momentum 0.9, weight decay , batch size 64, for 800 epochs (10 warmup). Cosine learning rate decays from to , and training utilizes four A100 GPUs.
Detection is primarily evaluated using mean Average Precision (mAP) and Average Recall (AR); affordances are evaluated with Kullback-Leibler divergence (KLD), Similarity Metric (SIM), and Normalized Scanpath Saliency (NSS).
5. Empirical Results and Comparative Performance
On ADG-Det and IIT-Heat, YOLOA achieves the following unified results:
| Model | mAP (ADG-Det) | AR (ADG-Det) | KLD | SIM | NSS | FPS (ADG-Det) | mAP (IIT-Heat) | AR (IIT-Heat) | KLD | SIM | NSS | FPS (IIT-Heat) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| YOLOA | 52.8 | 49.7 | 1.53 | 0.34 | 1.35 | 73.8 | 73.1 | 74.8 | 2.55 | 0.19 | 1.84 | 89.8 |
| YOLOA-light | 51.7 | 47.8 | 1.66 | 0.31 | 1.31 | 470.6 | 72.6 | 73.6 | 2.68 | 0.17 | 1.53 | 846.2 |
Compared to previous state-of-the-art methods (AffordanceNet, CoTDet, AffordanceLLM, WSMA), YOLOA increases detection mAP by 4.5–13 points while retaining real-time speed (73–89 FPS). YOLOA-light sustains more than 98% of full-model accuracy at $5$– the frame rate, with up to 846 FPS on IIT-Heat (Ji et al., 3 Dec 2025).
6. Ablation Studies and Component Analysis
YOLOA's dual-branch and adapter design demonstrates significant interaction effects:
- Without LLM Adapter: mAP drops to 49.6, AR to 43.7, KLD rises to 1.812, indicating the adapter's key contribution.
- Branch removal: Removing affordance or detection branches yields mAP or KLD degradation, confirming cross-task dependence.
- Adapter path ablations: Individual adapter heads (CLS, BOX, AFF) improve respective metrics, enabling tight affordance masks and robust detection, with all combined providing highest accuracy (e.g., mAP = 52.8, KLD = 1.528).
Optimal contextual interaction is achieved by setting the LLM Adapter's top- detections to (ADG-Det) and (IIT-Heat), thus balancing semantic context with noise suppression.
7. Qualitative Analysis and Feature Alignment
Qualitative examples (Figs. 4 & 6 in (Ji et al., 3 Dec 2025)) illustrate YOLOA's precise affordance localization—for example, "cut" affordance for knives and scissors, and multi-region affordance on complex objects (e.g., "contain" for bowls, "pound" for hammers in cluttered scenes)—outperforming both vision-only and text-conditioned baselines. Feature t-SNE visualization evidences adapter-induced semantic compactness in representation space, supporting improved cross-modal alignment.
A plausible implication is that adapter-mediated cross-modal guidance not only enhances empirical detection performance but also induces more distinguishable and semantically meaningful intermediate features, beneficial for downstream tasks in embodied AI.
YOLOA advances real-time affordance detection by tightly coupling object detection and affordance estimation via LLM-guided semantic refinement, setting new baselines for unified "what–where–how" inference both in accuracy and throughput on established benchmarks (Ji et al., 3 Dec 2025).