Weakly-Supervised Learning of Dense Functional Correspondences

Published 4 Sep 2025 in cs.CV | (2509.03893v1)

Abstract: Establishing dense correspondences across image pairs is essential for tasks such as shape reconstruction and robot manipulation. In the challenging setting of matching across different categories, the function of an object, i.e., the effect that an object can cause on other objects, can guide how correspondences should be established. This is because object parts that enable specific functions often share similarities in shape and appearance. We derive the definition of dense functional correspondence based on this observation and propose a weakly-supervised learning paradigm to tackle the prediction task. The main insight behind our approach is that we can leverage vision-LLMs to pseudo-label multi-view images to obtain functional parts. We then integrate this with dense contrastive learning from pixel correspondences to distill both functional and spatial knowledge into a new model that can establish dense functional correspondence. Further, we curate synthetic and real evaluation datasets as task benchmarks. Our results demonstrate the advantages of our approach over baseline solutions consisting of off-the-shelf self-supervised image representations and grounded vision LLMs.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces a weakly-supervised framework leveraging vision-language models to generate dense functional correspondences across object categories.
It employs a novel combination of functional and spatial contrastive losses, ensuring semantic alignment and structural consistency in correspondences.
Experimental results show superior performance over traditional keypoint methods, enabling accurate pixel-level mapping in both synthetic and real-world datasets.

Weakly-Supervised Learning of Dense Functional Correspondences

Introduction and Motivation

The paper addresses the problem of establishing dense pixel-level correspondences between object instances based on functional similarity, particularly across different object categories. This is a critical capability for applications such as shape reconstruction, robotic manipulation, and imitation learning, where transferring knowledge or actions between objects with similar functions but disparate appearances is required. The central challenge is that functionally similar parts may be visually dissimilar, necessitating both semantic and structural reasoning to identify and align these regions.

Figure 1: Dense functional correspondence requires both semantic and structural understanding to align functionally similar parts across visually distinct objects.

Problem Formulation and Dataset Construction

The authors formalize dense functional correspondence as the mapping between pixels of functionally equivalent parts of two objects, such that the corresponding surface points are spatially close in 3D when the objects are aligned for a given function. This is operationalized by leveraging 3D object assets and defining correspondences via 3D alignment of functional parts, followed by projection to 2D image space.

Figure 2: Ground-truth dense correspondences are derived by aligning functional parts in 3D and projecting to 2D, enabling large-scale evaluation without manual annotation.

To support both training and evaluation, the paper introduces two datasets:

Synthetic Evaluation Dataset: 950 pairs of Objaverse assets, spanning 24 functions, with 1,800+ rendered image pairs and dense ground-truth correspondences.
Real Evaluation Dataset: 190 pairs from the HANDAL dataset, covering 13 functions, with 500+ real image pairs and ground-truth correspondences.

Weakly-Supervised Data Curation via Vision-LLMs

A key contribution is the scalable curation of pseudo-labeled training data using vision-LLMs (VLMs). The pipeline involves:

Using GPT-4 to generate function-specific prompts and object category lists.
Filtering and matching Objaverse assets via Caption3D and Llama3 embeddings.
Employing CogVLM to generate bounding box proposals for functional parts across multi-view renderings, which are aggregated onto 3D point clouds and post-processed to produce pixel-level functional part masks.
Figure 3: Automated pseudo-labeling pipeline leverages VLMs to annotate functional parts at scale, minimizing human supervision.

This approach enables the generation of a large, diverse dataset of functional part segmentations, facilitating weakly-supervised learning without expensive manual annotation.

Model Architecture and Training Objectives

The proposed model builds on frozen DINOv2 and CLIP backbones, extracting local image features and conditioning them on function text embeddings. A 3-layer MLP processes the concatenated features to produce function-conditioned pixel embeddings. Optionally, an additional layer predicts functional part masks.

Figure 4: Local functional feature extraction via an MLP conditioned on both spatial and functional information.

Training employs a combination of contrastive objectives:

Functional Part Contrastive Loss: Encourages embeddings of functionally relevant regions across objects to be similar, while separating irrelevant regions.
Spatial Contrastive Loss: Enforces structural consistency by matching corresponding pixels across multi-view images of the same object, preventing mode collapse.
Mask Loss (optional): Supervises functional part mask prediction via binary cross-entropy.
Figure 5: Joint functional and spatial contrastive objectives distill semantic and structural knowledge into the learned feature space.

Implementation uses DINOv2-B as backbone, 224×224 images, 3-layer MLPs with 1024 hidden units, Adam optimizer, and batch sizes of 50 image pairs. Hyperparameters for loss weights are tuned via sensitivity analysis.

Experimental Results

Quantitative evaluation demonstrates that the proposed model, trained solely on synthetic data, generalizes effectively to real images. The model consistently outperforms baselines including self-supervised features (DINOv2, Stable Diffusion), VLM-based grounding (CogVLM, ManipVQA), and their combinations. Notably, the model achieves superior performance in both label transfer and correspondence discovery tasks, especially for cross-category pairs, where self-supervised features struggle.

Figure 6: The proposed method retrieves functionally relevant correspondences with higher precision than DINOv2, particularly in cross-category scenarios.

Ablation studies reveal that:

Functional loss alone is insufficient; spatial loss is necessary to avoid feature collapse.
Mask loss provides marginal gains, particularly in unconstrained real-world settings.
Larger ViT backbones and reduced stride sizes improve performance, albeit with increased computational cost.

Qualitative results further illustrate the model's ability to focus on functionally relevant regions and maintain fine-grained structural alignment, even between objects with significant visual differences.

Comparison with Prior Work

The approach contrasts with keypoint-based methods such as FunKPoint [lai2021functional], which rely on sparse, manually defined keypoints and are limited in capturing nuanced part-level correspondences across categories. The dense formulation here enables higher precision and scalability, with minimal human input.

Figure 7: Visual comparison with FunKPoint [lai2021functional] highlights the increased granularity and precision of dense correspondences.

Limitations and Future Directions

The method faces challenges in cases of ambiguity, such as objects with multiple functionally relevant parts or radially symmetric regions. Addressing these may require probabilistic modeling or additional conditioning. The reliance on segmentation assumes object masks are available, which may not hold in all real-world scenarios.

Potential future directions include:

Extending to multi-modal or probabilistic correspondence prediction.
Integrating with robotic manipulation pipelines for demonstration transfer.
Scaling to larger, more diverse datasets via improved pseudo-labeling.

Conclusion

This work introduces a principled, weakly-supervised framework for learning dense functional correspondences across object categories. By distilling semantic knowledge from VLMs and structural priors from self-supervised features, the model achieves state-of-the-art performance in both synthetic and real-world benchmarks. The approach is highly scalable and generalizes well, providing a foundation for advanced object understanding and manipulation in vision and robotics.

Markdown Report Issue