Instance-specific object search capability of CLIP-Fields

Determine whether CLIP-Fields, which constructs a 3D semantic map by aligning CLIP image encoder features to point-cloud features, can accurately search for the location of a specific object instance in a 3D environment when given a query image, as required by the Instance-Specific Image Goal Navigation task, rather than only providing coarse localization of the image capture location.

Background

CLIP-Fields learns a 3D semantic map by matching feature vectors of points in a 3D point cloud to features from CLIP’s image encoder, enabling text- and image-based retrieval over the map. Prior work suggests CLIP-Fields can coarsely identify where a query image was taken, but its ability to perform instance-specific object search—central to Instance-Specific Image Goal Navigation (InstanceImageNav)—has not been established.

The paper argues that methods relying on CLIP’s multimodal pretraining are strong for category-level tasks but may be weak for fine-grained, instance-level retrieval. Consequently, verifying whether CLIP-Fields supports locating specific object instances is a highlighted unresolved question within the context of map-based object retrieval.

References

In addition, Shafiullah~et~al. suggests that it is possible to coarsely identify the location where a given image query was taken using CLIP-Fields. However, whether it is possible to search for the location of a specific object as required by InstanceImageNav has not been verified.

— Object Instance Retrieval in Assistive Robotics: Leveraging Fine-Tuned SimSiam with Multi-View Images Based on 3D Semantic Map (2404.09647 - Sakaguchi et al., 2024) in Section 3.2 Object Retrieval (Related Work)

Instance-specific object search capability of CLIP-Fields

Background

References

Related Problems