Emma
Summary:
Researchers have developed a new method for training language and vision models, called LERF, that uses 3D representations of scenes to improve performance. The method uses 3D CLIP embeddings, multi-scale supervision, DINO regularization, and natural language interaction to allow models to better understand the 3D world and perform tasks such as cleaning up a coffee spill in a
kitchen.