- The paper introduces a self-supervised framework that learns dense visual object descriptors from RGBD data to aid robotic manipulation tasks.
- It employs a fully convolutional network trained with pixelwise contrastive loss and hard-negative scaling to ensure consistent matching across viewpoints and deformations.
- The method leverages 3D reconstructions, object masking, and synthetic training scenes to achieve robust performance on both rigid and non-rigid objects.
This paper introduces Dense Object Nets (DON), a method for learning dense visual object descriptors directly from RGBD video data using self-supervision, specifically targeting robotic manipulation tasks (1806.08756). The goal is to create object representations that are task-agnostic, applicable to both rigid and non-rigid objects, leverage 3D information, and require no manual labeling.
The core idea is to train a deep fully convolutional neural network (FCN) that maps an input RGB image (I) to a dense descriptor image (f(I)), where each pixel u in the input image has a corresponding D-dimensional descriptor vector f(I)(u). The network is trained such that the L2 distance between descriptors of pixels corresponding to the same physical point on an object's surface is minimized, even across different camera viewpoints or object deformations. Conversely, the distance between descriptors of pixels corresponding to different physical points is maximized.
Self-Supervised Training Framework
The training process leverages pairs of RGB images (Ia,Ib) extracted from an RGBD video sequence of a static scene. A dense 3D reconstruction of the scene (e.g., using TSDF fusion with camera poses from robot kinematics or SLAM) provides the geometric ground truth for correspondences.
- Generating Matches and Non-Matches: For a pixel ua in image Ia, its corresponding 3D point is found by raycasting into the reconstruction. This 3D point is then projected into image Ib to find the matching pixel ub. Occlusion and field-of-view checks are performed. This pair (ua,ub) constitutes a "match". "Non-matches" are pairs of pixels (ua,ub′) where ua and ub′ correspond to different 3D points.
- Pixelwise Contrastive Loss: A Siamese network architecture is used, applying the same FCN f(⋅) to both Ia and Ib. The contrastive loss function aims to:
- Minimize D(Ia,ua,Ib,ub)2=∣∣f(Ia)(ua)−f(Ib)(ub)∣∣22 for matches.
- Maximize D(Ia,ua,Ib,ub′) for non-matches, specifically penalizing pairs where the distance is less than a margin M: max(0,M−D(Ia,ua,Ib,ub′))2.
The total loss combines the average loss over many matches and non-matches sampled from the image pair.
Key Implementation Techniques for Object-Centric Descriptors
While the basic self-supervised framework works for static scenes or dynamic scenes using dynamic reconstruction, the paper introduces techniques to learn consistent descriptors for potentially non-rigid objects using only static scene reconstructions, which are often easier to obtain reliably.
- Object Masking via 3D Change Detection: To focus the network on the object(s) of interest rather than the background, an object mask is automatically generated. This is done by comparing the 3D reconstruction of the scene containing the object(s) with a reconstruction of the empty background scene (e.g., just the table). Points present in the object scene but not the background scene are identified as belonging to the object. Projecting this 3D object geometry back into the 2D images yields pixel-wise object masks. Matches are sampled only from object pixels, while non-matches can be sampled from the entire image. This significantly improves performance and enables other techniques.
- Background Domain Randomization: Using the object masks, the background pixels in training images are replaced with random textures or colors. This forces the network to rely solely on the object's appearance for generating descriptors, improving robustness and cross-scene generalization, especially for low-texture objects or smaller datasets.
- Hard-Negative Scaling: Instead of normalizing the non-match loss by the total number of sampled non-matches (Nnon-matches), the paper proposes normalizing by the number of hard non-matches (Nhard-negatives), i.e., those non-matches whose descriptor distance is currently less than the margin M. This adaptive scaling prevents the loss from being dominated by easy negatives early in training and focuses learning on distinguishing difficult cases.
$N_{\text{hard-negatives}} = \sum_{N_{\text{non-matches}}} \mathbbm{1}(M - D(I_a, u_a, I_b, u'_b) > 0)$
Lnon-matches=Nhard-negatives1Nnon-matches∑max(0,M−D(Ia,ua,Ib,ub′))2
- Data Diversification and Augmentation: Diverse training data is collected by using a robot arm to capture objects from various viewpoints, orientations, distances, and under different lighting conditions. Synthetic 180-degree image rotations are also applied during training.
Multi-Object Descriptors and Class Generalization
The framework is extended to handle multiple objects and object classes:
- Multi-Object Distinctness: To ensure descriptors for different objects occupy separate regions in the descriptor space, several strategies are used:
- Cross-Object Loss: When training on multiple distinct objects, non-match loss is explicitly applied between pixels sampled from images of different objects. This requires object masks to know which pixels belong to which object.
- Direct Training on Multi-Object Scenes: The system can train directly on cluttered scenes containing multiple objects. The 3D geometry still provides valid matches/non-matches within and across objects without needing explicit instance segmentation during this phase (though masks are needed for cross-object loss).
- Synthetic Multi-Object Scenes: New training scenes are created by synthetically layering masked object images from single-object captures. Correspondences that become occluded during layering are pruned. This combinatorially increases the variety of multi-object configurations seen during training.
- Selective Class Generalization vs. Instance Specificity:
- Class Generalization: When training a single DON model on multiple instances of the same object class (e.g., different shoes, different mugs) using only within-scene and within-instance matches/non-matches (training mode
consistent
), the learned descriptors surprisingly generalize across instances. Corresponding semantic points (e.g., the heel of different shoes) map to similar descriptor values.
- Instance Specificity: If the goal is to distinguish between instances of the same class, the multi-object training techniques (cross-object loss applied between instances, training mode
specific
) can be used. This forces the descriptors for similar points on different instances (e.g., the handle of mug A vs. mug B) to be distinct.
Robotic Manipulation Application: Grasping Specific Points
The primary application demonstrated is grasping a specific point on an object, identified by a user clicking a single pixel ua∗ in a reference image Ia.
1. User provides (Ia,ua∗).
2. Robot observes the object in a new scene, capturing image Ib and corresponding depth data.
3. The system computes the dense descriptor map f(Ib).
4. It finds the pixel u^b in Ib whose descriptor is closest to the reference descriptor f(Ia)(ua∗):
u^b=ub∈Ibargmin∣∣f(Ia)(ua∗)−f(Ib)(ub)∣∣2
5. A threshold on the minimum descriptor distance is used to determine if a valid match exists.
6. If a match u^b is found, its corresponding 3D location is retrieved using the depth map associated with Ib.
7. This 3D location becomes the target for a grasp planner (the paper uses a simple geometric point cloud-based planner centered on the target).
- Demonstrated Capabilities:
- Grasping specific points (e.g., tail, ear) on a deformable object (caterpillar toy) across different configurations.
- Grasping semantically corresponding points (e.g., heel) across different instances of shoes, using class-general descriptors. This works even for unseen shoe instances.
- Grasping a specific point on a target shoe instance within a cluttered pile of other shoes, using instance-specific descriptors trained with cross-object/instance loss.
Implementation Details
- Hardware: Kuka IIWA 7-DOF arm, Schunk WSG 50 gripper, Primesense Carmine 1.09 RGBD sensor.
- Data Collection: Automated scanning patterns capture ~2100 RGBD frames per scene (~70 secs), downsampled to ~315 frames based on camera pose difference. Robot kinematics used for camera poses. TSDF fusion for 3D reconstruction. Object rearrangement between scenes can also be automated.
- Network: ResNet-34 backbone (ImageNet pretrained), modified for stride-8 output, followed by bilinear upsampling to full resolution (640x480). Descriptor dimension D is typically small (e.g., 3-16).
- Training: Adam optimizer, ~3500 steps, takes ~13 minutes on a single GPU (e.g., Nvidia 1080 Ti). Total time including data collection for a new object is ~20 minutes. Training involves sampling ~1 million matches/non-matches per step from image pairs chosen according to specified probabilities (e.g., within-scene, across-scene, cross-object, synthetic).
- Performance: Quantitative evaluation using human-annotated keypoints shows significant improvement in correspondence accuracy (measured by pixel distance error and rank of true match) using the proposed techniques (masking, hard-negative scaling) compared to baseline methods.
In summary, Dense Object Nets provide a practical, self-supervised method for learning rich, dense visual representations of objects suitable for various robotic manipulation tasks that require identifying specific points or regions on objects, even under deformation or across class instances. The key innovations lie in techniques like automatic masking, background randomization, hard-negative scaling, and cross-object loss, which enable robust learning from readily available RGBD data.