- The paper introduces TeGA, a method for text-guided synthetic 3D data augmentation using text-to-3D generative models to improve zero-shot 3D classification performance by addressing data scarcity.
- TeGA includes a consistency filtering strategy that uses multimodal models (BLIP, GPT-4, CLIP) to remove noisy generated samples, ensuring geometric and semantic alignment with text prompts.
- Experimental results show that augmenting datasets with TeGA-generated data significantly boosts zero-shot 3D classification accuracy on benchmarks like ModelNet40, ScanObjectNN, and Objaverse-LVIS.
The paper introduces Text-guided Geometric Augmentation (TeGA), a synthetic 3D dataset expansion method designed to enhance zero-shot 3D classification by leveraging generative text-to-3D models. The primary goal is to address the challenge of limited 3D data availability, which hinders the performance of zero-shot recognition models in 3D vision compared to their 2D counterparts. TeGA uses text prompts to guide the generation of synthetic 3D data, which is then used to augment existing 3D datasets. A consistency filtering strategy is also introduced to remove noisy or misaligned samples, ensuring that the generated data aligns semantically and geometrically with the input text.
The paper details the methodology, related work, experimental setup, and results, offering a comprehensive analysis of TeGA's effectiveness. The key components and findings are:
Methodology:
- TeGA leverages text-to-3D generative models to create synthetic 3D data from text prompts.
- It uses Point-E to generate point clouds from text. The point clouds are converted to meshes using the Ball Pivoting Algorithm, and images are rendered from multiple viewpoints.
- A consistency filtering strategy is applied to remove noisy samples where semantics and geometric shapes do not match the text. This involves using BLIP to generate captions from rendered images, GPT-4 to summarize these captions into a unified text description, and comparing this description with the original text prompt using both word-level matching and concept-level matching.
Language-Image-3D Contrastive Learning:
- The paper adopts a language-image-3D contrastive learning approach to align 3D embeddings with the feature spaces of images and text, leveraging CLIP's knowledge as a shared embedding space.
- The contrastive objective maximizes similarity within the shared embedding space using the formula:
LAll=−2N1∑i=1N∑(A,B)∈S(log∑jexp(hiA⋅hjB/τ)exp(hiA⋅hiB/τ)+log∑jexp(hiB⋅hjA/τ)exp(hiB⋅hiA/τ))
where:
- N is the number of samples.
- A and B represent different modalities (image, text, point cloud).
- S={(I,T),(P,I),(P,T)} represents the pairs across modalities: image-text, point cloud-image, and point cloud-text.
- hiA and hiB are the normalized features for modalities A and B, respectively.
- τ is a learnable temperature parameter.
Experimental Setup:
- The experiments were conducted using ShapeNet as the primary training dataset, augmented with synthetic data generated by TeGA.
- ModelNet40, ScanObjectNN, and Objaverse-LVIS were used as evaluation datasets for zero-shot 3D classification.
- MixCon3D was trained with the augmented dataset and evaluated on the benchmark datasets.
Results:
The paper includes several key experimental results:
- TeGA improves zero-shot performance with gains of 3.0% on Objaverse-LVIS, 4.6% on ScanObjectNN, and 8.7% on ModelNet40.
- Consistency filtering improves performance compared to using unfiltered data.
- The optimal guidance scale of Point-E is 3.0, balancing geometric diversity and alignment with text prompts.
The paper presents ablation studies that analyze the impact of different factors on TeGA's performance:
- Varying the mixing ratio of real and synthetic data shows that the best performance is achieved when 25% of the original ShapeNet data is replaced with synthetic data. The performance degrades as the proportion of synthetic data increases beyond this point, with complete failure when only synthetic data is used.
- Scaling the amount of synthetic data influences performance, with accuracy generally improving as more synthetic data is added, although ScanObjectNN shows a decline with increased scaling due to its sensitivity to noise.
Ablation Studies:
The paper analyzes the impact of different factors on TeGA's performance:
- Mixing Ratio of Synthetic 3D Data: The paper investigates whether synthetic data can replace real data by replacing portions of ShapeNet with TeGA-generated synthetic data while keeping the total sample size constant. The best performance was observed when 25% of ShapeNet data was replaced. Beyond this, performance deteriorates, with complete synthetic data failing to train effectively.
- Scalability of Synthetic 3D Data: Experiments were conducted to assess how scaling the amount of synthetic data added to ShapeNet impacts MixCon3D's performance. Data generated by Point-E was scaled to 0.1x, 1x, and 2x the original ShapeNet data. Results indicate that scaling enhances MixCon3D's training, with synthetic data contributing effectively. Accuracy improved for Objaverse-LVIS and ModelNet40 with scaled synthetic data, while ScanObjectNN experienced a decline due to noise sensitivity.
Additional Experimental Results:
- Multi-modal Pretraining reveals that high performance is achievable even without text-point cloud contrastive learning. While training succeeds without text-3D contrastive learning, incorporating it enhances accuracy, confirming its contribution.
- Visualization of Features using t-SNE shows clear separation between different classes and distinct separation between ShapeNet data and synthetic data of the same class.
In summary, the paper demonstrates that TeGA is an effective method for expanding 3D datasets and improving the performance of zero-shot 3D classification models. The combination of text-guided synthetic data generation and consistency filtering addresses the challenges of data scarcity in 3D vision and paves the way for more robust and generalizable 3D models.