3D Discriminative Autoencoder
- The paper introduces a novel framework that leverages unsupervised adversarial training and differentiable rendering to learn realistic 3D object surfaces without annotations.
- It employs a convolutional generator to produce explicit 3D mesh geometry, vertex-level textures, and background images, integrated with mesh-smoothness regularization.
- Experimental results on ShapeNet and CelebA demonstrate effective 3D reconstruction and pose estimation, while highlighting challenges like the hollow-mask illusion.
A 3D discriminative autoencoder is a framework that learns 3D object surfaces, textures, and viewpoints directly from unannotated image collections. In this architecture, a convolutional neural network generator outputs explicit 3D mesh geometry and corresponding texture maps, alongside a background image, which are rendered into 2D images using a differentiable renderer. The principal innovation is unsupervised learning: the model is trained adversarially such that if the generated image is indistinguishable from real images, the underlying 3D representation must be realistic as well. This is achieved without using annotations such as object pose, landmarks, or masks. The framework can pair the generative model with an encoder to enable direct 3D reconstruction and pose estimation for single images, demonstrating that truly unsupervised 3D mesh learning is feasible from in-the-wild data (Szabó et al., 2018).
1. Architecture and Workflow
The 3D discriminative autoencoder consists of an encoder (E), generator (G), differentiable renderer (R), and discriminator/critic (D).
- Encoder (E): Accepts a single image and produces a latent code and viewpoint estimate , where typically models Euler angles (azimuth, elevation, yaw).
- Generator (G): Receives latent code , partitions it into , decodes to a fixed-topology mesh (vertex positions ) and texture (RGB colors ), and to background .
- Random Viewpoint (v): For synthetic samples, (uniform over azimuth and elevation).
- Differentiable Renderer (R): Projects geometry and texture under a perspective camera with smooth silhouette blending, outputting image with exact gradients.
- Training Workflow:
- GAN-style phase: G synthesizes samples from , , and D distinguishes (fake) from (real).
- Autoencoder phase: E encodes , produces , forwarded to fixed G and R for , optimization minimizes reconstruction loss .
2. Mathematical Formulation of Objectives
The training regime leverages both adversarial and reconstruction losses, with mesh regularization.
- Reconstruction Loss:
where for encoded .
- Adversarial (Wasserstein-GAN with Gradient Penalty) Loss:
where fake samples .
- Generator minimizes:
Mesh-Smoothness Regularization:
penalizing normal flips between adjacent triangles.
Autoencoder Training (with fixed G):
with constraints ensuring lies in the generator's support region.
3. Representation: Shape, Texture, Background, Viewpoint
Object geometry is parameterized as a mesh with vertices, where spherical–radial coordinates are predicted and mapped to Cartesian positions. Texture assignment is per-vertex RGB, facilitating color interpolation using barycentric weights over rendered triangles.
Background images are modeled on a distant sphere, so viewpoint shifts induce planar background motion, decoupling object and background parallax.
Viewpoint is parameterized by azimuth , elevation , and optionally yaw . In GAN training, and are sampled from a known prior .
4. Differentiable Rendering Layer
Rendering employs a high-focal-length perspective transformation. Projected meshes are tiled onto the image plane; color within each triangle is interpolated per-vertex via barycentric weights. A soft silhouette model blends foreground at triangle and occlusion edges over Gaussian or linear ramp bands, crucial for ensuring nonzero gradients with respect to mesh vertices and colors. Importantly, the model omits explicit shading or illumination: it assumes purely Lambertian surfaces where vertex colors are emitted without simulated light or shadow.
5. Training Regime and Hyperparameters
Training proceeds in two distinct phases:
GAN Phase: G and D are trained with random latent vectors and viewpoints over – steps. D is updated according to the WGAN-GP objective (gradient penalty ), G includes the mesh-smoothness term ( in –). Adam optimizer is used with learning rate , , . Batch size: 16–64.
Autoencoder Phase: With G frozen, E is optimized for or reconstruction over ~50k steps. Optionally, small regularization is applied to latent code magnitude.
6. Empirical Outcomes and Analysis
Evaluation is performed on synthetic ShapeNet classes and real face datasets (CelebA). On ShapeNet, geometry coverage is assessed via generated surface normals. For CelebA, results are qualitatively compared to conventional supervised 3D morphable model methods, specifically examining multi-azimuth renderings and normal maps.
Findings:
- Generator yields plausible 3D facial features (nose, brow, lips) rendered at novel azimuths ().
- Mesh-smoothness regularization eliminates high-frequency geometric spikes but, if excessive ( large), causes degenerate ellipsoid-like output.
- Autoencoder reconstructions preserve large-scale geometry and pose; fine detail is coarsely reproduced.
- Failure Modes:
- Hollow-mask illusion: for limited viewpoint data, concave facial reconstructions become plausible. Mitigation involves stricter object size constraints or expanded viewing angle priors.
- Reference ambiguity: the generator's canonical orientation can arbitrarily rotate with respect to Euler axes. Reconstructions are plausible but axis alignment is arbitrary.
- Deficient texture variation on features outside the mesh scope (mouth interiors, ears).
A plausible implication is that fully unsupervised learning of explicit 3D meshes and textures from uncontrolled images is tractable when adversarial realism constraints are enforced in 2D render space, and inversion via autoencoding enables direct estimation of shape and pose for novel inputs (Szabó et al., 2018).