Content Deformation Fields (CoDeF)
- Content Deformation Fields (CoDeF) are a neural representation that decomposes visual data into a canonical content field and deformation fields mapping dynamic transformations.
- They enable applications such as spatial retargeting, video-to-video translation, and generative modeling by transferring image edits consistently across frames.
- CoDeF leverages coupled neural architectures and tailored regularization, achieving improved reconstruction fidelity, reduced temporal flicker, and enhanced keypoint tracking.
Content Deformation Fields (CoDeF) provide a principled neural representation for the content and motion in visual data, supporting spatial retargeting, temporally consistent video processing, and generative modeling. A CoDeF representation decomposes a scene or video into a canonical content field—an atlas that contains static scene information—and a deformation field or sequence thereof, encoding the transformation from the canonical coordinate system to each observation. This paradigm enables neural fields to not only reconstruct video content more robustly than previous approaches, but also to transfer image-space edits and features to temporally coherent video results via simple warping. Recent formulations extend the approach to diverse applications, spanning content-aware image retargeting, video-to-video translation, generative video synthesis, and efficient keypoint tracking.
1. Mathematical Definition and Formulation
The foundational structure of CoDeF consists of two coupled neural fields. The canonical content field, , encodes RGB appearance as a continuous function, typically parameterized by a multi-resolution hash grid and an MLP. The temporal or spatial deformation field, , maps per-pixel video/frame coordinates to positions in the canonical atlas. Frame reconstruction is then expressed as (Ouyang et al., 2023).
In generative extensions such as GenDeF, the canonical content is produced by a generator , and each per-frame deformation field is produced by a deformation generator conditioned on content features and motion codes, outputting per-pixel 2D offsets . Video frame synthesis is performed by warping the canonical image, , using differentiable bilinear sampling. Optical flow between frames derives directly as (Wang et al., 2023).
For single images or 3D data, CoDeF reduces to a deformation field , where is the visual domain (image plane or 3D space), and each point is displaced along a prescribed axis by , yielding (Elsner et al., 2023).
2. Objective Functions and Regularization Strategies
CoDeF-based approaches utilize reconstruction and regularization losses tailored to the smoothness, physical plausibility, and semantic structure of warps:
- Pixelwise or Perceptual Reconstruction Loss: Supervises the field-predicted frame against ground truth using or task-appropriate divergences, e.g., (Ouyang et al., 2023).
- Content-aware Deformation Regularization: In image/3D retargeting, an energy-weighted stretch loss , shear loss , monotonicity constraint , and boundary adherence loss ensure deformations concentrate in low-information areas, remain smooth, and respect boundary conditions (Elsner et al., 2023).
- Optical Flow‐guided Smoothness: Video CoDeF incorporates a flow-matching loss, encouraging , where is a RAFT-based optical flow and the mask selects high-confidence pixels (Ouyang et al., 2023).
- Adversarial and Structural Video Regularization: In generative settings, adversarial losses on the output video, as well as flow temporal smoothness regularizers (edge-aware or Huber penalties), enforce both realism and interframe motion consistency (Wang et al., 2023).
- Auxiliary Background and Semantic Consistency: When the field is stratified into semantic layers or regions, layer-specific losses penalize deviations outside semantic masks (Ouyang et al., 2023).
The overall objective is a weighted sum , with additional terms as dictated by application.
3. Network Architectures and Parameterization
The canonical content field is typically parameterized by a 2D multi-resolution hash grid , followed by a compact MLP predicting RGB values. Temporal or deformation fields are learned by an MLP over a 3D hash embedding , enabling time-continuous, pixelwise 2D warps (Ouyang et al., 2023). For content-aware retargeting, all fields (image-to-RGB , energy , cumulative energy , deformation ) use fully connected MLPs with positional encoding (32 frequencies), 4–5 layers, widths 64–192 (task-dependent), and LeakyReLU activations; residual connections are inserted every two layers. Adam optimizer with standard hyperparameters (, ) and a default learning rate of are standard, with reduced rate () for image expansion in retargeting (Elsner et al., 2023).
In video generation (GenDeF), the canonical generator is a StyleGAN-style upsampling stack taking a latent code input; the deformation generator receives motion codes and canonical-content features, applying conv-based upsampling and “ToFlow” heads to output per-pixel offsets. Conditioning on canonical features during deformation synthesis ensures appearance-coherent motion fields (Wang et al., 2023).
4. Rendering, Algorithm Lifting, and Applications
With learned , any frame is reconstructed by querying to find its coordinate in the canonical image and evaluating at the mapped position. This explicit decomposition enables off-the-shelf image-space algorithms (segmentation, style transfer, keypoints, editing) to be applied once to the canonical image , then “pushed forward” to all frames via . This algorithm lifting achieves superior cross-frame consistency for video translation and tracking tasks, with reduced flicker and improved semantic reliability over per-frame baselines (Ouyang et al., 2023).
In video generation, manipulations or edits on (painting, region-masking, scribbles) are naturally propagated across all frames through the warp fields, supporting temporally consistent video editing, segmentation, and keypoint tracking via the correspondence in GenDeF (Wang et al., 2023).
In the context of geometric retargeting (images, NeRFs, polygon meshes), the neural deformation fields generalize seam carving and graph-cut methods: the continuous deformation subsumes discrete seam removals, allowing for arbitrary axis-alignment and global optimization, with distortion concentrated in low-information regions (Elsner et al., 2023).
5. Evaluation Protocols and Empirical Results
CoDeF approaches are evaluated across cross-frame consistency, reconstruction fidelity, and tracking accuracy. Key quantitative measures include:
- PSNR: Evaluates frame reconstruction against ground truth. CoDeF achieves a 4.4 dB improvement over Neural Atlas baselines (Ouyang et al., 2023).
- Temporal Flicker (): Computes mean differences between frames; flicker is reduced by over 20% versus per-frame ControlNet (Ouyang et al., 2023).
- Keypoint Tracking Error (): Measures deviation between warped canonical keypoints and ground-truth, reducing errors by over 30% compared to optical-flow trackers in non-rigid scenes (Ouyang et al., 2023).
- FID: Used for image retargeting; lower FID is reported for CoDeF compared to seam carving (e.g., mean FID 46.68 versus 52.57 on RetargetMe, -axis shrinking) (Elsner et al., 2023).
- User studies: Subjects preferred CoDeF video retargeting over seam carving (44.6% vs 25.1% preference for 2D; 96–100% for 3D NeRF) (Elsner et al., 2023).
Training times are consistently reduced compared to prior neural atlas approaches (5 minutes versus 10 hours), with practical runtimes (5–10 s for images, min for NeRF) (Ouyang et al., 2023, Elsner et al., 2023).
6. Extensions, Limitations, and Future Directions
The primary limitation is the reliance on per-scene optimization, requiring up to 10 minutes per video reconstruction; real-time, feed-forward models are not yet incorporated. The canonical content assumption may degrade under extreme viewpoint or illumination changes, and handling of large non-rigid deformations or occlusions is constrained by the expressiveness of the canonical field (Ouyang et al., 2023). Extensions include learning set-level () predictors from small video input (as in IBRNet, PixelNeRF), or combining with deformation-aware diffusion models for direct video generation (Ouyang et al., 2023).
In geometric contexts, only modest adaptation is required to accommodate new modalities such as neural radiance fields or polygon meshes; the deformation field architecture and energy-based regularization are directly transferable by redefining the per-point energy or sampling strategy (Elsner et al., 2023).
7. Relationship to Prior Methods and Conceptual Significance
CoDeF generalizes discrete operations such as seam carving and graph cuts. Discrete seam carving solutions are subcases of the continuous, globally-optimized , and CoDeF’s energy-weighted stretch loss extends the per-seam energy cost to continuous integrals, while the shear term analogizes the total variation or graph-cut regularization (Elsner et al., 2023). As such, CoDeF offers a unified, neural-field-based formalism that accommodates not only content-aware editing but also smooth, spatio-temporally consistent transformation for video and geometric data.
By decoupling “what” (canonical content) from “where/when” (deformation), CoDeF supplies an atlas-and-warp paradigm supporting efficient, consistent video processing, generative modeling, and cross-modal applications, substantially advancing the landscape of neural scene and video representations (Ouyang et al., 2023, Elsner et al., 2023, Wang et al., 2023).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free