Real-Time Image Synthesis Techniques
- Real-time image synthesis is a computational framework that generates novel, photorealistic images interactively under sub-second latency constraints.
- Core paradigms include neural view synthesis, per-pixel 3D Gaussian splatting, MPI-based modeling, and diffusion approaches, each balancing quality with compute and memory efficiency.
- These methods are applied in immersive, resource-constrained settings, achieving high frame rates at resolutions ranging from 512×512 to 2K×2K.
Real-time image synthesis refers to computational frameworks and algorithms that generate novel, photorealistic or semantically controlled images from input measurements, inferences, or minimal user guidance, all under strict constraints of sub-second (typically >15–30 Hz) compute latency suitable for interactive, immersive, or resource-constrained settings. Unlike offline or batch-based image synthesis, real-time approaches must optimize compute, memory, and sometimes data transfer to deliver visual results at rates compatible with human interaction or device actuation cycles.
1. Core Algorithmic Paradigms
Real-time image synthesis encompasses multiple algorithmic domains, each addressing distinct problem settings and trade-offs between quality, control, and latency.
Neural View Synthesis: Methods such as Quark (Flynn et al., 2024) and FWD (Cao et al., 2022) generate novel views of a 3D scene from a sparse set of calibrated images, achieving frame rates above 30 FPS at 1080p (Quark: 1920×1080 at 30 FPS on A100). Quark’s architecture revolves around a multi-scale, UNet-style “render-and-refine" pipeline that estimates a layered depth map (LDM) for the target view, combining transformer-based cross-view attention with efficient volume-to-layer compute scaling. FWD employs explicit depth predictors and differentiable forward warping to render and blend source features into the target camera, leveraging a lightweight transformer for inter-view feature fusion and CNN-based inpainting to complete occluded regions.
Pixel-wise 3D Gaussian Splatting: GPS-Gaussian (Zheng et al., 2023) formulates per-pixel 3D Gaussian primitives regressed directly from source-view pixels and depths, enabling 2K-resolution, frame-accurate human novel view synthesis (25 FPS at 2048×2048, RTX 3090). This approach avoids per-subject optimization via end-to-end regression of Gaussian means, covariances, rotations, scales, and opacities, followed by GPU-accelerated, differentiable Gaussian splatting for instantaneous rendering.
MPI and Neural Basis Expansion: NeX (Wizadwongsa et al., 2021) generalizes the multiplane image (MPI) model by parameterizing each plane’s color as a neural basis expansion function of viewing direction, enabling real-time synthesis with view-dependent effects (300 FPS at ~1 MPix). This hybrid implicit-explicit paradigm allows high-frequency reflectance effects, such as specularities and caustics, to be represented compactly.
Diffusion and Latent Flow Matching: Muon-AD (Chen et al., 11 Apr 2025) and RPFM (Jeong et al., 6 May 2025) demonstrate real-time text-to-image and pose-guided image synthesis on edge devices through attention distillation, gradient orthogonalization, mixed-precision quantization, and nonlinear ODE-based flow matching in latent space. Muon-AD integrates a Muon optimizer (orthogonal updates), dynamic attention pruning, and curriculum learning, achieving 24 FPS synthesis on Jetson Orin NX with <7GB memory. RPFM utilizes latent-space flow-matching for pose-guided person image synthesis, trading a minor decrease in SSIM/FID for ≥2× sampling speedup, supporting real-time sign-language video generation.
Domain-Specific Real-Time Generation: Efficient state-conditioned modulation and domain metrics are exemplified in real-time food-cooking progression synthesis (Gupta et al., 21 Nov 2025), where a lightweight FiLM-modulated U-Net leverages recipe/state embeddings and a domain-specific similarity metric (CIS) to produce temporally plausible, state-conditional outputs at sub-second rates on embedded NPUs.
View Synthesis without Explicit Geometry: Position-aware MLPs as used in PLFNet+ (Gond et al., 2024), eschew explicit depth or geometric warping for purely learned, pose-conditioned feature encoding, achieving >100 FPS at 512×512, with low memory and compute demand.
2. Architectures and Representations
Efficient real-time synthesis builds upon representations and architectures optimized for both quality and compute.
| Representation | Synthesis Target | Example Method | Typical FPS/Res | Notable Feature |
|---|---|---|---|---|
| Layered Depth Maps | General Scenes | Quark (Flynn et al., 2024) | 30 FPS @ 1080p | Dynamic, scene-adaptive |
| Per-pixel 3D Gaussian | Human Modeling | GPS-Gaussian (Zheng et al., 2023) | 25 FPS @ 2K×2K | Full explicit splatting |