Scale-Wise Conditional Upsampling

Updated 3 January 2026

The paper introduces dynamic kernel generation and neural implicit functions that enable arbitrary-scale upsampling with consistent, artifact-free outputs.
It employs conditioning strategies and architectures such as hyper-networks, implicit coordinate decoders, and encoder–decoder feature fusion to adapt internal representations across scales.
Applications range from image super-resolution and point cloud densification to climate downscaling, demonstrating significant parameter and computational savings.

Scale-wise conditional upsampling refers to methodologies that enable a single neural model to perform upsampling (super-resolution or density enhancement) at arbitrary, user-specified scales, both integer and fractional, by conditioning the model on the target scale. Unlike traditional fixed-scale networks, scale-wise conditional upsamplers are designed to generalize across a continuum of upsampling factors, adapt their internal representations or operations to the requested scale at inference, and ensure scale-consistent outputs absent of artifacts across the entire scale range. This unification of multi-scale support into a single network offers significant advances in efficiency, flexibility, and often, fidelity in both generative and regression-based settings across images, point clouds, and related domains.

1. Foundational Principles and Mathematical Framework

The core principle behind scale-wise conditional upsampling is explicit conditioning of either feature extraction, upsampling kernels, or both, on the input scale factor $r$ (for 2D SR) or $s$ (for point cloud/statistical fields). This conditioning is realized in various ways:

Dynamic kernel generation: Kernel weights are synthesized on-the-fly based on the current scale and content, e.g., via hyper-networks or attention-weighted convolution banks (Wu et al., 2021, Lee et al., 2024, Vasconcelos et al., 2022).
Neural implicit functions: Output pixel/coordinate values are predicted through neural fields that take continuous spatial coordinates and the target scale as input, ensuring continuous, artifact-free resampling (Wu et al., 2021, Vasconcelos et al., 2022, Kim et al., 2024).
Conditional generative models: In diffusion, flow, or GAN frameworks, the scale factor is embedded and injected at every stage, guiding the generative process (Qu et al., 2023, Xu et al., 26 Jul 2025, Winkler et al., 2024, Bang et al., 9 Jun 2025).
Equivariant and group-theoretic treatments: Operators are designed to be (approximately) equivariant under scale transformations, ensuring consistent behavior under re-scaling (Sangalli et al., 2022).

The mathematical expressions typically involve fusing kernels or features with scale-dependent weights, and for neural fields, defining $f(\mathbf{x}, r) \to \mathbf{y}$ where $\mathbf{x}$ is a (possibly normalized) coordinate and $r$ is the target scale.

2. Architectural Patterns and Representative Algorithms

A. Dynamic and Hyper-network-based Convolutions.

SAD-Conv layers in SADN (Wu et al., 2021) maintain $K$ convolution kernels $\{W_k\}$ ; at runtime, scale-conditioned attention coefficients $\alpha_k(r, y)$ are computed and the kernels are fused:

$W(r, y) = \sum_{k=1}^K \alpha_k(r, y) W_k$

This enables the feature extractor to operate with scale-adapted receptive fields and filter responses.

The recently proposed IGConv (Lee et al., 2024) and CUF (Vasconcelos et al., 2022) replace traditional sub-pixel convolution (SPConv) heads with compact hyper-networks that synthesize the upsampling filters for any requested scale $r$ or fractional offset, using Fourier features or MLP-based kernel generators.

B. Implicit Coordinate Decoders and Bilinear Functions.

The CSUM+MBLIF in SADN (Wu et al., 2021), continuous upsampling filters in CUF (Vasconcelos et al., 2022), and the implicit neural head in latent diffusion + INR pipelines (Kim et al., 2024) interpolate across multi-scale feature volumes using local bilinear or MLP-based functions:

$f_\theta(p_{HR}; \{M_t\}, r) = F_\theta\left( [\alpha_1(r) \cdot \text{Bilinear}(M_1, \cdot), ..., \alpha_T(r) \cdot \text{Bilinear}(M_T, \cdot)] \right)$

where the feature aggregation and attention are continuous in $r$ and $p_{HR}$ .

C. Scale-aware Generative Models.

In conditional DDPMs for point cloud and image upsampling (Qu et al., 2023, Bang et al., 9 Jun 2025), the upsampling rate is encoded as an embedding or injected into attention/normalization blocks, allowing the denoising process to remain consistent across arbitrary target densities or image sizes.

D. Encoder–Decoder Feature Fusion.

FADE (Lu et al., 2024) fuses encoder and decoder features in the upsampling kernel generation, using a semi-shift convolution mechanism, producing spatially and scale-varying kernels, with per-pixel gating for detail/semantic trade-off.

E. Group-equivariant Operations.

SEU-Net (Sangalli et al., 2022) defines upsampling and downsampling operators as linear maps over the semigroup $G = S_\gamma \times \mathbb{Z}^2$ , implementing scale-channels and cross-correlation to maintain equivariance.

3. Conditioning Strategies and Implementation Details

A variety of conditioning strategies exist:

Explicit scale vector/embedding: e.g., in Meta-PU (Ye et al., 2021), upsampling factor $R$ is encoded as an augmented “scale vector” $\widetilde{R} \in \mathbb{R}^{2R_{max}}$ and used as input to a meta-subnetwork that generates the dynamic weights.
Positional embedding: In CUF (Vasconcelos et al., 2022), the scale $s$ , fractional offset $\rho$ , and kernel index are jointly mapped via DCT-style embeddings, concatenated, and used to parameterize the kernel hyper-network.
Coordinate-adapter functions: In CasArbi (Bang et al., 9 Jun 2025), $x, y$ spatial coordinates, scale $s$ , and diffusion time $t$ are jointly embedded via a Fourier style coordinate adapter and fused into each block of the denoising U-Net.
Control-image feature fusion: In SCALAR (Xu et al., 26 Jul 2025), supplement information, such as canny/edge/sketch/semantic depth control maps, are encoded via a frozen backbone and injected at each scale via per-layer projections.

4. Quantitative Performance and Empirical Outcomes

Across major benchmarks, scale-wise conditional upsampling achieves or surpasses the best single-scale models:

SADN (Wu et al., 2021): On Set5/Set14/Urban100, matches or exceeds fixed-scale SR networks (e.g., SAN, RCAN) with fewer parameters (7.6M vs. 22.3M for RDN).
CUF (Vasconcelos et al., 2022): Matches or outperforms MetaSR, LIIF, and LTE in PSNR on DIV2K, with a 40× reduction in parameters and up to 10× lower FLOPs at high scales.
IGConv⁺ (Lee et al., 2024): Improves PSNR over SPConv-based networks by +0.21 dB on Urban100×4, with a threefold reduction in training budget and parameters.
Meta-PU (Ye et al., 2021) & PU-EVA (Luo et al., 2022): On point cloud tasks, a single model supports continuous/flexible $1.1\times$ – $16\times$ upsampling with uniformly low CD/EMD errors, beating fixed-rate methods.
Diffusion-based models (Qu et al., 2023, Bang et al., 9 Jun 2025): Retain high fidelity and robustness on arbitrary-scale upsampling and generation, with performance holding or degrading gracefully OOD.
Climate downscaling (Winkler et al., 2024): Conditional normalizing flows yield lower MAE/RMSE than bicubic or GAN baselines, and uncertainty quantification is maintained at all scales (2×, 4×).

Ablations in each work confirm contributions of scale-conditioned attention, multi-scale feature fusion, and continuous implicit functions to improved results and artifact suppression.

5. Avoidance of Artifacts and Generalization Properties

Checkerboard and aliasing artifacts, a recurring problem in naive or fixed-scale upsamplers (especially those using transposed convolution), are mitigated:

In SADN (Wu et al., 2021), use of bilinear interpolation and implicit functions guarantees continuity in spatial/scale output.
In CUF (Vasconcelos et al., 2022)/IGConv (Lee et al., 2024), direct parameterization of kernels as neural fields/hyper-nets allows smooth transitions between integer and non-integer scales without hard-coded resampling.
SEU-Net (Sangalli et al., 2022), through approximate scale-equivariance and "scale dropout", achieves dramatic improvements in generalization to unseen scales ( $\mathrm{IoU} > 0.60$ at $\pm2$ octaves vs $<0.40$ for vanilla U-Net).
In diffusion and GAN-based models (Qu et al., 2023, 2806.07813), explicit scale conditioning in the generative process ensures no “jumps” or “holes” in upsampled outputs.

Scale randomization or stratified scale sampling during training is almost universally adopted, ensuring the model sees both typical (integer) and atypical (arbitrary/fractional) scales on every batch and improving interpolation quality.

6. Application Domains and Model Efficiency

Scale-wise conditional upsampling has demonstrated strong performance in:

Single-image super-resolution for natural images and faces (Wu et al., 2021, Vasconcelos et al., 2022, Lee et al., 2024, Kim et al., 2024, Bang et al., 9 Jun 2025)
Flexible-rate point cloud densification (Qu et al., 2023, Ye et al., 2021, Luo et al., 2022)
Arbitrary-scale generative modeling (VAR, diffusion, GANs) (Xu et al., 26 Jul 2025, Kim et al., 2024, Zhang et al., 2020)
Dense prediction in semantic segmentation, matting, or depth estimation (Lu et al., 2024, Sangalli et al., 2022)
Climate downscaling for geophysical data (Winkler et al., 2024)

From an efficiency standpoint, hyper-net and neural field approaches achieve parameter and computational savings proportional to the number of traditional heads replaced (e.g., up to 40× fewer parameters for CUF vs. direct multi-scale heads). Recent works enable inference-time “instantiation” of continuous heads to minimize runtime cost (Vasconcelos et al., 2022, Lee et al., 2024).

Task-agnostic upsamplers (FADE (Lu et al., 2024)) demonstrate robust improvements for both region- and detail-sensitive tasks. In point clouds, meta-learning and edge-vector approximations decouple the architecture from the upsampling ratio.

7. Limitations, Open Questions, and Future Directions

Current scale-wise conditional upsampling approaches exhibit some limitations:

Many methods require explicit enumeration or branching for each scale in certain architectures (e.g., “tail heads” in ICF-SRSR (Neshatavar et al., 2023)), constraining full continuity of scale support.
Per-pixel implicit function evaluation at high resolutions still involves non-trivial computation, though this can often be amortized by pre-instantiation for integer scales (Vasconcelos et al., 2022, Lee et al., 2024).
Some equivariant frameworks resort to approximations due to the discrete grid, with small residual artifacts or loss of strict group properties (Sangalli et al., 2022).
Conditioning on untrained or extremely large scales leads to graceful (but non-negligible) degradation of fidelity beyond training distributions (Wu et al., 2021).

Future directions include migration to arbitrary geometric transformations (not just uniform upsampling), fusion with more sophisticated generative priors (e.g., cascading multiple conditional flows or diffusion stages), extension to spatio-temporal tasks, and custom hardware acceleration for coordinate-based neural field evaluator submodules.

References: