Zero-Pretraining Deep Learning Methods
- Zero-pretraining deep learning methods are defined by the exclusion of pretrained weights, external datasets, and prior gradient information, relying solely on random initialization and task-specific loss functions.
- These approaches employ architectural inductive biases—such as Deep Image Prior, denormalization operators, and adaptive prompt generators—to improve model plasticity and avoid negative transfer.
- Empirical results show that methods like Denorm-DRFWI, ZeroRF, and DeepZero achieve faster convergence and competitive accuracy in applications including full waveform inversion, sparse-view reconstruction, and incremental learning.
Zero-pretraining deep learning methods are a class of algorithms and frameworks that eschew reliance on pretrained weights, external large-scale datasets, or gradient information obtained from previous optimization stages. Instead, they either begin with random initialization, use architectural inductive biases, or utilize function-value queries alone. This paradigm is gaining traction in several domains—including network-reparameterized full waveform inversion, neural field reconstruction, incremental learning, and optimization—largely due to its workflow simplicity, robustness to data gaps, and ability to circumvent negative transfer commonly observed when pretraining and downstream objectives misalign.
1. Key Principles and Formal Definitions
Zero-pretraining refers to the complete avoidance of any pretrained parameters, dataset-dependent priors, or first-order gradient information in deep learning initialization and training. Formally, given a deep model parametrized by , zero-pretraining workflow entails:
- Random initialization of ;
- Absence of any supervised or unsupervised fitting stage before deployment;
- No incorporation of external large-scale pretraining datasets or hand-crafted regularizers;
- Training is guided strictly by the task-specific loss function.
In network-reparameterized FWI, this is realized by a denormalization operator , imposing the initial model only as a fixed architectural bias rather than fitting the network to it (Chen et al., 5 Jun 2025). In sparse-view 3D reconstruction, all inductive bias is provided solely by generator architecture (e.g., Deep Image Prior) without pretrained features (Shi et al., 2023). In incremental learning, task progression is managed by Adaptive Prompt Generation rather than retrieval from a pretrained pool (Tang et al., 2023).
2. Methodologies: Model Construction and Training Strategies
Zero-pretraining deep learning is instantiated in several research directions:
Denorm-DRFWI for Full Waveform Inversion (Chen et al., 5 Jun 2025):
- The model learns perturbations about a fixed initial model , eliminating any pretraining or fine-tuning stage.
- Training proceeds via a single-stage physics-guided inversion loss:
- Variants include Static Denorm (non-updating ) and Adaptive Denorm (gradient-driven updates to ).
- Advantages over two-stage pretraining include preserved network plasticity, mitigation of spectral bias, and reduction of workflow hyperparameter sensitivity.
ZeroRF for Sparse View 360° Reconstruction (Shi et al., 2023):
- Neural radiance field factors are generated via convolutional generator fed with fixed Gaussian noise, enforcing a Deep Image Prior effect.
- No hand-crafted nor dataset-trained priors are used; generator weights are randomly initialized.
- Fitting is per-scene; optimization addresses photometric reconstruction loss only.
Incremental Learning with Adaptive Prompt Generator (Tang et al., 2023):
- Vision Transformer backbone is randomly initialized and trained solely on first task, then frozen.
- Task-specific prompts are produced dynamically via cross-attention to a growing pool of learned prompt candidates, not by retrieval from a static pretraining source.
- A knowledge pool regularizes the prompt generator and classifier by maintaining statistical summaries per class across tasks, ensuring resistance to catastrophic forgetting.
Zeroth-Order Deep Optimization (DeepZero) (Chen et al., 2023):
- All optimization is conducted via function-value queries alone; gradients are estimated by finite-difference coordinatewise perturbations.
- Sparsity-inducing pruning techniques reduce computation by targeting coordinates with significant gradient signal.
- Feature reuse and forward parallelization techniques accelerate training time and allow scalability for deep architectures.
3. Empirical Performance and Comparative Results
Zero-pretraining frameworks have demonstrated state-of-the-art convergence, robustness, and fidelity across several domains:
| Task/Setting | Method | Acc./MSE/SSIM | Comments |
|---|---|---|---|
| FWI (Marmousi, smooth) | Pretrain-DRFWI | MSE=0.1839, SSIM=0.8016 | Two-stage, highest error, slow convergence |
| FWI (Marmousi, smooth) | S-Denorm | MSE=0.1605, SSIM=0.8268 | Faster (), improved accuracy |
| FWI (Marmousi, smooth) | A-Denorm | MSE=0.1517, SSIM=0.8363 | Fastest, best accuracy |
| NeRF-Synthetic (4 view) | ZeroRF | PSNR=21.94, SSIM=0.856 | Outperforms baselines in comp. time |
| CIFAR-100 B50-T10 | L2P (no pretrain) | 36.55% | Prompt-retrieval (severely degraded) |
| CIFAR-100 B50-T10 | APG | 66.68% | Prompt-generation (zero pretraining) |
| ResNet-20 (CIFAR-10) | DeepZero | 86.94% | ZO optimization (FO in 50 epochs) |
Denorm-DRFWI excels in both speed and high-frequency model recovery, ZeroRF provides SOTA 3D reconstruction in timeframes (minute-scale) unattainable by alternatives, APG achieves nearly double the accuracy of prompt-retrieval on sequential CIFAR-100 splits, and DeepZero delivers deep CNN convergence within of FO baselines, with unique applicability in black-box certified adversarial defense (Chen et al., 2023) and PDE error correction.
4. Architectural and Algorithmic Inductive Biases
Zero-pretraining leverages implicit architectural biases in lieu of data-dependent or handcrafted priors:
- Deep Image Prior (Shi et al., 2023): Convolutional generator architectures (e.g., ResNet-style blocks, bilinear upsampling) naturally suppress high-frequency "speckle," regularizing solution search space. This suppresses noise without external regularization.
- Denormalization Operator (Chen et al., 5 Jun 2025): Forces the network to model only the high-wavenumber perturbation. This targets frequency bias and accelerates high-frequency detail recovery.
- Feature Reuse (Chen et al., 2023): In ZO optimization, intermediate network activations are cached, minimizing redundant computation for parameter perturbations.
- Prompt Generation via Cross-Attention (Tang et al., 2023): Dynamically adapts prompt vectors to intermediate features, instead of static selection by cosine similarity, bridging domain gaps and distribution shifts inherent in incremental tasks.
A plausible implication is that architectural choices, such as choice of generator in ZeroRF, MLP depth and activation function (e.g., sine with in Denorm-DRFWI), and prompt composition in APG, directly dictate the attainable inductive bias and thus final model fidelity.
5. Workflow Simplification, Pitfalls, and Best Practices
Zero-pretraining eliminates mismatched objectives, hyperparameter sensitivity, and negative transfer observed in pretraining regimes (Chen et al., 5 Jun 2025). Recommended protocols include:
- Single-stage training with strictly task-driven loss.
- Avoidance of additional supervised fitting; let domain constraints (physics, photometric, cross-entropy) guide solution search.
- Adam/AdamW optimization (), batch/layer normalization for stability (Ponti et al., 2021).
- Monitoring both error-based and structural metrics (e.g., , SSIM in FWI/neural fields) for convergence and overfitting detection.
- Ensemble or test-time augmentation to mitigate data sample perturbations.
- Careful data preprocessing and augmentation for scarce, noisy, or imbalanced regimes.
A plausible implication is that, while zero-pretraining theoretically affords maximal plasticity, architectural instabilities (e.g., badly calibrated initialization, lack of inductive bias) must still be constrained through thoughtful design and regularization-by-architecture.
6. Applications, Extensions, and Limitations
Zero-pretraining methods have been validated in:
- Unsupervised physics inversion (FWI) with robust recovery from inaccurate starters.
- Sparse-view 3D reconstruction and editing, mesh texturing, and text/image-to-3D applications (Shi et al., 2023).
- Black-box optimization and certified model robustness against adversarial attacks (DeepZero).
- Class-incremental learning with domain-gap resilience (Tang et al., 2023).
Limitations include propagation of structural grid biases (ZeroRF), potential architectural instability in unbounded domains (ZeroRF), and requirement for strong task-specific inductive biases in the absence of external priors. In particular, for incremental learning, APG’s effectiveness stems from its dynamic prompt generation by cross-attention, whereas standard prompt retrieval collapses when pretraining-domain alignment is poor.
7. Context and Future Directions
Zero-pretraining is not merely an engineering convenience—it offers a systematic avenue for model robustness, plasticity, and rapid deployment when external data or gradients are unavailable or unreliable. Workflows such as denormalization-based parameterization, Deep Image Prior generators, cross-attentive prompt generators, and ZO finite-difference methods all represent concrete operationalizations of this paradigm.
Research directions include further scalability of zeroth-order optimization for large-scale models (Chen et al., 2023), architectural adaptations for unbounded or highly non-Euclidean domains, and new forms of regularization-by-structure that further reduce reliance on external data. In incremental and continual learning, adaptive architectures such as APG may continue to close the distribution-gap previously untraversable without strong pretraining.
Zero-pretraining approaches mark a foundational shift toward direct, architecture-driven learning, circumventing the pitfalls and mismatches introduced by prior-dependent deep learning workflows.