PowerTrain: Fast DNN Training & Power Optimization
- PowerTrain is a machine learning-based methodology that predicts DNN training time and power consumption on GPU edge devices by optimizing high-dimensional power modes.
- It employs an exhaustive offline profiling phase and transfer learning to achieve prediction errors under 15% for time and 6% for power on new workloads.
- Its Pareto-optimal configuration selection method ensures efficient adaptation to power, latency, and workload constraints for real-time DNN training optimization.
PowerTrain refers to a machine learning–based methodology for the fast, generalizable prediction and principled optimization of deep neural network (DNN) training time and board-level power consumption on highly configurable, power-constrained GPU-accelerated edge devices. PowerTrain enables efficient selection of device-level “power modes” (combinations of CPU core count, CPU frequency, GPU frequency, and memory frequency) for arbitrary DNN workloads, by leveraging transfer learning from an initial, comprehensive profiling on a reference workload. This approach is designed for the specific high-dimensional configuration spaces exposed by devices such as the NVIDIA Jetson Orin AGX platform, in which the cardinality of feasible power modes exceeds 10,000 and manual profiling becomes intractable for real-time adaptation (K. et al., 2024).
1. Motivation and Problem Definition
PowerTrain addresses the core challenge of dynamic DNN training under device- and scenario-specific constraints on battery power and latency. Modern edge AI accelerators, exemplified by Jetson-class devices with >1,000 compute cores, offer fine-grained hardware control across multiple axes: CPU/GPU/memory frequencies and active core counts. The resulting power-performance trade-off space is combinatorially large. Realistic optimization requires rapid prediction of DNN training performance (time per minibatch) and energy use (mean power), without costly brute-force empirical profiling for each new workload or device instance (K. et al., 2024).
Key requirements set by this context:
- Configurational complexity: Orin AGX exposes approximately 18,000 power modes.
- Transferability: Users must support new workloads and devices with minimal extra measurement.
- Principled optimization: Support constraint-driven selection (e.g., minimize training time under a power budget) rather than ad hoc tuning.
PowerTrain is the first method to meet these requirements using a transfer-learned neural predictor architecture (K. et al., 2024).
2. Offline Profiling and Reference Model Training
PowerTrain begins with a one-time, exhaustive offline profiling phase:
- Reference workload: ResNet-18 on ImageNet, chosen for operational representativeness.
- Power mode enumeration: Each mode is defined by a tuple [CPU core count, CPU frequency, GPU frequency, memory frequency].
- Data collection procedure:
- Disable dynamic voltage/frequency scaling (DVFS).
- For each power mode, set the hardware configuration.
- Discard the first minibatch (kernel selection overhead).
- Measure time (via
torch.cuda.eventin PyTorch) and average board power (viajtop, 1 Hz) across 40 stabilized minibatches.
- Corpus: The profiling procedure samples 4,368 representative modes (all GPU × memory, even CPU freqs, and 6 CPU core counts), covering the non-linear capacity of the configuration space (K. et al., 2024).
Two separate neural networks are then trained over this corpus:
- Input features: Normalized [CPU core count, CPU freq., GPU freq., memory freq.].
- Targets: (i) Per-minibatch time [ms], (ii) Average board power [mW].
- Model architecture: Input → Dense(256)-ReLU-Dropout → Dense(128)-ReLU-Dropout → Dense(64)-ReLU → Output(Dense(1)-Linear).
- Optimization: Mean-squared error (MSE) loss, Adam optimizer (lr=0.001), early stopping.
These reference models encapsulate the intricate, non-linear resource dependencies for DNN training at the system level for the chosen device.
3. Transfer Learning for New Workloads and Devices
To generalize PowerTrain’s predictions to new DNN architectures, datasets, or Jetson devices:
- Transfer protocol:
- Remove the reference network’s output (Dense(1)) layer.
- Replace with a newly initialized output layer.
- Collect approximately 50 measurements for the new workload and/or device, distributed across the configuration space.
- Retrain only the new output layer (optionally fine-tuning earlier layers with a lower learning rate).
- Loss: MSE as above.
This protocol produces adapted neural predictors of per-minibatch time and mean power. Prediction accuracy after transfer is robust, typically achieving mean absolute percentage error (MAPE) <6% (power) and <15% (time) for up to 4,400 modes on unseen workloads, and <14.5% (time), <11% (power) when porting to new Jetson devices (K. et al., 2024).
Key result: Transfer learning with ≈50 new-mode samples achieves lower error than training a new neural model from scratch on the same data volume.
4. Pareto-Optimal Configuration Selection
PowerTrain supports principled, constraint-driven optimization over the predicted (time, power) performance:
- For the set of all feasible power modes, the adapted neural models predict time and power.
- The Pareto frontier is constructed as:
- Given a user constraint (e.g., ), PowerTrain recommends the configuration on with the lowest predicted time (K. et al., 2024).
In practice, PowerTrain’s predicted Pareto front yields a median time penalty of ≈1% over the true optimum and power constraint violations >1 W for only ≈25% of budgeted cases.
5. Empirical Results and Deployment
PowerTrain’s validation spans:
- Workloads: Computer vision (MobileNet v3, YOLO v8n), NLP (BERT-base on SQuAD, LSTM on WikiText), cross-dataset variants.
- Devices: Orin AGX, Xavier AGX, Orin Nano.
- Accuracy: Transfer to new CV/NLP workloads (on Orin) achieves power MAPE <6%, time MAPE <15% (50-mode adaptation). Inter-device transfer shows power MAPE <11–6%, time MAPE <14–9% (K. et al., 2024).
- Optimization impact: Relative to common baselines (MAXN, random sampling, naive neural predictor with 50 modes), PowerTrain achieves up to 45% faster training or 88% lower power under identical constraints.
PowerTrain’s approach—single exhaustive profiling, lightweight transfer, fast neural inference, and Pareto-optimal selection—is suited to real-time dynamic workload and SLA (service-level agreement) compliance in heterogeneous, federated-learning, or edge fleet deployments.
6. Methodological Insights and Future Directions
The PowerTrain methodology supplies several key technical insights:
- Board-level power and time for DNN training on modern system-on-modules are highly non-linear in device control parameters. Neural regression is necessary to capture this structure.
- Fine-grained transfer learning reduces adaptation cost per new context, outperforming “from-scratch” approaches in both data efficiency and prediction accuracy.
- The architecture is robust to changes in workload category (CV, NLP, RNN vs CNN), dataset scale, and batch size.
- Hypothetically, future extensions could include in-band early-epoch profiling, reinforcement learning for non-stationary workload optimization, or adaptation to non-DNN or large-server-GPU scenarios. This suggests a plausible path toward universal power-performance prediction in edge ML orchestration.
7. Summary Table: Key Steps and Metrics
| Phase | Operation | Quantitative Footprint |
|---|---|---|
| Offline | Profile 4,368–18,096 modes on reference DNN | One-time, ≈hours (per hardware) |
| Training | NN fit for time & power | MSE loss, Dense(256→128→64→1), <100 epochs |
| Transfer | 50 samples, fine-tune output layer | <15 min per new DNN/device |
| Prediction | ~4,000 configurations per workload/device | <6% (power), <15% (time) MAPE after transfer |
| Optimization | Pareto-front selection, one scan | Median <1% time penalty vs. true optimum |
| Deployment | Adapted model, real-time per-epoch evaluation | O(1) NN inference per power mode |
PowerTrain is thus a scalable, neural network–based transfer-learning framework for time/power prediction and configuration optimization in DNN training on accelerated, power-constrained edge devices (K. et al., 2024).