ZAPS-DA: Zero-Phase Action Policy Smoothing with Decoupled Actor for Continuous Control in Reinforcement Learning

Published 28 May 2026 in cs.RO, cs.LG, and eess.SY | (2605.30612v1)

Abstract: Continuous control policies trained with off-policy reinforcement learning frequently exhibit high-frequency action jitter, rendering direct deployment on physical actuators impractical. Post-hoc filtering attenuates jitter but introduces phase lag; embedding smoothness penalties in the actor's loss couples them with the RL gradient and conflates reward regression with over-aggressive smoothing. We present ZAPS-DA, a framework that reduces action jitter at deployment with negligible phase lag and no post-processing. ZAPS-DA pairs an unmodified main actor (trained by the base RL loss) with a separate decoupled actor trained via supervised imitation of zero-phase filtered targets stored in the replay buffer. The deployed policy is the decoupled actor: a feed-forward map from the current observation to a smooth action, with no inference-time filter and no action-history input -- a mechanism we term causal distillation of a non-causal filter. A magnitude-matched MSE loss provides zero-hyperparameter portability across optimizer classes. Validated with Soft Actor-Critic and a Savitzky--Golay filter in two driving simulators using paired n=150 evaluation protocols: on MetaDrive, ZAPS-DA reduces steering jitter by 14--21x and throttle jitter by 3--5x (all $p < 10^{-4}$, Bonferroni-corrected) while matching task-completion (p=0.28 success, p=0.31 crash) at a 6.3% reward cost; on a custom Webots adaptive cruise control environment, the same SG configuration produces a Pareto improvement -- reward parity (p=0.121), 8--45x steering jitter reduction, and total task-failure rate reduced from 2.0% to 0.7%.

Abstract PDF Upgrade to Chat

Authors (1)

Faiq Shamass

Summary

The paper’s main contribution is its dual-actor architecture that decouples reward maximization from smoothness supervision to achieve zero-phase action smoothing.
It leverages a Savitzky–Golay filter and magnitude-matched MSE loss to obtain up to 21× steering jitter reduction while maintaining near-parity in rewards.
Empirical results on MetaDrive and Webots ACC demonstrate that ZAPS-DA can effectively suppress temporal jitter, providing a robust and scalable solution for continuous control tasks.

ZAPS-DA: Architectural Decoupling for Zero-Phase Jitter Suppression in RL

Introduction and Motivation

Temporal inconsistency in action sequences—commonly referred to as action jitter—is a persistent failure mode of off-policy RL algorithms in continuous control, particularly when deploying feed-forward policies trained with agents such as Soft Actor-Critic (SAC). The paper "ZAPS-DA: Zero-Phase Action Policy Smoothing with Decoupled Actor for Continuous Control in Reinforcement Learning" (2605.30612) addresses this issue by proposing ZAPS-DA, a training-time architectural framework that leverages non-causal smoothness oracles—zero-phase filters—while maintaining causal, low-latency action selection at deployment.

Conventional approaches, such as applying low-pass filters post-hoc or directly regularizing the actor loss, are criticized for introducing undesirable phase lag or coupling reward maximization and smoothness objectives in a way that either saturates the dual variable or destabilizes learning. The core contribution of ZAPS-DA is the strict decoupling of the RL reward objective and the smoothness supervision, allowing the separately trained decoupled actor to distill zero-phase smoothness without requiring future context or additional filtering infrastructure at deployment.

ZAPS-DA Framework

ZAPS-DA introduces a dual-actor architecture in which the main actor $\pi_\phi$ is trained using vanilla SAC, and a separate decoupled actor $\pi_\psi$ is trained via supervised regression to zero-phase-filtered action targets. The deployment strategy shifts entirely to the decoupled actor, which operates as a standard feed-forward policy from current observation to action—crucially, it does not require action history or lookahead filtering at inference.

This is enabled by maintaining a replay buffer in which each transition is augmented with a zero-phase-filtered action label, produced using a Savitzky–Golay (SG) filter operating non-causally over a centered window within a fixed-length history buffer.

Figure 1: ZAPS-DA training-time data flow, with the environment wrapper appending transitions to a history buffer and pushing SG-filtered actions as supervisory targets.

The decoupled actor is trained with a magnitude-matched MSE loss, using a dynamically computed scaling factor that ensures robust optimization across both adaptive (e.g., Adam) and non-adaptive (e.g., SGD) optimizers, thus guaranteeing hyperparameter-free portability and avoiding training instabilities from mismatched gradient magnitudes.

Empirical Evaluation

The dual-actor ZAPS-DA framework is evaluated on two distinct continuous-control environments: MetaDrive (procedural autonomous driving) and a custom Webots adaptive cruise control (ACC) simulation. Both environments lack explicit smoothness rewards in their native configurations, with Webots exhibiting mildly smoothness-friendly reward shaping. The empirical protocol utilizes strict paired-seed evaluation to control for rollout stochasticity, and reports results over large batched episode samples ( $n=150$ , multiple seeds).

MetaDrive Results

In MetaDrive, ZAPS-DA achieves steering jitter reductions of $14$– $21\times$ and throttle jitter reductions of $3$– $5\times$ across all step-difference metrics (JE, MAD, MDD, variance) with maintained task completion statistics and a marginal reward penalty ( $6.3\%$ ), which is shown to be sensitive to the SG window size—effectively a Pareto knob between smoothness and fidelity to the SAC baseline.

The qualitative effect is visualized by inspecting command traces:

Figure 2: Decoupled actor steering commands attenuate high-frequency oscillations while closely tracking the mean intent of the main actor.

Figure 3: The throttle control trace shows reduced rapid reversals without introducing observable lag, maintaining the envelope of vehicle acceleration and deceleration.

Spectral analysis further corroborates the frequency-domain attenuation:

Figure 4: Steering action power spectral density reveals 10–20 dB attenuation of high-frequency components with the decoupled actor.

Webots ACC Results

On the Webots ACC task, ZAPS-DA yields reward parity (+0.9\%), steering jitter reductions up to $45\times$ , and a task-failure rate reduction from $2.0\%$ to $\pi_\psi$ 0. This demonstrates that when environmental rewards are mildly smoothness-aligned, ZAPS-DA can deliver a strict Pareto improvement.

Ablation Studies

Smoothness–Reward Tradeoff

SG window sweep experiments confirm the existence of a Pareto frontier: increasing window size monotonically increases smoothness at the cost of reward until excessive windowing induces negative side-lobe artifacts, deteriorating both metrics. The necessity of narrower smoothing on throttle is empirically validated; over-smoothing this axis directly increases collision rates due to under-responsiveness.

Magnitude-Matching Portability

Optimizing the decoupled actor with SGD without magnitude-matching results in catastrophic training collapse, confirming the scaling mechanism’s essentiality for optimizer-agnostic deployment. In contrast, Adam's effective scale-invariance renders the matching step redundant, but including it incurs no negative side-effects.

Theoretical and Practical Implications

ZAPS-DA’s architecture formally demonstrates that distilling a non-causal oracle into a causally deployed policy is practical and effective for jitter suppression, without compromising step-wise reactiveness or incurring deployment latency. This decoupling paves the way for leveraging more sophisticated temporal-smoothing or lookahead supervision strategies—including learned kernel filters or offline optimal control signals—extending far beyond hand-crafted smoothness losses or action-conditioned architectures.

From a safety and reliability standpoint, the framework removes the need to artificially tune smoothness–reward coefficients or to develop architectures that maintain action history, offering a robust, interpretable, and easily retrofittable path toward actuator-friendly policy deployment in physical systems.

Limitations and Future Directions

The study is confined to SAC and feed-forward MLPs in simulation, with smoothness supervision restricted to a fixed SG filter. Generalization to other zero-lag temporal operators (e.g., Butterworth via filtfilt, learned filters), broader action domains (manipulation, legged locomotion), and deployment on real hardware remain open areas for future exploration. Real-world validation would need to account for sensor noise, actuator dynamics, and unmodeled degradation, potentially revealing new challenges or necessitating additional robustness mechanisms.

The potential interplay between the zero-phase filter and more complex reward structures or adversarial environments also warrants further investigation.

Conclusion

ZAPS-DA provides a rigorous and extensible methodological advance for mitigating action jitter in continuous-control policies trained by off-policy RL, achieving state-of-the-art suppression while retaining reward-optimality parity (or achieving a favorable smoothness–performance tradeoff). Its architectural principle—that non-causal supervision can be distilled via decoupled actors into causal, real-time policies—constitutes a generalizable paradigm for embedding trajectory-level structure into deployed agents. This decoupling enables flexible application of sophisticated temporal or domain-specific supervision during training, without sacrificing simplicity and responsiveness at deployment.

Markdown Report Issue