Segment Hysteresis RL for ISAC Systems
- Segment Hysteresis Based Reinforcement Learning is a novel algorithm that integrates a hysteresis mechanism to stabilize segment assignments in SWAN-ISAC systems.
- It employs an Advantage Actor-Critic framework within an MDP formulation to jointly optimize transmit beamforming, segment selection, and antenna positioning.
- The hysteresis mechanism mitigates rapid assignment oscillations and non-stationarity, yielding up to 70% higher reward ceilings compared to fixed-update strategies.
Segment Hysteresis Based Reinforcement Learning (SHRL) is a reinforcement learning algorithm specifically designed for integrated sensing and communication (ISAC) optimization within segmented waveguide-enabled pinching-antenna array (SWAN) systems. SHRL addresses the challenge of dynamic segment selection—critical for both communication throughput and sensing performance—by introducing a hysteresis mechanism that governs when segment allocations are updated, mitigating instability and improving overall system reward. The algorithm is formalized within a Markov decision process (MDP) framework, jointly optimizing transmit beamforming, segment selection, and antenna positioning (Gao et al., 28 Jan 2026).
1. Markov Decision Process Formulation
SHRL frames the SWAN-ISAC control problem as an MDP comprising the following components:
- State Space: At timestep , the state is , where and denote instantaneous channel state information (CSI) for all communication users and sensing targets, is the previous pinching-antenna pose vector, and encodes the last segment selection assignments.
- Action Space: The agent outputs a tuple , with (antenna positions), (raw segment-selection logits), and (beamforming weights). Segment assignments are executed as after processing through the hysteresis gate.
- Transition Dynamics: The system updates via , yielding new CSI, illumination, and data-rate metrics, then transitioning to . Channel evolution follows deterministic (e.g., line-of-sight) or stochastic models.
- Reward Function:
where is the communication rate for user , and represents the expected illumination power for sensing target ; is a threshold, and violations incur a unit penalty per target.
2. Segment Hysteresis Mechanism
The defining feature of SHRL is the segment hysteresis mechanism. Rather than immediately reassigning segments upon every change in the policy network’s output logits , SHRL applies a probabilistic or threshold-based gate. This mechanism stabilizes assignments and filters out minor or spurious fluctuations, thus suppressing rapid remapping:
- Probabilistic Gate: For each segment :
where regulates update aggressiveness. Smaller values yield more persistent segment allocations.
- Threshold (Dead-band) Variant:
This enforces assignment updates only for changes exceeding a hysteresis gap .
This mechanism avoids structural oscillation in segment-to-antenna mapping, which alleviates non-stationarity in the environment as perceived by the RL agent.
3. Learning Algorithm and Update Rules
SHRL employs the Advantage Actor–Critic (A2C) framework:
- Actor (): Parameterized by , maps states to actions .
- Critic (): Parameterized by , estimates the value function.
Update rules are as follows:
- Critic Loss:
- Actor Loss:
Gradients with respect to and update the respective networks toward higher-reward policies.
Pseudocode for main steps (abridged):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
Initialize θ, ψ; set memory tildeφ₀ (e.g. uniform) for episode=1…N_episodes: reset environment, observe s₀ for t=0…T_max: aₜ = π_θ(sₜ) extract raw logits φₜ from aₜ for each segment m: with prob. p_update: tildeφ_{m,t} = φ_{m,t} else: tildeφ_{m,t} = tildeφ_{m,t-1} execute (ψₜ, tildeφₜ, Wₜ) in env observe rₜ, s_{t+1} compute hatRₜ = rₜ + γ V_ψ(s_{t+1}) compute Aₜ = hatRₜ - V_ψ(sₜ) update critic via ∇_ψ (V_ψ(sₜ) - hatRₜ)² update actor via ∇_θ [ -log π_θ(aₜ|sₜ)·Aₜ ] end for end for return θ*, ψ* |
4. Hyperparameters and Training Protocol
SHRL is trained using the following key hyperparameters and procedural setup:
| Parameter | Value/Range | Description |
|---|---|---|
| Actor LR () | Step size for actor network | |
| Critic LR () | Step size for critic network | |
| Discount () | 0.99 | Return discount factor |
| Hysteresis | {0.05, 0.1, 0.2} | Grid searched |
| Exploration | Additive Gaussian noise | Applied to outputs during training |
| Episode length | Per episode or until fixed geometry eval | |
| Convergence detection | reward improv/200 episodes | Via moving-average (window=50) |
Episodes are initialized with randomized user and target positions. Convergence is declared when the moving-average reward plateaus under a specified threshold for a sustained interval.
5. Comparative Performance and Ablation
The effectiveness of SHRL is evaluated against several baselines: standard A2C (without hysteresis), SPRL (periodic segment-update every 5 steps), PPO (proximal policy optimization), and a random action agent.
Key evaluation highlights:
- SHRL achieved the highest final reward (≈13.2), outperforming A2C (≈9.9), SPRL (≈7.7), PPO (≈6.4), and random policies (≈3.0).
- SHRL converged within approximately 4000 episodes, exhibiting the smallest reward variance.
- The hysteresis mechanism prevented the reward collapses and non-stationarity seen in standard A2C/PPO.
- For waveguide configuration , , SHRL yielded 13.15 bps/Hz for rate and for illumination, outperforming all tested alternatives. For increased , SHRL continued to improve the rate-illumination trade-off.
- Compared to A2C, SHRL achieved approximately 30% higher final rate and required about 50% fewer episodes to converge.
- Compared to SPRL with fixed update intervals, SHRL’s adaptive hysteresis delivered a ~70% higher reward ceiling and smoother training.
6. Significance and Implications
Segment Hysteresis Based Reinforcement Learning introduces a principled module for stabilizing assignment dynamics in RL-based resource control for ISAC, particularly when assignment policies are sensitive to minor environmental or agent perturbations. The probabilistic hysteresis gate, governed by , effectively reduces environmental non-stationarity as perceived by the RL agent. This yields efficient convergence and robust performance, particularly relevant in complex multi-objective wireless systems where aggressive segment remapping is inefficient or detrimental (Gao et al., 28 Jan 2026).
A plausible implication is that similar hysteresis primitives may generalize to other RL-controlled resource allocation problems plagued by instability or oscillatory structural decisions.
7. Related and Future Directions
SHRL represents a step beyond baseline A2C and PPO approaches by incorporating memory into the segment assignment policy via hysteresis. This approach is specifically tailored for SWAN-ISAC system design but may be extensible to other domains requiring persistent control actions or delayed adaptation. Future research may investigate the generalization of hysteresis-based RL modules to broader classes of resource allocation, the exploration of adaptive or learned hysteresis parameters, and integration with more advanced actor-critic architectures.