Segment Hysteresis RL for ISAC Systems

Updated 29 January 2026

Segment Hysteresis Based Reinforcement Learning is a novel algorithm that integrates a hysteresis mechanism to stabilize segment assignments in SWAN-ISAC systems.
It employs an Advantage Actor-Critic framework within an MDP formulation to jointly optimize transmit beamforming, segment selection, and antenna positioning.
The hysteresis mechanism mitigates rapid assignment oscillations and non-stationarity, yielding up to 70% higher reward ceilings compared to fixed-update strategies.

Segment Hysteresis Based Reinforcement Learning (SHRL) is a reinforcement learning algorithm specifically designed for integrated sensing and communication (ISAC) optimization within segmented waveguide-enabled pinching-antenna array (SWAN) systems. SHRL addresses the challenge of dynamic segment selection—critical for both communication throughput and sensing performance—by introducing a hysteresis mechanism that governs when segment allocations are updated, mitigating instability and improving overall system reward. The algorithm is formalized within a Markov decision process (MDP) framework, jointly optimizing transmit beamforming, segment selection, and antenna positioning (Gao et al., 28 Jan 2026).

1. Markov Decision Process Formulation

SHRL frames the SWAN-ISAC control problem as an MDP comprising the following components:

State Space: At timestep $t$ , the state is $s_t = [h_{k_c,t},\, h_{k_s,t},\, \psi_{t-1},\, \phi_{t-1}]$ , where $h_{k_c,t}$ and $h_{k_s,t}$ denote instantaneous channel state information (CSI) for all $K_c$ communication users and $K_s$ sensing targets, $\psi_{t-1}$ is the previous pinching-antenna pose vector, and $\phi_{t-1}$ encodes the last segment selection assignments.
Action Space: The agent outputs a tuple $a_t = [\psi_t,\, \phi_t,\, W_t]$ , with $\psi_t \in \mathbb{R}^{M\times N\times 2}$ (antenna positions), $\phi_t \in \mathbb{R}^M$ (raw segment-selection logits), and $W_t \in \mathbb{C}^{K_c\times M}$ (beamforming weights). Segment assignments are executed as $\tilde\phi_t$ after processing through the hysteresis gate.
Transition Dynamics: The system updates via $(\psi_t,\, \tilde\phi_t,\, W_t)$ , yielding new CSI, illumination, and data-rate metrics, then transitioning to $s_{t+1}$ . Channel evolution follows deterministic (e.g., line-of-sight) or stochastic models.
Reward Function:

$r_t = \sum_{k_c=1}^{K_c} R_{k_c,t} - \sum_{k_s=1}^{K_s} \mathbb{I}\{\Gamma_{k_s,t} < \tilde\Gamma\}$

where $R_{k_c,t} = B\log_2\left(1 + \frac{\sum_m |h_{k_c,m}^T s_m|^2}{\sum_{m\ne m'} |h_{k_c,m}^T s_{m'}|^2 + \sigma^2}\right)$ is the communication rate for user $k_c$ , and $\Gamma_{k_s,t}$ represents the expected illumination power for sensing target $k_s$ ; $\tilde\Gamma$ is a threshold, and violations incur a unit penalty per target.

2. Segment Hysteresis Mechanism

The defining feature of SHRL is the segment hysteresis mechanism. Rather than immediately reassigning segments upon every change in the policy network’s output logits $\phi_t$ , SHRL applies a probabilistic or threshold-based gate. This mechanism stabilizes assignments and filters out minor or spurious fluctuations, thus suppressing rapid remapping:

Probabilistic Gate: For each segment $m$ :

$\tilde\phi_{m,t} = \begin{cases} \phi_{m,t}, & \text{with probability } p_{\text{update}} \ \tilde\phi_{m,t-1}, & \text{otherwise} \end{cases}$

where $p_{\text{update}}\in(0,1)$ regulates update aggressiveness. Smaller values yield more persistent segment allocations.

Threshold (Dead-band) Variant:

$\tilde\phi_{m,t} = \begin{cases} \phi_{m,t}, & |\phi_{m,t} - \tilde\phi_{m,t-1}| \ge \delta_{\text{hys}} \ \tilde\phi_{m,t-1}, & \text{otherwise} \end{cases}$

This enforces assignment updates only for changes exceeding a hysteresis gap $\delta_{\text{hys}}$ .

This mechanism avoids structural oscillation in segment-to-antenna mapping, which alleviates non-stationarity in the environment as perceived by the RL agent.

3. Learning Algorithm and Update Rules

SHRL employs the Advantage Actor–Critic (A2C) framework:

Actor ( $\pi_\theta(a|s)$ ): Parameterized by $\theta$ , maps states to actions $(\psi_t,\, \phi_t,\, W_t)$ .
Critic ( $V_\psi(s_t)$ ): Parameterized by $\psi$ , estimates the value function.

Update rules are as follows:

Critic Loss:

$\mathcal{L}_V = \big(V_\psi(s_t) - \hat{R}_t\big)^2, \qquad \hat{R}_t = r_t + \gamma V_\psi(s_{t+1})$

Actor Loss:

$\mathcal{L}_\pi = -\log \pi_\theta(a_t|s_t) \cdot \hat{A}_t, \qquad \hat{A}_t = \hat{R}_t - V_\psi(s_t)$

Gradients with respect to $\theta$ and $\psi$ update the respective networks toward higher-reward policies.

Pseudocode for main steps (abridged):

Initialize θ, ψ; set memory tildeφ₀ (e.g. uniform)
for episode=1…N_episodes:
  reset environment, observe s₀
  for t=0…T_max:
    aₜ = π_θ(sₜ)
    extract raw logits φₜ from aₜ
    for each segment m:
      with prob. p_update:
        tildeφ_{m,t} = φ_{m,t}
      else:
        tildeφ_{m,t} = tildeφ_{m,t-1}
    execute (ψₜ, tildeφₜ, Wₜ) in env
    observe rₜ, s_{t+1}
    compute hatRₜ = rₜ + γ V_ψ(s_{t+1})
    compute Aₜ = hatRₜ - V_ψ(sₜ)
    update critic via ∇_ψ (V_ψ(sₜ) - hatRₜ)²
    update actor via ∇_θ [ -log π_θ(aₜ|sₜ)·Aₜ ]
  end for
end for
return θ*, ψ*

4. Hyperparameters and Training Protocol

SHRL is trained using the following key hyperparameters and procedural setup:

Parameter	Value/Range	Description
Actor LR ( $\alpha_{actor}$ )	$1\!\times\!10^{-4}$	Step size for actor network
Critic LR ( $\alpha_{critic}$ )	$1\!\times\!10^{-3}$	Step size for critic network
Discount ( $\gamma$ )	0.99	Return discount factor
Hysteresis $p_{update}$	{0.05, 0.1, 0.2}	Grid searched
Exploration	Additive Gaussian noise	Applied to outputs during training
Episode length	$T_{max}=100$	Per episode or until fixed geometry eval
Convergence detection	$<$ $10^{-3}$ reward improv/200 episodes	Via moving-average (window=50)

Episodes are initialized with randomized user and target positions. Convergence is declared when the moving-average reward plateaus under a specified threshold for a sustained interval.

5. Comparative Performance and Ablation

The effectiveness of SHRL is evaluated against several baselines: standard A2C (without hysteresis), SPRL (periodic segment-update every 5 steps), PPO (proximal policy optimization), and a random action agent.

Key evaluation highlights:

SHRL achieved the highest final reward (≈13.2), outperforming A2C (≈9.9), SPRL (≈7.7), PPO (≈6.4), and random policies (≈3.0).
SHRL converged within approximately 4000 episodes, exhibiting the smallest reward variance.
The hysteresis mechanism prevented the reward collapses and non-stationarity seen in standard A2C/PPO.
For waveguide configuration $M=3$ , $L_{wg}=40\,\mathrm{m}$ , SHRL yielded 13.15 bps/Hz for rate and $1.09\!\times\!10^{-5}$ for illumination, outperforming all tested alternatives. For increased $L_{wg}$ , SHRL continued to improve the rate-illumination trade-off.
Compared to A2C, SHRL achieved approximately 30% higher final rate and required about 50% fewer episodes to converge.
Compared to SPRL with fixed update intervals, SHRL’s adaptive hysteresis delivered a ~70% higher reward ceiling and smoother training.

6. Significance and Implications

Segment Hysteresis Based Reinforcement Learning introduces a principled module for stabilizing assignment dynamics in RL-based resource control for ISAC, particularly when assignment policies are sensitive to minor environmental or agent perturbations. The probabilistic hysteresis gate, governed by $p_{update}$ , effectively reduces environmental non-stationarity as perceived by the RL agent. This yields efficient convergence and robust performance, particularly relevant in complex multi-objective wireless systems where aggressive segment remapping is inefficient or detrimental (Gao et al., 28 Jan 2026).

A plausible implication is that similar hysteresis primitives may generalize to other RL-controlled resource allocation problems plagued by instability or oscillatory structural decisions.

SHRL represents a step beyond baseline A2C and PPO approaches by incorporating memory into the segment assignment policy via hysteresis. This approach is specifically tailored for SWAN-ISAC system design but may be extensible to other domains requiring persistent control actions or delayed adaptation. Future research may investigate the generalization of hysteresis-based RL modules to broader classes of resource allocation, the exploration of adaptive or learned hysteresis parameters, and integration with more advanced actor-critic architectures.

Markdown Report Issue Upgrade to Chat

References (1)

Integrated Sensing and Communication for Segmented Waveguide-Enabled Pinching Antenna Systems (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Segment Hysteresis Based Reinforcement Learning (SHRL) Algorithm.