Weighted MMSE-DDPG for PA Placement & Beamforming

Updated 10 January 2026

The paper presents a hybrid framework combining the classical WMMSE method with DDPG to jointly optimize PA placement and beamforming in blockage-rich environments.
It leverages deterministic obstacle modeling and a black-box DDPG actor to adaptively deploy pinching antennas while satisfying rate and power constraints.
Empirical results demonstrate rapid convergence and improved sum-rate throughput, highlighting the framework's potential for advanced indoor wireless network design.

Weighted Minimum Mean Square Error Integrated Deep Deterministic Policy Gradient (WMMSE-DDPG) is a hybrid optimization framework designed for the joint placement and beamforming of pinching-antenna (PA) systems in indoor wireless environments that feature line-of-sight (LoS) blockages. The approach leverages the deterministic modeling of obstacles and integrates the classical Weighted Minimum Mean Square Error (WMMSE) methodology with the Deep Deterministic Policy Gradient (DDPG) algorithm from deep reinforcement learning, effectively addressing the non-smooth transitions induced by binary blockage conditions. The interplay allows for adaptive, blockage-aware deployment of PAs and beam patterns that maximize throughput while respecting rate and power constraints (Xie et al., 3 Jan 2026).

1. Problem Formulation and Reformulation

The primary objective is the maximization of aggregate user sum-rate under physical and quality-of-service (QoS) constraints, specifically:

$\max_{\Psi, P} \sum_{m=1}^M \log_2 (1 + \mathrm{SINR}_m(\Psi, P))$

subject to $R_m \geq R_t ~\forall m$ , $\sum_m \|p_m\|^2 \leq P_t$ , and $\Psi \in [0, L_x]^K$ , where $\Psi$ denotes the horizontal positions of $K$ PAs and $P = [p_1, ..., p_M]$ the beamformers.

This is reformulated as a WMMSE minimization:

$\min_{P,u,w} J(P, u, w; \Psi) \equiv \sum_{m=1}^M \left[ w_m e_m(P; \Psi, u_m) - \log w_m \right]$

subject to $e_m \leq \delta_m = 2^{-R_t}$ and $\sum_m \|p_m\|^2 \leq P_t$ . The mean-square error per user $R_m \geq R_t ~\forall m$ 0 captures the PA‐dependent channel characteristics:

$R_m \geq R_t ~\forall m$ 1

where $R_m \geq R_t ~\forall m$ 2 encodes the blockage-aware channel structure through the deterministic LoS-blockage indicator $R_m \geq R_t ~\forall m$ 3.

2. Integration of DDPG for Non-Smooth Placement Optimization

The non-smooth, discontinuous dependence of $R_m \geq R_t ~\forall m$ 4 on LoS connectivity (due to binary blockages) renders gradient-based policy optimization ineffective. DDPG, a model-free off-policy actor-critic algorithm for continuous action spaces, is used to treat the PA placement as a black-box control task. The DDPG module defines:

State Space $R_m \geq R_t ~\forall m$ 5: $R_m \geq R_t ~\forall m$ 6, representing user coordinates, obstacle centers, obstacle radii, and optionally, previous PA positions $R_m \geq R_t ~\forall m$ 7.
Action Space $R_m \geq R_t ~\forall m$ 8: $R_m \geq R_t ~\forall m$ 9, with the actor outputting continuous waveguide positions for each PA.
Reward Function:

$\sum_m \|p_m\|^2 \leq P_t$ 0,

where $\sum_m \|p_m\|^2 \leq P_t$ 1 applies a soft penalty for sub-threshold rates.

3. WMMSE Algorithm as a Beamforming Subroutine

For any fixed PA configuration $\sum_m \|p_m\|^2 \leq P_t$ 2, beamforming is solved via standard WMMSE iterations internal to each DDPG step:

Equalizer Update:

$\sum_m \|p_m\|^2 \leq P_t$ 3

Weight Update:

$\sum_m \|p_m\|^2 \leq P_t$ 4, with $\sum_m \|p_m\|^2 \leq P_t$ 5

Beamformer Update:

$\sum_m \|p_m\|^2 \leq P_t$ 6, subject to dual variable updates enforcing constraints: $\sum_m \|p_m\|^2 \leq P_t$ 7,

$\sum_m \|p_m\|^2 \leq P_t$ 8

Iterations continue until convergence of $\sum_m \|p_m\|^2 \leq P_t$ 9 and rates $\Psi \in [0, L_x]^K$ 0.

4. DDPG Network Architecture and Training Procedure

The WMMSE-DDPG scheme employs neural networks for both actor and critic:

Actor $\Psi \in [0, L_x]^K$ 1:
- Input size $\Psi \in [0, L_x]^K$ 2
- Two hidden layers (256 ReLU units each)
- Output: $\Psi \in [0, L_x]^K$ 3 continuous logits, mapped to PA positions via $\Psi \in [0, L_x]^K$ 4
Critic $\Psi \in [0, L_x]^K$ 5:
- Input: concatenated $\Psi \in [0, L_x]^K$ 6
- Two hidden layers (256 ReLU units each)
- Output: scalar Q-value

Training per DDPG step:

Critic update:

$\Psi \in [0, L_x]^K$ 7

$\Psi \in [0, L_x]^K$ 8

Actor update:

$\Psi \in [0, L_x]^K$ 9

$\Psi$ 0

Gaussian or Ornstein–Uhlenbeck noise is added for exploration. The reward is episodic and essentially single-step (contextual bandit), with or without target networks.

5. Algorithm Pseudocode and Workflow

The high-level workflow proceeds as follows:

Initialize actor $\Psi$ 1, critic $\Psi$ 2, and replay buffer $\Psi$ 3.
For each episode:
- Observe obstacle and user layout ( $\Psi$ 4).
- Actor outputs noisy position proposal $\Psi$ 5.
- Construct blockage-aware channels $\Psi$ 6.
- Run WMMSE to solve $\Psi$ 7.
- Compute and store one-step reward $\Psi$ 8.
- Sample minibatch from $\Psi$ 9 and update critic and actor networks.
- Optionally update target networks.
Repeat for $K$ 0.

Parameter settings from the source indicate $K$ 1, batch size $K$ 2– $K$ 3, and $K$ 4 outputs enforcing spatial bounds.

6. Empirical Convergence and Implementation Considerations

Simulation experiments demonstrate rapid convergence of both actor and critic losses, with the sum-rate reward stabilizing after several thousand gradient steps. Key practical strategies include:

Pre-computation of obstacle blocking maps for candidate $K$ 5 values to accelerate the channel model.
Warm-starting the WMMSE beamforming subroutine using the previous solution to reduce computation.
Penalty annealing on $K$ 6 in $K$ 7 for stricter QoS enforcement over training.

Handling abrupt changes in LoS connectivity—when small positional shifts toggle blockage—relies on the black-box nature of the WMMSE-DDPG alternation, enabling the DDPG actor to learn non-smooth policies via experience replay and activation squashing.

7. Context and Significance

WMMSE-DDPG transforms PA deployment in blockage-rich indoor wireless environments. By encapsulating the optimal, rapidly-convergent WMMSE beamforming within a DDPG agent, the methodology enables direct learning of PA placement policies that exploit the deterministic obstacle layout. Notably, simulation results in the referenced work (Xie et al., 3 Jan 2026) indicate significant improvements in system throughput and LoS connectivity over baseline approaches. Additionally, pinching-antenna systems can harness physical obstacles to attenuate co-channel interference, thus converting blockages from liabilities into strategic assets. The framework addresses the core optimization challenge of jointly allocating spatial and signal-processing resources in presence of discrete, combinatorial environmental effects. A plausible implication is its applicability to other blockage-aware network design problems where gradient-based methods are defective due to non-smooth physical constraints.

Markdown Report Issue Upgrade to Chat

References (1)

Pinching Antennas in Blockage-Aware Environments: Modeling, Design, and Optimization (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Weighted Minimum Mean Square Error Integrated Deep Deterministic Policy Gradient (WMMSE-DDPG).