Voxel-Level A3C Module for 3D Segmentation

Updated 14 January 2026

The paper presents a voxel-level A3C method that treats each voxel as an independent agent using composite rewards to correct noisy labels.
It leverages a shared encoder with multiple decoders to output segmentation, value, and policy predictions, ensuring stable asynchronous updates.
Empirical results demonstrate Dice score improvements of up to +3.72% on challenging datasets, confirming the module's robust performance.

The voxel-level Asynchronous Advantage Actor-Critic (vA3C) module is a reinforcement learning component formulated for robust 3D medical image segmentation in the presence of noisy annotations. It is a core innovation of the Staged Voxel-Level @@@@1@@@@ (SVL-DRL) framework, which frames each image voxel as an autonomous agent operating asynchronously in parallel. Unlike conventional sample-level or patch-level denoising approaches, vA3C exploits local policy adjustments driven by composite rewards that fuse segmentation accuracy metrics with anatomical constraints, thereby incrementally rectifying labeling inaccuracies at voxel resolution (Fu et al., 7 Jan 2026).

1. Voxel-wise Reinforcement Learning Agent Formulation

In the vA3C formulation, each voxel $i$ of a 3D image volume $I \in \mathbb{R}^{H \times W \times D}$ —where $N = H \cdot W \cdot D$ —is modeled as an independent agent. The state of agent $i$ at step $t$ is denoted $s_i^{(t)}$ , initialized to the voxel's raw intensity $I_i$ and, in practice, set to the feature vector at $i$ within a global feature map $F^{(t)}$ output by a shared encoder. This state captures both local and contextual neighborhood information. The collective system state is $S^{(t)} = (s_1^{(t)}, \ldots, s_N^{(t)})$ , allowing all voxels to update policy-relevant representations simultaneously.

2. Discrete Action Space and Voxel Manipulations

Each voxel-agent samples from a finite action set $\mathcal{A} = \{0, 1, 2\}$ :

$a=0$ : do nothing,
$a=1$ : enhance tissue/lesion,
$a=2$ : weaken tissue/lesion.

The selected action modulates the current voxel value via: $I_{new} = \begin{cases} I_{orig} & a=0 \ \operatorname{clip}(I_{orig}\cdot(1+0.3\cdot\epsilon),0,1) & a=1 \ \operatorname{clip}(I_{orig}\cdot(1-0.3\cdot\epsilon),0,1) & a=2 \end{cases}$ where $\epsilon \sim \operatorname{Uniform}(0,1)$ . This formulation imposes controlled perturbations, simulating potential medical effects and facilitating error correction in the segmentation process.

3. Shared-Encoder and Multi-Decoder Architecture

The network backbone consists of a shared Swin-Unetr encoder, which generates hierarchical feature maps from the entire input volume $X^{(t)}$ . This is followed by three parallel decoder branches:

Segmentation head $f_{\theta_s}$ : produces probability map $P^{(t)}$ for voxel-wise segmentation.
Value head $V_{\theta_v}$ : estimates the expected return $V(s^{(t)})$ for each voxel.
Policy head $\pi_{\theta_p}$ : outputs action logits for each voxel, parameterizing $\pi(a|s)$ .

Encoder parameters are shared, while decoders for segmentation, value, and policy are independent, collectively parameterized by $\theta = \{\theta_s, \theta_v, \theta_p\}$ .

4. Voxel-Adaptive A3C: Workflow and Loss Formulations

The conventional A3C framework—where multiple actors interact with independent environments—is adapted such that every voxel is an agent in a shared environment, executed in parallel in each forward pass. Training proceeds in three main stages:

Warmup: Pure supervised segmentation,
Transition: Mixed Dice and value-MSE loss,
Full RL: RL loss with segmentation, value, and policy terms.

At each RL step:

Actions are sampled per voxel: $a_i^{(t)} \sim \pi_{\theta_p}(\cdot|s_i^{(t)};\tau)$ .
The action effect is applied to yield $X^{(t+1)}$ .
Step rewards $r_i^{(t)}$ are computed (see Section 5).
$n$ -step returns are accumulated: $R^{(t)} = \sum_{k=0}^{n-1} \gamma^k \bar{r}^{(t+k)}$ , where $\bar{r}^{(t)} = N^{-1}\sum_i r_i^{(t)}$ .
Advantages $A^{(t)} = R^{(t)} - V_{\theta_v}(s^{(t)})$ are used in policy gradients.
Gradients for segmentation, value, and policy loss are aggregated and used to update the parameters.

The staged loss functions are

$\begin{aligned} \mathcal{L}_{warm} &= 1 - \operatorname{Dice}(P, G) \ \mathcal{L}_{trans} &= (1-\lambda)\mathcal{L}_{dice} + \lambda (R - V(s))^2 \ \mathcal{L}_{full} &= (1-\alpha-\beta)\mathcal{L}_{dice} + \alpha (R-V(s))^2 + \beta ( -\log \pi(a|s) A ) \end{aligned}$

Values for $\alpha, \beta, \lambda$ are set respectively to $0.2, 0.2, 0.3$.

5. Advantage Policy Gradient and Composite Reward Function

The advantage function follows the A3C standard: $A(s^{(t)}, a^{(t)}) = R^{(t)} - V(s^{(t)})$ . The policy gradient for voxel $i$ is

$\nabla_{\theta_p} \mathcal{L}_{policy} = -\nabla_{\theta_p} [ \log \pi_{\theta_p}(a_i^{(t)}|s_i^{(t)}) A(s_i^{(t)}, a_i^{(t)}) ]$

aggregated over all voxels and $n$ -step windows.

The reward at each step is: $r^{(t)} = \Delta\operatorname{Dice} + \mathcal{C}$ where

$\Delta\operatorname{Dice} = \operatorname{Dice}(f^{(t)}, G) - \operatorname{Dice}(f^{(t-1)}, G)$

and the anatomical constraint

$\mathcal{C}(f) = \max(N_{cc}(f) - 1, 0) + \sum_{i,j} |\nabla f_{i,j}|$

with $N_{cc}(f)$ as the number of connected components in the binary mask, and $\sum |\nabla f|$ the total variation of the segmentation. A lower $\mathcal{C}$ enforces anatomical plausibility, penalizing fragmented or irregular outputs.

6. Empirical Validation: Ablation and Robustness under Noisy Annotations

Ablation studies (Tables 7–8 in the source) quantify vA3C’s contribution to segmentation accuracy under synthetic noise. For example, on the LA dataset with 50% SFDA-Noise, Dice scores improve from baseline $83.37\%$ to $88.65\%$ with full vA3C (+ $3.72\%$ ); for Pancreas-CT with SFDA-Noise, performance increases from $74.31\%$ to $78.64\%$ (+ $2.06\%$ ). Removing any of the RL training stages leads to further performance degradation. This empirically confirms that voxel-level asynchronous policy updates are the dominant factor for robust learning under annotation noise (Fu et al., 7 Jan 2026).

7. Theoretical Rationale for Voxel-Level Asynchronous Actor-Critic

Treating each voxel as an agent enables correction of mislabeled regions without requiring exclusion of entire volumes or patches. Asynchronous updates across spatially disjoint voxels decorrelate gradients, promoting stable optimization in high-dimensional settings. The composite reward, combining localized accuracy gain (Dice delta) and global anatomical regularity, incentivizes the model to incrementally ameliorate noisy labels. The staged training regimen ensures effective initialization, mitigating policy collapse. Overall, vA3C's granularity and parallel reinforcement maximize resilience to global annotation noise and accelerate convergence, outperforming conventional supervised baselines and sample-filtering methods (Fu et al., 7 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Staged Voxel-Level Deep Reinforcement Learning for 3D Medical Image Segmentation with Noisy Annotations (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Voxel-Level Asynchronous Advantage Actor-Critic (vA3C) Module.