Papers
Topics
Authors
Recent
Search
2000 character limit reached

Voxel-Level A3C Module for 3D Segmentation

Updated 14 January 2026
  • The paper presents a voxel-level A3C method that treats each voxel as an independent agent using composite rewards to correct noisy labels.
  • It leverages a shared encoder with multiple decoders to output segmentation, value, and policy predictions, ensuring stable asynchronous updates.
  • Empirical results demonstrate Dice score improvements of up to +3.72% on challenging datasets, confirming the module's robust performance.

The voxel-level Asynchronous Advantage Actor-Critic (vA3C) module is a reinforcement learning component formulated for robust 3D medical image segmentation in the presence of noisy annotations. It is a core innovation of the Staged Voxel-Level @@@@1@@@@ (SVL-DRL) framework, which frames each image voxel as an autonomous agent operating asynchronously in parallel. Unlike conventional sample-level or patch-level denoising approaches, vA3C exploits local policy adjustments driven by composite rewards that fuse segmentation accuracy metrics with anatomical constraints, thereby incrementally rectifying labeling inaccuracies at voxel resolution (Fu et al., 7 Jan 2026).

1. Voxel-wise Reinforcement Learning Agent Formulation

In the vA3C formulation, each voxel ii of a 3D image volume IRH×W×DI \in \mathbb{R}^{H \times W \times D}—where N=HWDN = H \cdot W \cdot D—is modeled as an independent agent. The state of agent ii at step tt is denoted si(t)s_i^{(t)}, initialized to the voxel's raw intensity IiI_i and, in practice, set to the feature vector at ii within a global feature map F(t)F^{(t)} output by a shared encoder. This state captures both local and contextual neighborhood information. The collective system state is S(t)=(s1(t),,sN(t))S^{(t)} = (s_1^{(t)}, \ldots, s_N^{(t)}), allowing all voxels to update policy-relevant representations simultaneously.

2. Discrete Action Space and Voxel Manipulations

Each voxel-agent samples from a finite action set A={0,1,2}\mathcal{A} = \{0, 1, 2\}:

  • a=0a=0: do nothing,
  • a=1a=1: enhance tissue/lesion,
  • a=2a=2: weaken tissue/lesion.

The selected action modulates the current voxel value via: Inew={Ioriga=0 clip(Iorig(1+0.3ϵ),0,1)a=1 clip(Iorig(10.3ϵ),0,1)a=2I_{new} = \begin{cases} I_{orig} & a=0 \ \operatorname{clip}(I_{orig}\cdot(1+0.3\cdot\epsilon),0,1) & a=1 \ \operatorname{clip}(I_{orig}\cdot(1-0.3\cdot\epsilon),0,1) & a=2 \end{cases} where ϵUniform(0,1)\epsilon \sim \operatorname{Uniform}(0,1). This formulation imposes controlled perturbations, simulating potential medical effects and facilitating error correction in the segmentation process.

3. Shared-Encoder and Multi-Decoder Architecture

The network backbone consists of a shared Swin-Unetr encoder, which generates hierarchical feature maps from the entire input volume X(t)X^{(t)}. This is followed by three parallel decoder branches:

  • Segmentation head fθsf_{\theta_s}: produces probability map P(t)P^{(t)} for voxel-wise segmentation.
  • Value head VθvV_{\theta_v}: estimates the expected return V(s(t))V(s^{(t)}) for each voxel.
  • Policy head πθp\pi_{\theta_p}: outputs action logits for each voxel, parameterizing π(as)\pi(a|s).

Encoder parameters are shared, while decoders for segmentation, value, and policy are independent, collectively parameterized by θ={θs,θv,θp}\theta = \{\theta_s, \theta_v, \theta_p\}.

4. Voxel-Adaptive A3C: Workflow and Loss Formulations

The conventional A3C framework—where multiple actors interact with independent environments—is adapted such that every voxel is an agent in a shared environment, executed in parallel in each forward pass. Training proceeds in three main stages:

  1. Warmup: Pure supervised segmentation,
  2. Transition: Mixed Dice and value-MSE loss,
  3. Full RL: RL loss with segmentation, value, and policy terms.

At each RL step:

  • Actions are sampled per voxel: ai(t)πθp(si(t);τ)a_i^{(t)} \sim \pi_{\theta_p}(\cdot|s_i^{(t)};\tau).
  • The action effect is applied to yield X(t+1)X^{(t+1)}.
  • Step rewards ri(t)r_i^{(t)} are computed (see Section 5).
  • nn-step returns are accumulated: R(t)=k=0n1γkrˉ(t+k)R^{(t)} = \sum_{k=0}^{n-1} \gamma^k \bar{r}^{(t+k)}, where rˉ(t)=N1iri(t)\bar{r}^{(t)} = N^{-1}\sum_i r_i^{(t)}.
  • Advantages A(t)=R(t)Vθv(s(t))A^{(t)} = R^{(t)} - V_{\theta_v}(s^{(t)}) are used in policy gradients.
  • Gradients for segmentation, value, and policy loss are aggregated and used to update the parameters.

The staged loss functions are

Lwarm=1Dice(P,G) Ltrans=(1λ)Ldice+λ(RV(s))2 Lfull=(1αβ)Ldice+α(RV(s))2+β(logπ(as)A)\begin{aligned} \mathcal{L}_{warm} &= 1 - \operatorname{Dice}(P, G) \ \mathcal{L}_{trans} &= (1-\lambda)\mathcal{L}_{dice} + \lambda (R - V(s))^2 \ \mathcal{L}_{full} &= (1-\alpha-\beta)\mathcal{L}_{dice} + \alpha (R-V(s))^2 + \beta ( -\log \pi(a|s) A ) \end{aligned}

Values for α,β,λ\alpha, \beta, \lambda are set respectively to $0.2, 0.2, 0.3$.

5. Advantage Policy Gradient and Composite Reward Function

The advantage function follows the A3C standard: A(s(t),a(t))=R(t)V(s(t))A(s^{(t)}, a^{(t)}) = R^{(t)} - V(s^{(t)}). The policy gradient for voxel ii is

θpLpolicy=θp[logπθp(ai(t)si(t))A(si(t),ai(t))]\nabla_{\theta_p} \mathcal{L}_{policy} = -\nabla_{\theta_p} [ \log \pi_{\theta_p}(a_i^{(t)}|s_i^{(t)}) A(s_i^{(t)}, a_i^{(t)}) ]

aggregated over all voxels and nn-step windows.

The reward at each step is: r(t)=ΔDice+Cr^{(t)} = \Delta\operatorname{Dice} + \mathcal{C} where

ΔDice=Dice(f(t),G)Dice(f(t1),G)\Delta\operatorname{Dice} = \operatorname{Dice}(f^{(t)}, G) - \operatorname{Dice}(f^{(t-1)}, G)

and the anatomical constraint

C(f)=max(Ncc(f)1,0)+i,jfi,j\mathcal{C}(f) = \max(N_{cc}(f) - 1, 0) + \sum_{i,j} |\nabla f_{i,j}|

with Ncc(f)N_{cc}(f) as the number of connected components in the binary mask, and f\sum |\nabla f| the total variation of the segmentation. A lower C\mathcal{C} enforces anatomical plausibility, penalizing fragmented or irregular outputs.

6. Empirical Validation: Ablation and Robustness under Noisy Annotations

Ablation studies (Tables 7–8 in the source) quantify vA3C’s contribution to segmentation accuracy under synthetic noise. For example, on the LA dataset with 50% SFDA-Noise, Dice scores improve from baseline 83.37%83.37\% to 88.65%88.65\% with full vA3C (+3.72%3.72\%); for Pancreas-CT with SFDA-Noise, performance increases from 74.31%74.31\% to 78.64%78.64\% (+2.06%2.06\%). Removing any of the RL training stages leads to further performance degradation. This empirically confirms that voxel-level asynchronous policy updates are the dominant factor for robust learning under annotation noise (Fu et al., 7 Jan 2026).

7. Theoretical Rationale for Voxel-Level Asynchronous Actor-Critic

Treating each voxel as an agent enables correction of mislabeled regions without requiring exclusion of entire volumes or patches. Asynchronous updates across spatially disjoint voxels decorrelate gradients, promoting stable optimization in high-dimensional settings. The composite reward, combining localized accuracy gain (Dice delta) and global anatomical regularity, incentivizes the model to incrementally ameliorate noisy labels. The staged training regimen ensures effective initialization, mitigating policy collapse. Overall, vA3C's granularity and parallel reinforcement maximize resilience to global annotation noise and accelerate convergence, outperforming conventional supervised baselines and sample-filtering methods (Fu et al., 7 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Voxel-Level Asynchronous Advantage Actor-Critic (vA3C) Module.