Full-Codebook Random Masking

Updated 3 April 2026

Full-Codebook Random Masking uniformly applies random masks over every index of a data representation, effectively hiding sensitive information for model regularization and privacy.
Empirical results in sequence models show improved convergence rates and lower error metrics compared to traditional per-layer masking techniques.
The approach also offers strong cryptographic and differential privacy guarantees, although it entails higher computational overhead.

Full-Codebook Random Masking refers to a family of masking strategies that operate over the entire set of indices (“codebook”) of a data representation, with the purpose of hiding information—either for model regularization, cryptographic security, or data privacy. The term encompasses techniques such as dense token masking for sequence models, algebraic masking in cryptographic implementations, and matrix masking in privacy-preserving data analysis. The core operational principle is the injection of statistically uniform or random noise/masks over all codebook positions at each step, ensuring maximal decorrelation between the masked intermediate representations and any sensitive or reconstructible information.

1. Mathematical Formulations

In discrete sequence modeling, let $X\in\{1,\ldots,K\}^{T\times C}$ denote a ground-truth matrix of tokens indexed by time ( $T$ ) and codebook/channel ( $C$ ). A full-codebook random masking strategy constructs a binary mask $M\in\{0,1\}^{T\times C}$ , where $M_{t,c}\sim \mathrm{Bernoulli}(p_t)$ and $p_t\sim\mathrm{Uniform}(0,1)$ is resampled for each training instance. The masked input is

$X^{\mathrm{mask}}_{t,c} = \begin{cases} [\mathrm{M}] & \text{if } M_{t,c}=1, \ X_{t,c} & \text{if } M_{t,c}=0, \end{cases}$

where $[\mathrm{M}]$ is a special mask token shared across all codebooks. This approach masks a dense, randomly chosen subset of tokens from all codebooks within each training sample, yielding on average half of $T\cdot C$ entries masked per example ( $\mathbb{E}[p_t]=0.5$ ) (Zhu et al., 1 Apr 2026).

In algebraic cryptographic masking, with $T$ 0 and $T$ 1, a full-codebook random mask $T$ 2 is chosen uniformly at random, and the masked value is $T$ 3. Periodic mask refresh may update $T$ 4 and adjust $T$ 5 accordingly. In threshold implementations, this is interpretable as a single-share split: $T$ 6 for $T$ 7 (Ramezanpour et al., 2019).

For differential privacy via matrix masking, with $T$ 8, a random orthogonal matrix $T$ 9 and Gaussian noise $C$ 0 are generated. The released pseudo-data is $C$ 1 (Ding et al., 2022).

2. Algorithmic Procedures and Training Integration

In discrete diffusion models such as OmniVoice, the full-codebook random masking is applied to acoustic token matrices. For each gradient update:

Sample a masking probability $C$ 2.
Independently mask each token $C$ 3 with probability $C$ 4.
Construct $C$ 5 accordingly.
Encode input text $C$ 6 and $C$ 7.
For masked entries, compute model predictions $C$ 8, and aggregate cross-entropy loss only over masked positions.

This represents a “single-stage, non-autoregressive” training step reminiscent of a discrete corruption or diffusion process, but without a fine-grained diffusion schedule; randomness is entirely per-sample and per-token (Zhu et al., 1 Apr 2026).

In RS-Mask cryptographic implementations, the procedure for each sensitive variable is:

Draw random mask $C$ 9.
Compute $M\in\{0,1\}^{T\times C}$ 0.
Forward $M\in\{0,1\}^{T\times C}$ 1 and $M\in\{0,1\}^{T\times C}$ 2 separately through nonlinear circuit elements.
Optionally, refresh the mask as needed for forward security.
Error-detecting infective variants inject redundancy for DFA/DFIA resistance (Ramezanpour et al., 2019).

For matrix masking in DP:

For each individual, add Gaussian noise to their data.
Draw a global random orthogonal matrix $M\in\{0,1\}^{T\times C}$ 3.
Publish $M\in\{0,1\}^{T\times C}$ 4; this can equivalently be reordered as $M\in\{0,1\}^{T\times C}$ 5 with $M\in\{0,1\}^{T\times C}$ 6 iid Gaussian due to orthogonal invariance (Ding et al., 2022).

3. Empirical Results and Security/Privacy Guarantees

In sequence modeling, ablation results demonstrate that full-codebook random masking (as opposed to per-layer masking strategies) yields:

Higher convergence speed due to a denser training signal ( $M\in\{0,1\}^{T\times C}$ 7-fold increase over per-layer selection).
Reduction in word error rate (WER) and improvement in mean opinion scores (MOS). For instance:
- Full-codebook mask: SIM-o $M\in\{0,1\}^{T\times C}$ 8, WER $M\in\{0,1\}^{T\times C}$ 9, UTMOS $M_{t,c}\sim \mathrm{Bernoulli}(p_t)$ 0
- Per-layer mask (SoundStorm): SIM-o $M_{t,c}\sim \mathrm{Bernoulli}(p_t)$ 1, WER $M_{t,c}\sim \mathrm{Bernoulli}(p_t)$ 2, UTMOS $M_{t,c}\sim \mathrm{Bernoulli}(p_t)$ 3 (Zhu et al., 1 Apr 2026).
Omnilingual TTS models employing full-codebook masking achieve WER $M_{t,c}\sim \mathrm{Bernoulli}(p_t)$ 4 and UTMOS $M_{t,c}\sim \mathrm{Bernoulli}(p_t)$ 5 on LibriSpeech-PC, outperforming per-layer masked models (Zhu et al., 1 Apr 2026).

In block ciphers, the RS-Mask yields perfect statistical independence between outputs and secrets, per the theorem: if $M_{t,c}\sim \mathrm{Bernoulli}(p_t)$ 6, $M_{t,c}\sim \mathrm{Bernoulli}(p_t)$ 7, then $M_{t,c}\sim \mathrm{Bernoulli}(p_t)$ 8, conferring immunity to first-order side-channel and fault attacks. FPGA implementation on AES-128 shows the following (3-share):

Max clock: 218 MHz (vs. 242 MHz unprotected)
LUTs: 2273 (vs. 508)
Throughput: 116.8 Mbps (vs. 151.1 Mbps)
Energy/bit: 4.14 nJ (vs. 2.55 nJ) (Ramezanpour et al., 2019).

In differential privacy, combining masking with Gaussian noise yields a substantial reduction in variance requirements for achieving $M_{t,c}\sim \mathrm{Bernoulli}(p_t)$ 9-DP:

Without masking: $p_t\sim\mathrm{Uniform}(0,1)$ 0
With matrix masking: $p_t\sim\mathrm{Uniform}(0,1)$ 1 for $p_t\sim\mathrm{Uniform}(0,1)$ 2 This reduction directly increases downstream analytic utility (Ding et al., 2022).

4. Security and Privacy Analysis

Theoretical guarantees for RS-Mask (random-space/full-codebook random masking) rest on the uniformity of the masking distribution:

For any intermediate variable or fault-induced output, the masking ensures perfectly uniform distribution over $p_t\sim\mathrm{Uniform}(0,1)$ 3, eliminating statistical distinguishability.
The information-theoretic proof (Theorem 1, (Ramezanpour et al., 2019)) establishes that the mutual information $p_t\sim\mathrm{Uniform}(0,1)$ 4 vanishes when $p_t\sim\mathrm{Uniform}(0,1)$ 5 is uniform.
Infective variants using redundant encodings and multiplicative propagation of mask error further defeat DFA/DFIA, as any difference is uniformly masked and reveals no exploitable bias.
In DP, random orthogonal masking spreads out any change in an individual's data across all rows/directions, suppressing the influence of individual entries on the released pseudo-data and substantially tightening tail bounds on the log-density ratio (Ding et al., 2022).

5. Comparison to Alternative Masking Strategies

Contrasted with partial or per-layer masking:

In sequence models, per-layer (e.g., SoundStorm, MaskGCT) masking only affects a single codebook per training step, providing a sparse learning signal.
Full-codebook random masking achieves denser masking, enhancing gradient flow, faster convergence, and broader utilization of context across codebooks and time.
In cryptographic masking, high-order threshold implementations split secrets into $p_t\sim\mathrm{Uniform}(0,1)$ 6 shares for $p_t\sim\mathrm{Uniform}(0,1)$ 7th-order security; RS-Mask provides equivalent side-channel resistance by using one share as a uniformly random mask over the full field, resulting in similar or slightly increased area and power overhead in hardware, but with proven maximal security (Ramezanpour et al., 2019).
In DP, masking via random rotations outperforms classic Gaussian noise addition by reducing required noise, due to effective spread of sensitivity across a higher-dimensional space (Ding et al., 2022).

6. Implementation Considerations and Limitations

Computational overhead: In DP, generating a full-rank random orthogonal matrix is $p_t\sim\mathrm{Uniform}(0,1)$ 8, though block-diagonal masks or fast transforms can mitigate this cost (Ding et al., 2022).
Parameter selection: Effective masking requires resampling the mask for each sample or operation, and, in DP, maintaining $p_t\sim\mathrm{Uniform}(0,1)$ 9 for proper privacy guarantees.
The noise variance constraint in DP remains weakly dependent on the number of attributes $X^{\mathrm{mask}}_{t,c} = \begin{cases} [\mathrm{M}] & \text{if } M_{t,c}=1, \ X_{t,c} & \text{if } M_{t,c}=0, \end{cases}$ 0, and achieving p-independent privacy with full utility is an open problem.
In RS-Mask for cryptography, the area and energy overheads are substantial compared to unprotected implementations but modest compared to other high-order countermeasures.

7. Instantiation Guidelines and Application Scope

In TTS and sequence models: Apply masking independently to every entry of the multi-codebook token matrix for each training instance, with a randomly drawn global masking probability; compute losses over all codebooks and apply masked modeling to every sequence/vocabulary dimension (Zhu et al., 1 Apr 2026).
In symmetric cryptographic designs: Implement encode/decode wrappers around every nonlinear operation, carry fresh full-codebook random masks in parallel datapaths, and incorporate mask refreshing per pipeline boundary or nonce/IV update. Infective protection is achieved via small error-detection blocks and redundancy propagation (Ramezanpour et al., 2019).
In differential privacy: Add Gaussian noise locally to each record, compose them, and apply a global random orthogonal transformation prior to publishing the output. Adjust the noise scale per specified theorems (Ding et al., 2022).

Full-codebook random masking represents a unified conceptual framework for robust information hiding, distinct in its use of dense, uniformly random transformation over the entirety of underlying algebraic or codebook structure. Its empirical and theoretical guarantees span domains from privacy and security to improved model learning efficacy.