Payload Splitting Mechanisms

Updated 4 July 2026

Payload splitting is a decomposition operator that divides data into smaller, manageable subunits to reduce system burden while preserving overall performance.
In audiovisual steganography, it partitions secret messages between audio and video streams using ratio-based allocation to balance distortion and concealment.
In federated recommender systems and packet anomaly detection, it optimizes communication and feature extraction through selective factor transmission and sliding-window block extraction.

Searching arXiv for the cited papers and closely related uses of “payload splitting” across domains. Payload splitting denotes a family of operations in which a payload is decomposed into smaller units before embedding, transmission, or representation learning. In the cited literature, the term refers to at least three distinct but structurally related procedures: split-payload audiovisual steganography, where a secret message $M$ is divided across video and audio carriers; payload optimization in federated recommender systems, where only a selected subset of the global item-factor payload is transmitted each communication round; and packet-payload block splitting, where a raw byte payload $P$ is segmented into overlapping blocks prior to embedding and classification (Paudel et al., 7 Jun 2026, Khan et al., 2021, Liu et al., 2019). Across these settings, the central objective is to reduce burden on any single carrier, round, or representation unit while preserving task performance or concealment.

1. Formalizations of payload splitting

The three formulations operate on different payload objects. In audiovisual steganography, the payload is a secret message of $B$ bits. In federated recommender systems, the payload is the server-to-client model component $Q\in\mathbb{R}^{K\times M}$ . In packet anomaly detection, the payload is the raw packet byte sequence $P=(p_1,\dots,p_M)$ .

Domain	Payload object	Splitting mechanism
Audiovisual steganography	Secret message $M$	Partition into $M_v$ and $M_a$
Federated recommender systems	Item-factor matrix $Q$	Select $M_s$ item factors per round
Packet anomaly detection	Packet payload $P$ 0	Sliding-window block sequence extraction

In split-payload audiovisual steganography, the message is partitioned by an allocation ratio $P$ 1 such that

$P$ 2

and hence

$P$ 3

An equivalent per-slice formulation divides the clip into $P$ 4 temporal slices, draws a clip-level modulation vector $P$ 5, and sets

$P$ 6

In the federated recommender formulation, payload splitting is implemented as selection rather than arithmetic partition. Each item $P$ 7 is treated as a separate arm in a batched bandit, with action $P$ 8 indicating whether the corresponding factor $P$ 9 is included in the transmitted payload $B$ 0. Exactly $B$ 1 arms are selected per round, so the payload reduction is controlled directly by the selected subset size (Khan et al., 2021).

In packet anomaly detection, splitting is a sequence transformation. A payload is converted into overlapping blocks of fixed length $B$ 2 and stride $B$ 3, with

$B$ 4

The resulting block sequence is then filtered by a top- $B$ 5 vocabulary and embedded into vectors for downstream modeling (Liu et al., 2019).

These formulations suggest that “payload splitting” is best understood as a domain-specific decomposition operator rather than a single algorithmic primitive.

2. Split-payload audiovisual steganography

In the audiovisual setting, payload splitting is motivated by the availability of audio and video streams as concurrent carriers. The hidden message is divided between the two modes so that the embedding burden on either individual carrier is reduced. The video branch uses the WOW algorithm: directional high-pass residuals $B$ 6 are computed for filters $B$ 7, and the per-pixel embedding cost is

$B$ 8

Embedding then uses a syndrome-trellis code (STC) or Layer-Wise Optimized Wavelet embedding to modify least-significant bits at rate $B$ 9 bits per pixel. The audio branch computes local RMS energy over a window of length $Q\in\mathbb{R}^{K\times M}$ 0,

$Q\in\mathbb{R}^{K\times M}$ 1

assigns per-sample cost

$Q\in\mathbb{R}^{K\times M}$ 2

and applies STC embedding to PCM samples at rate $Q\in\mathbb{R}^{K\times M}$ 3 bits per sample, concentrating changes in high-energy regions (Paudel et al., 7 Jun 2026).

The paper further introduces distortion equalization. An audio payload rate $Q\in\mathbb{R}^{K\times M}$ 4 is chosen by binary search so that video distortion $Q\in\mathbb{R}^{K\times M}$ 5 and audio distortion $Q\in\mathbb{R}^{K\times M}$ 6, measured in PESQ-normalized form, satisfy

$Q\in\mathbb{R}^{K\times M}$ 7

This equalization constrains the split so that one modality does not dominate purely because of a distortion mismatch.

A second axis of the formulation concerns temporal coordination. Both modes divide the clip into $Q\in\mathbb{R}^{K\times M}$ 8 temporal slices with per-slice seeds $Q\in\mathbb{R}^{K\times M}$ 9 and rates $P=(p_1,\dots,p_M)$ 0. In the synchronized condition, the video embedder $P=(p_1,\dots,p_M)$ 1 and audio embedder $P=(p_1,\dots,p_M)$ 2 draw from the same slice seed, inducing statistical dependence

$P=(p_1,\dots,p_M)$ 3

In the asynchronous condition, the two embedders use independent seeds, so

$P=(p_1,\dots,p_M)$ 4

Because the rate vector $P=(p_1,\dots,p_M)$ 5 is shared, any performance gap between synchronized and asynchronous conditions reflects seed-level correlation rather than payload volume. The split ratio $P=(p_1,\dots,p_M)$ 6 is selected to place each modality below its unimodal detector’s threshold.

The significance of this construction lies in its attempt to create concealment through coordinated under-threshold embedding rather than through stronger perturbation in a single carrier.

3. Detection architectures, metrics, and empirical interpretation

The evaluated detector family contains unimodal and multimodal models. The unimodal video detector is SRNet-style, with a convolutional stem using high-pass filters and absolute-value activation to extract $P=(p_1,\dots,p_M)$ 7 LSB residuals, followed by several residual blocks at full spatial resolution and nearest-neighbor pooling to fixed-size vectors per frame. The unimodal audio detector is a CNN-LSTM with a dual-polarity high-pass filter bank, absolute activation, multiscale convolutional branches, and a bidirectional LSTM over samples within each slice. Both unimodal and multimodal systems use a shared temporal module consisting of a bidirectional LSTM over the $P=(p_1,\dots,p_M)$ 8 slices, multi-head attention pooling, and a classification MLP. The multimodal detector adds two modality branches processed by self-attention Transformer layers and three cross-attention layers, with shared positional embeddings used to align slice indices; the final classifier receives concatenated pooled embeddings. Training uses cross-entropy with label smoothing $P=(p_1,\dots,p_M)$ 9 (Paudel et al., 7 Jun 2026).

Evaluation is reported with probability of error,

$M$ 0

where

$M$ 1

along with accuracy, ROC curve, AUC, true positive rate, false positive rate, and optionally McNemar’s test,

$M$ 2

For a balanced split of $M$ 3 bpp video and $M$ 4 bps audio, the reported test performance is as follows.

Detector	Embedding mode	Accuracy / $M$ 5 / AUC
Cross-attention	synchronized	93.62% / 0.064 / 0.906
Cross-attention	asynchronous	82.75% / 0.173 / 0.995
Video-only	either	50.00% / 0.500 / $M$ 6
Audio-only	either	50.00% / 0.500 / $M$ 7

The synchronized split-payload ablation is especially important. Zeroing the audio input yields 93.50% accuracy with $M$ 8; zeroing the video input yields 50.00% accuracy with $M$ 9; and shuffling audio-video pairs yields 93.60% accuracy with $M_v$ 0. The paper interprets these results as showing that the apparent multimodal success mostly comes from the video stream rather than from genuine cross-modal learning. At the same time, asynchronous embedding reduces multimodal accuracy relative to synchronized embedding, which indicates that seed correlation provides some advantage.

A common misconception would be to treat a higher multimodal score as direct evidence of learned audiovisual dependence. The ablation results directly challenge that interpretation: the detector can appear strong while functionally relying on only one branch.

4. Payload optimization in federated recommender systems

In federated collaborative filtering, payload splitting addresses communication cost rather than concealment. The global model is the item-factor matrix

$M_v$ 1

which is broadcast to clients and updated through returned item-gradient updates $M_v$ 2. The payload size in bytes is approximately

$M_v$ 3

As $M_v$ 4 grows to $M_v$ 5– $M_v$ 6 items, this payload can reach tens or hundreds of MB, up to $M_v$ 7 GB, which is infeasible for resource-constrained devices or low-bandwidth links (Khan et al., 2021).

The proposed solution treats each item as an arm in a batched multi-armed bandit. At iteration $M_v$ 8, arm $M_v$ 9 corresponds to item-factor $M_a$ 0, action $M_a$ 1 indicates inclusion in the payload, and the state $M_a$ 2 records past feedback. Reward is computed from the current gradient and past gradients:

$M_a$ 3

where $M_a$ 4 and

$M_a$ 5

Early in training, the absolute-change term dominates and favors rapidly changing items; later, the cosine term dominates and favors alignment with past update direction.

Selection is performed by Bayesian Thompson Sampling. For each arm, a posterior over the mean reward is maintained under

$M_a$ 6

After $M_a$ 7 selections with average reward $M_a$ 8, the posterior becomes

$M_a$ 9

with

$Q$ 0

The server samples from these posteriors, selects the top- $Q$ 1 arms, transmits only the corresponding columns of $Q$ 2, collects gradients, updates the model if enough client responses are received, and then updates per-item rewards and posteriors.

Experiments use Movielens-1M, Last-FM, and MIND-small, with top-100 Precision, Recall, $Q$ 3, and MAP normalized with respect to the best-possible offline oracle. Payload-reduction regimes range from 25% to 98%. At 90% reduction, the method yields approximately a 20% drop in metrics on Movielens relative to the full model but remains 27–60% better than random; on Last-FM and MIND, it incurs only 4–8% degradation while remaining 160–350% better than random. On the sparse datasets, it reaches within 98–99% of full-model performance in approximately 400–450 rounds, versus approximately 200 rounds for the full model. The practical trade-off is explicit: in highly sparse domains, it is reported to be safe to cut payload by up to 90% with less than 8% accuracy loss, whereas on moderately dense data a 75% cut retains approximately 80% of original accuracy.

5. Packet-payload block splitting for anomaly detection

In packet anomaly detection, payload splitting is a representation-learning preprocessing stage. A raw payload $Q$ 4 is viewed as a sequence of bytes, each in $Q$ 5. The pipeline first extracts fixed-length overlapping blocks with a sliding window, then builds a block vocabulary from the training set, filters each packet’s block sequence to elements in that vocabulary, maps retained blocks to indices, and finally embeds them through a learnable matrix $Q$ 6 before passing the resulting vector sequence to an LSTM-CNN model (Liu et al., 2019).

The hyper-parameter setting reported as optimal is $Q$ 7, $Q$ 8, $Q$ 9, and $M_s$ 0. The rationale is explicit. Larger $M_s$ 1 captures higher-order byte patterns and longer local subsequences, but if too large it mixes unrelated bytes and dilutes anomaly signals; empirically, $M_s$ 2 gives the best trade-off. Full overlap with $M_s$ 3 preserves local shifts and yields the best performance. A vocabulary that is too small misses important blocks, while one that is too large introduces spurious tokens; peak detection rate occurs at $M_s$ 4.

After block extraction, a frequency dictionary is built over the training data, the top- $M_s$ 5 blocks are retained as vocabulary $M_s$ 6, and each new packet is filtered accordingly. Let $M_s$ 7 denote the block-to-index bijection. If $M_s$ 8, the block is kept as index $M_s$ 9; otherwise it is dropped. The embedded sequence is then

$P$ 00

No pre-training is used; the embedding matrix is initialized randomly and optimized end-to-end by the final cross-entropy objective.

Sequence modeling is performed by an LSTM with hidden size $P$ 01. Rather than use only the final hidden state, the method selects $P$ 02 equally spaced hidden states,

$P$ 03

forming a matrix $P$ 04. This matrix is reshaped as a feature map and processed by a CNN with two convolutional layers: 32 filters of size $P$ 05, followed by max-pooling; then 64 filters of size $P$ 06, followed by max-pooling. The final classifier is an MLP with one hidden layer of size 128, ReLU, dropout(0.1), and two output units. Because exactly $P$ 07 hidden states are selected, embedding-level padding or truncation is unnecessary in the standard case; if $P$ 08, zero-padding hidden states is described as a practical choice.

The significance of this formulation is that splitting occurs before classification rather than during transmission. The split reorganizes local byte content into a vocabulary-constrained sequence that can be modeled for both high-dimensional information and underlying sequential information.

6. Comparative themes, limitations, and methodological cautions

Across these three settings, payload splitting serves different immediate objectives: concealment in audiovisual steganography, communication reduction in federated optimization, and structured representation extraction in packet anomaly detection. The mechanisms also differ: ratio-based partitioning across modalities, subset selection over item factors, and sliding-window decomposition over byte strings (Paudel et al., 7 Jun 2026, Khan et al., 2021, Liu et al., 2019).

Several recurring methodological cautions emerge. In the multimodal steganalysis case, a detector may appear effective without learning the intended joint signal; the reported post-hoc checks are modality masking and same-label shuffling, and the recommended remedies include richer fusion mechanisms that force joint representation learning, larger multi-speaker and multi-scene corpora, and statistical tests such as McNemar’s test. In the federated recommender case, the trade-off is between payload reduction and recommendation degradation, with $P$ 09 controlling exploration versus exploitation and with practical model selection framed as choosing the “knee” point on accuracy-versus-payload curves. In the packet-payload case, performance depends sensitively on block length, vocabulary size, and the number of selected LSTM states; too small a setting loses context, while too large a setting introduces noise or overhead.

A plausible cross-domain implication is that payload splitting is most useful when the split places each subproblem in a regime that is easier to manage than the unsplit original. In audiovisual steganography, the regime is below unimodal detector thresholds. In federated recommendation, it is a reduced communication budget that still preserves informative updates. In packet anomaly detection, it is a vocabulary-filtered sequence length and granularity that remain expressive for downstream modeling.

Another plausible implication is that empirical gains from payload splitting should not be interpreted uniformly. In one setting, the main effect may be evasion of unimodal steganalysis; in another, it may be communication savings with delayed convergence; in a third, it may be improved feature organization rather than reduced burden. The literature therefore treats payload splitting not as an intrinsically beneficial operation, but as a controlled structural intervention whose value depends on how the split interacts with distortion, informativeness, alignment, and model inductive bias.