Cross-Modality Editing Module (CEM)

Updated 4 July 2026

Cross-modality Editing Module (CEM) is a family of multimodal mechanisms that enable coherent, controlled edits across different data modalities.
It employs techniques such as cross-attention, delta denoising, and evidence-gated gating to align and propagate edits while preserving source structure.
CEM frameworks are applied in tasks ranging from zero-shot audio-video editing to RGB-D salient object detection and multimodal knowledge editing in unified models.

Searching arXiv for the cited papers and closely related work on cross-modality editing modules. arxiv_search(query="Cross-modality Editing Module zero-shot audio-visual editing cross-modal delta denoising ScopeEdit UniKE RGB-D salient object detection", max_results=10) arxiv_search query: "Cross-modality Editing Module zero-shot audio-visual editing cross-modal delta denoising ScopeEdit UniKE RGB-D salient object detection" max_results: 10 Cross-modality Editing Module (CEM) denotes an abstracted multimodal design pattern in which an editing signal is transferred, aligned, or selectively propagated across modalities rather than being confined to a single stream. In the cited literature, the term does not name one canonical layer. Instead, it is used to describe concrete mechanisms that couple modalities during editing or refinement: AvED’s cross-modal delta denoising for zero-shot audio-video editing, ScopeEdit’s two-branch online writer for multimodal LLMs, the cmMS block for RGB-D salient object detection, and reasoning-augmented activation for cross-modal knowledge editing in unified multimodal models (Lin et al., 26 Mar 2025, Li et al., 2 Jul 2026, Li et al., 2020, Gao et al., 30 May 2026). Across these formulations, a CEM is characterized by shared conditioning, modality-aware selection or gating, and an explicit mechanism for balancing edit propagation against preservation.

1. Conceptual scope and architectural forms

A CEM is most usefully understood as a family of mechanisms rather than a single architecture. In AvED, there is “no single layer named ‘CEM’,” but there is a “cross-modal editing mechanism” composed of cross-attention blocks in diffusion U-Nets, a prompt-relevance scoring module, a cross-modal contrastive fusion module, and the DDS delta objective. In ScopeEdit, the CEM is a “two-branch edit writer” that decomposes each update into a modality-local absorption branch and an evidence-gated shared generalization branch. In RGB-D salient object detection, the cmMS block—composed of cmFM, AFS, and sg-PEA—can be treated as a CEM because it modulates, selects, and refines features across RGB and depth. In unified multimodal models, UniKE identifies the cross-modal problem that a CEM would need to solve: text-side parameter edits must survive the conditioning pathway into visual generation (Lin et al., 26 Mar 2025, Li et al., 2 Jul 2026, Li et al., 2020, Gao et al., 30 May 2026).

System	Modalities	CEM-like mechanism
AvED	audio-video-text	cross-modal delta denoising
ScopeEdit	image-text in MLLMs	modality-local + shared gated writer
cmMS	RGB-depth	modulation, selection, position-edge attention
UniKE setting	text-to-image in UMMs	reasoning-augmented activation over conditioning

This diversity makes the term broader than cross-attention alone. A CEM may operate in latent diffusion space, FFN key space, or feature pyramid space. It may edit via contrastive coupling, low-rank writes, affine modulation, or prompt-side activation. What is shared is the objective of making edits or refinements cross-modally coherent while limiting irrelevant drift.

2. Diffusion-based CEMs for zero-shot audio-video editing

AvED formalizes “zero-shot audio-video editing” as the task of transforming synchronized audio-visual content to align with a target text prompt “without additional model training.” The inputs are a real video with synchronized audio $(v,a)$ , an optional source prompt $y_{\text{src}}$ , and a target prompt $y_{\text{trg}}$ ; the outputs are an edited video $v'$ and edited audio $a'$ that match the target text, preserve appropriate source structure, and remain synchronized across modalities (Lin et al., 26 Mar 2025).

The framework is built on Delta Denoising Score (DDS), an extension of Score Distillation Sampling used in a two-branch form. The source branch contains original content with $y_{\text{src}}$ , and the target branch contains the editable latent with $y_{\text{trg}}$ . Classifier-free guidance is written as

$\epsilon_{\phi}^{\omega}(\mathbf{z}_{t}, y, t) = (1+\omega)\, \epsilon_{\phi}(\mathbf{z}_{t}, y, t) - \omega \, \epsilon_{\phi}(\mathbf{z}_{t}, \emptyset, t),$

and the per-modality DDS loss is

$\mathcal{L}_{\mathrm{DDS}}(\theta; y_{\text{trg}}) = \big\| \epsilon_{\phi}^{\omega}(\mathbf{z}_t(\theta), y_{\text{trg}}, t) - \epsilon_{\phi}^{\omega}(\mathbf{z}_t, y_{\text{src}}, t) \big\|^2.$

This defines the intra-modal edit delta: the target latent is nudged toward the target prompt while preserving source structure.

The specifically cross-modal component is derived from diffusion U-Net cross-attention. AvED computes text relevance for audio and video patches from query-key products,

$\mathbf{S}_i^{a} = \max_j \big( \mathbf{Q}_a \mathbf{K}_a^\top \big)_{i,j}, \qquad \mathbf{S}_i^{v} = \max_j \big( \mathbf{Q}_v \mathbf{K}_v^\top \big)_{i,j},$

normalizes these scores, and thresholds them into relevant patches $y_{\text{src}}$ 0 and irrelevant patches $y_{\text{src}}$ 1. Hidden states from the same layers provide patch embeddings. AvED then constructs target relevant sets and source/target irrelevant sets, and applies a standard contrastive loss

$y_{\text{src}}$ 2

inside a cross-modal delta denoising term $y_{\text{src}}$ 3. The final objective is

$y_{\text{src}}$ 4

Operationally, AvED encodes video frames and audio spectrograms into VAEs, reshapes frames into an $y_{\text{src}}$ 5 grid so that a 2D image diffusion model can process them, and optimizes target latents with gradient descent over approximately $y_{\text{src}}$ 6 steps. The first approximately $y_{\text{src}}$ 7 steps use only $y_{\text{src}}$ 8; the remaining steps add $y_{\text{src}}$ 9 with weight factor $y_{\text{trg}}$ 0. The pretrained backbones are Stable Diffusion 2.1 for video and AudioLDM2-Large for audio. AvED-Bench contains “110 videos,” each “10 seconds,” spanning “11 categories from VGGSound,” and the method is also evaluated on OAVE (Lin et al., 26 Mar 2025).

Within this formulation, the CEM is the mechanism that reads text-conditioned attention, identifies where edits should occur, and couples the audio and video editing trajectories during denoising. Synchronization is maintained through common timesteps, shared textual conditioning, and patch-indexed positive and negative pairs.

3. Online recursive CEMs for multimodal knowledge editing

ScopeEdit addresses “online multimodal knowledge editing” in transformer-based MLLMs under a bounded-overhead constraint. The editable parameters are FFN output matrices $y_{\text{trg}}$ 1 in selected layers, while the remaining parameters are frozen. For an edit stream $y_{\text{trg}}$ 2, the update is written as $y_{\text{trg}}$ 3 with $y_{\text{trg}}$ 4 (Li et al., 2 Jul 2026).

The paper introduces Edit-Scoped Generalization, defined as the joint requirement of “in-scope cross-modal generalization” and “out-of-scope locality preservation.” Its pilot study identifies a “scope gap” even when the original edit instance is already correct: only 62.20% of such edits exhibit “proper propagation,” 28.60% show “under-generalization,” 7.20% show “over-generalization,” and 2.00% show “entangled failure” (Li et al., 2 Jul 2026).

ScopeEdit’s CEM is the layerwise decomposition

$y_{\text{trg}}$ 5

The modality-local absorption branch uses textual FFN keys only, is always active, and is intended to support reliable correction and locality. The evidence-gated shared generalization branch uses joint visual-text keys and propagates edits across modalities only when visual and textual evidence are aligned. In the shared branch, text-only and visual-only keys are projected into a shared low-rank space:

$y_{\text{trg}}$ 6

Directional agreement and bilateral support are defined by cosine similarity and norm balance, and the gate is

$y_{\text{trg}}$ 7

If the modalities are aligned and both strong, $y_{\text{trg}}$ 8; if they disagree or one modality is weak, the shared branch is suppressed.

The write geometry is explicitly low-rank and orthogonal. Local and shared bases satisfy

$y_{\text{trg}}$ 9

and the update is

$v'$ 0

Branchwise proximal updates use inverse preconditioners $v'$ 1 and have the form

$v'$ 2

The preconditioners are updated recursively with Sherman–Morrison, yielding constant per-edit overhead (Li et al., 2 Jul 2026).

A distinct but closely related problem is identified in unified multimodal models. UniKE shows that text-side efficacy can reach “approximately 92%,” whereas “the best overall VQA accuracy under direct image generation is only 18.5%.” In the direct setting, BLIP3o-4B + PMET achieves overall text-side efficacy 76.30% and overall VQA 18.51%; Ovis-U1 + PMET achieves 72.18 and 9.71; OmniGen2 + AlphaEdit achieves 76.37 and 11.50. The paper attributes this modality gap to partial alignment between edited textual representations and the conditioning pathways for visual generation, and proposes Reasoning-augmented Parameter Editing, which “explicitly activates edited knowledge before generation” and improves overall VQA for all evaluated model-editor pairs, including Ovis-U1 + PMET from 9.71 to 28.32 (Gao et al., 30 May 2026).

Together, these results establish that a CEM for multimodal knowledge editing must control not only whether an edit is written, but also the semantic boundary of its propagation and the pathway by which that edit reaches another modality.

4. Modulation-and-selection CEMs in RGB-D perception

In RGB-D salient object detection, the cmMS block provides a non-generative but structurally clear CEM formulation. It consists of cross-modality feature modulation (cmFM), adaptive feature selection (AFS), and saliency-guided position-edge attention (sg-PEA), and is applied in a coarse-to-fine manner over VGG-16 feature levels (Li et al., 2020).

cmFM treats depth as a prior that modulates RGB features through a pixel-wise affine transform. Given $v'$ 3 and $v'$ 4, a small CNN with two parallel branches predicts

$v'$ 5

and the modulated feature is

$v'$ 6

The modulation is explicitly pixel-wise:

$v'$ 7

This mechanism is motivated by the observation that depth is complementary but can also be noisy or misaligned; cmFM therefore learns where depth should amplify or shift RGB responses.

AFS performs cross-modality “editing” in channel and spatial dimensions. For each input stream, global average pooling and an SE mapping produce channel weights,

$v'$ 8

$v'$ 9

followed by channel rescaling $a'$ 0. After this self-modality selection, the four streams are concatenated and passed through a second SE mapping to produce channel-attention-on-channel-attention (CACA), yielding $a'$ 1. In parallel, gated spatial fusion predicts modality-specific confidence maps,

$a'$ 2

and fuses them as

$a'$ 3

The final AFS output is

$a'$ 4

sg-PEA then refines these saliency-related features with task-guided attention. Position attention uses the up-sampled saliency map from the next coarser level:

$a'$ 5

and edge attention uses a predicted edge map:

$a'$ 6

Training uses deep supervision at every level,

$a'$ 7

with $a'$ 8, $a'$ 9, and $y_{\text{src}}$ 0 (Li et al., 2020).

This formulation broadens the meaning of CEM. The module is not editing knowledge or diffusion latents; rather, it edits feature salience by using one modality as a prior, then adaptively selecting and refining cross-modal evidence.

5. Benchmarks, evaluation criteria, and empirical signatures

The empirical study of CEM-like systems is strongly benchmark-dependent. AvED-Bench contains 110 videos, each 10 seconds, spanning 11 categories from VGGSound, and is explicitly designed for “zero-shot audio-video editing.” OAVE contains 44 categories, “10 images per category (clips) + separate audio,” and “25 prompt templates that modify sounding object or environment” (Lin et al., 26 Mar 2025).

For zero-shot audio-video editing, evaluation is divided into video-only, audio-only, and joint audio-video metrics. Video-only metrics are CLIP-F, CLIP-T, DINO, and Obj; audio-only metrics are CLAP and LPAPS; joint metrics are IB, AV-Align, and ACC. These metrics separate source preservation, text alignment, high-level cross-modal coherence, and low-level temporal synchronization (Lin et al., 26 Mar 2025).

ScopeEdit evaluates edited instance reliability, text-conditioned and image-conditioned generalization, and text/image locality on E-VQA, E-IC, and VLKEB. On BLIP2-OPT E-IC @100, M-ORE records M-Gen 66.31, M-Loc 94.97, whereas ScopeEdit records M-Gen 85.48, M-Loc 97.17. On LLaVA-v1.5 E-IC @100, M-ORE records M-Gen 78.29, M-Loc 89.21, whereas ScopeEdit records M-Gen 86.35, M-Loc 92.26 (Li et al., 2 Jul 2026).

UniKE supplies a complementary benchmark for cross-modality knowledge editing in unified multimodal models. It contains 2,971 edit subjects—964 attribute edits and 2,007 relation edits—expanded into 5,535 evaluation instances. Attribute edits are organized into four stages, with counts after filtering of 959, 874, 858, and 837. Visual success is assessed by VQA-based verification with Qwen3-VL-235B-A22B-Instruct, and the benchmark reports text-side Efficacy, Reasoning Accuracy, and VQA Accuracy (Gao et al., 30 May 2026).

A recurrent empirical signature across these benchmarks is that unimodal success does not guarantee multimodal transfer. In AvED, existing zero-shot audio and video editing methods are limited by synchronization and coherence. In ScopeEdit, reliable instance correction does not guarantee correct scope. In UniKE, text edits often fail to manifest visually. This convergence across task families is one of the strongest arguments for treating CEM as a distinct design problem rather than a minor extension of unimodal editing.

6. Limitations, misconceptions, and design implications

A common misconception is that cross-modal editing is achieved whenever a model contains cross-attention. The surveyed systems do not support that view. AvED requires not only text-conditioned cross-attention but also relevance scoring, thresholded patch selection, and a cross-modal contrastive loss. ScopeEdit requires branch separation, orthogonal write geometries, and evidence-gated propagation. cmMS requires modulation, channel selection, spatial confidence estimation, and task-guided refinement. UniKE shows that even successful text-side parameter edits may remain too weak or misaligned to steer image synthesis (Lin et al., 26 Mar 2025, Li et al., 2 Jul 2026, Li et al., 2020, Gao et al., 30 May 2026).

A second misconception is that edit reliability on the original instance implies correct cross-modal behavior. ScopeEdit’s “scope gap” and UniKE’s modality gap directly contradict this. Reliable edits can still under-generalize, over-generalize, or fail to transfer visually. UniKE’s mechanistic analysis localizes part of this problem to the conditioning interface: for Ovis-U1, the frozen projection preserves only 34.82% of edit perturbation energy in the top-1536 right singular vectors, close to the theoretical 37.50% for a random vector, and the paper concludes that the bottleneck is before the DiT rather than inside it (Li et al., 2 Jul 2026, Gao et al., 30 May 2026).

Each formulation also exposes domain-specific failure modes. AvED reports that complex edits with multiple objects or events can cause temporal inconsistencies or partial misalignment, that poorly tuned thresholding or random selection can degrade DINO, LPAPS, and AV-Align, and that larger grids improve frame consistency but can slightly hurt text/object fidelity and AV alignment. ScopeEdit notes that gating in low-rank key space may not capture all semantics, and that rank and gate hyperparameters require tuning. cmMS reports that removing cmFM, AFS, or sg-PEA reduces performance, and that naïve addition or concatenation is inferior to depth-conditioned modulation. UniKE emphasizes that direct text-backbone editing is insufficient when conditioning bottlenecks attenuate the signal (Lin et al., 26 Mar 2025, Li et al., 2 Jul 2026, Li et al., 2020, Gao et al., 30 May 2026).

These results suggest several stable design principles. A plausible implication is that a CEM should explicitly separate where to edit from where to propagate: AvED separates relevant from irrelevant patches; ScopeEdit separates local absorption from shared propagation; cmMS separates modulation, selection, and task-guided refinement; UniKE separates parameter editing from activation before generation. A second plausible implication is that shared conditioning must be accompanied by some form of alignment control—contrastive pairing, evidence gating, confidence maps, or reasoning scaffolds—otherwise cross-modal transfer is either weak or poorly localized. A third plausible implication is that evaluation should be inherently cross-modal: text accuracy, per-instance reliability, or feature salience alone do not characterize whether the edited knowledge or refined representation has actually propagated to the other modality in a semantically appropriate way.