Depth Copy Paste: Augmentation Techniques

Updated 16 December 2025

Depth Copy Paste is a data augmentation technique that leverages pixel- or token-level copying informed by depth cues to ensure semantic and spatial consistency.
It integrates depth-guided placement, occlusion-aware segmentation, and position-indexed operations to produce realistic and structured outputs.
Quantitative evaluations demonstrate improved model performance in face detection, urban segmentation, and language model tasks compared to baseline methods.

Depth Copy Paste (DCP) refers to a family of data manipulation and augmentation techniques that leverage pixel- or token-level copying and pasting informed by spatial, semantic, or positional “depth.” Across computer vision and language modeling, DCP enables the controlled recombination of content from sources into new contexts while preserving meaningful structure—such as geometric depth in images or positional indices in text. Modern DCP frameworks address the limitations of naïve copy-paste by integrating depth-aware region selection, compositional logic, or positional toolchains, producing more coherent, robust, and controllable outputs. Principal applications include data augmentation for visual detection/segmentation and the manipulation of LLM contexts with explicit position-based copy-paste operations.

1. Motivation and Core Problems

Classic copy-paste augmentation improves data variability by inserting foreground instances into new backgrounds. However, absence of physical, semantic, or structural consistency can produce implausible composites that degrade model performance. Key limitations addressed by DCP methods include:

Semantic inconsistency: Randomly chosen backgrounds are often contextually misaligned with the pasted foreground (e.g., pasting an indoor portrait onto an outdoor scene), reducing the effectiveness of augmented samples for robust model training (Guo, 12 Dec 2025).
Occlusion and artifact generation: Naive segmentation fails to account for partial visibility, potentially including occluded or corrupted regions in the composite (e.g., hair behind a hand leading to ghosting artifacts).
Geometric misalignment: Lack of spatial or scale reasoning causes pasted content to float, overlap unnaturally, or appear at an unrepresentative scale—all of which reduce data realism.
Positional ambiguity in language: Without explicit position indices, LLMs struggle to replicate copy-paste operations deterministically, particularly for hierarchical or length-constrained tasks (Wang et al., 2024).

These challenges motivate explicit depth, semantics, or position modeling during copy-paste, producing physically and structurally plausible synthetic data for downstream learning tasks.

2. Methods: Image-Based Depth Copy Paste Pipelines

The DCP workflow for images, particularly robust face detection and urban-scene segmentation, comprises three main modules: semantic retrieval, foreground segmentation with occlusion handling, and depth-guided placement.

2.1 Semantic and Visual Coherence

A foreground image $I_f$ (e.g., a person or face) is paired with backgrounds $B_i$ that are both semantically and visually compatible. BLIP is used to caption $I_f$ , yielding a contextual string $C_f$ such as “a smiling person indoors.” Both the caption and the image are embedded with CLIP:

$e_t = \mathrm{CLIP}_{\text{text}}(C_f)$ (text embedding)
$e_v = \mathrm{CLIP}_{\text{img}}(I_f)$ (visual embedding)
For each background $B_i$ , $b_i = \mathrm{CLIP}_{\text{img}}(B_i)$

Cosine similarities are computed for both channels:

$s_i^{(v)} = \langle e_v, b_i \rangle / (\|e_v\| \cdot \|b_i\|)$ (visual)
$s_i^{(t)} = \langle e_t, b_i \rangle / (\|e_t\| \cdot \|b_i\|)$ (semantic)

The final composite similarity score:

$S_i = \lambda s_i^{(v)} + (1 - \lambda) s_i^{(t)}$

Top-scoring backgrounds are selected for compositing, with $\lambda = 0.5$ by default (Guo, 12 Dec 2025).

2.2 Occlusion-Aware Foreground Extraction

Foreground masks are constructed using a combination of segmentation and depth-based occlusion detection. SAM3 provides a binary mask $M_\text{sam}$ outlining the subject. A per-pixel depth map $D_f(p)$ is estimated via Depth-Anything. The local depth deviation for each foreground pixel $p$ is:

$\delta(p) = | D_f(p) - \frac{1}{|\mathcal{N}(p)|} \sum_{q \in \mathcal{N}(p)} D_f(q) |$

Thresholding $\delta(p)$ yields a visibility mask $M_\text{vis}$ , which is intersected with $M_\text{sam}$ to obtain the final mask $M_\text{fg}$ for compositing. This process systematically excludes occluded or noisy regions, preserving only the valid, visible surface (Guo, 12 Dec 2025).

2.3 Depth-Guided Placement

The masked foreground is spatially aligned with the target background by sliding a window over the normalized background depth map $\hat{D}_b$ , seeking a location with matched mean depth, variance, and depth gradient smoothness. For window $W_{x,y}$ :

Mean/variance: $\mu_{x,y}$ , $\sigma_{x,y}$ (window), $\mu_f$ , $\sigma_f$ (foreground)
Deviations: $\Delta_\text{mean}(x,y)$ , $\Delta_\text{var}(x,y)$
Smoothness: $S_\text{smooth}(x,y)$ (average gradient magnitude)
Combined score:

$S(x, y) = \alpha \Bigl(1 - \frac{\Delta_\text{mean}(x, y)}{\max \Delta_\text{mean}}\Bigr) + \beta \Bigl(1 - \frac{\Delta_\text{var}(x, y)}{\max \Delta_\text{var}}\Bigr) + \gamma \Bigl(1 - \frac{S_\text{smooth}(x, y)}{\max S_\text{smooth}}\Bigr)$

The optimal paste location is $(x^*, y^*) = \arg\max_{(x, y)} S(x, y)$ . Typical weights are $\alpha = 0.4$ , $\beta = 0.4$ , $\gamma=0.2$ ; stride $s=16$ px (Guo, 12 Dec 2025). This ensures local depth continuity and realistic integration.

3. Methods: Depth Region Copy-Paste for Contrastive Segmentation

For contrastive urban-scene segmentation, depth-coherent regions are extracted and pasted to encourage context-invariant representations (Zeng et al., 2022).

Superpixels are generated (SLIC) and grouped via a region-adjacency graph weighted by occlusion and support distances (using average 3D coordinates and normals), then cluster assignment is resolved via iterative InfoMap.
Only sufficiently large, spatially coherent depth regions are retained.
Copy-paste involves random geometric/photometric augmentation, controlled offset sampling, and occlusion-aware compositing using the “DepthMix” rule, which resolves paste-mask pixels by minimum depth at each location.
Region-level and pixel-level positive pairs across two views are tracked under the SwAV contrastive loss, with loss balance $\mathcal{L} = \lambda\,\mathcal{L}_{\rm pixel} + (1-\lambda)\,\mathcal{L}_{\rm region}$ (Zeng et al., 2022).

This unsupervised approach increases segmentation robustness by forcing feature learning on semantically consistent 3D regions under varied scene context.

4. Methods: PositionID Copy-Paste in LLMs

PositionID CP Prompting introduces an explicit position-indexed copy-paste toolchain for autoregressive LLMs, enabling deterministic substring replication within generated context (Wang et al., 2024).

A position ID $p_i=i$ is assigned to each token upon invocation of the copy tool, yielding context $(x_1, p_1), \dots, (x_N, p_N)$ .
The model emits special function-call tokens for copy (“<COPY>[tag] [desc] [start] [end]</COPY>”) and paste (“<PASTE>[tag]</PASTE>”).
Upon a <COPY> call, the substring $x_{s}\ ...\ x_{e}$ is stored in an external clipboard keyed by tag.
When <PASTE>[tag] is emitted, the clipboard content is inserted verbatim at the current generation point.
Benchmarks (CP-Bench) measure copy-paste accuracy (CP Success Rate), Rouge-L, and perplexity.

Current implementations support single-level copy-paste but lack explicit hierarchical (depth) or nested references. The immediate effect is an 80.8% CP Success Rate (vs. 68.1% for few-shot prompting and 0% zero-shot), with substantial reductions in perplexity and improved subjective ratings (Wang et al., 2024).

5. Quantitative Evaluation

Face detection augmentation (DCP, WIDER Face, RetinaFace backbone):

Synthetic Ratio	mAP (Baseline)	mAP (DCP)
0%	0.847	—
40%	—	0.866
Peak (≈50%)	—	0.873

Augmentation Comparison (Easy/Medium/Hard, WIDER Val):

Method	Easy	Medium	Hard
Baseline (no aug)	0.879	0.801	0.769
Cutout	0.883	0.811	0.783
CutMix	0.891	0.817	0.779
Random Copy–Paste	0.904	0.827	0.791
Depth–Copy–Paste	0.917	0.834	0.803

Ablation results show that each DCP module—multimodal semantic retrieval, occlusion-aware mask, depth-guided placement—yields measurable gains. For segmentation, “Copy-Pasting Coherent Depth Regions” surpasses previous state-of-the-art by +7.14% mIoU on Cityscapes and +6.65% on KITTI (Zeng et al., 2022).

LLM copy-paste (CP-Bench, PositionID Prompting):

Zero-shot: CP-S.R. undefined
Few-shot: CP-S.R. = 68.1%
PositionID: CP-S.R. = 80.8%, Rouge-L = 18.4, PPL = 8.4

6. Qualitative Effects and Analysis

Depth Copy Paste in images results in pasted objects that “stand naturally on the floor or ground plane,” with preserved contours and no “floating” artifacts. Occluded or corrupted regions (e.g., parts of hair occluded by a hand) are omitted, avoiding texture corruption. In language, explicit position-indexed copy-paste eliminates substring mismatch and random errors prevalent in unconditioned copying (Wang et al., 2024).

Common failure cases include:

Misestimated depth leading to slight scale mismatch in composited images or misplacement near complex background regions.
For LLMs, absence of hierarchical copy-paste prevents nested edit operations or multi-level clipboard logic.

A plausible implication is that further generalization could enable multi-span edits, hierarchical structural manipulation, or video-consistent augmentations.

7. Limitations and Future Directions

Noted limitations of existing DCP frameworks include:

Monocular noise: Depth estimation is less accurate in featureless or occluded regions, occasionally producing suboptimal paste locations or artifacts (Guo, 12 Dec 2025, Zeng et al., 2022).
Computational overhead: Joint use of semantic and depth estimators increases augmentation synthesis cost.
Fixed weights: Default weighting parameters ( $\lambda$ , $\alpha$ , $\beta$ , $\gamma$ ) are handcrafted; learning these parameters from validation data may improve compositing quality.
Hierarchical copy-paste: Current PositionID copy-paste supports only single-level (“flat”) operations; future work could extend to “COPY inside COPY” or stack-based editors.
Scalability: The explicit use of position IDs doubles context length in language modeling, with corresponding efficiency tradeoffs (Wang et al., 2024).

Future development directions include task-adaptive fusion weighting, generalization to other object categories and temporal consistency for videos, generative inpainting/refinement at paste boundaries, and hierarchical clipboard logic for programmatic editing.

References:

"Depth-Copy-Paste: Multimodal and Depth-Aware Compositing for Robust Face Detection" (Guo, 12 Dec 2025)
"Copy-Pasting Coherent Depth Regions Improves Contrastive Learning for Urban-Scene Segmentation" (Zeng et al., 2022)
"PositionID: LLMs can Control Lengths, Copy and Paste with Explicit Positional Awareness" (Wang et al., 2024)

Markdown Upgrade to Chat

References (3)

Depth-Copy-Paste: Multimodal and Depth-Aware Compositing for Robust Face Detection (2025)

PositionID: LLMs can Control Lengths, Copy and Paste with Explicit Positional Awareness (2024)

Copy-Pasting Coherent Depth Regions Improves Contrastive Learning for Urban-Scene Segmentation (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Depth Copy Paste.

Depth Copy Paste: Augmentation Techniques

1. Motivation and Core Problems

2. Methods: Image-Based Depth Copy Paste Pipelines

2.1 Semantic and Visual Coherence

2.2 Occlusion-Aware Foreground Extraction

2.3 Depth-Guided Placement

3. Methods: Depth Region Copy-Paste for Contrastive Segmentation

4. Methods: PositionID Copy-Paste in LLMs

5. Quantitative Evaluation

6. Qualitative Effects and Analysis

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Depth Copy Paste: Augmentation Techniques

1. Motivation and Core Problems

2. Methods: Image-Based Depth Copy Paste Pipelines

2.1 Semantic and Visual Coherence

2.2 Occlusion-Aware Foreground Extraction

2.3 Depth-Guided Placement

3. Methods: Depth Region Copy-Paste for Contrastive Segmentation

4. Methods: PositionID Copy-Paste in LLMs

5. Quantitative Evaluation

6. Qualitative Effects and Analysis

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research