Papers
Topics
Authors
Recent
2000 character limit reached

Depth Copy Paste: Augmentation Techniques

Updated 16 December 2025
  • Depth Copy Paste is a data augmentation technique that leverages pixel- or token-level copying informed by depth cues to ensure semantic and spatial consistency.
  • It integrates depth-guided placement, occlusion-aware segmentation, and position-indexed operations to produce realistic and structured outputs.
  • Quantitative evaluations demonstrate improved model performance in face detection, urban segmentation, and language model tasks compared to baseline methods.

Depth Copy Paste (DCP) refers to a family of data manipulation and augmentation techniques that leverage pixel- or token-level copying and pasting informed by spatial, semantic, or positional “depth.” Across computer vision and language modeling, DCP enables the controlled recombination of content from sources into new contexts while preserving meaningful structure—such as geometric depth in images or positional indices in text. Modern DCP frameworks address the limitations of naïve copy-paste by integrating depth-aware region selection, compositional logic, or positional toolchains, producing more coherent, robust, and controllable outputs. Principal applications include data augmentation for visual detection/segmentation and the manipulation of LLM contexts with explicit position-based copy-paste operations.

1. Motivation and Core Problems

Classic copy-paste augmentation improves data variability by inserting foreground instances into new backgrounds. However, absence of physical, semantic, or structural consistency can produce implausible composites that degrade model performance. Key limitations addressed by DCP methods include:

  • Semantic inconsistency: Randomly chosen backgrounds are often contextually misaligned with the pasted foreground (e.g., pasting an indoor portrait onto an outdoor scene), reducing the effectiveness of augmented samples for robust model training (Guo, 12 Dec 2025).
  • Occlusion and artifact generation: Naive segmentation fails to account for partial visibility, potentially including occluded or corrupted regions in the composite (e.g., hair behind a hand leading to ghosting artifacts).
  • Geometric misalignment: Lack of spatial or scale reasoning causes pasted content to float, overlap unnaturally, or appear at an unrepresentative scale—all of which reduce data realism.
  • Positional ambiguity in language: Without explicit position indices, LLMs struggle to replicate copy-paste operations deterministically, particularly for hierarchical or length-constrained tasks (Wang et al., 9 Oct 2024).

These challenges motivate explicit depth, semantics, or position modeling during copy-paste, producing physically and structurally plausible synthetic data for downstream learning tasks.

2. Methods: Image-Based Depth Copy Paste Pipelines

The DCP workflow for images, particularly robust face detection and urban-scene segmentation, comprises three main modules: semantic retrieval, foreground segmentation with occlusion handling, and depth-guided placement.

2.1 Semantic and Visual Coherence

A foreground image IfI_f (e.g., a person or face) is paired with backgrounds BiB_i that are both semantically and visually compatible. BLIP is used to caption IfI_f, yielding a contextual string CfC_f such as “a smiling person indoors.” Both the caption and the image are embedded with CLIP:

  • et=CLIPtext(Cf)e_t = \mathrm{CLIP}_{\text{text}}(C_f) (text embedding)
  • ev=CLIPimg(If)e_v = \mathrm{CLIP}_{\text{img}}(I_f) (visual embedding)
  • For each background BiB_i, bi=CLIPimg(Bi)b_i = \mathrm{CLIP}_{\text{img}}(B_i)

Cosine similarities are computed for both channels:

  • si(v)=ev,bi/(evbi)s_i^{(v)} = \langle e_v, b_i \rangle / (\|e_v\| \cdot \|b_i\|) (visual)
  • si(t)=et,bi/(etbi)s_i^{(t)} = \langle e_t, b_i \rangle / (\|e_t\| \cdot \|b_i\|) (semantic)

The final composite similarity score:

Si=λsi(v)+(1λ)si(t)S_i = \lambda s_i^{(v)} + (1 - \lambda) s_i^{(t)}

Top-scoring backgrounds are selected for compositing, with λ=0.5\lambda = 0.5 by default (Guo, 12 Dec 2025).

2.2 Occlusion-Aware Foreground Extraction

Foreground masks are constructed using a combination of segmentation and depth-based occlusion detection. SAM3 provides a binary mask MsamM_\text{sam} outlining the subject. A per-pixel depth map Df(p)D_f(p) is estimated via Depth-Anything. The local depth deviation for each foreground pixel pp is:

δ(p)=Df(p)1N(p)qN(p)Df(q)\delta(p) = | D_f(p) - \frac{1}{|\mathcal{N}(p)|} \sum_{q \in \mathcal{N}(p)} D_f(q) |

Thresholding δ(p)\delta(p) yields a visibility mask MvisM_\text{vis}, which is intersected with MsamM_\text{sam} to obtain the final mask MfgM_\text{fg} for compositing. This process systematically excludes occluded or noisy regions, preserving only the valid, visible surface (Guo, 12 Dec 2025).

2.3 Depth-Guided Placement

The masked foreground is spatially aligned with the target background by sliding a window over the normalized background depth map D^b\hat{D}_b, seeking a location with matched mean depth, variance, and depth gradient smoothness. For window Wx,yW_{x,y}:

  • Mean/variance: μx,y\mu_{x,y}, σx,y\sigma_{x,y} (window), μf\mu_f, σf\sigma_f (foreground)
  • Deviations: Δmean(x,y)\Delta_\text{mean}(x,y), Δvar(x,y)\Delta_\text{var}(x,y)
  • Smoothness: Ssmooth(x,y)S_\text{smooth}(x,y) (average gradient magnitude)
  • Combined score:

S(x,y)=α(1Δmean(x,y)maxΔmean)+β(1Δvar(x,y)maxΔvar)+γ(1Ssmooth(x,y)maxSsmooth)S(x, y) = \alpha \Bigl(1 - \frac{\Delta_\text{mean}(x, y)}{\max \Delta_\text{mean}}\Bigr) + \beta \Bigl(1 - \frac{\Delta_\text{var}(x, y)}{\max \Delta_\text{var}}\Bigr) + \gamma \Bigl(1 - \frac{S_\text{smooth}(x, y)}{\max S_\text{smooth}}\Bigr)

The optimal paste location is (x,y)=argmax(x,y)S(x,y)(x^*, y^*) = \arg\max_{(x, y)} S(x, y). Typical weights are α=0.4\alpha = 0.4, β=0.4\beta = 0.4, γ=0.2\gamma=0.2; stride s=16s=16 px (Guo, 12 Dec 2025). This ensures local depth continuity and realistic integration.

3. Methods: Depth Region Copy-Paste for Contrastive Segmentation

For contrastive urban-scene segmentation, depth-coherent regions are extracted and pasted to encourage context-invariant representations (Zeng et al., 2022).

  • Superpixels are generated (SLIC) and grouped via a region-adjacency graph weighted by occlusion and support distances (using average 3D coordinates and normals), then cluster assignment is resolved via iterative InfoMap.
  • Only sufficiently large, spatially coherent depth regions are retained.
  • Copy-paste involves random geometric/photometric augmentation, controlled offset sampling, and occlusion-aware compositing using the “DepthMix” rule, which resolves paste-mask pixels by minimum depth at each location.
  • Region-level and pixel-level positive pairs across two views are tracked under the SwAV contrastive loss, with loss balance L=λLpixel+(1λ)Lregion\mathcal{L} = \lambda\,\mathcal{L}_{\rm pixel} + (1-\lambda)\,\mathcal{L}_{\rm region} (Zeng et al., 2022).

This unsupervised approach increases segmentation robustness by forcing feature learning on semantically consistent 3D regions under varied scene context.

4. Methods: PositionID Copy-Paste in LLMs

PositionID CP Prompting introduces an explicit position-indexed copy-paste toolchain for autoregressive LLMs, enabling deterministic substring replication within generated context (Wang et al., 9 Oct 2024).

  • A position ID pi=ip_i=i is assigned to each token upon invocation of the copy tool, yielding context (x1,p1),,(xN,pN)(x_1, p_1), \dots, (x_N, p_N).
  • The model emits special function-call tokens for copy (“<COPY>[tag] [desc] [start] [end]</COPY>”) and paste (“<PASTE>[tag]</PASTE>”).
  • Upon a <COPY> call, the substring xs ... xex_{s}\ ...\ x_{e} is stored in an external clipboard keyed by tag.
  • When <PASTE>[tag] is emitted, the clipboard content is inserted verbatim at the current generation point.
  • Benchmarks (CP-Bench) measure copy-paste accuracy (CP Success Rate), Rouge-L, and perplexity.

Current implementations support single-level copy-paste but lack explicit hierarchical (depth) or nested references. The immediate effect is an 80.8% CP Success Rate (vs. 68.1% for few-shot prompting and 0% zero-shot), with substantial reductions in perplexity and improved subjective ratings (Wang et al., 9 Oct 2024).

5. Quantitative Evaluation

Face detection augmentation (DCP, WIDER Face, RetinaFace backbone):

Synthetic Ratio mAP (Baseline) mAP (DCP)
0% 0.847
40% 0.866
Peak (≈50%) 0.873

Augmentation Comparison (Easy/Medium/Hard, WIDER Val):

Method Easy Medium Hard
Baseline (no aug) 0.879 0.801 0.769
Cutout 0.883 0.811 0.783
CutMix 0.891 0.817 0.779
Random Copy–Paste 0.904 0.827 0.791
Depth–Copy–Paste 0.917 0.834 0.803

Ablation results show that each DCP module—multimodal semantic retrieval, occlusion-aware mask, depth-guided placement—yields measurable gains. For segmentation, “Copy-Pasting Coherent Depth Regions” surpasses previous state-of-the-art by +7.14% mIoU on Cityscapes and +6.65% on KITTI (Zeng et al., 2022).

LLM copy-paste (CP-Bench, PositionID Prompting):

  • Zero-shot: CP-S.R. undefined
  • Few-shot: CP-S.R. = 68.1%
  • PositionID: CP-S.R. = 80.8%, Rouge-L = 18.4, PPL = 8.4

6. Qualitative Effects and Analysis

Depth Copy Paste in images results in pasted objects that “stand naturally on the floor or ground plane,” with preserved contours and no “floating” artifacts. Occluded or corrupted regions (e.g., parts of hair occluded by a hand) are omitted, avoiding texture corruption. In language, explicit position-indexed copy-paste eliminates substring mismatch and random errors prevalent in unconditioned copying (Wang et al., 9 Oct 2024).

Common failure cases include:

  • Misestimated depth leading to slight scale mismatch in composited images or misplacement near complex background regions.
  • For LLMs, absence of hierarchical copy-paste prevents nested edit operations or multi-level clipboard logic.

A plausible implication is that further generalization could enable multi-span edits, hierarchical structural manipulation, or video-consistent augmentations.

7. Limitations and Future Directions

Noted limitations of existing DCP frameworks include:

  • Monocular noise: Depth estimation is less accurate in featureless or occluded regions, occasionally producing suboptimal paste locations or artifacts (Guo, 12 Dec 2025, Zeng et al., 2022).
  • Computational overhead: Joint use of semantic and depth estimators increases augmentation synthesis cost.
  • Fixed weights: Default weighting parameters (λ\lambda, α\alpha, β\beta, γ\gamma) are handcrafted; learning these parameters from validation data may improve compositing quality.
  • Hierarchical copy-paste: Current PositionID copy-paste supports only single-level (“flat”) operations; future work could extend to “COPY inside COPY” or stack-based editors.
  • Scalability: The explicit use of position IDs doubles context length in language modeling, with corresponding efficiency tradeoffs (Wang et al., 9 Oct 2024).

Future development directions include task-adaptive fusion weighting, generalization to other object categories and temporal consistency for videos, generative inpainting/refinement at paste boundaries, and hierarchical clipboard logic for programmatic editing.


References:

  • "Depth-Copy-Paste: Multimodal and Depth-Aware Compositing for Robust Face Detection" (Guo, 12 Dec 2025)
  • "Copy-Pasting Coherent Depth Regions Improves Contrastive Learning for Urban-Scene Segmentation" (Zeng et al., 2022)
  • "PositionID: LLMs can Control Lengths, Copy and Paste with Explicit Positional Awareness" (Wang et al., 9 Oct 2024)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Depth Copy Paste.