Point-in-Context (PIC): 3D In-Context Learning

Updated 18 January 2026

Point-in-Context (PIC) is an in-context learning framework tailored for 3D point cloud tasks, addressing challenges like information leakage via masked coordinate regression.
It deploys a joint sampling module that aligns input and target sequences, ensuring robust coordinate regression and effective analogical reasoning from prompt examples.
Extensions such as DG-PIC and MEPIC demonstrate PIC's versatility in achieving domain generalization and memory-efficient transformer inference across varied applications.

Point-in-Context (PIC) is a suite of machine learning paradigms, models, and systems leveraging the principle of “in-context learning,” where input–output examples are provided as prompts to condition model predictions in lieu of parameter updates. While the terminology “Point-in-Context” spans areas including 3D point cloud understanding and scalable transformer inference, the underlying methods unite around the idea of direct, context-driven adaptation at inference time. This article provides a comprehensive technical overview of PIC, with a primary focus on its instantiation for 3D point cloud understanding and its extensions to both domain generalization and memory-efficient model serving.

1. Origins and Foundational Motivation

The PIC framework for 3D point cloud understanding was developed to address the challenge of leveraging in-context learning—previously successful in NLP (e.g. GPT-3 “prompting”) and 2D masked modeling (e.g. Painter, Visual Prompting)—for unstructured 3D data. Unlike images where discrete patches map naturally to tokens, 3D point clouds are unordered sets of coordinates; masking approaches used in images lead to severe information leakage and fail to honor the spatial indeterminacy of point sets. Classical Masked Point Modeling (MPM) used positional embeddings even for masked regions, unintentionally leaking target positions and rendering masked reconstruction ill-posed for pure in-context learning (Fang et al., 2023, Liu et al., 2024).

PIC’s core innovation is to treat all 3D tasks (reconstruction, denoising, registration, segmentation) as coordinate regression: both inputs and outputs for every example are point sets in ℝ³, with no conversion to discrete labels or class tokens. Each task is specified at inference time by concatenating a prompt example (input–target pair) with a query input whose target is masked; the model must infer the query target solely by analogical reasoning from the prompt, not from any explicit parameter update or task head.

2. Architectural Principles and Joint Sampling

Masked Coordinate Modeling and Prompting

Each model input comprises two pairs: (Pᵢ, Tᵏᵢ) serves as the prompt and (Pⱼ, Tᵏⱼ) as the query, where $P, T \in \mathbb{R}^{3 \times N}$ for point clouds and their targets, respectively.
The model applies a random mask to subsets of the target coordinates in both prompt and query (masking ratio typically ≈70%), replacing masked tokens with a learned mask point.
Both prompt and query are embedded (either concatenated in PIC-Cat or encoded in parallel streams in PIC-Sep) and passed through a transformer to regress the masked coordinates.

Joint Sampling Module

The Joint Sampling (JS) module ensures strict token alignment between input and target sequences, which is critical for avoiding information leakage from positional embeddings:

N patch centers are sampled from the input cloud (by Farthest Point Sampling or random sampling). For each center, KNN forms a local patch from both the input and corresponding target.
Masked centers are never embedded. Patch indices are derived solely from input indices, guaranteeing identical ordering between input and target token sequences (Fang et al., 2023, Liu et al., 2024).
The output sequence is reconstructed using the Chamfer Distance as the loss function:

$\mathcal{L}(P̂,G) = \sum_{p \in \hat{P}} \min_{g \in G} \| p-g \|_2^2 + \sum_{g \in G} \min_{p \in \hat{P}} \| p-g \|_2^2$

Ablation studies demonstrate model collapse (CD≫40, mIoU < 25%) when JS is removed, underscoring its necessity (Fang et al., 2023, Liu et al., 2024).

3. Unified Multitask Modeling and Segmentation by Coordinates

Unlike conventional multitask networks requiring task-specific heads, PIC unifies:

Reconstruction: Densify a sparse cloud, target is a denser set of points.
Denoising: Recover clean coordinates from noisy input.
Registration: Predict canonical pose from a rotated input.
Segmentation: Each part/category is mapped to a cluster of 3D “label points.” Segmentation is reduced to regressing coordinates closest to these label points.

Early PIC systems use fixed label-coordinate assignments for segmentation:

Each category is represented by a pre-chosen point in ℝ³ (label bank $L=\{\ell_1,\dotsc,\ell_C\}$ ).
Each query point is labeled as $y_i = \arg\min_k \|\,\hat{p}_i - l_k\,\|_2$ .
However, fixed labels constrain generalization: new classes cannot be accommodated, and model performance degrades with class crowding (Liu et al., 2024).

PIC-S: In-Context Labeling and Enhancing

PIC-S (Point-In-Context-Segmenter) addresses fixed-label limitations by:

In-Context Labeling (ICL): A global label-point bank $B$ is generated; for each instance, a random subset $B_i$ is assigned as labels, forcing the model to learn prompt-specific label mappings.
In-Context Enhancing (ICE): Additional training pairs are formed using random augmentations (noise, deformation, occlusion), strengthening the model’s context-dependent reasoning (Liu et al., 2024).
The segmentation loss combines Chamfer Distance with Smooth- $\ell_1$ .

4. Domain Generalization with DG-PIC

DG-PIC extends PIC to multi-domain, multi-task settings, tackling out-of-distribution generalization:

Dual-Level Source Prototype Estimation: For each source domain, DG-PIC computes global (shape-level) and local (patch-level) prototypes from training data.
Dual-Level Test-Time Feature Shifting: At inference, features for an unseen test sample are softly shifted towards source domain prototypes at both global and local levels, using macro-level (domain semantic) and micro-level (patch positional) attention weights.
Unlike standard DG methods or classic in-context learning, no model update occurs at test time; all adaptation is via feature-space realignment. This enables efficient zero-shot generalization across tasks and domains (Jiang et al., 2024).

	Classic PIC	DG-PIC
Domains	Single-domain	Multi-domain
Heads	None	None
Test adapt	Prompt only	Feature shifting (no update)
Benchmarks	ShapeNet, ModelNet	ShapeNet, ScanNet, SO-NN

5. Empirical Results and Comparative Analysis

3D Point Cloud Tasks

Datasets: ShapeNet, ShapeNetPart, Human3D, BEHAVE, ModelNet40, ScanNet, ScanObjectNN (Fang et al., 2023, Liu et al., 2024, Jiang et al., 2024).
Main metrics: Chamfer Distance for reconstruction/denoising/registration; mIoU for segmentation.
PIC-Cat achieves Rec CD≈4.3e-3, Seg mIoU≈78.95%, outperforming multitask baselines and matching some task-specific networks.
PIC-S-Sep achieves mIoU≈85% on in-domain, ≈63% zero-shot (AKB-48, best non-PIC ≈40%).
DG-PIC achieves order-of-magnitude lower CD than both standard PIC and domain generalization baselines in OOD settings; e.g., on ScanObjectNN, DG-PIC achieves CD = 4.1 vs. standard PIC’s CD = 73 (Jiang et al., 2024).

Ablations and Analysis

Removal of Joint Sampling leads to collapse in all tasks.
Feature-shifting ablation: Full macro+micro DG-PIC outperforms global/local only or naïve averaging.
Mask ratio: Optimal performance at ≈70%; low ratios lead to trivial solutions.
Prompt selection based on minimal Chamfer Distance to the query (“CD-aware”) reduces error by up to ≈30% for registration.

6. Extensions to Transformer Inference and Other Domains

In LLM serving for NLP, Position-Independent Caching (also abbreviated PIC) addresses memory bottlenecks arising from repeated context chunk reuse in long prompts (Wang et al., 18 Dec 2025):

Classic prefix caching is limited to exact-match prefixes; PIC enables chunk reuse at arbitrary positions, but previous systems suffered from memory and alignment inefficiencies.
Memory Efficient PIC (MEPIC) introduces block-level recomputation, page-aligned KV storage, and RoPE (rotary position embedding) fusion in the attention kernel.
MEPIC achieves up to 2–5× HBM savings and improved throughput versus prior PIC implementations, with no changes to the underlying network.
The underlying PIC idea remains: generalize reuse or inference by leveraging “context chunks” directly, regardless of position.

7. Future Directions and Open Challenges

3D PIC: Scaling to more complex tasks such as scene-level detection, panoptic segmentation, or multi-modal fusion (RGB, LiDAR) in a unified prompt-driven framework.
Dynamic prompt selection and retrieval policies (metric- or learning-based) to optimize analogy transfer during test time.
Augmenting PIC with cross-domain, cross-modal, or privacy-preserving extensions (e.g., in DG-PIC or encrypted-key PIC for LLM serving).
For memory-efficient inference, integrating with quantization, pruning, heat-aware eviction, and adaptive chunking schemes.

Point-in-Context methods have established a new paradigm for in-context learning and efficient multi-task or multi-domain adaptation in point cloud understanding and beyond. By abstracting inputs and outputs in a unified coordinate or token space and using careful alignment and context-driven mechanisms, PIC offers both empirical robustness and architectural generality across domains (Fang et al., 2023, Liu et al., 2024, Jiang et al., 2024, Wang et al., 18 Dec 2025).