Semantic-Contact Fields (SCFields)

Updated 4 July 2026

Semantic-Contact Fields (SCFields) are dense 3D representations that attach semantic labels and physical force vectors to each point in a geometric domain.
They combine explicit contact probabilities with object or part-level semantics, enabling advanced tool manipulation and human-scene interactions.
SCFields enhance performance in tasks like tactile scraping, dexterous grasping, and pose estimation through simulation pre-training and real-world alignment.

Semantic-Contact Fields (SCFields) are field-based representations of contact that attach semantically meaningful contact attributes to a geometric domain rather than treating contact as an isolated binary event. In the most explicit formulation, SCFields are a dense 3D field over a tool surface that fuses category- or part-level visual semantics with extrinsic contact probability and force, yielding per-point features of the form $x_i = [\,\mathbf{f}^{\text{ext}}_i \Vert c_i \Vert S_i\,]$ on a tool point cloud (Ma et al., 14 Feb 2026). Closely related formulations appear in human-scene contact prediction, where DecoDINO produces per-vertex binary contact probabilities and semantic object labels on the SMPL mesh and is described as learning a discrete Semantic-Contact Field on the human surface (Bierling et al., 27 Oct 2025). Other grasping systems do not always use the term literally, but implement field-like semantic contact representations over object surfaces, point clouds, or 2D-to-3D lifted contact maps (Shin et al., 13 May 2026, Li et al., 2024).

1. Conceptual scope and definitions

SCFields are not a single canonical object; the term spans several closely related representational choices. In the tool-manipulation formulation, the field is defined over the tool surface point cloud $P_{\text{obj}} = \{p_i \in \mathbb{R}^3\}_{i=1}^N$ and maps each point to a contact-force vector, a contact probability, and a semantic feature:

$f: \mathbb{R}^3 \to \mathbb{R}^{3} \times [0,1] \times \mathbb{R}^{d_S}, \quad p_i \mapsto (\mathbf{f}^{\text{ext}}_i, c_i, S_i).$

This makes each tool-surface point encode both “what it is” semantically and “how it is currently interacting physically” (Ma et al., 14 Feb 2026).

In DecoDINO, the domain is instead the human body surface, instantiated as the $N=6890$ vertices of the SMPL mesh. The model predicts, for each vertex, a binary contact label and a semantic label of the contacted object. In field notation, this corresponds to a binary contact field $f_{\text{bin}}: V \to \{0,1\}$ and a semantic contact field $f_{\text{sem}}: V \to \{1,\dots,C\}$ , or probabilistically to per-vertex distributions $p_{\text{bin}}(y_i=1\mid v_i)$ and $p_{\text{sem}}(c\mid v_i)$ (Bierling et al., 27 Oct 2025). The semantics here are object categories such as floor, couch or sofa, bench, motorcycle, and generic supporting surfaces, rather than contact modes or material classes.

A broader reading of the literature shows that “semantic” can refer to different label spaces. In SECOND-Grasp, semantic contact is intent- and part-conditioned: semantic regions such as “handle”, “body”, “base”, or “lens” are projected into 3D contact maps and refined by cross-view consistency and local convexity (Shin et al., 13 May 2026). In ClickDiff, the Semantic Contact Map (SCM) is object-anchored and finger-indexed, with a tensor $SCM \in \mathbb{R}^{N \times 5}$ specifying which object points are touched by which fingers (Li et al., 2024). This suggests that SCFields are better understood as a family of surface- or volume-defined fields whose codomain combines contact structure with a task-relevant semantic factorization.

2. Domains, codomains, and representational forms

The main variants differ primarily in the domain on which contact is defined and in the semantic payload attached to each geometric element.

System	Domain	Field values
DecoDINO	SMPL mesh vertices	Binary contact probability and semantic object-label distribution
SCFields for tool manipulation	Tool surface point cloud	Extrinsic force vector, contact probability, semantic feature
SECOND-Grasp	Object surface points from multi-view lifting	Semantic intent label and confidence
ClickDiff	Object point cloud	Finger-indexed semantic contact tensor and scalar contact map
Lightning Grasp	$\mathbb{R}^3 \times \mathbb{S}^2$ contact pose space	Feasible position-normal contact states
Eulerian phase-field contact	Fixed Eulerian volume	Phase fields, overlap field, normals, stress-related contact state

For human-scene interaction, the field is discrete over a fixed human template. DecoDINO’s SCField is “defined over vertices” and can be viewed as a multi-channel field

$P_{\text{obj}} = \{p_i \in \mathbb{R}^3\}_{i=1}^N$ 0

which extends DECO’s earlier binary-only mapping $P_{\text{obj}} = \{p_i \in \mathbb{R}^3\}_{i=1}^N$ 1 (Bierling et al., 27 Oct 2025).

For tool manipulation, the representation is point-based rather than mesh-vertex-based. The geometry is a point cloud of tool points, environment points, and tactile points, and the contact estimator predicts dense extrinsic contact fields only on $P_{\text{obj}} = \{p_i \in \mathbb{R}^3\}_{i=1}^N$ 2, while semantics $P_{\text{obj}} = \{p_i \in \mathbb{R}^3\}_{i=1}^N$ 3 are supplied by a separate 3D semantic field module derived from GenDP-style representations (Ma et al., 14 Feb 2026). The field is object-centric and explicitly combines semantics with physically grounded force estimates.

For dexterous grasping, the field becomes more weakly or indirectly defined. SECOND-Grasp represents semantic contact as a subset of surface points with semantic labels and confidence scores, obtained from VLM proposals, SAM2 segmentation, multi-view reprojection, and convexity-based filtering (Shin et al., 13 May 2026). ClickDiff uses an object-surface point cloud with finger-level semantics, where the continuous part of the representation is the scalar contact map $P_{\text{obj}} = \{p_i \in \mathbb{R}^3\}_{i=1}^N$ 4 and the semantic part is the discrete SCM tensor (Li et al., 2024). Lightning Grasp’s Contact Field is not semantic in itself; it is a 6D set of feasible contact vectors $P_{\text{obj}} = \{p_i \in \mathbb{R}^3\}_{i=1}^N$ 5 describing where a hand surface element can appear under forward kinematics. A plausible implication is that it provides a geometric substrate onto which semantic labels or task-conditioned preferences could be attached (Yin et al., 10 Nov 2025).

A further generalization appears in the Eulerian phase-field contact framework, where contact is not attached to an explicit surface mesh at all. Each solid has a phase field $P_{\text{obj}} = \{p_i \in \mathbb{R}^3\}_{i=1}^N$ 6, and contact between bodies $P_{\text{obj}} = \{p_i \in \mathbb{R}^3\}_{i=1}^N$ 7 and $P_{\text{obj}} = \{p_i \in \mathbb{R}^3\}_{i=1}^N$ 8 is encoded by the overlap field

$P_{\text{obj}} = \{p_i \in \mathbb{R}^3\}_{i=1}^N$ 9

This is not presented as an SCField in name, but it gives a field-based contact semantics in which occupancy, interface normals, and overlap all become local state variables on a fixed Eulerian mesh (Lorez et al., 2023).

3. Architectural patterns and computational pipelines

Despite their heterogeneity, SCField systems share a recurring architecture: a geometry-aligned representation, a semantic channel, and a mechanism for turning local contact predictions into downstream control or generation signals.

DecoDINO preserves the three-branch DECO structure: a scene-context branch, a part-context branch, and a contact branch. It replaces the original encoders with two independent DINOv2 ViT-g/14 backbones, LoRA-finetunes them, and replaces class-level cross-attention with patch-level cross-attention. The resulting patch features are pooled via a learnable attention mechanism and passed to a vertex-level MLP with sigmoid and softmax outputs for binary contact and semantic object labels, respectively (Bierling et al., 27 Oct 2025). The architectural emphasis is on finer local reasoning between body and scene tokens while preserving a per-vertex output space.

The tool-manipulation SCFields system uses a unified tactile-geometry point cloud

$f: \mathbb{R}^3 \to \mathbb{R}^{3} \times [0,1] \times \mathbb{R}^{d_S}, \quad p_i \mapsto (\mathbf{f}^{\text{ext}}_i, c_i, S_i).$ 0

where tactile points carry a 15-dimensional history of marker displacements and all points carry a type indicator. A PointNet++ encoder-decoder predicts a dense contact probability field and a dense force field on the tool surface, while a pretrained 3D semantic field module produces per-point semantic features from RGB-D observations. These are concatenated to form the SCField used by a diffusion policy (Ma et al., 14 Feb 2026). The pipeline is explicitly two-stage: large-scale simulation pre-training, followed by real-world alignment using pseudo-labels derived from geometric heuristics and force optimization.

SECOND-Grasp starts from 2D semantic proposals rather than dense tactile contact sensing. A VLM, specifically Qwen3-VL-8B-Instruct, predicts between 2 and 4 graspable intentions and associated 2D boxes for four fixed side views. SAM2 segments these boxes; cross-view semantic refinement uses depth-consistent reprojections to accumulate support; local convexity filtering in 3D removes geometrically invalid regions. The refined 3D contact map is then converted into a pseudo hand pose through inverse kinematics and used to supervise policy learning (Shin et al., 13 May 2026).

ClickDiff decomposes controllable grasp generation into two diffusion stages. The Semantic Conditional Module maps object geometry and SCM to a scalar contact map, and the Contact Conditional Module maps the object and contact map to MANO parameters. The SCM can be produced algorithmically from ground-truth geometry or specified interactively by clicks that define which finger should touch which object region (Li et al., 2024). The decomposition is intended to separate semantic contact specification from high-DOF hand generation.

Lightning Grasp shifts the emphasis from prediction to procedural search. Its Contact Field is sampled offline from hand geometry and kinematics, stored as a BVH over spatial boxes with associated normal sets, and queried against object-side contact vectors $f: \mathbb{R}^3 \to \mathbb{R}^{3} \times [0,1] \times \mathbb{R}^{d_S}, \quad p_i \mapsto (\mathbf{f}^{\text{ext}}_i, c_i, S_i).$ 1 to generate feasible contact domains (Yin et al., 10 Nov 2025). This is not itself a semantic field, but it formalizes contact as a reusable field in $f: \mathbb{R}^3 \to \mathbb{R}^{3} \times [0,1] \times \mathbb{R}^{d_S}, \quad p_i \mapsto (\mathbf{f}^{\text{ext}}_i, c_i, S_i).$ 2 and demonstrates how geometry can be decoupled from the online search loop.

4. Learning objectives, geometric constraints, and physical grounding

SCField systems differ substantially in how they supervise semantics and how strongly they encode mechanics.

DecoDINO retains DECO’s multi-task loss structure and adds semantic supervision. The vertex-level binary contact objective is a BCE modified by a per-vertex positive class weight

$f: \mathbb{R}^3 \to \mathbb{R}^{3} \times [0,1] \times \mathbb{R}^{d_S}, \quad p_i \mapsto (\mathbf{f}^{\text{ext}}_i, c_i, S_i).$ 3

with $f: \mathbb{R}^3 \to \mathbb{R}^{3} \times [0,1] \times \mathbb{R}^{d_S}, \quad p_i \mapsto (\mathbf{f}^{\text{ext}}_i, c_i, S_i).$ 4 and $f: \mathbb{R}^3 \to \mathbb{R}^{3} \times [0,1] \times \mathbb{R}^{d_S}, \quad p_i \mapsto (\mathbf{f}^{\text{ext}}_i, c_i, S_i).$ 5, rescaled to a mean equal to the average negative-to-positive vertex ratio, approximately $f: \mathbb{R}^3 \to \mathbb{R}^{3} \times [0,1] \times \mathbb{R}^{d_S}, \quad p_i \mapsto (\mathbf{f}^{\text{ext}}_i, c_i, S_i).$ 6, and clipped to avoid extreme values. This weighting addresses severe imbalance caused by dominant foot contacts and is paired with a standard cross-entropy loss over semantic classes for contacted vertices (Bierling et al., 27 Oct 2025). The stated effect is to reduce the “feet-on-ground” bias and improve semantic-contact quality in rarer contact regimes.

The tool-manipulation SCFields paper uses a more explicitly physical objective. In simulation, contact probability labels are obtained from SDF-based penetration via

$f: \mathbb{R}^3 \to \mathbb{R}^{3} \times [0,1] \times \mathbb{R}^{d_S}, \quad p_i \mapsto (\mathbf{f}^{\text{ext}}_i, c_i, S_i).$ 7

with $f: \mathbb{R}^3 \to \mathbb{R}^{3} \times [0,1] \times \mathbb{R}^{d_S}, \quad p_i \mapsto (\mathbf{f}^{\text{ext}}_i, c_i, S_i).$ 8 and $f: \mathbb{R}^3 \to \mathbb{R}^{3} \times [0,1] \times \mathbb{R}^{d_S}, \quad p_i \mapsto (\mathbf{f}^{\text{ext}}_i, c_i, S_i).$ 9 chosen so that $N=6890$ 0 at 5 mm penetration. Dense extrinsic forces are synthesized from PyBullet contact points using distance weighting and depth modulation. Training combines Focal Loss for contact probability, weighted force-magnitude regression, and cosine force-direction loss, with coefficients $N=6890$ 1, $N=6890$ 2, $N=6890$ 3, and $N=6890$ 4 (Ma et al., 14 Feb 2026). In the real-world alignment stage, dense force pseudo-labels are obtained by solving an SOCP with ECOS that matches a tactile wrench while enforcing a friction-cone-like constraint.

SECOND-Grasp’s contact field is regularized first semantically and then geometrically. Cross-view refinement increases confidence for pixels whose 3D reprojections agree across views under a depth threshold $N=6890$ 5, and the final 3D contact set retains only points that form a local convex pair with at least one top-10% confidence seed point (Shin et al., 13 May 2026). The contact map then shapes both an IK objective,

$N=6890$ 6

and a policy reward with pose-guidance and contact-conditioned terms.

ClickDiff’s SCM is supervised indirectly. The Semantic Conditional Module reconstructs the scalar contact map with an $N=6890$ 7 loss plus a thresholded sparsity term, while the Contact Conditional Module uses MANO-parameter and hand-vertex reconstruction losses together with the Tactile-Guided Constraint

$N=6890$ 8

which draws finger-specific hand centroids toward SCM-indicated object points (Li et al., 2024). Here the semantic contact field functions primarily as a structured conditioning signal and as a source of geometric constraints.

In the Eulerian phase-field framework, physical grounding is intrinsic to the representation. Contact is enforced by volumetric penalty body forces

$N=6890$ 9

where $f_{\text{bin}}: V \to \{0,1\}$ 0 (Lorez et al., 2023). This suggests a broader SCField design principle: contact fields can be interpreted not only as predictive outputs but also as energy- or penalty-derived state fields that directly generate mechanical interactions.

5. Applications and empirical behavior

The best-developed explicit SCField application is category-level tactile tool manipulation. On unseen tools in scraping, SCFields achieve SR 79.6%, Eff 73.5%, and Eff Norm 84.7%, compared with Vision-only (GenDP) at SR 35.1%, Eff 25.4%, and Eff Norm 35.1%, and Raw tactile at SR 50.0%, Eff 23.3%, and Eff Norm 27.3%. In crayon drawing on unseen crayons, SCFields reach a drawing consistency score of 0.78, versus 0.60 for Vision-only and 0.61 for Raw tactile. In peeling with unseen peelers, SCFields achieve Contact 90%, Cut-in 73.3%, and Peel 4.52 cm, compared with Vision-only at Contact 50%, Cut-in 33.3%, and Peel 1.12 cm (Ma et al., 14 Feb 2026). The same study reports that a no-force ablation performs poorly in scraping and peeling, which supports the claim that explicit force channels are central rather than incidental.

DecoDINO applies the SCField idea to human-scene contact understanding. On DAMON, the full model improves binary-contact F1 from 56.42% to 62.54%, precision from 54.27% to 67.04%, and geodesic error from 18.68 to 15.89 cm, while recall changes from 72.94% to 67.35%. For semantic classification over contacted vertices, DecoDINO reports F1 28.77%, Precision 17.55%, and Recall 79.81% (Bierling et al., 27 Oct 2025). The paper characterizes this as a high-recall, low-precision semantic-contact field and notes improved behavior on soft surfaces, under occlusion, and for complex poses, while also documenting residual false-positive foot contacts.

SECOND-Grasp shows how a field-like semantic contact representation can guide dexterous grasping. On DexGraspNet, the state-based policy achieves Grasp Success Rate 98.2% on seen categories and 97.7% on unseen categories, while intent-aware evaluation shows improvements of 12.8% and 26.2% (Shin et al., 13 May 2026). Cross-dataset results are also reported, including 79.80% on DGA, 96.68% on EGAD, 77.84% on Omni6DPose, 87.89% on ModelNet40, and 94.14% on VisualDexterity. The paper attributes these gains to geometrically grounded contact maps and pseudo-pose supervision rather than to direct runtime inference of a semantic contact field.

ClickDiff uses a semantic contact field in the narrower sense of a controllable semantic contact map. On GRAB unseen objects, it reports MPJPE 40.57 mm, CDev 52.05 mm, and SR 72.85%, compared with substantially worse values for GOAL and ContactGen in the cited averages (Li et al., 2024). Ablations indicate that SCM improves MPJPE and CDev over binary-map and Gaussian-map contact conditions and that Dual+SCM improves SR by approximately 4.6% and reduces CDev by approximately 7.9 mm relative to the best single-stage SCM-conditioned model.

Lightning Grasp demonstrates that field-based contact abstractions can drastically change computational performance even without explicit semantics. It reports 2–5 seconds on an A100 GPU to generate 1000–10,000 valid grasps, with 300–1000 effective samples per second and trimmed-mean throughput of 1091 SPS for Allegro, 420 SPS for LEAP, 559 SPS for Shadow, and 249 SPS for DClaw in Table 1 (Yin et al., 10 Nov 2025). This suggests that SCFields intended for task-aware grasp synthesis need not be neural by default; high-performance procedural fields remain competitive when the central problem is feasible-contact enumeration.

A recurring misconception is that SCFields denote only one representation: a semantic contact classifier on object surfaces. The literature is broader. In human-scene contact, the semantic channel is object category attached to human vertices (Bierling et al., 27 Oct 2025). In tactile tool manipulation, it is part-level or functional semantics fused with dense force and contact probability on the tool (Ma et al., 14 Feb 2026). In ClickDiff, semantics reduce to finger identity on object points (Li et al., 2024). In SECOND-Grasp, semantics are grasp intentions such as “handle” or “body” coupled to confidence and geometric refinement (Shin et al., 13 May 2026). The common invariant is not the label set but the coupling of geometry-aligned contact state with a semantically structured codomain.

A second misconception is that semantics alone suffice. The most successful SCField instantiations couple semantics to explicit geometry or mechanics: patch-level body-scene reasoning in DecoDINO, dense extrinsic force regression and sim-to-real alignment in tool manipulation, multi-view depth consistency and local convexity in SECOND-Grasp, or finger-centroid geometric constraints in ClickDiff (Bierling et al., 27 Oct 2025, Ma et al., 14 Feb 2026, Shin et al., 13 May 2026, Li et al., 2024). The field is useful insofar as it mediates between semantic abstraction and physically feasible contact.

The current limitations are correspondingly diverse. DecoDINO reports high semantic recall but low precision and notes persistent foot-contact priors and failures when no person is present (Bierling et al., 27 Oct 2025). The tool-manipulation SCFields pipeline is imitation-only, depends on a structured alignment task for real pseudo-labels, incurs PointNet++ and dense-field inference cost, and may require new alignment data when tactile hardware changes (Ma et al., 14 Feb 2026). SECOND-Grasp uses a discrete point-cloud field, not a continuous implicit neural field, and does not use the contact field directly as a runtime input (Shin et al., 13 May 2026). ClickDiff’s SCM is discrete, threshold-based, and lacks explicit physical simulation (Li et al., 2024). Lightning Grasp lacks explicit semantics and struggles with highly non-convex objects such as cups (Yin et al., 10 Nov 2025). The Eulerian framework addresses only frictionless normal contact and, as a penalty method, permits small residual penetration (Lorez et al., 2023).

Several future directions recur across the papers. These include more and better semantic annotations, better VLM integration, continuous neural SCFields that output contact probability and semantic class directly, tighter coupling with reinforcement learning or world models, extension to friction, adhesion, or damage semantics, and task-conditioned or language-conditioned semantic contact selection (Bierling et al., 27 Oct 2025, Ma et al., 14 Feb 2026, Shin et al., 13 May 2026, Lorez et al., 2023). Taken together, these directions indicate that SCFields are evolving from a descriptive contact label space into a general interface between perception, semantics, mechanics, and policy.