Explicit Graph Representations for Parts

Updated 27 December 2025

Explicit graph representations for parts structure image regions as nodes with edges that encode spatial and statistical relationships, offering a clear part-whole interpretation.
These frameworks integrate methods like over-segmentation, MDL-based subgraph discovery, and learned graph pooling to capture both compositional and physical attributes.
Evaluation benchmarks reveal high accuracy and robustness under occlusion, out-of-distribution conditions, and complex physical interactions.

Explicit graph representations for parts refer to frameworks in which image regions or shape components ("parts") are structured as nodes in a graph, with edges encoding their spatial, statistical, or physical relationships. This approach contrasts with implicit or latent-object models, instead rendering part–whole structure directly accessible to computation and interpretability. Recent developments in this direction have established a strong connection between explicit part-graphs and advances in robustness, compositionality, and transferability for visual perception and object-centric modeling.

1. Formalization of Graph-Based Part Representations

Explicit part graphs are formalized by associating discrete image parts or shape primitives with graph nodes, while edges express spatial or statistical relations among these parts. In "Multi-Part Object Representations via Graph Structures and Co-Part Discovery" (ECO-Net) (Foo et al., 20 Dec 2025), an undirected graph $G = (V, E)$ is constructed from over-segmented images: nodes $V = \{1, \ldots, M\}$ correspond to $M$ contiguous pixel parts obtained by Felzenszwalb's segmentation, and edges $E$ connect adjacent parts. Each node is equipped with a shape embedding derived from $K$ boundary pixels,

$\mathbf{V}_i = \Big[ \mathbf{x}^i_1 - \mathbf{x}^i_c, \ldots, \mathbf{x}^i_K - \mathbf{x}^i_c \Big] \in \mathbb{R}^{K \times 2},$

where $\mathbf{x}^i_c$ is the part centroid and $\mathbf{x}_k^i$ are boundary samples. Optionally, high-level descriptors such as DINO-ViT features can be appended.

Edges encode relative part displacements,

$\mathbf{e}_{ij} = \mathbf{x}_c^j - \mathbf{x}_c^i,$

thus providing explicit spatial context for each adjacency.

In the Compositional Hierarchy of Parts (CHOP) architecture (Aktas et al., 2015), a part at hierarchy layer $l$ is a labeled directed random graph,

$\mathcal{P}_i^l = \big(\mathcal{G}_i^l,\, \mathcal{Y}_i^l\big),$

where $\mathcal{G}_i^l = (\mathcal{V}_i^l, \mathcal{E}_i^l)$ are the node and edge sets, and $\mathcal{Y}_i^l$ is a random variable label. Object graphs are unions of part graphs at each layer, with edge features $\phi_{ab}^l$ encoding discretized spatial relations via quantization schemes derived from Minimum Conditional Entropy Clustering.

Physical Scene Graphs (PSGs) (Bear et al., 2020) extend the explicit part-graph paradigm via hierarchies in which nodes, at multiple levels, correspond to image regions aggregated from pixels to parts to whole objects, edges encode “physical bonds” (surface continuity, joint motion), and node attributes $A_l(v) \in \mathbb{R}^{C_l}$ represent learned appearance and geometric properties.

2. Part Graph Construction and Learning Methodologies

Methods for building explicit part graphs unify spatial grouping, generative modeling, and description length-based selection principles.

ECO-Net (Foo et al., 20 Dec 2025) constructs part graphs through over-segmentation and assigns boundary-based node features as above. The co-part discovery algorithm performs iterative clustering: parts $P_i, P_j$ are grouped if their shape embeddings exhibit $K$ -point average cosine similarity above threshold $\epsilon \approx 0.99$ , and their neighborhoods also exhibit high similarity. The process enforces robustness via transitive closure and naturally scales to multi-part object discovery. No neural loss is involved at this stage.

CHOP (Aktas et al., 2015) alternates two steps at each hierarchy layer: (i) a generative Minimum Conditional Entropy Clustering to capture frequent spatial relations between parts, producing spatial relation codes $M_{ijk} = (i, j, \mathbf{c}_{ijk}, Z_{ijk})$ , and (ii) a descriptive Minimum Description Length (MDL) subgraph discovery, which selects part subgraphs maximizing compression of the object graph. Star-shaped subgraphs are promoted to new parts if they yield lower MDL, recursively constructing the multi-layer compositional vocabulary.

PSGNet (Bear et al., 2020) introduces a learned graph construction stage embedded within a differentiable network. Hierarchical graph pooling is performed using learned affinity functions—feature similarity, co-occurrence (parameterized by VAEs), and motion-based grouping—followed by clustering and vectorization of each segment's local subregions. This endows PSGs with hierarchical, object-centric structure supporting downstream rendering, segmentation, and property estimation.

3. Model Architectures and Computational Properties

The surveyed graph-based part models diverge in their training and architectural paradigms:

ECO-Net does not utilize end-to-end learning; its modules are strictly algorithmic: segmentation, feature extraction, pairwise clustering, and a memory update stage for discovered objects. No ground-truth object masks or neural networks are required, making the method robust to occlusion and background variability. Computational complexity is $O(M^2 \log M)$ per image, with optimization possible via graph-theoretic heuristics and approximate nearest neighbor search (Foo et al., 20 Dec 2025).
CHOP implements hierarchical vocabulary learning, beam-search-based subgraph discovery, and indexing/pruning for efficient subgraph isomorphism detection. Empirical inference times per image are $0.5$–$3$ seconds on CPU, with sublinear vocabulary growth in category/instance count, enabled by part shareability (Aktas et al., 2015).
PSGNet is a neural architecture composed of: (a) ConvRNN feature extraction with feedback, (b) hierarchical graph grouping and attribute learning via graph convolutions, and (c) parameter-free quadratic rendering decoders. Graph construction is learnable and leverages multiple perceptual and physical grouping cues, in contrast to the purely algorithmic approaches above (Bear et al., 2020).

4. Evaluation Benchmarks and Experimental Results

Explicit graph-based part representations have been evaluated on synthetic, real-world, and challenging occlusion/out-of-distribution (OOD) benchmarks.

ECO-Net (Foo et al., 20 Dec 2025) introduces a unified evaluation regime covering:

Simulated: Tetrominoes and AbsScene (high ARI, Dice, and IoU: $>$ 97.7),
Realistic: GSO (ARI=99.1, mDice=94.5), SKU-110K (ARI=7.8; small objects are difficult),
Real-World: PASCAL VOC (ARI=19.0), MS COCO (ARI=38.9),
Occlusion: AbsScene-O (mDice=93.1), GSO-O (mDice=90.3),
OOD: AbsScene-C (ARI=93.5, mDice=97.3).

Downstream property prediction via regression from part-graph embeddings yields $R^2>98\%$ on shape and coordinate attributes. Ablation shows catastrophic failure if co-part clustering is omitted.

CHOP (Aktas et al., 2015) is validated on shape retrieval datasets such as MPEG-7, ETHZ, Tools-40, Tools-35, and Myth. Retrieval metrics (Top-1, Top-4, Bullseye) indicate that explicit part-graphs with learned spatial relations and shareable subparts outperform previous contour or shape-context methods, e.g., Bullseye 87.9–93.3% on Tools-35/Myth.

PSGNet (Bear et al., 2020) demonstrates higher recall (0.70) and robust mIoU/BoundF on the Playroom dataset. Ablation shows that removal of recurrence, border aggregation, or graph convolution significantly degrades performance (recall drops to $\leq$ 0.59). Use of learned vector affinities is critical—binary affinity collapses recall to 0.05.

5. Interpretability, Compositionality, and Physical Reasoning

Explicit part graphs offer direct interpretability: nodes correspond to discrete spatial entities, edges to explicit relations or bonds, and embeddings to measurable physical or statistical properties.

In ECO-Net (Foo et al., 20 Dec 2025), explicit spatial encoding allows robust object discovery under occlusion and OOD; discovered objects can be mapped to interpretable property vectors for downstream tasks without implicit neural bottlenecks.

CHOP (Aktas et al., 2015) builds a hierarchy in which low-level shape primitives compose recurrent, shareable mid-level motifs and ultimately complex objects, supporting both compositional inference and efficient shape retrieval. The vocabulary evolves to stabilize as common substructures are captured.

PSGNet (Bear et al., 2020) further demonstrates compositional reasoning: symbolic object editing (removing, translating, scaling nodes), segment-wise manipulation of attributes (color, shape), and tracking across time via spatiotemporal bonds. The hierarchy supports disentanglement of surface, texture, and geometric factors.

6. Limitations and Open Directions

Explicit graph-based part representations present several open challenges:

Parameter Sensitivity: ECO-Net requires a hand-tuned similarity threshold $\epsilon$ ; automatic or learned metric selection is a desirable extension (Foo et al., 20 Dec 2025).
Scalability: $O(M^2)$ pairwise clustering in ECO-Net limits applicability to large-scale images and video; faster matching algorithms, or learned graph neural networks, are suggested remedies.
Richer Semantics: Current approaches, while effective for relatively simple part structure, are limited in handling articulated, hierarchical, or semantically complex parts (e.g., non-star or cyclic configurations, joints, and attachments) (Foo et al., 20 Dec 2025, Aktas et al., 2015).
End-to-End Learning: Integrating explicit part-graph formation and object discovery into fully differentiable, supervised or self-supervised pipelines with learnable grouping and relation modules remains an active area, exemplified by the architecture of PSGNet (Bear et al., 2020).
Physical Realism: While PSGNet’s edges encode “physical bonds” in a latent sense, fine-grained modeling of forces, contact mechanics, or articulated kinematics is not yet achieved.

A plausible implication is that advances in hierarchical graph construction, unsupervised metric learning, and the integration of physical constraints will further enable explicit part graphs to support both robust perception and physical reasoning in complex environments.

7. Summary and Perspectives

Explicit graph representations for parts constitute a foundational methodology for object-centric vision and scene understanding. By encoding part compositionality, spatial and physical context, and statistical recurrence directly in graph-based structures, these models have demonstrated gains in robustness, interpretability, and downstream task performance, especially in the presence of occlusion and unseen backgrounds. Ongoing research is expanding their expressivity, scalability, and integration with end-to-end differentiable architectures, with the goal of advancing compositional, interpretable, and physically grounded scene representations across a range of vision and robotics domains (Foo et al., 20 Dec 2025, Aktas et al., 2015, Bear et al., 2020).