OCCAM#1: Object-Centric Masked Classification

Updated 9 March 2026

The paper demonstrates that applied object masks can effectively disentangle foreground objects from background cues to achieve state-of-the-art out-of-distribution performance.
It employs techniques like mask-driven cropping, alpha-channel augmentation, and prompt-guided dynamic masking to isolate and encode objects within the image.
Empirical benchmarks show significant improvements over traditional methods, highlighting enhanced accuracy on datasets such as ImageNet-D, UrbanCars, and Waterbirds.

Object-Centric Classification with Applied Masks (OCCAM#1) is a paradigm and experimental toolbox for leveraging explicit object masks—generated by segmentation models or computed dynamically—to drive object-centric representation, enable robust classification, and disentangle objects from spurious background cues. Unlike earlier approaches which focus mainly on slot-space or latent-space object discovery, OCCAM#1 operates directly in pixel space, exploiting off-the-shelf or learned masks to isolate and encode objects before classification. This method achieves state-of-the-art performance on out-of-distribution (OOD) generalization and robust recognition tasks by combining mask-driven cropping, attention mechanisms, or explicit region-level fusion. Its workflow is entirely training-free when using frozen segmentation and encoder backbones, but can be extended with prompt-guided dynamic masking or compositional inference for more complex scenarios.

1. Object-Centric Mask Generation and Application

OCCAM#1 begins by generating a set of object-centric masks from a given image. These masks may come from pretrained, class-agnostic segmentation models such as HQES (EntitySeg), SAM, or FT-Dinosaur, each outputting $K$ binary masks $m_i \in \{0,1\}^{H \times W}$ . The main application strategies include:

Gray BG + Crop: All non-masked pixels are set to a neutral color (e.g., gray), the masked object is cropped to its bounding box, and resized as needed before encoding.
Alpha-channel: The mask $m_i$ augments the input as a fourth channel, facilitating architectures that can directly process multi-modal or masked input forms.

These applied masks define the input for the subsequent encoding and classification stages, ensuring that each candidate object is encoded independently of background or other objects present in the scene (Rubinstein et al., 9 Apr 2025).

2. Mask-Driven Object Encoding and Classification Pipeline

For each generated mask, a frozen image encoder $\psi$ (typically ViT-L/14 CLIP or similar) maps the masked crop $a(x, m_i)$ into a $d$ -dimensional embedding $z_i = \psi(a(x,m_i))$ . The classifier head $f$ then produces logits, and the final prediction $y = \mathrm{Softmax}(f(z_i))$ yields the class posterior.

Foreground mask selection is governed by a mask scoring function. Notable strategies include:

Ensemble entropy: Measures the mean entropy of predictions across a CLIP ensemble, favoring masks yielding low entropy on a plausible target class.
Class-aided oracle: Selects masks recognized as most likely for the ground-truth class (used as an oracle for diagnostic purposes).

The mask with the highest score, $m^*$ , is used for the final prediction. For multiclass images, this yields robust, object-centric predictions that are less sensitive to background confounds and spurious cues (Rubinstein et al., 9 Apr 2025).

3. Variants: Dynamic and Prompt-Driven Masking

Beyond fixed mask generators, OCCAM#1 encompasses approaches where masks are learned or dynamically computed according to auxiliary cues:

Prompt-Driven Dynamic Object-Centric Perception: A ResNet-18 backbone is augmented with (i) a Prompt-Based Object-Centric Gating Module and (ii) a Dynamic Selective Module. Scene prompts (e.g., "An image taken in {ArtPaintings}") encoded by CLIP's text encoder are fused with visual features via Slot-Attention, yielding a binary gating mask $\mathbf{M}\in\{0,1\}^{H\times W\times C}$ . The Dynamic Selective Module then sparsifies features both across channels and spatial regions:

$F'_c = M_c \odot F, \quad F'_s = M_s \odot F, \quad F' = F'_c + F'_s = (M_c + M_s) \odot F$

where marginals $M_c(c) = \max_{i,j} M_{ijc}$ and $M_s(i,j) = \max_c M_{ijc}$ represent channel-wise and spatial-wise mask selectors. Training includes standard cross-entropy loss and bound losses on the density of activated mask regions, encouraging sparsity and controlled selectivity. Empirically, this achieves improved generalization on PACS and Diverse-Weather benchmarks, with dynamic gating alone yielding +9.3% over static baselines, and CLIP-prompt + Slot-Attention fusion an additional +6.8% gain (Li et al., 2024).

4. Mask Integration Strategies in Traditional CNNs

OCCAM#1 subsumes methods where a known ROI mask is incorporated via architectural fusion:

Side-branch attention: Eppel (Eppel, 2018) introduces a side-branch that convolves the input ROI mask to yield an attention map, which is merged with the base network's features either by addition or multiplication. Fusion at the earliest layer (conv1) provides the best spatial resolution and preserves background context, crucial for small or ambiguous ROIs.
Results: On COCO (mean class accuracy, small ROIs), side-branch attention improves performance over hard blackout masking, with fusion at conv1 yielding 68-83% versus 48-72% for blackout, and addition marginally outperforming multiplication.

This integration method is robust to noisy masks and requires minimal extra parameters (one convolution per fusion point), establishing a flexible template for OCCAM-style mask usage.

Fusion Variant	Layer	Mean Class Accuracy (All Sizes, COCO)
Blackout (hard)	–	72%
Add @ conv3_x	conv3_x	77%
Mul @ conv1	conv1	83%
Add @ conv1	conv1	83%

5. Compositional Models and Robust Inference under Occlusion

OCCAM#1 also encompasses composite systems in which explicit part masks, learned from internal CNN features, are used for robust classification—especially under partial occlusion:

Compositional model branch: After standard DCNN training, internal feature descriptors for all spatial locations are clustered, yielding $K$ part prototypes $d_k$ . Detection maps $b_{p,k}$ highlight presence of each part at location $p$ via cosine similarity thresholding, forming binary activation maps $B\in \{0,1\}^{H\times W\times K}$ . For each class $y$ , Bernoulli probabilities $\alpha_{p,k,y}$ model expected part distributions, optionally as mixtures over $M$ viewpoint clusters.
Occlusion-aware scoring: At test time, visibility variables $z_p$ select between object and background generative models at each spatial location. Classification proceeds via either the DCNN (if confident; $p(\hat y_{dcnn}|I;W) \geq \tau$ ) or the compositional model, which ignores occluded regions using $z_p$ and maximizes log-likelihood over part configurations. This hybrid yields strong accuracy on unoccluded data, with far better generalization to novel occlusions even without exposure to corrupted images during training (Kortylewski et al., 2019).

6. Mask Classification, Region-Level Decisions, and Outlier Handling

OCCAM#1 generalizes to mask-classification architectures, as in Mask2Former and RbA scoring (Nayal et al., 2022):

Region-wise mask classification: A Transformer decoder maintains $N$ learnable “object queries.” Each predicts a binary mask $M_n(x)$ and class logits $p_n(x)$ . Each query empirically behaves as a one-vs-all classifier for its specialist class.
Outlier detection: RbA scoring: For each pixel or region,

$RbA(x) = -\sum_{k=1}^K \sigma(L_k(x))$

where $\sigma$ is tanh on logits, quantifying the degree to which a region is rejected by all known classes ("rejected by all"), thus identifying unknown objects or regions.

Region-level benefits: Empirical evaluation shows that region-level mask classification (vs. per-pixel) yields smoother, more object-coherent outlier maps, better false-positive reduction at boundaries, and supports efficient region-level open-vocabulary expansion and outlier fine-tuning.
Selective fine-tuning: Only the final mask/class heads are updated with a squared-hinge loss on synthetic outlier regions, preserving closed-set segmentation accuracy and improving open-set detection.

Method / Component	Main Property	Region-Level Impact
Mask2Former+RbA	One-vs-all queries, analytic OoD scoring	Superior object-centric maps
Fine-tune mask head	Loss on synthetic outliers, rest frozen	Preserves inlier accuracy

7. Comparative Performance and Empirical Findings

Across benchmarks, OCCAM#1 demonstrates substantial gains in robust classification and object discovery:

OOD Generalization: OCCAM using HQES masks achieves 68.0% on ImageNet-D (background) vs. 23.5% for CLIP; 100% on UrbanCars (ResNet50 CLIP baseline: 64.8%; slot-based CoBalT: 80.0%); 95.2% on ImageNet-9 (CLIP: 91.9%), and 96.0% on Waterbirds (WGA).
Unsupervised Discovery: HQES yields 79.3% FG-ARI on Movi-C vs. 66.9% for SlotDiffusion.
Ablative insights: Prompt-driven dynamic masking augments static networks (+9.3-16.1% accuracy gain), while region-level mask classification (RbA) enhances OOD and boundary performance over pixelwise methods.

8. Open Challenges and Frontiers

Contemporary OCCAM#1 exposes several research challenges:

Foreground mask selection: Determining which mask corresponds to the target object remains a bottleneck. Existing scoring heuristics (ensemble entropy, class-aided) are not fully satisfactory for realistic scenarios.
Dynamic and multimodal cues: Current pipelines rely on static appearance segmentation; integrating motion, depth, or cross-modal cues is an open problem.
Benchmarks and objective evaluation: New downstream tasks are needed to benchmark object-centric reasoning beyond unsupervised segmentation metrics.
Extension to open-world and compositional settings: OCCAM#1 provides a principled basis for scalable region-level incremental learning where new object categories can be added as new queries or masks.

References

"Are We Done with Object-Centric Learning?" (Rubinstein et al., 9 Apr 2025)
"Prompt-Driven Dynamic Object-Centric Learning for Single Domain Generalization" (Li et al., 2024)
"Classifying a specific image region using convolutional nets with an ROI mask as input" (Eppel, 2018)
"Combining Compositional Models and Deep Networks For Robust Object Classification under Occlusion" (Kortylewski et al., 2019)
"RbA: Segmenting Unknown Regions Rejected by All" (Nayal et al., 2022)