Alterbute: Diffusion-based Object Attribute Editing

Updated 16 January 2026

Alterbute is a diffusion-based framework that enables fine-grained editing of intrinsic object attributes, including color, texture, material, and shape.
It employs a novel four-way conditioning scheme using identity, text prompts, background, and masks to precisely guide the diffusion process.
Performance evaluations demonstrate state-of-the-art identity preservation and attribute alignment, validated by metrics like CLIP-T, DINO, and CLIP-I.

Alterbute is a unified, diffusion-based framework for editing the intrinsic attributes of a singled-out object in a real image—including color, texture, material, and shape—while rigorously preserving both object identity and all extrinsic scene factors. It is distinguished by its combination of a flexible training objective, a novel deployment of Visual Named Entities (VNEs) for scalable, fine-grained supervision, and its architecture that conditions the diffusion process on identity, textual attribute prompts, background, and mask information. Alterbute advances the state of the art in identity-preserving attribute editing, enabling, for the first time, scalable, high-fidelity modification of intrinsic object properties in the wild (Reiss et al., 15 Jan 2026).

1. Formulation of the Intrinsic Attribute Editing Problem

Alterbute formalizes the task of editing an object's intrinsic attributes as transforming an input image $y \in \mathbb{R}^{H \times W \times 3}$ , which depicts a physical scene $s$ and a target object $o$ of identity $\mathrm{id}$ , to a new image $y’$ where only the object's intrinsic attributes (color, texture, material, shape) have changed, while preserving both the object's perceived identity and all extrinsic scene factors (background, lighting, camera pose). This is abstractly expressed via an (idealized) renderer $G_\mathrm{physics}$ as follows:

$y = G_\mathrm{physics}(o, s), \quad o = O(\mathrm{id}, a_\mathrm{int}, s)$

Here, $a_\mathrm{int}$ denotes the set of intrinsic attributes. The ideal operator for intrinsic-attribute editing, given a prompt $p$ describing new attributes $a'_\mathrm{int}$ , yields

$y' = G_\mathrm{physics}(O(\mathrm{id}, a'_\mathrm{int}, s), s)$

Since $G_\mathrm{physics}$ and $O$ are unknown, Alterbute learns a parametric diffusion model $D_\theta$ to approximate this mapping.

2. Diffusion-Based Architecture and Conditioning Scheme

Alterbute leverages a pretrained latent diffusion UNet backbone (SDXL, $7$B parameters), fine-tuned to support a four-way conditioning protocol:

Identity reference image (\textbf{id}): Contains only the object foreground (background zeroed via mask).
Text prompt ( $p$ ): Encodes the desired intrinsic attributes (e.g., {color: red, material: wood}).
Background image ( $\mathrm{bg}$ ): Formed by masking out the object in $y$ and filling with a constant gray to fix extrinsic context.
Binary object mask ( $m$ ): Indicates spatial support for the object.

During training, the noisy target latent and identity reference are tiled side-by-side into an image grid of size $H \times 2W$ , enabling self-attention propagation of identity features. Conditioning inputs $(\mathrm{id}, \mathrm{bg}, m)$ are concatenated along channels, while $p$ is injected via cross-attention. The input at diffusion step $\tau$ is

$x_\tau = \alpha_\tau y + \sigma_\tau \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$

3. Relaxed Training Objective and Data Pipeline

Supervised pairs differing \emph{only} in intrinsic attributes and not extrinsic factors are rare at scale. Alterbute circumvents this by adopting a “relaxed” objective, allowing changes in both intrinsic and extrinsic properties during training. Triplets $(y_i, \mathrm{id}_i, s_i, p_i)$ are mined from VNE clusters (see Section 5) such that $y_i$ and $\mathrm{id}_i$ share the same VNE identity but may differ in background, lighting, or pose. The model is trained to reconstruct $y_i$ conditioned on $(\mathrm{id}_i, s_i, p_i)$ , using an $L_2$ diffusion denoising loss applied only to the target half:

$\mathcal{L}(\theta) = \mathbb{E}_{i,\tau,\epsilon} \left\|D_\theta(\alpha_\tau y_i + \sigma_\tau \epsilon, \mathrm{id}_i, p_i, (\mathrm{bg}_i, m_i), \tau) - \epsilon \right\|^2$

This encourages the model to use $p_i$ to alter intrinsic attributes, $(\mathrm{bg}_i, m_i)$ for extrinsic attributes, and $\mathrm{id}_i$ to preserve identity.

4. Inference-time Extrinsic Factor Constraint

At inference, the objective is to edit image $y$ only along the intrinsic attribute directions specified by $p_\text{test}$ , without changing the extrinsic context $s_\text{test} = (\mathrm{bg}_\text{test}, m_\text{test})$ . This is achieved by extracting $m_\text{test}$ via a pretrained segmenter, forming $\mathrm{bg}_\text{test}$ from $y$ , and setting $\mathrm{id}_\text{test}$ by cropping and masking the original object. The diffusion process then operates with these fixed inputs for all timesteps, mathematically pinning down extrinsic factors: the UNet receives no alternate background or mask and can thus only alter the degrees of freedom described in $p_\text{test}$ and any intrinsic defaults defined by $\mathrm{id}_\text{test}$ .

5. Visual Named Entities (VNEs) and Scalable Identity Supervision

A critical challenge for intrinsic attribute editing is the precise and scalable definition of object identity—fine-grained enough for discriminating subtle shape/material features but tolerant to intrinsic variations. Alterbute introduces the Visual Named Entity (VNE): a fine-grained, textually expressible category (e.g., “Porsche 911 Carrera,” “IKEA LACK table”) signifying shared identity-defining features across real images.

VNEs are mined by processing all $16M$ object crops in OpenImages v4 using Gemini 2.0 Flash, a vision–LLM, to infer candidate VNE labels. Only “High”-confidence assignments are retained, yielding approximately $1.5M$ labeled objects grouped into $69,744$ non-singleton clusters (each a VNE category). Cluster members naturally vary in intrinsic attributes and extrinsics, supporting identity-preserving edits. For each object, Gemini provides intrinsic attributes in structured JSON form, supplying training prompts $p_i$ .

The automated pipeline thereby generates, at scale and without manual labeling: identity references, attribute prompts, backgrounds and masks, and target images across $\sim70K$ identities.

6. Evaluation and Comparative Performance

Alterbute is evaluated on $30$ held-out objects and $100$ test cases spanning both standard and under-represented categories. Baselines include general-purpose editors (FlowEdit, InstructPix2Pix, OmniGen, UltraEdit, Diptych) and attribute-specific methods (MimicBrush, MaterialFusion).

User studies show Alterbute is preferred over any baseline in at least $76\%$ of comparisons. Automated vision–LLMs (Gemini, GPT-4o, Claude 3.7) independently confirm these outcomes with statistically significant agreement ( $p<10^{-10}$ ).

Conventional metrics are summarized as follows:

Method	DINO ↑	CLIP-I ↑	CLIP-T ↑
FlowEdit	0.813	0.900	0.294
InstructPix2Pix	0.772	0.877	0.302
OmniGen	0.823	0.912	0.305
UltraEdit	0.841	0.922	0.303
Diptych	0.794	0.901	0.313
Ours (Alterbute)	0.815	0.914	0.321

Alterbute achieves state-of-the-art attribute alignment (highest CLIP-T) without a significant loss in identity (CLIP-I, DINO). Qualitative analysis demonstrates reliable shape edits and multi-attribute transformations beyond the reach of previous methods. It is noted that high scores on CLIP-I and DINO can sometimes be misleading, as identity-preserving no-ops also rate highly.

7. Implementation Protocols and Reproducibility

Key implementation details include:

Diffusion backbone: SDXL text-to-image, $7$B parameters.
Mask extraction: SAM 2.0 segmentation (Ravi et al. 2024).
Training regimen: $100$k steps, learning rate $1\times10^{-5}$ , batch size $128$, image grid $512 \times 1024$ , on 128 TPU-v4 cores (24 hours).
Classifier-free guidance: Scale $7.5$ for text, $2.0$ for image.
VNE mining: OpenImages v4 and Gemini 2.0 Flash filtered for “High”-confidence, discarding singleton clusters.
Intrinsic attribute prompts: Structured JSON from Gemini.
Random drops: $10\%$ steps drop $\mathrm{id}_i$ , $10\%$ drop $p_i$ to support scene and identity-conditional learning.
Segmentation granularity: $50\%$ precise segmentation, $50\%$ coarse bounding box (for shape edit robustness).

This configuration enables Alterbute to generalize edit capabilities while maintaining robustness to variation in segmentation precision, attribute specification, and identity reference.

Alterbute establishes a new paradigm for fine-grained, identity-preserving editing of object intrinsic attributes at scale, facilitated by automated VNE-based supervision and a principled diffusion modeling approach (Reiss et al., 15 Jan 2026).

Markdown Upgrade to Chat

References (1)

Alterbute: Editing Intrinsic Attributes of Objects in Images (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Alterbute.