Papers
Topics
Authors
Recent
Search
2000 character limit reached

Alterbute: Diffusion-based Object Attribute Editing

Updated 16 January 2026
  • Alterbute is a diffusion-based framework that enables fine-grained editing of intrinsic object attributes, including color, texture, material, and shape.
  • It employs a novel four-way conditioning scheme using identity, text prompts, background, and masks to precisely guide the diffusion process.
  • Performance evaluations demonstrate state-of-the-art identity preservation and attribute alignment, validated by metrics like CLIP-T, DINO, and CLIP-I.

Alterbute is a unified, diffusion-based framework for editing the intrinsic attributes of a singled-out object in a real image—including color, texture, material, and shape—while rigorously preserving both object identity and all extrinsic scene factors. It is distinguished by its combination of a flexible training objective, a novel deployment of Visual Named Entities (VNEs) for scalable, fine-grained supervision, and its architecture that conditions the diffusion process on identity, textual attribute prompts, background, and mask information. Alterbute advances the state of the art in identity-preserving attribute editing, enabling, for the first time, scalable, high-fidelity modification of intrinsic object properties in the wild (Reiss et al., 15 Jan 2026).

1. Formulation of the Intrinsic Attribute Editing Problem

Alterbute formalizes the task of editing an object's intrinsic attributes as transforming an input image yRH×W×3y \in \mathbb{R}^{H \times W \times 3}, which depicts a physical scene ss and a target object oo of identity id\mathrm{id}, to a new image yy’ where only the object's intrinsic attributes (color, texture, material, shape) have changed, while preserving both the object's perceived identity and all extrinsic scene factors (background, lighting, camera pose). This is abstractly expressed via an (idealized) renderer GphysicsG_\mathrm{physics} as follows:

y=Gphysics(o,s),o=O(id,aint,s)y = G_\mathrm{physics}(o, s), \quad o = O(\mathrm{id}, a_\mathrm{int}, s)

Here, ainta_\mathrm{int} denotes the set of intrinsic attributes. The ideal operator for intrinsic-attribute editing, given a prompt pp describing new attributes ainta'_\mathrm{int}, yields

y=Gphysics(O(id,aint,s),s)y' = G_\mathrm{physics}(O(\mathrm{id}, a'_\mathrm{int}, s), s)

Since GphysicsG_\mathrm{physics} and OO are unknown, Alterbute learns a parametric diffusion model DθD_\theta to approximate this mapping.

2. Diffusion-Based Architecture and Conditioning Scheme

Alterbute leverages a pretrained latent diffusion UNet backbone (SDXL, $7$B parameters), fine-tuned to support a four-way conditioning protocol:

  • Identity reference image (\textbf{id}): Contains only the object foreground (background zeroed via mask).
  • Text prompt (pp): Encodes the desired intrinsic attributes (e.g., {color: red, material: wood}).
  • Background image (bg\mathrm{bg}): Formed by masking out the object in yy and filling with a constant gray to fix extrinsic context.
  • Binary object mask (mm): Indicates spatial support for the object.

During training, the noisy target latent and identity reference are tiled side-by-side into an image grid of size H×2WH \times 2W, enabling self-attention propagation of identity features. Conditioning inputs (id,bg,m)(\mathrm{id}, \mathrm{bg}, m) are concatenated along channels, while pp is injected via cross-attention. The input at diffusion step τ\tau is

xτ=ατy+στϵ,ϵN(0,I)x_\tau = \alpha_\tau y + \sigma_\tau \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

3. Relaxed Training Objective and Data Pipeline

Supervised pairs differing \emph{only} in intrinsic attributes and not extrinsic factors are rare at scale. Alterbute circumvents this by adopting a “relaxed” objective, allowing changes in both intrinsic and extrinsic properties during training. Triplets (yi,idi,si,pi)(y_i, \mathrm{id}_i, s_i, p_i) are mined from VNE clusters (see Section 5) such that yiy_i and idi\mathrm{id}_i share the same VNE identity but may differ in background, lighting, or pose. The model is trained to reconstruct yiy_i conditioned on (idi,si,pi)(\mathrm{id}_i, s_i, p_i), using an L2L_2 diffusion denoising loss applied only to the target half:

L(θ)=Ei,τ,ϵDθ(ατyi+στϵ,idi,pi,(bgi,mi),τ)ϵ2\mathcal{L}(\theta) = \mathbb{E}_{i,\tau,\epsilon} \left\|D_\theta(\alpha_\tau y_i + \sigma_\tau \epsilon, \mathrm{id}_i, p_i, (\mathrm{bg}_i, m_i), \tau) - \epsilon \right\|^2

This encourages the model to use pip_i to alter intrinsic attributes, (bgi,mi)(\mathrm{bg}_i, m_i) for extrinsic attributes, and idi\mathrm{id}_i to preserve identity.

4. Inference-time Extrinsic Factor Constraint

At inference, the objective is to edit image yy only along the intrinsic attribute directions specified by ptestp_\text{test}, without changing the extrinsic context stest=(bgtest,mtest)s_\text{test} = (\mathrm{bg}_\text{test}, m_\text{test}). This is achieved by extracting mtestm_\text{test} via a pretrained segmenter, forming bgtest\mathrm{bg}_\text{test} from yy, and setting idtest\mathrm{id}_\text{test} by cropping and masking the original object. The diffusion process then operates with these fixed inputs for all timesteps, mathematically pinning down extrinsic factors: the UNet receives no alternate background or mask and can thus only alter the degrees of freedom described in ptestp_\text{test} and any intrinsic defaults defined by idtest\mathrm{id}_\text{test}.

5. Visual Named Entities (VNEs) and Scalable Identity Supervision

A critical challenge for intrinsic attribute editing is the precise and scalable definition of object identity—fine-grained enough for discriminating subtle shape/material features but tolerant to intrinsic variations. Alterbute introduces the Visual Named Entity (VNE): a fine-grained, textually expressible category (e.g., “Porsche 911 Carrera,” “IKEA LACK table”) signifying shared identity-defining features across real images.

VNEs are mined by processing all $16M$ object crops in OpenImages v4 using Gemini 2.0 Flash, a vision–LLM, to infer candidate VNE labels. Only “High”-confidence assignments are retained, yielding approximately $1.5M$ labeled objects grouped into $69,744$ non-singleton clusters (each a VNE category). Cluster members naturally vary in intrinsic attributes and extrinsics, supporting identity-preserving edits. For each object, Gemini provides intrinsic attributes in structured JSON form, supplying training prompts pip_i.

The automated pipeline thereby generates, at scale and without manual labeling: identity references, attribute prompts, backgrounds and masks, and target images across 70K\sim70K identities.

6. Evaluation and Comparative Performance

Alterbute is evaluated on $30$ held-out objects and $100$ test cases spanning both standard and under-represented categories. Baselines include general-purpose editors (FlowEdit, InstructPix2Pix, OmniGen, UltraEdit, Diptych) and attribute-specific methods (MimicBrush, MaterialFusion).

User studies show Alterbute is preferred over any baseline in at least 76%76\% of comparisons. Automated vision–LLMs (Gemini, GPT-4o, Claude 3.7) independently confirm these outcomes with statistically significant agreement (p<1010p<10^{-10}).

Conventional metrics are summarized as follows:

Method DINO CLIP-I ↑ CLIP-T ↑
FlowEdit 0.813 0.900 0.294
InstructPix2Pix 0.772 0.877 0.302
OmniGen 0.823 0.912 0.305
UltraEdit 0.841 0.922 0.303
Diptych 0.794 0.901 0.313
Ours (Alterbute) 0.815 0.914 0.321

Alterbute achieves state-of-the-art attribute alignment (highest CLIP-T) without a significant loss in identity (CLIP-I, DINO). Qualitative analysis demonstrates reliable shape edits and multi-attribute transformations beyond the reach of previous methods. It is noted that high scores on CLIP-I and DINO can sometimes be misleading, as identity-preserving no-ops also rate highly.

7. Implementation Protocols and Reproducibility

Key implementation details include:

  • Diffusion backbone: SDXL text-to-image, $7$B parameters.
  • Mask extraction: SAM 2.0 segmentation (Ravi et al. 2024).
  • Training regimen: $100$k steps, learning rate 1×1051\times10^{-5}, batch size $128$, image grid 512×1024512 \times 1024, on 128 TPU-v4 cores (24 hours).
  • Classifier-free guidance: Scale $7.5$ for text, $2.0$ for image.
  • VNE mining: OpenImages v4 and Gemini 2.0 Flash filtered for “High”-confidence, discarding singleton clusters.
  • Intrinsic attribute prompts: Structured JSON from Gemini.
  • Random drops: 10%10\% steps drop idi\mathrm{id}_i, 10%10\% drop pip_i to support scene and identity-conditional learning.
  • Segmentation granularity: 50%50\% precise segmentation, 50%50\% coarse bounding box (for shape edit robustness).

This configuration enables Alterbute to generalize edit capabilities while maintaining robustness to variation in segmentation precision, attribute specification, and identity reference.


Alterbute establishes a new paradigm for fine-grained, identity-preserving editing of object intrinsic attributes at scale, facilitated by automated VNE-based supervision and a principled diffusion modeling approach (Reiss et al., 15 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Alterbute.