Alterbute: Diffusion-based Object Attribute Editing
- Alterbute is a diffusion-based framework that enables fine-grained editing of intrinsic object attributes, including color, texture, material, and shape.
- It employs a novel four-way conditioning scheme using identity, text prompts, background, and masks to precisely guide the diffusion process.
- Performance evaluations demonstrate state-of-the-art identity preservation and attribute alignment, validated by metrics like CLIP-T, DINO, and CLIP-I.
Alterbute is a unified, diffusion-based framework for editing the intrinsic attributes of a singled-out object in a real image—including color, texture, material, and shape—while rigorously preserving both object identity and all extrinsic scene factors. It is distinguished by its combination of a flexible training objective, a novel deployment of Visual Named Entities (VNEs) for scalable, fine-grained supervision, and its architecture that conditions the diffusion process on identity, textual attribute prompts, background, and mask information. Alterbute advances the state of the art in identity-preserving attribute editing, enabling, for the first time, scalable, high-fidelity modification of intrinsic object properties in the wild (Reiss et al., 15 Jan 2026).
1. Formulation of the Intrinsic Attribute Editing Problem
Alterbute formalizes the task of editing an object's intrinsic attributes as transforming an input image , which depicts a physical scene and a target object of identity , to a new image where only the object's intrinsic attributes (color, texture, material, shape) have changed, while preserving both the object's perceived identity and all extrinsic scene factors (background, lighting, camera pose). This is abstractly expressed via an (idealized) renderer as follows:
Here, denotes the set of intrinsic attributes. The ideal operator for intrinsic-attribute editing, given a prompt describing new attributes , yields
Since and are unknown, Alterbute learns a parametric diffusion model to approximate this mapping.
2. Diffusion-Based Architecture and Conditioning Scheme
Alterbute leverages a pretrained latent diffusion UNet backbone (SDXL, $7$B parameters), fine-tuned to support a four-way conditioning protocol:
- Identity reference image (\textbf{id}): Contains only the object foreground (background zeroed via mask).
- Text prompt (): Encodes the desired intrinsic attributes (e.g.,
{color: red, material: wood}). - Background image (): Formed by masking out the object in and filling with a constant gray to fix extrinsic context.
- Binary object mask (): Indicates spatial support for the object.
During training, the noisy target latent and identity reference are tiled side-by-side into an image grid of size , enabling self-attention propagation of identity features. Conditioning inputs are concatenated along channels, while is injected via cross-attention. The input at diffusion step is
3. Relaxed Training Objective and Data Pipeline
Supervised pairs differing \emph{only} in intrinsic attributes and not extrinsic factors are rare at scale. Alterbute circumvents this by adopting a “relaxed” objective, allowing changes in both intrinsic and extrinsic properties during training. Triplets are mined from VNE clusters (see Section 5) such that and share the same VNE identity but may differ in background, lighting, or pose. The model is trained to reconstruct conditioned on , using an diffusion denoising loss applied only to the target half:
This encourages the model to use to alter intrinsic attributes, for extrinsic attributes, and to preserve identity.
4. Inference-time Extrinsic Factor Constraint
At inference, the objective is to edit image only along the intrinsic attribute directions specified by , without changing the extrinsic context . This is achieved by extracting via a pretrained segmenter, forming from , and setting by cropping and masking the original object. The diffusion process then operates with these fixed inputs for all timesteps, mathematically pinning down extrinsic factors: the UNet receives no alternate background or mask and can thus only alter the degrees of freedom described in and any intrinsic defaults defined by .
5. Visual Named Entities (VNEs) and Scalable Identity Supervision
A critical challenge for intrinsic attribute editing is the precise and scalable definition of object identity—fine-grained enough for discriminating subtle shape/material features but tolerant to intrinsic variations. Alterbute introduces the Visual Named Entity (VNE): a fine-grained, textually expressible category (e.g., “Porsche 911 Carrera,” “IKEA LACK table”) signifying shared identity-defining features across real images.
VNEs are mined by processing all $16M$ object crops in OpenImages v4 using Gemini 2.0 Flash, a vision–LLM, to infer candidate VNE labels. Only “High”-confidence assignments are retained, yielding approximately $1.5M$ labeled objects grouped into $69,744$ non-singleton clusters (each a VNE category). Cluster members naturally vary in intrinsic attributes and extrinsics, supporting identity-preserving edits. For each object, Gemini provides intrinsic attributes in structured JSON form, supplying training prompts .
The automated pipeline thereby generates, at scale and without manual labeling: identity references, attribute prompts, backgrounds and masks, and target images across identities.
6. Evaluation and Comparative Performance
Alterbute is evaluated on $30$ held-out objects and $100$ test cases spanning both standard and under-represented categories. Baselines include general-purpose editors (FlowEdit, InstructPix2Pix, OmniGen, UltraEdit, Diptych) and attribute-specific methods (MimicBrush, MaterialFusion).
User studies show Alterbute is preferred over any baseline in at least of comparisons. Automated vision–LLMs (Gemini, GPT-4o, Claude 3.7) independently confirm these outcomes with statistically significant agreement ().
Conventional metrics are summarized as follows:
| Method | DINO ↑ | CLIP-I ↑ | CLIP-T ↑ |
|---|---|---|---|
| FlowEdit | 0.813 | 0.900 | 0.294 |
| InstructPix2Pix | 0.772 | 0.877 | 0.302 |
| OmniGen | 0.823 | 0.912 | 0.305 |
| UltraEdit | 0.841 | 0.922 | 0.303 |
| Diptych | 0.794 | 0.901 | 0.313 |
| Ours (Alterbute) | 0.815 | 0.914 | 0.321 |
Alterbute achieves state-of-the-art attribute alignment (highest CLIP-T) without a significant loss in identity (CLIP-I, DINO). Qualitative analysis demonstrates reliable shape edits and multi-attribute transformations beyond the reach of previous methods. It is noted that high scores on CLIP-I and DINO can sometimes be misleading, as identity-preserving no-ops also rate highly.
7. Implementation Protocols and Reproducibility
Key implementation details include:
- Diffusion backbone: SDXL text-to-image, $7$B parameters.
- Mask extraction: SAM 2.0 segmentation (Ravi et al. 2024).
- Training regimen: $100$k steps, learning rate , batch size $128$, image grid , on 128 TPU-v4 cores (24 hours).
- Classifier-free guidance: Scale $7.5$ for text, $2.0$ for image.
- VNE mining: OpenImages v4 and Gemini 2.0 Flash filtered for “High”-confidence, discarding singleton clusters.
- Intrinsic attribute prompts: Structured JSON from Gemini.
- Random drops: steps drop , drop to support scene and identity-conditional learning.
- Segmentation granularity: precise segmentation, coarse bounding box (for shape edit robustness).
This configuration enables Alterbute to generalize edit capabilities while maintaining robustness to variation in segmentation precision, attribute specification, and identity reference.
Alterbute establishes a new paradigm for fine-grained, identity-preserving editing of object intrinsic attributes at scale, facilitated by automated VNE-based supervision and a principled diffusion modeling approach (Reiss et al., 15 Jan 2026).