Papers
Topics
Authors
Recent
Search
2000 character limit reached

FLUX.1 Kontext

Updated 4 July 2026
  • FLUX.1 Kontext is a unified model that integrates text-to-image generation and image-conditioned editing through a single transformer using sequence concatenation.
  • It employs generative flow matching in latent space with a rectified flow objective and Flux-VAE, achieving competitive reconstruction metrics and rapid 1024×1024 output in 3–5 seconds.
  • The system supports iterative multi-turn editing by reusing previous outputs as context, ensuring improved object and character preservation in diverse editing workflows.

FLUX.1 Kontext is a unified, in-context image generation and editing model built on generative flow matching in latent space. It is designed to perform, within a single architecture and interface, both text-to-image generation and image-conditioned editing and synthesis spanning local edits, global scene changes, character and style reference, and text editing. Its central mechanism is a simple sequence concatenation of text tokens and image context tokens, so that “from scratch” synthesis and image-to-image transformation are handled by the same network rather than by task-specific architectural switches. The model is also presented as a multi-turn system: previous outputs can be reintroduced as context for subsequent edits, and the accompanying evaluation reports improved preservation of objects and characters together with interactive 1024 × 1024 generation in about 3–5 seconds after distillation (Labs et al., 17 Jun 2025).

1. Unified task model

FLUX.1 Kontext is defined by task unification. The model accepts a text instruction or prompt together with zero, one, or more context images, and uses the same transformer to handle local editing, global editing, character reference, style reference, text editing, and unconstrained text-to-image synthesis. Image-conditioned editing and generation are therefore not separate operating modes; they differ only in whether context image tokens are present in the sequence. Multi-turn workflows are realized by feeding the previous output image back as the next step’s context, concatenated with a new instruction. The system also supports additional visual cues, such as bounding boxes or red ellipses, injected as part of the input image and interpreted as localized edit hints (Labs et al., 17 Jun 2025).

This formulation is notable because it treats reference consumption and target synthesis as operations in a single attention space. A common misconception in image editing systems is that local editing, personalized generation, and iterative revision necessarily require separate heads, adapters, or task-specific modules. In FLUX.1 Kontext, the intended alternative is a single token-level conditioning mechanism that can “read” a reference and “write” an edit or a new image with the same network. This suggests that the model’s notion of editing is structurally close to conditional generation rather than to a separately engineered post-processing stage.

2. Latent architecture and representation

The model is a rectified flow transformer operating in the latent space of a learned image autoencoder. That autoencoder, Flux-VAE, is a convolutional encoder–decoder trained from scratch with an adversarial objective, scaled up in compute, and configured with 16 latent channels. On reconstruction metrics reported against common VAEs used in image synthesis, Flux-VAE attains PDist 0.332 ± 0.003, SSIM 0.896 ± 0.004, and PSNR 31.1 ± 0.08, compared with SD3-VAE at PDist 0.452 ± 0.004, SSIM 0.858 ± 0.005, and PSNR 29.6 ± 0.07; SD3-TAE at PDist 0.746 ± 0.004, SSIM 0.774 ± 0.014, and PSNR 27.9 ± 0.06; SDXL-VAE at PDist 0.890 ± 0.005, SSIM 0.748 ± 0.006, and PSNR 25.9 ± 0.07; and SD-VAE at PDist 0.949 ± 0.005, SSIM 0.720 ± 0.004, and PSNR 25.0 ± 0.07 (Labs et al., 17 Jun 2025).

The generator backbone follows a diffusion-transformer-style rectified flow design composed of double-stream and single-stream blocks. In the double-stream stage, image and text token streams use separate weights, while cross-modal interaction is implemented by attention over the concatenated token sequence. After this stage, image and text tokens are concatenated and processed by 38 single-stream transformer blocks. Text tokens are then discarded, and image tokens are decoded by the VAE decoder.

To improve throughput and reduce latency, the architecture uses fused DiT feed-forward blocks, which fuse attention in/out projections with the MLP and reduce modulation parameters, together with factorized 3D rotary positional embeddings (3D-RoPE). Every latent token is indexed by space–time coordinates (t,h,w)(t, h, w). For single-image inputs, t0t \equiv 0; for multi-image conditioning, disjoint tt offsets are assigned per context image. A plausible implication is that the latent-space design and strong reconstruction properties of Flux-VAE are not merely auxiliary components: they help determine how much information survives repeated edit cycles.

3. Conditioning mechanism and flow-matching formulation

Conditioning is implemented by sequence concatenation. Text is tokenized into the text stream in the usual way. A context image yy is encoded by the Flux-VAE into latent tokens, which are appended to the target image token sequence xx on the visual stream. Additional context images y1,,yNy_1, \ldots, y_N can also be appended, although training in this release focuses on single-image context. Positional separation is handled through 3D-RoPE offsets: if a token position is u=(t,h,w)u = (t, h, w), target tokens use ux=(0,h,w)u_x = (0, h, w) and context tokens use uyi=(i,h,w)u_{y_i} = (i, h, w) for i=1,,Ni = 1, \ldots, N. The authors report that channel-wise latent concatenation was also tried and underperformed in initial experiments relative to sequence concatenation (Labs et al., 17 Jun 2025).

The underlying generative formulation is rectified flow matching in latent space. With target latent t0t \equiv 00, Gaussian noise t0t \equiv 01, text conditioning t0t \equiv 02, and image context t0t \equiv 03, the probability flow ODE is

t0t \equiv 04

and a linear path example is

t0t \equiv 05

Under conditional flow matching, an ideal target velocity satisfies

t0t \equiv 06

with training loss

t0t \equiv 07

The instantiated rectified-flow objective uses a velocity predictor t0t \equiv 08 on the linear path

t0t \equiv 09

and minimizes

tt0

The appendix gives a more general forward noising process,

tt1

with conditional flow matching loss

tt2

For rectified flows, tt3 and tt4, reducing the velocity target to tt5 up to sign.

The timestep variable tt6 is sampled from a Logit-Normal distribution tt7 with tt8 and tt9 set per resolution. The reported equivalence between the high-resolution yy0-shift and the Logit-Normal parameter is

yy1

and the generalized timestep redistribution is

yy2

Classifier-free guidance is applied on velocity during ODE integration:

yy3

where yy4 controls guidance strength. The paper notes that naïve multi-step guided sampling, such as 50–250 evaluations, is accurate but slow and can introduce artifacts at high yy5.

4. Training methodology, distillation, and model variants

Training begins from a FLUX.1 text-to-image rectified flow checkpoint. The authors then curate millions of relational tuples yy6 and jointly fine-tune on image-to-image and text-to-image tasks under the rectified-flow loss. Although the architecture supports multiple context images, training is concentrated on single-image context in this release. Three named variants are described. FLUX.1 Kontext [pro] is trained with the flow objective followed by latent adversarial diffusion distillation (LADD). FLUX.1 Kontext [dev] is obtained by guidance-distillation into a 12B diffusion transformer (Meng et al., 2023) and is trained exclusively on image-to-image tasks. FLUX.1 Kontext [max] uses more compute to further improve generative performance (Labs et al., 17 Jun 2025).

LADD is used to distill multi-step guided flow sampling into a faster process while simultaneously improving perceptual quality through adversarial training. The reported consequence is a substantial reduction in the number of solver steps required at inference. This provides the basis for the model’s interactive runtime characteristics and is central to its quality/speed trade-off.

The training stack emphasizes systems efficiency. The authors use FSDP2 in mixed precision, specifically bf16 all-gather and fp32 reduce-scatter, together with selective activation checkpointing, FlashAttention-3, and regional compilation of transformer blocks to improve throughput and reduce VRAM. Safety mitigations include classifier-based filtering and adversarial training intended to mitigate NCII/CSAM generation. These operational details are significant because they tie the model’s architectural simplicity to a concrete deployment strategy rather than leaving interactivity as an abstract aspiration.

5. Inference and iterative editing workflows

At inference time, FLUX.1 Kontext solves the rectified probability flow ODE with classifier-free guidance. The detailed choice of ODE solver is not foregrounded, but the baseline guided procedure requires 50–250 evaluations; LADD distills this into a substantially faster sampler. After distillation, the model produces 1024 × 1024 images in about 3–5 seconds for both text-to-image and image-to-image workflows (Labs et al., 17 Jun 2025).

Iterative, multi-turn editing is implemented by re-encoding the previous output as the next context image yy7 and pairing it with a new instruction yy8. Because the unification occurs at the token-sequence level, there is no mode switch between “edit” and “generate,” and local and global refinements can be chained within the same procedure. The reported demonstrations preserve character identity, pose, lighting, and style across sequences involving object removal, relighting or weather changes, scene relocation, apparel or style changes, and text or logo edits.

The deployment implications follow directly from this mechanism. The architecture’s simple concatenation permits straightforward integration into interactive applications: new turns are executed by re-encoding the last output as context and updating the instruction. The paper also states that the approach is compatible with multiple input images and visual cues. A plausible implication is that the interface contract for downstream products is unusually stable: iterative editing can be exposed as repeated reconditioning rather than as a collection of task-specific tools.

6. Benchmarks and empirical performance

To evaluate the model, the authors introduce KontextBench, a 1,026-example benchmark collected from real-world use cases with 108 base images drawn from personal photos, CC-licensed or public-domain data, and AI-generated images. The benchmark spans five categories: local instruction editing with 416 examples, global instruction editing with 262, character reference (CREF) with 193, style reference (SREF) with 63, and text editing with 92. Across image-to-image tasks, FLUX.1 Kontext [max] and [pro] are reported as the top systems in local editing, text editing, and general character reference; for global editing and SREF they are second to GPT-Image-1 and Gen-4 References, respectively. Median 1024 × 1024 latency figures place FLUX.1 Kontext among the fastest models for both text-to-image and image-to-image, and the text reports an order-of-magnitude speedup over some competing APIs (Labs et al., 17 Jun 2025).

Multi-turn identity preservation is quantified on a dedicated track using AuraFace similarity. For FLUX.1 Kontext Pro, step-wise similarity from the original face is 0.9938 for steps 0→1, 0.9850 for 0→2, 0.9328 for 0→3, 0.8857 for 0→4, and 0.7427 for 0→5, with an average of 0.908. Runway Gen-4 records 0.9561, 0.8992, 0.7701, 0.7476, and 0.4986, averaging 0.774. GPT-4o High, within the GPT-Image-1 family, records 0.7109, 0.4720, 0.2819, 0.3249, and 0.2915, averaging 0.416. The reported interpretation is markedly slower identity drift for FLUX.1 Kontext across multiple edit turns.

Text-to-image performance is also assessed on an internal 1,000-prompt suite, Internal-T2I-Bench, built from DrawBench, PartiPrompts, and real user prompts. The evaluation decomposes quality into prompt following, aesthetics, realism, typography, and speed. FLUX.1 Kontext is described as achieving balanced, competitive performance across all dimensions, improving over FLUX-1.1 [pro], with further gains for the [max] variant. Complementary evaluations on GenAI Bench are said to corroborate these trends, although the figures report relative ELO-style outcomes rather than absolute numeric scores.

7. Position within the editing literature and reported limitations

The paper situates FLUX.1 Kontext against several classes of prior systems. InstructPix2Pix, Emu-Edit, OmniGen, HiDream-E1, and ICEdit are described as relying on synthetic instruction pairs or specialized data or heads, and as often suffering identity drift over multi-turn edits. IP-Adapter-style methods and in-context LoRAs are characterized as requiring adapters or per-task LoRAs for personalization and style guidance. By contrast, FLUX.1 Kontext performs personalization and style reference directly from concatenated context tokens, without extra heads or adapters, and in principle supports multiple context images. Strong proprietary systems such as GPT-Image-1/GPT-4o High and Runway Gen-4 are presented as offering high quality but as being slower or more prone to drift over edits; the reported comparison is that FLUX.1 Kontext matches or surpasses them on character preservation and text editing while being up to an order of magnitude faster in API latency at 1024 × 1024 (Labs et al., 17 Jun 2025).

The limitations are explicit. Identity drift can still occur, especially under challenging prompts or after many iterations, and artifacts become visible after six or more sequential edits in some cases. Instruction following can fail for precise geometric edits; the example given is “move the coffee to the left” producing unintended scene modifications. Fine-grained constraints may be partially ignored. Distillation can introduce artifacts, and very high classifier-free guidance can induce over-saturation or stylization. Although the architecture supports multiple references, multi-reference performance was not the focus of this release because the current training emphasis is single-image context.

These caveats help delimit what the unification claim does and does not imply. The model collapses several previously separated workflows into a single transformer and tokenization scheme, but it does not remove the standard failure modes associated with iterative generative editing, hard spatial control, or aggressive guidance. A plausible implication is that FLUX.1 Kontext is best understood not as a fully solved editing framework, but as a rectified-flow-based consolidation of generation, reference conditioning, and iterative revision into one latent-space interface.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FLUX.1.