Papers
Topics
Authors
Recent
Search
2000 character limit reached

EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture

Published 4 Dec 2025 in cs.CV | (2512.04810v2)

Abstract: We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing. Specifically, EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation. This also ensures the training balance between understanding and generation tasks by applying the same compression ratio to images. 2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures. 3) A shared-and-decoupled network that enables mutual improvements across tasks while meeting the task-specific modeling requirements. 4) A mixture-of-experts mechanism adopted for visual understanding encoder, which substantially improves perceptual capabilities with a few parameters increase. Extensive experiments have shown that EMMA-4B can significantly outperform state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, while also achieving competitive results compared to recent multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image). We believe that EMMA lays a solid foundation for the future development of unified multimodal architectures.

Summary

  • The paper presents EMMA, a unified multimodal model that improves efficiency via high-compression visual encoding and channel-wise token concatenation.
  • It leverages a shared-and-decoupled backbone with a mixture-of-experts router to balance semantic understanding and fine-grained generative detail.
  • Experimental results show EMMA’s superior performance on multimodal benchmarks while using significantly fewer tokens compared to previous models.

EMMA: A Unified and Efficient Architecture for Multimodal Understanding, Generation, and Editing

Introduction and Motivation

Unified multimodal architectures have become pivotal for advancing research at the intersection of vision and language, with significant implications for both embodied AI and AGI. Despite the proliferation of unified frameworks, existing approaches are frequently bottlenecked by inefficient token utilization, imbalanced optimization across modalities, and suboptimal architectural unification. The paper "EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture" (2512.04810) addresses these deficiencies through a jointly optimized architecture—EMMA—designed for scalable multimodal understanding, generation, and editing capabilities.

Architectural Innovations

EMMA’s core advances are realized through system-level architectural decisions targeting both efficiency and expressivity:

  • Efficient Visual Encoding: EMMA deploys a high-compression autoencoder, specifically DCAE, operating at a 32×32\times compression ratio, for both understanding and generation visual branches. This ensures cross-task balance in training and drastically lowers the number of processed visual tokens, significantly improving computational efficiency.
  • Channel-wise Token Concatenation: In contrast to prior unified models (e.g., BAGEL) that concatenate visual tokens along the sequence dimension, EMMA achieves fusion via channel-wise concatenation. This design reduces context token counts without diminishing cross-modal feature interaction, enabling stronger performance with lower computation.
  • Shared-and-Decoupled Backbone: The model backbone is configured to share shallower layers across tasks, leveraging mutual inductive bias, while deeper layers are decoupled to optimize for the idiosyncratic requirements of understanding (semantic modeling) and generation (semantic plus high-frequency detail modeling). This enables parameter sharing without negative cross-task interference.
  • Mixture-of-Experts (MoE) Visual Encoder: Incorporating a router-augmented MoE into the SigLIP2-based visual encoder allows dynamic routing to a STEM-domain expert or a general expert submodule. This mitigates the inherent diversity in multimodal data (notably STEM images), enhancing overall perceptual capability with a minimal parameter increase. Figure 1

    Figure 1: The EMMA architecture leverages adapters and a decoupled backbone to balance understanding and generation tasks within a single, unified multimodal model.

Data Construction Pipelines

EMMA’s results hinge on large, diverse, and carefully curated multimodal corpora for understanding, text-to-image (T2I) generation, and image editing:

  • General Editing Pipeline: For image editing, instruction-image pairs are synthesized by leveraging vision-LLMs (VLMs) to generate and validate editing instructions, with downstream image editing models producing outputs. Further post-filtering, including face similarity for portraits, ensures instruction fidelity. Figure 2

    Figure 2: The pipeline for constructing general image editing data tightly couples VLM-based instruction synthesis with model output validation.

  • Text Editing Pipeline: For images featuring text, EMMA’s pipeline leverages OCR to localize text, synthesize plausible edit instructions (random replacement/removal), and validate edit completion, generating challenging text-edit benchmarks for both training and evaluation. Figure 3

    Figure 3: The text editing data pipeline integrates text detection, instruction generation, edit synthesis, and rigorous OCR-based validation.

Experimental Results

EMMA-4B demonstrates a strong empirical profile across a comprehensive suite of multimodal benchmarks:

  • Multimodal Understanding: EMMA achieves an MMVet score of 73.0 and MMBench score of 85.8, outperforming BAGEL-7B (67.2 MMVet, 85.0 MMBench) despite using significantly fewer parameters and one-fifth the visual context tokens.
  • Text-to-Image Generation: EMMA delivers a GenEval score of 0.91 (0.93 with prompt rewriting), surpassing BAGEL-7B (0.82/0.88) and Qwen-Image (0.87), and attains 85.6 on DPG-Bench, rivaling models such as Qwen-Image.
  • Image Editing: On GEdit-Bench-EN, EMMA achieves 6.53 overall, exceeding or matching prior unified models while using only 1/5 of the visual tokens compared to competitors.
  • Generalization: Despite training predominantly with English instruction editing pairs, EMMA natively supports editing via instructions in Chinese and generates consistent results even under complex multi-step instructions. Figure 4

    Figure 4: EMMA demonstrates high-fidelity text-to-image generation given complex and open-vocabulary textual prompts.

    Figure 5

    Figure 5: EMMA’s editing results maintain subject consistency and instruction fidelity, even with multilingual or composite instructions—despite a training focus on single-English instructions.

Emergent Capabilities and Analysis

Two notable emergent phenomena are detailed:

  1. Cross-Lingual and Complex Instruction Adherence: Without explicit Chinese editing or compositional instruction data, EMMA generalizes robustly to these domains. This is attributed to the architecture's end-to-end unification allowing robust transfer from understanding-heavy training data distributions.
  2. Token Efficiency and Performance Tradeoff: EMMA’s reduced token count (enabled by high compression and channel concatenation) does not sacrifice performance. In image editing, EMMA uses one-fifth the token count of older approaches while attaining superior or equivalent evaluation scores, directly contradicting the widespread assumption that effective multimodal systems require extensive context tokens.

Theoretical and Practical Implications

EMMA establishes new precedents for efficiency, generalization, and modularity in unified multimodal models. The shared-and-decoupled architecture presents a compelling blueprint for optimization in scenarios with disparate task-specific characteristics, reminiscent of domain-adaptive transformer advances. The MoE design for the vision encoder confirms that methodologically lightweight expert routing suffices for robust generalization across visual domains, including STEM cases that typically degrade model-wide performance in large-scale VLMs.

Practically, the architecture allows deployment of competitively performant multimodal systems both in server and edge settings due to lower memory and computational footprints. The compositional training pipeline—bolstered by synthetic data generation and rigorous automated filtering—demonstrates scalable recipe for upcoming unified architectures addressing broader domains (e.g., video-text, multimodal RL).

Future Directions

EMMA signals several avenues for future research:

  • Expanding MoE Expert Pools: The pipeline for dynamically routing visual tokens to downstream domain-specific experts beyond STEM can be generalized for continual learning and robust in-the-wild deployment.
  • Improving Editing Consistency Metrics: The current image editing metrics exhibit limitations, particularly subject consistency. New objective criteria and benchmarks for region-based editing are warranted.
  • Unified Training Paradigms: The cross-paradigm architecture, wherein next-token prediction and diffusion-based generative objectives co-exist, may inspire even broader joint training paradigms, unifying modality conditioning in a more principled fashion across both discrete and continuous domains.

Conclusion

EMMA delivers an integrated approach for large unified multimodal models, balancing token efficiency, empirical performance, and architectural flexibility. The innovations in visual token compression, channel-wise fusion, and mixture-of-experts routing, validated by consistent outperformance and cross-lingual generalization, make EMMA an instructive model for future work aiming to unify multimodal understanding, generation, and editing in a resource-efficient and extensible manner.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview

This paper introduces EMMA, a single AI system that can understand images, create images from text, and edit images—all inside one unified model. Think of it as a smart “all-in-one” tool that reads, draws, and edits pictures efficiently and well.

Objectives

The researchers wanted to solve a big problem: it’s hard to train one model to do both understanding (like answering questions about a picture) and generation (like making or editing a picture) without wasting time and computer power. Their goals were to:

  • Build one model that handles understanding, generation, and editing together.
  • Make it fast and efficient by reducing how much information the model needs to process.
  • Keep or improve quality compared to larger, specialized models.

Methods and Approach

To reach these goals, EMMA uses a few clever design ideas. Here’s what they did, explained with simple analogies:

Key ideas in simple terms

  • High compression for images: They use an “autoencoder,” which is like a shrink-and-expand machine for pictures. It squashes a big image into a small, compact form (32× smaller) and later rebuilds it. This makes the model much faster because it handles fewer “pieces” of information.
  • Fewer “pieces” to process (tokens): Tokens are like tiny puzzle pieces of information the model reads. EMMA reduces the number of visual tokens needed, so it can work quicker without losing important details.
  • Smart stacking of information (channel-wise concatenation): Imagine stacking two layers of a sandwich instead of making two full sandwiches. EMMA stacks the understanding info (what’s in the picture) with the generation info (how to draw details) without increasing the number of tokens. This keeps things efficient and fused together.
  • Shared-and-decoupled network: Early layers in the model are shared—like basic classes all students take. Later layers split into specialized paths—like choosing a major—so the understanding part and the generation part can learn what they each need best.
  • Mixture of Experts (MoE) for tricky images: The model includes specialists for certain image types (like math and science diagrams). A “router” picks the right expert when needed, like a guidance counselor directing you to the right tutor.

How it learns:

  • For understanding (I2T), it learns by predicting the next word, similar to how chatbots learn to continue a sentence.
  • For generating images (T2I and editing), it uses a “flow” method, which is like planning a smooth path to turn text into a final picture step by step.

Training data and steps:

  • EMMA trains on huge sets of multimodal data (images with text), high-quality image generation data, and image editing examples.
  • It trains in stages: align visual features, pre-train, fine-tune, polish quality, then add and train the special STEM expert and the router.

Main Findings and Why They’re Important

The paper reports strong results:

  • Better performance with a smaller model: EMMA uses a 4-billion-parameter LLM but beats larger unified models. For example, it scores 73.0 on MMVet (a tough vision test) versus 67.2 for BAGEL-7B.
  • Strong image generation: On GenEval (a test of how well images match the text), EMMA scores 0.91 without tricks like prompt rewriting, and 0.93 with it—better than many unified models and competitive with specialized image generators.
  • Efficient image editing: On GEdit, EMMA slightly edges out similar unified models while using only about 20% of the visual tokens they need (around 5× fewer tokens for editing tasks), making it faster and cheaper to run.

Why this matters:

  • EMMA shows you can have one unified model that does multiple hard tasks well, without being huge or slow.
  • It proves smart design (compression, stacking, shared-then-specialized layers, experts) can beat brute force size.

Implications and Potential Impact

  • Faster, cheaper AI systems: Because EMMA needs fewer tokens and smarter compression, it can run more efficiently—great for real-world apps, especially on limited hardware.
  • One model for many jobs: EMMA’s unified design hints at future AI that can read, reason, draw, and edit—moving toward more general-purpose AI.
  • Better cross-modal tools: EMMA can handle complex instructions and even support other languages (like Chinese) in generation and editing, showing useful “emergent” abilities.
  • Raises the bar for evaluation: The authors point out current editing tests don’t fully measure “subject consistency” (keeping the same person/object intact), which suggests we need better benchmarks to fairly judge image editing quality.

In short, EMMA is a practical step toward powerful, unified multimodal AI that’s both efficient and capable—helping bridge the gap between understanding and creating visual content in one model.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a focused list of what remains missing, uncertain, or unexplored in the paper, written to be actionable for follow‑up research.

  • Quantify the trade‑offs of the 32× compression in both understanding and generation:
    • Systematically ablate compression ratios (e.g., 8×/16×/32×) at fixed compute and measure effects on high‑frequency detail, small objects, text rendering accuracy, and spatial localization.
    • Evaluate identity/subject preservation in generation and editing under different compression settings using face similarity (e.g., ArcFace/CosFace), LPIPS, FID/KID, and pick‑score/human studies.
  • Validate spatial alignment assumptions behind channel‑wise concatenation:
    • Confirm that Und‑Enc (SigLIP2 + 2×2 pixel shuffle) and Gen‑Enc (DCAE) latents align to the same spatial grid across native resolutions and aspect ratios.
    • Investigate learned alignment modules (e.g., deformable alignment, cross‑attention alignment) versus raw channel concatenation, and compare against token‑wise fusion, cross‑modal attention, and gated fusion.
  • Provide ablations for the shared‑and‑decoupled architecture:
    • Explore how the depth/extent of shared layers affects task performance, stability, and negative transfer.
    • Examine gradient interference between next‑token prediction (Und) and flow‑matching (Gen) in shared layers; test gradient surgery, loss weighting, alternating schedules, or adapters to mitigate conflicts.
  • Clarify and assess the MoE routing and expert design:
    • Specify router features, training signals, and misrouting rates; study calibration and uncertainty.
    • Evaluate more domains beyond STEM (e.g., medical, documents, charts, OCR, UI) and consider token‑/region‑level routing versus image‑level routing.
    • Analyze expert load balancing, memory/latency overhead, and dynamic expert expansion policies.
  • Positional encoding choices remain underexplored:
    • Ablate the combination of 2D positional encoding for visual tokens with 1D RoPE in the LLM, and compare alternatives (pure 2D RoPE, rotary hybrids, learned 2D embeddings).
    • Measure impacts on spatial reasoning (e.g., MMVet spatial subsets, synthetic layout tasks) and on variable‑resolution/aspect‑ratio inputs.
  • Attention masking strategy merits deeper analysis:
    • Study the impact of hybrid attention (visual tokens attending each other within an image for Gen) on multi‑image contexts, multi‑turn editing, and long visual sequences.
    • Evaluate alternative masking patterns (e.g., block causal, bidirectional windows) on both alignment (GenEval) and image quality (DPG‑Bench, aesthetic metrics).
  • Editing evaluation and metrics are insufficiently addressed:
    • Propose and validate metrics for subject/identity consistency, localized accuracy (mask fidelity), and instruction adherence beyond VLM judgments (e.g., segmentation overlap, localized CLIP‑based checks, OCR for text edits).
    • Create a benchmark with controlled identity preservation scenarios and region‑based edits to avoid metric gaming and to reflect real‑world editing constraints.
  • Data transparency and potential leakage:
    • Document internal datasets, deduplication procedures, and steps taken to prevent test set contamination (e.g., MMVet/GenEval/GEdit).
    • Analyze dataset composition biases (e.g., portrait vs. non‑portrait, languages, domains) and their effect on performance and fairness.
  • Efficiency claims lack end‑to‑end measurements:
    • Report wall‑clock training/inference latency, memory footprint, token throughput, and energy usage versus baselines (e.g., BAGEL‑7B, Qwen‑Image) under matched settings.
    • Quantify how much of the observed gains come specifically from token reduction versus architectural or data differences.
  • High‑resolution generation/editing scalability is unclear:
    • Evaluate performance and stability at 2K–8K resolutions (T2I and IT2I), including tiling/consistency across tiles, detail recovery, and runtime memory behavior.
  • Role of the frozen DCAE in overall capability:
    • Examine end‑to‑end fine‑tuning of DCAE versus keeping it frozen; quantify impacts on fidelity, robustness, and editing precision.
    • Compare different AEs/VAEs trained for 32× with reconstruction/regularization trade‑offs.
  • Cross‑lingual capabilities are anecdotal:
    • Conduct systematic multilingual evaluations (Chinese and additional languages) for understanding, T2I, and IT2I, including text rendering quality, instruction adherence, and editing correctness.
  • Safety, misuse, and provenance are not discussed:
    • Assess content safety (NSFW, harmful edits), demographic bias in generation/editing, watermarking/provenance, and mechanisms to prevent identity manipulation misuse.
  • Scaling laws and model size effects remain unexplored:
    • Characterize how performance scales with LLM size and visual encoder capacity; identify diminishing returns and optimal parameter allocations between Und/Gen branches and experts.
  • Negative transfer and synergy claims need quantification:
    • Provide controlled experiments showing when joint training improves one task via the other (and when it hurts), including ablations that remove Und‑Enc or Gen‑Enc, or train tasks separately before merging.
  • Robustness to challenging inputs is untested:
    • Evaluate performance under noise, blur, compression, occlusion, low‑light, small objects, adversarial perturbations, and domain shifts (e.g., medical, satellite).
  • Multi‑image/multi‑modal extensions are not addressed:
    • Investigate unified handling of video, audio, depth/3D, and multi‑image composition tasks (e.g., image‑pair reasoning, collage generation) within the same architecture and attention scheme.
  • Training curricula and sampling strategies need study:
    • Ablate the 1:1:1 mixture in SFT/QT, portrait/general balance, and STEM/general balance; explore curriculum learning and adaptive sampling based on performance gradients.
  • Reproducibility and open‑sourcing gaps:
    • Provide code, training recipes, checkpoints, and detailed hyperparameters (e.g., flow matching schedules, bucket sizes, augmentations) to enable third‑party replication and fair comparisons.

Practical Applications

Immediate Applications

Below are actionable, sector-linked use cases that can be deployed now, given EMMA’s reported performance, efficiency (32× compression, channel-wise fusion), and unified understanding–generation–editing capabilities.

  • Software/Creative tools (software, media/advertising)
    • Natural-language image editor and generator: integrate EMMA’s IT2I and T2I into design suites (e.g., plugins for Photoshop, Figma) to perform object addition/removal/replacement, background/tone transfer, text rendering, and multi-step edits from instructions.
    • Brand-safe creative co-pilot: generate on-brief concepts and edit assets while preserving subject consistency (leveraging EMMA’s better alignment scores without RL/prompt rewriting).
    • Assumptions/dependencies: GPU/accelerator availability, integration via an SDK/API, human review for brand safety, subject-consistency checks; licensing and access to EMMA weights and DCAE/SigLIP2 components.
  • E-commerce and retail (commerce, marketing)
    • Product image workflows: automatic background transfer, color correction, layout/position adherence, and text overlays that match prompt specifications; support “virtual try-on” style edits from catalog instructions.
    • Assumptions/dependencies: high-resolution inputs (supported buckets at ~1K), fairness/accuracy safeguards to avoid misrepresentation, batch processing pipelines, face-similarity filters for portraits (as in the paper’s editing data pipeline).
  • Document AI and analytics (finance, enterprise software)
    • Robust document and chart understanding: deploy EMMA for DocVQA, ChartQA, OCR-driven extraction, and visual reasoning on forms/tables to convert scanned documents into structured data and answers.
    • Assumptions/dependencies: domain-specific fine-tuning for document types, privacy/compliance controls (PII handling), reliable OCR integration and native-resolution support.
  • Education and training (education, edtech)
    • Visual tutoring and content generation: use EMMA’s math/diagram reasoning (MathVista/MMMU competitiveness) to explain problems with image context, generate illustrative images, and follow multi-step instructions.
    • Assumptions/dependencies: curriculum alignment, content moderation, multi-language support where needed (EMMA shows emergent Chinese instruction handling via its understanding branch).
  • Accessibility and assistive tech (public sector, consumer apps)
    • Image-to-text descriptions and alt-text generation: on-device or low-latency services that describe scenes, diagrams, and documents for visually impaired users; support Chinese and English instructions.
    • Assumptions/dependencies: model quantization for edge devices, latency targets, user privacy safeguards.
  • Customer support and field service (enterprise, IoT/edge)
    • Visual ticket triage and guided fixes: interpret user-submitted images, annotate issues, and generate edited examples showing corrective steps (e.g., highlight components, overlay instructions).
    • Assumptions/dependencies: domain fine-tuning (equipment, components), secure handling of customer data, workflow integration with ticketing systems.
  • Content moderation and compliance (policy, platform safety)
    • Prompt–image alignment checks: use EMMA’s alignment strengths to pre-screen generated assets for adherence to specified attributes (colors, positions, counts) before publishing.
    • Assumptions/dependencies: threshold calibration, audit logs, bias testing; a governance layer to flag risky edits (e.g., identity manipulation).
  • Synthetic data creation pipelines (software, ML ops)
    • High-quality dataset generation: reproduce the paper’s editing data factory (VLM-instruction + editing + filtering + reversed pairs) to synthesize training data for text rendering, editing, and domain-specific tasks.
    • Assumptions/dependencies: filtering via VLM and face-similarity/OCR modules, license checks on source images, careful exclusion of datasets that harm subject consistency (e.g., GPT-Image-Edit-1.5M).
  • Edge/mobile deployment for multimodal tasks (devices, telecom)
    • Low-token inference for on-device understanding/editing: leverage 32× compression and channel-wise fusion to enable lightweight image captioning, simple edits, and localized generation on mobile/AR devices.
    • Assumptions/dependencies: quantization, hardware acceleration (NPUs), energy constraints, caching for positional encodings; acceptable quality at reduced context size.

Long-Term Applications

These use cases will likely require further research, scaling, domain adaptation, or ecosystem development (e.g., expanded MoE experts, improved metrics, video/3D extension).

  • Robotics and embodied AI (robotics)
    • Unified perception–instruction following: extend EMMA to 3D/video inputs for scene understanding and task execution; use generation/editing to plan or simulate changes (simulation-to-reality).
    • Assumptions/dependencies: temporal modeling, 3D perception, safety verification, low-latency control loops; additional experts (MoE) for robotics visuals.
  • AR/VR and spatial computing (software, consumer hardware)
    • Real-time generative overlays and environment edits: live augmentation of scenes (e.g., instructional overlays, visual corrections) guided by natural-language prompts.
    • Assumptions/dependencies: stringent latency budgets, streaming token pipelines, robust spatial priors beyond 2D positional encodings, device-side acceleration.
  • Healthcare imaging and reporting (healthcare)
    • Medical visual understanding and report assistance: domain-specific MoE experts to interpret radiology/pathology images and align with clinical narratives; careful use of editing for anonymization/annotation.
    • Assumptions/dependencies: medical-grade datasets, regulatory compliance (HIPAA/GDPR), rigorous validation; avoid generative edits that could mislead diagnostics.
  • Industrial inspection and energy (manufacturing, energy)
    • Automated reading of gauges/diagrams and guided maintenance: convert visual telemetry/diagrams into actionable instructions; generate annotated edits showing steps or risks.
    • Assumptions/dependencies: domain data, ruggedized edge deployment, safety certification, integration with SCADA/CMMS systems.
  • Finance and business intelligence (finance, enterprise analytics)
    • Chart reasoning and narrative generation at scale: explain financial charts, detect inconsistencies, and auto-generate dashboards with visuals created/edited per specifications.
    • Assumptions/dependencies: auditability, accuracy guarantees, provenance tracking of generated assets, compliance with disclosure standards.
  • Standards and policy for image editing (policy, governance)
    • Subject-consistency metrics and certification: develop benchmarks that evaluate identity/subject preservation; mandate watermarking and provenance for edited media.
    • Assumptions/dependencies: community adoption, regulator involvement, platform enforcement mechanisms; tooling to measure and report consistency.
  • Multimodal IDEs and agent platforms (software, developer tools)
    • Unified agent that handles understanding, generation, and editing in a single toolchain (reducing orchestration complexity across separate models).
    • Assumptions/dependencies: robust APIs, plugin ecosystems, strong guardrails, configurable attention strategies for mixed tasks.
  • Video generation and editing (media/entertainment)
    • Extend EMMA’s architecture to temporal data using similar compression and fusion strategies for high-frame-rate video editing/generation and instruction-based transformations.
    • Assumptions/dependencies: temporal coherence models, compute scaling, new loss functions (flow/diffusion) tuned for video; benchmarking beyond still images.
  • Multilingual expansion with task-aware experts (global markets, education)
    • Router-driven MoE for language-specific visual tasks (e.g., diagrams/text in diverse scripts), enabling native T2I/IT2I beyond English/Chinese.
    • Assumptions/dependencies: diverse multilingual datasets, language-aware routing, OCR/text rendering across scripts, safety for culturally sensitive content.
  • Privacy-preserving/federated deployments (public sector, enterprise)
    • On-premise EMMA variants with MoE tuned to local domains; federated learning for updates without moving sensitive images off-site.
    • Assumptions/dependencies: federated training infrastructure, secure adapters, differential privacy, compliance audits.

These applications build directly on EMMA’s innovations—32× compression via autoencoder, channel-wise token fusion, shared-and-decoupled architecture, and MoE vision experts—translating architectural efficiency and performance gains into concrete tools, workflows, and sectoral value. Each long-term item depends on further research or ecosystem maturation (e.g., domain data, metrics, safety), while immediate items can be productized with existing integration and governance practices.

Glossary

  • 1D RoPE: A rotary positional embedding applied along a single dimension to encode token order within transformer models. "processed using the positional embedding of 1D RoPE~\cite{rope}."
  • 2D positional encoding: A spatial positional encoding that injects 2D location information into visual tokens. "before feeding visual tokens into the LLM, the 2D positional encoding is applied to incorporate spatial priors."
  • 2x2 token merging strategy: A heuristic that merges nearby visual tokens in a 2x2 grid to reduce sequence length for generation models. "and 2x2 token merging strategy, EMMA only requires 1/4 visual tokens for the generation task."
  • Adapter: A lightweight projection module used to interface modality-specific encoders with a unified backbone. "only the adapter in Und branch is tuned."
  • Autoencoder (AE): A neural architecture that compresses and reconstructs images, often used to tokenize high-resolution visuals for generation. "the autoencoder (AE) with an 8x compression ratio (e.g., the AE in FLUX~\cite{flux2024})"
  • Casual masking: A masking scheme (commonly “causal”) where tokens only attend to previous positions to preserve autoregressive order. "Specifically, pure casual masking is utilized to both textual and visual tokens in Und tasks."
  • Channel dimension: The feature axis along which per-token vectors are concatenated to fuse information without increasing token count. "concatenated along the channel dimension instead of the token level like BAGEL~\cite{deng2025BAGEL}."
  • Channel-wise concatenation: Concatenating features along channels (not across tokens) to fuse understanding and generation features efficiently. "Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures."
  • Compression ratio: The factor by which an encoder reduces image size or token count, balancing efficiency and fidelity. "An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation."
  • Cross-task parameter sharing: Sharing subsets of parameters across tasks to encourage mutual improvement while preserving task-specific behavior. "introduces cross-task parameter sharing at shallow layers to facilitate mutual improvements between multimodal tasks,"
  • DCAE: A deep compression autoencoder optimized for high-resolution diffusion models to produce compact yet informative visual latents. "EMMA first uses a higher-compression autoencoder (DCAE~\cite{chen2024deep}, 32x compression), that aligns the compression ratio in the visual understanding branch."
  • Flow matching: A training objective for generative models that learns a velocity field to transform noise into data distributions. "For Gen tasks, EMMA utilizes flow matching with velocity prediction."
  • Gen-Enc (generation encoder): The visual encoder branch specialized for image generation, producing tokens optimized for synthesis. "For Gen-Enc, we employ a high-compression autoencoder, DCAE~\cite{dcae}, with a 32× compression ratio."
  • High-frequency details: Fine-grained visual information (textures/edges) critical for photorealistic image synthesis. "generation (semantics and high-frequency details modeling~\cite{ma2025deco}) tasks."
  • Hybrid attention strategy: An attention configuration that mixes different masking rules across modalities or token types. "Following Transfusion~\cite{zhou2024transfusion}, EMMA adopts a hybrid attention strategy."
  • LLM: A transformer-based LLM that processes visual tokens and text jointly for multimodal tasks. "followed by a 2x2 pixel shuffle strategy~\cite{internvl15} to further reduce these tokens to one quarter before feeding them into the LLM."
  • Mix-of-experts (MoE): A modular architecture where a router selects among specialized experts to process inputs more effectively. "EMMA also integrates a mix-of-experts (MoE) strategy into the visual understanding encoder"
  • Next-token prediction: An autoregressive learning paradigm where models predict the subsequent token in a sequence. "As for Und tasks, EMMA utilizes the next-token prediction mechanism to guide overall learning."
  • Parameter decoupling: Keeping subsets of parameters independent across tasks to satisfy distinct modeling requirements. "employing parameter decoupling at deeper layers to meet the distinct modeling requirements of understanding and generation."
  • Patchfy operation: The (patchify) process that splits images into fixed-size patches to form input tokens for vision transformers. "Und-Enc, through the patchfy operation in SigLIP2 and pixel shuffling strategy, achieves a 32× compression ratio of the input image."
  • Pixel shuffle: A transformation that reorganizes spatial information to reduce token count while preserving structure (e.g., 2x2). "followed by a 2x2 pixel shuffle strategy~\cite{internvl15} to further reduce these tokens to one quarter"
  • Positional embeddings: Learned vectors added to tokens to encode their positions in sequences or grids. "we extend its capability to support native resolutions of input images by interpolating the positional embeddings."
  • Router module: A gating component that routes inputs to the most suitable expert in a mixture-of-experts system. "with a router module dynamically selecting the appropriate expert for each input."
  • Self-attention layers: Transformer layers that compute token-to-token dependencies for cross-modal interaction. "Moreover, cross-modal interactions are achieved through self-attention layers."
  • SigLIP2: A modern vision-language encoder variant of SigLIP with improved semantic understanding and localization capabilities. "we directly utilize SigLIP2-so400m-patch16-512~\cite{siglip2} as Und-Enc."
  • STEM expert: A specialized MoE expert tailored to scientific/technical visual data (charts, equations, diagrams). "EMMA additionally introduces a STEM (science, technology, engineering, and math) expert to process STEM images,"
  • Token-wise concatenation: Concatenating along the token axis, increasing sequence length and computational cost. "Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens,"
  • Und-Enc (understanding encoder): The visual encoder branch focused on extracting semantic features for multimodal understanding. "For Und-Enc, recent works generally adopt SigLIP~\cite{siglip,siglip2} to encode images as visual tokens"
  • Visual tokens: Discrete vectors representing image patches/features for integration with LLMs. "recent works generally adopt SigLIP~\cite{siglip,siglip2} to encode images as visual tokens"

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 8 likes about this paper.