Groma: MLLMs & Dislocation Models

Updated 21 February 2026

Groma is a term that encompasses two distinct frameworks: a multimodal LLM with integrated region tokenization for precise visual grounding and a set of nonlocal PDE models for dislocation densities in crystal plasticity.
The multimodal variant leverages state-of-the-art vision encoders, deformable detectors, and unified tokenization to enhance region captioning, visual grounding, and conversational VQA.
The Groma–Balogh model uses nonlocal transport equations with entropy methods to bridge discrete dislocation dynamics and continuum crystal plasticity, ensuring rigorous mean-field analysis.

Groma designates two distinct concepts in the scientific literature: (1) a state-of-the-art framework for grounding and region-level visual perception in Multimodal LLMs (MLLMs), and (2) a family of nonlocal transport equations for dislocation densities in crystal plasticity, often called the "Groma–Balogh model." Both have provided foundational progress in their respective domains—multimodal AI and mathematical materials science—by innovating over standard architectures or mean-field approaches and offering new mathematical or algorithmic structures for fine-grained representation and prediction.

1. Groma in Multimodal LLMs

In the context of vision-LLMs, Groma is a Multimodal LLM with intrinsic region-level perception and grounding capability, specifically engineered to decompose visual inputs into localized tokens and jointly align these with language (Ma et al., 2024). The system addresses limitations in previous MLLMs that attempt localization post-hoc or through auxiliary modules, instead "lifting" the act of region localization into the visual tokenization pipeline itself. The unified architecture supports region captioning, visual grounding, and visually grounded dialogues with explicit referential correspondence between regions and text.

Architectural Components

Vision Encoder: Utilizes DINOv2-L/14 (448×448 input) to generate patch-level embeddings.
Region Proposer: Builds on a class-agnostic Deformable DETR (DDETR) head extracting up to 100 high-quality bounding box proposals via non-max suppression and objectness thresholding.
Region Encoder: Applies multi-scale RoIAlign over the DINOv2 feature pyramid, producing region tokens $R_i = \mathrm{ROIAlign}(\{F_k\}, b_i)$ matched in embedding dimension to patch tokens.
Connector and LLM: An MLP projects both region and patch tokens to the LLM space (Vicuna-7B v1.5, $H=4096$ ). The tokens are concatenated with textual tokens and passed as a single context sequence; cross-modal self-attention remains unchanged.

Tokenization and Grounding Mechanisms

Region Tokens and Proxy Vocabulary: Each region token is paired with a discrete "proxy" text token (e.g., $<$ r_i $>$ ) in the LLM vocabulary, binding lexical references to visual regions in both user instructions and model responses.
Unified Input Sequence: The transformer stack's context comprises $[T; V; R']$ , where $T$ = text tokens, $V$ = image tokens, $R'$ = region tokens.
Output Grounding: The LLM emits $<$ r_i $>$ tokens when referring to regions; grounded output sequences employ tags (e.g., $<$ roi $><$ r_4 $><$ /roi $>$ ) to index localized phrases.

Training Pipeline

The model is trained in three stages:

Detection Pretraining: Over 5.7M box annotations (COCO, Objects365, OpenImages, etc.), with the DDETR head trained against $\ell_1$ and generalized IoU loss; backbone is frozen.
Vision–Language Alignment: 3.2M samples covering image-caption, grounded caption, region caption, and referring expression comprehension (REC) tasks.
Instruction Finetuning: 857K samples, including bespoke "Groma Instruct" dialogues generated via GPT-4V with explicit region referencing.

Empirical Performance

On standard region-level and grounding benchmarks, Groma outperforms prior generalist MLLMs and rivals specialist detection systems:

Referring Expression Comprehension: 86.52% average accuracy on RefCOCO/RefCOCO+/RefCOCOg, surpassing Qwen-VL (85.83%) and close to UNINEXT-L (86.95%).
LVIS-Ground: Average recall (AR) @ [.5:.95] of 28.8, with strong performance on large and medium objects.
Region Captioning: METEOR/CIDEr scores comparable to or better than state-of-the-art without task-specific tuning.
Conversational VQA: High scores across conversation, description, and reasoning tasks.

Ablation shows that most region-aware knowledge is embedded in the region proposer/encoder, and freezing the LLM except for brief finetuning suffices for state-of-the-art grounding (Ma et al., 2024).

2. The Groma–Balogh Model of Dislocation Densities

The Groma–Balogh equations are a class of non-local, nonlinear PDEs modeling the evolution of positive and negative dislocation densities in crystalline solids (0901.0219, Patrizi et al., 2022, Garroni et al., 2018, 0903.1559). This framework bridges discrete dislocation dynamics and macroscopic crystal plasticity, offering a rigorous description of the kinematics and interactions of extended defects.

Canonical Formulation

For $d=1,2$ , the system describes two densities $\rho^+(x,t)$ and $\rho^-(x,t)$ (for positive/negative dislocations) evolved by nonlocal transport: $\begin{aligned} \partial_t \rho^+ &= \nabla \cdot \left( \rho^+ \nabla (V * (\rho^+ - \rho^-)) \right),\ \partial_t \rho^- &= -\nabla \cdot \left( \rho^- \nabla (V * (\rho^+ - \rho^-)) \right), \end{aligned}$ where $V(x) = -\log|x|$ (modulo periodicity) and convolution is on the flat torus $T^d$ (Garroni et al., 2018). The driving field $\sigma(x,t)$ is a nonlocal stress, given as Riesz or Hilbert transforms of the net density. In 2D, the transport velocity is $u(x, t) = R_1^2 R_2^2 (\rho^+ - \rho^-)(x, t)$ , with $R_i$ the Riesz transforms.

Derivation from Physical Models

The 1D Groma–Balogh system is formally derived from the Peierls–Nabarro phase-field model via a multi-scale expansion: phase-field equations with fractional Laplacian nonlocality and Peierls potential converge, in the many-dislocation and small-core limit, to a transport system for signed densities (Patrizi et al., 2022). Dislocation interactions arise from the collective effect of long-range elastic forces, with opposite-sign pairs undergoing mutual attraction and possible annihilation—represented in continuum by the sign structure of velocities and not explicit sink terms.

Mathematical Structure and Analysis

Entropy and A Priori Estimates

Entropy-dissipation structure: The system enjoys an entropy (or free-energy) functional based on the densities $\theta^+ = \partial_1 \rho^+$ , $\theta^- = \partial_1 \rho^-$ , with

$E_{\text{ent}}[\rho](t) = \int_{T^2} [\theta^+ \ln \theta^+ + \theta^- \ln \theta^-] \, dx$

and entropy dissipation inequality:

$E_{\text{ent}}(t) + \int_0^t \int_{T^2} [R_1 R_2 (\theta^+ - \theta^-)]^2 dx ds \leq E_{\text{ent}}(0)$

A priori control: Uniform-in-time estimates are established for $L^2$ norms of densities, their $L \log L$ norms, time derivatives, and nonlocal fields (0901.0219).

Existence and Uniqueness

Global existence: Under periodicity conditions and regularity/monotonicity on initial data, the existence of global-in-time distributional solutions is established for the 2D periodic Groma–Balogh system (0901.0219).
Local well-posedness: For initial data in Hölder–Zygmund spaces $C^r \cap L^p$ , $r>1$ , the system has unique local-in-time solutions via commutator estimates and Littlewood–Paley theory (0903.1559).
Gradient flow formulation: The system can be interpreted as a $\lambda$ -convex Wasserstein gradient flow of a (regularized) interaction energy, facilitating mean-field convergence from discrete particle dynamics (Garroni et al., 2018).

Discrete-to-Continuum Limit and Regularization Effects

Convergence: When the interaction potential is regularized at lengthscale $\delta_n \to 0$ slowly enough, the empirical measures of discrete particle systems converge to the Groma–Balogh PDE solution in Wasserstein-2 distance.
Non-convergence: If $\delta_n \ll n^{-1/d}$ , dipole formations become effectively immobilized, leading to empirical measures that do not solve the Groma–Balogh equation (Garroni et al., 2018). The rate of potential regularization is thus critical for mean-field passage.

Table: Mathematical Features of Groma–Balogh PDEs

Feature	1D/2D Case	Mathematical Structure
Nonlocal term	Hilbert/Riesz trans.	$V(x) = -\log\|x\|$ , $R_i$ , $H$
Entropy functional	Yes	$E_{\text{ent}}$ , $L\log L$ control
Existence	Global (monotone), Local	Compactness, a priori $L^2$ and entropy bounds
Gradient flow	Yes	Wasserstein, $\lambda$ -convex
Discrete limit	Yes (critical $\delta_n$ )	Evolutionary convergence & counterexamples

3. Practical Applications

The Groma MLLM advances fine-grained visual-language reasoning, enabling:

Region captioning: Explicit grounding of descriptions to image subregions.
Visual grounding and referring expression comprehension: Interpreting and generating instructions localized to image regions, critical for human-robot interaction and assistive computer vision.
Long-form, grounded dialogues: Multiregion reasoning and response for both VQA and narrative tasks, including dense and occluded scenarios (Ma et al., 2024).

The Groma–Balogh equations are central to:

Crystal plasticity theory: Modeling the mesoscale evolution of dislocation structures in response to applied stress.
Bridging scales: Rigorously justifying macroscopic plastic flow laws from phase-field or particle-level crystal defect descriptions.
Benchmarking mean-field and statistical physics approaches in materials science.

Prior MLLMs (OFA, Shikra, Qwen-VL, MiniGPT-v2, Ferret): Groma uniquely integrates region proposal at tokenization, achieving superior region-level tasks without incurring additional complexity in the LLM head or relying on external localization modules. This decouples perceptual localization from reasoning and improves scalability to high-resolution inputs (Ma et al., 2024).
Specialist detectors (MDETR, G-DINO, UNINEXT-L): Compete on par in region-level tasks while retaining generalist multimodal dialogue capabilities.
Plasticity models (Kocks–Mecking, Nye–Kröner): Groma–Balogh provides a nonlocal, sign-sensitive mean-field evolution, capturing the repulsion/attraction dichotomy and pair annihilation without ad hoc terms (Patrizi et al., 2022).

5. Open Problems and Extensions

LLM adaptation: Further optimization of region-linguistic alignment, and robustness of grounding under domain adaptation or occlusion, remain open.
Continuum limits: Full global uniqueness, characterization of singularity-formation, and the influence of initial condition structure in the Groma–Balogh system are active research topics, especially in higher dimensions.
Rate of convergence: The interplay between discrete-particle regularization and continuum equation validity continues to produce subtle non-convergence scenarios requiring refined mathematical tools (Garroni et al., 2018).
Extension to 3D and complex slip geometries: Existing results focus primarily on 1D/2D and straight parallel dislocation lines.

A plausible implication is that both "Groma" architectures—whether referential visual tokenization in MLLMs or nonlocal PDEs for defect densities—represent a shift toward explicit, fine-grained modeling of localized structures within broader data modalities, providing templates for future research in their respective areas.

Markdown Report Issue Upgrade to Chat

References (5)

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models (2024)

Global existence for a system of non-linear and non-local transport equations describing the dynamics of dislocation densities (2009)

Derivation of the 1-D Groma-Balogh equations from the Peierls-Nabarro model (2022)

Convergence and non-convergence of many-particle evolutions with multiple signs (2018)

Short time existence and uniqueness in Hölder spaces for the 2D dynamics of dislocation densities (2009)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Groma.

Groma: MLLMs & Dislocation Models

1. Groma in Multimodal LLMs

Architectural Components

Tokenization and Grounding Mechanisms

Training Pipeline

Empirical Performance

2. The Groma–Balogh Model of Dislocation Densities

Canonical Formulation

Derivation from Physical Models

Mathematical Structure and Analysis

Entropy and A Priori Estimates

Existence and Uniqueness

Discrete-to-Continuum Limit and Regularization Effects

Table: Mathematical Features of Groma–Balogh PDEs

3. Practical Applications

5. Open Problems and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Groma: MLLMs & Dislocation Models

1. Groma in Multimodal LLMs

Architectural Components

Tokenization and Grounding Mechanisms

Training Pipeline

Empirical Performance

2. The Groma–Balogh Model of Dislocation Densities

Canonical Formulation

Derivation from Physical Models

Mathematical Structure and Analysis

Entropy and A Priori Estimates

Existence and Uniqueness

Discrete-to-Continuum Limit and Regularization Effects

Table: Mathematical Features of Groma–Balogh PDEs

3. Practical Applications

4. Comparison to Related Frameworks and Models

5. Open Problems and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research