Papers
Topics
Authors
Recent
Search
2000 character limit reached

Groma: MLLMs & Dislocation Models

Updated 21 February 2026
  • Groma is a term that encompasses two distinct frameworks: a multimodal LLM with integrated region tokenization for precise visual grounding and a set of nonlocal PDE models for dislocation densities in crystal plasticity.
  • The multimodal variant leverages state-of-the-art vision encoders, deformable detectors, and unified tokenization to enhance region captioning, visual grounding, and conversational VQA.
  • The Groma–Balogh model uses nonlocal transport equations with entropy methods to bridge discrete dislocation dynamics and continuum crystal plasticity, ensuring rigorous mean-field analysis.

Groma designates two distinct concepts in the scientific literature: (1) a state-of-the-art framework for grounding and region-level visual perception in Multimodal LLMs (MLLMs), and (2) a family of nonlocal transport equations for dislocation densities in crystal plasticity, often called the "Groma–Balogh model." Both have provided foundational progress in their respective domains—multimodal AI and mathematical materials science—by innovating over standard architectures or mean-field approaches and offering new mathematical or algorithmic structures for fine-grained representation and prediction.

1. Groma in Multimodal LLMs

In the context of vision-LLMs, Groma is a Multimodal LLM with intrinsic region-level perception and grounding capability, specifically engineered to decompose visual inputs into localized tokens and jointly align these with language (Ma et al., 2024). The system addresses limitations in previous MLLMs that attempt localization post-hoc or through auxiliary modules, instead "lifting" the act of region localization into the visual tokenization pipeline itself. The unified architecture supports region captioning, visual grounding, and visually grounded dialogues with explicit referential correspondence between regions and text.

Architectural Components

  • Vision Encoder: Utilizes DINOv2-L/14 (448×448 input) to generate patch-level embeddings.
  • Region Proposer: Builds on a class-agnostic Deformable DETR (DDETR) head extracting up to 100 high-quality bounding box proposals via non-max suppression and objectness thresholding.
  • Region Encoder: Applies multi-scale RoIAlign over the DINOv2 feature pyramid, producing region tokens Ri=ROIAlign({Fk},bi)R_i = \mathrm{ROIAlign}(\{F_k\}, b_i) matched in embedding dimension to patch tokens.
  • Connector and LLM: An MLP projects both region and patch tokens to the LLM space (Vicuna-7B v1.5, H=4096H=4096). The tokens are concatenated with textual tokens and passed as a single context sequence; cross-modal self-attention remains unchanged.

Tokenization and Grounding Mechanisms

  • Region Tokens and Proxy Vocabulary: Each region token is paired with a discrete "proxy" text token (e.g., <<r_i>>) in the LLM vocabulary, binding lexical references to visual regions in both user instructions and model responses.
  • Unified Input Sequence: The transformer stack's context comprises [T;V;R][T; V; R'], where TT = text tokens, VV = image tokens, RR' = region tokens.
  • Output Grounding: The LLM emits <<r_i>> tokens when referring to regions; grounded output sequences employ tags (e.g., <<roi><><r_4><></roi>>) to index localized phrases.

Training Pipeline

The model is trained in three stages:

  1. Detection Pretraining: Over 5.7M box annotations (COCO, Objects365, OpenImages, etc.), with the DDETR head trained against 1\ell_1 and generalized IoU loss; backbone is frozen.
  2. Vision–Language Alignment: 3.2M samples covering image-caption, grounded caption, region caption, and referring expression comprehension (REC) tasks.
  3. Instruction Finetuning: 857K samples, including bespoke "Groma Instruct" dialogues generated via GPT-4V with explicit region referencing.

Empirical Performance

On standard region-level and grounding benchmarks, Groma outperforms prior generalist MLLMs and rivals specialist detection systems:

  • Referring Expression Comprehension: 86.52% average accuracy on RefCOCO/RefCOCO+/RefCOCOg, surpassing Qwen-VL (85.83%) and close to UNINEXT-L (86.95%).
  • LVIS-Ground: Average recall (AR) @ [.5:.95] of 28.8, with strong performance on large and medium objects.
  • Region Captioning: METEOR/CIDEr scores comparable to or better than state-of-the-art without task-specific tuning.
  • Conversational VQA: High scores across conversation, description, and reasoning tasks.

Ablation shows that most region-aware knowledge is embedded in the region proposer/encoder, and freezing the LLM except for brief finetuning suffices for state-of-the-art grounding (Ma et al., 2024).

2. The Groma–Balogh Model of Dislocation Densities

The Groma–Balogh equations are a class of non-local, nonlinear PDEs modeling the evolution of positive and negative dislocation densities in crystalline solids (0901.0219, Patrizi et al., 2022, Garroni et al., 2018, 0903.1559). This framework bridges discrete dislocation dynamics and macroscopic crystal plasticity, offering a rigorous description of the kinematics and interactions of extended defects.

Canonical Formulation

For d=1,2d=1,2, the system describes two densities ρ+(x,t)\rho^+(x,t) and ρ(x,t)\rho^-(x,t) (for positive/negative dislocations) evolved by nonlocal transport: tρ+=(ρ+(V(ρ+ρ))), tρ=(ρ(V(ρ+ρ))),\begin{aligned} \partial_t \rho^+ &= \nabla \cdot \left( \rho^+ \nabla (V * (\rho^+ - \rho^-)) \right),\ \partial_t \rho^- &= -\nabla \cdot \left( \rho^- \nabla (V * (\rho^+ - \rho^-)) \right), \end{aligned} where V(x)=logxV(x) = -\log|x| (modulo periodicity) and convolution is on the flat torus TdT^d (Garroni et al., 2018). The driving field σ(x,t)\sigma(x,t) is a nonlocal stress, given as Riesz or Hilbert transforms of the net density. In 2D, the transport velocity is u(x,t)=R12R22(ρ+ρ)(x,t)u(x, t) = R_1^2 R_2^2 (\rho^+ - \rho^-)(x, t), with RiR_i the Riesz transforms.

Derivation from Physical Models

The 1D Groma–Balogh system is formally derived from the Peierls–Nabarro phase-field model via a multi-scale expansion: phase-field equations with fractional Laplacian nonlocality and Peierls potential converge, in the many-dislocation and small-core limit, to a transport system for signed densities (Patrizi et al., 2022). Dislocation interactions arise from the collective effect of long-range elastic forces, with opposite-sign pairs undergoing mutual attraction and possible annihilation—represented in continuum by the sign structure of velocities and not explicit sink terms.

Mathematical Structure and Analysis

Entropy and A Priori Estimates

  • Entropy-dissipation structure: The system enjoys an entropy (or free-energy) functional based on the densities θ+=1ρ+\theta^+ = \partial_1 \rho^+, θ=1ρ\theta^- = \partial_1 \rho^-, with

Eent[ρ](t)=T2[θ+lnθ++θlnθ]dxE_{\text{ent}}[\rho](t) = \int_{T^2} [\theta^+ \ln \theta^+ + \theta^- \ln \theta^-] \, dx

and entropy dissipation inequality:

Eent(t)+0tT2[R1R2(θ+θ)]2dxdsEent(0)E_{\text{ent}}(t) + \int_0^t \int_{T^2} [R_1 R_2 (\theta^+ - \theta^-)]^2 dx ds \leq E_{\text{ent}}(0)

  • A priori control: Uniform-in-time estimates are established for L2L^2 norms of densities, their LlogLL \log L norms, time derivatives, and nonlocal fields (0901.0219).

Existence and Uniqueness

  • Global existence: Under periodicity conditions and regularity/monotonicity on initial data, the existence of global-in-time distributional solutions is established for the 2D periodic Groma–Balogh system (0901.0219).
  • Local well-posedness: For initial data in Hölder–Zygmund spaces CrLpC^r \cap L^p, r>1r>1, the system has unique local-in-time solutions via commutator estimates and Littlewood–Paley theory (0903.1559).
  • Gradient flow formulation: The system can be interpreted as a λ\lambda-convex Wasserstein gradient flow of a (regularized) interaction energy, facilitating mean-field convergence from discrete particle dynamics (Garroni et al., 2018).

Discrete-to-Continuum Limit and Regularization Effects

  • Convergence: When the interaction potential is regularized at lengthscale δn0\delta_n \to 0 slowly enough, the empirical measures of discrete particle systems converge to the Groma–Balogh PDE solution in Wasserstein-2 distance.
  • Non-convergence: If δnn1/d\delta_n \ll n^{-1/d}, dipole formations become effectively immobilized, leading to empirical measures that do not solve the Groma–Balogh equation (Garroni et al., 2018). The rate of potential regularization is thus critical for mean-field passage.

Table: Mathematical Features of Groma–Balogh PDEs

Feature 1D/2D Case Mathematical Structure
Nonlocal term Hilbert/Riesz trans. V(x)=logxV(x) = -\log|x|, RiR_i, HH
Entropy functional Yes EentE_{\text{ent}}, LlogLL\log L control
Existence Global (monotone), Local Compactness, a priori L2L^2 and entropy bounds
Gradient flow Yes Wasserstein, λ\lambda-convex
Discrete limit Yes (critical δn\delta_n) Evolutionary convergence & counterexamples

3. Practical Applications

The Groma MLLM advances fine-grained visual-language reasoning, enabling:

  • Region captioning: Explicit grounding of descriptions to image subregions.
  • Visual grounding and referring expression comprehension: Interpreting and generating instructions localized to image regions, critical for human-robot interaction and assistive computer vision.
  • Long-form, grounded dialogues: Multiregion reasoning and response for both VQA and narrative tasks, including dense and occluded scenarios (Ma et al., 2024).

The Groma–Balogh equations are central to:

  • Crystal plasticity theory: Modeling the mesoscale evolution of dislocation structures in response to applied stress.
  • Bridging scales: Rigorously justifying macroscopic plastic flow laws from phase-field or particle-level crystal defect descriptions.
  • Benchmarking mean-field and statistical physics approaches in materials science.
  • Prior MLLMs (OFA, Shikra, Qwen-VL, MiniGPT-v2, Ferret): Groma uniquely integrates region proposal at tokenization, achieving superior region-level tasks without incurring additional complexity in the LLM head or relying on external localization modules. This decouples perceptual localization from reasoning and improves scalability to high-resolution inputs (Ma et al., 2024).
  • Specialist detectors (MDETR, G-DINO, UNINEXT-L): Compete on par in region-level tasks while retaining generalist multimodal dialogue capabilities.
  • Plasticity models (Kocks–Mecking, Nye–Kröner): Groma–Balogh provides a nonlocal, sign-sensitive mean-field evolution, capturing the repulsion/attraction dichotomy and pair annihilation without ad hoc terms (Patrizi et al., 2022).

5. Open Problems and Extensions

  • LLM adaptation: Further optimization of region-linguistic alignment, and robustness of grounding under domain adaptation or occlusion, remain open.
  • Continuum limits: Full global uniqueness, characterization of singularity-formation, and the influence of initial condition structure in the Groma–Balogh system are active research topics, especially in higher dimensions.
  • Rate of convergence: The interplay between discrete-particle regularization and continuum equation validity continues to produce subtle non-convergence scenarios requiring refined mathematical tools (Garroni et al., 2018).
  • Extension to 3D and complex slip geometries: Existing results focus primarily on 1D/2D and straight parallel dislocation lines.

A plausible implication is that both "Groma" architectures—whether referential visual tokenization in MLLMs or nonlocal PDEs for defect densities—represent a shift toward explicit, fine-grained modeling of localized structures within broader data modalities, providing templates for future research in their respective areas.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Groma.