Multi-Region Infilling & Constrained Decoding

Updated 15 August 2025

Multi-region infilling constrained decoding is a framework for editing multiple noncontiguous spans while rigorously enforcing syntactic, semantic, and factual constraints.
It leverages advanced fusion mechanisms, including graph neural networks and incremental parsing with formal grammar checks, to ensure cross-sequence consistency.
Recent developments integrate semantic frames and multi-modal evidence, mitigating hallucinations and enhancing the fidelity of structured generation.

Multi-region infilling constrained decoding refers to the class of methods and frameworks that enable language or vision-LLMs to generate, fill, or edit multiple noncontiguous spans (“regions”) in a structured context, with strict adherence to external constraints. These constraints may be syntactic (from formal grammars), semantic (from frames or evidence), or factual (from multi-modal input). The field spans neural architecture extensions, grammar-based verification, prompt engineering, bidirectional context integration, and specialized fusion strategies. The following sections provide an authoritative technical survey of the main methodologies, mechanisms, and evaluation results in this domain.

1. Architectural Foundations and Fusion Mechanisms

Multi-region infilling typically extends conventional left-to-right decoding by enabling bidirectional or multi-sequence context aggregation. In “Consistent Multiple Sequence Decoding” (Xu et al., 2020), the approach models each decoder for each sequence (or region) as a node in a graph. At each time step, two sources of context enter each decoder:

The local autoregressive history (previously generated tokens for this sequence)
Fused global context (aggregated from related sequences/regions via a consistency fusion mechanism).

The fusion mechanism is realized through message passing in a Graph Neural Network (GNN), with each node $v$ updated by aggregating neighbor features and applying adaptive self-attention:

$a_v^{(k)} = \sum_{u \in N(v)} \alpha_u^{(k)} \cdot (W h_u^{(k-1)}), \qquad h_v^{(k)} = \mathrm{COMBINE}\left(h_v^{(k-1)}, a_v^{(k)}\right)$

The self-attention weights $\alpha_u^{(k)}$ (computed by a leaky ReLU-activated MLP and softmaxed over neighbors) ensure context aggregation is locally adaptive at every region and decoding step.

This fusion enables cross-sequence consistency, vital for tasks like dense relational captioning or multi-span text/vision infilling, where independently generated outputs risk semantic or referential incoherence.

2. Grammar-Constrained and Verification-Based Decoding

Formal languages play an essential role in guaranteeing validity in code and structured output infilling. The paradigm is extended to fill-in-the-middle and multi-region settings by integrating incremental parsing, quotient computation, and intersection testing of languages.

In “Constrained Decoding for Fill-in-the-Middle Code LLMs via Efficient Left and Right Quotienting of Context-Sensitive Grammars” (Melcer et al., 28 Feb 2024), the key constraint is enforced by maintaining an incremental state (Earley chart) and at each token insertion, checking for the existence of a valid suffix such that the current prefix can be completed to a program satisfying the context-free grammar $L(G)$ :

$L(G) / L(\beta) = \{ \alpha \in \Sigma^* \mid \exists r \in L(\beta): \alpha r \in L(G) \}$

For multi-region infilling, the method constructs a quotient grammar encapsulating all possible completions, and the incremental parser rapidly rejects invalid branches. Extensions accommodate language-specific features such as indentation (Python), parenthesis matching, or context-sensitive lexing.

“Constrained Decoding of Diffusion LLMs with Context-Free Grammars” (Mündler et al., 13 Aug 2025) generalizes this: the core step after each token addition is to determine whether any assignment of fillers to all regions yields an output $x \in L$ . This is reduced to language intersection testing: for context-free $L$ and a regular completion language $C_x$ , check $L \cap C_x \neq \emptyset$ via an implicit intersection grammar, early-pruning non-generating nonterminals.

These approaches jointly ensure syntactic (and often semantic) well-formedness under multi-region infilling, and scale to unordered, additive infilling common in diffusion-model driven LLMs.

3. Semantic, Evidence, and Lexical Constraint Integration

A different axis of constraint leverages high-level semantic frames, structured evidence, or explicit lexical requirements. In “InFillmore: Frame-Guided Language Generation” (Ou et al., 2021), two mechanisms are detailed:

Fine-tuning with Frame Control (FFL): During training, golden infill spans are prepended with frame ID tokens representing thematic or action semantics. The model learns to associate these with matching lexical/semantic realizations.
Disjunctive Lexically Constrained Decoding (LCD): During decoding, each span's constraint set is a disjunction of FrameNet lexical units. Beam search is restricted such that each constraint set is satisfied within the region, realized as a dynamic trie structure per constraint, with explicit tracking and fulfillment.

This explicit semantic or lexical constraint approach allows direct user or system control over the "meaning" or word choice of each infilled region, supporting applications requiring interpretability or controllable generation (e.g., story editing, legal draft infilling, structured counterfactual generation).

Similarly, “STOC-TOT: Stochastic Tree-of-Thought with Constrained Decoding” (Bi et al., 4 Jul 2024) integrates a vocabulary bank $V$ (derived from evidence and prompt) and restricts answer generation to words in $V$ , minimizing hallucination and grounding answers in retrieved evidence across multiple regions (passages).

4. Advances in Bidirectional and Autoregressive Infill Models

Bidirectional context integration (infilling between a prefix and suffix) is often hampered by computational overhead or architectural limitations. “Enabling Autoregressive Models to Fill In Masked Tokens” (Israel et al., 9 Feb 2025) introduces MARIA, a hybrid combining AR and MLM models with a linear decoder over concatenated hidden states:

$\pi_{\mathrm{MARIA}}(x \ | \ \cdot) = \sigma(W_3 [h_1(x); h_2(x)])$

The method matches multi-span, masked infilling quality with MLMs while retaining AR’s inference efficiency and KV caching. Empirical results show MARIA outperforms discrete diffusion and other conventional approaches on perplexity and qualitative metrics, especially in high-masking or multi-region settings.

Other works, such as “Self-Infilling Code Generation” (Zheng et al., 2023), introduce non-monotonic generation. Upon encountering uncertainty, the model can interrupt left-to-right decoding, generate suffixes first, then return to infill skipped content, cycling through refinement (looping) steps for improved logical consistency, accommodating multiple regions recursively.

5. Token-Level, Span-Level, and Character-Level Infilling Constraints

Constrained infilling at the sub-token or character level presents unique challenges. In “Empowering Character-level Text Infilling by Eliminating Sub-Tokens” (Ren et al., 27 May 2024), the FIM-SE method transforms character-level infilling into a constrained line-level task:

For a split at arbitrary characters, last line of prefix and first line of suffix are marked (<START>, <END> tokens), model is required to generate the infilled segment starting and ending precisely with these patterns.
This eliminates sub-token fragmentation and label ambiguity, reducing perplexity and error propagation at boundaries, and improving span- and multi-region accuracy on tasks like code and natural language infilling.

The general framework supplied by FIM-SE extends naturally to any multi-region scenario where boundary conditions must be strictly enforced.

Recently, vision-LLMs (LVLMs) have adopted multi-region fusion techniques for mitigating hallucinations. “MRFD: Multi-Region Fusion Decoding with Self-Consistency” (Ge et al., 14 Aug 2025) uses cross-attention maps to extract salient regions from images corresponding to a text prompt:

For each region, independent responses are generated.
Jensen–Shannon Divergence (JSD) is used to compute reliability weights based on agreement of next-token probability distributions, with final logit fusion:

$w_k = \frac{e^{-J_k/\gamma}}{\sum_{i} e^{-J_i/\gamma}},\qquad \ell_{\text{fused}}^{(t')} = \sum_k w_k \ell_k^{(t')}$

Region-aware prompts enhance each region’s specific evidence, grounding outputs.

Experiments demonstrate marked reduction in hallucination rates and improvements in factuality, especially in complex or adversarial multimodal benchmarks.

7. Evaluation Metrics, Applications, and Research Directions

Empirical validation spans language, code, and multimodal benchmarks:

“Consistent Multiple Sequence Decoding” (Xu et al., 2020): +5.2% mAP, +9.5% BLEU-1 consistency gain in dense captioning; diversity maintained.
Grammar constrained approaches (Melcer et al., 28 Feb 2024, Mündler et al., 13 Aug 2025): Up to 31.5% syntactic correctness improvement in multi-span code infilling; Pass@1 functional accuracy gain.
Character-level (Ren et al., 27 May 2024): FIM-SE yields 8.8–11.5% accuracy improvement in multi-span and single/multi-line code infilling.
MRFD (Ge et al., 14 Aug 2025): Improves hallucination metrics on POPE, CHAIR, MME-Hallucination across multiple LVLMs.
MARIA (Israel et al., 9 Feb 2025): Superior perplexity, throughput, and sample quality across masking rates and domains.

Applications range from program synthesis, structured document editing, story generation, multimodal VQA, to grounded factual QA.

Research directions include: further integration of semantic/statistical checks with grammatical constraints (e.g., preventing context escape and semantic errors), domain-adaptive constraint formalizations, efficiency optimizations (for grammar intersection and incremental parsing), fusion of multi-source evidence, and expanding to dynamic or temporally structured input (e.g., video).

Multi-region infilling constrained decoding encompasses techniques that rigorously enforce structure and semantics in multi-span generative modeling. The ongoing convergence of graph-based fusion, formal grammar verification, semantic constraint integration, and multi-modal evidence aggregation continues to advance the reliability of structured AI generation across diverse settings.