Slot-Level Plan-and-Infill Paradigm

Updated 18 December 2025

Slot-level plan-and-infill paradigm is a strategy that partitions sequences into meaningful slots and decouples planning from infilling for targeted content generation.
It overcomes classical span enumeration bottlenecks by enabling parallel decoding and reducing computational complexity from quadratic to linear or factorial orders.
Empirical results show significant speedups (up to 18.4×) and improved performance in few-shot slot tagging and diffusion-based text generation.

The slot-level plan-and-infill paradigm is a modeling and inference strategy that partitions a sequence or input instance into semantically or structurally meaningful “slots,” then iteratively selects and fills these slots in a planning and infilling loop. This paradigm has achieved substantial efficiency and performance gains in diverse settings—most notably, few-shot slot tagging with LLMs, and large-scale text generation using diffusion-based decoding at the slot rather than token level. Core to this approach is the decoupling of prediction (“plan”) and completion (“infill”) at the slot granularity, which allows for both targeted handling of local dependencies and for computational efficiency via parallelization and cache reuse.

1. Formal Setup and Conceptual Foundations

Let $x=(x_1, ..., x_n)$ represent a sequence (utterance or text), partitioned into $K$ contiguous, non-overlapping slots $s_i=(x_{(i-1)k+1},...,x_{ik})$ of fixed length $k$ (for text generation), or, for structured prediction, let $T = \{t_1,...,t_m\}$ represent $m$ predefined types such as information extraction fields (“departure,” “time,” etc.). In each setting, a valid output is a mapping of slot-types to value spans or a reconstruction of all slots (respectively). In the plan-and-infill approach, inference is organized as an alternating sequence of:

Planning: Identify which slots should be filled and/or what content is tentatively appropriate for them, often through parallel sampling or scoring mechanisms.
Infilling: Complete each selected slot (possibly in parallel), optionally conditioned on partial generations for other slots and previously filled content.

In classic slot tagging, this approach contrasts with inefficient span enumeration ( $O(n^2)$ ), by directly mapping slot type $\to$ span, requiring only $O(m)$ operations. In slot-level diffusion LMs, slotwise infilling reduces the combinatorial complexity otherwise arising from arbitrary masking patterns.

2. Classic Approaches and Bottlenecks

For few-shot slot tagging, “classic” prompt-based classification enumerates all possible $n(n+1)/2$ contiguous spans $s_{i:j}$ in input $x$ and, for each slot type $t\in T$ , embeds each into a cloze-style prompt (e.g., $[x]\,\text{‘}\{s\}\text{’ is a [MASK] entity.}$ ) and queries the LLM for $P(\text{label}=t|p(x,s))$ . This process requires a prohibitive number of model evaluations ( $O(n^2\cdot m)$ per sentence), limiting scalability for practical systems and long inputs. Analogous limitations arise in token-level masked diffusion models for generation, where the number of possible masking configurations is exponential: $2^L$ for sequence length $L$ .

This highlights the necessity for slot-level reasoning and decoding that can bypass these bottlenecks by aligning computational granularity with semantic or structural chunks.

3. Slot-Level Inverse Prompting and Plan-and-Infill in Few-Shot Slot Tagging

The slot-level plan-and-infill approach for slot tagging reverses the standard paradigm of predicting labels for spans. Instead, for each slot type $t\in T$ , a prompt $p_t(x) = “[x]\, t\, \text{refers to}\ \_\_.”$ is constructed. The LLM directly generates slot values $v_t$ as contiguous subspans of $x$ , with probability

$P(v_t|t, x) = \prod_{k=1}^{|v_t|} P(v_{t,k}|t, x, v_{t,1:k-1}).$

At inference time, this reduces computational cost to $O(m\cdot L)$ , a dramatic efficiency improvement. Special control tokens such as $<$ NONE $>$ and $<$ SEP $>$ handle absent or multi-span slots; decoding is constrained to subspans of $x$ .

Slotwise outputs can be reconciled with sequence labeling (e.g., BIO format) by aligning generated values $v_t$ back to input positions.

In both slot tagging and diffusion-based LMs, a single pass may overlook dependencies between slots (e.g., “arrival” following “departure”, time localizing price). The iterative plan-and-infill strategy introduces multiple decoding passes:

First-round: Independently decode each slot.
Second-round (refinement): For each slot $t$ , construct context $C$ as the set of $(t',V[t'])$ pairs for $t'\neq t$ and their decoded values (except “none” values). The new prompt encodes the context: $[x]$ + $\text{“} t' \text{ refers to } v.\text{”}$ (for all $t', v$ in $C$ ) + $\text{“} t \text{ refers to } \_\_.”$ The LLM then produces $v^{(2)}_t = \arg\max_v P(v | t, x, C)$ .

Pseudocode for this iterative reasoning is as follows:

for t in T:
    v_1[t] = argmax_v P(v | x, t)
for t in T:
    context = { (t', v_1[t']) for t' != t, v_1[t'] != 'none' }
    v_2[t] = argmax_v P(v | x, t, context)

This refinement can be trained with a joint loss. Empirically, this procedure led to considerable performance improvements: on MIT-Restaurant (10-shot), a 6.1-point F₁ gain over template-based baseline, and 2.75 points over one-pass inverse prompting, alongside a $3.4{-}8\times$ speedup (Hou et al., 2022).

5. Slot-Level Plan-and-Infill for Parallel Decoding in Diffusion LLMs

In the context of text generation, ReFusion implements plan-and-infill by elevating masked diffusion from the token to the slot level (Li et al., 15 Dec 2025). Consider a sequence $x=(x_1,...,x_L)$ , partitioned into $K$ slots $s_i$ , each of length $k$ . At iteration $t$ :

Planning (diffusion-style slot selection): Draft samples are produced for each masked slot. Slot-certainty scores $C(s_i)$ are computed to quantify contextual predictability:

$C(s_i) = p_\theta(d_{i,1} \mid p_0, S_{\text{clean}}, S_{\text{masked}_{< (i,1)}})$

The subset of slots exceeding threshold $T_{\text{slot}}$ (or the one with maximal certainty, if none meet threshold) are selected for infilling.

Infilling: For each selected slot, autoregressive verification is conducted in two phases:
1. Global verification: Jointly compute likelihoods for the concatenated tokens. If the longest accepted prefix covers at least $k$ tokens, the corresponding slots are moved to $S_{\text{clean}}$ .
2. Parallel Iterative Completion: Incomplete slots are refilled token-wise via local autoregressive sampling and verification, independently and in parallel, until completion.
The process repeats, shrinking $S_{\text{masked}}$ , until the full sequence is generated.

This approach permits full key-value cache reuse (no repeated recomputation of earlier tokens), and the learning problem reduces from $2^L$ tokenwise masking configurations to a factorial $K!$ slot orderings. For $L=4096$ , $k=8$ ( $K=512$ ), this reduces theoretical complexity by orders of magnitude.

6. Empirical Performance and Speedup

The slot-level plan-and-infill paradigm yields both accuracy and efficiency benefits across tasks. In slot tagging, the inverse plan-and-infill with iterative refinement outperforms template-based prompting by over 6 F₁ points at 10-shot and runs $3.4{-}8\times$ faster (Hou et al., 2022).

In diffusion-based text generation, ReFusion's slot-level plan-and-infill outperforms strong masked-diffusion and autoregressive models in both accuracy and throughput:

Benchmark	LLaDA-8B-Instruct	Dream-7B-Instruct	ReFusion
MBPP pass@1	50.40%	68.20%	78.66%
TPS (tokens/s)	12.42	53.00	92.09
GSM8K accuracy	76.35%	76.42%	84.91%
TPS (tokens/s)	23.93	18.99	81.77

Average speedup versus Dream/LLaDA is $18.4\times$ ; versus strong ARMs, $2.33\times$ (Li et al., 15 Dec 2025). This is attributed to parallel slot infilling, reduced decoding iterations (2–10 steps, vs $L/k$ for ARM), and full cache reuse.

7. Representative Examples and Workflow

In slot tagging, for $x=$ “book a flight from beijing to new york tomorrow morning” and $T=\{$ departure, arrival, time, price $\}$ :

First-pass prompts and generations:
- p(departure): “book a flight ... departure refers to __.” $\to$ “beijing.”
- p(arrival): “... arrival refers to __.” $\to$ “new york.”
- p(time): “... time refers to __.” $\to$ “tomorrow morning.”
- p(price): “... price refers to __.” $\to$ “none.”
Second-pass (if needed for price):
- p′(price): “... departure refers to beijing. time refers to tomorrow morning. arrival refers to new york. price refers to __.” Possibly $\to$ “299 USD.” or “none.”

Output values are then aligned to the input for BIO-tag production.

In diffusion LMs, slots are selected and generated as non-overlapping contiguous subblocks, with planning/infill repeated until the sequence is complete and reordered to original structure.

The slot-level plan-and-infill paradigm combines slotwise reasoning with iterative dependency-aware generation, producing both accuracy and efficiency gains by matching computational structure to problem semantics. Its utility is demonstrated in both structured prediction and high-throughput text generation, where it outperforms both traditional enumeration-based and tokenwise diffusion approaches (Hou et al., 2022, Li et al., 15 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Inverse is Better! Fast and Accurate Prompt for Few-shot Slot Tagging (2022)

ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Slot-Level Plan-and-Infill Paradigm.

Slot-Level Plan-and-Infill Paradigm

1. Formal Setup and Conceptual Foundations

2. Classic Approaches and Bottlenecks

3. Slot-Level Inverse Prompting and Plan-and-Infill in Few-Shot Slot Tagging

4. Iterative Refinement: Dependency-Aware Slot Generation

5. Slot-Level Plan-and-Infill for Parallel Decoding in Diffusion LLMs

6. Empirical Performance and Speedup

7. Representative Examples and Workflow

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Slot-Level Plan-and-Infill Paradigm

1. Formal Setup and Conceptual Foundations

2. Classic Approaches and Bottlenecks

3. Slot-Level Inverse Prompting and Plan-and-Infill in Few-Shot Slot Tagging

4. Iterative Refinement: Dependency-Aware Slot Generation

5. Slot-Level Plan-and-Infill for Parallel Decoding in Diffusion LLMs

6. Empirical Performance and Speedup

7. Representative Examples and Workflow

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics