Papers
Topics
Authors
Recent
Search
2000 character limit reached

Slot-Level Plan-and-Infill Paradigm

Updated 18 December 2025
  • Slot-level plan-and-infill paradigm is a strategy that partitions sequences into meaningful slots and decouples planning from infilling for targeted content generation.
  • It overcomes classical span enumeration bottlenecks by enabling parallel decoding and reducing computational complexity from quadratic to linear or factorial orders.
  • Empirical results show significant speedups (up to 18.4×) and improved performance in few-shot slot tagging and diffusion-based text generation.

The slot-level plan-and-infill paradigm is a modeling and inference strategy that partitions a sequence or input instance into semantically or structurally meaningful “slots,” then iteratively selects and fills these slots in a planning and infilling loop. This paradigm has achieved substantial efficiency and performance gains in diverse settings—most notably, few-shot slot tagging with LLMs, and large-scale text generation using diffusion-based decoding at the slot rather than token level. Core to this approach is the decoupling of prediction (“plan”) and completion (“infill”) at the slot granularity, which allows for both targeted handling of local dependencies and for computational efficiency via parallelization and cache reuse.

1. Formal Setup and Conceptual Foundations

Let x=(x1,...,xn)x=(x_1, ..., x_n) represent a sequence (utterance or text), partitioned into KK contiguous, non-overlapping slots si=(x(i1)k+1,...,xik)s_i=(x_{(i-1)k+1},...,x_{ik}) of fixed length kk (for text generation), or, for structured prediction, let T={t1,...,tm}T = \{t_1,...,t_m\} represent mm predefined types such as information extraction fields (“departure,” “time,” etc.). In each setting, a valid output is a mapping of slot-types to value spans or a reconstruction of all slots (respectively). In the plan-and-infill approach, inference is organized as an alternating sequence of:

  • Planning: Identify which slots should be filled and/or what content is tentatively appropriate for them, often through parallel sampling or scoring mechanisms.
  • Infilling: Complete each selected slot (possibly in parallel), optionally conditioned on partial generations for other slots and previously filled content.

In classic slot tagging, this approach contrasts with inefficient span enumeration (O(n2)O(n^2)), by directly mapping slot type \to span, requiring only O(m)O(m) operations. In slot-level diffusion LMs, slotwise infilling reduces the combinatorial complexity otherwise arising from arbitrary masking patterns.

2. Classic Approaches and Bottlenecks

For few-shot slot tagging, “classic” prompt-based classification enumerates all possible n(n+1)/2n(n+1)/2 contiguous spans si:js_{i:j} in input xx and, for each slot type tTt\in T, embeds each into a cloze-style prompt (e.g., [x]{s}’ is a [MASK] entity.[x]\,\text{‘}\{s\}\text{’ is a [MASK] entity.}) and queries the LLM for P(label=tp(x,s))P(\text{label}=t|p(x,s)). This process requires a prohibitive number of model evaluations (O(n2m)O(n^2\cdot m) per sentence), limiting scalability for practical systems and long inputs. Analogous limitations arise in token-level masked diffusion models for generation, where the number of possible masking configurations is exponential: 2L2^L for sequence length LL.

This highlights the necessity for slot-level reasoning and decoding that can bypass these bottlenecks by aligning computational granularity with semantic or structural chunks.

3. Slot-Level Inverse Prompting and Plan-and-Infill in Few-Shot Slot Tagging

The slot-level plan-and-infill approach for slot tagging reverses the standard paradigm of predicting labels for spans. Instead, for each slot type tTt\in T, a prompt pt(x)=[x]trefers to __.p_t(x) = “[x]\, t\, \text{refers to}\ \_\_.” is constructed. The LLM directly generates slot values vtv_t as contiguous subspans of xx, with probability

P(vtt,x)=k=1vtP(vt,kt,x,vt,1:k1).P(v_t|t, x) = \prod_{k=1}^{|v_t|} P(v_{t,k}|t, x, v_{t,1:k-1}).

At inference time, this reduces computational cost to O(mL)O(m\cdot L), a dramatic efficiency improvement. Special control tokens such as <<NONE>> and <<SEP>> handle absent or multi-span slots; decoding is constrained to subspans of xx.

Slotwise outputs can be reconciled with sequence labeling (e.g., BIO format) by aligning generated values vtv_t back to input positions.

4. Iterative Refinement: Dependency-Aware Slot Generation

In both slot tagging and diffusion-based LMs, a single pass may overlook dependencies between slots (e.g., “arrival” following “departure”, time localizing price). The iterative plan-and-infill strategy introduces multiple decoding passes:

  • First-round: Independently decode each slot.
  • Second-round (refinement): For each slot tt, construct context CC as the set of (t,V[t])(t',V[t']) pairs for ttt'\neq t and their decoded values (except “none” values). The new prompt encodes the context: [x][x] + t refers to v.\text{“} t' \text{ refers to } v.\text{”} (for all t,vt', v in CC) + t refers to __.\text{“} t \text{ refers to } \_\_.” The LLM then produces vt(2)=argmaxvP(vt,x,C)v^{(2)}_t = \arg\max_v P(v | t, x, C).

Pseudocode for this iterative reasoning is as follows:

1
2
3
4
5
for t in T:
    v_1[t] = argmax_v P(v | x, t)
for t in T:
    context = { (t', v_1[t']) for t' != t, v_1[t'] != 'none' }
    v_2[t] = argmax_v P(v | x, t, context)

This refinement can be trained with a joint loss. Empirically, this procedure led to considerable performance improvements: on MIT-Restaurant (10-shot), a 6.1-point F₁ gain over template-based baseline, and 2.75 points over one-pass inverse prompting, alongside a 3.48×3.4{-}8\times speedup (Hou et al., 2022).

5. Slot-Level Plan-and-Infill for Parallel Decoding in Diffusion LLMs

In the context of text generation, ReFusion implements plan-and-infill by elevating masked diffusion from the token to the slot level (Li et al., 15 Dec 2025). Consider a sequence x=(x1,...,xL)x=(x_1,...,x_L), partitioned into KK slots sis_i, each of length kk. At iteration tt:

  • Planning (diffusion-style slot selection): Draft samples are produced for each masked slot. Slot-certainty scores C(si)C(s_i) are computed to quantify contextual predictability:

C(si)=pθ(di,1p0,Sclean,Smasked<(i,1))C(s_i) = p_\theta(d_{i,1} \mid p_0, S_{\text{clean}}, S_{\text{masked}_{< (i,1)}})

The subset of slots exceeding threshold TslotT_{\text{slot}} (or the one with maximal certainty, if none meet threshold) are selected for infilling.

  • Infilling: For each selected slot, autoregressive verification is conducted in two phases:

    1. Global verification: Jointly compute likelihoods for the concatenated tokens. If the longest accepted prefix covers at least kk tokens, the corresponding slots are moved to ScleanS_{\text{clean}}.
    2. Parallel Iterative Completion: Incomplete slots are refilled token-wise via local autoregressive sampling and verification, independently and in parallel, until completion.
  • The process repeats, shrinking SmaskedS_{\text{masked}}, until the full sequence is generated.

This approach permits full key-value cache reuse (no repeated recomputation of earlier tokens), and the learning problem reduces from 2L2^L tokenwise masking configurations to a factorial K!K! slot orderings. For L=4096L=4096, k=8k=8 (K=512K=512), this reduces theoretical complexity by orders of magnitude.

6. Empirical Performance and Speedup

The slot-level plan-and-infill paradigm yields both accuracy and efficiency benefits across tasks. In slot tagging, the inverse plan-and-infill with iterative refinement outperforms template-based prompting by over 6 F₁ points at 10-shot and runs 3.48×3.4{-}8\times faster (Hou et al., 2022).

In diffusion-based text generation, ReFusion's slot-level plan-and-infill outperforms strong masked-diffusion and autoregressive models in both accuracy and throughput:

Benchmark LLaDA-8B-Instruct Dream-7B-Instruct ReFusion
MBPP pass@1 50.40% 68.20% 78.66%
TPS (tokens/s) 12.42 53.00 92.09
GSM8K accuracy 76.35% 76.42% 84.91%
TPS (tokens/s) 23.93 18.99 81.77

Average speedup versus Dream/LLaDA is 18.4×18.4\times; versus strong ARMs, 2.33×2.33\times (Li et al., 15 Dec 2025). This is attributed to parallel slot infilling, reduced decoding iterations (2–10 steps, vs L/kL/k for ARM), and full cache reuse.

7. Representative Examples and Workflow

In slot tagging, for x=x= “book a flight from beijing to new york tomorrow morning” and T={T=\{departure, arrival, time, price}\}:

  • First-pass prompts and generations:
    • p(departure): “book a flight ... departure refers to __.” \to “beijing.”
    • p(arrival): “... arrival refers to __.” \to “new york.”
    • p(time): “... time refers to __.” \to “tomorrow morning.”
    • p(price): “... price refers to __.” \to “none.”
  • Second-pass (if needed for price):
    • p′(price): “... departure refers to beijing. time refers to tomorrow morning. arrival refers to new york. price refers to __.” Possibly \to “299 USD.” or “none.”

Output values are then aligned to the input for BIO-tag production.

In diffusion LMs, slots are selected and generated as non-overlapping contiguous subblocks, with planning/infill repeated until the sequence is complete and reordered to original structure.


The slot-level plan-and-infill paradigm combines slotwise reasoning with iterative dependency-aware generation, producing both accuracy and efficiency gains by matching computational structure to problem semantics. Its utility is demonstrated in both structured prediction and high-throughput text generation, where it outperforms both traditional enumeration-based and tokenwise diffusion approaches (Hou et al., 2022, Li et al., 15 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Slot-Level Plan-and-Infill Paradigm.