Slot-Level Plan-and-Infill Paradigm
- Slot-level plan-and-infill paradigm is a strategy that partitions sequences into meaningful slots and decouples planning from infilling for targeted content generation.
- It overcomes classical span enumeration bottlenecks by enabling parallel decoding and reducing computational complexity from quadratic to linear or factorial orders.
- Empirical results show significant speedups (up to 18.4×) and improved performance in few-shot slot tagging and diffusion-based text generation.
The slot-level plan-and-infill paradigm is a modeling and inference strategy that partitions a sequence or input instance into semantically or structurally meaningful “slots,” then iteratively selects and fills these slots in a planning and infilling loop. This paradigm has achieved substantial efficiency and performance gains in diverse settings—most notably, few-shot slot tagging with LLMs, and large-scale text generation using diffusion-based decoding at the slot rather than token level. Core to this approach is the decoupling of prediction (“plan”) and completion (“infill”) at the slot granularity, which allows for both targeted handling of local dependencies and for computational efficiency via parallelization and cache reuse.
1. Formal Setup and Conceptual Foundations
Let represent a sequence (utterance or text), partitioned into contiguous, non-overlapping slots of fixed length (for text generation), or, for structured prediction, let represent predefined types such as information extraction fields (“departure,” “time,” etc.). In each setting, a valid output is a mapping of slot-types to value spans or a reconstruction of all slots (respectively). In the plan-and-infill approach, inference is organized as an alternating sequence of:
- Planning: Identify which slots should be filled and/or what content is tentatively appropriate for them, often through parallel sampling or scoring mechanisms.
- Infilling: Complete each selected slot (possibly in parallel), optionally conditioned on partial generations for other slots and previously filled content.
In classic slot tagging, this approach contrasts with inefficient span enumeration (), by directly mapping slot type span, requiring only operations. In slot-level diffusion LMs, slotwise infilling reduces the combinatorial complexity otherwise arising from arbitrary masking patterns.
2. Classic Approaches and Bottlenecks
For few-shot slot tagging, “classic” prompt-based classification enumerates all possible contiguous spans in input and, for each slot type , embeds each into a cloze-style prompt (e.g., ) and queries the LLM for . This process requires a prohibitive number of model evaluations ( per sentence), limiting scalability for practical systems and long inputs. Analogous limitations arise in token-level masked diffusion models for generation, where the number of possible masking configurations is exponential: for sequence length .
This highlights the necessity for slot-level reasoning and decoding that can bypass these bottlenecks by aligning computational granularity with semantic or structural chunks.
3. Slot-Level Inverse Prompting and Plan-and-Infill in Few-Shot Slot Tagging
The slot-level plan-and-infill approach for slot tagging reverses the standard paradigm of predicting labels for spans. Instead, for each slot type , a prompt is constructed. The LLM directly generates slot values as contiguous subspans of , with probability
At inference time, this reduces computational cost to , a dramatic efficiency improvement. Special control tokens such as NONE and SEP handle absent or multi-span slots; decoding is constrained to subspans of .
Slotwise outputs can be reconciled with sequence labeling (e.g., BIO format) by aligning generated values back to input positions.
4. Iterative Refinement: Dependency-Aware Slot Generation
In both slot tagging and diffusion-based LMs, a single pass may overlook dependencies between slots (e.g., “arrival” following “departure”, time localizing price). The iterative plan-and-infill strategy introduces multiple decoding passes:
- First-round: Independently decode each slot.
- Second-round (refinement): For each slot , construct context as the set of pairs for and their decoded values (except “none” values). The new prompt encodes the context: + (for all in ) + The LLM then produces .
Pseudocode for this iterative reasoning is as follows:
1 2 3 4 5 |
for t in T: v_1[t] = argmax_v P(v | x, t) for t in T: context = { (t', v_1[t']) for t' != t, v_1[t'] != 'none' } v_2[t] = argmax_v P(v | x, t, context) |
This refinement can be trained with a joint loss. Empirically, this procedure led to considerable performance improvements: on MIT-Restaurant (10-shot), a 6.1-point F₁ gain over template-based baseline, and 2.75 points over one-pass inverse prompting, alongside a speedup (Hou et al., 2022).
5. Slot-Level Plan-and-Infill for Parallel Decoding in Diffusion LLMs
In the context of text generation, ReFusion implements plan-and-infill by elevating masked diffusion from the token to the slot level (Li et al., 15 Dec 2025). Consider a sequence , partitioned into slots , each of length . At iteration :
- Planning (diffusion-style slot selection): Draft samples are produced for each masked slot. Slot-certainty scores are computed to quantify contextual predictability:
The subset of slots exceeding threshold (or the one with maximal certainty, if none meet threshold) are selected for infilling.
- Infilling: For each selected slot, autoregressive verification is conducted in two phases:
- Global verification: Jointly compute likelihoods for the concatenated tokens. If the longest accepted prefix covers at least tokens, the corresponding slots are moved to .
- Parallel Iterative Completion: Incomplete slots are refilled token-wise via local autoregressive sampling and verification, independently and in parallel, until completion.
The process repeats, shrinking , until the full sequence is generated.
This approach permits full key-value cache reuse (no repeated recomputation of earlier tokens), and the learning problem reduces from tokenwise masking configurations to a factorial slot orderings. For , (), this reduces theoretical complexity by orders of magnitude.
6. Empirical Performance and Speedup
The slot-level plan-and-infill paradigm yields both accuracy and efficiency benefits across tasks. In slot tagging, the inverse plan-and-infill with iterative refinement outperforms template-based prompting by over 6 F₁ points at 10-shot and runs faster (Hou et al., 2022).
In diffusion-based text generation, ReFusion's slot-level plan-and-infill outperforms strong masked-diffusion and autoregressive models in both accuracy and throughput:
| Benchmark | LLaDA-8B-Instruct | Dream-7B-Instruct | ReFusion |
|---|---|---|---|
| MBPP pass@1 | 50.40% | 68.20% | 78.66% |
| TPS (tokens/s) | 12.42 | 53.00 | 92.09 |
| GSM8K accuracy | 76.35% | 76.42% | 84.91% |
| TPS (tokens/s) | 23.93 | 18.99 | 81.77 |
Average speedup versus Dream/LLaDA is ; versus strong ARMs, (Li et al., 15 Dec 2025). This is attributed to parallel slot infilling, reduced decoding iterations (2–10 steps, vs for ARM), and full cache reuse.
7. Representative Examples and Workflow
In slot tagging, for “book a flight from beijing to new york tomorrow morning” and departure, arrival, time, price:
- First-pass prompts and generations:
- p(departure): “book a flight ... departure refers to __.” “beijing.”
- p(arrival): “... arrival refers to __.” “new york.”
- p(time): “... time refers to __.” “tomorrow morning.”
- p(price): “... price refers to __.” “none.”
- Second-pass (if needed for price):
- p′(price): “... departure refers to beijing. time refers to tomorrow morning. arrival refers to new york. price refers to __.” Possibly “299 USD.” or “none.”
Output values are then aligned to the input for BIO-tag production.
In diffusion LMs, slots are selected and generated as non-overlapping contiguous subblocks, with planning/infill repeated until the sequence is complete and reordered to original structure.
The slot-level plan-and-infill paradigm combines slotwise reasoning with iterative dependency-aware generation, producing both accuracy and efficiency gains by matching computational structure to problem semantics. Its utility is demonstrated in both structured prediction and high-throughput text generation, where it outperforms both traditional enumeration-based and tokenwise diffusion approaches (Hou et al., 2022, Li et al., 15 Dec 2025).