Screening Is Enough
Abstract: A core limitation of standard softmax attention is that it does not define a notion of absolute query--key relevance: attention weights are obtained by redistributing a fixed unit mass across all keys according to their relative scores. As a result, relevance is defined only relative to competing keys, and irrelevant keys cannot be explicitly rejected. We introduce Multiscreen, a language-model architecture built around a mechanism we call screening, which enables absolute query--key relevance. Instead of redistributing attention across all keys, screening evaluates each key against an explicit threshold, discarding irrelevant keys and aggregating the remaining keys, thereby removing global competition among keys. Across experiments, Multiscreen achieves comparable validation loss with approximately 40% fewer parameters than a Transformer baseline, enables stable optimization at substantially larger learning rates, maintains strong performance in long-context perplexity, shows little to no degradation in retrieval performance even far beyond the training context length, and reduces inference latency by up to 3.2$\times$ at 100K context length.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper introduces a new way for LLMs to “pay attention” to the right information in long texts. The authors argue that the usual method (used in Transformers) spreads attention across everything, even irrelevant parts, which gets worse as the text gets longer. Their new model, called Multiscreen, uses a process they call screening. Screening lets the model decide—using a clear cutoff—which pieces of text are relevant and which aren’t, and then ignore the rest. This helps the model remember and use important information better, especially in very long documents.
Key Questions the Paper Tries to Answer
- Can a LLM decide what’s relevant using an absolute rule (like a pass/fail test) instead of always comparing everything to everything else?
- Will this help models handle very long contexts better, without forgetting or “diluting” the important bits?
- Can it make models smaller, faster, and easier to train while keeping good performance?
- How can we measure pure “retrieval”—finding the right piece of information—without confusing it with language tricks or prompt wording?
How Multiscreen Works (Explained Simply)
Imagine you’re reading a long notebook and trying to answer a question using earlier pages:
- The usual Transformer attention is like having to give a little bit of your focus to every page, even the useless ones, because it spreads attention across all pages. If the notebook gets longer, each page gets a smaller slice of your focus.
- Screening is like having a smart filter or a “bouncer.” For each question, it checks every page one by one and asks: “Is this page similar enough to what I need?” If not, it’s tossed out. If yes, it’s kept. There’s a clear threshold that decides what counts as “relevant.”
Here are the main ideas, in everyday terms:
- Absolute relevance instead of relative attention: Multiscreen doesn’t force all focus to add up to one and be shared across everything. Each past token is tested on its own against a threshold; if it doesn’t pass, it contributes nothing.
- Screening windows: Each part of the model learns how far back it should look. If a piece learns it only needs recent info, it doesn’t waste time scanning faraway text. If it needs long-range info, it can open its window wider.
- Minimal positional encoding: The model adds a tiny sense of “order” only when it’s looking locally. When it opens the window wide for long distances, that extra position trick turns off, so the model doesn’t rely on guessing position patterns it never saw during training.
- Normalization and safety checks: Before comparing things, the model normalizes vectors so comparisons are fair. It also keeps outputs from getting too large with a gentle limiter (called TanhNorm) and uses a “gate” (like a volume knob) to decide how much of the retrieved information to use.
Analogy: Think of screening like using a spam filter with a strict rule. Each email is checked separately; only emails that clearly pass the rule get into your inbox. You’re not forced to pick “some” emails if none look good—you can end up with zero, which is exactly what you want when nothing is relevant.
What the Researchers Did to Test It
- They trained Multiscreen and regular Transformers on the same data and compared:
- How well they predict the next word (validation loss/perplexity).
- How well they can find a specific piece of information inside very long texts.
- How stable training is at different learning rates.
- How fast they run when the context is extremely long.
- They also created a new, simple benchmark called ABCDigits to test pure retrieval. It shows a shuffled list like “A=123456, B=987654, …” and then asks the model to complete something like “L=”. There’s exactly one correct answer in the text, and there are always 26 keys (A–Z), so the task cleanly measures whether the model can find the right match without relying on language tricks or special prompts.
Main Findings and Why They Matter
Here are the most important results:
- Fewer parameters for similar quality: Multiscreen matched a Transformer’s next-word prediction quality with about 40% fewer parameters. This means smaller models can perform similarly well.
- More stable training at higher learning rates: Multiscreen trained reliably even with much larger learning rates than Transformers can handle. That usually means faster and easier training.
- Better at long contexts: On very long texts, Multiscreen kept good perplexity (a measure of prediction quality) without breaking down when texts got longer than what it saw during training. Transformers often spiked in perplexity beyond their trained length.
- Strong retrieval, even far beyond training length: On the ABCDigits test, Multiscreen stayed very accurate at finding the right match—even at lengths much longer than it was trained on. It also beat Transformers clearly, including within the training length. Even a much smaller Multiscreen model outperformed a larger Transformer in retrieval.
- Faster at very long inputs: For 100,000-token contexts, Multiscreen cut inference time by up to 3.2× compared to Transformers. That’s a big speedup for long documents.
Why this matters:
- Real-world use often involves long documents, code files, or chats. Being able to find the right info quickly and reliably in long contexts is crucial.
- Smaller, faster, and more stable models are cheaper to train and run, and they’re easier to deploy.
What This Could Mean Going Forward
The paper suggests that to handle long inputs well, models should move from “redistributing attention across everything” to “explicitly selecting what matters” using clear, absolute rules. Screening shows that:
- We can build models that are lighter, faster, and more reliable at long-range retrieval.
- Training can be made more stable with bigger learning rates, helping efficiency.
- Models can avoid the common “attention dilution” problem where important info gets buried in very long contexts.
In short, screening changes the model’s mindset from “spread attention everywhere and hope the good stuff stands out” to “check each piece and keep only the good stuff.” That simple switch has big benefits for accuracy, speed, and robustness on long inputs.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a consolidated list of what remains missing, uncertain, or unexplored in the paper, phrased to be concrete and actionable for future work:
- External validity to natural-language tasks: No evaluation on real long-context tasks (e.g., long-document QA/summarization, multi-hop retrieval) to confirm that ABCDigits gains transfer to semantically rich settings.
- Breadth of benchmarks: Absence of comparisons on standard long-context suites (e.g., LongBench, RULER, GovReport/BookSum, NIAH variants) and memory-intensive reasoning tasks to triangulate retrieval and compositional use of retrieved content.
- Instruction tuning and alignment: Unclear how screening interacts with instruction-following (SFT/RLHF) and whether absolute-relevance selection is robust under instruction formats, chain-of-thought prompting, or tool-use scenarios.
- Generalization across domains: Results are limited to English web corpora; no tests on code, math, multilingual text, or noisy OCR-like documents that stress different positional and retrieval behaviors.
- Upper-scale viability: Largest reported model (~4B) is trained via an architectural conversion mid-run; no clean, apples-to-apples scaling beyond 1B to tens of billions of parameters, leaving open whether advantages persist at frontier scales.
- Compute-parity and fairness: Token budgets are matched, but per-token compute/FLOPs and memory traffic likely differ; no end-to-end training throughput, memory footprint, or energy comparisons to ensure fair efficiency claims.
- Baseline breadth: Missing head-to-head comparisons against strong long-context baselines (e.g., LongRoPE, ALiBi, position-interpolation variants), sparse/entmax/sparsemax attention, retrieval-augmented or block-sparse attention, and efficient backbones (Mamba, RetNet, Hyena) under matched training and evaluation.
- Ablations of architectural components: No systematic study isolating the effects of unit-length normalization, Trim-and-Square vs alternative transforms, Softmask shape, TanhNorm, MiPE, GLU-style gating, or tied/normalized embeddings on stability, retrieval, and loss.
- Sensitivity analysis: Lack of hyperparameter sweeps for key design choices (e.g.,
d_K,d_V,w_th, initialization scales,s_O, value normalization on/off), and no robustness analysis across seeds beyond small models. - Query-adaptive thresholds: Screening uses two learned scalars per unit (
s_w,s_r) rather than query-conditioned thresholds; it is unknown if per-query or per-token adaptive thresholds would improve recall/precision trade-offs. - Window-learning differentiability: The Softmask uses a hard zero outside the learned window (
-w < j - i <= 0), creating a non-differentiable inclusion boundary; the impact on learning window growth and gradient flow is not analyzed. - Behavior near
w_th: MiPE is disabled via a piecewise function ofwwith a hard cutoff; potential optimization instabilities or behavioral discontinuities near the threshold are not studied. - Value normalization trade-offs: Normalizing values to unit length eliminates magnitude information; no evidence is provided on whether this harms tasks that rely on graded contributions or calibrated attenuation.
- TanhNorm effects: The saturating norm bound may impede gradient flow or suppress additive aggregation when many relevant keys exist; no ablation or analysis of alternative norm control (e.g., RMSNorm, LayerNorm-on-values) is provided.
- Multi-evidence aggregation: Screening can zero out many keys; it is unclear how well the model aggregates numerous moderately relevant pieces (e.g., summarization, entailment) versus a few highly relevant ones, and whether a no-competition design hurts fine-grained weighting.
- False negative vs false positive balance: No calibration analysis on the acceptance width (
1/r) and its effect on rejecting slightly-relevant keys, especially under noise, distractors, or adversarial prompts. - Long-context extrapolation mechanism: When
wexceeds the training max, inference forcesw = ∞; the ablation of this intervention and its effect on quality and compute is not provided. - KV-caching and incremental decoding: The paper does not explain whether screening supports efficient caching analogous to attention KV-caches, nor the per-token decoding complexity and memory under varying learned windows.
- Inference compute distribution: No statistics on the learned window size distribution across layers/tiles, nor the fraction of tiles operating effectively in linear-time vs quadratic-time at inference.
- Training stability generality: Learning-rate stability is shown at small scales (28–45M); it is unknown whether the stability margin holds at larger scales, different batch sizes, and across optimizer/weight decay/clipping settings.
- Calibration and uncertainty: Non-normalized relevance may change confidence calibration; no calibration metrics (e.g., ECE), logit scaling behavior, or downstream impact on hallucination/control are reported.
- Safety and robustness: No adversarial, noisy, or distribution-shift robustness studies (e.g., distractor density, bursty repetitions, specious correlations) to test screening under challenging retrieval conditions.
- Positional expressivity: MiPE rotates only the first two coordinates and is inactive for large
w; the sufficiency of such minimal positional signal for order-sensitive reasoning is untested. - Theoretical analysis: Lack of formal guarantees or analyses (e.g., stability, Lipschitz constants, gradient norms) explaining why removing competition stabilizes optimization and how screening behaves in the limit of long contexts.
- Latency reporting scope: Inference latency gains are reported at 100K tokens but without hardware, kernel, and caching details, nor latency–quality trade-offs across lengths and batch sizes.
- Reproducibility artifacts: No explicit release of code, kernels, or ABCDigits generator details; exact reproduction of screening unit kernels and fused ops may be non-trivial without reference implementations.
- Compatibility with ecosystem improvements: It is unknown how screening integrates with MoE, retrieval-augmented generation, spec-decoding, speculative cache reuse, or adaptive computation (early exiting).
- Task-specific training objectives: The paper shows retrieval–loss mismatch (better retrieval with higher validation loss) but does not explore alternative training objectives or auxiliary losses that explicitly encourage retrieval with screening.
Practical Applications
Immediate Applications
Below are actionable use cases that can be deployed now, derived from the paper’s findings on screening-based relevance, learned windows, minimal positional encoding, improved training stability, and long-context efficiency.
- Improved long-context document processing in enterprise workflows (legal, finance, healthcare, research)
- What: Replace softmax attention with Multiscreen to reliably retrieve relevant information across long contracts, filings, patient records, and literature without “lost-in-the-middle” degradation.
- Tools/products/workflows: PyTorch/DeepSpeed modules implementing gated screening tiles; inference configurations that honor learned screening windows; evaluation pipelines including ABCDigits for retrieval QA; integrated long-context summarization and cross-referencing.
- Assumptions/dependencies: Latency and retrieval gains demonstrated at 100K-token contexts; generalization from SlimPajama pretraining to domain-specific corpora may require finetuning; robust GPU kernels for screening/window-skip execution.
- Faster long-context chat and assistant experiences (software, consumer tech)
- What: Deploy Multiscreen-backed chat systems with sustained 100K-token sessions, reducing latency by 2.3–3.2× and maintaining retrieval fidelity across extended conversations.
- Tools/products/workflows: Streaming inference with dynamic window scheduling; token caching across turns; CI tests using semantics-free retrieval (ABCDigits) to calibrate model updates.
- Assumptions/dependencies: Real user prompts must benefit from absolute relevance; product telemetry should validate speedups at operational context lengths; kernel fusion and window-skipping need production-grade implementations.
- Parameter-efficient LLMs for constrained environments (edge/IoT, education, startups)
- What: Achieve comparable validation loss with ≈40% fewer parameters; enable smaller models on limited hardware while preserving retrieval behavior.
- Tools/products/workflows: Weight-tying (shared normalized embeddings), small key/value dimensions, Multiscreen tiles; deployment to single-GPU/edge devices; curriculum finetuning for domain adaptation.
- Assumptions/dependencies: Reported efficiency holds up to tested scales (≤4B parameters); memory footprint still scales with context length—ensure window learning keeps most tiles finite.
- Stable high-learning-rate training pipelines (industry ML ops, academia)
- What: Use substantially larger LRs (e.g., constant after warmup), omit weight decay and gradient clipping, and reduce LR tuning burden; deploy faster and more robust training.
- Tools/products/workflows: Training recipes mirroring paper (AdamW, large batch, fixed LR post-warmup); LR sweeps extending into higher ranges; gradient-norm monitoring dashboards.
- Assumptions/dependencies: Stability verified at small-to-mid scales and specific optimizer settings; extreme-scale training may still require guardrails; dataset composition affects LR tolerance.
- Retrieval reliability QA with ABCDigits (academia, industry QA, policy auditing)
- What: Adopt the semantics-free, completion-based key–value retrieval benchmark to isolate and quantify retrieval ability independently of instruction-following and semantic cues.
- Tools/products/workflows: Synthetic dataset generator; CI gates for retrieval regressions; standardized reporting over context length vs. depth grids; integration with model cards.
- Assumptions/dependencies: Synthetic metrics should correlate with downstream retrieval tasks; complement with domain-specific retrieval tests to avoid overfitting to synthetic patterns.
- Safer context handling via absolute rejection (policy, safety, alignment)
- What: Leverage thresholded screening to explicitly discard irrelevant keys, reducing spurious context contributions and enabling auditable relevance maps for safety-critical deployments.
- Tools/products/workflows: Logging of per-key relevance values; runtime thresholds for conservative screening; post-hoc analysis tools; governance audits.
- Assumptions/dependencies: Threshold calibration must balance recall and precision; user data privacy and explainability policies require careful logging design.
- Long-document summarization and personal knowledge-base assistants (daily life, education)
- What: Build apps that can ingest and reliably retrieve from long notes/books/journals for study, research, or personal memory—without positional extrapolation fragility.
- Tools/products/workflows: Local/offline assistants using Multiscreen; long-context note ingestion; retrieval diagnostics with ABCDigits-like tests; streamlined positional setup via MiPE.
- Assumptions/dependencies: Consumer hardware must handle long contexts; real-world content diversity may require finetuning; UX design to surface retrieval confidence.
- Simplified positional handling without RoPE extrapolation (LLM platforms)
- What: Use minimal positional encoding (MiPE) that activates only for small windows and disables itself for long-range access, avoiding brittle RoPE scaling at inference.
- Tools/products/workflows: Conditional positional modules tied to learned window parameter ; removal of RoPE scaling factors from deployment configs.
- Assumptions/dependencies: Some tasks may rely on richer positional structure; validation on code and math tasks recommended.
- Reduced compute and energy costs at long contexts (cloud ops, energy)
- What: Exploit learned finite screening windows for effectively linear-time tiles; skip computations outside to lower total FLOPs and energy consumption during long-context inference.
- Tools/products/workflows: Scheduler that encourages/monitors finite across tiles; energy/perf dashboards; auto-tuning of window initialization.
- Assumptions/dependencies: Real gains depend on the proportion of tiles learning finite windows; if many windows go infinite, costs approach full causal interaction.
- Better RAG integration via internal screening (software)
- What: Combine external retrieval with internal absolute relevance to filter noisy retrieved passages, improving end-to-end generation quality.
- Tools/products/workflows: RAG pipelines with screening-aware readers; retrieval confidence weighting; evaluation on long multi-document queries using ABCDigits-style controls.
- Assumptions/dependencies: Gains depend on retrieval quality and reader–retriever synergy; benchmark under realistic corpus noise.
Long-Term Applications
The following use cases require further research, scaling, engineering, or validation to reach production maturity.
- Extreme long-context LLMs (≥1M tokens) for whole-codebase analysis, legal discovery, and longitudinal EHR review (software, legal, healthcare)
- What: Push robust retrieval and latency benefits to million-token contexts without positional extrapolation mismatch.
- Tools/products/workflows: Memory-optimized kernels; streaming/chunked training; hierarchical screening across sections.
- Assumptions/dependencies: Hardware memory constraints; screening must remain effective at scale; careful curriculum for length generalization.
- Hybrid backbones pairing Multiscreen with linear-time sequence models (Mamba/Hyena/RetNet) (software, robotics)
- What: Combine efficient state-space/convolutional cores for bulk context processing with screening for precise recall.
- Tools/products/workflows: Architectural research; co-training pipelines; routing between modules based on learned windows/gates.
- Assumptions/dependencies: Integration complexity; training stability across heterogeneous modules; task-dependent routing policy.
- Sector standards for long-context reliability (policy, governance, procurement)
- What: Establish ABCDigits-like evaluation as part of compliance checklists for LLM procurement and certification (e.g., government, finance, healthcare).
- Tools/products/workflows: Standardized test suites; reporting templates; thresholds for acceptable retrieval performance at specified context lengths.
- Assumptions/dependencies: Multi-stakeholder consensus; risk of metric gaming; need for domain-specific complements.
- Transparent, auditable relevance maps for safety-critical decisions (healthcare, finance, public sector)
- What: Use per-key relevance logging to explain model decisions over long contexts, aiding audits and error analysis.
- Tools/products/workflows: Secure logging; visualization tools; integration with model governance platforms.
- Assumptions/dependencies: Privacy constraints; acceptable trade-offs between transparency and performance.
- On-device assistants with persistent lifetime memory (daily life, privacy-preserving AI)
- What: Private, long-memory assistants managing diaries, emails, documents over years, with robust retrieval and acceptable latency on consumer hardware.
- Tools/products/workflows: Incremental memory ingestion; compaction via learned windows; local inference engines.
- Assumptions/dependencies: Device compute and storage; battery and thermal constraints; user consent and data governance.
- Energy-efficient data centers via screening-driven scheduling (energy, sustainability)
- What: Dynamically adjust screening windows to minimize compute for long-context inference across fleets, lowering carbon footprint.
- Tools/products/workflows: Fleet-wide window telemetry; auto-schedulers; carbon accounting dashboards.
- Assumptions/dependencies: Predictive control of window sizes; negligible quality loss under aggressive compute reduction.
- Education: longitudinal learning analytics and tutoring (education)
- What: Track student progress across multi-year artifacts (assignments, notes) and retrieve misconceptions or milestones reliably.
- Tools/products/workflows: Secure student data pipelines; long-context tutor models; retrieval confidence reporting.
- Assumptions/dependencies: Strong privacy and consent frameworks; domain finetuning; validation on diverse curricula.
- Financial compliance and risk analysis over large corpora (finance)
- What: Automate prospectus and regulatory document review across thousands of pages with robust retrieval and explanations.
- Tools/products/workflows: Ingestion pipelines; screening-enhanced readers; audit trails of relevance decisions.
- Assumptions/dependencies: Domain adaptation; legal review requirements; model robustness to structured financial language.
- Scientific literature copilots at enterprise scale (academia, pharma, R&D)
- What: Aggregate and reason over tens of thousands of papers, reliably retrieving methods, results, and contradictions.
- Tools/products/workflows: Long-context literature ingestion; citation-aware screening; synthesis and conflict detection.
- Assumptions/dependencies: Continual pretraining on scientific corpora; hallucination control; provenance tracking.
- Robotics and autonomous systems memory (robotics)
- What: Retrieve mission-critical past states and logs across long operational histories for planning and diagnostics.
- Tools/products/workflows: Tokenization of multimodal logs; screening-aware memory modules; offline analysis and real-time recall.
- Assumptions/dependencies: Robust mapping of non-text signals to token sequences; latency constraints in real-time systems.
Glossary
- ABCDigits: A semantics-free, completion-based benchmark to evaluate key–value retrieval in long contexts. "we introduce ABCDigits, a synthetic completion-based key--value retrieval benchmark that removes natural-language semantics, fixes the number of keys across context lengths, and ensures that the target output is uniquely determined without relying on instruction-following or semantic cues."
- acceptance threshold: The effective similarity cutoff after the Trim-and-Square transform above which keys are considered relevant. "illustrating the effective acceptance threshold."
- acceptance width: The inverse-width parameter controlling how far below maximum similarity a key can be while still being accepted. "where is the screening window and $1/r$ is the acceptance width for similarity."
- AdamW: An optimizer with decoupled weight decay commonly used to train large neural networks. "All models are optimized using AdamW"
- ALiBi-style: A class of positional extrapolation methods that adjust attention biases for long contexts. "ALiBi-style or RoPE-based extrapolation methods"
- attention-fading effect: The dilution of attention over many tokens as context length grows. "Scalable-Softmax (SSMax) targets the attention-fading effect by sharpening the attention distribution as context length increases"
- associative recall: Synthetic tasks probing a model’s ability to retrieve values associated with keys within a sequence. "Synthetic associative recall and key--value retrieval tasks have long been used to study memory in sequence models"
- causal mask: A mask that prevents a token from attending to future positions. "In this limit, the softmask reduces to a standard full causal mask."
- continual pretraining: Further pretraining a model, often at longer sequence lengths, starting from an existing checkpoint. "we further perform continual pretraining with a sequence length of "
- distance-aware relevance: The relevance score modulated by positional distance via the softmask. "The distance-aware relevance is"
- distance-unaware relevance: The content-based relevance computed from query–key similarity before applying distance weighting. "We then define a distance-unaware relevance using a Trim-and-Square transform"
- entmax: A sparse alternative to softmax that can yield sparse attention distributions. "such as sparsemax, entmax, and their variants"
- FIRE: A functional relative position encoding method for length generalization. "such as FIRE"
- gated screening tile: The head-level module that performs screening-based aggregation followed by multiplicative gating and projection. "A gated screening tile is the head-level module illustrated in \cref{fig:gscrn}."
- GLU-style multiplicative gating: A gating mechanism that modulates features via elementwise multiplication inspired by Gated Linear Units. "modulates the retrieved representation with a nonlinear gate inspired by GLU-style multiplicative gating"
- gradient clipping: A stabilization technique that caps gradient norms to prevent exploding gradients. "gradient clipping (threshold $1.0$)"
- Hyena: An architecture using long convolutions to model long-range dependencies efficiently. "Architectures such as Mamba, Hyena, and RetNet"
- inference latency: The time required for a model to generate outputs at inference. "reduces inference latency by 2.3--3.2 relative to the Transformer baseline."
- language-modeling head: The final projection from hidden states to vocabulary logits used for next-token prediction. "The input embedding matrix is normalized and shared with the language-modeling head"
- learning-rate sweep: A systematic evaluation across multiple learning rates to assess training stability and performance. "we conduct a learning-rate sweep"
- LLaMA-style architecture: A Transformer configuration family popularized by LLaMA models. "we adopt a LLaMA-style architecture"
- LongRoPE: A method that adapts RoPE for improved performance at extended context lengths. "methods such as LongRoPE that explicitly retune positional behavior for longer contexts"
- lost-in-the-middle phenomena: A retrieval failure mode where models struggle to recall information located in the middle of long contexts. "including lost-in-the-middle phenomena"
- Mamba: A sequence model based on selective state spaces enabling efficient long-range modeling. "Architectures such as Mamba, Hyena, and RetNet"
- minimal positional encoding (MiPE): A RoPE-like rotation applied to only two coordinates and activated only for small windows. "we introduce minimal positional encoding (MiPE), a RoPE-like rotation"
- Multiscreen: A language-model architecture that replaces softmax attention with screening to enable absolute relevance. "We introduce Multiscreen, a language-model architecture that enables absolute query--key relevance through a mechanism we call screening."
- needle-in-a-haystack: An evaluation setup where a small piece of relevant information must be retrieved from a long context. "needle-in-a-haystack and passkey-style evaluations for long-context retrieval"
- NoPE: An approach that removes explicit positional encodings while analyzing length generalization behavior. "including NoPE and subsequent analyses of its length generalization behavior"
- perplexity: A standard metric for language modeling quality, measuring how well the model predicts text. "maintains strong performance in long-context perplexity"
- Pythia: A suite of standardized Transformer configurations and training settings used for benchmarking. "based on those used in Pythia"
- relative position schemes: Positional encoding methods that depend on relative rather than absolute positions. "learned or function-based relative position schemes for length generalization"
- RetNet: A model using recurrent retention mechanisms to handle long sequences. "Architectures such as Mamba, Hyena, and RetNet"
- retrieval-based attention: Mechanisms that first select a subset of keys to attend for efficiency before applying attention over them. "sparse or retrieval-based attention mechanisms that restrict the set of attended keys"
- recurrent-retention mechanisms: Recurrence-like mechanisms that retain and update summaries of past context. "or recurrent-retention mechanisms"
- RoPE: Rotary Positional Embeddings, a method for encoding relative positions via rotations in embedding space. "we use RoPE with "
- RoPE scaling factor: A multiplier applied to RoPE frequencies to extrapolate to longer context lengths. "we test multiple RoPE scaling factors"
- RoPE-like rotation: A rotation-based positional encoding akin to RoPE, here applied minimally in MiPE. "a RoPE-like rotation"
- row-wise normalization to unit length (RSS): Normalizing each row vector to have unit norm. "'/RSS' denotes row-wise normalization to unit length."
- Scalable-Softmax (SSMax): A modified softmax that sharpens attention as context length increases to counteract attention fading. "Scalable-Softmax (SSMax) targets the attention-fading effect by sharpening the attention distribution as context length increases"
- screening: A mechanism that evaluates each key against an absolute threshold and aggregates only the relevant ones. "we propose a mechanism called screening that enables absolute query--key relevance."
- screening unit: The module that computes similarities, thresholds them, applies distance weighting, aggregates surviving values, and normalizes. "We now describe the screening unit shown in \cref{fig:scrn}."
- screening window: The learned width controlling how far in the sequence a screening unit attends. "where is the screening window"
- selective state spaces: State-space models that selectively update and propagate information for long-range dependencies. "including selective state spaces, long convolutions, or recurrent-retention mechanisms"
- Selective Attention: A softmax-based variant introducing query- and position-dependent temperature scaling. "Selective Attention introduces query- and position-dependent temperature scaling within the softmax framework"
- semantic masking: Hiding semantic cues in prompts to isolate retrieval behavior. "obscured by semantic masking"
- SiLU: An activation function (Sigmoid Linear Unit), also known as swish, used here inside the gate. "we use the elementwise nonlinearity for gating"
- SlimPajama: A large-scale pretraining dataset derived from RedPajama. "We pretrain all models on the SlimPajama~\cite{cerebras2023slimpajama} dataset"
- softmask: A cosine-shaped, causal, distance-aware weighting that smoothly decays to zero at the window boundary. "We next apply a causal and distance-aware softmask:"
- sparsemax: A sparse alternative to softmax that can produce zero probabilities exactly. "such as sparsemax, entmax, and their variants"
- supraparameter: A single scaling hyperparameter controlling multiple model dimensions simultaneously. "scaling with the supraparameter ."
- TanhNorm: A norm-bounding function that preserves direction while smoothly capping vector norms at 1. "we apply a normalization function that we introduce as TanhNorm"
- tied and normalized input-output embedding: Sharing and normalizing the input embedding matrix with the output head. "The model uses a tied and normalized input-output embedding structure"
- Trim-and-Square transform: A threshold-and-squaring mapping from similarity to relevance that zeros out low similarities. "We then define a distance-unaware relevance using a Trim-and-Square transform"
- unit-length normalization: Normalizing queries, keys, and values so their norms are one to bound similarities in [-1, 1]. "We first normalize queries, keys, and values to unit length:"
- validation loss: The cross-entropy loss on held-out data used to assess model quality during training. "Multiscreen achieves comparable validation loss"
- weight decay: L2-like regularization applied to weights during optimization to reduce overfitting. "weight decay ($0.1$)"
- weight tying: Using the same parameters for input embeddings and the output projection to improve efficiency. "and apply weight tying between the input embedding and the language modeling head"
Collections
Sign up for free to add this paper to one or more collections.


