Variable-Width Transformers
Abstract: Scaling model size, specifically depth and width, has driven significant progress in transformer-based LLMs. However, most architectures maintain a constant width across all layers, allocating a fixed parameter and computation budget evenly despite different layers potentially playing distinct computational roles. In this work, we empirically investigate nonuniform capacity allocation across network depth by proposing a $\times$-shaped > <former architecture. This design maintains wider early and late layers while narrowing the middle layers, utilizing a parameter-free residual resizing mechanism. Across decoder-only LLMs ranging from 200M to 2B parameters (dense) and 3B parameters (MoE), our > <former consistently outperforms parameter-matched uniform baselines on language modeling loss. By reducing the average layer width, this architecture also requires fewer overall FLOPs (22% reduction under fitted loss-matched scaling curves) and smaller KV cache memory and I/O cost (15% reduction). In analysis, we show that this bottleneck structure results in qualitatively different representations in residual streams. Overall, our results demonstrate that nonuniform width allocation can result in more resource-optimal scaling of LLMs.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
A Simple Explanation of “Variable-Width Transformers” (the ><former)
1) What this paper is about
This paper asks a simple question about the way many AI LLMs (Transformers) are built: do all layers really need to be the same size? Instead of keeping every layer equally “wide,” the authors design a new model shape that’s wide at the start and end but narrow in the middle—like an hourglass or an “X.” They call it the “><former.” They show that this shape can make models both better and more efficient.
2) The main questions they ask
- If we have a fixed budget of model size and compute, is it better to give every layer the same amount of capacity, or to give different layers different amounts?
- Which shape works best: always widening, always narrowing, narrow-then-wide, or wide-then-narrow?
- Can we change layer widths without breaking how information is passed between layers?
- Does this design actually improve performance and save computing power and memory?
- Where should the narrow “bottleneck” go, and how narrow should it be?
3) How they approached it (in everyday terms)
Think of a Transformer as a multi-step assembly line for understanding text:
- Depth = how many steps (layers) there are.
- Width = how many “lanes” each step has to carry information.
Most models give every step the same number of lanes. This paper tries a different plan:
- Make early and late steps wide (many lanes), but squeeze the middle steps (fewer lanes). Picture an hourglass or the shape of an “X”: wide → narrow → wide.
The tricky part is passing information through layers of different widths. The authors use a simple, no-new-parameters trick:
- Imagine a big shared notebook (the “residual stream”) that carries information forward through the whole model.
- Each layer only writes to and reads from part of this notebook (a slice).
- When a layer gets narrower, it just uses fewer pages; when a later layer gets wider again, it “copies back” the last known values for the extra pages it needs. No extra learned weights are added—just smart copying and slicing.
They tested several shapes (A, V, <> and X) and trained models of different sizes (200M to 2B parameters, plus a 3B Mixture-of-Experts model) on a large text dataset. They measured:
- Language modeling quality (loss/perplexity: lower is better).
- Compute used (FLOPs).
- Memory related to attention (KV cache), which affects how much memory and data movement you need when the model runs.
Why this can save compute and memory:
- Parameters in each layer grow with “width × width,” so if you keep the total number of parameters the same but allow some narrow layers, the average width goes down.
- Attention compute and KV cache grow roughly with “width,” so lower average width means less attention compute and smaller KV cache.
4) What they found and why it matters
Across all the tested sizes, the X-shaped design (wide–narrow–wide) worked best. Here are the big takeaways, phrased simply:
- Better performance: The ><former consistently had lower language modeling loss (which means better predictions), typically around a few percent better perplexity than same-size, uniform-width models.
- Less compute and memory (at the same parameter budget):
- In direct experiments, training used about 2–4% fewer FLOPs and needed roughly 10–11% less KV-cache memory on average.
- Using fitted scaling curves (a way to predict performance at larger scales), the model could reach the same quality with about 22% fewer FLOPs and about 15% less KV-cache size.
- Works beyond standard models: The benefits also showed up in a Mixture-of-Experts model.
- Better use of middle layers: Deeper models often “collapse” in the middle and stop using their full capacity there. The ><former’s built-in bottleneck seems to prevent that. It spreads out work more evenly, keeps the middle healthier, and uses features more balanced across dimensions.
- Smoother thinking, sharper finishing: Looking at intermediate predictions (with a tool called a “logit lens”), the ><former changes its mind more gradually through the middle, then locks in the right answer near the end. That’s a good sign of steady, focused processing.
5) Why this could matter in the future
This study suggests we shouldn’t just think about “how big” a Transformer is, but also “how its size is distributed across layers.” By giving early and late layers more space and squeezing the middle, we can:
- Get better results with the same number of parameters.
- Use less compute and memory, which can save money, energy, and time.
- Build models that use their internal space more wisely and avoid wasteful middle-layer collapse.
There is a catch: training efficient variable-width models is harder with today’s software and hardware, which are tuned for uniform-width layers. But that’s an engineering problem, not a scientific roadblock. If toolmakers add good support, this “variable-width” idea could become a practical way to build faster, smarter LLMs.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, consolidated list of concrete gaps and unresolved questions that future work could address:
- Scaling beyond few-billion parameters: Do x-shaped variable-width gains persist or widen at larger dense and MoE scales (e.g., 7B–70B–100B+) under realistic training budgets and sequence lengths? Are the fitted scaling-law exponent differences robust at those scales?
- Multiple runs and uncertainty: How sensitive are the reported gains to random seed, data order, and minor training hyperparameter changes? Provide variance estimates, confidence intervals, and significance testing across multiple replicates.
- Compute-matching criteria: Results match parameter count (not FLOPs) and report PFLOP/s-days analytically; how do conclusions change if models are matched on (a) total training FLOPs, (b) wall-clock time on modern systems, or (c) carbon/energy? Is there a Pareto frontier across params/FLOPs/time/memory?
- End-to-end systems performance: What are the realized training/inference throughput and latency on current hardware (A100/H100, TPUv5), including kernel launch overhead, fragmentation, memory bandwidth, and pipeline/tensor/sequence parallelism overheads? Can purpose-built kernels close the theory–practice gap, and by how much?
- KV cache savings in practice: The average-width proxy estimates KV cache reductions, but how do actual memory footprints and I/O change under common implementations (e.g., MQA/GQA, paged KV, cache compression, quantized KV, CPU offloading)? Does width variability complicate cache packing or paging policies?
- Inference latency vs memory trade-offs: Do narrower mid-layers reduce time-to-first-token or tokens/sec given per-layer heterogeneity? What is the latency impact under speculative decoding, assisted decoding, or chunked attention?
- Bottleneck schedule robustness: How stable is performance across datasets, tokenizers, context lengths, depths, numbers of heads, and MLP multipliers when using the proposed ratio-based schedule (l* ≈ 0.75L, d* ≈ 0.3d)? Are alternative schedules (arithmetic, piecewise, multi-bottleneck, asymmetric endpoints) better for certain regimes?
- Automatic schedule discovery: Can we learn per-layer widths via meta-learning, bilevel optimization, neural architecture search, or differentiable resource allocation under parameter/FLOP/KV constraints, rather than hand-tuning two ratios?
- Theoretical justification: Why is an x-shape near-optimal? Can tensor-program analyses, dynamical systems, information bottleneck theory, or effective rank theory predict optimal width profiles as a function of depth, data, and compute?
- Interaction with attention/head structure: What is the best way to co-design variable width with number of heads, head dimension, or per-layer head counts? Does head reallocation alongside width yield further gains or reduce kernel fragmentation?
- Residual-stream resizing mechanisms: The carry-forward copy outperforms zero-padding and learned projections here; do lightweight learned mixers (e.g., low-rank, gated, sparsely parameterized, or shared across layers) bridge performance while remaining stable? What stabilizes learned projections that currently diverge?
- Gradient flow and truncation effects: Does dimension truncation induce gradient starvation or delayed learning for bypassed coordinates? How do per-layer widths influence signal propagation, conditioning, and optimizer hyperparameters (e.g., under non-μP training)?
- Training stability at greater depths: Does variable width mitigate or exacerbate instabilities (e.g., attention sinks, compression valleys) in deeper models (L>80/120)? Are special normalizations or residual scaling needed as depth grows?
- Generality across architectures: Do results transfer to encoder-decoder models, bidirectional encoders (BERT-like), retrieval-augmented models, multimodal transformers (vision–language, audio), and diffusion transformers?
- Task coverage beyond perplexity: How do variable-width models fare on instruction following, few-/zero-shot reasoning, code/math, multilingual, long-context QA/summarization, Toolformer-style tool use, and safety alignment after SFT/RLHF?
- Fine-tuning and transfer: Does the width schedule that is best for pretraining remain optimal for domain-specific fine-tuning, continual learning, or post-training methods (DPO, PPO, LoRA)? Are adapter and LoRA placements affected by mid-layer bottlenecks?
- Mixture-of-experts design: How should expert routing, expert size/capacity, and number of experts interact with per-layer width variability? Is it preferable to vary active expert width or the number of active experts across depth?
- Compatibility with efficiency methods: What are interactions with FlashAttention, sequence parallelism, activation checkpointing/recomputation, activation/weight quantization, sparsity, blockwise pruning, and low-rank adapters under variable widths?
- Memory layout and quantization: Do per-layer width changes complicate tensor layout, fused kernels, or quantization groupings (per-channel/per-tensor)? Are there accuracy or efficiency regressions for 4–8 bit weight/activation/KV quantization in variable-width layers?
- Long-context behavior: Does an x-shaped profile alter scaling with context length (e.g., 32k–1M tokens) for retrieval, needle-in-a-haystack, and streaming tasks, especially with RoPE/ALiBi variants?
- Data dependence: Are optimal bottleneck location and magnitude dataset-dependent (web, code, scientific text, multilingual)? Do noisy vs curated corpora push toward different profiles?
- Multi-bottleneck and hierarchical designs: Would two or more bottlenecks or hourglass-like repeated narrowing/widening outperform a single x-shaped profile? Is there a depth-dependent hierarchy that aligns with emergent capability phases?
- Analysis methodology robustness: The representational entropy and logit-lens analyses use WikiText-2 and a specific “stream view”; do conclusions hold with alternative datasets, effective-dimension metrics, centering/whitening choices, and decoding views?
- Fairness of hidden-state comparisons: Treating the “wide residual stream” as the effective hidden state for > <former may bias comparisons; can we construct width-normalized probes or subspace-matched evaluations to ensure apples-to-apples representational analysis?
- Practical recipe completeness: The prescription rounds widths to multiples of 32 (16 heads × 2 for RoPE); do different head counts, positional encodings, or rotary bases require different rounding or degrade performance?
- Lifecycle costs and engineering complexity: What is the net engineering overhead for kernel development, scheduling, and serving at scale relative to realized gains? Is there a threshold model size where complexity amortizes?
- Constraints trade-offs: Since param and FLOP matching cannot be achieved simultaneously under current assumptions, what alternative constraints (e.g., match FLOPs and KV, or KV and time) yield the best utility for common deployment profiles?
- Distillation and conversion: Can a pretrained uniform-width model be distilled or converted into a variable-width one (or vice versa) to harvest KV/memory benefits without full retraining? What are accuracy losses and conversion costs?
- Safety and robustness: Do variable-width models differ in adversarial robustness, calibration, uncertainty, or safety behaviors post-alignment? Any shifts in toxicity/bias metrics relative to uniform-width baselines?
Practical Applications
Immediate Applications
Below are concrete ways the paper’s “x‑shaped variable‑width Transformer” (> <former) can be used today, with sector links, potential tools/products/workflows, and key assumptions/dependencies.
- Lower-cost LLM serving via reduced KV-cache and memory I/O (software/cloud)
- What: Swap constant-width decoder-only LMs for > <formers to cut average layer width ≈10–11% and KV-cache/I/O ≈10–15%, with similar or better loss at matched parameters.
- Tools/workflows: Integrate variable-width blocks and fixed global residual stream with “carry-forward” copying into inference engines (e.g., vLLM, TensorRT-LLM, FasterTransformer); adjust memory planners to allocate per-layer KV sizes; export via ONNX/TensorRT with shape metadata.
- Assumptions/dependencies: Speedups depend on kernel support for per-layer width heterogeneity and fused slicing/copying; gains are largest for long contexts where KV dominates; benefits shown at 200M–3B scales and should be validated at larger scales.
- Longer context windows or larger batch sizes on the same hardware (productivity/coding/legal SaaS)
- What: Reinvest KV-cache savings into longer max sequence length or higher concurrency without adding GPUs.
- Tools/workflows: Re-tune context-length limits; update allocator to exploit lower average width; pair with RoPE scaling and MQA/GQA.
- Assumptions/dependencies: Memory, not compute, is the limiter; ensure model remains stable at longer contexts; verify interplay with MQA/GQA and quantization.
- Pre-training cost and carbon reduction (academia/startups/cloud)
- What: Achieve equal or better loss with ≈2.5–4% fewer pre-training FLOPs at 200M–2B, with tighter scaling-law fits suggesting potential widening gaps.
- Tools/workflows: Adopt the provided recipe (geometric schedule; bottleneck at ~0.75L; bottleneck width ~0.3d; width rounded to head-multiple) in PyTorch/JAX training code; keep embeddings fixed at d; use parameter-matched baselines and pP-aware training as in the paper.
- Assumptions/dependencies: Engineering overhead for variable-width kernels; retuning may be needed for different datasets/optimizers; carbon savings in practice depend on realized, not theoretical, efficiency.
- MoE model efficiency improvements (cloud providers/enterprise)
- What: Combine > <former with MoE to cut active parameters and KV footprint while improving perplexity (paper shows wins despite ~3% fewer active params when matching total params).
- Tools/workflows: Apply variable-width to shared trunk; maintain expert routing; adjust expert MLP sizes consistently; test token-level metrics and serving memory.
- Assumptions/dependencies: Infrastructure for MoE must tolerate heterogenous per-layer shapes; parameter-matching policy (total vs. active) affects comparisons.
- On-device/offline assistants with better battery and RAM use (consumer electronics/automotive)
- What: Deploy ~1B-class > <formers on mobile/edge for voice, note-taking, and offline copilot features, leveraging lower KV/memory bandwidth.
- Tools/workflows: Export to CoreML/NNAPI/TFLite; quantize to 4–8 bits; profile memory bandwidth; adjust schedulers for per-layer dims.
- Assumptions/dependencies: Mobile accelerators need kernels for variable widths; validate that benefits persist with aggressive quantization and on-device attention variants.
- Fine-tuning and distillation to memory-friendlier students (healthcare/finance/on‑prem)
- What: Distill from a uniform teacher into a > <former student to retain quality while shrinking KV-cache for on-prem and privacy-preserving deployments.
- Tools/workflows: KD pipelines (logits/features); LoRA/adapter fine-tuning atop a variable-width base; evaluation on perplexity and domain tasks.
- Assumptions/dependencies: Distillation recipes may need adaptation due to width heterogeneity; ensure tokenizer and embedding interfaces remain consistent.
- Better representation diagnostics and interpretability during training (academia/ML safety)
- What: Use > <former to mitigate mid-layer compression valleys and monitor normalized matrix entropy and MLP activation utilization as health signals.
- Tools/workflows: Add logit-lens and entropy dashboards to training (e.g., Weights & Biases); track activation density and layer-to-layer KL.
- Assumptions/dependencies: Improved representational metrics correlate with downstream quality but are not guarantees; analysis requires extra logging/compute.
- ESG reporting and procurement decisions (policy/corporate sustainability)
- What: Update energy/emissions accounting to reflect lower FLOPs and memory I/O per token, informing greener architecture procurement.
- Tools/workflows: Integrate architecture-aware factors into emissions calculators; include KV/I/O metrics in internal benchmarks.
- Assumptions/dependencies: Realized efficiency depends on software stack maturity; requires standardized measurement methodology.
Long-Term Applications
The following opportunities require additional research, scaling studies, or significant engineering (kernels, frameworks, or hardware).
- Variable-width-aware kernels and parallelism in mainstream frameworks (AI infrastructure vendors)
- What: Add fused ops for residual slicing/copying and heterogenous per-layer GEMMs; redesign tensor/pipeline parallelism for non-uniform widths.
- Tools/products: PyTorch/JAX/Triton ops; TensorRT-LLM plugins; vLLM schedulers that track per-layer widths; memory planners that exploit smaller average dims.
- Assumptions/dependencies: Collaboration with hardware vendors; backward compatibility with existing checkpoints.
- Hardware co-design for elastic width and KV-aware memory (semiconductors)
- What: Architect accelerators that schedule variable GEMM shapes efficiently and reduce KV-cache I/O overheads; smarter on-chip SRAM partitioning.
- Tools/products: ISA/library support for dynamic tiling; KV-prefetch and compression units; firmware for width-aware kernels.
- Assumptions/dependencies: Clear market demand; software stacks must exploit hardware features.
- Elastic-width inference and training policies (cloud/edge)
- What: Runtime controllers that adjust width profiles by task difficulty, latency SLAs, or thermal headroom; pair with early-exit and budgeted decoding.
- Tools/products: Width controllers, calibration datasets, and confidence estimators; multi-profile training (or LoRA that modulates width usage).
- Assumptions/dependencies: Need stability across width profiles; requires training-time exposure to multiple schedules or robust zero-shot generalization.
- Cross-modal extensions (vision/speech/robotics)
- What: Apply x-shaped schedules to ViTs, conformers, and policy transformers to reduce memory and I/O in high-throughput or on-robot settings.
- Tools/products: Modify residual streams in encoders/encoder-decoders; carry-forward schemes for multi-branch architectures; validate on tasks like ASR, video understanding, or planning.
- Assumptions/dependencies: Empirical validation needed beyond decoder-only LMs; token vs. channel/feature semantics differ across modalities.
- KV-cache–centric commercial offerings (SaaS/ML platforms)
- What: “XFormer mode” in hosted inference for longer contexts/cheaper tiers; context extension products that leverage lower average width.
- Tools/products: Managed endpoints with variable-width backends; pricing/capacity tiers based on KV savings.
- Assumptions/dependencies: Customer acceptance of new checkpoints; migration paths from existing constant-width models.
- Standards and benchmarks emphasizing I/O and memory efficiency (policy/consortia)
- What: Require parameter- or FLOP-matched comparisons that also report KV-cache and I/O; create energy-efficiency leaderboards that reward architectures like > <former.
- Tools/products: Benchmark suites and reporting templates; third-party audits.
- Assumptions/dependencies: Community consensus on metrics; support from conferences/funding agencies.
- Training-time regularization via architectural bottlenecks (ML research)
- What: Treat variable width as a structural regularizer to reduce representation collapse; study links to generalization and catastrophic forgetting.
- Tools/products: Research protocols combining > <former with dropout/layerdrop, data curricula, and sparsity.
- Assumptions/dependencies: Effects may vary with size, data, and optimizer; requires controlled ablations at >10B scales.
- Safety/monitoring products leveraging residual-entropy profiles (ML Ops)
- What: Use higher, more even residual entropy as a health indicator and early warning for degradation; integrate into model validation pipelines.
- Tools/products: Entropy/activation analytics APIs; acceptance gates in CI for model releases.
- Assumptions/dependencies: Need stronger causal links between these metrics and failure modes; overhead must be acceptable.
- Curriculum and education technology at scale (education/consumer)
- What: On-device tutors/readers with longer contexts and privacy-preserving operation enabled by memory savings.
- Tools/products: Classroom devices with local LMs; e-readers summarizing/answering questions offline.
- Assumptions/dependencies: Kernel maturity on low-power hardware; content licensing and privacy compliance.
Notes across applications:
- The paper’s recipe (x-shape, fixed global residual with carry-forward copying, bottleneck at ~0.75L and ~0.3d) is the current best-known configuration; further tuning may be needed per model family.
- Reported gains were demonstrated on decoder-only LMs at 200M–3B (including a 3B/1B MoE). Extrapolation to larger scales is promising but requires validation.
- Real-world efficiency depends on software/hardware support for heterogeneous per-layer widths; until kernels are optimized, some theoretical gains may not fully materialize.
Glossary
- A-shaped model: A transformer width profile that narrows with depth (wide early, narrow late). "we obtain a V or A-shaped model."
- AdamW: An optimizer that decouples weight decay from gradient updates, commonly used for training transformers. "with the same AdamW hyperparameters across scales"
- Active parameters: The subset of parameters actually used per token in a Mixture-of-Experts model. "3B total/1B active parameters"
- Attention dot-products: The core computations in self-attention where queries are dotted with keys; their cost scales with hidden size and sequence length. "For attention dot-products, their FLOPs scale linearly with the hidden dimension:"
- bfloat16 precision: A 16-bit floating-point format with wider exponent range than FP16, used to stabilize large-model training. "All models are trained in bfloat16 precision."
- Bottleneck dimension: The reduced hidden size at the narrowest layer in a variable-width model. "we need to choose a bottleneck layer index l* and the bottleneck dimension dį*"
- Bottleneck layer: The specific layer at which the model width is minimized. "the layer index of a bottleneck layer"
- Carry-forward (features): A parameter-free expansion method that copies previously computed coordinates forward through the residual stream. "carry forward features by copying coordinates through the residual stream"
- Chinchilla-optimal: A compute–data scaling rule suggesting an optimal ratio between model size and training tokens. "2.5x Chinchilla-optimal (Hoffmann et al., 2022)"
- cl100k_base: A specific tokenizer vocabulary used for preparing inputs. "Inputs are tokenized with OpenAI's cl100k_base."
- Compression valleys: Regions in depth where representations collapse to a low-rank subspace, reducing effective capacity. "frequently developing 'compression valleys' where their middle layers collapse"
- Constant-width transformer: A transformer whose hidden dimension is the same at every layer. "Beyond a regular constant-width transformer, we experiment with a shape, a x shape, a V shape, and a A shape."
- Decoder-only transformer: An autoregressive transformer architecture that uses only decoder blocks for language modeling. "decoder-only transformer LMs"
- Effective dimension: A notion of representational dimensionality related to the spread of singular values (effective rank). "Closely related to the effective dimension metric"
- FLOPs: Floating point operations; a measure of computational cost. "We also report pre-training FLOPs in PFLOP/s-days"
- Geometric layer width schedule: A layer-width plan where widths change multiplicatively across layers. "A geometric layer width schedule outperforms an arithmetic one in preliminary experiments."
- KL divergence (layer-to-layer): A measure of how much the decoded distributions differ between adjacent layers. "the layer-to-layer KL divergence between adjacent logit-lens distributions."
- KV cache: The stored key/value tensors used during autoregressive inference to avoid recomputing attention. "KV cache memory and I/O cost"
- Logit lens: A technique for decoding intermediate hidden states into vocabulary logits to probe model behavior. "the logit lens (nostalgebraist, 2020)"
- Maximal update parametrization: A training parametrization that stabilizes learning dynamics across width/depth scales. "All models are trained with maximal update parametrization (pP; Yang et al., 2024)."
- Mixture-of-Experts (MoE): An architecture that routes tokens to a subset of expert MLPs to increase parameter count without proportional compute. "We also consider a Mixture-of-Experts (MoE) model with 3B total/1B active parameters."
- Normalized accuracy: Accuracy adjusted to remove answer-length biases in multiple-choice evaluation. "we report normalized accuracy when available"
- Normalized matrix entropy: An entropy-based measure of how evenly singular values (and thus representation capacity) are distributed. "we track the normalized matrix entropy of the residual stream across all layers:"
- Parameter-matched: Having the same total number of parameters when comparing different architectures. "parameter-matched constant-width baselines"
- Parameter-free approach: A method that changes representation size without adding learned projection parameters. "We consider a parameter-free approach."
- Perplexity: An intrinsic language modeling metric measuring how well a model predicts tokens (lower is better). "perplexity-based tasks"
- PFLOP/s-days: A compute accounting unit equal to running at 1 PFLOP/s for one day. "We also report pre-training FLOPs in PFLOP/s-days"
- QKV projections: Linear layers that produce queries, keys, and values for attention. "The QKV projections of the first layer"
- Residual connection: A skip connection that adds a block’s output to its input to ease optimization. "The residual connection in the final layer MLP is truncated accordingly."
- Residual stream: The vector pathway that carries token representations across layers, to which blocks add their outputs. "allow each block to read from and write to a layer-specific slice of the residual stream."
- RoPE (rotary position embedding): A positional encoding method that rotates query/key vectors to encode positions. "for RoPE (Su et al., 2024)"
- RMS normalization: A normalization that rescales activations by their root-mean-square value. "applying the final RMS normalization"
- Scaling law curve: An empirical power-law relationship between performance and resources (e.g., FLOPs or width). "we fit a scaling law curve on loss vs. pre-training FLOPs"
- SwiGLU: An activation function variant that improves transformer performance over ReLU/GELU in MLPs. "We use the SwiGLU activation (Shazeer, 2020)."
- Unembedding matrix: The final linear map from hidden states back to vocabulary logits. "followed by the unembedding matrix."
- Variable-width transformer: A transformer whose hidden dimension changes across layers. "variable-width transformers"
- V-shaped model: A width profile that grows with depth (narrow early, wide late). "growing (V-shaped)"
- X-shaped model: A width profile that narrows then widens across depth (wide early/late, narrow middle). "The x-shaped model performs the best."
- Zero-shot setting: Evaluating without task-specific fine-tuning or examples. "in the zero-shot setting."
Collections
Sign up for free to add this paper to one or more collections.