Avey-B
Abstract: Compact pretrained bidirectional encoders remain the backbone of industrial NLP under tight compute and memory budgets. Their effectiveness stems from self-attention's ability to deliver high-quality bidirectional contextualization with sequence-level parallelism, as popularized by BERT-style architectures. Recently, Avey was introduced as an autoregressive, attention-free alternative that naturally admits an encoder-only adaptation. In this paper, we reformulate Avey for the encoder-only paradigm and propose several innovations to its architecture, including decoupled static and dynamic parameterizations, stability-oriented normalization, and neural compression. Results show that this reformulated architecture compares favorably to four widely used Transformer-based encoders, consistently outperforming them on standard token-classification and information-retrieval benchmarks while scaling more efficiently to long contexts.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper introduces Avey-B, a new kind of LLM “encoder” that understands text by looking both left and right in a sentence at the same time (like BERT), but without using the usual “attention” mechanism. Instead, it looks at text in chunks, quickly finds the most relevant other chunks, and mixes their information using simple, efficient steps. The goal is to keep high accuracy while being faster and lighter on memory—especially for long documents.
What questions were the researchers asking?
- Can we build a strong bidirectional encoder (like BERT) without using attention?
- If we separate two ways of mixing information—fixed rules vs. similarity-based rules—does the model become more stable and accurate?
- Can we compress the “extra context” we gather so the model stays fast even when it looks at lots of relevant text?
- How does this new design compare to popular encoders such as BERT, RoBERTa, ModernBERT, and NeoBERT on common tasks?
How does their approach work?
Think of reading a long article with sticky notes:
- Instead of re-reading everything, you pick the few pages most relevant to the current page you’re studying.
- You then write a short, neat summary of those pages and attach it to the current page.
- You repeat this for each chunk, building a full understanding faster and without flipping through the entire article every time.
Avey-B follows that idea.
The big idea: rank and retrieve only what matters
- The model splits the input text into equal-sized chunks (“splits”).
- For each chunk, a “ranker” finds the top-k most relevant other chunks, using a simple similarity score (you can think of it as “how alike are these parts?”).
- This avoids wasting time mixing in unhelpful pieces of text.
The two main parts
- Ranker
- Finds and scores relevant chunks from the whole text.
- New in Avey-B: a “neural compressor” turns the current chunk plus its retrieved chunks back into a single chunk-sized summary. This keeps the later steps fast.
- Neural Processor
- Improves each chunk’s token representations with three mini-steps:
- Enricher: expands features with a small per-token neural network.
- Contextualizer: mixes information between tokens, based on either learned weights or similarity scores.
- Fuser: blends the original and contextualized features and returns to the usual embedding size.
What’s new in Avey-B (and why it helps)
- Bidirectional understanding: It removes the “look only forward” rule, so each token can use both left and right context—like BERT encoders.
- Decoupled mixing layers: The model alternates two kinds of layers:
- Static layers use fixed learned weights (like a teacher’s consistent guidance).
- Dynamic layers use similarity scores (like a popularity vote among tokens).
- This keeps similarity-based steps fair: if token A is more similar than token B, it won’t accidentally count less.
- Stable similarity scores: In dynamic layers, similarity scores are normalized so they don’t blow up or become unstable during training.
- Neural compression: Instead of processing the current chunk plus all retrieved chunks together (which would slow things down), Avey-B compresses them into one chunk-sized summary first. This keeps the compute per chunk steady and speeds up inference.
What did they find?
Across many standard tasks, Avey-B performed strongly and ran fast:
- Accuracy
- It beat BERT and NeoBERT across all tested benchmarks.
- It consistently outperformed RoBERTa and ModernBERT on token-level tasks (like tagging words with labels) and information retrieval (like matching questions to relevant passages).
- Even the smaller Avey-B sometimes matched or exceeded larger Transformer encoders on these tasks, despite being trained on far fewer tokens than, for example, ModernBERT.
- Speed and scaling
- Avey-B processed long inputs faster than the Transformer-based encoders tested.
- As documents get longer, Avey-B’s throughput drops much more slowly than ModernBERT and NeoBERT.
- At very long lengths (e.g., around 96,000 tokens), Avey-B was several times faster than the baselines (for example, about 3.4× faster than ModernBERT and about 11.6× faster than NeoBERT in their setup).
Why this matters: token tagging and retrieval often show up in real-world systems (like search, recommendation, and data labeling), where speed and memory are tight. Avey-B’s design fits those needs well.
Why does this matter, and what could it impact?
- Attention isn’t the only way: This work shows you can build strong, bidirectional text encoders without self-attention, opening new paths for efficient models.
- Better for long documents: Because Avey-B ranks and compresses relevant context, it stays fast as inputs grow, which is great for search engines, document analysis, and other large-scale applications.
- Industry-ready efficiency: Encoders often power production systems where latency and cost matter. Avey-B’s higher throughput and good accuracy could reduce serving costs and speed up user-facing applications.
- Research and reproducibility: The authors released code and pretrained models, making it easier for others to build on their approach.
In short, Avey-B shows a practical, attention-free way to get high-quality text understanding—often faster and with less compute—especially when handling long inputs or doing tasks like tagging tokens and retrieving relevant information.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, consolidated list of the paper’s unresolved issues and avenues for future research.
- Bidirectional retrieval design: Avey-B’s ranker is intentionally unidirectional (matches only preceding splits). Quantify how truly bidirectional retrieval (left and right splits) affects SC/QA performance, cost, and stability, and identify tasks where bidirectional retrieval is necessary.
- Ranker complexity and scalability: The ranker’s O(N2 d) matching cost remains a bottleneck for very long inputs and large d. Investigate approximate or hierarchical retrieval (e.g., ANN, locality-sensitive hashing, multi-stage pruning) that preserves quality while reducing complexity to near-linear or N log N.
- Inference-time ranker overhead: The paper measures throughput dominated by the neural processor; however, the ranker still runs per pass. Provide a clear accounting of inference complexity including ranker cost, and explore incremental or cached ranking for streaming/online encoders.
- Retrieval granularity: The ranker operates at split-level granularity. Evaluate token-level, phrase-level, or hierarchical (paragraph/section) retrieval to refine relevance selection and reduce false positives/negatives in the retrieved context.
- MaxSim as the relevance signal: Assess alternative relevance scoring functions (e.g., learned metrics, mutual information, cross-encoder scoring, hybrid lexical-semantic signals) and their effect on precision/recall of retrieved splits.
- Neural compression fidelity: The compressor linearly reduces (k+1)S tokens to S tokens via P. Characterize information loss (e.g., probing tasks, alignment metrics), derive bounds on compression-induced degradation, and compare against non-linear compressors (attention pooling, low-rank adapters, learned pooling trees).
- Adaptive k and S: k=3 and S=256 are fixed. Study adaptive per-split k and split size S based on uncertainty/relevance/difficulty to improve quality under fixed compute budgets.
- Theoretical properties of decoupled parametrization: Beyond monotonicity per dynamic layer, analyze global expressivity, convergence properties, and optimization dynamics of interleaved static/dynamic layers; identify regimes where decoupling harms/help stability or representational power.
- Normalization choice in dynamic layers: Provide formal spectral/conditioning analyses for row-wise sum normalization (ε-stabilization), compare to double-stochastic or Sinkhorn normalizations, and quantify gradients’ behavior with depth (e.g., singular value bounds).
- Sensitivity to ε: The stabilizer ε is not specified or analyzed. Determine sensitivity ranges, stability thresholds, and principled methods to set ε (e.g., adaptive ε per layer or per batch).
- Positional information across splits: The contextualizer claims no extra positional encoding due to V; clarify how cross-split order is represented post-compression, and test tasks where precise global order matters (e.g., logical reasoning, long-context discourse).
- Failure mode analysis: Avey-B lags RoBERTa/ModernBERT on MNLI and ReCoRD/SQuAD. Perform granular error analysis (e.g., contradiction vs entailment, entity-centric vs relational QA) to pinpoint architectural causes and guide targeted fixes (e.g., bidirectional ranker, richer positional cues).
- Pretraining objective alignment: The model uses MLM with 20% masking. Explore retrieval-aware or contrastive pretraining (e.g., ELECTRA/RTD, span corruption, denoising with retrieved context) tailored to Avey-B’s retrieval/compression inductive biases.
- Scaling laws: Provide systematic scaling studies (parameters, depth/width, pretraining tokens, k/S/N) and fit empirical laws for quality and efficiency; identify optimal depth-width interleaving schedules across scales.
- Long-context quality on real tasks: Beyond synthetic NIAH and efficiency curves, evaluate accuracy on real long-context tasks (book-length QA, long-range summarization, legal/scientific IR) to validate selective retrieval/compression at 32k–96k tokens.
- Memory and energy profiling: Report peak memory, activation footprint, and energy/latency trade-offs for both training and inference (with/without torch.compile) relative to FlashAttention-optimized baselines; quantify the impact of fused kernels once available.
- Kernel-level optimization gap: Avey-B lacks fused CUDA/Triton kernels. Build and benchmark fused kernels to confirm whether the observed throughput advantage persists or increases, and characterize the proportion of gains attributable to architecture vs kernel optimization.
- Hardware and batch-size sensitivity: Throughput results use batch size 8 on specific GPUs (H200/B200; mention is inconsistent). Study performance across diverse hardware (A100/H100/H200/B200, consumer GPUs, CPUs) and batch sizes; ensure conclusions generalize.
- Robustness and domain generalization: Evaluate resilience to noise/adversarial perturbations, domain shift (biomedical, legal), and multilingual/low-resource settings; test whether ranker+compressor degrades under distribution shifts.
- Fine-tuning protocols: Benchmarks use limited epochs and fixed LR sweeps. Examine whether Avey-B benefits from task-specific schedules (e.g., layerwise LR, retrieval-aware regularizers), and whether longer fine-tuning narrows gaps on MNLI/QA.
- Differentiable retrieval: The ranker’s top-k selection is non-differentiable. Explore continuous relaxations (e.g., Gumbel-top-k, SoftTopK), end-to-end training of ranker scoring, or joint learning with compressor to improve selection quality.
- Static/dynamic interleaving schedule search: The chosen S→D pattern is empirically best; develop principled or learned schedule selection (e.g., reinforcement learning, NAS) and test deeper variants (D→S blocks, multiple D in a row) for different tasks.
- Information-theoretic view of bypass/gating: Formalize how partial-embedding bypass and gating mitigate over-smoothing; quantify retained token-specific information across depth and relate to generalization/robustness.
- Retrieval noise and distractors: Measure the rate and impact of irrelevant split selection, and design mechanisms (confidence thresholds, re-ranking, iterative retrieval) to reduce distractors without increasing compute.
- Comprehensive complexity reconciliation: The paper mentions both quadratic (training) and linear (inference) scaling claims. Provide a unified complexity model that includes ranker, compressor, and processor under both training and inference for encoder-only usage, clarifying when each term dominates.
- Security/privacy considerations: Retrieval over splits may surface sensitive content. Audit leakage risks and design privacy-preserving retrieval/compression (e.g., differential privacy, secure indexing).
- Integration with downstream systems: Assess compatibility with production IR pipelines (indexing, sharding, latency SLAs), and study how compressor/ranker interact with caching and streaming encoders in real deployments.
Practical Applications
Immediate Applications
Below are specific, deployable uses that directly leverage Avey-B’s bidirectional, attention-free encoder, its superior token-classification and information-retrieval performance, and its high-throughput long-context scaling.
- Compact drop-in encoder for production NLP pipelines (software, enterprise search, e-commerce, customer support)
- Use Avey-B as a replacement for BERT/RoBERTa in existing encoders (CLS pooling, mean pooling) for tasks like sentiment analysis, intent detection, topic classification, and email/ticket routing.
- Benefits: lower latency and higher throughput at short and long contexts; better accuracy on token classification and dense retrieval vs. common Transformer baselines.
- Tools/workflows: plug into Detext-, ColBERT-, or Sentence-Transformer-style pipelines; integrate with FAISS/Milvus/Weaviate for vector search; swap model weights/config in existing services.
- Dependencies/assumptions: compatibility with current tokenization/serving stack; validation on in-domain data; GPU/CPU deployment tuned for split size S and top-k; current lack of fused kernels (uses torch.compile) may limit peak efficiency, though results already favorable.
- Enterprise and consumer search with long-document encoding (software, media, knowledge management)
- Use Avey-B to embed and index long documents (manuals, wikis, PDFs, reports) with better throughput and long-context scaling (validated up to ~96k tokens).
- Benefits: faster ingestion; more robust retrieval of relevant passages in lengthy files; improved NDCG@10 on IR benchmarks relative to Transformer baselines.
- Tools/workflows: document chunker that aligns to Avey-B’s split paradigm; indexing with vector DBs; retrieval-augmented generation (RAG) pipelines use Avey-B as the document and/or query encoder.
- Dependencies/assumptions: choice of S and k tuned for collection characteristics; ANN-backed retrieval for scalability; RAG quality still depends on downstream generator.
- PII detection and automated redaction (healthcare, finance, public sector, legal, daily productivity)
- Apply Avey-B to token-level PII detection (names, addresses, IDs) and redact content in documents, emails, and logs, benefiting from top token-classification performance.
- Benefits: improved recall/precision on span detection; faster processing of long records (EHR notes, legal exhibits, compliance documents).
- Tools/workflows: DLP pipelines; data sanitization filters in ETL; email gateways and document processors with real-time redaction.
- Dependencies/assumptions: domain fine-tuning data and validation; jurisdiction-specific policy constraints; careful thresholding to balance over/under-redaction.
- Clinical/biomedical entity extraction and coding (healthcare)
- Use Avey-B for NER and span tagging (problems, meds, procedures) and assist with ICD/CPT coding from free-text EHR notes.
- Benefits: strong short-span accuracy and fast processing for long patient notes or multi-visit summaries.
- Tools/workflows: integration into clinical NLP platforms; batch processing of EHR corpora; assistance tools for medical coders.
- Dependencies/assumptions: domain adaptation to clinical language; HIPAA-compliant deployment; local/on-prem inference may be required.
- Legal and e-discovery triage (legal, public sector)
- Token-level tagging and retrieval across large corpora (contracts, case law, discovery sets) for clause detection, privilege detection, and issue-focused retrieval.
- Benefits: scalability to very long documents; improved ranking and tagging speeds for tight review timelines.
- Tools/workflows: pipeline that embeds documents using Avey-B; filtering/triage UI for legal teams; contract analytics.
- Dependencies/assumptions: tuning on legal text; explanation/audit trails for defensibility; integration with e-discovery platforms and metadata pipelines.
- Financial document analysis and monitoring (finance)
- Classify and extract spans from 10-K/10-Q filings, research reports, and news; power retrieval for due diligence and risk surveillance.
- Benefits: long-context encoding matches the length of filings; better IR performance for topic/event retrieval.
- Tools/workflows: long-document encoders for research platforms; surveillance pipelines for email/chat monitoring with token-level tagging.
- Dependencies/assumptions: domain lexicon adaptation; governance and model-risk controls; throughput tuning for bulk ingestion windows.
- Education: long-form content search and grading assistance (education, EdTech)
- Retrieval from textbooks/course materials; token-level rubric alignment for short-answer/essay components.
- Benefits: faster indexing of large course materials; improved extraction-based scoring aids for graders.
- Tools/workflows: course content indexers; grading assistance dashboards using Avey-B for extraction and rationale highlighting.
- Dependencies/assumptions: institution policies on automated grading; calibration against rubrics; fairness/consistency checks.
- On-device or edge inference for classification and extraction (mobile, IoT, daily life)
- Compact encoder for offline email/app content classification (priority, spam/ham, categories) and redaction of sensitive text on-device.
- Benefits: budget-friendly latency and memory; better privacy by keeping inference local.
- Tools/workflows: quantized Avey-B variants; mobile runtimes (CoreML, NNAPI) once available; batched inference for message streams.
- Dependencies/assumptions: availability of export/quantization toolchains; reduced-precision evaluation; memory limits dictate S/k choices.
- Log and incident report triage (software operations, energy, manufacturing)
- Token-level tagging of anomalies and IR-based retrieval on long logs/incident writeups.
- Benefits: handles lengthy logs; improves triage and root-cause search.
- Tools/workflows: pipeline to embed logs and attach token tags; integrate with observability stacks for faster incident response.
- Dependencies/assumptions: domain fine-tuning on log formats; throughput under spike loads; alerting thresholds.
- Public-sector document intake and comment analysis (policy, governance)
- Classify and extract spans (issues, stakeholders, policy topics) from long public comments and regulations; enable targeted search for policymakers.
- Benefits: high-throughput long-document processing; more precise span tagging.
- Tools/workflows: agency intake pipelines; dashboards for thematic retrieval and summarization inputs.
- Dependencies/assumptions: transparency and bias audits; multilingual coverage if required; on-premise deployment constraints.
Long-Term Applications
These applications are promising but may require additional research, scaling, engineering, or ecosystem maturity (e.g., fused kernels, ANN rankers, domain pretraining).
- Ultra-long document analytics and agents (software, legal, healthcare, finance, public sector)
- End-to-end systems that read, retrieve, and ground decisions across hundreds of thousands to millions of tokens (e.g., cross-document legal discovery, longitudinal patient timelines, multi-year financial analyses).
- Potential products: “LongDoc Encoder API,” document-grounded agents that rely on Avey-B for retrieval/extraction at scale.
- Dependencies/assumptions: approximate ranker for per-split retrieval to reduce O(N²d) matching costs; memory-optimized serving; domain pretraining; evaluation for stability at extreme lengths.
- Fused-kernel and hardware-accelerated deployments
- Triton/CUDA fused kernels for the neural processor, ranker, and compressor; future ASIC/FPGA paths for attention-free encoders.
- Benefits: substantial throughput and latency gains beyond torch.compile; lower energy per token.
- Dependencies/assumptions: engineering investment; kernel generalization across batch/sequence regimes; vendor ecosystem support.
- Encoder-decoder pipelines for document-grounded generation (software, media, education)
- Use Avey-B as the encoder in encoder–decoder systems to provide long-context evidence to a generator (e.g., RAG 2.0 with selective split retrieval/compression).
- Potential workflows: Avey-B encodes documents into relevance-aware representations; decoder consumes retrieved spans for faithful answers.
- Dependencies/assumptions: tight coupling interfaces; training objectives for faithfulness; task-specific datasets.
- Advanced retrieval architectures: “Avey-B ColBERT-style” and compressed indexing
- Late interaction systems where Avey-B’s split-aware representations and cosine-based dynamic layers power efficient passage-level matching; leverage learned compression to shrink index size.
- Benefits: improved long-doc retrieval with smaller storage and faster query-time computation.
- Dependencies/assumptions: ANN indices tuned to split embeddings; index maintenance for dynamic corpora; careful evaluation of recall vs. compression trade-offs.
- Multilingual and cross-domain models
- Avey-B variants pretrained for multilingual corpora and specialized domains (clinical, legal, finance, code).
- Benefits: broader applicability and reduced domain adaptation effort.
- Dependencies/assumptions: large-scale domain/multilingual corpora; cross-lingual alignment; fairness and regional compliance testing.
- Interpretability and compliance-by-design encoders (policy, regulated industries)
- Leverage monotonicity within dynamic layers (decoupled parameterization) and row-wise normalization to develop more interpretable evidence aggregation for audits.
- Potential tools: span-attribution visualizers that reflect similarity-respecting updates; regulator-friendly reports.
- Dependencies/assumptions: formal interpretability methods built on Avey-B internals; user studies; integration with GRC systems.
- Robotics and IoT instruction retrieval from long manuals (robotics, manufacturing)
- On-device retrieval and extraction from long equipment manuals for task guidance; token-level hazard detection in SOPs.
- Benefits: offline operation; efficient long-context access for step-by-step instructions.
- Dependencies/assumptions: ruggedized, quantized models; multimodal extension if diagrams are needed; safety validation.
- Energy and industrial maintenance intelligence
- Analyze long maintenance logs and incident histories; IR for similar past failures; span tagging of root-cause descriptors.
- Benefits: better MRO (maintenance, repair, operations) efficiency and knowledge transfer.
- Dependencies/assumptions: domain adaptation; interoperability with CMMS/EAM systems; throughput for large historical corpora.
- Curriculum-scale retrieval and tutoring (education)
- Systems that navigate an entire curriculum corpus to retrieve precise excerpts for personalized help, assessments, and content linking.
- Benefits: fine-grained retrieval over long materials; improved study assistance grounded in exact text.
- Dependencies/assumptions: aligned pedagogy; bias and accessibility considerations; generator integration for tutoring.
- Privacy-preserving local knowledge bases (daily life, enterprise)
- Local-first knowledge apps that index personal or departmental corpora (notes, emails, documents) and provide fast, accurate retrieval/extraction without cloud upload.
- Benefits: improved privacy and latency; long-context support for large personal archives.
- Dependencies/assumptions: efficient on-device indices; lightweight inference stacks; user controls for data governance.
Cross-cutting assumptions and dependencies
- Data and fine-tuning: High-quality, domain-specific fine-tuning sets are often required to realize gains in specialized sectors (healthcare, legal, finance, logs).
- Serving stack: Current implementation lacks custom fused kernels; while Avey-B already outperforms optimized Transformers in many regimes, further gains likely require kernel engineering.
- Ranker scaling: Split-to-split matching is O(N²d) per pass; for very long inputs, approximate nearest neighbor (ANN) schemes or hierarchical ranking may be needed.
- Governance: Bias, fairness, and auditability must be addressed, especially in regulated domains; interpretability benefits from the dynamic-layer monotonicity should be validated in practice.
- Hardware and precision: Performance claims are measured on modern NVIDIA GPUs with mixed precision; CPU/mobile deployment requires quantization/distillation and careful benchmarking.
- Multilingual/generalization: The released checkpoints (pretrained on FineWeb) may require additional training for multilingual and non-Web domains; licensing/usage constraints should be checked.
Glossary
- ALiBi positional biases: A technique that adds linear positional bias to attention scores to encode relative position without explicit embeddings. "ALiBi positional biases (Press et al., 2022)"
- alternating global/local attention: An attention pattern that alternates between full-sequence (global) and windowed (local) attention to improve efficiency. "alternating global/local attention"
- autoregressive mask: A masking scheme that prevents tokens from attending to future positions, enforcing causal direction. "Avey-B drops the autoregressive mask in Avey's contextualizer"
- Avey-B: A bidirectional, encoder-only reformulation of the Avey architecture that decouples static and dynamic parameterizations and introduces neural compression. "We propose Avey-B, a bidirectional encoder architecture that capitalizes on Avey"
- BF16: The bfloat16 mixed-precision numeric format used to accelerate training and inference with minimal accuracy loss. "with mixed precision (BF16)"
- causal language modeling (CLM): A pretraining objective where models predict the next token given past context only. "GPT optimized a causal language modeling (CLM) objective"
- contextualizer: A module in the neural processor that mixes token embeddings using dynamic similarities and learned transformations. "The contextualizer is an embedding-wise neural network with dynamic parameterization and cosine-similarity-based selectivity"
- cosine similarity: A measure of similarity between two vectors based on the cosine of the angle between them. "N(Ztr) N(Ztr) computes cosine similarities between embeddings"
- decoupled static and dynamic parameterizations: A design where learned linear transformations (static) and similarity-based mixing (dynamic) occur in separate layers to preserve monotonicity. "including: (1) decoupled static and dynamic parameterizations"
- disentangled attention: An attention mechanism that separates content and positional information into distinct matrices. "introduced disentangled attention, which separates content and positional information into distinct attention matrices"
- encoder-only paradigm: An architecture that uses only Transformer encoders (bidirectional) without decoder components. "reformulate Avey for the encoder-only paradigm"
- FlashAttention: An IO-aware, memory-efficient exact attention algorithm that speeds up Transformer attention. "FlashAttention (Dao et al., 2022)"
- gated linear units (GLU): An activation mechanism that gates linear transformations to improve expressiveness and training stability. "gated linear units (GLU) (Dauphin et al., 2017; Shazeer, 2020)"
- Hadamard multiplication: Element-wise multiplication of matrices or vectors used within the contextualizer. "O denotes element-wise (Hadamard) multiplication"
- masked language modeling (MLM): A pretraining objective that reconstructs randomly masked tokens in an input sequence. "masked language modeling (MLM), which reconstructs randomly masked tokens in an input sequence"
- MaxSim operator: A relevance-scoring function that selects splits based on maximum similarity of their token embeddings. "using the MaxSim operator (Khattab & Zaharia, 2020)"
- NDCG@10: A ranking evaluation metric (Normalized Discounted Cumulative Gain) computed over the top 10 results. "IR with NDCG@10."
- needle-in-a-haystack (NIAH): A synthetic long-context benchmark designed to test retrieval of a small signal within a large sequence. "Appendix M evaluates the long-context capabilities of Avey-B on a synthetic needle-in-a-haystack (NIAH) benchmark"
- neural compression: A learned mechanism that compresses retrieved context into a fixed-size representation before processing. "we introduce a neural compression scheme in the ranker"
- neural processor: The data-dependent Avey module with enricher, contextualizer, and fuser that performs token contextualization. "The neural processor comprises three modules, an enricher, a contextualizer, and a fuser."
- partial-embedding bypassing: A technique that preserves a subset of raw token features across layers to mitigate over-smoothing. "This partial-embedding bypassing technique preserves raw token-specific features"
- power-law decay model: A scaling model where throughput decreases as a power of sequence length. "We characterize long-context throughput using a power-law decay model, T(N) & N-a"
- ranker: The component that selects top-k relevant splits for each target split based on similarity scores. "For each target split, the ranker selects the top-k most relevant splits from the input sequence"
- residual connection: A skip connection that adds original inputs to processed outputs to stabilize training. "Avey-B adds a residual connection between the compressor output and the split's original S tokens"
- RMSNorm: A normalization method that normalizes activations by their root mean square without centering. "RMSNorm (Zhang & Sennrich, 2019)"
- RoPE positional encoding: Rotary positional embeddings that encode position via rotations in embedding space. "RoPE positional encoding (Su et al., 2021)"
- row-stochastic similarity operator: A similarity matrix whose rows sum to at most one, bounding per-row gains. "This row-wise normalization yields a row-stochastic similarity operator (row sums ≤ 1)"
- row-wise l2 normalization: Normalizing each row vector to unit l2 norm to compute cosine similarities. "N (.) applies row-wise l2 normalization"
- row-wise normalization: Per-row scaling of similarity scores to stabilize training and control gains. "This row-wise normalization yields a row-stochastic similarity operator"
- row-wise sum normalization: A normalization that divides similarity scores by their row-wise sums. "(row-wise sum normalization)"
- SwiGLU activations: An activation variant combining gating and smooth nonlinearities for improved performance. "SwiGLU activations (Shazeer, 2020)"
- torch. compile: A PyTorch compilation API for graph capture and backend code generation to optimize execution. "using torch. compile, which performs graph cap- ture and backend code generation"
- virtual adversarial training: A regularization technique that improves fine-tuning stability via adversarial perturbations. "improved fine-tuning stability through virtual adversarial training"
Collections
Sign up for free to add this paper to one or more collections.