Avey-B

Published 17 Feb 2026 in cs.CL and cs.AI | (2602.15814v1)

Abstract: Compact pretrained bidirectional encoders remain the backbone of industrial NLP under tight compute and memory budgets. Their effectiveness stems from self-attention's ability to deliver high-quality bidirectional contextualization with sequence-level parallelism, as popularized by BERT-style architectures. Recently, Avey was introduced as an autoregressive, attention-free alternative that naturally admits an encoder-only adaptation. In this paper, we reformulate Avey for the encoder-only paradigm and propose several innovations to its architecture, including decoupled static and dynamic parameterizations, stability-oriented normalization, and neural compression. Results show that this reformulated architecture compares favorably to four widely used Transformer-based encoders, consistently outperforming them on standard token-classification and information-retrieval benchmarks while scaling more efficiently to long contexts.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces an attention-free, bidirectional encoder that decouples static and dynamic parameterization, yielding sharper and more diverse token representations.
It employs row-normalized cosine similarity and neural compression with residual integration, significantly improving throughput and long-context performance compared to Transformer baselines.
Empirical evaluations show Avey-B achieves superior token classification and information retrieval with reduced pretraining requirements and enhanced scalability.

Avey-B: Bidirectional Encoder Architecture Without Attention

Architectural Contributions

Avey-B is a novel bidirectional, attention-free encoder architecture, extending the original autoregressive Avey design to the encoder-only paradigm for industrial NLP under compute and memory constraints. The model's core architectural innovations include:

Decoupled Static and Dynamic Parameterization: Unlike the original Avey's element-wise coupling of learned projections and similarity scores, Avey-B alternates layers with purely static (learned cross-embedding transformations) and purely dynamic (cosine similarity-weighted contextualization) updates. This guarantees monotonicity with respect to token relevance in dynamic layers, avoids destructive interactions (e.g., relevance inversion), and permits inhibitory effects in static layers, yielding sharper and more diverse representations.
Row-Normalized Similarity Scores: Dynamic layers normalize cosine similarity scores by their row sums, producing row-stochastic operators that stabilize forward activation and gradient flow throughout depth, counteracting the amplification failures seen in attention-based architectures. Empirical ablations show this normalization is superior to softmax and RMS-based alternatives.
Neural Compression: To circumvent the scalability bottleneck introduced by concatenating top-k retrieved splits for bidirectional contextualization, Avey-B compresses each block (current plus top-k splits) back to the original split size through a learned linear transformation. This maintains computation cost per split independent of k and preserves relevant context while drastically improving throughput.
Residual Integration in Compression: A residual connection from the original split tokens to the compressed block output preserves local information and enhances downstream effectiveness, especially on tasks requiring fine-grained evidence selection.

Empirical Evaluation

Avey-B was benchmarked against several leading Transformer-based bidirectional encoders—BERT, RoBERTa, NeoBERT, and ModernBERT—on a comprehensive suite of tasks: Sequence Classification (SC), Token Classification (TC), Question Answering (QA), and Information Retrieval (IR). Key findings include:

Token Classification and Information Retrieval Superiority: Across all scales and token budgets, Avey-B consistently outperforms all Transformer-based baselines in TC and IR, often winning even against larger, more heavily pretrained models.
Generalization With Reduced Pretraining: Despite being pretrained on only $180$ billion tokens (roughly 11 $\times$ fewer than ModernBERT), Avey-B achieves comparable or superior effectiveness, indicating strong inductive biases from its architecture.
Scaling Efficiency: For sequence lengths up to $96$K tokens, Avey-B exhibits faster throughput and lower latency than all Transformer baselines, with scaling exponents (power-law fits) less than half those of ModernBERT and NeoBERT. Throughput degrades sublinearly with sequence length, maintaining practical deployment efficiency even at extreme context windows.
Long-Range Reasoning: On synthetic needle-in-a-haystack QA benchmarks up to $96$K tokens, Avey-B maintains high accuracy and robustness while ModernBERT and NeoBERT collapse or become intractable due to memory constraints. This confirms that retrieval-based architectures enable extrapolation far beyond trained context windows.
Optimization Stability: Avey-B demonstrates low cross-seed variance, ranking among the most robust encoders in the pool. The architectural principles—decoupling, normalization, compression, and shallow-embedding retrieval—contribute to consistent optimization stability.

Ablation and Design Analysis

Systematic ablation studies corroborate the necessity of each core architectural element:

Decoupling: Removing decoupling leads to monotonicity violations and performance drops ( $>$ 3–7 points) across all tasks.
Normalization: Row-wise normalization is critical; its removal degrades effectiveness by up to 15 points and destabilizes training.
Compression and Residual: The neural compressor enables a $4.37\times$ throughput improvement with only modest losses in QA and IR. The residual further ensures local span fidelity.
Retriever Choice: Empirical results show that a unidirectional ranker (retrieving only preceding splits) is preferable both for efficacy and for preserving alignment with discourse structure.
Static Layer Sign Handling: Permitting signed weights in static layers yields better effectiveness, especially for QA and tasks that require inhibitory contextualization.
Layer Arrangement: Alternating static and dynamic layers with a static layer preceding a dynamic one, provides the most stable and effective contextualization, especially for token classification and downstream discriminative tasks.

Practical and Theoretical Implications

Avey-B demonstrates that self-attention is not the only viable paradigm for high-quality bidirectional contextualization. Eliminating attention yields:

Memory and Throughput Gains: Quadratic complexity is avoided, facilitating cost-effective deployment in industrial and edge scenarios, and enabling practical extension to extremely long contexts.
Robustness and Extrapolation: Retrieval-based contextualization allows token representations to condition on select splits, scaling context width independently from sequence length and supporting generalization far beyond the training window.
Model Design Flexibility: Decoupling static/dynamic parameterization and compressive retrieval mechanisms can be incorporated into other encoder architectures, suggesting broader applicability.
Potential for New Architectures: The demonstrated effectiveness of attention-free, retrieval-based bidirectional encoders motivates exploration of similar architectures in areas such as dense retrieval, document-level QA, and large-context language modeling.

Future developments may focus on integrating fused-kernel implementations for additional speedups, refining retrieval selection with task-aware relevance criteria, or hybridizing attention-free encoders with memory-efficient state-space models.

Conclusion

Avey-B offers an attention-free, bidirectional alternative to Transformer-based encoders, delivering superior efficiency, robustness, and token-level accuracy, especially in long-context and resource-constrained settings. Its architectural decoupling, normalization, and neural compression strategies empirically and theoretically advance the viability of retrieval-conditioned, non-attention encoders as a new foundation in industrial and academic NLP. The model's released implementation and checkpoints will facilitate further study and potential adoption across diverse language understanding tasks (2602.15814).

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper introduces Avey-B, a new kind of LLM “encoder” that understands text by looking both left and right in a sentence at the same time (like BERT), but without using the usual “attention” mechanism. Instead, it looks at text in chunks, quickly finds the most relevant other chunks, and mixes their information using simple, efficient steps. The goal is to keep high accuracy while being faster and lighter on memory—especially for long documents.

What questions were the researchers asking?

Can we build a strong bidirectional encoder (like BERT) without using attention?
If we separate two ways of mixing information—fixed rules vs. similarity-based rules—does the model become more stable and accurate?
Can we compress the “extra context” we gather so the model stays fast even when it looks at lots of relevant text?
How does this new design compare to popular encoders such as BERT, RoBERTa, ModernBERT, and NeoBERT on common tasks?

How does their approach work?

Think of reading a long article with sticky notes:

Instead of re-reading everything, you pick the few pages most relevant to the current page you’re studying.
You then write a short, neat summary of those pages and attach it to the current page.
You repeat this for each chunk, building a full understanding faster and without flipping through the entire article every time.

Avey-B follows that idea.

The big idea: rank and retrieve only what matters

The model splits the input text into equal-sized chunks (“splits”).
For each chunk, a “ranker” finds the top-k most relevant other chunks, using a simple similarity score (you can think of it as “how alike are these parts?”).
This avoids wasting time mixing in unhelpful pieces of text.

The two main parts

Ranker

Finds and scores relevant chunks from the whole text.
New in Avey-B: a “neural compressor” turns the current chunk plus its retrieved chunks back into a single chunk-sized summary. This keeps the later steps fast.

Neural Processor

Improves each chunk’s token representations with three mini-steps:
- Enricher: expands features with a small per-token neural network.
- Contextualizer: mixes information between tokens, based on either learned weights or similarity scores.
- Fuser: blends the original and contextualized features and returns to the usual embedding size.

What’s new in Avey-B (and why it helps)

Bidirectional understanding: It removes the “look only forward” rule, so each token can use both left and right context—like BERT encoders.
Decoupled mixing layers: The model alternates two kinds of layers:
- Static layers use fixed learned weights (like a teacher’s consistent guidance).
- Dynamic layers use similarity scores (like a popularity vote among tokens).
- This keeps similarity-based steps fair: if token A is more similar than token B, it won’t accidentally count less.
Stable similarity scores: In dynamic layers, similarity scores are normalized so they don’t blow up or become unstable during training.
Neural compression: Instead of processing the current chunk plus all retrieved chunks together (which would slow things down), Avey-B compresses them into one chunk-sized summary first. This keeps the compute per chunk steady and speeds up inference.

What did they find?

Across many standard tasks, Avey-B performed strongly and ran fast:

Accuracy
- It beat BERT and NeoBERT across all tested benchmarks.
- It consistently outperformed RoBERTa and ModernBERT on token-level tasks (like tagging words with labels) and information retrieval (like matching questions to relevant passages).
- Even the smaller Avey-B sometimes matched or exceeded larger Transformer encoders on these tasks, despite being trained on far fewer tokens than, for example, ModernBERT.
Speed and scaling
- Avey-B processed long inputs faster than the Transformer-based encoders tested.
- As documents get longer, Avey-B’s throughput drops much more slowly than ModernBERT and NeoBERT.
- At very long lengths (e.g., around 96,000 tokens), Avey-B was several times faster than the baselines (for example, about 3.4× faster than ModernBERT and about 11.6× faster than NeoBERT in their setup).

Why this matters: token tagging and retrieval often show up in real-world systems (like search, recommendation, and data labeling), where speed and memory are tight. Avey-B’s design fits those needs well.

Why does this matter, and what could it impact?

Attention isn’t the only way: This work shows you can build strong, bidirectional text encoders without self-attention, opening new paths for efficient models.
Better for long documents: Because Avey-B ranks and compresses relevant context, it stays fast as inputs grow, which is great for search engines, document analysis, and other large-scale applications.
Industry-ready efficiency: Encoders often power production systems where latency and cost matter. Avey-B’s higher throughput and good accuracy could reduce serving costs and speed up user-facing applications.
Research and reproducibility: The authors released code and pretrained models, making it easier for others to build on their approach.

In short, Avey-B shows a practical, attention-free way to get high-quality text understanding—often faster and with less compute—especially when handling long inputs or doing tasks like tagging tokens and retrieving relevant information.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, consolidated list of the paper’s unresolved issues and avenues for future research.

Bidirectional retrieval design: Avey-B’s ranker is intentionally unidirectional (matches only preceding splits). Quantify how truly bidirectional retrieval (left and right splits) affects SC/QA performance, cost, and stability, and identify tasks where bidirectional retrieval is necessary.
Ranker complexity and scalability: The ranker’s O(N² d) matching cost remains a bottleneck for very long inputs and large d. Investigate approximate or hierarchical retrieval (e.g., ANN, locality-sensitive hashing, multi-stage pruning) that preserves quality while reducing complexity to near-linear or N log N.
Inference-time ranker overhead: The paper measures throughput dominated by the neural processor; however, the ranker still runs per pass. Provide a clear accounting of inference complexity including ranker cost, and explore incremental or cached ranking for streaming/online encoders.
Retrieval granularity: The ranker operates at split-level granularity. Evaluate token-level, phrase-level, or hierarchical (paragraph/section) retrieval to refine relevance selection and reduce false positives/negatives in the retrieved context.
MaxSim as the relevance signal: Assess alternative relevance scoring functions (e.g., learned metrics, mutual information, cross-encoder scoring, hybrid lexical-semantic signals) and their effect on precision/recall of retrieved splits.
Neural compression fidelity: The compressor linearly reduces (k+1)S tokens to S tokens via P. Characterize information loss (e.g., probing tasks, alignment metrics), derive bounds on compression-induced degradation, and compare against non-linear compressors (attention pooling, low-rank adapters, learned pooling trees).
Adaptive k and S: k=3 and S=256 are fixed. Study adaptive per-split k and split size S based on uncertainty/relevance/difficulty to improve quality under fixed compute budgets.
Theoretical properties of decoupled parametrization: Beyond monotonicity per dynamic layer, analyze global expressivity, convergence properties, and optimization dynamics of interleaved static/dynamic layers; identify regimes where decoupling harms/help stability or representational power.
Normalization choice in dynamic layers: Provide formal spectral/conditioning analyses for row-wise sum normalization (ε-stabilization), compare to double-stochastic or Sinkhorn normalizations, and quantify gradients’ behavior with depth (e.g., singular value bounds).
Sensitivity to ε: The stabilizer ε is not specified or analyzed. Determine sensitivity ranges, stability thresholds, and principled methods to set ε (e.g., adaptive ε per layer or per batch).
Positional information across splits: The contextualizer claims no extra positional encoding due to V; clarify how cross-split order is represented post-compression, and test tasks where precise global order matters (e.g., logical reasoning, long-context discourse).
Failure mode analysis: Avey-B lags RoBERTa/ModernBERT on MNLI and ReCoRD/SQuAD. Perform granular error analysis (e.g., contradiction vs entailment, entity-centric vs relational QA) to pinpoint architectural causes and guide targeted fixes (e.g., bidirectional ranker, richer positional cues).
Pretraining objective alignment: The model uses MLM with 20% masking. Explore retrieval-aware or contrastive pretraining (e.g., ELECTRA/RTD, span corruption, denoising with retrieved context) tailored to Avey-B’s retrieval/compression inductive biases.
Scaling laws: Provide systematic scaling studies (parameters, depth/width, pretraining tokens, k/S/N) and fit empirical laws for quality and efficiency; identify optimal depth-width interleaving schedules across scales.
Long-context quality on real tasks: Beyond synthetic NIAH and efficiency curves, evaluate accuracy on real long-context tasks (book-length QA, long-range summarization, legal/scientific IR) to validate selective retrieval/compression at 32k–96k tokens.
Memory and energy profiling: Report peak memory, activation footprint, and energy/latency trade-offs for both training and inference (with/without torch.compile) relative to FlashAttention-optimized baselines; quantify the impact of fused kernels once available.
Kernel-level optimization gap: Avey-B lacks fused CUDA/Triton kernels. Build and benchmark fused kernels to confirm whether the observed throughput advantage persists or increases, and characterize the proportion of gains attributable to architecture vs kernel optimization.
Hardware and batch-size sensitivity: Throughput results use batch size 8 on specific GPUs (H200/B200; mention is inconsistent). Study performance across diverse hardware (A100/H100/H200/B200, consumer GPUs, CPUs) and batch sizes; ensure conclusions generalize.
Robustness and domain generalization: Evaluate resilience to noise/adversarial perturbations, domain shift (biomedical, legal), and multilingual/low-resource settings; test whether ranker+compressor degrades under distribution shifts.
Fine-tuning protocols: Benchmarks use limited epochs and fixed LR sweeps. Examine whether Avey-B benefits from task-specific schedules (e.g., layerwise LR, retrieval-aware regularizers), and whether longer fine-tuning narrows gaps on MNLI/QA.
Differentiable retrieval: The ranker’s top-k selection is non-differentiable. Explore continuous relaxations (e.g., Gumbel-top-k, SoftTopK), end-to-end training of ranker scoring, or joint learning with compressor to improve selection quality.
Static/dynamic interleaving schedule search: The chosen S→D pattern is empirically best; develop principled or learned schedule selection (e.g., reinforcement learning, NAS) and test deeper variants (D→S blocks, multiple D in a row) for different tasks.
Information-theoretic view of bypass/gating: Formalize how partial-embedding bypass and gating mitigate over-smoothing; quantify retained token-specific information across depth and relate to generalization/robustness.
Retrieval noise and distractors: Measure the rate and impact of irrelevant split selection, and design mechanisms (confidence thresholds, re-ranking, iterative retrieval) to reduce distractors without increasing compute.
Comprehensive complexity reconciliation: The paper mentions both quadratic (training) and linear (inference) scaling claims. Provide a unified complexity model that includes ranker, compressor, and processor under both training and inference for encoder-only usage, clarifying when each term dominates.
Security/privacy considerations: Retrieval over splits may surface sensitive content. Audit leakage risks and design privacy-preserving retrieval/compression (e.g., differential privacy, secure indexing).
Integration with downstream systems: Assess compatibility with production IR pipelines (indexing, sharding, latency SLAs), and study how compressor/ranker interact with caching and streaming encoders in real deployments.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are specific, deployable uses that directly leverage Avey-B’s bidirectional, attention-free encoder, its superior token-classification and information-retrieval performance, and its high-throughput long-context scaling.

Compact drop-in encoder for production NLP pipelines (software, enterprise search, e-commerce, customer support)
- Use Avey-B as a replacement for BERT/RoBERTa in existing encoders (CLS pooling, mean pooling) for tasks like sentiment analysis, intent detection, topic classification, and email/ticket routing.
- Benefits: lower latency and higher throughput at short and long contexts; better accuracy on token classification and dense retrieval vs. common Transformer baselines.
- Tools/workflows: plug into Detext-, ColBERT-, or Sentence-Transformer-style pipelines; integrate with FAISS/Milvus/Weaviate for vector search; swap model weights/config in existing services.
- Dependencies/assumptions: compatibility with current tokenization/serving stack; validation on in-domain data; GPU/CPU deployment tuned for split size S and top-k; current lack of fused kernels (uses torch.compile) may limit peak efficiency, though results already favorable.
Enterprise and consumer search with long-document encoding (software, media, knowledge management)
- Use Avey-B to embed and index long documents (manuals, wikis, PDFs, reports) with better throughput and long-context scaling (validated up to ~96k tokens).
- Benefits: faster ingestion; more robust retrieval of relevant passages in lengthy files; improved NDCG@10 on IR benchmarks relative to Transformer baselines.
- Tools/workflows: document chunker that aligns to Avey-B’s split paradigm; indexing with vector DBs; retrieval-augmented generation (RAG) pipelines use Avey-B as the document and/or query encoder.
- Dependencies/assumptions: choice of S and k tuned for collection characteristics; ANN-backed retrieval for scalability; RAG quality still depends on downstream generator.
PII detection and automated redaction (healthcare, finance, public sector, legal, daily productivity)
- Apply Avey-B to token-level PII detection (names, addresses, IDs) and redact content in documents, emails, and logs, benefiting from top token-classification performance.
- Benefits: improved recall/precision on span detection; faster processing of long records (EHR notes, legal exhibits, compliance documents).
- Tools/workflows: DLP pipelines; data sanitization filters in ETL; email gateways and document processors with real-time redaction.
- Dependencies/assumptions: domain fine-tuning data and validation; jurisdiction-specific policy constraints; careful thresholding to balance over/under-redaction.
Clinical/biomedical entity extraction and coding (healthcare)
- Use Avey-B for NER and span tagging (problems, meds, procedures) and assist with ICD/CPT coding from free-text EHR notes.
- Benefits: strong short-span accuracy and fast processing for long patient notes or multi-visit summaries.
- Tools/workflows: integration into clinical NLP platforms; batch processing of EHR corpora; assistance tools for medical coders.
- Dependencies/assumptions: domain adaptation to clinical language; HIPAA-compliant deployment; local/on-prem inference may be required.
Legal and e-discovery triage (legal, public sector)
- Token-level tagging and retrieval across large corpora (contracts, case law, discovery sets) for clause detection, privilege detection, and issue-focused retrieval.
- Benefits: scalability to very long documents; improved ranking and tagging speeds for tight review timelines.
- Tools/workflows: pipeline that embeds documents using Avey-B; filtering/triage UI for legal teams; contract analytics.
- Dependencies/assumptions: tuning on legal text; explanation/audit trails for defensibility; integration with e-discovery platforms and metadata pipelines.
Financial document analysis and monitoring (finance)
- Classify and extract spans from 10-K/10-Q filings, research reports, and news; power retrieval for due diligence and risk surveillance.
- Benefits: long-context encoding matches the length of filings; better IR performance for topic/event retrieval.
- Tools/workflows: long-document encoders for research platforms; surveillance pipelines for email/chat monitoring with token-level tagging.
- Dependencies/assumptions: domain lexicon adaptation; governance and model-risk controls; throughput tuning for bulk ingestion windows.
Education: long-form content search and grading assistance (education, EdTech)
- Retrieval from textbooks/course materials; token-level rubric alignment for short-answer/essay components.
- Benefits: faster indexing of large course materials; improved extraction-based scoring aids for graders.
- Tools/workflows: course content indexers; grading assistance dashboards using Avey-B for extraction and rationale highlighting.
- Dependencies/assumptions: institution policies on automated grading; calibration against rubrics; fairness/consistency checks.
On-device or edge inference for classification and extraction (mobile, IoT, daily life)
- Compact encoder for offline email/app content classification (priority, spam/ham, categories) and redaction of sensitive text on-device.
- Benefits: budget-friendly latency and memory; better privacy by keeping inference local.
- Tools/workflows: quantized Avey-B variants; mobile runtimes (CoreML, NNAPI) once available; batched inference for message streams.
- Dependencies/assumptions: availability of export/quantization toolchains; reduced-precision evaluation; memory limits dictate S/k choices.
Log and incident report triage (software operations, energy, manufacturing)
- Token-level tagging of anomalies and IR-based retrieval on long logs/incident writeups.
- Benefits: handles lengthy logs; improves triage and root-cause search.
- Tools/workflows: pipeline to embed logs and attach token tags; integrate with observability stacks for faster incident response.
- Dependencies/assumptions: domain fine-tuning on log formats; throughput under spike loads; alerting thresholds.
Public-sector document intake and comment analysis (policy, governance)
- Classify and extract spans (issues, stakeholders, policy topics) from long public comments and regulations; enable targeted search for policymakers.
- Benefits: high-throughput long-document processing; more precise span tagging.
- Tools/workflows: agency intake pipelines; dashboards for thematic retrieval and summarization inputs.
- Dependencies/assumptions: transparency and bias audits; multilingual coverage if required; on-premise deployment constraints.

Long-Term Applications

These applications are promising but may require additional research, scaling, engineering, or ecosystem maturity (e.g., fused kernels, ANN rankers, domain pretraining).

Ultra-long document analytics and agents (software, legal, healthcare, finance, public sector)
- End-to-end systems that read, retrieve, and ground decisions across hundreds of thousands to millions of tokens (e.g., cross-document legal discovery, longitudinal patient timelines, multi-year financial analyses).
- Potential products: “LongDoc Encoder API,” document-grounded agents that rely on Avey-B for retrieval/extraction at scale.
- Dependencies/assumptions: approximate ranker for per-split retrieval to reduce O(N²d) matching costs; memory-optimized serving; domain pretraining; evaluation for stability at extreme lengths.
Fused-kernel and hardware-accelerated deployments
- Triton/CUDA fused kernels for the neural processor, ranker, and compressor; future ASIC/FPGA paths for attention-free encoders.
- Benefits: substantial throughput and latency gains beyond torch.compile; lower energy per token.
- Dependencies/assumptions: engineering investment; kernel generalization across batch/sequence regimes; vendor ecosystem support.
Encoder-decoder pipelines for document-grounded generation (software, media, education)
- Use Avey-B as the encoder in encoder–decoder systems to provide long-context evidence to a generator (e.g., RAG 2.0 with selective split retrieval/compression).
- Potential workflows: Avey-B encodes documents into relevance-aware representations; decoder consumes retrieved spans for faithful answers.
- Dependencies/assumptions: tight coupling interfaces; training objectives for faithfulness; task-specific datasets.
Advanced retrieval architectures: “Avey-B ColBERT-style” and compressed indexing
- Late interaction systems where Avey-B’s split-aware representations and cosine-based dynamic layers power efficient passage-level matching; leverage learned compression to shrink index size.
- Benefits: improved long-doc retrieval with smaller storage and faster query-time computation.
- Dependencies/assumptions: ANN indices tuned to split embeddings; index maintenance for dynamic corpora; careful evaluation of recall vs. compression trade-offs.
Multilingual and cross-domain models
- Avey-B variants pretrained for multilingual corpora and specialized domains (clinical, legal, finance, code).
- Benefits: broader applicability and reduced domain adaptation effort.
- Dependencies/assumptions: large-scale domain/multilingual corpora; cross-lingual alignment; fairness and regional compliance testing.
Interpretability and compliance-by-design encoders (policy, regulated industries)
- Leverage monotonicity within dynamic layers (decoupled parameterization) and row-wise normalization to develop more interpretable evidence aggregation for audits.
- Potential tools: span-attribution visualizers that reflect similarity-respecting updates; regulator-friendly reports.
- Dependencies/assumptions: formal interpretability methods built on Avey-B internals; user studies; integration with GRC systems.
Robotics and IoT instruction retrieval from long manuals (robotics, manufacturing)
- On-device retrieval and extraction from long equipment manuals for task guidance; token-level hazard detection in SOPs.
- Benefits: offline operation; efficient long-context access for step-by-step instructions.
- Dependencies/assumptions: ruggedized, quantized models; multimodal extension if diagrams are needed; safety validation.
Energy and industrial maintenance intelligence
- Analyze long maintenance logs and incident histories; IR for similar past failures; span tagging of root-cause descriptors.
- Benefits: better MRO (maintenance, repair, operations) efficiency and knowledge transfer.
- Dependencies/assumptions: domain adaptation; interoperability with CMMS/EAM systems; throughput for large historical corpora.
Curriculum-scale retrieval and tutoring (education)
- Systems that navigate an entire curriculum corpus to retrieve precise excerpts for personalized help, assessments, and content linking.
- Benefits: fine-grained retrieval over long materials; improved study assistance grounded in exact text.
- Dependencies/assumptions: aligned pedagogy; bias and accessibility considerations; generator integration for tutoring.
Privacy-preserving local knowledge bases (daily life, enterprise)
- Local-first knowledge apps that index personal or departmental corpora (notes, emails, documents) and provide fast, accurate retrieval/extraction without cloud upload.
- Benefits: improved privacy and latency; long-context support for large personal archives.
- Dependencies/assumptions: efficient on-device indices; lightweight inference stacks; user controls for data governance.

Cross-cutting assumptions and dependencies

Data and fine-tuning: High-quality, domain-specific fine-tuning sets are often required to realize gains in specialized sectors (healthcare, legal, finance, logs).
Serving stack: Current implementation lacks custom fused kernels; while Avey-B already outperforms optimized Transformers in many regimes, further gains likely require kernel engineering.
Ranker scaling: Split-to-split matching is O(N²d) per pass; for very long inputs, approximate nearest neighbor (ANN) schemes or hierarchical ranking may be needed.
Governance: Bias, fairness, and auditability must be addressed, especially in regulated domains; interpretability benefits from the dynamic-layer monotonicity should be validated in practice.
Hardware and precision: Performance claims are measured on modern NVIDIA GPUs with mixed precision; CPU/mobile deployment requires quantization/distillation and careful benchmarking.
Multilingual/generalization: The released checkpoints (pretrained on FineWeb) may require additional training for multilingual and non-Web domains; licensing/usage constraints should be checked.

View Paper Prompt View All Prompts

Glossary

ALiBi positional biases: A technique that adds linear positional bias to attention scores to encode relative position without explicit embeddings. "ALiBi positional biases (Press et al., 2022)"
alternating global/local attention: An attention pattern that alternates between full-sequence (global) and windowed (local) attention to improve efficiency. "alternating global/local attention"
autoregressive mask: A masking scheme that prevents tokens from attending to future positions, enforcing causal direction. "Avey-B drops the autoregressive mask in Avey's contextualizer"
Avey-B: A bidirectional, encoder-only reformulation of the Avey architecture that decouples static and dynamic parameterizations and introduces neural compression. "We propose Avey-B, a bidirectional encoder architecture that capitalizes on Avey"
BF16: The bfloat16 mixed-precision numeric format used to accelerate training and inference with minimal accuracy loss. "with mixed precision (BF16)"
causal language modeling (CLM): A pretraining objective where models predict the next token given past context only. "GPT optimized a causal language modeling (CLM) objective"
contextualizer: A module in the neural processor that mixes token embeddings using dynamic similarities and learned transformations. "The contextualizer is an embedding-wise neural network with dynamic parameterization and cosine-similarity-based selectivity"
cosine similarity: A measure of similarity between two vectors based on the cosine of the angle between them. "N(Ztr) N(Ztr) computes cosine similarities between embeddings"
decoupled static and dynamic parameterizations: A design where learned linear transformations (static) and similarity-based mixing (dynamic) occur in separate layers to preserve monotonicity. "including: (1) decoupled static and dynamic parameterizations"
disentangled attention: An attention mechanism that separates content and positional information into distinct matrices. "introduced disentangled attention, which separates content and positional information into distinct attention matrices"
encoder-only paradigm: An architecture that uses only Transformer encoders (bidirectional) without decoder components. "reformulate Avey for the encoder-only paradigm"
FlashAttention: An IO-aware, memory-efficient exact attention algorithm that speeds up Transformer attention. "FlashAttention (Dao et al., 2022)"
gated linear units (GLU): An activation mechanism that gates linear transformations to improve expressiveness and training stability. "gated linear units (GLU) (Dauphin et al., 2017; Shazeer, 2020)"
Hadamard multiplication: Element-wise multiplication of matrices or vectors used within the contextualizer. "O denotes element-wise (Hadamard) multiplication"
masked language modeling (MLM): A pretraining objective that reconstructs randomly masked tokens in an input sequence. "masked language modeling (MLM), which reconstructs randomly masked tokens in an input sequence"
MaxSim operator: A relevance-scoring function that selects splits based on maximum similarity of their token embeddings. "using the MaxSim operator (Khattab & Zaharia, 2020)"
NDCG@10: A ranking evaluation metric (Normalized Discounted Cumulative Gain) computed over the top 10 results. "IR with NDCG@10."
needle-in-a-haystack (NIAH): A synthetic long-context benchmark designed to test retrieval of a small signal within a large sequence. "Appendix M evaluates the long-context capabilities of Avey-B on a synthetic needle-in-a-haystack (NIAH) benchmark"
neural compression: A learned mechanism that compresses retrieved context into a fixed-size representation before processing. "we introduce a neural compression scheme in the ranker"
neural processor: The data-dependent Avey module with enricher, contextualizer, and fuser that performs token contextualization. "The neural processor comprises three modules, an enricher, a contextualizer, and a fuser."
partial-embedding bypassing: A technique that preserves a subset of raw token features across layers to mitigate over-smoothing. "This partial-embedding bypassing technique preserves raw token-specific features"
power-law decay model: A scaling model where throughput decreases as a power of sequence length. "We characterize long-context throughput using a power-law decay model, T(N) & N-a"
ranker: The component that selects top-k relevant splits for each target split based on similarity scores. "For each target split, the ranker selects the top-k most relevant splits from the input sequence"
residual connection: A skip connection that adds original inputs to processed outputs to stabilize training. "Avey-B adds a residual connection between the compressor output and the split's original S tokens"
RMSNorm: A normalization method that normalizes activations by their root mean square without centering. "RMSNorm (Zhang & Sennrich, 2019)"
RoPE positional encoding: Rotary positional embeddings that encode position via rotations in embedding space. "RoPE positional encoding (Su et al., 2021)"
row-stochastic similarity operator: A similarity matrix whose rows sum to at most one, bounding per-row gains. "This row-wise normalization yields a row-stochastic similarity operator (row sums ≤ 1)"
row-wise l2 normalization: Normalizing each row vector to unit l2 norm to compute cosine similarities. "N (.) applies row-wise l2 normalization"
row-wise normalization: Per-row scaling of similarity scores to stabilize training and control gains. "This row-wise normalization yields a row-stochastic similarity operator"
row-wise sum normalization: A normalization that divides similarity scores by their row-wise sums. "(row-wise sum normalization)"
SwiGLU activations: An activation variant combining gating and smooth nonlinearities for improved performance. "SwiGLU activations (Shazeer, 2020)"
torch. compile: A PyTorch compilation API for graph capture and backend code generation to optimize execution. "using torch. compile, which performs graph cap- ture and backend code generation"
virtual adversarial training: A regularization technique that improves fine-tuning stability via adversarial perturbations. "improved fine-tuning stability through virtual adversarial training"

Avey-B

Summary

Avey-B: Bidirectional Encoder Architecture Without Attention

Architectural Contributions

Empirical Evaluation

Ablation and Design Analysis

Practical and Theoretical Implications

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions were the researchers asking?

How does their approach work?

The big idea: rank and retrieve only what matters

The two main parts

What’s new in Avey-B (and why it helps)

What did they find?

Why does this matter, and what could it impact?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and dependencies

Glossary

Open Problems

Continue Learning

Collections

Tweets