SENTRA: Adaptive Next-Token Transformer

Updated 22 September 2025

SENTRA is a Transformer architecture that augments standard models with adaptive token selection and semantic gating for efficient next-token prediction.
It employs layer-wise gating, threshold selectors, and contrastive pre-training to ensure robust and interpretable performance for edge communication and LLM text detection.
Its resource-adaptive design balances inference accuracy against bandwidth, outperforming baselines in compression, classification, and detection tasks.

SElected-Next-Token tRAnsformer (SENTRA) is a Transformer-based architecture whose defining feature is the integration of token selection mechanisms or selected next-token probability information within the core of next-token prediction and related downstream tasks. Different implementations of SENTRA have been introduced in the context of semantic communication for edge intelligence, as well as in text detection to distinguish LLM-generated text. Across these instantiations, SENTRA typically augments standard Transformative architectures with token-level selection, gating, and/or semantic probability processing to produce robust, efficient, and interpretable next-token predictors and detectors.

1. Core Principles and Architectural Overview

SENTRA is constructed around the principle of “selected next-token” processing. In contrast to conventional Transformer models which process all tokens in a uniform manner, SENTRA architectures often incorporate explicit token selection modules or operate directly on selected-next-token-probability (SNTP) sequences. In communication-centric instantiations, such as for edge inference over semantic channels, this selection is implemented using budget-aware gating mechanisms that determine which semantic tokens (features, patch representations, or probability sequences) are retained for further computation or transmission. In text detection tasks, the architecture ingests SNTP sequences—lists of token-level log-probabilities assigned by an LLM—and encodes them via a bidirectional Transformer to produce a sequence-level classification.

The high-level SENTRA workflow involves:

Extraction or computation of token-level features, with optional inclusion of a “budget token” specifying the resource or semantic importance budget.
Layer-wise gating or masking, implemented by trainable modules, to score and select tokens according to semantic importance, probability, or explicit user/resource constraints.
Transformer-based refinement of selected tokens or SNTP sequences.
Downstream modules—such as joint source-channel coding blocks or classification heads—depending on application.

2. Token Selection Mechanisms and Semantic Gating

A critical component of SENTRA systems for communication tasks is the adaptive token selection module. Each transformer block is prefaced by a semantic gating mechanism that receives both the sequence of token embeddings and a budget token reflecting the desired transmission or computation budget (α ∈ [0, 1]). Two submodules are present at each block:

Threshold selector: Processes the budget token to yield a per-layer gating threshold γₖ.
Token gate: Evaluates each input token to compute gating values.

A halting score sᵢ for token i is computed as:

$s_i = \sigma(\delta \cdot g_i + \beta)$

where δ, β are scalar parameters, σ is the sigmoid function, and gᵢ is the token-level gate output. Tokens with sᵢ below the threshold γₖ are discarded. This design enables per-input, per-layer, and per-budget adaptation, offering fine-grained control over the volume of information preserved and passed to subsequent computation stages.

This explicit, trainable selection yields models that can efficiently operate across a continuum of resource constraints—in terms of latency, bandwidth, or computational cost—without requiring separate models for each constraint.

The Transformer backbone in SENTRA serves to contextually refine the selected tokens or SNTP input sequences. The self-attention layers allow retained tokens to interact and aggregate global information, enabling robust downstream inference or classification despite significant sparsification. For edge communication, this enables transmission of only the most task-relevant features, while in SNTP-based LLM text detection, the layers distill complex log-probability patterns into a unified sequence representation.

Transformer layers in SENTRA may also be responsible for adapting to the varying number of tokens produced by semantic selection. The architecture therefore must be robust to dynamic sequence lengths and highly effective at learning contextual dependencies among only the selected tokens.

4. Downstream Modules and Resource-Adaptive Communication

In edge intelligence and semantic communication applications, SENTRA incorporates a joint source-channel coding (JSCC) module downstream of token selection. The selected tokens are further compressed by an autoencoder (often generating complex-valued codewords) for robust transmission over noisy wireless channels. The JSCC encoder and decoder pair ensures that the most essential semantic information is retained and can be reconstructed with high fidelity from the received signal.

Resource allocation is governed by an algorithm based on Lyapunov stochastic optimization, which dynamically tunes both the token selection budget (α) and the per-token compression ratio, balancing inference accuracy against bandwidth constraints. This mechanism is particularly effective in environments where channel conditions fluctuate, as it adaptively reconfigures the transmission strategy at each time step.

5. Methodology in LLM Text Detection

For LLM-generated text detection, SENTRA approaches the problem by encoding the SNTP sequence for a candidate document:

For input text $t_1, ..., t_T$ , the sequence of log-probabilities $-\log q_i(\theta)$ is computed for each token with respect to one or several frozen LLMs.
These SNTP vectors are concatenated and projected through a fully-connected layer with positional embeddings to form the input sequence for SENTRA’s Transformer encoder.
A [CLS] token is prepended; the sequence is processed through a bidirectional Transformer, and the output corresponding to the [CLS] token is used as the pooled representation $R_l$ for sequence classification.

A key contribution is the use of contrastive pre-training: SENTRA’s SNTP-derived embeddings are aligned (via a contrastive loss) with embeddings from a frozen pre-trained text encoder, enhancing domain robustness and generalization across diverse genres or synthetic LLM outputs.

6. Theoretical Foundations and Model Capacity

The SENTRA framework is informed by several results in the next-token prediction literature:

Rigorous upper and lower bounds exist on the prediction capacity of single-layer (and by extension, deeper) decoder-only transformers, scaling as $O(k/(\omega - 1))$ , where $k$ is the parameter count and $\omega$ the vocabulary size (Madden et al., 2024). Maintaining injectivity in the selection and self-attention mechanisms is essential for full model capacity.
Self-attention dynamics are interpretable through a two-step mechanism: hard retrieval (selecting highest-priority tokens according to a directed token-priority graph) and soft composition (blending selected tokens) (Li et al., 2024). This dovetails with SENTRA’s explicit token selection and composition approach.
Empirically, a minimal “bigram subnetwork” in the first transformer MLP layer is both necessary and sufficient for efficient next-token transformation, offering mechanistic insight into how token selection and early-stage rotation enable robust token-level prediction (Chang et al., 21 Apr 2025).

7. Empirical Results and Applications

SENTRA-derived models achieve significant performance improvements over baselines across domains:

LLM-generated text detection: On RAID, M4GT, and MAGE datasets, SENTRA surpasses zero-shot, embedding-based, and supervised detectors, especially in out-of-domain and out-of-LLM settings. Contrastive pre-training further boosts both average and worst-case AUROC performance (Plyler et al., 15 Sep 2025).
Edge communication tasks: SENTRA achieves one to two orders of magnitude improvement in data compression for a given inference accuracy, consistently outperforming ViT, MobileNet, and digital baselines in classification accuracy when transmitting under severe bandwidth and SNR constraints (Devoto et al., 23 May 2025).
Interpretability: The explicit selection masks and gating thresholds offer clear, visualizable mechanisms for understanding which parts of the input contribute to inference or detection outcomes.

In summary, SElected-Next-Token tRAnsformer (SENTRA) generalizes standard Transformer models through adaptive, task-driven token selection operating at inference and/or probability sequence levels. By unifying resource-efficient selection, deep representation refinement, and robust contrastive learning, SENTRA outperforms traditional transformers in both edge communication and robust LLM text detection. Its design is grounded in rigorous theoretical analysis of transformer capacity and dynamics, and its practical success is attributable to its explicit modeling of token importance and data efficiency under real-world constraints.

Markdown Report Issue Upgrade to Chat

References (5)

Next-token prediction capacity: general upper bounds and a lower bound for transformers (2024)

Mechanics of Next Token Prediction with Self-Attention (2024)

Bigram Subnetworks: Mapping to Next Tokens in Transformer Language Models (2025)

SENTRA: Selected-Next-Token Transformer for LLM Text Detection (2025)

Adaptive Semantic Token Communication for Transformer-based Edge Inference (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SElected-Next-Token tRAnsformer (SENTRA).

SENTRA: Adaptive Next-Token Transformer

1. Core Principles and Architectural Overview

2. Token Selection Mechanisms and Semantic Gating

3. Transformer Layer Roles and Semantic Refinement

4. Downstream Modules and Resource-Adaptive Communication

5. Methodology in LLM Text Detection

6. Theoretical Foundations and Model Capacity

7. Empirical Results and Applications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SENTRA: Adaptive Next-Token Transformer

1. Core Principles and Architectural Overview

2. Token Selection Mechanisms and Semantic Gating

3. Transformer Layer Roles and Semantic Refinement

4. Downstream Modules and Resource-Adaptive Communication

5. Methodology in LLM Text Detection

6. Theoretical Foundations and Model Capacity

7. Empirical Results and Applications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research