Deep Char/Byte-Level Transformers

Updated 19 March 2026

Deep character/byte-level transformers are architectures that process raw character or byte streams, bypassing conventional tokenization for open-vocabulary and noise-robust modeling.
They employ innovations like convolutional downsampling, hierarchical patching, and shifted attention to efficiently manage the increased sequence lengths from character-level inputs.
Empirical studies show these models achieve competitive accuracy in tasks like language modeling, machine translation, and multimodal processing, despite slower inference on short outputs.

Deep character/byte-level transformers are architectures that operate directly on raw character or byte sequences, bypassing the need for hand-engineered linguistic tokenization such as subwords or words. Building upon the self-attention mechanism of the standard Transformer, these models have demonstrated competitive or superior results in a variety of domains including natural language modeling, machine translation, multimodal learning, and data compression. They offer advantages in open-vocabulary support, robustness to noise and corruption, cross-lingual usability, and pipeline simplicity, while imposing specific computational and modeling challenges due to the increased sequence lengths inherent in character-level representations.

1. Architectural Foundations

Deep character/byte-level transformers replace tokenized input sequences with raw character or byte streams. The canonical instantiation is ByT5, which processes sequences as UTF-8 bytes using a 259-entry vocabulary (256 bytes plus 3 reserved IDs: PAD, EOS, UNK). Embedding is performed by a learnable lookup table: $x_b = E[b] \in \mathbb{R}^{d_\mathrm{model}}$ . The rest of the architecture follows the T5 backbone—with multi-layer, multi-head self-attention, relative positional biases, and ReLU-activated feed-forward networks—crucially, with an encoder that is three times as deep as the decoder to balance increased input sequence length (Xue et al., 2021).

Character-level transformers (e.g., (Al-Rfou et al., 2018, Gupta et al., 2019, Gao et al., 2020)) often employ model dimensions $d_\mathrm{model}$ in the 256–1536 range and stack from 6 (typical for translation) up to 64 layers (for language modeling). Adequate learning of long-range dependencies at the byte scale invokes unique architectural choices: per-layer learned positional encodings in deep models, auxiliary losses at intermediate positions and layers, and, in some cases, explicit local context via convolutional or highway modules (Al-Rfou et al., 2018).

Table: Selected Model Hyperparameters

Model	Input Unit	d_model	#Layers (Enc/Dec)	Max Seq Len	Vocab Size
ByT5-Base	Byte	1536	18 / 6	1024	259
CharTransformer	Char	128/512	6 / 6	5× subword	~300
Deep LM	Byte	512	64	512	256

Character and byte-level Transformers for images, audio, and multimodal data (e.g., ByteFormer (Horton et al., 2023), MEGABYTE (Yu et al., 2023)) use similar embedding/attention modules, relying on downsampling layers such as Conv1D or patching to control the otherwise prohibitive sequence length.

2. Sequence Length Compression and Efficiency

Operating on characters or bytes increases sequence length by a factor of 4–10× compared to subword representations. Models alleviate the resulting quadratic cost of self-attention via architectural innovations:

Downsampling Front-Ends: CharTransformer and Charformer (Banar et al., 2020, Tay et al., 2021) use convolutional and pooling stages or gradient-based subword tokenization (GBST) to group adjacent characters into blocks, reducing sequence length by factors of 2–5, and thus reducing attention complexity by up to 25×.
Hierarchical and Patch-Based Models: MEGABYTE (Yu et al., 2023) partitions long byte streams into patches, processing global dependencies with a global Transformer (patch-level tokens) and local structure with smaller local Transformers, achieving sub-quadratic attention ( $O(N\sqrt{N})$ where $N$ is total bytes).
Shifted/Windowed Attention: ByteFormer (Horton et al., 2023) achieves tractable computation on sequences with 10,000+ tokens using shifted window attention and strided Conv1D downsampling between layers.
Block/Compressed Representations: TEMPEST (Alcazar et al., 26 Oct 2025) leverages the inherent block structure of compressed file formats (e.g., JPEG, MP3) to group bytes and reduce token count by up to 99%, yielding efficiency gains without explicit reconstruction of uncompressed data.
GBST Causality Patching: For generative decoding, block-wise groupings must avoid information leaks. Causal downsampling is achieved by preventing convolution/n-gram pooling from spanning block boundaries and eschewing convolutional positional encodings in the decoder (Edman et al., 2022).

3. Training Paradigms and Empirical Properties

Character/byte-level transformers are trained using objectives and schedules analogous to their token-based counterparts, with scaling adapted to longer sequences:

Objective: Plain cross-entropy next-token prediction at the byte/character granularity or masked span corruption (mean length 20 bytes for ByT5).
Optimization: Large batch sizes are critical (e.g., batch ≥ 128 for competitive accuracy in transduction tasks (Wu et al., 2020)) to stabilize the optimization landscape at high sequence lengths.
Auxiliary Losses: Deep models (e.g., 64-layer LM (Al-Rfou et al., 2018)) include per-layer and per-position auxiliary losses to improve gradient flow and accelerate convergence.
Scaling: Empirical evidence indicates that for model sizes <1B, byte-level models are both more parameter-efficient and competitive in accuracy. For pure classification at very large scales, subword models retain a marginal edge (Xue et al., 2021).
Downsampling: Reduces per-step FLOPs by up to 67%, speeds up training/inference by 28–100% without sacrificing quality (Tay et al., 2021, Lees et al., 2022).

Table: Downsampling Impact in Charformer (Tay et al., 2021)

Model	No Downsampling	Downsampling (2×)	Downsampling (3×)
Steps/sec (Base)	9.3	11	15
Peak Mem (GB)	3.1	1.95	1.63
Pretrain FLOPS	1.1e13	1.6e13	--

4. Empirical Performance and Robustness

Deep character/byte-level transformers exhibit competitive accuracy and superior robustness across tasks and modalities:

Language Modeling: 64-layer character-level LM achieves state-of-the-art results on text8/enwik8: 1.13 bpc/1.06 bpb, outperforming deep LSTMs by ~15–17% (Al-Rfou et al., 2018).
Machine Translation: Character-level NMT models reduce the BLEU gap to BPE baselines to ≈0.3 (32-layer char encoder, pre-norm, transparent attention) and vastly improve out-of-domain and noise robustness, showing losses 20–40% smaller than BPE under synthetic noise (Gupta et al., 2019).
Data Compression: Byte-level Transformers trained on 165GB data (text, image, audio) yield adjusted compression ratios that outperform both general-purpose (gzip, LZMA2) and domain-specific (PNG, FLAC) compressors, e.g., r=0.49 on OOD audio (FLAC: 0.54) (Heurtel-Depeiges et al., 2024).
Multimodal and File Classification: ByteFormer matches/exceeds modality-specific ViTs and audio CNNs, and enables joint multimodal classification without any modality-specific pre-processing (Horton et al., 2023).
Morphology and Spelling Tasks: ByT5 increases accuracy on grapheme-to-phoneme, transliteration, and inflection tasks by >30 points over subword baselines in some settings (Xue et al., 2021).
Downstream Tasks and Robustness: Charformer-based systems provide superior AUC on multilingual toxic comment classification, are robust to code-switching, emoji-based hate, and obfuscated text, and maintain high performance with simulated character noise (Lees et al., 2022).
Hierarchical Models: HAT (Hierarchy of character-to-word-to-character modules) matches the downstream accuracy of subword-token baseline transformers up to 7B parameters, with 30–50% smaller accuracy drops under character perturbations (Neitemeier et al., 17 Jan 2025).

5. Innovative Model Variants and Hybrid Approaches

Gradient-Based Subword Tokenization (GBST): Charformer introduces a differentiable module that produces a weighted mixture over blocks of different character spans, learning subword-like groupings end-to-end and supporting downsampling (Tay et al., 2021).
Hierarchical Autoregressive Transformer (HAT): Combines a lightweight character-level encoder, a main word-level transformer backbone, and a small character-level decoder, conferring both the sequence compression benefits of word-level modeling and robustness/flexibility of character-level inputs (Neitemeier et al., 17 Jan 2025).
Multiscale and Patch-Based Models: MEGABYTE employs patch-based factorization and local-global modeling to handle sequences of over a million bytes with sub-quadratic complexity, attaining state-of-the-art image and language modeling results (Yu et al., 2023).
Compressed Domain Processing: TEMPEST achieves significant token/count reductions and computational savings by treating blocks of compressed files (e.g., MP3, JPEG) as atomic sequence units, substantially lowering FLOPs and memory (Alcazar et al., 26 Oct 2025).

6. Limitations and Open Challenges

Despite their promise and efficiency advances, byte/character-level transformers impose specific trade-offs:

Inference Latency: Byte-level inference is substantially slower: up to 10× slowdown for tasks with short outputs (e.g., XNLI, summarization), despite competitive throughput on generation and long-output tasks (Xue et al., 2021).
Data Exposure: At a fixed token budget, byte-level models are exposed to ≈4× less raw text than subword-token models, which can limit learning at extreme scales (Xue et al., 2021).
Capacity and Context: Very deep models, extensive downsampling, or hybrid architectures are required to match the effective modeling capacity and context window of token-based approaches (Al-Rfou et al., 2018, Tay et al., 2021).
Decoding Efficiency and Information Leakage: Downsampling in decoders requires strict causality, and improper design can result in future-information leaks (Edman et al., 2022).
Modality Transfer: Cross-modality transfer in multimodal byte-level transformers is weak for unseen domains, and optimal context length/model size depend critically on data modality (Heurtel-Depeiges et al., 2024).

7. Future Directions

Current trends and open research lines include:

Efficient Attention and Novel Downsampling: Sparse, local, and hashing-based attention as well as hierarchical and dynamic block grouping to mitigate quadratic costs (Xue et al., 2021, Tay et al., 2021, Alcazar et al., 26 Oct 2025).
Learned or Dynamic Tokenization: Soft/varying subword splits or block lengths learned end-to-end or adaptively per input (Tay et al., 2021, Neitemeier et al., 17 Jan 2025).
Large-Scale Multimodal Pretraining: Unified models trained from bytes across text, images, and audio, exploiting block/patched representations for efficient universal modeling (Heurtel-Depeiges et al., 2024, Horton et al., 2023).
Robustness and Domain Adaptation: Exploiting open-vocabulary and character-level modeling for rapid adaptation to new domains/languages and for robustness against noise, code-switching, and adversarial perturbations (Neitemeier et al., 17 Jan 2025, Lees et al., 2022).
Deployment and Productionization: Real-time, token-free, multilingual byte-level models integrated into production systems for classification, generation, and content moderation (Lees et al., 2022).

Deep character/byte-level transformers, by removing rigid tokenization and leveraging architectural innovations to address sequence length and efficiency, support robust, flexible, multilingual, and multimodal modeling across a diverse range of data and tasks, with ongoing developments enhancing their computational and modeling efficiency.