BitMar: Efficient Multimodal Transformer
- BitMar is a quantized multimodal transformer architecture that combines low-bit text and vision encoders, episodic memory, and sliding-window attention for efficient edge deployment.
- It employs 1.58-bit quantization and a compact BitNet-based decoder to achieve low-latency, energy-efficient performance without severe accuracy loss.
- The external episodic memory and per-layer memory conditioning bolster long-range contextual understanding, enhancing tasks like captioning and multimodal QA.
BitMar is a quantized multimodal transformer architecture designed for vision-language generation on resource-constrained edge devices. It integrates low-bitweight encoders for both text and vision, efficient multimodal fusion, an external episodic memory module, and a compact BitNet-based decoder with per-layer memory conditioning and sliding-window attention. The entire design is engineered to provide efficient, low-latency multimodal understanding and generation while minimizing computational demands and energy consumption, making BitMar specifically suitable for hardware-limited or embedded settings (Aman et al., 12 Oct 2025).
1. Model Architecture and Quantization
BitMar implements a four-stage architecture:
- Low-bit Encoders: Both text and image modalities are encoded with extremely compressed representations.
- Text: Four-layer BitNet Transformer, 128 hidden units, four attention heads, support for sequences up to 256 tokens.
- Vision: DiNOv2 backbone extracts 768-dimensional patch features, which are mean- or average-pooled and projected via a 2-layer MLP (with ReLU and dropout) to 128-dimensional compact tokens.
All attention and feed-forward weights are ternary (), with per-layer scaling factors, providing an effective quantization of 1.58 bits per weight. Activations are quantized to 8 bits via per-token max–abs scaling. Noisy quantization is regulated to maintain representational expressiveness.
| Component | Quantization | Representation |
|---|---|---|
| Text Encoder | 1.58-bit ternary | 128-D tokens |
| Vision Encoder | 1.58-bit ternary + 8b | 128-D tokens* |
*Projected down from 768-D patch features.
2. Multimodal Fusion: Cross-Attention and Pooling
Following encoding, the two modalities are aligned using cross-attention, with text queries and visual keys/values yielding a fused representation . Aggregation (mean pooling or learned pooling) reduces to a single vector representing the episode's latent multimodal context.
| Step | Operation |
|---|---|
| Alignment | Cross-attention fusion |
| Aggregation | (Learned) pooling to |
3. Episodic Memory Module
The episodic memory is a fixed-size key–value matrix (default , ). This module enables persistent conditioning of the generative process.
- Writing: At each step , a pooled query and learned weights are used for a soft update:
- Reading: Reads are performed by softmax addressing on similarity between memory and :
The memory readout is , which is injected into the decoder at each step.
This external memory introduces persistent, trainable context across multiple samples and enhances long-term consistency in multimodal generation.
4. BitNet-Based Decoder with Sliding-Window Attention
The decoder is an autoregressive BitNet Transformer (four layers, 128 units, four heads). Two technical innovations are incorporated:
- Sliding-Window Attention with Sinks: To process long/streaming sequences efficiently, the decoder maintains a fixed set of "sink" tokens (e.g., ) for global context and a window of recent tokens (), discarding the oldest. This ensures constant memory usage and allows inference with bounded cache size.
- Per-Layer Memory Conditioning: At each layer, the retrieved memory vector is combined with token embeddings via projection or residual addition:
This allows hierarchical integration of episodic memory, enhancing contextual relevance across the network depth.
5. Edge-Oriented Efficiency and Fusion Properties
- Low-bit quantization results in a small footprint and low latency without severe accuracy penalties.
- Both modalities are projected into a 128-dimensional latent space, minimizing fusion cost.
- Episodic memory enables the model to recall structured information over time, boosting performance on tasks involving long-range context.
- Sliding-window attention ensures predictable, low compute/memory per token.
- Empirical evaluation demonstrates that BitMar’s quantization (fraction of zero weights stabilizing near 42.8%) does not substantially degrade embedding quality or throughput.
6. Experimental Results and Edge Suitability
Benchmarks confirm the efficacy of BitMar for edge deployment:
- Compactness: BitMar–14M, with 14 million parameters, competes with larger low-bit models on BoolQ and WinoGrande.
- Task Benefit from Episodic Memory: Ablations show 3–4 percentage point gains on tracking/multimodal QA tasks with memory enabled.
- Throughput and Power: With memory enabled, throughput reaches 57.3 tokens/s and energy use drops compared to memory-disabled mode.
- Quantization Stability: Quantization effectiveness is maintained throughout training, verifying controlled compression.
While performance on some high-knowledge benchmarks trails full-scale models, the design demonstrates a favorable quality-speed trade-off on captioning and understanding, specifically within latency and memory limits of edge processors.
7. Practical Implications and Future Directions
BitMar demonstrates a blueprint for highly compressed, memory-augmented, multimodal transformers deployable outside of datacenter settings. The combination of episodic memory, per-layer low-bit conditioning, and efficient long-context handling via sliding-window mechanisms provides on-device models with capabilities previously limited to full-precision, server-scale systems.
A plausible implication is that further architectural refinements (e.g., memory scaling, adaptive precision) and hardware–software co-design could extend BitMar’s approach to domains requiring robust multimodal reasoning, privacy-sensitive edge inference, and sustained generation over long contexts. Future work may also explore domain adaptation of the episodic memory mechanism for other resource-constrained generative settings.
In summary, BitMar achieves competitive multimodal generative performance on edge hardware by integrating 1.58-bit quantized encoders, cross-modal fusion, an external episodic memory, and a BitNet-based autoregressive decoder with per-layer conditioning and sliding-window attention (Aman et al., 12 Oct 2025). Experimental evidence supports its suitability for low-latency, low-power deployment in scenarios requiring efficient image–text generation and understanding.