BitMar: Efficient Multimodal Transformer

Updated 19 October 2025

BitMar is a quantized multimodal transformer architecture that combines low-bit text and vision encoders, episodic memory, and sliding-window attention for efficient edge deployment.
It employs 1.58-bit quantization and a compact BitNet-based decoder to achieve low-latency, energy-efficient performance without severe accuracy loss.
The external episodic memory and per-layer memory conditioning bolster long-range contextual understanding, enhancing tasks like captioning and multimodal QA.

BitMar is a quantized multimodal transformer architecture designed for vision-language generation on resource-constrained edge devices. It integrates low-bitweight encoders for both text and vision, efficient multimodal fusion, an external episodic memory module, and a compact BitNet-based decoder with per-layer memory conditioning and sliding-window attention. The entire design is engineered to provide efficient, low-latency multimodal understanding and generation while minimizing computational demands and energy consumption, making BitMar specifically suitable for hardware-limited or embedded settings (Aman et al., 12 Oct 2025).

1. Model Architecture and Quantization

BitMar implements a four-stage architecture:

Low-bit Encoders: Both text and image modalities are encoded with extremely compressed representations.
- Text: Four-layer BitNet Transformer, 128 hidden units, four attention heads, support for sequences up to 256 tokens.
- Vision: DiNOv2 backbone extracts 768-dimensional patch features, which are mean- or average-pooled and projected via a 2-layer MLP (with ReLU and dropout) to 128-dimensional compact tokens.

All attention and feed-forward weights are ternary ( $w \in \{-1, 0, +1\}$ ), with per-layer scaling factors, providing an effective quantization of 1.58 bits per weight. Activations are quantized to 8 bits via per-token max–abs scaling. Noisy quantization is regulated to maintain representational expressiveness.

Component	Quantization	Representation
Text Encoder	1.58-bit ternary	128-D tokens
Vision Encoder	1.58-bit ternary + 8b	128-D tokens*

*Projected down from 768-D patch features.

2. Multimodal Fusion: Cross-Attention and Pooling

Following encoding, the two modalities are aligned using cross-attention, with text queries and visual keys/values yielding a fused representation $F \in \mathbb{R}^{n_t \times 128}$ . Aggregation (mean pooling or learned pooling) reduces $F$ to a single vector $q_{\text{mem}} \in \mathbb{R}^{128}$ representing the episode's latent multimodal context.

Step	Operation
Alignment	Cross-attention fusion
Aggregation	(Learned) pooling to $q_{\text{mem}}$

3. Episodic Memory Module

The episodic memory is a fixed-size key–value matrix $M \in \mathbb{R}^{K \times C}$ (default $K=512$ , $C=128$ ). This module enables persistent conditioning of the generative process.

Writing: At each step $t$ , a pooled query $q_t$ and learned weights $W_w \in \mathbb{R}^K$ are used for a soft update:

$M \leftarrow M + \alpha \, W_w \, q_t^T$

Reading: Reads are performed by softmax addressing on similarity between memory and $q_t$ :

$W_r = \mathrm{softmax}(M q_t)$

The memory readout is $M_r = W_r^T M$ , which is injected into the decoder at each step.

This external memory introduces persistent, trainable context across multiple samples and enhances long-term consistency in multimodal generation.

4. BitNet-Based Decoder with Sliding-Window Attention

The decoder is an autoregressive BitNet Transformer (four layers, 128 units, four heads). Two technical innovations are incorporated:

Sliding-Window Attention with Sinks: To process long/streaming sequences efficiently, the decoder maintains a fixed set of "sink" tokens (e.g., $S=4$ ) for global context and a window of recent tokens ( $W=1020$ ), discarding the oldest. This ensures constant memory usage and allows inference with bounded cache size.
Per-Layer Memory Conditioning: At each layer, the retrieved memory vector $M_r$ is combined with token embeddings via projection or residual addition:

$x_t \leftarrow x_t + M_r$

This allows hierarchical integration of episodic memory, enhancing contextual relevance across the network depth.

5. Edge-Oriented Efficiency and Fusion Properties

Low-bit quantization results in a small footprint and low latency without severe accuracy penalties.
Both modalities are projected into a 128-dimensional latent space, minimizing fusion cost.
Episodic memory enables the model to recall structured information over time, boosting performance on tasks involving long-range context.
Sliding-window attention ensures predictable, low compute/memory per token.
Empirical evaluation demonstrates that BitMar’s quantization (fraction of zero weights stabilizing near 42.8%) does not substantially degrade embedding quality or throughput.

6. Experimental Results and Edge Suitability

Benchmarks confirm the efficacy of BitMar for edge deployment:

Compactness: BitMar–14M, with 14 million parameters, competes with larger low-bit models on BoolQ and WinoGrande.
Task Benefit from Episodic Memory: Ablations show 3–4 percentage point gains on tracking/multimodal QA tasks with memory enabled.
Throughput and Power: With memory enabled, throughput reaches 57.3 tokens/s and energy use drops compared to memory-disabled mode.
Quantization Stability: Quantization effectiveness $E_q$ is maintained throughout training, verifying controlled compression.

While performance on some high-knowledge benchmarks trails full-scale models, the design demonstrates a favorable quality-speed trade-off on captioning and understanding, specifically within latency and memory limits of edge processors.

7. Practical Implications and Future Directions

BitMar demonstrates a blueprint for highly compressed, memory-augmented, multimodal transformers deployable outside of datacenter settings. The combination of episodic memory, per-layer low-bit conditioning, and efficient long-context handling via sliding-window mechanisms provides on-device models with capabilities previously limited to full-precision, server-scale systems.

A plausible implication is that further architectural refinements (e.g., memory scaling, adaptive precision) and hardware–software co-design could extend BitMar’s approach to domains requiring robust multimodal reasoning, privacy-sensitive edge inference, and sustained generation over long contexts. Future work may also explore domain adaptation of the episodic memory mechanism for other resource-constrained generative settings.

In summary, BitMar achieves competitive multimodal generative performance on edge hardware by integrating 1.58-bit quantized encoders, cross-modal fusion, an external episodic memory, and a BitNet-based autoregressive decoder with per-layer conditioning and sliding-window attention (Aman et al., 12 Oct 2025). Experimental evidence supports its suitability for low-latency, low-power deployment in scenarios requiring efficient image–text generation and understanding.

PDF Markdown Chat (Pro)

References (1)

BitMar: Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to BitMar.