Papers
Topics
Authors
Recent
2000 character limit reached

BitMar: Efficient Multimodal Transformer

Updated 19 October 2025
  • BitMar is a quantized multimodal transformer architecture that combines low-bit text and vision encoders, episodic memory, and sliding-window attention for efficient edge deployment.
  • It employs 1.58-bit quantization and a compact BitNet-based decoder to achieve low-latency, energy-efficient performance without severe accuracy loss.
  • The external episodic memory and per-layer memory conditioning bolster long-range contextual understanding, enhancing tasks like captioning and multimodal QA.

BitMar is a quantized multimodal transformer architecture designed for vision-language generation on resource-constrained edge devices. It integrates low-bitweight encoders for both text and vision, efficient multimodal fusion, an external episodic memory module, and a compact BitNet-based decoder with per-layer memory conditioning and sliding-window attention. The entire design is engineered to provide efficient, low-latency multimodal understanding and generation while minimizing computational demands and energy consumption, making BitMar specifically suitable for hardware-limited or embedded settings (Aman et al., 12 Oct 2025).

1. Model Architecture and Quantization

BitMar implements a four-stage architecture:

  1. Low-bit Encoders: Both text and image modalities are encoded with extremely compressed representations.
    • Text: Four-layer BitNet Transformer, 128 hidden units, four attention heads, support for sequences up to 256 tokens.
    • Vision: DiNOv2 backbone extracts 768-dimensional patch features, which are mean- or average-pooled and projected via a 2-layer MLP (with ReLU and dropout) to 128-dimensional compact tokens.

All attention and feed-forward weights are ternary (w{1,0,+1}w \in \{-1, 0, +1\}), with per-layer scaling factors, providing an effective quantization of 1.58 bits per weight. Activations are quantized to 8 bits via per-token max–abs scaling. Noisy quantization is regulated to maintain representational expressiveness.

Component Quantization Representation
Text Encoder 1.58-bit ternary 128-D tokens
Vision Encoder 1.58-bit ternary + 8b 128-D tokens*

*Projected down from 768-D patch features.

2. Multimodal Fusion: Cross-Attention and Pooling

Following encoding, the two modalities are aligned using cross-attention, with text queries and visual keys/values yielding a fused representation FRnt×128F \in \mathbb{R}^{n_t \times 128}. Aggregation (mean pooling or learned pooling) reduces FF to a single vector qmemR128q_{\text{mem}} \in \mathbb{R}^{128} representing the episode's latent multimodal context.

Step Operation
Alignment Cross-attention fusion
Aggregation (Learned) pooling to qmemq_{\text{mem}}

3. Episodic Memory Module

The episodic memory is a fixed-size key–value matrix MRK×CM \in \mathbb{R}^{K \times C} (default K=512K=512, C=128C=128). This module enables persistent conditioning of the generative process.

  • Writing: At each step tt, a pooled query qtq_t and learned weights WwRKW_w \in \mathbb{R}^K are used for a soft update:

MM+αWwqtTM \leftarrow M + \alpha \, W_w \, q_t^T

  • Reading: Reads are performed by softmax addressing on similarity between memory and qtq_t:

Wr=softmax(Mqt)W_r = \mathrm{softmax}(M q_t)

The memory readout is Mr=WrTMM_r = W_r^T M, which is injected into the decoder at each step.

This external memory introduces persistent, trainable context across multiple samples and enhances long-term consistency in multimodal generation.

4. BitNet-Based Decoder with Sliding-Window Attention

The decoder is an autoregressive BitNet Transformer (four layers, 128 units, four heads). Two technical innovations are incorporated:

  • Sliding-Window Attention with Sinks: To process long/streaming sequences efficiently, the decoder maintains a fixed set of "sink" tokens (e.g., S=4S=4) for global context and a window of recent tokens (W=1020W=1020), discarding the oldest. This ensures constant memory usage and allows inference with bounded cache size.
  • Per-Layer Memory Conditioning: At each layer, the retrieved memory vector MrM_r is combined with token embeddings via projection or residual addition:

xtxt+Mrx_t \leftarrow x_t + M_r

This allows hierarchical integration of episodic memory, enhancing contextual relevance across the network depth.

5. Edge-Oriented Efficiency and Fusion Properties

  • Low-bit quantization results in a small footprint and low latency without severe accuracy penalties.
  • Both modalities are projected into a 128-dimensional latent space, minimizing fusion cost.
  • Episodic memory enables the model to recall structured information over time, boosting performance on tasks involving long-range context.
  • Sliding-window attention ensures predictable, low compute/memory per token.
  • Empirical evaluation demonstrates that BitMar’s quantization (fraction of zero weights stabilizing near 42.8%) does not substantially degrade embedding quality or throughput.

6. Experimental Results and Edge Suitability

Benchmarks confirm the efficacy of BitMar for edge deployment:

  • Compactness: BitMar–14M, with 14 million parameters, competes with larger low-bit models on BoolQ and WinoGrande.
  • Task Benefit from Episodic Memory: Ablations show 3–4 percentage point gains on tracking/multimodal QA tasks with memory enabled.
  • Throughput and Power: With memory enabled, throughput reaches 57.3 tokens/s and energy use drops compared to memory-disabled mode.
  • Quantization Stability: Quantization effectiveness EqE_q is maintained throughout training, verifying controlled compression.

While performance on some high-knowledge benchmarks trails full-scale models, the design demonstrates a favorable quality-speed trade-off on captioning and understanding, specifically within latency and memory limits of edge processors.

7. Practical Implications and Future Directions

BitMar demonstrates a blueprint for highly compressed, memory-augmented, multimodal transformers deployable outside of datacenter settings. The combination of episodic memory, per-layer low-bit conditioning, and efficient long-context handling via sliding-window mechanisms provides on-device models with capabilities previously limited to full-precision, server-scale systems.

A plausible implication is that further architectural refinements (e.g., memory scaling, adaptive precision) and hardware–software co-design could extend BitMar’s approach to domains requiring robust multimodal reasoning, privacy-sensitive edge inference, and sustained generation over long contexts. Future work may also explore domain adaptation of the episodic memory mechanism for other resource-constrained generative settings.


In summary, BitMar achieves competitive multimodal generative performance on edge hardware by integrating 1.58-bit quantized encoders, cross-modal fusion, an external episodic memory, and a BitNet-based autoregressive decoder with per-layer conditioning and sliding-window attention (Aman et al., 12 Oct 2025). Experimental evidence supports its suitability for low-latency, low-power deployment in scenarios requiring efficient image–text generation and understanding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to BitMar.