Papers
Topics
Authors
Recent
Search
2000 character limit reached

Jamba-1.5: Hybrid LLM & Janus Monolayer Insight

Updated 30 June 2026
  • Jamba-1.5 is a dual-domain innovation encompassing a scalable hybrid LLM architecture and a Janus monolayer material with tailored physical properties.
  • The LLM component employs a Transformer-Mamba-MoE design with ExpertsInt8 quantization to achieve a record 256K context window and efficient memory usage.
  • The Janus monolayer CrBr1.5I1.5 exhibits robust ferromagnetism and an exceptionally high out-of-plane piezoelectric response, highlighting its potential in spintronics and multifunctional devices.

Jamba-1.5 refers to two distinct advanced scientific systems: (1) a next-generation LLM architecture, combining Transformer, Mamba state-space, and Mixture-of-Experts (MoE) techniques at high scale (Team et al., 2024); and (2) a two-dimensional Janus monolayer material, CrBr1.5I1.5\mathrm{CrBr_{1.5}I_{1.5}}, exhibiting both robust ferromagnetism and exceptionally large out-of-plane piezoelectric response (Guo et al., 2021). In both AI and 2D materials research, “Jamba-1.5” denotes high performance through hybridization—either architectural or compositional—with properties exceeding prior systems in key respects.

1. Hybrid Transformer-Mamba-MoE Architecture (LLM)

Jamba-1.5, developed by AI21, preserves the key design of earlier Jamba models—a tightly fused hybrid of full self-attention (Transformer) layers, efficient state-space (Mamba) layers, and sparsely activated Mixture-of-Experts modules. The architecture alternates one Transformer attention layer for every seven Mamba state-space layers within each Jamba “block” (8 layers per block, 9 blocks total). Crucially, every two layers replace the conventional MLP with an MoE module of n=16n=16 experts per layer (hidden size 8192, top-K=2K = 2 routing per token). Attention in each block operates with 64 query heads and 8 key/value heads. This yields a total of 72 layers—of which 9 are full-attention—leading to much slower key–value (KV) cache growth with context length compared to dense Transformers.

The gating mechanism for each MoE block is given by

gi(x)=exp(wix)j=1nexp(wjx),i=1,,ng_i(x) = \frac{\exp(w_i^\top x)}{\sum_{j=1}^{n} \exp(w_j^\top x)},\quad i=1,\ldots,n

for token representation xRdx \in \mathbb{R}^d. The top-2 experts resulting from gi(x)g_i(x) are selected, with their outputs sparsely weighted, concatenated, and projected back to dimension dd.

Active parameter count (parameters used for a single-token path) and total parameters (including all experts) are summarized below:

Model Total Params Active Params KV Cache @ 256K
Jamba-1.5-Mini 52 B 12 B 4 GB
Jamba-1.5-Large 398 B 94 B 9 GB

For context, LLaMA-3.1-70B requires 80 GB and Mistral-Large-2 requires 88 GB for the KV cache at equivalent context length.

2. Scaling to 256K Effective Context with Efficient Memory Usage

Jamba-1.5 achieves an open-weight record of 256K effective context window by leveraging three factors: (1) sparsity of attention (only 1/8 layers use full attention), (2) fixed-state Mamba layers that do not require cumulative KV caches, and (3) sequence-parallel serving and paged attention orchestration through vLLM. The architecture relies on pre-existing rotary embeddings for positional information.

The combined design results in KV cache growth that is 8× slower than pure Transformer models. End-to-end batch throughput degrades only as O(L)\mathcal{O}(L) in the rare attention layers, rather than in all layers, yielding more robust performance at extreme context length.

3. ExpertsInt8 Quantization and Resource Efficiency

A key deployment innovation is ExpertsInt8 quantization: over 85% of model parameters reside in MoE layers (>90% in MoE + MLPs), which are quantized to INT8 at load, each with a per-expert scale ss:

Q(w)=round(ws),wsQ(w)Q(w) = \mathrm{round}\left(\frac{w}{s}\right), \qquad w \approx s Q(w)

During inference, these INT8 values are dequantized to BF16 in the fused_moe vLLM kernel, with data movement restricted to on-chip SRAM, lowering latency. This method incurs negligible precision loss, requires no calibration, and often reduces latency due to lower high-bandwidth memory transfer. At 256K context, MoE/MLP weight footprints are halved computationally, supporting inference of Jamba-1.5-Large on 8×80GB GPUs; latency matches FP8 on H100 GPUs and outperforms GPTQ on A100.

4. Instruction Tuning, Fine-Tuning, and Training Regimen

Pre-training utilizes a mixture of multi-language web text, code, books, and scientific literature, followed by mid-training emphasizing long documents. Supervised fine-tuning (“post-training”) focuses on curated high-quality conversational data, skill-specific data (e.g., structured QA and function-calling), and long-context data including synthetic “needle-in-haystack” tasks. Most fine-tuning samples are synthesized via LLM prompting and automatically validated. The training objective augments cross-entropy loss with a minor “Activation Loss”:

n=16n=160

to penalize extreme activations, ensuring numerical safety in FP16 domains. No reinforcement learning (PPO or DPO) is applied; data synthesis and filtering suffice for model quality.

5. Benchmarking, Quantitative Results, and Model Release

Jamba-1.5 benchmarks competitively on broad academic and chatbot evaluations. On MMLU, BBH, ARC-C, GSM8K, HumanEval, and TruthfulQA, Jamba-1.5-Large (80.0% 5-shot MMLU) is close to LLaMA-3.1-70B (83.6%) and Mistral-Large-2 (82.5%). In arena-style chatbot evaluations (e.g., Arena-Hard with GPT-4-Turbo judgments), Jamba-1.5-Large achieves 65.4%/48.5% versus 55.7%/49.8% for the LLaMA baseline. On long-context benchmarks, Jamba-1.5-Large uniquely preserves near-100% retrieval/aggregation performance up to 256K (RULER average: 95.7%), outperforming other open-weight models on ∞Bench for both multiple-choice (80.4%) and QA (34.9%) at 100K tokens. Multilingual MMLU shows Jamba-1.5-Mini at 64.3% (vs 56.8% for LLaMA-3.1-8B) and Large at 73.9% (vs 77.8% for LLaMA-3.1-70B).

Model weights for both versions are available under the Jamba Open Model License on Hugging Face, with full ExpertsInt8 source code released. Jamba-1.5-Large operates efficiently on 8×80GB GPUs using FSDP and parallel serving, with throughputs up to 160 tokens/sec (256K context); Jamba-1.5-Mini is optimized for 2×80GB setups.

6. Janus Monolayer n=16n=161: Structure, Magnetism, and Piezoelectricity

In two-dimensional materials science, Jamba-1.5 indicates a Janus monolayer n=16n=162—a derivative of n=16n=163 where one side’s I atoms are replaced by Br, producing an asymmetric (Br–Cr–I) trilayer (Guo et al., 2021). The material exhibits point group 3m symmetry, lattice constant n=16n=164 Å, and substantial built-in out-of-plane dipole.

Stability is evidenced by (1) phonon spectra with no imaginary frequencies, (2) mechanical properties—n=16n=165 N/m, n=16n=166 N/m, n=16n=167 N/m, n=16n=168 N/m, n=16n=169—satisfying Born criteria, and (3) thermal stability in ab initio molecular dynamics at 300 K.

7. Electronic, Magnetic, and Piezoelectric Phenomena in Jamba-1.5 Monolayers

Jamba-1.5 is an indirect-gap semiconductor (K=2K = 20 eV), with a half-semiconducting band structure—both valence and conduction bands are 100% spin-up polarized. The Cr ion achieves a local moment of K=2K = 212.985 μB; the monolayer favors out-of-plane easy axis (MAE 356 μeV/Cr). Ferromagnetism arises from Cr–X–Cr superexchange (K=2K = 2290°, Goodenough–Kanamori mechanism), with out-of-plane anisotropy.

Piezoelectrically, the Janus symmetry (broken inversion and horizontal mirror) allows both in-plane and out-of-plane responses. Using Voigt notation, the piezoelectric strain coefficient for out-of-plane K=2K = 23 pm/V in the FM ground state. This exceeds values in other 2D materials, such as MoSSe (K=2K = 24 pm/V), MXenes (0.40–0.78 pm/V), and K=2K = 25-InK=2K = 26SeK=2K = 27 (0.415 pm/V).

Strain engineering permits tuning of magnetic and piezoelectric properties. Under K=2K = 285% biaxial compression (K=2K = 29), a transition to AF-Néel order occurs. Compressive strain enhances gi(x)=exp(wix)j=1nexp(wjx),i=1,,ng_i(x) = \frac{\exp(w_i^\top x)}{\sum_{j=1}^{n} \exp(w_j^\top x)},\quad i=1,\ldots,n0 up to 0.993 pm/V; tensile strain increases gi(x)=exp(wix)j=1nexp(wjx),i=1,,ng_i(x) = \frac{\exp(w_i^\top x)}{\sum_{j=1}^{n} \exp(w_j^\top x)},\quad i=1,\ldots,n1 to 1.545 pm/V (at gi(x)=exp(wix)j=1nexp(wjx),i=1,,ng_i(x) = \frac{\exp(w_i^\top x)}{\sum_{j=1}^{n} \exp(w_j^\top x)},\quad i=1,\ldots,n2). In the AFM phase (gi(x)=exp(wix)j=1nexp(wjx),i=1,,ng_i(x) = \frac{\exp(w_i^\top x)}{\sum_{j=1}^{n} \exp(w_j^\top x)},\quad i=1,\ldots,n3), gi(x)=exp(wix)j=1nexp(wjx),i=1,,ng_i(x) = \frac{\exp(w_i^\top x)}{\sum_{j=1}^{n} \exp(w_j^\top x)},\quad i=1,\ldots,n4 remains high (0.999 pm/V).

Applications discussed include 2D spintronic sensors, magneto-piezotronic transducers, and strain-controlled FM/AFM memories. Realization depends on successful layer-selective halogen exchange and stabilization strategies (e.g., h-BN encapsulation) during fabrication.

8. Synthesis, Device Integration, and Outlook

For the gi(x)=exp(wix)j=1nexp(wjx),i=1,,ng_i(x) = \frac{\exp(w_i^\top x)}{\sum_{j=1}^{n} \exp(w_j^\top x)},\quad i=1,\ldots,n5 monolayer, synthesis challenges include maintaining magnetic order during halogen exchange and ensuring chemical stability. The strong out-of-plane piezoelectricity (gi(x)=exp(wix)j=1nexp(wjx),i=1,,ng_i(x) = \frac{\exp(w_i^\top x)}{\sum_{j=1}^{n} \exp(w_j^\top x)},\quad i=1,\ldots,n6 pm/V) facilitates vertical field gating in FETs and heterostructure integration. Outlook suggests opportunities for multifunctional devices where strain and field control both magnetism and electronic polarization, with potential for integration into novel 2D heterostructures for spin–charge conversion and memory applications.

In neural architectures, Jamba-1.5 illustrates the efficacy of hybrid and sparse model design, supporting extreme context lengths and superior throughput while remaining accessible to the research community through open-weights and source code. Quantization advances and efficient memory design lower barriers for deployment at high scale.

Both Jamba-1.5 systems, while arising from different domains, mark advances obtained by hybridization—whether by blending state-space, attention, and MoE methods in LLMs, or by compositional control and symmetry breaking in designer 2D materials.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Jamba-1.5.