Jamba-1.5: Hybrid LLM & Janus Monolayer Insight
- Jamba-1.5 is a dual-domain innovation encompassing a scalable hybrid LLM architecture and a Janus monolayer material with tailored physical properties.
- The LLM component employs a Transformer-Mamba-MoE design with ExpertsInt8 quantization to achieve a record 256K context window and efficient memory usage.
- The Janus monolayer CrBr1.5I1.5 exhibits robust ferromagnetism and an exceptionally high out-of-plane piezoelectric response, highlighting its potential in spintronics and multifunctional devices.
Jamba-1.5 refers to two distinct advanced scientific systems: (1) a next-generation LLM architecture, combining Transformer, Mamba state-space, and Mixture-of-Experts (MoE) techniques at high scale (Team et al., 2024); and (2) a two-dimensional Janus monolayer material, , exhibiting both robust ferromagnetism and exceptionally large out-of-plane piezoelectric response (Guo et al., 2021). In both AI and 2D materials research, “Jamba-1.5” denotes high performance through hybridization—either architectural or compositional—with properties exceeding prior systems in key respects.
1. Hybrid Transformer-Mamba-MoE Architecture (LLM)
Jamba-1.5, developed by AI21, preserves the key design of earlier Jamba models—a tightly fused hybrid of full self-attention (Transformer) layers, efficient state-space (Mamba) layers, and sparsely activated Mixture-of-Experts modules. The architecture alternates one Transformer attention layer for every seven Mamba state-space layers within each Jamba “block” (8 layers per block, 9 blocks total). Crucially, every two layers replace the conventional MLP with an MoE module of experts per layer (hidden size 8192, top- routing per token). Attention in each block operates with 64 query heads and 8 key/value heads. This yields a total of 72 layers—of which 9 are full-attention—leading to much slower key–value (KV) cache growth with context length compared to dense Transformers.
The gating mechanism for each MoE block is given by
for token representation . The top-2 experts resulting from are selected, with their outputs sparsely weighted, concatenated, and projected back to dimension .
Active parameter count (parameters used for a single-token path) and total parameters (including all experts) are summarized below:
| Model | Total Params | Active Params | KV Cache @ 256K |
|---|---|---|---|
| Jamba-1.5-Mini | 52 B | 12 B | 4 GB |
| Jamba-1.5-Large | 398 B | 94 B | 9 GB |
For context, LLaMA-3.1-70B requires 80 GB and Mistral-Large-2 requires 88 GB for the KV cache at equivalent context length.
2. Scaling to 256K Effective Context with Efficient Memory Usage
Jamba-1.5 achieves an open-weight record of 256K effective context window by leveraging three factors: (1) sparsity of attention (only 1/8 layers use full attention), (2) fixed-state Mamba layers that do not require cumulative KV caches, and (3) sequence-parallel serving and paged attention orchestration through vLLM. The architecture relies on pre-existing rotary embeddings for positional information.
The combined design results in KV cache growth that is 8× slower than pure Transformer models. End-to-end batch throughput degrades only as in the rare attention layers, rather than in all layers, yielding more robust performance at extreme context length.
3. ExpertsInt8 Quantization and Resource Efficiency
A key deployment innovation is ExpertsInt8 quantization: over 85% of model parameters reside in MoE layers (>90% in MoE + MLPs), which are quantized to INT8 at load, each with a per-expert scale :
During inference, these INT8 values are dequantized to BF16 in the fused_moe vLLM kernel, with data movement restricted to on-chip SRAM, lowering latency. This method incurs negligible precision loss, requires no calibration, and often reduces latency due to lower high-bandwidth memory transfer. At 256K context, MoE/MLP weight footprints are halved computationally, supporting inference of Jamba-1.5-Large on 8×80GB GPUs; latency matches FP8 on H100 GPUs and outperforms GPTQ on A100.
4. Instruction Tuning, Fine-Tuning, and Training Regimen
Pre-training utilizes a mixture of multi-language web text, code, books, and scientific literature, followed by mid-training emphasizing long documents. Supervised fine-tuning (“post-training”) focuses on curated high-quality conversational data, skill-specific data (e.g., structured QA and function-calling), and long-context data including synthetic “needle-in-haystack” tasks. Most fine-tuning samples are synthesized via LLM prompting and automatically validated. The training objective augments cross-entropy loss with a minor “Activation Loss”:
0
to penalize extreme activations, ensuring numerical safety in FP16 domains. No reinforcement learning (PPO or DPO) is applied; data synthesis and filtering suffice for model quality.
5. Benchmarking, Quantitative Results, and Model Release
Jamba-1.5 benchmarks competitively on broad academic and chatbot evaluations. On MMLU, BBH, ARC-C, GSM8K, HumanEval, and TruthfulQA, Jamba-1.5-Large (80.0% 5-shot MMLU) is close to LLaMA-3.1-70B (83.6%) and Mistral-Large-2 (82.5%). In arena-style chatbot evaluations (e.g., Arena-Hard with GPT-4-Turbo judgments), Jamba-1.5-Large achieves 65.4%/48.5% versus 55.7%/49.8% for the LLaMA baseline. On long-context benchmarks, Jamba-1.5-Large uniquely preserves near-100% retrieval/aggregation performance up to 256K (RULER average: 95.7%), outperforming other open-weight models on ∞Bench for both multiple-choice (80.4%) and QA (34.9%) at 100K tokens. Multilingual MMLU shows Jamba-1.5-Mini at 64.3% (vs 56.8% for LLaMA-3.1-8B) and Large at 73.9% (vs 77.8% for LLaMA-3.1-70B).
Model weights for both versions are available under the Jamba Open Model License on Hugging Face, with full ExpertsInt8 source code released. Jamba-1.5-Large operates efficiently on 8×80GB GPUs using FSDP and parallel serving, with throughputs up to 160 tokens/sec (256K context); Jamba-1.5-Mini is optimized for 2×80GB setups.
6. Janus Monolayer 1: Structure, Magnetism, and Piezoelectricity
In two-dimensional materials science, Jamba-1.5 indicates a Janus monolayer 2—a derivative of 3 where one side’s I atoms are replaced by Br, producing an asymmetric (Br–Cr–I) trilayer (Guo et al., 2021). The material exhibits point group 3m symmetry, lattice constant 4 Å, and substantial built-in out-of-plane dipole.
Stability is evidenced by (1) phonon spectra with no imaginary frequencies, (2) mechanical properties—5 N/m, 6 N/m, 7 N/m, 8 N/m, 9—satisfying Born criteria, and (3) thermal stability in ab initio molecular dynamics at 300 K.
7. Electronic, Magnetic, and Piezoelectric Phenomena in Jamba-1.5 Monolayers
Jamba-1.5 is an indirect-gap semiconductor (0 eV), with a half-semiconducting band structure—both valence and conduction bands are 100% spin-up polarized. The Cr ion achieves a local moment of 12.985 μB; the monolayer favors out-of-plane easy axis (MAE 356 μeV/Cr). Ferromagnetism arises from Cr–X–Cr superexchange (290°, Goodenough–Kanamori mechanism), with out-of-plane anisotropy.
Piezoelectrically, the Janus symmetry (broken inversion and horizontal mirror) allows both in-plane and out-of-plane responses. Using Voigt notation, the piezoelectric strain coefficient for out-of-plane 3 pm/V in the FM ground state. This exceeds values in other 2D materials, such as MoSSe (4 pm/V), MXenes (0.40–0.78 pm/V), and 5-In6Se7 (0.415 pm/V).
Strain engineering permits tuning of magnetic and piezoelectric properties. Under 85% biaxial compression (9), a transition to AF-Néel order occurs. Compressive strain enhances 0 up to 0.993 pm/V; tensile strain increases 1 to 1.545 pm/V (at 2). In the AFM phase (3), 4 remains high (0.999 pm/V).
Applications discussed include 2D spintronic sensors, magneto-piezotronic transducers, and strain-controlled FM/AFM memories. Realization depends on successful layer-selective halogen exchange and stabilization strategies (e.g., h-BN encapsulation) during fabrication.
8. Synthesis, Device Integration, and Outlook
For the 5 monolayer, synthesis challenges include maintaining magnetic order during halogen exchange and ensuring chemical stability. The strong out-of-plane piezoelectricity (6 pm/V) facilitates vertical field gating in FETs and heterostructure integration. Outlook suggests opportunities for multifunctional devices where strain and field control both magnetism and electronic polarization, with potential for integration into novel 2D heterostructures for spin–charge conversion and memory applications.
In neural architectures, Jamba-1.5 illustrates the efficacy of hybrid and sparse model design, supporting extreme context lengths and superior throughput while remaining accessible to the research community through open-weights and source code. Quantization advances and efficient memory design lower barriers for deployment at high scale.
Both Jamba-1.5 systems, while arising from different domains, mark advances obtained by hybridization—whether by blending state-space, attention, and MoE methods in LLMs, or by compositional control and symmetry breaking in designer 2D materials.