Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 82 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 19 tok/s Pro

GPT-5 High 17 tok/s Pro

GPT-4o 107 tok/s Pro

Kimi K2 174 tok/s Pro

GPT OSS 120B 468 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

SpikingBrain Technical Report: Spiking Brain-inspired Large Models (2509.05276v1)

Published 5 Sep 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Mainstream Transformer-based LLMs face major efficiency bottlenecks: training computation scales quadratically with sequence length, and inference memory grows linearly, limiting long-context processing. Building large models on non-NVIDIA platforms also poses challenges for stable and efficient training. To address this, we introduce SpikingBrain, a family of brain-inspired models designed for efficient long-context training and inference. SpikingBrain leverages the MetaX GPU cluster and focuses on three aspects: (1) Model Architecture: linear and hybrid-linear attention architectures with adaptive spiking neurons; (2) Algorithmic Optimizations: an efficient, conversion-based training pipeline and a dedicated spike coding framework; (3) System Engineering: customized training frameworks, operator libraries, and parallelism strategies tailored to MetaX hardware. Using these techniques, we develop two models: SpikingBrain-7B, a linear LLM, and SpikingBrain-76B, a hybrid-linear MoE LLM. These models demonstrate the feasibility of large-scale LLM development on non-NVIDIA platforms. SpikingBrain achieves performance comparable to open-source Transformer baselines while using only about 150B tokens for continual pre-training. Our models significantly improve long-sequence training efficiency and deliver inference with (partially) constant memory and event-driven spiking behavior. For example, SpikingBrain-7B attains over 100x speedup in Time to First Token for 4M-token sequences. Training remains stable for weeks on hundreds of MetaX C550 GPUs, with the 7B model reaching a Model FLOPs Utilization of 23.4 percent. The proposed spiking scheme achieves 69.15 percent sparsity, enabling low-power operation. Overall, this work demonstrates the potential of brain-inspired mechanisms to drive the next generation of efficient and scalable large model design.

Collections

Summary

The paper introduces a novel LLM architecture that integrates brain-inspired hybrid attention, MoE, and adaptive spiking neurons for efficient long-context processing.
It employs a conversion-based training paradigm with continual pre-training and multi-stage fine-tuning to achieve over 100× speedup and competitive benchmark performance.
The models demonstrate significant energy savings and scalability, reducing energy consumption by up to 97.7% compared to traditional FP16 and INT8 MAC approaches.

SpikingBrain: Brain-Inspired Large Models for Efficient Long-Context Training and Inference

Introduction

SpikingBrain introduces a family of LLMs that integrate brain-inspired mechanisms—hybrid efficient attention, Mixture-of-Experts (MoE) modules, and spike encoding—into their architecture. The models are designed to address the computational and memory bottlenecks of mainstream Transformer-based LLMs, particularly for long-context processing and deployment on non-NVIDIA hardware. SpikingBrain leverages a conversion-based training pipeline, adaptive spiking neurons, and system-level optimizations for the MetaX GPU cluster, demonstrating stable large-scale training and inference with less than 2% of the data typically required for comparable open-source models.

Figure 1: Overview of SpikingBrain, highlighting hybrid attention, MoE, spike encoding, and hardware adaptation for efficient training and inference on MetaX clusters.

Model Architecture

Hybrid Attention Mechanisms

SpikingBrain models employ a combination of linear, sliding window, and full softmax attention modules. Linear attention provides $O(n)$ complexity and constant memory usage, while sliding window attention captures local dependencies efficiently. Hybridization is realized in two paradigms:

Inter-layer sequential hybridization (SpikingBrain-7B): Alternates linear and SWA layers for purely linear complexity.
Intra-layer parallel hybridization (SpikingBrain-76B): Combines linear, SWA, and full attention within layers, with outputs normalized for stability.

The Gated Linear Attention (GLA) module enhances expressivity via gating vectors, supporting recurrent state updates analogous to dendritic dynamics in biological neurons.

Mixture-of-Experts (MoE)

SpikingBrain-76B incorporates sparse MoE layers, with 16 routed experts and 1 shared expert per layer, activating only a subset of parameters per token. Upcycling techniques replicate dense FFN weights across experts, maintaining initial equivalence and enabling specialization during training. Seven dense FFN layers are retained for stability.

Adaptive Spiking Neurons

The models utilize adaptive-threshold spiking neurons, simplifying the LIF model by removing the decay factor and introducing a dynamic threshold proportional to the mean absolute membrane potential. This ensures balanced firing rates, prevents over-excitation/quiescence, and supports efficient conversion of activations to integer spike counts.

Figure 2: Integrated architectures of SpikingBrain-7B (linear, inter-layer hybridization) and SpikingBrain-76B (hybrid-linear MoE, intra-layer hybridization), with spike coding for hardware compatibility.

Training Paradigm

Conversion-Based Training

SpikingBrain leverages a multi-stage conversion pipeline:

Continual Pre-Training (CPT): Transfers attention patterns from pre-trained Transformer checkpoints to efficient attention variants, progressively extending context length (8k → 32k → 128k) with only ~150B tokens.
Supervised Fine-Tuning (SFT): Three-stage alignment for general knowledge, dialogue, and reasoning, using domain-specific datasets.

Attention map correspondence enables direct initialization of QKV projections from softmax attention models, with lightweight training adapting to local/low-rank variants. Non-negative activations and low-rank normalization ensure stable convergence.

MoE Upcycling

Dense FFN weights are replicated across experts, with output scaling to maintain consistency. Stochastic routing and data noise drive expert specialization, while shared experts stabilize training.

Spiking-Driven LLMs

Spiking Scheme

Activations are converted to integer spike counts via adaptive-threshold neurons, then expanded into sparse spike trains for event-driven computation. Three spike coding schemes are supported:

Binary ( $\{0,1\}$ ): Simple, low overhead, but high firing rate for large counts.
Ternary ( $\{-1,0,1\}$ ): Bidirectional, halves firing rate and time steps, aligns with biological excitation/inhibition.
Bitwise: Expands counts into binary bits, compresses time dimension, optimal for high-precision, low-power scenarios.
Figure 3: Schematic of binary, ternary, and bitwise spike coding schemes, illustrating temporal compression and sparsity.

Hardware Adaptation

Spike coding is compatible with GPU execution (single-step integer formulation) and event-driven neuromorphic hardware (expanded spike trains). Deployment on asynchronous hardware maximizes energy efficiency, as computation is triggered only by spike events.

MetaX Cluster Implementation

Distributed Training and Operator Adaptation

MetaX-specific optimizations include:

MoE memory/computation strategies: Hot-cold expert replication, adaptive/multi-granularity recomputation, length alignment.
Communication: SDMA engines, kernel fusion, memory offloading, and auto-tuning.
Operator adaptation: Triton kernel optimization and CUDA-to-MACA migration, leveraging MetaX acceleration libraries for efficient execution.
Figure 4: Operator adaptation workflow for SpikingBrain on MetaX GPUs, integrating Triton and MACA frameworks.

Parallel Topologies

SpikingBrain models employ data, pipeline, expert, and sequence parallelism, with ZeRO and activation recomputation for memory efficiency. Sequence parallelism (DeepSpeed Ulysses, ZeCO, LASP-2) enables scalable long-context training and inference.

Results

Downstream Performance

SpikingBrain-7B recovers ~90% of base model performance across benchmarks (MMLU, CMMLU, ARC-C, HS, Ceval), matching advanced linear and hybrid models. SpikingBrain-76B closes the gap further, achieving results comparable to Llama2-70B, Mixtral-8×7B, and Gemma2-27B, despite activating fewer parameters.

Long-Context Efficiency

SpikingBrain-7B achieves over 100× speedup in TTFT for 4M-token sequences under sequence parallelism, with constant time overhead as sequence length and GPU count scale. Inference latency and throughput are consistently superior to baselines across frameworks and hardware.

Figure 5: TTFT comparison under sequence parallelism, demonstrating SpikingBrain-7B's scalability and efficiency for ultra-long sequences.

CPU-Side Inference

Compressed 1B-scale SpikingBrain models deployed on CPUs (llama.cpp backend) maintain constant decoding speed as output length increases, outperforming Llama3.2-1B by up to 15.39× at 256k sequence length.

Figure 6: Overview of the CPU-side inference pipeline, detailing conversion, registration, optimization, and quantized inference steps.

Figure 7: Decoding speed comparison for SpikingBrain-1B and Llama3.2-1B on CPU, highlighting stable throughput and memory efficiency.

Spiking Scheme Analysis

Bitwise-ternary spike coding yields 69.15% sparsity, with 18.4% of channels inactive during inference. Performance degradation from spiking and INT8 quantization is limited to ~2%. Energy consumption is reduced by 97.7% (vs. FP16 MAC) and 85.2% (vs. INT8 MAC), with average MAC energy cost of 0.034 pJ.

Figure 8: Spike counts distribution for SpikingBrain-7B and SpikingBrain-76B under bitwise spike coding.

Figure 9: Time–neuron firing maps for binary, ternary, and bitwise spike coding, visualizing sparsity and temporal compression.

Conclusion

SpikingBrain demonstrates the practical integration of brain-inspired mechanisms into large model architectures, achieving efficient long-context training and inference on non-NVIDIA hardware. The models deliver competitive performance with minimal data, order-of-magnitude speedups, and substantial energy savings via spiking-driven computation. These advances provide a reference for future neuromorphic hardware and scalable deployment of efficient LLMs, with implications for both theoretical model design and real-world applications in resource-constrained environments.