Papers
Topics
Authors
Recent
2000 character limit reached

Falcon-H1R: 7B Reasoning-Optimized Model

Updated 7 January 2026
  • Falcon-H1R is a 7-billion-parameter small language model that integrates Transformer and Mamba state-space blocks for efficient, high-accuracy multi-step reasoning.
  • It employs a hybrid-parallel architecture with data, tensor, and sequence parallelism to manage long contexts (up to 48K tokens) while reducing inference costs.
  • Innovative training and chain-of-thought management techniques enable Falcon-H1R to outperform larger models on diverse benchmarks in mathematics, coding, and general reasoning.

Falcon-H1R is a 7-billion-parameter reasoning-optimized small LLM (SLM) designed to achieve state-of-the-art performance on multi-step reasoning tasks while maintaining high efficiency in both inference cost and test-time scaling. Distinguished by a hybrid-parallel Transformer–Mamba (state-space) architecture and targeted training methodology, Falcon-H1R establishes that with careful engineering, SLMs can match or outperform models 2×2\times to 7×7\times larger on diverse benchmarks in mathematics, code, and general reasoning. Its parameter and token efficiency, coupled with innovations in parallelism and test-time chain-of-thought (CoT) management, position Falcon-H1R as a practical foundation for advanced reasoning systems across a wide array of technical domains (Team et al., 5 Jan 2026).

1. Model Architecture and Hybrid-Parallelism

Falcon-H1R-7B employs a layerwise hybrid architecture in which each layer interleaves:

  • A standard multi-head Transformer block, and
  • A Mamba Structured State Space Model (SSM) block.

Key architectural features include:

  • Dual attention heads per layer: 12 standard query/key/value (Q/K/V) heads and 2 long-range Mamba heads (each head dimension 128).
  • State space recurrence: SSM blocks utilize a state dimension dstate=256d_\text{state} = 256, enabling recurrence scaling as O(Ldstate)\mathcal{O}(L \cdot d_\text{state}) and trading quadratic O(L2)\mathcal{O}(L^2) attention for linear scaling in the sequence length LL.
  • Parallelism mechanisms:
    • Data- and tensor-parallelism (TP) across GPUs for the attention and feed-forward sublayers.
    • Sequence-parallel (SP) “Ulysses” partitioning for managing SSM over long contexts (up to 256 K tokens) with scatter/gather communications, maintaining memory efficiency.

Resource characteristics are as follows:

  • Total parameters: P=7.59×109P = 7.59 \times 10^9.
  • FLOPs per token: Cattn2PL/headsC_\text{attn} \approx 2P L/\text{heads} and CssmO(dstate2L)C_\text{ssm} \sim \mathcal{O}(d_\text{state}^2 L).
  • Inference cost: Cphase(L)=O(PL)C_\text{phase}(L) = \mathcal{O}(P L), linear in context length for long sequences.
  • Peak memory: MmemO(P+Ldmodel)M_\text{mem} \sim \mathcal{O}(P + L d_\text{model}), typically \sim70 GB for L=32L = 32K tokens under hybrid SP+TP.

2. Training Data and Optimization

2.1 Data Curation and Preprocessing

Falcon-H1R’s training corpus encompasses domains such as:

  • Mathematics (with verified ground truth and log-normal token-length distribution),
  • Coding (Python/C++ with functional tests),
  • Science (factual, multi-step reasoning),
  • Other (instruction-based, chat, tool use).

Data filtering includes:

  • Removing empty/malformed reasoning traces,
  • Math answer verification via LaTeX matching and math-verify fallback,
  • Code validation through Sandbox-Fusion harness,
  • Difficulty-aware weighting of samples (easy: 0.5×0.5\times, medium: 1×1\times, hard: $1.25$–1.75×1.75\times).

2.2 Supervised Fine-Tuning (SFT)

  • Starting from the Falcon-H1-7B pretrained base,
  • Learning rate schedule: η(t)=η0(1t/T)\eta(t) = \eta_0 \cdot (1-t/T), where η0=1024×106\eta_0 = 1024 \times 10^{-6},
  • Batch size B=512B = 512, 3 epochs over 3.1M examples, contexts up to 36K tokens, with some up to 48K right-trimmed,
  • Optimizer: AdamW (β1=0.9\beta_1 = 0.9, β2=0.95\beta_2 = 0.95), weight decay $0.01$, grad clip $1.0$, bfloat16 precision.

Balanced DP token normalization ensures equal global gradient contributions across all valid tokens:

Lbalanced(r)=ii(r)mi(r)ϵ+r=1Rimi(r)R\mathcal{L}_\text{balanced}^{(r)} = \frac{\sum_i \ell_i^{(r)} m_i^{(r)}}{\epsilon + \sum_{r' = 1}^R \sum_i m_i^{(r')}} \cdot R

for RR data-parallel ranks.

2.3 Reinforcement Learning (RL) via GRPO

  • Uses a generalized clipped policy-gradient objective:

J(θ)=1Gi=1G1oit=1oimin(ri,tAi,t,clip(ri,t,1ϵ,1+ϵ)Ai,t)J(\theta) = \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min\left(r_{i,t}A_{i,t}, \mathrm{clip}(r_{i,t}, 1-\epsilon, 1+\epsilon)A_{i,t}\right)

  • Importance ratio ri,t=πθ(oi,t)πθold(oi,t)r_{i,t} = \frac{\pi_\theta(o_{i,t}\mid\ldots)}{\pi_{\theta_\mathrm{old}}(o_{i,t}\mid\ldots)},
  • Group-relative advantage Ai,t=Rfinal(q,oi)μRσRA_{i,t} = \frac{R_\mathrm{final}(q,o_i) - \mu_R}{\sigma_R}, with group mean μR\mu_R,
  • No KL or entropy penalties,
  • Hyperparameters: G=16G=16 rollouts, temperature τ=0.85\tau = 0.85, max sequence Lmax=48L_\text{max} = 48K tokens, learning rate 2×1062 \times 10^{-6}, PPO batch 128.

3. Empirical Results and Benchmark Comparison

3.1 Reasoning, Coding, and General Evaluation

Falcon-H1R’s single-chain pass@1 accuracies on reasoning benchmarks:

Task Score Next Best (Δ) Notes
AIME24 88.1% Qwen3-32B (+8.7pp)
AIME25 83.1% 2nd place
HMMT25 64.9% GPT-OSS-20B
AMO-Bench 36.3% +10pp over next
Math500 97.4% Comprehensive maths eval
LCB v6 68.6% 2nd Code performance
SciCode 28.3/3.9% Code, multiple splits
τ2\tau^2-Telecom 25.4% Telecom code
TB Hard 4.9% Hard coding benchmark
GPQA-Diamond 61.3% General knowledge
MMLU-Pro 72.1% General reasoning
HL Exam 11.1% 2nd General reasoning
IFBench 53.4% Instruction-following benchmark

Falcon-H1R-7B matches or exceeds 14–32B competitors by 2–10 percentage points on reasoning tasks. For example, on AMO-Bench:

Δacc(7B14B)=accH1R7BaccPhi4+14B+6.4%\Delta_\mathrm{acc}(7B \to 14B) = \mathrm{acc}_\mathrm{H1R7B} - \mathrm{acc}_\mathrm{Phi4+14B} \approx +6.4\%

3.2 Test-Time Scaling and DeepConf

Falcon-H1R incorporates the DeepConf approach for adaptive, parallel CoT test-time scaling:

  • Adaptive filtering: Discards chains with low predicted confidence (threshold: 10th percentile over 2048-token window).
  • Early stopping: Aborts traces when moving confidence falls below threshold.
  • Efficiency metrics:
    • Speedup(n)=T1/Tn(n) = T_1 / T_n, where TnT_n is wall-clock for nn chains.
    • Token reduction ratio quantifies fewer output tokens vs. baseline.
Model AIME24 Acc Tok (M) AIME25 Acc Tok (M) AMO-Bench Acc Tok (M)
Qwen3-8B 80.0% 138.3 80.0% 177.2 15.4% 320.0
DeepSeek-R1-8B 90.0% 145.5 82.8% 174.5 25.6% 487.9
Phi-4-Plus-14B 86.7% 123.9 83.3% 145.9 20.5% 276.9
Qwen3-32B 86.7% 134.4 86.7% 174.8 28.2% 364.8
Falcon-H1R-7B 96.7% 89.8 96.7% 95.1 35.9% 216.8

Falcon-H1R-7B achieves:

  • 38–51% token reduction versus larger baselines at equal or higher accuracy,
  • Lower batch latency due to Mamba SSM’s scaling,
  • Increased up-front confidence estimation complexity, offset by robust early chain termination.

4. Chain-of-Thought Generation and Practical Utility

Falcon-H1R’s hybrid SSM architecture supports scalable chain-of-thought (CoT) generation for extended context lengths—up to 48K tokens—with low per-token computational cost. This makes Falcon-H1R suitable for multi-step reasoning in mathematics, science, software engineering, and sequential planning domains. At 7B parameters, the model can be deployed on conventional multi-GPU clusters, delivering throughput 20–100% higher than comparable 8B pure-Transformer models. This is conducive to interactive reasoning assistants and resource-conscious deployments.

The model’s output includes high-quality, auditable CoT traces and calibrated confidence estimates, supporting downstream scenarios such as automated grading, code verification, and interactive tutoring.

5. Architectural Implications and Efficiency Trade-offs

Falcon-H1R demonstrates that compact models, via hybrid-parallel design and dedicated scaling strategies, can drive reasoning performance at levels exceeding models with 14–32B parameters—while drastically reducing resource demands. The two-level parallelism (TP, SP), balanced DP token normalization, and DeepConf test-time scaling afford unique advantages:

  • Per-token inference cost linear in context length,
  • Sustained memory efficiency for very long sequences,
  • Reliable confidence-based early stopping in multi-chain parallelism,
  • Minimal accuracy trade-off versus larger models.

A plausible implication is that future reasoning-focused SLMs may replicate or extend Falcon-H1R’s architectural and scaling approaches to further push the limits of parameter efficiency.

6. Summary and Impact

Falcon-H1R substantiates that strategic hybridization of Transformer and state-space components, paired with advanced data curation, SFT, and generalized RL policy optimization, can vault 7B-parameter models to state-of-the-art reasoning ability on challenging multi-domain benchmarks. The model’s efficiency gains—in both tokens and wall-clock latency—enable scalable deployment and interactive real-time applications where advanced reasoning is required, with competitive or superior accuracy compared to much larger models (Team et al., 5 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Falcon-H1R.