Falcon-H1R: 7B Reasoning-Optimized Model

Updated 7 January 2026

Falcon-H1R is a 7-billion-parameter small language model that integrates Transformer and Mamba state-space blocks for efficient, high-accuracy multi-step reasoning.
It employs a hybrid-parallel architecture with data, tensor, and sequence parallelism to manage long contexts (up to 48K tokens) while reducing inference costs.
Innovative training and chain-of-thought management techniques enable Falcon-H1R to outperform larger models on diverse benchmarks in mathematics, coding, and general reasoning.

Falcon-H1R is a 7-billion-parameter reasoning-optimized small LLM (SLM) designed to achieve state-of-the-art performance on multi-step reasoning tasks while maintaining high efficiency in both inference cost and test-time scaling. Distinguished by a hybrid-parallel Transformer–Mamba (state-space) architecture and targeted training methodology, Falcon-H1R establishes that with careful engineering, SLMs can match or outperform models $2\times$ to $7\times$ larger on diverse benchmarks in mathematics, code, and general reasoning. Its parameter and token efficiency, coupled with innovations in parallelism and test-time chain-of-thought (CoT) management, position Falcon-H1R as a practical foundation for advanced reasoning systems across a wide array of technical domains (Team et al., 5 Jan 2026).

1. Model Architecture and Hybrid-Parallelism

Falcon-H1R-7B employs a layerwise hybrid architecture in which each layer interleaves:

A standard multi-head Transformer block, and
A Mamba Structured State Space Model (SSM) block.

Key architectural features include:

Dual attention heads per layer: 12 standard query/key/value (Q/K/V) heads and 2 long-range Mamba heads (each head dimension 128).
State space recurrence: SSM blocks utilize a state dimension $d_\text{state} = 256$ , enabling recurrence scaling as $\mathcal{O}(L \cdot d_\text{state})$ and trading quadratic $\mathcal{O}(L^2)$ attention for linear scaling in the sequence length $L$ .
Parallelism mechanisms:
- Data- and tensor-parallelism (TP) across GPUs for the attention and feed-forward sublayers.
- Sequence-parallel (SP) “Ulysses” partitioning for managing SSM over long contexts (up to 256 K tokens) with scatter/gather communications, maintaining memory efficiency.

Resource characteristics are as follows:

Total parameters: $P = 7.59 \times 10^9$ .
FLOPs per token: $C_\text{attn} \approx 2P L/\text{heads}$ and $C_\text{ssm} \sim \mathcal{O}(d_\text{state}^2 L)$ .
Inference cost: $C_\text{phase}(L) = \mathcal{O}(P L)$ , linear in context length for long sequences.
Peak memory: $M_\text{mem} \sim \mathcal{O}(P + L d_\text{model})$ , typically $\sim$ 70 GB for $L = 32$ K tokens under hybrid SP+TP.

2. Training Data and Optimization

2.1 Data Curation and Preprocessing

Falcon-H1R’s training corpus encompasses domains such as:

Mathematics (with verified ground truth and log-normal token-length distribution),
Coding (Python/C++ with functional tests),
Science (factual, multi-step reasoning),
Other (instruction-based, chat, tool use).

Data filtering includes:

Removing empty/malformed reasoning traces,
Math answer verification via LaTeX matching and math-verify fallback,
Code validation through Sandbox-Fusion harness,
Difficulty-aware weighting of samples (easy: $0.5\times$ , medium: $1\times$ , hard: $1.25$– $1.75\times$ ).

2.2 Supervised Fine-Tuning (SFT)

Starting from the Falcon-H1-7B pretrained base,
Learning rate schedule: $\eta(t) = \eta_0 \cdot (1-t/T)$ , where $\eta_0 = 1024 \times 10^{-6}$ ,
Batch size $B = 512$ , 3 epochs over 3.1M examples, contexts up to 36K tokens, with some up to 48K right-trimmed,
Optimizer: AdamW ( $\beta_1 = 0.9$ , $\beta_2 = 0.95$ ), weight decay $0.01$, grad clip $1.0$, bfloat16 precision.

Balanced DP token normalization ensures equal global gradient contributions across all valid tokens:

$\mathcal{L}_\text{balanced}^{(r)} = \frac{\sum_i \ell_i^{(r)} m_i^{(r)}}{\epsilon + \sum_{r' = 1}^R \sum_i m_i^{(r')}} \cdot R$

for $R$ data-parallel ranks.

2.3 Reinforcement Learning (RL) via GRPO

Uses a generalized clipped policy-gradient objective:

$J(\theta) = \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min\left(r_{i,t}A_{i,t}, \mathrm{clip}(r_{i,t}, 1-\epsilon, 1+\epsilon)A_{i,t}\right)$

Importance ratio $r_{i,t} = \frac{\pi_\theta(o_{i,t}\mid\ldots)}{\pi_{\theta_\mathrm{old}}(o_{i,t}\mid\ldots)}$ ,
Group-relative advantage $A_{i,t} = \frac{R_\mathrm{final}(q,o_i) - \mu_R}{\sigma_R}$ , with group mean $\mu_R$ ,
No KL or entropy penalties,
Hyperparameters: $G=16$ rollouts, temperature $\tau = 0.85$ , max sequence $L_\text{max} = 48$ K tokens, learning rate $2 \times 10^{-6}$ , PPO batch 128.

3. Empirical Results and Benchmark Comparison

3.1 Reasoning, Coding, and General Evaluation

Falcon-H1R’s single-chain pass@1 accuracies on reasoning benchmarks:

Task	Score	Next Best (Δ)	Notes
AIME24	88.1%	Qwen3-32B (+8.7pp)
AIME25	83.1%	2nd place
HMMT25	64.9%	≈ GPT-OSS-20B
AMO-Bench	36.3%	+10pp over next
Math500	97.4%		Comprehensive maths eval
LCB v6	68.6%	2nd	Code performance
SciCode	28.3/3.9%		Code, multiple splits
$\tau^2$ -Telecom	25.4%		Telecom code
TB Hard	4.9%		Hard coding benchmark
GPQA-Diamond	61.3%		General knowledge
MMLU-Pro	72.1%		General reasoning
HL Exam	11.1%	2nd	General reasoning
IFBench	53.4%		Instruction-following benchmark

Falcon-H1R-7B matches or exceeds 14–32B competitors by 2–10 percentage points on reasoning tasks. For example, on AMO-Bench:

$\Delta_\mathrm{acc}(7B \to 14B) = \mathrm{acc}_\mathrm{H1R7B} - \mathrm{acc}_\mathrm{Phi4+14B} \approx +6.4\%$

3.2 Test-Time Scaling and DeepConf

Falcon-H1R incorporates the DeepConf approach for adaptive, parallel CoT test-time scaling:

Adaptive filtering: Discards chains with low predicted confidence (threshold: 10th percentile over 2048-token window).
Early stopping: Aborts traces when moving confidence falls below threshold.
Efficiency metrics:
- Speedup $(n) = T_1 / T_n$ , where $T_n$ is wall-clock for $n$ chains.
- Token reduction ratio quantifies fewer output tokens vs. baseline.

Model	AIME24 Acc	Tok (M)	AIME25 Acc	Tok (M)	AMO-Bench Acc	Tok (M)
Qwen3-8B	80.0%	138.3	80.0%	177.2	15.4%	320.0
DeepSeek-R1-8B	90.0%	145.5	82.8%	174.5	25.6%	487.9
Phi-4-Plus-14B	86.7%	123.9	83.3%	145.9	20.5%	276.9
Qwen3-32B	86.7%	134.4	86.7%	174.8	28.2%	364.8
Falcon-H1R-7B	96.7%	89.8	96.7%	95.1	35.9%	216.8

Falcon-H1R-7B achieves:

38–51% token reduction versus larger baselines at equal or higher accuracy,
Lower batch latency due to Mamba SSM’s scaling,
Increased up-front confidence estimation complexity, offset by robust early chain termination.

4. Chain-of-Thought Generation and Practical Utility

Falcon-H1R’s hybrid SSM architecture supports scalable chain-of-thought (CoT) generation for extended context lengths—up to 48K tokens—with low per-token computational cost. This makes Falcon-H1R suitable for multi-step reasoning in mathematics, science, software engineering, and sequential planning domains. At 7B parameters, the model can be deployed on conventional multi-GPU clusters, delivering throughput 20–100% higher than comparable 8B pure-Transformer models. This is conducive to interactive reasoning assistants and resource-conscious deployments.

The model’s output includes high-quality, auditable CoT traces and calibrated confidence estimates, supporting downstream scenarios such as automated grading, code verification, and interactive tutoring.

5. Architectural Implications and Efficiency Trade-offs

Falcon-H1R demonstrates that compact models, via hybrid-parallel design and dedicated scaling strategies, can drive reasoning performance at levels exceeding models with 14–32B parameters—while drastically reducing resource demands. The two-level parallelism (TP, SP), balanced DP token normalization, and DeepConf test-time scaling afford unique advantages:

Per-token inference cost linear in context length,
Sustained memory efficiency for very long sequences,
Reliable confidence-based early stopping in multi-chain parallelism,
Minimal accuracy trade-off versus larger models.

A plausible implication is that future reasoning-focused SLMs may replicate or extend Falcon-H1R’s architectural and scaling approaches to further push the limits of parameter efficiency.

6. Summary and Impact

Falcon-H1R substantiates that strategic hybridization of Transformer and state-space components, paired with advanced data curation, SFT, and generalized RL policy optimization, can vault 7B-parameter models to state-of-the-art reasoning ability on challenging multi-domain benchmarks. The model’s efficiency gains—in both tokens and wall-clock latency—enable scalable deployment and interactive real-time applications where advanced reasoning is required, with competitive or superior accuracy compared to much larger models (Team et al., 5 Jan 2026).

PDF Markdown Chat (Pro)

References (1)

Falcon-H1R: Pushing the Reasoning Frontiers with a Hybrid Model for Efficient Test-Time Scaling (2026)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Falcon-H1R.