Hermes 4: Open Hybrid Reasoning LLM
- Hermes 4 is a family of hybrid reasoning language models that integrate structured multi-turn chain-of-thought processes with broad instruction-following capabilities.
- It employs innovative techniques such as loss masking, sample packing with Flex Attention, and controlled chain-of-thought truncation to optimize training and output fidelity.
- Developed on Llama 3.1 and Qwen 3 backbones, Hermes 4 is openly released with all model weights available on HuggingFace, fostering reproducibility and collaborative research.
Hermes 4 is a family of hybrid reasoning LLMs that combine structured, multi-turn reasoning capability—specifically detailed chain-of-thought processes—with broad, instruction-following aptitude. Developed atop Llama 3.1 and Qwen 3 model backbones, Hermes 4 introduces architectural and data pipeline innovations that enable high-fidelity reasoning trace generation, robust answer production in a single pass, and strong performance across mathematical, coding, knowledge, comprehension, and alignment benchmarks. The Hermes 4 series is openly released, with all model weights available via HuggingFace for reproducibility and collaborative research (Teknium et al., 25 Aug 2025).
1. Model Architecture and Training Methodology
Hermes 4 models are designed to integrate multi-step, structured reasoning with unconstrained instruction-following behavior, departing from models that exclusively rely on one paradigm. The architecture is based on Llama 3.1 checkpoints or, in the case of Hermes 4 14B, a Qwen 3–based seed. Notable features include:
- Loss masking: Only tokens generated by the “assistant” role are included in the cross-entropy loss, reducing spurious alignment with user/other metadata tokens.
- Sample packing: To accommodate highly variable input lengths (with token counts spanning several orders of magnitude), samples are pre-packed using the First-Fit Decreasing algorithm, guaranteeing batch efficiency above 99.9%. The attention mechanism is limited within sample boundaries using Flex Attention.
- Parallelism and hardware utilization: Training scales across 192 NVIDIA B200 GPUs using a mixture of Distributed Data Parallelism (DDP), Tensor Parallelism (TP), and Fully Sharded Data Parallelism (FSDP) to support long contexts (16K–40,960 tokens) and large global batch sizes (384 at 16,384 tokens).
- Chain-of-thought length control: A specialized fine-tuning stage introduces an explicit
</think>
token at a preset reasoning length (20k–30k tokens), trained via masked loss so that gradients are only applied to this token, leading the model to learn controlled truncation of reasoning traces without degrading answer fidelity.
2. Data Curation, Synthesis, and Quality Control
Hermes 4 training data consists of approximately 5 million samples with over 19 billion tokens, combining reasoning-intensive and general instruction-following examples. The data pipeline emphasizes both diversity and rigorous quality control:
- Synthetic graph-based data generation: DataForge is used to define directed acyclic graphs (DAGs) for reasoning tasks, encoding both complex instructions and rich intermediate traces.
- Deduplication: ModernBERT embedding similarity (cosine threshold 0.7) is used for strict deduplication across the dataset, minimizing semantic overlap and leakage.
- Rejection sampling: Candidate samples are filtered using over one thousand task-specific verifiers in the Atropos RL environment, ensuring that only high-quality, verifiable trajectories are retained for training.
- Data heterogeneity handling: The sample packing and masked batch construction methods are crucial in managing the extreme diversity in sample length and complexity.
A plausible implication is that this approach allows Hermes 4 to generalize across a broader range of reasoning styles and problem types, while maintaining high trace and format fidelity.
3. Training Infrastructure and Optimization Strategies
Hermes 4 training and evaluation leverage state-of-the-art infrastructure:
- Batching and efficiency: The First-Fit Decreasing algorithm is employed to efficiently pack variable-length samples into training batches, with Flex Attention ensuring that inter-sample context mixing is prevented.
- Distributed optimization: Training is conducted using mixed-precision, with a cosine learning rate schedule and optimizer choices reflecting stability for very-long-context LLMing.
- Sample-level attention masking: Within each packed sequence, attention heads are restricted to their respective sample segments, preventing cross-contamination during optimization.
This model scaling strategy supports context windows of up to 40,960 tokens, necessary for multi-turn and highly elaborate reasoning tasks.
4. Quantitative Evaluation and Benchmarking
Hermes 4 is benchmarked across a suite of quantitative tasks:
- Mathematical reasoning: Evaluated against MATH-500, AIME’24, AIME’25, GPQA Diamond; large Hermes 4 variants attain normalized scores in the high 70–80% range.
- Code generation: On LiveCodeBench (LCBv6 Aug2024+) and BBH, the model’s fine-tuned chain-of-thought and strict format adherence lead to strong pass@1 and functional code scores.
- Knowledge/comprehension: Benchmarks including MMLU, MMLU-Pro, DROP, and SimpleQA demonstrate broad instruction following, with competitive performance compared to state-of-the-art instruction-tuned LLMs.
- Alignment/safety: Tests using IFEval, Arena-Hard v1, RefusalBench, and RewardBench measure policy compliance, refusal capability, and reward alignment.
The evaluation harness uses OpenAI-compatible chat completions endpoints and reproducible scripts (lighteval, Atropos) for all quantitative tasks, with evaluation outputs released for sample-level scrutiny.
5. Behavioral and Qualitative Analysis
Hermes 4’s qualitative characteristics are detailed as follows:
- Persona adoption: The model exhibits reduced “policy rigidity” compared to many proprietary systems. In adversarial roleplay and meta-contexts, Hermes 4 adopts instructed personas and generates immersive in-character output, minimizing default to generic safety warnings.
- Chain-of-thought separation: Model generations distinctly partition the “thinking” (explanatory, stepwise reasoning or stylistic planning) segment from the final answer, often visibly separated by cue tokens such as
</think>
. - Prompt sensitivity: Adjusting system prompt cue tokens (e.g., switching from “assistant” to first-person) modulates the model’s response style, reducing sycophantic tendencies and producing more peer-like or agentive outputs.
- Format compliance: Strong adherence to tool-use and answer-formatting instructions enables reliable deployment in environments requiring precise response structure.
This suggests Hermes 4 is amenable to controllable generation schemes and user-driven behavior modulation via external prompt cues.
6. Open Release, Reproducibility, and Research Enablers
The Hermes 4 project places a strong emphasis on openness and reproducibility:
- Model and data accessibility: All weights are publicly available through the HuggingFace repository https://huggingface.co/collections/NousResearch/hermes-4-collection-68a731bfd452e20816725728, along with benchmark logs and evaluation samples for replicate analysis.
- Transparency of pipeline: The data curation process (including synthesis graphs, deduplication methods, rejection sampling, loss masking, sample packing algorithms, and chain-of-thought truncation procedures) is documented in detail, supporting independent validation and further academic paper.
- Collaboration incentives: By providing both models and data under open terms, the project enables contribution from the research community in hybrid reasoning, prompt engineering, and chain-of-thought supervision research.
- Infrastructure reuse: The use of established environments like Axolotl and TorchTitan allows researchers to adopt and extend the training and evaluation pipeline for custom problem classes.
A plausible implication is that Hermes 4 represents a reference implementation for open, high-fidelity hybrid reasoning LLMs designed for multi-step problem solving, alignment, and controlled output length.
7. Significance and Future Directions
Hermes 4 advances the state of open-hybrid reasoning systems, coupling efficient, long-context training with precise control over reasoning trajectory length and output style. It demonstrates competitive or better-than-proprietary performance across diverse academic tasks, while publicly releasing all methodological and model artifacts.
Future research directions facilitated by the Hermes 4 release include:
- Expanding reasoning graph complexity and extending chain-of-thought thresholding techniques.
- Investigating emergent behavioral dynamics in open-weight LLMs under persona and alignment shifts.
- Integrating hybrid reasoning architectures into domain-specific expert systems and tool-use applications.
By adhering to open science principles and providing comprehensive technical detail, Hermes 4 serves as a model for reproducible research and collaborative progress in advanced multi-step LLMing (Teknium et al., 25 Aug 2025).