CompassMax-V3-Thinking: Advanced MoE Reasoning

Updated 9 December 2025

CompassMax-V3-Thinking is a hundred-billion-scale Mixture-of-Experts reasoning model that integrates reinforcement learning and tool-augmented techniques to address large-scale system inefficiencies.
It employs innovative methods such as Program-of-Thought and scratchpad augmentation along with RL optimizations like ZVE and ESPO to achieve state-of-the-art performance in reasoning and coding benchmarks.
Its robust design features module invocation protocols, safe tool integration, and efficient resource management, ensuring reliable multi-step reasoning and scalable inference.

CompassMax-V3-Thinking refers to a hundred-billion-scale Mixture-of-Experts (MoE) reasoning model and methodology, characterized by its integration of advanced reinforcement learning (RL) optimization schemes, tool-augmented reasoning, and a high-throughput system stack for stable, scalable training and inference. This architecture and workflow are designed to ensure that each input prompt yields a meaningful policy signal, addressing critical inefficiencies and failure modes that emerge in large-scale MoE RL systems. Empirical findings confirm that CompassMax-V3-Thinking delivers state-of-the-art performance in reasoning, coding, and multilingual benchmarks, outperforming contemporary models through innovations at both the algorithmic and system levels (Zeng et al., 8 Dec 2025, Song et al., 23 Jul 2025).

1. Tool-Augmented Reasoning Architectures

CompassMax-V3-Thinking extends traditional large reasoning models (LRMs) by incorporating explicit tool-augmentation strategies, notably Python interpreter integration and external scratchpad memory. These augmentations address the token context limits and error propagation associated with multi-step reasoning, especially on complex or lengthy tasks.

Python Interpreter Augmentation:
- Program-of-Thought (PoT): The model generates an entire Python program intended to solve the task, which is then executed in an isolated interpreter. This allows the solution of highly structured or long-horizon problems without suffering from LLM token budget constraints. Inputs include system prompts spelling out the rules and user prompts requesting, for instance, a Python function to output a handleable solution (e.g., a move list) (Song et al., 23 Jul 2025).
- Think-and-Execute (T&E): The LRM generates pseudo-code in multiple stages, parses and "executes" its own output iteratively as a form of internal substrate reasoning, dispensing with the external runtime.
Scratchpad Augmentation: An external memory state persists across multiple generation cycles, letting the model develop lengthy chains of logic in stages—each within manageable output limits. The pipeline involves iterative prompt construction (including prior scratchpad state and in-context examples), generation, parsing, and optional early stopping.

These augmentations are critical for CompassMax-V3-Thinking to scale its intermediate reasoning capacity beyond what monolithic generation or in-memory state allows.

2. RL Algorithmic Innovations for Stable MoE Training

Scaling RL to hundred-billion parameter MoE settings introduces pronounced instabilities and inefficiencies, addressed by four interlocking innovations:

Multi-Stage Zero-Variance Elimination (ZVE): Prevents policy gradient collapse due to zero-variance prompt batches (where every rollout receives an identical reward). ZVE applies at three levels: (i) adaptive sampling expansion for uninformative prompts, (ii) reward-stage reshaping with length and repetition penalties, and (iii) advantage reshaping by injecting controlled noise (Zeng et al., 8 Dec 2025).
ESPO (Entropy-adaptive Importance Sampling Policy Optimization): Interpolates between the flatness of sequence-level importance sampling and the brittleness of token-level IS. ESPO partitions sequences into entropy-coherent groups and uses group-adaptive clipping bounds:

$J_{\rm ESPO}(\theta) = \mathbb E_{x,y\sim\pi_{\rm old}}\Biggl[ \sum_{t=1}^{G}\sum_{l=1}^{L_t} \tau_t\, \min\Bigl( s_t(\theta)\,\bar A_{t,l},\; \mathrm{clip}(s_t(\theta),1-\epsilon_t,1+\epsilon_t)\,\bar A_{t,l} \Bigr) \Biggr]$

with $\epsilon_t$ adaptively scaled to the entropy of the token group.

Router Replay and Generative Reward Model (GenRM) Adjustment: Logs token-level expert routing during rollout and reuses this itinerary during policy updates to resolve engine-level log-prob discrepancies, ensuring RL determinism. Further, a ternary classifier GenRM is used for reward assignment ("better," "tie," or "worse") to enforce monotonic advantage ordering and eliminate advantage inversion.
High-Throughput RL System Engineering: Employs FP8-quantized rollouts, synchronous length-aware scheduling, multi-detokenization parallelism, and overlapped reward computation, resulting in a 1.66× throughput gain over naïve baselines.

3. Core Reasoning Pipelines and Invocation Protocols

The agent's reasoning workflow is formalized by clear pseudo-algorithmic patterns enabling reliable, modular integration of tool-augmented steps:

PoT Inference: Single-pass code generation followed by external interpreter execution.
Scratchpad Multi-Step Inference: Iterative multi-prompt generation and scratchpad updating, halting either at early stopping or maximum preset cycles (T).
Module Invocation Formalism: For $(f_\theta)$ $(f_{θ})$ the LRM,
- Without tools: $y = \arg\max_y P_\theta(y|P)$ .
- Interpreter: $code = \arg\max_c P_\theta(c|P)$ , $answer = Exec(code)$ .
- Scratchpad: $(A_t, S_{t+1}) = \arg\max_{a,s} P_\theta(a,s|P,S_t)$ , $y = \mathrm{Concat}(A_1…A_{T_{final}})$ .

These formalizations facilitate robust chaining of external computation and multi-step self-guided reasoning within CompassMax-V3-Thinking.

4. Experimental Evaluation and Statistical Insights

CompassMax-V3-Thinking demonstrates consistent and statistically significant performance gains in challenging reasoning benchmarks.

Apple "Thinking-Illusion" Benchmark: Four task classes (Tower of Hanoi, Checker Jumping, River Crossing, Blocks World) across $N\in\{3,5,7,9,11,13\}$ are used.
Models Compared: DeepSeek-V3, DeepSeek-R1, Qwen 3, and Qwen 3 Thinking. Both tool-augmented and non-augmented settings are evaluated.
Key Outcomes:
- PoT augmentation allows DeepSeek-R1 to achieve 96–100% average success on challenging configurations versus 4–5% for the best LLMs without such tool support.
- Scratchpad augmentation delivers up to 80% success in DeepSeek-R1 for hard Blocks World instances where non-tool models fail.
Statistical Testing: Paired McNemar’s test confirms p<0.001 for River/Blocks tasks when using PoT, indicating consistent, robust improvement (Song et al., 23 Jul 2025).

Performance summary for $N\geq 5$ :

Model	Direct	PoT	Scratchpad
DeepSeek-V3 (LLM)	4%	100%	15%
DeepSeek-R1 (LRM)	5%	96%	48%
Qwen 3	2%	0%	10%
Qwen 3 Thinking (LRM)	2%	0%	15%

5. System Design, Prompting, and Implementation Best Practices

Robust deployment of CompassMax-V3-Thinking requires attention to prompt engineering, safe tool invocation, systematic debugging, and resource control:

Prompt Engineering: System prompts should present concise puzzle rules and tool interface schemas; three in-context examples (small $N$ ) are recommended for multi-step scratchpad prompting. For PoT, at least one example of code generation is advised.
Safe Tool Invocation: Python interpreters must be sandboxed and version-locked. Resource limits (e.g., 20-minute execution and RAM quotas) are enforced to avoid runaway code. For T&E, careful parsing enables internal error catching.
Debugging: All intermediate code and scratchpad states are logged; failures are systematically analyzed for code validity, scratchpad consistency, and correct halting.
Resource Efficiency: Scratchpad max-steps $T$ should be dynamically tuned (e.g., $T\approx \lceil(2^N-1)/300\rceil$ for Hanoi). Token consumption is monitored; PoT reduces on-model token usage by outsourcing computation to Python.

6. Benchmarking and Comparative Performance

CompassMax-V3-Thinking achieves state-of-the-art results across a spectrum of in-house and public benchmarks:

E-commerce macro-average: 85.79 (vs. 85.14 for CompassMax-V3, 80.89 DeepSeek-R1, 79.10 Gemini-2.5-pro).
SEA Multilingual: 86.41 (vs. 85.58).
General-Ability Battery: 76.01 (vs. 64.49).
ARC Coding: HumanEval Pass@1 = 98.17; MBPP = 73.54.
ARC Reasoning: AIME24 = 83.30; HMMT = 46.70.
Alignment: IFeval strict = 85.40 (Zeng et al., 8 Dec 2025).

These results underscore the central CompassMax-V3-Thinking principle: every prompt must matter, enabled by comprehensive ZVE filtering, ESPO credit assignment, deterministic routing, and token/resource-efficient scaling.

7. Future Directions and Prospective Extensions

Empirical and theoretical considerations suggest promising extensions for CompassMax-V3-Thinking:

Adaptive scratchpad and interpreter orchestration for hybrid, context-sensitive reasoning workflows.
Integration of tool invocation confidence modeling to inform when augmentation is likely to be beneficial.
Augmented reward modeling and dynamic credit assignment for increasingly open-ended, real-world tasks.
Scalability to multi-agent or interactive settings, where prompt-adaptive pipeline steps and tool use are co-learned.

A plausible implication is that further harmonization of RL signal propagation, tool-augmented reasoning, and system-level scheduling mechanisms will facilitate even greater reliability and throughput as model and task complexity continue to increase (Song et al., 23 Jul 2025, Zeng et al., 8 Dec 2025).