XuanTie C910 RISC-V Core
- XuanTie C910 is an industrial-grade RISC-V core with deep pipeline, superscalar, and out-of-order execution, enabling high throughput.
- It employs advanced microarchitectural features including a multi-level cache hierarchy, micro-op cache, and robust branch prediction.
- Benchmark studies show superior IPC and energy efficiency while also revealing complex timing behaviors and side-channel vulnerabilities.
The XuanTie C910 is an industrial-grade, high-performance, superscalar, out-of-order (OoO) RISC-V core developed by T-Head (Alibaba). Its design targets compute-intensive and energy-sensitive domains, providing advanced microarchitectural features, cache hierarchies, and aggressive speculation typical of high-end processors. The C910 supports deep pipeline parallelism and offers superior throughput relative to mainstream open-source or commercial RISC-V implementations, while benchmarking studies have focused on both efficiency trade-offs and microarchitectural side-channel properties (Austa et al., 9 Oct 2025, Fu et al., 30 May 2025).
1. Microarchitectural Organization
The XuanTie C910’s fundamental design prioritizes superscalar, deep pipeline out-of-order execution.
- Pipeline Staging:
- The C910 is described with both 6-stage and 12-stage pipelines, depending on the context and configuration. One configuration, detailed in system-level benchmarks, implements a 12-stage pipeline with a deep instruction fetch front-end and extended OoO scheduling window.
- Fetch width up to 16 bytes, with up to 4 RISC-V macro-operations decoded per cycle (3-wide decode, 3-wide issue, 3-wide commit; up to 9-wide retire via macro-instruction compaction).
- Superscalar Execution:
- Instruction-level parallelism is facilitated by a 64-entry Reorder Buffer (ROB), with internal compaction holding up to 192 instructions.
- Physical register files consist of 96 integer and 64 floating-point registers.
- Separate issue/wakeup logic for distinct functional unit (FU) clusters.
- Load/Store and Queues:
- 16-entry load queue and 12-entry store queue support out-of-order memory disambiguation.
- Branch Prediction:
- Two-level branch prediction with a 16-entry fully associative L0 BTB, 4K-entry 4-way L1 BTB, a 32K-entry BHT combining local/global history, 12-entry return address stack, and 16-entry loop buffer.
2. Cache Hierarchy and Memory Subsystem
The C910 employs a multi-level cache hierarchy to maximize spatial and temporal locality, offering distinctive performance characteristics.
- L1 Caches:
- Instruction and data caches are each configured as 32 KB, 4-way set-associative with 64-byte lines (6-stage variant) or 64 KB, 2-way set-associative (12-stage variant), both employing physically indexed, physically tagged (PIPT) mapping.
- Write-back policy is used, with data caches supporting unaligned accesses via microcode.
- L2 Cache:
- Select C910 variants include a 256 KB, 8-way set-associative unified L2, inclusive of L1, with 12-cycle access latency beyond L1.
- Other Microarchitectural Features:
- Micro-op cache (MOCache): up to 32 decoded macro-ops, reducing decode time by approximately 2 cycles on MOCache hits.
- Store-to-load forwarding is implemented via a dedicated buffer, introducing a variable-latency path (6–8 cycles).
- Dual-issue of integer × integer or integer × load pairs.
- Aggressive branch prediction, with penalty for mis-prediction ranging from 12–16 cycles.
3. Standards Compliance and Interfaces
Originally, the C910 utilized several proprietary extensions, which affected portability and standards compliance.
- Proprietary Features:
- Non-standard T-Head AXI-ACE interface with decrement-burst protocol extensions.
- Integrated CLINT/PLIC interrupt controllers, non-parameterizable debug module with custom CSRs and non-RISC-V-compliant debug handshake.
- Modifications for Compliance:
- Removal of on-core proprietary controllers in favor of standard, open-source CLINT and PLIC.
- Implementation of RISC-V-compliant debug CSR set (dcsr, dpc, dscratch), along with standard debug request mode, decode-time debug exceptions, and wfi-wake.
- Standardization of memory interface by removing proprietary L2 and replacing AXI-ACE decrements with AXI4 incremental bursts (Fu et al., 30 May 2025).
4. Performance, Area, and Energy Efficiency
Systematic benchmarking places the C910’s throughput and efficiency in the context of other advanced RISC-V cores (CVA6, CVA6S+).
| Core | IPC | Area (mm²) | Power (mW) | AreaEff (GOPS/mm²) | EnergyEff (GOPS/W) |
|---|---|---|---|---|---|
| CVA6 | 0.70 | 0.95 | 70 | lowest | lowest |
| CVA6S+ | 0.94 | 1.01 | 90 | highest | medium |
| C910 | 1.61 | 1.66 | 145 | 2nd | highest |
- C910 Results (≈900 MHz, GF22FDX silicon):
- IPC: 1.61 (Embench-IoT, warm L1)
- Area: 1.66 mm² (+75% over scalar CVA6)
- Power: ≈145 mW
- Area efficiency: 2nd among tested cores
- Energy efficiency: highest, ≈1.3× better than CVA6S+ and 1.8× better than CVA6
- Trade-offs:
- C910 delivers highest instruction throughput (+119.5% IPC vs. scalar CVA6) and leads energy efficiency beyond 500 MHz, explained by aggressive clock gating and deep OoO parallelism.
- At frequencies ≤500 MHz, simple in-order designs can outperform the C910 in area-energy efficiency due to reduced leakage.
This suggests that, contrary to the common view, high IPC through superscalar and OoO techniques does not always entail a disproportionate penalty in area or energy, particularly at higher operating frequencies (Fu et al., 30 May 2025).
5. Microarchitectural Timing and Side-Channel Vulnerabilities
An in-depth assessment of the C910’s microarchitectural timing reveals a complex spectrum of memory access latencies, with direct implications for side-channel security.
- Timing Classes and Measured Latencies:
| Timing Class | Symbol | Latency (cycles) |
|---|---|---|
| L1 D-cache hit | 4 | |
| L1 miss → L2 hit | 16 | |
| L2 miss → DRAM | ≈150 | |
| Micro-op cache hit | 2 | |
| Store-to-load forwarding | 6–8 | |
| Branch mispredict penalty | 14 |
- Distinct Timing Types:
- The C910 presents 7 distinct cache/memory latency types, a richer set than comparable RISC-V cores (U54: 4; U74: 5 timing types).
- The additional timing types arise from the micro-op cache fast path, store-to-load forwarding, and deeper branch prediction (e.g., branch recovery interference).
- Side-Channel Assessment:
- Using a suite of 120 cache-timing patterns, 45 patterns (37.5%) on the C910 reveal >1 timing cluster (“vulnerable”), while 8 patterns (6.8%) are indistinguishable across all tested cores.
- This indicates that C910’s advanced features (micro-op cache, complex forwarding, speculative execution) expose a broader timing attack surface compared to more predictable in-order pipelines (Austa et al., 9 Oct 2025).
6. Countermeasures and Mitigation Strategies
Several countermeasures are recommended to address the microarchitectural timing leakage present in the C910:
- Disable/partition the micro-op cache when running code with secret-dependent branches to suppress timing variation arising from MOCache hits.
- Force critical loads through constant-latency paths by introducing dummy stores to negate store-to-load forwarding effects.
- Insert balanced memory fences (fence.i/fence) to regularly flush micro-op and L1 caches between secret-dependent regions.
- Leverage cache coloring or way-partitioning at runtime to mitigate cross-domain cache eviction channels.
- Compile security-critical routines using “flat” memory-access profiles to enforce constant-latency execution irrespective of input-dependent control or data flow.
These recommendations are drawn directly from empirical findings on timing leakage in the C910 and apply generally to RISC-V designs aiming for side-channel resistance (Austa et al., 9 Oct 2025).
7. Comparative Analysis and Broader Implications
Compared to other open-source and industrial RISC-V cores such as SiFive’s U54 and U74 and the open-source CVA6/CVA6S+, the C910 demonstrates:
- Microarchitectural Complexity:
- The only evaluated core with micro-op cache and deep OoO pipeline, resulting in more distinct timing types and higher side-channel exposure.
- Performance Leadership:
- Outperforms both in-order scalar and dual-issue RISC-V cores in IPC and energy efficiency beyond mid-frequency ranges.
- Compliance and Integration:
- Required significant modification to proprietary interfaces and protocols for integration into open-source platforms and standard SoC interconnects.
A plausible implication is that as RISC-V cores adopt deeper pipelines and aggressive speculation for higher performance, extensive microarchitectural analysis—both for performance and security—is essential to manage new design and threat trade-offs (Fu et al., 30 May 2025, Austa et al., 9 Oct 2025).