Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
95 tokens/sec
Gemini 2.5 Pro Premium
32 tokens/sec
GPT-5 Medium
18 tokens/sec
GPT-5 High Premium
20 tokens/sec
GPT-4o
97 tokens/sec
DeepSeek R1 via Azure Premium
87 tokens/sec
GPT OSS 120B via Groq Premium
468 tokens/sec
Kimi K2 via Groq Premium
202 tokens/sec
2000 character limit reached

Oobleck: Low-Compromise Design for Fault Tolerant Accelerators (2506.22654v1)

Published 27 Jun 2025 in cs.AR

Abstract: Data center hardware refresh cycles are lengthening. However, increasing processor complexity is raising the potential for faults. To achieve longevity in the face of increasingly fault-prone datapaths, fault tolerance is needed, especially in on-chip accelerator datapaths. Previously researched methods for adding fault tolerance to accelerator designs require high area, lowering chip utilisation. We propose a novel architecture for accelerator fault tolerance, Oobleck, which leverages modular acceleration to enable fault tolerance without burdensome area requirements. In order to streamline the development and enforce modular conventions, we introduce the Viscosity language, an actor based approach to hardware-software co-design. Viscosity uses a single description of the accelerator's function and produces both hardware and software descriptions. Our high-level models of data centers indicate that our approach can decrease the number of failure-induced chip purchases inside data centers while not affecting aggregate throughput, thus reducing data center costs. To show the feasibility of our approach, we show three case-studies: FFT, AES, and DCT accelerators. We additionally profile the performance under the key parameters affecting latency. Under a single fault we can maintain speedups of between 1.7x-5.16x for accelerated applications over purely software implementations. We show further benefits can be achieved by adding hot-spare FPGAs into the chip.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.