TEE for CDPUs: Trust in Accelerators
- Trusted Execution Environments (TEEs) for CDPUs are system architectures that provide strong isolation and cryptographic controls for heterogeneous accelerator workloads.
- They implement rigorous boundary management using dynamic PCIe partitioning, secure boot, and composite attestation to defend against advanced threat models.
- Performance evaluations demonstrate manageable overheads with low-latency context switches and high-throughput secure computing, enabling robust deployment in distributed cloud settings.
Trusted Execution Environments (TEEs) for Cryptographic and Domain-Specific Data Processing Units (CDPUs) constitute a class of system architectures and mechanisms dedicated to providing strong isolation and confidentiality/integrity guarantees to computation workloads that leverage not only CPUs but also heterogeneous accelerators such as GPUs, FPGAs, and custom AI engines. Unlike traditional CPU-centric TEEs, which confine protection to the main processor package, these systems address the challenge of extending trusted computing guarantees into a landscape where privacy-sensitive, compute/data-intensive operations are offloaded to highly parallel, accelerator-based architectures. Solutions in this domain produce formalized trust models, system-level boundary management, dynamic key management, and novel isolation primitives tailored for both hardware and firmware diversity, enabling secure multi-tenant high-throughput computation suitable for the contemporary disaggregated and distributed cloud setting.
1. Architectural Foundations and Models of Trust
TEE designs for CDPUs consistently recognize the inadequacy of the historical CPU-as-root-of-trust model for environments where critical workloads and their secrets must traverse and reside in accelerators owned, managed, or shared by untrusted software stacks. Proposals such as HETEE construct a system-level root of trust around a dedicated Security Controller (SC), physically decoupled from the host CPU, managing both the cryptographic boundary and accelerator context stewardship over dynamic PCIe partitioning (Zhu et al., 2019). This controller, running a minimal, verifiable OS and cryptographic toolset, serves as the anchor for key management, attestation, secure boot, and accelerator runtime multiplexing.
Composite enclave approaches generalize this further by allowing the enclave’s trust base to be composed dynamically from both CPU-protection mechanisms and measured firmware/driver logic on any directly attached accelerator, sensor, or I/O peripheral, with attestation spanning all domain members (Schneider et al., 2020). Hybrid designs introduce rack-level SCs for CDPUs lacking on-device TEE support, harnessing hardware enforcement on device interconnects (PCIe, NVMe) and in-proxy cryptography to provide equivalent guarantees at the bus or host boundary (Dhar et al., 2022).
Across all these designs, two fundamental architectural elements recur:
- Physical/cryptographic isolation of trusted and untrusted domains: By leveraging switchable PCIe topologies, IOMMU/VT-d, RISC-V PMP, or on-chip partitioned memory (BRAM), the trusted domain is rendered both physically and logically opaque to the host and other tenants.
- Authenticated task and data queues mediated by secure, in-memory or on-chip buffers: Computation offloads take the form of encrypted, signed payloads, with all critical access—DMA setup, kernel launch, result fetch—controlled by trusted code.
2. Threat Models and Security Goals
The threat landscape for TEE in CDPU deployments encompasses (a) fully compromised host OS and hypervisors, (b) kernel-level malware capable of snooping or tampering with memory and bus traffic, (c) hardware attackers with access to PCIe or internal buses (though chip package or fuse tampering remains out of scope in most models), and (d) rogue co-tenant enclaves, unmeasured device firmware, or DMA attacks (Zhu et al., 2019, Dhar et al., 2022, Schneider et al., 2020).
Key security properties sought are:
- Confidentiality and integrity of code, data, and intermediate results within trusted domains
- Detection/elimination of accelerator context tampering or persistence upon boundary switch by means of context resets and full memory clearing
- Remote attestation of trusted software/firmware/hardware combinations
- Cryptographically enforced access controls for both vertical (platform, OS, enclave separation) and horizontal (multi-tenant) isolation
- Non-leakage of secrets into shared CPU or accelerator microarchitectural state, and resistance to transient-execution and side-channel exfiltration (Chakraborty et al., 2023)
These are realized via formal measurement, persistent key storage tethered to trusted hardware, signed reports uniquely cemented to device identities and code hashes, and the cryptographic wrapping of all off-domain memory exchanges.
3. Isolation and Boundary Control Mechanisms
TEE for CDPUs leverages several techniques for strong domain separation:
- Dynamic Device Boundary Management: HETEE’s use of a programmable PCIe switch allows on-demand, authenticated reallocation of accelerators between untrusted host control and the secure domain. Mode switching entails a sequence—encrypted host command, controller-initiated context wipe (e.g., full DRAM and register reset for GPUs, ~55 ms), switch reconfiguration (<2 ms), and driver (re)enumeration; total device switch latency is ~60 ms (Zhu et al., 2019).
- On-Chip and Bus-Level Isolation: FPGA-based TEEOD uses physical on-chip segmentation (BRAM hulls, per-enclave soft-core gating) such that only trusted communication agents may marshal messages or memory transfers between PS and logic, while the host OS is structurally denied even indirect enclave access. Secure boot is achieved through AES-CMAC–authenticated bitstreams rooted in eFuse/PUF hardware keys (Pereira et al., 2021).
- Configurable TCB Composition: Composite enclaves allow per-unit assignment of both hardware and software trust bases, with region-level access policies and connection graphs forming the basis for formal attestation. For instance, in the Keystone-based prototype, each peripheral or accelerator is governed by a minimal, measured firmware loaded only when needed; shared buffers are protected via PMP, and dynamic region mapping/disconnects ensure there is no stale or unprotected data at any point (Schneider et al., 2020).
- Bus and Memory Crypto-Filtering: When legacy accelerators or storage devices lack on-device protection, rack-level SCs enforce address-range checks and data plane encryption (AES-GCM) for all MMIO/DMA flows, accompanied by per-job or per-session keying. Devices with integrated SMs (e.g., AI accelerators, SSDs) implement their own Memory Protection Engines (MPEs) and Access Control Units (ACUs), with hardware-enforced command/data channel segmentation (Dhar et al., 2022).
4. Key Management, Remote Attestation, and Secure Boot
Robust key management in TEE for CDPUs is multi-layered:
- Device-unique private keys burned at manufacture, used for initial platform and controller boot attestation
- Session keys derived via authenticated (typically Diffie–Hellman) exchange after remote attestation
- Persistent secrets (e.g., enclave launch keys, provisioning keys) bound to cryptographic coprocessors (TPM) and supporting Enhanced Authorization Policies (EAP) and PCRs to enforce enclave identity and locality context (Chakraborty et al., 2023)
In TALUS, secrets essential to the enclave’s lifecycle (SGX launch key, QE key) are moved into the TPM, never touching CPU cache/memory. Access to these secrets is mediated solely by policy-enforcing microcode at TPM locality 4; enclave keys can only be derived for enclaves meeting EAP criteria. Key derivation, sealing/unsealing, and quote signing use hardware TPM primitives, with the enclave’s ephemeral keys delivered only into trusted CPU registers. AES-GCM–based authenticated encryption is the standard for task, data, and command packaging across all protocols (Zhu et al., 2019, Dhar et al., 2022, Chakraborty et al., 2023).
5. Programming Models and Workflow
Programming abstractions for CDPU-aware TEEs are designed to minimize changes for application developers while enforcing cryptographic wrappers and task-level isolation.
- HETEE’s API: Host-side logic creates an attested session, prepares tasks (kernel binaries, input data), encrypts them, transmits via PCIe-mapped queues, and receives authenticated results, all via a standard API. Controller-side logic validates, decrypts, schedules, and atomically executes accelerator kernels; output is re-wrapped and returned (Zhu et al., 2019).
- Composite enclave model: Application enclaves and minimal driver enclaves exchange state via mapped shared regions, with lifecycle events (connect, disconnect) serialized by the security monitor, and attestation consolidating code, firmware, and connection graphs (Schneider et al., 2020).
- Middleware orchestration: In hybrid datacenter settings, the job submission flows through mutual attestation of all TEE nodes, trigger multi-device key agreement, and partition job-wide keys to all CDPUs and SCs. DMA/MMIO interception and mediation (e.g., using CUDA and xDMA hooks) enables transparent protection for legacy accelerator interfaces (Dhar et al., 2022).
6. Performance, Scalability, and Overhead
All TEE solutions for CDPUs introduce a performance-security tradeoff:
- Throughput and Latency: HETEE’s prototype reports average inference overhead of 12.34% and training overhead of 9.87% for large-scale DNN workloads (ImageNet, AlexNet, ResNet, VGG), with most additional latency arising from cryptographic tasks—crypto engine pipelining reduces cost by 40%, and DMA–compute overlap hides 70% of data transfer delay (Zhu et al., 2019).
- Resource Utilization: FPGA-based TEEOD can host up to 6 concurrent enclaves on Ultra96-V2, with per-enclave area overhead of 7.0% LUTs, 3.8% FFs, and 15.3% BRAMs. Context switch times are on the order of 50 ms (cold load), with mailbox command invocations in the sub-millisecond range (Pereira et al., 2021).
- Multi-tenant Scaling: Composite enclaves incur ~4.6% context-switch overhead regardless of shared buffer size; large accelerators with PMP-based isolation show a –11% frequency hit and <1% area cost per cluster (Schneider et al., 2020). Hybrid SC-based datacenter designs report 0.42–8% application overhead on AI workloads, with rack-level hardware able to sustain >900 concurrent secured AI devices (Dhar et al., 2022). Azure-wrapped FPGAs and GPUs in similar settings observe MMIO/DMA encryption overheads of ~1.5–1.9%, with cryptographic pipelines and high-bandwidth transfers serving as effective mitigations.
7. Extensions, Variants, and Case Studies
TEE research for CDPUs is trending towards increased configurability, on-demand resource allocation, and composable trust. Key case studies include:
- HETEE: Demonstrates strong isolation and secure multi-GPU orchestration without the need to modify commodity hardware, supporting both inference and training workloads at scale (Zhu et al., 2019).
- TEEOD: Proves feasibility of multi-enclave, physically separated execution on reconfigurable SoCs, and seamlessly ports existing TEE application binaries with high compatibility and minimal changes (Pereira et al., 2021).
- Composite enclaves: Show the integration of accelerator-backed unit enclaves, including complex many-core RISC-V accelerators, into a single attested and isolated computation environment with formally minimal driver/firmware TCB (Schneider et al., 2020).
- Hybrid datacenter architecture: Illustrates a migration path for legacy infrastructure by combining device-level and rack-level TEEs, providing security policy continuity across a heterogeneous accelerator pool (Dhar et al., 2022).
- TALUS: Achieves resilience to entire classes of microarchitectural and transient-execution vulnerabilities by offloading key management to discrete cryptoprocessors, thus structurally preventing secret exfiltration even under strong attacker models (Chakraborty et al., 2023).
A plausible implication is that, as accelerator vendors incrementally deploy on-device TEEs and bus-level security primitives (e.g., hardware ACUs, MPEs), the need for centralized wrapping diminishes, allowing end-to-end cryptographic guarantees to be realized with even lower overhead and tighter security invariants.
References:
- "Enabling Privacy-Preserving, Compute- and Data-Intensive Computing using Heterogeneous Trusted Execution Environment" (Zhu et al., 2019)
- "TALUS: Reinforcing TEE Confidentiality with Cryptographic Coprocessors" (Chakraborty et al., 2023)
- "Composite Enclaves: Towards Disaggregated Trusted Execution" (Schneider et al., 2020)
- "Empowering Data Centers for Next Generation Trusted Computing" (Dhar et al., 2022)
- "Towards a Trusted Execution Environment via Reconfigurable FPGA" (Pereira et al., 2021)