Hybrid TEE–Accelerator Pipelines
- Hybrid TEE–Accelerator Pipelines are integrated systems that securely combine CPU-based TEEs with specialized accelerators to support confidential high-throughput computations.
- They partition tasks by processing sensitive operations within trusted enclaves while offloading performance-critical workloads to GPUs, FPGAs, or other accelerators.
- By employing cryptographic masking, attestation, and hardware protection, these pipelines achieve significant speedups and minimal TEE overhead in real-world scenarios.
Hybrid TEE–Accelerator Pipelines are integrated hardware–software systems that combine the security properties of CPU-based Trusted Execution Environments (TEEs) with the computational throughput of specialized accelerators (e.g., GPUs, FPGAs). These pipelines are motivated by the need to support compute- and data-intensive workloads, such as confidential machine learning and secure Function-as-a-Service (FaaS), in environments where the host operating system or infrastructure is not trusted. Hybrid designs use programmatic partitioning, cryptographic masking, attestation, and memory protection primitives to deliver end-to-end confidentiality and integrity, all while exploiting accelerators for major computational workloads. Several designs, such as TGh, TwinShield, Composite Enclaves, HETEE, and ACAI, have concretely realized hybrid TEE–accelerator pipelines for a range of architectures (Choncholas et al., 2023, Xue et al., 4 Jul 2025, Schneider et al., 2020, Sridhara et al., 2023, Zhu et al., 2019).
1. Architectural Principles and Pipeline Partitioning
Hybrid TEE–accelerator architectures systematically divide computation between a trusted domain (CPU-based TEE or security controller) and an untrusted, but high-throughput, accelerator. Key partitioning strategies include:
- Confidential computation inside the TEE: Initialization, input preprocessing, masking, and sensitive checksums or verifications are performed in a minimal TEE enclave (e.g., Intel SGX, Arm CCA, RISC-V Keystone).
- Performance-critical computation offloaded: Linear algebra, neural network inference, encryption, and other throughput-intensive workloads are executed on an accelerator (GPU/FPGA), often on masked or encrypted data.
- Cryptographic or hardware enforcement: Masking schemes (e.g., additive secret-sharing, garbled circuits), physical memory protection (PMP, IOMMU, SMMU), and attestation protocols ensure that data remains confidential and results are verifiable.
A canonical data flow involves preparing/masking data in the TEE, transferring to the accelerator for bulk computation, and recovering or verifying computation results in the TEE before releasing outputs. Table 1 summarizes representative designs:
| System | Trust Boundary | Accelerator Type | Offload Fraction | Masking Technique |
|---|---|---|---|---|
| TGh | Intel SGX/TrustZone | Host CPU | ≈Majority | Garbled circuits (GC) |
| TwinShield | Intel SGX | GPU | ~87% | Additive masking, permuted masking |
| Composite Enclaves | RISC-V Keystone | FPGA, RISC-V accel | Kernel offload | PMP-protected DMA |
| ACAI | Arm CCA Realm | GPU/FPGA | Unrestricted | Encrypted memory, SPDM attestation |
| HETEE | Security controller | GPU/TPU/FPGA | Arbitrarily high | AES-GCM/HMAC authenticated traffic |
2. Protocol Workflows and Trust Models
Each hybrid TEE–accelerator system employs a detailed protocol workflow to enforce confidentiality and integrity under strong adversarial assumptions.
TGh (Choncholas et al., 2023)
TGh offloads the heavy computation phase of a function to an untrusted host using a garbled-circuit protocol, with the enclave generating garbled tables, providing masked wire labels, and verifying circuit outputs. The TEE hardware (e.g., CPU and enclave page cache) is trusted, but the host OS and all application software are adversarial. The host learns only circuit structure and masked input bits.
TwinShield (Xue et al., 4 Jul 2025)
TwinShield's protocol uses a secure enclave to mask input and model parameters, offloads most linear and non-linear computations (incl. attention and SoftMax) to the GPU, and performs final integrity checks (U-Verify) in the TEE. All masking randomness and integrity parameters are enclave-generated, and the adversary controls the OS, GPU, and interconnect, excluding only the TEE.
Composite Enclaves (Schneider et al., 2020)
Composite Enclaves create a configurable trust boundary using chained PMP control and driver enclaves for each attached accelerator or peripheral. Only the required drivers/firmware and the RISC-V SM are included in the TCB. The threat model assumes a remote attacker with control of OS, hypervisor, and potential hot-swapped devices—hardware and firmware under enclave control are trusted.
ACAI (Sridhara et al., 2023)
ACAI on Arm CCA leverages world-level (field/normal) isolation, hardware memory encryption (MEC/MPE), and restricted SMMU mapping to securely expose device buffers to a field VM. An explicit device-attestation and keying protocol (SPDM+PCIe IDE) establishes exclusive accelerator assignment. The adversary encompasses the hypervisor, physical interposers, all co-tenant VMs/devices, and all non-field system software.
HETEE (Zhu et al., 2019)
HETEE inserts a standalone security controller on the PCIe bus, mediating encrypted, authenticated data transfer between an untrusted host and dynamically re-attached accelerators. Only the controller and its OS/crypto engine are trusted. The host and attached accelerators are considered untrusted until post-reset and attestation. The system is robust against direct DMA, bus observation, and host privilege escalation (excluding physical side-channels).
3. Cryptographic and Hardware Protection Mechanisms
Protection mechanisms used in hybrid TEE–accelerator pipelines are designed to guarantee data confidentiality, computation integrity, and sometimes function privacy, against a powerful adversary.
- Additive Masking / Secret Sharing: Used in TwinShield, input matrices/vectors are randomized by addition over a finite field; recovery is by subtracting the mask. This approach is information-theoretically secure assuming mask secrecy.
- Garbled Circuits: TGh uses free-XOR garbled circuits, assigning random wire labels and mask bits, so the untrusted evaluator learns nothing beyond circuit topology and input/output sizes.
- Integrity Verification: Probabilistic checking such as Freivalds' algorithm (for matrix-mult), random row hashing, and random linear coefficients provide efficient integrity checks. U-Verify synthesizes these for non-linear functions in TwinShield.
- Physical Memory Protection (PMP/IOMMU/SMMU): Composite Enclaves and ACAI configure hardware PMP or SMMU/IOMMU entries so that only the enclave or device enclave (and no software entity on the host or accelerator) can access protected memory regions.
- Authenticated Encryption (e.g., AES-GCM, HMAC): HETEE applies AES-GCM to all code/data sent over the PCIe fabric, with HMAC tags for integrity; buffer access is strictly mediated by the controller.
4. Performance and Overhead Analysis
Hybrid pipelines are designed to amortize TEE overhead by transferring the computational bulk to accelerators; these approaches are quantitatively analyzed in the literature:
- TGh (Choncholas et al., 2023): The hybrid protocol is beneficial when enclave management instructions (ecalls/evictions) exceed 0.7–0.8% of program instructions. Pure TEE execution in short-lived FaaS kernels suffers from management overheads (e.g., context switching ≈10,000–17,000 cycles; enclave initialization ≈3 ms), whereas garbled circuits offloaded to the host/accelerator avoid most TEE bottlenecks.
- TwinShield (Xue et al., 4 Jul 2025): Offloading ~87% of FLOPs yields 4.0×–6.1× speedups over prior secure inference pipelines. Latency for BERT-Base drops from 1.294 s (TEE-only) to 0.216 s (TwinShield). The computational overhead of masking and integrity checks is dominated by GPU-side execution (>9× faster per attention matmul).
- Composite Enclaves (Schneider et al., 2020): The measured context-switch overhead is ≈220 cycles (~4.7%) over stock Keystone, independent of buffer size. DMA bus bandwidth for 1 MiB buffers is ≈5 GiB/s in FPGA prototypes. Total TCB addition is ≈8 KLoC, and offloading heavy workload kernels to accelerators incurs only ∼5% system cycle overhead.
- ACAI (Sridhara et al., 2023): For high-throughput GPU/FPGA workloads, ACAI protected mode incurs only 6.8–7.0% throughput penalty beyond baseline Arm CCA, compared to ≈98% slowdown using purely software cryptography over untrusted accelerators.
- HETEE (Zhu et al., 2019): Throughput overhead is ≈12.34% for inference and ≈9.87% for training vs. unprotected execution, even with full AES-GCM/HMAC protection. Controller bottlenecks (FPGA ARM-side driver) and task queueing are the predominant contributors to latency. Scaling to multiple GPUs approaches ideal linearity.
5. Real-World Deployments and Case Studies
Concrete deployments of hybrid TEE–accelerator pipelines encompass various hardware and system configurations:
- FPGA+Arduino Peripherals and Large-Scale RISC-V Accelerators: Composite Enclaves have been prototyped with Digilent Genesys 2 FPGA and a RISC-V cluster accelerator (4,096 cores) with PMP enforcement and context isolation, achieving secure shared-memory operation for peripherals and compute offload (Schneider et al., 2020).
- PCIe-based Security Controllers: HETEE is evaluated on a system combining an ARM-based FPGA controller, four NVIDIA Tesla M40 GPUs, and a Broadcom PCIe switch; the controller mediates task offload, encrypted dataflow, and dynamic device reassignment (Zhu et al., 2019).
- ARM CCA Realms with Accelerators: ACAI leverages the Arm FVP platform with RME support, hardware-attested device assignment, world isolation, and protected stage-2 mapping for PCIe-attached accelerators (Sridhara et al., 2023).
- Confidential Transformer Inference: TwinShield enables integrity-checked, confidentiality-preserving Transformer inference on ImageNet, SST-2, and WikiText using up to 7B-parameter models, demonstrating both high fidelity (accuracy drop ≤1.9%) and throughput scaling from 1–10.7× depending on token length (Xue et al., 4 Jul 2025).
6. Challenges, Limitations, and Future Directions
Current hybrid TEE–accelerator pipelines face both practical and theoretical limitations:
- Dynamic Device Management: Most architectures (e.g., ACAI) support accelerator attach/detach only at VM creation/destruction, precluding dynamic partitioning and multi-tenant sharing without further hardware support or secure firmware design (Sridhara et al., 2023).
- Side Channel and Physical Attacks: The majority of proposals consider cache, timing, and speculative-execution side channels as out-of-scope, though some designs, such as HETEE, employ physical mesh and controller hardening (Zhu et al., 2019).
- Scalability: PCIe switch/fabric capacity, lane counts, and protocol CPU bottlenecks (on controller or TEE) present scaling challenges for very large cluster deployments (Zhu et al., 2019, Schneider et al., 2020).
- Virtualization and Fine-grained Resource Control: Multi-tenant accelerator virtualization (e.g., Nvidia MIG) requires per-slice keying and routing, which remains an open design problem for most hybrid pipelines (Sridhara et al., 2023).
- Hardware/TCB Complexity: Minimal TCB expansion is achieved in Composite Enclaves by confining trust to only the necessary drivers, runtimes, and firmware. This model is extendable to other bus-connected accelerators with PMP/IOMMU slices and firmware attestation (Schneider et al., 2020).
A plausible implication is that continued hardware support for secure device isolation, in-fabric attestation, and high-performance cryptographic masking (e.g., field arithmetic, vectorized noise injection) will be essential for further improving the efficiency and usability of hybrid TEE–accelerator pipelines.
7. Summary Table of Representative Hybrid TEE–Accelerator Pipelines
| Paper / System | Core Trust Model | Accelerator Usage | Security Foundations | Performance Overhead |
|---|---|---|---|---|
| TGh (Choncholas et al., 2023) | SGX/TrustZone, enclave trusted | Host CPU, GC offload | Garbled circuits, PRF/AES | Avoids TEE mgmt. cost; scalable win |
| TwinShield (Xue et al., 4 Jul 2025) | SGX enclave only | GPU (~87% FLOPs) | Additive masking, U-Verify | 4–6× faster than prior TEE ML |
| Composite Enclaves (Schneider et al., 2020) | RISC-V SM, drivers/firmware | FPGA/RISC-V accel. | PMP, IOMMU protected memory | ~5% cycle, 220-cycle context switch |
| ACAI (Sridhara et al., 2023) | ARM CCA field VM, SMMU/monitor | PCIe GPU/FPGA | Granule Protection, SPDM | ~7% throughput overhead (vs. plain) |
| HETEE (Zhu et al., 2019) | Security controller only | Any PCIe GPU/TPU/FPGA | AES-GCM/HMAC; PCIe switch | ~12% inference, ~10% training |
Citations: (Choncholas et al., 2023) (Xue et al., 4 Jul 2025) (Schneider et al., 2020) (Sridhara et al., 2023) (Zhu et al., 2019)