TEE-Shielded On-Device Inference

Updated 27 April 2026

TEE-shielded on-device inference is a secure approach that employs trusted hardware, partition-before-training, and cryptographic obfuscation to protect ML models.
It leverages techniques such as mirror designs, one-time pad encryption, and selective tensor shielding to balance performance with strong defenses against model-stealing and membership inference attacks.
Empirical results show that methods like TEESlice and TensorShield achieve up to 25× speedup with minimal accuracy loss, ensuring confidentiality and integrity in hostile environments.

Trusted Execution Environment (TEE)-Shielded On-Device Inference refers to the suite of hardware-software architectures, partitioning strategies, firmware modifications, and associated cryptographic and obfuscation mechanisms that enable privacy-preserving, attack-resilient execution of machine learning models directly on user devices, without full model exposure to untrusted code, memory, or accelerators. This paradigm is motivated by the dual imperatives of runtime efficiency (low-latency, high-throughput inference) and robust protection against model-stealing and membership inference attacks, especially in adversarial scenarios where end devices are user-controlled and attackers possess access to rich public neural model repositories.

1. Threat Model and Security Objectives

TEE-shielded on-device inference is deployed within the hardware boundary of a Trusted Execution Environment (TEE), such as Intel SGX or ARM TrustZone. The standard threat model assumes the adversary controls the entire OS, drivers, and all resources outside the TEE, and can launch both passive and active attacks, including:

Model-Stealing (MS): Constructing surrogate models via query synthesis, leveraging public pre-trained networks to recover proprietary architectures and weights.
Membership Inference (MIA): Determining the presence of specific records in the training set, exploiting exposed activations, gradients, or scores.

Security objectives are:

Confidentiality: All privacy-sensitive weights, intermediate activations, and training data are protected within the TEE; adversaries gain no more than black-box (label-only) access.
Integrity: Computation over protected model components is immune to tampering.
Performance: Inference latency and throughput should approach that of GPU or NPU execution outside the TEE.

Techniques such as partition-before-training, hardware-attested remote key provisioning, and one-time pad cryptography are used to uphold these guarantees in the presence of powerful, knowledgeable adversaries who possess vast model and data resources (Li et al., 2024, Zhang et al., 2023).

2. Taxonomy of TEE-Shielded On-Device Inference Approaches

Approaches differ in model partitioning logic, TEE–untrusted world interface design, and hardware acceleration capabilities. Representative categories include:

Partition-After-Training ("train-then-partition"): The full DNN is first trained, then split into sensitive and insensitive partitions. Existing TSDP (TEE-shielded DNN partition) schemes—DarkneTZ, ShadowNet, SOTER—place a suffix, prefix, or random selection of layers/weights in the TEE but empirically fail to generalize security when the adversary leverages surrogate model initialization with public weights (Zhang et al., 2023, Mo et al., 2020, Sun et al., 2020).
Partition-Before-Training: Privacy boundary is imposed prior to training. TEESlice (Li et al., 2024) inserts private "slices" (adapters) between backbone layers, learning to absorb all privacy-related functionality into a parameter-efficient TEE-resident subset.
Mirror or Two-Branch Design: MirrorNet (Liu et al., 2023) employs a backbone (normal world) with a lightweight TEE "companion monitor" to rectify outputs, while TBNet (Liu et al., 2024) uses two parallel branches (TEE/REE) jointly trained then independently pruned.
Obfuscation-Based: Amulet (Mao et al., 8 Dec 2025) and ShadowNet (Sun et al., 2020) transform both model parameters and activations via random masking, enabling full GPU/accelerator inference on obfuscated representations, with only I/O masking/unmasking requiring TEE interaction.
Critical Tensor Shielding: TensorShield (Sun et al., 28 May 2025) assigns an importance score to each tensor via XAI attention metrics, shielding only those tensors for which leakage would enable model/MI attacks exceeding black-box baseline.
Graph, GNN, and LLM Extensions: Partition-before-training with private adapters extends to GNNVault for GNNs (Ding et al., 20 Feb 2025), and LoRA-style private adapters for LLMs (Li et al., 2024), as well as specialized memory and job schedulers for on-device TrustZone/NPU LLM protection (Wang et al., 17 Nov 2025, Nayan et al., 22 Oct 2025, Abdollahi et al., 11 Apr 2025).

3. TEE-Accelerator Partitioning, Encryption, and Integrity Mechanisms

Efficient TEE-shielded inference on modern edge devices requires careful partitioning, balancing TEE memory/computation constraints against the utility of hardware accelerators (GPU, NPU):

Partition Location: In partition-before-training (e.g., TEESlice), only adapters and non-linearities are run in the TEE, confining ≤5% of total FLOPs, while all remaining linear layers are offloaded (Li et al., 2024). In TensorShield, selective tensors are shielded to maintain security with only ∼8% of weights in TEE (Sun et al., 28 May 2025).
One-time Pad Feature Encryption: Linear layer offloading uses modular arithmetic or additive masking, where TEE sends quantized features $\hat{h}_e = (\hat{h} + r) \mod p$ ; since $g$ is linear, TEE restores the result: $g(\hat{h}_e) - g(r) = g(\hat{h})$ (Li et al., 2024, Sun et al., 2020, Nayan et al., 22 Oct 2025).
Obfuscation/Permutation: Amulet obfuscates each layer via random invertible matrices $P$ , $Q$ ; non-linear layers require permutation and expansion gadgets to maintain forward correctness and information-theoretic secrecy (Mao et al., 8 Dec 2025).
Integrity Verification: Freivalds' algorithm is used for randomly spot-checking result integrity: TEE precomputes $\hat{s} = W s$ , then checks $o^\top s = h^\top \hat{s}$ (Li et al., 2024).
Side-Channel and DMA Mitigation: Some systems (e.g., TZ-LLM (Wang et al., 17 Nov 2025)) relocate all TEE-buffered data, restrict DMA, and minimize TEE-exposed interfaces to limit surface area.

4. Performance, Security, and Utility Trade-offs

Empirical evaluations confirm that direct full-model execution in TEEs incurs prohibitive latency (e.g., >50x slowdown for 100% TEE FLOPs), while advanced partition-before-training and obfuscation-based schemes approach hardware-accelerated throughput:

Scheme	%FLOPs in TEE	Speedup vs. TEE-only	Security (MS/MIA acc.)	Accuracy Drop
Full-TEE	100%	1×	Black-box	0%
TEESlice	2-4%	18–23×	Black-box-equivalent	<0.5%
TensorShield	8% (ResNet18)	up to 25.35×	Black-box	none
Amulet	<1% in TEE	8–9×	Black-box (proof)	<1e-4
TSDP Baselines	45–97%	1–2×	3–4× > black-box	variable

TEESlice's defense remains statistically indistinguishable from the black-box baseline ( $p\gg0.05$ ) for both model-stealing and membership inference attacks, while prior TSDP methods leak 3–4× more (Li et al., 2024, Zhang et al., 2023). Obfuscation-based schemes formally guarantee that observation of all mask-obfuscated weights/activations yields $I(W; \textrm{observations})=0$ information-theoretic secrecy (Mao et al., 8 Dec 2025).

For LLM deployment, TZ-LLM realizes pipelined secure memory allocation and NPU time-sharing, achieving 90.9% reduction in first-token inference time (TTFT) and 23.2% decoding speedup over TEE-only baselines (Wang et al., 17 Nov 2025). Memory-efficient DNN inference frameworks adapt TrustZone’s memory controller to support >3× speedup and 66.5% energy reduction (Xie et al., 2024).

5. Application to Model Classes and Modalities

TEE-shielded inference methods have been demonstrated across a variety of NN and data modalities:

CNNs and Vision: TEESlice and Amulet support AlexNet, ResNet18/34/50/101/152, VGG16, and MobileNet (CIFAR-10/100, STL10, UTKFace, ImageNet) with negligible accuracy loss and at-scale throughput (Li et al., 2024, Mao et al., 8 Dec 2025).
Transformers/LLMs: Partitioned LoRA adapters as TEESlice slices; SecureInfer and TZ-LLM implement LLaMA and TinyLlama (hundreds of MB to multi-GB parameter sets) with SGX or TrustZone, balancing security-critical blocks in enclave and high-throughput matmuls on GPU/NPU (Li et al., 2024, Wang et al., 17 Nov 2025, Nayan et al., 22 Oct 2025).
Graph Neural Networks: GNNVault partitions GNNs before training, using a public backbone on substitute graphs and a TEE-resident private rectifier, achieving strong resistance to link-stealing (AUC drop 0.21+ on Cora/Citeseer) and <2% accuracy degradation (Ding et al., 20 Feb 2025).
IoT, Stream Data: EnclaveTree encodes entire Hoeffding Trees as fixed-size matrices, yielding side-channel-resilient, high-throughput stream inference on small/medium feature sets (Wang et al., 2022).
Edge Device Applications: Deployments on Raspberry Pi 3B+ (OP-TEE/TrustZone) and HiKey960 (Cortex-A73) validate efficacy under realistic memory and compute constraints (Liu et al., 2023, Xie et al., 2024, Liu et al., 2024).

6. Limitations, Open Challenges, and Extensions

While recent methods advance both performance and security, notable limitations remain:

Side-Channel Leakage: Most designs, including TEESlice, Amulet, and related partitioning schemes, do not directly address physical or cache-based side channels within the TEE (Li et al., 2024, Mao et al., 8 Dec 2025).
Key and OTP Management: One-time pad exhaustion and periodic re-keying protocols demand careful engineering to avoid reuse (Li et al., 2024).
Scaling to Larger Models: Handling >32–64 MB TEE RAM for state-of-the-art LLMs remains challenging; approaches employ pipelined restoration and partial parameter caching, or parameter-efficient representations (e.g., LoRA, dense pruning) (Wang et al., 17 Nov 2025, Li et al., 2024).
Dynamic/Adaptive Protection: Existing partitioning logic is static post-training; open directions include dynamic adaptation to input complexity, resource-awareness, and integration with DP/MPC for composable security (Li et al., 2024, Liu et al., 2024).
High Storage Overhead: Masking for non-linear layers in Amulet increases on-device storage, with impact mitigated by single-load design and abundant commodity DRAM (Mao et al., 8 Dec 2025).
Strict Black-Box Reduction: Security guarantees rely on perfect hardware isolation and black-box interface; attacks exploiting external side channels or richer I/O interfaces may necessitate further hardening (Liu et al., 2023, Abdollahi et al., 11 Apr 2025).

7. Historical Evolution and Emerging Standards

TEE-shielded on-device inference evolved from coarse partitioning (entire suffix/prefix protection, e.g., DarkneTZ) to sensitivity-guided, fine-grained, and information-theoretic obfuscation. Partition-before-training (TEESlice, GNNVault) and critical tensor masking (TensorShield) reflect a shift from post-hoc to preemptive protection, often leveraging explainability metrics for partition selection (Li et al., 2024, Sun et al., 28 May 2025, Ding et al., 20 Feb 2025). Hardware support for secure NPU scheduling, pipelined memory prefetch, and new architectures such as Arm CCA further lower the overhead of confidential inference (Wang et al., 17 Nov 2025, Abdollahi et al., 11 Apr 2025).

Representative research groups have advanced each frontier, including the authors of TEESlice (Li et al., 2024), MirrorNet (Liu et al., 2023), DarkneTZ (Mo et al., 2020), GNNVault (Ding et al., 20 Feb 2025), and Amulet (Mao et al., 8 Dec 2025), providing open-source reference code, benchmarks, and deployment recipes. Best practices now include partition-before-training, minimal TEE crossing, hardware-optimized memory layout, and hybrid accelerator co-design.

References:

TEESlice (Li et al., 2024), MirrorNet (Liu et al., 2023), GNNVault (Ding et al., 20 Feb 2025), Amulet (Mao et al., 8 Dec 2025), TensorShield (Sun et al., 28 May 2025), SecureInfer (Nayan et al., 22 Oct 2025), TZ-LLM (Wang et al., 17 Nov 2025), TBNet (Liu et al., 2024), DarkneTZ (Mo et al., 2020), EnclaveTree (Wang et al., 2022), Memory-Efficient TrustZone DNN (Xie et al., 2024), Arm CCA (Abdollahi et al., 11 Apr 2025), ShadowNet (Sun et al., 2020), No Privacy Left Outside (Zhang et al., 2023).