TEE-Shielded On-Device Inference
- TEE-Shielded On-Device Inference is an approach that secures machine learning on user devices using hardware enclaves to guarantee model confidentiality, integrity, and I/O privacy.
- It employs partitioning strategies, cryptographic masking, and accelerator offloading to optimize performance while mitigating risks such as model extraction and data leakage.
- Deployment involves a hardware-software co-design using platforms like Arm TrustZone and Intel SGX, achieving low overhead and strong resistance to adversarial attacks.
Trusted Execution Environment (TEE)-Shielded On-Device Inference refers to running machine learning inference on user-controlled devices such that cryptographically isolated hardware enclaves (TEEs) provide model confidentiality, integrity, and (where applicable) input/output privacy—even when adversaries control the operating system, device firmware, or user-space software. The paradigm is motivated by the increasing prevalence of on-device deep neural networks (DNNs) and LLMs on mobile phones, edge devices, and IoT hardware, where model extraction, weights theft, and training data leakage have become acute security concerns. The state of the art in TEE-shielded on-device inference is characterized by system/software/hardware co-designs, architectural defenses, fine-grained tensor shielding, obfuscation techniques, trusted/world memory management, and attestation-centric frameworks.
1. System Architectures and Threat Models
TEE-shielded on-device inference systems span several architectural approaches, reflecting device constraints (secure memory size, accelerator access, TCB minimization) and targeted threat models. Common TEE platforms include Arm TrustZone with OP-TEE, Intel SGX, and (recently) Arm Confidential Compute Architecture (CCA) realms.
- Partitioned Execution: Most designs partition inference into TEE-protected ("secure world") and REE/GPU ("normal world") segments. For example, DarkneTZ partitions a pre-trained DNN into "privacy-sensitive" layers hosted inside a TEE, with the remainder running outside. MirrorNet introduces a public BackboneNet (REE) and a mirrored, accuracy-rectifying CPM (TEE) (Mo et al., 2020, Liu et al., 2023).
- Hybrid TEE–Accelerator Pipelines: SecureInfer and TwinShield leverage heterogeneous architectures, running privacy-critical code (key parameters, non-linear ops, secret projections) in the TEE, while offloading high-throughput linear matrix multiplies to an untrusted GPU with information-theoretic masking and secret sharing (Nayan et al., 22 Oct 2025, Xue et al., 4 Jul 2025).
- Co-processor/Lightweight Secure Engine: Some frameworks offload non-linear or token-by-token operations to a small on-chip "lightweight trusted hardware" co-processor, maximizing memory/compute separation (e.g., TPM-class security chips) (Huang et al., 2022).
- Obfuscation and Encapsulation: Amulet obfuscates the complete DNN inside the TEE, stores obfuscated weights in untrusted memory, and runs inference fully outside the TEE, interacting only for input/output masking/unmasking (Mao et al., 8 Dec 2025). Similarly, ShadowNet obfuscates linear-layer weights, offloads them, and handles non-linear layers solely within the TEE (Sun et al., 2020).
Threat models typically assume full REE compromise: attackers may arbitrarily read/write any non-secure memory, replace model binaries, add LoRA adapters, implant side-channel hooks, or monitor all I/O. TEEs are assumed free of code/data leaks. Some frameworks also consider (or exclude) physical and microarchitectural side-channels.
2. Partitioning, Shielding, and Tensor Selection Strategies
Partitioning the model between TEE and REE/GPU is central to performance-security trade-offs. Three major strategies are observed:
- Heuristic Partitioning/Post-Training: Early work (e.g., DarkneTZ, Serdab) partitions by depth (e.g., shield only the final classifier layer) or structure (shielding all nonlinear ops) after model training. However, this leaves significant leakage pathways for model stealing (MS) and membership inference (MIA), except in cases where the final layer is the only source of information for attack (Mo et al., 2020, Zhang et al., 2023).
- Critical-Tensor Identification (XAI-Based Selection): TensorShield introduces an XAI-driven method, computing a criticality metric for each tensor—where is its gradient-based intrinsic importance and the Grad-CAM attention-transition distance to public models. Shielding only the few tensors with highest recovers black-box-level resistance to MS/MIA at sharply reduced TEE memory and computational cost (Sun et al., 28 May 2025).
- Partition-Before-Training: Recent approaches (e.g., TEESlice, GNNVault) decouple public (offloaded) and confidential (enclave) model weights at the point of training. For example, TEESlice inserts private "slices" only in the TEE-resident part and runs public, never-private-trained backbone weights in REE (Zhang et al., 2023). GNNVault trains a public backbone on a surrogate graph, then trains a private rectifier on sensitive data, scrupulously restricting all sensitive graph structure and parameters to the TEE (Ding et al., 20 Feb 2025).
This classification is critical since (as demonstrated in large-scale benchmarking (Zhang et al., 2023)), post-training partitioning often offers poor control over security-utility trade-offs, while partition-before-training can guarantee black-box ("full-shield") equivalence at orders-of-magnitude lower TEE cost.
3. Cryptographic and Algorithmic Protections
TEE-shielded on-device inference employs a diversity of cryptographic and algorithmic methods for confidentiality and integrity.
- Weight and Activation Masking: SecureInfer, TwinShield, and ShadowNet use additive one-time-pads, fixed-point masking over , permutation obfuscation, or random projections on linear transformations; all guarantee that observed offloaded tensors carry negligible or zero mutual information about model parameters or user data (Nayan et al., 22 Oct 2025, Xue et al., 4 Jul 2025, Sun et al., 2020).
- Output and Activation Obfuscation: Amulet extends algebraic masking through all layers (including non-linearities and attention blocks) using random invertible matrices and Kronecker/permutation gadgets, ensuring information-theoretic secrecy of all weight and activation material outside the TEE (Mao et al., 8 Dec 2025).
- Execution/Restoration Protocols: SecureInfer, ShadowNet, and Amulet all execute sequences where masked tensors are pushed to the GPU/REE, results restored and unmasked inside TEE, and all intermediate state zeroized. Dual-protection integrity mechanisms like U-Verify (TwinShield) and Freivalds’ algorithm (TEESlice) embed cryptographic checksums or challenge-responses to catch tampering in GPU computation (Xue et al., 4 Jul 2025, Zhang et al., 2023).
- Watermarking for Attestation: For LLM attestation, AttestLLM introduces activation-based watermarking with quantization-aware embedding and periodic block-level challenge-response verification from TEE; cryptographic secrets for watermark extraction are kept enclave-resident to defeat model replacement and forging attacks (Zhang et al., 8 Sep 2025).
4. Performance, Scalability, and Hardware Co-Design
The primary challenge for on-device TEE-shielded inference is reconciling secure execution with stringent device constraints and latency requirements.
- TEE Memory Constraints: Full-model TEE storage is infeasible for models exceeding tens of MB (modern CNNs, GNNs, or 7B+ LLMs). Partitioning, slice-based minimization, and on-demand paging (pipelined restoration in TZ-LLM) are key (Wang et al., 17 Nov 2025, Xie et al., 19 Mar 2024, Sun et al., 28 May 2025).
- Accelerator Utilization: SecureInfer, TwinShield, Amulet, and ShadowNet enable high-throughput GPU computation on masked/obfuscated tensors, with most (80–90%+) of FLOPs offloaded. This results in speedups of 4–10× over TEE-only baselines and performance overheads as low as 1.2–2.8× unprotected inference (Nayan et al., 22 Oct 2025, Xue et al., 4 Jul 2025, Mao et al., 8 Dec 2025, Sun et al., 2020).
- Latency and Energy Evaluation: Quantitative data show overheads as low as +3% in DarkneTZ for single-layer TEE, +16.8–19.6% for LLM attestation (AttestLLM, INT8/INT4), 2.8–4.8× in Amulet, and up to 25.35× speedups relative to naive full-shielding in TensorShield (Zhang et al., 8 Sep 2025, Mao et al., 8 Dec 2025, Sun et al., 28 May 2025, Mo et al., 2020).
- Enhancements for Ultra-Large Models and NPUs: pKVM in AttestLLM enables TEE memory regions of ≥512 MB, supporting LLMs up to 15B parameters; TZ-LLM introduces TEE-coordinated minimal NPU drivers and pipelined parameter restoration to closely match REE decoding speeds for LLMs (Wang et al., 17 Nov 2025, Zhang et al., 8 Sep 2025). Arm CCA realms can allocate gigabytes of secure memory, removing TrustZone's primary bottleneck (Abdollahi et al., 11 Apr 2025).
5. Security Guarantees, Formal Analyses, and Attestation
- Indistinguishability and Black-Box Equivalence: Amulet, TwinShield, SecureInfer, and TensorShield formalize that their obfuscation, masking, or partitioning yields (information-theoretic secrecy); no more leakage than black-box label-only exposure is possible under their threat models (Mao et al., 8 Dec 2025, Xue et al., 4 Jul 2025, Nayan et al., 22 Oct 2025, Sun et al., 28 May 2025).
- Attack Resistance Evaluation: Large-scale evaluation in TEESlice demonstrates that prior post-training partitioning approaches leave substantial leakage (MS accuracy 3.8–4.3×, MIA 1.2–1.4× above black-box), while "partition-before-training" (TEESlice, GNNVault) can precisely match black-box upper-bound security with single-digit TEE compute costs (Zhang et al., 2023, Ding et al., 20 Feb 2025).
- Attestation and Integrity: Challenge–response protocols (AttestLLM), integrity-check rows/hashes (TwinShield), and remote attestation procedures (SGX, Arm CCA) are integrated to ensure the authenticity of executing models and enclave code, preventing model replacement, binary tampering, and side-channel code injection (Zhang et al., 8 Sep 2025, Xue et al., 4 Jul 2025, Abdollahi et al., 11 Apr 2025).
- Safety under Model Extraction, MIA, and Link Stealing: Empirical and analytic security assessment (GNNVault, TensorShield, AttestLLM) addresses both classical surrogate training (MS) and more advanced link-prediction and KV-leakage attacks (KV-Shield, (Yang et al., 6 Sep 2024)), with information-theoretic or entropy-based quantification of attack infeasibility.
6. Practical Deployment Considerations and Limitations
- Hardware Portability and Platform Support: Frameworks such as AttestLLM (pKVM/TrustZone), SecureInfer (SGX+GPU), TwinShield (SGX+CUDA), and TZ-LLM (TrustZone+NPU) are demonstrated across ARMv8, Intel SGX FLC, and emerging Arm CCA, covering mobile, edge, and cloud TEE-capable hardware (Zhang et al., 8 Sep 2025, Nayan et al., 22 Oct 2025, Xue et al., 4 Jul 2025, Wang et al., 17 Nov 2025, Abdollahi et al., 11 Apr 2025).
- Minimal Trusted Computing Base (TCB): Designs prioritize compact TCB size (e.g., 20 KLoC in TZ-LLM, 3,367 LoC in Smart-Zone), limiting the scope for code vulnerabilities and facilitating formal audit/verification (Wang et al., 17 Nov 2025, Xie et al., 19 Mar 2024).
- Obfuscation Storage Expansion and Preprocessing Cost: Masking and algebraic obfuscation expand model footprint (e.g., Amulet up to 200%), and offline preprocessing time can be non-zero but typically remains sub-minute per model for commodity hardware (Mao et al., 8 Dec 2025).
- Boundary of Defenses: Many frameworks explicitly exclude physical and microarchitectural side-channel attacks; others (e.g., AttestLLM) leverage memory-layout randomization, control-flow attestation, and recalculate "evasion" probabilities per attestation round (Zhang et al., 8 Sep 2025).
- Accelerator Assignment and Emerging TEEs: Arm CCA represents an advance in fine-grained secure memory partitioning, potentially supporting multi-GB confidential model inference and better device assignment mechanisms; NPU, DSP, and GPU enclave integration remains an active engineering frontier (Abdollahi et al., 11 Apr 2025, Wang et al., 17 Nov 2025).
7. Comparative Table
| Framework | Partition Strategy | Accelerator Use | Overhead vs. Full-TEE | Security Guarantee |
|---|---|---|---|---|
| DarkneTZ (Mo et al., 2020) | Post-training, deep/shallow | CPU only | +3% (last layer) | Black-box MIA-thresholding |
| TensorShield (Sun et al., 28 May 2025) | Critical-tensor XAI-driven | GPU/CPU | up to 25.35× faster | Black-box MS/MIA equivalence |
| TEESlice (Zhang et al., 2023) | Partition-before-training | GPU/CPU | 10–15× faster | Black-box upper-bound, info-theoretic |
| SecureInfer (Nayan et al., 22 Oct 2025) | Threat-informed (sensitivity) | GPU+SGX | 4.7× faster | Info-theoretic masking, MIA/MS drop |
| TwinShield (Xue et al., 4 Jul 2025) | Cryptoprotected Transformers | GPU+SGX | 4–6.1× faster | End-to-end dual confidentiality |
| AttestLLM (Zhang et al., 8 Sep 2025) | WM-embedded block attestation | CPU (pKVM) | +17–20% per-tok | Watermark-based authenticity |
| Amulet (Mao et al., 8 Dec 2025) | Algebraic obfuscation (all layers) | GPU | 8–9× faster | Information-theoretic secrecy |
| TZ-LLM (Wang et al., 17 Nov 2025) | Pipelin. restoration, NPU driver | NPU/CPU | +6% vs. non-TEE | End-to-end param/KV confidentiality |
All figures above are direct, verbatim or derived from evaluated data in the referenced works; "faster" or "overhead" metrics are relative to full-TEE baselines.
TEE-shielded on-device inference thus represents a convergent field at the intersection of systems security, hardware co-design, neural network privacy, and adversarial robustness, where new algorithms and architecture co-designs continually expand the tractable security–performance Pareto frontier for ML on the edge.