Tensor Computing Interface (TCI)
- Tensor Computing Interface (TCI) is a standardized API that unifies tensor operations across heterogeneous platforms, enabling efficient tensor-network applications.
- It uses a zero-overhead design with compile-time templates, ensuring performance parity with native framework implementations on CPUs, GPUs, and supercomputers.
- The interface offers over 50 free functions and a BLAS-like xGETT for binary tensor contraction, promoting portability and seamless backend integration.
The Tensor Computing Interface (TCI) is a standardized, application-oriented API designed to abstract and unify tensor operations across heterogeneous software and hardware platforms. TCI enables portable, high-performance tensor-network (TN) applications in domains such as quantum science and machine learning. It addresses widespread deficiencies in interoperability and code portability by decoupling application logic from underlying tensor-computing frameworks (TCFs), exposing a lightweight, expressive set of core primitives for tensor manipulation, linear algebra, and network contraction. Complementing this, the xGETT interface delivers BLAS-like semantics for binary tensor contraction, providing a rigorous foundation for future TCI standardization (Sun et al., 30 Dec 2025, Hörnblad, 2024).
1. Motivation and Architectural Principles
Traditional tensor-network workflows are tightly coupled to framework-specific APIs—e.g. ITensor, Cytnx, cuTENSOR—leading to prohibitive costs in migration, maintenance, and hardware targeting. TCI mitigates these issues via three strategies:
- API Abstraction: Presents a minimal, expressive interface in standard C++ (C++17), supporting both object- and free-function paradigms for tensor manipulation.
- Zero-Overhead Design: Utilizes compile-time template resolution for all tensor traits and types, avoiding runtime virtual dispatch and aligning performance with native TCF backends.
- Cross-Platform Portability: Allows migration between CPUs, GPUs, supercomputer clusters, and emerging tensor-acceleration hardware (e.g. TPUs) by changing only typedefs and header includes (Sun et al., 30 Dec 2025).
This paradigm enables TN codes, such as those for quantum-circuit simulation or variational algorithms, to be developed independently of the computational back end.
2. Type System and Object Model
TCI introduces a unified type system that defines tensor objects and their metadata abstractly. The central element is the abstract tensor type TenT, together with a traits class tensor_traits<TenT>. The traits interface exposes:
| Member Type | Description | Example Type |
|---|---|---|
| ten_t | Concrete tensor type | Cytnx::UniTensor |
| shape_t | Bond-dimension tuple | std::vector<int> |
| elem_t | Scalar data type | float, double |
| context_handle_t | Backend resource handle | CPU/GPU structs |
All tensor coordinates, shapes, and accessors are indexed according to bond order. Tensor objects can represent dense or (future extension) block-sparse, symmetry-aware forms (Sun et al., 30 Dec 2025). Concrete types and aliases are provided by TCFs at compile time via the tci/tci.h header.
For a tensor , the following are defined:
- Elements accessed by
3. Core API and Functional Categories
TCI supplies approximately fifty free functions, grouped as follows (Sun et al., 30 Dec 2025):
- Read-only Queries: order, shape, size, size_bytes, get_elem.
- Construction/Destruction: allocate, zeros, fill, random, eye, copy, move, clear.
- I/O Operations: load, save.
- Manipulation: reshape, transpose (bond permutations), expand/shrink, extract_sub, replace_sub, concatenate, stack, for_each (element traversal).
- Linear Algebra Primitives: diag, norm, normalize, scale, trace (partial), exp/inverse (on matricized blocks), contract (Einstein summation), linear_combine, svd, trunc_svd, qr/lq, eigvals/eigh.
- Miscellaneous/Debug: to_range, show, close (elementwise equality checks), convert (cross-backend), version.
The principal contraction routine contract(ctx, A, labels_A, B, labels_B, C, labels_C) supports Einstein-index labeling, enabling flexible contractions of arbitrary TN diagrams. All functions accept a context handle carrying backend-specific resources.
4. BLAS-like Interface for Binary Tensor Contraction
The xGETT interface, proposed as the nucleus of a future TCI standard, formalizes binary tensor contraction with strict semantics for mode extents, strides, contractions, and output permutations (Hörnblad, 2024).
The contraction formula is:
xGETT call signature: Each argument configures the contraction:
- RANKA, RANKB: tensor orders.
- EXTA, EXTB: mode sizes.
- INCA, INCB, INCC: memory strides per mode.
- CONTS: number of contracted modes.
- CONTA, CONTB: arrays specifying which modes to contract.
- PERM: permutation for free indices in output.
- C: pointer to result tensor.
The PERM array reorders free output modes, supporting arbitrary output layouts. (Example: For , PERM=[2,0,3,1] produces output mode ordering [β₁, α₁, β₂, α₂].)
Reference implementations use stride-driven coordinate loops and avoid packing, while high-performance variants block and pack sub-tensors for cache efficiency and vectorization (Hörnblad, 2024).
5. Back-End Integration and Portability
Each TCF provides a C++ header (tci/tci.h) and concrete tensor type mapping to TenT. TCI API calls are compiled to these types, yielding direct calls to the underlying TCF implementations (e.g. Cytnx::contract, cuTENSOR’s cuTensorNet). Context handles encapsulate resources for CPU or GPU computation, supporting hardware-specific scheduling.
TCI is designed to have negligible abstraction overhead (confirmed empirically: <2% in benchmarks), matching native TCF performance whether on Intel Xeon, NVIDIA H100, Apple M2 Ultra, or large-node supercomputing clusters (Sun et al., 30 Dec 2025).
Migrating TN code between back ends requires only a change of typedef and include path.
6. Performance Benchmarks and Implementation
Representative benchmarks include iTEBD ground-state simulations and 2dTNS-based belief propagation (Sun et al., 30 Dec 2025):
- iTEBD code for transverse-field Ising model exhibits CPU performance scaling as ; TCI abstraction overhead is negligible compared to native APIs.
- For large bond dimensions (), GPU implementations yield order-of-magnitude speedups under TCI compared to CPU.
- On supercomputers (Fugaku, A64FX), TCI-based codes perform identically to native implementations for , with only minimal divergence at very small size.
- Reference xGETT (single-threaded, stride-based) achieves 5 Gflop/s (memory-bound); optimized blocked code (TBLIS, 8 threads) reaches 100 Gflop/s (Hörnblad, 2024).
This demonstrates that abstraction via TCI (and xGETT) does not incur meaningful performance penalties.
7. Roadmap Toward a Full TCI Standard
The foundational xGETT work and the application-oriented TCI API are converging toward a robust, BLAS-like standard for multilinear algebra (Hörnblad, 2024, Sun et al., 30 Dec 2025). Key elements of this vision include:
- Naming conventions and kernel suite: e.g. SGETT/DGETT/CGETT/ZGETT for contractions, xTRANSP for bond permutation, xREDUCE for reduction, and higher-level decompositions (SVD, HOSVD, EIG).
- Blocking and packing hints: explicit or autotunable parameters guide backend implementations.
- Compatibility layers: link-time aliasing and wrapper classes for seamless legacy BLAS/Tensor migration.
- Bindings for high-level languages: Python, Julia, MATLAB, supporting JAX, PyTorch, ITensors.jl, TensorKit.jl integration.
- Advanced features: Asynchronous APIs, automatic differentiation, symmetry-aware tensors, and support for block-sparse layouts.
- Extensibility: Targeting emerging architectures and specialty hardware.
This suggests that widespread adoption of TCI will catalyze development of portable, efficient tensor computations across quantum, AL/ML, and scientific domains, paralleling the historical impact of BLAS in matrix-centric workflows.