Papers
Topics
Authors
Recent
Search
2000 character limit reached

Core-INC: In-Switch Collective Operations

Updated 3 February 2026
  • Core-INC is a novel architecture that embeds arithmetic reduction and broadcast operations directly into switches to streamline collective communication in AI workloads.
  • The methodology employs in-switch aggregation via programmable ASIC pipelines using tree-reduction and broadcast phases, nearly halving communication time compared to conventional techniques.
  • Empirical results demonstrate a 57% latency reduction and significant bandwidth savings, leading to notable speedups in distributed AI training systems.

Core-INC refers to a system architecture and methodology in which collective-operation primitives—such as arithmetic reduction and broadcast—are embedded directly within the data-plane of network switches. This paradigm is distinct from host- or NIC-level implementations of collective operations, and targets primary communication bottlenecks in large-scale AI workloads by reducing end-host overhead and core network traffic through in-switch processing of collective data movement and aggregation (Hoefler et al., 27 Jan 2026).

1. Architectural Principles

Core-INC implements collective operations (e.g., Allreduce) at the network switch level, leveraging switch-resident aggregation engines, packet buffers, and small ALU pipelines. The operational phases are as follows:

  • Reduction Phase: Internal switches perform associative reductions (e.g., sum) on vector partitions as they arrive from child links. Partial results are forwarded upstream, with only the reduced data traversing each link.
  • Broadcast Phase: The root switch, upon accumulating the complete reduced vector, multicasts results downstream. This replication occurs within hardware, sidestepping software or NIC-layer fan-out.

The standard algorithmic model overlays a binary or k-ary tree across the switch fabric. Each endpoint sends its data segment to a nearest leaf switch, which aggregates data up the tree. Upon completion of the reduction, the full result is then multicast back down the tree. Modern P4-capable ASICs suffice for full wire-speed implementation due to existing programmable match-action pipelines and ALUs.

Feature Core-INC Implementation Conventional (Ring) Allreduce
Aggregation location In-switch hardware End-host software/NIC
Data path Tree (depth D) Ring (length P)
Example tree reduction Upstream sum, downstream multicast Serial rotation among hosts
State per collective Maintained within switches At endpoints/NICs

2. Performance and Communication Cost Models

The performance of Core-INC is described via the α–β model:

  • Point-to-point cost: Tp2p(n)=α+βnT_{p2p}(n) = \alpha + \beta n
  • Ring Allreduce (P endpoints, message size N):

Tring(N)2(P1)α+2(P1)/PβNT_{\text{ring}}(N) \simeq 2(P-1)\alpha + 2(P-1)/P\cdot\beta N

  • Core-INC Allreduce (tree, depth D):

TINC(N)2Dα+2βNT_{\text{INC}}(N) \simeq 2D\alpha + 2\beta N

For typical fat-tree topologies with D=O(logP)D=O(\log P) and practical PP, the bandwidth term dominates for large NN, resulting in nearly 2×2\times reduction in total traffic compared to ring Allreduce. A Core-INC system requires that each endpoint transmits and receives NN once, unlike conventional algorithms where segments are communicated multiple times.

3. Empirical Benefits

Empirical analysis on an 8 GiB Allreduce task demonstrates that Core-INC completes in 151 ms versus 352 ms for a ring algorithm—a 57% reduction in collective latency. This efficiency directly translates to end-to-end speedup of distributed training:

  • For collective overhead at 50% of iteration time, the application speedup is 34%.
  • At 20% overhead, the achievable speedup is 11%.
  • Even a 5% improvement is highly valuable given the cost structure of AI clusters.

Bandwidth consumption is also halved: in a fat-tree, Allreduce and broadcast operations manifest a clear drop in per-link and aggregate bisection bandwidth requirements. For Allgather or Broadcast collectives, link usage reduces proportionally (e.g., 9 versus 13 links for broadcast).

4. Obstacles to Adoption

Six major hurdles identified for production deployment of Core-INC are:

  1. Low-Precision Data Types: In-switch ALUs and data paths are ill-suited for accumulating quantized (e.g., 8- or 4-bit) tensors. Workarounds can negate bandwidth gains or add design complexity.
  2. Blocked/Vector Formats: Data formats such as block-floating-point require packetized metadata and matching logic in hardware, increasing switch design complexity.
  3. Sparse Reductions: Aggregating sparse vectors in-network risks "fill-in," where index union causes vector expansion. This is not efficiently handled in current Core-INC designs.
  4. Bitwise Reproducibility: Maintaining floating-point reduction order for bit-identical results is infeasible on distributed switch fabrics, complicating debugging and regulatory compliance.
  5. Endpoint Coordination: In-switch operations necessitate per-collective state, reduction tree management, and coordination with NIC flow control and congestion—challenging switch memory and orchestration.
  6. Encryption and Security: In-network reduction requires access to plaintext payloads, clashing with standard end-to-end encryption regimes and considerably broadening the attack surface.

5. Development Trajectory and Standardization

The transition of Core-INC from research to widespread deployment is predicted to be incremental:

  • Short-term adoption will focus on "single-switch islands" in localized AI clusters, utilizing custom top-of-rack or leaf switches for local collectives.
  • Standardization efforts (e.g., Ultra-Ethernet Transport) aim to define minimal packet and control-plane interfaces for both Core-INC and Edge-INC.
  • Advancements in switch ASIC architectures—increasing match-action table width, ALU count, and SRAM capacity—are essential to mainstreaming support for emerging quantized, sparse, and block-based data types.

Once standards solidify and switch vendors ship Core-INC–capable silicon, datacenter early adopters are expected to pilot deployments in limited domains, such as parameter-server backends. Broader, multi-switch and multi-tenant adoption depends on solutions to the aforementioned system and semantic barriers.

6. Summary and Contextual Significance

Core-INC architectures represent a concerted effort to align network switch functionalities with the communication topology and arithmetic demands of modern AI workloads. By shifting aggregation and broadcast into the switch data-plane, AI collectives can achieve up to 60% latency reductions and halve core-network traffic under practical workloads. However, uptake beyond niche deployments in HPC and AI clusters will require sustained progress in data type support, reproducibility, security, and system coordination. The likely near-term landscape features single-switch deployments, with phased scaling toward multi-hop fabrics contingent on overcoming fundamental architectural and operational constraints (Hoefler et al., 27 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Core-INC.