Efficient and Scalable Agentic AI with Heterogeneous Systems

Published 25 Jul 2025 in cs.LG, cs.AI, and cs.DC | (2507.19635v1)

Abstract: AI agents are emerging as a dominant workload in a wide range of applications, promising to be the vehicle that delivers the promised benefits of AI to enterprises and consumers. Unlike conventional software or static inference, agentic workloads are dynamic and structurally complex. Often these agents are directed graphs of compute and IO operations that span multi-modal data input and conversion), data processing and context gathering (e.g vector DB lookups), multiple LLM inferences, tool calls, etc. To scale AI agent usage, we need efficient and scalable deployment and agent-serving infrastructure. To tackle this challenge, in this paper, we present a system design for dynamic orchestration of AI agent workloads on heterogeneous compute infrastructure spanning CPUs and accelerators, both from different vendors and across different performance tiers within a single vendor. The system delivers several building blocks: a framework for planning and optimizing agentic AI execution graphs using cost models that account for compute, memory, and bandwidth constraints of different HW; a MLIR based representation and compilation system that can decompose AI agent execution graphs into granular operators and generate code for different HW options; and a dynamic orchestration system that can place the granular components across a heterogeneous compute infrastructure and stitch them together while meeting an end-to-end SLA. Our design performs a systems level TCO optimization and preliminary results show that leveraging a heterogeneous infrastructure can deliver significant TCO benefits. A preliminary surprising finding is that for some workloads a heterogeneous combination of older generation GPUs with newer accelerators can deliver similar TCO as the latest generation homogenous GPU infrastructure design, potentially extending the life of deployed infrastructure.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a system design that utilizes heterogeneous compute resources to optimize agentic AI execution graphs and reduce TCO.
It leverages MLIR-based compilation and dynamic orchestration to efficiently map AI tasks to diverse hardware resources.
Experimental results demonstrate that a mix of older GPUs and cutting-edge accelerators can achieve performance comparable to modern homogeneous clusters.

Efficient and Scalable Agentic AI with Heterogeneous Systems

Introduction

The paper "Efficient and Scalable Agentic AI with Heterogeneous Systems" (2507.19635) addresses the emerging need for scalable deployment of agentic AI workloads. These workloads differ significantly from traditional software and static inference due to their dynamic and structurally complex nature, often involving directed graphs of compute and I/O operations. The authors propose a system design that leverages heterogeneous compute infrastructure to optimize the deployment and serving of AI agents, reducing Total Cost of Ownership (TCO) and allowing more flexible infrastructure usage.

System Design and Architecture

The proposed system is fundamentally structured around the utilization of heterogeneous computing resources, incorporating diverse hardware types like CPUs and GPUs from multiple vendors and across different performance tiers.

Agentic AI Execution Graphs: The design includes a framework to plan and optimize execution graphs of agentic AI using cost models. These models consider various constraints such as compute capacity, memory bandwidth, and latency associated with different hardware resources.
MLIR Based Compilation: Central to the architecture is the use of MLIR (Multi-Level Intermediate Representation) to abstract AI agent tasks into granular operators. This representation facilitates code generation for multiple hardware options, effectively bridging the gap between AI model specifications and the computational resources deployed.
Dynamic Orchestration: The system employs a dynamic orchestration strategy to allocate resources across a heterogeneous infrastructure. This approach ensures optimized load balancing and resource allocation, complying with end-to-end Service Level Agreements (SLAs).

Experimental Results

Preliminary empirical results underscore the effectiveness of the heterogeneous strategy. Significantly, combinations of older-generation GPUs with cutting-edge accelerators like NVIDIA's H100 and Intel's Gaudi 3 can deliver TCO performance that parallels the latest homogenous GPU setups (e.g., NVIDIA B200 clusters). This showcases the potential to extend the operational life of existing GPU infrastructure, offering a compelling argument for heterogeneous deployments in operational environments.

Implications and Future Directions

The implications of this research are twofold:

Practical Deployment: By demonstrating significant TCO advantages, the findings encourage the industry to adopt heterogeneous infrastructures, facilitating broader adoption of AI technologies while managing costs effectively.
Theoretical Enhancements: The integration of complex execution graph planning and MLIR-based compilation poses several theoretical challenges and opportunities for further research. Continued advancements in these areas could yield even greater efficiencies and broader applicability across diverse AI workloads.

Conclusion

The paper contributes significantly to the discourse on scalable AI deployment by laying the groundwork for efficient utilization of heterogeneous compute environments. While the AI landscape rapidly evolves, systems optimizations such as those proposed here are crucial for realizing cost-effective and sustainable AI advancements. Future research is likely to build on these findings, refining cost models and orchestration strategies to further enhance the scalability and efficiency of agentic AI systems.

Markdown Report Issue