Papers
Topics
Authors
Recent
Search
2000 character limit reached

Globus Compute: Cloud-Hosted HPC FaaS

Updated 18 April 2026
  • Globus Compute is a cloud-hosted Function-as-a-Service platform that runs user-defined Python functions across heterogeneous HPC clusters with endpoint-based provisioning and OAuth2 security.
  • It integrates local resource schedulers and dynamic endpoint management to efficiently scale compute tasks and manage high-performance workflows.
  • Its architecture underpins advanced scientific applications by supporting federated inference, real-time analytics, and automated, cross-site workflow orchestration.

Globus Compute is a cloud-hosted Function-as-a-Service (FaaS) platform that enables secure, on-demand, and high-performance execution of user-defined Python functions across heterogeneous, distributed High-Performance Computing (HPC) clusters and facilities. It provides a unified interface for federated, policy-controlled compute, leveraging endpoint-based resource provisioning, centralized authentication and authorization via Globus Auth, and a cloud service for orchestrating and dispatching compute tasks. Globus Compute underpins advanced scientific applications, including real-time experimental analysis at beamlines, federated inference-as-a-service at AI-scale, and automated, cross-site workflow execution (Bicer et al., 2021, Tanikanti et al., 15 Oct 2025).

1. Architectural Components and Security Model

Globus Compute consists of a centralized cloud service, pluggable execution endpoints, and integration with identity and workflow orchestration layers:

  • Cloud Service (funcX core): Accepts user function invocations, manages registration of Python functions, maintains state, and coordinates dispatch to endpoints. All operations are gated by OAuth2 tokens issued by Globus Auth.
  • Endpoints: Installed at HPC sites (e.g., Argonne ThetaGPU, DGX A100, or federated academic clusters) as persistent daemon processes. Endpoints manage job submission to the local resource manager (Cobalt, Slurm, PBS, or Kubernetes), maintain a dynamic pool of worker processes, and enforce fine-grained function execution policies. Endpoint descriptors include capacity, concurrency limits, and node allocation policies.
  • Authentication and Authorization: Globus Auth provides single sign-on and delegated token issuance (OAuth2/OpenID Connect) for all actions: data transfers, function registration, invocation, and workflow state transitions. Endpoints accept only centrally registered functions and validate invocation tokens, preventing arbitrary code injection and enabling fine-grained, user- or group-scoped access controls (Bicer et al., 2021, Tanikanti et al., 15 Oct 2025).
  • Integration: Often embedded within higher-level orchestration frameworks such as Globus Flows (state machine–driven workflow execution) and resource gateways (e.g., FIRST) (Tanikanti et al., 15 Oct 2025).

2. Endpoint Provisioning, Configuration, and Lifecycle

Endpoints must be preinstalled and configured on supported compute resources:

  • Installation and Activation: The funcx-endpoint package is installed via pip and configured with site-specific parameters. A named endpoint (e.g., "theta-funcx") can be started as a background UNIX service, which registers with the Globus Compute cloud and polls for tasks (Bicer et al., 2021).
  • Dynamic Resource Allocation: Endpoints use wrapper scripts to integrate with local schedulers. When available workers are exhausted, the endpoint submits new jobs via the scheduler (e.g., Cobalt), launches worker daemons, and advertises newly available capacity. This supports burst allocation and elasticity across both interactive and batch pooling configurations.
  • GPU and Resource Binding: On multi-GPU nodes, a file-lock–mediated mechanism ensures unique assignment of CUDA_VISIBLE_DEVICES for each worker, preventing resource conflicts. Each worker employs an fcntl lock on a shared-memory token file, tracking which devices are allocated (Bicer et al., 2021).
  • Function Registration: Users register Python functions through the SDK (e.g., FuncXClient().register_function(...)). Endpoints expose only pre-registered functions, ensuring provenance and execution control.

3. Workflow Orchestration and Analytics Integration

Globus Compute is commonly embedded in multi-stage scientific pipelines, orchestrated via JSON-based state machines:

  • Data Acquisition to Analysis Pipelines: At experimental facilities, a local data acquisition system detects new data sets and initiates a Globus Flow. This flow typically includes:

    1. Data Staging: Raw data is transferred via Globus Transfer to remote storage.
    2. Computation: A Globus Compute function (e.g., ptychographic reconstruction) is invoked at the remote endpoint, consuming the newly staged data and producing results.
    3. Result Return: Outputs are transferred back to the originating site. Each step in the pipeline is authorized via OAuth2 tokens specific to the required scopes (transfer, compute), providing a tightly controlled and auditable chain of operations (Bicer et al., 2021).
  • Parallelization: For tasks such as 3D ptychographic reconstruction, hundreds of flows (one per rotation angle, for instance) are executed in parallel. High-level orchestration allows rapid scaling to utilize up to 64 GPUs across 8 nodes, reducing end-to-end analysis time from ~50,000 s (single GPU) to ~13,000 s (eight nodes) (Bicer et al., 2021).

  • Batch and Interactive Modes: In inference workloads (as in FIRST), endpoints maintain "hot" pools of model servers for low-latency execution, or submit batch jobs for high-throughput, non-interactive use (Tanikanti et al., 15 Oct 2025).

4. Federated Inference and Scientific AI Workflows

Globus Compute undergirds cross-cluster inference systems such as the Federated Inference Resource Scheduling Toolkit (FIRST):

  • OPENAI-Compatible API Gateway: FIRST presents an OpenAI-compatible front end (Django-Ninja), secured with Globus Auth, enabling token-level access and tracking. Incoming HTTP requests are converted to Globus Compute function calls targeted at specific endpoints (Tanikanti et al., 15 Oct 2025).
  • Endpoint Federation and Auto-scaling: Endpoints are configured in a federated mesh. The selection logic prefers endpoints where requested AI models are already "running," otherwise allocating new resources up to a configured maximum. Throughput scaling is achieved by adding additional scheduler jobs as request rate (λ\lambda) exceeds per-node target capacity (μ\mu). The number of instances is

ninstances=min(Nmax,λ/μ),n_\text{instances} = \min(N_\text{max}, \lceil \lambda / \mu \rceil),

enabling linear scaling, subject to system and scheduler constraints (Tanikanti et al., 15 Oct 2025).

  • Security Model: Only pre-registered inference functions are executed, and endpoints authenticate each invocation against the gateway’s policy, preventing arbitrary code execution in shared HPC clusters.
  • Throughput and Latency: For Llama 3.3 70B, scaling from 1 to 4 nodes increases request throughput from 8.3 to 23.9 req/s and token throughput from 1432 to 4131 tok/s, while median latency decreases from 54.5 s to 16.0 s (Tanikanti et al., 15 Oct 2025).
  • Concurrency: FIRST sustains up to 700 concurrent interactive sessions on WebUI front ends for 8B–70B models, with near-linear token throughput scaling (e.g., 2119 tok/s, 14.7 req/s for Llama 3.1 8B at 700 sessions) (Tanikanti et al., 15 Oct 2025).

5. Performance, Scalability, and Bottleneck Analysis

Empirical studies and analytical models reveal key scaling trends and bottlenecks:

  • Multi-GPU Scaling (Ptychography): Single-node iteration times for datasets (e.g., "Catalyst" [1.8K,128,128]) decrease from 1.0 s (1 GPU) to 0.5 s (8 GPUs). Larger datasets ("Siemens" [32K,256,256]) see times drop from 11 s (4 GPUs) to 4.5 s (8 GPUs), indicating sublinear scaling as communication overheads become non-negligible for small per-task data sizes (Bicer et al., 2021).
  • End-to-End Timing: For 8K–32K-class datasets, data transfer dominates total wall time (70–280 s up/return transfer for 2–8 GB, ~450-520 s compute for 100 iterations). Overheads from Globus Flows and job queue wait account for 10–50% of total time (Bicer et al., 2021).
  • Analytical Model:

T(N)=D/B(N)+Tcompute(N)+ToverheadT(N) = D / B(N) + T_\text{compute}(N) + T_\text{overhead}

where DD is dataset size, B(N)B(N) is effective bandwidth, TcomputeT_\text{compute} is multi-GPU reconstruction time, and ToverheadT_\text{overhead} aggregates queue waits, token refreshes, and cloud service calls (Bicer et al., 2021).

  • Elastic Inference: In FIRST, scaling request load (λ\lambda) and node count (NN) yields total capacity μ\mu0, up to scheduler or networking limits. Cold-start latency remains a challenge for very large models, mitigated by keeping idle nodes hot for up to 2 hours ("keep-alive" policy) (Tanikanti et al., 15 Oct 2025).
  • Comparison to Direct Model Serving: Globus Compute–based dispatch shows higher maximum throughput than directly exposed model servers (e.g., for Llama 3.3 70B: 9.2 req/s vs. 5.8 req/s at maximum load) with modest additional startup latencies (e.g., FIRST median latency 9.2 s vs. direct 3.0 s at low rates) (Tanikanti et al., 15 Oct 2025).

6. Operational Lessons and Best Practices

Experience from both data-intensive scientific pipelines and AI inference workloads yields operational guidelines:

  • Resource Pre-allocation: Reserving a dedicated queue (e.g., "funcx queue") avoids long scheduler waits and supports bursty or interactive workloads (Bicer et al., 2021).
  • Batching and Granularity: Avoiding small per-task data sizes (<100–200 MB) on multi-GPU nodes prevents communication overhead from dominating. Grouping small "views" into larger batches improves efficiency (Bicer et al., 2021).
  • GPU Assignment: Use of file-lock–based device pinning prevents CUDA conflicts during multi-worker execution (Bicer et al., 2021).
  • Token Management: Monitoring Globus Auth token lifetime (default 1 h) is critical for long-running jobs to prevent authorization errors (Bicer et al., 2021).
  • Rate Limiting and Caching: Early deployments found that repeated credential introspection and endpoint-status polling could trigger API rate limits; caching is effective in eliminating these bottlenecks (Tanikanti et al., 15 Oct 2025).
  • Load Patterns: Provisioning a "warm pool" of model-serving nodes improves interactive response times (<1 s latency) in AI workflows, while batch mode maximizes throughput for non-interactive jobs (Tanikanti et al., 15 Oct 2025).
  • Policy and Security: Pre-registration of executable functions and tight integration with institutional IdP/MFA via Globus Groups ensure compliance with multi-tenant security requirements (Tanikanti et al., 15 Oct 2025).

7. Scientific Impact and Future Directions

Globus Compute enables high-throughput, secure, and federated execution of scientific workflows at the intersection of experimental data acquisition, HPC-scale computation, and AI-driven analytics. By abstracting compute as a centrally managed, policy-driven service, it allows experimental facilities and research teams to instantiate robust pipelines that span the edge-to-HPC continuum, scaling from single-node to facility-scale deployments.

This suggests a template for future scientific and AI applications: serverless, fine-grained orchestration of complex, distributed computation with integrated access controls and transparent federation across resource boundaries. A plausible implication is that as data rates and model sizes increase, operationalization of "multi-endpoint, multi-site" strategies—backed by Globus Compute's protocol—will be necessary to achieve real-time feedback and interactive analytics at scale (Bicer et al., 2021, Tanikanti et al., 15 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Globus Compute.