Global Control Store (GCS) in Ray
- Global Control Store (GCS) is a distributed, fault-tolerant, and horizontally scalable key-value store designed for managing Ray's control metadata.
- It decouples the control state from schedulers and workers, supporting sub-millisecond latencies and robust lineage-based fault tolerance.
- GCS uses sharding and chain-replication techniques to ensure consistent, reactive metadata updates under massive distributed loads.
The Global Control Store (GCS) is the distributed, fault-tolerant, and horizontally scalable control-plane storage subsystem designed for the Ray execution engine. Ray targets emerging AI workloads that produce extremely high task and object throughput, such as reinforcement learning applications spawning millions of tasks per second on large-scale clusters. GCS abstracts and manages the lineage, object locations, actor metadata, and function registration, serving as the authoritative store for all control state in the Ray system. By decoupling control state from schedulers and workers, GCS eliminates scheduling bottlenecks and enables robust, fine-grained, and scalable management of distributed tasks, actors, and objects (Moritz et al., 2017).
1. Motivation, Requirements, and Design Objectives
Ray's unified task-parallel and actor-based execution model places unprecedented requirements on the control plane. Fine-grained scheduling, rapid metadata dissemination, and lineage-based system recovery drive the need for a global state store capable of sub-millisecond latencies and millions of operations per second. Centralized masters become bottlenecks under these loads, particularly in scheduling decisions and metadata lookups.
GCS was designed to address:
- High throughput and low latency: Support millions of key-value operations per second with single- to sub-millisecond latencies.
- Horizontal scalability: Achieve scalable control-plane storage through sharding, decoupling state storage from both workers and schedulers.
- Fault tolerance with strong consistency: Ensure deterministic replay and exactly-once semantics for lineage entries, while surviving shard failures with minimal client-observed delay.
A key motivation is robust support for lineage-based fault tolerance: tasks in Ray are stateless and idempotent, while actors are stateful and require checkpointing. Comprehensive, durable lineage and metadata management in GCS is essential for transparent task re-execution and actor recovery upon node or component failure, without application code changes (Moritz et al., 2017).
2. Data Model and Metadata Schema
GCS is fundamentally a sharded and replicated key-value store, augmented with publish-subscribe channels for reactive metadata updates. The schema comprises four principal tables:
| Table | Key | Value / Purpose |
|---|---|---|
| FunctionTable | FunctionID | Serialized remote-function or actor-constructor code and resource requirements |
| TaskTable | TaskID | {dependencies: List<ObjectID>, return_ids: List<ObjectID>, spec: TaskSpec, state, attempt_count} |
| ObjectTable | ObjectID | {locations: Set<NodeID>, size_bytes: Int, creation_task: TaskID}; pub-sub notification |
| ActorTable | ActorID | {home_node: NodeID, checkpoint_location: Optional<ObjectID>, last_method_seq: Int} |
The formal mappings are:
- ObjectTable:
- TaskTable:
- ActorTable:
Each table is logically partitioned into shards by .
3. System Architecture and Sharding
Ray deploys a GCS cluster alongside its schedulers, object stores, and worker processes. The GCS is organized into logical shards, each implemented as a chain of Redis instances . This architecture follows a lightweight chain-replication protocol, ensuring strong consistency for each key.
Client Access Pattern:
- Each Ray process (driver, worker, or scheduler) interacts with GCS via a client library.
- The client:
- Computes
- Locates the primary (head of chain ) for the shard
- Issues a unary RPC ( or 0) to 1
- For 2, 3 forwards the request down the chain; the tail (4) acknowledges up the chain on completion
Subscriptions are handled through Redis pub-sub on the corresponding shard's channel.
Scalability is achieved by increasing 5, redistributing key ranges as load demands. Stateles schedulers and object stores read or cache metadata as needed, without central bottlenecks (Moritz et al., 2017).
4. Distributed Algorithms: Fault Tolerance and Consistency
GCS enforces per-shard chain replication for durability and strong per-key consistency. Write and read protocols are explicitly defined:
- Write protocol (put):
6
- Read protocol: Clients typically issue reads to the primary, benefiting from up-to-date, consistent state.
This scheme enables:
- Consistent and deterministic lineage replay
- Exactly-once semantics for metadata entries
- Transparent shard/replica recovery with minimal delay
Pub-sub notification allows clients (e.g., workers expecting a particular object) to react immediately when objects become available or their locations change.
5. Interaction with Ray Components and Execution Engine
Schedulers in Ray are stateless and interact with GCS to fetch live metadata required for scheduling decisions. This decoupling removes the “critical path” dependency on schedulers for task dispatch and object management, allowing rapid and distributed scaling. Upon failures, all restarted components (drivers, workers, schedulers) can read necessary lineage and object/actor metadata from the durable GCS for reconstruction and recovery.
Tasks—being stateless and idempotent—are tracked via lineage stored in the TaskTable, enabling exact replay. Actors—being stateful—are recovered via ActorTable entries, which record last checkpoint locations and execution sequence numbers. ObjectTable subscription channels notify interested parties when objects appear or disappear from the cluster, enabling event-driven distributed coordination (Moritz et al., 2017).
6. Scalability and Performance Characteristics
GCS supports scaling beyond 1.8 million tasks per second with sub-millisecond to low single-digit millisecond control state operation latencies, as required for contemporary reinforcement learning workloads. Horizontal scalability is realized through the increase in shard count and chain reconfiguration. Strong consistency, fine-grained parallelism, and rapid pub-sub updates ensure the GCS does not become a limiting factor under massive distributed load.
A plausible implication is that GCS’s architectural patterns could generalize to other high-throughput distributed control-store scenarios, especially those requiring strong metadata consistency, lineage-based recovery, and fine-grained reactivity (Moritz et al., 2017).