Single Controller interfaceο
The Single Controller provides a unified interface for managing distributed workers using Ray or other backends and executing functions across them. It simplifies the process of dispatching tasks and collecting results, particularly when dealing with data parallelism or model parallelism.
Core APIsο
- class verl.single_controller.Worker(*args, **kwargs)ο
A distributed worker that handles initialization and configuration for distributed training.
This class manages worker initialization, configuration, and provides methods for executing distributed operations. It handles communication settings, device configuration, and worker metadata management.
- __init__(cuda_visible_devices=None) None ο
Initialize the worker with environment settings and device configuration.
- Parameters:
cuda_visible_devices (str, optional) β CUDA visible devices configuration. Defaults to None.
- static __new__(cls, *args, **kwargs)ο
Create a new Worker instance with proper initialization based on environment settings.
- get_cuda_visible_devices()ο
Get the CUDA visible devices configuration.
- get_master_addr_port()ο
Get the master address and port for distributed communication.
- property rankο
Get the rank of this worker in the distributed setup.
- property world_sizeο
Get the total number of workers in the distributed setup.
- class verl.single_controller.WorkerGroup(resource_pool: ResourcePool, **kwargs)ο
Base class for managing a group of workers in a distributed system. The class provides methods for worker management, aliveness checking, and method binding.
- __init__(resource_pool: ResourcePool, **kwargs) None ο
- property world_sizeο
Number of workers in the group.
- class verl.single_controller.ClassWithInitArgs(cls, *args, **kwargs)ο
Wrapper class that stores constructor arguments for deferred instantiation. This class is particularly useful for remote class instantiation where the actual construction needs to happen at a different time or location.
- __call__() Any ο
Instantiate the stored class with the stored arguments.
- __init__(cls, *args, **kwargs) None ο
Initialize the ClassWithInitArgs instance.
- Parameters:
cls β The class to be instantiated later
*args β Positional arguments for the class constructor
**kwargs β Keyword arguments for the class constructor
- class verl.single_controller.ResourcePool(process_on_nodes=None, max_colocate_count: int = 10, n_gpus_per_node=8)ο
Manages a pool of resources across multiple nodes, tracking process counts and GPU allocations. The class provides methods to calculate world size, local world sizes, and local ranks across all nodes in the pool.
- __init__(process_on_nodes=None, max_colocate_count: int = 10, n_gpus_per_node=8) None ο
Initialize the ResourcePool with node processes and GPU configuration.
- Parameters:
process_on_nodes (List[int], optional) β List of process counts per node. Defaults to empty list.
max_colocate_count (int, optional) β Maximum number of processes that can be colocated. Defaults to 10.
n_gpus_per_node (int, optional) β Number of GPUs available per node. Defaults to 8.
- local_rank_list() List[int] ο
Returns a flat list of local ranks for all processes across all nodes.
- local_world_size_list() List[int] ο
Returns a flat list where each process has its local world size.
- property world_sizeο
Total number of processes across all nodes in the pool.
- class verl.single_controller.ray.RayWorkerGroup(resource_pool: RayResourcePool = None, ray_cls_with_init: RayClassWithInitArgs = None, bin_pack: bool = True, name_prefix: str = None, detached=False, worker_names=None, worker_handles: List[ActorHandle] = None, ray_wait_register_center_timeout: int = 300, device_name='cuda', **kwargs)ο
A group of Ray workers that can be managed collectively.
This class extends WorkerGroup to provide Ray-specific functionality for creating and managing groups of Ray actors with specific resource requirements and scheduling strategies.
- __init__(resource_pool: RayResourcePool = None, ray_cls_with_init: RayClassWithInitArgs = None, bin_pack: bool = True, name_prefix: str = None, detached=False, worker_names=None, worker_handles: List[ActorHandle] = None, ray_wait_register_center_timeout: int = 300, device_name='cuda', **kwargs) None ο
Initialize a RayWorkerGroup.
- Parameters:
resource_pool β Resource pool for worker allocation
ray_cls_with_init β Class with initialization arguments for workers
bin_pack β Whether to use strict bin packing for resource allocation
name_prefix β Prefix for worker names
detached β Whether workers should be detached
worker_names β Names of existing workers to attach to
ray_wait_register_center_timeout β Timeout for waiting on register center
**kwargs β Additional keyword arguments
- verl.single_controller.ray.create_colocated_worker_cls(class_dict: dict[str, RayClassWithInitArgs])ο
This function should return a class instance that delegates the calls to every cls in cls_dict