Compute Resources & Hardware Accelerators
In flyte-sdk, you manage compute requirements using the Resources class, which allows you to specify CPU, memory, storage, and hardware accelerators for your tasks. These requirements can be set globally for a task via TaskEnvironment or dynamically adjusted at call time using the .override() method.
Basic Resource Allocation
You can define CPU and memory requirements using integers, floats, or Kubernetes-style strings. To handle varying workloads, you can also provide a tuple to specify a (request, limit) range.
import flyte
# Define resources in a TaskEnvironment
env = flyte.TaskEnvironment(
name="compute-env",
resources=flyte.Resources(
cpu="500m", # 0.5 cores
memory="1Gi", # 1 GiB memory
disk="10Gi", # 10 GiB ephemeral storage
),
)
@env.task
async def process_data(x: int) -> int:
return x + 1
# Override resources for a specific call
async def dynamic_task():
# Request 1 CPU (limit 2) and 2Gi memory (limit 4Gi)
await process_data.override(
resources=flyte.Resources(
cpu=(1, 2),
memory=("2Gi", "4Gi")
)
)(x=10)
Hardware Accelerators
flyte-sdk supports a wide range of hardware accelerators, including NVIDIA GPUs, Google Cloud TPUs, AWS Neuron, AMD GPUs, and Habana Gaudi.
Simple GPU Allocation
For standard GPU requests, you can use a formatted string "<type>:<quantity>" or a simple integer for any available GPU.
# Request 1 NVIDIA T4 GPU
res_t4 = flyte.Resources(gpu="T4:1")
# Request 2 NVIDIA A100 GPUs
res_a100 = flyte.Resources(gpu="A100:2")
# Request 1 of any available GPU
res_any = flyte.Resources(gpu=1)
Advanced GPU Configuration (MIG)
For NVIDIA GPUs that support Multi-Instance GPU (MIG), use the GPU helper to specify a partition. This is supported for A100, A100 80G, H100, and H200.
# Request a 1g.5gb partition on an A100
gpu_config = flyte.GPU(device="A100", quantity=1, partition="1g.5gb")
resources = flyte.Resources(gpu=gpu_config)
TPU Slices
To use Google Cloud TPUs, use the TPU helper to specify the device type and the slice topology (partition).
# Request a V5P TPU with a 2x2x1 topology
tpu_config = flyte.TPU(device="V5P", partition="2x2x1")
resources = flyte.Resources(gpu=tpu_config)
Other Accelerators
flyte-sdk provides dedicated helpers for other specialized hardware:
# AWS Neuron (Inferentia/Trainium)
neuron_res = flyte.Resources(gpu=flyte.Neuron(device="Trn1"))
# AMD GPUs
amd_res = flyte.Resources(gpu=flyte.AMD_GPU(device="MI300X"))
# Habana Gaudi
gaudi_res = flyte.Resources(gpu=flyte.HABANA_GAUDI(device="Gaudi1"))
Shared Memory
For tasks that require high-performance inter-process communication or large data loading (common in deep learning), you can configure shared memory (/dev/shm).
# Set shared memory to a specific size
res_shm = flyte.Resources(shm="16Gi")
# Automatically set shared memory to the maximum available on the node
res_auto = flyte.Resources(shm="auto")
Advanced Customization with PodTemplate
When standard Resources are insufficient, PodTemplate allows you to customize the underlying Kubernetes Pod specification directly. This is useful for adding environment variables, image pull secrets, or custom labels.
from kubernetes.client import V1Container, V1EnvVar, V1LocalObjectReference, V1PodSpec
import flyte
pod_template = flyte.PodTemplate(
primary_container_name="primary",
labels={"team": "ml-platform"},
annotations={"description": "high-memory-worker"},
pod_spec=V1PodSpec(
containers=[
V1Container(
name="primary",
env=[V1EnvVar(name="DATASET_VERSION", value="v2")]
)
],
image_pull_secrets=[V1LocalObjectReference(name="my-registry-key")],
),
)
env = flyte.TaskEnvironment(
name="custom-pod-env",
pod_template=pod_template,
)
Troubleshooting
Resource Ranges and Singular Values
While Resources supports tuples for (request, limit), some internal flyte-sdk operations require singular values. If you encounter a ValueError stating a value "can not be a list or tuple", ensure you are providing a single int, float, or str for that specific context.
GPU Validation
- Quantity: The
quantityfor anyDevice(GPU, TPU, etc.) must be at least 1. Passing 0 or negative values will trigger aValueError. - Partition Validation: flyte-sdk validates partitions against specific device types. For example,
1g.5gbis valid forA100but will be rejected forT4. Similarly, TPU topologies like2x2x1are validated against the specific TPU version (e.g.,V5P). - Device Types: When using the string format (e.g.,
"T4:1"), the device name must match one of the supported types defined inflyte.Accelerators.