Architecture Overview
This section contains architecture diagrams and documentation for flyte-sdk.
Available Diagrams
Flyte SDK System Context
The Flyte SDK System Context diagram illustrates the interactions between users, the Flyte SDK, and the various external systems involved in data and ML orchestration.
The flyte (comprising both the Python library and the CLI) serves as the primary interface for Data Scientists and Developers to author, manage, and execute workflows and tasks.
Key interactions discovered in the codebase:
- Flyte Admin (Backend): The SDK communicates with the Flyte control plane via gRPC using
flyteidl2clients. This includes services for project management, task registration, execution control, and log retrieval. - Cloud Storage: The SDK performs data I/O (S3, GCS, Azure Blob Storage) by first requesting signed URLs from the DataProxy service (part of Flyte Admin) and then performing direct HTTP uploads/downloads using
httpxorfsspec. - Container Registry: For task execution, the SDK builds container images locally (via Docker/Podman) or remotely (via a backend service) and pushes them to registries like GHCR, ECR, or GCR.
- Kubernetes Cluster: While the SDK defines Kubernetes-native objects (like Pod specs), the actual orchestration and execution on Kubernetes are managed by the Flyte backend (specifically Flyte Propeller).
- Local Runtime: The SDK includes a sandbox environment (using
pydantic_monty) for local execution and testing of tasks without requiring a full Flyte deployment.
Key Architectural Findings:
- The SDK uses gRPC (via flyteidl2) to communicate with Flyte Admin services (Project, Task, Run, etc.).
- Data persistence in S3/GCS/Azure is achieved through a 'DataProxy' pattern: the SDK gets signed URLs from Admin and uploads/downloads directly to cloud storage.
- Image building is modular, supporting local Docker/Podman builds or remote builds via a backend task.
- Local execution (Sandbox) is a core feature, allowing users to run workflows locally before deploying to a cluster.
- The SDK acts as a translator, converting Python code and Kubernetes-native specs into Flyte-compatible Protobuf messages.
Flyte SDK Internal Component Architecture
The Flyte SDK architecture is built around a modular structure that separates the user-facing CLI and remote client from the internal execution engine and type system.
Key Components
- flyte.cli: The primary entry point for users, providing commands for running, deploying, and managing Flyte entities. It leverages the remote client and configuration system.
- flyte.remote: A high-level client-side representation of Flyte entities (Apps, Runs, Tasks). It manages communication with the Flyte backend via a specialized
ClientSetand handles async/sync bridging usingsyncify. - flyte.models: Defines the core data models and serialization contexts used throughout the SDK. It acts as the "lingua franca" for all other components.
- flyte.types: Contains the
TypeEngineandTypeTransformersystem, which is responsible for marshalling between native Python types and Flyte's internal literal representation. - Flyte SDK Internal Component Architecture: The execution engine that handles task lifecycle, including input loading, execution, and output uploading. It is used both locally and within the cluster.
- Distributed Compute Plugins: A collection of extensions that provide specialized task types (e.g., Spark, Dask, Ray) and type transformers. They integrate with the SDK via the
flyte.extendlayer. - ImageBuilder: Provides logic for building and caching Docker images, supporting both local and remote build backends.
Relationships and Flow
- User Interaction: Users interact with the CLI, which uses the Remote Client to trigger actions on the Flyte backend.
- Data Marshalling: Both the Remote Client and the Execution Engine rely on the Type System to convert data.
- Plugin Integration: Plugins extend the SDK's capabilities by implementing templates provided by the Extension Layer, which in turn interacts with the Execution Engine.
- Infrastructure: The Image Builder is used during deployment to package code and dependencies into container images, often coordinated by the Remote Client.
- Storage: The Storage abstraction provides a uniform interface for the Execution Engine to read and write data to various blob stores (S3, GCS, etc.).
Key Architectural Findings:
- The SDK uses a 'syncify' layer to bridge async gRPC/ConnectRPC calls with synchronous user code.
- 'flyte.models' is a central dependency used for data representation across all layers.
- The 'TypeEngine' in 'flyte.types' is the core of Flyte's data marshalling, used by both the runtime and plugins.
- 'flyte.remote' acts as a high-level facade over the lower-level 'ClientSet' which manages multiple service-specific gRPC clients.
- Plugins are decoupled from the core SDK but integrate through a formal 'flyte.extend' registry.
- Image building is an internal utility ('flyte._internal.imagebuild') that supports both local Docker builds and remote builds via the Flyte backend.
Workflow Registration and Execution Flow
This sequence diagram illustrates the dual flows of Workflow Registration and Workflow Execution within the Flyte SDK.
Registration Flow
The registration flow begins when a user calls flyte.deploy(). The SDK packages the environment and its tasks into a DeploymentPlan. Each task, represented by a TaskTemplate, is translated into a wire-format TaskSpec. The ClientSet (acting as the flyte.remote) then communicates with the remote Flyte Admin service via the TaskService to register the task definition.
Execution Flow
The execution flow is triggered by flyte.run(). The _Runner component manages the lifecycle of the remote execution:
- Image & Code Bundling: It ensures that the necessary container images are built and the source code is bundled (either as a
.tgzor.pklfile). - Type Validation: The
TypeEngineis invoked to validate the user's native Python arguments against the task'sNativeInterface. It transforms these arguments into Flyte flyte.models. - Input Upload: Before triggering the run, the SDK uploads the transformed inputs to the Flyte backend using the
DataProxyService. - Execution Trigger: Finally, the
ClientSetcallsRunService.create_run()to initiate the execution on the Flyte cluster. The call returns aRunobject (representing the Core Execution Framework), which the user can use to track status, logs, and outputs.
Key architectural components discovered include the ClientSet for unified API access, the TypeEngine for extensible type transformations, and the TaskTemplate as the primary unit of definition for both tasks and workflows in this SDK version.
Key Architectural Findings:
- The SDK uses a
ClientSetclass as a unified entry point for all Flyte Admin services (Task, Run, DataProxy, etc.). TaskTemplateserves as the core definition entity, encompassing what was traditionally split between tasks and workflows.- The
TypeEnginehandles the complex mapping between native Python types and Flyte's protobuf-basedLiteralsystem. - Remote execution involves a multi-step process: image building, code bundling, input transformation, input uploading via
DataProxy, and finally run creation. - The
Runclass acts as the handle for an active or completed execution, providing methods to wait for completion, fetch logs, and retrieve outputs.
Flyte Domain Entity Model
The Flyte Domain Entity Model diagram represents the core objects used within the Flyte SDK to define and execute tasks and workflows.
Key findings from the codebase:
- Task Definition: Tasks are defined using
TaskTemplate(SDK-side) which contains theNativeInterface(Python types), resources, and retry strategies. - Interfaces: The SDK uses
NativeInterfaceto represent Python-level type hints, which are then converted to the IDL-levelTypedInterface. ATypedInterfaceconsists ofVariableobjects, each mapping a name to aLiteralType. - Execution Model: In this version of the SDK (using
flyteidl2), "Workflows" are implemented as "Pure Python Workflows" where a task can call other tasks. The execution is represented by aRun, which contains a rootAction. Actions can be nested, representing the call graph of the execution. - Triggers: Automation and scheduling (traditionally
LaunchPlanin Flyte) are handled viaTriggerentities, which associate a task with anAutomationSpec(e.g., Cron or FixedRate). - Data Model: Data is passed between tasks as
Literalobjects, which can be scalars, collections, or maps.
The diagram shows the relationships between these entities, highlighting how SDK-side definitions (TaskTemplate) relate to remote execution objects (Run, Action) and the underlying IDL types (TypedInterface, Variable, Literal).
Key Architectural Findings:
- Tasks are defined by
TaskTemplatewhich holds aNativeInterfacefor Python-level type information. NativeInterfaceis serialized into the IDL-levelTypedInterface, which containsVariabledefinitions.- Executions are tracked via
RunandActionentities;Actionsupports nesting to represent complex task call sequences. Triggerentities manage task automation, replacing or augmenting the traditionalLaunchPlanconcept.- Data exchange is performed using
Literalobjects, which encapsulate various data types including scalars and collections.
Flyte SDK Deployment and Runtime Architecture
The deployment architecture of the Flyte SDK illustrates a multi-environment setup involving local development, CI/CD pipelines, and a containerized runtime on Kubernetes.
Key Components:
- Local Developer Machine: Developers use the flyte.cli to author, test, and deploy workflows. Local execution runs tasks directly on the machine, while remote execution submits them to the Flyte Control Plane.
- CI/CD Runner: Automated pipelines (GitHub Actions) handle the building and publishing of the SDK to PyPI and base Docker images to the GitHub Container Registry (GHCR).
- Flyte Control Plane: Consists of Flyte Admin, which provides the API for registration and execution management, and Flyte Propeller, the core orchestration engine that manages workflow state and schedules tasks.
- Kubernetes Cluster: The data plane where Flyte Worker Pods execute user code. It also hosts Image Builder Pods for remote image construction when local Docker is not used.
- External Infrastructure: Includes Object Storage (S3/GCS) for persisting task inputs, outputs, and code bundles, and Docker Registries for managing container images.
Data Flow:
- Developers register workflows via the CLI to Flyte Admin.
- Flyte Propeller receives execution requests and creates Kubernetes Pods.
- Pods pull required images from the Docker Registry and download code bundles/data from Object Storage.
- Task results are written back to Object Storage and tracked by the control plane.
Key Architectural Findings:
- The SDK supports both local and remote image builders, utilizing Docker Buildx locally or a specialized Flyte task for remote builds.
- Flyte Propeller acts as the primary orchestrator within the Kubernetes cluster, managing the lifecycle of worker pods.
- The 'a0' entrypoint in the SDK is the primary runtime wrapper used inside containerized tasks to load and execute user code.
- CI/CD pipelines are responsible for multi-platform image builds (Python 3.10-3.14) and publishing to PyPI.
- Object storage (S3/GCS) is used as the source of truth for task data and serialized code bundles (TGZ/PKL).
Workflow Execution State Machine
This state diagram illustrates the lifecycle of a workflow execution (referred to as a "Run" or "Action" in the SDK) as it transitions through various phases defined by the ActionPhase enum.
The execution begins in the QUEUED state upon creation. It then progresses through resource allocation (WAITING_FOR_RESOURCES) and environment setup (INITIALIZING) before entering the RUNNING state where the actual user code executes.
Transitions to terminal states can occur from various points:
- Success: The execution completes successfully from the
RUNNINGstate. - Failure: An error occurs during execution or initialization.
- Timeout: The execution exceeds its configured time limit.
- Abort: A user manually terminates the execution via the
abort()method, which is possible from any non-terminal state.
The SDK provides mechanisms like wait() and watch() to monitor these transitions in real-time, often displaying progress via a rich status interface. Terminal states are identified by the is_terminal property on the phase object.
Key Architectural Findings:
- The primary state entity is the
ActionPhaseenum, which maps directly to Flyte's internal execution phases. - Execution states include
QUEUED,WAITING_FOR_RESOURCES,INITIALIZING,RUNNING,SUCCEEDED,FAILED,ABORTED, andTIMED_OUT. - Terminal states are explicitly defined in the code via the
ActionPhase.is_terminalproperty. - Transitions are primarily managed by the Flyte backend, but the SDK can trigger the
ABORTEDstate through theabort()method onRunorActionobjects. - The
RunandActionclasses insrc/flyte/remote/provide the interface for observing state changes usingwait()andwatch()methods.