Architecture Overview

This section contains architecture diagrams and documentation for flyte-sdk.

Available Diagrams

Flyte SDK System Context

The Flyte SDK System Context diagram illustrates the interactions between users, the Flyte SDK, and the various external systems involved in data and ML orchestration.

The flyte (comprising both the Python library and the CLI) serves as the primary interface for Data Scientists and Developers to author, manage, and execute workflows and tasks.

Key interactions discovered in the codebase:

Flyte Admin (Backend): The SDK communicates with the Flyte control plane via gRPC using flyteidl2 clients. This includes services for project management, task registration, execution control, and log retrieval.
Cloud Storage: The SDK performs data I/O (S3, GCS, Azure Blob Storage) by first requesting signed URLs from the DataProxy service (part of Flyte Admin) and then performing direct HTTP uploads/downloads using httpx or fsspec.
Container Registry: For task execution, the SDK builds container images locally (via Docker/Podman) or remotely (via a backend service) and pushes them to registries like GHCR, ECR, or GCR.
Kubernetes Cluster: While the SDK defines Kubernetes-native objects (like Pod specs), the actual orchestration and execution on Kubernetes are managed by the Flyte backend (specifically Flyte Propeller).
Local Runtime: The SDK includes a sandbox environment (using pydantic_monty) for local execution and testing of tasks without requiring a full Flyte deployment.

Key Architectural Findings:

The SDK uses gRPC (via flyteidl2) to communicate with Flyte Admin services (Project, Task, Run, etc.).
Data persistence in S3/GCS/Azure is achieved through a 'DataProxy' pattern: the SDK gets signed URLs from Admin and uploads/downloads directly to cloud storage.
Image building is modular, supporting local Docker/Podman builds or remote builds via a backend task.
Local execution (Sandbox) is a core feature, allowing users to run workflows locally before deploying to a cluster.
The SDK acts as a translator, converting Python code and Kubernetes-native specs into Flyte-compatible Protobuf messages.

Flyte SDK Internal Component Architecture

The Flyte SDK architecture is built around a modular structure that separates the user-facing CLI and remote client from the internal execution engine and type system.

Key Components

flyte.cli: The primary entry point for users, providing commands for running, deploying, and managing Flyte entities. It leverages the remote client and configuration system.
flyte.remote: A high-level client-side representation of Flyte entities (Apps, Runs, Tasks). It manages communication with the Flyte backend via a specialized ClientSet and handles async/sync bridging using syncify.
flyte.models: Defines the core data models and serialization contexts used throughout the SDK. It acts as the "lingua franca" for all other components.
flyte.types: Contains the TypeEngine and TypeTransformer system, which is responsible for marshalling between native Python types and Flyte's internal literal representation.
Flyte SDK Internal Component Architecture: The execution engine that handles task lifecycle, including input loading, execution, and output uploading. It is used both locally and within the cluster.
Distributed Compute Plugins: A collection of extensions that provide specialized task types (e.g., Spark, Dask, Ray) and type transformers. They integrate with the SDK via the flyte.extend layer.
ImageBuilder: Provides logic for building and caching Docker images, supporting both local and remote build backends.

Relationships and Flow

User Interaction: Users interact with the CLI, which uses the Remote Client to trigger actions on the Flyte backend.
Data Marshalling: Both the Remote Client and the Execution Engine rely on the Type System to convert data.
Plugin Integration: Plugins extend the SDK's capabilities by implementing templates provided by the Extension Layer, which in turn interacts with the Execution Engine.
Infrastructure: The Image Builder is used during deployment to package code and dependencies into container images, often coordinated by the Remote Client.
Storage: The Storage abstraction provides a uniform interface for the Execution Engine to read and write data to various blob stores (S3, GCS, etc.).

Key Architectural Findings:

The SDK uses a 'syncify' layer to bridge async gRPC/ConnectRPC calls with synchronous user code.
'flyte.models' is a central dependency used for data representation across all layers.
The 'TypeEngine' in 'flyte.types' is the core of Flyte's data marshalling, used by both the runtime and plugins.
'flyte.remote' acts as a high-level facade over the lower-level 'ClientSet' which manages multiple service-specific gRPC clients.
Plugins are decoupled from the core SDK but integrate through a formal 'flyte.extend' registry.
Image building is an internal utility ('flyte._internal.imagebuild') that supports both local Docker builds and remote builds via the Flyte backend.

Workflow Registration and Execution Flow

This sequence diagram illustrates the dual flows of Workflow Registration and Workflow Execution within the Flyte SDK.

Registration Flow

The registration flow begins when a user calls flyte.deploy(). The SDK packages the environment and its tasks into a DeploymentPlan. Each task, represented by a TaskTemplate, is translated into a wire-format TaskSpec. The ClientSet (acting as the flyte.remote) then communicates with the remote Flyte Admin service via the TaskService to register the task definition.

Execution Flow

The execution flow is triggered by flyte.run(). The _Runner component manages the lifecycle of the remote execution:

Image & Code Bundling: It ensures that the necessary container images are built and the source code is bundled (either as a .tgz or .pkl file).
Type Validation: The TypeEngine is invoked to validate the user's native Python arguments against the task's NativeInterface. It transforms these arguments into Flyte flyte.models.
Input Upload: Before triggering the run, the SDK uploads the transformed inputs to the Flyte backend using the DataProxyService.
Execution Trigger: Finally, the ClientSet calls RunService.create_run() to initiate the execution on the Flyte cluster. The call returns a Run object (representing the Core Execution Framework), which the user can use to track status, logs, and outputs.

Key architectural components discovered include the ClientSet for unified API access, the TypeEngine for extensible type transformations, and the TaskTemplate as the primary unit of definition for both tasks and workflows in this SDK version.

Key Architectural Findings:

The SDK uses a ClientSet class as a unified entry point for all Flyte Admin services (Task, Run, DataProxy, etc.).
TaskTemplate serves as the core definition entity, encompassing what was traditionally split between tasks and workflows.
The TypeEngine handles the complex mapping between native Python types and Flyte's protobuf-based Literal system.
Remote execution involves a multi-step process: image building, code bundling, input transformation, input uploading via DataProxy, and finally run creation.
The Run class acts as the handle for an active or completed execution, providing methods to wait for completion, fetch logs, and retrieve outputs.

Flyte Domain Entity Model

The Flyte Domain Entity Model diagram represents the core objects used within the Flyte SDK to define and execute tasks and workflows.

Key findings from the codebase:

Task Definition: Tasks are defined using TaskTemplate (SDK-side) which contains the NativeInterface (Python types), resources, and retry strategies.
Interfaces: The SDK uses NativeInterface to represent Python-level type hints, which are then converted to the IDL-level TypedInterface. A TypedInterface consists of Variable objects, each mapping a name to a LiteralType.
Execution Model: In this version of the SDK (using flyteidl2), "Workflows" are implemented as "Pure Python Workflows" where a task can call other tasks. The execution is represented by a Run, which contains a root Action. Actions can be nested, representing the call graph of the execution.
Triggers: Automation and scheduling (traditionally LaunchPlan in Flyte) are handled via Trigger entities, which associate a task with an AutomationSpec (e.g., Cron or FixedRate).
Data Model: Data is passed between tasks as Literal objects, which can be scalars, collections, or maps.

The diagram shows the relationships between these entities, highlighting how SDK-side definitions (TaskTemplate) relate to remote execution objects (Run, Action) and the underlying IDL types (TypedInterface, Variable, Literal).

Key Architectural Findings:

Tasks are defined by TaskTemplate which holds a NativeInterface for Python-level type information.
NativeInterface is serialized into the IDL-level TypedInterface, which contains Variable definitions.
Executions are tracked via Run and Action entities; Action supports nesting to represent complex task call sequences.
Trigger entities manage task automation, replacing or augmenting the traditional LaunchPlan concept.
Data exchange is performed using Literal objects, which encapsulate various data types including scalars and collections.

Flyte SDK Deployment and Runtime Architecture

The deployment architecture of the Flyte SDK illustrates a multi-environment setup involving local development, CI/CD pipelines, and a containerized runtime on Kubernetes.

Key Components:

Local Developer Machine: Developers use the flyte.cli to author, test, and deploy workflows. Local execution runs tasks directly on the machine, while remote execution submits them to the Flyte Control Plane.
CI/CD Runner: Automated pipelines (GitHub Actions) handle the building and publishing of the SDK to PyPI and base Docker images to the GitHub Container Registry (GHCR).
Flyte Control Plane: Consists of Flyte Admin, which provides the API for registration and execution management, and Flyte Propeller, the core orchestration engine that manages workflow state and schedules tasks.
Kubernetes Cluster: The data plane where Flyte Worker Pods execute user code. It also hosts Image Builder Pods for remote image construction when local Docker is not used.
External Infrastructure: Includes Object Storage (S3/GCS) for persisting task inputs, outputs, and code bundles, and Docker Registries for managing container images.

Data Flow:

Developers register workflows via the CLI to Flyte Admin.
Flyte Propeller receives execution requests and creates Kubernetes Pods.
Pods pull required images from the Docker Registry and download code bundles/data from Object Storage.
Task results are written back to Object Storage and tracked by the control plane.

Key Architectural Findings:

The SDK supports both local and remote image builders, utilizing Docker Buildx locally or a specialized Flyte task for remote builds.
Flyte Propeller acts as the primary orchestrator within the Kubernetes cluster, managing the lifecycle of worker pods.
The 'a0' entrypoint in the SDK is the primary runtime wrapper used inside containerized tasks to load and execute user code.
CI/CD pipelines are responsible for multi-platform image builds (Python 3.10-3.14) and publishing to PyPI.
Object storage (S3/GCS) is used as the source of truth for task data and serialized code bundles (TGZ/PKL).

Workflow Execution State Machine

This state diagram illustrates the lifecycle of a workflow execution (referred to as a "Run" or "Action" in the SDK) as it transitions through various phases defined by the ActionPhase enum.

The execution begins in the QUEUED state upon creation. It then progresses through resource allocation (WAITING_FOR_RESOURCES) and environment setup (INITIALIZING) before entering the RUNNING state where the actual user code executes.

Transitions to terminal states can occur from various points:

Success: The execution completes successfully from the RUNNING state.
Failure: An error occurs during execution or initialization.
Timeout: The execution exceeds its configured time limit.
Abort: A user manually terminates the execution via the abort() method, which is possible from any non-terminal state.

The SDK provides mechanisms like wait() and watch() to monitor these transitions in real-time, often displaying progress via a rich status interface. Terminal states are identified by the is_terminal property on the phase object.

Key Architectural Findings:

The primary state entity is the ActionPhase enum, which maps directly to Flyte's internal execution phases.
Execution states include QUEUED, WAITING_FOR_RESOURCES, INITIALIZING, RUNNING, SUCCEEDED, FAILED, ABORTED, and TIMED_OUT.
Terminal states are explicitly defined in the code via the ActionPhase.is_terminal property.
Transitions are primarily managed by the Flyte backend, but the SDK can trigger the ABORTED state through the abort() method on Run or Action objects.
The Run and Action classes in src/flyte/remote/ provide the interface for observing state changes using wait() and watch() methods.

Available Diagrams​

Flyte SDK System Context​

Flyte SDK Internal Component Architecture​

Key Components​

Relationships and Flow​

Workflow Registration and Execution Flow​

Registration Flow​

Execution Flow​

Flyte Domain Entity Model​

Flyte SDK Deployment and Runtime Architecture​

Key Components:​

Data Flow:​

Workflow Execution State Machine​

Available Diagrams

Flyte SDK System Context

Flyte SDK Internal Component Architecture

Key Components

Relationships and Flow

Workflow Registration and Execution Flow

Registration Flow

Execution Flow

Flyte Domain Entity Model

Flyte SDK Deployment and Runtime Architecture

Key Components:

Data Flow:

Workflow Execution State Machine