flyte-sdk
Reliably orchestrate ML pipelines, models, and agents at scale — in pure Python.
Overview
Flyte 2 SDK is a powerful, Python-native framework designed for building and orchestrating complex AI applications, machine learning workflows, and interactive services. It allows developers to define execution environments, tasks, and workflows using standard Python syntax, while providing the robustness and scalability required for production-grade systems.
Whether you are training a large-scale model, building a compound AI agent, or serving a real-time API, Flyte 2 SDK provides the tools to manage the entire lifecycle from local development to cloud-scale deployment. By abstracting away the complexities of infrastructure, it enables engineers to focus on writing code that is portable, reproducible, and resilient.
Key Concepts
- Task Environments Execution Environments: The fundamental unit of execution, defining the container image, hardware accelerators (GPUs, TPUs), and dependencies required for your code to run reliably.
- Compound AI Applications Serving & Interactive Apps: Build and serve interactive applications and LLM-based agents with first-class support for FastAPI and long-running environments.
- Type System & Data Validation Data & Type System: A robust type engine that integrates with Pydantic, Pandera, and Polars to ensure data integrity and structured IO across distributed tasks.
- Remote Orchestration Remote Orchestration & Management: Seamlessly transition from local execution to remote management, with built-in support for caching, retries, and state recovery.
- Ecosystem Integrations Ecosystem & Tooling Integrations: Extend your workflows with native plugins for distributed compute and tracking tools like Ray, Spark, Dask, Weights & Biases, and MLflow.
- Image Lifecycle Management Container Image Lifecycle: Automatically build and manage container images, ensuring consistent environments across development and production without manual Dockerfile management.
Common Use Cases
- ML Training Pipelines: Orchestrate multi-step training workflows with automated caching, versioning, and resource management.
- LLM Serving & Agents: Deploy and scale compound AI agents and model serving endpoints using integrated FastAPI support.
- Distributed Data Processing: Run large-scale data transformations using integrated Spark, Ray, or Dask clusters directly from Python.
- Human-in-the-Loop Workflows: Implement interactive steps that require human intervention or approval within automated processes.
- Automated Operations: Schedule recurring jobs and trigger workflows based on external events, webhooks, or data changes.
Getting Started
To begin your journey with Flyte, start by Connecting to the Remote Platform to set up your environment. Follow the Getting Started guide to run your first task locally, and then explore how to build complex workflows in the Core Execution Framework section.