The Self-Correction and Testing Loop

When an LLM generates code, it often fails due to missing dependencies, subtle logic bugs, or incorrect test expectations. The _CodeGenSession in flyte-sdk implements a self-correction loop that diagnoses these failures and iterates until the code passes its own generated tests or reaches a maximum iteration limit.

Orchestrating the Lifecycle

The _CodeGenSession class in plugins/codegen/src/flyteplugins/codegen/auto_coder_agent.py manages the mutable state for a single code generation run. It is initialized by the AutoCoderAgent and encapsulates the retry loop, image building, and error diagnosis logic.

The core of the session is the run() method, which executes a loop up to max_iterations (defaulting to 10). Each iteration calls _attempt(), which follows a structured sequence:

Code Generation: Generates the solution code if needs_new_code is true.
Package Detection: Analyzes the generated code to identify required language and system packages.
Test Generation: Generates a test suite if needs_new_tests is true.
Environment Setup: Builds or updates a sandbox Docker image containing the detected dependencies.
Execution: Runs the tests within the sandbox environment.
Failure Handling: If tests fail, the session diagnoses the error and plans a fix.

Dynamic Environment Construction

flyte-sdk does not rely on a static environment. Instead, it dynamically detects dependencies from the generated code using detect_and_track_packages. The session maintains state for detected_packages (language-specific) and detected_system_packages (OS-level).

If a package installation fails during the image build, _CodeGenSession implements a specific recovery loop in _build_image():

except InvalidPackageError as e:
    bad_package = e.package_name
    # ... remove bad package from state ...
    # Ask LLM for the correct package name
    replacement, in_tok, out_tok = await suggest_replacement_package(
        self.model,
        bad_package,
        e.original_error,
        solution_code,
        self.litellm_params,
    )
    if replacement and replacement not in self.detected_system_packages:
        self.detected_system_packages.append(replacement)

This allows the agent to recover from hallucinated package names or OS-specific naming differences (e.g., libmagic-dev vs libmagic).

Error Diagnosis and Reclassification

When tests fail, the session calls diagnose_and_plan_environment_fix to categorize the failure into one of three types: environment, test_error, or logic.

A critical feature of flyte-sdk is the Reclassification Logic in _reclassify_errors(). LLMs often get stuck in a "local minimum" where they repeatedly try to fix the code for a bug that actually exists in the test, or vice versa. To break this cycle, the session tracks fix attempts for specific test failures:

test_error -> logic: If a test fails multiple times with the same error after the agent attempted to fix the test code, the session reclassifies it as a logic error. This forces the LLM to stop modifying the test and instead fix the solution code to match the test's expectations.
logic -> test_error: Conversely, if logic fixes fail to resolve a persistent error, the session reclassifies it as a test_error, assuming the test's expected values might be incorrect.

Multi-Stage Verification

Before accepting a patch, _CodeGenSession performs internal verification to ensure the LLM actually applied the requested fixes. In _generate_code(), the session uses verify_logic_fixes_applied to check the new solution against the diagnosis.

If verification fails, the session becomes progressively more "forceful" in its prompts:

if code_attempt == 2:
    messages.append({
        "role": "user",
        "content": "CRITICAL: The previous code generation attempt did NOT apply all the required fixes. You MUST apply EVERY SINGLE fix listed above."
    })

This multi-stage approach (Diagnose -> Patch -> Verify -> Force) ensures that the final output is not just a new version of the code, but a version that specifically addresses the identified failures.

Protected Constraints

The self-correction loop is governed by strict environmental constraints. For example, the agent is explicitly forbidden from modifying the /var/outputs directory, which is a pre-existing path in the Flyte sandbox. The _handle_logic_env_errors method injects these constraints into every patch request to prevent the LLM from generating code that would fail in a real Flyte environment:

error_msg = (
    "CRITICAL CONSTRAINTS:\n"
    "1. /var/outputs is a PRE-EXISTING directory. NEVER delete, recreate, or modify it. "
    "NEVER use shutil.rmtree or os.makedirs on /var/outputs. Only write files into it..."
)

This grounding ensures that the "corrected" code remains compatible with the underlying Flyte infrastructure.

Orchestrating the Lifecycle​

Dynamic Environment Construction​

Error Diagnosis and Reclassification​

Multi-Stage Verification​

Protected Constraints​

Orchestrating the Lifecycle

Dynamic Environment Construction

Error Diagnosis and Reclassification

Multi-Stage Verification

Protected Constraints