Rich Validation Reports for DataFrames
When a Pandera validation fails in a Flyte task, the resulting traceback often lacks the granular context needed to identify which specific rows or columns violated the schema. In large-scale data processing, simply knowing that a "SchemaError" occurred is insufficient for debugging.
The flyte-sdk addresses this by automatically generating rich HTML validation reports. These reports are embedded directly into the Flyte UI, providing a visual summary of the data, schema-level errors, and data-level violations.
Enabling Validation Reports
You enable these reports by annotating your task's dataframe arguments with Pandera schemas. By default, validation failures will raise an exception, but you can configure the behavior to only log a warning while still generating the report.
import pandera as pa
from pandera.typing import DataFrame
from typing import Annotated
from flyteplugins.pandera import ValidationConfig
class EmployeeSchema(pa.SchemaModel):
id: pa.Index[int]
name: pa.Field(str)
age: pa.Field(int, pa.Check.gt(0))
@task
def process_data(
# ValidationConfig(on_error="warn") allows the task to continue
# even if validation fails, but the report will still be generated.
df: Annotated[DataFrame[EmployeeSchema], ValidationConfig(on_error="warn")]
) -> DataFrame[EmployeeSchema]:
return df
How Reports are Generated
The report generation logic is encapsulated in the PanderaDataFrameTransformer class within flyteplugins.pandera.transformers.base. During type conversion (both for inputs in to_python_value and outputs in to_literal), the transformer invokes the _validate method.
Internal flow of _validate:
- It calls
schema.validate(data, lazy=True)to collect all possible errors rather than stopping at the first one. - If errors occur, it catches
SchemaErrors. - It passes the data and the error object to a
PanderaReportRenderer. - The resulting HTML is registered with Flyte's reporting system via
flyte.report.get_tab(report_title).replace(html).
The Rendering Engine
The core rendering logic resides in PanderaPandasReportRenderer (found in flyteplugins.pandera.renderers.pandas). It uses the great_tables library to construct a multi-section HTML document:
- Summary: A high-level overview including the schema name, data shape, and total error counts.
- Data Preview: A snapshot of the first few rows of the dataframe (controlled by
DATA_PREVIEW_HEAD). - Schema-level Errors: Tables detailing metadata issues like missing columns or incorrect dtypes.
- Data-level Errors: Detailed breakdowns of value violations, including the percentage of valid rows and specific failure cases.
The renderer handles complex Pandera error structures, such as "long-form" failure cases, by pivoting them into a readable format via the _reshape_long_failure_cases method.
Backend-Specific Renderers
Because Flyte supports multiple dataframe engines, flyte-sdk provides specialized renderers to handle the nuances of lazy evaluation and distributed data.
Polars Support
The PanderaPolarsReportRenderer (in flyteplugins.pandera.renderers.polars) extends the base pandas renderer. Since Polars often uses LazyFrame for performance, the renderer explicitly calls .collect() on a limited subset of the data to generate the preview and failure case summaries without materializing the entire dataset.
# Internal implementation detail from PanderaPolarsReportRenderer
def _to_pandas(self, data: Any):
if isinstance(data, pl.LazyFrame):
# Only collect the head to avoid OOM
data = data.head(DATA_PREVIEW_HEAD).collect()
if isinstance(data, pl.DataFrame):
return data.head(DATA_PREVIEW_HEAD).to_pandas()
return super()._to_pandas(data)
PySpark SQL Support
The PanderaPySparkSqlReportRenderer (in flyteplugins.pandera.renderers.pyspark_sql) handles distributed Spark DataFrames. It uses .limit(DATA_PREVIEW_HEAD).toPandas() to safely bring a small representative sample of the data into the local environment for rendering. This ensures that generating a report doesn't accidentally trigger a massive data transfer or crash the driver node.
Report Integration in Flyte UI
The reports are not just logs; they are first-class UI components. The PanderaDataFrameTransformer uses flyte.report.get_tab to create dedicated tabs in the Flyte Console:
- Pandera report: input: Generated when the task receives data.
- Pandera report: output: Generated when the task returns data.
Even if validation succeeds, a success report is generated to provide a "Data Preview" and metadata summary, giving you confidence that the data flowing through your pipeline matches your expectations.