Validating and Testing DataFrames with pandera

Niels Bantilan

2023-11-03

Dealing with invalid data 😭

Data validation is important…

… but tedious 😑

“Garbage in, garbage out”

“Data-centric machine learning”

“Data as code”

“But I just want to train my model!” 😫

A day in the life of a data scientist

flowchart LR
  A[clean data] --> S[split data]
  S --> B[train model]
  B --> C[evaluate model]
  S --> C

A day in the life of a data scientist

flowchart LR
  A[clean data] --> S[split data]
  S --> B[train model]
  B --> C[evaluate model]
  S --> C

  style B fill:#e0e0e0,stroke:#7f7f7f

A day in the life of a data scientist

flowchart LR
  A[clean data] --> S[split data]
  S --> B[train model]
  B --> C[evaluate model]
  S --> C

  style B fill:#8cffb4,stroke:#2bb55b
  style C fill:#ff8c8c,stroke:#bc3131

A day in the life of a data scientist

flowchart LR
  A[clean data] --> S[split data]
  S --> B[train model]
  B --> C[evaluate model]
  S --> C

  style S fill:#FFF2CC,stroke:#e2b128
  style B fill:#8cffb4,stroke:#2bb55b
  style C fill:#ff8c8c,stroke:#bc3131

Data validation is about understanding your data

And capturing that understanding as a schema.

{
    "column1": "integer",
    "column2": "string",
    "column3": "float",
}
  • 📖 Schemas document the shape and properties of some data structure.
  • 🔍 Schemas enforce that shape and those properties programmatically.

Data validation can be fun 🎉

Data validation is like unit testing for your data

$ run pipeline
  • dataset_x_validation
  • dataset_y_validation
  • dataset_z_validation

✨🍪✨

The data validation mindset

Before:

flowchart LR
  G[Define Goal]
  E[Explore]
  I[Implement]
  S[Spot Check]
  P{Pass?}
  C[Continue]

  G --> E
  E --> I
  I --> S
  S --> P
  P -- Yes --> C
  P -- No --> E

The data validation mindset

After:

flowchart LR
  G[Define Goal]
  E[Explore]
  I[Implement]
  T[Define Schema]
  S[Validate]
  P{Pass?}
  C[Continue]

  G --> E
  E --> I
  E --> T
  I --> S
  T --> S
  S --> P
  P -- Yes --> C
  P -- No --> E

  style S fill:#8cffb4,stroke:#2bb55b
  style T fill:#FFF2CC,stroke:#e2b128

There’s no substitute for understanding your data with your own eyes 👀



pandera: a Python data validation and testing toolkit

🤷‍♂️ So What?

By using pandera in your stack, you get:

  • ⭐️ A single source of truth
  • 📖 Data documentation as code
  • 🔎 Run-time dataframe schema enforcers

⏱️ Spend less time worrying about the correctness of your data and more time analyzing, visualizing, and modeling them.

Define Goal

Predict the price of items in a produce transaction dataset

import pandas as pd

transactions = pd.DataFrame.from_records([
    {"item": "orange", "price": 0.75},
    {"item": "apple", "price": 0.50},
    {"item": "banana", "price": 0.25},
])

Explore the data

transactions.dtypes
item      object
price    float64
dtype: object


transactions.describe()
price
count 3.000
mean 0.500
std 0.250
min 0.250
25% 0.375
50% 0.500
75% 0.625
max 0.750

Build our understanding

  • item is a categorical variable represented as a string.
  • item contains three possible values: orange, apple, and banana.
  • price is a float.
  • price is greater or equal to zero.
  • neither column can contain null values

Define a schema

Pandera gives you a simple way to translate your understanding into a schema

import pandera as pa

class Schema(pa.DataFrameModel):
    item: str = pa.Field(
        isin=["apple", "orange", "banana"],
        nullable=False,
    )
    price: float = pa.Field(
        ge=0,
        nullable=False,
    )
1
item is a categorical variable represented as a string.
2
item contains three possible values: orange, apple, and banana.
3
price is a float.
4
price is a positive value.
5
neither column can contain null values

Validate the data

If the data are valid, Schema.validate simply returns the valid data:

validated_transactions = Schema.validate(transactions)
item price
0 orange 0.75
1 apple 0.50
2 banana 0.25

Validate the data

But if not, it raises a SchemaError exception:

invalid_data = pd.DataFrame.from_records([
    {"item": "apple", "price": 0.75},
    {"item": "orange", "price": float("nan")},
    {"item": "squash", "price": -1000.0},
])
try:
    Schema.validate(invalid_data)
except pa.errors.SchemaError as exc:
    failure_cases = exc.failure_cases
index failure_case
0 2 squash

Validate the data

lazy=True will evaluate all checks before raising a SchemaErrors exception.

try:
    Schema.validate(invalid_data, lazy=True)
except pa.errors.SchemaErrors as exc:
    failure_cases = exc.failure_cases
schema_context column check check_number failure_case index
0 Column item isin([‘apple’, ‘orange’, ‘banana’]) 0 squash 2
1 Column price not_nullable None NaN 1
2 Column price greater_than_or_equal_to(0) 0 -1000.0 2

Functional Validation

Add type hints and a pandera.check_types decorator to your functions

from pandera.typing import DataFrame

@pa.check_types(lazy=True)
def clean_data(raw_data) -> DataFrame[Schema]:
    return raw_data

Functional Validation

The clean_data function now validates data every time it’s called:

try:
    clean_data(invalid_data)
except pa.errors.SchemaErrors as exc:
    failure_cases = exc.failure_cases
schema_context column check check_number failure_case index
0 Column item isin([‘apple’, ‘orange’, ‘banana’]) 0 squash 2
1 Column price not_nullable None NaN 1
2 Column price greater_than_or_equal_to(0) 0 -1000.0 2

Updating the Schema

“But squash is a valid item!”

class Schema(pa.DataFrameModel):
    item: str = pa.Field(
-       isin=["apple", "orange", "banana"],
---
+       isin=["apple", "orange", "banana", "squash"],
        nullable=False,
    )
    price: float = pa.Field(
       ge=0,
       nullable=False,
    )

Schema Options

class SchemaOptions(pa.DataFrameModel):
    ...

    class Config:
        coerce = True
        ordered = True
        strict = True
        drop_invalid_rows = True
        unique_column_names = True
1
Attempts to coerce raw data into specified types
2
Makes sure columns are order as specified in schema
3
Makes sure all columns specified in the schema are present
4
Drops rows with invalid values
5
Makes sure column names are unique

Built-in Checks

class SchemaBuiltInChecks(pa.DataFrameModel):
    column_1: str = pa.Field(
        isin=["a", "b", "c"],
        unique_values_eq=["a", "b", "c"],
        str_matches="pattern",
    )
    column_2: float = pa.Field(
        in_range={"min_value": 0, "max_value": 100},
        le=100,
        ne=-1,
    )
1
Values are in a finite set
2
Unique set of values are equal to a finite set
3
String matches a pattern
4
Values are within some range
5
Values are less than some maximum
6
Values are not equal to some constant

Custom Checks

class SchemaCustomChecks(pa.DataFrameModel):
    column_1: float
    column_2: float

    @pa.check("column_1", "column_2")
    def mean_is_between(cls, series):
        return 0 <= series.mean() <= 100

    @pa.dataframe_check
    def col1_lt_col2(cls, df):
        return df["column_1"] < df["column_2"]
1
Custom column-level check makes sure the mean of that column is within some range
2
Custom dataframe-level check makes sure column_1 is less than column_2

Regex Column-matching

Suppose I have column names that match some pattern:

num_col_1 num_col_2 num_col_3 num_col_n
0 1.306457 -0.000429 -0.903723 -0.047536
1 0.027807 -1.073813 0.462735 -0.297414
2 -0.920510 -0.671194 -0.975042 0.322838
class RegexSchema(pa.DataFrameModel):
    num_columns: float = pa.Field(alias="num_col_.+", regex=True)

    @pa.check("num_col_.+", regex=True)
    def custom_check(cls, series):
        ...

Meta Comment

This presentation quarto document is validated by pandera 🤯

Synthesize Test Data with Pandera

Just call Schema$example

example_data = Schema.example(size=5)
item price
0 apple 6.103516e-05
1 banana 3.333333e-01
2 orange 1.900000e+00
3 apple 4.940656e-324
4 apple 7.018780e+76

Unit testing

Suppose I want to add a returned column to my dataset…

# Before data processing
class Schema(pa.DataFrameModel):
    item: str = pa.Field(isin=["apple", "orange", "banana"], nullable=False)
    price: float = pa.Field(ge=0, nullable=False)


# After data processing
class ProcessedSchema(Schema):
    returned: bool

Unit testing

Defining a process_data function

from typing import List

@pa.check_types(lazy=True)
def process_data(
    data: DataFrame[Schema],
    returned: List[bool],
) -> DataFrame[ProcessedSchema]:
    return data.assign(returned=returned)

Unit testing

Our test for process_data:

from hypothesis import given, settings

size = 3

@given(data=Schema.strategy(size=size))
@settings(max_examples=3)
def test_process_data(data):
    process_data(data, returned=[True] * size)
    print("tests pass! ✅")
1
Create a mock dataset
2
Just call process_data

Unit testing

Run test_process_data

test_process_data()
tests pass! ✅
tests pass! ✅
tests pass! ✅

Catch bugs early 🐞

Suppose there’s a bug in process_data:

@pa.check_types(lazy=True)
def process_data(
    data: DataFrame[Schema],
    returned: List[bool],
) -> DataFrame[ProcessedSchema]:
    return data.assign(returnned=returned)

Catch bugs early 🐞

try:
    test_process_data()
except pa.errors.SchemaErrors as exc:
    print(exc)
Schema ProcessedSchema: A total of 1 schema errors were found.

Error Counts
------------
- SchemaErrorReason.COLUMN_NOT_IN_DATAFRAME: 1

Schema Error Summary
--------------------
                                           failure_cases  n_failure_cases
schema_context  column check                                             
DataFrameSchema <NA>   column_in_dataframe    [returned]                1

Usage Tip
---------

Directly inspect all errors by catching the exception:

```
try:
    schema.validate(dataframe, lazy=True)
except SchemaErrors as err:
    err.failure_cases  # dataframe of schema errors
    err.data  # invalid dataframe
```

Data validation is not only about testing the actual data 📊, but also the functions that produce them.

Get started with pandera in 10 minutes

Python

pip install pandera

Define and import schema

# schema.py
import pandera as pa

class Schema(pa.DataFrameModel):
    col1: int
    col2: float
    col2: str

Validate away!

python_dataframe = ...

Schema.validate(python_dataframe)




Get in touch!