Validating and Testing DataFrames with pandera

Dealing with invalid data 😭

Data validation is important…

… but tedious 😑

“Garbage in, garbage out”

“Data-centric machine learning”

“Data as code”

“But I just want to train my model!” 😫

A day in the life of a data scientist

flowchart LR
  A[clean data] --> S[split data]
  S --> B[train model]
  B --> C[evaluate model]
  S --> C

A day in the life of a data scientist

flowchart LR
  A[clean data] --> S[split data]
  S --> B[train model]
  B --> C[evaluate model]
  S --> C

  style B fill:#e0e0e0,stroke:#7f7f7f

A day in the life of a data scientist

flowchart LR
  A[clean data] --> S[split data]
  S --> B[train model]
  B --> C[evaluate model]
  S --> C

  style B fill:#8cffb4,stroke:#2bb55b
  style C fill:#ff8c8c,stroke:#bc3131

A day in the life of a data scientist

flowchart LR
  A[clean data] --> S[split data]
  S --> B[train model]
  B --> C[evaluate model]
  S --> C

  style S fill:#FFF2CC,stroke:#e2b128
  style B fill:#8cffb4,stroke:#2bb55b
  style C fill:#ff8c8c,stroke:#bc3131

Data validation is about understanding your data

And capturing that understanding as a schema.

{
    "column1": "integer",
    "column2": "string",
    "column3": "float",
}

📖 Schemas document the shape and properties of some data structure.
🔍 Schemas enforce that shape and those properties programmatically.

Data validation can be fun 🎉

Data validation is like unit testing for your data

$ run pipeline

✅ dataset_x_validation
✅ dataset_y_validation
✅ dataset_z_validation

✨🍪✨

The data validation mindset

Before:

flowchart LR
  G[Define Goal]
  E[Explore]
  I[Implement]
  S[Spot Check]
  P{Pass?}
  C[Continue]

  G --> E
  E --> I
  I --> S
  S --> P
  P -- Yes --> C
  P -- No --> E

The data validation mindset

After:

flowchart LR
  G[Define Goal]
  E[Explore]
  I[Implement]
  T[Define Schema]
  S[Validate]
  P{Pass?}
  C[Continue]

  G --> E
  E --> I
  E --> T
  I --> S
  T --> S
  S --> P
  P -- Yes --> C
  P -- No --> E

  style S fill:#8cffb4,stroke:#2bb55b
  style T fill:#FFF2CC,stroke:#e2b128

There’s no substitute for understanding your data with your own eyes 👀

pandera: a Python data validation and testing toolkit

🤷‍♂️ So What?

By using pandera in your stack, you get:

⭐️ A single source of truth
📖 Data documentation as code
🔎 Run-time dataframe schema enforcers

⏱️ Spend less time worrying about the correctness of your data and more time analyzing, visualizing, and modeling them.

Define Goal

Predict the price of items in a produce transaction dataset

import pandas as pd

transactions = pd.DataFrame.from_records([
    {"item": "orange", "price": 0.75},
    {"item": "apple", "price": 0.50},
    {"item": "banana", "price": 0.25},
])

Explore the data

transactions.dtypes

item      object
price    float64
dtype: object

transactions.describe()

	price
count	3.000
mean	0.500
std	0.250
min	0.250
25%	0.375
50%	0.500
75%	0.625
max	0.750

Build our understanding

item is a categorical variable represented as a string.
item contains three possible values: orange, apple, and banana.
price is a float.
price is greater or equal to zero.
neither column can contain null values

Define a schema

Pandera gives you a simple way to translate your understanding into a schema

import pandera as pa

class Schema(pa.DataFrameModel):
    item: str = pa.Field(
        isin=["apple", "orange", "banana"],
        nullable=False,
    )
    price: float = pa.Field(
        ge=0,
        nullable=False,
    )

1: item is a categorical variable represented as a string.
2: item contains three possible values: orange, apple, and banana.
3: price is a float.
4: price is a positive value.
5: neither column can contain null values

Validate the data

If the data are valid, Schema.validate simply returns the valid data:

validated_transactions = Schema.validate(transactions)

	item	price
0	orange	0.75
1	apple	0.50
2	banana	0.25

Validate the data

But if not, it raises a SchemaError exception:

invalid_data = pd.DataFrame.from_records([
    {"item": "apple", "price": 0.75},
    {"item": "orange", "price": float("nan")},
    {"item": "squash", "price": -1000.0},
])

try:
    Schema.validate(invalid_data)
except pa.errors.SchemaError as exc:
    failure_cases = exc.failure_cases

	index	failure_case
0	2	squash

Validate the data

lazy=True will evaluate all checks before raising a SchemaErrors exception.

try:
    Schema.validate(invalid_data, lazy=True)
except pa.errors.SchemaErrors as exc:
    failure_cases = exc.failure_cases

	schema_context	column	check	check_number	failure_case	index
0	Column	item	isin([‘apple’, ‘orange’, ‘banana’])	0	squash	2
1	Column	price	not_nullable	None	NaN	1
2	Column	price	greater_than_or_equal_to(0)	0	-1000.0	2

Functional Validation

Add type hints and a pandera.check_types decorator to your functions

from pandera.typing import DataFrame

@pa.check_types(lazy=True)
def clean_data(raw_data) -> DataFrame[Schema]:
    return raw_data

Functional Validation

The clean_data function now validates data every time it’s called:

try:
    clean_data(invalid_data)
except pa.errors.SchemaErrors as exc:
    failure_cases = exc.failure_cases

	schema_context	column	check	check_number	failure_case	index
0	Column	item	isin([‘apple’, ‘orange’, ‘banana’])	0	squash	2
1	Column	price	not_nullable	None	NaN	1
2	Column	price	greater_than_or_equal_to(0)	0	-1000.0	2

Updating the Schema

“But squash is a valid item!”

class Schema(pa.DataFrameModel):
    item: str = pa.Field(
-       isin=["apple", "orange", "banana"],
---
+       isin=["apple", "orange", "banana", "squash"],
        nullable=False,
    )
    price: float = pa.Field(
       ge=0,
       nullable=False,
    )

Schema Options

class SchemaOptions(pa.DataFrameModel):
    ...

    class Config:
        coerce = True
        ordered = True
        strict = True
        drop_invalid_rows = True
        unique_column_names = True

1: Attempts to coerce raw data into specified types
2: Makes sure columns are order as specified in schema
3: Makes sure all columns specified in the schema are present
4: Drops rows with invalid values
5: Makes sure column names are unique

Built-in Checks

class SchemaBuiltInChecks(pa.DataFrameModel):
    column_1: str = pa.Field(
        isin=["a", "b", "c"],
        unique_values_eq=["a", "b", "c"],
        str_matches="pattern",
    )
    column_2: float = pa.Field(
        in_range={"min_value": 0, "max_value": 100},
        le=100,
        ne=-1,
    )

1: Values are in a finite set
2: Unique set of values are equal to a finite set
3: String matches a pattern
4: Values are within some range
5: Values are less than some maximum
6: Values are not equal to some constant

Custom Checks

class SchemaCustomChecks(pa.DataFrameModel):
    column_1: float
    column_2: float

    @pa.check("column_1", "column_2")
    def mean_is_between(cls, series):
        return 0 <= series.mean() <= 100

    @pa.dataframe_check
    def col1_lt_col2(cls, df):
        return df["column_1"] < df["column_2"]

1: Custom column-level check makes sure the mean of that column is within some range
2: Custom dataframe-level check makes sure column_1 is less than column_2

Regex Column-matching

Suppose I have column names that match some pattern:

	num_col_1	num_col_2	num_col_3	num_col_n
0	1.306457	-0.000429	-0.903723	-0.047536
1	0.027807	-1.073813	0.462735	-0.297414
2	-0.920510	-0.671194	-0.975042	0.322838

class RegexSchema(pa.DataFrameModel):
    num_columns: float = pa.Field(alias="num_col_.+", regex=True)

    @pa.check("num_col_.+", regex=True)
    def custom_check(cls, series):
        ...

Meta Comment

This presentation quarto document is validated by pandera 🤯

Synthesize Test Data with Pandera

Just call Schema$example

example_data = Schema.example(size=5)

	item	price
0	apple	6.103516e-05
1	banana	3.333333e-01
2	orange	1.900000e+00
3	apple	4.940656e-324
4	apple	7.018780e+76

Unit testing

Suppose I want to add a returned column to my dataset…

# Before data processing
class Schema(pa.DataFrameModel):
    item: str = pa.Field(isin=["apple", "orange", "banana"], nullable=False)
    price: float = pa.Field(ge=0, nullable=False)

# After data processing
class ProcessedSchema(Schema):
    returned: bool

Unit testing

Defining a process_data function

from typing import List

@pa.check_types(lazy=True)
def process_data(
    data: DataFrame[Schema],
    returned: List[bool],
) -> DataFrame[ProcessedSchema]:
    return data.assign(returned=returned)

Unit testing

Our test for process_data:

from hypothesis import given, settings

size = 3

@given(data=Schema.strategy(size=size))
@settings(max_examples=3)
def test_process_data(data):
    process_data(data, returned=[True] * size)
    print("tests pass! ✅")

1: Create a mock dataset
2: Just call process_data

Unit testing

Run test_process_data

test_process_data()

tests pass! ✅
tests pass! ✅
tests pass! ✅

Catch bugs early 🐞

Suppose there’s a bug in process_data:

@pa.check_types(lazy=True)
def process_data(
    data: DataFrame[Schema],
    returned: List[bool],
) -> DataFrame[ProcessedSchema]:
    return data.assign(returnned=returned)

Catch bugs early 🐞

try:
    test_process_data()
except pa.errors.SchemaErrors as exc:
    print(exc)

Schema ProcessedSchema: A total of 1 schema errors were found.

Error Counts
------------
- SchemaErrorReason.COLUMN_NOT_IN_DATAFRAME: 1

Schema Error Summary
--------------------
                                           failure_cases  n_failure_cases
schema_context  column check                                             
DataFrameSchema <NA>   column_in_dataframe    [returned]                1

Usage Tip
---------

Directly inspect all errors by catching the exception:

```
try:
    schema.validate(dataframe, lazy=True)
except SchemaErrors as err:
    err.failure_cases  # dataframe of schema errors
    err.data  # invalid dataframe
```

Data validation is not only about testing the actual data 📊, but also the functions that produce them.

Get started with `pandera` in 10 minutes

Python

pip install pandera

Define and import schema

# schema.py
import pandera as pa

class Schema(pa.DataFrameModel):
    col1: int
    col2: float
    col2: str

Validate away!

python_dataframe = ...

Schema.validate(python_dataframe)

Get in touch!

email: niels@union.ai
twitter: @cosmicbboy
linkedin: https://www.linkedin.com/in/nbantilan
discord: https://discord.gg/NmKXbZHhDN
repo: https://github.com/unionai-oss/pandera

Validating and Testing DataFrames with pandera

Dealing with invalid data 😭

Data validation is important…

A day in the life of a data scientist

A day in the life of a data scientist

A day in the life of a data scientist

A day in the life of a data scientist

Data validation is about understanding your data

Data validation can be fun 🎉

✨🍪✨

The data validation mindset

The data validation mindset

🤷‍♂️ So What?

Define Goal

Explore the data

Build our understanding

Define a schema

Validate the data

Validate the data

Validate the data

Functional Validation

Functional Validation

Updating the Schema

Schema Options

Built-in Checks

Custom Checks

Regex Column-matching

Meta Comment

Synthesize Test Data with Pandera

Unit testing

Unit testing

Unit testing

Unit testing

Catch bugs early 🐞

Catch bugs early 🐞

Get started with pandera in 10 minutes

Get in touch!

Validating and Testing DataFrames with `pandera`

Get started with `pandera` in 10 minutes