Validating and Testing DataFrames with pandera
2023-11-03
In October 2012, data science was named the sexiest job by the Harvard Business Review. But you know what isn’t sexy?
Dealing with invalid data 😭
Dealing with invalid Data, which can often feel like a losing battle when you don’t realize that corrupted or otherwise incorrect data has passed through your pipeline. What’s worse is that downstream consumers of that data are actually relying on you to make sure that the data is clean and correct.
Data validation is important…
… but tedious 😑
“Garbage in, garbage out”
“Data-centric machine learning”
“But I just want to train my model!” 😫
Once upon a time, this was me…
A day in the life of a data scientist
flowchart LR
A[clean data] --> S[split data]
S --> B[train model]
B --> C[evaluate model]
S --> C
In one of my past jobs I had to train a model, so my pipeline looked something like this:
A day in the life of a data scientist
flowchart LR
A[clean data] --> S[split data]
S --> B[train model]
B --> C[evaluate model]
S --> C
style B fill:#e0e0e0,stroke:#7f7f7f
Now the training step can take a long time to complete, so I had to wait a few days for that to complete…
A day in the life of a data scientist
flowchart LR
A[clean data] --> S[split data]
S --> B[train model]
B --> C[evaluate model]
S --> C
style B fill:#8cffb4,stroke:#2bb55b
style C fill:#ff8c8c,stroke:#bc3131
A day in the life of a data scientist
flowchart LR
A[clean data] --> S[split data]
S --> B[train model]
B --> C[evaluate model]
S --> C
style S fill:#FFF2CC,stroke:#e2b128
style B fill:#8cffb4,stroke:#2bb55b
style C fill:#ff8c8c,stroke:#bc3131
Had I made some assertions about what the test data should look like at the split data step, I would have caught this data bug early before wasting all of that time training the model.
Data validation is about understanding your data
And capturing that understanding as a schema .
{
"column1" : "integer" ,
"column2" : "string" ,
"column3" : "float" ,
}
📖 Schemas document the shape and properties of some data structure.
🔍 Schemas enforce that shape and those properties programmatically.
A schema is some artifact that serves two purposes:
…
My job, in this talk, is not to convince you that data validation is the most attractive part of data science, but I do want to convince you that…
Data validation can be fun 🎉
Data validation is like unit testing for your data
$ run pipeline
✅ dataset_x_validation
✅ dataset_y_validation
✅ dataset_z_validation
✨🍪✨
And the way you do that is to reframe data validation as unit tests for your data.
This way whenever you run your data pipelines you’ll know that your datasets are valid and you can get that little dopamine hit seeing your data tests pass.
The data validation mindset
Like many things in life worth doing, getting to the fun part of data validation requires some extra work and a shift in your mindset.
The shift is subtle, where instead of casually checking your data as you implement the functions for your pipeline, you explicitly define a schema along-side your function.
Before:
flowchart LR
G[Define Goal]
E[Explore]
I[Implement]
S[Spot Check]
P{Pass?}
C[Continue]
G --> E
E --> I
I --> S
S --> P
P -- Yes --> C
P -- No --> E
The data validation mindset
After:
flowchart LR
G[Define Goal]
E[Explore]
I[Implement]
T[Define Schema]
S[Validate]
P{Pass?}
C[Continue]
G --> E
E --> I
E --> T
I --> S
T --> S
S --> P
P -- Yes --> C
P -- No --> E
style S fill:#8cffb4,stroke:#2bb55b
style T fill:#FFF2CC,stroke:#e2b128
Data validation is a never-ending process of understanding your data through exploration, encoding your understanding as a schemas, testing it against live data, and re-understanding your data as it shifts around in the real world.
There’s no substitute for understanding your data with your own eyes 👀
But once you gain that understanding at a particular point in time, it can seem like a sisyphian task to define and maintain the schemas that you need to make sure your data are valid.
I built pandera to lower the barrier to creating and maintaining schemas in your codebase and my hope was that it would encourage a culture of data hygiene in the organizations that use it.
pandera
: a Python data validation and testing toolkit
Pandera is a python package that provides a light-weight, flexible, and expressive API for data validation of dataframe-like objects in Python…
🤷♂️ So What?
By using pandera
in your stack, you get:
⭐️ A single source of truth
📖 Data documentation as code
🔎 Run-time dataframe schema enforcers
Framing data validation as fun is really a trojan horse for getting you to do something that yields a lot of practical benefits, because…
A single source of truth for you data schemas
Data documentation as code for you and your team to understand what your data looks like
Run-time dataframe enforcers in development, testing, and production contexts
⏱️ Spend less time worrying about the correctness of your data and more time analyzing, visualizing, and modeling them.
Now I’d like to take you through a mini data validation journey to show you how you can get started with pandera.
Define Goal
Predict the price of items in a produce transaction dataset
import pandas as pd
transactions = pd.DataFrame.from_records([
{"item" : "orange" , "price" : 0.75 },
{"item" : "apple" , "price" : 0.50 },
{"item" : "banana" , "price" : 0.25 },
])
Explore the data
item object
price float64
dtype: object
count
3.000
mean
0.500
std
0.250
min
0.250
25%
0.375
50%
0.500
75%
0.625
max
0.750
Build our understanding
item
is a categorical variable represented as a string.
item
contains three possible values: orange
, apple
, and banana
.
price
is a float.
price
is greater or equal to zero.
neither column can contain null values
Define a schema
Pandera gives you a simple way to translate your understanding into a schema
import pandera as pa
class Schema(pa.DataFrameModel):
1 item: str = pa.Field(
2 isin= ["apple" , "orange" , "banana" ],
5 nullable= False ,
)
3 price: float = pa.Field(
4 ge= 0 ,
nullable= False ,
)
1
item
is a categorical variable represented as a string.
2
item
contains three possible values: orange
, apple
, and banana
.
3
price
is a float.
4
price
is a positive value.
5
neither column can contain null values
Validate the data
If the data are valid, Schema.validate
simply returns the valid data:
validated_transactions = Schema.validate(transactions)
item
price
0
orange
0.75
1
apple
0.50
2
banana
0.25
Validate the data
But if not, it raises a SchemaError
exception:
invalid_data = pd.DataFrame.from_records([
{"item" : "apple" , "price" : 0.75 },
{"item" : "orange" , "price" : float ("nan" )},
{"item" : "squash" , "price" : - 1000.0 },
])
try :
Schema.validate(invalid_data)
except pa.errors.SchemaError as exc:
failure_cases = exc.failure_cases
index
failure_case
0
2
squash
Validate the data
lazy=True
will evaluate all checks before raising a SchemaErrors
exception.
try :
Schema.validate(invalid_data, lazy= True )
except pa.errors.SchemaErrors as exc:
failure_cases = exc.failure_cases
schema_context
column
check
check_number
failure_case
index
0
Column
item
isin([‘apple’, ‘orange’, ‘banana’])
0
squash
2
1
Column
price
not_nullable
None
NaN
1
2
Column
price
greater_than_or_equal_to(0)
0
-1000.0
2
Functional Validation
Add type hints and a pandera.check_types
decorator to your functions
from pandera.typing import DataFrame
@pa.check_types (lazy= True )
def clean_data(raw_data) -> DataFrame[Schema]:
return raw_data
Functional Validation
The clean_data
function now validates data every time it’s called:
try :
clean_data(invalid_data)
except pa.errors.SchemaErrors as exc:
failure_cases = exc.failure_cases
schema_context
column
check
check_number
failure_case
index
0
Column
item
isin([‘apple’, ‘orange’, ‘banana’])
0
squash
2
1
Column
price
not_nullable
None
NaN
1
2
Column
price
greater_than_or_equal_to(0)
0
-1000.0
2
Updating the Schema
“But squash is a valid item!”
class Schema(pa.DataFrameModel):
item: str = pa.Field(
- isin= ["apple" , "orange" , "banana" ],
---
+ isin= ["apple" , "orange" , "banana" , "squash" ],
nullable= False ,
)
price: float = pa.Field(
ge= 0 ,
nullable= False ,
)
Schema Options
class SchemaOptions(pa.DataFrameModel):
...
class Config:
1 coerce = True
2 ordered = True
3 strict = True
4 drop_invalid_rows = True
5 unique_column_names = True
1
Attempts to coerce raw data into specified types
2
Makes sure columns are order as specified in schema
3
Makes sure all columns specified in the schema are present
4
Drops rows with invalid values
5
Makes sure column names are unique
Built-in Checks
class SchemaBuiltInChecks(pa.DataFrameModel):
column_1: str = pa.Field(
1 isin= ["a" , "b" , "c" ],
2 unique_values_eq= ["a" , "b" , "c" ],
3 str_matches= "pattern" ,
)
column_2: float = pa.Field(
4 in_range= {"min_value" : 0 , "max_value" : 100 },
5 le= 100 ,
6 ne=- 1 ,
)
1
Values are in a finite set
2
Unique set of values are equal to a finite set
3
String matches a pattern
4
Values are within some range
5
Values are less than some maximum
6
Values are not equal to some constant
Custom Checks
class SchemaCustomChecks(pa.DataFrameModel):
column_1: float
column_2: float
@pa.check ("column_1" , "column_2" )
1 def mean_is_between(cls, series):
return 0 <= series.mean() <= 100
@pa.dataframe_check
2 def col1_lt_col2(cls, df):
return df["column_1" ] < df["column_2" ]
1
Custom column-level check makes sure the mean of that column is within some range
2
Custom dataframe-level check makes sure column_1
is less than column_2
Regex Column-matching
Suppose I have column names that match some pattern:
num_col_1
num_col_2
num_col_3
num_col_n
0
1.306457
-0.000429
-0.903723
-0.047536
1
0.027807
-1.073813
0.462735
-0.297414
2
-0.920510
-0.671194
-0.975042
0.322838
class RegexSchema(pa.DataFrameModel):
num_columns: float = pa.Field(alias= "num_col_.+" , regex= True )
@pa.check ("num_col_.+" , regex= True )
def custom_check(cls, series):
...
Synthesize Test Data with Pandera
Just call Schema$example
example_data = Schema.example(size= 5 )
item
price
0
apple
6.103516e-05
1
banana
3.333333e-01
2
orange
1.900000e+00
3
apple
4.940656e-324
4
apple
7.018780e+76
Unit testing
Suppose I want to add a returned
column to my dataset…
# Before data processing
class Schema(pa.DataFrameModel):
item: str = pa.Field(isin= ["apple" , "orange" , "banana" ], nullable= False )
price: float = pa.Field(ge= 0 , nullable= False )
# After data processing
class ProcessedSchema(Schema):
returned: bool
Unit testing
Defining a process_data
function
from typing import List
@pa.check_types (lazy= True )
def process_data(
data: DataFrame[Schema],
returned: List[bool ],
) -> DataFrame[ProcessedSchema]:
return data.assign(returned= returned)
Unit testing
Our test for process_data
:
from hypothesis import given, settings
size = 3
1 @given (data= Schema.strategy(size= size))
@settings (max_examples= 3 )
def test_process_data(data):
2 process_data(data, returned= [True ] * size)
print ("tests pass! ✅" )
1
Create a mock dataset
2
Just call process_data
Unit testing
Run test_process_data
tests pass! ✅
tests pass! ✅
tests pass! ✅
Catch bugs early 🐞
Suppose there’s a bug in process_data
:
@pa.check_types (lazy= True )
def process_data(
data: DataFrame[Schema],
returned: List[bool ],
) -> DataFrame[ProcessedSchema]:
return data.assign(returnned= returned)
Catch bugs early 🐞
try :
test_process_data()
except pa.errors.SchemaErrors as exc:
print (exc)
Schema ProcessedSchema: A total of 1 schema errors were found.
Error Counts
------------
- SchemaErrorReason.COLUMN_NOT_IN_DATAFRAME: 1
Schema Error Summary
--------------------
failure_cases n_failure_cases
schema_context column check
DataFrameSchema <NA> column_in_dataframe [returned] 1
Usage Tip
---------
Directly inspect all errors by catching the exception:
```
try:
schema.validate(dataframe, lazy=True)
except SchemaErrors as err:
err.failure_cases # dataframe of schema errors
err.data # invalid dataframe
```
Data validation is not only about testing the actual data 📊, but also the functions
that produce them.
Get started with pandera
in 10 minutes
Python
Define and import schema
# schema.py
import pandera as pa
class Schema(pa.DataFrameModel):
col1: int
col2: float
col2: str
Validate away!
python_dataframe = ...
Schema.validate(python_dataframe)
By understanding what counts as valid data, creating schemas for them, and using these schemas across your Python and R stack, you get a single source of truth for your data schemas. These schemas serve as data documentation for you and your team and run-time validators for your data in both development and production contexts. pandera
lowers the barrier to maintaining data hygiene so you can spend less time worrying about the correctness of your data and more time analyzing, visualizing, and modeling them.
And at the end of the day, I really think that actually, our data wants to be validated, so I think it’s up to us to do that for them.