Validating and Testing R DataFrames with pandera and reticulate

Niels Bantilan

2023-09-20

Dealing with invalid data 😭

Data validation is important…

… but tedious 😑

“Garbage in, garbage out”

“Data-centric machine learning”

“Data as code”

“But I just want to train my model!” 😫

A day in the life of a data scientist

flowchart LR
  A[clean data] --> S[split data]
  S --> B[train model]
  B --> C[evaluate model]
  S --> C

A day in the life of a data scientist

flowchart LR
  A[clean data] --> S[split data]
  S --> B[train model]
  B --> C[evaluate model]
  S --> C

  style B fill:#e0e0e0,stroke:#7f7f7f

A day in the life of a data scientist

flowchart LR
  A[clean data] --> S[split data]
  S --> B[train model]
  B --> C[evaluate model]
  S --> C

  style B fill:#8cffb4,stroke:#2bb55b
  style C fill:#ff8c8c,stroke:#bc3131

A day in the life of a data scientist

flowchart LR
  A[clean data] --> S[split data]
  S --> B[train model]
  B --> C[evaluate model]
  S --> C

  style S fill:#FFF2CC,stroke:#e2b128
  style B fill:#8cffb4,stroke:#2bb55b
  style C fill:#ff8c8c,stroke:#bc3131

Data validation is about understanding your data

And capturing that understanding as a schema.

{
    "column1": "integer",
    "column2": "string",
    "column3": "float",
}
  • 📖 Schemas document the shape and properties of some data structure.
  • 🔍 Schemas enforce that shape and those properties programmatically.

Data validation can be fun 🎉

Data validation is like unit testing for your data

$ run pipeline
  • dataset_x_validation
  • dataset_y_validation
  • dataset_z_validation

✨🍪✨

The data validation mindset

Before:

flowchart LR
  G[Define Goal]
  E[Explore]
  I[Implement]
  S[Spot Check]
  P{Pass?}
  C[Continue]

  G --> E
  E --> I
  I --> S
  S --> P
  P -- Yes --> C
  P -- No --> E

The data validation mindset

After:

flowchart LR
  G[Define Goal]
  E[Explore]
  I[Implement]
  T[Define Schema]
  S[Validate]
  P{Pass?}
  C[Continue]

  G --> E
  E --> I
  E --> T
  I --> S
  T --> S
  S --> P
  P -- Yes --> C
  P -- No --> E

  style S fill:#8cffb4,stroke:#2bb55b
  style T fill:#FFF2CC,stroke:#e2b128

There’s no substitute for understanding your data with your own eyes 👀



pandera: a Python data validation and testing toolkit

reticulate: a bridge between Python and R

🤷‍♂️ So What?

By using pandera in your Python and R stacks, you get:

  • ⭐️ A single source of truth
  • 📖 Data documentation as code
  • 🔎 Run-time dataframe schema enforcers

⏱️ Spend less time worrying about the correctness of your data and more time analyzing, visualizing, and modeling them.

Define Goal

Predict the price of items in a produce transaction dataset

import pandas as pd

transactions = pd.DataFrame.from_records([
    {"item": "orange", "price": 0.75},
    {"item": "apple", "price": 0.50},
    {"item": "banana", "price": 0.25},
])

Explore the data

transactions.dtypes
item      object
price    float64
dtype: object


transactions.describe()
       price
count  3.000
mean   0.500
std    0.250
min    0.250
25%    0.375
50%    0.500
75%    0.625
max    0.750

Build our understanding

  • item is a categorical variable represented as a string.
  • item contains three possible values: orange, apple, and banana.
  • price is a float.
  • price is greater or equal to zero.
  • neither column can contain null values

Define a schema

Pandera gives you a simple way to translate your understanding into a schema

import pandera as pa

class Schema(pa.DataFrameModel):
    item: str = pa.Field(
        isin=["apple", "orange", "banana"],
        nullable=False,
    )
    price: float = pa.Field(
        ge=0,
        nullable=False,
    )
1
item is a categorical variable represented as a string.
2
item contains three possible values: orange, apple, and banana.
3
price is a float.
4
price is a positive value.
5
neither column can contain null values

Validate the data

If the data are valid, Schema.validate simply returns the valid data:

validated_transactions = Schema.validate(transactions)
item price
0 orange 0.75
1 apple 0.50
2 banana 0.25

Validate the data

But if not, it raises a SchemaError exception:

invalid_data = pd.DataFrame.from_records([
    {"item": "apple", "price": 0.75},
    {"item": "orange", "price": float("nan")},
    {"item": "squash", "price": -1000.0},
])
try:
    Schema.validate(invalid_data)
except pa.errors.SchemaError as exc:
    failure_cases = exc.failure_cases
index failure_case
0 2 squash

Validate the data

lazy=True will evaluate all checks before raising a SchemaErrors exception.

try:
    Schema.validate(invalid_data, lazy=True)
except pa.errors.SchemaErrors as exc:
    failure_cases = exc.failure_cases
schema_context column check check_number failure_case index
0 Column item isin(['apple', 'orange', 'banana']) 0 squash 2
1 Column price not_nullable None NaN 1
2 Column price greater_than_or_equal_to(0) 0 -1000.0 2

Functional Validation

Add type hints and a pandera.check_types decorator to your functions

from pandera.typing import DataFrame

@pa.check_types(lazy=True)
def clean_data(raw_data) -> DataFrame[Schema]:
    return raw_data

Functional Validation

The clean_data function now validates data every time it’s called:

try:
    clean_data(invalid_data)
except pa.errors.SchemaErrors as exc:
    failure_cases = exc.failure_cases
schema_context column check check_number failure_case index
0 Column item isin(['apple', 'orange', 'banana']) 0 squash 2
1 Column price not_nullable None NaN 1
2 Column price greater_than_or_equal_to(0) 0 -1000.0 2

Updating the Schema

“But squash is a valid item!”

class Schema(pa.DataFrameModel):
    item: str = pa.Field(
        isin=["apple", "orange", "banana", "squash"],
        nullable=False,
    )
    price: float = pa.Field(
        ge=0,
        nullable=False,
    )

Schema Options

class SchemaOptions(pa.DataFrameModel):
    ...

    class Config:
        coerce = True
        ordered = True
        strict = True
        drop_invalid_rows = True
        unique_column_names = True
1
Attempts to coerce raw data into specified types
2
Makes sure columns are order as specified in schema
3
Makes sure all columns specified in the schema are present
4
Drops rows with invalid values
5
Makes sure column names are unique

Built-in Checks

class SchemaBuiltInChecks(pa.DataFrameModel):
    column_1: str = pa.Field(
        isin=["a", "b", "c"],
        unique_values_eq=["a", "b", "c"],
        str_matches="pattern",
    )
    column_2: float = pa.Field(
        in_range={"min_value": 0, "max_value": 100},
        le=100,
        ne=-1,
    )
1
Values are in a finite set
2
Unique set of values are equal to a finite set
3
String matches a pattern
4
Values are within some range
5
Values are less than some maximum
6
Values are not equal to some constant

Custom Checks

class SchemaCustomChecks(pa.DataFrameModel):
    column_1: float
    column_2: float

    @pa.check("column_1", "column_2")
    def mean_is_between(cls, series):
        return 0 <= series.mean() <= 100

    @pa.dataframe_check
    def col1_lt_col2(cls, df):
        return df["column_1"] < df["column_2"]
1
Custom column-level check makes sure the mean of that column is within some range
2
Custom dataframe-level check makes sure column_1 is less than column_2

Regex Column-matching

Suppose I have column names that match some pattern:

num_col_1 num_col_2 num_col_3 num_col_n
0 -0.053206 1.804979 -0.041816 -0.238918
1 -0.159643 0.432160 1.596157 -0.524850
2 1.314412 -0.520288 0.311637 0.663506
class RegexSchema(pa.DataFrameModel):
    num_columns: float = pa.Field(alias="num_col_.+", regex=True)

    @pa.check("num_col_.+", regex=True)
    def custom_check(cls, series):
        ...

Meta Comment

This presentation quarto document is validated by pandera 🤯

Using pandera in R via reticulate

🐘 What about pointblank or validate?

  • If you’re already using pandera, reuse those schemas in an R runtime ♻️.
  • If the pandera programming model somehow fits better in your head 🧠.
  • You just want R and Python to get along 🤝
library(dplyr)
library(knitr)
library(reticulate)

use_condaenv("pandera-posit-2023")

It just works! 🔥

valid_r_data <- data.frame(
    item = c("apple", "orange", "orange"),
    price = c(0.5, 0.75, 0.25)
)

validated_r_data <- py$Schema$validate(valid_r_data)
item price
apple 0.50
orange 0.75
orange 0.25

⚠️ Warning: this hasn’t been comprehensively tested

Catch the Python Exception

invalid_r_data <- data.frame(
    item = c("applee", "orange", "orange"),
    price = c(0.5, 0.75, NaN)
)
validated_data <- tryCatch({
    return(py$Schema$validate(invalid_r_data, lazy=TRUE))
}, error=function(err) {
    exception <<- attr(py_last_error(), "exception")
    return(NULL)
})
1
Use py to access variables in the Python namespace
2
Use the py_last_error function to get the last exception raised in Python

Get the failure cases

failure_cases <- exception$failure_cases
kable(failure_cases)
schema_context column check check_number failure_case index
Column item isin([‘apple’, ‘orange’, ‘banana’, ‘squash’]) 0 applee 0
Column price not_nullable NULL NA 2

Synthesize Test Data with Pandera

Just call Schema$example

example_data <- py$Schema$example(size = as.integer(5))
item price
banana 0
apple 0
squash 0
banana 0
squash 0

Unit testing

Suppose I want to add a returned column to my dataset…

# Before data processing
class Schema(pa.DataFrameModel):
    item: str = pa.Field(isin=["apple", "orange", "banana"], nullable=False)
    price: float = pa.Field(ge=0, nullable=False)


# After data processing
class ProcessedSchema(Schema):
    returned: bool

Unit testing

Defining a process_data function

process_data <- function(data, returned) {
    transformed <- py$Schema$validate(data, lazy=TRUE) |>
        mutate(returned=returned)

    return(py$ProcessedSchema$validate(transformed, lazy=TRUE))
}

Unit testing

Our test for process_data:

test_process_data <- function() {
    mock_data <- py$Schema$example(size = as.integer(3))
    failed <- FALSE

    output <- tryCatch({
        process_data(mock_data, FALSE)
    }, error=function(err) {
        exception <<- attr(py_last_error(), "exception")
        failed <<- TRUE
        return(NULL)
    })

    if (failed) {
        print("process_data test failed ❌")
        kable(exception$failure_cases)
    } else {
        print("process_data test passes ✅")
        kable(output)
    }
}
1
Create a mock dataset
2
Try/catch an error when calling process_data
3
Report success or failure

Unit testing

Run test_process_data

test_process_data()
[1] "process_data test passes ✅"
item price returned
apple 3.402823e+38 FALSE
orange 3.116777e+11 FALSE
apple 7.978949e+13 FALSE

Catch bugs early 🐞

Suppose there’s a bug in process_data:

process_data <- function(data, returned) {
    transformed <- py$Schema$validate(data, lazy=TRUE) |>
        mutate(returnd=returned)

    return(py$ProcessedSchema$validate(transformed, lazy=TRUE))
}

Catch bugs early 🐞

test_process_data()
[1] "process_data test failed ❌"
schema_context column check check_number failure_case index
DataFrameSchema NA column_in_dataframe NA returned NA

Data validation is not only about testing the actual data, but also the functions that produce them.

Get started with pandera in 10 minutes

Python

pip install pandera

R

install.packages("reticulate")

Define and import schema

# schema.py
import pandera as pa

class Schema(pa.DataFrameModel):
    col1: int
    col2: float
    col2: str
schema <- import("schema")

Validate away!

python_dataframe = ...

Schema.validate(python_dataframe)
library(reticulate)

r_dataframe <- ...

py$Schema$validate(py$python_data)




💡 Got ideas?

Come talk to me!