Validating and Testing R DataFrames with pandera and reticulate

Dealing with invalid data 😭

Data validation is important…

… but tedious 😑

“Garbage in, garbage out”

“Data-centric machine learning”

“Data as code”

“But I just want to train my model!” 😫

A day in the life of a data scientist

flowchart LR
  A[clean data] --> S[split data]
  S --> B[train model]
  B --> C[evaluate model]
  S --> C

A day in the life of a data scientist

flowchart LR
  A[clean data] --> S[split data]
  S --> B[train model]
  B --> C[evaluate model]
  S --> C

  style B fill:#e0e0e0,stroke:#7f7f7f

A day in the life of a data scientist

flowchart LR
  A[clean data] --> S[split data]
  S --> B[train model]
  B --> C[evaluate model]
  S --> C

  style B fill:#8cffb4,stroke:#2bb55b
  style C fill:#ff8c8c,stroke:#bc3131

A day in the life of a data scientist

flowchart LR
  A[clean data] --> S[split data]
  S --> B[train model]
  B --> C[evaluate model]
  S --> C

  style S fill:#FFF2CC,stroke:#e2b128
  style B fill:#8cffb4,stroke:#2bb55b
  style C fill:#ff8c8c,stroke:#bc3131

Data validation is about understanding your data

And capturing that understanding as a schema.

{
    "column1": "integer",
    "column2": "string",
    "column3": "float",
}

📖 Schemas document the shape and properties of some data structure.
🔍 Schemas enforce that shape and those properties programmatically.

Data validation can be fun 🎉

Data validation is like unit testing for your data

$ run pipeline

✅ dataset_x_validation
✅ dataset_y_validation
✅ dataset_z_validation

✨🍪✨

The data validation mindset

Before:

flowchart LR
  G[Define Goal]
  E[Explore]
  I[Implement]
  S[Spot Check]
  P{Pass?}
  C[Continue]

  G --> E
  E --> I
  I --> S
  S --> P
  P -- Yes --> C
  P -- No --> E

The data validation mindset

After:

flowchart LR
  G[Define Goal]
  E[Explore]
  I[Implement]
  T[Define Schema]
  S[Validate]
  P{Pass?}
  C[Continue]

  G --> E
  E --> I
  E --> T
  I --> S
  T --> S
  S --> P
  P -- Yes --> C
  P -- No --> E

  style S fill:#8cffb4,stroke:#2bb55b
  style T fill:#FFF2CC,stroke:#e2b128

There’s no substitute for understanding your data with your own eyes 👀

pandera: a Python data validation and testing toolkit

reticulate: a bridge between Python and R

🤷‍♂️ So What?

By using pandera in your Python and R stacks, you get:

⭐️ A single source of truth
📖 Data documentation as code
🔎 Run-time dataframe schema enforcers

⏱️ Spend less time worrying about the correctness of your data and more time analyzing, visualizing, and modeling them.

Define Goal

Predict the price of items in a produce transaction dataset

import pandas as pd

transactions = pd.DataFrame.from_records([
    {"item": "orange", "price": 0.75},
    {"item": "apple", "price": 0.50},
    {"item": "banana", "price": 0.25},
])

Explore the data

transactions.dtypes

item      object
price    float64
dtype: object

transactions.describe()

       price
count  3.000
mean   0.500
std    0.250
min    0.250
25%    0.375
50%    0.500
75%    0.625
max    0.750

Build our understanding

item is a categorical variable represented as a string.
item contains three possible values: orange, apple, and banana.
price is a float.
price is greater or equal to zero.
neither column can contain null values

Define a schema

Pandera gives you a simple way to translate your understanding into a schema

import pandera as pa

class Schema(pa.DataFrameModel):
    item: str = pa.Field(
        isin=["apple", "orange", "banana"],
        nullable=False,
    )
    price: float = pa.Field(
        ge=0,
        nullable=False,
    )

1: item is a categorical variable represented as a string.
2: item contains three possible values: orange, apple, and banana.
3: price is a float.
4: price is a positive value.
5: neither column can contain null values

Validate the data

If the data are valid, Schema.validate simply returns the valid data:

validated_transactions = Schema.validate(transactions)

	item	price
0	orange	0.75
1	apple	0.50
2	banana	0.25

Validate the data

But if not, it raises a SchemaError exception:

invalid_data = pd.DataFrame.from_records([
    {"item": "apple", "price": 0.75},
    {"item": "orange", "price": float("nan")},
    {"item": "squash", "price": -1000.0},
])

try:
    Schema.validate(invalid_data)
except pa.errors.SchemaError as exc:
    failure_cases = exc.failure_cases

	index	failure_case
0	2	squash

Validate the data

lazy=True will evaluate all checks before raising a SchemaErrors exception.

try:
    Schema.validate(invalid_data, lazy=True)
except pa.errors.SchemaErrors as exc:
    failure_cases = exc.failure_cases

	schema_context	column	check	check_number	failure_case	index
0	Column	item	isin(['apple', 'orange', 'banana'])	0	squash	2
1	Column	price	not_nullable	None	NaN	1
2	Column	price	greater_than_or_equal_to(0)	0	-1000.0	2

Functional Validation

Add type hints and a pandera.check_types decorator to your functions

from pandera.typing import DataFrame

@pa.check_types(lazy=True)
def clean_data(raw_data) -> DataFrame[Schema]:
    return raw_data

Functional Validation

The clean_data function now validates data every time it’s called:

try:
    clean_data(invalid_data)
except pa.errors.SchemaErrors as exc:
    failure_cases = exc.failure_cases

	schema_context	column	check	check_number	failure_case	index
0	Column	item	isin(['apple', 'orange', 'banana'])	0	squash	2
1	Column	price	not_nullable	None	NaN	1
2	Column	price	greater_than_or_equal_to(0)	0	-1000.0	2

Updating the Schema

“But squash is a valid item!”

class Schema(pa.DataFrameModel):
    item: str = pa.Field(
        isin=["apple", "orange", "banana", "squash"],
        nullable=False,
    )
    price: float = pa.Field(
        ge=0,
        nullable=False,
    )

Schema Options

class SchemaOptions(pa.DataFrameModel):
    ...

    class Config:
        coerce = True
        ordered = True
        strict = True
        drop_invalid_rows = True
        unique_column_names = True

1: Attempts to coerce raw data into specified types
2: Makes sure columns are order as specified in schema
3: Makes sure all columns specified in the schema are present
4: Drops rows with invalid values
5: Makes sure column names are unique

Built-in Checks

class SchemaBuiltInChecks(pa.DataFrameModel):
    column_1: str = pa.Field(
        isin=["a", "b", "c"],
        unique_values_eq=["a", "b", "c"],
        str_matches="pattern",
    )
    column_2: float = pa.Field(
        in_range={"min_value": 0, "max_value": 100},
        le=100,
        ne=-1,
    )

1: Values are in a finite set
2: Unique set of values are equal to a finite set
3: String matches a pattern
4: Values are within some range
5: Values are less than some maximum
6: Values are not equal to some constant

Custom Checks

class SchemaCustomChecks(pa.DataFrameModel):
    column_1: float
    column_2: float

    @pa.check("column_1", "column_2")
    def mean_is_between(cls, series):
        return 0 <= series.mean() <= 100

    @pa.dataframe_check
    def col1_lt_col2(cls, df):
        return df["column_1"] < df["column_2"]

1: Custom column-level check makes sure the mean of that column is within some range
2: Custom dataframe-level check makes sure column_1 is less than column_2

Regex Column-matching

Suppose I have column names that match some pattern:

	num_col_1	num_col_2	num_col_3	num_col_n
0	-0.053206	1.804979	-0.041816	-0.238918
1	-0.159643	0.432160	1.596157	-0.524850
2	1.314412	-0.520288	0.311637	0.663506

class RegexSchema(pa.DataFrameModel):
    num_columns: float = pa.Field(alias="num_col_.+", regex=True)

    @pa.check("num_col_.+", regex=True)
    def custom_check(cls, series):
        ...

Meta Comment

This presentation quarto document is validated by pandera 🤯

Using `pandera` in R via `reticulate`

🐘 What about pointblank or validate?

If you’re already using pandera, reuse those schemas in an R runtime ♻️.
If the pandera programming model somehow fits better in your head 🧠.
You just want R and Python to get along 🤝

library(dplyr)
library(knitr)
library(reticulate)

use_condaenv("pandera-posit-2023")

It just works! 🔥

valid_r_data <- data.frame(
    item = c("apple", "orange", "orange"),
    price = c(0.5, 0.75, 0.25)
)

validated_r_data <- py$Schema$validate(valid_r_data)

item	price
apple	0.50
orange	0.75
orange	0.25

⚠️ Warning: this hasn’t been comprehensively tested

Catch the Python Exception

invalid_r_data <- data.frame(
    item = c("applee", "orange", "orange"),
    price = c(0.5, 0.75, NaN)
)

validated_data <- tryCatch({
    return(py$Schema$validate(invalid_r_data, lazy=TRUE))
}, error=function(err) {
    exception <<- attr(py_last_error(), "exception")
    return(NULL)
})

1: Use py to access variables in the Python namespace
2: Use the py_last_error function to get the last exception raised in Python

Get the failure cases

failure_cases <- exception$failure_cases
kable(failure_cases)

schema_context	column	check	check_number	failure_case	index
Column	item	isin([‘apple’, ‘orange’, ‘banana’, ‘squash’])	0	applee	0
Column	price	not_nullable	NULL	NA	2

Synthesize Test Data with Pandera

Just call Schema$example

example_data <- py$Schema$example(size = as.integer(5))

item	price
banana	0
apple	0
squash	0
banana	0
squash	0

Unit testing

Suppose I want to add a returned column to my dataset…

# Before data processing
class Schema(pa.DataFrameModel):
    item: str = pa.Field(isin=["apple", "orange", "banana"], nullable=False)
    price: float = pa.Field(ge=0, nullable=False)

# After data processing
class ProcessedSchema(Schema):
    returned: bool

Unit testing

Defining a process_data function

process_data <- function(data, returned) {
    transformed <- py$Schema$validate(data, lazy=TRUE) |>
        mutate(returned=returned)

    return(py$ProcessedSchema$validate(transformed, lazy=TRUE))
}

Unit testing

Our test for process_data:

test_process_data <- function() {
    mock_data <- py$Schema$example(size = as.integer(3))
    failed <- FALSE

    output <- tryCatch({
        process_data(mock_data, FALSE)
    }, error=function(err) {
        exception <<- attr(py_last_error(), "exception")
        failed <<- TRUE
        return(NULL)
    })

    if (failed) {
        print("process_data test failed ❌")
        kable(exception$failure_cases)
    } else {
        print("process_data test passes ✅")
        kable(output)
    }
}

1: Create a mock dataset
2: Try/catch an error when calling process_data
3: Report success or failure

Unit testing

Run test_process_data

test_process_data()

[1] "process_data test passes ✅"

item	price	returned
apple	3.402823e+38	FALSE
orange	3.116777e+11	FALSE
apple	7.978949e+13	FALSE

Catch bugs early 🐞

Suppose there’s a bug in process_data:

process_data <- function(data, returned) {
    transformed <- py$Schema$validate(data, lazy=TRUE) |>
        mutate(returnd=returned)

    return(py$ProcessedSchema$validate(transformed, lazy=TRUE))
}

Catch bugs early 🐞

test_process_data()

[1] "process_data test failed ❌"

schema_context	column	check	check_number	failure_case	index
DataFrameSchema	NA	column_in_dataframe	NA	returned	NA

Data validation is not only about testing the actual data, but also the functions that produce them.

Get started with `pandera` in 10 minutes

Python

pip install pandera

R

install.packages("reticulate")

Define and import schema

# schema.py
import pandera as pa

class Schema(pa.DataFrameModel):
    col1: int
    col2: float
    col2: str

schema <- import("schema")

Validate away!

python_dataframe = ...

Schema.validate(python_dataframe)

library(reticulate)

r_dataframe <- ...

py$Schema$validate(py$python_data)

💡 Got ideas?

Come talk to me!

email: niels@union.ai
twitter: @cosmicbboy
linkedin: https://www.linkedin.com/in/nbantilan
discord: https://discord.gg/NmKXbZHhDN
repo: https://github.com/unionai-oss/pandera

Validating and Testing R DataFrames with pandera and reticulate

Dealing with invalid data 😭

Data validation is important…

A day in the life of a data scientist

A day in the life of a data scientist

A day in the life of a data scientist

A day in the life of a data scientist

Data validation is about understanding your data

Data validation can be fun 🎉

✨🍪✨

The data validation mindset

The data validation mindset

🤷‍♂️ So What?

Define Goal

Explore the data

Build our understanding

Define a schema

Validate the data

Validate the data

Validate the data

Functional Validation

Functional Validation

Updating the Schema

Schema Options

Built-in Checks

Custom Checks

Regex Column-matching

Meta Comment

Using pandera in R via reticulate

It just works! 🔥

Catch the Python Exception

Synthesize Test Data with Pandera

Unit testing

Unit testing

Unit testing

Unit testing

Catch bugs early 🐞

Catch bugs early 🐞

Get started with pandera in 10 minutes

💡 Got ideas?

Validating and Testing R DataFrames with `pandera` and `reticulate`

Using `pandera` in R via `reticulate`

Get started with `pandera` in 10 minutes