This is the blog of Will Badart.

14 Jan 2020

- #data science
- #software engineering

The last three posts gradually introduced “Ghosts of Departed Proofs” (GDP) as a framework for designing safe, ergonomic data science libraries. GDP and the related approaches leverage Haskell’s expressive type system to enable the compiler to detect user errors which violate the preconditions of an algorithm.

Unfortunately, not many data scientists are using Haskell today, so to use GDP directly, most teams would have to switch languages.

This post is for teams who can’t afford to switch languages right now, but still want to make stronger arguments for the safety and correctness of their data science software. Below, we’ll be extracting the abstract approaches from the GDP system and exploring how to implement them concretely in Python.

One of the major value propositions of GDP in Haskell is the total lack of any run-time cost. We get the safety guarantees from the compiler, and our program runs the same. Such a feature simply isn’t possible in a dynamic type system, wherein any type-checking occurs at run-time.

With PEP-484, however, Python introduced what amounts to an
optional static type system. We can use an external tool
(`mypy`

seems to be the popular one) to read any recorded
type hints and check them. It’s not the same as having a compiler
categorically reject ill-typed programs (Python will happily
interpret and run a program `mypy`

finds ill-typed; it simply ignores
type hints), but having the type check as part of an automated
process like a CI pipeline is a small step in the direction of
“guaranteeing” type safety.

With PEP-484 as a foundation, we can start to implement echoes of the techniques we’ve been exploring.

As before, a smart constructor takes some data from an underlying
type to a wrapper type which communicates some property. We’ll
translate Haskell’s `newtype`

-based implementation for that idea into
a class, say, for the `SortedList`

example:

```
class SortedList:
def __init__(self, data):
self.data = list(sorted(data))
```

Now we can use a type hint with `SortedList`

to encode a “list
argument must be sorted” precondition:

```
def merge_sorted(x: SortedList, y: SortedList):
# Operate on x.data and y.data, knowing they've already been sorted
```

Anywhere we want to use a function/ method for regular lists, we can
just reference `.data`

. The problem is that `SortedList`

has a pretty
limited interface; there’s almost nothing we can do with it without
pulling back the curtain and working with `.data`

. We can’t even
print the elements without reimplementing `__repr__`

. To alleviate
this, we can have `SortedList`

inherit from `list`

.

```
class SortedList(list):
def __init__(self, data):
super().__init__(sorted(data))
```

Now, `merge_sorted`

can operate directly on its arguments, rather
than grabbing the underlying lists they wrap.

This is great for a nominal check that a particular procedure has been applied. However, two aspects of Python prevent this check from being a proper guarantee that the given property holds.

- There’s no real “private” data in Python. In the first
“composition” approach, any part of our program is free to read
and write
`.data`

. We can rename it to`._data`

and make`.data`

a read-only property, but that’s just indirection. Even though the leading underscore, by convention, politely suggests that`._data`

is “private,” we’re free to ignore that wherever we feel convenient, and there’s nothing Python or`mypy`

can do to stop us (or, more likely, our library users). - Python data is mutable. Yes, the data was definitely sorted by the
end of
`__init__`

’s body, but every line of code that follows is free to shuffle, slice, drop and chop the underlying data at will. That’s why the Python class`SortedList`

more communicates the notion that “this list was sorted at some point,” rather than the guarantee that the term is and will be sorted for the life of the program.

Now that we’re clear that these types can’t guarantee, merely suggest, a property, let’s continue on to our data science library and translate our preprocessing module:

```
from enum import Enum
from sklearn.preprocessing import StandardScaler
class Normalized:
def __init__(self, df):
self.scaler = StandardScaler()
self.df = self.scaler.fit_transform(df)
class MissingValueHandler(Enum):
DROP = pd.DataFrame.dropna
FILL = pd.DataFrame.fillna
class NoMissingVals:
def __init__(self, df, handler: MissingValueHandler, **opts):
self.df = handler(df, **opts)
```

`NoMissingVals`

is a little trickier here since there were multiple
ways to handle missing values. The use of the enum type above
constrains users to the missing value handling methods defined by
`MissingValueHandler`

(no other argument would type check). A call
would look like `NoMissingVals(df, MissingValueHandler.DROP)`

.

Here’s how these classes look as preconditions on our machine learning algorithms:

```
def kmeans(df: Normalized, k: int):
# ...
def id3_decision_tree(df: NoMissingVals):
# ...
```

With these type hints, if we forget the requisite procedure, as
below, `mypy`

(wherever in our development pipeline it’s run) will
helpfully remind us:

```
import pandas as pd
df = pd.read_csv('my_data.csv')
kmeans(df, k=4)
```

leads to

```
error: Argument 1 to "kmeans" has incompatible type "DataFrame"; expected "Normalized"
```

If we think of our parsing functions as smart constructors which may fail, then we can add that notion of failure directly to our Python translation of the smart constructor pattern. To revisit square matrices:

```
import numpy as np
class SquareMatrix:
def __init__(self, a: np.ndarray):
if not self.is_square(a):
raise ValueError(f"Can't make a square matrix from dimensions {a.shape}")
self.data = a
@staticmethod
def is_square(a: np.ndarray):
try:
n, m = a.shape
except ValueError: # a isn't 2D
return false
return n == m
```

If we’re careful not to mutate (e.g. reshape) our `SquareMatrix`

,
this type is a pretty good indication that the underlying matrix in
`.data`

is actually square. Using this parser, we can check for
squareness and *fail early*, perhaps as soon as we have our target
matrix in hand. Without it, we have to wait for whatever
implementation detail of the algorithm relies on the squareness
property to blow up with an opaque, technical error.

Again, due to the optionality of type checking, the mutability of
data, and the lack of private data, this doesn’t amount to a
*guarantee* that a term of type `SquareMatrix`

is actually square.
But it is a good smoke test for when we accidentally omit the check.

GDP associates a term with a proof by sticking the proof in a phantom
type parameter of a `newtype`

constructor. What does that mean in
Python?

Interestingly, since Python 3.5.2, the `typing`

standard library
module ships with a `NewType`

construct not entirely
dissimilar to Haskell’s. A Python `NewType`

wraps some underlying
type in a new type, to be treated as something separate by the type
checker. However, `NewType`

doesn’t admit the parameterization we
need. In Python, the notion of a type with parameters corresponds to
generics. For example, `List`

is a generic type,
meaning it is parameterized by another, type, e.g. `List[int]`

.

In GDP, the “such that” `newtype`

was our core mechanism for
associating proofs with terms. Let’s define a generic type
`SuchThat`

, parameterized by a data type and a proof type:

```
from typing import Generic, TypeVar
A = TypeVar('A') # data type
P = TypeVar('P') # proof type
class SuchThat(Generic[A, P]):
def __init__(self, a: A):
self.a = a
```

User-defined generics are implemented as above by inheriting from
`Generic`

. The parameters given to this parent (in this case, the
type variables `A`

and `P`

) correspond to the parameters of our type.
For example, for the type `SuchThat[int, Positive]`

, `int`

is the
data type and `Positive`

is the proof type. In other words, a term of
type `SuchThat[int, Positive]`

is an `int`

such that it is positive.
Note that we preserve our notion of “phantom-ness” as the
implementation (term-level code) of `SuchThat`

makes no mention of
`P`

.

Here’s the `SortedList`

example, using our new `SuchThat`

construct:

```
# As in the last post, we'll never need a value of type Sorted, so we
# don't implement a way to create Sorted values. We only need to
# reason about the type (as a proof type for SuchThat).
class Sorted:
def __init__(self):
raise NotImplementedError('This class should have no instances')
def ensure_sorted(x: List[Any]) -> SuchThat[List[Any], Sorted]:
return SuchThat(sorted(x))
def merge_sorted(
x: SuchThat[List[Any], Sorted],
y: SuchThat[List[Any], Sorted]
) -> SuchThat[List[Any], Sorted]:
# ...
```

It’s wordier than it was in Haskell, but it’s a pretty direct
translation (much more direct than I thought it would be when I first
started). Everything in Python is necessarily going to be more
flexible, and when it comes to type-level programming, a little more
ad-hoc. There’s also 1) run-time overhead to use these more
expressive types (for instance, having to access `.a`

to operate on
the underlying data of a `SuchThat`

term) and 2) extra verbosity in
source code for extracting values.

It’s up to library authors to apply these patterns in a way that hits the desired level of safety without becoming so cumbersome that users end up circumventing them. There’s still a lot of work to be done, 1) on the Python side to make the type-level constructs more workable, 2) on the Haskell side to make the data science ecosystem more complete and mainstream, and 3) on the theory side to make the more rigorous family of formal methods more accessible to workaday programmers.

Until then, I believe these approaches comprise a compelling step in the right direction for both data science and software engineering as a whole. They work with no need to adopt radically new technology, and fit into paradigms and languages we’re already using. With the right support, this technology could become the way we do data science, and in doing so, dramatically improve the safety of the innumerable real-world systems which rely on it.

Next post:
Domain-Level Errors

Previous post:
GDP for Data Science