vflow logo

A library for making stability analysis simple. Easily evaluate the effect of judgment calls to your data-science pipeline (e.g. choice of imputation strategy)!

mit license python3.9+ tests tests joss PyPI - version

How vflow works internally

VeridicalFlow provides three main abstractions: Vset, Vfunc, and Subkey. Vfuncs run arbitrary computations on inputs. A Vset collects related Vfunc instances (which wrap user-provided functions or classes) and determines which Vfunc to apply to which inputs and how to do so. Vset outputs are dictionaries with tuple keys composed of one or more Subkey instances that help associate outputs with the Vfunc that produced them and the Vset in which that Vfunc was collected. Below we describe these three abstractions in more detail.

Vset

Fundamentally, all Vset objects generate permutations by applying their functions to the Cartesian product of input parameters. Internally, Vset functions and inputs are wrapped in dictionaries with tuple keys. These tuples contain Subkey objects, which identify the functions wrapped by the Vfunc (the Subkey "value"), as well as its originating Vset (the Subkey "origin"). The following is a simple example of what a dictionary created by vflow might look like after fitting a random forest classifier from scikit-learn to some data:

{ (X, y, RF): RandomForestClassifier(max_depth=5, n_estimators=50) }

Each of X, y, and RF are instances of Subkey where the printed representation is the Subkey value, with origins corresponding to the name of the Vset that created the Subkey. The values may be user-supplied at Vset initialization or, if not, will be generated in the format {vset.name}_{index}, e.g. "modeling_0", "modeling_1", etc. Since inputs must be dictionaries with a certain format, all raw inputs (e.g., numpy arrays) must be initialized using helpers.init_args.

When a Vset method such as fit or predict is called with multiple input arguments, the inputs are first combined left to right by Cartesian product; the right Subkey tuple is always filtered on any matching items in the left Subkey tuple, and the two keys' remaining Subkey instances are concatenated. The values of the resulting combined dictionary are tuples containing the values of the inputs concatenated in the order in which they were passed in.

Next, the Vset computes the Cartesian product of the combined input dictionary with Vset.vfuncs or Vset.fitted_vfuncs (depending on the Vset method that was called), combining Subkey tuples in a similar process to determine which Vfuncs to apply to which inputs.

output_matching

An important Vset initialization parameter is the boolean output_matching. By default, this parameter is False, but it should be set to True when the Vset is used multiple times and its outputs need to be combined regardless of when it was called, as, for example, with a data cleaning Vset that is used first on training data before model fitting and later on test data for model evaluation.

To demonstrate, if a data imputation Vset with values "mean_impute" and "med_impute" was used at an earlier step in the pipeline with output_matching=False, then the following bad matches may occur at the testing stage:

Input dictionaries:


# dictionary of fitted models
{ (X_train, mean_impute, y_train, RF): RF_fit_on_mean_imputed_train_data,
  (X_train, med_impute, y_train, RF): RF_fit_on_med_imputed_train_data }

# dictionary of imputed testing data
{ (X_test, mean_impute, y_test): mean_imputed_test_data,
  (X_test, med_impute, y_test): med_imputed_test_data }
Output dictionary:
{ (X_train, mean_impute, y_train, X_test, mean_impute, RF): # good match
    RF_fit_on_mean_imputed_train_data(mean_imputed_test_data),
  (X_train, mean_impute, y_train, X_test, med_impute, RF): # bad match!
    RF_fit_on_mean_imputed_train_data(med_imputed_test_data)
  (X_train, med_impute, y_train, X_test, med_impute, RF): # good match
    RF_fit_on_med_imputed_train_data(med_imputed_test_data),
  (X_train, med_impute, y_train, X_test, mean_impute, RF): # bad match!
    RF_fit_on_med_imputed_train_data(mean_imputed_test_data) }

Internally, when output_matching=True, new Subkey instances added to the output dictionary keys by the Vset will have an output_matching attribute with value True, which is used to reject Cartesian product combinations when the Subkey origins do not match or match but their values differ. See below for more info on Subkey matching.

In contrast, use the default output_matching=False when:

Asynchronous and lazy computation

There are two important Vset initialization parameters that control how fuctions in the Vset are computed:

For more details on is_async and lazy, see below.

Vfunc

When a Vset is initialized, the items in the modules arg are wrapped in Vfunc objects, which are like named functions that can optionally support a fit or transform method. Initializing the Vset with is_async=True and lazy=True has the following effects:

Subkey

Subkey instances identify the user's functions in a Vset and help to correctly match inputs to other inputs and functions.

Behavior when Subkey has output_matching=False (default)

By default, Subkey objects are non-matching, meaning that vflow won't bother to look for a matching Subkey when combining data or deciding which Vfunc to apply to which inputs. As described above, Vset dictionaries are combined from left to right two-at-a-time, and if every Subkey in the keys of both dictionaries is non-matching then the result is the full Cartesian product of the two dictionaries.

Behavior when Subkey has output_matching=True

If one of the entries in a given dictionary's Subkey tuple is matching then vflow tries to find a match in the tuple keys of the other dictionary. The first dictionary's value is combined with the other dictionary's value in two cases:

There is one main way for a Subkey to be matching: it was created by a Vset that was initialized with output_matching=True, as described above.