vflow logo

A library for making stability analysis simple. Easily evaluate the effect of judgment calls to your data-science pipeline (e.g. choice of imputation strategy)!

How `vflow` works internally

VeridicalFlow provides three main abstractions: Vset, Vfunc, and Subkey. Vfuncs run arbitrary computations on inputs. A Vset collects related Vfunc instances (which wrap user-provided functions or classes) and determines which Vfunc to apply to which inputs and how to do so. Vset outputs are dictionaries with tuple keys composed of one or more Subkey instances that help associate outputs with the Vfunc that produced them and the Vset in which that Vfunc was collected. Below we describe these three abstractions in more detail.

`Vset`

Fundamentally, all Vset objects generate permutations by applying their functions to the Cartesian product of input parameters. Internally, Vset functions and inputs are wrapped in dictionaries with tuple keys. These tuples contain Subkey objects, which identify the functions wrapped by the Vfunc (the Subkey "value"), as well as its originating Vset (the Subkey "origin"). The following is a simple example of what a dictionary created by vflow might look like after fitting a random forest classifier from scikit-learn to some data:

{ (X, y, RF): RandomForestClassifier(max_depth=5, n_estimators=50) }

Each of X, y, and RF are instances of Subkey where the printed representation is the Subkey value, with origins corresponding to the name of the Vset that created the Subkey. The values may be user-supplied at Vset initialization or, if not, will be generated in the format {vset.name}_{index}, e.g. "modeling_0", "modeling_1", etc. Since inputs must be dictionaries with a certain format, all raw inputs (e.g., numpy arrays) must be initialized using helpers.init_args.

When a Vset method such as fit or predict is called with multiple input arguments, the inputs are first combined left to right by Cartesian product; the right Subkey tuple is always filtered on any matching items in the left Subkey tuple, and the two keys' remaining Subkey instances are concatenated. The values of the resulting combined dictionary are tuples containing the values of the inputs concatenated in the order in which they were passed in.

Next, the Vset computes the Cartesian product of the combined input dictionary with Vset.vfuncs or Vset.fitted_vfuncs (depending on the Vset method that was called), combining Subkey tuples in a similar process to determine which Vfuncs to apply to which inputs.

`output_matching`

An important Vset initialization parameter is the boolean output_matching. By default, this parameter is False, but it should be set to True when the Vset is used multiple times and its outputs need to be combined regardless of when it was called, as, for example, with a data cleaning Vset that is used first on training data before model fitting and later on test data for model evaluation.

To demonstrate, if a data imputation Vset with values "mean_impute" and "med_impute" was used at an earlier step in the pipeline with output_matching=False, then the following bad matches may occur at the testing stage:

Input dictionaries:


# dictionary of fitted models
{ (X_train, mean_impute, y_train, RF): RF_fit_on_mean_imputed_train_data,
  (X_train, med_impute, y_train, RF): RF_fit_on_med_imputed_train_data }

# dictionary of imputed testing data
{ (X_test, mean_impute, y_test): mean_imputed_test_data,
  (X_test, med_impute, y_test): med_imputed_test_data }

Output dictionary:

{ (X_train, mean_impute, y_train, X_test, mean_impute, RF): # good match
    RF_fit_on_mean_imputed_train_data(mean_imputed_test_data),
  (X_train, mean_impute, y_train, X_test, med_impute, RF): # bad match!
    RF_fit_on_mean_imputed_train_data(med_imputed_test_data)
  (X_train, med_impute, y_train, X_test, med_impute, RF): # good match
    RF_fit_on_med_imputed_train_data(med_imputed_test_data),
  (X_train, med_impute, y_train, X_test, mean_impute, RF): # bad match!
    RF_fit_on_med_imputed_train_data(mean_imputed_test_data) }

Internally, when output_matching=True, new Subkey instances added to the output dictionary keys by the Vset will have an output_matching attribute with value True, which is used to reject Cartesian product combinations when the Subkey origins do not match or match but their values differ. See below for more info on Subkey matching.

In contrast, use the default output_matching=False when:

Separate calls to the Vset result in entirely independent outputs, such as is usually the case for a Vset that does subsampling of the data. In this case, output_matching=True will result in bad matches, i.e., unnecessary matching on subsamples of the same or different datasets.
output_matching=True was already used earlier in the pipeline. For example, if you use various strategies to clean your data that must be matched at training and test time, you don't need to initialize a modeling Vset with output_matching=True, but should instead use output_matching=True when initializing the data cleaning Vset.

Asynchronous and lazy computation

There are two important Vset initialization parameters that control how fuctions in the Vset are computed:

is_async: when True, all functions are computed asynchronously using Ray. The resources used to distribute computation is deterined by the user's call to ray.init() before applying the Vset to inputs. Default is False.
lazy: when True, functions are computed lazily, meaning that no computation occurs until their results are required downstream in the pipeline. Default is False.

For more details on is_async and lazy, see below.

`Vfunc`

When a Vset is initialized, the items in the modules arg are wrapped in Vfunc objects, which are like named functions that can optionally support a fit or transform method. Initializing the Vset with is_async=True and lazy=True has the following effects:

is_async=True: wraps user functions with AsyncVfunc objects, which compute function outputs asynchronously using ray.
lazy=True: wraps the Vset output dictionary values with VfuncPromise objects, which are lazily evaluated. At the moment, VfuncPromise objects are only resolved if passed to a Vset with lazy=False downstream or if called manually by the user.

`Subkey`

Subkey instances identify the user's functions in a Vset and help to correctly match inputs to other inputs and functions.

Behavior when `Subkey` has `output_matching=False` (default)

By default, Subkey objects are non-matching, meaning that vflow won't bother to look for a matching Subkey when combining data or deciding which Vfunc to apply to which inputs. As described above, Vset dictionaries are combined from left to right two-at-a-time, and if every Subkey in the keys of both dictionaries is non-matching then the result is the full Cartesian product of the two dictionaries.

Behavior when `Subkey` has `output_matching=True`

If one of the entries in a given dictionary's Subkey tuple is matching then vflow tries to find a match in the tuple keys of the other dictionary. The first dictionary's value is combined with the other dictionary's value in two cases:

The Subkey instances of the other dictionary's tuple key are all non-matching.
The other dictionary's key has a Subkey with output_matching=True that has the same origin and same value as the first dictionary's Subkey.

There is one main way for a Subkey to be matching: it was created by a Vset that was initialized with output_matching=True, as described above.

How vflow works internally

output_matching

Asynchronous and lazy computation

Behavior when Subkey has output_matching=False (default)

Behavior when Subkey has output_matching=True

How `vflow` works internally

`output_matching`

Behavior when `Subkey` has `output_matching=False` (default)

Behavior when `Subkey` has `output_matching=True`