Evaluate and/or summarize feature importance scores.

Evaluate the estimated feature importance scores against the true feature support. eval_feature_importance evaluates the feature importances for each experimental replicate separately. summarize_feature_importance summarizes the feature importances across experimental replicates.

Usage

eval_feature_importance(
  fit_results,
  vary_params = NULL,
  nested_cols = NULL,
  feature_col,
  imp_col,
  group_cols = NULL
)

summarize_feature_importance(
  fit_results,
  vary_params = NULL,
  nested_cols = NULL,
  feature_col,
  imp_col,
  group_cols = NULL,
  na_rm = FALSE,
  summary_funs = c("mean", "median", "min", "max", "sd", "raw"),
  custom_summary_funs = NULL,
  eval_id = "feature_importance"
)

Arguments

fit_results: A tibble, as returned by fit_experiment().
vary_params: A vector of DGP or Method parameter names that are varied across in the Experiment.
nested_cols: (Optional) A character string or vector specifying the name of the column(s) in fit_results that need to be unnested before evaluating results. Default is NULL, meaning no columns in fit_results need to be unnested prior to computation.
feature_col: A character string identifying the column in fit_results with the feature names or IDs.
imp_col: A character string identifying the column in fit_results with the estimated feature importance data. Each element in this column should be an array of length p, where p is the number of features and the feature order aligns with that of truth_col. Elements in this array should be numeric where a higher magnitude indicates a more important feature.
group_cols: (Optional) A character string or vector specifying the column(s) to group rows by before evaluating metrics. This is useful for assessing within-group metrics.
na_rm: A logical value indicating whether NA values should be stripped before the computation proceeds.
summary_funs: Character vector specifying how to summarize evaluation metrics. Must choose from a built-in library of summary functions - elements of the vector must be one of "mean", "median", "min", "max", "sd", "raw".
custom_summary_funs: Named list of custom functions to summarize results. Names in the list should correspond to the name of the summary function. Values in the list should be a function that takes in one argument, that being the values of the evaluated metrics.
eval_id: Character string. ID to be used as a suffix when naming result columns. Default NULL does not add any ID to the column names.

Value

The output of eval_feature_importance() is a tibble with the columns .rep, .dgp_name, and .method_name in addition to the columns specified by group_cols, vary_params, feature_col, and imp_col.

The output of summarize_feature_importance() is a grouped tibble containing both identifying information and the feature importance results aggregated over experimental replicates. Specifically, the identifier columns include .dgp_name, .method_name, any columns specified by group_cols and vary_params, and the column specified by feature_col. In addition, there are results columns corresponding to the requested statistics in summary_funs and custom_summary_funs. These columns end in the suffix specified by eval_id.

Examples

# generate example fit_results data for a feature selection problem
fit_results <- tibble::tibble(
  .rep = rep(1:2, times = 2),
  .dgp_name = c("DGP1", "DGP1", "DGP2", "DGP2"),
  .method_name = c("Method"),
  feature_info = lapply(
    1:4,
    FUN = function(i) {
      tibble::tibble(
        # feature names
        feature = c("featureA", "featureB", "featureC"),
        # estimated feature importance scores
        est_importance = c(10, runif(2, min = -2, max = 2))
      )
    }
  )
)

# evaluate feature importances (using all default metrics) for each replicate
eval_results <- eval_feature_importance(
  fit_results,
  nested_cols = "feature_info",
  feature_col = "feature",
  imp_col = "est_importance"
)
# summarize feature importances (using all default metric) across replicates
eval_results_summary <- summarize_feature_importance(
  fit_results,
  nested_cols = "feature_info",
  feature_col = "feature",
  imp_col = "est_importance"
)

Usage

Arguments

Value

See also

Examples