
Evaluate and/or summarize ROC or PR curves for feature selection.
Source:R/evaluator-lib-feature-selection.R
eval_feature_selection_curve_funs.RdEvaluate the ROC or PR curves corresponding to the selected
features, given the true feature support and the estimated feature
importances. eval_feature_selection_curve() evaluates the ROC or PR
curve for each experimental replicate separately.
summarize_feature_selection_curve() summarizes the ROC or PR curve
across experimental replicates.
Usage
eval_feature_selection_curve(
fit_results,
vary_params = NULL,
nested_cols = NULL,
truth_col,
imp_col,
group_cols = NULL,
curve = c("ROC", "PR"),
na_rm = FALSE
)
summarize_feature_selection_curve(
fit_results,
vary_params = NULL,
nested_cols = NULL,
truth_col,
imp_col,
group_cols = NULL,
curve = c("ROC", "PR"),
na_rm = FALSE,
x_grid = seq(0, 1, by = 0.01),
summary_funs = c("mean", "median", "min", "max", "sd", "raw"),
custom_summary_funs = NULL,
eval_id = ifelse(curve == "PR", "precision", "TPR")
)Arguments
- fit_results
A tibble, as returned by
fit_experiment().- vary_params
A vector of
DGPorMethodparameter names that are varied across in theExperiment.- nested_cols
(Optional) A character string or vector specifying the name of the column(s) in
fit_resultsthat need to be unnested before evaluating results. Default isNULL, meaning no columns infit_resultsneed to be unnested prior to computation.- truth_col
A character string identifying the column in
fit_resultswith the true feature support data. Each element in this column should be an array of lengthp, wherepis the number of features. Elements in this array should be binary withTRUEor1meaning the feature (corresponding to that slot) is in the support andFALSEor0meaning the feature is not in the support.- imp_col
A character string identifying the column in
fit_resultswith the estimated feature importance data. Each element in this column should be an array of lengthp, wherepis the number of features and the feature order aligns with that oftruth_col. Elements in this array should be numeric where a higher magnitude indicates a more important feature.- group_cols
(Optional) A character string or vector specifying the column(s) to group rows by before evaluating metrics. This is useful for assessing within-group metrics.
- curve
Either "ROC" or "PR" indicating whether to evaluate the ROC or Precision-Recall curve.
- na_rm
A
logicalvalue indicating whetherNAvalues should be stripped before the computation proceeds.- x_grid
Vector of values between 0 and 1 at which to evaluate the ROC or PR curve. If
curve = "ROC", the provided vector of values are the FPR values at which to evaluate the TPR, and ifcurve = "PR", the values are the recall values at which to evaluate the precision.- summary_funs
Character vector specifying how to summarize evaluation metrics. Must choose from a built-in library of summary functions - elements of the vector must be one of "mean", "median", "min", "max", "sd", "raw".
- custom_summary_funs
Named list of custom functions to summarize results. Names in the list should correspond to the name of the summary function. Values in the list should be a function that takes in one argument, that being the values of the evaluated metrics.
- eval_id
Character string. ID to be used as a suffix when naming result columns. Default
NULLdoes not add any ID to the column names.
Value
The output of eval_feature_selection_curve() is a tibble with
the following columns:
- .rep
Replicate ID.
- .dgp_name
Name of DGP.
- .method_name
Name of Method.
- curve_estimate
A list of tibbles with x and y coordinate values for the ROC/PR curve for the given experimental replicate. If
curve = "ROC", thetibblehas the columns.threshold,FPR, andTPRfor the threshold, false positive rate, and true positive rate, respectively. Ifcurve = "PR", thetibblehas the columns.threshold,recall, andprecision.
as well as any columns specified by group_cols and vary_params.
The output of summarize_feature_selection_curve() is a grouped
tibble containing both identifying information and the
feature selection curve results aggregated over experimental replicates.
Specifically, the identifier columns include .dgp_name,
.method_name, and any columns specified by group_cols and
vary_params. In addition, there are results columns corresponding to
the requested statistics in summary_funs and
custom_summary_funs. If curve = "ROC", these results columns
include FPR and others that end in the suffix "_TPR". If
curve = "PR", the results columns include recall and others
that end in the suffix "_precision".
See also
Other feature_selection_funs:
eval_feature_importance_funs,
eval_feature_selection_err_funs,
plot_feature_importance(),
plot_feature_selection_curve(),
plot_feature_selection_err()
Examples
# generate example fit_results data for a feature selection problem
fit_results <- tibble::tibble(
.rep = rep(1:2, times = 2),
.dgp_name = c("DGP1", "DGP1", "DGP2", "DGP2"),
.method_name = c("Method"),
feature_info = lapply(
1:4,
FUN = function(i) {
tibble::tibble(
# feature names
feature = c("featureA", "featureB", "featureC"),
# true feature support
true_support = c(TRUE, FALSE, TRUE),
# estimated feature importance scores
est_importance = c(10, runif(2, min = -2, max = 2))
)
}
)
)
# evaluate feature selection ROC/PR curves for each replicate
roc_results <- eval_feature_selection_curve(
fit_results,
curve = "ROC",
nested_cols = "feature_info",
truth_col = "true_support",
imp_col = "est_importance"
)
pr_results <- eval_feature_selection_curve(
fit_results,
curve = "PR",
nested_cols = "feature_info",
truth_col = "true_support",
imp_col = "est_importance"
)
# summarize feature selection ROC/PR curves across replicates
roc_summary <- summarize_feature_selection_curve(
fit_results,
curve = "ROC",
nested_cols = "feature_info",
truth_col = "true_support",
imp_col = "est_importance"
)
pr_summary <- summarize_feature_selection_curve(
fit_results,
curve = "PR",
nested_cols = "feature_info",
truth_col = "true_support",
imp_col = "est_importance"
)