Evaluate and/or summarize ROC or PR curves for feature selection.
Source:R/evaluator-lib-feature-selection.R
eval_feature_selection_curve_funs.Rd
Evaluate the ROC or PR curves corresponding to the selected
features, given the true feature support and the estimated feature
importances. eval_feature_selection_curve()
evaluates the ROC or PR
curve for each experimental replicate separately.
summarize_feature_selection_curve()
summarizes the ROC or PR curve
across experimental replicates.
Usage
eval_feature_selection_curve(
fit_results,
vary_params = NULL,
nested_cols = NULL,
truth_col,
imp_col,
group_cols = NULL,
curve = c("ROC", "PR"),
na_rm = FALSE
)
summarize_feature_selection_curve(
fit_results,
vary_params = NULL,
nested_cols = NULL,
truth_col,
imp_col,
group_cols = NULL,
curve = c("ROC", "PR"),
na_rm = FALSE,
x_grid = seq(0, 1, by = 0.01),
summary_funs = c("mean", "median", "min", "max", "sd", "raw"),
custom_summary_funs = NULL,
eval_id = ifelse(curve == "PR", "precision", "TPR")
)
Arguments
- fit_results
A tibble, as returned by
fit_experiment()
.- vary_params
A vector of
DGP
orMethod
parameter names that are varied across in theExperiment
.- nested_cols
(Optional) A character string or vector specifying the name of the column(s) in
fit_results
that need to be unnested before evaluating results. Default isNULL
, meaning no columns infit_results
need to be unnested prior to computation.- truth_col
A character string identifying the column in
fit_results
with the true feature support data. Each element in this column should be an array of lengthp
, wherep
is the number of features. Elements in this array should be binary withTRUE
or1
meaning the feature (corresponding to that slot) is in the support andFALSE
or0
meaning the feature is not in the support.- imp_col
A character string identifying the column in
fit_results
with the estimated feature importance data. Each element in this column should be an array of lengthp
, wherep
is the number of features and the feature order aligns with that oftruth_col
. Elements in this array should be numeric where a higher magnitude indicates a more important feature.- group_cols
(Optional) A character string or vector specifying the column(s) to group rows by before evaluating metrics. This is useful for assessing within-group metrics.
- curve
Either "ROC" or "PR" indicating whether to evaluate the ROC or Precision-Recall curve.
- na_rm
A
logical
value indicating whetherNA
values should be stripped before the computation proceeds.- x_grid
Vector of values between 0 and 1 at which to evaluate the ROC or PR curve. If
curve = "ROC"
, the provided vector of values are the FPR values at which to evaluate the TPR, and ifcurve = "PR"
, the values are the recall values at which to evaluate the precision.- summary_funs
Character vector specifying how to summarize evaluation metrics. Must choose from a built-in library of summary functions - elements of the vector must be one of "mean", "median", "min", "max", "sd", "raw".
- custom_summary_funs
Named list of custom functions to summarize results. Names in the list should correspond to the name of the summary function. Values in the list should be a function that takes in one argument, that being the values of the evaluated metrics.
- eval_id
Character string. ID to be used as a suffix when naming result columns. Default
NULL
does not add any ID to the column names.
Value
The output of eval_feature_selection_curve()
is a tibble
with
the following columns:
- .rep
Replicate ID.
- .dgp_name
Name of DGP.
- .method_name
Name of Method.
- curve_estimate
A list of tibbles with x and y coordinate values for the ROC/PR curve for the given experimental replicate. If
curve = "ROC"
, thetibble
has the columns.threshold
,FPR
, andTPR
for the threshold, false positive rate, and true positive rate, respectively. Ifcurve = "PR"
, thetibble
has the columns.threshold
,recall
, andprecision
.
as well as any columns specified by group_cols
and vary_params
.
The output of summarize_feature_selection_curve()
is a grouped
tibble
containing both identifying information and the
feature selection curve results aggregated over experimental replicates.
Specifically, the identifier columns include .dgp_name
,
.method_name
, and any columns specified by group_cols
and
vary_params
. In addition, there are results columns corresponding to
the requested statistics in summary_funs
and
custom_summary_funs
. If curve = "ROC"
, these results columns
include FPR
and others that end in the suffix "_TPR". If
curve = "PR"
, the results columns include recall
and others
that end in the suffix "_precision".
See also
Other feature_selection_funs:
eval_feature_importance_funs
,
eval_feature_selection_err_funs
,
plot_feature_importance()
,
plot_feature_selection_curve()
,
plot_feature_selection_err()
Examples
# generate example fit_results data for a feature selection problem
fit_results <- tibble::tibble(
.rep = rep(1:2, times = 2),
.dgp_name = c("DGP1", "DGP1", "DGP2", "DGP2"),
.method_name = c("Method"),
feature_info = lapply(
1:4,
FUN = function(i) {
tibble::tibble(
# feature names
feature = c("featureA", "featureB", "featureC"),
# true feature support
true_support = c(TRUE, FALSE, TRUE),
# estimated feature importance scores
est_importance = c(10, runif(2, min = -2, max = 2))
)
}
)
)
# evaluate feature selection ROC/PR curves for each replicate
roc_results <- eval_feature_selection_curve(
fit_results,
curve = "ROC",
nested_cols = "feature_info",
truth_col = "true_support",
imp_col = "est_importance"
)
pr_results <- eval_feature_selection_curve(
fit_results,
curve = "PR",
nested_cols = "feature_info",
truth_col = "true_support",
imp_col = "est_importance"
)
# summarize feature selection ROC/PR curves across replicates
roc_summary <- summarize_feature_selection_curve(
fit_results,
curve = "ROC",
nested_cols = "feature_info",
truth_col = "true_support",
imp_col = "est_importance"
)
pr_summary <- summarize_feature_selection_curve(
fit_results,
curve = "PR",
nested_cols = "feature_info",
truth_col = "true_support",
imp_col = "est_importance"
)