The objective of this simulation experiment is to provide a basic example on how to use simChef
and showcase the automated R Markdown-generated documentation. For the sake of illustration, this toy simulation experiment studies the performance of linear regression versus random forests under a linear data-generating process across varying noise levels.
[Typically, the objective of the simulation experiment (and this blurb) will be more scientific than instructive and will warrant additional context/background and domain knowledge.]
We simulate the (test and training) covariate/design matrix \(\mathbf{X} \in \mathbb{R}^{n \times p}\) from a standard normal distribution and the response vector \(\mathbf{y} \in \mathbb{R}^n\) from a linear model. Specifically,
\[\begin{align*}\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon},\\\end{align*}\]
where\[\begin{align*}& \mathbf{X}_{ij} \stackrel{iid}{\sim} N\left(0, 1\right) \text{ for all } i = 1, \ldots, n \text{ and } j = 1, \ldots, p, \\& \boldsymbol{\epsilon}_i \stackrel{iid}{\sim} N(0, \sigma^2) \text{ for all } i = 1, \ldots, n.\end{align*}\]
Default Parameters in DGP
[In practice, documentation of DGPs should answer the questions “what” and “why”. That is, “what” is the DGP, and “why” are we using/studying it? As this simulation experiment is a contrived example, we omit the “why” here.]
Function
#> function(n_train, n_test, p, beta, noise_sd) {
#> n <- n_train + n_test
#> X <- matrix(rnorm(n * p), nrow = n, ncol = p)
#> y <- X %*% beta + rnorm(n, sd = noise_sd)
#> data_list <- list(
#> X_train = X[1:n_train, , drop = FALSE],
#> y_train = y[1:n_train],
#> X_test = X[(n_train + 1):n, , drop = FALSE],
#> y_test = y[(n_train + 1):n]
#> )
#> return(data_list)
#> }
#> <bytecode: 0x5645ddacb0d8>
Input Parameters
#> $n_train
#> [1] 200
#>
#> $n_test
#> [1] 200
#>
#> $p
#> [1] 2
#>
#> $beta
#> [1] 1 0
#>
#> $noise_sd
#> [1] 1
Given some training covariate data \(\mathbf{X}\) and response vector \(\mathbf{y}\), we fit a linear regression model (i.e., ordinary least squares) on \(\mathbf{X}\) to predict \(\mathbf{y}\) by minimizing the following objective:
\[\begin{align*}\boldsymbol{\hat{\beta}} = \text{argmin}_{\boldsymbol{\beta}} || \mathbf{y} - \mathbf{X} \boldsymbol{\beta} ||_2^2.\end{align*}\]
Then, to make prediction given some test data \(\mathbf{X}^{\text{test}}\), we compute:
\[\begin{align*}\boldsymbol{\hat{y}}^{\text{test}} = \mathbf{X}^{\text{test}} \boldsymbol{\hat{\beta}}.\end{align*}\]
[In practice, documentation of methods should answer the questions “what” and “why”. That is, “what” is the method, and “why” are we using/studying it? As this simulation experiment is a contrived example, we omit the “why” here.]
Function
#> function(X_train, y_train, X_test, y_test) {
#> train_df <- dplyr::bind_cols(data.frame(X_train), y = y_train)
#> fit <- lm(y ~ ., data = train_df)
#> predictions <- predict(fit, data.frame(X_test))
#> return(list(predictions = predictions, y_test = y_test))
#> }
#> <bytecode: 0x5645dffc7630>
Input Parameters
#> list()
Given some training covariate data \(\mathbf{X}\) and response vector \(\mathbf{y}\), we fit a random forest on \(\mathbf{X}\) to predict \(\mathbf{y}\). At a high-level, a random forest is an ensemble of classification or regression trees (CART) that are fitted independently of one another on bootstrapped samples of the training data. Further, each CART model is fitted by performing recursive axis-aligned binary splits, where the optimal split at each node is chosen from a random subsample of features to minimize an impurity decrease criterion.
To make predictions, CART identifies the unique leaf node containing the test data point and predicts the mean response (of training data) in that node. For a random forest, the predictions are averaged across all CARTs in the forest.
For further details, we refer to Breiman (2001).
[In practice, documentation of methods should answer the questions “what” and “why”. That is, “what” is the method, and “why” are we using/studying it? As this simulation experiment is a contrived example, we omit the “why” here.]
Function
#> function(X_train, y_train, X_test, y_test, ...) {
#> train_df <- dplyr::bind_cols(data.frame(X_train), y = y_train)
#> fit <- ranger::ranger(y ~ ., data = train_df, ...)
#> predictions <- predict(fit, data.frame(X_test))$predictions
#> return(list(predictions = predictions, y_test = y_test))
#> }
#> <bytecode: 0x5645e128a460>
Input Parameters
#> $num.threads
#> [1] 1
Given the true responses \(\mathbf{y} \in \mathbb{R}^n\) and predicted responses \(\mathbf{\hat{y}} \in \mathbb{R}^n\) from various methods, we evaluate several prediction accuracy metrics, namely:
We choose to evaluate both RMSE and MAE as these can convey different messages in the presence of outliers. Further, \(R^2\) provides a convenient normalization of RMSE that can often be more easily interpreted.
[In practice, documentation of evaluation metrics should answer the questions “what” and “why”. That is, “what” is the metric, and “why” are we using/studying it?]
Function
#> function (fit_results, vary_params = NULL, nested_cols = NULL,
#> truth_col, estimate_col, prob_cols = NULL, group_cols = NULL,
#> metrics = NULL, na_rm = FALSE, summary_funs = c("mean", "median",
#> "min", "max", "sd", "raw"), custom_summary_funs = NULL,
#> eval_id = "pred_err")
#> {
#> group_vars <- c(".dgp_name", ".method_name", vary_params,
#> group_cols, ".metric")
#> eval_tib <- eval_pred_err(fit_results = fit_results, vary_params = vary_params,
#> nested_cols = nested_cols, truth_col = truth_col, estimate_col = estimate_col,
#> prob_cols = prob_cols, group_cols = group_cols, metrics = metrics,
#> na_rm = na_rm) %>% dplyr::group_by(dplyr::across(tidyselect::any_of(group_vars)))
#> eval_summary <- eval_summarizer(eval_data = eval_tib, eval_id = eval_id,
#> value_col = ".estimate", summary_funs = summary_funs,
#> custom_summary_funs = custom_summary_funs, na_rm = na_rm)
#> return(eval_summary)
#> }
#> <bytecode: 0x5645dd2cdc50>
#> <environment: namespace:simChef>
Input Parameters
#> $truth_col
#> [1] "y_test"
#>
#> $estimate_col
#> [1] "predictions"
We plot the prediction accuracy between the true and predicted responses, as measured via RMSE, MAE, and \(R^2\), to understand how characteristics of the DGP affect various prediction methods.
[In practice, documentation of the plotters should answer the questions “what” and “why”. That is, “what” is the plot, and “why” are we using/studying it?]
Function
#> function (fit_results = NULL, eval_results = NULL, eval_name = NULL,
#> eval_fun = "summarize_pred_err", eval_fun_options = NULL,
#> vary_params = NULL, metrics = NULL, show = c("point", "line"),
#> ...)
#> {
#> .metric <- NULL
#> arg_list <- get_dot_args(user_args = rlang::list2(...), default_args = list(eval_id = "pred_err",
#> facet_formula = ~.metric, facet_type = "wrap", facet_args = list(scales = "free")))
#> plot_data <- get_plot_data(fit_results = fit_results, eval_results = eval_results,
#> eval_name = eval_name, eval_fun = eval_fun, eval_fun_options = c(eval_fun_options,
#> list(metrics = metrics)))
#> if (!is.null(metrics)) {
#> if (!inherits(metrics, "metric_set")) {
#> abort("Unknown metrics. metrics must be of class 'yardstick::metric_set' or NULL.")
#> }
#> metric_names <- names(attr(metrics, "metrics"))
#> plot_data <- plot_data %>% dplyr::filter(.metric %in%
#> metric_names)
#> }
#> plt <- do.call(plot_eval_constructor, args = c(list(plot_data = plot_data,
#> vary_params = vary_params, show = show), arg_list))
#> return(plt)
#> }
#> <bytecode: 0x5645dc992648>
#> <environment: namespace:simChef>
Input Parameters
#> $eval_name
#> [1] "Prediction Accuracy"
Parameter Values
#> $dgp
#> $dgp$`Linear DGP`
#> $dgp$`Linear DGP`$noise_sd
#> [1] 0.1 0.5 1.0 2.0
#>
#>
#>
#> $method
#> list()