General DGP constructor function to generate X and y data.

A general DGP constructor function that generates X and y data for any supervised learning DGP, provided the functions for simulating X, y, and the additive error term.

Usage

xy_dgp_constructor(
  n,
  X_fun,
  y_fun,
  err_fun = NULL,
  add_err = TRUE,
  data_split = FALSE,
  train_prop = 0.5,
  return_values = c("X", "y", "support"),
  ...
)

Arguments

n: Number of samples.
X_fun: Function to generate X data. Must take an argument n or .n which determines the number of observations to generate.
y_fun: Function to generate y data. Must take an argument X which accepts the result from X_fun.
err_fun: Function to generate error/noise data. Default NULL adds no error to the output of y_fun().
add_err: Logical. If TRUE (default), add result of err_fun() to result of y_fun() to obtain the simulated response vector. If FALSE, return err_fun(y_fun(...), ...) as the simulated response vector. Note that add_err = TRUE will return an error for categorical responses y.
data_split: Logical; if TRUE, splits data into training and test sets according to train_prop.
train_prop: Proportion of data in training set if data_split = TRUE.
return_values: Character vector indicating what objects to return in list. Elements in vector must be one of "X", "y", "support".
...: Additional arguments to pass to functions that generate X, y, and err. If the argument doesn't exist in one of the functions it is ignored. If two or more of the functions have an argument of the same name but with different values, then use one of the following prefixes in front of the argument name (passed via ...) to differentiate it: .X_, .y_, or .err_.

Value

A list of the named objects that were requested in return_values. See brief descriptions below.

X: A data.frame.
y: A response vector of length nrow(X).
support: A vector of feature indices indicating all features used in the true support of the DGP.

Note that if data_split = TRUE and "X", "y" are in return_values, then the returned list also contains slots for "Xtest" and "ytest".

Details

If add_err = TRUE, data is generated from the following additive model: $$y = y_fun(X, ...) + err_fun(X, y_fun(X), ...), where X = X_fun(...).$$

If add_err = FALSE, data is generated via: $$y = err_fun(X, y_fun(X, ...), ...), where X = X_fun(...).$$

Note that while err_fun() is allowed to depend on both X and y, it is not necessary that err_fun() depend on X or y.

Examples

# generate X = 100 x 10 standard Gaussian, y = linear regression model
sim_data <- xy_dgp_constructor(X_fun = MASS::mvrnorm,
                               y_fun = generate_y_linear,
                               err_fun = rnorm, data_split = TRUE,
                               # shared dgp arguments
                               n = 100,
                               # arguments specifically for X_fun
                               .X_mu = rep(0, 10), .X_Sigma = diag(10),
                               # arguments specifically for y_fun
                               .y_betas = rnorm(10), .y_return_support = TRUE,
                               # arguments specifically for err_fun
                               .err_sd = 1)
# or alternatively, (since arguments of X_fun, y_fun, err_fun are unique,
# with the exception of `n`)
sim_data <- xy_dgp_constructor(X_fun = MASS::mvrnorm,
                               y_fun = generate_y_linear,
                               err_fun = rnorm, data_split = TRUE,
                               # shared dgp arguments
                               n = 100,
                               # arguments specifically for X_fun
                               mu = rep(0, 10), Sigma = diag(10),
                               # arguments specifically for y_fun
                               betas = rnorm(10), return_support = TRUE,
                               # arguments specifically for err_fun
                               sd = 1)