General DGP constructor function to generate X and y data.
xy_dgp_constructor.Rd
A general DGP constructor function that generates X and y data for any supervised learning DGP, provided the functions for simulating X, y, and the additive error term.
Usage
xy_dgp_constructor(
n,
X_fun,
y_fun,
err_fun = NULL,
add_err = TRUE,
data_split = FALSE,
train_prop = 0.5,
return_values = c("X", "y", "support"),
...
)
Arguments
- n
Number of samples.
- X_fun
Function to generate X data. Must take an argument
n
or.n
which determines the number of observations to generate.- y_fun
Function to generate y data. Must take an argument
X
which accepts the result fromX_fun
.- err_fun
Function to generate error/noise data. Default
NULL
adds no error to the output ofy_fun()
.- add_err
Logical. If
TRUE
(default), add result oferr_fun()
to result ofy_fun()
to obtain the simulated response vector. IfFALSE
, returnerr_fun(y_fun(...), ...)
as the simulated response vector. Note thatadd_err = TRUE
will return an error for categorical responsesy
.- data_split
Logical; if
TRUE
, splits data into training and test sets according totrain_prop
.- train_prop
Proportion of data in training set if
data_split = TRUE
.- return_values
Character vector indicating what objects to return in list. Elements in vector must be one of "X", "y", "support".
- ...
Additional arguments to pass to functions that generate X, y, and err. If the argument doesn't exist in one of the functions it is ignored. If two or more of the functions have an argument of the same name but with different values, then use one of the following prefixes in front of the argument name (passed via
...
) to differentiate it: .X_, .y_, or .err_.
Value
A list of the named objects that were requested in
return_values
. See brief descriptions below.
- X
A
data.frame
.- y
A response vector of length
nrow(X)
.- support
A vector of feature indices indicating all features used in the true support of the DGP.
Note that if data_split = TRUE
and "X", "y"
are in return_values
, then the returned list also contains slots for
"Xtest" and "ytest".
Details
If add_err = TRUE
, data is generated from the following
additive model:
$$y = y_fun(X, ...) + err_fun(X, y_fun(X), ...), where X = X_fun(...).$$
If add_err = FALSE
, data is generated via:
$$y = err_fun(X, y_fun(X, ...), ...), where X = X_fun(...).$$
Note that while err_fun()
is allowed to depend on both X and y, it is
not necessary that err_fun()
depend on X or y.
Examples
# generate X = 100 x 10 standard Gaussian, y = linear regression model
sim_data <- xy_dgp_constructor(X_fun = MASS::mvrnorm,
y_fun = generate_y_linear,
err_fun = rnorm, data_split = TRUE,
# shared dgp arguments
n = 100,
# arguments specifically for X_fun
.X_mu = rep(0, 10), .X_Sigma = diag(10),
# arguments specifically for y_fun
.y_betas = rnorm(10), .y_return_support = TRUE,
# arguments specifically for err_fun
.err_sd = 1)
# or alternatively, (since arguments of X_fun, y_fun, err_fun are unique,
# with the exception of `n`)
sim_data <- xy_dgp_constructor(X_fun = MASS::mvrnorm,
y_fun = generate_y_linear,
err_fun = rnorm, data_split = TRUE,
# shared dgp arguments
n = 100,
# arguments specifically for X_fun
mu = rep(0, 10), Sigma = diag(10),
# arguments specifically for y_fun
betas = rnorm(10), return_support = TRUE,
# arguments specifically for err_fun
sd = 1)