Generate independent Gaussian covariates and linear response data.
linear_gaussian_dgp.Rd
Generate independent normally-distributed covariates (including potentially omitted variables) and linear response data with a specified error distribution.
Usage
linear_gaussian_dgp(
n,
p_obs = 0,
p_unobs = 0,
s_obs = p_obs,
s_unobs = p_unobs,
betas = NULL,
betas_unobs = NULL,
intercept = 0,
err = NULL,
data_split = FALSE,
train_prop = 0.5,
return_values = c("X", "y", "support"),
...
)
Arguments
- n
Number of samples.
- p_obs
Number of observed features.
- p_unobs
Number of unobserved (omitted) features.
- s_obs
Sparsity level of observed features. Coefficients corresponding to features after the
s_obs
position (i.e., positions i =s_obs
+ 1, ...,p_obs
) are set to 0.- s_unobs
Sparsity level of unobserved (omitted) features. Coefficients corresponding to features after the
s_unobs
position (i.e., positions i =s_unobs
+ 1, ...,p_unobs
) are set to 0.- betas
Coefficient vector for observed design matrix. If a scalar is provided, the coefficient vector is constant. If
NULL
(default), entries in the coefficient vector are drawn iid from N(0,betas_sd
^2). Can also be a function that generates the coefficient vector; seegenerate_coef()
.- betas_unobs
Coefficient vector for unobserved design matrix. If a scalar is provided, the coefficient vector is constant. If
NULL
(default), entries in the coefficient vector are drawn iid from N(0,betas_unobs_sd
^2). Can also be a function that generates the coefficient vector; seegenerate_coef()
.- intercept
Scalar intercept term.
- err
Function from which to generate simulated error vector. Default is
NULL
which adds no error to the DGP.- data_split
Logical; if
TRUE
, splits data into training and test sets according totrain_prop
.- train_prop
Proportion of data in training set if
data_split = TRUE
.- return_values
Character vector indicating what objects to return in list. Elements in vector must be one of "X", "y", "support".
- ...
Additional arguments to pass to functions that generate X, U, y, betas, betas_unobs, and err. If the argument doesn't exist in one of the functions it is ignored. If two or more of the functions have an argument of the same name but with different values, then use one of the following prefixes in front of the argument name (passed via
...
) to differentiate it: .X_, .U_, .y_, .betas_, .betas_unobs_, or .err_. For additional details, seegenerate_X_gaussian()
,generate_y_linear()
,generate_coef()
, andgenerate_errors()
Value
A list of the named objects that were requested in
return_values
. See brief descriptions below.
- X
A
data.frame
.- y
A response vector of length
nrow(X)
.- support
A vector of feature indices indicating all features used in the true support of the DGP.
Note that if data_split = TRUE
and "X", "y"
are in return_values
, then the returned list also contains slots for
"Xtest" and "ytest".
Details
Data is generated via: $$y = intercept + betas \%\emph{\% X + betas_unobs \%}\% U + err(...),$$ where X, U are standard Gaussian random matrices and the true underlying support of this data is the first s_obs and s_unobs features in X and U respectively.
Examples
# generate data from: y = betas_1 * x_1 + betas_2 * x_2 + N(0, 0.5), where
# betas_1, betas_2 ~ N(0, 1) and X ~ N(0, I_10)
sim_data <- linear_gaussian_dgp(n = 100, p_obs = 10, s_obs = 2, betas_sd = 1,
err = rnorm, sd = .5)
# generate data from y = betas %*% X - u_1 + t(df = 1), where
# betas ~ N(0, .5), betas_unobs = [-1, 0], X ~ N(0, I_10), U ~ N(0, I_2)
sim_data <- linear_gaussian_dgp(n = 100, p_obs = 10, p_unobs = 2,
betas_sd = .5, betas_unobs = c(-1, 0),
err = rt, df = 1)