Generate independent Gaussian covariates and linear response data.
linear_gaussian_dgp.RdGenerate independent normally-distributed covariates (including potentially omitted variables) and linear response data with a specified error distribution.
Usage
linear_gaussian_dgp(
  n,
  p_obs = 0,
  p_unobs = 0,
  s_obs = p_obs,
  s_unobs = p_unobs,
  betas = NULL,
  betas_unobs = NULL,
  intercept = 0,
  err = NULL,
  data_split = FALSE,
  train_prop = 0.5,
  return_values = c("X", "y", "support"),
  ...
)Arguments
- n
 Number of samples.
- p_obs
 Number of observed features.
- p_unobs
 Number of unobserved (omitted) features.
- s_obs
 Sparsity level of observed features. Coefficients corresponding to features after the
s_obsposition (i.e., positions i =s_obs+ 1, ...,p_obs) are set to 0.- s_unobs
 Sparsity level of unobserved (omitted) features. Coefficients corresponding to features after the
s_unobsposition (i.e., positions i =s_unobs+ 1, ...,p_unobs) are set to 0.- betas
 Coefficient vector for observed design matrix. If a scalar is provided, the coefficient vector is constant. If
NULL(default), entries in the coefficient vector are drawn iid from N(0,betas_sd^2). Can also be a function that generates the coefficient vector; seegenerate_coef().- betas_unobs
 Coefficient vector for unobserved design matrix. If a scalar is provided, the coefficient vector is constant. If
NULL(default), entries in the coefficient vector are drawn iid from N(0,betas_unobs_sd^2). Can also be a function that generates the coefficient vector; seegenerate_coef().- intercept
 Scalar intercept term.
- err
 Function from which to generate simulated error vector. Default is
NULLwhich adds no error to the DGP.- data_split
 Logical; if
TRUE, splits data into training and test sets according totrain_prop.- train_prop
 Proportion of data in training set if
data_split = TRUE.- return_values
 Character vector indicating what objects to return in list. Elements in vector must be one of "X", "y", "support".
- ...
 Additional arguments to pass to functions that generate X, U, y, betas, betas_unobs, and err. If the argument doesn't exist in one of the functions it is ignored. If two or more of the functions have an argument of the same name but with different values, then use one of the following prefixes in front of the argument name (passed via
...) to differentiate it: .X_, .U_, .y_, .betas_, .betas_unobs_, or .err_. For additional details, seegenerate_X_gaussian(),generate_y_linear(),generate_coef(), andgenerate_errors()
Value
A list of the named objects that were requested in
return_values. See brief descriptions below.
- X
 A
data.frame.- y
 A response vector of length
nrow(X).- support
 A vector of feature indices indicating all features used in the true support of the DGP.
Note that if data_split = TRUE and "X", "y"
are in return_values, then the returned list also contains slots for
"Xtest" and "ytest".
Details
Data is generated via: $$y = intercept + betas \%\emph{\% X + betas_unobs \%}\% U + err(...),$$ where X, U are standard Gaussian random matrices and the true underlying support of this data is the first s_obs and s_unobs features in X and U respectively.
Examples
# generate data from: y = betas_1 * x_1 + betas_2 * x_2 + N(0, 0.5), where
# betas_1, betas_2 ~ N(0, 1) and X ~ N(0, I_10)
sim_data <- linear_gaussian_dgp(n = 100, p_obs = 10, s_obs = 2, betas_sd = 1,
                                err = rnorm, sd = .5)
# generate data from y = betas %*% X - u_1 + t(df = 1), where
# betas ~ N(0, .5), betas_unobs = [-1, 0], X ~ N(0, I_10), U ~ N(0, I_2)
sim_data <- linear_gaussian_dgp(n = 100, p_obs = 10, p_unobs = 2,
                                betas_sd = .5, betas_unobs = c(-1, 0),
                                err = rt, df = 1)