Generate independent Gaussian covariates and linear response data.

Generate independent normally-distributed covariates (including potentially omitted variables) and linear response data with a specified error distribution.

Usage

linear_gaussian_dgp(
  n,
  p_obs = 0,
  p_unobs = 0,
  s_obs = p_obs,
  s_unobs = p_unobs,
  betas = NULL,
  betas_unobs = NULL,
  intercept = 0,
  err = NULL,
  data_split = FALSE,
  train_prop = 0.5,
  return_values = c("X", "y", "support"),
  ...
)

Arguments

n: Number of samples.
p_obs: Number of observed features.
p_unobs: Number of unobserved (omitted) features.
s_obs: Sparsity level of observed features. Coefficients corresponding to features after the s_obs position (i.e., positions i = s_obs + 1, ..., p_obs) are set to 0.
s_unobs: Sparsity level of unobserved (omitted) features. Coefficients corresponding to features after the s_unobs position (i.e., positions i = s_unobs + 1, ..., p_unobs) are set to 0.
betas: Coefficient vector for observed design matrix. If a scalar is provided, the coefficient vector is constant. If NULL (default), entries in the coefficient vector are drawn iid from N(0, betas_sd^2). Can also be a function that generates the coefficient vector; see generate_coef().
betas_unobs: Coefficient vector for unobserved design matrix. If a scalar is provided, the coefficient vector is constant. If NULL (default), entries in the coefficient vector are drawn iid from N(0, betas_unobs_sd^2). Can also be a function that generates the coefficient vector; see generate_coef().
intercept: Scalar intercept term.
err: Function from which to generate simulated error vector. Default is NULL which adds no error to the DGP.
data_split: Logical; if TRUE, splits data into training and test sets according to train_prop.
train_prop: Proportion of data in training set if data_split = TRUE.
return_values: Character vector indicating what objects to return in list. Elements in vector must be one of "X", "y", "support".
...: Additional arguments to pass to functions that generate X, U, y, betas, betas_unobs, and err. If the argument doesn't exist in one of the functions it is ignored. If two or more of the functions have an argument of the same name but with different values, then use one of the following prefixes in front of the argument name (passed via ...) to differentiate it: .X_, .U_, .y_, .betas_, .betas_unobs_, or .err_. For additional details, see generate_X_gaussian(), generate_y_linear(), generate_coef(), and generate_errors()

Value

A list of the named objects that were requested in return_values. See brief descriptions below.

X: A data.frame.
y: A response vector of length nrow(X).
support: A vector of feature indices indicating all features used in the true support of the DGP.

Note that if data_split = TRUE and "X", "y" are in return_values, then the returned list also contains slots for "Xtest" and "ytest".

Details

Data is generated via: $$y = intercept + betas \%\emph{\% X + betas_unobs \%}\% U + err(...),$$ where X, U are standard Gaussian random matrices and the true underlying support of this data is the first s_obs and s_unobs features in X and U respectively.

Examples

# generate data from: y = betas_1 * x_1 + betas_2 * x_2 + N(0, 0.5), where
# betas_1, betas_2 ~ N(0, 1) and X ~ N(0, I_10)
sim_data <- linear_gaussian_dgp(n = 100, p_obs = 10, s_obs = 2, betas_sd = 1,
                                err = rnorm, sd = .5)

# generate data from y = betas %*% X - u_1 + t(df = 1), where
# betas ~ N(0, .5), betas_unobs = [-1, 0], X ~ N(0, I_10), U ~ N(0, I_2)
sim_data <- linear_gaussian_dgp(n = 100, p_obs = 10, p_unobs = 2,
                                betas_sd = .5, betas_unobs = c(-1, 0),
                                err = rt, df = 1)