Skip to contents

Generate independent normally-distributed covariates (including potentially omitted variables) and linear response data with a specified error distribution.

Usage

linear_gaussian_dgp(
  n,
  p_obs = 0,
  p_unobs = 0,
  s_obs = p_obs,
  s_unobs = p_unobs,
  betas = NULL,
  betas_unobs = NULL,
  intercept = 0,
  err = NULL,
  data_split = FALSE,
  train_prop = 0.5,
  return_values = c("X", "y", "support"),
  ...
)

Arguments

n

Number of samples.

p_obs

Number of observed features.

p_unobs

Number of unobserved (omitted) features.

s_obs

Sparsity level of observed features. Coefficients corresponding to features after the s_obs position (i.e., positions i = s_obs + 1, ..., p_obs) are set to 0.

s_unobs

Sparsity level of unobserved (omitted) features. Coefficients corresponding to features after the s_unobs position (i.e., positions i = s_unobs + 1, ..., p_unobs) are set to 0.

betas

Coefficient vector for observed design matrix. If a scalar is provided, the coefficient vector is constant. If NULL (default), entries in the coefficient vector are drawn iid from N(0, betas_sd^2). Can also be a function that generates the coefficient vector; see generate_coef().

betas_unobs

Coefficient vector for unobserved design matrix. If a scalar is provided, the coefficient vector is constant. If NULL (default), entries in the coefficient vector are drawn iid from N(0, betas_unobs_sd^2). Can also be a function that generates the coefficient vector; see generate_coef().

intercept

Scalar intercept term.

err

Function from which to generate simulated error vector. Default is NULL which adds no error to the DGP.

data_split

Logical; if TRUE, splits data into training and test sets according to train_prop.

train_prop

Proportion of data in training set if data_split = TRUE.

return_values

Character vector indicating what objects to return in list. Elements in vector must be one of "X", "y", "support".

...

Additional arguments to pass to functions that generate X, U, y, betas, betas_unobs, and err. If the argument doesn't exist in one of the functions it is ignored. If two or more of the functions have an argument of the same name but with different values, then use one of the following prefixes in front of the argument name (passed via ...) to differentiate it: .X_, .U_, .y_, .betas_, .betas_unobs_, or .err_. For additional details, see generate_X_gaussian(), generate_y_linear(), generate_coef(), and generate_errors()

Value

A list of the named objects that were requested in return_values. See brief descriptions below.

X

A data.frame.

y

A response vector of length nrow(X).

support

A vector of feature indices indicating all features used in the true support of the DGP.

Note that if data_split = TRUE and "X", "y" are in return_values, then the returned list also contains slots for "Xtest" and "ytest".

Details

Data is generated via: $$y = intercept + betas \%\emph{\% X + betas_unobs \%}\% U + err(...),$$ where X, U are standard Gaussian random matrices and the true underlying support of this data is the first s_obs and s_unobs features in X and U respectively.

Examples

# generate data from: y = betas_1 * x_1 + betas_2 * x_2 + N(0, 0.5), where
# betas_1, betas_2 ~ N(0, 1) and X ~ N(0, I_10)
sim_data <- linear_gaussian_dgp(n = 100, p_obs = 10, s_obs = 2, betas_sd = 1,
                                err = rnorm, sd = .5)

# generate data from y = betas %*% X - u_1 + t(df = 1), where
# betas ~ N(0, .5), betas_unobs = [-1, 0], X ~ N(0, I_10), U ~ N(0, I_2)
sim_data <- linear_gaussian_dgp(n = 100, p_obs = 10, p_unobs = 2,
                                betas_sd = .5, betas_unobs = c(-1, 0),
                                err = rt, df = 1)