Skip to contents

Generate normally-distributed covariates that are potentially correlated and linear response data with a specified error distribution.

Usage

correlated_linear_gaussian_dgp(
  n,
  p_uncorr,
  p_corr,
  s_uncorr = p_uncorr,
  s_corr = p_corr,
  corr,
  betas_uncorr = NULL,
  betas_corr = NULL,
  betas_uncorr_sd = 1,
  betas_corr_sd = 1,
  intercept = 0,
  err = NULL,
  data_split = FALSE,
  train_prop = 0.5,
  return_values = c("X", "y", "support"),
  ...
)

Arguments

n

Number of samples.

p_uncorr

Number of uncorrelated features.

p_corr

Number of features in correlated group.

s_uncorr

Sparsity level of features in uncorrelated group. Coefficients corresponding to features after the s_uncorr position (i.e., positions i = s_uncorr + 1, ..., p_uncorr) are set to 0.

s_corr

Sparsity level of features in correlated group. Coefficients corresponding to features after the s_corr position (i.e., positions i = s_corr + 1, ..., p_corr) are set to 0.

corr

Correlation between features in correlated group.

betas_uncorr

Coefficient vector for uncorrelated features. If a scalar is provided, the coefficient vector is a constant vector. If NULL (default), entries in the coefficient vector are drawn iid from N(0, betas_uncorr_sd^2).

betas_corr

Coefficient vector for correlated features. If a scalar is provided, the coefficient vector is a constant vector. If NULL (default), entries in the coefficient vector are drawn iid from N(0, betas_corr_sd^2).

betas_uncorr_sd

(Optional) SD of normal distribution from which to draw betas_uncorr. Only used if betas_uncorr argument is NULL.

betas_corr_sd

(Optional) SD of normal distribution from which to draw betas_corr. Only used if betas_corr argument is NULL.

intercept

Scalar intercept term.

err

Function from which to generate simulated error vector. Default is NULL which adds no error to the DGP.

data_split

Logical; if TRUE, splits data into training and test sets according to train_prop.

train_prop

Proportion of data in training set if data_split = TRUE.

return_values

Character vector indicating what objects to return in list. Elements in vector must be one of "X", "y", "support".

...

Other arguments to pass to err() to generate the error vector.

Value

A list of the named objects that were requested in return_values. See brief descriptions below.

X

A data.frame.

y

A response vector of length nrow(X).

support

A vector of feature indices indicating all features used in the true support of the DGP.

Note that if data_split = TRUE and "X", "y" are in return_values, then the returned list also contains slots for "Xtest" and "ytest".

Details

Data is generated via: $$y = intercept + betas_uncorr \%\emph{\% X_uncorr + betas_corr \%}\% X_corr + err(...),$$ where X_uncorr is an (uncorrelated) standard Gaussian random matrix and X_corr is a correlated Gaussian random matrix with variance 1 and Cor(X_corr_i, X_corr_j) = corr for all i, j. The true underlying support of this data is the first s_uncorr and s_corr features in X_uncorr and X_corr respectively.

Examples

# generate data from: y = betas_corr_1 * x_corr_1 + betas_corr_2 * x_corr_2 + N(0, 0.5),
# where betas_corr_1, betas_corr_2 ~ N(0, 1),
# Var(X_corr_i) = 1, Cor(X_corr_i, X_corr_j) = 0.7 for all i, j = 1, ..., 10
sim_data <- correlated_linear_gaussian_dgp(n = 100, p_uncorr = 0, p_corr = 10,
                                           s_corr = 2, corr = 0.7,
                                           err = rnorm, sd = .5)

# generate data from y = betas_uncorr %*% X_uncorr - X_corr_1 + t(df = 1), where
# betas_uncorr ~ N(0, .5), betas_corr = [-1, 0], X_uncorr ~ N(0, I_10),
# X_corr ~ N(0, Sigma), Sigma has 1s on diagonals and 0.7 elsewhere.
sim_data <- correlated_linear_gaussian_dgp(n = 100, p_uncorr = 10, p_corr = 2,
                                           corr = 0.7, betas_uncorr_sd = 1,
                                           betas_corr = c(-1, 0),
                                           err = rt, df = 1)