Generate correlated Gaussian covariates and linear response data. — correlated_linear_gaussian

Generate normally-distributed covariates that are potentially correlated and linear response data with a specified error distribution.

Usage

correlated_linear_gaussian_dgp(
  n,
  p_uncorr,
  p_corr,
  s_uncorr = p_uncorr,
  s_corr = p_corr,
  corr,
  betas_uncorr = NULL,
  betas_corr = NULL,
  betas_uncorr_sd = 1,
  betas_corr_sd = 1,
  intercept = 0,
  err = NULL,
  data_split = FALSE,
  train_prop = 0.5,
  return_values = c("X", "y", "support"),
  ...
)

Arguments

n: Number of samples.
p_uncorr: Number of uncorrelated features.
p_corr: Number of features in correlated group.
s_uncorr: Sparsity level of features in uncorrelated group. Coefficients corresponding to features after the s_uncorr position (i.e., positions i = s_uncorr + 1, ..., p_uncorr) are set to 0.
s_corr: Sparsity level of features in correlated group. Coefficients corresponding to features after the s_corr position (i.e., positions i = s_corr + 1, ..., p_corr) are set to 0.
corr: Correlation between features in correlated group.
betas_uncorr: Coefficient vector for uncorrelated features. If a scalar is provided, the coefficient vector is a constant vector. If NULL (default), entries in the coefficient vector are drawn iid from N(0, betas_uncorr_sd^2).
betas_corr: Coefficient vector for correlated features. If a scalar is provided, the coefficient vector is a constant vector. If NULL (default), entries in the coefficient vector are drawn iid from N(0, betas_corr_sd^2).
betas_uncorr_sd: (Optional) SD of normal distribution from which to draw betas_uncorr. Only used if betas_uncorr argument is NULL.
betas_corr_sd: (Optional) SD of normal distribution from which to draw betas_corr. Only used if betas_corr argument is NULL.
intercept: Scalar intercept term.
err: Function from which to generate simulated error vector. Default is NULL which adds no error to the DGP.
data_split: Logical; if TRUE, splits data into training and test sets according to train_prop.
train_prop: Proportion of data in training set if data_split = TRUE.
return_values: Character vector indicating what objects to return in list. Elements in vector must be one of "X", "y", "support".
...: Other arguments to pass to err() to generate the error vector.

Value

A list of the named objects that were requested in return_values. See brief descriptions below.

X: A data.frame.
y: A response vector of length nrow(X).
support: A vector of feature indices indicating all features used in the true support of the DGP.

Note that if data_split = TRUE and "X", "y" are in return_values, then the returned list also contains slots for "Xtest" and "ytest".

Details

Data is generated via: $$y = intercept + betas_uncorr \%\emph{\% X_uncorr + betas_corr \%}\% X_corr + err(...),$$ where X_uncorr is an (uncorrelated) standard Gaussian random matrix and X_corr is a correlated Gaussian random matrix with variance 1 and Cor(X_corr_i, X_corr_j) = corr for all i, j. The true underlying support of this data is the first s_uncorr and s_corr features in X_uncorr and X_corr respectively.

Examples

# generate data from: y = betas_corr_1 * x_corr_1 + betas_corr_2 * x_corr_2 + N(0, 0.5),
# where betas_corr_1, betas_corr_2 ~ N(0, 1),
# Var(X_corr_i) = 1, Cor(X_corr_i, X_corr_j) = 0.7 for all i, j = 1, ..., 10
sim_data <- correlated_linear_gaussian_dgp(n = 100, p_uncorr = 0, p_corr = 10,
                                           s_corr = 2, corr = 0.7,
                                           err = rnorm, sd = .5)

# generate data from y = betas_uncorr %*% X_uncorr - X_corr_1 + t(df = 1), where
# betas_uncorr ~ N(0, .5), betas_corr = [-1, 0], X_uncorr ~ N(0, I_10),
# X_corr ~ N(0, Sigma), Sigma has 1s on diagonals and 0.7 elsewhere.
sim_data <- correlated_linear_gaussian_dgp(n = 100, p_uncorr = 10, p_corr = 2,
                                           corr = 0.7, betas_uncorr_sd = 1,
                                           betas_corr = c(-1, 0),
                                           err = rt, df = 1)