Generate correlated Gaussian covariates and (binary) logistic response data.
correlated_logistic_gaussian_dgp.Rd
Generate normally-distributed covariates that are potentially correlated and (binary) logistic response data.
Usage
correlated_logistic_gaussian_dgp(
n,
p_uncorr,
p_corr,
s_uncorr = p_uncorr,
s_corr = p_corr,
corr,
betas_uncorr = NULL,
betas_corr = NULL,
betas_uncorr_sd = 1,
betas_corr_sd = 1,
intercept = 0,
data_split = FALSE,
train_prop = 0.5,
return_values = c("X", "y", "support"),
...
)
Arguments
- n
Number of samples.
- p_uncorr
Number of uncorrelated features.
- p_corr
Number of features in correlated group.
- s_uncorr
Sparsity level of features in uncorrelated group. Coefficients corresponding to features after the
s_uncorr
position (i.e., positions i =s_uncorr
+ 1, ...,p_uncorr
) are set to 0.- s_corr
Sparsity level of features in correlated group. Coefficients corresponding to features after the
s_corr
position (i.e., positions i =s_corr
+ 1, ...,p_corr
) are set to 0.- corr
Correlation between features in correlated group.
- betas_uncorr
Coefficient vector for uncorrelated features. If a scalar is provided, the coefficient vector is constant. If
NULL
(default), entries in the coefficient vector are drawn iid from N(0,betas_uncorr_sd
^2). Can also be a function that generates the coefficient vector; seegenerate_coef()
.- betas_corr
Coefficient vector for correlated features. If a scalar is provided, the coefficient vector is constant. If
NULL
(default), entries in the coefficient vector are drawn iid from N(0,betas_corr_sd
^2). Can also be a function that generates the coefficient vector; seegenerate_coef()
.- betas_uncorr_sd
(Optional) SD of normal distribution from which to draw
betas_uncorr
. Only used ifbetas_uncorr
argument isNULL
or is a function in which casebetas_uncorr_sd
is optionally passed to the function assd
; seegenerate_coef()
.- betas_corr_sd
(Optional) SD of normal distribution from which to draw
betas_corr
. Only used ifbetas_corr
argument isNULL
or is a function in which casebetas_corr_sd
is optionally passed to the function assd
; seegenerate_coef()
.- intercept
Scalar intercept term.
- data_split
Logical; if
TRUE
, splits data into training and test sets according totrain_prop
.- train_prop
Proportion of data in training set if
data_split = TRUE
.- return_values
Character vector indicating what objects to return in list. Elements in vector must be one of "X", "y", "support".
- ...
Not used.
Value
A list of the named objects that were requested in
return_values
. See brief descriptions below.
- X
A
data.frame
.- y
A response vector of length
nrow(X)
.- support
A vector of feature indices indicating all features used in the true support of the DGP.
Note that if data_split = TRUE
and "X", "y"
are in return_values
, then the returned list also contains slots for
"Xtest" and "ytest".
Details
Data is generated via: $$log(p / (1 - p)) = intercept + betas_uncorr \%\emph{\% X_uncorr + betas_corr \%}\% X_corr,$$ where p = P(y = 1 | X), X_uncorr is an (uncorrelated) standard Gaussian random matrix, and X_corr is a correlated Gaussian random matrix with variance 1 and Cor(X_corr_i, X_corr_j) = corr for all i, j. The true underlying support of this data is the first s_uncorr and s_corr features in X_uncorr and X_corr respectively.
Examples
# generate data from: log(p / (1 - p)) = betas_corr_1 * x_corr_1 + betas_corr_2 * x_corr_2,
# where betas_corr_1, betas_corr_2 ~ N(0, 1),
# Var(X_corr_i) = 1, Cor(X_corr_i, X_corr_j) = 0.7 for all i, j = 1, ..., 10
sim_data <- correlated_logistic_gaussian_dgp(n = 100, p_uncorr = 0, p_corr = 10,
s_corr = 2, corr = 0.7)
# generate data from: log(p / (1 - p)) = betas_uncorr %*% X_uncorr - X_corr_1,
# where betas_uncorr ~ N(0, .5), betas_corr = [-1, 0], X_uncorr ~ N(0, I_10),
# X_corr ~ N(0, Sigma), Sigma has 1s on diagonals and 0.7 elsewhere.
sim_data <- correlated_logistic_gaussian_dgp(n = 100, p_uncorr = 10, p_corr = 2,
corr = 0.7, betas_uncorr_sd = 1,
betas_corr = c(-1, 0))