Generate independent Gaussian covariates and LSS response data.

Generate independent normally-distributed covariates and LSS response data with a specified error distribution.

Usage

lss_gaussian_dgp(
  n,
  p,
  k,
  s,
  thresholds = 0,
  signs = 1,
  betas = 1,
  intercept = 0,
  overlap = FALSE,
  err = NULL,
  data_split = FALSE,
  train_prop = 0.5,
  return_values = c("X", "y", "support"),
  ...
)

Arguments

n: Number of samples.
p: Number of features.
k: Order of the interactions.
s: Number of interactions in the LSS model or a matrix of the support indices with each interaction taking a row in this matrix and ncol = k.
thresholds: A scalar or a s x k matrix of the thresholds for each term in the LSS model.
signs: A scalar or a s x k matrix of the sign of each interaction (1 means > while -1 means <).
betas: Scalar, vector, or function to generate coefficients corresponding to interaction terms. See \codegenerate_coef().
intercept: Scalar intercept term.
overlap: If TRUE, simulate support indices with replacement; if FALSE, simulate support indices without replacement (so no overlap)
err: Function from which to generate simulated error vector. Default is NULL which adds no error to the DGP.
data_split: Logical; if TRUE, splits data into training and test sets according to train_prop.
train_prop: Proportion of data in training set if data_split = TRUE.
return_values: Character vector indicating what objects to return in list. Elements in vector must be one of "X", "y", "support", "int_support".
...: Other arguments to pass to err() to generate the error vector.

Value

A list of the named objects that were requested in return_values. See brief descriptions below.

X: A data.frame.
y: A response vector of length nrow(X).
support: A vector of feature indices indicating all features used in the true support of the DGP.
int_support: A vector of signed feature indices in the true (interaction) support of the DGP. For example, "1+_2-" means that the interaction between high values of feature 1 and low values of feature 2 appears in the underlying DGP.

Note that if data_split = TRUE and "X", "y" are in return_values, then the returned list also contains slots for "Xtest" and "ytest".

Details

Data is generated via: $$y = intercept + sum_{i = 1}^{s} beta_i prod_{j = 1}^{k}1(X_{S_j} lessgtr thresholds_ij) + err(...),$$ where X is a standard Gaussian random matrix. If overlap = TRUE, then the true interaction support is randomly chosen from the p features in X. If overlap = FALSE, then the true interaction support is sequentially taken from the first s*k features in X.

For more details on the LSS model, see Behr, Merle, et al. "Provable Boolean Interaction Recovery from Tree Ensemble obtained via Random Forests." arXiv preprint arXiv:2102.11800 (2021).

Examples

# generate data from: y = 1(X_1 > 0, X_2 > 0) + 1(X_3 > 0, X_4 > 0), where
# X is a 100 x 10 standard Gaussian random matrix
sim_data <- lss_gaussian_dgp(n = 100, p = 10, k = 2, s = 2,
                             thresholds = 0, signs = 1, betas = 1)

# generate data from: y = 3 * 1(X_1 < 0) - 1(X_2 > 1) + N(0, 1), where
# X is a 100 x 10 standard Gaussian random matrix
sim_data <- lss_gaussian_dgp(n = 100, p = 10, k = 1, s = 2,
                             thresholds = matrix(0:1, nrow = 2),
                             signs = matrix(c(-1, 1), nrow = 2),
                             betas = c(3, -1),
                             err = rnorm)