Skip to contents

Generate normally-distributed covariates that are potentially correlated and LSS response data with a specified error distribution.


  s_uncorr = p_uncorr,
  s_corr = p_corr,
  thresholds = 0,
  signs = 1,
  betas = 1,
  intercept = 0,
  overlap = FALSE,
  mixed_int = FALSE,
  err = NULL,
  data_split = FALSE,
  train_prop = 0.5,
  return_values = c("X", "y", "support"),



Number of samples.


Number of uncorrelated features.


Number of features in correlated group.


Number of interactions from features in uncorrelated group.


Number of interactions from features in correlated group.


Correlation between features in correlated group.


Order of the interactions.


A scalar or a s x k matrix of the thresholds for each term in the LSS model.


A scalar or a s x k matrix of the sign of each interaction (1 means > while -1 means <).


Scalar, vector, or function to generate coefficients corresponding to interaction terms. See \codegenerate_coef().


Scalar intercept term.


If TRUE, simulate support indices with replacement; if FALSE, simulate support indices without replacement (so no overlap)


If TRUE, correlated and uncorrelated variables are mixed together when constructing an interaction of order-k. If FALSE, each interaction of order-k is composed of only correlated variables or only uncorrelated variables.


Function from which to generate simulated error vector. Default is NULL which adds no error to the DGP.


Logical; if TRUE, splits data into training and test sets according to train_prop.


Proportion of data in training set if data_split = TRUE.


Character vector indicating what objects to return in list. Elements in vector must be one of "X", "y", "support", "int_support".


Other arguments to pass to err() to generate the error vector.


A list of the named objects that were requested in return_values. See brief descriptions below.


A data.frame.


A response vector of length nrow(X).


A vector of feature indices indicating all features used in the true support of the DGP.


A vector of signed feature indices in the true (interaction) support of the DGP. For example, "1+_2-" means that the interaction between high values of feature 1 and low values of feature 2 appears in the underlying DGP.

Note that if data_split = TRUE and "X", "y" are in return_values, then the returned list also contains slots for "Xtest" and "ytest".


Data is generated via: $$y = intercept + sum_{i = 1}^{s} beta_i prod_{j = 1}^{k}1(X_{S_j} lessgtr thresholds_ij) + err(...),$$ where X = [X_uncorr, X_corr], X_uncorr is an (uncorrelated) standard Gaussian random matrix, and X_corr is a correlated Gaussian random matrix with variance 1 and Cor(X_corr_i, X_corr_j) = corr for all i, j. If overlap = TRUE, then the true interaction support is randomly chosen from the (p_uncorr + p_corr) features in X. If overlap = FALSE, then the true interaction support is sequentially taken from the first s_uncorr*k features in X_uncorr and the first s_corr*k features in X_corr.

For more details on the LSS model, see Behr, Merle, et al. "Provable Boolean Interaction Recovery from Tree Ensemble obtained via Random Forests." arXiv preprint arXiv:2102.11800 (2021).


# generate data from: y = 1(X_1 > 0, X_2 > 0) + 1(X_3 > 0, X_4 > 0), where
# X is a 100 x 10 correlated Gaussian random matrix with
# Var(X_i) = 1 for all i and Cor(X_i, X_j) = 0.7 for all i != j
sim_data <- correlated_lss_gaussian_dgp(n = 100, p_uncorr = 0, p_corr = 10,
                                        k = 2, s_corr = 2, corr = 0.7,
                                        thresholds = 0, signs = 1, betas = 1)

# generate data from: y = 3 * 1(X_1 > 0, X_2 > 0) - 1(X_11 > 0, X_12 > 0) + N(0, 1),
# where X = [Z, U], Z is a 100 x 10 standard Gaussian random matrix,
# U is a 100 x 10 Gaussian random matrix with Var(U_i) = 1 and Cor(U_i, U_j) = 0.7
sim_data <- correlated_lss_gaussian_dgp(n = 100, p_uncorr = 10, p_corr = 10,
                                        s_uncorr = 1, s_corr = 1, corr = 0.7,
                                        k = 2, betas = c(3, -1), err = rnorm)

# generate data from: y = \sum_{i = 1}^{4} \prod_{j = 1}^{2} 1(X_{s_j} > 0),
# where s_j \in {1:4, 11:14} are randomly selected indiceds, X = [Z, U],
# Z is a 100 x 10 standard Gaussian random matrix, U is a 100 x 10 Gaussian
# random matrix with Var(U_i) = 1 and Cor(U_i, U_j) = 0.7
# i.e., interactions may consist of both correlated and uncorrelated features
sim_data <- correlated_lss_gaussian_dgp(n = 100, p_uncorr = 10, p_corr = 10,
                                        s_uncorr = 2, s_corr = 2, k = 2,
                                        corr = 0.7, mixed_int = TRUE)