Splits data into training, validation, and test sets

Given (X, y) data, splits the data into training, validation, and test partitions according to the specified proportions. Can also perform stratified (or clustered) data splitting if provided.

split_data(
  X,
  y,
  stratified_by = NULL,
  train_prop = 0.6,
  valid_prop = 0.2,
  test_prop = 0.2
)

Arguments

X: A data matrix or data frame.
y: A response vector.
stratified_by: An optional vector of group IDs to stratify by. That is, the random paritioning occurs within each group so that the group proportions are similar across the training, validation, and test sets. Vector must be the same length as `y`. If NULL (default), the full data set is randomly partitioned into training, validation, and test sets.
train_prop: Proportion of data in training set. Default is 0.6.
valid_prop: Proportion of data in validation set. Default is 0.2.
test_prop: Proportion of data in test set. Default is 0.2.

Value

A list of two:

X: A list of three data matrices or data frames named `train`, `validate`, and `test` containing the training, validation, and test X partitions, respectively.
y: A list of three vectors named `train`, `validate`, and `test` containing the training, validation, and test y partitions, respectively.

Examples

# splits iris data into training (60%), validation (20%), and test (20%) sets
data_split <- split_data(X = iris %>% dplyr::select(-Species),
                         y = iris$Species,
                         train_prop = 0.6, valid_prop = 0.2, test_prop = 0.2)

# splits iris data into training, validation, and test sets while keeping
# `Species` distribution constant across partitions
stratified_data_split <- split_data(X = iris %>% dplyr::select(-Species),
                                    y = iris$Species,
                                    stratified_by = iris$Species,
                                    train_prop = 0.6, valid_prop = 0.2,
                                    test_prop = 0.2)