Given the training, validation, and test data, plots a summary of the feature distributions (either together or separately per feature) to quickly examine any distributional shifts between partitions. Only continuous (i.e., numeric) and categorical (i.e., character or factor) features are used for plotting.

plot_data_split(
  train = NULL,
  valid = NULL,
  test = NULL,
  by_feature = NULL,
  plot_type = "auto",
  xlab = "Value",
  title = NULL,
  plot_heights = 1,
  theme_options = NULL,
  ...
)

Arguments

train

Training data matrix, data frame, or vector.

valid

Validation data matrix, data frame, or vector.

test

Test data matrix, data frame, or vector.

by_feature

Logical. If TRUE, plots distributions for each feature separately. If FALSE, plots distribution of all features together. Default is TRUE if there are <10 features and FALSE otherwise.

plot_type

Type of plot. Default is "auto", which uses a kernel density plot for continuous features and a bar plot for categorical features. If not "auto", `plot_type` should be a list with two named elements: `continuous` and `categorical`. The `continuous` element must be one of "density", "histogram", and "boxplot" while the `categorical` element must be "bar" (with more options to come), indicating the type of plot to use for continuous and categorical features, respectively.

xlab

X-axis label.

title

Plot title.

plot_heights

(Optional) numeric vector of relative row heights of subplots. Only used if both continuous and categorical features are found in the data. For example, heights = c(2, 1) would make the first row twice as tall as the second row.

theme_options

(Optional) list of arguments to pass to vthemes::theme_vmodern().

...

Additional arguments to pass to ggplot2::geom_*().

Value

A ggplot object.