Plots summary of data split distributions — plot_data

Given the training, validation, and test data, plots a summary of the feature distributions (either together or separately per feature) to quickly examine any distributional shifts between partitions. Only continuous (i.e., numeric) and categorical (i.e., character or factor) features are used for plotting.

plot_data_split(
  train = NULL,
  valid = NULL,
  test = NULL,
  by_feature = NULL,
  plot_type = "auto",
  xlab = "Value",
  title = NULL,
  plot_heights = 1,
  theme_options = NULL,
  ...
)

Arguments

train: Training data matrix, data frame, or vector.
valid: Validation data matrix, data frame, or vector.
test: Test data matrix, data frame, or vector.
by_feature: Logical. If TRUE, plots distributions for each feature separately. If FALSE, plots distribution of all features together. Default is TRUE if there are <10 features and FALSE otherwise.
plot_type: Type of plot. Default is "auto", which uses a kernel density plot for continuous features and a bar plot for categorical features. If not "auto", `plot_type` should be a list with two named elements: `continuous` and `categorical`. The `continuous` element must be one of "density", "histogram", and "boxplot" while the `categorical` element must be "bar" (with more options to come), indicating the type of plot to use for continuous and categorical features, respectively.
xlab: X-axis label.
title: Plot title.
plot_heights: (Optional) numeric vector of relative row heights of subplots. Only used if both continuous and categorical features are found in the data. For example, heights = c(2, 1) would make the first row twice as tall as the second row.
theme_options: (Optional) list of arguments to pass to vthemes::theme_vmodern().
...: Additional arguments to pass to ggplot2::geom_*().

Value

A ggplot object.