Given the large amount of computation that simulation studies
require, one of the main goals of simChef
is to make it
easy to parallelize your simulations. simChef
uses the R
package future
to
distribute simulation replicates across whatever available resources the
user specifies. All you have to do to start running your simulations in
parallel is set the future
plan before calling
run_experiment()
:
n_workers <- availableCores() - 1
plan(multisession, workers = n_workers)
The multisession
plan used here will run your simulation
experiments on a local (i.e., where R is running) Linux, macOS, or
Windows machine, in this case using all but one of the cores. This is
very convenient, but it’s important to carefully consider two aspects of
the distributed computation in order to effectively parallelize the
simulations: what plan to use and how to use it.
While simChef
works with any valid future
plan, one may be better than another for your particular set of
experiments. We recommend you carefully read the future
docs
to learn more about the default plans, as well as alternative plans in
packages like future.callr
and future.batchtools
.
Simulation tasks
When a future
plan has been set and the user calls
run_experiment
, simChef
will distribute
computation across the resources specified in the plan. Consider
n
computational “tasks” to be distributed across
p
parallel workers. In simChef
, tasks
correspond to simulation replicates, which generate data from a single
DGP
and fit that data using a single Method
,
along with associated parameters (either defaults or from those that
have been varied in the Experiment
).
Assuming each task takes approximately the same amount of time to
complete regardless of the worker assigned to the task, then with
n=100
and p=4
each worker should complete
around 25 of the tasks. In the ideal setting, the total time to complete
the 100 tasks should be around 4 times lower than the time it takes one
worker to complete them, on average.
Dealing with task heterogeneity
In more realistic scenarios–and especially in simulation experiments which often include heterogeneous methods compared under diverse data-generated processes for a range of sample sizes–tasks can be much less uniform. Different groupings of tasks can have profound implications for the overall running time. Therefore, it’s important to carefully decide how to arrange your simulation into separate experiments in order to take greatest advantage of the available parallelism.
simChef
distributes the simulation’s replicates evenly
across available future
workers, partially answering the
how question. The remainder of the answer comes from you and
your specific application, but here are a couple tips:
- In general, one should not have fewer tasks than workers and should
avoid
n>>p
very small tasks as the overhead of distributing computation to workers may outweigh the benefits of parallelism. - When tasks have unbalanced sizes, it can be helpful to group tasks
into separate experiments, each of which has tasks of roughly equal
duration. In spite of the extra overhead, you may find that using a
separate
Experiment
for each task group ends up decreasing the overall simulation running time because workers with small tasks spend less time idly waiting for workers with large tasks to finish. Using theclone_from
argument increate_experiment()
, you can copy an existing experiment and modify it so that tasks have similar sizes, repeating this process for each group of similarly-sized tasks. - You can use the
progressr
package to get updates as the experiment computation progresses. - Use
options(simChef.debug = TRUE)
to get helpful debugging output as anExperiment
works on it’s tasks, including info on memory usage. This may slow things down quite a bit, so don’t use it when you run the full simulation.
On the roadmap: nested parallelism
In the future we plan to give more control over how the user splits
the computation across workers, with nested parallelism for cases where,
e.g., DGPs
can be split across a few nodes (e.g., using one
of the plan in the package future.batchtools
)
and each node uses many cores to process the replicates in parallel
(e.g., using the future::multicore
plan).
If this is something you’re interested in, please feel free to contribute to the discussion at https://github.com/Yu-Group/simChef/issues/54.
Example
Putting aside the caveats above for now, parallelization in
simChef
works without modification other than using
future
to set a parallel backend. In the example below, we
choose the multicore
backend (not available on Windows) to
create forked R processes using all of the available cores.
This example shows how total replicates can quickly add up when
varying across DGP
or Method
parameters. By
varying across parameters of one of the DGPs
, we in effect
have 17 distinct data generating processes in the experiment (1 for
dgp1
and 16 for the combinations of parameters to
dgp2
), though in actuality there are only two
DGP
objects. Similarly, we effectively have 4 distinct
methods, though there are only 2 Method
objects. With
n_reps = 2
, this results in a total of 2 x 17 x 4 = 136
total rows in the results tibble
.
library(simChef)
library(future)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
n_cores <- availableCores(methods = "system")
n_cores
#> system
#> 4
plan(multicore, workers = n_cores)
dgp_fun1 <- function(n=100, rho=0.5, noise_level=1) {
cov_mat <- diag(nrow = 5)
cov_mat[cov_mat == 0] <- rho
X <- MASS::mvrnorm(n = n, mu = rep(0, 5), Sigma = cov_mat)
y <- cbind(1, X) %*% c(-8, 3, -1, 0, 0, 0) + rnorm(n, sd = noise_level)
return(list(X = X, y = y))
}
dgp_fun2 <- function(n=100, d=100, rho=0.5, sparsity=0.5, noise_level=1,
nonzero_coeff = c(-3, -1, 1, 3)) {
cov_mat <- diag(nrow = d)
cov_mat[cov_mat == 0] <- rho
X <- MASS::mvrnorm(n = n, mu = rep(0, d), Sigma = cov_mat)
coeff_prob <- c(sparsity, rep((1 - sparsity) / 4, times = 4))
coeff <- c(
-8, # intercept
sample(
c(0, nonzero_coeff), size = d, replace = TRUE,
prob = coeff_prob
)
)
y <- cbind(1, X) %*% coeff + rnorm(n, sd = noise_level)
return(list(X = X, y = y))
}
dgp1 <- create_dgp(dgp_fun1, .name = "dense_dgp")
dgp2 <- create_dgp(dgp_fun2, .name = "sparse_dgp")
ols <- function(X, y) {
fit <- lm(y ~ X) %>% broom::tidy()
return(fit)
}
elnet <- function(X, y, alpha=1) {
fit <- glmnet::glmnet(
x = X, y = y, family = "gaussian", alpha = alpha
) %>% broom::tidy()
return(fit)
}
method1 <- create_method(ols, .name = "ols")
method2 <- create_method(elnet, .name = "elnet")
experiment <- create_experiment(
name = "exper", future.packages = "dplyr") %>%
add_dgp(dgp1) %>%
add_dgp(dgp2) %>%
add_method(method1) %>%
add_method(method2) %>%
add_vary_across(
.dgp = "sparse_dgp",
d = c(100, 1000),
rho = c(0.2, 0.9),
sparsity = c(0.5, 0.9),
nonzero_coeff = list(c(-3, -1, 1, 3), c(-0.3, -0.1, 0.1, 0.3))
) %>%
add_vary_across(
.method = "elnet", alpha = c(0, 0.5, 1)
)
results <- experiment$fit(n_reps = 2)
#> Fitting exper...
#> 2 reps completed (totals: 2/2) | time taken: 0.562949 minutes
#> ==============================
results
#> # A tibble: 136 × 16
#> .rep .dgp_name .method_name d rho sparsity nonzero_coeff alpha term
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <list> <dbl> <list>
#> 1 1 dense_dgp elnet NA NA NA <NULL> 0 <chr>
#> 2 1 dense_dgp elnet NA NA NA <NULL> 0.5 <chr>
#> 3 1 dense_dgp elnet NA NA NA <NULL> 1 <chr>
#> 4 1 dense_dgp ols NA NA NA <NULL> NA <chr>
#> 5 1 sparse_dgp elnet 100 0.2 0.5 <dbl [4]> 0 <chr>
#> 6 1 sparse_dgp elnet 100 0.2 0.5 <dbl [4]> 0.5 <chr>
#> 7 1 sparse_dgp elnet 100 0.2 0.5 <dbl [4]> 1 <chr>
#> 8 1 sparse_dgp elnet 100 0.2 0.5 <dbl [4]> 0 <chr>
#> 9 1 sparse_dgp elnet 100 0.2 0.5 <dbl [4]> 0.5 <chr>
#> 10 1 sparse_dgp elnet 100 0.2 0.5 <dbl [4]> 1 <chr>
#> # ℹ 126 more rows
#> # ℹ 7 more variables: estimate <list>, std.error <list>, statistic <list>,
#> # p.value <list>, step <list>, lambda <list>, dev.ratio <list>
If we find lower computational resource utilization than we’d like,
as the simulation grows we might consider breaking this experiment up
into separate experiments, e.g., by DGP, method, or parameters like
sample size n
and number of covariates d
,
depending on which factors have the greatest impact on task
duration.