Here is me, finally a shiny winter morning, which is quite rare where I live, a cup of hot tea, and dreaming about the parallel universe in which I am the best coder ever... 

Then a user comes in: you know, either I don't use the parallel package properly, or your logger is doing something wrong. Me, inside: oh my, let it be the parallel package. Loudly: ok, show me what are you doing.

This is what he was doing (a simplified version):

cl <- parallel::makeCluster(3, type = "FORK")
doParallel::registerDoParallel(cl)
on.exit(parallel::stopCluster(cl), add = TRUE, after = FALSE)

`%dopar%` <- foreach::`%dopar%`
foreach::foreach(
  i = 1:3,
  .verbose = TRUE
) %dopar% {
  logger(i)
}

The logger was a somewhat complicated function, basically writing the input into a database table. The resulting table contained 3 lines, as expected, but all the lines were identical, and not always the same: sometimes they contained 1, otherwise 3, and so on.

It took me a while to understand what's happening: for technical reasons, the logger would first write the files to disk and then to the database. The names of the files were build from random strings :

random_string <- function() {
  digits <- 0:9
  out <- c(
    sample(LETTERS, 5, replace = TRUE),
    sample(digits, 4, replace = TRUE),
    sample(letters, 1, replace = TRUE)
  )
  
  out <- paste0(out, collapse = "")
  
  return(paste0("random_", out))
}

Everything was ok when running sequentially, the problem appeared only when running in parallel. After enabling some debugging flags, I realised that in the parallel case, only one file was generated, which got overwritten every time. This explained while the entries in the database were duplicates.

The open question was: why this happened. And then I remember that there was something about clusters and the seed for random number generators..As far as I understand (please correct me if I'm wrong, I'm no expert in this field), the seed is the same for all clusters when running in parallel. This means that I'll always get the same random string. 

Here I found a nice explanation about seeds for random number generators and advices about what to do when running code in parallel. In summary, if you know you have code that uses random numbers, it's better to use doRNG::`%dorng%` instead of foreach::`%dopar%`.

Here is the code to visualize this:

logger <- function(x, do_rng) {
  msg <- paste(
    "now: ", as.character(Sys.time()), 
    " | x: ", x, 
    " | do_rng: ", do_rng,
    " | random : ", random_string(), 
    "\n"
    )
  cat(msg, file = "/tmp/delme.log", append = TRUE)
}

cl <- parallel::makeCluster(3, type = "FORK")
doParallel::registerDoParallel(cl)
on.exit(parallel::stopCluster(cl), add = TRUE, after = FALSE)


`%dopar%` <- foreach::`%dopar%`
foreach::foreach(
  i = 1:3,
  .verbose = TRUE
) %dopar% {
  logger(x = i, do_rng = FALSE)
}

`%dorng%` <- doRNG::`%dorng%`
foreach::foreach(
  i = 1:3,
  .verbose = TRUE
) %dorng% {
  logger(x = i, do_rng = TRUE)
}

With doRNG::`%dorng%`, the random numbers differ since the cluster seeds are different:

readLines(con = "/tmp/delme.log")
[1] "now:  2021-01-22 23:14:58  | x:  2  | do_rng:  FALSE  | random :  random_WLATF7819m "
[2] "now:  2021-01-22 23:14:58  | x:  1  | do_rng:  FALSE  | random :  random_WLATF7819m "
[3] "now:  2021-01-22 23:14:58  | x:  3  | do_rng:  FALSE  | random :  random_WLATF7819m "
[4] "now:  2021-01-22 23:14:59  | x:  1  | do_rng:  TRUE  | random :  random_NGYRW5295g " 
[5] "now:  2021-01-22 23:14:59  | x:  3  | do_rng:  TRUE  | random :  random_GXSAO7727j " 
[6] "now:  2021-01-22 23:14:59  | x:  2  | do_rng:  TRUE  | random :  random_SMFGG1453h " 

But what do you do if some packages that you depend on use code based on random numbers generators?

The doFuture package does a very nice job here. If you do this:

doFuture::registerDoFuture()
future::plan("multisession")

`%dopar%` <- foreach::`%dopar%`
foreach::foreach(
  i = 1:3,
  .verbose = TRUE
) %dopar% {
  logger(x = i, do_rng = FALSE)
}

you get a nice warning which should wake you up:

Warning messages:
1: UNRELIABLE VALUE: One of the foreach() iterations (‘doFuture-1’) unexpectedly generated 
random numbers without declaring so. There is a risk that those random numbers are not 
statistically sound and the overall results might be invalid. To fix this, use '%dorng%' from 
the 'doRNG' package instead of '%dopar%'. This ensures that proper, parallel-safe random 
numbers are produced via the L'Ecuyer-CMRG method. To disable this check, set option 
'future.rng.onMisuse' to "ignore". 

Something learned today. I hope it helps you too. As for me: I guess I have to postpone my jump to that parallel universe...

 

Make a promise. Show up. Do the work. Repeat.