Running parallel code in Linux using R can be much more efficient than in Windows, largely thanks to Linux’s native support for the mclapply()
function. After successfully getting parallel processing to work on Windows using doSNOW
, I wanted to move my code to a Linux environment and make use of doMC
.
It should’ve been, but I ran into a frustrating and cryptic error when I tried to combine foreach
, doMC
, and parallel processing with mclapply
. This blog post documents how I debugged and fixed the issue—and how you can avoid it, too.
The Initial Code
Here’s the first version of the code I tried to run on Linux:
foreachFunc <- function(Data) {
RowFunction <- function(d) {
chisq.test(d)$p.value
}
P <- as.matrix(apply(Data, 1, RowFunction))
return(P)
}
library(doMC)
library(foreach)
number_of_cpus <- 4
cl <- makeCluster(number_of_cpus)
registerDoMC(cl)
Chunks <- c(1:NROW(Data_new)) %% 4
P <- foreach(i = 0:3, .combine = rbind, mc.cores = 4) %dopar% {
foreachFunc(Data_new[Chunks == i, ])
}
stopCluster(cl)
The Error I Got
Error in mclapply(argsList, FUN, mc.preschedule = preschedule, mc.set.seed = set.seed, :
(list) object cannot be coerced to type 'integer'
What Went Wrong
The error message doesn’t make it obvious, but here’s what I discovered:
- I was mixing
doMC
andmakeCluster()
, which is a big no-no.makeCluster()
andstopCluster()
are meant fordoParallel
ordoSNOW
.doMC
is meant to be simpler and works only on Unix-like systems. It internally usesmclapply()
no cluster setup required.
- I also mistakenly added
mc.cores=4
inside theforeach()
loop, which is not valid when using%dopar%
withdoMC
. That’s only used withmclapply()
directly.
The Fix and Improved Code
Here’s the corrected and working version of the code:
library(doMC)
library(foreach)
# Register parallel backend
number_of_cpus <- 4
registerDoMC(cores = number_of_cpus)
# Simulate sample data
set.seed(123)
Data_new <- matrix(sample(1:10, 1000, replace = TRUE), nrow = 100)
# Function to apply chisq.test row-wise
foreachFunc <- function(Data) {
RowFunction <- function(d) {
chisq.test(d)$p.value
}
P <- apply(Data, 1, RowFunction)
return(P)
}
# Split data into chunks for parallelism
Chunks <- c(1:NROW(Data_new)) %% number_of_cpus
# Run parallel computation
P <- foreach(i = 0:(number_of_cpus - 1), .combine = c) %dopar% {
foreachFunc(Data_new[Chunks == i, ])
}
And just like that no more errors, and the computation ran beautifully in parallel.
Extra Feature I Added for Practice
Once the core code was working, I added a few useful enhancements to explore the power of parallelization further:
Add a Progress Bar (With pbapply
)
If you want to visualize the progress of your loop, pbapply
is a super helpful library:
library(pbapply)
p_values <- pbapply(Data_new, 1, function(d) chisq.test(d)$p.value)
Benchmark Performance (With microbenchmark
)
Let’s see how much faster the parallel code really is:
library(microbenchmark)
microbenchmark(
serial = apply(Data_new, 1, function(d) chisq.test(d)$p.value),
parallel = foreach(i = 0:(number_of_cpus - 1), .combine = c) %dopar% {
foreachFunc(Data_new[Chunks == i, ])
},
times = 5
)
Handle Errors Gracefully (With tryCatch
)
chisq.test()
can fail if the input isn’t suitable. I wrapped it in tryCatch()
to avoid crashing mid-loop:
RowFunction <- function(d) {
tryCatch({
chisq.test(d)$p.value
}, error = function(e) {
NA
})
}
Key Takeaway
- Don’t use
makeCluster()
withdoMC
— it’s unnecessary and will cause issues. - Avoid
mc.cores
inforeach()
— that’s only formclapply()
. - Use
registerDoMC(cores = X)
as the only setup needed fordoMC
. - Wrap your test functions in
tryCatch()
to make your pipeline resilient. - For cross-platform compatibility, you’re better off using
doParallel
.
Cross Platform Safe Version
If your code needs to run on both Windows and Linux, here’s a safer and more portable version using doParallel
:
library(doParallel)
library(foreach)
number_of_cpus <- parallel::detectCores()
cl <- makeCluster(number_of_cpus)
registerDoParallel(cl)
Chunks <- c(1:NROW(Data_new)) %% number_of_cpus
P <- foreach(i = 0:(number_of_cpus - 1), .combine = c) %dopar% {
foreachFunc(Data_new[Chunks == i, ])
}
stopCluster(cl)
Final Thought
Getting parallel computing to work across platforms can be tricky, especially when packages like doMC
and doParallel
behave so differently. I learned the hard way that Linux’s mclapply()
is powerful but easy to misuse when you assume all parallel packages work the same.