Utilizing multiple CPUs
Luckily using multiple CPUs in R is relatively simple. There is a deprecated library multicore available which you shouldn't use. A newer library parallel is recommended. This library provides mclapply. This function only works on Linux systems so we're not going to use that one. The below examples work on Windows and Linux and do not use deprecated libraries.
A very simple example
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores)
arr <- c("business","done","differently")
#Work on the future together
result <- parLapply(cl, arr, function(x) toupper(x))
#Conclusion: BUSINESS DONE DIFFERENTLY
paste (c('Conclusion:',result),collapse = ' ')
The example is a minimal example of how you can use clustering in R. What this code does is spawn multiple processes and process the entries from the array c("business","done","differently") in those separate processes. Processing in this case is just putting them in uppercase. After it is done, the result from the different processes is combined in Conclusion: BUSINESS DONE DIFFERENTLY.
If you remove the stopCluster command, you can see there are multiple processes open on my Windows machine:
After having called the stopCluster command, the number of processes if much reduced:
You can imagine that for such a simple operation as putting things in uppercase, you might as well use the regular apply function which saves you from the overhead of spawning processes. If however you have more complex operations like the below example, you will benefit greatly from being to utilize more computing power!
A more elaborate example
You can download the code of this example from: https://github.com/MaartenSmeets/R/blob/master/htmlcrawling.R
The sample however does not work anymore since it parses Yahoo pages which have recently been changed. The sample does illustrate however how to do parallel processing.
Because there are separate R processes running, you need to make libraries and functions available to these processes. For example, you can make libraries available like:
#make libraries available in other nodes
And you can make functions available like
There are several considerations (and probably more than mentioned below) when using this way of clustering:
- Work packages are separated equally over CPUs. If however the work packages differ greatly in the amount of work, you can encounter situations where parLapply is waiting for a process to complete while the other processes are already done. You should try and use work packages mostly of equal size to avoid this.
- If a process runs too long, it will timeout. You can set the timeout when creating the cluster like: cl <- makeCluster(no_cores, timeout=50)
- Every process takes memory. If you process large variables in parallel, you might encounter memory limitations.
- Debugging the different processes can be difficult. I will not go into detail here.
- GPUs can also be utilized to do calculations. See for example: https://www.r-bloggers.com/r-gpu-programming-for-all-with-gpur/. I have not tried this but the performance graphs online indicate a much better performance can be achieved than when using CPUs.