Horovod with Intel(R) oneCCL¶
To use Horovod with the Intel(R) oneAPI Collective Communications Library (oneCCL), follow the steps below.
To install oneCCL, follow these steps.
setvars.sh to start using oneCCL.
Install Horovod from source code
python setup.py build python setup.py install
or via pip
pip install horovod
You can specify the affinity for Horovod background thread with the
HOROVOD_THREAD_AFFINITY environment variable.
See the instructions below.
Set Horovod background thread affinity according to the rule - if there is N Horovod processes per node, this variable should contain all the values for every local process using comma as a separator:
where c0,…,c(N-1) are core IDs to pin background threads from local processes.
Set the number of oneCCL workers:
where X is the number of oneCCL worker threads (workers) per process you’d like to dedicate to drive communication.
Set oneCCL workers affinity automatically:
This is default mode. The exact core IDs will depend from process launcher used.
Set oneCCL workers affinity explicitly:
where c0,c1,..,c(X-1) are core IDs dedicated to local oneCCL workers, i.e. X =
CCL_WORKER_COUNT * Number of processes per node.
Please refer to Execution of Communication Operations for more information.
For example, we have 2 nodes and each node has 2 sockets: socket0 CPUs: 0-17,36-53 and socket1 CPUs: 18-35,54-71. We dedicate the last two cores of each socket for 2 oneCCL workers and pin Horovod background thread to one of the hyper-thread cores of oneCCL workers’s cores. All these cores are excluded from Intel MPI pinning using
I_MPI_PIN_PROCESSOR_EXCLUDE_LIST to dedicate them to oneCCL and Horovod tasks only, thus avoiding the conflict with framework’s computational threads.
export CCL_WORKER_COUNT=2 export CCL_WORKER_AFFINITY="16,17,34,35" export HOROVOD_THREAD_AFFINITY="53,71" export I_MPI_PIN_DOMAIN=socket export I_MPI_PIN_PROCESSOR_EXCLUDE_LIST="16,17,34,35,52,53,70,71" mpirun -n 4 -ppn 2 -hostfile hosts python ./run_example.py
Set cache hint for oneCCL operations:
allreduce only yet. Disabled by default.
Please refer to Caching of Communication Operations for more information.