Horovod with Intel(R) oneCCL

To use Horovod with the Intel(R) oneAPI Collective Communications Library (oneCCL), follow the steps below.

  1. Install oneCCL.

To install oneCCL, follow these steps.

Source setvars.sh to start using oneCCL.

source <install_dir>/env/setvars.sh
  1. Set HOROVOD_CPU_OPERATIONS variable

export HOROVOD_CPU_OPERATIONS=CCL
  1. Install Horovod from source code

python setup.py build
python setup.py install

or via pip

pip install horovod

Advanced settings

Affinity

You can specify the affinity for Horovod background thread with the HOROVOD_THREAD_AFFINITY environment variable. See the instructions below.

Set Horovod background thread affinity according to the rule - if there is N Horovod processes per node, this variable should contain all the values for every local process using comma as a separator:

export HOROVOD_THREAD_AFFINITY=c0,c1,...,c(N-1)

where c0,…,c(N-1) are core IDs to pin background threads from local processes.

Set the number of oneCCL workers:

export CCL_WORKER_COUNT=X

where X is the number of oneCCL worker threads (workers) per process you’d like to dedicate to drive communication.

Set oneCCL workers affinity automatically:

export CCL_WORKER_AFFINITY=auto

This is default mode. The exact core IDs will depend from process launcher used.

Set oneCCL workers affinity explicitly:

export CCL_WORKER_AFFINITY=c0,c1,..,c(X-1)

where c0,c1,..,c(X-1) are core IDs dedicated to local oneCCL workers, i.e. X = CCL_WORKER_COUNT * Number of processes per node.

Please refer to Execution of Communication Operations for more information.

For example, we have 2 nodes and each node has 2 sockets: socket0 CPUs: 0-17,36-53 and socket1 CPUs: 18-35,54-71. We dedicate the last two cores of each socket for 2 oneCCL workers and pin Horovod background thread to one of the hyper-thread cores of oneCCL workers’s cores. All these cores are excluded from Intel MPI pinning using I_MPI_PIN_PROCESSOR_EXCLUDE_LIST to dedicate them to oneCCL and Horovod tasks only, thus avoiding the conflict with framework’s computational threads.

export CCL_WORKER_COUNT=2
export CCL_WORKER_AFFINITY="16,17,34,35"
export HOROVOD_THREAD_AFFINITY="53,71"
export I_MPI_PIN_DOMAIN=socket
export I_MPI_PIN_PROCESSOR_EXCLUDE_LIST="16,17,34,35,52,53,70,71"

mpirun -n 4 -ppn 2 -hostfile hosts python ./run_example.py

Caching

Set cache hint for oneCCL operations:

export HOROVOD_CCL_CACHE=0|1

Available for allreduce only yet. Disabled by default.

Please refer to Caching of Communication Operations for more information.