Horovod in Docker

To streamline the installation process, we have published reference Dockerfiles so you can get started with Horovod in minutes. These containers include Horovod examples in the /examples directory.

Pre-built Docker containers with Horovod are available on DockerHub for GPU, CPU, and Ray.

Running on a single machine

After the container is built, run it using nvidia-docker.

Note: You can replace horovod:latest with the specific pre-build Docker container with Horovod instead of building it by yourself.

$ nvidia-docker run -it horovod:latest
root@c278c88dd552:/examples# horovodrun -np 4 -H localhost:4 python keras_mnist_advanced.py

If you don’t run your container in privileged mode, you may see the following message:

[a8c9914754d2:00040] Read -1, expected 131072, errno = 1

You can ignore this message.

Running on multiple machines

Here we describe a simple example involving a shared filesystem /mnt/share using a common port number 12345 for the SSH daemon that will be run on all the containers. /mnt/share/ssh would contain a typical id_rsa and authorized_keys pair that allows passwordless authentication.

Note: These are not hard requirements but they make the example more concise. A shared filesystem can be replaced by rsyncing SSH configuration and code across machines, and a common SSH port can be replaced by machine-specific ports defined in /root/.ssh/ssh_config file.

Primary worker:

host1$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest
root@c278c88dd552:/examples# horovodrun -np 16 -H host1:4,host2:4,host3:4,host4:4 -p 12345 python keras_mnist_advanced.py

Secondary workers:

host2$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \
    bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
host3$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \
    bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
host4$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \
    bash -c "/usr/sbin/sshd -p 12345; sleep infinity"

Adding Mellanox RDMA support

If you have Mellanox NICs, we recommend that you mount your Mellanox devices (/dev/infiniband) in the container and enable the IPC_LOCK capability for memory registration:

$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh --cap-add=IPC_LOCK --device=/dev/infiniband horovod:latest
root@c278c88dd552:/examples# ...

You need to specify these additional configuration options on primary and secondary workers.

Running containers with different ports

To run in situations without a common SSH port (e.g., multiple containers on the same host):

  1. Configure your ~/.ssh/config file to assign custom host names and ports for each container

    Host host1
      Port 1234
    Host host2
      Port 2345
  2. Use horovodrun directly as though each container were a separate host with its own IP

    $ horovodrun -np 8 -H host1:4,host2:4 python keras_mnist_advanced.py