.. inclusion-marker-start-do-not-remove Horovod in Docker ================= To streamline the installation process on GPU machines, we have published the reference `Dockerfile `__ so you can get started with Horovod in minutes. The container includes `Examples `__ in the ``/examples`` directory. Pre-built Docker containers with Horovod are available on `DockerHub `__. Building ~~~~~~~~ Before building, you can modify ``Dockerfile.gpu`` to your liking, e.g. select a different CUDA, TensorFlow or Python version. .. code-block:: bash $ mkdir horovod-docker-gpu $ wget -O horovod-docker-gpu/Dockerfile https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.gpu $ docker build -t horovod:latest horovod-docker-gpu For users without GPUs available in their environments, we've also published a `CPU Dockerfile `__ you can build and run similarly. Running on a single machine ~~~~~~~~~~~~~~~~~~~~~~~~~~~ After the container is built, run it using `nvidia-docker `__. **Note**: You can replace ``horovod:latest`` with the `specific `__ pre-build Docker container with Horovod instead of building it by yourself. .. code-block:: bash $ nvidia-docker run -it horovod:latest root@c278c88dd552:/examples# horovodrun -np 4 -H localhost:4 python keras_mnist_advanced.py If you don't run your container in privileged mode, you may see the following message: .. code-block:: bash [a8c9914754d2:00040] Read -1, expected 131072, errno = 1 You can ignore this message. Running on multiple machines ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Here we describe a simple example involving a shared filesystem ``/mnt/share`` using a common port number ``12345`` for the SSH daemon that will be run on all the containers. ``/mnt/share/ssh`` would contain a typical ``id_rsa`` and ``authorized_keys`` pair that allows `passwordless authentication `__. **Note**: These are not hard requirements but they make the example more concise. A shared filesystem can be replaced by ``rsyncing`` SSH configuration and code across machines, and a common SSH port can be replaced by machine-specific ports defined in ``/root/.ssh/ssh_config`` file. Primary worker: .. code-block:: bash host1$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest root@c278c88dd552:/examples# horovodrun -np 16 -H host1:4,host2:4,host3:4,host4:4 -p 12345 python keras_mnist_advanced.py Secondary workers: .. code-block:: bash host2$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \ bash -c "/usr/sbin/sshd -p 12345; sleep infinity" .. code-block:: bash host3$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \ bash -c "/usr/sbin/sshd -p 12345; sleep infinity" .. code-block:: bash host4$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \ bash -c "/usr/sbin/sshd -p 12345; sleep infinity" Adding Mellanox RDMA support ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If you have Mellanox NICs, we recommend that you mount your Mellanox devices (``/dev/infiniband``) in the container and enable the IPC_LOCK capability for memory registration: .. code-block:: bash $ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh --cap-add=IPC_LOCK --device=/dev/infiniband horovod:latest root@c278c88dd552:/examples# ... You need to specify these additional configuration options on primary and secondary workers. .. inclusion-marker-end-do-not-remove