Horovod has the ability to record the timeline of its activity, called Horovod Timeline.
To record a Horovod Timeline, set the
--timeline-filename command line argument to the location of the timeline
file to be created. This file is only recorded on rank 0, but it contains information about activity of all workers.
$ horovodrun -np 4 --timeline-filename /path/to/timeline.json python train.py
You can then open the timeline file using the
chrome://tracing facility of the Chrome browser.
In the example above, you can see few tensors being reduced. There are two major phases for each tensor reduction:
Negotiation - a phase when all workers send to rank 0 signal that they’re ready to reduce the given tensor.
Each worker reporting readiness is represented by a tick under the
NEGOTIATE_ALLREDUCEbar, so you can see which workers were early and which were late.
Immediately after negotiation, rank 0 sends all other workers signal to start reducing the tensor.
Processing - a phase when the operation actually happens. It is further subdivided into multiple sub-phases:
WAIT_FOR_DATAindicates time taken to wait for GPU to finish computing input to the allreduce, allgather, or broadcast operations. This happens because TensorFlow tries to smartly interleave scheduling and GPU computation. This is only applicable to situations where the Horovod operation is placed on GPU.
WAIT_FOR_OTHER_TENSOR_DATAindicates time taken to wait for GPU to finish computing other inputs for other operations that are part of the same fusion batch.
QUEUEhappens when reduction is done with NCCL, and the previous NCCL operation did not finish yet.
MEMCPY_OUT_FUSION_BUFFERindicate time taken to copy data into and out of the fusion buffer.
MPI_BCASTindicate time taken to do the actual operation on GPU (or CPU) and highlights whether the operation was performed using NCCL or pure MPI.
In case of
NCCL_ALLREDUCEwill become a sequence or a subsequence of
Adding cycle markers¶
Horovod performs work in cycles. These cycles are used to aid Tensor Fusion. Horovod has the ability to record the moment when each cycle starts for debugging of Tensor Fusion.
Since this information makes timeline view very crowded, it is not enabled by default. To add cycle markers to the timeline, set the
$ horovodrun -np 4 --timeline-filename /path/to/timeline.json --timeline-mark-cycles python train.py