Horovod with MXNet¶

Horovod supports Apache MXNet and regular TensorFlow in similar ways.

See full training MNIST and ImageNet examples. The script below provides a simple skeleton of code block based on the Apache MXNet Gluon API.

import mxnet as mx
import horovod.mxnet as hvd
from mxnet import autograd

# Initialize Horovod
hvd.init()

# Pin GPU to be used to process local rank
context = mx.gpu(hvd.local_rank())
num_workers = hvd.size()

# Build model
model = ...
model.hybridize()

# Create optimizer
optimizer_params = ...
opt = mx.optimizer.create('sgd', **optimizer_params)

# Initialize parameters
model.initialize(initializer, ctx=context)

# Fetch and broadcast parameters
params = model.collect_params()
if params is not None:
    hvd.broadcast_parameters(params, root_rank=0)

# Create DistributedTrainer, a subclass of gluon.Trainer
trainer = hvd.DistributedTrainer(params, opt)

# Create loss function
loss_fn = ...

# Train model
for epoch in range(num_epoch):
    train_data.reset()
    for nbatch, batch in enumerate(train_data, start=1):
        data = batch.data[0].as_in_context(context)
        label = batch.label[0].as_in_context(context)
        with autograd.record():
            output = model(data.astype(dtype, copy=False))
            loss = loss_fn(output, label)
        loss.backward()
        trainer.step(batch_size)

Note

Some MXNet versions do not work with Horovod:

MXNet 1.4.0 and earlier have GCC incompatibility issues. Use MXNet 1.4.1 or later with Horovod 0.16.2 or later to avoid these incompatibilities.
MXNet 1.5.1, 1.6.0, 1.7.0, and 1.7.0.post1 are missing MKLDNN headers, so they do not work with Horovod. Use 1.5.1.post0, 1.6.0.post0, and 1.7.0.post0, respectively.
MXNet 1.6.0.post0 and 1.7.0.post0 are only available as mxnet-cu101 and mxnet-cu102.

Horovod with MXNet¶

Navigation

Related Topics