Horovod with MXNet ================== Horovod supports Apache MXNet and regular TensorFlow in similar ways. See full training `MNIST `__ and `ImageNet `__ examples. The script below provides a simple skeleton of code block based on the Apache MXNet Gluon API. .. code-block:: python import mxnet as mx import horovod.mxnet as hvd from mxnet import autograd # Initialize Horovod hvd.init() # Pin GPU to be used to process local rank context = mx.gpu(hvd.local_rank()) num_workers = hvd.size() # Build model model = ... model.hybridize() # Create optimizer optimizer_params = ... opt = mx.optimizer.create('sgd', **optimizer_params) # Initialize parameters model.initialize(initializer, ctx=context) # Fetch and broadcast parameters params = model.collect_params() if params is not None: hvd.broadcast_parameters(params, root_rank=0) # Create DistributedTrainer, a subclass of gluon.Trainer trainer = hvd.DistributedTrainer(params, opt) # Create loss function loss_fn = ... # Train model for epoch in range(num_epoch): train_data.reset() for nbatch, batch in enumerate(train_data, start=1): data = batch.data[0].as_in_context(context) label = batch.label[0].as_in_context(context) with autograd.record(): output = model(data.astype(dtype, copy=False)) loss = loss_fn(output, label) loss.backward() trainer.step(batch_size) .. NOTE:: Some MXNet versions do not work with Horovod: - MXNet 1.4.0 and earlier have `GCC incompatibility issues `__. Use MXNet 1.4.1 or later with Horovod 0.16.2 or later to avoid these incompatibilities. - MXNet 1.5.1, 1.6.0, 1.7.0, and 1.7.0.post1 are missing MKLDNN headers, so they do not work with Horovod. Use 1.5.1.post0, 1.6.0.post0, and 1.7.0.post0, respectively. - MXNet 1.6.0.post0 and 1.7.0.post0 are only available as mxnet-cu101 and mxnet-cu102.