Utilities for distributed training.
An all_gather layer with backward, useful for collecting model output embeddings from multiple gpus to allow large batch size loss calculation, e.g. for InfoNCE (SimCRL, CLIP).