Wednesday
The Ultra-Scale Playbook: Training LLMs on GPU Clusters — Amazing, and finally we have a 100-page open-source online book on how models are trained with multiple GPUs, with reproducible source code.
Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch — Latest paper from DeepMind about efficient geographically distributed training with overlapped communication.