At very large scale AI training, tail latency and not average latency, is what limits synchronous pretraining performance.
Christoph Paasch will discuss how MRC resolves these issues by:
1. Introducing a new RDMA transport protocol that sprays traffic across many paths and actively load-balances, eliminating flow collisions. 2. Creating Multi-plane Clos topologies and enabling >100K-GPU training clusters as two-tier networks with high switch radix and physical redundancy. 3. And last, using static SRv6 source-routing which gives MRC autonomous, failure-bypassing path control.
The talk will describe experiences running MRC and static SRv6 routing in production within OpenAI and Microsoft’s largest training clusters where it has been used to train the latest frontier models. Christoph will discuss operational experience showing how MRC allows AI training jobs to ride out many network failures that previously would have interrupted training.
cheers, jamal