0x1A: Talk, Resilient AI Supercomputer Networking using MRC and SRv6 - people

16 Jun 2026


      At very large scale AI training, tail latency and not average latency,
is what limits synchronous pretraining performance.
Christoph Paasch will discuss how MRC resolves these issues by:
1. Introducing a new RDMA transport protocol that sprays traffic
across many paths and actively load-balances, eliminating flow
collisions.
2. Creating Multi-plane Clos topologies and  enabling >100K-GPU
training clusters as two-tier networks with high switch radix and
physical redundancy.
3. And last, using static SRv6 source-routing which gives MRC
autonomous, failure-bypassing path control.
The talk will describe experiences running MRC and static SRv6 routing
in production within OpenAI and Microsoft’s largest training clusters
where it has been used to train the latest frontier models. Christoph
will discuss operational experience showing how MRC allows AI training
jobs to ride out many network failures that previously would have
interrupted training.
cheers,
jamal