0x17: Talk, SO_TIMESTAMPING: powering fleetwide RPC monitoring - people

7 Oct 2023


      Willem de Bruijn says timestamping, via  SO_TIMESTAMPING, is key to
debugging network stack latency. Instead of gut-feel finger pointing
between network and kernel tribes we can just get down to the facts of
where the latency really is.
SO_TIMESTAMPING can isolate transmission, reception and even
scheduling sources. Capturing connection state along with timestamps
further enables root cause discovery, such as TCP receive window size.
Capturing timestamps at more points, such as traffic shaping and NIC
hardware, expands visibility to tough issues like incast.
SO_TIMESTAMPING has seen iterative development to enable fleetwide RPC
monitoring at Google. They presented details of this infra called
"Fathom" at SIGCOMM 2023[1]. In this talk Willem will start us beyond
where that SIGCOMM paper ends. He will take a deep dive on the Linux
kernel infrastructure that makes fleetwide continuous latency analysis
and attribution possible.
API extensions include covering TCP bytestreams, capturing transport
protocol state along with events (OPT_STAT), and supporting selective
sampling (OPT_CMSG). The talk reviews the core SO_TIMESTAMPING API,
discusses non-obvious extensions (MSG_EOR, SO_RCVLOWAT), summarizes
gotchas from the field (OPT_ID_TCP), and explains how all this
combines to enable robust continuous RPC monitoring. It touches on
clock synchronization and precision. Finally, it compares this UAPI to
dynamic tracing with uprobes, kprobes, tracepoints and BPF.
Should be fun, cant wait to attend this talk!
[1] search for "fathom" at
https://conferences.sigcomm.org/sigcomm/2023/program.html
cheers,
jamal