diff --git a/drafts/2026-01-25-ace-profiling-attorney-the-case-of-the-missing-gbits.md b/drafts/2026-01-25-ace-profiling-attorney-the-case-of-the-missing-gbits.md new file mode 100644 index 0000000..046760a --- /dev/null +++ b/drafts/2026-01-25-ace-profiling-attorney-the-case-of-the-missing-gbits.md @@ -0,0 +1,476 @@ +--- +title: Ace Profiling Attorney - The Case of the Missing Gbits +categories: [Programming, Profiling] +tags: [Rust, kernel, networking] +--- + +> **Cast** +> +> **Me:** “I rewrote a port forwarder in Rust. It works. It’s… not fast enough.” +> +> **Inner Prosecutor:** “*Objection!* ‘Not fast enough’ is not evidence. Bring numbers.” +> +> **Me:** “Fine. We’ll do this properly.” + +--- + +## 0. The Situation + +I’m building a userspace TCP port forwarder in Rust called [oxidinetd](https://github.com/DaZuo0122/oxidinetd) (The binary named `oi`). It accepts a TCP connection, connects to an upstream server, then relays bytes in both directions. + +> This post is not a “Rust vs C” piece — it’s about **profiling**, **forming hypotheses**, and **turning measurements into speed**. + +### Test environment + +- OS: Debian 13 +- Kernel: `6.12.48+deb13-amd64` +- Runtime: `smol` +- Benchmark: single machine, network namespaces + veth + +Why namespaces + veth? The loopback can hide “real networking” behavior. Namespaces/veth keep the test local (repeatable), but with a path closer to real routing. + +--- + +> **Inner Prosecutor:** “You claim it’s repeatable. Prove your setup.” +> +> **Me:** “Here’s the lab.” + +--- + +## 1. The Lab Setup + +Backend server inside `ns_server`: + +```bash +sudo ip netns exec ns_server iperf3 -s -p 9001 +``` + +Client inside `ns_client`, traffic goes through `oi`: + +```bash +sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8 +``` + +> **Note**: -P 8 matters. A forwarder might look okay under -P 1, then collapse when syscall pressure scales with concurrency. + +### Forwarder config + +`oi` listens on `10.0.1.1:9000` and connects to `10.0.0.2:9001`. + +`profiling.conf`: + +```yaml +127.0.0.1 9000 127.0.0.1 9001 +``` + +--- + +## 2. The Questions + +> **Inner Prosecutor:** “Alright. What exactly is the crime?” +> +> **Me:** “Throughput is lower than expected. The suspects:” +> +> 1. CPU bound vs I/O bound +> 2. Userspace overhead vs kernel TCP stack +> 3. Syscall-rate wall (too many `send/recv` per byte) +> 4. Async runtime scheduling / wakeups / locks + +--- + +## 3. Evidence Tool #1 — `perf stat` (Macro view) + +Command: + +```bash +sudo perf stat -p $(pidof oi) -e \ + cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \ + -- sleep 33 +``` + +### What I’m looking for + +* **Context switches** exploding → runtime contention or wake storms +* **CPU migrations** exploding → scheduler instability (bad for repeatability) +* **IPC** tanking + cache misses skyrocketing → memory/latency issues +* Otherwise: likely **kernel networking + syscalls** dominate + +Output: + +```text + Performance counter stats for process id '209785': + + 113,810,599,893 cpu_atom/cycles/ (0.11%) + 164,681,878,450 cpu_core/cycles/ (99.89%) + 102,575,167,734 cpu_atom/instructions/ # 0.90 insn per cycle (0.11%) + 237,094,207,911 cpu_core/instructions/ # 1.44 insn per cycle (99.89%) + 33,093,338 cpu_atom/cache-misses/ (0.11%) + 5,381,441 cpu_core/cache-misses/ (99.89%) + 20,012,975,873 cpu_atom/branches/ (0.11%) + 46,120,077,111 cpu_core/branches/ (99.89%) + 211,767,555 cpu_atom/branch-misses/ # 1.06% of all branches (0.11%) + 245,969,685 cpu_core/branch-misses/ # 0.53% of all branches (99.89%) + 1,686 context-switches + 150 cpu-migrations + + 33.004363800 seconds time elapsed +``` + +Interpretation: + +**Low context switching**: + + - context-switches: 1,686 over ~33s → ~51 switches/sec + + - cpu-migrations: 150 over ~33s → ~4.5/s → very stable CPU placement + +**CPU is working hard**: + + - 237,094,207,911 cpu_core instructions + + - IPC: 1.44 (instructions per cycle) → not lock-bound or stalling badly + +**Clean cache, branch metrics**: + + - cache-misses: ~3.1M (tiny compared to the instruction count) + + - branch-misses: 0.62% + +--- + +> **Inner Prosecutor:** “That’s a vibe-check. Where’s the real culprit?” +> +> **Me:** “Next tool. This one tells me what kind of pain we’re paying for.” + +--- + +## 4. Evidence Tool #2 — `strace -c` (Syscall composition) + +Command: + +```bash +sudo timeout 30s strace -c -f -p $(pidof oi) +``` + +### Why `strace -c` is lethal for forwarders + +A userspace TCP forwarder often boils down to: + +* `recv(...)` from one socket +* `send(...)` to the other socket + +If your throughput is low and `strace -c` shows **millions** of `sendto/recvfrom` calls, you’re likely hitting a **syscall-per-byte wall**. + +Output (simplified): + +```text +sendto 2,190,751 calls 4.146799s (57.6%) +recvfrom 2,190,763 calls 3.052340s (42.4%) +total syscall time: 7.200789s + +``` + +Interpretation: + +(A) **100% syscall/copy dominated:** + +Almost all traced time is inside: + + - sendto() (TCP send) + + - recvfrom() (TCP recv) + +(B) **syscall rate is massive** + +Total send+recv calls: + + - ~4,381,500 syscalls in ~32s + - → ~137k `sendto` per sec + ~137k `recvfrom` per sec + - → ~274k syscalls/sec total + +That’s exactly the pattern of a forwarder doing: + +`recv -> send -> recv -> send ...` with a relatively small buffer. + +--- + +> **Inner Prosecutor:** “So you’re saying the kernel is being spammed.” +> +> **Me:** “Exactly. Now I want to know who’s spamming it — my logic, my runtime, or my copy loop.” + +--- + +## 5. Evidence Tool #3 — FlameGraph (Where cycles actually go) + +Commands: + +```bash +sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30 +sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg +``` + +### What the flamegraph showed (described, not embedded) + +Instead of embedding the graph, here’s the important story the flamegraph told: + +1. The widest “towers” were kernel TCP send/recv paths: + + * `__x64_sys_sendto` → `tcp_sendmsg_locked` → `tcp_write_xmit` → … + * `__x64_sys_recvfrom` → `tcp_recvmsg` → … + +2. My userspace frames existed, but they were thin compared to the kernel towers. + That means: + + * I’m not burning CPU on complicated Rust logic. + * I’m paying overhead on the boundary: syscalls, TCP stack, copies. + +3. In the dwarf flamegraph, the *userspace* frames pointed to my forwarding implementation: + + * the code path that ultimately calls read/write repeatedly. + +> **Conclusion:** This is not “async is slow” in general. This is “my relay loop is forcing too many small kernel transitions.” + +## 6. The Suspect: my forwarding code + +Here was the original TCP relay: + +```rust +// Use smol's copy function to forward data in both directions +let client_to_server = io::copy(client_stream.clone(), server_stream.clone()); +let server_to_client = io::copy(server_stream, client_stream); + +futures_lite::future::try_zip(client_to_server, server_to_client).await?; +``` + +> **Inner Prosecutor:** “*Objection!* That looks perfectly reasonable.” +> +> **Me:** “Yes. That’s why it’s dangerous.” + +### Why this can be slow under high throughput + +Generic `io::copy` helpers often use a relatively small internal buffer (commonly ~8KiB), plus abstraction layers that can increase: + +* syscall frequency +* readiness polling +* per-chunk overhead + +Small buffers aren’t “wrong”. They’re memory-friendly. But for a forwarder pushing tens of Gbit/s, **syscalls per byte** becomes the real limiter. + +--- + +## 7. The Fix: a manual `pump()` loop (and a buffer size sweep) + +I replaced `io::copy` with a manual relay loop: + +* allocate a buffer once per direction +* read into it +* write it out +* on EOF, propagate half-close with `shutdown(Write)` + +Code (core idea): + +```rust +async fn pump(mut r: TcpStream, mut w: TcpStream, buf_sz: usize) -> io::Result { + let mut buf = vec![0u8; buf_sz]; + let mut total = 0u64; + + loop { + let n = r.read(&mut buf).await?; + if n == 0 { + let _ = w.shutdown(std::net::Shutdown::Write); + break; + } + w.write_all(&buf[..n]).await?; + total += n as u64; + } + Ok(total) +} +``` + +And run both directions: + +```rust +let c2s = pump(client_stream.clone(), server_stream.clone(), BUF); +let s2c = pump(server_stream, client_stream, BUF); +try_zip(c2s, s2c).await?; +``` + +--- + +> **Inner Prosecutor:** “You changed ‘one helper call’ into ‘a loop’. That’s your miracle?” +> +> **Me:** “Not the loop. The *bytes per syscall*.” + +--- + +## 8. Verification: numbers don’t lie + +Same machine, same namespaces/veth, same `iperf3 -P 8`. + +### Baseline (generic copy, ~8KiB internal buffer) + +Throughput: + +```text +17.8 Gbit/s +``` + +### Pump + 16KiB buffer + +Throughput: + +```text +28.6 Gbit/s +``` + +`strace -c` showed `sendto/recvfrom` call count dropped: + +```text +% time seconds usecs/call calls errors syscall +------ ----------- ----------- --------- --------- ---------------- + 57.80 14.590016 442121 33 epoll_wait + 28.84 7.279883 4 1771146 sendto + 13.33 3.363882 1 1771212 48 recvfrom + 0.02 0.003843 61 62 44 futex + 0.01 0.001947 12 159 epoll_ctl +... +------ ----------- ----------- --------- --------- ---------------- +100.00 25.242897 7 3542787 143 total +``` + +### Pump + 64KiB buffer + +Throughput: + +```text +54.1 Gbit/s (best observed) +``` + +`perf stat` output: + +```text +Performance counter stats for process id '893123': + + 120,859,810,675 cpu_atom/cycles/ (0.15%) + 134,735,934,329 cpu_core/cycles/ (99.85%) + 79,946,979,880 cpu_atom/instructions/ # 0.66 insn per cycle (0.15%) + 127,036,644,759 cpu_core/instructions/ # 0.94 insn per cycle (99.85%) + 24,713,474 cpu_atom/cache-misses/ (0.15%) + 9,604,449 cpu_core/cache-misses/ (99.85%) + 15,584,074,530 cpu_atom/branches/ (0.15%) + 24,796,180,117 cpu_core/branches/ (99.85%) + 175,778,825 cpu_atom/branch-misses/ # 1.13% of all branches (0.15%) + 135,067,353 cpu_core/branch-misses/ # 0.54% of all branches (99.85%) + 1,519 context-switches + 50 cpu-migrations + + 33.006529572 seconds time elapsed +``` + +`strace -c` output: + +```text +% time seconds usecs/call calls errors syscall +------ ----------- ----------- --------- --------- ---------------- + 54.56 18.079500 463576 39 epoll_wait + 27.91 9.249443 7 1294854 2 sendto + 17.49 5.796927 4 1294919 51 recvfrom +... +------ ----------- ----------- --------- --------- ---------------- +100.00 33.135377 12 2590253 158 total +``` + +--- + +## 9. “Wait — why is `epoll_wait` taking most syscall time?” + +> **Inner Prosecutor:** “*Objection!* Your table says `epoll_wait` dominates time. So epoll is the bottleneck!” +> +> **Me:** “Nope. That’s a common misread.” + +`strace -c` counts **time spent inside syscalls**, including time spent **blocked**. + +In async runtimes, it’s normal for one thread to sit in `epoll_wait(timeout=...)` while other threads do actual send/recv work. That blocking time is charged to `epoll_wait`, but it’s not “overhead” — it’s *waiting*. + +The real signal is still: + +* `sendto/recvfrom` call counts (millions) +* average microseconds per call +* and whether call count drops when buffer size increases + +That’s the syscall-per-byte story. + +--- +## 10. So why did 64KiB cause such a huge jump? + +Two reasons: + +### 1) Syscall wall is nonlinear + +Throughput is roughly: + +**Throughput ≈ bytes_per_syscall_pair × syscall_pairs_per_second** + +If syscall rate is the limiter, increasing bytes per syscall can push you past a threshold where: + +* socket buffers stay fuller +* TCP windows are better utilized +* per-stream pacing is smoother +* concurrency (`-P 8`) stops fighting overhead and starts working in your favor + +Once you cross that threshold, throughput can jump until the *next* ceiling (kernel TCP work, memory bandwidth, or iperf itself). + +### 2) Less “per-chunk” overhead in userspace + +A small-buffer copy loop means more iterations, more polls, more bookkeeping. +A bigger buffer means: + +* fewer loop iterations per GB moved +* fewer wakeups/polls +* fewer syscall transitions per GB + +Your `strace` call counts dropped significantly between 16KiB and 64KiB, and throughput nearly doubled. + +--- + +## 11. Trade-offs: buffer size is not free + +> **Inner Prosecutor:** “*Hold it!* Bigger buffers mean wasted memory.” +> +> **Me:** “Correct.” + +A forwarder allocates **two buffers per connection** (one per direction). + +So for 64KiB: + +* ~128KiB per connection (just for relay buffers) +* plus runtime + socket buffers + +That’s fine for “few heavy streams”, but it matters if you handle thousands of concurrent connections. + +In practice, the right move is: + +* choose a good default (64KiB is common) +* make it configurable +* consider buffer pooling if connection churn is heavy + +--- + +## 12. Closing statement + +This was a good reminder that performance work is not guessing — it’s a dialogue with the system: + +1. Describe the situation +2. Ask sharp questions +3. Use tools to confirm +4. Explain the results using low-level knowledge +5. Make one change +6. Re-measure + +And the funniest part: the “clean” one-liner `io::copy` was correct, but its defaults were hiding a performance policy I didn’t want. + +> **Inner Prosecutor:** “Case closed?” +> +> **Me:** “For now. Next case: buffer pooling, socket buffer tuning, and maybe a Linux-only `splice(2)` fast path — carefully, behind a safe wrapper.” + +--- \ No newline at end of file