From 144ae4a0f39603eb492d52fbb0d5bed30c4fbdee Mon Sep 17 00:00:00 2001 From: manbo Date: Tue, 3 Feb 2026 13:41:37 +0800 Subject: [PATCH] Add: more Phoenix flavor --- ...-attorney-the-case-of-the-missing-gbits.md | 949 +++++++++--------- 1 file changed, 474 insertions(+), 475 deletions(-) diff --git a/drafts/2026-01-25-ace-profiling-attorney-the-case-of-the-missing-gbits.md b/drafts/2026-01-25-ace-profiling-attorney-the-case-of-the-missing-gbits.md index 046760a..f696043 100644 --- a/drafts/2026-01-25-ace-profiling-attorney-the-case-of-the-missing-gbits.md +++ b/drafts/2026-01-25-ace-profiling-attorney-the-case-of-the-missing-gbits.md @@ -1,476 +1,475 @@ ---- -title: Ace Profiling Attorney - The Case of the Missing Gbits -categories: [Programming, Profiling] -tags: [Rust, kernel, networking] ---- - -> **Cast** -> -> **Me:** “I rewrote a port forwarder in Rust. It works. It’s… not fast enough.” -> -> **Inner Prosecutor:** “*Objection!* ‘Not fast enough’ is not evidence. Bring numbers.” -> -> **Me:** “Fine. We’ll do this properly.” - ---- - -## 0. The Situation - -I’m building a userspace TCP port forwarder in Rust called [oxidinetd](https://github.com/DaZuo0122/oxidinetd) (The binary named `oi`). It accepts a TCP connection, connects to an upstream server, then relays bytes in both directions. - -> This post is not a “Rust vs C” piece — it’s about **profiling**, **forming hypotheses**, and **turning measurements into speed**. - -### Test environment - -- OS: Debian 13 -- Kernel: `6.12.48+deb13-amd64` -- Runtime: `smol` -- Benchmark: single machine, network namespaces + veth - -Why namespaces + veth? The loopback can hide “real networking” behavior. Namespaces/veth keep the test local (repeatable), but with a path closer to real routing. - ---- - -> **Inner Prosecutor:** “You claim it’s repeatable. Prove your setup.” -> -> **Me:** “Here’s the lab.” - ---- - -## 1. The Lab Setup - -Backend server inside `ns_server`: - -```bash -sudo ip netns exec ns_server iperf3 -s -p 9001 -``` - -Client inside `ns_client`, traffic goes through `oi`: - -```bash -sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8 -``` - -> **Note**: -P 8 matters. A forwarder might look okay under -P 1, then collapse when syscall pressure scales with concurrency. - -### Forwarder config - -`oi` listens on `10.0.1.1:9000` and connects to `10.0.0.2:9001`. - -`profiling.conf`: - -```yaml -127.0.0.1 9000 127.0.0.1 9001 -``` - ---- - -## 2. The Questions - -> **Inner Prosecutor:** “Alright. What exactly is the crime?” -> -> **Me:** “Throughput is lower than expected. The suspects:” -> -> 1. CPU bound vs I/O bound -> 2. Userspace overhead vs kernel TCP stack -> 3. Syscall-rate wall (too many `send/recv` per byte) -> 4. Async runtime scheduling / wakeups / locks - ---- - -## 3. Evidence Tool #1 — `perf stat` (Macro view) - -Command: - -```bash -sudo perf stat -p $(pidof oi) -e \ - cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \ - -- sleep 33 -``` - -### What I’m looking for - -* **Context switches** exploding → runtime contention or wake storms -* **CPU migrations** exploding → scheduler instability (bad for repeatability) -* **IPC** tanking + cache misses skyrocketing → memory/latency issues -* Otherwise: likely **kernel networking + syscalls** dominate - -Output: - -```text - Performance counter stats for process id '209785': - - 113,810,599,893 cpu_atom/cycles/ (0.11%) - 164,681,878,450 cpu_core/cycles/ (99.89%) - 102,575,167,734 cpu_atom/instructions/ # 0.90 insn per cycle (0.11%) - 237,094,207,911 cpu_core/instructions/ # 1.44 insn per cycle (99.89%) - 33,093,338 cpu_atom/cache-misses/ (0.11%) - 5,381,441 cpu_core/cache-misses/ (99.89%) - 20,012,975,873 cpu_atom/branches/ (0.11%) - 46,120,077,111 cpu_core/branches/ (99.89%) - 211,767,555 cpu_atom/branch-misses/ # 1.06% of all branches (0.11%) - 245,969,685 cpu_core/branch-misses/ # 0.53% of all branches (99.89%) - 1,686 context-switches - 150 cpu-migrations - - 33.004363800 seconds time elapsed -``` - -Interpretation: - -**Low context switching**: - - - context-switches: 1,686 over ~33s → ~51 switches/sec - - - cpu-migrations: 150 over ~33s → ~4.5/s → very stable CPU placement - -**CPU is working hard**: - - - 237,094,207,911 cpu_core instructions - - - IPC: 1.44 (instructions per cycle) → not lock-bound or stalling badly - -**Clean cache, branch metrics**: - - - cache-misses: ~3.1M (tiny compared to the instruction count) - - - branch-misses: 0.62% - ---- - -> **Inner Prosecutor:** “That’s a vibe-check. Where’s the real culprit?” -> -> **Me:** “Next tool. This one tells me what kind of pain we’re paying for.” - ---- - -## 4. Evidence Tool #2 — `strace -c` (Syscall composition) - -Command: - -```bash -sudo timeout 30s strace -c -f -p $(pidof oi) -``` - -### Why `strace -c` is lethal for forwarders - -A userspace TCP forwarder often boils down to: - -* `recv(...)` from one socket -* `send(...)` to the other socket - -If your throughput is low and `strace -c` shows **millions** of `sendto/recvfrom` calls, you’re likely hitting a **syscall-per-byte wall**. - -Output (simplified): - -```text -sendto 2,190,751 calls 4.146799s (57.6%) -recvfrom 2,190,763 calls 3.052340s (42.4%) -total syscall time: 7.200789s - -``` - -Interpretation: - -(A) **100% syscall/copy dominated:** - -Almost all traced time is inside: - - - sendto() (TCP send) - - - recvfrom() (TCP recv) - -(B) **syscall rate is massive** - -Total send+recv calls: - - - ~4,381,500 syscalls in ~32s - - → ~137k `sendto` per sec + ~137k `recvfrom` per sec - - → ~274k syscalls/sec total - -That’s exactly the pattern of a forwarder doing: - -`recv -> send -> recv -> send ...` with a relatively small buffer. - ---- - -> **Inner Prosecutor:** “So you’re saying the kernel is being spammed.” -> -> **Me:** “Exactly. Now I want to know who’s spamming it — my logic, my runtime, or my copy loop.” - ---- - -## 5. Evidence Tool #3 — FlameGraph (Where cycles actually go) - -Commands: - -```bash -sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30 -sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg -``` - -### What the flamegraph showed (described, not embedded) - -Instead of embedding the graph, here’s the important story the flamegraph told: - -1. The widest “towers” were kernel TCP send/recv paths: - - * `__x64_sys_sendto` → `tcp_sendmsg_locked` → `tcp_write_xmit` → … - * `__x64_sys_recvfrom` → `tcp_recvmsg` → … - -2. My userspace frames existed, but they were thin compared to the kernel towers. - That means: - - * I’m not burning CPU on complicated Rust logic. - * I’m paying overhead on the boundary: syscalls, TCP stack, copies. - -3. In the dwarf flamegraph, the *userspace* frames pointed to my forwarding implementation: - - * the code path that ultimately calls read/write repeatedly. - -> **Conclusion:** This is not “async is slow” in general. This is “my relay loop is forcing too many small kernel transitions.” - -## 6. The Suspect: my forwarding code - -Here was the original TCP relay: - -```rust -// Use smol's copy function to forward data in both directions -let client_to_server = io::copy(client_stream.clone(), server_stream.clone()); -let server_to_client = io::copy(server_stream, client_stream); - -futures_lite::future::try_zip(client_to_server, server_to_client).await?; -``` - -> **Inner Prosecutor:** “*Objection!* That looks perfectly reasonable.” -> -> **Me:** “Yes. That’s why it’s dangerous.” - -### Why this can be slow under high throughput - -Generic `io::copy` helpers often use a relatively small internal buffer (commonly ~8KiB), plus abstraction layers that can increase: - -* syscall frequency -* readiness polling -* per-chunk overhead - -Small buffers aren’t “wrong”. They’re memory-friendly. But for a forwarder pushing tens of Gbit/s, **syscalls per byte** becomes the real limiter. - ---- - -## 7. The Fix: a manual `pump()` loop (and a buffer size sweep) - -I replaced `io::copy` with a manual relay loop: - -* allocate a buffer once per direction -* read into it -* write it out -* on EOF, propagate half-close with `shutdown(Write)` - -Code (core idea): - -```rust -async fn pump(mut r: TcpStream, mut w: TcpStream, buf_sz: usize) -> io::Result { - let mut buf = vec![0u8; buf_sz]; - let mut total = 0u64; - - loop { - let n = r.read(&mut buf).await?; - if n == 0 { - let _ = w.shutdown(std::net::Shutdown::Write); - break; - } - w.write_all(&buf[..n]).await?; - total += n as u64; - } - Ok(total) -} -``` - -And run both directions: - -```rust -let c2s = pump(client_stream.clone(), server_stream.clone(), BUF); -let s2c = pump(server_stream, client_stream, BUF); -try_zip(c2s, s2c).await?; -``` - ---- - -> **Inner Prosecutor:** “You changed ‘one helper call’ into ‘a loop’. That’s your miracle?” -> -> **Me:** “Not the loop. The *bytes per syscall*.” - ---- - -## 8. Verification: numbers don’t lie - -Same machine, same namespaces/veth, same `iperf3 -P 8`. - -### Baseline (generic copy, ~8KiB internal buffer) - -Throughput: - -```text -17.8 Gbit/s -``` - -### Pump + 16KiB buffer - -Throughput: - -```text -28.6 Gbit/s -``` - -`strace -c` showed `sendto/recvfrom` call count dropped: - -```text -% time seconds usecs/call calls errors syscall ------- ----------- ----------- --------- --------- ---------------- - 57.80 14.590016 442121 33 epoll_wait - 28.84 7.279883 4 1771146 sendto - 13.33 3.363882 1 1771212 48 recvfrom - 0.02 0.003843 61 62 44 futex - 0.01 0.001947 12 159 epoll_ctl -... ------- ----------- ----------- --------- --------- ---------------- -100.00 25.242897 7 3542787 143 total -``` - -### Pump + 64KiB buffer - -Throughput: - -```text -54.1 Gbit/s (best observed) -``` - -`perf stat` output: - -```text -Performance counter stats for process id '893123': - - 120,859,810,675 cpu_atom/cycles/ (0.15%) - 134,735,934,329 cpu_core/cycles/ (99.85%) - 79,946,979,880 cpu_atom/instructions/ # 0.66 insn per cycle (0.15%) - 127,036,644,759 cpu_core/instructions/ # 0.94 insn per cycle (99.85%) - 24,713,474 cpu_atom/cache-misses/ (0.15%) - 9,604,449 cpu_core/cache-misses/ (99.85%) - 15,584,074,530 cpu_atom/branches/ (0.15%) - 24,796,180,117 cpu_core/branches/ (99.85%) - 175,778,825 cpu_atom/branch-misses/ # 1.13% of all branches (0.15%) - 135,067,353 cpu_core/branch-misses/ # 0.54% of all branches (99.85%) - 1,519 context-switches - 50 cpu-migrations - - 33.006529572 seconds time elapsed -``` - -`strace -c` output: - -```text -% time seconds usecs/call calls errors syscall ------- ----------- ----------- --------- --------- ---------------- - 54.56 18.079500 463576 39 epoll_wait - 27.91 9.249443 7 1294854 2 sendto - 17.49 5.796927 4 1294919 51 recvfrom -... ------- ----------- ----------- --------- --------- ---------------- -100.00 33.135377 12 2590253 158 total -``` - ---- - -## 9. “Wait — why is `epoll_wait` taking most syscall time?” - -> **Inner Prosecutor:** “*Objection!* Your table says `epoll_wait` dominates time. So epoll is the bottleneck!” -> -> **Me:** “Nope. That’s a common misread.” - -`strace -c` counts **time spent inside syscalls**, including time spent **blocked**. - -In async runtimes, it’s normal for one thread to sit in `epoll_wait(timeout=...)` while other threads do actual send/recv work. That blocking time is charged to `epoll_wait`, but it’s not “overhead” — it’s *waiting*. - -The real signal is still: - -* `sendto/recvfrom` call counts (millions) -* average microseconds per call -* and whether call count drops when buffer size increases - -That’s the syscall-per-byte story. - ---- -## 10. So why did 64KiB cause such a huge jump? - -Two reasons: - -### 1) Syscall wall is nonlinear - -Throughput is roughly: - -**Throughput ≈ bytes_per_syscall_pair × syscall_pairs_per_second** - -If syscall rate is the limiter, increasing bytes per syscall can push you past a threshold where: - -* socket buffers stay fuller -* TCP windows are better utilized -* per-stream pacing is smoother -* concurrency (`-P 8`) stops fighting overhead and starts working in your favor - -Once you cross that threshold, throughput can jump until the *next* ceiling (kernel TCP work, memory bandwidth, or iperf itself). - -### 2) Less “per-chunk” overhead in userspace - -A small-buffer copy loop means more iterations, more polls, more bookkeeping. -A bigger buffer means: - -* fewer loop iterations per GB moved -* fewer wakeups/polls -* fewer syscall transitions per GB - -Your `strace` call counts dropped significantly between 16KiB and 64KiB, and throughput nearly doubled. - ---- - -## 11. Trade-offs: buffer size is not free - -> **Inner Prosecutor:** “*Hold it!* Bigger buffers mean wasted memory.” -> -> **Me:** “Correct.” - -A forwarder allocates **two buffers per connection** (one per direction). - -So for 64KiB: - -* ~128KiB per connection (just for relay buffers) -* plus runtime + socket buffers - -That’s fine for “few heavy streams”, but it matters if you handle thousands of concurrent connections. - -In practice, the right move is: - -* choose a good default (64KiB is common) -* make it configurable -* consider buffer pooling if connection churn is heavy - ---- - -## 12. Closing statement - -This was a good reminder that performance work is not guessing — it’s a dialogue with the system: - -1. Describe the situation -2. Ask sharp questions -3. Use tools to confirm -4. Explain the results using low-level knowledge -5. Make one change -6. Re-measure - -And the funniest part: the “clean” one-liner `io::copy` was correct, but its defaults were hiding a performance policy I didn’t want. - -> **Inner Prosecutor:** “Case closed?” -> -> **Me:** “For now. Next case: buffer pooling, socket buffer tuning, and maybe a Linux-only `splice(2)` fast path — carefully, behind a safe wrapper.” - +--- +title: Ace Profiling Attorney - The Case of the Missing Gbits +categories: [Programming, Profiling] +tags: [Rust, kernel, networking] +--- + +> **Disclaimer:** This is not a language-war post. No “X vs Y”. +> This is a profiling detective story about my Rust TCP forwarder [`oi`](https://github.com/DaZuo0122/oxidinetd). + +--- +## 0) Prologue — The Courthouse Lobby + +> **Me:** I wrote a Rust TCP port forwarder. It works. It forwards. +> +> **Inner Prosecutor (Phoenix voice):** *Hold it!* “Works” is not a metric. How fast? +> +> **Me:** Not fast enough under load. +> +> **Inner Prosecutor:** *Objection!* “Not fast enough” is an emotion. Bring evidence. +> +> **Me:** Fine. I’ll bring **perf**, **strace**, and a **flamegraph**. +> +> **Inner Prosecutor:** Good. This court accepts only facts. + +## 1) The Crime Scene — Setup & Reproduction + +**Me:** Single machine, Debian 13. No WAN noise, no tunnel bottlenecks. + +**Inner Prosecutor:** *Hold it!* If it’s “single machine”, how do you avoid loopback cheating? + +**Me:** Network namespaces + veth. Local, repeatable, closer to real networking. + +### Environment + +- Debian 13 +- Kernel: `6.12.48+deb13-amd64` +- Runtime: `smol` +- Test topology: `ns_client → oi (root ns) → ns_server` via veth + +### Reproduction commands + +**Exhibit A: Start backend server in `ns_server`** + +```bash +sudo ip netns exec ns_server iperf3 -s -p 9001 +```` + +**Exhibit B: Run client in `ns_client` through forwarder** + +```bash +sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8 +``` + +**Inner Prosecutor:** *Hold it!* Why `-P 8`? + +**Me:** Because a forwarder can look fine in `-P 1` and fall apart when syscall pressure scales. + +**Inner Prosecutor:** …Acceptable. + +--- + +## 2) The Suspects — What Could Be Limiting Throughput? + +**Me:** Four suspects. + +1. **CPU bound** (pure compute wall) +2. **Kernel TCP stack bound** (send/recv path, skb, softirq, netfilter/conntrack) +3. **Syscall-rate wall** (too many `sendto/recvfrom` per byte) +4. **Runtime scheduling / contention** (wake storms, locks, futex) + +**Inner Prosecutor:** *Objection!* That’s too broad. Narrow it down. + +**Me:** That’s what the tools are for. + +--- + +## 3) Evidence #1 — `perf stat` (The Macro View) + +**Me:** First I ask: are we burning CPU, thrashing schedulers, or stalling on memory? + +**Command:** + +```bash +sudo perf stat -p $(pidof oi) -e \ + cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \ + -- sleep 33 +``` + +**What I’m looking for:** + +* Huge `context-switches` → runtime thrash / lock contention +* Huge `cpu-migrations` → unstable scheduling +* Very low IPC + huge cache misses → memory stalls +* Otherwise: likely syscall/kernel path + +Output: + +```text + Performance counter stats for process id '209785': + + 113,810,599,893 cpu_atom/cycles/ (0.11%) + 164,681,878,450 cpu_core/cycles/ (99.89%) + 102,575,167,734 cpu_atom/instructions/ # 0.90 insn per cycle (0.11%) + 237,094,207,911 cpu_core/instructions/ # 1.44 insn per cycle (99.89%) + 33,093,338 cpu_atom/cache-misses/ (0.11%) + 5,381,441 cpu_core/cache-misses/ (99.89%) + 20,012,975,873 cpu_atom/branches/ (0.11%) + 46,120,077,111 cpu_core/branches/ (99.89%) + 211,767,555 cpu_atom/branch-misses/ # 1.06% of all branches (0.11%) + 245,969,685 cpu_core/branch-misses/ # 0.53% of all branches (99.89%) + 1,686 context-switches + 150 cpu-migrations + + 33.004363800 seconds time elapsed +``` + + +**Low context switching**: + + - context-switches: 1,686 over ~33s → ~51 switches/sec + + - cpu-migrations: 150 over ~33s → ~4.5/s → very stable CPU placement + +**CPU is working hard**: + + - 237,094,207,911 cpu_core instructions + + - IPC: 1.44 (instructions per cycle) → not lock-bound or stalling badly + +**Clean cache, branch metrics**: + + - cache-misses: ~3.1M (tiny compared to the instruction count) + + - branch-misses: 0.62% + + + +**Inner Prosecutor:** *Hold it!* You didn’t show the numbers. + +**Me:** Patience. The next exhibit makes the culprit confess. + +--- + +## 4) Evidence #2 — `strace -c` (The Confession: Syscall Composition) + +**Me:** Next: “What syscalls are we paying for?” + +**Command:** + +```bash +sudo timeout 30s strace -c -f -p $(pidof oi) +``` + +**What I expect if this is a forwarding wall:** + +* `sendto` and `recvfrom` dominate calls +* call counts in the millions + +Output (simplified): + +```text +sendto 2,190,751 calls 4.146799s (57.6%) +recvfrom 2,190,763 calls 3.052340s (42.4%) +total syscall time: 7.200789s +``` + + +(A) **100% syscall/copy dominated:** + +- Almost all traced time is inside: + + - sendto() (TCP send) + + - recvfrom() (TCP recv) + +(B) **syscall rate is massive** + +- Total send+recv calls: + + - ~4,381,500 syscalls in ~32s + - → ~137k `sendto` per sec + ~137k `recvfrom` per sec + - → ~274k syscalls/sec total + + +**Inner Prosecutor:** *Objection!* Syscalls alone don’t prove the bottleneck. + +**Me:** True. So I brought a witness. + +--- + +## 5) Evidence #3 — FlameGraph (The Witness) + +**Me:** The flamegraph doesn’t lie. It testifies where cycles go. + +**Commands:** + +```bash +sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30 +sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg +``` + +**What the flamegraph showed (described, not embedded):** + +* The widest towers were kernel TCP send/recv paths: + + * `__x64_sys_sendto` → `tcp_sendmsg_locked` → `tcp_write_xmit` → ... + * `__x64_sys_recvfrom` → `tcp_recvmsg` → ... +* My userspace frames existed, but were comparatively thin. +* The call chain still pointed into my forwarding implementation. + + +**Inner Prosecutor:** *Hold it!* So you’re saying… the kernel is doing the heavy lifting? + +**Me:** Exactly. Which means my job is to **stop annoying the kernel** with too many tiny operations. + +--- + +## 6) The Real Culprit — A “Perfectly Reasonable” Copy Loop + +**Me:** Here’s the original relay code. Looks clean, right? + +```rust +let client_to_server = io::copy(client_stream.clone(), server_stream.clone()); +let server_to_client = io::copy(server_stream, client_stream); + +futures_lite::future::try_zip(client_to_server, server_to_client).await?; +``` + +**Inner Prosecutor:** *Objection!* This is idiomatic and correct. + +**Me:** Yes. That’s why it’s dangerous. + +**Key detail:** `futures_lite::io::copy` uses a small internal buffer (~8KiB in practice). +Small buffer → more iterations → more syscalls → more overhead. + +If a forwarder is syscall-rate bound, this becomes a ceiling. + +--- + +## 7) The First Breakthrough — Replace `io::copy` with `pump()` + +**Me:** I wrote a manual pump loop: + +* allocate a buffer once +* `read()` into it +* `write_all()` out +* on EOF: `shutdown(Write)` to propagate half-close + +```rust +async fn pump(mut r: TcpStream, mut w: TcpStream, buf_sz: usize) -> io::Result { + let mut buf = vec![0u8; buf_sz]; + let mut total = 0u64; + + loop { + let n = r.read(&mut buf).await?; + if n == 0 { + let _ = w.shutdown(std::net::Shutdown::Write); + break; + } + w.write_all(&buf[..n]).await?; + total += n as u64; + } + Ok(total) +} +``` + +Run both directions: + +```rust +let c2s = pump(client_stream.clone(), server_stream.clone(), BUF); +let s2c = pump(server_stream, client_stream, BUF); +try_zip(c2s, s2c).await?; +``` + +**Inner Prosecutor:** *Hold it!* That’s just a loop. How does that win? + +**Me:** Not the loop. The **bytes per syscall**. + +--- + +### 8) Exhibit C — The Numbers (8KiB → 16KiB → 64KiB) + +### Baseline: ~8KiB (generic copy helper) + +Throughput: + +```text +17.8 Gbit/s +``` + + +**Inner Prosecutor:** *Objection!* That’s your “crime scene” number? + +**Me:** Yes. Now watch what happens when the kernel stops getting spammed. + + +### Pump + 16KiB buffer + +Throughput: + +```text +28.6 Gbit/s +``` + +`strace -c` showed `sendto/recvfrom` call count dropped: + +```text +% time seconds usecs/call calls errors syscall +------ ----------- ----------- --------- --------- ---------------- + 57.80 14.590016 442121 33 epoll_wait + 28.84 7.279883 4 1771146 sendto + 13.33 3.363882 1 1771212 48 recvfrom + 0.02 0.003843 61 62 44 futex + 0.01 0.001947 12 159 epoll_ctl +... +------ ----------- ----------- --------- --------- ---------------- +100.00 25.242897 7 3542787 143 total +``` + + +**Inner Prosecutor:** *Hold it!* That’s already big. But you claim there’s more? + +**Me:** Oh, there’s more. + + +### Pump + 64KiB buffer + +Throughput: + +```text +54.1 Gbit/s (best observed) +``` + +`perf stat` output: + +```text +Performance counter stats for process id '893123': + + 120,859,810,675 cpu_atom/cycles/ (0.15%) + 134,735,934,329 cpu_core/cycles/ (99.85%) + 79,946,979,880 cpu_atom/instructions/ # 0.66 insn per cycle (0.15%) + 127,036,644,759 cpu_core/instructions/ # 0.94 insn per cycle (99.85%) + 24,713,474 cpu_atom/cache-misses/ (0.15%) + 9,604,449 cpu_core/cache-misses/ (99.85%) + 15,584,074,530 cpu_atom/branches/ (0.15%) + 24,796,180,117 cpu_core/branches/ (99.85%) + 175,778,825 cpu_atom/branch-misses/ # 1.13% of all branches (0.15%) + 135,067,353 cpu_core/branch-misses/ # 0.54% of all branches (99.85%) + 1,519 context-switches + 50 cpu-migrations + + 33.006529572 seconds time elapsed +``` + +`strace -c` output: + +```text +% time seconds usecs/call calls errors syscall +------ ----------- ----------- --------- --------- ---------------- + 54.56 18.079500 463576 39 epoll_wait + 27.91 9.249443 7 1294854 2 sendto + 17.49 5.796927 4 1294919 51 recvfrom +... +------ ----------- ----------- --------- --------- ---------------- +100.00 33.135377 12 2590253 158 total +``` + + +**Inner Prosecutor:** *OBJECTION!* `epoll_wait` is eating the time. That’s the bottleneck! + +**Me:** Nice try. That’s a classic trap. + +--- + +## 9) Cross-Examination — The `epoll_wait` Trap + +**Me:** `strace -c` measures time spent *inside syscalls*, including time spent **blocked**. + +In async runtimes: + +* One thread can sit in `epoll_wait(timeout=...)` +* Other threads do `sendto/recvfrom` +* `strace` charges the blocking time to `epoll_wait` + +So `epoll_wait` dominating **does not** mean “epoll is slow”. +It often means “one thread is waiting while others work”. + +**What matters here:** + +* `sendto` / `recvfrom` call counts +* and how they change with buffer size + +--- + +## 10) Final Explanation — Why 64KiB Causes a “Nonlinear” Jump + +**Inner Prosecutor:** *Hold it!* You only reduced syscall calls by ~some percent. How do you nearly triple throughput? + +**Me:** Because syscall walls are **nonlinear**. + +A forwarder’s throughput is approximately: + +> **Throughput ≈ bytes_per_syscall_pair × syscall_pairs_per_second** + +If you’re syscall-rate limited, increasing `bytes_per_syscall_pair` pushes you past a threshold where: + +* socket buffers stay fuller +* the TCP window is better utilized +* each stream spends less time in per-chunk bookkeeping +* concurrency (`-P 8`) stops fighting overhead and starts helping + +Once you cross that threshold, throughput can jump until the next ceiling (kernel TCP, memory bandwidth, iperf itself). + +That’s why a “small” change can create a big effect. + +--- + +## 11. Trade-offs: buffer size is not free + +**Inner Prosecutor:** *Objection!* Bigger buffers waste memory! + +**Me:** Sustained. + +A forwarder allocates **two buffers per connection** (one per direction). + +So for 64KiB: + +* ~128KiB per connection (just for relay buffers) +* plus runtime + socket buffers + +That’s fine for “few heavy streams”, but it matters if you handle thousands of concurrent connections. + +In practice, the right move is: + +* choose a good default (64KiB is common) +* make it configurable +* consider buffer pooling if connection churn is heavy + +--- + +## Epilogue — Case Closed (for now) + +**Inner Prosecutor:** So the culprit was… + +**Me:** A perfectly reasonable helper with a default buffer size I didn’t question. + +**Inner Prosecutor:** And the lesson? + +**Me:** Don’t guess. Ask sharp questions. Use the tools. Let the system testify. + + +> **Verdict:** Guilty of “too many syscalls per byte.” +> +> **Sentence:** 64KiB buffers and a better relay loop. + +--- + +## Ending + +This was a good reminder that performance work is not guessing — it’s a dialogue with the system: + +1. Describe the situation +2. Ask sharp questions +3. Use tools to confirm +4. Explain the results using low-level knowledge +5. Make one change +6. Re-measure + +And the funniest part: the “clean” one-liner `io::copy` was correct, but its defaults were hiding a performance policy I didn’t want. + +> **Inner Prosecutor:** “Case closed?” +> +> **Me:** “For now. Next case: buffer pooling, socket buffer tuning, and maybe a Linux-only `splice(2)` fast path — carefully, behind a safe wrapper.” + --- \ No newline at end of file