Add: more Phoenix flavor

2026-02-03 13:41:37 +08:00
parent ca04039826
commit 144ae4a0f3
1 changed files with 474 additions and 475 deletions
--- a/drafts/2026-01-25-ace-profiling-attorney-the-case-of-the-missing-gbits.md
+++ b/drafts/2026-01-25-ace-profiling-attorney-the-case-of-the-missing-gbits.md
@@ -1,476 +1,475 @@
---
+---
-title: Ace Profiling Attorney - The Case of the Missing Gbits
+title: Ace Profiling Attorney - The Case of the Missing Gbits
-categories: [Programming, Profiling]
+categories: [Programming, Profiling]
-tags: [Rust, kernel, networking]
+tags: [Rust, kernel, networking]
---
+---
-
+
-> **Cast**
+> **Disclaimer:** This is not a language-war post. No “X vs Y”.  
->
+> This is a profiling detective story about my Rust TCP forwarder [`oi`](https://github.com/DaZuo0122/oxidinetd).
-> **Me:** “I rewrote a port forwarder in Rust. It works. It’s… not fast enough.”
+
->
+---
-> **Inner Prosecutor:** “*Objection!* ‘Not fast enough’ is not evidence. Bring numbers.”
+## 0) Prologue — The Courthouse Lobby
->
+
-> **Me:** “Fine. We’ll do this properly.”
+> **Me:** I wrote a Rust TCP port forwarder. It works. It forwards.  
-
+> 
---
+> **Inner Prosecutor (Phoenix voice):** *Hold it!* “Works” is not a metric. How fast?  
-
+> 
-## 0. The Situation
+> **Me:** Not fast enough under load.  
-
+> 
-I’m building a userspace TCP port forwarder in Rust called [oxidinetd](https://github.com/DaZuo0122/oxidinetd) (The binary named `oi`). It accepts a TCP connection, connects to an upstream server, then relays bytes in both directions.
+> **Inner Prosecutor:** *Objection!* “Not fast enough” is an emotion. Bring evidence.  
-
+> 
-> This post is not a “Rust vs C” piece — it’s about **profiling**, **forming hypotheses**, and **turning measurements into speed**.
+> **Me:** Fine. I’ll bring **perf**, **strace**, and a **flamegraph**.  
-
+> 
-### Test environment
+> **Inner Prosecutor:** Good. This court accepts only facts.
-
+
- OS: Debian 13
+## 1) The Crime Scene — Setup & Reproduction
- Kernel: `6.12.48+deb13-amd64`
+
- Runtime: `smol`
+**Me:** Single machine, Debian 13. No WAN noise, no tunnel bottlenecks.  
- Benchmark: single machine, network namespaces + veth
+
-
+**Inner Prosecutor:** *Hold it!* If it’s “single machine”, how do you avoid loopback cheating?  
-Why namespaces + veth? The loopback can hide “real networking” behavior. Namespaces/veth keep the test local (repeatable), but with a path closer to real routing.
+
-
+**Me:** Network namespaces + veth. Local, repeatable, closer to real networking.
---
+
-
+### Environment
-> **Inner Prosecutor:** “You claim it’s repeatable. Prove your setup.”
+
->
+- Debian 13
-> **Me:** “Here’s the lab.”
+- Kernel: `6.12.48+deb13-amd64`
-
+- Runtime: `smol`
---
+- Test topology: `ns_client → oi (root ns) → ns_server` via veth
-
+
-## 1. The Lab Setup
+### Reproduction commands
-
+
-Backend server inside `ns_server`:
+**Exhibit A: Start backend server in `ns_server`**
-
+
-```bash
+```bash
-sudo ip netns exec ns_server iperf3 -s -p 9001
+sudo ip netns exec ns_server iperf3 -s -p 9001
-```
+````
-
+
-Client inside `ns_client`, traffic goes through `oi`:
+**Exhibit B: Run client in `ns_client` through forwarder**
-
+
-```bash
+```bash
-sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8
+sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8
-```
+```
-
+
-> **Note**: -P 8 matters. A forwarder might look okay under -P 1, then collapse when syscall pressure scales with concurrency.
+**Inner Prosecutor:** *Hold it!* Why `-P 8`?  
-
+
-### Forwarder config
+**Me:** Because a forwarder can look fine in `-P 1` and fall apart when syscall pressure scales.  
-
+
-`oi` listens on `10.0.1.1:9000` and connects to `10.0.0.2:9001`.
+**Inner Prosecutor:** …Acceptable.  
-
+
-`profiling.conf`:
+---
-
+
-```yaml
+## 2) The Suspects — What Could Be Limiting Throughput?
-127.0.0.1 9000 127.0.0.1 9001
+
-```
+**Me:** Four suspects.
-
+
---
+1. **CPU bound** (pure compute wall)
-
+2. **Kernel TCP stack bound** (send/recv path, skb, softirq, netfilter/conntrack)
-## 2. The Questions
+3. **Syscall-rate wall** (too many `sendto/recvfrom` per byte)
-
+4. **Runtime scheduling / contention** (wake storms, locks, futex)
-> **Inner Prosecutor:** “Alright. What exactly is the crime?”
+
->
+**Inner Prosecutor:** *Objection!* That’s too broad. Narrow it down.  
-> **Me:** “Throughput is lower than expected. The suspects:”
+
->
+**Me:** That’s what the tools are for.
-> 1. CPU bound vs I/O bound
+
-> 2. Userspace overhead vs kernel TCP stack
+---
-> 3. Syscall-rate wall (too many `send/recv` per byte)
+
-> 4. Async runtime scheduling / wakeups / locks
+## 3) Evidence #1 — `perf stat` (The Macro View)
-
+
---
+**Me:** First I ask: are we burning CPU, thrashing schedulers, or stalling on memory?
-
+
-## 3. Evidence Tool #1 — `perf stat` (Macro view)
+**Command:**
-
+
-Command:
+```bash
-
+sudo perf stat -p $(pidof oi) -e \
-```bash
+  cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
-sudo perf stat -p $(pidof oi) -e \
+  -- sleep 33
-  cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
+```
-  -- sleep 33
+
-```
+**What I’m looking for:**
-
+
-### What I’m looking for
+* Huge `context-switches` → runtime thrash / lock contention
-
+* Huge `cpu-migrations` → unstable scheduling
-* **Context switches** exploding → runtime contention or wake storms
+* Very low IPC + huge cache misses → memory stalls
-* **CPU migrations** exploding → scheduler instability (bad for repeatability)
+* Otherwise: likely syscall/kernel path
-* **IPC** tanking + cache misses skyrocketing → memory/latency issues
+
-* Otherwise: likely **kernel networking + syscalls** dominate
+Output: 
-
+
-Output: 
+```text
-
+ Performance counter stats for process id '209785':
-```text
+
- Performance counter stats for process id '209785':
+   113,810,599,893      cpu_atom/cycles/                                                        (0.11%)
-
+   164,681,878,450      cpu_core/cycles/                                                        (99.89%)
-   113,810,599,893      cpu_atom/cycles/                                                        (0.11%)
+   102,575,167,734      cpu_atom/instructions/           #    0.90  insn per cycle              (0.11%)
-   164,681,878,450      cpu_core/cycles/                                                        (99.89%)
+   237,094,207,911      cpu_core/instructions/           #    1.44  insn per cycle              (99.89%)
-   102,575,167,734      cpu_atom/instructions/           #    0.90  insn per cycle              (0.11%)
+        33,093,338      cpu_atom/cache-misses/                                                  (0.11%)
-   237,094,207,911      cpu_core/instructions/           #    1.44  insn per cycle              (99.89%)
+         5,381,441      cpu_core/cache-misses/                                                  (99.89%)
-        33,093,338      cpu_atom/cache-misses/                                                  (0.11%)
+    20,012,975,873      cpu_atom/branches/                                                      (0.11%)
-         5,381,441      cpu_core/cache-misses/                                                  (99.89%)
+    46,120,077,111      cpu_core/branches/                                                      (99.89%)
-    20,012,975,873      cpu_atom/branches/                                                      (0.11%)
+       211,767,555      cpu_atom/branch-misses/          #    1.06% of all branches             (0.11%)
-    46,120,077,111      cpu_core/branches/                                                      (99.89%)
+       245,969,685      cpu_core/branch-misses/          #    0.53% of all branches             (99.89%)
-       211,767,555      cpu_atom/branch-misses/          #    1.06% of all branches             (0.11%)
+             1,686      context-switches
-       245,969,685      cpu_core/branch-misses/          #    0.53% of all branches             (99.89%)
+               150      cpu-migrations
-             1,686      context-switches
+
-               150      cpu-migrations
+      33.004363800 seconds time elapsed
-
+```
-      33.004363800 seconds time elapsed
+
-```
+
-
+**Low context switching**:
-Interpretation:
+	
-
+   - context-switches: 1,686 over ~33s → ~51 switches/sec
-**Low context switching**:
+	
-	
+   - cpu-migrations: 150 over ~33s → ~4.5/s → very stable CPU placement
-	- context-switches: 1,686 over ~33s → ~51 switches/sec
+
-	
+**CPU is working hard**:
-	- cpu-migrations: 150 over ~33s → ~4.5/s → very stable CPU placement
+	
-
+   - 237,094,207,911 cpu_core instructions
-**CPU is working hard**:
+	
-	
+   - IPC: 1.44 (instructions per cycle) → not lock-bound or stalling badly
-	- 237,094,207,911 cpu_core instructions
+	
-	
+**Clean cache, branch metrics**:
-	- IPC: 1.44 (instructions per cycle) → not lock-bound or stalling badly
+
-	
+   - cache-misses: ~3.1M (tiny compared to the instruction count)
-**Clean cache, branch metrics**:
+	
-
+   - branch-misses: 0.62%
-	- cache-misses: ~3.1M (tiny compared to the instruction count)
+
-	
+
-	- branch-misses: 0.62%
+	
-	
+**Inner Prosecutor:** *Hold it!* You didn’t show the numbers.  
---
+
-
+**Me:** Patience. The next exhibit makes the culprit confess.  
-> **Inner Prosecutor:** “That’s a vibe-check. Where’s the real culprit?”
+
->
+---
-> **Me:** “Next tool. This one tells me what kind of pain we’re paying for.”
+
-
+## 4) Evidence #2 — `strace -c` (The Confession: Syscall Composition)
---
+
-
+**Me:** Next: “What syscalls are we paying for?”
-## 4. Evidence Tool #2 — `strace -c` (Syscall composition)
+
-
+**Command:**
-Command:
+
-
+```bash
-```bash
+sudo timeout 30s strace -c -f -p $(pidof oi)
-sudo timeout 30s strace -c -f -p $(pidof oi)
+```
-```
+
-
+**What I expect if this is a forwarding wall:**
-### Why `strace -c` is lethal for forwarders
+
-
+* `sendto` and `recvfrom` dominate calls
-A userspace TCP forwarder often boils down to:
+* call counts in the millions
-
+
-* `recv(...)` from one socket
+Output (simplified):
-* `send(...)` to the other socket
+
-
+```text
-If your throughput is low and `strace -c` shows **millions** of `sendto/recvfrom` calls, you’re likely hitting a **syscall-per-byte wall**.
+sendto   2,190,751 calls   4.146799s  (57.6%)
-
+recvfrom 2,190,763 calls   3.052340s  (42.4%)
-Output (simplified):
+total syscall time:        7.200789s
-
+```
-```text
+
-sendto   2,190,751 calls   4.146799s  (57.6%)
+
-recvfrom 2,190,763 calls   3.052340s  (42.4%)
+(A) **100% syscall/copy dominated:**
-total syscall time:        7.200789s
+	
-
+- Almost all traced time is inside:
-```
+
-
+	- sendto() (TCP send)
-Interpretation:
+
-
+	- recvfrom() (TCP recv)
-(A) **100% syscall/copy dominated:**
+
-	
+(B) **syscall rate is massive**
-Almost all traced time is inside:
+
-
+- Total send+recv calls:
-	- sendto() (TCP send)
+
-
+	- ~4,381,500 syscalls in ~32s
-	- recvfrom() (TCP recv)
+	- → ~137k `sendto` per sec + ~137k `recvfrom` per sec
-
+	- → ~274k syscalls/sec total
-(B) **syscall rate is massive**
+
-
+
-Total send+recv calls:
+**Inner Prosecutor:** *Objection!* Syscalls alone don’t prove the bottleneck.  
-
+
-	- ~4,381,500 syscalls in ~32s
+**Me:** True. So I brought a witness.
-	- → ~137k `sendto` per sec + ~137k `recvfrom` per sec
+
-	- → ~274k syscalls/sec total
+---
-
+
-That’s exactly the pattern of a forwarder doing:
+## 5) Evidence #3 — FlameGraph (The Witness)
-
+
-`recv -> send -> recv -> send ...` with a relatively small buffer.
+**Me:** The flamegraph doesn’t lie. It testifies where cycles go.
-
+
---
+**Commands:**
-
+
-> **Inner Prosecutor:** “So you’re saying the kernel is being spammed.”
+```bash
->
+sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
-> **Me:** “Exactly. Now I want to know who’s spamming it — my logic, my runtime, or my copy loop.”
+sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
-
+```
---
+
-
+**What the flamegraph showed (described, not embedded):**
-## 5. Evidence Tool #3 — FlameGraph (Where cycles actually go)
+
-
+* The widest towers were kernel TCP send/recv paths:
-Commands:
+
-
+  * `__x64_sys_sendto` → `tcp_sendmsg_locked` → `tcp_write_xmit` → ...
-```bash
+  * `__x64_sys_recvfrom` → `tcp_recvmsg` → ...
-sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
+* My userspace frames existed, but were comparatively thin.
-sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
+* The call chain still pointed into my forwarding implementation.
-```
+
-
+
-### What the flamegraph showed (described, not embedded)
+**Inner Prosecutor:** *Hold it!* So you’re saying… the kernel is doing the heavy lifting?  
-
+
-Instead of embedding the graph, here’s the important story the flamegraph told:
+**Me:** Exactly. Which means my job is to **stop annoying the kernel** with too many tiny operations.  
-
+
-1. The widest “towers” were kernel TCP send/recv paths:
+---
-
+
-   * `__x64_sys_sendto` → `tcp_sendmsg_locked` → `tcp_write_xmit` → …
+## 6) The Real Culprit — A “Perfectly Reasonable” Copy Loop
-   * `__x64_sys_recvfrom` → `tcp_recvmsg` → …
+
-
+**Me:** Here’s the original relay code. Looks clean, right?
-2. My userspace frames existed, but they were thin compared to the kernel towers.
+
-   That means:
+```rust
-
+let client_to_server = io::copy(client_stream.clone(), server_stream.clone());
-   * I’m not burning CPU on complicated Rust logic.
+let server_to_client = io::copy(server_stream, client_stream);
-   * I’m paying overhead on the boundary: syscalls, TCP stack, copies.
+
-
+futures_lite::future::try_zip(client_to_server, server_to_client).await?;
-3. In the dwarf flamegraph, the *userspace* frames pointed to my forwarding implementation:
+```
-
+
-   * the code path that ultimately calls read/write repeatedly.
+**Inner Prosecutor:** *Objection!* This is idiomatic and correct.  
-
+
-> **Conclusion:** This is not “async is slow” in general. This is “my relay loop is forcing too many small kernel transitions.”
+**Me:** Yes. That’s why it’s dangerous.  
-
+
-## 6. The Suspect: my forwarding code
+**Key detail:** `futures_lite::io::copy` uses a small internal buffer (~8KiB in practice).
-
+Small buffer → more iterations → more syscalls → more overhead.
-Here was the original TCP relay:
+
-
+If a forwarder is syscall-rate bound, this becomes a ceiling.
-```rust
+
-// Use smol's copy function to forward data in both directions
+---
-let client_to_server = io::copy(client_stream.clone(), server_stream.clone());
+
-let server_to_client = io::copy(server_stream, client_stream);
+## 7) The First Breakthrough — Replace `io::copy` with `pump()`
-
+
-futures_lite::future::try_zip(client_to_server, server_to_client).await?;
+**Me:** I wrote a manual pump loop:
-```
+
-
+* allocate a buffer once
-> **Inner Prosecutor:** “*Objection!* That looks perfectly reasonable.”
+* `read()` into it
->
+* `write_all()` out
-> **Me:** “Yes. That’s why it’s dangerous.”
+* on EOF: `shutdown(Write)` to propagate half-close
-
+
-### Why this can be slow under high throughput
+```rust
-
+async fn pump(mut r: TcpStream, mut w: TcpStream, buf_sz: usize) -> io::Result<u64> {
-Generic `io::copy` helpers often use a relatively small internal buffer (commonly ~8KiB), plus abstraction layers that can increase:
+    let mut buf = vec![0u8; buf_sz];
-
+    let mut total = 0u64;
-* syscall frequency
+
-* readiness polling
+    loop {
-* per-chunk overhead
+        let n = r.read(&mut buf).await?;
-
+        if n == 0 {
-Small buffers aren’t “wrong”. They’re memory-friendly. But for a forwarder pushing tens of Gbit/s, **syscalls per byte** becomes the real limiter.
+            let _ = w.shutdown(std::net::Shutdown::Write);
-
+            break;
---
+        }
-
+        w.write_all(&buf[..n]).await?;
-## 7. The Fix: a manual `pump()` loop (and a buffer size sweep)
+        total += n as u64;
-
+    }
-I replaced `io::copy` with a manual relay loop:
+    Ok(total)
-
+}
-* allocate a buffer once per direction
+```
-* read into it
+
-* write it out
+Run both directions:
-* on EOF, propagate half-close with `shutdown(Write)`
+
-
+```rust
-Code (core idea):
+let c2s = pump(client_stream.clone(), server_stream.clone(), BUF);
-
+let s2c = pump(server_stream, client_stream, BUF);
-```rust
+try_zip(c2s, s2c).await?;
-async fn pump(mut r: TcpStream, mut w: TcpStream, buf_sz: usize) -> io::Result<u64> {
+```
-    let mut buf = vec![0u8; buf_sz];
+
-    let mut total = 0u64;
+**Inner Prosecutor:** *Hold it!* That’s just a loop. How does that win?  
-
+
-    loop {
+**Me:** Not the loop. The **bytes per syscall**.  
-        let n = r.read(&mut buf).await?;
+
-        if n == 0 {
+---
-            let _ = w.shutdown(std::net::Shutdown::Write);
+
-            break;
+### 8) Exhibit C — The Numbers (8KiB → 16KiB → 64KiB)
-        }
+
-        w.write_all(&buf[..n]).await?;
+### Baseline: ~8KiB (generic copy helper)
-        total += n as u64;
+
-    }
+Throughput:
-    Ok(total)
+
-}
+```text
-```
+17.8 Gbit/s
-
+```
-And run both directions:
+
-
+
-```rust
+**Inner Prosecutor:** *Objection!* That’s your “crime scene” number?  
-let c2s = pump(client_stream.clone(), server_stream.clone(), BUF);
+
-let s2c = pump(server_stream, client_stream, BUF);
+**Me:** Yes. Now watch what happens when the kernel stops getting spammed.  
-try_zip(c2s, s2c).await?;
+
-```
+
-
+### Pump + 16KiB buffer
---
+
-
+Throughput:
-> **Inner Prosecutor:** “You changed ‘one helper call’ into ‘a loop’. That’s your miracle?”
+
->
+```text
-> **Me:** “Not the loop. The *bytes per syscall*.”
+28.6 Gbit/s
-
+```
---
+
-
+`strace -c` showed `sendto/recvfrom` call count dropped:
-## 8. Verification: numbers don’t lie
+
-
+```text
-Same machine, same namespaces/veth, same `iperf3 -P 8`.
+% time     seconds  usecs/call     calls    errors syscall
-
+------ ----------- ----------- --------- --------- ----------------
-### Baseline (generic copy, ~8KiB internal buffer)
+ 57.80   14.590016      442121        33           epoll_wait
-
+ 28.84    7.279883           4   1771146           sendto
-Throughput:
+ 13.33    3.363882           1   1771212        48 recvfrom
-
+  0.02    0.003843          61        62        44 futex
-```text
+  0.01    0.001947          12       159           epoll_ctl
-17.8 Gbit/s
+...
-```
+------ ----------- ----------- --------- --------- ----------------
-
+100.00   25.242897           7   3542787       143 total
-### Pump + 16KiB buffer
+```
-
+
-Throughput:
+
-
+**Inner Prosecutor:** *Hold it!* That’s already big. But you claim there’s more?  
-```text
+
-28.6 Gbit/s
+**Me:** Oh, there’s more.  
-```
+
-
+
-`strace -c` showed `sendto/recvfrom` call count dropped:
+### Pump + 64KiB buffer
-
+
-```text
+Throughput:
-% time     seconds  usecs/call     calls    errors syscall
+
------ ----------- ----------- --------- --------- ----------------
+```text
- 57.80   14.590016      442121        33           epoll_wait
+54.1 Gbit/s (best observed)
- 28.84    7.279883           4   1771146           sendto
+```
- 13.33    3.363882           1   1771212        48 recvfrom
+
-  0.02    0.003843          61        62        44 futex
+`perf stat` output:
-  0.01    0.001947          12       159           epoll_ctl
+
-...
+```text
------ ----------- ----------- --------- --------- ----------------
+Performance counter stats for process id '893123':
-100.00   25.242897           7   3542787       143 total
+
-```
+   120,859,810,675      cpu_atom/cycles/                                                        (0.15%)
-
+   134,735,934,329      cpu_core/cycles/                                                        (99.85%)
-### Pump + 64KiB buffer
+    79,946,979,880      cpu_atom/instructions/           #    0.66  insn per cycle              (0.15%)
-
+   127,036,644,759      cpu_core/instructions/           #    0.94  insn per cycle              (99.85%)
-Throughput:
+        24,713,474      cpu_atom/cache-misses/                                                  (0.15%)
-
+         9,604,449      cpu_core/cache-misses/                                                  (99.85%)
-```text
+    15,584,074,530      cpu_atom/branches/                                                      (0.15%)
-54.1 Gbit/s (best observed)
+    24,796,180,117      cpu_core/branches/                                                      (99.85%)
-```
+       175,778,825      cpu_atom/branch-misses/          #    1.13% of all branches             (0.15%)
-
+       135,067,353      cpu_core/branch-misses/          #    0.54% of all branches             (99.85%)
-`perf stat` output:
+             1,519      context-switches
-
+                50      cpu-migrations
-```text
+
-Performance counter stats for process id '893123':
+      33.006529572 seconds time elapsed
-
+```
-   120,859,810,675      cpu_atom/cycles/                                                        (0.15%)
+
-   134,735,934,329      cpu_core/cycles/                                                        (99.85%)
+`strace -c` output:
-    79,946,979,880      cpu_atom/instructions/           #    0.66  insn per cycle              (0.15%)
+
-   127,036,644,759      cpu_core/instructions/           #    0.94  insn per cycle              (99.85%)
+```text
-        24,713,474      cpu_atom/cache-misses/                                                  (0.15%)
+% time     seconds  usecs/call     calls    errors syscall
-         9,604,449      cpu_core/cache-misses/                                                  (99.85%)
+------ ----------- ----------- --------- --------- ----------------
-    15,584,074,530      cpu_atom/branches/                                                      (0.15%)
+ 54.56   18.079500      463576        39           epoll_wait
-    24,796,180,117      cpu_core/branches/                                                      (99.85%)
+ 27.91    9.249443           7   1294854         2 sendto
-       175,778,825      cpu_atom/branch-misses/          #    1.13% of all branches             (0.15%)
+ 17.49    5.796927           4   1294919        51 recvfrom
-       135,067,353      cpu_core/branch-misses/          #    0.54% of all branches             (99.85%)
+...
-             1,519      context-switches
+------ ----------- ----------- --------- --------- ----------------
-                50      cpu-migrations
+100.00   33.135377          12   2590253       158 total
-
+```
-      33.006529572 seconds time elapsed
+
-```
+
-
+**Inner Prosecutor:** *OBJECTION!* `epoll_wait` is eating the time. That’s the bottleneck!  
-`strace -c` output:
+
-
+**Me:** Nice try. That’s a classic trap.  
-```text
+
-% time     seconds  usecs/call     calls    errors syscall
+---
------ ----------- ----------- --------- --------- ----------------
+
- 54.56   18.079500      463576        39           epoll_wait
+## 9) Cross-Examination — The `epoll_wait` Trap
- 27.91    9.249443           7   1294854         2 sendto
+
- 17.49    5.796927           4   1294919        51 recvfrom
+**Me:** `strace -c` measures time spent *inside syscalls*, including time spent **blocked**.
-...
+
------ ----------- ----------- --------- --------- ----------------
+In async runtimes:
-100.00   33.135377          12   2590253       158 total
+
-```
+* One thread can sit in `epoll_wait(timeout=...)`
-
+* Other threads do `sendto/recvfrom`
---
+* `strace` charges the blocking time to `epoll_wait`
-
+
-## 9. “Wait — why is `epoll_wait` taking most syscall time?”
+So `epoll_wait` dominating **does not** mean “epoll is slow”.
-
+It often means “one thread is waiting while others work”.
-> **Inner Prosecutor:** “*Objection!* Your table says `epoll_wait` dominates time. So epoll is the bottleneck!”
+
->
+**What matters here:**
-> **Me:** “Nope. That’s a common misread.”
+
-
+* `sendto` / `recvfrom` call counts
-`strace -c` counts **time spent inside syscalls**, including time spent **blocked**.
+* and how they change with buffer size
-
+
-In async runtimes, it’s normal for one thread to sit in `epoll_wait(timeout=...)` while other threads do actual send/recv work. That blocking time is charged to `epoll_wait`, but it’s not “overhead” — it’s *waiting*.
+---
-
+
-The real signal is still:
+## 10) Final Explanation — Why 64KiB Causes a “Nonlinear” Jump
-
+
-* `sendto/recvfrom` call counts (millions)
+**Inner Prosecutor:** *Hold it!* You only reduced syscall calls by ~some percent. How do you nearly triple throughput?  
-* average microseconds per call
+
-* and whether call count drops when buffer size increases
+**Me:** Because syscall walls are **nonlinear**.  
-
+
-That’s the syscall-per-byte story.
+A forwarder’s throughput is approximately:
-
+
---
+> **Throughput ≈ bytes_per_syscall_pair × syscall_pairs_per_second**
-## 10. So why did 64KiB cause such a huge jump?
+
-
+If you’re syscall-rate limited, increasing `bytes_per_syscall_pair` pushes you past a threshold where:
-Two reasons:
+
-
+* socket buffers stay fuller
-### 1) Syscall wall is nonlinear
+* the TCP window is better utilized
-
+* each stream spends less time in per-chunk bookkeeping
-Throughput is roughly:
+* concurrency (`-P 8`) stops fighting overhead and starts helping
-
+
-**Throughput ≈ bytes_per_syscall_pair × syscall_pairs_per_second**
+Once you cross that threshold, throughput can jump until the next ceiling (kernel TCP, memory bandwidth, iperf itself).
-
+
-If syscall rate is the limiter, increasing bytes per syscall can push you past a threshold where:
+That’s why a “small” change can create a big effect.
-
+
-* socket buffers stay fuller
+---
-* TCP windows are better utilized
+
-* per-stream pacing is smoother
+## 11. Trade-offs: buffer size is not free
-* concurrency (`-P 8`) stops fighting overhead and starts working in your favor
+
-
+**Inner Prosecutor:** *Objection!* Bigger buffers waste memory!  
-Once you cross that threshold, throughput can jump until the *next* ceiling (kernel TCP work, memory bandwidth, or iperf itself).
+
-
+**Me:** Sustained.  
-### 2) Less “per-chunk” overhead in userspace
+
-
+A forwarder allocates **two buffers per connection** (one per direction).
-A small-buffer copy loop means more iterations, more polls, more bookkeeping.
+
-A bigger buffer means:
+So for 64KiB:
-
+
-* fewer loop iterations per GB moved
+* ~128KiB per connection (just for relay buffers)
-* fewer wakeups/polls
+* plus runtime + socket buffers
-* fewer syscall transitions per GB
+
-
+That’s fine for “few heavy streams”, but it matters if you handle thousands of concurrent connections.
-Your `strace` call counts dropped significantly between 16KiB and 64KiB, and throughput nearly doubled.
+
-
+In practice, the right move is:
---
+
-
+* choose a good default (64KiB is common)
-## 11. Trade-offs: buffer size is not free
+* make it configurable
-
+* consider buffer pooling if connection churn is heavy
-> **Inner Prosecutor:** “*Hold it!* Bigger buffers mean wasted memory.”
+
->
+---
-> **Me:** “Correct.”
+
-
+## Epilogue — Case Closed (for now)
-A forwarder allocates **two buffers per connection** (one per direction).
+
-
+**Inner Prosecutor:** So the culprit was…  
-So for 64KiB:
+
-
+**Me:** A perfectly reasonable helper with a default buffer size I didn’t question.  
-* ~128KiB per connection (just for relay buffers)
+
-* plus runtime + socket buffers
+**Inner Prosecutor:** And the lesson?  
-
+
-That’s fine for “few heavy streams”, but it matters if you handle thousands of concurrent connections.
+**Me:** Don’t guess. Ask sharp questions. Use the tools. Let the system testify.  
-
+
-In practice, the right move is:
+
-
+> **Verdict:** Guilty of “too many syscalls per byte.”  
-* choose a good default (64KiB is common)
+> 
-* make it configurable
+> **Sentence:** 64KiB buffers and a better relay loop.  
-* consider buffer pooling if connection churn is heavy
+
-
+--- 
---
+
-
+## Ending
-## 12. Closing statement
+
-
+This was a good reminder that performance work is not guessing — it’s a dialogue with the system:
-This was a good reminder that performance work is not guessing — it’s a dialogue with the system:
+
-
+1. Describe the situation
-1. Describe the situation
+2. Ask sharp questions
-2. Ask sharp questions
+3. Use tools to confirm
-3. Use tools to confirm
+4. Explain the results using low-level knowledge
-4. Explain the results using low-level knowledge
+5. Make one change
-5. Make one change
+6. Re-measure
-6. Re-measure
+
-
+And the funniest part: the “clean” one-liner `io::copy` was correct, but its defaults were hiding a performance policy I didn’t want.
-And the funniest part: the “clean” one-liner `io::copy` was correct, but its defaults were hiding a performance policy I didn’t want.
+
-
+> **Inner Prosecutor:** “Case closed?”
-> **Inner Prosecutor:** “Case closed?”
+>
->
+> **Me:** “For now. Next case: buffer pooling, socket buffer tuning, and maybe a Linux-only `splice(2)` fast path — carefully, behind a safe wrapper.”
-> **Me:** “For now. Next case: buffer pooling, socket buffer tuning, and maybe a Linux-only `splice(2)` fast path — carefully, behind a safe wrapper.”
+
 ---