Upload files to "drafts"

2026-02-03 13:06:58 +08:00
parent ba61eb37b9
commit ca04039826
1 changed files with 476 additions and 0 deletions
--- a/drafts/2026-01-25-ace-profiling-attorney-the-case-of-the-missing-gbits.md
+++ b/drafts/2026-01-25-ace-profiling-attorney-the-case-of-the-missing-gbits.md
@@ -0,0 +1,476 @@
 ---
 title: Ace Profiling Attorney - The Case of the Missing Gbits
 categories: [Programming, Profiling]
 tags: [Rust, kernel, networking]
 ---
 > **Cast**
 >
 > **Me:** “I rewrote a port forwarder in Rust. It works. It’s… not fast enough.”
 >
 > **Inner Prosecutor:** “*Objection!* ‘Not fast enough’ is not evidence. Bring numbers.”
 >
 > **Me:** “Fine. We’ll do this properly.”
 ---
 ## 0. The Situation
 I’m building a userspace TCP port forwarder in Rust called [oxidinetd](https://github.com/DaZuo0122/oxidinetd) (The binary named `oi`). It accepts a TCP connection, connects to an upstream server, then relays bytes in both directions.
 > This post is not a “Rust vs C” piece — it’s about **profiling**, **forming hypotheses**, and **turning measurements into speed**.
 ### Test environment
 - OS: Debian 13
 - Kernel: `6.12.48+deb13-amd64`
 - Runtime: `smol`
 - Benchmark: single machine, network namespaces + veth
 Why namespaces + veth? The loopback can hide “real networking” behavior. Namespaces/veth keep the test local (repeatable), but with a path closer to real routing.
 ---
 > **Inner Prosecutor:** “You claim it’s repeatable. Prove your setup.”
 >
 > **Me:** “Here’s the lab.”
 ---
 ## 1. The Lab Setup
 Backend server inside `ns_server`:
 ```bash
 sudo ip netns exec ns_server iperf3 -s -p 9001
 ```
 Client inside `ns_client`, traffic goes through `oi`:
 ```bash
 sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8
 ```
 > **Note**: -P 8 matters. A forwarder might look okay under -P 1, then collapse when syscall pressure scales with concurrency.
 ### Forwarder config
 `oi` listens on `10.0.1.1:9000` and connects to `10.0.0.2:9001`.
 `profiling.conf`:
 ```yaml
 127.0.0.1 9000 127.0.0.1 9001
 ```
 ---
 ## 2. The Questions
 > **Inner Prosecutor:** “Alright. What exactly is the crime?”
 >
 > **Me:** “Throughput is lower than expected. The suspects:”
 >
 > 1. CPU bound vs I/O bound
 > 2. Userspace overhead vs kernel TCP stack
 > 3. Syscall-rate wall (too many `send/recv` per byte)
 > 4. Async runtime scheduling / wakeups / locks
 ---
 ## 3. Evidence Tool #1 — `perf stat` (Macro view)
 Command:
 ```bash
 sudo perf stat -p $(pidof oi) -e \
  cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
  -- sleep 33
 ```
 ### What I’m looking for
 * **Context switches** exploding → runtime contention or wake storms
 * **CPU migrations** exploding → scheduler instability (bad for repeatability)
 * **IPC** tanking + cache misses skyrocketing → memory/latency issues
 * Otherwise: likely **kernel networking + syscalls** dominate
 Output: 
 ```text
 Performance counter stats for process id '209785':
   113,810,599,893      cpu_atom/cycles/                                                        (0.11%)
   164,681,878,450      cpu_core/cycles/                                                        (99.89%)
   102,575,167,734      cpu_atom/instructions/           #    0.90  insn per cycle              (0.11%)
   237,094,207,911      cpu_core/instructions/           #    1.44  insn per cycle              (99.89%)
        33,093,338      cpu_atom/cache-misses/                                                  (0.11%)
         5,381,441      cpu_core/cache-misses/                                                  (99.89%)
    20,012,975,873      cpu_atom/branches/                                                      (0.11%)
    46,120,077,111      cpu_core/branches/                                                      (99.89%)
       211,767,555      cpu_atom/branch-misses/          #    1.06% of all branches             (0.11%)
       245,969,685      cpu_core/branch-misses/          #    0.53% of all branches             (99.89%)
             1,686      context-switches
               150      cpu-migrations
      33.004363800 seconds time elapsed
 ```
 Interpretation:
 **Low context switching**:
 	- context-switches: 1,686 over ~33s → ~51 switches/sec
 	- cpu-migrations: 150 over ~33s → ~4.5/s → very stable CPU placement
 **CPU is working hard**:
 	- 237,094,207,911 cpu_core instructions
 	- IPC: 1.44 (instructions per cycle) → not lock-bound or stalling badly
 **Clean cache, branch metrics**:
 	- cache-misses: ~3.1M (tiny compared to the instruction count)
 	- branch-misses: 0.62%
 ---
 > **Inner Prosecutor:** “That’s a vibe-check. Where’s the real culprit?”
 >
 > **Me:** “Next tool. This one tells me what kind of pain we’re paying for.”
 ---
 ## 4. Evidence Tool #2 — `strace -c` (Syscall composition)
 Command:
 ```bash
 sudo timeout 30s strace -c -f -p $(pidof oi)
 ```
 ### Why `strace -c` is lethal for forwarders
 A userspace TCP forwarder often boils down to:
 * `recv(...)` from one socket
 * `send(...)` to the other socket
 If your throughput is low and `strace -c` shows **millions** of `sendto/recvfrom` calls, you’re likely hitting a **syscall-per-byte wall**.
 Output (simplified):
 ```text
 sendto   2,190,751 calls   4.146799s  (57.6%)
 recvfrom 2,190,763 calls   3.052340s  (42.4%)
 total syscall time:        7.200789s
 ```
 Interpretation:
 (A) **100% syscall/copy dominated:**
 Almost all traced time is inside:
 	- sendto() (TCP send)
 	- recvfrom() (TCP recv)
 (B) **syscall rate is massive**
 Total send+recv calls:
 	- ~4,381,500 syscalls in ~32s
 	- → ~137k `sendto` per sec + ~137k `recvfrom` per sec
 	- → ~274k syscalls/sec total
 That’s exactly the pattern of a forwarder doing:
 `recv -> send -> recv -> send ...` with a relatively small buffer.
 ---
 > **Inner Prosecutor:** “So you’re saying the kernel is being spammed.”
 >
 > **Me:** “Exactly. Now I want to know who’s spamming it — my logic, my runtime, or my copy loop.”
 ---
 ## 5. Evidence Tool #3 — FlameGraph (Where cycles actually go)
 Commands:
 ```bash
 sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
 sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
 ```
 ### What the flamegraph showed (described, not embedded)
 Instead of embedding the graph, here’s the important story the flamegraph told:
 1. The widest “towers” were kernel TCP send/recv paths:
   * `__x64_sys_sendto` → `tcp_sendmsg_locked` → `tcp_write_xmit` → …
   * `__x64_sys_recvfrom` → `tcp_recvmsg` → …
 2. My userspace frames existed, but they were thin compared to the kernel towers.
   That means:
   * I’m not burning CPU on complicated Rust logic.
   * I’m paying overhead on the boundary: syscalls, TCP stack, copies.
 3. In the dwarf flamegraph, the *userspace* frames pointed to my forwarding implementation:
   * the code path that ultimately calls read/write repeatedly.
 > **Conclusion:** This is not “async is slow” in general. This is “my relay loop is forcing too many small kernel transitions.”
 ## 6. The Suspect: my forwarding code
 Here was the original TCP relay:
 ```rust
 // Use smol's copy function to forward data in both directions
 let client_to_server = io::copy(client_stream.clone(), server_stream.clone());
 let server_to_client = io::copy(server_stream, client_stream);
 futures_lite::future::try_zip(client_to_server, server_to_client).await?;
 ```
 > **Inner Prosecutor:** “*Objection!* That looks perfectly reasonable.”
 >
 > **Me:** “Yes. That’s why it’s dangerous.”
 ### Why this can be slow under high throughput
 Generic `io::copy` helpers often use a relatively small internal buffer (commonly ~8KiB), plus abstraction layers that can increase:
 * syscall frequency
 * readiness polling
 * per-chunk overhead
 Small buffers aren’t “wrong”. They’re memory-friendly. But for a forwarder pushing tens of Gbit/s, **syscalls per byte** becomes the real limiter.
 ---
 ## 7. The Fix: a manual `pump()` loop (and a buffer size sweep)
 I replaced `io::copy` with a manual relay loop:
 * allocate a buffer once per direction
 * read into it
 * write it out
 * on EOF, propagate half-close with `shutdown(Write)`
 Code (core idea):
 ```rust
 async fn pump(mut r: TcpStream, mut w: TcpStream, buf_sz: usize) -> io::Result<u64> {
    let mut buf = vec![0u8; buf_sz];
    let mut total = 0u64;
    loop {
        let n = r.read(&mut buf).await?;
        if n == 0 {
            let _ = w.shutdown(std::net::Shutdown::Write);
            break;
        }
        w.write_all(&buf[..n]).await?;
        total += n as u64;
    }
    Ok(total)
 }
 ```
 And run both directions:
 ```rust
 let c2s = pump(client_stream.clone(), server_stream.clone(), BUF);
 let s2c = pump(server_stream, client_stream, BUF);
 try_zip(c2s, s2c).await?;
 ```
 ---
 > **Inner Prosecutor:** “You changed ‘one helper call’ into ‘a loop’. That’s your miracle?”
 >
 > **Me:** “Not the loop. The *bytes per syscall*.”
 ---
 ## 8. Verification: numbers don’t lie
 Same machine, same namespaces/veth, same `iperf3 -P 8`.
 ### Baseline (generic copy, ~8KiB internal buffer)
 Throughput:
 ```text
 17.8 Gbit/s
 ```
 ### Pump + 16KiB buffer
 Throughput:
 ```text
 28.6 Gbit/s
 ```
 `strace -c` showed `sendto/recvfrom` call count dropped:
 ```text
 % time     seconds  usecs/call     calls    errors syscall
 ------ ----------- ----------- --------- --------- ----------------
 57.80   14.590016      442121        33           epoll_wait
 28.84    7.279883           4   1771146           sendto
 13.33    3.363882           1   1771212        48 recvfrom
  0.02    0.003843          61        62        44 futex
  0.01    0.001947          12       159           epoll_ctl
 ...
 ------ ----------- ----------- --------- --------- ----------------
 100.00   25.242897           7   3542787       143 total
 ```
 ### Pump + 64KiB buffer
 Throughput:
 ```text
 54.1 Gbit/s (best observed)
 ```
 `perf stat` output:
 ```text
 Performance counter stats for process id '893123':
   120,859,810,675      cpu_atom/cycles/                                                        (0.15%)
   134,735,934,329      cpu_core/cycles/                                                        (99.85%)
    79,946,979,880      cpu_atom/instructions/           #    0.66  insn per cycle              (0.15%)
   127,036,644,759      cpu_core/instructions/           #    0.94  insn per cycle              (99.85%)
        24,713,474      cpu_atom/cache-misses/                                                  (0.15%)
         9,604,449      cpu_core/cache-misses/                                                  (99.85%)
    15,584,074,530      cpu_atom/branches/                                                      (0.15%)
    24,796,180,117      cpu_core/branches/                                                      (99.85%)
       175,778,825      cpu_atom/branch-misses/          #    1.13% of all branches             (0.15%)
       135,067,353      cpu_core/branch-misses/          #    0.54% of all branches             (99.85%)
             1,519      context-switches
                50      cpu-migrations
      33.006529572 seconds time elapsed
 ```
 `strace -c` output:
 ```text
 % time     seconds  usecs/call     calls    errors syscall
 ------ ----------- ----------- --------- --------- ----------------
 54.56   18.079500      463576        39           epoll_wait
 27.91    9.249443           7   1294854         2 sendto
 17.49    5.796927           4   1294919        51 recvfrom
 ...
 ------ ----------- ----------- --------- --------- ----------------
 100.00   33.135377          12   2590253       158 total
 ```
 ---
 ## 9. “Wait — why is `epoll_wait` taking most syscall time?”
 > **Inner Prosecutor:** “*Objection!* Your table says `epoll_wait` dominates time. So epoll is the bottleneck!”
 >
 > **Me:** “Nope. That’s a common misread.”
 `strace -c` counts **time spent inside syscalls**, including time spent **blocked**.
 In async runtimes, it’s normal for one thread to sit in `epoll_wait(timeout=...)` while other threads do actual send/recv work. That blocking time is charged to `epoll_wait`, but it’s not “overhead” — it’s *waiting*.
 The real signal is still:
 * `sendto/recvfrom` call counts (millions)
 * average microseconds per call
 * and whether call count drops when buffer size increases
 That’s the syscall-per-byte story.
 ---
 ## 10. So why did 64KiB cause such a huge jump?
 Two reasons:
 ### 1) Syscall wall is nonlinear
 Throughput is roughly:
 **Throughput ≈ bytes_per_syscall_pair × syscall_pairs_per_second**
 If syscall rate is the limiter, increasing bytes per syscall can push you past a threshold where:
 * socket buffers stay fuller
 * TCP windows are better utilized
 * per-stream pacing is smoother
 * concurrency (`-P 8`) stops fighting overhead and starts working in your favor
 Once you cross that threshold, throughput can jump until the *next* ceiling (kernel TCP work, memory bandwidth, or iperf itself).
 ### 2) Less “per-chunk” overhead in userspace
 A small-buffer copy loop means more iterations, more polls, more bookkeeping.
 A bigger buffer means:
 * fewer loop iterations per GB moved
 * fewer wakeups/polls
 * fewer syscall transitions per GB
 Your `strace` call counts dropped significantly between 16KiB and 64KiB, and throughput nearly doubled.
 ---
 ## 11. Trade-offs: buffer size is not free
 > **Inner Prosecutor:** “*Hold it!* Bigger buffers mean wasted memory.”
 >
 > **Me:** “Correct.”
 A forwarder allocates **two buffers per connection** (one per direction).
 So for 64KiB:
 * ~128KiB per connection (just for relay buffers)
 * plus runtime + socket buffers
 That’s fine for “few heavy streams”, but it matters if you handle thousands of concurrent connections.
 In practice, the right move is:
 * choose a good default (64KiB is common)
 * make it configurable
 * consider buffer pooling if connection churn is heavy
 ---
 ## 12. Closing statement
 This was a good reminder that performance work is not guessing — it’s a dialogue with the system:
 1. Describe the situation
 2. Ask sharp questions
 3. Use tools to confirm
 4. Explain the results using low-level knowledge
 5. Make one change
 6. Re-measure
 And the funniest part: the “clean” one-liner `io::copy` was correct, but its defaults were hiding a performance policy I didn’t want.
 > **Inner Prosecutor:** “Case closed?”
 >
 > **Me:** “For now. Next case: buffer pooling, socket buffer tuning, and maybe a Linux-only `splice(2)` fast path — carefully, behind a safe wrapper.”
 ---