Add: more Phoenix flavor

Upload files to "drafts"
Update: changes buffer from 8KiB to 16 and finally 64, have discovered more things than expect.
2026-02-03 13:41:37 +08:00 · 2026-02-03 13:06:58 +08:00 · 2026-01-27 23:23:44 +08:00
2 changed files with 1008 additions and 193 deletions
--- a/drafts/2026-01-24-profiling-rust-written-network-program.md
+++ b/drafts/2026-01-24-profiling-rust-written-network-program.md
@@ -1,193 +1,533 @@
-
+## Testing enviornment setup
-## Testing enviornment setup
+Install tools:
-Install tools:
+
-
+```bash
-```bash
+sudo apt update
-sudo apt update
+
-
+sudo apt install -y hyperfine heaptrack valgrind
-sudo apt install -y hyperfine heaptrack valgrind
+
-
+sudo apt install -y \
-sudo apt install -y \
+  build-essential clang lld pkg-config \
-  build-essential clang lld pkg-config \
+  linux-perf \
-  linux-perf \
+  iperf3 netperf net-tools \
-  iperf3 netperf net-tools \
+  tcpdump ethtool iproute2 \
-  tcpdump ethtool iproute2 \
+  bpftrace bpfcc-tools \
-  bpftrace bpfcc-tools \
+  strace ltrace \
-  strace ltrace \
+  sysstat procps \
-  sysstat procps \
+  git perl
-  git perl
+```
-```
+
-
+
-
+Install framegraph(not shipped on debian):
-Install framegraph(not shipped on debian):
+
-
+```bash
-```bash
+git clone https://github.com/brendangregg/FlameGraph ~/FlameGraph
-git clone https://github.com/brendangregg/FlameGraph ~/FlameGraph
+
-
+echo 'export PATH="$HOME/FlameGraph:$PATH"' >> ~/.bashrc
-echo 'export PATH="$HOME/FlameGraph:$PATH"' >> ~/.bashrc
+
-
+source ~/.bashrc
-source ~/.bashrc
+
-
+which flamegraph.pl
-which flamegraph.pl
+```
-```
+
-
+modify the Cargo.toml of verion 0.1.0:
-modify the Cargo.toml of verion 0.1.0:
+
-
+```toml
-```toml
+[profile.release]
-[profile.release]
+lto = true
-lto = true
+codegen-units = 1
-codegen-units = 1
+debug = 1
-debug = 1
+strip = "none"
-strip = "none"
+panic = "abort"
-panic = "abort"
+```
-```
+
-
+
-
+Build with frame pointers to help profiling:
-Build with frame pointers to help profiling:
+
-
+```bash
-```bash
+git clone https://github.com/DaZuo0122/oxidinetd.git
-git clone https://github.com/DaZuo0122/oxidinetd.git
+
-
+RUSTFLAGS="-C force-frame-pointers=yes" cargo build --release
-RUSTFLAGS="-C force-frame-pointers=yes" cargo build --release
+```
-```
+
-
+
-
+`profiling.conf`:
-`profiling.conf`:
+
-
+```yaml
-```yaml
+127.0.0.1 9000 127.0.0.1 9001
-127.0.0.1 9000 127.0.0.1 9001
+```
-```
+
-
+
-
+Backend iperf3 server:
-Backend iperf3 server:
+
-
+```bash
-```bash
+iperf3 -s -p 9001
-iperf3 -s -p 9001
+```
-```
+
-
+
-
+forwarder:
-forwarder:
+
-
+```bash
-```bash
+./oi -c profiling.conf
-./oi -c profiling.conf
+```
-```
+
-
+
-
+triggers redirect:
-triggers redirect:
+
-
+```bash
-```bash
+iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1
-iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1
+iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 8
-iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 8
+```
-```
+
-
+verification: 
-verification: 
+
-
+```bash
-```bash
+sudo ss -tnp | egrep '(:9000|:9001)'
-sudo ss -tnp | egrep '(:9000|:9001)'
+```
-```
+
-
+## Testing 
-## Testing 
+
-
+CPU hotspot:
-CPU hotspot:
+
-
+```bash
-```bash
+sudo perf top -p $(pidof oi)
-sudo perf top -p $(pidof oi)
+```
-```
+
-
+If you see lots of:
-If you see lots of:
+
-
+- sys_read, sys_write, __x64_sys_sendto, tcp_sendmsg → syscall/copy overhead
- sys_read, sys_write, __x64_sys_sendto, tcp_sendmsg → syscall/copy overhead
+
-
+- futex, __lll_lock_wait → contention/locks
- futex, __lll_lock_wait → contention/locks
+
-
+- epoll_wait → executor wake behavior / too many idle polls
- epoll_wait → executor wake behavior / too many idle polls
+
-
+
-
+Hard numbers:
-Hard numbers:
+
-
+```bash
-```bash
+sudo perf stat -p $(pidof oi) -e \
-sudo perf stat -p $(pidof oi) -e \
+  cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
-  cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
+  -- sleep 30
-  -- sleep 30
+```
-```
+
-
+Big differences to watch:
-Big differences to watch:
+
-
+- context-switches much higher on oi → too many tasks/wakers / lock contention
- context-switches much higher on oi → too many tasks/wakers / lock contention
+
-
+- instructions much higher on oi for same throughput → runtime overhead / copies
- instructions much higher on oi for same throughput → runtime overhead / copies
+
-
+- cache-misses higher → allocations / poor locality
- cache-misses higher → allocations / poor locality
+
-
+
-
+Flamegraph  
-Flamegraph  
+Record: 
-Record: 
+
-
+```bash
-```bash
+sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30
-sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30
+sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
-sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
+```
-```
+
-
+If the stack looks “flat / missing” (common with async + LTO), use dwarf unwinding:
-If the stack looks “flat / missing” (common with async + LTO), use dwarf unwinding:
+
-
+```bash
-```bash
+sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
-sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
+sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
-sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
+```
-```
+
-
+
-
+syscall-cost check:
-syscall-cost check:
+
-
+```bash
-```bash
+sudo strace -ff -c -p $(pidof oi) -o /tmp/oi.strace
-sudo strace -ff -c -p $(pidof oi) -o /tmp/oi.strace
+# run 15–30s under load, then Ctrl+C
-# run 15–30s under load, then Ctrl+C
+tail -n +1 /tmp/oi.strace.*
-tail -n +1 /tmp/oi.strace.*
+```
-```
+
-
+If you see huge % time in read/write/sendmsg/recvmsg, you’re dominated by copying + syscalls.
-If you see huge % time in read/write/sendmsg/recvmsg, you’re dominated by copying + syscalls.
+
-
+
-
+ebpf stuffs
-ebpf stuffs
+
-
+--skipped--
--skipped--
+
-
+
-
+Smol-focused bottlenecks + the “fix list”
-Smol-focused bottlenecks + the “fix list”
+A) If you’re syscall/copy bound
-A) If you’re syscall/copy bound
+
-
+Best improvement candidates:
-Best improvement candidates:
+
-
+buffer reuse (no per-loop Vec allocation)
-buffer reuse (no per-loop Vec allocation)
+
-
+reduce tiny writes (coalesce)
-reduce tiny writes (coalesce)
+
-
+zero-copy splice (Linux-only, biggest win but more complex)
-zero-copy splice (Linux-only, biggest win but more complex)
+
-
+For Linux zero-copy, you’d implement a splice(2)-based fast path (socket→pipe→socket). That’s how high-performance forwarders avoid double-copy.
-For Linux zero-copy, you’d implement a splice(2)-based fast path (socket→pipe→socket). That’s how high-performance forwarders avoid double-copy.
+
-
+B) If you’re executor/waker bound (common for async forwarders)
-B) If you’re executor/waker bound (common for async forwarders)
+
-
+Symptoms:
-Symptoms:
+
-
+perf shows a lot of runtime / wake / scheduling
-perf shows a lot of runtime / wake / scheduling
+
-
+perf stat shows more context switches than rinetd
-perf stat shows more context switches than rinetd
+
-
+Fixes:
-Fixes:
+
-
+don’t spawn 2 tasks per connection (one per direction) unless needed
-don’t spawn 2 tasks per connection (one per direction) unless needed
+→ do a single task that forwards both directions in one loop (state machine)
-→ do a single task that forwards both directions in one loop (state machine)
+
-
+avoid any shared Mutex on hot path (logging/metrics)
-avoid any shared Mutex on hot path (logging/metrics)
+
-
+keep per-conn state minimal
-keep per-conn state minimal
+
-
+C) If you’re single-thread limited
-C) If you’re single-thread limited
+
-
+smol can be extremely fast, but if you’re effectively running everything on one thread, throughput may cap earlier.
-smol can be extremely fast, but if you’re effectively running everything on one thread, throughput may cap earlier.
+
-
+Fix direction:
-Fix direction:
+
-
+move to smol::Executor + N threads (usually num_cpus)
-move to smol::Executor + N threads (usually num_cpus)
+
-
+or run multiple block_on() workers (careful: avoid accept() duplication)
-or run multiple block_on() workers (careful: avoid accept() duplication)
+
 ## outcome oi
 ### CPU hotspot
 testing commands: 
 ```bash
 iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1
 sudo perf stat -p $(pidof oi) -e \
  cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
  -- sleep 30
 ```
 perf report: 
 ```text
 Performance counter stats for process id '207279':
    98,571,874,480      cpu_atom/cycles/                                                        (0.10%)
   134,732,064,800      cpu_core/cycles/                                                        (99.90%)
    75,889,748,906      cpu_atom/instructions/           #    0.77  insn per cycle              (0.10%)
   159,098,987,713      cpu_core/instructions/           #    1.18  insn per cycle              (99.90%)
        30,443,258      cpu_atom/cache-misses/                                                  (0.10%)
         3,155,528      cpu_core/cache-misses/                                                  (99.90%)
    15,003,063,317      cpu_atom/branches/                                                      (0.10%)
    31,479,765,962      cpu_core/branches/                                                      (99.90%)
       149,091,165      cpu_atom/branch-misses/          #    0.99% of all branches             (0.10%)
       195,562,861      cpu_core/branch-misses/          #    0.62% of all branches             (99.90%)
             1,138      context-switches                                                      
                37      cpu-migrations                                                        
      33.004738330 seconds time elapsed
 ```
 ### FlameGraph
 testing commands:
 ```bash
 sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30
 sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
 ```
 outcome:
 oi.svg
 commands:
 ```bash
 sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
 sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi_dwarf.svg
 ```
 outcome:
 oi_dwarf.svg
 ### syscall-cost check
 ```bash
 sudo strace -ff -C -p $(pidof oi) -o /tmp/oi.strace
 # run 15–30s under load, then Ctrl+C
 tail -n +1 /tmp/oi.strace.*
 ```
 ## More real setup
 traffic goes through real kernel routing + 2 TCP legs
 Create namespaces + veth links:
 ```bash
 sudo ip netns add ns_client
 sudo ip netns add ns_server
 sudo ip link add veth_c type veth peer name veth_c_ns
 sudo ip link set veth_c_ns netns ns_client
 sudo ip link add veth_s type veth peer name veth_s_ns
 sudo ip link set veth_s_ns netns ns_server
 sudo ip addr add 10.0.1.1/24 dev veth_c
 sudo ip link set veth_c up
 sudo ip addr add 10.0.0.1/24 dev veth_s
 sudo ip link set veth_s up
 sudo ip netns exec ns_client ip addr add 10.0.1.2/24 dev veth_c_ns
 sudo ip netns exec ns_client ip link set veth_c_ns up
 sudo ip netns exec ns_client ip link set lo up
 sudo ip netns exec ns_server ip addr add 10.0.0.2/24 dev veth_s_ns
 sudo ip netns exec ns_server ip link set veth_s_ns up
 sudo ip netns exec ns_server ip link set lo up
 sudo sysctl -w net.ipv4.ip_forward=1
 ```
 Config to force redirect path:
 ```yaml
 10.0.1.1 9000 10.0.0.2 9001
 ```
 Start backend server in ns_server:
 ```bash
 sudo ip netns exec ns_server iperf3 -s -p 9001
 ```
 Run client in ns_client → forwarder → backend:
 ```bash
 sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8
 ```
 perf report: 
 ```text
 sudo perf stat -p $(pidof oi) -e   cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations   -- sleep 33
 Performance counter stats for process id '209785':
   113,810,599,893      cpu_atom/cycles/                                                        (0.11%)
   164,681,878,450      cpu_core/cycles/                                                        (99.89%)
   102,575,167,734      cpu_atom/instructions/           #    0.90  insn per cycle              (0.11%)
   237,094,207,911      cpu_core/instructions/           #    1.44  insn per cycle              (99.89%)
        33,093,338      cpu_atom/cache-misses/                                                  (0.11%)
         5,381,441      cpu_core/cache-misses/                                                  (99.89%)
    20,012,975,873      cpu_atom/branches/                                                      (0.11%)
    46,120,077,111      cpu_core/branches/                                                      (99.89%)
       211,767,555      cpu_atom/branch-misses/          #    1.06% of all branches             (0.11%)
       245,969,685      cpu_core/branch-misses/          #    0.53% of all branches             (99.89%)
             1,686      context-switches
               150      cpu-migrations
      33.004363800 seconds time elapsed
 ```
 flamegraph
 ### Add latency + small-packet tests
 netperf (request/response)
 Start netserver in backend namespace:
 ```bash
 sudo ip netns exec ns_server netserver -p 9001
 ```
 Run TCP_RR against forwarded port:
 ```bash
 sudo ip netns exec ns_client netperf -H 10.0.1.1 -p 9000 -t TCP_RR -l 30 -- -r 32,32
 ```
 ## After opt
 Here, we changed future_lite::io 8KiB buffer to a customized 16KiB buffer. (To avoid conflict, I changed binary name to oiopt).
 ```rust
 async fn pump(mut r: TcpStream, mut w: TcpStream) -> io::Result<u64> {
    // let's try 16KiB instead of future_lite::io 8KiB
    // and do a profiling to see the outcome
    let mut buf = vec![0u8; 16 * 1024];
    let mut total = 0u64;
    loop {
        let n = r.read(&mut buf).await?;
        if n == 0 {
            // EOF: send FIN to peer
            let _ = w.shutdown(Shutdown::Write);
            break;
        }
        w.write_all(&buf[0..n]).await?;
        total += n as u64;
    }
    Ok(total)
 }
 // And change the function call in handle_tcp_connection 
 let client_to_server = pump(client_stream.clone(), server_stream.clone());
 let server_to_client = pump(server_stream, client_stream);
 ```
 ### outcomes
 Still with `sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8`
 perf stat: 
 ```text
 sudo perf stat -p $(pidof oiopt) -e   cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations   -- sleep 33
 Performance counter stats for process id '883435':
   118,960,667,431      cpu_atom/cycles/                                                        (0.05%)
   131,934,369,110      cpu_core/cycles/                                                        (99.95%)
   100,530,466,140      cpu_atom/instructions/           #    0.85  insn per cycle              (0.05%)
   185,203,788,299      cpu_core/instructions/           #    1.40  insn per cycle              (99.95%)
        11,027,490      cpu_atom/cache-misses/                                                  (0.05%)
         2,123,369      cpu_core/cache-misses/                                                  (99.95%)
    19,641,945,774      cpu_atom/branches/                                                      (0.05%)
    36,245,438,057      cpu_core/branches/                                                      (99.95%)
       214,098,497      cpu_atom/branch-misses/          #    1.09% of all branches             (0.05%)
       179,848,095      cpu_core/branch-misses/          #    0.50% of all branches             (99.95%)
             2,308      context-switches
                31      cpu-migrations
      33.004555878 seconds time elapsed
 ```
 system call check:
 ```bash
 sudo timeout 30s strace -c -f -p $(pidof oiopt)
 ```
 output:
 ```text
 strace: Process 883435 attached with 4 threads
 strace: Process 883438 detached
 strace: Process 883437 detached
 strace: Process 883436 detached
 strace: Process 883435 detached
 % time     seconds  usecs/call     calls    errors syscall
 ------ ----------- ----------- --------- --------- ----------------
 57.80   14.590016      442121        33           epoll_wait
 28.84    7.279883           4   1771146           sendto
 13.33    3.363882           1   1771212        48 recvfrom
  0.02    0.003843          61        62        44 futex
  0.01    0.001947          12       159           epoll_ctl
  0.00    0.000894          99         9         9 connect
  0.00    0.000620          34        18         9 accept4
  0.00    0.000503          14        34           timerfd_settime
  0.00    0.000446          13        33        33 read
  0.00    0.000271          15        18           ioctl
  0.00    0.000189          21         9           write
  0.00    0.000176          19         9           socket
  0.00    0.000099          11         9           getsockopt
  0.00    0.000079           4        18           shutdown
  0.00    0.000049           2        18           close
 ------ ----------- ----------- --------- --------- ----------------
 100.00   25.242897           7   3542787       143 total
 ```
 ## Further tests to explain why this huge
 Changed 16KiB buffer to 64KiB, and named the binary to oiopt64
 iperf3 throughput under `-P 8`, highest 54.1Gbits/sec, other threads are much higher than before(16KiB buffer)
 perf stat: 
 ```text
 sudo perf stat -p $(pidof oiopt64) -e   cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations   -- sleep 33
 Performance counter stats for process id '893123':
   120,859,810,675      cpu_atom/cycles/                                                        (0.15%)
   134,735,934,329      cpu_core/cycles/                                                        (99.85%)
    79,946,979,880      cpu_atom/instructions/           #    0.66  insn per cycle              (0.15%)
   127,036,644,759      cpu_core/instructions/           #    0.94  insn per cycle              (99.85%)
        24,713,474      cpu_atom/cache-misses/                                                  (0.15%)
         9,604,449      cpu_core/cache-misses/                                                  (99.85%)
    15,584,074,530      cpu_atom/branches/                                                      (0.15%)
    24,796,180,117      cpu_core/branches/                                                      (99.85%)
       175,778,825      cpu_atom/branch-misses/          #    1.13% of all branches             (0.15%)
       135,067,353      cpu_core/branch-misses/          #    0.54% of all branches             (99.85%)
             1,519      context-switches
                50      cpu-migrations
      33.006529572 seconds time elapsed
 ```
 system call check:
 ```bash
 sudo timeout 30s strace -c -f -p $(pidof oiopt64)
 ```
 output:
 ```text
 strace: Process 893123 attached with 4 threads
 strace: Process 893126 detached
 strace: Process 893125 detached
 strace: Process 893124 detached
 strace: Process 893123 detached
 % time     seconds  usecs/call     calls    errors syscall
 ------ ----------- ----------- --------- --------- ----------------
 54.56   18.079500      463576        39           epoll_wait
 27.91    9.249443           7   1294854         2 sendto
 17.49    5.796927           4   1294919        51 recvfrom
  0.01    0.003778          50        75        49 futex
  0.01    0.002188          12       175           epoll_ctl
  0.00    0.000747          83         9         9 connect
  0.00    0.000714          17        40           timerfd_settime
  0.00    0.000510          13        39        38 read
  0.00    0.000452          25        18         9 accept4
  0.00    0.000310          17        18           ioctl
  0.00    0.000232          23        10           write
  0.00    0.000200          22         9           socket
  0.00    0.000183          20         9           getsockopt
  0.00    0.000100           5        18           shutdown
  0.00    0.000053           2        18           close
  0.00    0.000020          20         1           mprotect
  0.00    0.000015          15         1           sched_yield
  0.00    0.000005           5         1           madvise
 ------ ----------- ----------- --------- --------- ----------------
 100.00   33.135377          12   2590253       158 total
 ```
 ### Cleanup:
 ```bash
 sudo ip netns del ns_client
 sudo ip netns del ns_server
 sudo ip link del veth_c
 sudo ip link del veth_s
 ```
--- a/drafts/2026-01-25-ace-profiling-attorney-the-case-of-the-missing-gbits.md
+++ b/drafts/2026-01-25-ace-profiling-attorney-the-case-of-the-missing-gbits.md
@@ -0,0 +1,475 @@
 ---
 title: Ace Profiling Attorney - The Case of the Missing Gbits
 categories: [Programming, Profiling]
 tags: [Rust, kernel, networking]
 ---
 > **Disclaimer:** This is not a language-war post. No “X vs Y”.  
 > This is a profiling detective story about my Rust TCP forwarder [`oi`](https://github.com/DaZuo0122/oxidinetd).
 ---
 ## 0) Prologue — The Courthouse Lobby
 > **Me:** I wrote a Rust TCP port forwarder. It works. It forwards.  
 > 
 > **Inner Prosecutor (Phoenix voice):** *Hold it!* “Works” is not a metric. How fast?  
 > 
 > **Me:** Not fast enough under load.  
 > 
 > **Inner Prosecutor:** *Objection!* “Not fast enough” is an emotion. Bring evidence.  
 > 
 > **Me:** Fine. I’ll bring **perf**, **strace**, and a **flamegraph**.  
 > 
 > **Inner Prosecutor:** Good. This court accepts only facts.
 ## 1) The Crime Scene — Setup & Reproduction
 **Me:** Single machine, Debian 13. No WAN noise, no tunnel bottlenecks.  
 **Inner Prosecutor:** *Hold it!* If it’s “single machine”, how do you avoid loopback cheating?  
 **Me:** Network namespaces + veth. Local, repeatable, closer to real networking.
 ### Environment
 - Debian 13
 - Kernel: `6.12.48+deb13-amd64`
 - Runtime: `smol`
 - Test topology: `ns_client → oi (root ns) → ns_server` via veth
 ### Reproduction commands
 **Exhibit A: Start backend server in `ns_server`**
 ```bash
 sudo ip netns exec ns_server iperf3 -s -p 9001
 ````
 **Exhibit B: Run client in `ns_client` through forwarder**
 ```bash
 sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8
 ```
 **Inner Prosecutor:** *Hold it!* Why `-P 8`?  
 **Me:** Because a forwarder can look fine in `-P 1` and fall apart when syscall pressure scales.  
 **Inner Prosecutor:** …Acceptable.  
 ---
 ## 2) The Suspects — What Could Be Limiting Throughput?
 **Me:** Four suspects.
 1. **CPU bound** (pure compute wall)
 2. **Kernel TCP stack bound** (send/recv path, skb, softirq, netfilter/conntrack)
 3. **Syscall-rate wall** (too many `sendto/recvfrom` per byte)
 4. **Runtime scheduling / contention** (wake storms, locks, futex)
 **Inner Prosecutor:** *Objection!* That’s too broad. Narrow it down.  
 **Me:** That’s what the tools are for.
 ---
 ## 3) Evidence #1 — `perf stat` (The Macro View)
 **Me:** First I ask: are we burning CPU, thrashing schedulers, or stalling on memory?
 **Command:**
 ```bash
 sudo perf stat -p $(pidof oi) -e \
  cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
  -- sleep 33
 ```
 **What I’m looking for:**
 * Huge `context-switches` → runtime thrash / lock contention
 * Huge `cpu-migrations` → unstable scheduling
 * Very low IPC + huge cache misses → memory stalls
 * Otherwise: likely syscall/kernel path
 Output: 
 ```text
 Performance counter stats for process id '209785':
   113,810,599,893      cpu_atom/cycles/                                                        (0.11%)
   164,681,878,450      cpu_core/cycles/                                                        (99.89%)
   102,575,167,734      cpu_atom/instructions/           #    0.90  insn per cycle              (0.11%)
   237,094,207,911      cpu_core/instructions/           #    1.44  insn per cycle              (99.89%)
        33,093,338      cpu_atom/cache-misses/                                                  (0.11%)
         5,381,441      cpu_core/cache-misses/                                                  (99.89%)
    20,012,975,873      cpu_atom/branches/                                                      (0.11%)
    46,120,077,111      cpu_core/branches/                                                      (99.89%)
       211,767,555      cpu_atom/branch-misses/          #    1.06% of all branches             (0.11%)
       245,969,685      cpu_core/branch-misses/          #    0.53% of all branches             (99.89%)
             1,686      context-switches
               150      cpu-migrations
      33.004363800 seconds time elapsed
 ```
 **Low context switching**:
   - context-switches: 1,686 over ~33s → ~51 switches/sec
   - cpu-migrations: 150 over ~33s → ~4.5/s → very stable CPU placement
 **CPU is working hard**:
   - 237,094,207,911 cpu_core instructions
   - IPC: 1.44 (instructions per cycle) → not lock-bound or stalling badly
 **Clean cache, branch metrics**:
   - cache-misses: ~3.1M (tiny compared to the instruction count)
   - branch-misses: 0.62%
 **Inner Prosecutor:** *Hold it!* You didn’t show the numbers.  
 **Me:** Patience. The next exhibit makes the culprit confess.  
 ---
 ## 4) Evidence #2 — `strace -c` (The Confession: Syscall Composition)
 **Me:** Next: “What syscalls are we paying for?”
 **Command:**
 ```bash
 sudo timeout 30s strace -c -f -p $(pidof oi)
 ```
 **What I expect if this is a forwarding wall:**
 * `sendto` and `recvfrom` dominate calls
 * call counts in the millions
 Output (simplified):
 ```text
 sendto   2,190,751 calls   4.146799s  (57.6%)
 recvfrom 2,190,763 calls   3.052340s  (42.4%)
 total syscall time:        7.200789s
 ```
 (A) **100% syscall/copy dominated:**
 - Almost all traced time is inside:
 	- sendto() (TCP send)
 	- recvfrom() (TCP recv)
 (B) **syscall rate is massive**
 - Total send+recv calls:
 	- ~4,381,500 syscalls in ~32s
 	- → ~137k `sendto` per sec + ~137k `recvfrom` per sec
 	- → ~274k syscalls/sec total
 **Inner Prosecutor:** *Objection!* Syscalls alone don’t prove the bottleneck.  
 **Me:** True. So I brought a witness.
 ---
 ## 5) Evidence #3 — FlameGraph (The Witness)
 **Me:** The flamegraph doesn’t lie. It testifies where cycles go.
 **Commands:**
 ```bash
 sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
 sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
 ```
 **What the flamegraph showed (described, not embedded):**
 * The widest towers were kernel TCP send/recv paths:
  * `__x64_sys_sendto` → `tcp_sendmsg_locked` → `tcp_write_xmit` → ...
  * `__x64_sys_recvfrom` → `tcp_recvmsg` → ...
 * My userspace frames existed, but were comparatively thin.
 * The call chain still pointed into my forwarding implementation.
 **Inner Prosecutor:** *Hold it!* So you’re saying… the kernel is doing the heavy lifting?  
 **Me:** Exactly. Which means my job is to **stop annoying the kernel** with too many tiny operations.  
 ---
 ## 6) The Real Culprit — A “Perfectly Reasonable” Copy Loop
 **Me:** Here’s the original relay code. Looks clean, right?
 ```rust
 let client_to_server = io::copy(client_stream.clone(), server_stream.clone());
 let server_to_client = io::copy(server_stream, client_stream);
 futures_lite::future::try_zip(client_to_server, server_to_client).await?;
 ```
 **Inner Prosecutor:** *Objection!* This is idiomatic and correct.  
 **Me:** Yes. That’s why it’s dangerous.  
 **Key detail:** `futures_lite::io::copy` uses a small internal buffer (~8KiB in practice).
 Small buffer → more iterations → more syscalls → more overhead.
 If a forwarder is syscall-rate bound, this becomes a ceiling.
 ---
 ## 7) The First Breakthrough — Replace `io::copy` with `pump()`
 **Me:** I wrote a manual pump loop:
 * allocate a buffer once
 * `read()` into it
 * `write_all()` out
 * on EOF: `shutdown(Write)` to propagate half-close
 ```rust
 async fn pump(mut r: TcpStream, mut w: TcpStream, buf_sz: usize) -> io::Result<u64> {
    let mut buf = vec![0u8; buf_sz];
    let mut total = 0u64;
    loop {
        let n = r.read(&mut buf).await?;
        if n == 0 {
            let _ = w.shutdown(std::net::Shutdown::Write);
            break;
        }
        w.write_all(&buf[..n]).await?;
        total += n as u64;
    }
    Ok(total)
 }
 ```
 Run both directions:
 ```rust
 let c2s = pump(client_stream.clone(), server_stream.clone(), BUF);
 let s2c = pump(server_stream, client_stream, BUF);
 try_zip(c2s, s2c).await?;
 ```
 **Inner Prosecutor:** *Hold it!* That’s just a loop. How does that win?  
 **Me:** Not the loop. The **bytes per syscall**.  
 ---
 ### 8) Exhibit C — The Numbers (8KiB → 16KiB → 64KiB)
 ### Baseline: ~8KiB (generic copy helper)
 Throughput:
 ```text
 17.8 Gbit/s
 ```
 **Inner Prosecutor:** *Objection!* That’s your “crime scene” number?  
 **Me:** Yes. Now watch what happens when the kernel stops getting spammed.  
 ### Pump + 16KiB buffer
 Throughput:
 ```text
 28.6 Gbit/s
 ```
 `strace -c` showed `sendto/recvfrom` call count dropped:
 ```text
 % time     seconds  usecs/call     calls    errors syscall
 ------ ----------- ----------- --------- --------- ----------------
 57.80   14.590016      442121        33           epoll_wait
 28.84    7.279883           4   1771146           sendto
 13.33    3.363882           1   1771212        48 recvfrom
  0.02    0.003843          61        62        44 futex
  0.01    0.001947          12       159           epoll_ctl
 ...
 ------ ----------- ----------- --------- --------- ----------------
 100.00   25.242897           7   3542787       143 total
 ```
 **Inner Prosecutor:** *Hold it!* That’s already big. But you claim there’s more?  
 **Me:** Oh, there’s more.  
 ### Pump + 64KiB buffer
 Throughput:
 ```text
 54.1 Gbit/s (best observed)
 ```
 `perf stat` output:
 ```text
 Performance counter stats for process id '893123':
   120,859,810,675      cpu_atom/cycles/                                                        (0.15%)
   134,735,934,329      cpu_core/cycles/                                                        (99.85%)
    79,946,979,880      cpu_atom/instructions/           #    0.66  insn per cycle              (0.15%)
   127,036,644,759      cpu_core/instructions/           #    0.94  insn per cycle              (99.85%)
        24,713,474      cpu_atom/cache-misses/                                                  (0.15%)
         9,604,449      cpu_core/cache-misses/                                                  (99.85%)
    15,584,074,530      cpu_atom/branches/                                                      (0.15%)
    24,796,180,117      cpu_core/branches/                                                      (99.85%)
       175,778,825      cpu_atom/branch-misses/          #    1.13% of all branches             (0.15%)
       135,067,353      cpu_core/branch-misses/          #    0.54% of all branches             (99.85%)
             1,519      context-switches
                50      cpu-migrations
      33.006529572 seconds time elapsed
 ```
 `strace -c` output:
 ```text
 % time     seconds  usecs/call     calls    errors syscall
 ------ ----------- ----------- --------- --------- ----------------
 54.56   18.079500      463576        39           epoll_wait
 27.91    9.249443           7   1294854         2 sendto
 17.49    5.796927           4   1294919        51 recvfrom
 ...
 ------ ----------- ----------- --------- --------- ----------------
 100.00   33.135377          12   2590253       158 total
 ```
 **Inner Prosecutor:** *OBJECTION!* `epoll_wait` is eating the time. That’s the bottleneck!  
 **Me:** Nice try. That’s a classic trap.  
 ---
 ## 9) Cross-Examination — The `epoll_wait` Trap
 **Me:** `strace -c` measures time spent *inside syscalls*, including time spent **blocked**.
 In async runtimes:
 * One thread can sit in `epoll_wait(timeout=...)`
 * Other threads do `sendto/recvfrom`
 * `strace` charges the blocking time to `epoll_wait`
 So `epoll_wait` dominating **does not** mean “epoll is slow”.
 It often means “one thread is waiting while others work”.
 **What matters here:**
 * `sendto` / `recvfrom` call counts
 * and how they change with buffer size
 ---
 ## 10) Final Explanation — Why 64KiB Causes a “Nonlinear” Jump
 **Inner Prosecutor:** *Hold it!* You only reduced syscall calls by ~some percent. How do you nearly triple throughput?  
 **Me:** Because syscall walls are **nonlinear**.  
 A forwarder’s throughput is approximately:
 > **Throughput ≈ bytes_per_syscall_pair × syscall_pairs_per_second**
 If you’re syscall-rate limited, increasing `bytes_per_syscall_pair` pushes you past a threshold where:
 * socket buffers stay fuller
 * the TCP window is better utilized
 * each stream spends less time in per-chunk bookkeeping
 * concurrency (`-P 8`) stops fighting overhead and starts helping
 Once you cross that threshold, throughput can jump until the next ceiling (kernel TCP, memory bandwidth, iperf itself).
 That’s why a “small” change can create a big effect.
 ---
 ## 11. Trade-offs: buffer size is not free
 **Inner Prosecutor:** *Objection!* Bigger buffers waste memory!  
 **Me:** Sustained.  
 A forwarder allocates **two buffers per connection** (one per direction).
 So for 64KiB:
 * ~128KiB per connection (just for relay buffers)
 * plus runtime + socket buffers
 That’s fine for “few heavy streams”, but it matters if you handle thousands of concurrent connections.
 In practice, the right move is:
 * choose a good default (64KiB is common)
 * make it configurable
 * consider buffer pooling if connection churn is heavy
 ---
 ## Epilogue — Case Closed (for now)
 **Inner Prosecutor:** So the culprit was…  
 **Me:** A perfectly reasonable helper with a default buffer size I didn’t question.  
 **Inner Prosecutor:** And the lesson?  
 **Me:** Don’t guess. Ask sharp questions. Use the tools. Let the system testify.  
 > **Verdict:** Guilty of “too many syscalls per byte.”  
 > 
 > **Sentence:** 64KiB buffers and a better relay loop.  
 --- 
 ## Ending
 This was a good reminder that performance work is not guessing — it’s a dialogue with the system:
 1. Describe the situation
 2. Ask sharp questions
 3. Use tools to confirm
 4. Explain the results using low-level knowledge
 5. Make one change
 6. Re-measure
 And the funniest part: the “clean” one-liner `io::copy` was correct, but its defaults were hiding a performance policy I didn’t want.
 > **Inner Prosecutor:** “Case closed?”
 >
 > **Me:** “For now. Next case: buffer pooling, socket buffer tuning, and maybe a Linux-only `splice(2)` fast path — carefully, behind a safe wrapper.”
 ---
Author	SHA1	Message	Date
manbo	144ae4a0f3	Add: more Phoenix flavor	2026-02-03 13:41:37 +08:00
manbo	ca04039826	Upload files to "drafts"	2026-02-03 13:06:58 +08:00
manbo	ba61eb37b9	Update: changes buffer from 8KiB to 16 and finally 64, have discovered more things than expect.	2026-01-27 23:23:44 +08:00