--- title: Ace Profiling Attorney - The Case of the Missing Gbits categories: [Programming, Profiling] tags: [Rust, kernel, networking] --- > **Cast** > > **Me:** “I rewrote a port forwarder in Rust. It works. It’s… not fast enough.” > > **Inner Prosecutor:** “*Objection!* ‘Not fast enough’ is not evidence. Bring numbers.” > > **Me:** “Fine. We’ll do this properly.” --- ## 0. The Situation I’m building a userspace TCP port forwarder in Rust called [oxidinetd](https://github.com/DaZuo0122/oxidinetd) (The binary named `oi`). It accepts a TCP connection, connects to an upstream server, then relays bytes in both directions. > This post is not a “Rust vs C” piece — it’s about **profiling**, **forming hypotheses**, and **turning measurements into speed**. ### Test environment - OS: Debian 13 - Kernel: `6.12.48+deb13-amd64` - Runtime: `smol` - Benchmark: single machine, network namespaces + veth Why namespaces + veth? The loopback can hide “real networking” behavior. Namespaces/veth keep the test local (repeatable), but with a path closer to real routing. --- > **Inner Prosecutor:** “You claim it’s repeatable. Prove your setup.” > > **Me:** “Here’s the lab.” --- ## 1. The Lab Setup Backend server inside `ns_server`: ```bash sudo ip netns exec ns_server iperf3 -s -p 9001 ``` Client inside `ns_client`, traffic goes through `oi`: ```bash sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8 ``` > **Note**: -P 8 matters. A forwarder might look okay under -P 1, then collapse when syscall pressure scales with concurrency. ### Forwarder config `oi` listens on `10.0.1.1:9000` and connects to `10.0.0.2:9001`. `profiling.conf`: ```yaml 127.0.0.1 9000 127.0.0.1 9001 ``` --- ## 2. The Questions > **Inner Prosecutor:** “Alright. What exactly is the crime?” > > **Me:** “Throughput is lower than expected. The suspects:” > > 1. CPU bound vs I/O bound > 2. Userspace overhead vs kernel TCP stack > 3. Syscall-rate wall (too many `send/recv` per byte) > 4. Async runtime scheduling / wakeups / locks --- ## 3. Evidence Tool #1 — `perf stat` (Macro view) Command: ```bash sudo perf stat -p $(pidof oi) -e \ cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \ -- sleep 33 ``` ### What I’m looking for * **Context switches** exploding → runtime contention or wake storms * **CPU migrations** exploding → scheduler instability (bad for repeatability) * **IPC** tanking + cache misses skyrocketing → memory/latency issues * Otherwise: likely **kernel networking + syscalls** dominate Output: ```text Performance counter stats for process id '209785': 113,810,599,893 cpu_atom/cycles/ (0.11%) 164,681,878,450 cpu_core/cycles/ (99.89%) 102,575,167,734 cpu_atom/instructions/ # 0.90 insn per cycle (0.11%) 237,094,207,911 cpu_core/instructions/ # 1.44 insn per cycle (99.89%) 33,093,338 cpu_atom/cache-misses/ (0.11%) 5,381,441 cpu_core/cache-misses/ (99.89%) 20,012,975,873 cpu_atom/branches/ (0.11%) 46,120,077,111 cpu_core/branches/ (99.89%) 211,767,555 cpu_atom/branch-misses/ # 1.06% of all branches (0.11%) 245,969,685 cpu_core/branch-misses/ # 0.53% of all branches (99.89%) 1,686 context-switches 150 cpu-migrations 33.004363800 seconds time elapsed ``` Interpretation: **Low context switching**: - context-switches: 1,686 over ~33s → ~51 switches/sec - cpu-migrations: 150 over ~33s → ~4.5/s → very stable CPU placement **CPU is working hard**: - 237,094,207,911 cpu_core instructions - IPC: 1.44 (instructions per cycle) → not lock-bound or stalling badly **Clean cache, branch metrics**: - cache-misses: ~3.1M (tiny compared to the instruction count) - branch-misses: 0.62% --- > **Inner Prosecutor:** “That’s a vibe-check. Where’s the real culprit?” > > **Me:** “Next tool. This one tells me what kind of pain we’re paying for.” --- ## 4. Evidence Tool #2 — `strace -c` (Syscall composition) Command: ```bash sudo timeout 30s strace -c -f -p $(pidof oi) ``` ### Why `strace -c` is lethal for forwarders A userspace TCP forwarder often boils down to: * `recv(...)` from one socket * `send(...)` to the other socket If your throughput is low and `strace -c` shows **millions** of `sendto/recvfrom` calls, you’re likely hitting a **syscall-per-byte wall**. Output (simplified): ```text sendto 2,190,751 calls 4.146799s (57.6%) recvfrom 2,190,763 calls 3.052340s (42.4%) total syscall time: 7.200789s ``` Interpretation: (A) **100% syscall/copy dominated:** Almost all traced time is inside: - sendto() (TCP send) - recvfrom() (TCP recv) (B) **syscall rate is massive** Total send+recv calls: - ~4,381,500 syscalls in ~32s - → ~137k `sendto` per sec + ~137k `recvfrom` per sec - → ~274k syscalls/sec total That’s exactly the pattern of a forwarder doing: `recv -> send -> recv -> send ...` with a relatively small buffer. --- > **Inner Prosecutor:** “So you’re saying the kernel is being spammed.” > > **Me:** “Exactly. Now I want to know who’s spamming it — my logic, my runtime, or my copy loop.” --- ## 5. Evidence Tool #3 — FlameGraph (Where cycles actually go) Commands: ```bash sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30 sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg ``` ### What the flamegraph showed (described, not embedded) Instead of embedding the graph, here’s the important story the flamegraph told: 1. The widest “towers” were kernel TCP send/recv paths: * `__x64_sys_sendto` → `tcp_sendmsg_locked` → `tcp_write_xmit` → … * `__x64_sys_recvfrom` → `tcp_recvmsg` → … 2. My userspace frames existed, but they were thin compared to the kernel towers. That means: * I’m not burning CPU on complicated Rust logic. * I’m paying overhead on the boundary: syscalls, TCP stack, copies. 3. In the dwarf flamegraph, the *userspace* frames pointed to my forwarding implementation: * the code path that ultimately calls read/write repeatedly. > **Conclusion:** This is not “async is slow” in general. This is “my relay loop is forcing too many small kernel transitions.” ## 6. The Suspect: my forwarding code Here was the original TCP relay: ```rust // Use smol's copy function to forward data in both directions let client_to_server = io::copy(client_stream.clone(), server_stream.clone()); let server_to_client = io::copy(server_stream, client_stream); futures_lite::future::try_zip(client_to_server, server_to_client).await?; ``` > **Inner Prosecutor:** “*Objection!* That looks perfectly reasonable.” > > **Me:** “Yes. That’s why it’s dangerous.” ### Why this can be slow under high throughput Generic `io::copy` helpers often use a relatively small internal buffer (commonly ~8KiB), plus abstraction layers that can increase: * syscall frequency * readiness polling * per-chunk overhead Small buffers aren’t “wrong”. They’re memory-friendly. But for a forwarder pushing tens of Gbit/s, **syscalls per byte** becomes the real limiter. --- ## 7. The Fix: a manual `pump()` loop (and a buffer size sweep) I replaced `io::copy` with a manual relay loop: * allocate a buffer once per direction * read into it * write it out * on EOF, propagate half-close with `shutdown(Write)` Code (core idea): ```rust async fn pump(mut r: TcpStream, mut w: TcpStream, buf_sz: usize) -> io::Result { let mut buf = vec![0u8; buf_sz]; let mut total = 0u64; loop { let n = r.read(&mut buf).await?; if n == 0 { let _ = w.shutdown(std::net::Shutdown::Write); break; } w.write_all(&buf[..n]).await?; total += n as u64; } Ok(total) } ``` And run both directions: ```rust let c2s = pump(client_stream.clone(), server_stream.clone(), BUF); let s2c = pump(server_stream, client_stream, BUF); try_zip(c2s, s2c).await?; ``` --- > **Inner Prosecutor:** “You changed ‘one helper call’ into ‘a loop’. That’s your miracle?” > > **Me:** “Not the loop. The *bytes per syscall*.” --- ## 8. Verification: numbers don’t lie Same machine, same namespaces/veth, same `iperf3 -P 8`. ### Baseline (generic copy, ~8KiB internal buffer) Throughput: ```text 17.8 Gbit/s ``` ### Pump + 16KiB buffer Throughput: ```text 28.6 Gbit/s ``` `strace -c` showed `sendto/recvfrom` call count dropped: ```text % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 57.80 14.590016 442121 33 epoll_wait 28.84 7.279883 4 1771146 sendto 13.33 3.363882 1 1771212 48 recvfrom 0.02 0.003843 61 62 44 futex 0.01 0.001947 12 159 epoll_ctl ... ------ ----------- ----------- --------- --------- ---------------- 100.00 25.242897 7 3542787 143 total ``` ### Pump + 64KiB buffer Throughput: ```text 54.1 Gbit/s (best observed) ``` `perf stat` output: ```text Performance counter stats for process id '893123': 120,859,810,675 cpu_atom/cycles/ (0.15%) 134,735,934,329 cpu_core/cycles/ (99.85%) 79,946,979,880 cpu_atom/instructions/ # 0.66 insn per cycle (0.15%) 127,036,644,759 cpu_core/instructions/ # 0.94 insn per cycle (99.85%) 24,713,474 cpu_atom/cache-misses/ (0.15%) 9,604,449 cpu_core/cache-misses/ (99.85%) 15,584,074,530 cpu_atom/branches/ (0.15%) 24,796,180,117 cpu_core/branches/ (99.85%) 175,778,825 cpu_atom/branch-misses/ # 1.13% of all branches (0.15%) 135,067,353 cpu_core/branch-misses/ # 0.54% of all branches (99.85%) 1,519 context-switches 50 cpu-migrations 33.006529572 seconds time elapsed ``` `strace -c` output: ```text % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 54.56 18.079500 463576 39 epoll_wait 27.91 9.249443 7 1294854 2 sendto 17.49 5.796927 4 1294919 51 recvfrom ... ------ ----------- ----------- --------- --------- ---------------- 100.00 33.135377 12 2590253 158 total ``` --- ## 9. “Wait — why is `epoll_wait` taking most syscall time?” > **Inner Prosecutor:** “*Objection!* Your table says `epoll_wait` dominates time. So epoll is the bottleneck!” > > **Me:** “Nope. That’s a common misread.” `strace -c` counts **time spent inside syscalls**, including time spent **blocked**. In async runtimes, it’s normal for one thread to sit in `epoll_wait(timeout=...)` while other threads do actual send/recv work. That blocking time is charged to `epoll_wait`, but it’s not “overhead” — it’s *waiting*. The real signal is still: * `sendto/recvfrom` call counts (millions) * average microseconds per call * and whether call count drops when buffer size increases That’s the syscall-per-byte story. --- ## 10. So why did 64KiB cause such a huge jump? Two reasons: ### 1) Syscall wall is nonlinear Throughput is roughly: **Throughput ≈ bytes_per_syscall_pair × syscall_pairs_per_second** If syscall rate is the limiter, increasing bytes per syscall can push you past a threshold where: * socket buffers stay fuller * TCP windows are better utilized * per-stream pacing is smoother * concurrency (`-P 8`) stops fighting overhead and starts working in your favor Once you cross that threshold, throughput can jump until the *next* ceiling (kernel TCP work, memory bandwidth, or iperf itself). ### 2) Less “per-chunk” overhead in userspace A small-buffer copy loop means more iterations, more polls, more bookkeeping. A bigger buffer means: * fewer loop iterations per GB moved * fewer wakeups/polls * fewer syscall transitions per GB Your `strace` call counts dropped significantly between 16KiB and 64KiB, and throughput nearly doubled. --- ## 11. Trade-offs: buffer size is not free > **Inner Prosecutor:** “*Hold it!* Bigger buffers mean wasted memory.” > > **Me:** “Correct.” A forwarder allocates **two buffers per connection** (one per direction). So for 64KiB: * ~128KiB per connection (just for relay buffers) * plus runtime + socket buffers That’s fine for “few heavy streams”, but it matters if you handle thousands of concurrent connections. In practice, the right move is: * choose a good default (64KiB is common) * make it configurable * consider buffer pooling if connection churn is heavy --- ## 12. Closing statement This was a good reminder that performance work is not guessing — it’s a dialogue with the system: 1. Describe the situation 2. Ask sharp questions 3. Use tools to confirm 4. Explain the results using low-level knowledge 5. Make one change 6. Re-measure And the funniest part: the “clean” one-liner `io::copy` was correct, but its defaults were hiding a performance policy I didn’t want. > **Inner Prosecutor:** “Case closed?” > > **Me:** “For now. Next case: buffer pooling, socket buffer tuning, and maybe a Linux-only `splice(2)` fast path — carefully, behind a safe wrapper.” ---