Files
blog-post-backup/drafts/2026-01-25-ace-profiling-attorney-the-case-of-the-missing-gbits.md
2026-02-03 13:41:37 +08:00

475 lines
14 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: Ace Profiling Attorney - The Case of the Missing Gbits
categories: [Programming, Profiling]
tags: [Rust, kernel, networking]
---
> **Disclaimer:** This is not a language-war post. No “X vs Y”.
> This is a profiling detective story about my Rust TCP forwarder [`oi`](https://github.com/DaZuo0122/oxidinetd).
---
## 0) Prologue — The Courthouse Lobby
> **Me:** I wrote a Rust TCP port forwarder. It works. It forwards.
>
> **Inner Prosecutor (Phoenix voice):** *Hold it!* “Works” is not a metric. How fast?
>
> **Me:** Not fast enough under load.
>
> **Inner Prosecutor:** *Objection!* “Not fast enough” is an emotion. Bring evidence.
>
> **Me:** Fine. Ill bring **perf**, **strace**, and a **flamegraph**.
>
> **Inner Prosecutor:** Good. This court accepts only facts.
## 1) The Crime Scene — Setup & Reproduction
**Me:** Single machine, Debian 13. No WAN noise, no tunnel bottlenecks.
**Inner Prosecutor:** *Hold it!* If its “single machine”, how do you avoid loopback cheating?
**Me:** Network namespaces + veth. Local, repeatable, closer to real networking.
### Environment
- Debian 13
- Kernel: `6.12.48+deb13-amd64`
- Runtime: `smol`
- Test topology: `ns_client → oi (root ns) → ns_server` via veth
### Reproduction commands
**Exhibit A: Start backend server in `ns_server`**
```bash
sudo ip netns exec ns_server iperf3 -s -p 9001
````
**Exhibit B: Run client in `ns_client` through forwarder**
```bash
sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8
```
**Inner Prosecutor:** *Hold it!* Why `-P 8`?
**Me:** Because a forwarder can look fine in `-P 1` and fall apart when syscall pressure scales.
**Inner Prosecutor:** …Acceptable.
---
## 2) The Suspects — What Could Be Limiting Throughput?
**Me:** Four suspects.
1. **CPU bound** (pure compute wall)
2. **Kernel TCP stack bound** (send/recv path, skb, softirq, netfilter/conntrack)
3. **Syscall-rate wall** (too many `sendto/recvfrom` per byte)
4. **Runtime scheduling / contention** (wake storms, locks, futex)
**Inner Prosecutor:** *Objection!* Thats too broad. Narrow it down.
**Me:** Thats what the tools are for.
---
## 3) Evidence #1 — `perf stat` (The Macro View)
**Me:** First I ask: are we burning CPU, thrashing schedulers, or stalling on memory?
**Command:**
```bash
sudo perf stat -p $(pidof oi) -e \
cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
-- sleep 33
```
**What Im looking for:**
* Huge `context-switches` → runtime thrash / lock contention
* Huge `cpu-migrations` → unstable scheduling
* Very low IPC + huge cache misses → memory stalls
* Otherwise: likely syscall/kernel path
Output:
```text
Performance counter stats for process id '209785':
113,810,599,893 cpu_atom/cycles/ (0.11%)
164,681,878,450 cpu_core/cycles/ (99.89%)
102,575,167,734 cpu_atom/instructions/ # 0.90 insn per cycle (0.11%)
237,094,207,911 cpu_core/instructions/ # 1.44 insn per cycle (99.89%)
33,093,338 cpu_atom/cache-misses/ (0.11%)
5,381,441 cpu_core/cache-misses/ (99.89%)
20,012,975,873 cpu_atom/branches/ (0.11%)
46,120,077,111 cpu_core/branches/ (99.89%)
211,767,555 cpu_atom/branch-misses/ # 1.06% of all branches (0.11%)
245,969,685 cpu_core/branch-misses/ # 0.53% of all branches (99.89%)
1,686 context-switches
150 cpu-migrations
33.004363800 seconds time elapsed
```
**Low context switching**:
- context-switches: 1,686 over ~33s → ~51 switches/sec
- cpu-migrations: 150 over ~33s → ~4.5/s → very stable CPU placement
**CPU is working hard**:
- 237,094,207,911 cpu_core instructions
- IPC: 1.44 (instructions per cycle) → not lock-bound or stalling badly
**Clean cache, branch metrics**:
- cache-misses: ~3.1M (tiny compared to the instruction count)
- branch-misses: 0.62%
**Inner Prosecutor:** *Hold it!* You didnt show the numbers.
**Me:** Patience. The next exhibit makes the culprit confess.
---
## 4) Evidence #2 — `strace -c` (The Confession: Syscall Composition)
**Me:** Next: “What syscalls are we paying for?”
**Command:**
```bash
sudo timeout 30s strace -c -f -p $(pidof oi)
```
**What I expect if this is a forwarding wall:**
* `sendto` and `recvfrom` dominate calls
* call counts in the millions
Output (simplified):
```text
sendto 2,190,751 calls 4.146799s (57.6%)
recvfrom 2,190,763 calls 3.052340s (42.4%)
total syscall time: 7.200789s
```
(A) **100% syscall/copy dominated:**
- Almost all traced time is inside:
- sendto() (TCP send)
- recvfrom() (TCP recv)
(B) **syscall rate is massive**
- Total send+recv calls:
- ~4,381,500 syscalls in ~32s
- → ~137k `sendto` per sec + ~137k `recvfrom` per sec
- → ~274k syscalls/sec total
**Inner Prosecutor:** *Objection!* Syscalls alone dont prove the bottleneck.
**Me:** True. So I brought a witness.
---
## 5) Evidence #3 — FlameGraph (The Witness)
**Me:** The flamegraph doesnt lie. It testifies where cycles go.
**Commands:**
```bash
sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
```
**What the flamegraph showed (described, not embedded):**
* The widest towers were kernel TCP send/recv paths:
* `__x64_sys_sendto``tcp_sendmsg_locked``tcp_write_xmit` → ...
* `__x64_sys_recvfrom``tcp_recvmsg` → ...
* My userspace frames existed, but were comparatively thin.
* The call chain still pointed into my forwarding implementation.
**Inner Prosecutor:** *Hold it!* So youre saying… the kernel is doing the heavy lifting?
**Me:** Exactly. Which means my job is to **stop annoying the kernel** with too many tiny operations.
---
## 6) The Real Culprit — A “Perfectly Reasonable” Copy Loop
**Me:** Heres the original relay code. Looks clean, right?
```rust
let client_to_server = io::copy(client_stream.clone(), server_stream.clone());
let server_to_client = io::copy(server_stream, client_stream);
futures_lite::future::try_zip(client_to_server, server_to_client).await?;
```
**Inner Prosecutor:** *Objection!* This is idiomatic and correct.
**Me:** Yes. Thats why its dangerous.
**Key detail:** `futures_lite::io::copy` uses a small internal buffer (~8KiB in practice).
Small buffer → more iterations → more syscalls → more overhead.
If a forwarder is syscall-rate bound, this becomes a ceiling.
---
## 7) The First Breakthrough — Replace `io::copy` with `pump()`
**Me:** I wrote a manual pump loop:
* allocate a buffer once
* `read()` into it
* `write_all()` out
* on EOF: `shutdown(Write)` to propagate half-close
```rust
async fn pump(mut r: TcpStream, mut w: TcpStream, buf_sz: usize) -> io::Result<u64> {
let mut buf = vec![0u8; buf_sz];
let mut total = 0u64;
loop {
let n = r.read(&mut buf).await?;
if n == 0 {
let _ = w.shutdown(std::net::Shutdown::Write);
break;
}
w.write_all(&buf[..n]).await?;
total += n as u64;
}
Ok(total)
}
```
Run both directions:
```rust
let c2s = pump(client_stream.clone(), server_stream.clone(), BUF);
let s2c = pump(server_stream, client_stream, BUF);
try_zip(c2s, s2c).await?;
```
**Inner Prosecutor:** *Hold it!* Thats just a loop. How does that win?
**Me:** Not the loop. The **bytes per syscall**.
---
### 8) Exhibit C — The Numbers (8KiB → 16KiB → 64KiB)
### Baseline: ~8KiB (generic copy helper)
Throughput:
```text
17.8 Gbit/s
```
**Inner Prosecutor:** *Objection!* Thats your “crime scene” number?
**Me:** Yes. Now watch what happens when the kernel stops getting spammed.
### Pump + 16KiB buffer
Throughput:
```text
28.6 Gbit/s
```
`strace -c` showed `sendto/recvfrom` call count dropped:
```text
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
57.80 14.590016 442121 33 epoll_wait
28.84 7.279883 4 1771146 sendto
13.33 3.363882 1 1771212 48 recvfrom
0.02 0.003843 61 62 44 futex
0.01 0.001947 12 159 epoll_ctl
...
------ ----------- ----------- --------- --------- ----------------
100.00 25.242897 7 3542787 143 total
```
**Inner Prosecutor:** *Hold it!* Thats already big. But you claim theres more?
**Me:** Oh, theres more.
### Pump + 64KiB buffer
Throughput:
```text
54.1 Gbit/s (best observed)
```
`perf stat` output:
```text
Performance counter stats for process id '893123':
120,859,810,675 cpu_atom/cycles/ (0.15%)
134,735,934,329 cpu_core/cycles/ (99.85%)
79,946,979,880 cpu_atom/instructions/ # 0.66 insn per cycle (0.15%)
127,036,644,759 cpu_core/instructions/ # 0.94 insn per cycle (99.85%)
24,713,474 cpu_atom/cache-misses/ (0.15%)
9,604,449 cpu_core/cache-misses/ (99.85%)
15,584,074,530 cpu_atom/branches/ (0.15%)
24,796,180,117 cpu_core/branches/ (99.85%)
175,778,825 cpu_atom/branch-misses/ # 1.13% of all branches (0.15%)
135,067,353 cpu_core/branch-misses/ # 0.54% of all branches (99.85%)
1,519 context-switches
50 cpu-migrations
33.006529572 seconds time elapsed
```
`strace -c` output:
```text
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
54.56 18.079500 463576 39 epoll_wait
27.91 9.249443 7 1294854 2 sendto
17.49 5.796927 4 1294919 51 recvfrom
...
------ ----------- ----------- --------- --------- ----------------
100.00 33.135377 12 2590253 158 total
```
**Inner Prosecutor:** *OBJECTION!* `epoll_wait` is eating the time. Thats the bottleneck!
**Me:** Nice try. Thats a classic trap.
---
## 9) Cross-Examination — The `epoll_wait` Trap
**Me:** `strace -c` measures time spent *inside syscalls*, including time spent **blocked**.
In async runtimes:
* One thread can sit in `epoll_wait(timeout=...)`
* Other threads do `sendto/recvfrom`
* `strace` charges the blocking time to `epoll_wait`
So `epoll_wait` dominating **does not** mean “epoll is slow”.
It often means “one thread is waiting while others work”.
**What matters here:**
* `sendto` / `recvfrom` call counts
* and how they change with buffer size
---
## 10) Final Explanation — Why 64KiB Causes a “Nonlinear” Jump
**Inner Prosecutor:** *Hold it!* You only reduced syscall calls by ~some percent. How do you nearly triple throughput?
**Me:** Because syscall walls are **nonlinear**.
A forwarders throughput is approximately:
> **Throughput ≈ bytes_per_syscall_pair × syscall_pairs_per_second**
If youre syscall-rate limited, increasing `bytes_per_syscall_pair` pushes you past a threshold where:
* socket buffers stay fuller
* the TCP window is better utilized
* each stream spends less time in per-chunk bookkeeping
* concurrency (`-P 8`) stops fighting overhead and starts helping
Once you cross that threshold, throughput can jump until the next ceiling (kernel TCP, memory bandwidth, iperf itself).
Thats why a “small” change can create a big effect.
---
## 11. Trade-offs: buffer size is not free
**Inner Prosecutor:** *Objection!* Bigger buffers waste memory!
**Me:** Sustained.
A forwarder allocates **two buffers per connection** (one per direction).
So for 64KiB:
* ~128KiB per connection (just for relay buffers)
* plus runtime + socket buffers
Thats fine for “few heavy streams”, but it matters if you handle thousands of concurrent connections.
In practice, the right move is:
* choose a good default (64KiB is common)
* make it configurable
* consider buffer pooling if connection churn is heavy
---
## Epilogue — Case Closed (for now)
**Inner Prosecutor:** So the culprit was…
**Me:** A perfectly reasonable helper with a default buffer size I didnt question.
**Inner Prosecutor:** And the lesson?
**Me:** Dont guess. Ask sharp questions. Use the tools. Let the system testify.
> **Verdict:** Guilty of “too many syscalls per byte.”
>
> **Sentence:** 64KiB buffers and a better relay loop.
---
## Ending
This was a good reminder that performance work is not guessing — its a dialogue with the system:
1. Describe the situation
2. Ask sharp questions
3. Use tools to confirm
4. Explain the results using low-level knowledge
5. Make one change
6. Re-measure
And the funniest part: the “clean” one-liner `io::copy` was correct, but its defaults were hiding a performance policy I didnt want.
> **Inner Prosecutor:** “Case closed?”
>
> **Me:** “For now. Next case: buffer pooling, socket buffer tuning, and maybe a Linux-only `splice(2)` fast path — carefully, behind a safe wrapper.”
---