Files
blog-post-backup/drafts/2026-01-25-ace-profiling-attorney-the-case-of-the-missing-gbits.md
2026-02-03 13:06:58 +08:00

476 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: Ace Profiling Attorney - The Case of the Missing Gbits
categories: [Programming, Profiling]
tags: [Rust, kernel, networking]
---
> **Cast**
>
> **Me:** “I rewrote a port forwarder in Rust. It works. Its… not fast enough.”
>
> **Inner Prosecutor:** “*Objection!* Not fast enough is not evidence. Bring numbers.”
>
> **Me:** “Fine. Well do this properly.”
---
## 0. The Situation
Im building a userspace TCP port forwarder in Rust called [oxidinetd](https://github.com/DaZuo0122/oxidinetd) (The binary named `oi`). It accepts a TCP connection, connects to an upstream server, then relays bytes in both directions.
> This post is not a “Rust vs C” piece — its about **profiling**, **forming hypotheses**, and **turning measurements into speed**.
### Test environment
- OS: Debian 13
- Kernel: `6.12.48+deb13-amd64`
- Runtime: `smol`
- Benchmark: single machine, network namespaces + veth
Why namespaces + veth? The loopback can hide “real networking” behavior. Namespaces/veth keep the test local (repeatable), but with a path closer to real routing.
---
> **Inner Prosecutor:** “You claim its repeatable. Prove your setup.”
>
> **Me:** “Heres the lab.”
---
## 1. The Lab Setup
Backend server inside `ns_server`:
```bash
sudo ip netns exec ns_server iperf3 -s -p 9001
```
Client inside `ns_client`, traffic goes through `oi`:
```bash
sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8
```
> **Note**: -P 8 matters. A forwarder might look okay under -P 1, then collapse when syscall pressure scales with concurrency.
### Forwarder config
`oi` listens on `10.0.1.1:9000` and connects to `10.0.0.2:9001`.
`profiling.conf`:
```yaml
127.0.0.1 9000 127.0.0.1 9001
```
---
## 2. The Questions
> **Inner Prosecutor:** “Alright. What exactly is the crime?”
>
> **Me:** “Throughput is lower than expected. The suspects:”
>
> 1. CPU bound vs I/O bound
> 2. Userspace overhead vs kernel TCP stack
> 3. Syscall-rate wall (too many `send/recv` per byte)
> 4. Async runtime scheduling / wakeups / locks
---
## 3. Evidence Tool #1 — `perf stat` (Macro view)
Command:
```bash
sudo perf stat -p $(pidof oi) -e \
cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
-- sleep 33
```
### What Im looking for
* **Context switches** exploding → runtime contention or wake storms
* **CPU migrations** exploding → scheduler instability (bad for repeatability)
* **IPC** tanking + cache misses skyrocketing → memory/latency issues
* Otherwise: likely **kernel networking + syscalls** dominate
Output:
```text
Performance counter stats for process id '209785':
113,810,599,893 cpu_atom/cycles/ (0.11%)
164,681,878,450 cpu_core/cycles/ (99.89%)
102,575,167,734 cpu_atom/instructions/ # 0.90 insn per cycle (0.11%)
237,094,207,911 cpu_core/instructions/ # 1.44 insn per cycle (99.89%)
33,093,338 cpu_atom/cache-misses/ (0.11%)
5,381,441 cpu_core/cache-misses/ (99.89%)
20,012,975,873 cpu_atom/branches/ (0.11%)
46,120,077,111 cpu_core/branches/ (99.89%)
211,767,555 cpu_atom/branch-misses/ # 1.06% of all branches (0.11%)
245,969,685 cpu_core/branch-misses/ # 0.53% of all branches (99.89%)
1,686 context-switches
150 cpu-migrations
33.004363800 seconds time elapsed
```
Interpretation:
**Low context switching**:
- context-switches: 1,686 over ~33s → ~51 switches/sec
- cpu-migrations: 150 over ~33s → ~4.5/s → very stable CPU placement
**CPU is working hard**:
- 237,094,207,911 cpu_core instructions
- IPC: 1.44 (instructions per cycle) → not lock-bound or stalling badly
**Clean cache, branch metrics**:
- cache-misses: ~3.1M (tiny compared to the instruction count)
- branch-misses: 0.62%
---
> **Inner Prosecutor:** “Thats a vibe-check. Wheres the real culprit?”
>
> **Me:** “Next tool. This one tells me what kind of pain were paying for.”
---
## 4. Evidence Tool #2 — `strace -c` (Syscall composition)
Command:
```bash
sudo timeout 30s strace -c -f -p $(pidof oi)
```
### Why `strace -c` is lethal for forwarders
A userspace TCP forwarder often boils down to:
* `recv(...)` from one socket
* `send(...)` to the other socket
If your throughput is low and `strace -c` shows **millions** of `sendto/recvfrom` calls, youre likely hitting a **syscall-per-byte wall**.
Output (simplified):
```text
sendto 2,190,751 calls 4.146799s (57.6%)
recvfrom 2,190,763 calls 3.052340s (42.4%)
total syscall time: 7.200789s
```
Interpretation:
(A) **100% syscall/copy dominated:**
Almost all traced time is inside:
- sendto() (TCP send)
- recvfrom() (TCP recv)
(B) **syscall rate is massive**
Total send+recv calls:
- ~4,381,500 syscalls in ~32s
- → ~137k `sendto` per sec + ~137k `recvfrom` per sec
- → ~274k syscalls/sec total
Thats exactly the pattern of a forwarder doing:
`recv -> send -> recv -> send ...` with a relatively small buffer.
---
> **Inner Prosecutor:** “So youre saying the kernel is being spammed.”
>
> **Me:** “Exactly. Now I want to know whos spamming it — my logic, my runtime, or my copy loop.”
---
## 5. Evidence Tool #3 — FlameGraph (Where cycles actually go)
Commands:
```bash
sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
```
### What the flamegraph showed (described, not embedded)
Instead of embedding the graph, heres the important story the flamegraph told:
1. The widest “towers” were kernel TCP send/recv paths:
* `__x64_sys_sendto``tcp_sendmsg_locked``tcp_write_xmit` → …
* `__x64_sys_recvfrom``tcp_recvmsg` → …
2. My userspace frames existed, but they were thin compared to the kernel towers.
That means:
* Im not burning CPU on complicated Rust logic.
* Im paying overhead on the boundary: syscalls, TCP stack, copies.
3. In the dwarf flamegraph, the *userspace* frames pointed to my forwarding implementation:
* the code path that ultimately calls read/write repeatedly.
> **Conclusion:** This is not “async is slow” in general. This is “my relay loop is forcing too many small kernel transitions.”
## 6. The Suspect: my forwarding code
Here was the original TCP relay:
```rust
// Use smol's copy function to forward data in both directions
let client_to_server = io::copy(client_stream.clone(), server_stream.clone());
let server_to_client = io::copy(server_stream, client_stream);
futures_lite::future::try_zip(client_to_server, server_to_client).await?;
```
> **Inner Prosecutor:** “*Objection!* That looks perfectly reasonable.”
>
> **Me:** “Yes. Thats why its dangerous.”
### Why this can be slow under high throughput
Generic `io::copy` helpers often use a relatively small internal buffer (commonly ~8KiB), plus abstraction layers that can increase:
* syscall frequency
* readiness polling
* per-chunk overhead
Small buffers arent “wrong”. Theyre memory-friendly. But for a forwarder pushing tens of Gbit/s, **syscalls per byte** becomes the real limiter.
---
## 7. The Fix: a manual `pump()` loop (and a buffer size sweep)
I replaced `io::copy` with a manual relay loop:
* allocate a buffer once per direction
* read into it
* write it out
* on EOF, propagate half-close with `shutdown(Write)`
Code (core idea):
```rust
async fn pump(mut r: TcpStream, mut w: TcpStream, buf_sz: usize) -> io::Result<u64> {
let mut buf = vec![0u8; buf_sz];
let mut total = 0u64;
loop {
let n = r.read(&mut buf).await?;
if n == 0 {
let _ = w.shutdown(std::net::Shutdown::Write);
break;
}
w.write_all(&buf[..n]).await?;
total += n as u64;
}
Ok(total)
}
```
And run both directions:
```rust
let c2s = pump(client_stream.clone(), server_stream.clone(), BUF);
let s2c = pump(server_stream, client_stream, BUF);
try_zip(c2s, s2c).await?;
```
---
> **Inner Prosecutor:** “You changed one helper call into a loop. Thats your miracle?”
>
> **Me:** “Not the loop. The *bytes per syscall*.”
---
## 8. Verification: numbers dont lie
Same machine, same namespaces/veth, same `iperf3 -P 8`.
### Baseline (generic copy, ~8KiB internal buffer)
Throughput:
```text
17.8 Gbit/s
```
### Pump + 16KiB buffer
Throughput:
```text
28.6 Gbit/s
```
`strace -c` showed `sendto/recvfrom` call count dropped:
```text
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
57.80 14.590016 442121 33 epoll_wait
28.84 7.279883 4 1771146 sendto
13.33 3.363882 1 1771212 48 recvfrom
0.02 0.003843 61 62 44 futex
0.01 0.001947 12 159 epoll_ctl
...
------ ----------- ----------- --------- --------- ----------------
100.00 25.242897 7 3542787 143 total
```
### Pump + 64KiB buffer
Throughput:
```text
54.1 Gbit/s (best observed)
```
`perf stat` output:
```text
Performance counter stats for process id '893123':
120,859,810,675 cpu_atom/cycles/ (0.15%)
134,735,934,329 cpu_core/cycles/ (99.85%)
79,946,979,880 cpu_atom/instructions/ # 0.66 insn per cycle (0.15%)
127,036,644,759 cpu_core/instructions/ # 0.94 insn per cycle (99.85%)
24,713,474 cpu_atom/cache-misses/ (0.15%)
9,604,449 cpu_core/cache-misses/ (99.85%)
15,584,074,530 cpu_atom/branches/ (0.15%)
24,796,180,117 cpu_core/branches/ (99.85%)
175,778,825 cpu_atom/branch-misses/ # 1.13% of all branches (0.15%)
135,067,353 cpu_core/branch-misses/ # 0.54% of all branches (99.85%)
1,519 context-switches
50 cpu-migrations
33.006529572 seconds time elapsed
```
`strace -c` output:
```text
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
54.56 18.079500 463576 39 epoll_wait
27.91 9.249443 7 1294854 2 sendto
17.49 5.796927 4 1294919 51 recvfrom
...
------ ----------- ----------- --------- --------- ----------------
100.00 33.135377 12 2590253 158 total
```
---
## 9. “Wait — why is `epoll_wait` taking most syscall time?”
> **Inner Prosecutor:** “*Objection!* Your table says `epoll_wait` dominates time. So epoll is the bottleneck!”
>
> **Me:** “Nope. Thats a common misread.”
`strace -c` counts **time spent inside syscalls**, including time spent **blocked**.
In async runtimes, its normal for one thread to sit in `epoll_wait(timeout=...)` while other threads do actual send/recv work. That blocking time is charged to `epoll_wait`, but its not “overhead” — its *waiting*.
The real signal is still:
* `sendto/recvfrom` call counts (millions)
* average microseconds per call
* and whether call count drops when buffer size increases
Thats the syscall-per-byte story.
---
## 10. So why did 64KiB cause such a huge jump?
Two reasons:
### 1) Syscall wall is nonlinear
Throughput is roughly:
**Throughput ≈ bytes_per_syscall_pair × syscall_pairs_per_second**
If syscall rate is the limiter, increasing bytes per syscall can push you past a threshold where:
* socket buffers stay fuller
* TCP windows are better utilized
* per-stream pacing is smoother
* concurrency (`-P 8`) stops fighting overhead and starts working in your favor
Once you cross that threshold, throughput can jump until the *next* ceiling (kernel TCP work, memory bandwidth, or iperf itself).
### 2) Less “per-chunk” overhead in userspace
A small-buffer copy loop means more iterations, more polls, more bookkeeping.
A bigger buffer means:
* fewer loop iterations per GB moved
* fewer wakeups/polls
* fewer syscall transitions per GB
Your `strace` call counts dropped significantly between 16KiB and 64KiB, and throughput nearly doubled.
---
## 11. Trade-offs: buffer size is not free
> **Inner Prosecutor:** “*Hold it!* Bigger buffers mean wasted memory.”
>
> **Me:** “Correct.”
A forwarder allocates **two buffers per connection** (one per direction).
So for 64KiB:
* ~128KiB per connection (just for relay buffers)
* plus runtime + socket buffers
Thats fine for “few heavy streams”, but it matters if you handle thousands of concurrent connections.
In practice, the right move is:
* choose a good default (64KiB is common)
* make it configurable
* consider buffer pooling if connection churn is heavy
---
## 12. Closing statement
This was a good reminder that performance work is not guessing — its a dialogue with the system:
1. Describe the situation
2. Ask sharp questions
3. Use tools to confirm
4. Explain the results using low-level knowledge
5. Make one change
6. Re-measure
And the funniest part: the “clean” one-liner `io::copy` was correct, but its defaults were hiding a performance policy I didnt want.
> **Inner Prosecutor:** “Case closed?”
>
> **Me:** “For now. Next case: buffer pooling, socket buffer tuning, and maybe a Linux-only `splice(2)` fast path — carefully, behind a safe wrapper.”
---