Add: more Phoenix flavor

This commit is contained in:
2026-02-03 13:41:37 +08:00
parent ca04039826
commit 144ae4a0f3

View File

@@ -4,83 +4,81 @@ categories: [Programming, Profiling]
tags: [Rust, kernel, networking] tags: [Rust, kernel, networking]
--- ---
> **Cast** > **Disclaimer:** This is not a language-war post. No “X vs Y”.
> > This is a profiling detective story about my Rust TCP forwarder [`oi`](https://github.com/DaZuo0122/oxidinetd).
> **Me:** “I rewrote a port forwarder in Rust. It works. Its… not fast enough.”
>
> **Inner Prosecutor:** “*Objection!* Not fast enough is not evidence. Bring numbers.”
>
> **Me:** “Fine. Well do this properly.”
--- ---
## 0) Prologue — The Courthouse Lobby
## 0. The Situation > **Me:** I wrote a Rust TCP port forwarder. It works. It forwards.
>
> **Inner Prosecutor (Phoenix voice):** *Hold it!* “Works” is not a metric. How fast?
>
> **Me:** Not fast enough under load.
>
> **Inner Prosecutor:** *Objection!* “Not fast enough” is an emotion. Bring evidence.
>
> **Me:** Fine. Ill bring **perf**, **strace**, and a **flamegraph**.
>
> **Inner Prosecutor:** Good. This court accepts only facts.
Im building a userspace TCP port forwarder in Rust called [oxidinetd](https://github.com/DaZuo0122/oxidinetd) (The binary named `oi`). It accepts a TCP connection, connects to an upstream server, then relays bytes in both directions. ## 1) The Crime Scene — Setup & Reproduction
> This post is not a “Rust vs C” piece — its about **profiling**, **forming hypotheses**, and **turning measurements into speed**. **Me:** Single machine, Debian 13. No WAN noise, no tunnel bottlenecks.
### Test environment **Inner Prosecutor:** *Hold it!* If its “single machine”, how do you avoid loopback cheating?
- OS: Debian 13 **Me:** Network namespaces + veth. Local, repeatable, closer to real networking.
### Environment
- Debian 13
- Kernel: `6.12.48+deb13-amd64` - Kernel: `6.12.48+deb13-amd64`
- Runtime: `smol` - Runtime: `smol`
- Benchmark: single machine, network namespaces + veth - Test topology: `ns_client → oi (root ns) → ns_server` via veth
Why namespaces + veth? The loopback can hide “real networking” behavior. Namespaces/veth keep the test local (repeatable), but with a path closer to real routing. ### Reproduction commands
--- **Exhibit A: Start backend server in `ns_server`**
> **Inner Prosecutor:** “You claim its repeatable. Prove your setup.”
>
> **Me:** “Heres the lab.”
---
## 1. The Lab Setup
Backend server inside `ns_server`:
```bash ```bash
sudo ip netns exec ns_server iperf3 -s -p 9001 sudo ip netns exec ns_server iperf3 -s -p 9001
``` ````
Client inside `ns_client`, traffic goes through `oi`: **Exhibit B: Run client in `ns_client` through forwarder**
```bash ```bash
sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8 sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8
``` ```
> **Note**: -P 8 matters. A forwarder might look okay under -P 1, then collapse when syscall pressure scales with concurrency. **Inner Prosecutor:** *Hold it!* Why `-P 8`?
### Forwarder config **Me:** Because a forwarder can look fine in `-P 1` and fall apart when syscall pressure scales.
`oi` listens on `10.0.1.1:9000` and connects to `10.0.0.2:9001`. **Inner Prosecutor:** …Acceptable.
`profiling.conf`:
```yaml
127.0.0.1 9000 127.0.0.1 9001
```
--- ---
## 2. The Questions ## 2) The Suspects — What Could Be Limiting Throughput?
> **Inner Prosecutor:** “Alright. What exactly is the crime?” **Me:** Four suspects.
>
> **Me:** “Throughput is lower than expected. The suspects:” 1. **CPU bound** (pure compute wall)
> 2. **Kernel TCP stack bound** (send/recv path, skb, softirq, netfilter/conntrack)
> 1. CPU bound vs I/O bound 3. **Syscall-rate wall** (too many `sendto/recvfrom` per byte)
> 2. Userspace overhead vs kernel TCP stack 4. **Runtime scheduling / contention** (wake storms, locks, futex)
> 3. Syscall-rate wall (too many `send/recv` per byte)
> 4. Async runtime scheduling / wakeups / locks **Inner Prosecutor:** *Objection!* Thats too broad. Narrow it down.
**Me:** Thats what the tools are for.
--- ---
## 3. Evidence Tool #1 — `perf stat` (Macro view) ## 3) Evidence #1 — `perf stat` (The Macro View)
Command: **Me:** First I ask: are we burning CPU, thrashing schedulers, or stalling on memory?
**Command:**
```bash ```bash
sudo perf stat -p $(pidof oi) -e \ sudo perf stat -p $(pidof oi) -e \
@@ -88,12 +86,12 @@ sudo perf stat -p $(pidof oi) -e \
-- sleep 33 -- sleep 33
``` ```
### What Im looking for **What Im looking for:**
* **Context switches** exploding → runtime contention or wake storms * Huge `context-switches` → runtime thrash / lock contention
* **CPU migrations** exploding → scheduler instability (bad for repeatability) * Huge `cpu-migrations` → unstable scheduling
* **IPC** tanking + cache misses skyrocketing → memory/latency issues * Very low IPC + huge cache misses → memory stalls
* Otherwise: likely **kernel networking + syscalls** dominate * Otherwise: likely syscall/kernel path
Output: Output:
@@ -116,7 +114,6 @@ Output:
33.004363800 seconds time elapsed 33.004363800 seconds time elapsed
``` ```
Interpretation:
**Low context switching**: **Low context switching**:
@@ -136,30 +133,28 @@ Interpretation:
- branch-misses: 0.62% - branch-misses: 0.62%
---
> **Inner Prosecutor:** “Thats a vibe-check. Wheres the real culprit?”
> **Inner Prosecutor:** *Hold it!* You didnt show the numbers.
> **Me:** “Next tool. This one tells me what kind of pain were paying for.”
**Me:** Patience. The next exhibit makes the culprit confess.
--- ---
## 4. Evidence Tool #2 — `strace -c` (Syscall composition) ## 4) Evidence #2 — `strace -c` (The Confession: Syscall Composition)
Command: **Me:** Next: “What syscalls are we paying for?”
**Command:**
```bash ```bash
sudo timeout 30s strace -c -f -p $(pidof oi) sudo timeout 30s strace -c -f -p $(pidof oi)
``` ```
### Why `strace -c` is lethal for forwarders **What I expect if this is a forwarding wall:**
A userspace TCP forwarder often boils down to: * `sendto` and `recvfrom` dominate calls
* call counts in the millions
* `recv(...)` from one socket
* `send(...)` to the other socket
If your throughput is low and `strace -c` shows **millions** of `sendto/recvfrom` calls, youre likely hitting a **syscall-per-byte wall**.
Output (simplified): Output (simplified):
@@ -167,14 +162,12 @@ Output (simplified):
sendto 2,190,751 calls 4.146799s (57.6%) sendto 2,190,751 calls 4.146799s (57.6%)
recvfrom 2,190,763 calls 3.052340s (42.4%) recvfrom 2,190,763 calls 3.052340s (42.4%)
total syscall time: 7.200789s total syscall time: 7.200789s
``` ```
Interpretation:
(A) **100% syscall/copy dominated:** (A) **100% syscall/copy dominated:**
Almost all traced time is inside: - Almost all traced time is inside:
- sendto() (TCP send) - sendto() (TCP send)
@@ -182,92 +175,76 @@ Almost all traced time is inside:
(B) **syscall rate is massive** (B) **syscall rate is massive**
Total send+recv calls: - Total send+recv calls:
- ~4,381,500 syscalls in ~32s - ~4,381,500 syscalls in ~32s
- → ~137k `sendto` per sec + ~137k `recvfrom` per sec - → ~137k `sendto` per sec + ~137k `recvfrom` per sec
- → ~274k syscalls/sec total - → ~274k syscalls/sec total
Thats exactly the pattern of a forwarder doing:
`recv -> send -> recv -> send ...` with a relatively small buffer. **Inner Prosecutor:** *Objection!* Syscalls alone dont prove the bottleneck.
**Me:** True. So I brought a witness.
--- ---
> **Inner Prosecutor:** “So youre saying the kernel is being spammed.” ## 5) Evidence #3 — FlameGraph (The Witness)
>
> **Me:** “Exactly. Now I want to know whos spamming it — my logic, my runtime, or my copy loop.”
--- **Me:** The flamegraph doesnt lie. It testifies where cycles go.
## 5. Evidence Tool #3 — FlameGraph (Where cycles actually go) **Commands:**
Commands:
```bash ```bash
sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30 sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
``` ```
### What the flamegraph showed (described, not embedded) **What the flamegraph showed (described, not embedded):**
Instead of embedding the graph, heres the important story the flamegraph told: * The widest towers were kernel TCP send/recv paths:
1. The widest “towers” were kernel TCP send/recv paths: * `__x64_sys_sendto``tcp_sendmsg_locked``tcp_write_xmit` → ...
* `__x64_sys_recvfrom``tcp_recvmsg` → ...
* My userspace frames existed, but were comparatively thin.
* The call chain still pointed into my forwarding implementation.
* `__x64_sys_sendto``tcp_sendmsg_locked``tcp_write_xmit` → …
* `__x64_sys_recvfrom``tcp_recvmsg` → …
2. My userspace frames existed, but they were thin compared to the kernel towers. **Inner Prosecutor:** *Hold it!* So youre saying… the kernel is doing the heavy lifting?
That means:
* Im not burning CPU on complicated Rust logic. **Me:** Exactly. Which means my job is to **stop annoying the kernel** with too many tiny operations.
* Im paying overhead on the boundary: syscalls, TCP stack, copies.
3. In the dwarf flamegraph, the *userspace* frames pointed to my forwarding implementation: ---
* the code path that ultimately calls read/write repeatedly. ## 6) The Real Culprit — A “Perfectly Reasonable” Copy Loop
> **Conclusion:** This is not “async is slow” in general. This is “my relay loop is forcing too many small kernel transitions.” **Me:** Heres the original relay code. Looks clean, right?
## 6. The Suspect: my forwarding code
Here was the original TCP relay:
```rust ```rust
// Use smol's copy function to forward data in both directions
let client_to_server = io::copy(client_stream.clone(), server_stream.clone()); let client_to_server = io::copy(client_stream.clone(), server_stream.clone());
let server_to_client = io::copy(server_stream, client_stream); let server_to_client = io::copy(server_stream, client_stream);
futures_lite::future::try_zip(client_to_server, server_to_client).await?; futures_lite::future::try_zip(client_to_server, server_to_client).await?;
``` ```
> **Inner Prosecutor:** “*Objection!* That looks perfectly reasonable.” **Inner Prosecutor:** *Objection!* This is idiomatic and correct.
>
> **Me:** “Yes. Thats why its dangerous.”
### Why this can be slow under high throughput **Me:** Yes. Thats why its dangerous.
Generic `io::copy` helpers often use a relatively small internal buffer (commonly ~8KiB), plus abstraction layers that can increase: **Key detail:** `futures_lite::io::copy` uses a small internal buffer (~8KiB in practice).
Small buffer → more iterations → more syscalls → more overhead.
* syscall frequency If a forwarder is syscall-rate bound, this becomes a ceiling.
* readiness polling
* per-chunk overhead
Small buffers arent “wrong”. Theyre memory-friendly. But for a forwarder pushing tens of Gbit/s, **syscalls per byte** becomes the real limiter.
--- ---
## 7. The Fix: a manual `pump()` loop (and a buffer size sweep) ## 7) The First Breakthrough — Replace `io::copy` with `pump()`
I replaced `io::copy` with a manual relay loop: **Me:** I wrote a manual pump loop:
* allocate a buffer once per direction * allocate a buffer once
* read into it * `read()` into it
* write it out * `write_all()` out
* on EOF, propagate half-close with `shutdown(Write)` * on EOF: `shutdown(Write)` to propagate half-close
Code (core idea):
```rust ```rust
async fn pump(mut r: TcpStream, mut w: TcpStream, buf_sz: usize) -> io::Result<u64> { async fn pump(mut r: TcpStream, mut w: TcpStream, buf_sz: usize) -> io::Result<u64> {
@@ -287,7 +264,7 @@ async fn pump(mut r: TcpStream, mut w: TcpStream, buf_sz: usize) -> io::Result<u
} }
``` ```
And run both directions: Run both directions:
```rust ```rust
let c2s = pump(client_stream.clone(), server_stream.clone(), BUF); let c2s = pump(client_stream.clone(), server_stream.clone(), BUF);
@@ -295,19 +272,15 @@ let s2c = pump(server_stream, client_stream, BUF);
try_zip(c2s, s2c).await?; try_zip(c2s, s2c).await?;
``` ```
--- **Inner Prosecutor:** *Hold it!* Thats just a loop. How does that win?
> **Inner Prosecutor:** “You changed one helper call into a loop. Thats your miracle?” **Me:** Not the loop. The **bytes per syscall**.
>
> **Me:** “Not the loop. The *bytes per syscall*.”
--- ---
## 8. Verification: numbers dont lie ### 8) Exhibit C — The Numbers (8KiB → 16KiB → 64KiB)
Same machine, same namespaces/veth, same `iperf3 -P 8`. ### Baseline: ~8KiB (generic copy helper)
### Baseline (generic copy, ~8KiB internal buffer)
Throughput: Throughput:
@@ -315,6 +288,12 @@ Throughput:
17.8 Gbit/s 17.8 Gbit/s
``` ```
**Inner Prosecutor:** *Objection!* Thats your “crime scene” number?
**Me:** Yes. Now watch what happens when the kernel stops getting spammed.
### Pump + 16KiB buffer ### Pump + 16KiB buffer
Throughput: Throughput:
@@ -338,6 +317,12 @@ Throughput:
100.00 25.242897 7 3542787 143 total 100.00 25.242897 7 3542787 143 total
``` ```
**Inner Prosecutor:** *Hold it!* Thats already big. But you claim theres more?
**Me:** Oh, theres more.
### Pump + 64KiB buffer ### Pump + 64KiB buffer
Throughput: Throughput:
@@ -380,64 +365,61 @@ Performance counter stats for process id '893123':
100.00 33.135377 12 2590253 158 total 100.00 33.135377 12 2590253 158 total
``` ```
---
## 9. “Wait — why is `epoll_wait` taking most syscall time?” **Inner Prosecutor:** *OBJECTION!* `epoll_wait` is eating the time. Thats the bottleneck!
> **Inner Prosecutor:** “*Objection!* Your table says `epoll_wait` dominates time. So epoll is the bottleneck!” **Me:** Nice try. Thats a classic trap.
>
> **Me:** “Nope. Thats a common misread.”
`strace -c` counts **time spent inside syscalls**, including time spent **blocked**.
In async runtimes, its normal for one thread to sit in `epoll_wait(timeout=...)` while other threads do actual send/recv work. That blocking time is charged to `epoll_wait`, but its not “overhead” — its *waiting*.
The real signal is still:
* `sendto/recvfrom` call counts (millions)
* average microseconds per call
* and whether call count drops when buffer size increases
Thats the syscall-per-byte story.
--- ---
## 10. So why did 64KiB cause such a huge jump?
Two reasons: ## 9) Cross-Examination — The `epoll_wait` Trap
### 1) Syscall wall is nonlinear **Me:** `strace -c` measures time spent *inside syscalls*, including time spent **blocked**.
Throughput is roughly: In async runtimes:
**Throughput ≈ bytes_per_syscall_pair × syscall_pairs_per_second** * One thread can sit in `epoll_wait(timeout=...)`
* Other threads do `sendto/recvfrom`
* `strace` charges the blocking time to `epoll_wait`
If syscall rate is the limiter, increasing bytes per syscall can push you past a threshold where: So `epoll_wait` dominating **does not** mean “epoll is slow”.
It often means “one thread is waiting while others work”.
**What matters here:**
* `sendto` / `recvfrom` call counts
* and how they change with buffer size
---
## 10) Final Explanation — Why 64KiB Causes a “Nonlinear” Jump
**Inner Prosecutor:** *Hold it!* You only reduced syscall calls by ~some percent. How do you nearly triple throughput?
**Me:** Because syscall walls are **nonlinear**.
A forwarders throughput is approximately:
> **Throughput ≈ bytes_per_syscall_pair × syscall_pairs_per_second**
If youre syscall-rate limited, increasing `bytes_per_syscall_pair` pushes you past a threshold where:
* socket buffers stay fuller * socket buffers stay fuller
* TCP windows are better utilized * the TCP window is better utilized
* per-stream pacing is smoother * each stream spends less time in per-chunk bookkeeping
* concurrency (`-P 8`) stops fighting overhead and starts working in your favor * concurrency (`-P 8`) stops fighting overhead and starts helping
Once you cross that threshold, throughput can jump until the *next* ceiling (kernel TCP work, memory bandwidth, or iperf itself). Once you cross that threshold, throughput can jump until the next ceiling (kernel TCP, memory bandwidth, iperf itself).
### 2) Less “per-chunk” overhead in userspace Thats why a “small” change can create a big effect.
A small-buffer copy loop means more iterations, more polls, more bookkeeping.
A bigger buffer means:
* fewer loop iterations per GB moved
* fewer wakeups/polls
* fewer syscall transitions per GB
Your `strace` call counts dropped significantly between 16KiB and 64KiB, and throughput nearly doubled.
--- ---
## 11. Trade-offs: buffer size is not free ## 11. Trade-offs: buffer size is not free
> **Inner Prosecutor:** “*Hold it!* Bigger buffers mean wasted memory.” **Inner Prosecutor:** *Objection!* Bigger buffers waste memory!
>
> **Me:** “Correct.” **Me:** Sustained.
A forwarder allocates **two buffers per connection** (one per direction). A forwarder allocates **two buffers per connection** (one per direction).
@@ -456,7 +438,24 @@ In practice, the right move is:
--- ---
## 12. Closing statement ## Epilogue — Case Closed (for now)
**Inner Prosecutor:** So the culprit was…
**Me:** A perfectly reasonable helper with a default buffer size I didnt question.
**Inner Prosecutor:** And the lesson?
**Me:** Dont guess. Ask sharp questions. Use the tools. Let the system testify.
> **Verdict:** Guilty of “too many syscalls per byte.”
>
> **Sentence:** 64KiB buffers and a better relay loop.
---
## Ending
This was a good reminder that performance work is not guessing — its a dialogue with the system: This was a good reminder that performance work is not guessing — its a dialogue with the system: