Add: more Phoenix flavor
This commit is contained in:
@@ -1,476 +1,475 @@
|
|||||||
---
|
---
|
||||||
title: Ace Profiling Attorney - The Case of the Missing Gbits
|
title: Ace Profiling Attorney - The Case of the Missing Gbits
|
||||||
categories: [Programming, Profiling]
|
categories: [Programming, Profiling]
|
||||||
tags: [Rust, kernel, networking]
|
tags: [Rust, kernel, networking]
|
||||||
---
|
---
|
||||||
|
|
||||||
> **Cast**
|
> **Disclaimer:** This is not a language-war post. No “X vs Y”.
|
||||||
>
|
> This is a profiling detective story about my Rust TCP forwarder [`oi`](https://github.com/DaZuo0122/oxidinetd).
|
||||||
> **Me:** “I rewrote a port forwarder in Rust. It works. It’s… not fast enough.”
|
|
||||||
>
|
---
|
||||||
> **Inner Prosecutor:** “*Objection!* ‘Not fast enough’ is not evidence. Bring numbers.”
|
## 0) Prologue — The Courthouse Lobby
|
||||||
>
|
|
||||||
> **Me:** “Fine. We’ll do this properly.”
|
> **Me:** I wrote a Rust TCP port forwarder. It works. It forwards.
|
||||||
|
>
|
||||||
---
|
> **Inner Prosecutor (Phoenix voice):** *Hold it!* “Works” is not a metric. How fast?
|
||||||
|
>
|
||||||
## 0. The Situation
|
> **Me:** Not fast enough under load.
|
||||||
|
>
|
||||||
I’m building a userspace TCP port forwarder in Rust called [oxidinetd](https://github.com/DaZuo0122/oxidinetd) (The binary named `oi`). It accepts a TCP connection, connects to an upstream server, then relays bytes in both directions.
|
> **Inner Prosecutor:** *Objection!* “Not fast enough” is an emotion. Bring evidence.
|
||||||
|
>
|
||||||
> This post is not a “Rust vs C” piece — it’s about **profiling**, **forming hypotheses**, and **turning measurements into speed**.
|
> **Me:** Fine. I’ll bring **perf**, **strace**, and a **flamegraph**.
|
||||||
|
>
|
||||||
### Test environment
|
> **Inner Prosecutor:** Good. This court accepts only facts.
|
||||||
|
|
||||||
- OS: Debian 13
|
## 1) The Crime Scene — Setup & Reproduction
|
||||||
- Kernel: `6.12.48+deb13-amd64`
|
|
||||||
- Runtime: `smol`
|
**Me:** Single machine, Debian 13. No WAN noise, no tunnel bottlenecks.
|
||||||
- Benchmark: single machine, network namespaces + veth
|
|
||||||
|
**Inner Prosecutor:** *Hold it!* If it’s “single machine”, how do you avoid loopback cheating?
|
||||||
Why namespaces + veth? The loopback can hide “real networking” behavior. Namespaces/veth keep the test local (repeatable), but with a path closer to real routing.
|
|
||||||
|
**Me:** Network namespaces + veth. Local, repeatable, closer to real networking.
|
||||||
---
|
|
||||||
|
### Environment
|
||||||
> **Inner Prosecutor:** “You claim it’s repeatable. Prove your setup.”
|
|
||||||
>
|
- Debian 13
|
||||||
> **Me:** “Here’s the lab.”
|
- Kernel: `6.12.48+deb13-amd64`
|
||||||
|
- Runtime: `smol`
|
||||||
---
|
- Test topology: `ns_client → oi (root ns) → ns_server` via veth
|
||||||
|
|
||||||
## 1. The Lab Setup
|
### Reproduction commands
|
||||||
|
|
||||||
Backend server inside `ns_server`:
|
**Exhibit A: Start backend server in `ns_server`**
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
sudo ip netns exec ns_server iperf3 -s -p 9001
|
sudo ip netns exec ns_server iperf3 -s -p 9001
|
||||||
```
|
````
|
||||||
|
|
||||||
Client inside `ns_client`, traffic goes through `oi`:
|
**Exhibit B: Run client in `ns_client` through forwarder**
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8
|
sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8
|
||||||
```
|
```
|
||||||
|
|
||||||
> **Note**: -P 8 matters. A forwarder might look okay under -P 1, then collapse when syscall pressure scales with concurrency.
|
**Inner Prosecutor:** *Hold it!* Why `-P 8`?
|
||||||
|
|
||||||
### Forwarder config
|
**Me:** Because a forwarder can look fine in `-P 1` and fall apart when syscall pressure scales.
|
||||||
|
|
||||||
`oi` listens on `10.0.1.1:9000` and connects to `10.0.0.2:9001`.
|
**Inner Prosecutor:** …Acceptable.
|
||||||
|
|
||||||
`profiling.conf`:
|
---
|
||||||
|
|
||||||
```yaml
|
## 2) The Suspects — What Could Be Limiting Throughput?
|
||||||
127.0.0.1 9000 127.0.0.1 9001
|
|
||||||
```
|
**Me:** Four suspects.
|
||||||
|
|
||||||
---
|
1. **CPU bound** (pure compute wall)
|
||||||
|
2. **Kernel TCP stack bound** (send/recv path, skb, softirq, netfilter/conntrack)
|
||||||
## 2. The Questions
|
3. **Syscall-rate wall** (too many `sendto/recvfrom` per byte)
|
||||||
|
4. **Runtime scheduling / contention** (wake storms, locks, futex)
|
||||||
> **Inner Prosecutor:** “Alright. What exactly is the crime?”
|
|
||||||
>
|
**Inner Prosecutor:** *Objection!* That’s too broad. Narrow it down.
|
||||||
> **Me:** “Throughput is lower than expected. The suspects:”
|
|
||||||
>
|
**Me:** That’s what the tools are for.
|
||||||
> 1. CPU bound vs I/O bound
|
|
||||||
> 2. Userspace overhead vs kernel TCP stack
|
---
|
||||||
> 3. Syscall-rate wall (too many `send/recv` per byte)
|
|
||||||
> 4. Async runtime scheduling / wakeups / locks
|
## 3) Evidence #1 — `perf stat` (The Macro View)
|
||||||
|
|
||||||
---
|
**Me:** First I ask: are we burning CPU, thrashing schedulers, or stalling on memory?
|
||||||
|
|
||||||
## 3. Evidence Tool #1 — `perf stat` (Macro view)
|
**Command:**
|
||||||
|
|
||||||
Command:
|
```bash
|
||||||
|
sudo perf stat -p $(pidof oi) -e \
|
||||||
```bash
|
cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
|
||||||
sudo perf stat -p $(pidof oi) -e \
|
-- sleep 33
|
||||||
cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
|
```
|
||||||
-- sleep 33
|
|
||||||
```
|
**What I’m looking for:**
|
||||||
|
|
||||||
### What I’m looking for
|
* Huge `context-switches` → runtime thrash / lock contention
|
||||||
|
* Huge `cpu-migrations` → unstable scheduling
|
||||||
* **Context switches** exploding → runtime contention or wake storms
|
* Very low IPC + huge cache misses → memory stalls
|
||||||
* **CPU migrations** exploding → scheduler instability (bad for repeatability)
|
* Otherwise: likely syscall/kernel path
|
||||||
* **IPC** tanking + cache misses skyrocketing → memory/latency issues
|
|
||||||
* Otherwise: likely **kernel networking + syscalls** dominate
|
Output:
|
||||||
|
|
||||||
Output:
|
```text
|
||||||
|
Performance counter stats for process id '209785':
|
||||||
```text
|
|
||||||
Performance counter stats for process id '209785':
|
113,810,599,893 cpu_atom/cycles/ (0.11%)
|
||||||
|
164,681,878,450 cpu_core/cycles/ (99.89%)
|
||||||
113,810,599,893 cpu_atom/cycles/ (0.11%)
|
102,575,167,734 cpu_atom/instructions/ # 0.90 insn per cycle (0.11%)
|
||||||
164,681,878,450 cpu_core/cycles/ (99.89%)
|
237,094,207,911 cpu_core/instructions/ # 1.44 insn per cycle (99.89%)
|
||||||
102,575,167,734 cpu_atom/instructions/ # 0.90 insn per cycle (0.11%)
|
33,093,338 cpu_atom/cache-misses/ (0.11%)
|
||||||
237,094,207,911 cpu_core/instructions/ # 1.44 insn per cycle (99.89%)
|
5,381,441 cpu_core/cache-misses/ (99.89%)
|
||||||
33,093,338 cpu_atom/cache-misses/ (0.11%)
|
20,012,975,873 cpu_atom/branches/ (0.11%)
|
||||||
5,381,441 cpu_core/cache-misses/ (99.89%)
|
46,120,077,111 cpu_core/branches/ (99.89%)
|
||||||
20,012,975,873 cpu_atom/branches/ (0.11%)
|
211,767,555 cpu_atom/branch-misses/ # 1.06% of all branches (0.11%)
|
||||||
46,120,077,111 cpu_core/branches/ (99.89%)
|
245,969,685 cpu_core/branch-misses/ # 0.53% of all branches (99.89%)
|
||||||
211,767,555 cpu_atom/branch-misses/ # 1.06% of all branches (0.11%)
|
1,686 context-switches
|
||||||
245,969,685 cpu_core/branch-misses/ # 0.53% of all branches (99.89%)
|
150 cpu-migrations
|
||||||
1,686 context-switches
|
|
||||||
150 cpu-migrations
|
33.004363800 seconds time elapsed
|
||||||
|
```
|
||||||
33.004363800 seconds time elapsed
|
|
||||||
```
|
|
||||||
|
**Low context switching**:
|
||||||
Interpretation:
|
|
||||||
|
- context-switches: 1,686 over ~33s → ~51 switches/sec
|
||||||
**Low context switching**:
|
|
||||||
|
- cpu-migrations: 150 over ~33s → ~4.5/s → very stable CPU placement
|
||||||
- context-switches: 1,686 over ~33s → ~51 switches/sec
|
|
||||||
|
**CPU is working hard**:
|
||||||
- cpu-migrations: 150 over ~33s → ~4.5/s → very stable CPU placement
|
|
||||||
|
- 237,094,207,911 cpu_core instructions
|
||||||
**CPU is working hard**:
|
|
||||||
|
- IPC: 1.44 (instructions per cycle) → not lock-bound or stalling badly
|
||||||
- 237,094,207,911 cpu_core instructions
|
|
||||||
|
**Clean cache, branch metrics**:
|
||||||
- IPC: 1.44 (instructions per cycle) → not lock-bound or stalling badly
|
|
||||||
|
- cache-misses: ~3.1M (tiny compared to the instruction count)
|
||||||
**Clean cache, branch metrics**:
|
|
||||||
|
- branch-misses: 0.62%
|
||||||
- cache-misses: ~3.1M (tiny compared to the instruction count)
|
|
||||||
|
|
||||||
- branch-misses: 0.62%
|
|
||||||
|
**Inner Prosecutor:** *Hold it!* You didn’t show the numbers.
|
||||||
---
|
|
||||||
|
**Me:** Patience. The next exhibit makes the culprit confess.
|
||||||
> **Inner Prosecutor:** “That’s a vibe-check. Where’s the real culprit?”
|
|
||||||
>
|
---
|
||||||
> **Me:** “Next tool. This one tells me what kind of pain we’re paying for.”
|
|
||||||
|
## 4) Evidence #2 — `strace -c` (The Confession: Syscall Composition)
|
||||||
---
|
|
||||||
|
**Me:** Next: “What syscalls are we paying for?”
|
||||||
## 4. Evidence Tool #2 — `strace -c` (Syscall composition)
|
|
||||||
|
**Command:**
|
||||||
Command:
|
|
||||||
|
```bash
|
||||||
```bash
|
sudo timeout 30s strace -c -f -p $(pidof oi)
|
||||||
sudo timeout 30s strace -c -f -p $(pidof oi)
|
```
|
||||||
```
|
|
||||||
|
**What I expect if this is a forwarding wall:**
|
||||||
### Why `strace -c` is lethal for forwarders
|
|
||||||
|
* `sendto` and `recvfrom` dominate calls
|
||||||
A userspace TCP forwarder often boils down to:
|
* call counts in the millions
|
||||||
|
|
||||||
* `recv(...)` from one socket
|
Output (simplified):
|
||||||
* `send(...)` to the other socket
|
|
||||||
|
```text
|
||||||
If your throughput is low and `strace -c` shows **millions** of `sendto/recvfrom` calls, you’re likely hitting a **syscall-per-byte wall**.
|
sendto 2,190,751 calls 4.146799s (57.6%)
|
||||||
|
recvfrom 2,190,763 calls 3.052340s (42.4%)
|
||||||
Output (simplified):
|
total syscall time: 7.200789s
|
||||||
|
```
|
||||||
```text
|
|
||||||
sendto 2,190,751 calls 4.146799s (57.6%)
|
|
||||||
recvfrom 2,190,763 calls 3.052340s (42.4%)
|
(A) **100% syscall/copy dominated:**
|
||||||
total syscall time: 7.200789s
|
|
||||||
|
- Almost all traced time is inside:
|
||||||
```
|
|
||||||
|
- sendto() (TCP send)
|
||||||
Interpretation:
|
|
||||||
|
- recvfrom() (TCP recv)
|
||||||
(A) **100% syscall/copy dominated:**
|
|
||||||
|
(B) **syscall rate is massive**
|
||||||
Almost all traced time is inside:
|
|
||||||
|
- Total send+recv calls:
|
||||||
- sendto() (TCP send)
|
|
||||||
|
- ~4,381,500 syscalls in ~32s
|
||||||
- recvfrom() (TCP recv)
|
- → ~137k `sendto` per sec + ~137k `recvfrom` per sec
|
||||||
|
- → ~274k syscalls/sec total
|
||||||
(B) **syscall rate is massive**
|
|
||||||
|
|
||||||
Total send+recv calls:
|
**Inner Prosecutor:** *Objection!* Syscalls alone don’t prove the bottleneck.
|
||||||
|
|
||||||
- ~4,381,500 syscalls in ~32s
|
**Me:** True. So I brought a witness.
|
||||||
- → ~137k `sendto` per sec + ~137k `recvfrom` per sec
|
|
||||||
- → ~274k syscalls/sec total
|
---
|
||||||
|
|
||||||
That’s exactly the pattern of a forwarder doing:
|
## 5) Evidence #3 — FlameGraph (The Witness)
|
||||||
|
|
||||||
`recv -> send -> recv -> send ...` with a relatively small buffer.
|
**Me:** The flamegraph doesn’t lie. It testifies where cycles go.
|
||||||
|
|
||||||
---
|
**Commands:**
|
||||||
|
|
||||||
> **Inner Prosecutor:** “So you’re saying the kernel is being spammed.”
|
```bash
|
||||||
>
|
sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
|
||||||
> **Me:** “Exactly. Now I want to know who’s spamming it — my logic, my runtime, or my copy loop.”
|
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
|
||||||
|
```
|
||||||
---
|
|
||||||
|
**What the flamegraph showed (described, not embedded):**
|
||||||
## 5. Evidence Tool #3 — FlameGraph (Where cycles actually go)
|
|
||||||
|
* The widest towers were kernel TCP send/recv paths:
|
||||||
Commands:
|
|
||||||
|
* `__x64_sys_sendto` → `tcp_sendmsg_locked` → `tcp_write_xmit` → ...
|
||||||
```bash
|
* `__x64_sys_recvfrom` → `tcp_recvmsg` → ...
|
||||||
sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
|
* My userspace frames existed, but were comparatively thin.
|
||||||
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
|
* The call chain still pointed into my forwarding implementation.
|
||||||
```
|
|
||||||
|
|
||||||
### What the flamegraph showed (described, not embedded)
|
**Inner Prosecutor:** *Hold it!* So you’re saying… the kernel is doing the heavy lifting?
|
||||||
|
|
||||||
Instead of embedding the graph, here’s the important story the flamegraph told:
|
**Me:** Exactly. Which means my job is to **stop annoying the kernel** with too many tiny operations.
|
||||||
|
|
||||||
1. The widest “towers” were kernel TCP send/recv paths:
|
---
|
||||||
|
|
||||||
* `__x64_sys_sendto` → `tcp_sendmsg_locked` → `tcp_write_xmit` → …
|
## 6) The Real Culprit — A “Perfectly Reasonable” Copy Loop
|
||||||
* `__x64_sys_recvfrom` → `tcp_recvmsg` → …
|
|
||||||
|
**Me:** Here’s the original relay code. Looks clean, right?
|
||||||
2. My userspace frames existed, but they were thin compared to the kernel towers.
|
|
||||||
That means:
|
```rust
|
||||||
|
let client_to_server = io::copy(client_stream.clone(), server_stream.clone());
|
||||||
* I’m not burning CPU on complicated Rust logic.
|
let server_to_client = io::copy(server_stream, client_stream);
|
||||||
* I’m paying overhead on the boundary: syscalls, TCP stack, copies.
|
|
||||||
|
futures_lite::future::try_zip(client_to_server, server_to_client).await?;
|
||||||
3. In the dwarf flamegraph, the *userspace* frames pointed to my forwarding implementation:
|
```
|
||||||
|
|
||||||
* the code path that ultimately calls read/write repeatedly.
|
**Inner Prosecutor:** *Objection!* This is idiomatic and correct.
|
||||||
|
|
||||||
> **Conclusion:** This is not “async is slow” in general. This is “my relay loop is forcing too many small kernel transitions.”
|
**Me:** Yes. That’s why it’s dangerous.
|
||||||
|
|
||||||
## 6. The Suspect: my forwarding code
|
**Key detail:** `futures_lite::io::copy` uses a small internal buffer (~8KiB in practice).
|
||||||
|
Small buffer → more iterations → more syscalls → more overhead.
|
||||||
Here was the original TCP relay:
|
|
||||||
|
If a forwarder is syscall-rate bound, this becomes a ceiling.
|
||||||
```rust
|
|
||||||
// Use smol's copy function to forward data in both directions
|
---
|
||||||
let client_to_server = io::copy(client_stream.clone(), server_stream.clone());
|
|
||||||
let server_to_client = io::copy(server_stream, client_stream);
|
## 7) The First Breakthrough — Replace `io::copy` with `pump()`
|
||||||
|
|
||||||
futures_lite::future::try_zip(client_to_server, server_to_client).await?;
|
**Me:** I wrote a manual pump loop:
|
||||||
```
|
|
||||||
|
* allocate a buffer once
|
||||||
> **Inner Prosecutor:** “*Objection!* That looks perfectly reasonable.”
|
* `read()` into it
|
||||||
>
|
* `write_all()` out
|
||||||
> **Me:** “Yes. That’s why it’s dangerous.”
|
* on EOF: `shutdown(Write)` to propagate half-close
|
||||||
|
|
||||||
### Why this can be slow under high throughput
|
```rust
|
||||||
|
async fn pump(mut r: TcpStream, mut w: TcpStream, buf_sz: usize) -> io::Result<u64> {
|
||||||
Generic `io::copy` helpers often use a relatively small internal buffer (commonly ~8KiB), plus abstraction layers that can increase:
|
let mut buf = vec![0u8; buf_sz];
|
||||||
|
let mut total = 0u64;
|
||||||
* syscall frequency
|
|
||||||
* readiness polling
|
loop {
|
||||||
* per-chunk overhead
|
let n = r.read(&mut buf).await?;
|
||||||
|
if n == 0 {
|
||||||
Small buffers aren’t “wrong”. They’re memory-friendly. But for a forwarder pushing tens of Gbit/s, **syscalls per byte** becomes the real limiter.
|
let _ = w.shutdown(std::net::Shutdown::Write);
|
||||||
|
break;
|
||||||
---
|
}
|
||||||
|
w.write_all(&buf[..n]).await?;
|
||||||
## 7. The Fix: a manual `pump()` loop (and a buffer size sweep)
|
total += n as u64;
|
||||||
|
}
|
||||||
I replaced `io::copy` with a manual relay loop:
|
Ok(total)
|
||||||
|
}
|
||||||
* allocate a buffer once per direction
|
```
|
||||||
* read into it
|
|
||||||
* write it out
|
Run both directions:
|
||||||
* on EOF, propagate half-close with `shutdown(Write)`
|
|
||||||
|
```rust
|
||||||
Code (core idea):
|
let c2s = pump(client_stream.clone(), server_stream.clone(), BUF);
|
||||||
|
let s2c = pump(server_stream, client_stream, BUF);
|
||||||
```rust
|
try_zip(c2s, s2c).await?;
|
||||||
async fn pump(mut r: TcpStream, mut w: TcpStream, buf_sz: usize) -> io::Result<u64> {
|
```
|
||||||
let mut buf = vec![0u8; buf_sz];
|
|
||||||
let mut total = 0u64;
|
**Inner Prosecutor:** *Hold it!* That’s just a loop. How does that win?
|
||||||
|
|
||||||
loop {
|
**Me:** Not the loop. The **bytes per syscall**.
|
||||||
let n = r.read(&mut buf).await?;
|
|
||||||
if n == 0 {
|
---
|
||||||
let _ = w.shutdown(std::net::Shutdown::Write);
|
|
||||||
break;
|
### 8) Exhibit C — The Numbers (8KiB → 16KiB → 64KiB)
|
||||||
}
|
|
||||||
w.write_all(&buf[..n]).await?;
|
### Baseline: ~8KiB (generic copy helper)
|
||||||
total += n as u64;
|
|
||||||
}
|
Throughput:
|
||||||
Ok(total)
|
|
||||||
}
|
```text
|
||||||
```
|
17.8 Gbit/s
|
||||||
|
```
|
||||||
And run both directions:
|
|
||||||
|
|
||||||
```rust
|
**Inner Prosecutor:** *Objection!* That’s your “crime scene” number?
|
||||||
let c2s = pump(client_stream.clone(), server_stream.clone(), BUF);
|
|
||||||
let s2c = pump(server_stream, client_stream, BUF);
|
**Me:** Yes. Now watch what happens when the kernel stops getting spammed.
|
||||||
try_zip(c2s, s2c).await?;
|
|
||||||
```
|
|
||||||
|
### Pump + 16KiB buffer
|
||||||
---
|
|
||||||
|
Throughput:
|
||||||
> **Inner Prosecutor:** “You changed ‘one helper call’ into ‘a loop’. That’s your miracle?”
|
|
||||||
>
|
```text
|
||||||
> **Me:** “Not the loop. The *bytes per syscall*.”
|
28.6 Gbit/s
|
||||||
|
```
|
||||||
---
|
|
||||||
|
`strace -c` showed `sendto/recvfrom` call count dropped:
|
||||||
## 8. Verification: numbers don’t lie
|
|
||||||
|
```text
|
||||||
Same machine, same namespaces/veth, same `iperf3 -P 8`.
|
% time seconds usecs/call calls errors syscall
|
||||||
|
------ ----------- ----------- --------- --------- ----------------
|
||||||
### Baseline (generic copy, ~8KiB internal buffer)
|
57.80 14.590016 442121 33 epoll_wait
|
||||||
|
28.84 7.279883 4 1771146 sendto
|
||||||
Throughput:
|
13.33 3.363882 1 1771212 48 recvfrom
|
||||||
|
0.02 0.003843 61 62 44 futex
|
||||||
```text
|
0.01 0.001947 12 159 epoll_ctl
|
||||||
17.8 Gbit/s
|
...
|
||||||
```
|
------ ----------- ----------- --------- --------- ----------------
|
||||||
|
100.00 25.242897 7 3542787 143 total
|
||||||
### Pump + 16KiB buffer
|
```
|
||||||
|
|
||||||
Throughput:
|
|
||||||
|
**Inner Prosecutor:** *Hold it!* That’s already big. But you claim there’s more?
|
||||||
```text
|
|
||||||
28.6 Gbit/s
|
**Me:** Oh, there’s more.
|
||||||
```
|
|
||||||
|
|
||||||
`strace -c` showed `sendto/recvfrom` call count dropped:
|
### Pump + 64KiB buffer
|
||||||
|
|
||||||
```text
|
Throughput:
|
||||||
% time seconds usecs/call calls errors syscall
|
|
||||||
------ ----------- ----------- --------- --------- ----------------
|
```text
|
||||||
57.80 14.590016 442121 33 epoll_wait
|
54.1 Gbit/s (best observed)
|
||||||
28.84 7.279883 4 1771146 sendto
|
```
|
||||||
13.33 3.363882 1 1771212 48 recvfrom
|
|
||||||
0.02 0.003843 61 62 44 futex
|
`perf stat` output:
|
||||||
0.01 0.001947 12 159 epoll_ctl
|
|
||||||
...
|
```text
|
||||||
------ ----------- ----------- --------- --------- ----------------
|
Performance counter stats for process id '893123':
|
||||||
100.00 25.242897 7 3542787 143 total
|
|
||||||
```
|
120,859,810,675 cpu_atom/cycles/ (0.15%)
|
||||||
|
134,735,934,329 cpu_core/cycles/ (99.85%)
|
||||||
### Pump + 64KiB buffer
|
79,946,979,880 cpu_atom/instructions/ # 0.66 insn per cycle (0.15%)
|
||||||
|
127,036,644,759 cpu_core/instructions/ # 0.94 insn per cycle (99.85%)
|
||||||
Throughput:
|
24,713,474 cpu_atom/cache-misses/ (0.15%)
|
||||||
|
9,604,449 cpu_core/cache-misses/ (99.85%)
|
||||||
```text
|
15,584,074,530 cpu_atom/branches/ (0.15%)
|
||||||
54.1 Gbit/s (best observed)
|
24,796,180,117 cpu_core/branches/ (99.85%)
|
||||||
```
|
175,778,825 cpu_atom/branch-misses/ # 1.13% of all branches (0.15%)
|
||||||
|
135,067,353 cpu_core/branch-misses/ # 0.54% of all branches (99.85%)
|
||||||
`perf stat` output:
|
1,519 context-switches
|
||||||
|
50 cpu-migrations
|
||||||
```text
|
|
||||||
Performance counter stats for process id '893123':
|
33.006529572 seconds time elapsed
|
||||||
|
```
|
||||||
120,859,810,675 cpu_atom/cycles/ (0.15%)
|
|
||||||
134,735,934,329 cpu_core/cycles/ (99.85%)
|
`strace -c` output:
|
||||||
79,946,979,880 cpu_atom/instructions/ # 0.66 insn per cycle (0.15%)
|
|
||||||
127,036,644,759 cpu_core/instructions/ # 0.94 insn per cycle (99.85%)
|
```text
|
||||||
24,713,474 cpu_atom/cache-misses/ (0.15%)
|
% time seconds usecs/call calls errors syscall
|
||||||
9,604,449 cpu_core/cache-misses/ (99.85%)
|
------ ----------- ----------- --------- --------- ----------------
|
||||||
15,584,074,530 cpu_atom/branches/ (0.15%)
|
54.56 18.079500 463576 39 epoll_wait
|
||||||
24,796,180,117 cpu_core/branches/ (99.85%)
|
27.91 9.249443 7 1294854 2 sendto
|
||||||
175,778,825 cpu_atom/branch-misses/ # 1.13% of all branches (0.15%)
|
17.49 5.796927 4 1294919 51 recvfrom
|
||||||
135,067,353 cpu_core/branch-misses/ # 0.54% of all branches (99.85%)
|
...
|
||||||
1,519 context-switches
|
------ ----------- ----------- --------- --------- ----------------
|
||||||
50 cpu-migrations
|
100.00 33.135377 12 2590253 158 total
|
||||||
|
```
|
||||||
33.006529572 seconds time elapsed
|
|
||||||
```
|
|
||||||
|
**Inner Prosecutor:** *OBJECTION!* `epoll_wait` is eating the time. That’s the bottleneck!
|
||||||
`strace -c` output:
|
|
||||||
|
**Me:** Nice try. That’s a classic trap.
|
||||||
```text
|
|
||||||
% time seconds usecs/call calls errors syscall
|
---
|
||||||
------ ----------- ----------- --------- --------- ----------------
|
|
||||||
54.56 18.079500 463576 39 epoll_wait
|
## 9) Cross-Examination — The `epoll_wait` Trap
|
||||||
27.91 9.249443 7 1294854 2 sendto
|
|
||||||
17.49 5.796927 4 1294919 51 recvfrom
|
**Me:** `strace -c` measures time spent *inside syscalls*, including time spent **blocked**.
|
||||||
...
|
|
||||||
------ ----------- ----------- --------- --------- ----------------
|
In async runtimes:
|
||||||
100.00 33.135377 12 2590253 158 total
|
|
||||||
```
|
* One thread can sit in `epoll_wait(timeout=...)`
|
||||||
|
* Other threads do `sendto/recvfrom`
|
||||||
---
|
* `strace` charges the blocking time to `epoll_wait`
|
||||||
|
|
||||||
## 9. “Wait — why is `epoll_wait` taking most syscall time?”
|
So `epoll_wait` dominating **does not** mean “epoll is slow”.
|
||||||
|
It often means “one thread is waiting while others work”.
|
||||||
> **Inner Prosecutor:** “*Objection!* Your table says `epoll_wait` dominates time. So epoll is the bottleneck!”
|
|
||||||
>
|
**What matters here:**
|
||||||
> **Me:** “Nope. That’s a common misread.”
|
|
||||||
|
* `sendto` / `recvfrom` call counts
|
||||||
`strace -c` counts **time spent inside syscalls**, including time spent **blocked**.
|
* and how they change with buffer size
|
||||||
|
|
||||||
In async runtimes, it’s normal for one thread to sit in `epoll_wait(timeout=...)` while other threads do actual send/recv work. That blocking time is charged to `epoll_wait`, but it’s not “overhead” — it’s *waiting*.
|
---
|
||||||
|
|
||||||
The real signal is still:
|
## 10) Final Explanation — Why 64KiB Causes a “Nonlinear” Jump
|
||||||
|
|
||||||
* `sendto/recvfrom` call counts (millions)
|
**Inner Prosecutor:** *Hold it!* You only reduced syscall calls by ~some percent. How do you nearly triple throughput?
|
||||||
* average microseconds per call
|
|
||||||
* and whether call count drops when buffer size increases
|
**Me:** Because syscall walls are **nonlinear**.
|
||||||
|
|
||||||
That’s the syscall-per-byte story.
|
A forwarder’s throughput is approximately:
|
||||||
|
|
||||||
---
|
> **Throughput ≈ bytes_per_syscall_pair × syscall_pairs_per_second**
|
||||||
## 10. So why did 64KiB cause such a huge jump?
|
|
||||||
|
If you’re syscall-rate limited, increasing `bytes_per_syscall_pair` pushes you past a threshold where:
|
||||||
Two reasons:
|
|
||||||
|
* socket buffers stay fuller
|
||||||
### 1) Syscall wall is nonlinear
|
* the TCP window is better utilized
|
||||||
|
* each stream spends less time in per-chunk bookkeeping
|
||||||
Throughput is roughly:
|
* concurrency (`-P 8`) stops fighting overhead and starts helping
|
||||||
|
|
||||||
**Throughput ≈ bytes_per_syscall_pair × syscall_pairs_per_second**
|
Once you cross that threshold, throughput can jump until the next ceiling (kernel TCP, memory bandwidth, iperf itself).
|
||||||
|
|
||||||
If syscall rate is the limiter, increasing bytes per syscall can push you past a threshold where:
|
That’s why a “small” change can create a big effect.
|
||||||
|
|
||||||
* socket buffers stay fuller
|
---
|
||||||
* TCP windows are better utilized
|
|
||||||
* per-stream pacing is smoother
|
## 11. Trade-offs: buffer size is not free
|
||||||
* concurrency (`-P 8`) stops fighting overhead and starts working in your favor
|
|
||||||
|
**Inner Prosecutor:** *Objection!* Bigger buffers waste memory!
|
||||||
Once you cross that threshold, throughput can jump until the *next* ceiling (kernel TCP work, memory bandwidth, or iperf itself).
|
|
||||||
|
**Me:** Sustained.
|
||||||
### 2) Less “per-chunk” overhead in userspace
|
|
||||||
|
A forwarder allocates **two buffers per connection** (one per direction).
|
||||||
A small-buffer copy loop means more iterations, more polls, more bookkeeping.
|
|
||||||
A bigger buffer means:
|
So for 64KiB:
|
||||||
|
|
||||||
* fewer loop iterations per GB moved
|
* ~128KiB per connection (just for relay buffers)
|
||||||
* fewer wakeups/polls
|
* plus runtime + socket buffers
|
||||||
* fewer syscall transitions per GB
|
|
||||||
|
That’s fine for “few heavy streams”, but it matters if you handle thousands of concurrent connections.
|
||||||
Your `strace` call counts dropped significantly between 16KiB and 64KiB, and throughput nearly doubled.
|
|
||||||
|
In practice, the right move is:
|
||||||
---
|
|
||||||
|
* choose a good default (64KiB is common)
|
||||||
## 11. Trade-offs: buffer size is not free
|
* make it configurable
|
||||||
|
* consider buffer pooling if connection churn is heavy
|
||||||
> **Inner Prosecutor:** “*Hold it!* Bigger buffers mean wasted memory.”
|
|
||||||
>
|
---
|
||||||
> **Me:** “Correct.”
|
|
||||||
|
## Epilogue — Case Closed (for now)
|
||||||
A forwarder allocates **two buffers per connection** (one per direction).
|
|
||||||
|
**Inner Prosecutor:** So the culprit was…
|
||||||
So for 64KiB:
|
|
||||||
|
**Me:** A perfectly reasonable helper with a default buffer size I didn’t question.
|
||||||
* ~128KiB per connection (just for relay buffers)
|
|
||||||
* plus runtime + socket buffers
|
**Inner Prosecutor:** And the lesson?
|
||||||
|
|
||||||
That’s fine for “few heavy streams”, but it matters if you handle thousands of concurrent connections.
|
**Me:** Don’t guess. Ask sharp questions. Use the tools. Let the system testify.
|
||||||
|
|
||||||
In practice, the right move is:
|
|
||||||
|
> **Verdict:** Guilty of “too many syscalls per byte.”
|
||||||
* choose a good default (64KiB is common)
|
>
|
||||||
* make it configurable
|
> **Sentence:** 64KiB buffers and a better relay loop.
|
||||||
* consider buffer pooling if connection churn is heavy
|
|
||||||
|
---
|
||||||
---
|
|
||||||
|
## Ending
|
||||||
## 12. Closing statement
|
|
||||||
|
This was a good reminder that performance work is not guessing — it’s a dialogue with the system:
|
||||||
This was a good reminder that performance work is not guessing — it’s a dialogue with the system:
|
|
||||||
|
1. Describe the situation
|
||||||
1. Describe the situation
|
2. Ask sharp questions
|
||||||
2. Ask sharp questions
|
3. Use tools to confirm
|
||||||
3. Use tools to confirm
|
4. Explain the results using low-level knowledge
|
||||||
4. Explain the results using low-level knowledge
|
5. Make one change
|
||||||
5. Make one change
|
6. Re-measure
|
||||||
6. Re-measure
|
|
||||||
|
And the funniest part: the “clean” one-liner `io::copy` was correct, but its defaults were hiding a performance policy I didn’t want.
|
||||||
And the funniest part: the “clean” one-liner `io::copy` was correct, but its defaults were hiding a performance policy I didn’t want.
|
|
||||||
|
> **Inner Prosecutor:** “Case closed?”
|
||||||
> **Inner Prosecutor:** “Case closed?”
|
>
|
||||||
>
|
> **Me:** “For now. Next case: buffer pooling, socket buffer tuning, and maybe a Linux-only `splice(2)` fast path — carefully, behind a safe wrapper.”
|
||||||
> **Me:** “For now. Next case: buffer pooling, socket buffer tuning, and maybe a Linux-only `splice(2)` fast path — carefully, behind a safe wrapper.”
|
|
||||||
|
|
||||||
---
|
---
|
||||||
Reference in New Issue
Block a user