Add: more Phoenix flavor

This commit is contained in:
2026-02-03 13:41:37 +08:00
parent ca04039826
commit 144ae4a0f3

View File

@@ -1,476 +1,475 @@
--- ---
title: Ace Profiling Attorney - The Case of the Missing Gbits title: Ace Profiling Attorney - The Case of the Missing Gbits
categories: [Programming, Profiling] categories: [Programming, Profiling]
tags: [Rust, kernel, networking] tags: [Rust, kernel, networking]
--- ---
> **Cast** > **Disclaimer:** This is not a language-war post. No “X vs Y”.
> > This is a profiling detective story about my Rust TCP forwarder [`oi`](https://github.com/DaZuo0122/oxidinetd).
> **Me:** “I rewrote a port forwarder in Rust. It works. Its… not fast enough.”
> ---
> **Inner Prosecutor:** “*Objection!* Not fast enough is not evidence. Bring numbers.” ## 0) Prologue — The Courthouse Lobby
>
> **Me:** “Fine. Well do this properly.” > **Me:** I wrote a Rust TCP port forwarder. It works. It forwards.
>
--- > **Inner Prosecutor (Phoenix voice):** *Hold it!* “Works” is not a metric. How fast?
>
## 0. The Situation > **Me:** Not fast enough under load.
>
Im building a userspace TCP port forwarder in Rust called [oxidinetd](https://github.com/DaZuo0122/oxidinetd) (The binary named `oi`). It accepts a TCP connection, connects to an upstream server, then relays bytes in both directions. > **Inner Prosecutor:** *Objection!* “Not fast enough” is an emotion. Bring evidence.
>
> This post is not a “Rust vs C” piece — its about **profiling**, **forming hypotheses**, and **turning measurements into speed**. > **Me:** Fine. Ill bring **perf**, **strace**, and a **flamegraph**.
>
### Test environment > **Inner Prosecutor:** Good. This court accepts only facts.
- OS: Debian 13 ## 1) The Crime Scene — Setup & Reproduction
- Kernel: `6.12.48+deb13-amd64`
- Runtime: `smol` **Me:** Single machine, Debian 13. No WAN noise, no tunnel bottlenecks.
- Benchmark: single machine, network namespaces + veth
**Inner Prosecutor:** *Hold it!* If its “single machine”, how do you avoid loopback cheating?
Why namespaces + veth? The loopback can hide “real networking” behavior. Namespaces/veth keep the test local (repeatable), but with a path closer to real routing.
**Me:** Network namespaces + veth. Local, repeatable, closer to real networking.
---
### Environment
> **Inner Prosecutor:** “You claim its repeatable. Prove your setup.”
> - Debian 13
> **Me:** “Heres the lab.” - Kernel: `6.12.48+deb13-amd64`
- Runtime: `smol`
--- - Test topology: `ns_client → oi (root ns) → ns_server` via veth
## 1. The Lab Setup ### Reproduction commands
Backend server inside `ns_server`: **Exhibit A: Start backend server in `ns_server`**
```bash ```bash
sudo ip netns exec ns_server iperf3 -s -p 9001 sudo ip netns exec ns_server iperf3 -s -p 9001
``` ````
Client inside `ns_client`, traffic goes through `oi`: **Exhibit B: Run client in `ns_client` through forwarder**
```bash ```bash
sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8 sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8
``` ```
> **Note**: -P 8 matters. A forwarder might look okay under -P 1, then collapse when syscall pressure scales with concurrency. **Inner Prosecutor:** *Hold it!* Why `-P 8`?
### Forwarder config **Me:** Because a forwarder can look fine in `-P 1` and fall apart when syscall pressure scales.
`oi` listens on `10.0.1.1:9000` and connects to `10.0.0.2:9001`. **Inner Prosecutor:** …Acceptable.
`profiling.conf`: ---
```yaml ## 2) The Suspects — What Could Be Limiting Throughput?
127.0.0.1 9000 127.0.0.1 9001
``` **Me:** Four suspects.
--- 1. **CPU bound** (pure compute wall)
2. **Kernel TCP stack bound** (send/recv path, skb, softirq, netfilter/conntrack)
## 2. The Questions 3. **Syscall-rate wall** (too many `sendto/recvfrom` per byte)
4. **Runtime scheduling / contention** (wake storms, locks, futex)
> **Inner Prosecutor:** “Alright. What exactly is the crime?”
> **Inner Prosecutor:** *Objection!* Thats too broad. Narrow it down.
> **Me:** “Throughput is lower than expected. The suspects:”
> **Me:** Thats what the tools are for.
> 1. CPU bound vs I/O bound
> 2. Userspace overhead vs kernel TCP stack ---
> 3. Syscall-rate wall (too many `send/recv` per byte)
> 4. Async runtime scheduling / wakeups / locks ## 3) Evidence #1 — `perf stat` (The Macro View)
--- **Me:** First I ask: are we burning CPU, thrashing schedulers, or stalling on memory?
## 3. Evidence Tool #1 — `perf stat` (Macro view) **Command:**
Command: ```bash
sudo perf stat -p $(pidof oi) -e \
```bash cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
sudo perf stat -p $(pidof oi) -e \ -- sleep 33
cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \ ```
-- sleep 33
``` **What Im looking for:**
### What Im looking for * Huge `context-switches` → runtime thrash / lock contention
* Huge `cpu-migrations` → unstable scheduling
* **Context switches** exploding → runtime contention or wake storms * Very low IPC + huge cache misses → memory stalls
* **CPU migrations** exploding → scheduler instability (bad for repeatability) * Otherwise: likely syscall/kernel path
* **IPC** tanking + cache misses skyrocketing → memory/latency issues
* Otherwise: likely **kernel networking + syscalls** dominate Output:
Output: ```text
Performance counter stats for process id '209785':
```text
Performance counter stats for process id '209785': 113,810,599,893 cpu_atom/cycles/ (0.11%)
164,681,878,450 cpu_core/cycles/ (99.89%)
113,810,599,893 cpu_atom/cycles/ (0.11%) 102,575,167,734 cpu_atom/instructions/ # 0.90 insn per cycle (0.11%)
164,681,878,450 cpu_core/cycles/ (99.89%) 237,094,207,911 cpu_core/instructions/ # 1.44 insn per cycle (99.89%)
102,575,167,734 cpu_atom/instructions/ # 0.90 insn per cycle (0.11%) 33,093,338 cpu_atom/cache-misses/ (0.11%)
237,094,207,911 cpu_core/instructions/ # 1.44 insn per cycle (99.89%) 5,381,441 cpu_core/cache-misses/ (99.89%)
33,093,338 cpu_atom/cache-misses/ (0.11%) 20,012,975,873 cpu_atom/branches/ (0.11%)
5,381,441 cpu_core/cache-misses/ (99.89%) 46,120,077,111 cpu_core/branches/ (99.89%)
20,012,975,873 cpu_atom/branches/ (0.11%) 211,767,555 cpu_atom/branch-misses/ # 1.06% of all branches (0.11%)
46,120,077,111 cpu_core/branches/ (99.89%) 245,969,685 cpu_core/branch-misses/ # 0.53% of all branches (99.89%)
211,767,555 cpu_atom/branch-misses/ # 1.06% of all branches (0.11%) 1,686 context-switches
245,969,685 cpu_core/branch-misses/ # 0.53% of all branches (99.89%) 150 cpu-migrations
1,686 context-switches
150 cpu-migrations 33.004363800 seconds time elapsed
```
33.004363800 seconds time elapsed
```
**Low context switching**:
Interpretation:
- context-switches: 1,686 over ~33s → ~51 switches/sec
**Low context switching**:
- cpu-migrations: 150 over ~33s → ~4.5/s → very stable CPU placement
- context-switches: 1,686 over ~33s → ~51 switches/sec
**CPU is working hard**:
- cpu-migrations: 150 over ~33s → ~4.5/s → very stable CPU placement
- 237,094,207,911 cpu_core instructions
**CPU is working hard**:
- IPC: 1.44 (instructions per cycle) → not lock-bound or stalling badly
- 237,094,207,911 cpu_core instructions
**Clean cache, branch metrics**:
- IPC: 1.44 (instructions per cycle) → not lock-bound or stalling badly
- cache-misses: ~3.1M (tiny compared to the instruction count)
**Clean cache, branch metrics**:
- branch-misses: 0.62%
- cache-misses: ~3.1M (tiny compared to the instruction count)
- branch-misses: 0.62%
**Inner Prosecutor:** *Hold it!* You didnt show the numbers.
---
**Me:** Patience. The next exhibit makes the culprit confess.
> **Inner Prosecutor:** “Thats a vibe-check. Wheres the real culprit?”
> ---
> **Me:** “Next tool. This one tells me what kind of pain were paying for.”
## 4) Evidence #2 — `strace -c` (The Confession: Syscall Composition)
---
**Me:** Next: “What syscalls are we paying for?”
## 4. Evidence Tool #2 — `strace -c` (Syscall composition)
**Command:**
Command:
```bash
```bash sudo timeout 30s strace -c -f -p $(pidof oi)
sudo timeout 30s strace -c -f -p $(pidof oi) ```
```
**What I expect if this is a forwarding wall:**
### Why `strace -c` is lethal for forwarders
* `sendto` and `recvfrom` dominate calls
A userspace TCP forwarder often boils down to: * call counts in the millions
* `recv(...)` from one socket Output (simplified):
* `send(...)` to the other socket
```text
If your throughput is low and `strace -c` shows **millions** of `sendto/recvfrom` calls, youre likely hitting a **syscall-per-byte wall**. sendto 2,190,751 calls 4.146799s (57.6%)
recvfrom 2,190,763 calls 3.052340s (42.4%)
Output (simplified): total syscall time: 7.200789s
```
```text
sendto 2,190,751 calls 4.146799s (57.6%)
recvfrom 2,190,763 calls 3.052340s (42.4%) (A) **100% syscall/copy dominated:**
total syscall time: 7.200789s
- Almost all traced time is inside:
```
- sendto() (TCP send)
Interpretation:
- recvfrom() (TCP recv)
(A) **100% syscall/copy dominated:**
(B) **syscall rate is massive**
Almost all traced time is inside:
- Total send+recv calls:
- sendto() (TCP send)
- ~4,381,500 syscalls in ~32s
- recvfrom() (TCP recv) - → ~137k `sendto` per sec + ~137k `recvfrom` per sec
- → ~274k syscalls/sec total
(B) **syscall rate is massive**
Total send+recv calls: **Inner Prosecutor:** *Objection!* Syscalls alone dont prove the bottleneck.
- ~4,381,500 syscalls in ~32s **Me:** True. So I brought a witness.
- → ~137k `sendto` per sec + ~137k `recvfrom` per sec
- → ~274k syscalls/sec total ---
Thats exactly the pattern of a forwarder doing: ## 5) Evidence #3 — FlameGraph (The Witness)
`recv -> send -> recv -> send ...` with a relatively small buffer. **Me:** The flamegraph doesnt lie. It testifies where cycles go.
--- **Commands:**
> **Inner Prosecutor:** “So youre saying the kernel is being spammed.” ```bash
> sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
> **Me:** “Exactly. Now I want to know whos spamming it — my logic, my runtime, or my copy loop.” sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
```
---
**What the flamegraph showed (described, not embedded):**
## 5. Evidence Tool #3 — FlameGraph (Where cycles actually go)
* The widest towers were kernel TCP send/recv paths:
Commands:
* `__x64_sys_sendto``tcp_sendmsg_locked``tcp_write_xmit` → ...
```bash * `__x64_sys_recvfrom``tcp_recvmsg` → ...
sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30 * My userspace frames existed, but were comparatively thin.
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg * The call chain still pointed into my forwarding implementation.
```
### What the flamegraph showed (described, not embedded) **Inner Prosecutor:** *Hold it!* So youre saying… the kernel is doing the heavy lifting?
Instead of embedding the graph, heres the important story the flamegraph told: **Me:** Exactly. Which means my job is to **stop annoying the kernel** with too many tiny operations.
1. The widest “towers” were kernel TCP send/recv paths: ---
* `__x64_sys_sendto``tcp_sendmsg_locked``tcp_write_xmit` → … ## 6) The Real Culprit — A “Perfectly Reasonable” Copy Loop
* `__x64_sys_recvfrom``tcp_recvmsg` → …
**Me:** Heres the original relay code. Looks clean, right?
2. My userspace frames existed, but they were thin compared to the kernel towers.
That means: ```rust
let client_to_server = io::copy(client_stream.clone(), server_stream.clone());
* Im not burning CPU on complicated Rust logic. let server_to_client = io::copy(server_stream, client_stream);
* Im paying overhead on the boundary: syscalls, TCP stack, copies.
futures_lite::future::try_zip(client_to_server, server_to_client).await?;
3. In the dwarf flamegraph, the *userspace* frames pointed to my forwarding implementation: ```
* the code path that ultimately calls read/write repeatedly. **Inner Prosecutor:** *Objection!* This is idiomatic and correct.
> **Conclusion:** This is not “async is slow” in general. This is “my relay loop is forcing too many small kernel transitions.” **Me:** Yes. Thats why its dangerous.
## 6. The Suspect: my forwarding code **Key detail:** `futures_lite::io::copy` uses a small internal buffer (~8KiB in practice).
Small buffer → more iterations → more syscalls → more overhead.
Here was the original TCP relay:
If a forwarder is syscall-rate bound, this becomes a ceiling.
```rust
// Use smol's copy function to forward data in both directions ---
let client_to_server = io::copy(client_stream.clone(), server_stream.clone());
let server_to_client = io::copy(server_stream, client_stream); ## 7) The First Breakthrough — Replace `io::copy` with `pump()`
futures_lite::future::try_zip(client_to_server, server_to_client).await?; **Me:** I wrote a manual pump loop:
```
* allocate a buffer once
> **Inner Prosecutor:** “*Objection!* That looks perfectly reasonable.” * `read()` into it
> * `write_all()` out
> **Me:** “Yes. Thats why its dangerous.” * on EOF: `shutdown(Write)` to propagate half-close
### Why this can be slow under high throughput ```rust
async fn pump(mut r: TcpStream, mut w: TcpStream, buf_sz: usize) -> io::Result<u64> {
Generic `io::copy` helpers often use a relatively small internal buffer (commonly ~8KiB), plus abstraction layers that can increase: let mut buf = vec![0u8; buf_sz];
let mut total = 0u64;
* syscall frequency
* readiness polling loop {
* per-chunk overhead let n = r.read(&mut buf).await?;
if n == 0 {
Small buffers arent “wrong”. Theyre memory-friendly. But for a forwarder pushing tens of Gbit/s, **syscalls per byte** becomes the real limiter. let _ = w.shutdown(std::net::Shutdown::Write);
break;
--- }
w.write_all(&buf[..n]).await?;
## 7. The Fix: a manual `pump()` loop (and a buffer size sweep) total += n as u64;
}
I replaced `io::copy` with a manual relay loop: Ok(total)
}
* allocate a buffer once per direction ```
* read into it
* write it out Run both directions:
* on EOF, propagate half-close with `shutdown(Write)`
```rust
Code (core idea): let c2s = pump(client_stream.clone(), server_stream.clone(), BUF);
let s2c = pump(server_stream, client_stream, BUF);
```rust try_zip(c2s, s2c).await?;
async fn pump(mut r: TcpStream, mut w: TcpStream, buf_sz: usize) -> io::Result<u64> { ```
let mut buf = vec![0u8; buf_sz];
let mut total = 0u64; **Inner Prosecutor:** *Hold it!* Thats just a loop. How does that win?
loop { **Me:** Not the loop. The **bytes per syscall**.
let n = r.read(&mut buf).await?;
if n == 0 { ---
let _ = w.shutdown(std::net::Shutdown::Write);
break; ### 8) Exhibit C — The Numbers (8KiB → 16KiB → 64KiB)
}
w.write_all(&buf[..n]).await?; ### Baseline: ~8KiB (generic copy helper)
total += n as u64;
} Throughput:
Ok(total)
} ```text
``` 17.8 Gbit/s
```
And run both directions:
```rust **Inner Prosecutor:** *Objection!* Thats your “crime scene” number?
let c2s = pump(client_stream.clone(), server_stream.clone(), BUF);
let s2c = pump(server_stream, client_stream, BUF); **Me:** Yes. Now watch what happens when the kernel stops getting spammed.
try_zip(c2s, s2c).await?;
```
### Pump + 16KiB buffer
---
Throughput:
> **Inner Prosecutor:** “You changed one helper call into a loop. Thats your miracle?”
> ```text
> **Me:** “Not the loop. The *bytes per syscall*.” 28.6 Gbit/s
```
---
`strace -c` showed `sendto/recvfrom` call count dropped:
## 8. Verification: numbers dont lie
```text
Same machine, same namespaces/veth, same `iperf3 -P 8`. % time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
### Baseline (generic copy, ~8KiB internal buffer) 57.80 14.590016 442121 33 epoll_wait
28.84 7.279883 4 1771146 sendto
Throughput: 13.33 3.363882 1 1771212 48 recvfrom
0.02 0.003843 61 62 44 futex
```text 0.01 0.001947 12 159 epoll_ctl
17.8 Gbit/s ...
``` ------ ----------- ----------- --------- --------- ----------------
100.00 25.242897 7 3542787 143 total
### Pump + 16KiB buffer ```
Throughput:
**Inner Prosecutor:** *Hold it!* Thats already big. But you claim theres more?
```text
28.6 Gbit/s **Me:** Oh, theres more.
```
`strace -c` showed `sendto/recvfrom` call count dropped: ### Pump + 64KiB buffer
```text Throughput:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ---------------- ```text
57.80 14.590016 442121 33 epoll_wait 54.1 Gbit/s (best observed)
28.84 7.279883 4 1771146 sendto ```
13.33 3.363882 1 1771212 48 recvfrom
0.02 0.003843 61 62 44 futex `perf stat` output:
0.01 0.001947 12 159 epoll_ctl
... ```text
------ ----------- ----------- --------- --------- ---------------- Performance counter stats for process id '893123':
100.00 25.242897 7 3542787 143 total
``` 120,859,810,675 cpu_atom/cycles/ (0.15%)
134,735,934,329 cpu_core/cycles/ (99.85%)
### Pump + 64KiB buffer 79,946,979,880 cpu_atom/instructions/ # 0.66 insn per cycle (0.15%)
127,036,644,759 cpu_core/instructions/ # 0.94 insn per cycle (99.85%)
Throughput: 24,713,474 cpu_atom/cache-misses/ (0.15%)
9,604,449 cpu_core/cache-misses/ (99.85%)
```text 15,584,074,530 cpu_atom/branches/ (0.15%)
54.1 Gbit/s (best observed) 24,796,180,117 cpu_core/branches/ (99.85%)
``` 175,778,825 cpu_atom/branch-misses/ # 1.13% of all branches (0.15%)
135,067,353 cpu_core/branch-misses/ # 0.54% of all branches (99.85%)
`perf stat` output: 1,519 context-switches
50 cpu-migrations
```text
Performance counter stats for process id '893123': 33.006529572 seconds time elapsed
```
120,859,810,675 cpu_atom/cycles/ (0.15%)
134,735,934,329 cpu_core/cycles/ (99.85%) `strace -c` output:
79,946,979,880 cpu_atom/instructions/ # 0.66 insn per cycle (0.15%)
127,036,644,759 cpu_core/instructions/ # 0.94 insn per cycle (99.85%) ```text
24,713,474 cpu_atom/cache-misses/ (0.15%) % time seconds usecs/call calls errors syscall
9,604,449 cpu_core/cache-misses/ (99.85%) ------ ----------- ----------- --------- --------- ----------------
15,584,074,530 cpu_atom/branches/ (0.15%) 54.56 18.079500 463576 39 epoll_wait
24,796,180,117 cpu_core/branches/ (99.85%) 27.91 9.249443 7 1294854 2 sendto
175,778,825 cpu_atom/branch-misses/ # 1.13% of all branches (0.15%) 17.49 5.796927 4 1294919 51 recvfrom
135,067,353 cpu_core/branch-misses/ # 0.54% of all branches (99.85%) ...
1,519 context-switches ------ ----------- ----------- --------- --------- ----------------
50 cpu-migrations 100.00 33.135377 12 2590253 158 total
```
33.006529572 seconds time elapsed
```
**Inner Prosecutor:** *OBJECTION!* `epoll_wait` is eating the time. Thats the bottleneck!
`strace -c` output:
**Me:** Nice try. Thats a classic trap.
```text
% time seconds usecs/call calls errors syscall ---
------ ----------- ----------- --------- --------- ----------------
54.56 18.079500 463576 39 epoll_wait ## 9) Cross-Examination — The `epoll_wait` Trap
27.91 9.249443 7 1294854 2 sendto
17.49 5.796927 4 1294919 51 recvfrom **Me:** `strace -c` measures time spent *inside syscalls*, including time spent **blocked**.
...
------ ----------- ----------- --------- --------- ---------------- In async runtimes:
100.00 33.135377 12 2590253 158 total
``` * One thread can sit in `epoll_wait(timeout=...)`
* Other threads do `sendto/recvfrom`
--- * `strace` charges the blocking time to `epoll_wait`
## 9. “Wait — why is `epoll_wait` taking most syscall time?” So `epoll_wait` dominating **does not** mean “epoll is slow”.
It often means “one thread is waiting while others work”.
> **Inner Prosecutor:** “*Objection!* Your table says `epoll_wait` dominates time. So epoll is the bottleneck!”
> **What matters here:**
> **Me:** “Nope. Thats a common misread.”
* `sendto` / `recvfrom` call counts
`strace -c` counts **time spent inside syscalls**, including time spent **blocked**. * and how they change with buffer size
In async runtimes, its normal for one thread to sit in `epoll_wait(timeout=...)` while other threads do actual send/recv work. That blocking time is charged to `epoll_wait`, but its not “overhead” — its *waiting*. ---
The real signal is still: ## 10) Final Explanation — Why 64KiB Causes a “Nonlinear” Jump
* `sendto/recvfrom` call counts (millions) **Inner Prosecutor:** *Hold it!* You only reduced syscall calls by ~some percent. How do you nearly triple throughput?
* average microseconds per call
* and whether call count drops when buffer size increases **Me:** Because syscall walls are **nonlinear**.
Thats the syscall-per-byte story. A forwarders throughput is approximately:
--- > **Throughput ≈ bytes_per_syscall_pair × syscall_pairs_per_second**
## 10. So why did 64KiB cause such a huge jump?
If youre syscall-rate limited, increasing `bytes_per_syscall_pair` pushes you past a threshold where:
Two reasons:
* socket buffers stay fuller
### 1) Syscall wall is nonlinear * the TCP window is better utilized
* each stream spends less time in per-chunk bookkeeping
Throughput is roughly: * concurrency (`-P 8`) stops fighting overhead and starts helping
**Throughput ≈ bytes_per_syscall_pair × syscall_pairs_per_second** Once you cross that threshold, throughput can jump until the next ceiling (kernel TCP, memory bandwidth, iperf itself).
If syscall rate is the limiter, increasing bytes per syscall can push you past a threshold where: Thats why a “small” change can create a big effect.
* socket buffers stay fuller ---
* TCP windows are better utilized
* per-stream pacing is smoother ## 11. Trade-offs: buffer size is not free
* concurrency (`-P 8`) stops fighting overhead and starts working in your favor
**Inner Prosecutor:** *Objection!* Bigger buffers waste memory!
Once you cross that threshold, throughput can jump until the *next* ceiling (kernel TCP work, memory bandwidth, or iperf itself).
**Me:** Sustained.
### 2) Less “per-chunk” overhead in userspace
A forwarder allocates **two buffers per connection** (one per direction).
A small-buffer copy loop means more iterations, more polls, more bookkeeping.
A bigger buffer means: So for 64KiB:
* fewer loop iterations per GB moved * ~128KiB per connection (just for relay buffers)
* fewer wakeups/polls * plus runtime + socket buffers
* fewer syscall transitions per GB
Thats fine for “few heavy streams”, but it matters if you handle thousands of concurrent connections.
Your `strace` call counts dropped significantly between 16KiB and 64KiB, and throughput nearly doubled.
In practice, the right move is:
---
* choose a good default (64KiB is common)
## 11. Trade-offs: buffer size is not free * make it configurable
* consider buffer pooling if connection churn is heavy
> **Inner Prosecutor:** “*Hold it!* Bigger buffers mean wasted memory.”
> ---
> **Me:** “Correct.”
## Epilogue — Case Closed (for now)
A forwarder allocates **two buffers per connection** (one per direction).
**Inner Prosecutor:** So the culprit was…
So for 64KiB:
**Me:** A perfectly reasonable helper with a default buffer size I didnt question.
* ~128KiB per connection (just for relay buffers)
* plus runtime + socket buffers **Inner Prosecutor:** And the lesson?
Thats fine for “few heavy streams”, but it matters if you handle thousands of concurrent connections. **Me:** Dont guess. Ask sharp questions. Use the tools. Let the system testify.
In practice, the right move is:
> **Verdict:** Guilty of “too many syscalls per byte.”
* choose a good default (64KiB is common) >
* make it configurable > **Sentence:** 64KiB buffers and a better relay loop.
* consider buffer pooling if connection churn is heavy
---
---
## Ending
## 12. Closing statement
This was a good reminder that performance work is not guessing — its a dialogue with the system:
This was a good reminder that performance work is not guessing — its a dialogue with the system:
1. Describe the situation
1. Describe the situation 2. Ask sharp questions
2. Ask sharp questions 3. Use tools to confirm
3. Use tools to confirm 4. Explain the results using low-level knowledge
4. Explain the results using low-level knowledge 5. Make one change
5. Make one change 6. Re-measure
6. Re-measure
And the funniest part: the “clean” one-liner `io::copy` was correct, but its defaults were hiding a performance policy I didnt want.
And the funniest part: the “clean” one-liner `io::copy` was correct, but its defaults were hiding a performance policy I didnt want.
> **Inner Prosecutor:** “Case closed?”
> **Inner Prosecutor:** “Case closed?” >
> > **Me:** “For now. Next case: buffer pooling, socket buffer tuning, and maybe a Linux-only `splice(2)` fast path — carefully, behind a safe wrapper.”
> **Me:** “For now. Next case: buffer pooling, socket buffer tuning, and maybe a Linux-only `splice(2)` fast path — carefully, behind a safe wrapper.”
--- ---