Compare commits
3 Commits
markyan04-
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 144ae4a0f3 | |||
| ca04039826 | |||
| ba61eb37b9 |
@@ -1,4 +1,3 @@
|
||||
|
||||
## Testing enviornment setup
|
||||
Install tools:
|
||||
|
||||
@@ -191,3 +190,344 @@ Fix direction:
|
||||
move to smol::Executor + N threads (usually num_cpus)
|
||||
|
||||
or run multiple block_on() workers (careful: avoid accept() duplication)
|
||||
|
||||
|
||||
## outcome oi
|
||||
|
||||
### CPU hotspot
|
||||
testing commands:
|
||||
|
||||
```bash
|
||||
iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1
|
||||
|
||||
sudo perf stat -p $(pidof oi) -e \
|
||||
cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
|
||||
-- sleep 30
|
||||
```
|
||||
|
||||
perf report:
|
||||
|
||||
```text
|
||||
Performance counter stats for process id '207279':
|
||||
|
||||
98,571,874,480 cpu_atom/cycles/ (0.10%)
|
||||
134,732,064,800 cpu_core/cycles/ (99.90%)
|
||||
75,889,748,906 cpu_atom/instructions/ # 0.77 insn per cycle (0.10%)
|
||||
159,098,987,713 cpu_core/instructions/ # 1.18 insn per cycle (99.90%)
|
||||
30,443,258 cpu_atom/cache-misses/ (0.10%)
|
||||
3,155,528 cpu_core/cache-misses/ (99.90%)
|
||||
15,003,063,317 cpu_atom/branches/ (0.10%)
|
||||
31,479,765,962 cpu_core/branches/ (99.90%)
|
||||
149,091,165 cpu_atom/branch-misses/ # 0.99% of all branches (0.10%)
|
||||
195,562,861 cpu_core/branch-misses/ # 0.62% of all branches (99.90%)
|
||||
1,138 context-switches
|
||||
37 cpu-migrations
|
||||
|
||||
33.004738330 seconds time elapsed
|
||||
```
|
||||
|
||||
|
||||
### FlameGraph
|
||||
testing commands:
|
||||
|
||||
```bash
|
||||
sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30
|
||||
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
|
||||
```
|
||||
|
||||
outcome:
|
||||
|
||||
oi.svg
|
||||
|
||||
commands:
|
||||
|
||||
```bash
|
||||
sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
|
||||
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi_dwarf.svg
|
||||
```
|
||||
|
||||
outcome:
|
||||
|
||||
oi_dwarf.svg
|
||||
|
||||
|
||||
### syscall-cost check
|
||||
|
||||
```bash
|
||||
sudo strace -ff -C -p $(pidof oi) -o /tmp/oi.strace
|
||||
# run 15–30s under load, then Ctrl+C
|
||||
tail -n +1 /tmp/oi.strace.*
|
||||
```
|
||||
|
||||
|
||||
## More real setup
|
||||
|
||||
traffic goes through real kernel routing + 2 TCP legs
|
||||
|
||||
|
||||
Create namespaces + veth links:
|
||||
|
||||
```bash
|
||||
sudo ip netns add ns_client
|
||||
sudo ip netns add ns_server
|
||||
|
||||
sudo ip link add veth_c type veth peer name veth_c_ns
|
||||
sudo ip link set veth_c_ns netns ns_client
|
||||
|
||||
sudo ip link add veth_s type veth peer name veth_s_ns
|
||||
sudo ip link set veth_s_ns netns ns_server
|
||||
|
||||
sudo ip addr add 10.0.1.1/24 dev veth_c
|
||||
sudo ip link set veth_c up
|
||||
sudo ip addr add 10.0.0.1/24 dev veth_s
|
||||
sudo ip link set veth_s up
|
||||
|
||||
sudo ip netns exec ns_client ip addr add 10.0.1.2/24 dev veth_c_ns
|
||||
sudo ip netns exec ns_client ip link set veth_c_ns up
|
||||
sudo ip netns exec ns_client ip link set lo up
|
||||
|
||||
sudo ip netns exec ns_server ip addr add 10.0.0.2/24 dev veth_s_ns
|
||||
sudo ip netns exec ns_server ip link set veth_s_ns up
|
||||
sudo ip netns exec ns_server ip link set lo up
|
||||
|
||||
sudo sysctl -w net.ipv4.ip_forward=1
|
||||
```
|
||||
|
||||
|
||||
Config to force redirect path:
|
||||
|
||||
```yaml
|
||||
10.0.1.1 9000 10.0.0.2 9001
|
||||
```
|
||||
|
||||
|
||||
Start backend server in ns_server:
|
||||
|
||||
```bash
|
||||
sudo ip netns exec ns_server iperf3 -s -p 9001
|
||||
```
|
||||
|
||||
|
||||
Run client in ns_client → forwarder → backend:
|
||||
|
||||
```bash
|
||||
sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8
|
||||
```
|
||||
|
||||
perf report:
|
||||
|
||||
```text
|
||||
sudo perf stat -p $(pidof oi) -e cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations -- sleep 33
|
||||
|
||||
Performance counter stats for process id '209785':
|
||||
|
||||
113,810,599,893 cpu_atom/cycles/ (0.11%)
|
||||
164,681,878,450 cpu_core/cycles/ (99.89%)
|
||||
102,575,167,734 cpu_atom/instructions/ # 0.90 insn per cycle (0.11%)
|
||||
237,094,207,911 cpu_core/instructions/ # 1.44 insn per cycle (99.89%)
|
||||
33,093,338 cpu_atom/cache-misses/ (0.11%)
|
||||
5,381,441 cpu_core/cache-misses/ (99.89%)
|
||||
20,012,975,873 cpu_atom/branches/ (0.11%)
|
||||
46,120,077,111 cpu_core/branches/ (99.89%)
|
||||
211,767,555 cpu_atom/branch-misses/ # 1.06% of all branches (0.11%)
|
||||
245,969,685 cpu_core/branch-misses/ # 0.53% of all branches (99.89%)
|
||||
1,686 context-switches
|
||||
150 cpu-migrations
|
||||
|
||||
33.004363800 seconds time elapsed
|
||||
```
|
||||
|
||||
flamegraph
|
||||
|
||||
|
||||
### Add latency + small-packet tests
|
||||
|
||||
netperf (request/response)
|
||||
|
||||
Start netserver in backend namespace:
|
||||
|
||||
```bash
|
||||
sudo ip netns exec ns_server netserver -p 9001
|
||||
```
|
||||
|
||||
Run TCP_RR against forwarded port:
|
||||
|
||||
```bash
|
||||
sudo ip netns exec ns_client netperf -H 10.0.1.1 -p 9000 -t TCP_RR -l 30 -- -r 32,32
|
||||
```
|
||||
|
||||
## After opt
|
||||
|
||||
Here, we changed future_lite::io 8KiB buffer to a customized 16KiB buffer. (To avoid conflict, I changed binary name to oiopt).
|
||||
|
||||
```rust
|
||||
async fn pump(mut r: TcpStream, mut w: TcpStream) -> io::Result<u64> {
|
||||
// let's try 16KiB instead of future_lite::io 8KiB
|
||||
// and do a profiling to see the outcome
|
||||
let mut buf = vec![0u8; 16 * 1024];
|
||||
let mut total = 0u64;
|
||||
|
||||
loop {
|
||||
let n = r.read(&mut buf).await?;
|
||||
if n == 0 {
|
||||
// EOF: send FIN to peer
|
||||
let _ = w.shutdown(Shutdown::Write);
|
||||
break;
|
||||
}
|
||||
w.write_all(&buf[0..n]).await?;
|
||||
total += n as u64;
|
||||
}
|
||||
|
||||
Ok(total)
|
||||
}
|
||||
|
||||
|
||||
// And change the function call in handle_tcp_connection
|
||||
let client_to_server = pump(client_stream.clone(), server_stream.clone());
|
||||
let server_to_client = pump(server_stream, client_stream);
|
||||
```
|
||||
|
||||
|
||||
### outcomes
|
||||
|
||||
Still with `sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8`
|
||||
|
||||
|
||||
perf stat:
|
||||
|
||||
```text
|
||||
sudo perf stat -p $(pidof oiopt) -e cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations -- sleep 33
|
||||
|
||||
Performance counter stats for process id '883435':
|
||||
|
||||
118,960,667,431 cpu_atom/cycles/ (0.05%)
|
||||
131,934,369,110 cpu_core/cycles/ (99.95%)
|
||||
100,530,466,140 cpu_atom/instructions/ # 0.85 insn per cycle (0.05%)
|
||||
185,203,788,299 cpu_core/instructions/ # 1.40 insn per cycle (99.95%)
|
||||
11,027,490 cpu_atom/cache-misses/ (0.05%)
|
||||
2,123,369 cpu_core/cache-misses/ (99.95%)
|
||||
19,641,945,774 cpu_atom/branches/ (0.05%)
|
||||
36,245,438,057 cpu_core/branches/ (99.95%)
|
||||
214,098,497 cpu_atom/branch-misses/ # 1.09% of all branches (0.05%)
|
||||
179,848,095 cpu_core/branch-misses/ # 0.50% of all branches (99.95%)
|
||||
2,308 context-switches
|
||||
31 cpu-migrations
|
||||
|
||||
33.004555878 seconds time elapsed
|
||||
|
||||
```
|
||||
|
||||
system call check:
|
||||
|
||||
```bash
|
||||
sudo timeout 30s strace -c -f -p $(pidof oiopt)
|
||||
```
|
||||
|
||||
output:
|
||||
|
||||
```text
|
||||
strace: Process 883435 attached with 4 threads
|
||||
strace: Process 883438 detached
|
||||
strace: Process 883437 detached
|
||||
strace: Process 883436 detached
|
||||
strace: Process 883435 detached
|
||||
% time seconds usecs/call calls errors syscall
|
||||
------ ----------- ----------- --------- --------- ----------------
|
||||
57.80 14.590016 442121 33 epoll_wait
|
||||
28.84 7.279883 4 1771146 sendto
|
||||
13.33 3.363882 1 1771212 48 recvfrom
|
||||
0.02 0.003843 61 62 44 futex
|
||||
0.01 0.001947 12 159 epoll_ctl
|
||||
0.00 0.000894 99 9 9 connect
|
||||
0.00 0.000620 34 18 9 accept4
|
||||
0.00 0.000503 14 34 timerfd_settime
|
||||
0.00 0.000446 13 33 33 read
|
||||
0.00 0.000271 15 18 ioctl
|
||||
0.00 0.000189 21 9 write
|
||||
0.00 0.000176 19 9 socket
|
||||
0.00 0.000099 11 9 getsockopt
|
||||
0.00 0.000079 4 18 shutdown
|
||||
0.00 0.000049 2 18 close
|
||||
------ ----------- ----------- --------- --------- ----------------
|
||||
100.00 25.242897 7 3542787 143 total
|
||||
|
||||
```
|
||||
|
||||
## Further tests to explain why this huge
|
||||
|
||||
Changed 16KiB buffer to 64KiB, and named the binary to oiopt64
|
||||
|
||||
iperf3 throughput under `-P 8`, highest 54.1Gbits/sec, other threads are much higher than before(16KiB buffer)
|
||||
|
||||
perf stat:
|
||||
|
||||
```text
|
||||
sudo perf stat -p $(pidof oiopt64) -e cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations -- sleep 33
|
||||
|
||||
Performance counter stats for process id '893123':
|
||||
|
||||
120,859,810,675 cpu_atom/cycles/ (0.15%)
|
||||
134,735,934,329 cpu_core/cycles/ (99.85%)
|
||||
79,946,979,880 cpu_atom/instructions/ # 0.66 insn per cycle (0.15%)
|
||||
127,036,644,759 cpu_core/instructions/ # 0.94 insn per cycle (99.85%)
|
||||
24,713,474 cpu_atom/cache-misses/ (0.15%)
|
||||
9,604,449 cpu_core/cache-misses/ (99.85%)
|
||||
15,584,074,530 cpu_atom/branches/ (0.15%)
|
||||
24,796,180,117 cpu_core/branches/ (99.85%)
|
||||
175,778,825 cpu_atom/branch-misses/ # 1.13% of all branches (0.15%)
|
||||
135,067,353 cpu_core/branch-misses/ # 0.54% of all branches (99.85%)
|
||||
1,519 context-switches
|
||||
50 cpu-migrations
|
||||
|
||||
33.006529572 seconds time elapsed
|
||||
```
|
||||
|
||||
system call check:
|
||||
|
||||
```bash
|
||||
sudo timeout 30s strace -c -f -p $(pidof oiopt64)
|
||||
```
|
||||
|
||||
output:
|
||||
|
||||
```text
|
||||
strace: Process 893123 attached with 4 threads
|
||||
strace: Process 893126 detached
|
||||
strace: Process 893125 detached
|
||||
strace: Process 893124 detached
|
||||
strace: Process 893123 detached
|
||||
% time seconds usecs/call calls errors syscall
|
||||
------ ----------- ----------- --------- --------- ----------------
|
||||
54.56 18.079500 463576 39 epoll_wait
|
||||
27.91 9.249443 7 1294854 2 sendto
|
||||
17.49 5.796927 4 1294919 51 recvfrom
|
||||
0.01 0.003778 50 75 49 futex
|
||||
0.01 0.002188 12 175 epoll_ctl
|
||||
0.00 0.000747 83 9 9 connect
|
||||
0.00 0.000714 17 40 timerfd_settime
|
||||
0.00 0.000510 13 39 38 read
|
||||
0.00 0.000452 25 18 9 accept4
|
||||
0.00 0.000310 17 18 ioctl
|
||||
0.00 0.000232 23 10 write
|
||||
0.00 0.000200 22 9 socket
|
||||
0.00 0.000183 20 9 getsockopt
|
||||
0.00 0.000100 5 18 shutdown
|
||||
0.00 0.000053 2 18 close
|
||||
0.00 0.000020 20 1 mprotect
|
||||
0.00 0.000015 15 1 sched_yield
|
||||
0.00 0.000005 5 1 madvise
|
||||
------ ----------- ----------- --------- --------- ----------------
|
||||
100.00 33.135377 12 2590253 158 total
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
### Cleanup:
|
||||
|
||||
```bash
|
||||
sudo ip netns del ns_client
|
||||
sudo ip netns del ns_server
|
||||
sudo ip link del veth_c
|
||||
sudo ip link del veth_s
|
||||
```
|
||||
@@ -0,0 +1,475 @@
|
||||
---
|
||||
title: Ace Profiling Attorney - The Case of the Missing Gbits
|
||||
categories: [Programming, Profiling]
|
||||
tags: [Rust, kernel, networking]
|
||||
---
|
||||
|
||||
> **Disclaimer:** This is not a language-war post. No “X vs Y”.
|
||||
> This is a profiling detective story about my Rust TCP forwarder [`oi`](https://github.com/DaZuo0122/oxidinetd).
|
||||
|
||||
---
|
||||
## 0) Prologue — The Courthouse Lobby
|
||||
|
||||
> **Me:** I wrote a Rust TCP port forwarder. It works. It forwards.
|
||||
>
|
||||
> **Inner Prosecutor (Phoenix voice):** *Hold it!* “Works” is not a metric. How fast?
|
||||
>
|
||||
> **Me:** Not fast enough under load.
|
||||
>
|
||||
> **Inner Prosecutor:** *Objection!* “Not fast enough” is an emotion. Bring evidence.
|
||||
>
|
||||
> **Me:** Fine. I’ll bring **perf**, **strace**, and a **flamegraph**.
|
||||
>
|
||||
> **Inner Prosecutor:** Good. This court accepts only facts.
|
||||
|
||||
## 1) The Crime Scene — Setup & Reproduction
|
||||
|
||||
**Me:** Single machine, Debian 13. No WAN noise, no tunnel bottlenecks.
|
||||
|
||||
**Inner Prosecutor:** *Hold it!* If it’s “single machine”, how do you avoid loopback cheating?
|
||||
|
||||
**Me:** Network namespaces + veth. Local, repeatable, closer to real networking.
|
||||
|
||||
### Environment
|
||||
|
||||
- Debian 13
|
||||
- Kernel: `6.12.48+deb13-amd64`
|
||||
- Runtime: `smol`
|
||||
- Test topology: `ns_client → oi (root ns) → ns_server` via veth
|
||||
|
||||
### Reproduction commands
|
||||
|
||||
**Exhibit A: Start backend server in `ns_server`**
|
||||
|
||||
```bash
|
||||
sudo ip netns exec ns_server iperf3 -s -p 9001
|
||||
````
|
||||
|
||||
**Exhibit B: Run client in `ns_client` through forwarder**
|
||||
|
||||
```bash
|
||||
sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8
|
||||
```
|
||||
|
||||
**Inner Prosecutor:** *Hold it!* Why `-P 8`?
|
||||
|
||||
**Me:** Because a forwarder can look fine in `-P 1` and fall apart when syscall pressure scales.
|
||||
|
||||
**Inner Prosecutor:** …Acceptable.
|
||||
|
||||
---
|
||||
|
||||
## 2) The Suspects — What Could Be Limiting Throughput?
|
||||
|
||||
**Me:** Four suspects.
|
||||
|
||||
1. **CPU bound** (pure compute wall)
|
||||
2. **Kernel TCP stack bound** (send/recv path, skb, softirq, netfilter/conntrack)
|
||||
3. **Syscall-rate wall** (too many `sendto/recvfrom` per byte)
|
||||
4. **Runtime scheduling / contention** (wake storms, locks, futex)
|
||||
|
||||
**Inner Prosecutor:** *Objection!* That’s too broad. Narrow it down.
|
||||
|
||||
**Me:** That’s what the tools are for.
|
||||
|
||||
---
|
||||
|
||||
## 3) Evidence #1 — `perf stat` (The Macro View)
|
||||
|
||||
**Me:** First I ask: are we burning CPU, thrashing schedulers, or stalling on memory?
|
||||
|
||||
**Command:**
|
||||
|
||||
```bash
|
||||
sudo perf stat -p $(pidof oi) -e \
|
||||
cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
|
||||
-- sleep 33
|
||||
```
|
||||
|
||||
**What I’m looking for:**
|
||||
|
||||
* Huge `context-switches` → runtime thrash / lock contention
|
||||
* Huge `cpu-migrations` → unstable scheduling
|
||||
* Very low IPC + huge cache misses → memory stalls
|
||||
* Otherwise: likely syscall/kernel path
|
||||
|
||||
Output:
|
||||
|
||||
```text
|
||||
Performance counter stats for process id '209785':
|
||||
|
||||
113,810,599,893 cpu_atom/cycles/ (0.11%)
|
||||
164,681,878,450 cpu_core/cycles/ (99.89%)
|
||||
102,575,167,734 cpu_atom/instructions/ # 0.90 insn per cycle (0.11%)
|
||||
237,094,207,911 cpu_core/instructions/ # 1.44 insn per cycle (99.89%)
|
||||
33,093,338 cpu_atom/cache-misses/ (0.11%)
|
||||
5,381,441 cpu_core/cache-misses/ (99.89%)
|
||||
20,012,975,873 cpu_atom/branches/ (0.11%)
|
||||
46,120,077,111 cpu_core/branches/ (99.89%)
|
||||
211,767,555 cpu_atom/branch-misses/ # 1.06% of all branches (0.11%)
|
||||
245,969,685 cpu_core/branch-misses/ # 0.53% of all branches (99.89%)
|
||||
1,686 context-switches
|
||||
150 cpu-migrations
|
||||
|
||||
33.004363800 seconds time elapsed
|
||||
```
|
||||
|
||||
|
||||
**Low context switching**:
|
||||
|
||||
- context-switches: 1,686 over ~33s → ~51 switches/sec
|
||||
|
||||
- cpu-migrations: 150 over ~33s → ~4.5/s → very stable CPU placement
|
||||
|
||||
**CPU is working hard**:
|
||||
|
||||
- 237,094,207,911 cpu_core instructions
|
||||
|
||||
- IPC: 1.44 (instructions per cycle) → not lock-bound or stalling badly
|
||||
|
||||
**Clean cache, branch metrics**:
|
||||
|
||||
- cache-misses: ~3.1M (tiny compared to the instruction count)
|
||||
|
||||
- branch-misses: 0.62%
|
||||
|
||||
|
||||
|
||||
**Inner Prosecutor:** *Hold it!* You didn’t show the numbers.
|
||||
|
||||
**Me:** Patience. The next exhibit makes the culprit confess.
|
||||
|
||||
---
|
||||
|
||||
## 4) Evidence #2 — `strace -c` (The Confession: Syscall Composition)
|
||||
|
||||
**Me:** Next: “What syscalls are we paying for?”
|
||||
|
||||
**Command:**
|
||||
|
||||
```bash
|
||||
sudo timeout 30s strace -c -f -p $(pidof oi)
|
||||
```
|
||||
|
||||
**What I expect if this is a forwarding wall:**
|
||||
|
||||
* `sendto` and `recvfrom` dominate calls
|
||||
* call counts in the millions
|
||||
|
||||
Output (simplified):
|
||||
|
||||
```text
|
||||
sendto 2,190,751 calls 4.146799s (57.6%)
|
||||
recvfrom 2,190,763 calls 3.052340s (42.4%)
|
||||
total syscall time: 7.200789s
|
||||
```
|
||||
|
||||
|
||||
(A) **100% syscall/copy dominated:**
|
||||
|
||||
- Almost all traced time is inside:
|
||||
|
||||
- sendto() (TCP send)
|
||||
|
||||
- recvfrom() (TCP recv)
|
||||
|
||||
(B) **syscall rate is massive**
|
||||
|
||||
- Total send+recv calls:
|
||||
|
||||
- ~4,381,500 syscalls in ~32s
|
||||
- → ~137k `sendto` per sec + ~137k `recvfrom` per sec
|
||||
- → ~274k syscalls/sec total
|
||||
|
||||
|
||||
**Inner Prosecutor:** *Objection!* Syscalls alone don’t prove the bottleneck.
|
||||
|
||||
**Me:** True. So I brought a witness.
|
||||
|
||||
---
|
||||
|
||||
## 5) Evidence #3 — FlameGraph (The Witness)
|
||||
|
||||
**Me:** The flamegraph doesn’t lie. It testifies where cycles go.
|
||||
|
||||
**Commands:**
|
||||
|
||||
```bash
|
||||
sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
|
||||
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
|
||||
```
|
||||
|
||||
**What the flamegraph showed (described, not embedded):**
|
||||
|
||||
* The widest towers were kernel TCP send/recv paths:
|
||||
|
||||
* `__x64_sys_sendto` → `tcp_sendmsg_locked` → `tcp_write_xmit` → ...
|
||||
* `__x64_sys_recvfrom` → `tcp_recvmsg` → ...
|
||||
* My userspace frames existed, but were comparatively thin.
|
||||
* The call chain still pointed into my forwarding implementation.
|
||||
|
||||
|
||||
**Inner Prosecutor:** *Hold it!* So you’re saying… the kernel is doing the heavy lifting?
|
||||
|
||||
**Me:** Exactly. Which means my job is to **stop annoying the kernel** with too many tiny operations.
|
||||
|
||||
---
|
||||
|
||||
## 6) The Real Culprit — A “Perfectly Reasonable” Copy Loop
|
||||
|
||||
**Me:** Here’s the original relay code. Looks clean, right?
|
||||
|
||||
```rust
|
||||
let client_to_server = io::copy(client_stream.clone(), server_stream.clone());
|
||||
let server_to_client = io::copy(server_stream, client_stream);
|
||||
|
||||
futures_lite::future::try_zip(client_to_server, server_to_client).await?;
|
||||
```
|
||||
|
||||
**Inner Prosecutor:** *Objection!* This is idiomatic and correct.
|
||||
|
||||
**Me:** Yes. That’s why it’s dangerous.
|
||||
|
||||
**Key detail:** `futures_lite::io::copy` uses a small internal buffer (~8KiB in practice).
|
||||
Small buffer → more iterations → more syscalls → more overhead.
|
||||
|
||||
If a forwarder is syscall-rate bound, this becomes a ceiling.
|
||||
|
||||
---
|
||||
|
||||
## 7) The First Breakthrough — Replace `io::copy` with `pump()`
|
||||
|
||||
**Me:** I wrote a manual pump loop:
|
||||
|
||||
* allocate a buffer once
|
||||
* `read()` into it
|
||||
* `write_all()` out
|
||||
* on EOF: `shutdown(Write)` to propagate half-close
|
||||
|
||||
```rust
|
||||
async fn pump(mut r: TcpStream, mut w: TcpStream, buf_sz: usize) -> io::Result<u64> {
|
||||
let mut buf = vec![0u8; buf_sz];
|
||||
let mut total = 0u64;
|
||||
|
||||
loop {
|
||||
let n = r.read(&mut buf).await?;
|
||||
if n == 0 {
|
||||
let _ = w.shutdown(std::net::Shutdown::Write);
|
||||
break;
|
||||
}
|
||||
w.write_all(&buf[..n]).await?;
|
||||
total += n as u64;
|
||||
}
|
||||
Ok(total)
|
||||
}
|
||||
```
|
||||
|
||||
Run both directions:
|
||||
|
||||
```rust
|
||||
let c2s = pump(client_stream.clone(), server_stream.clone(), BUF);
|
||||
let s2c = pump(server_stream, client_stream, BUF);
|
||||
try_zip(c2s, s2c).await?;
|
||||
```
|
||||
|
||||
**Inner Prosecutor:** *Hold it!* That’s just a loop. How does that win?
|
||||
|
||||
**Me:** Not the loop. The **bytes per syscall**.
|
||||
|
||||
---
|
||||
|
||||
### 8) Exhibit C — The Numbers (8KiB → 16KiB → 64KiB)
|
||||
|
||||
### Baseline: ~8KiB (generic copy helper)
|
||||
|
||||
Throughput:
|
||||
|
||||
```text
|
||||
17.8 Gbit/s
|
||||
```
|
||||
|
||||
|
||||
**Inner Prosecutor:** *Objection!* That’s your “crime scene” number?
|
||||
|
||||
**Me:** Yes. Now watch what happens when the kernel stops getting spammed.
|
||||
|
||||
|
||||
### Pump + 16KiB buffer
|
||||
|
||||
Throughput:
|
||||
|
||||
```text
|
||||
28.6 Gbit/s
|
||||
```
|
||||
|
||||
`strace -c` showed `sendto/recvfrom` call count dropped:
|
||||
|
||||
```text
|
||||
% time seconds usecs/call calls errors syscall
|
||||
------ ----------- ----------- --------- --------- ----------------
|
||||
57.80 14.590016 442121 33 epoll_wait
|
||||
28.84 7.279883 4 1771146 sendto
|
||||
13.33 3.363882 1 1771212 48 recvfrom
|
||||
0.02 0.003843 61 62 44 futex
|
||||
0.01 0.001947 12 159 epoll_ctl
|
||||
...
|
||||
------ ----------- ----------- --------- --------- ----------------
|
||||
100.00 25.242897 7 3542787 143 total
|
||||
```
|
||||
|
||||
|
||||
**Inner Prosecutor:** *Hold it!* That’s already big. But you claim there’s more?
|
||||
|
||||
**Me:** Oh, there’s more.
|
||||
|
||||
|
||||
### Pump + 64KiB buffer
|
||||
|
||||
Throughput:
|
||||
|
||||
```text
|
||||
54.1 Gbit/s (best observed)
|
||||
```
|
||||
|
||||
`perf stat` output:
|
||||
|
||||
```text
|
||||
Performance counter stats for process id '893123':
|
||||
|
||||
120,859,810,675 cpu_atom/cycles/ (0.15%)
|
||||
134,735,934,329 cpu_core/cycles/ (99.85%)
|
||||
79,946,979,880 cpu_atom/instructions/ # 0.66 insn per cycle (0.15%)
|
||||
127,036,644,759 cpu_core/instructions/ # 0.94 insn per cycle (99.85%)
|
||||
24,713,474 cpu_atom/cache-misses/ (0.15%)
|
||||
9,604,449 cpu_core/cache-misses/ (99.85%)
|
||||
15,584,074,530 cpu_atom/branches/ (0.15%)
|
||||
24,796,180,117 cpu_core/branches/ (99.85%)
|
||||
175,778,825 cpu_atom/branch-misses/ # 1.13% of all branches (0.15%)
|
||||
135,067,353 cpu_core/branch-misses/ # 0.54% of all branches (99.85%)
|
||||
1,519 context-switches
|
||||
50 cpu-migrations
|
||||
|
||||
33.006529572 seconds time elapsed
|
||||
```
|
||||
|
||||
`strace -c` output:
|
||||
|
||||
```text
|
||||
% time seconds usecs/call calls errors syscall
|
||||
------ ----------- ----------- --------- --------- ----------------
|
||||
54.56 18.079500 463576 39 epoll_wait
|
||||
27.91 9.249443 7 1294854 2 sendto
|
||||
17.49 5.796927 4 1294919 51 recvfrom
|
||||
...
|
||||
------ ----------- ----------- --------- --------- ----------------
|
||||
100.00 33.135377 12 2590253 158 total
|
||||
```
|
||||
|
||||
|
||||
**Inner Prosecutor:** *OBJECTION!* `epoll_wait` is eating the time. That’s the bottleneck!
|
||||
|
||||
**Me:** Nice try. That’s a classic trap.
|
||||
|
||||
---
|
||||
|
||||
## 9) Cross-Examination — The `epoll_wait` Trap
|
||||
|
||||
**Me:** `strace -c` measures time spent *inside syscalls*, including time spent **blocked**.
|
||||
|
||||
In async runtimes:
|
||||
|
||||
* One thread can sit in `epoll_wait(timeout=...)`
|
||||
* Other threads do `sendto/recvfrom`
|
||||
* `strace` charges the blocking time to `epoll_wait`
|
||||
|
||||
So `epoll_wait` dominating **does not** mean “epoll is slow”.
|
||||
It often means “one thread is waiting while others work”.
|
||||
|
||||
**What matters here:**
|
||||
|
||||
* `sendto` / `recvfrom` call counts
|
||||
* and how they change with buffer size
|
||||
|
||||
---
|
||||
|
||||
## 10) Final Explanation — Why 64KiB Causes a “Nonlinear” Jump
|
||||
|
||||
**Inner Prosecutor:** *Hold it!* You only reduced syscall calls by ~some percent. How do you nearly triple throughput?
|
||||
|
||||
**Me:** Because syscall walls are **nonlinear**.
|
||||
|
||||
A forwarder’s throughput is approximately:
|
||||
|
||||
> **Throughput ≈ bytes_per_syscall_pair × syscall_pairs_per_second**
|
||||
|
||||
If you’re syscall-rate limited, increasing `bytes_per_syscall_pair` pushes you past a threshold where:
|
||||
|
||||
* socket buffers stay fuller
|
||||
* the TCP window is better utilized
|
||||
* each stream spends less time in per-chunk bookkeeping
|
||||
* concurrency (`-P 8`) stops fighting overhead and starts helping
|
||||
|
||||
Once you cross that threshold, throughput can jump until the next ceiling (kernel TCP, memory bandwidth, iperf itself).
|
||||
|
||||
That’s why a “small” change can create a big effect.
|
||||
|
||||
---
|
||||
|
||||
## 11. Trade-offs: buffer size is not free
|
||||
|
||||
**Inner Prosecutor:** *Objection!* Bigger buffers waste memory!
|
||||
|
||||
**Me:** Sustained.
|
||||
|
||||
A forwarder allocates **two buffers per connection** (one per direction).
|
||||
|
||||
So for 64KiB:
|
||||
|
||||
* ~128KiB per connection (just for relay buffers)
|
||||
* plus runtime + socket buffers
|
||||
|
||||
That’s fine for “few heavy streams”, but it matters if you handle thousands of concurrent connections.
|
||||
|
||||
In practice, the right move is:
|
||||
|
||||
* choose a good default (64KiB is common)
|
||||
* make it configurable
|
||||
* consider buffer pooling if connection churn is heavy
|
||||
|
||||
---
|
||||
|
||||
## Epilogue — Case Closed (for now)
|
||||
|
||||
**Inner Prosecutor:** So the culprit was…
|
||||
|
||||
**Me:** A perfectly reasonable helper with a default buffer size I didn’t question.
|
||||
|
||||
**Inner Prosecutor:** And the lesson?
|
||||
|
||||
**Me:** Don’t guess. Ask sharp questions. Use the tools. Let the system testify.
|
||||
|
||||
|
||||
> **Verdict:** Guilty of “too many syscalls per byte.”
|
||||
>
|
||||
> **Sentence:** 64KiB buffers and a better relay loop.
|
||||
|
||||
---
|
||||
|
||||
## Ending
|
||||
|
||||
This was a good reminder that performance work is not guessing — it’s a dialogue with the system:
|
||||
|
||||
1. Describe the situation
|
||||
2. Ask sharp questions
|
||||
3. Use tools to confirm
|
||||
4. Explain the results using low-level knowledge
|
||||
5. Make one change
|
||||
6. Re-measure
|
||||
|
||||
And the funniest part: the “clean” one-liner `io::copy` was correct, but its defaults were hiding a performance policy I didn’t want.
|
||||
|
||||
> **Inner Prosecutor:** “Case closed?”
|
||||
>
|
||||
> **Me:** “For now. Next case: buffer pooling, socket buffer tuning, and maybe a Linux-only `splice(2)` fast path — carefully, behind a safe wrapper.”
|
||||
|
||||
---
|
||||
Reference in New Issue
Block a user