14 KiB
title, categories, tags
| title | categories | tags | |||||
|---|---|---|---|---|---|---|---|
| Ace Profiling Attorney - The Case of the Missing Gbits |
|
|
Disclaimer: This is not a language-war post. No “X vs Y”.
This is a profiling detective story about my Rust TCP forwarderoi.
0) Prologue — The Courthouse Lobby
Me: I wrote a Rust TCP port forwarder. It works. It forwards.
Inner Prosecutor (Phoenix voice): Hold it! “Works” is not a metric. How fast?
Me: Not fast enough under load.
Inner Prosecutor: Objection! “Not fast enough” is an emotion. Bring evidence.
Me: Fine. I’ll bring perf, strace, and a flamegraph.
Inner Prosecutor: Good. This court accepts only facts.
1) The Crime Scene — Setup & Reproduction
Me: Single machine, Debian 13. No WAN noise, no tunnel bottlenecks.
Inner Prosecutor: Hold it! If it’s “single machine”, how do you avoid loopback cheating?
Me: Network namespaces + veth. Local, repeatable, closer to real networking.
Environment
- Debian 13
- Kernel:
6.12.48+deb13-amd64 - Runtime:
smol - Test topology:
ns_client → oi (root ns) → ns_servervia veth
Reproduction commands
Exhibit A: Start backend server in ns_server
sudo ip netns exec ns_server iperf3 -s -p 9001
Exhibit B: Run client in ns_client through forwarder
sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8
Inner Prosecutor: Hold it! Why -P 8?
Me: Because a forwarder can look fine in -P 1 and fall apart when syscall pressure scales.
Inner Prosecutor: …Acceptable.
2) The Suspects — What Could Be Limiting Throughput?
Me: Four suspects.
- CPU bound (pure compute wall)
- Kernel TCP stack bound (send/recv path, skb, softirq, netfilter/conntrack)
- Syscall-rate wall (too many
sendto/recvfromper byte) - Runtime scheduling / contention (wake storms, locks, futex)
Inner Prosecutor: Objection! That’s too broad. Narrow it down.
Me: That’s what the tools are for.
3) Evidence #1 — perf stat (The Macro View)
Me: First I ask: are we burning CPU, thrashing schedulers, or stalling on memory?
Command:
sudo perf stat -p $(pidof oi) -e \
cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
-- sleep 33
What I’m looking for:
- Huge
context-switches→ runtime thrash / lock contention - Huge
cpu-migrations→ unstable scheduling - Very low IPC + huge cache misses → memory stalls
- Otherwise: likely syscall/kernel path
Output:
Performance counter stats for process id '209785':
113,810,599,893 cpu_atom/cycles/ (0.11%)
164,681,878,450 cpu_core/cycles/ (99.89%)
102,575,167,734 cpu_atom/instructions/ # 0.90 insn per cycle (0.11%)
237,094,207,911 cpu_core/instructions/ # 1.44 insn per cycle (99.89%)
33,093,338 cpu_atom/cache-misses/ (0.11%)
5,381,441 cpu_core/cache-misses/ (99.89%)
20,012,975,873 cpu_atom/branches/ (0.11%)
46,120,077,111 cpu_core/branches/ (99.89%)
211,767,555 cpu_atom/branch-misses/ # 1.06% of all branches (0.11%)
245,969,685 cpu_core/branch-misses/ # 0.53% of all branches (99.89%)
1,686 context-switches
150 cpu-migrations
33.004363800 seconds time elapsed
Low context switching:
-
context-switches: 1,686 over ~33s → ~51 switches/sec
-
cpu-migrations: 150 over ~33s → ~4.5/s → very stable CPU placement
CPU is working hard:
-
237,094,207,911 cpu_core instructions
-
IPC: 1.44 (instructions per cycle) → not lock-bound or stalling badly
Clean cache, branch metrics:
-
cache-misses: ~3.1M (tiny compared to the instruction count)
-
branch-misses: 0.62%
Inner Prosecutor: Hold it! You didn’t show the numbers.
Me: Patience. The next exhibit makes the culprit confess.
4) Evidence #2 — strace -c (The Confession: Syscall Composition)
Me: Next: “What syscalls are we paying for?”
Command:
sudo timeout 30s strace -c -f -p $(pidof oi)
What I expect if this is a forwarding wall:
sendtoandrecvfromdominate calls- call counts in the millions
Output (simplified):
sendto 2,190,751 calls 4.146799s (57.6%)
recvfrom 2,190,763 calls 3.052340s (42.4%)
total syscall time: 7.200789s
(A) 100% syscall/copy dominated:
-
Almost all traced time is inside:
-
sendto() (TCP send)
-
recvfrom() (TCP recv)
-
(B) syscall rate is massive
-
Total send+recv calls:
- ~4,381,500 syscalls in ~32s
- → ~137k
sendtoper sec + ~137krecvfromper sec - → ~274k syscalls/sec total
Inner Prosecutor: Objection! Syscalls alone don’t prove the bottleneck.
Me: True. So I brought a witness.
5) Evidence #3 — FlameGraph (The Witness)
Me: The flamegraph doesn’t lie. It testifies where cycles go.
Commands:
sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
What the flamegraph showed (described, not embedded):
-
The widest towers were kernel TCP send/recv paths:
__x64_sys_sendto→tcp_sendmsg_locked→tcp_write_xmit→ ...__x64_sys_recvfrom→tcp_recvmsg→ ...
-
My userspace frames existed, but were comparatively thin.
-
The call chain still pointed into my forwarding implementation.
Inner Prosecutor: Hold it! So you’re saying… the kernel is doing the heavy lifting?
Me: Exactly. Which means my job is to stop annoying the kernel with too many tiny operations.
6) The Real Culprit — A “Perfectly Reasonable” Copy Loop
Me: Here’s the original relay code. Looks clean, right?
let client_to_server = io::copy(client_stream.clone(), server_stream.clone());
let server_to_client = io::copy(server_stream, client_stream);
futures_lite::future::try_zip(client_to_server, server_to_client).await?;
Inner Prosecutor: Objection! This is idiomatic and correct.
Me: Yes. That’s why it’s dangerous.
Key detail: futures_lite::io::copy uses a small internal buffer (~8KiB in practice).
Small buffer → more iterations → more syscalls → more overhead.
If a forwarder is syscall-rate bound, this becomes a ceiling.
7) The First Breakthrough — Replace io::copy with pump()
Me: I wrote a manual pump loop:
- allocate a buffer once
read()into itwrite_all()out- on EOF:
shutdown(Write)to propagate half-close
async fn pump(mut r: TcpStream, mut w: TcpStream, buf_sz: usize) -> io::Result<u64> {
let mut buf = vec![0u8; buf_sz];
let mut total = 0u64;
loop {
let n = r.read(&mut buf).await?;
if n == 0 {
let _ = w.shutdown(std::net::Shutdown::Write);
break;
}
w.write_all(&buf[..n]).await?;
total += n as u64;
}
Ok(total)
}
Run both directions:
let c2s = pump(client_stream.clone(), server_stream.clone(), BUF);
let s2c = pump(server_stream, client_stream, BUF);
try_zip(c2s, s2c).await?;
Inner Prosecutor: Hold it! That’s just a loop. How does that win?
Me: Not the loop. The bytes per syscall.
8) Exhibit C — The Numbers (8KiB → 16KiB → 64KiB)
Baseline: ~8KiB (generic copy helper)
Throughput:
17.8 Gbit/s
Inner Prosecutor: Objection! That’s your “crime scene” number?
Me: Yes. Now watch what happens when the kernel stops getting spammed.
Pump + 16KiB buffer
Throughput:
28.6 Gbit/s
strace -c showed sendto/recvfrom call count dropped:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
57.80 14.590016 442121 33 epoll_wait
28.84 7.279883 4 1771146 sendto
13.33 3.363882 1 1771212 48 recvfrom
0.02 0.003843 61 62 44 futex
0.01 0.001947 12 159 epoll_ctl
...
------ ----------- ----------- --------- --------- ----------------
100.00 25.242897 7 3542787 143 total
Inner Prosecutor: Hold it! That’s already big. But you claim there’s more?
Me: Oh, there’s more.
Pump + 64KiB buffer
Throughput:
54.1 Gbit/s (best observed)
perf stat output:
Performance counter stats for process id '893123':
120,859,810,675 cpu_atom/cycles/ (0.15%)
134,735,934,329 cpu_core/cycles/ (99.85%)
79,946,979,880 cpu_atom/instructions/ # 0.66 insn per cycle (0.15%)
127,036,644,759 cpu_core/instructions/ # 0.94 insn per cycle (99.85%)
24,713,474 cpu_atom/cache-misses/ (0.15%)
9,604,449 cpu_core/cache-misses/ (99.85%)
15,584,074,530 cpu_atom/branches/ (0.15%)
24,796,180,117 cpu_core/branches/ (99.85%)
175,778,825 cpu_atom/branch-misses/ # 1.13% of all branches (0.15%)
135,067,353 cpu_core/branch-misses/ # 0.54% of all branches (99.85%)
1,519 context-switches
50 cpu-migrations
33.006529572 seconds time elapsed
strace -c output:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
54.56 18.079500 463576 39 epoll_wait
27.91 9.249443 7 1294854 2 sendto
17.49 5.796927 4 1294919 51 recvfrom
...
------ ----------- ----------- --------- --------- ----------------
100.00 33.135377 12 2590253 158 total
Inner Prosecutor: OBJECTION! epoll_wait is eating the time. That’s the bottleneck!
Me: Nice try. That’s a classic trap.
9) Cross-Examination — The epoll_wait Trap
Me: strace -c measures time spent inside syscalls, including time spent blocked.
In async runtimes:
- One thread can sit in
epoll_wait(timeout=...) - Other threads do
sendto/recvfrom stracecharges the blocking time toepoll_wait
So epoll_wait dominating does not mean “epoll is slow”.
It often means “one thread is waiting while others work”.
What matters here:
sendto/recvfromcall counts- and how they change with buffer size
10) Final Explanation — Why 64KiB Causes a “Nonlinear” Jump
Inner Prosecutor: Hold it! You only reduced syscall calls by ~some percent. How do you nearly triple throughput?
Me: Because syscall walls are nonlinear.
A forwarder’s throughput is approximately:
Throughput ≈ bytes_per_syscall_pair × syscall_pairs_per_second
If you’re syscall-rate limited, increasing bytes_per_syscall_pair pushes you past a threshold where:
- socket buffers stay fuller
- the TCP window is better utilized
- each stream spends less time in per-chunk bookkeeping
- concurrency (
-P 8) stops fighting overhead and starts helping
Once you cross that threshold, throughput can jump until the next ceiling (kernel TCP, memory bandwidth, iperf itself).
That’s why a “small” change can create a big effect.
11. Trade-offs: buffer size is not free
Inner Prosecutor: Objection! Bigger buffers waste memory!
Me: Sustained.
A forwarder allocates two buffers per connection (one per direction).
So for 64KiB:
- ~128KiB per connection (just for relay buffers)
- plus runtime + socket buffers
That’s fine for “few heavy streams”, but it matters if you handle thousands of concurrent connections.
In practice, the right move is:
- choose a good default (64KiB is common)
- make it configurable
- consider buffer pooling if connection churn is heavy
Epilogue — Case Closed (for now)
Inner Prosecutor: So the culprit was…
Me: A perfectly reasonable helper with a default buffer size I didn’t question.
Inner Prosecutor: And the lesson?
Me: Don’t guess. Ask sharp questions. Use the tools. Let the system testify.
Verdict: Guilty of “too many syscalls per byte.”
Sentence: 64KiB buffers and a better relay loop.
Ending
This was a good reminder that performance work is not guessing — it’s a dialogue with the system:
- Describe the situation
- Ask sharp questions
- Use tools to confirm
- Explain the results using low-level knowledge
- Make one change
- Re-measure
And the funniest part: the “clean” one-liner io::copy was correct, but its defaults were hiding a performance policy I didn’t want.
Inner Prosecutor: “Case closed?”
Me: “For now. Next case: buffer pooling, socket buffer tuning, and maybe a Linux-only
splice(2)fast path — carefully, behind a safe wrapper.”