14 KiB
title, categories, tags
| title | categories | tags | |||||
|---|---|---|---|---|---|---|---|
| Ace Profiling Attorney - The Case of the Missing Gbits |
|
|
Cast
Me: “I rewrote a port forwarder in Rust. It works. It’s… not fast enough.”
Inner Prosecutor: “Objection! ‘Not fast enough’ is not evidence. Bring numbers.”
Me: “Fine. We’ll do this properly.”
0. The Situation
I’m building a userspace TCP port forwarder in Rust called oxidinetd (The binary named oi). It accepts a TCP connection, connects to an upstream server, then relays bytes in both directions.
This post is not a “Rust vs C” piece — it’s about profiling, forming hypotheses, and turning measurements into speed.
Test environment
- OS: Debian 13
- Kernel:
6.12.48+deb13-amd64 - Runtime:
smol - Benchmark: single machine, network namespaces + veth
Why namespaces + veth? The loopback can hide “real networking” behavior. Namespaces/veth keep the test local (repeatable), but with a path closer to real routing.
Inner Prosecutor: “You claim it’s repeatable. Prove your setup.”
Me: “Here’s the lab.”
1. The Lab Setup
Backend server inside ns_server:
sudo ip netns exec ns_server iperf3 -s -p 9001
Client inside ns_client, traffic goes through oi:
sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8
Note
: -P 8 matters. A forwarder might look okay under -P 1, then collapse when syscall pressure scales with concurrency.
Forwarder config
oi listens on 10.0.1.1:9000 and connects to 10.0.0.2:9001.
profiling.conf:
127.0.0.1 9000 127.0.0.1 9001
2. The Questions
Inner Prosecutor: “Alright. What exactly is the crime?”
Me: “Throughput is lower than expected. The suspects:”
- CPU bound vs I/O bound
- Userspace overhead vs kernel TCP stack
- Syscall-rate wall (too many
send/recvper byte)- Async runtime scheduling / wakeups / locks
3. Evidence Tool #1 — perf stat (Macro view)
Command:
sudo perf stat -p $(pidof oi) -e \
cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
-- sleep 33
What I’m looking for
- Context switches exploding → runtime contention or wake storms
- CPU migrations exploding → scheduler instability (bad for repeatability)
- IPC tanking + cache misses skyrocketing → memory/latency issues
- Otherwise: likely kernel networking + syscalls dominate
Output:
Performance counter stats for process id '209785':
113,810,599,893 cpu_atom/cycles/ (0.11%)
164,681,878,450 cpu_core/cycles/ (99.89%)
102,575,167,734 cpu_atom/instructions/ # 0.90 insn per cycle (0.11%)
237,094,207,911 cpu_core/instructions/ # 1.44 insn per cycle (99.89%)
33,093,338 cpu_atom/cache-misses/ (0.11%)
5,381,441 cpu_core/cache-misses/ (99.89%)
20,012,975,873 cpu_atom/branches/ (0.11%)
46,120,077,111 cpu_core/branches/ (99.89%)
211,767,555 cpu_atom/branch-misses/ # 1.06% of all branches (0.11%)
245,969,685 cpu_core/branch-misses/ # 0.53% of all branches (99.89%)
1,686 context-switches
150 cpu-migrations
33.004363800 seconds time elapsed
Interpretation:
Low context switching:
- context-switches: 1,686 over ~33s → ~51 switches/sec
- cpu-migrations: 150 over ~33s → ~4.5/s → very stable CPU placement
CPU is working hard:
- 237,094,207,911 cpu_core instructions
- IPC: 1.44 (instructions per cycle) → not lock-bound or stalling badly
Clean cache, branch metrics:
- cache-misses: ~3.1M (tiny compared to the instruction count)
- branch-misses: 0.62%
Inner Prosecutor: “That’s a vibe-check. Where’s the real culprit?”
Me: “Next tool. This one tells me what kind of pain we’re paying for.”
4. Evidence Tool #2 — strace -c (Syscall composition)
Command:
sudo timeout 30s strace -c -f -p $(pidof oi)
Why strace -c is lethal for forwarders
A userspace TCP forwarder often boils down to:
recv(...)from one socketsend(...)to the other socket
If your throughput is low and strace -c shows millions of sendto/recvfrom calls, you’re likely hitting a syscall-per-byte wall.
Output (simplified):
sendto 2,190,751 calls 4.146799s (57.6%)
recvfrom 2,190,763 calls 3.052340s (42.4%)
total syscall time: 7.200789s
Interpretation:
(A) 100% syscall/copy dominated:
Almost all traced time is inside:
- sendto() (TCP send)
- recvfrom() (TCP recv)
(B) syscall rate is massive
Total send+recv calls:
- ~4,381,500 syscalls in ~32s
- → ~137k `sendto` per sec + ~137k `recvfrom` per sec
- → ~274k syscalls/sec total
That’s exactly the pattern of a forwarder doing:
recv -> send -> recv -> send ... with a relatively small buffer.
Inner Prosecutor: “So you’re saying the kernel is being spammed.”
Me: “Exactly. Now I want to know who’s spamming it — my logic, my runtime, or my copy loop.”
5. Evidence Tool #3 — FlameGraph (Where cycles actually go)
Commands:
sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
What the flamegraph showed (described, not embedded)
Instead of embedding the graph, here’s the important story the flamegraph told:
-
The widest “towers” were kernel TCP send/recv paths:
__x64_sys_sendto→tcp_sendmsg_locked→tcp_write_xmit→ …__x64_sys_recvfrom→tcp_recvmsg→ …
-
My userspace frames existed, but they were thin compared to the kernel towers. That means:
- I’m not burning CPU on complicated Rust logic.
- I’m paying overhead on the boundary: syscalls, TCP stack, copies.
-
In the dwarf flamegraph, the userspace frames pointed to my forwarding implementation:
- the code path that ultimately calls read/write repeatedly.
Conclusion: This is not “async is slow” in general. This is “my relay loop is forcing too many small kernel transitions.”
6. The Suspect: my forwarding code
Here was the original TCP relay:
// Use smol's copy function to forward data in both directions
let client_to_server = io::copy(client_stream.clone(), server_stream.clone());
let server_to_client = io::copy(server_stream, client_stream);
futures_lite::future::try_zip(client_to_server, server_to_client).await?;
Inner Prosecutor: “Objection! That looks perfectly reasonable.”
Me: “Yes. That’s why it’s dangerous.”
Why this can be slow under high throughput
Generic io::copy helpers often use a relatively small internal buffer (commonly ~8KiB), plus abstraction layers that can increase:
- syscall frequency
- readiness polling
- per-chunk overhead
Small buffers aren’t “wrong”. They’re memory-friendly. But for a forwarder pushing tens of Gbit/s, syscalls per byte becomes the real limiter.
7. The Fix: a manual pump() loop (and a buffer size sweep)
I replaced io::copy with a manual relay loop:
- allocate a buffer once per direction
- read into it
- write it out
- on EOF, propagate half-close with
shutdown(Write)
Code (core idea):
async fn pump(mut r: TcpStream, mut w: TcpStream, buf_sz: usize) -> io::Result<u64> {
let mut buf = vec![0u8; buf_sz];
let mut total = 0u64;
loop {
let n = r.read(&mut buf).await?;
if n == 0 {
let _ = w.shutdown(std::net::Shutdown::Write);
break;
}
w.write_all(&buf[..n]).await?;
total += n as u64;
}
Ok(total)
}
And run both directions:
let c2s = pump(client_stream.clone(), server_stream.clone(), BUF);
let s2c = pump(server_stream, client_stream, BUF);
try_zip(c2s, s2c).await?;
Inner Prosecutor: “You changed ‘one helper call’ into ‘a loop’. That’s your miracle?”
Me: “Not the loop. The bytes per syscall.”
8. Verification: numbers don’t lie
Same machine, same namespaces/veth, same iperf3 -P 8.
Baseline (generic copy, ~8KiB internal buffer)
Throughput:
17.8 Gbit/s
Pump + 16KiB buffer
Throughput:
28.6 Gbit/s
strace -c showed sendto/recvfrom call count dropped:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
57.80 14.590016 442121 33 epoll_wait
28.84 7.279883 4 1771146 sendto
13.33 3.363882 1 1771212 48 recvfrom
0.02 0.003843 61 62 44 futex
0.01 0.001947 12 159 epoll_ctl
...
------ ----------- ----------- --------- --------- ----------------
100.00 25.242897 7 3542787 143 total
Pump + 64KiB buffer
Throughput:
54.1 Gbit/s (best observed)
perf stat output:
Performance counter stats for process id '893123':
120,859,810,675 cpu_atom/cycles/ (0.15%)
134,735,934,329 cpu_core/cycles/ (99.85%)
79,946,979,880 cpu_atom/instructions/ # 0.66 insn per cycle (0.15%)
127,036,644,759 cpu_core/instructions/ # 0.94 insn per cycle (99.85%)
24,713,474 cpu_atom/cache-misses/ (0.15%)
9,604,449 cpu_core/cache-misses/ (99.85%)
15,584,074,530 cpu_atom/branches/ (0.15%)
24,796,180,117 cpu_core/branches/ (99.85%)
175,778,825 cpu_atom/branch-misses/ # 1.13% of all branches (0.15%)
135,067,353 cpu_core/branch-misses/ # 0.54% of all branches (99.85%)
1,519 context-switches
50 cpu-migrations
33.006529572 seconds time elapsed
strace -c output:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
54.56 18.079500 463576 39 epoll_wait
27.91 9.249443 7 1294854 2 sendto
17.49 5.796927 4 1294919 51 recvfrom
...
------ ----------- ----------- --------- --------- ----------------
100.00 33.135377 12 2590253 158 total
9. “Wait — why is epoll_wait taking most syscall time?”
Inner Prosecutor: “Objection! Your table says
epoll_waitdominates time. So epoll is the bottleneck!”Me: “Nope. That’s a common misread.”
strace -c counts time spent inside syscalls, including time spent blocked.
In async runtimes, it’s normal for one thread to sit in epoll_wait(timeout=...) while other threads do actual send/recv work. That blocking time is charged to epoll_wait, but it’s not “overhead” — it’s waiting.
The real signal is still:
sendto/recvfromcall counts (millions)- average microseconds per call
- and whether call count drops when buffer size increases
That’s the syscall-per-byte story.
10. So why did 64KiB cause such a huge jump?
Two reasons:
1) Syscall wall is nonlinear
Throughput is roughly:
Throughput ≈ bytes_per_syscall_pair × syscall_pairs_per_second
If syscall rate is the limiter, increasing bytes per syscall can push you past a threshold where:
- socket buffers stay fuller
- TCP windows are better utilized
- per-stream pacing is smoother
- concurrency (
-P 8) stops fighting overhead and starts working in your favor
Once you cross that threshold, throughput can jump until the next ceiling (kernel TCP work, memory bandwidth, or iperf itself).
2) Less “per-chunk” overhead in userspace
A small-buffer copy loop means more iterations, more polls, more bookkeeping. A bigger buffer means:
- fewer loop iterations per GB moved
- fewer wakeups/polls
- fewer syscall transitions per GB
Your strace call counts dropped significantly between 16KiB and 64KiB, and throughput nearly doubled.
11. Trade-offs: buffer size is not free
Inner Prosecutor: “Hold it! Bigger buffers mean wasted memory.”
Me: “Correct.”
A forwarder allocates two buffers per connection (one per direction).
So for 64KiB:
- ~128KiB per connection (just for relay buffers)
- plus runtime + socket buffers
That’s fine for “few heavy streams”, but it matters if you handle thousands of concurrent connections.
In practice, the right move is:
- choose a good default (64KiB is common)
- make it configurable
- consider buffer pooling if connection churn is heavy
12. Closing statement
This was a good reminder that performance work is not guessing — it’s a dialogue with the system:
- Describe the situation
- Ask sharp questions
- Use tools to confirm
- Explain the results using low-level knowledge
- Make one change
- Re-measure
And the funniest part: the “clean” one-liner io::copy was correct, but its defaults were hiding a performance policy I didn’t want.
Inner Prosecutor: “Case closed?”
Me: “For now. Next case: buffer pooling, socket buffer tuning, and maybe a Linux-only
splice(2)fast path — carefully, behind a safe wrapper.”