blog-post-backup/drafts/2026-01-25-ace-profiling-attorney-the-case-of-the-missing-gbits.md at ca0403982653664cec9197db7235775d89c60ad6

manbo/blog-post-backup

Fork 0

Files

manbo ca04039826 Upload files to "drafts"

2026-02-03 13:06:58 +08:00

14 KiB

Raw Blame History

title, categories, tags

title

0. The Situation

I’m building a userspace TCP port forwarder in Rust called oxidinetd (The binary named oi). It accepts a TCP connection, connects to an upstream server, then relays bytes in both directions.

This post is not a “Rust vs C” piece — it’s about profiling, forming hypotheses, and turning measurements into speed.

Test environment

OS: Debian 13
Kernel: 6.12.48+deb13-amd64
Runtime: smol
Benchmark: single machine, network namespaces + veth

Why namespaces + veth? The loopback can hide “real networking” behavior. Namespaces/veth keep the test local (repeatable), but with a path closer to real routing.

Inner Prosecutor: “You claim it’s repeatable. Prove your setup.”

Me: “Here’s the lab.”

1. The Lab Setup

Backend server inside ns_server:

sudo ip netns exec ns_server iperf3 -s -p 9001

Client inside ns_client, traffic goes through oi:

sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8

Note

: -P 8 matters. A forwarder might look okay under -P 1, then collapse when syscall pressure scales with concurrency.

Forwarder config

oi listens on 10.0.1.1:9000 and connects to 10.0.0.2:9001.

profiling.conf:

127.0.0.1 9000 127.0.0.1 9001

2. The Questions

Inner Prosecutor: “Alright. What exactly is the crime?”

Me: “Throughput is lower than expected. The suspects:”

CPU bound vs I/O bound

Userspace overhead vs kernel TCP stack

Syscall-rate wall (too many send/recv per byte)

Async runtime scheduling / wakeups / locks

3. Evidence Tool #1 — `perf stat` (Macro view)

Command:

sudo perf stat -p $(pidof oi) -e \
  cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
  -- sleep 33

What I’m looking for

Context switches exploding → runtime contention or wake storms
CPU migrations exploding → scheduler instability (bad for repeatability)
IPC tanking + cache misses skyrocketing → memory/latency issues
Otherwise: likely kernel networking + syscalls dominate

Output:

 Performance counter stats for process id '209785':

   113,810,599,893      cpu_atom/cycles/                                                        (0.11%)
   164,681,878,450      cpu_core/cycles/                                                        (99.89%)
   102,575,167,734      cpu_atom/instructions/           #    0.90  insn per cycle              (0.11%)
   237,094,207,911      cpu_core/instructions/           #    1.44  insn per cycle              (99.89%)
        33,093,338      cpu_atom/cache-misses/                                                  (0.11%)
         5,381,441      cpu_core/cache-misses/                                                  (99.89%)
    20,012,975,873      cpu_atom/branches/                                                      (0.11%)
    46,120,077,111      cpu_core/branches/                                                      (99.89%)
       211,767,555      cpu_atom/branch-misses/          #    1.06% of all branches             (0.11%)
       245,969,685      cpu_core/branch-misses/          #    0.53% of all branches             (99.89%)
             1,686      context-switches
               150      cpu-migrations

      33.004363800 seconds time elapsed

Interpretation:

Low context switching:

- context-switches: 1,686 over ~33s → ~51 switches/sec

- cpu-migrations: 150 over ~33s → ~4.5/s → very stable CPU placement

CPU is working hard:

- 237,094,207,911 cpu_core instructions

- IPC: 1.44 (instructions per cycle) → not lock-bound or stalling badly

Clean cache, branch metrics:

- cache-misses: ~3.1M (tiny compared to the instruction count)

- branch-misses: 0.62%

Inner Prosecutor: “That’s a vibe-check. Where’s the real culprit?”

Me: “Next tool. This one tells me what kind of pain we’re paying for.”

4. Evidence Tool #2 — `strace -c` (Syscall composition)

Command:

sudo timeout 30s strace -c -f -p $(pidof oi)

Why `strace -c` is lethal for forwarders

A userspace TCP forwarder often boils down to:

recv(...) from one socket
send(...) to the other socket

If your throughput is low and strace -c shows millions of sendto/recvfrom calls, you’re likely hitting a syscall-per-byte wall.

Output (simplified):

sendto   2,190,751 calls   4.146799s  (57.6%)
recvfrom 2,190,763 calls   3.052340s  (42.4%)
total syscall time:        7.200789s

Interpretation:

(A) 100% syscall/copy dominated:

Almost all traced time is inside:

- sendto() (TCP send)

- recvfrom() (TCP recv)

(B) syscall rate is massive

Total send+recv calls:

- ~4,381,500 syscalls in ~32s
- → ~137k `sendto` per sec + ~137k `recvfrom` per sec
- → ~274k syscalls/sec total

That’s exactly the pattern of a forwarder doing:

recv -> send -> recv -> send ... with a relatively small buffer.

Inner Prosecutor: “So you’re saying the kernel is being spammed.”

Me: “Exactly. Now I want to know who’s spamming it — my logic, my runtime, or my copy loop.”

5. Evidence Tool #3 — FlameGraph (Where cycles actually go)

Commands:

sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg

What the flamegraph showed (described, not embedded)

Instead of embedding the graph, here’s the important story the flamegraph told:

The widest “towers” were kernel TCP send/recv paths:
- __x64_sys_sendto → tcp_sendmsg_locked → tcp_write_xmit → …
- __x64_sys_recvfrom → tcp_recvmsg → …
My userspace frames existed, but they were thin compared to the kernel towers. That means:
- I’m not burning CPU on complicated Rust logic.
- I’m paying overhead on the boundary: syscalls, TCP stack, copies.
In the dwarf flamegraph, the userspace frames pointed to my forwarding implementation:
- the code path that ultimately calls read/write repeatedly.

Conclusion: This is not “async is slow” in general. This is “my relay loop is forcing too many small kernel transitions.”

6. The Suspect: my forwarding code

Here was the original TCP relay:

// Use smol's copy function to forward data in both directions
let client_to_server = io::copy(client_stream.clone(), server_stream.clone());
let server_to_client = io::copy(server_stream, client_stream);

futures_lite::future::try_zip(client_to_server, server_to_client).await?;

Inner Prosecutor: “Objection! That looks perfectly reasonable.”

Me: “Yes. That’s why it’s dangerous.”

Why this can be slow under high throughput

Generic io::copy helpers often use a relatively small internal buffer (commonly ~8KiB), plus abstraction layers that can increase:

syscall frequency
readiness polling
per-chunk overhead

Small buffers aren’t “wrong”. They’re memory-friendly. But for a forwarder pushing tens of Gbit/s, syscalls per byte becomes the real limiter.

7. The Fix: a manual `pump()` loop (and a buffer size sweep)

I replaced io::copy with a manual relay loop:

allocate a buffer once per direction
read into it
write it out
on EOF, propagate half-close with shutdown(Write)

Code (core idea):

async fn pump(mut r: TcpStream, mut w: TcpStream, buf_sz: usize) -> io::Result<u64> {
    let mut buf = vec![0u8; buf_sz];
    let mut total = 0u64;

    loop {
        let n = r.read(&mut buf).await?;
        if n == 0 {
            let _ = w.shutdown(std::net::Shutdown::Write);
            break;
        }
        w.write_all(&buf[..n]).await?;
        total += n as u64;
    }
    Ok(total)
}

And run both directions:

let c2s = pump(client_stream.clone(), server_stream.clone(), BUF);
let s2c = pump(server_stream, client_stream, BUF);
try_zip(c2s, s2c).await?;

Inner Prosecutor: “You changed ‘one helper call’ into ‘a loop’. That’s your miracle?”

Me: “Not the loop. The bytes per syscall.”

8. Verification: numbers don’t lie

Same machine, same namespaces/veth, same iperf3 -P 8.

Baseline (generic copy, ~8KiB internal buffer)

Throughput:

17.8 Gbit/s

Pump + 16KiB buffer

Throughput:

28.6 Gbit/s

strace -c showed sendto/recvfrom call count dropped:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 57.80   14.590016      442121        33           epoll_wait
 28.84    7.279883           4   1771146           sendto
 13.33    3.363882           1   1771212        48 recvfrom
  0.02    0.003843          61        62        44 futex
  0.01    0.001947          12       159           epoll_ctl
...
------ ----------- ----------- --------- --------- ----------------
100.00   25.242897           7   3542787       143 total

Pump + 64KiB buffer

Throughput:

54.1 Gbit/s (best observed)

perf stat output:

Performance counter stats for process id '893123':

   120,859,810,675      cpu_atom/cycles/                                                        (0.15%)
   134,735,934,329      cpu_core/cycles/                                                        (99.85%)
    79,946,979,880      cpu_atom/instructions/           #    0.66  insn per cycle              (0.15%)
   127,036,644,759      cpu_core/instructions/           #    0.94  insn per cycle              (99.85%)
        24,713,474      cpu_atom/cache-misses/                                                  (0.15%)
         9,604,449      cpu_core/cache-misses/                                                  (99.85%)
    15,584,074,530      cpu_atom/branches/                                                      (0.15%)
    24,796,180,117      cpu_core/branches/                                                      (99.85%)
       175,778,825      cpu_atom/branch-misses/          #    1.13% of all branches             (0.15%)
       135,067,353      cpu_core/branch-misses/          #    0.54% of all branches             (99.85%)
             1,519      context-switches
                50      cpu-migrations

      33.006529572 seconds time elapsed

strace -c output:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 54.56   18.079500      463576        39           epoll_wait
 27.91    9.249443           7   1294854         2 sendto
 17.49    5.796927           4   1294919        51 recvfrom
...
------ ----------- ----------- --------- --------- ----------------
100.00   33.135377          12   2590253       158 total

9. “Wait — why is `epoll_wait` taking most syscall time?”

Inner Prosecutor: “Objection! Your table says epoll_wait dominates time. So epoll is the bottleneck!”

Me: “Nope. That’s a common misread.”

strace -c counts time spent inside syscalls, including time spent blocked.

In async runtimes, it’s normal for one thread to sit in epoll_wait(timeout=...) while other threads do actual send/recv work. That blocking time is charged to epoll_wait, but it’s not “overhead” — it’s waiting.

The real signal is still:

sendto/recvfrom call counts (millions)
average microseconds per call
and whether call count drops when buffer size increases

That’s the syscall-per-byte story.

10. So why did 64KiB cause such a huge jump?

Two reasons:

1) Syscall wall is nonlinear

Throughput is roughly:

Throughput ≈ bytes_per_syscall_pair × syscall_pairs_per_second

If syscall rate is the limiter, increasing bytes per syscall can push you past a threshold where:

socket buffers stay fuller
TCP windows are better utilized
per-stream pacing is smoother
concurrency (-P 8) stops fighting overhead and starts working in your favor

Once you cross that threshold, throughput can jump until the next ceiling (kernel TCP work, memory bandwidth, or iperf itself).

2) Less “per-chunk” overhead in userspace

A small-buffer copy loop means more iterations, more polls, more bookkeeping. A bigger buffer means:

fewer loop iterations per GB moved
fewer wakeups/polls
fewer syscall transitions per GB

Your strace call counts dropped significantly between 16KiB and 64KiB, and throughput nearly doubled.

11. Trade-offs: buffer size is not free

Inner Prosecutor: “Hold it! Bigger buffers mean wasted memory.”

Me: “Correct.”

A forwarder allocates two buffers per connection (one per direction).

So for 64KiB:

~128KiB per connection (just for relay buffers)
plus runtime + socket buffers

That’s fine for “few heavy streams”, but it matters if you handle thousands of concurrent connections.

In practice, the right move is:

choose a good default (64KiB is common)
make it configurable
consider buffer pooling if connection churn is heavy

12. Closing statement

This was a good reminder that performance work is not guessing — it’s a dialogue with the system:

Describe the situation
Ask sharp questions
Use tools to confirm
Explain the results using low-level knowledge
Make one change
Re-measure

And the funniest part: the “clean” one-liner io::copy was correct, but its defaults were hiding a performance policy I didn’t want.

Inner Prosecutor: “Case closed?”

Me: “For now. Next case: buffer pooling, socket buffer tuning, and maybe a Linux-only splice(2) fast path — carefully, behind a safe wrapper.”

14 KiB Raw Blame History Unescape Escape

0. The Situation

Test environment

1. The Lab Setup

Forwarder config

2. The Questions

3. Evidence Tool #1 — perf stat (Macro view)

What I’m looking for

4. Evidence Tool #2 — strace -c (Syscall composition)

Why strace -c is lethal for forwarders

5. Evidence Tool #3 — FlameGraph (Where cycles actually go)

What the flamegraph showed (described, not embedded)

6. The Suspect: my forwarding code

Why this can be slow under high throughput

7. The Fix: a manual pump() loop (and a buffer size sweep)

8. Verification: numbers don’t lie

Baseline (generic copy, ~8KiB internal buffer)

Pump + 16KiB buffer

Pump + 64KiB buffer

9. “Wait — why is epoll_wait taking most syscall time?”

10. So why did 64KiB cause such a huge jump?

1) Syscall wall is nonlinear

2) Less “per-chunk” overhead in userspace

11. Trade-offs: buffer size is not free

12. Closing statement

14 KiB

Raw Blame History

3. Evidence Tool #1 — `perf stat` (Macro view)

4. Evidence Tool #2 — `strace -c` (Syscall composition)

Why `strace -c` is lethal for forwarders

7. The Fix: a manual `pump()` loop (and a buffer size sweep)

9. “Wait — why is `epoll_wait` taking most syscall time?”