Files
blog-post-backup/drafts/2026-01-25-ace-profiling-attorney-the-case-of-the-missing-gbits.md
2026-02-03 13:06:58 +08:00

14 KiB
Raw Blame History

title, categories, tags
title categories tags
Ace Profiling Attorney - The Case of the Missing Gbits
Programming
Profiling
Rust
kernel
networking

Cast

Me: “I rewrote a port forwarder in Rust. It works. Its… not fast enough.”

Inner Prosecutor:Objection! Not fast enough is not evidence. Bring numbers.”

Me: “Fine. Well do this properly.”


0. The Situation

Im building a userspace TCP port forwarder in Rust called oxidinetd (The binary named oi). It accepts a TCP connection, connects to an upstream server, then relays bytes in both directions.

This post is not a “Rust vs C” piece — its about profiling, forming hypotheses, and turning measurements into speed.

Test environment

  • OS: Debian 13
  • Kernel: 6.12.48+deb13-amd64
  • Runtime: smol
  • Benchmark: single machine, network namespaces + veth

Why namespaces + veth? The loopback can hide “real networking” behavior. Namespaces/veth keep the test local (repeatable), but with a path closer to real routing.


Inner Prosecutor: “You claim its repeatable. Prove your setup.”

Me: “Heres the lab.”


1. The Lab Setup

Backend server inside ns_server:

sudo ip netns exec ns_server iperf3 -s -p 9001

Client inside ns_client, traffic goes through oi:

sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8

Note

: -P 8 matters. A forwarder might look okay under -P 1, then collapse when syscall pressure scales with concurrency.

Forwarder config

oi listens on 10.0.1.1:9000 and connects to 10.0.0.2:9001.

profiling.conf:

127.0.0.1 9000 127.0.0.1 9001

2. The Questions

Inner Prosecutor: “Alright. What exactly is the crime?”

Me: “Throughput is lower than expected. The suspects:”

  1. CPU bound vs I/O bound
  2. Userspace overhead vs kernel TCP stack
  3. Syscall-rate wall (too many send/recv per byte)
  4. Async runtime scheduling / wakeups / locks

3. Evidence Tool #1 — perf stat (Macro view)

Command:

sudo perf stat -p $(pidof oi) -e \
  cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
  -- sleep 33

What Im looking for

  • Context switches exploding → runtime contention or wake storms
  • CPU migrations exploding → scheduler instability (bad for repeatability)
  • IPC tanking + cache misses skyrocketing → memory/latency issues
  • Otherwise: likely kernel networking + syscalls dominate

Output:

 Performance counter stats for process id '209785':

   113,810,599,893      cpu_atom/cycles/                                                        (0.11%)
   164,681,878,450      cpu_core/cycles/                                                        (99.89%)
   102,575,167,734      cpu_atom/instructions/           #    0.90  insn per cycle              (0.11%)
   237,094,207,911      cpu_core/instructions/           #    1.44  insn per cycle              (99.89%)
        33,093,338      cpu_atom/cache-misses/                                                  (0.11%)
         5,381,441      cpu_core/cache-misses/                                                  (99.89%)
    20,012,975,873      cpu_atom/branches/                                                      (0.11%)
    46,120,077,111      cpu_core/branches/                                                      (99.89%)
       211,767,555      cpu_atom/branch-misses/          #    1.06% of all branches             (0.11%)
       245,969,685      cpu_core/branch-misses/          #    0.53% of all branches             (99.89%)
             1,686      context-switches
               150      cpu-migrations

      33.004363800 seconds time elapsed

Interpretation:

Low context switching:

- context-switches: 1,686 over ~33s → ~51 switches/sec

- cpu-migrations: 150 over ~33s → ~4.5/s → very stable CPU placement

CPU is working hard:

- 237,094,207,911 cpu_core instructions

- IPC: 1.44 (instructions per cycle) → not lock-bound or stalling badly

Clean cache, branch metrics:

- cache-misses: ~3.1M (tiny compared to the instruction count)

- branch-misses: 0.62%

Inner Prosecutor: “Thats a vibe-check. Wheres the real culprit?”

Me: “Next tool. This one tells me what kind of pain were paying for.”


4. Evidence Tool #2 — strace -c (Syscall composition)

Command:

sudo timeout 30s strace -c -f -p $(pidof oi)

Why strace -c is lethal for forwarders

A userspace TCP forwarder often boils down to:

  • recv(...) from one socket
  • send(...) to the other socket

If your throughput is low and strace -c shows millions of sendto/recvfrom calls, youre likely hitting a syscall-per-byte wall.

Output (simplified):

sendto   2,190,751 calls   4.146799s  (57.6%)
recvfrom 2,190,763 calls   3.052340s  (42.4%)
total syscall time:        7.200789s

Interpretation:

(A) 100% syscall/copy dominated:

Almost all traced time is inside:

- sendto() (TCP send)

- recvfrom() (TCP recv)

(B) syscall rate is massive

Total send+recv calls:

- ~4,381,500 syscalls in ~32s
- → ~137k `sendto` per sec + ~137k `recvfrom` per sec
- → ~274k syscalls/sec total

Thats exactly the pattern of a forwarder doing:

recv -> send -> recv -> send ... with a relatively small buffer.


Inner Prosecutor: “So youre saying the kernel is being spammed.”

Me: “Exactly. Now I want to know whos spamming it — my logic, my runtime, or my copy loop.”


5. Evidence Tool #3 — FlameGraph (Where cycles actually go)

Commands:

sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg

What the flamegraph showed (described, not embedded)

Instead of embedding the graph, heres the important story the flamegraph told:

  1. The widest “towers” were kernel TCP send/recv paths:

    • __x64_sys_sendtotcp_sendmsg_lockedtcp_write_xmit → …
    • __x64_sys_recvfromtcp_recvmsg → …
  2. My userspace frames existed, but they were thin compared to the kernel towers. That means:

    • Im not burning CPU on complicated Rust logic.
    • Im paying overhead on the boundary: syscalls, TCP stack, copies.
  3. In the dwarf flamegraph, the userspace frames pointed to my forwarding implementation:

    • the code path that ultimately calls read/write repeatedly.

Conclusion: This is not “async is slow” in general. This is “my relay loop is forcing too many small kernel transitions.”

6. The Suspect: my forwarding code

Here was the original TCP relay:

// Use smol's copy function to forward data in both directions
let client_to_server = io::copy(client_stream.clone(), server_stream.clone());
let server_to_client = io::copy(server_stream, client_stream);

futures_lite::future::try_zip(client_to_server, server_to_client).await?;

Inner Prosecutor:Objection! That looks perfectly reasonable.”

Me: “Yes. Thats why its dangerous.”

Why this can be slow under high throughput

Generic io::copy helpers often use a relatively small internal buffer (commonly ~8KiB), plus abstraction layers that can increase:

  • syscall frequency
  • readiness polling
  • per-chunk overhead

Small buffers arent “wrong”. Theyre memory-friendly. But for a forwarder pushing tens of Gbit/s, syscalls per byte becomes the real limiter.


7. The Fix: a manual pump() loop (and a buffer size sweep)

I replaced io::copy with a manual relay loop:

  • allocate a buffer once per direction
  • read into it
  • write it out
  • on EOF, propagate half-close with shutdown(Write)

Code (core idea):

async fn pump(mut r: TcpStream, mut w: TcpStream, buf_sz: usize) -> io::Result<u64> {
    let mut buf = vec![0u8; buf_sz];
    let mut total = 0u64;

    loop {
        let n = r.read(&mut buf).await?;
        if n == 0 {
            let _ = w.shutdown(std::net::Shutdown::Write);
            break;
        }
        w.write_all(&buf[..n]).await?;
        total += n as u64;
    }
    Ok(total)
}

And run both directions:

let c2s = pump(client_stream.clone(), server_stream.clone(), BUF);
let s2c = pump(server_stream, client_stream, BUF);
try_zip(c2s, s2c).await?;

Inner Prosecutor: “You changed one helper call into a loop. Thats your miracle?”

Me: “Not the loop. The bytes per syscall.”


8. Verification: numbers dont lie

Same machine, same namespaces/veth, same iperf3 -P 8.

Baseline (generic copy, ~8KiB internal buffer)

Throughput:

17.8 Gbit/s

Pump + 16KiB buffer

Throughput:

28.6 Gbit/s

strace -c showed sendto/recvfrom call count dropped:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 57.80   14.590016      442121        33           epoll_wait
 28.84    7.279883           4   1771146           sendto
 13.33    3.363882           1   1771212        48 recvfrom
  0.02    0.003843          61        62        44 futex
  0.01    0.001947          12       159           epoll_ctl
...
------ ----------- ----------- --------- --------- ----------------
100.00   25.242897           7   3542787       143 total

Pump + 64KiB buffer

Throughput:

54.1 Gbit/s (best observed)

perf stat output:

Performance counter stats for process id '893123':

   120,859,810,675      cpu_atom/cycles/                                                        (0.15%)
   134,735,934,329      cpu_core/cycles/                                                        (99.85%)
    79,946,979,880      cpu_atom/instructions/           #    0.66  insn per cycle              (0.15%)
   127,036,644,759      cpu_core/instructions/           #    0.94  insn per cycle              (99.85%)
        24,713,474      cpu_atom/cache-misses/                                                  (0.15%)
         9,604,449      cpu_core/cache-misses/                                                  (99.85%)
    15,584,074,530      cpu_atom/branches/                                                      (0.15%)
    24,796,180,117      cpu_core/branches/                                                      (99.85%)
       175,778,825      cpu_atom/branch-misses/          #    1.13% of all branches             (0.15%)
       135,067,353      cpu_core/branch-misses/          #    0.54% of all branches             (99.85%)
             1,519      context-switches
                50      cpu-migrations

      33.006529572 seconds time elapsed

strace -c output:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 54.56   18.079500      463576        39           epoll_wait
 27.91    9.249443           7   1294854         2 sendto
 17.49    5.796927           4   1294919        51 recvfrom
...
------ ----------- ----------- --------- --------- ----------------
100.00   33.135377          12   2590253       158 total

9. “Wait — why is epoll_wait taking most syscall time?”

Inner Prosecutor:Objection! Your table says epoll_wait dominates time. So epoll is the bottleneck!”

Me: “Nope. Thats a common misread.”

strace -c counts time spent inside syscalls, including time spent blocked.

In async runtimes, its normal for one thread to sit in epoll_wait(timeout=...) while other threads do actual send/recv work. That blocking time is charged to epoll_wait, but its not “overhead” — its waiting.

The real signal is still:

  • sendto/recvfrom call counts (millions)
  • average microseconds per call
  • and whether call count drops when buffer size increases

Thats the syscall-per-byte story.


10. So why did 64KiB cause such a huge jump?

Two reasons:

1) Syscall wall is nonlinear

Throughput is roughly:

Throughput ≈ bytes_per_syscall_pair × syscall_pairs_per_second

If syscall rate is the limiter, increasing bytes per syscall can push you past a threshold where:

  • socket buffers stay fuller
  • TCP windows are better utilized
  • per-stream pacing is smoother
  • concurrency (-P 8) stops fighting overhead and starts working in your favor

Once you cross that threshold, throughput can jump until the next ceiling (kernel TCP work, memory bandwidth, or iperf itself).

2) Less “per-chunk” overhead in userspace

A small-buffer copy loop means more iterations, more polls, more bookkeeping. A bigger buffer means:

  • fewer loop iterations per GB moved
  • fewer wakeups/polls
  • fewer syscall transitions per GB

Your strace call counts dropped significantly between 16KiB and 64KiB, and throughput nearly doubled.


11. Trade-offs: buffer size is not free

Inner Prosecutor:Hold it! Bigger buffers mean wasted memory.”

Me: “Correct.”

A forwarder allocates two buffers per connection (one per direction).

So for 64KiB:

  • ~128KiB per connection (just for relay buffers)
  • plus runtime + socket buffers

Thats fine for “few heavy streams”, but it matters if you handle thousands of concurrent connections.

In practice, the right move is:

  • choose a good default (64KiB is common)
  • make it configurable
  • consider buffer pooling if connection churn is heavy

12. Closing statement

This was a good reminder that performance work is not guessing — its a dialogue with the system:

  1. Describe the situation
  2. Ask sharp questions
  3. Use tools to confirm
  4. Explain the results using low-level knowledge
  5. Make one change
  6. Re-measure

And the funniest part: the “clean” one-liner io::copy was correct, but its defaults were hiding a performance policy I didnt want.

Inner Prosecutor: “Case closed?”

Me: “For now. Next case: buffer pooling, socket buffer tuning, and maybe a Linux-only splice(2) fast path — carefully, behind a safe wrapper.”