blog-post-backup/drafts/2026-01-25-ace-profiling-attorney-the-case-of-the-missing-gbits.md at 144ae4a0f39603eb492d52fbb0d5bed30c4fbdee

manbo/blog-post-backup

Fork 0

Files

manbo 144ae4a0f3 Add: more Phoenix flavor

2026-02-03 13:41:37 +08:00

14 KiB

Raw Blame History

title, categories, tags

title

0) Prologue — The Courthouse Lobby

Me: I wrote a Rust TCP port forwarder. It works. It forwards.

Inner Prosecutor (Phoenix voice): Hold it! “Works” is not a metric. How fast?

Me: Not fast enough under load.

Inner Prosecutor: Objection! “Not fast enough” is an emotion. Bring evidence.

Me: Fine. I’ll bring perf, strace, and a flamegraph.

Inner Prosecutor: Good. This court accepts only facts.

1) The Crime Scene — Setup & Reproduction

Me: Single machine, Debian 13. No WAN noise, no tunnel bottlenecks.

Inner Prosecutor: Hold it! If it’s “single machine”, how do you avoid loopback cheating?

Me: Network namespaces + veth. Local, repeatable, closer to real networking.

Environment

Debian 13
Kernel: 6.12.48+deb13-amd64
Runtime: smol
Test topology: ns_client → oi (root ns) → ns_server via veth

Reproduction commands

Exhibit A: Start backend server in ns_server

sudo ip netns exec ns_server iperf3 -s -p 9001

Exhibit B: Run client in ns_client through forwarder

sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8

Inner Prosecutor: Hold it! Why -P 8?

Me: Because a forwarder can look fine in -P 1 and fall apart when syscall pressure scales.

Inner Prosecutor: …Acceptable.

2) The Suspects — What Could Be Limiting Throughput?

Me: Four suspects.

CPU bound (pure compute wall)
Kernel TCP stack bound (send/recv path, skb, softirq, netfilter/conntrack)
Syscall-rate wall (too many sendto/recvfrom per byte)
Runtime scheduling / contention (wake storms, locks, futex)

Inner Prosecutor: Objection! That’s too broad. Narrow it down.

Me: That’s what the tools are for.

3) Evidence #1 — `perf stat` (The Macro View)

Me: First I ask: are we burning CPU, thrashing schedulers, or stalling on memory?

Command:

sudo perf stat -p $(pidof oi) -e \
  cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
  -- sleep 33

What I’m looking for:

Huge context-switches → runtime thrash / lock contention
Huge cpu-migrations → unstable scheduling
Very low IPC + huge cache misses → memory stalls
Otherwise: likely syscall/kernel path

Output:

 Performance counter stats for process id '209785':

   113,810,599,893      cpu_atom/cycles/                                                        (0.11%)
   164,681,878,450      cpu_core/cycles/                                                        (99.89%)
   102,575,167,734      cpu_atom/instructions/           #    0.90  insn per cycle              (0.11%)
   237,094,207,911      cpu_core/instructions/           #    1.44  insn per cycle              (99.89%)
        33,093,338      cpu_atom/cache-misses/                                                  (0.11%)
         5,381,441      cpu_core/cache-misses/                                                  (99.89%)
    20,012,975,873      cpu_atom/branches/                                                      (0.11%)
    46,120,077,111      cpu_core/branches/                                                      (99.89%)
       211,767,555      cpu_atom/branch-misses/          #    1.06% of all branches             (0.11%)
       245,969,685      cpu_core/branch-misses/          #    0.53% of all branches             (99.89%)
             1,686      context-switches
               150      cpu-migrations

      33.004363800 seconds time elapsed

Low context switching:

context-switches: 1,686 over ~33s → ~51 switches/sec
cpu-migrations: 150 over ~33s → ~4.5/s → very stable CPU placement

CPU is working hard:

237,094,207,911 cpu_core instructions
IPC: 1.44 (instructions per cycle) → not lock-bound or stalling badly

Clean cache, branch metrics:

cache-misses: ~3.1M (tiny compared to the instruction count)
branch-misses: 0.62%

Inner Prosecutor: Hold it! You didn’t show the numbers.

Me: Patience. The next exhibit makes the culprit confess.

4) Evidence #2 — `strace -c` (The Confession: Syscall Composition)

Me: Next: “What syscalls are we paying for?”

Command:

sudo timeout 30s strace -c -f -p $(pidof oi)

What I expect if this is a forwarding wall:

sendto and recvfrom dominate calls
call counts in the millions

Output (simplified):

sendto   2,190,751 calls   4.146799s  (57.6%)
recvfrom 2,190,763 calls   3.052340s  (42.4%)
total syscall time:        7.200789s

(A) 100% syscall/copy dominated:

Almost all traced time is inside:
- sendto() (TCP send)
- recvfrom() (TCP recv)

(B) syscall rate is massive

Total send+recv calls:
- ~4,381,500 syscalls in ~32s
- → ~137k sendto per sec + ~137k recvfrom per sec
- → ~274k syscalls/sec total

Inner Prosecutor: Objection! Syscalls alone don’t prove the bottleneck.

Me: True. So I brought a witness.

5) Evidence #3 — FlameGraph (The Witness)

Me: The flamegraph doesn’t lie. It testifies where cycles go.

Commands:

sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg

What the flamegraph showed (described, not embedded):

The widest towers were kernel TCP send/recv paths:
- __x64_sys_sendto → tcp_sendmsg_locked → tcp_write_xmit → ...
- __x64_sys_recvfrom → tcp_recvmsg → ...
My userspace frames existed, but were comparatively thin.
The call chain still pointed into my forwarding implementation.

Inner Prosecutor: Hold it! So you’re saying… the kernel is doing the heavy lifting?

Me: Exactly. Which means my job is to stop annoying the kernel with too many tiny operations.

6) The Real Culprit — A “Perfectly Reasonable” Copy Loop

Me: Here’s the original relay code. Looks clean, right?

let client_to_server = io::copy(client_stream.clone(), server_stream.clone());
let server_to_client = io::copy(server_stream, client_stream);

futures_lite::future::try_zip(client_to_server, server_to_client).await?;

Inner Prosecutor: Objection! This is idiomatic and correct.

Me: Yes. That’s why it’s dangerous.

Key detail: futures_lite::io::copy uses a small internal buffer (~8KiB in practice). Small buffer → more iterations → more syscalls → more overhead.

If a forwarder is syscall-rate bound, this becomes a ceiling.

7) The First Breakthrough — Replace `io::copy` with `pump()`

Me: I wrote a manual pump loop:

allocate a buffer once
read() into it
write_all() out
on EOF: shutdown(Write) to propagate half-close

async fn pump(mut r: TcpStream, mut w: TcpStream, buf_sz: usize) -> io::Result<u64> {
    let mut buf = vec![0u8; buf_sz];
    let mut total = 0u64;

    loop {
        let n = r.read(&mut buf).await?;
        if n == 0 {
            let _ = w.shutdown(std::net::Shutdown::Write);
            break;
        }
        w.write_all(&buf[..n]).await?;
        total += n as u64;
    }
    Ok(total)
}

Run both directions:

let c2s = pump(client_stream.clone(), server_stream.clone(), BUF);
let s2c = pump(server_stream, client_stream, BUF);
try_zip(c2s, s2c).await?;

Inner Prosecutor: Hold it! That’s just a loop. How does that win?

Me: Not the loop. The bytes per syscall.

8) Exhibit C — The Numbers (8KiB → 16KiB → 64KiB)

Baseline: ~8KiB (generic copy helper)

Throughput:

17.8 Gbit/s

Inner Prosecutor: Objection! That’s your “crime scene” number?

Me: Yes. Now watch what happens when the kernel stops getting spammed.

Pump + 16KiB buffer

Throughput:

28.6 Gbit/s

strace -c showed sendto/recvfrom call count dropped:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 57.80   14.590016      442121        33           epoll_wait
 28.84    7.279883           4   1771146           sendto
 13.33    3.363882           1   1771212        48 recvfrom
  0.02    0.003843          61        62        44 futex
  0.01    0.001947          12       159           epoll_ctl
...
------ ----------- ----------- --------- --------- ----------------
100.00   25.242897           7   3542787       143 total

Inner Prosecutor: Hold it! That’s already big. But you claim there’s more?

Me: Oh, there’s more.

Pump + 64KiB buffer

Throughput:

54.1 Gbit/s (best observed)

perf stat output:

Performance counter stats for process id '893123':

   120,859,810,675      cpu_atom/cycles/                                                        (0.15%)
   134,735,934,329      cpu_core/cycles/                                                        (99.85%)
    79,946,979,880      cpu_atom/instructions/           #    0.66  insn per cycle              (0.15%)
   127,036,644,759      cpu_core/instructions/           #    0.94  insn per cycle              (99.85%)
        24,713,474      cpu_atom/cache-misses/                                                  (0.15%)
         9,604,449      cpu_core/cache-misses/                                                  (99.85%)
    15,584,074,530      cpu_atom/branches/                                                      (0.15%)
    24,796,180,117      cpu_core/branches/                                                      (99.85%)
       175,778,825      cpu_atom/branch-misses/          #    1.13% of all branches             (0.15%)
       135,067,353      cpu_core/branch-misses/          #    0.54% of all branches             (99.85%)
             1,519      context-switches
                50      cpu-migrations

      33.006529572 seconds time elapsed

strace -c output:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 54.56   18.079500      463576        39           epoll_wait
 27.91    9.249443           7   1294854         2 sendto
 17.49    5.796927           4   1294919        51 recvfrom
...
------ ----------- ----------- --------- --------- ----------------
100.00   33.135377          12   2590253       158 total

Inner Prosecutor: OBJECTION! epoll_wait is eating the time. That’s the bottleneck!

Me: Nice try. That’s a classic trap.

9) Cross-Examination — The `epoll_wait` Trap

Me: strace -c measures time spent inside syscalls, including time spent blocked.

In async runtimes:

One thread can sit in epoll_wait(timeout=...)
Other threads do sendto/recvfrom
strace charges the blocking time to epoll_wait

So epoll_wait dominating does not mean “epoll is slow”. It often means “one thread is waiting while others work”.

What matters here:

sendto / recvfrom call counts
and how they change with buffer size

10) Final Explanation — Why 64KiB Causes a “Nonlinear” Jump

Inner Prosecutor: Hold it! You only reduced syscall calls by ~some percent. How do you nearly triple throughput?

Me: Because syscall walls are nonlinear.

A forwarder’s throughput is approximately:

Throughput ≈ bytes_per_syscall_pair × syscall_pairs_per_second

If you’re syscall-rate limited, increasing bytes_per_syscall_pair pushes you past a threshold where:

socket buffers stay fuller
the TCP window is better utilized
each stream spends less time in per-chunk bookkeeping
concurrency (-P 8) stops fighting overhead and starts helping

Once you cross that threshold, throughput can jump until the next ceiling (kernel TCP, memory bandwidth, iperf itself).

That’s why a “small” change can create a big effect.

11. Trade-offs: buffer size is not free

Inner Prosecutor: Objection! Bigger buffers waste memory!

Me: Sustained.

A forwarder allocates two buffers per connection (one per direction).

So for 64KiB:

~128KiB per connection (just for relay buffers)
plus runtime + socket buffers

That’s fine for “few heavy streams”, but it matters if you handle thousands of concurrent connections.

In practice, the right move is:

choose a good default (64KiB is common)
make it configurable
consider buffer pooling if connection churn is heavy

Epilogue — Case Closed (for now)

Inner Prosecutor: So the culprit was…

Me: A perfectly reasonable helper with a default buffer size I didn’t question.

Inner Prosecutor: And the lesson?

Me: Don’t guess. Ask sharp questions. Use the tools. Let the system testify.

Verdict: Guilty of “too many syscalls per byte.”

Sentence: 64KiB buffers and a better relay loop.

Ending

This was a good reminder that performance work is not guessing — it’s a dialogue with the system:

Describe the situation
Ask sharp questions
Use tools to confirm
Explain the results using low-level knowledge
Make one change
Re-measure

And the funniest part: the “clean” one-liner io::copy was correct, but its defaults were hiding a performance policy I didn’t want.

Inner Prosecutor: “Case closed?”

Me: “For now. Next case: buffer pooling, socket buffer tuning, and maybe a Linux-only splice(2) fast path — carefully, behind a safe wrapper.”

14 KiB Raw Blame History Unescape Escape

0) Prologue — The Courthouse Lobby

1) The Crime Scene — Setup & Reproduction

Environment

Reproduction commands

2) The Suspects — What Could Be Limiting Throughput?

3) Evidence #1 — perf stat (The Macro View)

4) Evidence #2 — strace -c (The Confession: Syscall Composition)

5) Evidence #3 — FlameGraph (The Witness)

6) The Real Culprit — A “Perfectly Reasonable” Copy Loop

7) The First Breakthrough — Replace io::copy with pump()

8) Exhibit C — The Numbers (8KiB → 16KiB → 64KiB)

Baseline: ~8KiB (generic copy helper)

Pump + 16KiB buffer

Pump + 64KiB buffer

9) Cross-Examination — The epoll_wait Trap

10) Final Explanation — Why 64KiB Causes a “Nonlinear” Jump

11. Trade-offs: buffer size is not free

Epilogue — Case Closed (for now)

Ending

14 KiB

Raw Blame History

3) Evidence #1 — `perf stat` (The Macro View)

4) Evidence #2 — `strace -c` (The Confession: Syscall Composition)

7) The First Breakthrough — Replace `io::copy` with `pump()`

9) Cross-Examination — The `epoll_wait` Trap