From 1743af8b643e0ac31ee9c4192c57fb127556345e Mon Sep 17 00:00:00 2001 From: Markyan04 Date: Sat, 24 Jan 2026 15:01:56 +0800 Subject: [PATCH] Update drafts/2026-01-24-profiling-rust-written-network-program.md --- ...-profiling-rust-written-network-program.md | 454 ++++++++++-------- 1 file changed, 261 insertions(+), 193 deletions(-) diff --git a/drafts/2026-01-24-profiling-rust-written-network-program.md b/drafts/2026-01-24-profiling-rust-written-network-program.md index 714c8c0..0923c6a 100644 --- a/drafts/2026-01-24-profiling-rust-written-network-program.md +++ b/drafts/2026-01-24-profiling-rust-written-network-program.md @@ -1,193 +1,261 @@ - -## Testing enviornment setup -Install tools: - -```bash -sudo apt update - -sudo apt install -y hyperfine heaptrack valgrind - -sudo apt install -y \ - build-essential clang lld pkg-config \ - linux-perf \ - iperf3 netperf net-tools \ - tcpdump ethtool iproute2 \ - bpftrace bpfcc-tools \ - strace ltrace \ - sysstat procps \ - git perl -``` - - -Install framegraph(not shipped on debian): - -```bash -git clone https://github.com/brendangregg/FlameGraph ~/FlameGraph - -echo 'export PATH="$HOME/FlameGraph:$PATH"' >> ~/.bashrc - -source ~/.bashrc - -which flamegraph.pl -``` - -modify the Cargo.toml of verion 0.1.0: - -```toml -[profile.release] -lto = true -codegen-units = 1 -debug = 1 -strip = "none" -panic = "abort" -``` - - -Build with frame pointers to help profiling: - -```bash -git clone https://github.com/DaZuo0122/oxidinetd.git - -RUSTFLAGS="-C force-frame-pointers=yes" cargo build --release -``` - - -`profiling.conf`: - -```yaml -127.0.0.1 9000 127.0.0.1 9001 -``` - - -Backend iperf3 server: - -```bash -iperf3 -s -p 9001 -``` - - -forwarder: - -```bash -./oi -c profiling.conf -``` - - -triggers redirect: - -```bash -iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1 -iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 8 -``` - -verification: - -```bash -sudo ss -tnp | egrep '(:9000|:9001)' -``` - -## Testing - -CPU hotspot: - -```bash -sudo perf top -p $(pidof oi) -``` - -If you see lots of: - -- sys_read, sys_write, __x64_sys_sendto, tcp_sendmsg → syscall/copy overhead - -- futex, __lll_lock_wait → contention/locks - -- epoll_wait → executor wake behavior / too many idle polls - - -Hard numbers: - -```bash -sudo perf stat -p $(pidof oi) -e \ - cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \ - -- sleep 30 -``` - -Big differences to watch: - -- context-switches much higher on oi → too many tasks/wakers / lock contention - -- instructions much higher on oi for same throughput → runtime overhead / copies - -- cache-misses higher → allocations / poor locality - - -Flamegraph -Record: - -```bash -sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30 -sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg -``` - -If the stack looks “flat / missing” (common with async + LTO), use dwarf unwinding: - -```bash -sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30 -sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg -``` - - -syscall-cost check: - -```bash -sudo strace -ff -c -p $(pidof oi) -o /tmp/oi.strace -# run 15–30s under load, then Ctrl+C -tail -n +1 /tmp/oi.strace.* -``` - -If you see huge % time in read/write/sendmsg/recvmsg, you’re dominated by copying + syscalls. - - -ebpf stuffs - ---skipped-- - - -Smol-focused bottlenecks + the “fix list” -A) If you’re syscall/copy bound - -Best improvement candidates: - -buffer reuse (no per-loop Vec allocation) - -reduce tiny writes (coalesce) - -zero-copy splice (Linux-only, biggest win but more complex) - -For Linux zero-copy, you’d implement a splice(2)-based fast path (socket→pipe→socket). That’s how high-performance forwarders avoid double-copy. - -B) If you’re executor/waker bound (common for async forwarders) - -Symptoms: - -perf shows a lot of runtime / wake / scheduling - -perf stat shows more context switches than rinetd - -Fixes: - -don’t spawn 2 tasks per connection (one per direction) unless needed -→ do a single task that forwards both directions in one loop (state machine) - -avoid any shared Mutex on hot path (logging/metrics) - -keep per-conn state minimal - -C) If you’re single-thread limited - -smol can be extremely fast, but if you’re effectively running everything on one thread, throughput may cap earlier. - -Fix direction: - -move to smol::Executor + N threads (usually num_cpus) - -or run multiple block_on() workers (careful: avoid accept() duplication) \ No newline at end of file +## Testing enviornment setup +Install tools: + +```bash +sudo apt update + +sudo apt install -y hyperfine heaptrack valgrind + +sudo apt install -y \ + build-essential clang lld pkg-config \ + linux-perf \ + iperf3 netperf net-tools \ + tcpdump ethtool iproute2 \ + bpftrace bpfcc-tools \ + strace ltrace \ + sysstat procps \ + git perl +``` + + +Install framegraph(not shipped on debian): + +```bash +git clone https://github.com/brendangregg/FlameGraph ~/FlameGraph + +echo 'export PATH="$HOME/FlameGraph:$PATH"' >> ~/.bashrc + +source ~/.bashrc + +which flamegraph.pl +``` + +modify the Cargo.toml of verion 0.1.0: + +```toml +[profile.release] +lto = true +codegen-units = 1 +debug = 1 +strip = "none" +panic = "abort" +``` + + +Build with frame pointers to help profiling: + +```bash +git clone https://github.com/DaZuo0122/oxidinetd.git + +RUSTFLAGS="-C force-frame-pointers=yes" cargo build --release +``` + + +`profiling.conf`: + +```yaml +127.0.0.1 9000 127.0.0.1 9001 +``` + + +Backend iperf3 server: + +```bash +iperf3 -s -p 9001 +``` + + +forwarder: + +```bash +./oi -c profiling.conf +``` + + +triggers redirect: + +```bash +iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1 +iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 8 +``` + +verification: + +```bash +sudo ss -tnp | egrep '(:9000|:9001)' +``` + +## Testing + +CPU hotspot: + +```bash +sudo perf top -p $(pidof oi) +``` + +If you see lots of: + +- sys_read, sys_write, __x64_sys_sendto, tcp_sendmsg → syscall/copy overhead + +- futex, __lll_lock_wait → contention/locks + +- epoll_wait → executor wake behavior / too many idle polls + + +Hard numbers: + +```bash +sudo perf stat -p $(pidof oi) -e \ + cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \ + -- sleep 30 +``` + +Big differences to watch: + +- context-switches much higher on oi → too many tasks/wakers / lock contention + +- instructions much higher on oi for same throughput → runtime overhead / copies + +- cache-misses higher → allocations / poor locality + + +Flamegraph +Record: + +```bash +sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30 +sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg +``` + +If the stack looks “flat / missing” (common with async + LTO), use dwarf unwinding: + +```bash +sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30 +sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg +``` + + +syscall-cost check: + +```bash +sudo strace -ff -c -p $(pidof oi) -o /tmp/oi.strace +# run 15–30s under load, then Ctrl+C +tail -n +1 /tmp/oi.strace.* +``` + +If you see huge % time in read/write/sendmsg/recvmsg, you’re dominated by copying + syscalls. + + +ebpf stuffs + +--skipped-- + + +Smol-focused bottlenecks + the “fix list” +A) If you’re syscall/copy bound + +Best improvement candidates: + +buffer reuse (no per-loop Vec allocation) + +reduce tiny writes (coalesce) + +zero-copy splice (Linux-only, biggest win but more complex) + +For Linux zero-copy, you’d implement a splice(2)-based fast path (socket→pipe→socket). That’s how high-performance forwarders avoid double-copy. + +B) If you’re executor/waker bound (common for async forwarders) + +Symptoms: + +perf shows a lot of runtime / wake / scheduling + +perf stat shows more context switches than rinetd + +Fixes: + +don’t spawn 2 tasks per connection (one per direction) unless needed +→ do a single task that forwards both directions in one loop (state machine) + +avoid any shared Mutex on hot path (logging/metrics) + +keep per-conn state minimal + +C) If you’re single-thread limited + +smol can be extremely fast, but if you’re effectively running everything on one thread, throughput may cap earlier. + +Fix direction: + +move to smol::Executor + N threads (usually num_cpus) + +or run multiple block_on() workers (careful: avoid accept() duplication) + + +## outcome + +### CPU hotspot +testing commands: + +```bash +iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1 + +sudo perf stat -p $(pidof oi) -e \ + cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \ + -- sleep 30 +``` + +perf report: + +```text +Performance counter stats for process id '207279': + + 98,571,874,480 cpu_atom/cycles/ (0.10%) + 134,732,064,800 cpu_core/cycles/ (99.90%) + 75,889,748,906 cpu_atom/instructions/ # 0.77 insn per cycle (0.10%) + 159,098,987,713 cpu_core/instructions/ # 1.18 insn per cycle (99.90%) + 30,443,258 cpu_atom/cache-misses/ (0.10%) + 3,155,528 cpu_core/cache-misses/ (99.90%) + 15,003,063,317 cpu_atom/branches/ (0.10%) + 31,479,765,962 cpu_core/branches/ (99.90%) + 149,091,165 cpu_atom/branch-misses/ # 0.99% of all branches (0.10%) + 195,562,861 cpu_core/branch-misses/ # 0.62% of all branches (99.90%) + 1,138 context-switches + 37 cpu-migrations + + 33.004738330 seconds time elapsed +``` + + +### FlameGraph +testing commands: + +```bash +sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30 +sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg +``` + +outcome: + +oi.svg + +commands: + +```bash +sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30 +sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi_dwarf.svg +``` + +outcome: + +oi_dwarf.svg + + +### syscall-cost check + +```bash +sudo strace -ff -c -p $(pidof oi) -o /tmp/oi.strace +# run 15–30s under load, then Ctrl+C +tail -n +1 /tmp/oi.strace.* +``` +