Upload files to "drafts"

2026-01-24 14:05:01 +08:00
commit fdcf5838b3
1 changed files with 193 additions and 0 deletions
--- a/drafts/2026-01-24-profiling-rust-written-network-program.md
+++ b/drafts/2026-01-24-profiling-rust-written-network-program.md
@@ -0,0 +1,193 @@
 ## Testing enviornment setup
 Install tools:
 ```bash
 sudo apt update
 sudo apt install -y hyperfine heaptrack valgrind
 sudo apt install -y \
  build-essential clang lld pkg-config \
  linux-perf \
  iperf3 netperf net-tools \
  tcpdump ethtool iproute2 \
  bpftrace bpfcc-tools \
  strace ltrace \
  sysstat procps \
  git perl
 ```
 Install framegraph(not shipped on debian):
 ```bash
 git clone https://github.com/brendangregg/FlameGraph ~/FlameGraph
 echo 'export PATH="$HOME/FlameGraph:$PATH"' >> ~/.bashrc
 source ~/.bashrc
 which flamegraph.pl
 ```
 modify the Cargo.toml of verion 0.1.0:
 ```toml
 [profile.release]
 lto = true
 codegen-units = 1
 debug = 1
 strip = "none"
 panic = "abort"
 ```
 Build with frame pointers to help profiling:
 ```bash
 git clone https://github.com/DaZuo0122/oxidinetd.git
 RUSTFLAGS="-C force-frame-pointers=yes" cargo build --release
 ```
 `profiling.conf`:
 ```yaml
 127.0.0.1 9000 127.0.0.1 9001
 ```
 Backend iperf3 server:
 ```bash
 iperf3 -s -p 9001
 ```
 forwarder:
 ```bash
 ./oi -c profiling.conf
 ```
 triggers redirect:
 ```bash
 iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1
 iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 8
 ```
 verification: 
 ```bash
 sudo ss -tnp | egrep '(:9000|:9001)'
 ```
 ## Testing 
 CPU hotspot:
 ```bash
 sudo perf top -p $(pidof oi)
 ```
 If you see lots of:
 - sys_read, sys_write, __x64_sys_sendto, tcp_sendmsg → syscall/copy overhead
 - futex, __lll_lock_wait → contention/locks
 - epoll_wait → executor wake behavior / too many idle polls
 Hard numbers:
 ```bash
 sudo perf stat -p $(pidof oi) -e \
  cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
  -- sleep 30
 ```
 Big differences to watch:
 - context-switches much higher on oi → too many tasks/wakers / lock contention
 - instructions much higher on oi for same throughput → runtime overhead / copies
 - cache-misses higher → allocations / poor locality
 Flamegraph  
 Record: 
 ```bash
 sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30
 sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
 ```
 If the stack looks “flat / missing” (common with async + LTO), use dwarf unwinding:
 ```bash
 sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
 sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
 ```
 syscall-cost check:
 ```bash
 sudo strace -ff -c -p $(pidof oi) -o /tmp/oi.strace
 # run 15–30s under load, then Ctrl+C
 tail -n +1 /tmp/oi.strace.*
 ```
 If you see huge % time in read/write/sendmsg/recvmsg, you’re dominated by copying + syscalls.
 ebpf stuffs
 --skipped--
 Smol-focused bottlenecks + the “fix list”
 A) If you’re syscall/copy bound
 Best improvement candidates:
 buffer reuse (no per-loop Vec allocation)
 reduce tiny writes (coalesce)
 zero-copy splice (Linux-only, biggest win but more complex)
 For Linux zero-copy, you’d implement a splice(2)-based fast path (socket→pipe→socket). That’s how high-performance forwarders avoid double-copy.
 B) If you’re executor/waker bound (common for async forwarders)
 Symptoms:
 perf shows a lot of runtime / wake / scheduling
 perf stat shows more context switches than rinetd
 Fixes:
 don’t spawn 2 tasks per connection (one per direction) unless needed
 → do a single task that forwards both directions in one loop (state machine)
 avoid any shared Mutex on hot path (logging/metrics)
 keep per-conn state minimal
 C) If you’re single-thread limited
 smol can be extremely fast, but if you’re effectively running everything on one thread, throughput may cap earlier.
 Fix direction:
 move to smol::Executor + N threads (usually num_cpus)
 or run multiple block_on() workers (careful: avoid accept() duplication)