diff --git a/drafts/2026-01-24-profiling-rust-written-network-program.md b/drafts/2026-01-24-profiling-rust-written-network-program.md index 714c8c0..6d9e041 100644 --- a/drafts/2026-01-24-profiling-rust-written-network-program.md +++ b/drafts/2026-01-24-profiling-rust-written-network-program.md @@ -1,193 +1,533 @@ - -## Testing enviornment setup -Install tools: - -```bash -sudo apt update - -sudo apt install -y hyperfine heaptrack valgrind - -sudo apt install -y \ - build-essential clang lld pkg-config \ - linux-perf \ - iperf3 netperf net-tools \ - tcpdump ethtool iproute2 \ - bpftrace bpfcc-tools \ - strace ltrace \ - sysstat procps \ - git perl -``` - - -Install framegraph(not shipped on debian): - -```bash -git clone https://github.com/brendangregg/FlameGraph ~/FlameGraph - -echo 'export PATH="$HOME/FlameGraph:$PATH"' >> ~/.bashrc - -source ~/.bashrc - -which flamegraph.pl -``` - -modify the Cargo.toml of verion 0.1.0: - -```toml -[profile.release] -lto = true -codegen-units = 1 -debug = 1 -strip = "none" -panic = "abort" -``` - - -Build with frame pointers to help profiling: - -```bash -git clone https://github.com/DaZuo0122/oxidinetd.git - -RUSTFLAGS="-C force-frame-pointers=yes" cargo build --release -``` - - -`profiling.conf`: - -```yaml -127.0.0.1 9000 127.0.0.1 9001 -``` - - -Backend iperf3 server: - -```bash -iperf3 -s -p 9001 -``` - - -forwarder: - -```bash -./oi -c profiling.conf -``` - - -triggers redirect: - -```bash -iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1 -iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 8 -``` - -verification: - -```bash -sudo ss -tnp | egrep '(:9000|:9001)' -``` - -## Testing - -CPU hotspot: - -```bash -sudo perf top -p $(pidof oi) -``` - -If you see lots of: - -- sys_read, sys_write, __x64_sys_sendto, tcp_sendmsg → syscall/copy overhead - -- futex, __lll_lock_wait → contention/locks - -- epoll_wait → executor wake behavior / too many idle polls - - -Hard numbers: - -```bash -sudo perf stat -p $(pidof oi) -e \ - cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \ - -- sleep 30 -``` - -Big differences to watch: - -- context-switches much higher on oi → too many tasks/wakers / lock contention - -- instructions much higher on oi for same throughput → runtime overhead / copies - -- cache-misses higher → allocations / poor locality - - -Flamegraph -Record: - -```bash -sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30 -sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg -``` - -If the stack looks “flat / missing” (common with async + LTO), use dwarf unwinding: - -```bash -sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30 -sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg -``` - - -syscall-cost check: - -```bash -sudo strace -ff -c -p $(pidof oi) -o /tmp/oi.strace -# run 15–30s under load, then Ctrl+C -tail -n +1 /tmp/oi.strace.* -``` - -If you see huge % time in read/write/sendmsg/recvmsg, you’re dominated by copying + syscalls. - - -ebpf stuffs - ---skipped-- - - -Smol-focused bottlenecks + the “fix list” -A) If you’re syscall/copy bound - -Best improvement candidates: - -buffer reuse (no per-loop Vec allocation) - -reduce tiny writes (coalesce) - -zero-copy splice (Linux-only, biggest win but more complex) - -For Linux zero-copy, you’d implement a splice(2)-based fast path (socket→pipe→socket). That’s how high-performance forwarders avoid double-copy. - -B) If you’re executor/waker bound (common for async forwarders) - -Symptoms: - -perf shows a lot of runtime / wake / scheduling - -perf stat shows more context switches than rinetd - -Fixes: - -don’t spawn 2 tasks per connection (one per direction) unless needed -→ do a single task that forwards both directions in one loop (state machine) - -avoid any shared Mutex on hot path (logging/metrics) - -keep per-conn state minimal - -C) If you’re single-thread limited - -smol can be extremely fast, but if you’re effectively running everything on one thread, throughput may cap earlier. - -Fix direction: - -move to smol::Executor + N threads (usually num_cpus) - -or run multiple block_on() workers (careful: avoid accept() duplication) \ No newline at end of file +## Testing enviornment setup +Install tools: + +```bash +sudo apt update + +sudo apt install -y hyperfine heaptrack valgrind + +sudo apt install -y \ + build-essential clang lld pkg-config \ + linux-perf \ + iperf3 netperf net-tools \ + tcpdump ethtool iproute2 \ + bpftrace bpfcc-tools \ + strace ltrace \ + sysstat procps \ + git perl +``` + + +Install framegraph(not shipped on debian): + +```bash +git clone https://github.com/brendangregg/FlameGraph ~/FlameGraph + +echo 'export PATH="$HOME/FlameGraph:$PATH"' >> ~/.bashrc + +source ~/.bashrc + +which flamegraph.pl +``` + +modify the Cargo.toml of verion 0.1.0: + +```toml +[profile.release] +lto = true +codegen-units = 1 +debug = 1 +strip = "none" +panic = "abort" +``` + + +Build with frame pointers to help profiling: + +```bash +git clone https://github.com/DaZuo0122/oxidinetd.git + +RUSTFLAGS="-C force-frame-pointers=yes" cargo build --release +``` + + +`profiling.conf`: + +```yaml +127.0.0.1 9000 127.0.0.1 9001 +``` + + +Backend iperf3 server: + +```bash +iperf3 -s -p 9001 +``` + + +forwarder: + +```bash +./oi -c profiling.conf +``` + + +triggers redirect: + +```bash +iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1 +iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 8 +``` + +verification: + +```bash +sudo ss -tnp | egrep '(:9000|:9001)' +``` + +## Testing + +CPU hotspot: + +```bash +sudo perf top -p $(pidof oi) +``` + +If you see lots of: + +- sys_read, sys_write, __x64_sys_sendto, tcp_sendmsg → syscall/copy overhead + +- futex, __lll_lock_wait → contention/locks + +- epoll_wait → executor wake behavior / too many idle polls + + +Hard numbers: + +```bash +sudo perf stat -p $(pidof oi) -e \ + cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \ + -- sleep 30 +``` + +Big differences to watch: + +- context-switches much higher on oi → too many tasks/wakers / lock contention + +- instructions much higher on oi for same throughput → runtime overhead / copies + +- cache-misses higher → allocations / poor locality + + +Flamegraph +Record: + +```bash +sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30 +sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg +``` + +If the stack looks “flat / missing” (common with async + LTO), use dwarf unwinding: + +```bash +sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30 +sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg +``` + + +syscall-cost check: + +```bash +sudo strace -ff -c -p $(pidof oi) -o /tmp/oi.strace +# run 15–30s under load, then Ctrl+C +tail -n +1 /tmp/oi.strace.* +``` + +If you see huge % time in read/write/sendmsg/recvmsg, you’re dominated by copying + syscalls. + + +ebpf stuffs + +--skipped-- + + +Smol-focused bottlenecks + the “fix list” +A) If you’re syscall/copy bound + +Best improvement candidates: + +buffer reuse (no per-loop Vec allocation) + +reduce tiny writes (coalesce) + +zero-copy splice (Linux-only, biggest win but more complex) + +For Linux zero-copy, you’d implement a splice(2)-based fast path (socket→pipe→socket). That’s how high-performance forwarders avoid double-copy. + +B) If you’re executor/waker bound (common for async forwarders) + +Symptoms: + +perf shows a lot of runtime / wake / scheduling + +perf stat shows more context switches than rinetd + +Fixes: + +don’t spawn 2 tasks per connection (one per direction) unless needed +→ do a single task that forwards both directions in one loop (state machine) + +avoid any shared Mutex on hot path (logging/metrics) + +keep per-conn state minimal + +C) If you’re single-thread limited + +smol can be extremely fast, but if you’re effectively running everything on one thread, throughput may cap earlier. + +Fix direction: + +move to smol::Executor + N threads (usually num_cpus) + +or run multiple block_on() workers (careful: avoid accept() duplication) + + +## outcome oi + +### CPU hotspot +testing commands: + +```bash +iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1 + +sudo perf stat -p $(pidof oi) -e \ + cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \ + -- sleep 30 +``` + +perf report: + +```text +Performance counter stats for process id '207279': + + 98,571,874,480 cpu_atom/cycles/ (0.10%) + 134,732,064,800 cpu_core/cycles/ (99.90%) + 75,889,748,906 cpu_atom/instructions/ # 0.77 insn per cycle (0.10%) + 159,098,987,713 cpu_core/instructions/ # 1.18 insn per cycle (99.90%) + 30,443,258 cpu_atom/cache-misses/ (0.10%) + 3,155,528 cpu_core/cache-misses/ (99.90%) + 15,003,063,317 cpu_atom/branches/ (0.10%) + 31,479,765,962 cpu_core/branches/ (99.90%) + 149,091,165 cpu_atom/branch-misses/ # 0.99% of all branches (0.10%) + 195,562,861 cpu_core/branch-misses/ # 0.62% of all branches (99.90%) + 1,138 context-switches + 37 cpu-migrations + + 33.004738330 seconds time elapsed +``` + + +### FlameGraph +testing commands: + +```bash +sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30 +sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg +``` + +outcome: + +oi.svg + +commands: + +```bash +sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30 +sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi_dwarf.svg +``` + +outcome: + +oi_dwarf.svg + + +### syscall-cost check + +```bash +sudo strace -ff -C -p $(pidof oi) -o /tmp/oi.strace +# run 15–30s under load, then Ctrl+C +tail -n +1 /tmp/oi.strace.* +``` + + +## More real setup + +traffic goes through real kernel routing + 2 TCP legs + + +Create namespaces + veth links: + +```bash +sudo ip netns add ns_client +sudo ip netns add ns_server + +sudo ip link add veth_c type veth peer name veth_c_ns +sudo ip link set veth_c_ns netns ns_client + +sudo ip link add veth_s type veth peer name veth_s_ns +sudo ip link set veth_s_ns netns ns_server + +sudo ip addr add 10.0.1.1/24 dev veth_c +sudo ip link set veth_c up +sudo ip addr add 10.0.0.1/24 dev veth_s +sudo ip link set veth_s up + +sudo ip netns exec ns_client ip addr add 10.0.1.2/24 dev veth_c_ns +sudo ip netns exec ns_client ip link set veth_c_ns up +sudo ip netns exec ns_client ip link set lo up + +sudo ip netns exec ns_server ip addr add 10.0.0.2/24 dev veth_s_ns +sudo ip netns exec ns_server ip link set veth_s_ns up +sudo ip netns exec ns_server ip link set lo up + +sudo sysctl -w net.ipv4.ip_forward=1 +``` + + +Config to force redirect path: + +```yaml +10.0.1.1 9000 10.0.0.2 9001 +``` + + +Start backend server in ns_server: + +```bash +sudo ip netns exec ns_server iperf3 -s -p 9001 +``` + + +Run client in ns_client → forwarder → backend: + +```bash +sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8 +``` + +perf report: + +```text +sudo perf stat -p $(pidof oi) -e cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations -- sleep 33 + + Performance counter stats for process id '209785': + + 113,810,599,893 cpu_atom/cycles/ (0.11%) + 164,681,878,450 cpu_core/cycles/ (99.89%) + 102,575,167,734 cpu_atom/instructions/ # 0.90 insn per cycle (0.11%) + 237,094,207,911 cpu_core/instructions/ # 1.44 insn per cycle (99.89%) + 33,093,338 cpu_atom/cache-misses/ (0.11%) + 5,381,441 cpu_core/cache-misses/ (99.89%) + 20,012,975,873 cpu_atom/branches/ (0.11%) + 46,120,077,111 cpu_core/branches/ (99.89%) + 211,767,555 cpu_atom/branch-misses/ # 1.06% of all branches (0.11%) + 245,969,685 cpu_core/branch-misses/ # 0.53% of all branches (99.89%) + 1,686 context-switches + 150 cpu-migrations + + 33.004363800 seconds time elapsed +``` + +flamegraph + + +### Add latency + small-packet tests + +netperf (request/response) + +Start netserver in backend namespace: + +```bash +sudo ip netns exec ns_server netserver -p 9001 +``` + +Run TCP_RR against forwarded port: + +```bash +sudo ip netns exec ns_client netperf -H 10.0.1.1 -p 9000 -t TCP_RR -l 30 -- -r 32,32 +``` + +## After opt + +Here, we changed future_lite::io 8KiB buffer to a customized 16KiB buffer. (To avoid conflict, I changed binary name to oiopt). + +```rust +async fn pump(mut r: TcpStream, mut w: TcpStream) -> io::Result { + // let's try 16KiB instead of future_lite::io 8KiB + // and do a profiling to see the outcome + let mut buf = vec![0u8; 16 * 1024]; + let mut total = 0u64; + + loop { + let n = r.read(&mut buf).await?; + if n == 0 { + // EOF: send FIN to peer + let _ = w.shutdown(Shutdown::Write); + break; + } + w.write_all(&buf[0..n]).await?; + total += n as u64; + } + + Ok(total) +} + + +// And change the function call in handle_tcp_connection +let client_to_server = pump(client_stream.clone(), server_stream.clone()); +let server_to_client = pump(server_stream, client_stream); +``` + + +### outcomes + +Still with `sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8` + + +perf stat: + +```text +sudo perf stat -p $(pidof oiopt) -e cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations -- sleep 33 + + Performance counter stats for process id '883435': + + 118,960,667,431 cpu_atom/cycles/ (0.05%) + 131,934,369,110 cpu_core/cycles/ (99.95%) + 100,530,466,140 cpu_atom/instructions/ # 0.85 insn per cycle (0.05%) + 185,203,788,299 cpu_core/instructions/ # 1.40 insn per cycle (99.95%) + 11,027,490 cpu_atom/cache-misses/ (0.05%) + 2,123,369 cpu_core/cache-misses/ (99.95%) + 19,641,945,774 cpu_atom/branches/ (0.05%) + 36,245,438,057 cpu_core/branches/ (99.95%) + 214,098,497 cpu_atom/branch-misses/ # 1.09% of all branches (0.05%) + 179,848,095 cpu_core/branch-misses/ # 0.50% of all branches (99.95%) + 2,308 context-switches + 31 cpu-migrations + + 33.004555878 seconds time elapsed + +``` + +system call check: + +```bash +sudo timeout 30s strace -c -f -p $(pidof oiopt) +``` + +output: + +```text +strace: Process 883435 attached with 4 threads +strace: Process 883438 detached +strace: Process 883437 detached +strace: Process 883436 detached +strace: Process 883435 detached +% time seconds usecs/call calls errors syscall +------ ----------- ----------- --------- --------- ---------------- + 57.80 14.590016 442121 33 epoll_wait + 28.84 7.279883 4 1771146 sendto + 13.33 3.363882 1 1771212 48 recvfrom + 0.02 0.003843 61 62 44 futex + 0.01 0.001947 12 159 epoll_ctl + 0.00 0.000894 99 9 9 connect + 0.00 0.000620 34 18 9 accept4 + 0.00 0.000503 14 34 timerfd_settime + 0.00 0.000446 13 33 33 read + 0.00 0.000271 15 18 ioctl + 0.00 0.000189 21 9 write + 0.00 0.000176 19 9 socket + 0.00 0.000099 11 9 getsockopt + 0.00 0.000079 4 18 shutdown + 0.00 0.000049 2 18 close +------ ----------- ----------- --------- --------- ---------------- +100.00 25.242897 7 3542787 143 total + +``` + +## Further tests to explain why this huge + +Changed 16KiB buffer to 64KiB, and named the binary to oiopt64 + +iperf3 throughput under `-P 8`, highest 54.1Gbits/sec, other threads are much higher than before(16KiB buffer) + +perf stat: + +```text +sudo perf stat -p $(pidof oiopt64) -e cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations -- sleep 33 + + Performance counter stats for process id '893123': + + 120,859,810,675 cpu_atom/cycles/ (0.15%) + 134,735,934,329 cpu_core/cycles/ (99.85%) + 79,946,979,880 cpu_atom/instructions/ # 0.66 insn per cycle (0.15%) + 127,036,644,759 cpu_core/instructions/ # 0.94 insn per cycle (99.85%) + 24,713,474 cpu_atom/cache-misses/ (0.15%) + 9,604,449 cpu_core/cache-misses/ (99.85%) + 15,584,074,530 cpu_atom/branches/ (0.15%) + 24,796,180,117 cpu_core/branches/ (99.85%) + 175,778,825 cpu_atom/branch-misses/ # 1.13% of all branches (0.15%) + 135,067,353 cpu_core/branch-misses/ # 0.54% of all branches (99.85%) + 1,519 context-switches + 50 cpu-migrations + + 33.006529572 seconds time elapsed +``` + +system call check: + +```bash +sudo timeout 30s strace -c -f -p $(pidof oiopt64) +``` + +output: + +```text +strace: Process 893123 attached with 4 threads +strace: Process 893126 detached +strace: Process 893125 detached +strace: Process 893124 detached +strace: Process 893123 detached +% time seconds usecs/call calls errors syscall +------ ----------- ----------- --------- --------- ---------------- + 54.56 18.079500 463576 39 epoll_wait + 27.91 9.249443 7 1294854 2 sendto + 17.49 5.796927 4 1294919 51 recvfrom + 0.01 0.003778 50 75 49 futex + 0.01 0.002188 12 175 epoll_ctl + 0.00 0.000747 83 9 9 connect + 0.00 0.000714 17 40 timerfd_settime + 0.00 0.000510 13 39 38 read + 0.00 0.000452 25 18 9 accept4 + 0.00 0.000310 17 18 ioctl + 0.00 0.000232 23 10 write + 0.00 0.000200 22 9 socket + 0.00 0.000183 20 9 getsockopt + 0.00 0.000100 5 18 shutdown + 0.00 0.000053 2 18 close + 0.00 0.000020 20 1 mprotect + 0.00 0.000015 15 1 sched_yield + 0.00 0.000005 5 1 madvise +------ ----------- ----------- --------- --------- ---------------- +100.00 33.135377 12 2590253 158 total +``` + + + + +### Cleanup: + +```bash +sudo ip netns del ns_client +sudo ip netns del ns_server +sudo ip link del veth_c +sudo ip link del veth_s +``` \ No newline at end of file