From fdcf5838b321b7ee9ce68f6be194afd40623a305 Mon Sep 17 00:00:00 2001
From: manbo <manbo@hachi.mi>
Date: Sat, 24 Jan 2026 14:05:01 +0800
Subject: [PATCH] Upload files to "drafts"

---
 ...-profiling-rust-written-network-program.md | 193 ++++++++++++++++++
 1 file changed, 193 insertions(+)
 create mode 100644 drafts/2026-01-24-profiling-rust-written-network-program.md

diff --git a/drafts/2026-01-24-profiling-rust-written-network-program.md b/drafts/2026-01-24-profiling-rust-written-network-program.md
new file mode 100644
index 0000000..714c8c0
--- /dev/null
+++ b/drafts/2026-01-24-profiling-rust-written-network-program.md
@@ -0,0 +1,193 @@
+
+## Testing enviornment setup
+Install tools:
+
+```bash
+sudo apt update
+
+sudo apt install -y hyperfine heaptrack valgrind
+
+sudo apt install -y \
+  build-essential clang lld pkg-config \
+  linux-perf \
+  iperf3 netperf net-tools \
+  tcpdump ethtool iproute2 \
+  bpftrace bpfcc-tools \
+  strace ltrace \
+  sysstat procps \
+  git perl
+```
+
+
+Install framegraph(not shipped on debian):
+
+```bash
+git clone https://github.com/brendangregg/FlameGraph ~/FlameGraph
+
+echo 'export PATH="$HOME/FlameGraph:$PATH"' >> ~/.bashrc
+
+source ~/.bashrc
+
+which flamegraph.pl
+```
+
+modify the Cargo.toml of verion 0.1.0:
+
+```toml
+[profile.release]
+lto = true
+codegen-units = 1
+debug = 1
+strip = "none"
+panic = "abort"
+```
+
+
+Build with frame pointers to help profiling:
+
+```bash
+git clone https://github.com/DaZuo0122/oxidinetd.git
+
+RUSTFLAGS="-C force-frame-pointers=yes" cargo build --release
+```
+
+
+`profiling.conf`:
+
+```yaml
+127.0.0.1 9000 127.0.0.1 9001
+```
+
+
+Backend iperf3 server:
+
+```bash
+iperf3 -s -p 9001
+```
+
+
+forwarder:
+
+```bash
+./oi -c profiling.conf
+```
+
+
+triggers redirect:
+
+```bash
+iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1
+iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 8
+```
+
+verification: 
+
+```bash
+sudo ss -tnp | egrep '(:9000|:9001)'
+```
+
+## Testing 
+
+CPU hotspot:
+
+```bash
+sudo perf top -p $(pidof oi)
+```
+
+If you see lots of:
+
+- sys_read, sys_write, __x64_sys_sendto, tcp_sendmsg → syscall/copy overhead
+
+- futex, __lll_lock_wait → contention/locks
+
+- epoll_wait → executor wake behavior / too many idle polls
+
+
+Hard numbers:
+
+```bash
+sudo perf stat -p $(pidof oi) -e \
+  cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
+  -- sleep 30
+```
+
+Big differences to watch:
+
+- context-switches much higher on oi → too many tasks/wakers / lock contention
+
+- instructions much higher on oi for same throughput → runtime overhead / copies
+
+- cache-misses higher → allocations / poor locality
+
+
+Flamegraph  
+Record: 
+
+```bash
+sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30
+sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
+```
+
+If the stack looks “flat / missing” (common with async + LTO), use dwarf unwinding:
+
+```bash
+sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
+sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
+```
+
+
+syscall-cost check:
+
+```bash
+sudo strace -ff -c -p $(pidof oi) -o /tmp/oi.strace
+# run 15–30s under load, then Ctrl+C
+tail -n +1 /tmp/oi.strace.*
+```
+
+If you see huge % time in read/write/sendmsg/recvmsg, you’re dominated by copying + syscalls.
+
+
+ebpf stuffs
+
+--skipped--
+
+
+Smol-focused bottlenecks + the “fix list”
+A) If you’re syscall/copy bound
+
+Best improvement candidates:
+
+buffer reuse (no per-loop Vec allocation)
+
+reduce tiny writes (coalesce)
+
+zero-copy splice (Linux-only, biggest win but more complex)
+
+For Linux zero-copy, you’d implement a splice(2)-based fast path (socket→pipe→socket). That’s how high-performance forwarders avoid double-copy.
+
+B) If you’re executor/waker bound (common for async forwarders)
+
+Symptoms:
+
+perf shows a lot of runtime / wake / scheduling
+
+perf stat shows more context switches than rinetd
+
+Fixes:
+
+don’t spawn 2 tasks per connection (one per direction) unless needed
+→ do a single task that forwards both directions in one loop (state machine)
+
+avoid any shared Mutex on hot path (logging/metrics)
+
+keep per-conn state minimal
+
+C) If you’re single-thread limited
+
+smol can be extremely fast, but if you’re effectively running everything on one thread, throughput may cap earlier.
+
+Fix direction:
+
+move to smol::Executor + N threads (usually num_cpus)
+
+or run multiple block_on() workers (careful: avoid accept() duplication)
\ No newline at end of file