## Testing enviornment setup Install tools: ```bash sudo apt update sudo apt install -y hyperfine heaptrack valgrind sudo apt install -y \ build-essential clang lld pkg-config \ linux-perf \ iperf3 netperf net-tools \ tcpdump ethtool iproute2 \ bpftrace bpfcc-tools \ strace ltrace \ sysstat procps \ git perl ``` Install framegraph(not shipped on debian): ```bash git clone https://github.com/brendangregg/FlameGraph ~/FlameGraph echo 'export PATH="$HOME/FlameGraph:$PATH"' >> ~/.bashrc source ~/.bashrc which flamegraph.pl ``` modify the Cargo.toml of verion 0.1.0: ```toml [profile.release] lto = true codegen-units = 1 debug = 1 strip = "none" panic = "abort" ``` Build with frame pointers to help profiling: ```bash git clone https://github.com/DaZuo0122/oxidinetd.git RUSTFLAGS="-C force-frame-pointers=yes" cargo build --release ``` `profiling.conf`: ```yaml 127.0.0.1 9000 127.0.0.1 9001 ``` Backend iperf3 server: ```bash iperf3 -s -p 9001 ``` forwarder: ```bash ./oi -c profiling.conf ``` triggers redirect: ```bash iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1 iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 8 ``` verification: ```bash sudo ss -tnp | egrep '(:9000|:9001)' ``` ## Testing CPU hotspot: ```bash sudo perf top -p $(pidof oi) ``` If you see lots of: - sys_read, sys_write, __x64_sys_sendto, tcp_sendmsg → syscall/copy overhead - futex, __lll_lock_wait → contention/locks - epoll_wait → executor wake behavior / too many idle polls Hard numbers: ```bash sudo perf stat -p $(pidof oi) -e \ cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \ -- sleep 30 ``` Big differences to watch: - context-switches much higher on oi → too many tasks/wakers / lock contention - instructions much higher on oi for same throughput → runtime overhead / copies - cache-misses higher → allocations / poor locality Flamegraph Record: ```bash sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30 sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg ``` If the stack looks “flat / missing” (common with async + LTO), use dwarf unwinding: ```bash sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30 sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg ``` syscall-cost check: ```bash sudo strace -ff -c -p $(pidof oi) -o /tmp/oi.strace # run 15–30s under load, then Ctrl+C tail -n +1 /tmp/oi.strace.* ``` If you see huge % time in read/write/sendmsg/recvmsg, you’re dominated by copying + syscalls. ebpf stuffs --skipped-- Smol-focused bottlenecks + the “fix list” A) If you’re syscall/copy bound Best improvement candidates: buffer reuse (no per-loop Vec allocation) reduce tiny writes (coalesce) zero-copy splice (Linux-only, biggest win but more complex) For Linux zero-copy, you’d implement a splice(2)-based fast path (socket→pipe→socket). That’s how high-performance forwarders avoid double-copy. B) If you’re executor/waker bound (common for async forwarders) Symptoms: perf shows a lot of runtime / wake / scheduling perf stat shows more context switches than rinetd Fixes: don’t spawn 2 tasks per connection (one per direction) unless needed → do a single task that forwards both directions in one loop (state machine) avoid any shared Mutex on hot path (logging/metrics) keep per-conn state minimal C) If you’re single-thread limited smol can be extremely fast, but if you’re effectively running everything on one thread, throughput may cap earlier. Fix direction: move to smol::Executor + N threads (usually num_cpus) or run multiple block_on() workers (careful: avoid accept() duplication) ## outcome oi ### CPU hotspot testing commands: ```bash iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1 sudo perf stat -p $(pidof oi) -e \ cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \ -- sleep 30 ``` perf report: ```text Performance counter stats for process id '207279': 98,571,874,480 cpu_atom/cycles/ (0.10%) 134,732,064,800 cpu_core/cycles/ (99.90%) 75,889,748,906 cpu_atom/instructions/ # 0.77 insn per cycle (0.10%) 159,098,987,713 cpu_core/instructions/ # 1.18 insn per cycle (99.90%) 30,443,258 cpu_atom/cache-misses/ (0.10%) 3,155,528 cpu_core/cache-misses/ (99.90%) 15,003,063,317 cpu_atom/branches/ (0.10%) 31,479,765,962 cpu_core/branches/ (99.90%) 149,091,165 cpu_atom/branch-misses/ # 0.99% of all branches (0.10%) 195,562,861 cpu_core/branch-misses/ # 0.62% of all branches (99.90%) 1,138 context-switches 37 cpu-migrations 33.004738330 seconds time elapsed ``` ### FlameGraph testing commands: ```bash sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30 sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg ``` outcome: oi.svg commands: ```bash sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30 sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi_dwarf.svg ``` outcome: oi_dwarf.svg ### syscall-cost check ```bash sudo strace -ff -C -p $(pidof oi) -o /tmp/oi.strace # run 15–30s under load, then Ctrl+C tail -n +1 /tmp/oi.strace.* ``` ## More real setup traffic goes through real kernel routing + 2 TCP legs Create namespaces + veth links: ```bash sudo ip netns add ns_client sudo ip netns add ns_server sudo ip link add veth_c type veth peer name veth_c_ns sudo ip link set veth_c_ns netns ns_client sudo ip link add veth_s type veth peer name veth_s_ns sudo ip link set veth_s_ns netns ns_server sudo ip addr add 10.0.1.1/24 dev veth_c sudo ip link set veth_c up sudo ip addr add 10.0.0.1/24 dev veth_s sudo ip link set veth_s up sudo ip netns exec ns_client ip addr add 10.0.1.2/24 dev veth_c_ns sudo ip netns exec ns_client ip link set veth_c_ns up sudo ip netns exec ns_client ip link set lo up sudo ip netns exec ns_server ip addr add 10.0.0.2/24 dev veth_s_ns sudo ip netns exec ns_server ip link set veth_s_ns up sudo ip netns exec ns_server ip link set lo up sudo sysctl -w net.ipv4.ip_forward=1 ``` Config to force redirect path: ```yaml 10.0.1.1 9000 10.0.0.2 9001 ``` Start backend server in ns_server: ```bash sudo ip netns exec ns_server iperf3 -s -p 9001 ``` Run client in ns_client → forwarder → backend: ```bash sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8 ``` perf report: ```text sudo perf stat -p $(pidof oi) -e cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations -- sleep 33 Performance counter stats for process id '209785': 113,810,599,893 cpu_atom/cycles/ (0.11%) 164,681,878,450 cpu_core/cycles/ (99.89%) 102,575,167,734 cpu_atom/instructions/ # 0.90 insn per cycle (0.11%) 237,094,207,911 cpu_core/instructions/ # 1.44 insn per cycle (99.89%) 33,093,338 cpu_atom/cache-misses/ (0.11%) 5,381,441 cpu_core/cache-misses/ (99.89%) 20,012,975,873 cpu_atom/branches/ (0.11%) 46,120,077,111 cpu_core/branches/ (99.89%) 211,767,555 cpu_atom/branch-misses/ # 1.06% of all branches (0.11%) 245,969,685 cpu_core/branch-misses/ # 0.53% of all branches (99.89%) 1,686 context-switches 150 cpu-migrations 33.004363800 seconds time elapsed ``` flamegraph ### Add latency + small-packet tests netperf (request/response) Start netserver in backend namespace: ```bash sudo ip netns exec ns_server netserver -p 9001 ``` Run TCP_RR against forwarded port: ```bash sudo ip netns exec ns_client netperf -H 10.0.1.1 -p 9000 -t TCP_RR -l 30 -- -r 32,32 ``` ## After opt Here, we changed future_lite::io 8KiB buffer to a customized 16KiB buffer. (To avoid conflict, I changed binary name to oiopt). ```rust async fn pump(mut r: TcpStream, mut w: TcpStream) -> io::Result { // let's try 16KiB instead of future_lite::io 8KiB // and do a profiling to see the outcome let mut buf = vec![0u8; 16 * 1024]; let mut total = 0u64; loop { let n = r.read(&mut buf).await?; if n == 0 { // EOF: send FIN to peer let _ = w.shutdown(Shutdown::Write); break; } w.write_all(&buf[0..n]).await?; total += n as u64; } Ok(total) } // And change the function call in handle_tcp_connection let client_to_server = pump(client_stream.clone(), server_stream.clone()); let server_to_client = pump(server_stream, client_stream); ``` ### outcomes Still with `sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8` perf stat: ```text sudo perf stat -p $(pidof oiopt) -e cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations -- sleep 33 Performance counter stats for process id '883435': 118,960,667,431 cpu_atom/cycles/ (0.05%) 131,934,369,110 cpu_core/cycles/ (99.95%) 100,530,466,140 cpu_atom/instructions/ # 0.85 insn per cycle (0.05%) 185,203,788,299 cpu_core/instructions/ # 1.40 insn per cycle (99.95%) 11,027,490 cpu_atom/cache-misses/ (0.05%) 2,123,369 cpu_core/cache-misses/ (99.95%) 19,641,945,774 cpu_atom/branches/ (0.05%) 36,245,438,057 cpu_core/branches/ (99.95%) 214,098,497 cpu_atom/branch-misses/ # 1.09% of all branches (0.05%) 179,848,095 cpu_core/branch-misses/ # 0.50% of all branches (99.95%) 2,308 context-switches 31 cpu-migrations 33.004555878 seconds time elapsed ``` system call check: ```bash sudo timeout 30s strace -c -f -p $(pidof oiopt) ``` output: ```text strace: Process 883435 attached with 4 threads strace: Process 883438 detached strace: Process 883437 detached strace: Process 883436 detached strace: Process 883435 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 57.80 14.590016 442121 33 epoll_wait 28.84 7.279883 4 1771146 sendto 13.33 3.363882 1 1771212 48 recvfrom 0.02 0.003843 61 62 44 futex 0.01 0.001947 12 159 epoll_ctl 0.00 0.000894 99 9 9 connect 0.00 0.000620 34 18 9 accept4 0.00 0.000503 14 34 timerfd_settime 0.00 0.000446 13 33 33 read 0.00 0.000271 15 18 ioctl 0.00 0.000189 21 9 write 0.00 0.000176 19 9 socket 0.00 0.000099 11 9 getsockopt 0.00 0.000079 4 18 shutdown 0.00 0.000049 2 18 close ------ ----------- ----------- --------- --------- ---------------- 100.00 25.242897 7 3542787 143 total ``` ## Further tests to explain why this huge Changed 16KiB buffer to 64KiB, and named the binary to oiopt64 iperf3 throughput under `-P 8`, highest 54.1Gbits/sec, other threads are much higher than before(16KiB buffer) perf stat: ```text sudo perf stat -p $(pidof oiopt64) -e cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations -- sleep 33 Performance counter stats for process id '893123': 120,859,810,675 cpu_atom/cycles/ (0.15%) 134,735,934,329 cpu_core/cycles/ (99.85%) 79,946,979,880 cpu_atom/instructions/ # 0.66 insn per cycle (0.15%) 127,036,644,759 cpu_core/instructions/ # 0.94 insn per cycle (99.85%) 24,713,474 cpu_atom/cache-misses/ (0.15%) 9,604,449 cpu_core/cache-misses/ (99.85%) 15,584,074,530 cpu_atom/branches/ (0.15%) 24,796,180,117 cpu_core/branches/ (99.85%) 175,778,825 cpu_atom/branch-misses/ # 1.13% of all branches (0.15%) 135,067,353 cpu_core/branch-misses/ # 0.54% of all branches (99.85%) 1,519 context-switches 50 cpu-migrations 33.006529572 seconds time elapsed ``` system call check: ```bash sudo timeout 30s strace -c -f -p $(pidof oiopt64) ``` output: ```text strace: Process 893123 attached with 4 threads strace: Process 893126 detached strace: Process 893125 detached strace: Process 893124 detached strace: Process 893123 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 54.56 18.079500 463576 39 epoll_wait 27.91 9.249443 7 1294854 2 sendto 17.49 5.796927 4 1294919 51 recvfrom 0.01 0.003778 50 75 49 futex 0.01 0.002188 12 175 epoll_ctl 0.00 0.000747 83 9 9 connect 0.00 0.000714 17 40 timerfd_settime 0.00 0.000510 13 39 38 read 0.00 0.000452 25 18 9 accept4 0.00 0.000310 17 18 ioctl 0.00 0.000232 23 10 write 0.00 0.000200 22 9 socket 0.00 0.000183 20 9 getsockopt 0.00 0.000100 5 18 shutdown 0.00 0.000053 2 18 close 0.00 0.000020 20 1 mprotect 0.00 0.000015 15 1 sched_yield 0.00 0.000005 5 1 madvise ------ ----------- ----------- --------- --------- ---------------- 100.00 33.135377 12 2590253 158 total ``` ### Cleanup: ```bash sudo ip netns del ns_client sudo ip netns del ns_server sudo ip link del veth_c sudo ip link del veth_s ```