Files
blog-post-backup/drafts/2026-01-24-profiling-rust-written-network-program.md

262 lines
5.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
## Testing enviornment setup
Install tools:
```bash
sudo apt update
sudo apt install -y hyperfine heaptrack valgrind
sudo apt install -y \
build-essential clang lld pkg-config \
linux-perf \
iperf3 netperf net-tools \
tcpdump ethtool iproute2 \
bpftrace bpfcc-tools \
strace ltrace \
sysstat procps \
git perl
```
Install framegraph(not shipped on debian):
```bash
git clone https://github.com/brendangregg/FlameGraph ~/FlameGraph
echo 'export PATH="$HOME/FlameGraph:$PATH"' >> ~/.bashrc
source ~/.bashrc
which flamegraph.pl
```
modify the Cargo.toml of verion 0.1.0:
```toml
[profile.release]
lto = true
codegen-units = 1
debug = 1
strip = "none"
panic = "abort"
```
Build with frame pointers to help profiling:
```bash
git clone https://github.com/DaZuo0122/oxidinetd.git
RUSTFLAGS="-C force-frame-pointers=yes" cargo build --release
```
`profiling.conf`:
```yaml
127.0.0.1 9000 127.0.0.1 9001
```
Backend iperf3 server:
```bash
iperf3 -s -p 9001
```
forwarder:
```bash
./oi -c profiling.conf
```
triggers redirect:
```bash
iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1
iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 8
```
verification:
```bash
sudo ss -tnp | egrep '(:9000|:9001)'
```
## Testing
CPU hotspot:
```bash
sudo perf top -p $(pidof oi)
```
If you see lots of:
- sys_read, sys_write, __x64_sys_sendto, tcp_sendmsg → syscall/copy overhead
- futex, __lll_lock_wait → contention/locks
- epoll_wait → executor wake behavior / too many idle polls
Hard numbers:
```bash
sudo perf stat -p $(pidof oi) -e \
cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
-- sleep 30
```
Big differences to watch:
- context-switches much higher on oi → too many tasks/wakers / lock contention
- instructions much higher on oi for same throughput → runtime overhead / copies
- cache-misses higher → allocations / poor locality
Flamegraph
Record:
```bash
sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
```
If the stack looks “flat / missing” (common with async + LTO), use dwarf unwinding:
```bash
sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
```
syscall-cost check:
```bash
sudo strace -ff -c -p $(pidof oi) -o /tmp/oi.strace
# run 1530s under load, then Ctrl+C
tail -n +1 /tmp/oi.strace.*
```
If you see huge % time in read/write/sendmsg/recvmsg, youre dominated by copying + syscalls.
ebpf stuffs
--skipped--
Smol-focused bottlenecks + the “fix list”
A) If youre syscall/copy bound
Best improvement candidates:
buffer reuse (no per-loop Vec allocation)
reduce tiny writes (coalesce)
zero-copy splice (Linux-only, biggest win but more complex)
For Linux zero-copy, youd implement a splice(2)-based fast path (socket→pipe→socket). Thats how high-performance forwarders avoid double-copy.
B) If youre executor/waker bound (common for async forwarders)
Symptoms:
perf shows a lot of runtime / wake / scheduling
perf stat shows more context switches than rinetd
Fixes:
dont spawn 2 tasks per connection (one per direction) unless needed
→ do a single task that forwards both directions in one loop (state machine)
avoid any shared Mutex on hot path (logging/metrics)
keep per-conn state minimal
C) If youre single-thread limited
smol can be extremely fast, but if youre effectively running everything on one thread, throughput may cap earlier.
Fix direction:
move to smol::Executor + N threads (usually num_cpus)
or run multiple block_on() workers (careful: avoid accept() duplication)
## outcome
### CPU hotspot
testing commands:
```bash
iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1
sudo perf stat -p $(pidof oi) -e \
cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
-- sleep 30
```
perf report:
```text
Performance counter stats for process id '207279':
98,571,874,480 cpu_atom/cycles/ (0.10%)
134,732,064,800 cpu_core/cycles/ (99.90%)
75,889,748,906 cpu_atom/instructions/ # 0.77 insn per cycle (0.10%)
159,098,987,713 cpu_core/instructions/ # 1.18 insn per cycle (99.90%)
30,443,258 cpu_atom/cache-misses/ (0.10%)
3,155,528 cpu_core/cache-misses/ (99.90%)
15,003,063,317 cpu_atom/branches/ (0.10%)
31,479,765,962 cpu_core/branches/ (99.90%)
149,091,165 cpu_atom/branch-misses/ # 0.99% of all branches (0.10%)
195,562,861 cpu_core/branch-misses/ # 0.62% of all branches (99.90%)
1,138 context-switches
37 cpu-migrations
33.004738330 seconds time elapsed
```
### FlameGraph
testing commands:
```bash
sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
```
outcome:
oi.svg
commands:
```bash
sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi_dwarf.svg
```
outcome:
oi_dwarf.svg
### syscall-cost check
```bash
sudo strace -ff -c -p $(pidof oi) -o /tmp/oi.strace
# run 1530s under load, then Ctrl+C
tail -n +1 /tmp/oi.strace.*
```