Update drafts/2026-01-24-profiling-rust-written-network-program.md
This commit is contained in:
@@ -1,193 +1,261 @@
|
||||
|
||||
## Testing enviornment setup
|
||||
Install tools:
|
||||
|
||||
```bash
|
||||
sudo apt update
|
||||
|
||||
sudo apt install -y hyperfine heaptrack valgrind
|
||||
|
||||
sudo apt install -y \
|
||||
build-essential clang lld pkg-config \
|
||||
linux-perf \
|
||||
iperf3 netperf net-tools \
|
||||
tcpdump ethtool iproute2 \
|
||||
bpftrace bpfcc-tools \
|
||||
strace ltrace \
|
||||
sysstat procps \
|
||||
git perl
|
||||
```
|
||||
|
||||
|
||||
Install framegraph(not shipped on debian):
|
||||
|
||||
```bash
|
||||
git clone https://github.com/brendangregg/FlameGraph ~/FlameGraph
|
||||
|
||||
echo 'export PATH="$HOME/FlameGraph:$PATH"' >> ~/.bashrc
|
||||
|
||||
source ~/.bashrc
|
||||
|
||||
which flamegraph.pl
|
||||
```
|
||||
|
||||
modify the Cargo.toml of verion 0.1.0:
|
||||
|
||||
```toml
|
||||
[profile.release]
|
||||
lto = true
|
||||
codegen-units = 1
|
||||
debug = 1
|
||||
strip = "none"
|
||||
panic = "abort"
|
||||
```
|
||||
|
||||
|
||||
Build with frame pointers to help profiling:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/DaZuo0122/oxidinetd.git
|
||||
|
||||
RUSTFLAGS="-C force-frame-pointers=yes" cargo build --release
|
||||
```
|
||||
|
||||
|
||||
`profiling.conf`:
|
||||
|
||||
```yaml
|
||||
127.0.0.1 9000 127.0.0.1 9001
|
||||
```
|
||||
|
||||
|
||||
Backend iperf3 server:
|
||||
|
||||
```bash
|
||||
iperf3 -s -p 9001
|
||||
```
|
||||
|
||||
|
||||
forwarder:
|
||||
|
||||
```bash
|
||||
./oi -c profiling.conf
|
||||
```
|
||||
|
||||
|
||||
triggers redirect:
|
||||
|
||||
```bash
|
||||
iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1
|
||||
iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 8
|
||||
```
|
||||
|
||||
verification:
|
||||
|
||||
```bash
|
||||
sudo ss -tnp | egrep '(:9000|:9001)'
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
CPU hotspot:
|
||||
|
||||
```bash
|
||||
sudo perf top -p $(pidof oi)
|
||||
```
|
||||
|
||||
If you see lots of:
|
||||
|
||||
- sys_read, sys_write, __x64_sys_sendto, tcp_sendmsg → syscall/copy overhead
|
||||
|
||||
- futex, __lll_lock_wait → contention/locks
|
||||
|
||||
- epoll_wait → executor wake behavior / too many idle polls
|
||||
|
||||
|
||||
Hard numbers:
|
||||
|
||||
```bash
|
||||
sudo perf stat -p $(pidof oi) -e \
|
||||
cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
|
||||
-- sleep 30
|
||||
```
|
||||
|
||||
Big differences to watch:
|
||||
|
||||
- context-switches much higher on oi → too many tasks/wakers / lock contention
|
||||
|
||||
- instructions much higher on oi for same throughput → runtime overhead / copies
|
||||
|
||||
- cache-misses higher → allocations / poor locality
|
||||
|
||||
|
||||
Flamegraph
|
||||
Record:
|
||||
|
||||
```bash
|
||||
sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30
|
||||
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
|
||||
```
|
||||
|
||||
If the stack looks “flat / missing” (common with async + LTO), use dwarf unwinding:
|
||||
|
||||
```bash
|
||||
sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
|
||||
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
|
||||
```
|
||||
|
||||
|
||||
syscall-cost check:
|
||||
|
||||
```bash
|
||||
sudo strace -ff -c -p $(pidof oi) -o /tmp/oi.strace
|
||||
# run 15–30s under load, then Ctrl+C
|
||||
tail -n +1 /tmp/oi.strace.*
|
||||
```
|
||||
|
||||
If you see huge % time in read/write/sendmsg/recvmsg, you’re dominated by copying + syscalls.
|
||||
|
||||
|
||||
ebpf stuffs
|
||||
|
||||
--skipped--
|
||||
|
||||
|
||||
Smol-focused bottlenecks + the “fix list”
|
||||
A) If you’re syscall/copy bound
|
||||
|
||||
Best improvement candidates:
|
||||
|
||||
buffer reuse (no per-loop Vec allocation)
|
||||
|
||||
reduce tiny writes (coalesce)
|
||||
|
||||
zero-copy splice (Linux-only, biggest win but more complex)
|
||||
|
||||
For Linux zero-copy, you’d implement a splice(2)-based fast path (socket→pipe→socket). That’s how high-performance forwarders avoid double-copy.
|
||||
|
||||
B) If you’re executor/waker bound (common for async forwarders)
|
||||
|
||||
Symptoms:
|
||||
|
||||
perf shows a lot of runtime / wake / scheduling
|
||||
|
||||
perf stat shows more context switches than rinetd
|
||||
|
||||
Fixes:
|
||||
|
||||
don’t spawn 2 tasks per connection (one per direction) unless needed
|
||||
→ do a single task that forwards both directions in one loop (state machine)
|
||||
|
||||
avoid any shared Mutex on hot path (logging/metrics)
|
||||
|
||||
keep per-conn state minimal
|
||||
|
||||
C) If you’re single-thread limited
|
||||
|
||||
smol can be extremely fast, but if you’re effectively running everything on one thread, throughput may cap earlier.
|
||||
|
||||
Fix direction:
|
||||
|
||||
move to smol::Executor + N threads (usually num_cpus)
|
||||
|
||||
or run multiple block_on() workers (careful: avoid accept() duplication)
|
||||
## Testing enviornment setup
|
||||
Install tools:
|
||||
|
||||
```bash
|
||||
sudo apt update
|
||||
|
||||
sudo apt install -y hyperfine heaptrack valgrind
|
||||
|
||||
sudo apt install -y \
|
||||
build-essential clang lld pkg-config \
|
||||
linux-perf \
|
||||
iperf3 netperf net-tools \
|
||||
tcpdump ethtool iproute2 \
|
||||
bpftrace bpfcc-tools \
|
||||
strace ltrace \
|
||||
sysstat procps \
|
||||
git perl
|
||||
```
|
||||
|
||||
|
||||
Install framegraph(not shipped on debian):
|
||||
|
||||
```bash
|
||||
git clone https://github.com/brendangregg/FlameGraph ~/FlameGraph
|
||||
|
||||
echo 'export PATH="$HOME/FlameGraph:$PATH"' >> ~/.bashrc
|
||||
|
||||
source ~/.bashrc
|
||||
|
||||
which flamegraph.pl
|
||||
```
|
||||
|
||||
modify the Cargo.toml of verion 0.1.0:
|
||||
|
||||
```toml
|
||||
[profile.release]
|
||||
lto = true
|
||||
codegen-units = 1
|
||||
debug = 1
|
||||
strip = "none"
|
||||
panic = "abort"
|
||||
```
|
||||
|
||||
|
||||
Build with frame pointers to help profiling:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/DaZuo0122/oxidinetd.git
|
||||
|
||||
RUSTFLAGS="-C force-frame-pointers=yes" cargo build --release
|
||||
```
|
||||
|
||||
|
||||
`profiling.conf`:
|
||||
|
||||
```yaml
|
||||
127.0.0.1 9000 127.0.0.1 9001
|
||||
```
|
||||
|
||||
|
||||
Backend iperf3 server:
|
||||
|
||||
```bash
|
||||
iperf3 -s -p 9001
|
||||
```
|
||||
|
||||
|
||||
forwarder:
|
||||
|
||||
```bash
|
||||
./oi -c profiling.conf
|
||||
```
|
||||
|
||||
|
||||
triggers redirect:
|
||||
|
||||
```bash
|
||||
iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1
|
||||
iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 8
|
||||
```
|
||||
|
||||
verification:
|
||||
|
||||
```bash
|
||||
sudo ss -tnp | egrep '(:9000|:9001)'
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
CPU hotspot:
|
||||
|
||||
```bash
|
||||
sudo perf top -p $(pidof oi)
|
||||
```
|
||||
|
||||
If you see lots of:
|
||||
|
||||
- sys_read, sys_write, __x64_sys_sendto, tcp_sendmsg → syscall/copy overhead
|
||||
|
||||
- futex, __lll_lock_wait → contention/locks
|
||||
|
||||
- epoll_wait → executor wake behavior / too many idle polls
|
||||
|
||||
|
||||
Hard numbers:
|
||||
|
||||
```bash
|
||||
sudo perf stat -p $(pidof oi) -e \
|
||||
cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
|
||||
-- sleep 30
|
||||
```
|
||||
|
||||
Big differences to watch:
|
||||
|
||||
- context-switches much higher on oi → too many tasks/wakers / lock contention
|
||||
|
||||
- instructions much higher on oi for same throughput → runtime overhead / copies
|
||||
|
||||
- cache-misses higher → allocations / poor locality
|
||||
|
||||
|
||||
Flamegraph
|
||||
Record:
|
||||
|
||||
```bash
|
||||
sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30
|
||||
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
|
||||
```
|
||||
|
||||
If the stack looks “flat / missing” (common with async + LTO), use dwarf unwinding:
|
||||
|
||||
```bash
|
||||
sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
|
||||
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
|
||||
```
|
||||
|
||||
|
||||
syscall-cost check:
|
||||
|
||||
```bash
|
||||
sudo strace -ff -c -p $(pidof oi) -o /tmp/oi.strace
|
||||
# run 15–30s under load, then Ctrl+C
|
||||
tail -n +1 /tmp/oi.strace.*
|
||||
```
|
||||
|
||||
If you see huge % time in read/write/sendmsg/recvmsg, you’re dominated by copying + syscalls.
|
||||
|
||||
|
||||
ebpf stuffs
|
||||
|
||||
--skipped--
|
||||
|
||||
|
||||
Smol-focused bottlenecks + the “fix list”
|
||||
A) If you’re syscall/copy bound
|
||||
|
||||
Best improvement candidates:
|
||||
|
||||
buffer reuse (no per-loop Vec allocation)
|
||||
|
||||
reduce tiny writes (coalesce)
|
||||
|
||||
zero-copy splice (Linux-only, biggest win but more complex)
|
||||
|
||||
For Linux zero-copy, you’d implement a splice(2)-based fast path (socket→pipe→socket). That’s how high-performance forwarders avoid double-copy.
|
||||
|
||||
B) If you’re executor/waker bound (common for async forwarders)
|
||||
|
||||
Symptoms:
|
||||
|
||||
perf shows a lot of runtime / wake / scheduling
|
||||
|
||||
perf stat shows more context switches than rinetd
|
||||
|
||||
Fixes:
|
||||
|
||||
don’t spawn 2 tasks per connection (one per direction) unless needed
|
||||
→ do a single task that forwards both directions in one loop (state machine)
|
||||
|
||||
avoid any shared Mutex on hot path (logging/metrics)
|
||||
|
||||
keep per-conn state minimal
|
||||
|
||||
C) If you’re single-thread limited
|
||||
|
||||
smol can be extremely fast, but if you’re effectively running everything on one thread, throughput may cap earlier.
|
||||
|
||||
Fix direction:
|
||||
|
||||
move to smol::Executor + N threads (usually num_cpus)
|
||||
|
||||
or run multiple block_on() workers (careful: avoid accept() duplication)
|
||||
|
||||
|
||||
## outcome
|
||||
|
||||
### CPU hotspot
|
||||
testing commands:
|
||||
|
||||
```bash
|
||||
iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1
|
||||
|
||||
sudo perf stat -p $(pidof oi) -e \
|
||||
cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
|
||||
-- sleep 30
|
||||
```
|
||||
|
||||
perf report:
|
||||
|
||||
```text
|
||||
Performance counter stats for process id '207279':
|
||||
|
||||
98,571,874,480 cpu_atom/cycles/ (0.10%)
|
||||
134,732,064,800 cpu_core/cycles/ (99.90%)
|
||||
75,889,748,906 cpu_atom/instructions/ # 0.77 insn per cycle (0.10%)
|
||||
159,098,987,713 cpu_core/instructions/ # 1.18 insn per cycle (99.90%)
|
||||
30,443,258 cpu_atom/cache-misses/ (0.10%)
|
||||
3,155,528 cpu_core/cache-misses/ (99.90%)
|
||||
15,003,063,317 cpu_atom/branches/ (0.10%)
|
||||
31,479,765,962 cpu_core/branches/ (99.90%)
|
||||
149,091,165 cpu_atom/branch-misses/ # 0.99% of all branches (0.10%)
|
||||
195,562,861 cpu_core/branch-misses/ # 0.62% of all branches (99.90%)
|
||||
1,138 context-switches
|
||||
37 cpu-migrations
|
||||
|
||||
33.004738330 seconds time elapsed
|
||||
```
|
||||
|
||||
|
||||
### FlameGraph
|
||||
testing commands:
|
||||
|
||||
```bash
|
||||
sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30
|
||||
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
|
||||
```
|
||||
|
||||
outcome:
|
||||
|
||||
oi.svg
|
||||
|
||||
commands:
|
||||
|
||||
```bash
|
||||
sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
|
||||
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi_dwarf.svg
|
||||
```
|
||||
|
||||
outcome:
|
||||
|
||||
oi_dwarf.svg
|
||||
|
||||
|
||||
### syscall-cost check
|
||||
|
||||
```bash
|
||||
sudo strace -ff -c -p $(pidof oi) -o /tmp/oi.strace
|
||||
# run 15–30s under load, then Ctrl+C
|
||||
tail -n +1 /tmp/oi.strace.*
|
||||
```
|
||||
|
||||
|
||||
Reference in New Issue
Block a user