Compare commits

1 Commits

View File

@@ -1,193 +1,261 @@
## Testing enviornment setup
## Testing enviornment setup Install tools:
Install tools:
```bash
```bash sudo apt update
sudo apt update
sudo apt install -y hyperfine heaptrack valgrind
sudo apt install -y hyperfine heaptrack valgrind
sudo apt install -y \
sudo apt install -y \ build-essential clang lld pkg-config \
build-essential clang lld pkg-config \ linux-perf \
linux-perf \ iperf3 netperf net-tools \
iperf3 netperf net-tools \ tcpdump ethtool iproute2 \
tcpdump ethtool iproute2 \ bpftrace bpfcc-tools \
bpftrace bpfcc-tools \ strace ltrace \
strace ltrace \ sysstat procps \
sysstat procps \ git perl
git perl ```
```
Install framegraph(not shipped on debian):
Install framegraph(not shipped on debian):
```bash
```bash git clone https://github.com/brendangregg/FlameGraph ~/FlameGraph
git clone https://github.com/brendangregg/FlameGraph ~/FlameGraph
echo 'export PATH="$HOME/FlameGraph:$PATH"' >> ~/.bashrc
echo 'export PATH="$HOME/FlameGraph:$PATH"' >> ~/.bashrc
source ~/.bashrc
source ~/.bashrc
which flamegraph.pl
which flamegraph.pl ```
```
modify the Cargo.toml of verion 0.1.0:
modify the Cargo.toml of verion 0.1.0:
```toml
```toml [profile.release]
[profile.release] lto = true
lto = true codegen-units = 1
codegen-units = 1 debug = 1
debug = 1 strip = "none"
strip = "none" panic = "abort"
panic = "abort" ```
```
Build with frame pointers to help profiling:
Build with frame pointers to help profiling:
```bash
```bash git clone https://github.com/DaZuo0122/oxidinetd.git
git clone https://github.com/DaZuo0122/oxidinetd.git
RUSTFLAGS="-C force-frame-pointers=yes" cargo build --release
RUSTFLAGS="-C force-frame-pointers=yes" cargo build --release ```
```
`profiling.conf`:
`profiling.conf`:
```yaml
```yaml 127.0.0.1 9000 127.0.0.1 9001
127.0.0.1 9000 127.0.0.1 9001 ```
```
Backend iperf3 server:
Backend iperf3 server:
```bash
```bash iperf3 -s -p 9001
iperf3 -s -p 9001 ```
```
forwarder:
forwarder:
```bash
```bash ./oi -c profiling.conf
./oi -c profiling.conf ```
```
triggers redirect:
triggers redirect:
```bash
```bash iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1
iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1 iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 8
iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 8 ```
```
verification:
verification:
```bash
```bash sudo ss -tnp | egrep '(:9000|:9001)'
sudo ss -tnp | egrep '(:9000|:9001)' ```
```
## Testing
## Testing
CPU hotspot:
CPU hotspot:
```bash
```bash sudo perf top -p $(pidof oi)
sudo perf top -p $(pidof oi) ```
```
If you see lots of:
If you see lots of:
- sys_read, sys_write, __x64_sys_sendto, tcp_sendmsg → syscall/copy overhead
- sys_read, sys_write, __x64_sys_sendto, tcp_sendmsg → syscall/copy overhead
- futex, __lll_lock_wait → contention/locks
- futex, __lll_lock_wait → contention/locks
- epoll_wait → executor wake behavior / too many idle polls
- epoll_wait → executor wake behavior / too many idle polls
Hard numbers:
Hard numbers:
```bash
```bash sudo perf stat -p $(pidof oi) -e \
sudo perf stat -p $(pidof oi) -e \ cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \ -- sleep 30
-- sleep 30 ```
```
Big differences to watch:
Big differences to watch:
- context-switches much higher on oi → too many tasks/wakers / lock contention
- context-switches much higher on oi → too many tasks/wakers / lock contention
- instructions much higher on oi for same throughput → runtime overhead / copies
- instructions much higher on oi for same throughput → runtime overhead / copies
- cache-misses higher → allocations / poor locality
- cache-misses higher → allocations / poor locality
Flamegraph
Flamegraph Record:
Record:
```bash
```bash sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30
sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30 sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg ```
```
If the stack looks “flat / missing” (common with async + LTO), use dwarf unwinding:
If the stack looks “flat / missing” (common with async + LTO), use dwarf unwinding:
```bash
```bash sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30 sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg ```
```
syscall-cost check:
syscall-cost check:
```bash
```bash sudo strace -ff -c -p $(pidof oi) -o /tmp/oi.strace
sudo strace -ff -c -p $(pidof oi) -o /tmp/oi.strace # run 1530s under load, then Ctrl+C
# run 1530s under load, then Ctrl+C tail -n +1 /tmp/oi.strace.*
tail -n +1 /tmp/oi.strace.* ```
```
If you see huge % time in read/write/sendmsg/recvmsg, youre dominated by copying + syscalls.
If you see huge % time in read/write/sendmsg/recvmsg, youre dominated by copying + syscalls.
ebpf stuffs
ebpf stuffs
--skipped--
--skipped--
Smol-focused bottlenecks + the “fix list”
Smol-focused bottlenecks + the “fix list” A) If youre syscall/copy bound
A) If youre syscall/copy bound
Best improvement candidates:
Best improvement candidates:
buffer reuse (no per-loop Vec allocation)
buffer reuse (no per-loop Vec allocation)
reduce tiny writes (coalesce)
reduce tiny writes (coalesce)
zero-copy splice (Linux-only, biggest win but more complex)
zero-copy splice (Linux-only, biggest win but more complex)
For Linux zero-copy, youd implement a splice(2)-based fast path (socket→pipe→socket). Thats how high-performance forwarders avoid double-copy.
For Linux zero-copy, youd implement a splice(2)-based fast path (socket→pipe→socket). Thats how high-performance forwarders avoid double-copy.
B) If youre executor/waker bound (common for async forwarders)
B) If youre executor/waker bound (common for async forwarders)
Symptoms:
Symptoms:
perf shows a lot of runtime / wake / scheduling
perf shows a lot of runtime / wake / scheduling
perf stat shows more context switches than rinetd
perf stat shows more context switches than rinetd
Fixes:
Fixes:
dont spawn 2 tasks per connection (one per direction) unless needed
dont spawn 2 tasks per connection (one per direction) unless needed → do a single task that forwards both directions in one loop (state machine)
→ do a single task that forwards both directions in one loop (state machine)
avoid any shared Mutex on hot path (logging/metrics)
avoid any shared Mutex on hot path (logging/metrics)
keep per-conn state minimal
keep per-conn state minimal
C) If youre single-thread limited
C) If youre single-thread limited
smol can be extremely fast, but if youre effectively running everything on one thread, throughput may cap earlier.
smol can be extremely fast, but if youre effectively running everything on one thread, throughput may cap earlier.
Fix direction:
Fix direction:
move to smol::Executor + N threads (usually num_cpus)
move to smol::Executor + N threads (usually num_cpus)
or run multiple block_on() workers (careful: avoid accept() duplication)
or run multiple block_on() workers (careful: avoid accept() duplication)
## outcome
### CPU hotspot
testing commands:
```bash
iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1
sudo perf stat -p $(pidof oi) -e \
cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
-- sleep 30
```
perf report:
```text
Performance counter stats for process id '207279':
98,571,874,480 cpu_atom/cycles/ (0.10%)
134,732,064,800 cpu_core/cycles/ (99.90%)
75,889,748,906 cpu_atom/instructions/ # 0.77 insn per cycle (0.10%)
159,098,987,713 cpu_core/instructions/ # 1.18 insn per cycle (99.90%)
30,443,258 cpu_atom/cache-misses/ (0.10%)
3,155,528 cpu_core/cache-misses/ (99.90%)
15,003,063,317 cpu_atom/branches/ (0.10%)
31,479,765,962 cpu_core/branches/ (99.90%)
149,091,165 cpu_atom/branch-misses/ # 0.99% of all branches (0.10%)
195,562,861 cpu_core/branch-misses/ # 0.62% of all branches (99.90%)
1,138 context-switches
37 cpu-migrations
33.004738330 seconds time elapsed
```
### FlameGraph
testing commands:
```bash
sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
```
outcome:
oi.svg
commands:
```bash
sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi_dwarf.svg
```
outcome:
oi_dwarf.svg
### syscall-cost check
```bash
sudo strace -ff -c -p $(pidof oi) -o /tmp/oi.strace
# run 1530s under load, then Ctrl+C
tail -n +1 /tmp/oi.strace.*
```