Update: changes buffer from 8KiB to 16 and finally 64, have discovered more things than expect.

This commit is contained in:
2026-01-27 23:23:44 +08:00
parent fdcf5838b3
commit ba61eb37b9

View File

@@ -1,193 +1,533 @@
## Testing enviornment setup
## Testing enviornment setup Install tools:
Install tools:
```bash
```bash sudo apt update
sudo apt update
sudo apt install -y hyperfine heaptrack valgrind
sudo apt install -y hyperfine heaptrack valgrind
sudo apt install -y \
sudo apt install -y \ build-essential clang lld pkg-config \
build-essential clang lld pkg-config \ linux-perf \
linux-perf \ iperf3 netperf net-tools \
iperf3 netperf net-tools \ tcpdump ethtool iproute2 \
tcpdump ethtool iproute2 \ bpftrace bpfcc-tools \
bpftrace bpfcc-tools \ strace ltrace \
strace ltrace \ sysstat procps \
sysstat procps \ git perl
git perl ```
```
Install framegraph(not shipped on debian):
Install framegraph(not shipped on debian):
```bash
```bash git clone https://github.com/brendangregg/FlameGraph ~/FlameGraph
git clone https://github.com/brendangregg/FlameGraph ~/FlameGraph
echo 'export PATH="$HOME/FlameGraph:$PATH"' >> ~/.bashrc
echo 'export PATH="$HOME/FlameGraph:$PATH"' >> ~/.bashrc
source ~/.bashrc
source ~/.bashrc
which flamegraph.pl
which flamegraph.pl ```
```
modify the Cargo.toml of verion 0.1.0:
modify the Cargo.toml of verion 0.1.0:
```toml
```toml [profile.release]
[profile.release] lto = true
lto = true codegen-units = 1
codegen-units = 1 debug = 1
debug = 1 strip = "none"
strip = "none" panic = "abort"
panic = "abort" ```
```
Build with frame pointers to help profiling:
Build with frame pointers to help profiling:
```bash
```bash git clone https://github.com/DaZuo0122/oxidinetd.git
git clone https://github.com/DaZuo0122/oxidinetd.git
RUSTFLAGS="-C force-frame-pointers=yes" cargo build --release
RUSTFLAGS="-C force-frame-pointers=yes" cargo build --release ```
```
`profiling.conf`:
`profiling.conf`:
```yaml
```yaml 127.0.0.1 9000 127.0.0.1 9001
127.0.0.1 9000 127.0.0.1 9001 ```
```
Backend iperf3 server:
Backend iperf3 server:
```bash
```bash iperf3 -s -p 9001
iperf3 -s -p 9001 ```
```
forwarder:
forwarder:
```bash
```bash ./oi -c profiling.conf
./oi -c profiling.conf ```
```
triggers redirect:
triggers redirect:
```bash
```bash iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1
iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1 iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 8
iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 8 ```
```
verification:
verification:
```bash
```bash sudo ss -tnp | egrep '(:9000|:9001)'
sudo ss -tnp | egrep '(:9000|:9001)' ```
```
## Testing
## Testing
CPU hotspot:
CPU hotspot:
```bash
```bash sudo perf top -p $(pidof oi)
sudo perf top -p $(pidof oi) ```
```
If you see lots of:
If you see lots of:
- sys_read, sys_write, __x64_sys_sendto, tcp_sendmsg → syscall/copy overhead
- sys_read, sys_write, __x64_sys_sendto, tcp_sendmsg → syscall/copy overhead
- futex, __lll_lock_wait → contention/locks
- futex, __lll_lock_wait → contention/locks
- epoll_wait → executor wake behavior / too many idle polls
- epoll_wait → executor wake behavior / too many idle polls
Hard numbers:
Hard numbers:
```bash
```bash sudo perf stat -p $(pidof oi) -e \
sudo perf stat -p $(pidof oi) -e \ cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \ -- sleep 30
-- sleep 30 ```
```
Big differences to watch:
Big differences to watch:
- context-switches much higher on oi → too many tasks/wakers / lock contention
- context-switches much higher on oi → too many tasks/wakers / lock contention
- instructions much higher on oi for same throughput → runtime overhead / copies
- instructions much higher on oi for same throughput → runtime overhead / copies
- cache-misses higher → allocations / poor locality
- cache-misses higher → allocations / poor locality
Flamegraph
Flamegraph Record:
Record:
```bash
```bash sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30
sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30 sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg ```
```
If the stack looks “flat / missing” (common with async + LTO), use dwarf unwinding:
If the stack looks “flat / missing” (common with async + LTO), use dwarf unwinding:
```bash
```bash sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30 sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg ```
```
syscall-cost check:
syscall-cost check:
```bash
```bash sudo strace -ff -c -p $(pidof oi) -o /tmp/oi.strace
sudo strace -ff -c -p $(pidof oi) -o /tmp/oi.strace # run 1530s under load, then Ctrl+C
# run 1530s under load, then Ctrl+C tail -n +1 /tmp/oi.strace.*
tail -n +1 /tmp/oi.strace.* ```
```
If you see huge % time in read/write/sendmsg/recvmsg, youre dominated by copying + syscalls.
If you see huge % time in read/write/sendmsg/recvmsg, youre dominated by copying + syscalls.
ebpf stuffs
ebpf stuffs
--skipped--
--skipped--
Smol-focused bottlenecks + the “fix list”
Smol-focused bottlenecks + the “fix list” A) If youre syscall/copy bound
A) If youre syscall/copy bound
Best improvement candidates:
Best improvement candidates:
buffer reuse (no per-loop Vec allocation)
buffer reuse (no per-loop Vec allocation)
reduce tiny writes (coalesce)
reduce tiny writes (coalesce)
zero-copy splice (Linux-only, biggest win but more complex)
zero-copy splice (Linux-only, biggest win but more complex)
For Linux zero-copy, youd implement a splice(2)-based fast path (socket→pipe→socket). Thats how high-performance forwarders avoid double-copy.
For Linux zero-copy, youd implement a splice(2)-based fast path (socket→pipe→socket). Thats how high-performance forwarders avoid double-copy.
B) If youre executor/waker bound (common for async forwarders)
B) If youre executor/waker bound (common for async forwarders)
Symptoms:
Symptoms:
perf shows a lot of runtime / wake / scheduling
perf shows a lot of runtime / wake / scheduling
perf stat shows more context switches than rinetd
perf stat shows more context switches than rinetd
Fixes:
Fixes:
dont spawn 2 tasks per connection (one per direction) unless needed
dont spawn 2 tasks per connection (one per direction) unless needed → do a single task that forwards both directions in one loop (state machine)
→ do a single task that forwards both directions in one loop (state machine)
avoid any shared Mutex on hot path (logging/metrics)
avoid any shared Mutex on hot path (logging/metrics)
keep per-conn state minimal
keep per-conn state minimal
C) If youre single-thread limited
C) If youre single-thread limited
smol can be extremely fast, but if youre effectively running everything on one thread, throughput may cap earlier.
smol can be extremely fast, but if youre effectively running everything on one thread, throughput may cap earlier.
Fix direction:
Fix direction:
move to smol::Executor + N threads (usually num_cpus)
move to smol::Executor + N threads (usually num_cpus)
or run multiple block_on() workers (careful: avoid accept() duplication)
or run multiple block_on() workers (careful: avoid accept() duplication)
## outcome oi
### CPU hotspot
testing commands:
```bash
iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1
sudo perf stat -p $(pidof oi) -e \
cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
-- sleep 30
```
perf report:
```text
Performance counter stats for process id '207279':
98,571,874,480 cpu_atom/cycles/ (0.10%)
134,732,064,800 cpu_core/cycles/ (99.90%)
75,889,748,906 cpu_atom/instructions/ # 0.77 insn per cycle (0.10%)
159,098,987,713 cpu_core/instructions/ # 1.18 insn per cycle (99.90%)
30,443,258 cpu_atom/cache-misses/ (0.10%)
3,155,528 cpu_core/cache-misses/ (99.90%)
15,003,063,317 cpu_atom/branches/ (0.10%)
31,479,765,962 cpu_core/branches/ (99.90%)
149,091,165 cpu_atom/branch-misses/ # 0.99% of all branches (0.10%)
195,562,861 cpu_core/branch-misses/ # 0.62% of all branches (99.90%)
1,138 context-switches
37 cpu-migrations
33.004738330 seconds time elapsed
```
### FlameGraph
testing commands:
```bash
sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
```
outcome:
oi.svg
commands:
```bash
sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi_dwarf.svg
```
outcome:
oi_dwarf.svg
### syscall-cost check
```bash
sudo strace -ff -C -p $(pidof oi) -o /tmp/oi.strace
# run 1530s under load, then Ctrl+C
tail -n +1 /tmp/oi.strace.*
```
## More real setup
traffic goes through real kernel routing + 2 TCP legs
Create namespaces + veth links:
```bash
sudo ip netns add ns_client
sudo ip netns add ns_server
sudo ip link add veth_c type veth peer name veth_c_ns
sudo ip link set veth_c_ns netns ns_client
sudo ip link add veth_s type veth peer name veth_s_ns
sudo ip link set veth_s_ns netns ns_server
sudo ip addr add 10.0.1.1/24 dev veth_c
sudo ip link set veth_c up
sudo ip addr add 10.0.0.1/24 dev veth_s
sudo ip link set veth_s up
sudo ip netns exec ns_client ip addr add 10.0.1.2/24 dev veth_c_ns
sudo ip netns exec ns_client ip link set veth_c_ns up
sudo ip netns exec ns_client ip link set lo up
sudo ip netns exec ns_server ip addr add 10.0.0.2/24 dev veth_s_ns
sudo ip netns exec ns_server ip link set veth_s_ns up
sudo ip netns exec ns_server ip link set lo up
sudo sysctl -w net.ipv4.ip_forward=1
```
Config to force redirect path:
```yaml
10.0.1.1 9000 10.0.0.2 9001
```
Start backend server in ns_server:
```bash
sudo ip netns exec ns_server iperf3 -s -p 9001
```
Run client in ns_client → forwarder → backend:
```bash
sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8
```
perf report:
```text
sudo perf stat -p $(pidof oi) -e cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations -- sleep 33
Performance counter stats for process id '209785':
113,810,599,893 cpu_atom/cycles/ (0.11%)
164,681,878,450 cpu_core/cycles/ (99.89%)
102,575,167,734 cpu_atom/instructions/ # 0.90 insn per cycle (0.11%)
237,094,207,911 cpu_core/instructions/ # 1.44 insn per cycle (99.89%)
33,093,338 cpu_atom/cache-misses/ (0.11%)
5,381,441 cpu_core/cache-misses/ (99.89%)
20,012,975,873 cpu_atom/branches/ (0.11%)
46,120,077,111 cpu_core/branches/ (99.89%)
211,767,555 cpu_atom/branch-misses/ # 1.06% of all branches (0.11%)
245,969,685 cpu_core/branch-misses/ # 0.53% of all branches (99.89%)
1,686 context-switches
150 cpu-migrations
33.004363800 seconds time elapsed
```
flamegraph
### Add latency + small-packet tests
netperf (request/response)
Start netserver in backend namespace:
```bash
sudo ip netns exec ns_server netserver -p 9001
```
Run TCP_RR against forwarded port:
```bash
sudo ip netns exec ns_client netperf -H 10.0.1.1 -p 9000 -t TCP_RR -l 30 -- -r 32,32
```
## After opt
Here, we changed future_lite::io 8KiB buffer to a customized 16KiB buffer. (To avoid conflict, I changed binary name to oiopt).
```rust
async fn pump(mut r: TcpStream, mut w: TcpStream) -> io::Result<u64> {
// let's try 16KiB instead of future_lite::io 8KiB
// and do a profiling to see the outcome
let mut buf = vec![0u8; 16 * 1024];
let mut total = 0u64;
loop {
let n = r.read(&mut buf).await?;
if n == 0 {
// EOF: send FIN to peer
let _ = w.shutdown(Shutdown::Write);
break;
}
w.write_all(&buf[0..n]).await?;
total += n as u64;
}
Ok(total)
}
// And change the function call in handle_tcp_connection
let client_to_server = pump(client_stream.clone(), server_stream.clone());
let server_to_client = pump(server_stream, client_stream);
```
### outcomes
Still with `sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8`
perf stat:
```text
sudo perf stat -p $(pidof oiopt) -e cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations -- sleep 33
Performance counter stats for process id '883435':
118,960,667,431 cpu_atom/cycles/ (0.05%)
131,934,369,110 cpu_core/cycles/ (99.95%)
100,530,466,140 cpu_atom/instructions/ # 0.85 insn per cycle (0.05%)
185,203,788,299 cpu_core/instructions/ # 1.40 insn per cycle (99.95%)
11,027,490 cpu_atom/cache-misses/ (0.05%)
2,123,369 cpu_core/cache-misses/ (99.95%)
19,641,945,774 cpu_atom/branches/ (0.05%)
36,245,438,057 cpu_core/branches/ (99.95%)
214,098,497 cpu_atom/branch-misses/ # 1.09% of all branches (0.05%)
179,848,095 cpu_core/branch-misses/ # 0.50% of all branches (99.95%)
2,308 context-switches
31 cpu-migrations
33.004555878 seconds time elapsed
```
system call check:
```bash
sudo timeout 30s strace -c -f -p $(pidof oiopt)
```
output:
```text
strace: Process 883435 attached with 4 threads
strace: Process 883438 detached
strace: Process 883437 detached
strace: Process 883436 detached
strace: Process 883435 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
57.80 14.590016 442121 33 epoll_wait
28.84 7.279883 4 1771146 sendto
13.33 3.363882 1 1771212 48 recvfrom
0.02 0.003843 61 62 44 futex
0.01 0.001947 12 159 epoll_ctl
0.00 0.000894 99 9 9 connect
0.00 0.000620 34 18 9 accept4
0.00 0.000503 14 34 timerfd_settime
0.00 0.000446 13 33 33 read
0.00 0.000271 15 18 ioctl
0.00 0.000189 21 9 write
0.00 0.000176 19 9 socket
0.00 0.000099 11 9 getsockopt
0.00 0.000079 4 18 shutdown
0.00 0.000049 2 18 close
------ ----------- ----------- --------- --------- ----------------
100.00 25.242897 7 3542787 143 total
```
## Further tests to explain why this huge
Changed 16KiB buffer to 64KiB, and named the binary to oiopt64
iperf3 throughput under `-P 8`, highest 54.1Gbits/sec, other threads are much higher than before(16KiB buffer)
perf stat:
```text
sudo perf stat -p $(pidof oiopt64) -e cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations -- sleep 33
Performance counter stats for process id '893123':
120,859,810,675 cpu_atom/cycles/ (0.15%)
134,735,934,329 cpu_core/cycles/ (99.85%)
79,946,979,880 cpu_atom/instructions/ # 0.66 insn per cycle (0.15%)
127,036,644,759 cpu_core/instructions/ # 0.94 insn per cycle (99.85%)
24,713,474 cpu_atom/cache-misses/ (0.15%)
9,604,449 cpu_core/cache-misses/ (99.85%)
15,584,074,530 cpu_atom/branches/ (0.15%)
24,796,180,117 cpu_core/branches/ (99.85%)
175,778,825 cpu_atom/branch-misses/ # 1.13% of all branches (0.15%)
135,067,353 cpu_core/branch-misses/ # 0.54% of all branches (99.85%)
1,519 context-switches
50 cpu-migrations
33.006529572 seconds time elapsed
```
system call check:
```bash
sudo timeout 30s strace -c -f -p $(pidof oiopt64)
```
output:
```text
strace: Process 893123 attached with 4 threads
strace: Process 893126 detached
strace: Process 893125 detached
strace: Process 893124 detached
strace: Process 893123 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
54.56 18.079500 463576 39 epoll_wait
27.91 9.249443 7 1294854 2 sendto
17.49 5.796927 4 1294919 51 recvfrom
0.01 0.003778 50 75 49 futex
0.01 0.002188 12 175 epoll_ctl
0.00 0.000747 83 9 9 connect
0.00 0.000714 17 40 timerfd_settime
0.00 0.000510 13 39 38 read
0.00 0.000452 25 18 9 accept4
0.00 0.000310 17 18 ioctl
0.00 0.000232 23 10 write
0.00 0.000200 22 9 socket
0.00 0.000183 20 9 getsockopt
0.00 0.000100 5 18 shutdown
0.00 0.000053 2 18 close
0.00 0.000020 20 1 mprotect
0.00 0.000015 15 1 sched_yield
0.00 0.000005 5 1 madvise
------ ----------- ----------- --------- --------- ----------------
100.00 33.135377 12 2590253 158 total
```
### Cleanup:
```bash
sudo ip netns del ns_client
sudo ip netns del ns_server
sudo ip link del veth_c
sudo ip link del veth_s
```