blog-post-backup/drafts/2026-01-24-profiling-rust-written-network-program.md at main

manbo/blog-post-backup

Fork 0

Files

manbo ba61eb37b9 Update: changes buffer from 8KiB to 16 and finally 64, have discovered more things than expect.

2026-01-27 23:23:44 +08:00

16 KiB

Raw Permalink Blame History

Testing enviornment setup

Install tools:

sudo apt update

sudo apt install -y hyperfine heaptrack valgrind

sudo apt install -y \
  build-essential clang lld pkg-config \
  linux-perf \
  iperf3 netperf net-tools \
  tcpdump ethtool iproute2 \
  bpftrace bpfcc-tools \
  strace ltrace \
  sysstat procps \
  git perl

Install framegraph(not shipped on debian):

git clone https://github.com/brendangregg/FlameGraph ~/FlameGraph

echo 'export PATH="$HOME/FlameGraph:$PATH"' >> ~/.bashrc

source ~/.bashrc

which flamegraph.pl

modify the Cargo.toml of verion 0.1.0:

[profile.release]
lto = true
codegen-units = 1
debug = 1
strip = "none"
panic = "abort"

Build with frame pointers to help profiling:

git clone https://github.com/DaZuo0122/oxidinetd.git

RUSTFLAGS="-C force-frame-pointers=yes" cargo build --release

profiling.conf:

127.0.0.1 9000 127.0.0.1 9001

Backend iperf3 server:

iperf3 -s -p 9001

forwarder:

./oi -c profiling.conf

triggers redirect:

iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1
iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 8

verification:

sudo ss -tnp | egrep '(:9000|:9001)'

Testing

CPU hotspot:

sudo perf top -p $(pidof oi)

If you see lots of:

sys_read, sys_write, __x64_sys_sendto, tcp_sendmsg → syscall/copy overhead
futex, __lll_lock_wait → contention/locks
epoll_wait → executor wake behavior / too many idle polls

Hard numbers:

sudo perf stat -p $(pidof oi) -e \
  cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
  -- sleep 30

Big differences to watch:

context-switches much higher on oi → too many tasks/wakers / lock contention
instructions much higher on oi for same throughput → runtime overhead / copies
cache-misses higher → allocations / poor locality

Flamegraph
Record:

sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg

If the stack looks “flat / missing” (common with async + LTO), use dwarf unwinding:

sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg

syscall-cost check:

sudo strace -ff -c -p $(pidof oi) -o /tmp/oi.strace
# run 15–30s under load, then Ctrl+C
tail -n +1 /tmp/oi.strace.*

If you see huge % time in read/write/sendmsg/recvmsg, you’re dominated by copying + syscalls.

ebpf stuffs

--skipped--

Smol-focused bottlenecks + the “fix list” A) If you’re syscall/copy bound

Best improvement candidates:

buffer reuse (no per-loop Vec allocation)

reduce tiny writes (coalesce)

zero-copy splice (Linux-only, biggest win but more complex)

For Linux zero-copy, you’d implement a splice(2)-based fast path (socket→pipe→socket). That’s how high-performance forwarders avoid double-copy.

B) If you’re executor/waker bound (common for async forwarders)

Symptoms:

perf shows a lot of runtime / wake / scheduling

perf stat shows more context switches than rinetd

Fixes:

don’t spawn 2 tasks per connection (one per direction) unless needed → do a single task that forwards both directions in one loop (state machine)

avoid any shared Mutex on hot path (logging/metrics)

keep per-conn state minimal

C) If you’re single-thread limited

smol can be extremely fast, but if you’re effectively running everything on one thread, throughput may cap earlier.

Fix direction:

move to smol::Executor + N threads (usually num_cpus)

or run multiple block_on() workers (careful: avoid accept() duplication)

outcome oi

CPU hotspot

testing commands:

iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1

sudo perf stat -p $(pidof oi) -e \
  cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
  -- sleep 30

perf report:

Performance counter stats for process id '207279':

    98,571,874,480      cpu_atom/cycles/                                                        (0.10%)
   134,732,064,800      cpu_core/cycles/                                                        (99.90%)
    75,889,748,906      cpu_atom/instructions/           #    0.77  insn per cycle              (0.10%)
   159,098,987,713      cpu_core/instructions/           #    1.18  insn per cycle              (99.90%)
        30,443,258      cpu_atom/cache-misses/                                                  (0.10%)
         3,155,528      cpu_core/cache-misses/                                                  (99.90%)
    15,003,063,317      cpu_atom/branches/                                                      (0.10%)
    31,479,765,962      cpu_core/branches/                                                      (99.90%)
       149,091,165      cpu_atom/branch-misses/          #    0.99% of all branches             (0.10%)
       195,562,861      cpu_core/branch-misses/          #    0.62% of all branches             (99.90%)
             1,138      context-switches                                                      
                37      cpu-migrations                                                        

      33.004738330 seconds time elapsed

FlameGraph

testing commands:

sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg

outcome:

oi.svg

commands:

sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi_dwarf.svg

outcome:

oi_dwarf.svg

syscall-cost check

sudo strace -ff -C -p $(pidof oi) -o /tmp/oi.strace
# run 15–30s under load, then Ctrl+C
tail -n +1 /tmp/oi.strace.*

More real setup

traffic goes through real kernel routing + 2 TCP legs

Create namespaces + veth links:

sudo ip netns add ns_client
sudo ip netns add ns_server

sudo ip link add veth_c type veth peer name veth_c_ns
sudo ip link set veth_c_ns netns ns_client

sudo ip link add veth_s type veth peer name veth_s_ns
sudo ip link set veth_s_ns netns ns_server

sudo ip addr add 10.0.1.1/24 dev veth_c
sudo ip link set veth_c up
sudo ip addr add 10.0.0.1/24 dev veth_s
sudo ip link set veth_s up

sudo ip netns exec ns_client ip addr add 10.0.1.2/24 dev veth_c_ns
sudo ip netns exec ns_client ip link set veth_c_ns up
sudo ip netns exec ns_client ip link set lo up

sudo ip netns exec ns_server ip addr add 10.0.0.2/24 dev veth_s_ns
sudo ip netns exec ns_server ip link set veth_s_ns up
sudo ip netns exec ns_server ip link set lo up

sudo sysctl -w net.ipv4.ip_forward=1

Config to force redirect path:

10.0.1.1 9000 10.0.0.2 9001

Start backend server in ns_server:

sudo ip netns exec ns_server iperf3 -s -p 9001

Run client in ns_client → forwarder → backend:

sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8

perf report:

sudo perf stat -p $(pidof oi) -e   cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations   -- sleep 33

 Performance counter stats for process id '209785':

   113,810,599,893      cpu_atom/cycles/                                                        (0.11%)
   164,681,878,450      cpu_core/cycles/                                                        (99.89%)
   102,575,167,734      cpu_atom/instructions/           #    0.90  insn per cycle              (0.11%)
   237,094,207,911      cpu_core/instructions/           #    1.44  insn per cycle              (99.89%)
        33,093,338      cpu_atom/cache-misses/                                                  (0.11%)
         5,381,441      cpu_core/cache-misses/                                                  (99.89%)
    20,012,975,873      cpu_atom/branches/                                                      (0.11%)
    46,120,077,111      cpu_core/branches/                                                      (99.89%)
       211,767,555      cpu_atom/branch-misses/          #    1.06% of all branches             (0.11%)
       245,969,685      cpu_core/branch-misses/          #    0.53% of all branches             (99.89%)
             1,686      context-switches
               150      cpu-migrations

      33.004363800 seconds time elapsed

flamegraph

Add latency + small-packet tests

netperf (request/response)

Start netserver in backend namespace:

sudo ip netns exec ns_server netserver -p 9001

Run TCP_RR against forwarded port:

sudo ip netns exec ns_client netperf -H 10.0.1.1 -p 9000 -t TCP_RR -l 30 -- -r 32,32

After opt

Here, we changed future_lite::io 8KiB buffer to a customized 16KiB buffer. (To avoid conflict, I changed binary name to oiopt).

async fn pump(mut r: TcpStream, mut w: TcpStream) -> io::Result<u64> {
    // let's try 16KiB instead of future_lite::io 8KiB
    // and do a profiling to see the outcome
    let mut buf = vec![0u8; 16 * 1024];
    let mut total = 0u64;

    loop {
        let n = r.read(&mut buf).await?;
        if n == 0 {
            // EOF: send FIN to peer
            let _ = w.shutdown(Shutdown::Write);
            break;
        }
        w.write_all(&buf[0..n]).await?;
        total += n as u64;
    }

    Ok(total)
}


// And change the function call in handle_tcp_connection 
let client_to_server = pump(client_stream.clone(), server_stream.clone());
let server_to_client = pump(server_stream, client_stream);

outcomes

Still with sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8

perf stat:

sudo perf stat -p $(pidof oiopt) -e   cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations   -- sleep 33

 Performance counter stats for process id '883435':

   118,960,667,431      cpu_atom/cycles/                                                        (0.05%)
   131,934,369,110      cpu_core/cycles/                                                        (99.95%)
   100,530,466,140      cpu_atom/instructions/           #    0.85  insn per cycle              (0.05%)
   185,203,788,299      cpu_core/instructions/           #    1.40  insn per cycle              (99.95%)
        11,027,490      cpu_atom/cache-misses/                                                  (0.05%)
         2,123,369      cpu_core/cache-misses/                                                  (99.95%)
    19,641,945,774      cpu_atom/branches/                                                      (0.05%)
    36,245,438,057      cpu_core/branches/                                                      (99.95%)
       214,098,497      cpu_atom/branch-misses/          #    1.09% of all branches             (0.05%)
       179,848,095      cpu_core/branch-misses/          #    0.50% of all branches             (99.95%)
             2,308      context-switches
                31      cpu-migrations

      33.004555878 seconds time elapsed

system call check:

sudo timeout 30s strace -c -f -p $(pidof oiopt)

output:

strace: Process 883435 attached with 4 threads
strace: Process 883438 detached
strace: Process 883437 detached
strace: Process 883436 detached
strace: Process 883435 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 57.80   14.590016      442121        33           epoll_wait
 28.84    7.279883           4   1771146           sendto
 13.33    3.363882           1   1771212        48 recvfrom
  0.02    0.003843          61        62        44 futex
  0.01    0.001947          12       159           epoll_ctl
  0.00    0.000894          99         9         9 connect
  0.00    0.000620          34        18         9 accept4
  0.00    0.000503          14        34           timerfd_settime
  0.00    0.000446          13        33        33 read
  0.00    0.000271          15        18           ioctl
  0.00    0.000189          21         9           write
  0.00    0.000176          19         9           socket
  0.00    0.000099          11         9           getsockopt
  0.00    0.000079           4        18           shutdown
  0.00    0.000049           2        18           close
------ ----------- ----------- --------- --------- ----------------
100.00   25.242897           7   3542787       143 total

Further tests to explain why this huge

Changed 16KiB buffer to 64KiB, and named the binary to oiopt64

iperf3 throughput under -P 8, highest 54.1Gbits/sec, other threads are much higher than before(16KiB buffer)