## Testing enviornment setup
Install tools:

```bash
sudo apt update

sudo apt install -y hyperfine heaptrack valgrind

sudo apt install -y \
  build-essential clang lld pkg-config \
  linux-perf \
  iperf3 netperf net-tools \
  tcpdump ethtool iproute2 \
  bpftrace bpfcc-tools \
  strace ltrace \
  sysstat procps \
  git perl
```


Install framegraph(not shipped on debian):

```bash
git clone https://github.com/brendangregg/FlameGraph ~/FlameGraph

echo 'export PATH="$HOME/FlameGraph:$PATH"' >> ~/.bashrc

source ~/.bashrc

which flamegraph.pl
```

modify the Cargo.toml of verion 0.1.0:

```toml
[profile.release]
lto = true
codegen-units = 1
debug = 1
strip = "none"
panic = "abort"
```


Build with frame pointers to help profiling:

```bash
git clone https://github.com/DaZuo0122/oxidinetd.git

RUSTFLAGS="-C force-frame-pointers=yes" cargo build --release
```


`profiling.conf`:

```yaml
127.0.0.1 9000 127.0.0.1 9001
```


Backend iperf3 server:

```bash
iperf3 -s -p 9001
```


forwarder:

```bash
./oi -c profiling.conf
```


triggers redirect:

```bash
iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1
iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 8
```

verification: 

```bash
sudo ss -tnp | egrep '(:9000|:9001)'
```

## Testing 

CPU hotspot:

```bash
sudo perf top -p $(pidof oi)
```

If you see lots of:

- sys_read, sys_write, __x64_sys_sendto, tcp_sendmsg → syscall/copy overhead

- futex, __lll_lock_wait → contention/locks

- epoll_wait → executor wake behavior / too many idle polls


Hard numbers:

```bash
sudo perf stat -p $(pidof oi) -e \
  cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
  -- sleep 30
```

Big differences to watch:

- context-switches much higher on oi → too many tasks/wakers / lock contention

- instructions much higher on oi for same throughput → runtime overhead / copies

- cache-misses higher → allocations / poor locality


Flamegraph  
Record: 

```bash
sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
```

If the stack looks “flat / missing” (common with async + LTO), use dwarf unwinding:

```bash
sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
```


syscall-cost check:

```bash
sudo strace -ff -c -p $(pidof oi) -o /tmp/oi.strace
# run 15–30s under load, then Ctrl+C
tail -n +1 /tmp/oi.strace.*
```

If you see huge % time in read/write/sendmsg/recvmsg, you’re dominated by copying + syscalls.


ebpf stuffs

--skipped--


Smol-focused bottlenecks + the “fix list”
A) If you’re syscall/copy bound

Best improvement candidates:

buffer reuse (no per-loop Vec allocation)

reduce tiny writes (coalesce)

zero-copy splice (Linux-only, biggest win but more complex)

For Linux zero-copy, you’d implement a splice(2)-based fast path (socket→pipe→socket). That’s how high-performance forwarders avoid double-copy.

B) If you’re executor/waker bound (common for async forwarders)

Symptoms:

perf shows a lot of runtime / wake / scheduling

perf stat shows more context switches than rinetd

Fixes:

don’t spawn 2 tasks per connection (one per direction) unless needed
→ do a single task that forwards both directions in one loop (state machine)

avoid any shared Mutex on hot path (logging/metrics)

keep per-conn state minimal

C) If you’re single-thread limited

smol can be extremely fast, but if you’re effectively running everything on one thread, throughput may cap earlier.

Fix direction:

move to smol::Executor + N threads (usually num_cpus)

or run multiple block_on() workers (careful: avoid accept() duplication)


## outcome

### CPU hotspot
testing commands: 

```bash
iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1

sudo perf stat -p $(pidof oi) -e \
  cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
  -- sleep 30
```

perf report: 

```text
Performance counter stats for process id '207279':

    98,571,874,480      cpu_atom/cycles/                                                        (0.10%)
   134,732,064,800      cpu_core/cycles/                                                        (99.90%)
    75,889,748,906      cpu_atom/instructions/           #    0.77  insn per cycle              (0.10%)
   159,098,987,713      cpu_core/instructions/           #    1.18  insn per cycle              (99.90%)
        30,443,258      cpu_atom/cache-misses/                                                  (0.10%)
         3,155,528      cpu_core/cache-misses/                                                  (99.90%)
    15,003,063,317      cpu_atom/branches/                                                      (0.10%)
    31,479,765,962      cpu_core/branches/                                                      (99.90%)
       149,091,165      cpu_atom/branch-misses/          #    0.99% of all branches             (0.10%)
       195,562,861      cpu_core/branch-misses/          #    0.62% of all branches             (99.90%)
             1,138      context-switches                                                      
                37      cpu-migrations                                                        

      33.004738330 seconds time elapsed
```


### FlameGraph
testing commands:

```bash
sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
```

outcome:

oi.svg

commands:

```bash
sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi_dwarf.svg
```

outcome:

oi_dwarf.svg


### syscall-cost check

```bash
sudo strace -ff -c -p $(pidof oi) -o /tmp/oi.strace
# run 15–30s under load, then Ctrl+C
tail -n +1 /tmp/oi.strace.*
```