diff --git a/drafts/2026-01-24-profiling-rust-written-network-program.md b/drafts/2026-01-24-profiling-rust-written-network-program.md
index 714c8c0..6d9e041 100644
--- a/drafts/2026-01-24-profiling-rust-written-network-program.md
+++ b/drafts/2026-01-24-profiling-rust-written-network-program.md
@@ -1,193 +1,533 @@
-
-## Testing enviornment setup
-Install tools:
-
-```bash
-sudo apt update
-
-sudo apt install -y hyperfine heaptrack valgrind
-
-sudo apt install -y \
-  build-essential clang lld pkg-config \
-  linux-perf \
-  iperf3 netperf net-tools \
-  tcpdump ethtool iproute2 \
-  bpftrace bpfcc-tools \
-  strace ltrace \
-  sysstat procps \
-  git perl
-```
-
-
-Install framegraph(not shipped on debian):
-
-```bash
-git clone https://github.com/brendangregg/FlameGraph ~/FlameGraph
-
-echo 'export PATH="$HOME/FlameGraph:$PATH"' >> ~/.bashrc
-
-source ~/.bashrc
-
-which flamegraph.pl
-```
-
-modify the Cargo.toml of verion 0.1.0:
-
-```toml
-[profile.release]
-lto = true
-codegen-units = 1
-debug = 1
-strip = "none"
-panic = "abort"
-```
-
-
-Build with frame pointers to help profiling:
-
-```bash
-git clone https://github.com/DaZuo0122/oxidinetd.git
-
-RUSTFLAGS="-C force-frame-pointers=yes" cargo build --release
-```
-
-
-`profiling.conf`:
-
-```yaml
-127.0.0.1 9000 127.0.0.1 9001
-```
-
-
-Backend iperf3 server:
-
-```bash
-iperf3 -s -p 9001
-```
-
-
-forwarder:
-
-```bash
-./oi -c profiling.conf
-```
-
-
-triggers redirect:
-
-```bash
-iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1
-iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 8
-```
-
-verification: 
-
-```bash
-sudo ss -tnp | egrep '(:9000|:9001)'
-```
-
-## Testing 
-
-CPU hotspot:
-
-```bash
-sudo perf top -p $(pidof oi)
-```
-
-If you see lots of:
-
-- sys_read, sys_write, __x64_sys_sendto, tcp_sendmsg → syscall/copy overhead
-
-- futex, __lll_lock_wait → contention/locks
-
-- epoll_wait → executor wake behavior / too many idle polls
-
-
-Hard numbers:
-
-```bash
-sudo perf stat -p $(pidof oi) -e \
-  cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
-  -- sleep 30
-```
-
-Big differences to watch:
-
-- context-switches much higher on oi → too many tasks/wakers / lock contention
-
-- instructions much higher on oi for same throughput → runtime overhead / copies
-
-- cache-misses higher → allocations / poor locality
-
-
-Flamegraph  
-Record: 
-
-```bash
-sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30
-sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
-```
-
-If the stack looks “flat / missing” (common with async + LTO), use dwarf unwinding:
-
-```bash
-sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
-sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
-```
-
-
-syscall-cost check:
-
-```bash
-sudo strace -ff -c -p $(pidof oi) -o /tmp/oi.strace
-# run 15–30s under load, then Ctrl+C
-tail -n +1 /tmp/oi.strace.*
-```
-
-If you see huge % time in read/write/sendmsg/recvmsg, you’re dominated by copying + syscalls.
-
-
-ebpf stuffs
-
---skipped--
-
-
-Smol-focused bottlenecks + the “fix list”
-A) If you’re syscall/copy bound
-
-Best improvement candidates:
-
-buffer reuse (no per-loop Vec allocation)
-
-reduce tiny writes (coalesce)
-
-zero-copy splice (Linux-only, biggest win but more complex)
-
-For Linux zero-copy, you’d implement a splice(2)-based fast path (socket→pipe→socket). That’s how high-performance forwarders avoid double-copy.
-
-B) If you’re executor/waker bound (common for async forwarders)
-
-Symptoms:
-
-perf shows a lot of runtime / wake / scheduling
-
-perf stat shows more context switches than rinetd
-
-Fixes:
-
-don’t spawn 2 tasks per connection (one per direction) unless needed
-→ do a single task that forwards both directions in one loop (state machine)
-
-avoid any shared Mutex on hot path (logging/metrics)
-
-keep per-conn state minimal
-
-C) If you’re single-thread limited
-
-smol can be extremely fast, but if you’re effectively running everything on one thread, throughput may cap earlier.
-
-Fix direction:
-
-move to smol::Executor + N threads (usually num_cpus)
-
-or run multiple block_on() workers (careful: avoid accept() duplication)
\ No newline at end of file
+## Testing enviornment setup
+Install tools:
+
+```bash
+sudo apt update
+
+sudo apt install -y hyperfine heaptrack valgrind
+
+sudo apt install -y \
+  build-essential clang lld pkg-config \
+  linux-perf \
+  iperf3 netperf net-tools \
+  tcpdump ethtool iproute2 \
+  bpftrace bpfcc-tools \
+  strace ltrace \
+  sysstat procps \
+  git perl
+```
+
+
+Install framegraph(not shipped on debian):
+
+```bash
+git clone https://github.com/brendangregg/FlameGraph ~/FlameGraph
+
+echo 'export PATH="$HOME/FlameGraph:$PATH"' >> ~/.bashrc
+
+source ~/.bashrc
+
+which flamegraph.pl
+```
+
+modify the Cargo.toml of verion 0.1.0:
+
+```toml
+[profile.release]
+lto = true
+codegen-units = 1
+debug = 1
+strip = "none"
+panic = "abort"
+```
+
+
+Build with frame pointers to help profiling:
+
+```bash
+git clone https://github.com/DaZuo0122/oxidinetd.git
+
+RUSTFLAGS="-C force-frame-pointers=yes" cargo build --release
+```
+
+
+`profiling.conf`:
+
+```yaml
+127.0.0.1 9000 127.0.0.1 9001
+```
+
+
+Backend iperf3 server:
+
+```bash
+iperf3 -s -p 9001
+```
+
+
+forwarder:
+
+```bash
+./oi -c profiling.conf
+```
+
+
+triggers redirect:
+
+```bash
+iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1
+iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 8
+```
+
+verification: 
+
+```bash
+sudo ss -tnp | egrep '(:9000|:9001)'
+```
+
+## Testing 
+
+CPU hotspot:
+
+```bash
+sudo perf top -p $(pidof oi)
+```
+
+If you see lots of:
+
+- sys_read, sys_write, __x64_sys_sendto, tcp_sendmsg → syscall/copy overhead
+
+- futex, __lll_lock_wait → contention/locks
+
+- epoll_wait → executor wake behavior / too many idle polls
+
+
+Hard numbers:
+
+```bash
+sudo perf stat -p $(pidof oi) -e \
+  cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
+  -- sleep 30
+```
+
+Big differences to watch:
+
+- context-switches much higher on oi → too many tasks/wakers / lock contention
+
+- instructions much higher on oi for same throughput → runtime overhead / copies
+
+- cache-misses higher → allocations / poor locality
+
+
+Flamegraph  
+Record: 
+
+```bash
+sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30
+sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
+```
+
+If the stack looks “flat / missing” (common with async + LTO), use dwarf unwinding:
+
+```bash
+sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
+sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
+```
+
+
+syscall-cost check:
+
+```bash
+sudo strace -ff -c -p $(pidof oi) -o /tmp/oi.strace
+# run 15–30s under load, then Ctrl+C
+tail -n +1 /tmp/oi.strace.*
+```
+
+If you see huge % time in read/write/sendmsg/recvmsg, you’re dominated by copying + syscalls.
+
+
+ebpf stuffs
+
+--skipped--
+
+
+Smol-focused bottlenecks + the “fix list”
+A) If you’re syscall/copy bound
+
+Best improvement candidates:
+
+buffer reuse (no per-loop Vec allocation)
+
+reduce tiny writes (coalesce)
+
+zero-copy splice (Linux-only, biggest win but more complex)
+
+For Linux zero-copy, you’d implement a splice(2)-based fast path (socket→pipe→socket). That’s how high-performance forwarders avoid double-copy.
+
+B) If you’re executor/waker bound (common for async forwarders)
+
+Symptoms:
+
+perf shows a lot of runtime / wake / scheduling
+
+perf stat shows more context switches than rinetd
+
+Fixes:
+
+don’t spawn 2 tasks per connection (one per direction) unless needed
+→ do a single task that forwards both directions in one loop (state machine)
+
+avoid any shared Mutex on hot path (logging/metrics)
+
+keep per-conn state minimal
+
+C) If you’re single-thread limited
+
+smol can be extremely fast, but if you’re effectively running everything on one thread, throughput may cap earlier.
+
+Fix direction:
+
+move to smol::Executor + N threads (usually num_cpus)
+
+or run multiple block_on() workers (careful: avoid accept() duplication)
+
+
+## outcome oi
+
+### CPU hotspot
+testing commands: 
+
+```bash
+iperf3 -c 127.0.0.1 -p 9000 -t 30 -P 1
+
+sudo perf stat -p $(pidof oi) -e \
+  cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations \
+  -- sleep 30
+```
+
+perf report: 
+
+```text
+Performance counter stats for process id '207279':
+
+    98,571,874,480      cpu_atom/cycles/                                                        (0.10%)
+   134,732,064,800      cpu_core/cycles/                                                        (99.90%)
+    75,889,748,906      cpu_atom/instructions/           #    0.77  insn per cycle              (0.10%)
+   159,098,987,713      cpu_core/instructions/           #    1.18  insn per cycle              (99.90%)
+        30,443,258      cpu_atom/cache-misses/                                                  (0.10%)
+         3,155,528      cpu_core/cache-misses/                                                  (99.90%)
+    15,003,063,317      cpu_atom/branches/                                                      (0.10%)
+    31,479,765,962      cpu_core/branches/                                                      (99.90%)
+       149,091,165      cpu_atom/branch-misses/          #    0.99% of all branches             (0.10%)
+       195,562,861      cpu_core/branch-misses/          #    0.62% of all branches             (99.90%)
+             1,138      context-switches                                                      
+                37      cpu-migrations                                                        
+
+      33.004738330 seconds time elapsed
+```
+
+
+### FlameGraph
+testing commands:
+
+```bash
+sudo perf record -F 199 -g -p $(pidof oi) -- sleep 30
+sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi.svg
+```
+
+outcome:
+
+oi.svg
+
+commands:
+
+```bash
+sudo perf record -F 199 --call-graph dwarf,16384 -p $(pidof oi) -- sleep 30
+sudo perf script | stackcollapse-perf.pl | flamegraph.pl > oi_dwarf.svg
+```
+
+outcome:
+
+oi_dwarf.svg
+
+
+### syscall-cost check
+
+```bash
+sudo strace -ff -C -p $(pidof oi) -o /tmp/oi.strace
+# run 15–30s under load, then Ctrl+C
+tail -n +1 /tmp/oi.strace.*
+```
+
+
+## More real setup
+
+traffic goes through real kernel routing + 2 TCP legs
+
+
+Create namespaces + veth links:
+
+```bash
+sudo ip netns add ns_client
+sudo ip netns add ns_server
+
+sudo ip link add veth_c type veth peer name veth_c_ns
+sudo ip link set veth_c_ns netns ns_client
+
+sudo ip link add veth_s type veth peer name veth_s_ns
+sudo ip link set veth_s_ns netns ns_server
+
+sudo ip addr add 10.0.1.1/24 dev veth_c
+sudo ip link set veth_c up
+sudo ip addr add 10.0.0.1/24 dev veth_s
+sudo ip link set veth_s up
+
+sudo ip netns exec ns_client ip addr add 10.0.1.2/24 dev veth_c_ns
+sudo ip netns exec ns_client ip link set veth_c_ns up
+sudo ip netns exec ns_client ip link set lo up
+
+sudo ip netns exec ns_server ip addr add 10.0.0.2/24 dev veth_s_ns
+sudo ip netns exec ns_server ip link set veth_s_ns up
+sudo ip netns exec ns_server ip link set lo up
+
+sudo sysctl -w net.ipv4.ip_forward=1
+```
+
+
+Config to force redirect path:
+
+```yaml
+10.0.1.1 9000 10.0.0.2 9001
+```
+
+
+Start backend server in ns_server:
+
+```bash
+sudo ip netns exec ns_server iperf3 -s -p 9001
+```
+
+
+Run client in ns_client → forwarder → backend:
+
+```bash
+sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8
+```
+
+perf report: 
+
+```text
+sudo perf stat -p $(pidof oi) -e   cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations   -- sleep 33
+
+ Performance counter stats for process id '209785':
+
+   113,810,599,893      cpu_atom/cycles/                                                        (0.11%)
+   164,681,878,450      cpu_core/cycles/                                                        (99.89%)
+   102,575,167,734      cpu_atom/instructions/           #    0.90  insn per cycle              (0.11%)
+   237,094,207,911      cpu_core/instructions/           #    1.44  insn per cycle              (99.89%)
+        33,093,338      cpu_atom/cache-misses/                                                  (0.11%)
+         5,381,441      cpu_core/cache-misses/                                                  (99.89%)
+    20,012,975,873      cpu_atom/branches/                                                      (0.11%)
+    46,120,077,111      cpu_core/branches/                                                      (99.89%)
+       211,767,555      cpu_atom/branch-misses/          #    1.06% of all branches             (0.11%)
+       245,969,685      cpu_core/branch-misses/          #    0.53% of all branches             (99.89%)
+             1,686      context-switches
+               150      cpu-migrations
+
+      33.004363800 seconds time elapsed
+```
+
+flamegraph
+
+
+### Add latency + small-packet tests
+
+netperf (request/response)
+
+Start netserver in backend namespace:
+
+```bash
+sudo ip netns exec ns_server netserver -p 9001
+```
+
+Run TCP_RR against forwarded port:
+
+```bash
+sudo ip netns exec ns_client netperf -H 10.0.1.1 -p 9000 -t TCP_RR -l 30 -- -r 32,32
+```
+
+## After opt
+
+Here, we changed future_lite::io 8KiB buffer to a customized 16KiB buffer. (To avoid conflict, I changed binary name to oiopt).
+
+```rust
+async fn pump(mut r: TcpStream, mut w: TcpStream) -> io::Result<u64> {
+    // let's try 16KiB instead of future_lite::io 8KiB
+    // and do a profiling to see the outcome
+    let mut buf = vec![0u8; 16 * 1024];
+    let mut total = 0u64;
+
+    loop {
+        let n = r.read(&mut buf).await?;
+        if n == 0 {
+            // EOF: send FIN to peer
+            let _ = w.shutdown(Shutdown::Write);
+            break;
+        }
+        w.write_all(&buf[0..n]).await?;
+        total += n as u64;
+    }
+
+    Ok(total)
+}
+
+
+// And change the function call in handle_tcp_connection 
+let client_to_server = pump(client_stream.clone(), server_stream.clone());
+let server_to_client = pump(server_stream, client_stream);
+```
+
+
+### outcomes
+
+Still with `sudo ip netns exec ns_client iperf3 -c 10.0.1.1 -p 9000 -t 30 -P 8`
+
+
+perf stat: 
+
+```text
+sudo perf stat -p $(pidof oiopt) -e   cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations   -- sleep 33
+
+ Performance counter stats for process id '883435':
+
+   118,960,667,431      cpu_atom/cycles/                                                        (0.05%)
+   131,934,369,110      cpu_core/cycles/                                                        (99.95%)
+   100,530,466,140      cpu_atom/instructions/           #    0.85  insn per cycle              (0.05%)
+   185,203,788,299      cpu_core/instructions/           #    1.40  insn per cycle              (99.95%)
+        11,027,490      cpu_atom/cache-misses/                                                  (0.05%)
+         2,123,369      cpu_core/cache-misses/                                                  (99.95%)
+    19,641,945,774      cpu_atom/branches/                                                      (0.05%)
+    36,245,438,057      cpu_core/branches/                                                      (99.95%)
+       214,098,497      cpu_atom/branch-misses/          #    1.09% of all branches             (0.05%)
+       179,848,095      cpu_core/branch-misses/          #    0.50% of all branches             (99.95%)
+             2,308      context-switches
+                31      cpu-migrations
+
+      33.004555878 seconds time elapsed
+
+```
+
+system call check:
+
+```bash
+sudo timeout 30s strace -c -f -p $(pidof oiopt)
+```
+
+output:
+
+```text
+strace: Process 883435 attached with 4 threads
+strace: Process 883438 detached
+strace: Process 883437 detached
+strace: Process 883436 detached
+strace: Process 883435 detached
+% time     seconds  usecs/call     calls    errors syscall
+------ ----------- ----------- --------- --------- ----------------
+ 57.80   14.590016      442121        33           epoll_wait
+ 28.84    7.279883           4   1771146           sendto
+ 13.33    3.363882           1   1771212        48 recvfrom
+  0.02    0.003843          61        62        44 futex
+  0.01    0.001947          12       159           epoll_ctl
+  0.00    0.000894          99         9         9 connect
+  0.00    0.000620          34        18         9 accept4
+  0.00    0.000503          14        34           timerfd_settime
+  0.00    0.000446          13        33        33 read
+  0.00    0.000271          15        18           ioctl
+  0.00    0.000189          21         9           write
+  0.00    0.000176          19         9           socket
+  0.00    0.000099          11         9           getsockopt
+  0.00    0.000079           4        18           shutdown
+  0.00    0.000049           2        18           close
+------ ----------- ----------- --------- --------- ----------------
+100.00   25.242897           7   3542787       143 total
+
+```
+
+## Further tests to explain why this huge
+
+Changed 16KiB buffer to 64KiB, and named the binary to oiopt64
+
+iperf3 throughput under `-P 8`, highest 54.1Gbits/sec, other threads are much higher than before(16KiB buffer)
+
+perf stat: 
+
+```text
+sudo perf stat -p $(pidof oiopt64) -e   cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations   -- sleep 33
+
+ Performance counter stats for process id '893123':
+
+   120,859,810,675      cpu_atom/cycles/                                                        (0.15%)
+   134,735,934,329      cpu_core/cycles/                                                        (99.85%)
+    79,946,979,880      cpu_atom/instructions/           #    0.66  insn per cycle              (0.15%)
+   127,036,644,759      cpu_core/instructions/           #    0.94  insn per cycle              (99.85%)
+        24,713,474      cpu_atom/cache-misses/                                                  (0.15%)
+         9,604,449      cpu_core/cache-misses/                                                  (99.85%)
+    15,584,074,530      cpu_atom/branches/                                                      (0.15%)
+    24,796,180,117      cpu_core/branches/                                                      (99.85%)
+       175,778,825      cpu_atom/branch-misses/          #    1.13% of all branches             (0.15%)
+       135,067,353      cpu_core/branch-misses/          #    0.54% of all branches             (99.85%)
+             1,519      context-switches
+                50      cpu-migrations
+
+      33.006529572 seconds time elapsed
+```
+
+system call check:
+
+```bash
+sudo timeout 30s strace -c -f -p $(pidof oiopt64)
+```
+
+output:
+
+```text
+strace: Process 893123 attached with 4 threads
+strace: Process 893126 detached
+strace: Process 893125 detached
+strace: Process 893124 detached
+strace: Process 893123 detached
+% time     seconds  usecs/call     calls    errors syscall
+------ ----------- ----------- --------- --------- ----------------
+ 54.56   18.079500      463576        39           epoll_wait
+ 27.91    9.249443           7   1294854         2 sendto
+ 17.49    5.796927           4   1294919        51 recvfrom
+  0.01    0.003778          50        75        49 futex
+  0.01    0.002188          12       175           epoll_ctl
+  0.00    0.000747          83         9         9 connect
+  0.00    0.000714          17        40           timerfd_settime
+  0.00    0.000510          13        39        38 read
+  0.00    0.000452          25        18         9 accept4
+  0.00    0.000310          17        18           ioctl
+  0.00    0.000232          23        10           write
+  0.00    0.000200          22         9           socket
+  0.00    0.000183          20         9           getsockopt
+  0.00    0.000100           5        18           shutdown
+  0.00    0.000053           2        18           close
+  0.00    0.000020          20         1           mprotect
+  0.00    0.000015          15         1           sched_yield
+  0.00    0.000005           5         1           madvise
+------ ----------- ----------- --------- --------- ----------------
+100.00   33.135377          12   2590253       158 total
+```
+
+
+
+
+### Cleanup:
+
+```bash
+sudo ip netns del ns_client
+sudo ip netns del ns_server
+sudo ip link del veth_c
+sudo ip link del veth_s
+```
\ No newline at end of file