WTFnet/docs/dns_leak_detection_design.md

Below is a **high-level (language-agnostic)** design for a **client-side DNS leak detector** aimed at *censorship-resistance threat models*, i.e.:

> “Censor/ISP can observe/log DNS intent or infer proxy usage; we want to detect when DNS behavior escapes the intended protection path.”

I’ll cover: **definitions**, **detection standards**, **workflow**, **modules**, **passive+active detection**, **outputs**, and **test methodology**.

---

# 1) Scope and goals

## Goals

Your detector should answer, with evidence:

1. **Did any DNS query leave the device outside the intended safe path?**
2. **Which domains leaked?** (when visible)
3. **Which transport leaked?** (UDP/53, TCP/53, DoT/853, DoH)
4. **Which interface leaked?** (Wi-Fi/Ethernet vs tunnel)
5. **Which process/app triggered it?** (if your OS allows attribution)

And in your censorship model, it should also detect:

6. **Split-policy intent leakage**: “unknown/sensitive domains were resolved using domestic/ISP-facing DNS.”

## Non-goals (be explicit)

* Not a censorship circumvention tool itself
* Not a full firewall manager (can suggest fixes, but detection is the core)
* Not perfect attribution on every OS (process mapping may be partial)

---

# 2) Define “DNS leak” precisely (your program’s standard)

You need a **formal definition** because “DNS leak” is overloaded.

## Standard definition A (classic VPN / tunnel bypass)

A leak occurs if:

> **An unencrypted DNS query is sent outside the secure tunnel path**
> This is essentially how popular leak test sites define it (“unencrypted DNS query sent OUTSIDE the established VPN tunnel”). ([IP Leak][1])

Your detector should implement it in a machine-checkable way:

**Leak-A condition**

* DNS over **UDP/53 or TCP/53**
* Destination is **not** a “trusted resolver path” (e.g., not the tunnel interface, not loopback stub, not proxy channel)
* Interface is **not** the intended egress

✅ Strong for censorship: plaintext DNS exposes intent.

---

## Standard definition B (split-policy intent leak)

A leak occurs if:

> **A domain that should be “proxied / remote-resolved” was queried via local/ISP-facing DNS.**

This is the “proxy split rules still leak intent” case.

**Leak-B condition**

* Query name matches either:

  * a “proxy-required set” (sensitive list, non-allowlist, unknown), or
  * a policy rule (“everything except allowlist must resolve via proxy DNS”)
* And the query was observed going to:

  * ISP resolver(s) / domestic resolver(s) / non-tunnel interface

✅ This is the leak most users in censorship settings care about.

---

## Standard definition C (encrypted DNS escape / bypass)

A leak occurs if:

> DNS was encrypted, but escaped the intended channel (e.g., app uses its own DoH directly to the Internet).

This matters because DoH hides the QNAME but still creates **observable behavior** and breaks your “DNS must follow proxy” invariant.

**Leak-C condition**

* DoH (RFC 8484) ([IETF Datatracker][2]) or DoT (RFC 7858) ([IETF Datatracker][3]) flow exists
* And it does **not** go through your approved egress path (tunnel/proxy)

✅ Detects “Firefox/Chrome built-in DoH bypass” style cases.

---

## Standard definition D (mismatch risk indicator)

Not a “leak” by itself, but a **proxy inference amplifier**:

> DNS egress region/path differs from traffic egress region/path.

This is a *censorship-resistance hygiene metric*, not a binary leak.

**Mismatch condition**

* Same domain produces:

  * DNS resolution via path X
  * TCP/TLS connection via path Y
* Where X ≠ Y (interface, ASN region, etc.)

✅ Helps catch “DNS direct, traffic proxy” or “DNS proxy, traffic direct” weirdness.

---

# 3) High-level architecture

## Core components

1. **Policy & Configuration**

   * What counts as “safe DNS path”
   * Which interfaces are “protected” (tunnel) vs “physical”
   * Allowlist / proxy-required sets (optional)
   * Known resolver lists (optional)
   * Severity thresholds

2. **Traffic Sensor (Passive Monitor)**

   * Captures outbound traffic metadata (and optionally payload for DNS parsing)
   * Must cover:

     * UDP/53, TCP/53
     * TCP/853 (DoT)
     * HTTPS flows that look like DoH (see below)
   * Emits normalized events into a pipeline

3. **Classifier**

   * Recognize DNS protocol types:

     * Plain DNS
     * DoT
     * DoH
   * Attach confidence scores (especially for DoH)

4. **DNS Parser (for plaintext DNS only)**

   * Extract: QNAME, QTYPE, transaction IDs, response codes (optional)
   * Store minimally (privacy-aware)

5. **Flow Tracker**

   * Correlate packets into “flows”
   * Map flow → interface → destination → process (if possible)
   * Track timing correlation: DNS → connection attempts

6. **Leak Detector (Rules Engine)**

   * Apply Leak-A/B/C/D definitions
   * Produce leak events + severity + evidence chain

7. **Active Prober**

   * Generates controlled DNS lookups to test behavior
   * Can test fail-closed, bypasses, multi-interface behavior, etc.

8. **Report Generator**

   * Human-readable summary
   * Machine-readable logs (JSON)
   * Recommendations (non-invasive)

---

# 4) Workflow (end-to-end)

## Workflow 0: Setup & baseline

1. Enumerate interfaces and routes

   * Identify physical NICs
   * Identify tunnel / proxy interface (or “expected egress destinations”)
2. Identify system DNS configuration

   * Default resolvers per interface
   * Local stub presence (127.0.0.1, etc.)
3. Load policy profile

   * Full-tunnel, split-tunnel, or proxy-based
4. Start passive monitor

**Output:** “Current state snapshot” (useful even before testing).

---

## Workflow 1: Passive detection loop (always-on)

Continuously:

1. Capture outbound packets/flows
2. Classify as DNS-like (plain DNS / DoT / DoH / unknown)
3. If plaintext DNS → parse QNAME/QTYPE
4. Assign metadata:

   * interface
   * dst IP/port
   * process (if possible)
   * timestamp
5. Evaluate leak rules:

   * Leak-A/B/C/D
6. Write event log + optional real-time alert

**Key design point:** passive mode should be able to detect leaks **without requiring any special test domain**.

---

## Workflow 2: Active test suite (on-demand)

Active tests exist because some leaks are intermittent or only happen under stress.

### Active Test A: “No plaintext DNS escape”

* Trigger a set of DNS queries (unique random domains)
* Verify **zero UDP/53 & TCP/53** leaves physical interfaces

### Active Test B: “Fail-closed test”

* Temporarily disrupt the “protected path” (e.g., tunnel down)
* Trigger lookups again
* Expected: DNS fails (no fallback to ISP DNS)

### Active Test C: “App bypass test”

* Launch test scenarios that mimic real apps
* Confirm no direct DoH/DoT flows go to public Internet outside the proxy path

### Active Test D: “Split-policy correctness”

* Query domains that should be:

  * direct-allowed
  * proxy-required
  * unknown
* Confirm resolution path matches policy

---

# 5) How to recognize DNS transports (detection mechanics)

## Plain DNS (strongest signal)

**Match conditions**

* UDP dst port 53 OR TCP dst port 53
* Parse DNS header
* Extract QNAME/QTYPE

**Evidence strength:** high
**Intent visibility:** yes (domain visible)

---

## DoT (port-based, easy)

DoT is defined over TLS, typically port **853**. ([IETF Datatracker][3])

**Match conditions**

* TCP dst port 853
* Optionally confirm TLS handshake exists

**Evidence strength:** high
**Intent visibility:** no (domain hidden)

---

## DoH (harder; heuristic + optional allowlists)

DoH is DNS over HTTPS (RFC 8484). ([IETF Datatracker][2])

**Recognizers (from strongest to weakest):**

1. HTTP request with `Content-Type: application/dns-message`
2. Path/pattern common to DoH endpoints (optional list)
3. SNI matches known DoH providers (optional list)
4. Traffic resembles frequent small HTTPS POST/GET bursts typical of DoH (weak)

**Evidence strength:** medium
**Intent visibility:** no (domain hidden)

**Important for your use-case:** you may not need to *prove* it’s DoH; you mostly need to detect “DNS-like encrypted resolver traffic bypassing the proxy channel.”

---

# 6) Policy model: define “safe DNS path”

You need a simple abstraction users can configure:

### Safe DNS path can be defined by one or more of:

* **Allowed interfaces**

  * loopback (local stub)
  * tunnel interface
* **Allowed destination set**

  * proxy server IP(s)
  * internal resolver IP(s)
* **Allowed process**

  * only your local stub + proxy allowed to resolve externally
* **Allowed port set**

  * maybe only permit 443 to proxy server (if DNS rides inside it)

Then implement:

**A DNS event is a “leak” if it violates safe-path constraints.**

---

# 7) Leak severity model (useful for real-world debugging)

### Severity P0 (critical)

* Plaintext DNS (UDP/TCP 53) on physical interface to ISP/public resolver
* Especially if QNAME matches proxy-required/sensitive list

### Severity P1 (high)

* DoH/DoT bypassing proxy channel directly to public Internet

### Severity P2 (medium)

* Policy mismatch: domain resolved locally but connection later proxied (or vice versa)

### Severity P3 (low / info)

* Authoritative-side “resolver egress exposure” (less relevant for client-side leak detector)
* CDN performance mismatch indicators

---

# 8) Outputs and reporting

## Real-time console output (for debugging)

* “DNS leak detected: Plain DNS”
* domain (if visible)
* destination resolver IP
* interface
* process name (if available)
* policy rule violated
* suggested fix category (e.g., “force stub + block port 53”)

## Forensics log (machine-readable)

A single **LeakEvent** record could include:

* timestamp
* leak_type (A/B/C/D)
* transport (UDP53, TCP53, DoT, DoH)
* qname/qtype (nullable)
* src_iface / dst_ip / dst_port
* process_id/process_name (nullable)
* correlation_id (link DNS → subsequent connection attempt)
* confidence score (esp. DoH)
* raw evidence pointers (pcap offsets / event IDs)

## Summary report

* Leak counts by type
* Top leaking processes
* Top leaking resolver destinations
* Timeline view (bursts often indicate OS fallback behavior)
* “Pass/Fail” per policy definition

---

# 9) Validation strategy (“how do I know my detector is correct?”)

## Ground truth tests

1. **Known-leak scenario**

   * intentionally set OS DNS to ISP DNS, no tunnel
   * detector must catch plaintext DNS

2. **Known-safe scenario**

   * local stub only + blocked outbound 53/853
   * detector should show zero leaks

3. **Bypass scenario**

   * enable browser built-in DoH directly
   * detector should catch encrypted resolver bypass (Leak-C)

4. **Split-policy scenario**

   * allowlist CN direct, everything else proxy-resolve
   * detector should show:

     * allowlist resolved direct
     * unknown resolved via proxy path

---

# 10) Recommended “profiles” (makes tool usable)

Provide built-in presets:

### Profile 1: Full-tunnel VPN

* allow DNS only via tunnel interface or loopback stub
* any UDP/TCP 53 on physical NIC = leak

### Profile 2: Proxy + local stub (your case)

* allow DNS only to loopback stub
* allow stub upstream only via proxy server destinations
* flag any direct DoH/DoT to public endpoints

### Profile 3: Split tunnel (geoip + allowlist)

* allow plaintext DNS **only** for allowlisted domains (if user accepts risk)
* enforce “unknown → proxy-resolve”
* emphasize Leak-B correctness

---

Below is an updated **high-level design** (still language-agnostic) that integrates **process attribution** cleanly, including how it fits into the workflow and what to log.

---

# 1) New component: Process Attribution Engine (PAE)

## Purpose

When a DNS-like event is observed, the PAE tries to attach:

* **PID**
* **PPID**
* **process name**
* *(optional but extremely useful)* full command line, executable path, user, container/app package, etc.

This lets your logs answer:

> “Which program generated the leaked DNS request?”
> “Was it a browser, OS service, updater, antivirus, proxy itself, or some library?”

## Position in the pipeline

It sits between **Traffic Sensor** and **Leak Detector** as an “event enricher”:

**Traffic Event → (Classifier) → (Process Attribution) → Enriched Event → Leak Rules → Report**

---

# 2) Updated architecture (with process attribution)

### Existing modules (from earlier design)

1. Policy & Configuration
2. Traffic Sensor (packet/flow monitor)
3. Classifier (Plain DNS / DoT / DoH / Unknown)
4. DNS Parser (plaintext only)
5. Flow Tracker
6. Leak Detector (rules engine)
7. Active Prober
8. Report Generator

### New module

9. **Process Attribution Engine (PAE)**

   * resolves “who owns this flow / packet”
   * emits PID/PPID/name
   * handles platform-specific differences and fallbacks

---

# 3) Workflow changes (what happens when a potential leak is seen)

## Passive detection loop (updated)

1. Capture outbound traffic event
2. Classify transport type:

   * UDP/53, TCP/53 → plaintext DNS
   * TCP/853 → DoT
   * HTTPS patterns → DoH (heuristic)
3. Extract the **5-tuple**

   * src IP:port, dst IP:port, protocol
4. **PAE lookup**

   * resolve the owner process for this traffic
   * attach PID/PPID/name (+ optional metadata)
5. Apply leak rules (A/B/C/D)
6. Emit:

   * realtime log line (human readable)
   * structured record (JSON/event log)

---

# 4) Process attribution: what to detect and how (high-level)

Process attribution always works on one core concept:

> **Map observed traffic (socket/flow) → owning process**

### Inputs PAE needs

* protocol (UDP/TCP)
* local src port
* local address
* timestamp
* optionally: connection state / flow ID

### Output from PAE

* `pid`, `ppid`, `process_name`
* optional enrichment:

  * `exe_path`
  * `cmdline`
  * `user`
  * “process tree chain” (for debugging: parent → child → …)

---

# 5) Platform support strategy (without implementation detail)

Process attribution is **OS-specific**, so structure it as:

## “Attribution Provider” interface

* Provider A: “kernel-level flow owner”
* Provider B: “socket table owner lookup”
* Provider C: “event tracing feed”
* Provider D: fallback “unknown / not supported”

Your main design goal is:

### Design rule

**Attribution must be best-effort + gracefully degrading**, never blocking detection.

So you always log the leak even if PID is unavailable:

* `pid=null, attribution_confidence=LOW`

---

# 6) Attribution confidence + race handling (important!)

Attribution can be tricky because:

* a process may exit quickly (“short-lived resolver helper”)
* ports can be reused
* NAT or local proxies may obscure the real origin

So log **confidence**:

* **HIGH**: direct mapping from kernel/socket owner at time of event
* **MEDIUM**: mapping by lookup shortly after event (possible race)
* **LOW**: inferred / uncertain
* **NONE**: not resolved

Also record *why* attribution failed:

* “permission denied”
* “flow already gone”
* “unsupported transport”
* “ambiguous mapping”

This makes debugging much easier.

---

# 7) What PID/PPID adds to your leak definitions

### Leak-A (plaintext DNS outside safe path)

Now you can say:

> “`svchost.exe (PID 1234)` sent UDP/53 to ISP resolver on Wi-Fi interface”

### Leak-B (split-policy intent leak)

You can catch:

* “game launcher looked up blocked domain”
* “system service triggered a sensitive name unexpectedly”
* “your proxy itself isn’t actually resolving via its own channel”

### Leak-C (encrypted DNS bypass)

This becomes *very actionable*:

> “`firefox.exe` started direct DoH to resolver outside tunnel”

### Leak-D (mismatch indicator)

You can also correlate:

* DNS resolved by one process
* connection made by another process
  (e.g., local stub vs app)

---

# 8) Reporting / realtime logging format (updated)

## Realtime log line (human readable)

Example (conceptual):

* **[P0][Leak-A] Plain DNS leaked**

  * Domain: `example-sensitive.com` (A)
  * From: `Wi-Fi` → To: `1.2.3.4:53`
  * Process: `browser.exe` **PID=4321 PPID=1200**
  * Policy violated: “No UDP/53 on physical NIC”

## Structured event (JSON-style fields)

Minimum recommended fields:

### Event identity

* `event_id`
* `timestamp`

### DNS identity

* `transport` (udp53/tcp53/dot/doh/unknown)
* `qname` (nullable)
* `qtype` (nullable)

### Network path

* `interface_name`
* `src_ip`, `src_port`
* `dst_ip`, `dst_port`
* `route_class` (tunnel / physical / loopback)

### Process identity (your requested additions)

* `pid`
* `ppid`
* `process_name`
* optional:

  * `exe_path`
  * `cmdline`
  * `user`

### Detection result

* `leak_type` (A/B/C/D)
* `severity` (P0..P3)
* `policy_rule_id`
* `attribution_confidence`

---

# 9) Privacy and safety notes (important in a DNS tool)

Because you’re logging **domains** and **process command lines**, this becomes sensitive.

Add a “privacy mode” policy:

* **Full**: store full domain + cmdline
* **Redacted**: hash domain; keep TLD only; truncate cmdline
* **Minimal**: only keep leak counts + resolver IPs + process name

Also allow “capture window” (rotate logs, avoid giant histories).

---

# 10) UX feature: “Show me the process tree”

When a leak happens, a good debugger view is:

* `PID: foo (pid 1000)`

  * `PPID: bar (pid 900)`

    * `PPID: systemd/svchost/etc`

This is extremely useful to identify:

* browsers spawning helpers
* OS DNS services
* containerized processes
* update agents / telemetry daemons

So your report generator should support:

✅ **Process chain rendering** (where possible)

---

# 11) Practical edge cases you should detect (with PID helping)

1. **Local stub is fine, upstream isn’t**

   * Your local resolver process leaks upstream plaintext DNS
2. **Browser uses its own DoH**

   * process attribution immediately reveals it
3. **Multiple interfaces**

   * a leak only happens on Wi-Fi but not Ethernet
4. **Kill-switch failure**

   * when tunnel drops, PID shows which app starts leaking first

---