Files
WTFnet/docs/dns_leak_detection_design.md
2026-01-17 18:45:24 +08:00

724 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Below is a **high-level (language-agnostic)** design for a **client-side DNS leak detector** aimed at *censorship-resistance threat models*, i.e.:
> “Censor/ISP can observe/log DNS intent or infer proxy usage; we want to detect when DNS behavior escapes the intended protection path.”
Ill cover: **definitions**, **detection standards**, **workflow**, **modules**, **passive+active detection**, **outputs**, and **test methodology**.
---
# 1) Scope and goals
## Goals
Your detector should answer, with evidence:
1. **Did any DNS query leave the device outside the intended safe path?**
2. **Which domains leaked?** (when visible)
3. **Which transport leaked?** (UDP/53, TCP/53, DoT/853, DoH)
4. **Which interface leaked?** (Wi-Fi/Ethernet vs tunnel)
5. **Which process/app triggered it?** (if your OS allows attribution)
And in your censorship model, it should also detect:
6. **Split-policy intent leakage**: “unknown/sensitive domains were resolved using domestic/ISP-facing DNS.”
## Non-goals (be explicit)
* Not a censorship circumvention tool itself
* Not a full firewall manager (can suggest fixes, but detection is the core)
* Not perfect attribution on every OS (process mapping may be partial)
---
# 2) Define “DNS leak” precisely (your programs standard)
You need a **formal definition** because “DNS leak” is overloaded.
## Standard definition A (classic VPN / tunnel bypass)
A leak occurs if:
> **An unencrypted DNS query is sent outside the secure tunnel path**
> This is essentially how popular leak test sites define it (“unencrypted DNS query sent OUTSIDE the established VPN tunnel”). ([IP Leak][1])
Your detector should implement it in a machine-checkable way:
**Leak-A condition**
* DNS over **UDP/53 or TCP/53**
* Destination is **not** a “trusted resolver path” (e.g., not the tunnel interface, not loopback stub, not proxy channel)
* Interface is **not** the intended egress
✅ Strong for censorship: plaintext DNS exposes intent.
---
## Standard definition B (split-policy intent leak)
A leak occurs if:
> **A domain that should be “proxied / remote-resolved” was queried via local/ISP-facing DNS.**
This is the “proxy split rules still leak intent” case.
**Leak-B condition**
* Query name matches either:
* a “proxy-required set” (sensitive list, non-allowlist, unknown), or
* a policy rule (“everything except allowlist must resolve via proxy DNS”)
* And the query was observed going to:
* ISP resolver(s) / domestic resolver(s) / non-tunnel interface
✅ This is the leak most users in censorship settings care about.
---
## Standard definition C (encrypted DNS escape / bypass)
A leak occurs if:
> DNS was encrypted, but escaped the intended channel (e.g., app uses its own DoH directly to the Internet).
This matters because DoH hides the QNAME but still creates **observable behavior** and breaks your “DNS must follow proxy” invariant.
**Leak-C condition**
* DoH (RFC 8484) ([IETF Datatracker][2]) or DoT (RFC 7858) ([IETF Datatracker][3]) flow exists
* And it does **not** go through your approved egress path (tunnel/proxy)
✅ Detects “Firefox/Chrome built-in DoH bypass” style cases.
---
## Standard definition D (mismatch risk indicator)
Not a “leak” by itself, but a **proxy inference amplifier**:
> DNS egress region/path differs from traffic egress region/path.
This is a *censorship-resistance hygiene metric*, not a binary leak.
**Mismatch condition**
* Same domain produces:
* DNS resolution via path X
* TCP/TLS connection via path Y
* Where X ≠ Y (interface, ASN region, etc.)
✅ Helps catch “DNS direct, traffic proxy” or “DNS proxy, traffic direct” weirdness.
---
# 3) High-level architecture
## Core components
1. **Policy & Configuration**
* What counts as “safe DNS path”
* Which interfaces are “protected” (tunnel) vs “physical”
* Allowlist / proxy-required sets (optional)
* Known resolver lists (optional)
* Severity thresholds
2. **Traffic Sensor (Passive Monitor)**
* Captures outbound traffic metadata (and optionally payload for DNS parsing)
* Must cover:
* UDP/53, TCP/53
* TCP/853 (DoT)
* HTTPS flows that look like DoH (see below)
* Emits normalized events into a pipeline
3. **Classifier**
* Recognize DNS protocol types:
* Plain DNS
* DoT
* DoH
* Attach confidence scores (especially for DoH)
4. **DNS Parser (for plaintext DNS only)**
* Extract: QNAME, QTYPE, transaction IDs, response codes (optional)
* Store minimally (privacy-aware)
5. **Flow Tracker**
* Correlate packets into “flows”
* Map flow → interface → destination → process (if possible)
* Track timing correlation: DNS → connection attempts
6. **Leak Detector (Rules Engine)**
* Apply Leak-A/B/C/D definitions
* Produce leak events + severity + evidence chain
7. **Active Prober**
* Generates controlled DNS lookups to test behavior
* Can test fail-closed, bypasses, multi-interface behavior, etc.
8. **Report Generator**
* Human-readable summary
* Machine-readable logs (JSON)
* Recommendations (non-invasive)
---
# 4) Workflow (end-to-end)
## Workflow 0: Setup & baseline
1. Enumerate interfaces and routes
* Identify physical NICs
* Identify tunnel / proxy interface (or “expected egress destinations”)
2. Identify system DNS configuration
* Default resolvers per interface
* Local stub presence (127.0.0.1, etc.)
3. Load policy profile
* Full-tunnel, split-tunnel, or proxy-based
4. Start passive monitor
**Output:** “Current state snapshot” (useful even before testing).
---
## Workflow 1: Passive detection loop (always-on)
Continuously:
1. Capture outbound packets/flows
2. Classify as DNS-like (plain DNS / DoT / DoH / unknown)
3. If plaintext DNS → parse QNAME/QTYPE
4. Assign metadata:
* interface
* dst IP/port
* process (if possible)
* timestamp
5. Evaluate leak rules:
* Leak-A/B/C/D
6. Write event log + optional real-time alert
**Key design point:** passive mode should be able to detect leaks **without requiring any special test domain**.
---
## Workflow 2: Active test suite (on-demand)
Active tests exist because some leaks are intermittent or only happen under stress.
### Active Test A: “No plaintext DNS escape”
* Trigger a set of DNS queries (unique random domains)
* Verify **zero UDP/53 & TCP/53** leaves physical interfaces
### Active Test B: “Fail-closed test”
* Temporarily disrupt the “protected path” (e.g., tunnel down)
* Trigger lookups again
* Expected: DNS fails (no fallback to ISP DNS)
### Active Test C: “App bypass test”
* Launch test scenarios that mimic real apps
* Confirm no direct DoH/DoT flows go to public Internet outside the proxy path
### Active Test D: “Split-policy correctness”
* Query domains that should be:
* direct-allowed
* proxy-required
* unknown
* Confirm resolution path matches policy
---
# 5) How to recognize DNS transports (detection mechanics)
## Plain DNS (strongest signal)
**Match conditions**
* UDP dst port 53 OR TCP dst port 53
* Parse DNS header
* Extract QNAME/QTYPE
**Evidence strength:** high
**Intent visibility:** yes (domain visible)
---
## DoT (port-based, easy)
DoT is defined over TLS, typically port **853**. ([IETF Datatracker][3])
**Match conditions**
* TCP dst port 853
* Optionally confirm TLS handshake exists
**Evidence strength:** high
**Intent visibility:** no (domain hidden)
---
## DoH (harder; heuristic + optional allowlists)
DoH is DNS over HTTPS (RFC 8484). ([IETF Datatracker][2])
**Recognizers (from strongest to weakest):**
1. HTTP request with `Content-Type: application/dns-message`
2. Path/pattern common to DoH endpoints (optional list)
3. SNI matches known DoH providers (optional list)
4. Traffic resembles frequent small HTTPS POST/GET bursts typical of DoH (weak)
**Evidence strength:** medium
**Intent visibility:** no (domain hidden)
**Important for your use-case:** you may not need to *prove* its DoH; you mostly need to detect “DNS-like encrypted resolver traffic bypassing the proxy channel.”
---
# 6) Policy model: define “safe DNS path”
You need a simple abstraction users can configure:
### Safe DNS path can be defined by one or more of:
* **Allowed interfaces**
* loopback (local stub)
* tunnel interface
* **Allowed destination set**
* proxy server IP(s)
* internal resolver IP(s)
* **Allowed process**
* only your local stub + proxy allowed to resolve externally
* **Allowed port set**
* maybe only permit 443 to proxy server (if DNS rides inside it)
Then implement:
**A DNS event is a “leak” if it violates safe-path constraints.**
---
# 7) Leak severity model (useful for real-world debugging)
### Severity P0 (critical)
* Plaintext DNS (UDP/TCP 53) on physical interface to ISP/public resolver
* Especially if QNAME matches proxy-required/sensitive list
### Severity P1 (high)
* DoH/DoT bypassing proxy channel directly to public Internet
### Severity P2 (medium)
* Policy mismatch: domain resolved locally but connection later proxied (or vice versa)
### Severity P3 (low / info)
* Authoritative-side “resolver egress exposure” (less relevant for client-side leak detector)
* CDN performance mismatch indicators
---
# 8) Outputs and reporting
## Real-time console output (for debugging)
* “DNS leak detected: Plain DNS”
* domain (if visible)
* destination resolver IP
* interface
* process name (if available)
* policy rule violated
* suggested fix category (e.g., “force stub + block port 53”)
## Forensics log (machine-readable)
A single **LeakEvent** record could include:
* timestamp
* leak_type (A/B/C/D)
* transport (UDP53, TCP53, DoT, DoH)
* qname/qtype (nullable)
* src_iface / dst_ip / dst_port
* process_id/process_name (nullable)
* correlation_id (link DNS → subsequent connection attempt)
* confidence score (esp. DoH)
* raw evidence pointers (pcap offsets / event IDs)
## Summary report
* Leak counts by type
* Top leaking processes
* Top leaking resolver destinations
* Timeline view (bursts often indicate OS fallback behavior)
* “Pass/Fail” per policy definition
---
# 9) Validation strategy (“how do I know my detector is correct?”)
## Ground truth tests
1. **Known-leak scenario**
* intentionally set OS DNS to ISP DNS, no tunnel
* detector must catch plaintext DNS
2. **Known-safe scenario**
* local stub only + blocked outbound 53/853
* detector should show zero leaks
3. **Bypass scenario**
* enable browser built-in DoH directly
* detector should catch encrypted resolver bypass (Leak-C)
4. **Split-policy scenario**
* allowlist CN direct, everything else proxy-resolve
* detector should show:
* allowlist resolved direct
* unknown resolved via proxy path
---
# 10) Recommended “profiles” (makes tool usable)
Provide built-in presets:
### Profile 1: Full-tunnel VPN
* allow DNS only via tunnel interface or loopback stub
* any UDP/TCP 53 on physical NIC = leak
### Profile 2: Proxy + local stub (your case)
* allow DNS only to loopback stub
* allow stub upstream only via proxy server destinations
* flag any direct DoH/DoT to public endpoints
### Profile 3: Split tunnel (geoip + allowlist)
* allow plaintext DNS **only** for allowlisted domains (if user accepts risk)
* enforce “unknown → proxy-resolve”
* emphasize Leak-B correctness
---
Below is an updated **high-level design** (still language-agnostic) that integrates **process attribution** cleanly, including how it fits into the workflow and what to log.
---
# 1) New component: Process Attribution Engine (PAE)
## Purpose
When a DNS-like event is observed, the PAE tries to attach:
* **PID**
* **PPID**
* **process name**
* *(optional but extremely useful)* full command line, executable path, user, container/app package, etc.
This lets your logs answer:
> “Which program generated the leaked DNS request?”
> “Was it a browser, OS service, updater, antivirus, proxy itself, or some library?”
## Position in the pipeline
It sits between **Traffic Sensor** and **Leak Detector** as an “event enricher”:
**Traffic Event → (Classifier) → (Process Attribution) → Enriched Event → Leak Rules → Report**
---
# 2) Updated architecture (with process attribution)
### Existing modules (from earlier design)
1. Policy & Configuration
2. Traffic Sensor (packet/flow monitor)
3. Classifier (Plain DNS / DoT / DoH / Unknown)
4. DNS Parser (plaintext only)
5. Flow Tracker
6. Leak Detector (rules engine)
7. Active Prober
8. Report Generator
### New module
9. **Process Attribution Engine (PAE)**
* resolves “who owns this flow / packet”
* emits PID/PPID/name
* handles platform-specific differences and fallbacks
---
# 3) Workflow changes (what happens when a potential leak is seen)
## Passive detection loop (updated)
1. Capture outbound traffic event
2. Classify transport type:
* UDP/53, TCP/53 → plaintext DNS
* TCP/853 → DoT
* HTTPS patterns → DoH (heuristic)
3. Extract the **5-tuple**
* src IP:port, dst IP:port, protocol
4. **PAE lookup**
* resolve the owner process for this traffic
* attach PID/PPID/name (+ optional metadata)
5. Apply leak rules (A/B/C/D)
6. Emit:
* realtime log line (human readable)
* structured record (JSON/event log)
---
# 4) Process attribution: what to detect and how (high-level)
Process attribution always works on one core concept:
> **Map observed traffic (socket/flow) → owning process**
### Inputs PAE needs
* protocol (UDP/TCP)
* local src port
* local address
* timestamp
* optionally: connection state / flow ID
### Output from PAE
* `pid`, `ppid`, `process_name`
* optional enrichment:
* `exe_path`
* `cmdline`
* `user`
* “process tree chain” (for debugging: parent → child → …)
---
# 5) Platform support strategy (without implementation detail)
Process attribution is **OS-specific**, so structure it as:
## “Attribution Provider” interface
* Provider A: “kernel-level flow owner”
* Provider B: “socket table owner lookup”
* Provider C: “event tracing feed”
* Provider D: fallback “unknown / not supported”
Your main design goal is:
### Design rule
**Attribution must be best-effort + gracefully degrading**, never blocking detection.
So you always log the leak even if PID is unavailable:
* `pid=null, attribution_confidence=LOW`
---
# 6) Attribution confidence + race handling (important!)
Attribution can be tricky because:
* a process may exit quickly (“short-lived resolver helper”)
* ports can be reused
* NAT or local proxies may obscure the real origin
So log **confidence**:
* **HIGH**: direct mapping from kernel/socket owner at time of event
* **MEDIUM**: mapping by lookup shortly after event (possible race)
* **LOW**: inferred / uncertain
* **NONE**: not resolved
Also record *why* attribution failed:
* “permission denied”
* “flow already gone”
* “unsupported transport”
* “ambiguous mapping”
This makes debugging much easier.
---
# 7) What PID/PPID adds to your leak definitions
### Leak-A (plaintext DNS outside safe path)
Now you can say:
> “`svchost.exe (PID 1234)` sent UDP/53 to ISP resolver on Wi-Fi interface”
### Leak-B (split-policy intent leak)
You can catch:
* “game launcher looked up blocked domain”
* “system service triggered a sensitive name unexpectedly”
* “your proxy itself isnt actually resolving via its own channel”
### Leak-C (encrypted DNS bypass)
This becomes *very actionable*:
> “`firefox.exe` started direct DoH to resolver outside tunnel”
### Leak-D (mismatch indicator)
You can also correlate:
* DNS resolved by one process
* connection made by another process
(e.g., local stub vs app)
---
# 8) Reporting / realtime logging format (updated)
## Realtime log line (human readable)
Example (conceptual):
* **[P0][Leak-A] Plain DNS leaked**
* Domain: `example-sensitive.com` (A)
* From: `Wi-Fi` → To: `1.2.3.4:53`
* Process: `browser.exe` **PID=4321 PPID=1200**
* Policy violated: “No UDP/53 on physical NIC”
## Structured event (JSON-style fields)
Minimum recommended fields:
### Event identity
* `event_id`
* `timestamp`
### DNS identity
* `transport` (udp53/tcp53/dot/doh/unknown)
* `qname` (nullable)
* `qtype` (nullable)
### Network path
* `interface_name`
* `src_ip`, `src_port`
* `dst_ip`, `dst_port`
* `route_class` (tunnel / physical / loopback)
### Process identity (your requested additions)
* `pid`
* `ppid`
* `process_name`
* optional:
* `exe_path`
* `cmdline`
* `user`
### Detection result
* `leak_type` (A/B/C/D)
* `severity` (P0..P3)
* `policy_rule_id`
* `attribution_confidence`
---
# 9) Privacy and safety notes (important in a DNS tool)
Because youre logging **domains** and **process command lines**, this becomes sensitive.
Add a “privacy mode” policy:
* **Full**: store full domain + cmdline
* **Redacted**: hash domain; keep TLD only; truncate cmdline
* **Minimal**: only keep leak counts + resolver IPs + process name
Also allow “capture window” (rotate logs, avoid giant histories).
---
# 10) UX feature: “Show me the process tree”
When a leak happens, a good debugger view is:
* `PID: foo (pid 1000)`
* `PPID: bar (pid 900)`
* `PPID: systemd/svchost/etc`
This is extremely useful to identify:
* browsers spawning helpers
* OS DNS services
* containerized processes
* update agents / telemetry daemons
So your report generator should support:
**Process chain rendering** (where possible)
---
# 11) Practical edge cases you should detect (with PID helping)
1. **Local stub is fine, upstream isnt**
* Your local resolver process leaks upstream plaintext DNS
2. **Browser uses its own DoH**
* process attribution immediately reveals it
3. **Multiple interfaces**
* a leak only happens on Wi-Fi but not Ethernet
4. **Kill-switch failure**
* when tunnel drops, PID shows which app starts leaking first
---