Add: dns leak detection

2026-01-17 18:45:24 +08:00
parent ccd4a31d21
commit cfa96bde08
30 changed files with 3973 additions and 16 deletions
--- a/docs/dns_leak_detection_design.md
+++ b/docs/dns_leak_detection_design.md
@@ -0,0 +1,723 @@
+Below is a **high-level (language-agnostic)** design for a **client-side DNS leak detector** aimed at *censorship-resistance threat models*, i.e.:
+
+> “Censor/ISP can observe/log DNS intent or infer proxy usage; we want to detect when DNS behavior escapes the intended protection path.”
+
+I’ll cover: **definitions**, **detection standards**, **workflow**, **modules**, **passive+active detection**, **outputs**, and **test methodology**.
+
+---
+
+# 1) Scope and goals
+
+## Goals
+
+Your detector should answer, with evidence:
+
+1. **Did any DNS query leave the device outside the intended safe path?**
+2. **Which domains leaked?** (when visible)
+3. **Which transport leaked?** (UDP/53, TCP/53, DoT/853, DoH)
+4. **Which interface leaked?** (Wi-Fi/Ethernet vs tunnel)
+5. **Which process/app triggered it?** (if your OS allows attribution)
+
+And in your censorship model, it should also detect:
+
+6. **Split-policy intent leakage**: “unknown/sensitive domains were resolved using domestic/ISP-facing DNS.”
+
+## Non-goals (be explicit)
+
+* Not a censorship circumvention tool itself
+* Not a full firewall manager (can suggest fixes, but detection is the core)
+* Not perfect attribution on every OS (process mapping may be partial)
+
+---
+
+# 2) Define “DNS leak” precisely (your program’s standard)
+
+You need a **formal definition** because “DNS leak” is overloaded.
+
+## Standard definition A (classic VPN / tunnel bypass)
+
+A leak occurs if:
+
+> **An unencrypted DNS query is sent outside the secure tunnel path**
+> This is essentially how popular leak test sites define it (“unencrypted DNS query sent OUTSIDE the established VPN tunnel”). ([IP Leak][1])
+
+Your detector should implement it in a machine-checkable way:
+
+**Leak-A condition**
+
+* DNS over **UDP/53 or TCP/53**
+* Destination is **not** a “trusted resolver path” (e.g., not the tunnel interface, not loopback stub, not proxy channel)
+* Interface is **not** the intended egress
+
+✅ Strong for censorship: plaintext DNS exposes intent.
+
+---
+
+## Standard definition B (split-policy intent leak)
+
+A leak occurs if:
+
+> **A domain that should be “proxied / remote-resolved” was queried via local/ISP-facing DNS.**
+
+This is the “proxy split rules still leak intent” case.
+
+**Leak-B condition**
+
+* Query name matches either:
+
+  * a “proxy-required set” (sensitive list, non-allowlist, unknown), or
+  * a policy rule (“everything except allowlist must resolve via proxy DNS”)
+* And the query was observed going to:
+
+  * ISP resolver(s) / domestic resolver(s) / non-tunnel interface
+
+✅ This is the leak most users in censorship settings care about.
+
+---
+
+## Standard definition C (encrypted DNS escape / bypass)
+
+A leak occurs if:
+
+> DNS was encrypted, but escaped the intended channel (e.g., app uses its own DoH directly to the Internet).
+
+This matters because DoH hides the QNAME but still creates **observable behavior** and breaks your “DNS must follow proxy” invariant.
+
+**Leak-C condition**
+
+* DoH (RFC 8484) ([IETF Datatracker][2]) or DoT (RFC 7858) ([IETF Datatracker][3]) flow exists
+* And it does **not** go through your approved egress path (tunnel/proxy)
+
+✅ Detects “Firefox/Chrome built-in DoH bypass” style cases.
+
+---
+
+## Standard definition D (mismatch risk indicator)
+
+Not a “leak” by itself, but a **proxy inference amplifier**:
+
+> DNS egress region/path differs from traffic egress region/path.
+
+This is a *censorship-resistance hygiene metric*, not a binary leak.
+
+**Mismatch condition**
+
+* Same domain produces:
+
+  * DNS resolution via path X
+  * TCP/TLS connection via path Y
+* Where X ≠ Y (interface, ASN region, etc.)
+
+✅ Helps catch “DNS direct, traffic proxy” or “DNS proxy, traffic direct” weirdness.
+
+---
+
+# 3) High-level architecture
+
+## Core components
+
+1. **Policy & Configuration**
+
+   * What counts as “safe DNS path”
+   * Which interfaces are “protected” (tunnel) vs “physical”
+   * Allowlist / proxy-required sets (optional)
+   * Known resolver lists (optional)
+   * Severity thresholds
+
+2. **Traffic Sensor (Passive Monitor)**
+
+   * Captures outbound traffic metadata (and optionally payload for DNS parsing)
+   * Must cover:
+
+     * UDP/53, TCP/53
+     * TCP/853 (DoT)
+     * HTTPS flows that look like DoH (see below)
+   * Emits normalized events into a pipeline
+
+3. **Classifier**
+
+   * Recognize DNS protocol types:
+
+     * Plain DNS
+     * DoT
+     * DoH
+   * Attach confidence scores (especially for DoH)
+
+4. **DNS Parser (for plaintext DNS only)**
+
+   * Extract: QNAME, QTYPE, transaction IDs, response codes (optional)
+   * Store minimally (privacy-aware)
+
+5. **Flow Tracker**
+
+   * Correlate packets into “flows”
+   * Map flow → interface → destination → process (if possible)
+   * Track timing correlation: DNS → connection attempts
+
+6. **Leak Detector (Rules Engine)**
+
+   * Apply Leak-A/B/C/D definitions
+   * Produce leak events + severity + evidence chain
+
+7. **Active Prober**
+
+   * Generates controlled DNS lookups to test behavior
+   * Can test fail-closed, bypasses, multi-interface behavior, etc.
+
+8. **Report Generator**
+
+   * Human-readable summary
+   * Machine-readable logs (JSON)
+   * Recommendations (non-invasive)
+
+---
+
+# 4) Workflow (end-to-end)
+
+## Workflow 0: Setup & baseline
+
+1. Enumerate interfaces and routes
+
+   * Identify physical NICs
+   * Identify tunnel / proxy interface (or “expected egress destinations”)
+2. Identify system DNS configuration
+
+   * Default resolvers per interface
+   * Local stub presence (127.0.0.1, etc.)
+3. Load policy profile
+
+   * Full-tunnel, split-tunnel, or proxy-based
+4. Start passive monitor
+
+**Output:** “Current state snapshot” (useful even before testing).
+
+---
+
+## Workflow 1: Passive detection loop (always-on)
+
+Continuously:
+
+1. Capture outbound packets/flows
+2. Classify as DNS-like (plain DNS / DoT / DoH / unknown)
+3. If plaintext DNS → parse QNAME/QTYPE
+4. Assign metadata:
+
+   * interface
+   * dst IP/port
+   * process (if possible)
+   * timestamp
+5. Evaluate leak rules:
+
+   * Leak-A/B/C/D
+6. Write event log + optional real-time alert
+
+**Key design point:** passive mode should be able to detect leaks **without requiring any special test domain**.
+
+---
+
+## Workflow 2: Active test suite (on-demand)
+
+Active tests exist because some leaks are intermittent or only happen under stress.
+
+### Active Test A: “No plaintext DNS escape”
+
+* Trigger a set of DNS queries (unique random domains)
+* Verify **zero UDP/53 & TCP/53** leaves physical interfaces
+
+### Active Test B: “Fail-closed test”
+
+* Temporarily disrupt the “protected path” (e.g., tunnel down)
+* Trigger lookups again
+* Expected: DNS fails (no fallback to ISP DNS)
+
+### Active Test C: “App bypass test”
+
+* Launch test scenarios that mimic real apps
+* Confirm no direct DoH/DoT flows go to public Internet outside the proxy path
+
+### Active Test D: “Split-policy correctness”
+
+* Query domains that should be:
+
+  * direct-allowed
+  * proxy-required
+  * unknown
+* Confirm resolution path matches policy
+
+---
+
+# 5) How to recognize DNS transports (detection mechanics)
+
+## Plain DNS (strongest signal)
+
+**Match conditions**
+
+* UDP dst port 53 OR TCP dst port 53
+* Parse DNS header
+* Extract QNAME/QTYPE
+
+**Evidence strength:** high
+**Intent visibility:** yes (domain visible)
+
+---
+
+## DoT (port-based, easy)
+
+DoT is defined over TLS, typically port **853**. ([IETF Datatracker][3])
+
+**Match conditions**
+
+* TCP dst port 853
+* Optionally confirm TLS handshake exists
+
+**Evidence strength:** high
+**Intent visibility:** no (domain hidden)
+
+---
+
+## DoH (harder; heuristic + optional allowlists)
+
+DoH is DNS over HTTPS (RFC 8484). ([IETF Datatracker][2])
+
+**Recognizers (from strongest to weakest):**
+
+1. HTTP request with `Content-Type: application/dns-message`
+2. Path/pattern common to DoH endpoints (optional list)
+3. SNI matches known DoH providers (optional list)
+4. Traffic resembles frequent small HTTPS POST/GET bursts typical of DoH (weak)
+
+**Evidence strength:** medium
+**Intent visibility:** no (domain hidden)
+
+**Important for your use-case:** you may not need to *prove* it’s DoH; you mostly need to detect “DNS-like encrypted resolver traffic bypassing the proxy channel.”
+
+---
+
+# 6) Policy model: define “safe DNS path”
+
+You need a simple abstraction users can configure:
+
+### Safe DNS path can be defined by one or more of:
+
+* **Allowed interfaces**
+
+  * loopback (local stub)
+  * tunnel interface
+* **Allowed destination set**
+
+  * proxy server IP(s)
+  * internal resolver IP(s)
+* **Allowed process**
+
+  * only your local stub + proxy allowed to resolve externally
+* **Allowed port set**
+
+  * maybe only permit 443 to proxy server (if DNS rides inside it)
+
+Then implement:
+
+**A DNS event is a “leak” if it violates safe-path constraints.**
+
+---
+
+# 7) Leak severity model (useful for real-world debugging)
+
+### Severity P0 (critical)
+
+* Plaintext DNS (UDP/TCP 53) on physical interface to ISP/public resolver
+* Especially if QNAME matches proxy-required/sensitive list
+
+### Severity P1 (high)
+
+* DoH/DoT bypassing proxy channel directly to public Internet
+
+### Severity P2 (medium)
+
+* Policy mismatch: domain resolved locally but connection later proxied (or vice versa)
+
+### Severity P3 (low / info)
+
+* Authoritative-side “resolver egress exposure” (less relevant for client-side leak detector)
+* CDN performance mismatch indicators
+
+---
+
+# 8) Outputs and reporting
+
+## Real-time console output (for debugging)
+
+* “DNS leak detected: Plain DNS”
+* domain (if visible)
+* destination resolver IP
+* interface
+* process name (if available)
+* policy rule violated
+* suggested fix category (e.g., “force stub + block port 53”)
+
+## Forensics log (machine-readable)
+
+A single **LeakEvent** record could include:
+
+* timestamp
+* leak_type (A/B/C/D)
+* transport (UDP53, TCP53, DoT, DoH)
+* qname/qtype (nullable)
+* src_iface / dst_ip / dst_port
+* process_id/process_name (nullable)
+* correlation_id (link DNS → subsequent connection attempt)
+* confidence score (esp. DoH)
+* raw evidence pointers (pcap offsets / event IDs)
+
+## Summary report
+
+* Leak counts by type
+* Top leaking processes
+* Top leaking resolver destinations
+* Timeline view (bursts often indicate OS fallback behavior)
+* “Pass/Fail” per policy definition
+
+---
+
+# 9) Validation strategy (“how do I know my detector is correct?”)
+
+## Ground truth tests
+
+1. **Known-leak scenario**
+
+   * intentionally set OS DNS to ISP DNS, no tunnel
+   * detector must catch plaintext DNS
+
+2. **Known-safe scenario**
+
+   * local stub only + blocked outbound 53/853
+   * detector should show zero leaks
+
+3. **Bypass scenario**
+
+   * enable browser built-in DoH directly
+   * detector should catch encrypted resolver bypass (Leak-C)
+
+4. **Split-policy scenario**
+
+   * allowlist CN direct, everything else proxy-resolve
+   * detector should show:
+
+     * allowlist resolved direct
+     * unknown resolved via proxy path
+
+---
+
+# 10) Recommended “profiles” (makes tool usable)
+
+Provide built-in presets:
+
+### Profile 1: Full-tunnel VPN
+
+* allow DNS only via tunnel interface or loopback stub
+* any UDP/TCP 53 on physical NIC = leak
+
+### Profile 2: Proxy + local stub (your case)
+
+* allow DNS only to loopback stub
+* allow stub upstream only via proxy server destinations
+* flag any direct DoH/DoT to public endpoints
+
+### Profile 3: Split tunnel (geoip + allowlist)
+
+* allow plaintext DNS **only** for allowlisted domains (if user accepts risk)
+* enforce “unknown → proxy-resolve”
+* emphasize Leak-B correctness
+
+---
+
+Below is an updated **high-level design** (still language-agnostic) that integrates **process attribution** cleanly, including how it fits into the workflow and what to log.
+
+---
+
+# 1) New component: Process Attribution Engine (PAE)
+
+## Purpose
+
+When a DNS-like event is observed, the PAE tries to attach:
+
+* **PID**
+* **PPID**
+* **process name**
+* *(optional but extremely useful)* full command line, executable path, user, container/app package, etc.
+
+This lets your logs answer:
+
+> “Which program generated the leaked DNS request?”
+> “Was it a browser, OS service, updater, antivirus, proxy itself, or some library?”
+
+## Position in the pipeline
+
+It sits between **Traffic Sensor** and **Leak Detector** as an “event enricher”:
+
+**Traffic Event → (Classifier) → (Process Attribution) → Enriched Event → Leak Rules → Report**
+
+---
+
+# 2) Updated architecture (with process attribution)
+
+### Existing modules (from earlier design)
+
+1. Policy & Configuration
+2. Traffic Sensor (packet/flow monitor)
+3. Classifier (Plain DNS / DoT / DoH / Unknown)
+4. DNS Parser (plaintext only)
+5. Flow Tracker
+6. Leak Detector (rules engine)
+7. Active Prober
+8. Report Generator
+
+### New module
+
+9. **Process Attribution Engine (PAE)**
+
+   * resolves “who owns this flow / packet”
+   * emits PID/PPID/name
+   * handles platform-specific differences and fallbacks
+
+---
+
+# 3) Workflow changes (what happens when a potential leak is seen)
+
+## Passive detection loop (updated)
+
+1. Capture outbound traffic event
+2. Classify transport type:
+
+   * UDP/53, TCP/53 → plaintext DNS
+   * TCP/853 → DoT
+   * HTTPS patterns → DoH (heuristic)
+3. Extract the **5-tuple**
+
+   * src IP:port, dst IP:port, protocol
+4. **PAE lookup**
+
+   * resolve the owner process for this traffic
+   * attach PID/PPID/name (+ optional metadata)
+5. Apply leak rules (A/B/C/D)
+6. Emit:
+
+   * realtime log line (human readable)
+   * structured record (JSON/event log)
+
+---
+
+# 4) Process attribution: what to detect and how (high-level)
+
+Process attribution always works on one core concept:
+
+> **Map observed traffic (socket/flow) → owning process**
+
+### Inputs PAE needs
+
+* protocol (UDP/TCP)
+* local src port
+* local address
+* timestamp
+* optionally: connection state / flow ID
+
+### Output from PAE
+
+* `pid`, `ppid`, `process_name`
+* optional enrichment:
+
+  * `exe_path`
+  * `cmdline`
+  * `user`
+  * “process tree chain” (for debugging: parent → child → …)
+
+---
+
+# 5) Platform support strategy (without implementation detail)
+
+Process attribution is **OS-specific**, so structure it as:
+
+## “Attribution Provider” interface
+
+* Provider A: “kernel-level flow owner”
+* Provider B: “socket table owner lookup”
+* Provider C: “event tracing feed”
+* Provider D: fallback “unknown / not supported”
+
+Your main design goal is:
+
+### Design rule
+
+**Attribution must be best-effort + gracefully degrading**, never blocking detection.
+
+So you always log the leak even if PID is unavailable:
+
+* `pid=null, attribution_confidence=LOW`
+
+---
+
+# 6) Attribution confidence + race handling (important!)
+
+Attribution can be tricky because:
+
+* a process may exit quickly (“short-lived resolver helper”)
+* ports can be reused
+* NAT or local proxies may obscure the real origin
+
+So log **confidence**:
+
+* **HIGH**: direct mapping from kernel/socket owner at time of event
+* **MEDIUM**: mapping by lookup shortly after event (possible race)
+* **LOW**: inferred / uncertain
+* **NONE**: not resolved
+
+Also record *why* attribution failed:
+
+* “permission denied”
+* “flow already gone”
+* “unsupported transport”
+* “ambiguous mapping”
+
+This makes debugging much easier.
+
+---
+
+# 7) What PID/PPID adds to your leak definitions
+
+### Leak-A (plaintext DNS outside safe path)
+
+Now you can say:
+
+> “`svchost.exe (PID 1234)` sent UDP/53 to ISP resolver on Wi-Fi interface”
+
+### Leak-B (split-policy intent leak)
+
+You can catch:
+
+* “game launcher looked up blocked domain”
+* “system service triggered a sensitive name unexpectedly”
+* “your proxy itself isn’t actually resolving via its own channel”
+
+### Leak-C (encrypted DNS bypass)
+
+This becomes *very actionable*:
+
+> “`firefox.exe` started direct DoH to resolver outside tunnel”
+
+### Leak-D (mismatch indicator)
+
+You can also correlate:
+
+* DNS resolved by one process
+* connection made by another process
+  (e.g., local stub vs app)
+
+---
+
+# 8) Reporting / realtime logging format (updated)
+
+## Realtime log line (human readable)
+
+Example (conceptual):
+
+* **[P0][Leak-A] Plain DNS leaked**
+
+  * Domain: `example-sensitive.com` (A)
+  * From: `Wi-Fi` → To: `1.2.3.4:53`
+  * Process: `browser.exe` **PID=4321 PPID=1200**
+  * Policy violated: “No UDP/53 on physical NIC”
+
+## Structured event (JSON-style fields)
+
+Minimum recommended fields:
+
+### Event identity
+
+* `event_id`
+* `timestamp`
+
+### DNS identity
+
+* `transport` (udp53/tcp53/dot/doh/unknown)
+* `qname` (nullable)
+* `qtype` (nullable)
+
+### Network path
+
+* `interface_name`
+* `src_ip`, `src_port`
+* `dst_ip`, `dst_port`
+* `route_class` (tunnel / physical / loopback)
+
+### Process identity (your requested additions)
+
+* `pid`
+* `ppid`
+* `process_name`
+* optional:
+
+  * `exe_path`
+  * `cmdline`
+  * `user`
+
+### Detection result
+
+* `leak_type` (A/B/C/D)
+* `severity` (P0..P3)
+* `policy_rule_id`
+* `attribution_confidence`
+
+---
+
+# 9) Privacy and safety notes (important in a DNS tool)
+
+Because you’re logging **domains** and **process command lines**, this becomes sensitive.
+
+Add a “privacy mode” policy:
+
+* **Full**: store full domain + cmdline
+* **Redacted**: hash domain; keep TLD only; truncate cmdline
+* **Minimal**: only keep leak counts + resolver IPs + process name
+
+Also allow “capture window” (rotate logs, avoid giant histories).
+
+---
+
+# 10) UX feature: “Show me the process tree”
+
+When a leak happens, a good debugger view is:
+
+* `PID: foo (pid 1000)`
+
+  * `PPID: bar (pid 900)`
+
+    * `PPID: systemd/svchost/etc`
+
+This is extremely useful to identify:
+
+* browsers spawning helpers
+* OS DNS services
+* containerized processes
+* update agents / telemetry daemons
+
+So your report generator should support:
+
+✅ **Process chain rendering** (where possible)
+
+---
+
+# 11) Practical edge cases you should detect (with PID helping)
+
+1. **Local stub is fine, upstream isn’t**
+
+   * Your local resolver process leaks upstream plaintext DNS
+2. **Browser uses its own DoH**
+
+   * process attribution immediately reveals it
+3. **Multiple interfaces**
+
+   * a leak only happens on Wi-Fi but not Ethernet
+4. **Kill-switch failure**
+
+   * when tunnel drops, PID shows which app starts leaking first
+
+---