Add: dns leak detection

This commit is contained in:
DaZuo0122
2026-01-17 18:45:24 +08:00
parent ccd4a31d21
commit cfa96bde08
30 changed files with 3973 additions and 16 deletions

View File

@@ -0,0 +1,723 @@
Below is a **high-level (language-agnostic)** design for a **client-side DNS leak detector** aimed at *censorship-resistance threat models*, i.e.:
> “Censor/ISP can observe/log DNS intent or infer proxy usage; we want to detect when DNS behavior escapes the intended protection path.”
Ill cover: **definitions**, **detection standards**, **workflow**, **modules**, **passive+active detection**, **outputs**, and **test methodology**.
---
# 1) Scope and goals
## Goals
Your detector should answer, with evidence:
1. **Did any DNS query leave the device outside the intended safe path?**
2. **Which domains leaked?** (when visible)
3. **Which transport leaked?** (UDP/53, TCP/53, DoT/853, DoH)
4. **Which interface leaked?** (Wi-Fi/Ethernet vs tunnel)
5. **Which process/app triggered it?** (if your OS allows attribution)
And in your censorship model, it should also detect:
6. **Split-policy intent leakage**: “unknown/sensitive domains were resolved using domestic/ISP-facing DNS.”
## Non-goals (be explicit)
* Not a censorship circumvention tool itself
* Not a full firewall manager (can suggest fixes, but detection is the core)
* Not perfect attribution on every OS (process mapping may be partial)
---
# 2) Define “DNS leak” precisely (your programs standard)
You need a **formal definition** because “DNS leak” is overloaded.
## Standard definition A (classic VPN / tunnel bypass)
A leak occurs if:
> **An unencrypted DNS query is sent outside the secure tunnel path**
> This is essentially how popular leak test sites define it (“unencrypted DNS query sent OUTSIDE the established VPN tunnel”). ([IP Leak][1])
Your detector should implement it in a machine-checkable way:
**Leak-A condition**
* DNS over **UDP/53 or TCP/53**
* Destination is **not** a “trusted resolver path” (e.g., not the tunnel interface, not loopback stub, not proxy channel)
* Interface is **not** the intended egress
✅ Strong for censorship: plaintext DNS exposes intent.
---
## Standard definition B (split-policy intent leak)
A leak occurs if:
> **A domain that should be “proxied / remote-resolved” was queried via local/ISP-facing DNS.**
This is the “proxy split rules still leak intent” case.
**Leak-B condition**
* Query name matches either:
* a “proxy-required set” (sensitive list, non-allowlist, unknown), or
* a policy rule (“everything except allowlist must resolve via proxy DNS”)
* And the query was observed going to:
* ISP resolver(s) / domestic resolver(s) / non-tunnel interface
✅ This is the leak most users in censorship settings care about.
---
## Standard definition C (encrypted DNS escape / bypass)
A leak occurs if:
> DNS was encrypted, but escaped the intended channel (e.g., app uses its own DoH directly to the Internet).
This matters because DoH hides the QNAME but still creates **observable behavior** and breaks your “DNS must follow proxy” invariant.
**Leak-C condition**
* DoH (RFC 8484) ([IETF Datatracker][2]) or DoT (RFC 7858) ([IETF Datatracker][3]) flow exists
* And it does **not** go through your approved egress path (tunnel/proxy)
✅ Detects “Firefox/Chrome built-in DoH bypass” style cases.
---
## Standard definition D (mismatch risk indicator)
Not a “leak” by itself, but a **proxy inference amplifier**:
> DNS egress region/path differs from traffic egress region/path.
This is a *censorship-resistance hygiene metric*, not a binary leak.
**Mismatch condition**
* Same domain produces:
* DNS resolution via path X
* TCP/TLS connection via path Y
* Where X ≠ Y (interface, ASN region, etc.)
✅ Helps catch “DNS direct, traffic proxy” or “DNS proxy, traffic direct” weirdness.
---
# 3) High-level architecture
## Core components
1. **Policy & Configuration**
* What counts as “safe DNS path”
* Which interfaces are “protected” (tunnel) vs “physical”
* Allowlist / proxy-required sets (optional)
* Known resolver lists (optional)
* Severity thresholds
2. **Traffic Sensor (Passive Monitor)**
* Captures outbound traffic metadata (and optionally payload for DNS parsing)
* Must cover:
* UDP/53, TCP/53
* TCP/853 (DoT)
* HTTPS flows that look like DoH (see below)
* Emits normalized events into a pipeline
3. **Classifier**
* Recognize DNS protocol types:
* Plain DNS
* DoT
* DoH
* Attach confidence scores (especially for DoH)
4. **DNS Parser (for plaintext DNS only)**
* Extract: QNAME, QTYPE, transaction IDs, response codes (optional)
* Store minimally (privacy-aware)
5. **Flow Tracker**
* Correlate packets into “flows”
* Map flow → interface → destination → process (if possible)
* Track timing correlation: DNS → connection attempts
6. **Leak Detector (Rules Engine)**
* Apply Leak-A/B/C/D definitions
* Produce leak events + severity + evidence chain
7. **Active Prober**
* Generates controlled DNS lookups to test behavior
* Can test fail-closed, bypasses, multi-interface behavior, etc.
8. **Report Generator**
* Human-readable summary
* Machine-readable logs (JSON)
* Recommendations (non-invasive)
---
# 4) Workflow (end-to-end)
## Workflow 0: Setup & baseline
1. Enumerate interfaces and routes
* Identify physical NICs
* Identify tunnel / proxy interface (or “expected egress destinations”)
2. Identify system DNS configuration
* Default resolvers per interface
* Local stub presence (127.0.0.1, etc.)
3. Load policy profile
* Full-tunnel, split-tunnel, or proxy-based
4. Start passive monitor
**Output:** “Current state snapshot” (useful even before testing).
---
## Workflow 1: Passive detection loop (always-on)
Continuously:
1. Capture outbound packets/flows
2. Classify as DNS-like (plain DNS / DoT / DoH / unknown)
3. If plaintext DNS → parse QNAME/QTYPE
4. Assign metadata:
* interface
* dst IP/port
* process (if possible)
* timestamp
5. Evaluate leak rules:
* Leak-A/B/C/D
6. Write event log + optional real-time alert
**Key design point:** passive mode should be able to detect leaks **without requiring any special test domain**.
---
## Workflow 2: Active test suite (on-demand)
Active tests exist because some leaks are intermittent or only happen under stress.
### Active Test A: “No plaintext DNS escape”
* Trigger a set of DNS queries (unique random domains)
* Verify **zero UDP/53 & TCP/53** leaves physical interfaces
### Active Test B: “Fail-closed test”
* Temporarily disrupt the “protected path” (e.g., tunnel down)
* Trigger lookups again
* Expected: DNS fails (no fallback to ISP DNS)
### Active Test C: “App bypass test”
* Launch test scenarios that mimic real apps
* Confirm no direct DoH/DoT flows go to public Internet outside the proxy path
### Active Test D: “Split-policy correctness”
* Query domains that should be:
* direct-allowed
* proxy-required
* unknown
* Confirm resolution path matches policy
---
# 5) How to recognize DNS transports (detection mechanics)
## Plain DNS (strongest signal)
**Match conditions**
* UDP dst port 53 OR TCP dst port 53
* Parse DNS header
* Extract QNAME/QTYPE
**Evidence strength:** high
**Intent visibility:** yes (domain visible)
---
## DoT (port-based, easy)
DoT is defined over TLS, typically port **853**. ([IETF Datatracker][3])
**Match conditions**
* TCP dst port 853
* Optionally confirm TLS handshake exists
**Evidence strength:** high
**Intent visibility:** no (domain hidden)
---
## DoH (harder; heuristic + optional allowlists)
DoH is DNS over HTTPS (RFC 8484). ([IETF Datatracker][2])
**Recognizers (from strongest to weakest):**
1. HTTP request with `Content-Type: application/dns-message`
2. Path/pattern common to DoH endpoints (optional list)
3. SNI matches known DoH providers (optional list)
4. Traffic resembles frequent small HTTPS POST/GET bursts typical of DoH (weak)
**Evidence strength:** medium
**Intent visibility:** no (domain hidden)
**Important for your use-case:** you may not need to *prove* its DoH; you mostly need to detect “DNS-like encrypted resolver traffic bypassing the proxy channel.”
---
# 6) Policy model: define “safe DNS path”
You need a simple abstraction users can configure:
### Safe DNS path can be defined by one or more of:
* **Allowed interfaces**
* loopback (local stub)
* tunnel interface
* **Allowed destination set**
* proxy server IP(s)
* internal resolver IP(s)
* **Allowed process**
* only your local stub + proxy allowed to resolve externally
* **Allowed port set**
* maybe only permit 443 to proxy server (if DNS rides inside it)
Then implement:
**A DNS event is a “leak” if it violates safe-path constraints.**
---
# 7) Leak severity model (useful for real-world debugging)
### Severity P0 (critical)
* Plaintext DNS (UDP/TCP 53) on physical interface to ISP/public resolver
* Especially if QNAME matches proxy-required/sensitive list
### Severity P1 (high)
* DoH/DoT bypassing proxy channel directly to public Internet
### Severity P2 (medium)
* Policy mismatch: domain resolved locally but connection later proxied (or vice versa)
### Severity P3 (low / info)
* Authoritative-side “resolver egress exposure” (less relevant for client-side leak detector)
* CDN performance mismatch indicators
---
# 8) Outputs and reporting
## Real-time console output (for debugging)
* “DNS leak detected: Plain DNS”
* domain (if visible)
* destination resolver IP
* interface
* process name (if available)
* policy rule violated
* suggested fix category (e.g., “force stub + block port 53”)
## Forensics log (machine-readable)
A single **LeakEvent** record could include:
* timestamp
* leak_type (A/B/C/D)
* transport (UDP53, TCP53, DoT, DoH)
* qname/qtype (nullable)
* src_iface / dst_ip / dst_port
* process_id/process_name (nullable)
* correlation_id (link DNS → subsequent connection attempt)
* confidence score (esp. DoH)
* raw evidence pointers (pcap offsets / event IDs)
## Summary report
* Leak counts by type
* Top leaking processes
* Top leaking resolver destinations
* Timeline view (bursts often indicate OS fallback behavior)
* “Pass/Fail” per policy definition
---
# 9) Validation strategy (“how do I know my detector is correct?”)
## Ground truth tests
1. **Known-leak scenario**
* intentionally set OS DNS to ISP DNS, no tunnel
* detector must catch plaintext DNS
2. **Known-safe scenario**
* local stub only + blocked outbound 53/853
* detector should show zero leaks
3. **Bypass scenario**
* enable browser built-in DoH directly
* detector should catch encrypted resolver bypass (Leak-C)
4. **Split-policy scenario**
* allowlist CN direct, everything else proxy-resolve
* detector should show:
* allowlist resolved direct
* unknown resolved via proxy path
---
# 10) Recommended “profiles” (makes tool usable)
Provide built-in presets:
### Profile 1: Full-tunnel VPN
* allow DNS only via tunnel interface or loopback stub
* any UDP/TCP 53 on physical NIC = leak
### Profile 2: Proxy + local stub (your case)
* allow DNS only to loopback stub
* allow stub upstream only via proxy server destinations
* flag any direct DoH/DoT to public endpoints
### Profile 3: Split tunnel (geoip + allowlist)
* allow plaintext DNS **only** for allowlisted domains (if user accepts risk)
* enforce “unknown → proxy-resolve”
* emphasize Leak-B correctness
---
Below is an updated **high-level design** (still language-agnostic) that integrates **process attribution** cleanly, including how it fits into the workflow and what to log.
---
# 1) New component: Process Attribution Engine (PAE)
## Purpose
When a DNS-like event is observed, the PAE tries to attach:
* **PID**
* **PPID**
* **process name**
* *(optional but extremely useful)* full command line, executable path, user, container/app package, etc.
This lets your logs answer:
> “Which program generated the leaked DNS request?”
> “Was it a browser, OS service, updater, antivirus, proxy itself, or some library?”
## Position in the pipeline
It sits between **Traffic Sensor** and **Leak Detector** as an “event enricher”:
**Traffic Event → (Classifier) → (Process Attribution) → Enriched Event → Leak Rules → Report**
---
# 2) Updated architecture (with process attribution)
### Existing modules (from earlier design)
1. Policy & Configuration
2. Traffic Sensor (packet/flow monitor)
3. Classifier (Plain DNS / DoT / DoH / Unknown)
4. DNS Parser (plaintext only)
5. Flow Tracker
6. Leak Detector (rules engine)
7. Active Prober
8. Report Generator
### New module
9. **Process Attribution Engine (PAE)**
* resolves “who owns this flow / packet”
* emits PID/PPID/name
* handles platform-specific differences and fallbacks
---
# 3) Workflow changes (what happens when a potential leak is seen)
## Passive detection loop (updated)
1. Capture outbound traffic event
2. Classify transport type:
* UDP/53, TCP/53 → plaintext DNS
* TCP/853 → DoT
* HTTPS patterns → DoH (heuristic)
3. Extract the **5-tuple**
* src IP:port, dst IP:port, protocol
4. **PAE lookup**
* resolve the owner process for this traffic
* attach PID/PPID/name (+ optional metadata)
5. Apply leak rules (A/B/C/D)
6. Emit:
* realtime log line (human readable)
* structured record (JSON/event log)
---
# 4) Process attribution: what to detect and how (high-level)
Process attribution always works on one core concept:
> **Map observed traffic (socket/flow) → owning process**
### Inputs PAE needs
* protocol (UDP/TCP)
* local src port
* local address
* timestamp
* optionally: connection state / flow ID
### Output from PAE
* `pid`, `ppid`, `process_name`
* optional enrichment:
* `exe_path`
* `cmdline`
* `user`
* “process tree chain” (for debugging: parent → child → …)
---
# 5) Platform support strategy (without implementation detail)
Process attribution is **OS-specific**, so structure it as:
## “Attribution Provider” interface
* Provider A: “kernel-level flow owner”
* Provider B: “socket table owner lookup”
* Provider C: “event tracing feed”
* Provider D: fallback “unknown / not supported”
Your main design goal is:
### Design rule
**Attribution must be best-effort + gracefully degrading**, never blocking detection.
So you always log the leak even if PID is unavailable:
* `pid=null, attribution_confidence=LOW`
---
# 6) Attribution confidence + race handling (important!)
Attribution can be tricky because:
* a process may exit quickly (“short-lived resolver helper”)
* ports can be reused
* NAT or local proxies may obscure the real origin
So log **confidence**:
* **HIGH**: direct mapping from kernel/socket owner at time of event
* **MEDIUM**: mapping by lookup shortly after event (possible race)
* **LOW**: inferred / uncertain
* **NONE**: not resolved
Also record *why* attribution failed:
* “permission denied”
* “flow already gone”
* “unsupported transport”
* “ambiguous mapping”
This makes debugging much easier.
---
# 7) What PID/PPID adds to your leak definitions
### Leak-A (plaintext DNS outside safe path)
Now you can say:
> “`svchost.exe (PID 1234)` sent UDP/53 to ISP resolver on Wi-Fi interface”
### Leak-B (split-policy intent leak)
You can catch:
* “game launcher looked up blocked domain”
* “system service triggered a sensitive name unexpectedly”
* “your proxy itself isnt actually resolving via its own channel”
### Leak-C (encrypted DNS bypass)
This becomes *very actionable*:
> “`firefox.exe` started direct DoH to resolver outside tunnel”
### Leak-D (mismatch indicator)
You can also correlate:
* DNS resolved by one process
* connection made by another process
(e.g., local stub vs app)
---
# 8) Reporting / realtime logging format (updated)
## Realtime log line (human readable)
Example (conceptual):
* **[P0][Leak-A] Plain DNS leaked**
* Domain: `example-sensitive.com` (A)
* From: `Wi-Fi` → To: `1.2.3.4:53`
* Process: `browser.exe` **PID=4321 PPID=1200**
* Policy violated: “No UDP/53 on physical NIC”
## Structured event (JSON-style fields)
Minimum recommended fields:
### Event identity
* `event_id`
* `timestamp`
### DNS identity
* `transport` (udp53/tcp53/dot/doh/unknown)
* `qname` (nullable)
* `qtype` (nullable)
### Network path
* `interface_name`
* `src_ip`, `src_port`
* `dst_ip`, `dst_port`
* `route_class` (tunnel / physical / loopback)
### Process identity (your requested additions)
* `pid`
* `ppid`
* `process_name`
* optional:
* `exe_path`
* `cmdline`
* `user`
### Detection result
* `leak_type` (A/B/C/D)
* `severity` (P0..P3)
* `policy_rule_id`
* `attribution_confidence`
---
# 9) Privacy and safety notes (important in a DNS tool)
Because youre logging **domains** and **process command lines**, this becomes sensitive.
Add a “privacy mode” policy:
* **Full**: store full domain + cmdline
* **Redacted**: hash domain; keep TLD only; truncate cmdline
* **Minimal**: only keep leak counts + resolver IPs + process name
Also allow “capture window” (rotate logs, avoid giant histories).
---
# 10) UX feature: “Show me the process tree”
When a leak happens, a good debugger view is:
* `PID: foo (pid 1000)`
* `PPID: bar (pid 900)`
* `PPID: systemd/svchost/etc`
This is extremely useful to identify:
* browsers spawning helpers
* OS DNS services
* containerized processes
* update agents / telemetry daemons
So your report generator should support:
**Process chain rendering** (where possible)
---
# 11) Practical edge cases you should detect (with PID helping)
1. **Local stub is fine, upstream isnt**
* Your local resolver process leaks upstream plaintext DNS
2. **Browser uses its own DoH**
* process attribution immediately reveals it
3. **Multiple interfaces**
* a leak only happens on Wi-Fi but not Ethernet
4. **Kill-switch failure**
* when tunnel drops, PID shows which app starts leaking first
---