724 lines
18 KiB
Markdown
724 lines
18 KiB
Markdown
Below is a **high-level (language-agnostic)** design for a **client-side DNS leak detector** aimed at *censorship-resistance threat models*, i.e.:
|
||
|
||
> “Censor/ISP can observe/log DNS intent or infer proxy usage; we want to detect when DNS behavior escapes the intended protection path.”
|
||
|
||
I’ll cover: **definitions**, **detection standards**, **workflow**, **modules**, **passive+active detection**, **outputs**, and **test methodology**.
|
||
|
||
---
|
||
|
||
# 1) Scope and goals
|
||
|
||
## Goals
|
||
|
||
Your detector should answer, with evidence:
|
||
|
||
1. **Did any DNS query leave the device outside the intended safe path?**
|
||
2. **Which domains leaked?** (when visible)
|
||
3. **Which transport leaked?** (UDP/53, TCP/53, DoT/853, DoH)
|
||
4. **Which interface leaked?** (Wi-Fi/Ethernet vs tunnel)
|
||
5. **Which process/app triggered it?** (if your OS allows attribution)
|
||
|
||
And in your censorship model, it should also detect:
|
||
|
||
6. **Split-policy intent leakage**: “unknown/sensitive domains were resolved using domestic/ISP-facing DNS.”
|
||
|
||
## Non-goals (be explicit)
|
||
|
||
* Not a censorship circumvention tool itself
|
||
* Not a full firewall manager (can suggest fixes, but detection is the core)
|
||
* Not perfect attribution on every OS (process mapping may be partial)
|
||
|
||
---
|
||
|
||
# 2) Define “DNS leak” precisely (your program’s standard)
|
||
|
||
You need a **formal definition** because “DNS leak” is overloaded.
|
||
|
||
## Standard definition A (classic VPN / tunnel bypass)
|
||
|
||
A leak occurs if:
|
||
|
||
> **An unencrypted DNS query is sent outside the secure tunnel path**
|
||
> This is essentially how popular leak test sites define it (“unencrypted DNS query sent OUTSIDE the established VPN tunnel”). ([IP Leak][1])
|
||
|
||
Your detector should implement it in a machine-checkable way:
|
||
|
||
**Leak-A condition**
|
||
|
||
* DNS over **UDP/53 or TCP/53**
|
||
* Destination is **not** a “trusted resolver path” (e.g., not the tunnel interface, not loopback stub, not proxy channel)
|
||
* Interface is **not** the intended egress
|
||
|
||
✅ Strong for censorship: plaintext DNS exposes intent.
|
||
|
||
---
|
||
|
||
## Standard definition B (split-policy intent leak)
|
||
|
||
A leak occurs if:
|
||
|
||
> **A domain that should be “proxied / remote-resolved” was queried via local/ISP-facing DNS.**
|
||
|
||
This is the “proxy split rules still leak intent” case.
|
||
|
||
**Leak-B condition**
|
||
|
||
* Query name matches either:
|
||
|
||
* a “proxy-required set” (sensitive list, non-allowlist, unknown), or
|
||
* a policy rule (“everything except allowlist must resolve via proxy DNS”)
|
||
* And the query was observed going to:
|
||
|
||
* ISP resolver(s) / domestic resolver(s) / non-tunnel interface
|
||
|
||
✅ This is the leak most users in censorship settings care about.
|
||
|
||
---
|
||
|
||
## Standard definition C (encrypted DNS escape / bypass)
|
||
|
||
A leak occurs if:
|
||
|
||
> DNS was encrypted, but escaped the intended channel (e.g., app uses its own DoH directly to the Internet).
|
||
|
||
This matters because DoH hides the QNAME but still creates **observable behavior** and breaks your “DNS must follow proxy” invariant.
|
||
|
||
**Leak-C condition**
|
||
|
||
* DoH (RFC 8484) ([IETF Datatracker][2]) or DoT (RFC 7858) ([IETF Datatracker][3]) flow exists
|
||
* And it does **not** go through your approved egress path (tunnel/proxy)
|
||
|
||
✅ Detects “Firefox/Chrome built-in DoH bypass” style cases.
|
||
|
||
---
|
||
|
||
## Standard definition D (mismatch risk indicator)
|
||
|
||
Not a “leak” by itself, but a **proxy inference amplifier**:
|
||
|
||
> DNS egress region/path differs from traffic egress region/path.
|
||
|
||
This is a *censorship-resistance hygiene metric*, not a binary leak.
|
||
|
||
**Mismatch condition**
|
||
|
||
* Same domain produces:
|
||
|
||
* DNS resolution via path X
|
||
* TCP/TLS connection via path Y
|
||
* Where X ≠ Y (interface, ASN region, etc.)
|
||
|
||
✅ Helps catch “DNS direct, traffic proxy” or “DNS proxy, traffic direct” weirdness.
|
||
|
||
---
|
||
|
||
# 3) High-level architecture
|
||
|
||
## Core components
|
||
|
||
1. **Policy & Configuration**
|
||
|
||
* What counts as “safe DNS path”
|
||
* Which interfaces are “protected” (tunnel) vs “physical”
|
||
* Allowlist / proxy-required sets (optional)
|
||
* Known resolver lists (optional)
|
||
* Severity thresholds
|
||
|
||
2. **Traffic Sensor (Passive Monitor)**
|
||
|
||
* Captures outbound traffic metadata (and optionally payload for DNS parsing)
|
||
* Must cover:
|
||
|
||
* UDP/53, TCP/53
|
||
* TCP/853 (DoT)
|
||
* HTTPS flows that look like DoH (see below)
|
||
* Emits normalized events into a pipeline
|
||
|
||
3. **Classifier**
|
||
|
||
* Recognize DNS protocol types:
|
||
|
||
* Plain DNS
|
||
* DoT
|
||
* DoH
|
||
* Attach confidence scores (especially for DoH)
|
||
|
||
4. **DNS Parser (for plaintext DNS only)**
|
||
|
||
* Extract: QNAME, QTYPE, transaction IDs, response codes (optional)
|
||
* Store minimally (privacy-aware)
|
||
|
||
5. **Flow Tracker**
|
||
|
||
* Correlate packets into “flows”
|
||
* Map flow → interface → destination → process (if possible)
|
||
* Track timing correlation: DNS → connection attempts
|
||
|
||
6. **Leak Detector (Rules Engine)**
|
||
|
||
* Apply Leak-A/B/C/D definitions
|
||
* Produce leak events + severity + evidence chain
|
||
|
||
7. **Active Prober**
|
||
|
||
* Generates controlled DNS lookups to test behavior
|
||
* Can test fail-closed, bypasses, multi-interface behavior, etc.
|
||
|
||
8. **Report Generator**
|
||
|
||
* Human-readable summary
|
||
* Machine-readable logs (JSON)
|
||
* Recommendations (non-invasive)
|
||
|
||
---
|
||
|
||
# 4) Workflow (end-to-end)
|
||
|
||
## Workflow 0: Setup & baseline
|
||
|
||
1. Enumerate interfaces and routes
|
||
|
||
* Identify physical NICs
|
||
* Identify tunnel / proxy interface (or “expected egress destinations”)
|
||
2. Identify system DNS configuration
|
||
|
||
* Default resolvers per interface
|
||
* Local stub presence (127.0.0.1, etc.)
|
||
3. Load policy profile
|
||
|
||
* Full-tunnel, split-tunnel, or proxy-based
|
||
4. Start passive monitor
|
||
|
||
**Output:** “Current state snapshot” (useful even before testing).
|
||
|
||
---
|
||
|
||
## Workflow 1: Passive detection loop (always-on)
|
||
|
||
Continuously:
|
||
|
||
1. Capture outbound packets/flows
|
||
2. Classify as DNS-like (plain DNS / DoT / DoH / unknown)
|
||
3. If plaintext DNS → parse QNAME/QTYPE
|
||
4. Assign metadata:
|
||
|
||
* interface
|
||
* dst IP/port
|
||
* process (if possible)
|
||
* timestamp
|
||
5. Evaluate leak rules:
|
||
|
||
* Leak-A/B/C/D
|
||
6. Write event log + optional real-time alert
|
||
|
||
**Key design point:** passive mode should be able to detect leaks **without requiring any special test domain**.
|
||
|
||
---
|
||
|
||
## Workflow 2: Active test suite (on-demand)
|
||
|
||
Active tests exist because some leaks are intermittent or only happen under stress.
|
||
|
||
### Active Test A: “No plaintext DNS escape”
|
||
|
||
* Trigger a set of DNS queries (unique random domains)
|
||
* Verify **zero UDP/53 & TCP/53** leaves physical interfaces
|
||
|
||
### Active Test B: “Fail-closed test”
|
||
|
||
* Temporarily disrupt the “protected path” (e.g., tunnel down)
|
||
* Trigger lookups again
|
||
* Expected: DNS fails (no fallback to ISP DNS)
|
||
|
||
### Active Test C: “App bypass test”
|
||
|
||
* Launch test scenarios that mimic real apps
|
||
* Confirm no direct DoH/DoT flows go to public Internet outside the proxy path
|
||
|
||
### Active Test D: “Split-policy correctness”
|
||
|
||
* Query domains that should be:
|
||
|
||
* direct-allowed
|
||
* proxy-required
|
||
* unknown
|
||
* Confirm resolution path matches policy
|
||
|
||
---
|
||
|
||
# 5) How to recognize DNS transports (detection mechanics)
|
||
|
||
## Plain DNS (strongest signal)
|
||
|
||
**Match conditions**
|
||
|
||
* UDP dst port 53 OR TCP dst port 53
|
||
* Parse DNS header
|
||
* Extract QNAME/QTYPE
|
||
|
||
**Evidence strength:** high
|
||
**Intent visibility:** yes (domain visible)
|
||
|
||
---
|
||
|
||
## DoT (port-based, easy)
|
||
|
||
DoT is defined over TLS, typically port **853**. ([IETF Datatracker][3])
|
||
|
||
**Match conditions**
|
||
|
||
* TCP dst port 853
|
||
* Optionally confirm TLS handshake exists
|
||
|
||
**Evidence strength:** high
|
||
**Intent visibility:** no (domain hidden)
|
||
|
||
---
|
||
|
||
## DoH (harder; heuristic + optional allowlists)
|
||
|
||
DoH is DNS over HTTPS (RFC 8484). ([IETF Datatracker][2])
|
||
|
||
**Recognizers (from strongest to weakest):**
|
||
|
||
1. HTTP request with `Content-Type: application/dns-message`
|
||
2. Path/pattern common to DoH endpoints (optional list)
|
||
3. SNI matches known DoH providers (optional list)
|
||
4. Traffic resembles frequent small HTTPS POST/GET bursts typical of DoH (weak)
|
||
|
||
**Evidence strength:** medium
|
||
**Intent visibility:** no (domain hidden)
|
||
|
||
**Important for your use-case:** you may not need to *prove* it’s DoH; you mostly need to detect “DNS-like encrypted resolver traffic bypassing the proxy channel.”
|
||
|
||
---
|
||
|
||
# 6) Policy model: define “safe DNS path”
|
||
|
||
You need a simple abstraction users can configure:
|
||
|
||
### Safe DNS path can be defined by one or more of:
|
||
|
||
* **Allowed interfaces**
|
||
|
||
* loopback (local stub)
|
||
* tunnel interface
|
||
* **Allowed destination set**
|
||
|
||
* proxy server IP(s)
|
||
* internal resolver IP(s)
|
||
* **Allowed process**
|
||
|
||
* only your local stub + proxy allowed to resolve externally
|
||
* **Allowed port set**
|
||
|
||
* maybe only permit 443 to proxy server (if DNS rides inside it)
|
||
|
||
Then implement:
|
||
|
||
**A DNS event is a “leak” if it violates safe-path constraints.**
|
||
|
||
---
|
||
|
||
# 7) Leak severity model (useful for real-world debugging)
|
||
|
||
### Severity P0 (critical)
|
||
|
||
* Plaintext DNS (UDP/TCP 53) on physical interface to ISP/public resolver
|
||
* Especially if QNAME matches proxy-required/sensitive list
|
||
|
||
### Severity P1 (high)
|
||
|
||
* DoH/DoT bypassing proxy channel directly to public Internet
|
||
|
||
### Severity P2 (medium)
|
||
|
||
* Policy mismatch: domain resolved locally but connection later proxied (or vice versa)
|
||
|
||
### Severity P3 (low / info)
|
||
|
||
* Authoritative-side “resolver egress exposure” (less relevant for client-side leak detector)
|
||
* CDN performance mismatch indicators
|
||
|
||
---
|
||
|
||
# 8) Outputs and reporting
|
||
|
||
## Real-time console output (for debugging)
|
||
|
||
* “DNS leak detected: Plain DNS”
|
||
* domain (if visible)
|
||
* destination resolver IP
|
||
* interface
|
||
* process name (if available)
|
||
* policy rule violated
|
||
* suggested fix category (e.g., “force stub + block port 53”)
|
||
|
||
## Forensics log (machine-readable)
|
||
|
||
A single **LeakEvent** record could include:
|
||
|
||
* timestamp
|
||
* leak_type (A/B/C/D)
|
||
* transport (UDP53, TCP53, DoT, DoH)
|
||
* qname/qtype (nullable)
|
||
* src_iface / dst_ip / dst_port
|
||
* process_id/process_name (nullable)
|
||
* correlation_id (link DNS → subsequent connection attempt)
|
||
* confidence score (esp. DoH)
|
||
* raw evidence pointers (pcap offsets / event IDs)
|
||
|
||
## Summary report
|
||
|
||
* Leak counts by type
|
||
* Top leaking processes
|
||
* Top leaking resolver destinations
|
||
* Timeline view (bursts often indicate OS fallback behavior)
|
||
* “Pass/Fail” per policy definition
|
||
|
||
---
|
||
|
||
# 9) Validation strategy (“how do I know my detector is correct?”)
|
||
|
||
## Ground truth tests
|
||
|
||
1. **Known-leak scenario**
|
||
|
||
* intentionally set OS DNS to ISP DNS, no tunnel
|
||
* detector must catch plaintext DNS
|
||
|
||
2. **Known-safe scenario**
|
||
|
||
* local stub only + blocked outbound 53/853
|
||
* detector should show zero leaks
|
||
|
||
3. **Bypass scenario**
|
||
|
||
* enable browser built-in DoH directly
|
||
* detector should catch encrypted resolver bypass (Leak-C)
|
||
|
||
4. **Split-policy scenario**
|
||
|
||
* allowlist CN direct, everything else proxy-resolve
|
||
* detector should show:
|
||
|
||
* allowlist resolved direct
|
||
* unknown resolved via proxy path
|
||
|
||
---
|
||
|
||
# 10) Recommended “profiles” (makes tool usable)
|
||
|
||
Provide built-in presets:
|
||
|
||
### Profile 1: Full-tunnel VPN
|
||
|
||
* allow DNS only via tunnel interface or loopback stub
|
||
* any UDP/TCP 53 on physical NIC = leak
|
||
|
||
### Profile 2: Proxy + local stub (your case)
|
||
|
||
* allow DNS only to loopback stub
|
||
* allow stub upstream only via proxy server destinations
|
||
* flag any direct DoH/DoT to public endpoints
|
||
|
||
### Profile 3: Split tunnel (geoip + allowlist)
|
||
|
||
* allow plaintext DNS **only** for allowlisted domains (if user accepts risk)
|
||
* enforce “unknown → proxy-resolve”
|
||
* emphasize Leak-B correctness
|
||
|
||
---
|
||
|
||
Below is an updated **high-level design** (still language-agnostic) that integrates **process attribution** cleanly, including how it fits into the workflow and what to log.
|
||
|
||
---
|
||
|
||
# 1) New component: Process Attribution Engine (PAE)
|
||
|
||
## Purpose
|
||
|
||
When a DNS-like event is observed, the PAE tries to attach:
|
||
|
||
* **PID**
|
||
* **PPID**
|
||
* **process name**
|
||
* *(optional but extremely useful)* full command line, executable path, user, container/app package, etc.
|
||
|
||
This lets your logs answer:
|
||
|
||
> “Which program generated the leaked DNS request?”
|
||
> “Was it a browser, OS service, updater, antivirus, proxy itself, or some library?”
|
||
|
||
## Position in the pipeline
|
||
|
||
It sits between **Traffic Sensor** and **Leak Detector** as an “event enricher”:
|
||
|
||
**Traffic Event → (Classifier) → (Process Attribution) → Enriched Event → Leak Rules → Report**
|
||
|
||
---
|
||
|
||
# 2) Updated architecture (with process attribution)
|
||
|
||
### Existing modules (from earlier design)
|
||
|
||
1. Policy & Configuration
|
||
2. Traffic Sensor (packet/flow monitor)
|
||
3. Classifier (Plain DNS / DoT / DoH / Unknown)
|
||
4. DNS Parser (plaintext only)
|
||
5. Flow Tracker
|
||
6. Leak Detector (rules engine)
|
||
7. Active Prober
|
||
8. Report Generator
|
||
|
||
### New module
|
||
|
||
9. **Process Attribution Engine (PAE)**
|
||
|
||
* resolves “who owns this flow / packet”
|
||
* emits PID/PPID/name
|
||
* handles platform-specific differences and fallbacks
|
||
|
||
---
|
||
|
||
# 3) Workflow changes (what happens when a potential leak is seen)
|
||
|
||
## Passive detection loop (updated)
|
||
|
||
1. Capture outbound traffic event
|
||
2. Classify transport type:
|
||
|
||
* UDP/53, TCP/53 → plaintext DNS
|
||
* TCP/853 → DoT
|
||
* HTTPS patterns → DoH (heuristic)
|
||
3. Extract the **5-tuple**
|
||
|
||
* src IP:port, dst IP:port, protocol
|
||
4. **PAE lookup**
|
||
|
||
* resolve the owner process for this traffic
|
||
* attach PID/PPID/name (+ optional metadata)
|
||
5. Apply leak rules (A/B/C/D)
|
||
6. Emit:
|
||
|
||
* realtime log line (human readable)
|
||
* structured record (JSON/event log)
|
||
|
||
---
|
||
|
||
# 4) Process attribution: what to detect and how (high-level)
|
||
|
||
Process attribution always works on one core concept:
|
||
|
||
> **Map observed traffic (socket/flow) → owning process**
|
||
|
||
### Inputs PAE needs
|
||
|
||
* protocol (UDP/TCP)
|
||
* local src port
|
||
* local address
|
||
* timestamp
|
||
* optionally: connection state / flow ID
|
||
|
||
### Output from PAE
|
||
|
||
* `pid`, `ppid`, `process_name`
|
||
* optional enrichment:
|
||
|
||
* `exe_path`
|
||
* `cmdline`
|
||
* `user`
|
||
* “process tree chain” (for debugging: parent → child → …)
|
||
|
||
---
|
||
|
||
# 5) Platform support strategy (without implementation detail)
|
||
|
||
Process attribution is **OS-specific**, so structure it as:
|
||
|
||
## “Attribution Provider” interface
|
||
|
||
* Provider A: “kernel-level flow owner”
|
||
* Provider B: “socket table owner lookup”
|
||
* Provider C: “event tracing feed”
|
||
* Provider D: fallback “unknown / not supported”
|
||
|
||
Your main design goal is:
|
||
|
||
### Design rule
|
||
|
||
**Attribution must be best-effort + gracefully degrading**, never blocking detection.
|
||
|
||
So you always log the leak even if PID is unavailable:
|
||
|
||
* `pid=null, attribution_confidence=LOW`
|
||
|
||
---
|
||
|
||
# 6) Attribution confidence + race handling (important!)
|
||
|
||
Attribution can be tricky because:
|
||
|
||
* a process may exit quickly (“short-lived resolver helper”)
|
||
* ports can be reused
|
||
* NAT or local proxies may obscure the real origin
|
||
|
||
So log **confidence**:
|
||
|
||
* **HIGH**: direct mapping from kernel/socket owner at time of event
|
||
* **MEDIUM**: mapping by lookup shortly after event (possible race)
|
||
* **LOW**: inferred / uncertain
|
||
* **NONE**: not resolved
|
||
|
||
Also record *why* attribution failed:
|
||
|
||
* “permission denied”
|
||
* “flow already gone”
|
||
* “unsupported transport”
|
||
* “ambiguous mapping”
|
||
|
||
This makes debugging much easier.
|
||
|
||
---
|
||
|
||
# 7) What PID/PPID adds to your leak definitions
|
||
|
||
### Leak-A (plaintext DNS outside safe path)
|
||
|
||
Now you can say:
|
||
|
||
> “`svchost.exe (PID 1234)` sent UDP/53 to ISP resolver on Wi-Fi interface”
|
||
|
||
### Leak-B (split-policy intent leak)
|
||
|
||
You can catch:
|
||
|
||
* “game launcher looked up blocked domain”
|
||
* “system service triggered a sensitive name unexpectedly”
|
||
* “your proxy itself isn’t actually resolving via its own channel”
|
||
|
||
### Leak-C (encrypted DNS bypass)
|
||
|
||
This becomes *very actionable*:
|
||
|
||
> “`firefox.exe` started direct DoH to resolver outside tunnel”
|
||
|
||
### Leak-D (mismatch indicator)
|
||
|
||
You can also correlate:
|
||
|
||
* DNS resolved by one process
|
||
* connection made by another process
|
||
(e.g., local stub vs app)
|
||
|
||
---
|
||
|
||
# 8) Reporting / realtime logging format (updated)
|
||
|
||
## Realtime log line (human readable)
|
||
|
||
Example (conceptual):
|
||
|
||
* **[P0][Leak-A] Plain DNS leaked**
|
||
|
||
* Domain: `example-sensitive.com` (A)
|
||
* From: `Wi-Fi` → To: `1.2.3.4:53`
|
||
* Process: `browser.exe` **PID=4321 PPID=1200**
|
||
* Policy violated: “No UDP/53 on physical NIC”
|
||
|
||
## Structured event (JSON-style fields)
|
||
|
||
Minimum recommended fields:
|
||
|
||
### Event identity
|
||
|
||
* `event_id`
|
||
* `timestamp`
|
||
|
||
### DNS identity
|
||
|
||
* `transport` (udp53/tcp53/dot/doh/unknown)
|
||
* `qname` (nullable)
|
||
* `qtype` (nullable)
|
||
|
||
### Network path
|
||
|
||
* `interface_name`
|
||
* `src_ip`, `src_port`
|
||
* `dst_ip`, `dst_port`
|
||
* `route_class` (tunnel / physical / loopback)
|
||
|
||
### Process identity (your requested additions)
|
||
|
||
* `pid`
|
||
* `ppid`
|
||
* `process_name`
|
||
* optional:
|
||
|
||
* `exe_path`
|
||
* `cmdline`
|
||
* `user`
|
||
|
||
### Detection result
|
||
|
||
* `leak_type` (A/B/C/D)
|
||
* `severity` (P0..P3)
|
||
* `policy_rule_id`
|
||
* `attribution_confidence`
|
||
|
||
---
|
||
|
||
# 9) Privacy and safety notes (important in a DNS tool)
|
||
|
||
Because you’re logging **domains** and **process command lines**, this becomes sensitive.
|
||
|
||
Add a “privacy mode” policy:
|
||
|
||
* **Full**: store full domain + cmdline
|
||
* **Redacted**: hash domain; keep TLD only; truncate cmdline
|
||
* **Minimal**: only keep leak counts + resolver IPs + process name
|
||
|
||
Also allow “capture window” (rotate logs, avoid giant histories).
|
||
|
||
---
|
||
|
||
# 10) UX feature: “Show me the process tree”
|
||
|
||
When a leak happens, a good debugger view is:
|
||
|
||
* `PID: foo (pid 1000)`
|
||
|
||
* `PPID: bar (pid 900)`
|
||
|
||
* `PPID: systemd/svchost/etc`
|
||
|
||
This is extremely useful to identify:
|
||
|
||
* browsers spawning helpers
|
||
* OS DNS services
|
||
* containerized processes
|
||
* update agents / telemetry daemons
|
||
|
||
So your report generator should support:
|
||
|
||
✅ **Process chain rendering** (where possible)
|
||
|
||
---
|
||
|
||
# 11) Practical edge cases you should detect (with PID helping)
|
||
|
||
1. **Local stub is fine, upstream isn’t**
|
||
|
||
* Your local resolver process leaks upstream plaintext DNS
|
||
2. **Browser uses its own DoH**
|
||
|
||
* process attribution immediately reveals it
|
||
3. **Multiple interfaces**
|
||
|
||
* a leak only happens on Wi-Fi but not Ethernet
|
||
4. **Kill-switch failure**
|
||
|
||
* when tunnel drops, PID shows which app starts leaking first
|
||
|
||
---
|