Files
WTFnet/docs/dns_leak_detection_design.md
2026-01-17 18:45:24 +08:00

18 KiB
Raw Blame History

Below is a high-level (language-agnostic) design for a client-side DNS leak detector aimed at censorship-resistance threat models, i.e.:

“Censor/ISP can observe/log DNS intent or infer proxy usage; we want to detect when DNS behavior escapes the intended protection path.”

Ill cover: definitions, detection standards, workflow, modules, passive+active detection, outputs, and test methodology.


1) Scope and goals

Goals

Your detector should answer, with evidence:

  1. Did any DNS query leave the device outside the intended safe path?
  2. Which domains leaked? (when visible)
  3. Which transport leaked? (UDP/53, TCP/53, DoT/853, DoH)
  4. Which interface leaked? (Wi-Fi/Ethernet vs tunnel)
  5. Which process/app triggered it? (if your OS allows attribution)

And in your censorship model, it should also detect:

  1. Split-policy intent leakage: “unknown/sensitive domains were resolved using domestic/ISP-facing DNS.”

Non-goals (be explicit)

  • Not a censorship circumvention tool itself
  • Not a full firewall manager (can suggest fixes, but detection is the core)
  • Not perfect attribution on every OS (process mapping may be partial)

2) Define “DNS leak” precisely (your programs standard)

You need a formal definition because “DNS leak” is overloaded.

Standard definition A (classic VPN / tunnel bypass)

A leak occurs if:

An unencrypted DNS query is sent outside the secure tunnel path This is essentially how popular leak test sites define it (“unencrypted DNS query sent OUTSIDE the established VPN tunnel”). ([IP Leak][1])

Your detector should implement it in a machine-checkable way:

Leak-A condition

  • DNS over UDP/53 or TCP/53
  • Destination is not a “trusted resolver path” (e.g., not the tunnel interface, not loopback stub, not proxy channel)
  • Interface is not the intended egress

Strong for censorship: plaintext DNS exposes intent.


Standard definition B (split-policy intent leak)

A leak occurs if:

A domain that should be “proxied / remote-resolved” was queried via local/ISP-facing DNS.

This is the “proxy split rules still leak intent” case.

Leak-B condition

  • Query name matches either:

    • a “proxy-required set” (sensitive list, non-allowlist, unknown), or
    • a policy rule (“everything except allowlist must resolve via proxy DNS”)
  • And the query was observed going to:

    • ISP resolver(s) / domestic resolver(s) / non-tunnel interface

This is the leak most users in censorship settings care about.


Standard definition C (encrypted DNS escape / bypass)

A leak occurs if:

DNS was encrypted, but escaped the intended channel (e.g., app uses its own DoH directly to the Internet).

This matters because DoH hides the QNAME but still creates observable behavior and breaks your “DNS must follow proxy” invariant.

Leak-C condition

  • DoH (RFC 8484) ([IETF Datatracker][2]) or DoT (RFC 7858) ([IETF Datatracker][3]) flow exists
  • And it does not go through your approved egress path (tunnel/proxy)

Detects “Firefox/Chrome built-in DoH bypass” style cases.


Standard definition D (mismatch risk indicator)

Not a “leak” by itself, but a proxy inference amplifier:

DNS egress region/path differs from traffic egress region/path.

This is a censorship-resistance hygiene metric, not a binary leak.

Mismatch condition

  • Same domain produces:

    • DNS resolution via path X
    • TCP/TLS connection via path Y
  • Where X ≠ Y (interface, ASN region, etc.)

Helps catch “DNS direct, traffic proxy” or “DNS proxy, traffic direct” weirdness.


3) High-level architecture

Core components

  1. Policy & Configuration

    • What counts as “safe DNS path”
    • Which interfaces are “protected” (tunnel) vs “physical”
    • Allowlist / proxy-required sets (optional)
    • Known resolver lists (optional)
    • Severity thresholds
  2. Traffic Sensor (Passive Monitor)

    • Captures outbound traffic metadata (and optionally payload for DNS parsing)

    • Must cover:

      • UDP/53, TCP/53
      • TCP/853 (DoT)
      • HTTPS flows that look like DoH (see below)
    • Emits normalized events into a pipeline

  3. Classifier

    • Recognize DNS protocol types:

      • Plain DNS
      • DoT
      • DoH
    • Attach confidence scores (especially for DoH)

  4. DNS Parser (for plaintext DNS only)

    • Extract: QNAME, QTYPE, transaction IDs, response codes (optional)
    • Store minimally (privacy-aware)
  5. Flow Tracker

    • Correlate packets into “flows”
    • Map flow → interface → destination → process (if possible)
    • Track timing correlation: DNS → connection attempts
  6. Leak Detector (Rules Engine)

    • Apply Leak-A/B/C/D definitions
    • Produce leak events + severity + evidence chain
  7. Active Prober

    • Generates controlled DNS lookups to test behavior
    • Can test fail-closed, bypasses, multi-interface behavior, etc.
  8. Report Generator

    • Human-readable summary
    • Machine-readable logs (JSON)
    • Recommendations (non-invasive)

4) Workflow (end-to-end)

Workflow 0: Setup & baseline

  1. Enumerate interfaces and routes

    • Identify physical NICs
    • Identify tunnel / proxy interface (or “expected egress destinations”)
  2. Identify system DNS configuration

    • Default resolvers per interface
    • Local stub presence (127.0.0.1, etc.)
  3. Load policy profile

    • Full-tunnel, split-tunnel, or proxy-based
  4. Start passive monitor

Output: “Current state snapshot” (useful even before testing).


Workflow 1: Passive detection loop (always-on)

Continuously:

  1. Capture outbound packets/flows

  2. Classify as DNS-like (plain DNS / DoT / DoH / unknown)

  3. If plaintext DNS → parse QNAME/QTYPE

  4. Assign metadata:

    • interface
    • dst IP/port
    • process (if possible)
    • timestamp
  5. Evaluate leak rules:

    • Leak-A/B/C/D
  6. Write event log + optional real-time alert

Key design point: passive mode should be able to detect leaks without requiring any special test domain.


Workflow 2: Active test suite (on-demand)

Active tests exist because some leaks are intermittent or only happen under stress.

Active Test A: “No plaintext DNS escape”

  • Trigger a set of DNS queries (unique random domains)
  • Verify zero UDP/53 & TCP/53 leaves physical interfaces

Active Test B: “Fail-closed test”

  • Temporarily disrupt the “protected path” (e.g., tunnel down)
  • Trigger lookups again
  • Expected: DNS fails (no fallback to ISP DNS)

Active Test C: “App bypass test”

  • Launch test scenarios that mimic real apps
  • Confirm no direct DoH/DoT flows go to public Internet outside the proxy path

Active Test D: “Split-policy correctness”

  • Query domains that should be:

    • direct-allowed
    • proxy-required
    • unknown
  • Confirm resolution path matches policy


5) How to recognize DNS transports (detection mechanics)

Plain DNS (strongest signal)

Match conditions

  • UDP dst port 53 OR TCP dst port 53
  • Parse DNS header
  • Extract QNAME/QTYPE

Evidence strength: high Intent visibility: yes (domain visible)


DoT (port-based, easy)

DoT is defined over TLS, typically port 853. ([IETF Datatracker][3])

Match conditions

  • TCP dst port 853
  • Optionally confirm TLS handshake exists

Evidence strength: high Intent visibility: no (domain hidden)


DoH (harder; heuristic + optional allowlists)

DoH is DNS over HTTPS (RFC 8484). ([IETF Datatracker][2])

Recognizers (from strongest to weakest):

  1. HTTP request with Content-Type: application/dns-message
  2. Path/pattern common to DoH endpoints (optional list)
  3. SNI matches known DoH providers (optional list)
  4. Traffic resembles frequent small HTTPS POST/GET bursts typical of DoH (weak)

Evidence strength: medium Intent visibility: no (domain hidden)

Important for your use-case: you may not need to prove its DoH; you mostly need to detect “DNS-like encrypted resolver traffic bypassing the proxy channel.”


6) Policy model: define “safe DNS path”

You need a simple abstraction users can configure:

Safe DNS path can be defined by one or more of:

  • Allowed interfaces

    • loopback (local stub)
    • tunnel interface
  • Allowed destination set

    • proxy server IP(s)
    • internal resolver IP(s)
  • Allowed process

    • only your local stub + proxy allowed to resolve externally
  • Allowed port set

    • maybe only permit 443 to proxy server (if DNS rides inside it)

Then implement:

A DNS event is a “leak” if it violates safe-path constraints.


7) Leak severity model (useful for real-world debugging)

Severity P0 (critical)

  • Plaintext DNS (UDP/TCP 53) on physical interface to ISP/public resolver
  • Especially if QNAME matches proxy-required/sensitive list

Severity P1 (high)

  • DoH/DoT bypassing proxy channel directly to public Internet

Severity P2 (medium)

  • Policy mismatch: domain resolved locally but connection later proxied (or vice versa)

Severity P3 (low / info)

  • Authoritative-side “resolver egress exposure” (less relevant for client-side leak detector)
  • CDN performance mismatch indicators

8) Outputs and reporting

Real-time console output (for debugging)

  • “DNS leak detected: Plain DNS”
  • domain (if visible)
  • destination resolver IP
  • interface
  • process name (if available)
  • policy rule violated
  • suggested fix category (e.g., “force stub + block port 53”)

Forensics log (machine-readable)

A single LeakEvent record could include:

  • timestamp
  • leak_type (A/B/C/D)
  • transport (UDP53, TCP53, DoT, DoH)
  • qname/qtype (nullable)
  • src_iface / dst_ip / dst_port
  • process_id/process_name (nullable)
  • correlation_id (link DNS → subsequent connection attempt)
  • confidence score (esp. DoH)
  • raw evidence pointers (pcap offsets / event IDs)

Summary report

  • Leak counts by type
  • Top leaking processes
  • Top leaking resolver destinations
  • Timeline view (bursts often indicate OS fallback behavior)
  • “Pass/Fail” per policy definition

9) Validation strategy (“how do I know my detector is correct?”)

Ground truth tests

  1. Known-leak scenario

    • intentionally set OS DNS to ISP DNS, no tunnel
    • detector must catch plaintext DNS
  2. Known-safe scenario

    • local stub only + blocked outbound 53/853
    • detector should show zero leaks
  3. Bypass scenario

    • enable browser built-in DoH directly
    • detector should catch encrypted resolver bypass (Leak-C)
  4. Split-policy scenario

    • allowlist CN direct, everything else proxy-resolve

    • detector should show:

      • allowlist resolved direct
      • unknown resolved via proxy path

10) Recommended “profiles” (makes tool usable)

Provide built-in presets:

Profile 1: Full-tunnel VPN

  • allow DNS only via tunnel interface or loopback stub
  • any UDP/TCP 53 on physical NIC = leak

Profile 2: Proxy + local stub (your case)

  • allow DNS only to loopback stub
  • allow stub upstream only via proxy server destinations
  • flag any direct DoH/DoT to public endpoints

Profile 3: Split tunnel (geoip + allowlist)

  • allow plaintext DNS only for allowlisted domains (if user accepts risk)
  • enforce “unknown → proxy-resolve”
  • emphasize Leak-B correctness

Below is an updated high-level design (still language-agnostic) that integrates process attribution cleanly, including how it fits into the workflow and what to log.


1) New component: Process Attribution Engine (PAE)

Purpose

When a DNS-like event is observed, the PAE tries to attach:

  • PID
  • PPID
  • process name
  • (optional but extremely useful) full command line, executable path, user, container/app package, etc.

This lets your logs answer:

“Which program generated the leaked DNS request?” “Was it a browser, OS service, updater, antivirus, proxy itself, or some library?”

Position in the pipeline

It sits between Traffic Sensor and Leak Detector as an “event enricher”:

Traffic Event → (Classifier) → (Process Attribution) → Enriched Event → Leak Rules → Report


2) Updated architecture (with process attribution)

Existing modules (from earlier design)

  1. Policy & Configuration
  2. Traffic Sensor (packet/flow monitor)
  3. Classifier (Plain DNS / DoT / DoH / Unknown)
  4. DNS Parser (plaintext only)
  5. Flow Tracker
  6. Leak Detector (rules engine)
  7. Active Prober
  8. Report Generator

New module

  1. Process Attribution Engine (PAE)

    • resolves “who owns this flow / packet”
    • emits PID/PPID/name
    • handles platform-specific differences and fallbacks

3) Workflow changes (what happens when a potential leak is seen)

Passive detection loop (updated)

  1. Capture outbound traffic event

  2. Classify transport type:

    • UDP/53, TCP/53 → plaintext DNS
    • TCP/853 → DoT
    • HTTPS patterns → DoH (heuristic)
  3. Extract the 5-tuple

    • src IP:port, dst IP:port, protocol
  4. PAE lookup

    • resolve the owner process for this traffic
    • attach PID/PPID/name (+ optional metadata)
  5. Apply leak rules (A/B/C/D)

  6. Emit:

    • realtime log line (human readable)
    • structured record (JSON/event log)

4) Process attribution: what to detect and how (high-level)

Process attribution always works on one core concept:

Map observed traffic (socket/flow) → owning process

Inputs PAE needs

  • protocol (UDP/TCP)
  • local src port
  • local address
  • timestamp
  • optionally: connection state / flow ID

Output from PAE

  • pid, ppid, process_name

  • optional enrichment:

    • exe_path
    • cmdline
    • user
    • “process tree chain” (for debugging: parent → child → …)

5) Platform support strategy (without implementation detail)

Process attribution is OS-specific, so structure it as:

“Attribution Provider” interface

  • Provider A: “kernel-level flow owner”
  • Provider B: “socket table owner lookup”
  • Provider C: “event tracing feed”
  • Provider D: fallback “unknown / not supported”

Your main design goal is:

Design rule

Attribution must be best-effort + gracefully degrading, never blocking detection.

So you always log the leak even if PID is unavailable:

  • pid=null, attribution_confidence=LOW

6) Attribution confidence + race handling (important!)

Attribution can be tricky because:

  • a process may exit quickly (“short-lived resolver helper”)
  • ports can be reused
  • NAT or local proxies may obscure the real origin

So log confidence:

  • HIGH: direct mapping from kernel/socket owner at time of event
  • MEDIUM: mapping by lookup shortly after event (possible race)
  • LOW: inferred / uncertain
  • NONE: not resolved

Also record why attribution failed:

  • “permission denied”
  • “flow already gone”
  • “unsupported transport”
  • “ambiguous mapping”

This makes debugging much easier.


7) What PID/PPID adds to your leak definitions

Leak-A (plaintext DNS outside safe path)

Now you can say:

svchost.exe (PID 1234) sent UDP/53 to ISP resolver on Wi-Fi interface”

Leak-B (split-policy intent leak)

You can catch:

  • “game launcher looked up blocked domain”
  • “system service triggered a sensitive name unexpectedly”
  • “your proxy itself isnt actually resolving via its own channel”

Leak-C (encrypted DNS bypass)

This becomes very actionable:

firefox.exe started direct DoH to resolver outside tunnel”

Leak-D (mismatch indicator)

You can also correlate:

  • DNS resolved by one process
  • connection made by another process (e.g., local stub vs app)

8) Reporting / realtime logging format (updated)

Realtime log line (human readable)

Example (conceptual):

  • [P0][Leak-A] Plain DNS leaked

    • Domain: example-sensitive.com (A)
    • From: Wi-Fi → To: 1.2.3.4:53
    • Process: browser.exe PID=4321 PPID=1200
    • Policy violated: “No UDP/53 on physical NIC”

Structured event (JSON-style fields)

Minimum recommended fields:

Event identity

  • event_id
  • timestamp

DNS identity

  • transport (udp53/tcp53/dot/doh/unknown)
  • qname (nullable)
  • qtype (nullable)

Network path

  • interface_name
  • src_ip, src_port
  • dst_ip, dst_port
  • route_class (tunnel / physical / loopback)

Process identity (your requested additions)

  • pid

  • ppid

  • process_name

  • optional:

    • exe_path
    • cmdline
    • user

Detection result

  • leak_type (A/B/C/D)
  • severity (P0..P3)
  • policy_rule_id
  • attribution_confidence

9) Privacy and safety notes (important in a DNS tool)

Because youre logging domains and process command lines, this becomes sensitive.

Add a “privacy mode” policy:

  • Full: store full domain + cmdline
  • Redacted: hash domain; keep TLD only; truncate cmdline
  • Minimal: only keep leak counts + resolver IPs + process name

Also allow “capture window” (rotate logs, avoid giant histories).


10) UX feature: “Show me the process tree”

When a leak happens, a good debugger view is:

  • PID: foo (pid 1000)

    • PPID: bar (pid 900)

      • PPID: systemd/svchost/etc

This is extremely useful to identify:

  • browsers spawning helpers
  • OS DNS services
  • containerized processes
  • update agents / telemetry daemons

So your report generator should support:

Process chain rendering (where possible)


11) Practical edge cases you should detect (with PID helping)

  1. Local stub is fine, upstream isnt

    • Your local resolver process leaks upstream plaintext DNS
  2. Browser uses its own DoH

    • process attribution immediately reveals it
  3. Multiple interfaces

    • a leak only happens on Wi-Fi but not Ethernet
  4. Kill-switch failure

    • when tunnel drops, PID shows which app starts leaking first