Files
mask-ddpm/report.md

292 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Hybrid Diffusion for ICS Traffic (HAI 21.03) — Project Report
# 工业控制系统流量混合扩散生成HAI 21.03)— 项目报告
## 1. Project Goal / 项目目标
Build a **hybrid diffusion-based generator** for ICS traffic features, focusing on **mixed continuous + discrete** feature sequences. The output is **feature-level sequences**, not raw packets. The generator should preserve:
- **Distributional fidelity** (continuous ranges + discrete frequencies)
- **Temporal consistency** (time correlation and sequence structure)
- **Field/logic consistency** for discrete protocol-like columns
构建一个用于 ICS 流量特征的**混合扩散生成模型**,处理**连续+离散混合特征序列**。输出为**特征级序列**而非原始报文。生成结果需要保持:
- **分布一致性**(连续值范围 + 离散频率)
- **时序一致性**(时间相关性与序列结构)
- **字段/逻辑一致性**(离散字段语义)
---
## 2. Data and Scope / 数据与范围
**Dataset used in current implementation:** HAI 21.03 (CSV feature traces).
**当前实现使用数据集:** HAI 21.03CSV 特征序列)。
**Data path (default in config):**
- `dataset/hai/hai-21.03/train*.csv.gz`
**特征拆分(固定 schema** `example/feature_split.json`
- Continuous features: sensor/process values
- Discrete features: binary/low-cardinality status/flag fields
- `time` is excluded from modeling
---
## 3. End-to-End Pipeline / 端到端流程
One command pipeline:
```
python example/run_all.py --device cuda
```
Full pipeline + diagnostics:
```
python example/run_all_full.py --device cuda
```
Pipeline stages:
1) **Prepare data** (`example/prepare_data.py`)
2) **Train temporal backbone** (`example/train.py`, stage 1)
3) **Train diffusion on residuals** (`example/train.py`, stage 2)
4) **Generate samples** (`example/export_samples.py`)
5) **Evaluate** (`example/evaluate_generated.py`)
一键流程对应:数据准备 → 时序骨干训练 → 残差扩散训练 → 采样导出 → 评估。
---
## 4. Technical Architecture / 技术架构
### 4.1 Hybrid Diffusion Model (Core) / 混合扩散模型(核心)
Defined in `example/hybrid_diffusion.py`.
**Inputs:**
- Continuous projection
- Discrete embeddings
- Time embedding (sinusoidal)
- Positional embedding (sequence index)
- Optional condition embedding (`file_id`)
**Backbone (configurable):**
- GRU (sequence modeling)
- Transformer encoder (selfattention)
- Post LayerNorm + residual MLP
**Current default config (latest):**
- Backbone: Transformer
- Sequence length: 96
- Batch size: 16
**Outputs:**
- Continuous head: predicts target (`eps` or `x0`)
- Discrete heads: logits per discrete column
**连续分支:** Gaussian diffusion
**离散分支:** Mask diffusion
---
### 4.2 Stage-1 Temporal Model (GRU) / 第一阶段时序模型GRU
A separate GRU models the **trend backbone** of continuous features. It is trained first using teacher forcing to predict the next step.
独立的 GRU 先学习连续特征的**趋势骨架**,使用 teacher forcing 进行逐步预测。
Trend definition:
```
trend = GRU(x)
residual = x - trend
```
**Two-stage training:** temporal GRU first, diffusion on residuals.
### 4.3 Feature-Type Aware Strategy / 特征类型分治方案
Based on HAI feature semantics and observed KS outliers, we classify problematic features into six types and plan separate modeling paths:
1) **Type 1: Exogenous setpoints / demands** (schedule-driven, piecewise-constant)
Examples: P1_B4002, P2_MSD, P4_HT_LD
Strategy: program generator (HSMM / change-point), or sample from program library; condition diffusion on these.
2) **Type 2: Controller outputs** (policy-like, saturation / rate limits)
Example: P1_B4005
Strategy: small controller emulator (PID/NARX) with clamp + rate-limit.
3) **Type 3: Spiky actuators** (few operating points + long dwell)
Examples: P1_PCV02Z, P1_FCV02Z
Strategy: spike-and-slab + dwell-time modeling or commanddriven actuator dynamics.
4) **Type 4: Quantized / digital-as-continuous**
Examples: P4_ST_PT01, P4_ST_TT01
Strategy: generate latent continuous then quantize or treat as ordinal discrete diffusion.
5) **Type 5: Derived conversions**
Examples: *FT**FTZ*
Strategy: generate base variable and derive conversions deterministically.
6) **Type 6: Aux / vibration / narrow-band**
Examples: P2_24Vdc, P2_HILout
Strategy: AR/ARMA or regimeconditioned narrow-band models.
---
## 5. Diffusion Formulations / 扩散形式
### 5.1 Continuous Diffusion / 连续扩散
Forward process on residuals:
```
r_t = sqrt(a_bar_t) * r + sqrt(1 - a_bar_t) * eps
```
Targets supported:
- **eps prediction**
- **x0 prediction** (default)
Current config:
```
"cont_target": "x0"
```
### 5.2 Discrete Diffusion / 离散扩散
Mask diffusion with cosine schedule:
```
p(t) = 0.5 * (1 - cos(pi * t / T))
```
Mask-only cross-entropy is computed on masked positions.
---
## 6. Loss Design / 损失设计
Total loss:
```
L = λ * L_cont + (1 λ) * L_disc
```
### 6.1 Continuous Loss / 连续损失
- `eps` target: MSE(eps_pred, eps)
- `x0` target: MSE(x0_pred, x0)
- Optional inverse-variance weighting: `cont_loss_weighting = "inv_std"`
- Optional **SNR-weighted loss**: reweights MSE by SNR to stabilize diffusion training
### 6.2 Discrete Loss / 离散损失
Cross-entropy on masked positions only.
### 6.3 Temporal Loss / 时序损失
Stage1 GRU predicts next step:
```
L_temporal = MSE(pred_next, x[:,1:])
```
### 6.4 Residual Alignment Losses / 残差对齐损失
- **Quantile loss** on residuals to align distribution tails.
- **Residual mean/std penalty** to reduce drift and improve KS.
---
## 7. Data Processing / 数据处理
Defined in `example/data_utils.py` + `example/prepare_data.py`.
Key steps:
- Streaming mean/std/min/max + int-like detection
- Optional **log1p transform** for heavy-tailed continuous columns
- Optional **quantile transform** (TabDDPM-style) for continuous columns (skips extra standardization)
- **Full quantile stats** (full_stats) for stable calibration
- Optional **post-hoc quantile calibration** to align 1D CDFs after sampling
- Discrete vocab + most frequent token
- Windowed batching with **shuffle buffer**
---
## 8. Sampling & Export / 采样与导出
Defined in:
- `example/sample.py`
- `example/export_samples.py`
Export process:
- Generate trend using temporal GRU
- Diffusion generates residuals
- Output: `trend + residual`
- De-normalize continuous values
- Inverse quantile transform (if enabled; no extra de-standardization)
- Optional post-hoc quantile calibration (if enabled)
- Bound to observed min/max (clamp / sigmoid / soft_tanh / none)
- Restore discrete tokens from vocab
- Write to CSV
---
## 9. Evaluation / 评估指标
Defined in `example/evaluate_generated.py`.
Metrics (with reference):
- **KS statistic** (continuous distribution)
- **Quantile diffs** (q05/q25/q50/q75/q95)
- **Lag1 correlation diff** (temporal structure)
- **Discrete JSD** over vocab frequency
- **Invalid token counts**
**指标汇总与对比脚本:** `example/summary_metrics.py`
- 输出 avg_ks / avg_jsd / avg_lag1_diff
- 追加记录到 `example/results/metrics_history.csv`
- 如果存在上一次记录,输出 delta新旧对比
**分布诊断脚本(逐特征 KS/CDF** `example/diagnose_ks.py`
- 输出 `example/results/ks_per_feature.csv`(每个连续特征 KS
- 输出 `example/results/cdf_<feature>.svg`(真实 vs 生成 CDF
- 统计生成数据是否堆积在边界gen_frac_at_min / gen_frac_at_max
**Filtered KS剔除难以学习特征仅用于诊断** `example/filtered_metrics.py`
- 规则std 过小或 KS 过高自动剔除
- 输出 `example/results/filtered_metrics.json`
- 只用于诊断,不作为最终指标
Recent runs (Windows):
- 2026-01-27 21:22:34 — avg_ks 0.4046 / avg_jsd 0.0376 / avg_lag1_diff 0.1449
---
## 10. Automation / 自动化
`example/run_all.py` runs all stages with config-driven paths.
`example/run_all_full.py` runs prepare/train/export/eval + KS diagnostics in one command.
`example/run_compare.py` can run a baseline vs temporal config and compute metric deltas.
---
## 11. Key Engineering Decisions / 关键工程决策
- Mixed-type diffusion: continuous + discrete split
- Two-stage training: temporal backbone first, diffusion on residuals
- Switchable backbone: GRU vs Transformer encoder for the diffusion model
- Positional + time embeddings for stability
- Optional inverse-variance weighting for continuous loss
- Log1p transforms for heavy-tailed signals
---
## 12. Code Map (Key Files) / 代码索引
- Core model: `example/hybrid_diffusion.py`
- Training: `example/train.py`
- Temporal GRU: `example/hybrid_diffusion.py` (`TemporalGRUGenerator`)
- Data prep: `example/prepare_data.py`
- Data utilities: `example/data_utils.py`
- Sampling: `example/sample.py`
- Export: `example/export_samples.py`
- Evaluation: `example/evaluate_generated.py`
- Pipeline: `example/run_all.py`
- Config: `example/config.json`
---
## 13. Known Issues / Current Limitations / 已知问题
- KS can remain high on a subset of features → per-feature diagnosis required
- Lag1 may fluctuate → distribution vs temporal trade-off
- Discrete JSD can regress when continuous KS is prioritized
- Transformer backbone may change stability; needs systematic comparison
---
## 14. Suggested Next Steps / 下一步建议
- Compare GRU vs Transformer backbone using `run_compare.py`
- Explore **vprediction** for continuous branch
- Strengthen discrete diffusion (e.g., D3PM-style transitions)
- Add targeted discrete calibration for highJSD columns
---
## 15. Summary / 总结
This project implements a **two-stage hybrid diffusion model** for ICS feature sequences: a GRU-based temporal backbone first models sequence trends, then diffusion learns residual corrections. The pipeline covers data prep, two-stage training, sampling, export, and evaluation. The main research challenge remains in balancing **distributional fidelity (KS)** and **temporal consistency (lag1)**.
本项目实现了**两阶段混合扩散模型**:先用 GRU 时序骨干学习趋势,再用扩散学习残差校正。系统包含完整训练与评估流程。主要挑战仍是**分布对齐KS与时序一致性lag1之间的平衡**。