Files
mask-ddpm/report.md
2026-01-27 18:19:07 +08:00

242 lines
7.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Hybrid Diffusion for ICS Traffic (HAI 21.03) — Project Report
# 工业控制系统流量混合扩散生成HAI 21.03)— 项目报告
## 1. Project Goal / 项目目标
Build a **hybrid diffusion-based generator** for ICS traffic features, focusing on **mixed continuous + discrete** feature sequences. The output is **feature-level sequences**, not raw packets. The generator should preserve:
- **Distributional fidelity** (continuous ranges + discrete frequencies)
- **Temporal consistency** (time correlation and sequence structure)
- **Field/logic consistency** for discrete protocol-like columns
构建一个用于 ICS 流量特征的**混合扩散生成模型**,处理**连续+离散混合特征序列**。输出为**特征级序列**而非原始报文。生成结果需要保持:
- **分布一致性**(连续值范围 + 离散频率)
- **时序一致性**(时间相关性与序列结构)
- **字段/逻辑一致性**(离散字段语义)
---
## 2. Data and Scope / 数据与范围
**Dataset used in current implementation:** HAI 21.03 (CSV feature traces).
**当前实现使用数据集:** HAI 21.03CSV 特征序列)。
**Data path (default in config):**
- `dataset/hai/hai-21.03/train*.csv.gz`
**特征拆分(固定 schema** `example/feature_split.json`
- Continuous features: sensor/process values
- Discrete features: binary/low-cardinality status/flag fields
- `time` is excluded from modeling
---
## 3. End-to-End Pipeline / 端到端流程
One command pipeline:
```
python example/run_all.py --device cuda
```
Pipeline stages:
1) **Prepare data** (`example/prepare_data.py`)
2) **Train temporal backbone** (`example/train.py`, stage 1)
3) **Train diffusion on residuals** (`example/train.py`, stage 2)
4) **Generate samples** (`example/export_samples.py`)
5) **Evaluate** (`example/evaluate_generated.py`)
一键流程对应:数据准备 → 时序骨干训练 → 残差扩散训练 → 采样导出 → 评估。
---
## 4. Technical Architecture / 技术架构
### 4.1 Hybrid Diffusion Model (Core) / 混合扩散模型(核心)
Defined in `example/hybrid_diffusion.py`.
**Inputs:**
- Continuous projection
- Discrete embeddings
- Time embedding (sinusoidal)
- Positional embedding (sequence index)
- Optional condition embedding (`file_id`)
**Backbone (configurable):**
- GRU (sequence modeling)
- Transformer encoder (selfattention)
- Post LayerNorm + residual MLP
**Outputs:**
- Continuous head: predicts target (`eps` or `x0`)
- Discrete heads: logits per discrete column
**连续分支:** Gaussian diffusion
**离散分支:** Mask diffusion
---
### 4.2 Stage-1 Temporal Model (GRU) / 第一阶段时序模型GRU
A separate GRU models the **trend backbone** of continuous features. It is trained first using teacher forcing to predict the next step.
独立的 GRU 先学习连续特征的**趋势骨架**,使用 teacher forcing 进行逐步预测。
Trend definition:
```
trend = GRU(x)
residual = x - trend
```
---
## 5. Diffusion Formulations / 扩散形式
### 5.1 Continuous Diffusion / 连续扩散
Forward process on residuals:
```
r_t = sqrt(a_bar_t) * r + sqrt(1 - a_bar_t) * eps
```
Targets supported:
- **eps prediction**
- **x0 prediction** (default)
Current config:
```
"cont_target": "x0"
```
### 5.2 Discrete Diffusion / 离散扩散
Mask diffusion with cosine schedule:
```
p(t) = 0.5 * (1 - cos(pi * t / T))
```
Mask-only cross-entropy is computed on masked positions.
---
## 6. Loss Design / 损失设计
Total loss:
```
L = λ * L_cont + (1 λ) * L_disc
```
### 6.1 Continuous Loss / 连续损失
- `eps` target: MSE(eps_pred, eps)
- `x0` target: MSE(x0_pred, x0)
- Optional inverse-variance weighting: `cont_loss_weighting = "inv_std"`
- Optional **SNR-weighted loss**: reweights MSE by SNR to stabilize diffusion training
### 6.2 Discrete Loss / 离散损失
Cross-entropy on masked positions only.
### 6.3 Temporal Loss / 时序损失
Stage1 GRU predicts next step:
```
L_temporal = MSE(pred_next, x[:,1:])
```
### 6.4 Residual Alignment Losses / 残差对齐损失
- **Quantile loss** on residuals to align distribution tails.
- **Residual mean/std penalty** to reduce drift and improve KS.
---
## 7. Data Processing / 数据处理
Defined in `example/data_utils.py` + `example/prepare_data.py`.
Key steps:
- Streaming mean/std/min/max + int-like detection
- Optional **log1p transform** for heavy-tailed continuous columns
- Discrete vocab + most frequent token
- Windowed batching with **shuffle buffer**
---
## 8. Sampling & Export / 采样与导出
Defined in:
- `example/sample.py`
- `example/export_samples.py`
Export process:
- Generate trend using temporal GRU
- Diffusion generates residuals
- Output: `trend + residual`
- De-normalize continuous values
- Clamp to observed min/max
- Restore discrete tokens from vocab
- Write to CSV
---
## 9. Evaluation / 评估指标
Defined in `example/evaluate_generated.py`.
Metrics (with reference):
- **KS statistic** (continuous distribution)
- **Quantile diffs** (q05/q25/q50/q75/q95)
- **Lag1 correlation diff** (temporal structure)
- **Discrete JSD** over vocab frequency
- **Invalid token counts**
**指标汇总与对比脚本:** `example/summary_metrics.py`
- 输出 avg_ks / avg_jsd / avg_lag1_diff
- 追加记录到 `example/results/metrics_history.csv`
- 如果存在上一次记录,输出 delta新旧对比
**分布诊断脚本(逐特征 KS/CDF** `example/diagnose_ks.py`
- 输出 `example/results/ks_per_feature.csv`(每个连续特征 KS
- 输出 `example/results/cdf_<feature>.svg`(真实 vs 生成 CDF
- 统计生成数据是否堆积在边界gen_frac_at_min / gen_frac_at_max
Recent run (user-reported, Windows):
- avg_ks 0.7096 / avg_jsd 0.03318 / avg_lag1_diff 0.18984
---
## 10. Automation / 自动化
`example/run_all.py` runs all stages with config-driven paths.
`example/run_compare.py` can run a baseline vs temporal config and compute metric deltas.
---
## 11. Key Engineering Decisions / 关键工程决策
- Mixed-type diffusion: continuous + discrete split
- Two-stage training: temporal backbone first, diffusion on residuals
- Switchable backbone: GRU vs Transformer encoder for the diffusion model
- Positional + time embeddings for stability
- Optional inverse-variance weighting for continuous loss
- Log1p transforms for heavy-tailed signals
---
## 12. Code Map (Key Files) / 代码索引
- Core model: `example/hybrid_diffusion.py`
- Training: `example/train.py`
- Temporal GRU: `example/hybrid_diffusion.py` (`TemporalGRUGenerator`)
- Data prep: `example/prepare_data.py`
- Data utilities: `example/data_utils.py`
- Sampling: `example/sample.py`
- Export: `example/export_samples.py`
- Evaluation: `example/evaluate_generated.py`
- Pipeline: `example/run_all.py`
- Config: `example/config.json`
---
## 13. Known Issues / Current Limitations / 已知问题
- KS may remain high → continuous distribution mismatch
- Lag1 may fluctuate → distribution vs temporal trade-off
- Continuous loss may dominate → needs careful weighting
- Transformer backbone may change stability; needs systematic comparison
---
## 14. Suggested Next Steps / 下一步建议
- Compare GRU vs Transformer backbone using `run_compare.py`
- Explore **vprediction** for continuous branch
- Strengthen discrete diffusion (e.g., D3PM-style transitions)
---
## 15. Summary / 总结
This project implements a **two-stage hybrid diffusion model** for ICS feature sequences: a GRU-based temporal backbone first models sequence trends, then diffusion learns residual corrections. The pipeline covers data prep, two-stage training, sampling, export, and evaluation. The main research challenge remains in balancing **distributional fidelity (KS)** and **temporal consistency (lag1)**.
本项目实现了**两阶段混合扩散模型**:先用 GRU 时序骨干学习趋势,再用扩散学习残差校正。系统包含完整训练与评估流程。主要挑战仍是**分布对齐KS与时序一致性lag1之间的平衡**。