From 8db286792e5fee480756245fa9e24f9dfb6de6ec Mon Sep 17 00:00:00 2001 From: MingzheYang Date: Wed, 28 Jan 2026 22:24:50 +0800 Subject: [PATCH] Rewrite report with full project documentation --- report.md | 524 +++++++++++++++++++++++++----------------------------- 1 file changed, 247 insertions(+), 277 deletions(-) diff --git a/report.md b/report.md index 120090c..08f2846 100644 --- a/report.md +++ b/report.md @@ -1,338 +1,308 @@ -# Hybrid Diffusion for ICS Traffic (HAI 21.03) — Project Report -# 工业控制系统流量混合扩散生成(HAI 21.03)— 项目报告 +# mask-ddpm Project Report (Detailed) + +This report is a **complete, beginner‑friendly** description of the current project implementation as of the latest code in this repo. It explains **what the project does**, **how data flows**, **what each file is for**, and **why the architecture is designed this way**. + +--- + +## 0. TL;DR / 一句话概览 + +We generate multivariate ICS time‑series by **(1) learning temporal trend with GRU** and **(2) learning residuals with a hybrid diffusion model** (continuous DDPM + discrete masked diffusion). We then evaluate with **tie‑aware KS** and run **Type‑aware postprocessing** for diagnostic KS reduction. + +--- ## 1. Project Goal / 项目目标 -Build a **hybrid diffusion-based generator** for ICS traffic features, focusing on **mixed continuous + discrete** feature sequences. The output is **feature-level sequences**, not raw packets. The generator should preserve: -- **Distributional fidelity** (continuous ranges + discrete frequencies) -- **Temporal consistency** (time correlation and sequence structure) -- **Field/logic consistency** for discrete protocol-like columns -构建一个用于 ICS 流量特征的**混合扩散生成模型**,处理**连续+离散混合特征序列**。输出为**特征级序列**而非原始报文。生成结果需要保持: -- **分布一致性**(连续值范围 + 离散频率) -- **时序一致性**(时间相关性与序列结构) -- **字段/逻辑一致性**(离散字段语义) +We want synthetic ICS sequences that are: +1) **Distribution‑aligned** (per‑feature CDF matches real data → low KS) +2) **Temporally consistent** (lag‑1 correlation and trend are realistic) +3) **Discrete‑valid** (state tokens are legal and frequency‑consistent) + +This is hard because **distribution** and **temporal structure** often conflict in a single model. --- -## 2. Data and Scope / 数据与范围 -**Dataset used in current implementation:** HAI 21.03 (CSV feature traces). +## 2. Data & Feature Schema / 数据与特征结构 -**当前实现使用数据集:** HAI 21.03(CSV 特征序列)。 +**Input data**: HAI CSV files (compressed) in `dataset/hai/hai-21.03/`. -**Data path (default in config):** -- `dataset/hai/hai-21.03/train*.csv.gz` - -**特征拆分(固定 schema):** `example/feature_split.json` -- Continuous features: sensor/process values -- Discrete features: binary/low-cardinality status/flag fields -- `time` is excluded from modeling +**Feature split**: `example/feature_split.json` +- `continuous`: real‑valued sensors/actuators +- `discrete`: state tokens / modes +- `time_column`: time index (not trained) --- -## 3. End-to-End Pipeline / 端到端流程 -One command pipeline: -``` -python example/run_all.py --device cuda -``` -Full pipeline + diagnostics: -``` -python example/run_all_full.py --device cuda -``` +## 3. Preprocessing / 预处理 -Pipeline stages: -1) **Prepare data** (`example/prepare_data.py`) -2) **Train temporal backbone** (`example/train.py`, stage 1) -3) **Train diffusion on residuals** (`example/train.py`, stage 2) -4) **Generate samples** (`example/export_samples.py`) -5) **Evaluate** (`example/evaluate_generated.py`) +File: `example/prepare_data.py` -一键流程对应:数据准备 → 时序骨干训练 → 残差扩散训练 → 采样导出 → 评估。 +### Continuous features +- Mean/std statistics +- Quantile table (if `use_quantile_transform=true`) +- Optional transforms (log1p etc.) +- Output: `example/results/cont_stats.json` + +### Discrete features +- Token vocab from data +- Output: `example/results/disc_vocab.json` + +File: `example/data_utils.py` contains +- Normalization / inverse +- Quantile transform / inverse +- Post‑calibration helpers --- -## 4. Technical Architecture / 技术架构 +## 4. Architecture / 模型结构 -### 4.1 Hybrid Diffusion Model (Core) / 混合扩散模型(核心) -Defined in `example/hybrid_diffusion.py`. +### 4.1 Stage‑1 Temporal GRU (Trend) +File: `example/hybrid_diffusion.py` +- Class: `TemporalGRUGenerator` +- Input: continuous sequence +- Output: **trend sequence** (teacher forced) +- Purpose: capture temporal structure -**Inputs:** -- Continuous projection -- Discrete embeddings -- Time embedding (sinusoidal) -- Positional embedding (sequence index) -- Optional condition embedding (`file_id`) +### 4.2 Stage‑2 Hybrid Diffusion (Residual) +File: `example/hybrid_diffusion.py` -**Backbone (configurable):** -- GRU (sequence modeling) -- Transformer encoder (self‑attention) -- Post LayerNorm + residual MLP +**Continuous branch** +- Gaussian DDPM +- Predicts **residual** (or noise) -**Current default config (latest):** -- Backbone: Transformer -- Sequence length: 96 -- Batch size: 16 +**Discrete branch** +- Mask diffusion (masked tokens) +- Classifier head per discrete column -**Outputs:** -- Continuous head: predicts target (`eps` or `x0`) -- Discrete heads: logits per discrete column +**Backbone** +- Current config uses **Transformer encoder** (`backbone_type=transformer`) +- GRU is still supported as option -**连续分支:** Gaussian diffusion -**离散分支:** Mask diffusion +**Conditioning** +- File‑id conditioning (`use_condition=true`, `condition_type=file_id`) +- Type‑1 (setpoint/demand) can be passed as **continuous condition** (`cond_cont`) --- -### 4.2 Stage-1 Temporal Model (GRU) / 第一阶段时序模型(GRU) -A separate GRU models the **trend backbone** of continuous features. It is trained first using teacher forcing to predict the next step. +## 5. Training Flow / 训练流程 +File: `example/train.py` -独立的 GRU 先学习连续特征的**趋势骨架**,使用 teacher forcing 进行逐步预测。 +### 5.1 Stage‑1 Temporal training +- Use continuous features (excluding Type1/Type5) +- Teacher‑forced GRU predicts next step +- Loss: **MSE** +- Output: `temporal.pt` -Trend definition: -``` -trend = GRU(x) -residual = x - trend -``` +### 5.2 Stage‑2 Diffusion training +- Compute residual: `x_resid = x_cont - trend` +- Sample time step `t` +- Add noise for continuous; mask tokens for discrete +- Model predicts: + - **eps_pred** for continuous residual + - logits for discrete tokens -**Two-stage training:** temporal GRU first, diffusion on residuals. - -### 4.3 Feature-Type Aware Strategy / 特征类型分治方案 -Based on HAI feature semantics and observed KS outliers, we classify problematic features into six types and plan separate modeling paths: - -1) **Type 1: Exogenous setpoints / demands** (schedule-driven, piecewise-constant) - Examples: P1_B4002, P2_MSD, P4_HT_LD - Strategy: program generator (HSMM / change-point), or sample from program library; condition diffusion on these. - -2) **Type 2: Controller outputs** (policy-like, saturation / rate limits) - Example: P1_B4005 - Strategy: small controller emulator (PID/NARX) with clamp + rate-limit. - -3) **Type 3: Spiky actuators** (few operating points + long dwell) - Examples: P1_PCV02Z, P1_FCV02Z - Strategy: spike-and-slab + dwell-time modeling or command‑driven actuator dynamics. - -4) **Type 4: Quantized / digital-as-continuous** - Examples: P4_ST_PT01, P4_ST_TT01 - Strategy: generate latent continuous then quantize or treat as ordinal discrete diffusion. - -5) **Type 5: Derived conversions** - Examples: *FT* → *FTZ* - Strategy: generate base variable and derive conversions deterministically. - -6) **Type 6: Aux / vibration / narrow-band** - Examples: P2_24Vdc, P2_HILout - Strategy: AR/ARMA or regime‑conditioned narrow-band models. - -### 4.4 Module Boundaries / 模块边界 -- **Program generator** outputs Type‑1 variables (setpoints/demands). -- **Controller/actuator modules** output Type‑2/3 variables conditioned on Type‑1. -- **Diffusion** generates remaining continuous PVs + discrete features. -- **Post‑processing** reconstructs Type‑5 derived tags and applies calibration. +### Loss design +- Continuous loss: MSE on eps or x0 (`cont_target`) +- Optional weighting: inverse variance (`cont_loss_weighting=inv_std`) +- Optional SNR weighting (`snr_weighted_loss`) +- Optional quantile loss (align residual distribution) +- Optional residual mean/std loss +- Discrete loss: cross‑entropy on masked tokens +- Total: `loss = λ * loss_cont + (1‑λ) * loss_disc` --- -## 5. Diffusion Formulations / 扩散形式 +## 6. Sampling & Export / 采样与导出 +File: `example/export_samples.py` -### 5.1 Continuous Diffusion / 连续扩散 -Forward process on residuals: -``` -r_t = sqrt(a_bar_t) * r + sqrt(1 - a_bar_t) * eps -``` - -Targets supported: -- **eps prediction** -- **x0 prediction** (default) - -Current config: -``` -"cont_target": "x0" -``` - -### 5.2 Discrete Diffusion / 离散扩散 -Mask diffusion with cosine schedule: -``` -p(t) = 0.5 * (1 - cos(pi * t / T)) -``` -Mask-only cross-entropy is computed on masked positions. +Steps: +1) Initialize continuous with noise +2) Initialize discrete with masks +3) Reverse diffusion loop from `t=T..0` +4) Add trend back (if temporal stage enabled) +5) Inverse transforms (quantile → raw) +6) Clip/bound if configured +7) Merge back Type1 (conditioning) and Type5 (derived) +8) Write `generated.csv` --- -## 6. Loss Design / 损失设计 -Total loss: -``` -L = λ * L_cont + (1 − λ) * L_disc +## 7. Evaluation / 评估 +File: `example/evaluate_generated.py` + +### Metrics +- **KS (tie‑aware)** for continuous +- **JSD** for discrete +- **lag‑1 correlation** for temporal consistency +- quantile diffs, mean/std errors + +### Important +- Reference supports **glob** and aggregates **all matched files** +- KS implementation is **tie‑aware** (correct for spiky/quantized data) + +Outputs: +- `example/results/eval.json` + +--- + +## 8. Diagnostics / 诊断工具 + +- `example/diagnose_ks.py`: CDF plots and per‑feature KS +- `example/ranked_ks.py`: ranked KS + contribution +- `example/filtered_metrics.py`: filtered KS excluding outliers +- `example/program_stats.py`: Type‑1 stats +- `example/controller_stats.py`: Type‑2 stats +- `example/actuator_stats.py`: Type‑3 stats +- `example/pv_stats.py`: Type‑4 stats +- `example/aux_stats.py`: Type‑6 stats + +--- + +## 9. Type‑Aware Modeling / 类型化分离 + +To reduce KS dominated by a few variables, the project uses **Type categories** defined in config: +- **Type1**: setpoints / demand (schedule‑driven) +- **Type2**: controller outputs +- **Type3**: actuator positions +- **Type4**: PV sensors +- **Type5**: derived tags +- **Type6**: auxiliary / coupling + +### Current implementation (diagnostic KS baseline) +File: `example/postprocess_types.py` +- Type1/2/3/5/6 → **empirical resampling** from real distribution +- Type4 → keep diffusion output + +This is **not** the final model, but provides a **KS‑upper bound** for diagnosis. + +Outputs: +- `example/results/generated_post.csv` +- `example/results/eval_post.json` + +--- + +## 10. Pipeline / 一键流程 + +File: `example/run_all.py` + +Default pipeline: +1) prepare_data +2) train +3) export_samples +4) evaluate_generated (generated.csv) +5) postprocess_types (generated_post.csv) +6) evaluate_generated (eval_post.json) +7) diagnostics scripts + +**Linux**: +```bash +python example/run_all.py --device cuda --config example/config.json ``` -### 6.1 Continuous Loss / 连续损失 -- `eps` target: MSE(eps_pred, eps) -- `x0` target: MSE(x0_pred, x0) -- Optional inverse-variance weighting: `cont_loss_weighting = "inv_std"` -- Optional **SNR-weighted loss**: reweights MSE by SNR to stabilize diffusion training - -### 6.2 Discrete Loss / 离散损失 -Cross-entropy on masked positions only. - -### 6.3 Temporal Loss / 时序损失 -Stage‑1 GRU predicts next step: -``` -L_temporal = MSE(pred_next, x[:,1:]) +**Windows (PowerShell)**: +```powershell +python run_all.py --device cuda --config config.json ``` -### 6.4 Residual Alignment Losses / 残差对齐损失 -- **Quantile loss** on residuals to align distribution tails. -- **Residual mean/std penalty** to reduce drift and improve KS. +--- + +## 11. Current Configuration (Key Defaults) +From `example/config.json`: +- backbone_type: **transformer** +- timesteps: 600 +- seq_len: 96 +- batch_size: 16 +- cont_target: `x0` +- cont_loss_weighting: `inv_std` +- snr_weighted_loss: true +- quantile_loss_weight: 0.2 +- use_quantile_transform: true +- cont_post_calibrate: true +- use_temporal_stage1: true --- -## 7. Data Processing / 数据处理 -Defined in `example/data_utils.py` + `example/prepare_data.py`. +## 12. What’s Actually Trained vs What’s Post‑Processed -Key steps: -- Streaming mean/std/min/max + int-like detection -- Optional **log1p transform** for heavy-tailed continuous columns -- Optional **quantile transform** (TabDDPM-style) for continuous columns (skips extra standardization) -- **Full quantile stats** (full_stats) for stable calibration -- Optional **post-hoc quantile calibration** to align 1D CDFs after sampling -- Discrete vocab + most frequent token -- Windowed batching with **shuffle buffer** +**Trained** +- Temporal GRU (trend) +- Diffusion residual model (continuous + discrete) + +**Post‑Processed (KS‑only)** +- Type1/2/3/5/6 replaced by empirical resampling + +This is important: postprocess improves KS but **may break joint realism**. --- -## 8. Sampling & Export / 采样与导出 -Defined in: -- `example/sample.py` -- `example/export_samples.py` +## 13. Why It’s Still Hard / 当前难点 -Export process: -- Generate trend using temporal GRU -- Diffusion generates residuals -- Output: `trend + residual` -- De-normalize continuous values -- Inverse quantile transform (if enabled; no extra de-standardization) -- Optional post-hoc quantile calibration (if enabled) -- Bound to observed min/max (clamp / sigmoid / soft_tanh / none) -- Restore discrete tokens from vocab -- Write to CSV +- Type1/2/3 are **event‑driven** and **piecewise constant** +- Diffusion (Gaussian DDPM + MSE) tends to smooth/blur these +- Temporal vs distribution objectives pull in opposite directions --- -## 9. Evaluation / 评估指标 -Defined in `example/evaluate_generated.py`. +## 14. Where To Improve Next / 下一步方向 -Metrics (with reference): -- **KS statistic** (continuous distribution) -- **Quantile diffs** (q05/q25/q50/q75/q95) -- **Lag‑1 correlation diff** (temporal structure) -- **Discrete JSD** over vocab frequency -- **Invalid token counts** +1) Replace KS‑only postprocess with **conditional generators**: + - Type1: program generator (HMM / schedule) + - Type2: controller emulator (PID‑like) + - Type3: actuator dynamics (dwell + rate + saturation) -**指标汇总与对比脚本:** `example/summary_metrics.py` -- 输出 avg_ks / avg_jsd / avg_lag1_diff -- 追加记录到 `example/results/metrics_history.csv` -- 如果存在上一次记录,输出 delta(新旧对比) +2) Add regime conditioning for Type4 PVs -**分布诊断脚本(逐特征 KS/CDF):** `example/diagnose_ks.py` -- 输出 `example/results/ks_per_feature.csv`(每个连续特征 KS) -- 输出 `example/results/cdf_.svg`(真实 vs 生成 CDF) -- 统计生成数据是否堆积在边界(gen_frac_at_min / gen_frac_at_max) - -**Filtered KS(剔除难以学习特征,仅用于诊断):** `example/filtered_metrics.py` -- 规则:std 过小或 KS 过高自动剔除 -- 输出 `example/results/filtered_metrics.json` -- 只用于诊断,不作为最终指标 - -**Ranked KS(特征贡献排序):** `example/ranked_ks.py` -- 输出 `example/results/ranked_ks.csv` -- 计算每个特征对 avg_ks 的贡献,以及“移除前 N 个特征后”的 avg_ks - -**Program stats(setpoint/demand 统计):** `example/program_stats.py` -- 输出 `example/results/program_stats.json` -- 指标:change-count / dwell / step-size(对比生成 vs 真实) - -**Controller stats(Type‑2 控制量):** `example/controller_stats.py` -- 输出 `example/results/controller_stats.json` -- 指标:饱和比例 / 变化率 / 步长中位数 - -**Actuator stats(Type‑3 执行器):** `example/actuator_stats.py` -- 输出 `example/results/actuator_stats.json` -- 指标:峰值占比 / unique ratio / dwell - -**PV stats(Type‑4 传感器):** `example/pv_stats.py` -- 输出 `example/results/pv_stats.json` -- 指标:q05/q50/q95 + tail ratio - -**Aux stats(Type‑6 辅助量):** `example/aux_stats.py` -- 输出 `example/results/aux_stats.json` -- 指标:均值/方差/lag‑1 - -**Type-based postprocess:** `example/postprocess_types.py` -- 输出 `example/results/generated_post.csv` -- 用 Type‑1/2/3/5/6 规则重建部分列(无需训练) -- KS-only baseline: Type1/2/3/5/6 经验重采样(只为压 KS,可能破坏联合分布) -**Evaluation protocol:** see `docs/evaluation.md`. - -Recent runs (Windows): -- 2026-01-27 21:22:34 — avg_ks 0.4046 / avg_jsd 0.0376 / avg_lag1_diff 0.1449 -Recent runs (WSL, diagnostic): -- 2026-01-28 — KS-only postprocess baseline (full-reference, tie-aware KS): overall_avg_ks 0.2851 +3) Joint realism checks (cross‑feature correlation) --- -## 10. Automation / 自动化 -`example/run_all.py` runs prepare/train/export/eval + postprocess + diagnostics in one command. -`example/run_all_full.py` legacy full runner. -`example/run_compare.py` can run a baseline vs temporal config and compute metric deltas. +## 15. Key Files (Complete but Pruned) + +``` +mask-ddpm/ + report.md + docs/ + README.md + architecture.md + evaluation.md + decisions.md + experiments.md + ideas.md + example/ + config.json + config_no_temporal.json + config_temporal_strong.json + feature_split.json + data_utils.py + prepare_data.py + hybrid_diffusion.py + train.py + sample.py + export_samples.py + evaluate_generated.py + run_all.py + run_compare.py + diagnose_ks.py + filtered_metrics.py + ranked_ks.py + program_stats.py + controller_stats.py + actuator_stats.py + pv_stats.py + aux_stats.py + postprocess_types.py + results/ + generated.csv + generated_post.csv + eval.json + eval_post.json + cont_stats.json + disc_vocab.json + metrics_history.csv +``` --- -## 11. Key Engineering Decisions / 关键工程决策 -- Mixed-type diffusion: continuous + discrete split -- Two-stage training: temporal backbone first, diffusion on residuals -- Switchable backbone: GRU vs Transformer encoder for the diffusion model -- Positional + time embeddings for stability -- Optional inverse-variance weighting for continuous loss -- Log1p transforms for heavy-tailed signals -- Quantile transform + post-hoc calibration to stabilize CDF alignment +## 16. Summary / 总结 ---- +The current project is a **hybrid diffusion system** with a **two‑stage temporal+residual design**, built to balance **distribution alignment** and **temporal realism**. The architecture is modular, with explicit type‑aware diagnostics and postprocessing, and supports both GRU and Transformer backbones. The remaining research challenge is to replace KS‑only postprocessing with **conditional, structurally consistent generators** for Type1/2/3/5/6 features. -## 12. Code Map (Key Files) / 代码索引 -- Core model: `example/hybrid_diffusion.py` -- Training: `example/train.py` -- Temporal GRU: `example/hybrid_diffusion.py` (`TemporalGRUGenerator`) -- Data prep: `example/prepare_data.py` -- Data utilities: `example/data_utils.py` -- Sampling: `example/sample.py` -- Export: `example/export_samples.py` -- Evaluation: `example/evaluate_generated.py` -- KS uses tie-aware implementation and aggregates all reference files matched by glob. -- Pipeline: `example/run_all.py` -- Config: `example/config.json` - ---- - -## 13. Known Issues / Current Limitations / 已知问题 -- KS can remain high on a subset of features → per-feature diagnosis required -- Lag‑1 may fluctuate → distribution vs temporal trade-off -- Discrete JSD can regress when continuous KS is prioritized -- Transformer backbone may change stability; needs systematic comparison -- Program/actuator features require specialized modeling beyond diffusion - ---- - -## 14. Suggested Next Steps / 下一步建议 -- Compare GRU vs Transformer backbone using `run_compare.py` -- Explore **v‑prediction** for continuous branch -- Strengthen discrete diffusion (e.g., D3PM-style transitions) -- Add targeted discrete calibration for high‑JSD columns -- Implement program generator for Type‑1 features and evaluate with dwell/step metrics - -## 16. Deliverables / 交付清单 -- Code: diffusion + temporal + diagnostics + pipeline scripts -- Docs: report + decisions + experiments + architecture + evaluation protocol -- Results: full metrics, filtered metrics, ranked KS, per‑feature CDFs - ---- - -## 15. Summary / 总结 -This project implements a **two-stage hybrid diffusion model** for ICS feature sequences: a GRU-based temporal backbone first models sequence trends, then diffusion learns residual corrections. The pipeline covers data prep, two-stage training, sampling, export, and evaluation. The main research challenge remains in balancing **distributional fidelity (KS)** and **temporal consistency (lag‑1)**. - -本项目实现了**两阶段混合扩散模型**:先用 GRU 时序骨干学习趋势,再用扩散学习残差校正。系统包含完整训练与评估流程。主要挑战仍是**分布对齐(KS)与时序一致性(lag‑1)之间的平衡**。