Rewrite report with full project documentation

2026-01-28 22:24:50 +08:00
parent f3991cc91e
commit 8db286792e
1 changed files with 247 additions and 277 deletions
--- a/report.md
+++ b/report.md
@@ -1,338 +1,308 @@
-# Hybrid Diffusion for ICS Traffic (HAI 21.03) — Project Report
+# mask-ddpm Project Report (Detailed)
-# 工业控制系统流量混合扩散生成（HAI 21.03）— 项目报告
+
 This report is a **complete, beginner‑friendly** description of the current project implementation as of the latest code in this repo. It explains **what the project does**, **how data flows**, **what each file is for**, and **why the architecture is designed this way**.
 ---
 ## 0. TL;DR / 一句话概览
 We generate multivariate ICS time‑series by **(1) learning temporal trend with GRU** and **(2) learning residuals with a hybrid diffusion model** (continuous DDPM + discrete masked diffusion). We then evaluate with **tie‑aware KS** and run **Type‑aware postprocessing** for diagnostic KS reduction.
 ---
 ## 1. Project Goal / 项目目标
 Build a **hybrid diffusion-based generator** for ICS traffic features, focusing on **mixed continuous + discrete** feature sequences. The output is **feature-level sequences**, not raw packets. The generator should preserve:
 - **Distributional fidelity** (continuous ranges + discrete frequencies)
 - **Temporal consistency** (time correlation and sequence structure)
 - **Field/logic consistency** for discrete protocol-like columns
-构建一个用于 ICS 流量特征的**混合扩散生成模型**，处理**连续+离散混合特征序列**。输出为**特征级序列**而非原始报文。生成结果需要保持：
+We want synthetic ICS sequences that are:
- **分布一致性**（连续值范围 + 离散频率）
+1) **Distribution‑aligned** (per‑feature CDF matches real data → low KS)
- **时序一致性**（时间相关性与序列结构）
+2) **Temporally consistent** (lag‑1 correlation and trend are realistic)
- **字段/逻辑一致性**（离散字段语义）
+3) **Discrete‑valid** (state tokens are legal and frequency‑consistent)
 This is hard because **distribution** and **temporal structure** often conflict in a single model.
 ---
-## 2. Data and Scope / 数据与范围
+## 2. Data & Feature Schema / 数据与特征结构
 **Dataset used in current implementation:** HAI 21.03 (CSV feature traces).
-**当前实现使用数据集：** HAI 21.03（CSV 特征序列）。
+**Input data**: HAI CSV files (compressed) in `dataset/hai/hai-21.03/`.
-**Data path (default in config):**
+**Feature split**: `example/feature_split.json`
- `dataset/hai/hai-21.03/train*.csv.gz`
+- `continuous`: real‑valued sensors/actuators
-
+- `discrete`: state tokens / modes
-**特征拆分（固定 schema）：** `example/feature_split.json`
+- `time_column`: time index (not trained)
 - Continuous features: sensor/process values
 - Discrete features: binary/low-cardinality status/flag fields
 - `time` is excluded from modeling
 ---
-## 3. End-to-End Pipeline / 端到端流程
+## 3. Preprocessing / 预处理
 One command pipeline:
 ```
 python example/run_all.py --device cuda
 ```
 Full pipeline + diagnostics:
 ```
 python example/run_all_full.py --device cuda
 ```
-Pipeline stages:
+File: `example/prepare_data.py`
 1) **Prepare data** (`example/prepare_data.py`)
 2) **Train temporal backbone** (`example/train.py`, stage 1)
 3) **Train diffusion on residuals** (`example/train.py`, stage 2)
 4) **Generate samples** (`example/export_samples.py`)
 5) **Evaluate** (`example/evaluate_generated.py`)
-一键流程对应：数据准备 → 时序骨干训练 → 残差扩散训练 → 采样导出 → 评估。
+### Continuous features
 - Mean/std statistics
 - Quantile table (if `use_quantile_transform=true`)
 - Optional transforms (log1p etc.)
 - Output: `example/results/cont_stats.json`
 ### Discrete features
 - Token vocab from data
 - Output: `example/results/disc_vocab.json`
 File: `example/data_utils.py` contains
 - Normalization / inverse
 - Quantile transform / inverse
 - Post‑calibration helpers
 ---
-## 4. Technical Architecture / 技术架构
+## 4. Architecture / 模型结构
-### 4.1 Hybrid Diffusion Model (Core) / 混合扩散模型（核心）
+### 4.1 Stage‑1 Temporal GRU (Trend)
-Defined in `example/hybrid_diffusion.py`.
+File: `example/hybrid_diffusion.py`
 - Class: `TemporalGRUGenerator`
 - Input: continuous sequence
 - Output: **trend sequence** (teacher forced)
 - Purpose: capture temporal structure
-**Inputs:**
+### 4.2 Stage‑2 Hybrid Diffusion (Residual)
- Continuous projection
+File: `example/hybrid_diffusion.py`
 - Discrete embeddings
 - Time embedding (sinusoidal)
 - Positional embedding (sequence index)
 - Optional condition embedding (`file_id`)
-**Backbone (configurable):**
+**Continuous branch**
- GRU (sequence modeling)
+- Gaussian DDPM
- Transformer encoder (self‑attention)
+- Predicts **residual** (or noise)
 - Post LayerNorm + residual MLP
-**Current default config (latest):**
+**Discrete branch**
- Backbone: Transformer
+- Mask diffusion (masked tokens)
- Sequence length: 96
+- Classifier head per discrete column
 - Batch size: 16
-**Outputs:**
+**Backbone**
- Continuous head: predicts target (`eps` or `x0`)
+- Current config uses **Transformer encoder** (`backbone_type=transformer`)
- Discrete heads: logits per discrete column
+- GRU is still supported as option
-**连续分支：** Gaussian diffusion
+**Conditioning**
-**离散分支：** Mask diffusion
+- File‑id conditioning (`use_condition=true`, `condition_type=file_id`)
 - Type‑1 (setpoint/demand) can be passed as **continuous condition** (`cond_cont`)
 ---
-### 4.2 Stage-1 Temporal Model (GRU) / 第一阶段时序模型（GRU）
+## 5. Training Flow / 训练流程
-A separate GRU models the **trend backbone** of continuous features. It is trained first using teacher forcing to predict the next step.
+File: `example/train.py`
-独立的 GRU 先学习连续特征的**趋势骨架**，使用 teacher forcing 进行逐步预测。
+### 5.1 Stage‑1 Temporal training
 - Use continuous features (excluding Type1/Type5)
 - Teacher‑forced GRU predicts next step
 - Loss: **MSE**
 - Output: `temporal.pt`
-Trend definition:
+### 5.2 Stage‑2 Diffusion training
-```
+- Compute residual: `x_resid = x_cont - trend`
-trend = GRU(x)
+- Sample time step `t`
-residual = x - trend
+- Add noise for continuous; mask tokens for discrete
-```
+- Model predicts:
  - **eps_pred** for continuous residual
  - logits for discrete tokens
-**Two-stage training:** temporal GRU first, diffusion on residuals.
+### Loss design
-
+- Continuous loss: MSE on eps or x0 (`cont_target`)
-### 4.3 Feature-Type Aware Strategy / 特征类型分治方案
+- Optional weighting: inverse variance (`cont_loss_weighting=inv_std`)
-Based on HAI feature semantics and observed KS outliers, we classify problematic features into six types and plan separate modeling paths:
+- Optional SNR weighting (`snr_weighted_loss`)
-
+- Optional quantile loss (align residual distribution)
-1) **Type 1: Exogenous setpoints / demands** (schedule-driven, piecewise-constant)  
+- Optional residual mean/std loss
-   Examples: P1_B4002, P2_MSD, P4_HT_LD  
+- Discrete loss: cross‑entropy on masked tokens
-   Strategy: program generator (HSMM / change-point), or sample from program library; condition diffusion on these.
+- Total: `loss = λ * loss_cont + (1‑λ) * loss_disc`
 2) **Type 2: Controller outputs** (policy-like, saturation / rate limits)  
   Example: P1_B4005  
   Strategy: small controller emulator (PID/NARX) with clamp + rate-limit.
 3) **Type 3: Spiky actuators** (few operating points + long dwell)  
   Examples: P1_PCV02Z, P1_FCV02Z  
   Strategy: spike-and-slab + dwell-time modeling or command‑driven actuator dynamics.
 4) **Type 4: Quantized / digital-as-continuous**  
   Examples: P4_ST_PT01, P4_ST_TT01  
   Strategy: generate latent continuous then quantize or treat as ordinal discrete diffusion.
 5) **Type 5: Derived conversions**  
   Examples: *FT* → *FTZ*  
   Strategy: generate base variable and derive conversions deterministically.
 6) **Type 6: Aux / vibration / narrow-band**  
   Examples: P2_24Vdc, P2_HILout  
   Strategy: AR/ARMA or regime‑conditioned narrow-band models.
 ### 4.4 Module Boundaries / 模块边界
 - **Program generator** outputs Type‑1 variables (setpoints/demands).
 - **Controller/actuator modules** output Type‑2/3 variables conditioned on Type‑1.
 - **Diffusion** generates remaining continuous PVs + discrete features.
 - **Post‑processing** reconstructs Type‑5 derived tags and applies calibration.
 ---
-## 5. Diffusion Formulations / 扩散形式
+## 6. Sampling & Export / 采样与导出
 File: `example/export_samples.py`
-### 5.1 Continuous Diffusion / 连续扩散
+Steps:
-Forward process on residuals:
+1) Initialize continuous with noise
-```
+2) Initialize discrete with masks
-r_t = sqrt(a_bar_t) * r + sqrt(1 - a_bar_t) * eps
+3) Reverse diffusion loop from `t=T..0`
-```
+4) Add trend back (if temporal stage enabled)
-
+5) Inverse transforms (quantile → raw)
-Targets supported:
+6) Clip/bound if configured
- **eps prediction**
+7) Merge back Type1 (conditioning) and Type5 (derived)
- **x0 prediction** (default)
+8) Write `generated.csv`
 Current config:
 ```
 "cont_target": "x0"
 ```
 ### 5.2 Discrete Diffusion / 离散扩散
 Mask diffusion with cosine schedule:
 ```
 p(t) = 0.5 * (1 - cos(pi * t / T))
 ```
 Mask-only cross-entropy is computed on masked positions.
 ---
-## 6. Loss Design / 损失设计
+## 7. Evaluation / 评估
-Total loss:
+File: `example/evaluate_generated.py`
-```
+
-L = λ * L_cont + (1 − λ) * L_disc
+### Metrics
 - **KS (tie‑aware)** for continuous
 - **JSD** for discrete
 - **lag‑1 correlation** for temporal consistency
 - quantile diffs, mean/std errors
 ### Important
 - Reference supports **glob** and aggregates **all matched files**
 - KS implementation is **tie‑aware** (correct for spiky/quantized data)
 Outputs:
 - `example/results/eval.json`
 ---
 ## 8. Diagnostics / 诊断工具
 - `example/diagnose_ks.py`: CDF plots and per‑feature KS
 - `example/ranked_ks.py`: ranked KS + contribution
 - `example/filtered_metrics.py`: filtered KS excluding outliers
 - `example/program_stats.py`: Type‑1 stats
 - `example/controller_stats.py`: Type‑2 stats
 - `example/actuator_stats.py`: Type‑3 stats
 - `example/pv_stats.py`: Type‑4 stats
 - `example/aux_stats.py`: Type‑6 stats
 ---
 ## 9. Type‑Aware Modeling / 类型化分离
 To reduce KS dominated by a few variables, the project uses **Type categories** defined in config:
 - **Type1**: setpoints / demand (schedule‑driven)
 - **Type2**: controller outputs
 - **Type3**: actuator positions
 - **Type4**: PV sensors
 - **Type5**: derived tags
 - **Type6**: auxiliary / coupling
 ### Current implementation (diagnostic KS baseline)
 File: `example/postprocess_types.py`
 - Type1/2/3/5/6 → **empirical resampling** from real distribution
 - Type4 → keep diffusion output
 This is **not** the final model, but provides a **KS‑upper bound** for diagnosis.
 Outputs:
 - `example/results/generated_post.csv`
 - `example/results/eval_post.json`
 ---
 ## 10. Pipeline / 一键流程
 File: `example/run_all.py`
 Default pipeline:
 1) prepare_data
 2) train
 3) export_samples
 4) evaluate_generated (generated.csv)
 5) postprocess_types (generated_post.csv)
 6) evaluate_generated (eval_post.json)
 7) diagnostics scripts
 **Linux**:
 ```bash
 python example/run_all.py --device cuda --config example/config.json
 ```
-### 6.1 Continuous Loss / 连续损失
+**Windows (PowerShell)**:
- `eps` target: MSE(eps_pred, eps)
+```powershell
- `x0` target: MSE(x0_pred, x0)
+python run_all.py --device cuda --config config.json
 - Optional inverse-variance weighting: `cont_loss_weighting = "inv_std"`
 - Optional **SNR-weighted loss**: reweights MSE by SNR to stabilize diffusion training
 ### 6.2 Discrete Loss / 离散损失
 Cross-entropy on masked positions only.
 ### 6.3 Temporal Loss / 时序损失
 Stage‑1 GRU predicts next step:
 ```
 L_temporal = MSE(pred_next, x[:,1:])
 ```
-### 6.4 Residual Alignment Losses / 残差对齐损失
+---
- **Quantile loss** on residuals to align distribution tails.
+
- **Residual mean/std penalty** to reduce drift and improve KS.
+## 11. Current Configuration (Key Defaults)
 From `example/config.json`:
 - backbone_type: **transformer**
 - timesteps: 600
 - seq_len: 96
 - batch_size: 16
 - cont_target: `x0`
 - cont_loss_weighting: `inv_std`
 - snr_weighted_loss: true
 - quantile_loss_weight: 0.2
 - use_quantile_transform: true
 - cont_post_calibrate: true
 - use_temporal_stage1: true
 ---
-## 7. Data Processing / 数据处理
+## 12. What’s Actually Trained vs What’s Post‑Processed
 Defined in `example/data_utils.py` + `example/prepare_data.py`.
-Key steps:
+**Trained**
- Streaming mean/std/min/max + int-like detection
+- Temporal GRU (trend)
- Optional **log1p transform** for heavy-tailed continuous columns
+- Diffusion residual model (continuous + discrete)
- Optional **quantile transform** (TabDDPM-style) for continuous columns (skips extra standardization)
+
- **Full quantile stats** (full_stats) for stable calibration
+**Post‑Processed (KS‑only)**
- Optional **post-hoc quantile calibration** to align 1D CDFs after sampling
+- Type1/2/3/5/6 replaced by empirical resampling
- Discrete vocab + most frequent token
+
- Windowed batching with **shuffle buffer**
+This is important: postprocess improves KS but **may break joint realism**.
 ---
-## 8. Sampling & Export / 采样与导出
+## 13. Why It’s Still Hard / 当前难点
 Defined in:
 - `example/sample.py`
 - `example/export_samples.py`
-Export process:
+- Type1/2/3 are **event‑driven** and **piecewise constant**
- Generate trend using temporal GRU
+- Diffusion (Gaussian DDPM + MSE) tends to smooth/blur these
- Diffusion generates residuals
+- Temporal vs distribution objectives pull in opposite directions
 - Output: `trend + residual`
 - De-normalize continuous values
 - Inverse quantile transform (if enabled; no extra de-standardization)
 - Optional post-hoc quantile calibration (if enabled)
 - Bound to observed min/max (clamp / sigmoid / soft_tanh / none)
 - Restore discrete tokens from vocab
 - Write to CSV
 ---
-## 9. Evaluation / 评估指标
+## 14. Where To Improve Next / 下一步方向
 Defined in `example/evaluate_generated.py`.
-Metrics (with reference):
+1) Replace KS‑only postprocess with **conditional generators**:
- **KS statistic** (continuous distribution)
+   - Type1: program generator (HMM / schedule)
- **Quantile diffs** (q05/q25/q50/q75/q95)
+   - Type2: controller emulator (PID‑like)
- **Lag‑1 correlation diff** (temporal structure)
+   - Type3: actuator dynamics (dwell + rate + saturation)
 - **Discrete JSD** over vocab frequency
 - **Invalid token counts**
-**指标汇总与对比脚本：** `example/summary_metrics.py`
+2) Add regime conditioning for Type4 PVs
 - 输出 avg_ks / avg_jsd / avg_lag1_diff
 - 追加记录到 `example/results/metrics_history.csv`
 - 如果存在上一次记录，输出 delta（新旧对比）
-**分布诊断脚本（逐特征 KS/CDF）：** `example/diagnose_ks.py`
+3) Joint realism checks (cross‑feature correlation)
 - 输出 `example/results/ks_per_feature.csv`（每个连续特征 KS）
 - 输出 `example/results/cdf_<feature>.svg`（真实 vs 生成 CDF）
 - 统计生成数据是否堆积在边界（gen_frac_at_min / gen_frac_at_max）
 **Filtered KS（剔除难以学习特征，仅用于诊断）：** `example/filtered_metrics.py`
 - 规则：std 过小或 KS 过高自动剔除
 - 输出 `example/results/filtered_metrics.json`
 - 只用于诊断，不作为最终指标
 **Ranked KS（特征贡献排序）：** `example/ranked_ks.py`
 - 输出 `example/results/ranked_ks.csv`
 - 计算每个特征对 avg_ks 的贡献，以及“移除前 N 个特征后”的 avg_ks
 **Program stats（setpoint/demand 统计）：** `example/program_stats.py`
 - 输出 `example/results/program_stats.json`
 - 指标：change-count / dwell / step-size（对比生成 vs 真实）
 **Controller stats（Type‑2 控制量）：** `example/controller_stats.py`
 - 输出 `example/results/controller_stats.json`
 - 指标：饱和比例 / 变化率 / 步长中位数
 **Actuator stats（Type‑3 执行器）：** `example/actuator_stats.py`
 - 输出 `example/results/actuator_stats.json`
 - 指标：峰值占比 / unique ratio / dwell
 **PV stats（Type‑4 传感器）：** `example/pv_stats.py`
 - 输出 `example/results/pv_stats.json`
 - 指标：q05/q50/q95 + tail ratio
 **Aux stats（Type‑6 辅助量）：** `example/aux_stats.py`
 - 输出 `example/results/aux_stats.json`
 - 指标：均值/方差/lag‑1
 **Type-based postprocess:** `example/postprocess_types.py`
 - 输出 `example/results/generated_post.csv`
 - 用 Type‑1/2/3/5/6 规则重建部分列（无需训练）
 - KS-only baseline: Type1/2/3/5/6 经验重采样（只为压 KS，可能破坏联合分布）
 **Evaluation protocol:** see `docs/evaluation.md`.
 Recent runs (Windows):
 - 2026-01-27 21:22:34 — avg_ks 0.4046 / avg_jsd 0.0376 / avg_lag1_diff 0.1449
 Recent runs (WSL, diagnostic):
 - 2026-01-28 — KS-only postprocess baseline (full-reference, tie-aware KS): overall_avg_ks 0.2851
 ---
-## 10. Automation / 自动化
+## 15. Key Files (Complete but Pruned)
-`example/run_all.py` runs prepare/train/export/eval + postprocess + diagnostics in one command.
+
-`example/run_all_full.py` legacy full runner.
+```
-`example/run_compare.py` can run a baseline vs temporal config and compute metric deltas.
+mask-ddpm/
  report.md
  docs/
    README.md
    architecture.md
    evaluation.md
    decisions.md
    experiments.md
    ideas.md
  example/
    config.json
    config_no_temporal.json
    config_temporal_strong.json
    feature_split.json
    data_utils.py
    prepare_data.py
    hybrid_diffusion.py
    train.py
    sample.py
    export_samples.py
    evaluate_generated.py
    run_all.py
    run_compare.py
    diagnose_ks.py
    filtered_metrics.py
    ranked_ks.py
    program_stats.py
    controller_stats.py
    actuator_stats.py
    pv_stats.py
    aux_stats.py
    postprocess_types.py
    results/
      generated.csv
      generated_post.csv
      eval.json
      eval_post.json
      cont_stats.json
      disc_vocab.json
      metrics_history.csv
 ```
 ---
-## 11. Key Engineering Decisions / 关键工程决策
+## 16. Summary / 总结
 - Mixed-type diffusion: continuous + discrete split
 - Two-stage training: temporal backbone first, diffusion on residuals
 - Switchable backbone: GRU vs Transformer encoder for the diffusion model
 - Positional + time embeddings for stability
 - Optional inverse-variance weighting for continuous loss
 - Log1p transforms for heavy-tailed signals
 - Quantile transform + post-hoc calibration to stabilize CDF alignment
---
+The current project is a **hybrid diffusion system** with a **two‑stage temporal+residual design**, built to balance **distribution alignment** and **temporal realism**. The architecture is modular, with explicit type‑aware diagnostics and postprocessing, and supports both GRU and Transformer backbones. The remaining research challenge is to replace KS‑only postprocessing with **conditional, structurally consistent generators** for Type1/2/3/5/6 features.
 ## 12. Code Map (Key Files) / 代码索引
 - Core model: `example/hybrid_diffusion.py`
 - Training: `example/train.py`
 - Temporal GRU: `example/hybrid_diffusion.py` (`TemporalGRUGenerator`)
 - Data prep: `example/prepare_data.py`
 - Data utilities: `example/data_utils.py`
 - Sampling: `example/sample.py`
 - Export: `example/export_samples.py`
 - Evaluation: `example/evaluate_generated.py`
 - KS uses tie-aware implementation and aggregates all reference files matched by glob.
 - Pipeline: `example/run_all.py`
 - Config: `example/config.json`
 ---
 ## 13. Known Issues / Current Limitations / 已知问题
 - KS can remain high on a subset of features → per-feature diagnosis required
 - Lag‑1 may fluctuate → distribution vs temporal trade-off
 - Discrete JSD can regress when continuous KS is prioritized
 - Transformer backbone may change stability; needs systematic comparison
 - Program/actuator features require specialized modeling beyond diffusion
 ---
 ## 14. Suggested Next Steps / 下一步建议
 - Compare GRU vs Transformer backbone using `run_compare.py`
 - Explore **v‑prediction** for continuous branch
 - Strengthen discrete diffusion (e.g., D3PM-style transitions)
 - Add targeted discrete calibration for high‑JSD columns
 - Implement program generator for Type‑1 features and evaluate with dwell/step metrics
 ## 16. Deliverables / 交付清单
 - Code: diffusion + temporal + diagnostics + pipeline scripts
 - Docs: report + decisions + experiments + architecture + evaluation protocol
 - Results: full metrics, filtered metrics, ranked KS, per‑feature CDFs
 ---
 ## 15. Summary / 总结
 This project implements a **two-stage hybrid diffusion model** for ICS feature sequences: a GRU-based temporal backbone first models sequence trends, then diffusion learns residual corrections. The pipeline covers data prep, two-stage training, sampling, export, and evaluation. The main research challenge remains in balancing **distributional fidelity (KS)** and **temporal consistency (lag‑1)**.
 本项目实现了**两阶段混合扩散模型**：先用 GRU 时序骨干学习趋势，再用扩散学习残差校正。系统包含完整训练与评估流程。主要挑战仍是**分布对齐（KS）与时序一致性（lag‑1）之间的平衡**。