From f6be8a6ecb6bdc286f2937a5a4240664c02e7239 Mon Sep 17 00:00:00 2001 From: MingzheYang Date: Wed, 28 Jan 2026 22:34:10 +0800 Subject: [PATCH] Rewrite report as full user manual --- report.md | 377 ++++++++++++++++++++++++++++-------------------------- 1 file changed, 197 insertions(+), 180 deletions(-) diff --git a/report.md b/report.md index 08f2846..53e5d37 100644 --- a/report.md +++ b/report.md @@ -1,218 +1,239 @@ -# mask-ddpm Project Report (Detailed) +# mask-ddpm 项目说明书(完整详细版) -This report is a **complete, beginner‑friendly** description of the current project implementation as of the latest code in this repo. It explains **what the project does**, **how data flows**, **what each file is for**, and **why the architecture is designed this way**. +> 本文档是“说明书级别”的完整描述,面向首次接触项目的同学。 +> 目标是让**不了解扩散/时序建模的人**也能理解:项目是什么、怎么跑、每个文件干什么、每一步在训练什么、为什么这么设计。 +> +> 适用范围:当前仓库代码(以 `example/config.json` 为主配置)。 --- -## 0. TL;DR / 一句话概览 - -We generate multivariate ICS time‑series by **(1) learning temporal trend with GRU** and **(2) learning residuals with a hybrid diffusion model** (continuous DDPM + discrete masked diffusion). We then evaluate with **tie‑aware KS** and run **Type‑aware postprocessing** for diagnostic KS reduction. +## 目录 +1. 项目目标与研究问题 +2. 数据与特征结构 +3. 预处理与统计文件 +4. 模型总体架构 +5. 训练流程(逐步骤) +6. 采样与导出流程 +7. 评估体系与指标 +8. 诊断工具与常用脚本 +9. Type‑aware(按类型分治)设计 +10. 一键运行与常见命令 +11. 输出文件说明 +12. 当前配置与关键超参 +13. 常见问题与慢的原因 +14. 已知限制与后续方向 +15. 文件树(精简版) +16. 文件职责(逐文件说明) --- -## 1. Project Goal / 项目目标 +## 1. 项目目标与研究问题 -We want synthetic ICS sequences that are: -1) **Distribution‑aligned** (per‑feature CDF matches real data → low KS) -2) **Temporally consistent** (lag‑1 correlation and trend are realistic) -3) **Discrete‑valid** (state tokens are legal and frequency‑consistent) +本项目目标:生成工业控制系统(ICS)多变量时序数据,满足以下三点: -This is hard because **distribution** and **temporal structure** often conflict in a single model. +- **分布一致性**:每个变量的统计分布接近真实(用 KS 衡量) +- **时序一致性**:序列结构合理,lag‑1 相关性、趋势符合真实 +- **离散合法性**:离散变量(状态/模式)必须是合法 token 且分布合理(JSD) + +核心难点: +- 时序结构和分布对齐经常相互冲突 +- 真实数据包含“程序驱动/事件驱动”的变量,难以用纯 DDPM 学好 --- -## 2. Data & Feature Schema / 数据与特征结构 +## 2. 数据与特征结构 -**Input data**: HAI CSV files (compressed) in `dataset/hai/hai-21.03/`. +**数据来源**:HAI `train*.csv.gz`(多文件) -**Feature split**: `example/feature_split.json` -- `continuous`: real‑valued sensors/actuators -- `discrete`: state tokens / modes -- `time_column`: time index (not trained) +**特征拆分**(见 `example/feature_split.json`): +- `continuous`:连续变量(传感器/执行器) +- `discrete`:离散变量(状态/模式) +- `time_column`:时间列(不参与训练) --- -## 3. Preprocessing / 预处理 +## 3. 预处理与统计文件 -File: `example/prepare_data.py` +脚本:`example/prepare_data.py` -### Continuous features -- Mean/std statistics -- Quantile table (if `use_quantile_transform=true`) -- Optional transforms (log1p etc.) -- Output: `example/results/cont_stats.json` +### 3.1 连续变量 +- 计算 mean/std +- 若开启 `use_quantile_transform`:计算分位数表(CDF) +- 输出:`example/results/cont_stats.json` -### Discrete features -- Token vocab from data -- Output: `example/results/disc_vocab.json` +### 3.2 离散变量 +- 统计 vocab +- 输出:`example/results/disc_vocab.json` -File: `example/data_utils.py` contains -- Normalization / inverse -- Quantile transform / inverse -- Post‑calibration helpers +### 3.3 数据工具 +`example/data_utils.py` 提供: +- 标准化/反标准化 +- 分位数变换/逆变换 +- 可选后校准(quantile calibration) --- -## 4. Architecture / 模型结构 +## 4. 模型总体架构 -### 4.1 Stage‑1 Temporal GRU (Trend) -File: `example/hybrid_diffusion.py` -- Class: `TemporalGRUGenerator` -- Input: continuous sequence -- Output: **trend sequence** (teacher forced) -- Purpose: capture temporal structure +本项目采用 **两阶段 + 混合扩散** 架构: -### 4.2 Stage‑2 Hybrid Diffusion (Residual) -File: `example/hybrid_diffusion.py` +### 4.1 Stage‑1 Temporal GRU +- 目的:学习序列趋势、时序结构 +- 输入:连续变量序列 +- 输出:trend(趋势序列) -**Continuous branch** -- Gaussian DDPM -- Predicts **residual** (or noise) +### 4.2 Stage‑2 Hybrid Diffusion +- 目的:学习残差分布(把时序和分布解耦) +- 连续变量:Gaussian DDPM +- 离散变量:mask diffusion 分类 head -**Discrete branch** -- Mask diffusion (masked tokens) -- Classifier head per discrete column - -**Backbone** -- Current config uses **Transformer encoder** (`backbone_type=transformer`) -- GRU is still supported as option - -**Conditioning** -- File‑id conditioning (`use_condition=true`, `condition_type=file_id`) -- Type‑1 (setpoint/demand) can be passed as **continuous condition** (`cond_cont`) +### 4.3 Backbone 选择 +- 当前配置:`backbone_type = transformer` +- 可选:GRU(更省显存更稳定) --- -## 5. Training Flow / 训练流程 -File: `example/train.py` +## 5. 训练流程(逐步骤) -### 5.1 Stage‑1 Temporal training -- Use continuous features (excluding Type1/Type5) -- Teacher‑forced GRU predicts next step -- Loss: **MSE** -- Output: `temporal.pt` +脚本:`example/train.py` -### 5.2 Stage‑2 Diffusion training -- Compute residual: `x_resid = x_cont - trend` -- Sample time step `t` -- Add noise for continuous; mask tokens for discrete -- Model predicts: - - **eps_pred** for continuous residual - - logits for discrete tokens +### Step 1:Temporal 训练 +- 输入:连续序列 +- GRU teacher‑forcing 预测下一步 +- Loss:MSE +- 输出:`temporal.pt` -### Loss design -- Continuous loss: MSE on eps or x0 (`cont_target`) -- Optional weighting: inverse variance (`cont_loss_weighting=inv_std`) -- Optional SNR weighting (`snr_weighted_loss`) -- Optional quantile loss (align residual distribution) -- Optional residual mean/std loss -- Discrete loss: cross‑entropy on masked tokens -- Total: `loss = λ * loss_cont + (1‑λ) * loss_disc` +### Step 2:Diffusion 训练 +- 计算残差:`x_resid = x_cont - trend` +- 采样时间步 t +- 连续:加噪 +- 离散:mask token +- 模型预测 eps / logits + +### Loss 设计 +- Continuous:MSE(eps 或 x0) +- Discrete:Cross Entropy(mask 部分) +- 总损失:`loss = λ * loss_cont + (1-λ) * loss_disc` +- 可选加权: + - inverse‑std + - SNR‑weighted + - quantile loss + - residual stat loss --- -## 6. Sampling & Export / 采样与导出 -File: `example/export_samples.py` +## 6. 采样与导出流程 -Steps: -1) Initialize continuous with noise -2) Initialize discrete with masks -3) Reverse diffusion loop from `t=T..0` -4) Add trend back (if temporal stage enabled) -5) Inverse transforms (quantile → raw) -6) Clip/bound if configured -7) Merge back Type1 (conditioning) and Type5 (derived) -8) Write `generated.csv` +脚本:`example/export_samples.py` + +流程: +1) 初始化噪声(连续) +2) 初始化 mask(离散) +3) 反扩散 t=T..0 +4) 加回 trend +5) 反变换(quantile/标准化) +6) 合成 CSV + +输出:`example/results/generated.csv` --- -## 7. Evaluation / 评估 -File: `example/evaluate_generated.py` +## 7. 评估体系与指标 -### Metrics -- **KS (tie‑aware)** for continuous -- **JSD** for discrete -- **lag‑1 correlation** for temporal consistency -- quantile diffs, mean/std errors +脚本:`example/evaluate_generated.py` -### Important -- Reference supports **glob** and aggregates **all matched files** -- KS implementation is **tie‑aware** (correct for spiky/quantized data) +### 连续指标 +- **KS(tie‑aware)** +- quantile diff +- lag‑1 correlation -Outputs: -- `example/results/eval.json` +### 离散指标 +- JSD +- invalid token 比例 + +### Reference 读取 +- 支持 `train*.csv.gz` glob +- 自动汇总所有文件 --- -## 8. Diagnostics / 诊断工具 +## 8. 诊断工具与常用脚本 -- `example/diagnose_ks.py`: CDF plots and per‑feature KS -- `example/ranked_ks.py`: ranked KS + contribution -- `example/filtered_metrics.py`: filtered KS excluding outliers -- `example/program_stats.py`: Type‑1 stats -- `example/controller_stats.py`: Type‑2 stats -- `example/actuator_stats.py`: Type‑3 stats -- `example/pv_stats.py`: Type‑4 stats -- `example/aux_stats.py`: Type‑6 stats +- `diagnose_ks.py`:CDF 可视化 +- `ranked_ks.py`:KS 贡献排序 +- `filtered_metrics.py`:过滤异常特征后的 KS +- `program_stats.py`:Type1 统计 +- `controller_stats.py`:Type2 统计 +- `actuator_stats.py`:Type3 统计 +- `pv_stats.py`:Type4 统计 +- `aux_stats.py`:Type6 统计 --- -## 9. Type‑Aware Modeling / 类型化分离 +## 9. Type‑aware 设计(按类型分治) -To reduce KS dominated by a few variables, the project uses **Type categories** defined in config: -- **Type1**: setpoints / demand (schedule‑driven) -- **Type2**: controller outputs -- **Type3**: actuator positions -- **Type4**: PV sensors -- **Type5**: derived tags -- **Type6**: auxiliary / coupling +在真实 ICS 中,部分变量很难用 DDPM 学到,所以做类型划分: -### Current implementation (diagnostic KS baseline) -File: `example/postprocess_types.py` -- Type1/2/3/5/6 → **empirical resampling** from real distribution -- Type4 → keep diffusion output +- **Type1**:setpoint/demand(调度驱动) +- **Type2**:controller outputs +- **Type3**:actuator positions +- **Type4**:PV sensors +- **Type5**:derived tags +- **Type6**:aux/coupling -This is **not** the final model, but provides a **KS‑upper bound** for diagnosis. +脚本:`example/postprocess_types.py` -Outputs: -- `example/results/generated_post.csv` -- `example/results/eval_post.json` +当前实现是 **KS‑only baseline**: +- Type1/2/3/5/6 → 经验重采样 +- Type4 → 仍用 diffusion + +用途: +- 快速诊断“KS 最优可达上界” +- 不保证联合分布真实性 + +输出:`example/results/generated_post.csv` --- -## 10. Pipeline / 一键流程 +## 10. 一键运行与常见命令 -File: `example/run_all.py` - -Default pipeline: -1) prepare_data -2) train -3) export_samples -4) evaluate_generated (generated.csv) -5) postprocess_types (generated_post.csv) -6) evaluate_generated (eval_post.json) -7) diagnostics scripts - -**Linux**: +### 全流程(推荐) ```bash python example/run_all.py --device cuda --config example/config.json ``` -**Windows (PowerShell)**: -```powershell -python run_all.py --device cuda --config config.json +### 只评估不训练 +```bash +python example/run_all.py --skip-prepare --skip-train --skip-export +``` + +### 只训练不评估 +```bash +python example/run_all.py --skip-eval --skip-postprocess --skip-post-eval --skip-diagnostics ``` --- -## 11. Current Configuration (Key Defaults) -From `example/config.json`: +## 11. 输出文件说明 + +- `generated.csv`:原始 diffusion 输出 +- `generated_post.csv`:KS‑only 后处理输出 +- `eval.json`:原始评估 +- `eval_post.json`:后处理评估 +- `cont_stats.json` / `disc_vocab.json`:统计文件 +- `*_stats.json`:Type 统计报告 + +--- + +## 12. 当前配置(关键超参) + +来自 `example/config.json`: - backbone_type: **transformer** - timesteps: 600 - seq_len: 96 - batch_size: 16 -- cont_target: `x0` -- cont_loss_weighting: `inv_std` +- cont_target: x0 +- cont_loss_weighting: inv_std - snr_weighted_loss: true - quantile_loss_weight: 0.2 - use_quantile_transform: true @@ -221,41 +242,30 @@ From `example/config.json`: --- -## 12. What’s Actually Trained vs What’s Post‑Processed +## 13. 为什么运行慢 -**Trained** -- Temporal GRU (trend) -- Diffusion residual model (continuous + discrete) - -**Post‑Processed (KS‑only)** -- Type1/2/3/5/6 replaced by empirical resampling - -This is important: postprocess improves KS but **may break joint realism**. +1) 两阶段训练(temporal + diffusion) +2) 评估要读全量 train*.csv.gz +3) run_all 默认跑所有诊断脚本 +4) timesteps / seq_len 大 --- -## 13. Why It’s Still Hard / 当前难点 +## 14. 已知限制与后续方向 -- Type1/2/3 are **event‑driven** and **piecewise constant** -- Diffusion (Gaussian DDPM + MSE) tends to smooth/blur these -- Temporal vs distribution objectives pull in opposite directions +限制: +- Type1/2/3 仍主导 KS +- KS‑only baseline 会破坏联合分布 +- 时序和分布存在 trade‑off + +方向: +- 为 Type1/2/3 建条件模型 +- Type4 增加 regime conditioning +- 联合指标(cross‑feature correlation) --- -## 14. Where To Improve Next / 下一步方向 - -1) Replace KS‑only postprocess with **conditional generators**: - - Type1: program generator (HMM / schedule) - - Type2: controller emulator (PID‑like) - - Type3: actuator dynamics (dwell + rate + saturation) - -2) Add regime conditioning for Type4 PVs - -3) Joint realism checks (cross‑feature correlation) - ---- - -## 15. Key Files (Complete but Pruned) +## 15. 文件树(精简版) ``` mask-ddpm/ @@ -291,18 +301,25 @@ mask-ddpm/ aux_stats.py postprocess_types.py results/ - generated.csv - generated_post.csv - eval.json - eval_post.json - cont_stats.json - disc_vocab.json - metrics_history.csv ``` --- -## 16. Summary / 总结 +## 16. 文件职责(逐文件说明) -The current project is a **hybrid diffusion system** with a **two‑stage temporal+residual design**, built to balance **distribution alignment** and **temporal realism**. The architecture is modular, with explicit type‑aware diagnostics and postprocessing, and supports both GRU and Transformer backbones. The remaining research challenge is to replace KS‑only postprocessing with **conditional, structurally consistent generators** for Type1/2/3/5/6 features. +- `prepare_data.py`:统计连续/离散特征 +- `data_utils.py`:预处理与变换函数 +- `hybrid_diffusion.py`:模型主体(Temporal + Diffusion) +- `train.py`:两阶段训练 +- `export_samples.py`:采样导出 +- `evaluate_generated.py`:评估指标 +- `run_all.py`:一键流程 +- `postprocess_types.py`:Type‑aware KS‑only baseline +- `diagnose_ks.py`:CDF 诊断 +- `ranked_ks.py`:KS 排序 +- `filtered_metrics.py`:过滤 KS +--- + +# 结束 +如果你需要更“论文式”的版本(加入公式、伪代码、实验表格),可以继续追加。