Rewrite report as full user manual

2026-01-28 22:34:10 +08:00
parent 6fb53dd5c1
commit f6be8a6ecb
1 changed files with 197 additions and 180 deletions
--- a/report.md
+++ b/report.md
@@ -1,218 +1,239 @@
-# mask-ddpm Project Report (Detailed)
+# mask-ddpm 项目说明书（完整详细版）

-This report is a **complete, beginner‑friendly** description of the current project implementation as of the latest code in this repo. It explains **what the project does**, **how data flows**, **what each file is for**, and **why the architecture is designed this way**.
+> 本文档是“说明书级别”的完整描述，面向首次接触项目的同学。
+> 目标是让**不了解扩散/时序建模的人**也能理解：项目是什么、怎么跑、每个文件干什么、每一步在训练什么、为什么这么设计。
+>
+> 适用范围：当前仓库代码（以 `example/config.json` 为主配置）。

 ---

-## 0. TL;DR / 一句话概览
-
-We generate multivariate ICS time‑series by **(1) learning temporal trend with GRU** and **(2) learning residuals with a hybrid diffusion model** (continuous DDPM + discrete masked diffusion). We then evaluate with **tie‑aware KS** and run **Type‑aware postprocessing** for diagnostic KS reduction.
+## 目录
+1. 项目目标与研究问题
+2. 数据与特征结构
+3. 预处理与统计文件
+4. 模型总体架构
+5. 训练流程（逐步骤）
+6. 采样与导出流程
+7. 评估体系与指标
+8. 诊断工具与常用脚本
+9. Type‑aware（按类型分治）设计
+10. 一键运行与常见命令
+11. 输出文件说明
+12. 当前配置与关键超参
+13. 常见问题与慢的原因
+14. 已知限制与后续方向
+15. 文件树（精简版）
+16. 文件职责（逐文件说明）

 ---

-## 1. Project Goal / 项目目标
+## 1. 项目目标与研究问题

-We want synthetic ICS sequences that are:
-1) **Distribution‑aligned** (per‑feature CDF matches real data → low KS)
-2) **Temporally consistent** (lag‑1 correlation and trend are realistic)
-3) **Discrete‑valid** (state tokens are legal and frequency‑consistent)
+本项目目标：生成工业控制系统（ICS）多变量时序数据，满足以下三点：

-This is hard because **distribution** and **temporal structure** often conflict in a single model.
+- **分布一致性**：每个变量的统计分布接近真实（用 KS 衡量）
+- **时序一致性**：序列结构合理，lag‑1 相关性、趋势符合真实
+- **离散合法性**：离散变量（状态/模式）必须是合法 token 且分布合理（JSD）
+
+核心难点：
+- 时序结构和分布对齐经常相互冲突
+- 真实数据包含“程序驱动/事件驱动”的变量，难以用纯 DDPM 学好

 ---

-## 2. Data & Feature Schema / 数据与特征结构
+## 2. 数据与特征结构

-**Input data**: HAI CSV files (compressed) in `dataset/hai/hai-21.03/`.
+**数据来源**：HAI `train*.csv.gz`（多文件）

-**Feature split**: `example/feature_split.json`
- `continuous`: real‑valued sensors/actuators
- `discrete`: state tokens / modes
- `time_column`: time index (not trained)
+**特征拆分**（见 `example/feature_split.json`）：
+- `continuous`：连续变量（传感器/执行器）
+- `discrete`：离散变量（状态/模式）
+- `time_column`：时间列（不参与训练）

 ---

-## 3. Preprocessing / 预处理
+## 3. 预处理与统计文件

-File: `example/prepare_data.py`
+脚本：`example/prepare_data.py`

-### Continuous features
- Mean/std statistics
- Quantile table (if `use_quantile_transform=true`)
- Optional transforms (log1p etc.)
- Output: `example/results/cont_stats.json`
+### 3.1 连续变量
+- 计算 mean/std
+- 若开启 `use_quantile_transform`：计算分位数表（CDF）
+- 输出：`example/results/cont_stats.json`

-### Discrete features
- Token vocab from data
- Output: `example/results/disc_vocab.json`
+### 3.2 离散变量
+- 统计 vocab
+- 输出：`example/results/disc_vocab.json`

-File: `example/data_utils.py` contains
- Normalization / inverse
- Quantile transform / inverse
- Post‑calibration helpers
+### 3.3 数据工具
+`example/data_utils.py` 提供：
+- 标准化/反标准化
+- 分位数变换/逆变换
+- 可选后校准（quantile calibration）

 ---

-## 4. Architecture / 模型结构
+## 4. 模型总体架构

-### 4.1 Stage‑1 Temporal GRU (Trend)
-File: `example/hybrid_diffusion.py`
- Class: `TemporalGRUGenerator`
- Input: continuous sequence
- Output: **trend sequence** (teacher forced)
- Purpose: capture temporal structure
+本项目采用 **两阶段 + 混合扩散** 架构：

-### 4.2 Stage‑2 Hybrid Diffusion (Residual)
-File: `example/hybrid_diffusion.py`
+### 4.1 Stage‑1 Temporal GRU
+- 目的：学习序列趋势、时序结构
+- 输入：连续变量序列
+- 输出：trend（趋势序列）

-**Continuous branch**
- Gaussian DDPM
- Predicts **residual** (or noise)
+### 4.2 Stage‑2 Hybrid Diffusion
+- 目的：学习残差分布（把时序和分布解耦）
+- 连续变量：Gaussian DDPM
+- 离散变量：mask diffusion 分类 head

-**Discrete branch**
- Mask diffusion (masked tokens)
- Classifier head per discrete column
-
-**Backbone**
- Current config uses **Transformer encoder** (`backbone_type=transformer`)
- GRU is still supported as option
-
-**Conditioning**
- File‑id conditioning (`use_condition=true`, `condition_type=file_id`)
- Type‑1 (setpoint/demand) can be passed as **continuous condition** (`cond_cont`)
+### 4.3 Backbone 选择
+- 当前配置：`backbone_type = transformer`
+- 可选：GRU（更省显存更稳定）

 ---

-## 5. Training Flow / 训练流程
-File: `example/train.py`
+## 5. 训练流程（逐步骤）

-### 5.1 Stage‑1 Temporal training
- Use continuous features (excluding Type1/Type5)
- Teacher‑forced GRU predicts next step
- Loss: **MSE**
- Output: `temporal.pt`
+脚本：`example/train.py`

-### 5.2 Stage‑2 Diffusion training
- Compute residual: `x_resid = x_cont - trend`
- Sample time step `t`
- Add noise for continuous; mask tokens for discrete
- Model predicts:
-  - **eps_pred** for continuous residual
-  - logits for discrete tokens
+### Step 1：Temporal 训练
+- 输入：连续序列
+- GRU teacher‑forcing 预测下一步
+- Loss：MSE
+- 输出：`temporal.pt`

-### Loss design
- Continuous loss: MSE on eps or x0 (`cont_target`)
- Optional weighting: inverse variance (`cont_loss_weighting=inv_std`)
- Optional SNR weighting (`snr_weighted_loss`)
- Optional quantile loss (align residual distribution)
- Optional residual mean/std loss
- Discrete loss: cross‑entropy on masked tokens
- Total: `loss = λ * loss_cont + (1‑λ) * loss_disc`
+### Step 2：Diffusion 训练
+- 计算残差：`x_resid = x_cont - trend`
+- 采样时间步 t
+- 连续：加噪
+- 离散：mask token
+- 模型预测 eps / logits
+
+### Loss 设计
+- Continuous：MSE（eps 或 x0）
+- Discrete：Cross Entropy（mask 部分）
+- 总损失：`loss = λ * loss_cont + (1-λ) * loss_disc`
+- 可选加权：
+  - inverse‑std
+  - SNR‑weighted
+  - quantile loss
+  - residual stat loss

 ---

-## 6. Sampling & Export / 采样与导出
-File: `example/export_samples.py`
+## 6. 采样与导出流程

-Steps:
-1) Initialize continuous with noise
-2) Initialize discrete with masks
-3) Reverse diffusion loop from `t=T..0`
-4) Add trend back (if temporal stage enabled)
-5) Inverse transforms (quantile → raw)
-6) Clip/bound if configured
-7) Merge back Type1 (conditioning) and Type5 (derived)
-8) Write `generated.csv`
+脚本：`example/export_samples.py`
+
+流程：
+1) 初始化噪声（连续）
+2) 初始化 mask（离散）
+3) 反扩散 t=T..0
+4) 加回 trend
+5) 反变换（quantile/标准化）
+6) 合成 CSV
+
+输出：`example/results/generated.csv`

 ---

-## 7. Evaluation / 评估
-File: `example/evaluate_generated.py`
+## 7. 评估体系与指标

-### Metrics
- **KS (tie‑aware)** for continuous
- **JSD** for discrete
- **lag‑1 correlation** for temporal consistency
- quantile diffs, mean/std errors
+脚本：`example/evaluate_generated.py`

-### Important
- Reference supports **glob** and aggregates **all matched files**
- KS implementation is **tie‑aware** (correct for spiky/quantized data)
+### 连续指标
+- **KS（tie‑aware）**
+- quantile diff
+- lag‑1 correlation

-Outputs:
- `example/results/eval.json`
+### 离散指标
+- JSD
+- invalid token 比例
+
+### Reference 读取
+- 支持 `train*.csv.gz` glob
+- 自动汇总所有文件

 ---

-## 8. Diagnostics / 诊断工具
+## 8. 诊断工具与常用脚本

- `example/diagnose_ks.py`: CDF plots and per‑feature KS
- `example/ranked_ks.py`: ranked KS + contribution
- `example/filtered_metrics.py`: filtered KS excluding outliers
- `example/program_stats.py`: Type‑1 stats
- `example/controller_stats.py`: Type‑2 stats
- `example/actuator_stats.py`: Type‑3 stats
- `example/pv_stats.py`: Type‑4 stats
- `example/aux_stats.py`: Type‑6 stats
+- `diagnose_ks.py`：CDF 可视化
+- `ranked_ks.py`：KS 贡献排序
+- `filtered_metrics.py`：过滤异常特征后的 KS
+- `program_stats.py`：Type1 统计
+- `controller_stats.py`：Type2 统计
+- `actuator_stats.py`：Type3 统计
+- `pv_stats.py`：Type4 统计
+- `aux_stats.py`：Type6 统计

 ---

-## 9. Type‑Aware Modeling / 类型化分离
+## 9. Type‑aware 设计（按类型分治）

-To reduce KS dominated by a few variables, the project uses **Type categories** defined in config:
- **Type1**: setpoints / demand (schedule‑driven)
- **Type2**: controller outputs
- **Type3**: actuator positions
- **Type4**: PV sensors
- **Type5**: derived tags
- **Type6**: auxiliary / coupling
+在真实 ICS 中，部分变量很难用 DDPM 学到，所以做类型划分：

-### Current implementation (diagnostic KS baseline)
-File: `example/postprocess_types.py`
- Type1/2/3/5/6 → **empirical resampling** from real distribution
- Type4 → keep diffusion output
+- **Type1**：setpoint/demand（调度驱动）
+- **Type2**：controller outputs
+- **Type3**：actuator positions
+- **Type4**：PV sensors
+- **Type5**：derived tags
+- **Type6**：aux/coupling

-This is **not** the final model, but provides a **KS‑upper bound** for diagnosis.
+脚本：`example/postprocess_types.py`

-Outputs:
- `example/results/generated_post.csv`
- `example/results/eval_post.json`
+当前实现是 **KS‑only baseline**：
+- Type1/2/3/5/6 → 经验重采样
+- Type4 → 仍用 diffusion
+
+用途：
+- 快速诊断“KS 最优可达上界”
+- 不保证联合分布真实性
+
+输出：`example/results/generated_post.csv`

 ---

-## 10. Pipeline / 一键流程
+## 10. 一键运行与常见命令

-File: `example/run_all.py`
-
-Default pipeline:
-1) prepare_data
-2) train
-3) export_samples
-4) evaluate_generated (generated.csv)
-5) postprocess_types (generated_post.csv)
-6) evaluate_generated (eval_post.json)
-7) diagnostics scripts
-
-**Linux**:
+### 全流程（推荐）
 ```bash
 python example/run_all.py --device cuda --config example/config.json
 ```

-**Windows (PowerShell)**:
-```powershell
-python run_all.py --device cuda --config config.json
+### 只评估不训练
+```bash
+python example/run_all.py --skip-prepare --skip-train --skip-export
+```
+
+### 只训练不评估
+```bash
+python example/run_all.py --skip-eval --skip-postprocess --skip-post-eval --skip-diagnostics
 ```

 ---

-## 11. Current Configuration (Key Defaults)
-From `example/config.json`:
+## 11. 输出文件说明
+
+- `generated.csv`：原始 diffusion 输出
+- `generated_post.csv`：KS‑only 后处理输出
+- `eval.json`：原始评估
+- `eval_post.json`：后处理评估
+- `cont_stats.json` / `disc_vocab.json`：统计文件
+- `*_stats.json`：Type 统计报告
+
+---
+
+## 12. 当前配置（关键超参）
+
+来自 `example/config.json`：
 - backbone_type: **transformer**
 - timesteps: 600
 - seq_len: 96
 - batch_size: 16
- cont_target: `x0`
- cont_loss_weighting: `inv_std`
+- cont_target: x0
+- cont_loss_weighting: inv_std
 - snr_weighted_loss: true
 - quantile_loss_weight: 0.2
 - use_quantile_transform: true
@@ -221,41 +242,30 @@ From `example/config.json`:

 ---

-## 12. What’s Actually Trained vs What’s Post‑Processed
+## 13. 为什么运行慢

-**Trained**
- Temporal GRU (trend)
- Diffusion residual model (continuous + discrete)
-
-**Post‑Processed (KS‑only)**
- Type1/2/3/5/6 replaced by empirical resampling
-
-This is important: postprocess improves KS but **may break joint realism**.
+1) 两阶段训练（temporal + diffusion）
+2) 评估要读全量 train*.csv.gz
+3) run_all 默认跑所有诊断脚本
+4) timesteps / seq_len 大

 ---

-## 13. Why It’s Still Hard / 当前难点
+## 14. 已知限制与后续方向

- Type1/2/3 are **event‑driven** and **piecewise constant**
- Diffusion (Gaussian DDPM + MSE) tends to smooth/blur these
- Temporal vs distribution objectives pull in opposite directions
+限制：
+- Type1/2/3 仍主导 KS
+- KS‑only baseline 会破坏联合分布
+- 时序和分布存在 trade‑off
+
+方向：
+- 为 Type1/2/3 建条件模型
+- Type4 增加 regime conditioning
+- 联合指标（cross‑feature correlation）

 ---

-## 14. Where To Improve Next / 下一步方向
-
-1) Replace KS‑only postprocess with **conditional generators**:
-   - Type1: program generator (HMM / schedule)
-   - Type2: controller emulator (PID‑like)
-   - Type3: actuator dynamics (dwell + rate + saturation)
-
-2) Add regime conditioning for Type4 PVs
-
-3) Joint realism checks (cross‑feature correlation)
-
---
-
-## 15. Key Files (Complete but Pruned)
+## 15. 文件树（精简版）

 ```
 mask-ddpm/
@@ -291,18 +301,25 @@ mask-ddpm/
    aux_stats.py
    postprocess_types.py
    results/
-      generated.csv
-      generated_post.csv
-      eval.json
-      eval_post.json
-      cont_stats.json
-      disc_vocab.json
-      metrics_history.csv
 ```

 ---

-## 16. Summary / 总结
+## 16. 文件职责（逐文件说明）

-The current project is a **hybrid diffusion system** with a **two‑stage temporal+residual design**, built to balance **distribution alignment** and **temporal realism**. The architecture is modular, with explicit type‑aware diagnostics and postprocessing, and supports both GRU and Transformer backbones. The remaining research challenge is to replace KS‑only postprocessing with **conditional, structurally consistent generators** for Type1/2/3/5/6 features.
+- `prepare_data.py`：统计连续/离散特征
+- `data_utils.py`：预处理与变换函数
+- `hybrid_diffusion.py`：模型主体（Temporal + Diffusion）
+- `train.py`：两阶段训练
+- `export_samples.py`：采样导出
+- `evaluate_generated.py`：评估指标
+- `run_all.py`：一键流程
+- `postprocess_types.py`：Type‑aware KS‑only baseline
+- `diagnose_ks.py`：CDF 诊断
+- `ranked_ks.py`：KS 排序
+- `filtered_metrics.py`：过滤 KS

+---
+
+# 结束
+如果你需要更“论文式”的版本（加入公式、伪代码、实验表格），可以继续追加。