Rewrite report as full user manual

2026-01-28 22:34:10 +08:00
parent 6fb53dd5c1
commit f6be8a6ecb
1 changed files with 197 additions and 180 deletions
--- a/report.md
+++ b/report.md
@@ -1,218 +1,239 @@
-# mask-ddpm Project Report (Detailed)
+# mask-ddpm 项目说明书（完整详细版）
-This report is a **complete, beginner‑friendly** description of the current project implementation as of the latest code in this repo. It explains **what the project does**, **how data flows**, **what each file is for**, and **why the architecture is designed this way**.
+> 本文档是“说明书级别”的完整描述，面向首次接触项目的同学。
 > 目标是让**不了解扩散/时序建模的人**也能理解：项目是什么、怎么跑、每个文件干什么、每一步在训练什么、为什么这么设计。
 >
 > 适用范围：当前仓库代码（以 `example/config.json` 为主配置）。
 ---
-## 0. TL;DR / 一句话概览
+## 目录
-
+1. 项目目标与研究问题
-We generate multivariate ICS time‑series by **(1) learning temporal trend with GRU** and **(2) learning residuals with a hybrid diffusion model** (continuous DDPM + discrete masked diffusion). We then evaluate with **tie‑aware KS** and run **Type‑aware postprocessing** for diagnostic KS reduction.
+2. 数据与特征结构
 3. 预处理与统计文件
 4. 模型总体架构
 5. 训练流程（逐步骤）
 6. 采样与导出流程
 7. 评估体系与指标
 8. 诊断工具与常用脚本
 9. Type‑aware（按类型分治）设计
 10. 一键运行与常见命令
 11. 输出文件说明
 12. 当前配置与关键超参
 13. 常见问题与慢的原因
 14. 已知限制与后续方向
 15. 文件树（精简版）
 16. 文件职责（逐文件说明）
 ---
-## 1. Project Goal / 项目目标
+## 1. 项目目标与研究问题
-We want synthetic ICS sequences that are:
+本项目目标：生成工业控制系统（ICS）多变量时序数据，满足以下三点：
 1) **Distribution‑aligned** (per‑feature CDF matches real data → low KS)
 2) **Temporally consistent** (lag‑1 correlation and trend are realistic)
 3) **Discrete‑valid** (state tokens are legal and frequency‑consistent)
-This is hard because **distribution** and **temporal structure** often conflict in a single model.
+- **分布一致性**：每个变量的统计分布接近真实（用 KS 衡量）
 - **时序一致性**：序列结构合理，lag‑1 相关性、趋势符合真实
 - **离散合法性**：离散变量（状态/模式）必须是合法 token 且分布合理（JSD）
 核心难点：
 - 时序结构和分布对齐经常相互冲突
 - 真实数据包含“程序驱动/事件驱动”的变量，难以用纯 DDPM 学好
 ---
-## 2. Data & Feature Schema / 数据与特征结构
+## 2. 数据与特征结构
-**Input data**: HAI CSV files (compressed) in `dataset/hai/hai-21.03/`.
+**数据来源**：HAI `train*.csv.gz`（多文件）
-**Feature split**: `example/feature_split.json`
+**特征拆分**（见 `example/feature_split.json`）：
- `continuous`: real‑valued sensors/actuators
+- `continuous`：连续变量（传感器/执行器）
- `discrete`: state tokens / modes
+- `discrete`：离散变量（状态/模式）
- `time_column`: time index (not trained)
+- `time_column`：时间列（不参与训练）
 ---
-## 3. Preprocessing / 预处理
+## 3. 预处理与统计文件
-File: `example/prepare_data.py`
+脚本：`example/prepare_data.py`
-### Continuous features
+### 3.1 连续变量
- Mean/std statistics
+- 计算 mean/std
- Quantile table (if `use_quantile_transform=true`)
+- 若开启 `use_quantile_transform`：计算分位数表（CDF）
- Optional transforms (log1p etc.)
+- 输出：`example/results/cont_stats.json`
 - Output: `example/results/cont_stats.json`
-### Discrete features
+### 3.2 离散变量
- Token vocab from data
+- 统计 vocab
- Output: `example/results/disc_vocab.json`
+- 输出：`example/results/disc_vocab.json`
-File: `example/data_utils.py` contains
+### 3.3 数据工具
- Normalization / inverse
+`example/data_utils.py` 提供：
- Quantile transform / inverse
+- 标准化/反标准化
- Post‑calibration helpers
+- 分位数变换/逆变换
 - 可选后校准（quantile calibration）
 ---
-## 4. Architecture / 模型结构
+## 4. 模型总体架构
-### 4.1 Stage‑1 Temporal GRU (Trend)
+本项目采用 **两阶段 + 混合扩散** 架构：
 File: `example/hybrid_diffusion.py`
 - Class: `TemporalGRUGenerator`
 - Input: continuous sequence
 - Output: **trend sequence** (teacher forced)
 - Purpose: capture temporal structure
-### 4.2 Stage‑2 Hybrid Diffusion (Residual)
+### 4.1 Stage‑1 Temporal GRU
-File: `example/hybrid_diffusion.py`
+- 目的：学习序列趋势、时序结构
 - 输入：连续变量序列
 - 输出：trend（趋势序列）
-**Continuous branch**
+### 4.2 Stage‑2 Hybrid Diffusion
- Gaussian DDPM
+- 目的：学习残差分布（把时序和分布解耦）
- Predicts **residual** (or noise)
+- 连续变量：Gaussian DDPM
 - 离散变量：mask diffusion 分类 head
-**Discrete branch**
+### 4.3 Backbone 选择
- Mask diffusion (masked tokens)
+- 当前配置：`backbone_type = transformer`
- Classifier head per discrete column
+- 可选：GRU（更省显存更稳定）
 **Backbone**
 - Current config uses **Transformer encoder** (`backbone_type=transformer`)
 - GRU is still supported as option
 **Conditioning**
 - File‑id conditioning (`use_condition=true`, `condition_type=file_id`)
 - Type‑1 (setpoint/demand) can be passed as **continuous condition** (`cond_cont`)
 ---
-## 5. Training Flow / 训练流程
+## 5. 训练流程（逐步骤）
 File: `example/train.py`
-### 5.1 Stage‑1 Temporal training
+脚本：`example/train.py`
 - Use continuous features (excluding Type1/Type5)
 - Teacher‑forced GRU predicts next step
 - Loss: **MSE**
 - Output: `temporal.pt`
-### 5.2 Stage‑2 Diffusion training
+### Step 1：Temporal 训练
- Compute residual: `x_resid = x_cont - trend`
+- 输入：连续序列
- Sample time step `t`
+- GRU teacher‑forcing 预测下一步
- Add noise for continuous; mask tokens for discrete
+- Loss：MSE
- Model predicts:
+- 输出：`temporal.pt`
  - **eps_pred** for continuous residual
  - logits for discrete tokens
-### Loss design
+### Step 2：Diffusion 训练
- Continuous loss: MSE on eps or x0 (`cont_target`)
+- 计算残差：`x_resid = x_cont - trend`
- Optional weighting: inverse variance (`cont_loss_weighting=inv_std`)
+- 采样时间步 t
- Optional SNR weighting (`snr_weighted_loss`)
+- 连续：加噪
- Optional quantile loss (align residual distribution)
+- 离散：mask token
- Optional residual mean/std loss
+- 模型预测 eps / logits
- Discrete loss: cross‑entropy on masked tokens
+
- Total: `loss = λ * loss_cont + (1‑λ) * loss_disc`
+### Loss 设计
 - Continuous：MSE（eps 或 x0）
 - Discrete：Cross Entropy（mask 部分）
 - 总损失：`loss = λ * loss_cont + (1-λ) * loss_disc`
 - 可选加权：
  - inverse‑std
  - SNR‑weighted
  - quantile loss
  - residual stat loss
 ---
-## 6. Sampling & Export / 采样与导出
+## 6. 采样与导出流程
 File: `example/export_samples.py`
-Steps:
+脚本：`example/export_samples.py`
-1) Initialize continuous with noise
+
-2) Initialize discrete with masks
+流程：
-3) Reverse diffusion loop from `t=T..0`
+1) 初始化噪声（连续）
-4) Add trend back (if temporal stage enabled)
+2) 初始化 mask（离散）
-5) Inverse transforms (quantile → raw)
+3) 反扩散 t=T..0
-6) Clip/bound if configured
+4) 加回 trend
-7) Merge back Type1 (conditioning) and Type5 (derived)
+5) 反变换（quantile/标准化）
-8) Write `generated.csv`
+6) 合成 CSV
 输出：`example/results/generated.csv`
 ---
-## 7. Evaluation / 评估
+## 7. 评估体系与指标
 File: `example/evaluate_generated.py`
-### Metrics
+脚本：`example/evaluate_generated.py`
 - **KS (tie‑aware)** for continuous
 - **JSD** for discrete
 - **lag‑1 correlation** for temporal consistency
 - quantile diffs, mean/std errors
-### Important
+### 连续指标
- Reference supports **glob** and aggregates **all matched files**
+- **KS（tie‑aware）**
- KS implementation is **tie‑aware** (correct for spiky/quantized data)
+- quantile diff
 - lag‑1 correlation
-Outputs:
+### 离散指标
- `example/results/eval.json`
+- JSD
 - invalid token 比例
 ### Reference 读取
 - 支持 `train*.csv.gz` glob
 - 自动汇总所有文件
 ---
-## 8. Diagnostics / 诊断工具
+## 8. 诊断工具与常用脚本
- `example/diagnose_ks.py`: CDF plots and per‑feature KS
+- `diagnose_ks.py`：CDF 可视化
- `example/ranked_ks.py`: ranked KS + contribution
+- `ranked_ks.py`：KS 贡献排序
- `example/filtered_metrics.py`: filtered KS excluding outliers
+- `filtered_metrics.py`：过滤异常特征后的 KS
- `example/program_stats.py`: Type‑1 stats
+- `program_stats.py`：Type1 统计
- `example/controller_stats.py`: Type‑2 stats
+- `controller_stats.py`：Type2 统计
- `example/actuator_stats.py`: Type‑3 stats
+- `actuator_stats.py`：Type3 统计
- `example/pv_stats.py`: Type‑4 stats
+- `pv_stats.py`：Type4 统计
- `example/aux_stats.py`: Type‑6 stats
+- `aux_stats.py`：Type6 统计
 ---
-## 9. Type‑Aware Modeling / 类型化分离
+## 9. Type‑aware 设计（按类型分治）
-To reduce KS dominated by a few variables, the project uses **Type categories** defined in config:
+在真实 ICS 中，部分变量很难用 DDPM 学到，所以做类型划分：
 - **Type1**: setpoints / demand (schedule‑driven)
 - **Type2**: controller outputs
 - **Type3**: actuator positions
 - **Type4**: PV sensors
 - **Type5**: derived tags
 - **Type6**: auxiliary / coupling
-### Current implementation (diagnostic KS baseline)
+- **Type1**：setpoint/demand（调度驱动）
-File: `example/postprocess_types.py`
+- **Type2**：controller outputs
- Type1/2/3/5/6 → **empirical resampling** from real distribution
+- **Type3**：actuator positions
- Type4 → keep diffusion output
+- **Type4**：PV sensors
 - **Type5**：derived tags
 - **Type6**：aux/coupling
-This is **not** the final model, but provides a **KS‑upper bound** for diagnosis.
+脚本：`example/postprocess_types.py`
-Outputs:
+当前实现是 **KS‑only baseline**：
- `example/results/generated_post.csv`
+- Type1/2/3/5/6 → 经验重采样
- `example/results/eval_post.json`
+- Type4 → 仍用 diffusion
 用途：
 - 快速诊断“KS 最优可达上界”
 - 不保证联合分布真实性
 输出：`example/results/generated_post.csv`
 ---
-## 10. Pipeline / 一键流程
+## 10. 一键运行与常见命令
-File: `example/run_all.py`
+### 全流程（推荐）
 Default pipeline:
 1) prepare_data
 2) train
 3) export_samples
 4) evaluate_generated (generated.csv)
 5) postprocess_types (generated_post.csv)
 6) evaluate_generated (eval_post.json)
 7) diagnostics scripts
 **Linux**:
 ```bash
 python example/run_all.py --device cuda --config example/config.json
 ```
-**Windows (PowerShell)**:
+### 只评估不训练
-```powershell
+```bash
-python run_all.py --device cuda --config config.json
+python example/run_all.py --skip-prepare --skip-train --skip-export
 ```
 ### 只训练不评估
 ```bash
 python example/run_all.py --skip-eval --skip-postprocess --skip-post-eval --skip-diagnostics
 ```
 ---
-## 11. Current Configuration (Key Defaults)
+## 11. 输出文件说明
-From `example/config.json`:
+
 - `generated.csv`：原始 diffusion 输出
 - `generated_post.csv`：KS‑only 后处理输出
 - `eval.json`：原始评估
 - `eval_post.json`：后处理评估
 - `cont_stats.json` / `disc_vocab.json`：统计文件
 - `*_stats.json`：Type 统计报告
 ---
 ## 12. 当前配置（关键超参）
 来自 `example/config.json`：
 - backbone_type: **transformer**
 - timesteps: 600
 - seq_len: 96
 - batch_size: 16
- cont_target: `x0`
+- cont_target: x0
- cont_loss_weighting: `inv_std`
+- cont_loss_weighting: inv_std
 - snr_weighted_loss: true
 - quantile_loss_weight: 0.2
 - use_quantile_transform: true
@@ -221,41 +242,30 @@ From `example/config.json`:
 ---
-## 12. What’s Actually Trained vs What’s Post‑Processed
+## 13. 为什么运行慢
-**Trained**
+1) 两阶段训练（temporal + diffusion）
- Temporal GRU (trend)
+2) 评估要读全量 train*.csv.gz
- Diffusion residual model (continuous + discrete)
+3) run_all 默认跑所有诊断脚本
-
+4) timesteps / seq_len 大
 **Post‑Processed (KS‑only)**
 - Type1/2/3/5/6 replaced by empirical resampling
 This is important: postprocess improves KS but **may break joint realism**.
 ---
-## 13. Why It’s Still Hard / 当前难点
+## 14. 已知限制与后续方向
- Type1/2/3 are **event‑driven** and **piecewise constant**
+限制：
- Diffusion (Gaussian DDPM + MSE) tends to smooth/blur these
+- Type1/2/3 仍主导 KS
- Temporal vs distribution objectives pull in opposite directions
+- KS‑only baseline 会破坏联合分布
 - 时序和分布存在 trade‑off
 方向：
 - 为 Type1/2/3 建条件模型
 - Type4 增加 regime conditioning
 - 联合指标（cross‑feature correlation）
 ---
-## 14. Where To Improve Next / 下一步方向
+## 15. 文件树（精简版）
 1) Replace KS‑only postprocess with **conditional generators**:
   - Type1: program generator (HMM / schedule)
   - Type2: controller emulator (PID‑like)
   - Type3: actuator dynamics (dwell + rate + saturation)
 2) Add regime conditioning for Type4 PVs
 3) Joint realism checks (cross‑feature correlation)
 ---
 ## 15. Key Files (Complete but Pruned)
 ```
 mask-ddpm/
@@ -291,18 +301,25 @@ mask-ddpm/
    aux_stats.py
    postprocess_types.py
    results/
      generated.csv
      generated_post.csv
      eval.json
      eval_post.json
      cont_stats.json
      disc_vocab.json
      metrics_history.csv
 ```
 ---
-## 16. Summary / 总结
+## 16. 文件职责（逐文件说明）
-The current project is a **hybrid diffusion system** with a **two‑stage temporal+residual design**, built to balance **distribution alignment** and **temporal realism**. The architecture is modular, with explicit type‑aware diagnostics and postprocessing, and supports both GRU and Transformer backbones. The remaining research challenge is to replace KS‑only postprocessing with **conditional, structurally consistent generators** for Type1/2/3/5/6 features.
+- `prepare_data.py`：统计连续/离散特征
 - `data_utils.py`：预处理与变换函数
 - `hybrid_diffusion.py`：模型主体（Temporal + Diffusion）
 - `train.py`：两阶段训练
 - `export_samples.py`：采样导出
 - `evaluate_generated.py`：评估指标
 - `run_all.py`：一键流程
 - `postprocess_types.py`：Type‑aware KS‑only baseline
 - `diagnose_ks.py`：CDF 诊断
 - `ranked_ks.py`：KS 排序
 - `filtered_metrics.py`：过滤 KS
 ---
 # 结束
 如果你需要更“论文式”的版本（加入公式、伪代码、实验表格），可以继续追加。