Rewrite report as full user manual

This commit is contained in:
2026-01-28 22:34:10 +08:00
parent 6fb53dd5c1
commit f6be8a6ecb

377
report.md
View File

@@ -1,218 +1,239 @@
# mask-ddpm Project Report (Detailed)
# mask-ddpm 项目说明书(完整详细版)
This report is a **complete, beginnerfriendly** description of the current project implementation as of the latest code in this repo. It explains **what the project does**, **how data flows**, **what each file is for**, and **why the architecture is designed this way**.
> 本文档是“说明书级别”的完整描述,面向首次接触项目的同学。
> 目标是让**不了解扩散/时序建模的人**也能理解:项目是什么、怎么跑、每个文件干什么、每一步在训练什么、为什么这么设计。
>
> 适用范围:当前仓库代码(以 `example/config.json` 为主配置)。
---
## 0. TL;DR / 一句话概览
We generate multivariate ICS timeseries by **(1) learning temporal trend with GRU** and **(2) learning residuals with a hybrid diffusion model** (continuous DDPM + discrete masked diffusion). We then evaluate with **tieaware KS** and run **Typeaware postprocessing** for diagnostic KS reduction.
## 目录
1. 项目目标与研究问题
2. 数据与特征结构
3. 预处理与统计文件
4. 模型总体架构
5. 训练流程(逐步骤)
6. 采样与导出流程
7. 评估体系与指标
8. 诊断工具与常用脚本
9. Typeaware按类型分治设计
10. 一键运行与常见命令
11. 输出文件说明
12. 当前配置与关键超参
13. 常见问题与慢的原因
14. 已知限制与后续方向
15. 文件树(精简版)
16. 文件职责(逐文件说明)
---
## 1. Project Goal / 项目目标
## 1. 项目目标与研究问题
We want synthetic ICS sequences that are:
1) **Distributionaligned** (perfeature CDF matches real data → low KS)
2) **Temporally consistent** (lag1 correlation and trend are realistic)
3) **Discretevalid** (state tokens are legal and frequencyconsistent)
本项目目标生成工业控制系统ICS多变量时序数据满足以下三点
This is hard because **distribution** and **temporal structure** often conflict in a single model.
- **分布一致性**:每个变量的统计分布接近真实(用 KS 衡量)
- **时序一致性**序列结构合理lag1 相关性、趋势符合真实
- **离散合法性**:离散变量(状态/模式)必须是合法 token 且分布合理JSD
核心难点:
- 时序结构和分布对齐经常相互冲突
- 真实数据包含“程序驱动/事件驱动”的变量,难以用纯 DDPM 学好
---
## 2. Data & Feature Schema / 数据与特征结构
## 2. 数据与特征结构
**Input data**: HAI CSV files (compressed) in `dataset/hai/hai-21.03/`.
**数据来源**HAI `train*.csv.gz`(多文件)
**Feature split**: `example/feature_split.json`
- `continuous`: realvalued sensors/actuators
- `discrete`: state tokens / modes
- `time_column`: time index (not trained)
**特征拆分**(见 `example/feature_split.json`
- `continuous`:连续变量(传感器/执行器)
- `discrete`:离散变量(状态/模式)
- `time_column`:时间列(不参与训练)
---
## 3. Preprocessing / 预处理
## 3. 预处理与统计文件
File: `example/prepare_data.py`
脚本:`example/prepare_data.py`
### Continuous features
- Mean/std statistics
- Quantile table (if `use_quantile_transform=true`)
- Optional transforms (log1p etc.)
- Output: `example/results/cont_stats.json`
### 3.1 连续变量
- 计算 mean/std
- 若开启 `use_quantile_transform`计算分位数表CDF
- 输出:`example/results/cont_stats.json`
### Discrete features
- Token vocab from data
- Output: `example/results/disc_vocab.json`
### 3.2 离散变量
- 统计 vocab
- 输出:`example/results/disc_vocab.json`
File: `example/data_utils.py` contains
- Normalization / inverse
- Quantile transform / inverse
- Postcalibration helpers
### 3.3 数据工具
`example/data_utils.py` 提供:
- 标准化/反标准化
- 分位数变换/逆变换
- 可选后校准quantile calibration
---
## 4. Architecture / 模型结
## 4. 模型总体架
### 4.1 Stage1 Temporal GRU (Trend)
File: `example/hybrid_diffusion.py`
- Class: `TemporalGRUGenerator`
- Input: continuous sequence
- Output: **trend sequence** (teacher forced)
- Purpose: capture temporal structure
本项目采用 **两阶段 + 混合扩散** 架构:
### 4.2 Stage2 Hybrid Diffusion (Residual)
File: `example/hybrid_diffusion.py`
### 4.1 Stage1 Temporal GRU
- 目的:学习序列趋势、时序结构
- 输入:连续变量序列
- 输出trend趋势序列
**Continuous branch**
- Gaussian DDPM
- Predicts **residual** (or noise)
### 4.2 Stage2 Hybrid Diffusion
- 目的:学习残差分布(把时序和分布解耦)
- 连续变量Gaussian DDPM
- 离散变量mask diffusion 分类 head
**Discrete branch**
- Mask diffusion (masked tokens)
- Classifier head per discrete column
**Backbone**
- Current config uses **Transformer encoder** (`backbone_type=transformer`)
- GRU is still supported as option
**Conditioning**
- Fileid conditioning (`use_condition=true`, `condition_type=file_id`)
- Type1 (setpoint/demand) can be passed as **continuous condition** (`cond_cont`)
### 4.3 Backbone 选择
- 当前配置:`backbone_type = transformer`
- 可选GRU更省显存更稳定
---
## 5. Training Flow / 训练流程
File: `example/train.py`
## 5. 训练流程(逐步骤)
### 5.1 Stage1 Temporal training
- Use continuous features (excluding Type1/Type5)
- Teacherforced GRU predicts next step
- Loss: **MSE**
- Output: `temporal.pt`
脚本:`example/train.py`
### 5.2 Stage2 Diffusion training
- Compute residual: `x_resid = x_cont - trend`
- Sample time step `t`
- Add noise for continuous; mask tokens for discrete
- Model predicts:
- **eps_pred** for continuous residual
- logits for discrete tokens
### Step 1Temporal 训练
- 输入:连续序列
- GRU teacherforcing 预测下一步
- LossMSE
- 输出:`temporal.pt`
### Loss design
- Continuous loss: MSE on eps or x0 (`cont_target`)
- Optional weighting: inverse variance (`cont_loss_weighting=inv_std`)
- Optional SNR weighting (`snr_weighted_loss`)
- Optional quantile loss (align residual distribution)
- Optional residual mean/std loss
- Discrete loss: crossentropy on masked tokens
- Total: `loss = λ * loss_cont + (1λ) * loss_disc`
### Step 2Diffusion 训练
- 计算残差:`x_resid = x_cont - trend`
- 采样时间步 t
- 连续:加噪
- 离散mask token
- 模型预测 eps / logits
### Loss 设计
- ContinuousMSEeps 或 x0
- DiscreteCross Entropymask 部分)
- 总损失:`loss = λ * loss_cont + (1-λ) * loss_disc`
- 可选加权:
- inversestd
- SNRweighted
- quantile loss
- residual stat loss
---
## 6. Sampling & Export / 采样与导出
File: `example/export_samples.py`
## 6. 采样与导出流程
Steps:
1) Initialize continuous with noise
2) Initialize discrete with masks
3) Reverse diffusion loop from `t=T..0`
4) Add trend back (if temporal stage enabled)
5) Inverse transforms (quantile → raw)
6) Clip/bound if configured
7) Merge back Type1 (conditioning) and Type5 (derived)
8) Write `generated.csv`
脚本:`example/export_samples.py`
流程:
1) 初始化噪声(连续)
2) 初始化 mask离散
3) 反扩散 t=T..0
4) 加回 trend
5) 反变换quantile/标准化)
6) 合成 CSV
输出:`example/results/generated.csv`
---
## 7. Evaluation / 评估
File: `example/evaluate_generated.py`
## 7. 评估体系与指标
### Metrics
- **KS (tieaware)** for continuous
- **JSD** for discrete
- **lag1 correlation** for temporal consistency
- quantile diffs, mean/std errors
脚本:`example/evaluate_generated.py`
### Important
- Reference supports **glob** and aggregates **all matched files**
- KS implementation is **tieaware** (correct for spiky/quantized data)
### 连续指标
- **KStieaware**
- quantile diff
- lag1 correlation
Outputs:
- `example/results/eval.json`
### 离散指标
- JSD
- invalid token 比例
### Reference 读取
- 支持 `train*.csv.gz` glob
- 自动汇总所有文件
---
## 8. Diagnostics / 诊断工具
## 8. 诊断工具与常用脚本
- `example/diagnose_ks.py`: CDF plots and perfeature KS
- `example/ranked_ks.py`: ranked KS + contribution
- `example/filtered_metrics.py`: filtered KS excluding outliers
- `example/program_stats.py`: Type1 stats
- `example/controller_stats.py`: Type2 stats
- `example/actuator_stats.py`: Type3 stats
- `example/pv_stats.py`: Type4 stats
- `example/aux_stats.py`: Type6 stats
- `diagnose_ks.py`CDF 可视化
- `ranked_ks.py`KS 贡献排序
- `filtered_metrics.py`:过滤异常特征后的 KS
- `program_stats.py`Type1 统计
- `controller_stats.py`Type2 统计
- `actuator_stats.py`Type3 统计
- `pv_stats.py`Type4 统计
- `aux_stats.py`Type6 统计
---
## 9. TypeAware Modeling / 类型化分离
## 9. Typeaware 设计(按类型分治)
To reduce KS dominated by a few variables, the project uses **Type categories** defined in config:
- **Type1**: setpoints / demand (scheduledriven)
- **Type2**: controller outputs
- **Type3**: actuator positions
- **Type4**: PV sensors
- **Type5**: derived tags
- **Type6**: auxiliary / coupling
在真实 ICS 中,部分变量很难用 DDPM 学到,所以做类型划分:
### Current implementation (diagnostic KS baseline)
File: `example/postprocess_types.py`
- Type1/2/3/5/6 → **empirical resampling** from real distribution
- Type4 → keep diffusion output
- **Type1**setpoint/demand调度驱动
- **Type2**controller outputs
- **Type3**actuator positions
- **Type4**PV sensors
- **Type5**derived tags
- **Type6**aux/coupling
This is **not** the final model, but provides a **KSupper bound** for diagnosis.
脚本:`example/postprocess_types.py`
Outputs:
- `example/results/generated_post.csv`
- `example/results/eval_post.json`
当前实现是 **KSonly baseline**
- Type1/2/3/5/6 → 经验重采样
- Type4 → 仍用 diffusion
用途:
- 快速诊断“KS 最优可达上界”
- 不保证联合分布真实性
输出:`example/results/generated_post.csv`
---
## 10. Pipeline / 一键流程
## 10. 一键运行与常见命令
File: `example/run_all.py`
Default pipeline:
1) prepare_data
2) train
3) export_samples
4) evaluate_generated (generated.csv)
5) postprocess_types (generated_post.csv)
6) evaluate_generated (eval_post.json)
7) diagnostics scripts
**Linux**:
### 全流程(推荐)
```bash
python example/run_all.py --device cuda --config example/config.json
```
**Windows (PowerShell)**:
```powershell
python run_all.py --device cuda --config config.json
### 只评估不训练
```bash
python example/run_all.py --skip-prepare --skip-train --skip-export
```
### 只训练不评估
```bash
python example/run_all.py --skip-eval --skip-postprocess --skip-post-eval --skip-diagnostics
```
---
## 11. Current Configuration (Key Defaults)
From `example/config.json`:
## 11. 输出文件说明
- `generated.csv`:原始 diffusion 输出
- `generated_post.csv`KSonly 后处理输出
- `eval.json`:原始评估
- `eval_post.json`:后处理评估
- `cont_stats.json` / `disc_vocab.json`:统计文件
- `*_stats.json`Type 统计报告
---
## 12. 当前配置(关键超参)
来自 `example/config.json`
- backbone_type: **transformer**
- timesteps: 600
- seq_len: 96
- batch_size: 16
- cont_target: `x0`
- cont_loss_weighting: `inv_std`
- cont_target: x0
- cont_loss_weighting: inv_std
- snr_weighted_loss: true
- quantile_loss_weight: 0.2
- use_quantile_transform: true
@@ -221,41 +242,30 @@ From `example/config.json`:
---
## 12. Whats Actually Trained vs Whats PostProcessed
## 13. 为什么运行慢
**Trained**
- Temporal GRU (trend)
- Diffusion residual model (continuous + discrete)
**PostProcessed (KSonly)**
- Type1/2/3/5/6 replaced by empirical resampling
This is important: postprocess improves KS but **may break joint realism**.
1) 两阶段训练temporal + diffusion
2) 评估要读全量 train*.csv.gz
3) run_all 默认跑所有诊断脚本
4) timesteps / seq_len 大
---
## 13. Why Its Still Hard / 当前难点
## 14. 已知限制与后续方向
- Type1/2/3 are **eventdriven** and **piecewise constant**
- Diffusion (Gaussian DDPM + MSE) tends to smooth/blur these
- Temporal vs distribution objectives pull in opposite directions
限制:
- Type1/2/3 仍主导 KS
- KSonly baseline 会破坏联合分布
- 时序和分布存在 tradeoff
方向:
- 为 Type1/2/3 建条件模型
- Type4 增加 regime conditioning
- 联合指标crossfeature correlation
---
## 14. Where To Improve Next / 下一步方向
1) Replace KSonly postprocess with **conditional generators**:
- Type1: program generator (HMM / schedule)
- Type2: controller emulator (PIDlike)
- Type3: actuator dynamics (dwell + rate + saturation)
2) Add regime conditioning for Type4 PVs
3) Joint realism checks (crossfeature correlation)
---
## 15. Key Files (Complete but Pruned)
## 15. 文件树(精简版)
```
mask-ddpm/
@@ -291,18 +301,25 @@ mask-ddpm/
aux_stats.py
postprocess_types.py
results/
generated.csv
generated_post.csv
eval.json
eval_post.json
cont_stats.json
disc_vocab.json
metrics_history.csv
```
---
## 16. Summary / 总结
## 16. 文件职责(逐文件说明)
The current project is a **hybrid diffusion system** with a **twostage temporal+residual design**, built to balance **distribution alignment** and **temporal realism**. The architecture is modular, with explicit typeaware diagnostics and postprocessing, and supports both GRU and Transformer backbones. The remaining research challenge is to replace KSonly postprocessing with **conditional, structurally consistent generators** for Type1/2/3/5/6 features.
- `prepare_data.py`:统计连续/离散特征
- `data_utils.py`:预处理与变换函数
- `hybrid_diffusion.py`模型主体Temporal + Diffusion
- `train.py`:两阶段训练
- `export_samples.py`:采样导出
- `evaluate_generated.py`:评估指标
- `run_all.py`:一键流程
- `postprocess_types.py`Typeaware KSonly baseline
- `diagnose_ks.py`CDF 诊断
- `ranked_ks.py`KS 排序
- `filtered_metrics.py`:过滤 KS
---
# 结束
如果你需要更“论文式”的版本(加入公式、伪代码、实验表格),可以继续追加。