Files
mask-ddpm/report.md

309 lines
8.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# mask-ddpm Project Report (Detailed)
This report is a **complete, beginnerfriendly** description of the current project implementation as of the latest code in this repo. It explains **what the project does**, **how data flows**, **what each file is for**, and **why the architecture is designed this way**.
---
## 0. TL;DR / 一句话概览
We generate multivariate ICS timeseries by **(1) learning temporal trend with GRU** and **(2) learning residuals with a hybrid diffusion model** (continuous DDPM + discrete masked diffusion). We then evaluate with **tieaware KS** and run **Typeaware postprocessing** for diagnostic KS reduction.
---
## 1. Project Goal / 项目目标
We want synthetic ICS sequences that are:
1) **Distributionaligned** (perfeature CDF matches real data → low KS)
2) **Temporally consistent** (lag1 correlation and trend are realistic)
3) **Discretevalid** (state tokens are legal and frequencyconsistent)
This is hard because **distribution** and **temporal structure** often conflict in a single model.
---
## 2. Data & Feature Schema / 数据与特征结构
**Input data**: HAI CSV files (compressed) in `dataset/hai/hai-21.03/`.
**Feature split**: `example/feature_split.json`
- `continuous`: realvalued sensors/actuators
- `discrete`: state tokens / modes
- `time_column`: time index (not trained)
---
## 3. Preprocessing / 预处理
File: `example/prepare_data.py`
### Continuous features
- Mean/std statistics
- Quantile table (if `use_quantile_transform=true`)
- Optional transforms (log1p etc.)
- Output: `example/results/cont_stats.json`
### Discrete features
- Token vocab from data
- Output: `example/results/disc_vocab.json`
File: `example/data_utils.py` contains
- Normalization / inverse
- Quantile transform / inverse
- Postcalibration helpers
---
## 4. Architecture / 模型结构
### 4.1 Stage1 Temporal GRU (Trend)
File: `example/hybrid_diffusion.py`
- Class: `TemporalGRUGenerator`
- Input: continuous sequence
- Output: **trend sequence** (teacher forced)
- Purpose: capture temporal structure
### 4.2 Stage2 Hybrid Diffusion (Residual)
File: `example/hybrid_diffusion.py`
**Continuous branch**
- Gaussian DDPM
- Predicts **residual** (or noise)
**Discrete branch**
- Mask diffusion (masked tokens)
- Classifier head per discrete column
**Backbone**
- Current config uses **Transformer encoder** (`backbone_type=transformer`)
- GRU is still supported as option
**Conditioning**
- Fileid conditioning (`use_condition=true`, `condition_type=file_id`)
- Type1 (setpoint/demand) can be passed as **continuous condition** (`cond_cont`)
---
## 5. Training Flow / 训练流程
File: `example/train.py`
### 5.1 Stage1 Temporal training
- Use continuous features (excluding Type1/Type5)
- Teacherforced GRU predicts next step
- Loss: **MSE**
- Output: `temporal.pt`
### 5.2 Stage2 Diffusion training
- Compute residual: `x_resid = x_cont - trend`
- Sample time step `t`
- Add noise for continuous; mask tokens for discrete
- Model predicts:
- **eps_pred** for continuous residual
- logits for discrete tokens
### Loss design
- Continuous loss: MSE on eps or x0 (`cont_target`)
- Optional weighting: inverse variance (`cont_loss_weighting=inv_std`)
- Optional SNR weighting (`snr_weighted_loss`)
- Optional quantile loss (align residual distribution)
- Optional residual mean/std loss
- Discrete loss: crossentropy on masked tokens
- Total: `loss = λ * loss_cont + (1λ) * loss_disc`
---
## 6. Sampling & Export / 采样与导出
File: `example/export_samples.py`
Steps:
1) Initialize continuous with noise
2) Initialize discrete with masks
3) Reverse diffusion loop from `t=T..0`
4) Add trend back (if temporal stage enabled)
5) Inverse transforms (quantile → raw)
6) Clip/bound if configured
7) Merge back Type1 (conditioning) and Type5 (derived)
8) Write `generated.csv`
---
## 7. Evaluation / 评估
File: `example/evaluate_generated.py`
### Metrics
- **KS (tieaware)** for continuous
- **JSD** for discrete
- **lag1 correlation** for temporal consistency
- quantile diffs, mean/std errors
### Important
- Reference supports **glob** and aggregates **all matched files**
- KS implementation is **tieaware** (correct for spiky/quantized data)
Outputs:
- `example/results/eval.json`
---
## 8. Diagnostics / 诊断工具
- `example/diagnose_ks.py`: CDF plots and perfeature KS
- `example/ranked_ks.py`: ranked KS + contribution
- `example/filtered_metrics.py`: filtered KS excluding outliers
- `example/program_stats.py`: Type1 stats
- `example/controller_stats.py`: Type2 stats
- `example/actuator_stats.py`: Type3 stats
- `example/pv_stats.py`: Type4 stats
- `example/aux_stats.py`: Type6 stats
---
## 9. TypeAware Modeling / 类型化分离
To reduce KS dominated by a few variables, the project uses **Type categories** defined in config:
- **Type1**: setpoints / demand (scheduledriven)
- **Type2**: controller outputs
- **Type3**: actuator positions
- **Type4**: PV sensors
- **Type5**: derived tags
- **Type6**: auxiliary / coupling
### Current implementation (diagnostic KS baseline)
File: `example/postprocess_types.py`
- Type1/2/3/5/6 → **empirical resampling** from real distribution
- Type4 → keep diffusion output
This is **not** the final model, but provides a **KSupper bound** for diagnosis.
Outputs:
- `example/results/generated_post.csv`
- `example/results/eval_post.json`
---
## 10. Pipeline / 一键流程
File: `example/run_all.py`
Default pipeline:
1) prepare_data
2) train
3) export_samples
4) evaluate_generated (generated.csv)
5) postprocess_types (generated_post.csv)
6) evaluate_generated (eval_post.json)
7) diagnostics scripts
**Linux**:
```bash
python example/run_all.py --device cuda --config example/config.json
```
**Windows (PowerShell)**:
```powershell
python run_all.py --device cuda --config config.json
```
---
## 11. Current Configuration (Key Defaults)
From `example/config.json`:
- backbone_type: **transformer**
- timesteps: 600
- seq_len: 96
- batch_size: 16
- cont_target: `x0`
- cont_loss_weighting: `inv_std`
- snr_weighted_loss: true
- quantile_loss_weight: 0.2
- use_quantile_transform: true
- cont_post_calibrate: true
- use_temporal_stage1: true
---
## 12. Whats Actually Trained vs Whats PostProcessed
**Trained**
- Temporal GRU (trend)
- Diffusion residual model (continuous + discrete)
**PostProcessed (KSonly)**
- Type1/2/3/5/6 replaced by empirical resampling
This is important: postprocess improves KS but **may break joint realism**.
---
## 13. Why Its Still Hard / 当前难点
- Type1/2/3 are **eventdriven** and **piecewise constant**
- Diffusion (Gaussian DDPM + MSE) tends to smooth/blur these
- Temporal vs distribution objectives pull in opposite directions
---
## 14. Where To Improve Next / 下一步方向
1) Replace KSonly postprocess with **conditional generators**:
- Type1: program generator (HMM / schedule)
- Type2: controller emulator (PIDlike)
- Type3: actuator dynamics (dwell + rate + saturation)
2) Add regime conditioning for Type4 PVs
3) Joint realism checks (crossfeature correlation)
---
## 15. Key Files (Complete but Pruned)
```
mask-ddpm/
report.md
docs/
README.md
architecture.md
evaluation.md
decisions.md
experiments.md
ideas.md
example/
config.json
config_no_temporal.json
config_temporal_strong.json
feature_split.json
data_utils.py
prepare_data.py
hybrid_diffusion.py
train.py
sample.py
export_samples.py
evaluate_generated.py
run_all.py
run_compare.py
diagnose_ks.py
filtered_metrics.py
ranked_ks.py
program_stats.py
controller_stats.py
actuator_stats.py
pv_stats.py
aux_stats.py
postprocess_types.py
results/
generated.csv
generated_post.csv
eval.json
eval_post.json
cont_stats.json
disc_vocab.json
metrics_history.csv
```
---
## 16. Summary / 总结
The current project is a **hybrid diffusion system** with a **twostage temporal+residual design**, built to balance **distribution alignment** and **temporal realism**. The architecture is modular, with explicit typeaware diagnostics and postprocessing, and supports both GRU and Transformer backbones. The remaining research challenge is to replace KSonly postprocessing with **conditional, structurally consistent generators** for Type1/2/3/5/6 features.