# mask-ddpm Project Report (Detailed) This report is a **complete, beginner‑friendly** description of the current project implementation as of the latest code in this repo. It explains **what the project does**, **how data flows**, **what each file is for**, and **why the architecture is designed this way**. --- ## 0. TL;DR / 一句话概览 We generate multivariate ICS time‑series by **(1) learning temporal trend with GRU** and **(2) learning residuals with a hybrid diffusion model** (continuous DDPM + discrete masked diffusion). We then evaluate with **tie‑aware KS** and run **Type‑aware postprocessing** for diagnostic KS reduction. --- ## 1. Project Goal / 项目目标 We want synthetic ICS sequences that are: 1) **Distribution‑aligned** (per‑feature CDF matches real data → low KS) 2) **Temporally consistent** (lag‑1 correlation and trend are realistic) 3) **Discrete‑valid** (state tokens are legal and frequency‑consistent) This is hard because **distribution** and **temporal structure** often conflict in a single model. --- ## 2. Data & Feature Schema / 数据与特征结构 **Input data**: HAI CSV files (compressed) in `dataset/hai/hai-21.03/`. **Feature split**: `example/feature_split.json` - `continuous`: real‑valued sensors/actuators - `discrete`: state tokens / modes - `time_column`: time index (not trained) --- ## 3. Preprocessing / 预处理 File: `example/prepare_data.py` ### Continuous features - Mean/std statistics - Quantile table (if `use_quantile_transform=true`) - Optional transforms (log1p etc.) - Output: `example/results/cont_stats.json` ### Discrete features - Token vocab from data - Output: `example/results/disc_vocab.json` File: `example/data_utils.py` contains - Normalization / inverse - Quantile transform / inverse - Post‑calibration helpers --- ## 4. Architecture / 模型结构 ### 4.1 Stage‑1 Temporal GRU (Trend) File: `example/hybrid_diffusion.py` - Class: `TemporalGRUGenerator` - Input: continuous sequence - Output: **trend sequence** (teacher forced) - Purpose: capture temporal structure ### 4.2 Stage‑2 Hybrid Diffusion (Residual) File: `example/hybrid_diffusion.py` **Continuous branch** - Gaussian DDPM - Predicts **residual** (or noise) **Discrete branch** - Mask diffusion (masked tokens) - Classifier head per discrete column **Backbone** - Current config uses **Transformer encoder** (`backbone_type=transformer`) - GRU is still supported as option **Conditioning** - File‑id conditioning (`use_condition=true`, `condition_type=file_id`) - Type‑1 (setpoint/demand) can be passed as **continuous condition** (`cond_cont`) --- ## 5. Training Flow / 训练流程 File: `example/train.py` ### 5.1 Stage‑1 Temporal training - Use continuous features (excluding Type1/Type5) - Teacher‑forced GRU predicts next step - Loss: **MSE** - Output: `temporal.pt` ### 5.2 Stage‑2 Diffusion training - Compute residual: `x_resid = x_cont - trend` - Sample time step `t` - Add noise for continuous; mask tokens for discrete - Model predicts: - **eps_pred** for continuous residual - logits for discrete tokens ### Loss design - Continuous loss: MSE on eps or x0 (`cont_target`) - Optional weighting: inverse variance (`cont_loss_weighting=inv_std`) - Optional SNR weighting (`snr_weighted_loss`) - Optional quantile loss (align residual distribution) - Optional residual mean/std loss - Discrete loss: cross‑entropy on masked tokens - Total: `loss = λ * loss_cont + (1‑λ) * loss_disc` --- ## 6. Sampling & Export / 采样与导出 File: `example/export_samples.py` Steps: 1) Initialize continuous with noise 2) Initialize discrete with masks 3) Reverse diffusion loop from `t=T..0` 4) Add trend back (if temporal stage enabled) 5) Inverse transforms (quantile → raw) 6) Clip/bound if configured 7) Merge back Type1 (conditioning) and Type5 (derived) 8) Write `generated.csv` --- ## 7. Evaluation / 评估 File: `example/evaluate_generated.py` ### Metrics - **KS (tie‑aware)** for continuous - **JSD** for discrete - **lag‑1 correlation** for temporal consistency - quantile diffs, mean/std errors ### Important - Reference supports **glob** and aggregates **all matched files** - KS implementation is **tie‑aware** (correct for spiky/quantized data) Outputs: - `example/results/eval.json` --- ## 8. Diagnostics / 诊断工具 - `example/diagnose_ks.py`: CDF plots and per‑feature KS - `example/ranked_ks.py`: ranked KS + contribution - `example/filtered_metrics.py`: filtered KS excluding outliers - `example/program_stats.py`: Type‑1 stats - `example/controller_stats.py`: Type‑2 stats - `example/actuator_stats.py`: Type‑3 stats - `example/pv_stats.py`: Type‑4 stats - `example/aux_stats.py`: Type‑6 stats --- ## 9. Type‑Aware Modeling / 类型化分离 To reduce KS dominated by a few variables, the project uses **Type categories** defined in config: - **Type1**: setpoints / demand (schedule‑driven) - **Type2**: controller outputs - **Type3**: actuator positions - **Type4**: PV sensors - **Type5**: derived tags - **Type6**: auxiliary / coupling ### Current implementation (diagnostic KS baseline) File: `example/postprocess_types.py` - Type1/2/3/5/6 → **empirical resampling** from real distribution - Type4 → keep diffusion output This is **not** the final model, but provides a **KS‑upper bound** for diagnosis. Outputs: - `example/results/generated_post.csv` - `example/results/eval_post.json` --- ## 10. Pipeline / 一键流程 File: `example/run_all.py` Default pipeline: 1) prepare_data 2) train 3) export_samples 4) evaluate_generated (generated.csv) 5) postprocess_types (generated_post.csv) 6) evaluate_generated (eval_post.json) 7) diagnostics scripts **Linux**: ```bash python example/run_all.py --device cuda --config example/config.json ``` **Windows (PowerShell)**: ```powershell python run_all.py --device cuda --config config.json ``` --- ## 11. Current Configuration (Key Defaults) From `example/config.json`: - backbone_type: **transformer** - timesteps: 600 - seq_len: 96 - batch_size: 16 - cont_target: `x0` - cont_loss_weighting: `inv_std` - snr_weighted_loss: true - quantile_loss_weight: 0.2 - use_quantile_transform: true - cont_post_calibrate: true - use_temporal_stage1: true --- ## 12. What’s Actually Trained vs What’s Post‑Processed **Trained** - Temporal GRU (trend) - Diffusion residual model (continuous + discrete) **Post‑Processed (KS‑only)** - Type1/2/3/5/6 replaced by empirical resampling This is important: postprocess improves KS but **may break joint realism**. --- ## 13. Why It’s Still Hard / 当前难点 - Type1/2/3 are **event‑driven** and **piecewise constant** - Diffusion (Gaussian DDPM + MSE) tends to smooth/blur these - Temporal vs distribution objectives pull in opposite directions --- ## 14. Where To Improve Next / 下一步方向 1) Replace KS‑only postprocess with **conditional generators**: - Type1: program generator (HMM / schedule) - Type2: controller emulator (PID‑like) - Type3: actuator dynamics (dwell + rate + saturation) 2) Add regime conditioning for Type4 PVs 3) Joint realism checks (cross‑feature correlation) --- ## 15. Key Files (Complete but Pruned) ``` mask-ddpm/ report.md docs/ README.md architecture.md evaluation.md decisions.md experiments.md ideas.md example/ config.json config_no_temporal.json config_temporal_strong.json feature_split.json data_utils.py prepare_data.py hybrid_diffusion.py train.py sample.py export_samples.py evaluate_generated.py run_all.py run_compare.py diagnose_ks.py filtered_metrics.py ranked_ks.py program_stats.py controller_stats.py actuator_stats.py pv_stats.py aux_stats.py postprocess_types.py results/ generated.csv generated_post.csv eval.json eval_post.json cont_stats.json disc_vocab.json metrics_history.csv ``` --- ## 16. Summary / 总结 The current project is a **hybrid diffusion system** with a **two‑stage temporal+residual design**, built to balance **distribution alignment** and **temporal realism**. The architecture is modular, with explicit type‑aware diagnostics and postprocessing, and supports both GRU and Transformer backbones. The remaining research challenge is to replace KS‑only postprocessing with **conditional, structurally consistent generators** for Type1/2/3/5/6 features.