mask-ddpm/report.md

# mask-ddpm Project Report (Detailed)

This report is a **complete, beginner‑friendly** description of the current project implementation as of the latest code in this repo. It explains **what the project does**, **how data flows**, **what each file is for**, and **why the architecture is designed this way**.

---

## 0. TL;DR / 一句话概览

We generate multivariate ICS time‑series by **(1) learning temporal trend with GRU** and **(2) learning residuals with a hybrid diffusion model** (continuous DDPM + discrete masked diffusion). We then evaluate with **tie‑aware KS** and run **Type‑aware postprocessing** for diagnostic KS reduction.

---

## 1. Project Goal / 项目目标

We want synthetic ICS sequences that are:
1) **Distribution‑aligned** (per‑feature CDF matches real data → low KS)
2) **Temporally consistent** (lag‑1 correlation and trend are realistic)
3) **Discrete‑valid** (state tokens are legal and frequency‑consistent)

This is hard because **distribution** and **temporal structure** often conflict in a single model.

---

## 2. Data & Feature Schema / 数据与特征结构

**Input data**: HAI CSV files (compressed) in `dataset/hai/hai-21.03/`.

**Feature split**: `example/feature_split.json`
- `continuous`: real‑valued sensors/actuators
- `discrete`: state tokens / modes
- `time_column`: time index (not trained)

---

## 3. Preprocessing / 预处理

File: `example/prepare_data.py`

### Continuous features
- Mean/std statistics
- Quantile table (if `use_quantile_transform=true`)
- Optional transforms (log1p etc.)
- Output: `example/results/cont_stats.json`

### Discrete features
- Token vocab from data
- Output: `example/results/disc_vocab.json`

File: `example/data_utils.py` contains
- Normalization / inverse
- Quantile transform / inverse
- Post‑calibration helpers

---

## 4. Architecture / 模型结构

### 4.1 Stage‑1 Temporal GRU (Trend)
File: `example/hybrid_diffusion.py`
- Class: `TemporalGRUGenerator`
- Input: continuous sequence
- Output: **trend sequence** (teacher forced)
- Purpose: capture temporal structure

### 4.2 Stage‑2 Hybrid Diffusion (Residual)
File: `example/hybrid_diffusion.py`

**Continuous branch**
- Gaussian DDPM
- Predicts **residual** (or noise)

**Discrete branch**
- Mask diffusion (masked tokens)
- Classifier head per discrete column

**Backbone**
- Current config uses **Transformer encoder** (`backbone_type=transformer`)
- GRU is still supported as option

**Conditioning**
- File‑id conditioning (`use_condition=true`, `condition_type=file_id`)
- Type‑1 (setpoint/demand) can be passed as **continuous condition** (`cond_cont`)

---

## 5. Training Flow / 训练流程
File: `example/train.py`

### 5.1 Stage‑1 Temporal training
- Use continuous features (excluding Type1/Type5)
- Teacher‑forced GRU predicts next step
- Loss: **MSE**
- Output: `temporal.pt`

### 5.2 Stage‑2 Diffusion training
- Compute residual: `x_resid = x_cont - trend`
- Sample time step `t`
- Add noise for continuous; mask tokens for discrete
- Model predicts:
  - **eps_pred** for continuous residual
  - logits for discrete tokens

### Loss design
- Continuous loss: MSE on eps or x0 (`cont_target`)
- Optional weighting: inverse variance (`cont_loss_weighting=inv_std`)
- Optional SNR weighting (`snr_weighted_loss`)
- Optional quantile loss (align residual distribution)
- Optional residual mean/std loss
- Discrete loss: cross‑entropy on masked tokens
- Total: `loss = λ * loss_cont + (1‑λ) * loss_disc`

---

## 6. Sampling & Export / 采样与导出
File: `example/export_samples.py`

Steps:
1) Initialize continuous with noise
2) Initialize discrete with masks
3) Reverse diffusion loop from `t=T..0`
4) Add trend back (if temporal stage enabled)
5) Inverse transforms (quantile → raw)
6) Clip/bound if configured
7) Merge back Type1 (conditioning) and Type5 (derived)
8) Write `generated.csv`

---

## 7. Evaluation / 评估
File: `example/evaluate_generated.py`

### Metrics
- **KS (tie‑aware)** for continuous
- **JSD** for discrete
- **lag‑1 correlation** for temporal consistency
- quantile diffs, mean/std errors

### Important
- Reference supports **glob** and aggregates **all matched files**
- KS implementation is **tie‑aware** (correct for spiky/quantized data)

Outputs:
- `example/results/eval.json`

---

## 8. Diagnostics / 诊断工具

- `example/diagnose_ks.py`: CDF plots and per‑feature KS
- `example/ranked_ks.py`: ranked KS + contribution
- `example/filtered_metrics.py`: filtered KS excluding outliers
- `example/program_stats.py`: Type‑1 stats
- `example/controller_stats.py`: Type‑2 stats
- `example/actuator_stats.py`: Type‑3 stats
- `example/pv_stats.py`: Type‑4 stats
- `example/aux_stats.py`: Type‑6 stats

---

## 9. Type‑Aware Modeling / 类型化分离

To reduce KS dominated by a few variables, the project uses **Type categories** defined in config:
- **Type1**: setpoints / demand (schedule‑driven)
- **Type2**: controller outputs
- **Type3**: actuator positions
- **Type4**: PV sensors
- **Type5**: derived tags
- **Type6**: auxiliary / coupling

### Current implementation (diagnostic KS baseline)
File: `example/postprocess_types.py`
- Type1/2/3/5/6 → **empirical resampling** from real distribution
- Type4 → keep diffusion output

This is **not** the final model, but provides a **KS‑upper bound** for diagnosis.

Outputs:
- `example/results/generated_post.csv`
- `example/results/eval_post.json`

---

## 10. Pipeline / 一键流程

File: `example/run_all.py`

Default pipeline:
1) prepare_data
2) train
3) export_samples
4) evaluate_generated (generated.csv)
5) postprocess_types (generated_post.csv)
6) evaluate_generated (eval_post.json)
7) diagnostics scripts

**Linux**:
```bash
python example/run_all.py --device cuda --config example/config.json
```

**Windows (PowerShell)**:
```powershell
python run_all.py --device cuda --config config.json
```

---

## 11. Current Configuration (Key Defaults)
From `example/config.json`:
- backbone_type: **transformer**
- timesteps: 600
- seq_len: 96
- batch_size: 16
- cont_target: `x0`
- cont_loss_weighting: `inv_std`
- snr_weighted_loss: true
- quantile_loss_weight: 0.2
- use_quantile_transform: true
- cont_post_calibrate: true
- use_temporal_stage1: true

---

## 12. What’s Actually Trained vs What’s Post‑Processed

**Trained**
- Temporal GRU (trend)
- Diffusion residual model (continuous + discrete)

**Post‑Processed (KS‑only)**
- Type1/2/3/5/6 replaced by empirical resampling

This is important: postprocess improves KS but **may break joint realism**.

---

## 13. Why It’s Still Hard / 当前难点

- Type1/2/3 are **event‑driven** and **piecewise constant**
- Diffusion (Gaussian DDPM + MSE) tends to smooth/blur these
- Temporal vs distribution objectives pull in opposite directions

---

## 14. Where To Improve Next / 下一步方向

1) Replace KS‑only postprocess with **conditional generators**:
   - Type1: program generator (HMM / schedule)
   - Type2: controller emulator (PID‑like)
   - Type3: actuator dynamics (dwell + rate + saturation)

2) Add regime conditioning for Type4 PVs

3) Joint realism checks (cross‑feature correlation)

---

## 15. Key Files (Complete but Pruned)

```
mask-ddpm/
  report.md
  docs/
    README.md
    architecture.md
    evaluation.md
    decisions.md
    experiments.md
    ideas.md
  example/
    config.json
    config_no_temporal.json
    config_temporal_strong.json
    feature_split.json
    data_utils.py
    prepare_data.py
    hybrid_diffusion.py
    train.py
    sample.py
    export_samples.py
    evaluate_generated.py
    run_all.py
    run_compare.py
    diagnose_ks.py
    filtered_metrics.py
    ranked_ks.py
    program_stats.py
    controller_stats.py
    actuator_stats.py
    pv_stats.py
    aux_stats.py
    postprocess_types.py
    results/
      generated.csv
      generated_post.csv
      eval.json
      eval_post.json
      cont_stats.json
      disc_vocab.json
      metrics_history.csv
```

---

## 16. Summary / 总结

The current project is a **hybrid diffusion system** with a **two‑stage temporal+residual design**, built to balance **distribution alignment** and **temporal realism**. The architecture is modular, with explicit type‑aware diagnostics and postprocessing, and supports both GRU and Transformer backbones. The remaining research challenge is to replace KS‑only postprocessing with **conditional, structurally consistent generators** for Type1/2/3/5/6 features.