Files
mask-ddpm/example
2026-01-09 02:14:20 +08:00
..
2026-01-09 02:14:20 +08:00
2026-01-09 02:14:20 +08:00
2026-01-09 02:14:20 +08:00
2026-01-09 02:14:20 +08:00
2026-01-09 02:14:20 +08:00
2026-01-09 02:14:20 +08:00
2026-01-09 02:14:20 +08:00
2026-01-09 02:14:20 +08:00
2026-01-09 02:14:20 +08:00
2026-01-09 02:14:20 +08:00
2026-01-09 02:14:20 +08:00
2026-01-09 02:14:20 +08:00

Example: HAI 21.03 Feature Split

This folder contains a small, reproducible example that inspects the HAI 21.03 CSV (train1) and produces a continuous/discrete split using a simple heuristic.

Files

  • analyze_hai21_03.py: reads a sample of the data and writes results.
  • data_utils.py: CSV loading, vocab, normalization, and batching helpers.
  • feature_split.json: column split for HAI 21.03.
  • hybrid_diffusion.py: hybrid model + diffusion utilities.
  • prepare_data.py: compute vocab and normalization stats.
  • train_stub.py: end-to-end scaffold for loss computation.
  • train.py: minimal training loop with checkpoints.
  • sample.py: minimal sampling loop.
  • model_design.md: step-by-step design notes.
  • results/feature_split.txt: comma-separated feature lists.
  • results/summary.txt: basic stats (rows sampled, column counts).

Run

python /home/anay/Dev/diffusion/mask-ddpm/example/analyze_hai21_03.py

Prepare vocab + stats (writes to example/results):

python /home/anay/Dev/diffusion/mask-ddpm/example/prepare_data.py

Train a small run:

python /home/anay/Dev/diffusion/mask-ddpm/example/train.py

Sample from the trained model:

python /home/anay/Dev/diffusion/mask-ddpm/example/sample.py

Notes

  • Heuristic: integer-like values with low cardinality (<=10) are treated as discrete. All other numeric columns are continuous.
  • The script only samples the first 5000 rows to stay fast.
  • prepare_data.py runs without PyTorch, but train.py and sample.py require it.
  • train.py and sample.py auto-select GPU if available; otherwise they fall back to CPU.