Clean artifacts and update example pipeline

This commit is contained in:
2026-01-22 16:32:51 +08:00
parent c0639386be
commit c3f750cd9d
20 changed files with 651 additions and 30826 deletions

View File

@@ -12,33 +12,54 @@ CSV (train1) and produces a continuous/discrete split using a simple heuristic.
- train_stub.py: end-to-end scaffold for loss computation.
- train.py: minimal training loop with checkpoints.
- sample.py: minimal sampling loop.
- export_samples.py: sample + export to CSV with original column names.
- evaluate_generated.py: basic eval of generated CSV vs training stats.
- config.json: training defaults for train.py.
- model_design.md: step-by-step design notes.
- results/feature_split.txt: comma-separated feature lists.
- results/summary.txt: basic stats (rows sampled, column counts).
## Run
```
python /home/anay/Dev/diffusion/mask-ddpm/example/analyze_hai21_03.py
python example/analyze_hai21_03.py
```
Prepare vocab + stats (writes to `example/results`):
```
python /home/anay/Dev/diffusion/mask-ddpm/example/prepare_data.py
python example/prepare_data.py
```
Train a small run:
```
python /home/anay/Dev/diffusion/mask-ddpm/example/train.py
python example/train.py --config example/config.json
```
Sample from the trained model:
```
python /home/anay/Dev/diffusion/mask-ddpm/example/sample.py
python example/sample.py
```
Sample and export CSV:
```
python example/export_samples.py --include-time --device cpu
```
Evaluate generated CSV (writes eval.json):
```
python example/evaluate_generated.py
```
One-click pipeline (prepare -> train -> export -> eval -> plot):
```
python example/run_pipeline.py --device auto
```
## Notes
- Heuristic: integer-like values with low cardinality (<=10) are treated as
discrete. All other numeric columns are continuous.
- Set `device` in `example/config.json` to `auto` or `cuda` when moving to a GPU machine.
- Attack label columns (`attack*`) are excluded from training and generation.
- `time` column is always excluded from training and generation (optional for export only).
- The script only samples the first 5000 rows to stay fast.
- `prepare_data.py` runs without PyTorch, but `train.py` and `sample.py` require it.
- `train.py` and `sample.py` auto-select GPU if available; otherwise they fall back to CPU.