72 lines
2.7 KiB
Markdown
72 lines
2.7 KiB
Markdown
# Example: HAI 21.03 Feature Split
|
|
|
|
This folder contains a small, reproducible example that inspects the HAI 21.03
|
|
CSV (all train*.csv.gz files) and produces a continuous/discrete split using a simple heuristic.
|
|
|
|
## Files
|
|
- analyze_hai21_03.py: reads a sample of the data and writes results.
|
|
- data_utils.py: CSV loading, vocab, normalization, and batching helpers.
|
|
- feature_split.json: column split for HAI 21.03.
|
|
- hybrid_diffusion.py: hybrid model + diffusion utilities.
|
|
- prepare_data.py: compute vocab and normalization stats.
|
|
- train_stub.py: end-to-end scaffold for loss computation.
|
|
- train.py: minimal training loop with checkpoints.
|
|
- sample.py: minimal sampling loop.
|
|
- export_samples.py: sample + export to CSV with original column names.
|
|
- evaluate_generated.py: basic eval of generated CSV vs training stats.
|
|
- config.json: training defaults for train.py.
|
|
- model_design.md: step-by-step design notes.
|
|
- results/feature_split.txt: comma-separated feature lists.
|
|
- results/summary.txt: basic stats (rows sampled, column counts).
|
|
|
|
## Run
|
|
```
|
|
python example/analyze_hai21_03.py
|
|
```
|
|
|
|
Prepare vocab + stats (writes to `example/results`):
|
|
```
|
|
python example/prepare_data.py
|
|
```
|
|
|
|
Train a small run:
|
|
```
|
|
python example/train.py --config example/config.json
|
|
```
|
|
|
|
Sample from the trained model:
|
|
```
|
|
python example/sample.py
|
|
```
|
|
|
|
Sample and export CSV:
|
|
```
|
|
python example/export_samples.py --include-time --device cpu
|
|
```
|
|
|
|
Evaluate generated CSV (writes eval.json):
|
|
```
|
|
python example/evaluate_generated.py
|
|
```
|
|
|
|
One-click pipeline (prepare -> train -> export -> eval -> plot):
|
|
```
|
|
python example/run_pipeline.py --device auto
|
|
```
|
|
|
|
## Notes
|
|
- Heuristic: integer-like values with low cardinality (<=10) are treated as
|
|
discrete. All other numeric columns are continuous.
|
|
- Set `device` in `example/config.json` to `auto` or `cuda` when moving to a GPU machine.
|
|
- Attack label columns (`attack*`) are excluded from training and generation.
|
|
- `time` column is always excluded from training and generation (optional for export only).
|
|
- EMA weights are saved as `model_ema.pt` and used by the pipeline for sampling.
|
|
- Gradients are clipped by default (`grad_clip` in `config.json`) to stabilize training.
|
|
- Discrete masking uses a cosine schedule for smoother corruption.
|
|
- Continuous sampling is clipped in normalized space each step for stability.
|
|
- Optional conditioning by file id (`train*.csv.gz`) is enabled by default for multi-file training.
|
|
- Continuous head can be bounded with `tanh` via `use_tanh_eps` in config.
|
|
- The script only samples the first 5000 rows to stay fast.
|
|
- `prepare_data.py` runs without PyTorch, but `train.py` and `sample.py` require it.
|
|
- `train.py` and `sample.py` auto-select GPU if available; otherwise they fall back to CPU.
|