Example: HAI 21.03 Feature Split
This folder contains a small, reproducible example that inspects the HAI 21.03 CSV (all train*.csv.gz files) and produces a continuous/discrete split using a simple heuristic.
Files
- analyze_hai21_03.py: reads a sample of the data and writes results.
- data_utils.py: CSV loading, vocab, normalization, and batching helpers.
- feature_split.json: column split for HAI 21.03.
- hybrid_diffusion.py: hybrid model + diffusion utilities.
- prepare_data.py: compute vocab and normalization stats.
- train_stub.py: end-to-end scaffold for loss computation.
- train.py: minimal training loop with checkpoints.
- sample.py: minimal sampling loop.
- export_samples.py: sample + export to CSV with original column names.
- evaluate_generated.py: basic eval of generated CSV vs training stats.
- config.json: training defaults for train.py.
- model_design.md: step-by-step design notes.
- results/feature_split.txt: comma-separated feature lists.
- results/summary.txt: basic stats (rows sampled, column counts).
Run
python example/analyze_hai21_03.py
Prepare vocab + stats (writes to example/results):
python example/prepare_data.py
Train a small run:
python example/train.py --config example/config.json
Sample from the trained model:
python example/sample.py
Sample and export CSV:
python example/export_samples.py --include-time --device cpu
Evaluate generated CSV (writes eval.json):
python example/evaluate_generated.py
One-click pipeline (prepare -> train -> export -> eval -> plot):
python example/run_pipeline.py --device auto
Notes
- Heuristic: integer-like values with low cardinality (<=10) are treated as discrete. All other numeric columns are continuous.
- Set
deviceinexample/config.jsontoautoorcudawhen moving to a GPU machine. - Attack label columns (
attack*) are excluded from training and generation. timecolumn is always excluded from training and generation (optional for export only).- EMA weights are saved as
model_ema.ptand used by the pipeline for sampling. - Gradients are clipped by default (
grad_clipinconfig.json) to stabilize training. - Discrete masking uses a cosine schedule for smoother corruption.
- Continuous sampling is clipped in normalized space each step for stability.
- Optional conditioning by file id (
train*.csv.gz) is enabled by default for multi-file training. - Continuous head can be bounded with
tanhviause_tanh_epsin config. - Export now clamps continuous features to training min/max and preserves integer/decimal precision.
- Continuous features may be log1p-transformed automatically for heavy-tailed columns (see cont_stats.json).
<UNK>tokens are replaced by the most frequent token for each discrete column at export.- The script only samples the first 5000 rows to stay fast.
prepare_data.pyruns without PyTorch, buttrain.pyandsample.pyrequire it.train.pyandsample.pyauto-select GPU if available; otherwise they fall back to CPU.- Optional feature-graph mixer (
model_use_feature_graph) adds a learnable relation prior across feature channels.