# Example: HAI 21.03 Feature Split This folder contains a small, reproducible example that inspects the HAI 21.03 CSV (all train*.csv.gz files) and produces a continuous/discrete split using a simple heuristic. ## Files - analyze_hai21_03.py: reads a sample of the data and writes results. - data_utils.py: CSV loading, vocab, normalization, and batching helpers. - feature_split.json: column split for HAI 21.03. - hybrid_diffusion.py: hybrid model + diffusion utilities. - prepare_data.py: compute vocab and normalization stats. - train_stub.py: end-to-end scaffold for loss computation. - train.py: minimal training loop with checkpoints. - sample.py: minimal sampling loop. - export_samples.py: sample + export to CSV with original column names. - evaluate_generated.py: basic eval of generated CSV vs training stats. - config.json: training defaults for train.py. - model_design.md: step-by-step design notes. - results/feature_split.txt: comma-separated feature lists. - results/summary.txt: basic stats (rows sampled, column counts). ## Run ``` python example/analyze_hai21_03.py ``` Prepare vocab + stats (writes to `example/results`): ``` python example/prepare_data.py ``` Train a small run: ``` python example/train.py --config example/config.json ``` Sample from the trained model: ``` python example/sample.py ``` Sample and export CSV: ``` python example/export_samples.py --include-time --device cpu ``` Evaluate generated CSV (writes eval.json): ``` python example/evaluate_generated.py ``` One-click pipeline (prepare -> train -> export -> eval -> plot): ``` python example/run_pipeline.py --device auto ``` ## Notes - Heuristic: integer-like values with low cardinality (<=10) are treated as discrete. All other numeric columns are continuous. - Set `device` in `example/config.json` to `auto` or `cuda` when moving to a GPU machine. - Attack label columns (`attack*`) are excluded from training and generation. - `time` column is always excluded from training and generation (optional for export only). - EMA weights are saved as `model_ema.pt` and used by the pipeline for sampling. - Gradients are clipped by default (`grad_clip` in `config.json`) to stabilize training. - Discrete masking uses a cosine schedule for smoother corruption. - Continuous sampling is clipped in normalized space each step for stability. - Optional conditioning by file id (`train*.csv.gz`) is enabled by default for multi-file training. - Continuous head can be bounded with `tanh` via `use_tanh_eps` in config. - Export now clamps continuous features to training min/max and preserves integer/decimal precision. - Continuous features may be log1p-transformed automatically for heavy-tailed columns (see cont_stats.json). - `` tokens are replaced by the most frequent token for each discrete column at export. - The script only samples the first 5000 rows to stay fast. - `prepare_data.py` runs without PyTorch, but `train.py` and `sample.py` require it. - `train.py` and `sample.py` auto-select GPU if available; otherwise they fall back to CPU. - Optional two-stage temporal model (`use_temporal_stage1`) trains a GRU trend backbone first, then diffusion models residuals.