ModuFlow/mask-ddpm

Fork 0

Files

History

MingzheYang cb610281ce update新结构

2026-01-26 18:27:41 +08:00

results

update

2026-01-25 18:10:08 +08:00

analyze_hai21_03.py

Win and linux can run the code

2026-01-22 17:39:31 +08:00

check_gpu_setup.py

Win and linux can run the code

2026-01-22 17:39:31 +08:00

config.json

update新结构

2026-01-26 18:27:41 +08:00

data_utils.py

连续型特征在时许相关性上的不足

2026-01-23 15:06:52 +08:00

debug_cuda_issue.py

Win and linux can run the code

2026-01-22 17:39:31 +08:00

evaluate_generated.py

update

2026-01-24 00:04:19 +08:00

export_samples.py

update新结构

2026-01-26 18:27:41 +08:00

feature_split.json

Update example and notes

2026-01-09 02:14:20 +08:00

find_correct_torch.py

Win and linux can run the code

2026-01-22 17:39:31 +08:00

hybrid_diffusion.py

update新结构

2026-01-26 18:27:41 +08:00

model_design.md

Update example and notes

2026-01-09 02:14:20 +08:00

platform_utils.py

update

2026-01-22 21:02:56 +08:00

plot_loss.py

update ks

2026-01-25 18:13:37 +08:00

prepare_data.py

连续型特征在时许相关性上的不足

2026-01-23 15:06:52 +08:00

README.md

update新结构

2026-01-26 18:27:41 +08:00

run_all.py

update

2026-01-24 00:20:50 +08:00

run_pipeline.py

连续型特征在时许相关性上的不足

2026-01-23 15:06:52 +08:00

sample.py

update新结构

2026-01-26 18:27:41 +08:00

test_all_modifications.py

Win and linux can run the code

2026-01-22 17:39:31 +08:00

test_gpu.py

Win and linux can run the code

2026-01-22 17:39:31 +08:00

test_windows_compatibility.py

Win and linux can run the code

2026-01-22 17:39:31 +08:00

train_stub.py

Win and linux can run the code

2026-01-22 17:39:31 +08:00

train.py

update新结构

2026-01-26 18:27:41 +08:00

README.md

Example: HAI 21.03 Feature Split

This folder contains a small, reproducible example that inspects the HAI 21.03 CSV (all train*.csv.gz files) and produces a continuous/discrete split using a simple heuristic.

Files

analyze_hai21_03.py: reads a sample of the data and writes results.
data_utils.py: CSV loading, vocab, normalization, and batching helpers.
feature_split.json: column split for HAI 21.03.
hybrid_diffusion.py: hybrid model + diffusion utilities.
prepare_data.py: compute vocab and normalization stats.
train_stub.py: end-to-end scaffold for loss computation.
train.py: minimal training loop with checkpoints.
sample.py: minimal sampling loop.
export_samples.py: sample + export to CSV with original column names.
evaluate_generated.py: basic eval of generated CSV vs training stats.
config.json: training defaults for train.py.
model_design.md: step-by-step design notes.
results/feature_split.txt: comma-separated feature lists.
results/summary.txt: basic stats (rows sampled, column counts).

Run

python example/analyze_hai21_03.py

Prepare vocab + stats (writes to example/results):

python example/prepare_data.py

Train a small run:

python example/train.py --config example/config.json

Sample from the trained model:

python example/sample.py

Sample and export CSV:

python example/export_samples.py --include-time --device cpu

Evaluate generated CSV (writes eval.json):

python example/evaluate_generated.py

One-click pipeline (prepare -> train -> export -> eval -> plot):

python example/run_pipeline.py --device auto

Notes

Heuristic: integer-like values with low cardinality (<=10) are treated as discrete. All other numeric columns are continuous.
Set device in example/config.json to auto or cuda when moving to a GPU machine.
Attack label columns (attack*) are excluded from training and generation.
time column is always excluded from training and generation (optional for export only).
EMA weights are saved as model_ema.pt and used by the pipeline for sampling.
Gradients are clipped by default (grad_clip in config.json) to stabilize training.
Discrete masking uses a cosine schedule for smoother corruption.
Continuous sampling is clipped in normalized space each step for stability.
Optional conditioning by file id (train*.csv.gz) is enabled by default for multi-file training.
Continuous head can be bounded with tanh via use_tanh_eps in config.
Export now clamps continuous features to training min/max and preserves integer/decimal precision.
Continuous features may be log1p-transformed automatically for heavy-tailed columns (see cont_stats.json).
<UNK> tokens are replaced by the most frequent token for each discrete column at export.
The script only samples the first 5000 rows to stay fast.
prepare_data.py runs without PyTorch, but train.py and sample.py require it.
train.py and sample.py auto-select GPU if available; otherwise they fall back to CPU.
Optional feature-graph mixer (model_use_feature_graph) adds a learnable relation prior across feature channels.