连续型特征在时许相关性上的不足

This commit is contained in:
2026-01-23 15:06:52 +08:00
parent 0d17be9a1c
commit ff12324560
12 changed files with 1212 additions and 68 deletions

View File

@@ -67,6 +67,7 @@ python example/run_pipeline.py --device auto
- Optional conditioning by file id (`train*.csv.gz`) is enabled by default for multi-file training.
- Continuous head can be bounded with `tanh` via `use_tanh_eps` in config.
- Export now clamps continuous features to training min/max and preserves integer/decimal precision.
- Continuous features may be log1p-transformed automatically for heavy-tailed columns (see cont_stats.json).
- `<UNK>` tokens are replaced by the most frequent token for each discrete column at export.
- The script only samples the first 5000 rows to stay fast.
- `prepare_data.py` runs without PyTorch, but `train.py` and `sample.py` require it.