连续型特征在时许相关性上的不足

2026-01-23 15:06:52 +08:00
parent 0d17be9a1c
commit ff12324560
12 changed files with 1212 additions and 68 deletions
--- a/example/README.md
+++ b/example/README.md
@@ -67,6 +67,7 @@ python example/run_pipeline.py --device auto
 - Optional conditioning by file id (`train*.csv.gz`) is enabled by default for multi-file training.
 - Continuous head can be bounded with `tanh` via `use_tanh_eps` in config.
 - Export now clamps continuous features to training min/max and preserves integer/decimal precision.
+- Continuous features may be log1p-transformed automatically for heavy-tailed columns (see cont_stats.json).
 - `<UNK>` tokens are replaced by the most frequent token for each discrete column at export.
 - The script only samples the first 5000 rows to stay fast.
 - `prepare_data.py` runs without PyTorch, but `train.py` and `sample.py` require it.