6.8 KiB
Hybrid Diffusion for ICS Traffic (HAI 21.03) — Project Report
工业控制系统流量混合扩散生成(HAI 21.03)— 项目报告
1. Project Goal / 项目目标
Build a hybrid diffusion-based generator for ICS traffic features, focusing on mixed continuous + discrete feature sequences. The output is feature-level sequences, not raw packets. The generator should preserve:
- Distributional fidelity (continuous ranges + discrete frequencies)
- Temporal consistency (time correlation and sequence structure)
- Field/logic consistency for discrete protocol-like columns
构建一个用于 ICS 流量特征的混合扩散生成模型,处理连续+离散混合特征序列。输出为特征级序列而非原始报文。生成结果需要保持:
- 分布一致性(连续值范围 + 离散频率)
- 时序一致性(时间相关性与序列结构)
- 字段/逻辑一致性(离散字段语义)
2. Data and Scope / 数据与范围
Dataset used in current implementation: HAI 21.03 (CSV feature traces).
当前实现使用数据集: HAI 21.03(CSV 特征序列)。
Data path (default in config):
dataset/hai/hai-21.03/train*.csv.gz
特征拆分(固定 schema): example/feature_split.json
- Continuous features: sensor/process values
- Discrete features: binary/low-cardinality status/flag fields
timeis excluded from modeling
3. End-to-End Pipeline / 端到端流程
One command pipeline:
python example/run_all.py --device cuda
Pipeline stages:
- Prepare data (
example/prepare_data.py) - Train model (
example/train.py) - Generate samples (
example/export_samples.py) - Evaluate (
example/evaluate_generated.py)
一键流程对应:数据准备 → 训练 → 采样导出 → 评估。
4. Technical Architecture / 技术架构
4.1 Hybrid Diffusion Model (Core) / 混合扩散模型(核心)
Defined in example/hybrid_diffusion.py.
Inputs:
- Continuous projection
- Discrete embeddings
- Time embedding (sinusoidal)
- Positional embedding (sequence index)
- Optional condition embedding (
file_id)
Backbone:
- GRU (sequence modeling)
- Post LayerNorm + residual MLP
Outputs:
- Continuous head: predicts target (
epsorx0) - Discrete heads: logits per discrete column
连续分支: Gaussian diffusion 离散分支: Mask diffusion
4.2 Temporal Backbone (GRU) / 共享时序骨干(GRU)
The GRU is the shared temporal backbone that fuses continuous + discrete signals into a unified sequence representation, enabling joint modeling of temporal dynamics and cross-feature dependencies.
GRU 是模型的共享时序核心,把连续/离散特征统一建模在同一时间结构中。
5. Diffusion Formulations / 扩散形式
5.1 Continuous Diffusion / 连续扩散
Forward process:
x_t = sqrt(a_bar_t) * x_0 + sqrt(1 - a_bar_t) * eps
Targets supported:
- eps prediction (default)
- x0 prediction (direct reconstruction)
Current config:
"cont_target": "x0"
5.2 Discrete Diffusion / 离散扩散
Mask diffusion with cosine schedule:
p(t) = 0.5 * (1 - cos(pi * t / T))
Mask-only cross-entropy is computed on masked positions.
6. Loss Design / 损失设计
Total loss:
L = λ * L_cont + (1 − λ) * L_disc
6.1 Continuous Loss / 连续损失
epstarget: MSE(eps_pred, eps)x0target: MSE(x0_pred, x0)- Optional inverse-variance weighting:
cont_loss_weighting = "inv_std"
6.2 Discrete Loss / 离散损失
Cross-entropy on masked positions only.
7. Data Processing / 数据处理
Defined in example/data_utils.py + example/prepare_data.py.
Key steps:
- Streaming mean/std/min/max + int-like detection
- Optional log1p transform for heavy-tailed continuous columns
- Discrete vocab + most frequent token
- Windowed batching with shuffle buffer
8. Sampling & Export / 采样与导出
Defined in:
example/sample.pyexample/export_samples.py
Export process:
- Reverse diffusion sampling
- De-normalize continuous values
- Clamp to observed min/max
- Restore discrete tokens from vocab
- Write to CSV
9. Evaluation / 评估指标
Defined in example/evaluate_generated.py.
Metrics (with reference):
- KS statistic (continuous distribution)
- Quantile diffs (q05/q25/q50/q75/q95)
- Lag‑1 correlation diff (temporal structure)
- Discrete JSD over vocab frequency
- Invalid token counts
10. Automation / 自动化
example/run_all.py runs all stages with config-driven paths.
11. Key Engineering Decisions / 关键工程决策
- Mixed-type diffusion: continuous + discrete split
- Shared temporal backbone (GRU) to align sequence structure
- Positional + time embeddings for stability
- Optional inverse-variance weighting for continuous loss
- Log1p transforms for heavy-tailed signals
12. Code Map (Key Files) / 代码索引
- Core model:
example/hybrid_diffusion.py - Training:
example/train.py - Data prep:
example/prepare_data.py - Data utilities:
example/data_utils.py - Sampling:
example/sample.py - Export:
example/export_samples.py - Evaluation:
example/evaluate_generated.py - Pipeline:
example/run_all.py - Config:
example/config.json
13. Known Issues / Current Limitations / 已知问题
- KS sometimes remains high → continuous distribution mismatch
- Lag‑1 may fluctuate → distribution vs temporal trade-off
- Continuous loss may dominate → needs careful weighting
14. Suggested Next Steps / 下一步建议
- Add SNR-weighted loss for stable diffusion training
- Explore v‑prediction for continuous branch
- Consider two-stage training (temporal first, distribution second)
- Strengthen discrete diffusion (e.g., D3PM-style transitions)
15. Summary / 总结
This project implements a hybrid diffusion model for ICS feature sequences with a GRU backbone, handling continuous and discrete features separately while sharing temporal structure. The pipeline covers data prep, training, sampling, export, and evaluation. The main research challenge remains in balancing distributional fidelity (KS) and temporal consistency (lag‑1).
本项目实现了基于 GRU 的混合扩散模型,连续/离散分支分开建模但共享时序结构,具备完整的训练与评估流程。主要挑战是分布对齐(KS)与时序一致性(lag‑1)之间的平衡。
16. Latest Evaluation Snapshot / 最新评估快照
Computed averages from the latest eval.json:
- avg_ks: 0.5208903596698115
- avg_jsd: 0.010592151023360712
- avg_lag1_diff: 0.8265139723919303
最新评估均值(来自 eval.json):
- avg_ks:0.5208903596698115
- avg_jsd:0.010592151023360712
- avg_lag1_diff:0.8265139723919303