commit 200bdf6136e69f2cf015f8e1a918685e6229c7c2 Author: MingzheYang Date: Thu Jan 8 05:17:16 2026 +0800 添加 TODO-FOR-AI.md diff --git a/TODO-FOR-AI.md b/TODO-FOR-AI.md new file mode 100644 index 0000000..aea9546 --- /dev/null +++ b/TODO-FOR-AI.md @@ -0,0 +1,219 @@ +# Project Context for AI + +**Modbus / ICS Traffic Generation with Hybrid Diffusion** + +## 1. Project Background + +This project aims to build a **hybrid diffusion-based generative model** for industrial network traffic, inspired by the paper: + +> *Spatio-Temporal Diffusion Model for Cellular Traffic Generation (STOUTER)* + +Unlike the original paper (cellular traffic, aggregated continuous values), this project targets **industrial control system (ICS) traffic**, with a focus on **Modbus-like protocols**, where traffic features include both: + +* **continuous values** (e.g., timing, numeric payloads), and +* **discrete protocol fields** (e.g., function codes, message types). + +The final goal is to generate **realistic, protocol-consistent traffic features**, which can later be converted into raw packets (PCAP) by an external generator. + +--- + +## 2. Available Datasets (Current State) + +The dataset directory is: + +```text +/Dev/diffusion/dataset/ +``` + +It currently contains **two datasets**: + +```text +dataset/ +├── modbus_dataset/ +└── hai/ +``` + +### 2.1 Common Properties + +Both datasets: + +* Are already **preprocessed into CSV files** +* Contain **traffic-level features**, not raw PCAP +* Are suitable as **model input for diffusion-based generation** +* Represent **sequences of network events / flows**, not aggregated hourly statistics + +No packet parsing is required at this stage. + +--- + +### 2.2 Modbus Dataset + +* Domain: **Modbus / industrial control traffic** +* Semantics: + + * Explicit protocol meaning (request/response, function codes, registers) + * Strong logical and temporal constraints +* Feature types typically include: + + * Continuous: + + * inter-arrival time + * numeric register values + * payload length + * Discrete: + + * function code + * direction (master → slave / slave → master) + * message type + +This dataset aligns closely with the **target application domain** of the project. + +--- + +### 2.3 HAI Dataset + +* Domain: **ICS network traffic (broader, not Modbus-only)** +* Characteristics: + + * Feature-extracted CSV format + * Contains both normal and abnormal behavior + * Less explicit protocol semantics compared to Modbus +* Often used for: + + * Anomaly detection + * Security-oriented modeling + +This dataset may be more suitable if the project emphasizes **security behavior patterns** rather than strict protocol logic. + +--- + +## 3. Task 1: Dataset-Level Decision + +The AI should first: + +1. **Inspect both datasets** + + * Compare feature schemas + * Identify: + + * continuous vs discrete fields + * temporal resolution + * protocol specificity +2. **Decide which dataset is more appropriate** for this project, based on: + + * Alignment with Modbus-style protocol semantics + * Suitability for diffusion-based generation + * Ability to support mixed continuous + discrete modeling + +The decision should be **explicitly justified** (why one dataset is preferred over the other in this project context). + +--- + +## 4. Modeling Goal + +After selecting the dataset, the AI should design a **hybrid diffusion model** that: + +* Operates on **feature-level traffic data** +* Generates **synthetic traffic feature sequences** +* Preserves: + + * temporal patterns + * protocol-level consistency + * stochastic variability + +The model does **not** generate raw packets directly. + +--- + +## 5. Hybrid Diffusion Design Constraints + +### 5.1 Feature Type Separation + +The selected dataset’s features should be divided into two groups: + +#### Continuous Features + +Examples: + +* inter-arrival time +* numeric values +* continuous statistics + +**Modeling requirement**: + +* Use **Gaussian diffusion (DDPM-style)** +* Forward process: add Gaussian noise +* Reverse process: predict noise with MSE (or L1) loss + +--- + +#### Discrete Features + +Examples: + +* function code +* message type +* direction +* categorical flags + +**Modeling requirement**: + +* Use **mask-based discrete diffusion** +* Forward process: randomly replace tokens with `[MASK]` +* Reverse process: predict original token via classification +* Loss: cross-entropy (typically on masked positions only) + +--- + +### 5.2 Unified Model Requirement + +The AI should design a model that: + +* Uses a **shared backbone** (e.g., UNet-like or temporal model) +* Has: + + * one head for continuous noise prediction + * one head for discrete token prediction +* Trains with a **combined loss**: + +```text +L = λ · L_continuous + (1 − λ) · L_discrete +``` + +--- + +## 6. Output Expected from the AI's Reasoning + +The AI should produce: + +1. **Dataset selection result** + + * Which dataset is chosen + * Why it is more suitable for this project +2. **Feature breakdown** + + * Which columns are continuous + * Which columns are discrete +3. **Hybrid diffusion architecture** + + * Input representation + * Forward noise strategy (continuous + discrete) + * Reverse denoising objectives +4. **Training formulation** + + * Loss definitions + * High-level training loop description + +Implementation details can remain **high-level / pseudocode-level** unless explicitly requested later. + +--- + +## 7. Non-Goals (Important) + +* Do **not** design packet parsers +* Do **not** generate raw PCAP directly +* Do **not** assume image-style diffusion +* Do **not** treat all features as continuous + +The focus is **feature-level hybrid diffusion modeling** under an ICS / Modbus context. +