添加 TODO-FOR-AI.md

This commit is contained in:
2026-01-08 05:17:16 +08:00
commit 200bdf6136

219
TODO-FOR-AI.md Normal file
View File

@@ -0,0 +1,219 @@
# Project Context for AI
**Modbus / ICS Traffic Generation with Hybrid Diffusion**
## 1. Project Background
This project aims to build a **hybrid diffusion-based generative model** for industrial network traffic, inspired by the paper:
> *Spatio-Temporal Diffusion Model for Cellular Traffic Generation (STOUTER)*
Unlike the original paper (cellular traffic, aggregated continuous values), this project targets **industrial control system (ICS) traffic**, with a focus on **Modbus-like protocols**, where traffic features include both:
* **continuous values** (e.g., timing, numeric payloads), and
* **discrete protocol fields** (e.g., function codes, message types).
The final goal is to generate **realistic, protocol-consistent traffic features**, which can later be converted into raw packets (PCAP) by an external generator.
---
## 2. Available Datasets (Current State)
The dataset directory is:
```text
/Dev/diffusion/dataset/
```
It currently contains **two datasets**:
```text
dataset/
├── modbus_dataset/
└── hai/
```
### 2.1 Common Properties
Both datasets:
* Are already **preprocessed into CSV files**
* Contain **traffic-level features**, not raw PCAP
* Are suitable as **model input for diffusion-based generation**
* Represent **sequences of network events / flows**, not aggregated hourly statistics
No packet parsing is required at this stage.
---
### 2.2 Modbus Dataset
* Domain: **Modbus / industrial control traffic**
* Semantics:
* Explicit protocol meaning (request/response, function codes, registers)
* Strong logical and temporal constraints
* Feature types typically include:
* Continuous:
* inter-arrival time
* numeric register values
* payload length
* Discrete:
* function code
* direction (master → slave / slave → master)
* message type
This dataset aligns closely with the **target application domain** of the project.
---
### 2.3 HAI Dataset
* Domain: **ICS network traffic (broader, not Modbus-only)**
* Characteristics:
* Feature-extracted CSV format
* Contains both normal and abnormal behavior
* Less explicit protocol semantics compared to Modbus
* Often used for:
* Anomaly detection
* Security-oriented modeling
This dataset may be more suitable if the project emphasizes **security behavior patterns** rather than strict protocol logic.
---
## 3. Task 1: Dataset-Level Decision
The AI should first:
1. **Inspect both datasets**
* Compare feature schemas
* Identify:
* continuous vs discrete fields
* temporal resolution
* protocol specificity
2. **Decide which dataset is more appropriate** for this project, based on:
* Alignment with Modbus-style protocol semantics
* Suitability for diffusion-based generation
* Ability to support mixed continuous + discrete modeling
The decision should be **explicitly justified** (why one dataset is preferred over the other in this project context).
---
## 4. Modeling Goal
After selecting the dataset, the AI should design a **hybrid diffusion model** that:
* Operates on **feature-level traffic data**
* Generates **synthetic traffic feature sequences**
* Preserves:
* temporal patterns
* protocol-level consistency
* stochastic variability
The model does **not** generate raw packets directly.
---
## 5. Hybrid Diffusion Design Constraints
### 5.1 Feature Type Separation
The selected datasets features should be divided into two groups:
#### Continuous Features
Examples:
* inter-arrival time
* numeric values
* continuous statistics
**Modeling requirement**:
* Use **Gaussian diffusion (DDPM-style)**
* Forward process: add Gaussian noise
* Reverse process: predict noise with MSE (or L1) loss
---
#### Discrete Features
Examples:
* function code
* message type
* direction
* categorical flags
**Modeling requirement**:
* Use **mask-based discrete diffusion**
* Forward process: randomly replace tokens with `[MASK]`
* Reverse process: predict original token via classification
* Loss: cross-entropy (typically on masked positions only)
---
### 5.2 Unified Model Requirement
The AI should design a model that:
* Uses a **shared backbone** (e.g., UNet-like or temporal model)
* Has:
* one head for continuous noise prediction
* one head for discrete token prediction
* Trains with a **combined loss**:
```text
L = λ · L_continuous + (1 λ) · L_discrete
```
---
## 6. Output Expected from the AI's Reasoning
The AI should produce:
1. **Dataset selection result**
* Which dataset is chosen
* Why it is more suitable for this project
2. **Feature breakdown**
* Which columns are continuous
* Which columns are discrete
3. **Hybrid diffusion architecture**
* Input representation
* Forward noise strategy (continuous + discrete)
* Reverse denoising objectives
4. **Training formulation**
* Loss definitions
* High-level training loop description
Implementation details can remain **high-level / pseudocode-level** unless explicitly requested later.
---
## 7. Non-Goals (Important)
* Do **not** design packet parsers
* Do **not** generate raw PCAP directly
* Do **not** assume image-style diffusion
* Do **not** treat all features as continuous
The focus is **feature-level hybrid diffusion modeling** under an ICS / Modbus context.