添加 TODO-FOR-AI.md
This commit is contained in:
219
TODO-FOR-AI.md
Normal file
219
TODO-FOR-AI.md
Normal file
@@ -0,0 +1,219 @@
|
|||||||
|
# Project Context for AI
|
||||||
|
|
||||||
|
**Modbus / ICS Traffic Generation with Hybrid Diffusion**
|
||||||
|
|
||||||
|
## 1. Project Background
|
||||||
|
|
||||||
|
This project aims to build a **hybrid diffusion-based generative model** for industrial network traffic, inspired by the paper:
|
||||||
|
|
||||||
|
> *Spatio-Temporal Diffusion Model for Cellular Traffic Generation (STOUTER)*
|
||||||
|
|
||||||
|
Unlike the original paper (cellular traffic, aggregated continuous values), this project targets **industrial control system (ICS) traffic**, with a focus on **Modbus-like protocols**, where traffic features include both:
|
||||||
|
|
||||||
|
* **continuous values** (e.g., timing, numeric payloads), and
|
||||||
|
* **discrete protocol fields** (e.g., function codes, message types).
|
||||||
|
|
||||||
|
The final goal is to generate **realistic, protocol-consistent traffic features**, which can later be converted into raw packets (PCAP) by an external generator.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Available Datasets (Current State)
|
||||||
|
|
||||||
|
The dataset directory is:
|
||||||
|
|
||||||
|
```text
|
||||||
|
/Dev/diffusion/dataset/
|
||||||
|
```
|
||||||
|
|
||||||
|
It currently contains **two datasets**:
|
||||||
|
|
||||||
|
```text
|
||||||
|
dataset/
|
||||||
|
├── modbus_dataset/
|
||||||
|
└── hai/
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2.1 Common Properties
|
||||||
|
|
||||||
|
Both datasets:
|
||||||
|
|
||||||
|
* Are already **preprocessed into CSV files**
|
||||||
|
* Contain **traffic-level features**, not raw PCAP
|
||||||
|
* Are suitable as **model input for diffusion-based generation**
|
||||||
|
* Represent **sequences of network events / flows**, not aggregated hourly statistics
|
||||||
|
|
||||||
|
No packet parsing is required at this stage.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2.2 Modbus Dataset
|
||||||
|
|
||||||
|
* Domain: **Modbus / industrial control traffic**
|
||||||
|
* Semantics:
|
||||||
|
|
||||||
|
* Explicit protocol meaning (request/response, function codes, registers)
|
||||||
|
* Strong logical and temporal constraints
|
||||||
|
* Feature types typically include:
|
||||||
|
|
||||||
|
* Continuous:
|
||||||
|
|
||||||
|
* inter-arrival time
|
||||||
|
* numeric register values
|
||||||
|
* payload length
|
||||||
|
* Discrete:
|
||||||
|
|
||||||
|
* function code
|
||||||
|
* direction (master → slave / slave → master)
|
||||||
|
* message type
|
||||||
|
|
||||||
|
This dataset aligns closely with the **target application domain** of the project.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2.3 HAI Dataset
|
||||||
|
|
||||||
|
* Domain: **ICS network traffic (broader, not Modbus-only)**
|
||||||
|
* Characteristics:
|
||||||
|
|
||||||
|
* Feature-extracted CSV format
|
||||||
|
* Contains both normal and abnormal behavior
|
||||||
|
* Less explicit protocol semantics compared to Modbus
|
||||||
|
* Often used for:
|
||||||
|
|
||||||
|
* Anomaly detection
|
||||||
|
* Security-oriented modeling
|
||||||
|
|
||||||
|
This dataset may be more suitable if the project emphasizes **security behavior patterns** rather than strict protocol logic.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Task 1: Dataset-Level Decision
|
||||||
|
|
||||||
|
The AI should first:
|
||||||
|
|
||||||
|
1. **Inspect both datasets**
|
||||||
|
|
||||||
|
* Compare feature schemas
|
||||||
|
* Identify:
|
||||||
|
|
||||||
|
* continuous vs discrete fields
|
||||||
|
* temporal resolution
|
||||||
|
* protocol specificity
|
||||||
|
2. **Decide which dataset is more appropriate** for this project, based on:
|
||||||
|
|
||||||
|
* Alignment with Modbus-style protocol semantics
|
||||||
|
* Suitability for diffusion-based generation
|
||||||
|
* Ability to support mixed continuous + discrete modeling
|
||||||
|
|
||||||
|
The decision should be **explicitly justified** (why one dataset is preferred over the other in this project context).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Modeling Goal
|
||||||
|
|
||||||
|
After selecting the dataset, the AI should design a **hybrid diffusion model** that:
|
||||||
|
|
||||||
|
* Operates on **feature-level traffic data**
|
||||||
|
* Generates **synthetic traffic feature sequences**
|
||||||
|
* Preserves:
|
||||||
|
|
||||||
|
* temporal patterns
|
||||||
|
* protocol-level consistency
|
||||||
|
* stochastic variability
|
||||||
|
|
||||||
|
The model does **not** generate raw packets directly.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Hybrid Diffusion Design Constraints
|
||||||
|
|
||||||
|
### 5.1 Feature Type Separation
|
||||||
|
|
||||||
|
The selected dataset’s features should be divided into two groups:
|
||||||
|
|
||||||
|
#### Continuous Features
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
|
||||||
|
* inter-arrival time
|
||||||
|
* numeric values
|
||||||
|
* continuous statistics
|
||||||
|
|
||||||
|
**Modeling requirement**:
|
||||||
|
|
||||||
|
* Use **Gaussian diffusion (DDPM-style)**
|
||||||
|
* Forward process: add Gaussian noise
|
||||||
|
* Reverse process: predict noise with MSE (or L1) loss
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### Discrete Features
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
|
||||||
|
* function code
|
||||||
|
* message type
|
||||||
|
* direction
|
||||||
|
* categorical flags
|
||||||
|
|
||||||
|
**Modeling requirement**:
|
||||||
|
|
||||||
|
* Use **mask-based discrete diffusion**
|
||||||
|
* Forward process: randomly replace tokens with `[MASK]`
|
||||||
|
* Reverse process: predict original token via classification
|
||||||
|
* Loss: cross-entropy (typically on masked positions only)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 5.2 Unified Model Requirement
|
||||||
|
|
||||||
|
The AI should design a model that:
|
||||||
|
|
||||||
|
* Uses a **shared backbone** (e.g., UNet-like or temporal model)
|
||||||
|
* Has:
|
||||||
|
|
||||||
|
* one head for continuous noise prediction
|
||||||
|
* one head for discrete token prediction
|
||||||
|
* Trains with a **combined loss**:
|
||||||
|
|
||||||
|
```text
|
||||||
|
L = λ · L_continuous + (1 − λ) · L_discrete
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Output Expected from the AI's Reasoning
|
||||||
|
|
||||||
|
The AI should produce:
|
||||||
|
|
||||||
|
1. **Dataset selection result**
|
||||||
|
|
||||||
|
* Which dataset is chosen
|
||||||
|
* Why it is more suitable for this project
|
||||||
|
2. **Feature breakdown**
|
||||||
|
|
||||||
|
* Which columns are continuous
|
||||||
|
* Which columns are discrete
|
||||||
|
3. **Hybrid diffusion architecture**
|
||||||
|
|
||||||
|
* Input representation
|
||||||
|
* Forward noise strategy (continuous + discrete)
|
||||||
|
* Reverse denoising objectives
|
||||||
|
4. **Training formulation**
|
||||||
|
|
||||||
|
* Loss definitions
|
||||||
|
* High-level training loop description
|
||||||
|
|
||||||
|
Implementation details can remain **high-level / pseudocode-level** unless explicitly requested later.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Non-Goals (Important)
|
||||||
|
|
||||||
|
* Do **not** design packet parsers
|
||||||
|
* Do **not** generate raw PCAP directly
|
||||||
|
* Do **not** assume image-style diffusion
|
||||||
|
* Do **not** treat all features as continuous
|
||||||
|
|
||||||
|
The focus is **feature-level hybrid diffusion modeling** under an ICS / Modbus context.
|
||||||
|
|
||||||
Reference in New Issue
Block a user