添加 TODO-FOR-AI.md
This commit is contained in:
219
TODO-FOR-AI.md
Normal file
219
TODO-FOR-AI.md
Normal file
@@ -0,0 +1,219 @@
|
||||
# Project Context for AI
|
||||
|
||||
**Modbus / ICS Traffic Generation with Hybrid Diffusion**
|
||||
|
||||
## 1. Project Background
|
||||
|
||||
This project aims to build a **hybrid diffusion-based generative model** for industrial network traffic, inspired by the paper:
|
||||
|
||||
> *Spatio-Temporal Diffusion Model for Cellular Traffic Generation (STOUTER)*
|
||||
|
||||
Unlike the original paper (cellular traffic, aggregated continuous values), this project targets **industrial control system (ICS) traffic**, with a focus on **Modbus-like protocols**, where traffic features include both:
|
||||
|
||||
* **continuous values** (e.g., timing, numeric payloads), and
|
||||
* **discrete protocol fields** (e.g., function codes, message types).
|
||||
|
||||
The final goal is to generate **realistic, protocol-consistent traffic features**, which can later be converted into raw packets (PCAP) by an external generator.
|
||||
|
||||
---
|
||||
|
||||
## 2. Available Datasets (Current State)
|
||||
|
||||
The dataset directory is:
|
||||
|
||||
```text
|
||||
/Dev/diffusion/dataset/
|
||||
```
|
||||
|
||||
It currently contains **two datasets**:
|
||||
|
||||
```text
|
||||
dataset/
|
||||
├── modbus_dataset/
|
||||
└── hai/
|
||||
```
|
||||
|
||||
### 2.1 Common Properties
|
||||
|
||||
Both datasets:
|
||||
|
||||
* Are already **preprocessed into CSV files**
|
||||
* Contain **traffic-level features**, not raw PCAP
|
||||
* Are suitable as **model input for diffusion-based generation**
|
||||
* Represent **sequences of network events / flows**, not aggregated hourly statistics
|
||||
|
||||
No packet parsing is required at this stage.
|
||||
|
||||
---
|
||||
|
||||
### 2.2 Modbus Dataset
|
||||
|
||||
* Domain: **Modbus / industrial control traffic**
|
||||
* Semantics:
|
||||
|
||||
* Explicit protocol meaning (request/response, function codes, registers)
|
||||
* Strong logical and temporal constraints
|
||||
* Feature types typically include:
|
||||
|
||||
* Continuous:
|
||||
|
||||
* inter-arrival time
|
||||
* numeric register values
|
||||
* payload length
|
||||
* Discrete:
|
||||
|
||||
* function code
|
||||
* direction (master → slave / slave → master)
|
||||
* message type
|
||||
|
||||
This dataset aligns closely with the **target application domain** of the project.
|
||||
|
||||
---
|
||||
|
||||
### 2.3 HAI Dataset
|
||||
|
||||
* Domain: **ICS network traffic (broader, not Modbus-only)**
|
||||
* Characteristics:
|
||||
|
||||
* Feature-extracted CSV format
|
||||
* Contains both normal and abnormal behavior
|
||||
* Less explicit protocol semantics compared to Modbus
|
||||
* Often used for:
|
||||
|
||||
* Anomaly detection
|
||||
* Security-oriented modeling
|
||||
|
||||
This dataset may be more suitable if the project emphasizes **security behavior patterns** rather than strict protocol logic.
|
||||
|
||||
---
|
||||
|
||||
## 3. Task 1: Dataset-Level Decision
|
||||
|
||||
The AI should first:
|
||||
|
||||
1. **Inspect both datasets**
|
||||
|
||||
* Compare feature schemas
|
||||
* Identify:
|
||||
|
||||
* continuous vs discrete fields
|
||||
* temporal resolution
|
||||
* protocol specificity
|
||||
2. **Decide which dataset is more appropriate** for this project, based on:
|
||||
|
||||
* Alignment with Modbus-style protocol semantics
|
||||
* Suitability for diffusion-based generation
|
||||
* Ability to support mixed continuous + discrete modeling
|
||||
|
||||
The decision should be **explicitly justified** (why one dataset is preferred over the other in this project context).
|
||||
|
||||
---
|
||||
|
||||
## 4. Modeling Goal
|
||||
|
||||
After selecting the dataset, the AI should design a **hybrid diffusion model** that:
|
||||
|
||||
* Operates on **feature-level traffic data**
|
||||
* Generates **synthetic traffic feature sequences**
|
||||
* Preserves:
|
||||
|
||||
* temporal patterns
|
||||
* protocol-level consistency
|
||||
* stochastic variability
|
||||
|
||||
The model does **not** generate raw packets directly.
|
||||
|
||||
---
|
||||
|
||||
## 5. Hybrid Diffusion Design Constraints
|
||||
|
||||
### 5.1 Feature Type Separation
|
||||
|
||||
The selected dataset’s features should be divided into two groups:
|
||||
|
||||
#### Continuous Features
|
||||
|
||||
Examples:
|
||||
|
||||
* inter-arrival time
|
||||
* numeric values
|
||||
* continuous statistics
|
||||
|
||||
**Modeling requirement**:
|
||||
|
||||
* Use **Gaussian diffusion (DDPM-style)**
|
||||
* Forward process: add Gaussian noise
|
||||
* Reverse process: predict noise with MSE (or L1) loss
|
||||
|
||||
---
|
||||
|
||||
#### Discrete Features
|
||||
|
||||
Examples:
|
||||
|
||||
* function code
|
||||
* message type
|
||||
* direction
|
||||
* categorical flags
|
||||
|
||||
**Modeling requirement**:
|
||||
|
||||
* Use **mask-based discrete diffusion**
|
||||
* Forward process: randomly replace tokens with `[MASK]`
|
||||
* Reverse process: predict original token via classification
|
||||
* Loss: cross-entropy (typically on masked positions only)
|
||||
|
||||
---
|
||||
|
||||
### 5.2 Unified Model Requirement
|
||||
|
||||
The AI should design a model that:
|
||||
|
||||
* Uses a **shared backbone** (e.g., UNet-like or temporal model)
|
||||
* Has:
|
||||
|
||||
* one head for continuous noise prediction
|
||||
* one head for discrete token prediction
|
||||
* Trains with a **combined loss**:
|
||||
|
||||
```text
|
||||
L = λ · L_continuous + (1 − λ) · L_discrete
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Output Expected from the AI's Reasoning
|
||||
|
||||
The AI should produce:
|
||||
|
||||
1. **Dataset selection result**
|
||||
|
||||
* Which dataset is chosen
|
||||
* Why it is more suitable for this project
|
||||
2. **Feature breakdown**
|
||||
|
||||
* Which columns are continuous
|
||||
* Which columns are discrete
|
||||
3. **Hybrid diffusion architecture**
|
||||
|
||||
* Input representation
|
||||
* Forward noise strategy (continuous + discrete)
|
||||
* Reverse denoising objectives
|
||||
4. **Training formulation**
|
||||
|
||||
* Loss definitions
|
||||
* High-level training loop description
|
||||
|
||||
Implementation details can remain **high-level / pseudocode-level** unless explicitly requested later.
|
||||
|
||||
---
|
||||
|
||||
## 7. Non-Goals (Important)
|
||||
|
||||
* Do **not** design packet parsers
|
||||
* Do **not** generate raw PCAP directly
|
||||
* Do **not** assume image-style diffusion
|
||||
* Do **not** treat all features as continuous
|
||||
|
||||
The focus is **feature-level hybrid diffusion modeling** under an ICS / Modbus context.
|
||||
|
||||
Reference in New Issue
Block a user