220 lines
5.2 KiB
Markdown
220 lines
5.2 KiB
Markdown
# Project Context for AI
|
||
|
||
**Modbus / ICS Traffic Generation with Hybrid Diffusion**
|
||
|
||
## 1. Project Background
|
||
|
||
This project aims to build a **hybrid diffusion-based generative model** for industrial network traffic, inspired by the paper:
|
||
|
||
> *Spatio-Temporal Diffusion Model for Cellular Traffic Generation (STOUTER)*
|
||
|
||
Unlike the original paper (cellular traffic, aggregated continuous values), this project targets **industrial control system (ICS) traffic**, with a focus on **Modbus-like protocols**, where traffic features include both:
|
||
|
||
* **continuous values** (e.g., timing, numeric payloads), and
|
||
* **discrete protocol fields** (e.g., function codes, message types).
|
||
|
||
The final goal is to generate **realistic, protocol-consistent traffic features**, which can later be converted into raw packets (PCAP) by an external generator.
|
||
|
||
---
|
||
|
||
## 2. Available Datasets (Current State)
|
||
|
||
The dataset directory is:
|
||
|
||
```text
|
||
/Dev/diffusion/dataset/
|
||
```
|
||
|
||
It currently contains **two datasets**:
|
||
|
||
```text
|
||
dataset/
|
||
├── modbus_dataset/
|
||
└── hai/
|
||
```
|
||
|
||
### 2.1 Common Properties
|
||
|
||
Both datasets:
|
||
|
||
* Are already **preprocessed into CSV files**
|
||
* Contain **traffic-level features**, not raw PCAP
|
||
* Are suitable as **model input for diffusion-based generation**
|
||
* Represent **sequences of network events / flows**, not aggregated hourly statistics
|
||
|
||
No packet parsing is required at this stage.
|
||
|
||
---
|
||
|
||
### 2.2 Modbus Dataset
|
||
|
||
* Domain: **Modbus / industrial control traffic**
|
||
* Semantics:
|
||
|
||
* Explicit protocol meaning (request/response, function codes, registers)
|
||
* Strong logical and temporal constraints
|
||
* Feature types typically include:
|
||
|
||
* Continuous:
|
||
|
||
* inter-arrival time
|
||
* numeric register values
|
||
* payload length
|
||
* Discrete:
|
||
|
||
* function code
|
||
* direction (master → slave / slave → master)
|
||
* message type
|
||
|
||
This dataset aligns closely with the **target application domain** of the project.
|
||
|
||
---
|
||
|
||
### 2.3 HAI Dataset
|
||
|
||
* Domain: **ICS network traffic (broader, not Modbus-only)**
|
||
* Characteristics:
|
||
|
||
* Feature-extracted CSV format
|
||
* Contains both normal and abnormal behavior
|
||
* Less explicit protocol semantics compared to Modbus
|
||
* Often used for:
|
||
|
||
* Anomaly detection
|
||
* Security-oriented modeling
|
||
|
||
This dataset may be more suitable if the project emphasizes **security behavior patterns** rather than strict protocol logic.
|
||
|
||
---
|
||
|
||
## 3. Task 1: Dataset-Level Decision
|
||
|
||
The AI should first:
|
||
|
||
1. **Inspect both datasets**
|
||
|
||
* Compare feature schemas
|
||
* Identify:
|
||
|
||
* continuous vs discrete fields
|
||
* temporal resolution
|
||
* protocol specificity
|
||
2. **Decide which dataset is more appropriate** for this project, based on:
|
||
|
||
* Alignment with Modbus-style protocol semantics
|
||
* Suitability for diffusion-based generation
|
||
* Ability to support mixed continuous + discrete modeling
|
||
|
||
The decision should be **explicitly justified** (why one dataset is preferred over the other in this project context).
|
||
|
||
---
|
||
|
||
## 4. Modeling Goal
|
||
|
||
After selecting the dataset, the AI should design a **hybrid diffusion model** that:
|
||
|
||
* Operates on **feature-level traffic data**
|
||
* Generates **synthetic traffic feature sequences**
|
||
* Preserves:
|
||
|
||
* temporal patterns
|
||
* protocol-level consistency
|
||
* stochastic variability
|
||
|
||
The model does **not** generate raw packets directly.
|
||
|
||
---
|
||
|
||
## 5. Hybrid Diffusion Design Constraints
|
||
|
||
### 5.1 Feature Type Separation
|
||
|
||
The selected dataset’s features should be divided into two groups:
|
||
|
||
#### Continuous Features
|
||
|
||
Examples:
|
||
|
||
* inter-arrival time
|
||
* numeric values
|
||
* continuous statistics
|
||
|
||
**Modeling requirement**:
|
||
|
||
* Use **Gaussian diffusion (DDPM-style)**
|
||
* Forward process: add Gaussian noise
|
||
* Reverse process: predict noise with MSE (or L1) loss
|
||
|
||
---
|
||
|
||
#### Discrete Features
|
||
|
||
Examples:
|
||
|
||
* function code
|
||
* message type
|
||
* direction
|
||
* categorical flags
|
||
|
||
**Modeling requirement**:
|
||
|
||
* Use **mask-based discrete diffusion**
|
||
* Forward process: randomly replace tokens with `[MASK]`
|
||
* Reverse process: predict original token via classification
|
||
* Loss: cross-entropy (typically on masked positions only)
|
||
|
||
---
|
||
|
||
### 5.2 Unified Model Requirement
|
||
|
||
The AI should design a model that:
|
||
|
||
* Uses a **shared backbone** (e.g., UNet-like or temporal model)
|
||
* Has:
|
||
|
||
* one head for continuous noise prediction
|
||
* one head for discrete token prediction
|
||
* Trains with a **combined loss**:
|
||
|
||
```text
|
||
L = λ · L_continuous + (1 − λ) · L_discrete
|
||
```
|
||
|
||
---
|
||
|
||
## 6. Output Expected from the AI's Reasoning
|
||
|
||
The AI should produce:
|
||
|
||
1. **Dataset selection result**
|
||
|
||
* Which dataset is chosen
|
||
* Why it is more suitable for this project
|
||
2. **Feature breakdown**
|
||
|
||
* Which columns are continuous
|
||
* Which columns are discrete
|
||
3. **Hybrid diffusion architecture**
|
||
|
||
* Input representation
|
||
* Forward noise strategy (continuous + discrete)
|
||
* Reverse denoising objectives
|
||
4. **Training formulation**
|
||
|
||
* Loss definitions
|
||
* High-level training loop description
|
||
|
||
Implementation details can remain **high-level / pseudocode-level** unless explicitly requested later.
|
||
|
||
---
|
||
|
||
## 7. Non-Goals (Important)
|
||
|
||
* Do **not** design packet parsers
|
||
* Do **not** generate raw PCAP directly
|
||
* Do **not** assume image-style diffusion
|
||
* Do **not** treat all features as continuous
|
||
|
||
The focus is **feature-level hybrid diffusion modeling** under an ICS / Modbus context.
|
||
|