Files
mask-ddpm/TODO-FOR-AI.md
2026-01-08 05:17:16 +08:00

220 lines
5.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Project Context for AI
**Modbus / ICS Traffic Generation with Hybrid Diffusion**
## 1. Project Background
This project aims to build a **hybrid diffusion-based generative model** for industrial network traffic, inspired by the paper:
> *Spatio-Temporal Diffusion Model for Cellular Traffic Generation (STOUTER)*
Unlike the original paper (cellular traffic, aggregated continuous values), this project targets **industrial control system (ICS) traffic**, with a focus on **Modbus-like protocols**, where traffic features include both:
* **continuous values** (e.g., timing, numeric payloads), and
* **discrete protocol fields** (e.g., function codes, message types).
The final goal is to generate **realistic, protocol-consistent traffic features**, which can later be converted into raw packets (PCAP) by an external generator.
---
## 2. Available Datasets (Current State)
The dataset directory is:
```text
/Dev/diffusion/dataset/
```
It currently contains **two datasets**:
```text
dataset/
├── modbus_dataset/
└── hai/
```
### 2.1 Common Properties
Both datasets:
* Are already **preprocessed into CSV files**
* Contain **traffic-level features**, not raw PCAP
* Are suitable as **model input for diffusion-based generation**
* Represent **sequences of network events / flows**, not aggregated hourly statistics
No packet parsing is required at this stage.
---
### 2.2 Modbus Dataset
* Domain: **Modbus / industrial control traffic**
* Semantics:
* Explicit protocol meaning (request/response, function codes, registers)
* Strong logical and temporal constraints
* Feature types typically include:
* Continuous:
* inter-arrival time
* numeric register values
* payload length
* Discrete:
* function code
* direction (master → slave / slave → master)
* message type
This dataset aligns closely with the **target application domain** of the project.
---
### 2.3 HAI Dataset
* Domain: **ICS network traffic (broader, not Modbus-only)**
* Characteristics:
* Feature-extracted CSV format
* Contains both normal and abnormal behavior
* Less explicit protocol semantics compared to Modbus
* Often used for:
* Anomaly detection
* Security-oriented modeling
This dataset may be more suitable if the project emphasizes **security behavior patterns** rather than strict protocol logic.
---
## 3. Task 1: Dataset-Level Decision
The AI should first:
1. **Inspect both datasets**
* Compare feature schemas
* Identify:
* continuous vs discrete fields
* temporal resolution
* protocol specificity
2. **Decide which dataset is more appropriate** for this project, based on:
* Alignment with Modbus-style protocol semantics
* Suitability for diffusion-based generation
* Ability to support mixed continuous + discrete modeling
The decision should be **explicitly justified** (why one dataset is preferred over the other in this project context).
---
## 4. Modeling Goal
After selecting the dataset, the AI should design a **hybrid diffusion model** that:
* Operates on **feature-level traffic data**
* Generates **synthetic traffic feature sequences**
* Preserves:
* temporal patterns
* protocol-level consistency
* stochastic variability
The model does **not** generate raw packets directly.
---
## 5. Hybrid Diffusion Design Constraints
### 5.1 Feature Type Separation
The selected datasets features should be divided into two groups:
#### Continuous Features
Examples:
* inter-arrival time
* numeric values
* continuous statistics
**Modeling requirement**:
* Use **Gaussian diffusion (DDPM-style)**
* Forward process: add Gaussian noise
* Reverse process: predict noise with MSE (or L1) loss
---
#### Discrete Features
Examples:
* function code
* message type
* direction
* categorical flags
**Modeling requirement**:
* Use **mask-based discrete diffusion**
* Forward process: randomly replace tokens with `[MASK]`
* Reverse process: predict original token via classification
* Loss: cross-entropy (typically on masked positions only)
---
### 5.2 Unified Model Requirement
The AI should design a model that:
* Uses a **shared backbone** (e.g., UNet-like or temporal model)
* Has:
* one head for continuous noise prediction
* one head for discrete token prediction
* Trains with a **combined loss**:
```text
L = λ · L_continuous + (1 λ) · L_discrete
```
---
## 6. Output Expected from the AI's Reasoning
The AI should produce:
1. **Dataset selection result**
* Which dataset is chosen
* Why it is more suitable for this project
2. **Feature breakdown**
* Which columns are continuous
* Which columns are discrete
3. **Hybrid diffusion architecture**
* Input representation
* Forward noise strategy (continuous + discrete)
* Reverse denoising objectives
4. **Training formulation**
* Loss definitions
* High-level training loop description
Implementation details can remain **high-level / pseudocode-level** unless explicitly requested later.
---
## 7. Non-Goals (Important)
* Do **not** design packet parsers
* Do **not** generate raw PCAP directly
* Do **not** assume image-style diffusion
* Do **not** treat all features as continuous
The focus is **feature-level hybrid diffusion modeling** under an ICS / Modbus context.