添加 TODO-FOR-AI.md

2026-01-08 05:17:16 +08:00
commit 200bdf6136
1 changed files with 219 additions and 0 deletions
--- a/TODO-FOR-AI.md
+++ b/TODO-FOR-AI.md
@@ -0,0 +1,219 @@
+# Project Context for AI
+
+**Modbus / ICS Traffic Generation with Hybrid Diffusion**
+
+## 1. Project Background
+
+This project aims to build a **hybrid diffusion-based generative model** for industrial network traffic, inspired by the paper:
+
+> *Spatio-Temporal Diffusion Model for Cellular Traffic Generation (STOUTER)*
+
+Unlike the original paper (cellular traffic, aggregated continuous values), this project targets **industrial control system (ICS) traffic**, with a focus on **Modbus-like protocols**, where traffic features include both:
+
+* **continuous values** (e.g., timing, numeric payloads), and
+* **discrete protocol fields** (e.g., function codes, message types).
+
+The final goal is to generate **realistic, protocol-consistent traffic features**, which can later be converted into raw packets (PCAP) by an external generator.
+
+---
+
+## 2. Available Datasets (Current State)
+
+The dataset directory is:
+
+```text
+/Dev/diffusion/dataset/
+```
+
+It currently contains **two datasets**:
+
+```text
+dataset/
+├── modbus_dataset/
+└── hai/
+```
+
+### 2.1 Common Properties
+
+Both datasets:
+
+* Are already **preprocessed into CSV files**
+* Contain **traffic-level features**, not raw PCAP
+* Are suitable as **model input for diffusion-based generation**
+* Represent **sequences of network events / flows**, not aggregated hourly statistics
+
+No packet parsing is required at this stage.
+
+---
+
+### 2.2 Modbus Dataset
+
+* Domain: **Modbus / industrial control traffic**
+* Semantics:
+
+  * Explicit protocol meaning (request/response, function codes, registers)
+  * Strong logical and temporal constraints
+* Feature types typically include:
+
+  * Continuous:
+
+    * inter-arrival time
+    * numeric register values
+    * payload length
+  * Discrete:
+
+    * function code
+    * direction (master → slave / slave → master)
+    * message type
+
+This dataset aligns closely with the **target application domain** of the project.
+
+---
+
+### 2.3 HAI Dataset
+
+* Domain: **ICS network traffic (broader, not Modbus-only)**
+* Characteristics:
+
+  * Feature-extracted CSV format
+  * Contains both normal and abnormal behavior
+  * Less explicit protocol semantics compared to Modbus
+* Often used for:
+
+  * Anomaly detection
+  * Security-oriented modeling
+
+This dataset may be more suitable if the project emphasizes **security behavior patterns** rather than strict protocol logic.
+
+---
+
+## 3. Task 1: Dataset-Level Decision
+
+The AI should first:
+
+1. **Inspect both datasets**
+
+   * Compare feature schemas
+   * Identify:
+
+     * continuous vs discrete fields
+     * temporal resolution
+     * protocol specificity
+2. **Decide which dataset is more appropriate** for this project, based on:
+
+   * Alignment with Modbus-style protocol semantics
+   * Suitability for diffusion-based generation
+   * Ability to support mixed continuous + discrete modeling
+
+The decision should be **explicitly justified** (why one dataset is preferred over the other in this project context).
+
+---
+
+## 4. Modeling Goal
+
+After selecting the dataset, the AI should design a **hybrid diffusion model** that:
+
+* Operates on **feature-level traffic data**
+* Generates **synthetic traffic feature sequences**
+* Preserves:
+
+  * temporal patterns
+  * protocol-level consistency
+  * stochastic variability
+
+The model does **not** generate raw packets directly.
+
+---
+
+## 5. Hybrid Diffusion Design Constraints
+
+### 5.1 Feature Type Separation
+
+The selected dataset’s features should be divided into two groups:
+
+#### Continuous Features
+
+Examples:
+
+* inter-arrival time
+* numeric values
+* continuous statistics
+
+**Modeling requirement**:
+
+* Use **Gaussian diffusion (DDPM-style)**
+* Forward process: add Gaussian noise
+* Reverse process: predict noise with MSE (or L1) loss
+
+---
+
+#### Discrete Features
+
+Examples:
+
+* function code
+* message type
+* direction
+* categorical flags
+
+**Modeling requirement**:
+
+* Use **mask-based discrete diffusion**
+* Forward process: randomly replace tokens with `[MASK]`
+* Reverse process: predict original token via classification
+* Loss: cross-entropy (typically on masked positions only)
+
+---
+
+### 5.2 Unified Model Requirement
+
+The AI should design a model that:
+
+* Uses a **shared backbone** (e.g., UNet-like or temporal model)
+* Has:
+
+  * one head for continuous noise prediction
+  * one head for discrete token prediction
+* Trains with a **combined loss**:
+
+```text
+L = λ · L_continuous + (1 − λ) · L_discrete
+```
+
+---
+
+## 6. Output Expected from the AI's Reasoning
+
+The AI should produce:
+
+1. **Dataset selection result**
+
+   * Which dataset is chosen
+   * Why it is more suitable for this project
+2. **Feature breakdown**
+
+   * Which columns are continuous
+   * Which columns are discrete
+3. **Hybrid diffusion architecture**
+
+   * Input representation
+   * Forward noise strategy (continuous + discrete)
+   * Reverse denoising objectives
+4. **Training formulation**
+
+   * Loss definitions
+   * High-level training loop description
+
+Implementation details can remain **high-level / pseudocode-level** unless explicitly requested later.
+
+---
+
+## 7. Non-Goals (Important)
+
+* Do **not** design packet parsers
+* Do **not** generate raw PCAP directly
+* Do **not** assume image-style diffusion
+* Do **not** treat all features as continuous
+
+The focus is **feature-level hybrid diffusion modeling** under an ICS / Modbus context.
+