mask-ddpm/TODO-FOR-AI.md

# Project Context for AI

**Modbus / ICS Traffic Generation with Hybrid Diffusion**

## 1. Project Background

This project aims to build a **hybrid diffusion-based generative model** for industrial network traffic, inspired by the paper:

> *Spatio-Temporal Diffusion Model for Cellular Traffic Generation (STOUTER)*

Unlike the original paper (cellular traffic, aggregated continuous values), this project targets **industrial control system (ICS) traffic**, with a focus on **Modbus-like protocols**, where traffic features include both:

* **continuous values** (e.g., timing, numeric payloads), and
* **discrete protocol fields** (e.g., function codes, message types).

The final goal is to generate **realistic, protocol-consistent traffic features**, which can later be converted into raw packets (PCAP) by an external generator.

---

## 2. Available Datasets (Current State)

The dataset directory is:

```text
/Dev/diffusion/dataset/
```

It currently contains **two datasets**:

```text
dataset/
├── modbus_dataset/
└── hai/
```

### 2.1 Common Properties

Both datasets:

* Are already **preprocessed into CSV files**
* Contain **traffic-level features**, not raw PCAP
* Are suitable as **model input for diffusion-based generation**
* Represent **sequences of network events / flows**, not aggregated hourly statistics

No packet parsing is required at this stage.

---

### 2.2 Modbus Dataset

* Domain: **Modbus / industrial control traffic**
* Semantics:

  * Explicit protocol meaning (request/response, function codes, registers)
  * Strong logical and temporal constraints
* Feature types typically include:

  * Continuous:

    * inter-arrival time
    * numeric register values
    * payload length
  * Discrete:

    * function code
    * direction (master → slave / slave → master)
    * message type

This dataset aligns closely with the **target application domain** of the project.

---

### 2.3 HAI Dataset

* Domain: **ICS network traffic (broader, not Modbus-only)**
* Characteristics:

  * Feature-extracted CSV format
  * Contains both normal and abnormal behavior
  * Less explicit protocol semantics compared to Modbus
* Often used for:

  * Anomaly detection
  * Security-oriented modeling

This dataset may be more suitable if the project emphasizes **security behavior patterns** rather than strict protocol logic.

---

## 3. Task 1: Dataset-Level Decision

The AI should first:

1. **Inspect both datasets**

   * Compare feature schemas
   * Identify:

     * continuous vs discrete fields
     * temporal resolution
     * protocol specificity
2. **Decide which dataset is more appropriate** for this project, based on:

   * Alignment with Modbus-style protocol semantics
   * Suitability for diffusion-based generation
   * Ability to support mixed continuous + discrete modeling

The decision should be **explicitly justified** (why one dataset is preferred over the other in this project context).

---

## 4. Modeling Goal

After selecting the dataset, the AI should design a **hybrid diffusion model** that:

* Operates on **feature-level traffic data**
* Generates **synthetic traffic feature sequences**
* Preserves:

  * temporal patterns
  * protocol-level consistency
  * stochastic variability

The model does **not** generate raw packets directly.

---

## 5. Hybrid Diffusion Design Constraints

### 5.1 Feature Type Separation

The selected dataset’s features should be divided into two groups:

#### Continuous Features

Examples:

* inter-arrival time
* numeric values
* continuous statistics

**Modeling requirement**:

* Use **Gaussian diffusion (DDPM-style)**
* Forward process: add Gaussian noise
* Reverse process: predict noise with MSE (or L1) loss

---

#### Discrete Features

Examples:

* function code
* message type
* direction
* categorical flags

**Modeling requirement**:

* Use **mask-based discrete diffusion**
* Forward process: randomly replace tokens with `[MASK]`
* Reverse process: predict original token via classification
* Loss: cross-entropy (typically on masked positions only)

---

### 5.2 Unified Model Requirement

The AI should design a model that:

* Uses a **shared backbone** (e.g., UNet-like or temporal model)
* Has:

  * one head for continuous noise prediction
  * one head for discrete token prediction
* Trains with a **combined loss**:

```text
L = λ · L_continuous + (1 − λ) · L_discrete
```

---

## 6. Output Expected from the AI's Reasoning

The AI should produce:

1. **Dataset selection result**

   * Which dataset is chosen
   * Why it is more suitable for this project
2. **Feature breakdown**

   * Which columns are continuous
   * Which columns are discrete
3. **Hybrid diffusion architecture**

   * Input representation
   * Forward noise strategy (continuous + discrete)
   * Reverse denoising objectives
4. **Training formulation**

   * Loss definitions
   * High-level training loop description

Implementation details can remain **high-level / pseudocode-level** unless explicitly requested later.

---

## 7. Non-Goals (Important)

* Do **not** design packet parsers
* Do **not** generate raw PCAP directly
* Do **not** assume image-style diffusion
* Do **not** treat all features as continuous

The focus is **feature-level hybrid diffusion modeling** under an ICS / Modbus context.