# Project Context for AI **Modbus / ICS Traffic Generation with Hybrid Diffusion** ## 1. Project Background This project aims to build a **hybrid diffusion-based generative model** for industrial network traffic, inspired by the paper: > *Spatio-Temporal Diffusion Model for Cellular Traffic Generation (STOUTER)* Unlike the original paper (cellular traffic, aggregated continuous values), this project targets **industrial control system (ICS) traffic**, with a focus on **Modbus-like protocols**, where traffic features include both: * **continuous values** (e.g., timing, numeric payloads), and * **discrete protocol fields** (e.g., function codes, message types). The final goal is to generate **realistic, protocol-consistent traffic features**, which can later be converted into raw packets (PCAP) by an external generator. --- ## 2. Available Datasets (Current State) The dataset directory is: ```text /Dev/diffusion/dataset/ ``` It currently contains **two datasets**: ```text dataset/ ├── modbus_dataset/ └── hai/ ``` ### 2.1 Common Properties Both datasets: * Are already **preprocessed into CSV files** * Contain **traffic-level features**, not raw PCAP * Are suitable as **model input for diffusion-based generation** * Represent **sequences of network events / flows**, not aggregated hourly statistics No packet parsing is required at this stage. --- ### 2.2 Modbus Dataset * Domain: **Modbus / industrial control traffic** * Semantics: * Explicit protocol meaning (request/response, function codes, registers) * Strong logical and temporal constraints * Feature types typically include: * Continuous: * inter-arrival time * numeric register values * payload length * Discrete: * function code * direction (master → slave / slave → master) * message type This dataset aligns closely with the **target application domain** of the project. --- ### 2.3 HAI Dataset * Domain: **ICS network traffic (broader, not Modbus-only)** * Characteristics: * Feature-extracted CSV format * Contains both normal and abnormal behavior * Less explicit protocol semantics compared to Modbus * Often used for: * Anomaly detection * Security-oriented modeling This dataset may be more suitable if the project emphasizes **security behavior patterns** rather than strict protocol logic. --- ## 3. Task 1: Dataset-Level Decision The AI should first: 1. **Inspect both datasets** * Compare feature schemas * Identify: * continuous vs discrete fields * temporal resolution * protocol specificity 2. **Decide which dataset is more appropriate** for this project, based on: * Alignment with Modbus-style protocol semantics * Suitability for diffusion-based generation * Ability to support mixed continuous + discrete modeling The decision should be **explicitly justified** (why one dataset is preferred over the other in this project context). --- ## 4. Modeling Goal After selecting the dataset, the AI should design a **hybrid diffusion model** that: * Operates on **feature-level traffic data** * Generates **synthetic traffic feature sequences** * Preserves: * temporal patterns * protocol-level consistency * stochastic variability The model does **not** generate raw packets directly. --- ## 5. Hybrid Diffusion Design Constraints ### 5.1 Feature Type Separation The selected dataset’s features should be divided into two groups: #### Continuous Features Examples: * inter-arrival time * numeric values * continuous statistics **Modeling requirement**: * Use **Gaussian diffusion (DDPM-style)** * Forward process: add Gaussian noise * Reverse process: predict noise with MSE (or L1) loss --- #### Discrete Features Examples: * function code * message type * direction * categorical flags **Modeling requirement**: * Use **mask-based discrete diffusion** * Forward process: randomly replace tokens with `[MASK]` * Reverse process: predict original token via classification * Loss: cross-entropy (typically on masked positions only) --- ### 5.2 Unified Model Requirement The AI should design a model that: * Uses a **shared backbone** (e.g., UNet-like or temporal model) * Has: * one head for continuous noise prediction * one head for discrete token prediction * Trains with a **combined loss**: ```text L = λ · L_continuous + (1 − λ) · L_discrete ``` --- ## 6. Output Expected from the AI's Reasoning The AI should produce: 1. **Dataset selection result** * Which dataset is chosen * Why it is more suitable for this project 2. **Feature breakdown** * Which columns are continuous * Which columns are discrete 3. **Hybrid diffusion architecture** * Input representation * Forward noise strategy (continuous + discrete) * Reverse denoising objectives 4. **Training formulation** * Loss definitions * High-level training loop description Implementation details can remain **high-level / pseudocode-level** unless explicitly requested later. --- ## 7. Non-Goals (Important) * Do **not** design packet parsers * Do **not** generate raw PCAP directly * Do **not** assume image-style diffusion * Do **not** treat all features as continuous The focus is **feature-level hybrid diffusion modeling** under an ICS / Modbus context.