5.2 KiB
Project Context for AI
Modbus / ICS Traffic Generation with Hybrid Diffusion
1. Project Background
This project aims to build a hybrid diffusion-based generative model for industrial network traffic, inspired by the paper:
Spatio-Temporal Diffusion Model for Cellular Traffic Generation (STOUTER)
Unlike the original paper (cellular traffic, aggregated continuous values), this project targets industrial control system (ICS) traffic, with a focus on Modbus-like protocols, where traffic features include both:
- continuous values (e.g., timing, numeric payloads), and
- discrete protocol fields (e.g., function codes, message types).
The final goal is to generate realistic, protocol-consistent traffic features, which can later be converted into raw packets (PCAP) by an external generator.
2. Available Datasets (Current State)
The dataset directory is:
/Dev/diffusion/dataset/
It currently contains two datasets:
dataset/
├── modbus_dataset/
└── hai/
2.1 Common Properties
Both datasets:
- Are already preprocessed into CSV files
- Contain traffic-level features, not raw PCAP
- Are suitable as model input for diffusion-based generation
- Represent sequences of network events / flows, not aggregated hourly statistics
No packet parsing is required at this stage.
2.2 Modbus Dataset
-
Domain: Modbus / industrial control traffic
-
Semantics:
- Explicit protocol meaning (request/response, function codes, registers)
- Strong logical and temporal constraints
-
Feature types typically include:
-
Continuous:
- inter-arrival time
- numeric register values
- payload length
-
Discrete:
- function code
- direction (master → slave / slave → master)
- message type
-
This dataset aligns closely with the target application domain of the project.
2.3 HAI Dataset
-
Domain: ICS network traffic (broader, not Modbus-only)
-
Characteristics:
- Feature-extracted CSV format
- Contains both normal and abnormal behavior
- Less explicit protocol semantics compared to Modbus
-
Often used for:
- Anomaly detection
- Security-oriented modeling
This dataset may be more suitable if the project emphasizes security behavior patterns rather than strict protocol logic.
3. Task 1: Dataset-Level Decision
The AI should first:
-
Inspect both datasets
-
Compare feature schemas
-
Identify:
- continuous vs discrete fields
- temporal resolution
- protocol specificity
-
-
Decide which dataset is more appropriate for this project, based on:
- Alignment with Modbus-style protocol semantics
- Suitability for diffusion-based generation
- Ability to support mixed continuous + discrete modeling
The decision should be explicitly justified (why one dataset is preferred over the other in this project context).
4. Modeling Goal
After selecting the dataset, the AI should design a hybrid diffusion model that:
-
Operates on feature-level traffic data
-
Generates synthetic traffic feature sequences
-
Preserves:
- temporal patterns
- protocol-level consistency
- stochastic variability
The model does not generate raw packets directly.
5. Hybrid Diffusion Design Constraints
5.1 Feature Type Separation
The selected dataset’s features should be divided into two groups:
Continuous Features
Examples:
- inter-arrival time
- numeric values
- continuous statistics
Modeling requirement:
- Use Gaussian diffusion (DDPM-style)
- Forward process: add Gaussian noise
- Reverse process: predict noise with MSE (or L1) loss
Discrete Features
Examples:
- function code
- message type
- direction
- categorical flags
Modeling requirement:
- Use mask-based discrete diffusion
- Forward process: randomly replace tokens with
[MASK] - Reverse process: predict original token via classification
- Loss: cross-entropy (typically on masked positions only)
5.2 Unified Model Requirement
The AI should design a model that:
-
Uses a shared backbone (e.g., UNet-like or temporal model)
-
Has:
- one head for continuous noise prediction
- one head for discrete token prediction
-
Trains with a combined loss:
L = λ · L_continuous + (1 − λ) · L_discrete
6. Output Expected from the AI's Reasoning
The AI should produce:
-
Dataset selection result
- Which dataset is chosen
- Why it is more suitable for this project
-
Feature breakdown
- Which columns are continuous
- Which columns are discrete
-
Hybrid diffusion architecture
- Input representation
- Forward noise strategy (continuous + discrete)
- Reverse denoising objectives
-
Training formulation
- Loss definitions
- High-level training loop description
Implementation details can remain high-level / pseudocode-level unless explicitly requested later.
7. Non-Goals (Important)
- Do not design packet parsers
- Do not generate raw PCAP directly
- Do not assume image-style diffusion
- Do not treat all features as continuous
The focus is feature-level hybrid diffusion modeling under an ICS / Modbus context.