ModuFlow/mask-ddpm

Fork 0

Files

MingzheYang 200bdf6136 添加 TODO-FOR-AI.md

2026-01-08 05:17:16 +08:00

5.2 KiB

Raw Blame History

Project Context for AI

Modbus / ICS Traffic Generation with Hybrid Diffusion

1. Project Background

This project aims to build a hybrid diffusion-based generative model for industrial network traffic, inspired by the paper:

Spatio-Temporal Diffusion Model for Cellular Traffic Generation (STOUTER)

Unlike the original paper (cellular traffic, aggregated continuous values), this project targets industrial control system (ICS) traffic, with a focus on Modbus-like protocols, where traffic features include both:

continuous values (e.g., timing, numeric payloads), and
discrete protocol fields (e.g., function codes, message types).

The final goal is to generate realistic, protocol-consistent traffic features, which can later be converted into raw packets (PCAP) by an external generator.

2. Available Datasets (Current State)

The dataset directory is:

/Dev/diffusion/dataset/

It currently contains two datasets:

dataset/
├── modbus_dataset/
└── hai/

2.1 Common Properties

Both datasets:

Are already preprocessed into CSV files
Contain traffic-level features, not raw PCAP
Are suitable as model input for diffusion-based generation
Represent sequences of network events / flows, not aggregated hourly statistics

No packet parsing is required at this stage.

2.2 Modbus Dataset

Domain: Modbus / industrial control traffic
Semantics:
- Explicit protocol meaning (request/response, function codes, registers)
- Strong logical and temporal constraints
Feature types typically include:
- Continuous:
  - inter-arrival time
  - numeric register values
  - payload length
- Discrete:
  - function code
  - direction (master → slave / slave → master)
  - message type

This dataset aligns closely with the target application domain of the project.

2.3 HAI Dataset

Domain: ICS network traffic (broader, not Modbus-only)
Characteristics:
- Feature-extracted CSV format
- Contains both normal and abnormal behavior
- Less explicit protocol semantics compared to Modbus
Often used for:
- Anomaly detection
- Security-oriented modeling

This dataset may be more suitable if the project emphasizes security behavior patterns rather than strict protocol logic.

3. Task 1: Dataset-Level Decision

The AI should first:

Inspect both datasets
- Compare feature schemas
- Identify:
  - continuous vs discrete fields
  - temporal resolution
  - protocol specificity
Decide which dataset is more appropriate for this project, based on:
- Alignment with Modbus-style protocol semantics
- Suitability for diffusion-based generation
- Ability to support mixed continuous + discrete modeling

The decision should be explicitly justified (why one dataset is preferred over the other in this project context).

4. Modeling Goal

After selecting the dataset, the AI should design a hybrid diffusion model that:

Operates on feature-level traffic data
Generates synthetic traffic feature sequences
Preserves:
- temporal patterns
- protocol-level consistency
- stochastic variability

The model does not generate raw packets directly.

5. Hybrid Diffusion Design Constraints

5.1 Feature Type Separation

The selected dataset’s features should be divided into two groups:

Continuous Features

Examples:

inter-arrival time
numeric values
continuous statistics

Modeling requirement:

Use Gaussian diffusion (DDPM-style)
Forward process: add Gaussian noise
Reverse process: predict noise with MSE (or L1) loss

Discrete Features

Examples:

function code
message type
direction
categorical flags

Modeling requirement:

Use mask-based discrete diffusion
Forward process: randomly replace tokens with [MASK]
Reverse process: predict original token via classification
Loss: cross-entropy (typically on masked positions only)

5.2 Unified Model Requirement

The AI should design a model that:

Uses a shared backbone (e.g., UNet-like or temporal model)
Has:
- one head for continuous noise prediction
- one head for discrete token prediction
Trains with a combined loss:

L = λ · L_continuous + (1 − λ) · L_discrete

6. Output Expected from the AI's Reasoning

The AI should produce:

Dataset selection result
- Which dataset is chosen
- Why it is more suitable for this project
Feature breakdown
- Which columns are continuous
- Which columns are discrete
Hybrid diffusion architecture
- Input representation
- Forward noise strategy (continuous + discrete)
- Reverse denoising objectives
Training formulation
- Loss definitions
- High-level training loop description

Implementation details can remain high-level / pseudocode-level unless explicitly requested later.

7. Non-Goals (Important)

Do not design packet parsers
Do not generate raw PCAP directly
Do not assume image-style diffusion
Do not treat all features as continuous

The focus is feature-level hybrid diffusion modeling under an ICS / Modbus context.

5.2 KiB Raw Blame History Unescape Escape