Files
mask-ddpm/TODO-FOR-AI.md
2026-01-08 05:17:16 +08:00

5.2 KiB
Raw Permalink Blame History

Project Context for AI

Modbus / ICS Traffic Generation with Hybrid Diffusion

1. Project Background

This project aims to build a hybrid diffusion-based generative model for industrial network traffic, inspired by the paper:

Spatio-Temporal Diffusion Model for Cellular Traffic Generation (STOUTER)

Unlike the original paper (cellular traffic, aggregated continuous values), this project targets industrial control system (ICS) traffic, with a focus on Modbus-like protocols, where traffic features include both:

  • continuous values (e.g., timing, numeric payloads), and
  • discrete protocol fields (e.g., function codes, message types).

The final goal is to generate realistic, protocol-consistent traffic features, which can later be converted into raw packets (PCAP) by an external generator.


2. Available Datasets (Current State)

The dataset directory is:

/Dev/diffusion/dataset/

It currently contains two datasets:

dataset/
├── modbus_dataset/
└── hai/

2.1 Common Properties

Both datasets:

  • Are already preprocessed into CSV files
  • Contain traffic-level features, not raw PCAP
  • Are suitable as model input for diffusion-based generation
  • Represent sequences of network events / flows, not aggregated hourly statistics

No packet parsing is required at this stage.


2.2 Modbus Dataset

  • Domain: Modbus / industrial control traffic

  • Semantics:

    • Explicit protocol meaning (request/response, function codes, registers)
    • Strong logical and temporal constraints
  • Feature types typically include:

    • Continuous:

      • inter-arrival time
      • numeric register values
      • payload length
    • Discrete:

      • function code
      • direction (master → slave / slave → master)
      • message type

This dataset aligns closely with the target application domain of the project.


2.3 HAI Dataset

  • Domain: ICS network traffic (broader, not Modbus-only)

  • Characteristics:

    • Feature-extracted CSV format
    • Contains both normal and abnormal behavior
    • Less explicit protocol semantics compared to Modbus
  • Often used for:

    • Anomaly detection
    • Security-oriented modeling

This dataset may be more suitable if the project emphasizes security behavior patterns rather than strict protocol logic.


3. Task 1: Dataset-Level Decision

The AI should first:

  1. Inspect both datasets

    • Compare feature schemas

    • Identify:

      • continuous vs discrete fields
      • temporal resolution
      • protocol specificity
  2. Decide which dataset is more appropriate for this project, based on:

    • Alignment with Modbus-style protocol semantics
    • Suitability for diffusion-based generation
    • Ability to support mixed continuous + discrete modeling

The decision should be explicitly justified (why one dataset is preferred over the other in this project context).


4. Modeling Goal

After selecting the dataset, the AI should design a hybrid diffusion model that:

  • Operates on feature-level traffic data

  • Generates synthetic traffic feature sequences

  • Preserves:

    • temporal patterns
    • protocol-level consistency
    • stochastic variability

The model does not generate raw packets directly.


5. Hybrid Diffusion Design Constraints

5.1 Feature Type Separation

The selected datasets features should be divided into two groups:

Continuous Features

Examples:

  • inter-arrival time
  • numeric values
  • continuous statistics

Modeling requirement:

  • Use Gaussian diffusion (DDPM-style)
  • Forward process: add Gaussian noise
  • Reverse process: predict noise with MSE (or L1) loss

Discrete Features

Examples:

  • function code
  • message type
  • direction
  • categorical flags

Modeling requirement:

  • Use mask-based discrete diffusion
  • Forward process: randomly replace tokens with [MASK]
  • Reverse process: predict original token via classification
  • Loss: cross-entropy (typically on masked positions only)

5.2 Unified Model Requirement

The AI should design a model that:

  • Uses a shared backbone (e.g., UNet-like or temporal model)

  • Has:

    • one head for continuous noise prediction
    • one head for discrete token prediction
  • Trains with a combined loss:

L = λ · L_continuous + (1  λ) · L_discrete

6. Output Expected from the AI's Reasoning

The AI should produce:

  1. Dataset selection result

    • Which dataset is chosen
    • Why it is more suitable for this project
  2. Feature breakdown

    • Which columns are continuous
    • Which columns are discrete
  3. Hybrid diffusion architecture

    • Input representation
    • Forward noise strategy (continuous + discrete)
    • Reverse denoising objectives
  4. Training formulation

    • Loss definitions
    • High-level training loop description

Implementation details can remain high-level / pseudocode-level unless explicitly requested later.


7. Non-Goals (Important)

  • Do not design packet parsers
  • Do not generate raw PCAP directly
  • Do not assume image-style diffusion
  • Do not treat all features as continuous

The focus is feature-level hybrid diffusion modeling under an ICS / Modbus context.