Intro and Related Work Completed

- The reference of HAI dataset still have problems.
2026-02-04 19:39:36 +08:00
parent 81625b5c4e
commit 272e159df1
2 changed files with 222 additions and 83 deletions
--- a/arxiv-style/main.tex
+++ b/arxiv-style/main.tex
@@ -25,7 +25,7 @@
 \title{Your Paper Title: A Deep Learning Approach for Something}

 % 若不需要日期，取消下面一行的注释
-%\date{}
+\date{}

 \newif\ifuniqueAffiliation
 \uniqueAffiliationtrue
@@ -67,7 +67,7 @@ pdfkeywords={Keyword1, Keyword2, Keyword3},
 \maketitle

 \begin{abstract}
-	Here is the abstract of your paper.
+	Here is the abstract of your paper. Here is the abstract of your paper. Here is the abstract of your paper. Here is the abstract of your paper. Here is the abstract of your paper. Here is the abstract of your paper. Here is the abstract of your paper. Here is the abstract of your paper. Here is the abstract of your paper. Here is the abstract of your paper. Here is the abstract of your paper. Here is the abstract of your paper. Here is the abstract of your paper. Here is the abstract of your paper.
 \end{abstract}

 % 关键词
@@ -76,7 +76,13 @@ pdfkeywords={Keyword1, Keyword2, Keyword3},
 % 1. Introduction
 \section{Introduction}
 \label{sec:intro}
-Here introduces the background, problem statement, and contribution.
+Industrial control systems (ICS) form the backbone of modern critical infrastructure, which includes power grids, water treatment, manufacturing, and transportation, among others. These systems monitor, regulate, and automate the physical processes through sensors, actuators, programmable logic controllers (PLCs), and monitoring software. Unlike conventional IT systems, ICS operate in real time, closely coupled with physical processes and safety‑critical constraints, using heterogeneous and legacy communication protocols such as Modbus/TCP and DNP3 that were not originally designed with robust security in mind. This architectural complexity and operational criticality make ICS high‑impact targets for cyber attacks, where disruptions can result in physical damage, environmental harm, and even loss of life. Recent reviews of ICS security highlight the expanding attack surface due to increased connectivity, legacy systems’ vulnerabilities, and the inadequacy of traditional security controls in capturing the nuances of ICS networks and protocols \citep{10.1007/s10844-022-00753-1, Nankya2023-gp}
+
+While machine learning (ML) techniques have shown promise for anomaly detection and automated cybersecurity within ICS, they rely heavily on labeled datasets that capture both benign operations and diverse attack patterns. In practice, real ICS traffic data, especially attack‑triggered captures, are scarce due to confidentiality, safety, and legal restrictions, and available public ICS datasets are few, limited in scope, or fail to reflect current threat modalities. For instance, the HAI Security Dataset provides operational telemetry and anomaly flags from a realistic control system setup for research purposes, but must be carefully preprocessed to derive protocol‑relevant features for ML tasks \citep{shin}. Data scarcity directly undermines model generalization, evaluation reproducibility, and the robustness of intrusion detection research, especially when training or testing ML models on realistic ICS behavior remains confined to small or outdated collections of examples \citep{info16100910}.
+
+Synthetic data generation offers a practical pathway to mitigate these challenges. By programmatically generating feature‑level sequences that mimic the statistical and temporal structure of real ICS telemetry, researchers can augment scarce training sets, standardize benchmarking, and preserve operational confidentiality. Relative to raw packet captures, feature‑level synthesis abstracts critical protocol semantics and statistical patterns without exposing sensitive fields, making it more compatible with safety constraints and compliance requirements in ICS environments. Modern generative modeling, including diffusion models, has advanced significantly in producing high‑fidelity synthetic data across domains. Diffusion approaches, such as denoising diffusion probabilistic models, learn to transform noise into coherent structured samples and have been successfully applied to tabular or time series data synthesis with better stability and data coverage compared to adversarial methods \citep{pmlr-v202-kotelnikov23a, rasul2021autoregressivedenoisingdiffusionmodels}
+
+Despite these advances, most existing work either focuses on packet‑level generation \citep{jiang2023netdiffusionnetworkdataaugmentation} or is limited to generic tabular data \citep{pmlr-v202-kotelnikov23a}, rather than domain‑specific control sequence synthesis tailored for ICS protocols where temporal coherence, multi‑channel dependencies, and discrete protocol legality are jointly required. This gap motivates our focus on protocol feature-level generation for ICS, which involves synthesizing sequences of protocol-relevant fields conditioned on their temporal and cross-channel structure. In this work, we formulate a hybrid modeling pipeline that decouples long‑horizon trends and local statistical detail while preserving discrete semantics of protocol tokens. By combining causal Transformers with diffusion‑based refiners, and enforcing deterministic validity constraints during sampling, our framework generates semantically coherent, temporally consistent, and distributionally faithful ICS feature sequences. We evaluate features derived from the HAI Security Dataset and demonstrate that our approach produces high‑quality synthetic sequences suitable for downstream augmentation, benchmarking, and integration into packet‑construction workflows that respect realistic ICS constraints.

 % 2. Related Work
 \section{Related Work}
@@ -107,12 +113,12 @@ A key empirical and methodological tension in ICS synthesis is that temporal rea

 Motivated by these considerations, we propose Mask-DDPM, organized in the following order:
 \begin{enumerate}
-  \item Transformer trend module: learns the dominant temporal backbone of continuous dynamics via attention-based sequence modeling \citep{vaswani2017attention}. 
-  
+  \item Transformer trend module: learns the dominant temporal backbone of continuous dynamics via attention-based sequence modeling \citep{vaswani2017attention}.
+
  \item Residual DDPM for continuous variables: models distributional detail as stochastic residual structure conditioned on the learned trend \citep{ho2020denoising,kollovieh2023tsdiff}.
-  
+
  \item Masked diffusion for discrete variables: generates discrete ICS states with an absorbing/masking corruption process and categorical reconstruction \citep{austin2021structured, shi2024simplified}.
-  
+
  \item Type-aware decomposition: a type-aware factorization and routing layer that assigns variables to the most appropriate modeling mechanism and enforces deterministic constraints where warranted.
 \end{enumerate}

@@ -209,17 +215,17 @@ We therefore introduce a type-aware decomposition that formalizes this heterogen

 We use the following taxonomy:
 \begin{enumerate}
-	\item Type 1 (program-driven / setpoint-like): externally commanded, step-and-dwell variables. These variables can be treated as exogenous drivers (conditioning signals) or routed to specialized change-point / dwell-time models, rather than being forced into a smooth denoiser that may over-regularize step structure. 
-	
-	\item Type 2 (controller outputs): continuous variables tightly coupled to feedback loops; these benefit from conditional modeling where the conditioning includes relevant process variables and commanded setpoints. 
-	
-	\item Type 3 (actuator states/positions): often exhibit saturation, dwell, and rate limits; these may require stateful dynamics beyond generic residual diffusion, motivating either specialized conditional modules or additional inductive constraints. 
-	
-	\item Type 4 (process variables): inertia-dominated continuous dynamics; these are the primary beneficiaries of the Transformer trend + residual DDPM pipeline. 
+	\item Type 1 (program-driven / setpoint-like): externally commanded, step-and-dwell variables. These variables can be treated as exogenous drivers (conditioning signals) or routed to specialized change-point / dwell-time models, rather than being forced into a smooth denoiser that may over-regularize step structure.
+
+	\item Type 2 (controller outputs): continuous variables tightly coupled to feedback loops; these benefit from conditional modeling where the conditioning includes relevant process variables and commanded setpoints.
+
+	\item Type 3 (actuator states/positions): often exhibit saturation, dwell, and rate limits; these may require stateful dynamics beyond generic residual diffusion, motivating either specialized conditional modules or additional inductive constraints.
+
+	\item Type 4 (process variables): inertia-dominated continuous dynamics; these are the primary beneficiaries of the Transformer trend + residual DDPM pipeline.

 	\item Type 5 (derived/deterministic variables): algebraic or rule-based functions of other variables; we enforce deterministic reconstruction $\hat{x}^{(i)} = g_i(\hat{X},\hat{Y})$ rather than learning a stochastic generator, improving logical consistency and sample efficiency.
-	
-	\item Type 6 (auxiliary/low-impact variables): weakly coupled or sparse signals; we allow simplified modeling (e.g., calibrated marginals or lightweight temporal models) to avoid allocating diffusion capacity where it is not warranted. 
+
+	\item Type 6 (auxiliary/low-impact variables): weakly coupled or sparse signals; we allow simplified modeling (e.g., calibrated marginals or lightweight temporal models) to avoid allocating diffusion capacity where it is not warranted.
 \end{enumerate}

 Type-aware decomposition improves synthesis quality through three mechanisms. First, it improves capacity allocation by preventing a small set of mechanistically atypical variables from dominating gradients and distorting the learned distribution for the majority class (typically Type 4). Second, it enables constraint enforcement by deterministically reconstructing Type 5 variables, preventing logically inconsistent samples that purely learned generators can produce. Third, it improves mechanism alignment by attaching inductive biases consistent with step/dwell or saturation behaviors where generic denoisers may implicitly favor smoothness.