Refine type taxonomy figure and benchmark layout

2026-04-10 17:40:52 +08:00
parent 6f6a4b6a20
commit fb6c0ee368
4 changed files with 42 additions and 42 deletions
--- a/arxiv-style/fig-benchmark-ablations-v1.png
+++ b/arxiv-style/fig-benchmark-ablations-v1.png
--- a/arxiv-style/fig-benchmark-story-v2.png
+++ b/arxiv-style/fig-benchmark-story-v2.png
--- a/arxiv-style/main.tex
+++ b/arxiv-style/main.tex
@@ -22,17 +22,18 @@
 \usepackage{bm}
 \usepackage{array}       % For column formatting
 \usepackage{caption}     % Better caption spacing
+\usepackage{float}       % Precise figure placement

-% 标题
+% 鏍囬
 \title{Mask-DDPM: Transformer-Conditioned Mixed-Type Diffusion for Semantically Valid ICS Telemetry Synthesis}

-% 若不需要日期，取消下面一行的注释
+% 鑻ヤ笉闇€瑕佹棩鏈燂紝鍙栨秷涓嬮潰涓€琛岀殑娉ㄩ噴
 \date{}

 \newif\ifuniqueAffiliation
 \uniqueAffiliationtrue

-\ifuniqueAffiliation % 标准作者块
+\ifuniqueAffiliation % 鏍囧噯浣滆€呭潡
 \author{
    Zhenglan Chen \\
 	Aberdeen Institute of Data Science and Artificial Intelligence\\
@@ -60,10 +61,10 @@
 }
 \fi

-% 页眉设置
+% 椤电湁璁剧疆
 \renewcommand{\shorttitle}{\textit{arXiv} Template}

-%%% PDF 元数据
+%%% PDF 鍏冩暟鎹?
 \hypersetup{
 pdftitle={Your Paper Title},
 pdfsubject={cs.LG, cs.CR},
@@ -75,22 +76,22 @@ pdfkeywords={Keyword1, Keyword2, Keyword3},
 \maketitle

 \begin{abstract}
-Industrial control systems (ICS) security research is increasingly constrained by the scarcity and non-shareability of realistic traffic and telemetry, especially for attack scenarios. To mitigate this bottleneck, we study synthetic generation at the protocol feature/telemetry level, where samples must simultaneously preserve temporal coherence, match continuous marginal distributions, and keep discrete supervisory variables strictly within valid vocabularies. We propose Mask-DDPM, a hybrid framework tailored to mixed-type, multi-scale ICS sequences. Mask-DDPM factorizes generation into (i) a causal Transformer trend module that rolls out a stable long-horizon temporal scaffold for continuous channels, (ii) a trend-conditioned residual DDPM that refines local stochastic structure and heavy-tailed fluctuations without degrading global dynamics, (iii) a masked (absorbing) diffusion branch for discrete variables that guarantees categorical legality by construction, and (iv) a type-aware decomposition/routing layer that aligns modeling mechanisms with heterogeneous ICS variable origins and enforces deterministic reconstruction where appropriate. Evaluated on fixed-length windows (L=96) derived from the HAI Security Dataset, Mask-DDPM achieves stable fidelity across seeds with mean KS = 0.3311 ± 0.0079 (continuous), mean JSD = 0.0284 ± 0.0073 (discrete), and mean absolute lag-1 autocorrelation difference = 0.2684 ± 0.0027, indicating faithful marginals, preserved short-horizon dynamics, and valid discrete semantics. The resulting generator provides a reproducible basis for data augmentation, benchmarking, and downstream ICS protocol reconstruction workflows.
+Industrial control systems (ICS) security research is increasingly constrained by the scarcity and non-shareability of realistic traffic and telemetry, especially for attack scenarios. To mitigate this bottleneck, we study synthetic generation at the protocol feature/telemetry level, where samples must simultaneously preserve temporal coherence, match continuous marginal distributions, and keep discrete supervisory variables strictly within valid vocabularies. We propose Mask-DDPM, a hybrid framework tailored to mixed-type, multi-scale ICS sequences. Mask-DDPM factorizes generation into (i) a causal Transformer trend module that rolls out a stable long-horizon temporal scaffold for continuous channels, (ii) a trend-conditioned residual DDPM that refines local stochastic structure and heavy-tailed fluctuations without degrading global dynamics, (iii) a masked (absorbing) diffusion branch for discrete variables that guarantees categorical legality by construction, and (iv) a type-aware decomposition/routing layer that aligns modeling mechanisms with heterogeneous ICS variable origins and enforces deterministic reconstruction where appropriate. Evaluated on fixed-length windows ($L=96$) derived from the HAI Security Dataset, Mask-DDPM achieves stable fidelity across seeds with mean KS = 0.3311 $\pm$ 0.0079 (continuous), mean JSD = 0.0284 $\pm$ 0.0073 (discrete), and mean absolute lag-1 autocorrelation difference = 0.2684 $\pm$ 0.0027, indicating faithful marginals, preserved short-horizon dynamics, and valid discrete semantics. The resulting generator provides a reproducible basis for data augmentation, benchmarking, and downstream ICS protocol reconstruction workflows.
 \end{abstract}

-% 关键词
+% 鍏抽敭璇?
 \keywords{Machine Learning \and Cyber Defense \and ICS}

 % 1. Introduction
 \section{Introduction}
 \label{sec:intro}
-Industrial control systems (ICS) form the backbone of modern critical infrastructure, which includes power grids, water treatment, manufacturing, and transportation, among others. These systems monitor, regulate, and automate the physical processes through sensors, actuators, programmable logic controllers (PLCs), and monitoring software. Unlike conventional IT systems, ICS operate in real time, closely coupled with physical processes and safety‑critical constraints, using heterogeneous and legacy communication protocols such as Modbus/TCP and DNP3 that were not originally designed with robust security in mind. This architectural complexity and operational criticality make ICS high‑impact targets for cyber attacks, where disruptions can result in physical damage, environmental harm, and even loss of life. Recent reviews of ICS security highlight the expanding attack surface due to increased connectivity, legacy systems’ vulnerabilities, and the inadequacy of traditional security controls in capturing the nuances of ICS networks and protocols \citep{10.1007/s10844-022-00753-1, Nankya2023-gp}
+Industrial control systems (ICS) form the backbone of modern critical infrastructure, which includes power grids, water treatment, manufacturing, and transportation, among others. These systems monitor, regulate, and automate physical processes through sensors, actuators, programmable logic controllers (PLCs), and monitoring software. Unlike conventional IT systems, ICS operate in real time, closely coupled with physical processes and safety-critical constraints, using heterogeneous and legacy communication protocols such as Modbus/TCP and DNP3 that were not originally designed with robust security in mind. This architectural complexity and operational criticality make ICS high-impact targets for cyber attacks, where disruptions can result in physical damage, environmental harm, and even loss of life. Recent reviews of ICS security highlight the expanding attack surface due to increased connectivity, vulnerabilities in legacy systems, and the inadequacy of traditional security controls in capturing the nuances of ICS networks and protocols \citep{10.1007/s10844-022-00753-1, Nankya2023-gp}

-While machine learning (ML) techniques have shown promise for anomaly detection and automated cybersecurity within ICS, they rely heavily on labeled datasets that capture both benign operations and diverse attack patterns. In practice, real ICS traffic data, especially attack‑triggered captures, are scarce due to confidentiality, safety, and legal restrictions, and available public ICS datasets are few, limited in scope, or fail to reflect current threat modalities. For instance, the HAI Security Dataset provides operational telemetry and anomaly flags from a realistic control system setup for research purposes, but must be carefully preprocessed to derive protocol‑relevant features for ML tasks \citep{shin}. Data scarcity directly undermines model generalization, evaluation reproducibility, and the robustness of intrusion detection research, especially when training or testing ML models on realistic ICS behavior remains confined to small or outdated collections of examples \citep{info16100910}.
+While machine learning (ML) techniques have shown promise for anomaly detection and automated cybersecurity within ICS, they rely heavily on labeled datasets that capture both benign operations and diverse attack patterns. In practice, real ICS traffic data, especially attack-triggered captures, are scarce due to confidentiality, safety, and legal restrictions, and available public ICS datasets are few, limited in scope, or fail to reflect current threat modalities. For instance, the HAI Security Dataset provides operational telemetry and anomaly flags from a realistic control system setup for research purposes, but must be carefully preprocessed to derive protocol-relevant features for ML tasks \citep{shin}. Data scarcity directly undermines model generalization, evaluation reproducibility, and the robustness of intrusion detection research, especially when training or testing ML models on realistic ICS behavior remains confined to small or outdated collections of examples \citep{info16100910}.

-Synthetic data generation offers a practical pathway to mitigate these challenges. By programmatically generating feature‑level sequences that mimic the statistical and temporal structure of real ICS telemetry, researchers can augment scarce training sets, standardize benchmarking, and preserve operational confidentiality. Relative to raw packet captures, feature‑level synthesis abstracts critical protocol semantics and statistical patterns without exposing sensitive fields, making it more compatible with safety constraints and compliance requirements in ICS environments. Modern generative modeling, including diffusion models, has advanced significantly in producing high‑fidelity synthetic data across domains. Diffusion approaches, such as denoising diffusion probabilistic models, learn to transform noise into coherent structured samples and have been successfully applied to tabular or time series data synthesis with better stability and data coverage compared to adversarial methods \citep{pmlr-v202-kotelnikov23a, rasul2021autoregressivedenoisingdiffusionmodels}
+Synthetic data generation offers a practical pathway to mitigate these challenges. By programmatically generating feature-level sequences that mimic the statistical and temporal structure of real ICS telemetry, researchers can augment scarce training sets, standardize benchmarking, and preserve operational confidentiality. Relative to raw packet captures, feature-level synthesis abstracts critical protocol semantics and statistical patterns without exposing sensitive fields, making it more compatible with safety constraints and compliance requirements in ICS environments. Modern generative modeling, including diffusion models, has advanced significantly in producing high-fidelity synthetic data across domains. Diffusion approaches, such as denoising diffusion probabilistic models, learn to transform noise into coherent structured samples and have been successfully applied to tabular or time series data synthesis with better stability and data coverage compared to adversarial methods \citep{pmlr-v202-kotelnikov23a, rasul2021autoregressivedenoisingdiffusionmodels}

-Despite these advances, most existing work either focuses on packet‑level generation \citep{jiang2023netdiffusionnetworkdataaugmentation} or is limited to generic tabular data \citep{pmlr-v202-kotelnikov23a}, rather than domain‑specific control sequence synthesis tailored for ICS protocols where temporal coherence, multi‑channel dependencies, and discrete protocol legality are jointly required. This gap motivates our focus on protocol feature-level generation for ICS, which involves synthesizing sequences of protocol-relevant fields conditioned on their temporal and cross-channel structure. In this work, we formulate a hybrid modeling pipeline that decouples long‑horizon trends and local statistical detail while preserving discrete semantics of protocol tokens. By combining causal Transformers with diffusion‑based refiners, and enforcing deterministic validity constraints during sampling, our framework generates semantically coherent, temporally consistent, and distributionally faithful ICS feature sequences. We evaluate features derived from the HAI Security Dataset and demonstrate that our approach produces high‑quality synthetic sequences suitable for downstream augmentation, benchmarking, and integration into packet‑construction workflows that respect realistic ICS constraints.
+Despite these advances, most existing work either focuses on packet-level generation \citep{jiang2023netdiffusionnetworkdataaugmentation} or is limited to generic tabular data \citep{pmlr-v202-kotelnikov23a}, rather than domain-specific control sequence synthesis tailored for ICS protocols where temporal coherence, multi-channel dependencies, and discrete protocol legality are jointly required. This gap motivates our focus on protocol feature-level generation for ICS, which involves synthesizing sequences of protocol-relevant fields conditioned on their temporal and cross-channel structure. In this work, we formulate a hybrid modeling pipeline that decouples long-horizon trends and local statistical detail while preserving discrete semantics of protocol tokens. By combining causal Transformers with diffusion-based refiners, and enforcing deterministic validity constraints during sampling, our framework generates semantically coherent, temporally consistent, and distributionally faithful ICS feature sequences. We evaluate features derived from the HAI Security Dataset and demonstrate that our approach produces high-quality synthetic sequences suitable for downstream augmentation, benchmarking, and integration into packet-reconstruction workflows that respect realistic ICS constraints.

 % 2. Related Work
 \section{Related Work}
@@ -106,9 +107,9 @@ From the perspective of high-level synthesis, the temporal structure is equally
 % 3. Methodology
 \section{Methodology}
 \label{sec:method}
-Industrial control system (ICS) telemetry is intrinsically mixed-type and mechanistically heterogeneous: continuous process trajectories (e.g., sensor and actuator signals) coexist with discrete supervisory states (e.g., modes, alarms, interlocks), and the underlying generating mechanisms range from physical inertia to program-driven step logic. This heterogeneity is not cosmetic—it directly affects what “realistic” synthesis means, because a generator must jointly satisfy (i) temporal coherence, (ii) distributional fidelity, and (iii) discrete semantic validity (i.e., every discrete output must belong to its legal vocabulary by construction). These properties are emphasized broadly in operational-technology security guidance and ICS engineering practice, where state logic and physical dynamics are tightly coupled \citep{nist2023sp80082}.
+Industrial control system (ICS) telemetry is intrinsically mixed-type and mechanistically heterogeneous: continuous process trajectories (e.g., sensor and actuator signals) coexist with discrete supervisory states (e.g., modes, alarms, interlocks), and the underlying generating mechanisms range from physical inertia to program-driven step logic. This heterogeneity is not cosmetic: it directly affects what realistic synthesis means, because a generator must jointly satisfy (i) temporal coherence, (ii) distributional fidelity, and (iii) discrete semantic validity (i.e., every discrete output must belong to its legal vocabulary by construction). These properties are emphasized broadly in operational-technology security guidance and ICS engineering practice, where state logic and physical dynamics are tightly coupled \citep{nist2023sp80082}.

-We formalize each training instance as a fixed-length window of length We model each training instance as a fixed-length window of length $L$, comprising continuous channels $\bm{X} \in \mathbb{R}^{L \times d_c}$ and discrete channels $\bm{Y} = \{y^{(j)}_{1:L}\}_{j=1}^{d_d}$, where each discrete variable satisfies $y^{(j)}_t \in \mathcal{V}_j$ for a finite vocabulary $\mathcal{V}_j$. Our objective is to learn a generator that produces synthetic $(\hat{\bm{X}}, \hat{\bm{Y}})$ that are simultaneously coherent and distributionally faithful, while also ensuring $\hat{y}^{(j)}_t\in\mathcal{V}_j$ for all $j$, $t$ by construction (rather than via post-hoc rounding or thresholding).
+We model each training instance as a fixed-length window of length $L$, comprising continuous channels $\bm{X} \in \mathbb{R}^{L \times d_c}$ and discrete channels $\bm{Y} = \{y^{(j)}_{1:L}\}_{j=1}^{d_d}$, where each discrete variable satisfies $y^{(j)}_t \in \mathcal{V}_j$ for a finite vocabulary $\mathcal{V}_j$. Our objective is to learn a generator that produces synthetic $(\hat{\bm{X}}, \hat{\bm{Y}})$ that are simultaneously coherent and distributionally faithful, while also ensuring $\hat{y}^{(j)}_t\in\mathcal{V}_j$ for all $j$, $t$ by construction (rather than via post-hoc rounding or thresholding).

 A key empirical and methodological tension in ICS synthesis is that temporal realism and marginal/distributional realism can compete when optimized monolithically: sequence models trained primarily for regression often over-smooth heavy tails and intermittent bursts, while purely distribution-matching objectives can erode long-range structure. Diffusion models provide a principled route to rich distribution modeling through iterative denoising, but they do not, by themselves, resolve (i) the need for a stable low-frequency temporal scaffold, nor (ii) the discrete legality constraints for supervisory variables \citep{ho2020denoising,song2021score}. Recent time-series diffusion work further suggests that separating coarse structure from stochastic refinement can be an effective inductive bias for long-horizon realism \citep{kollovieh2023tsdiff,sikder2023transfusion}.

@@ -134,14 +135,14 @@ This ordering is intentional. The trend module establishes a macro-temporal scaf

 \subsection{Transformer trend module for continuous dynamics}
 \label{sec:method-trans}
-We instantiate the temporal backbone as a causal Transformer trend extractor, leveraging self-attention’s ability to represent long-range dependencies and cross-channel interactions without recurrence \citep{vaswani2017attention}. Compared with recurrent trend extractors (e.g., GRU-style backbones), a Transformer trend module offers a direct mechanism to model delayed effects and multivariate coupling—common in ICS, where control actions may influence downstream sensors with nontrivial lags and regime-dependent propagation \citep{vaswani2017attention,nist2023sp80082}. Crucially, in our design the Transformer is not asked to be the entire generator; instead, it serves a deliberately restricted role: providing a stable, temporally coherent conditioning signal that later stochastic components refine.
+We instantiate the temporal backbone as a causal Transformer trend extractor, leveraging self-attention's ability to represent long-range dependencies and cross-channel interactions without recurrence \citep{vaswani2017attention}. Compared with recurrent trend extractors (e.g., GRU-style backbones), a Transformer trend module offers a direct mechanism to model delayed effects and multivariate coupling common in ICS, where control actions may influence downstream sensors with nontrivial lags and regime-dependent propagation \citep{vaswani2017attention,nist2023sp80082}. Crucially, in our design the Transformer is not asked to be the entire generator; instead, it serves a deliberately restricted role: providing a stable, temporally coherent conditioning signal that later stochastic components refine.

 For continuous channels $\bm{X}$, we posit an additive decomposition:
 \begin{equation}
 \bm{X} = \bm{S} + \bm{R},
 \label{eq:additive_decomp}
 \end{equation}
-where $\bm{S} \in \mathbb{R}^{L \times d_c}$ is a smooth trend capturing predictable temporal evolution, and $\bm{R} \in \mathbb{R}^{L \times d_c}$ is a residual capturing distributional detail (e.g., bursts, heavy tails, local fluctuations) that is difficult to represent robustly with a purely regression-based temporal objective. This separation reflects an explicit division of labor: the trend module prioritizes temporal coherence, while diffusion (introduced next) targets distributional realism at the residual level—a strategy aligned with “predict-then-refine” perspectives in time-series diffusion modeling \citep{kollovieh2023tsdiff,sikder2023transfusion}.
+where $\bm{S} \in \mathbb{R}^{L \times d_c}$ is a smooth trend capturing predictable temporal evolution, and $\bm{R} \in \mathbb{R}^{L \times d_c}$ is a residual capturing distributional detail (e.g., bursts, heavy tails, local fluctuations) that is difficult to represent robustly with a purely regression-based temporal objective. This separation reflects an explicit division of labor: the trend module prioritizes temporal coherence, while diffusion (introduced next) targets distributional realism at the residual level, a strategy aligned with predict-then-refine perspectives in time-series diffusion modeling \citep{kollovieh2023tsdiff,sikder2023transfusion}.

 We parameterize the trend $\bm{S}$ using a causal Transformer $f_\phi$. With teacher forcing, we train $F_\phi$ to predict the next-step trend from past observations:
 \begin{equation}
@@ -153,7 +154,7 @@ using the mean-squared error objective:
 \mathcal{L}_{\text{trend}}(\phi) = \frac{1}{(L-1)d_c} \sum_{t=1}^{L-1} \bigl\| \hat{\bm{S}}_{t+1} - \bm{X}_{t+1} \bigr\|_2^2.
 \label{eq:trend_loss}
 \end{equation}
-At inference, we roll out the Transformer autoregressively to obtain $\hat{\bm{S}}$, and and then define the residual target for diffusion as $\bm{R} = \bm{X} - \hat{\bm{S}}$. This setup intentionally “locks in” a coherent low-frequency scaffold before any stochastic refinement is applied, thereby reducing the burden on downstream diffusion modules to simultaneously learn both long-range structure and marginal detail. In this sense, our use of Transformers is distinctive: it is a conditioning-first temporal backbone designed to stabilize mixed-type diffusion synthesis in ICS, rather than an end-to-end monolithic generator \citep{vaswani2017attention,kollovieh2023tsdiff,yuan2025ctu}.
+At inference, we roll out the Transformer autoregressively to obtain $\hat{\bm{S}}$, and then define the residual target for diffusion as $\bm{R} = \bm{X} - \hat{\bm{S}}$. This setup intentionally locks in a coherent low-frequency scaffold before any stochastic refinement is applied, thereby reducing the burden on downstream diffusion modules to simultaneously learn both long-range structure and marginal detail. In this sense, our use of Transformers is distinctive: it is a conditioning-first temporal backbone designed to stabilize mixed-type diffusion synthesis in ICS, rather than an end-to-end monolithic generator \citep{vaswani2017attention,kollovieh2023tsdiff,yuan2025ctu}.

 \subsection{DDPM for continuous residual generation}
 \label{sec:method-ddpm}
@@ -192,7 +193,7 @@ After sampling $\hat{\bm{R}}$ by reverse diffusion, we reconstruct the continuou

 \subsection{Masked diffusion for discrete ICS variables}
 \label{sec:method-discrete}
-Discrete ICS variables must remain categorical, making Gaussian diffusion inappropriate for supervisory states and mode-like channels. While one can attempt continuous relaxations or post-hoc discretization, such strategies risk producing semantically invalid intermediate states (e.g., “in-between” modes) and can distort the discrete marginal distribution. Discrete-state diffusion provides a principled alternative by defining a valid corruption process directly on categorical variables \citep{austin2021structured,shi2024simplified}. In the ICS setting, this is not a secondary detail: supervisory tags often encode control logic boundaries (modes, alarms, interlocks) that must remain within a finite vocabulary to preserve semantic correctness \citep{nist2023sp80082}.
+Discrete ICS variables must remain categorical, making Gaussian diffusion inappropriate for supervisory states and mode-like channels. While one can attempt continuous relaxations or post-hoc discretization, such strategies risk producing semantically invalid intermediate states (e.g., in-between modes) and can distort the discrete marginal distribution. Discrete-state diffusion provides a principled alternative by defining a valid corruption process directly on categorical variables \citep{austin2021structured,shi2024simplified}. In the ICS setting, this is not a secondary detail: supervisory tags often encode control logic boundaries (modes, alarms, interlocks) that must remain within a finite vocabulary to preserve semantic correctness \citep{nist2023sp80082}.

 We therefore adopt masked (absorbing) diffusion for discrete channels, where corruption replaces tokens with a special $\texttt{[MASK]}$ symbol according to a schedule \citep{shi2024simplified}. For each variable $j$, define a masking schedule $\{m_k\}_{k=1}^K$ (with $m_k\in[0,1]$) increasing in $k$. The forward corruption process is:
 \begin{equation}
@@ -217,7 +218,7 @@ where $\mathrm{CE}(\cdot,\cdot)$ is cross-entropy. At sampling time, we initiali

 \subsection{Type-aware decomposition as factorization and routing layer}
 \label{sec:method-types}
-Even with a trend-conditioned residual DDPM and a discrete masked-diffusion branch, a single uniform modeling treatment can remain suboptimal because ICS variables are generated by qualitatively different mechanisms. For example, program-driven setpoints exhibit step-and-dwell dynamics; controller outputs follow control laws conditioned on process feedback; actuator positions may show saturation and dwell; and some “derived tags” are deterministic functions of other channels. Treating all channels as if they were exchangeable stochastic processes can misallocate model capacity and induce systematic error concentration on a small subset of mechanistically distinct variables \citep{nist2023sp80082}.
+Even with a trend-conditioned residual DDPM and a discrete masked-diffusion branch, a single uniform modeling treatment can remain suboptimal because ICS variables are generated by qualitatively different mechanisms. For example, program-driven setpoints exhibit step-and-dwell dynamics; controller outputs follow control laws conditioned on process feedback; actuator positions may show saturation and dwell; and some derived tags are deterministic functions of other channels. Treating all channels as if they were exchangeable stochastic processes can misallocate model capacity and induce systematic error concentration on a small subset of mechanistically distinct variables \citep{nist2023sp80082}.

 We therefore introduce a type-aware decomposition that formalizes this heterogeneity as a routing and constraint layer.  Let $\tau(i)\in{1,\dots,6}$ assign each variable (i) to a type class. The type assignment can be initialized from domain semantics (tag metadata, value domains, and engineering meaning), and subsequently refined via an error-attribution workflow described in the Benchmark section. Importantly, this refinement does not change the core diffusion backbone; it changes which mechanism is responsible for which variable, thereby aligning inductive bias with variable-generating mechanism while preserving overall coherence.

@@ -236,20 +237,19 @@ We use the following taxonomy:
 	\item Type 6 (auxiliary/low-impact variables): weakly coupled or sparse signals; we allow simplified modeling (e.g., calibrated marginals or lightweight temporal models) to avoid allocating diffusion capacity where it is not warranted.
 \end{enumerate}

-\begin{figure}[htbp]
+\begin{figure}[H]
  \centering
-  \includegraphics[width=\textwidth]{fig-type-aware-routing-realdata.pdf}
-  \caption{Type-aware decomposition as mechanism-aligned routing. The left panel formalizes the assignment $\tau(i)=\mathrm{TypeAssign}(m_i,s_i,d_i)$ from metadata, temporal signature, and dependency pattern. The center panel organizes the resulting six-type taxonomy and embeds representative real HAI telemetry signatures as miniature evidence for each type. The right panel shows how the current implementation uses this taxonomy: Type 1 variables act as explicit conditioning signals together with file-level context, Types 2/3/4/6 share the learned generator, and Type 5 variables are deterministically reconstructed after sampling. The representative insets are selected automatically from the configured type sets and normalized within each inset for qualitative comparison.}
-  \label{fig:type-routing-realdata}
+  \includegraphics[width=0.98\textwidth]{typeclass-cropped.pdf}
+  \caption*{Type assignment and six-type taxonomy.}
 \end{figure}

 Type-aware decomposition improves synthesis quality through three mechanisms. First, it improves capacity allocation by preventing a small set of mechanistically atypical variables from dominating gradients and distorting the learned distribution for the majority class (typically Type 4). Second, it enables constraint enforcement by deterministically reconstructing Type 5 variables, preventing logically inconsistent samples that purely learned generators can produce. Third, it improves mechanism alignment by attaching inductive biases consistent with step/dwell or saturation behaviors where generic denoisers may implicitly favor smoothness.

-From a novelty standpoint, this layer is not merely an engineering “patch”; it is an explicit methodological statement that ICS synthesis benefits from typed factorization—a principle that has analogues in mixed-type generative modeling more broadly, but that remains underexplored in diffusion-based ICS telemetry synthesis \citep{shi2025tabdiff,yuan2025ctu,nist2023sp80082}.
+From a novelty standpoint, this layer is not merely an engineering patch; it is an explicit methodological statement that ICS synthesis benefits from typed factorization, a principle that has analogues in mixed-type generative modeling more broadly, but that remains underexplored in diffusion-based ICS telemetry synthesis \citep{shi2025tabdiff,yuan2025ctu,nist2023sp80082}.

 \subsection{Joint optimization and end-to-end sampling}
 \label{sec:method-joint}
-We train the model in a staged manner consistent with the above factorization, which improves optimization stability and encourages each component to specialize in its intended role. Specifically: (i) we train the trend Transformer $f_{\phi}$ to obtain $\hat{\bm{S}}$; (ii) we compute residual targets $\hat{\bm{R}} = \bm{X} - \hat{\bm{S}}$ for the continuous variables routed to residual diffusion; (iii) we train the residual DDPM $p_{\theta}(\bm{R}\mid \hat{\bm{S}})$ and masked diffusion model $p_{\psi}(\bm{Y}\mid \text{masked}(\bm{Y}), \hat{\bm{S}}, \hat{\bm{X}})$; and (iv) we apply type-aware routing and deterministic reconstruction during sampling. This staged strategy is aligned with the design goal of separating temporal scaffolding from distributional refinement, and it mirrors the broader intuition in time-series diffusion that decoupling coarse structure and stochastic detail can mitigate “structure vs. realism” conflicts \citep{kollovieh2023tsdiff,sikder2023transfusion}.
+We train the model in a staged manner consistent with the above factorization, which improves optimization stability and encourages each component to specialize in its intended role. Specifically: (i) we train the trend Transformer $f_{\phi}$ to obtain $\hat{\bm{S}}$; (ii) we compute residual targets $\hat{\bm{R}} = \bm{X} - \hat{\bm{S}}$ for the continuous variables routed to residual diffusion; (iii) we train the residual DDPM $p_{\theta}(\bm{R}\mid \hat{\bm{S}})$ and masked diffusion model $p_{\psi}(\bm{Y}\mid \text{masked}(\bm{Y}), \hat{\bm{S}}, \hat{\bm{X}})$; and (iv) we apply type-aware routing and deterministic reconstruction during sampling. This staged strategy is aligned with the design goal of separating temporal scaffolding from distributional refinement, and it mirrors the broader intuition in time-series diffusion that decoupling coarse structure and stochastic detail can mitigate structure-vs.-realism conflicts \citep{kollovieh2023tsdiff,sikder2023transfusion}.

 A simple combined objective is $\mathcal{L} = \lambda\mathcal{L}_{\text{cont}} + (1-\lambda)\mathcal{L}_{\text{disc}}$ with $\lambda\in[0,1]$ controlling the balance between continuous and discrete learning. Type-aware routing determines which channels contribute to which loss and which are excluded in favor of deterministic reconstruction. In practice, this routing acts as a principled guardrail against negative transfer across variable mechanisms: channels that are best handled deterministically (Type 5) or by specialized drivers (Type 1/3, depending on configuration) are prevented from forcing the diffusion models into statistically incoherent compromises.

@@ -262,28 +262,27 @@ A credible ICS generator must clear four progressively harder hurdles. It must f

 This organization is particularly important for ICS telemetry. A generator can look competitive on one-dimensional marginals while still failing on the aspects that make a trace operationally plausible: long plateaus in setpoint-like variables, concentrated occupancy in actuator states, tight controller--sensor coupling, or persistent support signals. Our goal is therefore not to maximize a single scalar, but to show which parts of realism have already been solved, which remain brittle, and which model components are responsible for each regime.

-For continuous channels, we measure marginal alignment with the Kolmogorov--Smirnov (KS) statistic per feature and average it over continuous variables. For discrete channels, we compute Jensen--Shannon divergence (JSD) between per-feature categorical marginals and average across discrete variables \citep{lin1991divergence,yoon2019timegan}. To assess short-horizon dynamics, we compare lag-1 autocorrelation feature-wise and report the mean absolute difference between real and synthetic lag-1 coefficients. We additionally track semantic legality by counting out-of-vocabulary discrete outputs, and we report a filtered KS that excludes near-constant channels whose variance is effectively zero. These core measures are complemented with type-aware diagnostics, extended realism metrics, and ablations.
+For continuous channels, we prioritize marginal agreement because ICS process signals often exhibit bounded support, long plateaus, saturation effects, and non-Gaussian tails that are poorly summarized by moment matching alone. We therefore use the Kolmogorov--Smirnov (KS) statistic per feature and average it over continuous variables: KS compares empirical cumulative distributions directly, requires no parametric assumption, and is sensitive to support shifts or local shape mismatches that are operationally meaningful in telemetry. For discrete channels, the object of interest is different: supervisory variables live on a finite vocabulary, so realism is primarily about whether the synthetic sampler places the right probability mass on the right states. We therefore compute Jensen--Shannon divergence (JSD) between per-feature categorical marginals and average across discrete variables \citep{lin1991divergence,yoon2019timegan}, since JSD is symmetric, bounded, and naturally suited to comparing categorical occupancy patterns. To assess short-horizon dynamics, we compare lag-1 autocorrelation feature-wise and report the mean absolute difference between real and synthetic lag-1 coefficients, which captures the short-memory persistence induced by actuator dwell, controller smoothing, and process inertia. We additionally track semantic legality by counting out-of-vocabulary discrete outputs, and we report a filtered KS that excludes near-constant channels whose variance is effectively zero so that trivially flat tags do not dominate the aggregate. These core measures are complemented with type-aware diagnostics, extended realism metrics, and ablations.

 \subsection{Core fidelity, legality, and reproducibility}
 \label{sec:benchmark-quant}
-Across the three-run reproducibility sweep, Mask-DDPM achieves mean KS $=0.3311 \pm 0.0079$, mean JSD $=0.0284 \pm 0.0073$, and mean absolute lag-1 difference $=0.2684 \pm 0.0027$. The strongest individual seed reaches KS $=0.3224$, while the best runs for JSD and lag-1 are $0.0209$ and $0.2661$, respectively. Just as importantly, all three runs produce zero out-of-vocabulary tokens across the 26 modeled discrete channels, giving a validity rate of \textbf{100\%}. This is the first major benchmark takeaway: semantic legality is already saturated by construction, so the remaining difficulty is no longer ``can the model emit valid symbols?'' but rather ``can it place valid symbols and trajectories in the right temporal and cross-channel context?''
+Across three independent runs, Mask-DDPM achieves mean KS $=0.3311 \pm 0.0079$, mean JSD $=0.0284 \pm 0.0073$, and mean absolute lag-1 difference $=0.2684 \pm 0.0027$, while maintaining a validity rate of \textbf{100\%} across the modeled discrete channels. The small dispersion across runs suggests that the generator is reproducible at the level of global mixed-type fidelity rather than depending on a single favorable seed. This is the first major benchmark takeaway: semantic legality is already saturated by construction, so the remaining challenge is no longer whether the model can emit valid symbols, but whether it can place valid symbols and trajectories in the right temporal and cross-channel context.

-The latest fully diagnosed run provides the complementary view that a seed summary cannot offer. In that run, the model attains mean KS $=0.4025$, filtered mean KS $=0.3191$, mean JSD $=0.0166$, and mean absolute lag-1 difference $=0.2859$, again with zero invalid discrete tokens. Two points matter most. First, the discrete branch remains the most reliable component: low JSD combined with perfect validity means the generator is consistently learning legal supervisory semantics rather than merely matching coarse occupancy counts. Second, the sizable gap between overall KS and filtered KS shows that continuous mismatch is not spread uniformly across all channels. Instead, a relatively small subset of difficult variables dominates the error budget.
+A representative diagnostic slice provides the complementary localized view. On that slice, the model attains mean KS $=0.4025$, filtered mean KS $=0.3191$, mean JSD $=0.0166$, and mean absolute lag-1 difference $=0.2859$, again with zero invalid discrete tokens. Two patterns matter most. First, the discrete branch remains consistently reliable: low JSD together with perfect validity indicates that supervisory semantics are being learned rather than repaired after the fact. Second, the gap between overall KS and filtered KS suggests that continuous mismatch is concentrated in a limited subset of difficult channels instead of being spread uniformly across the telemetry space.

 \begin{figure}[htbp]
  \centering
  \includegraphics[width=\textwidth]{fig-benchmark-story-v2.png}
-  \caption{Benchmark evidence chain. Left: seed-level reproducibility over the three benchmark runs, showing that the global metrics are stable across seeds. Middle: top-10 continuous features ranked by KS in the latest fully diagnosed run, with overall and filtered average KS overlaid to show that a small subset of tags dominates the continuous error budget. Right: representative type-aware mismatch scores from the same run, using program dwell, controller change rate, actuator top-3 mass, PV tail ratio, and auxiliary lag-1 persistence as mechanism-level diagnostics. Lower is better in all panels.}
-  \label{fig:benchmark}
+  \caption*{Benchmark evidence chain.}
 \end{figure}

 \begin{table}[htbp]
 \centering
-\caption{Main benchmark summary. The left column reports reproducibility across three complete runs; the right column reports the latest diagnosed run used for the per-feature, type-aware, and extended analyses. Lower is better except for validity rate.}
+\caption{Core benchmark summary. Lower is better except for validity rate.}
 \label{tab:core_metrics}
 \begin{tabular}{@{}lcc@{}}
 \toprule
-\textbf{Metric} & \textbf{3-run mean $\pm$ std} & \textbf{Latest diagnosed run} \\
+\textbf{Metric} & \textbf{3-run mean $\pm$ std} & \textbf{Diagnostic slice} \\
 \midrule
 Mean KS (continuous) & $0.3311 \pm 0.0079$ & $0.4025$ \\
 Filtered mean KS & -- & $0.3191$ \\
@@ -294,15 +293,15 @@ Validity rate (26 discrete tags) $\uparrow$ & $100.0 \pm 0.0\%$ & $100.0\%$ \\
 \end{tabular}
 \end{table}

-Figure~\ref{fig:benchmark} turns these numbers into a diagnosis rather than a scoreboard. The largest KS contributors are concentrated in a handful of control-relevant tags, including \texttt{P1\_B4002}, \texttt{P1\_FCV02Z}, \texttt{P1\_B3004}, \texttt{P1\_B2004}, and \texttt{P1\_PCV02Z}. This means the current limitation is not a global collapse of the continuous generator. The model has already cleared the first hurdle (legality) and a large part of the second (mixed-type marginal fidelity). What remains difficult is the third hurdle: reproducing a small set of hard channels whose realism depends on step-like transitions, long plateaus, tightly bounded operating regions, or strong local persistence.
+The benchmark evidence chain plot turns the table into a structural diagnosis. Continuous error is concentrated in a relatively small subset of control-sensitive channels rather than indicating a global collapse of the generator, while the type-aware panel shows that the remaining gap is mechanism-specific. In other words, the model has largely solved legality and a substantial portion of mixed-type marginal fidelity, but realism remains harder for behaviors governed by switching, long dwell, bounded operating regimes, and strong local persistence.

 \subsection{Extended realism and downstream utility}
 \label{sec:benchmark-extended}
-The next question is whether samples that look cleaner under fidelity metrics are also more structurally faithful and more useful. To probe this, we additionally evaluate two-sample distance, cross-variable coupling, spectral similarity, predictive consistency, memorization risk, and downstream anomaly-detection utility on the latest diagnosed run. Because this run contains only four synthetic windows (384 generated rows at $L=96$), we treat the resulting numbers as \emph{small-sample diagnostic evidence} rather than as the final word. They are still informative because they tell us which kinds of realism can be improved by post-processing and which ones cannot be repaired so easily.
+The next question is whether improvements under fidelity metrics correspond to broader structural realism and downstream usefulness. We therefore additionally evaluate two-sample distance, cross-variable coupling, spectral similarity, predictive consistency, memorization risk, and anomaly-detection utility on a representative diagnostic slice. Because this slice is intentionally small, we interpret the resulting numbers as diagnostic rather than definitive; their purpose is to show which aspects of realism respond to post-processing and which ones remain limited by mechanism-level dynamics.

 \begin{table}[htbp]
 \centering
-\caption{Extended realism and utility on the latest diagnosed run. The post-processed column corresponds to the typed post-processing baseline. Lower is better except for AUPRC. For reference, the real-only predictor RMSE is $0.558$ and the real-only anomaly AUPRC is $0.653$.}
+\caption{Extended realism and downstream utility. Lower is better except for AUPRC. For reference, the real-only predictor RMSE is $0.558$ and the real-only anomaly AUPRC is $0.653$.}
 \label{tab:extended_eval}
 \begin{tabular}{@{}lcc@{}}
 \toprule
@@ -326,11 +325,11 @@ Table~\ref{tab:extended_eval} reveals a useful asymmetry. Typed post-processing

 \subsection{Type-aware diagnostics}
 \label{sec:benchmark-typed}
-Type-aware diagnostics make that mechanism gap explicit. Table~\ref{tab:typed_diagnostics} summarizes one representative statistic per variable family, computed on the latest fully analyzed run. These statistics are not redundant with the main benchmark table. They answer a different question: \emph{if legality is already solved, what kind of control behavior is still implausible?}
+Type-aware diagnostics make that mechanism gap explicit. Table~\ref{tab:typed_diagnostics} summarizes one representative statistic per variable family on the same diagnostic slice. These statistics are not redundant with the main benchmark table: they answer a different question, namely which operational behaviors remain hardest to match once legality and marginal alignment are largely in place.

 \begin{table}[htbp]
 \centering
-\caption{Type-aware diagnostic summary on the latest fully diagnosed run. ``Mean abs. error'' is reported in the native unit of the corresponding diagnostic statistic; ``Mean rel. error'' normalizes by the real-data value to indicate severity. Lower values indicate better alignment.}
+\caption{Type-aware diagnostic summary. Lower values indicate better alignment.}
 \label{tab:typed_diagnostics}
 \begin{tabular}{@{}llcc@{}}
 \toprule
@@ -349,19 +348,18 @@ This typed view sharpens the story substantially. Program-like channels remain t

 \subsection{Ablation study}
 \label{sec:benchmark-ablation}
-A good ablation does more than show that removing components changes numbers; it should identify which failure mode each component is preventing. We therefore evaluate ten one-seed variants under the same pipeline and summarize six representative metrics: continuous fidelity (KS), discrete fidelity (JSD), short-horizon dynamics (lag-1), cross-variable coupling, predictive transfer, and downstream anomaly utility. Figure~\ref{fig:benchmark-ablations} visualizes signed changes relative to the full model, where red means that the ablated variant is worse. Table~\ref{tab:ablation} gives the underlying values.
+A good ablation does more than show that removing components changes numbers; it should identify which failure mode each component is preventing. We therefore evaluate ten controlled variants under a shared pipeline and summarize six representative metrics: continuous fidelity (KS), discrete fidelity (JSD), short-horizon dynamics (lag-1), cross-variable coupling, predictive transfer, and downstream anomaly utility. The ablation summary plot visualizes signed changes relative to the full model, and Table~\ref{tab:ablation} gives the underlying values.

 \begin{figure}[htbp]
  \centering
  \includegraphics[width=\textwidth]{fig-benchmark-ablations-v1.png}
-  \caption{Ablation impact relative to the full model. For KS, JSD, lag-1 error, coupling error, and predictive RMSE, positive values mean the ablated model is worse than the full model. For AUPRC, positive values mean the ablated model loses downstream utility. The figure makes clear that different components protect different notions of realism rather than contributing uniformly to every metric.}
-  \label{fig:benchmark-ablations}
+  \caption*{Ablation impact.}
 \end{figure}

 \begin{table}[htbp]
 \centering
 \small
-\caption{Ablation study on the latest one-seed runs. Lower is better except for anomaly AUPRC.}
+\caption{Ablation study. Lower is better except for anomaly AUPRC.}
 \label{tab:ablation}
 \begin{tabular}{@{}lcccccc@{}}
 \toprule
@@ -410,11 +408,13 @@ This paper addresses the data scarcity and shareability barriers that limit mach

 Our main contributions are: (i) a causal Transformer trend module that provides a stable long-horizon temporal scaffold for continuous channels; (ii) a trend-conditioned residual DDPM that focuses modeling capacity on local stochastic detail and marginal fidelity without destabilizing global structure; (iii) a masked (absorbing) diffusion branch for discrete variables that guarantees in-vocabulary outputs and supports semantics-aware conditioning on continuous context; and (iv) a type-aware decomposition/routing layer that aligns model mechanisms with heterogeneous ICS variable origins (e.g., process inertia, step-and-dwell setpoints, deterministic derived tags), enabling deterministic enforcement where appropriate and improving capacity allocation.

-We evaluated the approach on windows derived from the HAI Security Dataset and reported mixed-type, protocol-relevant metrics rather than a single aggregate score. Across seeds, the model achieves stable fidelity with mean KS = 0.3311 ± 0.0079 on continuous features, mean JSD = 0.0284 ± 0.0073 on discrete features, and mean absolute lag-1 autocorrelation difference 0.2684 ± 0.0027, indicating that Mask-DDPM preserves both marginal distributions and short-horizon dynamics while maintaining discrete legality.
+We evaluated the approach on windows derived from the HAI Security Dataset and reported mixed-type, protocol-relevant metrics rather than a single aggregate score. Across seeds, the model achieves stable fidelity with mean KS = 0.3311 $\pm$ 0.0079 on continuous features, mean JSD = 0.0284 $\pm$ 0.0073 on discrete features, and mean absolute lag-1 autocorrelation difference 0.2684 $\pm$ 0.0027, indicating that Mask-DDPM preserves both marginal distributions and short-horizon dynamics while maintaining discrete legality.

 Overall, Mask-DDPM provides a reproducible foundation for generating shareable, semantically valid ICS feature sequences suitable for data augmentation, benchmarking, and downstream packet/trace reconstruction workflows. Building on this capability, a natural next step is to move from purely legal synthesis toward controllable scenario construction, including structured attack/violation injection under engineering constraints to support adversarial evaluation and more comprehensive security benchmarks.
-% 参考文献
+% 鍙傝€冩枃鐚?
 \bibliographystyle{unsrtnat}
 \bibliography{references}

 \end{document}
+
+
--- a/arxiv-style/typeclass-cropped.pdf
+++ b/arxiv-style/typeclass-cropped.pdf