From b88a9d39dacaebd55c990be57a4e41e75f62e5fa Mon Sep 17 00:00:00 2001
From: manbo <manbo@hachi.mi>
Date: Fri, 30 Jan 2026 17:43:12 +0800
Subject: [PATCH 1/4] Add knowledges/draft-incomplete-methodology.md

---
 knowledges/draft-incomplete-methodology.md | 176 +++++++++++++++++++++
 1 file changed, 176 insertions(+)
 create mode 100644 knowledges/draft-incomplete-methodology.md

diff --git a/knowledges/draft-incomplete-methodology.md b/knowledges/draft-incomplete-methodology.md
new file mode 100644
index 0000000..bd5e09f
--- /dev/null
+++ b/knowledges/draft-incomplete-methodology.md
@@ -0,0 +1,176 @@
+## Methodology
+
+### Problem setting and notation
+
+We consider multivariate industrial control system (ICS) time series composed of **continuous process variables** (PVs) and **discrete/state variables** (e.g., modes, alarms, categorical tags). Let
+[
+X \in \mathbb{R}^{L \times d_c},\qquad Y \in {1,\dots,V}^{L \times d_d},
+]
+denote a length-(L) segment with (d_c) continuous channels and (d_d) discrete channels (vocabulary size (V)). Our goal is to learn a generative model that produces realistic synthetic sequences ((\hat{X}, \hat{Y})) matching (i) temporal dynamics, (ii) marginal/conditional distributions, and (iii) discrete semantic validity.
+
+A core empirical difficulty in ICS is that a single model optimized end-to-end often trades off **distributional fidelity** and **temporal coherence**. We therefore adopt a **decoupled, two-stage** framework (“Mask-DDPM”) that separates (A) trend/dynamics modeling from (B) residual distribution modeling, and further separates (B1) continuous and (B2) discrete generation.
+
+---
+
+### Overview of the proposed Mask-DDPM framework
+
+Mask-DDPM factorizes generation into:
+
+1. **Stage 1 (Temporal trend modeling):** learn a deterministic trend estimator (T) for continuous dynamics using a **Transformer** backbone (instead of GRU), producing a smooth, temporally consistent “skeleton”.
+2. **Stage 2 (Residual distribution modeling):** generate (i) continuous residuals with a DDPM and (ii) discrete variables with masked (absorbing) diffusion, then reconstruct the final signal.
+
+Formally, we decompose continuous observations as
+[
+R = X - T,\qquad \hat{X} = T + \hat{R},
+]
+where (T) is predicted by the Transformer trend model and (\hat{R}) is sampled from a residual diffusion model. Discrete variables (Y) are generated via masked diffusion to guarantee outputs remain in the discrete vocabulary.
+
+This methodology sits at the intersection of diffusion modeling [2] and attention-based sequence modeling [1], and is motivated by recent diffusion-based approaches to time-series synthesis in general [12] and in industrial/ICS contexts [13], while explicitly addressing mixed discrete–continuous structure using masked diffusion objectives for discrete data [5].
+
+---
+
+### Data representation and preprocessing
+
+**Segmentation.** Raw ICS logs are segmented into fixed-length windows of length (L) (e.g., (L=128)) to form training examples. Windows may be sampled with overlap to increase effective data size.
+
+**Continuous channels.** Continuous variables are normalized channel-wise to stabilize optimization (e.g., z-score normalization). Normalization statistics are computed on the training split and reused at inference.
+
+**Discrete channels.** Each discrete/state variable is mapped to integer tokens in ({1,\dots,V}). A dedicated **[MASK]** token is added for masked diffusion corruption. Optionally, rare categories can be merged to reduce sparsity.
+
+---
+
+### Stage 1: Transformer-based temporal trend model
+
+We model the predictable, low-frequency temporal structure of continuous channels using a causal Transformer [1]. Let (X_{1:t}) denote the prefix up to time (t). The trend model (f_\phi) produces one-step-ahead predictions:
+[
+\hat{T}*{t+1} = f*\phi(X_{1:t}),\qquad t=1,\dots,L-1,
+]
+and is trained with mean squared error (teacher forcing):
+[
+\mathcal{L}*{trend} = \frac{1}{(L-1)d_c}\sum*{t=1}^{L-1}\left| \hat{T}*{t+1} - X*{t+1}\right|_2^2.
+]
+After training, the model is rolled out over a window to obtain (T \in \mathbb{R}^{L \times d_c}). This explicit trend extraction reduces the burden on diffusion to simultaneously learn “where the sequence goes” and “how values are distributed,” a tension frequently observed in diffusion/time-series settings [12,13].
+
+---
+
+### Stage 2A: Continuous residual modeling with DDPM
+
+We learn a denoising diffusion probabilistic model (DDPM) over residuals (R) [2]. The forward (noising) process gradually perturbs residuals with Gaussian noise:
+[
+q(r_t\mid r_{t-1})=\mathcal{N}!\left(\sqrt{1-\beta_t},r_{t-1},,\beta_t I\right),
+]
+with (t=1,\dots,T) diffusion steps and a pre-defined schedule ({\beta_t}). This yields the closed-form:
+[
+r_t=\sqrt{\bar{\alpha}_t},r_0+\sqrt{1-\bar{\alpha}_t},\epsilon,\qquad \epsilon\sim \mathcal{N}(0,I),
+]
+where (\alpha_t=1-\beta_t) and (\bar{\alpha}*t=\prod*{s=1}^t \alpha_s).
+
+A Transformer-based denoiser (g_\theta) parameterizes the reverse process by predicting either the added noise (\epsilon) or the clean residual (r_0):
+[
+\hat{\epsilon}=g_\theta(r_t,t)\quad\text{or}\quad \hat{r}*0=g*\theta(r_t,t).
+]
+We use the standard DDPM objective (two equivalent parameterizations commonly used in practice [2]):
+[
+\mathcal{L}_{cont} =
+\begin{cases}
+\left|\hat{\epsilon}-\epsilon\right|_2^2 & (\epsilon\text{-prediction})[4pt]
+\left|\hat{r}_0-r_0\right|_2^2 & (r_0\text{-prediction})
+\end{cases}
+]
+
+**SNR-weighted training (optional).** To mitigate optimization imbalance across timesteps, we optionally apply an SNR-based weighting strategy:
+[
+\mathcal{L}_{snr}=\frac{\mathrm{SNR}_t}{\mathrm{SNR}*t+\gamma},\mathcal{L}*{cont},
+]
+which is conceptually aligned with Min-SNR-style diffusion reweighting [3].
+
+**Residual reconstruction.** At inference, we sample (\hat{R}) by iterative denoising from Gaussian noise, then reconstruct the final continuous output:
+[
+\hat{X}=T+\hat{R}.
+]
+
+---
+
+### Stage 2B: Discrete variable modeling with masked diffusion
+
+Discrete ICS channels must remain **semantically valid** (i.e., categorical, not fractional). Instead of continuous diffusion on (Y), we use **masked (absorbing) diffusion** [5], which corrupts sequences by replacing tokens with a special mask symbol and trains the model to recover them.
+
+**Forward corruption.** Given a schedule (m_t\in[0,1]) increasing with (t), we sample a masked version (y_t) by independently masking positions:
+[
+y_t^{(i)} =
+\begin{cases}
+\texttt{[MASK]} & \text{with prob. } m_t\
+y_0^{(i)} & \text{otherwise}
+\end{cases}
+]
+This “absorbing” corruption is a discrete analogue of diffusion and underpins modern masked diffusion formulations [5].
+
+**Reverse model and loss.** A Transformer (h_\psi) outputs a categorical distribution over the vocabulary for each position. We compute cross-entropy **only on masked positions** (\mathcal{M}):
+[
+\mathcal{L}*{disc}=\frac{1}{|\mathcal{M}|}\sum*{(i,t)\in \mathcal{M}} CE(\hat{p}*{i,t},,y*{i,t}).
+]
+This guarantees decoded samples belong to the discrete vocabulary by construction. Masked diffusion can be viewed as a simplified, scalable alternative within the broader family of discrete diffusion models [4,5].
+
+---
+
+### Joint objective and training protocol
+
+We train Stage 1 and Stage 2 sequentially:
+
+1. **Train trend Transformer** (f_\phi) on continuous channels to obtain (T).
+2. **Compute residuals** (R=X-T).
+3. **Train diffusion models** on (R) (continuous DDPM) and (Y) (masked diffusion), using a weighted combination:
+   [
+   \mathcal{L}=\lambda \mathcal{L}*{cont}+(1-\lambda)\mathcal{L}*{disc}.
+   ]
+
+In our implementation, we typically use (L=128) and a diffusion horizon (T) up to 600 steps (trade-off between sample quality and compute). Transformer backbones increase training cost due to (O(L^2)) attention, but provide a principled mechanism for long-range temporal dependency modeling that is especially relevant in ICS settings [1,13].
+
+---
+
+### Sampling procedure (end-to-end generation)
+
+Given an optional seed prefix (or a sampled initial context):
+
+1. **Trend rollout:** use (f_\phi) to produce (T) over (L) steps.
+2. **Continuous residual sampling:** sample (\hat{R}) by reverse DDPM from noise, producing (\hat{X}=T+\hat{R}).
+3. **Discrete sampling:** initialize (\hat{Y}) as fully masked and iteratively unmask/denoise using the masked diffusion reverse model until all tokens are assigned [5].
+4. **Return** ((\hat{X},\hat{Y})) as a synthetic ICS window.
+
+---
+
+### Type-aware decomposition (diagnostic-guided extensibility)
+
+In practice, a small subset of channels can dominate failure modes (e.g., program-driven setpoints, actuator saturation/stiction, derived deterministic tags). We incorporate a **type-aware** diagnostic partitioning that groups variables by generative mechanism and enables modular replacements (e.g., conditional generation for program signals, deterministic reconstruction for derived tags). This design is compatible with emerging conditional diffusion paradigms for industrial time series [11] and complements prior ICS diffusion augmentation work that primarily targets continuous MTS fidelity [13].
+
+> **Note:** Detailed benchmark metrics (e.g., KS/JSD/Lag-1) and evaluation protocol belong in the **Benchmark / Experiments** section, not in Methodology, and are therefore omitted here as requested.
+
+---
+
+## References
+
+[1] Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). *Attention Is All You Need.* NeurIPS. ([arXiv][1])
+[2] Ho, J., Jain, A., & Abbeel, P. (2020). *Denoising Diffusion Probabilistic Models.* NeurIPS. ([SSRN][2])
+[3] Hang, T., Gu, S., Li, C., et al. (2023). *Efficient Diffusion Training via Min-SNR Weighting Strategy.* ICCV. ([arXiv][3])
+[4] Austin, J., Johnson, D. D., Ho, J., et al. (2021). *Structured Denoising Diffusion Models in Discrete State-Spaces (D3PM).* NeurIPS.
+[5] Shi, J., Han, K., Wang, Z., Doucet, A., & Titsias, M. K. (2024). *Simplified and Generalized Masked Diffusion for Discrete Data.* NeurIPS (poster); arXiv:2406.04329. ([arXiv][4])
+[6] Kotelnikov, A., Baranchuk, D., Rubachev, I., & Babenko, A. (2023). *TabDDPM: Modelling Tabular Data with Diffusion Models.* ICML.
+[7] Shi, J., Xu, M., Hua, H., Zhang, H., Ermon, S., & Leskovec, J. (2024). *TabDiff: a Mixed-type Diffusion Model for Tabular Data Generation.* arXiv:2410.20626 / ICLR. ([arXiv][5])
+[8] Kollovieh, M., Ansari, A. F., Bohlke-Schneider, M., et al. (2023). *Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting (TSDiff).* NeurIPS. ([arXiv][6])
+[9] Ren, L., Wang, H., & Laili, Y. (2024). *Diff-MTS: Temporal-Augmented Conditional Diffusion-based AIGC for Industrial Time Series Toward the Large Model Era.* arXiv:2407.11501. ([arXiv][7])
+[10] Sikder, M. F., et al. (2023). *TransFusion: Generating Long, High Fidelity Time Series with Diffusion and Transformers.* arXiv:2307.12667. ([arXiv][8])
+[11] Su, C., Cai, Z., Tian, Y., et al. (2025). *Diffusion Models for Time Series Forecasting: A Survey.* arXiv:2507.14507. ([arXiv][9])
+[12] Sha, Y., Yuan, Y., Wu, Y., & Zhao, H. (2026). *DDPM Fusing Mamba and Adaptive Attention: An Augmentation Method for Industrial Control Systems Anomaly Data (ETUA-DDPM).* SSRN. ([SSRN][10])
+[13] Yuan, Y., et al. (2025). *CTU-DDPM: Generating Industrial Control System Time-Series Data with a CNN-Transformer Hybrid Diffusion Model.* ACM (conference paper metadata). ([ACM Digital Library][11])
+
+[1]: https://arxiv.org/html/2307.12667v2?utm_source=chatgpt.com "Generating Long, High Fidelity Time Series using Diffusion ..."
+[2]: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5134458&utm_source=chatgpt.com "Diffusion Model Based Synthetic Data Generation for ..."
+[3]: https://arxiv.org/abs/2303.09556 "https://arxiv.org/abs/2303.09556"
+[4]: https://arxiv.org/pdf/2406.04329 "https://arxiv.org/pdf/2406.04329"
+[5]: https://arxiv.org/abs/2410.20626 "https://arxiv.org/abs/2410.20626"
+[6]: https://arxiv.org/abs/2307.11494 "https://arxiv.org/abs/2307.11494"
+[7]: https://arxiv.org/abs/2407.11501 "https://arxiv.org/abs/2407.11501"
+[8]: https://arxiv.org/abs/2307.12667 "https://arxiv.org/abs/2307.12667"
+[9]: https://arxiv.org/abs/2507.14507 "https://arxiv.org/abs/2507.14507"
+[10]: https://papers.ssrn.com/sol3/Delivery.cfm/254b62d8-9d8c-4b87-bb57-6b07da15f831-MECA.pdf?abstractid=6055903&mirid=1&type=2 "https://papers.ssrn.com/sol3/Delivery.cfm/254b62d8-9d8c-4b87-bb57-6b07da15f831-MECA.pdf?abstractid=6055903&mirid=1&type=2"
+[11]: https://dl.acm.org/doi/pdf/10.1145/3776759.3776845 "https://dl.acm.org/doi/pdf/10.1145/3776759.3776845"

From 735ca8ab511e4040a4b6c99c002e312948395763 Mon Sep 17 00:00:00 2001
From: manbo <manbo@hachi.mi>
Date: Fri, 30 Jan 2026 21:05:56 +0800
Subject: [PATCH 2/4] Update knowledges/draft-incomplete-methodology.md

---
 knowledges/draft-incomplete-methodology.md | 312 ++++++++++++---------
 1 file changed, 187 insertions(+), 125 deletions(-)

diff --git a/knowledges/draft-incomplete-methodology.md b/knowledges/draft-incomplete-methodology.md
index bd5e09f..d2b3cc1 100644
--- a/knowledges/draft-incomplete-methodology.md
+++ b/knowledges/draft-incomplete-methodology.md
@@ -1,176 +1,238 @@
 ## Methodology
 
-### Problem setting and notation
+### 1. Problem setting and design motivation
 
-We consider multivariate industrial control system (ICS) time series composed of **continuous process variables** (PVs) and **discrete/state variables** (e.g., modes, alarms, categorical tags). Let
-[
-X \in \mathbb{R}^{L \times d_c},\qquad Y \in {1,\dots,V}^{L \times d_d},
-]
-denote a length-(L) segment with (d_c) continuous channels and (d_d) discrete channels (vocabulary size (V)). Our goal is to learn a generative model that produces realistic synthetic sequences ((\hat{X}, \hat{Y})) matching (i) temporal dynamics, (ii) marginal/conditional distributions, and (iii) discrete semantic validity.
+Industrial control system (ICS) telemetry is a **mixed-type** sequential object: it couples **continuous** process dynamics (e.g., sensor values and physical responses) with **discrete** supervisory states (e.g., modes, alarms, interlocks). We model each training instance as a fixed-length window of length (L), consisting of (i) continuous channels (X\in\mathbb{R}^{L\times d_c}) and (ii) discrete channels (Y={y^{(j)}*{1:L}}*{j=1}^{d_d}), where each discrete variable (y^{(j)}_t\in\mathcal{V}_j) belongs to a finite vocabulary.
 
-A core empirical difficulty in ICS is that a single model optimized end-to-end often trades off **distributional fidelity** and **temporal coherence**. We therefore adopt a **decoupled, two-stage** framework (“Mask-DDPM”) that separates (A) trend/dynamics modeling from (B) residual distribution modeling, and further separates (B1) continuous and (B2) discrete generation.
+Our methodological objective is to learn a generator that produces synthetic ((\hat{X},\hat{Y})) that are simultaneously: (a) temporally coherent, (b) distributionally faithful, and (c) **semantically valid** for discrete channels (i.e., (\hat{y}^{(j)}_t\in\mathcal{V}_j) by construction). A key empirical observation motivating our design is that, in ICS, temporal realism and distributional realism often “pull” the model in different directions when trained monolithically; this is amplified by the heterogeneous mechanisms that produce different variables (program steps, controller laws, actuator saturation, physical inertia, and deterministic derived tags). We therefore structure the generator to *separate concerns* and then *specialize*.
 
 ---
 
-### Overview of the proposed Mask-DDPM framework
+### 2. Overview of Mask-DDPM
 
-Mask-DDPM factorizes generation into:
+We propose **Mask-DDPM**, organized in the following order:
 
-1. **Stage 1 (Temporal trend modeling):** learn a deterministic trend estimator (T) for continuous dynamics using a **Transformer** backbone (instead of GRU), producing a smooth, temporally consistent “skeleton”.
-2. **Stage 2 (Residual distribution modeling):** generate (i) continuous residuals with a DDPM and (ii) discrete variables with masked (absorbing) diffusion, then reconstruct the final signal.
+1. **Transformer trend module**: learns the dominant temporal backbone of continuous dynamics via attention-based sequence modeling [1].
+2. **Residual DDPM for continuous variables**: models distributional detail as stochastic residual structure conditioned on the learned trend [2].
+3. **Masked diffusion for discrete variables**: generates discrete ICS states with an absorbing/masking corruption process and categorical reconstruction [3,4].
+4. **Type-aware decomposition**: a performance-oriented refinement layer that routes variables to the most appropriate modeling mechanism and enforces deterministic constraints where warranted.
 
-Formally, we decompose continuous observations as
-[
-R = X - T,\qquad \hat{X} = T + \hat{R},
-]
-where (T) is predicted by the Transformer trend model and (\hat{R}) is sampled from a residual diffusion model. Discrete variables (Y) are generated via masked diffusion to guarantee outputs remain in the discrete vocabulary.
-
-This methodology sits at the intersection of diffusion modeling [2] and attention-based sequence modeling [1], and is motivated by recent diffusion-based approaches to time-series synthesis in general [12] and in industrial/ICS contexts [13], while explicitly addressing mixed discrete–continuous structure using masked diffusion objectives for discrete data [5].
+This ordering is intentional: the trend module fixes the *macro-temporal scaffold*; diffusion then focuses on *micro-structure and marginal fidelity*; masked diffusion guarantees *discrete legality*; and type-aware routing resolves the remaining mismatch caused by heterogeneous variable-generating mechanisms.
 
 ---
 
-### Data representation and preprocessing
+## 3. Transformer trend module for continuous dynamics
 
-**Segmentation.** Raw ICS logs are segmented into fixed-length windows of length (L) (e.g., (L=128)) to form training examples. Windows may be sampled with overlap to increase effective data size.
+### 3.1 Trend–residual decomposition
 
-**Continuous channels.** Continuous variables are normalized channel-wise to stabilize optimization (e.g., z-score normalization). Normalization statistics are computed on the training split and reused at inference.
+For continuous channels (X), we posit an additive decomposition
+[
+X = S + R,
+]
+where (S\in\mathbb{R}^{L\times d_c}) is a smooth **trend** capturing predictable temporal evolution, and (R\in\mathbb{R}^{L\times d_c}) is a **residual** capturing distributional detail (e.g., bursts, heavy tails, local fluctuations) that is difficult to represent robustly with a purely regression-based temporal objective.
 
-**Discrete channels.** Each discrete/state variable is mapped to integer tokens in ({1,\dots,V}). A dedicated **[MASK]** token is added for masked diffusion corruption. Optionally, rare categories can be merged to reduce sparsity.
+### 3.2 Causal Transformer parameterization
+
+We parameterize the trend (S) using a **causal Transformer** (f_\phi) [1]. Concretely, with teacher forcing we train (f_\phi) to predict the next-step trend from past observations:
+[
+\hat{S}*{t+1} = f*\phi(X_{1:t}), \qquad t=1,\dots,L-1,
+]
+with the mean-squared error objective
+[
+\mathcal{L}*{\text{trend}}(\phi)=\frac{1}{(L-1)d_c}\sum*{t=1}^{L-1}\left| \hat{S}*{t+1} - X*{t+1}\right|_2^2.
+]
+Self-attention is particularly suitable here because it provides a direct mechanism for learning cross-channel interactions and long-range temporal dependencies without recurrence, which is important when control actions influence downstream variables with nontrivial delays [1].
+
+At inference, we roll out the Transformer autoregressively to obtain (\hat{S}), and then define the residual target for diffusion as (R = X - \hat{S}).
+
+#### Uniqueness note (trend module)
+
+*While Transformers are now standard in sequence modeling, our use of a Transformer explicitly as a **trend extractor**—whose sole role is to provide conditioning for subsequent diffusion rather than to serve as the primary generator—creates a clean separation between temporal structure and distributional refinement in the ICS setting.*
 
 ---
 
-### Stage 1: Transformer-based temporal trend model
+## 4. DDPM for continuous residual generation
 
-We model the predictable, low-frequency temporal structure of continuous channels using a causal Transformer [1]. Let (X_{1:t}) denote the prefix up to time (t). The trend model (f_\phi) produces one-step-ahead predictions:
+### 4.1 Conditional diffusion on residuals
+
+We model the residual (R) with a denoising diffusion probabilistic model (DDPM) conditioned on the trend (\hat{S}) [2]. Let (K) denote the number of diffusion steps, with a noise schedule ({\beta_k}_{k=1}^K), (\alpha_k = 1-\beta_k), and (\bar{\alpha}*k=\prod*{i=1}^k \alpha_i). The forward corruption process is
 [
-\hat{T}*{t+1} = f*\phi(X_{1:t}),\qquad t=1,\dots,L-1,
+q(r_k\mid r_0)=\mathcal{N}!\left(\sqrt{\bar{\alpha}_k},r_0,\ (1-\bar{\alpha}_k)\mathbf{I}\right),
 ]
-and is trained with mean squared error (teacher forcing):
+equivalently,
 [
-\mathcal{L}*{trend} = \frac{1}{(L-1)d_c}\sum*{t=1}^{L-1}\left| \hat{T}*{t+1} - X*{t+1}\right|_2^2.
+r_k = \sqrt{\bar{\alpha}_k},r_0 + \sqrt{1-\bar{\alpha}_k},\epsilon,\qquad \epsilon\sim\mathcal{N}(0,\mathbf{I}).
 ]
-After training, the model is rolled out over a window to obtain (T \in \mathbb{R}^{L \times d_c}). This explicit trend extraction reduces the burden on diffusion to simultaneously learn “where the sequence goes” and “how values are distributed,” a tension frequently observed in diffusion/time-series settings [12,13].
+
+The learned reverse process is parameterized as
+[
+p_{\theta}(r_{k-1}\mid r_k,\hat{S})=\mathcal{N}!\left(\mu_{\theta}(r_k,k,\hat{S}),\ \Sigma_{\theta}(k)\right),
+]
+where (\mu_\theta) is implemented by a **Transformer denoiser** that consumes (i) the noised residual (r_k), (ii) a timestep embedding for (k), and (iii) conditioning features derived from (\hat{S}). Conditioning is crucial: it prevents the diffusion model from re-learning the low-frequency temporal scaffold and concentrates its capacity on matching the residual distribution.
+
+### 4.2 Training objective and loss shaping
+
+We train the denoiser using the standard DDPM (\epsilon)-prediction objective [2]:
+[
+\mathcal{L}_{\text{cont}}(\theta)
+=================================
+
+\mathbb{E}*{k,r_0,\epsilon}
+\left[
+\left|
+\epsilon - \epsilon*{\theta}(r_k,k,\hat{S})
+\right|*2^2
+\right].
+]
+Because diffusion optimization can exhibit timestep imbalance and gradient conflict, we optionally apply an SNR-based reweighting consistent with Min-SNR training [5]:
+[
+\mathcal{L}^{\text{snr}}*{\text{cont}}(\theta)
+==============================================
+
+\mathbb{E}*{k,r_0,\epsilon}
+\left[
+w(k)\left|
+\epsilon - \epsilon*{\theta}(r_k,k,\hat{S})
+\right|_2^2
+\right],
+\qquad
+w(k)=\frac{\mathrm{SNR}_k}{\mathrm{SNR}_k+\gamma}.
+]
+(We treat this as an ablation: it is not required by the method, but can improve training efficiency and stability in practice [5].)
+
+### 4.3 Continuous reconstruction
+
+After sampling (\hat{R}) by reverse diffusion, we reconstruct the continuous output as
+[
+\hat{X} = \hat{S} + \hat{R}.
+]
+This design makes the role of diffusion explicit: it acts as a **distributional corrector** on top of a temporally coherent backbone.
+
+#### Uniqueness note (continuous branch)
+
+*The central integration here is not “Transformer + diffusion” in isolation, but rather a **trend-conditioned residual diffusion** formulation: diffusion is trained on residual structure explicitly defined relative to a Transformer trend model, which is particularly aligned with ICS where low-frequency dynamics are strong and persistent.*
 
 ---
 
-### Stage 2A: Continuous residual modeling with DDPM
+## 5. Masked diffusion for discrete ICS variables
 
-We learn a denoising diffusion probabilistic model (DDPM) over residuals (R) [2]. The forward (noising) process gradually perturbs residuals with Gaussian noise:
-[
-q(r_t\mid r_{t-1})=\mathcal{N}!\left(\sqrt{1-\beta_t},r_{t-1},,\beta_t I\right),
-]
-with (t=1,\dots,T) diffusion steps and a pre-defined schedule ({\beta_t}). This yields the closed-form:
-[
-r_t=\sqrt{\bar{\alpha}_t},r_0+\sqrt{1-\bar{\alpha}_t},\epsilon,\qquad \epsilon\sim \mathcal{N}(0,I),
-]
-where (\alpha_t=1-\beta_t) and (\bar{\alpha}*t=\prod*{s=1}^t \alpha_s).
+### 5.1 Discrete corruption via absorbing masks
 
-A Transformer-based denoiser (g_\theta) parameterizes the reverse process by predicting either the added noise (\epsilon) or the clean residual (r_0):
+Discrete ICS variables must remain categorical, making Gaussian diffusion inappropriate. We therefore use **masked (absorbing) diffusion** for discrete channels [3,4], in which corruption replaces tokens with a special (\texttt{[MASK]}) symbol according to a schedule.
+
+For a discrete sequence (y^{(j)}_{1:L}), define a corruption level (m_k\in[0,1]) that increases with (k). The corrupted sequence at step (k) is formed by independent masking:
 [
-\hat{\epsilon}=g_\theta(r_t,t)\quad\text{or}\quad \hat{r}*0=g*\theta(r_t,t).
-]
-We use the standard DDPM objective (two equivalent parameterizations commonly used in practice [2]):
-[
-\mathcal{L}_{cont} =
+y^{(j,k)}_t =
 \begin{cases}
-\left|\hat{\epsilon}-\epsilon\right|_2^2 & (\epsilon\text{-prediction})[4pt]
-\left|\hat{r}_0-r_0\right|_2^2 & (r_0\text{-prediction})
+\texttt{[MASK]}, & \text{with probability } m_k,\
+y^{(j)}_t, & \text{otherwise}.
 \end{cases}
 ]
+This “absorbing” mechanism yields a diffusion-like process over discrete states while preserving a clear reconstruction target at every step [3,4].
 
-**SNR-weighted training (optional).** To mitigate optimization imbalance across timesteps, we optionally apply an SNR-based weighting strategy:
+### 5.2 Categorical denoising objective
+
+We parameterize the reverse model with a Transformer (h_\psi) that outputs a categorical distribution over (\mathcal{V}*j) for each masked position. Let (\mathcal{M}) be the set of masked indices. We train with cross-entropy over masked positions:
 [
-\mathcal{L}_{snr}=\frac{\mathrm{SNR}_t}{\mathrm{SNR}*t+\gamma},\mathcal{L}*{cont},
-]
-which is conceptually aligned with Min-SNR-style diffusion reweighting [3].
+\mathcal{L}*{\text{disc}}(\psi)
+===============================
 
-**Residual reconstruction.** At inference, we sample (\hat{R}) by iterative denoising from Gaussian noise, then reconstruct the final continuous output:
+\mathbb{E}*{k}
+\left[
+\frac{1}{|\mathcal{M}|}\sum*{(j,t)\in\mathcal{M}}
+\mathrm{CE}!\left(
+h_\psi!\left(y^{(k)},k,\hat{S}\right)_{j,t},\ y^{(j)}_t
+\right)
+\right].
+]
+Conditioning on (\hat{S}) (and, optionally, on (\hat{X})) allows discrete state generation to reflect the continuous context, which is often essential in ICS (e.g., mode transitions coupled to process regimes).
+
+### 5.3 Sampling
+
+At inference, we initialize (y^{(K)}) as fully masked and iteratively denoise/unmask from (k=K) to (1), sampling valid tokens from the predicted categorical distributions. Because outputs are always drawn from (\mathcal{V}_j), semantic validity is satisfied by construction.
+
+#### Uniqueness note (discrete branch)
+
+*To our knowledge, diffusion-based ICS synthesis has predominantly emphasized continuous multivariate generation; in contrast, we introduce masked diffusion as a **first-class discrete legality mechanism** within the same pipeline, rather than treating discrete tags as continuous surrogates or post-hoc thresholds.*
+
+---
+
+## 6. Type-aware decomposition as a performance refinement layer
+
+### 6.1 Motivation: mechanistic heterogeneity in ICS variables
+
+Even with a strong continuous generator and a discrete legality mechanism, ICS data pose an additional challenge: variables arise from **qualitatively different generative mechanisms**. For example, program-driven setpoints are often piecewise constant with abrupt changes; actuators exhibit saturation and dwell; some channels are deterministic functions of others (derived tags). Modeling all channels identically can cause persistent error concentration and degrade overall quality.
+
+We therefore introduce a **type-aware decomposition** that operates on top of the base pipeline to improve fidelity, stability, and interpretability.
+
+### 6.2 Typing function and routing
+
+Let (\tau(i)\in{1,\dots,6}) assign each variable (i) to a type. This induces a partition of variables into index sets ({\mathcal{I}*k}*{k=1}^6). We then define the generator as a composition of type-specific operators:
 [
-\hat{X}=T+\hat{R}.
+(\hat{X},\hat{Y}) = \mathcal{A}\Big(\hat{S},\hat{R},\hat{Y}_{\text{mask}},{\mathcal{G}*k}*{k=1}^6;\tau\Big),
 ]
+where (\mathcal{A}) assembles a complete sample and (\mathcal{G}_k) encodes the appropriate modeling choice per type.
+
+### 6.3 Six-type schema and modeling commitments
+
+We operationalize six types that map cleanly onto common ICS semantics:
+
+* **Type 1 (program-driven / setpoint-like):** exogenous drivers with step changes and long dwell. These variables are treated as conditioning signals or modeled with a dedicated change-point-aware generator rather than forcing them into residual diffusion.
+* **Type 2 (controller outputs):** variables that respond to setpoints and process feedback; we treat them as conditional on Type 1 and continuous context, and allow separate specialization if they dominate error.
+* **Type 3 (actuator states/positions):** bounded, often quantized, with saturation and dwell; we route them to discrete masked diffusion when naturally categorical/quantized, or to specialized dwell-aware modeling when continuous but state-persistent.
+* **Type 4 (process variables):** inertia-dominated continuous dynamics; these are the primary beneficiaries of the **Transformer trend + residual DDPM** pipeline.
+* **Type 5 (derived/deterministic variables):** algebraic or rule-based functions of other variables; we enforce deterministic reconstruction (\hat{x}^{(i)} = g_i(\hat{X},\hat{Y})) rather than learning a stochastic generator, improving logical consistency and sample efficiency.
+* **Type 6 (auxiliary/low-impact variables):** weakly coupled or sparse signals; we allow simplified modeling (e.g., calibrated marginals or lightweight temporal models) to avoid allocating diffusion capacity where it is not warranted.
+
+### 6.4 Training-time and inference-time integration
+
+Type-aware decomposition improves performance through three concrete mechanisms:
+
+1. **Capacity allocation:** by focusing diffusion on Type 4 (and selected Type 2/3), we reduce the tendency for a few mechanistically distinct variables to dominate gradients and distort the learned distribution elsewhere.
+2. **Constraint enforcement:** Type 5 variables are computed deterministically, preventing logically inconsistent samples that a purely learned generator may produce.
+3. **Mechanism alignment:** Type 1/3 variables receive inductive biases consistent with step-like or dwell-like behavior, which diffusion-trained smooth denoisers can otherwise over-regularize.
+
+In practice, (\tau) can be initialized from domain semantics (tag metadata and value domains) and refined via an error-attribution loop described in the Benchmark section. Importantly, this refinement does not alter the core generative backbone; it *re-routes* difficult variables to better-suited mechanisms while preserving end-to-end coherence.
+
+#### Uniqueness note (type-aware layer)
+
+*The type-aware decomposition is not merely a diagnostic taxonomy; it is a modeling layer that **changes the generator’s factorization**—deciding which variables are best treated as conditional drivers, which require dwell-aware behavior, and which should be deterministic—thereby tailoring the diffusion pipeline to the structural heterogeneity characteristic of ICS telemetry.*
 
 ---
 
-### Stage 2B: Discrete variable modeling with masked diffusion
+## 7. Joint optimization and end-to-end sampling
 
-Discrete ICS channels must remain **semantically valid** (i.e., categorical, not fractional). Instead of continuous diffusion on (Y), we use **masked (absorbing) diffusion** [5], which corrupts sequences by replacing tokens with a special mask symbol and trains the model to recover them.
+We train the model in a staged manner consistent with the factorization:
 
-**Forward corruption.** Given a schedule (m_t\in[0,1]) increasing with (t), we sample a masked version (y_t) by independently masking positions:
+1. Train the trend Transformer (f_\phi) to obtain (S).
+2. Compute residual targets (R=X-S) for Type 4 (and any routed continuous types).
+3. Train the residual DDPM (p_\theta(R\mid S)) and the masked diffusion model (p_\psi(Y\mid \text{masked}(Y),S)).
+4. Apply type-aware routing and deterministic reconstruction rules during sampling.
+
+A simple combined objective is
 [
-y_t^{(i)} =
-\begin{cases}
-\texttt{[MASK]} & \text{with prob. } m_t\
-y_0^{(i)} & \text{otherwise}
-\end{cases}
+\mathcal{L} = \lambda,\mathcal{L}*{\text{cont}} + (1-\lambda),\mathcal{L}*{\text{disc}},
 ]
-This “absorbing” corruption is a discrete analogue of diffusion and underpins modern masked diffusion formulations [5].
+with (\lambda\in[0,1]) controlling the balance between continuous and discrete learning. Type-aware routing determines which channels contribute to which loss and which are excluded in favor of deterministic reconstruction.
 
-**Reverse model and loss.** A Transformer (h_\psi) outputs a categorical distribution over the vocabulary for each position. We compute cross-entropy **only on masked positions** (\mathcal{M}):
-[
-\mathcal{L}*{disc}=\frac{1}{|\mathcal{M}|}\sum*{(i,t)\in \mathcal{M}} CE(\hat{p}*{i,t},,y*{i,t}).
-]
-This guarantees decoded samples belong to the discrete vocabulary by construction. Masked diffusion can be viewed as a simplified, scalable alternative within the broader family of discrete diffusion models [4,5].
+At inference time, we generate in the same order: (i) trend (\hat{S}), (ii) residual (\hat{R}) via DDPM, (iii) discrete (\hat{Y}) via masked diffusion, and (iv) type-aware assembly and deterministic reconstruction.
 
 ---
 
-### Joint objective and training protocol
+# References
 
-We train Stage 1 and Stage 2 sequentially:
+[1] Vaswani, A., Shazeer, N., Parmar, N., et al. *Attention Is All You Need.* NeurIPS, 2017. ([arXiv][1])
+[2] Ho, J., Jain, A., Abbeel, P. *Denoising Diffusion Probabilistic Models.* NeurIPS, 2020. ([arXiv][2])
+[3] Austin, J., Johnson, D. D., Ho, J., Tarlow, D., van den Berg, R. *Structured Denoising Diffusion Models in Discrete State-Spaces (D3PM).* NeurIPS, 2021. ([arXiv][3])
+[4] Shi, J., Han, K., Wang, Z., Doucet, A., Titsias, M. K. *Simplified and Generalized Masked Diffusion for Discrete Data.* arXiv:2406.04329, 2024. ([arXiv][4])
+[5] Hang, T., Gu, S., Li, C., et al. *Efficient Diffusion Training via Min-SNR Weighting Strategy.* ICCV, 2023. ([arXiv][5])
+[6] Kollovieh, M., Ansari, A. F., Bohlke-Schneider, M., et al. *Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting (TSDiff).* arXiv:2307.11494, 2023. ([arXiv][6])
+[7] Su, C., Cai, Z., Tian, Y., et al. *Diffusion Models for Time Series Forecasting: A Survey.* arXiv:2507.14507, 2025. ([arXiv][7])
 
-1. **Train trend Transformer** (f_\phi) on continuous channels to obtain (T).
-2. **Compute residuals** (R=X-T).
-3. **Train diffusion models** on (R) (continuous DDPM) and (Y) (masked diffusion), using a weighted combination:
-   [
-   \mathcal{L}=\lambda \mathcal{L}*{cont}+(1-\lambda)\mathcal{L}*{disc}.
-   ]
-
-In our implementation, we typically use (L=128) and a diffusion horizon (T) up to 600 steps (trade-off between sample quality and compute). Transformer backbones increase training cost due to (O(L^2)) attention, but provide a principled mechanism for long-range temporal dependency modeling that is especially relevant in ICS settings [1,13].
-
----
-
-### Sampling procedure (end-to-end generation)
-
-Given an optional seed prefix (or a sampled initial context):
-
-1. **Trend rollout:** use (f_\phi) to produce (T) over (L) steps.
-2. **Continuous residual sampling:** sample (\hat{R}) by reverse DDPM from noise, producing (\hat{X}=T+\hat{R}).
-3. **Discrete sampling:** initialize (\hat{Y}) as fully masked and iteratively unmask/denoise using the masked diffusion reverse model until all tokens are assigned [5].
-4. **Return** ((\hat{X},\hat{Y})) as a synthetic ICS window.
-
----
-
-### Type-aware decomposition (diagnostic-guided extensibility)
-
-In practice, a small subset of channels can dominate failure modes (e.g., program-driven setpoints, actuator saturation/stiction, derived deterministic tags). We incorporate a **type-aware** diagnostic partitioning that groups variables by generative mechanism and enables modular replacements (e.g., conditional generation for program signals, deterministic reconstruction for derived tags). This design is compatible with emerging conditional diffusion paradigms for industrial time series [11] and complements prior ICS diffusion augmentation work that primarily targets continuous MTS fidelity [13].
-
-> **Note:** Detailed benchmark metrics (e.g., KS/JSD/Lag-1) and evaluation protocol belong in the **Benchmark / Experiments** section, not in Methodology, and are therefore omitted here as requested.
-
----
-
-## References
-
-[1] Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). *Attention Is All You Need.* NeurIPS. ([arXiv][1])
-[2] Ho, J., Jain, A., & Abbeel, P. (2020). *Denoising Diffusion Probabilistic Models.* NeurIPS. ([SSRN][2])
-[3] Hang, T., Gu, S., Li, C., et al. (2023). *Efficient Diffusion Training via Min-SNR Weighting Strategy.* ICCV. ([arXiv][3])
-[4] Austin, J., Johnson, D. D., Ho, J., et al. (2021). *Structured Denoising Diffusion Models in Discrete State-Spaces (D3PM).* NeurIPS.
-[5] Shi, J., Han, K., Wang, Z., Doucet, A., & Titsias, M. K. (2024). *Simplified and Generalized Masked Diffusion for Discrete Data.* NeurIPS (poster); arXiv:2406.04329. ([arXiv][4])
-[6] Kotelnikov, A., Baranchuk, D., Rubachev, I., & Babenko, A. (2023). *TabDDPM: Modelling Tabular Data with Diffusion Models.* ICML.
-[7] Shi, J., Xu, M., Hua, H., Zhang, H., Ermon, S., & Leskovec, J. (2024). *TabDiff: a Mixed-type Diffusion Model for Tabular Data Generation.* arXiv:2410.20626 / ICLR. ([arXiv][5])
-[8] Kollovieh, M., Ansari, A. F., Bohlke-Schneider, M., et al. (2023). *Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting (TSDiff).* NeurIPS. ([arXiv][6])
-[9] Ren, L., Wang, H., & Laili, Y. (2024). *Diff-MTS: Temporal-Augmented Conditional Diffusion-based AIGC for Industrial Time Series Toward the Large Model Era.* arXiv:2407.11501. ([arXiv][7])
-[10] Sikder, M. F., et al. (2023). *TransFusion: Generating Long, High Fidelity Time Series with Diffusion and Transformers.* arXiv:2307.12667. ([arXiv][8])
-[11] Su, C., Cai, Z., Tian, Y., et al. (2025). *Diffusion Models for Time Series Forecasting: A Survey.* arXiv:2507.14507. ([arXiv][9])
-[12] Sha, Y., Yuan, Y., Wu, Y., & Zhao, H. (2026). *DDPM Fusing Mamba and Adaptive Attention: An Augmentation Method for Industrial Control Systems Anomaly Data (ETUA-DDPM).* SSRN. ([SSRN][10])
-[13] Yuan, Y., et al. (2025). *CTU-DDPM: Generating Industrial Control System Time-Series Data with a CNN-Transformer Hybrid Diffusion Model.* ACM (conference paper metadata). ([ACM Digital Library][11])
-
-[1]: https://arxiv.org/html/2307.12667v2?utm_source=chatgpt.com "Generating Long, High Fidelity Time Series using Diffusion ..."
-[2]: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5134458&utm_source=chatgpt.com "Diffusion Model Based Synthetic Data Generation for ..."
-[3]: https://arxiv.org/abs/2303.09556 "https://arxiv.org/abs/2303.09556"
-[4]: https://arxiv.org/pdf/2406.04329 "https://arxiv.org/pdf/2406.04329"
-[5]: https://arxiv.org/abs/2410.20626 "https://arxiv.org/abs/2410.20626"
-[6]: https://arxiv.org/abs/2307.11494 "https://arxiv.org/abs/2307.11494"
-[7]: https://arxiv.org/abs/2407.11501 "https://arxiv.org/abs/2407.11501"
-[8]: https://arxiv.org/abs/2307.12667 "https://arxiv.org/abs/2307.12667"
-[9]: https://arxiv.org/abs/2507.14507 "https://arxiv.org/abs/2507.14507"
-[10]: https://papers.ssrn.com/sol3/Delivery.cfm/254b62d8-9d8c-4b87-bb57-6b07da15f831-MECA.pdf?abstractid=6055903&mirid=1&type=2 "https://papers.ssrn.com/sol3/Delivery.cfm/254b62d8-9d8c-4b87-bb57-6b07da15f831-MECA.pdf?abstractid=6055903&mirid=1&type=2"
-[11]: https://dl.acm.org/doi/pdf/10.1145/3776759.3776845 "https://dl.acm.org/doi/pdf/10.1145/3776759.3776845"
+[1]: https://arxiv.org/abs/1706.03762?utm_source=chatgpt.com "Attention Is All You Need"
+[2]: https://arxiv.org/abs/2006.11239?utm_source=chatgpt.com "Denoising Diffusion Probabilistic Models"
+[3]: https://arxiv.org/abs/2107.03006?utm_source=chatgpt.com "Structured Denoising Diffusion Models in Discrete State-Spaces"
+[4]: https://arxiv.org/abs/2406.04329?utm_source=chatgpt.com "Simplified and Generalized Masked Diffusion for Discrete Data"
+[5]: https://arxiv.org/abs/2303.09556?utm_source=chatgpt.com "Efficient Diffusion Training via Min-SNR Weighting Strategy"
+[6]: https://arxiv.org/abs/2307.11494?utm_source=chatgpt.com "Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting"
+[7]: https://arxiv.org/abs/2507.14507?utm_source=chatgpt.com "Diffusion Models for Time Series Forecasting: A Survey"

From 2c1e21150499fa7cc2e60dbe71a828db14a9f769 Mon Sep 17 00:00:00 2001
From: manbo <manbo@hachi.mi>
Date: Sat, 31 Jan 2026 21:50:02 +0800
Subject: [PATCH 3/4] Version 2, slightly differ from online docs v2(use that)

online docs v2: https://my.feishu.cn/wiki/Za4dwCsG6iPD9qklRLWcoJOZnnb?edition_id=dXsOZT
---
 knowledges/draft-incomplete-methodology.md | 46 +++++-----------------
 1 file changed, 10 insertions(+), 36 deletions(-)

diff --git a/knowledges/draft-incomplete-methodology.md b/knowledges/draft-incomplete-methodology.md
index d2b3cc1..65ca4ba 100644
--- a/knowledges/draft-incomplete-methodology.md
+++ b/knowledges/draft-incomplete-methodology.md
@@ -6,16 +6,14 @@ Industrial control system (ICS) telemetry is a **mixed-type** sequential object:
 
 Our methodological objective is to learn a generator that produces synthetic ((\hat{X},\hat{Y})) that are simultaneously: (a) temporally coherent, (b) distributionally faithful, and (c) **semantically valid** for discrete channels (i.e., (\hat{y}^{(j)}_t\in\mathcal{V}_j) by construction). A key empirical observation motivating our design is that, in ICS, temporal realism and distributional realism often “pull” the model in different directions when trained monolithically; this is amplified by the heterogeneous mechanisms that produce different variables (program steps, controller laws, actuator saturation, physical inertia, and deterministic derived tags). We therefore structure the generator to *separate concerns* and then *specialize*.
 
----
-
-### 2. Overview of Mask-DDPM
+(Note: need a workflow graph here)
 
 We propose **Mask-DDPM**, organized in the following order:
 
 1. **Transformer trend module**: learns the dominant temporal backbone of continuous dynamics via attention-based sequence modeling [1].
 2. **Residual DDPM for continuous variables**: models distributional detail as stochastic residual structure conditioned on the learned trend [2].
 3. **Masked diffusion for discrete variables**: generates discrete ICS states with an absorbing/masking corruption process and categorical reconstruction [3,4].
-4. **Type-aware decomposition**: a performance-oriented refinement layer that routes variables to the most appropriate modeling mechanism and enforces deterministic constraints where warranted.
+4. **Type-aware decomposition**: a post-process (consider using a word instead of having finetune/refinement meanings) layer that routes variables to the most appropriate modeling mechanism and enforces deterministic constraints where warranted.
 
 This ordering is intentional: the trend module fixes the *macro-temporal scaffold*; diffusion then focuses on *micro-structure and marginal fidelity*; masked diffusion guarantees *discrete legality*; and type-aware routing resolves the remaining mismatch caused by heterogeneous variable-generating mechanisms.
 
@@ -23,7 +21,8 @@ This ordering is intentional: the trend module fixes the *macro-temporal scaffol
 
 ## 3. Transformer trend module for continuous dynamics
 
-### 3.1 Trend–residual decomposition
+Transformers are often considered standard in sequence modeling [Note: need a citation here].  We purposed a Transformer explicitly as a trend extractor, a sole role that is to provide conditioning for subsequent diffusion rather than to serve as the primary generator—creates a clean separation between temporal structure and distributional refinement in the ICS setting.
+
 
 For continuous channels (X), we posit an additive decomposition
 [
@@ -31,7 +30,6 @@ X = S + R,
 ]
 where (S\in\mathbb{R}^{L\times d_c}) is a smooth **trend** capturing predictable temporal evolution, and (R\in\mathbb{R}^{L\times d_c}) is a **residual** capturing distributional detail (e.g., bursts, heavy tails, local fluctuations) that is difficult to represent robustly with a purely regression-based temporal objective.
 
-### 3.2 Causal Transformer parameterization
 
 We parameterize the trend (S) using a **causal Transformer** (f_\phi) [1]. Concretely, with teacher forcing we train (f_\phi) to predict the next-step trend from past observations:
 [
@@ -45,15 +43,12 @@ Self-attention is particularly suitable here because it provides a direct mechan
 
 At inference, we roll out the Transformer autoregressively to obtain (\hat{S}), and then define the residual target for diffusion as (R = X - \hat{S}).
 
-#### Uniqueness note (trend module)
-
-*While Transformers are now standard in sequence modeling, our use of a Transformer explicitly as a **trend extractor**—whose sole role is to provide conditioning for subsequent diffusion rather than to serve as the primary generator—creates a clean separation between temporal structure and distributional refinement in the ICS setting.*
-
 ---
 
 ## 4. DDPM for continuous residual generation
 
-### 4.1 Conditional diffusion on residuals
+We made a trend-conditioned residual diffusion formulation: diffusion is trained on residual structure explicitly defined relative to a Transformer trend model, which is particularly aligned with ICS where low-frequency dynamics are strong and persistent.
+
 
 We model the residual (R) with a denoising diffusion probabilistic model (DDPM) conditioned on the trend (\hat{S}) [2]. Let (K) denote the number of diffusion steps, with a noise schedule ({\beta_k}_{k=1}^K), (\alpha_k = 1-\beta_k), and (\bar{\alpha}*k=\prod*{i=1}^k \alpha_i). The forward corruption process is
 [
@@ -70,7 +65,6 @@ p_{\theta}(r_{k-1}\mid r_k,\hat{S})=\mathcal{N}!\left(\mu_{\theta}(r_k,k,\hat{S}
 ]
 where (\mu_\theta) is implemented by a **Transformer denoiser** that consumes (i) the noised residual (r_k), (ii) a timestep embedding for (k), and (iii) conditioning features derived from (\hat{S}). Conditioning is crucial: it prevents the diffusion model from re-learning the low-frequency temporal scaffold and concentrates its capacity on matching the residual distribution.
 
-### 4.2 Training objective and loss shaping
 
 We train the denoiser using the standard DDPM (\epsilon)-prediction objective [2]:
 [
@@ -98,9 +92,7 @@ w(k)\left|
 \qquad
 w(k)=\frac{\mathrm{SNR}_k}{\mathrm{SNR}_k+\gamma}.
 ]
-(We treat this as an ablation: it is not required by the method, but can improve training efficiency and stability in practice [5].)
 
-### 4.3 Continuous reconstruction
 
 After sampling (\hat{R}) by reverse diffusion, we reconstruct the continuous output as
 [
@@ -108,17 +100,11 @@ After sampling (\hat{R}) by reverse diffusion, we reconstruct the continuous out
 ]
 This design makes the role of diffusion explicit: it acts as a **distributional corrector** on top of a temporally coherent backbone.
 
-#### Uniqueness note (continuous branch)
-
-*The central integration here is not “Transformer + diffusion” in isolation, but rather a **trend-conditioned residual diffusion** formulation: diffusion is trained on residual structure explicitly defined relative to a Transformer trend model, which is particularly aligned with ICS where low-frequency dynamics are strong and persistent.*
-
 ---
 
 ## 5. Masked diffusion for discrete ICS variables
 
-### 5.1 Discrete corruption via absorbing masks
-
-Discrete ICS variables must remain categorical, making Gaussian diffusion inappropriate. We therefore use **masked (absorbing) diffusion** for discrete channels [3,4], in which corruption replaces tokens with a special (\texttt{[MASK]}) symbol according to a schedule.
+Discrete ICS variables must remain categorical, making Gaussian diffusion inappropriate. Diffusion-based ICS synthesis has predominantly emphasized continuous multivariate generation; in contrast, we introduce masked diffusion as a first-class discrete legality mechanism within the same pipeline, rather than treating discrete tags as continuous surrogates or post-hoc thresholds. We therefore use **masked (absorbing) diffusion** for discrete channels [3,4], in which corruption replaces tokens with a special (\texttt{[MASK]}) symbol according to a schedule.
 
 For a discrete sequence (y^{(j)}_{1:L}), define a corruption level (m_k\in[0,1]) that increases with (k). The corrupted sequence at step (k) is formed by independent masking:
 [
@@ -130,9 +116,8 @@ y^{(j)}_t, & \text{otherwise}.
 ]
 This “absorbing” mechanism yields a diffusion-like process over discrete states while preserving a clear reconstruction target at every step [3,4].
 
-### 5.2 Categorical denoising objective
 
-We parameterize the reverse model with a Transformer (h_\psi) that outputs a categorical distribution over (\mathcal{V}*j) for each masked position. Let (\mathcal{M}) be the set of masked indices. We train with cross-entropy over masked positions:
+To categorical denoising objective, we parameterize the reverse model with a Transformer (h_\psi) that outputs a categorical distribution over (\mathcal{V}*j) for each masked position. Let (\mathcal{M}) be the set of masked indices. We train with cross-entropy over masked positions:
 [
 \mathcal{L}*{\text{disc}}(\psi)
 ===============================
@@ -147,25 +132,20 @@ h_\psi!\left(y^{(k)},k,\hat{S}\right)_{j,t},\ y^{(j)}_t
 ]
 Conditioning on (\hat{S}) (and, optionally, on (\hat{X})) allows discrete state generation to reflect the continuous context, which is often essential in ICS (e.g., mode transitions coupled to process regimes).
 
-### 5.3 Sampling
 
 At inference, we initialize (y^{(K)}) as fully masked and iteratively denoise/unmask from (k=K) to (1), sampling valid tokens from the predicted categorical distributions. Because outputs are always drawn from (\mathcal{V}_j), semantic validity is satisfied by construction.
 
-#### Uniqueness note (discrete branch)
-
-*To our knowledge, diffusion-based ICS synthesis has predominantly emphasized continuous multivariate generation; in contrast, we introduce masked diffusion as a **first-class discrete legality mechanism** within the same pipeline, rather than treating discrete tags as continuous surrogates or post-hoc thresholds.*
-
 ---
 
 ## 6. Type-aware decomposition as a performance refinement layer
 
-### 6.1 Motivation: mechanistic heterogeneity in ICS variables
+
 
 Even with a strong continuous generator and a discrete legality mechanism, ICS data pose an additional challenge: variables arise from **qualitatively different generative mechanisms**. For example, program-driven setpoints are often piecewise constant with abrupt changes; actuators exhibit saturation and dwell; some channels are deterministic functions of others (derived tags). Modeling all channels identically can cause persistent error concentration and degrade overall quality.
 
 We therefore introduce a **type-aware decomposition** that operates on top of the base pipeline to improve fidelity, stability, and interpretability.
 
-### 6.2 Typing function and routing
+The type-aware decomposition is not merely a diagnostic taxonomy; it is a modeling layer that changes the generator’s factorization—deciding which variables are best treated as conditional drivers, which require dwell-aware behavior, and which should be deterministic—thereby tailoring the diffusion pipeline to the structural heterogeneity characteristic of ICS telemetry.
 
 Let (\tau(i)\in{1,\dots,6}) assign each variable (i) to a type. This induces a partition of variables into index sets ({\mathcal{I}*k}*{k=1}^6). We then define the generator as a composition of type-specific operators:
 [
@@ -173,7 +153,6 @@ Let (\tau(i)\in{1,\dots,6}) assign each variable (i) to a type. This induces a p
 ]
 where (\mathcal{A}) assembles a complete sample and (\mathcal{G}_k) encodes the appropriate modeling choice per type.
 
-### 6.3 Six-type schema and modeling commitments
 
 We operationalize six types that map cleanly onto common ICS semantics:
 
@@ -184,7 +163,6 @@ We operationalize six types that map cleanly onto common ICS semantics:
 * **Type 5 (derived/deterministic variables):** algebraic or rule-based functions of other variables; we enforce deterministic reconstruction (\hat{x}^{(i)} = g_i(\hat{X},\hat{Y})) rather than learning a stochastic generator, improving logical consistency and sample efficiency.
 * **Type 6 (auxiliary/low-impact variables):** weakly coupled or sparse signals; we allow simplified modeling (e.g., calibrated marginals or lightweight temporal models) to avoid allocating diffusion capacity where it is not warranted.
 
-### 6.4 Training-time and inference-time integration
 
 Type-aware decomposition improves performance through three concrete mechanisms:
 
@@ -194,10 +172,6 @@ Type-aware decomposition improves performance through three concrete mechanisms:
 
 In practice, (\tau) can be initialized from domain semantics (tag metadata and value domains) and refined via an error-attribution loop described in the Benchmark section. Importantly, this refinement does not alter the core generative backbone; it *re-routes* difficult variables to better-suited mechanisms while preserving end-to-end coherence.
 
-#### Uniqueness note (type-aware layer)
-
-*The type-aware decomposition is not merely a diagnostic taxonomy; it is a modeling layer that **changes the generator’s factorization**—deciding which variables are best treated as conditional drivers, which require dwell-aware behavior, and which should be deterministic—thereby tailoring the diffusion pipeline to the structural heterogeneity characteristic of ICS telemetry.*
-
 ---
 
 ## 7. Joint optimization and end-to-end sampling

From 1ee85b97bcfd7958baea1426b6a1656cdab10896 Mon Sep 17 00:00:00 2001
From: manbo <manbo@hachi.mi>
Date: Sat, 31 Jan 2026 21:53:52 +0800
Subject: [PATCH 4/4] Add more paragraphs/citations to smooth the logic flow

---
 knowledges/draft-incomplete-methodology.md | 207 +++++++++------------
 1 file changed, 92 insertions(+), 115 deletions(-)

diff --git a/knowledges/draft-incomplete-methodology.md b/knowledges/draft-incomplete-methodology.md
index 65ca4ba..f0b9fed 100644
--- a/knowledges/draft-incomplete-methodology.md
+++ b/knowledges/draft-incomplete-methodology.md
@@ -1,212 +1,189 @@
+(Updated from your current draft; no benchmark-metric details are introduced here, as requested.) 
+
 ## Methodology
 
-### 1. Problem setting and design motivation
+Industrial control system (ICS) telemetry is intrinsically **mixed-type** and **mechanistically heterogeneous**: continuous process trajectories (e.g., sensor and actuator signals) coexist with discrete supervisory states (e.g., modes, alarms, interlocks), and the underlying generating mechanisms range from physical inertia to program-driven step logic. This heterogeneity is not cosmetic—it directly affects what “realistic” synthesis means, because a generator must jointly satisfy (i) temporal coherence, (ii) distributional fidelity, and (iii) discrete semantic validity (i.e., every discrete output must belong to its legal vocabulary by construction). These properties are emphasized broadly in operational-technology security guidance and ICS engineering practice, where state logic and physical dynamics are tightly coupled. [12]
 
-Industrial control system (ICS) telemetry is a **mixed-type** sequential object: it couples **continuous** process dynamics (e.g., sensor values and physical responses) with **discrete** supervisory states (e.g., modes, alarms, interlocks). We model each training instance as a fixed-length window of length (L), consisting of (i) continuous channels (X\in\mathbb{R}^{L\times d_c}) and (ii) discrete channels (Y={y^{(j)}*{1:L}}*{j=1}^{d_d}), where each discrete variable (y^{(j)}_t\in\mathcal{V}_j) belongs to a finite vocabulary.
+We formalize each training instance as a fixed-length window of length (L), consisting of (i) continuous channels (X\in\mathbb{R}^{L\times d_c}) and (ii) discrete channels (Y={y^{(j)}*{1:L}}*{j=1}^{d_d}), where each discrete variable (y^{(j)}_t\in\mathcal{V}_j) belongs to a finite vocabulary (\mathcal{V}_j). Our objective is to learn a generator that produces synthetic ((\hat{X},\hat{Y})) that are simultaneously temporally coherent and distributionally faithful, while also ensuring (\hat{y}^{(j)}_t\in\mathcal{V}_j) for all (j,t) by construction (rather than via post-hoc rounding or thresholding).
 
-Our methodological objective is to learn a generator that produces synthetic ((\hat{X},\hat{Y})) that are simultaneously: (a) temporally coherent, (b) distributionally faithful, and (c) **semantically valid** for discrete channels (i.e., (\hat{y}^{(j)}_t\in\mathcal{V}_j) by construction). A key empirical observation motivating our design is that, in ICS, temporal realism and distributional realism often “pull” the model in different directions when trained monolithically; this is amplified by the heterogeneous mechanisms that produce different variables (program steps, controller laws, actuator saturation, physical inertia, and deterministic derived tags). We therefore structure the generator to *separate concerns* and then *specialize*.
+A key empirical and methodological tension in ICS synthesis is that *temporal realism* and *marginal/distributional realism* can compete when optimized monolithically: sequence models trained primarily for regression often over-smooth heavy tails and intermittent bursts, while purely distribution-matching objectives can erode long-range structure. Diffusion models provide a principled route to rich distribution modeling through iterative denoising, but they do not, by themselves, resolve (i) the need for a stable low-frequency temporal scaffold, nor (ii) the discrete legality constraints for supervisory variables. [2,8] Recent time-series diffusion work further suggests that separating coarse structure from stochastic refinement can be an effective inductive bias for long-horizon realism. [6,7]
 
-(Note: need a workflow graph here)
+Motivated by these considerations, we propose **Mask-DDPM**, organized in the following order:
 
-We propose **Mask-DDPM**, organized in the following order:
+1. **Transformer trend module**: learns the dominant temporal backbone of continuous dynamics via attention-based sequence modeling. [1]
+2. **Residual DDPM for continuous variables**: models distributional detail as stochastic residual structure conditioned on the learned trend. [2,6]
+3. **Masked diffusion for discrete variables**: generates discrete ICS states with an absorbing/masking corruption process and categorical reconstruction. [3,4]
+4. **Type-aware decomposition**: a **type-aware factorization and routing layer** that assigns variables to the most appropriate modeling mechanism and enforces deterministic constraints where warranted.
 
-1. **Transformer trend module**: learns the dominant temporal backbone of continuous dynamics via attention-based sequence modeling [1].
-2. **Residual DDPM for continuous variables**: models distributional detail as stochastic residual structure conditioned on the learned trend [2].
-3. **Masked diffusion for discrete variables**: generates discrete ICS states with an absorbing/masking corruption process and categorical reconstruction [3,4].
-4. **Type-aware decomposition**: a post-process (consider using a word instead of having finetune/refinement meanings) layer that routes variables to the most appropriate modeling mechanism and enforces deterministic constraints where warranted.
-
-This ordering is intentional: the trend module fixes the *macro-temporal scaffold*; diffusion then focuses on *micro-structure and marginal fidelity*; masked diffusion guarantees *discrete legality*; and type-aware routing resolves the remaining mismatch caused by heterogeneous variable-generating mechanisms.
+This ordering is intentional. The trend module establishes a macro-temporal scaffold; residual diffusion then concentrates capacity on micro-structure and marginal fidelity; masked diffusion provides a native mechanism for discrete legality; and the type-aware layer operationalizes the observation that not all ICS variables should be modeled with the same stochastic mechanism. Importantly, while diffusion-based generation for ICS telemetry has begun to emerge, existing approaches remain limited and typically emphasize continuous synthesis or augmentation; in contrast, our pipeline integrates (i) a Transformer-conditioned residual diffusion backbone, (ii) a discrete masked-diffusion branch, and (iii) explicit type-aware routing for heterogeneous variable mechanisms within a single coherent generator. [10,11]
 
 ---
 
-## 3. Transformer trend module for continuous dynamics
-
-Transformers are often considered standard in sequence modeling [Note: need a citation here].  We purposed a Transformer explicitly as a trend extractor, a sole role that is to provide conditioning for subsequent diffusion rather than to serve as the primary generator—creates a clean separation between temporal structure and distributional refinement in the ICS setting.
+## Transformer trend module for continuous dynamics
 
+We instantiate the temporal backbone as a **causal Transformer** trend extractor, leveraging self-attention’s ability to represent long-range dependencies and cross-channel interactions without recurrence. [1] Compared with recurrent trend extractors (e.g., GRU-style backbones), a Transformer trend module offers a direct mechanism to model delayed effects and multivariate coupling—common in ICS, where control actions may influence downstream sensors with nontrivial lags and regime-dependent propagation. [1,12] Crucially, in our design the Transformer is *not* asked to be the entire generator; instead, it serves a deliberately restricted role: providing a stable, temporally coherent conditioning signal that later stochastic components refine.
 
 For continuous channels (X), we posit an additive decomposition
 [
 X = S + R,
 ]
-where (S\in\mathbb{R}^{L\times d_c}) is a smooth **trend** capturing predictable temporal evolution, and (R\in\mathbb{R}^{L\times d_c}) is a **residual** capturing distributional detail (e.g., bursts, heavy tails, local fluctuations) that is difficult to represent robustly with a purely regression-based temporal objective.
+where (S\in\mathbb{R}^{L\times d_c}) is a smooth **trend** capturing predictable temporal evolution and (R\in\mathbb{R}^{L\times d_c}) is a **residual** capturing distributional detail (e.g., bursts, heavy tails, local fluctuations) that is difficult to represent robustly with a purely regression-oriented temporal objective. This separation reflects an explicit *division of labor*: the trend module prioritizes temporal coherence, while diffusion (introduced next) targets distributional realism at the residual level—a strategy aligned with “predict-then-refine” perspectives in time-series diffusion modeling. [6,7]
 
-
-We parameterize the trend (S) using a **causal Transformer** (f_\phi) [1]. Concretely, with teacher forcing we train (f_\phi) to predict the next-step trend from past observations:
+We parameterize the trend (S) using a causal Transformer (f_{\phi}). With teacher forcing, we train (f_{\phi}) to predict the next-step trend from past observations:
 [
-\hat{S}*{t+1} = f*\phi(X_{1:t}), \qquad t=1,\dots,L-1,
+\hat{S}*{t+1} = f*{\phi}(X_{1:t}), \qquad t=1,\dots,L-1,
 ]
-with the mean-squared error objective
+using the mean-squared error objective
 [
 \mathcal{L}*{\text{trend}}(\phi)=\frac{1}{(L-1)d_c}\sum*{t=1}^{L-1}\left| \hat{S}*{t+1} - X*{t+1}\right|_2^2.
 ]
-Self-attention is particularly suitable here because it provides a direct mechanism for learning cross-channel interactions and long-range temporal dependencies without recurrence, which is important when control actions influence downstream variables with nontrivial delays [1].
-
-At inference, we roll out the Transformer autoregressively to obtain (\hat{S}), and then define the residual target for diffusion as (R = X - \hat{S}).
+At inference, we roll out the Transformer autoregressively to obtain (\hat{S}), and define the residual target for diffusion as (R = X - \hat{S}). This setup intentionally “locks in” a coherent low-frequency scaffold before any stochastic refinement is applied, thereby reducing the burden on downstream diffusion modules to simultaneously learn both long-range structure and marginal detail. In this sense, our use of Transformers is distinctive: it is a *conditioning-first* temporal backbone designed to stabilize mixed-type diffusion synthesis in ICS, rather than an end-to-end monolithic generator. [1,6,10]
 
 ---
 
-## 4. DDPM for continuous residual generation
+## DDPM for continuous residual generation
 
-We made a trend-conditioned residual diffusion formulation: diffusion is trained on residual structure explicitly defined relative to a Transformer trend model, which is particularly aligned with ICS where low-frequency dynamics are strong and persistent.
+We model the residual (R) with a denoising diffusion probabilistic model (DDPM) conditioned on the trend (\hat{S}). [2] Diffusion models learn complex data distributions by inverting a tractable noising process through iterative denoising, and have proven effective at capturing multimodality and heavy-tailed structure that is often attenuated by purely regression-based sequence models. [2,8] Conditioning the diffusion model on (\hat{S}) is central: it prevents the denoiser from re-learning the low-frequency scaffold and focuses capacity on residual micro-structure, mirroring the broader principle that diffusion excels as a distributional corrector when a reasonable coarse structure is available. [6,7]
 
-
-We model the residual (R) with a denoising diffusion probabilistic model (DDPM) conditioned on the trend (\hat{S}) [2]. Let (K) denote the number of diffusion steps, with a noise schedule ({\beta_k}_{k=1}^K), (\alpha_k = 1-\beta_k), and (\bar{\alpha}*k=\prod*{i=1}^k \alpha_i). The forward corruption process is
+Let (K) denote the number of diffusion steps, with a noise schedule ({\beta_k}_{k=1}^K), (\alpha_k = 1-\beta_k), and (\bar{\alpha}*k=\prod*{i=1}^k \alpha_i). The forward corruption process is
 [
 q(r_k\mid r_0)=\mathcal{N}!\left(\sqrt{\bar{\alpha}_k},r_0,\ (1-\bar{\alpha}_k)\mathbf{I}\right),
 ]
 equivalently,
 [
-r_k = \sqrt{\bar{\alpha}_k},r_0 + \sqrt{1-\bar{\alpha}_k},\epsilon,\qquad \epsilon\sim\mathcal{N}(0,\mathbf{I}).
+r_k = \sqrt{\bar{\alpha}_k},r_0 + \sqrt{1-\bar{\alpha}_k},\epsilon,\qquad \epsilon\sim\mathcal{N}(0,\mathbf{I}),
 ]
+where (r_0\equiv R) and (r_k) is the noised residual at step (k).
 
 The learned reverse process is parameterized as
 [
-p_{\theta}(r_{k-1}\mid r_k,\hat{S})=\mathcal{N}!\left(\mu_{\theta}(r_k,k,\hat{S}),\ \Sigma_{\theta}(k)\right),
+p_{\theta}(r_{k-1}\mid r_k,\hat{S})=\mathcal{N}!\left(\mu_{\theta}(r_k,k,\hat{S}),\ \Sigma(k)\right),
 ]
-where (\mu_\theta) is implemented by a **Transformer denoiser** that consumes (i) the noised residual (r_k), (ii) a timestep embedding for (k), and (iii) conditioning features derived from (\hat{S}). Conditioning is crucial: it prevents the diffusion model from re-learning the low-frequency temporal scaffold and concentrates its capacity on matching the residual distribution.
+where (\mu_\theta) is implemented by a **Transformer denoiser** that consumes (i) the noised residual (r_k), (ii) a timestep embedding for (k), and (iii) conditioning features derived from (\hat{S}). This denoiser architecture is consistent with the growing use of attention-based denoisers for long-context time-series diffusion, while our key methodological emphasis is the *trend-conditioned residual* factorization as the object of diffusion learning. [2,7]
 
-
-We train the denoiser using the standard DDPM (\epsilon)-prediction objective [2]:
+We train the denoiser using the standard DDPM (\epsilon)-prediction objective:
 [
-\mathcal{L}_{\text{cont}}(\theta)
-=================================
-
-\mathbb{E}*{k,r_0,\epsilon}
-\left[
+\mathcal{L}*{\text{cont}}(\theta)
+= \mathbb{E}*{k,r_0,\epsilon}!\left[
 \left|
-\epsilon - \epsilon*{\theta}(r_k,k,\hat{S})
+\epsilon - \epsilon_{\theta}(r_k,k,\hat{S})
 \right|*2^2
 \right].
 ]
-Because diffusion optimization can exhibit timestep imbalance and gradient conflict, we optionally apply an SNR-based reweighting consistent with Min-SNR training [5]:
+Because diffusion optimization can exhibit timestep imbalance (i.e., some timesteps dominate gradients), we optionally apply an SNR-based reweighting consistent with Min-SNR training:
 [
 \mathcal{L}^{\text{snr}}*{\text{cont}}(\theta)
-==============================================
-
-\mathbb{E}*{k,r_0,\epsilon}
-\left[
-w(k)\left|
+= \mathbb{E}*{k,r_0,\epsilon}!\left[
+w_k\left|
 \epsilon - \epsilon*{\theta}(r_k,k,\hat{S})
 \right|_2^2
 \right],
 \qquad
-w(k)=\frac{\mathrm{SNR}_k}{\mathrm{SNR}_k+\gamma}.
+w_k=\frac{\min(\mathrm{SNR}_k,\gamma)}{\mathrm{SNR}_k},
 ]
-
+where (\mathrm{SNR}_k=\bar{\alpha}_k/(1-\bar{\alpha}_k)) and (\gamma>0) is a cap parameter. [5]
 
 After sampling (\hat{R}) by reverse diffusion, we reconstruct the continuous output as
 [
 \hat{X} = \hat{S} + \hat{R}.
 ]
-This design makes the role of diffusion explicit: it acts as a **distributional corrector** on top of a temporally coherent backbone.
+Overall, the DDPM component serves as a **distributional corrector** on top of a temporally coherent backbone, which is particularly suited to ICS where low-frequency dynamics are strong and persistent but fine-scale variability (including bursts and regime-conditioned noise) remains important for realism. Relative to prior ICS diffusion efforts that primarily focus on continuous augmentation, our formulation elevates *trend-conditioned residual diffusion* as a modular mechanism for disentangling temporal structure from distributional refinement. [10,11]
 
 ---
 
-## 5. Masked diffusion for discrete ICS variables
+## Masked diffusion for discrete ICS variables
 
-Discrete ICS variables must remain categorical, making Gaussian diffusion inappropriate. Diffusion-based ICS synthesis has predominantly emphasized continuous multivariate generation; in contrast, we introduce masked diffusion as a first-class discrete legality mechanism within the same pipeline, rather than treating discrete tags as continuous surrogates or post-hoc thresholds. We therefore use **masked (absorbing) diffusion** for discrete channels [3,4], in which corruption replaces tokens with a special (\texttt{[MASK]}) symbol according to a schedule.
+Discrete ICS variables must remain categorical, making Gaussian diffusion inappropriate for supervisory states and mode-like channels. While one can attempt continuous relaxations or post-hoc discretization, such strategies risk producing semantically invalid intermediate states (e.g., “in-between” modes) and can distort the discrete marginal distribution. Discrete-state diffusion provides a principled alternative by defining a valid corruption process directly on categorical variables. [3,4] In the ICS setting, this is not a secondary detail: supervisory tags often encode control logic boundaries (modes, alarms, interlocks) that must remain within a finite vocabulary to preserve semantic correctness. [12]
 
-For a discrete sequence (y^{(j)}_{1:L}), define a corruption level (m_k\in[0,1]) that increases with (k). The corrupted sequence at step (k) is formed by independent masking:
+We therefore adopt **masked (absorbing) diffusion** for discrete channels, where corruption replaces tokens with a special (\texttt{[MASK]}) symbol according to a schedule. [4] For each variable (j), define a masking schedule ({m_k}_{k=1}^K) (with (m_k\in[0,1]) increasing in (k)). The forward corruption process is
 [
-y^{(j,k)}_t =
+q(y^{(j)}_k \mid y^{(j)}_0)=
 \begin{cases}
-\texttt{[MASK]}, & \text{with probability } m_k,\
-y^{(j)}_t, & \text{otherwise}.
+y^{(j)}*0, & \text{with probability } 1-m_k,\
+\texttt{[MASK]}, & \text{with probability } m_k,
 \end{cases}
 ]
-This “absorbing” mechanism yields a diffusion-like process over discrete states while preserving a clear reconstruction target at every step [3,4].
+applied independently across (j) and (t). Let (\mathcal{M}) denote the set of masked positions at step (k). The denoiser (h*{\psi}) predicts a categorical distribution over (\mathcal{V}*j) for each masked token, conditioned on (i) the corrupted discrete sequence, (ii) the diffusion step (k), and (iii) continuous context. Concretely, we condition on (\hat{S}) and (optionally) (\hat{X}) to couple supervisory reconstruction to the underlying continuous dynamics:
+[
+p*{\psi}!\left(y^{(j)}*0 \mid y_k, k, \hat{S}, \hat{X}\right)
+= h*{\psi}(y_k,k,\hat{S},\hat{X}).
+]
+This conditioning choice is motivated by the fact that many discrete ICS states are not standalone—they are functions of regimes, thresholds, and procedural phases that manifest in continuous channels. [12]
 
-
-To categorical denoising objective, we parameterize the reverse model with a Transformer (h_\psi) that outputs a categorical distribution over (\mathcal{V}*j) for each masked position. Let (\mathcal{M}) be the set of masked indices. We train with cross-entropy over masked positions:
+Training uses a categorical denoising objective:
 [
 \mathcal{L}*{\text{disc}}(\psi)
-===============================
-
-\mathbb{E}*{k}
-\left[
-\frac{1}{|\mathcal{M}|}\sum*{(j,t)\in\mathcal{M}}
+= \mathbb{E}*{k}!\left[
+\frac{1}{|\mathcal{M}|}
+\sum_{(j,t)\in\mathcal{M}}
 \mathrm{CE}!\left(
-h_\psi!\left(y^{(k)},k,\hat{S}\right)_{j,t},\ y^{(j)}_t
+h_{\psi}(y_k,k,\hat{S},\hat{X})*{j,t},
+y^{(j)}*{0,t}
 \right)
-\right].
+\right],
 ]
-Conditioning on (\hat{S}) (and, optionally, on (\hat{X})) allows discrete state generation to reflect the continuous context, which is often essential in ICS (e.g., mode transitions coupled to process regimes).
-
-
-At inference, we initialize (y^{(K)}) as fully masked and iteratively denoise/unmask from (k=K) to (1), sampling valid tokens from the predicted categorical distributions. Because outputs are always drawn from (\mathcal{V}_j), semantic validity is satisfied by construction.
+where (\mathrm{CE}(\cdot,\cdot)) is cross-entropy. At sampling time, we initialize all discrete tokens as (\texttt{[MASK]}) and iteratively unmask them using the learned conditionals, ensuring that every output token lies in its legal vocabulary by construction. This discrete branch is a key differentiator of our pipeline: unlike typical continuous-only diffusion augmentation in ICS, we integrate masked diffusion as a first-class mechanism for supervisory-variable legality within the same end-to-end synthesis workflow. [4,10]
 
 ---
 
-## 6. Type-aware decomposition as a performance refinement layer
+## Type-aware decomposition as a performance refinement layer
 
+Even with a trend-conditioned residual DDPM and a discrete masked-diffusion branch, a single uniform modeling treatment can remain suboptimal because ICS variables are generated by qualitatively different mechanisms. For example, program-driven setpoints exhibit step-and-dwell dynamics; controller outputs follow control laws conditioned on process feedback; actuator positions may show saturation and dwell; and some “derived tags” are deterministic functions of other channels. Treating all channels as if they were exchangeable stochastic processes can misallocate model capacity and induce systematic error concentration on a small subset of mechanistically distinct variables. [12]
 
+We therefore introduce a **type-aware decomposition** that formalizes this heterogeneity as a routing and constraint layer. Let (\tau(i)\in{1,\dots,6}) assign each variable (i) to a type class. The type assignment can be initialized from domain semantics (tag metadata, value domains, and engineering meaning), and subsequently refined via an error-attribution workflow described in the Benchmark section.  Importantly, this refinement does **not** change the core diffusion backbone; it changes *which mechanism is responsible for which variable*, thereby aligning inductive bias with variable-generating mechanism while preserving overall coherence.
 
-Even with a strong continuous generator and a discrete legality mechanism, ICS data pose an additional challenge: variables arise from **qualitatively different generative mechanisms**. For example, program-driven setpoints are often piecewise constant with abrupt changes; actuators exhibit saturation and dwell; some channels are deterministic functions of others (derived tags). Modeling all channels identically can cause persistent error concentration and degrade overall quality.
+We use the following taxonomy:
 
-We therefore introduce a **type-aware decomposition** that operates on top of the base pipeline to improve fidelity, stability, and interpretability.
+* **Type 1 (program-driven / setpoint-like):** externally commanded, step-and-dwell variables. These variables can be treated as exogenous drivers (conditioning signals) or routed to specialized change-point / dwell-time models, rather than being forced into a smooth denoiser that may over-regularize step structure.
+* **Type 2 (controller outputs):** continuous variables tightly coupled to feedback loops; these benefit from conditional modeling where the conditioning includes relevant process variables and commanded setpoints.
+* **Type 3 (actuator states/positions):** often exhibit saturation, dwell, and rate limits; these may require stateful dynamics beyond generic residual diffusion, motivating either specialized conditional modules or additional inductive constraints.
+* **Type 4 (process variables):** inertia-dominated continuous dynamics; these are the primary beneficiaries of the **Transformer trend + residual DDPM** pipeline. 
+* **Type 5 (derived/deterministic variables):** algebraic or rule-based functions of other variables; we enforce deterministic reconstruction (\hat{x}^{(i)} = g_i(\hat{X},\hat{Y})) rather than learning a stochastic generator, improving logical consistency and sample efficiency. 
+* **Type 6 (auxiliary/low-impact variables):** weakly coupled or sparse signals; we allow simplified modeling (e.g., calibrated marginals or lightweight temporal models) to avoid allocating diffusion capacity where it is not warranted. 
 
-The type-aware decomposition is not merely a diagnostic taxonomy; it is a modeling layer that changes the generator’s factorization—deciding which variables are best treated as conditional drivers, which require dwell-aware behavior, and which should be deterministic—thereby tailoring the diffusion pipeline to the structural heterogeneity characteristic of ICS telemetry.
+Type-aware decomposition improves synthesis quality through three mechanisms. First, it improves **capacity allocation** by preventing a small set of mechanistically atypical variables from dominating gradients and distorting the learned distribution for the majority class (typically Type 4). Second, it enables **constraint enforcement** by deterministically reconstructing Type 5 variables, preventing logically inconsistent samples that purely learned generators can produce. Third, it improves **mechanism alignment** by attaching inductive biases consistent with step/dwell or saturation behaviors where generic denoisers may implicitly favor smoothness. 
 
-Let (\tau(i)\in{1,\dots,6}) assign each variable (i) to a type. This induces a partition of variables into index sets ({\mathcal{I}*k}*{k=1}^6). We then define the generator as a composition of type-specific operators:
-[
-(\hat{X},\hat{Y}) = \mathcal{A}\Big(\hat{S},\hat{R},\hat{Y}_{\text{mask}},{\mathcal{G}*k}*{k=1}^6;\tau\Big),
-]
-where (\mathcal{A}) assembles a complete sample and (\mathcal{G}_k) encodes the appropriate modeling choice per type.
-
-
-We operationalize six types that map cleanly onto common ICS semantics:
-
-* **Type 1 (program-driven / setpoint-like):** exogenous drivers with step changes and long dwell. These variables are treated as conditioning signals or modeled with a dedicated change-point-aware generator rather than forcing them into residual diffusion.
-* **Type 2 (controller outputs):** variables that respond to setpoints and process feedback; we treat them as conditional on Type 1 and continuous context, and allow separate specialization if they dominate error.
-* **Type 3 (actuator states/positions):** bounded, often quantized, with saturation and dwell; we route them to discrete masked diffusion when naturally categorical/quantized, or to specialized dwell-aware modeling when continuous but state-persistent.
-* **Type 4 (process variables):** inertia-dominated continuous dynamics; these are the primary beneficiaries of the **Transformer trend + residual DDPM** pipeline.
-* **Type 5 (derived/deterministic variables):** algebraic or rule-based functions of other variables; we enforce deterministic reconstruction (\hat{x}^{(i)} = g_i(\hat{X},\hat{Y})) rather than learning a stochastic generator, improving logical consistency and sample efficiency.
-* **Type 6 (auxiliary/low-impact variables):** weakly coupled or sparse signals; we allow simplified modeling (e.g., calibrated marginals or lightweight temporal models) to avoid allocating diffusion capacity where it is not warranted.
-
-
-Type-aware decomposition improves performance through three concrete mechanisms:
-
-1. **Capacity allocation:** by focusing diffusion on Type 4 (and selected Type 2/3), we reduce the tendency for a few mechanistically distinct variables to dominate gradients and distort the learned distribution elsewhere.
-2. **Constraint enforcement:** Type 5 variables are computed deterministically, preventing logically inconsistent samples that a purely learned generator may produce.
-3. **Mechanism alignment:** Type 1/3 variables receive inductive biases consistent with step-like or dwell-like behavior, which diffusion-trained smooth denoisers can otherwise over-regularize.
-
-In practice, (\tau) can be initialized from domain semantics (tag metadata and value domains) and refined via an error-attribution loop described in the Benchmark section. Importantly, this refinement does not alter the core generative backbone; it *re-routes* difficult variables to better-suited mechanisms while preserving end-to-end coherence.
+From a novelty standpoint, this layer is not merely an engineering “patch”; it is an explicit methodological statement that ICS synthesis benefits from **typed factorization**—a principle that has analogues in mixed-type generative modeling more broadly, but that remains underexplored in diffusion-based ICS telemetry synthesis. [9,10,12]
 
 ---
 
-## 7. Joint optimization and end-to-end sampling
+## Joint optimization and end-to-end sampling
 
-We train the model in a staged manner consistent with the factorization:
-
-1. Train the trend Transformer (f_\phi) to obtain (S).
-2. Compute residual targets (R=X-S) for Type 4 (and any routed continuous types).
-3. Train the residual DDPM (p_\theta(R\mid S)) and the masked diffusion model (p_\psi(Y\mid \text{masked}(Y),S)).
-4. Apply type-aware routing and deterministic reconstruction rules during sampling.
+We train the model in a staged manner consistent with the above factorization, which improves optimization stability and encourages each component to specialize in its intended role. Specifically: (i) we train the trend Transformer (f_{\phi}) to obtain (\hat{S}); (ii) we compute residual targets (R=X-\hat{S}) for the continuous variables routed to residual diffusion; (iii) we train the residual DDPM (p_{\theta}(R\mid \hat{S})) and masked diffusion model (p_{\psi}(Y\mid \text{masked}(Y), \hat{S}, \hat{X})); and (iv) we apply type-aware routing and deterministic reconstruction during sampling.  This staged strategy is aligned with the design goal of separating temporal scaffolding from distributional refinement, and it mirrors the broader intuition in time-series diffusion that decoupling coarse structure and stochastic detail can mitigate “structure vs. realism” conflicts. [6,7]
 
 A simple combined objective is
 [
 \mathcal{L} = \lambda,\mathcal{L}*{\text{cont}} + (1-\lambda),\mathcal{L}*{\text{disc}},
 ]
-with (\lambda\in[0,1]) controlling the balance between continuous and discrete learning. Type-aware routing determines which channels contribute to which loss and which are excluded in favor of deterministic reconstruction.
+with (\lambda\in[0,1]) controlling the balance between continuous and discrete learning. Type-aware routing determines which channels contribute to which loss and which are excluded in favor of deterministic reconstruction. In practice, this routing acts as a principled guardrail against negative transfer across variable mechanisms: channels that are best handled deterministically (Type 5) or by specialized drivers (Type 1/3, depending on configuration) are prevented from forcing the diffusion models into statistically incoherent compromises.
 
-At inference time, we generate in the same order: (i) trend (\hat{S}), (ii) residual (\hat{R}) via DDPM, (iii) discrete (\hat{Y}) via masked diffusion, and (iv) type-aware assembly and deterministic reconstruction.
+At inference time, generation follows the same structured order: (i) trend (\hat{S}) via the Transformer, (ii) residual (\hat{R}) via DDPM, (iii) discrete (\hat{Y}) via masked diffusion, and (iv) type-aware assembly with deterministic reconstruction for routed variables. This pipeline produces ((\hat{X},\hat{Y})) that are temporally coherent by construction (through (\hat{S})), distributionally expressive (through (\hat{R}) denoising), and discretely valid (through masked diffusion), while explicitly accounting for heterogeneous variable-generating mechanisms through type-aware routing. In combination, these choices constitute our central methodological contribution: a unified Transformer + mixed diffusion generator for ICS telemetry, augmented by typed factorization to align model capacity with domain mechanism. [2,4,10,12]
 
 ---
 
 # References
 
-[1] Vaswani, A., Shazeer, N., Parmar, N., et al. *Attention Is All You Need.* NeurIPS, 2017. ([arXiv][1])
-[2] Ho, J., Jain, A., Abbeel, P. *Denoising Diffusion Probabilistic Models.* NeurIPS, 2020. ([arXiv][2])
-[3] Austin, J., Johnson, D. D., Ho, J., Tarlow, D., van den Berg, R. *Structured Denoising Diffusion Models in Discrete State-Spaces (D3PM).* NeurIPS, 2021. ([arXiv][3])
+[1] Vaswani, A., Shazeer, N., Parmar, N., et al. *Attention Is All You Need.* NeurIPS, 2017. arXiv:1706.03762. ([arXiv][1])
+[2] Ho, J., Jain, A., Abbeel, P. *Denoising Diffusion Probabilistic Models.* NeurIPS, 2020. arXiv:2006.11239. ([Proceedings of Machine Learning Research][2])
+[3] Austin, J., Johnson, D. D., Ho, J., Tarlow, D., van den Berg, R. *Structured Denoising Diffusion Models in Discrete State-Spaces (D3PM).* NeurIPS, 2021. arXiv:2107.03006. ([arXiv][3])
 [4] Shi, J., Han, K., Wang, Z., Doucet, A., Titsias, M. K. *Simplified and Generalized Masked Diffusion for Discrete Data.* arXiv:2406.04329, 2024. ([arXiv][4])
-[5] Hang, T., Gu, S., Li, C., et al. *Efficient Diffusion Training via Min-SNR Weighting Strategy.* ICCV, 2023. ([arXiv][5])
+[5] Hang, T., Wu, C., Zhang, H., et al. *Efficient Diffusion Training via Min-SNR Weighting Strategy.* arXiv:2303.09556, 2023. ([arXiv][5])
 [6] Kollovieh, M., Ansari, A. F., Bohlke-Schneider, M., et al. *Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting (TSDiff).* arXiv:2307.11494, 2023. ([arXiv][6])
-[7] Su, C., Cai, Z., Tian, Y., et al. *Diffusion Models for Time Series Forecasting: A Survey.* arXiv:2507.14507, 2025. ([arXiv][7])
+[7] Sikder, M. F., Ramachandranpillai, R., Heintz, F. *TransFusion: Generating Long, High Fidelity Time Series using Diffusion Models with Transformers.* arXiv:2307.12667, 2023. ([arXiv][7])
+[8] Song, Y., Sohl-Dickstein, J., Kingma, D. P., et al. *Score-Based Generative Modeling through Stochastic Differential Equations.* ICLR, 2021. arXiv:2011.13456. ([arXiv][8])
+[9] Zhang, H., Zhang, J., Li, J., et al. *TabDiff: a Mixed-type Diffusion Model for Tabular Data Generation.* arXiv:2410.20626, 2024. ([arXiv][9])
+[10] Yuan, H., Sha, K., Zhao, W. *CTU-DDPM: Conditional Transformer U-net DDPM for Industrial Control System Anomaly Data Augmentation.* ACM AICSS, 2025. DOI:10.1145/3776759.3776845.
+[11] Sha, K., et al. *DDPM Fusing Mamba and Adaptive Attention: An Augmentation Method for Industrial Control Systems Anomaly Data.* SSRN, posted Jan 10, 2026. (SSRN 6055903). ([SSRN][10])
+[12] NIST. *Guide to Operational Technology (OT) Security (SP 800-82r3).* 2023. ([NIST Computer Security Resource Center][11])
 
-[1]: https://arxiv.org/abs/1706.03762?utm_source=chatgpt.com "Attention Is All You Need"
-[2]: https://arxiv.org/abs/2006.11239?utm_source=chatgpt.com "Denoising Diffusion Probabilistic Models"
-[3]: https://arxiv.org/abs/2107.03006?utm_source=chatgpt.com "Structured Denoising Diffusion Models in Discrete State-Spaces"
-[4]: https://arxiv.org/abs/2406.04329?utm_source=chatgpt.com "Simplified and Generalized Masked Diffusion for Discrete Data"
-[5]: https://arxiv.org/abs/2303.09556?utm_source=chatgpt.com "Efficient Diffusion Training via Min-SNR Weighting Strategy"
-[6]: https://arxiv.org/abs/2307.11494?utm_source=chatgpt.com "Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting"
-[7]: https://arxiv.org/abs/2507.14507?utm_source=chatgpt.com "Diffusion Models for Time Series Forecasting: A Survey"
+[1]: https://arxiv.org/abs/2209.15421 "https://arxiv.org/abs/2209.15421"
+[2]: https://proceedings.mlr.press/v202/kotelnikov23a/kotelnikov23a.pdf "https://proceedings.mlr.press/v202/kotelnikov23a/kotelnikov23a.pdf"
+[3]: https://arxiv.org/html/2209.15421v2 "https://arxiv.org/html/2209.15421v2"
+[4]: https://arxiv.org/abs/2011.13456 "https://arxiv.org/abs/2011.13456"
+[5]: https://arxiv.org/abs/2303.09556 "https://arxiv.org/abs/2303.09556"
+[6]: https://arxiv.org/pdf/2401.03006 "https://arxiv.org/pdf/2401.03006"
+[7]: https://arxiv.org/abs/2307.12667 "https://arxiv.org/abs/2307.12667"
+[8]: https://arxiv.org/abs/2406.04329 "https://arxiv.org/abs/2406.04329"
+[9]: https://arxiv.org/abs/2406.07524 "https://arxiv.org/abs/2406.07524"
+[10]: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6055903 "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6055903"
+[11]: https://csrc.nist.gov/pubs/sp/800/82/r3/final "https://csrc.nist.gov/pubs/sp/800/82/r3/final"