Update knowledges/draft-incomplete-methodology.md

2026-01-30 21:05:56 +08:00
parent b88a9d39da
commit 735ca8ab51
1 changed files with 187 additions and 125 deletions
--- a/knowledges/draft-incomplete-methodology.md
+++ b/knowledges/draft-incomplete-methodology.md
@@ -1,176 +1,238 @@
 ## Methodology

-### Problem setting and notation
+### 1. Problem setting and design motivation

-We consider multivariate industrial control system (ICS) time series composed of **continuous process variables** (PVs) and **discrete/state variables** (e.g., modes, alarms, categorical tags). Let
-[
-X \in \mathbb{R}^{L \times d_c},\qquad Y \in {1,\dots,V}^{L \times d_d},
-]
-denote a length-(L) segment with (d_c) continuous channels and (d_d) discrete channels (vocabulary size (V)). Our goal is to learn a generative model that produces realistic synthetic sequences ((\hat{X}, \hat{Y})) matching (i) temporal dynamics, (ii) marginal/conditional distributions, and (iii) discrete semantic validity.
+Industrial control system (ICS) telemetry is a **mixed-type** sequential object: it couples **continuous** process dynamics (e.g., sensor values and physical responses) with **discrete** supervisory states (e.g., modes, alarms, interlocks). We model each training instance as a fixed-length window of length (L), consisting of (i) continuous channels (X\in\mathbb{R}^{L\times d_c}) and (ii) discrete channels (Y={y^{(j)}*{1:L}}*{j=1}^{d_d}), where each discrete variable (y^{(j)}_t\in\mathcal{V}_j) belongs to a finite vocabulary.

-A core empirical difficulty in ICS is that a single model optimized end-to-end often trades off **distributional fidelity** and **temporal coherence**. We therefore adopt a **decoupled, two-stage** framework (“Mask-DDPM”) that separates (A) trend/dynamics modeling from (B) residual distribution modeling, and further separates (B1) continuous and (B2) discrete generation.
+Our methodological objective is to learn a generator that produces synthetic ((\hat{X},\hat{Y})) that are simultaneously: (a) temporally coherent, (b) distributionally faithful, and (c) **semantically valid** for discrete channels (i.e., (\hat{y}^{(j)}_t\in\mathcal{V}_j) by construction). A key empirical observation motivating our design is that, in ICS, temporal realism and distributional realism often “pull” the model in different directions when trained monolithically; this is amplified by the heterogeneous mechanisms that produce different variables (program steps, controller laws, actuator saturation, physical inertia, and deterministic derived tags). We therefore structure the generator to *separate concerns* and then *specialize*.

 ---

-### Overview of the proposed Mask-DDPM framework
+### 2. Overview of Mask-DDPM

-Mask-DDPM factorizes generation into:
+We propose **Mask-DDPM**, organized in the following order:

-1. **Stage 1 (Temporal trend modeling):** learn a deterministic trend estimator (T) for continuous dynamics using a **Transformer** backbone (instead of GRU), producing a smooth, temporally consistent “skeleton”.
-2. **Stage 2 (Residual distribution modeling):** generate (i) continuous residuals with a DDPM and (ii) discrete variables with masked (absorbing) diffusion, then reconstruct the final signal.
+1. **Transformer trend module**: learns the dominant temporal backbone of continuous dynamics via attention-based sequence modeling [1].
+2. **Residual DDPM for continuous variables**: models distributional detail as stochastic residual structure conditioned on the learned trend [2].
+3. **Masked diffusion for discrete variables**: generates discrete ICS states with an absorbing/masking corruption process and categorical reconstruction [3,4].
+4. **Type-aware decomposition**: a performance-oriented refinement layer that routes variables to the most appropriate modeling mechanism and enforces deterministic constraints where warranted.

-Formally, we decompose continuous observations as
-[
-R = X - T,\qquad \hat{X} = T + \hat{R},
-]
-where (T) is predicted by the Transformer trend model and (\hat{R}) is sampled from a residual diffusion model. Discrete variables (Y) are generated via masked diffusion to guarantee outputs remain in the discrete vocabulary.
-
-This methodology sits at the intersection of diffusion modeling [2] and attention-based sequence modeling [1], and is motivated by recent diffusion-based approaches to time-series synthesis in general [12] and in industrial/ICS contexts [13], while explicitly addressing mixed discrete–continuous structure using masked diffusion objectives for discrete data [5].
+This ordering is intentional: the trend module fixes the *macro-temporal scaffold*; diffusion then focuses on *micro-structure and marginal fidelity*; masked diffusion guarantees *discrete legality*; and type-aware routing resolves the remaining mismatch caused by heterogeneous variable-generating mechanisms.

 ---

-### Data representation and preprocessing
+## 3. Transformer trend module for continuous dynamics

-**Segmentation.** Raw ICS logs are segmented into fixed-length windows of length (L) (e.g., (L=128)) to form training examples. Windows may be sampled with overlap to increase effective data size.
+### 3.1 Trend–residual decomposition

-**Continuous channels.** Continuous variables are normalized channel-wise to stabilize optimization (e.g., z-score normalization). Normalization statistics are computed on the training split and reused at inference.
+For continuous channels (X), we posit an additive decomposition
+[
+X = S + R,
+]
+where (S\in\mathbb{R}^{L\times d_c}) is a smooth **trend** capturing predictable temporal evolution, and (R\in\mathbb{R}^{L\times d_c}) is a **residual** capturing distributional detail (e.g., bursts, heavy tails, local fluctuations) that is difficult to represent robustly with a purely regression-based temporal objective.

-**Discrete channels.** Each discrete/state variable is mapped to integer tokens in ({1,\dots,V}). A dedicated **[MASK]** token is added for masked diffusion corruption. Optionally, rare categories can be merged to reduce sparsity.
+### 3.2 Causal Transformer parameterization
+
+We parameterize the trend (S) using a **causal Transformer** (f_\phi) [1]. Concretely, with teacher forcing we train (f_\phi) to predict the next-step trend from past observations:
+[
+\hat{S}*{t+1} = f*\phi(X_{1:t}), \qquad t=1,\dots,L-1,
+]
+with the mean-squared error objective
+[
+\mathcal{L}*{\text{trend}}(\phi)=\frac{1}{(L-1)d_c}\sum*{t=1}^{L-1}\left| \hat{S}*{t+1} - X*{t+1}\right|_2^2.
+]
+Self-attention is particularly suitable here because it provides a direct mechanism for learning cross-channel interactions and long-range temporal dependencies without recurrence, which is important when control actions influence downstream variables with nontrivial delays [1].
+
+At inference, we roll out the Transformer autoregressively to obtain (\hat{S}), and then define the residual target for diffusion as (R = X - \hat{S}).
+
+#### Uniqueness note (trend module)
+
+*While Transformers are now standard in sequence modeling, our use of a Transformer explicitly as a **trend extractor**—whose sole role is to provide conditioning for subsequent diffusion rather than to serve as the primary generator—creates a clean separation between temporal structure and distributional refinement in the ICS setting.*

 ---

-### Stage 1: Transformer-based temporal trend model
+## 4. DDPM for continuous residual generation

-We model the predictable, low-frequency temporal structure of continuous channels using a causal Transformer [1]. Let (X_{1:t}) denote the prefix up to time (t). The trend model (f_\phi) produces one-step-ahead predictions:
+### 4.1 Conditional diffusion on residuals
+
+We model the residual (R) with a denoising diffusion probabilistic model (DDPM) conditioned on the trend (\hat{S}) [2]. Let (K) denote the number of diffusion steps, with a noise schedule ({\beta_k}_{k=1}^K), (\alpha_k = 1-\beta_k), and (\bar{\alpha}*k=\prod*{i=1}^k \alpha_i). The forward corruption process is
 [
-\hat{T}*{t+1} = f*\phi(X_{1:t}),\qquad t=1,\dots,L-1,
+q(r_k\mid r_0)=\mathcal{N}!\left(\sqrt{\bar{\alpha}_k},r_0,\ (1-\bar{\alpha}_k)\mathbf{I}\right),
 ]
-and is trained with mean squared error (teacher forcing):
+equivalently,
 [
-\mathcal{L}*{trend} = \frac{1}{(L-1)d_c}\sum*{t=1}^{L-1}\left| \hat{T}*{t+1} - X*{t+1}\right|_2^2.
+r_k = \sqrt{\bar{\alpha}_k},r_0 + \sqrt{1-\bar{\alpha}_k},\epsilon,\qquad \epsilon\sim\mathcal{N}(0,\mathbf{I}).
 ]
-After training, the model is rolled out over a window to obtain (T \in \mathbb{R}^{L \times d_c}). This explicit trend extraction reduces the burden on diffusion to simultaneously learn “where the sequence goes” and “how values are distributed,” a tension frequently observed in diffusion/time-series settings [12,13].
+
+The learned reverse process is parameterized as
+[
+p_{\theta}(r_{k-1}\mid r_k,\hat{S})=\mathcal{N}!\left(\mu_{\theta}(r_k,k,\hat{S}),\ \Sigma_{\theta}(k)\right),
+]
+where (\mu_\theta) is implemented by a **Transformer denoiser** that consumes (i) the noised residual (r_k), (ii) a timestep embedding for (k), and (iii) conditioning features derived from (\hat{S}). Conditioning is crucial: it prevents the diffusion model from re-learning the low-frequency temporal scaffold and concentrates its capacity on matching the residual distribution.
+
+### 4.2 Training objective and loss shaping
+
+We train the denoiser using the standard DDPM (\epsilon)-prediction objective [2]:
+[
+\mathcal{L}_{\text{cont}}(\theta)
+=================================
+
+\mathbb{E}*{k,r_0,\epsilon}
+\left[
+\left|
+\epsilon - \epsilon*{\theta}(r_k,k,\hat{S})
+\right|*2^2
+\right].
+]
+Because diffusion optimization can exhibit timestep imbalance and gradient conflict, we optionally apply an SNR-based reweighting consistent with Min-SNR training [5]:
+[
+\mathcal{L}^{\text{snr}}*{\text{cont}}(\theta)
+==============================================
+
+\mathbb{E}*{k,r_0,\epsilon}
+\left[
+w(k)\left|
+\epsilon - \epsilon*{\theta}(r_k,k,\hat{S})
+\right|_2^2
+\right],
+\qquad
+w(k)=\frac{\mathrm{SNR}_k}{\mathrm{SNR}_k+\gamma}.
+]
+(We treat this as an ablation: it is not required by the method, but can improve training efficiency and stability in practice [5].)
+
+### 4.3 Continuous reconstruction
+
+After sampling (\hat{R}) by reverse diffusion, we reconstruct the continuous output as
+[
+\hat{X} = \hat{S} + \hat{R}.
+]
+This design makes the role of diffusion explicit: it acts as a **distributional corrector** on top of a temporally coherent backbone.
+
+#### Uniqueness note (continuous branch)
+
+*The central integration here is not “Transformer + diffusion” in isolation, but rather a **trend-conditioned residual diffusion** formulation: diffusion is trained on residual structure explicitly defined relative to a Transformer trend model, which is particularly aligned with ICS where low-frequency dynamics are strong and persistent.*

 ---

-### Stage 2A: Continuous residual modeling with DDPM
+## 5. Masked diffusion for discrete ICS variables

-We learn a denoising diffusion probabilistic model (DDPM) over residuals (R) [2]. The forward (noising) process gradually perturbs residuals with Gaussian noise:
-[
-q(r_t\mid r_{t-1})=\mathcal{N}!\left(\sqrt{1-\beta_t},r_{t-1},,\beta_t I\right),
-]
-with (t=1,\dots,T) diffusion steps and a pre-defined schedule ({\beta_t}). This yields the closed-form:
-[
-r_t=\sqrt{\bar{\alpha}_t},r_0+\sqrt{1-\bar{\alpha}_t},\epsilon,\qquad \epsilon\sim \mathcal{N}(0,I),
-]
-where (\alpha_t=1-\beta_t) and (\bar{\alpha}*t=\prod*{s=1}^t \alpha_s).
+### 5.1 Discrete corruption via absorbing masks

-A Transformer-based denoiser (g_\theta) parameterizes the reverse process by predicting either the added noise (\epsilon) or the clean residual (r_0):
+Discrete ICS variables must remain categorical, making Gaussian diffusion inappropriate. We therefore use **masked (absorbing) diffusion** for discrete channels [3,4], in which corruption replaces tokens with a special (\texttt{[MASK]}) symbol according to a schedule.
+
+For a discrete sequence (y^{(j)}_{1:L}), define a corruption level (m_k\in[0,1]) that increases with (k). The corrupted sequence at step (k) is formed by independent masking:
 [
-\hat{\epsilon}=g_\theta(r_t,t)\quad\text{or}\quad \hat{r}*0=g*\theta(r_t,t).
-]
-We use the standard DDPM objective (two equivalent parameterizations commonly used in practice [2]):
-[
-\mathcal{L}_{cont} =
+y^{(j,k)}_t =
 \begin{cases}
-\left|\hat{\epsilon}-\epsilon\right|_2^2 & (\epsilon\text{-prediction})[4pt]
-\left|\hat{r}_0-r_0\right|_2^2 & (r_0\text{-prediction})
+\texttt{[MASK]}, & \text{with probability } m_k,\
+y^{(j)}_t, & \text{otherwise}.
 \end{cases}
 ]
+This “absorbing” mechanism yields a diffusion-like process over discrete states while preserving a clear reconstruction target at every step [3,4].

-**SNR-weighted training (optional).** To mitigate optimization imbalance across timesteps, we optionally apply an SNR-based weighting strategy:
-[
-\mathcal{L}_{snr}=\frac{\mathrm{SNR}_t}{\mathrm{SNR}*t+\gamma},\mathcal{L}*{cont},
-]
-which is conceptually aligned with Min-SNR-style diffusion reweighting [3].
+### 5.2 Categorical denoising objective

-**Residual reconstruction.** At inference, we sample (\hat{R}) by iterative denoising from Gaussian noise, then reconstruct the final continuous output:
+We parameterize the reverse model with a Transformer (h_\psi) that outputs a categorical distribution over (\mathcal{V}*j) for each masked position. Let (\mathcal{M}) be the set of masked indices. We train with cross-entropy over masked positions:
 [
-\hat{X}=T+\hat{R}.
+\mathcal{L}*{\text{disc}}(\psi)
+===============================
+
+\mathbb{E}*{k}
+\left[
+\frac{1}{|\mathcal{M}|}\sum*{(j,t)\in\mathcal{M}}
+\mathrm{CE}!\left(
+h_\psi!\left(y^{(k)},k,\hat{S}\right)_{j,t},\ y^{(j)}_t
+\right)
+\right].
 ]
+Conditioning on (\hat{S}) (and, optionally, on (\hat{X})) allows discrete state generation to reflect the continuous context, which is often essential in ICS (e.g., mode transitions coupled to process regimes).
+
+### 5.3 Sampling
+
+At inference, we initialize (y^{(K)}) as fully masked and iteratively denoise/unmask from (k=K) to (1), sampling valid tokens from the predicted categorical distributions. Because outputs are always drawn from (\mathcal{V}_j), semantic validity is satisfied by construction.
+
+#### Uniqueness note (discrete branch)
+
+*To our knowledge, diffusion-based ICS synthesis has predominantly emphasized continuous multivariate generation; in contrast, we introduce masked diffusion as a **first-class discrete legality mechanism** within the same pipeline, rather than treating discrete tags as continuous surrogates or post-hoc thresholds.*

 ---

-### Stage 2B: Discrete variable modeling with masked diffusion
+## 6. Type-aware decomposition as a performance refinement layer

-Discrete ICS channels must remain **semantically valid** (i.e., categorical, not fractional). Instead of continuous diffusion on (Y), we use **masked (absorbing) diffusion** [5], which corrupts sequences by replacing tokens with a special mask symbol and trains the model to recover them.
+### 6.1 Motivation: mechanistic heterogeneity in ICS variables

-**Forward corruption.** Given a schedule (m_t\in[0,1]) increasing with (t), we sample a masked version (y_t) by independently masking positions:
+Even with a strong continuous generator and a discrete legality mechanism, ICS data pose an additional challenge: variables arise from **qualitatively different generative mechanisms**. For example, program-driven setpoints are often piecewise constant with abrupt changes; actuators exhibit saturation and dwell; some channels are deterministic functions of others (derived tags). Modeling all channels identically can cause persistent error concentration and degrade overall quality.
+
+We therefore introduce a **type-aware decomposition** that operates on top of the base pipeline to improve fidelity, stability, and interpretability.
+
+### 6.2 Typing function and routing
+
+Let (\tau(i)\in{1,\dots,6}) assign each variable (i) to a type. This induces a partition of variables into index sets ({\mathcal{I}*k}*{k=1}^6). We then define the generator as a composition of type-specific operators:
 [
-y_t^{(i)} =
-\begin{cases}
-\texttt{[MASK]} & \text{with prob. } m_t\
-y_0^{(i)} & \text{otherwise}
-\end{cases}
+(\hat{X},\hat{Y}) = \mathcal{A}\Big(\hat{S},\hat{R},\hat{Y}_{\text{mask}},{\mathcal{G}*k}*{k=1}^6;\tau\Big),
 ]
-This “absorbing” corruption is a discrete analogue of diffusion and underpins modern masked diffusion formulations [5].
+where (\mathcal{A}) assembles a complete sample and (\mathcal{G}_k) encodes the appropriate modeling choice per type.

-**Reverse model and loss.** A Transformer (h_\psi) outputs a categorical distribution over the vocabulary for each position. We compute cross-entropy **only on masked positions** (\mathcal{M}):
+### 6.3 Six-type schema and modeling commitments
+
+We operationalize six types that map cleanly onto common ICS semantics:
+
+* **Type 1 (program-driven / setpoint-like):** exogenous drivers with step changes and long dwell. These variables are treated as conditioning signals or modeled with a dedicated change-point-aware generator rather than forcing them into residual diffusion.
+* **Type 2 (controller outputs):** variables that respond to setpoints and process feedback; we treat them as conditional on Type 1 and continuous context, and allow separate specialization if they dominate error.
+* **Type 3 (actuator states/positions):** bounded, often quantized, with saturation and dwell; we route them to discrete masked diffusion when naturally categorical/quantized, or to specialized dwell-aware modeling when continuous but state-persistent.
+* **Type 4 (process variables):** inertia-dominated continuous dynamics; these are the primary beneficiaries of the **Transformer trend + residual DDPM** pipeline.
+* **Type 5 (derived/deterministic variables):** algebraic or rule-based functions of other variables; we enforce deterministic reconstruction (\hat{x}^{(i)} = g_i(\hat{X},\hat{Y})) rather than learning a stochastic generator, improving logical consistency and sample efficiency.
+* **Type 6 (auxiliary/low-impact variables):** weakly coupled or sparse signals; we allow simplified modeling (e.g., calibrated marginals or lightweight temporal models) to avoid allocating diffusion capacity where it is not warranted.
+
+### 6.4 Training-time and inference-time integration
+
+Type-aware decomposition improves performance through three concrete mechanisms:
+
+1. **Capacity allocation:** by focusing diffusion on Type 4 (and selected Type 2/3), we reduce the tendency for a few mechanistically distinct variables to dominate gradients and distort the learned distribution elsewhere.
+2. **Constraint enforcement:** Type 5 variables are computed deterministically, preventing logically inconsistent samples that a purely learned generator may produce.
+3. **Mechanism alignment:** Type 1/3 variables receive inductive biases consistent with step-like or dwell-like behavior, which diffusion-trained smooth denoisers can otherwise over-regularize.
+
+In practice, (\tau) can be initialized from domain semantics (tag metadata and value domains) and refined via an error-attribution loop described in the Benchmark section. Importantly, this refinement does not alter the core generative backbone; it *re-routes* difficult variables to better-suited mechanisms while preserving end-to-end coherence.
+
+#### Uniqueness note (type-aware layer)
+
+*The type-aware decomposition is not merely a diagnostic taxonomy; it is a modeling layer that **changes the generator’s factorization**—deciding which variables are best treated as conditional drivers, which require dwell-aware behavior, and which should be deterministic—thereby tailoring the diffusion pipeline to the structural heterogeneity characteristic of ICS telemetry.*
+
+---
+
+## 7. Joint optimization and end-to-end sampling
+
+We train the model in a staged manner consistent with the factorization:
+
+1. Train the trend Transformer (f_\phi) to obtain (S).
+2. Compute residual targets (R=X-S) for Type 4 (and any routed continuous types).
+3. Train the residual DDPM (p_\theta(R\mid S)) and the masked diffusion model (p_\psi(Y\mid \text{masked}(Y),S)).
+4. Apply type-aware routing and deterministic reconstruction rules during sampling.
+
+A simple combined objective is
 [
-\mathcal{L}*{disc}=\frac{1}{|\mathcal{M}|}\sum*{(i,t)\in \mathcal{M}} CE(\hat{p}*{i,t},,y*{i,t}).
+\mathcal{L} = \lambda,\mathcal{L}*{\text{cont}} + (1-\lambda),\mathcal{L}*{\text{disc}},
 ]
-This guarantees decoded samples belong to the discrete vocabulary by construction. Masked diffusion can be viewed as a simplified, scalable alternative within the broader family of discrete diffusion models [4,5].
+with (\lambda\in[0,1]) controlling the balance between continuous and discrete learning. Type-aware routing determines which channels contribute to which loss and which are excluded in favor of deterministic reconstruction.
+
+At inference time, we generate in the same order: (i) trend (\hat{S}), (ii) residual (\hat{R}) via DDPM, (iii) discrete (\hat{Y}) via masked diffusion, and (iv) type-aware assembly and deterministic reconstruction.

 ---

-### Joint objective and training protocol
+# References

-We train Stage 1 and Stage 2 sequentially:
+[1] Vaswani, A., Shazeer, N., Parmar, N., et al. *Attention Is All You Need.* NeurIPS, 2017. ([arXiv][1])
+[2] Ho, J., Jain, A., Abbeel, P. *Denoising Diffusion Probabilistic Models.* NeurIPS, 2020. ([arXiv][2])
+[3] Austin, J., Johnson, D. D., Ho, J., Tarlow, D., van den Berg, R. *Structured Denoising Diffusion Models in Discrete State-Spaces (D3PM).* NeurIPS, 2021. ([arXiv][3])
+[4] Shi, J., Han, K., Wang, Z., Doucet, A., Titsias, M. K. *Simplified and Generalized Masked Diffusion for Discrete Data.* arXiv:2406.04329, 2024. ([arXiv][4])
+[5] Hang, T., Gu, S., Li, C., et al. *Efficient Diffusion Training via Min-SNR Weighting Strategy.* ICCV, 2023. ([arXiv][5])
+[6] Kollovieh, M., Ansari, A. F., Bohlke-Schneider, M., et al. *Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting (TSDiff).* arXiv:2307.11494, 2023. ([arXiv][6])
+[7] Su, C., Cai, Z., Tian, Y., et al. *Diffusion Models for Time Series Forecasting: A Survey.* arXiv:2507.14507, 2025. ([arXiv][7])

-1. **Train trend Transformer** (f_\phi) on continuous channels to obtain (T).
-2. **Compute residuals** (R=X-T).
-3. **Train diffusion models** on (R) (continuous DDPM) and (Y) (masked diffusion), using a weighted combination:
-   [
-   \mathcal{L}=\lambda \mathcal{L}*{cont}+(1-\lambda)\mathcal{L}*{disc}.
-   ]
-
-In our implementation, we typically use (L=128) and a diffusion horizon (T) up to 600 steps (trade-off between sample quality and compute). Transformer backbones increase training cost due to (O(L^2)) attention, but provide a principled mechanism for long-range temporal dependency modeling that is especially relevant in ICS settings [1,13].
-
---
-
-### Sampling procedure (end-to-end generation)
-
-Given an optional seed prefix (or a sampled initial context):
-
-1. **Trend rollout:** use (f_\phi) to produce (T) over (L) steps.
-2. **Continuous residual sampling:** sample (\hat{R}) by reverse DDPM from noise, producing (\hat{X}=T+\hat{R}).
-3. **Discrete sampling:** initialize (\hat{Y}) as fully masked and iteratively unmask/denoise using the masked diffusion reverse model until all tokens are assigned [5].
-4. **Return** ((\hat{X},\hat{Y})) as a synthetic ICS window.
-
---
-
-### Type-aware decomposition (diagnostic-guided extensibility)
-
-In practice, a small subset of channels can dominate failure modes (e.g., program-driven setpoints, actuator saturation/stiction, derived deterministic tags). We incorporate a **type-aware** diagnostic partitioning that groups variables by generative mechanism and enables modular replacements (e.g., conditional generation for program signals, deterministic reconstruction for derived tags). This design is compatible with emerging conditional diffusion paradigms for industrial time series [11] and complements prior ICS diffusion augmentation work that primarily targets continuous MTS fidelity [13].
-
-> **Note:** Detailed benchmark metrics (e.g., KS/JSD/Lag-1) and evaluation protocol belong in the **Benchmark / Experiments** section, not in Methodology, and are therefore omitted here as requested.
-
---
-
-## References
-
-[1] Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). *Attention Is All You Need.* NeurIPS. ([arXiv][1])
-[2] Ho, J., Jain, A., & Abbeel, P. (2020). *Denoising Diffusion Probabilistic Models.* NeurIPS. ([SSRN][2])
-[3] Hang, T., Gu, S., Li, C., et al. (2023). *Efficient Diffusion Training via Min-SNR Weighting Strategy.* ICCV. ([arXiv][3])
-[4] Austin, J., Johnson, D. D., Ho, J., et al. (2021). *Structured Denoising Diffusion Models in Discrete State-Spaces (D3PM).* NeurIPS.
-[5] Shi, J., Han, K., Wang, Z., Doucet, A., & Titsias, M. K. (2024). *Simplified and Generalized Masked Diffusion for Discrete Data.* NeurIPS (poster); arXiv:2406.04329. ([arXiv][4])
-[6] Kotelnikov, A., Baranchuk, D., Rubachev, I., & Babenko, A. (2023). *TabDDPM: Modelling Tabular Data with Diffusion Models.* ICML.
-[7] Shi, J., Xu, M., Hua, H., Zhang, H., Ermon, S., & Leskovec, J. (2024). *TabDiff: a Mixed-type Diffusion Model for Tabular Data Generation.* arXiv:2410.20626 / ICLR. ([arXiv][5])
-[8] Kollovieh, M., Ansari, A. F., Bohlke-Schneider, M., et al. (2023). *Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting (TSDiff).* NeurIPS. ([arXiv][6])
-[9] Ren, L., Wang, H., & Laili, Y. (2024). *Diff-MTS: Temporal-Augmented Conditional Diffusion-based AIGC for Industrial Time Series Toward the Large Model Era.* arXiv:2407.11501. ([arXiv][7])
-[10] Sikder, M. F., et al. (2023). *TransFusion: Generating Long, High Fidelity Time Series with Diffusion and Transformers.* arXiv:2307.12667. ([arXiv][8])
-[11] Su, C., Cai, Z., Tian, Y., et al. (2025). *Diffusion Models for Time Series Forecasting: A Survey.* arXiv:2507.14507. ([arXiv][9])
-[12] Sha, Y., Yuan, Y., Wu, Y., & Zhao, H. (2026). *DDPM Fusing Mamba and Adaptive Attention: An Augmentation Method for Industrial Control Systems Anomaly Data (ETUA-DDPM).* SSRN. ([SSRN][10])
-[13] Yuan, Y., et al. (2025). *CTU-DDPM: Generating Industrial Control System Time-Series Data with a CNN-Transformer Hybrid Diffusion Model.* ACM (conference paper metadata). ([ACM Digital Library][11])
-
-[1]: https://arxiv.org/html/2307.12667v2?utm_source=chatgpt.com "Generating Long, High Fidelity Time Series using Diffusion ..."
-[2]: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5134458&utm_source=chatgpt.com "Diffusion Model Based Synthetic Data Generation for ..."
-[3]: https://arxiv.org/abs/2303.09556 "https://arxiv.org/abs/2303.09556"
-[4]: https://arxiv.org/pdf/2406.04329 "https://arxiv.org/pdf/2406.04329"
-[5]: https://arxiv.org/abs/2410.20626 "https://arxiv.org/abs/2410.20626"
-[6]: https://arxiv.org/abs/2307.11494 "https://arxiv.org/abs/2307.11494"
-[7]: https://arxiv.org/abs/2407.11501 "https://arxiv.org/abs/2407.11501"
-[8]: https://arxiv.org/abs/2307.12667 "https://arxiv.org/abs/2307.12667"
-[9]: https://arxiv.org/abs/2507.14507 "https://arxiv.org/abs/2507.14507"
-[10]: https://papers.ssrn.com/sol3/Delivery.cfm/254b62d8-9d8c-4b87-bb57-6b07da15f831-MECA.pdf?abstractid=6055903&mirid=1&type=2 "https://papers.ssrn.com/sol3/Delivery.cfm/254b62d8-9d8c-4b87-bb57-6b07da15f831-MECA.pdf?abstractid=6055903&mirid=1&type=2"
-[11]: https://dl.acm.org/doi/pdf/10.1145/3776759.3776845 "https://dl.acm.org/doi/pdf/10.1145/3776759.3776845"
+[1]: https://arxiv.org/abs/1706.03762?utm_source=chatgpt.com "Attention Is All You Need"
+[2]: https://arxiv.org/abs/2006.11239?utm_source=chatgpt.com "Denoising Diffusion Probabilistic Models"
+[3]: https://arxiv.org/abs/2107.03006?utm_source=chatgpt.com "Structured Denoising Diffusion Models in Discrete State-Spaces"
+[4]: https://arxiv.org/abs/2406.04329?utm_source=chatgpt.com "Simplified and Generalized Masked Diffusion for Discrete Data"
+[5]: https://arxiv.org/abs/2303.09556?utm_source=chatgpt.com "Efficient Diffusion Training via Min-SNR Weighting Strategy"
+[6]: https://arxiv.org/abs/2307.11494?utm_source=chatgpt.com "Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting"
+[7]: https://arxiv.org/abs/2507.14507?utm_source=chatgpt.com "Diffusion Models for Time Series Forecasting: A Survey"