12 KiB
Methodology
Problem setting and notation
We consider multivariate industrial control system (ICS) time series composed of continuous process variables (PVs) and discrete/state variables (e.g., modes, alarms, categorical tags). Let [ X \in \mathbb{R}^{L \times d_c},\qquad Y \in {1,\dots,V}^{L \times d_d}, ] denote a length-(L) segment with (d_c) continuous channels and (d_d) discrete channels (vocabulary size (V)). Our goal is to learn a generative model that produces realistic synthetic sequences ((\hat{X}, \hat{Y})) matching (i) temporal dynamics, (ii) marginal/conditional distributions, and (iii) discrete semantic validity.
A core empirical difficulty in ICS is that a single model optimized end-to-end often trades off distributional fidelity and temporal coherence. We therefore adopt a decoupled, two-stage framework (“Mask-DDPM”) that separates (A) trend/dynamics modeling from (B) residual distribution modeling, and further separates (B1) continuous and (B2) discrete generation.
Overview of the proposed Mask-DDPM framework
Mask-DDPM factorizes generation into:
- Stage 1 (Temporal trend modeling): learn a deterministic trend estimator (T) for continuous dynamics using a Transformer backbone (instead of GRU), producing a smooth, temporally consistent “skeleton”.
- Stage 2 (Residual distribution modeling): generate (i) continuous residuals with a DDPM and (ii) discrete variables with masked (absorbing) diffusion, then reconstruct the final signal.
Formally, we decompose continuous observations as [ R = X - T,\qquad \hat{X} = T + \hat{R}, ] where (T) is predicted by the Transformer trend model and (\hat{R}) is sampled from a residual diffusion model. Discrete variables (Y) are generated via masked diffusion to guarantee outputs remain in the discrete vocabulary.
This methodology sits at the intersection of diffusion modeling 2 and attention-based sequence modeling 1, and is motivated by recent diffusion-based approaches to time-series synthesis in general [12] and in industrial/ICS contexts [13], while explicitly addressing mixed discrete–continuous structure using masked diffusion objectives for discrete data 5.
Data representation and preprocessing
Segmentation. Raw ICS logs are segmented into fixed-length windows of length (L) (e.g., (L=128)) to form training examples. Windows may be sampled with overlap to increase effective data size.
Continuous channels. Continuous variables are normalized channel-wise to stabilize optimization (e.g., z-score normalization). Normalization statistics are computed on the training split and reused at inference.
Discrete channels. Each discrete/state variable is mapped to integer tokens in ({1,\dots,V}). A dedicated [MASK] token is added for masked diffusion corruption. Optionally, rare categories can be merged to reduce sparsity.
Stage 1: Transformer-based temporal trend model
We model the predictable, low-frequency temporal structure of continuous channels using a causal Transformer 1. Let (X_{1:t}) denote the prefix up to time (t). The trend model (f_\phi) produces one-step-ahead predictions: [ \hat{T}{t+1} = f\phi(X_{1:t}),\qquad t=1,\dots,L-1, ] and is trained with mean squared error (teacher forcing): [ \mathcal{L}{trend} = \frac{1}{(L-1)d_c}\sum{t=1}^{L-1}\left| \hat{T}{t+1} - X{t+1}\right|_2^2. ] After training, the model is rolled out over a window to obtain (T \in \mathbb{R}^{L \times d_c}). This explicit trend extraction reduces the burden on diffusion to simultaneously learn “where the sequence goes” and “how values are distributed,” a tension frequently observed in diffusion/time-series settings [12,13].
Stage 2A: Continuous residual modeling with DDPM
We learn a denoising diffusion probabilistic model (DDPM) over residuals (R) 2. The forward (noising) process gradually perturbs residuals with Gaussian noise: [ q(r_t\mid r_{t-1})=\mathcal{N}!\left(\sqrt{1-\beta_t},r_{t-1},,\beta_t I\right), ] with (t=1,\dots,T) diffusion steps and a pre-defined schedule ({\beta_t}). This yields the closed-form: [ r_t=\sqrt{\bar{\alpha}_t},r_0+\sqrt{1-\bar{\alpha}_t},\epsilon,\qquad \epsilon\sim \mathcal{N}(0,I), ] where (\alpha_t=1-\beta_t) and (\bar{\alpha}t=\prod{s=1}^t \alpha_s).
A Transformer-based denoiser (g_\theta) parameterizes the reverse process by predicting either the added noise (\epsilon) or the clean residual (r_0): [ \hat{\epsilon}=g_\theta(r_t,t)\quad\text{or}\quad \hat{r}0=g\theta(r_t,t). ] We use the standard DDPM objective (two equivalent parameterizations commonly used in practice 2): [ \mathcal{L}_{cont} = \begin{cases} \left|\hat{\epsilon}-\epsilon\right|_2^2 & (\epsilon\text{-prediction})[4pt] \left|\hat{r}_0-r_0\right|_2^2 & (r_0\text{-prediction}) \end{cases} ]
SNR-weighted training (optional). To mitigate optimization imbalance across timesteps, we optionally apply an SNR-based weighting strategy: [ \mathcal{L}_{snr}=\frac{\mathrm{SNR}_t}{\mathrm{SNR}t+\gamma},\mathcal{L}{cont}, ] which is conceptually aligned with Min-SNR-style diffusion reweighting 3.
Residual reconstruction. At inference, we sample (\hat{R}) by iterative denoising from Gaussian noise, then reconstruct the final continuous output: [ \hat{X}=T+\hat{R}. ]
Stage 2B: Discrete variable modeling with masked diffusion
Discrete ICS channels must remain semantically valid (i.e., categorical, not fractional). Instead of continuous diffusion on (Y), we use masked (absorbing) diffusion 5, which corrupts sequences by replacing tokens with a special mask symbol and trains the model to recover them.
Forward corruption. Given a schedule (m_t\in[0,1]) increasing with (t), we sample a masked version (y_t) by independently masking positions:
[
y_t^{(i)} =
\begin{cases}
\texttt{[MASK]} & \text{with prob. } m_t
y_0^{(i)} & \text{otherwise}
\end{cases}
]
This “absorbing” corruption is a discrete analogue of diffusion and underpins modern masked diffusion formulations 5.
Reverse model and loss. A Transformer (h_\psi) outputs a categorical distribution over the vocabulary for each position. We compute cross-entropy only on masked positions (\mathcal{M}): [ \mathcal{L}{disc}=\frac{1}{|\mathcal{M}|}\sum{(i,t)\in \mathcal{M}} CE(\hat{p}{i,t},,y{i,t}). ] This guarantees decoded samples belong to the discrete vocabulary by construction. Masked diffusion can be viewed as a simplified, scalable alternative within the broader family of discrete diffusion models [4,5].
Joint objective and training protocol
We train Stage 1 and Stage 2 sequentially:
- Train trend Transformer (f_\phi) on continuous channels to obtain (T).
- Compute residuals (R=X-T).
- Train diffusion models on (R) (continuous DDPM) and (Y) (masked diffusion), using a weighted combination: [ \mathcal{L}=\lambda \mathcal{L}{cont}+(1-\lambda)\mathcal{L}{disc}. ]
In our implementation, we typically use (L=128) and a diffusion horizon (T) up to 600 steps (trade-off between sample quality and compute). Transformer backbones increase training cost due to (O(L^2)) attention, but provide a principled mechanism for long-range temporal dependency modeling that is especially relevant in ICS settings [1,13].
Sampling procedure (end-to-end generation)
Given an optional seed prefix (or a sampled initial context):
- Trend rollout: use (f_\phi) to produce (T) over (L) steps.
- Continuous residual sampling: sample (\hat{R}) by reverse DDPM from noise, producing (\hat{X}=T+\hat{R}).
- Discrete sampling: initialize (\hat{Y}) as fully masked and iteratively unmask/denoise using the masked diffusion reverse model until all tokens are assigned 5.
- Return ((\hat{X},\hat{Y})) as a synthetic ICS window.
Type-aware decomposition (diagnostic-guided extensibility)
In practice, a small subset of channels can dominate failure modes (e.g., program-driven setpoints, actuator saturation/stiction, derived deterministic tags). We incorporate a type-aware diagnostic partitioning that groups variables by generative mechanism and enables modular replacements (e.g., conditional generation for program signals, deterministic reconstruction for derived tags). This design is compatible with emerging conditional diffusion paradigms for industrial time series 11 and complements prior ICS diffusion augmentation work that primarily targets continuous MTS fidelity [13].
Note: Detailed benchmark metrics (e.g., KS/JSD/Lag-1) and evaluation protocol belong in the Benchmark / Experiments section, not in Methodology, and are therefore omitted here as requested.
References
1 Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention Is All You Need. NeurIPS. (arXiv) 2 Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS. (SSRN) 3 Hang, T., Gu, S., Li, C., et al. (2023). Efficient Diffusion Training via Min-SNR Weighting Strategy. ICCV. (arXiv) 4 Austin, J., Johnson, D. D., Ho, J., et al. (2021). Structured Denoising Diffusion Models in Discrete State-Spaces (D3PM). NeurIPS. 5 Shi, J., Han, K., Wang, Z., Doucet, A., & Titsias, M. K. (2024). Simplified and Generalized Masked Diffusion for Discrete Data. NeurIPS (poster); arXiv:2406.04329. (arXiv) 6 Kotelnikov, A., Baranchuk, D., Rubachev, I., & Babenko, A. (2023). TabDDPM: Modelling Tabular Data with Diffusion Models. ICML. 7 Shi, J., Xu, M., Hua, H., Zhang, H., Ermon, S., & Leskovec, J. (2024). TabDiff: a Mixed-type Diffusion Model for Tabular Data Generation. arXiv:2410.20626 / ICLR. (arXiv) 8 Kollovieh, M., Ansari, A. F., Bohlke-Schneider, M., et al. (2023). Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting (TSDiff). NeurIPS. (arXiv) 9 Ren, L., Wang, H., & Laili, Y. (2024). Diff-MTS: Temporal-Augmented Conditional Diffusion-based AIGC for Industrial Time Series Toward the Large Model Era. arXiv:2407.11501. (arXiv) 10 Sikder, M. F., et al. (2023). TransFusion: Generating Long, High Fidelity Time Series with Diffusion and Transformers. arXiv:2307.12667. (arXiv) 11 Su, C., Cai, Z., Tian, Y., et al. (2025). Diffusion Models for Time Series Forecasting: A Survey. arXiv:2507.14507. (arXiv) [12] Sha, Y., Yuan, Y., Wu, Y., & Zhao, H. (2026). DDPM Fusing Mamba and Adaptive Attention: An Augmentation Method for Industrial Control Systems Anomaly Data (ETUA-DDPM). SSRN. (SSRN) [13] Yuan, Y., et al. (2025). CTU-DDPM: Generating Industrial Control System Time-Series Data with a CNN-Transformer Hybrid Diffusion Model. ACM (conference paper metadata). (ACM Digital Library)