## Methodology ### 1. Problem setting and design motivation Industrial control system (ICS) telemetry is a **mixed-type** sequential object: it couples **continuous** process dynamics (e.g., sensor values and physical responses) with **discrete** supervisory states (e.g., modes, alarms, interlocks). We model each training instance as a fixed-length window of length (L), consisting of (i) continuous channels (X\in\mathbb{R}^{L\times d_c}) and (ii) discrete channels (Y={y^{(j)}*{1:L}}*{j=1}^{d_d}), where each discrete variable (y^{(j)}_t\in\mathcal{V}_j) belongs to a finite vocabulary. Our methodological objective is to learn a generator that produces synthetic ((\hat{X},\hat{Y})) that are simultaneously: (a) temporally coherent, (b) distributionally faithful, and (c) **semantically valid** for discrete channels (i.e., (\hat{y}^{(j)}_t\in\mathcal{V}_j) by construction). A key empirical observation motivating our design is that, in ICS, temporal realism and distributional realism often “pull” the model in different directions when trained monolithically; this is amplified by the heterogeneous mechanisms that produce different variables (program steps, controller laws, actuator saturation, physical inertia, and deterministic derived tags). We therefore structure the generator to *separate concerns* and then *specialize*. (Note: need a workflow graph here) We propose **Mask-DDPM**, organized in the following order: 1. **Transformer trend module**: learns the dominant temporal backbone of continuous dynamics via attention-based sequence modeling [1]. 2. **Residual DDPM for continuous variables**: models distributional detail as stochastic residual structure conditioned on the learned trend [2]. 3. **Masked diffusion for discrete variables**: generates discrete ICS states with an absorbing/masking corruption process and categorical reconstruction [3,4]. 4. **Type-aware decomposition**: a post-process (consider using a word instead of having finetune/refinement meanings) layer that routes variables to the most appropriate modeling mechanism and enforces deterministic constraints where warranted. This ordering is intentional: the trend module fixes the *macro-temporal scaffold*; diffusion then focuses on *micro-structure and marginal fidelity*; masked diffusion guarantees *discrete legality*; and type-aware routing resolves the remaining mismatch caused by heterogeneous variable-generating mechanisms. --- ## 3. Transformer trend module for continuous dynamics Transformers are often considered standard in sequence modeling [Note: need a citation here]. We purposed a Transformer explicitly as a trend extractor, a sole role that is to provide conditioning for subsequent diffusion rather than to serve as the primary generator—creates a clean separation between temporal structure and distributional refinement in the ICS setting. For continuous channels (X), we posit an additive decomposition [ X = S + R, ] where (S\in\mathbb{R}^{L\times d_c}) is a smooth **trend** capturing predictable temporal evolution, and (R\in\mathbb{R}^{L\times d_c}) is a **residual** capturing distributional detail (e.g., bursts, heavy tails, local fluctuations) that is difficult to represent robustly with a purely regression-based temporal objective. We parameterize the trend (S) using a **causal Transformer** (f_\phi) [1]. Concretely, with teacher forcing we train (f_\phi) to predict the next-step trend from past observations: [ \hat{S}*{t+1} = f*\phi(X_{1:t}), \qquad t=1,\dots,L-1, ] with the mean-squared error objective [ \mathcal{L}*{\text{trend}}(\phi)=\frac{1}{(L-1)d_c}\sum*{t=1}^{L-1}\left| \hat{S}*{t+1} - X*{t+1}\right|_2^2. ] Self-attention is particularly suitable here because it provides a direct mechanism for learning cross-channel interactions and long-range temporal dependencies without recurrence, which is important when control actions influence downstream variables with nontrivial delays [1]. At inference, we roll out the Transformer autoregressively to obtain (\hat{S}), and then define the residual target for diffusion as (R = X - \hat{S}). --- ## 4. DDPM for continuous residual generation We made a trend-conditioned residual diffusion formulation: diffusion is trained on residual structure explicitly defined relative to a Transformer trend model, which is particularly aligned with ICS where low-frequency dynamics are strong and persistent. We model the residual (R) with a denoising diffusion probabilistic model (DDPM) conditioned on the trend (\hat{S}) [2]. Let (K) denote the number of diffusion steps, with a noise schedule ({\beta_k}_{k=1}^K), (\alpha_k = 1-\beta_k), and (\bar{\alpha}*k=\prod*{i=1}^k \alpha_i). The forward corruption process is [ q(r_k\mid r_0)=\mathcal{N}!\left(\sqrt{\bar{\alpha}_k},r_0,\ (1-\bar{\alpha}_k)\mathbf{I}\right), ] equivalently, [ r_k = \sqrt{\bar{\alpha}_k},r_0 + \sqrt{1-\bar{\alpha}_k},\epsilon,\qquad \epsilon\sim\mathcal{N}(0,\mathbf{I}). ] The learned reverse process is parameterized as [ p_{\theta}(r_{k-1}\mid r_k,\hat{S})=\mathcal{N}!\left(\mu_{\theta}(r_k,k,\hat{S}),\ \Sigma_{\theta}(k)\right), ] where (\mu_\theta) is implemented by a **Transformer denoiser** that consumes (i) the noised residual (r_k), (ii) a timestep embedding for (k), and (iii) conditioning features derived from (\hat{S}). Conditioning is crucial: it prevents the diffusion model from re-learning the low-frequency temporal scaffold and concentrates its capacity on matching the residual distribution. We train the denoiser using the standard DDPM (\epsilon)-prediction objective [2]: [ \mathcal{L}_{\text{cont}}(\theta) ================================= \mathbb{E}*{k,r_0,\epsilon} \left[ \left| \epsilon - \epsilon*{\theta}(r_k,k,\hat{S}) \right|*2^2 \right]. ] Because diffusion optimization can exhibit timestep imbalance and gradient conflict, we optionally apply an SNR-based reweighting consistent with Min-SNR training [5]: [ \mathcal{L}^{\text{snr}}*{\text{cont}}(\theta) ============================================== \mathbb{E}*{k,r_0,\epsilon} \left[ w(k)\left| \epsilon - \epsilon*{\theta}(r_k,k,\hat{S}) \right|_2^2 \right], \qquad w(k)=\frac{\mathrm{SNR}_k}{\mathrm{SNR}_k+\gamma}. ] After sampling (\hat{R}) by reverse diffusion, we reconstruct the continuous output as [ \hat{X} = \hat{S} + \hat{R}. ] This design makes the role of diffusion explicit: it acts as a **distributional corrector** on top of a temporally coherent backbone. --- ## 5. Masked diffusion for discrete ICS variables Discrete ICS variables must remain categorical, making Gaussian diffusion inappropriate. Diffusion-based ICS synthesis has predominantly emphasized continuous multivariate generation; in contrast, we introduce masked diffusion as a first-class discrete legality mechanism within the same pipeline, rather than treating discrete tags as continuous surrogates or post-hoc thresholds. We therefore use **masked (absorbing) diffusion** for discrete channels [3,4], in which corruption replaces tokens with a special (\texttt{[MASK]}) symbol according to a schedule. For a discrete sequence (y^{(j)}_{1:L}), define a corruption level (m_k\in[0,1]) that increases with (k). The corrupted sequence at step (k) is formed by independent masking: [ y^{(j,k)}_t = \begin{cases} \texttt{[MASK]}, & \text{with probability } m_k,\ y^{(j)}_t, & \text{otherwise}. \end{cases} ] This “absorbing” mechanism yields a diffusion-like process over discrete states while preserving a clear reconstruction target at every step [3,4]. To categorical denoising objective, we parameterize the reverse model with a Transformer (h_\psi) that outputs a categorical distribution over (\mathcal{V}*j) for each masked position. Let (\mathcal{M}) be the set of masked indices. We train with cross-entropy over masked positions: [ \mathcal{L}*{\text{disc}}(\psi) =============================== \mathbb{E}*{k} \left[ \frac{1}{|\mathcal{M}|}\sum*{(j,t)\in\mathcal{M}} \mathrm{CE}!\left( h_\psi!\left(y^{(k)},k,\hat{S}\right)_{j,t},\ y^{(j)}_t \right) \right]. ] Conditioning on (\hat{S}) (and, optionally, on (\hat{X})) allows discrete state generation to reflect the continuous context, which is often essential in ICS (e.g., mode transitions coupled to process regimes). At inference, we initialize (y^{(K)}) as fully masked and iteratively denoise/unmask from (k=K) to (1), sampling valid tokens from the predicted categorical distributions. Because outputs are always drawn from (\mathcal{V}_j), semantic validity is satisfied by construction. --- ## 6. Type-aware decomposition as a performance refinement layer Even with a strong continuous generator and a discrete legality mechanism, ICS data pose an additional challenge: variables arise from **qualitatively different generative mechanisms**. For example, program-driven setpoints are often piecewise constant with abrupt changes; actuators exhibit saturation and dwell; some channels are deterministic functions of others (derived tags). Modeling all channels identically can cause persistent error concentration and degrade overall quality. We therefore introduce a **type-aware decomposition** that operates on top of the base pipeline to improve fidelity, stability, and interpretability. The type-aware decomposition is not merely a diagnostic taxonomy; it is a modeling layer that changes the generator’s factorization—deciding which variables are best treated as conditional drivers, which require dwell-aware behavior, and which should be deterministic—thereby tailoring the diffusion pipeline to the structural heterogeneity characteristic of ICS telemetry. Let (\tau(i)\in{1,\dots,6}) assign each variable (i) to a type. This induces a partition of variables into index sets ({\mathcal{I}*k}*{k=1}^6). We then define the generator as a composition of type-specific operators: [ (\hat{X},\hat{Y}) = \mathcal{A}\Big(\hat{S},\hat{R},\hat{Y}_{\text{mask}},{\mathcal{G}*k}*{k=1}^6;\tau\Big), ] where (\mathcal{A}) assembles a complete sample and (\mathcal{G}_k) encodes the appropriate modeling choice per type. We operationalize six types that map cleanly onto common ICS semantics: * **Type 1 (program-driven / setpoint-like):** exogenous drivers with step changes and long dwell. These variables are treated as conditioning signals or modeled with a dedicated change-point-aware generator rather than forcing them into residual diffusion. * **Type 2 (controller outputs):** variables that respond to setpoints and process feedback; we treat them as conditional on Type 1 and continuous context, and allow separate specialization if they dominate error. * **Type 3 (actuator states/positions):** bounded, often quantized, with saturation and dwell; we route them to discrete masked diffusion when naturally categorical/quantized, or to specialized dwell-aware modeling when continuous but state-persistent. * **Type 4 (process variables):** inertia-dominated continuous dynamics; these are the primary beneficiaries of the **Transformer trend + residual DDPM** pipeline. * **Type 5 (derived/deterministic variables):** algebraic or rule-based functions of other variables; we enforce deterministic reconstruction (\hat{x}^{(i)} = g_i(\hat{X},\hat{Y})) rather than learning a stochastic generator, improving logical consistency and sample efficiency. * **Type 6 (auxiliary/low-impact variables):** weakly coupled or sparse signals; we allow simplified modeling (e.g., calibrated marginals or lightweight temporal models) to avoid allocating diffusion capacity where it is not warranted. Type-aware decomposition improves performance through three concrete mechanisms: 1. **Capacity allocation:** by focusing diffusion on Type 4 (and selected Type 2/3), we reduce the tendency for a few mechanistically distinct variables to dominate gradients and distort the learned distribution elsewhere. 2. **Constraint enforcement:** Type 5 variables are computed deterministically, preventing logically inconsistent samples that a purely learned generator may produce. 3. **Mechanism alignment:** Type 1/3 variables receive inductive biases consistent with step-like or dwell-like behavior, which diffusion-trained smooth denoisers can otherwise over-regularize. In practice, (\tau) can be initialized from domain semantics (tag metadata and value domains) and refined via an error-attribution loop described in the Benchmark section. Importantly, this refinement does not alter the core generative backbone; it *re-routes* difficult variables to better-suited mechanisms while preserving end-to-end coherence. --- ## 7. Joint optimization and end-to-end sampling We train the model in a staged manner consistent with the factorization: 1. Train the trend Transformer (f_\phi) to obtain (S). 2. Compute residual targets (R=X-S) for Type 4 (and any routed continuous types). 3. Train the residual DDPM (p_\theta(R\mid S)) and the masked diffusion model (p_\psi(Y\mid \text{masked}(Y),S)). 4. Apply type-aware routing and deterministic reconstruction rules during sampling. A simple combined objective is [ \mathcal{L} = \lambda,\mathcal{L}*{\text{cont}} + (1-\lambda),\mathcal{L}*{\text{disc}}, ] with (\lambda\in[0,1]) controlling the balance between continuous and discrete learning. Type-aware routing determines which channels contribute to which loss and which are excluded in favor of deterministic reconstruction. At inference time, we generate in the same order: (i) trend (\hat{S}), (ii) residual (\hat{R}) via DDPM, (iii) discrete (\hat{Y}) via masked diffusion, and (iv) type-aware assembly and deterministic reconstruction. --- # References [1] Vaswani, A., Shazeer, N., Parmar, N., et al. *Attention Is All You Need.* NeurIPS, 2017. ([arXiv][1]) [2] Ho, J., Jain, A., Abbeel, P. *Denoising Diffusion Probabilistic Models.* NeurIPS, 2020. ([arXiv][2]) [3] Austin, J., Johnson, D. D., Ho, J., Tarlow, D., van den Berg, R. *Structured Denoising Diffusion Models in Discrete State-Spaces (D3PM).* NeurIPS, 2021. ([arXiv][3]) [4] Shi, J., Han, K., Wang, Z., Doucet, A., Titsias, M. K. *Simplified and Generalized Masked Diffusion for Discrete Data.* arXiv:2406.04329, 2024. ([arXiv][4]) [5] Hang, T., Gu, S., Li, C., et al. *Efficient Diffusion Training via Min-SNR Weighting Strategy.* ICCV, 2023. ([arXiv][5]) [6] Kollovieh, M., Ansari, A. F., Bohlke-Schneider, M., et al. *Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting (TSDiff).* arXiv:2307.11494, 2023. ([arXiv][6]) [7] Su, C., Cai, Z., Tian, Y., et al. *Diffusion Models for Time Series Forecasting: A Survey.* arXiv:2507.14507, 2025. ([arXiv][7]) [1]: https://arxiv.org/abs/1706.03762?utm_source=chatgpt.com "Attention Is All You Need" [2]: https://arxiv.org/abs/2006.11239?utm_source=chatgpt.com "Denoising Diffusion Probabilistic Models" [3]: https://arxiv.org/abs/2107.03006?utm_source=chatgpt.com "Structured Denoising Diffusion Models in Discrete State-Spaces" [4]: https://arxiv.org/abs/2406.04329?utm_source=chatgpt.com "Simplified and Generalized Masked Diffusion for Discrete Data" [5]: https://arxiv.org/abs/2303.09556?utm_source=chatgpt.com "Efficient Diffusion Training via Min-SNR Weighting Strategy" [6]: https://arxiv.org/abs/2307.11494?utm_source=chatgpt.com "Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting" [7]: https://arxiv.org/abs/2507.14507?utm_source=chatgpt.com "Diffusion Models for Time Series Forecasting: A Survey"