Files
internal-docs/knowledges/draft-incomplete-methodology.md

16 KiB
Raw Blame History

Methodology

1. Problem setting and design motivation

Industrial control system (ICS) telemetry is a mixed-type sequential object: it couples continuous process dynamics (e.g., sensor values and physical responses) with discrete supervisory states (e.g., modes, alarms, interlocks). We model each training instance as a fixed-length window of length (L), consisting of (i) continuous channels (X\in\mathbb{R}^{L\times d_c}) and (ii) discrete channels (Y={y^{(j)}{1:L}}{j=1}^{d_d}), where each discrete variable (y^{(j)}_t\in\mathcal{V}_j) belongs to a finite vocabulary.

Our methodological objective is to learn a generator that produces synthetic ((\hat{X},\hat{Y})) that are simultaneously: (a) temporally coherent, (b) distributionally faithful, and (c) semantically valid for discrete channels (i.e., (\hat{y}^{(j)}_t\in\mathcal{V}_j) by construction). A key empirical observation motivating our design is that, in ICS, temporal realism and distributional realism often “pull” the model in different directions when trained monolithically; this is amplified by the heterogeneous mechanisms that produce different variables (program steps, controller laws, actuator saturation, physical inertia, and deterministic derived tags). We therefore structure the generator to separate concerns and then specialize.


2. Overview of Mask-DDPM

We propose Mask-DDPM, organized in the following order:

  1. Transformer trend module: learns the dominant temporal backbone of continuous dynamics via attention-based sequence modeling 1.
  2. Residual DDPM for continuous variables: models distributional detail as stochastic residual structure conditioned on the learned trend 2.
  3. Masked diffusion for discrete variables: generates discrete ICS states with an absorbing/masking corruption process and categorical reconstruction [3,4].
  4. Type-aware decomposition: a performance-oriented refinement layer that routes variables to the most appropriate modeling mechanism and enforces deterministic constraints where warranted.

This ordering is intentional: the trend module fixes the macro-temporal scaffold; diffusion then focuses on micro-structure and marginal fidelity; masked diffusion guarantees discrete legality; and type-aware routing resolves the remaining mismatch caused by heterogeneous variable-generating mechanisms.


3. Transformer trend module for continuous dynamics

3.1 Trendresidual decomposition

For continuous channels (X), we posit an additive decomposition [ X = S + R, ] where (S\in\mathbb{R}^{L\times d_c}) is a smooth trend capturing predictable temporal evolution, and (R\in\mathbb{R}^{L\times d_c}) is a residual capturing distributional detail (e.g., bursts, heavy tails, local fluctuations) that is difficult to represent robustly with a purely regression-based temporal objective.

3.2 Causal Transformer parameterization

We parameterize the trend (S) using a causal Transformer (f_\phi) 1. Concretely, with teacher forcing we train (f_\phi) to predict the next-step trend from past observations: [ \hat{S}{t+1} = f\phi(X_{1:t}), \qquad t=1,\dots,L-1, ] with the mean-squared error objective [ \mathcal{L}{\text{trend}}(\phi)=\frac{1}{(L-1)d_c}\sum{t=1}^{L-1}\left| \hat{S}{t+1} - X{t+1}\right|_2^2. ] Self-attention is particularly suitable here because it provides a direct mechanism for learning cross-channel interactions and long-range temporal dependencies without recurrence, which is important when control actions influence downstream variables with nontrivial delays 1.

At inference, we roll out the Transformer autoregressively to obtain (\hat{S}), and then define the residual target for diffusion as (R = X - \hat{S}).

Uniqueness note (trend module)

While Transformers are now standard in sequence modeling, our use of a Transformer explicitly as a trend extractor—whose sole role is to provide conditioning for subsequent diffusion rather than to serve as the primary generator—creates a clean separation between temporal structure and distributional refinement in the ICS setting.


4. DDPM for continuous residual generation

4.1 Conditional diffusion on residuals

We model the residual (R) with a denoising diffusion probabilistic model (DDPM) conditioned on the trend (\hat{S}) 2. Let (K) denote the number of diffusion steps, with a noise schedule ({\beta_k}_{k=1}^K), (\alpha_k = 1-\beta_k), and (\bar{\alpha}k=\prod{i=1}^k \alpha_i). The forward corruption process is [ q(r_k\mid r_0)=\mathcal{N}!\left(\sqrt{\bar{\alpha}_k},r_0,\ (1-\bar{\alpha}_k)\mathbf{I}\right), ] equivalently, [ r_k = \sqrt{\bar{\alpha}_k},r_0 + \sqrt{1-\bar{\alpha}_k},\epsilon,\qquad \epsilon\sim\mathcal{N}(0,\mathbf{I}). ]

The learned reverse process is parameterized as [ p_{\theta}(r_{k-1}\mid r_k,\hat{S})=\mathcal{N}!\left(\mu_{\theta}(r_k,k,\hat{S}),\ \Sigma_{\theta}(k)\right), ] where (\mu_\theta) is implemented by a Transformer denoiser that consumes (i) the noised residual (r_k), (ii) a timestep embedding for (k), and (iii) conditioning features derived from (\hat{S}). Conditioning is crucial: it prevents the diffusion model from re-learning the low-frequency temporal scaffold and concentrates its capacity on matching the residual distribution.

4.2 Training objective and loss shaping

We train the denoiser using the standard DDPM (\epsilon)-prediction objective 2: [ \mathcal{L}_{\text{cont}}(\theta)

\mathbb{E}{k,r_0,\epsilon} \left[ \left| \epsilon - \epsilon{\theta}(r_k,k,\hat{S}) \right|2^2 \right]. ] Because diffusion optimization can exhibit timestep imbalance and gradient conflict, we optionally apply an SNR-based reweighting consistent with Min-SNR training 5: [ \mathcal{L}^{\text{snr}}{\text{cont}}(\theta)

\mathbb{E}{k,r_0,\epsilon} \left[ w(k)\left| \epsilon - \epsilon{\theta}(r_k,k,\hat{S}) \right|_2^2 \right], \qquad w(k)=\frac{\mathrm{SNR}_k}{\mathrm{SNR}_k+\gamma}. ] (We treat this as an ablation: it is not required by the method, but can improve training efficiency and stability in practice 5.)

4.3 Continuous reconstruction

After sampling (\hat{R}) by reverse diffusion, we reconstruct the continuous output as [ \hat{X} = \hat{S} + \hat{R}. ] This design makes the role of diffusion explicit: it acts as a distributional corrector on top of a temporally coherent backbone.

Uniqueness note (continuous branch)

The central integration here is not “Transformer + diffusion” in isolation, but rather a trend-conditioned residual diffusion formulation: diffusion is trained on residual structure explicitly defined relative to a Transformer trend model, which is particularly aligned with ICS where low-frequency dynamics are strong and persistent.


5. Masked diffusion for discrete ICS variables

5.1 Discrete corruption via absorbing masks

Discrete ICS variables must remain categorical, making Gaussian diffusion inappropriate. We therefore use masked (absorbing) diffusion for discrete channels [3,4], in which corruption replaces tokens with a special (\texttt{[MASK]}) symbol according to a schedule.

For a discrete sequence (y^{(j)}_{1:L}), define a corruption level (m_k\in[0,1]) that increases with (k). The corrupted sequence at step (k) is formed by independent masking: [ y^{(j,k)}_t = \begin{cases} \texttt{[MASK]}, & \text{with probability } m_k,
y^{(j)}_t, & \text{otherwise}. \end{cases} ] This “absorbing” mechanism yields a diffusion-like process over discrete states while preserving a clear reconstruction target at every step [3,4].

5.2 Categorical denoising objective

We parameterize the reverse model with a Transformer (h_\psi) that outputs a categorical distribution over (\mathcal{V}j) for each masked position. Let (\mathcal{M}) be the set of masked indices. We train with cross-entropy over masked positions: [ \mathcal{L}{\text{disc}}(\psi)

\mathbb{E}{k} \left[ \frac{1}{|\mathcal{M}|}\sum{(j,t)\in\mathcal{M}} \mathrm{CE}!\left( h_\psi!\left(y^{(k)},k,\hat{S}\right)_{j,t},\ y^{(j)}_t \right) \right]. ] Conditioning on (\hat{S}) (and, optionally, on (\hat{X})) allows discrete state generation to reflect the continuous context, which is often essential in ICS (e.g., mode transitions coupled to process regimes).

5.3 Sampling

At inference, we initialize (y^{(K)}) as fully masked and iteratively denoise/unmask from (k=K) to (1), sampling valid tokens from the predicted categorical distributions. Because outputs are always drawn from (\mathcal{V}_j), semantic validity is satisfied by construction.

Uniqueness note (discrete branch)

To our knowledge, diffusion-based ICS synthesis has predominantly emphasized continuous multivariate generation; in contrast, we introduce masked diffusion as a first-class discrete legality mechanism within the same pipeline, rather than treating discrete tags as continuous surrogates or post-hoc thresholds.


6. Type-aware decomposition as a performance refinement layer

6.1 Motivation: mechanistic heterogeneity in ICS variables

Even with a strong continuous generator and a discrete legality mechanism, ICS data pose an additional challenge: variables arise from qualitatively different generative mechanisms. For example, program-driven setpoints are often piecewise constant with abrupt changes; actuators exhibit saturation and dwell; some channels are deterministic functions of others (derived tags). Modeling all channels identically can cause persistent error concentration and degrade overall quality.

We therefore introduce a type-aware decomposition that operates on top of the base pipeline to improve fidelity, stability, and interpretability.

6.2 Typing function and routing

Let (\tau(i)\in{1,\dots,6}) assign each variable (i) to a type. This induces a partition of variables into index sets ({\mathcal{I}k}{k=1}^6). We then define the generator as a composition of type-specific operators: [ (\hat{X},\hat{Y}) = \mathcal{A}\Big(\hat{S},\hat{R},\hat{Y}_{\text{mask}},{\mathcal{G}k}{k=1}^6;\tau\Big), ] where (\mathcal{A}) assembles a complete sample and (\mathcal{G}_k) encodes the appropriate modeling choice per type.

6.3 Six-type schema and modeling commitments

We operationalize six types that map cleanly onto common ICS semantics:

  • Type 1 (program-driven / setpoint-like): exogenous drivers with step changes and long dwell. These variables are treated as conditioning signals or modeled with a dedicated change-point-aware generator rather than forcing them into residual diffusion.
  • Type 2 (controller outputs): variables that respond to setpoints and process feedback; we treat them as conditional on Type 1 and continuous context, and allow separate specialization if they dominate error.
  • Type 3 (actuator states/positions): bounded, often quantized, with saturation and dwell; we route them to discrete masked diffusion when naturally categorical/quantized, or to specialized dwell-aware modeling when continuous but state-persistent.
  • Type 4 (process variables): inertia-dominated continuous dynamics; these are the primary beneficiaries of the Transformer trend + residual DDPM pipeline.
  • Type 5 (derived/deterministic variables): algebraic or rule-based functions of other variables; we enforce deterministic reconstruction (\hat{x}^{(i)} = g_i(\hat{X},\hat{Y})) rather than learning a stochastic generator, improving logical consistency and sample efficiency.
  • Type 6 (auxiliary/low-impact variables): weakly coupled or sparse signals; we allow simplified modeling (e.g., calibrated marginals or lightweight temporal models) to avoid allocating diffusion capacity where it is not warranted.

6.4 Training-time and inference-time integration

Type-aware decomposition improves performance through three concrete mechanisms:

  1. Capacity allocation: by focusing diffusion on Type 4 (and selected Type 2/3), we reduce the tendency for a few mechanistically distinct variables to dominate gradients and distort the learned distribution elsewhere.
  2. Constraint enforcement: Type 5 variables are computed deterministically, preventing logically inconsistent samples that a purely learned generator may produce.
  3. Mechanism alignment: Type 1/3 variables receive inductive biases consistent with step-like or dwell-like behavior, which diffusion-trained smooth denoisers can otherwise over-regularize.

In practice, (\tau) can be initialized from domain semantics (tag metadata and value domains) and refined via an error-attribution loop described in the Benchmark section. Importantly, this refinement does not alter the core generative backbone; it re-routes difficult variables to better-suited mechanisms while preserving end-to-end coherence.

Uniqueness note (type-aware layer)

The type-aware decomposition is not merely a diagnostic taxonomy; it is a modeling layer that changes the generators factorization—deciding which variables are best treated as conditional drivers, which require dwell-aware behavior, and which should be deterministic—thereby tailoring the diffusion pipeline to the structural heterogeneity characteristic of ICS telemetry.


7. Joint optimization and end-to-end sampling

We train the model in a staged manner consistent with the factorization:

  1. Train the trend Transformer (f_\phi) to obtain (S).
  2. Compute residual targets (R=X-S) for Type 4 (and any routed continuous types).
  3. Train the residual DDPM (p_\theta(R\mid S)) and the masked diffusion model (p_\psi(Y\mid \text{masked}(Y),S)).
  4. Apply type-aware routing and deterministic reconstruction rules during sampling.

A simple combined objective is [ \mathcal{L} = \lambda,\mathcal{L}{\text{cont}} + (1-\lambda),\mathcal{L}{\text{disc}}, ] with (\lambda\in[0,1]) controlling the balance between continuous and discrete learning. Type-aware routing determines which channels contribute to which loss and which are excluded in favor of deterministic reconstruction.

At inference time, we generate in the same order: (i) trend (\hat{S}), (ii) residual (\hat{R}) via DDPM, (iii) discrete (\hat{Y}) via masked diffusion, and (iv) type-aware assembly and deterministic reconstruction.


References

1 Vaswani, A., Shazeer, N., Parmar, N., et al. Attention Is All You Need. NeurIPS, 2017. (arXiv) 2 Ho, J., Jain, A., Abbeel, P. Denoising Diffusion Probabilistic Models. NeurIPS, 2020. (arXiv) 3 Austin, J., Johnson, D. D., Ho, J., Tarlow, D., van den Berg, R. Structured Denoising Diffusion Models in Discrete State-Spaces (D3PM). NeurIPS, 2021. (arXiv) 4 Shi, J., Han, K., Wang, Z., Doucet, A., Titsias, M. K. Simplified and Generalized Masked Diffusion for Discrete Data. arXiv:2406.04329, 2024. (arXiv) 5 Hang, T., Gu, S., Li, C., et al. Efficient Diffusion Training via Min-SNR Weighting Strategy. ICCV, 2023. (arXiv) 6 Kollovieh, M., Ansari, A. F., Bohlke-Schneider, M., et al. Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting (TSDiff). arXiv:2307.11494, 2023. (arXiv) 7 Su, C., Cai, Z., Tian, Y., et al. Diffusion Models for Time Series Forecasting: A Survey. arXiv:2507.14507, 2025. (arXiv)