internal-docs/knowledges/draft-incomplete-methodology.md

## Methodology

### 1. Problem setting and design motivation

Industrial control system (ICS) telemetry is a **mixed-type** sequential object: it couples **continuous** process dynamics (e.g., sensor values and physical responses) with **discrete** supervisory states (e.g., modes, alarms, interlocks). We model each training instance as a fixed-length window of length (L), consisting of (i) continuous channels (X\in\mathbb{R}^{L\times d_c}) and (ii) discrete channels (Y={y^{(j)}*{1:L}}*{j=1}^{d_d}), where each discrete variable (y^{(j)}_t\in\mathcal{V}_j) belongs to a finite vocabulary.

Our methodological objective is to learn a generator that produces synthetic ((\hat{X},\hat{Y})) that are simultaneously: (a) temporally coherent, (b) distributionally faithful, and (c) **semantically valid** for discrete channels (i.e., (\hat{y}^{(j)}_t\in\mathcal{V}_j) by construction). A key empirical observation motivating our design is that, in ICS, temporal realism and distributional realism often “pull” the model in different directions when trained monolithically; this is amplified by the heterogeneous mechanisms that produce different variables (program steps, controller laws, actuator saturation, physical inertia, and deterministic derived tags). We therefore structure the generator to *separate concerns* and then *specialize*.

---

### 2. Overview of Mask-DDPM

We propose **Mask-DDPM**, organized in the following order:

1. **Transformer trend module**: learns the dominant temporal backbone of continuous dynamics via attention-based sequence modeling [1].
2. **Residual DDPM for continuous variables**: models distributional detail as stochastic residual structure conditioned on the learned trend [2].
3. **Masked diffusion for discrete variables**: generates discrete ICS states with an absorbing/masking corruption process and categorical reconstruction [3,4].
4. **Type-aware decomposition**: a performance-oriented refinement layer that routes variables to the most appropriate modeling mechanism and enforces deterministic constraints where warranted.

This ordering is intentional: the trend module fixes the *macro-temporal scaffold*; diffusion then focuses on *micro-structure and marginal fidelity*; masked diffusion guarantees *discrete legality*; and type-aware routing resolves the remaining mismatch caused by heterogeneous variable-generating mechanisms.

---

## 3. Transformer trend module for continuous dynamics

### 3.1 Trend–residual decomposition

For continuous channels (X), we posit an additive decomposition
[
X = S + R,
]
where (S\in\mathbb{R}^{L\times d_c}) is a smooth **trend** capturing predictable temporal evolution, and (R\in\mathbb{R}^{L\times d_c}) is a **residual** capturing distributional detail (e.g., bursts, heavy tails, local fluctuations) that is difficult to represent robustly with a purely regression-based temporal objective.

### 3.2 Causal Transformer parameterization

We parameterize the trend (S) using a **causal Transformer** (f_\phi) [1]. Concretely, with teacher forcing we train (f_\phi) to predict the next-step trend from past observations:
[
\hat{S}*{t+1} = f*\phi(X_{1:t}), \qquad t=1,\dots,L-1,
]
with the mean-squared error objective
[
\mathcal{L}*{\text{trend}}(\phi)=\frac{1}{(L-1)d_c}\sum*{t=1}^{L-1}\left| \hat{S}*{t+1} - X*{t+1}\right|_2^2.
]
Self-attention is particularly suitable here because it provides a direct mechanism for learning cross-channel interactions and long-range temporal dependencies without recurrence, which is important when control actions influence downstream variables with nontrivial delays [1].

At inference, we roll out the Transformer autoregressively to obtain (\hat{S}), and then define the residual target for diffusion as (R = X - \hat{S}).

#### Uniqueness note (trend module)

*While Transformers are now standard in sequence modeling, our use of a Transformer explicitly as a **trend extractor**—whose sole role is to provide conditioning for subsequent diffusion rather than to serve as the primary generator—creates a clean separation between temporal structure and distributional refinement in the ICS setting.*

---

## 4. DDPM for continuous residual generation

### 4.1 Conditional diffusion on residuals

We model the residual (R) with a denoising diffusion probabilistic model (DDPM) conditioned on the trend (\hat{S}) [2]. Let (K) denote the number of diffusion steps, with a noise schedule ({\beta_k}_{k=1}^K), (\alpha_k = 1-\beta_k), and (\bar{\alpha}*k=\prod*{i=1}^k \alpha_i). The forward corruption process is
[
q(r_k\mid r_0)=\mathcal{N}!\left(\sqrt{\bar{\alpha}_k},r_0,\ (1-\bar{\alpha}_k)\mathbf{I}\right),
]
equivalently,
[
r_k = \sqrt{\bar{\alpha}_k},r_0 + \sqrt{1-\bar{\alpha}_k},\epsilon,\qquad \epsilon\sim\mathcal{N}(0,\mathbf{I}).
]

The learned reverse process is parameterized as
[
p_{\theta}(r_{k-1}\mid r_k,\hat{S})=\mathcal{N}!\left(\mu_{\theta}(r_k,k,\hat{S}),\ \Sigma_{\theta}(k)\right),
]
where (\mu_\theta) is implemented by a **Transformer denoiser** that consumes (i) the noised residual (r_k), (ii) a timestep embedding for (k), and (iii) conditioning features derived from (\hat{S}). Conditioning is crucial: it prevents the diffusion model from re-learning the low-frequency temporal scaffold and concentrates its capacity on matching the residual distribution.

### 4.2 Training objective and loss shaping

We train the denoiser using the standard DDPM (\epsilon)-prediction objective [2]:
[
\mathcal{L}_{\text{cont}}(\theta)
=================================

\mathbb{E}*{k,r_0,\epsilon}
\left[
\left|
\epsilon - \epsilon*{\theta}(r_k,k,\hat{S})
\right|*2^2
\right].
]
Because diffusion optimization can exhibit timestep imbalance and gradient conflict, we optionally apply an SNR-based reweighting consistent with Min-SNR training [5]:
[
\mathcal{L}^{\text{snr}}*{\text{cont}}(\theta)
==============================================

\mathbb{E}*{k,r_0,\epsilon}
\left[
w(k)\left|
\epsilon - \epsilon*{\theta}(r_k,k,\hat{S})
\right|_2^2
\right],
\qquad
w(k)=\frac{\mathrm{SNR}_k}{\mathrm{SNR}_k+\gamma}.
]
(We treat this as an ablation: it is not required by the method, but can improve training efficiency and stability in practice [5].)

### 4.3 Continuous reconstruction

After sampling (\hat{R}) by reverse diffusion, we reconstruct the continuous output as
[
\hat{X} = \hat{S} + \hat{R}.
]
This design makes the role of diffusion explicit: it acts as a **distributional corrector** on top of a temporally coherent backbone.

#### Uniqueness note (continuous branch)

*The central integration here is not “Transformer + diffusion” in isolation, but rather a **trend-conditioned residual diffusion** formulation: diffusion is trained on residual structure explicitly defined relative to a Transformer trend model, which is particularly aligned with ICS where low-frequency dynamics are strong and persistent.*

---

## 5. Masked diffusion for discrete ICS variables

### 5.1 Discrete corruption via absorbing masks

Discrete ICS variables must remain categorical, making Gaussian diffusion inappropriate. We therefore use **masked (absorbing) diffusion** for discrete channels [3,4], in which corruption replaces tokens with a special (\texttt{[MASK]}) symbol according to a schedule.

For a discrete sequence (y^{(j)}_{1:L}), define a corruption level (m_k\in[0,1]) that increases with (k). The corrupted sequence at step (k) is formed by independent masking:
[
y^{(j,k)}_t =
\begin{cases}
\texttt{[MASK]}, & \text{with probability } m_k,\
y^{(j)}_t, & \text{otherwise}.
\end{cases}
]
This “absorbing” mechanism yields a diffusion-like process over discrete states while preserving a clear reconstruction target at every step [3,4].

### 5.2 Categorical denoising objective

We parameterize the reverse model with a Transformer (h_\psi) that outputs a categorical distribution over (\mathcal{V}*j) for each masked position. Let (\mathcal{M}) be the set of masked indices. We train with cross-entropy over masked positions:
[
\mathcal{L}*{\text{disc}}(\psi)
===============================

\mathbb{E}*{k}
\left[
\frac{1}{|\mathcal{M}|}\sum*{(j,t)\in\mathcal{M}}
\mathrm{CE}!\left(
h_\psi!\left(y^{(k)},k,\hat{S}\right)_{j,t},\ y^{(j)}_t
\right)
\right].
]
Conditioning on (\hat{S}) (and, optionally, on (\hat{X})) allows discrete state generation to reflect the continuous context, which is often essential in ICS (e.g., mode transitions coupled to process regimes).

### 5.3 Sampling

At inference, we initialize (y^{(K)}) as fully masked and iteratively denoise/unmask from (k=K) to (1), sampling valid tokens from the predicted categorical distributions. Because outputs are always drawn from (\mathcal{V}_j), semantic validity is satisfied by construction.

#### Uniqueness note (discrete branch)

*To our knowledge, diffusion-based ICS synthesis has predominantly emphasized continuous multivariate generation; in contrast, we introduce masked diffusion as a **first-class discrete legality mechanism** within the same pipeline, rather than treating discrete tags as continuous surrogates or post-hoc thresholds.*

---

## 6. Type-aware decomposition as a performance refinement layer

### 6.1 Motivation: mechanistic heterogeneity in ICS variables

Even with a strong continuous generator and a discrete legality mechanism, ICS data pose an additional challenge: variables arise from **qualitatively different generative mechanisms**. For example, program-driven setpoints are often piecewise constant with abrupt changes; actuators exhibit saturation and dwell; some channels are deterministic functions of others (derived tags). Modeling all channels identically can cause persistent error concentration and degrade overall quality.

We therefore introduce a **type-aware decomposition** that operates on top of the base pipeline to improve fidelity, stability, and interpretability.

### 6.2 Typing function and routing

Let (\tau(i)\in{1,\dots,6}) assign each variable (i) to a type. This induces a partition of variables into index sets ({\mathcal{I}*k}*{k=1}^6). We then define the generator as a composition of type-specific operators:
[
(\hat{X},\hat{Y}) = \mathcal{A}\Big(\hat{S},\hat{R},\hat{Y}_{\text{mask}},{\mathcal{G}*k}*{k=1}^6;\tau\Big),
]
where (\mathcal{A}) assembles a complete sample and (\mathcal{G}_k) encodes the appropriate modeling choice per type.

### 6.3 Six-type schema and modeling commitments

We operationalize six types that map cleanly onto common ICS semantics:

* **Type 1 (program-driven / setpoint-like):** exogenous drivers with step changes and long dwell. These variables are treated as conditioning signals or modeled with a dedicated change-point-aware generator rather than forcing them into residual diffusion.
* **Type 2 (controller outputs):** variables that respond to setpoints and process feedback; we treat them as conditional on Type 1 and continuous context, and allow separate specialization if they dominate error.
* **Type 3 (actuator states/positions):** bounded, often quantized, with saturation and dwell; we route them to discrete masked diffusion when naturally categorical/quantized, or to specialized dwell-aware modeling when continuous but state-persistent.
* **Type 4 (process variables):** inertia-dominated continuous dynamics; these are the primary beneficiaries of the **Transformer trend + residual DDPM** pipeline.
* **Type 5 (derived/deterministic variables):** algebraic or rule-based functions of other variables; we enforce deterministic reconstruction (\hat{x}^{(i)} = g_i(\hat{X},\hat{Y})) rather than learning a stochastic generator, improving logical consistency and sample efficiency.
* **Type 6 (auxiliary/low-impact variables):** weakly coupled or sparse signals; we allow simplified modeling (e.g., calibrated marginals or lightweight temporal models) to avoid allocating diffusion capacity where it is not warranted.

### 6.4 Training-time and inference-time integration

Type-aware decomposition improves performance through three concrete mechanisms:

1. **Capacity allocation:** by focusing diffusion on Type 4 (and selected Type 2/3), we reduce the tendency for a few mechanistically distinct variables to dominate gradients and distort the learned distribution elsewhere.
2. **Constraint enforcement:** Type 5 variables are computed deterministically, preventing logically inconsistent samples that a purely learned generator may produce.
3. **Mechanism alignment:** Type 1/3 variables receive inductive biases consistent with step-like or dwell-like behavior, which diffusion-trained smooth denoisers can otherwise over-regularize.

In practice, (\tau) can be initialized from domain semantics (tag metadata and value domains) and refined via an error-attribution loop described in the Benchmark section. Importantly, this refinement does not alter the core generative backbone; it *re-routes* difficult variables to better-suited mechanisms while preserving end-to-end coherence.

#### Uniqueness note (type-aware layer)

*The type-aware decomposition is not merely a diagnostic taxonomy; it is a modeling layer that **changes the generator’s factorization**—deciding which variables are best treated as conditional drivers, which require dwell-aware behavior, and which should be deterministic—thereby tailoring the diffusion pipeline to the structural heterogeneity characteristic of ICS telemetry.*

---

## 7. Joint optimization and end-to-end sampling

We train the model in a staged manner consistent with the factorization:

1. Train the trend Transformer (f_\phi) to obtain (S).
2. Compute residual targets (R=X-S) for Type 4 (and any routed continuous types).
3. Train the residual DDPM (p_\theta(R\mid S)) and the masked diffusion model (p_\psi(Y\mid \text{masked}(Y),S)).
4. Apply type-aware routing and deterministic reconstruction rules during sampling.

A simple combined objective is
[
\mathcal{L} = \lambda,\mathcal{L}*{\text{cont}} + (1-\lambda),\mathcal{L}*{\text{disc}},
]
with (\lambda\in[0,1]) controlling the balance between continuous and discrete learning. Type-aware routing determines which channels contribute to which loss and which are excluded in favor of deterministic reconstruction.

At inference time, we generate in the same order: (i) trend (\hat{S}), (ii) residual (\hat{R}) via DDPM, (iii) discrete (\hat{Y}) via masked diffusion, and (iv) type-aware assembly and deterministic reconstruction.

---

# References

[1] Vaswani, A., Shazeer, N., Parmar, N., et al. *Attention Is All You Need.* NeurIPS, 2017. ([arXiv][1])
[2] Ho, J., Jain, A., Abbeel, P. *Denoising Diffusion Probabilistic Models.* NeurIPS, 2020. ([arXiv][2])
[3] Austin, J., Johnson, D. D., Ho, J., Tarlow, D., van den Berg, R. *Structured Denoising Diffusion Models in Discrete State-Spaces (D3PM).* NeurIPS, 2021. ([arXiv][3])
[4] Shi, J., Han, K., Wang, Z., Doucet, A., Titsias, M. K. *Simplified and Generalized Masked Diffusion for Discrete Data.* arXiv:2406.04329, 2024. ([arXiv][4])
[5] Hang, T., Gu, S., Li, C., et al. *Efficient Diffusion Training via Min-SNR Weighting Strategy.* ICCV, 2023. ([arXiv][5])
[6] Kollovieh, M., Ansari, A. F., Bohlke-Schneider, M., et al. *Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting (TSDiff).* arXiv:2307.11494, 2023. ([arXiv][6])
[7] Su, C., Cai, Z., Tian, Y., et al. *Diffusion Models for Time Series Forecasting: A Survey.* arXiv:2507.14507, 2025. ([arXiv][7])

[1]: https://arxiv.org/abs/1706.03762?utm_source=chatgpt.com "Attention Is All You Need"
[2]: https://arxiv.org/abs/2006.11239?utm_source=chatgpt.com "Denoising Diffusion Probabilistic Models"
[3]: https://arxiv.org/abs/2107.03006?utm_source=chatgpt.com "Structured Denoising Diffusion Models in Discrete State-Spaces"
[4]: https://arxiv.org/abs/2406.04329?utm_source=chatgpt.com "Simplified and Generalized Masked Diffusion for Discrete Data"
[5]: https://arxiv.org/abs/2303.09556?utm_source=chatgpt.com "Efficient Diffusion Training via Min-SNR Weighting Strategy"
[6]: https://arxiv.org/abs/2307.11494?utm_source=chatgpt.com "Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting"
[7]: https://arxiv.org/abs/2507.14507?utm_source=chatgpt.com "Diffusion Models for Time Series Forecasting: A Survey"