Version 2, slightly differ from online docs v2(use that)

online docs v2: https://my.feishu.cn/wiki/Za4dwCsG6iPD9qklRLWcoJOZnnb?edition_id=dXsOZT
This commit is contained in:
2026-01-31 21:50:02 +08:00
parent 735ca8ab51
commit 2c1e211504

View File

@@ -6,16 +6,14 @@ Industrial control system (ICS) telemetry is a **mixed-type** sequential object:
Our methodological objective is to learn a generator that produces synthetic ((\hat{X},\hat{Y})) that are simultaneously: (a) temporally coherent, (b) distributionally faithful, and (c) **semantically valid** for discrete channels (i.e., (\hat{y}^{(j)}_t\in\mathcal{V}_j) by construction). A key empirical observation motivating our design is that, in ICS, temporal realism and distributional realism often “pull” the model in different directions when trained monolithically; this is amplified by the heterogeneous mechanisms that produce different variables (program steps, controller laws, actuator saturation, physical inertia, and deterministic derived tags). We therefore structure the generator to *separate concerns* and then *specialize*. Our methodological objective is to learn a generator that produces synthetic ((\hat{X},\hat{Y})) that are simultaneously: (a) temporally coherent, (b) distributionally faithful, and (c) **semantically valid** for discrete channels (i.e., (\hat{y}^{(j)}_t\in\mathcal{V}_j) by construction). A key empirical observation motivating our design is that, in ICS, temporal realism and distributional realism often “pull” the model in different directions when trained monolithically; this is amplified by the heterogeneous mechanisms that produce different variables (program steps, controller laws, actuator saturation, physical inertia, and deterministic derived tags). We therefore structure the generator to *separate concerns* and then *specialize*.
--- (Note: need a workflow graph here)
### 2. Overview of Mask-DDPM
We propose **Mask-DDPM**, organized in the following order: We propose **Mask-DDPM**, organized in the following order:
1. **Transformer trend module**: learns the dominant temporal backbone of continuous dynamics via attention-based sequence modeling [1]. 1. **Transformer trend module**: learns the dominant temporal backbone of continuous dynamics via attention-based sequence modeling [1].
2. **Residual DDPM for continuous variables**: models distributional detail as stochastic residual structure conditioned on the learned trend [2]. 2. **Residual DDPM for continuous variables**: models distributional detail as stochastic residual structure conditioned on the learned trend [2].
3. **Masked diffusion for discrete variables**: generates discrete ICS states with an absorbing/masking corruption process and categorical reconstruction [3,4]. 3. **Masked diffusion for discrete variables**: generates discrete ICS states with an absorbing/masking corruption process and categorical reconstruction [3,4].
4. **Type-aware decomposition**: a performance-oriented refinement layer that routes variables to the most appropriate modeling mechanism and enforces deterministic constraints where warranted. 4. **Type-aware decomposition**: a post-process (consider using a word instead of having finetune/refinement meanings) layer that routes variables to the most appropriate modeling mechanism and enforces deterministic constraints where warranted.
This ordering is intentional: the trend module fixes the *macro-temporal scaffold*; diffusion then focuses on *micro-structure and marginal fidelity*; masked diffusion guarantees *discrete legality*; and type-aware routing resolves the remaining mismatch caused by heterogeneous variable-generating mechanisms. This ordering is intentional: the trend module fixes the *macro-temporal scaffold*; diffusion then focuses on *micro-structure and marginal fidelity*; masked diffusion guarantees *discrete legality*; and type-aware routing resolves the remaining mismatch caused by heterogeneous variable-generating mechanisms.
@@ -23,7 +21,8 @@ This ordering is intentional: the trend module fixes the *macro-temporal scaffol
## 3. Transformer trend module for continuous dynamics ## 3. Transformer trend module for continuous dynamics
### 3.1 Trendresidual decomposition Transformers are often considered standard in sequence modeling [Note: need a citation here]. We purposed a Transformer explicitly as a trend extractor, a sole role that is to provide conditioning for subsequent diffusion rather than to serve as the primary generator—creates a clean separation between temporal structure and distributional refinement in the ICS setting.
For continuous channels (X), we posit an additive decomposition For continuous channels (X), we posit an additive decomposition
[ [
@@ -31,7 +30,6 @@ X = S + R,
] ]
where (S\in\mathbb{R}^{L\times d_c}) is a smooth **trend** capturing predictable temporal evolution, and (R\in\mathbb{R}^{L\times d_c}) is a **residual** capturing distributional detail (e.g., bursts, heavy tails, local fluctuations) that is difficult to represent robustly with a purely regression-based temporal objective. where (S\in\mathbb{R}^{L\times d_c}) is a smooth **trend** capturing predictable temporal evolution, and (R\in\mathbb{R}^{L\times d_c}) is a **residual** capturing distributional detail (e.g., bursts, heavy tails, local fluctuations) that is difficult to represent robustly with a purely regression-based temporal objective.
### 3.2 Causal Transformer parameterization
We parameterize the trend (S) using a **causal Transformer** (f_\phi) [1]. Concretely, with teacher forcing we train (f_\phi) to predict the next-step trend from past observations: We parameterize the trend (S) using a **causal Transformer** (f_\phi) [1]. Concretely, with teacher forcing we train (f_\phi) to predict the next-step trend from past observations:
[ [
@@ -45,15 +43,12 @@ Self-attention is particularly suitable here because it provides a direct mechan
At inference, we roll out the Transformer autoregressively to obtain (\hat{S}), and then define the residual target for diffusion as (R = X - \hat{S}). At inference, we roll out the Transformer autoregressively to obtain (\hat{S}), and then define the residual target for diffusion as (R = X - \hat{S}).
#### Uniqueness note (trend module)
*While Transformers are now standard in sequence modeling, our use of a Transformer explicitly as a **trend extractor**—whose sole role is to provide conditioning for subsequent diffusion rather than to serve as the primary generator—creates a clean separation between temporal structure and distributional refinement in the ICS setting.*
--- ---
## 4. DDPM for continuous residual generation ## 4. DDPM for continuous residual generation
### 4.1 Conditional diffusion on residuals We made a trend-conditioned residual diffusion formulation: diffusion is trained on residual structure explicitly defined relative to a Transformer trend model, which is particularly aligned with ICS where low-frequency dynamics are strong and persistent.
We model the residual (R) with a denoising diffusion probabilistic model (DDPM) conditioned on the trend (\hat{S}) [2]. Let (K) denote the number of diffusion steps, with a noise schedule ({\beta_k}_{k=1}^K), (\alpha_k = 1-\beta_k), and (\bar{\alpha}*k=\prod*{i=1}^k \alpha_i). The forward corruption process is We model the residual (R) with a denoising diffusion probabilistic model (DDPM) conditioned on the trend (\hat{S}) [2]. Let (K) denote the number of diffusion steps, with a noise schedule ({\beta_k}_{k=1}^K), (\alpha_k = 1-\beta_k), and (\bar{\alpha}*k=\prod*{i=1}^k \alpha_i). The forward corruption process is
[ [
@@ -70,7 +65,6 @@ p_{\theta}(r_{k-1}\mid r_k,\hat{S})=\mathcal{N}!\left(\mu_{\theta}(r_k,k,\hat{S}
] ]
where (\mu_\theta) is implemented by a **Transformer denoiser** that consumes (i) the noised residual (r_k), (ii) a timestep embedding for (k), and (iii) conditioning features derived from (\hat{S}). Conditioning is crucial: it prevents the diffusion model from re-learning the low-frequency temporal scaffold and concentrates its capacity on matching the residual distribution. where (\mu_\theta) is implemented by a **Transformer denoiser** that consumes (i) the noised residual (r_k), (ii) a timestep embedding for (k), and (iii) conditioning features derived from (\hat{S}). Conditioning is crucial: it prevents the diffusion model from re-learning the low-frequency temporal scaffold and concentrates its capacity on matching the residual distribution.
### 4.2 Training objective and loss shaping
We train the denoiser using the standard DDPM (\epsilon)-prediction objective [2]: We train the denoiser using the standard DDPM (\epsilon)-prediction objective [2]:
[ [
@@ -98,9 +92,7 @@ w(k)\left|
\qquad \qquad
w(k)=\frac{\mathrm{SNR}_k}{\mathrm{SNR}_k+\gamma}. w(k)=\frac{\mathrm{SNR}_k}{\mathrm{SNR}_k+\gamma}.
] ]
(We treat this as an ablation: it is not required by the method, but can improve training efficiency and stability in practice [5].)
### 4.3 Continuous reconstruction
After sampling (\hat{R}) by reverse diffusion, we reconstruct the continuous output as After sampling (\hat{R}) by reverse diffusion, we reconstruct the continuous output as
[ [
@@ -108,17 +100,11 @@ After sampling (\hat{R}) by reverse diffusion, we reconstruct the continuous out
] ]
This design makes the role of diffusion explicit: it acts as a **distributional corrector** on top of a temporally coherent backbone. This design makes the role of diffusion explicit: it acts as a **distributional corrector** on top of a temporally coherent backbone.
#### Uniqueness note (continuous branch)
*The central integration here is not “Transformer + diffusion” in isolation, but rather a **trend-conditioned residual diffusion** formulation: diffusion is trained on residual structure explicitly defined relative to a Transformer trend model, which is particularly aligned with ICS where low-frequency dynamics are strong and persistent.*
--- ---
## 5. Masked diffusion for discrete ICS variables ## 5. Masked diffusion for discrete ICS variables
### 5.1 Discrete corruption via absorbing masks Discrete ICS variables must remain categorical, making Gaussian diffusion inappropriate. Diffusion-based ICS synthesis has predominantly emphasized continuous multivariate generation; in contrast, we introduce masked diffusion as a first-class discrete legality mechanism within the same pipeline, rather than treating discrete tags as continuous surrogates or post-hoc thresholds. We therefore use **masked (absorbing) diffusion** for discrete channels [3,4], in which corruption replaces tokens with a special (\texttt{[MASK]}) symbol according to a schedule.
Discrete ICS variables must remain categorical, making Gaussian diffusion inappropriate. We therefore use **masked (absorbing) diffusion** for discrete channels [3,4], in which corruption replaces tokens with a special (\texttt{[MASK]}) symbol according to a schedule.
For a discrete sequence (y^{(j)}_{1:L}), define a corruption level (m_k\in[0,1]) that increases with (k). The corrupted sequence at step (k) is formed by independent masking: For a discrete sequence (y^{(j)}_{1:L}), define a corruption level (m_k\in[0,1]) that increases with (k). The corrupted sequence at step (k) is formed by independent masking:
[ [
@@ -130,9 +116,8 @@ y^{(j)}_t, & \text{otherwise}.
] ]
This “absorbing” mechanism yields a diffusion-like process over discrete states while preserving a clear reconstruction target at every step [3,4]. This “absorbing” mechanism yields a diffusion-like process over discrete states while preserving a clear reconstruction target at every step [3,4].
### 5.2 Categorical denoising objective
We parameterize the reverse model with a Transformer (h_\psi) that outputs a categorical distribution over (\mathcal{V}*j) for each masked position. Let (\mathcal{M}) be the set of masked indices. We train with cross-entropy over masked positions: To categorical denoising objective, we parameterize the reverse model with a Transformer (h_\psi) that outputs a categorical distribution over (\mathcal{V}*j) for each masked position. Let (\mathcal{M}) be the set of masked indices. We train with cross-entropy over masked positions:
[ [
\mathcal{L}*{\text{disc}}(\psi) \mathcal{L}*{\text{disc}}(\psi)
=============================== ===============================
@@ -147,25 +132,20 @@ h_\psi!\left(y^{(k)},k,\hat{S}\right)_{j,t},\ y^{(j)}_t
] ]
Conditioning on (\hat{S}) (and, optionally, on (\hat{X})) allows discrete state generation to reflect the continuous context, which is often essential in ICS (e.g., mode transitions coupled to process regimes). Conditioning on (\hat{S}) (and, optionally, on (\hat{X})) allows discrete state generation to reflect the continuous context, which is often essential in ICS (e.g., mode transitions coupled to process regimes).
### 5.3 Sampling
At inference, we initialize (y^{(K)}) as fully masked and iteratively denoise/unmask from (k=K) to (1), sampling valid tokens from the predicted categorical distributions. Because outputs are always drawn from (\mathcal{V}_j), semantic validity is satisfied by construction. At inference, we initialize (y^{(K)}) as fully masked and iteratively denoise/unmask from (k=K) to (1), sampling valid tokens from the predicted categorical distributions. Because outputs are always drawn from (\mathcal{V}_j), semantic validity is satisfied by construction.
#### Uniqueness note (discrete branch)
*To our knowledge, diffusion-based ICS synthesis has predominantly emphasized continuous multivariate generation; in contrast, we introduce masked diffusion as a **first-class discrete legality mechanism** within the same pipeline, rather than treating discrete tags as continuous surrogates or post-hoc thresholds.*
--- ---
## 6. Type-aware decomposition as a performance refinement layer ## 6. Type-aware decomposition as a performance refinement layer
### 6.1 Motivation: mechanistic heterogeneity in ICS variables
Even with a strong continuous generator and a discrete legality mechanism, ICS data pose an additional challenge: variables arise from **qualitatively different generative mechanisms**. For example, program-driven setpoints are often piecewise constant with abrupt changes; actuators exhibit saturation and dwell; some channels are deterministic functions of others (derived tags). Modeling all channels identically can cause persistent error concentration and degrade overall quality. Even with a strong continuous generator and a discrete legality mechanism, ICS data pose an additional challenge: variables arise from **qualitatively different generative mechanisms**. For example, program-driven setpoints are often piecewise constant with abrupt changes; actuators exhibit saturation and dwell; some channels are deterministic functions of others (derived tags). Modeling all channels identically can cause persistent error concentration and degrade overall quality.
We therefore introduce a **type-aware decomposition** that operates on top of the base pipeline to improve fidelity, stability, and interpretability. We therefore introduce a **type-aware decomposition** that operates on top of the base pipeline to improve fidelity, stability, and interpretability.
### 6.2 Typing function and routing The type-aware decomposition is not merely a diagnostic taxonomy; it is a modeling layer that changes the generators factorization—deciding which variables are best treated as conditional drivers, which require dwell-aware behavior, and which should be deterministic—thereby tailoring the diffusion pipeline to the structural heterogeneity characteristic of ICS telemetry.
Let (\tau(i)\in{1,\dots,6}) assign each variable (i) to a type. This induces a partition of variables into index sets ({\mathcal{I}*k}*{k=1}^6). We then define the generator as a composition of type-specific operators: Let (\tau(i)\in{1,\dots,6}) assign each variable (i) to a type. This induces a partition of variables into index sets ({\mathcal{I}*k}*{k=1}^6). We then define the generator as a composition of type-specific operators:
[ [
@@ -173,7 +153,6 @@ Let (\tau(i)\in{1,\dots,6}) assign each variable (i) to a type. This induces a p
] ]
where (\mathcal{A}) assembles a complete sample and (\mathcal{G}_k) encodes the appropriate modeling choice per type. where (\mathcal{A}) assembles a complete sample and (\mathcal{G}_k) encodes the appropriate modeling choice per type.
### 6.3 Six-type schema and modeling commitments
We operationalize six types that map cleanly onto common ICS semantics: We operationalize six types that map cleanly onto common ICS semantics:
@@ -184,7 +163,6 @@ We operationalize six types that map cleanly onto common ICS semantics:
* **Type 5 (derived/deterministic variables):** algebraic or rule-based functions of other variables; we enforce deterministic reconstruction (\hat{x}^{(i)} = g_i(\hat{X},\hat{Y})) rather than learning a stochastic generator, improving logical consistency and sample efficiency. * **Type 5 (derived/deterministic variables):** algebraic or rule-based functions of other variables; we enforce deterministic reconstruction (\hat{x}^{(i)} = g_i(\hat{X},\hat{Y})) rather than learning a stochastic generator, improving logical consistency and sample efficiency.
* **Type 6 (auxiliary/low-impact variables):** weakly coupled or sparse signals; we allow simplified modeling (e.g., calibrated marginals or lightweight temporal models) to avoid allocating diffusion capacity where it is not warranted. * **Type 6 (auxiliary/low-impact variables):** weakly coupled or sparse signals; we allow simplified modeling (e.g., calibrated marginals or lightweight temporal models) to avoid allocating diffusion capacity where it is not warranted.
### 6.4 Training-time and inference-time integration
Type-aware decomposition improves performance through three concrete mechanisms: Type-aware decomposition improves performance through three concrete mechanisms:
@@ -194,10 +172,6 @@ Type-aware decomposition improves performance through three concrete mechanisms:
In practice, (\tau) can be initialized from domain semantics (tag metadata and value domains) and refined via an error-attribution loop described in the Benchmark section. Importantly, this refinement does not alter the core generative backbone; it *re-routes* difficult variables to better-suited mechanisms while preserving end-to-end coherence. In practice, (\tau) can be initialized from domain semantics (tag metadata and value domains) and refined via an error-attribution loop described in the Benchmark section. Importantly, this refinement does not alter the core generative backbone; it *re-routes* difficult variables to better-suited mechanisms while preserving end-to-end coherence.
#### Uniqueness note (type-aware layer)
*The type-aware decomposition is not merely a diagnostic taxonomy; it is a modeling layer that **changes the generators factorization**—deciding which variables are best treated as conditional drivers, which require dwell-aware behavior, and which should be deterministic—thereby tailoring the diffusion pipeline to the structural heterogeneity characteristic of ICS telemetry.*
--- ---
## 7. Joint optimization and end-to-end sampling ## 7. Joint optimization and end-to-end sampling