Version 2, slightly differ from online docs v2(use that)

online docs v2: https://my.feishu.cn/wiki/Za4dwCsG6iPD9qklRLWcoJOZnnb?edition_id=dXsOZT
2026-01-31 21:50:02 +08:00
parent 735ca8ab51
commit 2c1e211504
1 changed files with 10 additions and 36 deletions
--- a/knowledges/draft-incomplete-methodology.md
+++ b/knowledges/draft-incomplete-methodology.md
@@ -6,16 +6,14 @@ Industrial control system (ICS) telemetry is a **mixed-type** sequential object:
 Our methodological objective is to learn a generator that produces synthetic ((\hat{X},\hat{Y})) that are simultaneously: (a) temporally coherent, (b) distributionally faithful, and (c) **semantically valid** for discrete channels (i.e., (\hat{y}^{(j)}_t\in\mathcal{V}_j) by construction). A key empirical observation motivating our design is that, in ICS, temporal realism and distributional realism often “pull” the model in different directions when trained monolithically; this is amplified by the heterogeneous mechanisms that produce different variables (program steps, controller laws, actuator saturation, physical inertia, and deterministic derived tags). We therefore structure the generator to *separate concerns* and then *specialize*.
---
+(Note: need a workflow graph here)
 ### 2. Overview of Mask-DDPM
 We propose **Mask-DDPM**, organized in the following order:
 1. **Transformer trend module**: learns the dominant temporal backbone of continuous dynamics via attention-based sequence modeling [1].
 2. **Residual DDPM for continuous variables**: models distributional detail as stochastic residual structure conditioned on the learned trend [2].
 3. **Masked diffusion for discrete variables**: generates discrete ICS states with an absorbing/masking corruption process and categorical reconstruction [3,4].
-4. **Type-aware decomposition**: a performance-oriented refinement layer that routes variables to the most appropriate modeling mechanism and enforces deterministic constraints where warranted.
+4. **Type-aware decomposition**: a post-process (consider using a word instead of having finetune/refinement meanings) layer that routes variables to the most appropriate modeling mechanism and enforces deterministic constraints where warranted.
 This ordering is intentional: the trend module fixes the *macro-temporal scaffold*; diffusion then focuses on *micro-structure and marginal fidelity*; masked diffusion guarantees *discrete legality*; and type-aware routing resolves the remaining mismatch caused by heterogeneous variable-generating mechanisms.
@@ -23,7 +21,8 @@ This ordering is intentional: the trend module fixes the *macro-temporal scaffol
 ## 3. Transformer trend module for continuous dynamics
-### 3.1 Trend–residual decomposition
+Transformers are often considered standard in sequence modeling [Note: need a citation here].  We purposed a Transformer explicitly as a trend extractor, a sole role that is to provide conditioning for subsequent diffusion rather than to serve as the primary generator—creates a clean separation between temporal structure and distributional refinement in the ICS setting.
 For continuous channels (X), we posit an additive decomposition
 [
@@ -31,7 +30,6 @@ X = S + R,
 ]
 where (S\in\mathbb{R}^{L\times d_c}) is a smooth **trend** capturing predictable temporal evolution, and (R\in\mathbb{R}^{L\times d_c}) is a **residual** capturing distributional detail (e.g., bursts, heavy tails, local fluctuations) that is difficult to represent robustly with a purely regression-based temporal objective.
 ### 3.2 Causal Transformer parameterization
 We parameterize the trend (S) using a **causal Transformer** (f_\phi) [1]. Concretely, with teacher forcing we train (f_\phi) to predict the next-step trend from past observations:
 [
@@ -45,15 +43,12 @@ Self-attention is particularly suitable here because it provides a direct mechan
 At inference, we roll out the Transformer autoregressively to obtain (\hat{S}), and then define the residual target for diffusion as (R = X - \hat{S}).
 #### Uniqueness note (trend module)
 *While Transformers are now standard in sequence modeling, our use of a Transformer explicitly as a **trend extractor**—whose sole role is to provide conditioning for subsequent diffusion rather than to serve as the primary generator—creates a clean separation between temporal structure and distributional refinement in the ICS setting.*
 ---
 ## 4. DDPM for continuous residual generation
-### 4.1 Conditional diffusion on residuals
+We made a trend-conditioned residual diffusion formulation: diffusion is trained on residual structure explicitly defined relative to a Transformer trend model, which is particularly aligned with ICS where low-frequency dynamics are strong and persistent.
 We model the residual (R) with a denoising diffusion probabilistic model (DDPM) conditioned on the trend (\hat{S}) [2]. Let (K) denote the number of diffusion steps, with a noise schedule ({\beta_k}_{k=1}^K), (\alpha_k = 1-\beta_k), and (\bar{\alpha}*k=\prod*{i=1}^k \alpha_i). The forward corruption process is
 [
@@ -70,7 +65,6 @@ p_{\theta}(r_{k-1}\mid r_k,\hat{S})=\mathcal{N}!\left(\mu_{\theta}(r_k,k,\hat{S}
 ]
 where (\mu_\theta) is implemented by a **Transformer denoiser** that consumes (i) the noised residual (r_k), (ii) a timestep embedding for (k), and (iii) conditioning features derived from (\hat{S}). Conditioning is crucial: it prevents the diffusion model from re-learning the low-frequency temporal scaffold and concentrates its capacity on matching the residual distribution.
 ### 4.2 Training objective and loss shaping
 We train the denoiser using the standard DDPM (\epsilon)-prediction objective [2]:
 [
@@ -98,9 +92,7 @@ w(k)\left|
 \qquad
 w(k)=\frac{\mathrm{SNR}_k}{\mathrm{SNR}_k+\gamma}.
 ]
 (We treat this as an ablation: it is not required by the method, but can improve training efficiency and stability in practice [5].)
 ### 4.3 Continuous reconstruction
 After sampling (\hat{R}) by reverse diffusion, we reconstruct the continuous output as
 [
@@ -108,17 +100,11 @@ After sampling (\hat{R}) by reverse diffusion, we reconstruct the continuous out
 ]
 This design makes the role of diffusion explicit: it acts as a **distributional corrector** on top of a temporally coherent backbone.
 #### Uniqueness note (continuous branch)
 *The central integration here is not “Transformer + diffusion” in isolation, but rather a **trend-conditioned residual diffusion** formulation: diffusion is trained on residual structure explicitly defined relative to a Transformer trend model, which is particularly aligned with ICS where low-frequency dynamics are strong and persistent.*
 ---
 ## 5. Masked diffusion for discrete ICS variables
-### 5.1 Discrete corruption via absorbing masks
+Discrete ICS variables must remain categorical, making Gaussian diffusion inappropriate. Diffusion-based ICS synthesis has predominantly emphasized continuous multivariate generation; in contrast, we introduce masked diffusion as a first-class discrete legality mechanism within the same pipeline, rather than treating discrete tags as continuous surrogates or post-hoc thresholds. We therefore use **masked (absorbing) diffusion** for discrete channels [3,4], in which corruption replaces tokens with a special (\texttt{[MASK]}) symbol according to a schedule.
 Discrete ICS variables must remain categorical, making Gaussian diffusion inappropriate. We therefore use **masked (absorbing) diffusion** for discrete channels [3,4], in which corruption replaces tokens with a special (\texttt{[MASK]}) symbol according to a schedule.
 For a discrete sequence (y^{(j)}_{1:L}), define a corruption level (m_k\in[0,1]) that increases with (k). The corrupted sequence at step (k) is formed by independent masking:
 [
@@ -130,9 +116,8 @@ y^{(j)}_t, & \text{otherwise}.
 ]
 This “absorbing” mechanism yields a diffusion-like process over discrete states while preserving a clear reconstruction target at every step [3,4].
 ### 5.2 Categorical denoising objective
-We parameterize the reverse model with a Transformer (h_\psi) that outputs a categorical distribution over (\mathcal{V}*j) for each masked position. Let (\mathcal{M}) be the set of masked indices. We train with cross-entropy over masked positions:
+To categorical denoising objective, we parameterize the reverse model with a Transformer (h_\psi) that outputs a categorical distribution over (\mathcal{V}*j) for each masked position. Let (\mathcal{M}) be the set of masked indices. We train with cross-entropy over masked positions:
 [
 \mathcal{L}*{\text{disc}}(\psi)
 ===============================
@@ -147,25 +132,20 @@ h_\psi!\left(y^{(k)},k,\hat{S}\right)_{j,t},\ y^{(j)}_t
 ]
 Conditioning on (\hat{S}) (and, optionally, on (\hat{X})) allows discrete state generation to reflect the continuous context, which is often essential in ICS (e.g., mode transitions coupled to process regimes).
 ### 5.3 Sampling
 At inference, we initialize (y^{(K)}) as fully masked and iteratively denoise/unmask from (k=K) to (1), sampling valid tokens from the predicted categorical distributions. Because outputs are always drawn from (\mathcal{V}_j), semantic validity is satisfied by construction.
 #### Uniqueness note (discrete branch)
 *To our knowledge, diffusion-based ICS synthesis has predominantly emphasized continuous multivariate generation; in contrast, we introduce masked diffusion as a **first-class discrete legality mechanism** within the same pipeline, rather than treating discrete tags as continuous surrogates or post-hoc thresholds.*
 ---
 ## 6. Type-aware decomposition as a performance refinement layer
-### 6.1 Motivation: mechanistic heterogeneity in ICS variables
+
 Even with a strong continuous generator and a discrete legality mechanism, ICS data pose an additional challenge: variables arise from **qualitatively different generative mechanisms**. For example, program-driven setpoints are often piecewise constant with abrupt changes; actuators exhibit saturation and dwell; some channels are deterministic functions of others (derived tags). Modeling all channels identically can cause persistent error concentration and degrade overall quality.
 We therefore introduce a **type-aware decomposition** that operates on top of the base pipeline to improve fidelity, stability, and interpretability.
-### 6.2 Typing function and routing
+The type-aware decomposition is not merely a diagnostic taxonomy; it is a modeling layer that changes the generator’s factorization—deciding which variables are best treated as conditional drivers, which require dwell-aware behavior, and which should be deterministic—thereby tailoring the diffusion pipeline to the structural heterogeneity characteristic of ICS telemetry.
 Let (\tau(i)\in{1,\dots,6}) assign each variable (i) to a type. This induces a partition of variables into index sets ({\mathcal{I}*k}*{k=1}^6). We then define the generator as a composition of type-specific operators:
 [
@@ -173,7 +153,6 @@ Let (\tau(i)\in{1,\dots,6}) assign each variable (i) to a type. This induces a p
 ]
 where (\mathcal{A}) assembles a complete sample and (\mathcal{G}_k) encodes the appropriate modeling choice per type.
 ### 6.3 Six-type schema and modeling commitments
 We operationalize six types that map cleanly onto common ICS semantics:
@@ -184,7 +163,6 @@ We operationalize six types that map cleanly onto common ICS semantics:
 * **Type 5 (derived/deterministic variables):** algebraic or rule-based functions of other variables; we enforce deterministic reconstruction (\hat{x}^{(i)} = g_i(\hat{X},\hat{Y})) rather than learning a stochastic generator, improving logical consistency and sample efficiency.
 * **Type 6 (auxiliary/low-impact variables):** weakly coupled or sparse signals; we allow simplified modeling (e.g., calibrated marginals or lightweight temporal models) to avoid allocating diffusion capacity where it is not warranted.
 ### 6.4 Training-time and inference-time integration
 Type-aware decomposition improves performance through three concrete mechanisms:
@@ -194,10 +172,6 @@ Type-aware decomposition improves performance through three concrete mechanisms:
 In practice, (\tau) can be initialized from domain semantics (tag metadata and value domains) and refined via an error-attribution loop described in the Benchmark section. Importantly, this refinement does not alter the core generative backbone; it *re-routes* difficult variables to better-suited mechanisms while preserving end-to-end coherence.
 #### Uniqueness note (type-aware layer)
 *The type-aware decomposition is not merely a diagnostic taxonomy; it is a modeling layer that **changes the generator’s factorization**—deciding which variables are best treated as conditional drivers, which require dwell-aware behavior, and which should be deterministic—thereby tailoring the diffusion pipeline to the structural heterogeneity characteristic of ICS telemetry.*
 ---
 ## 7. Joint optimization and end-to-end sampling