Update benchmark and type-aware figures

This commit is contained in:
MZ YANG
2026-03-27 23:37:11 +08:00
parent 0ba59c131c
commit 6f6a4b6a20
4 changed files with 126 additions and 20 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 203 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 291 KiB

Binary file not shown.

View File

@@ -161,12 +161,12 @@ We model the residual RRR with a denoising diffusion probabilistic model (DDPM)
Let $\bm{K}$ denote the number of diffusion steps, with a noise schedule $\{\beta_k\}_{k=1}^K$, $\alpha_k = 1 - \beta_k$, and $\bar{\alpha}_k = \prod_{i=1}^k \alpha_i$. The forward corruption process is: Let $\bm{K}$ denote the number of diffusion steps, with a noise schedule $\{\beta_k\}_{k=1}^K$, $\alpha_k = 1 - \beta_k$, and $\bar{\alpha}_k = \prod_{i=1}^k \alpha_i$. The forward corruption process is:
\begin{equation} \begin{equation}
q(\bm{r}_k \mid \bm{r}_0) &= \mathcal{N}\bigl( \sqrt{\bar{\alpha}_k}\,\bm{r}_0,\; (1 - \bar{\alpha}_k)\mathbf{I} \bigr) q(\bm{r}_k \mid \bm{r}_0) = \mathcal{N}\bigl( \sqrt{\bar{\alpha}_k}\,\bm{r}_0,\; (1 - \bar{\alpha}_k)\mathbf{I} \bigr)
\label{eq:forward_corruption} \label{eq:forward_corruption}
\end{equation} \end{equation}
equivalently, equivalently,
\begin{equation} \begin{equation}
\bm{r}_k &= \sqrt{\bar{\alpha}_k}\,\bm{r}_0 + \sqrt{1 - \bar{\alpha}_k}\,\boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \bm{r}_k = \sqrt{\bar{\alpha}_k}\,\bm{r}_0 + \sqrt{1 - \bar{\alpha}_k}\,\boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})
\label{eq:forward_corruption_eq} \label{eq:forward_corruption_eq}
\end{equation} \end{equation}
The learned reverse process is parameterized as: The learned reverse process is parameterized as:
@@ -236,6 +236,13 @@ We use the following taxonomy:
\item Type 6 (auxiliary/low-impact variables): weakly coupled or sparse signals; we allow simplified modeling (e.g., calibrated marginals or lightweight temporal models) to avoid allocating diffusion capacity where it is not warranted. \item Type 6 (auxiliary/low-impact variables): weakly coupled or sparse signals; we allow simplified modeling (e.g., calibrated marginals or lightweight temporal models) to avoid allocating diffusion capacity where it is not warranted.
\end{enumerate} \end{enumerate}
\begin{figure}[htbp]
\centering
\includegraphics[width=\textwidth]{fig-type-aware-routing-realdata.pdf}
\caption{Type-aware decomposition as mechanism-aligned routing. The left panel formalizes the assignment $\tau(i)=\mathrm{TypeAssign}(m_i,s_i,d_i)$ from metadata, temporal signature, and dependency pattern. The center panel organizes the resulting six-type taxonomy and embeds representative real HAI telemetry signatures as miniature evidence for each type. The right panel shows how the current implementation uses this taxonomy: Type 1 variables act as explicit conditioning signals together with file-level context, Types 2/3/4/6 share the learned generator, and Type 5 variables are deterministically reconstructed after sampling. The representative insets are selected automatically from the configured type sets and normalized within each inset for qualitative comparison.}
\label{fig:type-routing-realdata}
\end{figure}
Type-aware decomposition improves synthesis quality through three mechanisms. First, it improves capacity allocation by preventing a small set of mechanistically atypical variables from dominating gradients and distorting the learned distribution for the majority class (typically Type 4). Second, it enables constraint enforcement by deterministically reconstructing Type 5 variables, preventing logically inconsistent samples that purely learned generators can produce. Third, it improves mechanism alignment by attaching inductive biases consistent with step/dwell or saturation behaviors where generic denoisers may implicitly favor smoothness. Type-aware decomposition improves synthesis quality through three mechanisms. First, it improves capacity allocation by preventing a small set of mechanistically atypical variables from dominating gradients and distorting the learned distribution for the majority class (typically Type 4). Second, it enables constraint enforcement by deterministically reconstructing Type 5 variables, preventing logically inconsistent samples that purely learned generators can produce. Third, it improves mechanism alignment by attaching inductive biases consistent with step/dwell or saturation behaviors where generic denoisers may implicitly favor smoothness.
From a novelty standpoint, this layer is not merely an engineering “patch”; it is an explicit methodological statement that ICS synthesis benefits from typed factorization—a principle that has analogues in mixed-type generative modeling more broadly, but that remains underexplored in diffusion-based ICS telemetry synthesis \citep{shi2025tabdiff,yuan2025ctu,nist2023sp80082}. From a novelty standpoint, this layer is not merely an engineering “patch”; it is an explicit methodological statement that ICS synthesis benefits from typed factorization—a principle that has analogues in mixed-type generative modeling more broadly, but that remains underexplored in diffusion-based ICS telemetry synthesis \citep{shi2025tabdiff,yuan2025ctu,nist2023sp80082}.
@@ -251,42 +258,141 @@ At inference time, generation follows the same structured order: (i) trend $\hat
% 4. Benchmark % 4. Benchmark
\section{Benchmark} \section{Benchmark}
\label{sec:benchmark} \label{sec:benchmark}
We evaluate the proposed pipeline on feature sequences derived from the HAI Security Dataset, using fixed-length windows (L=96) that preserve the mixed-type structure of ICS telemetry. The goal of this benchmark is not only to report “overall similarity”, but to justify why the proposed factorization is a better fit for protocol feature synthesis: continuous channels must match physical marginals \citep{coletta2023constrained}, discrete channels must remain semantically legal, and both must retain short-horizon dynamics that underpin state transitions and interlocks \citep{yang2001interlock}. A credible ICS generator must clear four progressively harder hurdles. It must first be \emph{semantically legal}: any out-of-vocabulary supervisory token renders a sample unusable, no matter how good its marginals look. It must then match the heterogeneous statistics of mixed-type telemetry, including continuous process channels and discrete supervisory states. Third, it must preserve \emph{mechanism-level realism}: switch-and-dwell behavior, bounded control motion, cross-tag coordination, and short-horizon persistence. Finally, these properties should matter downstream rather than only under offline similarity scores. We therefore organize the benchmark as a funnel rather than a flat metric list, moving from reproducibility and legality to diagnostic localization, extended realism, and ablation \citep{coletta2023constrained,yang2001interlock,stenger2024survey}.
This emphasis reflects evaluation practice in time-series generation, where strong results are typically supported by multiple complementary views (marginal fidelity, dependency/temporal structure, and downstream plausibility), rather than a single aggregate score \citep{stenger2024survey}. In the ICS setting, this multi-view requirement is sharper: a generator that matches continuous marginals while emitting out-of-vocabulary supervisory tokens is unusable for protocol reconstruction, and a generator that matches marginals but breaks lag structure can produce temporally implausible command/response sequences. This organization is particularly important for ICS telemetry. A generator can look competitive on one-dimensional marginals while still failing on the aspects that make a trace operationally plausible: long plateaus in setpoint-like variables, concentrated occupancy in actuator states, tight controller--sensor coupling, or persistent support signals. Our goal is therefore not to maximize a single scalar, but to show which parts of realism have already been solved, which remain brittle, and which model components are responsible for each regime.
Recent ICS time-series generators often emphasize aggregate similarity scores and utility-driven evaluations (e.g., anomaly-detection performance) to demonstrate realism, which is valuable but can under-specify mixed-type protocol constraints. Our benchmark complements these practices by making mixed-type legality and per-feature distributional alignment explicit: discrete outputs are evaluated as categorical distributions (JSD) and are constrained to remain within the legal vocabulary by construction, while continuous channels are evaluated with nonparametric distribution tests (KS) \citep{yoon2019timegan}. This combination provides a direct, protocol-relevant justification for the hybrid design, rather than relying on a single composite score that may mask discrete failures. For continuous channels, we measure marginal alignment with the Kolmogorov--Smirnov (KS) statistic per feature and average it over continuous variables. For discrete channels, we compute Jensen--Shannon divergence (JSD) between per-feature categorical marginals and average across discrete variables \citep{lin1991divergence,yoon2019timegan}. To assess short-horizon dynamics, we compare lag-1 autocorrelation feature-wise and report the mean absolute difference between real and synthetic lag-1 coefficients. We additionally track semantic legality by counting out-of-vocabulary discrete outputs, and we report a filtered KS that excludes near-constant channels whose variance is effectively zero. These core measures are complemented with type-aware diagnostics, extended realism metrics, and ablations.
For continuous channels, we measure distributional alignment using the KolmogorovSmirnov (KS) statistic computed per feature between the empirical distributions of real and synthetic samples, and then averaged across features. For discrete channels, we quantify marginal fidelity with JensenShannon divergence (JSD) \citep{lin1991divergence,yoon2019timegan} between categorical distributions per feature, averaged across discrete variables. To assess temporal realism, we compare lag-1 autocorrelation at the feature level and report the mean absolute difference between real and synthetic lag-1 autocorrelation, averaged across features. In addition, to avoid degenerate comparisons driven by near-constant tags, features whose empirical standard deviation falls below a small threshold are excluded from continuous KS aggregation; such channels carry limited distributional information and can distort summary statistics. \subsection{Core fidelity, legality, and reproducibility}
\subsection{Quantitative results}
\label{sec:benchmark-quant} \label{sec:benchmark-quant}
Across all runs, the mean continuous KS is 0.3311 (std 0.0079) and the mean discrete JSD is 0.0284 (std 0.0073), indicating that the generator preserves both continuous marginals and discrete semantic distributions at the feature level. Temporal consistency is similarly stable across runs, with a mean lag-1 autocorrelation difference of 0.2684 (std 0.0027), suggesting that the synthesized windows retain short-horizon dynamical structure \citep{ni2021sigwasserstein} instead of collapsing to marginal matching alone. The best-performing instance (by mean KS) attains 0.3224, and the small inter-seed variance shows that the reported fidelity is reproducible rather than driven by a single favorable initialization. Across the three-run reproducibility sweep, Mask-DDPM achieves mean KS $=0.3311 \pm 0.0079$, mean JSD $=0.0284 \pm 0.0073$, and mean absolute lag-1 difference $=0.2684 \pm 0.0027$. The strongest individual seed reaches KS $=0.3224$, while the best runs for JSD and lag-1 are $0.0209$ and $0.2661$, respectively. Just as importantly, all three runs produce zero out-of-vocabulary tokens across the 26 modeled discrete channels, giving a validity rate of \textbf{100\%}. This is the first major benchmark takeaway: semantic legality is already saturated by construction, so the remaining difficulty is no longer ``can the model emit valid symbols?'' but rather ``can it place valid symbols and trajectories in the right temporal and cross-channel context?''
The latest fully diagnosed run provides the complementary view that a seed summary cannot offer. In that run, the model attains mean KS $=0.4025$, filtered mean KS $=0.3191$, mean JSD $=0.0166$, and mean absolute lag-1 difference $=0.2859$, again with zero invalid discrete tokens. Two points matter most. First, the discrete branch remains the most reliable component: low JSD combined with perfect validity means the generator is consistently learning legal supervisory semantics rather than merely matching coarse occupancy counts. Second, the sizable gap between overall KS and filtered KS shows that continuous mismatch is not spread uniformly across all channels. Instead, a relatively small subset of difficult variables dominates the error budget.
\begin{figure}[htbp] \begin{figure}[htbp]
\centering \centering
\includegraphics[width=0.8\textwidth]{fig-overall-benchmark-v1.png} \includegraphics[width=\textwidth]{fig-benchmark-story-v2.png}
% \caption{Description of the figure.} \caption{Benchmark evidence chain. Left: seed-level reproducibility over the three benchmark runs, showing that the global metrics are stable across seeds. Middle: top-10 continuous features ranked by KS in the latest fully diagnosed run, with overall and filtered average KS overlaid to show that a small subset of tags dominates the continuous error budget. Right: representative type-aware mismatch scores from the same run, using program dwell, controller change rate, actuator top-3 mass, PV tail ratio, and auxiliary lag-1 persistence as mechanism-level diagnostics. Lower is better in all panels.}
\label{fig:benchmark} \label{fig:benchmark}
\end{figure} \end{figure}
\begin{table}[htbp] \begin{table}[htbp]
\centering \centering
\caption{Summary of benchmark metrics. Lower values indicate better performance.} \caption{Main benchmark summary. The left column reports reproducibility across three complete runs; the right column reports the latest diagnosed run used for the per-feature, type-aware, and extended analyses. Lower is better except for validity rate.}
\label{tab:metrics} \label{tab:core_metrics}
\begin{tabular}{@{}l l c c@{}} \begin{tabular}{@{}lcc@{}}
\toprule \toprule
\textbf{Metric} & \textbf{Aggregation} & \textbf{Lower is better} & \textbf{Mean $\pm$ Std} \\ \textbf{Metric} & \textbf{3-run mean $\pm$ std} & \textbf{Latest diagnosed run} \\
\midrule \midrule
KS (continuous) & mean over continuous features & \checkmark & 0.3311 $\pm$ 0.0079 \\ Mean KS (continuous) & $0.3311 \pm 0.0079$ & $0.4025$ \\
JSD (discrete) & mean over discrete features & \checkmark & 0.0284 $\pm$ 0.0073 \\ Filtered mean KS & -- & $0.3191$ \\
Abs $\Delta$ lag-1 autocorr & mean over features & \checkmark & 0.2684 $\pm$ 0.0027 \\ Mean JSD (discrete) & $0.0284 \pm 0.0073$ & $0.0166$ \\
Mean abs. $\Delta$ lag-1 autocorr & $0.2684 \pm 0.0027$ & $0.2859$ \\
Validity rate (26 discrete tags) $\uparrow$ & $100.0 \pm 0.0\%$ & $100.0\%$ \\
\bottomrule \bottomrule
\end{tabular} \end{tabular}
\end{table} \end{table}
To make the benchmark actionable (and comparable to prior work), we report type-appropriate, interpretable statistics instead of collapsing everything into a single similarity score. This matters in mixed-type ICS telemetry: continuous fidelity can be high while discrete semantics fail, and vice versa. By separating continuous (KS), discrete (JSD), and temporal (lag-1) views, the evaluation directly matches the design goals of the hybrid generator: distributional refinement for continuous residuals, vocabulary-valid reconstruction for discrete supervision, and trend-induced short-horizon coherence. Figure~\ref{fig:benchmark} turns these numbers into a diagnosis rather than a scoreboard. The largest KS contributors are concentrated in a handful of control-relevant tags, including \texttt{P1\_B4002}, \texttt{P1\_FCV02Z}, \texttt{P1\_B3004}, \texttt{P1\_B2004}, and \texttt{P1\_PCV02Z}. This means the current limitation is not a global collapse of the continuous generator. The model has already cleared the first hurdle (legality) and a large part of the second (mixed-type marginal fidelity). What remains difficult is the third hurdle: reproducing a small set of hard channels whose realism depends on step-like transitions, long plateaus, tightly bounded operating regions, or strong local persistence.
In addition, the seed-averaged reporting mirrors evaluation conventions in recent diffusion-based time-series generation studies, where robustness across runs is increasingly treated as a first-class signal rather than an afterthought. In this sense, the small inter-seed variance is itself evidence that the factorized training and typed routing reduce instability and localized error concentration, which is frequently observed when heterogeneous channels compete for the same modeling capacity. \subsection{Extended realism and downstream utility}
\label{sec:benchmark-extended}
The next question is whether samples that look cleaner under fidelity metrics are also more structurally faithful and more useful. To probe this, we additionally evaluate two-sample distance, cross-variable coupling, spectral similarity, predictive consistency, memorization risk, and downstream anomaly-detection utility on the latest diagnosed run. Because this run contains only four synthetic windows (384 generated rows at $L=96$), we treat the resulting numbers as \emph{small-sample diagnostic evidence} rather than as the final word. They are still informative because they tell us which kinds of realism can be improved by post-processing and which ones cannot be repaired so easily.
\begin{table}[htbp]
\centering
\caption{Extended realism and utility on the latest diagnosed run. The post-processed column corresponds to the typed post-processing baseline. Lower is better except for AUPRC. For reference, the real-only predictor RMSE is $0.558$ and the real-only anomaly AUPRC is $0.653$.}
\label{tab:extended_eval}
\begin{tabular}{@{}lcc@{}}
\toprule
\textbf{Metric} & \textbf{Raw generator} & \textbf{Post-processed} \\
\midrule
Continuous MMD (RBF) & $0.6499$ & $0.2166$ \\
Discriminative accuracy (ideal $0.5$) & $1.0000$ & $0.5000$ \\
Mean abs. corr. diff. & $0.2134$ & $0.1909$ \\
Mean abs. lag-1 corr. diff. & $0.2132$ & $0.1989$ \\
PSD $L_1$ distance & $0.0195$ & $0.0224$ \\
Memorization ratio & $2.9515$ & $1.6205$ \\
Predictive RMSE (synthetic-only) & $0.9722$ & $0.9641$ \\
Predictive RMSE (real + synthetic) & $0.5433$ & $0.5413$ \\
Anomaly AUPRC (synthetic-only) & $0.5889$ & $0.5894$ \\
Anomaly AUPRC (real + synthetic) & $0.6449$ & $0.6476$ \\
\bottomrule
\end{tabular}
\end{table}
Table~\ref{tab:extended_eval} reveals a useful asymmetry. Typed post-processing substantially improves distribution-level realism: continuous MMD drops from $0.6499$ to $0.2166$, discriminative accuracy moves from a trivially separable $1.0$ to the chance-level ideal of $0.5$, both contemporaneous and lagged correlation errors decrease, and the memorization ratio contracts from $2.95$ to $1.62$. In other words, post-processing is very effective at pulling the generated windows closer to the real holdout manifold without collapsing into exact training-set copies. Yet predictive and downstream utility improve only modestly. Synthetic-only predictors remain clearly weaker than real-only ones, and real-plus-synthetic anomaly utility stays slightly below the real-only baseline. This is an important benchmark result: once legality and low-order marginals are largely under control, the remaining gap is driven less by superficial distribution mismatch and more by mechanism-level dynamics that post hoc distribution shaping cannot fully restore.
\subsection{Type-aware diagnostics}
\label{sec:benchmark-typed}
Type-aware diagnostics make that mechanism gap explicit. Table~\ref{tab:typed_diagnostics} summarizes one representative statistic per variable family, computed on the latest fully analyzed run. These statistics are not redundant with the main benchmark table. They answer a different question: \emph{if legality is already solved, what kind of control behavior is still implausible?}
\begin{table}[htbp]
\centering
\caption{Type-aware diagnostic summary on the latest fully diagnosed run. ``Mean abs. error'' is reported in the native unit of the corresponding diagnostic statistic; ``Mean rel. error'' normalizes by the real-data value to indicate severity. Lower values indicate better alignment.}
\label{tab:typed_diagnostics}
\begin{tabular}{@{}llcc@{}}
\toprule
\textbf{Type} & \textbf{Proxy statistic} & \textbf{Mean abs. error} & \textbf{Mean rel. error} \\
\midrule
Program & mean dwell & $315.75$ & $0.64$ \\
Controller & change rate & $0.352$ & $0.84$ \\
Actuator & top-3 mass & $0.0117$ & $0.67$ \\
PV & tail ratio & $0.0796$ & $0.21$ \\
Auxiliary & lag-1 autocorr & $0.467$ & $0.77$ \\
\bottomrule
\end{tabular}
\end{table}
This typed view sharpens the story substantially. Program-like channels remain the hardest class because the model still under-represents long dwell behavior: it switches too often instead of maintaining the long plateaus characteristic of setpoints and schedule-driven tags. Controllers are too reactive, as reflected in the large change-rate mismatch. Actuator channels are closer in aggregate but still spread probability mass too broadly, indicating that the generator does not yet reproduce the concentrated occupancy of a few valid operating states. PV diagnostics are the most encouraging: their tail-ratio error is materially smaller, suggesting that the continuous branch already captures a meaningful portion of process-variable shape even though some upper-tail behavior remains underfit. Auxiliary channels expose a different weakness, namely that support signals with strong short-horizon persistence are still not reproduced as faithfully as their low-order marginals. In short, legality is already solved, but control realism is not.
\subsection{Ablation study}
\label{sec:benchmark-ablation}
A good ablation does more than show that removing components changes numbers; it should identify which failure mode each component is preventing. We therefore evaluate ten one-seed variants under the same pipeline and summarize six representative metrics: continuous fidelity (KS), discrete fidelity (JSD), short-horizon dynamics (lag-1), cross-variable coupling, predictive transfer, and downstream anomaly utility. Figure~\ref{fig:benchmark-ablations} visualizes signed changes relative to the full model, where red means that the ablated variant is worse. Table~\ref{tab:ablation} gives the underlying values.
\begin{figure}[htbp]
\centering
\includegraphics[width=\textwidth]{fig-benchmark-ablations-v1.png}
\caption{Ablation impact relative to the full model. For KS, JSD, lag-1 error, coupling error, and predictive RMSE, positive values mean the ablated model is worse than the full model. For AUPRC, positive values mean the ablated model loses downstream utility. The figure makes clear that different components protect different notions of realism rather than contributing uniformly to every metric.}
\label{fig:benchmark-ablations}
\end{figure}
\begin{table}[htbp]
\centering
\small
\caption{Ablation study on the latest one-seed runs. Lower is better except for anomaly AUPRC.}
\label{tab:ablation}
\begin{tabular}{@{}lcccccc@{}}
\toprule
\textbf{Variant} & \textbf{KS$\downarrow$} & \textbf{JSD$\downarrow$} & \textbf{Lag-1$\downarrow$} & \textbf{Coupling$\downarrow$} & \textbf{Pred. RMSE$\downarrow$} & \textbf{AUPRC$\uparrow$} \\
\midrule
\multicolumn{7}{@{}l}{\textit{Full model}} \\
Full model & $0.402$ & $0.028$ & $0.291$ & $0.215$ & $0.972$ & $0.644$ \\
\midrule
\multicolumn{7}{@{}l}{\textit{Structure and conditioning}} \\
No temporal scaffold & $0.408$ & $0.031$ & $0.664$ & $0.306$ & $0.977$ & $0.645$ \\
No file condition & $0.405$ & $0.033$ & $0.237$ & $0.262$ & $0.986$ & $0.640$ \\
No type routing & $0.356$ & $0.022$ & $0.138$ & $0.324$ & $1.017$ & $0.647$ \\
\midrule
\multicolumn{7}{@{}l}{\textit{Distribution shaping}} \\
No quantile transform & $0.599$ & $0.010$ & $0.156$ & $0.300$ & $1.653$ & $0.417$ \\
No post-calibration & $0.543$ & $0.024$ & $0.253$ & $0.249$ & $1.086$ & $0.647$ \\
\midrule
\multicolumn{7}{@{}l}{\textit{Loss and target design}} \\
No SNR weighting & $0.400$ & $0.022$ & $0.299$ & $0.214$ & $0.961$ & $0.637$ \\
No quantile loss & $0.413$ & $0.018$ & $0.311$ & $0.213$ & $0.965$ & $0.645$ \\
No residual-stat loss & $0.404$ & $0.029$ & $0.285$ & $0.210$ & $0.970$ & $0.647$ \\
Epsilon target & $0.482$ & $0.102$ & $0.728$ & $0.195$ & $0.968$ & $0.647$ \\
\bottomrule
\end{tabular}
\end{table}
The ablation results reveal three distinct roles. First, temporal staging is what makes the sequence look dynamical rather than merely plausible frame by frame: removing the temporal scaffold leaves KS nearly unchanged but more than doubles lag-1 error ($0.291 \rightarrow 0.664$) and substantially worsens coupling ($0.215 \rightarrow 0.306$). Second, quantile-based distribution shaping is what makes the continuous branch usable: without the quantile transform, KS degrades sharply ($0.402 \rightarrow 0.599$), synthetic-only predictive RMSE deteriorates dramatically ($0.972 \rightarrow 1.653$), and anomaly utility collapses ($0.644 \rightarrow 0.417$). This is not a cosmetic gain; it is one of the main contributors to usable process realism.
The routing ablation supplies the most instructive counterexample. Disabling type routing actually improves several one-dimensional metrics (for example KS and lag-1), yet it worsens coupling ($0.215 \rightarrow 0.324$) and predictive transfer ($0.972 \rightarrow 1.017$). This is exactly why the benchmark cannot stop at scalar per-feature scores: typed decomposition helps the generator coordinate variables and preserve mechanism-level consistency even when simpler metrics may look deceptively better without it. Finally, the target-parameterization ablation is the clearest failure case: replacing the current target with an epsilon target causes the largest degradation in JSD ($0.028 \rightarrow 0.102$) and lag-1 ($0.291 \rightarrow 0.728$), making it the most destructive ablation overall. By contrast, SNR weighting, quantile loss, and residual-stat regularization behave as second-order refinements whose effects are real but materially smaller.
Taken together, the benchmark now supports a sharper claim than a plain KS/JSD table could offer. Mask-DDPM already provides stable mixed-type fidelity, perfect discrete legality, and a meaningful amount of continuous realism. The remaining error is concentrated in a small subset of ICS-specific channels whose realism depends on rare switching, long dwell intervals, constrained occupancy, and persistent local dynamics. The ablation study clarifies why: temporal staging protects dynamical realism, quantile-based shaping protects continuous fidelity and downstream utility, and type-aware routing protects coordinated mechanism-level behavior even when simpler metrics do not fully reveal its value.
% 5. Future Work % 5. Future Work
\section{Future Work} \section{Future Work}