forked from manbo/internal-docs
Tighten benchmark section for page limit
This commit is contained in:
Binary file not shown.
@@ -211,17 +211,13 @@ At inference time, generation follows the same structured order: (i) trend $\hat
|
||||
% 4. Benchmark
|
||||
\section{Benchmark}
|
||||
\label{sec:benchmark}
|
||||
A credible ICS generator must clear four progressively harder hurdles. It must first be \emph{semantically legal}: any out-of-vocabulary supervisory token renders a sample unusable, no matter how good its marginals look. It must then match the heterogeneous statistics of mixed-type telemetry, including continuous process channels and discrete supervisory states. Third, it must preserve \emph{mechanism-level realism}: switch-and-dwell behavior, bounded control motion, cross-tag coordination, and short-horizon persistence. Finally, these properties should matter downstream rather than only under offline similarity scores. We therefore organize the benchmark as a funnel rather than a flat metric list, moving from reproducibility and legality to diagnostic localization, extended realism, and ablation \citep{coletta2023constrained,yang2001interlock,stenger2024survey}.
|
||||
A credible ICS generator must clear three hurdles. It must first be \emph{semantically legal}: any out-of-vocabulary supervisory token renders a sample unusable, regardless of marginal fidelity. It must then match the heterogeneous statistics of mixed-type telemetry, including continuous process channels and discrete supervisory states. Finally, it must preserve \emph{mechanism-level realism}: switch-and-dwell behavior, bounded control motion, cross-tag coordination, and short-horizon persistence. We therefore organize the benchmark as a funnel from legality and reproducibility to structural diagnosis and ablation \citep{coletta2023constrained,yang2001interlock,stenger2024survey}.
|
||||
|
||||
This organization is particularly important for ICS telemetry. A generator can look competitive on one-dimensional marginals while still failing on the aspects that make a trace operationally plausible: long plateaus in setpoint-like variables, concentrated occupancy in actuator states, tight controller--sensor coupling, or persistent support signals. Our goal is therefore not to maximize a single scalar, but to show which parts of realism have already been solved, which remain brittle, and which model components are responsible for each regime.
|
||||
|
||||
For continuous channels, we prioritize marginal agreement because ICS process signals often exhibit bounded support, long plateaus, saturation effects, and non-Gaussian tails that are poorly summarized by moment matching alone. We therefore use the Kolmogorov--Smirnov (KS) statistic per feature and average it over continuous variables: KS compares empirical cumulative distributions directly, requires no parametric assumption, and is sensitive to support shifts or local shape mismatches that are operationally meaningful in telemetry. For discrete channels, the object of interest is different: supervisory variables live on a finite vocabulary, so realism is primarily about whether the synthetic sampler places the right probability mass on the right states. We therefore compute Jensen--Shannon divergence (JSD) between per-feature categorical marginals and average across discrete variables \citep{lin1991divergence,yoon2019timegan}, since JSD is symmetric, bounded, and naturally suited to comparing categorical occupancy patterns. To assess short-horizon dynamics, we compare lag-1 autocorrelation feature-wise and report the mean absolute difference between real and synthetic lag-1 coefficients, which captures the short-memory persistence induced by actuator dwell, controller smoothing, and process inertia. We additionally track semantic legality by counting out-of-vocabulary discrete outputs, and we report a filtered KS that excludes near-constant channels whose variance is effectively zero so that trivially flat tags do not dominate the aggregate. These core measures are complemented with type-aware diagnostics, extended realism metrics, and ablations.
|
||||
For continuous channels, we use the Kolmogorov--Smirnov (KS) statistic because ICS process signals are often bounded, saturated, heavy-tailed, and plateau-dominated, so moment matching alone is too weak. KS directly compares empirical cumulative distributions, makes no parametric assumption, and is sensitive to support or shape mismatches that are operationally meaningful in telemetry. For discrete channels, realism is primarily about how probability mass is distributed over a finite vocabulary, so we use Jensen--Shannon divergence (JSD) between per-feature categorical marginals and average across discrete variables \citep{lin1991divergence,yoon2019timegan}. To assess short-horizon dynamics, we compare lag-1 autocorrelation feature-wise and report the mean absolute difference between real and synthetic lag-1 coefficients. We also track semantic legality by counting out-of-vocabulary discrete outputs and report a filtered KS that excludes near-constant channels so that trivially flat tags do not dominate the aggregate.
|
||||
|
||||
\subsection{Core fidelity, legality, and reproducibility}
|
||||
\label{sec:benchmark-quant}
|
||||
Across three independent runs, Mask-DDPM achieves mean KS $=0.3311 \pm 0.0079$, mean JSD $=0.0284 \pm 0.0073$, and mean absolute lag-1 difference $=0.2684 \pm 0.0027$, while maintaining a validity rate of \textbf{100\%} across the modeled discrete channels. The small dispersion across runs suggests that the generator is reproducible at the level of global mixed-type fidelity rather than depending on a single favorable seed. This is the first major benchmark takeaway: semantic legality is already saturated by construction, so the remaining challenge is no longer whether the model can emit valid symbols, but whether it can place valid symbols and trajectories in the right temporal and cross-channel context.
|
||||
|
||||
A representative diagnostic slice provides the complementary localized view. As summarized in Table~\ref{tab:core_metrics}, the model attains mean KS $=0.4025$, filtered mean KS $=0.3191$, mean JSD $=0.0166$, and mean absolute lag-1 difference $=0.2859$ on that slice, again with zero invalid discrete tokens. Two patterns matter most. First, the discrete branch remains consistently reliable: low JSD together with perfect validity indicates that supervisory semantics are being learned rather than repaired after the fact. Second, the gap between overall KS and filtered KS suggests that continuous mismatch is concentrated in a limited subset of difficult channels instead of being spread uniformly across the telemetry space.
|
||||
Across three independent runs, Mask-DDPM achieves mean KS $=0.3311 \pm 0.0079$, mean JSD $=0.0284 \pm 0.0073$, and mean absolute lag-1 difference $=0.2684 \pm 0.0027$, while maintaining a validity rate of \textbf{100\%} across the modeled discrete channels. The small dispersion across runs shows that mixed-type fidelity is reproducible rather than dependent on a single favorable seed. On a representative diagnostic slice, the model attains mean KS $=0.4025$, filtered mean KS $=0.3191$, mean JSD $=0.0166$, and mean absolute lag-1 difference $=0.2859$, again with zero invalid discrete tokens. The main pattern is that discrete legality is already solved, while continuous mismatch is concentrated in a limited subset of difficult channels rather than spread uniformly across the telemetry space.
|
||||
|
||||
\begin{figure}[htbp]
|
||||
\centering
|
||||
@@ -248,39 +244,11 @@ Validity rate (26 discrete tags) $\uparrow$ & $100.0 \pm 0.0\%$ & $100.0\%$ \\
|
||||
\end{table}
|
||||
|
||||
%Question about the following part. "Figure~\ref{fig:benchmark_story} turns the table into a structural diagnosis."
|
||||
Figure~\ref{fig:benchmark_story} turns the table into a structural diagnosis. The left panel visualizes seed-level stability across the three benchmark runs, showing that the reported KS, JSD, and lag-1 statistics are reproducible rather than the result of a single favorable seed. The middle panel ranks the most difficult continuous channels by KS and shows that the dominant continuous mismatch is concentrated in a relatively small subset of control-sensitive variables instead of indicating a global collapse of the generator. The right panel aggregates type-aware proxy mismatches and shows that the remaining realism gap is mechanism-specific, with program-like long-dwell behavior and actuator-state occupancy contributing more strongly than PV-like channels on this slice. In other words, the model has largely solved legality and a substantial portion of mixed-type marginal fidelity, but realism remains harder for behaviors governed by switching, long dwell, bounded operating regimes, and strong local persistence. This type-aware perspective is developed further in Section~\ref{sec:benchmark-typed}.
|
||||
|
||||
\subsection{Extended realism and downstream utility}
|
||||
\label{sec:benchmark-extended}
|
||||
The next question is whether improvements under fidelity metrics correspond to broader structural realism and downstream usefulness. We therefore additionally evaluate two-sample distance, cross-variable coupling, spectral similarity, predictive consistency, memorization risk, and anomaly-detection utility on a representative diagnostic slice. Because this slice is intentionally small, we interpret the resulting numbers as diagnostic rather than definitive; their purpose is to show which aspects of realism respond to post-processing and which ones remain limited by mechanism-level dynamics.
|
||||
|
||||
\begin{table}[htbp]
|
||||
\centering
|
||||
\caption{Extended realism and downstream utility. Lower is better except for AUPRC. For reference, the real-only predictor RMSE is $0.558$ and the real-only anomaly AUPRC is $0.653$.}
|
||||
\label{tab:extended_eval}
|
||||
\begin{tabular}{@{}lcc@{}}
|
||||
\toprule
|
||||
\textbf{Metric} & \textbf{Raw generator} & \textbf{Post-processed} \\
|
||||
\midrule
|
||||
Continuous MMD (RBF) & $0.6499$ & $0.2166$ \\
|
||||
Discriminative accuracy (ideal $0.5$) & $1.0000$ & $0.5000$ \\
|
||||
Mean abs. corr. diff. & $0.2134$ & $0.1909$ \\
|
||||
Mean abs. lag-1 corr. diff. & $0.2132$ & $0.1989$ \\
|
||||
PSD $L_1$ distance & $0.0195$ & $0.0224$ \\
|
||||
Memorization ratio & $2.9515$ & $1.6205$ \\
|
||||
Predictive RMSE (synthetic-only) & $0.9722$ & $0.9641$ \\
|
||||
Predictive RMSE (real + synthetic) & $0.5433$ & $0.5413$ \\
|
||||
Anomaly AUPRC (synthetic-only) & $0.5889$ & $0.5894$ \\
|
||||
Anomaly AUPRC (real + synthetic) & $0.6449$ & $0.6476$ \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
|
||||
Table~\ref{tab:extended_eval} reveals a useful asymmetry. Typed post-processing substantially improves distribution-level realism: continuous MMD drops from $0.6499$ to $0.2166$, discriminative accuracy moves from a trivially separable $1.0$ to the chance-level ideal of $0.5$, both contemporaneous and lagged correlation errors decrease, and the memorization ratio contracts from $2.95$ to $1.62$. In other words, post-processing is very effective at pulling the generated windows closer to the real holdout manifold without collapsing into exact training-set copies. Yet predictive and downstream utility improve only modestly. Synthetic-only predictors remain clearly weaker than real-only ones, and real-plus-synthetic anomaly utility stays slightly below the real-only baseline. This is an important benchmark result: once legality and low-order marginals are largely under control, the remaining gap is driven less by superficial distribution mismatch and more by mechanism-level dynamics that post hoc distribution shaping cannot fully restore.
|
||||
Figure~\ref{fig:benchmark_story} turns the table into a structural diagnosis. The left panel shows seed-level stability across the three benchmark runs. The middle panel shows that the dominant continuous mismatch is concentrated in a relatively small subset of control-sensitive variables rather than indicating a global collapse of the generator. The right panel shows that the remaining realism gap is mechanism-specific, with program-like long-dwell behavior and actuator-state occupancy contributing more strongly than PV-like channels on this slice.
|
||||
|
||||
\subsection{Type-aware diagnostics}
|
||||
\label{sec:benchmark-typed}
|
||||
Type-aware diagnostics make that mechanism gap explicit. Table~\ref{tab:typed_diagnostics} summarizes one representative statistic per variable family on the same diagnostic slice. These statistics are not redundant with the main benchmark table: they answer a different question, namely which operational behaviors remain hardest to match once legality and marginal alignment are largely in place. Because each family is evaluated with a different proxy, the absolute-error column should be interpreted within type, while the relative-error column is the more comparable cross-type indicator.
|
||||
Type-aware diagnostics make that mechanism gap explicit. Table~\ref{tab:typed_diagnostics} reports one representative statistic per variable family on the same diagnostic slice. Because each family is evaluated with a different proxy, the absolute-error column should be interpreted within type, while the relative-error column is the more comparable cross-type indicator.
|
||||
|
||||
\begin{table}[htbp]
|
||||
\centering
|
||||
@@ -299,11 +267,11 @@ Auxiliary & lag-1 autocorr & $0.125$ & $0.37$ \\
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
|
||||
This typed view sharpens the story substantially. Program-like channels remain the hardest family by a wide margin: mean-dwell mismatch is still large in both absolute and relative terms, indicating that the generator does not yet sustain the long plateaus characteristic of schedule-driven or setpoint-like behavior. Actuator channels form the next clear difficulty, with a sizable top-3-mass gap showing that the sampler still spreads probability mass across operating states more broadly than the real system does. Auxiliary channels exhibit a moderate persistence mismatch under the lag-1 proxy, suggesting that support signals with short-memory structure are only partially captured. By contrast, PV channels are the most stable family under this diagnostic, and the controller proxy is comparatively closer on this slice. In short, legality is already solved, but the remaining realism gap is not uniform across types: it is dominated primarily by long-dwell program behavior and actuator-state occupancy.
|
||||
Program-like channels remain the hardest family by a clear margin: mean-dwell mismatch is still large, indicating that the generator does not yet sustain the long plateaus characteristic of schedule-driven behavior. Actuator channels form the next clear difficulty, while PV channels are the most stable family under this diagnostic. In short, legality is solved, but the remaining realism gap is not uniform across types; it is dominated primarily by long-dwell program behavior and actuator-state occupancy.
|
||||
|
||||
\subsection{Ablation study}
|
||||
\label{sec:benchmark-ablation}
|
||||
A good ablation does more than show that removing components changes numbers; it should identify which failure mode each component is preventing. We therefore evaluate ten controlled variants under a shared pipeline and summarize six representative metrics: continuous fidelity (KS), discrete fidelity (JSD), short-horizon dynamics (lag-1), cross-variable coupling, predictive transfer, and downstream anomaly utility. Figure~\ref{fig:ablation_impact} visualizes signed changes relative to the full model, and Table~\ref{tab:ablation} gives the underlying values.
|
||||
We evaluate ten controlled variants under a shared pipeline and summarize six representative metrics: continuous fidelity (KS), discrete fidelity (JSD), short-horizon dynamics (lag-1), cross-variable coupling, predictive transfer, and downstream anomaly utility. Figure~\ref{fig:ablation_impact} visualizes signed changes relative to the full model, and Table~\ref{tab:ablation} gives the underlying values.
|
||||
|
||||
\begin{figure}[htbp]
|
||||
\centering
|
||||
@@ -342,11 +310,9 @@ Epsilon target & $0.482$ & $0.102$ & $0.728$ & $0.195$ & $0.968$ & $0.647$ \\
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
|
||||
The ablation results reveal three distinct roles. First, temporal staging is what makes the sequence look dynamical rather than merely plausible frame by frame: removing the temporal scaffold leaves KS nearly unchanged but more than doubles lag-1 error ($0.291 \rightarrow 0.664$) and substantially worsens coupling ($0.215 \rightarrow 0.306$). Second, quantile-based distribution shaping is what makes the continuous branch usable: without the quantile transform, KS degrades sharply ($0.402 \rightarrow 0.599$), synthetic-only predictive RMSE deteriorates dramatically ($0.972 \rightarrow 1.653$), and anomaly utility collapses ($0.644 \rightarrow 0.417$). This is not a cosmetic gain; it is one of the main contributors to usable process realism.
|
||||
The ablation results reveal three distinct roles. First, temporal staging is what makes the sequence look dynamical rather than merely plausible frame by frame: removing the temporal scaffold leaves KS nearly unchanged but more than doubles lag-1 error ($0.291 \rightarrow 0.664$) and substantially worsens coupling ($0.215 \rightarrow 0.306$). Second, quantile-based distribution shaping is one of the main contributors to usable continuous realism: without the quantile transform, KS degrades sharply ($0.402 \rightarrow 0.599$), synthetic-only predictive RMSE deteriorates ($0.972 \rightarrow 1.653$), and anomaly utility collapses ($0.644 \rightarrow 0.417$). Third, routing is the key counterexample to one-dimensional evaluation: disabling type routing can improve KS or lag-1 in isolation, yet it worsens coupling ($0.215 \rightarrow 0.324$) and predictive transfer ($0.972 \rightarrow 1.017$), showing that typed decomposition helps preserve coordinated mechanism-level behavior.
|
||||
|
||||
The routing ablation supplies the most instructive counterexample. Disabling type routing actually improves several one-dimensional metrics (for example KS and lag-1), yet it worsens coupling ($0.215 \rightarrow 0.324$) and predictive transfer ($0.972 \rightarrow 1.017$). This is exactly why the benchmark cannot stop at scalar per-feature scores: typed decomposition helps the generator coordinate variables and preserve mechanism-level consistency even when simpler metrics may look deceptively better without it. Finally, the target-parameterization ablation is the clearest failure case: replacing the current target with an epsilon target causes the largest degradation in JSD ($0.028 \rightarrow 0.102$) and lag-1 ($0.291 \rightarrow 0.728$), making it the most destructive ablation overall. By contrast, SNR weighting, quantile loss, and residual-stat regularization behave as second-order refinements whose effects are real but materially smaller.
|
||||
|
||||
Taken together, the benchmark now supports a sharper claim than a plain KS/JSD table could offer. Mask-DDPM already provides stable mixed-type fidelity, perfect discrete legality, and a meaningful amount of continuous realism. The remaining error is concentrated in a small subset of ICS-specific channels whose realism depends on rare switching, long dwell intervals, constrained occupancy, and persistent local dynamics. The ablation study clarifies why: temporal staging protects dynamical realism, quantile-based shaping protects continuous fidelity and downstream utility, and type-aware routing protects coordinated mechanism-level behavior even when simpler metrics do not fully reveal its value.
|
||||
Taken together, the benchmark supports a focused claim. Mask-DDPM already provides stable mixed-type fidelity and perfect discrete legality, while the remaining error is concentrated in ICS-specific channels whose realism depends on rare switching, long dwell intervals, constrained occupancy, and persistent local dynamics.
|
||||
|
||||
% 5. Conclusion and Future Work
|
||||
\section{Conclusion and Future Work}
|
||||
|
||||
Reference in New Issue
Block a user