Align benchmark evidence text and figure

2026-04-18 22:13:51 +08:00
parent 4a6dcb77a5
commit b67e7ffb0a
2 changed files with 8 additions and 8 deletions
--- a/arxiv-style/fig-benchmark-story-v2.png
+++ b/arxiv-style/fig-benchmark-story-v2.png
--- a/arxiv-style/main.tex
+++ b/arxiv-style/main.tex
@@ -291,7 +291,7 @@ Validity rate (26 discrete tags) $\uparrow$ & $100.0 \pm 0.0\%$ & $100.0\%$ \\
 \end{table}
 %Question about the following part. "Figure~\ref{fig:benchmark_story} turns the table into a structural diagnosis."
-Figure~\ref{fig:benchmark_story} turns the table into a structural diagnosis. Continuous error is concentrated in a relatively small subset of control-sensitive channels rather than indicating a global collapse of the generator, while the type-aware panel shows that the remaining gap is mechanism-specific. In other words, the model has largely solved legality and a substantial portion of mixed-type marginal fidelity, but realism remains harder for behaviors governed by switching, long dwell, bounded operating regimes, and strong local persistence.
+Figure~\ref{fig:benchmark_story} turns the table into a structural diagnosis. The left panel visualizes seed-level stability across the three benchmark runs, showing that the reported KS, JSD, and lag-1 statistics are reproducible rather than the result of a single favorable seed. The middle panel ranks the most difficult continuous channels by KS and shows that the dominant continuous mismatch is concentrated in a relatively small subset of control-sensitive variables instead of indicating a global collapse of the generator. The right panel aggregates type-aware proxy mismatches and shows that the remaining realism gap is mechanism-specific, with program-like long-dwell behavior and actuator-state occupancy contributing more strongly than PV-like channels on this slice. In other words, the model has largely solved legality and a substantial portion of mixed-type marginal fidelity, but realism remains harder for behaviors governed by switching, long dwell, bounded operating regimes, and strong local persistence. This type-aware perspective is developed further in Section~\ref{sec:benchmark-typed}.
 \subsection{Extended realism and downstream utility}
 \label{sec:benchmark-extended}
@@ -323,7 +323,7 @@ Table~\ref{tab:extended_eval} reveals a useful asymmetry. Typed post-processing
 \subsection{Type-aware diagnostics}
 \label{sec:benchmark-typed}
-Type-aware diagnostics make that mechanism gap explicit. Table~\ref{tab:typed_diagnostics} summarizes one representative statistic per variable family on the same diagnostic slice. These statistics are not redundant with the main benchmark table: they answer a different question, namely which operational behaviors remain hardest to match once legality and marginal alignment are largely in place.
+Type-aware diagnostics make that mechanism gap explicit. Table~\ref{tab:typed_diagnostics} summarizes one representative statistic per variable family on the same diagnostic slice. These statistics are not redundant with the main benchmark table: they answer a different question, namely which operational behaviors remain hardest to match once legality and marginal alignment are largely in place. Because each family is evaluated with a different proxy, the absolute-error column should be interpreted within type, while the relative-error column is the more comparable cross-type indicator.
 \begin{table}[htbp]
 \centering
@@ -333,16 +333,16 @@ Type-aware diagnostics make that mechanism gap explicit. Table~\ref{tab:typed_di
 \toprule
 \textbf{Type} & \textbf{Proxy statistic} & \textbf{Mean abs. error} & \textbf{Mean rel. error} \\
 \midrule
-Program & mean dwell & $315.75$ & $0.64$ \\
+Program & mean dwell & $318.70$ & $2.19$ \\
-Controller & change rate & $0.352$ & $0.84$ \\
+Controller & change rate & $0.104$ & $0.25$ \\
-Actuator & top-3 mass & $0.0117$ & $0.67$ \\
+Actuator & top-3 mass & $0.0615$ & $0.69$ \\
-PV & tail ratio & $0.0796$ & $0.21$ \\
+PV & tail ratio & $1.614$ & $0.20$ \\
-Auxiliary & lag-1 autocorr & $0.467$ & $0.77$ \\
+Auxiliary & lag-1 autocorr & $0.125$ & $0.37$ \\
 \bottomrule
 \end{tabular}
 \end{table}
-This typed view sharpens the story substantially. Program-like channels remain the hardest class because the model still under-represents long dwell behavior: it switches too often instead of maintaining the long plateaus characteristic of setpoints and schedule-driven tags. Controllers are too reactive, as reflected in the large change-rate mismatch. Actuator channels are closer in aggregate but still spread probability mass too broadly, indicating that the generator does not yet reproduce the concentrated occupancy of a few valid operating states. PV diagnostics are the most encouraging: their tail-ratio error is materially smaller, suggesting that the continuous branch already captures a meaningful portion of process-variable shape even though some upper-tail behavior remains underfit. Auxiliary channels expose a different weakness, namely that support signals with strong short-horizon persistence are still not reproduced as faithfully as their low-order marginals. In short, legality is already solved, but control realism is not.
+This typed view sharpens the story substantially. Program-like channels remain the hardest family by a wide margin: mean-dwell mismatch is still large in both absolute and relative terms, indicating that the generator does not yet sustain the long plateaus characteristic of schedule-driven or setpoint-like behavior. Actuator channels form the next clear difficulty, with a sizable top-3-mass gap showing that the sampler still spreads probability mass across operating states more broadly than the real system does. Auxiliary channels exhibit a moderate persistence mismatch under the lag-1 proxy, suggesting that support signals with short-memory structure are only partially captured. By contrast, PV channels are the most stable family under this diagnostic, and the controller proxy is comparatively closer on this slice. In short, legality is already solved, but the remaining realism gap is not uniform across types: it is dominated primarily by long-dwell program behavior and actuator-state occupancy.
 \subsection{Ablation study}
 \label{sec:benchmark-ablation}