[Suggestion]: 分别处理ks troublemaker features #2
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
刚刚我让gpt对features做的分类,参考了hai pdf上面的标注。一共分成6类。我建议对每一类单独规划解决方案。gpt 原回答:
I pulled the post-Jan-27 eval snapshot you’re referring to (the one where you logged KS ≈ 0.4046).
In that eval, a small set of features are still extreme outliers in continuous_ks (e.g., P1_B4002 = 1.0, P2_MSD = 1.0, P1_PCV02Z ≈ 0.992, P1_B4005 ≈ 0.797, P4_HT_LD ≈ 0.717).
Two of them are outright collapsed to constants in the generated data (std = 0.0): P1_B4002 and P2_MSD.
Those alone can keep KS stubbornly high.
Below is a practical classification of the “troublemakers” (and a small watchlist), plus what to model them with instead.
Type 1 — Exogenous setpoints & demands (schedule-driven, piecewise-constant)
Features (high KS / watchlist)
Why they break diffusion KS
These are not “plant responses”; they’re operator/program inputs. In HAI, setpoint-change commands are delivered at the start of each day (i.e., regime/schedule-like). Diffusion often either (a) blurs steps into ramps, or (b) collapses to a mean (your std=0 cases).
Better methods/architectures
HSMM / change-point program generator (recommended)
{num_changes, change_times, step_sizes}+ optional ramps.Conditional “control-program first” hierarchy
Empirical program library (fast baseline that often wins KS)
Type 2 — Controller outputs (policy-like, saturation/rate-limits)
Features
Why it’s hard
Controller outputs are a function of
(setpoint − measurement)+ integral/derivative terms + saturation. If you model it as an unconstrained continuous variable, you’ll miss (a) the heavy mass at 0/100, and (b) the sharp transitions.Better methods/architectures
Learned PID emulator (very strong for KS)
u_tfrom(SP_t, PV_t, PV_{t-1}, u_{t-1}), with explicit clamp + rate-limit in the forward pass.(Kp, Ki, Kd)(global or regime-conditioned).State-space / NARX model
Type 3 — Actuator positions with “spiky”/rarely-changing distributions
Features (high KS / watchlist)
Why they break KS
Many valve positions are effectively mixtures of a few operating points + long dwell times. KS gets dominated by “missing spikes” even if your mean/std look fine.
Better methods/architectures
Mixture-of-deltas + small noise (spike-and-slab)
Z_t = discrete_state_t + ε, wherediscrete_state_tis categorical with dwell-time modeling.Actuator dynamics model driven by command
Zas a constrained lag system:Z_{t+1} = Z_t + clip(α*(D_t − Z_t), -rate, +rate) + noiseα, rateper actuator/regime.Regime-conditioned modeling
Type 4 — Quantized / “digital-as-continuous” variables
These don’t look like the top KS offenders right now, but they’re common future pain points and often worth splitting out early.
Examples present in your eval
Better methods
Type 5 — Derived conversions (don’t model twice)
You have explicit “Z” conversions that are deterministic transforms of base signals.
Examples
Best practice
This usually improves both KS and cross-feature consistency “for free”.
Type 6 — Auxiliary / vibration / HIL / power-supply signals
These are usually narrow-band / near-stationary and often easier with classic time-series models.
Examples
Better methods
Evaluate your “filter them out + model separately” idea
It’s a good move as long as you keep the dependency structure:
The robust way to do it is a hierarchical generator:
Two immediate “highest ROI” fixes from your eval
Fix the constant-output collapse for P1_B4002 and P2_MSD (both have std=0).
If you split Type 1 out into a program generator, these usually stop being a diffusion problem entirely.
Handle spiky actuator positions with a spike-and-slab / dwell-time model (P1_PCV02Z ~ 0.992 KS).
If you want, paste the list of features you were planning to “exclude from overall KS”, and I’ll propose a concrete module boundary (what to keep in diffusion vs. what to move out) + an evaluation plan that still checks joint realism.
把6个types单独提取出来做了一张表,就先按这个分类来处理吧。
Below are the troublemaker continuous features (KS ≥ 0.50) from the current
eval.json, grouped into the 6 types we discussed. (KS values shown are exactly what’s recorded ineval.json.)Type 1 — Operator setpoints / demands (schedule-driven, piecewise-constant)
Suggestion: model these separately as an HMM / state-machine / scheduler, or make them exogenous conditioning (because the dataset’s normal operation explicitly uses an HMM to generate setpoint commands).
Type 2 — Controller internal/output variables (mid-loop, non-Gaussian)
Suggestion: learn as conditional model (condition on SP + PV + mode), or use a mixture density / flow head rather than a single Gaussian-style target.
Type 3 — Actuator command/position variables (saturation + spikes)
Suggestion: model as state-conditional (e.g., separate “steady”, “ramping”, “saturated” regimes), or use discrete mode + continuous residual (a small regime classifier + conditional continuous generator).
Type 4 — Process-variable sensors (PV) with mode changes / heavy tails
Suggestion: try conditional normalizing flows or SSM-style (latent state) models; these PVs often become multi-modal because the plant operates in distinct regimes.
Type 5 — Derived / converted tags (should be near-deterministic)
Suggestion: don’t “generate” it—recompute deterministically from its source measurement(s) in post-processing (or train it as a deterministic head), so it can’t drift distributionally.
Type 6 — HIL / cross-process coupling outputs
Suggestion: model as a conditional/physics-informed head (condition on demand + key process PVs). These variables are driven by the simulator coupling, so unconditional diffusion often misses the correct mapping.