diff --git a/arxiv-style/main.tex b/arxiv-style/main.tex
index 38a2a99..e67864f 100644
--- a/arxiv-style/main.tex
+++ b/arxiv-style/main.tex
@@ -97,7 +97,7 @@ Diffusion models exhibit good fit along this path: DDPM achieves high-quality sa
 
 Looking further into the mechanism complexity of ICS: its channel types are inherently mixed, containing both continuous process trajectories and discrete supervision/status variables, and discrete channels must be "legal" under operational constraints. The aforementioned progress in time series diffusion has mainly occurred in continuous spaces, but discrete diffusion has also developed systematic methods: D3PM improves sampling quality and likelihood through absorption/masking and structured transitions in discrete state spaces \citep{austin2023structureddenoisingdiffusionmodels}, subsequent masked diffusion provides stable reconstruction on categorical data in a more simplified form \citep{Lin_2020}, multinomial diffusion directly defines diffusion on a finite vocabulary through mechanisms such as argmax flows \citep{hoogeboom2021argmaxflowsmultinomialdiffusion}, and Diffusion-LM demonstrates an effective path for controllable text generation by imposing gradient constraints in continuous latent spaces \citep{li2022diffusionlmimprovescontrollabletext}. From the perspectives of protocols and finite-state machines, coverage-guided fuzz testing emphasizes the criticality of "sequence legality and state coverage" \citep{meng2025aflnetyearslatercoverageguided,godefroid2017learnfuzzmachinelearninginput,she2019neuzzefficientfuzzingneural}, echoing the concept of "legality by construction" in discrete diffusion: preferentially adopting absorption/masking diffusion on discrete channels, supplemented by type-aware conditioning and sampling constraints, to avoid semantic invalidity and marginal distortion caused by post hoc thresholding.
 
-From the perspective of high-level synthesis, the temporal structure is equally indispensable: ICS control often involves delay effects, phased operating conditions, and cross-channel coupling, requiring models to be able to characterize low-frequency, long-range dependencies while also overlaying multi-facated fine-grained fluctuations on them. The Transformer series has provided sufficient evidence in long-sequence time series tasks: Transformer-XL breaks through the fixed-length context limitation through a reusable memory mechanism and significantly enhances long-range dependency expression \citep{dai2019transformerxlattentivelanguagemodels}; Informer uses ProbSparse attention and efficient decoding to balance span and efficiency in long-sequence prediction \citep{zhou2021informerefficienttransformerlong}; Autoformer robustly models long-term seasonality and trends through autocorrelation and decomposition mechanisms \citep{wu2022autoformerdecompositiontransformersautocorrelation}; FEDformer further improves long-period prediction performance in frequency domain enhancement and decomposition \citep{zhou2022fedformerfrequencyenhanceddecomposed}; PatchTST enhances the stability and generalization of long-sequence multivariate prediction through local patch-based representation and channel-independent modeling \citep{2023}. Combining our previous positioning of diffusion, this chain of evidence points to a natural division of labor: using attention-based sequence models to first extract stable low-frequency trends/conditions (long-range skeletons), and then allowing diffusion to focus on margins and details in the residual space; meanwhile, discrete masking/absorbing diffusion is applied to supervised/pattern variables to ensure vocabulary legality by construction. This design not only inherits the advantages of time series diffusion in distribution fitting and uncertainty characterization \citep{rasul2021autoregressivedenoisingdiffusionmodels,tashiro2021csdiconditionalscorebaseddiffusion,wen2024diffstgprobabilisticspatiotemporalgraph,liu2023pristiconditionaldiffusionframework,kong2021diffwaveversatilediffusionmodel,11087622}, but also stabilizes the macroscopic temporal support through the long-range attention of Transformer, enabling the formation of an operational integrated generation pipeline under the mixed types and multi-scale dynamics of ICS.
+From the perspective of high-level synthesis, the temporal structure is equally indispensable: ICS control often involves delay effects, phased operating conditions, and cross-channel coupling, requiring models to be able to characterize low-frequency, long-range dependencies while also overlaying multi-facated fine-grained fluctuations on them. The Transformer series has provided sufficient evidence in long-sequence time series tasks: Transformer-XL breaks through the fixed-length context limitation through a reusable memory mechanism and significantly enhances long-range dependency expression \citep{dai2019transformerxlattentivelanguagemodels}; Informer uses ProbSparse attention and efficient decoding to balance span and efficiency in long-sequence prediction \citep{zhou2021informerefficienttransformerlong}; Autoformer robustly models long-term seasonality and trends through autocorrelation and decomposition mechanisms \citep{wu2022autoformerdecompositiontransformersautocorrelation}; FEDformer further improves long-period prediction performance in frequency domain enhancement and decomposition \citep{zhou2022fedformerfrequencyenhanceddecomposed}; PatchTST enhances the stability and generalization of long-sequence multivariate prediction through local patch-based representation and channel-independent modeling \citep{nie2023patchtst}. Combining our previous positioning of diffusion, this chain of evidence points to a natural division of labor: using attention-based sequence models to first extract stable low-frequency trends/conditions (long-range skeletons), and then allowing diffusion to focus on margins and details in the residual space; meanwhile, discrete masking/absorbing diffusion is applied to supervised/pattern variables to ensure vocabulary legality by construction. This design not only inherits the advantages of time series diffusion in distribution fitting and uncertainty characterization \citep{rasul2021autoregressivedenoisingdiffusionmodels,tashiro2021csdiconditionalscorebaseddiffusion,wen2024diffstgprobabilisticspatiotemporalgraph,liu2023pristiconditionaldiffusionframework,kong2021diffwaveversatilediffusionmodel,11087622}, but also stabilizes the macroscopic temporal support through the long-range attention of Transformer, enabling the formation of an operational integrated generation pipeline under the mixed types and multi-scale dynamics of ICS.
 
 % 3. Methodology
 \section{Methodology}
@@ -247,7 +247,7 @@ From a novelty standpoint, this layer is not merely an engineering patch; it is
 \label{sec:method-joint}
 We train the model in a staged manner consistent with the above factorization, which improves optimization stability and encourages each component to specialize in its intended role. Specifically: (i) we train the trend Transformer $f_{\phi}$ to obtain $\hat{\bm{S}}$; (ii) we compute residual targets $\hat{\bm{R}} = \bm{X} - \hat{\bm{S}}$ for the continuous variables routed to residual diffusion; (iii) we train the residual DDPM $p_{\theta}(\bm{R}\mid \hat{\bm{S}})$ and masked diffusion model $p_{\psi}(\bm{Y}\mid \text{masked}(\bm{Y}), \hat{\bm{S}}, \hat{\bm{X}})$; and (iv) we apply type-aware routing and deterministic reconstruction during sampling. This staged strategy is aligned with the design goal of separating temporal scaffolding from distributional refinement, and it mirrors the broader intuition in time-series diffusion that decoupling coarse structure and stochastic detail can mitigate structure-vs.-realism conflicts \citep{kollovieh2023tsdiff,sikder2023transfusion}.
 
-A simple combined objective is $\mathcal{L} = \lambda\mathcal{L}_{\text{cont}} + (1-\lambda)\mathcal{L}_{\text{disc}}$ with $\lambda\in[0,1]$ controlling the balance between continuous and discrete learning. Type-aware routing determines which channels contribute to which loss and which are excluded in favor of deterministic reconstruction. In practice, this routing acts as a principled guardrail against negative transfer across variable mechanisms: channels that are best handled deterministically (Type 5) or by specialized drivers (Type 1/3, depending on configuration) are prevented from forcing the diffusion models into statistically incoherent compromises.
+A simple combined objective is $\mathcal{L} = \lambda\mathcal{L}_{\text{cont}} + (1-\lambda)\mathcal{L}_{\text{disc}}$ with $\lambda\in[0,1]$ controlling the balance between continuous and discrete learning. Type-aware routing determines which channels contribute to which loss and which are excluded in favor of deterministic reconstruction. In practice, this routing acts as a principled guardrail against negative transfer across variable mechanisms: channels that are best handled deterministically (Type 5) or as exogenous / specialized state channels (e.g., driver-like or actuator-state variables) are prevented from forcing the diffusion models into statistically incoherent compromises.
 
 At inference time, generation follows the same structured order: (i) trend $\hat{\bm{S}}$ via the Transformer, (ii) residual $\hat{\bm{R}}$ via DDPM, (iii) discrete $\hat{\bm{Y}}$ via masked diffusion, and (iv) type-aware assembly with deterministic reconstruction for routed variables. This pipeline produces $(\hat{\bm{X}},\hat{\bm{Y}})$ that are temporally coherent by construction (through $\hat{\bm{S}}$), distributionally expressive (through $\hat{\bm{R}}$ denoising), and discretely valid (through masked diffusion), while explicitly accounting for heterogeneous variable-generating mechanisms through type-aware routing. In combination, these choices constitute our central methodological contribution: a unified Transformer + mixed diffusion generator for ICS telemetry, augmented by typed factorization to align model capacity with domain mechanism \citep{ho2020denoising,shi2024simplified,yuan2025ctu,nist2023sp80082}.
 
@@ -369,7 +369,7 @@ Full model & $0.402$ & $0.028$ & $0.291$ & $0.215$ & $0.972$ & $0.644$ \\
 \midrule
 \multicolumn{7}{@{}l}{\textit{Structure and conditioning}} \\
 No temporal scaffold & $0.408$ & $0.031$ & $0.664$ & $0.306$ & $0.977$ & $0.645$ \\
-No file condition & $0.405$ & $0.033$ & $0.237$ & $0.262$ & $0.986$ & $0.640$ \\
+No file-level context & $0.405$ & $0.033$ & $0.237$ & $0.262$ & $0.986$ & $0.640$ \\
 No type routing & $0.356$ & $0.022$ & $0.138$ & $0.324$ & $1.017$ & $0.647$ \\
 \midrule
 \multicolumn{7}{@{}l}{\textit{Distribution shaping}} \\
diff --git a/arxiv-style/references.bib b/arxiv-style/references.bib
index a46b6ee..7794eb1 100644
--- a/arxiv-style/references.bib
+++ b/arxiv-style/references.bib
@@ -605,3 +605,10 @@ keywords = {burstiness, energy plot, generator, internet, modeling, structural m
       primaryClass={cs.LG},
       url={https://arxiv.org/abs/2011.13456},
 }
+@inproceedings{nie2023patchtst,
+  title={A Time Series is Worth 64 Words: Long-term Forecasting with Transformers},
+  author={Nie, Yuqi and Nguyen, Nam H. and Sinthong, Phanwadee and Kalagnanam, Jayant},
+  booktitle={International Conference on Learning Representations (ICLR)},
+  year={2023},
+  url={https://arxiv.org/abs/2211.14730}
+}