Add: python scripts for figure generation

Add: ignore rules for drawio
Fix: Author Block
2026-02-09 00:24:40 +08:00 · 2026-02-08 17:26:38 +08:00 · 2026-02-06 15:51:07 +08:00 · 2026-02-06 15:24:43 +08:00 · 2026-02-06 15:11:17 +08:00 · 2026-02-06 14:55:53 +08:00
172 changed files with 24838 additions and 4 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,9 @@
 papers/Topic8 Network Traffic Generation/NetDiffusion Network Data Augmentation Through Protocol-Constrained Traffic Generation/NetDiffusion Network Data Augmentation Through Protocol-Constrained Traffic Gener_1_33_translate_20260130214700.pdf
 arxiv-style/*.pdf
 arxiv-style/*.aux
 arxiv-style/*.log
 arxiv-style/*.blg
 arxiv-style/*.bbl
 arxiv-style/*.out
 fig/
 .DS_Store
--- a/API/whatsneeded.md
+++ b/API/whatsneeded.md
@@ -0,0 +1,32 @@
 # What's needed for development&research?
 Group members should describe their requirements and dependencies in detail in this document.
 Current group members:  
 - HY YAN
 - MZ YANG
 - ZL CHEN
 The following subheadings are sections for descriptions of requirements, respectively.
 ## ZL CHEN
 {placeholder}
 ## MZ YANG
 {placeholder}
 ## HY YAN
 {placeholder}
--- a/README.md
+++ b/README.md
@@ -1 +1,12 @@
-## Hi there!
+## File organisation
 Currently, there are 4 parts of docs:
 - **API**: contains detailed docs about API designs. (Note that `system_design.md` is a temp draft for discussing high-level design, not for API definitations).
 - **knowledges**: contains theorical knowledges/designs for the project (mainly diffusion stuffs).
 - **notes4coding**: contains deatiled description for coding, may be useful to coding agents (like codex).
 - **papers**: contains useful papers we've found in pdf
 Here is the Feishu Docs link: https://my.feishu.cn/wiki/Za4dwCsG6iPD9qklRLWcoJOZnnb?from=from_copylink
--- a/arxiv-style/IEEEtran.cls
+++ b/arxiv-style/IEEEtran.cls
--- a/arxiv-style/README.md
+++ b/arxiv-style/README.md
@@ -0,0 +1,27 @@
 ## Files layout
 - `arxiv.sty` and `template.tex`: The arxiv template we are using.
 - `equations.tex`: **Duplicated**, contains equations in methodology section
 - `main.tex` and `references.bib`:This prints out our paper, currently using arxiv template. Note that references are template independent.
 ## How to compile
 It's recommanded to use `MiKTeX` as compiler on windows.
 To compile latex into pdf, follow these steps:
 ```bash
 pdflatex ./main.tex
 # Build reference DB, run once unless references.bib updated
 bibtex main
 # Always running compiling command twice
 pdflatex ./main.tex
 pdflatex ./main.tex
 ```
 ## Troubleshooting
 If you encounter warnings during the compiling process, simply press `Enter`.
 If you find the reference in pdf is like `[??]`, compile twice.
--- a/arxiv-style/arxiv.sty
+++ b/arxiv-style/arxiv.sty
@@ -0,0 +1,262 @@
 \NeedsTeXFormat{LaTeX2e}
 \ProcessOptions\relax
 % fonts
 \renewcommand{\rmdefault}{ptm}
 \renewcommand{\sfdefault}{phv}
 % set page geometry
 \usepackage[verbose=true,letterpaper]{geometry}
 \AtBeginDocument{
  \newgeometry{
    textheight=9in,
    textwidth=6.5in,
    top=1in,
    headheight=14pt,
    headsep=25pt,
    footskip=30pt
  }
 }
 \widowpenalty=10000
 \clubpenalty=10000
 \flushbottom
 \sloppy
 \newcommand{\headeright}{A Preprint}
 \newcommand{\undertitle}{A Preprint}
 \newcommand{\shorttitle}{\@title}
 \usepackage{fancyhdr}
 \fancyhf{}
 \pagestyle{fancy}
 \renewcommand{\headrulewidth}{0.4pt}
 \fancyheadoffset{0pt}
 \rhead{\scshape \footnotesize \headeright}
 \chead{\shorttitle}
 \cfoot{\thepage}
 %Handling Keywords
 \def\keywordname{{\bfseries \emph{Keywords}}}%
 \def\keywords#1{\par\addvspace\medskipamount{\rightskip=0pt plus1cm
 \def\and{\ifhmode\unskip\nobreak\fi\ $\cdot$
 }\noindent\keywordname\enspace\ignorespaces#1\par}}
 % font sizes with reduced leading
 \renewcommand{\normalsize}{%
  \@setfontsize\normalsize\@xpt\@xipt
  \abovedisplayskip      7\p@ \@plus 2\p@ \@minus 5\p@
  \abovedisplayshortskip \z@ \@plus 3\p@
  \belowdisplayskip      \abovedisplayskip
  \belowdisplayshortskip 4\p@ \@plus 3\p@ \@minus 3\p@
 }
 \normalsize
 \renewcommand{\small}{%
  \@setfontsize\small\@ixpt\@xpt
  \abovedisplayskip      6\p@ \@plus 1.5\p@ \@minus 4\p@
  \abovedisplayshortskip \z@  \@plus 2\p@
  \belowdisplayskip      \abovedisplayskip
  \belowdisplayshortskip 3\p@ \@plus 2\p@   \@minus 2\p@
 }
 \renewcommand{\footnotesize}{\@setfontsize\footnotesize\@ixpt\@xpt}
 \renewcommand{\scriptsize}{\@setfontsize\scriptsize\@viipt\@viiipt}
 \renewcommand{\tiny}{\@setfontsize\tiny\@vipt\@viipt}
 \renewcommand{\large}{\@setfontsize\large\@xiipt{14}}
 \renewcommand{\Large}{\@setfontsize\Large\@xivpt{16}}
 \renewcommand{\LARGE}{\@setfontsize\LARGE\@xviipt{20}}
 \renewcommand{\huge}{\@setfontsize\huge\@xxpt{23}}
 \renewcommand{\Huge}{\@setfontsize\Huge\@xxvpt{28}}
 % sections with less space
 \providecommand{\section}{}
 \renewcommand{\section}{%
  \@startsection{section}{1}{\z@}%
                {-2.0ex \@plus -0.5ex \@minus -0.2ex}%
                { 1.5ex \@plus  0.3ex \@minus  0.2ex}%
                {\large\bf\raggedright}%
 }
 \providecommand{\subsection}{}
 \renewcommand{\subsection}{%
  \@startsection{subsection}{2}{\z@}%
                {-1.8ex \@plus -0.5ex \@minus -0.2ex}%
                { 0.8ex \@plus  0.2ex}%
                {\normalsize\bf\raggedright}%
 }
 \providecommand{\subsubsection}{}
 \renewcommand{\subsubsection}{%
  \@startsection{subsubsection}{3}{\z@}%
                {-1.5ex \@plus -0.5ex \@minus -0.2ex}%
                { 0.5ex \@plus  0.2ex}%
                {\normalsize\bf\raggedright}%
 }
 \providecommand{\paragraph}{}
 \renewcommand{\paragraph}{%
  \@startsection{paragraph}{4}{\z@}%
                {1.5ex \@plus 0.5ex \@minus 0.2ex}%
                {-1em}%
                {\normalsize\bf}%
 }
 \providecommand{\subparagraph}{}
 \renewcommand{\subparagraph}{%
  \@startsection{subparagraph}{5}{\z@}%
                {1.5ex \@plus 0.5ex \@minus 0.2ex}%
                {-1em}%
                {\normalsize\bf}%
 }
 \providecommand{\subsubsubsection}{}
 \renewcommand{\subsubsubsection}{%
  \vskip5pt{\noindent\normalsize\rm\raggedright}%
 }
 % float placement
 \renewcommand{\topfraction      }{0.85}
 \renewcommand{\bottomfraction   }{0.4}
 \renewcommand{\textfraction     }{0.1}
 \renewcommand{\floatpagefraction}{0.7}
 \newlength{\@abovecaptionskip}\setlength{\@abovecaptionskip}{7\p@}
 \newlength{\@belowcaptionskip}\setlength{\@belowcaptionskip}{\z@}
 \setlength{\abovecaptionskip}{\@abovecaptionskip}
 \setlength{\belowcaptionskip}{\@belowcaptionskip}
 % swap above/belowcaptionskip lengths for tables
 \renewenvironment{table}
  {\setlength{\abovecaptionskip}{\@belowcaptionskip}%
   \setlength{\belowcaptionskip}{\@abovecaptionskip}%
   \@float{table}}
  {\end@float}
 % footnote formatting
 \setlength{\footnotesep }{6.65\p@}
 \setlength{\skip\footins}{9\p@ \@plus 4\p@ \@minus 2\p@}
 \renewcommand{\footnoterule}{\kern-3\p@ \hrule width 12pc \kern 2.6\p@}
 \setcounter{footnote}{0}
 % paragraph formatting
 \setlength{\parindent}{\z@}
 \setlength{\parskip  }{5.5\p@}
 % list formatting
 \setlength{\topsep       }{4\p@ \@plus 1\p@   \@minus 2\p@}
 \setlength{\partopsep    }{1\p@ \@plus 0.5\p@ \@minus 0.5\p@}
 \setlength{\itemsep      }{2\p@ \@plus 1\p@   \@minus 0.5\p@}
 \setlength{\parsep       }{2\p@ \@plus 1\p@   \@minus 0.5\p@}
 \setlength{\leftmargin   }{3pc}
 \setlength{\leftmargini  }{\leftmargin}
 \setlength{\leftmarginii }{2em}
 \setlength{\leftmarginiii}{1.5em}
 \setlength{\leftmarginiv }{1.0em}
 \setlength{\leftmarginv  }{0.5em}
 \def\@listi  {\leftmargin\leftmargini}
 \def\@listii {\leftmargin\leftmarginii
              \labelwidth\leftmarginii
              \advance\labelwidth-\labelsep
              \topsep  2\p@ \@plus 1\p@    \@minus 0.5\p@
              \parsep  1\p@ \@plus 0.5\p@ \@minus 0.5\p@
              \itemsep \parsep}
 \def\@listiii{\leftmargin\leftmarginiii
              \labelwidth\leftmarginiii
              \advance\labelwidth-\labelsep
              \topsep    1\p@ \@plus 0.5\p@ \@minus 0.5\p@
              \parsep    \z@
              \partopsep 0.5\p@ \@plus 0\p@ \@minus 0.5\p@
              \itemsep \topsep}
 \def\@listiv {\leftmargin\leftmarginiv
              \labelwidth\leftmarginiv
              \advance\labelwidth-\labelsep}
 \def\@listv  {\leftmargin\leftmarginv
              \labelwidth\leftmarginv
              \advance\labelwidth-\labelsep}
 \def\@listvi {\leftmargin\leftmarginvi
              \labelwidth\leftmarginvi
              \advance\labelwidth-\labelsep}
 % create title
 \providecommand{\maketitle}{}
 \renewcommand{\maketitle}{%
  \par
  \begingroup
    \renewcommand{\thefootnote}{\fnsymbol{footnote}}
    % for perfect author name centering
    %\renewcommand{\@makefnmark}{\hbox to \z@{$^{\@thefnmark}$\hss}}
    % The footnote-mark was overlapping the footnote-text,
    % added the following to fix this problem               (MK)
    \long\def\@makefntext##1{%
      \parindent 1em\noindent
      \hbox to 1.8em{\hss $\m@th ^{\@thefnmark}$}##1
    }
    \thispagestyle{empty}
    \@maketitle
    \@thanks
    %\@notice
  \endgroup
  \let\maketitle\relax
  \let\thanks\relax
 }
 % rules for title box at top of first page
 \newcommand{\@toptitlebar}{
  \hrule height 2\p@
  \vskip 0.25in
  \vskip -\parskip%
 }
 \newcommand{\@bottomtitlebar}{
  \vskip 0.29in
  \vskip -\parskip
  \hrule height 2\p@
  \vskip 0.09in%
 }
 % create title (includes both anonymized and non-anonymized versions)
 \providecommand{\@maketitle}{}
 \renewcommand{\@maketitle}{%
  \vbox{%
    \hsize\textwidth
    \linewidth\hsize
    \vskip 0.1in
    \@toptitlebar
    \centering
    {\LARGE\sc \@title\par}
    \@bottomtitlebar
    \textsc{\undertitle}\\
    \vskip 0.1in
    \def\And{%
      \end{tabular}\hfil\linebreak[0]\hfil%
      \begin{tabular}[t]{c}\bf\rule{\z@}{24\p@}\ignorespaces%
    }
    \def\AND{%
      \end{tabular}\hfil\linebreak[4]\hfil%
      \begin{tabular}[t]{c}\bf\rule{\z@}{24\p@}\ignorespaces%
    }
    \begin{tabular}[t]{c}\bf\rule{\z@}{24\p@}\@author\end{tabular}%
  \vskip 0.4in \@minus 0.1in \center{\@date}   \vskip 0.2in
  }
 }
 % add conference notice to bottom of first page
 \newcommand{\ftype@noticebox}{8}
 \newcommand{\@notice}{%
  % give a bit of extra room back to authors on first page
  \enlargethispage{2\baselineskip}%
  \@float{noticebox}[b]%
    \footnotesize\@noticestring%
  \end@float%
 }
 % abstract styling
 \renewenvironment{abstract}
 {
  \centerline
  {\large \bfseries \scshape Abstract}
  \begin{quote}
 }
 {
  \end{quote}
 }
 \endinput
--- a/arxiv-style/equations.tex
+++ b/arxiv-style/equations.tex
@@ -0,0 +1,95 @@
 \documentclass[10pt, twocolumn]{article}
 \usepackage{amsmath, amssymb}
 \usepackage{bm}
 \usepackage[margin=1in]{geometry}
 \usepackage{microtype}
 \title{Equations: Mask-DDPM Methodology}
 \author{}
 \date{}
 \begin{document}
 \maketitle
 \section{Problem Formulation}
 Each training instance is a fixed-length window of length $L$, comprising continuous channels $\bm{X} \in \mathbb{R}^{L \times d_c}$ and discrete channels $\bm{Y} = \{y^{(j)}_{1:L}\}_{j=1}^{d_d}$, where each discrete variable satisfies $y^{(j)}_t \in \mathcal{V}_j$ for a finite vocabulary $\mathcal{V}_j$.
 \section{Transformer Trend Module for Continuous Dynamics}
 We posit an additive decomposition of the continuous signal:
 \begin{equation}
 \bm{X} = \bm{S} + \bm{R},
 \label{eq:additive_decomp}
 \end{equation}
 where $\bm{S} \in \mathbb{R}^{L \times d_c}$ captures the smooth temporal trend and $\bm{R} \in \mathbb{R}^{L \times d_c}$ represents distributional residuals.
 The causal Transformer trend extractor $f_{\phi}$ predicts the next-step trend via:
 \begin{equation}
 \hat{\bm{S}}_{t+1} = f_{\phi}(\bm{X}_{1:t}), \quad t = 1, \dots, L-1.
 \label{eq:trend_prediction}
 \end{equation}
 Training minimizes the mean-squared error:
 \begin{equation}
 \mathcal{L}_{\text{trend}}(\phi) = \frac{1}{(L-1)d_c} \sum_{t=1}^{L-1} \bigl\| \hat{\bm{S}}_{t+1} - \bm{X}_{t+1} \bigr\|_2^2.
 \label{eq:trend_loss}
 \end{equation}
 At inference, the residual target is defined as $\bm{R} = \bm{X} - \hat{\bm{S}}$.
 \section{DDPM for Continuous Residual Generation}
 Let $K$ denote diffusion steps with noise schedule $\{\beta_k\}_{k=1}^K$, $\alpha_k = 1 - \beta_k$, and $\bar{\alpha}_k = \prod_{i=1}^k \alpha_i$. The forward corruption process is:
 \begin{align}
 q(\bm{r}_k \mid \bm{r}_0) &= \mathcal{N}\bigl( \sqrt{\bar{\alpha}_k}\,\bm{r}_0,\; (1 - \bar{\alpha}_k)\mathbf{I} \bigr), \\
 \bm{r}_k &= \sqrt{\bar{\alpha}_k}\,\bm{r}_0 + \sqrt{1 - \bar{\alpha}_k}\,\boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}),
 \label{eq:forward_process}
 \end{align}
 where $\bm{r}_0 \equiv \bm{R}$.
 The reverse process is parameterized as:
 \begin{equation}
 p_{\theta}(\bm{r}_{k-1} \mid \bm{r}_k, \hat{\bm{S}}) = \mathcal{N}\bigl( \boldsymbol{\mu}_{\theta}(\bm{r}_k, k, \hat{\bm{S}}),\; \boldsymbol{\Sigma}(k) \bigr).
 \label{eq:reverse_process}
 \end{equation}
 Training employs the $\epsilon$-prediction objective:
 \begin{equation}
 \mathcal{L}_{\text{cont}}(\theta) = \mathbb{E}_{k,\bm{r}_0,\boldsymbol{\epsilon}} \left[ \bigl\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_{\theta}(\bm{r}_k, k, \hat{\bm{S}}) \bigr\|_2^2 \right].
 \label{eq:ddpm_loss}
 \end{equation}
 Optionally, SNR-based reweighting yields:
 \begin{equation}
 \mathcal{L}^{\text{snr}}_{\text{cont}}(\theta) = \mathbb{E}_{k,\bm{r}_0,\boldsymbol{\epsilon}} \left[ w_k \bigl\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_{\theta}(\bm{r}_k, k, \hat{\bm{S}}) \bigr\|_2^2 \right],
 \label{eq:snr_loss}
 \end{equation}
 where $w_k = \min(\mathrm{SNR}_k, \gamma) / \mathrm{SNR}_k$ and $\mathrm{SNR}_k = \bar{\alpha}_k / (1 - \bar{\alpha}_k)$. The final continuous output is reconstructed as $\hat{\bm{X}} = \hat{\bm{S}} + \hat{\bm{R}}$.
 \section{Masked Diffusion for Discrete Variables}
 For discrete channel $j$, the forward masking process follows schedule $\{m_k\}_{k=1}^K$:
 \begin{equation}
 q(y^{(j)}_k \mid y^{(j)}_0) =
 \begin{cases}
 y^{(j)}_0,      & \text{with probability } 1 - m_k, \\
 \texttt{[MASK]}, & \text{with probability } m_k,
 \end{cases}
 \label{eq:masking_process}
 \end{equation}
 applied independently across variables and timesteps.
 The denoiser $h_{\psi}$ predicts categorical distributions conditioned on continuous context:
 \begin{equation}
 p_{\psi}\bigl( y^{(j)}_0 \mid y_k, k, \hat{\bm{S}}, \hat{\bm{X}} \bigr) = h_{\psi}(y_k, k, \hat{\bm{S}}, \hat{\bm{X}}).
 \label{eq:discrete_denoising}
 \end{equation}
 Training minimizes the categorical cross-entropy:
 \begin{equation}
 \mathcal{L}_{\text{disc}}(\psi) = \mathbb{E}_{k} \left[ \frac{1}{|\mathcal{M}|} \sum_{(j,t) \in \mathcal{M}} \mathrm{CE}\bigl( h_{\psi}(y_k, k, \hat{\bm{S}}, \hat{\bm{X}})_{j,t},\; y^{(j)}_{0,t} \bigr) \right],
 \label{eq:discrete_loss}
 \end{equation}
 where $\mathcal{M}$ denotes masked positions at step $k$.
 \section{Joint Optimization}
 The combined objective balances continuous and discrete learning:
 \begin{equation}
 \mathcal{L} = \lambda \, \mathcal{L}_{\text{cont}} + (1 - \lambda) \, \mathcal{L}_{\text{disc}}, \quad \lambda \in [0,1].
 \label{eq:joint_objective}
 \end{equation}
 Type-aware routing enforces deterministic reconstruction $\hat{x}^{(i)} = g_i(\hat{\bm{X}}, \hat{\bm{Y}})$ for derived variables.
 \end{document}
--- a/arxiv-style/fig-design-v1.png
+++ b/arxiv-style/fig-design-v1.png
--- a/arxiv-style/fig-design-v2.png
+++ b/arxiv-style/fig-design-v2.png
--- a/arxiv-style/fig-overall-benchmark-v1.png
+++ b/arxiv-style/fig-overall-benchmark-v1.png
--- a/arxiv-style/fig-scripts/.python-version
+++ b/arxiv-style/fig-scripts/.python-version
@@ -0,0 +1 @@
 3.12
--- a/arxiv-style/fig-scripts/draw_channels.py
+++ b/arxiv-style/fig-scripts/draw_channels.py
@@ -0,0 +1,237 @@
 #!/usr/bin/env python3
 """
 Draw *separate* SVG figures for:
  1) Continuous channels  (multiple smooth curves per figure)
  2) Discrete channels    (multiple step-like/token curves per figure)
 Outputs (default):
  out/continuous_channels.svg
  out/discrete_channels.svg
 Notes:
 - Transparent background (good for draw.io / LaTeX / diagrams).
 - No axes/frames by default (diagram-friendly).
 - Curves are synthetic placeholders; replace `make_*_channels()` with your real data.
 """
 from __future__ import annotations
 import argparse
 from dataclasses import dataclass
 from pathlib import Path
 import numpy as np
 import matplotlib.pyplot as plt
 # ----------------------------
 # Data generators (placeholders)
 # ----------------------------
@dataclass
 class GenParams:
    seconds: float = 10.0
    fs: int = 200
    seed: int = 7
    n_cont: int = 6          # number of continuous channels (curves)
    n_disc: int = 5          # number of discrete channels (curves)
    disc_vocab: int = 8      # token/vocab size for discrete channels
    disc_change_rate_hz: float = 1.2  # how often discrete tokens change
 def make_continuous_channels(p: GenParams) -> tuple[np.ndarray, np.ndarray]:
    """
    Returns:
      t: shape (T,)
      Y: shape (n_cont, T)
    """
    rng = np.random.default_rng(p.seed)
    T = int(p.seconds * p.fs)
    t = np.linspace(0, p.seconds, T, endpoint=False)
    Y = []
    for i in range(p.n_cont):
        # Multi-scale smooth-ish signals
        f1 = 0.15 + 0.06 * i
        f2 = 0.8 + 0.15 * (i % 3)
        phase = rng.uniform(0, 2 * np.pi)
        y = (
            0.9 * np.sin(2 * np.pi * f1 * t + phase)
            + 0.35 * np.sin(2 * np.pi * f2 * t + 1.3 * phase)
        )
        # Add mild colored-ish noise by smoothing white noise
        w = rng.normal(0, 1, size=T)
        w = np.convolve(w, np.ones(9) / 9.0, mode="same")
        y = y + 0.15 * w
        # Normalize each channel for consistent visual scale
        y = (y - np.mean(y)) / (np.std(y) + 1e-9)
        y = 0.8 * y + 0.15 * i  # vertical offset to separate curves a bit
        Y.append(y)
    return t, np.vstack(Y)
 def make_discrete_channels(p: GenParams) -> tuple[np.ndarray, np.ndarray]:
    """
    Discrete channels as piecewise-constant token IDs (integers).
    Returns:
      t: shape (T,)
      X: shape (n_disc, T)  (integers in [0, disc_vocab-1])
    """
    rng = np.random.default_rng(p.seed + 100)
    T = int(p.seconds * p.fs)
    t = np.linspace(0, p.seconds, T, endpoint=False)
    # expected number of changes per channel
    expected_changes = int(max(1, p.seconds * p.disc_change_rate_hz))
    X = np.zeros((p.n_disc, T), dtype=int)
    for c in range(p.n_disc):
        # pick change points
        k = rng.poisson(expected_changes) + 1
        change_pts = np.unique(rng.integers(0, T, size=k))
        change_pts = np.sort(np.concatenate([[0], change_pts, [T]]))
        cur = rng.integers(0, p.disc_vocab)
        for a, b in zip(change_pts[:-1], change_pts[1:]):
            # occasional token jump
            if a != 0:
                if rng.random() < 0.85:
                    cur = rng.integers(0, p.disc_vocab)
            X[c, a:b] = cur
    return t, X
 # ----------------------------
 # Plotting helpers
 # ----------------------------
 def _make_transparent_figure(width_in: float, height_in: float) -> tuple[plt.Figure, plt.Axes]:
    fig = plt.figure(figsize=(width_in, height_in), dpi=200)
    fig.patch.set_alpha(0.0)
    ax = fig.add_axes([0.03, 0.03, 0.94, 0.94])
    ax.patch.set_alpha(0.0)
    return fig, ax
 def save_continuous_channels_svg(
    t: np.ndarray,
    Y: np.ndarray,
    out_path: Path,
    *,
    lw: float = 2.0,
    clean: bool = True,
 ) -> None:
    """
    Plot multiple continuous curves in one figure and save SVG.
    Y shape: (n_cont, T)
    """
    fig, ax = _make_transparent_figure(width_in=6.0, height_in=2.2)
    # Let matplotlib choose different colors automatically (good defaults).
    for i in range(Y.shape[0]):
        ax.plot(t, Y[i], linewidth=lw)
    if clean:
        ax.set_axis_off()
    else:
        ax.set_xlabel("t")
        ax.set_ylabel("value")
    # Set limits with padding
    y_all = Y.reshape(-1)
    ymin, ymax = float(np.min(y_all)), float(np.max(y_all))
    ypad = 0.08 * (ymax - ymin + 1e-9)
    ax.set_xlim(t[0], t[-1])
    ax.set_ylim(ymin - ypad, ymax + ypad)
    out_path.parent.mkdir(parents=True, exist_ok=True)
    fig.savefig(out_path, format="svg", bbox_inches="tight", pad_inches=0.0, transparent=True)
    plt.close(fig)
 def save_discrete_channels_svg(
    t: np.ndarray,
    X: np.ndarray,
    out_path: Path,
    *,
    lw: float = 2.0,
    clean: bool = True,
    vertical_spacing: float = 1.25,
 ) -> None:
    """
    Plot multiple discrete (piecewise-constant) curves in one figure and save SVG.
    X shape: (n_disc, T) integers.
    We draw each channel as a step plot, offset vertically so curves don't overlap.
    """
    fig, ax = _make_transparent_figure(width_in=6.0, height_in=2.2)
    for i in range(X.shape[0]):
        y = X[i].astype(float) + i * vertical_spacing
        ax.step(t, y, where="post", linewidth=lw)
    if clean:
        ax.set_axis_off()
    else:
        ax.set_xlabel("t")
        ax.set_ylabel("token id (offset)")
    y_all = (X.astype(float) + np.arange(X.shape[0])[:, None] * vertical_spacing).reshape(-1)
    ymin, ymax = float(np.min(y_all)), float(np.max(y_all))
    ypad = 0.10 * (ymax - ymin + 1e-9)
    ax.set_xlim(t[0], t[-1])
    ax.set_ylim(ymin - ypad, ymax + ypad)
    out_path.parent.mkdir(parents=True, exist_ok=True)
    fig.savefig(out_path, format="svg", bbox_inches="tight", pad_inches=0.0, transparent=True)
    plt.close(fig)
 # ----------------------------
 # CLI
 # ----------------------------
 def main() -> None:
    ap = argparse.ArgumentParser()
    ap.add_argument("--outdir", type=Path, default=Path("out"))
    ap.add_argument("--seed", type=int, default=7)
    ap.add_argument("--seconds", type=float, default=10.0)
    ap.add_argument("--fs", type=int, default=200)
    ap.add_argument("--n-cont", type=int, default=6)
    ap.add_argument("--n-disc", type=int, default=5)
    ap.add_argument("--disc-vocab", type=int, default=8)
    ap.add_argument("--disc-change-rate", type=float, default=1.2)
    ap.add_argument("--keep-axes", action="store_true", help="Show axes/labels (default: off)")
    args = ap.parse_args()
    p = GenParams(
        seconds=args.seconds,
        fs=args.fs,
        seed=args.seed,
        n_cont=args.n_cont,
        n_disc=args.n_disc,
        disc_vocab=args.disc_vocab,
        disc_change_rate_hz=args.disc_change_rate,
    )
    t_c, Y = make_continuous_channels(p)
    t_d, X = make_discrete_channels(p)
    cont_path = args.outdir / "continuous_channels.svg"
    disc_path = args.outdir / "discrete_channels.svg"
    save_continuous_channels_svg(t_c, Y, cont_path, clean=not args.keep_axes)
    save_discrete_channels_svg(t_d, X, disc_path, clean=not args.keep_axes)
    print("Wrote:")
    print(f"  {cont_path}")
    print(f"  {disc_path}")
 if __name__ == "__main__":
    main()
--- a/arxiv-style/fig-scripts/draw_synthetic_ics_optionA.py
+++ b/arxiv-style/fig-scripts/draw_synthetic_ics_optionA.py
@@ -0,0 +1,272 @@
 #!/usr/bin/env python3
 """
 Option A: "Synthetic ICS Data" mini-panel (high-level features, not packets)
 What it draws (one SVG, transparent background):
 - Top: 2–3 continuous feature curves (smooth, time-aligned)
 - Bottom: discrete/categorical feature strip (colored blocks)
 - One vertical dashed alignment line crossing both
 - Optional shaded regime window
 - Optional "real vs synthetic" ghost overlay (faint gray behind one curve)
 Usage:
  uv run python draw_synthetic_ics_optionA.py --out ./assets/synth_ics_optionA.svg
 """
 from __future__ import annotations
 import argparse
 from dataclasses import dataclass
 from pathlib import Path
 import numpy as np
 import matplotlib.pyplot as plt
 from matplotlib.patches import Rectangle
@dataclass
 class Params:
    seed: int = 7
    seconds: float = 10.0
    fs: int = 300
    n_curves: int = 3              # continuous channels shown
    n_bins: int = 40               # discrete blocks across x
    disc_vocab: int = 8            # number of discrete categories
    # Layout / style
    width_in: float = 6.0
    height_in: float = 2.2
    curve_lw: float = 2.3
    ghost_lw: float = 2.0          # "real" overlay line width
    strip_height: float = 0.65     # bar height in [0,1] strip axis
    strip_gap_frac: float = 0.10   # gap between blocks (fraction of block width)
    # Visual cues
    show_alignment_line: bool = True
    align_x_frac: float = 0.58     # where to place dashed line, fraction of timeline
    show_regime_window: bool = True
    regime_start_frac: float = 0.30
    regime_end_frac: float = 0.45
    show_real_ghost: bool = True   # faint gray "real" behind first synthetic curve
 def _smooth(x: np.ndarray, win: int) -> np.ndarray:
    win = max(3, int(win) | 1)  # odd
    k = np.ones(win, dtype=float)
    k /= k.sum()
    return np.convolve(x, k, mode="same")
 def make_continuous_curves(p: Params) -> tuple[np.ndarray, np.ndarray, np.ndarray | None]:
    """
    Returns:
      t: (T,)
      Y_syn: (n_curves, T)  synthetic curves
      y_real: (T,) or None  optional "real" ghost curve (for one channel)
    """
    rng = np.random.default_rng(p.seed)
    T = int(p.seconds * p.fs)
    t = np.linspace(0, p.seconds, T, endpoint=False)
    Y = []
    for i in range(p.n_curves):
        # multi-scale smooth temporal patterns
        f_slow = 0.09 + 0.03 * (i % 3)
        f_mid = 0.65 + 0.18 * (i % 4)
        ph = rng.uniform(0, 2 * np.pi)
        y = (
            0.95 * np.sin(2 * np.pi * f_slow * t + ph)
            + 0.30 * np.sin(2 * np.pi * f_mid * t + 0.7 * ph)
        )
        # regime-like bumps
        bumps = np.zeros_like(t)
        for _ in range(2):
            mu = rng.uniform(0.8, p.seconds - 0.8)
            sig = rng.uniform(0.35, 0.85)
            bumps += np.exp(-0.5 * ((t - mu) / (sig + 1e-9)) ** 2)
        y += 0.55 * bumps
        # mild smooth noise
        noise = _smooth(rng.normal(0, 1, size=T), win=int(p.fs * 0.04))
        y += 0.10 * noise
        # normalize for clean presentation
        y = (y - y.mean()) / (y.std() + 1e-9)
        y *= 0.42
        Y.append(y)
    Y_syn = np.vstack(Y)
    # Optional "real" ghost: similar to first curve, but slightly different
    y_real = None
    if p.show_real_ghost:
        base = Y_syn[0].copy()
        drift = _smooth(rng.normal(0, 1, size=T), win=int(p.fs * 0.18))
        drift = drift / (np.std(drift) + 1e-9)
        y_real = base * 0.95 + 0.07 * drift
    return t, Y_syn, y_real
 def make_discrete_strip(p: Params) -> np.ndarray:
    """
    Piecewise-constant categorical IDs across n_bins.
    Returns:
      ids: (n_bins,) in [0, disc_vocab-1]
    """
    rng = np.random.default_rng(p.seed + 123)
    n = p.n_bins
    ids = np.zeros(n, dtype=int)
    cur = rng.integers(0, p.disc_vocab)
    for i in range(n):
        # occasional change
        if i == 0 or rng.random() < 0.28:
            cur = rng.integers(0, p.disc_vocab)
        ids[i] = cur
    return ids
 def _axes_clean(ax: plt.Axes) -> None:
    """Keep axes lines optional but remove all text/numbers (diagram-friendly)."""
    ax.set_xlabel("")
    ax.set_ylabel("")
    ax.set_title("")
    ax.set_xticks([])
    ax.set_yticks([])
    ax.tick_params(
        axis="both",
        which="both",
        bottom=False,
        left=False,
        top=False,
        right=False,
        labelbottom=False,
        labelleft=False,
    )
 def draw_optionA(out_path: Path, p: Params) -> None:
    # Figure
    fig = plt.figure(figsize=(p.width_in, p.height_in), dpi=200)
    fig.patch.set_alpha(0.0)
    # Two stacked axes (shared x)
    ax_top = fig.add_axes([0.08, 0.32, 0.90, 0.62])
    ax_bot = fig.add_axes([0.08, 0.12, 0.90, 0.16], sharex=ax_top)
    ax_top.patch.set_alpha(0.0)
    ax_bot.patch.set_alpha(0.0)
    # Generate data
    t, Y_syn, y_real = make_continuous_curves(p)
    ids = make_discrete_strip(p)
    x0, x1 = float(t[0]), float(t[-1])
    span = x1 - x0
    # Optional shaded regime window
    if p.show_regime_window:
        rs = x0 + p.regime_start_frac * span
        re = x0 + p.regime_end_frac * span
        ax_top.axvspan(rs, re, alpha=0.12)  # default color, semi-transparent
        ax_bot.axvspan(rs, re, alpha=0.12)
    # Optional vertical dashed alignment line
    if p.show_alignment_line:
        vx = x0 + p.align_x_frac * span
        ax_top.axvline(vx, linestyle="--", linewidth=1.2, alpha=0.7)
        ax_bot.axvline(vx, linestyle="--", linewidth=1.2, alpha=0.7)
    # Continuous curves (use fixed colors for consistency)
    curve_colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#9467bd"]  # blue, orange, green, purple
    # Ghost "real" behind the first curve (faint gray)
    if y_real is not None:
        ax_top.plot(t, y_real, linewidth=p.ghost_lw, color="0.65", alpha=0.55, zorder=1)
    for i in range(Y_syn.shape[0]):
        ax_top.plot(
            t, Y_syn[i],
            linewidth=p.curve_lw,
            color=curve_colors[i % len(curve_colors)],
            zorder=2
        )
    # Set top y-limits with padding
    ymin, ymax = float(Y_syn.min()), float(Y_syn.max())
    ypad = 0.10 * (ymax - ymin + 1e-9)
    ax_top.set_xlim(x0, x1)
    ax_top.set_ylim(ymin - ypad, ymax + ypad)
    # Discrete strip as colored blocks
    palette = [
        "#e41a1c", "#377eb8", "#4daf4a", "#984ea3",
        "#ff7f00", "#ffff33", "#a65628", "#f781bf",
    ]
    n = len(ids)
    bin_w = span / n
    gap = p.strip_gap_frac * bin_w
    ax_bot.set_ylim(0, 1)
    y = (1 - p.strip_height) / 2
    for i, cat in enumerate(ids):
        left = x0 + i * bin_w + gap / 2
        width = bin_w - gap
        ax_bot.add_patch(
            Rectangle(
                (left, y), width, p.strip_height,
                facecolor=palette[int(cat) % len(palette)],
                edgecolor="none",
            )
        )
    # Clean axes: no ticks/labels; keep spines (axes lines) visible
    _axes_clean(ax_top)
    _axes_clean(ax_bot)
    for ax in (ax_top, ax_bot):
        for side in ("left", "bottom", "top", "right"):
            ax.spines[side].set_visible(True)
    out_path.parent.mkdir(parents=True, exist_ok=True)
    fig.savefig(out_path, format="svg", transparent=True, bbox_inches="tight", pad_inches=0.0)
    plt.close(fig)
 def main() -> None:
    ap = argparse.ArgumentParser()
    ap.add_argument("--out", type=Path, default=Path("synth_ics_optionA.svg"))
    ap.add_argument("--seed", type=int, default=7)
    ap.add_argument("--seconds", type=float, default=10.0)
    ap.add_argument("--fs", type=int, default=300)
    ap.add_argument("--curves", type=int, default=3)
    ap.add_argument("--bins", type=int, default=40)
    ap.add_argument("--vocab", type=int, default=8)
    ap.add_argument("--no-align", action="store_true")
    ap.add_argument("--no-regime", action="store_true")
    ap.add_argument("--no-ghost", action="store_true")
    args = ap.parse_args()
    p = Params(
        seed=args.seed,
        seconds=args.seconds,
        fs=args.fs,
        n_curves=args.curves,
        n_bins=args.bins,
        disc_vocab=args.vocab,
        show_alignment_line=not args.no_align,
        show_regime_window=not args.no_regime,
        show_real_ghost=not args.no_ghost,
    )
    draw_optionA(args.out, p)
    print(f"Wrote: {args.out}")
 if __name__ == "__main__":
    main()
--- a/arxiv-style/fig-scripts/draw_synthetic_ics_optionB.py
+++ b/arxiv-style/fig-scripts/draw_synthetic_ics_optionB.py
@@ -0,0 +1,318 @@
 #!/usr/bin/env python3
 """
 Option B: "Synthetic ICS Data" as a mini process-story strip (high-level features)
 - ONE SVG, transparent background
 - Two frames by default: "steady/normal" -> "disturbance/recovery"
 - Each frame contains:
    - Top: multiple continuous feature curves
    - Bottom: discrete/categorical strip (colored blocks)
    - A vertical dashed alignment line crossing both
    - Optional shaded regime window
 - A right-pointing arrow between frames
 No text, no numbers (axes lines only). Good for draw.io embedding.
 Run:
  uv run python draw_synthetic_ics_optionB.py --out ./assets/synth_ics_optionB.svg
 """
 from __future__ import annotations
 import argparse
 from dataclasses import dataclass
 from pathlib import Path
 import numpy as np
 import matplotlib.pyplot as plt
 from matplotlib.patches import Rectangle, FancyArrowPatch
@dataclass
 class Params:
    seed: int = 7
    seconds: float = 8.0
    fs: int = 250
    # Two-frame story
    n_frames: int = 2
    # Per-frame visuals
    n_curves: int = 3
    n_bins: int = 32
    disc_vocab: int = 8
    # Layout
    width_in: float = 8.2
    height_in: float = 2.4
    # Relative layout inside the figure
    margin_left: float = 0.05
    margin_right: float = 0.05
    margin_bottom: float = 0.12
    margin_top: float = 0.10
    frame_gap: float = 0.08   # gap (figure fraction) between frames (space for arrow)
    # Styling
    curve_lw: float = 2.1
    ghost_lw: float = 1.8
    strip_height: float = 0.65
    strip_gap_frac: float = 0.12
    # Cues
    show_alignment_line: bool = True
    align_x_frac: float = 0.60
    show_regime_window: bool = True
    regime_start_frac: float = 0.30
    regime_end_frac: float = 0.46
    show_real_ghost: bool = False  # keep default off for cleaner story
    show_axes_spines: bool = True  # axes lines only (no ticks/labels)
 # ---------- helpers ----------
 def _smooth(x: np.ndarray, win: int) -> np.ndarray:
    win = max(3, int(win) | 1)
    k = np.ones(win, dtype=float)
    k /= k.sum()
    return np.convolve(x, k, mode="same")
 def _axes_only(ax: plt.Axes, *, keep_spines: bool) -> None:
    ax.set_xlabel("")
    ax.set_ylabel("")
    ax.set_title("")
    ax.set_xticks([])
    ax.set_yticks([])
    ax.tick_params(
        axis="both",
        which="both",
        bottom=False,
        left=False,
        top=False,
        right=False,
        labelbottom=False,
        labelleft=False,
    )
    ax.grid(False)
    if keep_spines:
        for s in ("left", "right", "top", "bottom"):
            ax.spines[s].set_visible(True)
    else:
        for s in ("left", "right", "top", "bottom"):
            ax.spines[s].set_visible(False)
 def make_frame_continuous(seed: int, seconds: float, fs: int, n_curves: int, style: str) -> tuple[np.ndarray, np.ndarray]:
    """
    style:
      - "steady": smoother, smaller bumps
      - "disturb": larger bumps and more variance
    """
    rng = np.random.default_rng(seed)
    T = int(seconds * fs)
    t = np.linspace(0, seconds, T, endpoint=False)
    amp_bump = 0.40 if style == "steady" else 0.85
    amp_noise = 0.09 if style == "steady" else 0.14
    amp_scale = 0.38 if style == "steady" else 0.46
    base_freqs = [0.10, 0.08, 0.12, 0.09]
    mid_freqs = [0.65, 0.78, 0.90, 0.72]
    Y = []
    for i in range(n_curves):
        f_slow = base_freqs[i % len(base_freqs)]
        f_mid = mid_freqs[i % len(mid_freqs)]
        ph = rng.uniform(0, 2 * np.pi)
        y = (
            0.95 * np.sin(2 * np.pi * f_slow * t + ph)
            + 0.28 * np.sin(2 * np.pi * f_mid * t + 0.65 * ph)
        )
        bumps = np.zeros_like(t)
        n_bumps = 2 if style == "steady" else 3
        for _ in range(n_bumps):
            mu = rng.uniform(0.9, seconds - 0.9)
            sig = rng.uniform(0.35, 0.75) if style == "steady" else rng.uniform(0.20, 0.55)
            bumps += np.exp(-0.5 * ((t - mu) / (sig + 1e-9)) ** 2)
        y += amp_bump * bumps
        noise = _smooth(rng.normal(0, 1, size=T), win=int(fs * 0.04))
        y += amp_noise * noise
        y = (y - y.mean()) / (y.std() + 1e-9)
        y *= amp_scale
        Y.append(y)
    return t, np.vstack(Y)
 def make_frame_discrete(seed: int, n_bins: int, vocab: int, style: str) -> np.ndarray:
    """
    style:
      - "steady": fewer transitions
      - "disturb": more transitions
    """
    rng = np.random.default_rng(seed + 111)
    ids = np.zeros(n_bins, dtype=int)
    p_change = 0.20 if style == "steady" else 0.38
    cur = rng.integers(0, vocab)
    for i in range(n_bins):
        if i == 0 or rng.random() < p_change:
            cur = rng.integers(0, vocab)
        ids[i] = cur
    return ids
 def draw_frame(ax_top: plt.Axes, ax_bot: plt.Axes, t: np.ndarray, Y: np.ndarray, ids: np.ndarray, p: Params) -> None:
    # Optional cues
    x0, x1 = float(t[0]), float(t[-1])
    span = x1 - x0
    if p.show_regime_window:
        rs = x0 + p.regime_start_frac * span
        re = x0 + p.regime_end_frac * span
        ax_top.axvspan(rs, re, alpha=0.12)  # default color
        ax_bot.axvspan(rs, re, alpha=0.12)
    if p.show_alignment_line:
        vx = x0 + p.align_x_frac * span
        ax_top.axvline(vx, linestyle="--", linewidth=1.15, alpha=0.7)
        ax_bot.axvline(vx, linestyle="--", linewidth=1.15, alpha=0.7)
    # Curves
    curve_colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#9467bd"]
    for i in range(Y.shape[0]):
        ax_top.plot(t, Y[i], linewidth=p.curve_lw, color=curve_colors[i % len(curve_colors)])
    ymin, ymax = float(Y.min()), float(Y.max())
    ypad = 0.10 * (ymax - ymin + 1e-9)
    ax_top.set_xlim(x0, x1)
    ax_top.set_ylim(ymin - ypad, ymax + ypad)
    # Discrete strip
    palette = [
        "#e41a1c", "#377eb8", "#4daf4a", "#984ea3",
        "#ff7f00", "#ffff33", "#a65628", "#f781bf",
    ]
    ax_bot.set_xlim(x0, x1)
    ax_bot.set_ylim(0, 1)
    n = len(ids)
    bin_w = span / n
    gap = p.strip_gap_frac * bin_w
    y = (1 - p.strip_height) / 2
    for i, cat in enumerate(ids):
        left = x0 + i * bin_w + gap / 2
        width = bin_w - gap
        ax_bot.add_patch(
            Rectangle((left, y), width, p.strip_height, facecolor=palette[int(cat) % len(palette)], edgecolor="none")
        )
    # Axes-only style
    _axes_only(ax_top, keep_spines=p.show_axes_spines)
    _axes_only(ax_bot, keep_spines=p.show_axes_spines)
 # ---------- main drawing ----------
 def draw_optionB(out_path: Path, p: Params) -> None:
    fig = plt.figure(figsize=(p.width_in, p.height_in), dpi=200)
    fig.patch.set_alpha(0.0)
    # Compute frame layout in figure coordinates
    # Each frame has two stacked axes: top curves and bottom strip.
    usable_w = 1.0 - p.margin_left - p.margin_right
    usable_h = 1.0 - p.margin_bottom - p.margin_top
    # Leave gap between frames for arrow
    total_gap = p.frame_gap * (p.n_frames - 1)
    frame_w = (usable_w - total_gap) / p.n_frames
    # Within each frame: vertical split
    top_h = usable_h * 0.70
    bot_h = usable_h * 0.18
    v_gap = usable_h * 0.06
    # bottoms
    bot_y = p.margin_bottom
    top_y = bot_y + bot_h + v_gap
    axes_pairs = []
    for f in range(p.n_frames):
        left = p.margin_left + f * (frame_w + p.frame_gap)
        ax_top = fig.add_axes([left, top_y, frame_w, top_h])
        ax_bot = fig.add_axes([left, bot_y, frame_w, bot_h], sharex=ax_top)
        ax_top.patch.set_alpha(0.0)
        ax_bot.patch.set_alpha(0.0)
        axes_pairs.append((ax_top, ax_bot))
    # Data per frame
    styles = ["steady", "disturb"] if p.n_frames == 2 else ["steady"] * (p.n_frames - 1) + ["disturb"]
    for idx, ((ax_top, ax_bot), style) in enumerate(zip(axes_pairs, styles)):
        t, Y = make_frame_continuous(p.seed + 10 * idx, p.seconds, p.fs, p.n_curves, style=style)
        ids = make_frame_discrete(p.seed + 10 * idx, p.n_bins, p.disc_vocab, style=style)
        draw_frame(ax_top, ax_bot, t, Y, ids, p)
    # Add a visual arrow between frames (in figure coordinates)
    if p.n_frames >= 2:
        for f in range(p.n_frames - 1):
            # center between frame f and f+1
            x_left = p.margin_left + f * (frame_w + p.frame_gap) + frame_w
            x_right = p.margin_left + (f + 1) * (frame_w + p.frame_gap)
            x_mid = (x_left + x_right) / 2
            # arrow y in the middle of the frame stack
            y_mid = bot_y + (bot_h + v_gap + top_h) / 2
            arr = FancyArrowPatch(
                (x_mid - 0.015, y_mid),
                (x_mid + 0.015, y_mid),
                transform=fig.transFigure,
                arrowstyle="-|>",
                mutation_scale=18,
                linewidth=1.6,
                color="black",
            )
            fig.patches.append(arr)
    out_path.parent.mkdir(parents=True, exist_ok=True)
    fig.savefig(out_path, format="svg", transparent=True, bbox_inches="tight", pad_inches=0.0)
    plt.close(fig)
 def main() -> None:
    ap = argparse.ArgumentParser()
    ap.add_argument("--out", type=Path, default=Path("synth_ics_optionB.svg"))
    ap.add_argument("--seed", type=int, default=7)
    ap.add_argument("--seconds", type=float, default=8.0)
    ap.add_argument("--fs", type=int, default=250)
    ap.add_argument("--frames", type=int, default=2, choices=[2, 3], help="2 or 3 frames (story strip)")
    ap.add_argument("--curves", type=int, default=3)
    ap.add_argument("--bins", type=int, default=32)
    ap.add_argument("--vocab", type=int, default=8)
    ap.add_argument("--no-align", action="store_true")
    ap.add_argument("--no-regime", action="store_true")
    ap.add_argument("--no-spines", action="store_true")
    args = ap.parse_args()
    p = Params(
        seed=args.seed,
        seconds=args.seconds,
        fs=args.fs,
        n_frames=args.frames,
        n_curves=args.curves,
        n_bins=args.bins,
        disc_vocab=args.vocab,
        show_alignment_line=not args.no_align,
        show_regime_window=not args.no_regime,
        show_axes_spines=not args.no_spines,
    )
    draw_optionB(args.out, p)
    print(f"Wrote: {args.out}")
 if __name__ == "__main__":
    main()
--- a/arxiv-style/fig-scripts/draw_transformer_lower_half.py
+++ b/arxiv-style/fig-scripts/draw_transformer_lower_half.py
@@ -0,0 +1,201 @@
 #!/usr/bin/env python3
 """
 Draw the *Transformer section* lower-half visuals:
 - Continuous channels: multiple smooth curves (like the colored trend lines)
 - Discrete channels: small colored bars/ticks along the bottom
 Output: ONE SVG with transparent background, axes hidden.
 Run:
  uv run python draw_transformer_lower_half.py --out ./assets/transformer_lower_half.svg
 """
 from __future__ import annotations
 import argparse
 from dataclasses import dataclass
 from pathlib import Path
 import numpy as np
 import matplotlib.pyplot as plt
 from matplotlib.patches import Rectangle
@dataclass
 class Params:
    seed: int = 7
    seconds: float = 10.0
    fs: int = 300
    # Continuous channels
    n_curves: int = 3
    curve_lw: float = 2.4
    # Discrete bars
    n_bins: int = 40          # number of discrete bars/ticks across time
    bar_height: float = 0.11  # relative height inside bar strip axis
    bar_gap: float = 0.08     # gap between bars (fraction of bar width)
    # Canvas sizing
    width_in: float = 5.8
    height_in: float = 1.9
 def _smooth(x: np.ndarray, win: int) -> np.ndarray:
    win = max(3, int(win) | 1)  # odd
    k = np.ones(win, dtype=float)
    k /= k.sum()
    return np.convolve(x, k, mode="same")
 def make_continuous_curves(p: Params) -> tuple[np.ndarray, np.ndarray]:
    """
    Produce 3 smooth curves with gentle long-term temporal patterning.
    Returns:
      t: (T,)
      Y: (n_curves, T)
    """
    rng = np.random.default_rng(p.seed)
    T = int(p.seconds * p.fs)
    t = np.linspace(0, p.seconds, T, endpoint=False)
    Y = []
    base_freqs = [0.12, 0.09, 0.15]
    mid_freqs = [0.65, 0.85, 0.75]
    for i in range(p.n_curves):
        f1 = base_freqs[i % len(base_freqs)]
        f2 = mid_freqs[i % len(mid_freqs)]
        ph = rng.uniform(0, 2 * np.pi)
        # Smooth trend + mid wiggle
        y = (
            1.00 * np.sin(2 * np.pi * f1 * t + ph)
            + 0.35 * np.sin(2 * np.pi * f2 * t + 0.7 * ph)
        )
        # Add a couple of smooth bumps (like slow pattern changes)
        bumps = np.zeros_like(t)
        for _ in range(2):
            mu = rng.uniform(0.8, p.seconds - 0.8)
            sig = rng.uniform(0.35, 0.75)
            bumps += np.exp(-0.5 * ((t - mu) / sig) ** 2)
        y += 0.55 * bumps
        # Mild smooth noise
        noise = _smooth(rng.normal(0, 1, size=T), win=int(p.fs * 0.04))
        y += 0.12 * noise
        # Normalize and compress amplitude to fit nicely
        y = (y - y.mean()) / (y.std() + 1e-9)
        y *= 0.42
        Y.append(y)
    return t, np.vstack(Y)
 def make_discrete_bars(p: Params) -> np.ndarray:
    """
    Generate discrete "token-like" bars across time bins.
    Returns:
      ids: (n_bins,) integer category ids
    """
    rng = np.random.default_rng(p.seed + 123)
    n = p.n_bins
    # A piecewise-constant sequence with occasional changes (looks like discrete channel)
    ids = np.zeros(n, dtype=int)
    cur = rng.integers(0, 8)
    for i in range(n):
        if i == 0 or rng.random() < 0.25:
            cur = rng.integers(0, 8)
        ids[i] = cur
    return ids
 def draw_transformer_lower_half_svg(out_path: Path, p: Params) -> None:
    # --- Figure + transparent background ---
    fig = plt.figure(figsize=(p.width_in, p.height_in), dpi=200)
    fig.patch.set_alpha(0.0)
    # Two stacked axes: curves (top), bars (bottom)
    # Tight, diagram-style layout
    ax_curves = fig.add_axes([0.06, 0.28, 0.90, 0.68])  # [left, bottom, width, height]
    ax_bars = fig.add_axes([0.06, 0.10, 0.90, 0.14])
    ax_curves.patch.set_alpha(0.0)
    ax_bars.patch.set_alpha(0.0)
    for ax in (ax_curves, ax_bars):
        ax.set_axis_off()
    # --- Data ---
    t, Y = make_continuous_curves(p)
    ids = make_discrete_bars(p)
    # --- Continuous curves (explicit colors to match the “multi-colored” look) ---
    # Feel free to swap these hex colors to match your figure theme.
    curve_colors = ["#1f77b4", "#ff7f0e", "#2ca02c"]  # blue / orange / green
    for i in range(Y.shape[0]):
        ax_curves.plot(t, Y[i], linewidth=p.curve_lw, color=curve_colors[i % len(curve_colors)])
    # Set curve bounds with padding (keeps it clean)
    ymin, ymax = float(Y.min()), float(Y.max())
    pad = 0.10 * (ymax - ymin + 1e-9)
    ax_curves.set_xlim(t[0], t[-1])
    ax_curves.set_ylim(ymin - pad, ymax + pad)
    # --- Discrete bars: small colored rectangles along the timeline ---
    # A small palette for categories (repeats if more categories appear)
    bar_palette = [
        "#e41a1c", "#377eb8", "#4daf4a", "#984ea3",
        "#ff7f00", "#ffff33", "#a65628", "#f781bf",
    ]
    # Convert bins into time spans
    n = len(ids)
    x0, x1 = t[0], t[-1]
    total = x1 - x0
    bin_w = total / n
    gap = p.bar_gap * bin_w
    # Draw bars in [0,1] y-space inside ax_bars
    ax_bars.set_xlim(x0, x1)
    ax_bars.set_ylim(0, 1)
    for i, cat in enumerate(ids):
        left = x0 + i * bin_w + gap / 2
        width = bin_w - gap
        color = bar_palette[int(cat) % len(bar_palette)]
        rect = Rectangle(
            (left, (1 - p.bar_height) / 2),
            width,
            p.bar_height,
            facecolor=color,
            edgecolor="none",
        )
        ax_bars.add_patch(rect)
    out_path.parent.mkdir(parents=True, exist_ok=True)
    fig.savefig(out_path, format="svg", transparent=True, bbox_inches="tight", pad_inches=0.0)
    plt.close(fig)
 def main() -> None:
    ap = argparse.ArgumentParser()
    ap.add_argument("--out", type=Path, default=Path("transformer_lower_half.svg"))
    ap.add_argument("--seed", type=int, default=7)
    ap.add_argument("--seconds", type=float, default=10.0)
    ap.add_argument("--fs", type=int, default=300)
    ap.add_argument("--bins", type=int, default=40)
    args = ap.parse_args()
    p = Params(seed=args.seed, seconds=args.seconds, fs=args.fs, n_bins=args.bins)
    draw_transformer_lower_half_svg(args.out, p)
    print(f"Wrote: {args.out}")
 if __name__ == "__main__":
    main()
--- a/arxiv-style/fig-scripts/draw_transformer_lower_half_axes.py
+++ b/arxiv-style/fig-scripts/draw_transformer_lower_half_axes.py
@@ -0,0 +1,202 @@
 #!/usr/bin/env python3
 """
 Transformer section lower-half visuals WITH AXES ONLY:
 - Axes spines visible
 - NO numbers (tick labels hidden)
 - NO words (axis labels removed)
 - Transparent background
 - One SVG output
 Run:
  uv run python draw_transformer_lower_half_axes_only.py --out ./assets/transformer_lower_half_axes_only.svg
 """
 from __future__ import annotations
 import argparse
 from dataclasses import dataclass
 from pathlib import Path
 import numpy as np
 import matplotlib.pyplot as plt
 from matplotlib.patches import Rectangle
@dataclass
 class Params:
    seed: int = 7
    seconds: float = 10.0
    fs: int = 300
    # Continuous channels
    n_curves: int = 3
    curve_lw: float = 2.4
    # Discrete bars
    n_bins: int = 40
    bar_height: float = 0.55   # fraction of the discrete-axis y-range
    bar_gap: float = 0.08      # fraction of bar width
    # Figure size
    width_in: float = 6.6
    height_in: float = 2.6
 def _smooth(x: np.ndarray, win: int) -> np.ndarray:
    win = max(3, int(win) | 1)  # odd
    k = np.ones(win, dtype=float)
    k /= k.sum()
    return np.convolve(x, k, mode="same")
 def make_continuous_curves(p: Params) -> tuple[np.ndarray, np.ndarray]:
    rng = np.random.default_rng(p.seed)
    T = int(p.seconds * p.fs)
    t = np.linspace(0, p.seconds, T, endpoint=False)
    Y = []
    base_freqs = [0.12, 0.09, 0.15]
    mid_freqs = [0.65, 0.85, 0.75]
    for i in range(p.n_curves):
        f1 = base_freqs[i % len(base_freqs)]
        f2 = mid_freqs[i % len(mid_freqs)]
        ph = rng.uniform(0, 2 * np.pi)
        y = (
            1.00 * np.sin(2 * np.pi * f1 * t + ph)
            + 0.35 * np.sin(2 * np.pi * f2 * t + 0.7 * ph)
        )
        bumps = np.zeros_like(t)
        for _ in range(2):
            mu = rng.uniform(0.8, p.seconds - 0.8)
            sig = rng.uniform(0.35, 0.75)
            bumps += np.exp(-0.5 * ((t - mu) / sig) ** 2)
        y += 0.55 * bumps
        noise = _smooth(rng.normal(0, 1, size=T), win=int(p.fs * 0.04))
        y += 0.12 * noise
        y = (y - y.mean()) / (y.std() + 1e-9)
        y *= 0.42
        Y.append(y)
    return t, np.vstack(Y)
 def make_discrete_bars(p: Params) -> np.ndarray:
    rng = np.random.default_rng(p.seed + 123)
    n = p.n_bins
    ids = np.zeros(n, dtype=int)
    cur = rng.integers(0, 8)
    for i in range(n):
        if i == 0 or rng.random() < 0.25:
            cur = rng.integers(0, 8)
        ids[i] = cur
    return ids
 def _axes_only(ax: plt.Axes) -> None:
    """Keep spines (axes lines), remove all ticks/labels/words."""
    # No labels
    ax.set_xlabel("")
    ax.set_ylabel("")
    ax.set_title("")
    # Keep spines as the only axes element
    for side in ("top", "right", "bottom", "left"):
        ax.spines[side].set_visible(True)
    # Remove tick marks and tick labels entirely
    ax.set_xticks([])
    ax.set_yticks([])
    ax.tick_params(
        axis="both",
        which="both",
        bottom=False,
        left=False,
        top=False,
        right=False,
        labelbottom=False,
        labelleft=False,
    )
    # No grid
    ax.grid(False)
 def draw_transformer_lower_half_svg(out_path: Path, p: Params) -> None:
    fig = plt.figure(figsize=(p.width_in, p.height_in), dpi=200)
    fig.patch.set_alpha(0.0)
    # Two axes sharing x (top curves, bottom bars)
    ax_curves = fig.add_axes([0.10, 0.38, 0.86, 0.56])
    ax_bars = fig.add_axes([0.10, 0.14, 0.86, 0.18], sharex=ax_curves)
    ax_curves.patch.set_alpha(0.0)
    ax_bars.patch.set_alpha(0.0)
    # Data
    t, Y = make_continuous_curves(p)
    ids = make_discrete_bars(p)
    # Top: continuous curves
    curve_colors = ["#1f77b4", "#ff7f0e", "#2ca02c"]  # blue / orange / green
    for i in range(Y.shape[0]):
        ax_curves.plot(t, Y[i], linewidth=p.curve_lw, color=curve_colors[i % len(curve_colors)])
    ymin, ymax = float(Y.min()), float(Y.max())
    ypad = 0.10 * (ymax - ymin + 1e-9)
    ax_curves.set_xlim(t[0], t[-1])
    ax_curves.set_ylim(ymin - ypad, ymax + ypad)
    # Bottom: discrete bars (colored strip)
    bar_palette = [
        "#e41a1c", "#377eb8", "#4daf4a", "#984ea3",
        "#ff7f00", "#ffff33", "#a65628", "#f781bf",
    ]
    x0, x1 = t[0], t[-1]
    total = x1 - x0
    n = len(ids)
    bin_w = total / n
    gap = p.bar_gap * bin_w
    ax_bars.set_xlim(x0, x1)
    ax_bars.set_ylim(0, 1)
    bar_y = (1 - p.bar_height) / 2
    for i, cat in enumerate(ids):
        left = x0 + i * bin_w + gap / 2
        width = bin_w - gap
        color = bar_palette[int(cat) % len(bar_palette)]
        ax_bars.add_patch(Rectangle((left, bar_y), width, p.bar_height, facecolor=color, edgecolor="none"))
    # Apply "axes only" styling (no numbers/words)
    _axes_only(ax_curves)
    _axes_only(ax_bars)
    out_path.parent.mkdir(parents=True, exist_ok=True)
    fig.savefig(out_path, format="svg", transparent=True, bbox_inches="tight", pad_inches=0.0)
    plt.close(fig)
 def main() -> None:
    ap = argparse.ArgumentParser()
    ap.add_argument("--out", type=Path, default=Path("transformer_lower_half_axes_only.svg"))
    ap.add_argument("--seed", type=int, default=7)
    ap.add_argument("--seconds", type=float, default=10.0)
    ap.add_argument("--fs", type=int, default=300)
    ap.add_argument("--bins", type=int, default=40)
    ap.add_argument("--curves", type=int, default=3)
    args = ap.parse_args()
    p = Params(seed=args.seed, seconds=args.seconds, fs=args.fs, n_bins=args.bins, n_curves=args.curves)
    draw_transformer_lower_half_svg(args.out, p)
    print(f"Wrote: {args.out}")
 if __name__ == "__main__":
    main()
--- a/arxiv-style/fig-scripts/gen_noise_ddmp.py
+++ b/arxiv-style/fig-scripts/gen_noise_ddmp.py
@@ -0,0 +1,161 @@
 #!/usr/bin/env python3
 """
 Generate "Noisy Residual" and "Denoised Residual" curves as SVGs.
 - Produces TWO separate SVG files:
    noisy_residual.svg
    denoised_residual.svg
 - Curves are synthetic but shaped like residual noise + denoised residual.
 - Uses only matplotlib + numpy.
 """
 from __future__ import annotations
 import argparse
 from dataclasses import dataclass
 from pathlib import Path
 import numpy as np
 import matplotlib.pyplot as plt
@dataclass
 class CurveParams:
    seconds: float = 12.0          # length of the signal
    fs: int = 250                  # samples per second
    seed: int = 7                  # RNG seed for reproducibility
    base_amp: float = 0.12         # smooth baseline amplitude
    noise_amp: float = 0.55        # high-frequency noise amplitude
    burst_amp: float = 1.2         # occasional spike amplitude
    burst_rate_hz: float = 0.35    # average spike frequency
    denoise_smooth_ms: float = 120 # smoothing window for "denoised" (ms)
 def gaussian_smooth(x: np.ndarray, sigma_samples: float) -> np.ndarray:
    """Gaussian smoothing using explicit kernel convolution (no SciPy dependency)."""
    if sigma_samples <= 0:
        return x.copy()
    radius = int(np.ceil(4 * sigma_samples))
    k = np.arange(-radius, radius + 1, dtype=float)
    kernel = np.exp(-(k**2) / (2 * sigma_samples**2))
    kernel /= kernel.sum()
    return np.convolve(x, kernel, mode="same")
 def make_residual(params: CurveParams) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
    """
    Create synthetic residual:
    - baseline: smooth wavy trend + slight drift
    - noise: band-limited-ish high-frequency noise
    - bursts: sparse spikes / impulse-like events
    Returns: (t, noisy, denoised)
    """
    rng = np.random.default_rng(params.seed)
    n = int(params.seconds * params.fs)
    t = np.linspace(0, params.seconds, n, endpoint=False)
    # Smooth baseline (small): combination of sinusoids + small random drift
    baseline = (
        0.7 * np.sin(2 * np.pi * 0.35 * t + 0.2)
        + 0.35 * np.sin(2 * np.pi * 0.9 * t + 1.2)
        + 0.25 * np.sin(2 * np.pi * 0.15 * t + 2.0)
    )
    baseline *= params.base_amp
    drift = np.cumsum(rng.normal(0, 1, size=n))
    drift = drift / (np.max(np.abs(drift)) + 1e-9) * (params.base_amp * 0.25)
    baseline = baseline + drift
    # High-frequency noise: whitened then lightly smoothed to look "oscillatory"
    raw = rng.normal(0, 1, size=n)
    hf = raw - gaussian_smooth(raw, sigma_samples=params.fs * 0.03)  # remove slow part
    hf = hf / (np.std(hf) + 1e-9)
    hf *= params.noise_amp
    # Bursts/spikes: Poisson process impulses convolved with short kernel
    expected_bursts = params.burst_rate_hz * params.seconds
    k_bursts = rng.poisson(expected_bursts)
    impulses = np.zeros(n)
    if k_bursts > 0:
        idx = rng.integers(0, n, size=k_bursts)
        impulses[idx] = rng.normal(loc=1.0, scale=0.4, size=k_bursts)
    # Shape impulses into spikes (asymmetric bump)
    spike_kernel_len = int(params.fs * 0.06)  # ~60ms
    spike_kernel_len = max(spike_kernel_len, 7)
    spike_t = np.arange(spike_kernel_len)
    spike_kernel = np.exp(-spike_t / (params.fs * 0.012))  # fast decay
    spike_kernel *= np.hanning(spike_kernel_len)  # taper
    spike_kernel /= (spike_kernel.max() + 1e-9)
    bursts = np.convolve(impulses, spike_kernel, mode="same")
    bursts *= params.burst_amp
    noisy = baseline + hf + bursts
    # "Denoised": remove high-frequency using Gaussian smoothing,
    # but keep spike structures partially.
    smooth_sigma = (params.denoise_smooth_ms / 1000.0) * params.fs / 3.0
    denoised = gaussian_smooth(noisy, sigma_samples=smooth_sigma)
    return t, noisy, denoised
 def save_curve_svg(
    t: np.ndarray,
    y: np.ndarray,
    out_path: Path,
    *,
    width_in: float = 5.4,
    height_in: float = 1.6,
    lw: float = 2.2,
    pad: float = 0.03,
 ) -> None:
    """
    Save a clean, figure-only SVG suitable for embedding in diagrams.
    - No axes, ticks, labels.
    - Tight bounding box.
    """
    fig = plt.figure(figsize=(width_in, height_in), dpi=200)
    ax = fig.add_axes([pad, pad, 1 - 2 * pad, 1 - 2 * pad])
    ax.plot(t, y, linewidth=lw)
    # Make it "icon-like" for diagrams: no axes or frames
    ax.set_axis_off()
    # Ensure bounds include a little padding
    ymin, ymax = np.min(y), np.max(y)
    ypad = 0.08 * (ymax - ymin + 1e-9)
    ax.set_xlim(t[0], t[-1])
    ax.set_ylim(ymin - ypad, ymax + ypad)
    out_path.parent.mkdir(parents=True, exist_ok=True)
    fig.savefig(out_path, format="svg", bbox_inches="tight", pad_inches=0.0)
    plt.close(fig)
 def main() -> None:
    ap = argparse.ArgumentParser()
    ap.add_argument("--outdir", type=Path, default=Path("."), help="Output directory")
    ap.add_argument("--seed", type=int, default=7, help="RNG seed")
    ap.add_argument("--seconds", type=float, default=12.0, help="Signal length (s)")
    ap.add_argument("--fs", type=int, default=250, help="Sampling rate (Hz)")
    ap.add_argument("--prefix", type=str, default="", help="Filename prefix (optional)")
    args = ap.parse_args()
    params = CurveParams(seconds=args.seconds, fs=args.fs, seed=args.seed)
    t, noisy, denoised = make_residual(params)
    noisy_path = args.outdir / f"{args.prefix}noisy_residual.svg"
    den_path = args.outdir / f"{args.prefix}denoised_residual.svg"
    save_curve_svg(t, noisy, noisy_path)
    save_curve_svg(t, denoised, den_path)
    print(f"Wrote:\n  {noisy_path}\n  {den_path}")
 if __name__ == "__main__":
    main()
--- a/arxiv-style/fig-scripts/make_ddpm_like_svg.py
+++ b/arxiv-style/fig-scripts/make_ddpm_like_svg.py
@@ -0,0 +1,188 @@
 #!/usr/bin/env python3
 """
 DDPM-like residual curve SVGs (separate files, fixed colors):
 - noisy_residual.svg    (blue)
 - denoised_residual.svg (purple)
 """
 from __future__ import annotations
 import argparse
 from dataclasses import dataclass
 from pathlib import Path
 import numpy as np
 import matplotlib.pyplot as plt
@dataclass
 class DDPMStyleParams:
    seconds: float = 12.0
    fs: int = 250
    seed: int = 7
    baseline_amp: float = 0.10
    mid_wiggle_amp: float = 0.18
    colored_noise_amp: float = 0.65
    colored_alpha: float = 1.0
    burst_rate_hz: float = 0.30
    burst_amp: float = 0.9
    burst_width_ms: float = 55
    denoise_sigmas_ms: tuple[float, ...] = (25, 60, 140)
    denoise_weights: tuple[float, ...] = (0.25, 0.35, 0.40)
    denoise_texture_keep: float = 0.10
 def gaussian_smooth(x: np.ndarray, sigma_samples: float) -> np.ndarray:
    if sigma_samples <= 0:
        return x.copy()
    radius = int(np.ceil(4 * sigma_samples))
    k = np.arange(-radius, radius + 1, dtype=float)
    kernel = np.exp(-(k**2) / (2 * sigma_samples**2))
    kernel /= kernel.sum()
    return np.convolve(x, kernel, mode="same")
 def colored_noise_1_f(n: int, rng: np.random.Generator, alpha: float) -> np.ndarray:
    white = rng.normal(0, 1, size=n)
    spec = np.fft.rfft(white)
    freqs = np.fft.rfftfreq(n, d=1.0)
    scale = np.ones_like(freqs)
    nonzero = freqs > 0
    scale[nonzero] = 1.0 / (freqs[nonzero] ** (alpha / 2.0))
    spec *= scale
    x = np.fft.irfft(spec, n=n)
    x = x - np.mean(x)
    x = x / (np.std(x) + 1e-9)
    return x
 def make_ddpm_like_residual(p: DDPMStyleParams) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
    rng = np.random.default_rng(p.seed)
    n = int(p.seconds * p.fs)
    t = np.linspace(0, p.seconds, n, endpoint=False)
    baseline = (
        0.8 * np.sin(2 * np.pi * 0.18 * t + 0.4)
        + 0.35 * np.sin(2 * np.pi * 0.06 * t + 2.2)
    ) * p.baseline_amp
    mid = (
        0.9 * np.sin(2 * np.pi * 0.9 * t + 1.1)
        + 0.5 * np.sin(2 * np.pi * 1.6 * t + 0.2)
        + 0.3 * np.sin(2 * np.pi * 2.4 * t + 2.6)
    ) * p.mid_wiggle_amp
    col = colored_noise_1_f(n, rng, alpha=p.colored_alpha) * p.colored_noise_amp
    expected = p.burst_rate_hz * p.seconds
    k = rng.poisson(expected)
    impulses = np.zeros(n)
    if k > 0:
        idx = rng.integers(0, n, size=k)
        impulses[idx] = rng.normal(loc=1.0, scale=0.35, size=k)
    width = max(int(p.fs * (p.burst_width_ms / 1000.0)), 7)
    u = np.arange(width)
    kernel = np.exp(-u / (p.fs * 0.012)) * np.hanning(width)
    kernel /= (kernel.max() + 1e-9)
    bursts = np.convolve(impulses, kernel, mode="same") * p.burst_amp
    noisy = baseline + mid + col + bursts
    sigmas_samples = [(ms / 1000.0) * p.fs / 3.0 for ms in p.denoise_sigmas_ms]
    smooths = [gaussian_smooth(noisy, s) for s in sigmas_samples]
    den_base = np.zeros_like(noisy)
    for w, sm in zip(p.denoise_weights, smooths):
        den_base += w * sm
    hf = noisy - gaussian_smooth(noisy, sigma_samples=p.fs * 0.03)
    denoised = den_base + p.denoise_texture_keep * (hf / (np.std(hf) + 1e-9)) * (0.10 * np.std(den_base))
    return t, noisy, denoised
 def save_single_curve_svg(
    t: np.ndarray,
    y: np.ndarray,
    out_path: Path,
    *,
    color: str,
    lw: float = 2.2,
 ) -> None:
    fig = plt.figure(figsize=(5.4, 1.6), dpi=200)
    # Make figure background transparent
    fig.patch.set_alpha(0.0)
    ax = fig.add_axes([0.03, 0.03, 0.94, 0.94])
    # Make axes background transparent
    ax.patch.set_alpha(0.0)
    ax.plot(t, y, linewidth=lw, color=color)
    # clean, diagram-friendly
    ax.set_axis_off()
    ymin, ymax = np.min(y), np.max(y)
    ypad = 0.08 * (ymax - ymin + 1e-9)
    ax.set_xlim(t[0], t[-1])
    ax.set_ylim(ymin - ypad, ymax + ypad)
    out_path.parent.mkdir(parents=True, exist_ok=True)
    fig.savefig(
        out_path,
        format="svg",
        bbox_inches="tight",
        pad_inches=0.0,
        transparent=True,   # <-- key for transparent output
    )
    plt.close(fig)
 def main() -> None:
    ap = argparse.ArgumentParser()
    ap.add_argument("--outdir", type=Path, default=Path("."))
    ap.add_argument("--seed", type=int, default=7)
    ap.add_argument("--seconds", type=float, default=12.0)
    ap.add_argument("--fs", type=int, default=250)
    ap.add_argument("--alpha", type=float, default=1.0)
    ap.add_argument("--noise-amp", type=float, default=0.65)
    ap.add_argument("--texture-keep", type=float, default=0.10)
    ap.add_argument("--prefix", type=str, default="")
    args = ap.parse_args()
    p = DDPMStyleParams(
        seconds=args.seconds,
        fs=args.fs,
        seed=args.seed,
        colored_alpha=args.alpha,
        colored_noise_amp=args.noise_amp,
        denoise_texture_keep=args.texture_keep,
    )
    t, noisy, den = make_ddpm_like_residual(p)
    outdir = args.outdir
    noisy_path = outdir / f"{args.prefix}noisy_residual.svg"
    den_path = outdir / f"{args.prefix}denoised_residual.svg"
    # Fixed colors as you requested
    save_single_curve_svg(t, noisy, noisy_path, color="blue")
    save_single_curve_svg(t, den, den_path, color="purple")
    print("Wrote:")
    print(f"  {noisy_path}")
    print(f"  {den_path}")
 if __name__ == "__main__":
    main()
--- a/arxiv-style/fig-scripts/pyproject.toml
+++ b/arxiv-style/fig-scripts/pyproject.toml
@@ -0,0 +1,10 @@
 [project]
 name = "fig-gen-ddpm"
 version = "0.1.0"
 description = "Add your description here"
 readme = "README.md"
 requires-python = ">=3.12"
 dependencies = [
  "numpy>=1.26",
  "matplotlib>=3.8",
 ]
--- a/arxiv-style/fig-scripts/synth_ics_3d_waterfall.py
+++ b/arxiv-style/fig-scripts/synth_ics_3d_waterfall.py
@@ -0,0 +1,240 @@
 #!/usr/bin/env python3
 """
 3D "final combined outcome" (time × channel × value) with:
 - NO numbers on axes (tick labels removed)
 - Axis *titles* kept (texts are okay)
 - Reduced whitespace: tight bbox + minimal margins
 - White background (non-transparent) suitable for embedding into another SVG
 Output:
  default PNG, optional SVG (2D projected vectors)
 Run:
  uv run python synth_ics_3d_waterfall_tight.py --out ./assets/synth_ics_3d.png
  uv run python synth_ics_3d_waterfall_tight.py --out ./assets/synth_ics_3d.svg --format svg
 """
 from __future__ import annotations
 import argparse
 from dataclasses import dataclass
 from pathlib import Path
 import numpy as np
 import matplotlib.pyplot as plt
@dataclass
 class Params:
    seed: int = 7
    seconds: float = 10.0
    fs: int = 220
    n_cont: int = 5
    n_disc: int = 2
    disc_vocab: int = 8
    disc_change_rate_hz: float = 1.1
    # view
    elev: float = 25.0
    azim: float = -58.0
    # figure size (smaller, more "cube-like")
    fig_w: float = 5.4
    fig_h: float = 5.0
    # discrete rendering
    disc_z_scale: float = 0.45
    disc_z_offset: float = -1.4
    # margins (figure fraction)
    left: float = 0.03
    right: float = 0.99
    bottom: float = 0.03
    top: float = 0.99
 def _smooth(x: np.ndarray, win: int) -> np.ndarray:
    win = max(3, int(win) | 1)
    k = np.ones(win, dtype=float)
    k /= k.sum()
    return np.convolve(x, k, mode="same")
 def make_continuous(p: Params) -> tuple[np.ndarray, np.ndarray]:
    rng = np.random.default_rng(p.seed)
    T = int(p.seconds * p.fs)
    t = np.linspace(0, p.seconds, T, endpoint=False)
    Y = []
    base_freqs = [0.08, 0.10, 0.12, 0.09, 0.11]
    mid_freqs = [0.55, 0.70, 0.85, 0.62, 0.78]
    for i in range(p.n_cont):
        f1 = base_freqs[i % len(base_freqs)]
        f2 = mid_freqs[i % len(mid_freqs)]
        ph = rng.uniform(0, 2 * np.pi)
        y = (
            0.95 * np.sin(2 * np.pi * f1 * t + ph)
            + 0.28 * np.sin(2 * np.pi * f2 * t + 0.65 * ph)
        )
        bumps = np.zeros_like(t)
        for _ in range(rng.integers(2, 4)):
            mu = rng.uniform(0.8, p.seconds - 0.8)
            sig = rng.uniform(0.25, 0.80)
            bumps += np.exp(-0.5 * ((t - mu) / (sig + 1e-9)) ** 2)
        y += 0.55 * bumps
        noise = _smooth(rng.normal(0, 1, size=T), win=int(p.fs * 0.05))
        y += 0.10 * noise
        y = (y - y.mean()) / (y.std() + 1e-9)
        Y.append(y)
    return t, np.vstack(Y)  # (n_cont, T)
 def make_discrete(p: Params, t: np.ndarray) -> np.ndarray:
    rng = np.random.default_rng(p.seed + 123)
    T = len(t)
    expected_changes = max(1, int(p.seconds * p.disc_change_rate_hz))
    X = np.zeros((p.n_disc, T), dtype=int)
    for c in range(p.n_disc):
        k = rng.poisson(expected_changes) + 1
        pts = np.unique(rng.integers(0, T, size=k))
        pts = np.sort(np.concatenate([[0], pts, [T]]))
        cur = rng.integers(0, p.disc_vocab)
        for a, b in zip(pts[:-1], pts[1:]):
            if a != 0 and rng.random() < 0.85:
                cur = rng.integers(0, p.disc_vocab)
            X[c, a:b] = cur
    return X
 def style_3d_axes(ax):
    # Make panes white but less visually heavy
    try:
        # Keep pane fill ON (white background) but reduce edge prominence
        ax.xaxis.pane.set_edgecolor("0.7")
        ax.yaxis.pane.set_edgecolor("0.7")
        ax.zaxis.pane.set_edgecolor("0.7")
    except Exception:
        pass
    ax.grid(True, linewidth=0.4, alpha=0.30)
 def remove_tick_numbers_keep_axis_titles(ax):
    # Remove tick labels (numbers) and tick marks, keep axis titles
    ax.set_xticklabels([])
    ax.set_yticklabels([])
    ax.set_zticklabels([])
    ax.tick_params(
        axis="both",
        which="both",
        length=0,   # no tick marks
        pad=0,
    )
    # 3D has separate tick_params for z on some versions; this still works broadly:
    try:
        ax.zaxis.set_tick_params(length=0, pad=0)
    except Exception:
        pass
 def main() -> None:
    ap = argparse.ArgumentParser()
    ap.add_argument("--out", type=Path, default=Path("synth_ics_3d.png"))
    ap.add_argument("--format", choices=["png", "svg"], default="png")
    ap.add_argument("--seed", type=int, default=7)
    ap.add_argument("--seconds", type=float, default=10.0)
    ap.add_argument("--fs", type=int, default=220)
    ap.add_argument("--n-cont", type=int, default=5)
    ap.add_argument("--n-disc", type=int, default=2)
    ap.add_argument("--disc-vocab", type=int, default=8)
    ap.add_argument("--disc-rate", type=float, default=1.1)
    ap.add_argument("--elev", type=float, default=25.0)
    ap.add_argument("--azim", type=float, default=-58.0)
    ap.add_argument("--fig-w", type=float, default=5.4)
    ap.add_argument("--fig-h", type=float, default=5.0)
    ap.add_argument("--disc-z-scale", type=float, default=0.45)
    ap.add_argument("--disc-z-offset", type=float, default=-1.4)
    args = ap.parse_args()
    p = Params(
        seed=args.seed,
        seconds=args.seconds,
        fs=args.fs,
        n_cont=args.n_cont,
        n_disc=args.n_disc,
        disc_vocab=args.disc_vocab,
        disc_change_rate_hz=args.disc_rate,
        elev=args.elev,
        azim=args.azim,
        fig_w=args.fig_w,
        fig_h=args.fig_h,
        disc_z_scale=args.disc_z_scale,
        disc_z_offset=args.disc_z_offset,
    )
    t, Yc = make_continuous(p)
    Xd = make_discrete(p, t)
    fig = plt.figure(figsize=(p.fig_w, p.fig_h), dpi=220, facecolor="white")
    ax = fig.add_subplot(111, projection="3d")
    style_3d_axes(ax)
    # Reduce whitespace around axes (tight placement)
    fig.subplots_adjust(left=p.left, right=p.right, bottom=p.bottom, top=p.top)
    # Draw continuous channels
    for i in range(p.n_cont):
        y = np.full_like(t, fill_value=i, dtype=float)
        z = Yc[i]
        ax.plot(t, y, z, linewidth=2.0)
    # Draw discrete channels as steps
    for j in range(p.n_disc):
        ch = p.n_cont + j
        y = np.full_like(t, fill_value=ch, dtype=float)
        z = p.disc_z_offset + p.disc_z_scale * Xd[j].astype(float)
        ax.step(t, y, z, where="post", linewidth=2.2)
    # Axis titles kept
    ax.set_xlabel("time")
    ax.set_ylabel("channel")
    ax.set_zlabel("value")
    # Remove numeric tick labels + tick marks
    remove_tick_numbers_keep_axis_titles(ax)
    # Camera
    ax.view_init(elev=p.elev, azim=p.azim)
    # Save tightly (minimize white border)
    args.out.parent.mkdir(parents=True, exist_ok=True)
    save_kwargs = dict(bbox_inches="tight", pad_inches=0.03, facecolor="white")
    if args.format == "svg" or args.out.suffix.lower() == ".svg":
        fig.savefig(args.out, format="svg", **save_kwargs)
    else:
        fig.savefig(args.out, format="png", **save_kwargs)
    plt.close(fig)
    print(f"Wrote: {args.out}")
 if __name__ == "__main__":
    main()
--- a/arxiv-style/fig-scripts/transformer_math_figure.py
+++ b/arxiv-style/fig-scripts/transformer_math_figure.py
@@ -0,0 +1,262 @@
 #!/usr/bin/env python3
 """
 Transformer-ish "trend" visuals with NO equations:
 - attention_weights.svg      : heatmap-like attention map (looks like "Transformer attends to positions")
 - token_activation_trends.svg: multiple token-channel curves (continuous trends)
 - discrete_tokens.svg        : step-like discrete channel trends (optional)
 All SVGs have transparent background and no axes (diagram-friendly).
 """
 from __future__ import annotations
 import argparse
 from dataclasses import dataclass
 from pathlib import Path
 import numpy as np
 import matplotlib.pyplot as plt
 # ----------------------------
 # Synthetic data generators
 # ----------------------------
@dataclass
 class Params:
    seed: int = 7
    T: int = 24                 # sequence length (positions)
    n_heads: int = 4            # attention heads to blend/choose
    n_curves: int = 7           # curves in token_activation_trends
    seconds: float = 10.0
    fs: int = 200
 def _gaussian(x: np.ndarray, mu: float, sig: float) -> np.ndarray:
    return np.exp(-0.5 * ((x - mu) / (sig + 1e-9)) ** 2)
 def make_attention_map(T: int, rng: np.random.Generator, mode: str) -> np.ndarray:
    """
    Create a transformer-like attention weight matrix A (T x T) with different visual styles:
      - "local": mostly near-diagonal attention
      - "global": some global tokens attend broadly
      - "causal": lower-triangular (decoder-like) with local preference
    """
    i = np.arange(T)[:, None]  # query positions
    j = np.arange(T)[None, :]  # key positions
    if mode == "local":
        logits = -((i - j) ** 2) / (2 * (2.2 ** 2))
        logits += 0.15 * rng.normal(size=(T, T))
    elif mode == "global":
        logits = -((i - j) ** 2) / (2 * (3.0 ** 2))
        # Add a few "global" key positions that many queries attend to
        globals_ = rng.choice(T, size=max(2, T // 10), replace=False)
        for g in globals_:
            logits += 1.2 * _gaussian(j, mu=g, sig=1.0)
        logits += 0.12 * rng.normal(size=(T, T))
    elif mode == "causal":
        logits = -((i - j) ** 2) / (2 * (2.0 ** 2))
        logits += 0.12 * rng.normal(size=(T, T))
        logits = np.where(j <= i, logits, -1e9)  # mask future
    else:
        raise ValueError(f"Unknown attention mode: {mode}")
    # softmax rows
    logits = logits - np.max(logits, axis=1, keepdims=True)
    A = np.exp(logits)
    A /= (np.sum(A, axis=1, keepdims=True) + 1e-9)
    return A
 def make_token_activation_trends(p: Params) -> tuple[np.ndarray, np.ndarray]:
    """
    Multiple smooth curves that feel like "representations evolving across layers/time".
    Returns:
      t: (N,)
      Y: (n_curves, N)
    """
    rng = np.random.default_rng(p.seed)
    N = int(p.seconds * p.fs)
    t = np.linspace(0, p.seconds, N, endpoint=False)
    Y = []
    for k in range(p.n_curves):
        # Multi-scale smooth components + some bursty response
        f1 = 0.10 + 0.04 * k
        f2 = 0.60 + 0.18 * (k % 3)
        phase = rng.uniform(0, 2 * np.pi)
        base = 0.9 * np.sin(2 * np.pi * f1 * t + phase) + 0.35 * np.sin(2 * np.pi * f2 * t + 0.7 * phase)
        # "attention-like gating": a few bumps where the curve spikes smoothly
        bumps = np.zeros_like(t)
        for _ in range(rng.integers(2, 5)):
            mu = rng.uniform(0.5, p.seconds - 0.5)
            sig = rng.uniform(0.15, 0.55)
            bumps += 0.9 * _gaussian(t, mu=mu, sig=sig)
        noise = rng.normal(0, 1, size=N)
        noise = np.convolve(noise, np.ones(11) / 11.0, mode="same")  # smooth noise
        y = base + 0.85 * bumps + 0.12 * noise
        # normalize and vertically offset
        y = (y - y.mean()) / (y.std() + 1e-9)
        y = 0.75 * y + 0.18 * k
        Y.append(y)
    return t, np.vstack(Y)
 def make_discrete_trends(p: Params, vocab: int = 9, change_rate_hz: float = 1.3) -> tuple[np.ndarray, np.ndarray]:
    """
    Discrete step-like channels: useful if you want a "token-id / discrete feature" feel.
    Returns:
      t: (N,)
      X: (n_curves, N) integers
    """
    rng = np.random.default_rng(p.seed + 123)
    N = int(p.seconds * p.fs)
    t = np.linspace(0, p.seconds, N, endpoint=False)
    expected = max(1, int(p.seconds * change_rate_hz))
    X = np.zeros((p.n_curves, N), dtype=int)
    for c in range(p.n_curves):
        k = rng.poisson(expected) + 1
        pts = np.unique(rng.integers(0, N, size=k))
        pts = np.sort(np.concatenate([[0], pts, [N]]))
        cur = rng.integers(0, vocab)
        for a, b in zip(pts[:-1], pts[1:]):
            if a != 0 and rng.random() < 0.9:
                cur = rng.integers(0, vocab)
            X[c, a:b] = cur
    return t, X
 # ----------------------------
 # Plot helpers (SVG, transparent, axes-free)
 # ----------------------------
 def _transparent_fig_ax(width_in: float, height_in: float):
    fig = plt.figure(figsize=(width_in, height_in), dpi=200)
    fig.patch.set_alpha(0.0)
    ax = fig.add_axes([0.03, 0.03, 0.94, 0.94])
    ax.patch.set_alpha(0.0)
    ax.set_axis_off()
    return fig, ax
 def save_attention_svg(A: np.ndarray, out: Path, *, show_colorbar: bool = False) -> None:
    fig, ax = _transparent_fig_ax(4.2, 4.2)
    # Using default colormap (no explicit color specification)
    im = ax.imshow(A, aspect="equal", interpolation="nearest")
    if show_colorbar:
        # colorbar can be useful, but it adds clutter in diagrams
        cax = fig.add_axes([0.92, 0.10, 0.03, 0.80])
        cb = fig.colorbar(im, cax=cax)
        cb.outline.set_linewidth(1.0)
    out.parent.mkdir(parents=True, exist_ok=True)
    fig.savefig(out, format="svg", bbox_inches="tight", pad_inches=0.0, transparent=True)
    plt.close(fig)
 def save_multi_curve_svg(t: np.ndarray, Y: np.ndarray, out: Path, *, lw: float = 2.0) -> None:
    fig, ax = _transparent_fig_ax(6.0, 2.2)
    for i in range(Y.shape[0]):
        ax.plot(t, Y[i], linewidth=lw)
    y_all = Y.reshape(-1)
    ymin, ymax = float(np.min(y_all)), float(np.max(y_all))
    ypad = 0.08 * (ymax - ymin + 1e-9)
    ax.set_xlim(t[0], t[-1])
    ax.set_ylim(ymin - ypad, ymax + ypad)
    out.parent.mkdir(parents=True, exist_ok=True)
    fig.savefig(out, format="svg", bbox_inches="tight", pad_inches=0.0, transparent=True)
    plt.close(fig)
 def save_discrete_svg(t: np.ndarray, X: np.ndarray, out: Path, *, lw: float = 2.0, spacing: float = 1.25) -> None:
    fig, ax = _transparent_fig_ax(6.0, 2.2)
    for i in range(X.shape[0]):
        y = X[i].astype(float) + i * spacing
        ax.step(t, y, where="post", linewidth=lw)
    y_all = (X.astype(float) + np.arange(X.shape[0])[:, None] * spacing).reshape(-1)
    ymin, ymax = float(np.min(y_all)), float(np.max(y_all))
    ypad = 0.10 * (ymax - ymin + 1e-9)
    ax.set_xlim(t[0], t[-1])
    ax.set_ylim(ymin - ypad, ymax + ypad)
    out.parent.mkdir(parents=True, exist_ok=True)
    fig.savefig(out, format="svg", bbox_inches="tight", pad_inches=0.0, transparent=True)
    plt.close(fig)
 # ----------------------------
 # CLI
 # ----------------------------
 def main() -> None:
    ap = argparse.ArgumentParser()
    ap.add_argument("--outdir", type=Path, default=Path("out"))
    ap.add_argument("--seed", type=int, default=7)
    # attention
    ap.add_argument("--T", type=int, default=24)
    ap.add_argument("--attn-mode", type=str, default="local", choices=["local", "global", "causal"])
    ap.add_argument("--colorbar", action="store_true")
    # curves
    ap.add_argument("--seconds", type=float, default=10.0)
    ap.add_argument("--fs", type=int, default=200)
    ap.add_argument("--n-curves", type=int, default=7)
    # discrete optional
    ap.add_argument("--with-discrete", action="store_true")
    ap.add_argument("--disc-vocab", type=int, default=9)
    ap.add_argument("--disc-rate", type=float, default=1.3)
    args = ap.parse_args()
    p = Params(
        seed=args.seed,
        T=args.T,
        n_curves=args.n_curves,
        seconds=args.seconds,
        fs=args.fs,
    )
    rng = np.random.default_rng(args.seed)
    # 1) attention map
    A = make_attention_map(args.T, rng, mode=args.attn_mode)
    save_attention_svg(A, args.outdir / "attention_weights.svg", show_colorbar=args.colorbar)
    # 2) continuous trends
    t, Y = make_token_activation_trends(p)
    save_multi_curve_svg(t, Y, args.outdir / "token_activation_trends.svg")
    # 3) discrete trends (optional)
    if args.with_discrete:
        td, X = make_discrete_trends(p, vocab=args.disc_vocab, change_rate_hz=args.disc_rate)
        save_discrete_svg(td, X, args.outdir / "discrete_tokens.svg")
    print("Wrote:")
    print(f"  {args.outdir / 'attention_weights.svg'}")
    print(f"  {args.outdir / 'token_activation_trends.svg'}")
    if args.with_discrete:
        print(f"  {args.outdir / 'discrete_tokens.svg'}")
 if __name__ == "__main__":
    main()
--- a/arxiv-style/main-ieee.tex
+++ b/arxiv-style/main-ieee.tex
@@ -0,0 +1,302 @@
 \documentclass[conference]{IEEEtran}
 \IEEEoverridecommandlockouts
 \usepackage{cite}
 \usepackage{amsmath,amssymb,amsfonts}
 \usepackage{graphicx}
 \usepackage{url}
 \usepackage{textcomp}
 \usepackage{xcolor}
 \usepackage{booktabs}
 \usepackage{bm}
 \title{Mask-DDPM: Transformer-Conditioned Mixed-Type Diffusion for Semantically Valid ICS Telemetry Synthesis}
 % 若不需要日期，取消下面一行的注释
 \author{
 \IEEEauthorblockN{Zhenglan Chen}
 \IEEEauthorblockA{\textit{Aberdeen Institute of Data Science and Artificial Intelligence} \\
 \textit{South China Normal University}\\
 Guangzhuo, China \\
 email address or ORCID}
 \and
 \IEEEauthorblockN{Mingzhe Yang}
 \IEEEauthorblockA{\textit{Aberdeen Institute of Data Science and Artificial Intelligence} \\
 \textit{South China Normal University}\\
 Guangzhuo, China \\
 email address or ORCID}
 \and
 \IEEEauthorblockN{Hongyu Yan}
 \IEEEauthorblockA{\textit{Aberdeen Institute of Data Science and Artificial Intelligence} \\
 \textit{South China Normal University}\\
 Guangzhuo, China \\
 email address or ORCID}
 \and
 \IEEEauthorblockN{Huan Yang}
 \IEEEauthorblockA{\textit{dept. name of organization (of Aff.)} \\
 \textit{South China Normal University}\\
 Guangzhuo, China \\
 email address or ORCID}
 }
 % 页眉设置
 %%% PDF 元数'
 \begin{document}
 \maketitle
 \begin{abstract}
 Industrial control systems (ICS) security research is increasingly constrained by the scarcity and non-shareability of realistic traffic and telemetry, especially for attack scenarios. To mitigate this bottleneck, we study synthetic generation at the protocol feature/telemetry level, where samples must simultaneously preserve temporal coherence, match continuous marginal distributions, and keep discrete supervisory variables strictly within valid vocabularies. We propose Mask-DDPM, a hybrid framework tailored to mixed-type, multi-scale ICS sequences. Mask-DDPM factorizes generation into (i) a causal Transformer trend module that rolls out a stable long-horizon temporal scaffold for continuous channels, (ii) a trend-conditioned residual DDPM that refines local stochastic structure and heavy-tailed fluctuations without degrading global dynamics, (iii) a masked (absorbing) diffusion branch for discrete variables that guarantees categorical legality by construction, and (iv) a type-aware decomposition/routing layer that aligns modeling mechanisms with heterogeneous ICS variable origins and enforces deterministic reconstruction where appropriate. Evaluated on fixed-length windows (L=96) derived from the HAI Security Dataset, Mask-DDPM achieves stable fidelity across seeds with mean KS = 0.3311 ± 0.0079 (continuous), mean JSD = 0.0284 ± 0.0073 (discrete), and mean absolute lag-1 autocorrelation difference = 0.2684 ± 0.0027, indicating faithful marginals, preserved short-horizon dynamics, and valid discrete semantics. The resulting generator provides a reproducible basis for data augmentation, benchmarking, and downstream ICS protocol reconstruction workflows.
 \end{abstract}
 % 关键'
 \begin{IEEEkeywords}Machine Learning, Cyber Defense, ICS\end{IEEEkeywords}
 % 1. Introduction
 \section{Introduction}
 \label{sec:intro}
 Industrial control systems (ICS) form the backbone of modern critical infrastructure, which includes power grids, water treatment, manufacturing, and transportation, among others. These systems monitor, regulate, and automate the physical processes through sensors, actuators, programmable logic controllers (PLCs), and monitoring software. Unlike conventional IT systems, ICS operate in real time, closely coupled with physical processes and safety-critical constraints, using heterogeneous and legacy communication protocols such as Modbus/TCP and DNP3 that were not originally designed with robust security in mind. This architectural complexity and operational criticality make ICS high-impact targets for cyber attacks, where disruptions can result in physical damage, environmental harm, and even loss of life. Recent reviews of ICS security highlight the expanding attack surface due to increased connectivity, legacy systems' vulnerabilities, and the inadequacy of traditional security controls in capturing the nuances of ICS networks and protocols \cite{10.1007/s10844-022-00753-1,Nankya2023-gp}
 While machine learning (ML) techniques have shown promise for anomaly detection and automated cybersecurity within ICS, they rely heavily on labeled datasets that capture both benign operations and diverse attack patterns. In practice, real ICS traffic data, especially attack-triggered captures, are scarce due to confidentiality, safety, and legal restrictions, and available public ICS datasets are few, limited in scope, or fail to reflect current threat modalities. For instance, the HAI Security Dataset provides operational telemetry and anomaly flags from a realistic control system setup for research purposes, but must be carefully preprocessed to derive protocol-relevant features for ML tasks \cite{shin}. Data scarcity directly undermines model generalization, evaluation reproducibility, and the robustness of intrusion detection research, especially when training or testing ML models on realistic ICS behavior remains confined to small or outdated collections of examples \cite{info16100910}.
 Synthetic data generation offers a practical pathway to mitigate these challenges. By programmatically generating feature-level sequences that mimic the statistical and temporal structure of real ICS telemetry, researchers can augment scarce training sets, standardize benchmarking, and preserve operational confidentiality. Relative to raw packet captures, feature-level synthesis abstracts critical protocol semantics and statistical patterns without exposing sensitive fields, making it more compatible with safety constraints and compliance requirements in ICS environments. Modern generative modeling, including diffusion models, has advanced significantly in producing high-fidelity synthetic data across domains. Diffusion approaches, such as denoising diffusion probabilistic models, learn to transform noise into coherent structured samples and have been successfully applied to tabular or time series data synthesis with better stability and data coverage compared to adversarial methods \cite{pmlr-v202-kotelnikov23a,rasul2021autoregressivedenoisingdiffusionmodels}
 Despite these advances, most existing work either focuses on packet-level generation \cite{jiang2023netdiffusionnetworkdataaugmentation} or is limited to generic tabular data \cite{pmlr-v202-kotelnikov23a}, rather than domain-specific control sequence synthesis tailored for ICS protocols where temporal coherence, multi-channel dependencies, and discrete protocol legality are jointly required. This gap motivates our focus on protocol feature-level generation for ICS, which involves synthesizing sequences of protocol-relevant fields conditioned on their temporal and cross-channel structure. In this work, we formulate a hybrid modeling pipeline that decouples long-horizon trends and local statistical detail while preserving discrete semantics of protocol tokens. By combining causal Transformers with diffusion-based refiners, and enforcing deterministic validity constraints during sampling, our framework generates semantically coherent, temporally consistent, and distributionally faithful ICS feature sequences. We evaluate features derived from the HAI Security Dataset and demonstrate that our approach produces high-quality synthetic sequences suitable for downstream augmentation, benchmarking, and integration into packet-construction workflows that respect realistic ICS constraints.
 % 2. Related Work
 \section{Related Work}
 \label{sec:related}
 Early generation of network data oriented towards ``realism'' mostly remained at the packet/flow header level, either through replay or statistical synthesis based on single-point observations. Swing, in a closed-loop, network-responsive manner, extracts user/application/network distributions from single-point observations to reproduce burstiness and correlation across multiple time scales \cite{10.1145/1159913.1159928}. Subsequently, a series of works advanced header synthesis to learning-based generation: the WGAN-based method added explicit verification of protocol field consistency to NetFlow/IPFIX \cite{Ring_2019}, NetShare reconstructed header modeling as flow-level time series and improved fidelity and scalability through domain encoding and parallel fine-tuning \cite{10.1145/3544216.3544251}, and DoppelGANger preserved the long-range structure and downstream sorting consistency of networked time series by decoupling attributes from sequences \cite{Lin_2020}. However, in industrial control system (ICS) scenarios, the original PCAP is usually not shareable, and public testbeds (such as SWaT, WADI) mostly provide process/monitoring telemetry and protocol interactions for security assessment, but public datasets emphasize operational variables rather than packet-level traces \cite{7469060,10.1145/3055366.3055375}. This makes ``synthesis at the feature/telemetry level, aware of protocol and semantics'' more feasible and necessary in practice: we are more concerned with reproducing high-level distributions and multi-scale temporal patterns according to operational semantics and physical constraints without relying on the original packets. From this perspective, the generation paradigm naturally shifts from ``packet syntax reproduction'' to ``modeling of high-level spatio-temporal distributions and uncertainties'', requiring stable training, strong distribution fitting, and interpretable uncertainty characterization.
 Diffusion models exhibit good fit along this path: DDPM achieves high-quality sampling and stable optimization through efficient $\epsilon$ parameterization and weighted variational objectives \cite{ho2020denoising}, the SDE perspective unifies score-based and diffusion, providing likelihood evaluation and prediction-correction sampling strategies based on probability flow ODEs \cite{song2021score}. For time series, TimeGrad replaces the constrained output distribution with conditional denoising, capturing high-dimensional correlations at each step \cite{rasul2021autoregressivedenoisingdiffusionmodels}; CSDI explicitly performs conditional diffusion and uses two-dimensional attention to simultaneously leverage temporal and cross-feature dependencies, suitable for conditioning and filling in missing values \cite{tashiro2021csdiconditionalscorebaseddiffusion}; in a more general spatio-temporal structure, DiffSTG generalizes diffusion to spatio-temporal graphs, combining TCN/GCN with denoising U-Net to improve CRPS and inference efficiency in a non-autoregressive manner \cite{wen2024diffstgprobabilisticspatiotemporalgraph}, and PriSTI further enhances conditional features and geographical relationships, maintaining robustness under high missing rates and sensor failures \cite{liu2023pristiconditionaldiffusionframework}; in long sequences and continuous domains, DiffWave verifies that diffusion can also match the quality of strong vocoders under non-autoregressive fast synthesis \cite{kong2021diffwaveversatilediffusionmodel}; studies on cellular communication traffic show that diffusion can recover spatio-temporal patterns and provide uncertainty characterization at the urban scale \cite{11087622}. These results overall point to a conclusion: when the research focus is on ``telemetry/high-level features'' rather than raw messages, diffusion models provide stable and fine-grained distribution fitting and uncertainty quantification, which is exactly in line with the requirements of ICS telemetry synthesis. Meanwhile, directly entrusting all structures to a ``monolithic diffusion'' is not advisable: long-range temporal skeletons and fine-grained marginal distributions often have optimization tensions, requiring explicit decoupling in modeling.
 Looking further into the mechanism complexity of ICS: its channel types are inherently mixed, containing both continuous process trajectories and discrete supervision/status variables, and discrete channels must be ``legal'' under operational constraints. The aforementioned progress in time series diffusion has mainly occurred in continuous spaces, but discrete diffusion has also developed systematic methods: D3PM improves sampling quality and likelihood through absorption/masking and structured transitions in discrete state spaces \cite{austin2021structured}, subsequent masked diffusion provides stable reconstruction on categorical data in a more simplified form \cite{Lin_2020}, multinomial diffusion directly defines diffusion on a finite vocabulary through mechanisms such as argmax flows \cite{hoogeboom2021argmaxflowsmultinomialdiffusion}, and Diffusion-LM demonstrates an effective path for controllable text generation by imposing gradient constraints in continuous latent spaces \cite{li2022diffusionlmimprovescontrollabletext}. From the perspectives of protocols and finite-state machines, coverage-guided fuzz testing emphasizes the criticality of ``sequence legality and state coverage'' \cite{meng2025aflnetyearslatercoverageguided,godefroid2017learnfuzzmachinelearninginput,she2019neuzzefficientfuzzingneural}, echoing the concept of ``legality by construction'' in discrete diffusion: preferentially adopting absorption/masking diffusion on discrete channels, supplemented by type-aware conditioning and sampling constraints, to avoid semantic invalidity and marginal distortion caused by post hoc thresholding.
 From the perspective of high-level synthesis, the temporal structure is equally indispensable: ICS control often involves delay effects, phased operating conditions, and cross-channel coupling, requiring models to be able to characterize low-frequency, long-range dependencies while also overlaying multi-modal fine-grained fluctuations on them. The Transformer series has provided sufficient evidence in long-sequence time series tasks: Transformer-XL breaks through the fixed-length context limitation through a reusable memory mechanism and significantly enhances long-range dependency expression \cite{dai2019transformerxlattentivelanguagemodels}; Informer uses ProbSparse attention and efficient decoding to balance span and efficiency in long-sequence prediction \cite{zhou2021informerefficienttransformerlong}; Autoformer robustly models long-term seasonality and trends through autocorrelation and decomposition mechanisms \cite{wu2022autoformerdecompositiontransformersautocorrelation}; FEDformer further improves long-period prediction performance in frequency domain enhancement and decomposition \cite{zhou2022fedformerfrequencyenhanceddecomposed}; PatchTST enhances the stability and generalization of long-sequence multivariate prediction through local patch-based representation and channel-independent modeling \cite{2023}. Combining our previous positioning of diffusion, this chain of evidence points to a natural division of labor: using attention-based sequence models to first extract stable low-frequency trends/conditions (long-range skeletons), and then allowing diffusion to focus on margins and details in the residual space; meanwhile, discrete masking/absorbing diffusion is applied to supervised/pattern variables to ensure vocabulary legality by construction. This design not only inherits the advantages of time series diffusion in distribution fitting and uncertainty characterization \cite{rasul2021autoregressivedenoisingdiffusionmodels,tashiro2021csdiconditionalscorebaseddiffusion,wen2024diffstgprobabilisticspatiotemporalgraph,liu2023pristiconditionaldiffusionframework,kong2021diffwaveversatilediffusionmodel,11087622}, but also stabilizes the macroscopic temporal support through the long-range attention of Transformer, enabling the formation of an operational integrated generation pipeline under the mixed types and multi-scale dynamics of ICS.
 % 3. Methodology
 \section{Methodology}
 \label{sec:method}
 Industrial control system (ICS) telemetry is intrinsically mixed-type and mechanistically heterogeneous: continuous process trajectories (e.g., sensor and actuator signals) coexist with discrete supervisory states (e.g., modes, alarms, interlocks), and the underlying generating mechanisms range from physical inertia to program-driven step logic. This heterogeneity is not cosmetic--it directly affects what "realistic'synthesis means, because a generator must jointly satisfy (i) temporal coherence, (ii) distributional fidelity, and (iii) discrete semantic validity (i.e., every discrete output must belong to its legal vocabulary by construction). These properties are emphasized broadly in operational-technology security guidance and ICS engineering practice, where state logic and physical dynamics are tightly coupled \cite{nist2023sp80082}.
 We formalize each training instance as a fixed-length window of length We model each training instance as a fixed-length window of length $L$, comprising continuous channels $\bm{X} \in \mathbb{R}^{L \times d_c}$ and discrete channels $\bm{Y} = \{y^{(j)}_{1:L}\}_{j=1}^{d_d}$, where each discrete variable satisfies $y^{(j)}_t \in \mathcal{V}_j$ for a finite vocabulary $\mathcal{V}_j$. Our objective is to learn a generator that produces synthetic $(\hat{\bm{X}}, \hat{\bm{Y}})$ that are simultaneously coherent and distributionally faithful, while also ensuring $\hat{y}^{(j)}_t\in\mathcal{V}_j$ for all $j$, $t$ by construction (rather than via post-hoc rounding or thresholding).
 A key empirical and methodological tension in ICS synthesis is that temporal realism and marginal/distributional realism can compete when optimized monolithically: sequence models trained primarily for regression often over-smooth heavy tails and intermittent bursts, while purely distribution-matching objectives can erode long-range structure. Diffusion models provide a principled route to rich distribution modeling through iterative denoising, but they do not, by themselves, resolve (i) the need for a stable low-frequency temporal scaffold, nor (ii) the discrete legality constraints for supervisory variables \cite{ho2020denoising,song2021score}. Recent time-series diffusion work further suggests that separating coarse structure from stochastic refinement can be an effective inductive bias for long-horizon realism \cite{kollovieh2023tsdiff,sikder2023transfusion}.
 \begin{figure*}[t]
  \centering
  \includegraphics[width=\textwidth]{fig-design-v2.png}
  % \caption{Description of the figure.}
  \label{fig:design}
 \end{figure*}
 Motivated by these considerations, we propose Mask-DDPM, organized in the following order:
 \begin{enumerate}
  \item Transformer trend module: learns the dominant temporal backbone of continuous dynamics via attention-based sequence modeling \cite{vaswani2017attention}.
  \item Residual DDPM for continuous variables: models distributional detail as stochastic residual structure conditioned on the learned trend \cite{ho2020denoising,kollovieh2023tsdiff}.
  \item Masked diffusion for discrete variables: generates discrete ICS states with an absorbing/masking corruption process and categorical reconstruction \cite{austin2021structured,shi2024simplified}.
  \item Type-aware decomposition: a type-aware factorization and routing layer that assigns variables to the most appropriate modeling mechanism and enforces deterministic constraints where warranted.
 \end{enumerate}
 This ordering is intentional. The trend module establishes a macro-temporal scaffold; residual diffusion then concentrates capacity on micro-structure and marginal fidelity; masked diffusion provides a native mechanism for discrete legality; and the type-aware layer operationalizes the observation that not all ICS variables should be modeled with the same stochastic mechanism. Importantly, while diffusion-based generation for ICS telemetry has begun to emerge, existing approaches remain limited and typically emphasize continuous synthesis or augmentation; in contrast, our pipeline integrates (i) a Transformer-conditioned residual diffusion backbone, (ii) a discrete masked-diffusion branch, and (iii) explicit type-aware routing for heterogeneous variable mechanisms within a single coherent generator \cite{yuan2025ctu,sha2026ddpm}.
 \subsection{Transformer trend module for continuous dynamics}
 \label{sec:method-trans}
 We instantiate the temporal backbone as a causal Transformer trend extractor, leveraging self-attention's ability to represent long-range dependencies and cross-channel interactions without recurrence \cite{vaswani2017attention}. Compared with recurrent trend extractors (e.g., GRU-style backbones), a Transformer trend module offers a direct mechanism to model delayed effects and multivariate coupling--common in ICS, where control actions may influence downstream sensors with nontrivial lags and regime-dependent propagation \cite{vaswani2017attention,nist2023sp80082}. Crucially, in our design the Transformer is not asked to be the entire generator; instead, it serves a deliberately restricted role: providing a stable, temporally coherent conditioning signal that later stochastic components refine.
 For continuous channels $\bm{X}$, we posit an additive decomposition:
 \begin{equation}
 \bm{X} = \bm{S} + \bm{R},
 \label{eq:additive_decomp}
 \end{equation}
 where $\bm{S} \in \mathbb{R}^{L \times d_c}$ is a smooth trend capturing predictable temporal evolution, and $\bm{R} \in \mathbb{R}^{L \times d_c}$ is a residual capturing distributional detail (e.g., bursts, heavy tails, local fluctuations) that is difficult to represent robustly with a purely regression-based temporal objective. This separation reflects an explicit division of labor: the trend module prioritizes temporal coherence, while diffusion (introduced next) targets distributional realism at the residual level--a strategy aligned with "predict-then-refine'perspectives in time-series diffusion modeling \cite{kollovieh2023tsdiff,sikder2023transfusion}.
 We parameterize the trend $\bm{S}$ using a causal Transformer $f_\phi$. With teacher forcing, we train $F_\phi$ to predict the next-step trend from past observations:
 \begin{equation}
 \hat{\bm{S}}_{t+1} = f_{\phi}(\bm{X}_{1:t}), \quad t = 1, \dots, L-1.
 \label{eq:trend_prediction}
 \end{equation}
 using the mean-squared error objective:
 \begin{equation}
 \mathcal{L}_{\text{trend}}(\phi) = \frac{1}{(L-1)d_c} \sum_{t=1}^{L-1} \bigl\| \hat{\bm{S}}_{t+1} - \bm{X}_{t+1} \bigr\|_2^2.
 \label{eq:trend_loss}
 \end{equation}
 At inference, we roll out the Transformer autoregressively to obtain $\hat{\bm{S}}$, and and then define the residual target for diffusion as $\bm{R} = \bm{X} - \hat{\bm{S}}$. This setup intentionally "locks in'a coherent low-frequency scaffold before any stochastic refinement is applied, thereby reducing the burden on downstream diffusion modules to simultaneously learn both long-range structure and marginal detail. In this sense, our use of Transformers is distinctive: it is a conditioning-first temporal backbone designed to stabilize mixed-type diffusion synthesis in ICS, rather than an end-to-end monolithic generator \cite{vaswani2017attention,kollovieh2023tsdiff,yuan2025ctu}.
 \subsection{DDPM for continuous residual generation}
 \label{sec:method-ddpm}
 We model the residual RRR with a denoising diffusion probabilistic model (DDPM) conditioned on the trend $\hat{\bm{S}}$ \cite{ho2020denoising}. Diffusion models learn complex data distributions by inverting a tractable noising process through iterative denoising, and have proven effective at capturing multimodality and heavy-tailed structure that is often attenuated by purely regression-based sequence models \cite{ho2020denoising,song2021score}. Conditioning the diffusion model on $\hat{\bm{S}}$ is central: it prevents the denoiser from re-learning the low-frequency scaffold and focuses capacity on residual micro-structure, mirroring the broader principle that diffusion excels as a distributional corrector when a reasonable coarse structure is available \cite{kollovieh2023tsdiff,sikder2023transfusion}.
 Let $\bm{K}$ denote the number of diffusion steps, with a noise schedule $\{\beta_k\}_{k=1}^K$, $\alpha_k = 1 - \beta_k$, and $\bar{\alpha}_k = \prod_{i=1}^k \alpha_i$. The forward corruption process is:
 \begin{equation}
 q(\bm{r}_k \mid \bm{r}_0) = \mathcal{N}\bigl( \sqrt{\bar{\alpha}_k}\,\bm{r}_0,\; (1 - \bar{\alpha}_k)\mathbf{I} \bigr)
 \label{eq:forward_corruption}
 \end{equation}
 equivalently,
 \begin{equation}
 \bm{r}_k = \sqrt{\bar{\alpha}_k}\,\bm{r}_0 + \sqrt{1 - \bar{\alpha}_k}\,\boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})
 \label{eq:forward_corruption_eq}
 \end{equation}
 The learned reverse process is parameterized as:
 \begin{equation}
 p_{\theta}(\bm{r}_{k-1} \mid \bm{r}_k, \hat{\bm{S}}) = \mathcal{N}\bigl( \boldsymbol{\mu}_{\theta}(\bm{r}_k, k, \hat{\bm{S}}),\; \boldsymbol{\Sigma}(k) \bigr).
 \label{eq:reverse_process}
 \end{equation}
 where $\mu_\theta$ is implemented by a Transformer denoiser that consumes (i) the noised residual $r_k$, (ii) a timestep embedding for $k$, and (iii) conditioning features derived from $\hat{\bm{S}}$.  This denoiser architecture is consistent with the growing use of attention-based denoisers for long-context time-series diffusion, while our key methodological emphasis is the trend-conditioned residual factorization as the object of diffusion learning \cite{ho2020denoising,sikder2023transfusion}.
 We train the denoiser using the standard DDPM $\epsilon$-prediction objective:
 \begin{equation}
 \mathcal{L}_{\text{cont}}(\theta) = \mathbb{E}_{k,\bm{r}_0,\boldsymbol{\epsilon}} \left[ \bigl\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_{\theta}(\bm{r}_k, k, \hat{\bm{S}}) \bigr\|_2^2 \right].
 \label{eq:ddpm_loss}
 \end{equation}
 Because diffusion optimization can exhibit timestep imbalance (i.e., some timesteps dominate gradients), we optionally apply an SNR-based reweighting consistent with Min-SNR training:
 \begin{equation}
 \mathcal{L}^{\text{snr}}_{\text{cont}}(\theta) = \mathbb{E}_{k,\bm{r}_0,\boldsymbol{\epsilon}} \left[ w_k \bigl\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_{\theta}(\bm{r}_k, k, \hat{\bm{S}}) \bigr\|_2^2 \right],
 \label{eq:snr_loss}
 \end{equation}
 where $\mathrm{SNR}_k=\bar{\alpha}_k/(1-\bar{\alpha}_k)$ and $\gamma>0$ is a cap parameter \cite{hang2023efficient}.
 After sampling $\hat{\bm{R}}$ by reverse diffusion, we reconstruct the continuous output as $\hat{\bm{X}} = \hat{\bm{S}} + \hat{\bm{R}}$. Overall, the DDPM component serves as a distributional corrector on top of a temporally coherent backbone, which is particularly suited to ICS where low-frequency dynamics are strong and persistent but fine-scale variability (including bursts and regime-conditioned noise) remains important for realism. Relative to prior ICS diffusion efforts that primarily focus on continuous augmentation, our formulation elevates trend-conditioned residual diffusion as a modular mechanism for disentangling temporal structure from distributional refinement \cite{yuan2025ctu,sha2026ddpm}.
 \subsection{Masked diffusion for discrete ICS variables}
 \label{sec:method-discrete}
 Discrete ICS variables must remain categorical, making Gaussian diffusion inappropriate for supervisory states and mode-like channels. While one can attempt continuous relaxations or post-hoc discretization, such strategies risk producing semantically invalid intermediate states (e.g., "in-between'modes) and can distort the discrete marginal distribution. Discrete-state diffusion provides a principled alternative by defining a valid corruption process directly on categorical variables \cite{austin2021structured,shi2024simplified}. In the ICS setting, this is not a secondary detail: supervisory tags often encode control logic boundaries (modes, alarms, interlocks) that must remain within a finite vocabulary to preserve semantic correctness \cite{nist2023sp80082}.
 We therefore adopt masked (absorbing) diffusion for discrete channels, where corruption replaces tokens with a special $\texttt{[MASK]}$ symbol according to a schedule \cite{shi2024simplified}. For each variable $j$, define a masking schedule $\{m_k\}_{k=1}^K$ (with $m_k\in[0,1]$) increasing in $k$. The forward corruption process is:
 \begin{equation}
 q(y^{(j)}_k \mid y^{(j)}_0) =
 \begin{cases}
 y^{(j)}_0,            & \text{with probability } 1 - m_k, \\
 \texttt{[MASK]},      & \text{with probability } m_k,
 \end{cases}
 \label{eq:masking_process}
 \end{equation}
 applied independently across $j$ and $t$. Let $\mathcal{M}$ denote the set of masked positions at step $k$. The denoiser $h_{\psi}$ predicts a categorical distribution over $\mathcal{V}_j$ for each masked token, conditioned on (i) the corrupted discrete sequence, (ii) the diffusion step $k$, and (iii) continuous context. Concretely, we condition on $\hat{\bm{S}}$ and $\hat{\bm{X}}$ to couple supervisory reconstruction to the underlying continuous dynamics:
 \begin{equation}
 p_{\psi}\bigl( y^{(j)}_0 \mid y_k, k, \hat{\bm{S}}, \hat{\bm{X}} \bigr) = h_{\psi}(y_k, k, \hat{\bm{S}}, \hat{\bm{X}}).
 \label{eq:discrete_denoising}
 \end{equation}
 This conditioning choice is motivated by the fact that many discrete ICS states are not standalone, they are functions of regimes, thresholds, and procedural phases that manifest in continuous channels \cite{nist2023sp80082}. Training uses a categorical denoising objective:
 \begin{equation}
 \mathcal{L}_{\text{disc}}(\psi) = \mathbb{E}_{k} \left[ \frac{1}{|\mathcal{M}|} \sum_{(j,t) \in \mathcal{M}} \mathrm{CE}\bigl( h_{\psi}(y_k, k, \hat{\bm{S}}, \hat{\bm{X}})_{j,t},\; y^{(j)}_{0,t} \bigr) \right],
 \label{eq:discrete_loss}
 \end{equation}
 where $\mathrm{CE}(\cdot,\cdot)$ is cross-entropy. At sampling time, we initialize all discrete tokens as $\texttt{[MASK]}$ and iteratively unmask them using the learned conditionals, ensuring that every output token lies in its legal vocabulary by construction. This discrete branch is a key differentiator of our pipeline: unlike typical continuous-only diffusion augmentation in ICS, we integrate masked diffusion as a first-class mechanism for supervisory-variable legality within the same end-to-end synthesis workflow \cite{shi2024simplified,yuan2025ctu}.
 \subsection{Type-aware decomposition as factorization and routing layer}
 \label{sec:method-types}
 Even with a trend-conditioned residual DDPM and a discrete masked-diffusion branch, a single uniform modeling treatment can remain suboptimal because ICS variables are generated by qualitatively different mechanisms. For example, program-driven setpoints exhibit step-and-dwell dynamics; controller outputs follow control laws conditioned on process feedback; actuator positions may show saturation and dwell; and some "derived tags'are deterministic functions of other channels. Treating all channels as if they were exchangeable stochastic processes can misallocate model capacity and induce systematic error concentration on a small subset of mechanistically distinct variables \cite{nist2023sp80082}.
 We therefore introduce a type-aware decomposition that formalizes this heterogeneity as a routing and constraint layer.  Let $\tau(i)\in{1,\dots,6}$ assign each variable (i) to a type class. The type assignment can be initialized from domain semantics (tag metadata, value domains, and engineering meaning), and subsequently refined via an error-attribution workflow described in the Benchmark section. Importantly, this refinement does not change the core diffusion backbone; it changes which mechanism is responsible for which variable, thereby aligning inductive bias with variable-generating mechanism while preserving overall coherence.
 We use the following taxonomy:
 \begin{enumerate}
 	\item Type 1 (program-driven / setpoint-like): externally commanded, step-and-dwell variables. These variables can be treated as exogenous drivers (conditioning signals) or routed to specialized change-point / dwell-time models, rather than being forced into a smooth denoiser that may over-regularize step structure.
 	\item Type 2 (controller outputs): continuous variables tightly coupled to feedback loops; these benefit from conditional modeling where the conditioning includes relevant process variables and commanded setpoints.
 	\item Type 3 (actuator states/positions): often exhibit saturation, dwell, and rate limits; these may require stateful dynamics beyond generic residual diffusion, motivating either specialized conditional modules or additional inductive constraints.
 	\item Type 4 (process variables): inertia-dominated continuous dynamics; these are the primary beneficiaries of the Transformer trend + residual DDPM pipeline.
 	\item Type 5 (derived/deterministic variables): algebraic or rule-based functions of other variables; we enforce deterministic reconstruction $\hat{x}^{(i)} = g_i(\hat{X},\hat{Y})$ rather than learning a stochastic generator, improving logical consistency and sample efficiency.
 	\item Type 6 (auxiliary/low-impact variables): weakly coupled or sparse signals; we allow simplified modeling (e.g., calibrated marginals or lightweight temporal models) to avoid allocating diffusion capacity where it is not warranted.
 \end{enumerate}
 Type-aware decomposition improves synthesis quality through three mechanisms. First, it improves capacity allocation by preventing a small set of mechanistically atypical variables from dominating gradients and distorting the learned distribution for the majority class (typically Type 4). Second, it enables constraint enforcement by deterministically reconstructing Type 5 variables, preventing logically inconsistent samples that purely learned generators can produce. Third, it improves mechanism alignment by attaching inductive biases consistent with step/dwell or saturation behaviors where generic denoisers may implicitly favor smoothness.
 From a novelty standpoint, this layer is not merely an engineering "patch' it is an explicit methodological statement that ICS synthesis benefits from typed factorization--a principle that has analogues in mixed-type generative modeling more broadly, but that remains underexplored in diffusion-based ICS telemetry synthesis \cite{shi2025tabdiff,yuan2025ctu,nist2023sp80082}.
 \subsection{Joint optimization and end-to-end sampling}
 \label{sec:method-joint}
 We train the model in a staged manner consistent with the above factorization, which improves optimization stability and encourages each component to specialize in its intended role. Specifically: (i) we train the trend Transformer $f_{\phi}$ to obtain $\hat{\bm{S}}$; (ii) we compute residual targets $\hat{\bm{R}} = \bm{X} - \hat{\bm{S}}$ for the continuous variables routed to residual diffusion; (iii) we train the residual DDPM $p_{\theta}(\bm{R}\mid \hat{\bm{S}})$ and masked diffusion model $p_{\psi}(\bm{Y}\mid \text{masked}(\bm{Y}), \hat{\bm{S}}, \hat{\bm{X}})$; and (iv) we apply type-aware routing and deterministic reconstruction during sampling. This staged strategy is aligned with the design goal of separating temporal scaffolding from distributional refinement, and it mirrors the broader intuition in time-series diffusion that decoupling coarse structure and stochastic detail can mitigate "structure vs. realism'conflicts \cite{kollovieh2023tsdiff,sikder2023transfusion}.
 A simple combined objective is $\mathcal{L} = \lambda\mathcal{L}_{\text{cont}} + (1-\lambda)\mathcal{L}_{\text{disc}}$ with $\lambda\in[0,1]$ controlling the balance between continuous and discrete learning. Type-aware routing determines which channels contribute to which loss and which are excluded in favor of deterministic reconstruction. In practice, this routing acts as a principled guardrail against negative transfer across variable mechanisms: channels that are best handled deterministically (Type 5) or by specialized drivers (Type 1/3, depending on configuration) are prevented from forcing the diffusion models into statistically incoherent compromises.
 At inference time, generation follows the same structured order: (i) trend $\hat{\bm{S}}$ via the Transformer, (ii) residual $\hat{\bm{R}}$ via DDPM, (iii) discrete $\hat{\bm{Y}}$ via masked diffusion, and (iv) type-aware assembly with deterministic reconstruction for routed variables. This pipeline produces $(\hat{\bm{X}},\hat{\bm{Y}})$ that are temporally coherent by construction (through $\hat{\bm{S}}$), distributionally expressive (through $\hat{\bm{R}}$ denoising), and discretely valid (through masked diffusion), while explicitly accounting for heterogeneous variable-generating mechanisms through type-aware routing. In combination, these choices constitute our central methodological contribution: a unified Transformer + mixed diffusion generator for ICS telemetry, augmented by typed factorization to align model capacity with domain mechanism \cite{ho2020denoising,shi2024simplified,yuan2025ctu,nist2023sp80082}.
 % 4. Benchmark
 \section{Benchmark}
 \label{sec:benchmark}
 We evaluate the proposed pipeline on feature sequences derived from the HAI Security Dataset, using fixed-length windows (L=96) that preserve the mixed-type structure of ICS telemetry. The goal of this benchmark is not only to report "overall similarity' but to justify why the proposed factorization is a better fit for protocol feature synthesis: continuous channels must match physical marginals \cite{coletta2023constrained}, discrete channels must remain semantically legal, and both must retain short-horizon dynamics that underpin state transitions and interlocks \cite{yang2001interlock}.
 This emphasis reflects evaluation practice in time-series generation, where strong results are typically supported by multiple complementary views (marginal fidelity, dependency/temporal structure, and downstream plausibility), rather than a single aggregate score \cite{stenger2024survey}. In the ICS setting, this multi-view requirement is sharper: a generator that matches continuous marginals while emitting out-of-vocabulary supervisory tokens is unusable for protocol reconstruction, and a generator that matches marginals but breaks lag structure can produce temporally implausible command/response sequences.
 Recent ICS time-series generators often emphasize aggregate similarity scores and utility-driven evaluations (e.g., anomaly-detection performance) to demonstrate realism, which is valuable but can under-specify mixed-type protocol constraints. Our benchmark complements these practices by making mixed-type legality and per-feature distributional alignment explicit: discrete outputs are evaluated as categorical distributions (JSD) and are constrained to remain within the legal vocabulary by construction, while continuous channels are evaluated with nonparametric distribution tests (KS) \cite{yoon2019timegan}. This combination provides a direct, protocol-relevant justification for the hybrid design, rather than relying on a single composite score that may mask discrete failures.
 For continuous channels, we measure distributional alignment using the Kolmogorov--Smirnov (KS) statistic computed per feature between the empirical distributions of real and synthetic samples, and then averaged across features. For discrete channels, we quantify marginal fidelity with Jensen--Shannon divergence (JSD) \cite{lin1991divergence,yoon2019timegan} between categorical distributions per feature, averaged across discrete variables. To assess temporal realism, we compare lag-1 autocorrelation at the feature level and report the mean absolute difference between real and synthetic lag-1 autocorrelation, averaged across features. In addition, to avoid degenerate comparisons driven by near-constant tags, features whose empirical standard deviation falls below a small threshold are excluded from continuous KS aggregation; such channels carry limited distributional information and can distort summary statistics.
 \subsection{Quantitative results}
 \label{sec:benchmark-quant}
 Across all runs, the mean continuous KS is 0.3311 (std 0.0079) and the mean discrete JSD is 0.0284 (std 0.0073), indicating that the generator preserves both continuous marginals and discrete semantic distributions at the feature level. Temporal consistency is similarly stable across runs, with a mean lag-1 autocorrelation difference of 0.2684 (std 0.0027), suggesting that the synthesized windows retain short-horizon dynamical structure \cite{ni2021sigwasserstein} instead of collapsing to marginal matching alone. The best-performing instance (by mean KS) attains 0.3224, and the small inter-seed variance shows that the reported fidelity is reproducible rather than driven by a single favorable initialization.
 \begin{figure*}[t]
  \centering
  \includegraphics[width=\textwidth]{fig-overall-benchmark-v1.png}
  % \caption{Description of the figure.}
  \label{fig:benchmark}
 \end{figure*}
 \begin{table}[htbp]
 \centering
 \caption{Summary of benchmark metrics. Lower values indicate better performance.}
 \label{tab:metrics}
 \begin{tabular}{@{}l l c@{}}
 \toprule
 \textbf{Metric} & \textbf{Aggregation} & \textbf{Mean $\pm$ Std} \\
 \midrule
 KS (continuous) & mean over continuous features & 0.3311 $\pm$ 0.0079 \\
 JSD (discrete)  & mean over discrete features   & 0.0284 $\pm$ 0.0073 \\
 Abs $\Delta$ lag-1 autocorr & mean over features & 0.2684 $\pm$ 0.0027 \\
 \bottomrule
 \end{tabular}
 \end{table}
 To make the benchmark actionable (and comparable to prior work), we report type-appropriate, interpretable statistics instead of collapsing everything into a single similarity score. This matters in mixed-type ICS telemetry: continuous fidelity can be high while discrete semantics fail, and vice versa. By separating continuous (KS), discrete (JSD), and temporal (lag-1) views, the evaluation directly matches the design goals of the hybrid generator: distributional refinement for continuous residuals, vocabulary-valid reconstruction for discrete supervision, and trend-induced short-horizon coherence.
 In addition, the seed-averaged reporting mirrors evaluation conventions in recent diffusion-based time-series generation studies, where robustness across runs is increasingly treated as a first-class signal rather than an afterthought. In this sense, the small inter-seed variance is itself evidence that the factorized training and typed routing reduce instability and localized error concentration, which is frequently observed when heterogeneous channels compete for the same modeling capacity.
 % 5. Future Work
 \section{Future Work}
 \label{sec:future}
 Future work will further expand from "generating legal ICS feature sequences" to "data construction and adversarial evaluation for security tasks". The core contribution of this paper focuses on generating feature sequences that are temporally consistent, have credible distributions, and have legal discrete values under mixed types and multi-scale dynamics. However, in the actual research of intrusion detection and anomaly detection, the more critical bottleneck is often the lack of "illegal data/anomaly data" with clear attack semantics and sufficient coverage. Therefore, a direct and important extension direction is to use the legal sequences generated in this paper as a controllable and reproducible "base line operation flow", and then, on the premise of maintaining sequence-level legality and engineering constraints, inject or mix illegal behaviors according to specified attack patterns, thereby systematically constructing a dataset for training and evaluating the recognition of illegal data packets.
 Specifically, attack injection can be upgraded from "simple perturbation" to "semantically consistent patterned rewriting": on continuous channels, implement bias injection, covert manipulation near thresholds, instantaneous mutations, and intermittent bursts, etc., so that it can both mimic the temporal characteristics pursued by attackers for concealment and not violate the basic boundary conditions of process dynamics; on discrete channels, implement illegal state transitions, alarm suppression/delayed triggering, pattern camouflage, etc., so that it reflects the trajectory morphology of "unreachable but forcibly created" under real control logic. Furthermore, the attack injection process itself can be coordinated with the type routing and constraint layer in this paper: for deterministically derived variables, illegal behaviors should be transmitted through the modification of upstream variables to maintain consistency; for supervised variables constrained by finite-state machines, interpretable illegal transitions should be generated through the "minimum violation path" or "controlled violation intensity", and violation points and violation types should be explicitly marked to facilitate downstream detection tasks to learn more fine-grained discrimination criteria.
 In terms of method morphology, this direction also naturally supports stronger controllability and measurability: attack patterns can be regarded as conditional variables to uniformly conditionally orchestrate legitimate generation and illegal injection, generating control samples of "different attack strategies under the same legitimate framework", thereby transforming dataset construction into a repeatable scenario generation process; meanwhile, by controlling the injection location, duration, amplitude, and coupling range, the performance degradation curves of detectors under different threat intensities and different operating condition stages can be systematically scanned, forming a more stable benchmark than "single acquisition/single script". Ultimately, this approach will transform the legitimate data generation capabilities presented in this paper into the infrastructure for security research: first providing a shareable and reproducible legitimate operation distribution, then injecting illegal patterns with clear semantics in a controllable manner, producing a dataset with sufficient coverage and consistent annotation for training and evaluating models that identify illegal packets/abnormal sequences, and promoting the improvement of reproducibility and engineering credibility in this direction.
 % 6. Conclusion
 \section{Conclusion}
 \label{sec:conclusion}
 This paper addresses the data scarcity and shareability barriers that limit machine-learning research for industrial control system (ICS) security by proposing a practical synthetic telemetry generation framework at the protocol feature level. We introduced Mask-DDPM, a hybrid generator designed explicitly for the mixed-type and multi-scale nature of ICS data, where continuous process dynamics must remain temporally coherent while discrete supervisory variables must remain categorically legal by construction.
 Our main contributions are: (i) a causal Transformer trend module that provides a stable long-horizon temporal scaffold for continuous channels; (ii) a trend-conditioned residual DDPM that focuses modeling capacity on local stochastic detail and marginal fidelity without destabilizing global structure; (iii) a masked (absorbing) diffusion branch for discrete variables that guarantees in-vocabulary outputs and supports semantics-aware conditioning on continuous context; and (iv) a type-aware decomposition/routing layer that aligns model mechanisms with heterogeneous ICS variable origins (e.g., process inertia, step-and-dwell setpoints, deterministic derived tags), enabling deterministic enforcement where appropriate and improving capacity allocation.
 We evaluated the approach on windows derived from the HAI Security Dataset and reported mixed-type, protocol-relevant metrics rather than a single aggregate score. Across seeds, the model achieves stable fidelity with mean KS = 0.3311 ± 0.0079 on continuous features, mean JSD = 0.0284 ± 0.0073 on discrete features, and mean absolute lag-1 autocorrelation difference 0.2684 ± 0.0027, indicating that Mask-DDPM preserves both marginal distributions and short-horizon dynamics while maintaining discrete legality.
 Overall, Mask-DDPM provides a reproducible foundation for generating shareable, semantically valid ICS feature sequences suitable for data augmentation, benchmarking, and downstream packet/trace reconstruction workflows. Building on this capability, a natural next step is to move from purely legal synthesis toward controllable scenario construction, including structured attack/violation injection under engineering constraints to support adversarial evaluation and more comprehensive security benchmarks.
 % 参考文'
 \bibliographystyle{IEEEtran}
 \bibliography{references}
 \end{document}
--- a/arxiv-style/main.tex
+++ b/arxiv-style/main.tex
@@ -0,0 +1,314 @@
 \documentclass{article}
 \usepackage{arxiv}
 \usepackage[utf8]{inputenc} % allow utf-8 input
 \usepackage[T1]{fontenc}    % use 8-bit T1 fonts
 \usepackage{hyperref}       % hyperlinks
 \usepackage{url}            % simple URL typesetting
 \usepackage{booktabs}       % professional-quality tables
 \usepackage{amsfonts}       % blackboard math symbols
 \usepackage{nicefrac}       % compact symbols for 1/2, etc.
 \usepackage{microtype}      % microtypography
 \usepackage{amsmath}		% cleveref must be loaded after amsmath!
 \usepackage{cleveref}       % smart cross-referencing
 \usepackage{lipsum}         % Can be removed after putting your text content
 \usepackage{graphicx}
 \usepackage[numbers]{natbib}
 \usepackage{doi}
 % Packages for equations
 \usepackage{amssymb}
 \usepackage{bm}
 \usepackage{array}       % For column formatting
 \usepackage{caption}     % Better caption spacing
 % 标题
 \title{Mask-DDPM: Transformer-Conditioned Mixed-Type Diffusion for Semantically Valid ICS Telemetry Synthesis}
 % 若不需要日期，取消下面一行的注释
 \date{}
 \newif\ifuniqueAffiliation
 \uniqueAffiliationtrue
 \ifuniqueAffiliation % 标准作者块
 \author{
    Zhenglan Chen \\
 	Aberdeen Institute of Data Science and Artificial Intelligence\\
 	South China Normal University\\
 	Guangzhou, Guangdong 510631, China \\
 	\texttt{20223803054@m.scnu.edu.cn} \\
 	\And
 	Mingzhe Yang \\
 	Aberdeen Institute of Data Science and Artificial Intelligence\\
 	South China Normal University\\
 	Guangzhou, Guangdong 510631, China \\
 	\texttt{20223803063@m.scnu.edu.cn} \\
 	\And
 	Hongyu Yan \\
 	Aberdeen Institute of Data Science and Artificial Intelligence\\
 	South China Normal University\\
 	Guangzhou, Guangdong 510631, China \\
 	\texttt{20223803065@m.scnu.edu.cn}
 	\And
 	Huan Yang \\
    foo\\
    South China Normal University\\
    Guangzhou, Guangdong 510631, China \\
    \texttt{foo@bar.com} \\
 }
 \fi
 % 页眉设置
 \renewcommand{\shorttitle}{\textit{arXiv} Template}
 %%% PDF 元数据
 \hypersetup{
 pdftitle={Your Paper Title},
 pdfsubject={cs.LG, cs.CR},
 pdfauthor={David S.~Hippocampus, Elias D.~Striatum},
 pdfkeywords={Keyword1, Keyword2, Keyword3},
 }
 \begin{document}
 \maketitle
 \begin{abstract}
 Industrial control systems (ICS) security research is increasingly constrained by the scarcity and non-shareability of realistic traffic and telemetry, especially for attack scenarios. To mitigate this bottleneck, we study synthetic generation at the protocol feature/telemetry level, where samples must simultaneously preserve temporal coherence, match continuous marginal distributions, and keep discrete supervisory variables strictly within valid vocabularies. We propose Mask-DDPM, a hybrid framework tailored to mixed-type, multi-scale ICS sequences. Mask-DDPM factorizes generation into (i) a causal Transformer trend module that rolls out a stable long-horizon temporal scaffold for continuous channels, (ii) a trend-conditioned residual DDPM that refines local stochastic structure and heavy-tailed fluctuations without degrading global dynamics, (iii) a masked (absorbing) diffusion branch for discrete variables that guarantees categorical legality by construction, and (iv) a type-aware decomposition/routing layer that aligns modeling mechanisms with heterogeneous ICS variable origins and enforces deterministic reconstruction where appropriate. Evaluated on fixed-length windows (L=96) derived from the HAI Security Dataset, Mask-DDPM achieves stable fidelity across seeds with mean KS = 0.3311 ± 0.0079 (continuous), mean JSD = 0.0284 ± 0.0073 (discrete), and mean absolute lag-1 autocorrelation difference = 0.2684 ± 0.0027, indicating faithful marginals, preserved short-horizon dynamics, and valid discrete semantics. The resulting generator provides a reproducible basis for data augmentation, benchmarking, and downstream ICS protocol reconstruction workflows.
 \end{abstract}
 % 关键词
 \keywords{Machine Learning \and Cyber Defense \and ICS}
 % 1. Introduction
 \section{Introduction}
 \label{sec:intro}
 Industrial control systems (ICS) form the backbone of modern critical infrastructure, which includes power grids, water treatment, manufacturing, and transportation, among others. These systems monitor, regulate, and automate the physical processes through sensors, actuators, programmable logic controllers (PLCs), and monitoring software. Unlike conventional IT systems, ICS operate in real time, closely coupled with physical processes and safety‑critical constraints, using heterogeneous and legacy communication protocols such as Modbus/TCP and DNP3 that were not originally designed with robust security in mind. This architectural complexity and operational criticality make ICS high‑impact targets for cyber attacks, where disruptions can result in physical damage, environmental harm, and even loss of life. Recent reviews of ICS security highlight the expanding attack surface due to increased connectivity, legacy systems’ vulnerabilities, and the inadequacy of traditional security controls in capturing the nuances of ICS networks and protocols \citep{10.1007/s10844-022-00753-1, Nankya2023-gp}
 While machine learning (ML) techniques have shown promise for anomaly detection and automated cybersecurity within ICS, they rely heavily on labeled datasets that capture both benign operations and diverse attack patterns. In practice, real ICS traffic data, especially attack‑triggered captures, are scarce due to confidentiality, safety, and legal restrictions, and available public ICS datasets are few, limited in scope, or fail to reflect current threat modalities. For instance, the HAI Security Dataset provides operational telemetry and anomaly flags from a realistic control system setup for research purposes, but must be carefully preprocessed to derive protocol‑relevant features for ML tasks \citep{shin}. Data scarcity directly undermines model generalization, evaluation reproducibility, and the robustness of intrusion detection research, especially when training or testing ML models on realistic ICS behavior remains confined to small or outdated collections of examples \citep{info16100910}.
 Synthetic data generation offers a practical pathway to mitigate these challenges. By programmatically generating feature‑level sequences that mimic the statistical and temporal structure of real ICS telemetry, researchers can augment scarce training sets, standardize benchmarking, and preserve operational confidentiality. Relative to raw packet captures, feature‑level synthesis abstracts critical protocol semantics and statistical patterns without exposing sensitive fields, making it more compatible with safety constraints and compliance requirements in ICS environments. Modern generative modeling, including diffusion models, has advanced significantly in producing high‑fidelity synthetic data across domains. Diffusion approaches, such as denoising diffusion probabilistic models, learn to transform noise into coherent structured samples and have been successfully applied to tabular or time series data synthesis with better stability and data coverage compared to adversarial methods \citep{pmlr-v202-kotelnikov23a, rasul2021autoregressivedenoisingdiffusionmodels}
 Despite these advances, most existing work either focuses on packet‑level generation \citep{jiang2023netdiffusionnetworkdataaugmentation} or is limited to generic tabular data \citep{pmlr-v202-kotelnikov23a}, rather than domain‑specific control sequence synthesis tailored for ICS protocols where temporal coherence, multi‑channel dependencies, and discrete protocol legality are jointly required. This gap motivates our focus on protocol feature-level generation for ICS, which involves synthesizing sequences of protocol-relevant fields conditioned on their temporal and cross-channel structure. In this work, we formulate a hybrid modeling pipeline that decouples long‑horizon trends and local statistical detail while preserving discrete semantics of protocol tokens. By combining causal Transformers with diffusion‑based refiners, and enforcing deterministic validity constraints during sampling, our framework generates semantically coherent, temporally consistent, and distributionally faithful ICS feature sequences. We evaluate features derived from the HAI Security Dataset and demonstrate that our approach produces high‑quality synthetic sequences suitable for downstream augmentation, benchmarking, and integration into packet‑construction workflows that respect realistic ICS constraints.
 % 2. Related Work
 \section{Related Work}
 \label{sec:related}
 Early generation of network data oriented towards ``realism'' mostly remained at the packet/flow header level, either through replay or statistical synthesis based on single-point observations. Swing, in a closed-loop, network-responsive manner, extracts user/application/network distributions from single-point observations to reproduce burstiness and correlation across multiple time scales \citep{10.1145/1151659.1159928,10.1145/1159913.1159928}. Subsequently, a series of works advanced header synthesis to learning-based generation: the WGAN-based method added explicit verification of protocol field consistency to NetFlow/IPFIX \citep{Ring_2019}, NetShare reconstructed header modeling as flow-level time series and improved fidelity and scalability through domain encoding and parallel fine-tuning \citep{10.1145/3544216.3544251}, and DoppelGANger preserved the long-range structure and downstream sorting consistency of networked time series by decoupling attributes from sequences \citep{Lin_2020}. However, in industrial control system (ICS) scenarios, the original PCAP is usually not shareable, and public testbeds (such as SWaT, WADI) mostly provide process/monitoring telemetry and protocol interactions for security assessment, but public datasets emphasize operational variables rather than packet-level traces \citep{7469060,10.1145/3055366.3055375}. This makes ``synthesis at the feature/telemetry level, aware of protocol and semantics'' more feasible and necessary in practice: we are more concerned with reproducing high-level distributions and multi-scale temporal patterns according to operational semantics and physical constraints without relying on the original packets. From this perspective, the generation paradigm naturally shifts from ``packet syntax reproduction'' to ``modeling of high-level spatio-temporal distributions and uncertainties'', requiring stable training, strong distribution fitting, and interpretable uncertainty characterization.
 Diffusion models exhibit good fit along this path: DDPM achieves high-quality sampling and stable optimization through efficient $\epsilon$ parameterization and weighted variational objectives \citep{NEURIPS2020_4c5bcfec}, the SDE perspective unifies score-based and diffusion, providing likelihood evaluation and prediction-correction sampling strategies based on probability flow ODEs \citep{song2021scorebasedgenerativemodelingstochastic}. For time series, TimeGrad replaces the constrained output distribution with conditional denoising, capturing high-dimensional correlations at each step \citep{rasul2021autoregressivedenoisingdiffusionmodels}; CSDI explicitly performs conditional diffusion and uses two-dimensional attention to simultaneously leverage temporal and cross-feature dependencies, suitable for conditioning and filling in missing values \citep{tashiro2021csdiconditionalscorebaseddiffusion}; in a more general spatio-temporal structure, DiffSTG generalizes diffusion to spatio-temporal graphs, combining TCN/GCN with denoising U-Net to improve CRPS and inference efficiency in a non-autoregressive manner \citep{wen2024diffstgprobabilisticspatiotemporalgraph}, and PriSTI further enhances conditional features and geographical relationships, maintaining robustness under high missing rates and sensor failures \citep{liu2023pristiconditionaldiffusionframework}; in long sequences and continuous domains, DiffWave verifies that diffusion can also match the quality of strong vocoders under non-autoregressive fast synthesis \citep{kong2021diffwaveversatilediffusionmodel}; studies on cellular communication traffic show that diffusion can recover spatio-temporal patterns and provide uncertainty characterization at the urban scale \citep{11087622}. These results overall point to a conclusion: when the research focus is on ``telemetry/high-level features'' rather than raw messages, diffusion models provide stable and fine-grained distribution fitting and uncertainty quantification, which is exactly in line with the requirements of ICS telemetry synthesis. Meanwhile, directly entrusting all structures to a ``monolithic diffusion'' is not advisable: long-range temporal skeletons and fine-grained marginal distributions often have optimization tensions, requiring explicit decoupling in modeling.
 Looking further into the mechanism complexity of ICS: its channel types are inherently mixed, containing both continuous process trajectories and discrete supervision/status variables, and discrete channels must be ``legal'' under operational constraints. The aforementioned progress in time series diffusion has mainly occurred in continuous spaces, but discrete diffusion has also developed systematic methods: D3PM improves sampling quality and likelihood through absorption/masking and structured transitions in discrete state spaces \citep{austin2023structureddenoisingdiffusionmodels}, subsequent masked diffusion provides stable reconstruction on categorical data in a more simplified form \citep{Lin_2020}, multinomial diffusion directly defines diffusion on a finite vocabulary through mechanisms such as argmax flows \citep{hoogeboom2021argmaxflowsmultinomialdiffusion}, and Diffusion-LM demonstrates an effective path for controllable text generation by imposing gradient constraints in continuous latent spaces \citep{li2022diffusionlmimprovescontrollabletext}. From the perspectives of protocols and finite-state machines, coverage-guided fuzz testing emphasizes the criticality of ``sequence legality and state coverage'' \citep{meng2025aflnetyearslatercoverageguided,godefroid2017learnfuzzmachinelearninginput,she2019neuzzefficientfuzzingneural}, echoing the concept of ``legality by construction'' in discrete diffusion: preferentially adopting absorption/masking diffusion on discrete channels, supplemented by type-aware conditioning and sampling constraints, to avoid semantic invalidity and marginal distortion caused by post hoc thresholding.
 From the perspective of high-level synthesis, the temporal structure is equally indispensable: ICS control often involves delay effects, phased operating conditions, and cross-channel coupling, requiring models to be able to characterize low-frequency, long-range dependencies while also overlaying multi-modal fine-grained fluctuations on them. The Transformer series has provided sufficient evidence in long-sequence time series tasks: Transformer-XL breaks through the fixed-length context limitation through a reusable memory mechanism and significantly enhances long-range dependency expression \citep{dai2019transformerxlattentivelanguagemodels}; Informer uses ProbSparse attention and efficient decoding to balance span and efficiency in long-sequence prediction \citep{zhou2021informerefficienttransformerlong}; Autoformer robustly models long-term seasonality and trends through autocorrelation and decomposition mechanisms \citep{wu2022autoformerdecompositiontransformersautocorrelation}; FEDformer further improves long-period prediction performance in frequency domain enhancement and decomposition \citep{zhou2022fedformerfrequencyenhanceddecomposed}; PatchTST enhances the stability and generalization of long-sequence multivariate prediction through local patch-based representation and channel-independent modeling \citep{2023}. Combining our previous positioning of diffusion, this chain of evidence points to a natural division of labor: using attention-based sequence models to first extract stable low-frequency trends/conditions (long-range skeletons), and then allowing diffusion to focus on margins and details in the residual space; meanwhile, discrete masking/absorbing diffusion is applied to supervised/pattern variables to ensure vocabulary legality by construction. This design not only inherits the advantages of time series diffusion in distribution fitting and uncertainty characterization \citep{rasul2021autoregressivedenoisingdiffusionmodels,tashiro2021csdiconditionalscorebaseddiffusion,wen2024diffstgprobabilisticspatiotemporalgraph,liu2023pristiconditionaldiffusionframework,kong2021diffwaveversatilediffusionmodel,11087622}, but also stabilizes the macroscopic temporal support through the long-range attention of Transformer, enabling the formation of an operational integrated generation pipeline under the mixed types and multi-scale dynamics of ICS.
 % 3. Methodology
 \section{Methodology}
 \label{sec:method}
 Industrial control system (ICS) telemetry is intrinsically mixed-type and mechanistically heterogeneous: continuous process trajectories (e.g., sensor and actuator signals) coexist with discrete supervisory states (e.g., modes, alarms, interlocks), and the underlying generating mechanisms range from physical inertia to program-driven step logic. This heterogeneity is not cosmetic—it directly affects what “realistic” synthesis means, because a generator must jointly satisfy (i) temporal coherence, (ii) distributional fidelity, and (iii) discrete semantic validity (i.e., every discrete output must belong to its legal vocabulary by construction). These properties are emphasized broadly in operational-technology security guidance and ICS engineering practice, where state logic and physical dynamics are tightly coupled \citep{nist2023sp80082}.
 We formalize each training instance as a fixed-length window of length We model each training instance as a fixed-length window of length $L$, comprising continuous channels $\bm{X} \in \mathbb{R}^{L \times d_c}$ and discrete channels $\bm{Y} = \{y^{(j)}_{1:L}\}_{j=1}^{d_d}$, where each discrete variable satisfies $y^{(j)}_t \in \mathcal{V}_j$ for a finite vocabulary $\mathcal{V}_j$. Our objective is to learn a generator that produces synthetic $(\hat{\bm{X}}, \hat{\bm{Y}})$ that are simultaneously coherent and distributionally faithful, while also ensuring $\hat{y}^{(j)}_t\in\mathcal{V}_j$ for all $j$, $t$ by construction (rather than via post-hoc rounding or thresholding).
 A key empirical and methodological tension in ICS synthesis is that temporal realism and marginal/distributional realism can compete when optimized monolithically: sequence models trained primarily for regression often over-smooth heavy tails and intermittent bursts, while purely distribution-matching objectives can erode long-range structure. Diffusion models provide a principled route to rich distribution modeling through iterative denoising, but they do not, by themselves, resolve (i) the need for a stable low-frequency temporal scaffold, nor (ii) the discrete legality constraints for supervisory variables \citep{ho2020denoising,song2021score}. Recent time-series diffusion work further suggests that separating coarse structure from stochastic refinement can be an effective inductive bias for long-horizon realism \citep{kollovieh2023tsdiff,sikder2023transfusion}.
 \begin{figure}[htbp]
  \centering
  \includegraphics[width=0.8\textwidth]{fig-design-v2.png}
  % \caption{Description of the figure.}
  \label{fig:design}
 \end{figure}
 Motivated by these considerations, we propose Mask-DDPM, organized in the following order:
 \begin{enumerate}
  \item Transformer trend module: learns the dominant temporal backbone of continuous dynamics via attention-based sequence modeling \citep{vaswani2017attention}.
  \item Residual DDPM for continuous variables: models distributional detail as stochastic residual structure conditioned on the learned trend \citep{ho2020denoising,kollovieh2023tsdiff}.
  \item Masked diffusion for discrete variables: generates discrete ICS states with an absorbing/masking corruption process and categorical reconstruction \citep{austin2021structured, shi2024simplified}.
  \item Type-aware decomposition: a type-aware factorization and routing layer that assigns variables to the most appropriate modeling mechanism and enforces deterministic constraints where warranted.
 \end{enumerate}
 This ordering is intentional. The trend module establishes a macro-temporal scaffold; residual diffusion then concentrates capacity on micro-structure and marginal fidelity; masked diffusion provides a native mechanism for discrete legality; and the type-aware layer operationalizes the observation that not all ICS variables should be modeled with the same stochastic mechanism. Importantly, while diffusion-based generation for ICS telemetry has begun to emerge, existing approaches remain limited and typically emphasize continuous synthesis or augmentation; in contrast, our pipeline integrates (i) a Transformer-conditioned residual diffusion backbone, (ii) a discrete masked-diffusion branch, and (iii) explicit type-aware routing for heterogeneous variable mechanisms within a single coherent generator \citep{yuan2025ctu,sha2026ddpm}.
 \subsection{Transformer trend module for continuous dynamics}
 \label{sec:method-trans}
 We instantiate the temporal backbone as a causal Transformer trend extractor, leveraging self-attention’s ability to represent long-range dependencies and cross-channel interactions without recurrence \citep{vaswani2017attention}. Compared with recurrent trend extractors (e.g., GRU-style backbones), a Transformer trend module offers a direct mechanism to model delayed effects and multivariate coupling—common in ICS, where control actions may influence downstream sensors with nontrivial lags and regime-dependent propagation \citep{vaswani2017attention,nist2023sp80082}. Crucially, in our design the Transformer is not asked to be the entire generator; instead, it serves a deliberately restricted role: providing a stable, temporally coherent conditioning signal that later stochastic components refine.
 For continuous channels $\bm{X}$, we posit an additive decomposition:
 \begin{equation}
 \bm{X} = \bm{S} + \bm{R},
 \label{eq:additive_decomp}
 \end{equation}
 where $\bm{S} \in \mathbb{R}^{L \times d_c}$ is a smooth trend capturing predictable temporal evolution, and $\bm{R} \in \mathbb{R}^{L \times d_c}$ is a residual capturing distributional detail (e.g., bursts, heavy tails, local fluctuations) that is difficult to represent robustly with a purely regression-based temporal objective. This separation reflects an explicit division of labor: the trend module prioritizes temporal coherence, while diffusion (introduced next) targets distributional realism at the residual level—a strategy aligned with “predict-then-refine” perspectives in time-series diffusion modeling \citep{kollovieh2023tsdiff,sikder2023transfusion}.
 We parameterize the trend $\bm{S}$ using a causal Transformer $f_\phi$. With teacher forcing, we train $F_\phi$ to predict the next-step trend from past observations:
 \begin{equation}
 \hat{\bm{S}}_{t+1} = f_{\phi}(\bm{X}_{1:t}), \quad t = 1, \dots, L-1.
 \label{eq:trend_prediction}
 \end{equation}
 using the mean-squared error objective:
 \begin{equation}
 \mathcal{L}_{\text{trend}}(\phi) = \frac{1}{(L-1)d_c} \sum_{t=1}^{L-1} \bigl\| \hat{\bm{S}}_{t+1} - \bm{X}_{t+1} \bigr\|_2^2.
 \label{eq:trend_loss}
 \end{equation}
 At inference, we roll out the Transformer autoregressively to obtain $\hat{\bm{S}}$, and and then define the residual target for diffusion as $\bm{R} = \bm{X} - \hat{\bm{S}}$. This setup intentionally “locks in” a coherent low-frequency scaffold before any stochastic refinement is applied, thereby reducing the burden on downstream diffusion modules to simultaneously learn both long-range structure and marginal detail. In this sense, our use of Transformers is distinctive: it is a conditioning-first temporal backbone designed to stabilize mixed-type diffusion synthesis in ICS, rather than an end-to-end monolithic generator \citep{vaswani2017attention,kollovieh2023tsdiff,yuan2025ctu}.
 \subsection{DDPM for continuous residual generation}
 \label{sec:method-ddpm}
 We model the residual RRR with a denoising diffusion probabilistic model (DDPM) conditioned on the trend $\hat{\bm{S}}$ \citep{ho2020denoising}. Diffusion models learn complex data distributions by inverting a tractable noising process through iterative denoising, and have proven effective at capturing multimodality and heavy-tailed structure that is often attenuated by purely regression-based sequence models \citep{ho2020denoising,song2021score}. Conditioning the diffusion model on $\hat{\bm{S}}$ is central: it prevents the denoiser from re-learning the low-frequency scaffold and focuses capacity on residual micro-structure, mirroring the broader principle that diffusion excels as a distributional corrector when a reasonable coarse structure is available \citep{kollovieh2023tsdiff, sikder2023transfusion}.
 Let $\bm{K}$ denote the number of diffusion steps, with a noise schedule $\{\beta_k\}_{k=1}^K$, $\alpha_k = 1 - \beta_k$, and $\bar{\alpha}_k = \prod_{i=1}^k \alpha_i$. The forward corruption process is:
 \begin{equation}
 q(\bm{r}_k \mid \bm{r}_0) &= \mathcal{N}\bigl( \sqrt{\bar{\alpha}_k}\,\bm{r}_0,\; (1 - \bar{\alpha}_k)\mathbf{I} \bigr)
 \label{eq:forward_corruption}
 \end{equation}
 equivalently,
 \begin{equation}
 \bm{r}_k &= \sqrt{\bar{\alpha}_k}\,\bm{r}_0 + \sqrt{1 - \bar{\alpha}_k}\,\boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})
 \label{eq:forward_corruption_eq}
 \end{equation}
 The learned reverse process is parameterized as:
 \begin{equation}
 p_{\theta}(\bm{r}_{k-1} \mid \bm{r}_k, \hat{\bm{S}}) = \mathcal{N}\bigl( \boldsymbol{\mu}_{\theta}(\bm{r}_k, k, \hat{\bm{S}}),\; \boldsymbol{\Sigma}(k) \bigr).
 \label{eq:reverse_process}
 \end{equation}
 where $\mu_\theta$ is implemented by a Transformer denoiser that consumes (i) the noised residual $r_k$, (ii) a timestep embedding for $k$, and (iii) conditioning features derived from $\hat{\bm{S}}$.  This denoiser architecture is consistent with the growing use of attention-based denoisers for long-context time-series diffusion, while our key methodological emphasis is the trend-conditioned residual factorization as the object of diffusion learning \citep{ho2020denoising,sikder2023transfusion}.
 We train the denoiser using the standard DDPM $\epsilon$-prediction objective:
 \begin{equation}
 \mathcal{L}_{\text{cont}}(\theta) = \mathbb{E}_{k,\bm{r}_0,\boldsymbol{\epsilon}} \left[ \bigl\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_{\theta}(\bm{r}_k, k, \hat{\bm{S}}) \bigr\|_2^2 \right].
 \label{eq:ddpm_loss}
 \end{equation}
 Because diffusion optimization can exhibit timestep imbalance (i.e., some timesteps dominate gradients), we optionally apply an SNR-based reweighting consistent with Min-SNR training:
 \begin{equation}
 \mathcal{L}^{\text{snr}}_{\text{cont}}(\theta) = \mathbb{E}_{k,\bm{r}_0,\boldsymbol{\epsilon}} \left[ w_k \bigl\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_{\theta}(\bm{r}_k, k, \hat{\bm{S}}) \bigr\|_2^2 \right],
 \label{eq:snr_loss}
 \end{equation}
 where $\mathrm{SNR}_k=\bar{\alpha}_k/(1-\bar{\alpha}_k)$ and $\gamma>0$ is a cap parameter \citep{hang2023efficient}.
 After sampling $\hat{\bm{R}}$ by reverse diffusion, we reconstruct the continuous output as $\hat{\bm{X}} = \hat{\bm{S}} + \hat{\bm{R}}$. Overall, the DDPM component serves as a distributional corrector on top of a temporally coherent backbone, which is particularly suited to ICS where low-frequency dynamics are strong and persistent but fine-scale variability (including bursts and regime-conditioned noise) remains important for realism. Relative to prior ICS diffusion efforts that primarily focus on continuous augmentation, our formulation elevates trend-conditioned residual diffusion as a modular mechanism for disentangling temporal structure from distributional refinement \citep{yuan2025ctu,sha2026ddpm}.
 \subsection{Masked diffusion for discrete ICS variables}
 \label{sec:method-discrete}
 Discrete ICS variables must remain categorical, making Gaussian diffusion inappropriate for supervisory states and mode-like channels. While one can attempt continuous relaxations or post-hoc discretization, such strategies risk producing semantically invalid intermediate states (e.g., “in-between” modes) and can distort the discrete marginal distribution. Discrete-state diffusion provides a principled alternative by defining a valid corruption process directly on categorical variables \citep{austin2021structured,shi2024simplified}. In the ICS setting, this is not a secondary detail: supervisory tags often encode control logic boundaries (modes, alarms, interlocks) that must remain within a finite vocabulary to preserve semantic correctness \citep{nist2023sp80082}.
 We therefore adopt masked (absorbing) diffusion for discrete channels, where corruption replaces tokens with a special $\texttt{[MASK]}$ symbol according to a schedule \citep{shi2024simplified}. For each variable $j$, define a masking schedule $\{m_k\}_{k=1}^K$ (with $m_k\in[0,1]$) increasing in $k$. The forward corruption process is:
 \begin{equation}
 q(y^{(j)}_k \mid y^{(j)}_0) =
 \begin{cases}
 y^{(j)}_0,            & \text{with probability } 1 - m_k, \\
 \texttt{[MASK]},      & \text{with probability } m_k,
 \end{cases}
 \label{eq:masking_process}
 \end{equation}
 applied independently across $j$ and $t$. Let $\mathcal{M}$ denote the set of masked positions at step $k$. The denoiser $h_{\psi}$ predicts a categorical distribution over $\mathcal{V}_j$ for each masked token, conditioned on (i) the corrupted discrete sequence, (ii) the diffusion step $k$, and (iii) continuous context. Concretely, we condition on $\hat{\bm{S}}$ and $\hat{\bm{X}}$ to couple supervisory reconstruction to the underlying continuous dynamics:
 \begin{equation}
 p_{\psi}\bigl( y^{(j)}_0 \mid y_k, k, \hat{\bm{S}}, \hat{\bm{X}} \bigr) = h_{\psi}(y_k, k, \hat{\bm{S}}, \hat{\bm{X}}).
 \label{eq:discrete_denoising}
 \end{equation}
 This conditioning choice is motivated by the fact that many discrete ICS states are not standalone, they are functions of regimes, thresholds, and procedural phases that manifest in continuous channels \citep{nist2023sp80082}. Training uses a categorical denoising objective:
 \begin{equation}
 \mathcal{L}_{\text{disc}}(\psi) = \mathbb{E}_{k} \left[ \frac{1}{|\mathcal{M}|} \sum_{(j,t) \in \mathcal{M}} \mathrm{CE}\bigl( h_{\psi}(y_k, k, \hat{\bm{S}}, \hat{\bm{X}})_{j,t},\; y^{(j)}_{0,t} \bigr) \right],
 \label{eq:discrete_loss}
 \end{equation}
 where $\mathrm{CE}(\cdot,\cdot)$ is cross-entropy. At sampling time, we initialize all discrete tokens as $\texttt{[MASK]}$ and iteratively unmask them using the learned conditionals, ensuring that every output token lies in its legal vocabulary by construction. This discrete branch is a key differentiator of our pipeline: unlike typical continuous-only diffusion augmentation in ICS, we integrate masked diffusion as a first-class mechanism for supervisory-variable legality within the same end-to-end synthesis workflow \citep{shi2024simplified,yuan2025ctu}.
 \subsection{Type-aware decomposition as factorization and routing layer}
 \label{sec:method-types}
 Even with a trend-conditioned residual DDPM and a discrete masked-diffusion branch, a single uniform modeling treatment can remain suboptimal because ICS variables are generated by qualitatively different mechanisms. For example, program-driven setpoints exhibit step-and-dwell dynamics; controller outputs follow control laws conditioned on process feedback; actuator positions may show saturation and dwell; and some “derived tags” are deterministic functions of other channels. Treating all channels as if they were exchangeable stochastic processes can misallocate model capacity and induce systematic error concentration on a small subset of mechanistically distinct variables \citep{nist2023sp80082}.
 We therefore introduce a type-aware decomposition that formalizes this heterogeneity as a routing and constraint layer.  Let $\tau(i)\in{1,\dots,6}$ assign each variable (i) to a type class. The type assignment can be initialized from domain semantics (tag metadata, value domains, and engineering meaning), and subsequently refined via an error-attribution workflow described in the Benchmark section. Importantly, this refinement does not change the core diffusion backbone; it changes which mechanism is responsible for which variable, thereby aligning inductive bias with variable-generating mechanism while preserving overall coherence.
 We use the following taxonomy:
 \begin{enumerate}
 	\item Type 1 (program-driven / setpoint-like): externally commanded, step-and-dwell variables. These variables can be treated as exogenous drivers (conditioning signals) or routed to specialized change-point / dwell-time models, rather than being forced into a smooth denoiser that may over-regularize step structure.
 	\item Type 2 (controller outputs): continuous variables tightly coupled to feedback loops; these benefit from conditional modeling where the conditioning includes relevant process variables and commanded setpoints.
 	\item Type 3 (actuator states/positions): often exhibit saturation, dwell, and rate limits; these may require stateful dynamics beyond generic residual diffusion, motivating either specialized conditional modules or additional inductive constraints.
 	\item Type 4 (process variables): inertia-dominated continuous dynamics; these are the primary beneficiaries of the Transformer trend + residual DDPM pipeline.
 	\item Type 5 (derived/deterministic variables): algebraic or rule-based functions of other variables; we enforce deterministic reconstruction $\hat{x}^{(i)} = g_i(\hat{X},\hat{Y})$ rather than learning a stochastic generator, improving logical consistency and sample efficiency.
 	\item Type 6 (auxiliary/low-impact variables): weakly coupled or sparse signals; we allow simplified modeling (e.g., calibrated marginals or lightweight temporal models) to avoid allocating diffusion capacity where it is not warranted.
 \end{enumerate}
 Type-aware decomposition improves synthesis quality through three mechanisms. First, it improves capacity allocation by preventing a small set of mechanistically atypical variables from dominating gradients and distorting the learned distribution for the majority class (typically Type 4). Second, it enables constraint enforcement by deterministically reconstructing Type 5 variables, preventing logically inconsistent samples that purely learned generators can produce. Third, it improves mechanism alignment by attaching inductive biases consistent with step/dwell or saturation behaviors where generic denoisers may implicitly favor smoothness.
 From a novelty standpoint, this layer is not merely an engineering “patch”; it is an explicit methodological statement that ICS synthesis benefits from typed factorization—a principle that has analogues in mixed-type generative modeling more broadly, but that remains underexplored in diffusion-based ICS telemetry synthesis \citep{shi2025tabdiff,yuan2025ctu,nist2023sp80082}.
 \subsection{Joint optimization and end-to-end sampling}
 \label{sec:method-joint}
 We train the model in a staged manner consistent with the above factorization, which improves optimization stability and encourages each component to specialize in its intended role. Specifically: (i) we train the trend Transformer $f_{\phi}$ to obtain $\hat{\bm{S}}$; (ii) we compute residual targets $\hat{\bm{R}} = \bm{X} - \hat{\bm{S}}$ for the continuous variables routed to residual diffusion; (iii) we train the residual DDPM $p_{\theta}(\bm{R}\mid \hat{\bm{S}})$ and masked diffusion model $p_{\psi}(\bm{Y}\mid \text{masked}(\bm{Y}), \hat{\bm{S}}, \hat{\bm{X}})$; and (iv) we apply type-aware routing and deterministic reconstruction during sampling. This staged strategy is aligned with the design goal of separating temporal scaffolding from distributional refinement, and it mirrors the broader intuition in time-series diffusion that decoupling coarse structure and stochastic detail can mitigate “structure vs. realism” conflicts \citep{kollovieh2023tsdiff,sikder2023transfusion}.
 A simple combined objective is $\mathcal{L} = \lambda\mathcal{L}_{\text{cont}} + (1-\lambda)\mathcal{L}_{\text{disc}}$ with $\lambda\in[0,1]$ controlling the balance between continuous and discrete learning. Type-aware routing determines which channels contribute to which loss and which are excluded in favor of deterministic reconstruction. In practice, this routing acts as a principled guardrail against negative transfer across variable mechanisms: channels that are best handled deterministically (Type 5) or by specialized drivers (Type 1/3, depending on configuration) are prevented from forcing the diffusion models into statistically incoherent compromises.
 At inference time, generation follows the same structured order: (i) trend $\hat{\bm{S}}$ via the Transformer, (ii) residual $\hat{\bm{R}}$ via DDPM, (iii) discrete $\hat{\bm{Y}}$ via masked diffusion, and (iv) type-aware assembly with deterministic reconstruction for routed variables. This pipeline produces $(\hat{\bm{X}},\hat{\bm{Y}})$ that are temporally coherent by construction (through $\hat{\bm{S}}$), distributionally expressive (through $\hat{\bm{R}}$ denoising), and discretely valid (through masked diffusion), while explicitly accounting for heterogeneous variable-generating mechanisms through type-aware routing. In combination, these choices constitute our central methodological contribution: a unified Transformer + mixed diffusion generator for ICS telemetry, augmented by typed factorization to align model capacity with domain mechanism \citep{ho2020denoising,shi2024simplified,yuan2025ctu,nist2023sp80082}.
 % 4. Benchmark
 \section{Benchmark}
 \label{sec:benchmark}
 We evaluate the proposed pipeline on feature sequences derived from the HAI Security Dataset, using fixed-length windows (L=96) that preserve the mixed-type structure of ICS telemetry. The goal of this benchmark is not only to report “overall similarity”, but to justify why the proposed factorization is a better fit for protocol feature synthesis: continuous channels must match physical marginals \citep{coletta2023constrained}, discrete channels must remain semantically legal, and both must retain short-horizon dynamics that underpin state transitions and interlocks \citep{yang2001interlock}.
 This emphasis reflects evaluation practice in time-series generation, where strong results are typically supported by multiple complementary views (marginal fidelity, dependency/temporal structure, and downstream plausibility), rather than a single aggregate score \citep{stenger2024survey}. In the ICS setting, this multi-view requirement is sharper: a generator that matches continuous marginals while emitting out-of-vocabulary supervisory tokens is unusable for protocol reconstruction, and a generator that matches marginals but breaks lag structure can produce temporally implausible command/response sequences.
 Recent ICS time-series generators often emphasize aggregate similarity scores and utility-driven evaluations (e.g., anomaly-detection performance) to demonstrate realism, which is valuable but can under-specify mixed-type protocol constraints. Our benchmark complements these practices by making mixed-type legality and per-feature distributional alignment explicit: discrete outputs are evaluated as categorical distributions (JSD) and are constrained to remain within the legal vocabulary by construction, while continuous channels are evaluated with nonparametric distribution tests (KS) \citep{yoon2019timegan}. This combination provides a direct, protocol-relevant justification for the hybrid design, rather than relying on a single composite score that may mask discrete failures.
 For continuous channels, we measure distributional alignment using the Kolmogorov–Smirnov (KS) statistic computed per feature between the empirical distributions of real and synthetic samples, and then averaged across features. For discrete channels, we quantify marginal fidelity with Jensen–Shannon divergence (JSD) \citep{lin1991divergence,yoon2019timegan} between categorical distributions per feature, averaged across discrete variables. To assess temporal realism, we compare lag-1 autocorrelation at the feature level and report the mean absolute difference between real and synthetic lag-1 autocorrelation, averaged across features. In addition, to avoid degenerate comparisons driven by near-constant tags, features whose empirical standard deviation falls below a small threshold are excluded from continuous KS aggregation; such channels carry limited distributional information and can distort summary statistics.
 \subsection{Quantitative results}
 \label{sec:benchmark-quant}
 Across all runs, the mean continuous KS is 0.3311 (std 0.0079) and the mean discrete JSD is 0.0284 (std 0.0073), indicating that the generator preserves both continuous marginals and discrete semantic distributions at the feature level. Temporal consistency is similarly stable across runs, with a mean lag-1 autocorrelation difference of 0.2684 (std 0.0027), suggesting that the synthesized windows retain short-horizon dynamical structure \citep{ni2021sigwasserstein} instead of collapsing to marginal matching alone. The best-performing instance (by mean KS) attains 0.3224, and the small inter-seed variance shows that the reported fidelity is reproducible rather than driven by a single favorable initialization.
 \begin{figure}[htbp]
  \centering
  \includegraphics[width=0.8\textwidth]{fig-overall-benchmark-v1.png}
  % \caption{Description of the figure.}
  \label{fig:benchmark}
 \end{figure}
 \begin{table}[htbp]
 \centering
 \caption{Summary of benchmark metrics. Lower values indicate better performance.}
 \label{tab:metrics}
 \begin{tabular}{@{}l l c c@{}}
 \toprule
 \textbf{Metric} & \textbf{Aggregation} & \textbf{Lower is better} & \textbf{Mean $\pm$ Std} \\
 \midrule
 KS (continuous) & mean over continuous features & \checkmark & 0.3311 $\pm$ 0.0079 \\
 JSD (discrete)  & mean over discrete features   & \checkmark & 0.0284 $\pm$ 0.0073 \\
 Abs $\Delta$ lag-1 autocorr & mean over features & \checkmark & 0.2684 $\pm$ 0.0027 \\
 \bottomrule
 \end{tabular}
 \end{table}
 To make the benchmark actionable (and comparable to prior work), we report type-appropriate, interpretable statistics instead of collapsing everything into a single similarity score. This matters in mixed-type ICS telemetry: continuous fidelity can be high while discrete semantics fail, and vice versa. By separating continuous (KS), discrete (JSD), and temporal (lag-1) views, the evaluation directly matches the design goals of the hybrid generator: distributional refinement for continuous residuals, vocabulary-valid reconstruction for discrete supervision, and trend-induced short-horizon coherence.
 In addition, the seed-averaged reporting mirrors evaluation conventions in recent diffusion-based time-series generation studies, where robustness across runs is increasingly treated as a first-class signal rather than an afterthought. In this sense, the small inter-seed variance is itself evidence that the factorized training and typed routing reduce instability and localized error concentration, which is frequently observed when heterogeneous channels compete for the same modeling capacity.
 % 5. Future Work
 \section{Future Work}
 \label{sec:future}
 Future work will further expand from "generating legal ICS feature sequences" to "data construction and adversarial evaluation for security tasks". The core contribution of this paper focuses on generating feature sequences that are temporally consistent, have credible distributions, and have legal discrete values under mixed types and multi-scale dynamics. However, in the actual research of intrusion detection and anomaly detection, the more critical bottleneck is often the lack of "illegal data/anomaly data" with clear attack semantics and sufficient coverage. Therefore, a direct and important extension direction is to use the legal sequences generated in this paper as a controllable and reproducible "base line operation flow", and then, on the premise of maintaining sequence-level legality and engineering constraints, inject or mix illegal behaviors according to specified attack patterns, thereby systematically constructing a dataset for training and evaluating the recognition of illegal data packets.
 Specifically, attack injection can be upgraded from "simple perturbation" to "semantically consistent patterned rewriting": on continuous channels, implement bias injection, covert manipulation near thresholds, instantaneous mutations, and intermittent bursts, etc., so that it can both mimic the temporal characteristics pursued by attackers for concealment and not violate the basic boundary conditions of process dynamics; on discrete channels, implement illegal state transitions, alarm suppression/delayed triggering, pattern camouflage, etc., so that it reflects the trajectory morphology of "unreachable but forcibly created" under real control logic. Furthermore, the attack injection process itself can be coordinated with the type routing and constraint layer in this paper: for deterministically derived variables, illegal behaviors should be transmitted through the modification of upstream variables to maintain consistency; for supervised variables constrained by finite-state machines, interpretable illegal transitions should be generated through the "minimum violation path" or "controlled violation intensity", and violation points and violation types should be explicitly marked to facilitate downstream detection tasks to learn more fine-grained discrimination criteria.
 In terms of method morphology, this direction also naturally supports stronger controllability and measurability: attack patterns can be regarded as conditional variables to uniformly conditionally orchestrate legitimate generation and illegal injection, generating control samples of "different attack strategies under the same legitimate framework", thereby transforming dataset construction into a repeatable scenario generation process; meanwhile, by controlling the injection location, duration, amplitude, and coupling range, the performance degradation curves of detectors under different threat intensities and different operating condition stages can be systematically scanned, forming a more stable benchmark than "single acquisition/single script". Ultimately, this approach will transform the legitimate data generation capabilities presented in this paper into the infrastructure for security research: first providing a shareable and reproducible legitimate operation distribution, then injecting illegal patterns with clear semantics in a controllable manner, producing a dataset with sufficient coverage and consistent annotation for training and evaluating models that identify illegal packets/abnormal sequences, and promoting the improvement of reproducibility and engineering credibility in this direction.
 % 6. Conclusion
 \section{Conclusion}
 \label{sec:conclusion}
 This paper addresses the data scarcity and shareability barriers that limit machine-learning research for industrial control system (ICS) security by proposing a practical synthetic telemetry generation framework at the protocol feature level. We introduced Mask-DDPM, a hybrid generator designed explicitly for the mixed-type and multi-scale nature of ICS data, where continuous process dynamics must remain temporally coherent while discrete supervisory variables must remain categorically legal by construction.
 Our main contributions are: (i) a causal Transformer trend module that provides a stable long-horizon temporal scaffold for continuous channels; (ii) a trend-conditioned residual DDPM that focuses modeling capacity on local stochastic detail and marginal fidelity without destabilizing global structure; (iii) a masked (absorbing) diffusion branch for discrete variables that guarantees in-vocabulary outputs and supports semantics-aware conditioning on continuous context; and (iv) a type-aware decomposition/routing layer that aligns model mechanisms with heterogeneous ICS variable origins (e.g., process inertia, step-and-dwell setpoints, deterministic derived tags), enabling deterministic enforcement where appropriate and improving capacity allocation.
 We evaluated the approach on windows derived from the HAI Security Dataset and reported mixed-type, protocol-relevant metrics rather than a single aggregate score. Across seeds, the model achieves stable fidelity with mean KS = 0.3311 ± 0.0079 on continuous features, mean JSD = 0.0284 ± 0.0073 on discrete features, and mean absolute lag-1 autocorrelation difference 0.2684 ± 0.0027, indicating that Mask-DDPM preserves both marginal distributions and short-horizon dynamics while maintaining discrete legality.
 Overall, Mask-DDPM provides a reproducible foundation for generating shareable, semantically valid ICS feature sequences suitable for data augmentation, benchmarking, and downstream packet/trace reconstruction workflows. Building on this capability, a natural next step is to move from purely legal synthesis toward controllable scenario construction, including structured attack/violation injection under engineering constraints to support adversarial evaluation and more comprehensive security benchmarks.
 % 参考文献
 \bibliographystyle{unsrtnat}
 \bibliography{references}
 \end{document}
--- a/arxiv-style/references.bib
+++ b/arxiv-style/references.bib
@@ -0,0 +1,555 @@
@inproceedings{vaswani2017attention,
  title={Attention Is All You Need},
  author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, {\L}ukasz and Polosukhin, Illia},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  volume={30},
  year={2017},
  url={https://arxiv.org/abs/1706.03762}
 }
@inproceedings{ho2020denoising,
  title={Denoising Diffusion Probabilistic Models},
  author={Ho, Jonathan and Jain, Ajay and Abbeel, Pieter},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  volume={33},
  pages={6840--6851},
  year={2020},
  url={https://arxiv.org/abs/2006.11239}
 }
@inproceedings{austin2021structured,
  title={Structured Denoising Diffusion Models in Discrete State-Spaces},
  author={Austin, Jacob and Johnson, Daniel D and Ho, Jonathan and Tarlow, Daniel and van den Berg, Rianne},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  volume={34},
  pages={17981--17993},
  year={2021},
  url={https://arxiv.org/abs/2107.03006}
 }
@article{shi2024simplified,
  title={Simplified and Generalized Masked Diffusion for Discrete Data},
  author={Shi, Juntong and Han, Ke and Wang, Zinan and Doucet, Arnaud and Titsias, Michalis K},
  journal={arXiv preprint},
  eprint={2406.04329},
  archivePrefix={arXiv},
  year={2024},
  url={https://arxiv.org/abs/2406.04329}
 }
@inproceedings{hang2023efficient,
  title={Efficient Diffusion Training via Min-SNR Weighting Strategy},
  author={Hang, Tianyu and Gu, Shuyang and Li, Chen and Bao, Jianmin and Chen, Dong and Hu, Han and Geng, Xin and Guo, Boxin},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  pages={7407--7417},
  year={2023},
  doi={10.1109/ICCV51070.2023.00702},
  url={https://arxiv.org/abs/2303.09556}
 }
@inproceedings{kollovieh2023tsdiff,
  title={Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting},
  author={Kollovieh, Marcel and Ansari, Abdul Fatir and Bohlke-Schneider, Michael and Fatir Ansari, Abdul and Salinas, David},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  volume={36},
  year={2023},
  url={https://arxiv.org/abs/2307.11494}
 }
@article{sikder2023transfusion,
  title={TransFusion: Generating Long, High Fidelity Time Series using Diffusion Models with Transformers},
  author={Sikder, M. F. and Ramachandranpillai, R. and Heintz, F.},
  journal={arXiv preprint},
  eprint={2307.12667},
  archivePrefix={arXiv},
  year={2023},
  url={https://arxiv.org/abs/2307.12667}
 }
@inproceedings{song2021score,
  title={Score-Based Generative Modeling through Stochastic Differential Equations},
  author={Song, Yang and Sohl-Dickstein, Jascha and Kingma, Diederik P and Kumar, Abhishek and Ermon, Stefano and Poole, Ben},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2021},
  url={https://arxiv.org/abs/2011.13456}
 }
@inproceedings{shi2025tabdiff,
  title={TabDiff: A Mixed-type Diffusion Model for Tabular Data Generation},
  author={Shi, Juntong and Xu, Minkai and Hua, Harper and Zhang, Hengrui and Ermon, Stefano and Leskovec, Jure},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2025},
  url={https://arxiv.org/abs/2410.20626}
 }
@inproceedings{yuan2025ctu,
  title={CTU-DDPM: Generating Industrial Control System Time-Series Data with a CNN-Transformer Hybrid Diffusion Model},
  author={Yuan, Yusong and Sha, Yun and Zhao, Wei and Zhang, Kun},
  booktitle={Proceedings of the 2025 International Symposium on Artificial Intelligence and Computational Social Sciences (ACM AICSS)},
  pages={123--132},
  year={2025},
  doi={10.1145/3776759.3776845},
  url={https://dl.acm.org/doi/10.1145/3776759.3776845}
 }
@misc{sha2026ddpm,
  title={DDPM Fusing Mamba and Adaptive Attention: An Augmentation Method for Industrial Control Systems Anomaly Data},
  author={Sha, Yun and Yuan, Yusong and Wu, Yonghao and Zhao, Haidong},
  year={2026},
  month={jan},
  note={SSRN Electronic Journal},
  eprint={6055903},
  archivePrefix={SSRN},
  doi={10.2139/ssrn.6055903},
  url={https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6055903}
 }
@techreport{nist2023sp80082,
  title={Guide to Operational Technology (OT) Security},
  author={{National Institute of Standards and Technology}},
  institution={NIST},
  type={Special Publication},
  number={800-82 Rev. 3},
  year={2023},
  month={sep},
  doi={10.6028/NIST.SP.800-82r3},
  url={https://csrc.nist.gov/pubs/sp/800/82/r3/final}
 }
 Reference for Introduction Part
@article{10.1007/s10844-022-00753-1,
 author = {Koay, Abigail M. Y. and Ko, Ryan K. L and Hettema, Hinne and Radke, Kenneth},
 title = {Machine learning in industrial control system (ICS) security: current landscape, opportunities and challenges},
 year = {2022},
 issue_date = {Apr 2023},
 publisher = {Kluwer Academic Publishers},
 address = {USA},
 volume = {60},
 number = {2},
 issn = {0925-9902},
 url = {https://doi.org/10.1007/s10844-022-00753-1},
 doi = {10.1007/s10844-022-00753-1},
 abstract = {The advent of Industry 4.0 has led to a rapid increase in cyber attacks on industrial systems and processes, particularly on Industrial Control Systems (ICS). These systems are increasingly becoming prime targets for cyber criminals and nation-states looking to extort large ransoms or cause disruptions due to their ability to cause devastating impact whenever they cease working or malfunction. Although myriads of cyber attack detection systems have been proposed and developed, these detection systems still face many challenges that are typically not found in traditional detection systems. Motivated by the need to better understand these challenges to improve current approaches, this paper aims to (1) understand the current vulnerability landscape in ICS, (2) survey current advancements of Machine Learning (ML) based methods with respect to the usage of ML base classifiers (3) provide insights to benefits and limitations of recent advancement with respect to two performance vectors; detection accuracy and attack variety. Based on our findings, we present key open challenges which will represent exciting research opportunities for the research community.},
 journal = {J. Intell. Inf. Syst.},
 month = oct,
 pages = {377–405},
 numpages = {29},
 keywords = {Operational technology, Cyber security, Dataset, Industrial control systems, Machine learning, Critical infrastructure}
 }
@ARTICLE{Nankya2023-gp,
  title     = "Securing industrial Control Systems: Components, cyber threats,
               and machine learning-driven defense strategies",
  author    = "Nankya, Mary and Chataut, Robin and Akl, Robert",
  abstract  = "Industrial Control Systems (ICS), which include Supervisory
               Control and Data Acquisition (SCADA) systems, Distributed
               Control Systems (DCS), and Programmable Logic Controllers (PLC),
               play a crucial role in managing and regulating industrial
               processes. However, ensuring the security of these systems is of
               utmost importance due to the potentially severe consequences of
               cyber attacks. This article presents an overview of ICS
               security, covering its components, protocols, industrial
               applications, and performance aspects. It also highlights the
               typical threats and vulnerabilities faced by these systems.
               Moreover, the article identifies key factors that influence the
               design decisions concerning control, communication, reliability,
               and redundancy properties of ICS, as these are critical in
               determining the security needs of the system. The article
               outlines existing security countermeasures, including network
               segmentation, access control, patch management, and security
               monitoring. Furthermore, the article explores the integration of
               machine learning techniques to enhance the cybersecurity of ICS.
               Machine learning offers several advantages, such as anomaly
               detection, threat intelligence analysis, and predictive
               maintenance. However, combining machine learning with other
               security measures is essential to establish a comprehensive
               defense strategy for ICS. The article also addresses the
               challenges associated with existing measures and provides
               recommendations for improving ICS security. This paper becomes a
               valuable reference for researchers aiming to make meaningful
               contributions within the constantly evolving ICS domain by
               providing an in-depth examination of the present state,
               challenges, and potential future advancements.",
  journal   = "Sensors (Basel)",
  publisher = "MDPI AG",
  volume    =  23,
  number    =  21,
  pages     = "8840",
  month     =  oct,
  year      =  2023,
  keywords  = "SCADA; anomaly detection; artificial intelligence; attacks;
               cyber defense; cyber threats; industrial control systems;
               security; vulnerabilities",
  copyright = "https://creativecommons.org/licenses/by/4.0/",
  language  = "en"
 }
@misc{shin,
  title = {HAI Security Dataset},
  url = {https://www.kaggle.com/dsv/5821622},
  doi = {10.34740/kaggle/dsv/5821622},
  publisher = {Kaggle},
  author = {Shin, Hyeok-Ki and Lee, Woomyo and Choi, Seungoh and Yun, Jeong-Han and Min, Byung Gil and Kim, HyoungChun},
  year = {2023}
 }
@Article{info16100910,
 AUTHOR = {Ali, Jokha and Ali, Saqib and Al Balushi, Taiseera and Nadir, Zia},
 TITLE = {Intrusion Detection in Industrial Control Systems Using Transfer Learning Guided by Reinforcement Learning},
 JOURNAL = {Information},
 VOLUME = {16},
 YEAR = {2025},
 NUMBER = {10},
 ARTICLE-NUMBER = {910},
 URL = {https://www.mdpi.com/2078-2489/16/10/910},
 ISSN = {2078-2489},
 ABSTRACT = {Securing Industrial Control Systems (ICSs) is critical, but it is made challenging by the constant evolution of cyber threats and the scarcity of labeled attack data in these specialized environments. Standard intrusion detection systems (IDSs) often fail to adapt when transferred to new networks with limited data. To address this, this paper introduces an adaptive intrusion detection framework that combines a hybrid Convolutional Neural Network and Long Short-Term Memory (CNN-LSTM) model with a novel transfer learning strategy. We employ a Reinforcement Learning (RL) agent to intelligently guide the fine-tuning process, which allows the IDS to dynamically adjust its parameters such as layer freezing and learning rates in real-time based on performance feedback. We evaluated our system in a realistic data-scarce scenario using only 50 labeled training samples. Our RL-Guided model achieved a final F1-score of 0.9825, significantly outperforming a standard neural fine-tuning model (0.861) and a target baseline model (0.759). Analysis of the RL agent’s behavior confirmed that it learned a balanced and effective policy for adapting the model to the target domain. We conclude that the proposed RL-guided approach creates a highly accurate and adaptive IDS that overcomes the limitations of static transfer learning methods. This dynamic fine-tuning strategy is a powerful and promising direction for building resilient cybersecurity defenses for critical infrastructure.},
 DOI = {10.3390/info16100910}
 }
@InProceedings{pmlr-v202-kotelnikov23a,
  title =          {{T}ab{DDPM}: Modelling Tabular Data with Diffusion Models},
  author =       {Kotelnikov, Akim and Baranchuk, Dmitry and Rubachev, Ivan and Babenko, Artem},
  booktitle =          {Proceedings of the 40th International Conference on Machine Learning},
  pages =          {17564--17579},
  year =          {2023},
  editor =          {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
  volume =          {202},
  series =          {Proceedings of Machine Learning Research},
  month =          {23--29 Jul},
  publisher =    {PMLR},
  pdf =          {https://proceedings.mlr.press/v202/kotelnikov23a/kotelnikov23a.pdf},
  url =          {https://proceedings.mlr.press/v202/kotelnikov23a.html},
  abstract =          {Denoising diffusion probabilistic models are becoming the leading generative modeling paradigm for many important data modalities. Being the most prevalent in the computer vision community, diffusion models have recently gained some attention in other domains, including speech, NLP, and graph-like data. In this work, we investigate if the framework of diffusion models can be advantageous for general tabular problems, where data points are typically represented by vectors of heterogeneous features. The inherent heterogeneity of tabular data makes it quite challenging for accurate modeling since the individual features can be of a completely different nature, i.e., some of them can be continuous and some can be discrete. To address such data types, we introduce TabDDPM — a diffusion model that can be universally applied to any tabular dataset and handles any feature types. We extensively evaluate TabDDPM on a wide set of benchmarks and demonstrate its superiority over existing GAN/VAE alternatives, which is consistent with the advantage of diffusion models in other fields.}
 }
@misc{rasul2021autoregressivedenoisingdiffusionmodels,
      title={Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting},
      author={Kashif Rasul and Calvin Seward and Ingmar Schuster and Roland Vollgraf},
      year={2021},
      eprint={2101.12072},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2101.12072},
 }
@misc{jiang2023netdiffusionnetworkdataaugmentation,
      title={NetDiffusion: Network Data Augmentation Through Protocol-Constrained Traffic Generation},
      author={Xi Jiang and Shinan Liu and Aaron Gember-Jacobson and Arjun Nitin Bhagoji and Paul Schmitt and Francesco Bronzino and Nick Feamster},
      year={2023},
      eprint={2310.08543},
      archivePrefix={arXiv},
      primaryClass={cs.NI},
      url={https://arxiv.org/abs/2310.08543},
 }
 Reference for Related Work
@inproceedings{10.1145/1159913.1159928,
 author = {Vishwanath, Kashi Venkatesh and Vahdat, Amin},
 title = {Realistic and responsive network traffic generation},
 year = {2006},
 isbn = {1595933085},
 publisher = {Association for Computing Machinery},
 address = {New York, NY, USA},
 url = {https://doi.org/10.1145/1159913.1159928},
 doi = {10.1145/1159913.1159928},
 abstract = {This paper presents Swing, a closed-loop, network-responsive traffic generator that accurately captures the packet interactions of a range of applications using a simple structural model. Starting from observed traffic at a single point in the network, Swing automatically extracts distributions for user, application, and network behavior. It then generates live traffic corresponding to the underlying models in a network emulation environment running commodity network protocol stacks. We find that the generated traces are statistically similar to the original traces. Further, to the best of our knowledge, we are the first to reproduce burstiness in traffic across a range of timescales using a model applicable to a variety of network settings. An initial sensitivity analysis reveals the importance of capturing and recreating user, application, and network characteristics to accurately reproduce such burstiness. Finally, we explore Swing's ability to vary user characteristics, application properties, and wide-area network conditions to project traffic characteristics into alternate scenarios.},
 booktitle = {Proceedings of the 2006 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications},
 pages = {111–122},
 numpages = {12},
 keywords = {burstiness, energy plot, generator, internet, modeling, structural model, traffic, wavelets},
 location = {Pisa, Italy},
 series = {SIGCOMM '06}
 }
@article{Ring_2019,
   title={Flow-based network traffic generation using Generative Adversarial Networks},
   volume={82},
   ISSN={0167-4048},
   url={http://dx.doi.org/10.1016/j.cose.2018.12.012},
   DOI={10.1016/j.cose.2018.12.012},
   journal={Computers \& Security},
   publisher={Elsevier BV},
   author={Ring, Markus and Schlör, Daniel and Landes, Dieter and Hotho, Andreas},
   year={2019},
   month=may, pages={156–172}
 }
@inproceedings{10.1145/3544216.3544251,
 author = {Yin, Yucheng and Lin, Zinan and Jin, Minhao and Fanti, Giulia and Sekar, Vyas},
 title = {Practical GAN-based synthetic IP header trace generation using NetShare},
 year = {2022},
 isbn = {9781450394208},
 publisher = {Association for Computing Machinery},
 address = {New York, NY, USA},
 url = {https://doi.org/10.1145/3544216.3544251},
 doi = {10.1145/3544216.3544251},
 abstract = {We explore the feasibility of using Generative Adversarial Networks (GANs) to automatically learn generative models to generate synthetic packet- and flow header traces for networking tasks (e.g., telemetry, anomaly detection, provisioning). We identify key fidelity, scalability, and privacy challenges and tradeoffs in existing GAN-based approaches. By synthesizing domain-specific insights with recent advances in machine learning and privacy, we identify design choices to tackle these challenges. Building on these insights, we develop an end-to-end framework, NetShare. We evaluate NetShare on six diverse packet header traces and find that: (1) across all distributional metrics and traces, it achieves 46\% more accuracy than baselines and (2) it meets users' requirements of downstream tasks in evaluating accuracy and rank ordering of candidate approaches.},
 booktitle = {Proceedings of the ACM SIGCOMM 2022 Conference},
 pages = {458–472},
 numpages = {15},
 keywords = {synthetic data generation, privacy, network packets, network flows, generative adversarial networks},
 location = {Amsterdam, Netherlands},
 series = {SIGCOMM '22}
 }
@inproceedings{Lin_2020, series={IMC ’20},
   title={Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions},
   url={http://dx.doi.org/10.1145/3419394.3423643},
   DOI={10.1145/3419394.3423643},
   booktitle={Proceedings of the ACM Internet Measurement Conference},
   publisher={ACM},
   author={Lin, Zinan and Jain, Alankar and Wang, Chen and Fanti, Giulia and Sekar, Vyas},
   year={2020},
   month=oct, pages={464–483},
   collection={IMC ’20}
 }
@INPROCEEDINGS{7469060,
  author={Mathur, Aditya P. and Tippenhauer, Nils Ole},
  booktitle={2016 International Workshop on Cyber-physical Systems for Smart Water Networks (CySWater)},
  title={SWaT: a water treatment testbed for research and training on ICS security},
  year={2016},
  volume={},
  number={},
  pages={31-36},
  keywords={Sensors;Actuators;Feeds;Process control;Chemicals;Chemical sensors;Security;Cyber Physical Systems;Industrial Control Systems;Cyber Attacks;Cyber Defense;Water Testbed},
  doi={10.1109/CySWater.2016.7469060}
 }
@inproceedings{10.1145/3055366.3055375,
 author = {Ahmed, Chuadhry Mujeeb and Palleti, Venkata Reddy and Mathur, Aditya P.},
 title = {WADI: a water distribution testbed for research in the design of secure cyber physical systems},
 year = {2017},
 isbn = {9781450349758},
 publisher = {Association for Computing Machinery},
 address = {New York, NY, USA},
 url = {https://doi.org/10.1145/3055366.3055375},
 doi = {10.1145/3055366.3055375},
 abstract = {The architecture of a water distribution testbed (WADI), and on-going research in the design of secure water distribution system is presented. WADI consists of three stages controlled by Programmable Logic Controllers (PLCs) and two stages controlled via Remote Terminal Units (RTUs). Each PLC and RTU uses sensors to estimate the system state and the actuators to effect control. WADI is currently used to (a) conduct security analysis for water distribution networks, (b) experimentally assess detection mechanisms for potential cyber and physical attacks, and (c) understand how the impact of an attack on one CPS could cascade to other connected CPSs. The cascading effects of attacks can be studied in WADI through its connection to two other testbeds, namely for water treatment and power generation and distribution.},
 booktitle = {Proceedings of the 3rd International Workshop on Cyber-Physical Systems for Smart Water Networks},
 pages = {25–28},
 numpages = {4},
 keywords = {attack detection, cyber physical systems, cyber security, industrial control systems, water distribution testbed},
 location = {Pittsburgh, Pennsylvania},
 series = {CySWATER '17}
 }
@misc{tashiro2021csdiconditionalscorebaseddiffusion,
      title={CSDI Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation},
      author={Yusuke Tashiro and Jiaming Song and Yang Song and Stefano Ermon},
      year={2021},
      eprint={2107.03502},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={httpsarxiv.orgabs2107.03502},
 }
@misc{wen2024diffstgprobabilisticspatiotemporalgraph,
      title={DiffSTG: Probabilistic Spatio-Temporal Graph Forecasting with Denoising Diffusion Models},
      author={Haomin Wen and Youfang Lin and Yutong Xia and Huaiyu Wan and Qingsong Wen and Roger Zimmermann and Yuxuan Liang},
      year={2024},
      eprint={2301.13629},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2301.13629},
 }
@misc{liu2023pristiconditionaldiffusionframework,
      title={PriSTI: A Conditional Diffusion Framework for Spatiotemporal Imputation},
      author={Mingzhe Liu and Han Huang and Hao Feng and Leilei Sun and Bowen Du and Yanjie Fu},
      year={2023},
      eprint={2302.09746},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2302.09746},
 }
@misc{kong2021diffwaveversatilediffusionmodel,
      title={DiffWave: A Versatile Diffusion Model for Audio Synthesis},
      author={Zhifeng Kong and Wei Ping and Jiaji Huang and Kexin Zhao and Bryan Catanzaro},
      year={2021},
      eprint={2009.09761},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2009.09761},
 }
@ARTICLE{11087622,
  author={Liu, Xiaosi and Xu, Xiaowen and Liu, Zhidan and Li, Zhenjiang and Wu, Kaishun},
  journal={IEEE Transactions on Mobile Computing},
  title={Spatio-Temporal Diffusion Model for Cellular Traffic Generation},
  year={2026},
  volume={25},
  number={1},
  pages={257-271},
  keywords={Base stations;Diffusion models;Data models;Uncertainty;Predictive models;Generative adversarial networks;Knowledge graphs;Mobile computing;Telecommunication traffic;Semantics;Cellular traffic;data generation;diffusion model;spatio-temporal graph},
  doi={10.1109/TMC.2025.3591183}
 }
@misc{hoogeboom2021argmaxflowsmultinomialdiffusion,
      title={Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions},
      author={Emiel Hoogeboom and Didrik Nielsen and Priyank Jaini and Patrick Forré and Max Welling},
      year={2021},
      eprint={2102.05379},
      archivePrefix={arXiv},
      primaryClass={stat.ML},
      url={https://arxiv.org/abs/2102.05379},
 }
@misc{li2022diffusionlmimprovescontrollabletext,
      title={Diffusion-LM Improves Controllable Text Generation},
      author={Xiang Lisa Li and John Thickstun and Ishaan Gulrajani and Percy Liang and Tatsunori B. Hashimoto},
      year={2022},
      eprint={2205.14217},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={httpsarxiv.orgabs2205.14217},
 }
@misc{meng2025aflnetyearslatercoverageguided,
      title={AFLNet Five Years Later: On Coverage-Guided Protocol Fuzzing},
      author={Ruijie Meng and Van-Thuan Pham and Marcel Böhme and Abhik Roychoudhury},
      year={2025},
      eprint={2412.20324},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2412.20324},
 }
@misc{godefroid2017learnfuzzmachinelearninginput,
      title={Learn\&Fuzz: Machine Learning for Input Fuzzing},
      author={Patrice Godefroid and Hila Peleg and Rishabh Singh},
      year={2017},
      eprint={1701.07232},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/1701.07232},
 }
@misc{she2019neuzzefficientfuzzingneural,
      title={NEUZZ: Efficient Fuzzing with Neural Program Smoothing},
      author={Dongdong She and Kexin Pei and Dave Epstein and Junfeng Yang and Baishakhi Ray and Suman Jana},
      year={2019},
      eprint={1807.05620},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/1807.05620},
 }
@misc{dai2019transformerxlattentivelanguagemodels,
      title={Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context},
      author={Zihang Dai and Zhilin Yang and Yiming Yang and Jaime Carbonell and Quoc V. Le and Ruslan Salakhutdinov},
      year={2019},
      eprint={1901.02860},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/1901.02860},
 }
@misc{zhou2021informerefficienttransformerlong,
      title={Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting},
      author={Haoyi Zhou and Shanghang Zhang and Jieqi Peng and Shuai Zhang and Jianxin Li and Hui Xiong and Wancai Zhang},
      year={2021},
      eprint={2012.07436},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2012.07436},
 }
@misc{wu2022autoformerdecompositiontransformersautocorrelation,
      title={Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting},
      author={Haixu Wu and Jiehui Xu and Jianmin Wang and Mingsheng Long},
      year={2022},
      eprint={2106.13008},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2106.13008},
 }
@misc{zhou2022fedformerfrequencyenhanceddecomposed,
      title={FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting},
      author={Tian Zhou and Ziqing Ma and Qingsong Wen and Xue Wang and Liang Sun and Rong Jin},
      year={2022},
      eprint={2201.12740},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2201.12740},
 }
@article{2023,
   title={A Note on Extremal Sombor Indices of Trees with a Given Degree Sequence},
   volume={90},
   ISSN={0340-6253},
   url={http://dx.doi.org/10.46793/match.90-1.197D},
   DOI={10.46793/match.90-1.197d},
   number={1},
   journal={Match Communications in Mathematical and in Computer Chemistry},
   publisher={University Library in Kragujevac},
   author={Damjanović, Ivan and Milošević, Marko and Stevanović, Dragan},
   year={2023},
   pages={197–202}
 }
 Reference for Benchmark
@article{stenger2024survey,
  title={Evaluation is key: a survey on evaluation measures for synthetic time series},
  author={Stenger, Michael and Leppich, Robert and Foster, Ian T and Kounev, Samuel and Bauer, Andre},
  journal={Journal of Big Data},
  volume={11},
  number={1},
  pages={66},
  year={2024},
  publisher={Springer}
 }
@article{lin1991divergence,
  title={Divergence measures based on the Shannon entropy},
  author={Lin, Jianhua},
  journal={IEEE Transactions on Information Theory},
  volume={37},
  number={1},
  pages={145--151},
  year={1991}
 }
@inproceedings{yoon2019timegan,
  title={Time-series generative adversarial networks},
  author={Yoon, Jinsung and Jarrett, Daniel and van der Schaar, Mihaela},
  booktitle={Advances in Neural Information Processing Systems},
  volume={32},
  year={2019}
 }
@article{ni2021sigwasserstein,
  title={Sig-Wasserstein GANs for time series generation},
  author={Ni, Hao and Szpruch, Lukasz and Wiese, Magnus and Liao, Shujian and Xiao, Baoren},
  journal={Proceedings of the ACM on Measurement and Analysis of Computing Systems},
  volume={5},
  number={3},
  pages={1--25},
  year={2021}
 }
@inproceedings{coletta2023constrained,
  title={On the constrained time-series generation problem},
  author={Coletta, Alessandro and Rossi, Roberto and others},
  booktitle={Advances in Neural Information Processing Systems},
  volume={36},
  year={2023}
 }
@article{yang2001interlock,
  title={Automatic verification of safety interlock systems for industrial processes},
  author={Yang, Sheng-Hong and Hsieh, Min-Chi},
  journal={Journal of Loss Prevention in the Process Industries},
  volume={14},
  number={6},
  pages={473--483},
  year={2001},
  publisher={Elsevier}
 }
--- a/arxiv-style/template.tex
+++ b/arxiv-style/template.tex
@@ -0,0 +1,214 @@
 \documentclass{article}
 \usepackage{arxiv}
 \usepackage[utf8]{inputenc} % allow utf-8 input
 \usepackage[T1]{fontenc}    % use 8-bit T1 fonts
 \usepackage{hyperref}       % hyperlinks
 \usepackage{url}            % simple URL typesetting
 \usepackage{booktabs}       % professional-quality tables
 \usepackage{amsfonts}       % blackboard math symbols
 \usepackage{nicefrac}       % compact symbols for 1/2, etc.
 \usepackage{microtype}      % microtypography
 \usepackage{cleveref}       % smart cross-referencing
 \usepackage{lipsum}         % Can be removed after putting your text content
 \usepackage{graphicx}
 \usepackage{natbib}
 \usepackage{doi}
 \title{A template for the \emph{arxiv} style}
 % Here you can change the date presented in the paper title
 %\date{September 9, 1985}
 % Or remove it
 %\date{}
 \newif\ifuniqueAffiliation
 % Comment to use multiple affiliations variant of author block 
 \uniqueAffiliationtrue
 \ifuniqueAffiliation % Standard variant of author block
 \author{ \href{https://orcid.org/0000-0000-0000-0000}{\includegraphics[scale=0.06]{orcid.pdf}\hspace{1mm}David S.~Hippocampus}\thanks{Use footnote for providing further
 		information about author (webpage, alternative
 		address)---\emph{not} for acknowledging funding agencies.} \\
 	Department of Computer Science\\
 	Cranberry-Lemon University\\
 	Pittsburgh, PA 15213 \\
 	\texttt{hippo@cs.cranberry-lemon.edu} \\
 	%% examples of more authors
 	\And
 	\href{https://orcid.org/0000-0000-0000-0000}{\includegraphics[scale=0.06]{orcid.pdf}\hspace{1mm}Elias D.~Striatum} \\
 	Department of Electrical Engineering\\
 	Mount-Sheikh University\\
 	Santa Narimana, Levand \\
 	\texttt{stariate@ee.mount-sheikh.edu} \\
 	%% \AND
 	%% Coauthor \\
 	%% Affiliation \\
 	%% Address \\
 	%% \texttt{email} \\
 	%% \And
 	%% Coauthor \\
 	%% Affiliation \\
 	%% Address \\
 	%% \texttt{email} \\
 	%% \And
 	%% Coauthor \\
 	%% Affiliation \\
 	%% Address \\
 	%% \texttt{email} \\
 }
 \else
 % Multiple affiliations variant of author block
 \usepackage{authblk}
 \renewcommand\Authfont{\bfseries}
 \setlength{\affilsep}{0em}
 % box is needed for correct spacing with authblk
 \newbox{\orcid}\sbox{\orcid}{\includegraphics[scale=0.06]{orcid.pdf}} 
 \author[1]{%
 	\href{https://orcid.org/0000-0000-0000-0000}{\usebox{\orcid}\hspace{1mm}David S.~Hippocampus\thanks{\texttt{hippo@cs.cranberry-lemon.edu}}}%
 }
 \author[1,2]{%
 	\href{https://orcid.org/0000-0000-0000-0000}{\usebox{\orcid}\hspace{1mm}Elias D.~Striatum\thanks{\texttt{stariate@ee.mount-sheikh.edu}}}%
 }
 \affil[1]{Department of Computer Science, Cranberry-Lemon University, Pittsburgh, PA 15213}
 \affil[2]{Department of Electrical Engineering, Mount-Sheikh University, Santa Narimana, Levand}
 \fi
 % Uncomment to override  the `A preprint' in the header
 %\renewcommand{\headeright}{Technical Report}
 %\renewcommand{\undertitle}{Technical Report}
 \renewcommand{\shorttitle}{\textit{arXiv} Template}
 %%% Add PDF metadata to help others organize their library
 %%% Once the PDF is generated, you can check the metadata with
 %%% $ pdfinfo template.pdf
 \hypersetup{
 pdftitle={A template for the arxiv style},
 pdfsubject={q-bio.NC, q-bio.QM},
 pdfauthor={David S.~Hippocampus, Elias D.~Striatum},
 pdfkeywords={First keyword, Second keyword, More},
 }
 \begin{document}
 \maketitle
 \begin{abstract}
 	\lipsum[1]
 \end{abstract}
 % keywords can be removed
 \keywords{First keyword \and Second keyword \and More}
 \section{Introduction}
 \lipsum[2]
 \lipsum[3]
 \section{Headings: first level}
 \label{sec:headings}
 \lipsum[4] See Section \ref{sec:headings}.
 \subsection{Headings: second level}
 \lipsum[5]
 \begin{equation}
 	\xi _{ij}(t)=P(x_{t}=i,x_{t+1}=j|y,v,w;\theta)= {\frac {\alpha _{i}(t)a^{w_t}_{ij}\beta _{j}(t+1)b^{v_{t+1}}_{j}(y_{t+1})}{\sum _{i=1}^{N} \sum _{j=1}^{N} \alpha _{i}(t)a^{w_t}_{ij}\beta _{j}(t+1)b^{v_{t+1}}_{j}(y_{t+1})}}
 \end{equation}
 \subsubsection{Headings: third level}
 \lipsum[6]
 \paragraph{Paragraph}
 \lipsum[7]
 \section{Examples of citations, figures, tables, references}
 \label{sec:others}
 \subsection{Citations}
 Citations use \verb+natbib+. The documentation may be found at
 \begin{center}
 	\url{http://mirrors.ctan.org/macros/latex/contrib/natbib/natnotes.pdf}
 \end{center}
 Here is an example usage of the two main commands (\verb+citet+ and \verb+citep+): Some people thought a thing \citep{kour2014real, keshet2016prediction} but other people thought something else \citep{kour2014fast}. Many people have speculated that if we knew exactly why \citet{kour2014fast} thought this\dots
 \subsection{Figures}
 \lipsum[10]
 See Figure \ref{fig:fig1}. Here is how you add footnotes. \footnote{Sample of the first footnote.}
 \lipsum[11]
 \begin{figure}
 	\centering
 	\fbox{\rule[-.5cm]{4cm}{4cm} \rule[-.5cm]{4cm}{0cm}}
 	\caption{Sample figure caption.}
 	\label{fig:fig1}
 \end{figure}
 \subsection{Tables}
 See awesome Table~\ref{tab:table}.
 The documentation for \verb+booktabs+ (`Publication quality tables in LaTeX') is available from:
 \begin{center}
 	\url{https://www.ctan.org/pkg/booktabs}
 \end{center}
 \begin{table}
 	\caption{Sample table title}
 	\centering
 	\begin{tabular}{lll}
 		\toprule
 		\multicolumn{2}{c}{Part}                   \\
 		\cmidrule(r){1-2}
 		Name     & Description     & Size ($\mu$m) \\
 		\midrule
 		Dendrite & Input terminal  & $\sim$100     \\
 		Axon     & Output terminal & $\sim$10      \\
 		Soma     & Cell body       & up to $10^6$  \\
 		\bottomrule
 	\end{tabular}
 	\label{tab:table}
 \end{table}
 \subsection{Lists}
 \begin{itemize}
 	\item Lorem ipsum dolor sit amet
 	\item consectetur adipiscing elit.
 	\item Aliquam dignissim blandit est, in dictum tortor gravida eget. In ac rutrum magna.
 \end{itemize}
 \bibliographystyle{unsrtnat}
 \bibliography{references}  %%% Uncomment this line and comment out the ``thebibliography'' section below to use the external .bib file (using bibtex) .
 %%% Uncomment this section and comment out the \bibliography{references} line above to use inline references.
 % \begin{thebibliography}{1}
 % 	\bibitem{kour2014real}
 % 	George Kour and Raid Saabne.
 % 	\newblock Real-time segmentation of on-line handwritten arabic script.
 % 	\newblock In {\em Frontiers in Handwriting Recognition (ICFHR), 2014 14th
 % 			International Conference on}, pages 417--422. IEEE, 2014.
 % 	\bibitem{kour2014fast}
 % 	George Kour and Raid Saabne.
 % 	\newblock Fast classification of handwritten on-line arabic characters.
 % 	\newblock In {\em Soft Computing and Pattern Recognition (SoCPaR), 2014 6th
 % 			International Conference of}, pages 312--318. IEEE, 2014.
 % 	\bibitem{keshet2016prediction}
 % 	Keshet, Renato, Alina Maor, and George Kour.
 % 	\newblock Prediction-Based, Prioritized Market-Share Insight Extraction.
 % 	\newblock In {\em Advanced Data Mining and Applications (ADMA), 2016 12th International 
 %                       Conference of}, pages 81--94,2016.
 % \end{thebibliography}
 \end{document}
--- a/copy.cmd
+++ b/copy.cmd
@@ -0,0 +1,66 @@
@echo off
 setlocal
 :: ====================================================================
 :: 脚本名称: copy_md_files.cmd
 :: 脚本功能: 遍历/papers所有子文件夹，将所有.md文件复制到/papers/md文件夹下。
 :: 作者: [你的名字/公司名，可选]
 :: 日期: 2023-10-27
 :: 版本: 1.0
 :: ====================================================================
 :: 定义源目录和目标目录
 set "SOURCE_DIR=%~dp0papers"
 set "TARGET_DIR=%~dp0papers\md"
 echo.
 echo ======================================================
 echo           开始复制Markdown文件
 echo ======================================================
 echo.
 :: 检查源目录是否存在
 if not exist "%SOURCE_DIR%" (
    echo 错误：源目录 "%SOURCE_DIR%" 不存在。
    echo 请确保脚本与"papers"文件夹在同一目录下，或者修改SOURCE_DIR变量。
    goto :eof
 )
 :: 创建目标目录（如果不存在）
 if not exist "%TARGET_DIR%" (
    echo 创建目标目录: "%TARGET_DIR%"
    mkdir "%TARGET_DIR%"
    if errorlevel 1 (
        echo 错误：无法创建目标目录 "%TARGET_DIR%"。
        goto :eof
    )
 ) else (
    echo 目标目录 "%TARGET_DIR%" 已存在。
 )
 echo.
 echo 正在扫描 "%SOURCE_DIR%" 及其子目录中的 .md 文件...
 echo.
 :: 遍历所有子文件夹并复制.md文件
 :: /s: 包含所有子目录
 :: /i: 不区分大小写匹配文件名（这里针对*.md）
 :: /y: 覆盖现有文件而不提示
 :: /d: 复制源文件比目标文件新，才进行复制 (可选，这里使用/y覆盖，所以可以省略/d)
 :: /h: 复制隐藏和系统文件 (如果.md文件可能是隐藏的)
 for /r "%SOURCE_DIR%" %%f in (*.md) do (
    echo 正在复制: "%%f"
    copy "%%f" "%TARGET_DIR%" >nul
    if errorlevel 1 (
        echo 警告：复制文件 "%%f" 失败。
    )
 )
 echo.
 echo ======================================================
 echo           Markdown文件复制完成
 echo ======================================================
 echo.
 endlocal
 pause
--- a/knowledges/Intro-Modbus-Diffusion-Tech-Route.md
+++ b/knowledges/Intro-Modbus-Diffusion-Tech-Route.md
@@ -0,0 +1,429 @@
 # Diffusion模型应用于Modbus
 ## 🧑‍💻 User
 请你详细介绍这篇论文的内容，并且详细介绍Diffsion模型。
 ---
 ## 🤖 Assistant
 你好。我是人工智能工程师专家。基于你提供的 IEEE Transactions on Mobile Computing (2026) 论文 **《Spatio-Temporal Diffusion Model for Cellular Traffic Generation》**（用于蜂窝流量生成的时空扩散模型），我将为你进行深度的技术解读。
 这篇论文提出了一种名为 **STOUTER** 的创新模型，旨在解决大规模蜂窝网络流量数据生成中的核心难题：如何在模拟流量不确定性的同时，精准捕捉复杂的时空模式。
 以下是论文的详细解读以及对扩散模型（Diffusion Model）技术的深度剖析。
 ---
 ### 第一部分：论文《STOUTER》深度解读
 #### 1. 研究背景与痛点
 *   **背景**：5G/6G 网络规划（如基站选址、资源分配）需要海量的流量数据支持。
 *   **痛点**：
    *   **数据隐私与获取难**：运营商数据敏感，难以公开获取。
    *   **现有生成模型的局限**：传统的 GAN（生成对抗网络）在大规模生成时容易出现**模式坍塌（Mode Collapse）**，且生成的样本缺乏多样性。
    *   **复杂的时空特性**：蜂窝流量具有长短期周期性（日/周模式）、空间异质性（不同区域基站负载不同）以及高度的不确定性（突发流量）。
 #### 2. 核心解决方案：STOUTER 模型
 论文提出了 **STOUTER** (Spatio-Temporal diffusiOn model for cellUlar Traffic genERation)。这是一个基于扩散模型的生成框架，主要由以下三个核心模块组成：
 ##### A. 时空图建模 (Spatio-Temporal Graph Modeling)
 为了让模型“理解”流量的时空规律，作者构建了两个图：
 1.  **时间图 (Temporal Graph)**：
    *   **节点**：一天内的24小时。
    *   **边**：连接相邻的小时（短期依赖）和一周内不同天数的同一小时（长期周期性，如周一上午9点与周二上午9点）。
    *   **作用**：捕捉长期和短期的周期性模式。
 2.  **基站图 (Base Station Graph)**：
    *   **节点**：基站。
    *   **边**：基于地理距离连接（如距离 < 1km）。
    *   **特征**：融合了 POI（兴趣点）信息（如教育、医疗、餐饮等），因为不同功能的区域流量模式完全不同。
    *   **作用**：捕捉空间相关性和区域功能差异。
 利用 **图自编码器 (Graph Autoencoder)** 对这两个图进行预训练，提取出潜在的时空特征嵌入（Embedding）。
 ##### B. 基于扩散的生成过程 (Diffusion-Based Generation)
 这是模型的核心引擎（后文详细解释扩散模型原理）。
 *   **前向过程**：向真实的流量数据中逐步添加高斯噪声，直到数据变成纯噪声。
 *   **反向过程**：从纯噪声开始，利用神经网络逐步去噪，恢复出逼真的流量数据。
 *   **创新点**：利用扩散模型的随机采样特性，完美模拟了蜂窝流量中的**不确定性（Uncertainty Pattern）**。
 ##### C. 时空特征融合模块 (STFFM) 与 STUnet
 在反向去噪过程中，如果不加引导，模型只能生成随机噪声。作者设计了 **STUnet**（Spatio-Temporal UNet）作为去噪网络：
 *   **STFFM (Spatio-Temporal Feature Fusion Module)**：将上述预训练好的“时间图特征”和“基站图特征”融合，并结合扩散步长（Time Step），作为“条件”注入到网络中。
 *   **机制**：引导去噪过程朝着符合真实时空规律的方向进行，确保生成的数据既有随机性（多样性），又符合物理世界的时空逻辑。
 #### 3. 实验结果
 *   **数据集**：使用了意大利米兰（Milan）和特伦蒂诺（Trentino）的真实电信数据。
 *   **性能提升**：在 Jensen-Shannon 散度（JSD，衡量分布相似度）指标上，STOUTER 相比现有的 GAN 和 VAE 方法提升了 **52.77%**。
 *   **结论**：生成的数据在长期趋势、短期波动和空间分布上都高度逼真，可直接用于训练流量预测模型（下游任务）。
 ---
 ### 第二部分：扩散模型 (Diffusion Model) 技术详解
 作为一名 AI 工程师，我认为理解扩散模型是理解这篇论文乃至当前生成式 AI（如 Stable Diffusion, Sora）的关键。
 #### 1. 什么是扩散模型？
 扩散模型是一类基于概率的生成模型。它的灵感来源于热力学中的扩散现象（例如一滴墨水滴入水中，分子会逐渐扩散直至均匀分布）。
 在 AI 领域，我们将这个过程分为两个阶段：
 *   **前向扩散（破坏数据）**：系统地破坏数据结构，直到它变成随机噪声。
 *   **反向扩散（创造数据）**：学习逆转破坏过程，从噪声中重构出数据。
 #### 2. 技术原理拆解
 ##### A. 前向过程 (Forward Process / Diffusion Process)
 假设我们有一个真实数据分布 $x_0$ （在本论文中是真实的流量矩阵）。我们定义一个马尔可夫链（Markov Chain），在每一步 $t$ 添加少量的高斯噪声。
 *   公式逻辑： $q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I)$ 
 *   随着步数  $T$  增加（例如  $T=1000$ ），原始数据  $x_0$  最终变成标准的正态分布噪声 $x_T$ 。
 *   **工程师视角**：这个过程是不含可学习参数的，是固定的数学变换。
 ##### B. 反向过程 (Reverse Process / Denoising Process)
 这是模型训练的核心。既然我们知道前向过程是加噪声，如果我们能训练一个神经网络 $p_\theta$ ，让它预测每一步加入了什么噪声，我们就能把它减掉。
 *   **目标**：从  $x_T$  （纯噪声）开始，通过神经网络预测并减去噪声，逐步得到  $x_{T-1}, \dots, x_0$ 。
 *   **神经网络的作用**：论文中的 **STUnet** 就是这个网络。它的输入是当前时刻的噪声数据 $x_t$ 和时空条件，输出是预测的噪声  $\epsilon_\theta$ 。
 *   **损失函数**：通常使用均方误差（MSE）或 L1 损失，计算“预测的噪声”与“实际加入的噪声”之间的差异。
 #### 3. 为什么扩散模型优于 GAN？
 在这篇论文的语境下，扩散模型的优势非常明显：
 1.  **训练稳定性**：GAN 需要生成器和判别器博弈，容易导致梯度消失或不收敛。扩散模型的训练目标（预测噪声）非常明确且稳定。
 2.  **避免模式坍塌 (Mode Collapse)**：GAN 倾向于记住某些特定的样本，生成重复数据。扩散模型通过对整个数据分布进行似然估计，能生成分布覆盖更广、多样性更强的数据（即更好地模拟流量的**不确定性**）。
 3.  **高质量生成**：通过逐步去噪的迭代过程（Coarse-to-Fine），扩散模型能生成细节更丰富的数据。
 #### 4. 在本论文中的具体应用
 这篇论文巧妙地将扩散模型从“图像生成”迁移到了“时空序列生成”：
 *   **图像像素  $\rightarrow$  流量数值**：将基站的流量数据看作一张二维“图像”（时间  $\times$  空间）。
 *   **条件引导 (Guidance)**：普通的扩散模型是无条件生成的。STOUTER 使用 **STFFM** 将时空图的 Embedding 作为条件（Condition），告诉扩散模型：“请不要生成任意的噪声，而是生成符合‘周一早上9点商业区’特征的流量数据”。
 ---
 ### 总结与专家点评
 **《Spatio-Temporal Diffusion Model for Cellular Traffic Generation》** 是一篇结合了 **图神经网络 (GNN)** 与 **生成式扩散模型 (Diffusion Model)** 的前沿工作。
 作为专家，我认为其核心价值在于：
 1.  **方法论的创新**：证明了扩散模型在非图像领域的时空数据（Time-Series Data）生成上同样具有强大的统治力。
 2.  **工程落地价值**：它不仅仅是理论创新，更解决了网络规划中“无数据可用”的实际工程痛点。通过物理约束（图结构）引导生成过程，保证了生成数据的**物理可解释性**，这是单纯依靠数据驱动的黑盒模型无法比拟的。
 这篇论文代表了人工智能在通信网络（AI for Networking）领域的最新发展方向。
 ---
 ## 🧑‍💻 User
 现在请你考虑将这个模型，迁移到Modbus协议上，以生成有效的数据（模拟数据包的主要特征的同时，精准捕捉数据包间的时空模式）。
 ---
 ## 🤖 Assistant
 这是一个非常具有挑战性但也极具应用价值的工程转化问题。
 作为一名人工智能工程师，将 **STOUTER** 模型从**宏观的蜂窝流量（连续、统计型数据）**迁移到**微观的 Modbus 协议（离散、指令型、强约束数据）**，我们需要进行底层的**语义重构**。不能简单地照搬模型，必须针对工业控制系统（ICS）的特性进行架构调整。
 Modbus 数据不仅包含“流量大小”，更包含了**指令逻辑（Function Code）**、**寄存器状态（Data/Values）**以及**设备拓扑（Master-Slave）**。
 以下是我为你设计的**Modbus-Diffusion 迁移架构方案**。
 ---
 ### 一、 核心概念映射：从 Cellular 到 ICS
 首先，我们需要重新定义 STOUTER 中的“空间”和“时间”概念，使其适配 Modbus 环境。
 | STOUTER 原概念 | 蜂窝网络 (Cellular) 含义 | **迁移后的 Modbus/ICS 含义** |
 | :--- | :--- | :--- |
 | **空间节点 (Nodes)** | 蜂窝基站 (Base Stations) | **工业资产/寄存器 (Assets/Registers)** <br> (例：PLC_1 的温度传感器、PLC_2 的阀门线圈) |
 | **空间边 (Edges)** | 地理距离 (Physical Distance) | **逻辑/物理连接 (Logical/Process Connection)** <br> (例：Master 与 Slave 的连接，或“温度”与“冷却阀”的工艺关联) |
 | **POI 语义** | 学校、医院、商场 | **设备功能语义** <br> (例：传感器、执行器、HMI、报警器) |
 | **时间周期** | 24小时人类活动周期 | **轮询周期 (Polling Cycle) & 工艺周期** <br> (例：流水线的一个作业节拍，或 SCADA 的固定查询频率) |
 | **流量数据 (Value)** | 吞吐量 (Throughput) | **多模态数据包特征** <br> (包含：时间间隔、功能码、寄存器地址、寄存器数值) |
 ---
 ### 二、 架构设计：Modbus-STOUTER
 为了生成有效的 Modbus 数据，我建议将模型分为三个阶段：**拓扑图构建**、**时空扩散生成**、**协议合规化重组**。
 #### 1. 空间建模：构建“设备-寄存器图” (Device-Register Graph)
 原论文使用 POI 和距离构建基站图，在 Modbus 中，我们需要构建一个**工业语义拓扑图**。
 *   **节点定义**：
    *   **粗粒度**：以 PLC/RTU 设备为节点。
    *   **细粒度（推荐）**：以关键寄存器（Holding Registers/Coils）为节点。例如，节点 A 是“锅炉温度”，节点 B 是“注水阀状态”。
 *   **边 (Edges) 的定义**：
    *   **通信边**：Master 到 Slave 的物理链路。
    *   **工艺边**：基于物理过程的关联。例如，当“温度寄存器”数值升高时，“阀门寄存器”通常会动作。这种关联可以通过分析历史数据的互相关性（Cross-Correlation）自动构建，或者手动基于 P&ID 图构建。
 *   **图卷积 (GCN) 作用**：学习工业现场的**控制逻辑依赖性**。例如，模型会学到：“节点 B 的值变化通常滞后于节点 A”。
 #### 2. 时间建模：基于周期的时序图 (Cyclic Temporal Graph)
 工业数据不像人类行为那样有明显的“早晚高峰”，而是呈现**高频周期性**和**状态突变性**。
 *   **时间编码**：不再是“小时”，而是以 **Polling Interval（轮询间隔）** 或 **Process Step（工艺步骤）** 为单位。
 *   **长短期记忆**：
    *   **短期**：一个 TCP 连接内的 Request-Response 序列。
    *   **长期**：设备的运行状态周期（如：启动-运行-停机-维护）。
 #### 3. 核心扩散过程的改造 (Modified Diffusion Core)
 这是最关键的技术难点。Modbus 数据包不仅有连续值（如温度 45.3），还有离散值（如功能码 0x03, 0x10）。STOUTER 原生只处理连续值。
 **解决方案：嵌入式扩散 (Embedding-based Diffusion)**
 *   **输入向量设计**：对于每个时间步 $t$ 和节点 $i$，输入向量 $x_{t,i}$ 不再是一个标量，而是一个**拼接向量**：
     $$x_{t,i} = [\text{Embed}(\text{FunctionCode}) \oplus \text{Embed}(\text{Address}) \oplus \text{Normalize}(\text{Value}) \oplus \text{InterArrivalTime}]$$ 
 *   **混合噪声添加**：
    *   对于连续部分（Value, Time）：使用标准高斯噪声。
    *   对于离散部分（Func Code, Address）：建议将其映射到连续的 Latent Space 后再加噪声，或者使用 **Discrete Diffusion (如 Multinomial Diffusion)** 技术。
 *   **STUnet 的改进**：
    *   在 STFFM（时空特征融合模块）中，增加一个 **Protocol Constraints Head (协议约束头)**，用于学习 Modbus 的固有规则（例如：读指令后通常紧跟返回数据的响应）。
 ---
 ### 三、 关键创新：由“噪声”到“有效数据包”
 扩散模型输出的是张量（Tensor），不是二进制数据包（PCAP）。为了保证生成的 Modbus 数据不仅“像”，而且“有效”（Valid），必须引入**后处理模块**。
 #### **Post-Processing: 确定性协议组装器 (Deterministic Protocol Assembler)**
 不要让 AI 生成 TCP Checksum 或 Protocol ID 这种死板的数据，AI 很容易出错且浪费算力。AI 应该负责生成**语义 (Semantics)**。
 1.  **AI 输出**：扩散模型生成一个序列，包含：`{时间戳: 10ms, 源: Master, 目标: Slave_1, 动作: Write, 地址: 40001, 值: 1234}`。
 2.  **规则组装**：使用 Python 的 `scapy` 或 `pymodbus` 库，作为“解码器”。
    *   接收 AI 的语义输出。
    *   自动计算 Length 字段。
    *   自动填充 Transaction ID（序列自增）。
    *   自动打包成符合 Modbus TCP 规范的二进制流。
 ---
 ### 四、 这种迁移的优势与潜在价值
 相比于传统的基于规则的模糊测试（Fuzzing）或简单的回放攻击脚本，这种基于时空扩散模型的方法有巨大的优势：
 1.  **捕捉隐蔽的攻击模式**：
    *   传统的攻击检测通过阈值（如：流量突然激增）。
    *   Modbus-STOUTER 可以模拟**“慢速攻击”**或**“逻辑攻击”**。例如，模型可以生成一系列完全符合协议规范，但在时空逻辑上异常的数据包（例如：在不该开启阀门的时间点，生成了一个合法的“开启阀门”指令）。这对训练高级 IDS（入侵检测系统）至关重要。
 2.  **生成高保真数字孪生数据**：
    *   工业环境数据获取极难（怕影响生产）。该模型可以基于少量的采样数据，生成海量的、带有**工艺逻辑相关性**的虚拟数据，用于系统压力测试。
 3.  **处理数据包的不确定性**：
    *   工业网络中存在抖动（Jitter）和丢包。扩散模型天然的概率特性可以完美模拟网络状况的**随机波动**，而不仅仅是理想状态下的数据。
 # PART 2：mzyang觉得离散值处理不妥
 下面我就只写你要的这部分：**两种“加噪/去噪”方法的详细介绍** （连续字段用高斯扩散、离散字段用 Mask），并且写成能直接塞进你文档的风格。
 ---
 ## 方法 1：连续字段用高斯噪声扩散（Gaussian Diffusion / DDPM 思路）
 ### 适用对象
 连续字段指的是能取实数的量，例如：
 * `inter_arrival_time`（到达间隔）
 * `value`（读写数值，或某种连续统计量）
 * `payload_len`（长度也可以当连续/或分桶）
 这些变量有“大小”和“接近”的概念，适合用高斯噪声逐步扰动。
 ---
 ### 核心想法（直觉）
 把真实连续数据 (x_0) 逐步加噪，直到变成“纯噪声”；
 再训练模型学会从噪声一步步还原回真实数据。
 一句话就是： ** 先把数据揉乱，再学会揉回来。 **
 ---
 ### 前向过程（加噪）
 DDPM 经典做法是定义一个时间步 (t=1..T)，每一步加一点高斯噪声：
 [
 q(x_t \mid x_{t-1}) = \mathcal{N}\big(\sqrt{1-\beta_t},x_{t-1}, \beta_t I\big)
 ]
 其中：
 * (\beta_t) 是第 (t) 步的噪声强度（通常从小到大）
 * (I) 是单位矩阵
 常用的“闭式写法”（更实用）：
 [
 x_t = \sqrt{\bar{\alpha}_t},x_0 + \sqrt{1-\bar{\alpha}_t},\epsilon,\quad \epsilon\sim \mathcal N(0,I)
 ]
 这里：
 * (\alpha_t = 1-\beta_t)
 * (\bar{\alpha}*t=\prod*{s=1}^t \alpha_s)
 > 直观理解：(t) 越大，(\sqrt{1-\bar{\alpha}_t}) 越大，噪声占比越高，(x_t) 越接近纯随机。
 ---
 ### 反向过程（去噪 / 生成）
 反向就是要学一个模型 (p_\theta) 逐步把 (x_t) 还原成 (x_{t-1})。
 常见训练方式是让模型预测噪声 (\epsilon)（也可以预测 (x_0)）：
 [
 \epsilon_\theta = f_\theta(x_t, t, \text{context})
 ]
 训练损失最常用就是 MSE：
 [
 \mathcal L_{\text{cont}} = \mathbb E\big[|\epsilon - \epsilon_\theta(x_t,t)|^2\big]
 ]
 ---
 ### 采样（生成）怎么做（你们落地时最关心的）
 1. 先从高斯噪声开始：(x_T \sim \mathcal N(0,I))
 2. 对 (t=T,T-1,\dots,1)：
   * 用模型预测噪声 (\epsilon_\theta(x_t,t))
   * 按 DDPM 的反向更新公式得到 (x_{t-1})
 3. 最终得到 (x_0) 作为生成的连续特征
 > 你不用在文档里写完整反向公式；只要说明“模型逐步去噪得到连续特征”即可。
 ---
 ### 优点 & 注意点（写给评审/队友看的）
 **优点**
 * 连续变量生成质量通常很好（分布、波动、burst 都更自然）
 * 训练稳定（MSE 好优化）
 ** 注意点 **
 * 连续字段最好做归一化/标准化（比如 z-score），否则噪声尺度不好选
 * (\beta_t)（噪声 schedule）会影响效果：一般从小到大更稳
 ---
 ---
 ## 方法 2：离散字段用 Mask 噪声（Mask Corruption / 离散去噪）
 ### 适用对象
 离散字段是“选项题”，例如：
 * `func_code`（功能码）
 * `msg_type`（request/response）
 * `direction`（C→S / S→C）
 * `exception_code`（异常码类别）
 这类变量没有“加法”和“中间值”，所以不能像连续那样加高斯噪声。
 ---
 ### 核心想法（直觉）
 离散扩散的“噪声”不是加一个数，而是 ** 把一些 token 盖住/弄丢 ** 。
 * 前向：随机把一部分离散值替换成 `[MASK]`
 * 反向：模型根据上下文把 `[MASK]` 猜回去（分类问题）
 一句话：**先挖空，再补全。**
 ---
 ### 前向过程（加噪 = Mask）
 给每个时间步 (t) 定义一个 mask 比例（或 mask 概率）(m_t)，通常 (t) 越大 mask 越多。
 对离散序列 (x^{\text{disc}}_0) 的每个位置：
 * 以概率 (m_t)：替换成 `[MASK]`
 * 以概率 (1-m_t)：保留原 token
 得到被污染的 (x^{\text{disc}}_t)。
 > 你也可以扩展为“mask 或随机替换成别的 token”，但第一版用纯 mask 就够清晰、好实现。
 ---
 ### 反向过程（去噪 = 猜 token）
 模型输入：被 mask 的离散序列 + 时间步 (t) +（可选）连续特征上下文
 模型输出：每个被 mask 位置的类别分布（softmax）：
 [
 p_\theta(x^{\text{disc}}_0 \mid x^{\text{disc}}_t, t, \text{context})
 ]
 训练损失是交叉熵（只在被 mask 的位置算更常见）：
 [
 \mathcal L_{\text{disc}} = -\sum_{j\in \text{masked}} \log p_\theta(x^{\text{disc}}_{0,j})
 ]
 > 直观：模型的任务就是“把挖掉的格子填回正确答案”。
 ---
 ### 采样（生成）怎么做（简单版本）
 最简单可用的一种生成策略：
 1. 初始离散序列设为全 mask：`[MASK, MASK, ...]`
 2. 对 (t=T,T-1,\dots,1)：
   * 用模型给每个 mask 位置输出 softmax 概率
   * 采样/取最大概率填入一部分 token
   * 随着 (t) 变小，mask 数逐步减少（填得越来越完整）
 最终得到完整的离散字段序列。
 > 你们文档里可以写成：“从全 mask 开始逐步恢复离散字段”，不用纠结具体填多少个位置的策略。
 ---
 ### 优点 & 注意点（很关键）
 **优点**
 * 对离散字段天然正确：不会引入“编号距离”的假设
 * 特别适合协议字段（功能码/请求响应类型），因为可以利用上下文预测
 **注意点**
 * 高基数离散变量（例如地址 0~65535）不建议直接当一个巨大 softmax 来 mask：会难学、数据稀疏
  更建议把它拆成 “region + offset” 或分桶后再做离散建模
 * 需要一个合理的 mask schedule：一开始别 mask 太狠，否则模型只能瞎猜
 ---
 ---
 ## 你们把这两种方法合在一起怎么说（文档一句话版）
 对于每个事件的特征向量，我们将字段分为：
 * 连续字段：用高斯扩散（DDPM）去噪生成（MSE loss）
 * 离散字段：用 mask 噪声进行离散去噪（分类恢复，CE loss）
  最终用一个统一 backbone 同时输出连续头和离散头，训练目标为两者加权和：
 [
 \mathcal L = \lambda\mathcal L_{\text{cont}} + (1-\lambda)\mathcal L_{\text{disc}}
 ]
 如果你决定实施这个项目，我建议按以下步骤进行：
 1.  **数据预处理**：获取 Modbus PCAP 包，解析为 `(Timestamp, Source, Dest, Function, Register, Value)` 格式的序列。
 2.  **图构建**：根据源/目 IP 和寄存器地址，构建邻接矩阵。
 3.  **模型修改**：修改 STUnet 的输入输出层，使其支持多维特征（不仅仅是流量值）。
 4.  **混合训练**：使用 L1 Loss 训练数值预测，使用 Cross-Entropy Loss 训练功能码分类。
 5.  **验证**：不仅验证 JSD（分布相似度），还要验证 **Protocol Validity Rate（协议合规率）** 和 **Process Logic Consistency（工艺逻辑一致性）**。
 这个迁移方案将原本用于“预测人上网行为”的模型，升维成了一个“理解机器控制逻辑”的生成器，在工业安全和数字孪生领域具有极高的创新性。
--- a/knowledges/draft-incomplete-methodology.md
+++ b/knowledges/draft-incomplete-methodology.md
@@ -0,0 +1,189 @@
 (Updated from your current draft; no benchmark-metric details are introduced here, as requested.) 
 ## Methodology
 Industrial control system (ICS) telemetry is intrinsically **mixed-type** and **mechanistically heterogeneous**: continuous process trajectories (e.g., sensor and actuator signals) coexist with discrete supervisory states (e.g., modes, alarms, interlocks), and the underlying generating mechanisms range from physical inertia to program-driven step logic. This heterogeneity is not cosmetic—it directly affects what “realistic” synthesis means, because a generator must jointly satisfy (i) temporal coherence, (ii) distributional fidelity, and (iii) discrete semantic validity (i.e., every discrete output must belong to its legal vocabulary by construction). These properties are emphasized broadly in operational-technology security guidance and ICS engineering practice, where state logic and physical dynamics are tightly coupled. [12]
 We formalize each training instance as a fixed-length window of length (L), consisting of (i) continuous channels (X\in\mathbb{R}^{L\times d_c}) and (ii) discrete channels (Y={y^{(j)}*{1:L}}*{j=1}^{d_d}), where each discrete variable (y^{(j)}_t\in\mathcal{V}_j) belongs to a finite vocabulary (\mathcal{V}_j). Our objective is to learn a generator that produces synthetic ((\hat{X},\hat{Y})) that are simultaneously temporally coherent and distributionally faithful, while also ensuring (\hat{y}^{(j)}_t\in\mathcal{V}_j) for all (j,t) by construction (rather than via post-hoc rounding or thresholding).
 A key empirical and methodological tension in ICS synthesis is that *temporal realism* and *marginal/distributional realism* can compete when optimized monolithically: sequence models trained primarily for regression often over-smooth heavy tails and intermittent bursts, while purely distribution-matching objectives can erode long-range structure. Diffusion models provide a principled route to rich distribution modeling through iterative denoising, but they do not, by themselves, resolve (i) the need for a stable low-frequency temporal scaffold, nor (ii) the discrete legality constraints for supervisory variables. [2,8] Recent time-series diffusion work further suggests that separating coarse structure from stochastic refinement can be an effective inductive bias for long-horizon realism. [6,7]
 Motivated by these considerations, we propose **Mask-DDPM**, organized in the following order:
 1. **Transformer trend module**: learns the dominant temporal backbone of continuous dynamics via attention-based sequence modeling. [1]
 2. **Residual DDPM for continuous variables**: models distributional detail as stochastic residual structure conditioned on the learned trend. [2,6]
 3. **Masked diffusion for discrete variables**: generates discrete ICS states with an absorbing/masking corruption process and categorical reconstruction. [3,4]
 4. **Type-aware decomposition**: a **type-aware factorization and routing layer** that assigns variables to the most appropriate modeling mechanism and enforces deterministic constraints where warranted.
 This ordering is intentional. The trend module establishes a macro-temporal scaffold; residual diffusion then concentrates capacity on micro-structure and marginal fidelity; masked diffusion provides a native mechanism for discrete legality; and the type-aware layer operationalizes the observation that not all ICS variables should be modeled with the same stochastic mechanism. Importantly, while diffusion-based generation for ICS telemetry has begun to emerge, existing approaches remain limited and typically emphasize continuous synthesis or augmentation; in contrast, our pipeline integrates (i) a Transformer-conditioned residual diffusion backbone, (ii) a discrete masked-diffusion branch, and (iii) explicit type-aware routing for heterogeneous variable mechanisms within a single coherent generator. [10,11]
 ---
 ## Transformer trend module for continuous dynamics
 We instantiate the temporal backbone as a **causal Transformer** trend extractor, leveraging self-attention’s ability to represent long-range dependencies and cross-channel interactions without recurrence. [1] Compared with recurrent trend extractors (e.g., GRU-style backbones), a Transformer trend module offers a direct mechanism to model delayed effects and multivariate coupling—common in ICS, where control actions may influence downstream sensors with nontrivial lags and regime-dependent propagation. [1,12] Crucially, in our design the Transformer is *not* asked to be the entire generator; instead, it serves a deliberately restricted role: providing a stable, temporally coherent conditioning signal that later stochastic components refine.
 For continuous channels (X), we posit an additive decomposition
 [
 X = S + R,
 ]
 where (S\in\mathbb{R}^{L\times d_c}) is a smooth **trend** capturing predictable temporal evolution and (R\in\mathbb{R}^{L\times d_c}) is a **residual** capturing distributional detail (e.g., bursts, heavy tails, local fluctuations) that is difficult to represent robustly with a purely regression-oriented temporal objective. This separation reflects an explicit *division of labor*: the trend module prioritizes temporal coherence, while diffusion (introduced next) targets distributional realism at the residual level—a strategy aligned with “predict-then-refine” perspectives in time-series diffusion modeling. [6,7]
 We parameterize the trend (S) using a causal Transformer (f_{\phi}). With teacher forcing, we train (f_{\phi}) to predict the next-step trend from past observations:
 [
 \hat{S}*{t+1} = f*{\phi}(X_{1:t}), \qquad t=1,\dots,L-1,
 ]
 using the mean-squared error objective
 [
 \mathcal{L}*{\text{trend}}(\phi)=\frac{1}{(L-1)d_c}\sum*{t=1}^{L-1}\left| \hat{S}*{t+1} - X*{t+1}\right|_2^2.
 ]
 At inference, we roll out the Transformer autoregressively to obtain (\hat{S}), and define the residual target for diffusion as (R = X - \hat{S}). This setup intentionally “locks in” a coherent low-frequency scaffold before any stochastic refinement is applied, thereby reducing the burden on downstream diffusion modules to simultaneously learn both long-range structure and marginal detail. In this sense, our use of Transformers is distinctive: it is a *conditioning-first* temporal backbone designed to stabilize mixed-type diffusion synthesis in ICS, rather than an end-to-end monolithic generator. [1,6,10]
 ---
 ## DDPM for continuous residual generation
 We model the residual (R) with a denoising diffusion probabilistic model (DDPM) conditioned on the trend (\hat{S}). [2] Diffusion models learn complex data distributions by inverting a tractable noising process through iterative denoising, and have proven effective at capturing multimodality and heavy-tailed structure that is often attenuated by purely regression-based sequence models. [2,8] Conditioning the diffusion model on (\hat{S}) is central: it prevents the denoiser from re-learning the low-frequency scaffold and focuses capacity on residual micro-structure, mirroring the broader principle that diffusion excels as a distributional corrector when a reasonable coarse structure is available. [6,7]
 Let (K) denote the number of diffusion steps, with a noise schedule ({\beta_k}_{k=1}^K), (\alpha_k = 1-\beta_k), and (\bar{\alpha}*k=\prod*{i=1}^k \alpha_i). The forward corruption process is
 [
 q(r_k\mid r_0)=\mathcal{N}!\left(\sqrt{\bar{\alpha}_k},r_0,\ (1-\bar{\alpha}_k)\mathbf{I}\right),
 ]
 equivalently,
 [
 r_k = \sqrt{\bar{\alpha}_k},r_0 + \sqrt{1-\bar{\alpha}_k},\epsilon,\qquad \epsilon\sim\mathcal{N}(0,\mathbf{I}),
 ]
 where (r_0\equiv R) and (r_k) is the noised residual at step (k).
 The learned reverse process is parameterized as
 [
 p_{\theta}(r_{k-1}\mid r_k,\hat{S})=\mathcal{N}!\left(\mu_{\theta}(r_k,k,\hat{S}),\ \Sigma(k)\right),
 ]
 where (\mu_\theta) is implemented by a **Transformer denoiser** that consumes (i) the noised residual (r_k), (ii) a timestep embedding for (k), and (iii) conditioning features derived from (\hat{S}). This denoiser architecture is consistent with the growing use of attention-based denoisers for long-context time-series diffusion, while our key methodological emphasis is the *trend-conditioned residual* factorization as the object of diffusion learning. [2,7]
 We train the denoiser using the standard DDPM (\epsilon)-prediction objective:
 [
 \mathcal{L}*{\text{cont}}(\theta)
 = \mathbb{E}*{k,r_0,\epsilon}!\left[
 \left|
 \epsilon - \epsilon_{\theta}(r_k,k,\hat{S})
 \right|*2^2
 \right].
 ]
 Because diffusion optimization can exhibit timestep imbalance (i.e., some timesteps dominate gradients), we optionally apply an SNR-based reweighting consistent with Min-SNR training:
 [
 \mathcal{L}^{\text{snr}}*{\text{cont}}(\theta)
 = \mathbb{E}*{k,r_0,\epsilon}!\left[
 w_k\left|
 \epsilon - \epsilon*{\theta}(r_k,k,\hat{S})
 \right|_2^2
 \right],
 \qquad
 w_k=\frac{\min(\mathrm{SNR}_k,\gamma)}{\mathrm{SNR}_k},
 ]
 where (\mathrm{SNR}_k=\bar{\alpha}_k/(1-\bar{\alpha}_k)) and (\gamma>0) is a cap parameter. [5]
 After sampling (\hat{R}) by reverse diffusion, we reconstruct the continuous output as
 [
 \hat{X} = \hat{S} + \hat{R}.
 ]
 Overall, the DDPM component serves as a **distributional corrector** on top of a temporally coherent backbone, which is particularly suited to ICS where low-frequency dynamics are strong and persistent but fine-scale variability (including bursts and regime-conditioned noise) remains important for realism. Relative to prior ICS diffusion efforts that primarily focus on continuous augmentation, our formulation elevates *trend-conditioned residual diffusion* as a modular mechanism for disentangling temporal structure from distributional refinement. [10,11]
 ---
 ## Masked diffusion for discrete ICS variables
 Discrete ICS variables must remain categorical, making Gaussian diffusion inappropriate for supervisory states and mode-like channels. While one can attempt continuous relaxations or post-hoc discretization, such strategies risk producing semantically invalid intermediate states (e.g., “in-between” modes) and can distort the discrete marginal distribution. Discrete-state diffusion provides a principled alternative by defining a valid corruption process directly on categorical variables. [3,4] In the ICS setting, this is not a secondary detail: supervisory tags often encode control logic boundaries (modes, alarms, interlocks) that must remain within a finite vocabulary to preserve semantic correctness. [12]
 We therefore adopt **masked (absorbing) diffusion** for discrete channels, where corruption replaces tokens with a special (\texttt{[MASK]}) symbol according to a schedule. [4] For each variable (j), define a masking schedule ({m_k}_{k=1}^K) (with (m_k\in[0,1]) increasing in (k)). The forward corruption process is
 [
 q(y^{(j)}_k \mid y^{(j)}_0)=
 \begin{cases}
 y^{(j)}*0, & \text{with probability } 1-m_k,\
 \texttt{[MASK]}, & \text{with probability } m_k,
 \end{cases}
 ]
 applied independently across (j) and (t). Let (\mathcal{M}) denote the set of masked positions at step (k). The denoiser (h*{\psi}) predicts a categorical distribution over (\mathcal{V}*j) for each masked token, conditioned on (i) the corrupted discrete sequence, (ii) the diffusion step (k), and (iii) continuous context. Concretely, we condition on (\hat{S}) and (optionally) (\hat{X}) to couple supervisory reconstruction to the underlying continuous dynamics:
 [
 p*{\psi}!\left(y^{(j)}*0 \mid y_k, k, \hat{S}, \hat{X}\right)
 = h*{\psi}(y_k,k,\hat{S},\hat{X}).
 ]
 This conditioning choice is motivated by the fact that many discrete ICS states are not standalone—they are functions of regimes, thresholds, and procedural phases that manifest in continuous channels. [12]
 Training uses a categorical denoising objective:
 [
 \mathcal{L}*{\text{disc}}(\psi)
 = \mathbb{E}*{k}!\left[
 \frac{1}{|\mathcal{M}|}
 \sum_{(j,t)\in\mathcal{M}}
 \mathrm{CE}!\left(
 h_{\psi}(y_k,k,\hat{S},\hat{X})*{j,t},
 y^{(j)}*{0,t}
 \right)
 \right],
 ]
 where (\mathrm{CE}(\cdot,\cdot)) is cross-entropy. At sampling time, we initialize all discrete tokens as (\texttt{[MASK]}) and iteratively unmask them using the learned conditionals, ensuring that every output token lies in its legal vocabulary by construction. This discrete branch is a key differentiator of our pipeline: unlike typical continuous-only diffusion augmentation in ICS, we integrate masked diffusion as a first-class mechanism for supervisory-variable legality within the same end-to-end synthesis workflow. [4,10]
 ---
 ## Type-aware decomposition as a performance refinement layer
 Even with a trend-conditioned residual DDPM and a discrete masked-diffusion branch, a single uniform modeling treatment can remain suboptimal because ICS variables are generated by qualitatively different mechanisms. For example, program-driven setpoints exhibit step-and-dwell dynamics; controller outputs follow control laws conditioned on process feedback; actuator positions may show saturation and dwell; and some “derived tags” are deterministic functions of other channels. Treating all channels as if they were exchangeable stochastic processes can misallocate model capacity and induce systematic error concentration on a small subset of mechanistically distinct variables. [12]
 We therefore introduce a **type-aware decomposition** that formalizes this heterogeneity as a routing and constraint layer. Let (\tau(i)\in{1,\dots,6}) assign each variable (i) to a type class. The type assignment can be initialized from domain semantics (tag metadata, value domains, and engineering meaning), and subsequently refined via an error-attribution workflow described in the Benchmark section.  Importantly, this refinement does **not** change the core diffusion backbone; it changes *which mechanism is responsible for which variable*, thereby aligning inductive bias with variable-generating mechanism while preserving overall coherence.
 We use the following taxonomy:
 * **Type 1 (program-driven / setpoint-like):** externally commanded, step-and-dwell variables. These variables can be treated as exogenous drivers (conditioning signals) or routed to specialized change-point / dwell-time models, rather than being forced into a smooth denoiser that may over-regularize step structure.
 * **Type 2 (controller outputs):** continuous variables tightly coupled to feedback loops; these benefit from conditional modeling where the conditioning includes relevant process variables and commanded setpoints.
 * **Type 3 (actuator states/positions):** often exhibit saturation, dwell, and rate limits; these may require stateful dynamics beyond generic residual diffusion, motivating either specialized conditional modules or additional inductive constraints.
 * **Type 4 (process variables):** inertia-dominated continuous dynamics; these are the primary beneficiaries of the **Transformer trend + residual DDPM** pipeline. 
 * **Type 5 (derived/deterministic variables):** algebraic or rule-based functions of other variables; we enforce deterministic reconstruction (\hat{x}^{(i)} = g_i(\hat{X},\hat{Y})) rather than learning a stochastic generator, improving logical consistency and sample efficiency. 
 * **Type 6 (auxiliary/low-impact variables):** weakly coupled or sparse signals; we allow simplified modeling (e.g., calibrated marginals or lightweight temporal models) to avoid allocating diffusion capacity where it is not warranted. 
 Type-aware decomposition improves synthesis quality through three mechanisms. First, it improves **capacity allocation** by preventing a small set of mechanistically atypical variables from dominating gradients and distorting the learned distribution for the majority class (typically Type 4). Second, it enables **constraint enforcement** by deterministically reconstructing Type 5 variables, preventing logically inconsistent samples that purely learned generators can produce. Third, it improves **mechanism alignment** by attaching inductive biases consistent with step/dwell or saturation behaviors where generic denoisers may implicitly favor smoothness. 
 From a novelty standpoint, this layer is not merely an engineering “patch”; it is an explicit methodological statement that ICS synthesis benefits from **typed factorization**—a principle that has analogues in mixed-type generative modeling more broadly, but that remains underexplored in diffusion-based ICS telemetry synthesis. [9,10,12]
 ---
 ## Joint optimization and end-to-end sampling
 We train the model in a staged manner consistent with the above factorization, which improves optimization stability and encourages each component to specialize in its intended role. Specifically: (i) we train the trend Transformer (f_{\phi}) to obtain (\hat{S}); (ii) we compute residual targets (R=X-\hat{S}) for the continuous variables routed to residual diffusion; (iii) we train the residual DDPM (p_{\theta}(R\mid \hat{S})) and masked diffusion model (p_{\psi}(Y\mid \text{masked}(Y), \hat{S}, \hat{X})); and (iv) we apply type-aware routing and deterministic reconstruction during sampling.  This staged strategy is aligned with the design goal of separating temporal scaffolding from distributional refinement, and it mirrors the broader intuition in time-series diffusion that decoupling coarse structure and stochastic detail can mitigate “structure vs. realism” conflicts. [6,7]
 A simple combined objective is
 [
 \mathcal{L} = \lambda,\mathcal{L}*{\text{cont}} + (1-\lambda),\mathcal{L}*{\text{disc}},
 ]
 with (\lambda\in[0,1]) controlling the balance between continuous and discrete learning. Type-aware routing determines which channels contribute to which loss and which are excluded in favor of deterministic reconstruction. In practice, this routing acts as a principled guardrail against negative transfer across variable mechanisms: channels that are best handled deterministically (Type 5) or by specialized drivers (Type 1/3, depending on configuration) are prevented from forcing the diffusion models into statistically incoherent compromises.
 At inference time, generation follows the same structured order: (i) trend (\hat{S}) via the Transformer, (ii) residual (\hat{R}) via DDPM, (iii) discrete (\hat{Y}) via masked diffusion, and (iv) type-aware assembly with deterministic reconstruction for routed variables. This pipeline produces ((\hat{X},\hat{Y})) that are temporally coherent by construction (through (\hat{S})), distributionally expressive (through (\hat{R}) denoising), and discretely valid (through masked diffusion), while explicitly accounting for heterogeneous variable-generating mechanisms through type-aware routing. In combination, these choices constitute our central methodological contribution: a unified Transformer + mixed diffusion generator for ICS telemetry, augmented by typed factorization to align model capacity with domain mechanism. [2,4,10,12]
 ---
 # References
 [1] Vaswani, A., Shazeer, N., Parmar, N., et al. *Attention Is All You Need.* NeurIPS, 2017. arXiv:1706.03762. ([arXiv][1])
 [2] Ho, J., Jain, A., Abbeel, P. *Denoising Diffusion Probabilistic Models.* NeurIPS, 2020. arXiv:2006.11239. ([Proceedings of Machine Learning Research][2])
 [3] Austin, J., Johnson, D. D., Ho, J., Tarlow, D., van den Berg, R. *Structured Denoising Diffusion Models in Discrete State-Spaces (D3PM).* NeurIPS, 2021. arXiv:2107.03006. ([arXiv][3])
 [4] Shi, J., Han, K., Wang, Z., Doucet, A., Titsias, M. K. *Simplified and Generalized Masked Diffusion for Discrete Data.* arXiv:2406.04329, 2024. ([arXiv][4])
 [5] Hang, T., Wu, C., Zhang, H., et al. *Efficient Diffusion Training via Min-SNR Weighting Strategy.* arXiv:2303.09556, 2023. ([arXiv][5])
 [6] Kollovieh, M., Ansari, A. F., Bohlke-Schneider, M., et al. *Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting (TSDiff).* arXiv:2307.11494, 2023. ([arXiv][6])
 [7] Sikder, M. F., Ramachandranpillai, R., Heintz, F. *TransFusion: Generating Long, High Fidelity Time Series using Diffusion Models with Transformers.* arXiv:2307.12667, 2023. ([arXiv][7])
 [8] Song, Y., Sohl-Dickstein, J., Kingma, D. P., et al. *Score-Based Generative Modeling through Stochastic Differential Equations.* ICLR, 2021. arXiv:2011.13456. ([arXiv][8])
 [9] Zhang, H., Zhang, J., Li, J., et al. *TabDiff: a Mixed-type Diffusion Model for Tabular Data Generation.* arXiv:2410.20626, 2024. ([arXiv][9])
 [10] Yuan, H., Sha, K., Zhao, W. *CTU-DDPM: Conditional Transformer U-net DDPM for Industrial Control System Anomaly Data Augmentation.* ACM AICSS, 2025. DOI:10.1145/3776759.3776845.
 [11] Sha, K., et al. *DDPM Fusing Mamba and Adaptive Attention: An Augmentation Method for Industrial Control Systems Anomaly Data.* SSRN, posted Jan 10, 2026. (SSRN 6055903). ([SSRN][10])
 [12] NIST. *Guide to Operational Technology (OT) Security (SP 800-82r3).* 2023. ([NIST Computer Security Resource Center][11])
 [1]: https://arxiv.org/abs/2209.15421 "https://arxiv.org/abs/2209.15421"
 [2]: https://proceedings.mlr.press/v202/kotelnikov23a/kotelnikov23a.pdf "https://proceedings.mlr.press/v202/kotelnikov23a/kotelnikov23a.pdf"
 [3]: https://arxiv.org/html/2209.15421v2 "https://arxiv.org/html/2209.15421v2"
 [4]: https://arxiv.org/abs/2011.13456 "https://arxiv.org/abs/2011.13456"
 [5]: https://arxiv.org/abs/2303.09556 "https://arxiv.org/abs/2303.09556"
 [6]: https://arxiv.org/pdf/2401.03006 "https://arxiv.org/pdf/2401.03006"
 [7]: https://arxiv.org/abs/2307.12667 "https://arxiv.org/abs/2307.12667"
 [8]: https://arxiv.org/abs/2406.04329 "https://arxiv.org/abs/2406.04329"
 [9]: https://arxiv.org/abs/2406.07524 "https://arxiv.org/abs/2406.07524"
 [10]: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6055903 "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6055903"
 [11]: https://csrc.nist.gov/pubs/sp/800/82/r3/final "https://csrc.nist.gov/pubs/sp/800/82/r3/final"
--- a/knowledges/mask-ddpm.md
+++ b/knowledges/mask-ddpm.md
@@ -0,0 +1,918 @@
 # mask‑ddpm 项目说明书（内部长文版｜约 3 千字）
 > 目标：以“论文叙述 + 说明书”混合风格，把项目的背景、方法、流程、评估、局限、未来方向完整写清楚。  
 > 面向对象：对扩散模型/时序建模不熟悉的读者也能读懂。  
 > 注意：不写具体实现代码，只写概念与方法。  
 ---
 ## 摘要（Abstract）
 工业控制系统（ICS）数据具有“多变量、强时序、混合类型”的特点。传统生成模型往往只能满足分布或时序之一，难以同时兼顾。我们提出一种 **两阶段混合扩散框架（mask‑ddpm）**：第一阶段学习时序趋势，第二阶段用扩散模型学习残差分布，同时对离散变量采用遮蔽‑恢复式扩散。为解决少数变量主导 KS 的问题，我们进一步提出 **Type‑aware 分治思路**，将程序驱动变量与过程变量分离处理，并配套完善的诊断与评估体系。实验表明，该框架能显著改善分布一致性，同时保持一定的时序结构，且具备可解释的误差定位能力。
 ---
 ## 1. 背景与动机（Why this problem is hard）
 工业数据和普通时间序列有本质区别：
 1) **物理惯性**：许多变量变化缓慢，前一刻会强烈影响后一刻。
 2) **程序驱动**：部分变量来自操作者/调度程序，呈现“阶跃 + 长时间停留”。
 3) **混合类型**：连续传感器与离散状态共存，必须分别处理。
 如果只用一个模型去学所有变量，会产生明显问题：
 - 模型为了对齐分布，会牺牲时序结构 → lag‑1 变差
 - 模型为了时序稳定，会模糊分布 → KS 上升
 - 离散变量如果当连续处理，会出现非法值 → JSD 上升
 因此，需要一种结构化的建模策略来“拆分矛盾”。
 ---
 ## 2. 问题定义（Problem Formulation）
 给定真实序列 \(X \in \mathbb{R}^{T\times D}\)，其中：
 - \(T\)：时间长度
 - \(D\)：变量维度（连续 + 离散）
 目标是学习生成模型 \(G\)，输出 \(\hat{X}\)，使得：
 **分布一致性**：
 \[
 \forall i,\; F^{(i)}_{gen}(x) \approx F^{(i)}_{real}(x)
 \]
 **时序一致性**：
 \[
 \rho_1(\hat{x}^{(i)}) \approx \rho_1(x^{(i)})
 \]
 **离散合法性**：
 \[
 \hat{x}^{(j)} \in \mathcal{V}_j
 \]
 其中 \(\rho_1\) 表示 lag‑1 相关，\(\mathcal{V}_j\) 是离散变量的合法词表。
 ---
 ## 3. 方法概览（Method Overview）
 核心设计是 **两阶段 + 混合扩散**：
 ### 阶段 1：时序趋势建模
 用序列模型学习“整体趋势”。直觉上，这一步只负责回答“序列怎么变化”。
 ### 阶段 2：分布残差建模
 用扩散模型学习“趋势以外的残差”，修正分布形状。直觉上，这一步只负责回答“数值分布像不像”。
 这种解耦让“时序一致性”与“分布一致性”不再直接冲突。
 ---
 ## 4. 模型结构（Model Details）
 ### 4.1 连续变量扩散（DDPM）
 扩散过程：
 \[
 x_t = \sqrt{\bar{\alpha}_t}\,x_0 + \sqrt{1-\bar{\alpha}_t}\,\epsilon,\quad \epsilon \sim \mathcal{N}(0,I)
 \]
 **解释**：
 - \(x_0\) 是真实残差
 - \(x_t\) 是加噪后的数据
 - 模型学习从 \(x_t\) 还原 \(x_0\) 或 \(\epsilon\)
 ### 4.2 离散变量扩散（Mask Diffusion）
 离散变量通过“遮蔽 + 恢复”建模：
 - 随机遮蔽部分 token
 - 学习预测原始 token
 这样保证离散输出 **合法且可解释**。
 ### 4.3 时序模块（Temporal GRU）
 GRU 用于学习连续序列的趋势：
 - 输入：连续序列
 - 输出：趋势序列
 - 用于构建残差：\(x_{resid} = x - trend\)
 直观理解：
 - GRU 负责“走向”
 - 扩散负责“修正细节”
 ---
 ## 5. 损失函数设计（Loss Design）
 ### 5.1 连续损失
 \[
 \mathcal{L}_{cont} =
 \begin{cases}
 \|\hat{\epsilon}-\epsilon\|^2 & (预测噪声)\\
 \|\hat{x}_0-x_0\|^2 & (预测原值)
 \end{cases}
 \]
 **解释**：衡量“去噪后是否接近真实残差”。
 ### 5.2 离散损失
 \[
 \mathcal{L}_{disc} = \frac{1}{|\mathcal{M}|}\sum_{(i,t)\in\mathcal{M}} CE(\hat{p}_{i,t}, y_{i,t})
 \]
 **解释**：只对被遮蔽的 token 做交叉熵，确保离散预测正确。
 ### 5.3 总损失
 \[
 \mathcal{L} = \lambda \mathcal{L}_{cont} + (1-\lambda)\mathcal{L}_{disc}
 \]
 **解释**：\(\lambda\) 决定分布与离散的权衡。
 ### 5.4 额外增强（可选）
 **SNR 权重**：
 \[
 \mathcal{L}_{snr} = \frac{\text{SNR}_t}{\text{SNR}_t+\gamma}\,\mathcal{L}_{cont}
 \]
 **分位数对齐**：
 \[
 \mathcal{L}_Q = \frac{1}{K}\sum_k \|Q_k(x_{real}) - Q_k(x_{gen})\|_1
 \]
 这些项有助于降低 KS，但可能牺牲时序一致性。
 ---
 ## 6. 评估指标（Evaluation Metrics）
 ### 6.1 KS（分布一致性）
 \[
 KS_i = \sup_x |F^{(i)}_{gen}(x) - F^{(i)}_{real}(x)|
 \]
 **解释**：越小代表分布越接近。
 ### 6.2 JSD（离散一致性）
 \[
 JSD(P,Q)=\tfrac12 KL(P\|M)+\tfrac12 KL(Q\|M)
 \]
 **解释**：衡量离散分布差异。
 ### 6.3 Lag‑1 Diff（时序一致性）
 \[
 \Delta_{lag1} = \frac{1}{d}\sum_i |\rho_1(\hat{x}_i) - \rho_1(x_i)|
 \]
 **解释**：衡量相邻时刻相关性偏差。
 ---
 ## 7. 诊断与分治思想（Type‑aware）
 现实中，少数变量会主导 KS：
 - setpoint/demand（程序驱动）
 - actuator（饱和/停留）
 - derived tags（确定性映射）
 因此我们提出 Type‑aware 分治：
 | 类型 | 特性 | 处理思路 |
 |------|------|---------|
 | Type1 | 程序驱动 | 单独建模/重采样 |
 | Type2 | 控制器输出 | 条件生成 |
 | Type3 | 执行器位置 | 状态 + 停留模型 |
 | Type4 | PV 过程量 | 扩散建模 |
 | Type5 | 派生变量 | 确定性映射 |
 | Type6 | 辅助量 | 简单统计模型 |
 当前实现为 **KS‑only baseline**（经验重采样），只用于诊断上界。
 ---
 ## 8. 方法优势与局限
 ### 优势
 - 将时序与分布解耦，训练更稳定
 - 连续/离散混合扩散保证合法性
 - 诊断工具完整，可定位问题变量
 ### 局限
 - Type1/2/3 仍是难点
 - KS 与时序存在 trade‑off
 - KS‑only baseline 破坏联合分布
 ---
 ## 9. 未来方向
 1) Type1/2/3 条件生成器替代重采样
 2) Type4 加工况 embedding
 3) 增加跨变量相关性评估
 ---
 ## 10. 结论
 mask‑ddpm 通过“两阶段趋势 + 扩散残差”实现了更平衡的 ICS 序列生成框架。项目已经在大多数变量上取得合理分布对齐，但仍需针对程序驱动变量进行专门建模。该框架提供了一个清晰可扩展的研究路径。
 ---
 ## 附录：公式解释简表
 | 公式 | 含义 |
 |------|------|
 | DDPM 加噪 | 描述连续变量如何被噪声腐蚀 |
 | 连续损失 | 衡量去噪结果与真实残差距离 |
 | 离散交叉熵 | 衡量遮蔽 token 是否正确恢复 |
 | KS | 分布最大偏差 |
 | JSD | 离散分布偏差 |
 | Lag‑1 Diff | 时序相关差异 |
 ---
 # Part 1 / 4 — 方法与理论基础（约 5k）
 > 本部分重点：把“这个方法为什么合理”讲清楚，建立读者对整体框架的直觉与理论基础。
 ## 1.1 工业时序生成的独特性
 工业控制系统（ICS）数据和常规时序数据（如股票、天气）有本质差别。最关键的三点是：
 **(a) 强物理约束**
 许多过程变量受到物理规律约束，比如液位变化不可能突然跳跃，压力和温度的变化速度有上限。这意味着模型必须尊重“惯性”与“平滑性”。
 **(b) 强程序驱动**
 工业系统中存在大量“外部控制信号”，例如设定值（setpoint）、需求（demand）、运行模式。这些变量不是由过程自然演化产生，而是外部程序/操作者指令造成。它们往往呈现“阶跃+停留”模式。
 **(c) 混合变量类型**
 同时存在连续变量（传感器、控制量）和离散变量（状态、模式、告警）。如果把离散变量强行当连续变量，会造成语义错误，例如出现“介于运行/停机之间”的无意义状态。
 因此，工业时序生成不是单纯的“预测下一步”，而是需要同时兼顾分布、时序、离散合法性。
 ---
 ## 1.2 为什么单一模型会失败
 如果用单一的序列模型（例如 LSTM/Transformer）去同时优化分布与时序，通常会出现以下冲突：
 - **分布损失优化 → 时序退化**
  模型为了匹配每个变量的分布，会倾向于忽略时间结构，生成“看起来像分布，但像噪声”的序列。
 - **时序损失优化 → 分布退化**
  模型为了保持时序连续性，生成的数值往往过于平滑，导致分布偏离真实数据（尤其重尾、尖峰）。
 - **离散变量错误**
  离散变量被当作连续变量，会出现非法 token，导致 JSD 增大。
 因此，必须引入结构性拆分：让不同模块只负责自己擅长的部分。
 ---
 ## 1.3 设计原则：解耦与专职化
 本项目遵循三条核心原则：
 **原则 1：解耦“时序”与“分布”**
 - 时序结构由 Temporal 模块负责
 - 分布细节由 Diffusion 模块负责
 **原则 2：连续与离散分开建模**
 - 连续变量适合高斯扩散
 - 离散变量适合 mask‑diffusion
 **原则 3：对难学变量分类处理**
 - 程序驱动变量（Type1）不适合 DDPM
 - 执行器变量（Type3）需要额外模型
 这些原则构成了本项目的总体方法论。
 ---
 ## 1.4 两阶段框架的理论直觉
 两阶段方法可以理解为：
 \[
 X = \text{Trend}(X) + \text{Residual}(X)
 \]
 **趋势部分**：决定序列“怎么走”，例如逐渐上升/下降，保持物理惯性。  
 **残差部分**：决定数值“长什么样”，例如分布形状、重尾、异常点等。
 如果直接用一个模型去学 \(X\)，它必须同时解决趋势与分布问题。而我们拆成两部分后：
 - 第一阶段（GRU）只学趋势
 - 第二阶段（Diffusion）只学残差
 这样能显著减少“时序 vs 分布”的冲突。
 ---
 ## 1.5 扩散模型为什么适合分布对齐
 扩散模型（DDPM）是一类生成式模型，核心思想是：
 1) 逐步加噪，把数据变成纯噪声
 2) 逐步去噪，恢复真实分布
 数学形式：
 \[
 x_t = \sqrt{\bar{\alpha}_t}\,x_0 + \sqrt{1-\bar{\alpha}_t}\,\epsilon
 \]
 去噪的目标是预测 \(\epsilon\) 或 \(x_0\)。
 **优点**：
 - 可以拟合复杂、多峰、重尾分布
 - 不依赖显式的概率密度函数
 这正是 ICS 数据中“分布复杂”的现实需求。
 ---
 ## 1.6 为什么离散变量要用 mask‑diffusion
 离散变量的关键问题是“语义合法性”。如果直接用连续扩散，会生成不合法值（例如 0.3、1.7 等），导致模型失效。
 mask‑diffusion 的思路：
 1) 随机遮蔽部分 token
 2) 模型预测被遮蔽 token
 损失是交叉熵：
 \[
 \mathcal{L}_{disc} = \frac{1}{|\mathcal{M}|}\sum_{(i,t)\in\mathcal{M}} CE(\hat{p}_{i,t},y_{i,t})
 \]
 这样保证输出必然来自 vocab，语义上合法。
 ---
 ## 1.7 为什么需要 Type‑aware 分治
 即使有两阶段 + 混合扩散，仍然会发现 KS 被少数变量主导：
 - Type1（setpoint/demand）几乎是“程序信号”
 - Type3（执行器）具有“停留+饱和”特性
 - Type5（派生变量）是确定性函数
 这些变量从统计机制上就不适合扩散模型，因此需要额外设计。
 Type‑aware 的意义是：
 - 把“扩散学不好的变量”剥离出去
 - 让扩散专注于真正适合它的变量（Type4）
 ---
 ## 1.8 指标设计的意义（直觉解释）
 我们采用三类核心指标：
 **(a) KS：分布对齐**  
 衡量生成分布与真实分布最大差距，越小越好。
 **(b) JSD：离散一致性**  
 衡量离散状态频率是否接近真实。
 **(c) Lag‑1 Diff：时序一致性**  
 衡量相邻时刻相关结构是否接近真实。
 这三个指标对应三个核心目标：分布、时序、合法性。
 ---
 ## 1.9 理论层面的 trade‑off
 可以理解为：
 - KS 低 → 表示分布好
 - Lag‑1 diff 低 → 表示时序好
 但在实践中：
 - 如果模型过度优化 KS，序列可能变成“随机抽样” → 时序崩坏
 - 如果模型过度优化时序，分布会过于平滑 → KS 上升
 因此，一个好的模型必须在二者之间找到平衡点。这正是本项目提出两阶段框架的核心动机。
 ---
 ## 1.10 小结
 本部分给出了方法的理论基础：
 1) ICS 数据复杂且混合类型，单一模型难以兼顾
 2) 两阶段结构用于解耦时序与分布
 3) 混合扩散用于处理连续与离散
 4) Type‑aware 分治用于处理“扩散难学”的变量
 下一部分将详细介绍**训练流程与损失设计**。
 ---
 # Part 2 / 4 — 训练流程与损失设计（约 5k）
 > 本部分重点：让普通读者清楚“模型训练到底在做什么、每一步优化的目标是什么、为什么需要这些 loss”。
 ## 2.1 训练流程总览（高层视角）
 整个训练分为两个阶段：
 1) **Temporal Stage**：训练一个时序模型（GRU），学习序列趋势。
 2) **Diffusion Stage**：训练扩散模型，学习趋势残差的分布。
 换句话说，训练不是一次性完成，而是先学“怎么变化”，再学“变化后的数值细节”。
 ---
 ## 2.2 Temporal Stage：学趋势的意义
 工业过程的动态通常平滑且具有惯性。假设我们直接用扩散模型学习整个序列，那么它必须同时完成“序列走向”和“数值分布”的建模任务。这样会产生冲突：
 - 过度优化分布 → 序列变随机
 - 过度优化时序 → 分布变窄
 Temporal Stage 的目的就是先把“走向”固定住。
 ### 训练方式（直觉）
 - 输入：连续变量序列
 - 目标：预测下一时刻的连续序列
 - 损失：MSE
 这样模型学到的是“趋势结构”，而不是复杂分布。
 ---
 ## 2.3 Diffusion Stage：学残差分布
 在 Temporal Stage 得到趋势序列 \(T\) 后，我们构造残差：
 \[
 R = X - T
 \]
 Diffusion Stage 的目标是拟合残差分布。这样最终生成结果为：
 \[
 \hat{X} = \hat{R} + T
 \]
 **直观理解**：
 - 趋势提供骨架
 - 残差提供“真实的噪声与细节”
 ---
 ## 2.4 连续分支损失设计
 ### 基本损失
 在扩散中，我们可以选择预测噪声 \(\epsilon\)，或直接预测原始 \(x_0\)：
 \[
 \mathcal{L}_{cont} =
 \begin{cases}
 \|\hat{\epsilon} - \epsilon\|^2 & (eps 预测)\\
 \|\hat{x}_0 - x_0\|^2 & (x0 预测)
 \end{cases}
 \]
 ### 为什么要预测 eps 或 x0？
 - eps 预测更稳定（扩散原始形式）
 - x0 预测更容易做分布校正
 因此在不同实验中会切换。
 ---
 ## 2.5 离散分支损失设计
 离散变量使用 mask‑diffusion。训练时随机遮蔽一部分 token：
 \[
 \mathcal{L}_{disc} = \frac{1}{|\mathcal{M}|}\sum_{(i,t)\in\mathcal{M}} CE(\hat{p}_{i,t},y_{i,t})
 \]
 直觉解释：
 - 只对被 mask 的 token 计算交叉熵
 - 学到的是“在上下文中恢复正确状态”
 这样避免了离散变量非法输出问题。
 ---
 ## 2.6 总损失与权衡
 最终损失采用线性加权：
 \[
 \mathcal{L} = \lambda\mathcal{L}_{cont} + (1-\lambda)\mathcal{L}_{disc}
 \]
 **\(\lambda\) 的作用**：
 - \(\lambda\) 大 → 连续分布更好，离散可能退化
 - \(\lambda\) 小 → 离散更好，连续分布可能偏差
 这是训练中的关键权衡参数。
 ---
 ## 2.7 SNR 加权（为什么需要）
 扩散过程早期噪声大、后期噪声小。直接用统一损失会导致训练不均衡。
 因此引入 SNR 权重：
 \[
 \mathcal{L}_{snr} = \frac{\text{SNR}_t}{\text{SNR}_t+\gamma}\mathcal{L}_{cont}
 \]
 直觉解释：
 - 在高噪声阶段，降低损失权重
 - 在低噪声阶段，更强调精细结构
 ---
 ## 2.8 分位数损失（为什么对 KS 有效）
 KS 本质是比较分布的 CDF。
 因此加入分位数损失：
 \[
 \mathcal{L}_Q = \frac{1}{K}\sum_k \|Q_k(x_{real}) - Q_k(x_{gen})\|_1
 \]
 直觉解释：
 - 让模型对齐分布的关键分位点
 - 尤其改善重尾和尖峰变量
 ---
 ## 2.9 残差统计约束（防止塌缩）
 为了避免残差分布塌缩，加入 mean/std 对齐：
 \[
 \mathcal{L}_{stat} = \|\mu_{gen}-\mu_{real}\|^2 + \|\sigma_{gen}-\sigma_{real}\|^2
 \]
 解释：
 - 约束残差均值与方差
 - 防止模型学到“过于平滑”残差
 ---
 ## 2.10 为什么训练会慢（深度解释）
 1) **两阶段训练**
   - Temporal 训练 + Diffusion 训练
   - 训练成本几乎翻倍
 2) **扩散步数大**
   - timesteps=600
   - 每次训练/采样都要反复迭代
 3) **Transformer backbone**
   - 自注意力复杂度随序列长度平方增长
   - seq_len=128 时显存和时间增加明显
 ---
 ## 2.11 训练中的常见异常与解释
 **(a) loss 降得慢**
 - 原因：扩散训练本身慢
 - 解决：降低 timesteps 或增大 batch
 **(b) loss 大幅波动**
 - 原因：噪声采样随机性 + 分布复杂性
 - 解决：增加 EMA 或降低学习率
 **(c) lag‑1 差**
 - 原因：trend 学得不足或 seq_len mismatch
 - 解决：提高 temporal_epochs，并统一 seq_len 与 sample_seq_len
 ---
 ## 2.12 小结
 本部分强调了训练的核心逻辑：
 - Temporal Stage 负责时序结构
 - Diffusion Stage 负责分布残差
 - Loss 设计通过多项损失平衡分布与时序
 下一部分将深入说明评估与诊断体系。
 ---
 # Part 3 / 4 — 评估与诊断体系（约 5k）
 > 本部分重点：让读者理解“如何判断模型好坏”，以及“为什么需要诊断工具”。
 ## 3.1 为什么只看 loss 不够
 训练 loss 只能说明“模型在拟合训练目标”，但不一定等价于真实质量。
 在本项目中，loss 可能降低，但 KS 或 lag‑1 仍可能很差，因为：
 - loss 主要优化局部误差，而 KS 衡量整体分布差异
 - 模型可能学到“平均值附近”，但尾部严重缺失
 - 离散变量可能预测错误，但 loss 依旧较小
 因此，必须使用更贴近目标的评估指标。
 ---
 ## 3.2 分布一致性指标（KS）
 KS 统计量定义为：
 \[
 KS_i = \sup_x |F^{(i)}_{gen}(x)-F^{(i)}_{real}(x)|
 \]
 **解释**：
 - \(F_{gen}\)：生成分布的累积分布函数
 - \(F_{real}\)：真实分布的累积分布函数
 - KS 表示两条 CDF 曲线的最大距离
 **直观理解**：
 - KS 小 → 生成分布几乎重合
 - KS 大 → 分布偏移明显
 **注意**：在尖峰/离散化变量上，必须使用 tie‑aware KS，否则会被高估。
 ---
 ## 3.3 离散一致性指标（JSD）
 Jensen‑Shannon Divergence：
 \[
 JSD(P,Q)=\tfrac12 KL(P\|M)+\tfrac12 KL(Q\|M)
 \]
 其中 \(M = (P+Q)/2\)。
 **解释**：
 - 衡量离散分布差异
 - 适合评估状态类变量
 - 越小越好
 ---
 ## 3.4 时序一致性指标（Lag‑1 Diff）
 Lag‑1 相关差异：
 \[
 \Delta_{lag1} = \frac{1}{d}\sum_i |\rho_1(\hat{x}_i)-\rho_1(x_i)|
 \]
 **解释**：
 - 计算每个变量的相邻相关性
 - 衡量“时间上的惯性是否正确”
 - 越小越好
 ---
 ## 3.5 为什么要做诊断
 即使平均 KS 下降，也可能是“部分变量非常好，部分变量非常差”。
 这会误导优化方向。因此必须诊断：
 - 哪些变量最差？
 - 它们的分布差在什么区间？
 - 它们属于哪种类型？
 ---
 ## 3.6 诊断工具 1：Per‑feature KS 排序
 我们用 `ranked_ks.py` 输出：
 - 变量 KS 从高到低排序
 - 观察“top‑N 主导问题变量”
 **意义**：
 - 精准定位瓶颈
 - 避免盲目改模型
 ---
 ## 3.7 诊断工具 2：CDF 图
 用 `diagnose_ks.py` 绘制 CDF 对比：
 - 真实 vs 生成
 - 可直观看到“尾部缺失 / 中位数偏移 / 饱和堆积”
 这是理解 KS 高的最直接手段。
 ---
 ## 3.8 诊断工具 3：Filtered KS
 Filtered KS 会剔除：
 - 标准差为 0 的变量
 - KS 极端异常的变量
 **意义**：
 - 判断“整体模型是否已经合理”
 - 证明 KS 被少数变量拖垮
 ---
 ## 3.9 Type‑wise 评估
 为了对应 Type‑aware 分治策略，我们也统计：
 - Type1 KS 均值
 - Type2 KS 均值
 - Type3 KS 均值
 - …
 这样可以看到不同类型变量的贡献。
 ---
 ## 3.10 评估指标的 trade‑off
 现实中常见情况：
 - KS 下降 → lag‑1 上升
 - lag‑1 下降 → KS 上升
 因此需要在实验中明确目标：
 - 若目标是分布一致性，接受时序退化
 - 若目标是时序一致性，接受分布偏差
 这也是论文中必须讨论的权衡点。
 ---
 ## 3.11 KS‑only baseline 的作用
 KS‑only baseline 是一种诊断工具：
 - 把难学变量替换为真实分布采样
 - 观察整体 KS 降到的“上限”
 意义：
 - 如果 KS 仍高 → 说明模型整体分布学得差
 - 如果 KS 明显降 → 说明瓶颈集中在少数变量
 ---
 ## 3.12 评估流程的完整逻辑
 评估流程包含：
 1) 生成数据（generated.csv）
 2) 与真实数据对比，计算 KS/JSD/lag‑1
 3) 输出 eval.json
 4) 额外诊断：CDF/Rank/Filtered
 5) 若启用后处理，再评估 eval_post.json
 ---
 ## 3.13 小结
 本部分说明：
 - 为什么必须用 KS/JSD/lag‑1 评估
 - 为什么要做 per‑feature 诊断
 - 为什么引入 Type‑wise 分治评估
 下一部分将讨论实验观察、局限与未来方向。
 ---
 # Part 4 / 4 — 实验观察、局限与未来方向（约 5k）
 > 本部分重点：总结实验现象，解释为什么会出现这些现象，并给出未来工作的方向。
 ## 4.1 典型实验现象回顾
 在大量实验中，我们观察到几个稳定现象：
 1) **KS 可以明显下降，但 lag‑1 可能变差**
   - 当加强分布对齐（quantile loss、校准）时，分布更好，但时序一致性下降
 2) **离散 JSD 相对稳定**
   - 离散 mask diffusion 保证 token 合法性
   - JSD 通常不会像 KS 那样剧烈波动
 3) **少数变量主导 KS**
   - Type1/2/3 是 KS 的主要来源
   - 说明整体分布其实已经合理，问题集中在少数难学变量
 ---
 ## 4.2 为什么 Type1/2/3 变量难学
 **Type1（setpoint/demand）**
 - 本质是外部程序驱动，而非系统响应
 - 呈现长时间常数 + 突然阶跃
 - 扩散模型更擅长连续变化，难以准确复现这种跳变
 **Type2（controller output）**
 - 受控制算法约束（PID）
 - 值域常常饱和于 0 或 100
 - 分布极端尖峰
 **Type3（actuator positions）**
 - 常有停留、饱和和离散档位
 - 实际变化规律与简单连续噪声不同
 这些特性说明：必须用专门的模型去建模，而不是依靠通用 DDPM。
 ---
 ## 4.3 KS‑only baseline 的意义与局限
 KS‑only baseline（经验重采样）告诉我们：
 - **理论上 KS 能到多低**
 - 确认瓶颈是否在少数变量
 但它的局限也非常明显：
 - 破坏变量之间的依赖关系
 - 无法保证联合分布一致
 - 只能作为诊断工具，而非最终生成器
 ---
 ## 4.4 时序 vs 分布的理论矛盾
 在本项目中，时序与分布的矛盾非常典型：
 - 如果只关注分布：序列会失去惯性，变成“随机抽样”
 - 如果只关注时序：序列会变平滑，失去重尾或多模态特征
 两阶段架构正是为了缓解这种矛盾，但仍无法彻底消除。
 ---
 ## 4.5 为什么需要条件建模（Conditioning）
 真实工业数据往往依赖于“工况/模式”。
 例如：
 - 不同 setpoint 会导致不同 PV 分布
 - 不同状态会导致执行器行为差异
 因此，未来需要在模型中加入条件：
 \[
 P(X|C)\quad \text{而不是}\quad P(X)
 \]
 这种条件建模可以显著减少分布误差。
 ---
 ## 4.6 未来方向（具体可操作的路线）
 ### 方向 1：Type1 程序生成器
 - 用 HMM 或 change‑point 模型生成 setpoint/demand
 - 再作为条件输入扩散模型
 ### 方向 2：Type2 控制器模拟器
 - 训练一个小型控制器网络
 - 根据 setpoint + PV 输出控制量
 ### 方向 3：Type3 执行器动力学模型
 - 用“状态 + 停留 + 速率”建模
 - 比纯重采样更真实
 ### 方向 4：Type4 PV 的条件扩散
 - 加入工况 embedding
 - 在不同 regime 下分布更一致
 ---
 ## 4.7 实验写作建议（论文表达角度）
 在写论文时，可以这样组织：
 1) 展示 KS/JSD/lag‑1 的全局指标
 2) 展示 per‑feature KS 排序
 3) 强调“少数变量主导错误”
 4) 提出 Type‑aware 分治策略
 5) 用 KS‑only baseline 作为诊断上界
 这样能显得方法合理且有深度。
 ---
 ## 4.8 结论总结
 本项目提供了一条清晰路径：
 - 先解耦时序与分布（两阶段）
 - 再解决连续与离散（混合扩散）
 - 最后用 Type‑aware 分治处理难变量
 因此，它不仅是一个生成模型，更是一套“如何诊断、如何改进”的系统性方法。
 ---
 # 全文总结
 mask‑ddpm 的核心贡献在于 **结构性拆解问题**：
 - 时序结构与分布对齐分开建模
 - 连续与离散分开扩散
 - 难变量类型单独处理
 通过这些结构化手段，本项目为工业时序生成提供了一个可解释、可扩展的方案。它不是单纯追求指标最优，而是提供了一套完整的研究路径。
--- a/links.txt
+++ b/links.txt
@@ -0,0 +1 @@
 https://my.feishu.cn/wiki/Za4dwCsG6iPD9qklRLWcoJOZnnb?from=from_copylink
--- a/notes4coding/checker_design.md
+++ b/notes4coding/checker_design.md
@@ -0,0 +1,357 @@
 ## Conversation summary (so far)
 * You’re building a **Modbus/TCP traffic generation system for security use** with this pipeline:
  **feature extraction → hybrid diffusion generates features → generator turns features into raw packets → checker validates traffic**.
 * I proposed a checker that validates traffic at **multiple layers**:
  1. frame/IP/TCP parseability,
  2. optional TCP conversation sanity,
  3. Modbus/TCP structural + semantic validity (MBAP + PDU),
  4. optional “feature-fidelity” checks against generator sidecar metadata.
 * I recommended a **generator → checker contract** that is easy to debug and automate:
  **PCAP/PCAPNG + JSONL sidecar** (1 metadata line per packet/event).
 * You then shared your existing **descriptor-driven Modbus parser** (Rust) and a short doc.
  The public API/type model is: `Config → FunctionDescriptor → FieldDescriptor(FieldType, length/length_from/scale/enum_map)` 
  and `parse_sawp_message(...)` returns JSON containing `unit`, `function`, optional `exception`, and `fields` .
 * I said: the **descriptor engine is very reusable** for the checker, but the checker’s entry point should ideally work on **raw Modbus/TCP bytes** (MBAP+PDU), not require `sawp_modbus::Message`.
  Also: for checker ergonomics, prefer structured error types over `Result<Value, String>` (but you can keep JSON output for the MVP).
 ---
 # Implementation document: Modbus/TCP Checker (Rust)
 This document is a practical, “not too complex” plan to implement the checker while still following good practices where they don’t add much difficulty.
 ## 1) What the checker does (goals / non-goals)
 ### Goals
 The checker verifies that generator output is:
 1. **Parsable** as TCP/IP traffic,
 2. **Modbus/TCP-valid** at the application level (MBAP + PDU rules),
 3. **Consistent** in request/response pairing (Transaction ID matching),
 4. Optionally **matches the expected features** (function code, unit id, quantities, payload size, timing tolerances, etc.).
 ### Non-goals (to keep it simple)
 To avoid turning this into a full Wireshark, we deliberately **do not** implement:
 * full TCP stream reassembly (segments split/merged),
 * full TCP state machine with retransmits/out-of-order handling,
 * IP/TCP checksum verification by default.
 Instead, we enforce a **generator constraint**: **one Modbus ADU per TCP payload** (no segmentation, no coalescing). This single constraint dramatically reduces checker complexity and is realistic for generated traces.
 > Trade-off: best practice would handle segmentation/coalescing and reassembly; difficulty rises a lot. The “one ADU per TCP payload” rule is the best complexity/benefit lever for this project.
 ---
 ## 2) Generator output contract (what the checker consumes)
 ### Recommended output (MVP-friendly and debuggable)
 **(A) PCAP or PCAPNG file**
 * `trace.pcapng` (or `.pcap`) containing the raw generated packets
 **(B) Sidecar JSONL metadata file**
 * `trace.meta.jsonl` where each line describes the corresponding packet/event (same order)
 This is the easiest way to:
 * reproduce failures,
 * correlate packet index with expected semantic fields,
 * produce actionable reports.
 ### JSONL schema (minimal + optional)
 **Minimal fields (recommended):**
 * `trace_id` (string/uuid)
 * `event_id` (monotonic integer)
 * `pcap_index` (or implicit by line number)
 * `ts_ns` timestamp
 * `direction` (`"c2s"` or `"s2c"`)
 * `flow` (src/dst ip/port)
 **Optional `expected` block (for feature-fidelity checks):**
 * `expected.modbus.transaction_id`, `unit_id`, `function_code`, and `expected.fields` (names matching your descriptor JSON).
 Example line:
 ```json
 {
  "trace_id": "c7f1...",
  "event_id": 42,
  "pcap_index": 42,
  "ts_ns": 1736451234567890123,
  "direction": "c2s",
  "flow": {"src_ip":"10.0.0.10","src_port":51012,"dst_ip":"10.0.0.20","dst_port":502},
  "expected": {
    "modbus": {"transaction_id": 513, "unit_id": 1, "function_code": 3},
    "fields": {"starting_address": 0, "quantity": 10}
  }
 }
 ```
 > Trade-off: best practice is “self-describing PCAP” (pcapng custom blocks, or embedding metadata); difficulty higher. JSONL sidecar is dead simple and works well.
 ---
 ## 3) Workflow (starting from generator output)
 ### Step 0 — Load inputs
 1. Read `trace.meta.jsonl` into a lightweight iterator (don’t load all if trace is huge).
 2. Open `trace.pcapng` and stream packets in order.
 ### Step 1 — Align packets and metadata
 For each packet index `i`:
 * read packet `i` from PCAP
 * read metadata line `i` from JSONL
  If mismatch (missing line/packet), record a **Fatal** alignment error and stop (or continue with “best effort”, your call).
 ### Step 2 — Decode packet and extract TCP payload
 Decode:
 * link layer (Ethernet/SLL/RAW depending on PCAP linktype),
 * IPv4/IPv6,
 * TCP,
 * extract TCP payload bytes.
 Minimal checks:
 * packet parses,
 * TCP payload length > 0 when direction indicates Modbus message,
 * port 502 is present on either side (configurable if you generate non-502).
 ### Step 3 — Parse Modbus/TCP ADU
 Assuming payload contains exactly one ADU:
 * parse MBAP (7 bytes) + PDU
 * validate basic MBAP invariants
 * parse function code and PDU data
 * decide request vs response based on `direction`
 * parse PDU data using descriptor map (your reusable part)
 ### Step 4 — Stateful consistency checks
 Maintain per-flow state:
 * request/response pairing by `(transaction_id, unit_id)`
 * outstanding request table with timeout/window limits
 ### Step 5 — Feature-fidelity checks (optional)
 If `expected` exists in JSONL:
 * compare decoded modbus header + parsed fields with expected values
 * compare sizes and (optionally) timing with tolerances
 ### Step 6 — Emit report
 Output:
 * `report.json` with summary + per-finding samples (packet indices, flow key, reason, extracted fields)
 * optional `report.txt` for quick reading
 ---
 ## 4) Reusing your existing parser (what to keep, what to adjust)
 You already have:
 * A descriptor model (`Config/FunctionDescriptor/FieldDescriptor/FieldType`) 
 * A function that returns a JSON representation with the shape the checker wants (`unit`, `function`, optional `exception`, `fields`) 
 ### 4.1 What is immediately reusable
 **Highly reusable for the checker:**
 * Descriptor loading (serde)
 * Field decoding logic (length/length_from, scale, enum_map)
 * The “JSON output” idea for reporting and debugging 
 ### 4.2 Small design adjustment to make reuse clean (recommended)
 Your checker will naturally see **raw TCP payload bytes**. So the lowest-friction integration is:
 * Implement a tiny **MBAP parser** in the checker:
  * returns `(transaction_id, protocol_id, length, unit_id, function_code, pdu_data)`
 * Then call your descriptor-based decoder on `pdu_data` (bytes **after** function code)
 Your doc shows the parser conceptually returns JSON with `fields` and supports request vs response descriptors , which maps perfectly to `direction`.
 **Suggested public entrypoint to expose from your parser module:**
 * `parse_with_descriptor(pdu_data: &[u8], unit: u8, function: u8, fields: &Vec<FieldDescriptor>) -> Result<Value, String>`
 If it’s currently private, just make it `pub(crate)` or `pub` and reuse it. This avoids binding the checker to `sawp_modbus::Message` and keeps implementation simple.
 > Trade-off: best practice would be to return a typed struct + typed errors; easier to maintain long term but more refactor work. For your “don’t make it hard” requirement, keeping JSON output + simple error types is totally fine for the first version.
 ### 4.3 How the checker chooses which descriptor to use
 * If `direction == c2s` → request descriptor
 * If `direction == s2c` → response descriptor
  This matches the intent of having `request` and `response` descriptor vectors in your model .
 ---
 ## 5) Checker internal design (simple but extensible)
 ### 5.1 Core data structures
 * `FlowKey { src_ip, src_port, dst_ip, dst_port, ip_version }`
 * `PacketCtx { trace_id, event_id, pcap_index, ts_ns, direction, flow }`
 * `DecodedModbus { transaction_id, protocol_id, length, unit_id, function_code, is_exception, exception_code?, pdu_data, parsed_fields_json? }`
 ### 5.2 “Rules” model (optional, but keeps code tidy)
 Instead of huge if/else blocks, implement a few rules that return findings:
 * `RuleMbapValid`
 * `RuleFunctionPduWellFormed` (basic length sanity)
 * `RuleTxIdPairing`
 * `RuleExpectedMatch` (only if sidecar has expected)
 If you don’t want a formal trait system initially, just implement these as functions that append to a `Vec<Finding>`.
 ### 5.3 Findings + severity
 Use a compact severity scale:
 * `Fatal`: cannot parse / cannot continue reliably
 * `Error`: protocol invalid
 * `Warn`: unusual but maybe acceptable
 * `Info`: stats
 A finding should include:
 * `pcap_index`, `event_id`, `flow`, `severity`, `code`, `message`
 * optional `observed` and `expected` snippets
 ---
 ## 6) What the checker validates (MVP vs stricter)
 ### MVP validations (recommended first milestone)
 1. PCAP + JSONL aligned
 2. Parse Ethernet/IP/TCP and extract payload
 3. MBAP:
   * payload length ≥ 7
   * length field consistency (basic)
 4. PDU:
   * function code exists
   * exception handling if `fc & 0x80 != 0`
 5. Descriptor parse success (request/response based on direction)
 6. Transaction pairing:
   * every response matches an outstanding request by transaction_id/unit_id
   * no duplicate outstanding txid unless you allow it
 ### “Strict mode” additions (still reasonable)
 * enforce unit_id range (if you want)
 * enforce function-code-specific invariants using parsed fields
  * e.g., `byte_count == 2 * quantity` for register reads/writes (if present in descriptor)
 * timeouts:
  * response must arrive within configured window
 ### Heavy features (avoid unless needed)
 * TCP reassembly and multi-ADU per segment
 * checksum verification
 * handling retransmits/out-of-order robustly
 ---
 ## 7) Dependencies (crates) for the checker
 ### Minimal set (keeps implementation easy)
 * **PCAP reading**
  * `pcap` (libpcap-backed; you already use it in your codebase)
 * **Packet decoding**
  * `pnet_packet` (you already use `pnet` patterns)
 * **Config + sidecar + report**
  * `serde`, `serde_json`
 * **Errors + logging**
  * `anyhow` (fast to integrate) and/or `thiserror` (nicer structured errors)
  * `tracing`, `tracing-subscriber`
 * **Utilities**
  * `hashbrown` (optional; std HashMap is fine)
  * `hex` (useful for debug/trailing bytes like your parser does)
 ### If you want to reduce external requirements (optional alternative)
 * Replace `pcap` with `pcap-file` (pure Rust; no libpcap dependency)
 * Replace `pnet` with `etherparse` (often simpler APIs)
 > Trade-off: “best practice” for portability is pure Rust (`pcap-file` + `etherparse`).
 > “Best practice” for least effort *given your current code* is reusing `pcap` + `pnet`.
 ---
 ## 8) Suggested project layout (simple)
 ```
 checker/
  src/
    main.rs               # CLI entry
    config.rs             # descriptor loading
    meta.rs               # JSONL reader structs
    pcap_in.rs            # pcap streaming
    decode.rs             # ethernet/ip/tcp extract payload
    mbap.rs               # Modbus/TCP MBAP parsing
    modbus_desc.rs        # reuse your parse_with_descriptor + types
    state.rs              # outstanding tx table
    validate.rs           # main validation pipeline
    report.rs             # report structs + JSON output
 ```
 ---
 ## 9) Practical implementation tips (to keep it from getting “hard”)
 1. **Enforce generator constraints**:
   * one ADU per TCP payload
   * no splitting/coalescing
     This keeps checker complexity low and makes failure reasons obvious.
 2. **Keep JSON output for parsed fields** at first:
   * You already have a clean JSON shape (`unit`, `function`, `fields`) 
   * Great for debugging mismatches with `expected.fields`
 3. **Add strictness as “modes”**:
   * `--mode=mvp | strict`
   * or config file toggles
 4. **Fail-fast vs best-effort**:
   * For CI or batch filtering, fail-fast on `Fatal` is fine.
   * For research/debugging, best-effort (continue and collect findings) is more useful.
 ---
--- a/papers/Practical
+++ b/papers/Practical
@@ -0,0 +1,17 @@
@inproceedings{10.1145/3544216.3544251,
 author = {Yin, Yucheng and Lin, Zinan and Jin, Minhao and Fanti, Giulia and Sekar, Vyas},
 title = {Practical GAN-based synthetic IP header trace generation using NetShare},
 year = {2022},
 isbn = {9781450394208},
 publisher = {Association for Computing Machinery},
 address = {New York, NY, USA},
 url = {https://doi.org/10.1145/3544216.3544251},
 doi = {10.1145/3544216.3544251},
 abstract = {We explore the feasibility of using Generative Adversarial Networks (GANs) to automatically learn generative models to generate synthetic packet- and flow header traces for networking tasks (e.g., telemetry, anomaly detection, provisioning). We identify key fidelity, scalability, and privacy challenges and tradeoffs in existing GAN-based approaches. By synthesizing domain-specific insights with recent advances in machine learning and privacy, we identify design choices to tackle these challenges. Building on these insights, we develop an end-to-end framework, NetShare. We evaluate NetShare on six diverse packet header traces and find that: (1) across all distributional metrics and traces, it achieves 46\% more accuracy than baselines and (2) it meets users' requirements of downstream tasks in evaluating accuracy and rank ordering of candidate approaches.},
 booktitle = {Proceedings of the ACM SIGCOMM 2022 Conference},
 pages = {458–472},
 numpages = {15},
 keywords = {synthetic data generation, privacy, network packets, network flows, generative adversarial networks},
 location = {Amsterdam, Netherlands},
 series = {SIGCOMM '22}
 }
--- a/papers/Practical
+++ b/papers/Practical
@@ -0,0 +1,45 @@
 # Practical GAN-based synthetic IP header trace generation using
 **第一个问题**：请对论文的内容进行摘要总结，包含研究背景与问题、研究目的、方法、主要结果和结论，字数要求在150-300字之间，使用论文中的术语和概念。
 论文面向网络管理任务（telemetry、anomaly detection、provisioning）中难以共享的packet/flow header traces，指出现有GAN/表格生成与时序生成方法在fidelity（字段相关性与大支持域字段）、scalability-fidelity tradeoff、privacy-fidelity tradeoff上不足。目标是自动学习可生成synthetic IP header traces的端到端框架。方法NetShare将多epoch traces合并并按five-tuple拆分为flow time series，用DoppelGANger式time-series GAN建模；对大范围数值做log(1+x)，IP用bit encoding，端口/协议用基于public data训练的IP2Vec；通过chunk化+fine-tuning并行训练提升效率，并用public pretraining+DP-SGD改善DP训练。结果显示NetShare在六个数据集上跨分布指标总体比基线高46%准确度，并在下游任务上保持accuracy与rank ordering，结论是GAN在该场景可行但需域特定重构与隐私/效率设计。
 **第二个问题**：请提取论文的摘要原文，摘要一般在Abstract之后，Introduction之前。
 We explore the feasibility of using Generative Adversarial Networks (GANs) to automatically learn generative models to generate synthetic packet- and flow header traces for networking tasks (e.g., telemetry, anomaly detection, provisioning). We identify key fidelity, scalability, and privacy challenges and tradeoffs in existing GAN-based approaches. By synthesizing domain-specific insights with recent advances in machine learning and privacy, we identify design choices to tackle these challenges. Building on these insights, we develop an end-to-end framework, NetShare. We evaluate NetShare on six diverse packet header traces and find that: (1) across all distributional metrics and traces, it achieves 46% more accuracy than baselines and (2) it meets users’ requirements of downstream tasks in evaluating accuracy and rank ordering of candidate approaches.
 **第三个问题**：请列出论文的全部作者，按照此格式：`作者1, 作者2, 作者3`。
 Yucheng Yin, Zinan Lin, Minhao Jin, Giulia Fanti, Vyas Sekar
 **第四个问题**：请直接告诉我这篇论文发表在哪个会议或期刊，请不要推理或提供额外信息。
 ACM SIGCOMM 2022 Conference (SIGCOMM ’22)
 **第五个问题**：请详细描述这篇论文主要解决的核心问题，并用简洁的语言概述。
 核心问题是：如何在不共享原始数据的前提下，生成对网络任务“足够真实且可用”的synthetic packet/flow IP header traces，同时兼顾三类现实约束——(1) Fidelity：要捕获跨字段与跨记录的相关性（如flow size、跨epoch的同一flow、多包流），并能学习大支持域字段（IP/port、pkts/bytes等）的长尾与heavy hitters；(2) Scalability：面对百万级记录的trace训练成本要可控，避免复杂时序GAN训练耗费数量级更多CPU/GPU-hours；(3) Privacy：在考虑差分隐私(ε,δ)-DP等机制时，避免“DP一上、保真度崩溃”的privacy-fidelity tradeoff。简洁概述：NetShare要把“能用、能训、能隐私化”的网络头部合成数据生成从不可行变为可落地。
 **第六个问题**：请告诉我这篇论文提出了哪些方法，请用最简洁的方式概括每个方法的核心思路。
 (1) Reformulation as flow time series generation：把按epoch的表格记录改写为“合并全局trace→按five-tuple切成flows→对每条flow生成时间序列”，以显式学习跨包/跨epoch相关性；(2) Field representation/encoding设计：对pkts/bytes等大范围数值做log(1+x)；IP用bitwise encoding；port/protocol用在public data上训练的IP2Vec embedding以兼顾可学性与DP可行性；(3) Scalability via chunking + fine tuning + parallelism：将flow trace按固定时间切成M个chunk，首chunk训练seed model，其余chunk并行fine-tune，并用flow tags保留跨chunk依赖；(4) Practical DP via public pretraining + DP-SGD fine-tuning：先用相近public traces预训练，再用DP-SGD在private data上微调以减少DP-SGD轮数，从而改善privacy-fidelity tradeoff；(5) Post-processing correctness：生成后回映射字段、补齐派生字段（如checksum）并重组为PCAP/NetFlow以保证trace可用与协议一致性。
 **第七个问题**：请告诉我这篇论文所使用的数据集，包括数据集的名称和来源。
 Flow(NetFlow)数据集：(1) UGR16：西班牙ISP的NetFlow v9采集数据（论文使用2016年3月第三周）；(2) CIDDS：模拟小型企业环境并注入恶意流量的流量数据集（带benign/attack与attack type标签）；(3) TON_IoT (TON)：物联网遥测/安全数据集（论文使用Train_Test_datasets子集）。Packet(PCAP)数据集：(4) CAIDA Anonymized Internet Traces：商业骨干链路高速度监测点的匿名化PCAP（论文用2018年3月New York collector子集）；(5) Data Center (DC)：IMC 2010论文中“UNI1”数据中心的抓包数据；(6) CyberAttack (CA)：U.S. National CyberWatch Mid-Atlantic CCDC 2012年3月比赛抓包数据（Netresec公开capture files）。另外，用于隐私/编码预训练的public数据：CAIDA Chicago collector 2015（用于IP2Vec端口/协议embedding与DP预训练示例）。
 **第八个问题**：请列举这篇论文评估方法的所有指标，并简要说明这些指标的作用。
 (1) Jensen–Shannon Divergence (JSD)：衡量合成与真实在categorical字段分布的差异（如SA/DA、SP/DP、PR），越小越好；(2) Earth Mover’s Distance (EMD / Wasserstein-1)：衡量continuous字段分布差异（如TS、TD、PKT、BYT、PS、PAT、FS），越小越好；(3) Normalized EMD：为跨字段可比性将EMD按字段归一化到固定区间用于汇总/展示；(4) Downstream task accuracy：如traffic type prediction各分类器准确率，用于检验“训练在合成、测试在真实”的可用性；(5) Order preservation / Spearman’s rank correlation coefficient：比较算法/模型在真实与合成上的相对排序一致性（分类器、sketch、NetML modes），越接近1越好；(6) Heavy hitter count estimation relative error：|error_syn−error_real|/error_real，检验sketch类遥测任务误差是否被合成数据保持；(7) Anomaly ratio relative error：|ratio_syn−ratio_real|/ratio_real，检验异常检测输出比例一致性；(8) Scalability：训练成本以total CPU hours度量，用于比较scalability-fidelity tradeoff；(9) Differential privacy参数(ε,δ)：隐私强度指标（δ固定10^-5），用于privacy-fidelity tradeoff曲线；(10) Protocol/domain correctness checks：如IP地址合法性、pkt/byt与协议约束、port与protocol一致性、最小包长等，用于验证生成trace语义/语法正确性。
 **第九个问题**：请总结这篇论文实验的表现，包含具体的数值表现和实验结论。
 (1) 总体保真度：在六个header traces上，NetShare“across all distributional metrics and traces”相对基线达到46% more accuracy；文中进一步给出总体汇总：NetShare在NetFlow分布指标上平均48%更好、在PCAP分布指标上平均41%更好（以JSD/normalized EMD衡量）。(2) 下游任务：TON数据集的traffic type prediction中，NetShare在五种分类器上均优于CTGAN/STAN/E-WGAN-GP；例如MLP上比次优基线（E-WGAN-GP）准确率高12%，并达到真实数据准确率的84%。排序保持方面，Spearman相关系数：CIDDS上NetShare为0.90（基线0.60/0.60/0.70），TON上NetShare为0.70（基线0.10/0.60/-0.60）。(3) Sketch遥测：重流计数任务的相对误差上，NetShare相对基线平均48%更小，并实现sketch算法排序的perfect match（文中描述）。(4) 异常检测：NetML modes排序相关系数在CAIDA上NetShare=1.00，DC上0.94，CA上0.88，整体优于基线。(5) 隐私：在(ε,δ)-DP下，public pretraining能显著优于naive DP-SGD；但即使在CAIDA上ε=24.24，DP-pretrain-SAME的mean EMD仍比非DP退化约2.3×（naive约3.5×），结论是高维DP合成仍是开放难题。
 **第十个问题**：请清晰地描述论文所作的工作，分别列举出动机和贡献点以及主要创新之处。
 动机：(1) 真实packet/flow header traces对监控算法评估、异常检测、容量规划等至关重要，但受业务与隐私限制难以共享；(2) 传统simulation/model-driven方法依赖大量领域知识且难泛化，既有ML/GAN方法又在相关性、规模与隐私上不可用。
 贡献点与主要创新：(1) 提出并实现端到端系统NetShare，把header trace合成从“按epoch表格生成”重构为“全局合并+按flow切分的time-series GAN生成”，系统性解决flow size/跨epoch相关性缺失；(2) 给出面向fidelity–scalability–privacy的字段表示设计（log变换、IP比特编码、端口/协议基于public数据的IP2Vec），并解释为何某些embedding会破坏DP（训练数据依赖的dictionary问题）；(3) 提出chunk化并行训练+seed模型fine-tuning与flow tags以改善scalability-fidelity tradeoff；(4) 将“public pretraining + DP-SGD fine-tuning”引入header trace生成，实证改善privacy-fidelity tradeoff，并量化其局限；(5) 用六个公开数据集、分布指标+多类下游任务（分类、sketch遥测、异常检测）进行系统评估，展示在准确率与排序保持上的实用性，并补充协议一致性检查与可用的PCAP/NetFlow输出流程（含派生字段生成）。
--- a/papers/Practical_GAN-based_synthetic_IP_header_trace_generation_using_NetShare.pdf
+++ b/papers/Practical_GAN-based_synthetic_IP_header_trace_generation_using_NetShare.pdf
--- a/Cellular/Spatio-Temporal
+++ b/Cellular/Spatio-Temporal
@@ -0,0 +1,10 @@
@ARTICLE{11087622,
  author={Liu, Xiaosi and Xu, Xiaowen and Liu, Zhidan and Li, Zhenjiang and Wu, Kaishun},
  journal={IEEE Transactions on Mobile Computing}, 
  title={Spatio-Temporal Diffusion Model for Cellular Traffic Generation}, 
  year={2026},
  volume={25},
  number={1},
  pages={257-271},
  keywords={Base stations;Diffusion models;Data models;Uncertainty;Predictive models;Generative adversarial networks;Knowledge graphs;Mobile computing;Telecommunication traffic;Semantics;Cellular traffic;data generation;diffusion model;spatio-temporal graph},
  doi={10.1109/TMC.2025.3591183}}
--- a/Cellular/Spatio-Temporal
+++ b/Cellular/Spatio-Temporal
@@ -0,0 +1,54 @@
 # Spatio-Temporal Diffusion Model for Cellular
 **第一个问题**：请对论文的内容进行摘要总结，包含研究背景与问题、研究目的、方法、主要结果和结论，字数要求在150-300字之间，使用论文中的术语和概念。
 移动网络规划与资源分配需要可用的流量需求建模，但现实中历史数据依赖重、运营商数据受隐私限制，且现有GAN/自回归方法在城域大规模基站流量生成上易不稳定、mode collapse且难刻画同一时空上下文下的uncertainty。论文提出STOUTER（Spatio-Temporal diffusiOn model for cellUlarTraffic genERation），以DDPM式forward diffusion向真实流量注噪、再reverse denoising从高斯先验重建生成流量分布。为捕获long/short-term periodic patterns与spatial patterns，作者预训练temporal graph与base station graph，并设计Spatio-Temporal Feature Fusion Module（STFFM），在Spatio-Temporal UNet（STUnet）中融合时间嵌入、基站嵌入与扩散步信息以预测噪声。实验在Milan与Trentino两区域共5类（及聚合Call/SMS）CDR数据上验证，STOUTER在Internet数据上相对基线平均提升MAE 19.23%、RMSE 18.35%、JSD 52.77%。结论是STOUTER能生成更接近真实分布且包含不确定性波动的城域基站流量，为后续预测与网络优化提供高质量合成数据支撑。
 **第二个问题**：请提取论文的摘要原文，摘要一般在Abstract之后，Introduction之前。
 In the digital era, the increasing demand for network traffic necessitates strategic network infrastructure planning. Accurate modeling of traffic demand through cellular traffic generation is crucial for optimizing base station deployment, enhancing network efficiency, and fostering technological innovation. In this paper, we introduce STOUTER, a spatio-temporal diffusion model for cellular traffic generation. STOUTER incorporates noise into traffic data through a forward diffusion process, followed by a reverse reconstruction process to generate realistic cellular traffic. To effectively capture the spatio-temporal patterns inherent in cellular traffic, we pre-train a temporal graph and a base station graph, and design the Spatio-Temporal Feature Fusion Module (STFFM). Leveraging STFFM, we develop STUnet, which estimates noise levels during the reverse denoising process, successfully simulating the spatio-temporal patterns and uncertainty variations in cellular traffic. Extensive experiments conducted on five cellular traffic datasets across two regions demonstrate that STOUTER improves cellular traffic generation by 52.77% in terms of the Jensen-Shannon Divergence (JSD) metric compared to existing models. These results indicate that STOUTER can generate cellular traffic distributions that closely resemble real-world data, providing valuable support for downstream applications.
 **第三个问题**：请列出论文的全部作者，按照此格式：`作者1, 作者2, 作者3`。
 Xiaosi Liu, Xiaowen Xu, Zhidan Liu, Zhenjiang Li, Kaishun Wu
 **第四个问题**：请直接告诉我这篇论文发表在哪个会议或期刊，请不要推理或提供额外信息。
 IEEE Transactions on Mobile Computing
 **第五个问题**：请详细描述这篇论文主要解决的核心问题，并用简洁的语言概述。
 论文要解决的核心问题是：在缺乏充分真实运营商数据（隐私限制、数据获取困难）且仅依赖历史数据会带来计算/时延与长期误差累积的情况下，如何生成“城域大规模基站级”蜂窝网络流量，使其同时逼近真实分布并体现三类关键规律：长短期周期性（hourly/daily/weekly）、空间异质性（不同基站/区域差异）以及同一时空条件下的流量不确定性波动（uncertainty pattern）。作者指出现有GAN类生成在大规模场景下易出现训练不稳定与mode collapse，且多依赖知识图谱/多源数据或仅拟合固定模式，忽略不确定性；因此需要一种更稳定、能从随机先验出发逐步“恢复分布”的生成范式，并能把时空先验注入生成过程。简洁概述：用扩散生成替代GAN，在去噪过程中显式融合时间周期图与基站空间图的表示，生成既有规律又有随机波动的基站流量序列分布。
 **第六个问题**：请告诉我这篇论文提出了哪些方法，请用最简洁的方式概括每个方法的核心思路。
 1) STOUTER：面向蜂窝流量生成的时空扩散框架，forward diffusion逐步加高斯噪声，reverse denoising从N(0,I)采样并迭代去噪重建流量。
 2) Temporal Graph（G_t）+表示学习：用小时节点与两类有向边（hourly relation、daily relation）编码短期与周周期关系，采用GraphMAE式GIN编码器/解码器得到时间段嵌入F_t。
 3) Base Station Graph（G_s）+表示学习：以基站为节点，1km阈值内建无向二值边；节点特征由覆盖POI八类计数构成，用GCN编码器/解码器得到基站嵌入F_s。
 4) 预训练目标L_pre（SCE）：对图自编码重建用Scaled Cosine Error并引入γ>1强调难样本，提升表示学习质量。
 5) STFFM（Spatio-Temporal Feature Fusion Module）：把扩散步t的SinPosEmb+MLP嵌入与Concat(F_s,F_t)经FC融合为F_st，并在每个残差块中作为条件注入卷积特征，引导去噪保留时空模式。
 6) STUnet：带Down/Mid/Up结构的Spatio-Temporal UNet作为噪声预测网络ε_θ(x_t,t)，在Mid-Block用多头注意力、在Down/Up用多头线性注意力融合历史流量、扩散步、时空语义等特征。
 7) 扩散训练目标：以噪声预测误差训练（文中给出基于ε_t与ε_θ的平方误差形式）。
 **第七个问题**：请告诉我这篇论文所使用的数据集，包括数据集的名称和来源。
 使用公开CDR数据集：Milan与Trentino（“Call Detailed Records (CDRs) from Italy”，由Semantics and Knowledge Innovation Lab提供；论文引用数据来源为Barlacchi et al., Sci. Data 2015[52]）。流量类型包含5类：Internet、Received-SMS、Sent-SMS、Incoming-Call、Outgoing-Call；并构造聚合数据集：Call（Incoming+Outgoing）与SMS（Received+Sent）。POI数据来源：OpenStreetMap（用于构建基站节点的POI八类统计特征，论文引用[34]）。
 **第八个问题**：请列举这篇论文评估方法的所有指标，并简要说明这些指标的作用。
 1) MAE（Mean Absolute Error）：衡量生成流量与真实流量在基站维度上的平均绝对偏差，反映数值误差水平。
 2) RMSE（Root Mean Square Error）：对大误差更敏感的均方根误差，用于衡量整体重建偏差并强调异常偏离。
 3) KL divergence：用于定义分布差异的基础量（文中作为JSD的组成项）。
 4) JSD（Jensen–Shannon Divergence）：对称且更稳定的分布相似度度量，用于比较生成流量分布与真实分布的接近程度，是论文主指标之一（报告提升52.77%）。此外，案例研究的下游预测同样用MAE与RMSE评估LSTM预测误差。
 **第九个问题**：请总结这篇论文实验的表现，包含具体的数值表现和实验结论。
 总体表现（Section VI-B）：在Milan与Trentino两区域、五类原始流量+聚合Call/SMS上，STOUTER在MAE、RMSE、JSD三指标整体优于TCN-GAN、VAE与以DiffWave/WaveNet为去噪网络的扩散基线。论文给出在Internet数据集上相对基线“平均提升”数值：MAE提升19.23%、RMSE提升18.35%、JSD提升52.77%。可视化（Fig.9）：Trentino Internet一月尺度上，STOUTER更好复现周期趋势且偏差更小；示例指标为STOUTER MAE 0.18、RMSE 0.2558，优于VAE（0.2923/0.4528）、DiffWave（0.2359/0.3384）与TCN-GAN（1.5976/2.5625）。消融（Fig.7）：去掉Temporal Graph（w/o TG）、去掉Base Station Graph（w/o BSG）或两者都去掉（w/o ST）都会导致指标退化，说明时空表示与融合对生成分布至关重要；但在Received-SMS等随机性更强、时空相关弱的数据上，JSD优势相对减弱。参数研究（Fig.8）：扩散步数t从10增至1000整体提升质量，Milan Internet约t=600趋稳、Trentino Internet约t=800趋稳，超过阈值收益递减。案例研究（Fig.10）：用生成数据训练LSTM做流量预测时，STOUTER生成数据使预测误差与“用真实数据训练”的模型差距在MAE/RMSE上可控制在0.1以内，表明生成数据对下游任务可用。
 **第十个问题**：请清晰地描述论文所作的工作，分别列举出动机和贡献点以及主要创新之处。
 动机：1) 基站部署与资源优化需要可信的流量需求数据，但真实数据受隐私/获取限制；2) 预测方法依赖大量历史数据且长期预测误差累积；3) GAN类城域流量生成存在训练不稳、mode collapse、对多源知识图谱依赖强且往往忽略不确定性波动。
 贡献点：1) 提出STOUTER：面向大规模基站蜂窝流量生成的spatio-temporal diffusion model，通过从高斯先验逐步去噪生成以刻画uncertainty pattern。2) 构建两类先验图并预训练表示：Temporal Graph建模hourly/daily（含周周期）关系以捕获long/short-term periodic patterns；Base Station Graph融合基站距离与POI八类统计以建模spatial patterns。3) 提出STFFM并构建STUnet：在扩散去噪网络各层中融合扩散步嵌入与时空图嵌入，结合多头注意力/线性注意力，提升对时空模式与不确定性同时建模的能力。4) 大规模真实数据实证：在Milan/Trentino五类流量上较SOTA基线在JSD上提升52.77%（Internet任务），并通过消融、变体、参数敏感性与下游预测案例验证有效性与可用性。
 主要创新之处：将“时空图先验表示（temporal graph + base station graph）”以模块化方式注入DDPM式去噪网络（STUnet），在无需城市知识图谱的前提下，同时复现周期性、空间异质性与不确定性波动，并用分布指标（JSD）系统证明生成分布更贴近真实。
--- a/papers/Spatio-Temporal_Diffusion_Model_for_Cellular_Traffic_Generation.pdf
+++ b/papers/Spatio-Temporal_Diffusion_Model_for_Cellular_Traffic_Generation.pdf
--- a/generation/Flow-based
+++ b/generation/Flow-based
@@ -0,0 +1,11 @@
@article{Ring_2019,
   title={Flow-based network traffic generation using Generative Adversarial Networks},
   volume={82},
   ISSN={0167-4048},
   url={http://dx.doi.org/10.1016/j.cose.2018.12.012},
   DOI={10.1016/j.cose.2018.12.012},
   journal={Computers &amp; Security},
   publisher={Elsevier BV},
   author={Ring, Markus and Schlör, Daniel and Landes, Dieter and Hotho, Andreas},
   year={2019},
   month=may, pages={156–172} }
--- a/generation/Flow-based
+++ b/generation/Flow-based
@@ -0,0 +1,48 @@
 # Flow-based Network Traffic Generation using Generative Adversarial Ne
 **第一个问题**：请对论文的内容进行摘要总结，包含研究背景与问题、研究目的、方法、主要结果和结论，字数要求在150-300字之间，使用论文中的术语和概念。
 论文面向NIDS评估的flow-based数据稀缺与隐私/标注难题，提出用Generative Adversarial Networks生成逼真的NetFlow/IPFIX流。核心挑战是GAN仅能处理连续属性，而流数据含大量categorical字段（IP/端口/协议）。作者提出三种预处理与生成方法：N-WGAN-GP（数值归一化，将IP/port视为连续值）、B-WGAN-GP（二进制化，将IP/port/bytes/packets转为bit向量）、E-WGAN-GP（Embedding，将IP/port/bytes/packets/duration用IP2Vec学习到R^m连续嵌入）；生成器采用Improved Wasserstein GAN with Gradient Penalty（WGAN-GP）与TTUR训练，评估采用分布可视化、属性逐维欧氏距离以及基于领域知识的七项质量测试。基于CIDDS-001数据（week2-4训练、每法生成8.5M flows），实验显示E-WGAN-GP与B-WGAN-GP能重现条件分布与结构关系（如子网行为、协议/端口一致性），E-WGAN-GP在domain checks整体最佳；N-WGAN-GP易产生不合理值。结论：GAN结合合适的离散到连续表示（尤其IP2Vec嵌入）可高质量生成flow-based网络流，用于NIDS训练与评估。
 **第二个问题**：请提取论文的摘要原文，摘要一般在Abstract之后，Introduction之前。
 Flow-based data sets are necessary for evaluating network-based intrusion detection systems (NIDS). In this work, we propose a novel methodology for generating realistic flow-based network traffic. Our approach is based on Generative Adversarial Networks (GANs) which achieve good results for image generation. A major challenge lies in the fact that GANs can only process continuous attributes. However, flow-based data inevitably contain categorical attributes such as IP addresses or port numbers. Therefore, we propose three different preprocessing approaches for flow-based data in order to transform them into continuous values. Further, we present a new method for evaluating the generated flow-based network traffic which uses domain knowledge to define quality tests. We use the three approaches for generating flow-based network traffic based on the CIDDS-001 data set. Experiments indicate that two of the three approaches are able to generate high quality data. Keywords: GANs, TTUR WGAN-GP, NetFlow, Generation, IDS
 **第三个问题**：请列出论文的全部作者，按照此格式：`作者1, 作者2, 作者3`。
 Markus Ring, Daniel Schlör, Dieter Landes, Andreas Hotho
 **第四个问题**：请直接告诉我这篇论文发表在哪个会议或期刊，请不要推理或提供额外信息。
 Computer & Security
 **第五个问题**：请详细描述这篇论文主要解决的核心问题，并用简洁的语言概述。
 核心问题是：如何让仅能处理连续变量的GAN生成包含大量类别型字段（IP地址、端口、协议等）的flow-based网络流，并保证生成数据在分布多样性与内部语义关系（如子网结构、协议-端口一致性）上逼近真实流量，以用于NIDS的训练与评估。简述：通过将离散流属性转换为适合GAN的连续表示（数值化、二进制化或IP2Vec嵌入），用WGAN-GP+TTUR学习真实流特征并生成高质量合成流，再用领域知识规则与分布距离进行质量评估。
 **第六个问题**：请告诉我这篇论文提出了哪些方法，请用最简洁的方式概括每个方法的核心思路。
 (1) N-WGAN-GP：将IP四段与端口按数值归一化到[0,1]，连续化bytes/packets/duration后，用WGAN-GP+TTUR生成。 (2) B-WGAN-GP：将IP映射为32位二进制、端口为16位二进制，bytes/packets为限定长度二进制表示，保留子网结构信息后，用WGAN-GP生成。 (3) E-WGAN-GP：扩展IP2Vec，学习IP/port/protocol及bytes/packets/duration的m维嵌入；GAN生成嵌入后通过最近邻（如余弦相似）反映射回原值。 (4) Baseline：独立抽样各属性的经验分布作为对照。 (5) 评估方法：分布可视化、逐属性欧氏距离、七项domain knowledge checks质量测试。
 **第七个问题**：请告诉我这篇论文所使用的数据集，包括数据集的名称和来源。
 CIDDS-001（Flow-based benchmark data sets for intrusion detection），来源于Coburg University与Würzburg University的公开数据集；采用OpenStack仿真企业网络环境抓取的NetFlow，使用week2-4（约2200万flows）为训练、week1（约850万flows）为参考评估。
 **第八个问题**：请列举这篇论文评估方法的所有指标，并简要说明这些指标的作用。
 (1) 可视化分布（小提琴图/时间分布）：检查条件分布与子网行为是否匹配真实流量。 (2) 属性欧氏距离：比较生成数据与训练集在各属性上的概率分布差异，衡量分布逼近程度与概念漂移合理性。 (3) Domain knowledge checks（7项规则）：基于协议-旗标一致性、内部/外部IP约束、HTTP/HTTPS/DNS协议端口对应、广播/多播位置、NetBIOS流结构、bytes/packets范围关系等，检验内部语义与关系正确性。 (4) 生成规模与覆盖（每法生成8.5M flows）：确保多样性与统计稳定性。
 **第九个问题**：请总结这篇论文实验的表现，包含具体的数值表现和实验结论。
 在CIDDS-001上，每种方法生成约8.5百万flows。时间分布方面，E-WGAN-GP曲线平滑且贴近真实week1的工作时段/午休行为。属性分布可视化显示：baseline无法捕获条件分布；N-WGAN-GP错误地集中到ext子网；B-WGAN-GP与E-WGAN-GP能区分客户端/服务器端口与目的IP范围。欧氏距离（相对week2-4）：E-WGAN-GP在source/dest port、bytes、packets等属性上与参考week1相近（如destination port 0.0327，bytes 0.0278，packets 0.0251），N-WGAN-GP在端口/bytes/packets距离较大（如source port 0.5658，bytes 0.5858，packets 1.0416）。七项domain checks（%通过率）：E-WGAN-GP在Test1–7整体最高（如Test1 99.77、Test2 99.98、Test7 99.49）；B-WGAN-GP在协议端口一致性表现优（Test3 99.97、Test4 99.90），但在广播相关规则较弱（Test5 47.13、Test6 40.19）；N-WGAN-GP在UDP/TCP旗标和DNS/HTTP规则明显较差（Test4 87.14）；baseline整体最低。结论：E-WGAN-GP质量最佳、B-WGAN-GP次之且可生成未见值，N-WGAN-GP不适合。
 **第十个问题**：请清晰地描述论文所作的工作，分别列举出动机和贡献点以及主要创新之处。
 动机：NIDS评估缺乏最新、可共享且带标签的flow-based数据；GAN擅长生成，但网络流含大量类别型字段、语义关系复杂，直接应用受限。
 贡献点：(1) 提出首个面向完整flow-based网络流的GAN生成框架，采用WGAN-GP+TTUR稳定训练；(2) 提出三种离散到连续转换策略（N/B/Embedding），系统解决GAN处理类别属性的关键障碍；(3) 扩展IP2Vec以学习bytes/packets/duration等数值字段的上下文化嵌入，并给出嵌入反映射方案；(4) 提出结合分布距离与七项领域知识规则的综合评估方法；(5) 在CIDDS-001大规模实验中验证两种方法能生成高质量数据。
 主要创新：用IP2Vec嵌入将复杂离散网络流属性及数值属性统一到连续空间，使GAN能内生学习属性间条件依赖；提出领域知识驱动的质量测试，补足仅看分布的评估盲点；二进制化策略保留子网与结构信息，兼顾生成未见值的能力。
--- a/generation/Flow-based
+++ b/generation/Flow-based
--- a/generation/Intro.txt
+++ b/generation/Intro.txt
@@ -0,0 +1,13 @@
 网络流量/Trace 生成与“可用性”讨论（支撑你做语义 trace 生成，而不是原始字节生成）
 Yin et al. Practical GAN-based Synthetic IP Header Trace Generation using NetShare. ACM SIGCOMM 2022.
 用途：它强调“生成可用的协议字段 trace”与实用评估（不是只看视觉相似）。你可以借鉴其“字段级一致性/约束”的评估思路。
 Lin et al. Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions. ACM IMC 2020.
 用途：专门讨论网络化时间序列共享/合成的挑战（相关性、隐私、评估）；你做 Modbus 合成的“评估指标设计”很适合引用它的观点。
 Ring et al. Flow-based Network Traffic Generation using Generative Adversarial Networks. Computers & Security 2019.
 用途：作为 GAN 基线类相关工作，对比扩散模型的训练稳定性与多样性优势。
 Vishwanath & Vahdat. Swing: Realistic and Responsive Network Traffic Generation. IEEE/ACM ToN 2009.
 用途：传统 traffic generator 经典工作；用于 related work 中“非深度学习合成”的对比。
--- a/generation/Practical
+++ b/generation/Practical
@@ -0,0 +1,17 @@
@inproceedings{10.1145/3544216.3544251,
 author = {Yin, Yucheng and Lin, Zinan and Jin, Minhao and Fanti, Giulia and Sekar, Vyas},
 title = {Practical GAN-based synthetic IP header trace generation using NetShare},
 year = {2022},
 isbn = {9781450394208},
 publisher = {Association for Computing Machinery},
 address = {New York, NY, USA},
 url = {https://doi.org/10.1145/3544216.3544251},
 doi = {10.1145/3544216.3544251},
 abstract = {We explore the feasibility of using Generative Adversarial Networks (GANs) to automatically learn generative models to generate synthetic packet- and flow header traces for networking tasks (e.g., telemetry, anomaly detection, provisioning). We identify key fidelity, scalability, and privacy challenges and tradeoffs in existing GAN-based approaches. By synthesizing domain-specific insights with recent advances in machine learning and privacy, we identify design choices to tackle these challenges. Building on these insights, we develop an end-to-end framework, NetShare. We evaluate NetShare on six diverse packet header traces and find that: (1) across all distributional metrics and traces, it achieves 46\% more accuracy than baselines and (2) it meets users' requirements of downstream tasks in evaluating accuracy and rank ordering of candidate approaches.},
 booktitle = {Proceedings of the ACM SIGCOMM 2022 Conference},
 pages = {458–472},
 numpages = {15},
 keywords = {generative adversarial networks, network flows, network packets, privacy, synthetic data generation},
 location = {Amsterdam, Netherlands},
 series = {SIGCOMM '22}
 }
--- a/generation/Practical
+++ b/generation/Practical
@@ -0,0 +1,48 @@
 # Practical GAN-based synthetic IP header trace generation using NetShare
 **第一个问题**：请对论文的内容进行摘要总结，包含研究背景与问题、研究目的、方法、主要结果和结论，字数要求在150-300字之间，使用论文中的术语和概念。
 论文针对网络管理与安全中的包/流头部数据难以共享的问题，探索用GAN生成合成IP头部踪迹以替代真实PCAP/NetFlow数据。作者识别出fidelity、scalability-fidelity与privacy-fidelity三大挑战：现有表格/时间序列GAN难以捕捉字段相关性与大取值空间、训练成本高、差分隐私训练破坏保真度。为此提出端到端框架NetShare：将跨epoch的头部轨迹重构为merged flow-level time series，并用DoppelGANger类时序GAN生成；结合域知识进行字段编码（log(1+x)、IP位编码、端口/协议IP2Vec嵌入）以提升可学性；通过chunk并行+seed fine-tuning与“flow tags”保留跨块相关性；引入DP-SGD并用相近公共数据预训练改善privacy-fidelity。基于6个异质数据集（UGR16、CIDDS、TON，CAIDA、数据中心、网络攻防PCAP），结果显示：在分布指标上NetShare较基线平均提升46%，在下游任务保持accuracy与rank ordering；相较基线实现更优scalability-fidelity，并在差分隐私设定下优于naive DP。结论：NetShare将域洞察与ML结合，实用地生成高保真、可扩展且具更好隐私折中的合成头部踪迹。
 **第二个问题**：请提取论文的摘要原文，摘要一般在Abstract之后，Introduction之前。
 We explore the feasibility of using Generative Adversarial Networks (GANs) to automatically learn generative models to generate synthetic packet- and flow header traces for networking tasks (e.g., telemetry, anomaly detection, provisioning). We identify key fidelity, scalability, and privacy challenges and tradeoffs in existing GAN-based approaches. By synthesizing domain-specific insights with recent advances in machine learning and privacy, we identify design choices to tackle these challenges. Building on these insights, we develop an end-to-end framework, NetShare. We evaluate NetShare on six diverse packet header traces and find that: (1) across all distributional metrics and traces, it achieves 46% more accuracy than baselines and (2) it meets users’ requirements of downstream tasks in evaluating accuracy and rank ordering of candidate approaches.
 **第三个问题**：请列出论文的全部作者，按照此格式：`作者1, 作者2, 作者3`。
 Yucheng Yin, Zinan Lin, Minhao Jin, Giulia Fanti, Vyas Sekar
 **第四个问题**：请直接告诉我这篇论文发表在哪个会议或期刊，请不要推理或提供额外信息。
 ACM SIGCOMM 2022
 **第五个问题**：请详细描述这篇论文主要解决的核心问题，并用简洁的语言概述。
 核心问题是如何用GAN生成高保真、可扩展、具隐私保障的PCAP/NetFlow合成头部踪迹，使其在字段分布、跨包/跨epoch相关性、下游任务准确性与排序上接近真实数据，同时缓解训练成本与差分隐私带来的保真损失。简述：将头部生成重构为flow-level时序问题，配合域感知编码与并行微调，联用公共数据预训练+DP-SGD，实现高保真且更实用的隐私合成。
 **第六个问题**：请告诉我这篇论文提出了哪些方法，请用最简洁的方式概括每个方法的核心思路。
 (1) 问题重构：merge多epoch为巨型trace，按五元组拆分flow，用时序GAN学习跨包/跨epoch相关性。 (2) 域感知编码：连续大范围字段用log(1+x)；IP用bit编码；端口/协议用IP2Vec嵌入（在公共数据上训练），提升可学性与隐私。 (3) 可扩展训练：将trace按固定时间切分为chunks；首块训练seed模型，其余并行fine-tuning；用“flow tags”保留跨块出现关系。 (4) 隐私增强：DP-SGD微调并结合公共数据预训练（相似域），改善privacy-fidelity折中。 (5) 合规后处理：最近邻反映射嵌入、补派生字段（如校验和）、按原始时间字段合并为PCAP/NetFlow。
 **第七个问题**：请告诉我这篇论文所使用的数据集，包括数据集的名称和来源。
 (1) UGR16（NetFlow v9，西班牙ISP，2016年3月第三周）；(2) CIDDS（仿真小型企业环境，有标签的Benign/Attack与类型）；(3) TON_IoT（IoT遥测子数据集，含9类攻击与正常流）；(4) CAIDA（骨干链路PCAP，2018年3月纽约采集器；公共数据）；(5) Data Center（IMC 2010研究的数据中心UNI1 PCAP）；(6) CyberAttack（美国网络攻防竞赛PCAP，2012年3月）。每数据集选取连续100万样本用于评测。
 **第八个问题**：请列举这篇论文评估方法的所有指标，并简要说明这些指标的作用。
 (1) 分布一致性：对SA/DA、SP/DP、PR用Jensen-Shannon Divergence；对TS、TD、PKT、BYT、PS、PAT、FS用Earth Mover’s Distance（归一化）。衡量合成与真实分布的距离。 (2) 下游任务保真：accuracy preservation与order preservation，用真实/合成训练-测试的准确率与Spearman秩相关评估分类器与测量算法的排序一致性。 (3) Sketch重估误差：重估heavy hitter计数的相对误差，评估合成数据在网络测量中的可用性。 (4) 异常检测保真：NetML多模式（IAT、SIZE、IAT_SIZE、STATS、SAMP-NUMP、SAMP-SIZE）下的异常比例相对误差及秩相关。 (5) 可扩展性-保真折中：训练CPU小时 vs 平均JSD/EMD，衡量效率与质量。 (6) 隐私-保真折中：差分隐私参数(ε, δ) vs 平均JSD/EMD，评估DP训练影响。 (7) 协议/域知识一致性测试：IP有效性、字节-包关系、端口-协议一致性、最小包长等，确保语义/语法正确性。
 **第九个问题**：请总结这篇论文实验的表现，包含具体的数值表现和实验结论。
 在6个数据集上，NetShare在分布指标相较基线平均提升46%（NetFlow约48%，PCAP约41%）；在UGR16与CAIDA示例中，NetShare的平均JSD/归一化EMD显著低于CTGAN、E-WGAN-GP、STAN及PCAP基线。下游任务方面：在TON流分类，NetShare在MLP上比次优基线高12%准确率，达到真实数据性能的84%，并在CIDDS/TON上获得更高秩相关（如CIDDS 0.90）；在Sketch重估，3个数据集上NetShare相对误差最低（平均减少约48%），并保持算法排序；在PCAP异常检测，NetShare多数模式相对误差最低，且在CAIDA上模式排名完全匹配。可扩展性方面，NetShare通过并行微调与chunking达成较优CPU小时-保真折中，PCAP场景相对CTGAN提升近一数量级的JSD，同时较DoppelGANger近一数量级更快。隐私方面，naive DP-SGD显著劣化保真度；用相似公共数据预训练后在ε≈24的CAIDA上将平均EMD相对naive DP从3.5×降至2.3×，但仍存在折中。协议一致性测试中，NetShare在多项测试达到≈95–99%通过率。结论：NetShare在保真、任务可用性与训练效率上优于现有方法，并以预训练提升DP情形的可用性。
 **第十个问题**：请清晰地描述论文所作的工作，分别列举出动机和贡献点以及主要创新之处。
 动机：现实PCAP/NetFlow头部踪迹因隐私/政策难共享，现有模拟/模型/ML方法在字段相关性、大取值空间、跨epoch相关与隐私保障上存在不足，且训练成本高、差分隐私显著破坏保真。
 贡献点：(1) 提出NetShare，重构为flow-level时序生成并实现端到端系统与Web原型；(2) 域感知编码策略（log变换、IP位编码、端口/协议IP2Vec公共预训练）提升GAN对网络字段的可学性与鲁棒性；(3) 并行可扩展训练（chunk+seed fine-tuning+flow tags）在保持相关性的同时显著降低CPU小时；(4) 差分隐私微调框架（公共数据预训练+DP-SGD）改善privacy-fidelity折中；(5) 系统性评测：在6数据集与多类分布/任务/一致性指标上验证较基线平均46%保真提升与排序保持。
 主要创新：将跨epoch合并与按五元组时序化的生成范式用于网络头部踪迹；将端口/协议嵌入与IP位编码的混合表示用于兼顾保真与隐私；提出带flow tags的并行微调保持跨块相关；首次在网络头部生成中实证差分隐私预训练微调提升隐私-保真折中。
--- a/generation/Practical
+++ b/generation/Practical
--- a/generation/Realistic
+++ b/generation/Realistic
@@ -0,0 +1,38 @@
@article{10.1145/1151659.1159928,
 author = {Vishwanath, Kashi Venkatesh and Vahdat, Amin},
 title = {Realistic and responsive network traffic generation},
 year = {2006},
 issue_date = {October 2006},
 publisher = {Association for Computing Machinery},
 address = {New York, NY, USA},
 volume = {36},
 number = {4},
 issn = {0146-4833},
 url = {https://doi.org/10.1145/1151659.1159928},
 doi = {10.1145/1151659.1159928},
 abstract = {This paper presents Swing, a closed-loop, network-responsive traffic generator that accurately captures the packet interactions of a range of applications using a simple structural model. Starting from observed traffic at a single point in the network, Swing automatically extracts distributions for user, application, and network behavior. It then generates live traffic corresponding to the underlying models in a network emulation environment running commodity network protocol stacks. We find that the generated traces are statistically similar to the original traces. Further, to the best of our knowledge, we are the first to reproduce burstiness in traffic across a range of timescales using a model applicable to a variety of network settings. An initial sensitivity analysis reveals the importance of capturing and recreating user, application, and network characteristics to accurately reproduce such burstiness. Finally, we explore Swing's ability to vary user characteristics, application properties, and wide-area network conditions to project traffic characteristics into alternate scenarios.},
 journal = {SIGCOMM Comput. Commun. Rev.},
 month = aug,
 pages = {111–122},
 numpages = {12},
 keywords = {burstiness, energy plot, generator, internet, modeling, structural model, traffic, wavelets}
 }
@inproceedings{10.1145/1159913.1159928,
 author = {Vishwanath, Kashi Venkatesh and Vahdat, Amin},
 title = {Realistic and responsive network traffic generation},
 year = {2006},
 isbn = {1595933085},
 publisher = {Association for Computing Machinery},
 address = {New York, NY, USA},
 url = {https://doi.org/10.1145/1159913.1159928},
 doi = {10.1145/1159913.1159928},
 abstract = {This paper presents Swing, a closed-loop, network-responsive traffic generator that accurately captures the packet interactions of a range of applications using a simple structural model. Starting from observed traffic at a single point in the network, Swing automatically extracts distributions for user, application, and network behavior. It then generates live traffic corresponding to the underlying models in a network emulation environment running commodity network protocol stacks. We find that the generated traces are statistically similar to the original traces. Further, to the best of our knowledge, we are the first to reproduce burstiness in traffic across a range of timescales using a model applicable to a variety of network settings. An initial sensitivity analysis reveals the importance of capturing and recreating user, application, and network characteristics to accurately reproduce such burstiness. Finally, we explore Swing's ability to vary user characteristics, application properties, and wide-area network conditions to project traffic characteristics into alternate scenarios.},
 booktitle = {Proceedings of the 2006 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications},
 pages = {111–122},
 numpages = {12},
 keywords = {burstiness, energy plot, generator, internet, modeling, structural model, traffic, wavelets},
 location = {Pisa, Italy},
 series = {SIGCOMM '06}
 }
--- a/generation/Realistic
+++ b/generation/Realistic
@@ -0,0 +1,58 @@
 # Realistic and responsive network traffic generation
 **第一个问题**：请对论文的内容进行摘要总结，包含研究背景与问题、研究目的、方法、主要结果和结论，字数要求在150-300字之间，使用论文中的术语和概念。
 论文提出Swing，一个闭环、网络响应式的traffic generator，旨在在网络仿真/仿真环境中用简单的structural model重现真实链路上的packet/flow交互与burstiness。背景问题是现有生成器难以同时匹配多时间尺度的到达过程方差（energy plots）、应用与用户层行为，以及广域网条件。Swing从单点tcpdump观察自动提取用户/会话/连接/网络四层参数分布（如numRRE、interRRE、numconn、interConn、numpairs、request/response sizes、reqthink，以及链路capacity/latency/loss），在ModelNet哑铃拓扑上用commodity stacks生成live traffic，并记录target link上的trace。方法包含被动估计RTT（SYN→SYN/ACK、响应→ACK时差）、packet-pair容量、基于序列号洞的丢包率，并用小波MRA的energy plots验证burstiness。结果表明：Swing在Mawi、CAIDA、Auck多数据集上，生成trace的统计特性与原trace相似，首次跨多时间尺度（含sub-RTT）重现字节/包到达能量谱；敏感性分析揭示需同时捕获用户/应用/网络特征，且可在场景投射中响应地调整延迟、容量、应用占比。结论：简单但结构化的模型加网络条件重放即可现实、响应地生成真实网络流量。
 **第二个问题**：请提取论文的摘要原文，摘要一般在Abstract之后，Introduction之前。
 This paper presents Swing, a closed-loop, network-responsive traffic generator that accurately captures the packet interactions of a range of applications using a simple structural model. Starting from observed traffic at a single point in the network, Swing automatically extracts distributions for user, application, and network behavior. It then generates live traffic corresponding to the underlying models in a network emulation environment running commodity network protocol stacks. We find that the generated traces are statistically similar to the original traces. Further, to the best of our knowledge, we are the first to reproduce burstiness in traffic across a range of timescales using a model applicable to a variety of network settings. An initial sensitivity analysis reveals the importance of capturing and recreating user, application, and network characteristics to accurately reproduce such burstiness. Finally, we explore Swing’s ability to vary user characteristics, application properties, and wide-area network conditions to project traffic characteristics into alternate scenarios.
 **第三个问题**：请列出论文的全部作者，按照此格式：`作者1, 作者2, 作者3`。
 Kashi Venkatesh Vishwanath, Amin Vahdat
 **第四个问题**：请直接告诉我这篇论文发表在哪个会议或期刊，请不要推理或提供额外信息。
 ACM SIGCOMM 2006
 **第五个问题**：请详细描述这篇论文主要解决的核心问题，并用简洁的语言概述。
 核心问题：如何从单点观测的packet header traces，构建一个语义明确且可响应的结构化模型，并在网络仿真环境中生成live traffic，准确重现多时间尺度（含sub-RTT）的burstiness与应用/用户/网络交互，同时支持场景投射。简述：提出Swing，用四层结构模型加被动网络特性估计与仿真回放，实现统计相似、跨尺度能量谱匹配的真实感流量生成。
 **第六个问题**：请告诉我这篇论文提出了哪些方法，请用最简洁的方式概括每个方法的核心思路。
 (1) 结构化参数化模型：四层（Users/Sessions/Connections/Network）参数与分布（如numRRE、interRRE、numconn、interConn、numpairs、request/response sizes、reqthink、capacity/latency/loss）统一描述应用行为。
 (2) 会话/RRE提取：基于SYN/FIN/时序与RREtimeout/SESStimeout将连接聚类为RRE与会话，得到分布的经验CDF。
 (3) 被动网络特性估计：用SYN→SYN/ACK、响应→ACK时差估RTT分解；packet-pair估瓶颈容量；序列号“holes”与重传估loss。
 (4) 仿真生成：在ModelNet哑铃拓扑配置链路MTU/容量/延迟/丢包；部署多生成器/监听器用commodity stacks按分布驱动live连接。
 (5) 验证与敏感性分析：用小波MRA energy plots比较字节/包到达的多尺度方差；探究网络/用户/应用参数缺省的影响。
 (6) 场景投射：调整延迟、容量、响应大小、应用占比等，以测试trace在替代场景下的响应。
 **第七个问题**：请告诉我这篇论文所使用的数据集，包括数据集的名称和来源。
 (1) CAIDA OC-48 MFN Backbone 1 链路（San Jose–Seattle）公共trace；(2) MAWI（WIDE项目）Trans-Pacific 18Mbps CAR 链路trace；(3) University of Auckland（NLANR存档，Auckland-VI）OC3c ATM 链路trace。均来自公开trace存储库（CAIDA、MAWI、NLANR）。
 **第八个问题**：请列举这篇论文评估方法的所有指标，并简要说明这些指标的作用。
 (1) 应用/总体带宽与pps：比较生成trace与原trace的Mbps与packets/sec，验证宏观负载一致性。
 (2) 参数分布一致性：各模型参数的median与IQR（numconn、interConn、numpairs、numRRE、interRRE、reqthink、请求/响应大小），检验结构模型重现性。
 (3) 小波MRA能量谱（energy plots）：比较不同时间尺度下字节/包到达过程方差，验证burstiness（含RTT与瓶颈容量对应的特征“dip”）。
 (4) 网络特性CDF：两向延迟、上下游容量、丢包率分布的CDF，对比被动估计与生成trace的网络条件一致性。
 (5) 敏感性分析：移除或变更网络/应用参数（如无网络重放、仅延迟/容量、去interRRE/interConn）对能量谱的影响，评估必要性与重要性。
 (6) 响应性实验：加倍延迟/响应大小/改变应用占比对能量谱与带宽的影响，评估场景投射能力。
 **第九个问题**：请总结这篇论文实验的表现，包含具体的数值表现和实验结论。
 在Auck/Mawi/CAIDA三数据集上：总体带宽与pps（如CAIDA：184Mbps、Auck与Mawi多案例）在Trace/Swing对照中接近；参数分布median/IQR高一致（例如HTTP请求/响应大小、numconn、interConn、reqthink等），少数人机“think”参数差异仍可接受。能量谱方面：Swing在HTTP、SQUID、KAZAA与方向双向的字节/包到达能量谱均与原trace紧密重叠，能再现RTT尺度dip（如200ms）与瓶颈容量对应dip（如3.5Mbps→约8ms）。网络特性估计的延迟/容量/丢包CDF与生成trace吻合度高。敏感性显示：不重放网络条件会丢失sub-RTT结构；仅延迟或延迟+容量不足；去interRRE或interConn会改变大尺度或整体burstiness。响应性显示：延迟加倍使能量谱特征右移一档（log2），响应大小加倍使能量谱整体升高且带宽由9Mbps增至19Mbps；增大SQUID占比使整体能量谱更类似SQUID曲线。结论：Swing首次跨多时间尺度重现burstiness，生成trace统计上与原trace相似，并能以语义明确的参数实现响应式场景投射。
 **第十个问题**：请清晰地描述论文所作的工作，分别列举出动机和贡献点以及主要创新之处。
 动机：需要一个能生成真实、可响应且最大随机性的网络traffic generator，既匹配多时间尺度burstiness，又能从单点观测自动抽取用户/应用/网络参数，并支持投射到不同网络/应用场景。
 贡献点：(1) 提出Swing，一个闭环、网络响应式生成框架，统一结构模型覆盖Users/Sessions/Connections/Network；(2) 从单点packet headers自动提取应用会话/RRE/连接参数的经验CDF与网络特性（RTT分解、packet-pair容量、序列号洞丢包）；(3) 在ModelNet哑铃拓扑用commodity stacks生成live traffic，统计上复现原trace的带宽/pps与结构参数分布；(4) 用小波MRA能量谱首次跨多时间尺度（含sub-RTT）重现字节/包到达burstiness；(5) 系统敏感性与响应性分析，证明用户/应用/网络三类特征均为必要，并展示调整延迟/容量/应用占比的可投射能力。
 创新之处：以简洁结构化模型与被动网络条件估计相结合，在仿真环境中重放多层交互，首次在多时间尺度上重现burstiness并保持统计相似，同时支持面向未来场景的参数化、响应式投射。
--- a/generation/Realistic
+++ b/generation/Realistic
--- a/generation/Using
+++ b/generation/Using
@@ -0,0 +1,10 @@
@inproceedings{Lin_2020, series={IMC ’20},
   title={Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions},
   url={http://dx.doi.org/10.1145/3419394.3423643},
   DOI={10.1145/3419394.3423643},
   booktitle={Proceedings of the ACM Internet Measurement Conference},
   publisher={ACM},
   author={Lin, Zinan and Jain, Alankar and Wang, Chen and Fanti, Giulia and Sekar, Vyas},
   year={2020},
   month=oct, pages={464–483},
   collection={IMC ’20} }
--- a/generation/Using
+++ b/generation/Using
@@ -0,0 +1,71 @@
 # Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions
 **第一个问题**：请对论文的内容进行摘要总结，包含研究背景与问题、研究目的、方法、主要结果和结论，字数要求在150-300字之间，使用论文中的术语和概念。
 论文针对网络与系统领域“数据共享难”的痛点，探讨利用Generative Adversarial Networks为多维元数据关联的长时序数据生成高保真合成数据，以最小专家介入支持结构表征、预测建模与算法评估。作者识别GAN在fidelity（长程依赖、复杂多维关系、mode collapse）与privacy（理论保证牺牲保真、记忆泄露）上的关键挑战，提出DoppelGANger工作流：解耦元数据与时序的联合生成P(A)·P(R|A)、为元数据引入auxiliary discriminator、batch generation的RNN以捕捉长时程相关、auto-normalization为每个样本生成min/max“伪元数据”缓解mode collapse。基于WWT、MBA、GCUT真实数据集，DG在分布与自相关等结构微基准和下游任务上较基线最高提升达43%，能保持算法排序；隐私方面表明DP-SGD破坏时序相关，提出用更大训练集缓解membership inference并通过仅重训元数据生成器实现业务属性混淆。结论：DG在保真与通用性上展现初步可行性，但隐私-保真折中仍是开放问题。
 **第二个问题**：请提取论文的摘要原文，摘要一般在Abstract之后，Introduction之前。
 Limited data access is a longstanding barrier to data-driven research and development in the networked systems community. In this work, we explore if and how generative adversarial networks (GANs) can be used to incentivize data sharing by enabling a generic framework for sharing synthetic datasets with minimal expert knowledge. As a specific target, our focus in this paper is on time series datasets with metadata (e.g., packet loss rate measurements with corresponding ISPs). We identify key challenges of existing GAN approaches for such workloads with respect to fidelity (e.g., long-term dependencies, complex multidimensional relationships, mode collapse) and privacy (i.e., existing guarantees are poorly understood and can sacrifice fidelity). To improve fidelity, we design a custom workflow called DoppelGANger (DG) and demonstrate that across diverse real-world datasets (e.g., bandwidth measurements, cluster requests, web sessions) and use cases (e.g., structural characterization, predictive modeling, algorithm comparison), DG achieves up to 43% better fidelity than baseline models. Although we do not resolve the privacy problem in this work, we identify fundamental challenges with both classical notions of privacy and recent advances to improve the privacy properties of GANs, and suggest a potential roadmap for addressing these challenges. By shedding light on the promise and challenges, we hope our work can rekindle the conversation on workflows for data sharing.
 **第三个问题**：请列出论文的全部作者，按照此格式：`作者1, 作者2, 作者3`。
 Zinan Lin, Alankar Jain, Chen Wang, Giulia Fanti, Vyas Sekar
 **第四个问题**：请直接告诉我这篇论文发表在哪个会议或期刊，请不要推理或提供额外信息。
 ACM Internet Measurement Conference (IMC 2020)
 **第五个问题**：请详细描述这篇论文主要解决的核心问题，并用简洁的语言概述。
 核心问题：如何用GAN在不依赖大量领域专家知识的前提下，为带多维元数据的网络化长时序数据生成高保真合成数据，既能保留长程时间相关与测量-元数据的复杂关系、避免mode collapse，又能在隐私上避免记忆泄露，并支持实际下游任务。简述：提出DoppelGANger，通过解耦/条件生成、批量RNN、辅助判别器与样本级归一化（min/max伪元数据）系统性提升保真，验证其在多数据集与任务上的效果，同时分析差分隐私训练的破坏性与可行的隐私折中。
 **第六个问题**：请告诉我这篇论文提出了哪些方法，请用最简洁的方式概括每个方法的核心思路。
 - 解耦+条件生成：分解P(A,R)=P(A)·P(R|A)，先MLP生成元数据，再以元数据作为条件由RNN生成时序，保留测量-元数据依赖。
 - 辅助判别器：为元数据单独设置discriminator并与主判别器加权联合训练，提升元数据分布保真。
 - Batch generation RNN：每次生成S个连续记录，降低RNN步数，捕捉长时程相关。
 - Auto-normalization：对每个样本单独归一化，并学习/生成min/max为“伪元数据”，生成后再反缩放，缓解mode collapse。
 - 变长长度标记：生成flag控制序列终止，学习长度分布。
 - 隐私策略：用更大训练集降低membership inference成功率，仅重训元数据生成器实现属性分布混淆；评估DP-SGD对保真的影响。
 **第七个问题**：请告诉我这篇论文所使用的数据集，包括数据集的名称和来源。
 - Wikipedia Web Traffic (WWT)：Kaggle Web Traffic Time Series Forecasting，日页面浏览量与域/访问方式/agent元数据。
 - FCC Measuring Broadband America (MBA)：美国FCC公开数据，家庭UDP ping丢包率与总流量，元数据含接入技术/ISP/州。
 - Google Cluster Usage Traces (GCUT)：Google集群任务资源使用日志（公开数据集），秒级测量聚合为5分钟统计与任务结束事件类型元数据。
 **第八个问题**：请列举这篇论文评估方法的所有指标，并简要说明这些指标的作用。
 - 自相关（ACF）与均方误差（MSE）：衡量长短期时间相关的保真，比较真实与合成序列的自相关曲线误差。
 - 分布距离：如Jensen-Shannon/Wasserstein-1，评估测量、元数据及联合分布的接近程度（如(max+min)/2、带宽CDF、Pearson相关系数CDF）。
 - 变长分布拟合：任务时长直方图拟合（捕捉多模态）。
 - 下游预测精度：在真实/合成数据训练测试的分类/回归（MLP、LR、SVM等）准确率或R²，检验泛化。
 - 排序保持（Spearman秩相关）：比较算法在真实与合成数据上的性能排序一致性。
 - 训练资源与稳定性：训练时间、样本规模与MSE敏感性，评估可用性。
 - 隐私评估：membership inference成功率与训练集规模的关系；DP-SGD下自相关退化情况。
 **第九个问题**：请总结这篇论文实验的表现，包含具体的数值表现和实验结论。
 DG在WWT上自相关MSE较最佳基线（RCGAN）降低约91.2%；在预测任务中，GCUT的MLP分类器用DG生成数据训练、真实数据测试的准确率比次优基线（AR）高43%，达到真实数据的约80%；算法排序方面，DG在GCUT与WWT的Spearman秩相关分别达1.00与0.80，保持排序优于多数基线；长度分布上，DG准确复现GCUT的双峰模式，RNN等基线缺失次模态；测量-元数据联合分布（MBA中DSL/Cable总带宽CDF）的Wasserstein-1距离最小。隐私上，membership inference随训练样本增多降至约51%（近随机），但DP-SGD（如ε≈1.45或更小）显著破坏自相关与保真。结论：DG在结构微基准与下游任务上显著优于基线，具备通用性与可用性；隐私-保真仍需新方法突破。
 **第十个问题**：请清晰地描述论文所作的工作，分别列举出动机和贡献点以及主要创新之处。
 动机：网络/系统领域数据共享受限，亟需无需大量专家参与的通用合成数据框架，既保留长时序与多维关系，又兼顾隐私与实际任务可用性。
 贡献点：
 - 提出DoppelGANger，系统性解决GAN在时间序列的fidelity挑战（长程相关、mode collapse、测量-元数据关系）。
 - 设计解耦/条件生成与auxiliary discriminator，提升元数据与联合分布保真。
 - 引入batch generation的RNN与样本级auto-normalization（min/max伪元数据），缓解mode collapse并学习长时程依赖。• 在WWT/MBA/GCUT多数据集与结构/预测/排序评估中实证高保真（最高43%提升）。
 - 系统性分析隐私折中，揭示DP-SGD对时序保真的破坏，提出实用缓解（更大训练集、只重训元数据生成器）。
 创新之处：
 - 将样本级归一化与伪元数据联合到GAN工作流中，解决时序值域多样性引发的mode collapse。
 - 元数据-时序的解耦条件生成+辅助判别器组合，在网络化长时序场景下有效提升联合分布学习。
 - 批量时序生成策略，使RNN在不牺牲长程相关的前提下可训练、稳定地生成长序列。
 - 面向数据共享的任务化评估框架（结构/预测/排序）与隐私实践建议。
--- a/generation/Using
+++ b/generation/Using
--- a/Fuzzing/AFLNet
+++ b/Fuzzing/AFLNet
@@ -0,0 +1,9 @@
@misc{meng2025aflnetyearslatercoverageguided,
      title={AFLNet Five Years Later: On Coverage-Guided Protocol Fuzzing}, 
      author={Ruijie Meng and Van-Thuan Pham and Marcel Böhme and Abhik Roychoudhury},
      year={2025},
      eprint={2412.20324},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2412.20324}, 
 }
--- a/Fuzzing/AFLNet
+++ b/Fuzzing/AFLNet
@@ -0,0 +1,47 @@
 # AFLNet Five Years Later On Coverage-Guided Protocol Fuzzing
 **第一个问题**：请对论文的内容进行摘要总结，包含研究背景与问题、研究目的、方法、主要结果和结论，字数要求在150-300字之间，使用论文中的术语和概念。
 协议实现是stateful且message-driven，同一消息在不同内部state下可能产生不同response，使传统coverage-guided greybox fuzzing（如AFL）难以有效测试网络协议。论文旨在对AFLNet（首个code- and state-coverage-guided protocol fuzzer）给出扩展技术讨论与大规模实证评估，并回顾其五年影响。方法上，AFLNet以message sequence作为seed，基于pcap录制/回放构建初始corpus，在线学习implemented protocol state machine（IPSM），用response code等标识state并统计#fuzz/#selected/#paths；在seed selection中交织queue顺序与state heuristics以导向progressive states；对序列分割为M1/M2/M3并在M2上施加protocol-aware与byte-level mutation；在同一bitmap中同时维护branch coverage与state transition coverage以判定interesting。结果显示：state feedback单独使用在部分对象上显著优于black-box；加入state feedback使state coverage平均提升35.67×，但对code coverage提升总体不显著；交织式seed-selection在综合code/state覆盖上最稳健。结论：state反馈能显著扩大协议状态空间探索，但“state定义”与吞吐等仍是关键挑战。
 **第二个问题**：请提取论文的摘要原文，摘要一般在Abstract之后，Introduction之前。
 Abstract—Protocol implementations are stateful which makes them difficult to test: Sending the same test input message twice might yield a different response every time. Our proposal to consider a sequence of messages as a seed for coverage-directed greybox fuzzing, to associate each message with the corresponding protocol state, and to maximize the coverage of both the state space and the code was first published in 2020 in a short tool demonstration paper. AFLNet was the first code- and state-coverage-guided protocol fuzzer; it used the response code as an indicator of the current protocol state. Over the past five years, the tool paper has gathered hundreds of citations, the code repository was forked almost 200 times and has seen over thirty pull requests from practitioners and researchers, and our initial proposal has been improved upon in many significant ways. In this paper, we first provide an extended discussion and a full empirical evaluation of the technical contributions of AFLNet and then reflect on the impact that our approach and our tool had in the past five years, on both the research and the practice of protocol fuzzing.
 **第三个问题**：请列出论文的全部作者，按照此格式：`作者1, 作者2, 作者3`。
 Ruijie Meng, Van-Thuan Pham, Marcel Böehme, Abhik Roychoudhury
 **第四个问题**：请直接告诉我这篇论文发表在哪个会议或期刊，请不要推理或提供额外信息。
 文段未给出会议或期刊信息。
 **第五个问题**：请详细描述这篇论文主要解决的核心问题，并用简洁的语言概述。
 核心问题是：如何将面向“单输入/近似无状态程序”的coverage-guided greybox fuzzing扩展到网络协议这种stateful、需要message sequence驱动且状态空间巨大的目标上，并同时兼顾code coverage与state space coverage。传统做法要么靠手工协议模型的stateful blackbox fuzzing（依赖不完备的状态/数据模型，且不保留“有趣”用例继续进化），要么把消息序列拼成文件交给AFL（无法聚焦关键消息、易生成大量无效序列）。论文围绕AFLNet提出并系统评估的一套解法：以消息序列为seed、在线推断IPSM并把状态反馈纳入引导与“interesting”判定，从而更系统地探索协议实现的状态与代码。简洁概述：让灰盒模糊测试“看见并利用协议状态”，从而可有效fuzz stateful protocols。
 **第六个问题**：请告诉我这篇论文提出了哪些方法，请用最简洁的方式概括每个方法的核心思路。
 (1) 消息序列作为seed的SCGF：把sequence of messages而非单文件输入作为进化种子，适配stateful server。(2) 录制/回放驱动（pcap→parse→send）：从真实流量提取初始语料并可重复回放以执行fuzzing迭代。(3) 轻量协议学习IPSM（implemented protocol state machine）：从response序列抽取state transitions，在线增量构建/更新状态机并维护#fuzz/#selected/#paths统计。(4) 面向progressive states的引导：按“盲点/新近/高产出”启发式选state，再在到达该state的子语料上做AFL式优先级选序列。(5) 交织式seed-selection：在coverage plateau时切换到state-heuristic重策略，否则按AFL队列顺序，兼顾吞吐与导向。(6) 三段式序列变异M1/M2/M3：固定前缀M1保证到达目标state，只在候选段M2做变异并继续执行后缀M3以观察传播效应。(7) 协议感知变异算子：对消息做replacement/insertion/duplication/deletion并与byte-level mutation堆叠。(8) 统一bitmap记录code+state覆盖：为state transition预留bitmap区域（SHIFT_SIZE），用分支与状态转移共同定义interesting seeds。
 **第七个问题**：请告诉我这篇论文所使用的数据集，包括数据集的名称和来源。
 基准为ProFuzzBench（Natella & Pham, ISSTA 2021工具/基准论文：ProFuzzBench: A benchmark for stateful protocol fuzzing），论文在其默认集成的网络协议实现（如Bftpd、DNSmasq、OpenSSH、TinyDTLS、Live555、ProFTPD、Pure-FTPd、Exim、DCMTK、Kamailio、forked-daapd、lightFTP等）上进行评测。
 **第八个问题**：请列举这篇论文评估方法的所有指标，并简要说明这些指标的作用。
 (1) Code coverage：以branch coverage（分支覆盖数）衡量探索到的代码范围，“未覆盖代码无法触发漏洞”。(2) State space coverage：以IPSM中构建的state transitions数量（以及状态数量/覆盖）衡量探索到的协议状态空间。(3) Vargha-Delaney effect size（Â12）：衡量两组独立实验结果的优势概率/效应量，用于判断差异是否具有“显著优势”（文中以Â12≥0.71或≤0.29作为显著门槛）。(4) 时间维度覆盖趋势：branch covered over time（24小时曲线）用于对比不同变体达到同等覆盖所需时间（如提到“约6×/4×更快达到相同分支数”）。
 **第九个问题**：请总结这篇论文实验的表现，包含具体的数值表现和实验结论。
 RQ1（仅state反馈）：AFLNetDARK（仅state feedback）在12个ProFuzzBench对象中有6个在code coverage上显著优于AFLNetBLACK（无code/state反馈）：Bftpd、DNSmasq、Kamailio、lightFTP、ProFTPD、Pure-FTPd；并在OpenSSH与TinyDTLS上分别约6×与4×更快达到与BLACK相同的分支数；对state数量很少的对象（如DCMTK最终仅3个state）提升不明显。结论：当state数量“足够”时，state反馈可作为无代码插桩场景的有效引导。RQ2（state+code vs 仅code，表1）：AFLNetQUEUE相对AFLNetCODE平均branch coverage提升仅+0.01%，但state coverage平均提升+35.67×；例如OpenSSH的state数从93.5提升到30480.9（+325.00×，Â12=1.00），DNSmasq从282.5到27364.0（+95.85×，Â12=1.00），Bftpd从170.5到334.0（+0.96×，Â12=1.00）。结论：额外state反馈极大扩展状态空间探索，但对代码覆盖提升整体不显著。RQ3（seed-selection策略，表2/3）：交织策略AFLNet在综合表现上最好；其相对AFLNetQUEUE平均branch coverage为-0.52%但相对AFLNetIPSM为+1.65%，同时state coverage相对AFLNetQUEUE为+5.77%、相对AFLNetIPSM为+12.77%。结论：交织式seed-selection总体上最稳健，且state覆盖与code覆盖不呈简单正相关。
 **第十个问题**：请清晰地描述论文所作的工作，分别列举出动机和贡献点以及主要创新之处。
 动机：网络协议实现是stateful reactive systems，输入是message sequence而非单文件；AFL类CGF缺乏状态与序列结构认知，SBF又依赖手工模型且不做进化保种，导致难以深入状态空间并发现实现偏差/隐藏转移与漏洞。
 贡献点：(1) 系统化阐释AFLNet：把消息序列纳入灰盒进化框架，并将state coverage与code coverage共同作为反馈信号。(2) 在线轻量IPSM学习与引导：通过response code等构造implemented protocol state machine，并用#fuzz/#selected/#paths等统计驱动state选择与导向。(3) 面向目标state的序列分割与变异策略（M1/M2/M3）：既保证可达性又允许在关键段落聚焦变异。(4) 统一bitmap编码分支与状态转移覆盖：在AFL bitmap中为state transition预留空间，使“interesting”判定同时考虑state与code。(5) 五年后大规模评估与配置指导：在ProFuzzBench上做24h×10次的大样本实验，分离评估state反馈、state+code协同、以及seed-selection策略，并用Â12报告效应量。
 主要创新之处：以“状态反馈+在线状态机学习”把灰盒fuzzing从无状态程序扩展到协议实现的状态空间探索，并给出可复现的消融式实证结论与最佳实践建议。
--- a/Fuzzing/AFLNet
+++ b/Fuzzing/AFLNet
--- a/fuzzing/Intro.txt
+++ b/fuzzing/Intro.txt
@@ -0,0 +1,17 @@
 协议状态机/模糊测试/学习输入生成（支撑你“生成有效 request-response 交互序列”）
 对 Modbus TCP 来说，“有效”不仅是字段合法，还包括：
 request 与 response 配对
 Transaction ID 一致/递增策略合理
 功能码与地址范围一致（如 0x03 对 holding register 区间）
 异常响应的触发条件合理
 这些强约束往往在 fuzzing / protocol testing 文献里讨论得更系统。
 Pham et al. AFLNet: A Greybox Fuzzer for Network Protocols. ICST 2019.
 用途：面向网络协议的状态覆盖 fuzzing；你可以借鉴其“状态反馈”思想，把扩散生成器和协议栈反馈（有效率/覆盖率）结合起来做强化。
 She et al. NEUZZ: Efficient Fuzzing with Neural Networks. IEEE S&P 2019.
 用途：神经网络引导 fuzzing 的代表作；可作为你未来“生成模型 + 反馈优化/引导采样”的相关工作支撑。
 Godefroid, Peleg, Singh. Learn&Fuzz: Machine Learning for Input Fuzzing. ASE 2017.
 用途：学习输入格式再生成；与你“语义级生成 + 确定性组装器”的理念一致（模型学语义，规则负责封包细节）。
--- a/Fuzzing/Learn&Fuzz
+++ b/Fuzzing/Learn&Fuzz
@@ -0,0 +1,9 @@
@misc{godefroid2017learnfuzzmachinelearninginput,
      title={Learn&Fuzz: Machine Learning for Input Fuzzing}, 
      author={Patrice Godefroid and Hila Peleg and Rishabh Singh},
      year={2017},
      eprint={1701.07232},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/1701.07232}, 
 }
--- a/Fuzzing/Learn&Fuzz
+++ b/Fuzzing/Learn&Fuzz
@@ -0,0 +1,47 @@
 # Learn&Fuzz Machine Learning for Input Fuzzing
 **第一个问题**：请对论文的内容进行摘要总结，包含研究背景与问题、研究目的、方法、主要结果和结论，字数要求在150-300字之间，使用论文中的术语和概念。
 论文聚焦grammar-based fuzzing的关键瓶颈：需要手工编写input grammar，过程费时且易错；而复杂结构化格式（如PDF）又最依赖该类fuzzing。研究目的在于用neural-network-based statistical learning自动生成可用于输入fuzzing的grammar/生成模型，并解决learn&fuzz张力（learning偏向生成well-formed inputs，fuzzing需要破坏结构以覆盖error-handling与意外路径）。方法上对约63,000个non-binary PDF objects进行无监督训练，采用seq2seq RNN（LSTM）学习字符级概率分布，并提出NoSample/Sample/SampleSpace三种采样生成策略及基于概率分布引导“where to fuzz”的SampleFuzz算法。实验以Microsoft Edge PDF parser为目标，用instruction coverage、pass rate与AppVerifier监测bugs评估：SampleSpace在50 epochs达97% pass rate；整体覆盖最佳为Sample-40e；在learn+fuzz组合中SampleFuzz以567,634条指令覆盖、68.24% pass rate取得最高覆盖并优于多种random fuzzing基线，且在更长实验中发现并修复一个stack-overflow bug。结论：统计生成模型可自动学习输入结构，并用概率信息更智能地施加fuzz以提升覆盖。
 **第二个问题**：请提取论文的摘要原文，摘要一般在Abstract之后，Introduction之前。
 Abstract.Fuzzing consists of repeatedly testing an application with modified, or fuzzed, inputs with the goal of finding security vulnerabili-ties in input-parsing code. In this paper, we show how to automate the generation of an input grammar suitable for input fuzzing using sam-ple inputs and neural-network-based statistical machine-learning tech-niques. We present a detailed case study with a complex input format, namely PDF, and a large complex security-critical parser for this format, namely, the PDF parser embedded in Microsoft’s new Edge browser. We discuss (and measure) the tension between conflicting learning and fuzzing goals: learning wants to capture the structure of well-formed in-puts, while fuzzing wants to break that structure in order to cover unex-pected code paths and find bugs. We also present a new algorithm for this learn&fuzz challenge which uses a learnt input probability distribution to intelligently guide where to fuzz inputs.
 **第三个问题**：请列出论文的全部作者，按照此格式：`作者1, 作者2, 作者3`。
 Patrice Godefroid, Hila Peleg, Rishabh Singh
 **第四个问题**：请直接告诉我这篇论文发表在哪个会议或期刊，请不要推理或提供额外信息。
 arXiv:1701.07232v1
 **第五个问题**：请详细描述这篇论文主要解决的核心问题，并用简洁的语言概述。
 论文要解决的核心问题是：如何在无需人工编写格式规范的前提下，从sample inputs自动学习出“足够像grammar”的生成式输入模型，用于grammar-based fuzzing复杂结构化输入（以PDF为代表），并进一步在“生成尽量well-formed以深入解析流程”和“刻意引入ill-formed片段以触达异常/错误处理代码”之间取得可控平衡。传统黑盒/白盒fuzz对复杂文本结构格式不如grammar-based有效，但后者依赖手工grammar；已有grammar/automata学习方法对PDF对象这种“相对扁平但token/键值组合极多”的格式并不理想。本文用seq2seq RNN学习字符序列的概率分布作为统计grammar，并利用该分布在高置信位置定点“反向扰动”以实现learn&fuzz。简洁概述：用神经网络从样本自动学输入结构，并用学到的概率分布指导更有效的结构化fuzz。
 **第六个问题**：请告诉我这篇论文提出了哪些方法，请用最简洁的方式概括每个方法的核心思路。
 (1) seq2seq RNN统计输入建模：把PDF object当作字符序列，训练encoder-decoder（LSTM）学习p(x_t|x_<t)并作为生成式语言模型。 (2) 训练数据构造（shift-by-one无监督）：将拼接后的字符流分块成长度d的输入序列，输出为右移1位序列，实现无监督next-char学习。 (3) NoSample生成：每步取argmax字符，最大化well-formed但多样性极低。 (4) Sample生成：每步按分布采样字符，提升多样性但更易产生ill-formed片段。 (5) SampleSpace生成：仅在whitespace边界采样、token内部用argmax，折中多样性与语法一致性。 (6) SampleFuzz算法：生成时先采样得到字符c及其概率p(c)，若模型高置信(p(c)>p_t)且掷币触发(p_fuzz>t_fuzz)，则用分布中最低概率字符替换（argmin），在“最不该出错的位置”注入异常以诱导解析器走入错误处理/意外路径。 (7) PDF对象嵌入整文件的host-append机制：将新对象按PDF增量更新规则附加到well-formed host（更新xref与trailer）以便对Edge PDF parser进行端到端测试。
 **第七个问题**：请告诉我这篇论文所使用的数据集，包括数据集的名称和来源。
 (1) PDF训练语料：从534个PDF文件中抽取约63,000个non-binary PDF objects；这534个PDF由Windows fuzzing team提供，且是对更大PDF集合做seed minimization后的结果；更大集合来源包括公开Web与历史fuzz用PDF。（论文未给该数据集专有名称）(2) 目标程序/基准：Microsoft Edge browser内嵌的Edge PDF parser（通过Windows团队提供的单进程test-driver执行）。(3) Host PDF集合：从上述534个PDF中选取最小的3个作为host1/host2/host3（约26Kb/33Kb/16Kb）用于将生成对象附加成完整PDF。
 **第八个问题**：请列举这篇论文评估方法的所有指标，并简要说明这些指标的作用。
 (1) Coverage（instruction coverage）：统计执行过的唯一指令集合（dll-name, dll-offset标识），集合并集衡量一组测试的覆盖范围，是fuzzing有效性的核心指标。 (2) Pass rate：通过grep解析日志中是否有parsing-error来判定pass/fail，pass表示被解析器视为well-formed；主要用来估计学习质量与“结构保持程度”。 (3) Bugs：在AppVerifier监控下捕获内存破坏类缺陷（如buffer overflow、异常递归导致的stack overflow等），衡量真实漏洞发现能力。
 **第九个问题**：请总结这篇论文实验的表现，包含具体的数值表现和实验结论。
 基线覆盖（host与baseline）：三份host单独覆盖约353,327（host1）到457,464（host2）条唯一指令，三者并集host123为494,652；将1,000个真实对象附加到host后，baseline123覆盖为553,873，且所有host自身pass rate为100%。学习质量（pass rate）：Sample在10 epochs时pass rate已>70%；SampleSpace整体更高，50 epochs最高达97% pass rate。覆盖表现（学习不加fuzz）：不同host对覆盖影响明显；总体覆盖最佳为Sample-40e（host123场景下胜出），且Sample-40e的覆盖集合几乎是其他集合的超集（相对SampleSpace-40e仅缺1,680条指令）。学习+fuzz对比（30,000个PDF/组，图8）：SampleFuzz覆盖567,634、pass rate 68.24%为最高覆盖；次优Sample+Random覆盖566,964、pass rate 41.81%；Sample-10K覆盖565,590、pass rate 78.92%；baseline+Random覆盖564,195、pass rate 44.05%；SampleSpace+Random覆盖563,930、pass rate 36.97%。结论：存在coverage与pass rate张力，随机fuzz提升覆盖但显著降低通过率；SampleFuzz在约65%–70%通过率附近取得更佳折中并带来最高覆盖。漏洞：常规实验未发现bug（目标已被长期fuzz）；但更长实验（Sample+Random，100,000对象/300,000 PDF，约5天）发现并修复一个stack-overflow bug。
 **第十个问题**：请清晰地描述论文所作的工作，分别列举出动机和贡献点以及主要创新之处。
 动机：grammar-based fuzzing对复杂结构化输入最有效，但手工编写input grammar“劳累/耗时/易错”，限制了在真实大型解析器（如浏览器PDF解析）上的应用；同时学习生成“规范输入”与fuzzing“破坏结构找漏洞”目标冲突，需要可控融合。
 贡献点：(1) 首次将neural-network-based statistical learning（seq2seq RNN/LSTM）用于从样本自动学习可生成的输入模型，以自动化grammar生成用于fuzzing。 (2) 针对PDF这种超复杂格式，明确限定范围为non-binary PDF objects，并给出端到端工程方案（把生成对象按PDF增量更新规则附加到host形成完整PDF）以真实驱动Edge PDF parser评测。 (3) 系统分析并量化learn&fuzz张力：用pass rate刻画学习质量、用instruction coverage刻画fuzz有效性，展示两者此消彼长。 (4) 提出SampleFuzz：利用learned input probability distribution在模型高置信位置用低概率字符替换，实现“智能选择where to fuzz”，在覆盖与通过率之间取得更优折中并获得最高覆盖。
 主要创新：把“概率语言模型”的不确定性/置信度直接转化为fuzzing决策信号（高置信处注入反常），从而在保持足够结构可深入解析的同时，更系统地触达错误处理与意外路径。
--- a/Fuzzing/Learn&Fuzz
+++ b/Fuzzing/Learn&Fuzz
--- a/Smoothing/NEUZZ
+++ b/Smoothing/NEUZZ
@@ -0,0 +1,9 @@
@misc{she2019neuzzefficientfuzzingneural,
      title={NEUZZ: Efficient Fuzzing with Neural Program Smoothing}, 
      author={Dongdong She and Kexin Pei and Dave Epstein and Junfeng Yang and Baishakhi Ray and Suman Jana},
      year={2019},
      eprint={1807.05620},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/1807.05620}, 
 }
--- a/Smoothing/NEUZZ
+++ b/Smoothing/NEUZZ
@@ -0,0 +1,60 @@
 # NEUZZ Efficient Fuzzing with Neural Program Smoothing
 **第一个问题**：请对论文的内容进行摘要总结，包含研究背景与问题、研究目的、方法、主要结果和结论，字数要求在150-300字之间，使用论文中的术语和概念。
 论文指出传统graybox fuzzing多依赖evolutionary guidance，易陷入随机变异的低效序列，难以触发深层分支与hard-to-trigger bugs；而直接用gradient-guided optimization又会被真实程序的discontinuities、plateaus与ridges卡住。研究目的在于通过program smoothing构造可微的surrogate function，使梯度方法可用于覆盖导向fuzzing。方法上，NEUZZ用feed-forward NN学习输入到edge bitmap（分支/边覆盖）的平滑近似，采用binary cross-entropy训练并做label降维（合并总是共现的边）；再基于∇x f_i(θ,x)选取top-k高梯度字节并按梯度符号生成变异，同时通过coverage-based filtration进行incremental learning以避免遗忘。主要结果：在10个真实程序上，NEUZZ在24小时内相对10个SOTA fuzzers达到最高edge coverage（最高达3×，文中亦报告对AFL最高约10×）；发现31个此前未知bug（含2个CVE），并在LAVA-M与DARPA CGC上分别找到更多注入/真实漏洞。结论是：神经网络程序平滑+梯度引导变异能显著提升fuzz效率与覆盖，并具备良好可扩展性。
 **第二个问题**：请提取论文的摘要原文，摘要一般在Abstract之后，Introduction之前。
 Abstract—Fuzzing has become the de facto standard technique for  finding  software  vulnerabilities.  However,  even  state-of-the-art   fuzzers   are   not   very   efficient   at   finding   hard-to-trigger software  bugs.  Most  popular  fuzzers  use  evolutionary  guidance to   generate   inputs   that   can   trigger   different   bugs.   Such evolutionary  algorithms,  while  fast  and  simple  to  implement, often   get   stuck   in   fruitless   sequences   of   random   mutations. Gradient-guided  optimization  presents  a  promising  alternative to evolutionary guidance. Gradient-guided techniques have been shown  to  significantly  outperform  evolutionary  algorithms  at solving  high-dimensional  structured  optimization  problems  in domains  like  machine  learning  by  efficiently  utilizing  gradients or  higher-order  derivatives  of  the  underlying  function. However,gradient-guidedapproachesarenotdirectly applicable  to  fuzzing  as  real-world  program  behaviors  contain many  discontinuities,  plateaus,  and  ridges  where  the  gradient-based  methods  often  get  stuck.  We  observe  that  this  problem can   be   addressed   by   creating   a   smooth   surrogate   function approximating the target program’s discrete branching behavior. In this paper, we propose a novel program smoothing technique using  surrogate  neural  network  models  that  can  incrementally learn smooth approximations of a complex, real-world program’s branching  behaviors.  We  further  demonstrate  that  such  neural network   models   can   be   used   together   with   gradient-guided input  generation  schemes  to  significantly  increase  the  efficiency of  the  fuzzing  process. OurextensiveevaluationsdemonstratethatNEUZZ significantly  outperforms  10  state-of-the-art  graybox  fuzzers  on 10  popular  real-world  programs  both  at  finding  new  bugs  and achieving  higher  edge  coverage.  NEUZZ  found  31  previously unknown  bugs  (including  two  CVEs)  that  other  fuzzers  failed to  find  in  10  real-world  programs  and  achieved  3X  more  edge coverage  than  all  of  the  tested  graybox  fuzzers  over  24  hour runs.  Furthermore,  NEUZZ  also  outperformed  existing  fuzzers on  both  LAVA-M  and  DARPA  CGC  bug  datasets.
 **第三个问题**：请列出论文的全部作者，按照此格式：`作者1, 作者2, 作者3`。
 Dongdong She, Kexin Pei, Dave Epstein, Junfeng Yang, Baishakhi Ray, Suman Jana
 **第四个问题**：请直接告诉我这篇论文发表在哪个会议或期刊，请不要推理或提供额外信息。
 arXiv:1807.05620v4
 **第五个问题**：请详细描述这篇论文主要解决的核心问题，并用简洁的语言概述。
 核心问题是：覆盖导向fuzzing本质是优化问题（最大化new edge coverage/bugs），但真实程序的分支行为对输入是高度离散且不光滑的目标函数，导致两类主流方法各有瓶颈：evolutionary algorithms无法利用梯度结构而易低效停滞；gradient-guided optimization虽高效，但在程序的discontinuities、plateaus、ridges处梯度不可用/不可靠而“卡住”。NEUZZ要解决的就是“如何在不引入符号执行等高开销白盒平滑的前提下，把程序分支行为变成可微、可求梯度的近似函数，并让梯度真正能指导变异去触达未覆盖边与隐藏漏洞”。简洁概述：用可微的神经网络代理模型平滑程序分支，使梯度引导变异在真实程序上可用且更高效。
 **第六个问题**：请告诉我这篇论文提出了哪些方法，请用最简洁的方式概括每个方法的核心思路。
 (1) Neural program smoothing：训练feed-forward surrogate NN，把输入字节序列映射为edge bitmap的“平滑近似”，从而可微、可求梯度。
 (2) Edge-label降维预处理：仅保留训练集中出现过的边，并合并“总是共现”的边以缓解multicollinearity，减少输出维度（约从65,536降到~4,000）。
 (3) Gradient-guided mutation（Algorithm 1）：对选定的输出边神经元计算∇x f_i(θ,x)，选top-k高梯度字节作为critical bytes，按梯度符号对字节做增/减并clip到[0,255]生成定向变异。
 (4) Exponentially grow mutation target：从少量字节开始，逐轮扩大要变异的字节数，以覆盖更大输入空间同时保持单次搜索有效性。
 (5) Incremental learning + coverage-based filtration：把新发现（触发新边）的输入加入，并用“只保留能带来新覆盖的旧数据摘要”控制数据规模，迭代重训以提高代理模型精度并避免灾难性遗忘。
 (6) Magic-check辅助（LAVA/CGC场景）：用定制LLVM pass插桩magic byte checks；用NN梯度先定位关键字节，再对相邻字节做局部穷举（4×256）以高效触发多字节条件。
 **第七个问题**：请告诉我这篇论文所使用的数据集，包括数据集的名称和来源。
 (1) 10个真实世界程序集：binutils-2.30（readelf -a, nm -C, objdump -D, size, strip）、harfbuzz-1.7.6、libjpeg-9c、mupdf-1.12.0、libxml2-2.9.7、zlib-1.2.11（论文表IIb列出；来源为对应开源项目/版本）。
 (2) LAVA-M bug dataset：LAVA项目的子集（base64、md5sum、uniq、who，含注入的magic-number触发漏洞；来源引用[28] LAVA）。
 (3) DARPA CGC dataset：Cyber Grand Challenge二进制/服务程序数据集（论文从中随机选50个binary评测；来源引用[26] CGC repository）。
 (4) 训练数据来源：先运行AFL-2.52b 1小时生成初始seed corpus与边覆盖标签，用于训练NN（平均每个程序约2K训练输入）。
 **第八个问题**：请列举这篇论文评估方法的所有指标，并简要说明这些指标的作用。
 (1) Bugs found / crashes：统计发现的真实漏洞与崩溃数量，用于衡量漏洞挖掘能力；内存类问题通过AddressSanitizer的stack trace去重，整数溢出通过人工分析+UBSan验证。
 (2) Edge coverage（new control-flow edges）：以AFL的edge coverage report统计“新增边数”，作为覆盖导向fuzzing的核心效果指标。
 (3) 运行时间预算下的覆盖增长曲线：比较24h（真实程序）、5h（LAVA-M）、6h（CGC）内覆盖随时间变化，体现“到达新边的速度”。
 (4) 训练开销/时间（NEUZZ train(s), training time sec）：衡量学习组件的成本（如与RNN fuzzer对比时报告训练时间差异）。
 (5) 固定变异预算下的覆盖（如1M mutations）：在控制变异次数时比较不同方法/模型的有效性，排除训练时长差异干扰。
 (6) NN预测精度（test accuracy约95%平均）：用于说明代理模型对分支行为预测质量（间接影响梯度指导有效性）。
 **第九个问题**：请总结这篇论文实验的表现，包含具体的数值表现和实验结论。
 真实程序（24h）：NEUZZ在10个程序上均取得最高edge coverage（表VI示例：readelf -a 4,942；harfbuzz 6,081；nm -C 2,056；libxml 1,596；mupdf 487；zlib 376等），并在多程序上呈现“1小时内新增>1,000边”的领先速度；文中总结对AFL在9/10程序上分别约6×、1.5×、9×、1.8×、3.7×、1.9×、10×、1.3×、3×的边覆盖优势，并称相对次优fuzzer可达约4.2×、1.3×、7×、1.2×、2.5×等提升。真实漏洞（表III）：在6个fuzzer对比中NEUZZ总计发现60个bug（AFL 29，AFLFast 27，VUzzer 14，KleeFL 26，AFL-laf-intel 60?——表中按项目/类型给出，且NEUZZ覆盖5类bug并额外拿到2个CVE：CVE-2018-19931/19932）。LAVA-M（5h）：NEUZZ在base64找到48、md5sum 60、uniq 29、who 1,582（表IV），整体优于Angora等对比项。CGC（6h，50 binaries）：NEUZZ触发31个有漏洞binary，AFL为21、Driller为25，且NEUZZ覆盖了AFL/Driller找到的全部并额外多6个（表V）。与RNN fuzzer对比（1M mutations，表VII）：NEUZZ在readelf/libjpeg/libxml/mupdf上分别获得约8.4×/4.2×/6.7×/3.7×更多边覆盖，且训练开销约低20×。模型消融（表VIII）：线性模型显著落后；增量学习进一步提升（如readelf -a：1,723→1,800→2,020）。结论：神经平滑+梯度定向变异在覆盖与找bug上均显著优于多种SOTA，且训练/执行开销可控、可扩展到大程序。
 **第十个问题**：请清晰地描述论文所作的工作，分别列举出动机和贡献点以及主要创新之处。
 动机：evolutionary fuzzing在深层逻辑与稀疏漏洞上效率低；梯度优化在高维结构化问题上更强，但直接用于程序会被分支离散性导致的不可微/不连续行为阻断；既有program smoothing依赖符号执行/抽象解释，开销大且不可扩展。
 贡献点：(1) 提出“program smoothing对gradient-guided fuzzing至关重要”的核心观点，并把fuzzing形式化为优化问题，明确为何需要平滑代理。 (2) 设计首个可扩展的surrogate neural network program smoothing：用feed-forward NN学习输入→edge coverage bitmap的平滑近似，并通过标签降维解决训练可行性。 (3) 提出面向覆盖的gradient-guided mutation策略：利用∇x f_i(θ,x)定位critical bytes与变异方向，系统生成高价值变异而非均匀随机。 (4) 提出coverage-based filtration的incremental learning流程，持续用新覆盖数据纠正代理模型且避免灾难性遗忘。 (5) 实现NEUZZ并在真实程序、LAVA-M、CGC上进行大规模对比，证明在bug数量与edge coverage上显著超越10个SOTA fuzzers。
 主要创新之处：用“可微代理模型”把离散分支行为平滑为可求梯度的函数，再把梯度直接转化为变异位置/方向的决策信号，从而以低开销获得比符号/污点等重分析更强的探索能力。
--- a/Smoothing/NEUZZ
+++ b/Smoothing/NEUZZ
--- a/spatiotemporal/Autoregressive
+++ b/spatiotemporal/Autoregressive
@@ -0,0 +1,9 @@
@misc{rasul2021autoregressivedenoisingdiffusionmodels,
      title={Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting}, 
      author={Kashif Rasul and Calvin Seward and Ingmar Schuster and Roland Vollgraf},
      year={2021},
      eprint={2101.12072},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2101.12072}, 
 }
--- a/spatiotemporal/Autoregressive
+++ b/spatiotemporal/Autoregressive
@@ -0,0 +1,44 @@
 # Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting
 **第一个问题**：请对论文的内容进行摘要总结，包含研究背景与问题、研究目的、方法、主要结果和结论，字数要求在150-300字之间，使用论文中的术语和概念。
 论文面向高维多变量概率时间序列预测，指出现有方法常依赖可处理的分布族（如多元高斯/低秩近似）、copula或normalizing flows，对真实分布的多模态/非连通结构表达受限，且高维协方差建模代价高。研究目的在于构建一个能在每个时间步学习灵活条件分布、并保持自回归预测优势的模型。方法上提出TimeGrad：用RNN（LSTM/GRU）编码历史与covariates得到隐藏状态h_{t−1}，并在每个时间步以条件denoising diffusion probabilistic model/score matching形式学习p_θ(x_t^0|h_{t−1})，训练通过变分下界等价的噪声预测损失||ε−ε_θ(·,h_{t−1},n)||^2，推断用annealed Langevin dynamics式的反向马尔可夫链从白噪声采样。实验在六个真实数据集（最高达2000维）上以CRPS与CRPS_sum评测，TimeGrad在除最小数据集外均达到新的SOTA，结论是“自回归+扩散/EBM”的组合可有效学习高维相关预测分布。
 **第二个问题**：请提取论文的摘要原文，摘要一般在Abstract之后，Introduction之前。
 In this work, we propose TimeGrad, an autoregressive model for multivariate probabilistic time series forecasting which samples from the data distribution at each time step by estimating its gradient. To this end, we use diffusion probabilistic models, a class of latent variable models closely connected to score matching and energy-based methods. Our model learns gradients by optimizing a variational bound on the data likelihood and at inference time converts white noise into a sample of the distribution of interest through a Markov chain using Langevin sampling. We demonstrate experimentally that the proposed autoregressive denoising diffusion model is the new state-of-the-art multivariate probabilistic forecasting method on real-world data sets with thousands of correlated dimensions. We hope that this method is a useful tool for practitioners and lays the foundation for future research in this area.
 **第三个问题**：请列出论文的全部作者，按照此格式：`作者1, 作者2, 作者3`。
 Kashif Rasul, Calvin Seward, Ingmar Schuster, Roland Vollgraf
 **第四个问题**：请直接告诉我这篇论文发表在哪个会议或期刊，请不要推理或提供额外信息。
 文段未给出会议或期刊信息。
 **第五个问题**：请详细描述这篇论文主要解决的核心问题，并用简洁的语言概述。
 论文要解决的核心问题是：在多变量（维度D可达上千/上万）时间序列预测中，如何为每个未来时间步学习“高维、相关、可能多模态且非高斯”的条件预测分布q_X(x_t^0|x_{1:t-1}^0,c_{1:T})，同时避免传统多元高斯协方差带来的O(D^2)参数和O(D^3)计算、以及normalizing flows/VAEs在非连通模式上可能出现的“连接模式间的虚假密度/映射困难”。TimeGrad通过在自回归框架下把每个时间步的发射分布替换为可用Langevin采样的扩散/score-based EBM，从而以更少的结构限制拟合复杂分布。简洁概述：用“RNN条件 + 扩散去噪得分模型”替代传统输出分布，做高维相关概率预测。
 **第六个问题**：请告诉我这篇论文提出了哪些方法，请用最简洁的方式概括每个方法的核心思路。
 (1) TimeGrad总体框架：把多变量预测分解为自回归因子Π_{t=t0}^T p_θ(x_t^0|h_{t−1})，其中h_{t−1}由RNN对历史与协变量编码。(2) 条件扩散发射头（conditional denoising diffusion model）：对每个时间步t，把目标向量x_t^0视为训练样本，使用固定前向扩散q(x_n|x_{n−1})加噪，学习反向p_θ(x_{n−1}|x_n,h_{t−1})去噪。(3) 噪声ε预测的训练目标：用Ho et al. (2020)的ε-parameterization，优化E[||ε−ε_θ(√\barα_n x_t^0+√(1−\barα_n)ε, h_{t−1}, n)||^2]（选择Σ_θ=~β_n）以学习score/梯度。(4) 推断与采样：给定h_{t−1}，从x_t^N~N(0,I)出发，按n=N…1进行反向更新（含噪声项z），得到x_t^0；再把采样到的x_t^0回喂RNN滚动生成多步轨迹。(5) 维度缩放（Scaling）：按context window的均值对各维归一化（均值为0则用1），预测后再缩放回原尺度以处理不同维度量纲差异。(6) 协变量建模（Covariates）：对类别特征用embedding，组合时间相关（hour/day等）、时间不变特征与lag features，作为RNN输入的一部分。
 **第七个问题**：请告诉我这篇论文所使用的数据集，包括数据集的名称和来源。
 使用六个公开数据集，并声明“preprocessed exactly as in (Salinas et al., 2019a)”：Exchange（Lai et al., 2018）；Solar（Lai et al., 2018）；Electricity（UCI ElectricityLoadDiagrams20112014：https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014）；Traffic（UCI PEMS-SF：https://archive.ics.uci.edu/ml/datasets/PEMS-SF）；Taxi（NYC TLC Trip Record Data：https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page）；Wikipedia（GluonTS仓库提供：链接指向https://github.com/mbohlkeschneider/gluon-ts/tree/mv_release/datasets）。
 **第八个问题**：请列举这篇论文评估方法的所有指标，并简要说明这些指标的作用。
 (1) CRPS（Continuous Ranked Probability Score）：对每个维度的预测CDF F与观测x的兼容性评分，CRPS越小代表预测分布越接近真实分布，是proper scoring rule。(2) CRPS_sum：先对D维求和得到标量序列，再对其预测分布\hat F_sum计算CRPS并在预测区间上平均，用于衡量模型对“总量/聚合量”不确定性的刻画能力，并且同样是proper scoring function。论文强调选择CRPS类指标是因为部分对比方法不提供解析likelihood或likelihood不可比。
 **第九个问题**：请总结这篇论文实验的表现，包含具体的数值表现和实验结论。
 论文在六个数据集上用CRPS_sum对比多类基线（VAR/VAR-Lasso/GARCH/VES、KVAE、Vec-LSTM系列、GP-Copula/GP-scaling、Transformer-MAF），TimeGrad在除最小数据集外均为最优：Exchange上TimeGrad 0.006±0.001（最优为VES 0.005±0.000，TimeGrad略逊）；Solar上0.287±0.02（优于Transformer-MAF 0.301±0.014等）；Electricity上0.0206±0.001（与Transformer-MAF 0.0207±0.000接近且更好）；Traffic上0.044±0.006（优于Transformer-MAF 0.056±0.001）；Taxi上0.114±0.02（优于Transformer-MAF 0.179±0.002等）；Wikipedia上0.0485±0.002（优于Transformer-MAF 0.063±0.003）。消融显示扩散步数N可降到约10而性能无明显损失，最优大约在N≈100，继续增大无收益。结论：TimeGrad作为“自回归扩散去噪模型”在高维相关概率预测上达到SOTA，并具备对多尺度/多模态分布更友好的建模特性。
 **第十个问题**：请清晰地描述论文所作的工作，分别列举出动机和贡献点以及主要创新之处。
 动机：(1) 多变量概率预测需要刻画不确定性与跨维相关性，但多元高斯/低秩/二阶相关限制明显且计算昂贵；(2) normalizing flows/VAEs在高维、非连通模式分布上可能产生虚假密度或映射困难；(3) EBMs/score-based方法表达力强但训练与时间序列条件化结合不足。
 贡献点与创新：(1) 提出TimeGrad：将每个时间步的条件发射分布建模为RNN条件化的denoising diffusion probabilistic model，通过ε-预测学习梯度/score，并用Langevin式反向链采样，从而在自回归预测中实现高维灵活分布建模；(2) 给出训练与推断算法（逐时间步训练目标、滚动采样轨迹），并加入适配真实多维量纲差异的scale normalization与covariate embedding；(3) 在六个真实基准（最高2000维）上用CRPS与CRPS_sum系统对比，取得除最小数据集外的SOTA；(4) 分析扩散长度N的影响，指出N≈10已可用、N≈100最优，为实践中的采样-质量权衡提供依据。
--- a/spatiotemporal/Autoregressive
+++ b/spatiotemporal/Autoregressive
--- a/spatiotemporal/CSDI
+++ b/spatiotemporal/CSDI
@@ -0,0 +1,9 @@
@misc{tashiro2021csdiconditionalscorebaseddiffusion,
      title={CSDI Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation}, 
      author={Yusuke Tashiro and Jiaming Song and Yang Song and Stefano Ermon},
      year={2021},
      eprint={2107.03502},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={httpsarxiv.orgabs2107.03502}, 
 }
--- a/spatiotemporal/CSDI
+++ b/spatiotemporal/CSDI
@@ -0,0 +1,44 @@
 # CSDI: Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation
 **第一个问题**：请对论文的内容进行摘要总结，包含研究背景与问题、研究目的、方法、主要结果和结论，字数要求在150-300字之间，使用论文中的术语和概念。
 论文研究多变量时间序列缺失值插补（probabilistic time series imputation），指出传统以autoregressive为主的插补方法虽常用，但score-based diffusion models在图像/音频生成上已显著优于同类方法，且现有“用无条件扩散模型近似后验score”的插补做法会对观测值加噪、并非精确条件分布。研究目的为直接学习q(x0^ta|x0^co)的条件分布并提升插补质量。方法提出CSDI：基于DDPM参数化，将反向过程扩展为pθ(x^ta_{t-1}|x^ta_t,x0^co)，令去噪网络εθ显式条件化于观测（含conditional mask），并用masked language modeling式自监督训练把观测拆为conditional observations与imputation targets（含random/historical/mix/testpattern策略）；实现上采用2D attention（时间Transformer+特征Transformer）捕获时序与跨特征依赖，并加入时间/特征嵌入。结果：在PhysioNet 2012与北京空气质量数据上，CSDI将CRPS相对现有概率插补方法提升40–65%，确定性插补MAE降低5–20%，并在插值与概率预测上具竞争力。结论是“为插补显式训练的条件扩散模型”显著优于后验近似式扩散插补与RNN基线。
 **第二个问题**：请提取论文的摘要原文，摘要一般在Abstract之后，Introduction之前。
 The imputation of missing values in time series has many applications in healthcare and finance. While autoregressive models are natural candidates for time series imputation, score-based diffusion models have recently outperformed existing counterparts including autoregressive models in many tasks such as image generation and audio synthesis, and would be promising for time series imputation. In this paper, we propose Conditional Score-based Diffusion models for Imputation (CSDI), a novel time series imputation method that utilizes score-based diffusion models conditioned on observed data. Unlike existing score-based approaches, the conditional diffusion model is explicitly trained for imputation and can exploit correlations between observed values. On healthcare and environmental data, CSDI improves by 40-65% over existing probabilistic imputation methods on popular performance metrics. In addition, deterministic imputation by CSDI reduces the error by 5-20% compared to the state-of-the-art deterministic imputation methods. Furthermore, CSDI can also be applied to time series interpolation and probabilistic forecasting, and is competitive with existing baselines. The code is available at https://github.com/ermongroup/CSDI.
 **第三个问题**：请列出论文的全部作者，按照此格式：`作者1, 作者2, 作者3`。
 Yusuke Tashiro, Jiaming Song, Yang Song, Stefano Ermon
 **第四个问题**：请直接告诉我这篇论文发表在哪个会议或期刊，请不要推理或提供额外信息。
 35th Conference on Neural Information Processing Systems (NeurIPS 2021)
 **第五个问题**：请详细描述这篇论文主要解决的核心问题，并用简洁的语言概述。
 论文核心要解决：在多变量时间序列X∈R^{K×L}存在缺失（mask M）时，如何学习缺失部分的条件分布q(x0^ta|x0^co)并进行概率插补（输出分布/采样），同时充分利用观测值在时间维与特征维的相关性。现有score-based/扩散插补通常用“无条件扩散模型”近似条件反向过程：在反向采样中对观测x0^co也加噪并拼接进入x_t，导致观测信息被噪声破坏且不对应精确的条件分布，从而限制插补质量。CSDI通过“显式条件反向过程 + 为插补专门训练的εθ(x^ta_t,t|x0^co)”来避免该近似，并用自监督策略在训练时构造(conditional, target)对。简洁概述：用显式条件扩散模型直接建模“给定观测的缺失分布”，替代对观测加噪的后验近似式扩散插补。
 **第六个问题**：请告诉我这篇论文提出了哪些方法，请用最简洁的方式概括每个方法的核心思路。
 (1) CSDI条件扩散插补框架：把DDPM反向链改为pθ(x^ta_{t-1}|x^ta_t,x0^co)，在每步去噪时显式输入观测条件以恢复缺失值分布。(2) 条件DDPM参数化：沿用DDPM的μ_DDPM/σ_DDPM形式，仅将εθ扩展为条件网络εθ(x^ta_t,t|x0^co)（并在实现里加m_co）。(3) 自监督训练（masked language modeling式）：把“观测值”划分为x0^co与x0^ta（目标），对x0^ta加噪得到x^ta_t，优化||ε−εθ(x^ta_t,t|x0^co)||^2，无需真实缺失的ground truth。(4) 目标选择策略（target choice）：Random（随机比例选观测为目标）、Historical（用训练集中其他样本缺失模式决定目标）、Mix（随机+历史混合）、Testpattern（已知测试缺失模式时直接用）。(5) 时间序列实现的输入对齐：将x^ta_t与x0^co零填充到固定K×L形状，配合conditional mask m_co指示条件位置与输出mask。(6) 2D attention去噪网络：在残差层中同时做temporal Transformer（沿L捕获时间依赖）与feature Transformer（沿K捕获特征依赖）。(7) Side information：时间戳s的128维time embedding + 特征ID的embedding，用于增强条件建模。
 **第七个问题**：请告诉我这篇论文所使用的数据集，包括数据集的名称和来源。
 插补/插值数据集：(1) PhysioNet Challenge 2012 healthcare数据集：4000个ICU临床时间序列，35变量，48小时（文中引用Silva et al., 2012）。(2) Air quality（Beijing PM2.5）数据集：北京36个站点PM2.5小时数据（文中引用Yi et al., 2016），并说明数据覆盖2014/05/01–2015/04/30且含结构化缺失/人工ground truth。预测（forecasting）数据集（来自GluonTS并按Salinas et al. 2019预处理）：solar（137站点光伏发电记录）、electricity（UCI ElectricityLoadDiagrams20112014）、traffic（UCI PEMS-SF）、taxi（NYC TLC trip records聚合的1214位置半小时序列）、wiki（2000维Wikipedia page views）。
 **第八个问题**：请列举这篇论文评估方法的所有指标，并简要说明这些指标的作用。
 (1) CRPS（continuous ranked probability score）：评估概率插补/插值的预测分布与观测值的匹配程度（越小越好），用100个样本近似分布并按分位数损失积分近似计算；论文用于概率插补与预测（也报告逐任务CRPS）。(2) 归一化平均CRPS：对所有目标(k,l)的CRPS求和后除以∑|x_{k,l}|，用于跨特征尺度的可比性（论文Eq.18）。(3) MAE（mean absolute error）：确定性插补误差（越小越好），CSDI取100样本中位数作为确定性插补。(4) RMSE：确定性插补/插值误差补充指标（越小越好），在附录表9/10报告。(5) CRPS-sum：概率预测任务中“对K维求和后的分布”的CRPS（越小越好），衡量联合效应/聚合量不确定性（论文Eq.19）。(6) MSE：概率预测任务的点误差补充指标（越小越好，表12）。(7) NLL（negative log likelihood）：论文做了对比分析（表8）说明其对噪声schedule敏感且与样本质量/CRPS不一定相关，因此不作为主指标。
 **第九个问题**：请总结这篇论文实验的表现，包含具体的数值表现和实验结论。
 概率插补（CRPS，越低越好，表2）：在healthcare数据上，CSDI分别为10%/50%/90% missing：0.238(0.001)/0.330(0.002)/0.522(0.002)，优于Multitask GP（0.489/0.581/0.942）、GP-VAE（0.574/0.774/0.998）、V-RIN（0.808/0.831/0.922）；并优于“unconditional diffusion”插补（0.360/0.458/0.671）。在air quality上CSDI为0.108(0.001)，优于Multitask GP 0.301、GP-VAE 0.397、V-RIN 0.526、unconditional 0.135。确定性插补（MAE，表3）：healthcare为0.217/0.301/0.481（10/50/90% missing），优于BRITS 0.284/0.368/0.517、GLIMA（引用）10.54在air quality上，CSDI在air quality为9.60(0.04)（优于BRITS引用11.56、unconditional 12.13等），总体称提升5–20%。插值（不规则采样，CRPS，表4）：CSDI 0.380/0.418/0.556（10/50/90% missing）优于Latent ODE与mTANs。预测（CRPS-sum，表5）：CSDI在electricity与traffic上最好（0.017与0.020），在solar/taxi/wiki上与SOTA基线接近或略优（例如wiki 0.047 vs TimeGrad 0.049）。结论：显式条件扩散模型+自监督训练显著提升概率/确定性插补，并可迁移到插值与预测。
 **第十个问题**：请清晰地描述论文所作的工作，分别列举出动机和贡献点以及主要创新之处。
 动机：(1) 时间序列缺失普遍存在且影响医疗/金融等应用；(2) 既有深度插补多为autoregressive，难以达到扩散模型在生成任务中的表现；(3) 现有扩散插补多用“无条件扩散近似条件后验score”，需要对观测加噪，信息受损且不对应精确条件分布。
 贡献点与创新：(1) 提出CSDI：为插补任务显式建模pθ(x0^ta|x0^co)的conditional score-based diffusion model，并在反向去噪中直接条件化于观测值以利用相关性。(2) 自监督训练机制：借鉴masked language modeling，把观测拆成条件与目标，设计random/historical/mix/testpattern等target choice策略，使得即使训练集缺少真实缺失ground truth也能训练。(3) 面向多变量时序的结构设计：提出2D attention（时间Transformer+特征Transformer）与side information（time embedding、feature embedding），强化跨时间/跨特征依赖建模。(4) 系统实证：在healthcare与air quality上CRPS提升40–65%，确定性MAE降低5–20%，并展示对插值与预测的可扩展性与竞争力，同时通过消融证明2D attention等组件关键。
--- a/spatiotemporal/CSDI
+++ b/spatiotemporal/CSDI
--- a/spatiotemporal/Denoising
+++ b/spatiotemporal/Denoising
@@ -0,0 +1,11 @@
@inproceedings{NEURIPS2020_4c5bcfec,
 author = {Ho, Jonathan and Jain, Ajay and Abbeel, Pieter},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {H. Larochelle and M. Ranzato and R. Hadsell and M.F. Balcan and H. Lin},
 pages = {6840--6851},
 publisher = {Curran Associates, Inc.},
 title = {Denoising Diffusion Probabilistic Models},
 url = {https://proceedings.neurips.cc/paper_files/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf},
 volume = {33},
 year = {2020}
 }
--- a/spatiotemporal/Denoising
+++ b/spatiotemporal/Denoising
@@ -0,0 +1,44 @@
 # **Denoising Diffusion Probabilistic Models**
 **第一个问题**：请对论文的内容进行摘要总结，包含研究背景与问题、研究目的、方法、主要结果和结论，字数要求在150-300字之间，使用论文中的术语和概念。
 论文研究扩散概率模型（diffusion probabilistic models, diffusion models）在高质量图像生成上的可行性。背景问题是：此前扩散模型虽易定义与训练，但缺少生成高质量样本的证明。研究目的在于改进扩散模型的训练与参数化，并展示其样本质量。方法上，构建固定前向扩散过程q(x_t|x_{t-1})逐步加高斯噪声，并学习反向过程p_θ(x_{t-1}|x_t)的高斯转移；提出与denoising score matching和Langevin dynamics的显式联系，采用预测噪声ε的参数化，并使用简化的加权变分目标L_simple训练；模型用U-Net+自注意力与时间嵌入。结果：在无条件CIFAR10上达到Inception Score 9.46、FID 3.17；在256×256 LSUN上样本质量接近ProgressiveGAN。结论指出扩散模型能产生高质量样本，且其采样可解释为progressive decoding/渐进式有损解码的泛化。
 **第二个问题**：请提取论文的摘要原文，摘要一般在Abstract之后，Introduction之前。
 We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN. Our implementation is available at https://github.com/hojonathanho/diffusion.
 **第三个问题**：请列出论文的全部作者，按照此格式：`作者1, 作者2, 作者3`。
 Jonathan Ho, Ajay Jain, Pieter Abbeel
 **第四个问题**：请直接告诉我这篇论文发表在哪个会议或期刊，请不要推理或提供额外信息。
 34th Conference on Neural Information Processing Systems (NeurIPS 2020)
 **第五个问题**：请详细描述这篇论文主要解决的核心问题，并用简洁的语言概述。
 核心问题是：扩散概率模型作为“通过变分推断训练的马尔可夫链”能否在图像上达到与GAN/自回归模型相当的高样本质量，以及如何设计反向过程参数化与训练目标来实现这一点。具体难点包括：反向过程p_θ(x_{t-1}|x_t)如何学习“去噪逆扩散”，训练时如何用可计算的变分下界分解（L_T、L_{1:T-1}、L_0）稳定优化，采样链如何与score-based/annealed Langevin dynamics统一理解，并解释其为何样本质量高但log likelihood（bits/dim）不占优。简洁概述：把扩散模型从“理论上合理、效果未证实”变成“能稳定训练并生成SOTA质量图像”的生成框架，并给出与score matching/Langevin及渐进式解码的统一解释。
 **第六个问题**：请告诉我这篇论文提出了哪些方法，请用最简洁的方式概括每个方法的核心思路。
 (1) 扩散模型的前向/反向马尔可夫链建模：固定q(x_t|x_{t-1})逐步加高斯噪声（β_t schedule），学习高斯反向转移p_θ(x_{t-1}|x_t)。(2) 反向均值的ε-预测参数化：用网络ε_θ(x_t,t)预测噪声ε，并由此构造μ_θ(x_t,t)=1/√α_t·(x_t−β_t/√(1−\barα_t)·ε_θ(x_t,t))，使采样形态类似Langevin dynamics。(3) 与denoising score matching的等价联系：推导L_{t-1}在ε-参数化下变为多噪声等级的去噪平方误差（与score matching形式一致）。(4) 简化训练目标L_simple：用未加权的E[||ε−ε_θ(√\barα_t x_0+√(1−\barα_t)ε,t)||^2]训练（t均匀采样），作为“加权变分下界”以提升样本质量并简化实现。(5) 固定方差Σ_θ(x_t,t)=σ_t^2 I：不学习反向方差，实验发现更稳定且FID更好。(6) Progressive lossy decompression / progressive decoding解释：把采样过程解释为逐步恢复信息的渐进式解码，并与自回归解码作类比。
 **第七个问题**：请告诉我这篇论文所使用的数据集，包括数据集的名称和来源。
 (1) CIFAR10：无条件生成评测（论文实验主基准；通过TensorFlow Datasets加载）。(2) LSUN 256×256：Bedroom、Church、Cat类别（数据准备使用StyleGAN代码；FID计算亦使用StyleGAN2代码）。(3) CelebA-HQ 256×256：用于展示高分辨率人脸样本与插值（通过TensorFlow Datasets加载）。
 **第八个问题**：请列举这篇论文评估方法的所有指标，并简要说明这些指标的作用。
 (1) Inception Score (IS)：衡量生成样本的可辨识性与多样性，越高越好（主要用于CIFAR10）。(2) Fréchet Inception Distance (FID)：衡量生成分布与真实分布在Inception特征空间的距离，越低越好（CIFAR10/LSUN等）。(3) Negative Log Likelihood (NLL) / bits per dimension：以离散数据的lossless codelength度量似然质量，越低越好（用于对比likelihood-based模型）。(4) Rate–Distortion分析：将L_{1:T}视为rate、L_0视为distortion，画rate(bits/dim)与distortion(RMSE)随时间/码率变化，用于解释“高感知质量但似然一般”的原因。(5) Distortion：用RMSE（在[0,255]像素尺度）度量重建误差/失真。 (6) Progressive sampling quality over time：在反向步数(T−t)过程中跟踪IS与FID，观察由粗到细生成。 (7) Train–test NLL gap：训练/测试bits/dim差异用于检查是否过拟合（论文报告gap≤0.03 bits/dim）。
 **第九个问题**：请总结这篇论文实验的表现，包含具体的数值表现和实验结论。
 在无条件CIFAR10上，论文最佳模型（L_simple）达到IS=9.46±0.11、FID=3.17，并给出NLL Test≤3.75 bits/dim（Train 3.72）。与参数化/目标消融相比，ε-预测在L_simple下显著优于μ̃-预测：μ̃预测在真变分界L+固定各向同性Σ时FID=13.22，而ε-预测+L_simple将FID降至3.17。高分辨率上，LSUN 256×256生成质量与ProgressiveGAN相当：Bedroom FID可达4.90（large模型），Church FID=7.89，Cat FID=19.75。论文同时指出：尽管样本质量强，扩散模型的log likelihood不如其他likelihood-based模型；通过rate-distortion显示大量bits用于“人眼不可感知细节”，支持其“优秀有损压缩/渐进式解码”解释。
 **第十个问题**：请清晰地描述论文所作的工作，分别列举出动机和贡献点以及主要创新之处。
 动机：扩散模型训练高效、定义简单，但此前缺少能生成高质量样本的实证；同时score matching/能量模型/Langevin采样等方向虽能生成图像，但缺少统一、可计算似然与“直接训练采样器”的框架。
 贡献点：(1) 证明扩散模型可达高样本质量：在CIFAR10取得SOTA级FID=3.17，并在LSUN 256×256上达到接近ProgressiveGAN的质量。(2) 提出关键理论联系：建立扩散模型与多噪声级别的denoising score matching、以及与annealed Langevin dynamics采样的显式等价关系，并将其视为主要贡献之一。(3) ε-预测反向过程参数化：用预测噪声ε_θ替代直接预测μ̃_t，使训练目标化简并使采样形式更接近Langevin动态。(4) 简化的加权变分目标L_simple：作为对标准变分下界的重加权，降低小t去噪项权重，显著提升样本质量且实现更简洁。(5) 渐进式有损解码解释与自回归泛化：把反向扩散看作progressive decoding，并给出与自回归“比特顺序/掩码顺序”的广义对应，同时用rate-distortion与逐步生成可视化支撑这一解释。
--- a/spatiotemporal/Denoising
+++ b/spatiotemporal/Denoising
--- a/spatiotemporal/DiffSTG
+++ b/spatiotemporal/DiffSTG
@@ -0,0 +1,9 @@
@misc{wen2024diffstgprobabilisticspatiotemporalgraph,
      title={DiffSTG: Probabilistic Spatio-Temporal Graph Forecasting with Denoising Diffusion Models}, 
      author={Haomin Wen and Youfang Lin and Yutong Xia and Huaiyu Wan and Qingsong Wen and Roger Zimmermann and Yuxuan Liang},
      year={2024},
      eprint={2301.13629},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2301.13629}, 
 }
--- a/spatiotemporal/DiffSTG
+++ b/spatiotemporal/DiffSTG
@@ -0,0 +1,45 @@
 # DiffSTG: Probabilistic Spatio-Temporal Graph Forecasting with Denoising Diffusion Models
 **第一个问题**：请对论文的内容进行摘要总结，包含研究背景与问题、研究目的、方法、主要结果和结论，字数要求在150-300字之间，使用论文中的术语和概念。
 论文针对spatio-temporal graph(STG) forecasting中STGNN难以刻画intrinsic uncertainties的问题，研究probabilistic STG forecasting在不确定性建模与复杂ST dependencies上的挑战。目标是将denoising diffusion probabilistic models(DDPM)推广到STGs，提出非自回归框架DiffSTG，并设计首个面向STG的去噪网络UGnet。方法上构建conditional diffusion：以历史图信号与图结构为条件，在masked formulation中把[x^h,x^p]统一为x_all并用x_all^msk条件化反向过程；训练采用噪声预测损失||ε−ε_θ(x_all^n,n|x_all^msk,G)||^2；UGnet以Unet式时间尺度聚合+TCN建模时间依赖+GCN建模空间相关；并用子步采样与末k步复用加速采样。结果在PEMS08、AIR-BJ、AIR-GZ上CRPS降低4%–14%、RMSE降低2%–7%，且推理较TimeGrad显著加速。结论表明DiffSTG兼顾ST建模能力与不确定性量化，适用于决策场景。
 **第二个问题**：请提取论文的摘要原文，摘要一般在Abstract之后，Introduction之前。
 Spatio-temporal graph neural networks (STGNN) have emerged as the dominant model for spatio-temporal graph (STG) forecast-ing. Despite their success, they fail to model intrinsic uncertainties within STG data, which cripples their practicality in downstream tasks for decision-making. To this end, this paper focuses on probabilistic STG forecasting, which is challenging due to the difficulty in modeling uncertainties and complex ST dependencies. In this study, we present the first attempt to generalize the popular de-noising diffusion probabilistic models to STGs, leading to a novel non-autoregressive framework called DiffSTG, along with the first denoising network UGnet for STG in the framework. Our approach combines the spatio-temporal learning capabilities of STGNNs with the uncertainty measurements of diffusion models. Extensive ex-periments validate that DiffSTG reduces the Continuous Ranked Probability Score (CRPS) by 4%-14%, and Root Mean Squared Er-ror (RMSE) by 2%-7% over existing methods on three real-world datasets. The code is in https://github.com/wenhaomin/DiffSTG.
 **第三个问题**：请列出论文的全部作者，按照此格式：`作者1, 作者2, 作者3`。
 Haomin Wen, Youfang Lin, Yutong Xia, Huaiyu Wan, Qingsong Wen, Roger Zimmermann, Yuxuan Liang
 **第四个问题**：请直接告诉我这篇论文发表在哪个会议或期刊，请不要推理或提供额外信息。
 未在提供的论文内容中给出具体会议或期刊名称。
 **第五个问题**：请详细描述这篇论文主要解决的核心问题，并用简洁的语言概述。
 核心问题是：在STG forecasting中同时做到(1)捕获跨节点的spatial correlations与跨时间的temporal dependencies（复杂ST dependencies），以及(2)对未来多步预测给出可用的概率分布与uncertainty quantification（而非仅点预测），并且(3)避免现有扩散式时间序列方法（如TimeGrad）在多步预测时的自回归/逐步生成带来的推理低效。论文指出：时间序列扩散模型往往只在单变量或单节点上建模时间依赖，缺乏显式图结构条件；同时TimeGrad需要对每个未来步重复运行扩散链，导致S×T_p×N级别的高开销。DiffSTG通过“图条件+非自回归多步一次生成+ST专用去噪网络”来解决。简洁概述：把扩散概率模型做成“能利用图结构、一次生成多步、还能输出不确定性”的STG概率预测器。
 **第六个问题**：请告诉我这篇论文提出了哪些方法，请用最简洁的方式概括每个方法的核心思路。
 (1) DiffSTG非自回归概率预测框架：用一次反向扩散过程直接生成多预测步x^p（而不是逐步/逐未来步运行）。(2) Conditional Diffusion for STG：在反向过程p_θ(x_{n-1}|x_n, x^h, G)中显式条件化历史与图结构以学习p(x^p|x^h,G)。(3) Generalized Masked Conditional Diffusion：将[x^h,x^p]拼成x_all，并用mask得到x_all^msk作为条件，统一“重建历史+预测未来”的训练目标。(4) UGnet去噪网络：时间维Unet结构捕获multi-scale temporal dependencies，块内用TCN建模时间依赖、用GCN建模空间相关，并注入noise level embedding。 (5) Sampling Acceleration：只采样子序列{τ_1…τ_M}减少步数M，并在末k步把近似样本复用以减少需要的反向扩散链次数（S→S/k）。
 **第七个问题**：请告诉我这篇论文所使用的数据集，包括数据集的名称和来源。
 (1) PEMS08：交通流量数据集，来自Caltrans Performance Measurement System (PeMS)；论文使用STSGCN提取/整理的版本，并依据实际路网构建邻接关系。 (2) AIR-BJ：北京34个监测站PM2.5小时数据（2019/01/01–2019/12/31），来源引用Yi et al.相关工作[46]；用站点距离构建空间相关矩阵A。 (3) AIR-GZ：广州41个监测站PM2.5小时数据（2017/01/01–2017/12/31），同样来源[46]并用距离建图。
 **第八个问题**：请列举这篇论文评估方法的所有指标，并简要说明这些指标的作用。
 (1) CRPS（Continuous Ranked Probability Score）：评估预测分布F与观测x的兼容性，衡量概率预测质量，越小越好（论文的主要指标）。(2) MAE：衡量点预测误差的平均绝对偏差；论文对概率模型用多次采样的均值（或聚合）给出确定性结果后计算，越小越好。(3) RMSE：衡量点预测误差的均方根，对大误差更敏感，越小越好。(4) Inference Time / Time cost：比较扩散式方法的推理耗时与加速效果（表3给出不同样本数S下秒级耗时）。
 **第九个问题**：请总结这篇论文实验的表现，包含具体的数值表现和实验结论。
 在三数据集上的概率方法对比（表2，越小越好）：AIR-BJ上DiffSTG=MAE 17.88、RMSE 29.60、CRPS 0.34（优于TimeGrad 0.36、DeepAR 0.37等）；AIR-GZ上DiffSTG=MAE 10.95、RMSE 16.66、CRPS 0.22（优于DeepAR 0.23、TimeGrad/MC Dropout 0.25）；PEMS08上DiffSTG=MAE 17.68、RMSE 27.13、CRPS 0.06（优于DeepAR/MC Dropout 0.07、TimeGrad 0.09）。相对“各数据集最强基线”，CRPS下降5.6%/4.3%/14.3%，RMSE下降约7.1%/2.4%/7.6%（表2的Error reduction行）。推理效率（表3，AIR-GZ，N=100）：S=8时TimeGrad 9.58s，DiffSTG(M=100,k=1) 0.24s（约40×加速）；进一步加速DiffSTG(M=40,k=2)为0.07s，且随S增大仍显著快于TimeGrad与CSDI。结论：DiffSTG在概率质量（CRPS）与确定性误差（MAE/RMSE）上均领先，并通过非自回归与采样子步显著提升推理速度。
 **第十个问题**：请清晰地描述论文所作的工作，分别列举出动机和贡献点以及主要创新之处。
 动机：(1) 现有STGNN多为deterministic，难以提供uncertainty用于决策；(2) 扩散式概率时间序列模型（TimeGrad/CSDI等）缺少图空间依赖建模，且自回归式多步生成效率低。
 贡献点：(1) 首次将DDPM系统推广到STG概率预测，提出非自回归DiffSTG以建模p(x^p|x^h,G)并量化不确定性。(2) 提出首个面向STG的扩散去噪网络UGnet：时间维Unet捕获多尺度时间模式，结合TCN+GCN显式学习ST dependencies。(3) 提出masked generalized conditional diffusion（x_all与x_all^msk同空间）以统一“重建历史+预测未来”，提升条件利用与任务泛化（预测/生成/插值）。(4) 提出采样加速策略（子步采样+末k步复用）并实证带来显著推理加速且性能保持。
 主要创新之处：ST专用去噪结构UGnet、非自回归多步扩散预测、以及面向STG任务的masked条件化扩散建模方式。
--- a/spatiotemporal/DiffSTG
+++ b/spatiotemporal/DiffSTG
--- a/spatiotemporal/DiffWave
+++ b/spatiotemporal/DiffWave
@@ -0,0 +1,9 @@
@misc{kong2021diffwaveversatilediffusionmodel,
      title={DiffWave: A Versatile Diffusion Model for Audio Synthesis}, 
      author={Zhifeng Kong and Wei Ping and Jiaji Huang and Kexin Zhao and Bryan Catanzaro},
      year={2021},
      eprint={2009.09761},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2009.09761}, 
 }
--- a/spatiotemporal/DiffWave
+++ b/spatiotemporal/DiffWave
@@ -0,0 +1,47 @@
 # DiffWave A Versatile Diffusion Model for Audio Synthesis
 **第一个问题**：请对论文的内容进行摘要总结，包含研究背景与问题、研究目的、方法、主要结果和结论，字数要求在150-300字之间，使用论文中的术语和概念。
 论文面向raw waveform生成，指出既有likelihood-based模型（WaveNet自回归、flow）虽能保真，但推理慢或受结构约束；GAN/VAE等又常需auxiliary losses且存在mode collapse/posterior collapse，尤其在unconditional generation中长序列建模困难。研究目的在于提出一种同时适用于conditional与unconditional的高保真、非自回归生成框架。方法提出DiffWave：基于DDPM参数化，用εθ预测噪声并以unweighted ELBO变体训练；推理从白噪声通过Markov链反向去噪生成波形，并提出fast sampling将训练时T步“折叠”为少量T_infer步。网络采用受WaveNet启发的bidirectional dilated convolution残差堆叠，支持mel spectrogram本地条件与label全局条件。结果：在LJ Speech神经声码器上MOS 4.44≈WaveNet 4.43且合成速度快多个数量级；小模型2.64M参数在V100上>5×实时（Fast）；在SC09无条件/类条件生成上相对WaveNet与WaveGAN在MOS、FID/IS/mIS/AM/NDB等质量与多样性指标显著更优。结论是扩散模型可用单一ELBO目标稳定训练，并在音频生成任务上实现高质量与高效推理。
 **第二个问题**：请提取论文的摘要原文，摘要一般在Abstract之后，Introduction之前。
 In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces high-fidelity audio in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation. We demonstrate that DiffWave matches a strong WaveNet vocoder in terms of speech quality (MOS: 4.44 versus 4.43), while synthesizing orders of magnitude faster. In particular, it significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.
 **第三个问题**：请列出论文的全部作者，按照此格式：`作者1, 作者2, 作者3`。
 Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, Bryan Catanzaro
 **第四个问题**：请直接告诉我这篇论文发表在哪个会议或期刊，请不要推理或提供额外信息。
 ICLR 2021
 **第五个问题**：请详细描述这篇论文主要解决的核心问题，并用简洁的语言概述。
 论文核心要解决：如何在raw waveform层面实现高保真生成，同时兼顾(1)非自回归并行合成的高效率，(2)对conditional（神经声码器/类条件）与unconditional生成的统一建模能力，(3)稳定训练且无需额外辅助损失。难点在于：无条件生成需要在极长序列（如16kHz下1秒=16000点）上学习复杂分布，WaveNet等自回归在无条件下易产生“made-up word-like sounds/样本差”，GAN/ VAE又易出现训练不稳定或模式问题。简洁概述：用扩散去噪的马尔可夫链把白噪声稳定地变成真实波形，以更快推理和更强无条件生成质量替代传统自回归/GAN方案。
 **第六个问题**：请告诉我这篇论文提出了哪些方法，请用最简洁的方式概括每个方法的核心思路。
 (1) DiffWave扩散式波形生成：采用DDPM前向加噪q(x_t|x_{t-1})与反向去噪pθ(x_{t-1}|x_t)，从N(0,I)逐步生成x_0。(2) ε-parameterization与unweighted ELBO训练：固定σθ为\tildeβ_t^{1/2}，网络εθ(x_t,t)预测噪声ε，最小化E||ε−εθ(√\barα_t x_0+√(1−\barα_t)ε,t)||²。(3) Fast sampling（T→T_infer折叠）：用用户自定义噪声日程{η_s}，通过对齐噪声水平t_align把少量采样步映射到训练步，显著减少合成时顺序步数（如6步）。(4) 去噪网络结构：基于WaveNet思想的feed-forward、bidirectional dilated conv残差堆叠+skip连接，因非自回归可用双向扩张卷积提升感受野利用。(5) 条件生成机制：本地条件（mel spectrogram经转置卷积上采样并Conv1×1映射后作为各残差层bias）与全局条件（label embedding经Conv1×1后作为bias）。(6) 无条件生成的感受野策略：通过反向扩散迭代使有效感受野扩大到T×r，以覆盖长序列依赖。
 **第七个问题**：请告诉我这篇论文所使用的数据集，包括数据集的名称和来源。
 (1) LJ Speech Dataset（Ito, 2017）：约24小时、22.05kHz、13100条女声语音，用于neural vocoding（mel spectrogram条件）。(2) Speech Commands Dataset（Warden, 2018）中的SC09子集（digits 0–9）：16kHz、1秒长度、训练31158条、2032说话人，用于unconditional与class-conditional生成；文中还用其官方提供的噪声类型做zero-shot denoising演示。
 **第八个问题**：请列举这篇论文评估方法的所有指标，并简要说明这些指标的作用。
 (1) MOS（Mean Opinion Score，5分制，含95%CI）：人类主观语音质量评价，用于声码器、无条件与类条件任务。(2) FID：在ResNeXT特征空间拟合高斯，衡量生成样本与训练集分布距离，兼顾质量与多样性（越低越好）。(3) IS：基于分类器输出，偏好“可被清晰分类且整体多样”的样本（越高越好）。(4) mIS：在IS基础上强调within-class diversity（类内多样性）（越高越好）。(5) AM Score：考虑训练集边缘label分布与生成分布的一致性，并加生成样本预测熵项，缓解IS忽略先验分布的问题（越低越好，论文表2以↓标注）。(6) NDB/K（Number of Statistically-Different Bins）：K-means分箱后比较生成与真实在各箱占比差异，衡量多样性与模式覆盖（越低越好）。(7) Accuracy（类条件任务）：用ResNeXT对生成样本分类的准确率，衡量语音清晰度/类一致性（越高越好）。(8) FID-class（类条件任务）：按每个digit分别算FID再取均值±方差，衡量类内分布匹配（越低越好）。(9) 合成速度（如“× real-time”、或“500× slower”对比）：衡量推理效率，特别对比WaveNet与DiffWave/Fast。
 **第九个问题**：请总结这篇论文实验的表现，包含具体的数值表现和实验结论。
 神经声码器（LJ Speech，表1）：DiffWave LARGE（T=200）MOS 4.44±0.07，略高于WaveNet 4.43±0.10；DiffWave BASE在T=40/50时MOS 4.35±0.10/4.38±0.08，且参数仅2.64M；Fast采样下DiffWave BASE(Fast) MOS 4.37±0.07、DiffWave LARGE(Fast) MOS 4.42±0.09，并报告合成可达5.6×/3.5×实时（V100，无工程优化），而WaveNet无工程优化约500×慢于实时。无条件生成（SC09，表2）：DiffWave在MOS 3.39±0.32显著优于WaveNet-256 1.43±0.30与WaveGAN 2.03±0.33；自动指标上DiffWave也最好：FID 1.287（优于WaveGAN 1.349、WaveNet-256 2.947），IS 5.305、mIS 59.4、AM 0.636、NDB/K 0.74。类条件生成（表3）：DiffWave MOS 3.50±0.31优于WaveNet-256 1.58±0.36；Accuracy 91.20%（WaveNet-256为60.70%）；FID-class 1.113±0.569（WaveNet-256为6.954±2.114），mIS 117.4远高于WaveNet。结论：DiffWave用单一扩散目标实现接近/超过强声码器质量，并在无条件/类条件任务上显著提升质量与多样性，同时推理速度相对自回归大幅加速。
 **第十个问题**：请清晰地描述论文所作的工作，分别列举出动机和贡献点以及主要创新之处。
 动机：(1) WaveNet等自回归波形生成质量高但推理极慢；(2) flow模型虽快但架构受可逆性/雅可比约束；(3) GAN/VAE常需辅助损失且存在训练不稳定、mode collapse/posterior collapse；(4) 无条件波形生成长序列难，传统模型质量/多样性不足。
 贡献点：(1) 提出DiffWave：非自回归扩散概率模型，使用ELBO变体（噪声预测）稳定训练，统一支持neural vocoding、类条件与无条件生成。(2) 设计适配扩散去噪的bidirectional dilated conv残差网络与扩散步嵌入/条件注入方式。(3) 提出fast sampling：将训练T步反向过程折叠为少量T_infer步，显著加速合成且保持质量。(4) 大规模实证：声码器MOS 4.44≈WaveNet且速度提升数量级；在无条件/类条件上相对WaveNet与WaveGAN在MOS与FID/IS/mIS/AM/NDB等全面领先。
 主要创新之处：将DDPM的ε-parameterization系统化落地到raw audio，并通过“非自回归去噪网络+噪声步对齐的快速采样”实现高质量与高效率的统一。
--- a/spatiotemporal/DiffWave
+++ b/spatiotemporal/DiffWave
--- a/spatiotemporal/Intro.txt
+++ b/spatiotemporal/Intro.txt
@@ -0,0 +1,22 @@
 扩散模型（DDPM/Score）用于时间序列/时空建模（最直接支撑你“用 diffusion 生成包序列”）
 Ho, Jain, Abbeel. Denoising Diffusion Probabilistic Models (DDPM). NeurIPS 2020.
 用途：扩散模型基本形式（前向加噪、反向去噪、预测噪声训练）。你方法部分的扩散理论根引用。
 Song et al. Score-Based Generative Modeling through Stochastic Differential Equations. ICLR 2021.
 用途：score-based diffusion 的更一般表述；如果你未来要做连续时间（时间间隔/抖动）的建模，这条线很有用。
 Rasul et al. Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting. ICML 2021.
 用途：多变量时间序列的扩散建模；对应你“多个(设备,寄存器)序列”的联合分布生成。
 Tashiro et al. CSDI: Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation. NeurIPS 2021.
 用途：条件扩散（conditioning）注入方式很适合你：把设备嵌入/寄存器语义/主从角色/工艺状态作为条件，约束生成。
 Liu et al. PriSTI: A Conditional Diffusion Framework for Spatiotemporal Imputation. ICDE 2023.
 用途：时空条件扩散框架；你把“空间”换成(设备,寄存器)二部图/异构图，“时间”换成轮询/会话位置，结构很贴近。
 Wen et al. DiffSTG: Probabilistic Spatio-Temporal Graph Forecasting with Denoising Diffusion Models. ACM SIGSPATIAL 2023.
 用途：扩散 + 时空图；你做(设备,寄存器)图上的生成（而不是预测）时，可借鉴其图特征融入去噪网络的方式。
 Kong et al. DiffWave: A Versatile Diffusion Model for Audio Synthesis. ICLR 2021.
 用途：一维信号生成（类似“时间间隔序列”“值序列”）；其 WaveNet/UNet 类去噪骨架对工业轮询类高频序列也很参考。
--- a/spatiotemporal/PriSTI
+++ b/spatiotemporal/PriSTI
@@ -0,0 +1,9 @@
@misc{liu2023pristiconditionaldiffusionframework,
      title={PriSTI: A Conditional Diffusion Framework for Spatiotemporal Imputation}, 
      author={Mingzhe Liu and Han Huang and Hao Feng and Leilei Sun and Bowen Du and Yanjie Fu},
      year={2023},
      eprint={2302.09746},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2302.09746}, 
 }
--- a/spatiotemporal/PriSTI
+++ b/spatiotemporal/PriSTI
@@ -0,0 +1,47 @@
 # PriSTI A Conditional Diffusion Framework for Spatiotemporal 
 **第一个问题**：请对论文的内容进行摘要总结，包含研究背景与问题、研究目的、方法、主要结果和结论，字数要求在150-300字之间，使用论文中的术语和概念。
 论文面向真实场景的spatiotemporal data不完整问题（sensor failures、transmission loss），研究spatiotemporal imputation以利用spatiotemporal dependence补全缺失值。既有RNN式autoregressive插补易出现error accumulation且多为确定性输出；扩散模型可从Gaussian noise生成并避免串行误差，但在ST数据上“conditional information的构造与利用”困难。为此提出PriSTI：先对观测做linear interpolation得到增强的conditional information X，并用Conditional Feature Extraction提取含temporal attention、spatial attention与MPNN(结合geographic adjacency A)的global context prior H_pri；再在Noise Estimation Module中用H_pri计算注意力权重（Q,K来自H_pri，V来自含噪输入），引导εθ(Ẋ_t,X,A,t)去噪。实验在AQI-36、METR-LA、PEMS-BAY多缺失模式下，PriSTI在MAE/MSE/CRPS上均优于CSDI、GRIN等，并在高缺失率与sensor failure场景保持优势；结论是“先验增强的条件特征+地理关系”能显著提升扩散式时空插补质量与稳健性。
 **第二个问题**：请提取论文的摘要原文，摘要一般在Abstract之后，Introduction之前。
 Abstract—Spatiotemporal data mining plays an important role in air quality monitoring, crowd flow modeling, and climate forecasting. However, the originally collected spatiotemporal data in real-world scenarios is usually incomplete due to sensor failures or transmission loss. Spatiotemporal imputation aims to fill the missing values according to the observed values and the underlying spatiotemporal dependence of them. The previous dominant models impute missing values autoregressively and suffer from the problem of error accumulation. As emerging powerful generative models, the diffusion probabilistic models can be adopted to impute missing values conditioned by observations and avoid inferring missing values from inaccurate historical imputation. However, the construction and utilization of conditional information are inevitable challenges when applying diffusion models to spatiotemporal imputation. To address above issues, we propose a conditional diffusion framework for spatiotemporal imputation with enhanced prior modeling, named PriSTI. Our proposed framework provides a conditional feature extraction module first to extract the coarse yet effective spatiotemporal dependencies from conditional information as the global context prior. Then, a noise estimation module transforms random noise to realistic values, with the spatiotemporal attention weights calculated by the conditional feature, as well as the consideration of geographic relationships. PriSTI outperforms existing imputation methods in various missing patterns of different real-world spatiotemporal data, and effectively handles scenarios such as high missing rates and sensor failure. The implementation code is available at https://github.com/LMZZML/PriSTI.
 **第三个问题**：请列出论文的全部作者，按照此格式：`作者1, 作者2, 作者3`。
 Mingzhe Liu, Han Huang, Hao Feng, Leilei Sun, Bowen Du, Yanjie Fu
 **第四个问题**：请直接告诉我这篇论文发表在哪个会议或期刊，请不要推理或提供额外信息。
 未在提供的论文内容中给出具体会议或期刊名称。
 **第五个问题**：请详细描述这篇论文主要解决的核心问题，并用简洁的语言概述。
 核心问题是：将conditional diffusion model用于spatiotemporal imputation时，如何“构造并利用”条件信息，使模型既能避免RNN自回归插补的error accumulation，又能有效学习复杂的时空依赖（temporal dependencies、spatial global correlations、geographic relationships A）。论文指出CSDI类方法常把observed values与noisy targets直接拼接输入，仅靠mask区分，导致序列趋势在高噪声步t≈T时不一致、增加学习难度，且未充分利用地理邻接与跨节点相关。PriSTI通过“先插值增强条件信息→提取全局先验H_pri→用H_pri主导注意力权重、减少噪声对依赖学习的干扰”来解决。简洁概述：解决扩散式时空插补中“条件信息噪声大、时空依赖难学、地理关系难融入”的关键瓶颈。
 **第六个问题**：请告诉我这篇论文提出了哪些方法，请用最简洁的方式概括每个方法的核心思路。
 (1) PriSTI条件扩散插补框架：把缺失部分作为生成目标，反向扩散pθ(Ẋ_{t-1}|Ẋ_t,X,A,t)从噪声逐步生成缺失值。(2) 线性插值增强条件信息（Interpolate）：对每个节点时间序列做linear interpolation得到更完整、趋势一致的条件输入X。(3) Conditional Feature Extraction Module（γ）：在仅含插值信息H=Conv(X)上，用temporal attention + spatial attention + MPNN(A)提取global context prior H_pri。(4) Noise Estimation Module：对混合输入H_in=Conv(X||Ẋ_t)学习时空依赖，但用H_pri计算注意力权重（Q,K来自H_pri，V来自H_in或H_tem）以减轻噪声干扰，并结合MPNN与空间注意力。(5) 空间注意力降复杂度：将N个节点映射到k个virtual nodes（k<N），把空间注意力复杂度从O(N^2 d)降到O(N k d)。(6) 掩码训练策略：point / block / hybrid（含历史缺失模式）在训练中随机“擦除观测”构造自监督插补目标。
 **第七个问题**：请告诉我这篇论文所使用的数据集，包括数据集的名称和来源。
 (1) AQI-36（air quality）：北京36站PM2.5小时数据12个月，来源Yi et al., “ST-MVL” (IJCAI 2016)[2]。(2) METR-LA（traffic speed）：洛杉矶高速207传感器、5分钟采样，来源Li et al., DCRNN (ICLR 2018)[3]并沿用其数据设置。(3) PEMS-BAY（traffic speed）：湾区325传感器、5分钟采样，同来源[3]；三者的geographic adjacency A按站点/传感器距离用thresholded Gaussian kernel构建。
 **第八个问题**：请列举这篇论文评估方法的所有指标，并简要说明这些指标的作用。
 (1) MAE：确定性插补误差的平均绝对偏差，衡量点估计准确性（越小越好）。(2) MSE：确定性插补误差的均方误差，对大误差更敏感（越小越好）。(3) CRPS：对概率插补分布D与真实值x的兼容性打分（量化不确定性质量，越小越好）；论文用100次采样近似分布，并以0.05分位离散近似积分。(4) RMSE（下游预测表V）：作为MSE的平方根，用于评估imputation后对Graph WaveNet预测性能的影响（越小越好）。
 **第九个问题**：请总结这篇论文实验的表现，包含具体的数值表现和实验结论。
 总体插补（表III/IV）：PriSTI在所有数据集与缺失模式下优于CSDI与GRIN等。示例：AQI-36 simulated failure(24.6%)上PriSTI MAE=9.03±0.07、MSE=310.39±7.03，优于CSDI MAE=9.51±0.10、MSE=352.46±7.50；对应CRPS PriSTI=0.0997，优于CSDI=0.1056。METR-LA block-missing(16.6%)上PriSTI MAE=1.86±0.00、MSE=10.70±0.02，优于CSDI 1.98±0.00、12.62±0.60；CRPS PriSTI=0.0244，优于CSDI=0.0260。PEMS-BAY block-missing(9.2%)上PriSTI MAE=0.78±0.00、MSE=3.31±0.01，优于CSDI 0.86±0.00、4.39±0.02；CRPS PriSTI=0.0093，优于CSDI=0.0127。高缺失率（图5）：在METR-LA缺失率90%时，PriSTI相对其他方法MAE提升4.67%–34.11%(block)与3.89%–43.99%(point)。传感器完全失效（RQ5）：AQI-36上两站点全程无观测时，PriSTI MAE分别为10.23与15.20，且优于GRIN。下游预测（表V）：对AQI-36插补后用Graph WaveNet预测，PriSTI使MAE从Ori. 36.97降至29.34，RMSE从60.37降至45.08，优于CSDI/GRIN/BRITS。结论：PriSTI通过“条件先验+地理关系”在确定性与概率插补、极端稀疏与失效场景及下游任务上均更稳健。
 **第十个问题**：请清晰地描述论文所作的工作，分别列举出动机和贡献点以及主要创新之处。
 动机：(1) 时空数据缺失普遍且影响监测/交通等应用；(2) RNN自回归插补存在error accumulation且难输出不确定性；(3) 扩散模型虽能做条件生成，但在时空插补中“条件信息如何构造、如何避免噪声破坏依赖学习、如何融入地理关系A”是关键难题。
 贡献点：(1) 提出PriSTI：面向spatiotemporal imputation的conditional diffusion framework，引入spatiotemporal global correlations与geographic relationships。(2) 提出Conditional Feature Extraction：在插值后的条件信息上提取H_pri作为global context prior，降低在高噪声下学习ST依赖的难度。(3) 提出Noise Estimation Module的“prior-guided attention”：用H_pri产生注意力权重（Q,K来自H_pri）来引导对含噪输入的去噪，从而缓解“观测+噪声拼接”带来的不一致与干扰。(4) 提出virtual nodes的空间注意力降复杂度方案O(N^2 d)→O(N k d)。
 主要创新之处：把“先验条件特征（由插值与时空/地理编码提取）”显式用于扩散去噪阶段的注意力权重计算，实现更强的时空依赖建模与更鲁棒的扩散式插补。
--- a/spatiotemporal/PriSTI
+++ b/spatiotemporal/PriSTI
--- a/spatiotemporal/Score-Based
+++ b/spatiotemporal/Score-Based
@@ -0,0 +1,9 @@
@misc{song2021scorebasedgenerativemodelingstochastic,
      title={Score-Based Generative Modeling through Stochastic Differential Equations}, 
      author={Yang Song and Jascha Sohl-Dickstein and Diederik P. Kingma and Abhishek Kumar and Stefano Ermon and Ben Poole},
      year={2021},
      eprint={2011.13456},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2011.13456}, 
 }
--- a/spatiotemporal/Score-Based
+++ b/spatiotemporal/Score-Based
@@ -0,0 +1,45 @@
 # Score-Based Generative Modeling through Stochastic Differ
 **第一个问题**：请对论文的内容进行摘要总结，包含研究背景与问题、研究目的、方法、主要结果和结论，字数要求在150-300字之间，使用论文中的术语和概念。
 论文研究score-based generative models中“由噪声生成数据”的统一连续时间视角：将数据分布通过forward SDE逐步加噪映射到已知prior，并由只依赖于time-dependent score ∇x log p_t(x) 的reverse-time SDE反向去噪生成样本。研究目的在于用SDE框架统一SMLD与DDPM（视为VE/VP SDE的离散化），并获得更灵活的采样、似然计算与可控生成能力。方法上训练time-dependent score network sθ(x,t) 以连续版denoising score matching估计score；采样用通用SDE solver，并提出Predictor-Corrector(PC)将数值SDE预测与Langevin/HMC校正结合；进一步推导probability flow ODE（neural ODE），可在相同边缘分布下进行确定性采样并用instantaneous change of variables实现exact likelihood。实验在CIFAR-10上达成IS=9.89、FID=2.20与2.99 bits/dim，并首次展示1024×1024 CelebA-HQ高保真生成；同时在class-conditional、inpainting与colorization等逆问题上验证可控生成。结论是SDE统一框架带来新采样器、精确似然与更强条件生成能力。
 **第二个问题**：请提取论文的摘要原文，摘要一般在Abstract之后，Introduction之前。
 Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the reverse-time SDE depends only on the time-dependent gradient field (a.k.a., score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, we can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. We show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities. In particular, we introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE. We also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, we provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, we achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of1024ˆ1024images for the first time from a score-based generative model.
 **第三个问题**：请列出论文的全部作者，按照此格式：`作者1, 作者2, 作者3`。
 Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole
 **第四个问题**：请直接告诉我这篇论文发表在哪个会议或期刊，请不要推理或提供额外信息。
 ICLR 2021
 **第五个问题**：请详细描述这篇论文主要解决的核心问题，并用简洁的语言概述。
 论文要解决的核心问题是：如何把“逐步加噪—逐步去噪”的score-based/diffusion生成模型提升为一个统一、连续时间、可分析且可扩展的生成框架，使得(1)不同方法（SMLD、DDPM）在同一理论下刻画，(2)采样不再局限于特定离散更新规则而能用通用数值求解器并获得更好效率/质量，(3)在同一模型下实现exact likelihood计算与latent可逆映射，(4)在无需重新训练条件模型的前提下处理class-conditional、inpainting、colorization等inverse problems。其关键技术瓶颈在于reverse dynamics只需要score ∇x log p_t(x)，但该score必须对连续t准确估计，并且离散化/数值误差会累积影响采样质量。简洁概述：用SDE把score/diffusion统一起来，并解决“怎么更好采样、怎么算精确似然、怎么做可控/逆问题生成”。
 **第六个问题**：请告诉我这篇论文提出了哪些方法，请用最简洁的方式概括每个方法的核心思路。
 (1) SDE统一框架：用forward SDE dx=f(x,t)dt+g(t)dw把数据分布扩散到prior，再用reverse-time SDE dx=[f−g²∇x log p_t(x)]dt+g dŵ从prior生成数据。(2) 连续时间score学习：训练time-dependent score network sθ(x,t) 通过连续版denoising score matching目标(式(7))逼近∇x log p_t(x)。(3) VE/VP/sub-VP SDE：把SMLD对应为Variance Exploding SDE，把DDPM对应为Variance Preserving SDE，并提出sub-VP SDE（方差被VP上界约束）以提升likelihood表现。(4) Reverse diffusion sampler：按“与forward同型”的离散化直接构造reverse-time SDE的数值采样器，避免为新SDE推导复杂ancestral规则。(5) Predictor-Corrector(PC) sampler：predictor用数值SDE solver推进一步，corrector用score-based MCMC（如Langevin/HMC）在每个时间点校正边缘分布，降低离散误差。(6) Probability flow ODE：推导与SDE共享同一组边缘分布p_t的确定性ODE，允许黑盒ODE solver自适应步长快速采样，并支持latent可逆映射。(7) Exact likelihood：对probability flow ODE使用instantaneous change of variables计算log p0(x)，并用Skilling-Hutchinson trace estimator高效估计散度。(8) Controllable generation / inverse problems：通过conditional reverse-time SDE在unconditional score基础上加入∇x log p_t(y|x)实现类别条件、inpainting、colorization等。
 **第七个问题**：请告诉我这篇论文所使用的数据集，包括数据集的名称和来源。
 (1) CIFAR-10：Krizhevsky et al., 2009（用于无条件生成、FID/IS、bits/dim与采样器对比）。(2) LSUN（bedroom、church outdoor）：Yu et al., 2015（用于PC采样对比、inpainting与colorization示例，文中展示256×256）。(3) CelebA（64×64）：Liu et al., 2015（用于架构探索中的VE设置对比）。(4) CelebA-HQ（1024×1024）：Karras et al., 2018（用于首次展示score-based模型的1024×1024高分辨率生成）。
 **第八个问题**：请列举这篇论文评估方法的所有指标，并简要说明这些指标的作用。
 (1) FID：衡量生成分布与真实分布在特征空间的距离，综合质量与多样性（越低越好），用于CIFAR-10/LSUN等样本质量比较。(2) Inception Score (IS)：衡量样本可辨识度与多样性（越高越好），用于CIFAR-10无条件生成。(3) NLL / bits/dim：负对数似然的bit-per-dimension度量（越低越好），论文通过probability flow ODE给出“exact likelihood”，用于CIFAR-10密度评估。(4) 采样计算量指标：score function evaluations / NFE（函数评估次数）与solver步数（如P1000/P2000/PC1000）用于衡量采样效率与质量权衡。(5)（任务性展示）class-conditional/inpainting/colorization主要以可视化结果展示效果（图4等），不使用单一数值指标汇总。
 **第九个问题**：请总结这篇论文实验的表现，包含具体的数值表现和实验结论。
 无条件CIFAR-10样本质量：最佳NCSN++ cont. (deep, VE)达到IS=9.89、FID=2.20（摘要与表3“Sample quality”）。采样器对比（表1）：在CIFAR-10上，PC采样通常优于仅predictor或仅corrector；例如在VP SDE(DDPM)侧，reverse diffusion的P1000 FID≈3.21±0.02，加入corrector的PC1000可到≈3.18±0.01；在VE侧，reverse diffusion P1000≈4.79±0.07，PC1000≈3.21±0.02，显示PC显著改善离散误差带来的质量损失。似然（表2）：通过probability flow ODE实现exact likelihood；DDPM++ cont. (deep, sub-VP)达到2.99 bits/dim（同时FID≈2.92），并且sub-VP在同架构下通常比VP获得更好的bits/dim（如DDPM cont.: VP 3.21 vs sub-VP 3.05）。高分辨率生成：在CelebA-HQ上首次展示1024×1024高保真样本（图12与文中描述）。结论：SDE框架在采样（PC/ODE）、似然（exact bits/dim）与能力（高分辨率/逆问题）上同时带来提升，并统一解释SMLD与DDPM。
 **第十个问题**：请清晰地描述论文所作的工作，分别列举出动机和贡献点以及主要创新之处。
 动机：现有SMLD与DDPM都依赖“多噪声尺度的逐步扰动/去扰动”，但彼此形式分裂、采样规则受限、离散误差影响大，且难以在同一框架下获得精确似然、灵活采样与统一的条件/逆问题生成。
 贡献点与创新：(1) 提出score-based生成的SDE统一框架：forward SDE定义连续噪声扩散，reverse-time SDE仅依赖score实现生成，并把SMLD/DDPM解释为VE/VP SDE离散化。(2) 提出PC采样框架：把数值SDE求解（predictor）与score-based MCMC校正（corrector）组合，系统性提升采样质量。(3) 推导probability flow ODE：与SDE共享边缘分布的确定性过程，支持黑盒ODE自适应采样、latent可逆操控与“exact likelihood computation”。(4) 提出sub-VP SDE并在likelihood上取得2.99 bits/dim记录（均匀dequantized CIFAR-10）。(5) 提供无需重训的controllable generation/inverse problems方案（类条件、inpainting、colorization），并结合架构改进实现CIFAR-10记录级FID/IS与首次1024×1024 CelebA-HQ生成。
--- a/spatiotemporal/Score-Based
+++ b/spatiotemporal/Score-Based
--- a/Generation/Diffusion-LM
+++ b/Generation/Diffusion-LM
@@ -0,0 +1,9 @@
@misc{li2022diffusionlmimprovescontrollabletext,
      title={Diffusion-LM Improves Controllable Text Generation}, 
      author={Xiang Lisa Li and John Thickstun and Ishaan Gulrajani and Percy Liang and Tatsunori B. Hashimoto},
      year={2022},
      eprint={2205.14217},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={httpsarxiv.orgabs2205.14217}, 
 }
--- a/Generation/Diffusion-LM
+++ b/Generation/Diffusion-LM
@@ -0,0 +1,63 @@
 # **Diffusion-LM Improves Controllable Text Generation**
 **第一个问题**：请对论文的内容进行摘要总结，包含研究背景与问题、研究目的、方法、主要结果和结论，字数要求在150-300字之间，使用论文中的术语和概念。
 论文关注“在不重新训练语言模型的前提下实现可控文本生成”这一开放问题：现有plug-and-play方法基于autoregressive LM，较难实现复杂、细粒度控制（如syntax tree/ spans），且难以组合多个控制。研究目的在于构建一种更易被梯度引导的生成模型，以支持结构化与全局约束控制。方法上提出Diffusion-LM：基于continuous diffusion的非自回归LM，从高斯噪声序列迭代去噪到word vectors，产生层级连续潜变量x_T…x_0；为离散文本引入embedding与rounding，并提出端到端训练目标、x_0-parameterization与clamping trick以减少rounding error；控制时在每个扩散步对连续潜变量做多步梯度更新，优化λ·log p(x_{t-1}|x_t)+log p(c|x_{t-1})以平衡fluency与control。实验在E2E与ROCStories上覆盖6类控制任务，Diffusion-LM在多项细粒度控制上显著优于PPLM/FUDGE，并在syntax tree与spans控制上超过fine-tuning oracle；在infilling上优于COLD/DELOREAN且接近专训AR模型。结论是连续扩散的层级潜变量为复杂可控生成提供了更强的可操控性，但代价是解码更慢、困惑度更高。
 **第二个问题**：请提取论文的摘要原文，摘要一般在Abstract之后，Introduction之前。
 Controlling the behavior of language models (LMs) without re-training is a major open problem in natural language generation.  While recent works have demon-strated successes on controlling simple sentence attributes (e.g., sentiment), there has been little progress on complex, fine-grained controls (e.g., syntactic structure). To address this challenge, we develop a new non-autoregressive language model based on continuous diffusions that we call Diffusion-LM. Building upon the recent successes of diffusion models in continuous domains, Diffusion-LM iteratively denoises a sequence of Gaussian vectors into word vectors, yielding a sequence of intermediate latent variables. The continuous, hierarchical nature of these inter-mediate variables enables a simple gradient-based algorithm to perform complex, controllable generation tasks. We demonstrate successful control of Diffusion-LM for six challenging fine-grained control tasks, significantly outperforming prior work.
 **第三个问题**：请列出论文的全部作者，按照此格式：`作者1, 作者2, 作者3`。
 Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, Tatsunori B. Hashimoto
 **第四个问题**：请直接告诉我这篇论文发表在哪个会议或期刊，请不要推理或提供额外信息。
 Preprint. Under review.
 **第五个问题**：请详细描述这篇论文主要解决的核心问题，并用简洁的语言概述。
 论文核心要解决的是：如何在“冻结语言模型参数（without re-training）”的plug-and-play设定下，实现对生成文本的复杂、细粒度、全局结构化控制，并且能够组合多个控制信号。作者指出基于autoregressive LM的控制（如PPLM、FUDGE、GeDi/DExperts类思想）存在结构性限制：自回归的左到右生成一旦早期出错难以修复，且很多控制（parse tree、span类别、长度规划、infilling）需要对全句全局属性或右侧上下文进行前瞻规划；前缀判别器/逐token引导在结构约束下会出现“前缀一错、后续全盘崩”的问题。Diffusion-LM通过连续扩散产生一系列从粗到细的潜变量x_T…x_0，使控制信号可以在不同噪声层级上对“整句”施加可微约束，并允许在每个扩散步对潜变量做梯度更新，从而更容易满足复杂控制同时保持流畅性。简洁概述：把可控生成从“逐token操控离散输出”转为“在连续扩散潜空间上做全局梯度引导的去噪采样”，以支持细粒度句法/语义/跨度/长度/填空等控制与组合。
 **第六个问题**：请告诉我这篇论文提出了哪些方法，请用最简洁的方式概括每个方法的核心思路。
 1) Diffusion-LM（continuous diffusion language model）：将文本序列表示为连续word vectors序列，从高斯噪声x_T迭代去噪生成x_0，再rounding到离散词序列w，实现非自回归生成
 2) End-to-end embedding learning（§4.1）：在前向过程中加入q_φ(x_0|w)=N(EMB(w),σ_0I)，并在反向加入p_θ(w|x_0)的rounding项，联合学习扩散模型与词向量嵌入（L_e2e_simple / L_e2e_vlb）。
 3) x_0-parameterization objective（§4.2）：改用网络直接预测x_0（f_θ(x_t,t)≈x_0），让每个扩散步都显式约束x_0，提高“落到词嵌入点上”的能力、减少rounding error。
 4) Clamping trick（§4.2）：在采样时把预测的x_0或中间估计Clamp到最近词嵌入再继续去噪，强制中间步骤更“词化”，进一步降低rounding误差。
 5) Gradient-based classifier-guided control on latents（§5.1）：在每个扩散步对x_{t-1}做多步梯度更新，优化λ·log p(x_{t-1}|x_t)+log p(c|x_{t-1})，用可微classifier提供控制梯度、扩散模型提供流畅性正则。
 6) Multiple controls composition（§7.2）：对多个控制项直接求和其log-prob梯度，在同一潜空间联合满足多个约束。
 7) Minimum Bayes Risk (MBR) decoding（§5.2）：对条件生成/填空等场景，从多次采样集合S中选取最小期望风险的输出（如基于负BLEU的风险），提升单样本质量。
 8) sqrt noise schedule（Appendix A）：提出更适合文本的噪声日程，使低噪声阶段不至于过“容易”，提升稳健性。
 **第七个问题**：请告诉我这篇论文所使用的数据集，包括数据集的名称和来源。
 训练数据集：E2E（E2E NLG dataset，Novikova et al. 2017，[28]），ROCStories（Mostafazadeh et al. 2016，[26]）。控制/评测相关数据：Infilling任务的左右上下文来自aNLG（Abductive Natural Language Generation，Bhagavatula et al. 2020，[2]，文中写“from the aNLG dataset”）。文中还提到用于数据增强的ROCStories (+GPT-J)为“微调GPT-J后生成的合成ROCStories样本”（用于额外对比似然/训练规模实验）。
 **第八个问题**：请列举这篇论文评估方法的所有指标，并简要说明这些指标的作用。
 1) ctrl（control success，成功率/得分，越高越好）：不同控制任务有不同定义——语义内容用value exact match成功率；POS用词级别的POS序列exact match；Syntax Tree用外部parser解析后与目标树的F1；Syntax Spans用目标span类别匹配比例；Length用生成长度在目标±2内的比例
 2) lm-score（记作lm，越低越好）：把生成文本喂给teacher LM（文中为fine-tuned GPT-2）计算perplexity，用于衡量流畅性/样本质量。
 3) NLL / log-likelihood bound（nats per token，越低越好）：报告Diffusion-LM的变分界（L_e2e_vlb）对应的NLL上界/下界比较，用于衡量语言建模拟合度（文中指出Diffusion-LM似然弱于同规模AR Transformer）。
 4) Infilling自动指标：BLEU-4、ROUGE-L、CIDEr、BERTScore（越高越好），衡量生成句与参考的n-gram重叠、序列相似与语义相似。
 5) Infilling人工评估：Genie leaderboard的人类评分（表中以均值及置信区间/误差形式呈现），衡量整体可读性与合理连接左右上下文的质量。
 **第九个问题**：请总结这篇论文实验的表现，包含具体的数值表现和实验结论。
 五个classifier-guided控制任务（Table 2，ctrl↑/lm↓）：
 (1) Semantic Content：Diffusion-LM 81.2 / 2.55，优于PPLM 9.9 / 5.32与FUDGE 69.9 / 2.83，低于FT-search 89.9 / 1.78但接近或高于FT-sample 72.5 / 2.87；
 (2) Parts-of-speech：Diffusion-LM 90.0 / 5.16，优于FUDGE 27.0 / 7.96，略低于FT-search 93.0 / 3.31、接近FT-sample 89.5 / 4.72；
 (3) Syntax Tree：Diffusion-LM 86.0 / 3.71，显著优于FUDGE 17.9 / 3.39，并超过FT-search 76.4 / 3.24与FT-sample 64.8 / 5.72；
 (4) Syntax Spans：Diffusion-LM 93.8 / 2.53，优于FUDGE 54.2 / 4.03，并超过FT-search 54.4 / 2.19与FT-sample 26.3 / 2.88；
 (5) Length：Diffusion-LM 99.9 / 2.16，优于FUDGE 46.9 / 3.11，接近FT-search 100.0 / 1.83且优于FT-sample 98.1 / 3.84。
 组合控制（Table 4）：Semantic+Syntax Tree时Diffusion-LM语义/句法成功率69.8/74.8（lm 5.92），显著高于FUDGE 61.7/15.4；Semantic+POS时Diffusion-LM语义/ POS成功率63.7/69.1（lm 3.46），优于FUDGE 64.5/24.1。Infilling（Table 5）：Diffusion-LM BLEU-4 7.1、ROUGE-L 28.3、CIDEr 30.7、BERTScore 89.0、人评0.37（+0.03/-0.02），显著高于DELOREAN与COLD（BLEU-4约1.6–1.8），并与专训AR-infilling接近（AR BLEU-4 6.7、ROUGE-L 27.0、CIDEr 26.9、BERTScore 89.0、人评0.39）。同时，语言建模似然上Diffusion-LM弱于AR Transformer：E2E NLL 2.28 vs 1.77，ROCStories 3.88 vs 3.05（§7），表明“更差似然但更强可控性”。
 **第十个问题**：请清晰地描述论文所作的工作，分别列举出动机和贡献点以及主要创新之处。
 动机：1) 真实应用需要可控生成，但为每个控制任务fine-tune成本高且难组合多个控制；2) 现有plug-and-play控制主要基于autoregressive LM，易累积错误、难做全局结构控制与右侧约束（如parse tree、span、长度规划、infilling）；3) diffusion在连续域成功，但离散文本缺少有效的continuous diffusion建模与可控机制。
 贡献点：1) 提出Diffusion-LM：首批系统探索“continuous diffusion用于文本”的非自回归语言模型，通过逐步去噪生成word vectors并得到层级连续潜变量。2) 为离散文本适配扩散：提出端到端学习embedding与rounding的训练目标（L_e2e_*），并用x_0-parameterization与clamping trick显著减少rounding errors、提升样本质量。3) 提出面向控制的潜空间梯度引导算法：在每个扩散步对连续潜变量做多步梯度更新，并加入fluency regularization λ·log p(x_{t-1}|x_t)，实现复杂结构控制与控制组合。4) 系统实验验证：在6类细粒度控制任务上显著超越PPLM/FUDGE，并在句法树/跨度控制上超过fine-tuning oracle；在infilling上优于COLD/DELOREAN且接近专训AR模型。
 主要创新之处：把“可控生成的优化变量”从离散token/AR隐藏状态迁移到“扩散产生的层级连续潜变量x_0:T”，使控制信号可以在粗到细的全句表示上施加可微约束，从而天然支持全局结构控制与组合，同时通过x_0参数化与clamping把连续扩散与离散词空间可靠对齐。
--- a/Generation/Diffusion-LM
+++ b/Generation/Diffusion-LM
--- a/generation/Intro.txt
+++ b/generation/Intro.txt
@@ -0,0 +1,17 @@
 离散/混合变量扩散（解决 Modbus 的功能码/地址等离散字段）
 你是“语义级生成”：至少包含
 离散：Function Code、（可选）异常码、寄存器地址/地址簇、读写长度
 连续/整数：寄存器值、时间间隔（inter-arrival）
 这类“混合类型生成”往往要引用离散扩散或 embedding trick。
 Austin et al. Structured Denoising Diffusion Models in Discrete State-Spaces (D3PM). NeurIPS 2021.
 用途：离散状态空间扩散；功能码/异常码/地址簇（token）可以用 D3PM 直接扩散生成。
 Li et al. Diffusion-LM Improves Controllable Text Generation. NeurIPS 2022.
 用途：文本是离散序列生成；你可将“PDU 字段序列/行为序列”类比为句子，并用其“可控生成”的讨论支撑“协议约束/场景约束”的必要性。
 工程落地常见做法：
 离散字段：D3PM/Multinomial diffusion 或先 embedding 到连续空间再做高斯扩散；
 连续字段：标准 DDPM；
 最后用一个 deterministic assembler（pymodbus/scapy）组装 MBAP/PDU，确保协议有效率接近 100%。
--- a/State-Spaces/Structured
+++ b/State-Spaces/Structured
@@ -0,0 +1,9 @@
@misc{austin2023structureddenoisingdiffusionmodels,
      title={Structured Denoising Diffusion Models in Discrete State-Spaces}, 
      author={Jacob Austin and Daniel D. Johnson and Jonathan Ho and Daniel Tarlow and Rianne van den Berg},
      year={2023},
      eprint={2107.03006},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2107.03006}, 
 }
--- a/State-Spaces/Structured
+++ b/State-Spaces/Structured
@@ -0,0 +1,61 @@
 # Structured Denoising Diffusion Models in Discrete State-Spaces
 **第一个问题**：请对论文的内容进行摘要总结，包含研究背景与问题、研究目的、方法、主要结果和结论，字数要求在150-300字之间，使用论文中的术语和概念。
 Denoising diffusion probabilistic models（DDPMs）在连续空间表现突出，但离散数据上的扩散模型仍难以在大规模文本与图像生成上达到竞争力。论文提出Discrete Denoising Diffusion Probabilistic Models（D3PMs），目标是通过“结构化的离散前向腐蚀过程”提升离散扩散的建模效果，并分析transition matrix与noise schedule对性能的关键影响。方法上，作者将离散扩散统一为基于Markov transition matrices Q_t 的类别扩散框架，提出多种结构化Q_t：discretized Gaussian（模拟连续高斯核、偏向相似状态）、embedding距离/nearest-neighbor（在嵌入空间定义局部性）、以及absorbing state（[MASK]吸收态并连接BERT/AR/MLM）。同时提出x0-parameterization的反向过程与新损失L_λ = L_vb + λ·CE(x0|x_t)，并给出基于mutual information的噪声日程。实验显示：文本上D3PM absorbing在text8取得≤1.45 bits/char（1000步）且在LM1B大词表上困惑度76.9（1000步）；图像上CIFAR-10的D3PM Gauss+logistic达到IS 8.56、FID 7.34、NLL≤3.435 bits/dim，接近/超过连续DDPM的对数似然并改善样本质量。结论是离散扩散中Q_t设计、辅助损失与MI日程显著提升性能，并揭示其与BERT/自回归/MLM的统一视角。
 **第二个问题**：请提取论文的摘要原文，摘要一般在Abstract之后，Introduction之前。
 Denoising diffusion probabilistic models (DDPMs) [19] have shown impressive results on image and waveform generation in continuous state spaces. Here, we introduce Discrete Denoising Diffusion Probabilistic Models (D3PMs), diffusion-like generative models for discrete data that generalize the multinomial diffusion model of Hoogeboom et al.[20], by going beyond corruption processes with uni-form transition probabilities. This includes corruption with transition matrices that mimic Gaussian kernels in continuous space, matrices based on nearest neighbors in embedding space, and matrices that introduce absorbing states.  The third al-lows us to draw a connection between diffusion models and autoregressive and mask-based generative models. We show that the choice of transition matrix is an important design decision that leads to improved results in image and text domains. We also introduce a new loss function that combines the variational lower bound with an auxiliary cross entropy loss.  For text, this model class achieves strong results on character-level text generation while scaling to large vocabularies on LM1B. On the image dataset CIFAR-10, our models approach the sample quality and exceed the log-likelihood of the continuous-space DDPM model.
 **第三个问题**：请列出论文的全部作者，按照此格式：`作者1, 作者2, 作者3`。
 Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, Rianne van den Berg
 **第四个问题**：请直接告诉我这篇论文发表在哪个会议或期刊，请不要推理或提供额外信息。
 35th Conference on Neural Information Processing Systems (NeurIPS 2021).
 **第五个问题**：请详细描述这篇论文主要解决的核心问题，并用简洁的语言概述。
 论文核心问题是：如何把扩散模型从连续高斯噪声的DDPM有效推广到“离散状态空间”的文本与量化图像，并让离散扩散在样本质量、对数似然、训练稳定性与可扩展性（大词表、大序列）上达到可竞争水平。此前离散扩散（如multinomial diffusion）多使用“uniform transition probabilities”的腐蚀过程，导致前向噪声缺少结构：对图像这种有序/局部性的离散值（量化像素）无法“优先扩散到相近值”，对文本这种离散符号也难注入相似性或mask结构，从而使反向去噪学习更难、效果受限。作者提出应把“前向Markov转移矩阵Q_t”视为离散扩散的关键设计自由度：通过结构化Q_t（如离散高斯核、嵌入邻近、吸收态mask）把领域先验嵌入腐蚀过程，进而塑造更可学习的反向去噪；同时提出辅助交叉熵损失与互信息噪声日程，缓解训练不稳定与日程选择困难。简洁概述：用结构化的离散腐蚀转移矩阵+更稳的训练目标/噪声日程，使离散扩散在文本与图像上显著变强，并统一连接到BERT/MLM与自回归模型。
 **第六个问题**：请告诉我这篇论文提出了哪些方法，请用最简洁的方式概括每个方法的核心思路。
 1) D3PM（Discrete Denoising Diffusion Probabilistic Models）通用框架：用类别变量的Markov前向过程q(x_t|x_{t-1})=Cat(x_{t-1}Q_t)与学习的反向过程p_θ(x_{t-1}|x_t)做离散扩散生成
 2) Structured transition matrices Q_t：将“腐蚀结构”编码进Q_t，而非仅uniform。
 3) D3PM-uniform：Q_t=(1−β_t)I+β_t/K·11^T，向任意类别均匀扩散（Hoogeboom等的multinomial diffusion推广/实现）。
 4) D3PM-absorbing（mask diffusion）：引入吸收态[MASK]，每步以β_t把token/像素变为[MASK]/灰像素，建立与BERT/MLM/AR的联系。
 5) Discretized Gaussian Q_t（D3PM-Gauss）：对有序离散值（量化像素）用离散截断高斯核定义转移，使更相近的值更易互转，模仿连续高斯扩散的“局部性”。
 6) Token embedding distance / Nearest-neighbor diffusion（D3PM-NN）：在预训练embedding空间构kNN图，构造对称rate matrix R并令Q_t=exp(α_t R)，使腐蚀沿语义邻近扩散。
 7) Mutual-information-based noise schedule：选择α_t/β_t使I(x_t;x_0)按t/T线性衰减到0，作为“结构化Q_t下”的统一噪声标尺；在absorbing情形退化为β_t=(T−t+1)^{-1}。
 8) x0-parameterization反向模型：网络预测~p_θ(~x_0|x_t)再与q(x_{t-1},x_t|~x_0)组合得到p_θ(x_{t-1}|x_t)，自动继承Q_t的稀疏模式并支持跳步推断。
 9) Truncated discretized logistic parameterization（图像）：对有序离散像素，用截断离散logistic分布参数化~p_θ(~x_0|x_t)以增强ordinal inductive bias。
 10) 新训练损失L_λ：在ELBO的L_vb上叠加辅助交叉熵E[-log ~p_θ(x_0|x_t)]，提升训练稳定性与样本质量（尤其图像）。
 **第七个问题**：请告诉我这篇论文所使用的数据集，包括数据集的名称和来源。
 文本：text8（Matt Mahoney text8 dataset，[28]，字符级，27 token，长度256切块评测）；LM1B / One Billion Word Benchmark（Chelba et al. 2013，[6]，使用sentencepiece词表大小8192、序列长度128打包训练）。图像：CIFAR-10（Krizhevsky et al. 2009，[27]）。
 **第八个问题**：请列举这篇论文评估方法的所有指标，并简要说明这些指标的作用。
 1) NLL（negative log-likelihood）：文本用bits/char（text8）或由困惑度等价衡量；图像用bits per dimension（bpd）。作用：衡量生成模型对数据分布的拟合/压缩能力，越低越好。
 2) Perplexity（LM1B）：语言建模常用指标，等价于指数化的平均负对数似然，越低越好。
 3) IS（Inception Score，CIFAR-10）：衡量样本“可辨识度+多样性”的启发式指标，越高越好。
 4) FID（Fréchet Inception Distance，CIFAR-10）：比较生成样本与真实样本在Inception特征空间的分布距离，越低越好，更常用作样本质量指标。
 5) Sample time（采样时间，秒）：衡量生成速度（与inference steps相关），越低越好。
 6) Model steps / inference steps（推断步数）：不是指标本身但作为效率维度报告，用于展示质量-速度折中（如1000/256/20步）。
 **第九个问题**：请总结这篇论文实验的表现，包含具体的数值表现和实验结论。
 text8（Table 1，bits/char↓，采样时间↓）：1000步下，D3PM absorbing（mask，L_{λ=0.01}）达到≤1.45±0.02（3.4±0.3s），优于D3PM uniform（≤1.61±0.02）与D3PM NN（≤1.59±0.03）；在256步下，D3PM absorbing仍为≤1.47±0.03（0.598±0.002s）；在20步下，D3PM absorbing为≤1.56±0.04（0.0785±0.0003s）。
 总体结论：mask/absorbing扩散在字符级最强，并能在少步推断下保持较好NLL，但仍弱于强自回归Transformer-XL（1.08）等。LM1B（Table 2，Perplexity↓）：D3PM absorbing在1000/128/64步分别为76.9±2.3 / 80.1±1.2 / 83.6±6.1，显著优于uniform的137.9±2.1 / 139.2±1.2 / 145.0±1.2；NN更差（149.5±1.3等）。说明：离散扩散可扩展到8192大词表，且absorbing最有效，embedding邻近并不一定带来收益。CIFAR-10（Table 3）：最佳D3PM Gauss+logistic（L_{λ=0.001}）达到IS 8.56±0.10、FID 7.34±0.19、NLL≤3.435±0.007；相比连续DDPM（Ho et al.）的DDPM L_simple：IS 9.46±0.11、FID 3.17、NLL≤3.75，D3PM在NLL上更优、样本质量接近但FID仍落后顶级连续扩散；且引入L_λ能显著提升离散模型（如D3PM Gauss从FID 15.30→8.34，IS 7.75→8.54）。
 综合结论：transition matrix选择是关键设计点；absorbing对文本最有效、discretized Gaussian+ordinal logistic对图像最有效；辅助CE损失与合适噪声日程能明显改善训练与样本质量。
 **第十个问题**：请清晰地描述论文所作的工作，分别列举出动机和贡献点以及主要创新之处。
 动机：1) 连续DDPM在图像/音频成功，但离散数据（文本、量化图像）上缺少同等级别的扩散建模框架与经验，尤其在大词表/长序列与高质量图像上；2) 既有离散扩散多采用uniform腐蚀，无法利用离散数据的结构（图像的有序/局部性、文本的mask与潜在语义相似性）；3) 需要更稳定的训练目标与更可解释的噪声日程来驱动离散扩散。
 贡献点：1) 提出并系统化D3PM框架：用一般Q_t矩阵描述离散扩散，推广multinomial diffusion，允许注入结构化腐蚀。2) 设计多类structured transition matrices：discretized Gaussian、embedding kNN/矩阵指数、absorbing state等，并实证“Q_t选择显著影响文本与图像效果”。3) 建立概念连接：证明BERT可视为“一步扩散”；构造视角说明自回归模型与生成式MLM可被表述为特定离散扩散过程。4) 提出训练改进：辅助交叉熵的混合损失L_λ提升训练与样本质量；提出基于互信息I(x_t;x_0)的噪声日程，适配一般结构化Q_t。5) 实证结果：文本上在text8强于多数非自回归基线并在LM1B证明可扩展；图像上CIFAR-10达到接近连续DDPM的样本质量并在对数似然上超过连续DDPM基线。
 主要创新之处：把离散扩散的关键自由度从“固定uniform腐蚀”提升为“可设计的结构化转移矩阵Q_t”，并配套可扩展计算（低秩/矩阵指数）、互信息日程与L_λ训练，使离散扩散在文本与量化图像上同时获得理论统一（连接BERT/AR/MLM）与显著性能提升。
--- a/State-Spaces/Structured
+++ b/State-Spaces/Structured
--- a/Networks/Graph
+++ b/Networks/Graph
@@ -0,0 +1,9 @@
@misc{veličković2018graphattentionnetworks,
      title={Graph Attention Networks}, 
      author={Petar Veličković and Guillem Cucurull and Arantxa Casanova and Adriana Romero and Pietro Liò and Yoshua Bengio},
      year={2018},
      eprint={1710.10903},
      archivePrefix={arXiv},
      primaryClass={stat.ML},
      url={https://arxiv.org/abs/1710.10903}, 
 }
--- a/Networks/Graph
+++ b/Networks/Graph
@@ -0,0 +1,62 @@
 # Graph Attention Networks
 **第一个问题**：请对论文的内容进行摘要总结，包含研究背景与问题、研究目的、方法、主要结果和结论，字数要求在150-300字之间，使用论文中的术语和概念。
 本文提出Graph Attention Networks（GATs），针对谱域图卷积方法的计算昂贵、依赖拉普拉斯特征基、难以泛化到不同图结构等问题，以及空间方法在可变邻域与权重共享上的挑战。研究目的在于通过masked self-attentional层，使节点对其邻域特征进行自注意，从而隐式分配不同邻居的重要性，且无需昂贵矩阵运算或预先知道全局图结构，实现对transductive与inductive任务的统一处理。方法包括多头注意力、邻域softmax归一化系数、共享线性变换与注意力机制，支持并行化，复杂度与GCN相当。主要结果：在Cora、Citeseer、Pubmed三大引文网络（transductive）与PPI蛋白互作数据（inductive）上，GAT达到或匹配state-of-the-art，在PPI上显著优于GraphSAGE，并优于同架构的常数注意版本。结论：GAT高效、可解释、可扩展到未见图，解决谱方法局限，展示了注意力在图结构数据上的潜力。
 **第二个问题**：请提取论文的摘要原文，摘要一般在Abstract之后，Introduction之前。
 We present graph attention networks (GATs), novel neural network architectures that operate on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. By stacking layers in which nodes are able to attend over their neighborhoods’ features, we enable (implicitly) specifying different weights to different nodes in a neighborhood, without requiring any kind of costly matrix operation (such as inversion) or depending on knowing the graph structure upfront. In this way, we address several key challenges of spectral-based graph neural networks simultaneously, and make our model readily applicable to inductive as well as transductive problems. Our GAT models have achieved or matched state-of-the-art results across four established transductive and inductive graph benchmarks: the Cora, Citeseer and Pubmed citation network datasets, as well as a protein-protein interaction dataset (wherein test graphs remain unseen during training).
 **第三个问题**：请列出论文的全部作者，按照此格式：`作者1, 作者2, 作者3`。
 Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, Yoshua Bengio
 **第四个问题**：请直接告诉我这篇论文发表在哪个会议或期刊，请不要推理或提供额外信息。
 ICLR 2018
 **第五个问题**：请详细描述这篇论文主要解决的核心问题，并用简洁的语言概述。
 核心问题：如何在图结构数据上高效、可泛化地进行节点表示学习与分类，同时克服谱方法对拉普拉斯特征基与固定图结构的依赖、昂贵的矩阵运算，以及空间方法在处理可变大小邻域与统一权重共享的困难。简述：GAT通过masked self-attention让每个节点对其邻域特征分配不同权重，避免昂贵谱分解，支持并行化与inductive设置，从而提升性能与可扩展性。
 **第六个问题**：请告诉我这篇论文提出了哪些方法，请用最简洁的方式概括每个方法的核心思路。
 1) 图注意力层（GAT layer）：共享线性变换W后，对邻域执行自注意力a(Wh_i, Wh_j)，用softmax归一化得到α_ij，加权聚合邻居特征并非线性输出。
 2) 多头注意力：并行K个独立注意头，隐藏层拼接以增强稳定性与表达力，输出层平均以做分类。
 3) 掩蔽注意（masked attention）：仅在一阶邻域内计算注意系数，注入图结构同时保持操作可并行。
 4) 常数注意对照（Const-GAT）：使用a(x,y)=1的恒定权重以对比注意机制带来的增益。
 5) 稀疏实现与并行化策略：采用稀疏矩阵操作降低存储与时间成本（实现层面说明）。
 **第七个问题**：请告诉我这篇论文所使用的数据集，包括数据集的名称和来源。
 - Cora（citation network，节点为文档，边为引用；来源：Sen et al., 2008，并按Yang et al., 2016的transductive设置）
 - Citeseer（citation network；来源：Sen et al., 2008；设置同上）
 - Pubmed（citation network；来源：Sen et al., 2008；设置同上）
 - PPI（Protein-Protein Interaction，多个组织的图；来源：Zitnik & Leskovec, 2017；使用Hamilton et al., 2017提供的预处理数据）
 **第八个问题**：请列举这篇论文评估方法的所有指标，并简要说明这些指标的作用。
 - 分类准确率（accuracy）：用于Cora、Citeseer、Pubmed的节点分类性能度量，反映预测正确的比例。
 - 微平均F1分数（micro-averaged F1）：用于PPI多标签节点分类，综合精确率与召回率并在样本层面微平均，衡量整体多标签预测质量。
 - 额外报告标准差：展示多次运行的稳定性与方差。
 **第九个问题**：请总结这篇论文实验的表现，包含具体的数值表现和实验结论。
 - Transductive（100次运行均值±标准差）：Cora：GAT 83.0±0.7%，优于GCN 81.5%与MoNet 81.7%；Citeseer：GAT 72.5±0.7%，优于GCN 70.3%；Pubmed：GAT 79.0±0.3%，匹配GCN 79.0%与优于多数基线。
 - Inductive（10次运行）：PPI：GAT 0.973±0.002 micro-F1，显著优于GraphSAGE最优0.768与Const-GAT 0.934±0.006。 结论：GAT在三个引文网络上达到或超越SOTA，在PPI上大幅领先，证明了对整个邻域进行注意加权以及自注意机制带来的显著增益与泛化能力。
 **第十个问题**：请清晰地描述论文所作的工作，分别列举出动机和贡献点以及主要创新之处。
 - 动机：解决谱方法对图拉普拉斯特征基的依赖与计算代价，空间方法在可变邻域与权重共享的局限；构建能在未见图上进行inductive推理的高效模型。
 - 贡献点：
  1. 提出图注意力层（GAT），在邻域内进行masked self-attention，隐式分配不同邻居权重；
  2. 设计多头注意力用于稳定训练与提升表达力，输出层平均以适配分类；
  3. 提供与GCN同量级的时间复杂度与并行化实现，适用于transductive与inductive任务；
  4. 在Cora、Citeseer、Pubmed与PPI上达到或刷新SOTA，显著优于GraphSAGE与常数注意对照。
 - 主要创新：
  - 将自注意力机制引入图邻域聚合，使用节点特征计算相似度并softmax归一化的掩蔽注意；
  - 多头图注意结构的层级堆叠与输出层平均策略；
  - 不依赖全局图结构即可进行学习与推理，支持完全未见测试图的inductive设置。
--- a/Networks/Graph
+++ b/Networks/Graph
--- a/Autoencoders/GraphMAE
+++ b/Autoencoders/GraphMAE
@@ -0,0 +1,9 @@
@misc{hou2022graphmaeselfsupervisedmaskedgraph,
      title={GraphMAE: Self-Supervised Masked Graph Autoencoders}, 
      author={Zhenyu Hou and Xiao Liu and Yukuo Cen and Yuxiao Dong and Hongxia Yang and Chunjie Wang and Jie Tang},
      year={2022},
      eprint={2205.10803},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2205.10803}, 
 }
--- a/Autoencoders/GraphMAE
+++ b/Autoencoders/GraphMAE
@@ -0,0 +1,58 @@
 # **GraphMAE: Self-Supervised Masked Graph Autoencoders**
 **第一个问题**：请对论文的内容进行摘要总结，包含研究背景与问题、研究目的、方法、主要结果和结论，字数要求在150-300字之间，使用论文中的术语和概念。
 论文关注图自监督学习中“contrastive learning主导、generative SSL/graph autoencoders(GAEs)未发挥潜力”的现状，指出GAEs受四类问题制约：过度强调structure reconstruction、无corruption的feature reconstruction易学到trivial solution、MSE对特征范数/维度敏感导致训练不稳、MLP decoder表达力不足。研究目的在于系统缓解这些缺陷，使GAE在node/graph classification与transfer learning上可与甚至超越SOTA对比学习方法。方法上提出GraphMAE：采用masked feature reconstruction作为目标（而非重建结构），用[MASK]做输入特征遮蔽并在解码前re-mask（[DMASK]），引入更具表达力的GNN decoder，并以scaled cosine error(SCE)替代MSE以提升稳定性与对hard samples的选择性。实验覆盖21个公开数据集、三类任务，GraphMAE在节点分类上如Cora 84.2%、PubMed 81.1%、Reddit 96.0%，图分类与分子迁移学习也整体优于或匹配SOTA基线。结论是：经过关键设计的简单masked GAE可稳定释放generative SSL在图上的潜力。
 **第二个问题**：请提取论文的摘要原文，摘要一般在Abstract之后，Introduction之前。
 Self-supervised learning (SSL) has been extensively explored in recent years. Particularly, generative SSL has seen emerging suc-cess in natural language processing and other AI fields, such as the wide adoption of BERT and GPT. Despite this, contrastive learning—which heavily relies on structural data augmentation and compli-cated training strategies—has been the dominant approach in graph SSL, while the progress of generative SSL on graphs, especially graph autoencoders (GAEs), has thus far not reached the potential as promised in other fields. In this paper, we identify and exam-ine the issues that negatively impact the development of GAEs, including their reconstruction objective, training robustness, and error metric. We present a masked graph autoencoder GraphMAE that mitigates these issues for generative self-supervised graph pre-training. Instead of reconstructing graph structures, we propose to focus on feature reconstruction with both a masking strategy and scaled cosine error that benefit the robust training of Graph-MAE. We conduct extensive experiments on 21 public datasets for three different graph learning tasks. The results manifest that GraphMAE—a simple graph autoencoder with careful designs—can consistently generate outperformance over both contrastive and generative state-of-the-art baselines. This study provides an under-standing of graph autoencoders and demonstrates the potential of generative self-supervised pre-training on graphs.
 **第三个问题**：请列出论文的全部作者，按照此格式：`作者1, 作者2, 作者3`。
 Zhenyu Hou, Xiao Liu, Yukuo Cen, Yuxiao Dong, Hongxia Yang, Chunjie Wang, Jie Tang
 **第四个问题**：请直接告诉我这篇论文发表在哪个会议或期刊，请不要推理或提供额外信息。
 KDD ’22
 **第五个问题**：请详细描述这篇论文主要解决的核心问题，并用简洁的语言概述。
 论文要解决的核心问题是：为什么graph autoencoders（GAEs）在图自监督中长期弱于对比学习，以及如何让“生成式自监督（generative SSL）”在图上达到与对比学习相当甚至更强的表征质量，尤其面向node/graph classification与预训练迁移。作者总结了制约GAEs的关键瓶颈：①目标偏差：大量GAE以link/structure reconstruction为主，导致对分类任务帮助有限；②鲁棒性不足：不做输入corruption的feature reconstruction容易学到identity function/trivial solution，图特征维度通常较小使该问题更突出；③误差度量不合适：MSE对特征范数差异与维度灾难敏感、训练易不稳定甚至collapse，且对易/难样本区分度低；④decoder表达力弱：常用MLP难以从低语义连续特征中有效“拉开”编码与重建目标，导致潜变量退化为原始特征拷贝。简洁概述：把GAE从“重建结构+MSE+弱解码”改造为“masked特征重建+稳定的SCE损失+GNN解码与re-mask正则”，从根上提升生成式图预训练的可用性与效果。
 **第六个问题**：请告诉我这篇论文提出了哪些方法，请用最简洁的方式概括每个方法的核心思路。
 1) GraphMAE（Masked Graph Autoencoder）：仅以feature reconstruction为自监督目标，通过遮蔽部分节点特征来学习可迁移表示。
 2) Masked feature reconstruction：随机采样节点子集V~并用可学习向量[MASK]替换其输入特征，迫使编码器利用邻域上下文恢复信息、避免trivial solution。
 3) Re-mask decoding：在解码前将被遮蔽节点的编码表示再次替换为[DMASK]，让解码器必须依赖邻居的未遮蔽潜表示来重建被遮蔽节点特征。
 4) GNN decoder：用单层GNN（如GAT/GIN/GCN）替代传统MLP解码器，提高解码表达力并引导编码器学习更高层潜表示。
 5) Scaled Cosine Error（SCE）：以(1-cosine(x,z))^γ作为重建损失，利用余弦误差消除范数/维度敏感性，并用γ实现对hard samples的“focal”式重加权，提升训练稳定性与选择性。
 6) Random-substitution（可选策略）：对被mask节点用随机替换而非“leave unchanged”，减轻训练/推理的mask-token分布偏移（论文指出leave-unchanged会伤害效果）。
 **第七个问题**：请告诉我这篇论文所使用的数据集，包括数据集的名称和来源。
 节点分类（6个）：Cora, Citeseer, PubMed（引文网络，论文引用[48]等经典基准）；ogbn-arxiv（Open Graph Benchmark，OGB，论文引用[15]）；PPI, Reddit（GraphSAGE常用归纳基准，论文引用其inductive setting来自GraphSAGE[10]）。
 图分类（7个）：MUTAG, IMDB-B, IMDB-M, PROTEINS, COLLAB, REDDIT-B, NCI1（论文引用[47]等图分类基准集合）。迁移学习/分子性质预测：预训练数据为ZINC15（2 million无标签分子，论文引用[33]），下游为MoleculeNet（论文引用[43]）的8个分类数据集：BBBP, Tox21, ToxCast, SIDER, ClinTox, MUV, HIV, BACE。
 **第八个问题**：请列举这篇论文评估方法的所有指标，并简要说明这些指标的作用。
 节点分类：Accuracy（Cora/Citeseer/PubMed/ogbn-arxiv/Reddit）用于衡量单标签分类正确率；Micro-F1（PPI，多标签）用于在多标签场景下按全局TP/FP/FN统计综合衡量分类质量。
 图分类：Accuracy（10-fold cross-validation accuracy，报告均值与标准差）用于衡量图级标签预测正确率，并用10折交叉验证减少划分方差。
 迁移学习（分子性质预测）：ROC-AUC（%）衡量二分类在不同阈值下的排序/区分能力，对类别不均衡更稳健。
 论文还报告了均值±标准差、以及部分实验“20次线性分类器随机初始化平均”等作为统计稳定性呈现方式。
 **第九个问题**：请总结这篇论文实验的表现，包含具体的数值表现和实验结论。
 节点分类（Table 1）：GraphMAE在Cora 84.2±0.4，Citeseer 73.4±0.4，PubMed 81.1±0.4，ogbn-arxiv 71.75±0.17，PPI Micro-F1 74.50±0.29，Reddit 96.01±0.08；相较代表性对比方法CCA-SSG分别为Cora 84.0±0.4、PubMed 81.0±0.4、Reddit 95.07±0.02等，GraphMAE整体为最佳或竞争性最强，并显著超过生成式基线如GAE（Cora 71.5±0.4）与GPT-GNN（Cora 80.1±1.0）。
 图分类（Table 2，accuracy）：GraphMAE在IMDB-B 75.52±0.66、IMDB-M 51.63±0.52、PROTEINS 75.30±0.39、COLLAB 80.32±0.46、MUTAG 88.19±1.26、REDDIT-B 88.01±0.19、NCI1 80.40±0.30；在5/7数据集上优于所有自监督基线，并在其余数据集上具竞争力（如MUTAG略低于InfoGCL 91.20±1.30但与多基线接近）。
 迁移学习（Table 3，ROC-AUC Avg.）：GraphMAE平均73.8，优于GraphLoG 73.4、JOAO 71.9、GraphCL 70.8等；在单项上如ClinTox 82.3±1.2、HIV 83.1±0.9等表现突出。
 消融（Table 4）：去掉mask（Cora 79.7）、用MSE替代SCE（Cora 79.1）都会大幅降分；去掉re-mask也会下降（Cora 82.7）。
 实验结论：GraphMAE通过“mask + re-mask + GNN decoder + SCE”实现稳定训练，使生成式图预训练在节点/图分类与分子迁移上系统性达到或超过SOTA对比学习。
 **第十个问题**：请清晰地描述论文所作的工作，分别列举出动机和贡献点以及主要创新之处。
 动机：1) 图自监督长期依赖contrastive learning，但其需要复杂训练策略（动量编码器/EMA/stop-gradient）、负样本与高质量图增强，工程与理论负担重且跨图不稳；2) GAEs天然更简单，但在分类任务上落后，作者希望找出其“没做对”的关键原因并让generative SSL在图上复兴。
 贡献点：1) 系统识别GAEs发展的关键障碍：重建目标偏置（结构优先）、无corruption导致trivial solution、MSE引发不稳定与低选择性、decoder表达力不足。2) 提出GraphMAE框架：以masked feature reconstruction为核心，避免结构重建带来的任务偏置并提升分类导向表征。3) 提出Scaled Cosine Error（SCE）：用余弦误差解决范数/维度敏感，并用γ缩放实现对难样本的自适应重加权（类focal loss思想），显著提升训练鲁棒性。4) 提出re-mask decoding + GNN decoder：通过[DMASK]与图解码器迫使信息从邻域潜表示流入重建，增强压缩表征学习。5) 在21个公开数据集、三类任务上给出系统实证：GraphMAE作为“简单但设计到位的GAE”可持续优于对比与生成式SOTA基线。
 主要创新之处：将“MAE式遮蔽重建”迁移到图并针对图特征连续/低语义、易trivial解的特性，联合设计SCE损失与re-mask+GNN解码，使生成式图自监督在稳定性与效果上同时跨越式提升。
--- a/Autoencoders/GraphMAE
+++ b/Autoencoders/GraphMAE
--- a/Transformer/Heterogeneous
+++ b/Transformer/Heterogeneous
@@ -0,0 +1,9 @@
@misc{hu2020heterogeneousgraphtransformer,
      title={Heterogeneous Graph Transformer}, 
      author={Ziniu Hu and Yuxiao Dong and Kuansan Wang and Yizhou Sun},
      year={2020},
      eprint={2003.01332},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2003.01332}, 
 }
--- a/Transformer/Heterogeneous
+++ b/Transformer/Heterogeneous
@@ -0,0 +1,54 @@
 # Heterogeneous Graph Transformer
 **第一个问题**：请对论文的内容进行摘要总结，包含研究背景与问题、研究目的、方法、主要结果和结论，字数要求在150-300字之间，使用论文中的术语和概念。
 论文面向Web-scale heterogeneous graphs（节点/边多类型且随时间演化）中现有GNN多为homogeneous设计、依赖手工meta paths、难刻画异构分布差异与动态、且不具备可扩展训练的问题，提出Heterogeneous Graph Transformer（HGT）。研究目的在于：用端到端方式学习node- and edge-type dependent representations，隐式学习“soft meta paths”，并能处理dynamic heterogeneous graphs与超大规模训练。方法上，HGT以meta relation ⟨τ(s),φ(e),τ(t)⟩分解参数，构建异构mutual attention、message passing与target-specific aggregation；引入Relative Temporal Encoding（RTE）用相对时间差∆T建模任意跨度的时序依赖；提出HGSampling以保持各类型节点比例与子图稠密性，实现mini-batch训练。实验在OAG（179M nodes/2B edges）及CS/Med子图上，HGT在Paper–Field、Paper–Venue与Author Disambiguation等任务上相对SOTA基线提升9%–21%，消融显示Heter与RTE均显著贡献。结论是HGT可在大规模异构动态图上高效、稳定优于现有GNN。
 **第二个问题**：请提取论文的摘要原文，摘要一般在Abstract之后，Introduction之前。
 Recent years have witnessed the emerging success of graph neu-ral networks (GNNs) for modeling structured data. However, most GNNs are designed for homogeneous graphs, in which all nodes and edges belong to the same types, making them infeasible to represent heterogeneous structures. In this paper, we present the Heterogeneous Graph Transformer (HGT) architecture for mod-eling Web-scale heterogeneous graphs. To model heterogeneity, we design node- and edge-type dependent parameters to charac-terize the heterogeneous attention over each edge, empowering HGT to maintain dedicated representations for different types of nodes and edges. To handle dynamic heterogeneous graphs, we in-troduce the relative temporal encoding technique into HGT, which is able to capture the dynamic structural dependency with arbitrary durations. To handle Web-scale graph data, we design the hetero-geneous mini-batch graph sampling algorithm—HGSampling—for efficient and scalable training. Extensive experiments on the Open Academic Graph of 179 million nodes and 2 billion edges show that the proposed HGT model consistently outperforms all the state-of-the-art GNN baselines by 9%–21%on various downstream tasks. The dataset and source code of HGT are publicly available at https://github.com/acbull/pyHGT.
 **第三个问题**：请列出论文的全部作者，按照此格式：`作者1, 作者2, 作者3`。
 Ziniu Hu, Yuxiao Dong, Kuansan Wang, Yizhou Sun
 **第四个问题**：请直接告诉我这篇论文发表在哪个会议或期刊，请不要推理或提供额外信息。
 Proceedings of The Web Conference 2020 (WWW ’20)
 **第五个问题**：请详细描述这篇论文主要解决的核心问题，并用简洁的语言概述。
 论文要解决的核心问题是：在Web级别的异构信息网络中，如何在不依赖手工meta paths的前提下，同时（1）刻画不同node/edge类型带来的分布差异与语义差异（heterogeneity），（2）刻画图随时间演化带来的结构依赖（dynamic），并且（3）在数亿节点、数十亿边上实现可扩展的GNN训练（scalability）。现有异构GNN往往只按node type或edge type单独建模、参数难以共享导致稀有关系学习不稳；对动态图常用“按时间切片建多个图”会丢失跨时间结构依赖；训练上直接全图传播不可行，现有采样在异构图上会产生类型极不均衡子图。简洁概述：提出一种以meta relation驱动的Transformer式异构注意力GNN，并配套相对时间编码与异构小批量采样，使其能在超大规模异构动态图上端到端学习表示并提升下游任务效果。
 **第六个问题**：请告诉我这篇论文提出了哪些方法，请用最简洁的方式概括每个方法的核心思路。
 1) HGT（Heterogeneous Graph Transformer）：用meta relation ⟨τ(s),φ(e),τ(t)⟩对注意力与消息传递参数进行分解与共享，构建异构multi-head mutual attention、异构message passing与target-specific aggregation（含残差）以生成上下文化表示。
 2) Heterogeneous Mutual Attention：对不同node type使用不同的Key/Query线性投影，并对不同edge type使用W_ATT变换，再用meta-relation prior μ对注意力进行自适应缩放。
 3) Heterogeneous Message Passing：对不同source node type使用不同的消息线性投影M-Linear，并用每种edge type的W_MSG注入关系语义，形成多头消息。
 4) RTE（Relative Temporal Encoding）：用相对时间差∆T(t,s)=T(t)-T(s)查找/生成sinusoid basis并经线性层T-Linear映射，将RTE(∆T)加到source表示以建模任意跨度、可泛化到未见时间的动态依赖。
 5) HGSampling：为每种node type维护独立budget并用归一化度数的importance sampling，采样出类型比例更均衡且更稠密的异构子图，用于mini-batch训练。
 6) Inductive Timestamp Assignment：对无固定时间戳的plain nodes在采样时从相连event nodes继承时间戳，以支持动态建模。
 **第七个问题**：请告诉我这篇论文所使用的数据集，包括数据集的名称和来源。
 1) Open Academic Graph（OAG）：论文实验基础数据集，引用[16,20,28]，规模约178/179 million nodes与2.236/2 billion edges，时间跨度1900–2019。
 2) Computer Science（CS）学术图：从OAG构建的领域子图。
 3) Medicine（Med）学术图：从OAG构建的领域子图。代码与数据入口：论文给出公开仓库 https://github.com/acbull/pyHGT。
 **第八个问题**：请列举这篇论文评估方法的所有指标，并简要说明这些指标的作用。
 论文使用两类排序指标：NDCG（Normalized Discounted Cumulative Gain）：衡量排序结果在Top位置的相关性质量，对靠前位置赋予更高权重，适用于多标签/多候选的ranking式评估；MRR（Mean Reciprocal Rank）：对每个查询取正确答案排名的倒数再求均值，反映模型把正确答案排到前面的能力，常用于link prediction与候选排序任务。
 **第九个问题**：请总结这篇论文实验的表现，包含具体的数值表现和实验结论。
 总体结论：HGT在CS/Med/OAG三套大规模异构图、四个任务上，NDCG与MRR均显著且一致优于GCN/GAT/RGCN/HetGNN/HAN，并在参数量与batch time上优于多数异构GNN基线。
 关键数值（Table 2，均值±方差，列出HGT完整版 HGT(+RTE,+Heter) 与最佳基线HAN对比为主）：在OAG的Paper–Field(L1)上，HAN NDCG 0.544±0.096 / MRR 0.622±0.092，HGT达0.615±0.084 / 0.702±0.081；在OAG的Paper–Venue上，HAN NDCG 0.327±0.062 / MRR 0.214±0.067，HGT达0.355±0.062 / 0.247±0.061；在OAG的Author Disambiguation上，HAN NDCG 0.821±0.056 / MRR 0.660±0.049，HGT达0.852±0.048 / 0.688±0.054。CS数据集的Paper–Field(L1)上，HAN NDCG 0.618±0.058 / MRR 0.691±0.051，HGT达0.718±0.014 / 0.823±0.019；Med数据集的Paper–Field(L1)上，HAN NDCG 0.607±0.054 / MRR 0.575±0.057，HGT达0.709±0.029 / 0.688±0.048。
 消融：去掉异构参数化（HGT +RTE −Heter）相对完整版有约4%性能下降；去掉RTE（HGT −RTE +Heter）约2%下降。效率（表中示例）：在CS任务设置下，HGT参数量8.20M、batch time约1.50s，低于HAN的9.45M与2.27s，且优于RGCN/HetGNN的参数规模与速度。
 **第十个问题**：请清晰地描述论文所作的工作，分别列举出动机和贡献点以及主要创新之处。
 动机：1) 现实系统（学术网/社交网/经济网/IoT）天然是heterogeneous graphs，homogeneous GNN假设不成立；2) 既有异构方法依赖手工meta paths或仅按node/edge单维度建模，难同时捕获共享模式与特定模式；3) 真实图是dynamic heterogeneous graph，时间切片会丢失跨时间结构依赖；4) Web-scale规模下需要可扩展mini-batch训练，现有采样在异构图上会产生类型失衡与信息损失。
 贡献点：1) 提出HGT架构：以meta relation ⟨τ(s),φ(e),τ(t)⟩为核心进行参数分解与共享，定义异构mutual attention、message passing与target-specific aggregation，实现无需手工meta paths的端到端异构表示学习（隐式“soft meta paths”）。2) 提出RTE：基于相对时间差的时间编码机制，可建模任意时长、并泛化到未见时间戳/未来时间。3) 提出HGSampling与配套的inductive timestamp assignment：在Web级异构图上采样类型更均衡且更稠密的子图，支持高效mini-batch训练与动态图时间戳分配。4) 在OAG（179M/2B）及CS/Med子图上进行大规模实验，证明相对SOTA基线持续提升（论文总结为9%–21%）。
 主要创新之处：将Transformer式多头注意力与“基于meta relation的三元分解参数化（node-type投影 + edge-type变换 + meta-relation prior）”结合，实现更强的异构建模与更高的参数共享效率；同时把相对时间编码与异构专用采样机制纳入统一训练框架，使模型真正可落地于Web-scale异构动态图。
--- a/Transformer/Heterogeneous
+++ b/Transformer/Heterogeneous
--- a/papers/Topic5
+++ b/papers/Topic5
@@ -0,0 +1,9 @@
@misc{xu2019powerfulgraphneuralnetworks,
      title={How Powerful are Graph Neural Networks?}, 
      author={Keyulu Xu and Weihua Hu and Jure Leskovec and Stefanie Jegelka},
      year={2019},
      eprint={1810.00826},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/1810.00826}, 
 }
--- a/papers/Topic5
+++ b/papers/Topic5
@@ -0,0 +1,59 @@
 # How Powerful are Graph Neural Networks
 **第一个问题**：请对论文的内容进行摘要总结，包含研究背景与问题、研究目的、方法、主要结果和结论，字数要求在150-300字之间，使用论文中的术语和概念。
 摘要总结：本文系统分析Graph Neural Networks（GNNs）的表达能力，构建与Weisfeiler–Lehman（WL）图同构测试紧密关联的理论框架。研究目的在于形式化刻画主流GNN变体（如GCN、GraphSAGE）的判别能力及局限，并提出一个在邻域聚合（message passing）类方法中“最强”的架构。方法上，作者将邻域表示为multiset，研究不同AGGREGATE与READOUT的可区分性条件，证明满足“注入”聚合与读出时，GNN至多与WL等强，并提出Graph Isomorphism Network（GIN），使用sum聚合与MLP实现对multiset的通用函数逼近。主要结果显示：常用的mean/max聚合或1-layer感知机不足以区分简单结构；GIN在多项图分类基准上达到SOTA，训练拟合几乎完美且测试表现优异。结论：GNN的判别力上限由WL测试界定，具备注入式聚合与读出（如GIN）的架构在表达力上最强，同时在实践中表现领先。
 **第二个问题**：请提取论文的摘要原文，摘要一般在Abstract之后，Introduction之前。
 Graph Neural Networks (GNNs) are an effective framework for representation learning of graphs. GNNs follow a neighborhood aggregation scheme, where the representation vector of a node is computed by recursively aggregating and transforming representation vectors of its neighboring nodes. Many GNN variants have been proposed and have achieved state-of-the-art results on both node and graph classification tasks. However, despite GNNs revolutionizing graph representation learning, there is limited understanding of their representational properties and limitations. Here, we present a theoretical framework for analyzing the expressive power of GNNs to capture different graph structures. Our results characterize the discriminative power of popular GNN variants, such as Graph Convolutional Networks and GraphSAGE, and show that they cannot learn to distinguish certain simple graph structures. We then develop a simple architecture that is provably the most expressive among the class of GNNs and is as powerful as the Weisfeiler-Lehman graph isomorphism test. We empirically validate our theoretical findings on a number of graph classification benchmarks, and demonstrate that our model achieves state-of-the-art performance.
 **第三个问题**：请列出论文的全部作者，按照此格式：`作者1, 作者2, 作者3`。
 Keyulu Xu, Weihua Hu, Jure Leskovec, Stefanie Jegelka
 **第四个问题**：请直接告诉我这篇论文发表在哪个会议或期刊，请不要推理或提供额外信息。
 ICLR 2019
 **第五个问题**：请详细描述这篇论文主要解决的核心问题，并用简洁的语言概述。
 核心问题：在邻域聚合（message passing）框架下，GNN到底能多强，能区分哪些图结构、在哪些情况下失效，以及如何构造在此类GNN中表达力最强、与Weisfeiler–Lehman测试等强的模型。简述：论文给出一个以multiset函数为基础的理论框架，证明常见聚合（mean/max）和1-layer感知机存在不可区分的结构，同时提出使用sum聚合+MLP的GIN，使GNN的判别力达到WL测试的上限。
 **第六个问题**：请告诉我这篇论文提出了哪些方法，请用最简洁的方式概括每个方法的核心思路。
 1) 理论框架（GNN表达力与WL测试）：将邻域表示为multiset，分析AGGREGATE/READOUT的“注入性”条件，给出GNN判别力的上界与等价条件。
 2) Graph Isomorphism Network（GIN）：用sum聚合实现对multiset的通用近似（injective），结合MLP与(1+ε)·self项，逐层更新h_v并在图级通过各层READOUT的拼接/求和形成h_G，达到与WL等强的表达力。
 3) 聚合器对比分析：形式化比较sum、mean、max对multiset的捕获能力（分别对应完整multiset、分布、集合骨架），揭示其区分能力差异。
 4) 经验验证设置：在图分类基准上对比GIN与“较弱”变体（mean/max或1-layer），验证理论结论。
 **第七个问题**：请告诉我这篇论文所使用的数据集，包括数据集的名称和来源。
 - Bioinformatics：MUTAG、PTC、NCI1、PROTEINS（来源于Yanardag & Vishwanathan, 2015汇总的图分类基准）。
 - Social networks：COLLAB、IMDB-BINARY、IMDB-MULTI、REDDIT-BINARY、REDDIT-MULTI5K（同样来源于Yanardag & Vishwanathan, 2015）。 备注：社交网络数据部分节点无特征，使用度或常数特征；生物数据含离散节点标签。
 **第八个问题**：请列举这篇论文评估方法的所有指标，并简要说明这些指标的作用。
 - 图分类准确率（accuracy，10折交叉验证均值±标准差）：衡量模型在图级分类任务上的泛化性能。
 - 训练准确率曲线：衡量不同聚合/架构的表示能力与拟合强度，验证表达力理论结论。
 - 与WL subtree kernel对比：作为强判别的非学习基线，对训练拟合与测试性能进行参考。
 **第九个问题**：请总结这篇论文实验的表现，包含具体的数值表现和实验结论。
 - 训练表现：GIN-ε与GIN-0在9个数据集上几乎完美拟合训练集；mean/max或1-layer变体在多数据集显著欠拟合，训练准确率明显较低。
 - 测试准确率（10折均值±标准差）：例如IMDB-BINARY GIN-0为75.1±5.1%，REDDIT-BINARY 92.4±2.5%，REDDIT-MULTI5K 57.5±1.5%，COLLAB 80.2±1.9%，MUTAG 89.4±5.6%，PROTEINS 76.2±2.8%，PTC 64.6±7.0，NCI1 82.7±1.7；在REDDIT类数据上mean-MLP接近随机（50.0±0.0%，20.0±0.0%）。总体结论：GIN在多数基准上达到或优于SOTA，强表达力带来更好的训练拟合与测试表现；简单聚合器存在结构不可分能力，导致性能下降。
 **第十个问题**：请清晰地描述论文所作的工作，分别列举出动机和贡献点以及主要创新之处。
 - 动机：缺乏对GNN表达力的系统理论理解；现有设计依赖经验与试错，未明确其能区分哪些结构、上限为何、如何构造更强模型。
 - 贡献点：
  1. 提出以multiset函数为核心的理论框架，形式化分析GNN的判别力与其与WL测试的关系，上界与等强条件；
  2. 证明常见变体（GCN、GraphSAGE的mean/max、1-layer感知机）无法区分某些简单图结构，刻画其能捕获的性质（分布或集合骨架）；
  3. 设计GIN，用sum聚合+MLP（含(1+ε)自项）实现注入式邻域聚合与图级读出，达到与WL等强的表达力；
  4. 在9个图分类基准上实证验证，GIN取得SOTA或可比表现，训练几乎完美拟合，支撑理论结论。
 - 主要创新：
  - 将GNN聚合视为multiset上的通用函数并给出注入性条件，建立与WL测试的等强判别理论；
  - 提出GIN这一简单而最强的邻域聚合GNN架构（sum+MLP+(1+ε)），并配套跨层READOUT设计；
  - 系统揭示sum/mean/max的表达差异与失效例，指导GNN聚合器选择与任务匹配。
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
DaZuo0122	566e251743	Add: python scripts for figure generation	2026-02-09 00:24:40 +08:00
DaZuo0122	0a74d67f19	Add: ignore rules for drawio	2026-02-08 17:26:38 +08:00
DaZuo0122	096af1bcb6	Fix: Author Block	2026-02-06 15:51:07 +08:00
DaZuo0122	c146138443	Fix: reference issue, from 50 to 46	2026-02-06 15:24:43 +08:00
DaZuo0122	0b3e79841b	Fix: pictures misplacement, remove the lower is better column from the table	2026-02-06 15:11:17 +08:00
DaZuo0122	67466f8e3d	Add: IEEE style double column - exceed 6 page limit and graph lispalced	2026-02-06 14:55:53 +08:00
DaZuo0122	250428b176	Add: README.md for latex compiling	2026-02-05 17:06:27 +08:00
Hongyu Yan	6f1e7a9994	Update main.tex	2026-02-04 23:02:58 +08:00
Hongyu Yan	b3280dcc19	Update main.tex	2026-02-04 22:48:59 +08:00
Hongyu Yan	de445963b5	Update main.tex	2026-02-04 22:45:41 +08:00
Hongyu Yan	5ede1a11f1	The first draft is completed.	2026-02-04 22:37:35 +08:00
Hongyu Yan	21053b4f13	Future Work Part Complete	2026-02-04 22:21:42 +08:00
Hongyu Yan	a7c8250d1a	Reapply "Update references.bib" This reverts commit `03640302db`.	2026-02-04 22:18:35 +08:00
Hongyu Yan	03640302db	Revert "Update references.bib" This reverts commit `3a9836d15d`.	2026-02-04 22:16:52 +08:00
Hongyu Yan	3a9836d15d	Update references.bib Solve the ref problem of HAI dataset	2026-02-04 22:16:18 +08:00
DaZuo0122	1e3eb39dea	Add: benchmark, bibtex shows error in line 377, should be reference for other section than benchmark/method	2026-02-04 22:13:46 +08:00
DaZuo0122	9f8af2c67c	Add: reference for benchmark	2026-02-04 21:46:53 +08:00
Hongyu Yan	272e159df1	Intro and Related Work Completed - The reference of HAI dataset still have problems.	2026-02-04 19:39:36 +08:00
Hongyu Yan	81625b5c4e	Update .gitignore	2026-02-04 18:19:30 +08:00
DaZuo0122	4815d05127	Change: citations are now presented in numbers	2026-02-04 17:14:03 +08:00
DaZuo0122	5fbfd1068f	Add: figure for benchmark section	2026-02-04 17:08:45 +08:00
DaZuo0122	dc5db4f3ec	Fix: equation misplaced error	2026-02-04 17:06:57 +08:00
DaZuo0122	87ba7a5516	Finish: Methodology section - have error at \label{eq:masking_process} misplaced.	2026-02-04 16:48:46 +08:00
DaZuo0122	6ae0f24af3	Add: first 3 paragraphs to methodology section	2026-02-04 13:07:59 +08:00
DaZuo0122	c95fca657b	Add: Dependencies for equations	2026-02-04 12:47:34 +08:00
DaZuo0122	704277f011	Remove: all custom \DeclareMathOperator and \newcommand	2026-02-04 12:38:55 +08:00
manbo	25bc355717	Update .gitignore	2026-02-04 11:23:01 +08:00
DaZuo0122	89997f1125	Add: equations of methodology in a separate file	2026-02-04 10:48:24 +08:00
Hongyu Yan	4ec1b0b7a8	Related Work Completed	2026-02-03 23:51:44 +08:00
Hongyu Yan	327fc36bf7	main tex file update	2026-02-03 23:38:31 +08:00
Hongyu Yan	72b653f35d	Ignore file	2026-02-03 23:01:29 +08:00
manbo	0e0f951e12	Upload files to "arxiv-style"	2026-02-02 21:54:05 +08:00
manbo	ace1130179	Add: Methodology section 12 references	2026-02-01 11:46:16 +08:00
manbo	0613edcf61	Add arxiv style template	2026-02-01 11:44:24 +08:00
manbo	1ee85b97bc	Add more paragraphs/citations to smooth the logic flow	2026-01-31 21:53:52 +08:00
manbo	2c1e211504	Version 2, slightly differ from online docs v2(use that) online docs v2: https://my.feishu.cn/wiki/Za4dwCsG6iPD9qklRLWcoJOZnnb?edition_id=dXsOZT	2026-01-31 21:50:02 +08:00
manbo	735ca8ab51	Update knowledges/draft-incomplete-methodology.md	2026-01-30 21:05:56 +08:00
manbo	b88a9d39da	Add knowledges/draft-incomplete-methodology.md	2026-01-30 17:43:12 +08:00
Your Name	b3bbef795d	Add detailed mask-ddpm documentation	2026-01-30 00:51:08 +08:00
Hongyu Yan	16a3085d14	Network Traffic Generation	2026-01-30 00:05:35 +08:00
Hongyu Yan	d5c8ace183	Generation of process time series on ICS to produce regular network data packets Papers about topic 7 Generation of process time series on ICS to produce regular network data packets	2026-01-29 00:48:42 +08:00
Hongyu Yan	e94d4aa91f	modbus generator related paper	2026-01-28 20:45:46 +08:00
Hongyu Yan	1cbfc6d53d	新增提取所有md文件的脚本所有md文件都会被提取到/papers/md文件夹下	2026-01-26 18:22:48 +08:00
Hongyu Yan	1b86bc7927	论文精读完工	2026-01-26 15:55:39 +08:00
Hongyu Yan	dac3a2e545	Topic1 Completed	2026-01-26 15:43:12 +08:00
Hongyu Yan	7cfdeb7503	Topic2 Completed	2026-01-26 15:28:08 +08:00
Hongyu Yan	284b2dfe8f	Topic3 Completed	2026-01-26 15:18:20 +08:00
Hongyu Yan	3219222204	新增论文概要	2026-01-26 00:18:00 +08:00
Hongyu Yan	4af2d73cc8	Feishu Docs link	2026-01-24 14:19:54 +08:00
Hongyu Yan	4ef0266a71	Topic 4 & 5: 论文精读补全所有U-开头的论文都还没有完成精读	2026-01-23 14:12:24 +08:00
Hongyu Yan	2208d85d80	Reference Paper	2026-01-22 16:08:52 +08:00
manbo	86852cbbbe	Update README.md	2026-01-09 15:59:03 +08:00
manbo	079a138c2a	Add notes4coding/checker_design.md	2026-01-09 15:49:44 +08:00
manbo	0730864c95	Add API/whatsneeded.md	2026-01-09 15:47:38 +08:00
MingzheYang	112e5ae85e	更新 knowledges/Intro-Modbus-Diffusion-Tech-Route.md	2026-01-07 19:24:15 +08:00
MingzheYang	9511e6271f	更新 knowledges/Intro-Modbus-Diffusion-Tech-Route.md	2026-01-07 14:53:05 +08:00
MingzheYang	baf511a81d	Update TeX fomula Change the TeX formula to the correctly rendered state	2026-01-07 14:52:09 +08:00
Markyan04	1728a88f4e	Intro-Modbus-Diffusion-Tech-Route.md An Intro for how to use diffusion on modbus protocol, and how to use/train it.	2026-01-07 00:14:45 +08:00
		`@@ -0,0 +1 @@`
							`https://my.feishu.cn/wiki/Za4dwCsG6iPD9qklRLWcoJOZnnb?from=from_copylink`