58 Commits

Author SHA1 Message Date
DaZuo0122
566e251743 Add: python scripts for figure generation 2026-02-09 00:24:40 +08:00
DaZuo0122
0a74d67f19 Add: ignore rules for drawio 2026-02-08 17:26:38 +08:00
DaZuo0122
096af1bcb6 Fix: Author Block 2026-02-06 15:51:07 +08:00
DaZuo0122
c146138443 Fix: reference issue, from 50 to 46 2026-02-06 15:24:43 +08:00
DaZuo0122
0b3e79841b Fix: pictures misplacement, remove the lower is better column from the table 2026-02-06 15:11:17 +08:00
DaZuo0122
67466f8e3d Add: IEEE style double column - exceed 6 page limit and graph lispalced 2026-02-06 14:55:53 +08:00
DaZuo0122
250428b176 Add: README.md for latex compiling 2026-02-05 17:06:27 +08:00
Hongyu Yan
6f1e7a9994 Update main.tex 2026-02-04 23:02:58 +08:00
Hongyu Yan
b3280dcc19 Update main.tex 2026-02-04 22:48:59 +08:00
Hongyu Yan
de445963b5 Update main.tex 2026-02-04 22:45:41 +08:00
Hongyu Yan
5ede1a11f1 The first draft is completed. 2026-02-04 22:37:35 +08:00
Hongyu Yan
21053b4f13 Future Work Part Complete 2026-02-04 22:21:42 +08:00
Hongyu Yan
a7c8250d1a Reapply "Update references.bib"
This reverts commit 03640302db10f4f54109172e518661aad4d5cceb.
2026-02-04 22:18:35 +08:00
Hongyu Yan
03640302db Revert "Update references.bib"
This reverts commit 3a9836d15d7c4e467c408d5b7a3ed758f018ef9e.
2026-02-04 22:16:52 +08:00
Hongyu Yan
3a9836d15d Update references.bib
Solve the ref problem of HAI dataset
2026-02-04 22:16:18 +08:00
DaZuo0122
1e3eb39dea Add: benchmark, bibtex shows error in line 377, should be reference for other section than benchmark/method 2026-02-04 22:13:46 +08:00
DaZuo0122
9f8af2c67c Add: reference for benchmark 2026-02-04 21:46:53 +08:00
272e159df1 Intro and Related Work Completed
- The reference of HAI dataset still have problems.
2026-02-04 19:39:36 +08:00
81625b5c4e Update .gitignore 2026-02-04 18:19:30 +08:00
DaZuo0122
4815d05127 Change: citations are now presented in numbers 2026-02-04 17:14:03 +08:00
DaZuo0122
5fbfd1068f Add: figure for benchmark section 2026-02-04 17:08:45 +08:00
DaZuo0122
dc5db4f3ec Fix: equation misplaced error 2026-02-04 17:06:57 +08:00
DaZuo0122
87ba7a5516 Finish: Methodology section - have error at \label{eq:masking_process} misplaced. 2026-02-04 16:48:46 +08:00
DaZuo0122
6ae0f24af3 Add: first 3 paragraphs to methodology section 2026-02-04 13:07:59 +08:00
DaZuo0122
c95fca657b Add: Dependencies for equations 2026-02-04 12:47:34 +08:00
DaZuo0122
704277f011 Remove: all custom \DeclareMathOperator and \newcommand 2026-02-04 12:38:55 +08:00
25bc355717 Update .gitignore 2026-02-04 11:23:01 +08:00
DaZuo0122
89997f1125 Add: equations of methodology in a separate file 2026-02-04 10:48:24 +08:00
Hongyu Yan
4ec1b0b7a8 Related Work Completed 2026-02-03 23:51:44 +08:00
Hongyu Yan
327fc36bf7 main tex file update 2026-02-03 23:38:31 +08:00
Hongyu Yan
72b653f35d Ignore file 2026-02-03 23:01:29 +08:00
0e0f951e12 Upload files to "arxiv-style" 2026-02-02 21:54:05 +08:00
ace1130179 Add: Methodology section 12 references 2026-02-01 11:46:16 +08:00
0613edcf61 Add arxiv style template 2026-02-01 11:44:24 +08:00
1ee85b97bc Add more paragraphs/citations to smooth the logic flow 2026-01-31 21:53:52 +08:00
2c1e211504 Version 2, slightly differ from online docs v2(use that)
online docs v2: https://my.feishu.cn/wiki/Za4dwCsG6iPD9qklRLWcoJOZnnb?edition_id=dXsOZT
2026-01-31 21:50:02 +08:00
735ca8ab51 Update knowledges/draft-incomplete-methodology.md 2026-01-30 21:05:56 +08:00
b88a9d39da Add knowledges/draft-incomplete-methodology.md 2026-01-30 17:43:12 +08:00
Your Name
b3bbef795d Add detailed mask-ddpm documentation 2026-01-30 00:51:08 +08:00
Hongyu Yan
16a3085d14 Network Traffic Generation 2026-01-30 00:05:35 +08:00
Hongyu Yan
d5c8ace183 Generation of process time series on ICS to produce regular network data packets
Papers about topic 7 Generation of process time series on ICS to produce regular network data packets
2026-01-29 00:48:42 +08:00
Hongyu Yan
e94d4aa91f modbus generator related paper 2026-01-28 20:45:46 +08:00
Hongyu Yan
1cbfc6d53d 新增提取所有md文件的脚本
所有md文件都会被提取到/papers/md文件夹下
2026-01-26 18:22:48 +08:00
Hongyu Yan
1b86bc7927 论文精读完工 2026-01-26 15:55:39 +08:00
Hongyu Yan
dac3a2e545 Topic1 Completed 2026-01-26 15:43:12 +08:00
Hongyu Yan
7cfdeb7503 Topic2 Completed 2026-01-26 15:28:08 +08:00
Hongyu Yan
284b2dfe8f Topic3 Completed 2026-01-26 15:18:20 +08:00
Hongyu Yan
3219222204 新增论文概要 2026-01-26 00:18:00 +08:00
Hongyu Yan
4af2d73cc8 Feishu Docs link 2026-01-24 14:19:54 +08:00
Hongyu Yan
4ef0266a71 Topic 4 & 5: 论文精读补全
所有U-开头的论文都还没有完成精读
2026-01-23 14:12:24 +08:00
Hongyu Yan
2208d85d80 Reference Paper 2026-01-22 16:08:52 +08:00
86852cbbbe Update README.md 2026-01-09 15:59:03 +08:00
079a138c2a Add notes4coding/checker_design.md 2026-01-09 15:49:44 +08:00
0730864c95 Add API/whatsneeded.md 2026-01-09 15:47:38 +08:00
112e5ae85e 更新 knowledges/Intro-Modbus-Diffusion-Tech-Route.md 2026-01-07 19:24:15 +08:00
9511e6271f 更新 knowledges/Intro-Modbus-Diffusion-Tech-Route.md 2026-01-07 14:53:05 +08:00
baf511a81d Update TeX fomula
Change the TeX formula to the correctly rendered state
2026-01-07 14:52:09 +08:00
1728a88f4e Intro-Modbus-Diffusion-Tech-Route.md
An Intro for how to use diffusion on modbus protocol, and how to use/train it.
2026-01-07 00:14:45 +08:00
172 changed files with 24838 additions and 4 deletions

9
.gitignore vendored Normal file
View File

@@ -0,0 +1,9 @@
papers/Topic8 Network Traffic Generation/NetDiffusion Network Data Augmentation Through Protocol-Constrained Traffic Generation/NetDiffusion Network Data Augmentation Through Protocol-Constrained Traffic Gener_1_33_translate_20260130214700.pdf
arxiv-style/*.pdf
arxiv-style/*.aux
arxiv-style/*.log
arxiv-style/*.blg
arxiv-style/*.bbl
arxiv-style/*.out
fig/
.DS_Store

32
API/whatsneeded.md Normal file
View File

@@ -0,0 +1,32 @@
# What's needed for development&research?
Group members should describe their requirements and dependencies in detail in this document.
Current group members:
- HY YAN
- MZ YANG
- ZL CHEN
The following subheadings are sections for descriptions of requirements, respectively.
## ZL CHEN
{placeholder}
## MZ YANG
{placeholder}
## HY YAN
{placeholder}

View File

@@ -1 +1,12 @@
## Hi there! ## File organisation
Currently, there are 4 parts of docs:
- **API**: contains detailed docs about API designs. (Note that `system_design.md` is a temp draft for discussing high-level design, not for API definitations).
- **knowledges**: contains theorical knowledges/designs for the project (mainly diffusion stuffs).
- **notes4coding**: contains deatiled description for coding, may be useful to coding agents (like codex).
- **papers**: contains useful papers we've found in pdf
Here is the Feishu Docs link: https://my.feishu.cn/wiki/Za4dwCsG6iPD9qklRLWcoJOZnnb?from=from_copylink

6347
arxiv-style/IEEEtran.cls Normal file

File diff suppressed because it is too large Load Diff

27
arxiv-style/README.md Normal file
View File

@@ -0,0 +1,27 @@
## Files layout
- `arxiv.sty` and `template.tex`: The arxiv template we are using.
- `equations.tex`: **Duplicated**, contains equations in methodology section
- `main.tex` and `references.bib`:This prints out our paper, currently using arxiv template. Note that references are template independent.
## How to compile
It's recommanded to use `MiKTeX` as compiler on windows.
To compile latex into pdf, follow these steps:
```bash
pdflatex ./main.tex
# Build reference DB, run once unless references.bib updated
bibtex main
# Always running compiling command twice
pdflatex ./main.tex
pdflatex ./main.tex
```
## Troubleshooting
If you encounter warnings during the compiling process, simply press `Enter`.
If you find the reference in pdf is like `[??]`, compile twice.

262
arxiv-style/arxiv.sty Normal file
View File

@@ -0,0 +1,262 @@
\NeedsTeXFormat{LaTeX2e}
\ProcessOptions\relax
% fonts
\renewcommand{\rmdefault}{ptm}
\renewcommand{\sfdefault}{phv}
% set page geometry
\usepackage[verbose=true,letterpaper]{geometry}
\AtBeginDocument{
\newgeometry{
textheight=9in,
textwidth=6.5in,
top=1in,
headheight=14pt,
headsep=25pt,
footskip=30pt
}
}
\widowpenalty=10000
\clubpenalty=10000
\flushbottom
\sloppy
\newcommand{\headeright}{A Preprint}
\newcommand{\undertitle}{A Preprint}
\newcommand{\shorttitle}{\@title}
\usepackage{fancyhdr}
\fancyhf{}
\pagestyle{fancy}
\renewcommand{\headrulewidth}{0.4pt}
\fancyheadoffset{0pt}
\rhead{\scshape \footnotesize \headeright}
\chead{\shorttitle}
\cfoot{\thepage}
%Handling Keywords
\def\keywordname{{\bfseries \emph{Keywords}}}%
\def\keywords#1{\par\addvspace\medskipamount{\rightskip=0pt plus1cm
\def\and{\ifhmode\unskip\nobreak\fi\ $\cdot$
}\noindent\keywordname\enspace\ignorespaces#1\par}}
% font sizes with reduced leading
\renewcommand{\normalsize}{%
\@setfontsize\normalsize\@xpt\@xipt
\abovedisplayskip 7\p@ \@plus 2\p@ \@minus 5\p@
\abovedisplayshortskip \z@ \@plus 3\p@
\belowdisplayskip \abovedisplayskip
\belowdisplayshortskip 4\p@ \@plus 3\p@ \@minus 3\p@
}
\normalsize
\renewcommand{\small}{%
\@setfontsize\small\@ixpt\@xpt
\abovedisplayskip 6\p@ \@plus 1.5\p@ \@minus 4\p@
\abovedisplayshortskip \z@ \@plus 2\p@
\belowdisplayskip \abovedisplayskip
\belowdisplayshortskip 3\p@ \@plus 2\p@ \@minus 2\p@
}
\renewcommand{\footnotesize}{\@setfontsize\footnotesize\@ixpt\@xpt}
\renewcommand{\scriptsize}{\@setfontsize\scriptsize\@viipt\@viiipt}
\renewcommand{\tiny}{\@setfontsize\tiny\@vipt\@viipt}
\renewcommand{\large}{\@setfontsize\large\@xiipt{14}}
\renewcommand{\Large}{\@setfontsize\Large\@xivpt{16}}
\renewcommand{\LARGE}{\@setfontsize\LARGE\@xviipt{20}}
\renewcommand{\huge}{\@setfontsize\huge\@xxpt{23}}
\renewcommand{\Huge}{\@setfontsize\Huge\@xxvpt{28}}
% sections with less space
\providecommand{\section}{}
\renewcommand{\section}{%
\@startsection{section}{1}{\z@}%
{-2.0ex \@plus -0.5ex \@minus -0.2ex}%
{ 1.5ex \@plus 0.3ex \@minus 0.2ex}%
{\large\bf\raggedright}%
}
\providecommand{\subsection}{}
\renewcommand{\subsection}{%
\@startsection{subsection}{2}{\z@}%
{-1.8ex \@plus -0.5ex \@minus -0.2ex}%
{ 0.8ex \@plus 0.2ex}%
{\normalsize\bf\raggedright}%
}
\providecommand{\subsubsection}{}
\renewcommand{\subsubsection}{%
\@startsection{subsubsection}{3}{\z@}%
{-1.5ex \@plus -0.5ex \@minus -0.2ex}%
{ 0.5ex \@plus 0.2ex}%
{\normalsize\bf\raggedright}%
}
\providecommand{\paragraph}{}
\renewcommand{\paragraph}{%
\@startsection{paragraph}{4}{\z@}%
{1.5ex \@plus 0.5ex \@minus 0.2ex}%
{-1em}%
{\normalsize\bf}%
}
\providecommand{\subparagraph}{}
\renewcommand{\subparagraph}{%
\@startsection{subparagraph}{5}{\z@}%
{1.5ex \@plus 0.5ex \@minus 0.2ex}%
{-1em}%
{\normalsize\bf}%
}
\providecommand{\subsubsubsection}{}
\renewcommand{\subsubsubsection}{%
\vskip5pt{\noindent\normalsize\rm\raggedright}%
}
% float placement
\renewcommand{\topfraction }{0.85}
\renewcommand{\bottomfraction }{0.4}
\renewcommand{\textfraction }{0.1}
\renewcommand{\floatpagefraction}{0.7}
\newlength{\@abovecaptionskip}\setlength{\@abovecaptionskip}{7\p@}
\newlength{\@belowcaptionskip}\setlength{\@belowcaptionskip}{\z@}
\setlength{\abovecaptionskip}{\@abovecaptionskip}
\setlength{\belowcaptionskip}{\@belowcaptionskip}
% swap above/belowcaptionskip lengths for tables
\renewenvironment{table}
{\setlength{\abovecaptionskip}{\@belowcaptionskip}%
\setlength{\belowcaptionskip}{\@abovecaptionskip}%
\@float{table}}
{\end@float}
% footnote formatting
\setlength{\footnotesep }{6.65\p@}
\setlength{\skip\footins}{9\p@ \@plus 4\p@ \@minus 2\p@}
\renewcommand{\footnoterule}{\kern-3\p@ \hrule width 12pc \kern 2.6\p@}
\setcounter{footnote}{0}
% paragraph formatting
\setlength{\parindent}{\z@}
\setlength{\parskip }{5.5\p@}
% list formatting
\setlength{\topsep }{4\p@ \@plus 1\p@ \@minus 2\p@}
\setlength{\partopsep }{1\p@ \@plus 0.5\p@ \@minus 0.5\p@}
\setlength{\itemsep }{2\p@ \@plus 1\p@ \@minus 0.5\p@}
\setlength{\parsep }{2\p@ \@plus 1\p@ \@minus 0.5\p@}
\setlength{\leftmargin }{3pc}
\setlength{\leftmargini }{\leftmargin}
\setlength{\leftmarginii }{2em}
\setlength{\leftmarginiii}{1.5em}
\setlength{\leftmarginiv }{1.0em}
\setlength{\leftmarginv }{0.5em}
\def\@listi {\leftmargin\leftmargini}
\def\@listii {\leftmargin\leftmarginii
\labelwidth\leftmarginii
\advance\labelwidth-\labelsep
\topsep 2\p@ \@plus 1\p@ \@minus 0.5\p@
\parsep 1\p@ \@plus 0.5\p@ \@minus 0.5\p@
\itemsep \parsep}
\def\@listiii{\leftmargin\leftmarginiii
\labelwidth\leftmarginiii
\advance\labelwidth-\labelsep
\topsep 1\p@ \@plus 0.5\p@ \@minus 0.5\p@
\parsep \z@
\partopsep 0.5\p@ \@plus 0\p@ \@minus 0.5\p@
\itemsep \topsep}
\def\@listiv {\leftmargin\leftmarginiv
\labelwidth\leftmarginiv
\advance\labelwidth-\labelsep}
\def\@listv {\leftmargin\leftmarginv
\labelwidth\leftmarginv
\advance\labelwidth-\labelsep}
\def\@listvi {\leftmargin\leftmarginvi
\labelwidth\leftmarginvi
\advance\labelwidth-\labelsep}
% create title
\providecommand{\maketitle}{}
\renewcommand{\maketitle}{%
\par
\begingroup
\renewcommand{\thefootnote}{\fnsymbol{footnote}}
% for perfect author name centering
%\renewcommand{\@makefnmark}{\hbox to \z@{$^{\@thefnmark}$\hss}}
% The footnote-mark was overlapping the footnote-text,
% added the following to fix this problem (MK)
\long\def\@makefntext##1{%
\parindent 1em\noindent
\hbox to 1.8em{\hss $\m@th ^{\@thefnmark}$}##1
}
\thispagestyle{empty}
\@maketitle
\@thanks
%\@notice
\endgroup
\let\maketitle\relax
\let\thanks\relax
}
% rules for title box at top of first page
\newcommand{\@toptitlebar}{
\hrule height 2\p@
\vskip 0.25in
\vskip -\parskip%
}
\newcommand{\@bottomtitlebar}{
\vskip 0.29in
\vskip -\parskip
\hrule height 2\p@
\vskip 0.09in%
}
% create title (includes both anonymized and non-anonymized versions)
\providecommand{\@maketitle}{}
\renewcommand{\@maketitle}{%
\vbox{%
\hsize\textwidth
\linewidth\hsize
\vskip 0.1in
\@toptitlebar
\centering
{\LARGE\sc \@title\par}
\@bottomtitlebar
\textsc{\undertitle}\\
\vskip 0.1in
\def\And{%
\end{tabular}\hfil\linebreak[0]\hfil%
\begin{tabular}[t]{c}\bf\rule{\z@}{24\p@}\ignorespaces%
}
\def\AND{%
\end{tabular}\hfil\linebreak[4]\hfil%
\begin{tabular}[t]{c}\bf\rule{\z@}{24\p@}\ignorespaces%
}
\begin{tabular}[t]{c}\bf\rule{\z@}{24\p@}\@author\end{tabular}%
\vskip 0.4in \@minus 0.1in \center{\@date} \vskip 0.2in
}
}
% add conference notice to bottom of first page
\newcommand{\ftype@noticebox}{8}
\newcommand{\@notice}{%
% give a bit of extra room back to authors on first page
\enlargethispage{2\baselineskip}%
\@float{noticebox}[b]%
\footnotesize\@noticestring%
\end@float%
}
% abstract styling
\renewenvironment{abstract}
{
\centerline
{\large \bfseries \scshape Abstract}
\begin{quote}
}
{
\end{quote}
}
\endinput

95
arxiv-style/equations.tex Normal file
View File

@@ -0,0 +1,95 @@
\documentclass[10pt, twocolumn]{article}
\usepackage{amsmath, amssymb}
\usepackage{bm}
\usepackage[margin=1in]{geometry}
\usepackage{microtype}
\title{Equations: Mask-DDPM Methodology}
\author{}
\date{}
\begin{document}
\maketitle
\section{Problem Formulation}
Each training instance is a fixed-length window of length $L$, comprising continuous channels $\bm{X} \in \mathbb{R}^{L \times d_c}$ and discrete channels $\bm{Y} = \{y^{(j)}_{1:L}\}_{j=1}^{d_d}$, where each discrete variable satisfies $y^{(j)}_t \in \mathcal{V}_j$ for a finite vocabulary $\mathcal{V}_j$.
\section{Transformer Trend Module for Continuous Dynamics}
We posit an additive decomposition of the continuous signal:
\begin{equation}
\bm{X} = \bm{S} + \bm{R},
\label{eq:additive_decomp}
\end{equation}
where $\bm{S} \in \mathbb{R}^{L \times d_c}$ captures the smooth temporal trend and $\bm{R} \in \mathbb{R}^{L \times d_c}$ represents distributional residuals.
The causal Transformer trend extractor $f_{\phi}$ predicts the next-step trend via:
\begin{equation}
\hat{\bm{S}}_{t+1} = f_{\phi}(\bm{X}_{1:t}), \quad t = 1, \dots, L-1.
\label{eq:trend_prediction}
\end{equation}
Training minimizes the mean-squared error:
\begin{equation}
\mathcal{L}_{\text{trend}}(\phi) = \frac{1}{(L-1)d_c} \sum_{t=1}^{L-1} \bigl\| \hat{\bm{S}}_{t+1} - \bm{X}_{t+1} \bigr\|_2^2.
\label{eq:trend_loss}
\end{equation}
At inference, the residual target is defined as $\bm{R} = \bm{X} - \hat{\bm{S}}$.
\section{DDPM for Continuous Residual Generation}
Let $K$ denote diffusion steps with noise schedule $\{\beta_k\}_{k=1}^K$, $\alpha_k = 1 - \beta_k$, and $\bar{\alpha}_k = \prod_{i=1}^k \alpha_i$. The forward corruption process is:
\begin{align}
q(\bm{r}_k \mid \bm{r}_0) &= \mathcal{N}\bigl( \sqrt{\bar{\alpha}_k}\,\bm{r}_0,\; (1 - \bar{\alpha}_k)\mathbf{I} \bigr), \\
\bm{r}_k &= \sqrt{\bar{\alpha}_k}\,\bm{r}_0 + \sqrt{1 - \bar{\alpha}_k}\,\boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}),
\label{eq:forward_process}
\end{align}
where $\bm{r}_0 \equiv \bm{R}$.
The reverse process is parameterized as:
\begin{equation}
p_{\theta}(\bm{r}_{k-1} \mid \bm{r}_k, \hat{\bm{S}}) = \mathcal{N}\bigl( \boldsymbol{\mu}_{\theta}(\bm{r}_k, k, \hat{\bm{S}}),\; \boldsymbol{\Sigma}(k) \bigr).
\label{eq:reverse_process}
\end{equation}
Training employs the $\epsilon$-prediction objective:
\begin{equation}
\mathcal{L}_{\text{cont}}(\theta) = \mathbb{E}_{k,\bm{r}_0,\boldsymbol{\epsilon}} \left[ \bigl\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_{\theta}(\bm{r}_k, k, \hat{\bm{S}}) \bigr\|_2^2 \right].
\label{eq:ddpm_loss}
\end{equation}
Optionally, SNR-based reweighting yields:
\begin{equation}
\mathcal{L}^{\text{snr}}_{\text{cont}}(\theta) = \mathbb{E}_{k,\bm{r}_0,\boldsymbol{\epsilon}} \left[ w_k \bigl\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_{\theta}(\bm{r}_k, k, \hat{\bm{S}}) \bigr\|_2^2 \right],
\label{eq:snr_loss}
\end{equation}
where $w_k = \min(\mathrm{SNR}_k, \gamma) / \mathrm{SNR}_k$ and $\mathrm{SNR}_k = \bar{\alpha}_k / (1 - \bar{\alpha}_k)$. The final continuous output is reconstructed as $\hat{\bm{X}} = \hat{\bm{S}} + \hat{\bm{R}}$.
\section{Masked Diffusion for Discrete Variables}
For discrete channel $j$, the forward masking process follows schedule $\{m_k\}_{k=1}^K$:
\begin{equation}
q(y^{(j)}_k \mid y^{(j)}_0) =
\begin{cases}
y^{(j)}_0, & \text{with probability } 1 - m_k, \\
\texttt{[MASK]}, & \text{with probability } m_k,
\end{cases}
\label{eq:masking_process}
\end{equation}
applied independently across variables and timesteps.
The denoiser $h_{\psi}$ predicts categorical distributions conditioned on continuous context:
\begin{equation}
p_{\psi}\bigl( y^{(j)}_0 \mid y_k, k, \hat{\bm{S}}, \hat{\bm{X}} \bigr) = h_{\psi}(y_k, k, \hat{\bm{S}}, \hat{\bm{X}}).
\label{eq:discrete_denoising}
\end{equation}
Training minimizes the categorical cross-entropy:
\begin{equation}
\mathcal{L}_{\text{disc}}(\psi) = \mathbb{E}_{k} \left[ \frac{1}{|\mathcal{M}|} \sum_{(j,t) \in \mathcal{M}} \mathrm{CE}\bigl( h_{\psi}(y_k, k, \hat{\bm{S}}, \hat{\bm{X}})_{j,t},\; y^{(j)}_{0,t} \bigr) \right],
\label{eq:discrete_loss}
\end{equation}
where $\mathcal{M}$ denotes masked positions at step $k$.
\section{Joint Optimization}
The combined objective balances continuous and discrete learning:
\begin{equation}
\mathcal{L} = \lambda \, \mathcal{L}_{\text{cont}} + (1 - \lambda) \, \mathcal{L}_{\text{disc}}, \quad \lambda \in [0,1].
\label{eq:joint_objective}
\end{equation}
Type-aware routing enforces deterministic reconstruction $\hat{x}^{(i)} = g_i(\hat{\bm{X}}, \hat{\bm{Y}})$ for derived variables.
\end{document}

Binary file not shown.

After

Width:  |  Height:  |  Size: 582 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 669 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 123 KiB

View File

@@ -0,0 +1 @@
3.12

View File

@@ -0,0 +1,237 @@
#!/usr/bin/env python3
"""
Draw *separate* SVG figures for:
1) Continuous channels (multiple smooth curves per figure)
2) Discrete channels (multiple step-like/token curves per figure)
Outputs (default):
out/continuous_channels.svg
out/discrete_channels.svg
Notes:
- Transparent background (good for draw.io / LaTeX / diagrams).
- No axes/frames by default (diagram-friendly).
- Curves are synthetic placeholders; replace `make_*_channels()` with your real data.
"""
from __future__ import annotations
import argparse
from dataclasses import dataclass
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
# ----------------------------
# Data generators (placeholders)
# ----------------------------
@dataclass
class GenParams:
seconds: float = 10.0
fs: int = 200
seed: int = 7
n_cont: int = 6 # number of continuous channels (curves)
n_disc: int = 5 # number of discrete channels (curves)
disc_vocab: int = 8 # token/vocab size for discrete channels
disc_change_rate_hz: float = 1.2 # how often discrete tokens change
def make_continuous_channels(p: GenParams) -> tuple[np.ndarray, np.ndarray]:
"""
Returns:
t: shape (T,)
Y: shape (n_cont, T)
"""
rng = np.random.default_rng(p.seed)
T = int(p.seconds * p.fs)
t = np.linspace(0, p.seconds, T, endpoint=False)
Y = []
for i in range(p.n_cont):
# Multi-scale smooth-ish signals
f1 = 0.15 + 0.06 * i
f2 = 0.8 + 0.15 * (i % 3)
phase = rng.uniform(0, 2 * np.pi)
y = (
0.9 * np.sin(2 * np.pi * f1 * t + phase)
+ 0.35 * np.sin(2 * np.pi * f2 * t + 1.3 * phase)
)
# Add mild colored-ish noise by smoothing white noise
w = rng.normal(0, 1, size=T)
w = np.convolve(w, np.ones(9) / 9.0, mode="same")
y = y + 0.15 * w
# Normalize each channel for consistent visual scale
y = (y - np.mean(y)) / (np.std(y) + 1e-9)
y = 0.8 * y + 0.15 * i # vertical offset to separate curves a bit
Y.append(y)
return t, np.vstack(Y)
def make_discrete_channels(p: GenParams) -> tuple[np.ndarray, np.ndarray]:
"""
Discrete channels as piecewise-constant token IDs (integers).
Returns:
t: shape (T,)
X: shape (n_disc, T) (integers in [0, disc_vocab-1])
"""
rng = np.random.default_rng(p.seed + 100)
T = int(p.seconds * p.fs)
t = np.linspace(0, p.seconds, T, endpoint=False)
# expected number of changes per channel
expected_changes = int(max(1, p.seconds * p.disc_change_rate_hz))
X = np.zeros((p.n_disc, T), dtype=int)
for c in range(p.n_disc):
# pick change points
k = rng.poisson(expected_changes) + 1
change_pts = np.unique(rng.integers(0, T, size=k))
change_pts = np.sort(np.concatenate([[0], change_pts, [T]]))
cur = rng.integers(0, p.disc_vocab)
for a, b in zip(change_pts[:-1], change_pts[1:]):
# occasional token jump
if a != 0:
if rng.random() < 0.85:
cur = rng.integers(0, p.disc_vocab)
X[c, a:b] = cur
return t, X
# ----------------------------
# Plotting helpers
# ----------------------------
def _make_transparent_figure(width_in: float, height_in: float) -> tuple[plt.Figure, plt.Axes]:
fig = plt.figure(figsize=(width_in, height_in), dpi=200)
fig.patch.set_alpha(0.0)
ax = fig.add_axes([0.03, 0.03, 0.94, 0.94])
ax.patch.set_alpha(0.0)
return fig, ax
def save_continuous_channels_svg(
t: np.ndarray,
Y: np.ndarray,
out_path: Path,
*,
lw: float = 2.0,
clean: bool = True,
) -> None:
"""
Plot multiple continuous curves in one figure and save SVG.
Y shape: (n_cont, T)
"""
fig, ax = _make_transparent_figure(width_in=6.0, height_in=2.2)
# Let matplotlib choose different colors automatically (good defaults).
for i in range(Y.shape[0]):
ax.plot(t, Y[i], linewidth=lw)
if clean:
ax.set_axis_off()
else:
ax.set_xlabel("t")
ax.set_ylabel("value")
# Set limits with padding
y_all = Y.reshape(-1)
ymin, ymax = float(np.min(y_all)), float(np.max(y_all))
ypad = 0.08 * (ymax - ymin + 1e-9)
ax.set_xlim(t[0], t[-1])
ax.set_ylim(ymin - ypad, ymax + ypad)
out_path.parent.mkdir(parents=True, exist_ok=True)
fig.savefig(out_path, format="svg", bbox_inches="tight", pad_inches=0.0, transparent=True)
plt.close(fig)
def save_discrete_channels_svg(
t: np.ndarray,
X: np.ndarray,
out_path: Path,
*,
lw: float = 2.0,
clean: bool = True,
vertical_spacing: float = 1.25,
) -> None:
"""
Plot multiple discrete (piecewise-constant) curves in one figure and save SVG.
X shape: (n_disc, T) integers.
We draw each channel as a step plot, offset vertically so curves don't overlap.
"""
fig, ax = _make_transparent_figure(width_in=6.0, height_in=2.2)
for i in range(X.shape[0]):
y = X[i].astype(float) + i * vertical_spacing
ax.step(t, y, where="post", linewidth=lw)
if clean:
ax.set_axis_off()
else:
ax.set_xlabel("t")
ax.set_ylabel("token id (offset)")
y_all = (X.astype(float) + np.arange(X.shape[0])[:, None] * vertical_spacing).reshape(-1)
ymin, ymax = float(np.min(y_all)), float(np.max(y_all))
ypad = 0.10 * (ymax - ymin + 1e-9)
ax.set_xlim(t[0], t[-1])
ax.set_ylim(ymin - ypad, ymax + ypad)
out_path.parent.mkdir(parents=True, exist_ok=True)
fig.savefig(out_path, format="svg", bbox_inches="tight", pad_inches=0.0, transparent=True)
plt.close(fig)
# ----------------------------
# CLI
# ----------------------------
def main() -> None:
ap = argparse.ArgumentParser()
ap.add_argument("--outdir", type=Path, default=Path("out"))
ap.add_argument("--seed", type=int, default=7)
ap.add_argument("--seconds", type=float, default=10.0)
ap.add_argument("--fs", type=int, default=200)
ap.add_argument("--n-cont", type=int, default=6)
ap.add_argument("--n-disc", type=int, default=5)
ap.add_argument("--disc-vocab", type=int, default=8)
ap.add_argument("--disc-change-rate", type=float, default=1.2)
ap.add_argument("--keep-axes", action="store_true", help="Show axes/labels (default: off)")
args = ap.parse_args()
p = GenParams(
seconds=args.seconds,
fs=args.fs,
seed=args.seed,
n_cont=args.n_cont,
n_disc=args.n_disc,
disc_vocab=args.disc_vocab,
disc_change_rate_hz=args.disc_change_rate,
)
t_c, Y = make_continuous_channels(p)
t_d, X = make_discrete_channels(p)
cont_path = args.outdir / "continuous_channels.svg"
disc_path = args.outdir / "discrete_channels.svg"
save_continuous_channels_svg(t_c, Y, cont_path, clean=not args.keep_axes)
save_discrete_channels_svg(t_d, X, disc_path, clean=not args.keep_axes)
print("Wrote:")
print(f" {cont_path}")
print(f" {disc_path}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,272 @@
#!/usr/bin/env python3
"""
Option A: "Synthetic ICS Data" mini-panel (high-level features, not packets)
What it draws (one SVG, transparent background):
- Top: 23 continuous feature curves (smooth, time-aligned)
- Bottom: discrete/categorical feature strip (colored blocks)
- One vertical dashed alignment line crossing both
- Optional shaded regime window
- Optional "real vs synthetic" ghost overlay (faint gray behind one curve)
Usage:
uv run python draw_synthetic_ics_optionA.py --out ./assets/synth_ics_optionA.svg
"""
from __future__ import annotations
import argparse
from dataclasses import dataclass
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
@dataclass
class Params:
seed: int = 7
seconds: float = 10.0
fs: int = 300
n_curves: int = 3 # continuous channels shown
n_bins: int = 40 # discrete blocks across x
disc_vocab: int = 8 # number of discrete categories
# Layout / style
width_in: float = 6.0
height_in: float = 2.2
curve_lw: float = 2.3
ghost_lw: float = 2.0 # "real" overlay line width
strip_height: float = 0.65 # bar height in [0,1] strip axis
strip_gap_frac: float = 0.10 # gap between blocks (fraction of block width)
# Visual cues
show_alignment_line: bool = True
align_x_frac: float = 0.58 # where to place dashed line, fraction of timeline
show_regime_window: bool = True
regime_start_frac: float = 0.30
regime_end_frac: float = 0.45
show_real_ghost: bool = True # faint gray "real" behind first synthetic curve
def _smooth(x: np.ndarray, win: int) -> np.ndarray:
win = max(3, int(win) | 1) # odd
k = np.ones(win, dtype=float)
k /= k.sum()
return np.convolve(x, k, mode="same")
def make_continuous_curves(p: Params) -> tuple[np.ndarray, np.ndarray, np.ndarray | None]:
"""
Returns:
t: (T,)
Y_syn: (n_curves, T) synthetic curves
y_real: (T,) or None optional "real" ghost curve (for one channel)
"""
rng = np.random.default_rng(p.seed)
T = int(p.seconds * p.fs)
t = np.linspace(0, p.seconds, T, endpoint=False)
Y = []
for i in range(p.n_curves):
# multi-scale smooth temporal patterns
f_slow = 0.09 + 0.03 * (i % 3)
f_mid = 0.65 + 0.18 * (i % 4)
ph = rng.uniform(0, 2 * np.pi)
y = (
0.95 * np.sin(2 * np.pi * f_slow * t + ph)
+ 0.30 * np.sin(2 * np.pi * f_mid * t + 0.7 * ph)
)
# regime-like bumps
bumps = np.zeros_like(t)
for _ in range(2):
mu = rng.uniform(0.8, p.seconds - 0.8)
sig = rng.uniform(0.35, 0.85)
bumps += np.exp(-0.5 * ((t - mu) / (sig + 1e-9)) ** 2)
y += 0.55 * bumps
# mild smooth noise
noise = _smooth(rng.normal(0, 1, size=T), win=int(p.fs * 0.04))
y += 0.10 * noise
# normalize for clean presentation
y = (y - y.mean()) / (y.std() + 1e-9)
y *= 0.42
Y.append(y)
Y_syn = np.vstack(Y)
# Optional "real" ghost: similar to first curve, but slightly different
y_real = None
if p.show_real_ghost:
base = Y_syn[0].copy()
drift = _smooth(rng.normal(0, 1, size=T), win=int(p.fs * 0.18))
drift = drift / (np.std(drift) + 1e-9)
y_real = base * 0.95 + 0.07 * drift
return t, Y_syn, y_real
def make_discrete_strip(p: Params) -> np.ndarray:
"""
Piecewise-constant categorical IDs across n_bins.
Returns:
ids: (n_bins,) in [0, disc_vocab-1]
"""
rng = np.random.default_rng(p.seed + 123)
n = p.n_bins
ids = np.zeros(n, dtype=int)
cur = rng.integers(0, p.disc_vocab)
for i in range(n):
# occasional change
if i == 0 or rng.random() < 0.28:
cur = rng.integers(0, p.disc_vocab)
ids[i] = cur
return ids
def _axes_clean(ax: plt.Axes) -> None:
"""Keep axes lines optional but remove all text/numbers (diagram-friendly)."""
ax.set_xlabel("")
ax.set_ylabel("")
ax.set_title("")
ax.set_xticks([])
ax.set_yticks([])
ax.tick_params(
axis="both",
which="both",
bottom=False,
left=False,
top=False,
right=False,
labelbottom=False,
labelleft=False,
)
def draw_optionA(out_path: Path, p: Params) -> None:
# Figure
fig = plt.figure(figsize=(p.width_in, p.height_in), dpi=200)
fig.patch.set_alpha(0.0)
# Two stacked axes (shared x)
ax_top = fig.add_axes([0.08, 0.32, 0.90, 0.62])
ax_bot = fig.add_axes([0.08, 0.12, 0.90, 0.16], sharex=ax_top)
ax_top.patch.set_alpha(0.0)
ax_bot.patch.set_alpha(0.0)
# Generate data
t, Y_syn, y_real = make_continuous_curves(p)
ids = make_discrete_strip(p)
x0, x1 = float(t[0]), float(t[-1])
span = x1 - x0
# Optional shaded regime window
if p.show_regime_window:
rs = x0 + p.regime_start_frac * span
re = x0 + p.regime_end_frac * span
ax_top.axvspan(rs, re, alpha=0.12) # default color, semi-transparent
ax_bot.axvspan(rs, re, alpha=0.12)
# Optional vertical dashed alignment line
if p.show_alignment_line:
vx = x0 + p.align_x_frac * span
ax_top.axvline(vx, linestyle="--", linewidth=1.2, alpha=0.7)
ax_bot.axvline(vx, linestyle="--", linewidth=1.2, alpha=0.7)
# Continuous curves (use fixed colors for consistency)
curve_colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#9467bd"] # blue, orange, green, purple
# Ghost "real" behind the first curve (faint gray)
if y_real is not None:
ax_top.plot(t, y_real, linewidth=p.ghost_lw, color="0.65", alpha=0.55, zorder=1)
for i in range(Y_syn.shape[0]):
ax_top.plot(
t, Y_syn[i],
linewidth=p.curve_lw,
color=curve_colors[i % len(curve_colors)],
zorder=2
)
# Set top y-limits with padding
ymin, ymax = float(Y_syn.min()), float(Y_syn.max())
ypad = 0.10 * (ymax - ymin + 1e-9)
ax_top.set_xlim(x0, x1)
ax_top.set_ylim(ymin - ypad, ymax + ypad)
# Discrete strip as colored blocks
palette = [
"#e41a1c", "#377eb8", "#4daf4a", "#984ea3",
"#ff7f00", "#ffff33", "#a65628", "#f781bf",
]
n = len(ids)
bin_w = span / n
gap = p.strip_gap_frac * bin_w
ax_bot.set_ylim(0, 1)
y = (1 - p.strip_height) / 2
for i, cat in enumerate(ids):
left = x0 + i * bin_w + gap / 2
width = bin_w - gap
ax_bot.add_patch(
Rectangle(
(left, y), width, p.strip_height,
facecolor=palette[int(cat) % len(palette)],
edgecolor="none",
)
)
# Clean axes: no ticks/labels; keep spines (axes lines) visible
_axes_clean(ax_top)
_axes_clean(ax_bot)
for ax in (ax_top, ax_bot):
for side in ("left", "bottom", "top", "right"):
ax.spines[side].set_visible(True)
out_path.parent.mkdir(parents=True, exist_ok=True)
fig.savefig(out_path, format="svg", transparent=True, bbox_inches="tight", pad_inches=0.0)
plt.close(fig)
def main() -> None:
ap = argparse.ArgumentParser()
ap.add_argument("--out", type=Path, default=Path("synth_ics_optionA.svg"))
ap.add_argument("--seed", type=int, default=7)
ap.add_argument("--seconds", type=float, default=10.0)
ap.add_argument("--fs", type=int, default=300)
ap.add_argument("--curves", type=int, default=3)
ap.add_argument("--bins", type=int, default=40)
ap.add_argument("--vocab", type=int, default=8)
ap.add_argument("--no-align", action="store_true")
ap.add_argument("--no-regime", action="store_true")
ap.add_argument("--no-ghost", action="store_true")
args = ap.parse_args()
p = Params(
seed=args.seed,
seconds=args.seconds,
fs=args.fs,
n_curves=args.curves,
n_bins=args.bins,
disc_vocab=args.vocab,
show_alignment_line=not args.no_align,
show_regime_window=not args.no_regime,
show_real_ghost=not args.no_ghost,
)
draw_optionA(args.out, p)
print(f"Wrote: {args.out}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,318 @@
#!/usr/bin/env python3
"""
Option B: "Synthetic ICS Data" as a mini process-story strip (high-level features)
- ONE SVG, transparent background
- Two frames by default: "steady/normal" -> "disturbance/recovery"
- Each frame contains:
- Top: multiple continuous feature curves
- Bottom: discrete/categorical strip (colored blocks)
- A vertical dashed alignment line crossing both
- Optional shaded regime window
- A right-pointing arrow between frames
No text, no numbers (axes lines only). Good for draw.io embedding.
Run:
uv run python draw_synthetic_ics_optionB.py --out ./assets/synth_ics_optionB.svg
"""
from __future__ import annotations
import argparse
from dataclasses import dataclass
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle, FancyArrowPatch
@dataclass
class Params:
seed: int = 7
seconds: float = 8.0
fs: int = 250
# Two-frame story
n_frames: int = 2
# Per-frame visuals
n_curves: int = 3
n_bins: int = 32
disc_vocab: int = 8
# Layout
width_in: float = 8.2
height_in: float = 2.4
# Relative layout inside the figure
margin_left: float = 0.05
margin_right: float = 0.05
margin_bottom: float = 0.12
margin_top: float = 0.10
frame_gap: float = 0.08 # gap (figure fraction) between frames (space for arrow)
# Styling
curve_lw: float = 2.1
ghost_lw: float = 1.8
strip_height: float = 0.65
strip_gap_frac: float = 0.12
# Cues
show_alignment_line: bool = True
align_x_frac: float = 0.60
show_regime_window: bool = True
regime_start_frac: float = 0.30
regime_end_frac: float = 0.46
show_real_ghost: bool = False # keep default off for cleaner story
show_axes_spines: bool = True # axes lines only (no ticks/labels)
# ---------- helpers ----------
def _smooth(x: np.ndarray, win: int) -> np.ndarray:
win = max(3, int(win) | 1)
k = np.ones(win, dtype=float)
k /= k.sum()
return np.convolve(x, k, mode="same")
def _axes_only(ax: plt.Axes, *, keep_spines: bool) -> None:
ax.set_xlabel("")
ax.set_ylabel("")
ax.set_title("")
ax.set_xticks([])
ax.set_yticks([])
ax.tick_params(
axis="both",
which="both",
bottom=False,
left=False,
top=False,
right=False,
labelbottom=False,
labelleft=False,
)
ax.grid(False)
if keep_spines:
for s in ("left", "right", "top", "bottom"):
ax.spines[s].set_visible(True)
else:
for s in ("left", "right", "top", "bottom"):
ax.spines[s].set_visible(False)
def make_frame_continuous(seed: int, seconds: float, fs: int, n_curves: int, style: str) -> tuple[np.ndarray, np.ndarray]:
"""
style:
- "steady": smoother, smaller bumps
- "disturb": larger bumps and more variance
"""
rng = np.random.default_rng(seed)
T = int(seconds * fs)
t = np.linspace(0, seconds, T, endpoint=False)
amp_bump = 0.40 if style == "steady" else 0.85
amp_noise = 0.09 if style == "steady" else 0.14
amp_scale = 0.38 if style == "steady" else 0.46
base_freqs = [0.10, 0.08, 0.12, 0.09]
mid_freqs = [0.65, 0.78, 0.90, 0.72]
Y = []
for i in range(n_curves):
f_slow = base_freqs[i % len(base_freqs)]
f_mid = mid_freqs[i % len(mid_freqs)]
ph = rng.uniform(0, 2 * np.pi)
y = (
0.95 * np.sin(2 * np.pi * f_slow * t + ph)
+ 0.28 * np.sin(2 * np.pi * f_mid * t + 0.65 * ph)
)
bumps = np.zeros_like(t)
n_bumps = 2 if style == "steady" else 3
for _ in range(n_bumps):
mu = rng.uniform(0.9, seconds - 0.9)
sig = rng.uniform(0.35, 0.75) if style == "steady" else rng.uniform(0.20, 0.55)
bumps += np.exp(-0.5 * ((t - mu) / (sig + 1e-9)) ** 2)
y += amp_bump * bumps
noise = _smooth(rng.normal(0, 1, size=T), win=int(fs * 0.04))
y += amp_noise * noise
y = (y - y.mean()) / (y.std() + 1e-9)
y *= amp_scale
Y.append(y)
return t, np.vstack(Y)
def make_frame_discrete(seed: int, n_bins: int, vocab: int, style: str) -> np.ndarray:
"""
style:
- "steady": fewer transitions
- "disturb": more transitions
"""
rng = np.random.default_rng(seed + 111)
ids = np.zeros(n_bins, dtype=int)
p_change = 0.20 if style == "steady" else 0.38
cur = rng.integers(0, vocab)
for i in range(n_bins):
if i == 0 or rng.random() < p_change:
cur = rng.integers(0, vocab)
ids[i] = cur
return ids
def draw_frame(ax_top: plt.Axes, ax_bot: plt.Axes, t: np.ndarray, Y: np.ndarray, ids: np.ndarray, p: Params) -> None:
# Optional cues
x0, x1 = float(t[0]), float(t[-1])
span = x1 - x0
if p.show_regime_window:
rs = x0 + p.regime_start_frac * span
re = x0 + p.regime_end_frac * span
ax_top.axvspan(rs, re, alpha=0.12) # default color
ax_bot.axvspan(rs, re, alpha=0.12)
if p.show_alignment_line:
vx = x0 + p.align_x_frac * span
ax_top.axvline(vx, linestyle="--", linewidth=1.15, alpha=0.7)
ax_bot.axvline(vx, linestyle="--", linewidth=1.15, alpha=0.7)
# Curves
curve_colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#9467bd"]
for i in range(Y.shape[0]):
ax_top.plot(t, Y[i], linewidth=p.curve_lw, color=curve_colors[i % len(curve_colors)])
ymin, ymax = float(Y.min()), float(Y.max())
ypad = 0.10 * (ymax - ymin + 1e-9)
ax_top.set_xlim(x0, x1)
ax_top.set_ylim(ymin - ypad, ymax + ypad)
# Discrete strip
palette = [
"#e41a1c", "#377eb8", "#4daf4a", "#984ea3",
"#ff7f00", "#ffff33", "#a65628", "#f781bf",
]
ax_bot.set_xlim(x0, x1)
ax_bot.set_ylim(0, 1)
n = len(ids)
bin_w = span / n
gap = p.strip_gap_frac * bin_w
y = (1 - p.strip_height) / 2
for i, cat in enumerate(ids):
left = x0 + i * bin_w + gap / 2
width = bin_w - gap
ax_bot.add_patch(
Rectangle((left, y), width, p.strip_height, facecolor=palette[int(cat) % len(palette)], edgecolor="none")
)
# Axes-only style
_axes_only(ax_top, keep_spines=p.show_axes_spines)
_axes_only(ax_bot, keep_spines=p.show_axes_spines)
# ---------- main drawing ----------
def draw_optionB(out_path: Path, p: Params) -> None:
fig = plt.figure(figsize=(p.width_in, p.height_in), dpi=200)
fig.patch.set_alpha(0.0)
# Compute frame layout in figure coordinates
# Each frame has two stacked axes: top curves and bottom strip.
usable_w = 1.0 - p.margin_left - p.margin_right
usable_h = 1.0 - p.margin_bottom - p.margin_top
# Leave gap between frames for arrow
total_gap = p.frame_gap * (p.n_frames - 1)
frame_w = (usable_w - total_gap) / p.n_frames
# Within each frame: vertical split
top_h = usable_h * 0.70
bot_h = usable_h * 0.18
v_gap = usable_h * 0.06
# bottoms
bot_y = p.margin_bottom
top_y = bot_y + bot_h + v_gap
axes_pairs = []
for f in range(p.n_frames):
left = p.margin_left + f * (frame_w + p.frame_gap)
ax_top = fig.add_axes([left, top_y, frame_w, top_h])
ax_bot = fig.add_axes([left, bot_y, frame_w, bot_h], sharex=ax_top)
ax_top.patch.set_alpha(0.0)
ax_bot.patch.set_alpha(0.0)
axes_pairs.append((ax_top, ax_bot))
# Data per frame
styles = ["steady", "disturb"] if p.n_frames == 2 else ["steady"] * (p.n_frames - 1) + ["disturb"]
for idx, ((ax_top, ax_bot), style) in enumerate(zip(axes_pairs, styles)):
t, Y = make_frame_continuous(p.seed + 10 * idx, p.seconds, p.fs, p.n_curves, style=style)
ids = make_frame_discrete(p.seed + 10 * idx, p.n_bins, p.disc_vocab, style=style)
draw_frame(ax_top, ax_bot, t, Y, ids, p)
# Add a visual arrow between frames (in figure coordinates)
if p.n_frames >= 2:
for f in range(p.n_frames - 1):
# center between frame f and f+1
x_left = p.margin_left + f * (frame_w + p.frame_gap) + frame_w
x_right = p.margin_left + (f + 1) * (frame_w + p.frame_gap)
x_mid = (x_left + x_right) / 2
# arrow y in the middle of the frame stack
y_mid = bot_y + (bot_h + v_gap + top_h) / 2
arr = FancyArrowPatch(
(x_mid - 0.015, y_mid),
(x_mid + 0.015, y_mid),
transform=fig.transFigure,
arrowstyle="-|>",
mutation_scale=18,
linewidth=1.6,
color="black",
)
fig.patches.append(arr)
out_path.parent.mkdir(parents=True, exist_ok=True)
fig.savefig(out_path, format="svg", transparent=True, bbox_inches="tight", pad_inches=0.0)
plt.close(fig)
def main() -> None:
ap = argparse.ArgumentParser()
ap.add_argument("--out", type=Path, default=Path("synth_ics_optionB.svg"))
ap.add_argument("--seed", type=int, default=7)
ap.add_argument("--seconds", type=float, default=8.0)
ap.add_argument("--fs", type=int, default=250)
ap.add_argument("--frames", type=int, default=2, choices=[2, 3], help="2 or 3 frames (story strip)")
ap.add_argument("--curves", type=int, default=3)
ap.add_argument("--bins", type=int, default=32)
ap.add_argument("--vocab", type=int, default=8)
ap.add_argument("--no-align", action="store_true")
ap.add_argument("--no-regime", action="store_true")
ap.add_argument("--no-spines", action="store_true")
args = ap.parse_args()
p = Params(
seed=args.seed,
seconds=args.seconds,
fs=args.fs,
n_frames=args.frames,
n_curves=args.curves,
n_bins=args.bins,
disc_vocab=args.vocab,
show_alignment_line=not args.no_align,
show_regime_window=not args.no_regime,
show_axes_spines=not args.no_spines,
)
draw_optionB(args.out, p)
print(f"Wrote: {args.out}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,201 @@
#!/usr/bin/env python3
"""
Draw the *Transformer section* lower-half visuals:
- Continuous channels: multiple smooth curves (like the colored trend lines)
- Discrete channels: small colored bars/ticks along the bottom
Output: ONE SVG with transparent background, axes hidden.
Run:
uv run python draw_transformer_lower_half.py --out ./assets/transformer_lower_half.svg
"""
from __future__ import annotations
import argparse
from dataclasses import dataclass
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
@dataclass
class Params:
seed: int = 7
seconds: float = 10.0
fs: int = 300
# Continuous channels
n_curves: int = 3
curve_lw: float = 2.4
# Discrete bars
n_bins: int = 40 # number of discrete bars/ticks across time
bar_height: float = 0.11 # relative height inside bar strip axis
bar_gap: float = 0.08 # gap between bars (fraction of bar width)
# Canvas sizing
width_in: float = 5.8
height_in: float = 1.9
def _smooth(x: np.ndarray, win: int) -> np.ndarray:
win = max(3, int(win) | 1) # odd
k = np.ones(win, dtype=float)
k /= k.sum()
return np.convolve(x, k, mode="same")
def make_continuous_curves(p: Params) -> tuple[np.ndarray, np.ndarray]:
"""
Produce 3 smooth curves with gentle long-term temporal patterning.
Returns:
t: (T,)
Y: (n_curves, T)
"""
rng = np.random.default_rng(p.seed)
T = int(p.seconds * p.fs)
t = np.linspace(0, p.seconds, T, endpoint=False)
Y = []
base_freqs = [0.12, 0.09, 0.15]
mid_freqs = [0.65, 0.85, 0.75]
for i in range(p.n_curves):
f1 = base_freqs[i % len(base_freqs)]
f2 = mid_freqs[i % len(mid_freqs)]
ph = rng.uniform(0, 2 * np.pi)
# Smooth trend + mid wiggle
y = (
1.00 * np.sin(2 * np.pi * f1 * t + ph)
+ 0.35 * np.sin(2 * np.pi * f2 * t + 0.7 * ph)
)
# Add a couple of smooth bumps (like slow pattern changes)
bumps = np.zeros_like(t)
for _ in range(2):
mu = rng.uniform(0.8, p.seconds - 0.8)
sig = rng.uniform(0.35, 0.75)
bumps += np.exp(-0.5 * ((t - mu) / sig) ** 2)
y += 0.55 * bumps
# Mild smooth noise
noise = _smooth(rng.normal(0, 1, size=T), win=int(p.fs * 0.04))
y += 0.12 * noise
# Normalize and compress amplitude to fit nicely
y = (y - y.mean()) / (y.std() + 1e-9)
y *= 0.42
Y.append(y)
return t, np.vstack(Y)
def make_discrete_bars(p: Params) -> np.ndarray:
"""
Generate discrete "token-like" bars across time bins.
Returns:
ids: (n_bins,) integer category ids
"""
rng = np.random.default_rng(p.seed + 123)
n = p.n_bins
# A piecewise-constant sequence with occasional changes (looks like discrete channel)
ids = np.zeros(n, dtype=int)
cur = rng.integers(0, 8)
for i in range(n):
if i == 0 or rng.random() < 0.25:
cur = rng.integers(0, 8)
ids[i] = cur
return ids
def draw_transformer_lower_half_svg(out_path: Path, p: Params) -> None:
# --- Figure + transparent background ---
fig = plt.figure(figsize=(p.width_in, p.height_in), dpi=200)
fig.patch.set_alpha(0.0)
# Two stacked axes: curves (top), bars (bottom)
# Tight, diagram-style layout
ax_curves = fig.add_axes([0.06, 0.28, 0.90, 0.68]) # [left, bottom, width, height]
ax_bars = fig.add_axes([0.06, 0.10, 0.90, 0.14])
ax_curves.patch.set_alpha(0.0)
ax_bars.patch.set_alpha(0.0)
for ax in (ax_curves, ax_bars):
ax.set_axis_off()
# --- Data ---
t, Y = make_continuous_curves(p)
ids = make_discrete_bars(p)
# --- Continuous curves (explicit colors to match the “multi-colored” look) ---
# Feel free to swap these hex colors to match your figure theme.
curve_colors = ["#1f77b4", "#ff7f0e", "#2ca02c"] # blue / orange / green
for i in range(Y.shape[0]):
ax_curves.plot(t, Y[i], linewidth=p.curve_lw, color=curve_colors[i % len(curve_colors)])
# Set curve bounds with padding (keeps it clean)
ymin, ymax = float(Y.min()), float(Y.max())
pad = 0.10 * (ymax - ymin + 1e-9)
ax_curves.set_xlim(t[0], t[-1])
ax_curves.set_ylim(ymin - pad, ymax + pad)
# --- Discrete bars: small colored rectangles along the timeline ---
# A small palette for categories (repeats if more categories appear)
bar_palette = [
"#e41a1c", "#377eb8", "#4daf4a", "#984ea3",
"#ff7f00", "#ffff33", "#a65628", "#f781bf",
]
# Convert bins into time spans
n = len(ids)
x0, x1 = t[0], t[-1]
total = x1 - x0
bin_w = total / n
gap = p.bar_gap * bin_w
# Draw bars in [0,1] y-space inside ax_bars
ax_bars.set_xlim(x0, x1)
ax_bars.set_ylim(0, 1)
for i, cat in enumerate(ids):
left = x0 + i * bin_w + gap / 2
width = bin_w - gap
color = bar_palette[int(cat) % len(bar_palette)]
rect = Rectangle(
(left, (1 - p.bar_height) / 2),
width,
p.bar_height,
facecolor=color,
edgecolor="none",
)
ax_bars.add_patch(rect)
out_path.parent.mkdir(parents=True, exist_ok=True)
fig.savefig(out_path, format="svg", transparent=True, bbox_inches="tight", pad_inches=0.0)
plt.close(fig)
def main() -> None:
ap = argparse.ArgumentParser()
ap.add_argument("--out", type=Path, default=Path("transformer_lower_half.svg"))
ap.add_argument("--seed", type=int, default=7)
ap.add_argument("--seconds", type=float, default=10.0)
ap.add_argument("--fs", type=int, default=300)
ap.add_argument("--bins", type=int, default=40)
args = ap.parse_args()
p = Params(seed=args.seed, seconds=args.seconds, fs=args.fs, n_bins=args.bins)
draw_transformer_lower_half_svg(args.out, p)
print(f"Wrote: {args.out}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,202 @@
#!/usr/bin/env python3
"""
Transformer section lower-half visuals WITH AXES ONLY:
- Axes spines visible
- NO numbers (tick labels hidden)
- NO words (axis labels removed)
- Transparent background
- One SVG output
Run:
uv run python draw_transformer_lower_half_axes_only.py --out ./assets/transformer_lower_half_axes_only.svg
"""
from __future__ import annotations
import argparse
from dataclasses import dataclass
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
@dataclass
class Params:
seed: int = 7
seconds: float = 10.0
fs: int = 300
# Continuous channels
n_curves: int = 3
curve_lw: float = 2.4
# Discrete bars
n_bins: int = 40
bar_height: float = 0.55 # fraction of the discrete-axis y-range
bar_gap: float = 0.08 # fraction of bar width
# Figure size
width_in: float = 6.6
height_in: float = 2.6
def _smooth(x: np.ndarray, win: int) -> np.ndarray:
win = max(3, int(win) | 1) # odd
k = np.ones(win, dtype=float)
k /= k.sum()
return np.convolve(x, k, mode="same")
def make_continuous_curves(p: Params) -> tuple[np.ndarray, np.ndarray]:
rng = np.random.default_rng(p.seed)
T = int(p.seconds * p.fs)
t = np.linspace(0, p.seconds, T, endpoint=False)
Y = []
base_freqs = [0.12, 0.09, 0.15]
mid_freqs = [0.65, 0.85, 0.75]
for i in range(p.n_curves):
f1 = base_freqs[i % len(base_freqs)]
f2 = mid_freqs[i % len(mid_freqs)]
ph = rng.uniform(0, 2 * np.pi)
y = (
1.00 * np.sin(2 * np.pi * f1 * t + ph)
+ 0.35 * np.sin(2 * np.pi * f2 * t + 0.7 * ph)
)
bumps = np.zeros_like(t)
for _ in range(2):
mu = rng.uniform(0.8, p.seconds - 0.8)
sig = rng.uniform(0.35, 0.75)
bumps += np.exp(-0.5 * ((t - mu) / sig) ** 2)
y += 0.55 * bumps
noise = _smooth(rng.normal(0, 1, size=T), win=int(p.fs * 0.04))
y += 0.12 * noise
y = (y - y.mean()) / (y.std() + 1e-9)
y *= 0.42
Y.append(y)
return t, np.vstack(Y)
def make_discrete_bars(p: Params) -> np.ndarray:
rng = np.random.default_rng(p.seed + 123)
n = p.n_bins
ids = np.zeros(n, dtype=int)
cur = rng.integers(0, 8)
for i in range(n):
if i == 0 or rng.random() < 0.25:
cur = rng.integers(0, 8)
ids[i] = cur
return ids
def _axes_only(ax: plt.Axes) -> None:
"""Keep spines (axes lines), remove all ticks/labels/words."""
# No labels
ax.set_xlabel("")
ax.set_ylabel("")
ax.set_title("")
# Keep spines as the only axes element
for side in ("top", "right", "bottom", "left"):
ax.spines[side].set_visible(True)
# Remove tick marks and tick labels entirely
ax.set_xticks([])
ax.set_yticks([])
ax.tick_params(
axis="both",
which="both",
bottom=False,
left=False,
top=False,
right=False,
labelbottom=False,
labelleft=False,
)
# No grid
ax.grid(False)
def draw_transformer_lower_half_svg(out_path: Path, p: Params) -> None:
fig = plt.figure(figsize=(p.width_in, p.height_in), dpi=200)
fig.patch.set_alpha(0.0)
# Two axes sharing x (top curves, bottom bars)
ax_curves = fig.add_axes([0.10, 0.38, 0.86, 0.56])
ax_bars = fig.add_axes([0.10, 0.14, 0.86, 0.18], sharex=ax_curves)
ax_curves.patch.set_alpha(0.0)
ax_bars.patch.set_alpha(0.0)
# Data
t, Y = make_continuous_curves(p)
ids = make_discrete_bars(p)
# Top: continuous curves
curve_colors = ["#1f77b4", "#ff7f0e", "#2ca02c"] # blue / orange / green
for i in range(Y.shape[0]):
ax_curves.plot(t, Y[i], linewidth=p.curve_lw, color=curve_colors[i % len(curve_colors)])
ymin, ymax = float(Y.min()), float(Y.max())
ypad = 0.10 * (ymax - ymin + 1e-9)
ax_curves.set_xlim(t[0], t[-1])
ax_curves.set_ylim(ymin - ypad, ymax + ypad)
# Bottom: discrete bars (colored strip)
bar_palette = [
"#e41a1c", "#377eb8", "#4daf4a", "#984ea3",
"#ff7f00", "#ffff33", "#a65628", "#f781bf",
]
x0, x1 = t[0], t[-1]
total = x1 - x0
n = len(ids)
bin_w = total / n
gap = p.bar_gap * bin_w
ax_bars.set_xlim(x0, x1)
ax_bars.set_ylim(0, 1)
bar_y = (1 - p.bar_height) / 2
for i, cat in enumerate(ids):
left = x0 + i * bin_w + gap / 2
width = bin_w - gap
color = bar_palette[int(cat) % len(bar_palette)]
ax_bars.add_patch(Rectangle((left, bar_y), width, p.bar_height, facecolor=color, edgecolor="none"))
# Apply "axes only" styling (no numbers/words)
_axes_only(ax_curves)
_axes_only(ax_bars)
out_path.parent.mkdir(parents=True, exist_ok=True)
fig.savefig(out_path, format="svg", transparent=True, bbox_inches="tight", pad_inches=0.0)
plt.close(fig)
def main() -> None:
ap = argparse.ArgumentParser()
ap.add_argument("--out", type=Path, default=Path("transformer_lower_half_axes_only.svg"))
ap.add_argument("--seed", type=int, default=7)
ap.add_argument("--seconds", type=float, default=10.0)
ap.add_argument("--fs", type=int, default=300)
ap.add_argument("--bins", type=int, default=40)
ap.add_argument("--curves", type=int, default=3)
args = ap.parse_args()
p = Params(seed=args.seed, seconds=args.seconds, fs=args.fs, n_bins=args.bins, n_curves=args.curves)
draw_transformer_lower_half_svg(args.out, p)
print(f"Wrote: {args.out}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,161 @@
#!/usr/bin/env python3
"""
Generate "Noisy Residual" and "Denoised Residual" curves as SVGs.
- Produces TWO separate SVG files:
noisy_residual.svg
denoised_residual.svg
- Curves are synthetic but shaped like residual noise + denoised residual.
- Uses only matplotlib + numpy.
"""
from __future__ import annotations
import argparse
from dataclasses import dataclass
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
@dataclass
class CurveParams:
seconds: float = 12.0 # length of the signal
fs: int = 250 # samples per second
seed: int = 7 # RNG seed for reproducibility
base_amp: float = 0.12 # smooth baseline amplitude
noise_amp: float = 0.55 # high-frequency noise amplitude
burst_amp: float = 1.2 # occasional spike amplitude
burst_rate_hz: float = 0.35 # average spike frequency
denoise_smooth_ms: float = 120 # smoothing window for "denoised" (ms)
def gaussian_smooth(x: np.ndarray, sigma_samples: float) -> np.ndarray:
"""Gaussian smoothing using explicit kernel convolution (no SciPy dependency)."""
if sigma_samples <= 0:
return x.copy()
radius = int(np.ceil(4 * sigma_samples))
k = np.arange(-radius, radius + 1, dtype=float)
kernel = np.exp(-(k**2) / (2 * sigma_samples**2))
kernel /= kernel.sum()
return np.convolve(x, kernel, mode="same")
def make_residual(params: CurveParams) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
"""
Create synthetic residual:
- baseline: smooth wavy trend + slight drift
- noise: band-limited-ish high-frequency noise
- bursts: sparse spikes / impulse-like events
Returns: (t, noisy, denoised)
"""
rng = np.random.default_rng(params.seed)
n = int(params.seconds * params.fs)
t = np.linspace(0, params.seconds, n, endpoint=False)
# Smooth baseline (small): combination of sinusoids + small random drift
baseline = (
0.7 * np.sin(2 * np.pi * 0.35 * t + 0.2)
+ 0.35 * np.sin(2 * np.pi * 0.9 * t + 1.2)
+ 0.25 * np.sin(2 * np.pi * 0.15 * t + 2.0)
)
baseline *= params.base_amp
drift = np.cumsum(rng.normal(0, 1, size=n))
drift = drift / (np.max(np.abs(drift)) + 1e-9) * (params.base_amp * 0.25)
baseline = baseline + drift
# High-frequency noise: whitened then lightly smoothed to look "oscillatory"
raw = rng.normal(0, 1, size=n)
hf = raw - gaussian_smooth(raw, sigma_samples=params.fs * 0.03) # remove slow part
hf = hf / (np.std(hf) + 1e-9)
hf *= params.noise_amp
# Bursts/spikes: Poisson process impulses convolved with short kernel
expected_bursts = params.burst_rate_hz * params.seconds
k_bursts = rng.poisson(expected_bursts)
impulses = np.zeros(n)
if k_bursts > 0:
idx = rng.integers(0, n, size=k_bursts)
impulses[idx] = rng.normal(loc=1.0, scale=0.4, size=k_bursts)
# Shape impulses into spikes (asymmetric bump)
spike_kernel_len = int(params.fs * 0.06) # ~60ms
spike_kernel_len = max(spike_kernel_len, 7)
spike_t = np.arange(spike_kernel_len)
spike_kernel = np.exp(-spike_t / (params.fs * 0.012)) # fast decay
spike_kernel *= np.hanning(spike_kernel_len) # taper
spike_kernel /= (spike_kernel.max() + 1e-9)
bursts = np.convolve(impulses, spike_kernel, mode="same")
bursts *= params.burst_amp
noisy = baseline + hf + bursts
# "Denoised": remove high-frequency using Gaussian smoothing,
# but keep spike structures partially.
smooth_sigma = (params.denoise_smooth_ms / 1000.0) * params.fs / 3.0
denoised = gaussian_smooth(noisy, sigma_samples=smooth_sigma)
return t, noisy, denoised
def save_curve_svg(
t: np.ndarray,
y: np.ndarray,
out_path: Path,
*,
width_in: float = 5.4,
height_in: float = 1.6,
lw: float = 2.2,
pad: float = 0.03,
) -> None:
"""
Save a clean, figure-only SVG suitable for embedding in diagrams.
- No axes, ticks, labels.
- Tight bounding box.
"""
fig = plt.figure(figsize=(width_in, height_in), dpi=200)
ax = fig.add_axes([pad, pad, 1 - 2 * pad, 1 - 2 * pad])
ax.plot(t, y, linewidth=lw)
# Make it "icon-like" for diagrams: no axes or frames
ax.set_axis_off()
# Ensure bounds include a little padding
ymin, ymax = np.min(y), np.max(y)
ypad = 0.08 * (ymax - ymin + 1e-9)
ax.set_xlim(t[0], t[-1])
ax.set_ylim(ymin - ypad, ymax + ypad)
out_path.parent.mkdir(parents=True, exist_ok=True)
fig.savefig(out_path, format="svg", bbox_inches="tight", pad_inches=0.0)
plt.close(fig)
def main() -> None:
ap = argparse.ArgumentParser()
ap.add_argument("--outdir", type=Path, default=Path("."), help="Output directory")
ap.add_argument("--seed", type=int, default=7, help="RNG seed")
ap.add_argument("--seconds", type=float, default=12.0, help="Signal length (s)")
ap.add_argument("--fs", type=int, default=250, help="Sampling rate (Hz)")
ap.add_argument("--prefix", type=str, default="", help="Filename prefix (optional)")
args = ap.parse_args()
params = CurveParams(seconds=args.seconds, fs=args.fs, seed=args.seed)
t, noisy, denoised = make_residual(params)
noisy_path = args.outdir / f"{args.prefix}noisy_residual.svg"
den_path = args.outdir / f"{args.prefix}denoised_residual.svg"
save_curve_svg(t, noisy, noisy_path)
save_curve_svg(t, denoised, den_path)
print(f"Wrote:\n {noisy_path}\n {den_path}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,188 @@
#!/usr/bin/env python3
"""
DDPM-like residual curve SVGs (separate files, fixed colors):
- noisy_residual.svg (blue)
- denoised_residual.svg (purple)
"""
from __future__ import annotations
import argparse
from dataclasses import dataclass
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
@dataclass
class DDPMStyleParams:
seconds: float = 12.0
fs: int = 250
seed: int = 7
baseline_amp: float = 0.10
mid_wiggle_amp: float = 0.18
colored_noise_amp: float = 0.65
colored_alpha: float = 1.0
burst_rate_hz: float = 0.30
burst_amp: float = 0.9
burst_width_ms: float = 55
denoise_sigmas_ms: tuple[float, ...] = (25, 60, 140)
denoise_weights: tuple[float, ...] = (0.25, 0.35, 0.40)
denoise_texture_keep: float = 0.10
def gaussian_smooth(x: np.ndarray, sigma_samples: float) -> np.ndarray:
if sigma_samples <= 0:
return x.copy()
radius = int(np.ceil(4 * sigma_samples))
k = np.arange(-radius, radius + 1, dtype=float)
kernel = np.exp(-(k**2) / (2 * sigma_samples**2))
kernel /= kernel.sum()
return np.convolve(x, kernel, mode="same")
def colored_noise_1_f(n: int, rng: np.random.Generator, alpha: float) -> np.ndarray:
white = rng.normal(0, 1, size=n)
spec = np.fft.rfft(white)
freqs = np.fft.rfftfreq(n, d=1.0)
scale = np.ones_like(freqs)
nonzero = freqs > 0
scale[nonzero] = 1.0 / (freqs[nonzero] ** (alpha / 2.0))
spec *= scale
x = np.fft.irfft(spec, n=n)
x = x - np.mean(x)
x = x / (np.std(x) + 1e-9)
return x
def make_ddpm_like_residual(p: DDPMStyleParams) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
rng = np.random.default_rng(p.seed)
n = int(p.seconds * p.fs)
t = np.linspace(0, p.seconds, n, endpoint=False)
baseline = (
0.8 * np.sin(2 * np.pi * 0.18 * t + 0.4)
+ 0.35 * np.sin(2 * np.pi * 0.06 * t + 2.2)
) * p.baseline_amp
mid = (
0.9 * np.sin(2 * np.pi * 0.9 * t + 1.1)
+ 0.5 * np.sin(2 * np.pi * 1.6 * t + 0.2)
+ 0.3 * np.sin(2 * np.pi * 2.4 * t + 2.6)
) * p.mid_wiggle_amp
col = colored_noise_1_f(n, rng, alpha=p.colored_alpha) * p.colored_noise_amp
expected = p.burst_rate_hz * p.seconds
k = rng.poisson(expected)
impulses = np.zeros(n)
if k > 0:
idx = rng.integers(0, n, size=k)
impulses[idx] = rng.normal(loc=1.0, scale=0.35, size=k)
width = max(int(p.fs * (p.burst_width_ms / 1000.0)), 7)
u = np.arange(width)
kernel = np.exp(-u / (p.fs * 0.012)) * np.hanning(width)
kernel /= (kernel.max() + 1e-9)
bursts = np.convolve(impulses, kernel, mode="same") * p.burst_amp
noisy = baseline + mid + col + bursts
sigmas_samples = [(ms / 1000.0) * p.fs / 3.0 for ms in p.denoise_sigmas_ms]
smooths = [gaussian_smooth(noisy, s) for s in sigmas_samples]
den_base = np.zeros_like(noisy)
for w, sm in zip(p.denoise_weights, smooths):
den_base += w * sm
hf = noisy - gaussian_smooth(noisy, sigma_samples=p.fs * 0.03)
denoised = den_base + p.denoise_texture_keep * (hf / (np.std(hf) + 1e-9)) * (0.10 * np.std(den_base))
return t, noisy, denoised
def save_single_curve_svg(
t: np.ndarray,
y: np.ndarray,
out_path: Path,
*,
color: str,
lw: float = 2.2,
) -> None:
fig = plt.figure(figsize=(5.4, 1.6), dpi=200)
# Make figure background transparent
fig.patch.set_alpha(0.0)
ax = fig.add_axes([0.03, 0.03, 0.94, 0.94])
# Make axes background transparent
ax.patch.set_alpha(0.0)
ax.plot(t, y, linewidth=lw, color=color)
# clean, diagram-friendly
ax.set_axis_off()
ymin, ymax = np.min(y), np.max(y)
ypad = 0.08 * (ymax - ymin + 1e-9)
ax.set_xlim(t[0], t[-1])
ax.set_ylim(ymin - ypad, ymax + ypad)
out_path.parent.mkdir(parents=True, exist_ok=True)
fig.savefig(
out_path,
format="svg",
bbox_inches="tight",
pad_inches=0.0,
transparent=True, # <-- key for transparent output
)
plt.close(fig)
def main() -> None:
ap = argparse.ArgumentParser()
ap.add_argument("--outdir", type=Path, default=Path("."))
ap.add_argument("--seed", type=int, default=7)
ap.add_argument("--seconds", type=float, default=12.0)
ap.add_argument("--fs", type=int, default=250)
ap.add_argument("--alpha", type=float, default=1.0)
ap.add_argument("--noise-amp", type=float, default=0.65)
ap.add_argument("--texture-keep", type=float, default=0.10)
ap.add_argument("--prefix", type=str, default="")
args = ap.parse_args()
p = DDPMStyleParams(
seconds=args.seconds,
fs=args.fs,
seed=args.seed,
colored_alpha=args.alpha,
colored_noise_amp=args.noise_amp,
denoise_texture_keep=args.texture_keep,
)
t, noisy, den = make_ddpm_like_residual(p)
outdir = args.outdir
noisy_path = outdir / f"{args.prefix}noisy_residual.svg"
den_path = outdir / f"{args.prefix}denoised_residual.svg"
# Fixed colors as you requested
save_single_curve_svg(t, noisy, noisy_path, color="blue")
save_single_curve_svg(t, den, den_path, color="purple")
print("Wrote:")
print(f" {noisy_path}")
print(f" {den_path}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,10 @@
[project]
name = "fig-gen-ddpm"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.12"
dependencies = [
"numpy>=1.26",
"matplotlib>=3.8",
]

View File

@@ -0,0 +1,240 @@
#!/usr/bin/env python3
"""
3D "final combined outcome" (time × channel × value) with:
- NO numbers on axes (tick labels removed)
- Axis *titles* kept (texts are okay)
- Reduced whitespace: tight bbox + minimal margins
- White background (non-transparent) suitable for embedding into another SVG
Output:
default PNG, optional SVG (2D projected vectors)
Run:
uv run python synth_ics_3d_waterfall_tight.py --out ./assets/synth_ics_3d.png
uv run python synth_ics_3d_waterfall_tight.py --out ./assets/synth_ics_3d.svg --format svg
"""
from __future__ import annotations
import argparse
from dataclasses import dataclass
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
@dataclass
class Params:
seed: int = 7
seconds: float = 10.0
fs: int = 220
n_cont: int = 5
n_disc: int = 2
disc_vocab: int = 8
disc_change_rate_hz: float = 1.1
# view
elev: float = 25.0
azim: float = -58.0
# figure size (smaller, more "cube-like")
fig_w: float = 5.4
fig_h: float = 5.0
# discrete rendering
disc_z_scale: float = 0.45
disc_z_offset: float = -1.4
# margins (figure fraction)
left: float = 0.03
right: float = 0.99
bottom: float = 0.03
top: float = 0.99
def _smooth(x: np.ndarray, win: int) -> np.ndarray:
win = max(3, int(win) | 1)
k = np.ones(win, dtype=float)
k /= k.sum()
return np.convolve(x, k, mode="same")
def make_continuous(p: Params) -> tuple[np.ndarray, np.ndarray]:
rng = np.random.default_rng(p.seed)
T = int(p.seconds * p.fs)
t = np.linspace(0, p.seconds, T, endpoint=False)
Y = []
base_freqs = [0.08, 0.10, 0.12, 0.09, 0.11]
mid_freqs = [0.55, 0.70, 0.85, 0.62, 0.78]
for i in range(p.n_cont):
f1 = base_freqs[i % len(base_freqs)]
f2 = mid_freqs[i % len(mid_freqs)]
ph = rng.uniform(0, 2 * np.pi)
y = (
0.95 * np.sin(2 * np.pi * f1 * t + ph)
+ 0.28 * np.sin(2 * np.pi * f2 * t + 0.65 * ph)
)
bumps = np.zeros_like(t)
for _ in range(rng.integers(2, 4)):
mu = rng.uniform(0.8, p.seconds - 0.8)
sig = rng.uniform(0.25, 0.80)
bumps += np.exp(-0.5 * ((t - mu) / (sig + 1e-9)) ** 2)
y += 0.55 * bumps
noise = _smooth(rng.normal(0, 1, size=T), win=int(p.fs * 0.05))
y += 0.10 * noise
y = (y - y.mean()) / (y.std() + 1e-9)
Y.append(y)
return t, np.vstack(Y) # (n_cont, T)
def make_discrete(p: Params, t: np.ndarray) -> np.ndarray:
rng = np.random.default_rng(p.seed + 123)
T = len(t)
expected_changes = max(1, int(p.seconds * p.disc_change_rate_hz))
X = np.zeros((p.n_disc, T), dtype=int)
for c in range(p.n_disc):
k = rng.poisson(expected_changes) + 1
pts = np.unique(rng.integers(0, T, size=k))
pts = np.sort(np.concatenate([[0], pts, [T]]))
cur = rng.integers(0, p.disc_vocab)
for a, b in zip(pts[:-1], pts[1:]):
if a != 0 and rng.random() < 0.85:
cur = rng.integers(0, p.disc_vocab)
X[c, a:b] = cur
return X
def style_3d_axes(ax):
# Make panes white but less visually heavy
try:
# Keep pane fill ON (white background) but reduce edge prominence
ax.xaxis.pane.set_edgecolor("0.7")
ax.yaxis.pane.set_edgecolor("0.7")
ax.zaxis.pane.set_edgecolor("0.7")
except Exception:
pass
ax.grid(True, linewidth=0.4, alpha=0.30)
def remove_tick_numbers_keep_axis_titles(ax):
# Remove tick labels (numbers) and tick marks, keep axis titles
ax.set_xticklabels([])
ax.set_yticklabels([])
ax.set_zticklabels([])
ax.tick_params(
axis="both",
which="both",
length=0, # no tick marks
pad=0,
)
# 3D has separate tick_params for z on some versions; this still works broadly:
try:
ax.zaxis.set_tick_params(length=0, pad=0)
except Exception:
pass
def main() -> None:
ap = argparse.ArgumentParser()
ap.add_argument("--out", type=Path, default=Path("synth_ics_3d.png"))
ap.add_argument("--format", choices=["png", "svg"], default="png")
ap.add_argument("--seed", type=int, default=7)
ap.add_argument("--seconds", type=float, default=10.0)
ap.add_argument("--fs", type=int, default=220)
ap.add_argument("--n-cont", type=int, default=5)
ap.add_argument("--n-disc", type=int, default=2)
ap.add_argument("--disc-vocab", type=int, default=8)
ap.add_argument("--disc-rate", type=float, default=1.1)
ap.add_argument("--elev", type=float, default=25.0)
ap.add_argument("--azim", type=float, default=-58.0)
ap.add_argument("--fig-w", type=float, default=5.4)
ap.add_argument("--fig-h", type=float, default=5.0)
ap.add_argument("--disc-z-scale", type=float, default=0.45)
ap.add_argument("--disc-z-offset", type=float, default=-1.4)
args = ap.parse_args()
p = Params(
seed=args.seed,
seconds=args.seconds,
fs=args.fs,
n_cont=args.n_cont,
n_disc=args.n_disc,
disc_vocab=args.disc_vocab,
disc_change_rate_hz=args.disc_rate,
elev=args.elev,
azim=args.azim,
fig_w=args.fig_w,
fig_h=args.fig_h,
disc_z_scale=args.disc_z_scale,
disc_z_offset=args.disc_z_offset,
)
t, Yc = make_continuous(p)
Xd = make_discrete(p, t)
fig = plt.figure(figsize=(p.fig_w, p.fig_h), dpi=220, facecolor="white")
ax = fig.add_subplot(111, projection="3d")
style_3d_axes(ax)
# Reduce whitespace around axes (tight placement)
fig.subplots_adjust(left=p.left, right=p.right, bottom=p.bottom, top=p.top)
# Draw continuous channels
for i in range(p.n_cont):
y = np.full_like(t, fill_value=i, dtype=float)
z = Yc[i]
ax.plot(t, y, z, linewidth=2.0)
# Draw discrete channels as steps
for j in range(p.n_disc):
ch = p.n_cont + j
y = np.full_like(t, fill_value=ch, dtype=float)
z = p.disc_z_offset + p.disc_z_scale * Xd[j].astype(float)
ax.step(t, y, z, where="post", linewidth=2.2)
# Axis titles kept
ax.set_xlabel("time")
ax.set_ylabel("channel")
ax.set_zlabel("value")
# Remove numeric tick labels + tick marks
remove_tick_numbers_keep_axis_titles(ax)
# Camera
ax.view_init(elev=p.elev, azim=p.azim)
# Save tightly (minimize white border)
args.out.parent.mkdir(parents=True, exist_ok=True)
save_kwargs = dict(bbox_inches="tight", pad_inches=0.03, facecolor="white")
if args.format == "svg" or args.out.suffix.lower() == ".svg":
fig.savefig(args.out, format="svg", **save_kwargs)
else:
fig.savefig(args.out, format="png", **save_kwargs)
plt.close(fig)
print(f"Wrote: {args.out}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,262 @@
#!/usr/bin/env python3
"""
Transformer-ish "trend" visuals with NO equations:
- attention_weights.svg : heatmap-like attention map (looks like "Transformer attends to positions")
- token_activation_trends.svg: multiple token-channel curves (continuous trends)
- discrete_tokens.svg : step-like discrete channel trends (optional)
All SVGs have transparent background and no axes (diagram-friendly).
"""
from __future__ import annotations
import argparse
from dataclasses import dataclass
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
# ----------------------------
# Synthetic data generators
# ----------------------------
@dataclass
class Params:
seed: int = 7
T: int = 24 # sequence length (positions)
n_heads: int = 4 # attention heads to blend/choose
n_curves: int = 7 # curves in token_activation_trends
seconds: float = 10.0
fs: int = 200
def _gaussian(x: np.ndarray, mu: float, sig: float) -> np.ndarray:
return np.exp(-0.5 * ((x - mu) / (sig + 1e-9)) ** 2)
def make_attention_map(T: int, rng: np.random.Generator, mode: str) -> np.ndarray:
"""
Create a transformer-like attention weight matrix A (T x T) with different visual styles:
- "local": mostly near-diagonal attention
- "global": some global tokens attend broadly
- "causal": lower-triangular (decoder-like) with local preference
"""
i = np.arange(T)[:, None] # query positions
j = np.arange(T)[None, :] # key positions
if mode == "local":
logits = -((i - j) ** 2) / (2 * (2.2 ** 2))
logits += 0.15 * rng.normal(size=(T, T))
elif mode == "global":
logits = -((i - j) ** 2) / (2 * (3.0 ** 2))
# Add a few "global" key positions that many queries attend to
globals_ = rng.choice(T, size=max(2, T // 10), replace=False)
for g in globals_:
logits += 1.2 * _gaussian(j, mu=g, sig=1.0)
logits += 0.12 * rng.normal(size=(T, T))
elif mode == "causal":
logits = -((i - j) ** 2) / (2 * (2.0 ** 2))
logits += 0.12 * rng.normal(size=(T, T))
logits = np.where(j <= i, logits, -1e9) # mask future
else:
raise ValueError(f"Unknown attention mode: {mode}")
# softmax rows
logits = logits - np.max(logits, axis=1, keepdims=True)
A = np.exp(logits)
A /= (np.sum(A, axis=1, keepdims=True) + 1e-9)
return A
def make_token_activation_trends(p: Params) -> tuple[np.ndarray, np.ndarray]:
"""
Multiple smooth curves that feel like "representations evolving across layers/time".
Returns:
t: (N,)
Y: (n_curves, N)
"""
rng = np.random.default_rng(p.seed)
N = int(p.seconds * p.fs)
t = np.linspace(0, p.seconds, N, endpoint=False)
Y = []
for k in range(p.n_curves):
# Multi-scale smooth components + some bursty response
f1 = 0.10 + 0.04 * k
f2 = 0.60 + 0.18 * (k % 3)
phase = rng.uniform(0, 2 * np.pi)
base = 0.9 * np.sin(2 * np.pi * f1 * t + phase) + 0.35 * np.sin(2 * np.pi * f2 * t + 0.7 * phase)
# "attention-like gating": a few bumps where the curve spikes smoothly
bumps = np.zeros_like(t)
for _ in range(rng.integers(2, 5)):
mu = rng.uniform(0.5, p.seconds - 0.5)
sig = rng.uniform(0.15, 0.55)
bumps += 0.9 * _gaussian(t, mu=mu, sig=sig)
noise = rng.normal(0, 1, size=N)
noise = np.convolve(noise, np.ones(11) / 11.0, mode="same") # smooth noise
y = base + 0.85 * bumps + 0.12 * noise
# normalize and vertically offset
y = (y - y.mean()) / (y.std() + 1e-9)
y = 0.75 * y + 0.18 * k
Y.append(y)
return t, np.vstack(Y)
def make_discrete_trends(p: Params, vocab: int = 9, change_rate_hz: float = 1.3) -> tuple[np.ndarray, np.ndarray]:
"""
Discrete step-like channels: useful if you want a "token-id / discrete feature" feel.
Returns:
t: (N,)
X: (n_curves, N) integers
"""
rng = np.random.default_rng(p.seed + 123)
N = int(p.seconds * p.fs)
t = np.linspace(0, p.seconds, N, endpoint=False)
expected = max(1, int(p.seconds * change_rate_hz))
X = np.zeros((p.n_curves, N), dtype=int)
for c in range(p.n_curves):
k = rng.poisson(expected) + 1
pts = np.unique(rng.integers(0, N, size=k))
pts = np.sort(np.concatenate([[0], pts, [N]]))
cur = rng.integers(0, vocab)
for a, b in zip(pts[:-1], pts[1:]):
if a != 0 and rng.random() < 0.9:
cur = rng.integers(0, vocab)
X[c, a:b] = cur
return t, X
# ----------------------------
# Plot helpers (SVG, transparent, axes-free)
# ----------------------------
def _transparent_fig_ax(width_in: float, height_in: float):
fig = plt.figure(figsize=(width_in, height_in), dpi=200)
fig.patch.set_alpha(0.0)
ax = fig.add_axes([0.03, 0.03, 0.94, 0.94])
ax.patch.set_alpha(0.0)
ax.set_axis_off()
return fig, ax
def save_attention_svg(A: np.ndarray, out: Path, *, show_colorbar: bool = False) -> None:
fig, ax = _transparent_fig_ax(4.2, 4.2)
# Using default colormap (no explicit color specification)
im = ax.imshow(A, aspect="equal", interpolation="nearest")
if show_colorbar:
# colorbar can be useful, but it adds clutter in diagrams
cax = fig.add_axes([0.92, 0.10, 0.03, 0.80])
cb = fig.colorbar(im, cax=cax)
cb.outline.set_linewidth(1.0)
out.parent.mkdir(parents=True, exist_ok=True)
fig.savefig(out, format="svg", bbox_inches="tight", pad_inches=0.0, transparent=True)
plt.close(fig)
def save_multi_curve_svg(t: np.ndarray, Y: np.ndarray, out: Path, *, lw: float = 2.0) -> None:
fig, ax = _transparent_fig_ax(6.0, 2.2)
for i in range(Y.shape[0]):
ax.plot(t, Y[i], linewidth=lw)
y_all = Y.reshape(-1)
ymin, ymax = float(np.min(y_all)), float(np.max(y_all))
ypad = 0.08 * (ymax - ymin + 1e-9)
ax.set_xlim(t[0], t[-1])
ax.set_ylim(ymin - ypad, ymax + ypad)
out.parent.mkdir(parents=True, exist_ok=True)
fig.savefig(out, format="svg", bbox_inches="tight", pad_inches=0.0, transparent=True)
plt.close(fig)
def save_discrete_svg(t: np.ndarray, X: np.ndarray, out: Path, *, lw: float = 2.0, spacing: float = 1.25) -> None:
fig, ax = _transparent_fig_ax(6.0, 2.2)
for i in range(X.shape[0]):
y = X[i].astype(float) + i * spacing
ax.step(t, y, where="post", linewidth=lw)
y_all = (X.astype(float) + np.arange(X.shape[0])[:, None] * spacing).reshape(-1)
ymin, ymax = float(np.min(y_all)), float(np.max(y_all))
ypad = 0.10 * (ymax - ymin + 1e-9)
ax.set_xlim(t[0], t[-1])
ax.set_ylim(ymin - ypad, ymax + ypad)
out.parent.mkdir(parents=True, exist_ok=True)
fig.savefig(out, format="svg", bbox_inches="tight", pad_inches=0.0, transparent=True)
plt.close(fig)
# ----------------------------
# CLI
# ----------------------------
def main() -> None:
ap = argparse.ArgumentParser()
ap.add_argument("--outdir", type=Path, default=Path("out"))
ap.add_argument("--seed", type=int, default=7)
# attention
ap.add_argument("--T", type=int, default=24)
ap.add_argument("--attn-mode", type=str, default="local", choices=["local", "global", "causal"])
ap.add_argument("--colorbar", action="store_true")
# curves
ap.add_argument("--seconds", type=float, default=10.0)
ap.add_argument("--fs", type=int, default=200)
ap.add_argument("--n-curves", type=int, default=7)
# discrete optional
ap.add_argument("--with-discrete", action="store_true")
ap.add_argument("--disc-vocab", type=int, default=9)
ap.add_argument("--disc-rate", type=float, default=1.3)
args = ap.parse_args()
p = Params(
seed=args.seed,
T=args.T,
n_curves=args.n_curves,
seconds=args.seconds,
fs=args.fs,
)
rng = np.random.default_rng(args.seed)
# 1) attention map
A = make_attention_map(args.T, rng, mode=args.attn_mode)
save_attention_svg(A, args.outdir / "attention_weights.svg", show_colorbar=args.colorbar)
# 2) continuous trends
t, Y = make_token_activation_trends(p)
save_multi_curve_svg(t, Y, args.outdir / "token_activation_trends.svg")
# 3) discrete trends (optional)
if args.with_discrete:
td, X = make_discrete_trends(p, vocab=args.disc_vocab, change_rate_hz=args.disc_rate)
save_discrete_svg(td, X, args.outdir / "discrete_tokens.svg")
print("Wrote:")
print(f" {args.outdir / 'attention_weights.svg'}")
print(f" {args.outdir / 'token_activation_trends.svg'}")
if args.with_discrete:
print(f" {args.outdir / 'discrete_tokens.svg'}")
if __name__ == "__main__":
main()

302
arxiv-style/main-ieee.tex Normal file
View File

@@ -0,0 +1,302 @@
\documentclass[conference]{IEEEtran}
\IEEEoverridecommandlockouts
\usepackage{cite}
\usepackage{amsmath,amssymb,amsfonts}
\usepackage{graphicx}
\usepackage{url}
\usepackage{textcomp}
\usepackage{xcolor}
\usepackage{booktabs}
\usepackage{bm}
\title{Mask-DDPM: Transformer-Conditioned Mixed-Type Diffusion for Semantically Valid ICS Telemetry Synthesis}
% 若不需要日期,取消下面一行的注释
\author{
\IEEEauthorblockN{Zhenglan Chen}
\IEEEauthorblockA{\textit{Aberdeen Institute of Data Science and Artificial Intelligence} \\
\textit{South China Normal University}\\
Guangzhuo, China \\
email address or ORCID}
\and
\IEEEauthorblockN{Mingzhe Yang}
\IEEEauthorblockA{\textit{Aberdeen Institute of Data Science and Artificial Intelligence} \\
\textit{South China Normal University}\\
Guangzhuo, China \\
email address or ORCID}
\and
\IEEEauthorblockN{Hongyu Yan}
\IEEEauthorblockA{\textit{Aberdeen Institute of Data Science and Artificial Intelligence} \\
\textit{South China Normal University}\\
Guangzhuo, China \\
email address or ORCID}
\and
\IEEEauthorblockN{Huan Yang}
\IEEEauthorblockA{\textit{dept. name of organization (of Aff.)} \\
\textit{South China Normal University}\\
Guangzhuo, China \\
email address or ORCID}
}
% 页眉设置
%%% PDF 元数'
\begin{document}
\maketitle
\begin{abstract}
Industrial control systems (ICS) security research is increasingly constrained by the scarcity and non-shareability of realistic traffic and telemetry, especially for attack scenarios. To mitigate this bottleneck, we study synthetic generation at the protocol feature/telemetry level, where samples must simultaneously preserve temporal coherence, match continuous marginal distributions, and keep discrete supervisory variables strictly within valid vocabularies. We propose Mask-DDPM, a hybrid framework tailored to mixed-type, multi-scale ICS sequences. Mask-DDPM factorizes generation into (i) a causal Transformer trend module that rolls out a stable long-horizon temporal scaffold for continuous channels, (ii) a trend-conditioned residual DDPM that refines local stochastic structure and heavy-tailed fluctuations without degrading global dynamics, (iii) a masked (absorbing) diffusion branch for discrete variables that guarantees categorical legality by construction, and (iv) a type-aware decomposition/routing layer that aligns modeling mechanisms with heterogeneous ICS variable origins and enforces deterministic reconstruction where appropriate. Evaluated on fixed-length windows (L=96) derived from the HAI Security Dataset, Mask-DDPM achieves stable fidelity across seeds with mean KS = 0.3311 ± 0.0079 (continuous), mean JSD = 0.0284 ± 0.0073 (discrete), and mean absolute lag-1 autocorrelation difference = 0.2684 ± 0.0027, indicating faithful marginals, preserved short-horizon dynamics, and valid discrete semantics. The resulting generator provides a reproducible basis for data augmentation, benchmarking, and downstream ICS protocol reconstruction workflows.
\end{abstract}
% 关键'
\begin{IEEEkeywords}Machine Learning, Cyber Defense, ICS\end{IEEEkeywords}
% 1. Introduction
\section{Introduction}
\label{sec:intro}
Industrial control systems (ICS) form the backbone of modern critical infrastructure, which includes power grids, water treatment, manufacturing, and transportation, among others. These systems monitor, regulate, and automate the physical processes through sensors, actuators, programmable logic controllers (PLCs), and monitoring software. Unlike conventional IT systems, ICS operate in real time, closely coupled with physical processes and safety-critical constraints, using heterogeneous and legacy communication protocols such as Modbus/TCP and DNP3 that were not originally designed with robust security in mind. This architectural complexity and operational criticality make ICS high-impact targets for cyber attacks, where disruptions can result in physical damage, environmental harm, and even loss of life. Recent reviews of ICS security highlight the expanding attack surface due to increased connectivity, legacy systems' vulnerabilities, and the inadequacy of traditional security controls in capturing the nuances of ICS networks and protocols \cite{10.1007/s10844-022-00753-1,Nankya2023-gp}
While machine learning (ML) techniques have shown promise for anomaly detection and automated cybersecurity within ICS, they rely heavily on labeled datasets that capture both benign operations and diverse attack patterns. In practice, real ICS traffic data, especially attack-triggered captures, are scarce due to confidentiality, safety, and legal restrictions, and available public ICS datasets are few, limited in scope, or fail to reflect current threat modalities. For instance, the HAI Security Dataset provides operational telemetry and anomaly flags from a realistic control system setup for research purposes, but must be carefully preprocessed to derive protocol-relevant features for ML tasks \cite{shin}. Data scarcity directly undermines model generalization, evaluation reproducibility, and the robustness of intrusion detection research, especially when training or testing ML models on realistic ICS behavior remains confined to small or outdated collections of examples \cite{info16100910}.
Synthetic data generation offers a practical pathway to mitigate these challenges. By programmatically generating feature-level sequences that mimic the statistical and temporal structure of real ICS telemetry, researchers can augment scarce training sets, standardize benchmarking, and preserve operational confidentiality. Relative to raw packet captures, feature-level synthesis abstracts critical protocol semantics and statistical patterns without exposing sensitive fields, making it more compatible with safety constraints and compliance requirements in ICS environments. Modern generative modeling, including diffusion models, has advanced significantly in producing high-fidelity synthetic data across domains. Diffusion approaches, such as denoising diffusion probabilistic models, learn to transform noise into coherent structured samples and have been successfully applied to tabular or time series data synthesis with better stability and data coverage compared to adversarial methods \cite{pmlr-v202-kotelnikov23a,rasul2021autoregressivedenoisingdiffusionmodels}
Despite these advances, most existing work either focuses on packet-level generation \cite{jiang2023netdiffusionnetworkdataaugmentation} or is limited to generic tabular data \cite{pmlr-v202-kotelnikov23a}, rather than domain-specific control sequence synthesis tailored for ICS protocols where temporal coherence, multi-channel dependencies, and discrete protocol legality are jointly required. This gap motivates our focus on protocol feature-level generation for ICS, which involves synthesizing sequences of protocol-relevant fields conditioned on their temporal and cross-channel structure. In this work, we formulate a hybrid modeling pipeline that decouples long-horizon trends and local statistical detail while preserving discrete semantics of protocol tokens. By combining causal Transformers with diffusion-based refiners, and enforcing deterministic validity constraints during sampling, our framework generates semantically coherent, temporally consistent, and distributionally faithful ICS feature sequences. We evaluate features derived from the HAI Security Dataset and demonstrate that our approach produces high-quality synthetic sequences suitable for downstream augmentation, benchmarking, and integration into packet-construction workflows that respect realistic ICS constraints.
% 2. Related Work
\section{Related Work}
\label{sec:related}
Early generation of network data oriented towards ``realism'' mostly remained at the packet/flow header level, either through replay or statistical synthesis based on single-point observations. Swing, in a closed-loop, network-responsive manner, extracts user/application/network distributions from single-point observations to reproduce burstiness and correlation across multiple time scales \cite{10.1145/1159913.1159928}. Subsequently, a series of works advanced header synthesis to learning-based generation: the WGAN-based method added explicit verification of protocol field consistency to NetFlow/IPFIX \cite{Ring_2019}, NetShare reconstructed header modeling as flow-level time series and improved fidelity and scalability through domain encoding and parallel fine-tuning \cite{10.1145/3544216.3544251}, and DoppelGANger preserved the long-range structure and downstream sorting consistency of networked time series by decoupling attributes from sequences \cite{Lin_2020}. However, in industrial control system (ICS) scenarios, the original PCAP is usually not shareable, and public testbeds (such as SWaT, WADI) mostly provide process/monitoring telemetry and protocol interactions for security assessment, but public datasets emphasize operational variables rather than packet-level traces \cite{7469060,10.1145/3055366.3055375}. This makes ``synthesis at the feature/telemetry level, aware of protocol and semantics'' more feasible and necessary in practice: we are more concerned with reproducing high-level distributions and multi-scale temporal patterns according to operational semantics and physical constraints without relying on the original packets. From this perspective, the generation paradigm naturally shifts from ``packet syntax reproduction'' to ``modeling of high-level spatio-temporal distributions and uncertainties'', requiring stable training, strong distribution fitting, and interpretable uncertainty characterization.
Diffusion models exhibit good fit along this path: DDPM achieves high-quality sampling and stable optimization through efficient $\epsilon$ parameterization and weighted variational objectives \cite{ho2020denoising}, the SDE perspective unifies score-based and diffusion, providing likelihood evaluation and prediction-correction sampling strategies based on probability flow ODEs \cite{song2021score}. For time series, TimeGrad replaces the constrained output distribution with conditional denoising, capturing high-dimensional correlations at each step \cite{rasul2021autoregressivedenoisingdiffusionmodels}; CSDI explicitly performs conditional diffusion and uses two-dimensional attention to simultaneously leverage temporal and cross-feature dependencies, suitable for conditioning and filling in missing values \cite{tashiro2021csdiconditionalscorebaseddiffusion}; in a more general spatio-temporal structure, DiffSTG generalizes diffusion to spatio-temporal graphs, combining TCN/GCN with denoising U-Net to improve CRPS and inference efficiency in a non-autoregressive manner \cite{wen2024diffstgprobabilisticspatiotemporalgraph}, and PriSTI further enhances conditional features and geographical relationships, maintaining robustness under high missing rates and sensor failures \cite{liu2023pristiconditionaldiffusionframework}; in long sequences and continuous domains, DiffWave verifies that diffusion can also match the quality of strong vocoders under non-autoregressive fast synthesis \cite{kong2021diffwaveversatilediffusionmodel}; studies on cellular communication traffic show that diffusion can recover spatio-temporal patterns and provide uncertainty characterization at the urban scale \cite{11087622}. These results overall point to a conclusion: when the research focus is on ``telemetry/high-level features'' rather than raw messages, diffusion models provide stable and fine-grained distribution fitting and uncertainty quantification, which is exactly in line with the requirements of ICS telemetry synthesis. Meanwhile, directly entrusting all structures to a ``monolithic diffusion'' is not advisable: long-range temporal skeletons and fine-grained marginal distributions often have optimization tensions, requiring explicit decoupling in modeling.
Looking further into the mechanism complexity of ICS: its channel types are inherently mixed, containing both continuous process trajectories and discrete supervision/status variables, and discrete channels must be ``legal'' under operational constraints. The aforementioned progress in time series diffusion has mainly occurred in continuous spaces, but discrete diffusion has also developed systematic methods: D3PM improves sampling quality and likelihood through absorption/masking and structured transitions in discrete state spaces \cite{austin2021structured}, subsequent masked diffusion provides stable reconstruction on categorical data in a more simplified form \cite{Lin_2020}, multinomial diffusion directly defines diffusion on a finite vocabulary through mechanisms such as argmax flows \cite{hoogeboom2021argmaxflowsmultinomialdiffusion}, and Diffusion-LM demonstrates an effective path for controllable text generation by imposing gradient constraints in continuous latent spaces \cite{li2022diffusionlmimprovescontrollabletext}. From the perspectives of protocols and finite-state machines, coverage-guided fuzz testing emphasizes the criticality of ``sequence legality and state coverage'' \cite{meng2025aflnetyearslatercoverageguided,godefroid2017learnfuzzmachinelearninginput,she2019neuzzefficientfuzzingneural}, echoing the concept of ``legality by construction'' in discrete diffusion: preferentially adopting absorption/masking diffusion on discrete channels, supplemented by type-aware conditioning and sampling constraints, to avoid semantic invalidity and marginal distortion caused by post hoc thresholding.
From the perspective of high-level synthesis, the temporal structure is equally indispensable: ICS control often involves delay effects, phased operating conditions, and cross-channel coupling, requiring models to be able to characterize low-frequency, long-range dependencies while also overlaying multi-modal fine-grained fluctuations on them. The Transformer series has provided sufficient evidence in long-sequence time series tasks: Transformer-XL breaks through the fixed-length context limitation through a reusable memory mechanism and significantly enhances long-range dependency expression \cite{dai2019transformerxlattentivelanguagemodels}; Informer uses ProbSparse attention and efficient decoding to balance span and efficiency in long-sequence prediction \cite{zhou2021informerefficienttransformerlong}; Autoformer robustly models long-term seasonality and trends through autocorrelation and decomposition mechanisms \cite{wu2022autoformerdecompositiontransformersautocorrelation}; FEDformer further improves long-period prediction performance in frequency domain enhancement and decomposition \cite{zhou2022fedformerfrequencyenhanceddecomposed}; PatchTST enhances the stability and generalization of long-sequence multivariate prediction through local patch-based representation and channel-independent modeling \cite{2023}. Combining our previous positioning of diffusion, this chain of evidence points to a natural division of labor: using attention-based sequence models to first extract stable low-frequency trends/conditions (long-range skeletons), and then allowing diffusion to focus on margins and details in the residual space; meanwhile, discrete masking/absorbing diffusion is applied to supervised/pattern variables to ensure vocabulary legality by construction. This design not only inherits the advantages of time series diffusion in distribution fitting and uncertainty characterization \cite{rasul2021autoregressivedenoisingdiffusionmodels,tashiro2021csdiconditionalscorebaseddiffusion,wen2024diffstgprobabilisticspatiotemporalgraph,liu2023pristiconditionaldiffusionframework,kong2021diffwaveversatilediffusionmodel,11087622}, but also stabilizes the macroscopic temporal support through the long-range attention of Transformer, enabling the formation of an operational integrated generation pipeline under the mixed types and multi-scale dynamics of ICS.
% 3. Methodology
\section{Methodology}
\label{sec:method}
Industrial control system (ICS) telemetry is intrinsically mixed-type and mechanistically heterogeneous: continuous process trajectories (e.g., sensor and actuator signals) coexist with discrete supervisory states (e.g., modes, alarms, interlocks), and the underlying generating mechanisms range from physical inertia to program-driven step logic. This heterogeneity is not cosmetic--it directly affects what "realistic'synthesis means, because a generator must jointly satisfy (i) temporal coherence, (ii) distributional fidelity, and (iii) discrete semantic validity (i.e., every discrete output must belong to its legal vocabulary by construction). These properties are emphasized broadly in operational-technology security guidance and ICS engineering practice, where state logic and physical dynamics are tightly coupled \cite{nist2023sp80082}.
We formalize each training instance as a fixed-length window of length We model each training instance as a fixed-length window of length $L$, comprising continuous channels $\bm{X} \in \mathbb{R}^{L \times d_c}$ and discrete channels $\bm{Y} = \{y^{(j)}_{1:L}\}_{j=1}^{d_d}$, where each discrete variable satisfies $y^{(j)}_t \in \mathcal{V}_j$ for a finite vocabulary $\mathcal{V}_j$. Our objective is to learn a generator that produces synthetic $(\hat{\bm{X}}, \hat{\bm{Y}})$ that are simultaneously coherent and distributionally faithful, while also ensuring $\hat{y}^{(j)}_t\in\mathcal{V}_j$ for all $j$, $t$ by construction (rather than via post-hoc rounding or thresholding).
A key empirical and methodological tension in ICS synthesis is that temporal realism and marginal/distributional realism can compete when optimized monolithically: sequence models trained primarily for regression often over-smooth heavy tails and intermittent bursts, while purely distribution-matching objectives can erode long-range structure. Diffusion models provide a principled route to rich distribution modeling through iterative denoising, but they do not, by themselves, resolve (i) the need for a stable low-frequency temporal scaffold, nor (ii) the discrete legality constraints for supervisory variables \cite{ho2020denoising,song2021score}. Recent time-series diffusion work further suggests that separating coarse structure from stochastic refinement can be an effective inductive bias for long-horizon realism \cite{kollovieh2023tsdiff,sikder2023transfusion}.
\begin{figure*}[t]
\centering
\includegraphics[width=\textwidth]{fig-design-v2.png}
% \caption{Description of the figure.}
\label{fig:design}
\end{figure*}
Motivated by these considerations, we propose Mask-DDPM, organized in the following order:
\begin{enumerate}
\item Transformer trend module: learns the dominant temporal backbone of continuous dynamics via attention-based sequence modeling \cite{vaswani2017attention}.
\item Residual DDPM for continuous variables: models distributional detail as stochastic residual structure conditioned on the learned trend \cite{ho2020denoising,kollovieh2023tsdiff}.
\item Masked diffusion for discrete variables: generates discrete ICS states with an absorbing/masking corruption process and categorical reconstruction \cite{austin2021structured,shi2024simplified}.
\item Type-aware decomposition: a type-aware factorization and routing layer that assigns variables to the most appropriate modeling mechanism and enforces deterministic constraints where warranted.
\end{enumerate}
This ordering is intentional. The trend module establishes a macro-temporal scaffold; residual diffusion then concentrates capacity on micro-structure and marginal fidelity; masked diffusion provides a native mechanism for discrete legality; and the type-aware layer operationalizes the observation that not all ICS variables should be modeled with the same stochastic mechanism. Importantly, while diffusion-based generation for ICS telemetry has begun to emerge, existing approaches remain limited and typically emphasize continuous synthesis or augmentation; in contrast, our pipeline integrates (i) a Transformer-conditioned residual diffusion backbone, (ii) a discrete masked-diffusion branch, and (iii) explicit type-aware routing for heterogeneous variable mechanisms within a single coherent generator \cite{yuan2025ctu,sha2026ddpm}.
\subsection{Transformer trend module for continuous dynamics}
\label{sec:method-trans}
We instantiate the temporal backbone as a causal Transformer trend extractor, leveraging self-attention's ability to represent long-range dependencies and cross-channel interactions without recurrence \cite{vaswani2017attention}. Compared with recurrent trend extractors (e.g., GRU-style backbones), a Transformer trend module offers a direct mechanism to model delayed effects and multivariate coupling--common in ICS, where control actions may influence downstream sensors with nontrivial lags and regime-dependent propagation \cite{vaswani2017attention,nist2023sp80082}. Crucially, in our design the Transformer is not asked to be the entire generator; instead, it serves a deliberately restricted role: providing a stable, temporally coherent conditioning signal that later stochastic components refine.
For continuous channels $\bm{X}$, we posit an additive decomposition:
\begin{equation}
\bm{X} = \bm{S} + \bm{R},
\label{eq:additive_decomp}
\end{equation}
where $\bm{S} \in \mathbb{R}^{L \times d_c}$ is a smooth trend capturing predictable temporal evolution, and $\bm{R} \in \mathbb{R}^{L \times d_c}$ is a residual capturing distributional detail (e.g., bursts, heavy tails, local fluctuations) that is difficult to represent robustly with a purely regression-based temporal objective. This separation reflects an explicit division of labor: the trend module prioritizes temporal coherence, while diffusion (introduced next) targets distributional realism at the residual level--a strategy aligned with "predict-then-refine'perspectives in time-series diffusion modeling \cite{kollovieh2023tsdiff,sikder2023transfusion}.
We parameterize the trend $\bm{S}$ using a causal Transformer $f_\phi$. With teacher forcing, we train $F_\phi$ to predict the next-step trend from past observations:
\begin{equation}
\hat{\bm{S}}_{t+1} = f_{\phi}(\bm{X}_{1:t}), \quad t = 1, \dots, L-1.
\label{eq:trend_prediction}
\end{equation}
using the mean-squared error objective:
\begin{equation}
\mathcal{L}_{\text{trend}}(\phi) = \frac{1}{(L-1)d_c} \sum_{t=1}^{L-1} \bigl\| \hat{\bm{S}}_{t+1} - \bm{X}_{t+1} \bigr\|_2^2.
\label{eq:trend_loss}
\end{equation}
At inference, we roll out the Transformer autoregressively to obtain $\hat{\bm{S}}$, and and then define the residual target for diffusion as $\bm{R} = \bm{X} - \hat{\bm{S}}$. This setup intentionally "locks in'a coherent low-frequency scaffold before any stochastic refinement is applied, thereby reducing the burden on downstream diffusion modules to simultaneously learn both long-range structure and marginal detail. In this sense, our use of Transformers is distinctive: it is a conditioning-first temporal backbone designed to stabilize mixed-type diffusion synthesis in ICS, rather than an end-to-end monolithic generator \cite{vaswani2017attention,kollovieh2023tsdiff,yuan2025ctu}.
\subsection{DDPM for continuous residual generation}
\label{sec:method-ddpm}
We model the residual RRR with a denoising diffusion probabilistic model (DDPM) conditioned on the trend $\hat{\bm{S}}$ \cite{ho2020denoising}. Diffusion models learn complex data distributions by inverting a tractable noising process through iterative denoising, and have proven effective at capturing multimodality and heavy-tailed structure that is often attenuated by purely regression-based sequence models \cite{ho2020denoising,song2021score}. Conditioning the diffusion model on $\hat{\bm{S}}$ is central: it prevents the denoiser from re-learning the low-frequency scaffold and focuses capacity on residual micro-structure, mirroring the broader principle that diffusion excels as a distributional corrector when a reasonable coarse structure is available \cite{kollovieh2023tsdiff,sikder2023transfusion}.
Let $\bm{K}$ denote the number of diffusion steps, with a noise schedule $\{\beta_k\}_{k=1}^K$, $\alpha_k = 1 - \beta_k$, and $\bar{\alpha}_k = \prod_{i=1}^k \alpha_i$. The forward corruption process is:
\begin{equation}
q(\bm{r}_k \mid \bm{r}_0) = \mathcal{N}\bigl( \sqrt{\bar{\alpha}_k}\,\bm{r}_0,\; (1 - \bar{\alpha}_k)\mathbf{I} \bigr)
\label{eq:forward_corruption}
\end{equation}
equivalently,
\begin{equation}
\bm{r}_k = \sqrt{\bar{\alpha}_k}\,\bm{r}_0 + \sqrt{1 - \bar{\alpha}_k}\,\boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})
\label{eq:forward_corruption_eq}
\end{equation}
The learned reverse process is parameterized as:
\begin{equation}
p_{\theta}(\bm{r}_{k-1} \mid \bm{r}_k, \hat{\bm{S}}) = \mathcal{N}\bigl( \boldsymbol{\mu}_{\theta}(\bm{r}_k, k, \hat{\bm{S}}),\; \boldsymbol{\Sigma}(k) \bigr).
\label{eq:reverse_process}
\end{equation}
where $\mu_\theta$ is implemented by a Transformer denoiser that consumes (i) the noised residual $r_k$, (ii) a timestep embedding for $k$, and (iii) conditioning features derived from $\hat{\bm{S}}$. This denoiser architecture is consistent with the growing use of attention-based denoisers for long-context time-series diffusion, while our key methodological emphasis is the trend-conditioned residual factorization as the object of diffusion learning \cite{ho2020denoising,sikder2023transfusion}.
We train the denoiser using the standard DDPM $\epsilon$-prediction objective:
\begin{equation}
\mathcal{L}_{\text{cont}}(\theta) = \mathbb{E}_{k,\bm{r}_0,\boldsymbol{\epsilon}} \left[ \bigl\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_{\theta}(\bm{r}_k, k, \hat{\bm{S}}) \bigr\|_2^2 \right].
\label{eq:ddpm_loss}
\end{equation}
Because diffusion optimization can exhibit timestep imbalance (i.e., some timesteps dominate gradients), we optionally apply an SNR-based reweighting consistent with Min-SNR training:
\begin{equation}
\mathcal{L}^{\text{snr}}_{\text{cont}}(\theta) = \mathbb{E}_{k,\bm{r}_0,\boldsymbol{\epsilon}} \left[ w_k \bigl\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_{\theta}(\bm{r}_k, k, \hat{\bm{S}}) \bigr\|_2^2 \right],
\label{eq:snr_loss}
\end{equation}
where $\mathrm{SNR}_k=\bar{\alpha}_k/(1-\bar{\alpha}_k)$ and $\gamma>0$ is a cap parameter \cite{hang2023efficient}.
After sampling $\hat{\bm{R}}$ by reverse diffusion, we reconstruct the continuous output as $\hat{\bm{X}} = \hat{\bm{S}} + \hat{\bm{R}}$. Overall, the DDPM component serves as a distributional corrector on top of a temporally coherent backbone, which is particularly suited to ICS where low-frequency dynamics are strong and persistent but fine-scale variability (including bursts and regime-conditioned noise) remains important for realism. Relative to prior ICS diffusion efforts that primarily focus on continuous augmentation, our formulation elevates trend-conditioned residual diffusion as a modular mechanism for disentangling temporal structure from distributional refinement \cite{yuan2025ctu,sha2026ddpm}.
\subsection{Masked diffusion for discrete ICS variables}
\label{sec:method-discrete}
Discrete ICS variables must remain categorical, making Gaussian diffusion inappropriate for supervisory states and mode-like channels. While one can attempt continuous relaxations or post-hoc discretization, such strategies risk producing semantically invalid intermediate states (e.g., "in-between'modes) and can distort the discrete marginal distribution. Discrete-state diffusion provides a principled alternative by defining a valid corruption process directly on categorical variables \cite{austin2021structured,shi2024simplified}. In the ICS setting, this is not a secondary detail: supervisory tags often encode control logic boundaries (modes, alarms, interlocks) that must remain within a finite vocabulary to preserve semantic correctness \cite{nist2023sp80082}.
We therefore adopt masked (absorbing) diffusion for discrete channels, where corruption replaces tokens with a special $\texttt{[MASK]}$ symbol according to a schedule \cite{shi2024simplified}. For each variable $j$, define a masking schedule $\{m_k\}_{k=1}^K$ (with $m_k\in[0,1]$) increasing in $k$. The forward corruption process is:
\begin{equation}
q(y^{(j)}_k \mid y^{(j)}_0) =
\begin{cases}
y^{(j)}_0, & \text{with probability } 1 - m_k, \\
\texttt{[MASK]}, & \text{with probability } m_k,
\end{cases}
\label{eq:masking_process}
\end{equation}
applied independently across $j$ and $t$. Let $\mathcal{M}$ denote the set of masked positions at step $k$. The denoiser $h_{\psi}$ predicts a categorical distribution over $\mathcal{V}_j$ for each masked token, conditioned on (i) the corrupted discrete sequence, (ii) the diffusion step $k$, and (iii) continuous context. Concretely, we condition on $\hat{\bm{S}}$ and $\hat{\bm{X}}$ to couple supervisory reconstruction to the underlying continuous dynamics:
\begin{equation}
p_{\psi}\bigl( y^{(j)}_0 \mid y_k, k, \hat{\bm{S}}, \hat{\bm{X}} \bigr) = h_{\psi}(y_k, k, \hat{\bm{S}}, \hat{\bm{X}}).
\label{eq:discrete_denoising}
\end{equation}
This conditioning choice is motivated by the fact that many discrete ICS states are not standalone, they are functions of regimes, thresholds, and procedural phases that manifest in continuous channels \cite{nist2023sp80082}. Training uses a categorical denoising objective:
\begin{equation}
\mathcal{L}_{\text{disc}}(\psi) = \mathbb{E}_{k} \left[ \frac{1}{|\mathcal{M}|} \sum_{(j,t) \in \mathcal{M}} \mathrm{CE}\bigl( h_{\psi}(y_k, k, \hat{\bm{S}}, \hat{\bm{X}})_{j,t},\; y^{(j)}_{0,t} \bigr) \right],
\label{eq:discrete_loss}
\end{equation}
where $\mathrm{CE}(\cdot,\cdot)$ is cross-entropy. At sampling time, we initialize all discrete tokens as $\texttt{[MASK]}$ and iteratively unmask them using the learned conditionals, ensuring that every output token lies in its legal vocabulary by construction. This discrete branch is a key differentiator of our pipeline: unlike typical continuous-only diffusion augmentation in ICS, we integrate masked diffusion as a first-class mechanism for supervisory-variable legality within the same end-to-end synthesis workflow \cite{shi2024simplified,yuan2025ctu}.
\subsection{Type-aware decomposition as factorization and routing layer}
\label{sec:method-types}
Even with a trend-conditioned residual DDPM and a discrete masked-diffusion branch, a single uniform modeling treatment can remain suboptimal because ICS variables are generated by qualitatively different mechanisms. For example, program-driven setpoints exhibit step-and-dwell dynamics; controller outputs follow control laws conditioned on process feedback; actuator positions may show saturation and dwell; and some "derived tags'are deterministic functions of other channels. Treating all channels as if they were exchangeable stochastic processes can misallocate model capacity and induce systematic error concentration on a small subset of mechanistically distinct variables \cite{nist2023sp80082}.
We therefore introduce a type-aware decomposition that formalizes this heterogeneity as a routing and constraint layer. Let $\tau(i)\in{1,\dots,6}$ assign each variable (i) to a type class. The type assignment can be initialized from domain semantics (tag metadata, value domains, and engineering meaning), and subsequently refined via an error-attribution workflow described in the Benchmark section. Importantly, this refinement does not change the core diffusion backbone; it changes which mechanism is responsible for which variable, thereby aligning inductive bias with variable-generating mechanism while preserving overall coherence.
We use the following taxonomy:
\begin{enumerate}
\item Type 1 (program-driven / setpoint-like): externally commanded, step-and-dwell variables. These variables can be treated as exogenous drivers (conditioning signals) or routed to specialized change-point / dwell-time models, rather than being forced into a smooth denoiser that may over-regularize step structure.
\item Type 2 (controller outputs): continuous variables tightly coupled to feedback loops; these benefit from conditional modeling where the conditioning includes relevant process variables and commanded setpoints.
\item Type 3 (actuator states/positions): often exhibit saturation, dwell, and rate limits; these may require stateful dynamics beyond generic residual diffusion, motivating either specialized conditional modules or additional inductive constraints.
\item Type 4 (process variables): inertia-dominated continuous dynamics; these are the primary beneficiaries of the Transformer trend + residual DDPM pipeline.
\item Type 5 (derived/deterministic variables): algebraic or rule-based functions of other variables; we enforce deterministic reconstruction $\hat{x}^{(i)} = g_i(\hat{X},\hat{Y})$ rather than learning a stochastic generator, improving logical consistency and sample efficiency.
\item Type 6 (auxiliary/low-impact variables): weakly coupled or sparse signals; we allow simplified modeling (e.g., calibrated marginals or lightweight temporal models) to avoid allocating diffusion capacity where it is not warranted.
\end{enumerate}
Type-aware decomposition improves synthesis quality through three mechanisms. First, it improves capacity allocation by preventing a small set of mechanistically atypical variables from dominating gradients and distorting the learned distribution for the majority class (typically Type 4). Second, it enables constraint enforcement by deterministically reconstructing Type 5 variables, preventing logically inconsistent samples that purely learned generators can produce. Third, it improves mechanism alignment by attaching inductive biases consistent with step/dwell or saturation behaviors where generic denoisers may implicitly favor smoothness.
From a novelty standpoint, this layer is not merely an engineering "patch' it is an explicit methodological statement that ICS synthesis benefits from typed factorization--a principle that has analogues in mixed-type generative modeling more broadly, but that remains underexplored in diffusion-based ICS telemetry synthesis \cite{shi2025tabdiff,yuan2025ctu,nist2023sp80082}.
\subsection{Joint optimization and end-to-end sampling}
\label{sec:method-joint}
We train the model in a staged manner consistent with the above factorization, which improves optimization stability and encourages each component to specialize in its intended role. Specifically: (i) we train the trend Transformer $f_{\phi}$ to obtain $\hat{\bm{S}}$; (ii) we compute residual targets $\hat{\bm{R}} = \bm{X} - \hat{\bm{S}}$ for the continuous variables routed to residual diffusion; (iii) we train the residual DDPM $p_{\theta}(\bm{R}\mid \hat{\bm{S}})$ and masked diffusion model $p_{\psi}(\bm{Y}\mid \text{masked}(\bm{Y}), \hat{\bm{S}}, \hat{\bm{X}})$; and (iv) we apply type-aware routing and deterministic reconstruction during sampling. This staged strategy is aligned with the design goal of separating temporal scaffolding from distributional refinement, and it mirrors the broader intuition in time-series diffusion that decoupling coarse structure and stochastic detail can mitigate "structure vs. realism'conflicts \cite{kollovieh2023tsdiff,sikder2023transfusion}.
A simple combined objective is $\mathcal{L} = \lambda\mathcal{L}_{\text{cont}} + (1-\lambda)\mathcal{L}_{\text{disc}}$ with $\lambda\in[0,1]$ controlling the balance between continuous and discrete learning. Type-aware routing determines which channels contribute to which loss and which are excluded in favor of deterministic reconstruction. In practice, this routing acts as a principled guardrail against negative transfer across variable mechanisms: channels that are best handled deterministically (Type 5) or by specialized drivers (Type 1/3, depending on configuration) are prevented from forcing the diffusion models into statistically incoherent compromises.
At inference time, generation follows the same structured order: (i) trend $\hat{\bm{S}}$ via the Transformer, (ii) residual $\hat{\bm{R}}$ via DDPM, (iii) discrete $\hat{\bm{Y}}$ via masked diffusion, and (iv) type-aware assembly with deterministic reconstruction for routed variables. This pipeline produces $(\hat{\bm{X}},\hat{\bm{Y}})$ that are temporally coherent by construction (through $\hat{\bm{S}}$), distributionally expressive (through $\hat{\bm{R}}$ denoising), and discretely valid (through masked diffusion), while explicitly accounting for heterogeneous variable-generating mechanisms through type-aware routing. In combination, these choices constitute our central methodological contribution: a unified Transformer + mixed diffusion generator for ICS telemetry, augmented by typed factorization to align model capacity with domain mechanism \cite{ho2020denoising,shi2024simplified,yuan2025ctu,nist2023sp80082}.
% 4. Benchmark
\section{Benchmark}
\label{sec:benchmark}
We evaluate the proposed pipeline on feature sequences derived from the HAI Security Dataset, using fixed-length windows (L=96) that preserve the mixed-type structure of ICS telemetry. The goal of this benchmark is not only to report "overall similarity' but to justify why the proposed factorization is a better fit for protocol feature synthesis: continuous channels must match physical marginals \cite{coletta2023constrained}, discrete channels must remain semantically legal, and both must retain short-horizon dynamics that underpin state transitions and interlocks \cite{yang2001interlock}.
This emphasis reflects evaluation practice in time-series generation, where strong results are typically supported by multiple complementary views (marginal fidelity, dependency/temporal structure, and downstream plausibility), rather than a single aggregate score \cite{stenger2024survey}. In the ICS setting, this multi-view requirement is sharper: a generator that matches continuous marginals while emitting out-of-vocabulary supervisory tokens is unusable for protocol reconstruction, and a generator that matches marginals but breaks lag structure can produce temporally implausible command/response sequences.
Recent ICS time-series generators often emphasize aggregate similarity scores and utility-driven evaluations (e.g., anomaly-detection performance) to demonstrate realism, which is valuable but can under-specify mixed-type protocol constraints. Our benchmark complements these practices by making mixed-type legality and per-feature distributional alignment explicit: discrete outputs are evaluated as categorical distributions (JSD) and are constrained to remain within the legal vocabulary by construction, while continuous channels are evaluated with nonparametric distribution tests (KS) \cite{yoon2019timegan}. This combination provides a direct, protocol-relevant justification for the hybrid design, rather than relying on a single composite score that may mask discrete failures.
For continuous channels, we measure distributional alignment using the Kolmogorov--Smirnov (KS) statistic computed per feature between the empirical distributions of real and synthetic samples, and then averaged across features. For discrete channels, we quantify marginal fidelity with Jensen--Shannon divergence (JSD) \cite{lin1991divergence,yoon2019timegan} between categorical distributions per feature, averaged across discrete variables. To assess temporal realism, we compare lag-1 autocorrelation at the feature level and report the mean absolute difference between real and synthetic lag-1 autocorrelation, averaged across features. In addition, to avoid degenerate comparisons driven by near-constant tags, features whose empirical standard deviation falls below a small threshold are excluded from continuous KS aggregation; such channels carry limited distributional information and can distort summary statistics.
\subsection{Quantitative results}
\label{sec:benchmark-quant}
Across all runs, the mean continuous KS is 0.3311 (std 0.0079) and the mean discrete JSD is 0.0284 (std 0.0073), indicating that the generator preserves both continuous marginals and discrete semantic distributions at the feature level. Temporal consistency is similarly stable across runs, with a mean lag-1 autocorrelation difference of 0.2684 (std 0.0027), suggesting that the synthesized windows retain short-horizon dynamical structure \cite{ni2021sigwasserstein} instead of collapsing to marginal matching alone. The best-performing instance (by mean KS) attains 0.3224, and the small inter-seed variance shows that the reported fidelity is reproducible rather than driven by a single favorable initialization.
\begin{figure*}[t]
\centering
\includegraphics[width=\textwidth]{fig-overall-benchmark-v1.png}
% \caption{Description of the figure.}
\label{fig:benchmark}
\end{figure*}
\begin{table}[htbp]
\centering
\caption{Summary of benchmark metrics. Lower values indicate better performance.}
\label{tab:metrics}
\begin{tabular}{@{}l l c@{}}
\toprule
\textbf{Metric} & \textbf{Aggregation} & \textbf{Mean $\pm$ Std} \\
\midrule
KS (continuous) & mean over continuous features & 0.3311 $\pm$ 0.0079 \\
JSD (discrete) & mean over discrete features & 0.0284 $\pm$ 0.0073 \\
Abs $\Delta$ lag-1 autocorr & mean over features & 0.2684 $\pm$ 0.0027 \\
\bottomrule
\end{tabular}
\end{table}
To make the benchmark actionable (and comparable to prior work), we report type-appropriate, interpretable statistics instead of collapsing everything into a single similarity score. This matters in mixed-type ICS telemetry: continuous fidelity can be high while discrete semantics fail, and vice versa. By separating continuous (KS), discrete (JSD), and temporal (lag-1) views, the evaluation directly matches the design goals of the hybrid generator: distributional refinement for continuous residuals, vocabulary-valid reconstruction for discrete supervision, and trend-induced short-horizon coherence.
In addition, the seed-averaged reporting mirrors evaluation conventions in recent diffusion-based time-series generation studies, where robustness across runs is increasingly treated as a first-class signal rather than an afterthought. In this sense, the small inter-seed variance is itself evidence that the factorized training and typed routing reduce instability and localized error concentration, which is frequently observed when heterogeneous channels compete for the same modeling capacity.
% 5. Future Work
\section{Future Work}
\label{sec:future}
Future work will further expand from "generating legal ICS feature sequences" to "data construction and adversarial evaluation for security tasks". The core contribution of this paper focuses on generating feature sequences that are temporally consistent, have credible distributions, and have legal discrete values under mixed types and multi-scale dynamics. However, in the actual research of intrusion detection and anomaly detection, the more critical bottleneck is often the lack of "illegal data/anomaly data" with clear attack semantics and sufficient coverage. Therefore, a direct and important extension direction is to use the legal sequences generated in this paper as a controllable and reproducible "base line operation flow", and then, on the premise of maintaining sequence-level legality and engineering constraints, inject or mix illegal behaviors according to specified attack patterns, thereby systematically constructing a dataset for training and evaluating the recognition of illegal data packets.
Specifically, attack injection can be upgraded from "simple perturbation" to "semantically consistent patterned rewriting": on continuous channels, implement bias injection, covert manipulation near thresholds, instantaneous mutations, and intermittent bursts, etc., so that it can both mimic the temporal characteristics pursued by attackers for concealment and not violate the basic boundary conditions of process dynamics; on discrete channels, implement illegal state transitions, alarm suppression/delayed triggering, pattern camouflage, etc., so that it reflects the trajectory morphology of "unreachable but forcibly created" under real control logic. Furthermore, the attack injection process itself can be coordinated with the type routing and constraint layer in this paper: for deterministically derived variables, illegal behaviors should be transmitted through the modification of upstream variables to maintain consistency; for supervised variables constrained by finite-state machines, interpretable illegal transitions should be generated through the "minimum violation path" or "controlled violation intensity", and violation points and violation types should be explicitly marked to facilitate downstream detection tasks to learn more fine-grained discrimination criteria.
In terms of method morphology, this direction also naturally supports stronger controllability and measurability: attack patterns can be regarded as conditional variables to uniformly conditionally orchestrate legitimate generation and illegal injection, generating control samples of "different attack strategies under the same legitimate framework", thereby transforming dataset construction into a repeatable scenario generation process; meanwhile, by controlling the injection location, duration, amplitude, and coupling range, the performance degradation curves of detectors under different threat intensities and different operating condition stages can be systematically scanned, forming a more stable benchmark than "single acquisition/single script". Ultimately, this approach will transform the legitimate data generation capabilities presented in this paper into the infrastructure for security research: first providing a shareable and reproducible legitimate operation distribution, then injecting illegal patterns with clear semantics in a controllable manner, producing a dataset with sufficient coverage and consistent annotation for training and evaluating models that identify illegal packets/abnormal sequences, and promoting the improvement of reproducibility and engineering credibility in this direction.
% 6. Conclusion
\section{Conclusion}
\label{sec:conclusion}
This paper addresses the data scarcity and shareability barriers that limit machine-learning research for industrial control system (ICS) security by proposing a practical synthetic telemetry generation framework at the protocol feature level. We introduced Mask-DDPM, a hybrid generator designed explicitly for the mixed-type and multi-scale nature of ICS data, where continuous process dynamics must remain temporally coherent while discrete supervisory variables must remain categorically legal by construction.
Our main contributions are: (i) a causal Transformer trend module that provides a stable long-horizon temporal scaffold for continuous channels; (ii) a trend-conditioned residual DDPM that focuses modeling capacity on local stochastic detail and marginal fidelity without destabilizing global structure; (iii) a masked (absorbing) diffusion branch for discrete variables that guarantees in-vocabulary outputs and supports semantics-aware conditioning on continuous context; and (iv) a type-aware decomposition/routing layer that aligns model mechanisms with heterogeneous ICS variable origins (e.g., process inertia, step-and-dwell setpoints, deterministic derived tags), enabling deterministic enforcement where appropriate and improving capacity allocation.
We evaluated the approach on windows derived from the HAI Security Dataset and reported mixed-type, protocol-relevant metrics rather than a single aggregate score. Across seeds, the model achieves stable fidelity with mean KS = 0.3311 ± 0.0079 on continuous features, mean JSD = 0.0284 ± 0.0073 on discrete features, and mean absolute lag-1 autocorrelation difference 0.2684 ± 0.0027, indicating that Mask-DDPM preserves both marginal distributions and short-horizon dynamics while maintaining discrete legality.
Overall, Mask-DDPM provides a reproducible foundation for generating shareable, semantically valid ICS feature sequences suitable for data augmentation, benchmarking, and downstream packet/trace reconstruction workflows. Building on this capability, a natural next step is to move from purely legal synthesis toward controllable scenario construction, including structured attack/violation injection under engineering constraints to support adversarial evaluation and more comprehensive security benchmarks.
% 参考文'
\bibliographystyle{IEEEtran}
\bibliography{references}
\end{document}

314
arxiv-style/main.tex Normal file
View File

@@ -0,0 +1,314 @@
\documentclass{article}
\usepackage{arxiv}
\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc} % use 8-bit T1 fonts
\usepackage{hyperref} % hyperlinks
\usepackage{url} % simple URL typesetting
\usepackage{booktabs} % professional-quality tables
\usepackage{amsfonts} % blackboard math symbols
\usepackage{nicefrac} % compact symbols for 1/2, etc.
\usepackage{microtype} % microtypography
\usepackage{amsmath} % cleveref must be loaded after amsmath!
\usepackage{cleveref} % smart cross-referencing
\usepackage{lipsum} % Can be removed after putting your text content
\usepackage{graphicx}
\usepackage[numbers]{natbib}
\usepackage{doi}
% Packages for equations
\usepackage{amssymb}
\usepackage{bm}
\usepackage{array} % For column formatting
\usepackage{caption} % Better caption spacing
% 标题
\title{Mask-DDPM: Transformer-Conditioned Mixed-Type Diffusion for Semantically Valid ICS Telemetry Synthesis}
% 若不需要日期,取消下面一行的注释
\date{}
\newif\ifuniqueAffiliation
\uniqueAffiliationtrue
\ifuniqueAffiliation % 标准作者块
\author{
Zhenglan Chen \\
Aberdeen Institute of Data Science and Artificial Intelligence\\
South China Normal University\\
Guangzhou, Guangdong 510631, China \\
\texttt{20223803054@m.scnu.edu.cn} \\
\And
Mingzhe Yang \\
Aberdeen Institute of Data Science and Artificial Intelligence\\
South China Normal University\\
Guangzhou, Guangdong 510631, China \\
\texttt{20223803063@m.scnu.edu.cn} \\
\And
Hongyu Yan \\
Aberdeen Institute of Data Science and Artificial Intelligence\\
South China Normal University\\
Guangzhou, Guangdong 510631, China \\
\texttt{20223803065@m.scnu.edu.cn}
\And
Huan Yang \\
foo\\
South China Normal University\\
Guangzhou, Guangdong 510631, China \\
\texttt{foo@bar.com} \\
}
\fi
% 页眉设置
\renewcommand{\shorttitle}{\textit{arXiv} Template}
%%% PDF 元数据
\hypersetup{
pdftitle={Your Paper Title},
pdfsubject={cs.LG, cs.CR},
pdfauthor={David S.~Hippocampus, Elias D.~Striatum},
pdfkeywords={Keyword1, Keyword2, Keyword3},
}
\begin{document}
\maketitle
\begin{abstract}
Industrial control systems (ICS) security research is increasingly constrained by the scarcity and non-shareability of realistic traffic and telemetry, especially for attack scenarios. To mitigate this bottleneck, we study synthetic generation at the protocol feature/telemetry level, where samples must simultaneously preserve temporal coherence, match continuous marginal distributions, and keep discrete supervisory variables strictly within valid vocabularies. We propose Mask-DDPM, a hybrid framework tailored to mixed-type, multi-scale ICS sequences. Mask-DDPM factorizes generation into (i) a causal Transformer trend module that rolls out a stable long-horizon temporal scaffold for continuous channels, (ii) a trend-conditioned residual DDPM that refines local stochastic structure and heavy-tailed fluctuations without degrading global dynamics, (iii) a masked (absorbing) diffusion branch for discrete variables that guarantees categorical legality by construction, and (iv) a type-aware decomposition/routing layer that aligns modeling mechanisms with heterogeneous ICS variable origins and enforces deterministic reconstruction where appropriate. Evaluated on fixed-length windows (L=96) derived from the HAI Security Dataset, Mask-DDPM achieves stable fidelity across seeds with mean KS = 0.3311 ± 0.0079 (continuous), mean JSD = 0.0284 ± 0.0073 (discrete), and mean absolute lag-1 autocorrelation difference = 0.2684 ± 0.0027, indicating faithful marginals, preserved short-horizon dynamics, and valid discrete semantics. The resulting generator provides a reproducible basis for data augmentation, benchmarking, and downstream ICS protocol reconstruction workflows.
\end{abstract}
% 关键词
\keywords{Machine Learning \and Cyber Defense \and ICS}
% 1. Introduction
\section{Introduction}
\label{sec:intro}
Industrial control systems (ICS) form the backbone of modern critical infrastructure, which includes power grids, water treatment, manufacturing, and transportation, among others. These systems monitor, regulate, and automate the physical processes through sensors, actuators, programmable logic controllers (PLCs), and monitoring software. Unlike conventional IT systems, ICS operate in real time, closely coupled with physical processes and safetycritical constraints, using heterogeneous and legacy communication protocols such as Modbus/TCP and DNP3 that were not originally designed with robust security in mind. This architectural complexity and operational criticality make ICS highimpact targets for cyber attacks, where disruptions can result in physical damage, environmental harm, and even loss of life. Recent reviews of ICS security highlight the expanding attack surface due to increased connectivity, legacy systems vulnerabilities, and the inadequacy of traditional security controls in capturing the nuances of ICS networks and protocols \citep{10.1007/s10844-022-00753-1, Nankya2023-gp}
While machine learning (ML) techniques have shown promise for anomaly detection and automated cybersecurity within ICS, they rely heavily on labeled datasets that capture both benign operations and diverse attack patterns. In practice, real ICS traffic data, especially attacktriggered captures, are scarce due to confidentiality, safety, and legal restrictions, and available public ICS datasets are few, limited in scope, or fail to reflect current threat modalities. For instance, the HAI Security Dataset provides operational telemetry and anomaly flags from a realistic control system setup for research purposes, but must be carefully preprocessed to derive protocolrelevant features for ML tasks \citep{shin}. Data scarcity directly undermines model generalization, evaluation reproducibility, and the robustness of intrusion detection research, especially when training or testing ML models on realistic ICS behavior remains confined to small or outdated collections of examples \citep{info16100910}.
Synthetic data generation offers a practical pathway to mitigate these challenges. By programmatically generating featurelevel sequences that mimic the statistical and temporal structure of real ICS telemetry, researchers can augment scarce training sets, standardize benchmarking, and preserve operational confidentiality. Relative to raw packet captures, featurelevel synthesis abstracts critical protocol semantics and statistical patterns without exposing sensitive fields, making it more compatible with safety constraints and compliance requirements in ICS environments. Modern generative modeling, including diffusion models, has advanced significantly in producing highfidelity synthetic data across domains. Diffusion approaches, such as denoising diffusion probabilistic models, learn to transform noise into coherent structured samples and have been successfully applied to tabular or time series data synthesis with better stability and data coverage compared to adversarial methods \citep{pmlr-v202-kotelnikov23a, rasul2021autoregressivedenoisingdiffusionmodels}
Despite these advances, most existing work either focuses on packetlevel generation \citep{jiang2023netdiffusionnetworkdataaugmentation} or is limited to generic tabular data \citep{pmlr-v202-kotelnikov23a}, rather than domainspecific control sequence synthesis tailored for ICS protocols where temporal coherence, multichannel dependencies, and discrete protocol legality are jointly required. This gap motivates our focus on protocol feature-level generation for ICS, which involves synthesizing sequences of protocol-relevant fields conditioned on their temporal and cross-channel structure. In this work, we formulate a hybrid modeling pipeline that decouples longhorizon trends and local statistical detail while preserving discrete semantics of protocol tokens. By combining causal Transformers with diffusionbased refiners, and enforcing deterministic validity constraints during sampling, our framework generates semantically coherent, temporally consistent, and distributionally faithful ICS feature sequences. We evaluate features derived from the HAI Security Dataset and demonstrate that our approach produces highquality synthetic sequences suitable for downstream augmentation, benchmarking, and integration into packetconstruction workflows that respect realistic ICS constraints.
% 2. Related Work
\section{Related Work}
\label{sec:related}
Early generation of network data oriented towards ``realism'' mostly remained at the packet/flow header level, either through replay or statistical synthesis based on single-point observations. Swing, in a closed-loop, network-responsive manner, extracts user/application/network distributions from single-point observations to reproduce burstiness and correlation across multiple time scales \citep{10.1145/1151659.1159928,10.1145/1159913.1159928}. Subsequently, a series of works advanced header synthesis to learning-based generation: the WGAN-based method added explicit verification of protocol field consistency to NetFlow/IPFIX \citep{Ring_2019}, NetShare reconstructed header modeling as flow-level time series and improved fidelity and scalability through domain encoding and parallel fine-tuning \citep{10.1145/3544216.3544251}, and DoppelGANger preserved the long-range structure and downstream sorting consistency of networked time series by decoupling attributes from sequences \citep{Lin_2020}. However, in industrial control system (ICS) scenarios, the original PCAP is usually not shareable, and public testbeds (such as SWaT, WADI) mostly provide process/monitoring telemetry and protocol interactions for security assessment, but public datasets emphasize operational variables rather than packet-level traces \citep{7469060,10.1145/3055366.3055375}. This makes ``synthesis at the feature/telemetry level, aware of protocol and semantics'' more feasible and necessary in practice: we are more concerned with reproducing high-level distributions and multi-scale temporal patterns according to operational semantics and physical constraints without relying on the original packets. From this perspective, the generation paradigm naturally shifts from ``packet syntax reproduction'' to ``modeling of high-level spatio-temporal distributions and uncertainties'', requiring stable training, strong distribution fitting, and interpretable uncertainty characterization.
Diffusion models exhibit good fit along this path: DDPM achieves high-quality sampling and stable optimization through efficient $\epsilon$ parameterization and weighted variational objectives \citep{NEURIPS2020_4c5bcfec}, the SDE perspective unifies score-based and diffusion, providing likelihood evaluation and prediction-correction sampling strategies based on probability flow ODEs \citep{song2021scorebasedgenerativemodelingstochastic}. For time series, TimeGrad replaces the constrained output distribution with conditional denoising, capturing high-dimensional correlations at each step \citep{rasul2021autoregressivedenoisingdiffusionmodels}; CSDI explicitly performs conditional diffusion and uses two-dimensional attention to simultaneously leverage temporal and cross-feature dependencies, suitable for conditioning and filling in missing values \citep{tashiro2021csdiconditionalscorebaseddiffusion}; in a more general spatio-temporal structure, DiffSTG generalizes diffusion to spatio-temporal graphs, combining TCN/GCN with denoising U-Net to improve CRPS and inference efficiency in a non-autoregressive manner \citep{wen2024diffstgprobabilisticspatiotemporalgraph}, and PriSTI further enhances conditional features and geographical relationships, maintaining robustness under high missing rates and sensor failures \citep{liu2023pristiconditionaldiffusionframework}; in long sequences and continuous domains, DiffWave verifies that diffusion can also match the quality of strong vocoders under non-autoregressive fast synthesis \citep{kong2021diffwaveversatilediffusionmodel}; studies on cellular communication traffic show that diffusion can recover spatio-temporal patterns and provide uncertainty characterization at the urban scale \citep{11087622}. These results overall point to a conclusion: when the research focus is on ``telemetry/high-level features'' rather than raw messages, diffusion models provide stable and fine-grained distribution fitting and uncertainty quantification, which is exactly in line with the requirements of ICS telemetry synthesis. Meanwhile, directly entrusting all structures to a ``monolithic diffusion'' is not advisable: long-range temporal skeletons and fine-grained marginal distributions often have optimization tensions, requiring explicit decoupling in modeling.
Looking further into the mechanism complexity of ICS: its channel types are inherently mixed, containing both continuous process trajectories and discrete supervision/status variables, and discrete channels must be ``legal'' under operational constraints. The aforementioned progress in time series diffusion has mainly occurred in continuous spaces, but discrete diffusion has also developed systematic methods: D3PM improves sampling quality and likelihood through absorption/masking and structured transitions in discrete state spaces \citep{austin2023structureddenoisingdiffusionmodels}, subsequent masked diffusion provides stable reconstruction on categorical data in a more simplified form \citep{Lin_2020}, multinomial diffusion directly defines diffusion on a finite vocabulary through mechanisms such as argmax flows \citep{hoogeboom2021argmaxflowsmultinomialdiffusion}, and Diffusion-LM demonstrates an effective path for controllable text generation by imposing gradient constraints in continuous latent spaces \citep{li2022diffusionlmimprovescontrollabletext}. From the perspectives of protocols and finite-state machines, coverage-guided fuzz testing emphasizes the criticality of ``sequence legality and state coverage'' \citep{meng2025aflnetyearslatercoverageguided,godefroid2017learnfuzzmachinelearninginput,she2019neuzzefficientfuzzingneural}, echoing the concept of ``legality by construction'' in discrete diffusion: preferentially adopting absorption/masking diffusion on discrete channels, supplemented by type-aware conditioning and sampling constraints, to avoid semantic invalidity and marginal distortion caused by post hoc thresholding.
From the perspective of high-level synthesis, the temporal structure is equally indispensable: ICS control often involves delay effects, phased operating conditions, and cross-channel coupling, requiring models to be able to characterize low-frequency, long-range dependencies while also overlaying multi-modal fine-grained fluctuations on them. The Transformer series has provided sufficient evidence in long-sequence time series tasks: Transformer-XL breaks through the fixed-length context limitation through a reusable memory mechanism and significantly enhances long-range dependency expression \citep{dai2019transformerxlattentivelanguagemodels}; Informer uses ProbSparse attention and efficient decoding to balance span and efficiency in long-sequence prediction \citep{zhou2021informerefficienttransformerlong}; Autoformer robustly models long-term seasonality and trends through autocorrelation and decomposition mechanisms \citep{wu2022autoformerdecompositiontransformersautocorrelation}; FEDformer further improves long-period prediction performance in frequency domain enhancement and decomposition \citep{zhou2022fedformerfrequencyenhanceddecomposed}; PatchTST enhances the stability and generalization of long-sequence multivariate prediction through local patch-based representation and channel-independent modeling \citep{2023}. Combining our previous positioning of diffusion, this chain of evidence points to a natural division of labor: using attention-based sequence models to first extract stable low-frequency trends/conditions (long-range skeletons), and then allowing diffusion to focus on margins and details in the residual space; meanwhile, discrete masking/absorbing diffusion is applied to supervised/pattern variables to ensure vocabulary legality by construction. This design not only inherits the advantages of time series diffusion in distribution fitting and uncertainty characterization \citep{rasul2021autoregressivedenoisingdiffusionmodels,tashiro2021csdiconditionalscorebaseddiffusion,wen2024diffstgprobabilisticspatiotemporalgraph,liu2023pristiconditionaldiffusionframework,kong2021diffwaveversatilediffusionmodel,11087622}, but also stabilizes the macroscopic temporal support through the long-range attention of Transformer, enabling the formation of an operational integrated generation pipeline under the mixed types and multi-scale dynamics of ICS.
% 3. Methodology
\section{Methodology}
\label{sec:method}
Industrial control system (ICS) telemetry is intrinsically mixed-type and mechanistically heterogeneous: continuous process trajectories (e.g., sensor and actuator signals) coexist with discrete supervisory states (e.g., modes, alarms, interlocks), and the underlying generating mechanisms range from physical inertia to program-driven step logic. This heterogeneity is not cosmetic—it directly affects what “realistic” synthesis means, because a generator must jointly satisfy (i) temporal coherence, (ii) distributional fidelity, and (iii) discrete semantic validity (i.e., every discrete output must belong to its legal vocabulary by construction). These properties are emphasized broadly in operational-technology security guidance and ICS engineering practice, where state logic and physical dynamics are tightly coupled \citep{nist2023sp80082}.
We formalize each training instance as a fixed-length window of length We model each training instance as a fixed-length window of length $L$, comprising continuous channels $\bm{X} \in \mathbb{R}^{L \times d_c}$ and discrete channels $\bm{Y} = \{y^{(j)}_{1:L}\}_{j=1}^{d_d}$, where each discrete variable satisfies $y^{(j)}_t \in \mathcal{V}_j$ for a finite vocabulary $\mathcal{V}_j$. Our objective is to learn a generator that produces synthetic $(\hat{\bm{X}}, \hat{\bm{Y}})$ that are simultaneously coherent and distributionally faithful, while also ensuring $\hat{y}^{(j)}_t\in\mathcal{V}_j$ for all $j$, $t$ by construction (rather than via post-hoc rounding or thresholding).
A key empirical and methodological tension in ICS synthesis is that temporal realism and marginal/distributional realism can compete when optimized monolithically: sequence models trained primarily for regression often over-smooth heavy tails and intermittent bursts, while purely distribution-matching objectives can erode long-range structure. Diffusion models provide a principled route to rich distribution modeling through iterative denoising, but they do not, by themselves, resolve (i) the need for a stable low-frequency temporal scaffold, nor (ii) the discrete legality constraints for supervisory variables \citep{ho2020denoising,song2021score}. Recent time-series diffusion work further suggests that separating coarse structure from stochastic refinement can be an effective inductive bias for long-horizon realism \citep{kollovieh2023tsdiff,sikder2023transfusion}.
\begin{figure}[htbp]
\centering
\includegraphics[width=0.8\textwidth]{fig-design-v2.png}
% \caption{Description of the figure.}
\label{fig:design}
\end{figure}
Motivated by these considerations, we propose Mask-DDPM, organized in the following order:
\begin{enumerate}
\item Transformer trend module: learns the dominant temporal backbone of continuous dynamics via attention-based sequence modeling \citep{vaswani2017attention}.
\item Residual DDPM for continuous variables: models distributional detail as stochastic residual structure conditioned on the learned trend \citep{ho2020denoising,kollovieh2023tsdiff}.
\item Masked diffusion for discrete variables: generates discrete ICS states with an absorbing/masking corruption process and categorical reconstruction \citep{austin2021structured, shi2024simplified}.
\item Type-aware decomposition: a type-aware factorization and routing layer that assigns variables to the most appropriate modeling mechanism and enforces deterministic constraints where warranted.
\end{enumerate}
This ordering is intentional. The trend module establishes a macro-temporal scaffold; residual diffusion then concentrates capacity on micro-structure and marginal fidelity; masked diffusion provides a native mechanism for discrete legality; and the type-aware layer operationalizes the observation that not all ICS variables should be modeled with the same stochastic mechanism. Importantly, while diffusion-based generation for ICS telemetry has begun to emerge, existing approaches remain limited and typically emphasize continuous synthesis or augmentation; in contrast, our pipeline integrates (i) a Transformer-conditioned residual diffusion backbone, (ii) a discrete masked-diffusion branch, and (iii) explicit type-aware routing for heterogeneous variable mechanisms within a single coherent generator \citep{yuan2025ctu,sha2026ddpm}.
\subsection{Transformer trend module for continuous dynamics}
\label{sec:method-trans}
We instantiate the temporal backbone as a causal Transformer trend extractor, leveraging self-attentions ability to represent long-range dependencies and cross-channel interactions without recurrence \citep{vaswani2017attention}. Compared with recurrent trend extractors (e.g., GRU-style backbones), a Transformer trend module offers a direct mechanism to model delayed effects and multivariate coupling—common in ICS, where control actions may influence downstream sensors with nontrivial lags and regime-dependent propagation \citep{vaswani2017attention,nist2023sp80082}. Crucially, in our design the Transformer is not asked to be the entire generator; instead, it serves a deliberately restricted role: providing a stable, temporally coherent conditioning signal that later stochastic components refine.
For continuous channels $\bm{X}$, we posit an additive decomposition:
\begin{equation}
\bm{X} = \bm{S} + \bm{R},
\label{eq:additive_decomp}
\end{equation}
where $\bm{S} \in \mathbb{R}^{L \times d_c}$ is a smooth trend capturing predictable temporal evolution, and $\bm{R} \in \mathbb{R}^{L \times d_c}$ is a residual capturing distributional detail (e.g., bursts, heavy tails, local fluctuations) that is difficult to represent robustly with a purely regression-based temporal objective. This separation reflects an explicit division of labor: the trend module prioritizes temporal coherence, while diffusion (introduced next) targets distributional realism at the residual level—a strategy aligned with “predict-then-refine” perspectives in time-series diffusion modeling \citep{kollovieh2023tsdiff,sikder2023transfusion}.
We parameterize the trend $\bm{S}$ using a causal Transformer $f_\phi$. With teacher forcing, we train $F_\phi$ to predict the next-step trend from past observations:
\begin{equation}
\hat{\bm{S}}_{t+1} = f_{\phi}(\bm{X}_{1:t}), \quad t = 1, \dots, L-1.
\label{eq:trend_prediction}
\end{equation}
using the mean-squared error objective:
\begin{equation}
\mathcal{L}_{\text{trend}}(\phi) = \frac{1}{(L-1)d_c} \sum_{t=1}^{L-1} \bigl\| \hat{\bm{S}}_{t+1} - \bm{X}_{t+1} \bigr\|_2^2.
\label{eq:trend_loss}
\end{equation}
At inference, we roll out the Transformer autoregressively to obtain $\hat{\bm{S}}$, and and then define the residual target for diffusion as $\bm{R} = \bm{X} - \hat{\bm{S}}$. This setup intentionally “locks in” a coherent low-frequency scaffold before any stochastic refinement is applied, thereby reducing the burden on downstream diffusion modules to simultaneously learn both long-range structure and marginal detail. In this sense, our use of Transformers is distinctive: it is a conditioning-first temporal backbone designed to stabilize mixed-type diffusion synthesis in ICS, rather than an end-to-end monolithic generator \citep{vaswani2017attention,kollovieh2023tsdiff,yuan2025ctu}.
\subsection{DDPM for continuous residual generation}
\label{sec:method-ddpm}
We model the residual RRR with a denoising diffusion probabilistic model (DDPM) conditioned on the trend $\hat{\bm{S}}$ \citep{ho2020denoising}. Diffusion models learn complex data distributions by inverting a tractable noising process through iterative denoising, and have proven effective at capturing multimodality and heavy-tailed structure that is often attenuated by purely regression-based sequence models \citep{ho2020denoising,song2021score}. Conditioning the diffusion model on $\hat{\bm{S}}$ is central: it prevents the denoiser from re-learning the low-frequency scaffold and focuses capacity on residual micro-structure, mirroring the broader principle that diffusion excels as a distributional corrector when a reasonable coarse structure is available \citep{kollovieh2023tsdiff, sikder2023transfusion}.
Let $\bm{K}$ denote the number of diffusion steps, with a noise schedule $\{\beta_k\}_{k=1}^K$, $\alpha_k = 1 - \beta_k$, and $\bar{\alpha}_k = \prod_{i=1}^k \alpha_i$. The forward corruption process is:
\begin{equation}
q(\bm{r}_k \mid \bm{r}_0) &= \mathcal{N}\bigl( \sqrt{\bar{\alpha}_k}\,\bm{r}_0,\; (1 - \bar{\alpha}_k)\mathbf{I} \bigr)
\label{eq:forward_corruption}
\end{equation}
equivalently,
\begin{equation}
\bm{r}_k &= \sqrt{\bar{\alpha}_k}\,\bm{r}_0 + \sqrt{1 - \bar{\alpha}_k}\,\boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})
\label{eq:forward_corruption_eq}
\end{equation}
The learned reverse process is parameterized as:
\begin{equation}
p_{\theta}(\bm{r}_{k-1} \mid \bm{r}_k, \hat{\bm{S}}) = \mathcal{N}\bigl( \boldsymbol{\mu}_{\theta}(\bm{r}_k, k, \hat{\bm{S}}),\; \boldsymbol{\Sigma}(k) \bigr).
\label{eq:reverse_process}
\end{equation}
where $\mu_\theta$ is implemented by a Transformer denoiser that consumes (i) the noised residual $r_k$, (ii) a timestep embedding for $k$, and (iii) conditioning features derived from $\hat{\bm{S}}$. This denoiser architecture is consistent with the growing use of attention-based denoisers for long-context time-series diffusion, while our key methodological emphasis is the trend-conditioned residual factorization as the object of diffusion learning \citep{ho2020denoising,sikder2023transfusion}.
We train the denoiser using the standard DDPM $\epsilon$-prediction objective:
\begin{equation}
\mathcal{L}_{\text{cont}}(\theta) = \mathbb{E}_{k,\bm{r}_0,\boldsymbol{\epsilon}} \left[ \bigl\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_{\theta}(\bm{r}_k, k, \hat{\bm{S}}) \bigr\|_2^2 \right].
\label{eq:ddpm_loss}
\end{equation}
Because diffusion optimization can exhibit timestep imbalance (i.e., some timesteps dominate gradients), we optionally apply an SNR-based reweighting consistent with Min-SNR training:
\begin{equation}
\mathcal{L}^{\text{snr}}_{\text{cont}}(\theta) = \mathbb{E}_{k,\bm{r}_0,\boldsymbol{\epsilon}} \left[ w_k \bigl\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_{\theta}(\bm{r}_k, k, \hat{\bm{S}}) \bigr\|_2^2 \right],
\label{eq:snr_loss}
\end{equation}
where $\mathrm{SNR}_k=\bar{\alpha}_k/(1-\bar{\alpha}_k)$ and $\gamma>0$ is a cap parameter \citep{hang2023efficient}.
After sampling $\hat{\bm{R}}$ by reverse diffusion, we reconstruct the continuous output as $\hat{\bm{X}} = \hat{\bm{S}} + \hat{\bm{R}}$. Overall, the DDPM component serves as a distributional corrector on top of a temporally coherent backbone, which is particularly suited to ICS where low-frequency dynamics are strong and persistent but fine-scale variability (including bursts and regime-conditioned noise) remains important for realism. Relative to prior ICS diffusion efforts that primarily focus on continuous augmentation, our formulation elevates trend-conditioned residual diffusion as a modular mechanism for disentangling temporal structure from distributional refinement \citep{yuan2025ctu,sha2026ddpm}.
\subsection{Masked diffusion for discrete ICS variables}
\label{sec:method-discrete}
Discrete ICS variables must remain categorical, making Gaussian diffusion inappropriate for supervisory states and mode-like channels. While one can attempt continuous relaxations or post-hoc discretization, such strategies risk producing semantically invalid intermediate states (e.g., “in-between” modes) and can distort the discrete marginal distribution. Discrete-state diffusion provides a principled alternative by defining a valid corruption process directly on categorical variables \citep{austin2021structured,shi2024simplified}. In the ICS setting, this is not a secondary detail: supervisory tags often encode control logic boundaries (modes, alarms, interlocks) that must remain within a finite vocabulary to preserve semantic correctness \citep{nist2023sp80082}.
We therefore adopt masked (absorbing) diffusion for discrete channels, where corruption replaces tokens with a special $\texttt{[MASK]}$ symbol according to a schedule \citep{shi2024simplified}. For each variable $j$, define a masking schedule $\{m_k\}_{k=1}^K$ (with $m_k\in[0,1]$) increasing in $k$. The forward corruption process is:
\begin{equation}
q(y^{(j)}_k \mid y^{(j)}_0) =
\begin{cases}
y^{(j)}_0, & \text{with probability } 1 - m_k, \\
\texttt{[MASK]}, & \text{with probability } m_k,
\end{cases}
\label{eq:masking_process}
\end{equation}
applied independently across $j$ and $t$. Let $\mathcal{M}$ denote the set of masked positions at step $k$. The denoiser $h_{\psi}$ predicts a categorical distribution over $\mathcal{V}_j$ for each masked token, conditioned on (i) the corrupted discrete sequence, (ii) the diffusion step $k$, and (iii) continuous context. Concretely, we condition on $\hat{\bm{S}}$ and $\hat{\bm{X}}$ to couple supervisory reconstruction to the underlying continuous dynamics:
\begin{equation}
p_{\psi}\bigl( y^{(j)}_0 \mid y_k, k, \hat{\bm{S}}, \hat{\bm{X}} \bigr) = h_{\psi}(y_k, k, \hat{\bm{S}}, \hat{\bm{X}}).
\label{eq:discrete_denoising}
\end{equation}
This conditioning choice is motivated by the fact that many discrete ICS states are not standalone, they are functions of regimes, thresholds, and procedural phases that manifest in continuous channels \citep{nist2023sp80082}. Training uses a categorical denoising objective:
\begin{equation}
\mathcal{L}_{\text{disc}}(\psi) = \mathbb{E}_{k} \left[ \frac{1}{|\mathcal{M}|} \sum_{(j,t) \in \mathcal{M}} \mathrm{CE}\bigl( h_{\psi}(y_k, k, \hat{\bm{S}}, \hat{\bm{X}})_{j,t},\; y^{(j)}_{0,t} \bigr) \right],
\label{eq:discrete_loss}
\end{equation}
where $\mathrm{CE}(\cdot,\cdot)$ is cross-entropy. At sampling time, we initialize all discrete tokens as $\texttt{[MASK]}$ and iteratively unmask them using the learned conditionals, ensuring that every output token lies in its legal vocabulary by construction. This discrete branch is a key differentiator of our pipeline: unlike typical continuous-only diffusion augmentation in ICS, we integrate masked diffusion as a first-class mechanism for supervisory-variable legality within the same end-to-end synthesis workflow \citep{shi2024simplified,yuan2025ctu}.
\subsection{Type-aware decomposition as factorization and routing layer}
\label{sec:method-types}
Even with a trend-conditioned residual DDPM and a discrete masked-diffusion branch, a single uniform modeling treatment can remain suboptimal because ICS variables are generated by qualitatively different mechanisms. For example, program-driven setpoints exhibit step-and-dwell dynamics; controller outputs follow control laws conditioned on process feedback; actuator positions may show saturation and dwell; and some “derived tags” are deterministic functions of other channels. Treating all channels as if they were exchangeable stochastic processes can misallocate model capacity and induce systematic error concentration on a small subset of mechanistically distinct variables \citep{nist2023sp80082}.
We therefore introduce a type-aware decomposition that formalizes this heterogeneity as a routing and constraint layer. Let $\tau(i)\in{1,\dots,6}$ assign each variable (i) to a type class. The type assignment can be initialized from domain semantics (tag metadata, value domains, and engineering meaning), and subsequently refined via an error-attribution workflow described in the Benchmark section. Importantly, this refinement does not change the core diffusion backbone; it changes which mechanism is responsible for which variable, thereby aligning inductive bias with variable-generating mechanism while preserving overall coherence.
We use the following taxonomy:
\begin{enumerate}
\item Type 1 (program-driven / setpoint-like): externally commanded, step-and-dwell variables. These variables can be treated as exogenous drivers (conditioning signals) or routed to specialized change-point / dwell-time models, rather than being forced into a smooth denoiser that may over-regularize step structure.
\item Type 2 (controller outputs): continuous variables tightly coupled to feedback loops; these benefit from conditional modeling where the conditioning includes relevant process variables and commanded setpoints.
\item Type 3 (actuator states/positions): often exhibit saturation, dwell, and rate limits; these may require stateful dynamics beyond generic residual diffusion, motivating either specialized conditional modules or additional inductive constraints.
\item Type 4 (process variables): inertia-dominated continuous dynamics; these are the primary beneficiaries of the Transformer trend + residual DDPM pipeline.
\item Type 5 (derived/deterministic variables): algebraic or rule-based functions of other variables; we enforce deterministic reconstruction $\hat{x}^{(i)} = g_i(\hat{X},\hat{Y})$ rather than learning a stochastic generator, improving logical consistency and sample efficiency.
\item Type 6 (auxiliary/low-impact variables): weakly coupled or sparse signals; we allow simplified modeling (e.g., calibrated marginals or lightweight temporal models) to avoid allocating diffusion capacity where it is not warranted.
\end{enumerate}
Type-aware decomposition improves synthesis quality through three mechanisms. First, it improves capacity allocation by preventing a small set of mechanistically atypical variables from dominating gradients and distorting the learned distribution for the majority class (typically Type 4). Second, it enables constraint enforcement by deterministically reconstructing Type 5 variables, preventing logically inconsistent samples that purely learned generators can produce. Third, it improves mechanism alignment by attaching inductive biases consistent with step/dwell or saturation behaviors where generic denoisers may implicitly favor smoothness.
From a novelty standpoint, this layer is not merely an engineering “patch”; it is an explicit methodological statement that ICS synthesis benefits from typed factorization—a principle that has analogues in mixed-type generative modeling more broadly, but that remains underexplored in diffusion-based ICS telemetry synthesis \citep{shi2025tabdiff,yuan2025ctu,nist2023sp80082}.
\subsection{Joint optimization and end-to-end sampling}
\label{sec:method-joint}
We train the model in a staged manner consistent with the above factorization, which improves optimization stability and encourages each component to specialize in its intended role. Specifically: (i) we train the trend Transformer $f_{\phi}$ to obtain $\hat{\bm{S}}$; (ii) we compute residual targets $\hat{\bm{R}} = \bm{X} - \hat{\bm{S}}$ for the continuous variables routed to residual diffusion; (iii) we train the residual DDPM $p_{\theta}(\bm{R}\mid \hat{\bm{S}})$ and masked diffusion model $p_{\psi}(\bm{Y}\mid \text{masked}(\bm{Y}), \hat{\bm{S}}, \hat{\bm{X}})$; and (iv) we apply type-aware routing and deterministic reconstruction during sampling. This staged strategy is aligned with the design goal of separating temporal scaffolding from distributional refinement, and it mirrors the broader intuition in time-series diffusion that decoupling coarse structure and stochastic detail can mitigate “structure vs. realism” conflicts \citep{kollovieh2023tsdiff,sikder2023transfusion}.
A simple combined objective is $\mathcal{L} = \lambda\mathcal{L}_{\text{cont}} + (1-\lambda)\mathcal{L}_{\text{disc}}$ with $\lambda\in[0,1]$ controlling the balance between continuous and discrete learning. Type-aware routing determines which channels contribute to which loss and which are excluded in favor of deterministic reconstruction. In practice, this routing acts as a principled guardrail against negative transfer across variable mechanisms: channels that are best handled deterministically (Type 5) or by specialized drivers (Type 1/3, depending on configuration) are prevented from forcing the diffusion models into statistically incoherent compromises.
At inference time, generation follows the same structured order: (i) trend $\hat{\bm{S}}$ via the Transformer, (ii) residual $\hat{\bm{R}}$ via DDPM, (iii) discrete $\hat{\bm{Y}}$ via masked diffusion, and (iv) type-aware assembly with deterministic reconstruction for routed variables. This pipeline produces $(\hat{\bm{X}},\hat{\bm{Y}})$ that are temporally coherent by construction (through $\hat{\bm{S}}$), distributionally expressive (through $\hat{\bm{R}}$ denoising), and discretely valid (through masked diffusion), while explicitly accounting for heterogeneous variable-generating mechanisms through type-aware routing. In combination, these choices constitute our central methodological contribution: a unified Transformer + mixed diffusion generator for ICS telemetry, augmented by typed factorization to align model capacity with domain mechanism \citep{ho2020denoising,shi2024simplified,yuan2025ctu,nist2023sp80082}.
% 4. Benchmark
\section{Benchmark}
\label{sec:benchmark}
We evaluate the proposed pipeline on feature sequences derived from the HAI Security Dataset, using fixed-length windows (L=96) that preserve the mixed-type structure of ICS telemetry. The goal of this benchmark is not only to report “overall similarity”, but to justify why the proposed factorization is a better fit for protocol feature synthesis: continuous channels must match physical marginals \citep{coletta2023constrained}, discrete channels must remain semantically legal, and both must retain short-horizon dynamics that underpin state transitions and interlocks \citep{yang2001interlock}.
This emphasis reflects evaluation practice in time-series generation, where strong results are typically supported by multiple complementary views (marginal fidelity, dependency/temporal structure, and downstream plausibility), rather than a single aggregate score \citep{stenger2024survey}. In the ICS setting, this multi-view requirement is sharper: a generator that matches continuous marginals while emitting out-of-vocabulary supervisory tokens is unusable for protocol reconstruction, and a generator that matches marginals but breaks lag structure can produce temporally implausible command/response sequences.
Recent ICS time-series generators often emphasize aggregate similarity scores and utility-driven evaluations (e.g., anomaly-detection performance) to demonstrate realism, which is valuable but can under-specify mixed-type protocol constraints. Our benchmark complements these practices by making mixed-type legality and per-feature distributional alignment explicit: discrete outputs are evaluated as categorical distributions (JSD) and are constrained to remain within the legal vocabulary by construction, while continuous channels are evaluated with nonparametric distribution tests (KS) \citep{yoon2019timegan}. This combination provides a direct, protocol-relevant justification for the hybrid design, rather than relying on a single composite score that may mask discrete failures.
For continuous channels, we measure distributional alignment using the KolmogorovSmirnov (KS) statistic computed per feature between the empirical distributions of real and synthetic samples, and then averaged across features. For discrete channels, we quantify marginal fidelity with JensenShannon divergence (JSD) \citep{lin1991divergence,yoon2019timegan} between categorical distributions per feature, averaged across discrete variables. To assess temporal realism, we compare lag-1 autocorrelation at the feature level and report the mean absolute difference between real and synthetic lag-1 autocorrelation, averaged across features. In addition, to avoid degenerate comparisons driven by near-constant tags, features whose empirical standard deviation falls below a small threshold are excluded from continuous KS aggregation; such channels carry limited distributional information and can distort summary statistics.
\subsection{Quantitative results}
\label{sec:benchmark-quant}
Across all runs, the mean continuous KS is 0.3311 (std 0.0079) and the mean discrete JSD is 0.0284 (std 0.0073), indicating that the generator preserves both continuous marginals and discrete semantic distributions at the feature level. Temporal consistency is similarly stable across runs, with a mean lag-1 autocorrelation difference of 0.2684 (std 0.0027), suggesting that the synthesized windows retain short-horizon dynamical structure \citep{ni2021sigwasserstein} instead of collapsing to marginal matching alone. The best-performing instance (by mean KS) attains 0.3224, and the small inter-seed variance shows that the reported fidelity is reproducible rather than driven by a single favorable initialization.
\begin{figure}[htbp]
\centering
\includegraphics[width=0.8\textwidth]{fig-overall-benchmark-v1.png}
% \caption{Description of the figure.}
\label{fig:benchmark}
\end{figure}
\begin{table}[htbp]
\centering
\caption{Summary of benchmark metrics. Lower values indicate better performance.}
\label{tab:metrics}
\begin{tabular}{@{}l l c c@{}}
\toprule
\textbf{Metric} & \textbf{Aggregation} & \textbf{Lower is better} & \textbf{Mean $\pm$ Std} \\
\midrule
KS (continuous) & mean over continuous features & \checkmark & 0.3311 $\pm$ 0.0079 \\
JSD (discrete) & mean over discrete features & \checkmark & 0.0284 $\pm$ 0.0073 \\
Abs $\Delta$ lag-1 autocorr & mean over features & \checkmark & 0.2684 $\pm$ 0.0027 \\
\bottomrule
\end{tabular}
\end{table}
To make the benchmark actionable (and comparable to prior work), we report type-appropriate, interpretable statistics instead of collapsing everything into a single similarity score. This matters in mixed-type ICS telemetry: continuous fidelity can be high while discrete semantics fail, and vice versa. By separating continuous (KS), discrete (JSD), and temporal (lag-1) views, the evaluation directly matches the design goals of the hybrid generator: distributional refinement for continuous residuals, vocabulary-valid reconstruction for discrete supervision, and trend-induced short-horizon coherence.
In addition, the seed-averaged reporting mirrors evaluation conventions in recent diffusion-based time-series generation studies, where robustness across runs is increasingly treated as a first-class signal rather than an afterthought. In this sense, the small inter-seed variance is itself evidence that the factorized training and typed routing reduce instability and localized error concentration, which is frequently observed when heterogeneous channels compete for the same modeling capacity.
% 5. Future Work
\section{Future Work}
\label{sec:future}
Future work will further expand from "generating legal ICS feature sequences" to "data construction and adversarial evaluation for security tasks". The core contribution of this paper focuses on generating feature sequences that are temporally consistent, have credible distributions, and have legal discrete values under mixed types and multi-scale dynamics. However, in the actual research of intrusion detection and anomaly detection, the more critical bottleneck is often the lack of "illegal data/anomaly data" with clear attack semantics and sufficient coverage. Therefore, a direct and important extension direction is to use the legal sequences generated in this paper as a controllable and reproducible "base line operation flow", and then, on the premise of maintaining sequence-level legality and engineering constraints, inject or mix illegal behaviors according to specified attack patterns, thereby systematically constructing a dataset for training and evaluating the recognition of illegal data packets.
Specifically, attack injection can be upgraded from "simple perturbation" to "semantically consistent patterned rewriting": on continuous channels, implement bias injection, covert manipulation near thresholds, instantaneous mutations, and intermittent bursts, etc., so that it can both mimic the temporal characteristics pursued by attackers for concealment and not violate the basic boundary conditions of process dynamics; on discrete channels, implement illegal state transitions, alarm suppression/delayed triggering, pattern camouflage, etc., so that it reflects the trajectory morphology of "unreachable but forcibly created" under real control logic. Furthermore, the attack injection process itself can be coordinated with the type routing and constraint layer in this paper: for deterministically derived variables, illegal behaviors should be transmitted through the modification of upstream variables to maintain consistency; for supervised variables constrained by finite-state machines, interpretable illegal transitions should be generated through the "minimum violation path" or "controlled violation intensity", and violation points and violation types should be explicitly marked to facilitate downstream detection tasks to learn more fine-grained discrimination criteria.
In terms of method morphology, this direction also naturally supports stronger controllability and measurability: attack patterns can be regarded as conditional variables to uniformly conditionally orchestrate legitimate generation and illegal injection, generating control samples of "different attack strategies under the same legitimate framework", thereby transforming dataset construction into a repeatable scenario generation process; meanwhile, by controlling the injection location, duration, amplitude, and coupling range, the performance degradation curves of detectors under different threat intensities and different operating condition stages can be systematically scanned, forming a more stable benchmark than "single acquisition/single script". Ultimately, this approach will transform the legitimate data generation capabilities presented in this paper into the infrastructure for security research: first providing a shareable and reproducible legitimate operation distribution, then injecting illegal patterns with clear semantics in a controllable manner, producing a dataset with sufficient coverage and consistent annotation for training and evaluating models that identify illegal packets/abnormal sequences, and promoting the improvement of reproducibility and engineering credibility in this direction.
% 6. Conclusion
\section{Conclusion}
\label{sec:conclusion}
This paper addresses the data scarcity and shareability barriers that limit machine-learning research for industrial control system (ICS) security by proposing a practical synthetic telemetry generation framework at the protocol feature level. We introduced Mask-DDPM, a hybrid generator designed explicitly for the mixed-type and multi-scale nature of ICS data, where continuous process dynamics must remain temporally coherent while discrete supervisory variables must remain categorically legal by construction.
Our main contributions are: (i) a causal Transformer trend module that provides a stable long-horizon temporal scaffold for continuous channels; (ii) a trend-conditioned residual DDPM that focuses modeling capacity on local stochastic detail and marginal fidelity without destabilizing global structure; (iii) a masked (absorbing) diffusion branch for discrete variables that guarantees in-vocabulary outputs and supports semantics-aware conditioning on continuous context; and (iv) a type-aware decomposition/routing layer that aligns model mechanisms with heterogeneous ICS variable origins (e.g., process inertia, step-and-dwell setpoints, deterministic derived tags), enabling deterministic enforcement where appropriate and improving capacity allocation.
We evaluated the approach on windows derived from the HAI Security Dataset and reported mixed-type, protocol-relevant metrics rather than a single aggregate score. Across seeds, the model achieves stable fidelity with mean KS = 0.3311 ± 0.0079 on continuous features, mean JSD = 0.0284 ± 0.0073 on discrete features, and mean absolute lag-1 autocorrelation difference 0.2684 ± 0.0027, indicating that Mask-DDPM preserves both marginal distributions and short-horizon dynamics while maintaining discrete legality.
Overall, Mask-DDPM provides a reproducible foundation for generating shareable, semantically valid ICS feature sequences suitable for data augmentation, benchmarking, and downstream packet/trace reconstruction workflows. Building on this capability, a natural next step is to move from purely legal synthesis toward controllable scenario construction, including structured attack/violation injection under engineering constraints to support adversarial evaluation and more comprehensive security benchmarks.
% 参考文献
\bibliographystyle{unsrtnat}
\bibliography{references}
\end{document}

555
arxiv-style/references.bib Normal file
View File

@@ -0,0 +1,555 @@
@inproceedings{vaswani2017attention,
title={Attention Is All You Need},
author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, {\L}ukasz and Polosukhin, Illia},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
volume={30},
year={2017},
url={https://arxiv.org/abs/1706.03762}
}
@inproceedings{ho2020denoising,
title={Denoising Diffusion Probabilistic Models},
author={Ho, Jonathan and Jain, Ajay and Abbeel, Pieter},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
volume={33},
pages={6840--6851},
year={2020},
url={https://arxiv.org/abs/2006.11239}
}
@inproceedings{austin2021structured,
title={Structured Denoising Diffusion Models in Discrete State-Spaces},
author={Austin, Jacob and Johnson, Daniel D and Ho, Jonathan and Tarlow, Daniel and van den Berg, Rianne},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
volume={34},
pages={17981--17993},
year={2021},
url={https://arxiv.org/abs/2107.03006}
}
@article{shi2024simplified,
title={Simplified and Generalized Masked Diffusion for Discrete Data},
author={Shi, Juntong and Han, Ke and Wang, Zinan and Doucet, Arnaud and Titsias, Michalis K},
journal={arXiv preprint},
eprint={2406.04329},
archivePrefix={arXiv},
year={2024},
url={https://arxiv.org/abs/2406.04329}
}
@inproceedings{hang2023efficient,
title={Efficient Diffusion Training via Min-SNR Weighting Strategy},
author={Hang, Tianyu and Gu, Shuyang and Li, Chen and Bao, Jianmin and Chen, Dong and Hu, Han and Geng, Xin and Guo, Boxin},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
pages={7407--7417},
year={2023},
doi={10.1109/ICCV51070.2023.00702},
url={https://arxiv.org/abs/2303.09556}
}
@inproceedings{kollovieh2023tsdiff,
title={Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting},
author={Kollovieh, Marcel and Ansari, Abdul Fatir and Bohlke-Schneider, Michael and Fatir Ansari, Abdul and Salinas, David},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
volume={36},
year={2023},
url={https://arxiv.org/abs/2307.11494}
}
@article{sikder2023transfusion,
title={TransFusion: Generating Long, High Fidelity Time Series using Diffusion Models with Transformers},
author={Sikder, M. F. and Ramachandranpillai, R. and Heintz, F.},
journal={arXiv preprint},
eprint={2307.12667},
archivePrefix={arXiv},
year={2023},
url={https://arxiv.org/abs/2307.12667}
}
@inproceedings{song2021score,
title={Score-Based Generative Modeling through Stochastic Differential Equations},
author={Song, Yang and Sohl-Dickstein, Jascha and Kingma, Diederik P and Kumar, Abhishek and Ermon, Stefano and Poole, Ben},
booktitle={International Conference on Learning Representations (ICLR)},
year={2021},
url={https://arxiv.org/abs/2011.13456}
}
@inproceedings{shi2025tabdiff,
title={TabDiff: A Mixed-type Diffusion Model for Tabular Data Generation},
author={Shi, Juntong and Xu, Minkai and Hua, Harper and Zhang, Hengrui and Ermon, Stefano and Leskovec, Jure},
booktitle={International Conference on Learning Representations (ICLR)},
year={2025},
url={https://arxiv.org/abs/2410.20626}
}
@inproceedings{yuan2025ctu,
title={CTU-DDPM: Generating Industrial Control System Time-Series Data with a CNN-Transformer Hybrid Diffusion Model},
author={Yuan, Yusong and Sha, Yun and Zhao, Wei and Zhang, Kun},
booktitle={Proceedings of the 2025 International Symposium on Artificial Intelligence and Computational Social Sciences (ACM AICSS)},
pages={123--132},
year={2025},
doi={10.1145/3776759.3776845},
url={https://dl.acm.org/doi/10.1145/3776759.3776845}
}
@misc{sha2026ddpm,
title={DDPM Fusing Mamba and Adaptive Attention: An Augmentation Method for Industrial Control Systems Anomaly Data},
author={Sha, Yun and Yuan, Yusong and Wu, Yonghao and Zhao, Haidong},
year={2026},
month={jan},
note={SSRN Electronic Journal},
eprint={6055903},
archivePrefix={SSRN},
doi={10.2139/ssrn.6055903},
url={https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6055903}
}
@techreport{nist2023sp80082,
title={Guide to Operational Technology (OT) Security},
author={{National Institute of Standards and Technology}},
institution={NIST},
type={Special Publication},
number={800-82 Rev. 3},
year={2023},
month={sep},
doi={10.6028/NIST.SP.800-82r3},
url={https://csrc.nist.gov/pubs/sp/800/82/r3/final}
}
Reference for Introduction Part
@article{10.1007/s10844-022-00753-1,
author = {Koay, Abigail M. Y. and Ko, Ryan K. L and Hettema, Hinne and Radke, Kenneth},
title = {Machine learning in industrial control system (ICS) security: current landscape, opportunities and challenges},
year = {2022},
issue_date = {Apr 2023},
publisher = {Kluwer Academic Publishers},
address = {USA},
volume = {60},
number = {2},
issn = {0925-9902},
url = {https://doi.org/10.1007/s10844-022-00753-1},
doi = {10.1007/s10844-022-00753-1},
abstract = {The advent of Industry 4.0 has led to a rapid increase in cyber attacks on industrial systems and processes, particularly on Industrial Control Systems (ICS). These systems are increasingly becoming prime targets for cyber criminals and nation-states looking to extort large ransoms or cause disruptions due to their ability to cause devastating impact whenever they cease working or malfunction. Although myriads of cyber attack detection systems have been proposed and developed, these detection systems still face many challenges that are typically not found in traditional detection systems. Motivated by the need to better understand these challenges to improve current approaches, this paper aims to (1) understand the current vulnerability landscape in ICS, (2) survey current advancements of Machine Learning (ML) based methods with respect to the usage of ML base classifiers (3) provide insights to benefits and limitations of recent advancement with respect to two performance vectors; detection accuracy and attack variety. Based on our findings, we present key open challenges which will represent exciting research opportunities for the research community.},
journal = {J. Intell. Inf. Syst.},
month = oct,
pages = {377405},
numpages = {29},
keywords = {Operational technology, Cyber security, Dataset, Industrial control systems, Machine learning, Critical infrastructure}
}
@ARTICLE{Nankya2023-gp,
title = "Securing industrial Control Systems: Components, cyber threats,
and machine learning-driven defense strategies",
author = "Nankya, Mary and Chataut, Robin and Akl, Robert",
abstract = "Industrial Control Systems (ICS), which include Supervisory
Control and Data Acquisition (SCADA) systems, Distributed
Control Systems (DCS), and Programmable Logic Controllers (PLC),
play a crucial role in managing and regulating industrial
processes. However, ensuring the security of these systems is of
utmost importance due to the potentially severe consequences of
cyber attacks. This article presents an overview of ICS
security, covering its components, protocols, industrial
applications, and performance aspects. It also highlights the
typical threats and vulnerabilities faced by these systems.
Moreover, the article identifies key factors that influence the
design decisions concerning control, communication, reliability,
and redundancy properties of ICS, as these are critical in
determining the security needs of the system. The article
outlines existing security countermeasures, including network
segmentation, access control, patch management, and security
monitoring. Furthermore, the article explores the integration of
machine learning techniques to enhance the cybersecurity of ICS.
Machine learning offers several advantages, such as anomaly
detection, threat intelligence analysis, and predictive
maintenance. However, combining machine learning with other
security measures is essential to establish a comprehensive
defense strategy for ICS. The article also addresses the
challenges associated with existing measures and provides
recommendations for improving ICS security. This paper becomes a
valuable reference for researchers aiming to make meaningful
contributions within the constantly evolving ICS domain by
providing an in-depth examination of the present state,
challenges, and potential future advancements.",
journal = "Sensors (Basel)",
publisher = "MDPI AG",
volume = 23,
number = 21,
pages = "8840",
month = oct,
year = 2023,
keywords = "SCADA; anomaly detection; artificial intelligence; attacks;
cyber defense; cyber threats; industrial control systems;
security; vulnerabilities",
copyright = "https://creativecommons.org/licenses/by/4.0/",
language = "en"
}
@misc{shin,
title = {HAI Security Dataset},
url = {https://www.kaggle.com/dsv/5821622},
doi = {10.34740/kaggle/dsv/5821622},
publisher = {Kaggle},
author = {Shin, Hyeok-Ki and Lee, Woomyo and Choi, Seungoh and Yun, Jeong-Han and Min, Byung Gil and Kim, HyoungChun},
year = {2023}
}
@Article{info16100910,
AUTHOR = {Ali, Jokha and Ali, Saqib and Al Balushi, Taiseera and Nadir, Zia},
TITLE = {Intrusion Detection in Industrial Control Systems Using Transfer Learning Guided by Reinforcement Learning},
JOURNAL = {Information},
VOLUME = {16},
YEAR = {2025},
NUMBER = {10},
ARTICLE-NUMBER = {910},
URL = {https://www.mdpi.com/2078-2489/16/10/910},
ISSN = {2078-2489},
ABSTRACT = {Securing Industrial Control Systems (ICSs) is critical, but it is made challenging by the constant evolution of cyber threats and the scarcity of labeled attack data in these specialized environments. Standard intrusion detection systems (IDSs) often fail to adapt when transferred to new networks with limited data. To address this, this paper introduces an adaptive intrusion detection framework that combines a hybrid Convolutional Neural Network and Long Short-Term Memory (CNN-LSTM) model with a novel transfer learning strategy. We employ a Reinforcement Learning (RL) agent to intelligently guide the fine-tuning process, which allows the IDS to dynamically adjust its parameters such as layer freezing and learning rates in real-time based on performance feedback. We evaluated our system in a realistic data-scarce scenario using only 50 labeled training samples. Our RL-Guided model achieved a final F1-score of 0.9825, significantly outperforming a standard neural fine-tuning model (0.861) and a target baseline model (0.759). Analysis of the RL agents behavior confirmed that it learned a balanced and effective policy for adapting the model to the target domain. We conclude that the proposed RL-guided approach creates a highly accurate and adaptive IDS that overcomes the limitations of static transfer learning methods. This dynamic fine-tuning strategy is a powerful and promising direction for building resilient cybersecurity defenses for critical infrastructure.},
DOI = {10.3390/info16100910}
}
@InProceedings{pmlr-v202-kotelnikov23a,
title = {{T}ab{DDPM}: Modelling Tabular Data with Diffusion Models},
author = {Kotelnikov, Akim and Baranchuk, Dmitry and Rubachev, Ivan and Babenko, Artem},
booktitle = {Proceedings of the 40th International Conference on Machine Learning},
pages = {17564--17579},
year = {2023},
editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
volume = {202},
series = {Proceedings of Machine Learning Research},
month = {23--29 Jul},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v202/kotelnikov23a/kotelnikov23a.pdf},
url = {https://proceedings.mlr.press/v202/kotelnikov23a.html},
abstract = {Denoising diffusion probabilistic models are becoming the leading generative modeling paradigm for many important data modalities. Being the most prevalent in the computer vision community, diffusion models have recently gained some attention in other domains, including speech, NLP, and graph-like data. In this work, we investigate if the framework of diffusion models can be advantageous for general tabular problems, where data points are typically represented by vectors of heterogeneous features. The inherent heterogeneity of tabular data makes it quite challenging for accurate modeling since the individual features can be of a completely different nature, i.e., some of them can be continuous and some can be discrete. To address such data types, we introduce TabDDPM — a diffusion model that can be universally applied to any tabular dataset and handles any feature types. We extensively evaluate TabDDPM on a wide set of benchmarks and demonstrate its superiority over existing GAN/VAE alternatives, which is consistent with the advantage of diffusion models in other fields.}
}
@misc{rasul2021autoregressivedenoisingdiffusionmodels,
title={Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting},
author={Kashif Rasul and Calvin Seward and Ingmar Schuster and Roland Vollgraf},
year={2021},
eprint={2101.12072},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2101.12072},
}
@misc{jiang2023netdiffusionnetworkdataaugmentation,
title={NetDiffusion: Network Data Augmentation Through Protocol-Constrained Traffic Generation},
author={Xi Jiang and Shinan Liu and Aaron Gember-Jacobson and Arjun Nitin Bhagoji and Paul Schmitt and Francesco Bronzino and Nick Feamster},
year={2023},
eprint={2310.08543},
archivePrefix={arXiv},
primaryClass={cs.NI},
url={https://arxiv.org/abs/2310.08543},
}
Reference for Related Work
@inproceedings{10.1145/1159913.1159928,
author = {Vishwanath, Kashi Venkatesh and Vahdat, Amin},
title = {Realistic and responsive network traffic generation},
year = {2006},
isbn = {1595933085},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/1159913.1159928},
doi = {10.1145/1159913.1159928},
abstract = {This paper presents Swing, a closed-loop, network-responsive traffic generator that accurately captures the packet interactions of a range of applications using a simple structural model. Starting from observed traffic at a single point in the network, Swing automatically extracts distributions for user, application, and network behavior. It then generates live traffic corresponding to the underlying models in a network emulation environment running commodity network protocol stacks. We find that the generated traces are statistically similar to the original traces. Further, to the best of our knowledge, we are the first to reproduce burstiness in traffic across a range of timescales using a model applicable to a variety of network settings. An initial sensitivity analysis reveals the importance of capturing and recreating user, application, and network characteristics to accurately reproduce such burstiness. Finally, we explore Swing's ability to vary user characteristics, application properties, and wide-area network conditions to project traffic characteristics into alternate scenarios.},
booktitle = {Proceedings of the 2006 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications},
pages = {111122},
numpages = {12},
keywords = {burstiness, energy plot, generator, internet, modeling, structural model, traffic, wavelets},
location = {Pisa, Italy},
series = {SIGCOMM '06}
}
@article{Ring_2019,
title={Flow-based network traffic generation using Generative Adversarial Networks},
volume={82},
ISSN={0167-4048},
url={http://dx.doi.org/10.1016/j.cose.2018.12.012},
DOI={10.1016/j.cose.2018.12.012},
journal={Computers \& Security},
publisher={Elsevier BV},
author={Ring, Markus and Schlör, Daniel and Landes, Dieter and Hotho, Andreas},
year={2019},
month=may, pages={156172}
}
@inproceedings{10.1145/3544216.3544251,
author = {Yin, Yucheng and Lin, Zinan and Jin, Minhao and Fanti, Giulia and Sekar, Vyas},
title = {Practical GAN-based synthetic IP header trace generation using NetShare},
year = {2022},
isbn = {9781450394208},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3544216.3544251},
doi = {10.1145/3544216.3544251},
abstract = {We explore the feasibility of using Generative Adversarial Networks (GANs) to automatically learn generative models to generate synthetic packet- and flow header traces for networking tasks (e.g., telemetry, anomaly detection, provisioning). We identify key fidelity, scalability, and privacy challenges and tradeoffs in existing GAN-based approaches. By synthesizing domain-specific insights with recent advances in machine learning and privacy, we identify design choices to tackle these challenges. Building on these insights, we develop an end-to-end framework, NetShare. We evaluate NetShare on six diverse packet header traces and find that: (1) across all distributional metrics and traces, it achieves 46\% more accuracy than baselines and (2) it meets users' requirements of downstream tasks in evaluating accuracy and rank ordering of candidate approaches.},
booktitle = {Proceedings of the ACM SIGCOMM 2022 Conference},
pages = {458472},
numpages = {15},
keywords = {synthetic data generation, privacy, network packets, network flows, generative adversarial networks},
location = {Amsterdam, Netherlands},
series = {SIGCOMM '22}
}
@inproceedings{Lin_2020, series={IMC 20},
title={Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions},
url={http://dx.doi.org/10.1145/3419394.3423643},
DOI={10.1145/3419394.3423643},
booktitle={Proceedings of the ACM Internet Measurement Conference},
publisher={ACM},
author={Lin, Zinan and Jain, Alankar and Wang, Chen and Fanti, Giulia and Sekar, Vyas},
year={2020},
month=oct, pages={464483},
collection={IMC 20}
}
@INPROCEEDINGS{7469060,
author={Mathur, Aditya P. and Tippenhauer, Nils Ole},
booktitle={2016 International Workshop on Cyber-physical Systems for Smart Water Networks (CySWater)},
title={SWaT: a water treatment testbed for research and training on ICS security},
year={2016},
volume={},
number={},
pages={31-36},
keywords={Sensors;Actuators;Feeds;Process control;Chemicals;Chemical sensors;Security;Cyber Physical Systems;Industrial Control Systems;Cyber Attacks;Cyber Defense;Water Testbed},
doi={10.1109/CySWater.2016.7469060}
}
@inproceedings{10.1145/3055366.3055375,
author = {Ahmed, Chuadhry Mujeeb and Palleti, Venkata Reddy and Mathur, Aditya P.},
title = {WADI: a water distribution testbed for research in the design of secure cyber physical systems},
year = {2017},
isbn = {9781450349758},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3055366.3055375},
doi = {10.1145/3055366.3055375},
abstract = {The architecture of a water distribution testbed (WADI), and on-going research in the design of secure water distribution system is presented. WADI consists of three stages controlled by Programmable Logic Controllers (PLCs) and two stages controlled via Remote Terminal Units (RTUs). Each PLC and RTU uses sensors to estimate the system state and the actuators to effect control. WADI is currently used to (a) conduct security analysis for water distribution networks, (b) experimentally assess detection mechanisms for potential cyber and physical attacks, and (c) understand how the impact of an attack on one CPS could cascade to other connected CPSs. The cascading effects of attacks can be studied in WADI through its connection to two other testbeds, namely for water treatment and power generation and distribution.},
booktitle = {Proceedings of the 3rd International Workshop on Cyber-Physical Systems for Smart Water Networks},
pages = {2528},
numpages = {4},
keywords = {attack detection, cyber physical systems, cyber security, industrial control systems, water distribution testbed},
location = {Pittsburgh, Pennsylvania},
series = {CySWATER '17}
}
@misc{tashiro2021csdiconditionalscorebaseddiffusion,
title={CSDI Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation},
author={Yusuke Tashiro and Jiaming Song and Yang Song and Stefano Ermon},
year={2021},
eprint={2107.03502},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={httpsarxiv.orgabs2107.03502},
}
@misc{wen2024diffstgprobabilisticspatiotemporalgraph,
title={DiffSTG: Probabilistic Spatio-Temporal Graph Forecasting with Denoising Diffusion Models},
author={Haomin Wen and Youfang Lin and Yutong Xia and Huaiyu Wan and Qingsong Wen and Roger Zimmermann and Yuxuan Liang},
year={2024},
eprint={2301.13629},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2301.13629},
}
@misc{liu2023pristiconditionaldiffusionframework,
title={PriSTI: A Conditional Diffusion Framework for Spatiotemporal Imputation},
author={Mingzhe Liu and Han Huang and Hao Feng and Leilei Sun and Bowen Du and Yanjie Fu},
year={2023},
eprint={2302.09746},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2302.09746},
}
@misc{kong2021diffwaveversatilediffusionmodel,
title={DiffWave: A Versatile Diffusion Model for Audio Synthesis},
author={Zhifeng Kong and Wei Ping and Jiaji Huang and Kexin Zhao and Bryan Catanzaro},
year={2021},
eprint={2009.09761},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2009.09761},
}
@ARTICLE{11087622,
author={Liu, Xiaosi and Xu, Xiaowen and Liu, Zhidan and Li, Zhenjiang and Wu, Kaishun},
journal={IEEE Transactions on Mobile Computing},
title={Spatio-Temporal Diffusion Model for Cellular Traffic Generation},
year={2026},
volume={25},
number={1},
pages={257-271},
keywords={Base stations;Diffusion models;Data models;Uncertainty;Predictive models;Generative adversarial networks;Knowledge graphs;Mobile computing;Telecommunication traffic;Semantics;Cellular traffic;data generation;diffusion model;spatio-temporal graph},
doi={10.1109/TMC.2025.3591183}
}
@misc{hoogeboom2021argmaxflowsmultinomialdiffusion,
title={Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions},
author={Emiel Hoogeboom and Didrik Nielsen and Priyank Jaini and Patrick Forré and Max Welling},
year={2021},
eprint={2102.05379},
archivePrefix={arXiv},
primaryClass={stat.ML},
url={https://arxiv.org/abs/2102.05379},
}
@misc{li2022diffusionlmimprovescontrollabletext,
title={Diffusion-LM Improves Controllable Text Generation},
author={Xiang Lisa Li and John Thickstun and Ishaan Gulrajani and Percy Liang and Tatsunori B. Hashimoto},
year={2022},
eprint={2205.14217},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={httpsarxiv.orgabs2205.14217},
}
@misc{meng2025aflnetyearslatercoverageguided,
title={AFLNet Five Years Later: On Coverage-Guided Protocol Fuzzing},
author={Ruijie Meng and Van-Thuan Pham and Marcel Böhme and Abhik Roychoudhury},
year={2025},
eprint={2412.20324},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2412.20324},
}
@misc{godefroid2017learnfuzzmachinelearninginput,
title={Learn\&Fuzz: Machine Learning for Input Fuzzing},
author={Patrice Godefroid and Hila Peleg and Rishabh Singh},
year={2017},
eprint={1701.07232},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/1701.07232},
}
@misc{she2019neuzzefficientfuzzingneural,
title={NEUZZ: Efficient Fuzzing with Neural Program Smoothing},
author={Dongdong She and Kexin Pei and Dave Epstein and Junfeng Yang and Baishakhi Ray and Suman Jana},
year={2019},
eprint={1807.05620},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/1807.05620},
}
@misc{dai2019transformerxlattentivelanguagemodels,
title={Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context},
author={Zihang Dai and Zhilin Yang and Yiming Yang and Jaime Carbonell and Quoc V. Le and Ruslan Salakhutdinov},
year={2019},
eprint={1901.02860},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/1901.02860},
}
@misc{zhou2021informerefficienttransformerlong,
title={Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting},
author={Haoyi Zhou and Shanghang Zhang and Jieqi Peng and Shuai Zhang and Jianxin Li and Hui Xiong and Wancai Zhang},
year={2021},
eprint={2012.07436},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2012.07436},
}
@misc{wu2022autoformerdecompositiontransformersautocorrelation,
title={Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting},
author={Haixu Wu and Jiehui Xu and Jianmin Wang and Mingsheng Long},
year={2022},
eprint={2106.13008},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2106.13008},
}
@misc{zhou2022fedformerfrequencyenhanceddecomposed,
title={FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting},
author={Tian Zhou and Ziqing Ma and Qingsong Wen and Xue Wang and Liang Sun and Rong Jin},
year={2022},
eprint={2201.12740},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2201.12740},
}
@article{2023,
title={A Note on Extremal Sombor Indices of Trees with a Given Degree Sequence},
volume={90},
ISSN={0340-6253},
url={http://dx.doi.org/10.46793/match.90-1.197D},
DOI={10.46793/match.90-1.197d},
number={1},
journal={Match Communications in Mathematical and in Computer Chemistry},
publisher={University Library in Kragujevac},
author={Damjanović, Ivan and Milošević, Marko and Stevanović, Dragan},
year={2023},
pages={197202}
}
Reference for Benchmark
@article{stenger2024survey,
title={Evaluation is key: a survey on evaluation measures for synthetic time series},
author={Stenger, Michael and Leppich, Robert and Foster, Ian T and Kounev, Samuel and Bauer, Andre},
journal={Journal of Big Data},
volume={11},
number={1},
pages={66},
year={2024},
publisher={Springer}
}
@article{lin1991divergence,
title={Divergence measures based on the Shannon entropy},
author={Lin, Jianhua},
journal={IEEE Transactions on Information Theory},
volume={37},
number={1},
pages={145--151},
year={1991}
}
@inproceedings{yoon2019timegan,
title={Time-series generative adversarial networks},
author={Yoon, Jinsung and Jarrett, Daniel and van der Schaar, Mihaela},
booktitle={Advances in Neural Information Processing Systems},
volume={32},
year={2019}
}
@article{ni2021sigwasserstein,
title={Sig-Wasserstein GANs for time series generation},
author={Ni, Hao and Szpruch, Lukasz and Wiese, Magnus and Liao, Shujian and Xiao, Baoren},
journal={Proceedings of the ACM on Measurement and Analysis of Computing Systems},
volume={5},
number={3},
pages={1--25},
year={2021}
}
@inproceedings{coletta2023constrained,
title={On the constrained time-series generation problem},
author={Coletta, Alessandro and Rossi, Roberto and others},
booktitle={Advances in Neural Information Processing Systems},
volume={36},
year={2023}
}
@article{yang2001interlock,
title={Automatic verification of safety interlock systems for industrial processes},
author={Yang, Sheng-Hong and Hsieh, Min-Chi},
journal={Journal of Loss Prevention in the Process Industries},
volume={14},
number={6},
pages={473--483},
year={2001},
publisher={Elsevier}
}

214
arxiv-style/template.tex Normal file
View File

@@ -0,0 +1,214 @@
\documentclass{article}
\usepackage{arxiv}
\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc} % use 8-bit T1 fonts
\usepackage{hyperref} % hyperlinks
\usepackage{url} % simple URL typesetting
\usepackage{booktabs} % professional-quality tables
\usepackage{amsfonts} % blackboard math symbols
\usepackage{nicefrac} % compact symbols for 1/2, etc.
\usepackage{microtype} % microtypography
\usepackage{cleveref} % smart cross-referencing
\usepackage{lipsum} % Can be removed after putting your text content
\usepackage{graphicx}
\usepackage{natbib}
\usepackage{doi}
\title{A template for the \emph{arxiv} style}
% Here you can change the date presented in the paper title
%\date{September 9, 1985}
% Or remove it
%\date{}
\newif\ifuniqueAffiliation
% Comment to use multiple affiliations variant of author block
\uniqueAffiliationtrue
\ifuniqueAffiliation % Standard variant of author block
\author{ \href{https://orcid.org/0000-0000-0000-0000}{\includegraphics[scale=0.06]{orcid.pdf}\hspace{1mm}David S.~Hippocampus}\thanks{Use footnote for providing further
information about author (webpage, alternative
address)---\emph{not} for acknowledging funding agencies.} \\
Department of Computer Science\\
Cranberry-Lemon University\\
Pittsburgh, PA 15213 \\
\texttt{hippo@cs.cranberry-lemon.edu} \\
%% examples of more authors
\And
\href{https://orcid.org/0000-0000-0000-0000}{\includegraphics[scale=0.06]{orcid.pdf}\hspace{1mm}Elias D.~Striatum} \\
Department of Electrical Engineering\\
Mount-Sheikh University\\
Santa Narimana, Levand \\
\texttt{stariate@ee.mount-sheikh.edu} \\
%% \AND
%% Coauthor \\
%% Affiliation \\
%% Address \\
%% \texttt{email} \\
%% \And
%% Coauthor \\
%% Affiliation \\
%% Address \\
%% \texttt{email} \\
%% \And
%% Coauthor \\
%% Affiliation \\
%% Address \\
%% \texttt{email} \\
}
\else
% Multiple affiliations variant of author block
\usepackage{authblk}
\renewcommand\Authfont{\bfseries}
\setlength{\affilsep}{0em}
% box is needed for correct spacing with authblk
\newbox{\orcid}\sbox{\orcid}{\includegraphics[scale=0.06]{orcid.pdf}}
\author[1]{%
\href{https://orcid.org/0000-0000-0000-0000}{\usebox{\orcid}\hspace{1mm}David S.~Hippocampus\thanks{\texttt{hippo@cs.cranberry-lemon.edu}}}%
}
\author[1,2]{%
\href{https://orcid.org/0000-0000-0000-0000}{\usebox{\orcid}\hspace{1mm}Elias D.~Striatum\thanks{\texttt{stariate@ee.mount-sheikh.edu}}}%
}
\affil[1]{Department of Computer Science, Cranberry-Lemon University, Pittsburgh, PA 15213}
\affil[2]{Department of Electrical Engineering, Mount-Sheikh University, Santa Narimana, Levand}
\fi
% Uncomment to override the `A preprint' in the header
%\renewcommand{\headeright}{Technical Report}
%\renewcommand{\undertitle}{Technical Report}
\renewcommand{\shorttitle}{\textit{arXiv} Template}
%%% Add PDF metadata to help others organize their library
%%% Once the PDF is generated, you can check the metadata with
%%% $ pdfinfo template.pdf
\hypersetup{
pdftitle={A template for the arxiv style},
pdfsubject={q-bio.NC, q-bio.QM},
pdfauthor={David S.~Hippocampus, Elias D.~Striatum},
pdfkeywords={First keyword, Second keyword, More},
}
\begin{document}
\maketitle
\begin{abstract}
\lipsum[1]
\end{abstract}
% keywords can be removed
\keywords{First keyword \and Second keyword \and More}
\section{Introduction}
\lipsum[2]
\lipsum[3]
\section{Headings: first level}
\label{sec:headings}
\lipsum[4] See Section \ref{sec:headings}.
\subsection{Headings: second level}
\lipsum[5]
\begin{equation}
\xi _{ij}(t)=P(x_{t}=i,x_{t+1}=j|y,v,w;\theta)= {\frac {\alpha _{i}(t)a^{w_t}_{ij}\beta _{j}(t+1)b^{v_{t+1}}_{j}(y_{t+1})}{\sum _{i=1}^{N} \sum _{j=1}^{N} \alpha _{i}(t)a^{w_t}_{ij}\beta _{j}(t+1)b^{v_{t+1}}_{j}(y_{t+1})}}
\end{equation}
\subsubsection{Headings: third level}
\lipsum[6]
\paragraph{Paragraph}
\lipsum[7]
\section{Examples of citations, figures, tables, references}
\label{sec:others}
\subsection{Citations}
Citations use \verb+natbib+. The documentation may be found at
\begin{center}
\url{http://mirrors.ctan.org/macros/latex/contrib/natbib/natnotes.pdf}
\end{center}
Here is an example usage of the two main commands (\verb+citet+ and \verb+citep+): Some people thought a thing \citep{kour2014real, keshet2016prediction} but other people thought something else \citep{kour2014fast}. Many people have speculated that if we knew exactly why \citet{kour2014fast} thought this\dots
\subsection{Figures}
\lipsum[10]
See Figure \ref{fig:fig1}. Here is how you add footnotes. \footnote{Sample of the first footnote.}
\lipsum[11]
\begin{figure}
\centering
\fbox{\rule[-.5cm]{4cm}{4cm} \rule[-.5cm]{4cm}{0cm}}
\caption{Sample figure caption.}
\label{fig:fig1}
\end{figure}
\subsection{Tables}
See awesome Table~\ref{tab:table}.
The documentation for \verb+booktabs+ (`Publication quality tables in LaTeX') is available from:
\begin{center}
\url{https://www.ctan.org/pkg/booktabs}
\end{center}
\begin{table}
\caption{Sample table title}
\centering
\begin{tabular}{lll}
\toprule
\multicolumn{2}{c}{Part} \\
\cmidrule(r){1-2}
Name & Description & Size ($\mu$m) \\
\midrule
Dendrite & Input terminal & $\sim$100 \\
Axon & Output terminal & $\sim$10 \\
Soma & Cell body & up to $10^6$ \\
\bottomrule
\end{tabular}
\label{tab:table}
\end{table}
\subsection{Lists}
\begin{itemize}
\item Lorem ipsum dolor sit amet
\item consectetur adipiscing elit.
\item Aliquam dignissim blandit est, in dictum tortor gravida eget. In ac rutrum magna.
\end{itemize}
\bibliographystyle{unsrtnat}
\bibliography{references} %%% Uncomment this line and comment out the ``thebibliography'' section below to use the external .bib file (using bibtex) .
%%% Uncomment this section and comment out the \bibliography{references} line above to use inline references.
% \begin{thebibliography}{1}
% \bibitem{kour2014real}
% George Kour and Raid Saabne.
% \newblock Real-time segmentation of on-line handwritten arabic script.
% \newblock In {\em Frontiers in Handwriting Recognition (ICFHR), 2014 14th
% International Conference on}, pages 417--422. IEEE, 2014.
% \bibitem{kour2014fast}
% George Kour and Raid Saabne.
% \newblock Fast classification of handwritten on-line arabic characters.
% \newblock In {\em Soft Computing and Pattern Recognition (SoCPaR), 2014 6th
% International Conference of}, pages 312--318. IEEE, 2014.
% \bibitem{keshet2016prediction}
% Keshet, Renato, Alina Maor, and George Kour.
% \newblock Prediction-Based, Prioritized Market-Share Insight Extraction.
% \newblock In {\em Advanced Data Mining and Applications (ADMA), 2016 12th International
% Conference of}, pages 81--94,2016.
% \end{thebibliography}
\end{document}

66
copy.cmd Normal file
View File

@@ -0,0 +1,66 @@
@echo off
setlocal
:: ====================================================================
:: 脚本名称: copy_md_files.cmd
:: 脚本功能: 遍历/papers所有子文件夹将所有.md文件复制到/papers/md文件夹下。
:: 作者: [你的名字/公司名,可选]
:: 日期: 2023-10-27
:: 版本: 1.0
:: ====================================================================
:: 定义源目录和目标目录
set "SOURCE_DIR=%~dp0papers"
set "TARGET_DIR=%~dp0papers\md"
echo.
echo ======================================================
echo 开始复制Markdown文件
echo ======================================================
echo.
:: 检查源目录是否存在
if not exist "%SOURCE_DIR%" (
echo 错误:源目录 "%SOURCE_DIR%" 不存在。
echo 请确保脚本与"papers"文件夹在同一目录下或者修改SOURCE_DIR变量。
goto :eof
)
:: 创建目标目录(如果不存在)
if not exist "%TARGET_DIR%" (
echo 创建目标目录: "%TARGET_DIR%"
mkdir "%TARGET_DIR%"
if errorlevel 1 (
echo 错误:无法创建目标目录 "%TARGET_DIR%"
goto :eof
)
) else (
echo 目标目录 "%TARGET_DIR%" 已存在。
)
echo.
echo 正在扫描 "%SOURCE_DIR%" 及其子目录中的 .md 文件...
echo.
:: 遍历所有子文件夹并复制.md文件
:: /s: 包含所有子目录
:: /i: 不区分大小写匹配文件名(这里针对*.md
:: /y: 覆盖现有文件而不提示
:: /d: 复制源文件比目标文件新,才进行复制 (可选,这里使用/y覆盖所以可以省略/d)
:: /h: 复制隐藏和系统文件 (如果.md文件可能是隐藏的)
for /r "%SOURCE_DIR%" %%f in (*.md) do (
echo 正在复制: "%%f"
copy "%%f" "%TARGET_DIR%" >nul
if errorlevel 1 (
echo 警告:复制文件 "%%f" 失败。
)
)
echo.
echo ======================================================
echo Markdown文件复制完成
echo ======================================================
echo.
endlocal
pause

View File

@@ -0,0 +1,429 @@
# Diffusion模型应用于Modbus
## 🧑‍💻 User
请你详细介绍这篇论文的内容并且详细介绍Diffsion模型。
---
## 🤖 Assistant
你好。我是人工智能工程师专家。基于你提供的 IEEE Transactions on Mobile Computing (2026) 论文 **《Spatio-Temporal Diffusion Model for Cellular Traffic Generation》**(用于蜂窝流量生成的时空扩散模型),我将为你进行深度的技术解读。
这篇论文提出了一种名为 **STOUTER** 的创新模型,旨在解决大规模蜂窝网络流量数据生成中的核心难题:如何在模拟流量不确定性的同时,精准捕捉复杂的时空模式。
以下是论文的详细解读以及对扩散模型Diffusion Model技术的深度剖析。
---
### 第一部分论文《STOUTER》深度解读
#### 1. 研究背景与痛点
* **背景**5G/6G 网络规划(如基站选址、资源分配)需要海量的流量数据支持。
* **痛点**
* **数据隐私与获取难**:运营商数据敏感,难以公开获取。
* **现有生成模型的局限**:传统的 GAN生成对抗网络在大规模生成时容易出现**模式坍塌Mode Collapse**,且生成的样本缺乏多样性。
* **复杂的时空特性**:蜂窝流量具有长短期周期性(日/周模式)、空间异质性(不同区域基站负载不同)以及高度的不确定性(突发流量)。
#### 2. 核心解决方案STOUTER 模型
论文提出了 **STOUTER** (Spatio-Temporal diffusiOn model for cellUlar Traffic genERation)。这是一个基于扩散模型的生成框架,主要由以下三个核心模块组成:
##### A. 时空图建模 (Spatio-Temporal Graph Modeling)
为了让模型“理解”流量的时空规律,作者构建了两个图:
1. **时间图 (Temporal Graph)**
* **节点**一天内的24小时。
* **边**连接相邻的小时短期依赖和一周内不同天数的同一小时长期周期性如周一上午9点与周二上午9点
* **作用**:捕捉长期和短期的周期性模式。
2. **基站图 (Base Station Graph)**
* **节点**:基站。
* **边**:基于地理距离连接(如距离 < 1km
* **特征**:融合了 POI兴趣点信息如教育、医疗、餐饮等因为不同功能的区域流量模式完全不同。
* **作用**:捕捉空间相关性和区域功能差异。
利用 **图自编码器 (Graph Autoencoder)** 对这两个图进行预训练提取出潜在的时空特征嵌入Embedding
##### B. 基于扩散的生成过程 (Diffusion-Based Generation)
这是模型的核心引擎(后文详细解释扩散模型原理)。
* **前向过程**:向真实的流量数据中逐步添加高斯噪声,直到数据变成纯噪声。
* **反向过程**:从纯噪声开始,利用神经网络逐步去噪,恢复出逼真的流量数据。
* **创新点**:利用扩散模型的随机采样特性,完美模拟了蜂窝流量中的**不确定性Uncertainty Pattern**。
##### C. 时空特征融合模块 (STFFM) 与 STUnet
在反向去噪过程中,如果不加引导,模型只能生成随机噪声。作者设计了 **STUnet**Spatio-Temporal UNet作为去噪网络
* **STFFM (Spatio-Temporal Feature Fusion Module)**将上述预训练好的“时间图特征”和“基站图特征”融合并结合扩散步长Time Step作为“条件”注入到网络中。
* **机制**:引导去噪过程朝着符合真实时空规律的方向进行,确保生成的数据既有随机性(多样性),又符合物理世界的时空逻辑。
#### 3. 实验结果
* **数据集**使用了意大利米兰Milan和特伦蒂诺Trentino的真实电信数据。
* **性能提升**:在 Jensen-Shannon 散度JSD衡量分布相似度指标上STOUTER 相比现有的 GAN 和 VAE 方法提升了 **52.77%**
* **结论**:生成的数据在长期趋势、短期波动和空间分布上都高度逼真,可直接用于训练流量预测模型(下游任务)。
---
### 第二部分:扩散模型 (Diffusion Model) 技术详解
作为一名 AI 工程师,我认为理解扩散模型是理解这篇论文乃至当前生成式 AI如 Stable Diffusion, Sora的关键。
#### 1. 什么是扩散模型?
扩散模型是一类基于概率的生成模型。它的灵感来源于热力学中的扩散现象(例如一滴墨水滴入水中,分子会逐渐扩散直至均匀分布)。
在 AI 领域,我们将这个过程分为两个阶段:
* **前向扩散(破坏数据)**:系统地破坏数据结构,直到它变成随机噪声。
* **反向扩散(创造数据)**:学习逆转破坏过程,从噪声中重构出数据。
#### 2. 技术原理拆解
##### A. 前向过程 (Forward Process / Diffusion Process)
假设我们有一个真实数据分布 $x_0$ 在本论文中是真实的流量矩阵。我们定义一个马尔可夫链Markov Chain在每一步 $t$ 添加少量的高斯噪声。
* 公式逻辑: $q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I)$
* 随着步数 $T$ 增加(例如 $T=1000$ ),原始数据 $x_0$ 最终变成标准的正态分布噪声 $x_T$ 。
* **工程师视角**:这个过程是不含可学习参数的,是固定的数学变换。
##### B. 反向过程 (Reverse Process / Denoising Process)
这是模型训练的核心。既然我们知道前向过程是加噪声,如果我们能训练一个神经网络 $p_\theta$ ,让它预测每一步加入了什么噪声,我们就能把它减掉。
* **目标**:从 $x_T$ (纯噪声)开始,通过神经网络预测并减去噪声,逐步得到 $x_{T-1}, \dots, x_0$ 。
* **神经网络的作用**:论文中的 **STUnet** 就是这个网络。它的输入是当前时刻的噪声数据 $x_t$ 和时空条件,输出是预测的噪声 $\epsilon_\theta$ 。
* **损失函数**通常使用均方误差MSE或 L1 损失,计算“预测的噪声”与“实际加入的噪声”之间的差异。
#### 3. 为什么扩散模型优于 GAN
在这篇论文的语境下,扩散模型的优势非常明显:
1. **训练稳定性**GAN 需要生成器和判别器博弈,容易导致梯度消失或不收敛。扩散模型的训练目标(预测噪声)非常明确且稳定。
2. **避免模式坍塌 (Mode Collapse)**GAN 倾向于记住某些特定的样本,生成重复数据。扩散模型通过对整个数据分布进行似然估计,能生成分布覆盖更广、多样性更强的数据(即更好地模拟流量的**不确定性**)。
3. **高质量生成**通过逐步去噪的迭代过程Coarse-to-Fine扩散模型能生成细节更丰富的数据。
#### 4. 在本论文中的具体应用
这篇论文巧妙地将扩散模型从“图像生成”迁移到了“时空序列生成”:
* **图像像素 $\rightarrow$ 流量数值**:将基站的流量数据看作一张二维“图像”(时间 $\times$ 空间)。
* **条件引导 (Guidance)**普通的扩散模型是无条件生成的。STOUTER 使用 **STFFM** 将时空图的 Embedding 作为条件Condition告诉扩散模型“请不要生成任意的噪声而是生成符合周一早上9点商业区特征的流量数据”。
---
### 总结与专家点评
**《Spatio-Temporal Diffusion Model for Cellular Traffic Generation》** 是一篇结合了 **图神经网络 (GNN)****生成式扩散模型 (Diffusion Model)** 的前沿工作。
作为专家,我认为其核心价值在于:
1. **方法论的创新**证明了扩散模型在非图像领域的时空数据Time-Series Data生成上同样具有强大的统治力。
2. **工程落地价值**:它不仅仅是理论创新,更解决了网络规划中“无数据可用”的实际工程痛点。通过物理约束(图结构)引导生成过程,保证了生成数据的**物理可解释性**,这是单纯依靠数据驱动的黑盒模型无法比拟的。
这篇论文代表了人工智能在通信网络AI for Networking领域的最新发展方向。
---
## 🧑‍💻 User
现在请你考虑将这个模型迁移到Modbus协议上以生成有效的数据模拟数据包的主要特征的同时精准捕捉数据包间的时空模式
---
## 🤖 Assistant
这是一个非常具有挑战性但也极具应用价值的工程转化问题。
作为一名人工智能工程师,将 **STOUTER** 模型从**宏观的蜂窝流量(连续、统计型数据)**迁移到**微观的 Modbus 协议(离散、指令型、强约束数据)**,我们需要进行底层的**语义重构**。不能简单地照搬模型必须针对工业控制系统ICS的特性进行架构调整。
Modbus 数据不仅包含“流量大小”,更包含了**指令逻辑Function Code**、**寄存器状态Data/Values**以及**设备拓扑Master-Slave**。
以下是我为你设计的**Modbus-Diffusion 迁移架构方案**。
---
### 一、 核心概念映射:从 Cellular 到 ICS
首先,我们需要重新定义 STOUTER 中的“空间”和“时间”概念,使其适配 Modbus 环境。
| STOUTER 原概念 | 蜂窝网络 (Cellular) 含义 | **迁移后的 Modbus/ICS 含义** |
| :--- | :--- | :--- |
| **空间节点 (Nodes)** | 蜂窝基站 (Base Stations) | **工业资产/寄存器 (Assets/Registers)** <br> (例PLC_1 的温度传感器、PLC_2 的阀门线圈) |
| **空间边 (Edges)** | 地理距离 (Physical Distance) | **逻辑/物理连接 (Logical/Process Connection)** <br> (例Master 与 Slave 的连接,或“温度”与“冷却阀”的工艺关联) |
| **POI 语义** | 学校、医院、商场 | **设备功能语义** <br> (例传感器、执行器、HMI、报警器) |
| **时间周期** | 24小时人类活动周期 | **轮询周期 (Polling Cycle) & 工艺周期** <br> (例:流水线的一个作业节拍,或 SCADA 的固定查询频率) |
| **流量数据 (Value)** | 吞吐量 (Throughput) | **多模态数据包特征** <br> (包含:时间间隔、功能码、寄存器地址、寄存器数值) |
---
### 二、 架构设计Modbus-STOUTER
为了生成有效的 Modbus 数据,我建议将模型分为三个阶段:**拓扑图构建**、**时空扩散生成**、**协议合规化重组**。
#### 1. 空间建模:构建“设备-寄存器图” (Device-Register Graph)
原论文使用 POI 和距离构建基站图,在 Modbus 中,我们需要构建一个**工业语义拓扑图**。
* **节点定义**
* **粗粒度**:以 PLC/RTU 设备为节点。
* **细粒度(推荐)**以关键寄存器Holding Registers/Coils为节点。例如节点 A 是“锅炉温度”,节点 B 是“注水阀状态”。
* **边 (Edges) 的定义**
* **通信边**Master 到 Slave 的物理链路。
* **工艺边**基于物理过程的关联。例如当“温度寄存器”数值升高时“阀门寄存器”通常会动作。这种关联可以通过分析历史数据的互相关性Cross-Correlation自动构建或者手动基于 P&ID 图构建。
* **图卷积 (GCN) 作用**:学习工业现场的**控制逻辑依赖性**。例如,模型会学到:“节点 B 的值变化通常滞后于节点 A”。
#### 2. 时间建模:基于周期的时序图 (Cyclic Temporal Graph)
工业数据不像人类行为那样有明显的“早晚高峰”,而是呈现**高频周期性**和**状态突变性**。
* **时间编码**:不再是“小时”,而是以 **Polling Interval轮询间隔****Process Step工艺步骤** 为单位。
* **长短期记忆**
* **短期**:一个 TCP 连接内的 Request-Response 序列。
* **长期**:设备的运行状态周期(如:启动-运行-停机-维护)。
#### 3. 核心扩散过程的改造 (Modified Diffusion Core)
这是最关键的技术难点。Modbus 数据包不仅有连续值(如温度 45.3),还有离散值(如功能码 0x03, 0x10。STOUTER 原生只处理连续值。
**解决方案:嵌入式扩散 (Embedding-based Diffusion)**
* **输入向量设计**:对于每个时间步 $t$ 和节点 $i$,输入向量 $x_{t,i}$ 不再是一个标量,而是一个**拼接向量**
$$x_{t,i} = [\text{Embed}(\text{FunctionCode}) \oplus \text{Embed}(\text{Address}) \oplus \text{Normalize}(\text{Value}) \oplus \text{InterArrivalTime}]$$
* **混合噪声添加**
* 对于连续部分Value, Time使用标准高斯噪声。
* 对于离散部分Func Code, Address建议将其映射到连续的 Latent Space 后再加噪声,或者使用 **Discrete Diffusion (如 Multinomial Diffusion)** 技术。
* **STUnet 的改进**
* 在 STFFM时空特征融合模块增加一个 **Protocol Constraints Head (协议约束头)**,用于学习 Modbus 的固有规则(例如:读指令后通常紧跟返回数据的响应)。
---
### 三、 关键创新:由“噪声”到“有效数据包”
扩散模型输出的是张量Tensor不是二进制数据包PCAP。为了保证生成的 Modbus 数据不仅“像”而且“有效”Valid必须引入**后处理模块**。
#### **Post-Processing: 确定性协议组装器 (Deterministic Protocol Assembler)**
不要让 AI 生成 TCP Checksum 或 Protocol ID 这种死板的数据AI 很容易出错且浪费算力。AI 应该负责生成**语义 (Semantics)**。
1. **AI 输出**:扩散模型生成一个序列,包含:`{时间戳: 10ms, 源: Master, 目标: Slave_1, 动作: Write, 地址: 40001, 值: 1234}`
2. **规则组装**:使用 Python 的 `scapy``pymodbus` 库,作为“解码器”。
* 接收 AI 的语义输出。
* 自动计算 Length 字段。
* 自动填充 Transaction ID序列自增
* 自动打包成符合 Modbus TCP 规范的二进制流。
---
### 四、 这种迁移的优势与潜在价值
相比于传统的基于规则的模糊测试Fuzzing或简单的回放攻击脚本这种基于时空扩散模型的方法有巨大的优势
1. **捕捉隐蔽的攻击模式**
* 传统的攻击检测通过阈值(如:流量突然激增)。
* Modbus-STOUTER 可以模拟**“慢速攻击”**或**“逻辑攻击”**。例如,模型可以生成一系列完全符合协议规范,但在时空逻辑上异常的数据包(例如:在不该开启阀门的时间点,生成了一个合法的“开启阀门”指令)。这对训练高级 IDS入侵检测系统至关重要。
2. **生成高保真数字孪生数据**
* 工业环境数据获取极难(怕影响生产)。该模型可以基于少量的采样数据,生成海量的、带有**工艺逻辑相关性**的虚拟数据,用于系统压力测试。
3. **处理数据包的不确定性**
* 工业网络中存在抖动Jitter和丢包。扩散模型天然的概率特性可以完美模拟网络状况的**随机波动**,而不仅仅是理想状态下的数据。
# PART 2mzyang觉得离散值处理不妥
下面我就只写你要的这部分:**两种“加噪/去噪”方法的详细介绍** (连续字段用高斯扩散、离散字段用 Mask并且写成能直接塞进你文档的风格。
---
## 方法 1连续字段用高斯噪声扩散Gaussian Diffusion / DDPM 思路)
### 适用对象
连续字段指的是能取实数的量,例如:
* `inter_arrival_time`(到达间隔)
* `value`(读写数值,或某种连续统计量)
* `payload_len`(长度也可以当连续/或分桶)
这些变量有“大小”和“接近”的概念,适合用高斯噪声逐步扰动。
---
### 核心想法(直觉)
把真实连续数据 (x_0) 逐步加噪,直到变成“纯噪声”;
再训练模型学会从噪声一步步还原回真实数据。
一句话就是: ** 先把数据揉乱,再学会揉回来。 **
---
### 前向过程(加噪)
DDPM 经典做法是定义一个时间步 (t=1..T),每一步加一点高斯噪声:
[
q(x_t \mid x_{t-1}) = \mathcal{N}\big(\sqrt{1-\beta_t},x_{t-1}, \beta_t I\big)
]
其中:
* (\beta_t) 是第 (t) 步的噪声强度(通常从小到大)
* (I) 是单位矩阵
常用的“闭式写法”(更实用):
[
x_t = \sqrt{\bar{\alpha}_t},x_0 + \sqrt{1-\bar{\alpha}_t},\epsilon,\quad \epsilon\sim \mathcal N(0,I)
]
这里:
* (\alpha_t = 1-\beta_t)
* (\bar{\alpha}*t=\prod*{s=1}^t \alpha_s)
> 直观理解:(t) 越大,(\sqrt{1-\bar{\alpha}_t}) 越大,噪声占比越高,(x_t) 越接近纯随机。
---
### 反向过程(去噪 / 生成)
反向就是要学一个模型 (p_\theta) 逐步把 (x_t) 还原成 (x_{t-1})。
常见训练方式是让模型预测噪声 (\epsilon)(也可以预测 (x_0)
[
\epsilon_\theta = f_\theta(x_t, t, \text{context})
]
训练损失最常用就是 MSE
[
\mathcal L_{\text{cont}} = \mathbb E\big[|\epsilon - \epsilon_\theta(x_t,t)|^2\big]
]
---
### 采样(生成)怎么做(你们落地时最关心的)
1. 先从高斯噪声开始:(x_T \sim \mathcal N(0,I))
2. 对 (t=T,T-1,\dots,1)
* 用模型预测噪声 (\epsilon_\theta(x_t,t))
* 按 DDPM 的反向更新公式得到 (x_{t-1})
3. 最终得到 (x_0) 作为生成的连续特征
> 你不用在文档里写完整反向公式;只要说明“模型逐步去噪得到连续特征”即可。
---
### 优点 & 注意点(写给评审/队友看的)
**优点**
* 连续变量生成质量通常很好分布、波动、burst 都更自然)
* 训练稳定MSE 好优化)
** 注意点 **
* 连续字段最好做归一化/标准化(比如 z-score否则噪声尺度不好选
* (\beta_t)(噪声 schedule会影响效果一般从小到大更稳
---
---
## 方法 2离散字段用 Mask 噪声Mask Corruption / 离散去噪)
### 适用对象
离散字段是“选项题”,例如:
* `func_code`(功能码)
* `msg_type`request/response
* `direction`C→S / S→C
* `exception_code`(异常码类别)
这类变量没有“加法”和“中间值”,所以不能像连续那样加高斯噪声。
---
### 核心想法(直觉)
离散扩散的“噪声”不是加一个数,而是 ** 把一些 token 盖住/弄丢 **
* 前向:随机把一部分离散值替换成 `[MASK]`
* 反向:模型根据上下文把 `[MASK]` 猜回去(分类问题)
一句话:**先挖空,再补全。**
---
### 前向过程(加噪 = Mask
给每个时间步 (t) 定义一个 mask 比例(或 mask 概率)(m_t),通常 (t) 越大 mask 越多。
对离散序列 (x^{\text{disc}}_0) 的每个位置:
* 以概率 (m_t):替换成 `[MASK]`
* 以概率 (1-m_t):保留原 token
得到被污染的 (x^{\text{disc}}_t)。
> 你也可以扩展为“mask 或随机替换成别的 token”但第一版用纯 mask 就够清晰、好实现。
---
### 反向过程(去噪 = 猜 token
模型输入:被 mask 的离散序列 + 时间步 (t) +(可选)连续特征上下文
模型输出:每个被 mask 位置的类别分布softmax
[
p_\theta(x^{\text{disc}}_0 \mid x^{\text{disc}}_t, t, \text{context})
]
训练损失是交叉熵(只在被 mask 的位置算更常见):
[
\mathcal L_{\text{disc}} = -\sum_{j\in \text{masked}} \log p_\theta(x^{\text{disc}}_{0,j})
]
> 直观:模型的任务就是“把挖掉的格子填回正确答案”。
---
### 采样(生成)怎么做(简单版本)
最简单可用的一种生成策略:
1. 初始离散序列设为全 mask`[MASK, MASK, ...]`
2. 对 (t=T,T-1,\dots,1)
* 用模型给每个 mask 位置输出 softmax 概率
* 采样/取最大概率填入一部分 token
* 随着 (t) 变小mask 数逐步减少(填得越来越完整)
最终得到完整的离散字段序列。
> 你们文档里可以写成:“从全 mask 开始逐步恢复离散字段”,不用纠结具体填多少个位置的策略。
---
### 优点 & 注意点(很关键)
**优点**
* 对离散字段天然正确:不会引入“编号距离”的假设
* 特别适合协议字段(功能码/请求响应类型),因为可以利用上下文预测
**注意点**
* 高基数离散变量(例如地址 0~65535不建议直接当一个巨大 softmax 来 mask会难学、数据稀疏
更建议把它拆成 “region + offset” 或分桶后再做离散建模
* 需要一个合理的 mask schedule一开始别 mask 太狠,否则模型只能瞎猜
---
---
## 你们把这两种方法合在一起怎么说(文档一句话版)
对于每个事件的特征向量,我们将字段分为:
* 连续字段用高斯扩散DDPM去噪生成MSE loss
* 离散字段:用 mask 噪声进行离散去噪分类恢复CE loss
最终用一个统一 backbone 同时输出连续头和离散头,训练目标为两者加权和:
[
\mathcal L = \lambda\mathcal L_{\text{cont}} + (1-\lambda)\mathcal L_{\text{disc}}
]
如果你决定实施这个项目,我建议按以下步骤进行:
1. **数据预处理**:获取 Modbus PCAP 包,解析为 `(Timestamp, Source, Dest, Function, Register, Value)` 格式的序列。
2. **图构建**:根据源/目 IP 和寄存器地址,构建邻接矩阵。
3. **模型修改**:修改 STUnet 的输入输出层,使其支持多维特征(不仅仅是流量值)。
4. **混合训练**:使用 L1 Loss 训练数值预测,使用 Cross-Entropy Loss 训练功能码分类。
5. **验证**:不仅验证 JSD分布相似度还要验证 **Protocol Validity Rate协议合规率****Process Logic Consistency工艺逻辑一致性**
这个迁移方案将原本用于“预测人上网行为”的模型,升维成了一个“理解机器控制逻辑”的生成器,在工业安全和数字孪生领域具有极高的创新性。

View File

@@ -0,0 +1,189 @@
(Updated from your current draft; no benchmark-metric details are introduced here, as requested.)
## Methodology
Industrial control system (ICS) telemetry is intrinsically **mixed-type** and **mechanistically heterogeneous**: continuous process trajectories (e.g., sensor and actuator signals) coexist with discrete supervisory states (e.g., modes, alarms, interlocks), and the underlying generating mechanisms range from physical inertia to program-driven step logic. This heterogeneity is not cosmetic—it directly affects what “realistic” synthesis means, because a generator must jointly satisfy (i) temporal coherence, (ii) distributional fidelity, and (iii) discrete semantic validity (i.e., every discrete output must belong to its legal vocabulary by construction). These properties are emphasized broadly in operational-technology security guidance and ICS engineering practice, where state logic and physical dynamics are tightly coupled. [12]
We formalize each training instance as a fixed-length window of length (L), consisting of (i) continuous channels (X\in\mathbb{R}^{L\times d_c}) and (ii) discrete channels (Y={y^{(j)}*{1:L}}*{j=1}^{d_d}), where each discrete variable (y^{(j)}_t\in\mathcal{V}_j) belongs to a finite vocabulary (\mathcal{V}_j). Our objective is to learn a generator that produces synthetic ((\hat{X},\hat{Y})) that are simultaneously temporally coherent and distributionally faithful, while also ensuring (\hat{y}^{(j)}_t\in\mathcal{V}_j) for all (j,t) by construction (rather than via post-hoc rounding or thresholding).
A key empirical and methodological tension in ICS synthesis is that *temporal realism* and *marginal/distributional realism* can compete when optimized monolithically: sequence models trained primarily for regression often over-smooth heavy tails and intermittent bursts, while purely distribution-matching objectives can erode long-range structure. Diffusion models provide a principled route to rich distribution modeling through iterative denoising, but they do not, by themselves, resolve (i) the need for a stable low-frequency temporal scaffold, nor (ii) the discrete legality constraints for supervisory variables. [2,8] Recent time-series diffusion work further suggests that separating coarse structure from stochastic refinement can be an effective inductive bias for long-horizon realism. [6,7]
Motivated by these considerations, we propose **Mask-DDPM**, organized in the following order:
1. **Transformer trend module**: learns the dominant temporal backbone of continuous dynamics via attention-based sequence modeling. [1]
2. **Residual DDPM for continuous variables**: models distributional detail as stochastic residual structure conditioned on the learned trend. [2,6]
3. **Masked diffusion for discrete variables**: generates discrete ICS states with an absorbing/masking corruption process and categorical reconstruction. [3,4]
4. **Type-aware decomposition**: a **type-aware factorization and routing layer** that assigns variables to the most appropriate modeling mechanism and enforces deterministic constraints where warranted.
This ordering is intentional. The trend module establishes a macro-temporal scaffold; residual diffusion then concentrates capacity on micro-structure and marginal fidelity; masked diffusion provides a native mechanism for discrete legality; and the type-aware layer operationalizes the observation that not all ICS variables should be modeled with the same stochastic mechanism. Importantly, while diffusion-based generation for ICS telemetry has begun to emerge, existing approaches remain limited and typically emphasize continuous synthesis or augmentation; in contrast, our pipeline integrates (i) a Transformer-conditioned residual diffusion backbone, (ii) a discrete masked-diffusion branch, and (iii) explicit type-aware routing for heterogeneous variable mechanisms within a single coherent generator. [10,11]
---
## Transformer trend module for continuous dynamics
We instantiate the temporal backbone as a **causal Transformer** trend extractor, leveraging self-attentions ability to represent long-range dependencies and cross-channel interactions without recurrence. [1] Compared with recurrent trend extractors (e.g., GRU-style backbones), a Transformer trend module offers a direct mechanism to model delayed effects and multivariate coupling—common in ICS, where control actions may influence downstream sensors with nontrivial lags and regime-dependent propagation. [1,12] Crucially, in our design the Transformer is *not* asked to be the entire generator; instead, it serves a deliberately restricted role: providing a stable, temporally coherent conditioning signal that later stochastic components refine.
For continuous channels (X), we posit an additive decomposition
[
X = S + R,
]
where (S\in\mathbb{R}^{L\times d_c}) is a smooth **trend** capturing predictable temporal evolution and (R\in\mathbb{R}^{L\times d_c}) is a **residual** capturing distributional detail (e.g., bursts, heavy tails, local fluctuations) that is difficult to represent robustly with a purely regression-oriented temporal objective. This separation reflects an explicit *division of labor*: the trend module prioritizes temporal coherence, while diffusion (introduced next) targets distributional realism at the residual level—a strategy aligned with “predict-then-refine” perspectives in time-series diffusion modeling. [6,7]
We parameterize the trend (S) using a causal Transformer (f_{\phi}). With teacher forcing, we train (f_{\phi}) to predict the next-step trend from past observations:
[
\hat{S}*{t+1} = f*{\phi}(X_{1:t}), \qquad t=1,\dots,L-1,
]
using the mean-squared error objective
[
\mathcal{L}*{\text{trend}}(\phi)=\frac{1}{(L-1)d_c}\sum*{t=1}^{L-1}\left| \hat{S}*{t+1} - X*{t+1}\right|_2^2.
]
At inference, we roll out the Transformer autoregressively to obtain (\hat{S}), and define the residual target for diffusion as (R = X - \hat{S}). This setup intentionally “locks in” a coherent low-frequency scaffold before any stochastic refinement is applied, thereby reducing the burden on downstream diffusion modules to simultaneously learn both long-range structure and marginal detail. In this sense, our use of Transformers is distinctive: it is a *conditioning-first* temporal backbone designed to stabilize mixed-type diffusion synthesis in ICS, rather than an end-to-end monolithic generator. [1,6,10]
---
## DDPM for continuous residual generation
We model the residual (R) with a denoising diffusion probabilistic model (DDPM) conditioned on the trend (\hat{S}). [2] Diffusion models learn complex data distributions by inverting a tractable noising process through iterative denoising, and have proven effective at capturing multimodality and heavy-tailed structure that is often attenuated by purely regression-based sequence models. [2,8] Conditioning the diffusion model on (\hat{S}) is central: it prevents the denoiser from re-learning the low-frequency scaffold and focuses capacity on residual micro-structure, mirroring the broader principle that diffusion excels as a distributional corrector when a reasonable coarse structure is available. [6,7]
Let (K) denote the number of diffusion steps, with a noise schedule ({\beta_k}_{k=1}^K), (\alpha_k = 1-\beta_k), and (\bar{\alpha}*k=\prod*{i=1}^k \alpha_i). The forward corruption process is
[
q(r_k\mid r_0)=\mathcal{N}!\left(\sqrt{\bar{\alpha}_k},r_0,\ (1-\bar{\alpha}_k)\mathbf{I}\right),
]
equivalently,
[
r_k = \sqrt{\bar{\alpha}_k},r_0 + \sqrt{1-\bar{\alpha}_k},\epsilon,\qquad \epsilon\sim\mathcal{N}(0,\mathbf{I}),
]
where (r_0\equiv R) and (r_k) is the noised residual at step (k).
The learned reverse process is parameterized as
[
p_{\theta}(r_{k-1}\mid r_k,\hat{S})=\mathcal{N}!\left(\mu_{\theta}(r_k,k,\hat{S}),\ \Sigma(k)\right),
]
where (\mu_\theta) is implemented by a **Transformer denoiser** that consumes (i) the noised residual (r_k), (ii) a timestep embedding for (k), and (iii) conditioning features derived from (\hat{S}). This denoiser architecture is consistent with the growing use of attention-based denoisers for long-context time-series diffusion, while our key methodological emphasis is the *trend-conditioned residual* factorization as the object of diffusion learning. [2,7]
We train the denoiser using the standard DDPM (\epsilon)-prediction objective:
[
\mathcal{L}*{\text{cont}}(\theta)
= \mathbb{E}*{k,r_0,\epsilon}!\left[
\left|
\epsilon - \epsilon_{\theta}(r_k,k,\hat{S})
\right|*2^2
\right].
]
Because diffusion optimization can exhibit timestep imbalance (i.e., some timesteps dominate gradients), we optionally apply an SNR-based reweighting consistent with Min-SNR training:
[
\mathcal{L}^{\text{snr}}*{\text{cont}}(\theta)
= \mathbb{E}*{k,r_0,\epsilon}!\left[
w_k\left|
\epsilon - \epsilon*{\theta}(r_k,k,\hat{S})
\right|_2^2
\right],
\qquad
w_k=\frac{\min(\mathrm{SNR}_k,\gamma)}{\mathrm{SNR}_k},
]
where (\mathrm{SNR}_k=\bar{\alpha}_k/(1-\bar{\alpha}_k)) and (\gamma>0) is a cap parameter. [5]
After sampling (\hat{R}) by reverse diffusion, we reconstruct the continuous output as
[
\hat{X} = \hat{S} + \hat{R}.
]
Overall, the DDPM component serves as a **distributional corrector** on top of a temporally coherent backbone, which is particularly suited to ICS where low-frequency dynamics are strong and persistent but fine-scale variability (including bursts and regime-conditioned noise) remains important for realism. Relative to prior ICS diffusion efforts that primarily focus on continuous augmentation, our formulation elevates *trend-conditioned residual diffusion* as a modular mechanism for disentangling temporal structure from distributional refinement. [10,11]
---
## Masked diffusion for discrete ICS variables
Discrete ICS variables must remain categorical, making Gaussian diffusion inappropriate for supervisory states and mode-like channels. While one can attempt continuous relaxations or post-hoc discretization, such strategies risk producing semantically invalid intermediate states (e.g., “in-between” modes) and can distort the discrete marginal distribution. Discrete-state diffusion provides a principled alternative by defining a valid corruption process directly on categorical variables. [3,4] In the ICS setting, this is not a secondary detail: supervisory tags often encode control logic boundaries (modes, alarms, interlocks) that must remain within a finite vocabulary to preserve semantic correctness. [12]
We therefore adopt **masked (absorbing) diffusion** for discrete channels, where corruption replaces tokens with a special (\texttt{[MASK]}) symbol according to a schedule. [4] For each variable (j), define a masking schedule ({m_k}_{k=1}^K) (with (m_k\in[0,1]) increasing in (k)). The forward corruption process is
[
q(y^{(j)}_k \mid y^{(j)}_0)=
\begin{cases}
y^{(j)}*0, & \text{with probability } 1-m_k,\
\texttt{[MASK]}, & \text{with probability } m_k,
\end{cases}
]
applied independently across (j) and (t). Let (\mathcal{M}) denote the set of masked positions at step (k). The denoiser (h*{\psi}) predicts a categorical distribution over (\mathcal{V}*j) for each masked token, conditioned on (i) the corrupted discrete sequence, (ii) the diffusion step (k), and (iii) continuous context. Concretely, we condition on (\hat{S}) and (optionally) (\hat{X}) to couple supervisory reconstruction to the underlying continuous dynamics:
[
p*{\psi}!\left(y^{(j)}*0 \mid y_k, k, \hat{S}, \hat{X}\right)
= h*{\psi}(y_k,k,\hat{S},\hat{X}).
]
This conditioning choice is motivated by the fact that many discrete ICS states are not standalone—they are functions of regimes, thresholds, and procedural phases that manifest in continuous channels. [12]
Training uses a categorical denoising objective:
[
\mathcal{L}*{\text{disc}}(\psi)
= \mathbb{E}*{k}!\left[
\frac{1}{|\mathcal{M}|}
\sum_{(j,t)\in\mathcal{M}}
\mathrm{CE}!\left(
h_{\psi}(y_k,k,\hat{S},\hat{X})*{j,t},
y^{(j)}*{0,t}
\right)
\right],
]
where (\mathrm{CE}(\cdot,\cdot)) is cross-entropy. At sampling time, we initialize all discrete tokens as (\texttt{[MASK]}) and iteratively unmask them using the learned conditionals, ensuring that every output token lies in its legal vocabulary by construction. This discrete branch is a key differentiator of our pipeline: unlike typical continuous-only diffusion augmentation in ICS, we integrate masked diffusion as a first-class mechanism for supervisory-variable legality within the same end-to-end synthesis workflow. [4,10]
---
## Type-aware decomposition as a performance refinement layer
Even with a trend-conditioned residual DDPM and a discrete masked-diffusion branch, a single uniform modeling treatment can remain suboptimal because ICS variables are generated by qualitatively different mechanisms. For example, program-driven setpoints exhibit step-and-dwell dynamics; controller outputs follow control laws conditioned on process feedback; actuator positions may show saturation and dwell; and some “derived tags” are deterministic functions of other channels. Treating all channels as if they were exchangeable stochastic processes can misallocate model capacity and induce systematic error concentration on a small subset of mechanistically distinct variables. [12]
We therefore introduce a **type-aware decomposition** that formalizes this heterogeneity as a routing and constraint layer. Let (\tau(i)\in{1,\dots,6}) assign each variable (i) to a type class. The type assignment can be initialized from domain semantics (tag metadata, value domains, and engineering meaning), and subsequently refined via an error-attribution workflow described in the Benchmark section. Importantly, this refinement does **not** change the core diffusion backbone; it changes *which mechanism is responsible for which variable*, thereby aligning inductive bias with variable-generating mechanism while preserving overall coherence.
We use the following taxonomy:
* **Type 1 (program-driven / setpoint-like):** externally commanded, step-and-dwell variables. These variables can be treated as exogenous drivers (conditioning signals) or routed to specialized change-point / dwell-time models, rather than being forced into a smooth denoiser that may over-regularize step structure.
* **Type 2 (controller outputs):** continuous variables tightly coupled to feedback loops; these benefit from conditional modeling where the conditioning includes relevant process variables and commanded setpoints.
* **Type 3 (actuator states/positions):** often exhibit saturation, dwell, and rate limits; these may require stateful dynamics beyond generic residual diffusion, motivating either specialized conditional modules or additional inductive constraints.
* **Type 4 (process variables):** inertia-dominated continuous dynamics; these are the primary beneficiaries of the **Transformer trend + residual DDPM** pipeline.
* **Type 5 (derived/deterministic variables):** algebraic or rule-based functions of other variables; we enforce deterministic reconstruction (\hat{x}^{(i)} = g_i(\hat{X},\hat{Y})) rather than learning a stochastic generator, improving logical consistency and sample efficiency.
* **Type 6 (auxiliary/low-impact variables):** weakly coupled or sparse signals; we allow simplified modeling (e.g., calibrated marginals or lightweight temporal models) to avoid allocating diffusion capacity where it is not warranted.
Type-aware decomposition improves synthesis quality through three mechanisms. First, it improves **capacity allocation** by preventing a small set of mechanistically atypical variables from dominating gradients and distorting the learned distribution for the majority class (typically Type 4). Second, it enables **constraint enforcement** by deterministically reconstructing Type 5 variables, preventing logically inconsistent samples that purely learned generators can produce. Third, it improves **mechanism alignment** by attaching inductive biases consistent with step/dwell or saturation behaviors where generic denoisers may implicitly favor smoothness.
From a novelty standpoint, this layer is not merely an engineering “patch”; it is an explicit methodological statement that ICS synthesis benefits from **typed factorization**—a principle that has analogues in mixed-type generative modeling more broadly, but that remains underexplored in diffusion-based ICS telemetry synthesis. [9,10,12]
---
## Joint optimization and end-to-end sampling
We train the model in a staged manner consistent with the above factorization, which improves optimization stability and encourages each component to specialize in its intended role. Specifically: (i) we train the trend Transformer (f_{\phi}) to obtain (\hat{S}); (ii) we compute residual targets (R=X-\hat{S}) for the continuous variables routed to residual diffusion; (iii) we train the residual DDPM (p_{\theta}(R\mid \hat{S})) and masked diffusion model (p_{\psi}(Y\mid \text{masked}(Y), \hat{S}, \hat{X})); and (iv) we apply type-aware routing and deterministic reconstruction during sampling. This staged strategy is aligned with the design goal of separating temporal scaffolding from distributional refinement, and it mirrors the broader intuition in time-series diffusion that decoupling coarse structure and stochastic detail can mitigate “structure vs. realism” conflicts. [6,7]
A simple combined objective is
[
\mathcal{L} = \lambda,\mathcal{L}*{\text{cont}} + (1-\lambda),\mathcal{L}*{\text{disc}},
]
with (\lambda\in[0,1]) controlling the balance between continuous and discrete learning. Type-aware routing determines which channels contribute to which loss and which are excluded in favor of deterministic reconstruction. In practice, this routing acts as a principled guardrail against negative transfer across variable mechanisms: channels that are best handled deterministically (Type 5) or by specialized drivers (Type 1/3, depending on configuration) are prevented from forcing the diffusion models into statistically incoherent compromises.
At inference time, generation follows the same structured order: (i) trend (\hat{S}) via the Transformer, (ii) residual (\hat{R}) via DDPM, (iii) discrete (\hat{Y}) via masked diffusion, and (iv) type-aware assembly with deterministic reconstruction for routed variables. This pipeline produces ((\hat{X},\hat{Y})) that are temporally coherent by construction (through (\hat{S})), distributionally expressive (through (\hat{R}) denoising), and discretely valid (through masked diffusion), while explicitly accounting for heterogeneous variable-generating mechanisms through type-aware routing. In combination, these choices constitute our central methodological contribution: a unified Transformer + mixed diffusion generator for ICS telemetry, augmented by typed factorization to align model capacity with domain mechanism. [2,4,10,12]
---
# References
[1] Vaswani, A., Shazeer, N., Parmar, N., et al. *Attention Is All You Need.* NeurIPS, 2017. arXiv:1706.03762. ([arXiv][1])
[2] Ho, J., Jain, A., Abbeel, P. *Denoising Diffusion Probabilistic Models.* NeurIPS, 2020. arXiv:2006.11239. ([Proceedings of Machine Learning Research][2])
[3] Austin, J., Johnson, D. D., Ho, J., Tarlow, D., van den Berg, R. *Structured Denoising Diffusion Models in Discrete State-Spaces (D3PM).* NeurIPS, 2021. arXiv:2107.03006. ([arXiv][3])
[4] Shi, J., Han, K., Wang, Z., Doucet, A., Titsias, M. K. *Simplified and Generalized Masked Diffusion for Discrete Data.* arXiv:2406.04329, 2024. ([arXiv][4])
[5] Hang, T., Wu, C., Zhang, H., et al. *Efficient Diffusion Training via Min-SNR Weighting Strategy.* arXiv:2303.09556, 2023. ([arXiv][5])
[6] Kollovieh, M., Ansari, A. F., Bohlke-Schneider, M., et al. *Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting (TSDiff).* arXiv:2307.11494, 2023. ([arXiv][6])
[7] Sikder, M. F., Ramachandranpillai, R., Heintz, F. *TransFusion: Generating Long, High Fidelity Time Series using Diffusion Models with Transformers.* arXiv:2307.12667, 2023. ([arXiv][7])
[8] Song, Y., Sohl-Dickstein, J., Kingma, D. P., et al. *Score-Based Generative Modeling through Stochastic Differential Equations.* ICLR, 2021. arXiv:2011.13456. ([arXiv][8])
[9] Zhang, H., Zhang, J., Li, J., et al. *TabDiff: a Mixed-type Diffusion Model for Tabular Data Generation.* arXiv:2410.20626, 2024. ([arXiv][9])
[10] Yuan, H., Sha, K., Zhao, W. *CTU-DDPM: Conditional Transformer U-net DDPM for Industrial Control System Anomaly Data Augmentation.* ACM AICSS, 2025. DOI:10.1145/3776759.3776845.
[11] Sha, K., et al. *DDPM Fusing Mamba and Adaptive Attention: An Augmentation Method for Industrial Control Systems Anomaly Data.* SSRN, posted Jan 10, 2026. (SSRN 6055903). ([SSRN][10])
[12] NIST. *Guide to Operational Technology (OT) Security (SP 800-82r3).* 2023. ([NIST Computer Security Resource Center][11])
[1]: https://arxiv.org/abs/2209.15421 "https://arxiv.org/abs/2209.15421"
[2]: https://proceedings.mlr.press/v202/kotelnikov23a/kotelnikov23a.pdf "https://proceedings.mlr.press/v202/kotelnikov23a/kotelnikov23a.pdf"
[3]: https://arxiv.org/html/2209.15421v2 "https://arxiv.org/html/2209.15421v2"
[4]: https://arxiv.org/abs/2011.13456 "https://arxiv.org/abs/2011.13456"
[5]: https://arxiv.org/abs/2303.09556 "https://arxiv.org/abs/2303.09556"
[6]: https://arxiv.org/pdf/2401.03006 "https://arxiv.org/pdf/2401.03006"
[7]: https://arxiv.org/abs/2307.12667 "https://arxiv.org/abs/2307.12667"
[8]: https://arxiv.org/abs/2406.04329 "https://arxiv.org/abs/2406.04329"
[9]: https://arxiv.org/abs/2406.07524 "https://arxiv.org/abs/2406.07524"
[10]: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6055903 "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6055903"
[11]: https://csrc.nist.gov/pubs/sp/800/82/r3/final "https://csrc.nist.gov/pubs/sp/800/82/r3/final"

918
knowledges/mask-ddpm.md Normal file
View File

@@ -0,0 +1,918 @@
# maskddpm 项目说明书(内部长文版|约 3 千字)
> 目标:以“论文叙述 + 说明书”混合风格,把项目的背景、方法、流程、评估、局限、未来方向完整写清楚。
> 面向对象:对扩散模型/时序建模不熟悉的读者也能读懂。
> 注意:不写具体实现代码,只写概念与方法。
---
## 摘要Abstract
工业控制系统ICS数据具有“多变量、强时序、混合类型”的特点。传统生成模型往往只能满足分布或时序之一难以同时兼顾。我们提出一种 **两阶段混合扩散框架maskddpm**:第一阶段学习时序趋势,第二阶段用扩散模型学习残差分布,同时对离散变量采用遮蔽‑恢复式扩散。为解决少数变量主导 KS 的问题,我们进一步提出 **Typeaware 分治思路**,将程序驱动变量与过程变量分离处理,并配套完善的诊断与评估体系。实验表明,该框架能显著改善分布一致性,同时保持一定的时序结构,且具备可解释的误差定位能力。
---
## 1. 背景与动机Why this problem is hard
工业数据和普通时间序列有本质区别:
1) **物理惯性**:许多变量变化缓慢,前一刻会强烈影响后一刻。
2) **程序驱动**:部分变量来自操作者/调度程序,呈现“阶跃 + 长时间停留”。
3) **混合类型**:连续传感器与离散状态共存,必须分别处理。
如果只用一个模型去学所有变量,会产生明显问题:
- 模型为了对齐分布,会牺牲时序结构 → lag1 变差
- 模型为了时序稳定,会模糊分布 → KS 上升
- 离散变量如果当连续处理,会出现非法值 → JSD 上升
因此,需要一种结构化的建模策略来“拆分矛盾”。
---
## 2. 问题定义Problem Formulation
给定真实序列 \(X \in \mathbb{R}^{T\times D}\),其中:
- \(T\):时间长度
- \(D\):变量维度(连续 + 离散)
目标是学习生成模型 \(G\),输出 \(\hat{X}\),使得:
**分布一致性**
\[
\forall i,\; F^{(i)}_{gen}(x) \approx F^{(i)}_{real}(x)
\]
**时序一致性**
\[
\rho_1(\hat{x}^{(i)}) \approx \rho_1(x^{(i)})
\]
**离散合法性**
\[
\hat{x}^{(j)} \in \mathcal{V}_j
\]
其中 \(\rho_1\) 表示 lag1 相关,\(\mathcal{V}_j\) 是离散变量的合法词表。
---
## 3. 方法概览Method Overview
核心设计是 **两阶段 + 混合扩散**
### 阶段 1时序趋势建模
用序列模型学习“整体趋势”。直觉上,这一步只负责回答“序列怎么变化”。
### 阶段 2分布残差建模
用扩散模型学习“趋势以外的残差”,修正分布形状。直觉上,这一步只负责回答“数值分布像不像”。
这种解耦让“时序一致性”与“分布一致性”不再直接冲突。
---
## 4. 模型结构Model Details
### 4.1 连续变量扩散DDPM
扩散过程:
\[
x_t = \sqrt{\bar{\alpha}_t}\,x_0 + \sqrt{1-\bar{\alpha}_t}\,\epsilon,\quad \epsilon \sim \mathcal{N}(0,I)
\]
**解释**
- \(x_0\) 是真实残差
- \(x_t\) 是加噪后的数据
- 模型学习从 \(x_t\) 还原 \(x_0\) 或 \(\epsilon\)
### 4.2 离散变量扩散Mask Diffusion
离散变量通过“遮蔽 + 恢复”建模:
- 随机遮蔽部分 token
- 学习预测原始 token
这样保证离散输出 **合法且可解释**
### 4.3 时序模块Temporal GRU
GRU 用于学习连续序列的趋势:
- 输入:连续序列
- 输出:趋势序列
- 用于构建残差:\(x_{resid} = x - trend\)
直观理解:
- GRU 负责“走向”
- 扩散负责“修正细节”
---
## 5. 损失函数设计Loss Design
### 5.1 连续损失
\[
\mathcal{L}_{cont} =
\begin{cases}
\|\hat{\epsilon}-\epsilon\|^2 & (预测噪声)\\
\|\hat{x}_0-x_0\|^2 & (预测原值)
\end{cases}
\]
**解释**:衡量“去噪后是否接近真实残差”。
### 5.2 离散损失
\[
\mathcal{L}_{disc} = \frac{1}{|\mathcal{M}|}\sum_{(i,t)\in\mathcal{M}} CE(\hat{p}_{i,t}, y_{i,t})
\]
**解释**:只对被遮蔽的 token 做交叉熵,确保离散预测正确。
### 5.3 总损失
\[
\mathcal{L} = \lambda \mathcal{L}_{cont} + (1-\lambda)\mathcal{L}_{disc}
\]
**解释**\(\lambda\) 决定分布与离散的权衡。
### 5.4 额外增强(可选)
**SNR 权重**
\[
\mathcal{L}_{snr} = \frac{\text{SNR}_t}{\text{SNR}_t+\gamma}\,\mathcal{L}_{cont}
\]
**分位数对齐**
\[
\mathcal{L}_Q = \frac{1}{K}\sum_k \|Q_k(x_{real}) - Q_k(x_{gen})\|_1
\]
这些项有助于降低 KS但可能牺牲时序一致性。
---
## 6. 评估指标Evaluation Metrics
### 6.1 KS分布一致性
\[
KS_i = \sup_x |F^{(i)}_{gen}(x) - F^{(i)}_{real}(x)|
\]
**解释**:越小代表分布越接近。
### 6.2 JSD离散一致性
\[
JSD(P,Q)=\tfrac12 KL(P\|M)+\tfrac12 KL(Q\|M)
\]
**解释**:衡量离散分布差异。
### 6.3 Lag1 Diff时序一致性
\[
\Delta_{lag1} = \frac{1}{d}\sum_i |\rho_1(\hat{x}_i) - \rho_1(x_i)|
\]
**解释**:衡量相邻时刻相关性偏差。
---
## 7. 诊断与分治思想Typeaware
现实中,少数变量会主导 KS
- setpoint/demand程序驱动
- actuator饱和/停留)
- derived tags确定性映射
因此我们提出 Typeaware 分治:
| 类型 | 特性 | 处理思路 |
|------|------|---------|
| Type1 | 程序驱动 | 单独建模/重采样 |
| Type2 | 控制器输出 | 条件生成 |
| Type3 | 执行器位置 | 状态 + 停留模型 |
| Type4 | PV 过程量 | 扩散建模 |
| Type5 | 派生变量 | 确定性映射 |
| Type6 | 辅助量 | 简单统计模型 |
当前实现为 **KSonly baseline**(经验重采样),只用于诊断上界。
---
## 8. 方法优势与局限
### 优势
- 将时序与分布解耦,训练更稳定
- 连续/离散混合扩散保证合法性
- 诊断工具完整,可定位问题变量
### 局限
- Type1/2/3 仍是难点
- KS 与时序存在 tradeoff
- KSonly baseline 破坏联合分布
---
## 9. 未来方向
1) Type1/2/3 条件生成器替代重采样
2) Type4 加工况 embedding
3) 增加跨变量相关性评估
---
## 10. 结论
maskddpm 通过“两阶段趋势 + 扩散残差”实现了更平衡的 ICS 序列生成框架。项目已经在大多数变量上取得合理分布对齐,但仍需针对程序驱动变量进行专门建模。该框架提供了一个清晰可扩展的研究路径。
---
## 附录:公式解释简表
| 公式 | 含义 |
|------|------|
| DDPM 加噪 | 描述连续变量如何被噪声腐蚀 |
| 连续损失 | 衡量去噪结果与真实残差距离 |
| 离散交叉熵 | 衡量遮蔽 token 是否正确恢复 |
| KS | 分布最大偏差 |
| JSD | 离散分布偏差 |
| Lag1 Diff | 时序相关差异 |
---
# Part 1 / 4 — 方法与理论基础(约 5k
> 本部分重点:把“这个方法为什么合理”讲清楚,建立读者对整体框架的直觉与理论基础。
## 1.1 工业时序生成的独特性
工业控制系统ICS数据和常规时序数据如股票、天气有本质差别。最关键的三点是
**(a) 强物理约束**
许多过程变量受到物理规律约束,比如液位变化不可能突然跳跃,压力和温度的变化速度有上限。这意味着模型必须尊重“惯性”与“平滑性”。
**(b) 强程序驱动**
工业系统中存在大量“外部控制信号”例如设定值setpoint、需求demand、运行模式。这些变量不是由过程自然演化产生而是外部程序/操作者指令造成。它们往往呈现“阶跃+停留”模式。
**(c) 混合变量类型**
同时存在连续变量(传感器、控制量)和离散变量(状态、模式、告警)。如果把离散变量强行当连续变量,会造成语义错误,例如出现“介于运行/停机之间”的无意义状态。
因此,工业时序生成不是单纯的“预测下一步”,而是需要同时兼顾分布、时序、离散合法性。
---
## 1.2 为什么单一模型会失败
如果用单一的序列模型(例如 LSTM/Transformer去同时优化分布与时序通常会出现以下冲突
- **分布损失优化 → 时序退化**
模型为了匹配每个变量的分布,会倾向于忽略时间结构,生成“看起来像分布,但像噪声”的序列。
- **时序损失优化 → 分布退化**
模型为了保持时序连续性,生成的数值往往过于平滑,导致分布偏离真实数据(尤其重尾、尖峰)。
- **离散变量错误**
离散变量被当作连续变量,会出现非法 token导致 JSD 增大。
因此,必须引入结构性拆分:让不同模块只负责自己擅长的部分。
---
## 1.3 设计原则:解耦与专职化
本项目遵循三条核心原则:
**原则 1解耦“时序”与“分布”**
- 时序结构由 Temporal 模块负责
- 分布细节由 Diffusion 模块负责
**原则 2连续与离散分开建模**
- 连续变量适合高斯扩散
- 离散变量适合 maskdiffusion
**原则 3对难学变量分类处理**
- 程序驱动变量Type1不适合 DDPM
- 执行器变量Type3需要额外模型
这些原则构成了本项目的总体方法论。
---
## 1.4 两阶段框架的理论直觉
两阶段方法可以理解为:
\[
X = \text{Trend}(X) + \text{Residual}(X)
\]
**趋势部分**:决定序列“怎么走”,例如逐渐上升/下降,保持物理惯性。
**残差部分**:决定数值“长什么样”,例如分布形状、重尾、异常点等。
如果直接用一个模型去学 \(X\),它必须同时解决趋势与分布问题。而我们拆成两部分后:
- 第一阶段GRU只学趋势
- 第二阶段Diffusion只学残差
这样能显著减少“时序 vs 分布”的冲突。
---
## 1.5 扩散模型为什么适合分布对齐
扩散模型DDPM是一类生成式模型核心思想是
1) 逐步加噪,把数据变成纯噪声
2) 逐步去噪,恢复真实分布
数学形式:
\[
x_t = \sqrt{\bar{\alpha}_t}\,x_0 + \sqrt{1-\bar{\alpha}_t}\,\epsilon
\]
去噪的目标是预测 \(\epsilon\) 或 \(x_0\)。
**优点**
- 可以拟合复杂、多峰、重尾分布
- 不依赖显式的概率密度函数
这正是 ICS 数据中“分布复杂”的现实需求。
---
## 1.6 为什么离散变量要用 maskdiffusion
离散变量的关键问题是“语义合法性”。如果直接用连续扩散,会生成不合法值(例如 0.3、1.7 等),导致模型失效。
maskdiffusion 的思路:
1) 随机遮蔽部分 token
2) 模型预测被遮蔽 token
损失是交叉熵:
\[
\mathcal{L}_{disc} = \frac{1}{|\mathcal{M}|}\sum_{(i,t)\in\mathcal{M}} CE(\hat{p}_{i,t},y_{i,t})
\]
这样保证输出必然来自 vocab语义上合法。
---
## 1.7 为什么需要 Typeaware 分治
即使有两阶段 + 混合扩散,仍然会发现 KS 被少数变量主导:
- Type1setpoint/demand几乎是“程序信号”
- Type3执行器具有“停留+饱和”特性
- Type5派生变量是确定性函数
这些变量从统计机制上就不适合扩散模型,因此需要额外设计。
Typeaware 的意义是:
- 把“扩散学不好的变量”剥离出去
- 让扩散专注于真正适合它的变量Type4
---
## 1.8 指标设计的意义(直觉解释)
我们采用三类核心指标:
**(a) KS分布对齐**
衡量生成分布与真实分布最大差距,越小越好。
**(b) JSD离散一致性**
衡量离散状态频率是否接近真实。
**(c) Lag1 Diff时序一致性**
衡量相邻时刻相关结构是否接近真实。
这三个指标对应三个核心目标:分布、时序、合法性。
---
## 1.9 理论层面的 tradeoff
可以理解为:
- KS 低 → 表示分布好
- Lag1 diff 低 → 表示时序好
但在实践中:
- 如果模型过度优化 KS序列可能变成“随机抽样” → 时序崩坏
- 如果模型过度优化时序,分布会过于平滑 → KS 上升
因此,一个好的模型必须在二者之间找到平衡点。这正是本项目提出两阶段框架的核心动机。
---
## 1.10 小结
本部分给出了方法的理论基础:
1) ICS 数据复杂且混合类型,单一模型难以兼顾
2) 两阶段结构用于解耦时序与分布
3) 混合扩散用于处理连续与离散
4) Typeaware 分治用于处理“扩散难学”的变量
下一部分将详细介绍**训练流程与损失设计**。
---
# Part 2 / 4 — 训练流程与损失设计(约 5k
> 本部分重点:让普通读者清楚“模型训练到底在做什么、每一步优化的目标是什么、为什么需要这些 loss”。
## 2.1 训练流程总览(高层视角)
整个训练分为两个阶段:
1) **Temporal Stage**训练一个时序模型GRU学习序列趋势。
2) **Diffusion Stage**:训练扩散模型,学习趋势残差的分布。
换句话说,训练不是一次性完成,而是先学“怎么变化”,再学“变化后的数值细节”。
---
## 2.2 Temporal Stage学趋势的意义
工业过程的动态通常平滑且具有惯性。假设我们直接用扩散模型学习整个序列,那么它必须同时完成“序列走向”和“数值分布”的建模任务。这样会产生冲突:
- 过度优化分布 → 序列变随机
- 过度优化时序 → 分布变窄
Temporal Stage 的目的就是先把“走向”固定住。
### 训练方式(直觉)
- 输入:连续变量序列
- 目标:预测下一时刻的连续序列
- 损失MSE
这样模型学到的是“趋势结构”,而不是复杂分布。
---
## 2.3 Diffusion Stage学残差分布
在 Temporal Stage 得到趋势序列 \(T\) 后,我们构造残差:
\[
R = X - T
\]
Diffusion Stage 的目标是拟合残差分布。这样最终生成结果为:
\[
\hat{X} = \hat{R} + T
\]
**直观理解**
- 趋势提供骨架
- 残差提供“真实的噪声与细节”
---
## 2.4 连续分支损失设计
### 基本损失
在扩散中,我们可以选择预测噪声 \(\epsilon\),或直接预测原始 \(x_0\)
\[
\mathcal{L}_{cont} =
\begin{cases}
\|\hat{\epsilon} - \epsilon\|^2 & (eps 预测)\\
\|\hat{x}_0 - x_0\|^2 & (x0 预测)
\end{cases}
\]
### 为什么要预测 eps 或 x0
- eps 预测更稳定(扩散原始形式)
- x0 预测更容易做分布校正
因此在不同实验中会切换。
---
## 2.5 离散分支损失设计
离散变量使用 maskdiffusion。训练时随机遮蔽一部分 token
\[
\mathcal{L}_{disc} = \frac{1}{|\mathcal{M}|}\sum_{(i,t)\in\mathcal{M}} CE(\hat{p}_{i,t},y_{i,t})
\]
直觉解释:
- 只对被 mask 的 token 计算交叉熵
- 学到的是“在上下文中恢复正确状态”
这样避免了离散变量非法输出问题。
---
## 2.6 总损失与权衡
最终损失采用线性加权:
\[
\mathcal{L} = \lambda\mathcal{L}_{cont} + (1-\lambda)\mathcal{L}_{disc}
\]
**\(\lambda\) 的作用**
- \(\lambda\) 大 → 连续分布更好,离散可能退化
- \(\lambda\) 小 → 离散更好,连续分布可能偏差
这是训练中的关键权衡参数。
---
## 2.7 SNR 加权(为什么需要)
扩散过程早期噪声大、后期噪声小。直接用统一损失会导致训练不均衡。
因此引入 SNR 权重:
\[
\mathcal{L}_{snr} = \frac{\text{SNR}_t}{\text{SNR}_t+\gamma}\mathcal{L}_{cont}
\]
直觉解释:
- 在高噪声阶段,降低损失权重
- 在低噪声阶段,更强调精细结构
---
## 2.8 分位数损失(为什么对 KS 有效)
KS 本质是比较分布的 CDF。
因此加入分位数损失:
\[
\mathcal{L}_Q = \frac{1}{K}\sum_k \|Q_k(x_{real}) - Q_k(x_{gen})\|_1
\]
直觉解释:
- 让模型对齐分布的关键分位点
- 尤其改善重尾和尖峰变量
---
## 2.9 残差统计约束(防止塌缩)
为了避免残差分布塌缩,加入 mean/std 对齐:
\[
\mathcal{L}_{stat} = \|\mu_{gen}-\mu_{real}\|^2 + \|\sigma_{gen}-\sigma_{real}\|^2
\]
解释:
- 约束残差均值与方差
- 防止模型学到“过于平滑”残差
---
## 2.10 为什么训练会慢(深度解释)
1) **两阶段训练**
- Temporal 训练 + Diffusion 训练
- 训练成本几乎翻倍
2) **扩散步数大**
- timesteps=600
- 每次训练/采样都要反复迭代
3) **Transformer backbone**
- 自注意力复杂度随序列长度平方增长
- seq_len=128 时显存和时间增加明显
---
## 2.11 训练中的常见异常与解释
**(a) loss 降得慢**
- 原因:扩散训练本身慢
- 解决:降低 timesteps 或增大 batch
**(b) loss 大幅波动**
- 原因:噪声采样随机性 + 分布复杂性
- 解决:增加 EMA 或降低学习率
**(c) lag1 差**
- 原因trend 学得不足或 seq_len mismatch
- 解决:提高 temporal_epochs并统一 seq_len 与 sample_seq_len
---
## 2.12 小结
本部分强调了训练的核心逻辑:
- Temporal Stage 负责时序结构
- Diffusion Stage 负责分布残差
- Loss 设计通过多项损失平衡分布与时序
下一部分将深入说明评估与诊断体系。
---
# Part 3 / 4 — 评估与诊断体系(约 5k
> 本部分重点:让读者理解“如何判断模型好坏”,以及“为什么需要诊断工具”。
## 3.1 为什么只看 loss 不够
训练 loss 只能说明“模型在拟合训练目标”,但不一定等价于真实质量。
在本项目中loss 可能降低,但 KS 或 lag1 仍可能很差,因为:
- loss 主要优化局部误差,而 KS 衡量整体分布差异
- 模型可能学到“平均值附近”,但尾部严重缺失
- 离散变量可能预测错误,但 loss 依旧较小
因此,必须使用更贴近目标的评估指标。
---
## 3.2 分布一致性指标KS
KS 统计量定义为:
\[
KS_i = \sup_x |F^{(i)}_{gen}(x)-F^{(i)}_{real}(x)|
\]
**解释**
- \(F_{gen}\):生成分布的累积分布函数
- \(F_{real}\):真实分布的累积分布函数
- KS 表示两条 CDF 曲线的最大距离
**直观理解**
- KS 小 → 生成分布几乎重合
- KS 大 → 分布偏移明显
**注意**:在尖峰/离散化变量上,必须使用 tieaware KS否则会被高估。
---
## 3.3 离散一致性指标JSD
JensenShannon Divergence
\[
JSD(P,Q)=\tfrac12 KL(P\|M)+\tfrac12 KL(Q\|M)
\]
其中 \(M = (P+Q)/2\)。
**解释**
- 衡量离散分布差异
- 适合评估状态类变量
- 越小越好
---
## 3.4 时序一致性指标Lag1 Diff
Lag1 相关差异:
\[
\Delta_{lag1} = \frac{1}{d}\sum_i |\rho_1(\hat{x}_i)-\rho_1(x_i)|
\]
**解释**
- 计算每个变量的相邻相关性
- 衡量“时间上的惯性是否正确”
- 越小越好
---
## 3.5 为什么要做诊断
即使平均 KS 下降,也可能是“部分变量非常好,部分变量非常差”。
这会误导优化方向。因此必须诊断:
- 哪些变量最差?
- 它们的分布差在什么区间?
- 它们属于哪种类型?
---
## 3.6 诊断工具 1Perfeature KS 排序
我们用 `ranked_ks.py` 输出:
- 变量 KS 从高到低排序
- 观察“topN 主导问题变量”
**意义**
- 精准定位瓶颈
- 避免盲目改模型
---
## 3.7 诊断工具 2CDF 图
`diagnose_ks.py` 绘制 CDF 对比:
- 真实 vs 生成
- 可直观看到“尾部缺失 / 中位数偏移 / 饱和堆积”
这是理解 KS 高的最直接手段。
---
## 3.8 诊断工具 3Filtered KS
Filtered KS 会剔除:
- 标准差为 0 的变量
- KS 极端异常的变量
**意义**
- 判断“整体模型是否已经合理”
- 证明 KS 被少数变量拖垮
---
## 3.9 Typewise 评估
为了对应 Typeaware 分治策略,我们也统计:
- Type1 KS 均值
- Type2 KS 均值
- Type3 KS 均值
-
这样可以看到不同类型变量的贡献。
---
## 3.10 评估指标的 tradeoff
现实中常见情况:
- KS 下降 → lag1 上升
- lag1 下降 → KS 上升
因此需要在实验中明确目标:
- 若目标是分布一致性,接受时序退化
- 若目标是时序一致性,接受分布偏差
这也是论文中必须讨论的权衡点。
---
## 3.11 KSonly baseline 的作用
KSonly baseline 是一种诊断工具:
- 把难学变量替换为真实分布采样
- 观察整体 KS 降到的“上限”
意义:
- 如果 KS 仍高 → 说明模型整体分布学得差
- 如果 KS 明显降 → 说明瓶颈集中在少数变量
---
## 3.12 评估流程的完整逻辑
评估流程包含:
1) 生成数据generated.csv
2) 与真实数据对比,计算 KS/JSD/lag1
3) 输出 eval.json
4) 额外诊断CDF/Rank/Filtered
5) 若启用后处理,再评估 eval_post.json
---
## 3.13 小结
本部分说明:
- 为什么必须用 KS/JSD/lag1 评估
- 为什么要做 perfeature 诊断
- 为什么引入 Typewise 分治评估
下一部分将讨论实验观察、局限与未来方向。
---
# Part 4 / 4 — 实验观察、局限与未来方向(约 5k
> 本部分重点:总结实验现象,解释为什么会出现这些现象,并给出未来工作的方向。
## 4.1 典型实验现象回顾
在大量实验中,我们观察到几个稳定现象:
1) **KS 可以明显下降,但 lag1 可能变差**
- 当加强分布对齐quantile loss、校准分布更好但时序一致性下降
2) **离散 JSD 相对稳定**
- 离散 mask diffusion 保证 token 合法性
- JSD 通常不会像 KS 那样剧烈波动
3) **少数变量主导 KS**
- Type1/2/3 是 KS 的主要来源
- 说明整体分布其实已经合理,问题集中在少数难学变量
---
## 4.2 为什么 Type1/2/3 变量难学
**Type1setpoint/demand**
- 本质是外部程序驱动,而非系统响应
- 呈现长时间常数 + 突然阶跃
- 扩散模型更擅长连续变化,难以准确复现这种跳变
**Type2controller output**
- 受控制算法约束PID
- 值域常常饱和于 0 或 100
- 分布极端尖峰
**Type3actuator positions**
- 常有停留、饱和和离散档位
- 实际变化规律与简单连续噪声不同
这些特性说明:必须用专门的模型去建模,而不是依靠通用 DDPM。
---
## 4.3 KSonly baseline 的意义与局限
KSonly baseline经验重采样告诉我们
- **理论上 KS 能到多低**
- 确认瓶颈是否在少数变量
但它的局限也非常明显:
- 破坏变量之间的依赖关系
- 无法保证联合分布一致
- 只能作为诊断工具,而非最终生成器
---
## 4.4 时序 vs 分布的理论矛盾
在本项目中,时序与分布的矛盾非常典型:
- 如果只关注分布:序列会失去惯性,变成“随机抽样”
- 如果只关注时序:序列会变平滑,失去重尾或多模态特征
两阶段架构正是为了缓解这种矛盾,但仍无法彻底消除。
---
## 4.5 为什么需要条件建模Conditioning
真实工业数据往往依赖于“工况/模式”。
例如:
- 不同 setpoint 会导致不同 PV 分布
- 不同状态会导致执行器行为差异
因此,未来需要在模型中加入条件:
\[
P(X|C)\quad \text{而不是}\quad P(X)
\]
这种条件建模可以显著减少分布误差。
---
## 4.6 未来方向(具体可操作的路线)
### 方向 1Type1 程序生成器
- 用 HMM 或 changepoint 模型生成 setpoint/demand
- 再作为条件输入扩散模型
### 方向 2Type2 控制器模拟器
- 训练一个小型控制器网络
- 根据 setpoint + PV 输出控制量
### 方向 3Type3 执行器动力学模型
- 用“状态 + 停留 + 速率”建模
- 比纯重采样更真实
### 方向 4Type4 PV 的条件扩散
- 加入工况 embedding
- 在不同 regime 下分布更一致
---
## 4.7 实验写作建议(论文表达角度)
在写论文时,可以这样组织:
1) 展示 KS/JSD/lag1 的全局指标
2) 展示 perfeature KS 排序
3) 强调“少数变量主导错误”
4) 提出 Typeaware 分治策略
5) 用 KSonly baseline 作为诊断上界
这样能显得方法合理且有深度。
---
## 4.8 结论总结
本项目提供了一条清晰路径:
- 先解耦时序与分布(两阶段)
- 再解决连续与离散(混合扩散)
- 最后用 Typeaware 分治处理难变量
因此,它不仅是一个生成模型,更是一套“如何诊断、如何改进”的系统性方法。
---
# 全文总结
maskddpm 的核心贡献在于 **结构性拆解问题**
- 时序结构与分布对齐分开建模
- 连续与离散分开扩散
- 难变量类型单独处理
通过这些结构化手段,本项目为工业时序生成提供了一个可解释、可扩展的方案。它不是单纯追求指标最优,而是提供了一套完整的研究路径。

1
links.txt Normal file
View File

@@ -0,0 +1 @@
https://my.feishu.cn/wiki/Za4dwCsG6iPD9qklRLWcoJOZnnb?from=from_copylink

View File

@@ -0,0 +1,357 @@
## Conversation summary (so far)
* Youre building a **Modbus/TCP traffic generation system for security use** with this pipeline:
**feature extraction → hybrid diffusion generates features → generator turns features into raw packets → checker validates traffic**.
* I proposed a checker that validates traffic at **multiple layers**:
1. frame/IP/TCP parseability,
2. optional TCP conversation sanity,
3. Modbus/TCP structural + semantic validity (MBAP + PDU),
4. optional “feature-fidelity” checks against generator sidecar metadata.
* I recommended a **generator → checker contract** that is easy to debug and automate:
**PCAP/PCAPNG + JSONL sidecar** (1 metadata line per packet/event).
* You then shared your existing **descriptor-driven Modbus parser** (Rust) and a short doc.
The public API/type model is: `Config → FunctionDescriptor → FieldDescriptor(FieldType, length/length_from/scale/enum_map)`
and `parse_sawp_message(...)` returns JSON containing `unit`, `function`, optional `exception`, and `fields` .
* I said: the **descriptor engine is very reusable** for the checker, but the checkers entry point should ideally work on **raw Modbus/TCP bytes** (MBAP+PDU), not require `sawp_modbus::Message`.
Also: for checker ergonomics, prefer structured error types over `Result<Value, String>` (but you can keep JSON output for the MVP).
---
# Implementation document: Modbus/TCP Checker (Rust)
This document is a practical, “not too complex” plan to implement the checker while still following good practices where they dont add much difficulty.
## 1) What the checker does (goals / non-goals)
### Goals
The checker verifies that generator output is:
1. **Parsable** as TCP/IP traffic,
2. **Modbus/TCP-valid** at the application level (MBAP + PDU rules),
3. **Consistent** in request/response pairing (Transaction ID matching),
4. Optionally **matches the expected features** (function code, unit id, quantities, payload size, timing tolerances, etc.).
### Non-goals (to keep it simple)
To avoid turning this into a full Wireshark, we deliberately **do not** implement:
* full TCP stream reassembly (segments split/merged),
* full TCP state machine with retransmits/out-of-order handling,
* IP/TCP checksum verification by default.
Instead, we enforce a **generator constraint**: **one Modbus ADU per TCP payload** (no segmentation, no coalescing). This single constraint dramatically reduces checker complexity and is realistic for generated traces.
> Trade-off: best practice would handle segmentation/coalescing and reassembly; difficulty rises a lot. The “one ADU per TCP payload” rule is the best complexity/benefit lever for this project.
---
## 2) Generator output contract (what the checker consumes)
### Recommended output (MVP-friendly and debuggable)
**(A) PCAP or PCAPNG file**
* `trace.pcapng` (or `.pcap`) containing the raw generated packets
**(B) Sidecar JSONL metadata file**
* `trace.meta.jsonl` where each line describes the corresponding packet/event (same order)
This is the easiest way to:
* reproduce failures,
* correlate packet index with expected semantic fields,
* produce actionable reports.
### JSONL schema (minimal + optional)
**Minimal fields (recommended):**
* `trace_id` (string/uuid)
* `event_id` (monotonic integer)
* `pcap_index` (or implicit by line number)
* `ts_ns` timestamp
* `direction` (`"c2s"` or `"s2c"`)
* `flow` (src/dst ip/port)
**Optional `expected` block (for feature-fidelity checks):**
* `expected.modbus.transaction_id`, `unit_id`, `function_code`, and `expected.fields` (names matching your descriptor JSON).
Example line:
```json
{
"trace_id": "c7f1...",
"event_id": 42,
"pcap_index": 42,
"ts_ns": 1736451234567890123,
"direction": "c2s",
"flow": {"src_ip":"10.0.0.10","src_port":51012,"dst_ip":"10.0.0.20","dst_port":502},
"expected": {
"modbus": {"transaction_id": 513, "unit_id": 1, "function_code": 3},
"fields": {"starting_address": 0, "quantity": 10}
}
}
```
> Trade-off: best practice is “self-describing PCAP” (pcapng custom blocks, or embedding metadata); difficulty higher. JSONL sidecar is dead simple and works well.
---
## 3) Workflow (starting from generator output)
### Step 0 — Load inputs
1. Read `trace.meta.jsonl` into a lightweight iterator (dont load all if trace is huge).
2. Open `trace.pcapng` and stream packets in order.
### Step 1 — Align packets and metadata
For each packet index `i`:
* read packet `i` from PCAP
* read metadata line `i` from JSONL
If mismatch (missing line/packet), record a **Fatal** alignment error and stop (or continue with “best effort”, your call).
### Step 2 — Decode packet and extract TCP payload
Decode:
* link layer (Ethernet/SLL/RAW depending on PCAP linktype),
* IPv4/IPv6,
* TCP,
* extract TCP payload bytes.
Minimal checks:
* packet parses,
* TCP payload length > 0 when direction indicates Modbus message,
* port 502 is present on either side (configurable if you generate non-502).
### Step 3 — Parse Modbus/TCP ADU
Assuming payload contains exactly one ADU:
* parse MBAP (7 bytes) + PDU
* validate basic MBAP invariants
* parse function code and PDU data
* decide request vs response based on `direction`
* parse PDU data using descriptor map (your reusable part)
### Step 4 — Stateful consistency checks
Maintain per-flow state:
* request/response pairing by `(transaction_id, unit_id)`
* outstanding request table with timeout/window limits
### Step 5 — Feature-fidelity checks (optional)
If `expected` exists in JSONL:
* compare decoded modbus header + parsed fields with expected values
* compare sizes and (optionally) timing with tolerances
### Step 6 — Emit report
Output:
* `report.json` with summary + per-finding samples (packet indices, flow key, reason, extracted fields)
* optional `report.txt` for quick reading
---
## 4) Reusing your existing parser (what to keep, what to adjust)
You already have:
* A descriptor model (`Config/FunctionDescriptor/FieldDescriptor/FieldType`)
* A function that returns a JSON representation with the shape the checker wants (`unit`, `function`, optional `exception`, `fields`)
### 4.1 What is immediately reusable
**Highly reusable for the checker:**
* Descriptor loading (serde)
* Field decoding logic (length/length_from, scale, enum_map)
* The “JSON output” idea for reporting and debugging
### 4.2 Small design adjustment to make reuse clean (recommended)
Your checker will naturally see **raw TCP payload bytes**. So the lowest-friction integration is:
* Implement a tiny **MBAP parser** in the checker:
* returns `(transaction_id, protocol_id, length, unit_id, function_code, pdu_data)`
* Then call your descriptor-based decoder on `pdu_data` (bytes **after** function code)
Your doc shows the parser conceptually returns JSON with `fields` and supports request vs response descriptors , which maps perfectly to `direction`.
**Suggested public entrypoint to expose from your parser module:**
* `parse_with_descriptor(pdu_data: &[u8], unit: u8, function: u8, fields: &Vec<FieldDescriptor>) -> Result<Value, String>`
If its currently private, just make it `pub(crate)` or `pub` and reuse it. This avoids binding the checker to `sawp_modbus::Message` and keeps implementation simple.
> Trade-off: best practice would be to return a typed struct + typed errors; easier to maintain long term but more refactor work. For your “dont make it hard” requirement, keeping JSON output + simple error types is totally fine for the first version.
### 4.3 How the checker chooses which descriptor to use
* If `direction == c2s` → request descriptor
* If `direction == s2c` → response descriptor
This matches the intent of having `request` and `response` descriptor vectors in your model .
---
## 5) Checker internal design (simple but extensible)
### 5.1 Core data structures
* `FlowKey { src_ip, src_port, dst_ip, dst_port, ip_version }`
* `PacketCtx { trace_id, event_id, pcap_index, ts_ns, direction, flow }`
* `DecodedModbus { transaction_id, protocol_id, length, unit_id, function_code, is_exception, exception_code?, pdu_data, parsed_fields_json? }`
### 5.2 “Rules” model (optional, but keeps code tidy)
Instead of huge if/else blocks, implement a few rules that return findings:
* `RuleMbapValid`
* `RuleFunctionPduWellFormed` (basic length sanity)
* `RuleTxIdPairing`
* `RuleExpectedMatch` (only if sidecar has expected)
If you dont want a formal trait system initially, just implement these as functions that append to a `Vec<Finding>`.
### 5.3 Findings + severity
Use a compact severity scale:
* `Fatal`: cannot parse / cannot continue reliably
* `Error`: protocol invalid
* `Warn`: unusual but maybe acceptable
* `Info`: stats
A finding should include:
* `pcap_index`, `event_id`, `flow`, `severity`, `code`, `message`
* optional `observed` and `expected` snippets
---
## 6) What the checker validates (MVP vs stricter)
### MVP validations (recommended first milestone)
1. PCAP + JSONL aligned
2. Parse Ethernet/IP/TCP and extract payload
3. MBAP:
* payload length ≥ 7
* length field consistency (basic)
4. PDU:
* function code exists
* exception handling if `fc & 0x80 != 0`
5. Descriptor parse success (request/response based on direction)
6. Transaction pairing:
* every response matches an outstanding request by transaction_id/unit_id
* no duplicate outstanding txid unless you allow it
### “Strict mode” additions (still reasonable)
* enforce unit_id range (if you want)
* enforce function-code-specific invariants using parsed fields
* e.g., `byte_count == 2 * quantity` for register reads/writes (if present in descriptor)
* timeouts:
* response must arrive within configured window
### Heavy features (avoid unless needed)
* TCP reassembly and multi-ADU per segment
* checksum verification
* handling retransmits/out-of-order robustly
---
## 7) Dependencies (crates) for the checker
### Minimal set (keeps implementation easy)
* **PCAP reading**
* `pcap` (libpcap-backed; you already use it in your codebase)
* **Packet decoding**
* `pnet_packet` (you already use `pnet` patterns)
* **Config + sidecar + report**
* `serde`, `serde_json`
* **Errors + logging**
* `anyhow` (fast to integrate) and/or `thiserror` (nicer structured errors)
* `tracing`, `tracing-subscriber`
* **Utilities**
* `hashbrown` (optional; std HashMap is fine)
* `hex` (useful for debug/trailing bytes like your parser does)
### If you want to reduce external requirements (optional alternative)
* Replace `pcap` with `pcap-file` (pure Rust; no libpcap dependency)
* Replace `pnet` with `etherparse` (often simpler APIs)
> Trade-off: “best practice” for portability is pure Rust (`pcap-file` + `etherparse`).
> “Best practice” for least effort *given your current code* is reusing `pcap` + `pnet`.
---
## 8) Suggested project layout (simple)
```
checker/
src/
main.rs # CLI entry
config.rs # descriptor loading
meta.rs # JSONL reader structs
pcap_in.rs # pcap streaming
decode.rs # ethernet/ip/tcp extract payload
mbap.rs # Modbus/TCP MBAP parsing
modbus_desc.rs # reuse your parse_with_descriptor + types
state.rs # outstanding tx table
validate.rs # main validation pipeline
report.rs # report structs + JSON output
```
---
## 9) Practical implementation tips (to keep it from getting “hard”)
1. **Enforce generator constraints**:
* one ADU per TCP payload
* no splitting/coalescing
This keeps checker complexity low and makes failure reasons obvious.
2. **Keep JSON output for parsed fields** at first:
* You already have a clean JSON shape (`unit`, `function`, `fields`)
* Great for debugging mismatches with `expected.fields`
3. **Add strictness as “modes”**:
* `--mode=mvp | strict`
* or config file toggles
4. **Fail-fast vs best-effort**:
* For CI or batch filtering, fail-fast on `Fatal` is fine.
* For research/debugging, best-effort (continue and collect findings) is more useful.
---

View File

@@ -0,0 +1,17 @@
@inproceedings{10.1145/3544216.3544251,
author = {Yin, Yucheng and Lin, Zinan and Jin, Minhao and Fanti, Giulia and Sekar, Vyas},
title = {Practical GAN-based synthetic IP header trace generation using NetShare},
year = {2022},
isbn = {9781450394208},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3544216.3544251},
doi = {10.1145/3544216.3544251},
abstract = {We explore the feasibility of using Generative Adversarial Networks (GANs) to automatically learn generative models to generate synthetic packet- and flow header traces for networking tasks (e.g., telemetry, anomaly detection, provisioning). We identify key fidelity, scalability, and privacy challenges and tradeoffs in existing GAN-based approaches. By synthesizing domain-specific insights with recent advances in machine learning and privacy, we identify design choices to tackle these challenges. Building on these insights, we develop an end-to-end framework, NetShare. We evaluate NetShare on six diverse packet header traces and find that: (1) across all distributional metrics and traces, it achieves 46\% more accuracy than baselines and (2) it meets users' requirements of downstream tasks in evaluating accuracy and rank ordering of candidate approaches.},
booktitle = {Proceedings of the ACM SIGCOMM 2022 Conference},
pages = {458472},
numpages = {15},
keywords = {synthetic data generation, privacy, network packets, network flows, generative adversarial networks},
location = {Amsterdam, Netherlands},
series = {SIGCOMM '22}
}

View File

@@ -0,0 +1,45 @@
# Practical GAN-based synthetic IP header trace generation using
**第一个问题**请对论文的内容进行摘要总结包含研究背景与问题、研究目的、方法、主要结果和结论字数要求在150-300字之间使用论文中的术语和概念。
论文面向网络管理任务telemetry、anomaly detection、provisioning中难以共享的packet/flow header traces指出现有GAN/表格生成与时序生成方法在fidelity字段相关性与大支持域字段、scalability-fidelity tradeoff、privacy-fidelity tradeoff上不足。目标是自动学习可生成synthetic IP header traces的端到端框架。方法NetShare将多epoch traces合并并按five-tuple拆分为flow time series用DoppelGANger式time-series GAN建模对大范围数值做log(1+x)IP用bit encoding端口/协议用基于public data训练的IP2Vec通过chunk化+fine-tuning并行训练提升效率并用public pretraining+DP-SGD改善DP训练。结果显示NetShare在六个数据集上跨分布指标总体比基线高46%准确度并在下游任务上保持accuracy与rank ordering结论是GAN在该场景可行但需域特定重构与隐私/效率设计。
**第二个问题**请提取论文的摘要原文摘要一般在Abstract之后Introduction之前。
We explore the feasibility of using Generative Adversarial Networks (GANs) to automatically learn generative models to generate synthetic packet- and flow header traces for networking tasks (e.g., telemetry, anomaly detection, provisioning). We identify key fidelity, scalability, and privacy challenges and tradeoffs in existing GAN-based approaches. By synthesizing domain-specific insights with recent advances in machine learning and privacy, we identify design choices to tackle these challenges. Building on these insights, we develop an end-to-end framework, NetShare. We evaluate NetShare on six diverse packet header traces and find that: (1) across all distributional metrics and traces, it achieves 46% more accuracy than baselines and (2) it meets users requirements of downstream tasks in evaluating accuracy and rank ordering of candidate approaches.
**第三个问题**:请列出论文的全部作者,按照此格式:`作者1, 作者2, 作者3`
Yucheng Yin, Zinan Lin, Minhao Jin, Giulia Fanti, Vyas Sekar
**第四个问题**:请直接告诉我这篇论文发表在哪个会议或期刊,请不要推理或提供额外信息。
ACM SIGCOMM 2022 Conference (SIGCOMM 22)
**第五个问题**:请详细描述这篇论文主要解决的核心问题,并用简洁的语言概述。
核心问题是如何在不共享原始数据的前提下生成对网络任务“足够真实且可用”的synthetic packet/flow IP header traces同时兼顾三类现实约束——(1) Fidelity要捕获跨字段与跨记录的相关性如flow size、跨epoch的同一flow、多包流并能学习大支持域字段IP/port、pkts/bytes等的长尾与heavy hitters(2) Scalability面对百万级记录的trace训练成本要可控避免复杂时序GAN训练耗费数量级更多CPU/GPU-hours(3) Privacy在考虑差分隐私(ε,δ)-DP等机制时避免“DP一上、保真度崩溃”的privacy-fidelity tradeoff。简洁概述NetShare要把“能用、能训、能隐私化”的网络头部合成数据生成从不可行变为可落地。
**第六个问题**:请告诉我这篇论文提出了哪些方法,请用最简洁的方式概括每个方法的核心思路。
(1) Reformulation as flow time series generation把按epoch的表格记录改写为“合并全局trace→按five-tuple切成flows→对每条flow生成时间序列”以显式学习跨包/跨epoch相关性(2) Field representation/encoding设计对pkts/bytes等大范围数值做log(1+x)IP用bitwise encodingport/protocol用在public data上训练的IP2Vec embedding以兼顾可学性与DP可行性(3) Scalability via chunking + fine tuning + parallelism将flow trace按固定时间切成M个chunk首chunk训练seed model其余chunk并行fine-tune并用flow tags保留跨chunk依赖(4) Practical DP via public pretraining + DP-SGD fine-tuning先用相近public traces预训练再用DP-SGD在private data上微调以减少DP-SGD轮数从而改善privacy-fidelity tradeoff(5) Post-processing correctness生成后回映射字段、补齐派生字段如checksum并重组为PCAP/NetFlow以保证trace可用与协议一致性。
**第七个问题**:请告诉我这篇论文所使用的数据集,包括数据集的名称和来源。
Flow(NetFlow)数据集:(1) UGR16西班牙ISP的NetFlow v9采集数据论文使用2016年3月第三周(2) CIDDS模拟小型企业环境并注入恶意流量的流量数据集带benign/attack与attack type标签(3) TON_IoT (TON):物联网遥测/安全数据集论文使用Train_Test_datasets子集。Packet(PCAP)数据集:(4) CAIDA Anonymized Internet Traces商业骨干链路高速度监测点的匿名化PCAP论文用2018年3月New York collector子集(5) Data Center (DC)IMC 2010论文中“UNI1”数据中心的抓包数据(6) CyberAttack (CA)U.S. National CyberWatch Mid-Atlantic CCDC 2012年3月比赛抓包数据Netresec公开capture files。另外用于隐私/编码预训练的public数据CAIDA Chicago collector 2015用于IP2Vec端口/协议embedding与DP预训练示例
**第八个问题**:请列举这篇论文评估方法的所有指标,并简要说明这些指标的作用。
(1) JensenShannon Divergence (JSD)衡量合成与真实在categorical字段分布的差异如SA/DA、SP/DP、PR越小越好(2) Earth Movers Distance (EMD / Wasserstein-1)衡量continuous字段分布差异如TS、TD、PKT、BYT、PS、PAT、FS越小越好(3) Normalized EMD为跨字段可比性将EMD按字段归一化到固定区间用于汇总/展示;(4) Downstream task accuracy如traffic type prediction各分类器准确率用于检验“训练在合成、测试在真实”的可用性(5) Order preservation / Spearmans rank correlation coefficient比较算法/模型在真实与合成上的相对排序一致性分类器、sketch、NetML modes越接近1越好(6) Heavy hitter count estimation relative error|error_synerror_real|/error_real检验sketch类遥测任务误差是否被合成数据保持(7) Anomaly ratio relative error|ratio_synratio_real|/ratio_real检验异常检测输出比例一致性(8) Scalability训练成本以total CPU hours度量用于比较scalability-fidelity tradeoff(9) Differential privacy参数(ε,δ)隐私强度指标δ固定10^-5用于privacy-fidelity tradeoff曲线(10) Protocol/domain correctness checks如IP地址合法性、pkt/byt与协议约束、port与protocol一致性、最小包长等用于验证生成trace语义/语法正确性。
**第九个问题**:请总结这篇论文实验的表现,包含具体的数值表现和实验结论。
(1) 总体保真度在六个header traces上NetShare“across all distributional metrics and traces”相对基线达到46% more accuracy文中进一步给出总体汇总NetShare在NetFlow分布指标上平均48%更好、在PCAP分布指标上平均41%更好以JSD/normalized EMD衡量。(2) 下游任务TON数据集的traffic type prediction中NetShare在五种分类器上均优于CTGAN/STAN/E-WGAN-GP例如MLP上比次优基线E-WGAN-GP准确率高12%并达到真实数据准确率的84%。排序保持方面Spearman相关系数CIDDS上NetShare为0.90基线0.60/0.60/0.70TON上NetShare为0.70基线0.10/0.60/-0.60)。(3) Sketch遥测重流计数任务的相对误差上NetShare相对基线平均48%更小并实现sketch算法排序的perfect match文中描述。(4) 异常检测NetML modes排序相关系数在CAIDA上NetShare=1.00DC上0.94CA上0.88,整体优于基线。(5) 隐私:在(ε,δ)-DP下public pretraining能显著优于naive DP-SGD但即使在CAIDA上ε=24.24DP-pretrain-SAME的mean EMD仍比非DP退化约2.3×naive约3.5×结论是高维DP合成仍是开放难题。
**第十个问题**:请清晰地描述论文所作的工作,分别列举出动机和贡献点以及主要创新之处。
动机:(1) 真实packet/flow header traces对监控算法评估、异常检测、容量规划等至关重要但受业务与隐私限制难以共享(2) 传统simulation/model-driven方法依赖大量领域知识且难泛化既有ML/GAN方法又在相关性、规模与隐私上不可用。
贡献点与主要创新:(1) 提出并实现端到端系统NetShare把header trace合成从“按epoch表格生成”重构为“全局合并+按flow切分的time-series GAN生成”系统性解决flow size/跨epoch相关性缺失(2) 给出面向fidelityscalabilityprivacy的字段表示设计log变换、IP比特编码、端口/协议基于public数据的IP2Vec并解释为何某些embedding会破坏DP训练数据依赖的dictionary问题(3) 提出chunk化并行训练+seed模型fine-tuning与flow tags以改善scalability-fidelity tradeoff(4) 将“public pretraining + DP-SGD fine-tuning”引入header trace生成实证改善privacy-fidelity tradeoff并量化其局限(5) 用六个公开数据集、分布指标+多类下游任务分类、sketch遥测、异常检测进行系统评估展示在准确率与排序保持上的实用性并补充协议一致性检查与可用的PCAP/NetFlow输出流程含派生字段生成

View File

@@ -0,0 +1,10 @@
@ARTICLE{11087622,
author={Liu, Xiaosi and Xu, Xiaowen and Liu, Zhidan and Li, Zhenjiang and Wu, Kaishun},
journal={IEEE Transactions on Mobile Computing},
title={Spatio-Temporal Diffusion Model for Cellular Traffic Generation},
year={2026},
volume={25},
number={1},
pages={257-271},
keywords={Base stations;Diffusion models;Data models;Uncertainty;Predictive models;Generative adversarial networks;Knowledge graphs;Mobile computing;Telecommunication traffic;Semantics;Cellular traffic;data generation;diffusion model;spatio-temporal graph},
doi={10.1109/TMC.2025.3591183}}

View File

@@ -0,0 +1,54 @@
# Spatio-Temporal Diffusion Model for Cellular
**第一个问题**请对论文的内容进行摘要总结包含研究背景与问题、研究目的、方法、主要结果和结论字数要求在150-300字之间使用论文中的术语和概念。
移动网络规划与资源分配需要可用的流量需求建模但现实中历史数据依赖重、运营商数据受隐私限制且现有GAN/自回归方法在城域大规模基站流量生成上易不稳定、mode collapse且难刻画同一时空上下文下的uncertainty。论文提出STOUTERSpatio-Temporal diffusiOn model for cellUlarTraffic genERation以DDPM式forward diffusion向真实流量注噪、再reverse denoising从高斯先验重建生成流量分布。为捕获long/short-term periodic patterns与spatial patterns作者预训练temporal graph与base station graph并设计Spatio-Temporal Feature Fusion ModuleSTFFM在Spatio-Temporal UNetSTUnet中融合时间嵌入、基站嵌入与扩散步信息以预测噪声。实验在Milan与Trentino两区域共5类及聚合Call/SMSCDR数据上验证STOUTER在Internet数据上相对基线平均提升MAE 19.23%、RMSE 18.35%、JSD 52.77%。结论是STOUTER能生成更接近真实分布且包含不确定性波动的城域基站流量为后续预测与网络优化提供高质量合成数据支撑。
**第二个问题**请提取论文的摘要原文摘要一般在Abstract之后Introduction之前。
In the digital era, the increasing demand for network traffic necessitates strategic network infrastructure planning. Accurate modeling of traffic demand through cellular traffic generation is crucial for optimizing base station deployment, enhancing network efficiency, and fostering technological innovation. In this paper, we introduce STOUTER, a spatio-temporal diffusion model for cellular traffic generation. STOUTER incorporates noise into traffic data through a forward diffusion process, followed by a reverse reconstruction process to generate realistic cellular traffic. To effectively capture the spatio-temporal patterns inherent in cellular traffic, we pre-train a temporal graph and a base station graph, and design the Spatio-Temporal Feature Fusion Module (STFFM). Leveraging STFFM, we develop STUnet, which estimates noise levels during the reverse denoising process, successfully simulating the spatio-temporal patterns and uncertainty variations in cellular traffic. Extensive experiments conducted on five cellular traffic datasets across two regions demonstrate that STOUTER improves cellular traffic generation by 52.77% in terms of the Jensen-Shannon Divergence (JSD) metric compared to existing models. These results indicate that STOUTER can generate cellular traffic distributions that closely resemble real-world data, providing valuable support for downstream applications.
**第三个问题**:请列出论文的全部作者,按照此格式:`作者1, 作者2, 作者3`
Xiaosi Liu, Xiaowen Xu, Zhidan Liu, Zhenjiang Li, Kaishun Wu
**第四个问题**:请直接告诉我这篇论文发表在哪个会议或期刊,请不要推理或提供额外信息。
IEEE Transactions on Mobile Computing
**第五个问题**:请详细描述这篇论文主要解决的核心问题,并用简洁的语言概述。
论文要解决的核心问题是:在缺乏充分真实运营商数据(隐私限制、数据获取困难)且仅依赖历史数据会带来计算/时延与长期误差累积的情况下如何生成“城域大规模基站级”蜂窝网络流量使其同时逼近真实分布并体现三类关键规律长短期周期性hourly/daily/weekly、空间异质性不同基站/区域差异以及同一时空条件下的流量不确定性波动uncertainty pattern。作者指出现有GAN类生成在大规模场景下易出现训练不稳定与mode collapse且多依赖知识图谱/多源数据或仅拟合固定模式忽略不确定性因此需要一种更稳定、能从随机先验出发逐步“恢复分布”的生成范式并能把时空先验注入生成过程。简洁概述用扩散生成替代GAN在去噪过程中显式融合时间周期图与基站空间图的表示生成既有规律又有随机波动的基站流量序列分布。
**第六个问题**:请告诉我这篇论文提出了哪些方法,请用最简洁的方式概括每个方法的核心思路。
1) STOUTER面向蜂窝流量生成的时空扩散框架forward diffusion逐步加高斯噪声reverse denoising从N(0,I)采样并迭代去噪重建流量。
2) Temporal GraphG_t+表示学习用小时节点与两类有向边hourly relation、daily relation编码短期与周周期关系采用GraphMAE式GIN编码器/解码器得到时间段嵌入F_t。
3) Base Station GraphG_s+表示学习以基站为节点1km阈值内建无向二值边节点特征由覆盖POI八类计数构成用GCN编码器/解码器得到基站嵌入F_s。
4) 预训练目标L_preSCE对图自编码重建用Scaled Cosine Error并引入γ>1强调难样本提升表示学习质量。
5) STFFMSpatio-Temporal Feature Fusion Module把扩散步t的SinPosEmb+MLP嵌入与Concat(F_s,F_t)经FC融合为F_st并在每个残差块中作为条件注入卷积特征引导去噪保留时空模式。
6) STUnet带Down/Mid/Up结构的Spatio-Temporal UNet作为噪声预测网络ε_θ(x_t,t)在Mid-Block用多头注意力、在Down/Up用多头线性注意力融合历史流量、扩散步、时空语义等特征。
7) 扩散训练目标以噪声预测误差训练文中给出基于ε_t与ε_θ的平方误差形式
**第七个问题**:请告诉我这篇论文所使用的数据集,包括数据集的名称和来源。
使用公开CDR数据集Milan与Trentino“Call Detailed Records (CDRs) from Italy”由Semantics and Knowledge Innovation Lab提供论文引用数据来源为Barlacchi et al., Sci. Data 2015[52]。流量类型包含5类Internet、Received-SMS、Sent-SMS、Incoming-Call、Outgoing-Call并构造聚合数据集CallIncoming+Outgoing与SMSReceived+Sent。POI数据来源OpenStreetMap用于构建基站节点的POI八类统计特征论文引用[34])。
**第八个问题**:请列举这篇论文评估方法的所有指标,并简要说明这些指标的作用。
1) MAEMean Absolute Error衡量生成流量与真实流量在基站维度上的平均绝对偏差反映数值误差水平。
2) RMSERoot Mean Square Error对大误差更敏感的均方根误差用于衡量整体重建偏差并强调异常偏离。
3) KL divergence用于定义分布差异的基础量文中作为JSD的组成项
4) JSDJensenShannon Divergence对称且更稳定的分布相似度度量用于比较生成流量分布与真实分布的接近程度是论文主指标之一报告提升52.77%。此外案例研究的下游预测同样用MAE与RMSE评估LSTM预测误差。
**第九个问题**:请总结这篇论文实验的表现,包含具体的数值表现和实验结论。
总体表现Section VI-B在Milan与Trentino两区域、五类原始流量+聚合Call/SMS上STOUTER在MAE、RMSE、JSD三指标整体优于TCN-GAN、VAE与以DiffWave/WaveNet为去噪网络的扩散基线。论文给出在Internet数据集上相对基线“平均提升”数值MAE提升19.23%、RMSE提升18.35%、JSD提升52.77%。可视化Fig.9Trentino Internet一月尺度上STOUTER更好复现周期趋势且偏差更小示例指标为STOUTER MAE 0.18、RMSE 0.2558优于VAE0.2923/0.4528、DiffWave0.2359/0.3384与TCN-GAN1.5976/2.5625。消融Fig.7去掉Temporal Graphw/o TG、去掉Base Station Graphw/o BSG或两者都去掉w/o ST都会导致指标退化说明时空表示与融合对生成分布至关重要但在Received-SMS等随机性更强、时空相关弱的数据上JSD优势相对减弱。参数研究Fig.8扩散步数t从10增至1000整体提升质量Milan Internet约t=600趋稳、Trentino Internet约t=800趋稳超过阈值收益递减。案例研究Fig.10用生成数据训练LSTM做流量预测时STOUTER生成数据使预测误差与“用真实数据训练”的模型差距在MAE/RMSE上可控制在0.1以内,表明生成数据对下游任务可用。
**第十个问题**:请清晰地描述论文所作的工作,分别列举出动机和贡献点以及主要创新之处。
动机1) 基站部署与资源优化需要可信的流量需求数据,但真实数据受隐私/获取限制2) 预测方法依赖大量历史数据且长期预测误差累积3) GAN类城域流量生成存在训练不稳、mode collapse、对多源知识图谱依赖强且往往忽略不确定性波动。
贡献点1) 提出STOUTER面向大规模基站蜂窝流量生成的spatio-temporal diffusion model通过从高斯先验逐步去噪生成以刻画uncertainty pattern。2) 构建两类先验图并预训练表示Temporal Graph建模hourly/daily含周周期关系以捕获long/short-term periodic patternsBase Station Graph融合基站距离与POI八类统计以建模spatial patterns。3) 提出STFFM并构建STUnet在扩散去噪网络各层中融合扩散步嵌入与时空图嵌入结合多头注意力/线性注意力提升对时空模式与不确定性同时建模的能力。4) 大规模真实数据实证在Milan/Trentino五类流量上较SOTA基线在JSD上提升52.77%Internet任务并通过消融、变体、参数敏感性与下游预测案例验证有效性与可用性。
主要创新之处将“时空图先验表示temporal graph + base station graph”以模块化方式注入DDPM式去噪网络STUnet在无需城市知识图谱的前提下同时复现周期性、空间异质性与不确定性波动并用分布指标JSD系统证明生成分布更贴近真实。

View File

@@ -0,0 +1,11 @@
@article{Ring_2019,
title={Flow-based network traffic generation using Generative Adversarial Networks},
volume={82},
ISSN={0167-4048},
url={http://dx.doi.org/10.1016/j.cose.2018.12.012},
DOI={10.1016/j.cose.2018.12.012},
journal={Computers &amp; Security},
publisher={Elsevier BV},
author={Ring, Markus and Schlör, Daniel and Landes, Dieter and Hotho, Andreas},
year={2019},
month=may, pages={156172} }

View File

@@ -0,0 +1,48 @@
# Flow-based Network Traffic Generation using Generative Adversarial Ne
**第一个问题**请对论文的内容进行摘要总结包含研究背景与问题、研究目的、方法、主要结果和结论字数要求在150-300字之间使用论文中的术语和概念。
论文面向NIDS评估的flow-based数据稀缺与隐私/标注难题提出用Generative Adversarial Networks生成逼真的NetFlow/IPFIX流。核心挑战是GAN仅能处理连续属性而流数据含大量categorical字段IP/端口/协议。作者提出三种预处理与生成方法N-WGAN-GP数值归一化将IP/port视为连续值、B-WGAN-GP二进制化将IP/port/bytes/packets转为bit向量、E-WGAN-GPEmbedding将IP/port/bytes/packets/duration用IP2Vec学习到R^m连续嵌入生成器采用Improved Wasserstein GAN with Gradient PenaltyWGAN-GP与TTUR训练评估采用分布可视化、属性逐维欧氏距离以及基于领域知识的七项质量测试。基于CIDDS-001数据week2-4训练、每法生成8.5M flows实验显示E-WGAN-GP与B-WGAN-GP能重现条件分布与结构关系如子网行为、协议/端口一致性E-WGAN-GP在domain checks整体最佳N-WGAN-GP易产生不合理值。结论GAN结合合适的离散到连续表示尤其IP2Vec嵌入可高质量生成flow-based网络流用于NIDS训练与评估。
**第二个问题**请提取论文的摘要原文摘要一般在Abstract之后Introduction之前。
Flow-based data sets are necessary for evaluating network-based intrusion detection systems (NIDS). In this work, we propose a novel methodology for generating realistic flow-based network traffic. Our approach is based on Generative Adversarial Networks (GANs) which achieve good results for image generation. A major challenge lies in the fact that GANs can only process continuous attributes. However, flow-based data inevitably contain categorical attributes such as IP addresses or port numbers. Therefore, we propose three different preprocessing approaches for flow-based data in order to transform them into continuous values. Further, we present a new method for evaluating the generated flow-based network traffic which uses domain knowledge to define quality tests. We use the three approaches for generating flow-based network traffic based on the CIDDS-001 data set. Experiments indicate that two of the three approaches are able to generate high quality data. Keywords: GANs, TTUR WGAN-GP, NetFlow, Generation, IDS
**第三个问题**:请列出论文的全部作者,按照此格式:`作者1, 作者2, 作者3`
Markus Ring, Daniel Schlör, Dieter Landes, Andreas Hotho
**第四个问题**:请直接告诉我这篇论文发表在哪个会议或期刊,请不要推理或提供额外信息。
Computer & Security
**第五个问题**:请详细描述这篇论文主要解决的核心问题,并用简洁的语言概述。
核心问题是如何让仅能处理连续变量的GAN生成包含大量类别型字段IP地址、端口、协议等的flow-based网络流并保证生成数据在分布多样性与内部语义关系如子网结构、协议-端口一致性上逼近真实流量以用于NIDS的训练与评估。简述通过将离散流属性转换为适合GAN的连续表示数值化、二进制化或IP2Vec嵌入用WGAN-GP+TTUR学习真实流特征并生成高质量合成流再用领域知识规则与分布距离进行质量评估。
**第六个问题**:请告诉我这篇论文提出了哪些方法,请用最简洁的方式概括每个方法的核心思路。
(1) N-WGAN-GP将IP四段与端口按数值归一化到[0,1]连续化bytes/packets/duration后用WGAN-GP+TTUR生成。 (2) B-WGAN-GP将IP映射为32位二进制、端口为16位二进制bytes/packets为限定长度二进制表示保留子网结构信息后用WGAN-GP生成。 (3) E-WGAN-GP扩展IP2Vec学习IP/port/protocol及bytes/packets/duration的m维嵌入GAN生成嵌入后通过最近邻如余弦相似反映射回原值。 (4) Baseline独立抽样各属性的经验分布作为对照。 (5) 评估方法分布可视化、逐属性欧氏距离、七项domain knowledge checks质量测试。
**第七个问题**:请告诉我这篇论文所使用的数据集,包括数据集的名称和来源。
CIDDS-001Flow-based benchmark data sets for intrusion detection来源于Coburg University与Würzburg University的公开数据集采用OpenStack仿真企业网络环境抓取的NetFlow使用week2-4约2200万flows为训练、week1约850万flows为参考评估。
**第八个问题**:请列举这篇论文评估方法的所有指标,并简要说明这些指标的作用。
(1) 可视化分布(小提琴图/时间分布):检查条件分布与子网行为是否匹配真实流量。 (2) 属性欧氏距离:比较生成数据与训练集在各属性上的概率分布差异,衡量分布逼近程度与概念漂移合理性。 (3) Domain knowledge checks7项规则基于协议-旗标一致性、内部/外部IP约束、HTTP/HTTPS/DNS协议端口对应、广播/多播位置、NetBIOS流结构、bytes/packets范围关系等检验内部语义与关系正确性。 (4) 生成规模与覆盖每法生成8.5M flows确保多样性与统计稳定性。
**第九个问题**:请总结这篇论文实验的表现,包含具体的数值表现和实验结论。
在CIDDS-001上每种方法生成约8.5百万flows。时间分布方面E-WGAN-GP曲线平滑且贴近真实week1的工作时段/午休行为。属性分布可视化显示baseline无法捕获条件分布N-WGAN-GP错误地集中到ext子网B-WGAN-GP与E-WGAN-GP能区分客户端/服务器端口与目的IP范围。欧氏距离相对week2-4E-WGAN-GP在source/dest port、bytes、packets等属性上与参考week1相近如destination port 0.0327bytes 0.0278packets 0.0251N-WGAN-GP在端口/bytes/packets距离较大如source port 0.5658bytes 0.5858packets 1.0416。七项domain checks%通过率E-WGAN-GP在Test17整体最高如Test1 99.77、Test2 99.98、Test7 99.49B-WGAN-GP在协议端口一致性表现优Test3 99.97、Test4 99.90但在广播相关规则较弱Test5 47.13、Test6 40.19N-WGAN-GP在UDP/TCP旗标和DNS/HTTP规则明显较差Test4 87.14baseline整体最低。结论E-WGAN-GP质量最佳、B-WGAN-GP次之且可生成未见值N-WGAN-GP不适合。
**第十个问题**:请清晰地描述论文所作的工作,分别列举出动机和贡献点以及主要创新之处。
动机NIDS评估缺乏最新、可共享且带标签的flow-based数据GAN擅长生成但网络流含大量类别型字段、语义关系复杂直接应用受限。
贡献点:(1) 提出首个面向完整flow-based网络流的GAN生成框架采用WGAN-GP+TTUR稳定训练(2) 提出三种离散到连续转换策略N/B/Embedding系统解决GAN处理类别属性的关键障碍(3) 扩展IP2Vec以学习bytes/packets/duration等数值字段的上下文化嵌入并给出嵌入反映射方案(4) 提出结合分布距离与七项领域知识规则的综合评估方法;(5) 在CIDDS-001大规模实验中验证两种方法能生成高质量数据。
主要创新用IP2Vec嵌入将复杂离散网络流属性及数值属性统一到连续空间使GAN能内生学习属性间条件依赖提出领域知识驱动的质量测试补足仅看分布的评估盲点二进制化策略保留子网与结构信息兼顾生成未见值的能力。

View File

@@ -0,0 +1,13 @@
网络流量/Trace 生成与“可用性”讨论(支撑你做语义 trace 生成,而不是原始字节生成)
Yin et al. Practical GAN-based Synthetic IP Header Trace Generation using NetShare. ACM SIGCOMM 2022.
用途:它强调“生成可用的协议字段 trace”与实用评估不是只看视觉相似。你可以借鉴其“字段级一致性/约束”的评估思路。
Lin et al. Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions. ACM IMC 2020.
用途:专门讨论网络化时间序列共享/合成的挑战(相关性、隐私、评估);你做 Modbus 合成的“评估指标设计”很适合引用它的观点。
Ring et al. Flow-based Network Traffic Generation using Generative Adversarial Networks. Computers & Security 2019.
用途:作为 GAN 基线类相关工作,对比扩散模型的训练稳定性与多样性优势。
Vishwanath & Vahdat. Swing: Realistic and Responsive Network Traffic Generation. IEEE/ACM ToN 2009.
用途:传统 traffic generator 经典工作;用于 related work 中“非深度学习合成”的对比。

View File

@@ -0,0 +1,17 @@
@inproceedings{10.1145/3544216.3544251,
author = {Yin, Yucheng and Lin, Zinan and Jin, Minhao and Fanti, Giulia and Sekar, Vyas},
title = {Practical GAN-based synthetic IP header trace generation using NetShare},
year = {2022},
isbn = {9781450394208},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3544216.3544251},
doi = {10.1145/3544216.3544251},
abstract = {We explore the feasibility of using Generative Adversarial Networks (GANs) to automatically learn generative models to generate synthetic packet- and flow header traces for networking tasks (e.g., telemetry, anomaly detection, provisioning). We identify key fidelity, scalability, and privacy challenges and tradeoffs in existing GAN-based approaches. By synthesizing domain-specific insights with recent advances in machine learning and privacy, we identify design choices to tackle these challenges. Building on these insights, we develop an end-to-end framework, NetShare. We evaluate NetShare on six diverse packet header traces and find that: (1) across all distributional metrics and traces, it achieves 46\% more accuracy than baselines and (2) it meets users' requirements of downstream tasks in evaluating accuracy and rank ordering of candidate approaches.},
booktitle = {Proceedings of the ACM SIGCOMM 2022 Conference},
pages = {458472},
numpages = {15},
keywords = {generative adversarial networks, network flows, network packets, privacy, synthetic data generation},
location = {Amsterdam, Netherlands},
series = {SIGCOMM '22}
}

View File

@@ -0,0 +1,48 @@
# Practical GAN-based synthetic IP header trace generation using NetShare
**第一个问题**请对论文的内容进行摘要总结包含研究背景与问题、研究目的、方法、主要结果和结论字数要求在150-300字之间使用论文中的术语和概念。
论文针对网络管理与安全中的包/流头部数据难以共享的问题探索用GAN生成合成IP头部踪迹以替代真实PCAP/NetFlow数据。作者识别出fidelity、scalability-fidelity与privacy-fidelity三大挑战现有表格/时间序列GAN难以捕捉字段相关性与大取值空间、训练成本高、差分隐私训练破坏保真度。为此提出端到端框架NetShare将跨epoch的头部轨迹重构为merged flow-level time series并用DoppelGANger类时序GAN生成结合域知识进行字段编码log(1+x)、IP位编码、端口/协议IP2Vec嵌入以提升可学性通过chunk并行+seed fine-tuning与“flow tags”保留跨块相关性引入DP-SGD并用相近公共数据预训练改善privacy-fidelity。基于6个异质数据集UGR16、CIDDS、TONCAIDA、数据中心、网络攻防PCAP结果显示在分布指标上NetShare较基线平均提升46%在下游任务保持accuracy与rank ordering相较基线实现更优scalability-fidelity并在差分隐私设定下优于naive DP。结论NetShare将域洞察与ML结合实用地生成高保真、可扩展且具更好隐私折中的合成头部踪迹。
**第二个问题**请提取论文的摘要原文摘要一般在Abstract之后Introduction之前。
We explore the feasibility of using Generative Adversarial Networks (GANs) to automatically learn generative models to generate synthetic packet- and flow header traces for networking tasks (e.g., telemetry, anomaly detection, provisioning). We identify key fidelity, scalability, and privacy challenges and tradeoffs in existing GAN-based approaches. By synthesizing domain-specific insights with recent advances in machine learning and privacy, we identify design choices to tackle these challenges. Building on these insights, we develop an end-to-end framework, NetShare. We evaluate NetShare on six diverse packet header traces and find that: (1) across all distributional metrics and traces, it achieves 46% more accuracy than baselines and (2) it meets users requirements of downstream tasks in evaluating accuracy and rank ordering of candidate approaches.
**第三个问题**:请列出论文的全部作者,按照此格式:`作者1, 作者2, 作者3`
Yucheng Yin, Zinan Lin, Minhao Jin, Giulia Fanti, Vyas Sekar
**第四个问题**:请直接告诉我这篇论文发表在哪个会议或期刊,请不要推理或提供额外信息。
ACM SIGCOMM 2022
**第五个问题**:请详细描述这篇论文主要解决的核心问题,并用简洁的语言概述。
核心问题是如何用GAN生成高保真、可扩展、具隐私保障的PCAP/NetFlow合成头部踪迹使其在字段分布、跨包/跨epoch相关性、下游任务准确性与排序上接近真实数据同时缓解训练成本与差分隐私带来的保真损失。简述将头部生成重构为flow-level时序问题配合域感知编码与并行微调联用公共数据预训练+DP-SGD实现高保真且更实用的隐私合成。
**第六个问题**:请告诉我这篇论文提出了哪些方法,请用最简洁的方式概括每个方法的核心思路。
(1) 问题重构merge多epoch为巨型trace按五元组拆分flow用时序GAN学习跨包/跨epoch相关性。 (2) 域感知编码连续大范围字段用log(1+x)IP用bit编码端口/协议用IP2Vec嵌入在公共数据上训练提升可学性与隐私。 (3) 可扩展训练将trace按固定时间切分为chunks首块训练seed模型其余并行fine-tuning用“flow tags”保留跨块出现关系。 (4) 隐私增强DP-SGD微调并结合公共数据预训练相似域改善privacy-fidelity折中。 (5) 合规后处理最近邻反映射嵌入、补派生字段如校验和、按原始时间字段合并为PCAP/NetFlow。
**第七个问题**:请告诉我这篇论文所使用的数据集,包括数据集的名称和来源。
(1) UGR16NetFlow v9西班牙ISP2016年3月第三周(2) CIDDS仿真小型企业环境有标签的Benign/Attack与类型(3) TON_IoTIoT遥测子数据集含9类攻击与正常流(4) CAIDA骨干链路PCAP2018年3月纽约采集器公共数据(5) Data CenterIMC 2010研究的数据中心UNI1 PCAP(6) CyberAttack美国网络攻防竞赛PCAP2012年3月。每数据集选取连续100万样本用于评测。
**第八个问题**:请列举这篇论文评估方法的所有指标,并简要说明这些指标的作用。
(1) 分布一致性对SA/DA、SP/DP、PR用Jensen-Shannon Divergence对TS、TD、PKT、BYT、PS、PAT、FS用Earth Movers Distance归一化。衡量合成与真实分布的距离。 (2) 下游任务保真accuracy preservation与order preservation用真实/合成训练-测试的准确率与Spearman秩相关评估分类器与测量算法的排序一致性。 (3) Sketch重估误差重估heavy hitter计数的相对误差评估合成数据在网络测量中的可用性。 (4) 异常检测保真NetML多模式IAT、SIZE、IAT_SIZE、STATS、SAMP-NUMP、SAMP-SIZE下的异常比例相对误差及秩相关。 (5) 可扩展性-保真折中训练CPU小时 vs 平均JSD/EMD衡量效率与质量。 (6) 隐私-保真折中:差分隐私参数(ε, δ) vs 平均JSD/EMD评估DP训练影响。 (7) 协议/域知识一致性测试IP有效性、字节-包关系、端口-协议一致性、最小包长等,确保语义/语法正确性。
**第九个问题**:请总结这篇论文实验的表现,包含具体的数值表现和实验结论。
在6个数据集上NetShare在分布指标相较基线平均提升46%NetFlow约48%PCAP约41%在UGR16与CAIDA示例中NetShare的平均JSD/归一化EMD显著低于CTGAN、E-WGAN-GP、STAN及PCAP基线。下游任务方面在TON流分类NetShare在MLP上比次优基线高12%准确率达到真实数据性能的84%并在CIDDS/TON上获得更高秩相关如CIDDS 0.90在Sketch重估3个数据集上NetShare相对误差最低平均减少约48%并保持算法排序在PCAP异常检测NetShare多数模式相对误差最低且在CAIDA上模式排名完全匹配。可扩展性方面NetShare通过并行微调与chunking达成较优CPU小时-保真折中PCAP场景相对CTGAN提升近一数量级的JSD同时较DoppelGANger近一数量级更快。隐私方面naive DP-SGD显著劣化保真度用相似公共数据预训练后在ε≈24的CAIDA上将平均EMD相对naive DP从3.5×降至2.3×但仍存在折中。协议一致性测试中NetShare在多项测试达到≈9599%通过率。结论NetShare在保真、任务可用性与训练效率上优于现有方法并以预训练提升DP情形的可用性。
**第十个问题**:请清晰地描述论文所作的工作,分别列举出动机和贡献点以及主要创新之处。
动机现实PCAP/NetFlow头部踪迹因隐私/政策难共享,现有模拟/模型/ML方法在字段相关性、大取值空间、跨epoch相关与隐私保障上存在不足且训练成本高、差分隐私显著破坏保真。
贡献点:(1) 提出NetShare重构为flow-level时序生成并实现端到端系统与Web原型(2) 域感知编码策略log变换、IP位编码、端口/协议IP2Vec公共预训练提升GAN对网络字段的可学性与鲁棒性(3) 并行可扩展训练chunk+seed fine-tuning+flow tags在保持相关性的同时显著降低CPU小时(4) 差分隐私微调框架(公共数据预训练+DP-SGD改善privacy-fidelity折中(5) 系统性评测在6数据集与多类分布/任务/一致性指标上验证较基线平均46%保真提升与排序保持。
主要创新将跨epoch合并与按五元组时序化的生成范式用于网络头部踪迹将端口/协议嵌入与IP位编码的混合表示用于兼顾保真与隐私提出带flow tags的并行微调保持跨块相关首次在网络头部生成中实证差分隐私预训练微调提升隐私-保真折中。

View File

@@ -0,0 +1,38 @@
@article{10.1145/1151659.1159928,
author = {Vishwanath, Kashi Venkatesh and Vahdat, Amin},
title = {Realistic and responsive network traffic generation},
year = {2006},
issue_date = {October 2006},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {36},
number = {4},
issn = {0146-4833},
url = {https://doi.org/10.1145/1151659.1159928},
doi = {10.1145/1151659.1159928},
abstract = {This paper presents Swing, a closed-loop, network-responsive traffic generator that accurately captures the packet interactions of a range of applications using a simple structural model. Starting from observed traffic at a single point in the network, Swing automatically extracts distributions for user, application, and network behavior. It then generates live traffic corresponding to the underlying models in a network emulation environment running commodity network protocol stacks. We find that the generated traces are statistically similar to the original traces. Further, to the best of our knowledge, we are the first to reproduce burstiness in traffic across a range of timescales using a model applicable to a variety of network settings. An initial sensitivity analysis reveals the importance of capturing and recreating user, application, and network characteristics to accurately reproduce such burstiness. Finally, we explore Swing's ability to vary user characteristics, application properties, and wide-area network conditions to project traffic characteristics into alternate scenarios.},
journal = {SIGCOMM Comput. Commun. Rev.},
month = aug,
pages = {111122},
numpages = {12},
keywords = {burstiness, energy plot, generator, internet, modeling, structural model, traffic, wavelets}
}
@inproceedings{10.1145/1159913.1159928,
author = {Vishwanath, Kashi Venkatesh and Vahdat, Amin},
title = {Realistic and responsive network traffic generation},
year = {2006},
isbn = {1595933085},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/1159913.1159928},
doi = {10.1145/1159913.1159928},
abstract = {This paper presents Swing, a closed-loop, network-responsive traffic generator that accurately captures the packet interactions of a range of applications using a simple structural model. Starting from observed traffic at a single point in the network, Swing automatically extracts distributions for user, application, and network behavior. It then generates live traffic corresponding to the underlying models in a network emulation environment running commodity network protocol stacks. We find that the generated traces are statistically similar to the original traces. Further, to the best of our knowledge, we are the first to reproduce burstiness in traffic across a range of timescales using a model applicable to a variety of network settings. An initial sensitivity analysis reveals the importance of capturing and recreating user, application, and network characteristics to accurately reproduce such burstiness. Finally, we explore Swing's ability to vary user characteristics, application properties, and wide-area network conditions to project traffic characteristics into alternate scenarios.},
booktitle = {Proceedings of the 2006 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications},
pages = {111122},
numpages = {12},
keywords = {burstiness, energy plot, generator, internet, modeling, structural model, traffic, wavelets},
location = {Pisa, Italy},
series = {SIGCOMM '06}
}

View File

@@ -0,0 +1,58 @@
# Realistic and responsive network traffic generation
**第一个问题**请对论文的内容进行摘要总结包含研究背景与问题、研究目的、方法、主要结果和结论字数要求在150-300字之间使用论文中的术语和概念。
论文提出Swing一个闭环、网络响应式的traffic generator旨在在网络仿真/仿真环境中用简单的structural model重现真实链路上的packet/flow交互与burstiness。背景问题是现有生成器难以同时匹配多时间尺度的到达过程方差energy plots、应用与用户层行为以及广域网条件。Swing从单点tcpdump观察自动提取用户/会话/连接/网络四层参数分布如numRRE、interRRE、numconn、interConn、numpairs、request/response sizes、reqthink以及链路capacity/latency/loss在ModelNet哑铃拓扑上用commodity stacks生成live traffic并记录target link上的trace。方法包含被动估计RTTSYN→SYN/ACK、响应→ACK时差、packet-pair容量、基于序列号洞的丢包率并用小波MRA的energy plots验证burstiness。结果表明Swing在Mawi、CAIDA、Auck多数据集上生成trace的统计特性与原trace相似首次跨多时间尺度含sub-RTT重现字节/包到达能量谱;敏感性分析揭示需同时捕获用户/应用/网络特征,且可在场景投射中响应地调整延迟、容量、应用占比。结论:简单但结构化的模型加网络条件重放即可现实、响应地生成真实网络流量。
**第二个问题**请提取论文的摘要原文摘要一般在Abstract之后Introduction之前。
This paper presents Swing, a closed-loop, network-responsive traffic generator that accurately captures the packet interactions of a range of applications using a simple structural model. Starting from observed traffic at a single point in the network, Swing automatically extracts distributions for user, application, and network behavior. It then generates live traffic corresponding to the underlying models in a network emulation environment running commodity network protocol stacks. We find that the generated traces are statistically similar to the original traces. Further, to the best of our knowledge, we are the first to reproduce burstiness in traffic across a range of timescales using a model applicable to a variety of network settings. An initial sensitivity analysis reveals the importance of capturing and recreating user, application, and network characteristics to accurately reproduce such burstiness. Finally, we explore Swings ability to vary user characteristics, application properties, and wide-area network conditions to project traffic characteristics into alternate scenarios.
**第三个问题**:请列出论文的全部作者,按照此格式:`作者1, 作者2, 作者3`
Kashi Venkatesh Vishwanath, Amin Vahdat
**第四个问题**:请直接告诉我这篇论文发表在哪个会议或期刊,请不要推理或提供额外信息。
ACM SIGCOMM 2006
**第五个问题**:请详细描述这篇论文主要解决的核心问题,并用简洁的语言概述。
核心问题如何从单点观测的packet header traces构建一个语义明确且可响应的结构化模型并在网络仿真环境中生成live traffic准确重现多时间尺度含sub-RTT的burstiness与应用/用户/网络交互同时支持场景投射。简述提出Swing用四层结构模型加被动网络特性估计与仿真回放实现统计相似、跨尺度能量谱匹配的真实感流量生成。
**第六个问题**:请告诉我这篇论文提出了哪些方法,请用最简洁的方式概括每个方法的核心思路。
(1) 结构化参数化模型四层Users/Sessions/Connections/Network参数与分布如numRRE、interRRE、numconn、interConn、numpairs、request/response sizes、reqthink、capacity/latency/loss统一描述应用行为。
(2) 会话/RRE提取基于SYN/FIN/时序与RREtimeout/SESStimeout将连接聚类为RRE与会话得到分布的经验CDF。
(3) 被动网络特性估计用SYN→SYN/ACK、响应→ACK时差估RTT分解packet-pair估瓶颈容量序列号“holes”与重传估loss。
(4) 仿真生成在ModelNet哑铃拓扑配置链路MTU/容量/延迟/丢包;部署多生成器/监听器用commodity stacks按分布驱动live连接。
(5) 验证与敏感性分析用小波MRA energy plots比较字节/包到达的多尺度方差;探究网络/用户/应用参数缺省的影响。
(6) 场景投射调整延迟、容量、响应大小、应用占比等以测试trace在替代场景下的响应。
**第七个问题**:请告诉我这篇论文所使用的数据集,包括数据集的名称和来源。
(1) CAIDA OC-48 MFN Backbone 1 链路San JoseSeattle公共trace(2) MAWIWIDE项目Trans-Pacific 18Mbps CAR 链路trace(3) University of AucklandNLANR存档Auckland-VIOC3c ATM 链路trace。均来自公开trace存储库CAIDA、MAWI、NLANR
**第八个问题**:请列举这篇论文评估方法的所有指标,并简要说明这些指标的作用。
(1) 应用/总体带宽与pps比较生成trace与原trace的Mbps与packets/sec验证宏观负载一致性。
(2) 参数分布一致性各模型参数的median与IQRnumconn、interConn、numpairs、numRRE、interRRE、reqthink、请求/响应大小),检验结构模型重现性。
(3) 小波MRA能量谱energy plots比较不同时间尺度下字节/包到达过程方差验证burstiness含RTT与瓶颈容量对应的特征“dip”
(4) 网络特性CDF两向延迟、上下游容量、丢包率分布的CDF对比被动估计与生成trace的网络条件一致性。
(5) 敏感性分析:移除或变更网络/应用参数(如无网络重放、仅延迟/容量、去interRRE/interConn对能量谱的影响评估必要性与重要性。
(6) 响应性实验:加倍延迟/响应大小/改变应用占比对能量谱与带宽的影响,评估场景投射能力。
**第九个问题**:请总结这篇论文实验的表现,包含具体的数值表现和实验结论。
在Auck/Mawi/CAIDA三数据集上总体带宽与pps如CAIDA184Mbps、Auck与Mawi多案例在Trace/Swing对照中接近参数分布median/IQR高一致例如HTTP请求/响应大小、numconn、interConn、reqthink等少数人机“think”参数差异仍可接受。能量谱方面Swing在HTTP、SQUID、KAZAA与方向双向的字节/包到达能量谱均与原trace紧密重叠能再现RTT尺度dip如200ms与瓶颈容量对应dip如3.5Mbps→约8ms。网络特性估计的延迟/容量/丢包CDF与生成trace吻合度高。敏感性显示不重放网络条件会丢失sub-RTT结构仅延迟或延迟+容量不足去interRRE或interConn会改变大尺度或整体burstiness。响应性显示延迟加倍使能量谱特征右移一档log2响应大小加倍使能量谱整体升高且带宽由9Mbps增至19Mbps增大SQUID占比使整体能量谱更类似SQUID曲线。结论Swing首次跨多时间尺度重现burstiness生成trace统计上与原trace相似并能以语义明确的参数实现响应式场景投射。
**第十个问题**:请清晰地描述论文所作的工作,分别列举出动机和贡献点以及主要创新之处。
动机需要一个能生成真实、可响应且最大随机性的网络traffic generator既匹配多时间尺度burstiness又能从单点观测自动抽取用户/应用/网络参数,并支持投射到不同网络/应用场景。
贡献点:(1) 提出Swing一个闭环、网络响应式生成框架统一结构模型覆盖Users/Sessions/Connections/Network(2) 从单点packet headers自动提取应用会话/RRE/连接参数的经验CDF与网络特性RTT分解、packet-pair容量、序列号洞丢包(3) 在ModelNet哑铃拓扑用commodity stacks生成live traffic统计上复现原trace的带宽/pps与结构参数分布(4) 用小波MRA能量谱首次跨多时间尺度含sub-RTT重现字节/包到达burstiness(5) 系统敏感性与响应性分析,证明用户/应用/网络三类特征均为必要,并展示调整延迟/容量/应用占比的可投射能力。
创新之处以简洁结构化模型与被动网络条件估计相结合在仿真环境中重放多层交互首次在多时间尺度上重现burstiness并保持统计相似同时支持面向未来场景的参数化、响应式投射。

View File

@@ -0,0 +1,10 @@
@inproceedings{Lin_2020, series={IMC 20},
title={Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions},
url={http://dx.doi.org/10.1145/3419394.3423643},
DOI={10.1145/3419394.3423643},
booktitle={Proceedings of the ACM Internet Measurement Conference},
publisher={ACM},
author={Lin, Zinan and Jain, Alankar and Wang, Chen and Fanti, Giulia and Sekar, Vyas},
year={2020},
month=oct, pages={464483},
collection={IMC 20} }

View File

@@ -0,0 +1,71 @@
# Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions
**第一个问题**请对论文的内容进行摘要总结包含研究背景与问题、研究目的、方法、主要结果和结论字数要求在150-300字之间使用论文中的术语和概念。
论文针对网络与系统领域“数据共享难”的痛点探讨利用Generative Adversarial Networks为多维元数据关联的长时序数据生成高保真合成数据以最小专家介入支持结构表征、预测建模与算法评估。作者识别GAN在fidelity长程依赖、复杂多维关系、mode collapse与privacy理论保证牺牲保真、记忆泄露上的关键挑战提出DoppelGANger工作流解耦元数据与时序的联合生成P(A)·P(R|A)、为元数据引入auxiliary discriminator、batch generation的RNN以捕捉长时程相关、auto-normalization为每个样本生成min/max“伪元数据”缓解mode collapse。基于WWT、MBA、GCUT真实数据集DG在分布与自相关等结构微基准和下游任务上较基线最高提升达43%能保持算法排序隐私方面表明DP-SGD破坏时序相关提出用更大训练集缓解membership inference并通过仅重训元数据生成器实现业务属性混淆。结论DG在保真与通用性上展现初步可行性但隐私-保真折中仍是开放问题。
**第二个问题**请提取论文的摘要原文摘要一般在Abstract之后Introduction之前。
Limited data access is a longstanding barrier to data-driven research and development in the networked systems community. In this work, we explore if and how generative adversarial networks (GANs) can be used to incentivize data sharing by enabling a generic framework for sharing synthetic datasets with minimal expert knowledge. As a specific target, our focus in this paper is on time series datasets with metadata (e.g., packet loss rate measurements with corresponding ISPs). We identify key challenges of existing GAN approaches for such workloads with respect to fidelity (e.g., long-term dependencies, complex multidimensional relationships, mode collapse) and privacy (i.e., existing guarantees are poorly understood and can sacrifice fidelity). To improve fidelity, we design a custom workflow called DoppelGANger (DG) and demonstrate that across diverse real-world datasets (e.g., bandwidth measurements, cluster requests, web sessions) and use cases (e.g., structural characterization, predictive modeling, algorithm comparison), DG achieves up to 43% better fidelity than baseline models. Although we do not resolve the privacy problem in this work, we identify fundamental challenges with both classical notions of privacy and recent advances to improve the privacy properties of GANs, and suggest a potential roadmap for addressing these challenges. By shedding light on the promise and challenges, we hope our work can rekindle the conversation on workflows for data sharing.
**第三个问题**:请列出论文的全部作者,按照此格式:`作者1, 作者2, 作者3`
Zinan Lin, Alankar Jain, Chen Wang, Giulia Fanti, Vyas Sekar
**第四个问题**:请直接告诉我这篇论文发表在哪个会议或期刊,请不要推理或提供额外信息。
ACM Internet Measurement Conference (IMC 2020)
**第五个问题**:请详细描述这篇论文主要解决的核心问题,并用简洁的语言概述。
核心问题如何用GAN在不依赖大量领域专家知识的前提下为带多维元数据的网络化长时序数据生成高保真合成数据既能保留长程时间相关与测量-元数据的复杂关系、避免mode collapse又能在隐私上避免记忆泄露并支持实际下游任务。简述提出DoppelGANger通过解耦/条件生成、批量RNN、辅助判别器与样本级归一化min/max伪元数据系统性提升保真验证其在多数据集与任务上的效果同时分析差分隐私训练的破坏性与可行的隐私折中。
**第六个问题**:请告诉我这篇论文提出了哪些方法,请用最简洁的方式概括每个方法的核心思路。
- 解耦+条件生成分解P(A,R)=P(A)·P(R|A)先MLP生成元数据再以元数据作为条件由RNN生成时序保留测量-元数据依赖。
- 辅助判别器为元数据单独设置discriminator并与主判别器加权联合训练提升元数据分布保真。
- Batch generation RNN每次生成S个连续记录降低RNN步数捕捉长时程相关。
- Auto-normalization对每个样本单独归一化并学习/生成min/max为“伪元数据”生成后再反缩放缓解mode collapse。
- 变长长度标记生成flag控制序列终止学习长度分布。
- 隐私策略用更大训练集降低membership inference成功率仅重训元数据生成器实现属性分布混淆评估DP-SGD对保真的影响。
**第七个问题**:请告诉我这篇论文所使用的数据集,包括数据集的名称和来源。
- Wikipedia Web Traffic (WWT)Kaggle Web Traffic Time Series Forecasting日页面浏览量与域/访问方式/agent元数据。
- FCC Measuring Broadband America (MBA)美国FCC公开数据家庭UDP ping丢包率与总流量元数据含接入技术/ISP/州。
- Google Cluster Usage Traces (GCUT)Google集群任务资源使用日志公开数据集秒级测量聚合为5分钟统计与任务结束事件类型元数据。
**第八个问题**:请列举这篇论文评估方法的所有指标,并简要说明这些指标的作用。
- 自相关ACF与均方误差MSE衡量长短期时间相关的保真比较真实与合成序列的自相关曲线误差。
- 分布距离如Jensen-Shannon/Wasserstein-1评估测量、元数据及联合分布的接近程度如(max+min)/2、带宽CDF、Pearson相关系数CDF
- 变长分布拟合:任务时长直方图拟合(捕捉多模态)。
- 下游预测精度:在真实/合成数据训练测试的分类/回归MLP、LR、SVM等准确率或R²检验泛化。
- 排序保持Spearman秩相关比较算法在真实与合成数据上的性能排序一致性。
- 训练资源与稳定性训练时间、样本规模与MSE敏感性评估可用性。
- 隐私评估membership inference成功率与训练集规模的关系DP-SGD下自相关退化情况。
**第九个问题**:请总结这篇论文实验的表现,包含具体的数值表现和实验结论。
DG在WWT上自相关MSE较最佳基线RCGAN降低约91.2%在预测任务中GCUT的MLP分类器用DG生成数据训练、真实数据测试的准确率比次优基线AR高43%达到真实数据的约80%算法排序方面DG在GCUT与WWT的Spearman秩相关分别达1.00与0.80保持排序优于多数基线长度分布上DG准确复现GCUT的双峰模式RNN等基线缺失次模态测量-元数据联合分布MBA中DSL/Cable总带宽CDF的Wasserstein-1距离最小。隐私上membership inference随训练样本增多降至约51%近随机但DP-SGD如ε≈1.45或更小显著破坏自相关与保真。结论DG在结构微基准与下游任务上显著优于基线具备通用性与可用性隐私-保真仍需新方法突破。
**第十个问题**:请清晰地描述论文所作的工作,分别列举出动机和贡献点以及主要创新之处。
动机:网络/系统领域数据共享受限,亟需无需大量专家参与的通用合成数据框架,既保留长时序与多维关系,又兼顾隐私与实际任务可用性。
贡献点:
- 提出DoppelGANger系统性解决GAN在时间序列的fidelity挑战长程相关、mode collapse、测量-元数据关系)。
- 设计解耦/条件生成与auxiliary discriminator提升元数据与联合分布保真。
- 引入batch generation的RNN与样本级auto-normalizationmin/max伪元数据缓解mode collapse并学习长时程依赖。• 在WWT/MBA/GCUT多数据集与结构/预测/排序评估中实证高保真最高43%提升)。
- 系统性分析隐私折中揭示DP-SGD对时序保真的破坏提出实用缓解更大训练集、只重训元数据生成器
创新之处:
- 将样本级归一化与伪元数据联合到GAN工作流中解决时序值域多样性引发的mode collapse。
- 元数据-时序的解耦条件生成+辅助判别器组合,在网络化长时序场景下有效提升联合分布学习。
- 批量时序生成策略使RNN在不牺牲长程相关的前提下可训练、稳定地生成长序列。
- 面向数据共享的任务化评估框架(结构/预测/排序)与隐私实践建议。

View File

@@ -0,0 +1,9 @@
@misc{meng2025aflnetyearslatercoverageguided,
title={AFLNet Five Years Later: On Coverage-Guided Protocol Fuzzing},
author={Ruijie Meng and Van-Thuan Pham and Marcel Böhme and Abhik Roychoudhury},
year={2025},
eprint={2412.20324},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2412.20324},
}

View File

@@ -0,0 +1,47 @@
# AFLNet Five Years Later On Coverage-Guided Protocol Fuzzing
**第一个问题**请对论文的内容进行摘要总结包含研究背景与问题、研究目的、方法、主要结果和结论字数要求在150-300字之间使用论文中的术语和概念。
协议实现是stateful且message-driven同一消息在不同内部state下可能产生不同response使传统coverage-guided greybox fuzzing如AFL难以有效测试网络协议。论文旨在对AFLNet首个code- and state-coverage-guided protocol fuzzer给出扩展技术讨论与大规模实证评估并回顾其五年影响。方法上AFLNet以message sequence作为seed基于pcap录制/回放构建初始corpus在线学习implemented protocol state machineIPSM用response code等标识state并统计#fuzz/#selected/#paths在seed selection中交织queue顺序与state heuristics以导向progressive states对序列分割为M1/M2/M3并在M2上施加protocol-aware与byte-level mutation在同一bitmap中同时维护branch coverage与state transition coverage以判定interesting。结果显示state feedback单独使用在部分对象上显著优于black-box加入state feedback使state coverage平均提升35.67×但对code coverage提升总体不显著交织式seed-selection在综合code/state覆盖上最稳健。结论state反馈能显著扩大协议状态空间探索但“state定义”与吞吐等仍是关键挑战。
**第二个问题**请提取论文的摘要原文摘要一般在Abstract之后Introduction之前。
Abstract—Protocol implementations are stateful which makes them difficult to test: Sending the same test input message twice might yield a different response every time. Our proposal to consider a sequence of messages as a seed for coverage-directed greybox fuzzing, to associate each message with the corresponding protocol state, and to maximize the coverage of both the state space and the code was first published in 2020 in a short tool demonstration paper. AFLNet was the first code- and state-coverage-guided protocol fuzzer; it used the response code as an indicator of the current protocol state. Over the past five years, the tool paper has gathered hundreds of citations, the code repository was forked almost 200 times and has seen over thirty pull requests from practitioners and researchers, and our initial proposal has been improved upon in many significant ways. In this paper, we first provide an extended discussion and a full empirical evaluation of the technical contributions of AFLNet and then reflect on the impact that our approach and our tool had in the past five years, on both the research and the practice of protocol fuzzing.
**第三个问题**:请列出论文的全部作者,按照此格式:`作者1, 作者2, 作者3`
Ruijie Meng, Van-Thuan Pham, Marcel Böehme, Abhik Roychoudhury
**第四个问题**:请直接告诉我这篇论文发表在哪个会议或期刊,请不要推理或提供额外信息。
文段未给出会议或期刊信息。
**第五个问题**:请详细描述这篇论文主要解决的核心问题,并用简洁的语言概述。
核心问题是:如何将面向“单输入/近似无状态程序”的coverage-guided greybox fuzzing扩展到网络协议这种stateful、需要message sequence驱动且状态空间巨大的目标上并同时兼顾code coverage与state space coverage。传统做法要么靠手工协议模型的stateful blackbox fuzzing依赖不完备的状态/数据模型且不保留“有趣”用例继续进化要么把消息序列拼成文件交给AFL无法聚焦关键消息、易生成大量无效序列。论文围绕AFLNet提出并系统评估的一套解法以消息序列为seed、在线推断IPSM并把状态反馈纳入引导与“interesting”判定从而更系统地探索协议实现的状态与代码。简洁概述让灰盒模糊测试“看见并利用协议状态”从而可有效fuzz stateful protocols。
**第六个问题**:请告诉我这篇论文提出了哪些方法,请用最简洁的方式概括每个方法的核心思路。
(1) 消息序列作为seed的SCGF把sequence of messages而非单文件输入作为进化种子适配stateful server。(2) 录制/回放驱动pcap→parse→send从真实流量提取初始语料并可重复回放以执行fuzzing迭代。(3) 轻量协议学习IPSMimplemented protocol state machine从response序列抽取state transitions在线增量构建/更新状态机并维护#fuzz/#selected/#paths统计。(4) 面向progressive states的引导按“盲点/新近/高产出”启发式选state再在到达该state的子语料上做AFL式优先级选序列。(5) 交织式seed-selection在coverage plateau时切换到state-heuristic重策略否则按AFL队列顺序兼顾吞吐与导向。(6) 三段式序列变异M1/M2/M3固定前缀M1保证到达目标state只在候选段M2做变异并继续执行后缀M3以观察传播效应。(7) 协议感知变异算子对消息做replacement/insertion/duplication/deletion并与byte-level mutation堆叠。(8) 统一bitmap记录code+state覆盖为state transition预留bitmap区域SHIFT_SIZE用分支与状态转移共同定义interesting seeds。
**第七个问题**:请告诉我这篇论文所使用的数据集,包括数据集的名称和来源。
基准为ProFuzzBenchNatella & Pham, ISSTA 2021工具/基准论文ProFuzzBench: A benchmark for stateful protocol fuzzing论文在其默认集成的网络协议实现如Bftpd、DNSmasq、OpenSSH、TinyDTLS、Live555、ProFTPD、Pure-FTPd、Exim、DCMTK、Kamailio、forked-daapd、lightFTP等上进行评测。
**第八个问题**:请列举这篇论文评估方法的所有指标,并简要说明这些指标的作用。
(1) Code coverage以branch coverage分支覆盖数衡量探索到的代码范围“未覆盖代码无法触发漏洞”。(2) State space coverage以IPSM中构建的state transitions数量以及状态数量/覆盖)衡量探索到的协议状态空间。(3) Vargha-Delaney effect sizeÂ12衡量两组独立实验结果的优势概率/效应量用于判断差异是否具有“显著优势”文中以Â12≥0.71或≤0.29作为显著门槛)。(4) 时间维度覆盖趋势branch covered over time24小时曲线用于对比不同变体达到同等覆盖所需时间如提到“约6×/4×更快达到相同分支数”
**第九个问题**:请总结这篇论文实验的表现,包含具体的数值表现和实验结论。
RQ1仅state反馈AFLNetDARK仅state feedback在12个ProFuzzBench对象中有6个在code coverage上显著优于AFLNetBLACK无code/state反馈Bftpd、DNSmasq、Kamailio、lightFTP、ProFTPD、Pure-FTPd并在OpenSSH与TinyDTLS上分别约6×与4×更快达到与BLACK相同的分支数对state数量很少的对象如DCMTK最终仅3个state提升不明显。结论当state数量“足够”时state反馈可作为无代码插桩场景的有效引导。RQ2state+code vs 仅code表1AFLNetQUEUE相对AFLNetCODE平均branch coverage提升仅+0.01%但state coverage平均提升+35.67×例如OpenSSH的state数从93.5提升到30480.9+325.00×Â12=1.00DNSmasq从282.5到27364.0+95.85×Â12=1.00Bftpd从170.5到334.0+0.96×Â12=1.00。结论额外state反馈极大扩展状态空间探索但对代码覆盖提升整体不显著。RQ3seed-selection策略表2/3交织策略AFLNet在综合表现上最好其相对AFLNetQUEUE平均branch coverage为-0.52%但相对AFLNetIPSM为+1.65%同时state coverage相对AFLNetQUEUE为+5.77%、相对AFLNetIPSM为+12.77%。结论交织式seed-selection总体上最稳健且state覆盖与code覆盖不呈简单正相关。
**第十个问题**:请清晰地描述论文所作的工作,分别列举出动机和贡献点以及主要创新之处。
动机网络协议实现是stateful reactive systems输入是message sequence而非单文件AFL类CGF缺乏状态与序列结构认知SBF又依赖手工模型且不做进化保种导致难以深入状态空间并发现实现偏差/隐藏转移与漏洞。
贡献点:(1) 系统化阐释AFLNet把消息序列纳入灰盒进化框架并将state coverage与code coverage共同作为反馈信号。(2) 在线轻量IPSM学习与引导通过response code等构造implemented protocol state machine并用#fuzz/#selected/#paths等统计驱动state选择与导向。(3) 面向目标state的序列分割与变异策略M1/M2/M3既保证可达性又允许在关键段落聚焦变异。(4) 统一bitmap编码分支与状态转移覆盖在AFL bitmap中为state transition预留空间使“interesting”判定同时考虑state与code。(5) 五年后大规模评估与配置指导在ProFuzzBench上做24h×10次的大样本实验分离评估state反馈、state+code协同、以及seed-selection策略并用Â12报告效应量。
主要创新之处:以“状态反馈+在线状态机学习”把灰盒fuzzing从无状态程序扩展到协议实现的状态空间探索并给出可复现的消融式实证结论与最佳实践建议。

View File

@@ -0,0 +1,17 @@
协议状态机/模糊测试/学习输入生成(支撑你“生成有效 request-response 交互序列”)
对 Modbus TCP 来说,“有效”不仅是字段合法,还包括:
request 与 response 配对
Transaction ID 一致/递增策略合理
功能码与地址范围一致(如 0x03 对 holding register 区间)
异常响应的触发条件合理
这些强约束往往在 fuzzing / protocol testing 文献里讨论得更系统。
Pham et al. AFLNet: A Greybox Fuzzer for Network Protocols. ICST 2019.
用途:面向网络协议的状态覆盖 fuzzing你可以借鉴其“状态反馈”思想把扩散生成器和协议栈反馈有效率/覆盖率)结合起来做强化。
She et al. NEUZZ: Efficient Fuzzing with Neural Networks. IEEE S&P 2019.
用途:神经网络引导 fuzzing 的代表作;可作为你未来“生成模型 + 反馈优化/引导采样”的相关工作支撑。
Godefroid, Peleg, Singh. Learn&Fuzz: Machine Learning for Input Fuzzing. ASE 2017.
用途:学习输入格式再生成;与你“语义级生成 + 确定性组装器”的理念一致(模型学语义,规则负责封包细节)。

View File

@@ -0,0 +1,9 @@
@misc{godefroid2017learnfuzzmachinelearninginput,
title={Learn&Fuzz: Machine Learning for Input Fuzzing},
author={Patrice Godefroid and Hila Peleg and Rishabh Singh},
year={2017},
eprint={1701.07232},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/1701.07232},
}

View File

@@ -0,0 +1,47 @@
# Learn&Fuzz Machine Learning for Input Fuzzing
**第一个问题**请对论文的内容进行摘要总结包含研究背景与问题、研究目的、方法、主要结果和结论字数要求在150-300字之间使用论文中的术语和概念。
论文聚焦grammar-based fuzzing的关键瓶颈需要手工编写input grammar过程费时且易错而复杂结构化格式如PDF又最依赖该类fuzzing。研究目的在于用neural-network-based statistical learning自动生成可用于输入fuzzing的grammar/生成模型并解决learn&fuzz张力learning偏向生成well-formed inputsfuzzing需要破坏结构以覆盖error-handling与意外路径。方法上对约63,000个non-binary PDF objects进行无监督训练采用seq2seq RNNLSTM学习字符级概率分布并提出NoSample/Sample/SampleSpace三种采样生成策略及基于概率分布引导“where to fuzz”的SampleFuzz算法。实验以Microsoft Edge PDF parser为目标用instruction coverage、pass rate与AppVerifier监测bugs评估SampleSpace在50 epochs达97% pass rate整体覆盖最佳为Sample-40e在learn+fuzz组合中SampleFuzz以567,634条指令覆盖、68.24% pass rate取得最高覆盖并优于多种random fuzzing基线且在更长实验中发现并修复一个stack-overflow bug。结论统计生成模型可自动学习输入结构并用概率信息更智能地施加fuzz以提升覆盖。
**第二个问题**请提取论文的摘要原文摘要一般在Abstract之后Introduction之前。
Abstract.Fuzzing consists of repeatedly testing an application with modified, or fuzzed, inputs with the goal of finding security vulnerabili-ties in input-parsing code. In this paper, we show how to automate the generation of an input grammar suitable for input fuzzing using sam-ple inputs and neural-network-based statistical machine-learning tech-niques. We present a detailed case study with a complex input format, namely PDF, and a large complex security-critical parser for this format, namely, the PDF parser embedded in Microsofts new Edge browser. We discuss (and measure) the tension between conflicting learning and fuzzing goals: learning wants to capture the structure of well-formed in-puts, while fuzzing wants to break that structure in order to cover unex-pected code paths and find bugs. We also present a new algorithm for this learn&fuzz challenge which uses a learnt input probability distribution to intelligently guide where to fuzz inputs.
**第三个问题**:请列出论文的全部作者,按照此格式:`作者1, 作者2, 作者3`
Patrice Godefroid, Hila Peleg, Rishabh Singh
**第四个问题**:请直接告诉我这篇论文发表在哪个会议或期刊,请不要推理或提供额外信息。
arXiv:1701.07232v1
**第五个问题**:请详细描述这篇论文主要解决的核心问题,并用简洁的语言概述。
论文要解决的核心问题是如何在无需人工编写格式规范的前提下从sample inputs自动学习出“足够像grammar”的生成式输入模型用于grammar-based fuzzing复杂结构化输入以PDF为代表并进一步在“生成尽量well-formed以深入解析流程”和“刻意引入ill-formed片段以触达异常/错误处理代码”之间取得可控平衡。传统黑盒/白盒fuzz对复杂文本结构格式不如grammar-based有效但后者依赖手工grammar已有grammar/automata学习方法对PDF对象这种“相对扁平但token/键值组合极多”的格式并不理想。本文用seq2seq RNN学习字符序列的概率分布作为统计grammar并利用该分布在高置信位置定点“反向扰动”以实现learn&fuzz。简洁概述用神经网络从样本自动学输入结构并用学到的概率分布指导更有效的结构化fuzz。
**第六个问题**:请告诉我这篇论文提出了哪些方法,请用最简洁的方式概括每个方法的核心思路。
(1) seq2seq RNN统计输入建模把PDF object当作字符序列训练encoder-decoderLSTM学习p(x_t|x_<t)并作为生成式语言模型。 (2) 训练数据构造shift-by-one无监督将拼接后的字符流分块成长度d的输入序列输出为右移1位序列实现无监督next-char学习。 (3) NoSample生成每步取argmax字符最大化well-formed但多样性极低。 (4) Sample生成每步按分布采样字符提升多样性但更易产生ill-formed片段。 (5) SampleSpace生成仅在whitespace边界采样、token内部用argmax折中多样性与语法一致性。 (6) SampleFuzz算法生成时先采样得到字符c及其概率p(c),若模型高置信(p(c)>p_t)且掷币触发(p_fuzz>t_fuzz)则用分布中最低概率字符替换argmin在“最不该出错的位置”注入异常以诱导解析器走入错误处理/意外路径。 (7) PDF对象嵌入整文件的host-append机制将新对象按PDF增量更新规则附加到well-formed host更新xref与trailer以便对Edge PDF parser进行端到端测试。
**第七个问题**:请告诉我这篇论文所使用的数据集,包括数据集的名称和来源。
(1) PDF训练语料从534个PDF文件中抽取约63,000个non-binary PDF objects这534个PDF由Windows fuzzing team提供且是对更大PDF集合做seed minimization后的结果更大集合来源包括公开Web与历史fuzz用PDF。论文未给该数据集专有名称(2) 目标程序/基准Microsoft Edge browser内嵌的Edge PDF parser通过Windows团队提供的单进程test-driver执行。(3) Host PDF集合从上述534个PDF中选取最小的3个作为host1/host2/host3约26Kb/33Kb/16Kb用于将生成对象附加成完整PDF。
**第八个问题**:请列举这篇论文评估方法的所有指标,并简要说明这些指标的作用。
(1) Coverageinstruction coverage统计执行过的唯一指令集合dll-name, dll-offset标识集合并集衡量一组测试的覆盖范围是fuzzing有效性的核心指标。 (2) Pass rate通过grep解析日志中是否有parsing-error来判定pass/failpass表示被解析器视为well-formed主要用来估计学习质量与“结构保持程度”。 (3) Bugs在AppVerifier监控下捕获内存破坏类缺陷如buffer overflow、异常递归导致的stack overflow等衡量真实漏洞发现能力。
**第九个问题**:请总结这篇论文实验的表现,包含具体的数值表现和实验结论。
基线覆盖host与baseline三份host单独覆盖约353,327host1到457,464host2条唯一指令三者并集host123为494,652将1,000个真实对象附加到host后baseline123覆盖为553,873且所有host自身pass rate为100%。学习质量pass rateSample在10 epochs时pass rate已>70%SampleSpace整体更高50 epochs最高达97% pass rate。覆盖表现学习不加fuzz不同host对覆盖影响明显总体覆盖最佳为Sample-40ehost123场景下胜出且Sample-40e的覆盖集合几乎是其他集合的超集相对SampleSpace-40e仅缺1,680条指令。学习+fuzz对比30,000个PDF/组图8SampleFuzz覆盖567,634、pass rate 68.24%为最高覆盖次优Sample+Random覆盖566,964、pass rate 41.81%Sample-10K覆盖565,590、pass rate 78.92%baseline+Random覆盖564,195、pass rate 44.05%SampleSpace+Random覆盖563,930、pass rate 36.97%。结论存在coverage与pass rate张力随机fuzz提升覆盖但显著降低通过率SampleFuzz在约65%70%通过率附近取得更佳折中并带来最高覆盖。漏洞常规实验未发现bug目标已被长期fuzz但更长实验Sample+Random100,000对象/300,000 PDF约5天发现并修复一个stack-overflow bug。
**第十个问题**:请清晰地描述论文所作的工作,分别列举出动机和贡献点以及主要创新之处。
动机grammar-based fuzzing对复杂结构化输入最有效但手工编写input grammar“劳累/耗时/易错”限制了在真实大型解析器如浏览器PDF解析上的应用同时学习生成“规范输入”与fuzzing“破坏结构找漏洞”目标冲突需要可控融合。
贡献点:(1) 首次将neural-network-based statistical learningseq2seq RNN/LSTM用于从样本自动学习可生成的输入模型以自动化grammar生成用于fuzzing。 (2) 针对PDF这种超复杂格式明确限定范围为non-binary PDF objects并给出端到端工程方案把生成对象按PDF增量更新规则附加到host形成完整PDF以真实驱动Edge PDF parser评测。 (3) 系统分析并量化learn&fuzz张力用pass rate刻画学习质量、用instruction coverage刻画fuzz有效性展示两者此消彼长。 (4) 提出SampleFuzz利用learned input probability distribution在模型高置信位置用低概率字符替换实现“智能选择where to fuzz”在覆盖与通过率之间取得更优折中并获得最高覆盖。
主要创新:把“概率语言模型”的不确定性/置信度直接转化为fuzzing决策信号高置信处注入反常从而在保持足够结构可深入解析的同时更系统地触达错误处理与意外路径。

View File

@@ -0,0 +1,9 @@
@misc{she2019neuzzefficientfuzzingneural,
title={NEUZZ: Efficient Fuzzing with Neural Program Smoothing},
author={Dongdong She and Kexin Pei and Dave Epstein and Junfeng Yang and Baishakhi Ray and Suman Jana},
year={2019},
eprint={1807.05620},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/1807.05620},
}

View File

@@ -0,0 +1,60 @@
# NEUZZ Efficient Fuzzing with Neural Program Smoothing
**第一个问题**请对论文的内容进行摘要总结包含研究背景与问题、研究目的、方法、主要结果和结论字数要求在150-300字之间使用论文中的术语和概念。
论文指出传统graybox fuzzing多依赖evolutionary guidance易陷入随机变异的低效序列难以触发深层分支与hard-to-trigger bugs而直接用gradient-guided optimization又会被真实程序的discontinuities、plateaus与ridges卡住。研究目的在于通过program smoothing构造可微的surrogate function使梯度方法可用于覆盖导向fuzzing。方法上NEUZZ用feed-forward NN学习输入到edge bitmap分支/边覆盖的平滑近似采用binary cross-entropy训练并做label降维合并总是共现的边再基于∇x f_i(θ,x)选取top-k高梯度字节并按梯度符号生成变异同时通过coverage-based filtration进行incremental learning以避免遗忘。主要结果在10个真实程序上NEUZZ在24小时内相对10个SOTA fuzzers达到最高edge coverage最高达3×文中亦报告对AFL最高约10×发现31个此前未知bug含2个CVE并在LAVA-M与DARPA CGC上分别找到更多注入/真实漏洞。结论是:神经网络程序平滑+梯度引导变异能显著提升fuzz效率与覆盖并具备良好可扩展性。
**第二个问题**请提取论文的摘要原文摘要一般在Abstract之后Introduction之前。
Abstract—Fuzzing has become the de facto standard technique for finding software vulnerabilities. However, even state-of-the-art fuzzers are not very efficient at finding hard-to-trigger software bugs. Most popular fuzzers use evolutionary guidance to generate inputs that can trigger different bugs. Such evolutionary algorithms, while fast and simple to implement, often get stuck in fruitless sequences of random mutations. Gradient-guided optimization presents a promising alternative to evolutionary guidance. Gradient-guided techniques have been shown to significantly outperform evolutionary algorithms at solving high-dimensional structured optimization problems in domains like machine learning by efficiently utilizing gradients or higher-order derivatives of the underlying function. However,gradient-guidedapproachesarenotdirectly applicable to fuzzing as real-world program behaviors contain many discontinuities, plateaus, and ridges where the gradient-based methods often get stuck. We observe that this problem can be addressed by creating a smooth surrogate function approximating the target programs discrete branching behavior. In this paper, we propose a novel program smoothing technique using surrogate neural network models that can incrementally learn smooth approximations of a complex, real-world programs branching behaviors. We further demonstrate that such neural network models can be used together with gradient-guided input generation schemes to significantly increase the efficiency of the fuzzing process. OurextensiveevaluationsdemonstratethatNEUZZ significantly outperforms 10 state-of-the-art graybox fuzzers on 10 popular real-world programs both at finding new bugs and achieving higher edge coverage. NEUZZ found 31 previously unknown bugs (including two CVEs) that other fuzzers failed to find in 10 real-world programs and achieved 3X more edge coverage than all of the tested graybox fuzzers over 24 hour runs. Furthermore, NEUZZ also outperformed existing fuzzers on both LAVA-M and DARPA CGC bug datasets.
**第三个问题**:请列出论文的全部作者,按照此格式:`作者1, 作者2, 作者3`
Dongdong She, Kexin Pei, Dave Epstein, Junfeng Yang, Baishakhi Ray, Suman Jana
**第四个问题**:请直接告诉我这篇论文发表在哪个会议或期刊,请不要推理或提供额外信息。
arXiv:1807.05620v4
**第五个问题**:请详细描述这篇论文主要解决的核心问题,并用简洁的语言概述。
核心问题是覆盖导向fuzzing本质是优化问题最大化new edge coverage/bugs但真实程序的分支行为对输入是高度离散且不光滑的目标函数导致两类主流方法各有瓶颈evolutionary algorithms无法利用梯度结构而易低效停滞gradient-guided optimization虽高效但在程序的discontinuities、plateaus、ridges处梯度不可用/不可靠而“卡住”。NEUZZ要解决的就是“如何在不引入符号执行等高开销白盒平滑的前提下把程序分支行为变成可微、可求梯度的近似函数并让梯度真正能指导变异去触达未覆盖边与隐藏漏洞”。简洁概述用可微的神经网络代理模型平滑程序分支使梯度引导变异在真实程序上可用且更高效。
**第六个问题**:请告诉我这篇论文提出了哪些方法,请用最简洁的方式概括每个方法的核心思路。
(1) Neural program smoothing训练feed-forward surrogate NN把输入字节序列映射为edge bitmap的“平滑近似”从而可微、可求梯度。
(2) Edge-label降维预处理仅保留训练集中出现过的边并合并“总是共现”的边以缓解multicollinearity减少输出维度约从65,536降到~4,000
(3) Gradient-guided mutationAlgorithm 1对选定的输出边神经元计算∇x f_i(θ,x)选top-k高梯度字节作为critical bytes按梯度符号对字节做增/减并clip到[0,255]生成定向变异。
(4) Exponentially grow mutation target从少量字节开始逐轮扩大要变异的字节数以覆盖更大输入空间同时保持单次搜索有效性。
(5) Incremental learning + coverage-based filtration把新发现触发新边的输入加入并用“只保留能带来新覆盖的旧数据摘要”控制数据规模迭代重训以提高代理模型精度并避免灾难性遗忘。
(6) Magic-check辅助LAVA/CGC场景用定制LLVM pass插桩magic byte checks用NN梯度先定位关键字节再对相邻字节做局部穷举4×256以高效触发多字节条件。
**第七个问题**:请告诉我这篇论文所使用的数据集,包括数据集的名称和来源。
(1) 10个真实世界程序集binutils-2.30readelf -a, nm -C, objdump -D, size, strip、harfbuzz-1.7.6、libjpeg-9c、mupdf-1.12.0、libxml2-2.9.7、zlib-1.2.11论文表IIb列出来源为对应开源项目/版本)。
(2) LAVA-M bug datasetLAVA项目的子集base64、md5sum、uniq、who含注入的magic-number触发漏洞来源引用[28] LAVA
(3) DARPA CGC datasetCyber Grand Challenge二进制/服务程序数据集论文从中随机选50个binary评测来源引用[26] CGC repository
(4) 训练数据来源先运行AFL-2.52b 1小时生成初始seed corpus与边覆盖标签用于训练NN平均每个程序约2K训练输入
**第八个问题**:请列举这篇论文评估方法的所有指标,并简要说明这些指标的作用。
(1) Bugs found / crashes统计发现的真实漏洞与崩溃数量用于衡量漏洞挖掘能力内存类问题通过AddressSanitizer的stack trace去重整数溢出通过人工分析+UBSan验证。
(2) Edge coveragenew control-flow edges以AFL的edge coverage report统计“新增边数”作为覆盖导向fuzzing的核心效果指标。
(3) 运行时间预算下的覆盖增长曲线比较24h真实程序、5hLAVA-M、6hCGC内覆盖随时间变化体现“到达新边的速度”。
(4) 训练开销/时间NEUZZ train(s), training time sec衡量学习组件的成本如与RNN fuzzer对比时报告训练时间差异
(5) 固定变异预算下的覆盖如1M mutations在控制变异次数时比较不同方法/模型的有效性,排除训练时长差异干扰。
(6) NN预测精度test accuracy约95%平均):用于说明代理模型对分支行为预测质量(间接影响梯度指导有效性)。
**第九个问题**:请总结这篇论文实验的表现,包含具体的数值表现和实验结论。
真实程序24hNEUZZ在10个程序上均取得最高edge coverage表VI示例readelf -a 4,942harfbuzz 6,081nm -C 2,056libxml 1,596mupdf 487zlib 376等并在多程序上呈现“1小时内新增>1,000边”的领先速度文中总结对AFL在9/10程序上分别约6×、1.5×、9×、1.8×、3.7×、1.9×、10×、1.3×、3×的边覆盖优势并称相对次优fuzzer可达约4.2×、1.3×、7×、1.2×、2.5×等提升。真实漏洞表III在6个fuzzer对比中NEUZZ总计发现60个bugAFL 29AFLFast 27VUzzer 14KleeFL 26AFL-laf-intel 60?——表中按项目/类型给出且NEUZZ覆盖5类bug并额外拿到2个CVECVE-2018-19931/19932。LAVA-M5hNEUZZ在base64找到48、md5sum 60、uniq 29、who 1,582表IV整体优于Angora等对比项。CGC6h50 binariesNEUZZ触发31个有漏洞binaryAFL为21、Driller为25且NEUZZ覆盖了AFL/Driller找到的全部并额外多6个表V。与RNN fuzzer对比1M mutations表VIINEUZZ在readelf/libjpeg/libxml/mupdf上分别获得约8.4×/4.2×/6.7×/3.7×更多边覆盖且训练开销约低20×。模型消融表VIII线性模型显著落后增量学习进一步提升如readelf -a1,723→1,800→2,020。结论神经平滑+梯度定向变异在覆盖与找bug上均显著优于多种SOTA且训练/执行开销可控、可扩展到大程序。
**第十个问题**:请清晰地描述论文所作的工作,分别列举出动机和贡献点以及主要创新之处。
动机evolutionary fuzzing在深层逻辑与稀疏漏洞上效率低梯度优化在高维结构化问题上更强但直接用于程序会被分支离散性导致的不可微/不连续行为阻断既有program smoothing依赖符号执行/抽象解释,开销大且不可扩展。
贡献点:(1) 提出“program smoothing对gradient-guided fuzzing至关重要”的核心观点并把fuzzing形式化为优化问题明确为何需要平滑代理。 (2) 设计首个可扩展的surrogate neural network program smoothing用feed-forward NN学习输入→edge coverage bitmap的平滑近似并通过标签降维解决训练可行性。 (3) 提出面向覆盖的gradient-guided mutation策略利用∇x f_i(θ,x)定位critical bytes与变异方向系统生成高价值变异而非均匀随机。 (4) 提出coverage-based filtration的incremental learning流程持续用新覆盖数据纠正代理模型且避免灾难性遗忘。 (5) 实现NEUZZ并在真实程序、LAVA-M、CGC上进行大规模对比证明在bug数量与edge coverage上显著超越10个SOTA fuzzers。
主要创新之处:用“可微代理模型”把离散分支行为平滑为可求梯度的函数,再把梯度直接转化为变异位置/方向的决策信号,从而以低开销获得比符号/污点等重分析更强的探索能力。

View File

@@ -0,0 +1,9 @@
@misc{rasul2021autoregressivedenoisingdiffusionmodels,
title={Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting},
author={Kashif Rasul and Calvin Seward and Ingmar Schuster and Roland Vollgraf},
year={2021},
eprint={2101.12072},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2101.12072},
}

View File

@@ -0,0 +1,44 @@
# Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting
**第一个问题**请对论文的内容进行摘要总结包含研究背景与问题、研究目的、方法、主要结果和结论字数要求在150-300字之间使用论文中的术语和概念。
论文面向高维多变量概率时间序列预测,指出现有方法常依赖可处理的分布族(如多元高斯/低秩近似、copula或normalizing flows对真实分布的多模态/非连通结构表达受限且高维协方差建模代价高。研究目的在于构建一个能在每个时间步学习灵活条件分布、并保持自回归预测优势的模型。方法上提出TimeGrad用RNNLSTM/GRU编码历史与covariates得到隐藏状态h_{t1}并在每个时间步以条件denoising diffusion probabilistic model/score matching形式学习p_θ(x_t^0|h_{t1}),训练通过变分下界等价的噪声预测损失||εε_θ(·,h_{t1},n)||^2推断用annealed Langevin dynamics式的反向马尔可夫链从白噪声采样。实验在六个真实数据集最高达2000维上以CRPS与CRPS_sum评测TimeGrad在除最小数据集外均达到新的SOTA结论是“自回归+扩散/EBM”的组合可有效学习高维相关预测分布。
**第二个问题**请提取论文的摘要原文摘要一般在Abstract之后Introduction之前。
In this work, we propose TimeGrad, an autoregressive model for multivariate probabilistic time series forecasting which samples from the data distribution at each time step by estimating its gradient. To this end, we use diffusion probabilistic models, a class of latent variable models closely connected to score matching and energy-based methods. Our model learns gradients by optimizing a variational bound on the data likelihood and at inference time converts white noise into a sample of the distribution of interest through a Markov chain using Langevin sampling. We demonstrate experimentally that the proposed autoregressive denoising diffusion model is the new state-of-the-art multivariate probabilistic forecasting method on real-world data sets with thousands of correlated dimensions. We hope that this method is a useful tool for practitioners and lays the foundation for future research in this area.
**第三个问题**:请列出论文的全部作者,按照此格式:`作者1, 作者2, 作者3`
Kashif Rasul, Calvin Seward, Ingmar Schuster, Roland Vollgraf
**第四个问题**:请直接告诉我这篇论文发表在哪个会议或期刊,请不要推理或提供额外信息。
文段未给出会议或期刊信息。
**第五个问题**:请详细描述这篇论文主要解决的核心问题,并用简洁的语言概述。
论文要解决的核心问题是在多变量维度D可达上千/上万时间序列预测中如何为每个未来时间步学习“高维、相关、可能多模态且非高斯”的条件预测分布q_X(x_t^0|x_{1:t-1}^0,c_{1:T})同时避免传统多元高斯协方差带来的O(D^2)参数和O(D^3)计算、以及normalizing flows/VAEs在非连通模式上可能出现的“连接模式间的虚假密度/映射困难”。TimeGrad通过在自回归框架下把每个时间步的发射分布替换为可用Langevin采样的扩散/score-based EBM从而以更少的结构限制拟合复杂分布。简洁概述用“RNN条件 + 扩散去噪得分模型”替代传统输出分布,做高维相关概率预测。
**第六个问题**:请告诉我这篇论文提出了哪些方法,请用最简洁的方式概括每个方法的核心思路。
(1) TimeGrad总体框架把多变量预测分解为自回归因子Π_{t=t0}^T p_θ(x_t^0|h_{t1})其中h_{t1}由RNN对历史与协变量编码。(2) 条件扩散发射头conditional denoising diffusion model对每个时间步t把目标向量x_t^0视为训练样本使用固定前向扩散q(x_n|x_{n1})加噪学习反向p_θ(x_{n1}|x_n,h_{t1})去噪。(3) 噪声ε预测的训练目标用Ho et al. (2020)的ε-parameterization优化E[||εε_θ(√\barα_n x_t^0+√(1\barα_n)ε, h_{t1}, n)||^2]选择Σ_θ=~β_n以学习score/梯度。(4) 推断与采样给定h_{t1}从x_t^N~N(0,I)出发按n=N…1进行反向更新含噪声项z得到x_t^0再把采样到的x_t^0回喂RNN滚动生成多步轨迹。(5) 维度缩放Scaling按context window的均值对各维归一化均值为0则用1预测后再缩放回原尺度以处理不同维度量纲差异。(6) 协变量建模Covariates对类别特征用embedding组合时间相关hour/day等、时间不变特征与lag features作为RNN输入的一部分。
**第七个问题**:请告诉我这篇论文所使用的数据集,包括数据集的名称和来源。
使用六个公开数据集并声明“preprocessed exactly as in (Salinas et al., 2019a)”ExchangeLai et al., 2018SolarLai et al., 2018ElectricityUCI ElectricityLoadDiagrams20112014https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014TrafficUCI PEMS-SFhttps://archive.ics.uci.edu/ml/datasets/PEMS-SFTaxiNYC TLC Trip Record Datahttps://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.pageWikipediaGluonTS仓库提供链接指向https://github.com/mbohlkeschneider/gluon-ts/tree/mv_release/datasets
**第八个问题**:请列举这篇论文评估方法的所有指标,并简要说明这些指标的作用。
(1) CRPSContinuous Ranked Probability Score对每个维度的预测CDF F与观测x的兼容性评分CRPS越小代表预测分布越接近真实分布是proper scoring rule。(2) CRPS_sum先对D维求和得到标量序列再对其预测分布\hat F_sum计算CRPS并在预测区间上平均用于衡量模型对“总量/聚合量”不确定性的刻画能力并且同样是proper scoring function。论文强调选择CRPS类指标是因为部分对比方法不提供解析likelihood或likelihood不可比。
**第九个问题**:请总结这篇论文实验的表现,包含具体的数值表现和实验结论。
论文在六个数据集上用CRPS_sum对比多类基线VAR/VAR-Lasso/GARCH/VES、KVAE、Vec-LSTM系列、GP-Copula/GP-scaling、Transformer-MAFTimeGrad在除最小数据集外均为最优Exchange上TimeGrad 0.006±0.001最优为VES 0.005±0.000TimeGrad略逊Solar上0.287±0.02优于Transformer-MAF 0.301±0.014等Electricity上0.0206±0.001与Transformer-MAF 0.0207±0.000接近且更好Traffic上0.044±0.006优于Transformer-MAF 0.056±0.001Taxi上0.114±0.02优于Transformer-MAF 0.179±0.002等Wikipedia上0.0485±0.002优于Transformer-MAF 0.063±0.003。消融显示扩散步数N可降到约10而性能无明显损失最优大约在N≈100继续增大无收益。结论TimeGrad作为“自回归扩散去噪模型”在高维相关概率预测上达到SOTA并具备对多尺度/多模态分布更友好的建模特性。
**第十个问题**:请清晰地描述论文所作的工作,分别列举出动机和贡献点以及主要创新之处。
动机:(1) 多变量概率预测需要刻画不确定性与跨维相关性,但多元高斯/低秩/二阶相关限制明显且计算昂贵;(2) normalizing flows/VAEs在高维、非连通模式分布上可能产生虚假密度或映射困难(3) EBMs/score-based方法表达力强但训练与时间序列条件化结合不足。
贡献点与创新:(1) 提出TimeGrad将每个时间步的条件发射分布建模为RNN条件化的denoising diffusion probabilistic model通过ε-预测学习梯度/score并用Langevin式反向链采样从而在自回归预测中实现高维灵活分布建模(2) 给出训练与推断算法逐时间步训练目标、滚动采样轨迹并加入适配真实多维量纲差异的scale normalization与covariate embedding(3) 在六个真实基准最高2000维上用CRPS与CRPS_sum系统对比取得除最小数据集外的SOTA(4) 分析扩散长度N的影响指出N≈10已可用、N≈100最优为实践中的采样-质量权衡提供依据。

View File

@@ -0,0 +1,9 @@
@misc{tashiro2021csdiconditionalscorebaseddiffusion,
title={CSDI Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation},
author={Yusuke Tashiro and Jiaming Song and Yang Song and Stefano Ermon},
year={2021},
eprint={2107.03502},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={httpsarxiv.orgabs2107.03502},
}

View File

@@ -0,0 +1,44 @@
# CSDI: Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation
**第一个问题**请对论文的内容进行摘要总结包含研究背景与问题、研究目的、方法、主要结果和结论字数要求在150-300字之间使用论文中的术语和概念。
论文研究多变量时间序列缺失值插补probabilistic time series imputation指出传统以autoregressive为主的插补方法虽常用但score-based diffusion models在图像/音频生成上已显著优于同类方法且现有“用无条件扩散模型近似后验score”的插补做法会对观测值加噪、并非精确条件分布。研究目的为直接学习q(x0^ta|x0^co)的条件分布并提升插补质量。方法提出CSDI基于DDPM参数化将反向过程扩展为pθ(x^ta_{t-1}|x^ta_t,x0^co)令去噪网络εθ显式条件化于观测含conditional mask并用masked language modeling式自监督训练把观测拆为conditional observations与imputation targets含random/historical/mix/testpattern策略实现上采用2D attention时间Transformer+特征Transformer捕获时序与跨特征依赖并加入时间/特征嵌入。结果在PhysioNet 2012与北京空气质量数据上CSDI将CRPS相对现有概率插补方法提升4065%确定性插补MAE降低520%并在插值与概率预测上具竞争力。结论是“为插补显式训练的条件扩散模型”显著优于后验近似式扩散插补与RNN基线。
**第二个问题**请提取论文的摘要原文摘要一般在Abstract之后Introduction之前。
The imputation of missing values in time series has many applications in healthcare and finance. While autoregressive models are natural candidates for time series imputation, score-based diffusion models have recently outperformed existing counterparts including autoregressive models in many tasks such as image generation and audio synthesis, and would be promising for time series imputation. In this paper, we propose Conditional Score-based Diffusion models for Imputation (CSDI), a novel time series imputation method that utilizes score-based diffusion models conditioned on observed data. Unlike existing score-based approaches, the conditional diffusion model is explicitly trained for imputation and can exploit correlations between observed values. On healthcare and environmental data, CSDI improves by 40-65% over existing probabilistic imputation methods on popular performance metrics. In addition, deterministic imputation by CSDI reduces the error by 5-20% compared to the state-of-the-art deterministic imputation methods. Furthermore, CSDI can also be applied to time series interpolation and probabilistic forecasting, and is competitive with existing baselines. The code is available at https://github.com/ermongroup/CSDI.
**第三个问题**:请列出论文的全部作者,按照此格式:`作者1, 作者2, 作者3`
Yusuke Tashiro, Jiaming Song, Yang Song, Stefano Ermon
**第四个问题**:请直接告诉我这篇论文发表在哪个会议或期刊,请不要推理或提供额外信息。
35th Conference on Neural Information Processing Systems (NeurIPS 2021)
**第五个问题**:请详细描述这篇论文主要解决的核心问题,并用简洁的语言概述。
论文核心要解决在多变量时间序列X∈R^{K×L}存在缺失mask M如何学习缺失部分的条件分布q(x0^ta|x0^co)并进行概率插补(输出分布/采样同时充分利用观测值在时间维与特征维的相关性。现有score-based/扩散插补通常用“无条件扩散模型”近似条件反向过程在反向采样中对观测x0^co也加噪并拼接进入x_t导致观测信息被噪声破坏且不对应精确的条件分布从而限制插补质量。CSDI通过“显式条件反向过程 + 为插补专门训练的εθ(x^ta_t,t|x0^co)”来避免该近似,并用自监督策略在训练时构造(conditional, target)对。简洁概述:用显式条件扩散模型直接建模“给定观测的缺失分布”,替代对观测加噪的后验近似式扩散插补。
**第六个问题**:请告诉我这篇论文提出了哪些方法,请用最简洁的方式概括每个方法的核心思路。
(1) CSDI条件扩散插补框架把DDPM反向链改为pθ(x^ta_{t-1}|x^ta_t,x0^co),在每步去噪时显式输入观测条件以恢复缺失值分布。(2) 条件DDPM参数化沿用DDPM的μ_DDPM/σ_DDPM形式仅将εθ扩展为条件网络εθ(x^ta_t,t|x0^co)并在实现里加m_co。(3) 自监督训练masked language modeling式把“观测值”划分为x0^co与x0^ta目标对x0^ta加噪得到x^ta_t优化||ε−εθ(x^ta_t,t|x0^co)||^2无需真实缺失的ground truth。(4) 目标选择策略target choiceRandom随机比例选观测为目标、Historical用训练集中其他样本缺失模式决定目标、Mix随机+历史混合、Testpattern已知测试缺失模式时直接用。(5) 时间序列实现的输入对齐将x^ta_t与x0^co零填充到固定K×L形状配合conditional mask m_co指示条件位置与输出mask。(6) 2D attention去噪网络在残差层中同时做temporal Transformer沿L捕获时间依赖与feature Transformer沿K捕获特征依赖。(7) Side information时间戳s的128维time embedding + 特征ID的embedding用于增强条件建模。
**第七个问题**:请告诉我这篇论文所使用的数据集,包括数据集的名称和来源。
插补/插值数据集:(1) PhysioNet Challenge 2012 healthcare数据集4000个ICU临床时间序列35变量48小时文中引用Silva et al., 2012。(2) Air qualityBeijing PM2.5数据集北京36个站点PM2.5小时数据文中引用Yi et al., 2016并说明数据覆盖2014/05/012015/04/30且含结构化缺失/人工ground truth。预测forecasting数据集来自GluonTS并按Salinas et al. 2019预处理solar137站点光伏发电记录、electricityUCI ElectricityLoadDiagrams20112014、trafficUCI PEMS-SF、taxiNYC TLC trip records聚合的1214位置半小时序列、wiki2000维Wikipedia page views
**第八个问题**:请列举这篇论文评估方法的所有指标,并简要说明这些指标的作用。
(1) CRPScontinuous ranked probability score评估概率插补/插值的预测分布与观测值的匹配程度越小越好用100个样本近似分布并按分位数损失积分近似计算论文用于概率插补与预测也报告逐任务CRPS。(2) 归一化平均CRPS对所有目标(k,l)的CRPS求和后除以∑|x_{k,l}|用于跨特征尺度的可比性论文Eq.18)。(3) MAEmean absolute error确定性插补误差越小越好CSDI取100样本中位数作为确定性插补。(4) RMSE确定性插补/插值误差补充指标越小越好在附录表9/10报告。(5) CRPS-sum概率预测任务中“对K维求和后的分布”的CRPS越小越好衡量联合效应/聚合量不确定性论文Eq.19)。(6) MSE概率预测任务的点误差补充指标越小越好表12。(7) NLLnegative log likelihood论文做了对比分析表8说明其对噪声schedule敏感且与样本质量/CRPS不一定相关因此不作为主指标。
**第九个问题**:请总结这篇论文实验的表现,包含具体的数值表现和实验结论。
概率插补CRPS越低越好表2在healthcare数据上CSDI分别为10%/50%/90% missing0.238(0.001)/0.330(0.002)/0.522(0.002)优于Multitask GP0.489/0.581/0.942、GP-VAE0.574/0.774/0.998、V-RIN0.808/0.831/0.922并优于“unconditional diffusion”插补0.360/0.458/0.671。在air quality上CSDI为0.108(0.001)优于Multitask GP 0.301、GP-VAE 0.397、V-RIN 0.526、unconditional 0.135。确定性插补MAE表3healthcare为0.217/0.301/0.48110/50/90% missing优于BRITS 0.284/0.368/0.517、GLIMA引用10.54在air quality上CSDI在air quality为9.60(0.04)优于BRITS引用11.56、unconditional 12.13等总体称提升520%。插值不规则采样CRPS表4CSDI 0.380/0.418/0.55610/50/90% missing优于Latent ODE与mTANs。预测CRPS-sum表5CSDI在electricity与traffic上最好0.017与0.020在solar/taxi/wiki上与SOTA基线接近或略优例如wiki 0.047 vs TimeGrad 0.049)。结论:显式条件扩散模型+自监督训练显著提升概率/确定性插补,并可迁移到插值与预测。
**第十个问题**:请清晰地描述论文所作的工作,分别列举出动机和贡献点以及主要创新之处。
动机:(1) 时间序列缺失普遍存在且影响医疗/金融等应用;(2) 既有深度插补多为autoregressive难以达到扩散模型在生成任务中的表现(3) 现有扩散插补多用“无条件扩散近似条件后验score”需要对观测加噪信息受损且不对应精确条件分布。
贡献点与创新:(1) 提出CSDI为插补任务显式建模pθ(x0^ta|x0^co)的conditional score-based diffusion model并在反向去噪中直接条件化于观测值以利用相关性。(2) 自监督训练机制借鉴masked language modeling把观测拆成条件与目标设计random/historical/mix/testpattern等target choice策略使得即使训练集缺少真实缺失ground truth也能训练。(3) 面向多变量时序的结构设计提出2D attention时间Transformer+特征Transformer与side informationtime embedding、feature embedding强化跨时间/跨特征依赖建模。(4) 系统实证在healthcare与air quality上CRPS提升4065%确定性MAE降低520%并展示对插值与预测的可扩展性与竞争力同时通过消融证明2D attention等组件关键。

View File

@@ -0,0 +1,11 @@
@inproceedings{NEURIPS2020_4c5bcfec,
author = {Ho, Jonathan and Jain, Ajay and Abbeel, Pieter},
booktitle = {Advances in Neural Information Processing Systems},
editor = {H. Larochelle and M. Ranzato and R. Hadsell and M.F. Balcan and H. Lin},
pages = {6840--6851},
publisher = {Curran Associates, Inc.},
title = {Denoising Diffusion Probabilistic Models},
url = {https://proceedings.neurips.cc/paper_files/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf},
volume = {33},
year = {2020}
}

View File

@@ -0,0 +1,44 @@
# **Denoising Diffusion Probabilistic Models**
**第一个问题**请对论文的内容进行摘要总结包含研究背景与问题、研究目的、方法、主要结果和结论字数要求在150-300字之间使用论文中的术语和概念。
论文研究扩散概率模型diffusion probabilistic models, diffusion models在高质量图像生成上的可行性。背景问题是此前扩散模型虽易定义与训练但缺少生成高质量样本的证明。研究目的在于改进扩散模型的训练与参数化并展示其样本质量。方法上构建固定前向扩散过程q(x_t|x_{t-1})逐步加高斯噪声并学习反向过程p_θ(x_{t-1}|x_t)的高斯转移提出与denoising score matching和Langevin dynamics的显式联系采用预测噪声ε的参数化并使用简化的加权变分目标L_simple训练模型用U-Net+自注意力与时间嵌入。结果在无条件CIFAR10上达到Inception Score 9.46、FID 3.17在256×256 LSUN上样本质量接近ProgressiveGAN。结论指出扩散模型能产生高质量样本且其采样可解释为progressive decoding/渐进式有损解码的泛化。
**第二个问题**请提取论文的摘要原文摘要一般在Abstract之后Introduction之前。
We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN. Our implementation is available at https://github.com/hojonathanho/diffusion.
**第三个问题**:请列出论文的全部作者,按照此格式:`作者1, 作者2, 作者3`
Jonathan Ho, Ajay Jain, Pieter Abbeel
**第四个问题**:请直接告诉我这篇论文发表在哪个会议或期刊,请不要推理或提供额外信息。
34th Conference on Neural Information Processing Systems (NeurIPS 2020)
**第五个问题**:请详细描述这篇论文主要解决的核心问题,并用简洁的语言概述。
核心问题是扩散概率模型作为“通过变分推断训练的马尔可夫链”能否在图像上达到与GAN/自回归模型相当的高样本质量以及如何设计反向过程参数化与训练目标来实现这一点。具体难点包括反向过程p_θ(x_{t-1}|x_t)如何学习“去噪逆扩散”训练时如何用可计算的变分下界分解L_T、L_{1:T-1}、L_0稳定优化采样链如何与score-based/annealed Langevin dynamics统一理解并解释其为何样本质量高但log likelihoodbits/dim不占优。简洁概述把扩散模型从“理论上合理、效果未证实”变成“能稳定训练并生成SOTA质量图像”的生成框架并给出与score matching/Langevin及渐进式解码的统一解释。
**第六个问题**:请告诉我这篇论文提出了哪些方法,请用最简洁的方式概括每个方法的核心思路。
(1) 扩散模型的前向/反向马尔可夫链建模固定q(x_t|x_{t-1})逐步加高斯噪声β_t schedule学习高斯反向转移p_θ(x_{t-1}|x_t)。(2) 反向均值的ε-预测参数化用网络ε_θ(x_t,t)预测噪声ε并由此构造μ_θ(x_t,t)=1/√α_t·(x_tβ_t/√(1\barα_t)·ε_θ(x_t,t))使采样形态类似Langevin dynamics。(3) 与denoising score matching的等价联系推导L_{t-1}在ε-参数化下变为多噪声等级的去噪平方误差与score matching形式一致。(4) 简化训练目标L_simple用未加权的E[||εε_θ(√\barα_t x_0+√(1\barα_t)ε,t)||^2]训练t均匀采样作为“加权变分下界”以提升样本质量并简化实现。(5) 固定方差Σ_θ(x_t,t)=σ_t^2 I不学习反向方差实验发现更稳定且FID更好。(6) Progressive lossy decompression / progressive decoding解释把采样过程解释为逐步恢复信息的渐进式解码并与自回归解码作类比。
**第七个问题**:请告诉我这篇论文所使用的数据集,包括数据集的名称和来源。
(1) CIFAR10无条件生成评测论文实验主基准通过TensorFlow Datasets加载。(2) LSUN 256×256Bedroom、Church、Cat类别数据准备使用StyleGAN代码FID计算亦使用StyleGAN2代码。(3) CelebA-HQ 256×256用于展示高分辨率人脸样本与插值通过TensorFlow Datasets加载
**第八个问题**:请列举这篇论文评估方法的所有指标,并简要说明这些指标的作用。
(1) Inception Score (IS)衡量生成样本的可辨识性与多样性越高越好主要用于CIFAR10。(2) Fréchet Inception Distance (FID)衡量生成分布与真实分布在Inception特征空间的距离越低越好CIFAR10/LSUN等。(3) Negative Log Likelihood (NLL) / bits per dimension以离散数据的lossless codelength度量似然质量越低越好用于对比likelihood-based模型。(4) RateDistortion分析将L_{1:T}视为rate、L_0视为distortion画rate(bits/dim)与distortion(RMSE)随时间/码率变化,用于解释“高感知质量但似然一般”的原因。(5) Distortion用RMSE在[0,255]像素尺度)度量重建误差/失真。 (6) Progressive sampling quality over time在反向步数(Tt)过程中跟踪IS与FID观察由粗到细生成。 (7) Traintest NLL gap训练/测试bits/dim差异用于检查是否过拟合论文报告gap≤0.03 bits/dim
**第九个问题**:请总结这篇论文实验的表现,包含具体的数值表现和实验结论。
在无条件CIFAR10上论文最佳模型L_simple达到IS=9.46±0.11、FID=3.17并给出NLL Test≤3.75 bits/dimTrain 3.72)。与参数化/目标消融相比,ε-预测在L_simple下显著优于μ̃-预测μ̃预测在真变分界L+固定各向同性Σ时FID=13.22,而ε-预测+L_simple将FID降至3.17。高分辨率上LSUN 256×256生成质量与ProgressiveGAN相当Bedroom FID可达4.90large模型Church FID=7.89Cat FID=19.75。论文同时指出尽管样本质量强扩散模型的log likelihood不如其他likelihood-based模型通过rate-distortion显示大量bits用于“人眼不可感知细节”支持其“优秀有损压缩/渐进式解码”解释。
**第十个问题**:请清晰地描述论文所作的工作,分别列举出动机和贡献点以及主要创新之处。
动机扩散模型训练高效、定义简单但此前缺少能生成高质量样本的实证同时score matching/能量模型/Langevin采样等方向虽能生成图像但缺少统一、可计算似然与“直接训练采样器”的框架。
贡献点:(1) 证明扩散模型可达高样本质量在CIFAR10取得SOTA级FID=3.17并在LSUN 256×256上达到接近ProgressiveGAN的质量。(2) 提出关键理论联系建立扩散模型与多噪声级别的denoising score matching、以及与annealed Langevin dynamics采样的显式等价关系并将其视为主要贡献之一。(3) ε-预测反向过程参数化用预测噪声ε_θ替代直接预测μ̃_t使训练目标化简并使采样形式更接近Langevin动态。(4) 简化的加权变分目标L_simple作为对标准变分下界的重加权降低小t去噪项权重显著提升样本质量且实现更简洁。(5) 渐进式有损解码解释与自回归泛化把反向扩散看作progressive decoding并给出与自回归“比特顺序/掩码顺序”的广义对应同时用rate-distortion与逐步生成可视化支撑这一解释。

View File

@@ -0,0 +1,9 @@
@misc{wen2024diffstgprobabilisticspatiotemporalgraph,
title={DiffSTG: Probabilistic Spatio-Temporal Graph Forecasting with Denoising Diffusion Models},
author={Haomin Wen and Youfang Lin and Yutong Xia and Huaiyu Wan and Qingsong Wen and Roger Zimmermann and Yuxuan Liang},
year={2024},
eprint={2301.13629},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2301.13629},
}

View File

@@ -0,0 +1,45 @@
# DiffSTG: Probabilistic Spatio-Temporal Graph Forecasting with Denoising Diffusion Models
**第一个问题**请对论文的内容进行摘要总结包含研究背景与问题、研究目的、方法、主要结果和结论字数要求在150-300字之间使用论文中的术语和概念。
论文针对spatio-temporal graph(STG) forecasting中STGNN难以刻画intrinsic uncertainties的问题研究probabilistic STG forecasting在不确定性建模与复杂ST dependencies上的挑战。目标是将denoising diffusion probabilistic models(DDPM)推广到STGs提出非自回归框架DiffSTG并设计首个面向STG的去噪网络UGnet。方法上构建conditional diffusion以历史图信号与图结构为条件在masked formulation中把[x^h,x^p]统一为x_all并用x_all^msk条件化反向过程训练采用噪声预测损失||εε_θ(x_all^n,n|x_all^msk,G)||^2UGnet以Unet式时间尺度聚合+TCN建模时间依赖+GCN建模空间相关并用子步采样与末k步复用加速采样。结果在PEMS08、AIR-BJ、AIR-GZ上CRPS降低4%14%、RMSE降低2%7%且推理较TimeGrad显著加速。结论表明DiffSTG兼顾ST建模能力与不确定性量化适用于决策场景。
**第二个问题**请提取论文的摘要原文摘要一般在Abstract之后Introduction之前。
Spatio-temporal graph neural networks (STGNN) have emerged as the dominant model for spatio-temporal graph (STG) forecast-ing. Despite their success, they fail to model intrinsic uncertainties within STG data, which cripples their practicality in downstream tasks for decision-making. To this end, this paper focuses on probabilistic STG forecasting, which is challenging due to the difficulty in modeling uncertainties and complex ST dependencies. In this study, we present the first attempt to generalize the popular de-noising diffusion probabilistic models to STGs, leading to a novel non-autoregressive framework called DiffSTG, along with the first denoising network UGnet for STG in the framework. Our approach combines the spatio-temporal learning capabilities of STGNNs with the uncertainty measurements of diffusion models. Extensive ex-periments validate that DiffSTG reduces the Continuous Ranked Probability Score (CRPS) by 4%-14%, and Root Mean Squared Er-ror (RMSE) by 2%-7% over existing methods on three real-world datasets. The code is in https://github.com/wenhaomin/DiffSTG.
**第三个问题**:请列出论文的全部作者,按照此格式:`作者1, 作者2, 作者3`
Haomin Wen, Youfang Lin, Yutong Xia, Huaiyu Wan, Qingsong Wen, Roger Zimmermann, Yuxuan Liang
**第四个问题**:请直接告诉我这篇论文发表在哪个会议或期刊,请不要推理或提供额外信息。
未在提供的论文内容中给出具体会议或期刊名称。
**第五个问题**:请详细描述这篇论文主要解决的核心问题,并用简洁的语言概述。
核心问题是在STG forecasting中同时做到(1)捕获跨节点的spatial correlations与跨时间的temporal dependencies复杂ST dependencies以及(2)对未来多步预测给出可用的概率分布与uncertainty quantification而非仅点预测并且(3)避免现有扩散式时间序列方法如TimeGrad在多步预测时的自回归/逐步生成带来的推理低效。论文指出时间序列扩散模型往往只在单变量或单节点上建模时间依赖缺乏显式图结构条件同时TimeGrad需要对每个未来步重复运行扩散链导致S×T_p×N级别的高开销。DiffSTG通过“图条件+非自回归多步一次生成+ST专用去噪网络”来解决。简洁概述把扩散概率模型做成“能利用图结构、一次生成多步、还能输出不确定性”的STG概率预测器。
**第六个问题**:请告诉我这篇论文提出了哪些方法,请用最简洁的方式概括每个方法的核心思路。
(1) DiffSTG非自回归概率预测框架用一次反向扩散过程直接生成多预测步x^p而不是逐步/逐未来步运行)。(2) Conditional Diffusion for STG在反向过程p_θ(x_{n-1}|x_n, x^h, G)中显式条件化历史与图结构以学习p(x^p|x^h,G)。(3) Generalized Masked Conditional Diffusion将[x^h,x^p]拼成x_all并用mask得到x_all^msk作为条件统一“重建历史+预测未来”的训练目标。(4) UGnet去噪网络时间维Unet结构捕获multi-scale temporal dependencies块内用TCN建模时间依赖、用GCN建模空间相关并注入noise level embedding。 (5) Sampling Acceleration只采样子序列{τ_1…τ_M}减少步数M并在末k步把近似样本复用以减少需要的反向扩散链次数S→S/k
**第七个问题**:请告诉我这篇论文所使用的数据集,包括数据集的名称和来源。
(1) PEMS08交通流量数据集来自Caltrans Performance Measurement System (PeMS)论文使用STSGCN提取/整理的版本,并依据实际路网构建邻接关系。 (2) AIR-BJ北京34个监测站PM2.5小时数据2019/01/012019/12/31来源引用Yi et al.相关工作[46]用站点距离构建空间相关矩阵A。 (3) AIR-GZ广州41个监测站PM2.5小时数据2017/01/012017/12/31同样来源[46]并用距离建图。
**第八个问题**:请列举这篇论文评估方法的所有指标,并简要说明这些指标的作用。
(1) CRPSContinuous Ranked Probability Score评估预测分布F与观测x的兼容性衡量概率预测质量越小越好论文的主要指标。(2) MAE衡量点预测误差的平均绝对偏差论文对概率模型用多次采样的均值或聚合给出确定性结果后计算越小越好。(3) RMSE衡量点预测误差的均方根对大误差更敏感越小越好。(4) Inference Time / Time cost比较扩散式方法的推理耗时与加速效果表3给出不同样本数S下秒级耗时
**第九个问题**:请总结这篇论文实验的表现,包含具体的数值表现和实验结论。
在三数据集上的概率方法对比表2越小越好AIR-BJ上DiffSTG=MAE 17.88、RMSE 29.60、CRPS 0.34优于TimeGrad 0.36、DeepAR 0.37等AIR-GZ上DiffSTG=MAE 10.95、RMSE 16.66、CRPS 0.22优于DeepAR 0.23、TimeGrad/MC Dropout 0.25PEMS08上DiffSTG=MAE 17.68、RMSE 27.13、CRPS 0.06优于DeepAR/MC Dropout 0.07、TimeGrad 0.09。相对“各数据集最强基线”CRPS下降5.6%/4.3%/14.3%RMSE下降约7.1%/2.4%/7.6%表2的Error reduction行。推理效率表3AIR-GZN=100S=8时TimeGrad 9.58sDiffSTG(M=100,k=1) 0.24s约40×加速进一步加速DiffSTG(M=40,k=2)为0.07s且随S增大仍显著快于TimeGrad与CSDI。结论DiffSTG在概率质量CRPS与确定性误差MAE/RMSE上均领先并通过非自回归与采样子步显著提升推理速度。
**第十个问题**:请清晰地描述论文所作的工作,分别列举出动机和贡献点以及主要创新之处。
动机:(1) 现有STGNN多为deterministic难以提供uncertainty用于决策(2) 扩散式概率时间序列模型TimeGrad/CSDI等缺少图空间依赖建模且自回归式多步生成效率低。
贡献点:(1) 首次将DDPM系统推广到STG概率预测提出非自回归DiffSTG以建模p(x^p|x^h,G)并量化不确定性。(2) 提出首个面向STG的扩散去噪网络UGnet时间维Unet捕获多尺度时间模式结合TCN+GCN显式学习ST dependencies。(3) 提出masked generalized conditional diffusionx_all与x_all^msk同空间以统一“重建历史+预测未来”,提升条件利用与任务泛化(预测/生成/插值)。(4) 提出采样加速策略(子步采样+末k步复用并实证带来显著推理加速且性能保持。
主要创新之处ST专用去噪结构UGnet、非自回归多步扩散预测、以及面向STG任务的masked条件化扩散建模方式。

View File

@@ -0,0 +1,9 @@
@misc{kong2021diffwaveversatilediffusionmodel,
title={DiffWave: A Versatile Diffusion Model for Audio Synthesis},
author={Zhifeng Kong and Wei Ping and Jiaji Huang and Kexin Zhao and Bryan Catanzaro},
year={2021},
eprint={2009.09761},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2009.09761},
}

View File

@@ -0,0 +1,47 @@
# DiffWave A Versatile Diffusion Model for Audio Synthesis
**第一个问题**请对论文的内容进行摘要总结包含研究背景与问题、研究目的、方法、主要结果和结论字数要求在150-300字之间使用论文中的术语和概念。
论文面向raw waveform生成指出既有likelihood-based模型WaveNet自回归、flow虽能保真但推理慢或受结构约束GAN/VAE等又常需auxiliary losses且存在mode collapse/posterior collapse尤其在unconditional generation中长序列建模困难。研究目的在于提出一种同时适用于conditional与unconditional的高保真、非自回归生成框架。方法提出DiffWave基于DDPM参数化用εθ预测噪声并以unweighted ELBO变体训练推理从白噪声通过Markov链反向去噪生成波形并提出fast sampling将训练时T步“折叠”为少量T_infer步。网络采用受WaveNet启发的bidirectional dilated convolution残差堆叠支持mel spectrogram本地条件与label全局条件。结果在LJ Speech神经声码器上MOS 4.44≈WaveNet 4.43且合成速度快多个数量级小模型2.64M参数在V100上>5×实时Fast在SC09无条件/类条件生成上相对WaveNet与WaveGAN在MOS、FID/IS/mIS/AM/NDB等质量与多样性指标显著更优。结论是扩散模型可用单一ELBO目标稳定训练并在音频生成任务上实现高质量与高效推理。
**第二个问题**请提取论文的摘要原文摘要一般在Abstract之后Introduction之前。
In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces high-fidelity audio in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation. We demonstrate that DiffWave matches a strong WaveNet vocoder in terms of speech quality (MOS: 4.44 versus 4.43), while synthesizing orders of magnitude faster. In particular, it significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.
**第三个问题**:请列出论文的全部作者,按照此格式:`作者1, 作者2, 作者3`
Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, Bryan Catanzaro
**第四个问题**:请直接告诉我这篇论文发表在哪个会议或期刊,请不要推理或提供额外信息。
ICLR 2021
**第五个问题**:请详细描述这篇论文主要解决的核心问题,并用简洁的语言概述。
论文核心要解决如何在raw waveform层面实现高保真生成同时兼顾(1)非自回归并行合成的高效率,(2)对conditional神经声码器/类条件与unconditional生成的统一建模能力(3)稳定训练且无需额外辅助损失。难点在于无条件生成需要在极长序列如16kHz下1秒=16000点上学习复杂分布WaveNet等自回归在无条件下易产生“made-up word-like sounds/样本差”GAN/ VAE又易出现训练不稳定或模式问题。简洁概述用扩散去噪的马尔可夫链把白噪声稳定地变成真实波形以更快推理和更强无条件生成质量替代传统自回归/GAN方案。
**第六个问题**:请告诉我这篇论文提出了哪些方法,请用最简洁的方式概括每个方法的核心思路。
(1) DiffWave扩散式波形生成采用DDPM前向加噪q(x_t|x_{t-1})与反向去噪pθ(x_{t-1}|x_t)从N(0,I)逐步生成x_0。(2) ε-parameterization与unweighted ELBO训练固定σθ为\tildeβ_t^{1/2},网络εθ(x_t,t)预测噪声ε最小化E||ε−εθ(√\barα_t x_0+√(1\barα_t)ε,t)||²。(3) Fast samplingT→T_infer折叠用用户自定义噪声日程{η_s}通过对齐噪声水平t_align把少量采样步映射到训练步显著减少合成时顺序步数如6步。(4) 去噪网络结构基于WaveNet思想的feed-forward、bidirectional dilated conv残差堆叠+skip连接因非自回归可用双向扩张卷积提升感受野利用。(5) 条件生成机制本地条件mel spectrogram经转置卷积上采样并Conv1×1映射后作为各残差层bias与全局条件label embedding经Conv1×1后作为bias。(6) 无条件生成的感受野策略通过反向扩散迭代使有效感受野扩大到T×r以覆盖长序列依赖。
**第七个问题**:请告诉我这篇论文所使用的数据集,包括数据集的名称和来源。
(1) LJ Speech DatasetIto, 2017约24小时、22.05kHz、13100条女声语音用于neural vocodingmel spectrogram条件。(2) Speech Commands DatasetWarden, 2018中的SC09子集digits 0916kHz、1秒长度、训练31158条、2032说话人用于unconditional与class-conditional生成文中还用其官方提供的噪声类型做zero-shot denoising演示。
**第八个问题**:请列举这篇论文评估方法的所有指标,并简要说明这些指标的作用。
(1) MOSMean Opinion Score5分制含95%CI人类主观语音质量评价用于声码器、无条件与类条件任务。(2) FID在ResNeXT特征空间拟合高斯衡量生成样本与训练集分布距离兼顾质量与多样性越低越好。(3) IS基于分类器输出偏好“可被清晰分类且整体多样”的样本越高越好。(4) mIS在IS基础上强调within-class diversity类内多样性越高越好。(5) AM Score考虑训练集边缘label分布与生成分布的一致性并加生成样本预测熵项缓解IS忽略先验分布的问题越低越好论文表2以↓标注。(6) NDB/KNumber of Statistically-Different BinsK-means分箱后比较生成与真实在各箱占比差异衡量多样性与模式覆盖越低越好。(7) Accuracy类条件任务用ResNeXT对生成样本分类的准确率衡量语音清晰度/类一致性(越高越好)。(8) FID-class类条件任务按每个digit分别算FID再取均值±方差衡量类内分布匹配越低越好。(9) 合成速度(如“× real-time”、或“500× slower”对比衡量推理效率特别对比WaveNet与DiffWave/Fast。
**第九个问题**:请总结这篇论文实验的表现,包含具体的数值表现和实验结论。
神经声码器LJ Speech表1DiffWave LARGET=200MOS 4.44±0.07略高于WaveNet 4.43±0.10DiffWave BASE在T=40/50时MOS 4.35±0.10/4.38±0.08且参数仅2.64MFast采样下DiffWave BASE(Fast) MOS 4.37±0.07、DiffWave LARGE(Fast) MOS 4.42±0.09并报告合成可达5.6×/3.5×实时V100无工程优化而WaveNet无工程优化约500×慢于实时。无条件生成SC09表2DiffWave在MOS 3.39±0.32显著优于WaveNet-256 1.43±0.30与WaveGAN 2.03±0.33自动指标上DiffWave也最好FID 1.287优于WaveGAN 1.349、WaveNet-256 2.947IS 5.305、mIS 59.4、AM 0.636、NDB/K 0.74。类条件生成表3DiffWave MOS 3.50±0.31优于WaveNet-256 1.58±0.36Accuracy 91.20%WaveNet-256为60.70%FID-class 1.113±0.569WaveNet-256为6.954±2.114mIS 117.4远高于WaveNet。结论DiffWave用单一扩散目标实现接近/超过强声码器质量,并在无条件/类条件任务上显著提升质量与多样性,同时推理速度相对自回归大幅加速。
**第十个问题**:请清晰地描述论文所作的工作,分别列举出动机和贡献点以及主要创新之处。
动机:(1) WaveNet等自回归波形生成质量高但推理极慢(2) flow模型虽快但架构受可逆性/雅可比约束;(3) GAN/VAE常需辅助损失且存在训练不稳定、mode collapse/posterior collapse(4) 无条件波形生成长序列难,传统模型质量/多样性不足。
贡献点:(1) 提出DiffWave非自回归扩散概率模型使用ELBO变体噪声预测稳定训练统一支持neural vocoding、类条件与无条件生成。(2) 设计适配扩散去噪的bidirectional dilated conv残差网络与扩散步嵌入/条件注入方式。(3) 提出fast sampling将训练T步反向过程折叠为少量T_infer步显著加速合成且保持质量。(4) 大规模实证声码器MOS 4.44≈WaveNet且速度提升数量级在无条件/类条件上相对WaveNet与WaveGAN在MOS与FID/IS/mIS/AM/NDB等全面领先。
主要创新之处将DDPM的ε-parameterization系统化落地到raw audio并通过“非自回归去噪网络+噪声步对齐的快速采样”实现高质量与高效率的统一。

View File

@@ -0,0 +1,22 @@
扩散模型DDPM/Score用于时间序列/时空建模(最直接支撑你“用 diffusion 生成包序列”)
Ho, Jain, Abbeel. Denoising Diffusion Probabilistic Models (DDPM). NeurIPS 2020.
用途:扩散模型基本形式(前向加噪、反向去噪、预测噪声训练)。你方法部分的扩散理论根引用。
Song et al. Score-Based Generative Modeling through Stochastic Differential Equations. ICLR 2021.
用途score-based diffusion 的更一般表述;如果你未来要做连续时间(时间间隔/抖动)的建模,这条线很有用。
Rasul et al. Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting. ICML 2021.
用途:多变量时间序列的扩散建模;对应你“多个(设备,寄存器)序列”的联合分布生成。
Tashiro et al. CSDI: Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation. NeurIPS 2021.
用途条件扩散conditioning注入方式很适合你把设备嵌入/寄存器语义/主从角色/工艺状态作为条件,约束生成。
Liu et al. PriSTI: A Conditional Diffusion Framework for Spatiotemporal Imputation. ICDE 2023.
用途:时空条件扩散框架;你把“空间”换成(设备,寄存器)二部图/异构图,“时间”换成轮询/会话位置,结构很贴近。
Wen et al. DiffSTG: Probabilistic Spatio-Temporal Graph Forecasting with Denoising Diffusion Models. ACM SIGSPATIAL 2023.
用途:扩散 + 时空图;你做(设备,寄存器)图上的生成(而不是预测)时,可借鉴其图特征融入去噪网络的方式。
Kong et al. DiffWave: A Versatile Diffusion Model for Audio Synthesis. ICLR 2021.
用途:一维信号生成(类似“时间间隔序列”“值序列”);其 WaveNet/UNet 类去噪骨架对工业轮询类高频序列也很参考。

View File

@@ -0,0 +1,9 @@
@misc{liu2023pristiconditionaldiffusionframework,
title={PriSTI: A Conditional Diffusion Framework for Spatiotemporal Imputation},
author={Mingzhe Liu and Han Huang and Hao Feng and Leilei Sun and Bowen Du and Yanjie Fu},
year={2023},
eprint={2302.09746},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2302.09746},
}

View File

@@ -0,0 +1,47 @@
# PriSTI A Conditional Diffusion Framework for Spatiotemporal
**第一个问题**请对论文的内容进行摘要总结包含研究背景与问题、研究目的、方法、主要结果和结论字数要求在150-300字之间使用论文中的术语和概念。
论文面向真实场景的spatiotemporal data不完整问题sensor failures、transmission loss研究spatiotemporal imputation以利用spatiotemporal dependence补全缺失值。既有RNN式autoregressive插补易出现error accumulation且多为确定性输出扩散模型可从Gaussian noise生成并避免串行误差但在ST数据上“conditional information的构造与利用”困难。为此提出PriSTI先对观测做linear interpolation得到增强的conditional information X并用Conditional Feature Extraction提取含temporal attention、spatial attention与MPNN(结合geographic adjacency A)的global context prior H_pri再在Noise Estimation Module中用H_pri计算注意力权重Q,K来自H_priV来自含噪输入引导εθ(Ẋ_t,X,A,t)去噪。实验在AQI-36、METR-LA、PEMS-BAY多缺失模式下PriSTI在MAE/MSE/CRPS上均优于CSDI、GRIN等并在高缺失率与sensor failure场景保持优势结论是“先验增强的条件特征+地理关系”能显著提升扩散式时空插补质量与稳健性。
**第二个问题**请提取论文的摘要原文摘要一般在Abstract之后Introduction之前。
Abstract—Spatiotemporal data mining plays an important role in air quality monitoring, crowd flow modeling, and climate forecasting. However, the originally collected spatiotemporal data in real-world scenarios is usually incomplete due to sensor failures or transmission loss. Spatiotemporal imputation aims to fill the missing values according to the observed values and the underlying spatiotemporal dependence of them. The previous dominant models impute missing values autoregressively and suffer from the problem of error accumulation. As emerging powerful generative models, the diffusion probabilistic models can be adopted to impute missing values conditioned by observations and avoid inferring missing values from inaccurate historical imputation. However, the construction and utilization of conditional information are inevitable challenges when applying diffusion models to spatiotemporal imputation. To address above issues, we propose a conditional diffusion framework for spatiotemporal imputation with enhanced prior modeling, named PriSTI. Our proposed framework provides a conditional feature extraction module first to extract the coarse yet effective spatiotemporal dependencies from conditional information as the global context prior. Then, a noise estimation module transforms random noise to realistic values, with the spatiotemporal attention weights calculated by the conditional feature, as well as the consideration of geographic relationships. PriSTI outperforms existing imputation methods in various missing patterns of different real-world spatiotemporal data, and effectively handles scenarios such as high missing rates and sensor failure. The implementation code is available at https://github.com/LMZZML/PriSTI.
**第三个问题**:请列出论文的全部作者,按照此格式:`作者1, 作者2, 作者3`
Mingzhe Liu, Han Huang, Hao Feng, Leilei Sun, Bowen Du, Yanjie Fu
**第四个问题**:请直接告诉我这篇论文发表在哪个会议或期刊,请不要推理或提供额外信息。
未在提供的论文内容中给出具体会议或期刊名称。
**第五个问题**:请详细描述这篇论文主要解决的核心问题,并用简洁的语言概述。
核心问题是将conditional diffusion model用于spatiotemporal imputation时如何“构造并利用”条件信息使模型既能避免RNN自回归插补的error accumulation又能有效学习复杂的时空依赖temporal dependencies、spatial global correlations、geographic relationships A。论文指出CSDI类方法常把observed values与noisy targets直接拼接输入仅靠mask区分导致序列趋势在高噪声步t≈T时不一致、增加学习难度且未充分利用地理邻接与跨节点相关。PriSTI通过“先插值增强条件信息→提取全局先验H_pri→用H_pri主导注意力权重、减少噪声对依赖学习的干扰”来解决。简洁概述解决扩散式时空插补中“条件信息噪声大、时空依赖难学、地理关系难融入”的关键瓶颈。
**第六个问题**:请告诉我这篇论文提出了哪些方法,请用最简洁的方式概括每个方法的核心思路。
(1) PriSTI条件扩散插补框架把缺失部分作为生成目标反向扩散pθ(Ẋ_{t-1}|Ẋ_t,X,A,t)从噪声逐步生成缺失值。(2) 线性插值增强条件信息Interpolate对每个节点时间序列做linear interpolation得到更完整、趋势一致的条件输入X。(3) Conditional Feature Extraction Moduleγ在仅含插值信息H=Conv(X)上用temporal attention + spatial attention + MPNN(A)提取global context prior H_pri。(4) Noise Estimation Module对混合输入H_in=Conv(X||Ẋ_t)学习时空依赖但用H_pri计算注意力权重Q,K来自H_priV来自H_in或H_tem以减轻噪声干扰并结合MPNN与空间注意力。(5) 空间注意力降复杂度将N个节点映射到k个virtual nodesk<N把空间注意力复杂度从O(N^2 d)降到O(N k d)。(6) 掩码训练策略point / block / hybrid含历史缺失模式在训练中随机“擦除观测”构造自监督插补目标。
**第七个问题**:请告诉我这篇论文所使用的数据集,包括数据集的名称和来源。
(1) AQI-36air quality北京36站PM2.5小时数据12个月来源Yi et al., “ST-MVL” (IJCAI 2016)[2]。(2) METR-LAtraffic speed洛杉矶高速207传感器、5分钟采样来源Li et al., DCRNN (ICLR 2018)[3]并沿用其数据设置。(3) PEMS-BAYtraffic speed湾区325传感器、5分钟采样同来源[3]三者的geographic adjacency A按站点/传感器距离用thresholded Gaussian kernel构建。
**第八个问题**:请列举这篇论文评估方法的所有指标,并简要说明这些指标的作用。
(1) MAE确定性插补误差的平均绝对偏差衡量点估计准确性越小越好。(2) MSE确定性插补误差的均方误差对大误差更敏感越小越好。(3) CRPS对概率插补分布D与真实值x的兼容性打分量化不确定性质量越小越好论文用100次采样近似分布并以0.05分位离散近似积分。(4) RMSE下游预测表V作为MSE的平方根用于评估imputation后对Graph WaveNet预测性能的影响越小越好
**第九个问题**:请总结这篇论文实验的表现,包含具体的数值表现和实验结论。
总体插补表III/IVPriSTI在所有数据集与缺失模式下优于CSDI与GRIN等。示例AQI-36 simulated failure(24.6%)上PriSTI MAE=9.03±0.07、MSE=310.39±7.03优于CSDI MAE=9.51±0.10、MSE=352.46±7.50对应CRPS PriSTI=0.0997优于CSDI=0.1056。METR-LA block-missing(16.6%)上PriSTI MAE=1.86±0.00、MSE=10.70±0.02优于CSDI 1.98±0.00、12.62±0.60CRPS PriSTI=0.0244优于CSDI=0.0260。PEMS-BAY block-missing(9.2%)上PriSTI MAE=0.78±0.00、MSE=3.31±0.01优于CSDI 0.86±0.00、4.39±0.02CRPS PriSTI=0.0093优于CSDI=0.0127。高缺失率图5在METR-LA缺失率90%时PriSTI相对其他方法MAE提升4.67%34.11%(block)与3.89%43.99%(point)。传感器完全失效RQ5AQI-36上两站点全程无观测时PriSTI MAE分别为10.23与15.20且优于GRIN。下游预测表V对AQI-36插补后用Graph WaveNet预测PriSTI使MAE从Ori. 36.97降至29.34RMSE从60.37降至45.08优于CSDI/GRIN/BRITS。结论PriSTI通过“条件先验+地理关系”在确定性与概率插补、极端稀疏与失效场景及下游任务上均更稳健。
**第十个问题**:请清晰地描述论文所作的工作,分别列举出动机和贡献点以及主要创新之处。
动机:(1) 时空数据缺失普遍且影响监测/交通等应用;(2) RNN自回归插补存在error accumulation且难输出不确定性(3) 扩散模型虽能做条件生成但在时空插补中“条件信息如何构造、如何避免噪声破坏依赖学习、如何融入地理关系A”是关键难题。
贡献点:(1) 提出PriSTI面向spatiotemporal imputation的conditional diffusion framework引入spatiotemporal global correlations与geographic relationships。(2) 提出Conditional Feature Extraction在插值后的条件信息上提取H_pri作为global context prior降低在高噪声下学习ST依赖的难度。(3) 提出Noise Estimation Module的“prior-guided attention”用H_pri产生注意力权重Q,K来自H_pri来引导对含噪输入的去噪从而缓解“观测+噪声拼接”带来的不一致与干扰。(4) 提出virtual nodes的空间注意力降复杂度方案O(N^2 d)→O(N k d)。
主要创新之处:把“先验条件特征(由插值与时空/地理编码提取)”显式用于扩散去噪阶段的注意力权重计算,实现更强的时空依赖建模与更鲁棒的扩散式插补。

View File

@@ -0,0 +1,9 @@
@misc{song2021scorebasedgenerativemodelingstochastic,
title={Score-Based Generative Modeling through Stochastic Differential Equations},
author={Yang Song and Jascha Sohl-Dickstein and Diederik P. Kingma and Abhishek Kumar and Stefano Ermon and Ben Poole},
year={2021},
eprint={2011.13456},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2011.13456},
}

View File

@@ -0,0 +1,45 @@
# Score-Based Generative Modeling through Stochastic Differ
**第一个问题**请对论文的内容进行摘要总结包含研究背景与问题、研究目的、方法、主要结果和结论字数要求在150-300字之间使用论文中的术语和概念。
论文研究score-based generative models中“由噪声生成数据”的统一连续时间视角将数据分布通过forward SDE逐步加噪映射到已知prior并由只依赖于time-dependent score ∇x log p_t(x) 的reverse-time SDE反向去噪生成样本。研究目的在于用SDE框架统一SMLD与DDPM视为VE/VP SDE的离散化并获得更灵活的采样、似然计算与可控生成能力。方法上训练time-dependent score network sθ(x,t) 以连续版denoising score matching估计score采样用通用SDE solver并提出Predictor-Corrector(PC)将数值SDE预测与Langevin/HMC校正结合进一步推导probability flow ODEneural ODE可在相同边缘分布下进行确定性采样并用instantaneous change of variables实现exact likelihood。实验在CIFAR-10上达成IS=9.89、FID=2.20与2.99 bits/dim并首次展示1024×1024 CelebA-HQ高保真生成同时在class-conditional、inpainting与colorization等逆问题上验证可控生成。结论是SDE统一框架带来新采样器、精确似然与更强条件生成能力。
**第二个问题**请提取论文的摘要原文摘要一般在Abstract之后Introduction之前。
Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the reverse-time SDE depends only on the time-dependent gradient field (a.k.a., score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, we can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. We show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities. In particular, we introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE. We also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, we provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, we achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of1024ˆ1024images for the first time from a score-based generative model.
**第三个问题**:请列出论文的全部作者,按照此格式:`作者1, 作者2, 作者3`
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole
**第四个问题**:请直接告诉我这篇论文发表在哪个会议或期刊,请不要推理或提供额外信息。
ICLR 2021
**第五个问题**:请详细描述这篇论文主要解决的核心问题,并用简洁的语言概述。
论文要解决的核心问题是如何把“逐步加噪—逐步去噪”的score-based/diffusion生成模型提升为一个统一、连续时间、可分析且可扩展的生成框架使得(1)不同方法SMLD、DDPM在同一理论下刻画(2)采样不再局限于特定离散更新规则而能用通用数值求解器并获得更好效率/质量,(3)在同一模型下实现exact likelihood计算与latent可逆映射(4)在无需重新训练条件模型的前提下处理class-conditional、inpainting、colorization等inverse problems。其关键技术瓶颈在于reverse dynamics只需要score ∇x log p_t(x)但该score必须对连续t准确估计并且离散化/数值误差会累积影响采样质量。简洁概述用SDE把score/diffusion统一起来并解决“怎么更好采样、怎么算精确似然、怎么做可控/逆问题生成”。
**第六个问题**:请告诉我这篇论文提出了哪些方法,请用最简洁的方式概括每个方法的核心思路。
(1) SDE统一框架用forward SDE dx=f(x,t)dt+g(t)dw把数据分布扩散到prior再用reverse-time SDE dx=[fg²∇x log p_t(x)]dt+g dŵ从prior生成数据。(2) 连续时间score学习训练time-dependent score network sθ(x,t) 通过连续版denoising score matching目标(式(7))逼近∇x log p_t(x)。(3) VE/VP/sub-VP SDE把SMLD对应为Variance Exploding SDE把DDPM对应为Variance Preserving SDE并提出sub-VP SDE方差被VP上界约束以提升likelihood表现。(4) Reverse diffusion sampler按“与forward同型”的离散化直接构造reverse-time SDE的数值采样器避免为新SDE推导复杂ancestral规则。(5) Predictor-Corrector(PC) samplerpredictor用数值SDE solver推进一步corrector用score-based MCMC如Langevin/HMC在每个时间点校正边缘分布降低离散误差。(6) Probability flow ODE推导与SDE共享同一组边缘分布p_t的确定性ODE允许黑盒ODE solver自适应步长快速采样并支持latent可逆映射。(7) Exact likelihood对probability flow ODE使用instantaneous change of variables计算log p0(x)并用Skilling-Hutchinson trace estimator高效估计散度。(8) Controllable generation / inverse problems通过conditional reverse-time SDE在unconditional score基础上加入∇x log p_t(y|x)实现类别条件、inpainting、colorization等。
**第七个问题**:请告诉我这篇论文所使用的数据集,包括数据集的名称和来源。
(1) CIFAR-10Krizhevsky et al., 2009用于无条件生成、FID/IS、bits/dim与采样器对比。(2) LSUNbedroom、church outdoorYu et al., 2015用于PC采样对比、inpainting与colorization示例文中展示256×256。(3) CelebA64×64Liu et al., 2015用于架构探索中的VE设置对比。(4) CelebA-HQ1024×1024Karras et al., 2018用于首次展示score-based模型的1024×1024高分辨率生成
**第八个问题**:请列举这篇论文评估方法的所有指标,并简要说明这些指标的作用。
(1) FID衡量生成分布与真实分布在特征空间的距离综合质量与多样性越低越好用于CIFAR-10/LSUN等样本质量比较。(2) Inception Score (IS)衡量样本可辨识度与多样性越高越好用于CIFAR-10无条件生成。(3) NLL / bits/dim负对数似然的bit-per-dimension度量越低越好论文通过probability flow ODE给出“exact likelihood”用于CIFAR-10密度评估。(4) 采样计算量指标score function evaluations / NFE函数评估次数与solver步数如P1000/P2000/PC1000用于衡量采样效率与质量权衡。(5)任务性展示class-conditional/inpainting/colorization主要以可视化结果展示效果图4等不使用单一数值指标汇总。
**第九个问题**:请总结这篇论文实验的表现,包含具体的数值表现和实验结论。
无条件CIFAR-10样本质量最佳NCSN++ cont. (deep, VE)达到IS=9.89、FID=2.20摘要与表3“Sample quality”。采样器对比表1在CIFAR-10上PC采样通常优于仅predictor或仅corrector例如在VP SDE(DDPM)侧reverse diffusion的P1000 FID≈3.21±0.02加入corrector的PC1000可到≈3.18±0.01在VE侧reverse diffusion P1000≈4.79±0.07PC1000≈3.21±0.02显示PC显著改善离散误差带来的质量损失。似然表2通过probability flow ODE实现exact likelihoodDDPM++ cont. (deep, sub-VP)达到2.99 bits/dim同时FID≈2.92并且sub-VP在同架构下通常比VP获得更好的bits/dim如DDPM cont.: VP 3.21 vs sub-VP 3.05。高分辨率生成在CelebA-HQ上首次展示1024×1024高保真样本图12与文中描述。结论SDE框架在采样PC/ODE、似然exact bits/dim与能力高分辨率/逆问题上同时带来提升并统一解释SMLD与DDPM。
**第十个问题**:请清晰地描述论文所作的工作,分别列举出动机和贡献点以及主要创新之处。
动机现有SMLD与DDPM都依赖“多噪声尺度的逐步扰动/去扰动”,但彼此形式分裂、采样规则受限、离散误差影响大,且难以在同一框架下获得精确似然、灵活采样与统一的条件/逆问题生成。
贡献点与创新:(1) 提出score-based生成的SDE统一框架forward SDE定义连续噪声扩散reverse-time SDE仅依赖score实现生成并把SMLD/DDPM解释为VE/VP SDE离散化。(2) 提出PC采样框架把数值SDE求解predictor与score-based MCMC校正corrector组合系统性提升采样质量。(3) 推导probability flow ODE与SDE共享边缘分布的确定性过程支持黑盒ODE自适应采样、latent可逆操控与“exact likelihood computation”。(4) 提出sub-VP SDE并在likelihood上取得2.99 bits/dim记录均匀dequantized CIFAR-10。(5) 提供无需重训的controllable generation/inverse problems方案类条件、inpainting、colorization并结合架构改进实现CIFAR-10记录级FID/IS与首次1024×1024 CelebA-HQ生成。

View File

@@ -0,0 +1,9 @@
@misc{li2022diffusionlmimprovescontrollabletext,
title={Diffusion-LM Improves Controllable Text Generation},
author={Xiang Lisa Li and John Thickstun and Ishaan Gulrajani and Percy Liang and Tatsunori B. Hashimoto},
year={2022},
eprint={2205.14217},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={httpsarxiv.orgabs2205.14217},
}

View File

@@ -0,0 +1,63 @@
# **Diffusion-LM Improves Controllable Text Generation**
**第一个问题**请对论文的内容进行摘要总结包含研究背景与问题、研究目的、方法、主要结果和结论字数要求在150-300字之间使用论文中的术语和概念。
论文关注“在不重新训练语言模型的前提下实现可控文本生成”这一开放问题现有plug-and-play方法基于autoregressive LM较难实现复杂、细粒度控制如syntax tree/ spans且难以组合多个控制。研究目的在于构建一种更易被梯度引导的生成模型以支持结构化与全局约束控制。方法上提出Diffusion-LM基于continuous diffusion的非自回归LM从高斯噪声序列迭代去噪到word vectors产生层级连续潜变量x_T…x_0为离散文本引入embedding与rounding并提出端到端训练目标、x_0-parameterization与clamping trick以减少rounding error控制时在每个扩散步对连续潜变量做多步梯度更新优化λ·log p(x_{t-1}|x_t)+log p(c|x_{t-1})以平衡fluency与control。实验在E2E与ROCStories上覆盖6类控制任务Diffusion-LM在多项细粒度控制上显著优于PPLM/FUDGE并在syntax tree与spans控制上超过fine-tuning oracle在infilling上优于COLD/DELOREAN且接近专训AR模型。结论是连续扩散的层级潜变量为复杂可控生成提供了更强的可操控性但代价是解码更慢、困惑度更高。
**第二个问题**请提取论文的摘要原文摘要一般在Abstract之后Introduction之前。
Controlling the behavior of language models (LMs) without re-training is a major open problem in natural language generation. While recent works have demon-strated successes on controlling simple sentence attributes (e.g., sentiment), there has been little progress on complex, fine-grained controls (e.g., syntactic structure). To address this challenge, we develop a new non-autoregressive language model based on continuous diffusions that we call Diffusion-LM. Building upon the recent successes of diffusion models in continuous domains, Diffusion-LM iteratively denoises a sequence of Gaussian vectors into word vectors, yielding a sequence of intermediate latent variables. The continuous, hierarchical nature of these inter-mediate variables enables a simple gradient-based algorithm to perform complex, controllable generation tasks. We demonstrate successful control of Diffusion-LM for six challenging fine-grained control tasks, significantly outperforming prior work.
**第三个问题**:请列出论文的全部作者,按照此格式:`作者1, 作者2, 作者3`
Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, Tatsunori B. Hashimoto
**第四个问题**:请直接告诉我这篇论文发表在哪个会议或期刊,请不要推理或提供额外信息。
Preprint. Under review.
**第五个问题**:请详细描述这篇论文主要解决的核心问题,并用简洁的语言概述。
论文核心要解决的是如何在“冻结语言模型参数without re-training”的plug-and-play设定下实现对生成文本的复杂、细粒度、全局结构化控制并且能够组合多个控制信号。作者指出基于autoregressive LM的控制如PPLM、FUDGE、GeDi/DExperts类思想存在结构性限制自回归的左到右生成一旦早期出错难以修复且很多控制parse tree、span类别、长度规划、infilling需要对全句全局属性或右侧上下文进行前瞻规划前缀判别器/逐token引导在结构约束下会出现“前缀一错、后续全盘崩”的问题。Diffusion-LM通过连续扩散产生一系列从粗到细的潜变量x_T…x_0使控制信号可以在不同噪声层级上对“整句”施加可微约束并允许在每个扩散步对潜变量做梯度更新从而更容易满足复杂控制同时保持流畅性。简洁概述把可控生成从“逐token操控离散输出”转为“在连续扩散潜空间上做全局梯度引导的去噪采样”以支持细粒度句法/语义/跨度/长度/填空等控制与组合。
**第六个问题**:请告诉我这篇论文提出了哪些方法,请用最简洁的方式概括每个方法的核心思路。
1) Diffusion-LMcontinuous diffusion language model将文本序列表示为连续word vectors序列从高斯噪声x_T迭代去噪生成x_0再rounding到离散词序列w实现非自回归生成
2) End-to-end embedding learning§4.1在前向过程中加入q_φ(x_0|w)=N(EMB(w),σ_0I)并在反向加入p_θ(w|x_0)的rounding项联合学习扩散模型与词向量嵌入L_e2e_simple / L_e2e_vlb
3) x_0-parameterization objective§4.2改用网络直接预测x_0f_θ(x_t,t)≈x_0让每个扩散步都显式约束x_0提高“落到词嵌入点上”的能力、减少rounding error。
4) Clamping trick§4.2在采样时把预测的x_0或中间估计Clamp到最近词嵌入再继续去噪强制中间步骤更“词化”进一步降低rounding误差。
5) Gradient-based classifier-guided control on latents§5.1在每个扩散步对x_{t-1}做多步梯度更新优化λ·log p(x_{t-1}|x_t)+log p(c|x_{t-1})用可微classifier提供控制梯度、扩散模型提供流畅性正则。
6) Multiple controls composition§7.2对多个控制项直接求和其log-prob梯度在同一潜空间联合满足多个约束。
7) Minimum Bayes Risk (MBR) decoding§5.2):对条件生成/填空等场景从多次采样集合S中选取最小期望风险的输出如基于负BLEU的风险提升单样本质量。
8) sqrt noise scheduleAppendix A提出更适合文本的噪声日程使低噪声阶段不至于过“容易”提升稳健性。
**第七个问题**:请告诉我这篇论文所使用的数据集,包括数据集的名称和来源。
训练数据集E2EE2E NLG datasetNovikova et al. 2017[28]ROCStoriesMostafazadeh et al. 2016[26])。控制/评测相关数据Infilling任务的左右上下文来自aNLGAbductive Natural Language GenerationBhagavatula et al. 2020[2]文中写“from the aNLG dataset”。文中还提到用于数据增强的ROCStories (+GPT-J)为“微调GPT-J后生成的合成ROCStories样本”用于额外对比似然/训练规模实验)。
**第八个问题**:请列举这篇论文评估方法的所有指标,并简要说明这些指标的作用。
1) ctrlcontrol success成功率/得分越高越好不同控制任务有不同定义——语义内容用value exact match成功率POS用词级别的POS序列exact matchSyntax Tree用外部parser解析后与目标树的F1Syntax Spans用目标span类别匹配比例Length用生成长度在目标±2内的比例
2) lm-score记作lm越低越好把生成文本喂给teacher LM文中为fine-tuned GPT-2计算perplexity用于衡量流畅性/样本质量。
3) NLL / log-likelihood boundnats per token越低越好报告Diffusion-LM的变分界L_e2e_vlb对应的NLL上界/下界比较用于衡量语言建模拟合度文中指出Diffusion-LM似然弱于同规模AR Transformer
4) Infilling自动指标BLEU-4、ROUGE-L、CIDEr、BERTScore越高越好衡量生成句与参考的n-gram重叠、序列相似与语义相似。
5) Infilling人工评估Genie leaderboard的人类评分表中以均值及置信区间/误差形式呈现),衡量整体可读性与合理连接左右上下文的质量。
**第九个问题**:请总结这篇论文实验的表现,包含具体的数值表现和实验结论。
五个classifier-guided控制任务Table 2ctrl↑/lm↓
(1) Semantic ContentDiffusion-LM 81.2 / 2.55优于PPLM 9.9 / 5.32与FUDGE 69.9 / 2.83低于FT-search 89.9 / 1.78但接近或高于FT-sample 72.5 / 2.87
(2) Parts-of-speechDiffusion-LM 90.0 / 5.16优于FUDGE 27.0 / 7.96略低于FT-search 93.0 / 3.31、接近FT-sample 89.5 / 4.72
(3) Syntax TreeDiffusion-LM 86.0 / 3.71显著优于FUDGE 17.9 / 3.39并超过FT-search 76.4 / 3.24与FT-sample 64.8 / 5.72
(4) Syntax SpansDiffusion-LM 93.8 / 2.53优于FUDGE 54.2 / 4.03并超过FT-search 54.4 / 2.19与FT-sample 26.3 / 2.88
(5) LengthDiffusion-LM 99.9 / 2.16优于FUDGE 46.9 / 3.11接近FT-search 100.0 / 1.83且优于FT-sample 98.1 / 3.84。
组合控制Table 4Semantic+Syntax Tree时Diffusion-LM语义/句法成功率69.8/74.8lm 5.92显著高于FUDGE 61.7/15.4Semantic+POS时Diffusion-LM语义/ POS成功率63.7/69.1lm 3.46优于FUDGE 64.5/24.1。InfillingTable 5Diffusion-LM BLEU-4 7.1、ROUGE-L 28.3、CIDEr 30.7、BERTScore 89.0、人评0.37+0.03/-0.02显著高于DELOREAN与COLDBLEU-4约1.61.8并与专训AR-infilling接近AR BLEU-4 6.7、ROUGE-L 27.0、CIDEr 26.9、BERTScore 89.0、人评0.39。同时语言建模似然上Diffusion-LM弱于AR TransformerE2E NLL 2.28 vs 1.77ROCStories 3.88 vs 3.05§7表明“更差似然但更强可控性”。
**第十个问题**:请清晰地描述论文所作的工作,分别列举出动机和贡献点以及主要创新之处。
动机1) 真实应用需要可控生成但为每个控制任务fine-tune成本高且难组合多个控制2) 现有plug-and-play控制主要基于autoregressive LM易累积错误、难做全局结构控制与右侧约束如parse tree、span、长度规划、infilling3) diffusion在连续域成功但离散文本缺少有效的continuous diffusion建模与可控机制。
贡献点1) 提出Diffusion-LM首批系统探索“continuous diffusion用于文本”的非自回归语言模型通过逐步去噪生成word vectors并得到层级连续潜变量。2) 为离散文本适配扩散提出端到端学习embedding与rounding的训练目标L_e2e_*并用x_0-parameterization与clamping trick显著减少rounding errors、提升样本质量。3) 提出面向控制的潜空间梯度引导算法在每个扩散步对连续潜变量做多步梯度更新并加入fluency regularization λ·log p(x_{t-1}|x_t)实现复杂结构控制与控制组合。4) 系统实验验证在6类细粒度控制任务上显著超越PPLM/FUDGE并在句法树/跨度控制上超过fine-tuning oracle在infilling上优于COLD/DELOREAN且接近专训AR模型。
主要创新之处把“可控生成的优化变量”从离散token/AR隐藏状态迁移到“扩散产生的层级连续潜变量x_0:T”使控制信号可以在粗到细的全句表示上施加可微约束从而天然支持全局结构控制与组合同时通过x_0参数化与clamping把连续扩散与离散词空间可靠对齐。

View File

@@ -0,0 +1,17 @@
离散/混合变量扩散(解决 Modbus 的功能码/地址等离散字段)
你是“语义级生成”:至少包含
离散Function Code、可选异常码、寄存器地址/地址簇、读写长度
连续/整数寄存器值、时间间隔inter-arrival
这类“混合类型生成”往往要引用离散扩散或 embedding trick。
Austin et al. Structured Denoising Diffusion Models in Discrete State-Spaces (D3PM). NeurIPS 2021.
用途:离散状态空间扩散;功能码/异常码/地址簇token可以用 D3PM 直接扩散生成。
Li et al. Diffusion-LM Improves Controllable Text Generation. NeurIPS 2022.
用途文本是离散序列生成你可将“PDU 字段序列/行为序列”类比为句子,并用其“可控生成”的讨论支撑“协议约束/场景约束”的必要性。
工程落地常见做法:
离散字段D3PM/Multinomial diffusion 或先 embedding 到连续空间再做高斯扩散;
连续字段:标准 DDPM
最后用一个 deterministic assemblerpymodbus/scapy组装 MBAP/PDU确保协议有效率接近 100%。

View File

@@ -0,0 +1,9 @@
@misc{austin2023structureddenoisingdiffusionmodels,
title={Structured Denoising Diffusion Models in Discrete State-Spaces},
author={Jacob Austin and Daniel D. Johnson and Jonathan Ho and Daniel Tarlow and Rianne van den Berg},
year={2023},
eprint={2107.03006},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2107.03006},
}

View File

@@ -0,0 +1,61 @@
# Structured Denoising Diffusion Models in Discrete State-Spaces
**第一个问题**请对论文的内容进行摘要总结包含研究背景与问题、研究目的、方法、主要结果和结论字数要求在150-300字之间使用论文中的术语和概念。
Denoising diffusion probabilistic modelsDDPMs在连续空间表现突出但离散数据上的扩散模型仍难以在大规模文本与图像生成上达到竞争力。论文提出Discrete Denoising Diffusion Probabilistic ModelsD3PMs目标是通过“结构化的离散前向腐蚀过程”提升离散扩散的建模效果并分析transition matrix与noise schedule对性能的关键影响。方法上作者将离散扩散统一为基于Markov transition matrices Q_t 的类别扩散框架提出多种结构化Q_tdiscretized Gaussian模拟连续高斯核、偏向相似状态、embedding距离/nearest-neighbor在嵌入空间定义局部性、以及absorbing state[MASK]吸收态并连接BERT/AR/MLM。同时提出x0-parameterization的反向过程与新损失L_λ = L_vb + λ·CE(x0|x_t)并给出基于mutual information的噪声日程。实验显示文本上D3PM absorbing在text8取得≤1.45 bits/char1000步且在LM1B大词表上困惑度76.91000步图像上CIFAR-10的D3PM Gauss+logistic达到IS 8.56、FID 7.34、NLL≤3.435 bits/dim接近/超过连续DDPM的对数似然并改善样本质量。结论是离散扩散中Q_t设计、辅助损失与MI日程显著提升性能并揭示其与BERT/自回归/MLM的统一视角。
**第二个问题**请提取论文的摘要原文摘要一般在Abstract之后Introduction之前。
Denoising diffusion probabilistic models (DDPMs) [19] have shown impressive results on image and waveform generation in continuous state spaces. Here, we introduce Discrete Denoising Diffusion Probabilistic Models (D3PMs), diffusion-like generative models for discrete data that generalize the multinomial diffusion model of Hoogeboom et al.[20], by going beyond corruption processes with uni-form transition probabilities. This includes corruption with transition matrices that mimic Gaussian kernels in continuous space, matrices based on nearest neighbors in embedding space, and matrices that introduce absorbing states. The third al-lows us to draw a connection between diffusion models and autoregressive and mask-based generative models. We show that the choice of transition matrix is an important design decision that leads to improved results in image and text domains. We also introduce a new loss function that combines the variational lower bound with an auxiliary cross entropy loss. For text, this model class achieves strong results on character-level text generation while scaling to large vocabularies on LM1B. On the image dataset CIFAR-10, our models approach the sample quality and exceed the log-likelihood of the continuous-space DDPM model.
**第三个问题**:请列出论文的全部作者,按照此格式:`作者1, 作者2, 作者3`
Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, Rianne van den Berg
**第四个问题**:请直接告诉我这篇论文发表在哪个会议或期刊,请不要推理或提供额外信息。
35th Conference on Neural Information Processing Systems (NeurIPS 2021).
**第五个问题**:请详细描述这篇论文主要解决的核心问题,并用简洁的语言概述。
论文核心问题是如何把扩散模型从连续高斯噪声的DDPM有效推广到“离散状态空间”的文本与量化图像并让离散扩散在样本质量、对数似然、训练稳定性与可扩展性大词表、大序列上达到可竞争水平。此前离散扩散如multinomial diffusion多使用“uniform transition probabilities”的腐蚀过程导致前向噪声缺少结构对图像这种有序/局部性的离散值量化像素无法“优先扩散到相近值”对文本这种离散符号也难注入相似性或mask结构从而使反向去噪学习更难、效果受限。作者提出应把“前向Markov转移矩阵Q_t”视为离散扩散的关键设计自由度通过结构化Q_t如离散高斯核、嵌入邻近、吸收态mask把领域先验嵌入腐蚀过程进而塑造更可学习的反向去噪同时提出辅助交叉熵损失与互信息噪声日程缓解训练不稳定与日程选择困难。简洁概述用结构化的离散腐蚀转移矩阵+更稳的训练目标/噪声日程使离散扩散在文本与图像上显著变强并统一连接到BERT/MLM与自回归模型。
**第六个问题**:请告诉我这篇论文提出了哪些方法,请用最简洁的方式概括每个方法的核心思路。
1) D3PMDiscrete Denoising Diffusion Probabilistic Models通用框架用类别变量的Markov前向过程q(x_t|x_{t-1})=Cat(x_{t-1}Q_t)与学习的反向过程p_θ(x_{t-1}|x_t)做离散扩散生成
2) Structured transition matrices Q_t将“腐蚀结构”编码进Q_t而非仅uniform。
3) D3PM-uniformQ_t=(1β_t)I+β_t/K·11^T向任意类别均匀扩散Hoogeboom等的multinomial diffusion推广/实现)。
4) D3PM-absorbingmask diffusion引入吸收态[MASK]每步以β_t把token/像素变为[MASK]/灰像素建立与BERT/MLM/AR的联系。
5) Discretized Gaussian Q_tD3PM-Gauss对有序离散值量化像素用离散截断高斯核定义转移使更相近的值更易互转模仿连续高斯扩散的“局部性”。
6) Token embedding distance / Nearest-neighbor diffusionD3PM-NN在预训练embedding空间构kNN图构造对称rate matrix R并令Q_t=exp(α_t R),使腐蚀沿语义邻近扩散。
7) Mutual-information-based noise schedule选择α_t/β_t使I(x_t;x_0)按t/T线性衰减到0作为“结构化Q_t下”的统一噪声标尺在absorbing情形退化为β_t=(Tt+1)^{-1}。
8) x0-parameterization反向模型网络预测~p_θ(~x_0|x_t)再与q(x_{t-1},x_t|~x_0)组合得到p_θ(x_{t-1}|x_t)自动继承Q_t的稀疏模式并支持跳步推断。
9) Truncated discretized logistic parameterization图像对有序离散像素用截断离散logistic分布参数化~p_θ(~x_0|x_t)以增强ordinal inductive bias。
10) 新训练损失L_λ在ELBO的L_vb上叠加辅助交叉熵E[-log ~p_θ(x_0|x_t)],提升训练稳定性与样本质量(尤其图像)。
**第七个问题**:请告诉我这篇论文所使用的数据集,包括数据集的名称和来源。
文本text8Matt Mahoney text8 dataset[28]字符级27 token长度256切块评测LM1B / One Billion Word BenchmarkChelba et al. 2013[6]使用sentencepiece词表大小8192、序列长度128打包训练。图像CIFAR-10Krizhevsky et al. 2009[27])。
**第八个问题**:请列举这篇论文评估方法的所有指标,并简要说明这些指标的作用。
1) NLLnegative log-likelihood文本用bits/chartext8或由困惑度等价衡量图像用bits per dimensionbpd。作用衡量生成模型对数据分布的拟合/压缩能力,越低越好。
2) PerplexityLM1B语言建模常用指标等价于指数化的平均负对数似然越低越好。
3) ISInception ScoreCIFAR-10衡量样本“可辨识度+多样性”的启发式指标,越高越好。
4) FIDFréchet Inception DistanceCIFAR-10比较生成样本与真实样本在Inception特征空间的分布距离越低越好更常用作样本质量指标。
5) Sample time采样时间衡量生成速度与inference steps相关越低越好。
6) Model steps / inference steps推断步数不是指标本身但作为效率维度报告用于展示质量-速度折中如1000/256/20步
**第九个问题**:请总结这篇论文实验的表现,包含具体的数值表现和实验结论。
text8Table 1bits/char↓采样时间↓1000步下D3PM absorbingmaskL_{λ=0.01}达到≤1.45±0.023.4±0.3s优于D3PM uniform≤1.61±0.02与D3PM NN≤1.59±0.03在256步下D3PM absorbing仍为≤1.47±0.030.598±0.002s在20步下D3PM absorbing为≤1.56±0.040.0785±0.0003s)。
总体结论mask/absorbing扩散在字符级最强并能在少步推断下保持较好NLL但仍弱于强自回归Transformer-XL1.08等。LM1BTable 2Perplexity↓D3PM absorbing在1000/128/64步分别为76.9±2.3 / 80.1±1.2 / 83.6±6.1显著优于uniform的137.9±2.1 / 139.2±1.2 / 145.0±1.2NN更差149.5±1.3等。说明离散扩散可扩展到8192大词表且absorbing最有效embedding邻近并不一定带来收益。CIFAR-10Table 3最佳D3PM Gauss+logisticL_{λ=0.001}达到IS 8.56±0.10、FID 7.34±0.19、NLL≤3.435±0.007相比连续DDPMHo et al.的DDPM L_simpleIS 9.46±0.11、FID 3.17、NLL≤3.75D3PM在NLL上更优、样本质量接近但FID仍落后顶级连续扩散且引入L_λ能显著提升离散模型如D3PM Gauss从FID 15.30→8.34IS 7.75→8.54)。
综合结论transition matrix选择是关键设计点absorbing对文本最有效、discretized Gaussian+ordinal logistic对图像最有效辅助CE损失与合适噪声日程能明显改善训练与样本质量。
**第十个问题**:请清晰地描述论文所作的工作,分别列举出动机和贡献点以及主要创新之处。
动机1) 连续DDPM在图像/音频成功,但离散数据(文本、量化图像)上缺少同等级别的扩散建模框架与经验,尤其在大词表/长序列与高质量图像上2) 既有离散扩散多采用uniform腐蚀无法利用离散数据的结构图像的有序/局部性、文本的mask与潜在语义相似性3) 需要更稳定的训练目标与更可解释的噪声日程来驱动离散扩散。
贡献点1) 提出并系统化D3PM框架用一般Q_t矩阵描述离散扩散推广multinomial diffusion允许注入结构化腐蚀。2) 设计多类structured transition matricesdiscretized Gaussian、embedding kNN/矩阵指数、absorbing state等并实证“Q_t选择显著影响文本与图像效果”。3) 建立概念连接证明BERT可视为“一步扩散”构造视角说明自回归模型与生成式MLM可被表述为特定离散扩散过程。4) 提出训练改进辅助交叉熵的混合损失L_λ提升训练与样本质量提出基于互信息I(x_t;x_0)的噪声日程适配一般结构化Q_t。5) 实证结果文本上在text8强于多数非自回归基线并在LM1B证明可扩展图像上CIFAR-10达到接近连续DDPM的样本质量并在对数似然上超过连续DDPM基线。
主要创新之处把离散扩散的关键自由度从“固定uniform腐蚀”提升为“可设计的结构化转移矩阵Q_t”并配套可扩展计算低秩/矩阵指数、互信息日程与L_λ训练使离散扩散在文本与量化图像上同时获得理论统一连接BERT/AR/MLM与显著性能提升。

View File

@@ -0,0 +1,9 @@
@misc{veličković2018graphattentionnetworks,
title={Graph Attention Networks},
author={Petar Veličković and Guillem Cucurull and Arantxa Casanova and Adriana Romero and Pietro Liò and Yoshua Bengio},
year={2018},
eprint={1710.10903},
archivePrefix={arXiv},
primaryClass={stat.ML},
url={https://arxiv.org/abs/1710.10903},
}

View File

@@ -0,0 +1,62 @@
# Graph Attention Networks
**第一个问题**请对论文的内容进行摘要总结包含研究背景与问题、研究目的、方法、主要结果和结论字数要求在150-300字之间使用论文中的术语和概念。
本文提出Graph Attention NetworksGATs针对谱域图卷积方法的计算昂贵、依赖拉普拉斯特征基、难以泛化到不同图结构等问题以及空间方法在可变邻域与权重共享上的挑战。研究目的在于通过masked self-attentional层使节点对其邻域特征进行自注意从而隐式分配不同邻居的重要性且无需昂贵矩阵运算或预先知道全局图结构实现对transductive与inductive任务的统一处理。方法包括多头注意力、邻域softmax归一化系数、共享线性变换与注意力机制支持并行化复杂度与GCN相当。主要结果在Cora、Citeseer、Pubmed三大引文网络transductive与PPI蛋白互作数据inductiveGAT达到或匹配state-of-the-art在PPI上显著优于GraphSAGE并优于同架构的常数注意版本。结论GAT高效、可解释、可扩展到未见图解决谱方法局限展示了注意力在图结构数据上的潜力。
**第二个问题**请提取论文的摘要原文摘要一般在Abstract之后Introduction之前。
We present graph attention networks (GATs), novel neural network architectures that operate on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. By stacking layers in which nodes are able to attend over their neighborhoods features, we enable (implicitly) specifying different weights to different nodes in a neighborhood, without requiring any kind of costly matrix operation (such as inversion) or depending on knowing the graph structure upfront. In this way, we address several key challenges of spectral-based graph neural networks simultaneously, and make our model readily applicable to inductive as well as transductive problems. Our GAT models have achieved or matched state-of-the-art results across four established transductive and inductive graph benchmarks: the Cora, Citeseer and Pubmed citation network datasets, as well as a protein-protein interaction dataset (wherein test graphs remain unseen during training).
**第三个问题**:请列出论文的全部作者,按照此格式:`作者1, 作者2, 作者3`
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, Yoshua Bengio
**第四个问题**:请直接告诉我这篇论文发表在哪个会议或期刊,请不要推理或提供额外信息。
ICLR 2018
**第五个问题**:请详细描述这篇论文主要解决的核心问题,并用简洁的语言概述。
核心问题如何在图结构数据上高效、可泛化地进行节点表示学习与分类同时克服谱方法对拉普拉斯特征基与固定图结构的依赖、昂贵的矩阵运算以及空间方法在处理可变大小邻域与统一权重共享的困难。简述GAT通过masked self-attention让每个节点对其邻域特征分配不同权重避免昂贵谱分解支持并行化与inductive设置从而提升性能与可扩展性。
**第六个问题**:请告诉我这篇论文提出了哪些方法,请用最简洁的方式概括每个方法的核心思路。
1) 图注意力层GAT layer共享线性变换W后对邻域执行自注意力a(Wh_i, Wh_j)用softmax归一化得到α_ij加权聚合邻居特征并非线性输出。
2) 多头注意力并行K个独立注意头隐藏层拼接以增强稳定性与表达力输出层平均以做分类。
3) 掩蔽注意masked attention仅在一阶邻域内计算注意系数注入图结构同时保持操作可并行。
4) 常数注意对照Const-GAT使用a(x,y)=1的恒定权重以对比注意机制带来的增益。
5) 稀疏实现与并行化策略:采用稀疏矩阵操作降低存储与时间成本(实现层面说明)。
**第七个问题**:请告诉我这篇论文所使用的数据集,包括数据集的名称和来源。
- Coracitation network节点为文档边为引用来源Sen et al., 2008并按Yang et al., 2016的transductive设置
- Citeseercitation network来源Sen et al., 2008设置同上
- Pubmedcitation network来源Sen et al., 2008设置同上
- PPIProtein-Protein Interaction多个组织的图来源Zitnik & Leskovec, 2017使用Hamilton et al., 2017提供的预处理数据
**第八个问题**:请列举这篇论文评估方法的所有指标,并简要说明这些指标的作用。
- 分类准确率accuracy用于Cora、Citeseer、Pubmed的节点分类性能度量反映预测正确的比例。
- 微平均F1分数micro-averaged F1用于PPI多标签节点分类综合精确率与召回率并在样本层面微平均衡量整体多标签预测质量。
- 额外报告标准差:展示多次运行的稳定性与方差。
**第九个问题**:请总结这篇论文实验的表现,包含具体的数值表现和实验结论。
- Transductive100次运行均值±标准差CoraGAT 83.0±0.7%优于GCN 81.5%与MoNet 81.7%CiteseerGAT 72.5±0.7%优于GCN 70.3%PubmedGAT 79.0±0.3%匹配GCN 79.0%与优于多数基线。
- Inductive10次运行PPIGAT 0.973±0.002 micro-F1显著优于GraphSAGE最优0.768与Const-GAT 0.934±0.006。 结论GAT在三个引文网络上达到或超越SOTA在PPI上大幅领先证明了对整个邻域进行注意加权以及自注意机制带来的显著增益与泛化能力。
**第十个问题**:请清晰地描述论文所作的工作,分别列举出动机和贡献点以及主要创新之处。
- 动机解决谱方法对图拉普拉斯特征基的依赖与计算代价空间方法在可变邻域与权重共享的局限构建能在未见图上进行inductive推理的高效模型。
- 贡献点:
1. 提出图注意力层GAT在邻域内进行masked self-attention隐式分配不同邻居权重
2. 设计多头注意力用于稳定训练与提升表达力,输出层平均以适配分类;
3. 提供与GCN同量级的时间复杂度与并行化实现适用于transductive与inductive任务
4. 在Cora、Citeseer、Pubmed与PPI上达到或刷新SOTA显著优于GraphSAGE与常数注意对照。
- 主要创新:
- 将自注意力机制引入图邻域聚合使用节点特征计算相似度并softmax归一化的掩蔽注意
- 多头图注意结构的层级堆叠与输出层平均策略;
- 不依赖全局图结构即可进行学习与推理支持完全未见测试图的inductive设置。

View File

@@ -0,0 +1,9 @@
@misc{hou2022graphmaeselfsupervisedmaskedgraph,
title={GraphMAE: Self-Supervised Masked Graph Autoencoders},
author={Zhenyu Hou and Xiao Liu and Yukuo Cen and Yuxiao Dong and Hongxia Yang and Chunjie Wang and Jie Tang},
year={2022},
eprint={2205.10803},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2205.10803},
}

View File

@@ -0,0 +1,58 @@
# **GraphMAE: Self-Supervised Masked Graph Autoencoders**
**第一个问题**请对论文的内容进行摘要总结包含研究背景与问题、研究目的、方法、主要结果和结论字数要求在150-300字之间使用论文中的术语和概念。
论文关注图自监督学习中“contrastive learning主导、generative SSL/graph autoencoders(GAEs)未发挥潜力”的现状指出GAEs受四类问题制约过度强调structure reconstruction、无corruption的feature reconstruction易学到trivial solution、MSE对特征范数/维度敏感导致训练不稳、MLP decoder表达力不足。研究目的在于系统缓解这些缺陷使GAE在node/graph classification与transfer learning上可与甚至超越SOTA对比学习方法。方法上提出GraphMAE采用masked feature reconstruction作为目标而非重建结构用[MASK]做输入特征遮蔽并在解码前re-mask[DMASK]引入更具表达力的GNN decoder并以scaled cosine error(SCE)替代MSE以提升稳定性与对hard samples的选择性。实验覆盖21个公开数据集、三类任务GraphMAE在节点分类上如Cora 84.2%、PubMed 81.1%、Reddit 96.0%图分类与分子迁移学习也整体优于或匹配SOTA基线。结论是经过关键设计的简单masked GAE可稳定释放generative SSL在图上的潜力。
**第二个问题**请提取论文的摘要原文摘要一般在Abstract之后Introduction之前。
Self-supervised learning (SSL) has been extensively explored in recent years. Particularly, generative SSL has seen emerging suc-cess in natural language processing and other AI fields, such as the wide adoption of BERT and GPT. Despite this, contrastive learning—which heavily relies on structural data augmentation and compli-cated training strategies—has been the dominant approach in graph SSL, while the progress of generative SSL on graphs, especially graph autoencoders (GAEs), has thus far not reached the potential as promised in other fields. In this paper, we identify and exam-ine the issues that negatively impact the development of GAEs, including their reconstruction objective, training robustness, and error metric. We present a masked graph autoencoder GraphMAE that mitigates these issues for generative self-supervised graph pre-training. Instead of reconstructing graph structures, we propose to focus on feature reconstruction with both a masking strategy and scaled cosine error that benefit the robust training of Graph-MAE. We conduct extensive experiments on 21 public datasets for three different graph learning tasks. The results manifest that GraphMAE—a simple graph autoencoder with careful designs—can consistently generate outperformance over both contrastive and generative state-of-the-art baselines. This study provides an under-standing of graph autoencoders and demonstrates the potential of generative self-supervised pre-training on graphs.
**第三个问题**:请列出论文的全部作者,按照此格式:`作者1, 作者2, 作者3`
Zhenyu Hou, Xiao Liu, Yukuo Cen, Yuxiao Dong, Hongxia Yang, Chunjie Wang, Jie Tang
**第四个问题**:请直接告诉我这篇论文发表在哪个会议或期刊,请不要推理或提供额外信息。
KDD 22
**第五个问题**:请详细描述这篇论文主要解决的核心问题,并用简洁的语言概述。
论文要解决的核心问题是为什么graph autoencodersGAEs在图自监督中长期弱于对比学习以及如何让“生成式自监督generative SSL”在图上达到与对比学习相当甚至更强的表征质量尤其面向node/graph classification与预训练迁移。作者总结了制约GAEs的关键瓶颈①目标偏差大量GAE以link/structure reconstruction为主导致对分类任务帮助有限②鲁棒性不足不做输入corruption的feature reconstruction容易学到identity function/trivial solution图特征维度通常较小使该问题更突出③误差度量不合适MSE对特征范数差异与维度灾难敏感、训练易不稳定甚至collapse且对易/难样本区分度低④decoder表达力弱常用MLP难以从低语义连续特征中有效“拉开”编码与重建目标导致潜变量退化为原始特征拷贝。简洁概述把GAE从“重建结构+MSE+弱解码”改造为“masked特征重建+稳定的SCE损失+GNN解码与re-mask正则”从根上提升生成式图预训练的可用性与效果。
**第六个问题**:请告诉我这篇论文提出了哪些方法,请用最简洁的方式概括每个方法的核心思路。
1) GraphMAEMasked Graph Autoencoder仅以feature reconstruction为自监督目标通过遮蔽部分节点特征来学习可迁移表示。
2) Masked feature reconstruction随机采样节点子集V~并用可学习向量[MASK]替换其输入特征迫使编码器利用邻域上下文恢复信息、避免trivial solution。
3) Re-mask decoding在解码前将被遮蔽节点的编码表示再次替换为[DMASK],让解码器必须依赖邻居的未遮蔽潜表示来重建被遮蔽节点特征。
4) GNN decoder用单层GNN如GAT/GIN/GCN替代传统MLP解码器提高解码表达力并引导编码器学习更高层潜表示。
5) Scaled Cosine ErrorSCE以(1-cosine(x,z))^γ作为重建损失,利用余弦误差消除范数/维度敏感性并用γ实现对hard samples的“focal”式重加权提升训练稳定性与选择性。
6) Random-substitution可选策略对被mask节点用随机替换而非“leave unchanged”减轻训练/推理的mask-token分布偏移论文指出leave-unchanged会伤害效果
**第七个问题**:请告诉我这篇论文所使用的数据集,包括数据集的名称和来源。
节点分类6个Cora, Citeseer, PubMed引文网络论文引用[48]等经典基准ogbn-arxivOpen Graph BenchmarkOGB论文引用[15]PPI, RedditGraphSAGE常用归纳基准论文引用其inductive setting来自GraphSAGE[10])。
图分类7个MUTAG, IMDB-B, IMDB-M, PROTEINS, COLLAB, REDDIT-B, NCI1论文引用[47]等图分类基准集合)。迁移学习/分子性质预测预训练数据为ZINC152 million无标签分子论文引用[33]下游为MoleculeNet论文引用[43]的8个分类数据集BBBP, Tox21, ToxCast, SIDER, ClinTox, MUV, HIV, BACE。
**第八个问题**:请列举这篇论文评估方法的所有指标,并简要说明这些指标的作用。
节点分类AccuracyCora/Citeseer/PubMed/ogbn-arxiv/Reddit用于衡量单标签分类正确率Micro-F1PPI多标签用于在多标签场景下按全局TP/FP/FN统计综合衡量分类质量。
图分类Accuracy10-fold cross-validation accuracy报告均值与标准差用于衡量图级标签预测正确率并用10折交叉验证减少划分方差。
迁移学习分子性质预测ROC-AUC%)衡量二分类在不同阈值下的排序/区分能力,对类别不均衡更稳健。
论文还报告了均值±标准差、以及部分实验“20次线性分类器随机初始化平均”等作为统计稳定性呈现方式。
**第九个问题**:请总结这篇论文实验的表现,包含具体的数值表现和实验结论。
节点分类Table 1GraphMAE在Cora 84.2±0.4Citeseer 73.4±0.4PubMed 81.1±0.4ogbn-arxiv 71.75±0.17PPI Micro-F1 74.50±0.29Reddit 96.01±0.08相较代表性对比方法CCA-SSG分别为Cora 84.0±0.4、PubMed 81.0±0.4、Reddit 95.07±0.02等GraphMAE整体为最佳或竞争性最强并显著超过生成式基线如GAECora 71.5±0.4与GPT-GNNCora 80.1±1.0)。
图分类Table 2accuracyGraphMAE在IMDB-B 75.52±0.66、IMDB-M 51.63±0.52、PROTEINS 75.30±0.39、COLLAB 80.32±0.46、MUTAG 88.19±1.26、REDDIT-B 88.01±0.19、NCI1 80.40±0.30在5/7数据集上优于所有自监督基线并在其余数据集上具竞争力如MUTAG略低于InfoGCL 91.20±1.30但与多基线接近)。
迁移学习Table 3ROC-AUC Avg.GraphMAE平均73.8优于GraphLoG 73.4、JOAO 71.9、GraphCL 70.8等在单项上如ClinTox 82.3±1.2、HIV 83.1±0.9等表现突出。
消融Table 4去掉maskCora 79.7、用MSE替代SCECora 79.1都会大幅降分去掉re-mask也会下降Cora 82.7)。
实验结论GraphMAE通过“mask + re-mask + GNN decoder + SCE”实现稳定训练使生成式图预训练在节点/图分类与分子迁移上系统性达到或超过SOTA对比学习。
**第十个问题**:请清晰地描述论文所作的工作,分别列举出动机和贡献点以及主要创新之处。
动机1) 图自监督长期依赖contrastive learning但其需要复杂训练策略动量编码器/EMA/stop-gradient、负样本与高质量图增强工程与理论负担重且跨图不稳2) GAEs天然更简单但在分类任务上落后作者希望找出其“没做对”的关键原因并让generative SSL在图上复兴。
贡献点1) 系统识别GAEs发展的关键障碍重建目标偏置结构优先、无corruption导致trivial solution、MSE引发不稳定与低选择性、decoder表达力不足。2) 提出GraphMAE框架以masked feature reconstruction为核心避免结构重建带来的任务偏置并提升分类导向表征。3) 提出Scaled Cosine ErrorSCE用余弦误差解决范数/维度敏感并用γ缩放实现对难样本的自适应重加权类focal loss思想显著提升训练鲁棒性。4) 提出re-mask decoding + GNN decoder通过[DMASK]与图解码器迫使信息从邻域潜表示流入重建增强压缩表征学习。5) 在21个公开数据集、三类任务上给出系统实证GraphMAE作为“简单但设计到位的GAE”可持续优于对比与生成式SOTA基线。
主要创新之处将“MAE式遮蔽重建”迁移到图并针对图特征连续/低语义、易trivial解的特性联合设计SCE损失与re-mask+GNN解码使生成式图自监督在稳定性与效果上同时跨越式提升。

View File

@@ -0,0 +1,9 @@
@misc{hu2020heterogeneousgraphtransformer,
title={Heterogeneous Graph Transformer},
author={Ziniu Hu and Yuxiao Dong and Kuansan Wang and Yizhou Sun},
year={2020},
eprint={2003.01332},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2003.01332},
}

View File

@@ -0,0 +1,54 @@
# Heterogeneous Graph Transformer
**第一个问题**请对论文的内容进行摘要总结包含研究背景与问题、研究目的、方法、主要结果和结论字数要求在150-300字之间使用论文中的术语和概念。
论文面向Web-scale heterogeneous graphs节点/边多类型且随时间演化中现有GNN多为homogeneous设计、依赖手工meta paths、难刻画异构分布差异与动态、且不具备可扩展训练的问题提出Heterogeneous Graph TransformerHGT。研究目的在于用端到端方式学习node- and edge-type dependent representations隐式学习“soft meta paths”并能处理dynamic heterogeneous graphs与超大规模训练。方法上HGT以meta relation ⟨τ(s),φ(e),τ(t)⟩分解参数构建异构mutual attention、message passing与target-specific aggregation引入Relative Temporal EncodingRTE用相对时间差∆T建模任意跨度的时序依赖提出HGSampling以保持各类型节点比例与子图稠密性实现mini-batch训练。实验在OAG179M nodes/2B edges及CS/Med子图上HGT在PaperField、PaperVenue与Author Disambiguation等任务上相对SOTA基线提升9%21%消融显示Heter与RTE均显著贡献。结论是HGT可在大规模异构动态图上高效、稳定优于现有GNN。
**第二个问题**请提取论文的摘要原文摘要一般在Abstract之后Introduction之前。
Recent years have witnessed the emerging success of graph neu-ral networks (GNNs) for modeling structured data. However, most GNNs are designed for homogeneous graphs, in which all nodes and edges belong to the same types, making them infeasible to represent heterogeneous structures. In this paper, we present the Heterogeneous Graph Transformer (HGT) architecture for mod-eling Web-scale heterogeneous graphs. To model heterogeneity, we design node- and edge-type dependent parameters to charac-terize the heterogeneous attention over each edge, empowering HGT to maintain dedicated representations for different types of nodes and edges. To handle dynamic heterogeneous graphs, we in-troduce the relative temporal encoding technique into HGT, which is able to capture the dynamic structural dependency with arbitrary durations. To handle Web-scale graph data, we design the hetero-geneous mini-batch graph sampling algorithm—HGSampling—for efficient and scalable training. Extensive experiments on the Open Academic Graph of 179 million nodes and 2 billion edges show that the proposed HGT model consistently outperforms all the state-of-the-art GNN baselines by 9%21%on various downstream tasks. The dataset and source code of HGT are publicly available at https://github.com/acbull/pyHGT.
**第三个问题**:请列出论文的全部作者,按照此格式:`作者1, 作者2, 作者3`
Ziniu Hu, Yuxiao Dong, Kuansan Wang, Yizhou Sun
**第四个问题**:请直接告诉我这篇论文发表在哪个会议或期刊,请不要推理或提供额外信息。
Proceedings of The Web Conference 2020 (WWW 20)
**第五个问题**:请详细描述这篇论文主要解决的核心问题,并用简洁的语言概述。
论文要解决的核心问题是在Web级别的异构信息网络中如何在不依赖手工meta paths的前提下同时1刻画不同node/edge类型带来的分布差异与语义差异heterogeneity2刻画图随时间演化带来的结构依赖dynamic并且3在数亿节点、数十亿边上实现可扩展的GNN训练scalability。现有异构GNN往往只按node type或edge type单独建模、参数难以共享导致稀有关系学习不稳对动态图常用“按时间切片建多个图”会丢失跨时间结构依赖训练上直接全图传播不可行现有采样在异构图上会产生类型极不均衡子图。简洁概述提出一种以meta relation驱动的Transformer式异构注意力GNN并配套相对时间编码与异构小批量采样使其能在超大规模异构动态图上端到端学习表示并提升下游任务效果。
**第六个问题**:请告诉我这篇论文提出了哪些方法,请用最简洁的方式概括每个方法的核心思路。
1) HGTHeterogeneous Graph Transformer用meta relation ⟨τ(s),φ(e),τ(t)⟩对注意力与消息传递参数进行分解与共享构建异构multi-head mutual attention、异构message passing与target-specific aggregation含残差以生成上下文化表示。
2) Heterogeneous Mutual Attention对不同node type使用不同的Key/Query线性投影并对不同edge type使用W_ATT变换再用meta-relation prior μ对注意力进行自适应缩放。
3) Heterogeneous Message Passing对不同source node type使用不同的消息线性投影M-Linear并用每种edge type的W_MSG注入关系语义形成多头消息。
4) RTERelative Temporal Encoding用相对时间差∆T(t,s)=T(t)-T(s)查找/生成sinusoid basis并经线性层T-Linear映射将RTE(∆T)加到source表示以建模任意跨度、可泛化到未见时间的动态依赖。
5) HGSampling为每种node type维护独立budget并用归一化度数的importance sampling采样出类型比例更均衡且更稠密的异构子图用于mini-batch训练。
6) Inductive Timestamp Assignment对无固定时间戳的plain nodes在采样时从相连event nodes继承时间戳以支持动态建模。
**第七个问题**:请告诉我这篇论文所使用的数据集,包括数据集的名称和来源。
1) Open Academic GraphOAG论文实验基础数据集引用[16,20,28]规模约178/179 million nodes与2.236/2 billion edges时间跨度19002019。
2) Computer ScienceCS学术图从OAG构建的领域子图。
3) MedicineMed学术图从OAG构建的领域子图。代码与数据入口论文给出公开仓库 https://github.com/acbull/pyHGT。
**第八个问题**:请列举这篇论文评估方法的所有指标,并简要说明这些指标的作用。
论文使用两类排序指标NDCGNormalized Discounted Cumulative Gain衡量排序结果在Top位置的相关性质量对靠前位置赋予更高权重适用于多标签/多候选的ranking式评估MRRMean Reciprocal Rank对每个查询取正确答案排名的倒数再求均值反映模型把正确答案排到前面的能力常用于link prediction与候选排序任务。
**第九个问题**:请总结这篇论文实验的表现,包含具体的数值表现和实验结论。
总体结论HGT在CS/Med/OAG三套大规模异构图、四个任务上NDCG与MRR均显著且一致优于GCN/GAT/RGCN/HetGNN/HAN并在参数量与batch time上优于多数异构GNN基线。
关键数值Table 2均值±方差列出HGT完整版 HGT(+RTE,+Heter) 与最佳基线HAN对比为主在OAG的PaperField(L1)上HAN NDCG 0.544±0.096 / MRR 0.622±0.092HGT达0.615±0.084 / 0.702±0.081在OAG的PaperVenue上HAN NDCG 0.327±0.062 / MRR 0.214±0.067HGT达0.355±0.062 / 0.247±0.061在OAG的Author Disambiguation上HAN NDCG 0.821±0.056 / MRR 0.660±0.049HGT达0.852±0.048 / 0.688±0.054。CS数据集的PaperField(L1)上HAN NDCG 0.618±0.058 / MRR 0.691±0.051HGT达0.718±0.014 / 0.823±0.019Med数据集的PaperField(L1)上HAN NDCG 0.607±0.054 / MRR 0.575±0.057HGT达0.709±0.029 / 0.688±0.048。
消融去掉异构参数化HGT +RTE Heter相对完整版有约4%性能下降去掉RTEHGT RTE +Heter约2%下降。效率表中示例在CS任务设置下HGT参数量8.20M、batch time约1.50s低于HAN的9.45M与2.27s且优于RGCN/HetGNN的参数规模与速度。
**第十个问题**:请清晰地描述论文所作的工作,分别列举出动机和贡献点以及主要创新之处。
动机1) 现实系统(学术网/社交网/经济网/IoT天然是heterogeneous graphshomogeneous GNN假设不成立2) 既有异构方法依赖手工meta paths或仅按node/edge单维度建模难同时捕获共享模式与特定模式3) 真实图是dynamic heterogeneous graph时间切片会丢失跨时间结构依赖4) Web-scale规模下需要可扩展mini-batch训练现有采样在异构图上会产生类型失衡与信息损失。
贡献点1) 提出HGT架构以meta relation ⟨τ(s),φ(e),τ(t)⟩为核心进行参数分解与共享定义异构mutual attention、message passing与target-specific aggregation实现无需手工meta paths的端到端异构表示学习隐式“soft meta paths”。2) 提出RTE基于相对时间差的时间编码机制可建模任意时长、并泛化到未见时间戳/未来时间。3) 提出HGSampling与配套的inductive timestamp assignment在Web级异构图上采样类型更均衡且更稠密的子图支持高效mini-batch训练与动态图时间戳分配。4) 在OAG179M/2B及CS/Med子图上进行大规模实验证明相对SOTA基线持续提升论文总结为9%21%)。
主要创新之处将Transformer式多头注意力与“基于meta relation的三元分解参数化node-type投影 + edge-type变换 + meta-relation prior”结合实现更强的异构建模与更高的参数共享效率同时把相对时间编码与异构专用采样机制纳入统一训练框架使模型真正可落地于Web-scale异构动态图。

View File

@@ -0,0 +1,9 @@
@misc{xu2019powerfulgraphneuralnetworks,
title={How Powerful are Graph Neural Networks?},
author={Keyulu Xu and Weihua Hu and Jure Leskovec and Stefanie Jegelka},
year={2019},
eprint={1810.00826},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/1810.00826},
}

View File

@@ -0,0 +1,59 @@
# How Powerful are Graph Neural Networks
**第一个问题**请对论文的内容进行摘要总结包含研究背景与问题、研究目的、方法、主要结果和结论字数要求在150-300字之间使用论文中的术语和概念。
摘要总结本文系统分析Graph Neural NetworksGNNs的表达能力构建与WeisfeilerLehmanWL图同构测试紧密关联的理论框架。研究目的在于形式化刻画主流GNN变体如GCN、GraphSAGE的判别能力及局限并提出一个在邻域聚合message passing类方法中“最强”的架构。方法上作者将邻域表示为multiset研究不同AGGREGATE与READOUT的可区分性条件证明满足“注入”聚合与读出时GNN至多与WL等强并提出Graph Isomorphism NetworkGIN使用sum聚合与MLP实现对multiset的通用函数逼近。主要结果显示常用的mean/max聚合或1-layer感知机不足以区分简单结构GIN在多项图分类基准上达到SOTA训练拟合几乎完美且测试表现优异。结论GNN的判别力上限由WL测试界定具备注入式聚合与读出如GIN的架构在表达力上最强同时在实践中表现领先。
**第二个问题**请提取论文的摘要原文摘要一般在Abstract之后Introduction之前。
Graph Neural Networks (GNNs) are an effective framework for representation learning of graphs. GNNs follow a neighborhood aggregation scheme, where the representation vector of a node is computed by recursively aggregating and transforming representation vectors of its neighboring nodes. Many GNN variants have been proposed and have achieved state-of-the-art results on both node and graph classification tasks. However, despite GNNs revolutionizing graph representation learning, there is limited understanding of their representational properties and limitations. Here, we present a theoretical framework for analyzing the expressive power of GNNs to capture different graph structures. Our results characterize the discriminative power of popular GNN variants, such as Graph Convolutional Networks and GraphSAGE, and show that they cannot learn to distinguish certain simple graph structures. We then develop a simple architecture that is provably the most expressive among the class of GNNs and is as powerful as the Weisfeiler-Lehman graph isomorphism test. We empirically validate our theoretical findings on a number of graph classification benchmarks, and demonstrate that our model achieves state-of-the-art performance.
**第三个问题**:请列出论文的全部作者,按照此格式:`作者1, 作者2, 作者3`
Keyulu Xu, Weihua Hu, Jure Leskovec, Stefanie Jegelka
**第四个问题**:请直接告诉我这篇论文发表在哪个会议或期刊,请不要推理或提供额外信息。
ICLR 2019
**第五个问题**:请详细描述这篇论文主要解决的核心问题,并用简洁的语言概述。
核心问题在邻域聚合message passing框架下GNN到底能多强能区分哪些图结构、在哪些情况下失效以及如何构造在此类GNN中表达力最强、与WeisfeilerLehman测试等强的模型。简述论文给出一个以multiset函数为基础的理论框架证明常见聚合mean/max和1-layer感知机存在不可区分的结构同时提出使用sum聚合+MLP的GIN使GNN的判别力达到WL测试的上限。
**第六个问题**:请告诉我这篇论文提出了哪些方法,请用最简洁的方式概括每个方法的核心思路。
1) 理论框架GNN表达力与WL测试将邻域表示为multiset分析AGGREGATE/READOUT的“注入性”条件给出GNN判别力的上界与等价条件。
2) Graph Isomorphism NetworkGIN用sum聚合实现对multiset的通用近似injective结合MLP与(1+ε)·self项逐层更新h_v并在图级通过各层READOUT的拼接/求和形成h_G达到与WL等强的表达力。
3) 聚合器对比分析形式化比较sum、mean、max对multiset的捕获能力分别对应完整multiset、分布、集合骨架揭示其区分能力差异。
4) 经验验证设置在图分类基准上对比GIN与“较弱”变体mean/max或1-layer验证理论结论。
**第七个问题**:请告诉我这篇论文所使用的数据集,包括数据集的名称和来源。
- BioinformaticsMUTAG、PTC、NCI1、PROTEINS来源于Yanardag & Vishwanathan, 2015汇总的图分类基准
- Social networksCOLLAB、IMDB-BINARY、IMDB-MULTI、REDDIT-BINARY、REDDIT-MULTI5K同样来源于Yanardag & Vishwanathan, 2015。 备注:社交网络数据部分节点无特征,使用度或常数特征;生物数据含离散节点标签。
**第八个问题**:请列举这篇论文评估方法的所有指标,并简要说明这些指标的作用。
- 图分类准确率accuracy10折交叉验证均值±标准差衡量模型在图级分类任务上的泛化性能。
- 训练准确率曲线:衡量不同聚合/架构的表示能力与拟合强度,验证表达力理论结论。
- 与WL subtree kernel对比作为强判别的非学习基线对训练拟合与测试性能进行参考。
**第九个问题**:请总结这篇论文实验的表现,包含具体的数值表现和实验结论。
- 训练表现GIN-ε与GIN-0在9个数据集上几乎完美拟合训练集mean/max或1-layer变体在多数据集显著欠拟合训练准确率明显较低。
- 测试准确率10折均值±标准差例如IMDB-BINARY GIN-0为75.1±5.1%REDDIT-BINARY 92.4±2.5%REDDIT-MULTI5K 57.5±1.5%COLLAB 80.2±1.9%MUTAG 89.4±5.6%PROTEINS 76.2±2.8%PTC 64.6±7.0NCI1 82.7±1.7在REDDIT类数据上mean-MLP接近随机50.0±0.0%20.0±0.0%。总体结论GIN在多数基准上达到或优于SOTA强表达力带来更好的训练拟合与测试表现简单聚合器存在结构不可分能力导致性能下降。
**第十个问题**:请清晰地描述论文所作的工作,分别列举出动机和贡献点以及主要创新之处。
- 动机缺乏对GNN表达力的系统理论理解现有设计依赖经验与试错未明确其能区分哪些结构、上限为何、如何构造更强模型。
- 贡献点:
1. 提出以multiset函数为核心的理论框架形式化分析GNN的判别力与其与WL测试的关系上界与等强条件
2. 证明常见变体GCN、GraphSAGE的mean/max、1-layer感知机无法区分某些简单图结构刻画其能捕获的性质分布或集合骨架
3. 设计GIN用sum聚合+MLP含(1+ε)自项实现注入式邻域聚合与图级读出达到与WL等强的表达力
4. 在9个图分类基准上实证验证GIN取得SOTA或可比表现训练几乎完美拟合支撑理论结论。
- 主要创新:
- 将GNN聚合视为multiset上的通用函数并给出注入性条件建立与WL测试的等强判别理论
- 提出GIN这一简单而最强的邻域聚合GNN架构sum+MLP+(1+ε)并配套跨层READOUT设计
- 系统揭示sum/mean/max的表达差异与失效例指导GNN聚合器选择与任务匹配。

Some files were not shown because too many files have changed in this diff Show More