Files
internal-docs/knowledges/NETSHARE_DESIGN_DOCUMENTATION.md
2025-12-26 13:24:23 +00:00

387 lines
13 KiB
Markdown

# NetShare: Design & Implementation Documentation
## Table of Contents
1. [Overview](#overview)
2. [Architecture](#architecture)
3. [Core Components](#core-components)
4. [Data Processing Pipeline](#data-processing-pipeline)
5. [Model Implementation](#model-implementation)
6. [Configuration System](#configuration-system)
7. [Distributed Computing](#distributed-computing)
8. [Field Processing System](#field-processing-system)
9. [Usage Examples](#usage-examples)
10. [Dependencies](#dependencies)
## Overview
NetShare is a GAN-based framework for generating synthetic network traffic traces (packet headers and flow headers) that maintains the statistical properties and privacy characteristics of real network data. The system addresses key challenges in synthetic network data generation including fidelity, scalability, and privacy.
### Key Features
- **GAN-based Generation**: Uses DoppelGANger architecture for realistic network trace generation
- **Multi-format Support**: Handles both PCAP and NetFlow formats
- **Distributed Processing**: Leverages Ray for scalable training and generation
- **Privacy Preservation**: Supports differential privacy (DP) options
- **Flexible Encoding**: Various encoding strategies for different data types
- **Quality Assessment**: Built-in visualization and evaluation tools
## Architecture
NetShare follows a modular, component-based architecture with clear separation of concerns:
```
┌─────────────────┐ ┌──────────────────────┐ ┌─────────────────┐
│ Generator │───▶│ Model Manager Layer │───▶│ Model │
│ │ │ (NetShareManager) │ │ (DoppelGANger) │
└─────────────────┘ └──────────────────────┘ └─────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌──────────────────────┐ ┌─────────────────┐
│ Pre/Post │ │ Ray Distributed │ │ Training/ │
│ Processor │ │ Computing │ │ Generation │
│ (NetShare) │ │ (Parallel Processing)│ │ Pipeline │
└─────────────────┘ └──────────────────────┘ └─────────────────┘
```
### Component Layers
1. **Generator Layer**: Main orchestration class that manages the complete workflow
2. **Model Manager Layer**: Handles training and generation workflows
3. **Model Layer**: Implements the actual GAN algorithms
4. **Pre/Post Processor Layer**: Handles data preparation and transformation
5. **Ray Layer**: Provides distributed computing capabilities
## Core Components
### Generator Class
The `Generator` class serves as the main entry point and workflow coordinator:
```python
from netshare import Generator
generator = Generator(config="config.json")
generator.train(work_folder="results/")
generator.generate(work_folder="results/")
generator.visualize(work_folder="results/")
```
**Key Methods**:
- `train()`: Preprocesses data and trains the GAN model
- `generate()`: Generates synthetic data using the trained model
- `train_and_generate()`: Executes both training and generation in sequence
- `visualize()`: Creates visual comparisons between real and synthetic data
### Model Manager
The `NetShareManager` handles the training and generation workflows:
- **Training Workflow**: Manages data preprocessing, model training, and checkpointing
- **Generation Workflow**: Handles attribute generation, feature generation, and data reconstruction
- **Chunked Processing**: Splits large datasets into chunks for efficient processing
### Model Implementation
The `DoppelGANgerTorchModel` implements the core GAN architecture:
- **Separate Generators**: Distinct generators for attributes and features
- **Conditional Generation**: Features generated conditioned on attributes
- **Multiple Discriminators**: Separate discriminators for attributes and features
- **Sequence Handling**: Supports variable-length sequences with padding
## Data Processing Pipeline
### Preprocessing Stage
The preprocessing pipeline transforms raw network data into GAN-ready format:
1. **Data Ingestion**: Supports PCAP and CSV formats
2. **Data Chunking**: Splits large datasets by size or time windows
3. **Field Processing**: Applies appropriate encodings to different field types
4. **Normalization**: Normalizes continuous fields to [0,1] range
5. **Encoding**: Converts categorical fields using various strategies
### Field Types and Encodings
NetShare supports multiple field types with specialized processing:
- **Continuous Fields**: Numerical data with min-max normalization
- **Discrete Fields**: Categorical data with one-hot encoding
- **Bit Fields**: Integer data converted to bit representations (e.g., IP addresses)
- **Word2Vec Fields**: Embedding-based representation for categorical data
### Post-processing Stage
The post-processing pipeline reconstructs synthetic data to original format:
1. **Denormalization**: Reverses normalization applied during preprocessing
2. **Decoding**: Converts encoded representations back to original format
3. **Format Conversion**: Outputs data in original format (PCAP/NetFlow)
4. **Quality Assessment**: Evaluates synthetic data quality
## Model Implementation
### DoppelGANger Architecture
The core model implements the DoppelGANger architecture which separates:
- **Attribute Generation**: Static properties of network flows (IP addresses, ports, protocol)
- **Feature Generation**: Time-series data within flows (timestamps, packet sizes)
**Key Components**:
- **Attribute Generator**: Creates static flow properties
- **Feature Generator**: Creates time-series data conditioned on attributes
- **Feature Discriminator**: Distinguishes real vs. synthetic features
- **Attribute Discriminator**: Distinguishes real vs. synthetic attributes
**Training Process**:
- Alternating optimization of generator and discriminators
- Gradient penalty for WGAN-GP stability
- Sequence packing for variable-length sequences
### Model Configuration
The model supports various hyperparameters:
- `batch_size`: Training batch size
- `sample_len`: Length of sequences to generate
- `epochs`: Number of training epochs
- `learning_rates`: Generator and discriminator learning rates
- `network_architecture`: Generator/discriminator layer configurations
## Configuration System
NetShare uses a hierarchical configuration system:
### Global Configuration
```json
{
"global_config": {
"original_data_file": "path/to/data.csv",
"overwrite": true,
"dataset_type": "netflow",
"n_chunks": 2,
"dp": false
}
}
```
### Pre/Post Processor Configuration
Defines how to process different data fields:
- Metadata fields (static flow properties)
- Timeseries fields (dynamic flow properties)
- Encoding strategies for each field type
### Model Configuration
Specifies GAN hyperparameters and architecture:
- Network dimensions and layers
- Training parameters (epochs, learning rates)
- Privacy settings (if using DP)
## Distributed Computing
### Ray Integration
NetShare leverages Ray for distributed computing:
- **Parallel Preprocessing**: Multiple data chunks processed in parallel
- **Distributed Training**: Model training across multiple nodes/GPUs
- **Resource Management**: Automatic load balancing and resource allocation
- **Fault Tolerance**: Resilient to node failures during long-running jobs
### Chunked Processing
For large datasets, NetShare splits data into chunks:
- Each chunk processed independently
- Results merged after processing
- Memory-efficient for large datasets
- Enables parallel processing
## Field Processing System
### Field Types
NetShare implements a flexible field processing system:
#### ContinuousField
- Handles numerical data
- Supports various normalization options (min-max, log1p, etc.)
- Preserves statistical properties during normalization/denormalization
#### DiscreteField
- Processes categorical data
- One-hot encoding for neural network compatibility
- Maintains categorical relationships
#### BitField
- Converts integers to bit representations
- Useful for IP addresses (32-bit representation)
- Preserves bit-level patterns
#### Word2VecField
- Embeds categorical data using Word2Vec models
- Captures semantic relationships between categories
- Reduces dimensionality for high-cardinality categorical data
### Encoding Strategies
Different encoding strategies optimize for different data types:
- **Bit Encoding**: For IP addresses and other integer identifiers
- **Word2Vec**: For categorical fields with semantic relationships
- **Categorical**: Standard one-hot encoding for discrete values
- **Normalization**: Various schemes for continuous values
## Usage Examples
### Basic Usage
```python
import netshare.ray as ray
from netshare import Generator
# Initialize Ray (optional)
ray.config.enabled = False
ray.init(address="auto")
# Create generator with configuration
generator = Generator(config="config.json")
# Train the model
generator.train(work_folder="results/")
# Generate synthetic data
generator.generate(work_folder="results/")
# Visualize results
generator.visualize(work_folder="results/")
ray.shutdown()
```
### Configuration Example
```json
{
"global_config": {
"original_data_file": "data/netflow.csv",
"overwrite": true,
"dataset_type": "netflow",
"n_chunks": 2,
"dp": false
},
"pre_post_processor": {
"class": "NetsharePrePostProcessor",
"config": {
"timestamp": {
"column": "ts",
"generation": true,
"encoding": "interarrival",
"normalization": "ZERO_ONE"
},
"metadata": [
{
"column": "srcip",
"type": "integer",
"encoding": "bit",
"n_bits": 32
},
{
"column": "srcport",
"type": "integer",
"encoding": "word2vec_port"
}
],
"timeseries": [
{
"column": "pkt",
"type": "float",
"normalization": "ZERO_ONE"
}
]
}
},
"model": {
"class": "DoppelGANgerTorchModel",
"config": {
"batch_size": 100,
"sample_len": [1, 5, 10],
"epochs": 40
}
}
}
```
## Dependencies
NetShare requires the following key dependencies:
- **PyTorch**: Deep learning framework for GAN implementation
- **Ray**: Distributed computing framework
- **Pandas**: Data manipulation and analysis
- **NumPy**: Numerical computing
- **Gensim**: Word2Vec implementation
- **Scikit-learn**: Machine learning utilities
- **Matplotlib**: Visualization
- **Config_IO**: Configuration management
- **SDMetrics**: Synthetic data quality evaluation
### Installation
```bash
pip install -e NetShare/
pip install -e SDMetrics_timeseries/
```
## Performance and Scalability
### Memory Management
- Chunked processing for large datasets
- Efficient data loading and preprocessing
- Model checkpointing to handle long training runs
### Parallel Processing
- Ray-based distributed computing
- Parallel preprocessing of data chunks
- Multi-GPU training support
### Quality Assurance
- Built-in visualization tools
- Statistical similarity metrics
- Downstream task evaluation capabilities
## Privacy Considerations
### Differential Privacy
- Optional DP support for privacy-preserving generation
- Configurable privacy budget
- Trade-off between privacy and utility
### Data Handling
- No direct access to raw network data in generated traces
- Statistical properties preserved while individual records are synthetic
- Compliance with privacy regulations
## Evaluation and Validation
### Built-in Metrics
- Distributional similarity metrics
- Statistical property preservation
- Downstream task performance evaluation
### Visualization Tools
- Side-by-side comparison of real vs. synthetic data
- Distribution plots
- Correlation analysis
## Extensibility
### Plugin Architecture
- Pluggable pre/post processors
- Custom model implementations
- Extendable field types
- Configurable workflows
### Customization Points
- Custom field encodings
- Alternative GAN architectures
- Specialized evaluation metrics
- Domain-specific preprocessing
## References
Yin, Y., Lin, Z., Jin, M., Fanti, G., & Sekar, V. (2022). Practical GAN-Based Synthetic IP Header Trace Generation Using NetShare. SIGCOMM 2022.