13 KiB
NetShare: Design & Implementation Documentation
Table of Contents
- Overview
- Architecture
- Core Components
- Data Processing Pipeline
- Model Implementation
- Configuration System
- Distributed Computing
- Field Processing System
- Usage Examples
- Dependencies
Overview
NetShare is a GAN-based framework for generating synthetic network traffic traces (packet headers and flow headers) that maintains the statistical properties and privacy characteristics of real network data. The system addresses key challenges in synthetic network data generation including fidelity, scalability, and privacy.
Key Features
- GAN-based Generation: Uses DoppelGANger architecture for realistic network trace generation
- Multi-format Support: Handles both PCAP and NetFlow formats
- Distributed Processing: Leverages Ray for scalable training and generation
- Privacy Preservation: Supports differential privacy (DP) options
- Flexible Encoding: Various encoding strategies for different data types
- Quality Assessment: Built-in visualization and evaluation tools
Architecture
NetShare follows a modular, component-based architecture with clear separation of concerns:
┌─────────────────┐ ┌──────────────────────┐ ┌─────────────────┐
│ Generator │───▶│ Model Manager Layer │───▶│ Model │
│ │ │ (NetShareManager) │ │ (DoppelGANger) │
└─────────────────┘ └──────────────────────┘ └─────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌──────────────────────┐ ┌─────────────────┐
│ Pre/Post │ │ Ray Distributed │ │ Training/ │
│ Processor │ │ Computing │ │ Generation │
│ (NetShare) │ │ (Parallel Processing)│ │ Pipeline │
└─────────────────┘ └──────────────────────┘ └─────────────────┘
Component Layers
- Generator Layer: Main orchestration class that manages the complete workflow
- Model Manager Layer: Handles training and generation workflows
- Model Layer: Implements the actual GAN algorithms
- Pre/Post Processor Layer: Handles data preparation and transformation
- Ray Layer: Provides distributed computing capabilities
Core Components
Generator Class
The Generator class serves as the main entry point and workflow coordinator:
from netshare import Generator
generator = Generator(config="config.json")
generator.train(work_folder="results/")
generator.generate(work_folder="results/")
generator.visualize(work_folder="results/")
Key Methods:
train(): Preprocesses data and trains the GAN modelgenerate(): Generates synthetic data using the trained modeltrain_and_generate(): Executes both training and generation in sequencevisualize(): Creates visual comparisons between real and synthetic data
Model Manager
The NetShareManager handles the training and generation workflows:
- Training Workflow: Manages data preprocessing, model training, and checkpointing
- Generation Workflow: Handles attribute generation, feature generation, and data reconstruction
- Chunked Processing: Splits large datasets into chunks for efficient processing
Model Implementation
The DoppelGANgerTorchModel implements the core GAN architecture:
- Separate Generators: Distinct generators for attributes and features
- Conditional Generation: Features generated conditioned on attributes
- Multiple Discriminators: Separate discriminators for attributes and features
- Sequence Handling: Supports variable-length sequences with padding
Data Processing Pipeline
Preprocessing Stage
The preprocessing pipeline transforms raw network data into GAN-ready format:
- Data Ingestion: Supports PCAP and CSV formats
- Data Chunking: Splits large datasets by size or time windows
- Field Processing: Applies appropriate encodings to different field types
- Normalization: Normalizes continuous fields to [0,1] range
- Encoding: Converts categorical fields using various strategies
Field Types and Encodings
NetShare supports multiple field types with specialized processing:
- Continuous Fields: Numerical data with min-max normalization
- Discrete Fields: Categorical data with one-hot encoding
- Bit Fields: Integer data converted to bit representations (e.g., IP addresses)
- Word2Vec Fields: Embedding-based representation for categorical data
Post-processing Stage
The post-processing pipeline reconstructs synthetic data to original format:
- Denormalization: Reverses normalization applied during preprocessing
- Decoding: Converts encoded representations back to original format
- Format Conversion: Outputs data in original format (PCAP/NetFlow)
- Quality Assessment: Evaluates synthetic data quality
Model Implementation
DoppelGANger Architecture
The core model implements the DoppelGANger architecture which separates:
- Attribute Generation: Static properties of network flows (IP addresses, ports, protocol)
- Feature Generation: Time-series data within flows (timestamps, packet sizes)
Key Components:
- Attribute Generator: Creates static flow properties
- Feature Generator: Creates time-series data conditioned on attributes
- Feature Discriminator: Distinguishes real vs. synthetic features
- Attribute Discriminator: Distinguishes real vs. synthetic attributes
Training Process:
- Alternating optimization of generator and discriminators
- Gradient penalty for WGAN-GP stability
- Sequence packing for variable-length sequences
Model Configuration
The model supports various hyperparameters:
batch_size: Training batch sizesample_len: Length of sequences to generateepochs: Number of training epochslearning_rates: Generator and discriminator learning ratesnetwork_architecture: Generator/discriminator layer configurations
Configuration System
NetShare uses a hierarchical configuration system:
Global Configuration
{
"global_config": {
"original_data_file": "path/to/data.csv",
"overwrite": true,
"dataset_type": "netflow",
"n_chunks": 2,
"dp": false
}
}
Pre/Post Processor Configuration
Defines how to process different data fields:
- Metadata fields (static flow properties)
- Timeseries fields (dynamic flow properties)
- Encoding strategies for each field type
Model Configuration
Specifies GAN hyperparameters and architecture:
- Network dimensions and layers
- Training parameters (epochs, learning rates)
- Privacy settings (if using DP)
Distributed Computing
Ray Integration
NetShare leverages Ray for distributed computing:
- Parallel Preprocessing: Multiple data chunks processed in parallel
- Distributed Training: Model training across multiple nodes/GPUs
- Resource Management: Automatic load balancing and resource allocation
- Fault Tolerance: Resilient to node failures during long-running jobs
Chunked Processing
For large datasets, NetShare splits data into chunks:
- Each chunk processed independently
- Results merged after processing
- Memory-efficient for large datasets
- Enables parallel processing
Field Processing System
Field Types
NetShare implements a flexible field processing system:
ContinuousField
- Handles numerical data
- Supports various normalization options (min-max, log1p, etc.)
- Preserves statistical properties during normalization/denormalization
DiscreteField
- Processes categorical data
- One-hot encoding for neural network compatibility
- Maintains categorical relationships
BitField
- Converts integers to bit representations
- Useful for IP addresses (32-bit representation)
- Preserves bit-level patterns
Word2VecField
- Embeds categorical data using Word2Vec models
- Captures semantic relationships between categories
- Reduces dimensionality for high-cardinality categorical data
Encoding Strategies
Different encoding strategies optimize for different data types:
- Bit Encoding: For IP addresses and other integer identifiers
- Word2Vec: For categorical fields with semantic relationships
- Categorical: Standard one-hot encoding for discrete values
- Normalization: Various schemes for continuous values
Usage Examples
Basic Usage
import netshare.ray as ray
from netshare import Generator
# Initialize Ray (optional)
ray.config.enabled = False
ray.init(address="auto")
# Create generator with configuration
generator = Generator(config="config.json")
# Train the model
generator.train(work_folder="results/")
# Generate synthetic data
generator.generate(work_folder="results/")
# Visualize results
generator.visualize(work_folder="results/")
ray.shutdown()
Configuration Example
{
"global_config": {
"original_data_file": "data/netflow.csv",
"overwrite": true,
"dataset_type": "netflow",
"n_chunks": 2,
"dp": false
},
"pre_post_processor": {
"class": "NetsharePrePostProcessor",
"config": {
"timestamp": {
"column": "ts",
"generation": true,
"encoding": "interarrival",
"normalization": "ZERO_ONE"
},
"metadata": [
{
"column": "srcip",
"type": "integer",
"encoding": "bit",
"n_bits": 32
},
{
"column": "srcport",
"type": "integer",
"encoding": "word2vec_port"
}
],
"timeseries": [
{
"column": "pkt",
"type": "float",
"normalization": "ZERO_ONE"
}
]
}
},
"model": {
"class": "DoppelGANgerTorchModel",
"config": {
"batch_size": 100,
"sample_len": [1, 5, 10],
"epochs": 40
}
}
}
Dependencies
NetShare requires the following key dependencies:
- PyTorch: Deep learning framework for GAN implementation
- Ray: Distributed computing framework
- Pandas: Data manipulation and analysis
- NumPy: Numerical computing
- Gensim: Word2Vec implementation
- Scikit-learn: Machine learning utilities
- Matplotlib: Visualization
- Config_IO: Configuration management
- SDMetrics: Synthetic data quality evaluation
Installation
pip install -e NetShare/
pip install -e SDMetrics_timeseries/
Performance and Scalability
Memory Management
- Chunked processing for large datasets
- Efficient data loading and preprocessing
- Model checkpointing to handle long training runs
Parallel Processing
- Ray-based distributed computing
- Parallel preprocessing of data chunks
- Multi-GPU training support
Quality Assurance
- Built-in visualization tools
- Statistical similarity metrics
- Downstream task evaluation capabilities
Privacy Considerations
Differential Privacy
- Optional DP support for privacy-preserving generation
- Configurable privacy budget
- Trade-off between privacy and utility
Data Handling
- No direct access to raw network data in generated traces
- Statistical properties preserved while individual records are synthetic
- Compliance with privacy regulations
Evaluation and Validation
Built-in Metrics
- Distributional similarity metrics
- Statistical property preservation
- Downstream task performance evaluation
Visualization Tools
- Side-by-side comparison of real vs. synthetic data
- Distribution plots
- Correlation analysis
Extensibility
Plugin Architecture
- Pluggable pre/post processors
- Custom model implementations
- Extendable field types
- Configurable workflows
Customization Points
- Custom field encodings
- Alternative GAN architectures
- Specialized evaluation metrics
- Domain-specific preprocessing
References
Yin, Y., Lin, Z., Jin, M., Fanti, G., & Sekar, V. (2022). Practical GAN-Based Synthetic IP Header Trace Generation Using NetShare. SIGCOMM 2022.