Files
internal-docs/knowledges/NETSHARE_DESIGN_DOCUMENTATION.md
2025-12-26 13:24:23 +00:00

13 KiB

NetShare: Design & Implementation Documentation

Table of Contents

  1. Overview
  2. Architecture
  3. Core Components
  4. Data Processing Pipeline
  5. Model Implementation
  6. Configuration System
  7. Distributed Computing
  8. Field Processing System
  9. Usage Examples
  10. Dependencies

Overview

NetShare is a GAN-based framework for generating synthetic network traffic traces (packet headers and flow headers) that maintains the statistical properties and privacy characteristics of real network data. The system addresses key challenges in synthetic network data generation including fidelity, scalability, and privacy.

Key Features

  • GAN-based Generation: Uses DoppelGANger architecture for realistic network trace generation
  • Multi-format Support: Handles both PCAP and NetFlow formats
  • Distributed Processing: Leverages Ray for scalable training and generation
  • Privacy Preservation: Supports differential privacy (DP) options
  • Flexible Encoding: Various encoding strategies for different data types
  • Quality Assessment: Built-in visualization and evaluation tools

Architecture

NetShare follows a modular, component-based architecture with clear separation of concerns:

┌─────────────────┐    ┌──────────────────────┐    ┌─────────────────┐
│   Generator     │───▶│ Model Manager Layer  │───▶│    Model        │
│                 │    │ (NetShareManager)    │    │ (DoppelGANger)  │
└─────────────────┘    └──────────────────────┘    └─────────────────┘
         │                        │                         │
         ▼                        ▼                         ▼
┌─────────────────┐    ┌──────────────────────┐    ┌─────────────────┐
│ Pre/Post        │    │ Ray Distributed      │    │ Training/       │
│ Processor       │    │ Computing            │    │ Generation      │
│ (NetShare)      │    │ (Parallel Processing)│    │ Pipeline        │
└─────────────────┘    └──────────────────────┘    └─────────────────┘

Component Layers

  1. Generator Layer: Main orchestration class that manages the complete workflow
  2. Model Manager Layer: Handles training and generation workflows
  3. Model Layer: Implements the actual GAN algorithms
  4. Pre/Post Processor Layer: Handles data preparation and transformation
  5. Ray Layer: Provides distributed computing capabilities

Core Components

Generator Class

The Generator class serves as the main entry point and workflow coordinator:

from netshare import Generator

generator = Generator(config="config.json")
generator.train(work_folder="results/")
generator.generate(work_folder="results/")
generator.visualize(work_folder="results/")

Key Methods:

  • train(): Preprocesses data and trains the GAN model
  • generate(): Generates synthetic data using the trained model
  • train_and_generate(): Executes both training and generation in sequence
  • visualize(): Creates visual comparisons between real and synthetic data

Model Manager

The NetShareManager handles the training and generation workflows:

  • Training Workflow: Manages data preprocessing, model training, and checkpointing
  • Generation Workflow: Handles attribute generation, feature generation, and data reconstruction
  • Chunked Processing: Splits large datasets into chunks for efficient processing

Model Implementation

The DoppelGANgerTorchModel implements the core GAN architecture:

  • Separate Generators: Distinct generators for attributes and features
  • Conditional Generation: Features generated conditioned on attributes
  • Multiple Discriminators: Separate discriminators for attributes and features
  • Sequence Handling: Supports variable-length sequences with padding

Data Processing Pipeline

Preprocessing Stage

The preprocessing pipeline transforms raw network data into GAN-ready format:

  1. Data Ingestion: Supports PCAP and CSV formats
  2. Data Chunking: Splits large datasets by size or time windows
  3. Field Processing: Applies appropriate encodings to different field types
  4. Normalization: Normalizes continuous fields to [0,1] range
  5. Encoding: Converts categorical fields using various strategies

Field Types and Encodings

NetShare supports multiple field types with specialized processing:

  • Continuous Fields: Numerical data with min-max normalization
  • Discrete Fields: Categorical data with one-hot encoding
  • Bit Fields: Integer data converted to bit representations (e.g., IP addresses)
  • Word2Vec Fields: Embedding-based representation for categorical data

Post-processing Stage

The post-processing pipeline reconstructs synthetic data to original format:

  1. Denormalization: Reverses normalization applied during preprocessing
  2. Decoding: Converts encoded representations back to original format
  3. Format Conversion: Outputs data in original format (PCAP/NetFlow)
  4. Quality Assessment: Evaluates synthetic data quality

Model Implementation

DoppelGANger Architecture

The core model implements the DoppelGANger architecture which separates:

  • Attribute Generation: Static properties of network flows (IP addresses, ports, protocol)
  • Feature Generation: Time-series data within flows (timestamps, packet sizes)

Key Components:

  • Attribute Generator: Creates static flow properties
  • Feature Generator: Creates time-series data conditioned on attributes
  • Feature Discriminator: Distinguishes real vs. synthetic features
  • Attribute Discriminator: Distinguishes real vs. synthetic attributes

Training Process:

  • Alternating optimization of generator and discriminators
  • Gradient penalty for WGAN-GP stability
  • Sequence packing for variable-length sequences

Model Configuration

The model supports various hyperparameters:

  • batch_size: Training batch size
  • sample_len: Length of sequences to generate
  • epochs: Number of training epochs
  • learning_rates: Generator and discriminator learning rates
  • network_architecture: Generator/discriminator layer configurations

Configuration System

NetShare uses a hierarchical configuration system:

Global Configuration

{
  "global_config": {
    "original_data_file": "path/to/data.csv",
    "overwrite": true,
    "dataset_type": "netflow",
    "n_chunks": 2,
    "dp": false
  }
}

Pre/Post Processor Configuration

Defines how to process different data fields:

  • Metadata fields (static flow properties)
  • Timeseries fields (dynamic flow properties)
  • Encoding strategies for each field type

Model Configuration

Specifies GAN hyperparameters and architecture:

  • Network dimensions and layers
  • Training parameters (epochs, learning rates)
  • Privacy settings (if using DP)

Distributed Computing

Ray Integration

NetShare leverages Ray for distributed computing:

  • Parallel Preprocessing: Multiple data chunks processed in parallel
  • Distributed Training: Model training across multiple nodes/GPUs
  • Resource Management: Automatic load balancing and resource allocation
  • Fault Tolerance: Resilient to node failures during long-running jobs

Chunked Processing

For large datasets, NetShare splits data into chunks:

  • Each chunk processed independently
  • Results merged after processing
  • Memory-efficient for large datasets
  • Enables parallel processing

Field Processing System

Field Types

NetShare implements a flexible field processing system:

ContinuousField

  • Handles numerical data
  • Supports various normalization options (min-max, log1p, etc.)
  • Preserves statistical properties during normalization/denormalization

DiscreteField

  • Processes categorical data
  • One-hot encoding for neural network compatibility
  • Maintains categorical relationships

BitField

  • Converts integers to bit representations
  • Useful for IP addresses (32-bit representation)
  • Preserves bit-level patterns

Word2VecField

  • Embeds categorical data using Word2Vec models
  • Captures semantic relationships between categories
  • Reduces dimensionality for high-cardinality categorical data

Encoding Strategies

Different encoding strategies optimize for different data types:

  • Bit Encoding: For IP addresses and other integer identifiers
  • Word2Vec: For categorical fields with semantic relationships
  • Categorical: Standard one-hot encoding for discrete values
  • Normalization: Various schemes for continuous values

Usage Examples

Basic Usage

import netshare.ray as ray
from netshare import Generator

# Initialize Ray (optional)
ray.config.enabled = False
ray.init(address="auto")

# Create generator with configuration
generator = Generator(config="config.json")

# Train the model
generator.train(work_folder="results/")

# Generate synthetic data
generator.generate(work_folder="results/")

# Visualize results
generator.visualize(work_folder="results/")

ray.shutdown()

Configuration Example

{
  "global_config": {
    "original_data_file": "data/netflow.csv",
    "overwrite": true,
    "dataset_type": "netflow",
    "n_chunks": 2,
    "dp": false
  },
  "pre_post_processor": {
    "class": "NetsharePrePostProcessor",
    "config": {
      "timestamp": {
        "column": "ts",
        "generation": true,
        "encoding": "interarrival",
        "normalization": "ZERO_ONE"
      },
      "metadata": [
        {
          "column": "srcip",
          "type": "integer",
          "encoding": "bit",
          "n_bits": 32
        },
        {
          "column": "srcport",
          "type": "integer",
          "encoding": "word2vec_port"
        }
      ],
      "timeseries": [
        {
          "column": "pkt",
          "type": "float",
          "normalization": "ZERO_ONE"
        }
      ]
    }
  },
  "model": {
    "class": "DoppelGANgerTorchModel",
    "config": {
      "batch_size": 100,
      "sample_len": [1, 5, 10],
      "epochs": 40
    }
  }
}

Dependencies

NetShare requires the following key dependencies:

  • PyTorch: Deep learning framework for GAN implementation
  • Ray: Distributed computing framework
  • Pandas: Data manipulation and analysis
  • NumPy: Numerical computing
  • Gensim: Word2Vec implementation
  • Scikit-learn: Machine learning utilities
  • Matplotlib: Visualization
  • Config_IO: Configuration management
  • SDMetrics: Synthetic data quality evaluation

Installation

pip install -e NetShare/
pip install -e SDMetrics_timeseries/

Performance and Scalability

Memory Management

  • Chunked processing for large datasets
  • Efficient data loading and preprocessing
  • Model checkpointing to handle long training runs

Parallel Processing

  • Ray-based distributed computing
  • Parallel preprocessing of data chunks
  • Multi-GPU training support

Quality Assurance

  • Built-in visualization tools
  • Statistical similarity metrics
  • Downstream task evaluation capabilities

Privacy Considerations

Differential Privacy

  • Optional DP support for privacy-preserving generation
  • Configurable privacy budget
  • Trade-off between privacy and utility

Data Handling

  • No direct access to raw network data in generated traces
  • Statistical properties preserved while individual records are synthetic
  • Compliance with privacy regulations

Evaluation and Validation

Built-in Metrics

  • Distributional similarity metrics
  • Statistical property preservation
  • Downstream task performance evaluation

Visualization Tools

  • Side-by-side comparison of real vs. synthetic data
  • Distribution plots
  • Correlation analysis

Extensibility

Plugin Architecture

  • Pluggable pre/post processors
  • Custom model implementations
  • Extendable field types
  • Configurable workflows

Customization Points

  • Custom field encodings
  • Alternative GAN architectures
  • Specialized evaluation metrics
  • Domain-specific preprocessing

References

Yin, Y., Lin, Z., Jin, M., Fanti, G., & Sekar, V. (2022). Practical GAN-Based Synthetic IP Header Trace Generation Using NetShare. SIGCOMM 2022.