Skip to content

Core Concepts

Overview

Surfing Weights is built around several key concepts that enable efficient streaming of model weights. Understanding these concepts will help you make the most of the library.

Weight Chunking

What is Chunking?

Chunking is the process of breaking down a large model into smaller, manageable pieces that can be:

  • Stored efficiently
  • Transmitted quickly
  • Loaded on demand

How Chunking Works

  1. Model Analysis
  2. The model's architecture is analyzed
  3. Weights are grouped by layers

  4. Chunk Creation

  5. Each layer's weights are saved separately
  6. Metadata about chunks is stored
  7. Configuration is preserved

Weight Streaming

The Streaming Process

  1. Initial Setup
  2. Client connects to weight server
  3. Model architecture is initialized
  4. Only metadata is loaded initially

  5. On-Demand Loading

  6. Weights are requested as needed
  7. Server streams requested chunks
  8. Client processes received weights

  9. Smart Caching

  10. Frequently used weights are cached
  11. LRU policy manages cache size
  12. Cold weights are released

Storage Backends

Surfing Weights supports multiple storage backends:

  1. Local Filesystem
  2. Direct access to local files
  3. Fastest for local deployment

  4. Amazon S3

  5. Cloud-based storage
  6. Scalable and reliable
  7. Good for distributed setups

  8. Custom Backends

  9. Extensible interface
  10. Support for other storage systems

Caching System

Cache Levels

  1. Server-Side Cache
  2. Reduces storage backend access
  3. Shared across clients
  4. Configurable size

  5. Client-Side Cache

  6. Reduces network requests
  7. Per-client caching
  8. Memory-efficient

Cache Management

  • LRU (Least Recently Used) policy
  • Configurable cache sizes
  • Automatic memory management

Model Support

Surfing Weights supports various transformer architectures:

  • BERT
  • GPT
  • T5
  • LLaMA
  • Custom models

Each model type has specific: - Chunking strategies - Loading patterns - Optimization techniques

Monitoring

Built-in monitoring provides:

  1. Performance Metrics
  2. Request latency
  3. Cache hit rates
  4. Memory usage

  5. Health Checks

  6. Server status
  7. Backend connectivity
  8. Resource utilization

Next Steps