Caching System
Surfing Weights implements a sophisticated caching system to optimize performance and memory usage. This guide explains how the caching system works and how to configure it for your needs.
Overview
The caching system operates at two levels:
- Server-Side Caching
- Caches raw model weights
- Shared across all clients
-
Memory-efficient storage
-
Client-Side Caching
- Caches loaded model components
- Per-client cache
- Optimized for inference
Server-Side Cache
Configuration
from streaming_weights import WeightServer
server = WeightServer(
model_path="./chunks/bert-tiny",
cache_size_mb=200 # Set cache size in megabytes
)
Command line configuration:
Features
- LRU (Least Recently Used) Eviction
- Automatically removes least used weights
- Optimizes memory usage
-
Adapts to access patterns
-
Size-Based Management
- Configurable maximum size
- Automatic eviction when full
- Memory usage monitoring
Client-Side Cache
Configuration
from streaming_weights import StreamingBertModel
model = StreamingBertModel(
model_name="prajjwal1/bert-tiny",
cache_size=3 # Number of layers to cache
)
Features
- Component-Level Caching
- Caches entire model layers
- Maintains layer state
-
Optimizes inference speed
-
Smart Prefetching
-
Cache Warmup
Performance Monitoring
Cache Statistics
# Get cache performance metrics
stats = model.get_inference_stats()
print(f"Cache hit rate: {stats['cache_hit_rate']:.2%}")
print(f"Average inference time: {stats['avg_inference_time']:.3f}s")
# Get current cache state
cache_info = model.get_cache_info()
print(f"Cached components: {cache_info['cached_components']}")
print(f"Cache memory usage: {cache_info['memory_usage_mb']:.2f} MB")
Cache Management
# Clear the cache manually
model.clear_cache()
# Update cache size at runtime
model.cache_size = 5 # Increase cache size
Advanced Features
Cache Optimization
-
Access Pattern Optimization
-
Memory Management
Distributed Caching
When using multiple servers:
from streaming_weights import AdvancedWeightServer
server = AdvancedWeightServer(
chunks_dir="./chunks",
redis_url="redis://localhost:6379", # Redis for distributed caching
cache_size_mb=1000
)
Best Practices
- Cache Size Configuration
- Set server cache size based on available RAM
- Adjust client cache size based on model architecture
-
Monitor cache hit rates for optimization
-
Performance Optimization
- Use warmup for frequently accessed layers
- Enable prefetching for sequential access
-
Clear cache when switching tasks
-
Memory Management
- Monitor memory usage with get_cache_info()
- Adjust cache sizes based on workload
- Clear cache when memory pressure is high
Troubleshooting
Common Issues
- High Memory Usage
- Reduce cache size
- Clear cache more frequently
-
Monitor with get_cache_info()
-
Poor Cache Performance
- Check cache hit rates
- Adjust cache size
- Review access patterns
Cache Monitoring
# Monitor cache performance
while running_inference:
stats = model.get_inference_stats()
if stats['cache_hit_rate'] < 0.5:
print("Warning: Low cache hit rate")
await asyncio.sleep(60) # Check every minute
Next Steps
- Learn about Error Handling
- Explore Model Support
- See Example Use Cases