# Performance Guide

This document describes BooFun's performance characteristics, optimization strategies, and benchmarks.

## Quick Summary

| Operation | Complexity | n=10 | n=14 | n=18 | n=20 |
|-----------|------------|------|------|------|------|
| Truth table creation | O(2^n) | <1ms | ~10ms | ~200ms | ~1s |
| Walsh-Hadamard Transform | O(n·2^n) | <1ms | ~5ms | ~100ms | ~500ms |
| Influence computation | O(2^n) | <1ms | ~5ms | ~80ms | ~350ms |
| Property testing | O(queries) | <1ms | <1ms | <5ms | <10ms |

## Optimization Tiers

BooFun uses multiple optimization strategies, automatically selecting the best available:

### Tier 1: NumPy Vectorization (Default)
- Always available
- 10-100x faster than pure Python
- Used for all array operations

### Tier 2: Numba JIT Compilation (Recommended)
- Requires: `pip install numba`
- 2-10x faster than NumPy for iterative operations
- JIT compilation of hot paths
- Used for: WHT, influences, sensitivity

### Tier 3: GPU Acceleration (Optional)
- Requires: `pip install cupy-cuda11x` (match your CUDA version)
- 10-100x faster for large n (n > 16)
- Used for: WHT, batch operations

## Memory Optimization

### Truth Table Representations

| Format | Memory (n=20) | Access Time | Best For |
|--------|---------------|-------------|----------|
| numpy bool | 1 MB | O(1) | n ≤ 14 |
| packed bitarray | 128 KB | O(1) | 14 < n ≤ 20 |
| sparse | ~k·12 bytes | O(1) | High sparsity |

### Auto-Selection

```python
from boofun.core.auto_representation import recommend_representation

# Get recommendation for your use case
rec = recommend_representation(n_vars=18, sparsity=0.1)
print(rec)
# {'representation': 'sparse_truth_table',
#  'reason': 'Sparsity 10.0% < 30%'}
```

### Using Packed Truth Tables

```python
from boofun.core.representations.packed_truth_table import create_packed_truth_table

# Convert existing truth table
packed = create_packed_truth_table(truth_table)

# Memory savings
from boofun.core.representations.packed_truth_table import memory_comparison
print(memory_comparison(20))
# packed_bitarray: 131,072 bytes (128.0 KB)
# numpy_bool: 1,048,576 bytes (1024.0 KB)
# savings: 8x
```

## Parallelization

### Batch Operations

```python
from boofun.core.optimizations import parallel_batch_influences, parallel_batch_fourier

# Compute influences for many functions at once
functions = [bf.random(n=10) for _ in range(100)]
all_influences = parallel_batch_influences(functions)

# Fourier coefficients in parallel
all_fourier = parallel_batch_fourier(functions)
```

### Numba Parallel Loops

Numba functions automatically use all CPU cores:

```python
# These are JIT-compiled with Numba:
# - _vectorized_influences_numba
# - _total_influence_numba
# - _fast_wht_numba

# Check if Numba is being used:
from boofun.core.optimizations import HAS_NUMBA, INFLUENCES_BACKEND
print(f"Numba available: {HAS_NUMBA}")
print(f"Influences backend: {INFLUENCES_BACKEND}")
```

## Caching and Memoization

### Global Compute Cache

```python
from boofun.core.optimizations import get_global_cache

cache = get_global_cache()

# Check cache statistics
print(cache.stats())
# {'size': 42, 'max_size': 500, 'hits': 156, 'misses': 42, 'hit_rate': 0.79}

# Clear cache if needed
cache.clear()
```

### Instance-Level Caching

BooleanFunction instances cache:
- Fourier coefficients (`_fourier_cache`)
- Influences (`_influences_cache`)
- Decision tree depth (`_dt_cache`)

```python
f = bf.majority(5)

# First call computes
_ = f.fourier()  # ~1ms

# Second call returns cached
_ = f.fourier()  # ~0.001ms
```

## Benchmarks

### Walsh-Hadamard Transform

```
n=10:  NumPy: 0.5ms, Numba: 0.2ms, GPU: 0.1ms
n=14:  NumPy: 8ms,   Numba: 3ms,   GPU: 0.5ms
n=18:  NumPy: 150ms, Numba: 50ms,  GPU: 5ms
n=20:  NumPy: 700ms, Numba: 200ms, GPU: 15ms
```

### Influence Computation

```
n=10:  NumPy: 0.3ms, Numba: 0.1ms
n=14:  NumPy: 5ms,   Numba: 1ms
n=18:  NumPy: 90ms,  Numba: 20ms
n=20:  NumPy: 400ms, Numba: 80ms
```

### Property Testing (1000 queries)

```
BLR linearity:  ~2ms (independent of n)
Junta test:     ~5ms (for k-junta)
Monotonicity:   ~3ms
```

## Best Practices

### 1. Install Numba

```bash
pip install numba
```

This alone provides 2-10x speedup for most operations.

### 2. Use Appropriate Representations

```python
# For n > 14, consider sparse or packed
from boofun.core.auto_representation import AdaptiveFunction

# Automatically chooses best format
f = AdaptiveFunction(truth_table, n_vars=18)
print(f.format)  # 'packed' or 'sparse'
```

### 3. Batch Operations

```python
# Bad: sequential
results = [f.influences() for f in functions]

# Good: parallel
from boofun.core.optimizations import parallel_batch_influences
results = parallel_batch_influences(functions)
```

### 4. Reuse Functions

```python
# Bad: recreate function each time
for _ in range(100):
    f = bf.AND(10)
    print(f.total_influence())

# Good: reuse function
f = bf.AND(10)
for _ in range(100):
    print(f.total_influence())  # Uses cached values
```

### 5. Profile Your Code

```python
from boofun.core.optimizations import WHT_BACKEND, INFLUENCES_BACKEND

print(f"WHT backend: {WHT_BACKEND}")
print(f"Influences backend: {INFLUENCES_BACKEND}")

# Time specific operations
import time
f = bf.random(n=16)

start = time.time()
_ = f.fourier()
print(f"WHT: {time.time() - start:.3f}s")

start = time.time()
_ = f.influences()
print(f"Influences: {time.time() - start:.3f}s")
```

## Running Benchmarks

```bash
# Run all benchmarks
pytest tests/benchmarks/ -v --benchmark-only

# Specific benchmark
pytest tests/benchmarks/test_external_benchmarks.py -v

# With comparison
pytest tests/benchmarks/ --benchmark-compare
```

## Docker Performance

The Docker images include Numba by default:

```bash
# Run benchmarks in Docker
docker-compose run benchmark
```

## Scaling Guidelines

| n | Recommended Approach |
|---|---------------------|
| ≤ 14 | Dense truth table, Numba |
| 15-18 | Packed/sparse, Numba, consider GPU |
| 19-22 | GPU required, sparse representation |
| > 22 | Consider sampling/approximation algorithms |

## Future Optimizations

- Distributed computation with Dask
- Further GPU kernel optimization
- Memory-mapped truth tables for very large n