# Test Guidelines How to write tests that actually catch bugs. ## The Convolution Bug: A Case Study ### What Happened ```python # The bug: silently thresholded real values to Boolean result_tt = (conv_values < 0).astype(bool) # Lost magnitude! ``` ### Why Tests Didn't Catch It ```python # ORIGINAL TESTS (BAD) def test_convolution_same_function(self): result = convolution(f, f) assert result is not None # Only checks existence assert result.n_vars == 2 # Only checks structure ``` ### What Tests Should Have Done ```python # FIXED TESTS (GOOD) def test_convolution_theorem_holds(self): """Verify (f*g)^(S) = f̂(S)·ĝ(S).""" conv_coeffs = convolution(f, g) expected = f.fourier() * g.fourier() assert np.allclose(conv_coeffs, expected) # Verifies math! ``` --- ## Anti-Patterns to Avoid ### 1. Existence Tests ```python # BAD: Only checks something returned assert result is not None # GOOD: Verify the actual value assert np.allclose(result, expected_value) ``` ### 2. Structure-Only Tests ```python # BAD: Only checks shape/type assert result.n_vars == 3 assert isinstance(result, BooleanFunction) # GOOD: Also verify correctness assert result.evaluate([0,0,0]) == expected_output ``` ### 3. Missing Contract Tests ```python # BAD: Doesn't verify the mathematical claim def test_parseval(): f = bf.parity(3) result = parseval_verify(f) assert result # What does this even mean? # GOOD: Explicitly verify Parseval's identity def test_parseval_identity(): f = bf.parity(3) sum_squared = sum(c**2 for c in f.fourier()) assert np.isclose(sum_squared, 1.0) # Σ f̂(S)² = 1 ``` --- ## Test Checklist Before writing a test, answer: 1. **What is the mathematical contract?** - What theorem/property should hold? - Write the equation in the docstring 2. **What are the edge cases?** - Empty input, n=0, n=1 - Constant functions (all 0s, all 1s) - Dictator functions (single variable) 3. **What should fail?** - Test that bad inputs raise exceptions - Verify error messages are helpful 4. **Does this test the implementation or the math?** - Implementation tests: "does it run?" - Math tests: "does it compute correctly?" - **You need both, but math tests catch more bugs** --- ## Property-Based Testing with Hypothesis Parameterized tests cover *specific* cases. Property-based tests cover *invariants* that must hold across all inputs. BooFun uses [Hypothesis](https://hypothesis.readthedocs.io/) for this. Key properties to test: ```python from hypothesis import given, strategies as st @given(n=st.integers(min_value=1, max_value=8)) def test_parseval_identity_any_majority(n): """Parseval: Σ f̂(S)² = 1 for any ±1-valued f.""" if n % 2 == 0: n += 1 # Majority needs odd n f = bf.majority(n) assert np.isclose(sum(f.fourier()**2), 1.0) @given(n=st.integers(min_value=1, max_value=6)) def test_influences_are_nonnegative(n): """Influences are probabilities: always in [0, 1].""" f = bf.random(n) for inf in f.influences(): assert 0 <= inf <= 1 ``` **When to use property tests**: monotonicity, symmetry, Parseval, nonnegativity, consistency between related functions (e.g., `is_linear` ↔ `nonlinearity`). --- ## Cross-Validation Testing The strongest tests verify library output against **independent computations**: closed-form formulas, other libraries, or hand-calculated values. ```python def test_majority_total_influence_exact(self): """Cross-validate against closed-form: I[Maj_n] = n * C(n-1,(n-1)/2) / 2^(n-1).""" from math import comb for n in [3, 5, 7, 9]: exact = n * comb(n - 1, (n - 1) // 2) / (2 ** (n - 1)) assert np.isclose(bf.majority(n).total_influence(), exact) ``` **Rule**: For every mathematical function, there should be at least one test that computes the expected value *independently* of the library code being tested. --- ## Test Template ```python class TestFunctionName: """Tests for function_name following [mathematical reference].""" def test_mathematical_property(self): """Verify [specific mathematical property]. Theorem: [state the theorem being tested] """ # Arrange f = bf.some_function(3) # Act result = function_name(f) # Assert - verify the MATH, not just existence expected = compute_expected_value(f) assert np.allclose(result, expected), f"Expected {expected}, got {result}" def test_edge_case_constant(self): """Edge case: constant function.""" f = bf.create([0, 0, 0, 0]) # All zeros result = function_name(f) assert result == expected_for_constant def test_invalid_input_raises(self): """Invalid input should fail loud.""" with pytest.raises(ValueError, match="helpful error message"): function_name(invalid_input) ``` --- ## Files Status ### Fixed (Jan 2026) | File | Issue | Fix | |------|-------|-----| | test_pac_learning.py | 24 weak assertions | Added error rate verification | | test_builtins.py | 14 weak assertions | Added truth table verification | | test_module_exploration.py | 20 weak assertions | Added correctness checks | | test_quantum_complexity.py | 13 weak assertions | Added mathematical property tests | | test_circuit.py | 10 weak assertions | Added circuit evaluation tests | | test_property_tester.py | (new) | Comprehensive PropertyTester tests | ### Remaining Low Priority | File | Count | Reason | |------|-------|--------| | test_visualization_*.py | 70+ | Visual output, hard to verify | | test_gpu_*.py | 21 | Infrastructure tests | | test_benchmarks.py | 9 | Performance tests | --- ## Running Tests ```bash pytest tests/ # All tests pytest tests/ -x -q # Stop on first failure, quiet pytest tests/unit/ # Unit tests only pytest tests/analysis/ # Analysis module tests pytest tests/ --cov=boofun # With coverage report pytest tests/ -k "sensitivity" # Tests matching keyword ``` ## Test Structure ``` tests/ ├── unit/ # Individual function tests ├── integration/ # Cross-module tests ├── analysis/ # Analysis module tests (Fourier, sensitivity, etc.) ├── families/ # Built-in function family tests ├── correctness/ # Bit-ordering, numerical precision ├── fuzz/ # Edge case fuzzing (Hypothesis) ├── benchmarks/ # Performance tests └── test_*.py # Top-level test files ``` **Coverage**: 72% with 3200+ tests. Core analysis modules are 70-85%; visualization and quantum_complexity are lower priority. **CI**: All tests run on every push via GitHub Actions. Pre-commit hooks enforce formatting (black, isort, flake8, codespell).