Skip to content

Commit 61e22a8

Browse files
refactor(pytest): align test files with article
1 parent be93d4f commit 61e22a8

32 files changed

+2835
-781
lines changed

.gitignore

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -149,4 +149,7 @@ outputs
149149
marimo_notebooks
150150
*.csv
151151
*.parquet
152-
__marimo__/
152+
__marimo__/
153+
154+
# Claude Code documentation
155+
CLAUDE.md

data_science_tools/pgvector_rag.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -602,4 +602,4 @@
602602
},
603603
"nbformat": 4,
604604
"nbformat_minor": 4
605-
}
605+
}

data_science_tools/pytest/README.md

Lines changed: 219 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,221 @@
11
[![View on YouTube](https://img.shields.io/badge/YouTube-Watch%20on%20Youtube-red?logo=youtube)](https://www.youtube.com/playlist?list=PLnK6m_JBRVNoYEer9hBmTNwkYB3gmbOPO) [![View on Medium](https://img.shields.io/badge/Medium-View%20on%20Medium-blue?logo=medium)](https://towardsdatascience.com/pytest-for-data-scientists-2990319e55e6)
22

3+
# Pytest for Data Scientists
4+
5+
Comprehensive examples and best practices for testing data science code with pytest.
6+
7+
## Directory Structure
8+
9+
```
10+
pytest/
11+
├── README.md # This file
12+
├── get_started/ # Basic pytest concepts
13+
│ └── sentiment.py
14+
├── parametrization/ # Parametrized testing
15+
│ ├── process.py
16+
│ ├── process_fixture.py
17+
│ └── sentiment.py
18+
├── test_structure_example/ # Project organization
19+
│ ├── src/
20+
│ └── tests/
21+
├── advanced_fixtures/ # Advanced fixture patterns
22+
│ ├── session_scoped.py
23+
│ ├── autouse_fixtures.py
24+
│ ├── conftest.py
25+
│ └── README.md
26+
├── temporary_files/ # Safe file I/O testing
27+
│ ├── file_operations.py
28+
│ ├── data_pipeline.py
29+
│ └── README.md
30+
├── numerical_testing/ # NumPy/DataFrame testing
31+
│ ├── numpy_arrays.py
32+
│ ├── dataframe_testing.py
33+
│ └── README.md
34+
├── mocking/ # External dependency mocking
35+
│ ├── api_mocking.py
36+
│ ├── database_mocking.py
37+
│ ├── requirements.txt
38+
│ └── README.md
39+
├── custom_markers/ # Test organization with markers
40+
│ ├── pytest.ini
41+
│ ├── marked_tests.py
42+
│ └── README.md
43+
└── project_config/ # Complete project configuration
44+
├── pytest.ini
45+
├── conftest.py
46+
├── test_with_fixtures.py
47+
└── README.md
48+
```
49+
50+
## Quick Start
51+
52+
### Basic Installation
53+
```bash
54+
pip install pytest
55+
56+
# For advanced features
57+
pip install pytest-cov pytest-xdist pytest-benchmark
58+
```
59+
60+
### Run Examples
61+
```bash
62+
# Basic examples
63+
pytest get_started/
64+
pytest parametrization/
65+
66+
# Advanced features
67+
pytest advanced_fixtures/
68+
pytest numerical_testing/
69+
pytest mocking/
70+
71+
# Full project configuration
72+
cd project_config && pytest
73+
```
74+
75+
## Feature Overview
76+
77+
### 🚀 **Basic Concepts** (`get_started/`, `parametrization/`)
78+
- Simple test functions and assertions
79+
- Parametrized tests for multiple test cases
80+
- Basic fixtures for data reuse
81+
82+
### 🔧 **Advanced Fixtures** (`advanced_fixtures/`)
83+
- **Session-scoped fixtures**: Load expensive datasets once
84+
- **Autouse fixtures**: Automatic setup for all tests
85+
- **Shared fixtures**: Common test data via `conftest.py`
86+
87+
### 📁 **Safe File Testing** (`temporary_files/`)
88+
- **tmp_path fixture**: Isolated temporary directories
89+
- **File I/O testing**: CSV, JSON, model serialization
90+
- **Pipeline testing**: End-to-end data processing
91+
92+
### 🔢 **Numerical Testing** (`numerical_testing/`)
93+
- **NumPy arrays**: Floating-point comparison with tolerance
94+
- **Pandas DataFrames**: Proper DataFrame equality testing
95+
- **Statistical validation**: Testing model outputs and data properties
96+
97+
### 🌐 **Mocking External Services** (`mocking/`)
98+
- **API mocking**: Test without hitting real APIs
99+
- **Database mocking**: Test queries without databases
100+
- **Error simulation**: Test failure scenarios safely
101+
102+
### 🏷️ **Custom Markers** (`custom_markers/`)
103+
- **Test organization**: Group tests by speed, requirements, domain
104+
- **Selective execution**: Run specific test categories
105+
- **CI/CD integration**: Different test suites for different stages
106+
107+
### ⚙️ **Project Configuration** (`project_config/`)
108+
- **Complete setup**: Production-ready pytest configuration
109+
- **Centralized fixtures**: Project-wide test utilities
110+
- **Best practices**: Logging, warnings, reproducibility
111+
112+
## Common Workflows
113+
114+
### Development Workflow
115+
```bash
116+
# Fast feedback during development
117+
pytest -m fast
118+
119+
# Before committing changes
120+
pytest -m "fast or (integration and not slow)"
121+
122+
# Full test suite
123+
pytest
124+
```
125+
126+
### Continuous Integration
127+
```bash
128+
# Unit tests (fast feedback)
129+
pytest -m "unit and fast"
130+
131+
# Integration tests
132+
pytest -m "integration and not gpu and not expensive"
133+
134+
# Performance tests (separate stage)
135+
pytest -m "slow or expensive"
136+
```
137+
138+
### Data Science Specific
139+
```bash
140+
# Test data processing pipelines
141+
pytest -m data_processing
142+
143+
# Test model training
144+
pytest -m model_training
145+
146+
# Test without external dependencies
147+
pytest -m "not api and not database"
148+
```
149+
150+
## Key Benefits for Data Scientists
151+
152+
### 🛡️ **Reliability**
153+
- **Reproducible results**: Consistent random seeds
154+
- **Isolated tests**: No interference between tests
155+
- **Proper numerical comparison**: Handle floating-point precision
156+
157+
### **Performance**
158+
- **Fast feedback**: Separate fast/slow test categories
159+
- **Efficient fixtures**: Load expensive data once
160+
- **Parallel execution**: Run tests concurrently
161+
162+
### 🔍 **Better Debugging**
163+
- **Clear error messages**: Detailed assertion information
164+
- **Test organization**: Easy to find and run specific tests
165+
- **Comprehensive logging**: Track test execution
166+
167+
### 🤝 **Team Collaboration**
168+
- **Standardized setup**: Consistent test environment
169+
- **Shared fixtures**: Common test data and utilities
170+
- **Documentation**: Clear examples and best practices
171+
172+
## Testing Patterns by Use Case
173+
174+
### Data Processing
175+
```python
176+
def test_data_cleaning(tmp_path):
177+
# Use temporary files for safe testing
178+
input_file = tmp_path / "dirty_data.csv"
179+
# Test cleaning pipeline...
180+
```
181+
182+
### Machine Learning
183+
```python
184+
@pytest.fixture(scope="session")
185+
def trained_model():
186+
# Train once, test many aspects
187+
return expensive_model_training()
188+
189+
def test_model_accuracy(trained_model):
190+
# Test with proper numerical comparison
191+
assert model.accuracy > 0.9
192+
```
193+
194+
### External APIs
195+
```python
196+
@patch('requests.get')
197+
def test_api_integration(mock_get):
198+
# Mock external calls for reliable testing
199+
mock_get.return_value.json.return_value = {'data': 'test'}
200+
# Test your logic...
201+
```
202+
203+
## Getting Help
204+
205+
Each directory contains detailed README files with:
206+
- Specific feature documentation
207+
- Running instructions
208+
- Best practices
209+
- Troubleshooting guides
210+
211+
Start with the examples that match your current testing needs, then explore advanced features as your test suite grows.
212+
213+
## Related Resources
214+
215+
- **Article**: [Pytest for Data Scientists](https://towardsdatascience.com/pytest-for-data-scientists-2990319e55e6)
216+
- **Video Series**: [YouTube Playlist](https://www.youtube.com/playlist?list=PLnK6m_JBRVNoYEer9hBmTNwkYB3gmbOPO)
217+
- **Official Docs**: [pytest.org](https://docs.pytest.org/)
218+
219+
## Contributing
220+
221+
These examples are designed to be practical and educational. Feel free to adapt them for your specific data science testing needs.
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# Advanced Fixtures Examples
2+
3+
This directory demonstrates advanced pytest fixture patterns particularly useful for data science projects.
4+
5+
## Files
6+
7+
- `session_scoped.py` - Session-scoped fixtures for expensive operations (like loading large datasets)
8+
- `autouse_fixtures.py` - Auto-use fixtures that run automatically before each test
9+
- `conftest.py` - Shared fixtures available to all tests in this directory
10+
11+
## Key Concepts
12+
13+
### Session-Scoped Fixtures
14+
- Run only once per test session
15+
- Perfect for loading expensive datasets or training models
16+
- Shared across all tests that request them
17+
- Significant performance improvements for test suites
18+
19+
### Autouse Fixtures
20+
- Automatically applied to all tests without explicit request
21+
- Great for setup that should always happen (like setting random seeds)
22+
- Ensures consistent test environments
23+
- Reduces boilerplate code in individual tests
24+
25+
### Conftest.py
26+
- Provides fixtures to all test files in the directory
27+
- No imports needed - fixtures are automatically available
28+
- Can have different conftest.py files at different directory levels
29+
- Fixtures in parent directories are available to child directories
30+
31+
## Running the Examples
32+
33+
```bash
34+
# Run all advanced fixture examples
35+
pytest advanced_fixtures/
36+
37+
# Run with verbose output to see fixture setup
38+
pytest -v advanced_fixtures/
39+
40+
# Run only session-scoped fixture tests
41+
pytest advanced_fixtures/session_scoped.py
42+
43+
# Run only autouse fixture tests
44+
pytest advanced_fixtures/autouse_fixtures.py
45+
```
46+
47+
## Key Benefits for Data Science
48+
49+
1. **Performance**: Session-scoped fixtures prevent reloading expensive datasets
50+
2. **Reproducibility**: Autouse fixtures ensure consistent random seeds
51+
3. **Organization**: Conftest.py centralizes common test data and setup
52+
4. **Maintainability**: Reduces code duplication across test files
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
import numpy as np
2+
import pytest
3+
4+
5+
@pytest.fixture(autouse=True)
6+
def setup_random_seeds():
7+
print("Setting up random seeds...")
8+
np.random.seed(42)
9+
import random
10+
random.seed(42)
11+
12+
13+
def test_model_prediction():
14+
# This test will have reproducible random results
15+
X = np.random.randn(100, 5)
16+
# Your model training and prediction code here
17+
assert len(X) == 100
18+
19+
20+
def test_data_sampling():
21+
# This test also gets reproducible randomness
22+
sample = np.random.choice([1, 2, 3, 4, 5], size=10)
23+
assert len(sample) == 10
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
"""
2+
Shared fixtures for advanced_fixtures examples.
3+
4+
This conftest.py file provides fixtures that can be used across
5+
all test files in this directory without explicit imports.
6+
"""
7+
8+
import numpy as np
9+
import pandas as pd
10+
import pytest
11+
12+
13+
@pytest.fixture(scope="session")
14+
def ml_dataset():
15+
"""Create a machine learning dataset for testing."""
16+
np.random.seed(42) # For reproducibility
17+
18+
# Generate features
19+
n_samples = 1000
20+
n_features = 4
21+
22+
X = np.random.randn(n_samples, n_features)
23+
# Create a target variable with some relationship to features
24+
y = (X[:, 0] + X[:, 1] * 0.5 + np.random.normal(0, 0.1, n_samples) > 0).astype(int)
25+
26+
# Create DataFrame
27+
feature_names = [f"feature_{i + 1}" for i in range(n_features)]
28+
df = pd.DataFrame(X, columns=feature_names)
29+
df["target"] = y
30+
31+
return df
32+
33+
34+
@pytest.fixture(scope="module")
35+
def data_processing_config():
36+
"""Configuration for data processing tests."""
37+
return {
38+
"train_size": 0.8,
39+
"random_state": 42,
40+
"normalize": True,
41+
"remove_outliers": True,
42+
"outlier_threshold": 3.0,
43+
}
44+
45+
46+
@pytest.fixture
47+
def sample_predictions():
48+
"""Generate sample model predictions for testing."""
49+
np.random.seed(42)
50+
return {
51+
"y_true": np.random.randint(0, 2, 100),
52+
"y_pred": np.random.rand(100), # Probability predictions
53+
"y_pred_binary": np.random.randint(0, 2, 100),
54+
}
55+
56+
57+
@pytest.fixture(autouse=True)
58+
def reset_random_state():
59+
"""Ensure each test starts with a known random state."""
60+
np.random.seed(42)
61+
import random
62+
63+
random.seed(42)

0 commit comments

Comments
 (0)