Skip to content

Commit 5ca93a9

Browse files
committed
Add pre-commit hooks to strip Jupyter notebook outputs
Former-commit-id: b97e05ff6d7921b46f69c3bcd4a94337bbefa131
1 parent 01289e6 commit 5ca93a9

11 files changed

+558
-1325
lines changed

Diff for: .gitignore

+3-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,10 @@ models/
66
# Include specific directories
77
!src/models/
88

9+
# Jupyter notebook
910
notebooks/.ipynb_checkpoints/
11+
.ipynb_checkpoints/
12+
*/.ipynb_checkpoints/*
1013

1114
# Python
1215
__pycache__/
@@ -47,4 +50,3 @@ build/
4750
npm-debug.log*
4851
yarn-debug.log*
4952
yarn-error.log*
50-

Diff for: .pre-commit-config.yaml

+17
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
repos:
2+
- repo: https://github.com/kynan/nbstripout
3+
rev: 0.7.1
4+
hooks:
5+
- id: nbstripout
6+
name: Strip Jupyter notebook output cells
7+
description: Clear output from Jupyter notebooks before committing
8+
files: \.ipynb$
9+
10+
- repo: https://github.com/pre-commit/pre-commit-hooks
11+
rev: v4.5.0
12+
hooks:
13+
- id: trailing-whitespace
14+
- id: end-of-file-fixer
15+
- id: check-yaml
16+
- id: check-added-large-files
17+
args: ['--maxkb=500']

Diff for: NOTEBOOK_GUIDELINES.md

+53
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# Jupyter Notebook Guidelines
2+
3+
## Automatic Output Stripping
4+
5+
This repository is configured with a pre-commit hook that automatically strips output cells from Jupyter notebooks before they are committed to Git. This helps keep the repository size manageable by avoiding the storage of large outputs such as images, graphs, and videos in the Git history.
6+
7+
### How It Works
8+
9+
1. The `nbstripout` pre-commit hook is configured to run automatically before each commit.
10+
2. It removes all output cells, execution counts, and metadata from notebooks.
11+
3. Your notebook file will be stripped only in the Git repository - your local file will keep its outputs.
12+
13+
### Setup for New Contributors
14+
15+
If you're newly cloning this repository, you need to set up the pre-commit hooks:
16+
17+
```bash
18+
# Install poetry dependencies including pre-commit tools
19+
poetry install
20+
21+
# Install the pre-commit hooks
22+
poetry run pre-commit install
23+
```
24+
25+
### Testing the Setup
26+
27+
To verify that the pre-commit hooks are working correctly, you can run:
28+
29+
```bash
30+
poetry run pre-commit run --all-files
31+
```
32+
33+
### Manual Stripping
34+
35+
If you need to manually strip outputs from a notebook, run:
36+
37+
```bash
38+
poetry run nbstripout notebooks/your_notebook.ipynb
39+
```
40+
41+
## Best Practices
42+
43+
1. **Keep Large Data Outside Git**: Store large datasets separately (e.g., data/ directory which is gitignored).
44+
2. **Avoid Embedding Large Files**: Don't embed videos, large images, or other binary data directly in notebooks.
45+
3. **Document Data Sources**: Always include information on how to obtain data needed for your notebooks.
46+
4. **Separate Code and Content**: Use markdown cells to document your analysis thoroughly.
47+
48+
## Troubleshooting
49+
50+
If you encounter issues with the pre-commit hooks, ensure:
51+
- You have run `poetry install` to install all dependencies
52+
- You have run `poetry run pre-commit install` to set up the hooks
53+
- You are committing from within the Poetry environment or using `poetry run git commit`

Diff for: notebooks/meetings.ipynb

+247-599
Large diffs are not rendered by default.

Diff for: notebooks/roll_call.ipynb

+7-147
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
},
1212
{
1313
"cell_type": "code",
14-
"execution_count": 5,
14+
"execution_count": null,
1515
"metadata": {},
1616
"outputs": [],
1717
"source": [
@@ -31,17 +31,9 @@
3131
},
3232
{
3333
"cell_type": "code",
34-
"execution_count": 6,
34+
"execution_count": null,
3535
"metadata": {},
36-
"outputs": [
37-
{
38-
"name": "stdout",
39-
"output_type": "stream",
40-
"text": [
41-
"Clip successfully extracted to: ../data/video/regular_council_meeting___2025_02_26_clip_4-50_to_5-20.mp4\n"
42-
]
43-
}
44-
],
36+
"outputs": [],
4537
"source": [
4638
"import subprocess\n",
4739
"from pathlib import Path\n",
@@ -97,126 +89,9 @@
9789
},
9890
{
9991
"cell_type": "code",
100-
"execution_count": 7,
92+
"execution_count": null,
10193
"metadata": {},
102-
"outputs": [
103-
{
104-
"name": "stderr",
105-
"output_type": "stream",
106-
"text": [
107-
"INFO:src.videos:Transcribing video with speaker diarization: ../data/video/regular_council_meeting___2025_02_26_clip_4-50_to_5-20.mp4\n",
108-
"INFO:src.videos:Output will be saved to: ../data/transcripts/regular_council_meeting___2025_02_26_clip_4-50_to_5-20.diarized.json\n",
109-
"INFO:src.huggingface:Auto-detected device: cpu\n",
110-
"INFO:src.huggingface:Auto-selected compute_type: int8\n",
111-
"INFO:src.huggingface:Loading WhisperX model: tiny on cpu with int8 precision\n"
112-
]
113-
},
114-
{
115-
"data": {
116-
"application/vnd.jupyter.widget-view+json": {
117-
"model_id": "168afa65d3ae4108af591eb1993fe482",
118-
"version_major": 2,
119-
"version_minor": 0
120-
},
121-
"text/plain": [
122-
"tokenizer.json: 0%| | 0.00/2.20M [00:00<?, ?B/s]"
123-
]
124-
},
125-
"metadata": {},
126-
"output_type": "display_data"
127-
},
128-
{
129-
"data": {
130-
"application/vnd.jupyter.widget-view+json": {
131-
"model_id": "89d35faecb8e447db3ccb95407e2a775",
132-
"version_major": 2,
133-
"version_minor": 0
134-
},
135-
"text/plain": [
136-
"config.json: 0%| | 0.00/2.25k [00:00<?, ?B/s]"
137-
]
138-
},
139-
"metadata": {},
140-
"output_type": "display_data"
141-
},
142-
{
143-
"data": {
144-
"application/vnd.jupyter.widget-view+json": {
145-
"model_id": "f616039556ee46aaaee2f975f016aeb0",
146-
"version_major": 2,
147-
"version_minor": 0
148-
},
149-
"text/plain": [
150-
"vocabulary.txt: 0%| | 0.00/460k [00:00<?, ?B/s]"
151-
]
152-
},
153-
"metadata": {},
154-
"output_type": "display_data"
155-
},
156-
{
157-
"data": {
158-
"application/vnd.jupyter.widget-view+json": {
159-
"model_id": "50bd4e88d6084638b91847587cc9ed0a",
160-
"version_major": 2,
161-
"version_minor": 0
162-
},
163-
"text/plain": [
164-
"model.bin: 0%| | 0.00/75.5M [00:00<?, ?B/s]"
165-
]
166-
},
167-
"metadata": {},
168-
"output_type": "display_data"
169-
},
170-
{
171-
"name": "stderr",
172-
"output_type": "stream",
173-
"text": [
174-
"Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.5.0.post0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../../../Library/Caches/pypoetry/virtualenvs/tgov_scraper-zRR99ne3-py3.11/lib/python3.11/site-packages/whisperx/assets/pytorch_model.bin`\n",
175-
"INFO:src.huggingface:Loading diarization pipeline\n"
176-
]
177-
},
178-
{
179-
"name": "stdout",
180-
"output_type": "stream",
181-
"text": [
182-
"No language specified, language will be first be detected for each audio file (increases inference time).\n",
183-
">>Performing voice activity detection using Pyannote...\n",
184-
"Model was trained with pyannote.audio 0.0.1, yours is 3.3.2. Bad things might happen unless you revert pyannote.audio to 0.x.\n",
185-
"Model was trained with torch 1.10.0+cu102, yours is 2.4.1. Bad things might happen unless you revert torch to 1.x.\n"
186-
]
187-
},
188-
{
189-
"name": "stderr",
190-
"output_type": "stream",
191-
"text": [
192-
"INFO:src.huggingface:WhisperX model loaded in 4.50 seconds\n",
193-
"INFO:src.videos:Running initial transcription with batch size 8...\n"
194-
]
195-
},
196-
{
197-
"name": "stdout",
198-
"output_type": "stream",
199-
"text": [
200-
"Detected language: en (0.99) in first 30s of audio...\n"
201-
]
202-
},
203-
{
204-
"name": "stderr",
205-
"output_type": "stream",
206-
"text": [
207-
"INFO:src.videos:Detected language: en\n",
208-
"INFO:src.videos:Loading alignment model for detected language: en\n",
209-
"INFO:src.videos:Aligning transcription with audio...\n",
210-
"INFO:src.videos:Running speaker diarization...\n",
211-
"/Users/owner/Library/Caches/pypoetry/virtualenvs/tgov_scraper-zRR99ne3-py3.11/lib/python3.11/site-packages/pyannote/audio/models/blocks/pooling.py:104: UserWarning: std(): degrees of freedom is <= 0. Correction should be strictly less than the reduction factor (input numel divided by output numel). (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/ReduceOps.cpp:1808.)\n",
212-
" std = sequences.std(dim=-1, correction=1)\n",
213-
"INFO:src.videos:Assigning speakers to transcription...\n",
214-
"INFO:src.videos:Processing transcription segments...\n",
215-
"INFO:src.videos:Diarized transcription completed in 30.03 seconds\n",
216-
"INFO:src.videos:Detailed JSON saved to: ../data/transcripts/regular_council_meeting___2025_02_26_clip_4-50_to_5-20.diarized.json\n"
217-
]
218-
}
219-
],
94+
"outputs": [],
22095
"source": [
22196
"from src.videos import transcribe_video_with_diarization\n",
22297
"\n",
@@ -231,24 +106,9 @@
231106
},
232107
{
233108
"cell_type": "code",
234-
"execution_count": 8,
109+
"execution_count": null,
235110
"metadata": {},
236-
"outputs": [
237-
{
238-
"data": {
239-
"application/vnd.jupyter.widget-view+json": {
240-
"model_id": "5d97ff70c1c3409da83c10c478f2bfaa",
241-
"version_major": 2,
242-
"version_minor": 0
243-
},
244-
"text/plain": [
245-
"HTML(value='<h3>Meeting Script</h3><hr><p><b>[00:00:00] SPEAKER_01:</b><br>Thank you, Mr. Huffinds. Any counci…"
246-
]
247-
},
248-
"metadata": {},
249-
"output_type": "display_data"
250-
}
251-
],
111+
"outputs": [],
252112
"source": [
253113
"def format_timestamp(seconds: float) -> str:\n",
254114
" \"\"\"Convert seconds to HH:MM:SS format\"\"\"\n",

0 commit comments

Comments
 (0)