boost_library_docs_tracker

Django app that scrapes Boost library documentation by version, writes metadata and content hashes into BoostDocContent (extracted text lives in workspace files, not the DB), links pages to library-versions via BoostLibraryDocumentation, and upserts to Pinecone. Sync state is tracked on BoostDocContent (is_upserted). Re-runs when new Boost versions are released; restart logic uses BoostDocContent.is_upserted to re-run upserts for failed or new rows.

Overview

Item	Value
Module	`boost_library_docs_tracker`
Management command	`run_boost_library_docs_tracker`
Workspace subfolder	`workspace/boost_library_docs_tracker/`
Models	`BoostDocContent`, `BoostLibraryDocumentation`
Service module	`boost_library_docs_tracker.services`
Fetcher module	`boost_library_docs_tracker.fetcher`

Directory structure

boost_library_docs_tracker/
├── __init__.py
├── apps.py
├── admin.py
├── models.py
├── services.py
├── fetcher.py
├── html_to_md.py
├── preprocessor.py
├── workspace.py
├── management/
│   ├── __init__.py
│   └── commands/
│       ├── __init__.py
│       └── run_boost_library_docs_tracker.py
├── migrations/
│   ├── __init__.py
│   ├── 0001_initial.py
│   ├── 0002_remove_page_content_and_status_add_is_upserted.py
│   └── 0003_redesign_per_schema_v10.py
└── tests/
    ├── __init__.py
    ├── fixtures.py
    ├── test_models.py
    ├── test_preprocessor.py
    ├── test_services.py
    └── test_fetcher.py

Database models

See Schema.md section 10 for the full ER diagram. Summary:

BoostDocContent — One row per unique document content, keyed by content_hash (SHA-256 of page text). Stores url, content_hash, first_version_id, last_version_id, is_upserted, scraped_at, created_at. Page content is not stored in the DB; it is kept in workspace files. The unique key is content_hash so the same content is never duplicated; the same URL may produce a new row if content changes. Restart/sync logic: select rows where is_upserted=False (or in failed_ids) and re-run Pinecone upserts; after success, set is_upserted=True.

BoostLibraryDocumentation — Join table between BoostLibraryVersion (section 3 of schema) and BoostDocContent. One row per (library-version, doc_content) pair; it only records which pages were found under a given (library, version). No status, page_count, or sync fields; sync state lives on BoostDocContent.

Fetcher (`fetcher.py`)

Contains all HTTP and HTML logic. Makes no database writes.

Function	Description
`crawl_library_pages(doc_root_url, max_pages, delay_secs)`	BFS from `doc_root_url`. Only follows links that stay within the same URL prefix (scoped to that library + version). Converts HTML to Markdown via `html_to_md`. Returns `list[tuple[page_url, markdown_text]]`.
`walk_library_html(source_root, lib_key, lib_documentation, version, max_pages)`	BFS-walks local HTML files from the extracted Boost source zip. Returns `list[tuple[canonical_url, markdown_text]]`.
`download_source_zip(version, dest_dir)`	Downloads the Boost source zip; skips if already present.
`extract_source_zip(zip_path, extract_dir)`	Extracts zip; returns top-level extracted directory. Skips if already extracted.
`delete_extract_dir(extract_dir)`	Deletes the extracted source tree to free disk space after scraping.

Dependencies: requests, beautifulsoup4, lxml.

Crawl boundary rule: Only URLs that begin with doc_root_url are followed. This keeps the BFS within one library's documentation tree and prevents leaking into other libraries or external pages.

Management command

Command name: run_boost_library_docs_tracker

Registered in config/boost_collector_schedule.yaml under the boost_library_docs group (after run_boost_github_activity_tracker in the overall pipeline).

Arguments

Argument	Default	Description
`--versions VERSION [VERSION ...]`	latest from DB	One or more Boost versions to scrape. Must match `BoostVersion.version` (e.g. `boost-1.85.0` when populated from GitHub tags).
`--library LIBRARY`	all	Scrape only one library by name. Useful for debugging.
`--dry-run`	off	Fetch and parse pages but do not write to DB or Pinecone.
`--skip-pinecone`	off	Write to DB but skip the Pinecone upsert step.
`--max-pages N`	100	Per-library page cap for the BFS crawl.
`--use-local`	off	Download source zip and walk local HTML instead of HTTP crawl.
`--cleanup-extract`	off	Delete extracted source tree after each version (only with `--use-local`).

Execution steps

Resolve versions — Use --versions if given; otherwise query BoostVersion for the latest version with non-null version_created_at. Sort old→new.
Discover libraries — For each version, call _get_library_list(version) to read (library_name, doc_root_url, lib_key, lib_doc) from BoostLibraryVersion and BoostLibrary.
Per-library scrape — For each library (optionally filtered by --library): fetch pages via fetcher.crawl_library_pages or fetcher.walk_library_html (if --use-local). For each page: save text to workspace, compute content_hash, call services.get_or_create_doc_content(url, content_hash, version_id=boost_version_id) → returns (BoostDocContent, change_type). Call services.link_content_to_library_version(library_version_id, doc_content_id) to create the BoostLibraryDocumentation join row (idempotent).
Pinecone sync — Unless --skip-pinecone or --dry-run: run sync_to_pinecone with preprocess_for_pinecone. The preprocessor selects BoostDocContent rows where is_upserted=False or in failed_ids, loads page content from workspace, builds vectors, and sets BoostDocContent.is_upserted=True on success. Failed IDs from the sync result are set back to is_upserted=False for retry.
Complete — Log summary and exit 0.

Restart and sync logic

Scrape: No per-row “pending” skip; each run (re)scrapes and updates or creates BoostDocContent by content_hash, and ensures BoostLibraryDocumentation join rows exist.
Pinecone: Driven by BoostDocContent.is_upserted. Rows with is_upserted=False (or in failed_ids) are (re)upserted; on success is_upserted is set to True. On sync failure, failed BoostDocContent IDs are set to is_upserted=False so the next run retries them.

Pinecone document shape

One Pinecone vector per BoostDocContent row. The preprocessor selects rows with BoostDocContent.is_upserted=False (or in failed_ids), loads page content from the workspace, and builds one document per row. Metadata is taken from BoostDocContent (url, content_hash, first_version, last_version) and from the linked BoostLibraryDocumentation / BoostLibraryVersion for library name.

Field	Value
`metadata.doc_id`	`BoostDocContent.content_hash`
`metadata.url`	Page URL
`metadata.first_version` / `metadata.last_version`	Version strings from BoostDocContent FKs
`metadata.library_name`	From a BoostLibraryDocumentation relation’s library
`metadata.ids`	BoostDocContent PK (for failure reporting)
`content`	Page text loaded from workspace

Upsert logic (driven by BoostDocContent.is_upserted): Only rows with is_upserted=False or in the retry failed_ids list are sent to Pinecone. After a successful upsert, BoostDocContent.is_upserted is set to True. If the sync reports failures, those BoostDocContent IDs are set back to is_upserted=False so the next run retries them.

Scheduling

The app runs on the on_release schedule in config/boost_collector_schedule.yaml (see Workflow.md). On most days there is no new version and the command finishes quickly; on a release day a full scrape runs for the new version. Pinecone upserts run for any BoostDocContent with is_upserted=False or in the failure retry list.

For a manual backfill of specific versions, pass values that match BoostVersion.version in the database (often the GitHub tag form with a boost- prefix):

python manage.py run_boost_library_docs_tracker --versions boost-1.85.0
python manage.py run_boost_library_docs_tracker --versions boost-1.85.0 boost-1.86.0

If your BoostVersion rows use bare version strings (e.g. 1.85.0), use those instead. The command looks up versions with BoostVersion.objects.get(version=...), so the value must match exactly.

Configuration

Settings added to settings.py (all via environment variables):

Setting	Env var	Default	Description
`BOOST_DOCS_MAX_PAGES_PER_LIBRARY`	`BOOST_DOCS_MAX_PAGES_PER_LIBRARY`	`500`	Per-library BFS page cap
`BOOST_DOCS_CRAWL_DELAY`	`BOOST_DOCS_CRAWL_DELAY`	`0.5`	Seconds to sleep between page fetches
`BOOST_DOCS_PINECONE_API_KEY`	`BOOST_DOCS_PINECONE_API_KEY`	`""`	Pinecone API key
`BOOST_DOCS_PINECONE_INDEX`	`BOOST_DOCS_PINECONE_INDEX`	`"boost-docs"`	Pinecone index name
`BOOST_DOCS_PINECONE_NAMESPACE`	`BOOST_DOCS_PINECONE_NAMESPACE`	`"boost_docs"`	Pinecone namespace

GitHub token reuses settings.GITHUB_TOKENS_SCRAPING (already configured by the project).

Project integration checklist

When adding this app to the project, do all of the following:

Add "boost_library_docs_tracker" to INSTALLED_APPS in settings.py.
Add "boost_library_docs_tracker" to _WORKSPACE_APP_SLUGS in settings.py.
Add the five BOOST_DOCS_* settings to settings.py and their env var defaults to .env.example.
Add "run_boost_library_docs_tracker" to config/boost_collector_schedule.yaml under the boost_library_docs group (see Workflow.md).
Add beautifulsoup4 and lxml to requirements.txt (if not already present).
Run python manage.py migrate to apply the app’s committed migrations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

boost_library_docs_tracker

Overview

Directory structure

Database models

Fetcher (`fetcher.py`)

Management command

Arguments

Execution steps

Restart and sync logic

Pinecone document shape

Scheduling

Configuration

Project integration checklist

Related documentation

FilesExpand file tree

boost_library_docs_tracker.md

Latest commit

History

boost_library_docs_tracker.md

File metadata and controls

boost_library_docs_tracker

Overview

Directory structure

Database models

Fetcher (fetcher.py)

Management command

Arguments

Execution steps

Restart and sync logic

Pinecone document shape

Scheduling

Configuration

Project integration checklist

Related documentation

Fetcher (`fetcher.py`)