Fetches and converts Boost library documentation (HTML and related sources) into Markdown for storage and downstream search (Pinecone, etc.). Requires a working pandoc binary on the host (see root README).
run_boost_library_docs_tracker crawls or unpacks documentation, normalizes it to Markdown, persists structured rows, then optionally embeds the same content into Pinecone. Service details: docs/service_api/boost_library_docs_tracker.md.
Published docs (default, HTTP crawl) Crawl starts from paths derived from each library’s metadata and stays under:
- Base:
https://www.boost.org/doc/libs/<version_underscores>/Example: Boost1.90.0→https://www.boost.org/doc/libs/1_90_0/(seeboost_library_docs_tracker/fetcher.py:BOOST_ORG_BASE+/doc/libs/...).
Downloaded source (--use-local)
Per version, the source zip is downloaded (then extracted under WORKSPACE_DIR) using, in order:
https://archives.boost.io/release/<version>/source/boost_<version_underscores>.zip- Fallback:
https://github.com/boostorg/boost/archive/refs/tags/boost-<version>.zip
Which versions
Pass --versions explicitly, or omit it to use the latest row in the BoostVersion table (PostgreSQL). This command does not call the GitHub API itself for version discovery; populate versions/libraries via boost_library_tracker (and related flows) first. Scope libraries with --library when needed.
BoostDocContent, BoostLibraryDocumentation, and related rows store URLs, content hashes, version links, and sync metadata; converted Markdown lives on disk under WORKSPACE_DIR, not in these table payloads (see the model docstrings in models.py). Canonical schema: docs/Schema.md, section 10 — Boost Library Docs Tracker (ER diagram and field notes). Related docs: docs/boost_library_docs_tracker.md (commands and workspace layout) and docs/service_api/boost_library_docs_tracker.md (service API for writes to these models).
Not part of this app’s pipeline. There is no git commit or Markdown repo push from this collector.
After DB + workspace writes, the collector can call run_cppa_pinecone_sync with this app’s preprocessor (unless --skip-pinecone or a dry run). That upserts into the namespace configured for Boost docs search; see docs/Pinecone_preprocess_guideline.md.
- Run the tracker:
python manage.py run_boost_library_docs_tracker --help. - Service-layer overview: docs/service_api/boost_library_docs_tracker.md.
- Confirm
pandocis onPATHbefore debugging conversion failures.
Scrapes Boost library docs for one or more versions, writes workspace + BoostDocContent / BoostLibraryDocumentation rows, then upserts Pinecone (unless skipped).
| Option | Description |
|---|---|
--versions |
Zero or more Boost versions (e.g. 1.86.0 1.87.0). Omitted → latest version from the BoostVersion table (run boost_library_tracker first if empty). |
--library |
Limit scrape to one library key (e.g. algorithm). Default: all libraries for each version. |
--dry-run |
Parse/fetch without writing DB, workspace, or Pinecone. |
--skip-pinecone |
Write DB + workspace but skip Pinecone upsert. |
--max-pages |
Per-library BFS page cap when crawling HTTP (default 10). |
--use-local |
Download Boost source zip and walk local HTML instead of HTTP crawl. |
--cleanup-extract |
With --use-local, delete extracted tree + downloaded zip after each version’s libraries finish. |
| Command | Purpose |
|---|---|
run_boost_library_docs_tracker |
Primary doc fetch / conversion pipeline. |
Run python manage.py COMMAND --help for options.
python -m pytest boost_library_docs_tracker/tests/ -v(from repo root; see root README.)