Make the PDF build time bearable. by KubaO · Pull Request #148 · twinbasic/documentation

KubaO · 2026-05-20T14:22:09Z

From 10 minutes down to 90 seconds flat :)

The perf harness was previously gitignoring its own lockfile, leaving the puppeteer + pagedjs-cli versions reproducible only by faith in npm install. Track the lockfile so the perf numbers in README correspond to a known-pinned dependency set.

The process phase of the render pipeline was 40s of `PDFDocument.load` + `pdfDoc.save` on the 52 MB raw Chrome PDF -- just to attach an outline tree and a few `/Info` fields. The incremental writer replaces that with a PDF 1.7 §7.5.6 incremental update: parse only the trailer, xref, Catalog and Info objects (~50ms), build outline objects in a fresh context starting from the original `/Size`, mutate Catalog and Info in place, then append the new objects + a new xref subsection + a trailer with `/Prev` pointing at the original xref. Readers chain backward to resolve everything we didn't touch. Original 52 MB stays byte-identical; we append ~400 KB. Wired into the harness as `--incremental`; on the 1638-page book the process phase drops from 39.7s to 0.25s, total from 150.7s to 110.3s. The tradeoff is file size -- pdf-lib's full save() deflate-compresses content streams as a side effect (17 MB output), the incremental writer preserves Chrome's uncompressed streams (53 MB output). Both documented in perf/README.md alongside the size-reclaim options.

Two findings that aren't behind the incremental writer: 1. pdf-lib's PDFDocument.load defaults to parseSpeed: Slow (100 objects/tick, with await waitForTick() between batches) and save defaults to objectsPerTick: 50. On the 52 MB Chrome PDF that's ~500+1000 cooperative yields per round-trip; a CPU profile shows 79 % of load is V8 idle wait, not actual parsing. The default was designed for browsers (don't lock the main thread for 30 s); in the harness with no concurrent work it's pure overhead. Switching to parseSpeed: Fastest + objectsPerTick: Infinity drops the roundtrip from 40 s to 4.7 s on the same input -- 8x speedup with no behaviour change. Wired into the harness's default roundtrip path. Production book.bat still pays the cost because pagedjs-cli passes no options to load/save; that's a separate follow-up. 2. Chromium gained Page.printToPDF { generateDocumentOutline: true } in M122 (Feb 2024). Puppeteer exposes it as `outline: true` and adds the launch flag automatically since v22. Chrome walks h1..h6 from the accessibility tree and emits /Outlines directly. Wired in as --chrome-outline. Caveats documented in README: - No tag-level filter (h1..h6 unconditional). Per-heading suppression works via role="presentation" / aria-hidden -- 1 attribute, visually unchanged. - Tree depth doesn't equal heading level: Chrome collapses skipped levels (h5 under h3 becomes depth+1, not depth+2). So filtering Chrome's output to "depth <= 3" does not extract h1..h4. - Diff against the injected outline on the real book shows significant structural drift (the injected outline gets 1773 entries, Chrome 6023 unfiltered / 1820 filtered, 10 of those point at different pages by end of book). - pagedjs-cli has never tried this option in any form -- searched github issues/PRs/code, gitlab, web; no mention. Probes and analysis scripts: profile-load.mjs, profile-roundtrip.mjs, probe-chrome-outline.mjs, probe-outline-exclusions.mjs, compare-outlines.mjs. All independently runnable.

pagedjs-cli's only behavioural contribution was the paged.js browser bundle and a small set of pdf-lib helpers; the CLI wrapper around them called PDFDocument.load / save with no options and therefore inherited pdf-lib's slow defaults (~32 s of pure idle yielding per build on the 1638-page docs book -- see perf/README.md's "Profiling pdf-lib's load"). Rather than patch upstream just to pass two option objects, vendor what we actually use and drive puppeteer ourselves. docs/lib/paged.browser.js -- pagedjs-cli@0.4.3/dist/browser.js verbatim, MIT, license header preserved. docs/lib/outline.mjs -- src/outline.js, ESM-ified, MIT, attribution. docs/lib/postprocesser.mjs -- src/postprocesser.js, ESM-ified, MIT, attribution. docs/render-book.mjs -- new ~190 LOC production driver. CLI-compatible with the subset of pagedjs-cli flags book.bat actually used (-o, --outline-tags, -t, --additional-script). Calls PDFDocument.load with parseSpeed: Fastest and pdfDoc.save with objectsPerTick: Infinity inline; no plumbing required. docs/book.bat swapped to `node render-book.mjs ...`. Same flags, ~32 s faster end-to-end (process phase drops from 39.7 s to 5.1 s on the same input), one fewer transitive dependency tree. perf/measure.mjs now imports the helpers and the bundle from docs/lib/ so the harness and production share the exact same code path. Both package.json files dropped pagedjs-cli and added puppeteer + pdf-lib + html-entities as direct deps (they were previously transitive via pagedjs-cli).

A paged.js handler in docs/lib/progress-handler.js emits a line per afterPageLayout; render-book.mjs re-renders it as a `\r`-overwritten TTY status (or one line every 100 pages when stdout is piped). Generate-phase progress remains a follow-up.

A 500 ms heartbeat in render-book.mjs writes `generating: 23.4s` to a `\r`-overwritten TTY line during page.pdf(). Byte-streaming via CDP printToPDF + transferMode:'ReturnAsStream' was tried first but adds no signal on the Chromium we ship with: SkPDF buffers the whole document internally and emits all 52 MB in one tick at the end. README documents the finding so the next reader doesn't re-investigate.

Replace 16 per-page page.element.querySelector(...) calls in AtPage.finalizePage's two forEach loops with one querySelectorAll up front into a class-keyed lookup table. A/B at 3+3 interleaved: query-CPU self-time drops ~91 ms (-27%, from ~338 ms unpatched to ~247 ms patched); render wall-clock is in the noise. The trigger for looking at this was a stale 13.7 s self-time row in the post-detach CPU profile -- which was measured against the earlier afterPageLayout-hook variant of detach-pages.js (hidden subtree style-cascade penalty); the shipping finalizePage-hook variant brings the same lambda to ~1 s total render. README is updated to flag that. The patch is marked `// PATCH: consolidate` at each touch point so a future re-vendoring of the bundle can find it via grep.

Hoist the conditional getComputedStyle reads out of AtPage.finalizePage's two forEach loops into one upfront batch that fills two Maps (__maxW, __maxH). The forEach bodies now consume cached values instead of calling GCS inline, so the write-then-read pattern that previously forced a layout flush on every iteration collapses to one flush total. For this book the hoist is a no-op: only the bottom-right page-number corner has .hasContent, so one GCS read fires per page either way and the smoke render lands at 47.98 s / 1638 pages / 16.9 MB -- inside the noise band of the consolidate-querySelector A/B. The win materializes for docs with multi-corner marginalia, where N inline reads (and therefore N forced flushes) collapse to 1. Marked `// PATCH:` at all three touch points so a future re-vendoring of the bundle can find them via grep. README's "Revisiting AtPage.finalizePage" section gains a "Read/write batching" sub-section documenting the change and the scaling note.

@page

AtPage.finalizePage's grid-template-columns / grid-template-rows output is a pure function of (page.element.className, marginalia map, CSS). For a 1638-page book with ~2-4 distinct page-class signatures, the same four values are recomputed ~1634 redundant times. Cache them on this.__finalizeCache, keyed by className. Check sits between the __mLookup build and the GCS hoist; on hit, apply the four cached strings via __mLookup and return -- Phases B and C skipped. On miss, run as before and store the result by reading back the just-written .style.grid-template-* values at the end of the method. Phase A's marginalia .hasContent classifier still runs every page (needed for @page-margin CSS rules to apply on the fresh DOM). Smoke render: 1638 pages, 16.9 MB output (byte-equivalent to the pre-patch run), render 48.27 s -- inside the noise band. Cleanup- only ship: the one flush we skip per cached page is ~70 us; total saving across the render is ~0.1 s, two orders of magnitude below the 4-13 s within-variant noise from the consolidate-querySelector A/B. README documents the wall-clock framing alongside the prior two finalizePage patches. Cache assumes @page rules don't use :nth-of-type or other position-dependent selectors that would vary by page index for the same className. Common case; commented in the bundle.

Replaces the per-page Phase B and Phase C decision tree in AtPage.finalizePage with CSS rules emitted once at parse time by the polisher. AtPage.afterTreeWalk now calls a new emitMarginGridTemplates after addPageClasses; for each page-class entry it computes the effective per-cell hasContent + max-width / max-height by unioning over matching marginalia entries (Phase A OR-cascade + CSS cascade for widths), runs the same decision tree the runtime did, and writes grid-template-columns / -rows rules into the polished stylesheet using selectorsForPage as the selector. For this book that emits 24 rules (6 page-class signatures x 4 margin groups), all '0 0 1fr'. addMarginaliaStyles now captures maxWidth / maxHeight / loc on each this.marginalia[sel] entry during its existing AST walk, so emitMarginGridTemplates can run the decision tree without a rendered DOM. finalizePage collapses to Phase A + an offset-only Phase B. The four offsetWidth-based fallback branches (auto-width minmax(%) templates when >= 2 cells have content and all relevant widths are auto) stay -- they read live layout and can't be pre-computed. A couldFire gate short-circuits before any querySelector or getComputedStyle when the content pattern can't reach an offset branch (every page in this book). Phase C is deleted -- the upstream code has no offset measurement in the vertical groups, so every branch is now handled by the static emission. Removes two prior PATCH blocks whose code paths no longer exist: - // PATCH: consolidate the 16 per-iteration querySelector calls (__mLookup build) - // PATCH: cross-page memoization (__finalizeCache + replay on hit) Plus their trailing artifacts (record-what-we-just-wrote, lookup fallbacks in the forEach loops). Keeps the // PATCH: hoist max-width reads pattern, rewritten inline inside the offset fallback path where layout-flush-after-classList.add still matters. Wall-clock is in the noise as expected -- this moves work from runtime JS to parse-time CSS but the browser does the same cascade + layout work either way. Instrumented A/B confirms the new path is taken: getComputedStyle calls drop from 9179 to 5903 (-36%, the two-per-page max-width reads the prior GCS hoist batched), offsetWidth stays at 0 (gate never fires), PDF size unchanged within run-to-run timestamp variance (~27 bytes). The README's 'Cross-page memoization' section now has a follow-up 'Hoisting grid-template emission to parse time' covering the architecture, the 24 emitted rules for this book, and what's left in finalizePage.

Profiles after the recent finalizePage commits showed render still scaling with page count via Page.create's gBCR layout flush. Three profile-diffed A/Bs: - textBreak per-word loop -> binary search: wash. textBreak isn't a significant gBCR caller; the loop's calls are folded into the JS wrapper at <100 ms total. - Page.create gBCR memoization: total native gBCR unchanged (18.4 s -> 18.6 s); the 12.9 s flush just migrated to Layout's constructor at line 1443. Lesson: gBCR self-time is layout-flush attribution, not call overhead. - counter-increment on .pagedjs_page disabled: total gBCR went up, not down. Rules out CSS counter resolution as the residual. The fourth A/B is the win. Physically removeChild'ing finalized pages instead of display:none-ing them (keep-last variant, since the chunker passes lastPage.element to Page.create) cuts Page.create's gBCR self-time from 12.9 s to 1.1 s and total render from 48 s to 28 s. The residual O(n) was Chromium's selector/style sweep over .pagedjs_pages siblings -- display:none does not exclude them from style resolution, only from layout. Documents the methodology shift (compare CPU profile diffs, not wall-clock totals) and what's needed to ship the aggressive detach: counter(page) -> attr(data-page-number) swap in print.css, named-string behavior verification, re-attach order, and handler audit. Bundle and detach-pages.js are unchanged -- this commit is documentation only.

@page

The CPU profile diff showed display:none does not exclude finalized pages from Chromium's per-page style/selector sweep, which scales with sibling count -- the residual O(n) the README narrative was hunting. Physically removing finalized pages from the DOM collapses Page.create's gBCR layout flush from 12.9 s to 0.9 s (15x) and total render from 48 s to 26 s on the 1638-page book. Three coordinated changes: - perf/detach-pages.js: removeChild + ordered re-attach. Keep the most recent finalized page in DOM so the chunker's lastPage lookup works; detach one page behind. afterRendered detaches the keeper and re-appends all in finalize (document) order. - docs/lib/paged.browser.js Counters handler: track displayed page number in JS and write `--page-num: "N"` per page, since CSS counter(page) no longer accumulates across detached siblings. Increment behaviour mirrors the shipping CSS exactly: skip increment on [data-counter-page-reset] pages, don't apply the reset value -- preserves the continuous numbering the original PDF was actually producing (the part-restart logic described in print.css's comments doesn't take effect in the current cascade; that's a separate issue). - docs/assets/css/print.css: @bottom-right reads var(--page-num) instead of counter(page). The existing @page :first and @page divider suppression rules still work because they override the entire `content` declaration. Verified byte-equivalent @bottom-right and @top-right text on sample pages 1, 2, 5, 6, 10, 100, 500, 1000, 1500, 1638. Named strings (string-set chapter-title) survive detach. Render: 48.5 s -> 26.3 s. Ratio: 1.77x -> 1.18x. Total native gBCR: 18.4 s -> 7.5 s. The remaining render cost is dominated by hasOverflow:1925 (4.6 s, now flat).

The guard at paged.browser.js:1998 read `table.querySelector("[colspan]")` but the variable is named `rowspan`, the comment says rowspan, and the block walks back through rows to find the one that started a rowspan group (the only row with full column coverage). Tables with rowspan but no colspan silently skipped the rowspan-aware break path and could break in the middle of a rowspan group. One-char fix: [colspan] -> [rowspan]. Smoke render: 1638 pages, no regression in render time or output.

The per-page rule injected at paged.browser.js:30836 used counter-reset, which per CSS Lists L3 §12.4.2 creates a new counter instance scoped to the element, leaving the .pagedjs_pages ancestor's counter untouched for sibling pages. That's an upstream paged.js bug: the rule was a no-op for cross-page numbering, so books that meant to restart page numbering at part dividers (like ours -- see print.css:113-126) silently shipped continuous numbering instead. counter-set modifies the existing closest-scope counter, which is what was intended. With this fix, the page right after each part divider now reads "1" as designed (e.g., book page 6 went from "5" to "1"; subsequent pages count up from there within each part). Also realigns my JS --page-num counter (added in the aggressive-detach commit) to honour the same reset values now that the CSS path actually restarts. Both code paths agree: skip increment on data-counter-page-reset pages, then apply the reset value. Verified: 1638 pages, chapter titles in @top-right unchanged on every sampled page. Render time unchanged.

The descend-to-deepest-valid-descendant loop reads lastChild.lastChild up to 3 times per iteration: the while condition, the validNode check, and the assignment. Cache it once at the top of each iteration. CPU profile self-time (--detach-pages, 1638 pages): findEndToken: 3269.92 ms -> 3108.00 ms (-162 ms, -5%). No migration to other callers; createBreakToken and gBCR self-times also drop or stay flat. Byte-equivalent PDF: page count 1638, all sampled page-number and chapter-title extractions match.

Adds a new "What happened when we tried createBreakToken dedup" section to perf/README.md covering the two attempts from this round: - Attempt A (findEndToken cache): shipped, -162 ms self-time. - Attempt B (createBreakToken findElement dedup): reverted. The dedup is semantically incapable of regressing, but the savings fell below the ~40-80 ms per-edge noise floor (measured against an unrelated findEndToken->findElement edge that hadn't been touched between runs). The !renderedNode + !temp.nextSibling branch the dedup targets must fire rarely. Echoes the textBreak post-mortem: a correct optimization in a cold branch is unmeasurable, and the methodology rule is to revert when the profile diff doesn't show movement above noise.

Page.checkUnderflowAfterResize is fired by the wrapper's ResizeObserver every time content shrinks -- i.e. on every overflow extraction during normal render. It calls findEndToken (~2ms per call, 3.6s total over a 1638-page book) and hands the resulting endToken to this._onUnderflow. The only live onUnderflow registration in the bundle was an empty callback in Chunker.addPage (with commented-out intent). The computed endToken was discarded every time. Delete the no-op registration so _onUnderflow stays undefined by default, and add an early bail in checkUnderflowAfterResize so findEndToken doesn't run when nobody can use its result. Any future caller that needs the underflow path back just calls page.onUnderflow(realFn) -- the non-default handler is itself the activation signal. CPU profile self-time (--detach-pages, 1638 pages): findEndToken: 3108 ms -> 0 ms. findElement: 1767 ms -> 1314 ms (the findEndToken->findElement attribution edge was ~537 ms; the rest is jitter). render total: 25.75 s -> 22.26 s (-3.49 s, -14%). Byte-equivalent PDF: page count 1638, all sampled bottom-right page-number and top-right chapter-title extractions match.

Attempted to apply the Attempt-C pattern to the next gBCR caller in the per-page sequence: Footnotes.afterPageLayout (~1114 ms attributed gBCR), guarded by Object.keys(this.footnotes).length which is 0 for this document (no `float: footnote` in CSS). Result: net gBCR -169 ms, but render wall-clock +880 ms, and the per-page last/first ratio worsened 1.50x -> 1.75x. The 944 ms of "missing" gBCR re-attributed to hasOverflow, create, Layout, and a previously-invisible afterPageLayout at line 31986 -- the classic Page.create-memoize migration shape. Reverted. README adds Attempt D documenting the trap and a revised "lesson" subsection that splits the methodology in two: JS-body self-time is addressable by walking up to the consumer and deleting the production; gBCR self-time is layout-flush attribution and cannot be addressed by skipping the calling JS without also changing the mutation pattern (which is what aggressive-detach did at the chunker level).

The shorthand is used throughout the document (since the "three places non-redundant work could be made simpler" section and all subsequent post-mortems) but was never expanded. Add a note right after the first profile that introduces getBoundingClientRect as the dominant frame, explaining both the abbreviation and the broader "layout-flush attribution" concept that's load-bearing for the rest of the doc's methodology discussion.

find-callers.mjs is the caller-attribution script used throughout the perf README's post-mortems (Page.create memoize, Footnotes skip, findEndToken/onUnderflow dead path) to detect gBCR migration between callers and to spot dead consumers up the call chain. It belongs alongside analyze-profile.mjs as part of the perf toolkit, not as a one-off at the repo root. Also: - Update the README's "Methodology" pointer that referenced a stale "gBCR-callers script under time-hooks-current" -- now links to perf/find-callers.mjs. - Add a header comment to find-callers.mjs explaining what it does, when to reach for it, and the hitCount caveat. - Top-level .gitignore entries for /before/, /after-*/, and /findoverflow-baseline/ -- the conventional output names used with `node perf/measure.mjs --out <name>` from the repo root. perf/results/ (measure.mjs's default location) was already ignored in perf/.gitignore.

The per-page render loop's overflow gate fired findBreakToken on every iteration once accumulated text exceeded maxChars (1500). That meant pages holding ~3000 chars triggered ~30+ findBreakToken calls -- each one a hasOverflow gBCR (= layout flush) -- before the overflow tripped. Linear-scan with O(nodes-past-maxChars) gBCR calls per page. Replace with additive backoff on a moving lengthAtLastCheck baseline: check once per maxChars of NEW content, advance the baseline when no overflow yet, reset on overflow found. Cuts per-page hasOverflow calls from O(nodes-past-maxChars) to O(page-chars / maxChars), typically 2-3 vs 30+. Correctness holds because findBreakToken handles arbitrary overshoot: findOverflow walks the wrapper to identify the actual overflowing Range regardless of how much was appended past it, removeOverflow extracts the excess via extractContents, and createBreakToken returns a BreakToken pointing at the right source resume position. The chunker creates a fresh walker from breakToken.node on the next page, so over-appended content is re-laid-out from its proper source position. The break-inside: avoid concern (containers possibly making different decisions when they hold extra trailing content) turned out to be unfounded on this document. CPU profile diff (--detach-pages, paired): | metric | PRE | POST | Δ | | --------------------------- | --------- | --------- | ---------------- | | render wall-clock | 23.73 s | 19.48 s | -4.25 s (-18 %) | | total gBCR (attribution) | 8024 ms | 5705 ms | -2319 ms (-29 %) | | hasOverflow gBCR | 4837 ms | 2725 ms | **-2112 (-44 %)**| | findOverflow per-node gBCR| 438 ms | 166 ms | -272 | | create / Layout / Footn. | unchanged | | removeOverflow self | 457 ms | 370 ms | -87 (improved) | | per-page ratio (last/first) | 1.64x | 1.60x | improved | No migration trap: Footnotes (1126 ms), create (955), and Layout (534) gBCR slices all stayed within jitter, and the per-page ratio got better, not worse. removeOverflow dropped instead of growing because fewer findBreakToken invocations mean fewer extractContents passes, not larger ones -- the maximum overshoot per page is bounded by maxChars (1500 chars), small relative to page capacity. PDF byte-equivalence: full pdftotext-MD5 match on pages 6, 100, 500, 1000, 1500, 1638; bottom-right page numbers and top-right chapter titles match on all sampled pages; PDF byte size 126 bytes apart (metadata timestamps); page count 1638.

KubaO added 26 commits May 20, 2026 16:18

Add a basic timing harness for paged.js.

9023531

Profile pagedjs rendering system.

7602636

Make the render time linear.

ad431b9

Investigate the number of flushes before and after the fix.

e9fe585

Test further flush-related optimization opportunities.

d4c16f4

Track perf/package-lock.json.

b4acd77

The perf harness was previously gitignoring its own lockfile, leaving the puppeteer + pagedjs-cli versions reproducible only by faith in npm install. Track the lockfile so the perf numbers in README correspond to a known-pinned dependency set.

KubaO merged commit efca24f into twinbasic:main May 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the PDF build time bearable.#148

Make the PDF build time bearable.#148
KubaO merged 26 commits into
twinbasic:mainfrom
KubaO:staging

KubaO commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KubaO commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant