Skip to content

Make the PDF build time bearable.#148

Merged
KubaO merged 26 commits into
twinbasic:mainfrom
KubaO:staging
May 20, 2026
Merged

Make the PDF build time bearable.#148
KubaO merged 26 commits into
twinbasic:mainfrom
KubaO:staging

Conversation

@KubaO
Copy link
Copy Markdown
Collaborator

@KubaO KubaO commented May 20, 2026

From 10 minutes down to 90 seconds flat :)

KubaO added 26 commits May 20, 2026 16:18
The perf harness was previously gitignoring its own lockfile, leaving
the puppeteer + pagedjs-cli versions reproducible only by faith in npm
install. Track the lockfile so the perf numbers in README correspond to
a known-pinned dependency set.
The process phase of the render pipeline was 40s of `PDFDocument.load`
+ `pdfDoc.save` on the 52 MB raw Chrome PDF -- just to attach an
outline tree and a few `/Info` fields. The incremental writer replaces
that with a PDF 1.7 §7.5.6 incremental update: parse only the trailer,
xref, Catalog and Info objects (~50ms), build outline objects in a
fresh context starting from the original `/Size`, mutate Catalog and
Info in place, then append the new objects + a new xref subsection +
a trailer with `/Prev` pointing at the original xref. Readers chain
backward to resolve everything we didn't touch. Original 52 MB stays
byte-identical; we append ~400 KB.

Wired into the harness as `--incremental`; on the 1638-page book the
process phase drops from 39.7s to 0.25s, total from 150.7s to 110.3s.
The tradeoff is file size -- pdf-lib's full save() deflate-compresses
content streams as a side effect (17 MB output), the incremental
writer preserves Chrome's uncompressed streams (53 MB output). Both
documented in perf/README.md alongside the size-reclaim options.
Two findings that aren't behind the incremental writer:

1. pdf-lib's PDFDocument.load defaults to parseSpeed: Slow (100
   objects/tick, with await waitForTick() between batches) and save
   defaults to objectsPerTick: 50. On the 52 MB Chrome PDF that's
   ~500+1000 cooperative yields per round-trip; a CPU profile shows
   79 % of load is V8 idle wait, not actual parsing. The default was
   designed for browsers (don't lock the main thread for 30 s); in
   the harness with no concurrent work it's pure overhead. Switching
   to parseSpeed: Fastest + objectsPerTick: Infinity drops the
   roundtrip from 40 s to 4.7 s on the same input -- 8x speedup with
   no behaviour change. Wired into the harness's default roundtrip
   path. Production book.bat still pays the cost because pagedjs-cli
   passes no options to load/save; that's a separate follow-up.

2. Chromium gained Page.printToPDF { generateDocumentOutline: true }
   in M122 (Feb 2024). Puppeteer exposes it as `outline: true` and
   adds the launch flag automatically since v22. Chrome walks h1..h6
   from the accessibility tree and emits /Outlines directly. Wired in
   as --chrome-outline. Caveats documented in README:

   - No tag-level filter (h1..h6 unconditional). Per-heading
     suppression works via role="presentation" / aria-hidden -- 1
     attribute, visually unchanged.
   - Tree depth doesn't equal heading level: Chrome collapses skipped
     levels (h5 under h3 becomes depth+1, not depth+2). So filtering
     Chrome's output to "depth <= 3" does not extract h1..h4.
   - Diff against the injected outline on the real book shows
     significant structural drift (the injected outline gets 1773
     entries, Chrome 6023 unfiltered / 1820 filtered, 10 of those
     point at different pages by end of book).
   - pagedjs-cli has never tried this option in any form -- searched
     github issues/PRs/code, gitlab, web; no mention.

Probes and analysis scripts: profile-load.mjs, profile-roundtrip.mjs,
probe-chrome-outline.mjs, probe-outline-exclusions.mjs,
compare-outlines.mjs. All independently runnable.
pagedjs-cli's only behavioural contribution was the paged.js browser
bundle and a small set of pdf-lib helpers; the CLI wrapper around
them called PDFDocument.load / save with no options and therefore
inherited pdf-lib's slow defaults (~32 s of pure idle yielding per
build on the 1638-page docs book -- see perf/README.md's "Profiling
pdf-lib's load"). Rather than patch upstream just to pass two option
objects, vendor what we actually use and drive puppeteer ourselves.

docs/lib/paged.browser.js -- pagedjs-cli@0.4.3/dist/browser.js
verbatim, MIT, license header preserved.

docs/lib/outline.mjs       -- src/outline.js, ESM-ified, MIT, attribution.
docs/lib/postprocesser.mjs -- src/postprocesser.js, ESM-ified, MIT, attribution.

docs/render-book.mjs -- new ~190 LOC production driver. CLI-compatible
with the subset of pagedjs-cli flags book.bat actually used
(-o, --outline-tags, -t, --additional-script). Calls PDFDocument.load
with parseSpeed: Fastest and pdfDoc.save with objectsPerTick: Infinity
inline; no plumbing required.

docs/book.bat swapped to `node render-book.mjs ...`. Same flags,
~32 s faster end-to-end (process phase drops from 39.7 s to 5.1 s
on the same input), one fewer transitive dependency tree.

perf/measure.mjs now imports the helpers and the bundle from
docs/lib/ so the harness and production share the exact same code
path. Both package.json files dropped pagedjs-cli and added
puppeteer + pdf-lib + html-entities as direct deps (they were
previously transitive via pagedjs-cli).
A paged.js handler in docs/lib/progress-handler.js emits a line per
afterPageLayout; render-book.mjs re-renders it as a `\r`-overwritten
TTY status (or one line every 100 pages when stdout is piped).
Generate-phase progress remains a follow-up.
A 500 ms heartbeat in render-book.mjs writes `generating: 23.4s` to
a `\r`-overwritten TTY line during page.pdf(). Byte-streaming via
CDP printToPDF + transferMode:'ReturnAsStream' was tried first but
adds no signal on the Chromium we ship with: SkPDF buffers the whole
document internally and emits all 52 MB in one tick at the end.
README documents the finding so the next reader doesn't re-investigate.
Replace 16 per-page page.element.querySelector(...) calls in
AtPage.finalizePage's two forEach loops with one querySelectorAll up
front into a class-keyed lookup table. A/B at 3+3 interleaved:
query-CPU self-time drops ~91 ms (-27%, from ~338 ms unpatched to
~247 ms patched); render wall-clock is in the noise.

The trigger for looking at this was a stale 13.7 s self-time row in
the post-detach CPU profile -- which was measured against the
earlier afterPageLayout-hook variant of detach-pages.js (hidden
subtree style-cascade penalty); the shipping finalizePage-hook
variant brings the same lambda to ~1 s total render. README is
updated to flag that. The patch is marked `// PATCH: consolidate` at
each touch point so a future re-vendoring of the bundle can find it
via grep.
Hoist the conditional getComputedStyle reads out of AtPage.finalizePage's
two forEach loops into one upfront batch that fills two Maps (__maxW,
__maxH). The forEach bodies now consume cached values instead of
calling GCS inline, so the write-then-read pattern that previously
forced a layout flush on every iteration collapses to one flush total.

For this book the hoist is a no-op: only the bottom-right page-number
corner has .hasContent, so one GCS read fires per page either way and
the smoke render lands at 47.98 s / 1638 pages / 16.9 MB -- inside the
noise band of the consolidate-querySelector A/B. The win materializes
for docs with multi-corner marginalia, where N inline reads (and
therefore N forced flushes) collapse to 1.

Marked `// PATCH:` at all three touch points so a future re-vendoring
of the bundle can find them via grep. README's "Revisiting
AtPage.finalizePage" section gains a "Read/write batching" sub-section
documenting the change and the scaling note.
AtPage.finalizePage's grid-template-columns / grid-template-rows
output is a pure function of (page.element.className, marginalia
map, CSS). For a 1638-page book with ~2-4 distinct page-class
signatures, the same four values are recomputed ~1634 redundant
times.

Cache them on this.__finalizeCache, keyed by className. Check sits
between the __mLookup build and the GCS hoist; on hit, apply the
four cached strings via __mLookup and return -- Phases B and C
skipped. On miss, run as before and store the result by reading
back the just-written .style.grid-template-* values at the end of
the method. Phase A's marginalia .hasContent classifier still runs
every page (needed for @page-margin CSS rules to apply on the fresh
DOM).

Smoke render: 1638 pages, 16.9 MB output (byte-equivalent to the
pre-patch run), render 48.27 s -- inside the noise band. Cleanup-
only ship: the one flush we skip per cached page is ~70 us; total
saving across the render is ~0.1 s, two orders of magnitude below
the 4-13 s within-variant noise from the consolidate-querySelector
A/B. README documents the wall-clock framing alongside the prior
two finalizePage patches.

Cache assumes @page rules don't use :nth-of-type or other
position-dependent selectors that would vary by page index for the
same className. Common case; commented in the bundle.
Replaces the per-page Phase B and Phase C decision tree in
AtPage.finalizePage with CSS rules emitted once at parse time by
the polisher. AtPage.afterTreeWalk now calls a new
emitMarginGridTemplates after addPageClasses; for each page-class
entry it computes the effective per-cell hasContent + max-width /
max-height by unioning over matching marginalia entries (Phase A
OR-cascade + CSS cascade for widths), runs the same decision tree
the runtime did, and writes grid-template-columns / -rows rules
into the polished stylesheet using selectorsForPage as the
selector. For this book that emits 24 rules (6 page-class
signatures x 4 margin groups), all '0 0 1fr'.

addMarginaliaStyles now captures maxWidth / maxHeight / loc on
each this.marginalia[sel] entry during its existing AST walk, so
emitMarginGridTemplates can run the decision tree without a
rendered DOM.

finalizePage collapses to Phase A + an offset-only Phase B. The
four offsetWidth-based fallback branches (auto-width minmax(%)
templates when >= 2 cells have content and all relevant widths
are auto) stay -- they read live layout and can't be pre-computed.
A couldFire gate short-circuits before any querySelector or
getComputedStyle when the content pattern can't reach an offset
branch (every page in this book). Phase C is deleted -- the
upstream code has no offset measurement in the vertical groups,
so every branch is now handled by the static emission.

Removes two prior PATCH blocks whose code paths no longer exist:
  - // PATCH: consolidate the 16 per-iteration querySelector calls
    (__mLookup build)
  - // PATCH: cross-page memoization
    (__finalizeCache + replay on hit)
Plus their trailing artifacts (record-what-we-just-wrote, lookup
fallbacks in the forEach loops). Keeps the // PATCH: hoist
max-width reads pattern, rewritten inline inside the offset
fallback path where layout-flush-after-classList.add still
matters.

Wall-clock is in the noise as expected -- this moves work from
runtime JS to parse-time CSS but the browser does the same
cascade + layout work either way. Instrumented A/B confirms the
new path is taken: getComputedStyle calls drop from 9179 to 5903
(-36%, the two-per-page max-width reads the prior GCS hoist
batched), offsetWidth stays at 0 (gate never fires), PDF size
unchanged within run-to-run timestamp variance (~27 bytes).

The README's 'Cross-page memoization' section now has a follow-up
'Hoisting grid-template emission to parse time' covering the
architecture, the 24 emitted rules for this book, and what's left
in finalizePage.
Profiles after the recent finalizePage commits showed render still
scaling with page count via Page.create's gBCR layout flush. Three
profile-diffed A/Bs:

- textBreak per-word loop -> binary search: wash. textBreak isn't a
  significant gBCR caller; the loop's calls are folded into the JS
  wrapper at <100 ms total.

- Page.create gBCR memoization: total native gBCR unchanged
  (18.4 s -> 18.6 s); the 12.9 s flush just migrated to Layout's
  constructor at line 1443. Lesson: gBCR self-time is layout-flush
  attribution, not call overhead.

- counter-increment on .pagedjs_page disabled: total gBCR went up,
  not down. Rules out CSS counter resolution as the residual.

The fourth A/B is the win. Physically removeChild'ing finalized
pages instead of display:none-ing them (keep-last variant, since
the chunker passes lastPage.element to Page.create) cuts
Page.create's gBCR self-time from 12.9 s to 1.1 s and total render
from 48 s to 28 s. The residual O(n) was Chromium's selector/style
sweep over .pagedjs_pages siblings -- display:none does not
exclude them from style resolution, only from layout.

Documents the methodology shift (compare CPU profile diffs, not
wall-clock totals) and what's needed to ship the aggressive
detach: counter(page) -> attr(data-page-number) swap in
print.css, named-string behavior verification, re-attach order,
and handler audit. Bundle and detach-pages.js are unchanged --
this commit is documentation only.
The CPU profile diff showed display:none does not exclude finalized
pages from Chromium's per-page style/selector sweep, which scales
with sibling count -- the residual O(n) the README narrative was
hunting. Physically removing finalized pages from the DOM collapses
Page.create's gBCR layout flush from 12.9 s to 0.9 s (15x) and
total render from 48 s to 26 s on the 1638-page book.

Three coordinated changes:

- perf/detach-pages.js: removeChild + ordered re-attach. Keep the
  most recent finalized page in DOM so the chunker's lastPage
  lookup works; detach one page behind. afterRendered detaches
  the keeper and re-appends all in finalize (document) order.

- docs/lib/paged.browser.js Counters handler: track displayed
  page number in JS and write `--page-num: "N"` per page, since
  CSS counter(page) no longer accumulates across detached
  siblings. Increment behaviour mirrors the shipping CSS exactly:
  skip increment on [data-counter-page-reset] pages, don't apply
  the reset value -- preserves the continuous numbering the
  original PDF was actually producing (the part-restart logic
  described in print.css's comments doesn't take effect in the
  current cascade; that's a separate issue).

- docs/assets/css/print.css: @bottom-right reads var(--page-num)
  instead of counter(page). The existing @page :first and
  @page divider suppression rules still work because they
  override the entire `content` declaration.

Verified byte-equivalent @bottom-right and @top-right text on
sample pages 1, 2, 5, 6, 10, 100, 500, 1000, 1500, 1638.
Named strings (string-set chapter-title) survive detach.

Render: 48.5 s -> 26.3 s. Ratio: 1.77x -> 1.18x. Total native
gBCR: 18.4 s -> 7.5 s. The remaining render cost is dominated
by hasOverflow:1925 (4.6 s, now flat).
The guard at paged.browser.js:1998 read `table.querySelector("[colspan]")`
but the variable is named `rowspan`, the comment says rowspan, and the
block walks back through rows to find the one that started a rowspan
group (the only row with full column coverage). Tables with rowspan
but no colspan silently skipped the rowspan-aware break path and could
break in the middle of a rowspan group.

One-char fix: [colspan] -> [rowspan]. Smoke render: 1638 pages, no
regression in render time or output.
The per-page rule injected at paged.browser.js:30836 used
counter-reset, which per CSS Lists L3 §12.4.2 creates a new
counter instance scoped to the element, leaving the
.pagedjs_pages ancestor's counter untouched for sibling pages.
That's an upstream paged.js bug: the rule was a no-op for
cross-page numbering, so books that meant to restart page
numbering at part dividers (like ours -- see print.css:113-126)
silently shipped continuous numbering instead.

counter-set modifies the existing closest-scope counter, which
is what was intended. With this fix, the page right after each
part divider now reads "1" as designed (e.g., book page 6 went
from "5" to "1"; subsequent pages count up from there within
each part).

Also realigns my JS --page-num counter (added in the
aggressive-detach commit) to honour the same reset values now
that the CSS path actually restarts. Both code paths agree:
skip increment on data-counter-page-reset pages, then apply
the reset value.

Verified: 1638 pages, chapter titles in @top-right unchanged
on every sampled page. Render time unchanged.
The descend-to-deepest-valid-descendant loop reads lastChild.lastChild
up to 3 times per iteration: the while condition, the validNode check,
and the assignment. Cache it once at the top of each iteration.

CPU profile self-time (--detach-pages, 1638 pages):
findEndToken: 3269.92 ms -> 3108.00 ms (-162 ms, -5%).
No migration to other callers; createBreakToken and gBCR self-times
also drop or stay flat. Byte-equivalent PDF: page count 1638, all
sampled page-number and chapter-title extractions match.
Adds a new "What happened when we tried createBreakToken dedup"
section to perf/README.md covering the two attempts from this round:

- Attempt A (findEndToken cache): shipped, -162 ms self-time.
- Attempt B (createBreakToken findElement dedup): reverted. The
  dedup is semantically incapable of regressing, but the savings
  fell below the ~40-80 ms per-edge noise floor (measured against
  an unrelated findEndToken->findElement edge that hadn't been
  touched between runs). The !renderedNode + !temp.nextSibling
  branch the dedup targets must fire rarely.

Echoes the textBreak post-mortem: a correct optimization in a
cold branch is unmeasurable, and the methodology rule is to revert
when the profile diff doesn't show movement above noise.
Page.checkUnderflowAfterResize is fired by the wrapper's
ResizeObserver every time content shrinks -- i.e. on every overflow
extraction during normal render. It calls findEndToken (~2ms per
call, 3.6s total over a 1638-page book) and hands the resulting
endToken to this._onUnderflow.

The only live onUnderflow registration in the bundle was an empty
callback in Chunker.addPage (with commented-out intent). The
computed endToken was discarded every time. Delete the no-op
registration so _onUnderflow stays undefined by default, and add
an early bail in checkUnderflowAfterResize so findEndToken doesn't
run when nobody can use its result. Any future caller that needs
the underflow path back just calls page.onUnderflow(realFn) -- the
non-default handler is itself the activation signal.

CPU profile self-time (--detach-pages, 1638 pages):
findEndToken: 3108 ms -> 0 ms.
findElement:  1767 ms -> 1314 ms (the findEndToken->findElement
              attribution edge was ~537 ms; the rest is jitter).
render total: 25.75 s -> 22.26 s (-3.49 s, -14%).

Byte-equivalent PDF: page count 1638, all sampled bottom-right
page-number and top-right chapter-title extractions match.
Attempted to apply the Attempt-C pattern to the next gBCR caller
in the per-page sequence: Footnotes.afterPageLayout (~1114 ms
attributed gBCR), guarded by Object.keys(this.footnotes).length
which is 0 for this document (no `float: footnote` in CSS).

Result: net gBCR -169 ms, but render wall-clock +880 ms, and the
per-page last/first ratio worsened 1.50x -> 1.75x. The 944 ms of
"missing" gBCR re-attributed to hasOverflow, create, Layout, and
a previously-invisible afterPageLayout at line 31986 -- the
classic Page.create-memoize migration shape.

Reverted. README adds Attempt D documenting the trap and a
revised "lesson" subsection that splits the methodology in two:
JS-body self-time is addressable by walking up to the consumer
and deleting the production; gBCR self-time is layout-flush
attribution and cannot be addressed by skipping the calling JS
without also changing the mutation pattern (which is what
aggressive-detach did at the chunker level).
The shorthand is used throughout the document (since the
"three places non-redundant work could be made simpler" section
and all subsequent post-mortems) but was never expanded. Add a
note right after the first profile that introduces
getBoundingClientRect as the dominant frame, explaining both the
abbreviation and the broader "layout-flush attribution" concept
that's load-bearing for the rest of the doc's methodology
discussion.
find-callers.mjs is the caller-attribution script used throughout
the perf README's post-mortems (Page.create memoize, Footnotes
skip, findEndToken/onUnderflow dead path) to detect gBCR migration
between callers and to spot dead consumers up the call chain. It
belongs alongside analyze-profile.mjs as part of the perf toolkit,
not as a one-off at the repo root.

Also:
- Update the README's "Methodology" pointer that referenced a
  stale "gBCR-callers script under time-hooks-current" -- now
  links to perf/find-callers.mjs.
- Add a header comment to find-callers.mjs explaining what it
  does, when to reach for it, and the hitCount caveat.
- Top-level .gitignore entries for /before/, /after-*/, and
  /findoverflow-baseline/ -- the conventional output names used
  with `node perf/measure.mjs --out <name>` from the repo root.
  perf/results/ (measure.mjs's default location) was already
  ignored in perf/.gitignore.
The per-page render loop's overflow gate fired findBreakToken on
every iteration once accumulated text exceeded maxChars (1500).
That meant pages holding ~3000 chars triggered ~30+ findBreakToken
calls -- each one a hasOverflow gBCR (= layout flush) -- before
the overflow tripped. Linear-scan with O(nodes-past-maxChars) gBCR
calls per page.

Replace with additive backoff on a moving lengthAtLastCheck
baseline: check once per maxChars of NEW content, advance the
baseline when no overflow yet, reset on overflow found. Cuts
per-page hasOverflow calls from O(nodes-past-maxChars) to
O(page-chars / maxChars), typically 2-3 vs 30+.

Correctness holds because findBreakToken handles arbitrary
overshoot: findOverflow walks the wrapper to identify the actual
overflowing Range regardless of how much was appended past it,
removeOverflow extracts the excess via extractContents, and
createBreakToken returns a BreakToken pointing at the right
source resume position. The chunker creates a fresh walker from
breakToken.node on the next page, so over-appended content is
re-laid-out from its proper source position. The break-inside:
avoid concern (containers possibly making different decisions
when they hold extra trailing content) turned out to be unfounded
on this document.

CPU profile diff (--detach-pages, paired):

| metric                      | PRE       | POST      | Δ                |
| --------------------------- | --------- | --------- | ---------------- |
| render wall-clock           | 23.73 s   | 19.48 s   | -4.25 s (-18 %)  |
| total gBCR (attribution)    | 8024 ms   | 5705 ms   | -2319 ms (-29 %) |
|   hasOverflow gBCR          | 4837 ms   | 2725 ms   | **-2112 (-44 %)**|
|   findOverflow per-node gBCR|  438 ms   |  166 ms   | -272             |
|   create / Layout / Footn.  | unchanged                                  |
| removeOverflow self         |  457 ms   |  370 ms   | -87 (improved)   |
| per-page ratio (last/first) | 1.64x     | 1.60x     | improved         |

No migration trap: Footnotes (1126 ms), create (955), and Layout
(534) gBCR slices all stayed within jitter, and the per-page
ratio got better, not worse. removeOverflow dropped instead of
growing because fewer findBreakToken invocations mean fewer
extractContents passes, not larger ones -- the maximum overshoot
per page is bounded by maxChars (1500 chars), small relative to
page capacity.

PDF byte-equivalence: full pdftotext-MD5 match on pages 6, 100,
500, 1000, 1500, 1638; bottom-right page numbers and top-right
chapter titles match on all sampled pages; PDF byte size 126
bytes apart (metadata timestamps); page count 1638.
@KubaO KubaO merged commit efca24f into twinbasic:main May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant