Skip to content

docs: omit .md pages from llms.txt without removing them completely#2480

Merged
webrdaniel merged 7 commits intomasterfrom
fix/omit-pages-from-llms-txt-without-removing-them-completely
May 5, 2026
Merged

docs: omit .md pages from llms.txt without removing them completely#2480
webrdaniel merged 7 commits intomasterfrom
fix/omit-pages-from-llms-txt-without-removing-them-completely

Conversation

@webrdaniel
Copy link
Copy Markdown
Contributor

@webrdaniel webrdaniel commented Apr 29, 2026

Follow-up to #2470. Listing pages in the llms-txt plugin's excludeRoutes also drops their /<route>.md counterparts from the build, so URLs like https://docs.apify.com/sdk.md started returning 404 (raised in #2470 (comment)).

This PR moves the exclusion from build time to post-build:

  • docusaurus.config.js: revert excludeRoutes back to just / and /search; add a NOTE so future contributors don't re-introduce the regression.
  • scripts/joinLlmsFiles.mjs: add LLMS_INDEX_EXCLUDE_PATTERNS and a filterLlmsIndex() postbuild step that strips matching - [Title](url) entries (and now-empty ## Section headings) from the generated build/llms.txt. The .md files stay on disk and continue to serve. Also fixes a pre-existing fire-and-forget race between joinFiles() and sanitizeFile().
  • package.json: add @docusaurus/utils as a direct dependency (used for createMatcher).
  • .github/workflows/test.yaml: add regression tests asserting that /sdk.md, /open-source.md, /api/v2/actor-builds-get.md, /api/v2/dataset-get.md, and /academy/tutorials.md still serve text/markdown. Also adds assert_final_content_type so child-repo homepages (/sdk/js, /sdk/python, /api/client/{js,python}, /cli) are checked through their nginx redirects for both HTML and Accept: text/markdown responses.

Net effect: same llms.txt index as #2470 produced, but the per-page .md files are restored.

Test plan

  • CI passes (the new .md-counterpart and child-repo redirect assertions exercise the regression)
  • npm run build succeeds locally
  • build/llms.txt size remains under the 100K limit enforced by npm run test:llms-size
  • Manually verify a few previously-broken URLs once deployed: https://docs.apify.com/sdk.md, https://docs.apify.com/open-source.md, https://docs.apify.com/api/v2/actor-builds-get.md

@webrdaniel webrdaniel requested review from TC-MO and marcel-rbro April 29, 2026 15:08
@webrdaniel webrdaniel self-assigned this Apr 29, 2026
@github-actions github-actions Bot added this to the 139th sprint - Web team milestone Apr 29, 2026
@github-actions github-actions Bot added the t-web Issues with this label are in the ownership of the web team. label Apr 29, 2026
@webrdaniel webrdaniel changed the title fix: omit .md pages from llms.txt without removing them completely docs: omit .md pages from llms.txt without removing them completely Apr 29, 2026
@webrdaniel webrdaniel added the adhoc Ad-hoc unplanned task added during the sprint. label Apr 29, 2026
@webrdaniel webrdaniel requested a review from B4nan April 29, 2026 15:12
@apify-service-account
Copy link
Copy Markdown

Preview for this PR was built for commit 2edc700 and is ready at https://pr-2480.preview.docs.apify.com!

@jancurn
Copy link
Copy Markdown
Member

jancurn commented Apr 29, 2026

Cheers. Pls can we add some tests for these special pages, to ensure the .md version work, and also that the HTML version contain the <link rel="alternate" type="text/markdown" href="https://docs.apify.com/xxx.md"> tag ?

@apify-service-account
Copy link
Copy Markdown

Preview for this PR was built for commit bc07d5b and is ready at https://pr-2480.preview.docs.apify.com!

@apify-service-account
Copy link
Copy Markdown

Preview for this PR was built for commit 8380f85 and is ready at https://pr-2480.preview.docs.apify.com!

@TC-MO
Copy link
Copy Markdown
Contributor

TC-MO commented Apr 30, 2026

The builds will fail until we deploy it or do we need to make some changes to test assertions in nginx.conf?

@apify-service-account
Copy link
Copy Markdown

Preview for this PR was built for commit 68594e4 and is ready at https://pr-2480.preview.docs.apify.com!

@webrdaniel
Copy link
Copy Markdown
Contributor Author

The tests should now run correctly against the staging

@webrdaniel webrdaniel marked this pull request as ready for review April 30, 2026 08:08
@jancurn
Copy link
Copy Markdown
Member

jancurn commented Apr 30, 2026

Great thank you

Copy link
Copy Markdown
Contributor

@marcel-rbro marcel-rbro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we nudge the review? Is @B4nan the one to review the technical side?

include ${PWD_PATH}/nginx-test.conf;
}
EOF
sed -i 's|https://apify.github.io/apify-docs|http://localhost:3000|g' default.conf
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are those changes actually necessary? the CI checks were working fine before, i was expecting you just add a few more test cases here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the change, the tests fail on:
Expected 'text/markdown' in 'Content-Type' for http://localhost:8080/sdk.md
Looks like nginx is proxying to the deployed production site instead of the locally-built one.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that was on purpose, it wasn't possible to test something locally. But I might be wrong, it's been quite some time since I was setting this up.

Copy link
Copy Markdown
Contributor Author

@webrdaniel webrdaniel May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so keeping it as it was

@B4nan
Copy link
Copy Markdown
Member

B4nan commented May 4, 2026

The tests should now run correctly against the staging

Also what do you mean by that? We dont have any staging env for the docs.

@apify-service-account
Copy link
Copy Markdown

Preview for this PR was built for commit d857b9df and is ready at https://pr-2480.preview.docs.apify.com!

@webrdaniel
Copy link
Copy Markdown
Contributor Author

The tests should now run correctly against the staging

Also what do you mean by that? We dont have any staging env for the docs.

This referred to the fix mentioned above that is now reverted

@webrdaniel webrdaniel requested a review from B4nan May 4, 2026 12:00
Copy link
Copy Markdown
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't wanna block this, feel free to put the changes back and merge.

Btw I was also confused by you renaming the nginx config to nginx-test.conf, I don't think it's necessary.

@webrdaniel webrdaniel merged commit 6a26714 into master May 5, 2026
14 of 15 checks passed
@webrdaniel webrdaniel deleted the fix/omit-pages-from-llms-txt-without-removing-them-completely branch May 5, 2026 07:07
B4nan pushed a commit that referenced this pull request May 5, 2026
#2496)

## Summary

- The `sed` substitution in the `Start Nginx with project config` step
ran against `default.conf`, which only contains the wrapper
(`worker_processes`, `events`, `http { include nginx.conf; }`) — not the
`https://apify.github.io/apify-docs` upstream URL, which lives in
`nginx.conf`. So the rewrite was a no-op and the header-assertions step
silently proxied to live prod instead of the PR's local Docusaurus serve
at port 3000.
- This masked regressions in the PR-under-test (the test could pass
purely because prod happened to serve the right content) and caused
spurious failures when the PR introduced changes that prod hadn't yet
picked up — see #2480, where `assert_header ".../sdk.md" "Content-Type"
"text/markdown"` failed because prod hadn't been redeployed yet.

## Test plan

- [ ] CI `Test / Docs build` job passes
- [ ] `Run header assertions` step actually exercises the local build
(e.g. break a `.md` route in a follow-up draft and confirm the test now
fails)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

adhoc Ad-hoc unplanned task added during the sprint. t-web Issues with this label are in the ownership of the web team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants