Skip to content

Move fullDomFetcher to Playwright #1144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

LVerneyEC
Copy link

@LVerneyEC LVerneyEC commented Apr 3, 2025

Hi,

Here is a proposal for a rewriting of the full DOM fetcher, moving it to Playwright instead of Puppeteer.

This edited browser also has support for HTTP/HTTPS proxy (e.g. corporate proxy) and behavior can be adjusted by two environment variables:

  • PLAYWRIGHT_NO_SANDBOX to disable all the sandboxing in Chrome (required for running in Docker, depending on the Docker setup).
  • PLAYWRIGHT_NO_HEADLESS to run it in headful mode (sometimes useful for debugging purposes)

This is using patchright wrapper around Playwright, which adds several patches for obvious Playwright detection mechanisms. Similar to the previously used puppeteer-extra-plugin-stealth.

Best,

@@ -94,10 +94,8 @@
"morgan": "^1.10.0",
"node-fetch": "^3.1.0",
"octokit": "2.0.2",
"patchright": "1.50.1",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just flagging that this is an explicit pin at the moment due to Kaliiiiiiiiii-Vinyzu/patchright#58. Issue is closed but the fix is not yet part of the latest release. This version should be adjusted after review, prior to merging.

@LVerneyEC
Copy link
Author

I noticed the tests are failing due to linting and commit/changelog issues. I'll fix these, but happy to have a first high-level review first to ensure this is useful and worth merging and fix everything at once afterwards :)

@MattiSG
Copy link
Member

MattiSG commented Apr 4, 2025

Thanks @LVerneyEC for this contribution! Fully agree with a first high-level overview before ironing out details :)
The intervention seems minimal. Do you have examples of cases that were blocked with the previous implementation and are unblocked with that switch? 🙂

@LVerneyEC
Copy link
Author

Do you have examples of cases that were blocked with the previous implementation and are unblocked with that switch? 🙂

Not so much. I have another PR to come for the htmlOnlyFetcher, for which this increases widely coverage.

Here, the main benefit is to move away from puppeteer-extra-stealth which is unmaintained for a couple years: https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth.

Also, more high-level updates such as supporting corporate proxies and offering the ability to run headful for debugging purposes.

@Ndpnt Ndpnt closed this Apr 10, 2025
@Ndpnt Ndpnt reopened this Apr 10, 2025
@Ndpnt
Copy link
Member

Ndpnt commented Apr 10, 2025

Hi @LVerneyEC,

I've conducted a series of benchmark tests to evaluate the potential benefits of switching from Puppeteer to Playwright. Below are the detailed results:

Browser Automation Tool Run # Total Failures 403 Errors Navigation Timeouts Selector Timeouts 404 Errors Duration
Puppeteer 1 57 48 9 0 0 6m 8s
2 47 36 10 0 1 6m 22s
3 30 18 11 0 1 5m 50s
4 31 19 11 0 1 5m 37s
5 31 29 1 0 0 5m 54s
Playwright 1 76 59 0 17 0 3m 9s
2 75 59 0 16 0 2m 46s
3 72 59 0 13 0 2m 47s
4 76 59 0 17 0 2m 54s
5 69 59 0 10 0 2m 59s

Observations:

  • Playwright is faster than Puppeteer
  • Playwright shows more consistent 403 error counts
  • Puppeteer shows more variation in error types and counts
  • Puppeteer is less frequently blocked than Playwright

Based on these benchmark results, I do not recommend switching to Playwright at this time. Even if it has faster execution times, its higher failure rates and blocking issues is a blocking point for me.

Regarding the other points mentioned:

  • It seems that Puppeteer natively supports HTTP proxies through environment variables
  • Puppeteer supports non-headless mode through the headless: false parameter in the launch function
  • While recommended to keep enabled, Puppeteer allows sandbox disabling using the --no-sandbox option

Have I missed any key points in my analysis, and do you still see reasons to switch to Playwright despite these results?

@LVerneyEC
Copy link
Author

Would you have more details the benchmark and the results? I am a bit surprised about the 403 and selector errors, since it does not really match my experience so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants