Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Camoufox-based crawler template #2842

Merged
merged 8 commits into from
Feb 24, 2025
Merged

feat: Camoufox-based crawler template #2842

merged 8 commits into from
Feb 24, 2025

Conversation

barjin
Copy link
Contributor

@barjin barjin commented Feb 12, 2025

Adds a Camoufox-based crawler template (camoufox-ts).

Compared to the basic playwright-ts template, camoufox-ts uses the camoufox-js package, which finds the correct latest Camoufox binary in GitHub Releases assets, downloads it and passes the correct launch options to it.

The main.ts script is modified to run the downloaded binary with the correct launchOptions.
Related to #2836

@barjin barjin self-assigned this Feb 12, 2025
@github-actions github-actions bot added this to the 108th sprint - Tooling team milestone Feb 12, 2025
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Feb 12, 2025
@barjin
Copy link
Contributor Author

barjin commented Feb 12, 2025

Todo:

  • automatize running npm run download-camoufox (maybe put it as postinstall?)
  • pass custom fingerprint-modifying options to Camoufox
  • maybe store binaries in a system- (or user-)wide location (~/.crawlee/binaries?)

@barjin barjin added the adhoc Ad-hoc unplanned task added during the sprint. label Feb 12, 2025
@barjin
Copy link
Contributor Author

barjin commented Feb 14, 2025

Example code:

import { launchOptions } from 'camoufox-js';
import { PlaywrightCrawler } from 'crawlee';
import { firefox } from 'playwright';

const startUrls = ['https://crawlee.dev'];

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, enqueueLinks }) => {
        await page.click('h2');
        await page.click('h3');

        await enqueueLinks();
    },
    maxConcurrency: 1,
    launchContext: {
        launcher: firefox,
        launchOptions: await launchOptions({
            headless: false,
            block_images: true,
            fonts: ['Times New Roman'],
            custom_fonts_only: true,
            humanize: true,
        }),
    },
});

await crawler.run(startUrls);

Execution:

Peek.2025-02-14.16-35.mp4

As set, the browser loads no images, uses only one system-installed font (aside from the ones loaded from the page directly) and uses the humanizing script to move the cursor.

@barjin barjin requested a review from B4nan February 14, 2025 16:14
@B4nan B4nan merged commit 7f08de4 into master Feb 24, 2025
9 checks passed
@B4nan B4nan deleted the feat/camoufox-crawler branch February 24, 2025 08:21
B4nan pushed a commit to apify/actor-templates that referenced this pull request Feb 27, 2025
Following the apify/apify-sdk-js#364 and
apify/crawlee#2842 , this PR adds
Camoufox-enabled templates to Apify Actor templates. The implementation
is heavily based on the existing Playwright + Chrome templates.

The only issue (I'm aware of) currently is the immense size of those
images (as they contain Chrome and we add Camoufox binaries). Installing
Camoufox directly to a `node-debian` image results in missing system
dependencies. While it might be possible to install those manually in
the Dockerfile, it might make the Dockerfile too complex for a regular
user.


![image](https://github.com/user-attachments/assets/fb0050fd-fadc-4bbc-80f3-0681dcfa2b92)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants