Scrapy template doesn't handle imminent migration to another host #303

honzajavorek · 2024-12-03T10:40:12Z

Apparently it's normal for the actor to be restarted by the Apify platform because of an imminent migration to another host. The Scrapy integration doesn't handle this case. When an actor made in Scrapy gets interrupted, it restarts from the beginning. This drains resources, puts more load on the target websites, and results in timeouts, effectively ruining that particular actor run.

The issue has been discussed on Discord with the advice being:

...it looks like that the official Scrapy - Apify integration just allow you to run the scrapy project on the platform but nothing more, so no state persistence. In that case you need to take care of that on your own

Elsewhere, @janbuchar mentions:

The Scrapy integration just uses the cloud storage when you run it on Apify, and that is persistent by design.

I file this issue to figure out how is it and whether you think this is something the integration should take care of.

Because I think it should. As an actor creator using Scrapy, so far I didn't need to know many specifics of the platform. I created a Scrapy project, added the integration, deployed to Apify, and it pretty much worked.

However, if any actor can be interrupted anytime - apparently a completely normal thing for the platform to do, and as a result it results in ruining the scraper run, my reasoning would be this renders the integration incomplete, as it doesn't help enough to make a project which successfully runs on the platform.

honzajavorek · 2025-01-27T15:09:00Z

For one of my scrapers, this now happens 100% of time. The restarts cause the scraper to take eternity. I raised the timeout to 2h but it's not enough as the platform just restarts it mid job. From my POV this means it's not really possible to use Apify as a seamless production environment for Scrapy.

Relates: apify/actor-templates#303

honzajavorek · 2025-02-07T13:26:50Z

I'll solve this by using Scrapy's HTTP cache, using Apify's key-value store as a backend. If my scraper gets restarted, the requests already made should be in cache and shouldn't drain resources. I have a prototype implementation. Once it's polished, I'll send a PR so that it can be a part of the SDK.

@honzajavorek

…asyncio` (#390) ### Description - Apify (asyncio) and Scrapy (Twisted) now run on a single event loop. - `nest-asyncio` has been completely removed. - It seems that this change also improved the performance. - The `ApifyScheduler`, which is synchronous, now executes asyncio coroutines (communication with RQ) in a separate thread with its own asyncio event loop. - Logging setup has to be adjusted and I moved to a dedicated file in the SDK. - The try-import functionality for optional dependecies from Crawlee was added to `scrapy` subpackage. - A new integration test for Scrapy Actor has been added. ### Issues - Closes: #148 - Closes: #176 - Closes: #392 - Relates: apify/actor-templates#303 - This issue will be closed once the corresponding PR in `actor-templates` is merged. ### Tests - A new integration test for Scrapy Actor has been added. - And of course, it was tested manually using the Actor from guides/templates. ### Next steps - Update Scrapy Actor template in `actor-templates`. - Update [Actor Scrapy Books Example](https://github.com/apify/actor-scrapy-books-example). - Add HTTP cache storage for KVS, @honzajavorek will provide his implementation. ### Follow-up issues - There are still a few issues to be resolved. - #391 - #395

Relates: #303

honzajavorek mentioned this issue Dec 3, 2024

docs: Improve the Features section in README apify/crawlee-python#772

Merged

vdusek self-assigned this Jan 29, 2025

vdusek added this to the 107th sprint - Tooling team milestone Jan 29, 2025

vdusek added bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team. labels Jan 29, 2025

vdusek mentioned this issue Jan 29, 2025

fix: Fix RQ usage in Scrapy scheduler apify/apify-sdk-python#385

Merged

vdusek added a commit to apify/apify-sdk-python that referenced this issue Jan 29, 2025

fix: Fix RQ usage in Scrapy scheduler (#385)

3363478

Relates: apify/actor-templates#303

vdusek mentioned this issue Feb 6, 2025

feat: Unify Apify and Scrapy to use single event loop & remove nest-asyncio apify/apify-sdk-python#390

Merged

vdusek modified the milestones: 107th sprint - Tooling team, 108th sprint - Tooling team Feb 17, 2025

This was referenced Feb 18, 2025

feat: Implement Scrapy HTTP cache backend apify/apify-sdk-python#403

Open

feat: update Python Scrapy template to use new SDK #311

Merged

vdusek added a commit that referenced this issue Feb 19, 2025

feat: update Python Scrapy template to use new SDK (#311)

d3587e0

Relates: #303

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrapy template doesn't handle imminent migration to another host #303

Scrapy template doesn't handle imminent migration to another host #303

honzajavorek commented Dec 3, 2024

honzajavorek commented Jan 27, 2025

honzajavorek commented Feb 7, 2025

Scrapy template doesn't handle imminent migration to another host #303

Scrapy template doesn't handle imminent migration to another host #303

Comments

honzajavorek commented Dec 3, 2024

honzajavorek commented Jan 27, 2025

honzajavorek commented Feb 7, 2025