Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrapy template doesn't handle imminent migration to another host #303

Open
honzajavorek opened this issue Dec 3, 2024 · 2 comments
Open
Assignees
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@honzajavorek
Copy link

Apparently it's normal for the actor to be restarted by the Apify platform because of an imminent migration to another host. The Scrapy integration doesn't handle this case. When an actor made in Scrapy gets interrupted, it restarts from the beginning. This drains resources, puts more load on the target websites, and results in timeouts, effectively ruining that particular actor run.

The issue has been discussed on Discord with the advice being:

...it looks like that the official Scrapy - Apify integration just allow you to run the scrapy project on the platform but nothing more, so no state persistence. In that case you need to take care of that on your own

Elsewhere, @janbuchar mentions:

The Scrapy integration just uses the cloud storage when you run it on Apify, and that is persistent by design.

I file this issue to figure out how is it and whether you think this is something the integration should take care of.

Because I think it should. As an actor creator using Scrapy, so far I didn't need to know many specifics of the platform. I created a Scrapy project, added the integration, deployed to Apify, and it pretty much worked.

However, if any actor can be interrupted anytime - apparently a completely normal thing for the platform to do, and as a result it results in ruining the scraper run, my reasoning would be this renders the integration incomplete, as it doesn't help enough to make a project which successfully runs on the platform.

@honzajavorek
Copy link
Author

For one of my scrapers, this now happens 100% of time. The restarts cause the scraper to take eternity. I raised the timeout to 2h but it's not enough as the platform just restarts it mid job. From my POV this means it's not really possible to use Apify as a seamless production environment for Scrapy.

@vdusek vdusek self-assigned this Jan 29, 2025
@vdusek vdusek added this to the 107th sprint - Tooling team milestone Jan 29, 2025
@vdusek vdusek added bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team. labels Jan 29, 2025
vdusek added a commit to apify/apify-sdk-python that referenced this issue Jan 29, 2025
@honzajavorek
Copy link
Author

I'll solve this by using Scrapy's HTTP cache, using Apify's key-value store as a backend. If my scraper gets restarted, the requests already made should be in cache and shouldn't drain resources. I have a prototype implementation. Once it's polished, I'll send a PR so that it can be a part of the SDK.

vdusek added a commit to apify/apify-sdk-python that referenced this issue Feb 13, 2025
…asyncio` (#390)

### Description

- Apify (asyncio) and Scrapy (Twisted) now run on a single event loop.
  - `nest-asyncio` has been completely removed.
  - It seems that this change also improved the performance.
- The `ApifyScheduler`, which is synchronous, now executes asyncio
coroutines (communication with RQ) in a separate thread with its own
asyncio event loop.
- Logging setup has to be adjusted and I moved to a dedicated file in
the SDK.
- The try-import functionality for optional dependecies from Crawlee was
added to `scrapy` subpackage.
- A new integration test for Scrapy Actor has been added.

### Issues

- Closes: #148
- Closes: #176
- Closes: #392
- Relates: apify/actor-templates#303
- This issue will be closed once the corresponding PR in
`actor-templates` is merged.

### Tests

- A new integration test for Scrapy Actor has been added.
- And of course, it was tested manually using the Actor from
guides/templates.

### Next steps

- Update Scrapy Actor template in `actor-templates`.
- Update [Actor Scrapy Books
Example](https://github.com/apify/actor-scrapy-books-example).
- Add HTTP cache storage for KVS, @honzajavorek will provide his
implementation.

### Follow-up issues

- There are still a few issues to be resolved.
- #391
- #395
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

2 participants