-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrapy template doesn't handle imminent migration to another host #303
Comments
For one of my scrapers, this now happens 100% of time. The restarts cause the scraper to take eternity. I raised the timeout to 2h but it's not enough as the platform just restarts it mid job. From my POV this means it's not really possible to use Apify as a seamless production environment for Scrapy. |
I'll solve this by using Scrapy's HTTP cache, using Apify's key-value store as a backend. If my scraper gets restarted, the requests already made should be in cache and shouldn't drain resources. I have a prototype implementation. Once it's polished, I'll send a PR so that it can be a part of the SDK. |
…asyncio` (#390) ### Description - Apify (asyncio) and Scrapy (Twisted) now run on a single event loop. - `nest-asyncio` has been completely removed. - It seems that this change also improved the performance. - The `ApifyScheduler`, which is synchronous, now executes asyncio coroutines (communication with RQ) in a separate thread with its own asyncio event loop. - Logging setup has to be adjusted and I moved to a dedicated file in the SDK. - The try-import functionality for optional dependecies from Crawlee was added to `scrapy` subpackage. - A new integration test for Scrapy Actor has been added. ### Issues - Closes: #148 - Closes: #176 - Closes: #392 - Relates: apify/actor-templates#303 - This issue will be closed once the corresponding PR in `actor-templates` is merged. ### Tests - A new integration test for Scrapy Actor has been added. - And of course, it was tested manually using the Actor from guides/templates. ### Next steps - Update Scrapy Actor template in `actor-templates`. - Update [Actor Scrapy Books Example](https://github.com/apify/actor-scrapy-books-example). - Add HTTP cache storage for KVS, @honzajavorek will provide his implementation. ### Follow-up issues - There are still a few issues to be resolved. - #391 - #395
Apparently it's normal for the actor to be restarted by the Apify platform because of an imminent migration to another host. The Scrapy integration doesn't handle this case. When an actor made in Scrapy gets interrupted, it restarts from the beginning. This drains resources, puts more load on the target websites, and results in timeouts, effectively ruining that particular actor run.
The issue has been discussed on Discord with the advice being:
Elsewhere, @janbuchar mentions:
I file this issue to figure out how is it and whether you think this is something the integration should take care of.
Because I think it should. As an actor creator using Scrapy, so far I didn't need to know many specifics of the platform. I created a Scrapy project, added the integration, deployed to Apify, and it pretty much worked.
However, if any actor can be interrupted anytime - apparently a completely normal thing for the platform to do, and as a result it results in ruining the scraper run, my reasoning would be this renders the integration incomplete, as it doesn't help enough to make a project which successfully runs on the platform.
The text was updated successfully, but these errors were encountered: