Description
When running my scrapers with the latest Apify SDK (meaning fd7650a), I get timeouts on the following lines of code. These timeouts don't crash the scraper immediately, but they corrupt the scraper run: The results are incomplete, and I've also seen strange request queue behavior after these errors, which at least once resembled endless looping (I aborted the scraper after repeatedly seeing the same runtime stats).
self._async_thread.run_coro(self._rq.mark_request_as_handled(apify_request))
self._async_thread.run_coro(self._rq.fetch_next_request())
I use the same technique with the same async thread for caching requests (see #403), but I can't see any timeout errors related to the key-value storage I use. AFAIK all timeout errors I've seen were related to RQ, despite the KV being heavily used as well during the same scraper run. (It happens with KV as well, see my comment below.)
The issue happens only occasionally, which makes it hard to track down. My scraper runs for 20 minutes just okay, and then spits out 5 of these errors. I've got these timeouts with two rather different spiders, so this isn't specific to a code of a single spider class.
Debugging Plan & Ideas
- verify if I can get these errors locally (note for myself: run plucker with
--apify
) 👉 no success - verify if these errors also happen with scrapers completely unrelated to jobs (i.e. verify this really isn't related to a certain website being touched)
- modify code of the SDK (edit site-packages or make a git fork with changes) to see where exactly the coroutine hangs so that the thread timeouts
- regardless the cause and solution, maybe the thread should crash the whole program in case of a timeout, because the coroutines are pretty important to finish...
Examples
Specimen 1
2025-02-14T16:44:23.860Z [apify.scrapy._async_thread] ERROR Coroutine execution timed out.
2025-02-14T16:44:23.861Z Traceback (most recent call last):
2025-02-14T16:44:23.862Z File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-14T16:44:23.863Z return future.result(timeout=timeout.total_seconds())
2025-02-14T16:44:23.863Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-14T16:44:23.864Z File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-14T16:44:23.865Z raise TimeoutError()
2025-02-14T16:44:23.866Z TimeoutError
2025-02-14T16:44:23.866Z [apify.scrapy._async_thread] ERROR Coroutine execution timed out. ({"message": "Coroutine execution timed out."})
2025-02-14T16:44:23.869Z Traceback (most recent call last):
2025-02-14T16:44:23.869Z File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-14T16:44:23.870Z return future.result(timeout=timeout.total_seconds())
2025-02-14T16:44:23.871Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-14T16:44:23.871Z File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-14T16:44:23.872Z raise TimeoutError()
2025-02-14T16:44:23.873Z TimeoutError
Specimen 2
2025-02-17T04:24:02.098Z [apify.scrapy._async_thread] ERROR Coroutine execution timed out.
2025-02-17T04:24:02.100Z Traceback (most recent call last):
2025-02-17T04:24:02.101Z File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T04:24:02.103Z return future.result(timeout=timeout.total_seconds())
2025-02-17T04:24:02.104Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T04:24:02.105Z File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T04:24:02.106Z raise TimeoutError()
2025-02-17T04:24:02.108Z TimeoutError
2025-02-17T04:24:02.110Z [apify.scrapy._async_thread] ERROR Coroutine execution timed out. ({"message": "Coroutine execution timed out."})
2025-02-17T04:24:02.111Z Traceback (most recent call last):
2025-02-17T04:24:02.112Z File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T04:24:02.113Z return future.result(timeout=timeout.total_seconds())
2025-02-17T04:24:02.114Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T04:24:02.115Z File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T04:24:02.116Z raise TimeoutError()
2025-02-17T04:24:02.117Z TimeoutError
2025-02-17T04:24:02.118Z Traceback (most recent call last):
2025-02-17T04:24:02.119Z File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/scheduler.py", line 149, in next_request
2025-02-17T04:24:02.120Z apify_request = self._async_thread.run_coro(self._rq.fetch_next_request())
2025-02-17T04:24:02.121Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T04:24:02.122Z File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T04:24:02.123Z return future.result(timeout=timeout.total_seconds())
2025-02-17T04:24:02.124Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T04:24:02.125Z File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T04:24:02.126Z raise TimeoutError()
2025-02-17T04:24:02.127Z TimeoutError
2025-02-17T04:24:02.130Z while handling timed call
2025-02-17T04:24:02.132Z Traceback (most recent call last):
2025-02-17T04:24:02.133Z File "/usr/src/app/.venv/lib/python3.12/site-packages/twisted/internet/base.py", line 1105, in runUntilCurrent
2025-02-17T04:24:02.134Z call.func(*call.args, **call.kw)
2025-02-17T04:24:02.135Z File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/utils/reactor.py", line 70, in __call__
2025-02-17T04:24:02.136Z return self._func(*self._a, **self._kw)
2025-02-17T04:24:02.137Z File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/core/engine.py", line 179, in _next_request
2025-02-17T04:24:02.138Z and self._next_request_from_scheduler() is not None
2025-02-17T04:24:02.139Z File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/core/engine.py", line 224, in _next_request_from_scheduler
2025-02-17T04:24:02.140Z request = self.slot.scheduler.next_request()
2025-02-17T04:24:02.141Z File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/scheduler.py", line 149, in next_request
2025-02-17T04:24:02.143Z apify_request = self._async_thread.run_coro(self._rq.fetch_next_request())
2025-02-17T04:24:02.144Z File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T04:24:02.146Z return future.result(timeout=timeout.total_seconds())
2025-02-17T04:24:02.153Z File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T04:24:02.154Z raise TimeoutError()
2025-02-17T04:24:02.155Z builtins.TimeoutError:
2025-02-17T04:24:02.156Z