Skip to content

Scrapy scheduler emits timeout errors #404

Open
@honzajavorek

Description

@honzajavorek

When running my scrapers with the latest Apify SDK (meaning fd7650a), I get timeouts on the following lines of code. These timeouts don't crash the scraper immediately, but they corrupt the scraper run: The results are incomplete, and I've also seen strange request queue behavior after these errors, which at least once resembled endless looping (I aborted the scraper after repeatedly seeing the same runtime stats).

self._async_thread.run_coro(self._rq.mark_request_as_handled(apify_request))
self._async_thread.run_coro(self._rq.fetch_next_request())

I use the same technique with the same async thread for caching requests (see #403), but I can't see any timeout errors related to the key-value storage I use. AFAIK all timeout errors I've seen were related to RQ, despite the KV being heavily used as well during the same scraper run. (It happens with KV as well, see my comment below.)

The issue happens only occasionally, which makes it hard to track down. My scraper runs for 20 minutes just okay, and then spits out 5 of these errors. I've got these timeouts with two rather different spiders, so this isn't specific to a code of a single spider class.

Debugging Plan & Ideas

  • verify if I can get these errors locally (note for myself: run plucker with --apify) 👉 no success
  • verify if these errors also happen with scrapers completely unrelated to jobs (i.e. verify this really isn't related to a certain website being touched)
  • modify code of the SDK (edit site-packages or make a git fork with changes) to see where exactly the coroutine hangs so that the thread timeouts
  • regardless the cause and solution, maybe the thread should crash the whole program in case of a timeout, because the coroutines are pretty important to finish...

Examples

Specimen 1
2025-02-14T16:44:23.860Z [apify.scrapy._async_thread] ERROR Coroutine execution timed out.
2025-02-14T16:44:23.861Z       Traceback (most recent call last):
2025-02-14T16:44:23.862Z         File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-14T16:44:23.863Z           return future.result(timeout=timeout.total_seconds())
2025-02-14T16:44:23.863Z                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-14T16:44:23.864Z         File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-14T16:44:23.865Z           raise TimeoutError()
2025-02-14T16:44:23.866Z       TimeoutError
2025-02-14T16:44:23.866Z [apify.scrapy._async_thread] ERROR Coroutine execution timed out. ({"message": "Coroutine execution timed out."})
2025-02-14T16:44:23.869Z       Traceback (most recent call last):
2025-02-14T16:44:23.869Z         File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-14T16:44:23.870Z           return future.result(timeout=timeout.total_seconds())
2025-02-14T16:44:23.871Z                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-14T16:44:23.871Z         File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-14T16:44:23.872Z           raise TimeoutError()
2025-02-14T16:44:23.873Z       TimeoutError
Specimen 2
2025-02-17T04:24:02.098Z [apify.scrapy._async_thread] ERROR Coroutine execution timed out.
2025-02-17T04:24:02.100Z       Traceback (most recent call last):
2025-02-17T04:24:02.101Z         File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T04:24:02.103Z           return future.result(timeout=timeout.total_seconds())
2025-02-17T04:24:02.104Z                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T04:24:02.105Z         File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T04:24:02.106Z           raise TimeoutError()
2025-02-17T04:24:02.108Z       TimeoutError
2025-02-17T04:24:02.110Z [apify.scrapy._async_thread] ERROR Coroutine execution timed out. ({"message": "Coroutine execution timed out."})
2025-02-17T04:24:02.111Z       Traceback (most recent call last):
2025-02-17T04:24:02.112Z         File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T04:24:02.113Z           return future.result(timeout=timeout.total_seconds())
2025-02-17T04:24:02.114Z                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T04:24:02.115Z         File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T04:24:02.116Z           raise TimeoutError()
2025-02-17T04:24:02.117Z       TimeoutError
2025-02-17T04:24:02.118Z Traceback (most recent call last):
2025-02-17T04:24:02.119Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/scheduler.py", line 149, in next_request
2025-02-17T04:24:02.120Z     apify_request = self._async_thread.run_coro(self._rq.fetch_next_request())
2025-02-17T04:24:02.121Z                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T04:24:02.122Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T04:24:02.123Z     return future.result(timeout=timeout.total_seconds())
2025-02-17T04:24:02.124Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T04:24:02.125Z   File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T04:24:02.126Z     raise TimeoutError()
2025-02-17T04:24:02.127Z TimeoutError
2025-02-17T04:24:02.130Z while handling timed call
2025-02-17T04:24:02.132Z Traceback (most recent call last):
2025-02-17T04:24:02.133Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/twisted/internet/base.py", line 1105, in runUntilCurrent
2025-02-17T04:24:02.134Z     call.func(*call.args, **call.kw)
2025-02-17T04:24:02.135Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/utils/reactor.py", line 70, in __call__
2025-02-17T04:24:02.136Z     return self._func(*self._a, **self._kw)
2025-02-17T04:24:02.137Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/core/engine.py", line 179, in _next_request
2025-02-17T04:24:02.138Z     and self._next_request_from_scheduler() is not None
2025-02-17T04:24:02.139Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/core/engine.py", line 224, in _next_request_from_scheduler
2025-02-17T04:24:02.140Z     request = self.slot.scheduler.next_request()
2025-02-17T04:24:02.141Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/scheduler.py", line 149, in next_request
2025-02-17T04:24:02.143Z     apify_request = self._async_thread.run_coro(self._rq.fetch_next_request())
2025-02-17T04:24:02.144Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T04:24:02.146Z     return future.result(timeout=timeout.total_seconds())
2025-02-17T04:24:02.153Z   File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T04:24:02.154Z     raise TimeoutError()
2025-02-17T04:24:02.155Z builtins.TimeoutError:
2025-02-17T04:24:02.156Z

Metadata

Metadata

Assignees

No one assigned

    Labels

    t-toolingIssues with this label are in the ownership of the tooling team.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions