Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrapy scheduler emits timeout errors #404

Open
2 of 4 tasks
honzajavorek opened this issue Feb 17, 2025 · 1 comment
Open
2 of 4 tasks

Scrapy scheduler emits timeout errors #404

honzajavorek opened this issue Feb 17, 2025 · 1 comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@honzajavorek
Copy link
Contributor

honzajavorek commented Feb 17, 2025

When running my scrapers with the latest Apify SDK (meaning fd7650a), I get timeouts on the following lines of code. These timeouts don't crash the scraper immediately, but they corrupt the scraper run: The results are incomplete, and I've also seen strange request queue behavior after these errors, which at least once resembled endless looping (I aborted the scraper after repeatedly seeing the same runtime stats).

self._async_thread.run_coro(self._rq.mark_request_as_handled(apify_request))
self._async_thread.run_coro(self._rq.fetch_next_request())

I use the same technique with the same async thread for caching requests (see #403), but I can't see any timeout errors related to the key-value storage I use. AFAIK all timeout errors I've seen were related to RQ, despite the KV being heavily used as well during the same scraper run. (It happens with KV as well, see my comment below.)

The issue happens only occasionally, which makes it hard to track down. My scraper runs for 20 minutes just okay, and then spits out 5 of these errors. I've got these timeouts with two rather different spiders, so this isn't specific to a code of a single spider class.

Debugging Plan & Ideas

  • verify if I can get these errors locally (note for myself: run plucker with --apify) 👉 no success
  • verify if these errors also happen with scrapers completely unrelated to jobs (i.e. verify this really isn't related to a certain website being touched)
  • modify code of the SDK (edit site-packages or make a git fork with changes) to see where exactly the coroutine hangs so that the thread timeouts
  • regardless the cause and solution, maybe the thread should crash the whole program in case of a timeout, because the coroutines are pretty important to finish...

Examples

Specimen 1
2025-02-14T16:44:23.860Z [apify.scrapy._async_thread] ERROR Coroutine execution timed out.
2025-02-14T16:44:23.861Z       Traceback (most recent call last):
2025-02-14T16:44:23.862Z         File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-14T16:44:23.863Z           return future.result(timeout=timeout.total_seconds())
2025-02-14T16:44:23.863Z                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-14T16:44:23.864Z         File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-14T16:44:23.865Z           raise TimeoutError()
2025-02-14T16:44:23.866Z       TimeoutError
2025-02-14T16:44:23.866Z [apify.scrapy._async_thread] ERROR Coroutine execution timed out. ({"message": "Coroutine execution timed out."})
2025-02-14T16:44:23.869Z       Traceback (most recent call last):
2025-02-14T16:44:23.869Z         File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-14T16:44:23.870Z           return future.result(timeout=timeout.total_seconds())
2025-02-14T16:44:23.871Z                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-14T16:44:23.871Z         File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-14T16:44:23.872Z           raise TimeoutError()
2025-02-14T16:44:23.873Z       TimeoutError
Specimen 2
2025-02-17T04:24:02.098Z [apify.scrapy._async_thread] ERROR Coroutine execution timed out.
2025-02-17T04:24:02.100Z       Traceback (most recent call last):
2025-02-17T04:24:02.101Z         File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T04:24:02.103Z           return future.result(timeout=timeout.total_seconds())
2025-02-17T04:24:02.104Z                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T04:24:02.105Z         File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T04:24:02.106Z           raise TimeoutError()
2025-02-17T04:24:02.108Z       TimeoutError
2025-02-17T04:24:02.110Z [apify.scrapy._async_thread] ERROR Coroutine execution timed out. ({"message": "Coroutine execution timed out."})
2025-02-17T04:24:02.111Z       Traceback (most recent call last):
2025-02-17T04:24:02.112Z         File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T04:24:02.113Z           return future.result(timeout=timeout.total_seconds())
2025-02-17T04:24:02.114Z                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T04:24:02.115Z         File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T04:24:02.116Z           raise TimeoutError()
2025-02-17T04:24:02.117Z       TimeoutError
2025-02-17T04:24:02.118Z Traceback (most recent call last):
2025-02-17T04:24:02.119Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/scheduler.py", line 149, in next_request
2025-02-17T04:24:02.120Z     apify_request = self._async_thread.run_coro(self._rq.fetch_next_request())
2025-02-17T04:24:02.121Z                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T04:24:02.122Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T04:24:02.123Z     return future.result(timeout=timeout.total_seconds())
2025-02-17T04:24:02.124Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T04:24:02.125Z   File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T04:24:02.126Z     raise TimeoutError()
2025-02-17T04:24:02.127Z TimeoutError
2025-02-17T04:24:02.130Z while handling timed call
2025-02-17T04:24:02.132Z Traceback (most recent call last):
2025-02-17T04:24:02.133Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/twisted/internet/base.py", line 1105, in runUntilCurrent
2025-02-17T04:24:02.134Z     call.func(*call.args, **call.kw)
2025-02-17T04:24:02.135Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/utils/reactor.py", line 70, in __call__
2025-02-17T04:24:02.136Z     return self._func(*self._a, **self._kw)
2025-02-17T04:24:02.137Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/core/engine.py", line 179, in _next_request
2025-02-17T04:24:02.138Z     and self._next_request_from_scheduler() is not None
2025-02-17T04:24:02.139Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/core/engine.py", line 224, in _next_request_from_scheduler
2025-02-17T04:24:02.140Z     request = self.slot.scheduler.next_request()
2025-02-17T04:24:02.141Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/scheduler.py", line 149, in next_request
2025-02-17T04:24:02.143Z     apify_request = self._async_thread.run_coro(self._rq.fetch_next_request())
2025-02-17T04:24:02.144Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T04:24:02.146Z     return future.result(timeout=timeout.total_seconds())
2025-02-17T04:24:02.153Z   File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T04:24:02.154Z     raise TimeoutError()
2025-02-17T04:24:02.155Z builtins.TimeoutError:
2025-02-17T04:24:02.156Z
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Feb 17, 2025
@honzajavorek
Copy link
Contributor Author

honzajavorek commented Feb 18, 2025

I have some new findings!

  • The errors are completely unrelated to the underlying website or type of spider. As of now I've seen these failures in all kinds of scrapers I have, even those performing just a single request (they don't strain the infrastructure).
  • The errors are related to the async thread solution without regard whether it is RQ in the scheduler or KV in the cache. While less often, I've seen timeout errors happening also with coroutines related to the HTTP cache.
  • For at least second time I've seen a case when the spider got stuck in endless loop and the platform had to kill it based on Actor timeout.
  • I wasn't able to reproduce it locally. On the platform it happens randomly, but regularly enough so that it feels like I have fifty-fifty chance with any particular run whether it finishes just fine or ends with these timeout errors.

More Examples

Specimen 3
2025-02-16T05:38:11.939Z [scrapy.utils.signal] ERROR Error caught on signal handler: <bound method HttpCacheMiddleware.spider_closed of <scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware object at 0x765bbe1a0d70>> ({"spider": "<Spider 'meetups-pyvo' at 0x765bbe7bff20>"})
2025-02-16T05:38:11.940Z       Traceback (most recent call last):
2025-02-16T05:38:11.941Z         File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/utils/defer.py", line 400, in maybeDeferred_coro
2025-02-16T05:38:11.941Z           result = f(*args, **kw)
2025-02-16T05:38:11.942Z                    ^^^^^^^^^^^^^^
2025-02-16T05:38:11.943Z         File "/usr/src/app/.venv/lib/python3.12/site-packages/pydispatch/robustapply.py", line 55, in robustApply
2025-02-16T05:38:11.943Z           return receiver(*arguments, **named)
2025-02-16T05:38:11.944Z                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16T05:38:11.944Z         File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/downloadermiddlewares/httpcache.py", line 68, in spider_closed
2025-02-16T05:38:11.945Z           self.storage.close_spider(spider)
2025-02-16T05:38:11.946Z         File "/usr/src/app/jg/plucker/cache.py", line 80, in close_spider
2025-02-16T05:38:11.946Z           self._async_thread.run_coro(expire_kv())
2025-02-16T05:38:11.947Z         File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-16T05:38:11.948Z           return future.result(timeout=timeout.total_seconds())
2025-02-16T05:38:11.948Z                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16T05:38:11.949Z         File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-16T05:38:11.950Z           raise TimeoutError()
2025-02-16T05:38:11.950Z       TimeoutError
Specimen 4
2025-02-17T05:37:11.020Z [scrapy.core.engine] INFO  Spider opened ({"spider": "<Spider 'meetups-ctvrtkon' at 0x7312f288f860>"})
2025-02-17T05:37:11.088Z [scrapy.extensions.logstats] INFO  Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) ({"spider": "<Spider 'meetups-ctvrtkon' at 0x7312f288f860>"})
2025-02-17T05:37:11.089Z [jg.plucker.cache] DEBUG Using Apify key value cache storage ({"spider": "<Spider 'meetups-ctvrtkon' at 0x7312f288f860>"})
2025-02-17T05:37:11.090Z [jg.plucker.cache] DEBUG Starting background thread for cache storage's event loop
2025-02-17T05:37:11.091Z [jg.plucker.cache] DEBUG Opening cache storage's 'httpcache-meetups-ctvrtkon' key value store
2025-02-17T05:37:11.153Z [scrapy.extensions.telnet] INFO  Telnet console listening on 127.0.0.1:6023 ({"crawler": "<scrapy.crawler.Crawler object at 0x7312f3080c50>"})
2025-02-17T05:38:11.249Z [apify.scrapy._async_thread] ERROR Coroutine execution timed out.
2025-02-17T05:38:11.250Z       Traceback (most recent call last):
2025-02-17T05:38:11.251Z         File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T05:38:11.253Z           return future.result(timeout=timeout.total_seconds())
2025-02-17T05:38:11.254Z                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T05:38:11.254Z         File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T05:38:11.255Z           raise TimeoutError()
2025-02-17T05:38:11.256Z       TimeoutError
2025-02-17T05:38:11.257Z [apify.scrapy._async_thread] ERROR Coroutine execution timed out. ({"message": "Coroutine execution timed out."})
2025-02-17T05:38:11.257Z       Traceback (most recent call last):
2025-02-17T05:38:11.258Z         File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T05:38:11.259Z           return future.result(timeout=timeout.total_seconds())
2025-02-17T05:38:11.260Z                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T05:38:11.260Z         File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T05:38:11.261Z           raise TimeoutError()
2025-02-17T05:38:11.262Z       TimeoutError
2025-02-17T05:38:11.263Z Traceback (most recent call last):
2025-02-17T05:38:11.263Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/scheduler.py", line 149, in next_request
2025-02-17T05:38:11.263Z     apify_request = self._async_thread.run_coro(self._rq.fetch_next_request())
2025-02-17T05:38:11.264Z                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T05:38:11.264Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T05:38:11.265Z     return future.result(timeout=timeout.total_seconds())
2025-02-17T05:38:11.265Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-17T05:38:11.266Z   File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T05:38:11.266Z     raise TimeoutError()
2025-02-17T05:38:11.267Z TimeoutError
2025-02-17T05:38:11.267Z while handling timed call
2025-02-17T05:38:11.268Z Traceback (most recent call last):
2025-02-17T05:38:11.268Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/twisted/internet/base.py", line 1105, in runUntilCurrent
2025-02-17T05:38:11.269Z     call.func(*call.args, **call.kw)
2025-02-17T05:38:11.269Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/utils/reactor.py", line 70, in __call__
2025-02-17T05:38:11.270Z     return self._func(*self._a, **self._kw)
2025-02-17T05:38:11.270Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/core/engine.py", line 179, in _next_request
2025-02-17T05:38:11.271Z     and self._next_request_from_scheduler() is not None
2025-02-17T05:38:11.271Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/scrapy/core/engine.py", line 224, in _next_request_from_scheduler
2025-02-17T05:38:11.272Z     request = self.slot.scheduler.next_request()
2025-02-17T05:38:11.272Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/scheduler.py", line 149, in next_request
2025-02-17T05:38:11.273Z     apify_request = self._async_thread.run_coro(self._rq.fetch_next_request())
2025-02-17T05:38:11.273Z   File "/usr/src/app/.venv/lib/python3.12/site-packages/apify/scrapy/_async_thread.py", line 62, in run_coro
2025-02-17T05:38:11.274Z     return future.result(timeout=timeout.total_seconds())
2025-02-17T05:38:11.274Z   File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 458, in result
2025-02-17T05:38:11.275Z     raise TimeoutError()
2025-02-17T05:38:11.275Z builtins.TimeoutError:
2025-02-17T05:38:11.275Z
2025-02-17T05:38:11.276Z [scrapy.extensions.logstats] INFO  Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) ({"spider": "<Spider 'meetups-ctvrtkon' at 0x7312f288f860>"})
2025-02-17T05:39:11.090Z [scrapy.extensions.logstats] INFO  Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) ({"spider": "<Spider 'meetups-ctvrtkon' at 0x7312f288f860>"})
2025-02-17T05:40:11.089Z [scrapy.extensions.logstats] INFO  Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) ({"spider": "<Spider 'meetups-ctvrtkon' at 0x7312f288f860>"})
2025-02-17T05:41:11.089Z [scrapy.extensions.logstats] INFO  Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) ({"spider": "<Spider 'meetups-ctvrtkon' at 0x7312f288f860>"})
2025-02-17T05:42:00.675Z ACTOR: The Actor run has reached the timeout of 300 seconds, aborting it. You can increase the timeout in Input > Run options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

1 participant