Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error starting up crawlee - json.decoder.JSONDecodeError: Expecting value #1029

Open
yqiang opened this issue Feb 27, 2025 · 5 comments · May be fixed by #1042
Open

error starting up crawlee - json.decoder.JSONDecodeError: Expecting value #1029

yqiang opened this issue Feb 27, 2025 · 5 comments · May be fixed by #1042
Assignees
Labels
t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@yqiang
Copy link

yqiang commented Feb 27, 2025

I'm getting the following traceback when trying to start a crawl job again after exiting with "CTRL-C":

❯ CRAWLEE_PURGE_ON_START=0 python main.py
Traceback (most recent call last):
  File "proj/main.py", line 80, in <module>
    asyncio.run(main())
    ~~~~~~~~~~~^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.13.2/Frameworks/Python.framework/Versions/3.13/lib/python3.13/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ~~~~~~~~~~^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.13.2/Frameworks/Python.framework/Versions/3.13/lib/python3.13/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.13.2/Frameworks/Python.framework/Versions/3.13/lib/python3.13/asyncio/base_events.py", line 725, in run_until_complete
    return future.result()
           ~~~~~~~~~~~~~^^
  File "proj/main.py", line 76, in main
    await crawler.run(urls)
  File "proj/.venv/lib/python3.13/site-packages/crawlee/crawlers/_basic/_basic_crawler.py", line 474, in run
    await self.add_requests(requests)
  File "proj/.venv/lib/python3.13/site-packages/crawlee/crawlers/_basic/_basic_crawler.py", line 563, in add_requests
    request_manager = await self.get_request_manager()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "proj/.venv/lib/python3.13/site-packages/crawlee/crawlers/_basic/_basic_crawler.py", line 406, in get_request_manager
    self._request_manager = await RequestQueue.open()
                            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "proj/.venv/lib/python3.13/site-packages/crawlee/storages/_request_queue.py", line 165, in open
    return await open_storage(
           ^^^^^^^^^^^^^^^^^^^
    ...<5 lines>...
    )
    ^
  File "proj/.venv/lib/python3.13/site-packages/crawlee/storages/_creation_management.py", line 166, in open_storage
    storage_info = await resource_collection_client.get_or_create(name=name, id=id)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "proj/.venv/lib/python3.13/site-packages/crawlee/storage_clients/_memory/_request_queue_collection_client.py", line 35, in get_or_create
    resource_client = await get_or_create_inner(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<5 lines>...
    )
    ^
  File "proj/.venv/lib/python3.13/site-packages/crawlee/storage_clients/_memory/_creation_management.py", line 143, in get_or_create_inner
    found = find_or_create_client_by_id_or_name_inner(
        resource_client_class=resource_client_class,
    ...<2 lines>...
        id=id,
    )
  File "proj/.venv/lib/python3.13/site-packages/crawlee/storage_clients/_memory/_creation_management.py", line 102, in find_or_create_client_by_id_or_name_inner
    storage_path = _determine_storage_path(resource_client_class, memory_storage_client, id, name)
  File "proj/.venv/lib/python3.13/site-packages/crawlee/storage_clients/_memory/_creation_management.py", line 412, in _determine_storage_path
    metadata = json.load(metadata_file)
  File "/opt/homebrew/Cellar/[email protected]/3.13.2/Frameworks/Python.framework/Versions/3.13/lib/python3.13/json/__init__.py", line 293, in load
    return loads(fp.read(),
        cls=cls, object_hook=object_hook,
        parse_float=parse_float, parse_int=parse_int,
        parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/opt/homebrew/Cellar/[email protected]/3.13.2/Frameworks/Python.framework/Versions/3.13/lib/python3.13/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ~~~~~~~~~~~~~~~~~~~~~~~^^^
  File "/opt/homebrew/Cellar/[email protected]/3.13.2/Frameworks/Python.framework/Versions/3.13/lib/python3.13/json/decoder.py", line 345, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.13.2/Frameworks/Python.framework/Versions/3.13/lib/python3.13/json/decoder.py", line 363, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

It looks like it might be having trouble decoding the metadata file? FWIW, I don't see any metadata files in storage/request_queues/default.

@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Feb 27, 2025
@janbuchar
Copy link
Collaborator

Hello @yqiang and thanks for the bug report! This seems like the same thing as #968, but that was closed due to inactivity. I hope we can crack it this time 🙂

@yqiang
Copy link
Author

yqiang commented Feb 27, 2025

I hope so too. FWIW crawlee was running with around 25 concurrent headless WebKit clients via playwright. In the meantime, is there anything that I can try to salvage this crawl job?

@Mantisus
Copy link
Collaborator

I don't see any metadata files in

Could you double-check, the file should be called __metadata__.json? It might be lost among the queue files.... :)

Because if it's not in folder, it should have been re-created automatically. However, if it is in the folder and it is empty, it will probably trigger your error.

@yqiang
Copy link
Author

yqiang commented Feb 27, 2025

The file exists, but it's empty:

storage/request_queues/default on main via 🐍 v3.13.2
❯ file __metadata__.json
__metadata__.json: empty

Otherwise the directory is populated with other request files:

❯ ls -l | wc -l
    1558

Should I remove the file so it can recreate it?

@Mantisus
Copy link
Collaborator

The file exists, but it's empty:

Apparently сrawl interrupted when the file was supposed to be updated.

Should I remove the file so it can recreate it?

Yeah, I think that'll work for you. The queue should be restored from the requests data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants