Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions docs/guides/code_examples/storages/opening.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
import asyncio

from crawlee.storages import Dataset


async def main() -> None:
# Named storage (persists across runs)
dataset_named = await Dataset.open(name='my-persistent-dataset')

# Unnamed storage with alias (purged on start)
dataset_unnamed = await Dataset.open(alias='temporary-results')

# Default unnamed storage (both are equivalent and purged on start)
dataset_default = await Dataset.open()
dataset_default = await Dataset.open(alias='default')


if __name__ == '__main__':
asyncio.run(main())
31 changes: 22 additions & 9 deletions docs/guides/storages.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';

import OpeningExample from '!!raw-loader!roa-loader!./code_examples/storages/opening.py';

import RqBasicExample from '!!raw-loader!roa-loader!./code_examples/storages/rq_basic_example.py';
import RqWithCrawlerExample from '!!raw-loader!roa-loader!./code_examples/storages/rq_with_crawler_example.py';
import RqWithCrawlerExplicitExample from '!!raw-loader!roa-loader!./code_examples/storages/rq_with_crawler_explicit_example.py';
Expand All @@ -26,7 +28,9 @@ import KvsWithCrawlerExplicitExample from '!!raw-loader!roa-loader!./code_exampl
import CleaningDoNotPurgeExample from '!!raw-loader!roa-loader!./code_examples/storages/cleaning_do_not_purge_example.py';
import CleaningPurgeExplicitlyExample from '!!raw-loader!roa-loader!./code_examples/storages/cleaning_purge_explicitly_example.py';

Crawlee offers several storage types for managing and persisting your crawling data. Request-oriented storages, such as the <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>, help you store and deduplicate URLs, while result-oriented storages, like <ApiLink to="class/Dataset">`Dataset`</ApiLink> and <ApiLink to="class/KeyValueStore">`KeyValueStore`</ApiLink>, focus on storing and retrieving scraping results. This guide helps you choose the storage type that suits your needs.
Crawlee offers several storage types for managing and persisting your crawling data. Request-oriented storages, such as the <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>, help you store and deduplicate URLs, while result-oriented storages, like <ApiLink to="class/Dataset">`Dataset`</ApiLink> and <ApiLink to="class/KeyValueStore">`KeyValueStore`</ApiLink>, focus on storing and retrieving scraping results. This guide explains when to use each type, how to interact with them, and how to control their lifecycle.

## Overview

Crawlee's storage system consists of two main layers:
- **Storages** (<ApiLink to="class/Dataset">`Dataset`</ApiLink>, <ApiLink to="class/KeyValueStore">`KeyValueStore`</ApiLink>, <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>): High-level interfaces for interacting with different storage types.
Expand Down Expand Up @@ -70,6 +74,21 @@ Storage --|> KeyValueStore
Storage --|> RequestQueue
```

### Named and unnamed storages

Crawlee supports two types of storages:

- **Named storages**: Persistent storages with a specific name that persist across runs. These are useful when you want to share data between different crawler runs or access the same storage from multiple places.
- **Unnamed storages**: Temporary storages identified by an alias that are scoped to a single run. These are automatically purged at the start of each run (when `purge_on_start` is enabled, which is the default).

### Default storage

Each storage type (<ApiLink to="class/Dataset">`Dataset`</ApiLink>, <ApiLink to="class/KeyValueStore">`KeyValueStore`</ApiLink>, <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>) has a default instance that can be accessed without specifying `id`, `name` or `alias`. Default unnamed storage is accessed by calling storage's `open` method without parameters. This is the most common way to use storages in simple crawlers. The special alias `"default"` is equivalent to calling `open` without parameters

<RunnableCodeBlock className="language-python" language="python">
{OpeningExample}
</RunnableCodeBlock>

## Request queue

The <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink> is the primary storage for URLs in Crawlee, especially useful for deep crawling. It supports dynamic addition of URLs, making it ideal for recursive tasks where URLs are discovered and added during the crawling process (e.g., following links across multiple pages). Each Crawlee project has a **default request queue**, which can be used to store URLs during a specific run.
Expand Down Expand Up @@ -186,13 +205,7 @@ Crawlee provides the following helper function to simplify interactions with the

## Cleaning up the storages

By default, Crawlee automatically cleans up **default storages** before each crawler run to ensure a clean state. This behavior is controlled by the <ApiLink to="class/Configuration#purge_on_start">`Configuration.purge_on_start`</ApiLink> setting (default: `True`).

### What gets purged

- **Default storages** are completely removed and recreated at the start of each run, ensuring that you start with a clean slate.
- **Named storages** are never automatically purged and persist across runs.
- The behavior depends on the storage client implementation.
By default, Crawlee cleans up all unnamed storages (including the default one) at the start of each run, so every crawl begins with a clean state. This behavior is controlled by <ApiLink to="class/Configuration#purge_on_start">`Configuration.purge_on_start`</ApiLink> (default: True). In contrast, named storages are never purged automatically and persist across runs. The exact behavior may vary depending on the storage client implementation.

### When purging happens

Expand Down Expand Up @@ -221,6 +234,6 @@ Note that purging behavior may vary between storage client implementations. For

## Conclusion

This guide introduced you to the different storage types available in Crawlee and how to interact with them. You learned how to manage requests using the <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink> and store and retrieve scraping results using the <ApiLink to="class/Dataset">`Dataset`</ApiLink> and <ApiLink to="class/KeyValueStore">`KeyValueStore`</ApiLink>. You also discovered how to use helper functions to simplify interactions with these storages. Finally, you learned how to clean up storages before starting a crawler run.
This guide introduced you to the different storage types available in Crawlee and how to interact with them. You learned about the distinction between named storages (persistent across runs) and unnamed storages with aliases (temporary and purged on start). You discovered how to manage requests using the <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink> and store and retrieve scraping results using the <ApiLink to="class/Dataset">`Dataset`</ApiLink> and <ApiLink to="class/KeyValueStore">`KeyValueStore`</ApiLink>. You also learned how to use helper functions to simplify interactions with these storages and how to control storage cleanup behavior.

If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/crawlee-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
26 changes: 18 additions & 8 deletions src/crawlee/_types.py
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,7 @@ class PushDataFunctionCall(PushDataKwargs):
data: list[dict[str, Any]] | dict[str, Any]
dataset_id: str | None
dataset_name: str | None
dataset_alias: str | None


class KeyValueStoreInterface(Protocol):
Expand Down Expand Up @@ -255,7 +256,7 @@ def __init__(self, *, key_value_store_getter: GetKeyValueStoreFunction) -> None:
self._key_value_store_getter = key_value_store_getter
self.add_requests_calls = list[AddRequestsKwargs]()
self.push_data_calls = list[PushDataFunctionCall]()
self.key_value_store_changes = dict[tuple[str | None, str | None], KeyValueStoreChangeRecords]()
self.key_value_store_changes = dict[tuple[str | None, str | None, str | None], KeyValueStoreChangeRecords]()

async def add_requests(
self,
Expand All @@ -270,6 +271,7 @@ async def push_data(
data: list[dict[str, Any]] | dict[str, Any],
dataset_id: str | None = None,
dataset_name: str | None = None,
dataset_alias: str | None = None,
**kwargs: Unpack[PushDataKwargs],
) -> None:
"""Track a call to the `push_data` context helper."""
Expand All @@ -278,6 +280,7 @@ async def push_data(
data=data,
dataset_id=dataset_id,
dataset_name=dataset_name,
dataset_alias=dataset_alias,
**kwargs,
)
)
Expand All @@ -287,13 +290,14 @@ async def get_key_value_store(
*,
id: str | None = None,
name: str | None = None,
alias: str | None = None,
) -> KeyValueStoreInterface:
if (id, name) not in self.key_value_store_changes:
self.key_value_store_changes[id, name] = KeyValueStoreChangeRecords(
await self._key_value_store_getter(id=id, name=name)
if (id, name, alias) not in self.key_value_store_changes:
self.key_value_store_changes[id, name, alias] = KeyValueStoreChangeRecords(
await self._key_value_store_getter(id=id, name=name, alias=alias)
)

return self.key_value_store_changes[id, name]
return self.key_value_store_changes[id, name, alias]


@docs_group('Functions')
Expand Down Expand Up @@ -424,12 +428,14 @@ def __call__(
*,
id: str | None = None,
name: str | None = None,
alias: str | None = None,
) -> Coroutine[None, None, KeyValueStore]:
"""Call dunder method.

Args:
id: The ID of the `KeyValueStore` to get.
name: The name of the `KeyValueStore` to get.
name: The name of the `KeyValueStore` to get (global scope, named storage).
alias: The alias of the `KeyValueStore` to get (run scope, unnamed storage).
"""


Expand All @@ -444,12 +450,14 @@ def __call__(
*,
id: str | None = None,
name: str | None = None,
alias: str | None = None,
) -> Coroutine[None, None, KeyValueStoreInterface]:
"""Call dunder method.

Args:
id: The ID of the `KeyValueStore` to get.
name: The name of the `KeyValueStore` to get.
name: The name of the `KeyValueStore` to get (global scope, named storage).
alias: The alias of the `KeyValueStore` to get (run scope, unnamed storage).
"""


Expand All @@ -466,14 +474,16 @@ def __call__(
data: list[dict[str, Any]] | dict[str, Any],
dataset_id: str | None = None,
dataset_name: str | None = None,
dataset_alias: str | None = None,
**kwargs: Unpack[PushDataKwargs],
) -> Coroutine[None, None, None]:
"""Call dunder method.

Args:
data: The data to push to the `Dataset`.
dataset_id: The ID of the `Dataset` to push the data to.
dataset_name: The name of the `Dataset` to push the data to.
dataset_name: The name of the `Dataset` to push the data to (global scope, named storage).
dataset_alias: The alias of the `Dataset` to push the data to (run scope, unnamed storage).
**kwargs: Additional keyword arguments.
"""

Expand Down
30 changes: 19 additions & 11 deletions src/crawlee/crawlers/_basic/_basic_crawler.py
Original file line number Diff line number Diff line change
Expand Up @@ -557,18 +557,20 @@ async def get_dataset(
*,
id: str | None = None,
name: str | None = None,
alias: str | None = None,
) -> Dataset:
"""Return the `Dataset` with the given ID or name. If none is provided, return the default one."""
return await Dataset.open(id=id, name=name)
return await Dataset.open(id=id, name=name, alias=alias)

async def get_key_value_store(
self,
*,
id: str | None = None,
name: str | None = None,
alias: str | None = None,
) -> KeyValueStore:
"""Return the `KeyValueStore` with the given ID or name. If none is provided, return the default KVS."""
return await KeyValueStore.open(id=id, name=name)
return await KeyValueStore.open(id=id, name=name, alias=alias)

def error_handler(
self, handler: ErrorHandler[TCrawlingContext | BasicCrawlingContext]
Expand Down Expand Up @@ -772,6 +774,7 @@ async def get_data(
self,
dataset_id: str | None = None,
dataset_name: str | None = None,
dataset_alias: str | None = None,
**kwargs: Unpack[GetDataKwargs],
) -> DatasetItemsListPage:
"""Retrieve data from a `Dataset`.
Expand All @@ -781,20 +784,22 @@ async def get_data(

Args:
dataset_id: The ID of the `Dataset`.
dataset_name: The name of the `Dataset`.
dataset_name: The name of the `Dataset` (global scope, named storage).
dataset_alias: The alias of the `Dataset` (run scope, unnamed storage).
kwargs: Keyword arguments to be passed to the `Dataset.get_data()` method.

Returns:
The retrieved data.
"""
dataset = await Dataset.open(id=dataset_id, name=dataset_name)
dataset = await Dataset.open(id=dataset_id, name=dataset_name, alias=dataset_alias)
return await dataset.get_data(**kwargs)

async def export_data(
self,
path: str | Path,
dataset_id: str | None = None,
dataset_name: str | None = None,
dataset_alias: str | None = None,
) -> None:
"""Export all items from a Dataset to a JSON or CSV file.

Expand All @@ -804,10 +809,11 @@ async def export_data(

Args:
path: The destination file path. Must end with '.json' or '.csv'.
dataset_id: The ID of the Dataset to export from. If None, uses `name` parameter instead.
dataset_name: The name of the Dataset to export from. If None, uses `id` parameter instead.
dataset_id: The ID of the Dataset to export from.
dataset_name: The name of the Dataset to export from (global scope, named storage).
dataset_alias: The alias of the Dataset to export from (run scope, unnamed storage).
"""
dataset = await self.get_dataset(id=dataset_id, name=dataset_name)
dataset = await self.get_dataset(id=dataset_id, name=dataset_name, alias=dataset_alias)

path = path if isinstance(path, Path) else Path(path)
dst = path.open('w', newline='')
Expand All @@ -824,6 +830,7 @@ async def _push_data(
data: list[dict[str, Any]] | dict[str, Any],
dataset_id: str | None = None,
dataset_name: str | None = None,
dataset_alias: str | None = None,
**kwargs: Unpack[PushDataKwargs],
) -> None:
"""Push data to a `Dataset`.
Expand All @@ -834,10 +841,11 @@ async def _push_data(
Args:
data: The data to push to the `Dataset`.
dataset_id: The ID of the `Dataset`.
dataset_name: The name of the `Dataset`.
dataset_name: The name of the `Dataset` (global scope, named storage).
dataset_alias: The alias of the `Dataset` (run scope, unnamed storage).
kwargs: Keyword arguments to be passed to the `Dataset.push_data()` method.
"""
dataset = await self.get_dataset(id=dataset_id, name=dataset_name)
dataset = await self.get_dataset(id=dataset_id, name=dataset_name, alias=dataset_alias)
await dataset.push_data(data, **kwargs)

def _should_retry_request(self, context: BasicCrawlingContext, error: Exception) -> bool:
Expand Down Expand Up @@ -1226,8 +1234,8 @@ async def _commit_key_value_store_changes(
result: RequestHandlerRunResult, get_kvs: GetKeyValueStoreFromRequestHandlerFunction
) -> None:
"""Store key value store changes recorded in result."""
for (id, name), changes in result.key_value_store_changes.items():
store = await get_kvs(id=id, name=name)
for (id, name, alias), changes in result.key_value_store_changes.items():
store = await get_kvs(id=id, name=name, alias=alias)
for key, value in changes.updates.items():
await store.set_value(key, value.content, value.content_type)

Expand Down
3 changes: 3 additions & 0 deletions src/crawlee/storage_clients/_base/_storage_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ async def create_dataset_client(
*,
id: str | None = None,
name: str | None = None,
alias: str | None = None,
configuration: Configuration | None = None,
) -> DatasetClient:
"""Create a dataset client."""
Expand All @@ -44,6 +45,7 @@ async def create_kvs_client(
*,
id: str | None = None,
name: str | None = None,
alias: str | None = None,
configuration: Configuration | None = None,
) -> KeyValueStoreClient:
"""Create a key-value store client."""
Expand All @@ -54,6 +56,7 @@ async def create_rq_client(
*,
id: str | None = None,
name: str | None = None,
alias: str | None = None,
configuration: Configuration | None = None,
) -> RequestQueueClient:
"""Create a request queue client."""
Expand Down
Loading
Loading