Skip to content

Conversation

@Mantisus
Copy link
Collaborator

@Mantisus Mantisus commented Oct 30, 2025

Description

This PR is inspired by this discussion.
Add support transform_request_function for SitemapRequestLoader, which works the same way as in EnqueueLinksFunction. This can be useful for setting label for correct routing or user_data with custom data.

@Mantisus Mantisus self-assigned this Oct 30, 2025
@Mantisus Mantisus requested a review from Pijukatel October 31, 2025 09:57
exclude: list[re.Pattern[Any] | Glob] | None = None,
max_buffer_size: int = 200,
persist_state_key: str | None = None,
transform_request_function: Callable[[RequestOptions], RequestOptions | RequestTransformAction] | None = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buuut... shouldn't this also receive an URL of the origin sitemap?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that makes sense. A sitemap cannot contain links to another domain.
This way, users can easily create a mapping between the original link to the sitemap and the link inside transform_request_function. From my point of view, the most valuable thing that adding transform_request_function gives is the ability to add a label so that the request is processed by the appropriate handler.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes a lot of sense, thanks. But I'm afraid that this won't "click" for a lot of people. Perhaps we could add an example that showcases this?

@Mantisus Mantisus requested a review from janbuchar November 4, 2025 12:13
def transform_request(
request_options: RequestOptions,
) -> RequestOptions | RequestTransformAction:
request_host = URL(request_options['url']).host
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should mention that a sitemap should only contain links to the same host here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants