Skip to content

Conversation

vdusek
Copy link
Collaborator

@vdusek vdusek commented Nov 13, 2024

Description

  • The service_container module has been completely refactored. It introduces changes to its usage, and resulting in many changes across the code base.
  • While it remains possible to pass "services" directly to components as before, components now rely on the service_container internally. They no longer store service instances themselves.
  • A new force flag has been added to the service_container's setters, which is especially useful for testing purposes.
  • This also quite simplifies the memory_storage_client.
  • We now have only set_storage_client, the same approach as for the event_manager. This is more flexible (allows more envs than just local & cloud). And in the SDK Actor, we can them based on the is_at_home.
  • This is a breaking change, but it affects only the service_container interface.

Open discussion

  • Should we go further and remove the option to pass configuration, event managers, or storage clients directly to components - requiring them to be set exclusively via the service_container? Thoughts are welcome.
    • No
  • Better name for service_container? service_locator?
    • service_locator

Issues

Testing

  • Existing tests, including those covering the service_container, have been updated to reflect these changes.
  • New tests covering the MemoryStorageClient respects the Configuration.

Manual reproduction

  • This code snippet demonstrates that the Crawler and MemoryStorageClient respects the custom Configuration.
  • Note: Some fields remain non-working for now. These will be addressed in a future PR, as this refactor is already quite big. However, with the new architecture, those updates should now be easy.
import asyncio

from crawlee.configuration import Configuration
from crawlee.http_crawler import HttpCrawler, HttpCrawlingContext
from crawlee.service_container import set_configuration


async def main() -> None:
    config = Configuration(persist_storage=False, write_metadata=False)
    set_configuration(config)  # or Crawler(config=config)
    crawler = HttpCrawler()

    @crawler.router.default_handler
    async def request_handler(context: HttpCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')
        await context.push_data({'url': context.request.url})

    await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
    asyncio.run(main())

Checklist

  • CI passed

@vdusek vdusek requested a review from janbuchar November 13, 2024 13:36
@github-actions github-actions bot added this to the 102nd sprint - Tooling team milestone Nov 13, 2024
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Nov 13, 2024
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for kicking this off!

@janbuchar
Copy link
Collaborator

We might want to try and resolve #369 here as well.

Additional context in Apify Slack - https://apify.slack.com/archives/C02JQSN79V4/p1723708212887119

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

@janbuchar
Copy link
Collaborator

Also apify/apify-sdk-python#324 (comment) is relevant here - if we could remove the need to initialize BasicCrawler under async with Actor, that would be great.

@vdusek vdusek changed the title refactor!: Refactor service container and other related components [WIP] refactor!: Refactor component management to use service_container exclusively [WIP] Nov 15, 2024
@vdusek vdusek force-pushed the refactor-service-container branch 2 times, most recently from c8747d6 to bf427bd Compare November 25, 2024 14:55
@vdusek vdusek marked this pull request as ready for review November 26, 2024 14:12
@vdusek vdusek changed the title refactor!: Refactor component management to use service_container exclusively [WIP] refactor!: Refactor component management to use service_container exclusively Nov 26, 2024
@vdusek vdusek force-pushed the refactor-service-container branch from 4c8472c to 99f75de Compare November 26, 2024 14:29
@vdusek vdusek changed the title refactor!: Refactor component management to use service_container exclusively fix!: Refactor component management to use service_container exclusively Nov 26, 2024
@vdusek vdusek changed the title fix!: Refactor component management to use service_container exclusively fix!: Refactor service usage to rely exclusively on service_container Nov 26, 2024
Copy link
Collaborator

@Pijukatel Pijukatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but maybe @janbuchar should approve as well.

janbuchar

This comment was marked as resolved.

@janbuchar

This comment was marked as resolved.

@vdusek

This comment was marked as outdated.

@Pijukatel

This comment was marked as resolved.

@vdusek vdusek removed this from the 103rd sprint - Tooling team milestone Dec 3, 2024
@vdusek vdusek force-pushed the refactor-service-container branch from 5e39fcb to c87c395 Compare December 13, 2024 13:32
@vdusek vdusek merged commit 1d31c6c into master Dec 13, 2024
23 checks passed
@vdusek vdusek deleted the refactor-service-container branch December 13, 2024 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
3 participants