Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix!: Refactor service usage to rely on service_locator #691

Merged
merged 27 commits into from
Dec 13, 2024

Conversation

vdusek
Copy link
Collaborator

@vdusek vdusek commented Nov 13, 2024

Description

  • The service_container module has been completely refactored. It introduces changes to its usage, and resulting in many changes across the code base.
  • While it remains possible to pass "services" directly to components as before, components now rely on the service_container internally. They no longer store service instances themselves.
  • A new force flag has been added to the service_container's setters, which is especially useful for testing purposes.
  • This also quite simplifies the memory_storage_client.
  • We now have only set_storage_client, the same approach as for the event_manager. This is more flexible (allows more envs than just local & cloud). And in the SDK Actor, we can them based on the is_at_home.
  • This is a breaking change, but it affects only the service_container interface.

Open discussion

  • Should we go further and remove the option to pass configuration, event managers, or storage clients directly to components - requiring them to be set exclusively via the service_container? Thoughts are welcome.
    • No
  • Better name for service_container? service_locator?
    • service_locator

Issues

Testing

  • Existing tests, including those covering the service_container, have been updated to reflect these changes.
  • New tests covering the MemoryStorageClient respects the Configuration.

Manual reproduction

  • This code snippet demonstrates that the Crawler and MemoryStorageClient respects the custom Configuration.
  • Note: Some fields remain non-working for now. These will be addressed in a future PR, as this refactor is already quite big. However, with the new architecture, those updates should now be easy.
import asyncio

from crawlee.configuration import Configuration
from crawlee.http_crawler import HttpCrawler, HttpCrawlingContext
from crawlee.service_container import set_configuration


async def main() -> None:
    config = Configuration(persist_storage=False, write_metadata=False)
    set_configuration(config)  # or Crawler(config=config)
    crawler = HttpCrawler()

    @crawler.router.default_handler
    async def request_handler(context: HttpCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')
        await context.push_data({'url': context.request.url})

    await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
    asyncio.run(main())

Checklist

  • CI passed

@vdusek vdusek requested a review from janbuchar November 13, 2024 13:36
@github-actions github-actions bot added this to the 102nd sprint - Tooling team milestone Nov 13, 2024
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Nov 13, 2024
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for kicking this off!

@janbuchar
Copy link
Collaborator

We might want to try and resolve #369 here as well.

Additional context in Apify Slack - https://apify.slack.com/archives/C02JQSN79V4/p1723708212887119

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

@janbuchar
Copy link
Collaborator

Also apify/apify-sdk-python#324 (comment) is relevant here - if we could remove the need to initialize BasicCrawler under async with Actor, that would be great.

@vdusek vdusek changed the title refactor!: Refactor service container and other related components [WIP] refactor!: Refactor component management to use service_container exclusively [WIP] Nov 15, 2024
@vdusek vdusek force-pushed the refactor-service-container branch 2 times, most recently from c8747d6 to bf427bd Compare November 25, 2024 14:55
@vdusek vdusek marked this pull request as ready for review November 26, 2024 14:12
@vdusek vdusek changed the title refactor!: Refactor component management to use service_container exclusively [WIP] refactor!: Refactor component management to use service_container exclusively Nov 26, 2024
@vdusek vdusek force-pushed the refactor-service-container branch from 4c8472c to 99f75de Compare November 26, 2024 14:29
@vdusek vdusek changed the title refactor!: Refactor component management to use service_container exclusively fix!: Refactor component management to use service_container exclusively Nov 26, 2024
@vdusek vdusek changed the title fix!: Refactor component management to use service_container exclusively fix!: Refactor service usage to rely exclusively on service_container Nov 26, 2024
Copy link
Contributor

@Pijukatel Pijukatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but maybe @janbuchar should approve as well.

janbuchar

This comment was marked as resolved.

@janbuchar

This comment was marked as resolved.

@vdusek

This comment was marked as outdated.

@Pijukatel

This comment was marked as resolved.

@vdusek vdusek removed this from the 103rd sprint - Tooling team milestone Dec 3, 2024
@vdusek vdusek force-pushed the refactor-service-container branch from 5e39fcb to c87c395 Compare December 13, 2024 13:32
@vdusek vdusek merged commit 1d31c6c into master Dec 13, 2024
23 checks passed
@vdusek vdusek deleted the refactor-service-container branch December 13, 2024 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
3 participants