Skip to content

Latest commit

 

History

History
116 lines (73 loc) · 6.7 KB

File metadata and controls

116 lines (73 loc) · 6.7 KB

Coding guidelines

This file provides guidance to programming agents when working with code in this repository.

Project Overview

The Apify SDK for Python (apify package on PyPI) is the official library for creating Apify Actors in Python. It provides Actor lifecycle management, storage access (datasets, key-value stores, request queues), event handling, proxy configuration, and pay-per-event charging. It builds on top of the Crawlee web scraping framework and the Apify API Client. Supports Python 3.10–3.14. Build system: hatchling.

Common Commands

# Install dependencies (including dev)
uv sync --all-extras

# Install dev dependencies + pre-commit hooks
uv run poe install-dev

# Format code (also auto-fixes lint issues via ruff check --fix)
uv run poe format

# Lint (format check + ruff check)
uv run poe lint

# Type check
uv run poe type-check

# Run all checks (lint + type-check + unit tests)
uv run poe check-code

# Unit tests (no API token needed)
uv run poe unit-tests

# Run a single test file
uv run pytest tests/unit/actor/test_actor_lifecycle.py

# Run a single test by name
uv run pytest tests/unit/actor/test_actor_lifecycle.py -k "test_name"

# Integration tests (needs APIFY_TEST_USER_API_TOKEN)
uv run poe integration-tests

# E2E tests (needs APIFY_TEST_USER_API_TOKEN, builds/deploys Actors on platform)
uv run poe e2e-tests

Code Style

  • Formatter/Linter: Ruff (line length 120, single quotes for inline, double quotes for docstrings)
  • Type checker: ty (targets Python 3.10)
  • All ruff rules enabled with specific ignores — see pyproject.toml [tool.ruff.lint] for the full ignore list
  • Tests are exempt from docstring rules (D), assert warnings (S101), and private member access (SLF001)
  • Unused imports are allowed in __init__.py files (re-exports)
  • Pre-commit hooks: lint check + type check run automatically on commit

Architecture

Core (src/apify/)

  • _actor.py — The _ActorType class is the central API. Actor is a lazy-object-proxy (lazy-object-proxy.Proxy) wrapping _ActorType — it acts as both a class (e.g. Actor.is_at_home()) and an instance-like context manager (async with Actor:). On __aenter__, the proxy's __wrapped__ is replaced with the active _ActorType instance. It manages the full Actor lifecycle (init, exit, fail), provides access to storages (open_dataset, open_key_value_store, open_request_queue), handles events, proxy configuration, charging, and platform API operations (start, call, metamorph, reboot).

  • _configuration.pyConfiguration extends Crawlee's Configuration with Apify-specific settings (API URL, token, Actor run metadata, proxy settings, charging config). Configuration is populated from environment variables (APIFY_*).

  • _charging.py — Pay-per-event billing system. ChargingManager / ChargingManagerImplementation handle charging events against pricing info fetched from the API.

  • _proxy_configuration.pyProxyConfiguration manages Apify proxy setup (residential, datacenter, groups, country targeting).

  • _models.py — Pydantic models for API data structures (Actor runs, webhooks, pricing info, etc.).

Storage Clients (src/apify/storage_clients/)

Four storage client implementations, all implementing Crawlee's abstract storage client interface:

  • _apify/ApifyStorageClient: talks to the Apify API for dataset, key-value store, and request queue operations (separate sub-clients for single vs. shared request queues). Used when running on the Apify platform.
  • _file_system/FileSystemStorageClient (alias ApifyFileSystemStorageClient): extends Crawlee's file system client with Apify-specific key-value store behavior.
  • _smart_apify/SmartApifyStorageClient: hybrid client that writes to both API and local file system for resilience.
  • MemoryStorageClient — re-exported from Crawlee for in-memory storage.

Storages (src/apify/storages/)

Re-exports Crawlee's Dataset, KeyValueStore, and RequestQueue classes.

Events (src/apify/events/)

  • _apify_event_manager.pyApifyEventManager extends Crawlee's event system with platform-specific events received via WebSocket connection.

Request Loaders (src/apify/request_loaders/)

  • _apify_request_list.pyApifyRequestList creates request lists from Actor input URLs (supports both direct URLs and "requests from URL" sources).

Scrapy Integration (src/apify/scrapy/)

Optional integration (apify[scrapy] extra) providing Scrapy scheduler, middlewares, pipelines, and extensions for running Scrapy spiders as Apify Actors.

Key Dependencies

  • crawlee — Base framework providing storage abstractions, event system, configuration, service locator pattern
  • apify-client — HTTP client for the Apify API (ApifyClientAsync)
  • apify-shared — Shared constants and utilities (ApifyEnvVars, ActorEnvVars, etc.)

Testing

Three test levels in tests/:

  • unit/ — Fast tests with no external dependencies. Use mocked API clients (ApifyClientAsyncPatcher fixture). Run with uv run poe unit-tests.
  • integration/ — Tests making real Apify API calls but not deploying Actors. Requires APIFY_TEST_USER_API_TOKEN. Run with uv run poe integration-tests.
  • e2e/ — Full end-to-end tests that build and deploy Actors on the platform. Slowest. Requires APIFY_TEST_USER_API_TOKEN. Use make_actor and run_actor fixtures. Run with uv run poe e2e-tests.

All test levels use pytest-asyncio with asyncio_mode = "auto" (no need for @pytest.mark.asyncio). Tests run in parallel via pytest-xdist (--numprocesses). Each test gets isolated state via the autouse _isolate_test_environment fixture which resets Actor, service_locator, and AliasResolver state. Conftest files live in each subdirectory (tests/unit/conftest.py, etc.) — there is no top-level tests/conftest.py.

Key Test Fixtures

  • apify_client_async_patcher (unit) — ApifyClientAsyncPatcher instance for mocking ApifyClientAsync methods. Patch by method/submethod, tracks call history in .calls.
  • make_httpserver/httpserver (unit) — session-scoped HTTPServer via pytest-httpserver for HTTP interception.
  • apify_client_async (integration/e2e) — real ApifyClientAsync using APIFY_TEST_USER_API_TOKEN.
  • make_actor (e2e) — creates a temporary Actor on the platform from a function, main_py string, or source files dict; cleans up after the session.
  • run_actor (e2e) — calls an Actor and waits up to 10 minutes for completion.