Skip to content

Conversation

Mantisus
Copy link
Collaborator

@Mantisus Mantisus commented Mar 7, 2025

Description

  • Add the use_table_logs parameter that allows disabling tables in logs. This makes log parsing easier when needed.

Issues

@Mantisus
Copy link
Collaborator Author

Mantisus commented Mar 7, 2025

Thinking about this task, I believe we shouldn't add any third-party logger. A flag that disables tables in logs, which make data parsing difficult, is sufficient.

This will allow users to use any logger that's compatible with the standard one and enables customization of log output.

Example for loguru

import inspect
import logging

from loguru import logger

class InterceptHandler(logging.Handler):
    def emit(self, record: logging.LogRecord) -> None:
        # Get corresponding Loguru level if it exists.
        try:
            level: str | int = logger.level(record.levelname).name
        except ValueError:
            level = record.levelno

        # Find caller from where originated the logged message.
        frame, depth = inspect.currentframe(), 0
        while frame:
            filename = frame.f_code.co_filename
            is_logging = filename == logging.__file__
            is_frozen = 'importlib' in filename and '_bootstrap' in filename
            if depth > 0 and not (is_logging | is_frozen):
                break
            frame = frame.f_back
            depth += 1

        logger.opt(depth=depth, exception=record.exc_info).log(level, record.getMessage())


logger.add('crawler.log', serialize=True, level='INFO')
logging.basicConfig(handlers=[InterceptHandler()], level=logging.INFO, force=True)

crawler = BeautifulSoupCrawler(configure_logging=False, use_table_logs=False)

Log record:

{
    "text": "2025-03-07 16:51:09.947 | INFO     | crawlee.crawlers._basic._basic_crawler:run:580 - Final request statistics: requests_finished: 1; requests_failed: 0; retry_histogram: [1]; request_avg_failed_duration: None; request_avg_finished_duration: 0.795506; requests_finished_per_minute: 73; requests_failed_per_minute: 0; request_total_duration: 0.795506; requests_total: 1; crawler_runtime: 0.818803\n",
    "record": {
        "elapsed": { "repr": "0:00:01.921982", "seconds": 1.921982 },
        "exception": null,
        "extra": {},
        "file": {
            "name": "_basic_crawler.py",
            "path": "/src/crawlee/crawlers/_basic/_basic_crawler.py"
        },
        "function": "run",
        "level": { "icon": "ℹ️", "name": "INFO", "no": 20 },
        "line": 580,
        "message": "Final request statistics: requests_finished: 1; requests_failed: 0; retry_histogram: [1]; request_avg_failed_duration: None; request_avg_finished_duration: 0.795506; requests_finished_per_minute: 73; requests_failed_per_minute: 0; request_total_duration: 0.795506; requests_total: 1; crawler_runtime: 0.818803",
        "module": "_basic_crawler",
        "name": "crawlee.crawlers._basic._basic_crawler",
        "process": { "id": 32118, "name": "MainProcess" },
        "thread": { "id": 139760540858176, "name": "MainThread" },
        "time": {
            "repr": "2025-03-07 16:51:09.947345+00:00",
            "timestamp": 1741366269.947345
        }
    }
}

@janbuchar
Copy link
Collaborator

I like this take. Could you add this as an example to the docs about setting up JSON logs?

@Mantisus Mantisus requested review from janbuchar and vdusek and removed request for janbuchar and vdusek March 7, 2025 17:27
@Mantisus Mantisus self-assigned this Mar 7, 2025
Copy link
Collaborator

@Pijukatel Pijukatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Maybe add some tiny test that checks the non-default option(default option is already covered by existing tests.)

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Only two minor notes 🙂...

@@ -134,6 +134,11 @@ class _BasicCrawlerOptions(TypedDict):
configure_logging: NotRequired[bool]
"""If True, the crawler will set up logging infrastructure automatically."""

use_table_logs: NotRequired[bool]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking, wouldn't it be better to have something like statistics_log_format: Literal["table", "inline"]? I think that most people won't know what a "table log" is...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - resolve Honza's suggestions before merging

Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit, feel free to resolve it at will. Otherwise LGTM

@janbuchar janbuchar changed the title feat: Add use_table_logs parameter to control using tables in logs feat: Add statistics_log_format parameter to BasicCrawler.__init__ Mar 17, 2025
@Mantisus Mantisus force-pushed the disable-table-logs branch from 06bfb38 to 40c390f Compare March 17, 2025 20:17
@vdusek vdusek changed the title feat: Add statistics_log_format parameter to BasicCrawler.__init__ feat: Add statistics_log_format parameter to BasicCrawler Mar 18, 2025
@vdusek vdusek added the t-tooling Issues with this label are in the ownership of the tooling team. label Mar 18, 2025
@vdusek vdusek merged commit 635ae4a into apify:master Mar 18, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add an option for JSON-compatible logs
4 participants