-
Notifications
You must be signed in to change notification settings - Fork 434
feat: Add statistics_log_format
parameter to BasicCrawler
#1061
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thinking about this task, I believe we shouldn't add any third-party logger. A flag that disables tables in logs, which make data parsing difficult, is sufficient. This will allow users to use any logger that's compatible with the standard one and enables customization of log output. Example for loguru import inspect
import logging
from loguru import logger
class InterceptHandler(logging.Handler):
def emit(self, record: logging.LogRecord) -> None:
# Get corresponding Loguru level if it exists.
try:
level: str | int = logger.level(record.levelname).name
except ValueError:
level = record.levelno
# Find caller from where originated the logged message.
frame, depth = inspect.currentframe(), 0
while frame:
filename = frame.f_code.co_filename
is_logging = filename == logging.__file__
is_frozen = 'importlib' in filename and '_bootstrap' in filename
if depth > 0 and not (is_logging | is_frozen):
break
frame = frame.f_back
depth += 1
logger.opt(depth=depth, exception=record.exc_info).log(level, record.getMessage())
logger.add('crawler.log', serialize=True, level='INFO')
logging.basicConfig(handlers=[InterceptHandler()], level=logging.INFO, force=True)
crawler = BeautifulSoupCrawler(configure_logging=False, use_table_logs=False) Log record: {
"text": "2025-03-07 16:51:09.947 | INFO | crawlee.crawlers._basic._basic_crawler:run:580 - Final request statistics: requests_finished: 1; requests_failed: 0; retry_histogram: [1]; request_avg_failed_duration: None; request_avg_finished_duration: 0.795506; requests_finished_per_minute: 73; requests_failed_per_minute: 0; request_total_duration: 0.795506; requests_total: 1; crawler_runtime: 0.818803\n",
"record": {
"elapsed": { "repr": "0:00:01.921982", "seconds": 1.921982 },
"exception": null,
"extra": {},
"file": {
"name": "_basic_crawler.py",
"path": "/src/crawlee/crawlers/_basic/_basic_crawler.py"
},
"function": "run",
"level": { "icon": "ℹ️", "name": "INFO", "no": 20 },
"line": 580,
"message": "Final request statistics: requests_finished: 1; requests_failed: 0; retry_histogram: [1]; request_avg_failed_duration: None; request_avg_finished_duration: 0.795506; requests_finished_per_minute: 73; requests_failed_per_minute: 0; request_total_duration: 0.795506; requests_total: 1; crawler_runtime: 0.818803",
"module": "_basic_crawler",
"name": "crawlee.crawlers._basic._basic_crawler",
"process": { "id": 32118, "name": "MainProcess" },
"thread": { "id": 139760540858176, "name": "MainThread" },
"time": {
"repr": "2025-03-07 16:51:09.947345+00:00",
"timestamp": 1741366269.947345
}
}
} |
I like this take. Could you add this as an example to the docs about setting up JSON logs? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Maybe add some tiny test that checks the non-default option(default option is already covered by existing tests.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Only two minor notes 🙂...
@@ -134,6 +134,11 @@ class _BasicCrawlerOptions(TypedDict): | |||
configure_logging: NotRequired[bool] | |||
"""If True, the crawler will set up logging infrastructure automatically.""" | |||
|
|||
use_table_logs: NotRequired[bool] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking, wouldn't it be better to have something like statistics_log_format: Literal["table", "inline"]
? I think that most people won't know what a "table log" is...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea
Co-authored-by: Vlada Dusek <[email protected]>
Co-authored-by: Vlada Dusek <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - resolve Honza's suggestions before merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit, feel free to resolve it at will. Otherwise LGTM
use_table_logs
parameter to control using tables in logsstatistics_log_format
parameter to BasicCrawler.__init__
06bfb38
to
40c390f
Compare
statistics_log_format
parameter to BasicCrawler.__init__
statistics_log_format
parameter to BasicCrawler
Description
use_table_logs
parameter that allows disabling tables in logs. This makes log parsing easier when needed.Issues