-
Notifications
You must be signed in to change notification settings - Fork 5
feat(docker-images): Extrapolate image build flow from benchmark runs; Move configuration files into project sources. #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Warning Rate limit exceeded@Bill-hbrhbr has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 14 minutes and 40 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughSwitches docs to a dataset- and Docker-centric workflow, adds Docker image build scripts and helpers, introduces centralized project path constants, registers a docker-images Taskfile, and clears contents of two existing Dockerfiles. Changes
Sequence Diagram(s)sequenceDiagram
participant Runner as Task Runner
participant Taskfile as taskfiles/docker-images/main.yaml
participant BuildPy as build.py
participant Utils as utils.py
participant Docker as Docker Engine
Runner->>Taskfile: invoke build (parallel)
loop per service in G_BENCHMARK_DOCKER_SERVICES
Taskfile->>BuildPy: run --service-name <service>
BuildPy->>Utils: validate_service_name(service)
alt valid
Utils-->>BuildPy: ok
BuildPy->>Utils: get_image_name(service)
Utils-->>BuildPy: image_tag
BuildPy->>BuildPy: locate Dockerfile at CONFIG_DIR/docker-images/service/Dockerfile
alt Dockerfile found
BuildPy->>Docker: docker build -t image_tag -f <Dockerfile> <PACKAGE_ROOT>
Docker-->>BuildPy: build result
BuildPy-->>Taskfile: exit 0
else missing
BuildPy-->>Taskfile: raise RuntimeError
end
else invalid
Utils-->>BuildPy: raise ValueError
BuildPy-->>Taskfile: error
end
end
Taskfile-->>Runner: complete
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Pre-merge checks and finishing touches✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (11)
README.md(1 hunks)assets/overhead_test/Dockerfile(0 hunks)assets/template/Dockerfile(0 hunks)pyproject.toml(1 hunks)src/log_archival_bench/scripts/docker_images/__init__.py(1 hunks)src/log_archival_bench/scripts/docker_images/build.py(1 hunks)src/log_archival_bench/scripts/docker_images/utils.py(1 hunks)src/log_archival_bench/utils/__init__.py(1 hunks)src/log_archival_bench/utils/path_utils.py(1 hunks)taskfile.yaml(1 hunks)taskfiles/docker-images/main.yaml(1 hunks)
💤 Files with no reviewable changes (2)
- assets/template/Dockerfile
- assets/overhead_test/Dockerfile
🧰 Additional context used
🧬 Code graph analysis (1)
src/log_archival_bench/scripts/docker_images/build.py (2)
src/log_archival_bench/scripts/docker_images/utils.py (1)
get_image_name(6-12)src/log_archival_bench/utils/path_utils.py (3)
get_config_dir(24-26)get_package_root(14-16)which(29-41)
🪛 Ruff (0.13.3)
src/log_archival_bench/scripts/docker_images/build.py
53-53: subprocess call: check for execution of untrusted input
(S603)
65-65: subprocess call: check for execution of untrusted input
(S603)
🔇 Additional comments (10)
src/log_archival_bench/scripts/docker_images/__init__.py (1)
1-1: LGTM!Standard package initializer with an appropriate docstring.
pyproject.toml (1)
27-28: LGTM!The mypy configuration correctly supports the new package structure under
src/. Theexplicit_package_basesandmypy_pathsettings enable proper type checking for the newly introduced modules.src/log_archival_bench/utils/__init__.py (1)
1-1: LGTM!Standard package initializer with an appropriate docstring.
taskfile.yaml (2)
8-8: LGTM!Appropriately includes the new docker-images taskfile to enable the image build workflow.
12-12: LGTM!The
G_PROJECT_SRC_DIRvariable correctly points to the source directory for the Docker image build context.README.md (1)
10-48: LGTM!The documentation updates clearly explain the new Docker-based workflow, including dataset download, image building (both concurrent and single-engine), and execution. The shift from virtual environment setup to Docker environments aligns well with the PR objectives.
src/log_archival_bench/scripts/docker_images/utils.py (1)
6-12: LGTM!The
get_image_namefunction provides a clean utility for generating consistent Docker image tags. The fallback to"clp-user"appropriately handles environments where theUSERvariable might not be set.src/log_archival_bench/scripts/docker_images/build.py (1)
17-67: LGTM!The build logic is well-structured:
- Properly validates Dockerfile existence before building
- Uses the
which()helper to safely locate the Docker executable- Handles optional metadata dumping with appropriate error checking
- The subprocess calls are safe despite Ruff S603 warnings (the script is designed for developer/CI use with controlled inputs)
src/log_archival_bench/utils/path_utils.py (2)
8-26: LGTM!The path derivation logic is correct:
_PACKAGE_ROOTcorrectly resolves to the package directory via__file__.parent_BUILD_DIRand_CONFIG_DIRare appropriately derived relative to the package root- The getter functions provide clean public interfaces to these paths
29-41: LGTM!The
which()helper provides robust executable location with clear error handling. RaisingRuntimeErrorwith a descriptive message when the binary is not found is appropriate for this use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
src/log_archival_bench/scripts/docker_images/build.py(1 hunks)src/log_archival_bench/utils/path_utils.py(1 hunks)taskfiles/docker-images/main.yaml(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/log_archival_bench/scripts/docker_images/build.py (2)
src/log_archival_bench/scripts/docker_images/utils.py (1)
get_image_name(6-12)src/log_archival_bench/utils/path_utils.py (3)
get_config_dir(24-26)get_package_root(14-16)which(29-41)
🪛 Ruff (0.13.3)
src/log_archival_bench/scripts/docker_images/build.py
53-53: subprocess call: check for execution of untrusted input
(S603)
65-65: subprocess call: check for execution of untrusted input
(S603)
🔇 Additional comments (4)
src/log_archival_bench/utils/path_utils.py (1)
1-41: LGTM! Clean implementation of path utilities.The module provides well-structured path helpers with appropriate error handling. The use of
Pathobjects and validation inwhich()ensures robust path resolution.taskfiles/docker-images/main.yaml (1)
1-76: LGTM! Task configuration is well-structured.The checksum validation has been correctly moved from
sourcestodeps, resolving the previous parsing issue. The parallel build orchestration and incremental rebuild detection via checksums align well with the PR objectives.src/log_archival_bench/scripts/docker_images/build.py (2)
43-53: Docker build implementation is correct.The subprocess call is safe given that
docker_binis resolved viawhich()and other arguments are constructed from validated or trusted sources. The static analysis warning (S603) is a false positive in this context, assuming theengine_namevalidation suggested earlier is implemented.Note: This approval assumes the
engine_nameinput validation is added as suggested in the earlier comment.
55-65: LGTM! Config dump implementation follows best practices.The optional JSON metadata dump correctly creates parent directories and uses a context manager for file operations. The subprocess call is safe given validated inputs.
Eden-D-Zhang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few things.
Co-authored-by: Eden Zhang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
src/log_archival_bench/scripts/docker_images/build.py(1 hunks)src/log_archival_bench/utils/path_utils.py(1 hunks)taskfiles/docker-images/main.yaml(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/log_archival_bench/scripts/docker_images/build.py (2)
src/log_archival_bench/scripts/docker_images/utils.py (1)
get_image_name(6-12)src/log_archival_bench/utils/path_utils.py (3)
get_config_dir(24-26)get_package_root(14-16)which(29-41)
🪛 Ruff (0.14.1)
src/log_archival_bench/scripts/docker_images/build.py
53-53: subprocess call: check for execution of untrusted input
(S603)
65-65: subprocess call: check for execution of untrusted input
(S603)
🔇 Additional comments (12)
src/log_archival_bench/utils/path_utils.py (4)
1-6: LGTM!Module docstring is clear and imports are appropriate.
8-11: LGTM!Package root and directory constants are correctly derived.
14-26: LGTM!The getter functions are straightforward with clear docstrings.
29-41: LGTM!The
which()helper properly validates binary existence and provides clear error messages.taskfiles/docker-images/main.yaml (4)
1-20: LGTM!The configuration structure is clear, and the note explaining the commented Presto-related images addresses previous review feedback.
22-35: LGTM!Task dependencies and initialisation logic are correctly structured.
37-45: LGTM!Parallel build orchestration using Task's
forloop is correctly implemented.
47-77: LGTM!The task structure is correct with proper use of
sources,deps, andcmds. The past issue regarding checksum validation placement has been resolved.src/log_archival_bench/scripts/docker_images/build.py (4)
1-14: LGTM!Shebang, docstring, and imports are appropriate.
36-41: LGTM!Image name resolution and Dockerfile validation are correctly implemented. Once the allowlist validation (flagged in lines 24-34) is added, this serves as a good secondary check for configuration errors.
43-53: LGTM!Docker build command construction and execution are correct. All inputs are validated or derived from controlled sources.
55-71: LGTM!Optional configuration dump logic is correctly implemented with proper file handling and directory creation. The main guard follows Python conventions.
…eprocessing. Internal for actual docker build run cmds
| build_cmds = [ | ||
| docker_bin, | ||
| "build", | ||
| "--tag", image_name, | ||
| str(get_package_root()), | ||
| "--file", docker_file_path, | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing some context. Why do we want to separate native/build and build?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
build does a bunch of preprocessing and argument validation, while native/build calls the actual docker cmds for building. It may seem redundant now but I think it will be helpful in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if that's the case, how about calling a python function rather than
build_cmd = [
which("python3"),
"-m", "log_archival_bench.scripts.docker_images.native.build",
"--image-name", get_image_name(engine_name),
"--docker-file-path", str(docker_file_path),
]
For example, if the user isn't using ubuntu/debian, only python but not python3 is available, and even if on ubuntu, if someone uses uv to run the script, this line will call system python3.
The best solution is to simply call a native python function rather than proxying the call through subprocess.run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
src/log_archival_bench/scripts/docker_images/build.py(1 hunks)src/log_archival_bench/scripts/docker_images/native/build.py(1 hunks)src/log_archival_bench/utils/project_config.py(1 hunks)taskfiles/docker-images/main.yaml(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/log_archival_bench/scripts/docker_images/build.py (2)
src/log_archival_bench/scripts/docker_images/utils.py (1)
get_image_name(6-12)src/log_archival_bench/scripts/docker_images/native/build.py (1)
main(12-49)
src/log_archival_bench/scripts/docker_images/native/build.py (1)
src/log_archival_bench/scripts/docker_images/build.py (1)
main(16-60)
🪛 Ruff (0.14.1)
src/log_archival_bench/scripts/docker_images/build.py
58-58: subprocess call: check for execution of untrusted input
(S603)
src/log_archival_bench/scripts/docker_images/native/build.py
40-40: subprocess call: check for execution of untrusted input
(S603)
47-47: subprocess call: check for execution of untrusted input
(S603)
🔇 Additional comments (7)
src/log_archival_bench/scripts/docker_images/build.py (2)
35-38: LGTM! Engine validation properly implemented.The explicit allowlist validation addresses the path traversal concerns raised in previous reviews. The error message provides clear feedback when an invalid engine is specified.
58-58: Static analysis false positive.The S603 warning about untrusted subprocess input is a false positive. The command invokes an internal Python module (
log_archival_bench.scripts.docker_images.native.build) with validated arguments:engine_namewas validated against the allowlist (lines 35-38), anddocker_file_pathwas verified to exist (lines 40-43).src/log_archival_bench/utils/project_config.py (1)
1-11: LGTM! Clean configuration module.The path constants are properly defined using
pathlib.Path, and the module provides a clear, centralized location for project-level paths.src/log_archival_bench/scripts/docker_images/native/build.py (2)
31-40: LGTM! Docker build execution is correct.The build command construction and execution are proper. The S603 static analysis warning is a false positive—the subprocess executes
docker buildwith validated arguments passed from the wrapper script.
42-47: LGTM! Metadata dump implementation is correct.The optional metadata dump properly creates directories, uses explicit UTF-8 encoding, and safely redirects
docker inspectoutput. The S603 static analysis warning is a false positive—image_namewas validated by the wrapper script.taskfiles/docker-images/main.yaml (2)
1-46: LGTM! Well-structured task workflow.The taskfile properly orchestrates parallel Docker image builds with clear separation of concerns. The note explaining disabled Presto images (line 16) addresses the previous review feedback.
47-78: LGTM! Checksum-based rebuild detection properly implemented.The task correctly addresses previous review feedback:
- Checksum validation is in
deps(lines 63-66), notsources, ensuring proper Task parsing- Dependencies include
utils/**/*(line 59) to detect changes in utility modules- The workflow validates existing checksums before building and computes new checksums after successful builds
hoophalab
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one more comment
| build_cmds = [ | ||
| docker_bin, | ||
| "build", | ||
| "--tag", image_name, | ||
| str(get_package_root()), | ||
| "--file", docker_file_path, | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if that's the case, how about calling a python function rather than
build_cmd = [
which("python3"),
"-m", "log_archival_bench.scripts.docker_images.native.build",
"--image-name", get_image_name(engine_name),
"--docker-file-path", str(docker_file_path),
]
For example, if the user isn't using ubuntu/debian, only python but not python3 is available, and even if on ubuntu, if someone uses uv to run the script, this line will call system python3.
The best solution is to simply call a native python function rather than proxying the call through subprocess.run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
src/log_archival_bench/scripts/docker_images/build.py (1)
28-31: Add a synchronization reminder comment.The
valid_engineslist must be kept in sync withG_DOCKER_IMAGE_ENGINESintaskfiles/docker-images/main.yaml(lines 12-20). When Presto images are re-enabled, both locations need updating.Apply this diff to add a reminder comment:
+ # NOTE: Keep in sync with G_DOCKER_IMAGE_ENGINES in taskfiles/docker-images/main.yaml valid_engines = ["clickhouse", "clp", "elasticsearch", "sparksql", "zstandard"] if engine_name not in valid_engines:
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
src/log_archival_bench/scripts/docker_images/build.py(1 hunks)src/log_archival_bench/scripts/docker_images/inspect.py(1 hunks)taskfiles/docker-images/main.yaml(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-21T17:00:54.873Z
Learnt from: hoophalab
PR: y-scope/log-archival-bench#17
File: taskfiles/docker-images/main.yaml:55-60
Timestamp: 2025-10-21T17:00:54.873Z
Learning: In the log-archival-bench project's taskfiles, when tracking Python source files in the `sources` list for Docker image builds, prefer using a glob pattern `{{.G_PROJECT_SRC_DIR}}/**/*` with exclusion `!{{.G_PROJECT_SRC_DIR}}/config/**` rather than manually listing individual Python files, to avoid having to update the taskfile every time Python imports change.
Applied to files:
taskfiles/docker-images/main.yaml
🧬 Code graph analysis (2)
src/log_archival_bench/scripts/docker_images/inspect.py (2)
src/log_archival_bench/scripts/docker_images/utils.py (1)
get_image_name(6-12)src/log_archival_bench/scripts/docker_images/build.py (1)
main(13-49)
src/log_archival_bench/scripts/docker_images/build.py (1)
src/log_archival_bench/scripts/docker_images/utils.py (1)
get_image_name(6-12)
🪛 Ruff (0.14.1)
src/log_archival_bench/scripts/docker_images/inspect.py
48-48: subprocess call: check for execution of untrusted input
(S603)
src/log_archival_bench/scripts/docker_images/build.py
47-47: subprocess call: check for execution of untrusted input
(S603)
🔇 Additional comments (5)
src/log_archival_bench/scripts/docker_images/build.py (1)
38-49: LGTM - Static analysis warning is a false positive.The subprocess call constructs the Docker build command from validated and trusted inputs:
engine_nameis validated against an allowlist (line 29)docker_file_pathis checked for existence (line 34)CONFIG_DIRandPACKAGE_ROOTare trusted constants- All other arguments are literal strings
The S603 warning can be safely disregarded.
src/log_archival_bench/scripts/docker_images/inspect.py (1)
47-49: Address static analysis warning after adding validation.Once you add the engine name validation (see previous comment), the subprocess call will be safe because
engine_namewill be validated against an allowlist andget_image_namewill format it safely. The S603 warning will then be a false positive.However, without validation, this represents a security concern.
taskfiles/docker-images/main.yaml (3)
12-20: LGTM - Engine list aligns with Python scripts.The
G_DOCKER_IMAGE_ENGINESlist correctly matches thevalid_enginesallowlist in both Python scripts. The comment explaining disabled Presto images is helpful.
61-67: Verify sources pattern matches intended rebuild triggers.The sources list uses
{{.G_PROJECT_SRC_DIR}}/**/*.py(line 62) to track only Python files. The retrieved learning suggests using{{.G_PROJECT_SRC_DIR}}/**/*(all files) with exclusions to avoid manual updates when imports change.However, tracking only
.pyfiles may be intentional if non-Python source changes (e.g.,.yaml,.txt,.sh) in the source directory should not trigger Docker image rebuilds.Based on learnings
If all source files should trigger rebuilds, apply this diff:
sources: - - "{{.G_PROJECT_SRC_DIR}}/**/*.py" + - "{{.G_PROJECT_SRC_DIR}}/**/*" - "{{.TASKFILE}}" - exclude: "{{.G_PROJECT_SRC_DIR}}/build/**/*" - exclude: "{{.G_PROJECT_SRC_DIR}}/config/**/*"
85-107: LGTM - Checksum workflow is well-structured.The checksum validation and computation logic is correct:
- Line 91: Allows
docker inspectto fail gracefully if the image doesn't exist yet (first build).- Line 103: Expects
docker inspectto succeed after the build completes.- The workflow ensures checksums are validated before builds and computed after successful builds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
taskfiles/docker-images/main.yaml(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-21T17:00:54.873Z
Learnt from: hoophalab
PR: y-scope/log-archival-bench#17
File: taskfiles/docker-images/main.yaml:55-60
Timestamp: 2025-10-21T17:00:54.873Z
Learning: In the log-archival-bench project's taskfiles, when tracking Python source files in the `sources` list for Docker image builds, prefer using a glob pattern `{{.G_PROJECT_SRC_DIR}}/**/*` with exclusion `!{{.G_PROJECT_SRC_DIR}}/config/**` rather than manually listing individual Python files, to avoid having to update the taskfile every time Python imports change.
Applied to files:
taskfiles/docker-images/main.yaml
🔇 Additional comments (4)
taskfiles/docker-images/main.yaml (4)
47-86: Verify the INSPECT_CMD invocation includes the--trimflag intentionally.Line 58 includes
--trimin the inspect command. Please confirm that this flag is supported by inspect.py and that trimming metadata is the intended behaviour for checksum validation. If inspect.py doesn't support this flag, the taskfile will fail at runtime.
71-76: Checksum validation structure is sound.The checksum validation is correctly placed in
deps(line 72) rather thansources, which addresses the previous critical issue. The validation task properly chains INSPECT_CMD, CHECKSUM_FILE, and METADATA_FILE variables through the task invocation. The|| trueon line 93 gracefully handles the case when an image doesn't yet exist.Also applies to: 87-98
37-45: Parallel build orchestration is well-structured.The
build-all-paralleltask correctly uses Task'sforloop withvar: G_DOCKER_IMAGE_ENGINESto invokebuild-single-imagefor each engine with theENGINE_NAMEvariable set. This enables proper parallelization of independent per-engine builds.
62-69: The Dockerfile path pattern on line 69 is correct and requires no changes.The repository contains all Dockerfiles in a flat structure at
src/log_archival_bench/config/docker-images/{{ENGINE_NAME}}.Dockerfile, which exactly matches the pattern specified on line 69. All active engines (clickhouse, clp, elasticsearch, sparksql, zstandard) and commented-out engines (presto_clp, presto_parquet) have corresponding Dockerfile files in this location. The original concern about potential subdirectory structure was unfounded.Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good, but I have a few high-level concerns:
-
Is checksumming a Docker image necessary, given that images are generally immutable?
docker builddoesn't rebuild an image that is already built. We also don't checksum docker images in the clp repo. -
If we really want to checksum images, is docker inspect a reliable source for a checksum? Can we use the image ID?
Additionally, the README isn't the best, but we should improve that after everything is done.
taskfiles/docker-images/main.yaml
Outdated
| - "{{.G_PROJECT_SRC_DIR}}/**/*" | ||
| - "{{.ROOT_DIR}}/Taskfile.yaml" | ||
| - "{{.ROOT_DIR}}/taskfiles/**/*.yaml" | ||
| - "{{.ROOT_DIR}}/tools/yscope-dev-utils/exports/taskfiles/**/*.yaml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- "{{.ROOT_DIR}}/tools/yscope-dev-utils/exports/taskfiles/**/*.yaml"
We might need to confirm with kirk whether this is needed
| metadata_file_path = Path(metadata_file) | ||
| metadata_file_path.parent.mkdir(parents=True, exist_ok=True) | ||
| with metadata_file_path.open("w", encoding="utf-8") as f: | ||
| json.dump(inspect_json, f, indent=4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to use inspect output as the checksum source, we might need to sort the keys. The consistent order of enumerating dictionary keys is an implementation detail of the interpreter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
taskfiles/docker-images/main.yaml (1)
16-18: Consider adding a tracking reference for Presto engine fixes.The note explains the Presto engines are pending fixes, but there's no reference to a tracking issue or milestone. Adding a link (e.g., "pending fixes in #123") would help maintainers know where to follow progress and when to re-enable these entries.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
src/log_archival_bench/scripts/docker_images/inspect.py(1 hunks)taskfiles/docker-images/main.yaml(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-21T17:00:54.873Z
Learnt from: hoophalab
PR: y-scope/log-archival-bench#17
File: taskfiles/docker-images/main.yaml:55-60
Timestamp: 2025-10-21T17:00:54.873Z
Learning: In the log-archival-bench project's taskfiles, when tracking Python source files in the `sources` list for Docker image builds, prefer using a glob pattern `{{.G_PROJECT_SRC_DIR}}/**/*` with exclusion `!{{.G_PROJECT_SRC_DIR}}/config/**` rather than manually listing individual Python files, to avoid having to update the taskfile every time Python imports change.
Applied to files:
taskfiles/docker-images/main.yaml
🧬 Code graph analysis (1)
src/log_archival_bench/scripts/docker_images/inspect.py (2)
src/log_archival_bench/scripts/docker_images/utils.py (2)
get_image_name(6-12)validate_engine_name(15-24)src/log_archival_bench/scripts/docker_images/build.py (1)
main(13-46)
🪛 Ruff (0.14.1)
src/log_archival_bench/scripts/docker_images/inspect.py
52-52: subprocess call: check for execution of untrusted input
(S603)
🔇 Additional comments (4)
src/log_archival_bench/scripts/docker_images/inspect.py (2)
44-52: JSON key ordering concern is mitigated by current usage.The Taskfile invokes this script with
--id-only(line 67 intaskfiles/docker-images/main.yaml), so only the stable image ID digest is written to the metadata file, making JSON key-ordering irrelevant for checksum computation. However, if future callers omit--id-only, the full JSON output could have non-deterministic key ordering across Python versions or implementations.Based on the discussion in past review comments, the current implementation is correct for the intended usage.
52-52: Static analysis warning is a false positive.The S603 warning about untrusted subprocess input is a false positive. The
engine_nameis validated against an allowlist at line 42 viavalidate_engine_name, andget_image_nameconstructs a safe image tag. The command is built as a list (notshell=True), preventing shell injection.taskfiles/docker-images/main.yaml (2)
70-78: Excellent use of glob-based source tracking.The sources list follows the project learning to use
{{.G_PROJECT_SRC_DIR}}/**/*with strategic exclusions, avoiding tedious manual updates when Python imports change. The explicit re-inclusion of the engine-specific Dockerfile (line 78) ensures per-engine dependencies are properly tracked.Based on learnings.
74-74: Verify necessity of yscope-dev-utils taskfile dependency.Including
{{.ROOT_DIR}}/tools/yscope-dev-utils/exports/taskfiles/**/*.yamlin sources means changes to utility taskfiles will trigger image rebuilds. This is reasonable if those taskfiles affect build behaviour, but hoophalab's past comment suggests confirming with Kirk whether this dependency is necessary.Based on past review comments.
taskfiles/docker-images/main.yaml
Outdated
| # Task may fail if the image does not exist. | ||
| - "{{.INSPECT_CMD}} || rm -f '{{.METADATA_FILE}}'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick | 🔵 Trivial
Clarify error-handling scope in validation task.
Line 115 uses {{.INSPECT_CMD}} || rm -f '{{.METADATA_FILE}}', which removes the metadata file on any inspect failure (docker daemon down, permissions, missing image, etc.), not just "image does not exist" as the comment suggests. This broad catch-all might be intentional (any failure should trigger rebuild), but the comment could be more precise.
Consider updating the comment to reflect that any inspect failure—not just a missing image—will trigger metadata removal and subsequent rebuild.
🤖 Prompt for AI Agents
In taskfiles/docker-images/main.yaml around lines 114 to 115, the comment "Task
may fail if the image does not exist." is misleading because the command
`{{.INSPECT_CMD}} || rm -f '{{.METADATA_FILE}}'` will remove the metadata file
on any inspect failure (docker daemon down, permissions, network, missing image,
etc.); update the comment to explicitly state that any inspect failure will
trigger metadata removal and a rebuild, and if you instead want to only remove
metadata for a missing image, replace the command with a more specific check
that inspects exit codes or parses inspect output to detect "no such image"
before removing the metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/log_archival_bench/scripts/docker_images/utils.py (1)
6-12: Consider adding defensive Docker name validation.While the current allowlist ensures all service names are valid Docker image names, adding explicit validation would make this function more robust against future modifications to the service list.
def get_image_name(service_name: str) -> str: """ :param service_name: :return: The name assigned to the Docker image that contains the service. """ + import re + # Docker image names must match [a-z0-9][a-z0-9_.-]* + if not re.match(r'^[a-z0-9][a-z0-9_.-]*$', service_name): + err_msg = ( + f"Invalid service name for Docker image: '{service_name}'. " + f"Must contain only lowercase letters, digits, hyphens, underscores, and periods." + ) + raise ValueError(err_msg) user = os.getenv("USER", "clp-user") return f"log-archival-bench-{service_name}-ubuntu-jammy:dev-{user}"
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
src/log_archival_bench/scripts/docker_images/build.py(1 hunks)src/log_archival_bench/scripts/docker_images/utils.py(1 hunks)taskfiles/docker-images/main.yaml(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: hoophalab
Repo: y-scope/log-archival-bench PR: 17
File: taskfiles/docker-images/main.yaml:55-60
Timestamp: 2025-10-21T17:00:54.873Z
Learning: In the log-archival-bench project's taskfiles, when tracking Python source files in the `sources` list for Docker image builds, prefer using a glob pattern `{{.G_PROJECT_SRC_DIR}}/**/*` with exclusion `!{{.G_PROJECT_SRC_DIR}}/config/**` rather than manually listing individual Python files, to avoid having to update the taskfile every time Python imports change.
📚 Learning: 2025-10-21T17:00:54.873Z
Learnt from: hoophalab
Repo: y-scope/log-archival-bench PR: 17
File: taskfiles/docker-images/main.yaml:55-60
Timestamp: 2025-10-21T17:00:54.873Z
Learning: In the log-archival-bench project's taskfiles, when tracking Python source files in the `sources` list for Docker image builds, prefer using a glob pattern `{{.G_PROJECT_SRC_DIR}}/**/*` with exclusion `!{{.G_PROJECT_SRC_DIR}}/config/**` rather than manually listing individual Python files, to avoid having to update the taskfile every time Python imports change.
Applied to files:
src/log_archival_bench/scripts/docker_images/build.pytaskfiles/docker-images/main.yaml
📚 Learning: 2025-09-22T13:50:57.351Z
Learnt from: Bill-hbrhbr
Repo: y-scope/log-archival-bench PR: 12
File: taskfiles/lint/yaml.yaml:15-23
Timestamp: 2025-09-22T13:50:57.351Z
Learning: In projects using Taskfile, submodules are automatically initialized during taskfile compilation, so referenced files in submodules (like tools/yscope-dev-utils/exports/lint-configs/.yamllint.yml) will be available at runtime even if not present in the current repository state.
Applied to files:
taskfiles/docker-images/main.yaml
🧬 Code graph analysis (1)
src/log_archival_bench/scripts/docker_images/build.py (1)
src/log_archival_bench/scripts/docker_images/utils.py (2)
get_image_name(6-12)validate_service_name(15-26)
🪛 Ruff (0.14.2)
src/log_archival_bench/scripts/docker_images/build.py
46-46: subprocess call: check for execution of untrusted input
(S603)
🔇 Additional comments (3)
src/log_archival_bench/scripts/docker_images/build.py (1)
13-48: LGTM! Subprocess call is secure.The implementation correctly validates the
service_namevia the allowlist (line 30) before using it in the Docker build command. The static analysis warning (S603) is a false positive—all subprocess arguments are either string literals, validated inputs, or safe project paths.The error handling via
check=Trueensures failures propagate appropriately.taskfiles/docker-images/main.yaml (1)
9-17: Service list is properly synchronized.The
G_DOCKER_IMAGE_SERVICESlist matches thevalid_servicesallowlist insrc/log_archival_bench/scripts/docker_images/utils.py(line 21), ensuring consistent validation across the build pipeline.src/log_archival_bench/scripts/docker_images/utils.py (1)
15-26: LGTM! Validation is thorough and well-documented.The allowlist validation correctly matches the services defined in
taskfiles/docker-images/main.yaml(lines 9-17), and the sync reminder comment helps maintain consistency. The error message clearly identifies invalid inputs and lists valid alternatives.
taskfiles/docker-images/main.yaml
Outdated
| build-single-image: | ||
| internal: true | ||
| label: "{{.TASK}}:{{.SERVICE_NAME}}" | ||
| requires: | ||
| vars: ["SERVICE_NAME"] | ||
| run: "when_changed" | ||
| cmds: | ||
| - "uv run '{{.G_DOCKER_IMAGE_SCRIPT_DIR}}/build.py' --service-name {{.SERVICE_NAME}}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Consider adding sources for rebuild detection.
The task specifies run: "when_changed" but doesn't define any sources. This means the task will execute on every invocation rather than only when relevant files change.
Based on learnings, consider adding source tracking to enable checksum-based rebuild detection:
build-single-image:
internal: true
label: "{{.TASK}}:{{.SERVICE_NAME}}"
requires:
vars: ["SERVICE_NAME"]
run: "when_changed"
+ sources:
+ - "{{.G_PROJECT_SRC_DIR}}/**/*"
+ - "!{{.G_PROJECT_SRC_DIR}}/config/**"
+ - "{{.G_PROJECT_SRC_DIR}}/config/docker-images/{{.SERVICE_NAME}}/Dockerfile"
+ - "{{.TASKFILE}}"
cmds:
- "uv run '{{.G_DOCKER_IMAGE_SCRIPT_DIR}}/build.py' --service-name {{.SERVICE_NAME}}"This aligns with the PR objective of "checksum-based rebuild detection."
Based on learnings
🤖 Prompt for AI Agents
In taskfiles/docker-images/main.yaml around lines 30 to 37, the task uses run:
"when_changed" but has no sources defined so it will run every time; add a
sources entry listing the files and directories whose checksum should trigger
rebuilds (for example: the service source directory, its Dockerfile, any build
scripts like G_DOCKER_IMAGE_SCRIPT_DIR/build.py, and any shared templates or
config used to build the image) so the task runner can compute checksums and
only run when those sources change; ensure paths use the same templating
variables already present (e.g. {{.SERVICE_NAME}} and
{{.G_DOCKER_IMAGE_SCRIPT_DIR}}) and include both files and relevant directories.
Bill-hbrhbr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self-review and comments
README.md
Outdated
| uv run src/log_archival_bench/scripts/docker_images/build.py --service-name <service_name> | ||
| ``` | ||
|
|
||
| Each valid service (e.g. `presto`, `clickhouse`, `elasticsearch`) has its own dedicated Docker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Each valid service (e.g. `presto`, `clickhouse`, `elasticsearch`) has its own dedicated Docker | |
| Each valid service (e.g. `presto_parquet`, `clickhouse`, `elasticsearch`) has its own dedicated Docker |
README.md
Outdated
| ## Run One Tool | ||
| ## Run One Benchmark Engine | ||
|
|
||
| Execute `./assets/{tool name}/main.py {path to <dataset name>.log}` to run ingestion and search on that dataset. | ||
| Execute `./assets/{engine_name}/main.py {path to <dataset_name>.log}` to run ingestion and search on that dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe no need to touch this block for now.
we can also defer README fixing once we've moved everything into src/log_archival_bench
| explicit_package_bases = true | ||
| mypy_path = ["src"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are needed for mypy linting to work in its current form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
README.md (1)
37-38: Update the example to use a valid service name.The example lists
prestoas a valid service, but according tosrc/log_archival_bench/scripts/docker_images/utils.py(line 21), the valid services are:clickhouse,clp,elasticsearch,sparksql, andzstandard. The Presto-related services (presto_clpandpresto_parquet) are currently disabled.Apply this diff to use a valid example:
-Each valid service (e.g. `presto`, `clickhouse`, `elasticsearch`) has its own dedicated Docker +Each valid service (e.g. `clickhouse`, `elasticsearch`, `sparksql`) has its own dedicated Docker
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
README.md(1 hunks)src/log_archival_bench/scripts/docker_images/build.py(1 hunks)src/log_archival_bench/scripts/docker_images/utils.py(1 hunks)taskfiles/docker-images/main.yaml(1 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: hoophalab
Repo: y-scope/log-archival-bench PR: 17
File: taskfiles/docker-images/main.yaml:55-60
Timestamp: 2025-10-21T17:00:54.873Z
Learning: In the log-archival-bench project's taskfiles, when tracking Python source files in the `sources` list for Docker image builds, prefer using a glob pattern `{{.G_PROJECT_SRC_DIR}}/**/*` with exclusion `!{{.G_PROJECT_SRC_DIR}}/config/**` rather than manually listing individual Python files, to avoid having to update the taskfile every time Python imports change.
📚 Learning: 2025-09-30T08:58:51.490Z
Learnt from: Bill-hbrhbr
Repo: y-scope/log-archival-bench PR: 13
File: taskfiles/lint/python.yaml:50-65
Timestamp: 2025-09-30T08:58:51.490Z
Learning: The log-archival-bench repository currently does not have unit tests for the benchmark suite, so the tests directory should not be included in PY_LINT_DIRS in taskfiles/lint/python.yaml.
Applied to files:
README.md
📚 Learning: 2025-10-21T17:00:54.873Z
Learnt from: hoophalab
Repo: y-scope/log-archival-bench PR: 17
File: taskfiles/docker-images/main.yaml:55-60
Timestamp: 2025-10-21T17:00:54.873Z
Learning: In the log-archival-bench project's taskfiles, when tracking Python source files in the `sources` list for Docker image builds, prefer using a glob pattern `{{.G_PROJECT_SRC_DIR}}/**/*` with exclusion `!{{.G_PROJECT_SRC_DIR}}/config/**` rather than manually listing individual Python files, to avoid having to update the taskfile every time Python imports change.
Applied to files:
taskfiles/docker-images/main.yamlsrc/log_archival_bench/scripts/docker_images/build.py
📚 Learning: 2025-09-22T13:50:57.351Z
Learnt from: Bill-hbrhbr
Repo: y-scope/log-archival-bench PR: 12
File: taskfiles/lint/yaml.yaml:15-23
Timestamp: 2025-09-22T13:50:57.351Z
Learning: In projects using Taskfile, submodules are automatically initialized during taskfile compilation, so referenced files in submodules (like tools/yscope-dev-utils/exports/lint-configs/.yamllint.yml) will be available at runtime even if not present in the current repository state.
Applied to files:
taskfiles/docker-images/main.yaml
🧬 Code graph analysis (1)
src/log_archival_bench/scripts/docker_images/build.py (1)
src/log_archival_bench/scripts/docker_images/utils.py (2)
get_image_name(6-12)validate_service_name(15-26)
🪛 Ruff (0.14.3)
src/log_archival_bench/scripts/docker_images/build.py
46-46: subprocess call: check for execution of untrusted input
(S603)
🔇 Additional comments (2)
src/log_archival_bench/scripts/docker_images/build.py (1)
1-52: LGTM! The implementation is secure and correct.The script properly validates the
service_nameparameter against an allowlist (line 30) before using it to construct paths, checks for Dockerfile existence (lines 32-35), and uses subprocess safely with an array-based command format and validated inputs.The static analysis warning (S603) about untrusted input is a false positive in this context—all user-controllable values are validated before being passed to subprocess.
src/log_archival_bench/scripts/docker_images/utils.py (1)
1-26: LGTM! Clean and well-structured utility functions.The module provides focused helper functions with proper validation. The allowlist approach in
validate_service_nameprevents path traversal issues, and the sync comment (line 20) helps maintain consistency with the Taskfile configuration.
Description
Previously, each benchmark run tore down and rebuilt the Docker image for its associated service engine, resulting in unnecessary rebuilds of identical images.
Now we make Docker image building into a separate preparation step for the benchmark runs, so that future code will directly use build images instead of repeating the setup -> tear down process.
The
presto_clpandpresto_parquetimages are temporarily disabled in this PR because their Dockerfiles are currently incorrect and result in failed builds:presto_clp: we will like use the CLP package to spin up benchmark servicespresto_parquet: will be fixed in a future PR.Checklist
breaking change.
Validation performed
task docker-images:buildbuilds the images uccessfully.Summary by CodeRabbit
Documentation
New Features
Chores