-
Notifications
You must be signed in to change notification settings - Fork 1
Convert pipeline container to a multi-part build #220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
…container wait for a rust build if needed
Summary of ChangesHello @afg1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request fundamentally re-architects the RNAcentral import pipeline's container build process. By transitioning from a single, large Dockerfile to a multi-stage, multi-container approach, the aim is to drastically cut down on build times, reduce the final image footprint, and enhance maintainability by isolating tool dependencies. This change optimizes the development and deployment workflow for the pipeline. Highlights
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This is an excellent and much-needed refactoring of the pipeline's Docker build process. Converting the monolithic Dockerfile to a multi-stage, multi-container setup is a great improvement for build speed, image size, and maintainability. The use of separate, versioned tool containers and a dedicated Makefile for managing them is a solid approach. My review includes several suggestions to further refine this new structure, focusing on reducing image sizes even more by using smaller base images, improving the robustness of build scripts, and ensuring consistency across the new Makefiles and Dockerfiles.
| # Copy Samtools + HTSlib from tool container | ||
| COPY --from=samtools /usr/local/bin/samtools /usr/local/bin/tabix /usr/local/bin/bgzip /usr/local/bin/ | ||
| COPY --from=samtools /usr/local/lib/libhts* /usr/local/lib/ | ||
| COPY --from=samtools /usr/local/include/htslib /usr/local/include/htslib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The final runtime image likely does not need the C header files from HTSlib (/usr/local/include/htslib). These are only required for compiling code against the library, not for running applications that are already compiled and dynamically link to it. Removing this COPY command will slightly reduce the image size and complexity.
| docker run --rm $(INFERNAL_IMAGE) cmscan -h > /dev/null | ||
| docker run --rm $(INFERNAL_IMAGE) cmbuild -h > /dev/null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests for infernal redirect the command's standard output to /dev/null. While this confirms the command runs without error, it hides the version or help output, which can be useful for verification. The samtools test, for example, prints the version. For consistency and more informative test runs, consider removing the output redirection.
docker run --rm $(INFERNAL_IMAGE) cmscan -h
docker run --rm $(INFERNAL_IMAGE) cmbuild -h
| docker run --rm $(RUST_UTILS_IMAGE) json2fasta --help > /dev/null | ||
| docker run --rm $(RUST_UTILS_IMAGE) precompute --help > /dev/null | ||
| docker run --rm $(RUST_UTILS_IMAGE) search-export --help > /dev/null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests for rust-utils redirect the command's standard output to /dev/null. While this confirms the command runs without error, it hides the help output, which can be useful for verification. The samtools test, for example, prints the version. For consistency and more informative test runs, consider removing the output redirection.
docker run --rm $(RUST_UTILS_IMAGE) json2fasta --help
docker run --rm $(RUST_UTILS_IMAGE) precompute --help
docker run --rm $(RUST_UTILS_IMAGE) search-export --help
| make install | ||
|
|
||
| # Stage 2: Final - minimal runtime image | ||
| FROM python:3.11.14-trixie |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The final stage of this Dockerfile uses python:3.11.14-trixie as the base image. Since this container's purpose is just to provide the compiled infernal tool, which does not have a Python dependency, you could use a much smaller base image like debian:trixie-slim. This would significantly reduce the size of the final rnacentral/infernal image.
FROM debian:trixie-slim
| RUN cargo build --release && \ | ||
| mkdir /rust-bins && \ | ||
| mv target/release/json2fasta \ | ||
| target/release/split-ena \ | ||
| target/release/expand-urs \ | ||
| target/release/precompute \ | ||
| target/release/search-export \ | ||
| target/release/ftp-export \ | ||
| target/release/json2dfasta \ | ||
| target/release/bed-expander \ | ||
| /rust-bins/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicitly listing every binary in the mv command is brittle. If a new utility is added to the workspace, it must also be manually added here. A more robust and maintainable approach would be to find and move all executable files from the release directory.
RUN cargo build --release && \
mkdir /rust-bins && \
find target/release -maxdepth 1 -type f -executable -exec mv {} /rust-bins/ \;
| /rust-bins/ | ||
|
|
||
| # Stage 2: Final - minimal runtime image | ||
| FROM python:3.11.14-trixie |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The final stage uses python:3.11.14-trixie as the base image. Since this container just provides pre-compiled Rust binaries, which are dynamically linked against glibc but don't require a Python runtime, you could switch to a smaller base like debian:trixie-slim. This would make the rnacentral/rust-utils image much smaller.
FROM debian:trixie-slim
| make install | ||
|
|
||
| # Stage 2: Final - minimal runtime image with libraries | ||
| FROM python:3.11.14-trixie |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The final stage uses python:3.11.14-trixie as the base image. The runtime dependencies for samtools are just standard C libraries, not Python. Using a smaller base image like debian:trixie-slim and installing the necessary .so libraries would make the final rnacentral/samtools image significantly smaller and more efficient.
FROM debian:trixie-slim
| COPY --from=builder /usr/local/bin/tabix /usr/local/bin/ | ||
| COPY --from=builder /usr/local/bin/bgzip /usr/local/bin/ | ||
| COPY --from=builder /usr/local/lib/libhts* /usr/local/lib/ | ||
| COPY --from=builder /usr/local/include/htslib /usr/local/include/htslib/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pmb59
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@afg1 LTGM overall. My only major comment is that we are not following any container version strategy, only latest https://hub.docker.com/r/rnacentral/rnacentral-import-pipeline/tags
Only using latest (No version history, can't rollback) it's not good. We should tag versions, and pin a specific version into the production environment, unless this approach is not feasible for any reason
| DOCKER_PASSWORD: ${{ secrets.DOCKER_PASSWORD }} | ||
| run: docker login -u $DOCKER_USER -p $DOCKER_PASSWORD | ||
|
|
||
| - name: Build Rust utilities container |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workflow will continue pushing even if the build fails. Consider adding:
| - name: Build Rust utilities container | |
| - name: Build Rust utilities container | |
| if: success() # Only push if build succeeded |
| - '.github/workflows/rust-container.yaml' | ||
| workflow_dispatch: | ||
| inputs: | ||
| force_rebuild: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The force_rebuild input is defined but never used in the workflow?
| GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} | ||
| run: | | ||
| echo "Waiting for rust-container workflow to complete..." | ||
| sleep 10 # Give workflow time to start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will 10 seconds be enough?
| --jq '.[0] | "\(.status):\(.conclusion)"') | ||
| if [ -z "$status" ]; then | ||
| echo "No rust-container workflow found yet, waiting..." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| echo "No rust-container workflow found yet, waiting..." | |
| echo "Warning: rust-container workflow not found yet, waiting..." |
| # These containers cache slow-to-build bioinformatics tools (Infernal, Samtools) | ||
|
|
||
| name: Build and Push Tool Containers | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this
| env: | |
| INFERNAL_VERSION: "1.1.2" | |
| SAMTOOLS_VERSION: "1.18" | |
or creating version files might be preferable than hardcoding x3 times
# containers/infernal/VERSION
1.1.2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR converts the RNAcentral import pipeline container from a monolithic single-stage build to a multi-stage build architecture. The refactoring separates bioinformatics tools (Infernal, Samtools, Rust utilities) into independent pre-built containers, significantly reducing build time from ~25-30 minutes to ~9 minutes and decreasing the final image size by 30-40%.
Key Changes:
- Introduced separate tool containers (Infernal 1.1.2, Samtools 1.18, Rust utilities) with multi-stage builds
- Modified main Dockerfile to pull pre-built binaries from tool containers instead of compiling from source
- Added GitHub Actions workflows to build and push tool containers independently with change detection
- Updated build configuration to use explicit platform targeting (linux/amd64)
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
containers/infernal/Dockerfile |
New multi-stage build for Infernal 1.1.2 bioinformatics tool suite |
containers/samtools/Dockerfile |
New multi-stage build for Samtools + HTSlib 1.18 with runtime libraries only |
containers/rust-utils/Dockerfile |
New multi-stage build compiling all 8 Rust CLI utilities in release mode |
containers/Makefile |
Added build targets for tool containers with version management and testing |
Makefile |
Updated to use docker buildx with explicit linux/amd64 platform |
Dockerfile |
Converted to multi-stage build pulling from tool containers; removed build dependencies |
Cargo.toml |
Changed workspace members from wildcard to explicit list for better dependency tracking |
.github/workflows/tool-containers.yaml |
New workflow for building Infernal and Samtools containers with change detection |
.github/workflows/rust-container.yaml |
New workflow for building Rust utilities container on code changes |
.github/workflows/main.yaml |
Added synchronization to wait for Rust container build before main pipeline build |
.dockerignore |
Restructured to exclude build artifacts, tests, and development files |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| RUN apt update && apt upgrade -y && \ | ||
| apt install -y \ | ||
| bedtools \ | ||
| ca-certificates \ | ||
| curl \ | ||
| default-mysql-client \ | ||
| devscripts \ | ||
| freetds-dev \ | ||
| gawk \ | ||
| gcc \ | ||
| git \ | ||
| gzip \ | ||
| hmmer \ | ||
| jq \ | ||
| lftp \ | ||
| libbz2-dev \ | ||
| liblzma-dev \ | ||
| libncurses5-dev \ | ||
| libncursesw5-dev \ | ||
| libsqlite3-dev \ | ||
| libssl-dev \ | ||
| libbz2-1.0 \ | ||
| liblzma5 \ | ||
| libncurses6 \ | ||
| libssl3 \ | ||
| libxml2-utils \ | ||
| libxml2-dev \ | ||
| libzip-dev \ | ||
| moreutils \ | ||
| mysql-common \ | ||
| openssl \ | ||
| pandoc \ | ||
| patch \ | ||
| pgloader \ | ||
| postgresql-17 \ | ||
| postgresql-client-17 \ | ||
| procps \ | ||
| python3 \ | ||
| python3-dev \ | ||
| python3-pip \ | ||
| rsync \ | ||
| sbcl \ | ||
| tabix \ | ||
| tar \ | ||
| time \ | ||
| unzip \ | ||
| zlib1g-dev\ | ||
| wget | ||
|
|
||
|
|
||
| # Install Infernal | ||
| RUN \ | ||
| cd $RNA/ && \ | ||
| curl -OL http://eddylab.org/infernal/infernal-1.1.2.tar.gz && \ | ||
| tar -xvzf infernal-1.1.2.tar.gz && \ | ||
| rm infernal-1.1.2.tar.gz && \ | ||
| cd infernal-1.1.2 && \ | ||
| ./configure --prefix=$RNA/infernal-1.1.2 && \ | ||
| make && \ | ||
| make install && \ | ||
| cd easel && \ | ||
| make install | ||
|
|
||
| # Install blat | ||
| RUN \ | ||
| wget https://hgwdev.gi.ucsc.edu/~kent/exe/linux/blatSuite.38.zip -O blat.zip && \ | ||
| wget \ | ||
| zlib1g && \ | ||
| rm -rf /var/lib/apt/lists/* |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The package tabix is installed via apt, but tabix is already provided by the samtools container and copied in line 77. This creates a redundant installation and potential version conflict between the apt-provided tabix and the one from the samtools container.
Remove tabix from the apt install list since it's already provided by the samtools container.
| run: docker login -u $DOCKER_USER -p $DOCKER_PASSWORD | ||
|
|
||
| - name: Build Infernal container | ||
| run: docker build -t rnacentral/infernal:1.1.2 -f containers/infernal/Dockerfile containers/infernal/ |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The build context is inconsistent between the Makefile and the workflow. The Makefile (line 30) uses . as the build context, while the workflow uses containers/infernal/. Since the Infernal Dockerfile doesn't copy any local files, both work, but this inconsistency could cause confusion.
For consistency, either:
- Use
.in both places (Makefile and workflow) - Use
infernal/in the Makefile and adjust the working directory accordingly
The same issue exists for the Samtools container (line 102).
| run: docker login -u $DOCKER_USER -p $DOCKER_PASSWORD | ||
|
|
||
| - name: Build Samtools container | ||
| run: docker build -t rnacentral/samtools:1.18 -f containers/samtools/Dockerfile containers/samtools/ |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The build context is inconsistent between the Makefile and the workflow. The Makefile (line 36 in containers/Makefile) uses . as the build context, while the workflow uses containers/samtools/. Since the Samtools Dockerfile doesn't copy any local files, both work, but this inconsistency could cause confusion and maintainability issues.
For consistency, use the same build context in both places.
| run: docker build -t rnacentral/samtools:1.18 -f containers/samtools/Dockerfile containers/samtools/ | |
| run: docker build -t rnacentral/samtools:1.18 -f containers/samtools/Dockerfile . |
| utils/ | ||
| Cargo.toml | ||
| Cargo.lock | ||
| Makefile |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .dockerignore file excludes utils/, Cargo.toml, Cargo.lock, and Makefile, but these files are required by the rust-utils container build. The rust-utils Dockerfile (at containers/rust-utils/Dockerfile) uses the repository root as its build context and attempts to COPY utils ./utils and COPY Cargo.toml Cargo.lock Makefile ./ (lines 14-15).
Since the .dockerignore applies to the build context (repository root for rust-utils), these COPY commands will fail with an error like "COPY failed: file not found in build context".
The main Dockerfile also uses the repository root as its build context but doesn't need these files since it pulls pre-built Rust binaries from the rust-utils container. Consider:
- Creating a separate
.dockerignorefile for the rust-utils build that doesn't exclude Rust-related files, or - Not excluding these files in the main
.dockerignore, or - Using a different approach to exclude files selectively for different builds.
| utils/ | |
| Cargo.toml | |
| Cargo.lock | |
| Makefile |
| # Copy Samtools + HTSlib from tool container | ||
| COPY --from=samtools /usr/local/bin/samtools /usr/local/bin/tabix /usr/local/bin/bgzip /usr/local/bin/ | ||
| COPY --from=samtools /usr/local/lib/libhts* /usr/local/lib/ | ||
| COPY --from=samtools /usr/local/include/htslib /usr/local/include/htslib |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The COPY --from=samtools /usr/local/include/htslib /usr/local/include/htslib line copies header files that are typically only needed for compiling against HTSlib. Since this is a runtime-only image (no gcc or build tools), these headers are likely unnecessary and increase the image size.
Consider removing this line unless the pipeline requires compiling C code at runtime that links against HTSlib.
| COPY --from=samtools /usr/local/include/htslib /usr/local/include/htslib |
| RUN samtools --version && \ | ||
| tabix --version |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test validates samtools and tabix but not bgzip, even though bgzip is copied from the builder (line 54) and mentioned in the container description (line 2). Consider adding bgzip --version to the test to ensure all three tools are working correctly.
| COPY --from=builder /usr/local/bin/tabix /usr/local/bin/ | ||
| COPY --from=builder /usr/local/bin/bgzip /usr/local/bin/ | ||
| COPY --from=builder /usr/local/lib/libhts* /usr/local/lib/ | ||
| COPY --from=builder /usr/local/include/htslib /usr/local/include/htslib/ |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The header files in /usr/local/include/htslib/ are being copied to the runtime image, but these are typically only needed for compiling against HTSlib. Since this is a runtime-only container, these headers are likely unnecessary and increase the image size.
Consider removing this line unless downstream containers need to compile C code that links against HTSlib.
| COPY --from=builder /usr/local/include/htslib /usr/local/include/htslib/ |
| ENV PATH="$RNA/infernal-1.1.2/bin:$PATH" | ||
| ENV PATH="$RNA/blat_suite:$PATH" | ||
| ENV PATH="$RNA/seqkit:$PATH" | ||
| ENV PATH="$RNACENTRAL_IMPORT_PIPELINE:$PATH" |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Rust binaries are copied to $RNACENTRAL_IMPORT_PIPELINE/bin/ (line 126), but the PATH includes $RNACENTRAL_IMPORT_PIPELINE (line 147) without the /bin suffix. This means the Rust utilities won't be found when executed.
Change line 147 to:
ENV PATH="$RNACENTRAL_IMPORT_PIPELINE/bin:$PATH"Or add a separate PATH entry:
ENV PATH="$RNACENTRAL_IMPORT_PIPELINE/bin:$RNACENTRAL_IMPORT_PIPELINE:$PATH"| ENV PATH="$RNACENTRAL_IMPORT_PIPELINE:$PATH" | |
| ENV PATH="$RNACENTRAL_IMPORT_PIPELINE/bin:$PATH" |
For a long time the RNAcentral import pipeline container was built as a single container, from one massive Dockerfile. This introduces a number of problems:
The proper way to tackle this is to make a multi-part build, in which we build bits of the pipeline in intermediate containers and pull the build artefacts and their dependencies ONLY into the final image.
This should:
This PR should implement a multi-part build of the pipeline container, and separates out tools into their own containers, from which we can pull the required build artefacts.
I've tested the new workflows, and the build now completes in 9 or so minutes, most of which time is spent pulling the docker image from the hub to convert it to singularity. We also now have explicit tool containers with tool versions attached that we can update independently if needed. Only the full on-push workflow hasn't been tested yet, because it needs to be triggered on the dev branch