Skip to content

Conversation

@afg1
Copy link
Contributor

@afg1 afg1 commented Nov 26, 2025

For a long time the RNAcentral import pipeline container was built as a single container, from one massive Dockerfile. This introduces a number of problems:

  • It takes ages to build (~25-30m) because everything is compiled every time
  • It becomes more difficult to change tool dependencies/versions
  • The container is very big (>3GB)

The proper way to tackle this is to make a multi-part build, in which we build bits of the pipeline in intermediate containers and pull the build artefacts and their dependencies ONLY into the final image.

This should:

  • Mean we only build tools when we change the versions (e.g. infernal version hasn't changed in years)
  • Build tools (e.g. cargo, gcc) are not included in the final container, reducing its size
  • The build process is faster, since we build once and pull several times

This PR should implement a multi-part build of the pipeline container, and separates out tools into their own containers, from which we can pull the required build artefacts.

I've tested the new workflows, and the build now completes in 9 or so minutes, most of which time is spent pulling the docker image from the hub to convert it to singularity. We also now have explicit tool containers with tool versions attached that we can update independently if needed. Only the full on-push workflow hasn't been tested yet, because it needs to be triggered on the dev branch

@afg1 afg1 self-assigned this Nov 26, 2025
@afg1 afg1 requested review from blakesweeney and pmb59 November 26, 2025 11:30
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @afg1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request fundamentally re-architects the RNAcentral import pipeline's container build process. By transitioning from a single, large Dockerfile to a multi-stage, multi-container approach, the aim is to drastically cut down on build times, reduce the final image footprint, and enhance maintainability by isolating tool dependencies. This change optimizes the development and deployment workflow for the pipeline.

Highlights

  • Multi-stage Dockerfile Implementation: The monolithic Dockerfile for the RNAcentral import pipeline has been refactored into a multi-stage build process, significantly improving build efficiency and reducing the final image size.
  • Tool Containerization: Key tools like Infernal, Samtools/HTSlib, and custom Rust utilities are now built in their own dedicated Docker containers. This allows for independent updates and prevents build tools from being included in the final pipeline image.
  • Reduced Build Times and Image Size: The new multi-stage approach is expected to reduce build times from 25-30 minutes to approximately 9 minutes and decrease the final container size by 30-40% by only pulling necessary build artifacts into the final image.
  • Streamlined Python Environment: A dedicated Python environment builder stage is introduced to manage Python dependencies using uv, ensuring a clean and frozen environment for the final image.
  • Makefile Enhancements: The containers/Makefile has been updated with new targets to build, push, and test the newly created tool containers (Infernal, Samtools, Rust utilities), facilitating their management.
Ignored Files
  • Ignored by pattern: .github/workflows/** (3)
    • .github/workflows/main.yaml
    • .github/workflows/rust-container.yaml
    • .github/workflows/tool-containers.yaml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is an excellent and much-needed refactoring of the pipeline's Docker build process. Converting the monolithic Dockerfile to a multi-stage, multi-container setup is a great improvement for build speed, image size, and maintainability. The use of separate, versioned tool containers and a dedicated Makefile for managing them is a solid approach. My review includes several suggestions to further refine this new structure, focusing on reducing image sizes even more by using smaller base images, improving the robustness of build scripts, and ensuring consistency across the new Makefiles and Dockerfiles.

# Copy Samtools + HTSlib from tool container
COPY --from=samtools /usr/local/bin/samtools /usr/local/bin/tabix /usr/local/bin/bgzip /usr/local/bin/
COPY --from=samtools /usr/local/lib/libhts* /usr/local/lib/
COPY --from=samtools /usr/local/include/htslib /usr/local/include/htslib
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The final runtime image likely does not need the C header files from HTSlib (/usr/local/include/htslib). These are only required for compiling code against the library, not for running applications that are already compiled and dynamically link to it. Removing this COPY command will slightly reduce the image size and complexity.

Comment on lines +68 to +69
docker run --rm $(INFERNAL_IMAGE) cmscan -h > /dev/null
docker run --rm $(INFERNAL_IMAGE) cmbuild -h > /dev/null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The tests for infernal redirect the command's standard output to /dev/null. While this confirms the command runs without error, it hides the version or help output, which can be useful for verification. The samtools test, for example, prints the version. For consistency and more informative test runs, consider removing the output redirection.

	docker run --rm $(INFERNAL_IMAGE) cmscan -h
	docker run --rm $(INFERNAL_IMAGE) cmbuild -h

Comment on lines +78 to +80
docker run --rm $(RUST_UTILS_IMAGE) json2fasta --help > /dev/null
docker run --rm $(RUST_UTILS_IMAGE) precompute --help > /dev/null
docker run --rm $(RUST_UTILS_IMAGE) search-export --help > /dev/null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The tests for rust-utils redirect the command's standard output to /dev/null. While this confirms the command runs without error, it hides the help output, which can be useful for verification. The samtools test, for example, prints the version. For consistency and more informative test runs, consider removing the output redirection.

	docker run --rm $(RUST_UTILS_IMAGE) json2fasta --help
	docker run --rm $(RUST_UTILS_IMAGE) precompute --help
	docker run --rm $(RUST_UTILS_IMAGE) search-export --help

make install

# Stage 2: Final - minimal runtime image
FROM python:3.11.14-trixie
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The final stage of this Dockerfile uses python:3.11.14-trixie as the base image. Since this container's purpose is just to provide the compiled infernal tool, which does not have a Python dependency, you could use a much smaller base image like debian:trixie-slim. This would significantly reduce the size of the final rnacentral/infernal image.

FROM debian:trixie-slim

Comment on lines +18 to +28
RUN cargo build --release && \
mkdir /rust-bins && \
mv target/release/json2fasta \
target/release/split-ena \
target/release/expand-urs \
target/release/precompute \
target/release/search-export \
target/release/ftp-export \
target/release/json2dfasta \
target/release/bed-expander \
/rust-bins/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Explicitly listing every binary in the mv command is brittle. If a new utility is added to the workspace, it must also be manually added here. A more robust and maintainable approach would be to find and move all executable files from the release directory.

RUN cargo build --release && \
    mkdir /rust-bins && \
    find target/release -maxdepth 1 -type f -executable -exec mv {} /rust-bins/ \;

/rust-bins/

# Stage 2: Final - minimal runtime image
FROM python:3.11.14-trixie
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The final stage uses python:3.11.14-trixie as the base image. Since this container just provides pre-compiled Rust binaries, which are dynamically linked against glibc but don't require a Python runtime, you could switch to a smaller base like debian:trixie-slim. This would make the rnacentral/rust-utils image much smaller.

FROM debian:trixie-slim

make install

# Stage 2: Final - minimal runtime image with libraries
FROM python:3.11.14-trixie
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The final stage uses python:3.11.14-trixie as the base image. The runtime dependencies for samtools are just standard C libraries, not Python. Using a smaller base image like debian:trixie-slim and installing the necessary .so libraries would make the final rnacentral/samtools image significantly smaller and more efficient.

FROM debian:trixie-slim

COPY --from=builder /usr/local/bin/tabix /usr/local/bin/
COPY --from=builder /usr/local/bin/bgzip /usr/local/bin/
COPY --from=builder /usr/local/lib/libhts* /usr/local/lib/
COPY --from=builder /usr/local/include/htslib /usr/local/include/htslib/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This COPY command brings the HTSlib header files (.h) into the final runtime image. Header files are only needed for compiling software, not for running it. Since this is a runtime image, these files are unnecessary and add to the image size. You can safely remove this line.

Copy link

@pmb59 pmb59 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@afg1 LTGM overall. My only major comment is that we are not following any container version strategy, only latest https://hub.docker.com/r/rnacentral/rnacentral-import-pipeline/tags
Only using latest (No version history, can't rollback) it's not good. We should tag versions, and pin a specific version into the production environment, unless this approach is not feasible for any reason

DOCKER_PASSWORD: ${{ secrets.DOCKER_PASSWORD }}
run: docker login -u $DOCKER_USER -p $DOCKER_PASSWORD

- name: Build Rust utilities container
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow will continue pushing even if the build fails. Consider adding:

Suggested change
- name: Build Rust utilities container
- name: Build Rust utilities container
if: success() # Only push if build succeeded

- '.github/workflows/rust-container.yaml'
workflow_dispatch:
inputs:
force_rebuild:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The force_rebuild input is defined but never used in the workflow?

GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
echo "Waiting for rust-container workflow to complete..."
sleep 10 # Give workflow time to start
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will 10 seconds be enough?

--jq '.[0] | "\(.status):\(.conclusion)"')
if [ -z "$status" ]; then
echo "No rust-container workflow found yet, waiting..."
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
echo "No rust-container workflow found yet, waiting..."
echo "Warning: rust-container workflow not found yet, waiting..."

# These containers cache slow-to-build bioinformatics tools (Infernal, Samtools)

name: Build and Push Tool Containers

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this

Suggested change
env:
INFERNAL_VERSION: "1.1.2"
SAMTOOLS_VERSION: "1.18"

or creating version files might be preferable than hardcoding x3 times

# containers/infernal/VERSION
1.1.2

@pmb59 pmb59 requested a review from Copilot November 26, 2025 15:01
Copilot finished reviewing on behalf of pmb59 November 26, 2025 15:05
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR converts the RNAcentral import pipeline container from a monolithic single-stage build to a multi-stage build architecture. The refactoring separates bioinformatics tools (Infernal, Samtools, Rust utilities) into independent pre-built containers, significantly reducing build time from ~25-30 minutes to ~9 minutes and decreasing the final image size by 30-40%.

Key Changes:

  • Introduced separate tool containers (Infernal 1.1.2, Samtools 1.18, Rust utilities) with multi-stage builds
  • Modified main Dockerfile to pull pre-built binaries from tool containers instead of compiling from source
  • Added GitHub Actions workflows to build and push tool containers independently with change detection
  • Updated build configuration to use explicit platform targeting (linux/amd64)

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
containers/infernal/Dockerfile New multi-stage build for Infernal 1.1.2 bioinformatics tool suite
containers/samtools/Dockerfile New multi-stage build for Samtools + HTSlib 1.18 with runtime libraries only
containers/rust-utils/Dockerfile New multi-stage build compiling all 8 Rust CLI utilities in release mode
containers/Makefile Added build targets for tool containers with version management and testing
Makefile Updated to use docker buildx with explicit linux/amd64 platform
Dockerfile Converted to multi-stage build pulling from tool containers; removed build dependencies
Cargo.toml Changed workspace members from wildcard to explicit list for better dependency tracking
.github/workflows/tool-containers.yaml New workflow for building Infernal and Samtools containers with change detection
.github/workflows/rust-container.yaml New workflow for building Rust utilities container on code changes
.github/workflows/main.yaml Added synchronization to wait for Rust container build before main pipeline build
.dockerignore Restructured to exclude build artifacts, tests, and development files

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +36 to +71
RUN apt update && apt upgrade -y && \
apt install -y \
bedtools \
ca-certificates \
curl \
default-mysql-client \
devscripts \
freetds-dev \
gawk \
gcc \
git \
gzip \
hmmer \
jq \
lftp \
libbz2-dev \
liblzma-dev \
libncurses5-dev \
libncursesw5-dev \
libsqlite3-dev \
libssl-dev \
libbz2-1.0 \
liblzma5 \
libncurses6 \
libssl3 \
libxml2-utils \
libxml2-dev \
libzip-dev \
moreutils \
mysql-common \
openssl \
pandoc \
patch \
pgloader \
postgresql-17 \
postgresql-client-17 \
procps \
python3 \
python3-dev \
python3-pip \
rsync \
sbcl \
tabix \
tar \
time \
unzip \
zlib1g-dev\
wget


# Install Infernal
RUN \
cd $RNA/ && \
curl -OL http://eddylab.org/infernal/infernal-1.1.2.tar.gz && \
tar -xvzf infernal-1.1.2.tar.gz && \
rm infernal-1.1.2.tar.gz && \
cd infernal-1.1.2 && \
./configure --prefix=$RNA/infernal-1.1.2 && \
make && \
make install && \
cd easel && \
make install

# Install blat
RUN \
wget https://hgwdev.gi.ucsc.edu/~kent/exe/linux/blatSuite.38.zip -O blat.zip && \
wget \
zlib1g && \
rm -rf /var/lib/apt/lists/*
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The package tabix is installed via apt, but tabix is already provided by the samtools container and copied in line 77. This creates a redundant installation and potential version conflict between the apt-provided tabix and the one from the samtools container.

Remove tabix from the apt install list since it's already provided by the samtools container.

Copilot uses AI. Check for mistakes.
run: docker login -u $DOCKER_USER -p $DOCKER_PASSWORD

- name: Build Infernal container
run: docker build -t rnacentral/infernal:1.1.2 -f containers/infernal/Dockerfile containers/infernal/
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build context is inconsistent between the Makefile and the workflow. The Makefile (line 30) uses . as the build context, while the workflow uses containers/infernal/. Since the Infernal Dockerfile doesn't copy any local files, both work, but this inconsistency could cause confusion.

For consistency, either:

  1. Use . in both places (Makefile and workflow)
  2. Use infernal/ in the Makefile and adjust the working directory accordingly

The same issue exists for the Samtools container (line 102).

Copilot uses AI. Check for mistakes.
run: docker login -u $DOCKER_USER -p $DOCKER_PASSWORD

- name: Build Samtools container
run: docker build -t rnacentral/samtools:1.18 -f containers/samtools/Dockerfile containers/samtools/
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build context is inconsistent between the Makefile and the workflow. The Makefile (line 36 in containers/Makefile) uses . as the build context, while the workflow uses containers/samtools/. Since the Samtools Dockerfile doesn't copy any local files, both work, but this inconsistency could cause confusion and maintainability issues.

For consistency, use the same build context in both places.

Suggested change
run: docker build -t rnacentral/samtools:1.18 -f containers/samtools/Dockerfile containers/samtools/
run: docker build -t rnacentral/samtools:1.18 -f containers/samtools/Dockerfile .

Copilot uses AI. Check for mistakes.
Comment on lines +7 to +10
utils/
Cargo.toml
Cargo.lock
Makefile
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .dockerignore file excludes utils/, Cargo.toml, Cargo.lock, and Makefile, but these files are required by the rust-utils container build. The rust-utils Dockerfile (at containers/rust-utils/Dockerfile) uses the repository root as its build context and attempts to COPY utils ./utils and COPY Cargo.toml Cargo.lock Makefile ./ (lines 14-15).

Since the .dockerignore applies to the build context (repository root for rust-utils), these COPY commands will fail with an error like "COPY failed: file not found in build context".

The main Dockerfile also uses the repository root as its build context but doesn't need these files since it pulls pre-built Rust binaries from the rust-utils container. Consider:

  1. Creating a separate .dockerignore file for the rust-utils build that doesn't exclude Rust-related files, or
  2. Not excluding these files in the main .dockerignore, or
  3. Using a different approach to exclude files selectively for different builds.
Suggested change
utils/
Cargo.toml
Cargo.lock
Makefile

Copilot uses AI. Check for mistakes.
# Copy Samtools + HTSlib from tool container
COPY --from=samtools /usr/local/bin/samtools /usr/local/bin/tabix /usr/local/bin/bgzip /usr/local/bin/
COPY --from=samtools /usr/local/lib/libhts* /usr/local/lib/
COPY --from=samtools /usr/local/include/htslib /usr/local/include/htslib
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The COPY --from=samtools /usr/local/include/htslib /usr/local/include/htslib line copies header files that are typically only needed for compiling against HTSlib. Since this is a runtime-only image (no gcc or build tools), these headers are likely unnecessary and increase the image size.

Consider removing this line unless the pipeline requires compiling C code at runtime that links against HTSlib.

Suggested change
COPY --from=samtools /usr/local/include/htslib /usr/local/include/htslib

Copilot uses AI. Check for mistakes.
Comment on lines +62 to +63
RUN samtools --version && \
tabix --version
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test validates samtools and tabix but not bgzip, even though bgzip is copied from the builder (line 54) and mentioned in the container description (line 2). Consider adding bgzip --version to the test to ensure all three tools are working correctly.

Copilot uses AI. Check for mistakes.
COPY --from=builder /usr/local/bin/tabix /usr/local/bin/
COPY --from=builder /usr/local/bin/bgzip /usr/local/bin/
COPY --from=builder /usr/local/lib/libhts* /usr/local/lib/
COPY --from=builder /usr/local/include/htslib /usr/local/include/htslib/
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The header files in /usr/local/include/htslib/ are being copied to the runtime image, but these are typically only needed for compiling against HTSlib. Since this is a runtime-only container, these headers are likely unnecessary and increase the image size.

Consider removing this line unless downstream containers need to compile C code that links against HTSlib.

Suggested change
COPY --from=builder /usr/local/include/htslib /usr/local/include/htslib/

Copilot uses AI. Check for mistakes.
ENV PATH="$RNA/infernal-1.1.2/bin:$PATH"
ENV PATH="$RNA/blat_suite:$PATH"
ENV PATH="$RNA/seqkit:$PATH"
ENV PATH="$RNACENTRAL_IMPORT_PIPELINE:$PATH"
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Rust binaries are copied to $RNACENTRAL_IMPORT_PIPELINE/bin/ (line 126), but the PATH includes $RNACENTRAL_IMPORT_PIPELINE (line 147) without the /bin suffix. This means the Rust utilities won't be found when executed.

Change line 147 to:

ENV PATH="$RNACENTRAL_IMPORT_PIPELINE/bin:$PATH"

Or add a separate PATH entry:

ENV PATH="$RNACENTRAL_IMPORT_PIPELINE/bin:$RNACENTRAL_IMPORT_PIPELINE:$PATH"
Suggested change
ENV PATH="$RNACENTRAL_IMPORT_PIPELINE:$PATH"
ENV PATH="$RNACENTRAL_IMPORT_PIPELINE/bin:$PATH"

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants