Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 0.7.4 - incongruent results between Conda and Biocontainers #73

Open
marchoeppner opened this issue Feb 20, 2025 · 4 comments
Open

Comments

@marchoeppner
Copy link

marchoeppner commented Feb 20, 2025

Hi,

so this is driving me a little nuts right now. I have a pipeline that provisions software either via Conda (discouraged) or a container (Singularity or Docker, mostly).

For legacy reasons, I am using ConfindR 0.7.4, with a custom database (https://zenodo.org/records/4604758), built for 0.7.4 here: https://gitlab.bfr.berlin/bfr_bioinformatics/aquamis_databases/-/raw/main/confindr_db.tar.gz

With a test sample, I am getting this with conda and singularity provided via Biocontainers (derived from Bioconda). Note that the conda-installed version of Confindr reports no SNVs, the Singularity-installed version (from Bioconda, https://quay.io/repository/biocontainers/confindr?tab=tags&tag=latest) reports 2 SNVs. It is my understanding that this should not be happening. And I cannot see major differences in the dependencies. Any ideas...?

confindr --threads 8 -i input_dir -m 50 -dt Illumina -bf 0.05 -q 20 -b 2 -fid _R1 -rid _R2 -d /work_syn/shared/references/gabi/1.0/confindr -o out_conda -verbosity debug
  2025-02-20 12:32:10  Welcome to ConFindr 0.7.4! Beginning analysis of your samples... 
  2025-02-20 12:32:10  Did not find rMLST databases, if you want to use ConFindr on genera other than Listeria, Salmonella, and Escherichia, you'll need to download them. Instructions are available at https://olc-bioinformatics.github.io/ConFindr/install/#downloading-confindr-databases
 
  2025-02-20 12:32:10  Beginning analysis of sample LC04-22-RV4-P64-C01... 
  2025-02-20 12:32:10  Sample is paired. Sample name is LC04-22-RV4-P64-C01 
  2025-02-20 12:32:10  Checking for cross-species contamination... 
  2025-02-20 12:32:13  Extracting conserved core genes... 
  2025-02-20 12:32:14  Quality trimming... 
  2025-02-20 12:32:15  Detecting contamination... 
  2025-02-20 12:32:15  Total gene length is 17385 
  2025-02-20 12:32:20  Done! Number of contaminating SNVs found: 0
 
  2025-02-20 12:32:20  Contamination detection complete! 

And singularity

singularity exec confindr\:0.7.4--py_0 confindr --threads 8 -i input_dir -m 50 -dt Illumina -bf 0.05 -q 20 -b 2 -fid _R1 -rid _R2 -d /work_syn/shared/references/gabi/1.0/confindr -o out_sgt -verbosity debug
  2025-02-20 12:33:02  Welcome to ConFindr 0.7.4! Beginning analysis of your samples... 
  2025-02-20 12:33:02  Did not find rMLST databases, if you want to use ConFindr on genera other than Listeria, Salmonella, and Escherichia, you'll need to download them. Instructions are available at https://olc-bioinformatics.github.io/ConFindr/install/#downloading-confindr-databases
 
  2025-02-20 12:33:02  Beginning analysis of sample LC04-22-RV4-P64-C01... 
  2025-02-20 12:33:02  Sample is paired. Sample name is LC04-22-RV4-P64-C01 
  2025-02-20 12:33:02  Checking for cross-species contamination... 
  2025-02-20 12:33:07  Extracting conserved core genes... 
  2025-02-20 12:33:08  Quality trimming... 
  2025-02-20 12:33:09  Detecting contamination... 
  2025-02-20 12:33:09  Total gene length is 17385 
  2025-02-20 12:33:14  base qualities before filtering: {'G': [38, 38, 37, 38, 36, 38, 32, 38, 38, 38, 38, 36, 38, 22, 38, 38, 38, 38, 76, 76, 75, 76, 38, 38, 0, 0, 38, 38, 38, 38, 38, 38, 0, 38, 0, 76, 0, 76, 0, 76, 0, 76, 0, 21, 38], 'T': [23, 20]} 
  2025-02-20 12:33:14  base qualities after filtering: {'G': [38, 38, 37, 38, 36, 38, 32, 38, 38, 38, 38, 36, 38, 22, 38, 38, 38, 38, 76, 76, 75, 76, 38, 38, 38, 38, 38, 38, 38, 38, 38, 76, 76, 76, 76, 21, 38], 'T': [23, 20]} 
  2025-02-20 12:33:14  SNVs found at position 429: {'G': 37, 'T': 2}
 
  2025-02-20 12:33:14  base qualities before filtering: {'T': [37, 38, 37, 38, 38, 76, 0, 75, 0], 'C': [23, 23]} 
  2025-02-20 12:33:14  base qualities after filtering: {'T': [37, 38, 37, 38, 38, 76, 75], 'C': [23, 23]} 
  2025-02-20 12:33:14  SNVs found at position 244: {'T': 7, 'C': 2}
 
  2025-02-20 12:33:15  Done! Number of contaminating SNVs found: 2
 
  2025-02-20 12:33:15  Contamination detection complete! 

@marchoeppner
Copy link
Author

marchoeppner commented Feb 20, 2025

I also went through the exercise of installing ConfindR manually, which matches the results from Conda. So I am leaning towards the container version returning incorrect results. The question is - why.

Dependencies installed in each environment:

Conda from Bioconda

BBMap version 38.96
samtools 1.13
KMA-1.2.0
Mash version 2.3

Biocontainers from Bioconda

BBMap version 38.86
samtools 1.10
KMA-1.2.0
Mash version 2.2.2

Which both seems to be in line with the requirements specified in the version-specific installation instructions (https://github.com/OLC-Bioinformatics/ConFindr/blob/fbfa7978aa800c19ea48c06ebb4ceda5e591935d/docs/install.md)

@marchoeppner
Copy link
Author

And the test data used (after trimming with Fastp):

ftp.sra.ebi.ac.uk/vol1/run/ERR117/ERR11797882/LC04-22-RV4-P64-C01_R1.fastq.gz
ftp.sra.ebi.ac.uk/vol1/run/ERR117/ERR11797882/LC04-22-RV4-P64-C01_R2.fastq.gz

@marchoeppner
Copy link
Author

I ended up building my own container here: https://hub.docker.com/repository/docker/mhoeppner/confindr/tags/0.7.4/sha256:a401ea0ea0b363fb64fcf55588aeb43b6b7fdb931bda2c5bd9ee403e47bf83a0

And that also matches the conda package.

@marchoeppner
Copy link
Author

Right, so if the point here wasn't clear ;)

The Biocontainer was build from the Conda package in 2020. If that installation of Confindr 0.7.4 yields different results from an installation directly from Bioconda today, the most likely interpretation (imho) is that the dependencies aren't defined tightly enough to ensure that reproducibility of results is possible. Which is a big problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant