Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

classify plant reads in metagenomes #3172

Closed
gabridinosauro opened this issue May 24, 2024 · 11 comments
Closed

classify plant reads in metagenomes #3172

gabridinosauro opened this issue May 24, 2024 · 11 comments

Comments

@gabridinosauro
Copy link

Dear Sourmash team,

Hope you are all good. I have a project where I have some shotgun metagenomics data of wild rodents.
I want to see if I can classify reads to plant genomes, to have an idea of their diet.

Is it possible to do it with sourmash?
I suppose I would have to make my own database as I do not see any databases containing plants already available.

Thanks in advance.

Gabriele

@ctb
Copy link
Contributor

ctb commented Jul 9, 2024

Hi Gabri, sorry for ignoring your issue for so long 😭

Short version - we don't have anything formal for plants, BUT if you can find a listing of all the things you want - maybe an assembly_summary file? - we can put together a recipe for sketching it quickly. Sound good?

@gabridinosauro
Copy link
Author

Hi Titus,
sounds great. Here attached the list of all plant genomes marked as reference in the genbank, with their accession numbers and taxID. Is that fine or you need any other info?

Thanks a lot again!
Gabri

Rerence_genomes_plants_genbank.csv

@gabridinosauro
Copy link
Author

Hi Titus,

Any chances of further progresses on this? I would love to help if needed?

Thanks! Gabri

@bluegenes
Copy link
Contributor

bluegenes commented Dec 10, 2024

Hi Gabri,

I went ahead and made this database for you. You can download the files here:

K-mer size Zipfile collection
k21 download (7G)
k31 download (8.8G)
k51 download (11G)

Lineage spreadsheet for sourmash tax commands: download

Two genbank accessions did not have genome FASTAs associated with them. They are:

accession name ftp_path
GCA_033675215.1 GCA_033675215.1 ASM3367521v1 Rhodiola kirilowii https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/033/675/215/GCA_033675215.1_ASM3367521v1
GCA_033675395.1 GCA_033675395.1 ASM3367539v1 Rhodiola chrysanthemifolia https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/033/675/395/GCA_033675395.1_ASM3367539v1

The commands and files I used are here: https://github.com/bluegenes/2024-ds-plant.
In short, I used our directsketch plugin. I first created a compatible input file from your csv, which looked like this:

accession,name,ftp_path
GCF_000001735.4,GCF_000001735.4 TAIR10.1 Arabidopsis thaliana,
GCF_030704535.1,GCF_030704535.1 ASM3070453v1 Vitis vinifera,
GCF_022201045.2,GCF_022201045.2 DVS_A1.0 Citrus sinensis,

I used your first 3 columns for the "name".

Then, I ran gbsketch:

sourmash scripts gbsketch Reference_genomes_plants_genbank.directsketch.csv -c 1 -r 1 --genomes-only -o Reference_genomes_plants_genbank.directsketch.zip --failed Reference_genomes_plants_genbank.directsketch.failed.csv --checksum-fail Reference_genomes_plants_genbank.directsketch.checksum-fail.csv --batch-size 200

to help with restart since there are large genomes, I use a batching process which actually yields a number of output zips. For the final db, I ran sourmash sig cat to unite the batches. I also used a parameter -n 2 to download fewer genomes simultaneously since the are so large (helped with memory usage -- my first run-through got killed at 100G!). Unfortunately, that is currently only available in a development branch for now, but will be released soon.

There are a lot of very large genomes in here! It took a few days to run 😅 . If you have an updated version with any genomes that have been added since July, I'm happy to sketch those and update this db for you.

Please let me know how it goes - I haven't actually tested these with anything yet!

Tessa

@gabridinosauro
Copy link
Author

@bluegenes this is sooo cool!!! thanks so much!!! @ctb and thanks Titus too you rock!!!!! I will let you know how it goes!

@gabridinosauro
Copy link
Author

Hello @bluegenes. One last question. I successfully run gather between my metagenomes and the databases. Now I want to run taxonomy. But I get an error that the csv file is not formatted correctly. Nevertheless, i have troubles finding the correct way to format it from the instruction website.

This is the command I run and the error I get:

sourmash tax annotate -g ${infile} --taxonomy Rerence_genomes_plants_genbank.csv

error:

== This is sourmash version 4.6.1. ==

[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==


[KERROR: cannot read taxonomy assignments from '/xdisk/laubitz/schiro/sourmash_plant/Rerence_genomes_plants_genbank.csv': No taxonomic identifiers found; headers are '\ufeffAssembly Accession','Assembly Name','Organism Name','Organism Taxonomic ID','Organism Infraspecific Names Breed','Organism Infraspecific Names Strain','Organism Infraspecific Names Cultivar','Organism Infraspecific Names Ecotype','Organism Infraspecific Names Isolate','Organism Infraspecific Names Sex','Annotation Name','Assembly Stats Total Sequence Length','Assembly Stats Total Number of Chromosomes','Assembly Level','Assembly Release Date','WGS project accession','Assembly Stats Number of Scaffolds'

@ctb
Copy link
Contributor

ctb commented Dec 26, 2024

hi @gabridinosauro sorry, we need to format the taxdb for you/make the lineages CSV - there are some instructions in the plant repo mentioned above, but I'm not sure if they'll work. It's on our list!

@bluegenes
Copy link
Contributor

@gabridinosauro I made you a tax file that should work, though I didn't test it. Try downloading here: https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/genbank-plant-2024-07/genbank-plants-2024-07.lineages.csv.gz

let me know if it works!

@gabridinosauro
Copy link
Author

it worked thanks!!! I downloaded the file from the github repo tho, not from that link because I did not have permissions from that link. Overall, the method seem to work although I seem to get some false positives. I get hits of tropical plants when my wild rodents are living in Canada. I think it might be because there is bacterial contamination in plant genomes?
I am trying now to use kmer 51 instead of 31.
Another idea would be to use chloroplast genomes instead, I can try building my own database.

@ctb
Copy link
Contributor

ctb commented Jan 13, 2025

hi @gabridinosauro we have some new code that makes it easy to sketch all genomes under any given NCBI taxonomic rank; it's not particularly usable by others yet, but I wanted to link it for you, since you were part of the inspiration. see: #3487

basically, if you can locate one or more NCBI taxonomic IDs under which you want all the things sketched, we can produce said database :)

@ctb
Copy link
Contributor

ctb commented Jan 22, 2025

@gabridinosauro updated reference databases described here: #3504

@ctb ctb closed this as completed Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants