Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minimal test dataset #41

Open
kfuku52 opened this issue Jun 16, 2021 · 21 comments
Open

minimal test dataset #41

kfuku52 opened this issue Jun 16, 2021 · 21 comments
Assignees

Comments

@kfuku52
Copy link
Owner

kfuku52 commented Jun 16, 2021

To make testing easier and quicker, we should find a minimal dataset, and run most, if not all, functionalities from metadata to curate with it. Ideally,

  • small file size of .sra: some bacterial dataset?
  • multiple BioProjects: 2 or 3?
  • 2 species
  • reference transcriptome fasta files are downloadable, maybe from amalgkit repo
  • pre-calculated orthofinder outputs are downloadable, maybe from amalgkit repo
@Hego-CCTB
Copy link
Collaborator

One thing of not here is that currently, curate wants tissues as input. While condition or strain would be functionally equal to tissue, we'll need to adjust curate to look at different columns if prompted to do so.

@kfuku52
Copy link
Owner Author

kfuku52 commented Jun 16, 2021

Good point. How about adding a new column such as curate_group in the metadata table? tissue can be copied as default values but users can manually modify it to include other categories such as treatment, sex, genotype...whatever they want.

@kfuku52
Copy link
Owner Author

kfuku52 commented Jun 16, 2021

... and, of course, amalgkit curate uses curate_group instead of tissue.

@Hego-CCTB
Copy link
Collaborator

Adding curate_group should probably be done all the way back in amalgkit metadata.
Also definitely needs an explanation in the wiki.

@kfuku52
Copy link
Owner Author

kfuku52 commented Jun 16, 2021

Right, could you do it?

@Hego-CCTB
Copy link
Collaborator

Hego-CCTB commented Jun 16, 2021

on it right now, should be a quick adjustment! I'll probably add the curate_group column during or after the group_tissues_by_config call and just copy the tissue column over.

@Hego-CCTB
Copy link
Collaborator

Amalgkit version 0.5.1.0:

  • metadata now introduces curate_group column. By default, this contains the tissue column data
  • curate now uses curate_group column instead of tissue
  • curate --tissues is now obsolete
  • curate --curate_group takes its place, input is unchanged
    f5665c6

@kfuku52
Copy link
Owner Author

kfuku52 commented Jun 16, 2021

Thank. Please describe the default behavior of --curate_group. I assume all values will be included by default.

List of curate_group values of the curate_group metadata column to be included

@Hego-CCTB
Copy link
Collaborator

Hego-CCTB commented Jun 16, 2021

amalgkit curate --curate_group is identical to how amalgkit curate --tissue worked. It was just renamed to avoid confusion.
A typical command would look like:
amalgkit curate --curate_group "root,flower,leaf" [additional arguments]

within the r script, this input string will be split and read into a vector selected_tissues, which then gets passed to the main algorithm for example to check_whithin_tissue_correlation.

EDIT:
to add to this, --curate_group (like --tissues) does not have a default input, but is required. Theoretically, it would be possible to read selected tissues/conditions/whatever from this column as default, but this can cause all kinds of problems, especially when the metadata sheet contains data from multiple species, or has typos in the column, unused SRR entries, etc.

@kfuku52
Copy link
Owner Author

kfuku52 commented Jun 16, 2021

OK, it makes sense to require --curate_group. Could you describe it in the option? Currently, it's not clear enough (see below). You can provide an example, otherwise, users cannot even know what separator they should use.

List of curate_group values of the curate_group metadata column to be included

@Hego-CCTB
Copy link
Collaborator

Hego-CCTB commented Jun 16, 2021

Yeah, that's fair.
What about:

"comma separated list of values contained in the curate_group metadata column to be included in the analysis. Example input may look like "root,flower,leaf" or "heat stressed,cold stressed,light stressed".

@kfuku52
Copy link
Owner Author

kfuku52 commented Jun 16, 2021

Looks good!

@Hego-CCTB
Copy link
Collaborator

Updated in Ver. 0.5.1.2!
8579749

@kfuku52
Copy link
Owner Author

kfuku52 commented Jul 5, 2021

@Hego-CCTB Please add any other factors which we should take into account in an ideal test dataset. I'll look for it when I have time.

  • small file size of .sra: some bacterial dataset?
  • multiple BioProjects: 2 or 3?
  • 2 species
  • reference transcriptome fasta files are downloadable, maybe from amalgkit repo
  • pre-calculated orthofinder outputs are downloadable, maybe from amalgkit repo

@Hego-CCTB
Copy link
Collaborator

I've been looking for some bacterial sets the other day. With various combinations of

  • Escherichia coli
  • Bacillus Subtilis
  • Mycobacterium Tuberculosis

with stresses or specific antibiotics as condition. I was surprised to see that there weren't that many RNAseq experiments. E.coli produced a couple of hits when running metadata, but the other 2 species didn't have much to offer.

@kfuku52
Copy link
Owner Author

kfuku52 commented Jul 5, 2021

Could you share a summary (maybe a table?) of your survey?

@Hego-CCTB
Copy link
Collaborator

Test_data_quick_survey.zip
Here is the last amalgkit metadata run I did, along with a summary metadata.tsv. Keywords were: stress, antibiotics, tetracycline. The species were the tree I mentioned in the above comment.

I did not anticipate all the different strains, which could be a different problem. In the summary I put in some possible candidate samples, which followed these criteria:

  • same (or at least similar) treatment in at least 2 species
  • minimum 2 bioprojects for each species in their respective treatments
  • must have untreated control sample as well
  • I tried to have them all be 'wildtype' too, but there would be no candidates left at all

The best I could find was anaerobic/hypoxia stress. Escherichia coli and Mycobacterium Tuberculosis had 2 bioprojects for both species for anaerobic/hypoxia stress. Although it might be a stretch to put anaerobic into the same category as hypoxia.

@kfuku52
Copy link
Owner Author

kfuku52 commented Jul 7, 2021

Thank you. E. coli looks promising as expected. I'll search for other species that are suitable for the comparison.

@kfuku52
Copy link
Owner Author

kfuku52 commented Mar 7, 2023

@Hego-CCTB I will take care of it if you don't have time.

@Hego-CCTB
Copy link
Collaborator

Yes, please help me out with this issue!

@kfuku52 kfuku52 assigned kfuku52 and unassigned Hego-CCTB Mar 29, 2023
@Hego-CCTB
Copy link
Collaborator

I'd like to create a full bacterial dataset for the paper this week, so we may just be able to use a subset for this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants