minimal test dataset #41

kfuku52 · 2021-06-16T12:02:38Z

To make testing easier and quicker, we should find a minimal dataset, and run most, if not all, functionalities from metadata to curate with it. Ideally,

small file size of .sra: some bacterial dataset?
multiple BioProjects: 2 or 3?
2 species
reference transcriptome fasta files are downloadable, maybe from amalgkit repo
pre-calculated orthofinder outputs are downloadable, maybe from amalgkit repo

The text was updated successfully, but these errors were encountered:

Hego-CCTB · 2021-06-16T12:09:36Z

One thing of not here is that currently, curate wants tissues as input. While condition or strain would be functionally equal to tissue, we'll need to adjust curate to look at different columns if prompted to do so.

kfuku52 · 2021-06-16T12:15:58Z

Good point. How about adding a new column such as curate_group in the metadata table? tissue can be copied as default values but users can manually modify it to include other categories such as treatment, sex, genotype...whatever they want.

kfuku52 · 2021-06-16T12:16:54Z

... and, of course, amalgkit curate uses curate_group instead of tissue.

Hego-CCTB · 2021-06-16T12:22:55Z

Adding curate_group should probably be done all the way back in amalgkit metadata.
Also definitely needs an explanation in the wiki.

kfuku52 · 2021-06-16T12:27:18Z

Right, could you do it?

Hego-CCTB · 2021-06-16T12:29:51Z

on it right now, should be a quick adjustment! I'll probably add the curate_group column during or after the group_tissues_by_config call and just copy the tissue column over.

Hego-CCTB · 2021-06-16T13:20:19Z

Amalgkit version 0.5.1.0:

metadata now introduces curate_group column. By default, this contains the tissue column data
curate now uses curate_group column instead of tissue
curate --tissues is now obsolete
curate --curate_group takes its place, input is unchanged
f5665c6

kfuku52 · 2021-06-16T13:23:48Z

Thank. Please describe the default behavior of --curate_group. I assume all values will be included by default.

List of curate_group values of the curate_group metadata column to be included

Hego-CCTB · 2021-06-16T13:29:23Z

amalgkit curate --curate_group is identical to how amalgkit curate --tissue worked. It was just renamed to avoid confusion.
A typical command would look like:
amalgkit curate --curate_group "root,flower,leaf" [additional arguments]

within the r script, this input string will be split and read into a vector selected_tissues, which then gets passed to the main algorithm for example to check_whithin_tissue_correlation.

EDIT:
to add to this, --curate_group (like --tissues) does not have a default input, but is required. Theoretically, it would be possible to read selected tissues/conditions/whatever from this column as default, but this can cause all kinds of problems, especially when the metadata sheet contains data from multiple species, or has typos in the column, unused SRR entries, etc.

kfuku52 · 2021-06-16T13:53:53Z

OK, it makes sense to require --curate_group. Could you describe it in the option? Currently, it's not clear enough (see below). You can provide an example, otherwise, users cannot even know what separator they should use.

List of curate_group values of the curate_group metadata column to be included

Hego-CCTB · 2021-06-16T14:32:26Z

Yeah, that's fair.
What about:

"comma separated list of values contained in the curate_group metadata column to be included in the analysis. Example input may look like "root,flower,leaf" or "heat stressed,cold stressed,light stressed".

kfuku52 · 2021-06-16T14:33:44Z

Looks good!

Hego-CCTB · 2021-06-16T14:36:28Z

Updated in Ver. 0.5.1.2!
8579749

kfuku52 · 2021-07-05T10:13:10Z

@Hego-CCTB Please add any other factors which we should take into account in an ideal test dataset. I'll look for it when I have time.

small file size of .sra: some bacterial dataset?

multiple BioProjects: 2 or 3?

2 species

reference transcriptome fasta files are downloadable, maybe from amalgkit repo

pre-calculated orthofinder outputs are downloadable, maybe from amalgkit repo

Hego-CCTB · 2021-07-05T10:21:10Z

I've been looking for some bacterial sets the other day. With various combinations of

Escherichia coli
Bacillus Subtilis
Mycobacterium Tuberculosis

with stresses or specific antibiotics as condition. I was surprised to see that there weren't that many RNAseq experiments. E.coli produced a couple of hits when running metadata, but the other 2 species didn't have much to offer.

kfuku52 · 2021-07-05T10:23:38Z

Could you share a summary (maybe a table?) of your survey?

Hego-CCTB · 2021-07-07T10:32:09Z

Test_data_quick_survey.zip
Here is the last amalgkit metadata run I did, along with a summary metadata.tsv. Keywords were: stress, antibiotics, tetracycline. The species were the tree I mentioned in the above comment.

I did not anticipate all the different strains, which could be a different problem. In the summary I put in some possible candidate samples, which followed these criteria:

same (or at least similar) treatment in at least 2 species
minimum 2 bioprojects for each species in their respective treatments
must have untreated control sample as well
I tried to have them all be 'wildtype' too, but there would be no candidates left at all

The best I could find was anaerobic/hypoxia stress. Escherichia coli and Mycobacterium Tuberculosis had 2 bioprojects for both species for anaerobic/hypoxia stress. Although it might be a stretch to put anaerobic into the same category as hypoxia.

kfuku52 · 2021-07-07T14:45:04Z

Thank you. E. coli looks promising as expected. I'll search for other species that are suitable for the comparison.

kfuku52 · 2023-03-07T08:31:28Z

@Hego-CCTB I will take care of it if you don't have time.

Hego-CCTB · 2023-03-29T09:44:19Z

Yes, please help me out with this issue!

Hego-CCTB · 2024-01-16T10:09:41Z

I'd like to create a full bacterial dataset for the paper this week, so we may just be able to use a subset for this issue.

kfuku52 assigned Hego-CCTB Jun 16, 2021

kfuku52 mentioned this issue Jun 16, 2021

num_read_fastp #25

Closed

kfuku52 mentioned this issue Jul 5, 2021

Error in h(simpleError(msg, call)) #61

Closed

kfuku52 assigned kfuku52 and unassigned Hego-CCTB Mar 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

minimal test dataset #41

minimal test dataset #41

kfuku52 commented Jun 16, 2021

Hego-CCTB commented Jun 16, 2021

kfuku52 commented Jun 16, 2021

kfuku52 commented Jun 16, 2021

Hego-CCTB commented Jun 16, 2021

kfuku52 commented Jun 16, 2021

Hego-CCTB commented Jun 16, 2021 •

edited

Loading

Hego-CCTB commented Jun 16, 2021

kfuku52 commented Jun 16, 2021

Hego-CCTB commented Jun 16, 2021 •

edited

Loading

kfuku52 commented Jun 16, 2021

Hego-CCTB commented Jun 16, 2021 •

edited

Loading

kfuku52 commented Jun 16, 2021

Hego-CCTB commented Jun 16, 2021

kfuku52 commented Jul 5, 2021

Hego-CCTB commented Jul 5, 2021

kfuku52 commented Jul 5, 2021

Hego-CCTB commented Jul 7, 2021

kfuku52 commented Jul 7, 2021

kfuku52 commented Mar 7, 2023

Hego-CCTB commented Mar 29, 2023

Hego-CCTB commented Jan 16, 2024

minimal test dataset #41

minimal test dataset #41

Comments

kfuku52 commented Jun 16, 2021

Hego-CCTB commented Jun 16, 2021

kfuku52 commented Jun 16, 2021

kfuku52 commented Jun 16, 2021

Hego-CCTB commented Jun 16, 2021

kfuku52 commented Jun 16, 2021

Hego-CCTB commented Jun 16, 2021 • edited Loading

Hego-CCTB commented Jun 16, 2021

kfuku52 commented Jun 16, 2021

Hego-CCTB commented Jun 16, 2021 • edited Loading

kfuku52 commented Jun 16, 2021

Hego-CCTB commented Jun 16, 2021 • edited Loading

kfuku52 commented Jun 16, 2021

Hego-CCTB commented Jun 16, 2021

kfuku52 commented Jul 5, 2021

Hego-CCTB commented Jul 5, 2021

kfuku52 commented Jul 5, 2021

Hego-CCTB commented Jul 7, 2021

kfuku52 commented Jul 7, 2021

kfuku52 commented Mar 7, 2023

Hego-CCTB commented Mar 29, 2023

Hego-CCTB commented Jan 16, 2024

Hego-CCTB commented Jun 16, 2021 •

edited

Loading

Hego-CCTB commented Jun 16, 2021 •

edited

Loading

Hego-CCTB commented Jun 16, 2021 •

edited

Loading