Add notebook for downloading McFarland 2020 Figure 1 data #2

ethanweinberger · 2022-03-25T18:21:43Z

This PR adds a Jupyter notebook to download the data from
McFarland et al., 2020 used to produce Figure 1 (i.e.,
response to idasanutlin and control DMSO for different cell lines).
This PR also adds a utils.py file to the datasets folder
containing reusable functions for downloading/preprocessing.

A couple things that should probably be hashed out before this gets merged:

What's the unit of abstraction that each data notebook should cover? For example, for this notebook I only included the data used to produce Fig. 1c in McFarland et al., 2020 as opposed to all of the data. This was in part because I already had code for this subset of the data ready to go, but also because it might get unwieldy to include all metadata values for all of the data even when they're not necessary (e.g. TP53 mutation status might not be relevant outside of the nutlin experiments).
Similar to 1., how much of the data processing lifecycle should each notebook cover? In my PR I include downloading the raw data as part of the notebook, but I see some notebooks in the repo start off from an h5ad file.
Is there a standard preprocessing/quality control workflow for all of the datasets or is the plan to do things more ad-hoc for each dataset? For now the anndata object in my notebook just contains raw counts.

This PR adds a Jupyter notebook to download the data from McFarland et al., 2020 used to produce Figure 1 (i.e., response to idasanutlin and control DMSO for different cell lines). This PR also adds a `utils.py` file to the datasets folder containing reusable functions for downloading/preprocessing.

review-notebook-app · 2022-03-25T18:21:47Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

This PR adds a notebook to download + preprocess the Norman 2019 dataset starting directly from downloading the raw counts. The notebook currently downloads the data, and fills in various metadata values. I made this PR as the current Norman 2019 notebook depends on downloading another h5ad file first- I personally like being able to see the full workflow (i.e., going from author provided files to final anndata) as part of the notebooks. As mentioned in theislab#2, I'm not sure what QC steps you prefer so this notebook simply produces an anndata with raw counts.

yugeji · 2022-03-31T11:30:23Z

What's the unit of abstraction that each data notebook should cover? For example, for this notebook I only included the data used to produce Fig. 1c in McFarland et al., 2020 as opposed to all of the data. This was in part because I already had code for this subset of the data ready to go, but also because it might get unwieldy to include all metadata values for all of the data even when they're not necessary (e.g. TP53 mutation status might not be relevant outside of the nutlin experiments).

Similar to 1., how much of the data processing lifecycle should each notebook cover? In my PR I include downloading the raw data as part of the notebook, but I see some notebooks in the repo start off from an h5ad file.

Is there a standard preprocessing/quality control workflow for all of the datasets or is the plan to do things more ad-hoc for each dataset? For now the anndata object in my notebook just contains raw counts.

Hey Ethan, great questions! I'll post the answers here for now but ideally there would be some other documentation somewhere other than an obscure template.ipynb notebook.

As of now, for each dataset we define a [author_year].ipynb and [author_year]_curation.ipynb notebook. The intention is that [author_year]_curation.ipynb contains what you've currently pushed for Norman19 (accession link to .h5ad) and [author_year].ipynb contains all the preprocessing that happens to the anndata object after. By the end of [author_year]_curation.ipynb, you should have an anndata which contains all author-provided metadata labels, gene names, and a raw count matrix.
The thought process behind this is that some users may want to do the preprocessing themselves, while other users may want to download several datasets knowing they've all been preprocessed similarly (e.g. when training machine learning models)
Hopefully answered in 1. [author_year]_curation.ipynb notebooks should start with the exact command to download the file. The idea is that the notebook should contain everything a user needs to exactly reproduce the data as linked from the repository from a publicly available source.
There is currently a notebook called template.ipynb which calls code from the repo. Copying the notebook and adapting it to your dataset is the expected amount of standardization.

* Add Norman 2019 notebook with more details This PR adds a notebook to download + preprocess the Norman 2019 dataset starting directly from downloading the raw counts. The notebook currently downloads the data, and fills in various metadata values. I made this PR as the current Norman 2019 notebook depends on downloading another h5ad file first- I personally like being able to see the full workflow (i.e., going from author provided files to final anndata) as part of the notebooks. As mentioned in #2, I'm not sure what QC steps you prefer so this notebook simply produces an anndata with raw counts. * Add standard metadata fields * standardize naming Authored-by: Ethan Weinberger <[email protected]>

ethanweinberger · 2022-03-31T18:19:29Z

Got it- the distinction between the curation/preprocessing notebooks makes sense to me.

Based on that distinction, it seems like it makes sense to have the mcfarland_2020_curation notebook grab all of the potentially useful data/metadata and then people can subset it later if they want. I'll update this PR sometime in the next few days.

ethanweinberger · 2022-03-31T18:41:08Z

Closing since this is taken care of by `mcfarland_2020_curation.ipynb'

ethanweinberger · 2022-04-01T17:43:58Z

Reopening per @yugeji's request

ethanweinberger mentioned this pull request Mar 25, 2022

More detailed Norman 2019 preprocessing notebook #3

Merged

ethanweinberger closed this Mar 31, 2022

ethanweinberger reopened this Apr 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add notebook for downloading McFarland 2020 Figure 1 data #2

Add notebook for downloading McFarland 2020 Figure 1 data #2

ethanweinberger commented Mar 25, 2022

review-notebook-app bot commented Mar 25, 2022

yugeji commented Mar 31, 2022

ethanweinberger commented Mar 31, 2022

ethanweinberger commented Mar 31, 2022

ethanweinberger commented Apr 1, 2022

Add notebook for downloading McFarland 2020 Figure 1 data #2

Are you sure you want to change the base?

Add notebook for downloading McFarland 2020 Figure 1 data #2

Conversation

ethanweinberger commented Mar 25, 2022

review-notebook-app bot commented Mar 25, 2022

yugeji commented Mar 31, 2022

ethanweinberger commented Mar 31, 2022

ethanweinberger commented Mar 31, 2022

ethanweinberger commented Apr 1, 2022