Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add notebook for downloading McFarland 2020 Figure 1 data #2

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ethanweinberger
Copy link
Contributor

This PR adds a Jupyter notebook to download the data from
McFarland et al., 2020 used to produce Figure 1 (i.e.,
response to idasanutlin and control DMSO for different cell lines).
This PR also adds a utils.py file to the datasets folder
containing reusable functions for downloading/preprocessing.

A couple things that should probably be hashed out before this gets merged:

  1. What's the unit of abstraction that each data notebook should cover? For example, for this notebook I only included the data used to produce Fig. 1c in McFarland et al., 2020 as opposed to all of the data. This was in part because I already had code for this subset of the data ready to go, but also because it might get unwieldy to include all metadata values for all of the data even when they're not necessary (e.g. TP53 mutation status might not be relevant outside of the nutlin experiments).
  2. Similar to 1., how much of the data processing lifecycle should each notebook cover? In my PR I include downloading the raw data as part of the notebook, but I see some notebooks in the repo start off from an h5ad file.
  3. Is there a standard preprocessing/quality control workflow for all of the datasets or is the plan to do things more ad-hoc for each dataset? For now the anndata object in my notebook just contains raw counts.

This PR adds a Jupyter notebook to download the data from
McFarland et al., 2020 used to produce Figure 1 (i.e.,
response to idasanutlin and control DMSO for different cell lines).
This PR also adds a `utils.py` file to the datasets folder
containing reusable functions for downloading/preprocessing.
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

ethanweinberger pushed a commit to ethanweinberger/sc-pert that referenced this pull request Mar 25, 2022
This PR adds a notebook to download + preprocess the Norman 2019
dataset starting directly from downloading the raw counts. The
notebook currently downloads the data, and fills in various metadata
values. I made this PR as the current Norman 2019 notebook depends
on downloading another h5ad file first- I personally like being
able to see the full workflow (i.e., going from author provided
files to final anndata) as part of the notebooks.

As mentioned in theislab#2, I'm
not sure what QC steps you prefer so this notebook simply produces
an anndata with raw counts.
@yugeji
Copy link
Member

yugeji commented Mar 31, 2022

  1. What's the unit of abstraction that each data notebook should cover? For example, for this notebook I only included the data used to produce Fig. 1c in McFarland et al., 2020 as opposed to all of the data. This was in part because I already had code for this subset of the data ready to go, but also because it might get unwieldy to include all metadata values for all of the data even when they're not necessary (e.g. TP53 mutation status might not be relevant outside of the nutlin experiments).
  2. Similar to 1., how much of the data processing lifecycle should each notebook cover? In my PR I include downloading the raw data as part of the notebook, but I see some notebooks in the repo start off from an h5ad file.
  3. Is there a standard preprocessing/quality control workflow for all of the datasets or is the plan to do things more ad-hoc for each dataset? For now the anndata object in my notebook just contains raw counts.

Hey Ethan, great questions! I'll post the answers here for now but ideally there would be some other documentation somewhere other than an obscure template.ipynb notebook.

  1. As of now, for each dataset we define a [author_year].ipynb and [author_year]_curation.ipynb notebook. The intention is that [author_year]_curation.ipynb contains what you've currently pushed for Norman19 (accession link to .h5ad) and [author_year].ipynb contains all the preprocessing that happens to the anndata object after. By the end of [author_year]_curation.ipynb, you should have an anndata which contains all author-provided metadata labels, gene names, and a raw count matrix.
    The thought process behind this is that some users may want to do the preprocessing themselves, while other users may want to download several datasets knowing they've all been preprocessed similarly (e.g. when training machine learning models)
  2. Hopefully answered in 1. [author_year]_curation.ipynb notebooks should start with the exact command to download the file. The idea is that the notebook should contain everything a user needs to exactly reproduce the data as linked from the repository from a publicly available source.
  3. There is currently a notebook called template.ipynb which calls code from the repo. Copying the notebook and adapting it to your dataset is the expected amount of standardization.

yugeji pushed a commit that referenced this pull request Mar 31, 2022
* Add Norman 2019 notebook with more details

This PR adds a notebook to download + preprocess the Norman 2019
dataset starting directly from downloading the raw counts. The
notebook currently downloads the data, and fills in various metadata
values. I made this PR as the current Norman 2019 notebook depends
on downloading another h5ad file first- I personally like being
able to see the full workflow (i.e., going from author provided
files to final anndata) as part of the notebooks.

As mentioned in #2, I'm
not sure what QC steps you prefer so this notebook simply produces
an anndata with raw counts.

* Add standard metadata fields

* standardize naming

Authored-by: Ethan Weinberger <[email protected]>
@ethanweinberger
Copy link
Contributor Author

Got it- the distinction between the curation/preprocessing notebooks makes sense to me.

Based on that distinction, it seems like it makes sense to have the mcfarland_2020_curation notebook grab all of the potentially useful data/metadata and then people can subset it later if they want. I'll update this PR sometime in the next few days.

@ethanweinberger
Copy link
Contributor Author

Closing since this is taken care of by `mcfarland_2020_curation.ipynb'

@ethanweinberger
Copy link
Contributor Author

Reopening per @yugeji's request

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants