Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CDRP dataset as BBBC047 #61

Closed
shntnu opened this issue Nov 12, 2018 · 21 comments
Closed

Add CDRP dataset as BBBC047 #61

shntnu opened this issue Nov 12, 2018 · 21 comments

Comments

@shntnu
Copy link
Contributor

shntnu commented Nov 12, 2018

The description can be copied from the abstract of
https://academic.oup.com/gigascience/article/6/12/giw014/2865213
and then link to that page

Additionally, add these notes

ftp://parrot.genomics.cn/gigadb/pub/10.5524/100001_101000/100351/profiles.tar.gz
comprises per-well averages of each extracted morphological feature computed across the cells.

This dataset is a superset of BBBC036 (which has only known bioactive compounds)

@shntnu
Copy link
Contributor Author

shntnu commented Oct 25, 2019

@N3llz can you let @MarziehHaghighi know the BBBC dataset id for this dataset? I know it isn't going to be up for a while, but we just need a number for now so that she can use it in her Rosetta report.

@N3llz
Copy link
Contributor

N3llz commented Oct 28, 2019

Yes, let's go with BBBC047. @MarziehHaghighi @shntnu

@N3llz N3llz changed the title Create a new dataset for CDRP dataset Create a new dataset for CDRP dataset (as BBBC047) Oct 28, 2019
@shntnu
Copy link
Contributor Author

shntnu commented Dec 11, 2019

Question from Gabriel Musso

During our investigation, we wanted to look for any systemic biases and evaluate options to normalize accordingly. We first began looking at plate specific effects, and then progressed towards the plate groups identified (Metadata_Plate_Map_Name in the processed features data set). Sorting these plate groups alphabetically, we noticed there were larger trends present across plate groups. The groups obtained by splitting the plate group names by hyphens (e.g. C-2113-01-D39-002 or H-BIOA-005-3) were highly correlated to plate similarity (see attached figure). However, we weren’t sure what the meaning was behind these labels.

We were hoping that you or someone on your team might be able to help us with the following questions:

  • What is the meaning between C and H in the first portion of the label? The C plates seem to group strongly together.
  • What is the meaning behind the remaining portions of these plate groups? Are they freezer locations? Were they run during the same day?
  • Why do some plate groups contain more than 4 plates (e.g. see plates ~130-175 in figure, column 5)?

barcode_platemap_cdrp.txt

@shntnu
Copy link
Contributor Author

shntnu commented Dec 11, 2019

@shntnu
Copy link
Contributor Author

shntnu commented Dec 13, 2019

Here's a summary of the prefixes of all the platemaps

# https://raw.githubusercontent.com/broadinstitute/2015_Bray_GigaScience/master/barcode_platemap_25412.csv
plateid_platemap <- 
  read_csv("barcode_platemap_25412.csv") %>%
  rename(platemap = Plate_Map_Name,
         plateid = Assay_Plate_Barcode)

plateid_platemap %>% mutate(platemap_prefix = str_extract(platemap, "(^[A-Z]-[A-Z0-9]+)")) %>% count(platemap_prefix, name = "num_platemaps", sort = TRUE) %>% knitr::kable()
platemap_prefix num_platemaps
C-2113 130
H-BIOA 55
H-CBLE 39
H-CBLG 39
H-CBLC 27
H-CBLH 24
H-CBLB 20
H-CBLD 19
H-CBLF 12
H-CBLA 10
H-CBLJ 9
H-CBLO 9
H-CBLN 8
H-CBLP 8
H-CBLK 4
  • H-BIOA is a designation for a designation for Broad's bioactives library.
  • H-CBLx are DOS compounds (Diversity Oriented Synthesis, Stuart Schreiber lab)
  • C-2113 plates were created for this project, and most likely come from the MLSMR library described below.

Here's a summary of the categories from the data resource paper

10,080 compounds came from the Molecular Libraries Small Molecule Repository (MLSMR), 2260 were drugs, natural products, and small-molecule probes that are part of the Broad Institute known bioactive compound collection, 269 were confirmed screening hits from the Molecular Libraries Program (MLP), and 18,051 were novel compounds derived from diversity-oriented synthesis.

Also available is the file cdrp_runs.txt contains the run id of each assay plate and the date on which it was run

The data is in this format:

run_id platemap barcode date_started
2113-01-W01-01-12 C-2113-01-D39-013 AU00024329 2011-05-24
2113-01-W01-01-12 C-2113-01-D39-013 AU00024330 2011-05-24
2113-01-W01-01-12 C-2113-01-D39-013 AU00024331 2011-05-24
2113-01-W01-01-12 C-2113-01-D39-013 AU00024332 2011-05-24
2113-01-W01-01-12 C-2113-01-D39-012 AU00024333 2011-05-24
2113-01-W01-01-12 C-2113-01-D39-012 AU00024334 2011-05-24

This info was obtained today from Broad's CBIP database via this link (project = 2113 CDRP Cell painting & GE-HTS)

@nishanthmerwin
Copy link

Thank you @shntnu for looking into this, and further, for finding the date information. It seems like beyond the batch ID alone, the date is highly correlated with the average similarity between controls across plates. For clarity, here’s a brief summary of steps to generate the figure posted below:

  1. Download pre-processed morphological features summarized per well from here: ftp://parrot.genomics.cn/gigadb/pub/10.5524/100001_101000/100351/profiles.tar.gz
  2. Subset only the controls and calculate the centroid point across all features per plate.
  3. Sort according to batch ID (Metadata_Plate_Map_Name)
  4. Compute the manhattan pairwise similarity between all plates
  5. Plot in heatmap and alongside outer bars that alternate colors every time there is a change according to covariate of interest.

heatmap_controls_only

@shntnu
Copy link
Contributor Author

shntnu commented Dec 19, 2019

@nishanthmerwin Thank you for posting your results on this!

One thing worth testing is how severe this effect is in the treatment signatures. Systematic effects of this sort (plate-to-plate variation or well position effects) affect DMSO signatures much more than treatment because the former is typically weaker.

Here’s what I would test if I suspected that date was driving the effect:
Compute similarities between wells of 

  1. different compounds and 
  2. on different plate maps and 
  3. of the same category eg DOS compounds 

And then test whether the difference in dates between pairs of wells correlates with/predicts the similarity between the pairs. 

(There are other ways to do this e.g. using random/fixed effects models)

@bledford87
Copy link
Contributor

@shntnu @MarziehHaghighi Are the data complete for this dataset? If so, I'll just need access to the files and for someone to fill out the info on this form.

@shntnu
Copy link
Contributor Author

shntnu commented Jun 7, 2021

@MarziehHaghighi – please go ahead and sort this out over the next couple of weeks because you reference this dataset in the NeurIPS paper

@MarziehHaghighi
Copy link

@MarziehHaghighi – please go ahead and sort this out over the next couple of weeks because you reference this dataset in the NeurIPS paper

@shntnu Could you please clarify what should I sort out here?

@shntnu
Copy link
Contributor Author

shntnu commented Jun 7, 2021

Ah, whatever @bledford87 said

#61 (comment)

@bledford87
Copy link
Contributor

@MarziehHaghighi I saw you submitted the form for this, so now I just need a zip file with the images and ground truth.

@shntnu
Copy link
Contributor Author

shntnu commented Jun 16, 2021 via email

@bledford87
Copy link
Contributor

Oh interesting, okay thanks! I don't think we've done a BBBC entry like this in the past, linking to the images in another spot, but I'll sort it out next week.

@shntnu
Copy link
Contributor Author

shntnu commented Jun 16, 2021 via email

@shntnu
Copy link
Contributor Author

shntnu commented Feb 2, 2022

We might reprocess this dataset sometime this year, to make the profiles compatible with the feature set being used in https://jump-cellpainting.broadinstitute.org/

If/when we do, we will use this issue to document our progress

cc @jccaicedo @bethac07

Internal ref: https://broadinstitute.slack.com/archives/C01AF25CQLT/p1643838553629459?thread_ts=1634213872.002500&cid=C01AF25CQLT

@AnneCarpenter
Copy link

Some notes from the slack msg:
Notes from reprocessing the data back in 2017 https://broadinstitute.atlassian.net/wiki/spaces/IP/pages/114638720/2017-04-19+CDP2+data+show+decent+quality+both+for+bioactive+and+DOS+compounds
The images now live in s3://cellpainting-gallery/cpg0012-wawer-bioactivecompoundprofiling

@shntnu
Copy link
Contributor Author

shntnu commented Jul 13, 2022

@AnneCarpenter Per our new protocol, we will make notes here broadinstitute/cellpainting-gallery#13 (comment)

@shntnu
Copy link
Contributor Author

shntnu commented Jul 13, 2022

In case some is wondering: won't it be confusing that this dataset has an entry in both, BBBC and Cell Painting Gallery?

Yes! :)

But thankfully our Airtable
https://airtable.com/appctUGldmRNkVS19/tblXX3mTxhCR9Bxbq/viwNJfGOOJot7Wr3x/recIRNKWanTVXppE1?blocks=hide (private) will link up everything

(and Erin/Beth will decide how to make that info public)

Going forward, new profiling datasets will exist only in cellpainting-gallery. See #52 (comment)

@shntnu shntnu changed the title Create a new dataset for CDRP dataset (as BBBC047) Add CDRP dataset as BBBC047 Jul 16, 2022
@bethac07
Copy link
Contributor

I think this is done, yes?

@shntnu
Copy link
Contributor Author

shntnu commented Aug 17, 2022

I think this is done, yes?
Yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants