Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many join keys #177

Open
andrewdchen opened this issue Jan 30, 2025 · 9 comments
Open

Too many join keys #177

andrewdchen opened this issue Jan 30, 2025 · 9 comments

Comments

@andrewdchen
Copy link

Thanks for actively maintaining this @stemangiola!

I received the following message from running sccomp dataset on the HBCA data (link) with the following specification:

sccomp code

sccomp_result = 
  navin_data_nona |>
  sccomp_estimate( 
    formula_composition = ~ BPA_score + self_reported_ethnicity + age2 + tissue_location + (1|source), 
    .sample =  donor_id, 
    .cell_group = author_cell_type, 
    bimodal_mean_variability_association = TRUE,
    cores = 16
  ) |> 
  sccomp_remove_outliers(cores = 1) |> # Optional
  sccomp_test()

BPA_score is a continuous score of geneset activity calculated form AUCell. The other covariates are from the study itself. NAs were removed before feeding into sccomp.

Error Message

sccomp says: From version 1.7.12 the logit fold change threshold for significance has be changed from 0.2 to 0.1.
This message is displayed once per session.
sccomp says: count column is an integer. The sum-constrained beta binomial model will be used
sccomp says: estimation
Error in `mutate()`:
i In argument: `design = map2(...)`.
Caused by error in `map2()`:
i In index: 1.
Caused by error in `left_join()`:
! This join would result in more rows than dplyr can handle.
i 4695098959 rows would be returned. 2147483647 rows is the maximum number
  allowed.
i Double check your join keys. This error commonly occurs due to a missing join
  key, or an improperly specified join condition.
Backtrace:
     x
  1. +-sccomp::sccomp_test(...)
  2. +-sccomp::sccomp_remove_outliers(...)
  3. +-sccomp::sccomp_estimate(...)
  4. +-sccomp:::sccomp_estimate.Seurat(...)
  5. | \-.data[[]] %>% ...
  6. +-sccomp::sccomp_estimate(...)
  7. +-sccomp:::sccomp_estimate.data.frame(...)
  8. | \-sccomp:::sccomp_glm_data_frame_raw(...)
  9. |   \-sccomp:::sccomp_glm_data_frame_counts(...)
 10. |     \-... %>% ...
 11. +-sccomp:::data_spread_to_model_input(...)
 12. | \-sccomp:::get_random_intercept_design2(...)
 13. |   +-dplyr::mutate(...)
 14. |   \-dplyr:::mutate.data.frame(...)
 15. |     \-dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
 16. |       +-base::withCallingHandlers(...)
 17. |       \-dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
 18. |         \-mask$eval_all_mutate(quo)
 19. |           \-dplyr (local) eval()
 20. +-purrr::map2(...)
 21. | \-purrr:::map2_("list", .x, .y, .f, ..., .progress = .progress)
 22. |   +-purrr:::with_indexed_errors(...)
 23. |   | \-base::withCallingHandlers(...)
 24. |   +-purrr:::call_with_cleanup(...)
 25. |   \-sccomp (local) .f(.x[[i]], .y[[i]], ...)
 26. |     +-dplyr::mutate(...)
 27. |     +-dplyr::mutate(...)
 28. |     +-dplyr::mutate(...)
 29. |     +-dplyr::left_join(...)
 30. |     \-dplyr:::left_join.data.frame(...)
 31. |       \-dplyr:::join_mutate(...)
 32. |         \-dplyr:::join_rows(...)
 33. |           \-dplyr:::dplyr_locate_matches(...)
 34. |             +-base::withCallingHandlers(...)
 35. |             \-vctrs::vec_locate_matches(...)
 36. +-vctrs:::stop_matches_overflow(size = 4695098959, call = `<env>`)
 37. | \-vctrs:::stop_matches(...)
 38. |   \-vctrs:::stop_vctrs(...)
 39. |     \-rlang::abort(message, class = c(class, "vctrs_error"), ..., call = call)
 40. |       \-rlang:::signal_abort(cnd, .file)
 41. |         \-base::signalCondition(cnd)
 42. \-dplyr (local) `<fn>`(`<vctrs___>`)
 43.   \-dplyr:::rethrow_error_join_matches_overflow(cnd, error_call)
 44.     \-dplyr:::stop_join(...)
 45.       \-dplyr:::stop_dplyr(...)
 46.         \-rlang::abort(...)
Warning message:
In sccomp_glm_data_frame_counts(mutate(left_join(.data %>% class_list_to_counts(!!.sample,  :
  sccomp says: the input data frame does not have the same number of `author_cell_type`, for all `donor_id`. We have made it so, adding 0s for the missing sample/feature pairs.
Execution halted

Perhaps there are too many covariates for the size of the data (119 samples, 658825 cells)? I'm trying with a subset now but would like to model jointly across all cell types if possible.

Thanks!

@stemangiola
Copy link
Collaborator

The size is not the problem, we modelled 4k samples and 10 million cells.

there must be something strange with the dataset factors to sample. Do you have 119 unique donor_id? please send cell metadata with anonymised if needed

@andrewdchen
Copy link
Author

navin_metadata.csv.zip

Had to filter out some columns because of size, but here's the metadata.

@stemangiola
Copy link
Collaborator

stemangiola commented Feb 1, 2025

Thanks,

please test with your own metadata first. I get The columns BPA_score are not present in your tibble

> library(sccomp)
> read_csv("~/Downloads/navin_metadata.csv") |>
+     sccomp_estimate( 
+         formula_composition = ~ BPA_score + self_reported_ethnicity + age2 + tissue_location + (1|source), 
+         .sample =  donor_id, 
+         .cell_group = author_cell_type, 
+         bimodal_mean_variability_association = TRUE,
+         cores = 16
+     )
New names:`` -> `...1``...1` -> `...2`
Rows: 714331 Columns: 10
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (9): ...2, sample_id, donor_id, author_cell_type, BPA_bin, source, self_reported_ethnicity, tissue_location, age2
dbl (1): ...1Use `spec()` to retrieve the full column specification for this data.Specify the column types or set `show_col_types = FALSE` to quiet this message.
Error in check_columns_exist(.data, c(quo_name(.sample), quo_name(.cell_group),  : 
  The columns BPA_score are not present in your tibble

Please paste here the full code from metadata + error from your metadata directly.

@andrewdchen
Copy link
Author

navin_metadata .csv.zip

Sorry forgot to check! Here's the metadata with the right columns.

@stemangiola
Copy link
Collaborator

Thanks,

can you please attach the code you use to execute from

read_csv("navin_metadata.csv") |> ...

and the error message?

@andrewdchen
Copy link
Author

The code that came before involved filtering the metadata to find the samples that had complete data for all the covariates, and then dropping the unused levels in the factors in the actual seurat object.

Let me know if you need anything else! Thanks again.

# Loading Data
navin_data <- readRDS(file.path(HEALTHY_PATH, "navin_2023/local.rds"))
navin_bpa_auc_scores <- readRDS(file.path(PATH, "results/navin_bpa_auc_scores.rds"))
navin_data$BPA_score <- navin_bpa_auc_scores["BPA_up",] - navin_bpa_auc_scores["BPA_dn",]

# Metadata changes
navin_metadata <- [email protected]
navin_metadata$sample_id <- rownames(navin_metadata)

## Binarizing factors
navin_metadata$BPA_bin <- with(navin_metadata, dplyr::if_else(BPA_score > median(navin_metadata$BPA_score), "High", "Low"))

all.equal(colnames(navin_data), navin_metadata$sample_id)
rownames(navin_metadata) <- navin_metadata$sample_id
[email protected] <- navin_metadata

## sccomp needs complete cases
df_1 <- navin_metadata %>% 
  dplyr::select(sample_id, donor_id, author_cell_type, BPA_score, BPA_bin, source, self_reported_ethnicity, tissue_location, age2) %>%
  dplyr::filter(!if_any(everything(), ~ . == "unknown"))

df_1 <- df_1[complete.cases(df_1),]

## Finding the samples that have no NAs across all the covariates of interest
bpa_samples <- df_1$sample_id

# Running sccomp
navin_data_nona <- navin_data[, bpa_samples]
navin_data_nona$self_reported_ethnicity <- droplevels(navin_data_nona$self_reported_ethnicity)
navin_data_nona$tissue_location <- droplevels(navin_data_nona$tissue_location)
navin_data_nona$donor_id <- droplevels(navin_data_nona$donor_id)
navin_data_nona$author_cell_type <- droplevels(navin_data_nona$author_cell_type)

sccomp_result = 
  navin_data_nona |>
  sccomp_estimate( 
    formula_composition = ~ BPA_score + self_reported_ethnicity + age2 + tissue_location + (1|source), 
    .sample =  donor_id, 
    .cell_group = author_cell_type, 
    bimodal_mean_variability_association = TRUE,
    cores = 16
  ) |> 
  sccomp_remove_outliers(cores = 1) |> # Optional
  sccomp_test()

@stemangiola
Copy link
Collaborator

stemangiola commented Feb 3, 2025

Thanks,

  1. is navin_data_nona a tibble?
  2. what is the error message of
sccomp_result = 
  navin_data_nona |>
  sccomp_estimate( 
    formula_composition = ~ BPA_score + self_reported_ethnicity + age2 + tissue_location + (1|source), 
    .sample =  donor_id, 
    .cell_group = author_cell_type, 
    bimodal_mean_variability_association = TRUE,
    cores = 16
  )
  1. if (1) is a data frame, and you have error for (2) I need navin_data_nona.csv attached

@andrewdchen
Copy link
Author

  1. It's a seurat object. Unfortunately I cannot attach it via github comments because it's too large
  2. The error message above is the result of this code.
  3. Though I can't attach it, the dataset is available to download here (link), and with some modifications of the paths above the code should run.

Hope this helps!

@stemangiola
Copy link
Collaborator

Please do the following to facilitate the reproducible example

# Get cell metadata
seurat_object[[]] |> saveRDS("seurat_object_metadata.rds")


# Execute sccomp
readRDS("seurat_object_metadata.rds") |>
sccomp_estimate( 
    formula_composition = ~ BPA_score + self_reported_ethnicity + age2 + tissue_location + (1|source), 
    .sample =  donor_id, 
    .cell_group = author_cell_type, 
    bimodal_mean_variability_association = TRUE,
    cores = 16
  )

Then send me

  1. seurat_object_metadata.rds
  2. the error message with the exact code that triggers it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants