Too many join keys #177

andrewdchen · 2025-01-30T18:12:23Z

Thanks for actively maintaining this @stemangiola!

I received the following message from running sccomp dataset on the HBCA data (link) with the following specification:

sccomp code

sccomp_result = 
  navin_data_nona |>
  sccomp_estimate( 
    formula_composition = ~ BPA_score + self_reported_ethnicity + age2 + tissue_location + (1|source), 
    .sample =  donor_id, 
    .cell_group = author_cell_type, 
    bimodal_mean_variability_association = TRUE,
    cores = 16
  ) |> 
  sccomp_remove_outliers(cores = 1) |> # Optional
  sccomp_test()

BPA_score is a continuous score of geneset activity calculated form AUCell. The other covariates are from the study itself. NAs were removed before feeding into sccomp.

Error Message

sccomp says: From version 1.7.12 the logit fold change threshold for significance has be changed from 0.2 to 0.1.
This message is displayed once per session.
sccomp says: count column is an integer. The sum-constrained beta binomial model will be used
sccomp says: estimation
Error in `mutate()`:
i In argument: `design = map2(...)`.
Caused by error in `map2()`:
i In index: 1.
Caused by error in `left_join()`:
! This join would result in more rows than dplyr can handle.
i 4695098959 rows would be returned. 2147483647 rows is the maximum number
  allowed.
i Double check your join keys. This error commonly occurs due to a missing join
  key, or an improperly specified join condition.
Backtrace:
     x
  1. +-sccomp::sccomp_test(...)
  2. +-sccomp::sccomp_remove_outliers(...)
  3. +-sccomp::sccomp_estimate(...)
  4. +-sccomp:::sccomp_estimate.Seurat(...)
  5. | \-.data[[]] %>% ...
  6. +-sccomp::sccomp_estimate(...)
  7. +-sccomp:::sccomp_estimate.data.frame(...)
  8. | \-sccomp:::sccomp_glm_data_frame_raw(...)
  9. |   \-sccomp:::sccomp_glm_data_frame_counts(...)
 10. |     \-... %>% ...
 11. +-sccomp:::data_spread_to_model_input(...)
 12. | \-sccomp:::get_random_intercept_design2(...)
 13. |   +-dplyr::mutate(...)
 14. |   \-dplyr:::mutate.data.frame(...)
 15. |     \-dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
 16. |       +-base::withCallingHandlers(...)
 17. |       \-dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
 18. |         \-mask$eval_all_mutate(quo)
 19. |           \-dplyr (local) eval()
 20. +-purrr::map2(...)
 21. | \-purrr:::map2_("list", .x, .y, .f, ..., .progress = .progress)
 22. |   +-purrr:::with_indexed_errors(...)
 23. |   | \-base::withCallingHandlers(...)
 24. |   +-purrr:::call_with_cleanup(...)
 25. |   \-sccomp (local) .f(.x[[i]], .y[[i]], ...)
 26. |     +-dplyr::mutate(...)
 27. |     +-dplyr::mutate(...)
 28. |     +-dplyr::mutate(...)
 29. |     +-dplyr::left_join(...)
 30. |     \-dplyr:::left_join.data.frame(...)
 31. |       \-dplyr:::join_mutate(...)
 32. |         \-dplyr:::join_rows(...)
 33. |           \-dplyr:::dplyr_locate_matches(...)
 34. |             +-base::withCallingHandlers(...)
 35. |             \-vctrs::vec_locate_matches(...)
 36. +-vctrs:::stop_matches_overflow(size = 4695098959, call = `<env>`)
 37. | \-vctrs:::stop_matches(...)
 38. |   \-vctrs:::stop_vctrs(...)
 39. |     \-rlang::abort(message, class = c(class, "vctrs_error"), ..., call = call)
 40. |       \-rlang:::signal_abort(cnd, .file)
 41. |         \-base::signalCondition(cnd)
 42. \-dplyr (local) `<fn>`(`<vctrs___>`)
 43.   \-dplyr:::rethrow_error_join_matches_overflow(cnd, error_call)
 44.     \-dplyr:::stop_join(...)
 45.       \-dplyr:::stop_dplyr(...)
 46.         \-rlang::abort(...)
Warning message:
In sccomp_glm_data_frame_counts(mutate(left_join(.data %>% class_list_to_counts(!!.sample,  :
  sccomp says: the input data frame does not have the same number of `author_cell_type`, for all `donor_id`. We have made it so, adding 0s for the missing sample/feature pairs.
Execution halted

Perhaps there are too many covariates for the size of the data (119 samples, 658825 cells)? I'm trying with a subset now but would like to model jointly across all cell types if possible.

Thanks!

The text was updated successfully, but these errors were encountered:

stemangiola · 2025-01-30T22:07:19Z

The size is not the problem, we modelled 4k samples and 10 million cells.

there must be something strange with the dataset factors to sample. Do you have 119 unique donor_id? please send cell metadata with anonymised if needed

andrewdchen · 2025-01-31T14:31:56Z

navin_metadata.csv.zip

Had to filter out some columns because of size, but here's the metadata.

stemangiola · 2025-02-01T00:12:49Z

Thanks,

please test with your own metadata first. I get The columns BPA_score are not present in your tibble

> library(sccomp)
> read_csv("~/Downloads/navin_metadata.csv") |>
+     sccomp_estimate( 
+         formula_composition = ~ BPA_score + self_reported_ethnicity + age2 + tissue_location + (1|source), 
+         .sample =  donor_id, 
+         .cell_group = author_cell_type, 
+         bimodal_mean_variability_association = TRUE,
+         cores = 16
+     )
New names:                                                                                                                                                                   
• `` -> `...1`
• `...1` -> `...2`
Rows: 714331 Columns: 10
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (9): ...2, sample_id, donor_id, author_cell_type, BPA_bin, source, self_reported_ethnicity, tissue_location, age2
dbl (1): ...1

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Error in check_columns_exist(.data, c(quo_name(.sample), quo_name(.cell_group),  : 
  The columns BPA_score are not present in your tibble

Please paste here the full code from metadata + error from your metadata directly.

andrewdchen · 2025-02-02T14:05:40Z

navin_metadata .csv.zip

Sorry forgot to check! Here's the metadata with the right columns.

stemangiola · 2025-02-03T05:47:31Z

Thanks,

can you please attach the code you use to execute from

read_csv("navin_metadata.csv") |> ...

and the error message?

andrewdchen · 2025-02-03T14:56:53Z

The code that came before involved filtering the metadata to find the samples that had complete data for all the covariates, and then dropping the unused levels in the factors in the actual seurat object.

Let me know if you need anything else! Thanks again.

# Loading Data
navin_data <- readRDS(file.path(HEALTHY_PATH, "navin_2023/local.rds"))
navin_bpa_auc_scores <- readRDS(file.path(PATH, "results/navin_bpa_auc_scores.rds"))
navin_data$BPA_score <- navin_bpa_auc_scores["BPA_up",] - navin_bpa_auc_scores["BPA_dn",]

# Metadata changes
navin_metadata <- [email protected]
navin_metadata$sample_id <- rownames(navin_metadata)

## Binarizing factors
navin_metadata$BPA_bin <- with(navin_metadata, dplyr::if_else(BPA_score > median(navin_metadata$BPA_score), "High", "Low"))

all.equal(colnames(navin_data), navin_metadata$sample_id)
rownames(navin_metadata) <- navin_metadata$sample_id
[email protected] <- navin_metadata

## sccomp needs complete cases
df_1 <- navin_metadata %>% 
  dplyr::select(sample_id, donor_id, author_cell_type, BPA_score, BPA_bin, source, self_reported_ethnicity, tissue_location, age2) %>%
  dplyr::filter(!if_any(everything(), ~ . == "unknown"))

df_1 <- df_1[complete.cases(df_1),]

## Finding the samples that have no NAs across all the covariates of interest
bpa_samples <- df_1$sample_id

# Running sccomp
navin_data_nona <- navin_data[, bpa_samples]
navin_data_nona$self_reported_ethnicity <- droplevels(navin_data_nona$self_reported_ethnicity)
navin_data_nona$tissue_location <- droplevels(navin_data_nona$tissue_location)
navin_data_nona$donor_id <- droplevels(navin_data_nona$donor_id)
navin_data_nona$author_cell_type <- droplevels(navin_data_nona$author_cell_type)

sccomp_result = 
  navin_data_nona |>
  sccomp_estimate( 
    formula_composition = ~ BPA_score + self_reported_ethnicity + age2 + tissue_location + (1|source), 
    .sample =  donor_id, 
    .cell_group = author_cell_type, 
    bimodal_mean_variability_association = TRUE,
    cores = 16
  ) |> 
  sccomp_remove_outliers(cores = 1) |> # Optional
  sccomp_test()

stemangiola · 2025-02-03T22:33:20Z

Thanks,

is navin_data_nona a tibble?
what is the error message of

sccomp_result = 
  navin_data_nona |>
  sccomp_estimate( 
    formula_composition = ~ BPA_score + self_reported_ethnicity + age2 + tissue_location + (1|source), 
    .sample =  donor_id, 
    .cell_group = author_cell_type, 
    bimodal_mean_variability_association = TRUE,
    cores = 16
  )

if (1) is a data frame, and you have error for (2) I need navin_data_nona.csv attached

andrewdchen · 2025-02-04T14:35:25Z

It's a seurat object. Unfortunately I cannot attach it via github comments because it's too large
The error message above is the result of this code.
Though I can't attach it, the dataset is available to download here (link), and with some modifications of the paths above the code should run.

Hope this helps!

stemangiola · 2025-02-05T04:21:18Z

Please do the following to facilitate the reproducible example

# Get cell metadata
seurat_object[[]] |> saveRDS("seurat_object_metadata.rds")


# Execute sccomp
readRDS("seurat_object_metadata.rds") |>
sccomp_estimate( 
    formula_composition = ~ BPA_score + self_reported_ethnicity + age2 + tissue_location + (1|source), 
    .sample =  donor_id, 
    .cell_group = author_cell_type, 
    bimodal_mean_variability_association = TRUE,
    cores = 16
  )

Then send me

seurat_object_metadata.rds
the error message with the exact code that triggers it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too many join keys #177

Too many join keys #177

andrewdchen commented Jan 30, 2025

stemangiola commented Jan 30, 2025

andrewdchen commented Jan 31, 2025

stemangiola commented Feb 1, 2025 •

edited

Loading

andrewdchen commented Feb 2, 2025

stemangiola commented Feb 3, 2025

andrewdchen commented Feb 3, 2025

stemangiola commented Feb 3, 2025 •

edited

Loading

andrewdchen commented Feb 4, 2025

stemangiola commented Feb 5, 2025

Too many join keys #177

Too many join keys #177

Comments

andrewdchen commented Jan 30, 2025

sccomp code

Error Message

stemangiola commented Jan 30, 2025

andrewdchen commented Jan 31, 2025

stemangiola commented Feb 1, 2025 • edited Loading

andrewdchen commented Feb 2, 2025

stemangiola commented Feb 3, 2025

andrewdchen commented Feb 3, 2025

stemangiola commented Feb 3, 2025 • edited Loading

andrewdchen commented Feb 4, 2025

stemangiola commented Feb 5, 2025

stemangiola commented Feb 1, 2025 •

edited

Loading

stemangiola commented Feb 3, 2025 •

edited

Loading