Conversation
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
1 similar comment
✅ Actions performedReview triggered.
|
|
✅ Actions performedReview triggered.
|
📝 WalkthroughSummary by CodeRabbit
WalkthroughIntroduces GPU-backed spatial niche discovery with Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 60 minutes.Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (4)
src/rapids_singlecell/squidpy_gpu/_gmm.py (1)
59-59: Unused variablen_samples.The static analysis correctly identifies that
n_samplesis unpacked but never used. Prefix with underscore to indicate intentional discard.Proposed fix
- n_samples, _ = X.shape + _n_samples, _ = X.shapeOr simply:
- n_samples, _ = X.shape + _, _ = X.shape🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/rapids_singlecell/squidpy_gpu/_gmm.py` at line 59, The variable n_samples from the line "n_samples, _ = X.shape" is assigned but never used; change the left-hand name to a deliberately discarded name (e.g., prefix with an underscore) so static analysis stops flagging it—update the assignment in _gmm.py (the line that unpacks X.shape) to use _n_samples or simply "_" for the first value.tests/test_gmm.py (2)
22-28: Consider adding sklearn reference comparison test.While the ARI-based validation is good, the coding guidelines recommend comparing against reference implementations. Consider adding a test that compares log-likelihood or cluster assignments against
sklearn.mixture.GaussianMixturewithcovariance_type="full"on the same synthetic data to validate numerical correctness.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_gmm.py` around lines 22 - 28, Add a sklearn reference comparison inside test_kmeans_init_recovers_well_separated_clusters: after generating X_np and y and calling gmm_fit_predict, fit sklearn.mixture.GaussianMixture(n_components=5, covariance_type="full", random_state=0) on X_np and compare either the per-sample log-likelihoods (via GaussianMixture.score_samples or overall score) or the predicted labels (via GaussianMixture.predict) against your implementation; assert they are close within a small tolerance (e.g., log-likelihood difference threshold or ARI between sklearn labels and labels from gmm_fit_predict), and ensure you use the same random_state and data from _well_separated to make the comparison deterministic.
63-66: Add validation forn_samples >= n_componentsor test the constraint boundary.The function does not validate that the number of samples exceeds the number of components. When
n_samples < n_components:
init="random_from_data"fails with an unclear error at line 94:rng.choice(n, size=K, replace=False)raisesValueError: Cannot take a larger sample than population without replacement.init="kmeans"(default) fails silently in cuML KMeans, which expectsn_clusters <= n_samples.The test suite covers only cases where
n_samples >> n_components(lowest is 100 samples with K=3). Consider either adding input validation with a clear error message (e.g.,if n < K: raise ValueError(...)) or adding test coverage for the boundary casen_samples == n_componentsand verifying the behavior is documented.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_gmm.py` around lines 63 - 66, Add explicit input validation in the GMM entrypoint (e.g., gmm_fit_predict or the function that accepts X and n_components) to check that the number of samples n (X.shape[0]) is >= n_components (K) and raise a ValueError with a clear message like "n_samples (n) must be >= n_components (K)" when this is not true; reference the init modes ("random_from_data" and "kmeans") in the message or docs so users understand why sampling/kmeans would fail. Ensure the check runs before any code paths that call rng.choice(...) or invoke cuML KMeans so the error is deterministic and informative.src/rapids_singlecell/squidpy_gpu/_niche.py (1)
320-320: Docstring contains ambiguous Unicode character.The minus sign on line 320 (
−) is a Unicode MINUS SIGN rather than ASCII HYPHEN-MINUS (-). While visually similar, this can cause issues with some tools.Proposed fix
- - ``"variance"``: ``Âₖ @ (X·X) − (Âₖ @ X)²`` (matches squidpy's path; densifies X) + - ``"variance"``: ``Âₖ @ (X·X) - (Âₖ @ X)²`` (matches squidpy's path; densifies X)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/rapids_singlecell/squidpy_gpu/_niche.py` at line 320, The docstring entry for "variance" uses a Unicode minus sign (−) which can break tooling; update the string in the docstring (the line containing ``"variance"``: ``Âₖ @ (X·X) − (Âₖ @ X)²``) to replace the Unicode MINUS SIGN with the ASCII HYPHEN-MINUS character so it reads ``... (X·X) - (Âₖ @ X)²`` and save the file with UTF-8 encoding to ensure no other non-ASCII punctuation remains.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/release-notes/0.15.1.md`:
- Around line 5-6: Update the PR reference and fix the truncated sentence:
replace the inconsistent `{smaller}\`644\`` on the rsc.gr.calculate_niche line
with `{pr}\`644\`` to match the second line, and complete the phrase so it reads
something like "Mirrors `squidpy.gr.calculate_niche` but runs on GPU" (or
similarly clear wording) referencing the function name `rsc.gr.calculate_niche`;
also ensure the second line still uses `{pr}\`644\`` and mentions
`squidpy_gpu._gmm.gmm_fit_predict` unchanged.
In `@src/rapids_singlecell/squidpy_gpu/_niche.py`:
- Around line 186-233: The return type annotation of _neighborhood_profile is
incorrect: it declares -> np.ndarray but the function builds and returns a CuPy
array (profile) using cp.zeros and other cp operations; change the signature to
return cp.ndarray (or Union[cp.ndarray, np.ndarray] if intent is to support CPU
arrays) and update any imports/type comments accordingly so type checkers
reflect the actual return type from _neighborhood_profile.
- Around line 209-231: The code can divide by zero when sum(weights) == 0 in the
final normalization of profile; update the logic around the weights handling
(the variable weights used in the loop and final division) to validate/adjust
total = sum(weights) (after expanding/normalizing weights to length distance)
and if total == 0 either raise a clear ValueError mentioning n_hop_weights or
fallback to a safe non-zero value (e.g., 1.0) before doing profile /=
cp.float32(total); ensure this validation occurs after the existing weights
expansion (the block that sets weights when None or pads it) and before the
final if not abs_nhood normalization so adj_bin, adj_k, one_hot and abs_nhood
code can assume a non-zero divisor.
---
Nitpick comments:
In `@src/rapids_singlecell/squidpy_gpu/_gmm.py`:
- Line 59: The variable n_samples from the line "n_samples, _ = X.shape" is
assigned but never used; change the left-hand name to a deliberately discarded
name (e.g., prefix with an underscore) so static analysis stops flagging
it—update the assignment in _gmm.py (the line that unpacks X.shape) to use
_n_samples or simply "_" for the first value.
In `@src/rapids_singlecell/squidpy_gpu/_niche.py`:
- Line 320: The docstring entry for "variance" uses a Unicode minus sign (−)
which can break tooling; update the string in the docstring (the line containing
``"variance"``: ``Âₖ @ (X·X) − (Âₖ @ X)²``) to replace the Unicode MINUS SIGN
with the ASCII HYPHEN-MINUS character so it reads ``... (X·X) - (Âₖ @ X)²`` and
save the file with UTF-8 encoding to ensure no other non-ASCII punctuation
remains.
In `@tests/test_gmm.py`:
- Around line 22-28: Add a sklearn reference comparison inside
test_kmeans_init_recovers_well_separated_clusters: after generating X_np and y
and calling gmm_fit_predict, fit sklearn.mixture.GaussianMixture(n_components=5,
covariance_type="full", random_state=0) on X_np and compare either the
per-sample log-likelihoods (via GaussianMixture.score_samples or overall score)
or the predicted labels (via GaussianMixture.predict) against your
implementation; assert they are close within a small tolerance (e.g.,
log-likelihood difference threshold or ARI between sklearn labels and labels
from gmm_fit_predict), and ensure you use the same random_state and data from
_well_separated to make the comparison deterministic.
- Around line 63-66: Add explicit input validation in the GMM entrypoint (e.g.,
gmm_fit_predict or the function that accepts X and n_components) to check that
the number of samples n (X.shape[0]) is >= n_components (K) and raise a
ValueError with a clear message like "n_samples (n) must be >= n_components (K)"
when this is not true; reference the init modes ("random_from_data" and
"kmeans") in the message or docs so users understand why sampling/kmeans would
fail. Ensure the check runs before any code paths that call rng.choice(...) or
invoke cuML KMeans so the error is deterministic and informative.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b1caef88-4e7a-4fe6-8340-081fff28e2d4
📒 Files selected for processing (8)
docs/api/squidpy_gpu.mddocs/release-notes/0.15.1.mddocs/release-notes/index.mdsrc/rapids_singlecell/squidpy_gpu/__init__.pysrc/rapids_singlecell/squidpy_gpu/_gmm.pysrc/rapids_singlecell/squidpy_gpu/_niche.pytests/test_gmm.pytests/test_niche.py
| * Add `rsc.gr.calculate_niche` with flavors `neighborhood` , `utag` , and `cellcharter`. Mirrors `squidpy.gr.calculate_niche` but runs {smaller}`644` {smaller}`S Dicks` | ||
| * Add a minimal full-covariance GMM (`squidpy_gpu._gmm.gmm_fit_predict`) used by the `cellcharter` {pr}`644` {smaller}`S Dicks` |
There was a problem hiding this comment.
Fix PR reference format inconsistency.
Line 5 uses {smaller}\644`while line 6 uses{pr}`644`. The PR reference should use the {pr}` directive consistently. Also, line 5 appears to be truncated ("but runs" seems incomplete).
Proposed fix
-* Add `rsc.gr.calculate_niche` with flavors `neighborhood` , `utag` , and `cellcharter`. Mirrors `squidpy.gr.calculate_niche` but runs {smaller}`644` {smaller}`S Dicks`
+* Add `rsc.gr.calculate_niche` with flavors `neighborhood`, `utag`, and `cellcharter`. Mirrors `squidpy.gr.calculate_niche` but runs on GPU. {pr}`644` {smaller}`S Dicks`📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| * Add `rsc.gr.calculate_niche` with flavors `neighborhood` , `utag` , and `cellcharter`. Mirrors `squidpy.gr.calculate_niche` but runs {smaller}`644` {smaller}`S Dicks` | |
| * Add a minimal full-covariance GMM (`squidpy_gpu._gmm.gmm_fit_predict`) used by the `cellcharter` {pr}`644` {smaller}`S Dicks` | |
| * Add `rsc.gr.calculate_niche` with flavors `neighborhood`, `utag`, and `cellcharter`. Mirrors `squidpy.gr.calculate_niche` but runs on GPU. {pr}`644` {smaller}`S Dicks` | |
| * Add a minimal full-covariance GMM (`squidpy_gpu._gmm.gmm_fit_predict`) used by the `cellcharter` {pr}`644` {smaller}`S Dicks` |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/release-notes/0.15.1.md` around lines 5 - 6, Update the PR reference and
fix the truncated sentence: replace the inconsistent `{smaller}\`644\`` on the
rsc.gr.calculate_niche line with `{pr}\`644\`` to match the second line, and
complete the phrase so it reads something like "Mirrors
`squidpy.gr.calculate_niche` but runs on GPU" (or similarly clear wording)
referencing the function name `rsc.gr.calculate_niche`; also ensure the second
line still uses `{pr}\`644\`` and mentions `squidpy_gpu._gmm.gmm_fit_predict`
unchanged.
| def _neighborhood_profile( | ||
| adata: AnnData, | ||
| *, | ||
| groups: str, | ||
| distance: int, | ||
| weights: Sequence[float] | None, | ||
| abs_nhood: bool, | ||
| key: str, | ||
| ) -> np.ndarray: | ||
| """Cells x categories matrix of cell-type counts (or relative frequencies) over n-hop neighbors.""" | ||
| cats = pd.Categorical(adata.obs[groups]) | ||
| n_cats = len(cats.categories) | ||
| n_obs = adata.n_obs | ||
|
|
||
| one_hot = cp.zeros((n_obs, n_cats), dtype=cp.float32) | ||
| one_hot[cp.arange(n_obs), cp.asarray(cats.codes, dtype=cp.int64)] = 1.0 | ||
|
|
||
| adj = rsc.get.X_to_GPU(adata.obsp[key]).astype(cp.float32) | ||
| adj.eliminate_zeros() | ||
| # Binarize so adj.data == 1: each existing edge contributes one neighbor count. | ||
| adj_bin = adj.copy() | ||
| adj_bin.data[:] = 1.0 | ||
|
|
||
| if weights is None: | ||
| weights = [1.0] * distance | ||
| elif len(weights) < distance: | ||
| weights = list(weights) + [weights[-1]] * (distance - len(weights)) | ||
|
|
||
| profile = cp.zeros((n_obs, n_cats), dtype=cp.float32) | ||
| adj_k = adj_bin | ||
| for hop in range(distance): | ||
| if hop == 0: | ||
| adj_hop = adj_bin | ||
| else: | ||
| adj_k = adj_k @ adj_bin | ||
| adj_hop = adj_k.copy() | ||
| adj_hop.data[:] = 1.0 | ||
| counts = adj_hop @ one_hot # (n_obs, n_cats) dense | ||
| if not abs_nhood: | ||
| row_sum = adj_hop.sum(axis=1).reshape(-1, 1) | ||
| row_sum = cp.where(row_sum == 0, cp.float32(1.0), row_sum) | ||
| counts = counts / row_sum | ||
| profile += cp.float32(weights[hop]) * counts | ||
|
|
||
| if not abs_nhood: | ||
| profile /= cp.float32(sum(weights)) | ||
|
|
||
| return profile |
There was a problem hiding this comment.
Return type hint mismatch: returns cp.ndarray, not np.ndarray.
The function signature indicates -> np.ndarray but line 233 returns profile which is a CuPy array (cp.zeros on line 214). This could cause issues for type checkers and documentation.
Proposed fix
def _neighborhood_profile(
adata: AnnData,
*,
groups: str,
distance: int,
weights: Sequence[float] | None,
abs_nhood: bool,
key: str,
-) -> np.ndarray:
+) -> cp.ndarray:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/rapids_singlecell/squidpy_gpu/_niche.py` around lines 186 - 233, The
return type annotation of _neighborhood_profile is incorrect: it declares ->
np.ndarray but the function builds and returns a CuPy array (profile) using
cp.zeros and other cp operations; change the signature to return cp.ndarray (or
Union[cp.ndarray, np.ndarray] if intent is to support CPU arrays) and update any
imports/type comments accordingly so type checkers reflect the actual return
type from _neighborhood_profile.
| if weights is None: | ||
| weights = [1.0] * distance | ||
| elif len(weights) < distance: | ||
| weights = list(weights) + [weights[-1]] * (distance - len(weights)) | ||
|
|
||
| profile = cp.zeros((n_obs, n_cats), dtype=cp.float32) | ||
| adj_k = adj_bin | ||
| for hop in range(distance): | ||
| if hop == 0: | ||
| adj_hop = adj_bin | ||
| else: | ||
| adj_k = adj_k @ adj_bin | ||
| adj_hop = adj_k.copy() | ||
| adj_hop.data[:] = 1.0 | ||
| counts = adj_hop @ one_hot # (n_obs, n_cats) dense | ||
| if not abs_nhood: | ||
| row_sum = adj_hop.sum(axis=1).reshape(-1, 1) | ||
| row_sum = cp.where(row_sum == 0, cp.float32(1.0), row_sum) | ||
| counts = counts / row_sum | ||
| profile += cp.float32(weights[hop]) * counts | ||
|
|
||
| if not abs_nhood: | ||
| profile /= cp.float32(sum(weights)) |
There was a problem hiding this comment.
Potential division by zero if n_hop_weights sums to zero.
If a user passes n_hop_weights=[0.0, 0.0, ...], line 231 would divide by zero. Consider adding validation or documentation.
Proposed fix
if not abs_nhood:
+ weight_sum = sum(weights[:distance])
+ if weight_sum == 0:
+ raise ValueError("`n_hop_weights` must sum to a positive value.")
- profile /= cp.float32(sum(weights))
+ profile /= cp.float32(weight_sum)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/rapids_singlecell/squidpy_gpu/_niche.py` around lines 209 - 231, The code
can divide by zero when sum(weights) == 0 in the final normalization of profile;
update the logic around the weights handling (the variable weights used in the
loop and final division) to validate/adjust total = sum(weights) (after
expanding/normalizing weights to length distance) and if total == 0 either raise
a clear ValueError mentioning n_hop_weights or fallback to a safe non-zero value
(e.g., 1.0) before doing profile /= cp.float32(total); ensure this validation
occurs after the existing weights expansion (the block that sets weights when
None or pads it) and before the final if not abs_nhood normalization so adj_bin,
adj_k, one_hot and abs_nhood code can assume a non-zero divisor.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #644 +/- ##
==========================================
+ Coverage 88.04% 88.43% +0.39%
==========================================
Files 96 98 +2
Lines 7032 7272 +240
==========================================
+ Hits 6191 6431 +240
Misses 841 841
|
This add squidpy like niche dectection. I also features a minimal GMM to do the
cellcharterworkflow