Skip to content

add niche detection#644

Open
Intron7 wants to merge 6 commits intomainfrom
add-niche-detection
Open

add niche detection#644
Intron7 wants to merge 6 commits intomainfrom
add-niche-detection

Conversation

@Intron7
Copy link
Copy Markdown
Member

@Intron7 Intron7 commented Apr 29, 2026

This add squidpy like niche dectection. I also features a minimal GMM to do the cellcharter workflow

@Intron7 Intron7 marked this pull request as ready for review April 29, 2026 14:28
@Intron7
Copy link
Copy Markdown
Member Author

Intron7 commented Apr 29, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

1 similar comment
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

@Intron7: Sure! I'll review the changes in this PR.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Introduced spatial niche discovery capabilities with three analysis flavors: neighborhood profiling, utag-based feature propagation, and cellcharter-based clustering for GPU-accelerated spatial analysis.
  • Documentation

    • Updated API documentation to reflect new spatial niche functionality.
    • Added release notes for version 0.15.1.
  • Tests

    • Added comprehensive test coverage for spatial niche discovery and GPU-based clustering workflows.

Walkthrough

Introduces GPU-backed spatial niche discovery with calculate_niche function supporting three flavors (neighborhood, utag, cellcharter). Adds a CuPy-based full-covariance GMM implementation for the cellcharter workflow. Extends the squidpy_gpu public API and updates documentation and release notes with comprehensive test coverage.

Changes

Cohort / File(s) Summary
Documentation
docs/api/squidpy_gpu.md, docs/release-notes/0.15.1.md, docs/release-notes/index.md
Added GPU API autosummary entry for calculate_niche, created release notes for version 0.15.1 documenting new niche and GMM functionality, and updated release notes index to include 0.15.1 entry.
GPU Implementation Core
src/rapids_singlecell/squidpy_gpu/_niche.py, src/rapids_singlecell/squidpy_gpu/_gmm.py
Implemented calculate_niche with three spatial niche discovery flavors (neighborhood, utag, cellcharter) supporting various aggregation methods and parameters. Added gmm_fit_predict for full-covariance Gaussian mixture modeling with CuPy, supporting kmeans and random initialization strategies.
Module Exports
src/rapids_singlecell/squidpy_gpu/__init__.py
Extended public API by importing and re-exporting calculate_niche from ._niche module.
Test Coverage
tests/test_gmm.py, tests/test_niche.py
Added comprehensive GPU-focused tests for gmm_fit_predict covering initialization strategies, determinism, and label constraints. Added extensive test suite for calculate_niche covering all three flavors, edge cases, parameter validation, and correctness across dense/sparse inputs.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 34.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'add niche detection' directly relates to the main changeset, which implements GPU-backed niche detection functionality with multiple flavors (neighborhood, utag, cellcharter).
Description check ✅ Passed The description is related to the changeset, mentioning 'squidpy like niche detection' and a 'minimal GMM' for the 'cellcharter workflow', both of which are present in the changes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch add-niche-detection

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
src/rapids_singlecell/squidpy_gpu/_gmm.py (1)

59-59: Unused variable n_samples.

The static analysis correctly identifies that n_samples is unpacked but never used. Prefix with underscore to indicate intentional discard.

Proposed fix
-    n_samples, _ = X.shape
+    _n_samples, _ = X.shape

Or simply:

-    n_samples, _ = X.shape
+    _, _ = X.shape
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/rapids_singlecell/squidpy_gpu/_gmm.py` at line 59, The variable n_samples
from the line "n_samples, _ = X.shape" is assigned but never used; change the
left-hand name to a deliberately discarded name (e.g., prefix with an
underscore) so static analysis stops flagging it—update the assignment in
_gmm.py (the line that unpacks X.shape) to use _n_samples or simply "_" for the
first value.
tests/test_gmm.py (2)

22-28: Consider adding sklearn reference comparison test.

While the ARI-based validation is good, the coding guidelines recommend comparing against reference implementations. Consider adding a test that compares log-likelihood or cluster assignments against sklearn.mixture.GaussianMixture with covariance_type="full" on the same synthetic data to validate numerical correctness.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_gmm.py` around lines 22 - 28, Add a sklearn reference comparison
inside test_kmeans_init_recovers_well_separated_clusters: after generating X_np
and y and calling gmm_fit_predict, fit
sklearn.mixture.GaussianMixture(n_components=5, covariance_type="full",
random_state=0) on X_np and compare either the per-sample log-likelihoods (via
GaussianMixture.score_samples or overall score) or the predicted labels (via
GaussianMixture.predict) against your implementation; assert they are close
within a small tolerance (e.g., log-likelihood difference threshold or ARI
between sklearn labels and labels from gmm_fit_predict), and ensure you use the
same random_state and data from _well_separated to make the comparison
deterministic.

63-66: Add validation for n_samples >= n_components or test the constraint boundary.

The function does not validate that the number of samples exceeds the number of components. When n_samples < n_components:

  • init="random_from_data" fails with an unclear error at line 94: rng.choice(n, size=K, replace=False) raises ValueError: Cannot take a larger sample than population without replacement.
  • init="kmeans" (default) fails silently in cuML KMeans, which expects n_clusters <= n_samples.

The test suite covers only cases where n_samples >> n_components (lowest is 100 samples with K=3). Consider either adding input validation with a clear error message (e.g., if n < K: raise ValueError(...)) or adding test coverage for the boundary case n_samples == n_components and verifying the behavior is documented.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_gmm.py` around lines 63 - 66, Add explicit input validation in the
GMM entrypoint (e.g., gmm_fit_predict or the function that accepts X and
n_components) to check that the number of samples n (X.shape[0]) is >=
n_components (K) and raise a ValueError with a clear message like "n_samples (n)
must be >= n_components (K)" when this is not true; reference the init modes
("random_from_data" and "kmeans") in the message or docs so users understand why
sampling/kmeans would fail. Ensure the check runs before any code paths that
call rng.choice(...) or invoke cuML KMeans so the error is deterministic and
informative.
src/rapids_singlecell/squidpy_gpu/_niche.py (1)

320-320: Docstring contains ambiguous Unicode character.

The minus sign on line 320 () is a Unicode MINUS SIGN rather than ASCII HYPHEN-MINUS (-). While visually similar, this can cause issues with some tools.

Proposed fix
-    - ``"variance"``: ``Âₖ @ (X·X) − (Âₖ @ X)²``  (matches squidpy's path; densifies X)
+    - ``"variance"``: ``Âₖ @ (X·X) - (Âₖ @ X)²``  (matches squidpy's path; densifies X)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/rapids_singlecell/squidpy_gpu/_niche.py` at line 320, The docstring entry
for "variance" uses a Unicode minus sign (−) which can break tooling; update the
string in the docstring (the line containing ``"variance"``: ``Âₖ @ (X·X) − (Âₖ
@ X)²``) to replace the Unicode MINUS SIGN with the ASCII HYPHEN-MINUS character
so it reads ``... (X·X) - (Âₖ @ X)²`` and save the file with UTF-8 encoding to
ensure no other non-ASCII punctuation remains.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/release-notes/0.15.1.md`:
- Around line 5-6: Update the PR reference and fix the truncated sentence:
replace the inconsistent `{smaller}\`644\`` on the rsc.gr.calculate_niche line
with `{pr}\`644\`` to match the second line, and complete the phrase so it reads
something like "Mirrors `squidpy.gr.calculate_niche` but runs on GPU" (or
similarly clear wording) referencing the function name `rsc.gr.calculate_niche`;
also ensure the second line still uses `{pr}\`644\`` and mentions
`squidpy_gpu._gmm.gmm_fit_predict` unchanged.

In `@src/rapids_singlecell/squidpy_gpu/_niche.py`:
- Around line 186-233: The return type annotation of _neighborhood_profile is
incorrect: it declares -> np.ndarray but the function builds and returns a CuPy
array (profile) using cp.zeros and other cp operations; change the signature to
return cp.ndarray (or Union[cp.ndarray, np.ndarray] if intent is to support CPU
arrays) and update any imports/type comments accordingly so type checkers
reflect the actual return type from _neighborhood_profile.
- Around line 209-231: The code can divide by zero when sum(weights) == 0 in the
final normalization of profile; update the logic around the weights handling
(the variable weights used in the loop and final division) to validate/adjust
total = sum(weights) (after expanding/normalizing weights to length distance)
and if total == 0 either raise a clear ValueError mentioning n_hop_weights or
fallback to a safe non-zero value (e.g., 1.0) before doing profile /=
cp.float32(total); ensure this validation occurs after the existing weights
expansion (the block that sets weights when None or pads it) and before the
final if not abs_nhood normalization so adj_bin, adj_k, one_hot and abs_nhood
code can assume a non-zero divisor.

---

Nitpick comments:
In `@src/rapids_singlecell/squidpy_gpu/_gmm.py`:
- Line 59: The variable n_samples from the line "n_samples, _ = X.shape" is
assigned but never used; change the left-hand name to a deliberately discarded
name (e.g., prefix with an underscore) so static analysis stops flagging
it—update the assignment in _gmm.py (the line that unpacks X.shape) to use
_n_samples or simply "_" for the first value.

In `@src/rapids_singlecell/squidpy_gpu/_niche.py`:
- Line 320: The docstring entry for "variance" uses a Unicode minus sign (−)
which can break tooling; update the string in the docstring (the line containing
``"variance"``: ``Âₖ @ (X·X) − (Âₖ @ X)²``) to replace the Unicode MINUS SIGN
with the ASCII HYPHEN-MINUS character so it reads ``... (X·X) - (Âₖ @ X)²`` and
save the file with UTF-8 encoding to ensure no other non-ASCII punctuation
remains.

In `@tests/test_gmm.py`:
- Around line 22-28: Add a sklearn reference comparison inside
test_kmeans_init_recovers_well_separated_clusters: after generating X_np and y
and calling gmm_fit_predict, fit sklearn.mixture.GaussianMixture(n_components=5,
covariance_type="full", random_state=0) on X_np and compare either the
per-sample log-likelihoods (via GaussianMixture.score_samples or overall score)
or the predicted labels (via GaussianMixture.predict) against your
implementation; assert they are close within a small tolerance (e.g.,
log-likelihood difference threshold or ARI between sklearn labels and labels
from gmm_fit_predict), and ensure you use the same random_state and data from
_well_separated to make the comparison deterministic.
- Around line 63-66: Add explicit input validation in the GMM entrypoint (e.g.,
gmm_fit_predict or the function that accepts X and n_components) to check that
the number of samples n (X.shape[0]) is >= n_components (K) and raise a
ValueError with a clear message like "n_samples (n) must be >= n_components (K)"
when this is not true; reference the init modes ("random_from_data" and
"kmeans") in the message or docs so users understand why sampling/kmeans would
fail. Ensure the check runs before any code paths that call rng.choice(...) or
invoke cuML KMeans so the error is deterministic and informative.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b1caef88-4e7a-4fe6-8340-081fff28e2d4

📥 Commits

Reviewing files that changed from the base of the PR and between 11bcd1a and 6d30c10.

📒 Files selected for processing (8)
  • docs/api/squidpy_gpu.md
  • docs/release-notes/0.15.1.md
  • docs/release-notes/index.md
  • src/rapids_singlecell/squidpy_gpu/__init__.py
  • src/rapids_singlecell/squidpy_gpu/_gmm.py
  • src/rapids_singlecell/squidpy_gpu/_niche.py
  • tests/test_gmm.py
  • tests/test_niche.py

Comment thread docs/release-notes/0.15.1.md Outdated
Comment on lines +5 to +6
* Add `rsc.gr.calculate_niche` with flavors `neighborhood` , `utag` , and `cellcharter`. Mirrors `squidpy.gr.calculate_niche` but runs {smaller}`644` {smaller}`S Dicks`
* Add a minimal full-covariance GMM (`squidpy_gpu._gmm.gmm_fit_predict`) used by the `cellcharter` {pr}`644` {smaller}`S Dicks`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix PR reference format inconsistency.

Line 5 uses {smaller}\644`while line 6 uses{pr}`644`. The PR reference should use the {pr}` directive consistently. Also, line 5 appears to be truncated ("but runs" seems incomplete).

Proposed fix
-* Add `rsc.gr.calculate_niche` with flavors `neighborhood` , `utag` , and `cellcharter`. Mirrors `squidpy.gr.calculate_niche` but runs {smaller}`644` {smaller}`S Dicks`
+* Add `rsc.gr.calculate_niche` with flavors `neighborhood`, `utag`, and `cellcharter`. Mirrors `squidpy.gr.calculate_niche` but runs on GPU. {pr}`644` {smaller}`S Dicks`
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
* Add `rsc.gr.calculate_niche` with flavors `neighborhood` , `utag` , and `cellcharter`. Mirrors `squidpy.gr.calculate_niche` but runs {smaller}`644` {smaller}`S Dicks`
* Add a minimal full-covariance GMM (`squidpy_gpu._gmm.gmm_fit_predict`) used by the `cellcharter` {pr}`644` {smaller}`S Dicks`
* Add `rsc.gr.calculate_niche` with flavors `neighborhood`, `utag`, and `cellcharter`. Mirrors `squidpy.gr.calculate_niche` but runs on GPU. {pr}`644` {smaller}`S Dicks`
* Add a minimal full-covariance GMM (`squidpy_gpu._gmm.gmm_fit_predict`) used by the `cellcharter` {pr}`644` {smaller}`S Dicks`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/release-notes/0.15.1.md` around lines 5 - 6, Update the PR reference and
fix the truncated sentence: replace the inconsistent `{smaller}\`644\`` on the
rsc.gr.calculate_niche line with `{pr}\`644\`` to match the second line, and
complete the phrase so it reads something like "Mirrors
`squidpy.gr.calculate_niche` but runs on GPU" (or similarly clear wording)
referencing the function name `rsc.gr.calculate_niche`; also ensure the second
line still uses `{pr}\`644\`` and mentions `squidpy_gpu._gmm.gmm_fit_predict`
unchanged.

Comment on lines +186 to +233
def _neighborhood_profile(
adata: AnnData,
*,
groups: str,
distance: int,
weights: Sequence[float] | None,
abs_nhood: bool,
key: str,
) -> np.ndarray:
"""Cells x categories matrix of cell-type counts (or relative frequencies) over n-hop neighbors."""
cats = pd.Categorical(adata.obs[groups])
n_cats = len(cats.categories)
n_obs = adata.n_obs

one_hot = cp.zeros((n_obs, n_cats), dtype=cp.float32)
one_hot[cp.arange(n_obs), cp.asarray(cats.codes, dtype=cp.int64)] = 1.0

adj = rsc.get.X_to_GPU(adata.obsp[key]).astype(cp.float32)
adj.eliminate_zeros()
# Binarize so adj.data == 1: each existing edge contributes one neighbor count.
adj_bin = adj.copy()
adj_bin.data[:] = 1.0

if weights is None:
weights = [1.0] * distance
elif len(weights) < distance:
weights = list(weights) + [weights[-1]] * (distance - len(weights))

profile = cp.zeros((n_obs, n_cats), dtype=cp.float32)
adj_k = adj_bin
for hop in range(distance):
if hop == 0:
adj_hop = adj_bin
else:
adj_k = adj_k @ adj_bin
adj_hop = adj_k.copy()
adj_hop.data[:] = 1.0
counts = adj_hop @ one_hot # (n_obs, n_cats) dense
if not abs_nhood:
row_sum = adj_hop.sum(axis=1).reshape(-1, 1)
row_sum = cp.where(row_sum == 0, cp.float32(1.0), row_sum)
counts = counts / row_sum
profile += cp.float32(weights[hop]) * counts

if not abs_nhood:
profile /= cp.float32(sum(weights))

return profile
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Return type hint mismatch: returns cp.ndarray, not np.ndarray.

The function signature indicates -> np.ndarray but line 233 returns profile which is a CuPy array (cp.zeros on line 214). This could cause issues for type checkers and documentation.

Proposed fix
 def _neighborhood_profile(
     adata: AnnData,
     *,
     groups: str,
     distance: int,
     weights: Sequence[float] | None,
     abs_nhood: bool,
     key: str,
-) -> np.ndarray:
+) -> cp.ndarray:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/rapids_singlecell/squidpy_gpu/_niche.py` around lines 186 - 233, The
return type annotation of _neighborhood_profile is incorrect: it declares ->
np.ndarray but the function builds and returns a CuPy array (profile) using
cp.zeros and other cp operations; change the signature to return cp.ndarray (or
Union[cp.ndarray, np.ndarray] if intent is to support CPU arrays) and update any
imports/type comments accordingly so type checkers reflect the actual return
type from _neighborhood_profile.

Comment on lines +209 to +231
if weights is None:
weights = [1.0] * distance
elif len(weights) < distance:
weights = list(weights) + [weights[-1]] * (distance - len(weights))

profile = cp.zeros((n_obs, n_cats), dtype=cp.float32)
adj_k = adj_bin
for hop in range(distance):
if hop == 0:
adj_hop = adj_bin
else:
adj_k = adj_k @ adj_bin
adj_hop = adj_k.copy()
adj_hop.data[:] = 1.0
counts = adj_hop @ one_hot # (n_obs, n_cats) dense
if not abs_nhood:
row_sum = adj_hop.sum(axis=1).reshape(-1, 1)
row_sum = cp.where(row_sum == 0, cp.float32(1.0), row_sum)
counts = counts / row_sum
profile += cp.float32(weights[hop]) * counts

if not abs_nhood:
profile /= cp.float32(sum(weights))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential division by zero if n_hop_weights sums to zero.

If a user passes n_hop_weights=[0.0, 0.0, ...], line 231 would divide by zero. Consider adding validation or documentation.

Proposed fix
     if not abs_nhood:
+        weight_sum = sum(weights[:distance])
+        if weight_sum == 0:
+            raise ValueError("`n_hop_weights` must sum to a positive value.")
-        profile /= cp.float32(sum(weights))
+        profile /= cp.float32(weight_sum)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/rapids_singlecell/squidpy_gpu/_niche.py` around lines 209 - 231, The code
can divide by zero when sum(weights) == 0 in the final normalization of profile;
update the logic around the weights handling (the variable weights used in the
loop and final division) to validate/adjust total = sum(weights) (after
expanding/normalizing weights to length distance) and if total == 0 either raise
a clear ValueError mentioning n_hop_weights or fallback to a safe non-zero value
(e.g., 1.0) before doing profile /= cp.float32(total); ensure this validation
occurs after the existing weights expansion (the block that sets weights when
None or pads it) and before the final if not abs_nhood normalization so adj_bin,
adj_k, one_hot and abs_nhood code can assume a non-zero divisor.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 29, 2026

Codecov Report

❌ Patch coverage is 97.50000% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.43%. Comparing base (11bcd1a) to head (6d30c10).

Files with missing lines Patch % Lines
src/rapids_singlecell/squidpy_gpu/_gmm.py 96.62% 3 Missing ⚠️
src/rapids_singlecell/squidpy_gpu/_niche.py 98.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #644      +/-   ##
==========================================
+ Coverage   88.04%   88.43%   +0.39%     
==========================================
  Files          96       98       +2     
  Lines        7032     7272     +240     
==========================================
+ Hits         6191     6431     +240     
  Misses        841      841              
Files with missing lines Coverage Δ
src/rapids_singlecell/squidpy_gpu/__init__.py 100.00% <100.00%> (ø)
src/rapids_singlecell/squidpy_gpu/_gmm.py 96.62% <96.62%> (ø)
src/rapids_singlecell/squidpy_gpu/_niche.py 98.00% <98.00%> (ø)

... and 2 files with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants