Skip to content

Conversation

andsild
Copy link

@andsild andsild commented Jun 4, 2025

Hi. Let me know if you want me to try and break down the PR into smaller PRs.

Otherwise, this PR is a first step to work with AML in DSA.
Changes are 1:

  1. Adds an option for cutoff predictions to be made, so that you only do predictions on n unlabeled features per round/epoch. All labeled samples will always be included. This is to speed up training in cases where you have many slides with many annotations (in our AML case, several millions). The benefit is speed, the downside may be that feature files no longer contain all the data if a user wants to download them.
  2. Define a background superpixel in cases for sparse pixelmaps (in my case, white blood cells where we are not interested in the space between them). The background superpixel will always be ignored. This is mostly for the UI (histomics-label). I tried a cleaner approach where we did not use dummy/mock values for this, but it requires changes to several parts of the code, also in histomics-label (to index at 1) large_image (for drawing superpixels where the indexing starts at 1) and histomicstk(to generate superpixels with starting value 1). I wanted to start simple by allowing an empty, unlabeled background pixel of size 1, which makes it more or less invisible in the UI.
  3. Sort all labeled superpixels last in the "confidence" order from predictions to prevent them from re-emerging first to the user in the filmstrip during active learning.
  4. Added tests to make sure my code (hopefully) doesn't break anything. The tests can also be used to benchmark the code.
  5. Using CUDA is now an optional parameter. I've mostly done this to make tests run on the CPU only. But users can now toggle GPU/CPU on and off in slicer.
  6. Miscellaneous, spelling fix/print statements

I have tested in DSA with my own AML slides and also using the default superpixels from the UI. The tests should also help to establish a baseline of what works and doesn't (test_full_predict starts from scratch with a slide of MNIST numbers and gets an accuracy around 80%`).

For the UI, see also DigitalSlideArchive/histomics-label#207

cooperlab and others added 23 commits June 4, 2025 11:43
false used to be the default, it now has to be set manually since we
have custom code in the model checkpoint
This is to print out the filename if anything fails for further
inspection
This is to speed up AL loop

Not a perfect solution, the UI will now recommend that users predict
"default" for a lot of the labels. But it is a first step to make sure
we can handle large slide with millions of annotations
When testing the code, I prefer keeping all the models on CPU.

Before this commit, the models would automatically go to the GPU if
available.

After this commit, users can pass a parameter to keep models on CPU
This also makes it easier to run tests in CPU-only environments
exists, skip it.

This is to allow sparse pixelmaps which has empty space between bounding
boxes/superpixels
@andsild andsild marked this pull request as draft June 4, 2025 17:24
<longflag>usecuda</longflag>
<description>Whether or not to use GPU/cuda (true) or cpu (false).</description>
<label>Use CUDA</label>
<default>false</default>
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be true by default

<longflag>cutoff</longflag>
<label>Number of annotations per slide</label>
<default>500</default>
<description>Number of unannotated superpixels to use per slide for features, training and predictions</description>
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

training only uses labeled samples

@manthey
Copy link
Contributor

manthey commented Jun 5, 2025

@andsild I find that having multiple PRs accelerates review -- we can get simple things merged in and then the complex things look simpler. I often will do a git diff master > some_diff_file.txt, then edit the diff file to have the isolated change and apply it to a new branch based on master to tease apart the PRs (but if you use an IDE it might do all that more conveniently).

@andsild
Copy link
Author

andsild commented Jun 5, 2025

@andsild I find that having multiple PRs accelerates review -- we can get simple things merged in and then the complex things look simpler. I often will do a git diff master > some_diff_file.txt, then edit the diff file to have the isolated change and apply it to a new branch based on master to tease apart the PRs (but if you use an IDE it might do all that more conveniently).

I get that 😄 good practice to do anyway

I've opened up

which is just this PR divided into multiple others

@andsild andsild closed this Jun 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants