-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Residual GPU Memory usage #96
Comments
follow up issues after more experimentation, not sure if related:
|
@r614 a CUDA context is created on the GPU before executing any kernels, which will store some metadata and other things like loaded libraries. The CUDA context is usually initialized when calls are made to the CUDA runtime API (such as launching a kernel, for example) and generally lasts for the lifetime of the process. This very small amount of memory (in the range of 10s to 100s of mb) is expected. IIRC, Scanpy will copy results back to CPU and the GPU memory should eventually cleaned up when the corresponding Python objects are cleaned up. However, it's always possible this might not happen immediately and might require waiting for the garbage collector. Managed memory is a little exception to the above. You can use it to oversubscribe the GPU memory so you don't immediately get out of memory errors, but that does come at the cost of increased thrashing potential as memory is paged into and out of the GPU as needed. Unfortunately, PCA does require computing the eigenpairs on a covariance matrix, which in your case looks like it would require 24929^2 entries- that's ~2.5GB of 32-bit float values. I recall at one point there was an additional limit imposed by the eigensolver itself (from cusolver directly), which wouldn't allow the number of columns^2 to be larger than 2^(32-1). This seems like it might be the case here. Can you print the output of Another benefit to the highly variable gene feature selection we do in our examples is that we avoid these limitations in the PCA altogether. |
thanks for the detailed reply! do you know if there is a workaround for forcing the creation of a new context/garbage collection at the api-level - maybe something akin to will post the conda output once I get my environment up again later today. would love to get a stable managed memory setup working - what memory gpu would you recommend for running computations on this size of a dataset? we ran into this on a 16GB gpu, and ran into OOM issues without unified memory. |
hi! i am trying to use the scanpy rapids functions to run multiple parallel operations on a server.
the problem i am running into is that after running any scanpy function with rapids enabled, there is some residual memory usage after the function call has ended, and I am assuming this is either because of a memory leak, or because the result itself is stored on the gpu.
during
scanpy.tl.neighbors
+scanpy.tl.umap
call:post function run:
we arent running any gpu load besides the umap function, and idle memory usage is ~75Mib.
happy to elaborate more + help find a fix for this. not sure if i am missing something really easy (maybe a
cupy.asnumpy
somewhere?), so any info would be super helpful!The text was updated successfully, but these errors were encountered: