Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending MongoDB auto-reconnect to all inserts (SCP-2629, SCP-5904) #379

Merged
merged 5 commits into from
Jan 27, 2025

Conversation

bistline
Copy link
Contributor

BACKGROUND & CHANGES

This update extends the current MongoDB "graceful auto-reconnect" functionality to all insert_many calls, not just those specific to gene expression data. This is in response to some instances of Connection reset by peer when ingesting other file types, such as metadata or subsampling of very large files. Additionally, now all database connections from ingest use the MongoConnection module, standardizing how scp-ingest-pipeline connects to database instances (as outlined in SCP-2629).

MANUAL TESTING

This is (sadly) nearly impossible to test, as it requires a file that is not an expression matrix that will reliably trigger the Connection reset by peer error. Testing with the fragment files from SCP2805 will take multiple days to encounter the error (original job failed after ~5 days of runtime), if it happens at all.

There is an automated test test_insert_reconnect in test_ingest.py that mirrors the reconnection test in the GeneExpression module. This test validates that when either the AutoReconnect or BulkWriteError errors occur, the reconnection code fires correctly. You can run this locally by initializing your ingest environment and then running this command from the tests directory:

pytest test_ingest.py -k 'test_insert_reconnect' --disable-pytest-warnings

You should see output similar to this:

=========================================================== test session starts ============================================================
platform darwin -- Python 3.10.14, pytest-7.2.1, pluggy-1.0.0
rootdir: /Users/bistline/Documents/Python/scp-ingest-pipeline
plugins: xdist-1.29.0, dash-2.17.1, forked-1.6.0, cov-2.8.1
collected 26 items / 25 deselected / 1 selected                                                                                            

test_ingest.py .                                                                                                                     [100%]

============================================= 1 passed, 25 deselected, 125 warnings in 21.24s ==============================================

Another manual test would be to use the ingest image gcr.io/broad-singlecellportal-staging/scp-ingest-jb-mongo-retry-all:0f9452d in your local SCP instance and validate that normal file uploads can write to the database as usual. Since all DB connections are now standardized and use the same reconnect code, we can at least prove normal operation is uninterrupted.

@bistline bistline requested review from eweitz and jlchang January 23, 2025 17:32
Copy link

codecov bot commented Jan 23, 2025

Codecov Report

Attention: Patch coverage is 72.72727% with 3 lines in your changes missing coverage. Please review.

Project coverage is 75.95%. Comparing base (410bc3e) to head (df50374).

Files with missing lines Patch % Lines
ingest/ingest_pipeline.py 50.00% 2 Missing ⚠️
ingest/mongo_connection.py 85.71% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@               Coverage Diff               @@
##           development     #379      +/-   ##
===============================================
+ Coverage        75.79%   75.95%   +0.15%     
===============================================
  Files               30       30              
  Lines             4470     4470              
===============================================
+ Hits              3388     3395       +7     
+ Misses            1082     1075       -7     
Files with missing lines Coverage Δ
ingest/mongo_connection.py 85.71% <85.71%> (+0.86%) ⬆️
ingest/ingest_pipeline.py 58.16% <50.00%> (+1.30%) ⬆️

Copy link
Member

@eweitz eweitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good! Nice consolidation to the more robust approach.

Comment on lines +70 to +76
# detect which argument is the list of inserted documents
# when called from IngestPipeline, first arg is ingest instance, and 3rd is the list
# when called from GeneExpression (static implementation), first arg is list
if args[0].__class__.__name__ == 'IngestPipeline':
docs_idx = 2
else:
docs_idx = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever, nice solution.

ingest/mongo_connection.py Outdated Show resolved Hide resolved
Co-authored-by: Eric Weitz <[email protected]>
Copy link
Collaborator

@jlchang jlchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified ingest of a known good AnnData file behaves as expected (as does automated test).
Keeping my fingers crossed that this will do the trick for long subsampling jobs!

@bistline bistline changed the title Extending MongoDB auto-reconnect to all inserts (SCP-2629, SCP-5908) Extending MongoDB auto-reconnect to all inserts (SCP-2629, SCP-5904) Jan 27, 2025
@bistline bistline merged commit 655d7d1 into development Jan 27, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants