Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: error handling in case of udsink failures #2372

Merged
merged 2 commits into from
Jan 29, 2025

Conversation

adarsh0728
Copy link
Contributor

@yhl25 and I were debugging a scenario where udsink panics. We observed that numa container wasn't restarting as it should. Fixed the error handling in this case.

Before PR:

2025-01-28T05:21:06.974698Z ERROR numaflow_core::source: Nak received for offset offset=String(StringOffset { offset: b"MTczODA0MTY2Njk0NzEwODU2NC0x", partition_idx: 0 })
2025-01-28T05:21:06.974694Z ERROR numaflow_core::source: Nak received for offset offset=String(StringOffset { offset: b"MTczODA0MTY2Njk0NzEwODU2NC0w", partition_idx: 0 })
2025-01-28T05:21:06.974628Z ERROR numaflow_core::sink: Error writing to sink e=Grpc("status: Unknown, message: \"panic inside sink handler: sink fn panicked on multiple of 3\", details: [], metadata: MetadataMap { headers: {} }")

With this PR:

2025-01-28T06:50:48.818759Z  INFO numaflow_core::metrics: Stopped the Lag-Reader Expose and Builder tasks
2025-01-28T06:50:48.818764Z ERROR numaflow_core: Application error running monovertex: Grpc("status: Unknown, message: \"panic inside sink handler: sink fn panicked on multiple of 3\", details: [], metadata: MetadataMap { headers: {} }")
2025-01-28T06:50:48.818768Z  INFO numaflow_core: Gracefully Exiting...

How to generate?

apiVersion: numaflow.numaproj.io/v1alpha1  
  
kind: MonoVertex  
  
metadata:  
  name: simple-mono-vertex  
spec:  
  source:  
    generator:  
      rpu: 50  
      duration: 1s  
  sink:  
    udsink:  
      container:  
        image: quay.io/adarsh0728/test/sink-log:stable

Copy link

codecov bot commented Jan 28, 2025

Codecov Report

Attention: Patch coverage is 75.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 67.96%. Comparing base (36f8357) to head (0c98071).

Files with missing lines Patch % Lines
rust/numaflow-core/src/source.rs 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2372      +/-   ##
==========================================
- Coverage   68.14%   67.96%   -0.19%     
==========================================
  Files         352      352              
  Lines       46500    46500              
==========================================
- Hits        31688    31604      -84     
- Misses      13732    13815      +83     
- Partials     1080     1081       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@adarsh0728 adarsh0728 requested a review from yhl25 January 28, 2025 07:23
@adarsh0728 adarsh0728 marked this pull request as ready for review January 28, 2025 08:07
@@ -306,6 +306,7 @@ impl SinkWriter {
// Discard the message from the tracker
this.tracker_handle.discard(offset).await?;
}
return Err(e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

were we sending the wrong metrics during error because of this logical bug?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the case when udsink container goes down because of a panic. Numa was not getting restarted immediately instead it was getting restarted after the multiple health check failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encountered during testing of changes I'm making in this PR

@vigith vigith changed the title chore: improve error handling in case of udsink failures fix: improve error handling in case of udsink failures Jan 29, 2025
@yhl25 yhl25 changed the title fix: improve error handling in case of udsink failures fix: error handling in case of udsink failures Jan 29, 2025
Signed-off-by: Vigith Maurice <[email protected]>
@vigith vigith enabled auto-merge (squash) January 29, 2025 01:34
@vigith vigith merged commit 5d60431 into numaproj:main Jan 29, 2025
24 checks passed
@adarsh0728 adarsh0728 deleted the err-udsink branch January 29, 2025 02:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants