-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIGABRT with ssd tiering on 1.26.3 #4672
Comments
thanks. how do you run it? with replication? |
no replication. this is second revision of running that failed with same error. previous attempt was only with maxmemory.
Any other information to provide? |
how long did process run before it crashed? Trying to see how frequent it is. |
about 2-3 mins from start each time. we have pretty big dataset it happens when we hitting upper memory limit. |
Do you use the |
Also do you use any NX/XX options? |
I am guessing that without |
we're doing |
offloading to SSD was the only reason we switched from redis. |
Is there anything in the server logs? Are you on the cloud? |
If you are on one of the major clouds, we can create a datastore that for you and monitor it, with your traffic crashing it |
no there is nothing else in the server logs |
@mainpart I will build a custom container to debug it further. will you be able to try it out? |
Should help debugging #4672 Signed-off-by: Roman Gershman <[email protected]>
No, it's not necessary. The data we use is a private cloud repository so we can't give it outside. |
Well, we will appreciate if you decide to help us with finding the bug even without disclosing the data. The debug printings will help us to understand the state better during the failure. |
Should help debugging #4672 Signed-off-by: Roman Gershman <[email protected]>
can you build and publish image |
@iamtraining can you please run |
E20250307 11:19:21.232584 13 string_family.cc:878] Inconsistent state in SetCmd::SetExisting key: image.scanner:store:scan-job:b49b5a634f9724afd19c06e2:application/vnd.security.vulnerability.report; version=1.1, it.key:image.scanner:store:scan-job:b49b5a634f9724afd19c06e2:application/vnd.security.vulnerability.report; version=1.1, it->first:image.scanner:store:scan-job:b49b5a634f9724afd19c06e2:application/vnd.security.vulnerability.report; version=1.1 params.prev_val: 0 18 |
Thanks, this helped. I will need another run though. @mainpart can you please try again with |
@romange We are facing similar issue
We are using dragonfly kubernetes operator, here are the values:
Let me know if you need more details. |
@adityaupadhyay-fynd can you please use |
@romange Here are the logs, using
|
@adityaupadhyay-fynd it really really helped me to narrow this down but unfortunately i need another run.
|
@romange The frequency of this error is quite high, sometimes occurring upto 8 times in an hour. We tried hosting dragonfly on VM instead of Kubernetes also, but faced the same issue. We are running this as a replacement for Redis for a self hosted version of a tool called Sentry. Interestingly, we have another identical Sentry setup running Dragonfly, however we are not facing this issue on that setup. So not sure how you would be able to replicate this, but if you need more logs from either setup I will be happy to share them with you. Logs using
Logs from another run
|
@adityaupadhyay-fynd Are these the entire logs? Dragonfly has info logs stored in |
@adityaupadhyay-fynd can you please try with |
@romange Sharing entire logs, running image |
Thanks, this iteration moved me another step closer. It happens almost immediately once the snapshot is loaded and yet I still do not find the root cause. I will publish another version of the image with more logs. So thankful for your help @adityaupadhyay-fynd ! |
@adityaupadhyay-fynd I've released thanks! |
@romange Here are the logs, image used: |
Thanks @adityaupadhyay-fynd . Unfortunately that image did not pinpoint the root cause. Can you please try again with ghcr.io/dragonflydb/dragonfly-dev:ubuntu-dd3e015 ? Same flags? |
@romange here are the logs from |
🎯 |
The bug: during the override of the existing external string, we called `TieredStorage::Delete` to delete the external reference. This function called CompactObj::Reset that cleared all the attributes on the value, including expiry. The fix: preserve the mask but clear the external state from the object. Fixes #4672 Signed-off-by: Roman Gershman <[email protected]>
The bug: during the override of the existing external string, we called `TieredStorage::Delete` to delete the external reference. This function called CompactObj::Reset that cleared all the attributes on the value, including expiry. The fix: preserve the mask but clear the external state from the object. Fixes #4672 Signed-off-by: Roman Gershman <[email protected]>
@adityaupadhyay-fynd please try ghcr.io/dragonflydb/dragonfly-dev:ubuntu-525b102 and let me know if the problem persists. You can run it without the flags I gave you. |
@adityaupadhyay-fynd how is it going? Did you experience any issues? You can also try 1.28 which should have the fix. Please let us know :) |
@romange Have not faced this error since yesterday (Image: |
F20250228 09:25:57.923086 14 db_slice.cc:820] Check failed: db.expire.Insert(main_it->first.AsRef(), ExpirePeriod(delta)).second
*** Check failure stack trace: ***
@ 0x562e118a3923 google::LogMessage::SendToLog()
@ 0x562e1189c0e7 google::LogMessage::Flush()
@ 0x562e1189da6f google::LogMessageFatal::~LogMessageFatal()
@ 0x562e110e5338 dfly::DbSlice::AddExpire()
@ 0x562e10f059c9 dfly::(anonymous namespace)::SetCmd::SetExisting()
@ 0x562e10f05e77 dfly::(anonymous namespace)::SetCmd::Set()
@ 0x562e10f067cf _ZN4absl12lts_2024011619functional_internal12InvokeObjectIZN4dfly12_GLOBAL__N_110SetGenericERKNS4_6SetCmd9SetParamsESt17basic_string_viewIcSt11char_traitsIcEESC_bPNS3_11TransactionEEUlSE_PNS3_11EngineShardEE_NSD_14RunnableResultEJSE_SG_EEET0_NS1_7VoidPtrEDpNS1_8ForwardTIT1_E4typeE
@ 0x562e1111bc19 dfly::Transaction::RunCallback()
@ 0x562e1111c7a9 dfly::Transaction::ScheduleInShard()
@ 0x562e1111e8df dfly::Transaction::ScheduleBatchInShard()
@ 0x562e11689d05 util::fb2::FiberQueue::Run()
@ 0x562e11169b90 _ZN5boost7context6detail11fiber_entryINS1_12fiber_recordINS0_5fiberEN4util3fb219FixedStackAllocatorEZNS6_6detail15WorkerFiberImplIZN4dfly9TaskQueue5StartESt17basic_string_viewIcSt11char_traitsIcEEEUlvE_JEEC4IS7_EESF_RKNS0_12preallocatedEOT_OSG_EUlOS4_E_EEEEvNS1_10transfer_tE
@ 0x562e116a900f make_fcontext
*** SIGABRT received at time=1740734757 on cpu 2 ***
PC: @ 0x7f9721e8a9fc (unknown) pthread_kill
[failure_signal_handler.cc : 345] RAW: Signal 11 raised at PC=0x7f9721e1c898 while already in AbslFailureSignalHandler()
*** SIGSEGV received at time=1740734757 on cpu 2 ***
PC: @ 0x7f9721e1c898 (unknown) abort
--maxmemory=1GB
The text was updated successfully, but these errors were encountered: