Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGABRT with ssd tiering on 1.26.3 #4672

Closed
mainpart opened this issue Feb 28, 2025 · 36 comments · Fixed by #4785
Closed

SIGABRT with ssd tiering on 1.26.3 #4672

mainpart opened this issue Feb 28, 2025 · 36 comments · Fixed by #4785
Assignees
Labels
bug Something isn't working Next Up task that is ready to be worked on and should be added to working queue
Milestone

Comments

@mainpart
Copy link

F20250228 09:25:57.923086 14 db_slice.cc:820] Check failed: db.expire.Insert(main_it->first.AsRef(), ExpirePeriod(delta)).second
*** Check failure stack trace: ***
@ 0x562e118a3923 google::LogMessage::SendToLog()
@ 0x562e1189c0e7 google::LogMessage::Flush()
@ 0x562e1189da6f google::LogMessageFatal::~LogMessageFatal()
@ 0x562e110e5338 dfly::DbSlice::AddExpire()
@ 0x562e10f059c9 dfly::(anonymous namespace)::SetCmd::SetExisting()
@ 0x562e10f05e77 dfly::(anonymous namespace)::SetCmd::Set()
@ 0x562e10f067cf _ZN4absl12lts_2024011619functional_internal12InvokeObjectIZN4dfly12_GLOBAL__N_110SetGenericERKNS4_6SetCmd9SetParamsESt17basic_string_viewIcSt11char_traitsIcEESC_bPNS3_11TransactionEEUlSE_PNS3_11EngineShardEE_NSD_14RunnableResultEJSE_SG_EEET0_NS1_7VoidPtrEDpNS1_8ForwardTIT1_E4typeE
@ 0x562e1111bc19 dfly::Transaction::RunCallback()
@ 0x562e1111c7a9 dfly::Transaction::ScheduleInShard()
@ 0x562e1111e8df dfly::Transaction::ScheduleBatchInShard()
@ 0x562e11689d05 util::fb2::FiberQueue::Run()
@ 0x562e11169b90 _ZN5boost7context6detail11fiber_entryINS1_12fiber_recordINS0_5fiberEN4util3fb219FixedStackAllocatorEZNS6_6detail15WorkerFiberImplIZN4dfly9TaskQueue5StartESt17basic_string_viewIcSt11char_traitsIcEEEUlvE_JEEC4IS7_EESF_RKNS0_12preallocatedEOT_OSG_EUlOS4_E_EEEEvNS1_10transfer_tE
@ 0x562e116a900f make_fcontext
*** SIGABRT received at time=1740734757 on cpu 2 ***
PC: @ 0x7f9721e8a9fc (unknown) pthread_kill
[failure_signal_handler.cc : 345] RAW: Signal 11 raised at PC=0x7f9721e1c898 while already in AbslFailureSignalHandler()
*** SIGSEGV received at time=1740734757 on cpu 2 ***
PC: @ 0x7f9721e1c898 (unknown) abort

  1. Pod Details:
    • The pod crashed with Exit Code: 139 (SIGSEGV - Segmentation fault)
    • Resource limits: 4Gi memory, 4 CPU
    • Resource requests: 1Gi memory, 200m CPU
    • Dragonfly version: df-v1.26.3
    • Command args include --maxmemory=1GB
@mainpart mainpart added the bug Something isn't working label Feb 28, 2025
@romange
Copy link
Collaborator

romange commented Feb 28, 2025

thanks. how do you run it? with replication?

@mainpart
Copy link
Author

mainpart commented Feb 28, 2025

no replication. this is second revision of running that failed with same error. previous attempt was only with maxmemory.

Args:
  --alsologtostderr
  --bind=::
  --port=6379
  --maxmemory=14G
  --proactor_threads=4
  --request_cache_limit=8000000
  --eviction_memory_budget_threshold=0.5
  --mem_defrag_threshold=0.1
  --tiered_offload_threshold=0.1
  --dir=/data
  --dbfilename=backup
  --tiered_prefix=/data/ssd
  --version_check=false

Any other information to provide?

@romange
Copy link
Collaborator

romange commented Feb 28, 2025

how long did process run before it crashed? Trying to see how frequent it is.

@mainpart
Copy link
Author

mainpart commented Feb 28, 2025

about 2-3 mins from start each time.
it falls on update existing key
SET EX 24h

we have pretty big dataset it happens when we hitting upper memory limit.

@romange
Copy link
Collaborator

romange commented Feb 28, 2025

Do you use the GET option, i.e. SET GET EX ... or it's a pure SET?

@romange
Copy link
Collaborator

romange commented Feb 28, 2025

Also do you use any NX/XX options?

@romange
Copy link
Collaborator

romange commented Feb 28, 2025

I am guessing that without --tiered_prefix it will work, is it possible to verify?

@mainpart
Copy link
Author

we're doing
set key value ex 84600 nx
next we're getting it
get key
if it found we're trying to update
set key value ex 84600 xx
then df crashes
ttl definitely is not expiring

@mainpart
Copy link
Author

mainpart commented Feb 28, 2025

offloading to SSD was the only reason we switched from redis.
we have limits on memory and unpredictable size of data to store.
so we decided to take offloading function of redis but it turns out that it available only on enterprise version.

@romange
Copy link
Collaborator

romange commented Feb 28, 2025

Is there anything in the server logs? Are you on the cloud?

@romange
Copy link
Collaborator

romange commented Feb 28, 2025

If you are on one of the major clouds, we can create a datastore that for you and monitor it, with your traffic crashing it

@romange romange added this to the v1.18 milestone Mar 2, 2025
@iamtraining
Copy link

Is there anything in the server logs? Are you on the cloud?

no there is nothing else in the server logs
the error wont appear if you dont update the ttl but leave it untouched when updating value

@romange romange removed this from the v1.18 milestone Mar 3, 2025
@romange
Copy link
Collaborator

romange commented Mar 4, 2025

@mainpart I will build a custom container to debug it further. will you be able to try it out?

romange added a commit that referenced this issue Mar 4, 2025
Should help debugging #4672

Signed-off-by: Roman Gershman <[email protected]>
@mainpart
Copy link
Author

mainpart commented Mar 4, 2025

No, it's not necessary.
We changed our algorithms so we do not update ttl. That's enough for us.

The data we use is a private cloud repository so we can't give it outside.

@romange
Copy link
Collaborator

romange commented Mar 4, 2025

Well, we will appreciate if you decide to help us with finding the bug even without disclosing the data. The debug printings will help us to understand the state better during the failure.

romange added a commit that referenced this issue Mar 4, 2025
Should help debugging #4672

Signed-off-by: Roman Gershman <[email protected]>
@iamtraining
Copy link

Well, we will appreciate if you decide to help us with finding the bug even without disclosing the data. The debug printings will help us to understand the state better during the failure.

can you build and publish image
we ll send you logs back

@romange romange changed the title SIGABRT on 1.26.3 SIGABRT with ssd tiering on 1.26.3 Mar 7, 2025
@romange
Copy link
Collaborator

romange commented Mar 7, 2025

@iamtraining can you please run ghcr.io/dragonflydb/dragonfly-dev:ubuntu-5aa1990 ?

@iamtraining
Copy link

@iamtraining can you please run ghcr.io/dragonflydb/dragonfly-dev:ubuntu-5aa1990 ?

E20250307 11:19:21.232584 13 string_family.cc:878] Inconsistent state in SetCmd::SetExisting key: image.scanner:store:scan-job:b49b5a634f9724afd19c06e2:application/vnd.security.vulnerability.report; version=1.1, it.key:image.scanner:store:scan-job:b49b5a634f9724afd19c06e2:application/vnd.security.vulnerability.report; version=1.1, it->first:image.scanner:store:scan-job:b49b5a634f9724afd19c06e2:application/vnd.security.vulnerability.report; version=1.1 params.prev_val: 0 18
F20250307 11:19:21.232707 13 db_slice.cc:927] Check failed: db.expire.Insert(main_it->first.AsRef(), ExpirePeriod(delta)).second
*** Check failure stack trace: ***
@ 0x5597352b9ea3 google::LogMessage::SendToLog()
@ 0x5597352b2347 google::LogMessage::Flush()
@ 0x5597352b3ccf google::LogMessageFatal::~LogMessageFatal()
@ 0x559734ae2b08 dfly::DbSlice::AddExpire()
@ 0x5597348ffdbd dfly::(anonymous namespace)::SetCmd::SetExisting()
@ 0x5597349002b7 dfly::(anonymous namespace)::SetCmd::Set()
@ 0x559734900c0f _ZN4absl12lts_2024072219functional_internal12InvokeObjectIZN4dfly12_GLOBAL__N_110SetGenericERKNS4_6SetCmd9SetParamsESt17basic_string_viewIcSt11char_traitsIcEESC_bPNS3_11TransactionEEUlSE_PNS3_11EngineShardEE_NSD_14RunnableResultEJSE_SG_EEET0_NS1_7VoidPtrEDpNS1_8ForwardTIT1_E4typeE
@ 0x559734b2539a dfly::Transaction::RunCallback()
@ 0x559734b25d0a dfly::Transaction::ScheduleInShard()
@ 0x559734b26ed2 dfly::Transaction::ScheduleInternal()
@ 0x559734b27a81 dfly::Transaction::Execute()
@ 0x559734b27b9a dfly::Transaction::ScheduleSingleHop()
@ 0x5597348fb2ee dfly::StringFamily::Set()
@ 0x559734b0611f dfly::CommandId::Invoke()
@ 0x5597347f8373 dfly::Service::InvokeCmd()
@ 0x5597347fefe1 dfly::Service::DispatchCommand()
@ 0x559734c01be9 facade::Connection::DispatchSingle()
@ 0x559734c0213f facade::Connection::ParseRedis()
@ 0x559734c0277c facade::Connection::IoLoop()
@ 0x559734c02e0d facade::Connection::ConnectionFlow()
@ 0x559734c03f3f facade::Connection::HandleRequests()
@ 0x5597350718bd util::ListenerInterface::RunSingleConnection()
@ 0x559735071e25 _ZN5boost7context6detail11fiber_entryINS1_12fiber_recordINS0_5fiberEN4util3fb219FixedStackAllocatorEZNS6_6detail15WorkerFiberImplIZZNS5_17ListenerInterface13RunAcceptLoopEvENKUlvE0_clEvEUlvE_JEEC4IS7_EESt17basic_string_viewIcSt11char_traitsIcEERKNS0_12preallocatedEOT_OSC_EUlOS4_E_EEEEvNS1_10transfer_tE
@ 0x5597350beb5f make_fcontext
*** SIGABRT received at time=1741346361 on cpu 2 ***
PC: @ 0x7fe2e29749fc (unknown) pthread_kill
[failure_signal_handler.cc : 345] RAW: Signal 11 raised at PC=0x7fe2e2906898 while already in AbslFailureSignalHandler()
*** SIGSEGV received at time=1741346361 on cpu 2 ***
PC: @ 0x7fe2e2906898 (unknown) abort

@romange
Copy link
Collaborator

romange commented Mar 9, 2025

Thanks, this helped. I will need another run though. @mainpart can you please try again with ghcr.io/dragonflydb/dragonfly-dev:ubuntu-bbd0a96 ?

@adityaupadhyay-fynd
Copy link

@romange We are facing similar issue

Check failed: db.expire.Insert(main_it->first.AsRef(), ExpirePeriod(delta)).second 
*** Check failure stack trace: ***
    @     0x56632e336343  google::LogMessage::SendToLog()
    @     0x56632e32eb07  google::LogMessage::Flush()
    @     0x56632e33048f  google::LogMessageFatal::~LogMessageFatal()
    @     0x56632dbc7810  dfly::DbSlice::AddExpire()
    @     0x56632da03aa8  dfly::SetCmd::SetExisting()
    @     0x56632da0419d  dfly::SetCmd::Set()
    @     0x56632da044ff  _ZN4absl12lts_2024011619functional_internal12InvokeObjectIZN4dfly10SetGenericEPNS3_17ConnectionContextERKNS3_6SetCmd9SetParamsESt17basic_string_viewIcSt11char_traitsIcEESD_EUlPNS3_11TransactionEPNS3_11EngineShardEE_NSE_14RunnableResultEJSF_SH_EEET0_NS1_7VoidPtrEDpNS1_8ForwardTIT1_E4typeE
    @     0x56632dbfdb19  dfly::Transaction::RunCallback()
    @     0x56632dbfe639  dfly::Transaction::ScheduleInShard()
    @     0x56632dc0074f  dfly::Transaction::ScheduleBatchInShard()
    @     0x56632e120845  util::fb2::FiberQueue::Run()
    @     0x56632dc41620  _ZN5boost7context6detail11fiber_entryINS1_12fiber_recordINS0_5fiberEN4util3fb219FixedStackAllocatorEZNS6_6detail15WorkerFiberImplIZN4dfly9TaskQueue5StartESt17basic_string_viewIcSt11char_traitsIcEEEUlvE_JEEC4IS7_EESF_RKNS0_12preallocatedEOT_OSG_EUlOS4_E_EEEEvNS1_10transfer_tE
    @     0x56632e13b96f  make_fcontext
*** SIGABRT received at time=1741174146 on cpu 0 ***
PC: @     0x7b23d83f49fc  (unknown)  pthread_kill
[failure_signal_handler.cc : 345] RAW: Signal 11 raised at PC=0x7b23d8386898 while already in AbslFailureSignalHandler()
*** SIGSEGV received at time=1741174146 on cpu 0 ***
PC: @     0x7b23d8386898  (unknown)  abort

We are using dragonfly kubernetes operator, here are the values:

apiVersion: dragonflydb.io/v1alpha1
kind: Dragonfly
metadata:
  name: dragonfly-sentry
  namespace: sentry
  labels:
    app.kubernetes.io/name: dragonfly
    app.kubernetes.io/instance: dragonfly-sentry
    app.kubernetes.io/part-of: dragonfly-operator
    app.kubernetes.io/managed-by: helm
    app.kubernetes.io/created-by: dragonfly-operator
spec:
  replicas: 1  # HA causing issues, so keeping it 1
  image: "docker.dragonflydb.io/dragonflydb/dragonfly:v1.27.1"

  resources:
    requests:
      cpu: "2"
      memory: "26Gi"
    limits:
      memory: "26Gi"

  args:
    - "--tiered_prefix=/dragonfly/snapshots/dragonfly"
    - "--maxclients=100000"
    - "--tcp_keepalive=60"

  snapshot:
    cron: ""
    persistentVolumeClaimSpec:
      accessModes:
        - ReadWriteOnce
      storageClassName: premium-rwo
      resources:
        requests:
          storage: 500Gi

  nodeSelector:
    Dragonfly: "True"

  tolerations:
    - key: dedicated
      operator: Equal
      value: dragonfly
      effect: NoExecute

Let me know if you need more details.

@romange
Copy link
Collaborator

romange commented Mar 12, 2025

@adityaupadhyay-fynd can you please use ghcr.io/dragonflydb/dragonfly-dev:ubuntu-bbd0a96 and share the full server logs with us once it crashes?

@adityaupadhyay-fynd
Copy link

@romange Here are the logs, using ghcr.io/dragonflydb/dragonfly-dev:ubuntu-bbd0a96

[dragonfly-sentry-0] E20250312 10:23:03.659698    14 db_slice.cc:573] Expired item in expire table, but not in prime table: e:7001be780b7e4cf6bb3bf05c1aeeccac:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.671336    14 db_slice.cc:573] Expired item in expire table, but not in prime table: e:7ebd3961b1864188bb617477a3bfeae0:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.677497    14 db_slice.cc:573] Expired item in expire table, but not in prime table: e:ce55f6c172484446b010a4ba1dffc787:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.688176    11 db_slice.cc:573] Expired item in expire table, but not in prime table: e:a94b3c2b990441e292accd72da12c6fd:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.690101    11 db_slice.cc:573] Expired item in expire table, but not in prime table: e:f08a156885ae4e6a858b1ea28393ea85:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.691282    13 db_slice.cc:573] Expired item in expire table, but not in prime table: e:28b7fd8db2ad47408fc8b9a89bcaf505:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.692900    14 db_slice.cc:573] Expired item in expire table, but not in prime table: e:d6f09bf0f21c4a1899f1828b7d76dfa7:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.693713    14 db_slice.cc:573] Expired item in expire table, but not in prime table: e:3c825c47e3df4818b7a1e1705529b7d9:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.694005    13 db_slice.cc:573] Expired item in expire table, but not in prime table: e:3c7173242ef74b4287399ded97752263:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.694181    11 db_slice.cc:573] Expired item in expire table, but not in prime table: e:d11db1a6120e47cf9c364678fa9a6ac7:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.694900    14 db_slice.cc:573] Expired item in expire table, but not in prime table: e:562623bd35f14a518fc638a2a5e0f755:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.696310    14 db_slice.cc:573] Expired item in expire table, but not in prime table: e:7001be780b7e4cf6bb3bf05c1aeeccac:758 0 0
[dragonfly-sentry-0] E20250312 10:23:03.696338    14 string_family.cc:888] Inconsistent state in SetCmd::SetExisting  key: e:7001be780b7e4cf6bb3bf05c1aeeccac:758, it.key:e:7001be780b7e4cf6bb3bf05c1aeeccac:758, it->first:e:7001be780b7e4cf6bb3bf05c1aeeccac:758 params.prev_val: 0 16
[dragonfly-sentry-0] F20250312 10:23:03.696354    14 db_slice.cc:935] Check failed: db.expire.Insert(main_it->first.AsRef(), ExpirePeriod(delta)).second 
[dragonfly-sentry-0] *** Check failure stack trace: ***
[dragonfly-sentry-0] E20250312 10:23:03.696529    11 db_slice.cc:573] Expired item in expire table, but not in prime table: e:6d38b8d90901435e99462496e589837a:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.697285    11 db_slice.cc:573] Expired item in expire table, but not in prime table: e:07bf6e2540fc4bb4b3f9c1f6e24dfa9e:758 1 1
[dragonfly-sentry-0]     @     0x5cecf02b7ea3  google::LogMessage::SendToLog()
[dragonfly-sentry-0] E20250312 10:23:03.697979    13 db_slice.cc:573] Expired item in expire table, but not in prime table: e:855d12875d644c46b44d351e42150d1f:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.698073    13 db_slice.cc:573] Expired item in expire table, but not in prime table: e:f67e8c4af3f445d4bdc2a29d95380e31:758 1 1
[dragonfly-sentry-0]     @     0x5cecf02b0347  google::LogMessage::Flush()
[dragonfly-sentry-0]     @     0x5cecf02b1ccf  google::LogMessageFatal::~LogMessageFatal()
[dragonfly-sentry-0] E20250312 10:23:03.699990    11 db_slice.cc:573] Expired item in expire table, but not in prime table: e:33b47b8352fb4ebabb799fbf8ce7ab06:758 1 1
[dragonfly-sentry-0]     @     0x5cecefae1318  dfly::DbSlice::AddExpire()
[dragonfly-sentry-0]     @     0x5cecef8fdddd  dfly::(anonymous namespace)::SetCmd::SetExisting()
[dragonfly-sentry-0]     @     0x5cecef8fe56b  dfly::(anonymous namespace)::SetCmd::Set()
[dragonfly-sentry-0]     @     0x5cecef8fee5f  _ZN4absl12lts_2024072219functional_internal12InvokeObjectIZN4dfly12_GLOBAL__N_110SetGenericERKNS4_6SetCmd9SetParamsESt17basic_string_viewIcSt11char_traitsIcEESC_bPNS3_11TransactionEEUlSE_PNS3_11EngineShardEE_NSD_14RunnableResultEJSE_SG_EEET0_NS1_7VoidPtrEDpNS1_8ForwardTIT1_E4typeE
[dragonfly-sentry-0]     @     0x5cecefb23c9a  dfly::Transaction::RunCallback()
[dragonfly-sentry-0] E20250312 10:23:03.703192    12 db_slice.cc:573] Expired item in expire table, but not in prime table: e:b1f337b837a6474dab19d5b10583f9f7:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.704550    11 db_slice.cc:573] Expired item in expire table, but not in prime table: e:dea0c092e23d43d595ad5f39becc098b:758 1 1
[dragonfly-sentry-0]     @     0x5cecefb2460a  dfly::Transaction::ScheduleInShard()
[dragonfly-sentry-0]     @     0x5cecefb246dc  _ZNSt17_Function_handlerIFvvEZN4dfly11Transaction16ScheduleInternalEvEUlvE0_E9_M_invokeERKSt9_Any_data
[dragonfly-sentry-0]     @     0x5cecf009acb4  util::fb2::FiberQueue::Run()
[dragonfly-sentry-0]     @     0x5cecefb7ab10  _ZN5boost7context6detail11fiber_entryINS1_12fiber_recordINS0_5fiberEN4util3fb219FixedStackAllocatorEZNS6_6detail15WorkerFiberImplIZN4dfly9TaskQueue5StartESt17basic_string_viewIcSt11char_traitsIcEEEUlvE_JEEC4IS7_EESF_RKNS0_12preallocatedEOT_OSG_EUlOS4_E_EEEEvNS1_10transfer_tE
[dragonfly-sentry-0] E20250312 10:23:03.707162    11 db_slice.cc:573] Expired item in expire table, but not in prime table: e:d7f57e655f46417ea2c95419a73e105b:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.707410    13 db_slice.cc:573] Expired item in expire table, but not in prime table: e:aaee1ffa8ea848c4b0c7b80d6d1af35f:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.707566    11 db_slice.cc:573] Expired item in expire table, but not in prime table: e:4c554bd5554245b1ad247927031640e3:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.707610    12 db_slice.cc:573] Expired item in expire table, but not in prime table: e:7b25fb9a22ac4b3f9e8d14b1adea098d:758 1 1
[dragonfly-sentry-0]     @     0x5cecf00bd45f  make_fcontext
[dragonfly-sentry-0] *** SIGABRT received at time=1741774983 on cpu 3 ***
[dragonfly-sentry-0] PC: @     0x7a9ba7cd29fc  (unknown)  pthread_kill
[dragonfly-sentry-0] [failure_signal_handler.cc : 345] RAW: Signal 11 raised at PC=0x7a9ba7c64898 while already in AbslFailureSignalHandler()
[dragonfly-sentry-0] *** SIGSEGV received at time=1741774983 on cpu 3 ***
[dragonfly-sentry-0] PC: @     0x7a9ba7c64898  (unknown)  abort
[dragonfly-sentry-0] E20250312 10:23:03.709760    11 db_slice.cc:573] Expired item in expire table, but not in prime table: e:15ed3c82b97a439699945630e2088ee2:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.709847    13 db_slice.cc:573] Expired item in expire table, but not in prime table: e:a7644431495b4ea68495f7223f2383fd:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.713244    11 db_slice.cc:573] Expired item in expire table, but not in prime table: e:d11db1a6120e47cf9c364678fa9a6ac7:758 0 0
[dragonfly-sentry-0] E20250312 10:23:03.713271    11 string_family.cc:888] Inconsistent state in SetCmd::SetExisting  key: e:d11db1a6120e47cf9c364678fa9a6ac7:758, it.key:e:d11db1a6120e47cf9c364678fa9a6ac7:758, it->first:e:d11db1a6120e47cf9c364678fa9a6ac7:758 params.prev_val: 0 16
[dragonfly-sentry-0] F20250312 10:23:03.713282    11 db_slice.cc:935] Check failed: db.expire.Insert(main_it->first.AsRef(), ExpirePeriod(delta)).second 
[dragonfly-sentry-0] *** Check failure stack trace: ***
[dragonfly-sentry-0]     @     0x5cecf02b7ea3  google::LogMessage::SendToLog()
[dragonfly-sentry-0]     @     0x5cecf02b0347  google::LogMessage::Flush()
[dragonfly-sentry-0]     @     0x5cecf02b1ccf  google::LogMessageFatal::~LogMessageFatal()
[dragonfly-sentry-0]     @     0x5cecefae1318  dfly::DbSlice::AddExpire()
[dragonfly-sentry-0]     @     0x5cecef8fdddd  dfly::(anonymous namespace)::SetCmd::SetExisting()
[dragonfly-sentry-0]     @     0x5cecef8fe56b  dfly::(anonymous namespace)::SetCmd::Set()
[dragonfly-sentry-0]     @     0x5cecef8fee5f  _ZN4absl12lts_2024072219functional_internal12InvokeObjectIZN4dfly12_GLOBAL__N_110SetGenericERKNS4_6SetCmd9SetParamsESt17basic_string_viewIcSt11char_traitsIcEESC_bPNS3_11TransactionEEUlSE_PNS3_11EngineShardEE_NSD_14RunnableResultEJSE_SG_EEET0_NS1_7VoidPtrEDpNS1_8ForwardTIT1_E4typeE
[dragonfly-sentry-0]     @     0x5cecefb23c9a  dfly::Transaction::RunCallback()
[dragonfly-sentry-0] E20250312 10:23:03.719271    12 db_slice.cc:573] Expired item in expire table, but not in prime table: e:c138e076cc7342beba9add4be8d47f8d:758 1 1
[dragonfly-sentry-0]     @     0x5cecefb2460a  dfly::Transaction::ScheduleInShard()
[dragonfly-sentry-0]     @     0x5cecefb246dc  _ZNSt17_Function_handlerIFvvEZN4dfly11Transaction16ScheduleInternalEvEUlvE0_E9_M_invokeERKSt9_Any_data

@romange
Copy link
Collaborator

romange commented Mar 12, 2025

@adityaupadhyay-fynd it really really helped me to narrow this down but unfortunately i need another run.

  1. can you please try again with ghcr.io/dragonflydb/dragonfly-dev:ubuntu-bd08528 ?
  2. how quickly does it reproduce? is it possible to reproduce it somehow in my environment?

@romange romange self-assigned this Mar 12, 2025
@adityaupadhyay-fynd
Copy link

adityaupadhyay-fynd commented Mar 13, 2025

@romange The frequency of this error is quite high, sometimes occurring upto 8 times in an hour. We tried hosting dragonfly on VM instead of Kubernetes also, but faced the same issue.

We are running this as a replacement for Redis for a self hosted version of a tool called Sentry. Interestingly, we have another identical Sentry setup running Dragonfly, however we are not facing this issue on that setup. So not sure how you would be able to replicate this, but if you need more logs from either setup I will be happy to share them with you.

Logs using ghcr.io/dragonflydb/dragonfly-dev:ubuntu-bd08528 image

E20250313 06:50:42.811220    14 tiered_storage.cc:161] Inconsistent expire state for e:596997b857c64fbe82fcdedb9af77206:758
E20250313 06:50:42.811242    14 tiered_storage.cc:161] Inconsistent expire state for e:1f7cbc5ab0ae49a2b44932a8020837a5:758
E20250313 06:50:42.806047    13 tiered_storage.cc:161] Inconsistent expire state for e:08bbe74ac3db404d8241fdc72b954537:758
E20250313 06:50:42.814425    13 tiered_storage.cc:161] Inconsistent expire state for e:8ca941ec29cd42748521e1d1b655de9d:758
E20250313 06:50:42.814450    13 tiered_storage.cc:161] Inconsistent expire state for e:8e2f9af9fa534ef281f70684bc6dea4e:758
W20250313 06:50:42.973462    14 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
E20250313 06:50:43.059314    11 db_slice.cc:573] Expired item in expire table, but not in prime table: e:21dd3197bb7d445c84d959e1c655d0f0:758 1 1
W20250313 06:50:43.168561    13 main_service.cc:1355]  EVALSHA 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e 1 758 RECORD 1741846613 sim:1 8 2592000 3 5000 758 2882967 a 1 5254,7264 1 1 4037,140 1 1 172,15442 1 1 1531,9103 1 1 2200,881 1 1 3162,8648 1 1 429,9271 1 1 387,835 1 c 1 4166,3080 1 1 12891,7430 1 1 18313,12268 1 1 3190,17277 1 1 6866,2747 1 1 15657,6402 1 1 15561,1252 1 1 7513,14809 1 failed with reason: Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
W20250313 06:50:43.179450    14 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
W20250313 06:50:43.381839    13 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
E20250313 06:50:43.446089    14 db_slice.cc:573] Expired item in expire table, but not in prime table: e:1b9fe5e9d6814e5fa5a501cc60776e41:758 1 1
W20250313 06:50:43.585512    11 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
E20250313 06:50:43.785785    11 db_slice.cc:573] Expired item in expire table, but not in prime table: e:299b97378d6b456fa1af89daa70e3017:758 1 1
W20250313 06:50:43.787040    13 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
E20250313 06:50:43.788897    14 db_slice.cc:573] Expired item in expire table, but not in prime table: e:2070833dae244d85be07945db947f86f:758 1 1
E20250313 06:50:43.828667    13 db_slice.cc:573] Expired item in expire table, but not in prime table: e:f87534c1eb6e456997a400376cda5e51:758 1 1
E20250313 06:50:43.931804    12 db_slice.cc:573] Expired item in expire table, but not in prime table: e:a398e169f4394743a8cc805efb17e9b5:758 1 1
E20250313 06:50:43.945291    11 db_slice.cc:573] Expired item in expire table, but not in prime table: e:21dd3197bb7d445c84d959e1c655d0f0:758 0 0
E20250313 06:50:43.945329    11 string_family.cc:888] Inconsistent state in SetCmd::SetExisting  key: e:21dd3197bb7d445c84d959e1c655d0f0:758, it.key:e:21dd3197bb7d445c84d959e1c655d0f0:758, it->first:e:21dd3197bb7d445c84d959e1c655d0f0:758 params.prev_val: 0 16
F20250313 06:50:43.945343    11 db_slice.cc:935] Check failed: db.expire.Insert(main_it->first.AsRef(), ExpirePeriod(delta)).second
*** Check failure stack trace: ***
    @     0x59b9e1f6bea3  google::LogMessage::SendToLog()
    @     0x59b9e1f64347  google::LogMessage::Flush()
    @     0x59b9e1f65ccf  google::LogMessageFatal::~LogMessageFatal()
    @     0x59b9e1794c48  dfly::DbSlice::AddExpire()
    @     0x59b9e15b0e2d  dfly::(anonymous namespace)::SetCmd::SetExisting()
    @     0x59b9e15b15bb  dfly::(anonymous namespace)::SetCmd::Set()
    @     0x59b9e15b1eaf  _ZN4absl12lts_2024072219functional_internal12InvokeObjectIZN4dfly12_GLOBAL__N_110SetGenericERKNS4_6SetCmd9SetParamsESt17basic_string_viewIcSt11char_traitsIcEESC_bPNS3_11TransactionEEUlSE_PNS3_11EngineShardEE_NSD_14RunnableResultEJSE_SG_EEET0_NS1_7VoidPtrEDpNS1_8ForwardTIT1_E4typeE
    @     0x59b9e17d75ca  dfly::Transaction::RunCallback()
    @     0x59b9e17d7f3a  dfly::Transaction::ScheduleInShard()
    @     0x59b9e17d800c  _ZNSt17_Function_handlerIFvvEZN4dfly11Transaction16ScheduleInternalEvEUlvE0_E9_M_invokeERKSt9_Any_data
    @     0x59b9e1d4e5e4  util::fb2::FiberQueue::Run()
    @     0x59b9e182e440  _ZN5boost7context6detail11fiber_entryINS1_12fiber_recordINS0_5fiberEN4util3fb219FixedStackAllocatorEZNS6_6detail15WorkerFiberImplIZN4dfly9TaskQueue5StartESt17basic_string_viewIcSt11char_traitsIcEEEUlvE_JEEC4IS7_EESF_RKNS0_12preallocatedEOT_OSG_EUlOS4_E_EEEEvNS1_10transfer_tE
    @     0x59b9e1d70d8f  make_fcontext
*** SIGABRT received at time=1741848643 on cpu 0 ***
PC: @     0x7a9eb72f29fc  (unknown)  pthread_kill
[failure_signal_handler.cc : 345] RAW: Signal 11 raised at PC=0x7a9eb7284898 while already in AbslFailureSignalHandler()
*** SIGSEGV received at time=1741848643 on cpu 0 ***
PC: @     0x7a9eb7284898  (unknown)  abort

Logs from another run

E20250313 07:00:43.139518    11 tiered_storage.cc:161] Inconsistent expire state for e:4eed5903c47e4ce2adca73363af15077:758
E20250313 07:00:43.139747    11 tiered_storage.cc:161] Inconsistent expire state for e:2a2e18c1cd364992962fd76be3b1b8db:758
E20250313 07:00:43.142809    11 tiered_storage.cc:161] Inconsistent expire state for e:271ebfa83e924bfd898dfe67f68b7c29:758
E20250313 07:00:43.143143    12 tiered_storage.cc:161] Inconsistent expire state for e:32b76b24167948e9986fdfe2ae3e2ad4:758
E20250313 07:00:43.151912    11 tiered_storage.cc:161] Inconsistent expire state for e:7b4e210cdc5446aea1ab1d2803cb2c17:758
E20250313 07:00:43.158300    14 tiered_storage.cc:161] Inconsistent expire state for e:bdf39ab8ad7f45748d80d26d8c444602:758
E20250313 07:00:43.202508    13 tiered_storage.cc:161] Inconsistent expire state for e:5faf95eab9f64df7b42cdc8bf1e2556b:758
E20250313 07:00:43.208093    14 tiered_storage.cc:161] Inconsistent expire state for e:395fe7d3de2b4f58981444bd837c9707:758
W20250313 07:00:43.338932    13 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
W20250313 07:00:43.546926    13 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
W20250313 07:00:43.753705    13 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
W20250313 07:00:43.957517    11 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
W20250313 07:00:44.072194    13 main_service.cc:1355]  EVALSHA 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e 1 758 RECORD 1741846939 sim:1 8 2592000 3 5000 758 2882967 a 1 5254,7264 1 1 4037,140 1 1 172,15442 1 1 1531,9103 1 1 2200,881 1 1 3162,8648 1 1 429,9271 1 1 387,835 1 c 1 4166,3080 1 1 12891,7430 1 1 18313,12268 1 1 3190,17277 1 1 6866,2747 1 1 15657,6402 1 1 15561,1252 1 1 7513,14809 1 failed with reason: Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
W20250313 07:00:44.177639    14 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
W20250313 07:00:44.427978    13 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
W20250313 07:00:44.628247    13 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
W20250313 07:00:44.834594    11 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
W20250313 07:00:45.037518    13 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{10}:a:f:352464
W20250313 07:00:45.087677    13 main_service.cc:1355]  EVALSHA 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e 1 758 RECORD 1741847080 sim:1 8 2592000 3 5000 758 2882967 a 1 5254,7264 1 1 4037,140 1 1 172,15442 1 1 1531,9103 1 1 2200,881 1 1 3162,8648 1 1 429,9271 1 1 387,835 1 c 1 4166,3080 1 1 12891,7430 1 1 18313,12268 1 1 3190,17277 1 1 6866,2747 1 1 15657,6402 1 1 15561,1252 1 1 7513,14809 1 failed with reason: Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
W20250313 07:00:45.256618    11 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
W20250313 07:00:45.488197    13 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{10}:a:f:5461
W20250313 07:00:45.689709    13 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
E20250313 07:00:45.859686    14 db_slice.cc:573] Expired item in expire table, but not in prime table: e:86871f40cd3a4c649acaffaaeea8d495:758 1 1
E20250313 07:00:45.891831    14 db_slice.cc:573] Expired item in expire table, but not in prime table: e:86871f40cd3a4c649acaffaaeea8d495:758 0 0
E20250313 07:00:45.891883    14 string_family.cc:888] Inconsistent state in SetCmd::SetExisting  key: e:86871f40cd3a4c649acaffaaeea8d495:758, it.key:e:86871f40cd3a4c649acaffaaeea8d495:758, it->first:e:86871f40cd3a4c649acaffaaeea8d495:758 params.prev_val: 0 16
F20250313 07:00:45.891933    14 db_slice.cc:935] Check failed: db.expire.Insert(main_it->first.AsRef(), ExpirePeriod(delta)).second
*** Check failure stack trace: ***
    @     0x57056b5dfea3  google::LogMessage::SendToLog()
    @     0x57056b5d8347  google::LogMessage::Flush()
    @     0x57056b5d9ccf  google::LogMessageFatal::~LogMessageFatal()
    @     0x57056ae08c48  dfly::DbSlice::AddExpire()
    @     0x57056ac24e2d  dfly::(anonymous namespace)::SetCmd::SetExisting()
    @     0x57056ac255bb  dfly::(anonymous namespace)::SetCmd::Set()
    @     0x57056ac25eaf  _ZN4absl12lts_2024072219functional_internal12InvokeObjectIZN4dfly12_GLOBAL__N_110SetGenericERKNS4_6SetCmd9SetParamsESt17basic_string_viewIcSt11char_traitsIcEESC_bPNS3_11TransactionEEUlSE_PNS3_11EngineShardEE_NSD_14RunnableResultEJSE_SG_EEET0_NS1_7VoidPtrEDpNS1_8ForwardTIT1_E4typeE
    @     0x57056ae4b5ca  dfly::Transaction::RunCallback()
    @     0x57056ae4bf3a  dfly::Transaction::ScheduleInShard()
    @     0x57056ae4c00c  _ZNSt17_Function_handlerIFvvEZN4dfly11Transaction16ScheduleInternalEvEUlvE0_E9_M_invokeERKSt9_Any_data
    @     0x57056b3c25e4  util::fb2::FiberQueue::Run()
    @     0x57056aea2440  _ZN5boost7context6detail11fiber_entryINS1_12fiber_recordINS0_5fiberEN4util3fb219FixedStackAllocatorEZNS6_6detail15WorkerFiberImplIZN4dfly9TaskQueue5StartESt17basic_string_viewIcSt11char_traitsIcEEEUlvE_JEEC4IS7_EESF_RKNS0_12preallocatedEOT_OSG_EUlOS4_E_EEEEvNS1_10transfer_tE
    @     0x57056b3e4d8f  make_fcontext
*** SIGABRT received at time=1741849245 on cpu 3 ***
PC: @     0x7db30fd999fc  (unknown)  pthread_kill
[failure_signal_handler.cc : 345] RAW: Signal 11 raised at PC=0x7db30fd2b898 while already in AbslFailureSignalHandler()
*** SIGSEGV received at time=1741849245 on cpu 3 ***
PC: @     0x7db30fd2b898  (unknown)  abort

@romange
Copy link
Collaborator

romange commented Mar 13, 2025

@adityaupadhyay-fynd Are these the entire logs? Dragonfly has info logs stored in /tmp/ and they start with

Image

@romange
Copy link
Collaborator

romange commented Mar 13, 2025

@adityaupadhyay-fynd can you please try with ghcr.io/dragonflydb/dragonfly-dev:ubuntu-a8a2e07 and run it with --tiered_experimental_cooling=false ? Also, can you send me all the logs you have, not just the last few lines?
If you want you can DM me with the logs.

@adityaupadhyay-fynd
Copy link

@romange Sharing entire logs, running image ghcr.io/dragonflydb/dragonfly-dev:ubuntu-a8a2e07 with --tiered_experimental_cooling=false flag.

dragonfly-error.log

@romange
Copy link
Collaborator

romange commented Mar 13, 2025

Thanks, this iteration moved me another step closer. It happens almost immediately once the snapshot is loaded and yet I still do not find the root cause. I will publish another version of the image with more logs. So thankful for your help @adityaupadhyay-fynd !

@romange
Copy link
Collaborator

romange commented Mar 13, 2025

@adityaupadhyay-fynd I've released ghcr.io/dragonflydb/dragonfly-dev:ubuntu-4fbbf14. Can you please run it with
--tiered_experimental_cooling=false --vmodule=tiered_storage=1,disk_storage=1 ?

thanks!

@adityaupadhyay-fynd
Copy link

@romange Here are the logs, image used: ghcr.io/dragonflydb/dragonfly-dev:ubuntu-4fbbf14, with --tiered_experimental_cooling=false --vmodule=tiered_storage=1,disk_storage=1 flags set

dragonfly-log-4fbbf14.log

@romange
Copy link
Collaborator

romange commented Mar 17, 2025

Thanks @adityaupadhyay-fynd . Unfortunately that image did not pinpoint the root cause.

Can you please try again with ghcr.io/dragonflydb/dragonfly-dev:ubuntu-dd3e015 ? Same flags?

@romange romange added this to the v1.29 milestone Mar 17, 2025
@adityaupadhyay-fynd
Copy link

@romange here are the logs from ghcr.io/dragonflydb/dragonfly-dev:ubuntu-dd3e015, using same flags as before.

dragonfly-dd3e015-2.log
dragonfly-dd3e015.log

@romange
Copy link
Collaborator

romange commented Mar 17, 2025

🎯

romange added a commit that referenced this issue Mar 17, 2025
The bug: during the override of the existing external string, we called
`TieredStorage::Delete` to delete the external reference. This function
called CompactObj::Reset that cleared all the attributes on the value, including
expiry.

The fix: preserve the mask but clear the external state from the object.
Fixes #4672

Signed-off-by: Roman Gershman <[email protected]>
@romange romange modified the milestones: v1.29, v1.28 Mar 17, 2025
@romange romange added the Next Up task that is ready to be worked on and should be added to working queue label Mar 17, 2025
romange added a commit that referenced this issue Mar 17, 2025
The bug: during the override of the existing external string, we called
`TieredStorage::Delete` to delete the external reference. This function
called CompactObj::Reset that cleared all the attributes on the value, including
expiry.

The fix: preserve the mask but clear the external state from the object.
Fixes #4672

Signed-off-by: Roman Gershman <[email protected]>
@romange
Copy link
Collaborator

romange commented Mar 17, 2025

@adityaupadhyay-fynd please try ghcr.io/dragonflydb/dragonfly-dev:ubuntu-525b102 and let me know if the problem persists. You can run it without the flags I gave you.

@romange
Copy link
Collaborator

romange commented Mar 18, 2025

@adityaupadhyay-fynd how is it going? Did you experience any issues? You can also try 1.28 which should have the fix. Please let us know :)

@adityaupadhyay-fynd
Copy link

@romange Have not faced this error since yesterday (Image: ghcr.io/dragonflydb/dragonfly-dev:ubuntu-525b102). Will try deploying image 1.28 and let you know. Thanks you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Next Up task that is ready to be worked on and should be added to working queue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants