SIGABRT with ssd tiering on 1.26.3 #4672

mainpart · 2025-02-28T09:47:58Z

F20250228 09:25:57.923086 14 db_slice.cc:820] Check failed: db.expire.Insert(main_it->first.AsRef(), ExpirePeriod(delta)).second
*** Check failure stack trace: ***
@ 0x562e118a3923 google::LogMessage::SendToLog()
@ 0x562e1189c0e7 google::LogMessage::Flush()
@ 0x562e1189da6f google::LogMessageFatal::~LogMessageFatal()
@ 0x562e110e5338 dfly::DbSlice::AddExpire()
@ 0x562e10f059c9 dfly::(anonymous namespace)::SetCmd::SetExisting()
@ 0x562e10f05e77 dfly::(anonymous namespace)::SetCmd::Set()
@ 0x562e10f067cf _ZN4absl12lts_2024011619functional_internal12InvokeObjectIZN4dfly12_GLOBAL__N_110SetGenericERKNS4_6SetCmd9SetParamsESt17basic_string_viewIcSt11char_traitsIcEESC_bPNS3_11TransactionEEUlSE_PNS3_11EngineShardEE_NSD_14RunnableResultEJSE_SG_EEET0_NS1_7VoidPtrEDpNS1_8ForwardTIT1_E4typeE
@ 0x562e1111bc19 dfly::Transaction::RunCallback()
@ 0x562e1111c7a9 dfly::Transaction::ScheduleInShard()
@ 0x562e1111e8df dfly::Transaction::ScheduleBatchInShard()
@ 0x562e11689d05 util::fb2::FiberQueue::Run()
@ 0x562e11169b90 _ZN5boost7context6detail11fiber_entryINS1_12fiber_recordINS0_5fiberEN4util3fb219FixedStackAllocatorEZNS6_6detail15WorkerFiberImplIZN4dfly9TaskQueue5StartESt17basic_string_viewIcSt11char_traitsIcEEEUlvE_JEEC4IS7_EESF_RKNS0_12preallocatedEOT_OSG_EUlOS4_E_EEEEvNS1_10transfer_tE
@ 0x562e116a900f make_fcontext
*** SIGABRT received at time=1740734757 on cpu 2 ***
PC: @ 0x7f9721e8a9fc (unknown) pthread_kill
[failure_signal_handler.cc : 345] RAW: Signal 11 raised at PC=0x7f9721e1c898 while already in AbslFailureSignalHandler()
*** SIGSEGV received at time=1740734757 on cpu 2 ***
PC: @ 0x7f9721e1c898 (unknown) abort

Pod Details:
- The pod crashed with Exit Code: 139 (SIGSEGV - Segmentation fault)
- Resource limits: 4Gi memory, 4 CPU
- Resource requests: 1Gi memory, 200m CPU
- Dragonfly version: df-v1.26.3
- Command args include --maxmemory=1GB

The text was updated successfully, but these errors were encountered:

romange · 2025-02-28T10:19:08Z

thanks. how do you run it? with replication?

mainpart · 2025-02-28T10:23:55Z

no replication. this is second revision of running that failed with same error. previous attempt was only with maxmemory.

Args:
  --alsologtostderr
  --bind=::
  --port=6379
  --maxmemory=14G
  --proactor_threads=4
  --request_cache_limit=8000000
  --eviction_memory_budget_threshold=0.5
  --mem_defrag_threshold=0.1
  --tiered_offload_threshold=0.1
  --dir=/data
  --dbfilename=backup
  --tiered_prefix=/data/ssd
  --version_check=false

Any other information to provide?

romange · 2025-02-28T10:27:36Z

how long did process run before it crashed? Trying to see how frequent it is.

mainpart · 2025-02-28T10:30:15Z

about 2-3 mins from start each time.
it falls on update existing key
SET EX 24h

we have pretty big dataset it happens when we hitting upper memory limit.

romange · 2025-02-28T10:53:32Z

Do you use the GET option, i.e. SET GET EX ... or it's a pure SET?

romange · 2025-02-28T10:54:06Z

Also do you use any NX/XX options?

romange · 2025-02-28T11:03:57Z

I am guessing that without --tiered_prefix it will work, is it possible to verify?

mainpart · 2025-02-28T11:20:21Z

we're doing
set key value ex 84600 nx
next we're getting it
get key
if it found we're trying to update
set key value ex 84600 xx
then df crashes
ttl definitely is not expiring

mainpart · 2025-02-28T11:27:32Z

offloading to SSD was the only reason we switched from redis.
we have limits on memory and unpredictable size of data to store.
so we decided to take offloading function of redis but it turns out that it available only on enterprise version.

romange · 2025-02-28T17:57:28Z

Is there anything in the server logs? Are you on the cloud?

romange · 2025-02-28T18:05:19Z

If you are on one of the major clouds, we can create a datastore that for you and monitor it, with your traffic crashing it

iamtraining · 2025-03-02T17:07:06Z

Is there anything in the server logs? Are you on the cloud?

no there is nothing else in the server logs
the error wont appear if you dont update the ttl but leave it untouched when updating value

romange · 2025-03-04T08:53:36Z

@mainpart I will build a custom container to debug it further. will you be able to try it out?

Should help debugging #4672 Signed-off-by: Roman Gershman <[email protected]>

mainpart · 2025-03-04T11:00:23Z

No, it's not necessary.
We changed our algorithms so we do not update ttl. That's enough for us.

The data we use is a private cloud repository so we can't give it outside.

romange · 2025-03-04T11:04:25Z

Well, we will appreciate if you decide to help us with finding the bug even without disclosing the data. The debug printings will help us to understand the state better during the failure.

Should help debugging #4672 Signed-off-by: Roman Gershman <[email protected]>

iamtraining · 2025-03-07T09:17:39Z

Well, we will appreciate if you decide to help us with finding the bug even without disclosing the data. The debug printings will help us to understand the state better during the failure.

can you build and publish image
we ll send you logs back

romange · 2025-03-07T10:12:05Z

@iamtraining can you please run ghcr.io/dragonflydb/dragonfly-dev:ubuntu-5aa1990 ?

iamtraining · 2025-03-07T11:22:14Z

@iamtraining can you please run ghcr.io/dragonflydb/dragonfly-dev:ubuntu-5aa1990 ?

E20250307 11:19:21.232584 13 string_family.cc:878] Inconsistent state in SetCmd::SetExisting key: image.scanner:store:scan-job:b49b5a634f9724afd19c06e2:application/vnd.security.vulnerability.report; version=1.1, it.key:image.scanner:store:scan-job:b49b5a634f9724afd19c06e2:application/vnd.security.vulnerability.report; version=1.1, it->first:image.scanner:store:scan-job:b49b5a634f9724afd19c06e2:application/vnd.security.vulnerability.report; version=1.1 params.prev_val: 0 18
F20250307 11:19:21.232707 13 db_slice.cc:927] Check failed: db.expire.Insert(main_it->first.AsRef(), ExpirePeriod(delta)).second
*** Check failure stack trace: ***
@ 0x5597352b9ea3 google::LogMessage::SendToLog()
@ 0x5597352b2347 google::LogMessage::Flush()
@ 0x5597352b3ccf google::LogMessageFatal::~LogMessageFatal()
@ 0x559734ae2b08 dfly::DbSlice::AddExpire()
@ 0x5597348ffdbd dfly::(anonymous namespace)::SetCmd::SetExisting()
@ 0x5597349002b7 dfly::(anonymous namespace)::SetCmd::Set()
@ 0x559734900c0f _ZN4absl12lts_2024072219functional_internal12InvokeObjectIZN4dfly12_GLOBAL__N_110SetGenericERKNS4_6SetCmd9SetParamsESt17basic_string_viewIcSt11char_traitsIcEESC_bPNS3_11TransactionEEUlSE_PNS3_11EngineShardEE_NSD_14RunnableResultEJSE_SG_EEET0_NS1_7VoidPtrEDpNS1_8ForwardTIT1_E4typeE
@ 0x559734b2539a dfly::Transaction::RunCallback()
@ 0x559734b25d0a dfly::Transaction::ScheduleInShard()
@ 0x559734b26ed2 dfly::Transaction::ScheduleInternal()
@ 0x559734b27a81 dfly::Transaction::Execute()
@ 0x559734b27b9a dfly::Transaction::ScheduleSingleHop()
@ 0x5597348fb2ee dfly::StringFamily::Set()
@ 0x559734b0611f dfly::CommandId::Invoke()
@ 0x5597347f8373 dfly::Service::InvokeCmd()
@ 0x5597347fefe1 dfly::Service::DispatchCommand()
@ 0x559734c01be9 facade::Connection::DispatchSingle()
@ 0x559734c0213f facade::Connection::ParseRedis()
@ 0x559734c0277c facade::Connection::IoLoop()
@ 0x559734c02e0d facade::Connection::ConnectionFlow()
@ 0x559734c03f3f facade::Connection::HandleRequests()
@ 0x5597350718bd util::ListenerInterface::RunSingleConnection()
@ 0x559735071e25 _ZN5boost7context6detail11fiber_entryINS1_12fiber_recordINS0_5fiberEN4util3fb219FixedStackAllocatorEZNS6_6detail15WorkerFiberImplIZZNS5_17ListenerInterface13RunAcceptLoopEvENKUlvE0_clEvEUlvE_JEEC4IS7_EESt17basic_string_viewIcSt11char_traitsIcEERKNS0_12preallocatedEOT_OSC_EUlOS4_E_EEEEvNS1_10transfer_tE
@ 0x5597350beb5f make_fcontext
*** SIGABRT received at time=1741346361 on cpu 2 ***
PC: @ 0x7fe2e29749fc (unknown) pthread_kill
[failure_signal_handler.cc : 345] RAW: Signal 11 raised at PC=0x7fe2e2906898 while already in AbslFailureSignalHandler()
*** SIGSEGV received at time=1741346361 on cpu 2 ***
PC: @ 0x7fe2e2906898 (unknown) abort

romange · 2025-03-09T11:20:54Z

Thanks, this helped. I will need another run though. @mainpart can you please try again with ghcr.io/dragonflydb/dragonfly-dev:ubuntu-bbd0a96 ?

adityaupadhyay-fynd · 2025-03-12T09:28:53Z

@romange We are facing similar issue

Check failed: db.expire.Insert(main_it->first.AsRef(), ExpirePeriod(delta)).second 
*** Check failure stack trace: ***
    @     0x56632e336343  google::LogMessage::SendToLog()
    @     0x56632e32eb07  google::LogMessage::Flush()
    @     0x56632e33048f  google::LogMessageFatal::~LogMessageFatal()
    @     0x56632dbc7810  dfly::DbSlice::AddExpire()
    @     0x56632da03aa8  dfly::SetCmd::SetExisting()
    @     0x56632da0419d  dfly::SetCmd::Set()
    @     0x56632da044ff  _ZN4absl12lts_2024011619functional_internal12InvokeObjectIZN4dfly10SetGenericEPNS3_17ConnectionContextERKNS3_6SetCmd9SetParamsESt17basic_string_viewIcSt11char_traitsIcEESD_EUlPNS3_11TransactionEPNS3_11EngineShardEE_NSE_14RunnableResultEJSF_SH_EEET0_NS1_7VoidPtrEDpNS1_8ForwardTIT1_E4typeE
    @     0x56632dbfdb19  dfly::Transaction::RunCallback()
    @     0x56632dbfe639  dfly::Transaction::ScheduleInShard()
    @     0x56632dc0074f  dfly::Transaction::ScheduleBatchInShard()
    @     0x56632e120845  util::fb2::FiberQueue::Run()
    @     0x56632dc41620  _ZN5boost7context6detail11fiber_entryINS1_12fiber_recordINS0_5fiberEN4util3fb219FixedStackAllocatorEZNS6_6detail15WorkerFiberImplIZN4dfly9TaskQueue5StartESt17basic_string_viewIcSt11char_traitsIcEEEUlvE_JEEC4IS7_EESF_RKNS0_12preallocatedEOT_OSG_EUlOS4_E_EEEEvNS1_10transfer_tE
    @     0x56632e13b96f  make_fcontext
*** SIGABRT received at time=1741174146 on cpu 0 ***
PC: @     0x7b23d83f49fc  (unknown)  pthread_kill
[failure_signal_handler.cc : 345] RAW: Signal 11 raised at PC=0x7b23d8386898 while already in AbslFailureSignalHandler()
*** SIGSEGV received at time=1741174146 on cpu 0 ***
PC: @     0x7b23d8386898  (unknown)  abort

We are using dragonfly kubernetes operator, here are the values:

apiVersion: dragonflydb.io/v1alpha1
kind: Dragonfly
metadata:
  name: dragonfly-sentry
  namespace: sentry
  labels:
    app.kubernetes.io/name: dragonfly
    app.kubernetes.io/instance: dragonfly-sentry
    app.kubernetes.io/part-of: dragonfly-operator
    app.kubernetes.io/managed-by: helm
    app.kubernetes.io/created-by: dragonfly-operator
spec:
  replicas: 1  # HA causing issues, so keeping it 1
  image: "docker.dragonflydb.io/dragonflydb/dragonfly:v1.27.1"

  resources:
    requests:
      cpu: "2"
      memory: "26Gi"
    limits:
      memory: "26Gi"

  args:
    - "--tiered_prefix=/dragonfly/snapshots/dragonfly"
    - "--maxclients=100000"
    - "--tcp_keepalive=60"

  snapshot:
    cron: ""
    persistentVolumeClaimSpec:
      accessModes:
        - ReadWriteOnce
      storageClassName: premium-rwo
      resources:
        requests:
          storage: 500Gi

  nodeSelector:
    Dragonfly: "True"

  tolerations:
    - key: dedicated
      operator: Equal
      value: dragonfly
      effect: NoExecute

Let me know if you need more details.

romange · 2025-03-12T09:32:42Z

@adityaupadhyay-fynd can you please use ghcr.io/dragonflydb/dragonfly-dev:ubuntu-bbd0a96 and share the full server logs with us once it crashes?

adityaupadhyay-fynd · 2025-03-12T10:29:04Z

@romange Here are the logs, using ghcr.io/dragonflydb/dragonfly-dev:ubuntu-bbd0a96

[dragonfly-sentry-0] E20250312 10:23:03.659698    14 db_slice.cc:573] Expired item in expire table, but not in prime table: e:7001be780b7e4cf6bb3bf05c1aeeccac:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.671336    14 db_slice.cc:573] Expired item in expire table, but not in prime table: e:7ebd3961b1864188bb617477a3bfeae0:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.677497    14 db_slice.cc:573] Expired item in expire table, but not in prime table: e:ce55f6c172484446b010a4ba1dffc787:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.688176    11 db_slice.cc:573] Expired item in expire table, but not in prime table: e:a94b3c2b990441e292accd72da12c6fd:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.690101    11 db_slice.cc:573] Expired item in expire table, but not in prime table: e:f08a156885ae4e6a858b1ea28393ea85:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.691282    13 db_slice.cc:573] Expired item in expire table, but not in prime table: e:28b7fd8db2ad47408fc8b9a89bcaf505:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.692900    14 db_slice.cc:573] Expired item in expire table, but not in prime table: e:d6f09bf0f21c4a1899f1828b7d76dfa7:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.693713    14 db_slice.cc:573] Expired item in expire table, but not in prime table: e:3c825c47e3df4818b7a1e1705529b7d9:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.694005    13 db_slice.cc:573] Expired item in expire table, but not in prime table: e:3c7173242ef74b4287399ded97752263:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.694181    11 db_slice.cc:573] Expired item in expire table, but not in prime table: e:d11db1a6120e47cf9c364678fa9a6ac7:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.694900    14 db_slice.cc:573] Expired item in expire table, but not in prime table: e:562623bd35f14a518fc638a2a5e0f755:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.696310    14 db_slice.cc:573] Expired item in expire table, but not in prime table: e:7001be780b7e4cf6bb3bf05c1aeeccac:758 0 0
[dragonfly-sentry-0] E20250312 10:23:03.696338    14 string_family.cc:888] Inconsistent state in SetCmd::SetExisting  key: e:7001be780b7e4cf6bb3bf05c1aeeccac:758, it.key:e:7001be780b7e4cf6bb3bf05c1aeeccac:758, it->first:e:7001be780b7e4cf6bb3bf05c1aeeccac:758 params.prev_val: 0 16
[dragonfly-sentry-0] F20250312 10:23:03.696354    14 db_slice.cc:935] Check failed: db.expire.Insert(main_it->first.AsRef(), ExpirePeriod(delta)).second 
[dragonfly-sentry-0] *** Check failure stack trace: ***
[dragonfly-sentry-0] E20250312 10:23:03.696529    11 db_slice.cc:573] Expired item in expire table, but not in prime table: e:6d38b8d90901435e99462496e589837a:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.697285    11 db_slice.cc:573] Expired item in expire table, but not in prime table: e:07bf6e2540fc4bb4b3f9c1f6e24dfa9e:758 1 1
[dragonfly-sentry-0]     @     0x5cecf02b7ea3  google::LogMessage::SendToLog()
[dragonfly-sentry-0] E20250312 10:23:03.697979    13 db_slice.cc:573] Expired item in expire table, but not in prime table: e:855d12875d644c46b44d351e42150d1f:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.698073    13 db_slice.cc:573] Expired item in expire table, but not in prime table: e:f67e8c4af3f445d4bdc2a29d95380e31:758 1 1
[dragonfly-sentry-0]     @     0x5cecf02b0347  google::LogMessage::Flush()
[dragonfly-sentry-0]     @     0x5cecf02b1ccf  google::LogMessageFatal::~LogMessageFatal()
[dragonfly-sentry-0] E20250312 10:23:03.699990    11 db_slice.cc:573] Expired item in expire table, but not in prime table: e:33b47b8352fb4ebabb799fbf8ce7ab06:758 1 1
[dragonfly-sentry-0]     @     0x5cecefae1318  dfly::DbSlice::AddExpire()
[dragonfly-sentry-0]     @     0x5cecef8fdddd  dfly::(anonymous namespace)::SetCmd::SetExisting()
[dragonfly-sentry-0]     @     0x5cecef8fe56b  dfly::(anonymous namespace)::SetCmd::Set()
[dragonfly-sentry-0]     @     0x5cecef8fee5f  _ZN4absl12lts_2024072219functional_internal12InvokeObjectIZN4dfly12_GLOBAL__N_110SetGenericERKNS4_6SetCmd9SetParamsESt17basic_string_viewIcSt11char_traitsIcEESC_bPNS3_11TransactionEEUlSE_PNS3_11EngineShardEE_NSD_14RunnableResultEJSE_SG_EEET0_NS1_7VoidPtrEDpNS1_8ForwardTIT1_E4typeE
[dragonfly-sentry-0]     @     0x5cecefb23c9a  dfly::Transaction::RunCallback()
[dragonfly-sentry-0] E20250312 10:23:03.703192    12 db_slice.cc:573] Expired item in expire table, but not in prime table: e:b1f337b837a6474dab19d5b10583f9f7:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.704550    11 db_slice.cc:573] Expired item in expire table, but not in prime table: e:dea0c092e23d43d595ad5f39becc098b:758 1 1
[dragonfly-sentry-0]     @     0x5cecefb2460a  dfly::Transaction::ScheduleInShard()
[dragonfly-sentry-0]     @     0x5cecefb246dc  _ZNSt17_Function_handlerIFvvEZN4dfly11Transaction16ScheduleInternalEvEUlvE0_E9_M_invokeERKSt9_Any_data
[dragonfly-sentry-0]     @     0x5cecf009acb4  util::fb2::FiberQueue::Run()
[dragonfly-sentry-0]     @     0x5cecefb7ab10  _ZN5boost7context6detail11fiber_entryINS1_12fiber_recordINS0_5fiberEN4util3fb219FixedStackAllocatorEZNS6_6detail15WorkerFiberImplIZN4dfly9TaskQueue5StartESt17basic_string_viewIcSt11char_traitsIcEEEUlvE_JEEC4IS7_EESF_RKNS0_12preallocatedEOT_OSG_EUlOS4_E_EEEEvNS1_10transfer_tE
[dragonfly-sentry-0] E20250312 10:23:03.707162    11 db_slice.cc:573] Expired item in expire table, but not in prime table: e:d7f57e655f46417ea2c95419a73e105b:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.707410    13 db_slice.cc:573] Expired item in expire table, but not in prime table: e:aaee1ffa8ea848c4b0c7b80d6d1af35f:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.707566    11 db_slice.cc:573] Expired item in expire table, but not in prime table: e:4c554bd5554245b1ad247927031640e3:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.707610    12 db_slice.cc:573] Expired item in expire table, but not in prime table: e:7b25fb9a22ac4b3f9e8d14b1adea098d:758 1 1
[dragonfly-sentry-0]     @     0x5cecf00bd45f  make_fcontext
[dragonfly-sentry-0] *** SIGABRT received at time=1741774983 on cpu 3 ***
[dragonfly-sentry-0] PC: @     0x7a9ba7cd29fc  (unknown)  pthread_kill
[dragonfly-sentry-0] [failure_signal_handler.cc : 345] RAW: Signal 11 raised at PC=0x7a9ba7c64898 while already in AbslFailureSignalHandler()
[dragonfly-sentry-0] *** SIGSEGV received at time=1741774983 on cpu 3 ***
[dragonfly-sentry-0] PC: @     0x7a9ba7c64898  (unknown)  abort
[dragonfly-sentry-0] E20250312 10:23:03.709760    11 db_slice.cc:573] Expired item in expire table, but not in prime table: e:15ed3c82b97a439699945630e2088ee2:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.709847    13 db_slice.cc:573] Expired item in expire table, but not in prime table: e:a7644431495b4ea68495f7223f2383fd:758 1 1
[dragonfly-sentry-0] E20250312 10:23:03.713244    11 db_slice.cc:573] Expired item in expire table, but not in prime table: e:d11db1a6120e47cf9c364678fa9a6ac7:758 0 0
[dragonfly-sentry-0] E20250312 10:23:03.713271    11 string_family.cc:888] Inconsistent state in SetCmd::SetExisting  key: e:d11db1a6120e47cf9c364678fa9a6ac7:758, it.key:e:d11db1a6120e47cf9c364678fa9a6ac7:758, it->first:e:d11db1a6120e47cf9c364678fa9a6ac7:758 params.prev_val: 0 16
[dragonfly-sentry-0] F20250312 10:23:03.713282    11 db_slice.cc:935] Check failed: db.expire.Insert(main_it->first.AsRef(), ExpirePeriod(delta)).second 
[dragonfly-sentry-0] *** Check failure stack trace: ***
[dragonfly-sentry-0]     @     0x5cecf02b7ea3  google::LogMessage::SendToLog()
[dragonfly-sentry-0]     @     0x5cecf02b0347  google::LogMessage::Flush()
[dragonfly-sentry-0]     @     0x5cecf02b1ccf  google::LogMessageFatal::~LogMessageFatal()
[dragonfly-sentry-0]     @     0x5cecefae1318  dfly::DbSlice::AddExpire()
[dragonfly-sentry-0]     @     0x5cecef8fdddd  dfly::(anonymous namespace)::SetCmd::SetExisting()
[dragonfly-sentry-0]     @     0x5cecef8fe56b  dfly::(anonymous namespace)::SetCmd::Set()
[dragonfly-sentry-0]     @     0x5cecef8fee5f  _ZN4absl12lts_2024072219functional_internal12InvokeObjectIZN4dfly12_GLOBAL__N_110SetGenericERKNS4_6SetCmd9SetParamsESt17basic_string_viewIcSt11char_traitsIcEESC_bPNS3_11TransactionEEUlSE_PNS3_11EngineShardEE_NSD_14RunnableResultEJSE_SG_EEET0_NS1_7VoidPtrEDpNS1_8ForwardTIT1_E4typeE
[dragonfly-sentry-0]     @     0x5cecefb23c9a  dfly::Transaction::RunCallback()
[dragonfly-sentry-0] E20250312 10:23:03.719271    12 db_slice.cc:573] Expired item in expire table, but not in prime table: e:c138e076cc7342beba9add4be8d47f8d:758 1 1
[dragonfly-sentry-0]     @     0x5cecefb2460a  dfly::Transaction::ScheduleInShard()
[dragonfly-sentry-0]     @     0x5cecefb246dc  _ZNSt17_Function_handlerIFvvEZN4dfly11Transaction16ScheduleInternalEvEUlvE0_E9_M_invokeERKSt9_Any_data

romange · 2025-03-12T12:57:05Z

@adityaupadhyay-fynd it really really helped me to narrow this down but unfortunately i need another run.

can you please try again with ghcr.io/dragonflydb/dragonfly-dev:ubuntu-bd08528 ?
how quickly does it reproduce? is it possible to reproduce it somehow in my environment?

adityaupadhyay-fynd · 2025-03-13T07:01:35Z

@romange The frequency of this error is quite high, sometimes occurring upto 8 times in an hour. We tried hosting dragonfly on VM instead of Kubernetes also, but faced the same issue.

We are running this as a replacement for Redis for a self hosted version of a tool called Sentry. Interestingly, we have another identical Sentry setup running Dragonfly, however we are not facing this issue on that setup. So not sure how you would be able to replicate this, but if you need more logs from either setup I will be happy to share them with you.

Logs using ghcr.io/dragonflydb/dragonfly-dev:ubuntu-bd08528 image

E20250313 06:50:42.811220    14 tiered_storage.cc:161] Inconsistent expire state for e:596997b857c64fbe82fcdedb9af77206:758
E20250313 06:50:42.811242    14 tiered_storage.cc:161] Inconsistent expire state for e:1f7cbc5ab0ae49a2b44932a8020837a5:758
E20250313 06:50:42.806047    13 tiered_storage.cc:161] Inconsistent expire state for e:08bbe74ac3db404d8241fdc72b954537:758
E20250313 06:50:42.814425    13 tiered_storage.cc:161] Inconsistent expire state for e:8ca941ec29cd42748521e1d1b655de9d:758
E20250313 06:50:42.814450    13 tiered_storage.cc:161] Inconsistent expire state for e:8e2f9af9fa534ef281f70684bc6dea4e:758
W20250313 06:50:42.973462    14 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
E20250313 06:50:43.059314    11 db_slice.cc:573] Expired item in expire table, but not in prime table: e:21dd3197bb7d445c84d959e1c655d0f0:758 1 1
W20250313 06:50:43.168561    13 main_service.cc:1355]  EVALSHA 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e 1 758 RECORD 1741846613 sim:1 8 2592000 3 5000 758 2882967 a 1 5254,7264 1 1 4037,140 1 1 172,15442 1 1 1531,9103 1 1 2200,881 1 1 3162,8648 1 1 429,9271 1 1 387,835 1 c 1 4166,3080 1 1 12891,7430 1 1 18313,12268 1 1 3190,17277 1 1 6866,2747 1 1 15657,6402 1 1 15561,1252 1 1 7513,14809 1 failed with reason: Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
W20250313 06:50:43.179450    14 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
W20250313 06:50:43.381839    13 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
E20250313 06:50:43.446089    14 db_slice.cc:573] Expired item in expire table, but not in prime table: e:1b9fe5e9d6814e5fa5a501cc60776e41:758 1 1
W20250313 06:50:43.585512    11 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
E20250313 06:50:43.785785    11 db_slice.cc:573] Expired item in expire table, but not in prime table: e:299b97378d6b456fa1af89daa70e3017:758 1 1
W20250313 06:50:43.787040    13 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
E20250313 06:50:43.788897    14 db_slice.cc:573] Expired item in expire table, but not in prime table: e:2070833dae244d85be07945db947f86f:758 1 1
E20250313 06:50:43.828667    13 db_slice.cc:573] Expired item in expire table, but not in prime table: e:f87534c1eb6e456997a400376cda5e51:758 1 1
E20250313 06:50:43.931804    12 db_slice.cc:573] Expired item in expire table, but not in prime table: e:a398e169f4394743a8cc805efb17e9b5:758 1 1
E20250313 06:50:43.945291    11 db_slice.cc:573] Expired item in expire table, but not in prime table: e:21dd3197bb7d445c84d959e1c655d0f0:758 0 0
E20250313 06:50:43.945329    11 string_family.cc:888] Inconsistent state in SetCmd::SetExisting  key: e:21dd3197bb7d445c84d959e1c655d0f0:758, it.key:e:21dd3197bb7d445c84d959e1c655d0f0:758, it->first:e:21dd3197bb7d445c84d959e1c655d0f0:758 params.prev_val: 0 16
F20250313 06:50:43.945343    11 db_slice.cc:935] Check failed: db.expire.Insert(main_it->first.AsRef(), ExpirePeriod(delta)).second
*** Check failure stack trace: ***
    @     0x59b9e1f6bea3  google::LogMessage::SendToLog()
    @     0x59b9e1f64347  google::LogMessage::Flush()
    @     0x59b9e1f65ccf  google::LogMessageFatal::~LogMessageFatal()
    @     0x59b9e1794c48  dfly::DbSlice::AddExpire()
    @     0x59b9e15b0e2d  dfly::(anonymous namespace)::SetCmd::SetExisting()
    @     0x59b9e15b15bb  dfly::(anonymous namespace)::SetCmd::Set()
    @     0x59b9e15b1eaf  _ZN4absl12lts_2024072219functional_internal12InvokeObjectIZN4dfly12_GLOBAL__N_110SetGenericERKNS4_6SetCmd9SetParamsESt17basic_string_viewIcSt11char_traitsIcEESC_bPNS3_11TransactionEEUlSE_PNS3_11EngineShardEE_NSD_14RunnableResultEJSE_SG_EEET0_NS1_7VoidPtrEDpNS1_8ForwardTIT1_E4typeE
    @     0x59b9e17d75ca  dfly::Transaction::RunCallback()
    @     0x59b9e17d7f3a  dfly::Transaction::ScheduleInShard()
    @     0x59b9e17d800c  _ZNSt17_Function_handlerIFvvEZN4dfly11Transaction16ScheduleInternalEvEUlvE0_E9_M_invokeERKSt9_Any_data
    @     0x59b9e1d4e5e4  util::fb2::FiberQueue::Run()
    @     0x59b9e182e440  _ZN5boost7context6detail11fiber_entryINS1_12fiber_recordINS0_5fiberEN4util3fb219FixedStackAllocatorEZNS6_6detail15WorkerFiberImplIZN4dfly9TaskQueue5StartESt17basic_string_viewIcSt11char_traitsIcEEEUlvE_JEEC4IS7_EESF_RKNS0_12preallocatedEOT_OSG_EUlOS4_E_EEEEvNS1_10transfer_tE
    @     0x59b9e1d70d8f  make_fcontext
*** SIGABRT received at time=1741848643 on cpu 0 ***
PC: @     0x7a9eb72f29fc  (unknown)  pthread_kill
[failure_signal_handler.cc : 345] RAW: Signal 11 raised at PC=0x7a9eb7284898 while already in AbslFailureSignalHandler()
*** SIGSEGV received at time=1741848643 on cpu 0 ***
PC: @     0x7a9eb7284898  (unknown)  abort

Logs from another run

E20250313 07:00:43.139518    11 tiered_storage.cc:161] Inconsistent expire state for e:4eed5903c47e4ce2adca73363af15077:758
E20250313 07:00:43.139747    11 tiered_storage.cc:161] Inconsistent expire state for e:2a2e18c1cd364992962fd76be3b1b8db:758
E20250313 07:00:43.142809    11 tiered_storage.cc:161] Inconsistent expire state for e:271ebfa83e924bfd898dfe67f68b7c29:758
E20250313 07:00:43.143143    12 tiered_storage.cc:161] Inconsistent expire state for e:32b76b24167948e9986fdfe2ae3e2ad4:758
E20250313 07:00:43.151912    11 tiered_storage.cc:161] Inconsistent expire state for e:7b4e210cdc5446aea1ab1d2803cb2c17:758
E20250313 07:00:43.158300    14 tiered_storage.cc:161] Inconsistent expire state for e:bdf39ab8ad7f45748d80d26d8c444602:758
E20250313 07:00:43.202508    13 tiered_storage.cc:161] Inconsistent expire state for e:5faf95eab9f64df7b42cdc8bf1e2556b:758
E20250313 07:00:43.208093    14 tiered_storage.cc:161] Inconsistent expire state for e:395fe7d3de2b4f58981444bd837c9707:758
W20250313 07:00:43.338932    13 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
W20250313 07:00:43.546926    13 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
W20250313 07:00:43.753705    13 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
W20250313 07:00:43.957517    11 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
W20250313 07:00:44.072194    13 main_service.cc:1355]  EVALSHA 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e 1 758 RECORD 1741846939 sim:1 8 2592000 3 5000 758 2882967 a 1 5254,7264 1 1 4037,140 1 1 172,15442 1 1 1531,9103 1 1 2200,881 1 1 3162,8648 1 1 429,9271 1 1 387,835 1 c 1 4166,3080 1 1 12891,7430 1 1 18313,12268 1 1 3190,17277 1 1 6866,2747 1 1 15657,6402 1 1 15561,1252 1 1 7513,14809 1 failed with reason: Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
W20250313 07:00:44.177639    14 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
W20250313 07:00:44.427978    13 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
W20250313 07:00:44.628247    13 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
W20250313 07:00:44.834594    11 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
W20250313 07:00:45.037518    13 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{10}:a:f:352464
W20250313 07:00:45.087677    13 main_service.cc:1355]  EVALSHA 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e 1 758 RECORD 1741847080 sim:1 8 2592000 3 5000 758 2882967 a 1 5254,7264 1 1 4037,140 1 1 172,15442 1 1 1531,9103 1 1 2200,881 1 1 3162,8648 1 1 429,9271 1 1 387,835 1 c 1 4166,3080 1 1 12891,7430 1 1 18313,12268 1 1 3190,17277 1 1 6866,2747 1 1 15657,6402 1 1 15561,1252 1 1 7513,14809 1 failed with reason: Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
W20250313 07:00:45.256618    11 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
W20250313 07:00:45.488197    13 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{10}:a:f:5461
W20250313 07:00:45.689709    13 script_mgr.cc:328] Error running script (call to 7dbebb8fbb0f92b2d5b95737d60edb4bb41a845e): @user_script:453: script tried accessing undeclared key, key: sim:1:{758}:a:f:2882967
E20250313 07:00:45.859686    14 db_slice.cc:573] Expired item in expire table, but not in prime table: e:86871f40cd3a4c649acaffaaeea8d495:758 1 1
E20250313 07:00:45.891831    14 db_slice.cc:573] Expired item in expire table, but not in prime table: e:86871f40cd3a4c649acaffaaeea8d495:758 0 0
E20250313 07:00:45.891883    14 string_family.cc:888] Inconsistent state in SetCmd::SetExisting  key: e:86871f40cd3a4c649acaffaaeea8d495:758, it.key:e:86871f40cd3a4c649acaffaaeea8d495:758, it->first:e:86871f40cd3a4c649acaffaaeea8d495:758 params.prev_val: 0 16
F20250313 07:00:45.891933    14 db_slice.cc:935] Check failed: db.expire.Insert(main_it->first.AsRef(), ExpirePeriod(delta)).second
*** Check failure stack trace: ***
    @     0x57056b5dfea3  google::LogMessage::SendToLog()
    @     0x57056b5d8347  google::LogMessage::Flush()
    @     0x57056b5d9ccf  google::LogMessageFatal::~LogMessageFatal()
    @     0x57056ae08c48  dfly::DbSlice::AddExpire()
    @     0x57056ac24e2d  dfly::(anonymous namespace)::SetCmd::SetExisting()
    @     0x57056ac255bb  dfly::(anonymous namespace)::SetCmd::Set()
    @     0x57056ac25eaf  _ZN4absl12lts_2024072219functional_internal12InvokeObjectIZN4dfly12_GLOBAL__N_110SetGenericERKNS4_6SetCmd9SetParamsESt17basic_string_viewIcSt11char_traitsIcEESC_bPNS3_11TransactionEEUlSE_PNS3_11EngineShardEE_NSD_14RunnableResultEJSE_SG_EEET0_NS1_7VoidPtrEDpNS1_8ForwardTIT1_E4typeE
    @     0x57056ae4b5ca  dfly::Transaction::RunCallback()
    @     0x57056ae4bf3a  dfly::Transaction::ScheduleInShard()
    @     0x57056ae4c00c  _ZNSt17_Function_handlerIFvvEZN4dfly11Transaction16ScheduleInternalEvEUlvE0_E9_M_invokeERKSt9_Any_data
    @     0x57056b3c25e4  util::fb2::FiberQueue::Run()
    @     0x57056aea2440  _ZN5boost7context6detail11fiber_entryINS1_12fiber_recordINS0_5fiberEN4util3fb219FixedStackAllocatorEZNS6_6detail15WorkerFiberImplIZN4dfly9TaskQueue5StartESt17basic_string_viewIcSt11char_traitsIcEEEUlvE_JEEC4IS7_EESF_RKNS0_12preallocatedEOT_OSG_EUlOS4_E_EEEEvNS1_10transfer_tE
    @     0x57056b3e4d8f  make_fcontext
*** SIGABRT received at time=1741849245 on cpu 3 ***
PC: @     0x7db30fd999fc  (unknown)  pthread_kill
[failure_signal_handler.cc : 345] RAW: Signal 11 raised at PC=0x7db30fd2b898 while already in AbslFailureSignalHandler()
*** SIGSEGV received at time=1741849245 on cpu 3 ***
PC: @     0x7db30fd2b898  (unknown)  abort

romange · 2025-03-13T07:29:08Z

@adityaupadhyay-fynd Are these the entire logs? Dragonfly has info logs stored in /tmp/ and they start with

romange · 2025-03-13T09:25:28Z

@adityaupadhyay-fynd can you please try with ghcr.io/dragonflydb/dragonfly-dev:ubuntu-a8a2e07 and run it with --tiered_experimental_cooling=false ? Also, can you send me all the logs you have, not just the last few lines?
If you want you can DM me with the logs.

adityaupadhyay-fynd · 2025-03-13T13:48:08Z

@romange Sharing entire logs, running image ghcr.io/dragonflydb/dragonfly-dev:ubuntu-a8a2e07 with --tiered_experimental_cooling=false flag.

dragonfly-error.log

romange · 2025-03-13T14:24:51Z

Thanks, this iteration moved me another step closer. It happens almost immediately once the snapshot is loaded and yet I still do not find the root cause. I will publish another version of the image with more logs. So thankful for your help @adityaupadhyay-fynd !

romange · 2025-03-13T19:15:33Z

@adityaupadhyay-fynd I've released ghcr.io/dragonflydb/dragonfly-dev:ubuntu-4fbbf14. Can you please run it with
--tiered_experimental_cooling=false --vmodule=tiered_storage=1,disk_storage=1 ?

thanks!

adityaupadhyay-fynd · 2025-03-17T05:22:09Z

@romange Here are the logs, image used: ghcr.io/dragonflydb/dragonfly-dev:ubuntu-4fbbf14, with --tiered_experimental_cooling=false --vmodule=tiered_storage=1,disk_storage=1 flags set

dragonfly-log-4fbbf14.log

romange · 2025-03-17T07:04:49Z

Thanks @adityaupadhyay-fynd . Unfortunately that image did not pinpoint the root cause.

Can you please try again with ghcr.io/dragonflydb/dragonfly-dev:ubuntu-dd3e015 ? Same flags?

adityaupadhyay-fynd · 2025-03-17T08:10:17Z

@romange here are the logs from ghcr.io/dragonflydb/dragonfly-dev:ubuntu-dd3e015, using same flags as before.

dragonfly-dd3e015-2.log
dragonfly-dd3e015.log

romange · 2025-03-17T09:14:50Z

🎯

The bug: during the override of the existing external string, we called `TieredStorage::Delete` to delete the external reference. This function called CompactObj::Reset that cleared all the attributes on the value, including expiry. The fix: preserve the mask but clear the external state from the object. Fixes #4672 Signed-off-by: Roman Gershman <[email protected]>

romange · 2025-03-17T11:46:35Z

@adityaupadhyay-fynd please try ghcr.io/dragonflydb/dragonfly-dev:ubuntu-525b102 and let me know if the problem persists. You can run it without the flags I gave you.

romange · 2025-03-18T07:46:33Z

@adityaupadhyay-fynd how is it going? Did you experience any issues? You can also try 1.28 which should have the fix. Please let us know :)

adityaupadhyay-fynd · 2025-03-18T08:38:03Z

@romange Have not faced this error since yesterday (Image: ghcr.io/dragonflydb/dragonfly-dev:ubuntu-525b102). Will try deploying image 1.28 and let you know. Thanks you.

mainpart added the bug Something isn't working label Feb 28, 2025

romange added this to the v1.18 milestone Mar 2, 2025

romange removed this from the v1.18 milestone Mar 3, 2025

romange added a commit that referenced this issue Mar 4, 2025

chore: add debug printings to SetExisting

1fe439e

Should help debugging #4672 Signed-off-by: Roman Gershman <[email protected]>

romange mentioned this issue Mar 4, 2025

chore: add debug printings to SetExisting #4694

Merged

romange added a commit that referenced this issue Mar 4, 2025

chore: add debug printings to SetExisting (#4694)

028e080

Should help debugging #4672 Signed-off-by: Roman Gershman <[email protected]>

romange changed the title ~~SIGABRT on 1.26.3~~ SIGABRT with ssd tiering on 1.26.3 Mar 7, 2025

romange self-assigned this Mar 12, 2025

romange added this to the v1.29 milestone Mar 17, 2025

romange mentioned this issue Mar 17, 2025

fix: removal of Expiry bit when overriding external strings #4785

Merged

romange modified the milestones: v1.29, v1.28 Mar 17, 2025

romange added the Next Up task that is ready to be worked on and should be added to working queue label Mar 17, 2025

romange closed this as completed in #4785 Mar 17, 2025

romange closed this as completed in 66c2973 Mar 17, 2025

SIGABRT with ssd tiering on 1.26.3 #4672

SIGABRT with ssd tiering on 1.26.3 #4672

Comments

mainpart commented Feb 28, 2025

romange commented Feb 28, 2025

mainpart commented Feb 28, 2025 • edited Loading

romange commented Feb 28, 2025

mainpart commented Feb 28, 2025 • edited Loading

romange commented Feb 28, 2025

romange commented Feb 28, 2025

romange commented Feb 28, 2025

mainpart commented Feb 28, 2025

mainpart commented Feb 28, 2025 • edited Loading

romange commented Feb 28, 2025

romange commented Feb 28, 2025

iamtraining commented Mar 2, 2025

romange commented Mar 4, 2025

mainpart commented Mar 4, 2025

romange commented Mar 4, 2025

iamtraining commented Mar 7, 2025

romange commented Mar 7, 2025 • edited Loading

iamtraining commented Mar 7, 2025

romange commented Mar 9, 2025 • edited Loading

adityaupadhyay-fynd commented Mar 12, 2025

romange commented Mar 12, 2025 • edited Loading

adityaupadhyay-fynd commented Mar 12, 2025

romange commented Mar 12, 2025

adityaupadhyay-fynd commented Mar 13, 2025 • edited Loading

romange commented Mar 13, 2025

romange commented Mar 13, 2025

adityaupadhyay-fynd commented Mar 13, 2025

romange commented Mar 13, 2025

romange commented Mar 13, 2025

adityaupadhyay-fynd commented Mar 17, 2025

romange commented Mar 17, 2025

adityaupadhyay-fynd commented Mar 17, 2025

romange commented Mar 17, 2025

romange commented Mar 17, 2025

romange commented Mar 18, 2025

adityaupadhyay-fynd commented Mar 18, 2025

mainpart commented Feb 28, 2025 •

edited

Loading

mainpart commented Feb 28, 2025 •

edited

Loading

mainpart commented Feb 28, 2025 •

edited

Loading

romange commented Mar 7, 2025 •

edited

Loading

romange commented Mar 9, 2025 •

edited

Loading

romange commented Mar 12, 2025 •

edited

Loading

adityaupadhyay-fynd commented Mar 13, 2025 •

edited

Loading