set performance regression during full sync or cluster migration #4809

kostasrim · 2025-03-21T12:29:30Z

To reproduce:

@pytest.mark.asyncio
@dfly_args({"proactor_threads": 4})
async def test_set_performance(df_factory: DflyInstanceFactory, df_seeder_factory):
    master = df_factory.create()
    replica = df_factory.create()

    df_factory.start_all([master, replica])
    m_client = master.client()
    r_client = replica.client()

    logging.debug("DEBUG POPULATE")
    key_num = 1000000
    await m_client.execute_command("debug", "populate", key_num, "test", "160", "RAND")

    dbsize = await m_client.dbsize()
    assert dbsize > (key_num * 0.95)

    stop = False

    async def run_sets():
        client = master.client()
        while not stop:
            pipeline = client.pipeline(transaction=False)
            for _ in range(80):
                pipeline.execute_command(f"set test:{random.randint(1, key_num)} fooooo")
            val = await pipeline.execute()
        await client.close()

    seed_task = [asyncio.create_task(run_sets()) for _ in range(20)]

    await asyncio.sleep(5)
    stop = True
    for seed in seed_task:
        await seed

    info = await m_client.info("commandstats")
    logging.debug(f"info before replication is {info}")

    master.stop()
    # Reset
    master.start()
    m_client = master.client()
    await m_client.execute_command("debug", "populate", key_num, "test", "160", "RAND")

    stop = False
    logging.debug("Start replication")

    assert "OK" == await r_client.execute_command(f"REPLICAOF localhost {master.port}")
    seed_task = [asyncio.create_task(run_sets()) for _ in range(20)]

    await wait_for_replicas_state(r_client, timeout=60)

    info = await m_client.info("commandstats")
    logging.debug(f"info after replication is {info}")

    stop = True
    for seed in seed_task:
        await seed
    logging.debug("Set finished")

Results on opt-build locally:

'cmdstat_set': {'calls': 1448000, 'usec': 322803, 'usec_per_call': 0.22293}

with replication during full sync:

set': {'calls': 4130785, 'usec': 68150755, 'usec_per_call': 16.4983}

That's 80x slowdown

The text was updated successfully, but these errors were encountered:

romange · 2025-03-21T16:26:49Z

We should probably introduce a similar test in cluster_test.py where we fill up a single shard, and then push a config that evicts a slot range.

adiholden · 2025-03-23T08:47:48Z

@kostasrim I believe your first set statistics are affected a lot from the debug populate command which you run in this test. You populate 1M keys and the statistics of
'cmdstat_set': {'calls': 1448000, 'usec': 322803, 'usec_per_call': 0.22293}
shows latency for this 1M ops in debug populate + ~0.5M ops from the seeder in the test

kostasrim · 2025-03-24T07:25:33Z

@romange I saw your PR, you can then "drop" them and measure explicitly +1

kostasrim · 2025-03-24T07:26:20Z

s from the s

+1 will address

romange · 2025-03-24T21:54:12Z

I used the following branch: https://github.com/dragonflydb/dragonfly/tree/ReproFlush
to analyze the yielding pattern during flush slots.

Attaching here partial log.

As I suspected there is heavy lack of balance between the work we do in FlushSlots fiber and when the rest of the fibers.
Based on what I saw, yielding barely progresses with command executions.
The reason for this is that a connection requires several preemptions to finish a request, hence there is design bug, where we yield only to do a very minimal work.

I believe that SleepFor should be an ok short term solution. Longer term solution - to introduce a more sophisticated scheduling in helio.

flush.log.gz

adiholden · 2025-03-25T07:56:36Z

When using this branch for benchmark the migration -https://github.com/dragonflydb/dragonfly/pull/4821/files which uses SleepFor we actually got high cpu (90%) and worst throughput (90K) than just yielding on every bucket in flush slots (155K and 100% cpu)

romange · 2025-03-25T08:45:34Z

so lets yield on every bucket

kostasrim added the bug Something isn't working label Mar 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

set performance regression during full sync or cluster migration #4809

set performance regression during full sync or cluster migration #4809

kostasrim commented Mar 21, 2025

romange commented Mar 21, 2025

adiholden commented Mar 23, 2025

kostasrim commented Mar 24, 2025 •

edited

Loading

kostasrim commented Mar 24, 2025

romange commented Mar 24, 2025

adiholden commented Mar 25, 2025

romange commented Mar 25, 2025

set performance regression during full sync or cluster migration #4809

set performance regression during full sync or cluster migration #4809

Comments

kostasrim commented Mar 21, 2025

romange commented Mar 21, 2025

adiholden commented Mar 23, 2025

kostasrim commented Mar 24, 2025 • edited Loading

kostasrim commented Mar 24, 2025

romange commented Mar 24, 2025

adiholden commented Mar 25, 2025

romange commented Mar 25, 2025

kostasrim commented Mar 24, 2025 •

edited

Loading