Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] cluster migrations #114

Open
LHLHLHE opened this issue Dec 28, 2024 · 4 comments
Open

[QUESTION] cluster migrations #114

LHLHLHE opened this issue Dec 28, 2024 · 4 comments

Comments

@LHLHLHE
Copy link

LHLHLHE commented Dec 28, 2024

I noticed that the MergeTree engine is always applied to the migration table, which is why migration data is not replicated to other nodes, even with "migration_cluster": "cluster" in the database settings.

class Migration(models.ClickhouseModel):

Is this done on purpose? I would like to be able to track the migration history on all nodes in the cluster.

@jayvynl

@jayvynl
Copy link
Owner

jayvynl commented Dec 28, 2024

Yes, it's intentional.

Think of this situation. There is a cluster of two node A and node B. A Model of MergeTree engine is defined as:

class Student(models.ClickhouseModel):
    name = models.StringField()
    address = models.StringField()
    score = models.Int8Field()

    class Meta:
        engine = models.MergeTree()
DATABASES = {
    "default": {
        "ENGINE": "clickhouse_backend.backend",
        "OPTIONS": {
            "migration_cluster": "cluster",
        }
    },
    "B": {
        "ENGINE": "clickhouse_backend.backend",
        "PORT": 9001,
        "OPTIONS": {
            "migration_cluster": "cluster",
        }
    }
}

First, apply migrations on the default database, Student table is created on Node A.

python manage.py runmigrations

Then, apply migrations on Node B.

python manage.py runmigrations --database B

If migrations table is created on cluster, django will see that Student have already been created, so Student is not created in node B.

Remember, models with plain MergeTree engine will only be created in the node which you have runmigrations. If you want data replication, you should use Replicated engine. If you want data distribution, you should use Distributed engine.

If you want to query all tables created in the cluster, use the following SQL:

select app, name, applied from clusterAllReplicas('your cluster name', currentDatabase(), 'django_migrations') where not deleted

@LHLHLHE
Copy link
Author

LHLHLHE commented Jan 9, 2025

I understand this, but I would like to be able to check migrations in node B when node A is down

@PierreF
Copy link

PierreF commented Jan 20, 2025

I did hit similar issue (actually unable to apply RunSQL migration that use CREATE TABLE ... ON CLUSTER with a 4 nodes Clickhouse cluster that is accessed though a load-balancer, i.e. application can't connect to one specific Clickhouse node).

After thinking a bit I came to the following question: Should we run python manage.py migration --database clickhouse-node1, python manage.py migration --database clickhouse-node2, ... ? I means, is the migration process assume we always run python manage.py migration against all Clickhouse nodes (regardless of whether the migration create a clustered table or local only table) ?

That would raise some concern about requirement to run migration multiple time and that migration runner needs to known the list of Clickhouse nodes addresses. But doing so should:

  • solve my issue with Clickhouse behind a load-balancer. Since I would do RunSQL without ON CLUSTER and apply to to every node by calling manage.py migration on every node.
  • solve this issue, since all nodes will have the same django_migrations content (and thanks to Meta.cluster and this condition, clustered table would only be created once)

Without running migration on every nodes, since running python manage.py migration only update the local django_migration table, I don't see how to avoid issue if a node is lost. In that case, cluster migrations applied from this node will be lost and they will be re-tried the next time you call python manage.py migration.

Does my thinking makes sense and is it the clickhouse-backend's expectation that we call python manage.py migration on all our Clickhouse nodes ? Does it make sense to propose an option to create django_migrations as replicated table which could allow user that only use clustered table (no local-only table) to run the migration only once ?

@jayvynl
Copy link
Owner

jayvynl commented Jan 22, 2025

Maybe you are right, migration table should be distributed, add an host field to track node running migration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants