Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgraded from beat 1.0.1 to 1.4.0 and beat has died twice in two days #210

Closed
robvdl opened this issue Dec 19, 2018 · 29 comments
Closed

upgraded from beat 1.0.1 to 1.4.0 and beat has died twice in two days #210

robvdl opened this issue Dec 19, 2018 · 29 comments

Comments

@robvdl
Copy link

robvdl commented Dec 19, 2018

We recently upgraded a UAT system from Celery 4.1.1 to 4.2.1 and Beat to 1.4.0

At first glance everything was fine, but twice now has beat process randomly just died and we've had to restart it.

The reason for the upgrade was to switch to the redis-py library version 3, with celery 4.1 we had to pin redis-py to version 2 since the recent redis-py 3 release broke things initially.

Anyway, beat dying has never happened before until we upgraded it, I'll see if I can get some additional information but in the mean time we have to do a rollback to Celery 4.1.1 and Beat 1.0.1

@robvdl
Copy link
Author

robvdl commented Dec 19, 2018

To make things worse, the migrations are also broken, you cannot migrate back to 0001 initial.

I have a feeling the issue is because of a squash migration, squash migrations are going to mess other peoples installations up if they try to go backwards and is something I normally try to avoid

It looks like migration 0005 is a squash migration, but it doesn't blow up until it gets to migration 0001

@robvdl
Copy link
Author

robvdl commented Dec 19, 2018

The issue with the broken migration 0005 shows up if you try to go from 0006 to 0005:

./manage.py migrate django_celery_beat 0005

CommandError: More than one migration matches '0005' in app 'django_celery_beat'. Please be more specific.

That is because 0005 is that squash migration, it seems to have messed with the order.

If you do try to be more specific, that is when it blows up:

./manage.py migrate django_celery_beat 0005_add_solarschedule_events_choices_squashed_0009_merge_20181012_1416

django.db.utils.ProgrammingError: column django_celery_beat_periodictask.priority does not exist

@robvdl
Copy link
Author

robvdl commented Dec 19, 2018

There are multiple 0006 and multiple 0005 migrations, this thing is a real mess and should be resolved.

It looks like someone did try to resolve it, but you cannot reverse migrate or jump to a specific migration because so many re-used numbers.

@robvdl
Copy link
Author

robvdl commented Dec 19, 2018

I also tried to empty all the beat related tables in the Django admin first before migrating back to 0001 and even then it blows up unfortunately.

psycopg2.ProgrammingError: column django_celery_beat_periodictask.solar_id does not exist
LINE 1: ..., "django_celery_beat_periodictask"."crontab_id", "django_ce...

Basically I'm going to have to go into the db and delete the tables and reset the migration state of the django_celery_beat app and just tart again, it looks like the migrations are broken and it simply is not possible to migrate back.

@justmobilize
Copy link

I'm happy to help fix the migrations. Version 1.3.0 and 1.4.0 should be pulled because once they are installed it will be difficult to get fixed. Squashing migrations doesn't work for a package because you never know what someone is on. It's different if you are in control of all the places the code is running.

@robvdl
Copy link
Author

robvdl commented Dec 23, 2018

Thanks, in the mean time I've manually rolled back to beat version 1.0.1 and rolled back celery to 4.1.1, because this is a live production system it can't really go down.

I'm going to do some experimentation in the new year to see if I can just upgrade celery but hold back beat for now, it would be nice to find out why beat keeps dying on me but it's a production system for a client so I can't really afford to mess with it too much.

@auvipy
Copy link
Member

auvipy commented Apr 23, 2019

Anyone with celery 4.3 or master can verify this issue with celery beat master?

@robvdl
Copy link
Author

robvdl commented Apr 23, 2019 via email

@auvipy
Copy link
Member

auvipy commented Apr 23, 2019

I can see a migration change here, v1.4.0...master could you please try on local using from the master branch?

@robvdl
Copy link
Author

robvdl commented Apr 23, 2019 via email

@robvdl
Copy link
Author

robvdl commented Apr 23, 2019

Basically to test this, start a project with beat 1.0.1, then upgrade to beat 1.4.0 and apply the migrations, then migrate the django_celery_beat app back to migration 0001, it will fail.

I've tried migrating backwards in steps but it will also fail.

@robvdl
Copy link
Author

robvdl commented Apr 23, 2019

Actually issue #217 doesn't accurately explain the issue, the comments in this issue do.

The issue #217 seems to end with "don't edit migrations that have already been applied". I beg to differ, sure that is the case normally, but if the migrations are broken then you basically have no choice to fix them to resolve this mess in this particular case.

@justmobilize
Copy link

One could clean up the migrations, and create a new one that may need some insane logic to fix past mistakes. For new installs it would be fine since there would be no data. Some people may just need to rebuild their schedules

@auvipy
Copy link
Member

auvipy commented Apr 24, 2019

@robvdl can you come with a solution?

@justmobilize
Copy link

I'm happy to work through this (As I've said in #217).

I just want to make sure my time isn't wasted and that I'll get active support in getting it figured out and merged.

@auvipy
Copy link
Member

auvipy commented Apr 24, 2019

I have dedicated time to help maintain these projects, so, please proceed and mention me in the new PR

@justmobilize
Copy link

Awesome. Thanks!

@justmobilize
Copy link

@robvdl, can we start with this?

  • Install v1.4.0
  • Apply this patch

You should be able to migrate frontwards and backwards.

The next question is migrations aside, does the beat process still die?

Also, not sure on your timezone settings, I needed to add this:

from django.utils import timezone
app.now = timezone.now

To my celery app

@liquidpele
Copy link
Contributor

Some people may just need to rebuild their schedules

I would really like to avoid that. If necessary, we can always use the migrations.RunPython to manually fix things if we need to do something crazy.

@auvipy
Copy link
Member

auvipy commented Apr 25, 2019

@tinylambda

@justmobilize
Copy link

justmobilize commented Apr 25, 2019

I agree. First step is to solve the problem, then find out how to make it work.

In theory if you are in a working state, tweaking existing migrations wont hurt assuming the final state is the same.

@justmobilize
Copy link

justmobilize commented Apr 25, 2019

@auvipy and @liquidpele, do you have any idea what versions you have in the wild?

Doing some testing I found:

v1.0.0 -> to my fix - okay
v1.0.1 -> to my fix - okay
v1.1.0 -> to my fix - okay
v1.1.1 -> to my fix - okay
v1.2.0 -> to my fix - not okay
v1.3.0 -> to my fix - okay
v1.4.0 -> to my fix - okay

also that a fresh install of v1.3.0 or v1.4.0 can not migrate

so worst case if someone is on v1.2.0 they would need to upgrade to v1.4.0 first. Does that seem acceptable?

@justmobilize
Copy link

@robvdl, I've had it running to 24 hours and processed about 250k tasks without any issues. Would love to know if you did (realizing the crashing and migrations have nothing to do with each other)

@robvdl
Copy link
Author

robvdl commented Apr 26, 2019

Sorry I won't be able to get a chance to try this again for a while, but I will eventually. One issue is that the servers this runs on are still on trusty and we've been trying to convince the client for a year they need upgrading.

@liquidpele
Copy link
Contributor

I've fixed the migration issue and added tests for it in my PR since I was adding a new migration anyway. #241

I was easily able to reproduce issues by just migrating back to 0001 and then forward again on django 1.11.20. @justmobilize care to retest with my branch?

@justmobilize
Copy link

@liquidpele, you deleted migrations, meaning if someone is on another version they will be totally blocked from moving forward (for example my production environment is on 1.1.1 which ends on 0005).

I would highly recommend tossing 0005_add_solarschedule_events_choices_squashed_0009_merge_20181012_1416 in favor of the patch I have above.

Thoughts?

@justmobilize
Copy link

You could then after that, make a squash 0001 through 0010 that you keep forever to make new installs quicker.

@robvdl
Copy link
Author

robvdl commented May 2, 2019

I like this solution :) I'm going to give this another shot in the next few weeks when I have a chance.

@auvipy
Copy link
Member

auvipy commented Mar 4, 2021

#244

@auvipy auvipy closed this as completed Mar 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants