Skip to content

Conversation

@AnnihilatorChess
Copy link
Contributor

Hello,

I found a bug where resuming a run from a checkpoint incorrectly restarts the LR scheduler's warmup and cosine decay. This is because the Trainer in training.py saves and loads the optimizer state but not the lr_scheduler state.

This PR fixes the save_model and load_checkpoint methods to include the lr_scheduler.state_dict() in the checkpoint, ensuring that training resumes with the correct learning rate.

(I also fixed a small typo: optimizer_state_dit -> optimizer_state_dict.)"

LTMeyer and others added 30 commits December 2, 2024 14:10
- Add doc
- Center badges
Updated `shear_flow` results
* Add arXiv badge

* Update link to arXiv paper
* List all the Well dataset in utils.py

* Use the dataset list in download script

* Order MHD datasets by dimension
Data:
- Add Rayleigh Benard uniform dataset
- Edit information about Shear Flow data

Statistics and Metrics:
- Add RMS statistics
- Add Pearson correlation metrics

Code Refactoring:
- Refine video generation control
- Refactor sample load from HDF5
- Add transformation and augmentation based on resizing and roation
- Allow specifying for dataset split
- Format with ruff
* Update citation after NeurIPS release

* Update citation in docs too
* Add the Well dataset collection mention to HF card

* Ignore streamlit local runs

* Make uploaded dataset public by default

* Add option to skip repacking HDF5 file

* Increase CPU resources in the uploading script
* Factorize models with a BaseModel

* Improve AFNO typing

* Add tests for the different models

* Do not pass dataset metadata to model

* Remove unecessary arguments in super

Co-authored-by: François Rozet <[email protected]>

---------

Co-authored-by: François Rozet <[email protected]>
* Make models inherit from PytorchModelHubMixin

* Rename upload -> upload_dataset

* Add script draft for uploading models

* Add config template to upload model

* Add ReadMes for the 4 models to upload to HF

* Specify data n_inputs in model upload config

* Update FNO README

* Complete model uploading script

* Fix path issues in model uploading script

* Improve model path and name retrieval

* Change model path retrieval strategy

* Change dataset -> model in upload folder method

* Update README.md

* Factorize models with a BaseModel

* Add tests for the different models

* Do not pass dataset metadata to model

* Improve AFNO typing

* Edit model path retrieval

* Update links in FNO readme

* Update FNO Readme

* Add header to model READMEs

* Add tables to model READMEs

* Add code sample to load models to READMEs

* Fix model instantiation

* Simplify uploading script

* Simplify uploading logic

* Fix typo in spatial

* Convert Omegaconf containers to be jsonable

* Improve type checking enforcement

Co-authored-by: Miles Cranmer <[email protected]>

* Simplify model path

Co-authored-by: Miles Cranmer <[email protected]>

* Update datasetname variable in README code snippet

* Apply suggested pathlib edits

* Factorize model card generation

* Remove duplicated header from model READMEs

* Fix model card template name

* Factorize further model README files

* Fix dataset name in model card

* Make model name variable in model card

* Fix missing model name update

* Fix typo in spatial resolution of UNetConvNext

* Edit links in README with appropriate model names

* Edit links in model README files

---------

Co-authored-by: Ruben Ohana <[email protected]>
Co-authored-by: Miles Cranmer <[email protected]>
* Change HF link to point to the Well collection

* Document retrieval of checkpoints through HF
payelmuk150 and others added 21 commits April 2, 2025 11:51
- Refactor DeltaWellDataset for time step differences
- Refactor normalization
- Fix AFNO and AViT models

Co-authored-by: Payel Mukhopadhyay <[email protected]>
Co-authored-by: Mike McCabe <[email protected]>
* Increment version from 1.0.1 to 1.1.0

* Add list of maintainers

* Add 3.13 to supported Python versions

* Test max and min supported Python versions
* Add missing statistics

* Remove try-except block causing silent failure

* Add DeltaWellDataset to the list of data imports

* Add dataset tests to check delta statistics

* Round statistics to 4 decimal places

* Fix argument in round function

* Make compute statistics script parallel

* Write stats with 4 decimal scientific notation

* Edit yaml dumping for scientific notation

* Factorize dataset download tests with fixtures

* Reorganize dataset tests

* Add comments to pytest fixtures

* Simplify step selection

* Raise error when stride and normalization are set
* Rewrite normalization tests

Now only test the normalization class instead of the actual dataset stats.

---------

Co-authored-by: Lucas Meyer <[email protected]>
* added max rollout steps to dataset docstring

* Update the_well/data/datasets.py

Co-authored-by: Lucas Meyer <[email protected]>

---------

Co-authored-by: Lucas Meyer <[email protected]>
* Add template for bug reports

* Update already existing issue message

* Add version and environment to issue template

* Add code snippet to obtain version and environment

* Fix typo in code snippet
…g_page

Add missing symbolic link to rayleigh_benard_uniform
fix: stop overwriting `best.pt` every validation
fix: denominator calculation for short validation
@AnnihilatorChess
Copy link
Contributor Author

This is what the learning rate currently looks like when continuing a run with a lr_scheduler (continued at step 76) with the same settings.
image

@PolymathicAI PolymathicAI deleted a comment from LTMeyer Oct 28, 2025
@mikemccabe210
Copy link
Contributor

Thanks for the contribution @AnnihilatorChess . The change looks good at a quick glance, but I think it'll be a few days before someone can do a more detailed check. For now I'll trigger the workflow and make sure it doesn't break any of the tests.

@kazewong
Copy link
Collaborator

kazewong commented Nov 5, 2025

@AnnihilatorChess This was accidentally closed during a restructuring of the repo. We would love to have your contribution, so once we are done with the restructuring and release, we will ping you for submitting a PR again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.