fixed scheduler checkpoint loading, typo #63

AnnihilatorChess · 2025-10-28T12:44:39Z

Hello,

I found a bug where resuming a run from a checkpoint incorrectly restarts the LR scheduler's warmup and cosine decay. This is because the Trainer in training.py saves and loads the optimizer state but not the lr_scheduler state.

This PR fixes the save_model and load_checkpoint methods to include the lr_scheduler.state_dict() in the checkpoint, ensuring that training resumes with the correct learning rate.

(I also fixed a small typo: optimizer_state_dit -> optimizer_state_dict.)"

- Add doc - Center badges

Add badges to README

Updated `shear_flow` results

* Add arXiv badge * Update link to arXiv paper

Califronia -> California

…eadme Fix shear_flow README.md

* List all the Well dataset in utils.py * Use the dataset list in download script * Order MHD datasets by dimension

Data: - Add Rayleigh Benard uniform dataset - Edit information about Shear Flow data Statistics and Metrics: - Add RMS statistics - Add Pearson correlation metrics Code Refactoring: - Refine video generation control - Refactor sample load from HDF5 - Add transformation and augmentation based on resizing and roation - Allow specifying for dataset split - Format with ruff

* Update citation after NeurIPS release * Update citation in docs too

* Add the Well dataset collection mention to HF card * Ignore streamlit local runs * Make uploaded dataset public by default * Add option to skip repacking HDF5 file * Increase CPU resources in the uploading script

* Factorize models with a BaseModel * Improve AFNO typing * Add tests for the different models * Do not pass dataset metadata to model * Remove unecessary arguments in super Co-authored-by: François Rozet <[email protected]> --------- Co-authored-by: François Rozet <[email protected]>

* Make models inherit from PytorchModelHubMixin * Rename upload -> upload_dataset * Add script draft for uploading models * Add config template to upload model * Add ReadMes for the 4 models to upload to HF * Specify data n_inputs in model upload config * Update FNO README * Complete model uploading script * Fix path issues in model uploading script * Improve model path and name retrieval * Change model path retrieval strategy * Change dataset -> model in upload folder method * Update README.md * Factorize models with a BaseModel * Add tests for the different models * Do not pass dataset metadata to model * Improve AFNO typing * Edit model path retrieval * Update links in FNO readme * Update FNO Readme * Add header to model READMEs * Add tables to model READMEs * Add code sample to load models to READMEs * Fix model instantiation * Simplify uploading script * Simplify uploading logic * Fix typo in spatial * Convert Omegaconf containers to be jsonable * Improve type checking enforcement Co-authored-by: Miles Cranmer <[email protected]> * Simplify model path Co-authored-by: Miles Cranmer <[email protected]> * Update datasetname variable in README code snippet * Apply suggested pathlib edits * Factorize model card generation * Remove duplicated header from model READMEs * Fix model card template name * Factorize further model README files * Fix dataset name in model card * Make model name variable in model card * Fix missing model name update * Fix typo in spatial resolution of UNetConvNext * Edit links in README with appropriate model names * Edit links in model README files --------- Co-authored-by: Ruben Ohana <[email protected]> Co-authored-by: Miles Cranmer <[email protected]>

* Change HF link to point to the Well collection * Document retrieval of checkpoints through HF

- Refactor DeltaWellDataset for time step differences - Refactor normalization - Fix AFNO and AViT models Co-authored-by: Payel Mukhopadhyay <[email protected]> Co-authored-by: Mike McCabe <[email protected]>

* Increment version from 1.0.1 to 1.1.0 * Add list of maintainers * Add 3.13 to supported Python versions * Test max and min supported Python versions

* Add missing statistics * Remove try-except block causing silent failure * Add DeltaWellDataset to the list of data imports * Add dataset tests to check delta statistics * Round statistics to 4 decimal places * Fix argument in round function * Make compute statistics script parallel * Write stats with 4 decimal scientific notation * Edit yaml dumping for scientific notation * Factorize dataset download tests with fixtures * Reorganize dataset tests * Add comments to pytest fixtures * Simplify step selection * Raise error when stride and normalization are set

Co-authored-by: Payel Mukhopadhyay <[email protected]>

* Rewrite normalization tests Now only test the normalization class instead of the actual dataset stats. --------- Co-authored-by: Lucas Meyer <[email protected]>

* added max rollout steps to dataset docstring * Update the_well/data/datasets.py Co-authored-by: Lucas Meyer <[email protected]> --------- Co-authored-by: Lucas Meyer <[email protected]>

* Add template for bug reports * Update already existing issue message * Add version and environment to issue template * Add code snippet to obtain version and environment * Fix typo in code snippet

…g_page Add missing symbolic link to rayleigh_benard_uniform

fix: stop overwriting `best.pt` every validation

fix: denominator calculation for short validation

AnnihilatorChess · 2025-10-28T12:59:37Z

This is what the learning rate currently looks like when continuing a run with a lr_scheduler (continued at step 76) with the same settings.

mikemccabe210 · 2025-10-29T20:51:03Z

Thanks for the contribution @AnnihilatorChess . The change looks good at a quick glance, but I think it'll be a few days before someone can do a more detailed check. For now I'll trigger the workflow and make sure it doesn't break any of the tests.

kazewong · 2025-11-05T17:24:55Z

@AnnihilatorChess This was accidentally closed during a restructuring of the repo. We would love to have your contribution, so once we are done with the restructuring and release, we will ping you for submitting a PR again.

LTMeyer and others added 30 commits December 2, 2024 14:10

Release of the Well

560d98c

Update README.md

ba0d3ee

Add badges for tests, PyPI and NeurIPS

f07b7aa

Update badges

d978aa8

- Add doc - Center badges

Merge pull request PolymathicAI#1 from PolymathicAI/badges

540cc73

Add badges to README

Update benchmarks.md

77d3193

Updated `shear_flow` results

add arXiv link

19b3730

Add arXiv link (PolymathicAI#2)

77c2fe3

* Add arXiv badge * Update link to arXiv paper

Fix readme template location for HF upload

cf6d466

Fix email addresses (PolymathicAI#3)

b665886

Fix math rendering errors (PolymathicAI#4)

512bc32

Update registry to faster location (PolymathicAI#5)

5a7250d

Increase version number

a2a9f9e

Fix time step number in RTI Readme (PolymathicAI#6)

c58ad7c

docs: fix typo in README (PolymathicAI#7)

7cf3f5e

Califronia -> California

Add FAQ link to github discussions (PolymathicAI#9)

9bf08d0

- updated the resolution in shear_flow README.md

bd1d691

- fix shear_flow README.md

e30a4df

Merge pull request PolymathicAI#13 from PolymathicAI/fix_shear_flow_r…

c93f70f

…eadme Fix shear_flow README.md

Factorize the Well dataset list (PolymathicAI#16)

1f4a34b

* List all the Well dataset in utils.py * Use the dataset list in download script * Order MHD datasets by dimension

Add BC passthrough (PolymathicAI#18)

f0a8ee8

Update README.md active_matter

52a7c12

Download dataset statistics files along data (PolymathicAI#22)

e8a936f

Update Citation (PolymathicAI#32)

c730c77

* Update citation after NeurIPS release * Update citation in docs too

Upload more datasets to HF (PolymathicAI#31)

7c98aa1

* Add the Well dataset collection mention to HF card * Ignore streamlit local runs * Make uploaded dataset public by default * Add option to skip repacking HDF5 file * Increase CPU resources in the uploading script

Add link to HF datasets in README

590371a

Document HF Records (PolymathicAI#36)

177c17c

* Change HF link to point to the Well collection * Document retrieval of checkpoints through HF

payelmuk150 and others added 21 commits April 2, 2025 11:51

Merge internal into public

18fcd05

- Refactor DeltaWellDataset for time step differences - Refactor normalization - Fix AFNO and AViT models Co-authored-by: Payel Mukhopadhyay <[email protected]> Co-authored-by: Mike McCabe <[email protected]>

Test that datasets are available on HF (PolymathicAI#38)

d32691d

Use uv in CI (PolymathicAI#39)

5bf0843

Add test to check model availability on HF (PolymathicAI#37)

2b353ca

Prepare v1.1.0 Release (PolymathicAI#40)

d5ea89f

* Increment version from 1.0.1 to 1.1.0 * Add list of maintainers * Add 3.13 to supported Python versions * Test max and min supported Python versions

Fix branch name in CI

c97311b

Refactor tests by removing unittest (PolymathicAI#43)

8bebdb3

Update normalization details in tutorial (PolymathicAI#47)

fdd8fbc

Co-authored-by: Payel Mukhopadhyay <[email protected]>

add normalization test (PolymathicAI#50)

c52d543

* Rewrite normalization tests Now only test the normalization class instead of the actual dataset stats. --------- Co-authored-by: Lucas Meyer <[email protected]>

Open HDF5 files on the fly (PolymathicAI#52)

6d77553

added max rollout steps to dataset docstring (PolymathicAI#53)

cfffc12

* added max rollout steps to dataset docstring * Update the_well/data/datasets.py Co-authored-by: Lucas Meyer <[email protected]> --------- Co-authored-by: Lucas Meyer <[email protected]>

Add template for bug reports (PolymathicAI#42)

a65f5bf

* Add template for bug reports * Update already existing issue message * Add version and environment to issue template * Add code snippet to obtain version and environment * Fix typo in code snippet

Add missing symbolic link to rayleigh_benard_uniform

808c317

Merge pull request PolymathicAI#56 from francois-rozet/55_docs_missin…

a123419

…g_page Add missing symbolic link to rayleigh_benard_uniform

fix overwriting best.pt every validation

baa8fb5

Fix denominator calculation for validation length

70b7691

Merge pull request PolymathicAI#60 from till-m/fix-validation-checkpoint

064acb0

fix: stop overwriting `best.pt` every validation

fix: linting

be8fc1f

Merge pull request PolymathicAI#61 from till-m/fix-short-validation

8ca4c33

fix: denominator calculation for short validation

fixed scheduler checkpoint loading, typo

8401823

PolymathicAI deleted a comment from LTMeyer Oct 28, 2025

kazewong closed this Nov 5, 2025

kazewong force-pushed the master branch from 8ca4c33 to 762df72 Compare November 5, 2025 16:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fixed scheduler checkpoint loading, typo #63

fixed scheduler checkpoint loading, typo #63

Uh oh!

AnnihilatorChess commented Oct 28, 2025

Uh oh!

AnnihilatorChess commented Oct 28, 2025

Uh oh!

mikemccabe210 commented Oct 29, 2025

Uh oh!

kazewong commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

fixed scheduler checkpoint loading, typo #63

fixed scheduler checkpoint loading, typo #63

Uh oh!

Conversation

AnnihilatorChess commented Oct 28, 2025

Uh oh!

AnnihilatorChess commented Oct 28, 2025

Uh oh!

mikemccabe210 commented Oct 29, 2025

Uh oh!

kazewong commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants