Skip to content

Global step reset when restoring checkpoints with trainer.validate #17127

@nicolas-dufour

Description

@nicolas-dufour

Bug description

When restoring checkpoints with trainer.validate, global_step and epoch are overwritten with 0.

It should keep the same global_step and epoch otherwise, it messes with the loggers.

This issue prevents to correctly validate checkpoints of a model as a postprocessing.

How to reproduce the bug

run a model with

trainer.validate(model, datamodule, cpt_path=ckpt_path

and log a metric, the result will be logged at step 0.

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0): 1.9
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0): 1.13
#- Python version (e.g., 3.9): 3.10
#- OS (e.g., Linux): Linux
#- CUDA/cuDNN version: 11.7
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @Borda @awaelchli @carmocca @justusschock

Metadata

Metadata

Assignees

No one assigned

    Labels

    checkpointingRelated to checkpointingfeatureIs an improvement or enhancementloopsRelated to the Loop APIplGeneric label for PyTorch Lightning package

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions