Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixed Precision Training #113

Open
zsxzs opened this issue May 30, 2022 · 2 comments
Open

Mixed Precision Training #113

zsxzs opened this issue May 30, 2022 · 2 comments

Comments

@zsxzs
Copy link

zsxzs commented May 30, 2022

Hi! Thank you for your wonderful work and great code!
Have you tried mixed precision training? I use mixed precision when I train to deblurring model on gopro datasets. But the lost value will become NAN after dozens of epochs.
Could you help me find out what the problem is?
M9J878BY6F YZE XJ $O7U6

@zsxzs
Copy link
Author

zsxzs commented May 30, 2022

This is modified code, unfortunately batchsize is 4 (There's not enough memory)

if opt.TRAINING.FP16:
    from torch.cuda.amp import GradScaler
    scaler = GradScaler()
else:
    scaler = None

for epoch in range(start_epoch, opt.OPTIM.NUM_EPOCHS + 1):
    epoch_start_time = time.time()
    total_loss = 0
    val_loss = 0
    train_id = 1
    model_restoration.train()
    for i, data in enumerate(tqdm(train_loader), 0):

        for param in model_restoration.parameters():
            param.grad = None

        with torch.no_grad():
            target = data[0].cuda()
            input_ = data[1].cuda()

        # scaler is not None
        if opt.TRAINING.FP16:
            from torch.cuda.amp import autocast

            with autocast():
                restored = model_restoration(input_)

                # Compute loss at each stage
                loss_char = np.sum([criterion_char(restored[j], target) for j in range(len(restored))])
                loss_edge = np.sum([criterion_edge(restored[j], target) for j in range(len(restored))])
                loss = (loss_char) + (0.05 * loss_edge)

            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()

        else:
            restored = model_restoration(input_)

            # Compute loss at each stage
            loss_char = np.sum([criterion_char(restored[j], target) for j in range(len(restored))])
            loss_edge = np.sum([criterion_edge(restored[j], target) for j in range(len(restored))])
            loss = (loss_char) + (0.05 * loss_edge)

            loss.backward()
            optimizer.step()

        total_loss += loss.item()

@drifterss
Copy link

Hi! Thank you for your wonderful work and great code!
Does the FP16 super parameter set a Boolean value and set it to true? I temporarily do not see this super parameter in the training. yml file.Thank you for your reply

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants