Skip to content

Make gradient checkpointing and offloading per-component#1476

Draft
dxqb wants to merge 2 commits into
Nerogar:masterfrom
dxqb:split-offload
Draft

Make gradient checkpointing and offloading per-component#1476
dxqb wants to merge 2 commits into
Nerogar:masterfrom
dxqb:split-offload

Conversation

@dxqb

@dxqb dxqb commented May 25, 2026

Copy link
Copy Markdown
Collaborator

@dxqb dxqb linked an issue May 25, 2026 that may be closed by this pull request
@dxqb

dxqb commented May 25, 2026

Copy link
Copy Markdown
Collaborator Author

Claude:

  • activation_offloading should default to False — it was always default-True but only took effect with layer offloading; now it works standalone, so a fresh fine-tune offloads activations out of the box.
  • "Layer Offload Fraction" is shown for CLIP encoders but is a no-op — the CLIP setup discards its conductor, so the value never drives anything; don't render the field for CLIP (or wire it up).

@dxqb dxqb added the preview merged in the preview branch label May 29, 2026
dxqb added a commit that referenced this pull request Jun 3, 2026
- BaseAnimaSetup: per-component checkpointing_or_offloading_enabled(),
  remove weight_list from create_autocast_context / disable_fp16_autocast_context
- AnimaFineTune/LoRASetup: latent_caching → image_caching / text_caching
- ModelType: add ANIMA to _MODEL_PARTS and supported_training_methods

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
dxqb added a commit to TheForgotten69/OneTrainer that referenced this pull request Jun 3, 2026
dxqb added a commit that referenced this pull request Jun 4, 2026
In the upstream TrainingTab.py (PR #1476), config is stored as
self.train_config on the view. In preview's Base*/controller pattern
it lives on controller.config. One call site in __setup_stable_diffusion_ui
was translated incorrectly during the bec207a merge.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
dxqb added a commit that referenced this pull request Jun 4, 2026
@dxqb dxqb mentioned this pull request Jun 6, 2026
3 tasks
@dxqb

dxqb commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator Author

Claude: Found while testing the preview branch — the caching_threads > 1 + layer-offloading guard in create_data_loader (introduced here, with a TODO: narrow this to the cached components only) is now overly broad:

if config.caching_threads > 1 and any(part.offload_fraction > 0 for part in config.model_part_configs()):
    raise RuntimeError('layer offloading can not be activated if "caching_threads" > 1')

This rejects the config if any part has layer offloading enabled — including transformer/unet/prior/unconditional_transformer, none of which run inside the caching dataloader's worker threads (only the components that actually produce a cache do: text encoder(s) for text caching, VAE for image/latent caching).

In practice, after the per-component split:

  • Only text_encoder / text_encoder_2 / text_encoder_3 / text_encoder_4 (depending on model_type.model_parts()) expose an "Offload" UI control (__create_offloading_widgets in BaseTrainingTabView.py) — vae does not (__create_vae_frame has no offloading widgets), so vae.offload_fraction is always 0 in practice.
  • So the check should really be: does any text encoder part that's actually used (and being cached) have offload_fraction > 0?

This means the current check blocks perfectly valid configs — e.g. layer offloading the transformer with caching_threads > 1 — even though that combination is fine, since the transformer's conductor never runs in a caching worker thread.

Suggested narrowing: only check text-encoder parts, e.g.

if config.caching_threads > 1 and any(
    getattr(config, name).offload_fraction > 0
    for name in config.model_type.model_parts()
    if name.startswith("text_encoder")
):
    raise RuntimeError('layer offloading can not be activated for a text encoder if "caching_threads" > 1')

dxqb added a commit that referenced this pull request Jun 14, 2026
…ent) into preview

# Conflicts:
#	modules/ui/ModelTab.py
#	modules/ui/TopBar.py
#	modules/ui/TrainUI.py
#	modules/ui/TrainingTab.py
dxqb added a commit that referenced this pull request Jun 14, 2026
Anima, Lens, and Ideogram setup files used the pre-#1476/#1462 4-arg
create_autocast_context/disable_fp16_autocast_context (with a weight-dtype
list), the old config.gradient_checkpointing.enabled() global check, and the
renamed config.latent_caching field. Update them to the current 3-arg
autocast helpers, per-part checkpointing via enable_checkpointing_for_*, and
config.image_caching/config.text_caching.
dxqb added a commit that referenced this pull request Jun 19, 2026
Add a _MODEL_PARTS table + ModelType.model_parts() as the single source
of truth for which components each model type has, keyed by TrainConfig
field names, and a ModelType.supported_training_methods() that enumerates
every type explicitly, raising on an unknown type rather than defaulting.

Collapse ModelTab's per-type __setup_*_ui methods into one __setup_ui that
derives the has_* widget flags from model_parts(), and collapse TopBar's
per-type training-method dispatch to build its dropdown from
supported_training_methods().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Rebased onto centralize-model-type: this is the offloading-only part of
split-offload, with the model-composition centralization (ModelType,
ModelTab, TopBar) excluded since it already landed separately.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

preview merged in the preview branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Activations offloading depends on layer offload fraction [Feat]: Separate offload settings for text encoder

1 participant