Skip to content

Yangyangt/try sync with internal#2

Draft
yy-code-nv wants to merge 13 commits into
mainfrom
yangyangt/try_sync_with_internal
Draft

Yangyangt/try sync with internal#2
yy-code-nv wants to merge 13 commits into
mainfrom
yangyangt/try_sync_with_internal

Conversation

@yy-code-nv

Copy link
Copy Markdown
Owner

No description provided.

lfengad and others added 4 commits June 9, 2026 14:54
### Summary
CI tests download input assets (e.g. action/video inputs) over the
network, and these intermittently fail with transient gateway errors
(502/503/504), flaking
the run. This PR makes those downloads robust and avoids re-fetching the
same assets every run.
### Changes
- **Backoff retry** (`inference/common/args.py`): wrap each input
download in an outer retry with exponential backoff + jitter (6
attempts, env-overridable via
`COSMOS_DOWNLOAD_*`). Permanent errors (400/401/403/404) fail fast.
- **Opt-in download cache**: when `COSMOS_DOWNLOAD_CACHE_DIR` is set,
downloads are cached by URL and reused across runs; unset → unchanged
behavior.
Concurrent writers use an atomic move.
- **CI wiring** (`gpu-tests.yml`): the `unittest` and `inference-smoke`
jobs point at a shared persistent cache dir
(`$RUNNER_WORKSPACE/cosmos_input_cache`,
outside the repo tree so cleanup keeps it), reused across runs and PRs
on the same runner.
### Impact
- Production/local behavior unchanged: cache is off unless the env var
is set; retry is transparent on success and only adds resilience on
failure.
- Only new persisted artifact is the cache dir; replaces
previously-leaked `/tmp` temp dirs in those jobs.

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Remove unused code for config.py (used for old toml config system)
- Add vision_sft_nano golden for GB200
## Summary

Adds a **DROID action-policy SFT recipe** for `nvidia/Cosmos3-Nano`,
mirroring the internal `droid_lerobot_8b` policy run, so users can
post-train the action-generation + action heads on DROID (LeRobot v3.0)
data.

## What's included

- **`data/vfm/action/datasets/droid_lerobot_dataset.py`** — DROID
LeRobot dataset: compact columnar load + episode-aware windowing
(replaces an eager full-table materialization), plus `joint_pos` (8D: 7
joints + gripper) and `use_state` support.
- **`data/vfm/action/datasets/action_sft_dataset.py`** (new) —
`get_action_droid_sft_dataset(...)` wrapping the dataset through
`ActionTransformPipeline`.
- **`configs/.../action/posttrain_config/action_policy_droid_nano.py`**
(new) — registered `action_policy_droid_nano` experiment (Cosmos3-Nano /
8B MoT): optimizer trains gen+action heads (5× LR on action heads),
`LambdaLinear` schedule, count-based batch, res480,
`encode_exact_durations=[33]` (chunk 32 → 33 frames).
- **`checkpoint/dcp.py`** — EMA warm-start: when `keys_to_skip_loading`
excludes `net_ema.`, initialize `net_ema = net` from the base weights so
EMA starts from the init rather than zeros.
- **`examples/toml/sft_config/action_policy_droid_{nano,repro}.toml`** —
1-GPU smoke + scaled (res480) configs.
- **`examples/launch_sft_action_policy_droid.sh`** +
**`docs/action_policy_droid_posttraining.md`** — runnable launcher and
walkthrough.

## Validation

End-to-end on H200:
- **1 node / 8×H200** — dry-run + training at res480,
`max_samples_per_batch=32` (64 OOMs at 139 GiB; internal used 128 on
GB200).
- **2 nodes / 16 ranks** — HSDP `shard 8 × replicate 2`, `TRAIN_EXIT=0`.
- Recipe faithful to internal `droid_lerobot_8b`: lr 1e-4 / betas / wd,
5× action-head LR, `LambdaLinear`, shift `{256:3,480:5,720:10}`,
`concat_view`, `chunk_length=32`.

## Notes

- Count-based batch (`max_samples_per_batch`,
`max_sequence_length=None`) lives in the experiment Python — TOML cannot
express `null`, and the loader only overrides keys present in the TOML.
- Base checkpoint: convert `nvidia/Cosmos3-Nano` → DCP and pass via
`BASE_CHECKPOINT_PATH`; action heads init fresh (skipped on load).

---------

Signed-off-by: Hao Liang <haolia@nvidia.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: lfengad <liangf@nvidia.com>
Co-authored-by: Yu-Wei Chao <82182961+ychao-nvidia@users.noreply.github.com>
…ovided (NVIDIA#33)

## Summary

`LocalBackend.join_path` accepted `Union[str, Path]` inputs but always
returned `str` (via `os.path.join`), even when `Path` objects were
passed. This violated the type contract and could cause `AttributeError`
downstream.

## Changes

- **local_backend.py**: Now checks if any input is a `Path` and returns
`Path(result)` accordingly. Removed the stale TODO that acknowledged
this issue.
- **base_backend.py, easy_io.py, file_client.py**: Updated return type
from `str` to `Union[str, Path]`.
- **boto3_backend.py, msc_backend.py, http_backend.py**: Updated return
type signature for consistency with the abstract base class.

## Related Issue

Closes NVIDIA#32

Co-authored-by: Maosheng Liao <maoshengl@nvidia.com>
@yy-code-nv yy-code-nv force-pushed the yangyangt/try_sync_with_internal branch from 90e7ca9 to 21230f9 Compare June 11, 2026 17:23
ychao-nvidia and others added 5 commits June 11, 2026 11:34
### Summary
Documents the Cosmos3-Nano-Policy-DROID policy server and aligns it with
the [cosmos
cookbook](https://github.com/NVIDIA/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_policy_with_cosmos_framework.md)
so the two stay consistent. Replaces the prior RoboLab/OpenPI WebSocket
guide with a Docker-based server-client workflow.

### Changes
- **`docs/action_policy_droid_server.md`** (new): full guide for serving
Cosmos3-Nano-Policy-DROID via a policy **Server** that streams actions
to a RoboLab **Client**, using a Docker-based setup (clone, build image,
launch container).
- **`docs/action_policy_robolab_server.md`** (removed): superseded by
the above; the old uv/OpenPI WebSocket flow no longer matches the
cookbook.
- **`README.md`**: add a TOC entry, a Policy Server section, and a
reference-table row linking the new guide.

### Impact
Docs-only change; no code paths affected.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Hi Cosmos team,

We are fixing some CVE issues found in `transformers<=5.0.0`. This PR
makes minor updates so the codebase works seamlessly with both pinned
`4.57.6` and `>=5.0.0` for T2I and T2V.

Signed-off-by: Hong-Yu Chiu <hongyuc@nvidia.com>
## Summary

Refactors the training data layer from the monolithic
`DataPackerDataLoader` /
`DataPacker` / `PackingIterableDataset` into a modular, four-role
abstraction
wired by a single loader. Behavior is preserved (golden-batch
byte-identical to
the legacy loader; resume validated live), and all existing recipes are
migrated.

DataDistributor  →  RawItemProcessor  →  SampleBatcher  →  BatchCollator
(shard/shuffle/      (raw item → one     (samples →        (group → one
 resume)             sample dict)        batch groups)     batch dict)

Each role is a small ABC with one required method; pick a built-in per
slot or
write your own. `CosmosDataLoader` is a `torch.utils.data.DataLoader`
subclass, so
it drops into the existing training loop.

## What changed

### New dataflow package — `cosmos_framework/data/vfm/dataflow/`
- **Loaders:** `CosmosDataLoader` (+ `batch_size=` sugar →
`SimpleBatcher` +
`DefaultBatchCollator`), `JointCosmosDataLoader` (ratio-weighted
heterogeneous join).
- **Distributors:** `IterableDistributor`, `MapDistributor` (resumable),
  `RankPartitionedDistributor`, `MixtureDistributor`.
- **Processors:** `IdentityProcessor` (+ recipe-specific `VLMProcessor`,
  `VideoPhy2Processor`).
- **Batchers:** `SimpleBatcher`, `PoolPackingBatcher`,
`SequentialPackingBatcher`.
- **Collators:** `DefaultBatchCollator`, `VFMListCollator` (+ recipe
`VLMCollator`).

### Legacy removal
- Deleted `data_packer.py`, `data_packer_dataloader.py`,
`packing_iterable_dataset.py`, `test_dp_state_distributed.py` (+ old
tests).

### Experiment migrations
- VLM `llava_ov` (renamed from `llava_ov_datapacker`, streaming
`IterableDistributor`).
- VLM `videophy2_sft_nano`.
- VFM: existing path unchanged; added `vision_sft_nano_v2` (new-loader
variant).
- Added `llava_ov_mapresume` — map-style
(`load_dataset(streaming=False)` +
  `MapDistributor`) resumable example.

### Config / TOML
- `PATH_REMAPS["vlm"]`: route `dataloader_train.{max_samples_per_batch,
  max_sequence_length}` → nested `batcher.{max_batch_size, max_tokens}`.

### Checkpoint / resume
- Renamed the resume-state selector value `"data_packer"` →
`"cosmos_dataloader"`
and env prefix `DP_STATE_` → `COSMOS_DL_STATE_`
(`DataLoaderStateCallback`,
`JointDataLoaderStateCallback`, `MapDistributor`). On-disk format
unchanged.

### CI / tests / docs
- Updated `tests/launch_regression_test.py` + launch scripts for the
`llava_ov`
  rename (golden loss keyed by `llava_ov`; workflow `-k llava_ov`).
- Added golden-batch, resume, and per-role unit tests.
- Replaced `docs/custom_dataset.md` with the `CosmosDataLoader`
tutorial; removed
  `docs/dataflow.md`.

## Validation

- **Golden-batch equality:** VLM / videophy2 / VFM batches
byte-identical to the
  legacy loader.
- **Live save→stop→resume** on `pre_exp012_llava_ov_mapresume` (8 dp
ranks,
`save_iter=100`): per-rank `input_ids` shapes identical across the
resume
boundary — **792 `(iter, rank)` keys, 0 mismatches** — and loss curves
match.
  No duplicated/skipped samples on any rank.
- **No CI risk:** the `llava_ov` golden recipe and its streaming data
path are
unchanged; the remap only affects the 3 VLM TOMLs, all of which compose
cleanly
  onto a real `PoolPackingBatcher`.

## Hard invariant

Dataloader resume + checkpoint saving must not regress. Held: resume is
preserved
through the existing `DataLoaderStateCallback`, with map-style
fast-forward and the
multi-sample contiguity guard, and validated end-to-end above.

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pipeline run via packages/cosmos-framework-release/release.sh:
- 220+ files changed/added/removed across guardrails, callbacks, configs,
  data, model, tools, utils to match current i4 source.
- local_datasets/ restored to match cosmos-framework main exactly; the dir
  is now CF-owned (excluded from the mapping going forward).
- Removed 4 orphan files re-introduced on this branch (multiview_dataloader,
  vlm/defaults/dataloader, nvlm_data_unify, nvlm_sample_loaders_and_part_filters)
  -- already excluded in mapping_config.toml; nothing in CF imports them.
- New modules brought in: data/imaginaire/webdataset/augmentors/image,
  data/vfm/action/action_processing, data/vfm/vlm/video_decoder_qwen,
  data/vlm/processors/{nemotron3densevl,nemotronvl}, model/tokenizer/evaluation,
  model/vfm/mot/cosmos3_vfm_qwen3_vl_network_test, utils/vfm/video_preprocess,
  others.
- Internal http(s) URLs scrubbed to https://invalid_url (s3://, github,
  pytorch, docs.nvidia, arxiv, etc. preserved). NFS/usr leak paths scrubbed
  to /invalid_dir. SPDX/OpenMDW-1.1 headers applied.
- COSMOS_INTERNAL flag now defaults to False (was inheriting TRAINING=True).
- Zero dangling cosmos_framework module imports.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…erministic to False

- utils/vfm/monkey_patch.py: rename _EXPECTED_TRANSFORMERS_VERSION
  -> _EXPECTED_TRANSFORMERS_VERSION_PREFIX (matches its "4.57." prefix-match
  semantics; the constant is a prefix, not an exact version).
- configs/base/vlm/defaults/policy_config.py: VLMModelConfig.deterministic
  default flipped True -> False. The comment already notes deterministic
  Flash-Attention kernels are slower and only needed for parity bit-exactness,
  so opt-in is the better default.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@yy-code-nv yy-code-nv force-pushed the yangyangt/try_sync_with_internal branch from 21230f9 to fcc90a3 Compare June 12, 2026 06:42
yy-code-nv and others added 4 commits June 12, 2026 00:18
- Stop migrating tests that pull in unshipped fixtures/helpers
  (configs/base/base_config_test.py, model/vfm/mot/cosmos3_vfm_qwen3_vl_network_test.py,
  model/vfm/vlm/nemotron_3_dense_vl/nemotron_3_dense_vl_test.py). Excluded in
  mapping_config.toml and removed from CF.
- inference/action.py: hand-sync from imaginaire4/packages/cosmos3/cosmos3/
  action.py. Adds ActionProcessingRecord / make_batched_action_processing_fields
  paths and moves pad_action_to_max_dim to the action_processing import group.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ActionPromptJsonFormatter can return a dict for the caption_key; downstream
consumers expect a string, so json.dumps it when needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Loosened REPLACE-NEXT semantics in the rewriter pick up the next *matching*
  line; lets a directive placed above a docstring scrub URLs inside it.
  Applied to avae.py and utils/misc.py to scrub two internal
  gitlab-master.nvidia.com URLs that previously survived in module/class
  docstrings.
- cluster.py and unittest.py removed from cosmos_framework/configs/base/defaults/
  and excluded from the release mapping (CF-owned going forward).
- Other small updates picked up from current i4 source.

No dangling imports.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The /cluster default group no longer exists (cluster.py was dropped and the
cluster entry removed from configs/base/config.py's defaults list). Hydra
errors with ConfigCompositionException when an experiment tries to override
a missing group, so strip the override from the three remaining experiments:
vision_sft_nano, vision_sft_super, action_policy_droid_nano.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants