Yangyangt/try sync with internal by yy-code-nv · Pull Request #2 · yy-code-nv/cosmos-framework

yy-code-nv · 2026-06-09T14:54:23Z

No description provided.

### Summary CI tests download input assets (e.g. action/video inputs) over the network, and these intermittently fail with transient gateway errors (502/503/504), flaking the run. This PR makes those downloads robust and avoids re-fetching the same assets every run. ### Changes - **Backoff retry** (`inference/common/args.py`): wrap each input download in an outer retry with exponential backoff + jitter (6 attempts, env-overridable via `COSMOS_DOWNLOAD_*`). Permanent errors (400/401/403/404) fail fast. - **Opt-in download cache**: when `COSMOS_DOWNLOAD_CACHE_DIR` is set, downloads are cached by URL and reused across runs; unset → unchanged behavior. Concurrent writers use an atomic move. - **CI wiring** (`gpu-tests.yml`): the `unittest` and `inference-smoke` jobs point at a shared persistent cache dir (`$RUNNER_WORKSPACE/cosmos_input_cache`, outside the repo tree so cleanup keeps it), reused across runs and PRs on the same runner. ### Impact - Production/local behavior unchanged: cache is off unless the env var is set; retry is transparent on success and only adds resilience on failure. - Only new persisted artifact is the cache dir; replaces previously-leaked `/tmp` temp dirs in those jobs. --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- Remove unused code for config.py (used for old toml config system) - Add vision_sft_nano golden for GB200

## Summary Adds a **DROID action-policy SFT recipe** for `nvidia/Cosmos3-Nano`, mirroring the internal `droid_lerobot_8b` policy run, so users can post-train the action-generation + action heads on DROID (LeRobot v3.0) data. ## What's included - **`data/vfm/action/datasets/droid_lerobot_dataset.py`** — DROID LeRobot dataset: compact columnar load + episode-aware windowing (replaces an eager full-table materialization), plus `joint_pos` (8D: 7 joints + gripper) and `use_state` support. - **`data/vfm/action/datasets/action_sft_dataset.py`** (new) — `get_action_droid_sft_dataset(...)` wrapping the dataset through `ActionTransformPipeline`. - **`configs/.../action/posttrain_config/action_policy_droid_nano.py`** (new) — registered `action_policy_droid_nano` experiment (Cosmos3-Nano / 8B MoT): optimizer trains gen+action heads (5× LR on action heads), `LambdaLinear` schedule, count-based batch, res480, `encode_exact_durations=[33]` (chunk 32 → 33 frames). - **`checkpoint/dcp.py`** — EMA warm-start: when `keys_to_skip_loading` excludes `net_ema.`, initialize `net_ema = net` from the base weights so EMA starts from the init rather than zeros. - **`examples/toml/sft_config/action_policy_droid_{nano,repro}.toml`** — 1-GPU smoke + scaled (res480) configs. - **`examples/launch_sft_action_policy_droid.sh`** + **`docs/action_policy_droid_posttraining.md`** — runnable launcher and walkthrough. ## Validation End-to-end on H200: - **1 node / 8×H200** — dry-run + training at res480, `max_samples_per_batch=32` (64 OOMs at 139 GiB; internal used 128 on GB200). - **2 nodes / 16 ranks** — HSDP `shard 8 × replicate 2`, `TRAIN_EXIT=0`. - Recipe faithful to internal `droid_lerobot_8b`: lr 1e-4 / betas / wd, 5× action-head LR, `LambdaLinear`, shift `{256:3,480:5,720:10}`, `concat_view`, `chunk_length=32`. ## Notes - Count-based batch (`max_samples_per_batch`, `max_sequence_length=None`) lives in the experiment Python — TOML cannot express `null`, and the loader only overrides keys present in the TOML. - Base checkpoint: convert `nvidia/Cosmos3-Nano` → DCP and pass via `BASE_CHECKPOINT_PATH`; action heads init fresh (skipped on load). --------- Signed-off-by: Hao Liang <haolia@nvidia.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: lfengad <liangf@nvidia.com> Co-authored-by: Yu-Wei Chao <82182961+ychao-nvidia@users.noreply.github.com>

…ovided (NVIDIA#33) ## Summary `LocalBackend.join_path` accepted `Union[str, Path]` inputs but always returned `str` (via `os.path.join`), even when `Path` objects were passed. This violated the type contract and could cause `AttributeError` downstream. ## Changes - **local_backend.py**: Now checks if any input is a `Path` and returns `Path(result)` accordingly. Removed the stale TODO that acknowledged this issue. - **base_backend.py, easy_io.py, file_client.py**: Updated return type from `str` to `Union[str, Path]`. - **boto3_backend.py, msc_backend.py, http_backend.py**: Updated return type signature for consistency with the abstract base class. ## Related Issue Closes NVIDIA#32 Co-authored-by: Maosheng Liao <maoshengl@nvidia.com>

### Summary Documents the Cosmos3-Nano-Policy-DROID policy server and aligns it with the [cosmos cookbook](https://github.com/NVIDIA/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_policy_with_cosmos_framework.md) so the two stay consistent. Replaces the prior RoboLab/OpenPI WebSocket guide with a Docker-based server-client workflow. ### Changes - **`docs/action_policy_droid_server.md`** (new): full guide for serving Cosmos3-Nano-Policy-DROID via a policy **Server** that streams actions to a RoboLab **Client**, using a Docker-based setup (clone, build image, launch container). - **`docs/action_policy_robolab_server.md`** (removed): superseded by the above; the old uv/OpenPI WebSocket flow no longer matches the cookbook. - **`README.md`**: add a TOC entry, a Policy Server section, and a reference-table row linking the new guide. ### Impact Docs-only change; no code paths affected. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Hi Cosmos team, We are fixing some CVE issues found in `transformers<=5.0.0`. This PR makes minor updates so the codebase works seamlessly with both pinned `4.57.6` and `>=5.0.0` for T2I and T2V. Signed-off-by: Hong-Yu Chiu <hongyuc@nvidia.com>

## Summary Refactors the training data layer from the monolithic `DataPackerDataLoader` / `DataPacker` / `PackingIterableDataset` into a modular, four-role abstraction wired by a single loader. Behavior is preserved (golden-batch byte-identical to the legacy loader; resume validated live), and all existing recipes are migrated. DataDistributor → RawItemProcessor → SampleBatcher → BatchCollator (shard/shuffle/ (raw item → one (samples → (group → one resume) sample dict) batch groups) batch dict) Each role is a small ABC with one required method; pick a built-in per slot or write your own. `CosmosDataLoader` is a `torch.utils.data.DataLoader` subclass, so it drops into the existing training loop. ## What changed ### New dataflow package — `cosmos_framework/data/vfm/dataflow/` - **Loaders:** `CosmosDataLoader` (+ `batch_size=` sugar → `SimpleBatcher` + `DefaultBatchCollator`), `JointCosmosDataLoader` (ratio-weighted heterogeneous join). - **Distributors:** `IterableDistributor`, `MapDistributor` (resumable), `RankPartitionedDistributor`, `MixtureDistributor`. - **Processors:** `IdentityProcessor` (+ recipe-specific `VLMProcessor`, `VideoPhy2Processor`). - **Batchers:** `SimpleBatcher`, `PoolPackingBatcher`, `SequentialPackingBatcher`. - **Collators:** `DefaultBatchCollator`, `VFMListCollator` (+ recipe `VLMCollator`). ### Legacy removal - Deleted `data_packer.py`, `data_packer_dataloader.py`, `packing_iterable_dataset.py`, `test_dp_state_distributed.py` (+ old tests). ### Experiment migrations - VLM `llava_ov` (renamed from `llava_ov_datapacker`, streaming `IterableDistributor`). - VLM `videophy2_sft_nano`. - VFM: existing path unchanged; added `vision_sft_nano_v2` (new-loader variant). - Added `llava_ov_mapresume` — map-style (`load_dataset(streaming=False)` + `MapDistributor`) resumable example. ### Config / TOML - `PATH_REMAPS["vlm"]`: route `dataloader_train.{max_samples_per_batch, max_sequence_length}` → nested `batcher.{max_batch_size, max_tokens}`. ### Checkpoint / resume - Renamed the resume-state selector value `"data_packer"` → `"cosmos_dataloader"` and env prefix `DP_STATE_` → `COSMOS_DL_STATE_` (`DataLoaderStateCallback`, `JointDataLoaderStateCallback`, `MapDistributor`). On-disk format unchanged. ### CI / tests / docs - Updated `tests/launch_regression_test.py` + launch scripts for the `llava_ov` rename (golden loss keyed by `llava_ov`; workflow `-k llava_ov`). - Added golden-batch, resume, and per-role unit tests. - Replaced `docs/custom_dataset.md` with the `CosmosDataLoader` tutorial; removed `docs/dataflow.md`. ## Validation - **Golden-batch equality:** VLM / videophy2 / VFM batches byte-identical to the legacy loader. - **Live save→stop→resume** on `pre_exp012_llava_ov_mapresume` (8 dp ranks, `save_iter=100`): per-rank `input_ids` shapes identical across the resume boundary — **792 `(iter, rank)` keys, 0 mismatches** — and loss curves match. No duplicated/skipped samples on any rank. - **No CI risk:** the `llava_ov` golden recipe and its streaming data path are unchanged; the remap only affects the 3 VLM TOMLs, all of which compose cleanly onto a real `PoolPackingBatcher`. ## Hard invariant Dataloader resume + checkpoint saving must not regress. Held: resume is preserved through the existing `DataLoaderStateCallback`, with map-style fast-forward and the multi-sample contiguity guard, and validated end-to-end above. --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Pipeline run via packages/cosmos-framework-release/release.sh: - 220+ files changed/added/removed across guardrails, callbacks, configs, data, model, tools, utils to match current i4 source. - local_datasets/ restored to match cosmos-framework main exactly; the dir is now CF-owned (excluded from the mapping going forward). - Removed 4 orphan files re-introduced on this branch (multiview_dataloader, vlm/defaults/dataloader, nvlm_data_unify, nvlm_sample_loaders_and_part_filters) -- already excluded in mapping_config.toml; nothing in CF imports them. - New modules brought in: data/imaginaire/webdataset/augmentors/image, data/vfm/action/action_processing, data/vfm/vlm/video_decoder_qwen, data/vlm/processors/{nemotron3densevl,nemotronvl}, model/tokenizer/evaluation, model/vfm/mot/cosmos3_vfm_qwen3_vl_network_test, utils/vfm/video_preprocess, others. - Internal http(s) URLs scrubbed to https://invalid_url (s3://, github, pytorch, docs.nvidia, arxiv, etc. preserved). NFS/usr leak paths scrubbed to /invalid_dir. SPDX/OpenMDW-1.1 headers applied. - COSMOS_INTERNAL flag now defaults to False (was inheriting TRAINING=True). - Zero dangling cosmos_framework module imports. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…erministic to False - utils/vfm/monkey_patch.py: rename _EXPECTED_TRANSFORMERS_VERSION -> _EXPECTED_TRANSFORMERS_VERSION_PREFIX (matches its "4.57." prefix-match semantics; the constant is a prefix, not an exact version). - configs/base/vlm/defaults/policy_config.py: VLMModelConfig.deterministic default flipped True -> False. The comment already notes deterministic Flash-Attention kernels are slower and only needed for parity bit-exactness, so opt-in is the better default. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Stop migrating tests that pull in unshipped fixtures/helpers (configs/base/base_config_test.py, model/vfm/mot/cosmos3_vfm_qwen3_vl_network_test.py, model/vfm/vlm/nemotron_3_dense_vl/nemotron_3_dense_vl_test.py). Excluded in mapping_config.toml and removed from CF. - inference/action.py: hand-sync from imaginaire4/packages/cosmos3/cosmos3/ action.py. Adds ActionProcessingRecord / make_batched_action_processing_fields paths and moves pad_action_to_max_dim to the action_processing import group. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ActionPromptJsonFormatter can return a dict for the caption_key; downstream consumers expect a string, so json.dumps it when needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Loosened REPLACE-NEXT semantics in the rewriter pick up the next *matching* line; lets a directive placed above a docstring scrub URLs inside it. Applied to avae.py and utils/misc.py to scrub two internal gitlab-master.nvidia.com URLs that previously survived in module/class docstrings. - cluster.py and unittest.py removed from cosmos_framework/configs/base/defaults/ and excluded from the release mapping (CF-owned going forward). - Other small updates picked up from current i4 source. No dangling imports. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The /cluster default group no longer exists (cluster.py was dropped and the cluster entry removed from configs/base/config.py's defaults list). Hydra errors with ConfigCompositionException when an experiment tries to override a missing group, so strip the override from the three remaining experiments: vision_sft_nano, vision_sft_super, action_policy_droid_nano. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

lfengad and others added 4 commits June 9, 2026 14:54

Remove unused code; Add golden for GB200 (NVIDIA#28)

55c6276

- Remove unused code for config.py (used for old toml config system) - Add vision_sft_nano golden for GB200

yy-code-nv force-pushed the yangyangt/try_sync_with_internal branch from 90e7ca9 to 21230f9 Compare June 11, 2026 17:23

ychao-nvidia and others added 5 commits June 11, 2026 11:34

yy-code-nv force-pushed the yangyangt/try_sync_with_internal branch from 21230f9 to fcc90a3 Compare June 12, 2026 06:42

yy-code-nv and others added 4 commits June 12, 2026 00:18

inference/action: stringify dict ai_caption output

d6e5db1

ActionPromptJsonFormatter can return a dict for the caption_key; downstream consumers expect a string, so json.dumps it when needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Yangyangt/try sync with internal#2

Yangyangt/try sync with internal#2
yy-code-nv wants to merge 13 commits into
mainfrom
yangyangt/try_sync_with_internal

yy-code-nv commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

yy-code-nv commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants