feat: Enable streaming in data preprocessor #437

willmj · 2025-01-14T14:47:15Z

Description of the change

These changes enable streaming and test streaming datasets.
Added:

Add streaming as an arg in DataSetConfig similarly to sampling
Add examples of DataSetConfig in tests/artifacts/predefined_data_configs/ for streaming
Add unit tests
Since IterableDatasets can't be indexed, use first example where column names are needed
User must set max_steps instead of num_train_epochs if using streaming

Related issue number

How to verify the PR

Run new unit tests which verify HF inference works and passing streaming in dataconfig returns and IterableDataset
Run on single GPU error
Run on multi GPU without error

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

…r future tests, add streaming to config Signed-off-by: Will Johnson <[email protected]>

Signed-off-by: Will Johnson <[email protected]>

github-actions · 2025-01-14T14:48:27Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

Signed-off-by: Will Johnson <[email protected]>

tuning/sft_trainer.py

Abhishek-TAMU

Thanks @willmj for integrating usage of Iterable datasets. Just some initial thoughts.

tuning/data/data_processors.py

tuning/sft_trainer.py

ashokponkumar · 2025-01-15T06:16:56Z

Shouldn't streaming be a top level object instead of a per dataset object? Is it possible to mix streaming and non-streaming datasets using concat?

…nstead of dataset config Signed-off-by: Will Johnson <[email protected]>

Signed-off-by: Will Johnson <[email protected]>

seshapad · 2025-01-26T04:01:52Z

@willmj Is this PR in a usable state? We need to run a EPT with large datasets. Without streaming the data processing is failing. We want the streaming feature to address this issue.

kmehant · 2025-01-26T05:03:52Z

@willmj I would request your attention to this

fms-hf-tuning/tuning/utils/preprocessing_utils.py

Lines 49 to 60 in 224f35b

    
           if "column_names" not in data or data.column_names is None: 
        
               if isinstance(data, IterableDataset): 
        
                   if hasattr(data, "_resolve_features"): 
        
                       data = data._resolve_features() 
        
                   else: 
        
                       raise ValueError( 
        
                           "_resolve_features API is not available to fetch column names" 
        
                       ) 
        
               else: 
        
                   raise ValueError( 
        
                       f"not possible to fetch column names for the loaded dataset of type {type(data)}" 
        
                   )

. iterabledatasets often loose out column information (sometimes on loading, or after map operations applied), so its good to be defensive on retrieving columns wherever necessary.

Signed-off-by: Will Johnson <[email protected]>

Abhishek-TAMU · 2025-01-28T19:40:32Z

tests/test_sft_trainer.py

+    [
+        (
+            [TWITTER_COMPLAINTS_DATA_DIR_JSON],
+            DATA_CONFIG_TOKENIZE_AND_APPLY_INPUT_MASKING_YAML,


I assume this yaml file is to be used for the this test case: DATA_CONFIG_YAML_STREAMING

Signed-off-by: Will Johnson <[email protected]>

willmj · 2025-01-28T20:43:30Z

@seshapad I have now had a successful tuning job with streaming on multi GPU. You should be able to try it out, let me know if you run into any errors.

tuning/data/setup_dataprocessor.py

Signed-off-by: Will Johnson <[email protected]>

tuning/sft_trainer.py

tuning/data/setup_dataprocessor.py

Signed-off-by: Will Johnson <[email protected]>

willmj · 2025-01-29T18:29:20Z

Tuning + inference works! Only 200 steps so the equivalent of less than an epoch, which is why the result is wrong - but format is right.
Config:

      {
          "model_name_or_path": "/llama3/hf/8b_pre_trained",
          "data_config_path": "/testing/tuning/input/apply-custom-template-streaming-data-config.yaml",
          "output_dir": "/testing/tuning/output/llama3-8b/ft/tone_20250129_1045-streaming-dataconfig",
          "save_model_dir": "/testing/tuning/output/llama3-8b/ft/tone_20250129_1045-streaming-dataconfig/save_model",
          "max_steps": 200,
          "per_device_train_batch_size": 4,
          "gradient_accumulation_steps": 1,
          "learning_rate": 1e-4,
          "response_template": "\n### Response:",
          "dataset_text_field": "output"
      }

Inference result on "Text: @sho_help @showtime your arrive is terrible streaming is stop and start every couple mins. Get it together it's xmas\n\n### Label:":

{
  "responses": [
    {
      "generatedTokenCount": 2,
      "text": " polite\u003c|end_of_text|\u003e",
      "inputTokenCount": 34,
      "stopReason": "EOS_TOKEN",
      "stopSequence": "\u003c|end_of_text|\u003e"
    }
  ]
}

Signed-off-by: Will Johnson <[email protected]>

seshapad · 2025-01-30T07:56:41Z

@willmj The streaming option crashes. I have attached the log for debugging. Here is the data config:

dataprocessor:
    type: default
    sampling_stopping_strategy: all_exhausted
    seed: 66
    streaming: true
datasets:
  - name: pleias
    sampling: 1.0
    data_paths:
      - "/pleias_greek/"
    data_handlers:
      - name: apply_dataset_formatting
        arguments:
          remove_columns: ['source_directory', 'domain', 'document', 'subset', 'split', 'document_id', 'identifier', 'collection', 'license', '_meta_timestamp', '_meta_request_url', '_meta_final_url', '_meta_dataset', '_meta_job_id', '_meta_file_name', '_meta_json']
          fn_kwargs:
            dataset_text_field: "contents"

I can share the dataset with you if you wish to attempt reproducing this bug.
Configuration of cli used:

accelerate launch \
  --num_processes=8 \
  --dynamo_backend="no" \
  --fsdp_auto_wrap_policy="TRANSFORMER_BASED_WRAP" \
  --fsdp_cpu_ram_efficient_loading="true" \
  --fsdp_forward_prefetch="false" \
  --fsdp_offload_params="false" \
  --fsdp_sharding_strategy="HYBRID_SHARD" \
  --fsdp_state_dict_type="FULL_STATE_DICT" \
  --fsdp_sync_module_states="true" \
  --machine_rank="${RANK}" \
  --main_process_ip="${MASTER_ADDR}" \
  --main_process_port="${MASTER_PORT}" \
  --mixed_precision="no" \
  --num_machines="${WORLD_SIZE}" \
  --rdzv_backend="static" \
  --same_network \
  --use_fsdp \
  -m tuning.sft_trainer \
  --adam_beta2="0.95" \
  --aim_repo="${AIMSTACK_DB}" \
  --data_config="data_config.yaml" \
  --evaluation_strategy="no" \
  --experiment="train-nb-g8b-r18" \
  --gradient_accumulation_steps="1" \
  --gradient_checkpointing="true" \
  --include_tokens_per_second="true" \
  --learning_rate="0.0003" \
  --logging_steps="1" \
  --logging_strategy="steps" \
  --lr_scheduler_type="cosine" \
  --max_grad_norm="1" \
  --max_steps="100" \
  --model_name_or_path="ibm-granite/granite-3.1-8b-base" \
  --output_dir="/run18" \
  --packing="true" \
  --per_device_train_batch_size="8" \
  --save_steps="50" \
  --save_strategy="steps" \
  --split_batches="true" \
  --torch_dtype="bfloat16" \
  --tracker="aim" \
  --use_flash_attn="true" \
  --warmup_ratio="0.05" \
  --weight_decay="0.1" \
  2>&1 | tee -a "/run18/accelerate_launch_output.log"

cc: @ashokponkumar

… pretokenized case in data collator Signed-off-by: Will Johnson <[email protected]>

Signed-off-by: Will Johnson <[email protected]>

tests/data/test_data_handlers.py

tests/data/test_data_preprocessing_utils.py

tuning/data/data_processors.py

Signed-off-by: Will Johnson <[email protected]>

dushyantbehl · 2025-02-06T13:09:08Z

tuning/data/setup_dataprocessor.py

+        data = processor.load_dataset(
+            None,
+            streaming=processor.processor_config.streaming,
+            splitName="train[:1]",


do we still need to specify streaming if all we do is load just first line of the train split?
Can you please check what does HF docs say about this.

For checking the columns, it seems fine in unit tests to not pass streaming - however it does load the example as a Dataset instead of an IterableDataset. If this is okay with you we can either pass in streaming through kwargs of load_dataset, default streaming to false in load_dataset, or just set it to false for loading this. Let me know what you think will work best.

If for checking columns a single sample can be loaded without streaming you can choose that route and force disable streaming in this call..I would be fine with it

What my question was to ask if a single sample can be loaded in all cases without performance considerations even for large datasets...so I wanted to ask if 1) HF load the only 1 sample from disk? 2) HF loads all samples and then drops all but one
In (2) the performance can take a hit.

It seems from HF documentation on slice splits that HF load dataset goes for number 1.

tuning/data/setup_dataprocessor.py

dushyantbehl · 2025-02-06T13:12:43Z

tuning/sft_trainer.py

+            "Setting `split_batches` to true - splitting batches among devices \
+                    `per_device_train_batch_size` is now the global batch size, and \
+                    should be treated as such."
+        )


While I can live with this check for now but feel like we should be more clear in this and must not waste a run where user's run fails and they scramble through the logs to find this..

Is there a suggestion to make this explicit?

Also...please move this inside process_dataargs itself...we do have train_args available so can do this inside that function...and do we need to set accelerator_config to this dict ... do we not need to append it?

Yes per Mehant's suggestion we set accelerator_config to this dict. You bring up a good point that documentation should be added for this PR, I will add it soon.

I added documentation in advanced-data-preprocessing.md, and made the warning more explicit.

Signed-off-by: Will Johnson <[email protected]>

dushyantbehl · 2025-02-06T13:21:09Z

@seshapad @HarikrishnanBalagopal can we take this branch now and shoot an EPT run which ended in error before?

@willmj has made all the code corrrectness changes so request you to do a sanity check before we go for merge.

seshapad · 2025-02-06T13:33:07Z

@HarikrishnanBalagopal please provide image for this branch. I will start an ept.

Signed-off-by: Will Johnson <[email protected]>

…evious PRs Signed-off-by: Will Johnson <[email protected]>

… is merged in Signed-off-by: Will Johnson <[email protected]>

Signed-off-by: Will Johnson <[email protected]>

willmj · 2025-02-11T21:03:17Z

According to this comment once trl is upgraded the case highlighted shouldn't be needed so I have removed it, which may cause some tests to fail while trl is not merged, specifically

tests/test_sft_trainer.py::test_run_causallm_ft_and_inference_streaming_ept

This means this PR is waiting on upgrading TRL.

Signed-off-by: Will Johnson <[email protected]>

willmj · 2025-02-20T14:54:15Z

Removing the following test case:

@pytest.mark.parametrize(
    "datafiles, datasetconfigname",
    [
        (
            [TWITTER_COMPLAINTS_TOKENIZED_JSON],
            DATA_CONFIG_YAML_STREAMING_PRETOKENIZED,
        ),
    ],
)
def test_run_causallm_ft_and_inference_streaming_ept(datasetconfigname, datafiles):
    """Check if we can finetune causallm models using multiple datasets with multiple files"""
    with tempfile.TemporaryDirectory() as tempdir:
        data_formatting_args = copy.deepcopy(DATA_ARGS)

        # set training_data_path and response_template to none
        data_formatting_args.response_template = None
        data_formatting_args.training_data_path = None

        # add data_paths in data_config file
        with tempfile.NamedTemporaryFile(
            "w", delete=False, suffix=".yaml"
        ) as temp_yaml_file:
            with open(datasetconfigname, "r", encoding="utf-8") as f:
                data = yaml.safe_load(f)
                datasets = data["datasets"]
                for _, d in enumerate(datasets):
                    d["data_paths"] = datafiles
                yaml.dump(data, temp_yaml_file)
                data_formatting_args.data_config_path = temp_yaml_file.name

        train_args = copy.deepcopy(TRAIN_ARGS)
        train_args.output_dir = tempdir
        train_args.max_steps = 1
        train_args.packing = True

        sft_trainer.train(MODEL_ARGS, data_formatting_args, train_args)

        # validate full ft configs
        _validate_training(tempdir)
        _, checkpoint_path = _get_latest_checkpoint_trainer_state(tempdir)

        # Load the model
        loaded_model = TunedCausalLM.load(checkpoint_path, MODEL_NAME)

        # Run inference on the text
        output_inference = loaded_model.run(
            "### Text: @NortonSupport Thanks much.\n\n### Label:", max_new_tokens=50
        )
        assert len(output_inference) > 0
        assert "### Text: @NortonSupport Thanks much.\n\n### Label:" in output_inference

as it will be resolved by #468

Signed-off-by: Will Johnson <[email protected]>

willmj added 3 commits January 8, 2025 15:31

feat: first unit test to test if streaming works, example template fo…

c394509

…r future tests, add streaming to config Signed-off-by: Will Johnson <[email protected]>

test: first draft of tests

7895082

Signed-off-by: Will Johnson <[email protected]>

feat: enable streaming explicitly passing argument

3956125

Signed-off-by: Will Johnson <[email protected]>

github-actions bot added the feat label Jan 14, 2025

fix: add back if data return false

dc851bb

Signed-off-by: Will Johnson <[email protected]>

willmj commented Jan 14, 2025

View reviewed changes

tuning/sft_trainer.py Outdated Show resolved Hide resolved

Abhishek-TAMU reviewed Jan 14, 2025

View reviewed changes

tuning/data/data_processors.py Show resolved Hide resolved

tuning/data/data_processors.py Outdated Show resolved Hide resolved

tuning/data/data_processors.py Outdated Show resolved Hide resolved

tuning/sft_trainer.py Outdated Show resolved Hide resolved

willmj added 2 commits January 16, 2025 12:06

feat: (currently broken) move streaming to data preprocessor config i…

1b54137

…nstead of dataset config Signed-off-by: Will Johnson <[email protected]>

feat: add streaming param in data preprocessor config

4bedbe6

Signed-off-by: Will Johnson <[email protected]>

fix: split batches

6e04cf0

Signed-off-by: Will Johnson <[email protected]>

Abhishek-TAMU reviewed Jan 28, 2025

View reviewed changes

willmj added 3 commits January 28, 2025 15:34

fix: pass split batches correctly to train args

88ebe95

Signed-off-by: Will Johnson <[email protected]>

fmt + lint

390e273

Signed-off-by: Will Johnson <[email protected]>

Merge branch 'main' into datapreprocessor-streaming

906a633

Signed-off-by: Will Johnson <[email protected]>

willmj commented Jan 28, 2025

View reviewed changes

tuning/data/setup_dataprocessor.py Outdated Show resolved Hide resolved

fix: column names conditional

5980c31

Signed-off-by: Will Johnson <[email protected]>

kmehant reviewed Jan 29, 2025

View reviewed changes

tuning/sft_trainer.py Outdated Show resolved Hide resolved

kmehant reviewed Jan 29, 2025

View reviewed changes

tuning/data/setup_dataprocessor.py Outdated Show resolved Hide resolved

willmj added 2 commits January 29, 2025 09:00

fix: logging

50316a4

Signed-off-by: Will Johnson <[email protected]>

fmt

52de5a1

Signed-off-by: Will Johnson <[email protected]>

willmj added 2 commits January 29, 2025 15:53

fix: validate mergeable datasets for IterableDatasets

c3d9679

Signed-off-by: Will Johnson <[email protected]>

tests: more streaming tests

10cf8db

Signed-off-by: Will Johnson <[email protected]>

fix: use resolve_features function for iterable datasets, add packing…

bf84480

… pretokenized case in data collator Signed-off-by: Will Johnson <[email protected]>

willmj added 2 commits February 6, 2025 07:16

fix: remove streaming variable from _process_dataset_configs

ceea99e

Signed-off-by: Will Johnson <[email protected]>

fmt

b7f1016

Signed-off-by: Will Johnson <[email protected]>