Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expected behavior of tokenize_and_apply_input_masking? #464

Open
mvcrouse opened this issue Feb 11, 2025 · 1 comment
Open

Expected behavior of tokenize_and_apply_input_masking? #464

mvcrouse opened this issue Feb 11, 2025 · 1 comment

Comments

@mvcrouse
Copy link

Describe the bug

A clear and concise description of what the bug is.

The function tokenize_and_apply_input_masking (see here) is applied as a preprocessing step that includes, among other things, ensuring that max_seq_length is enforced in the provided dataset. It is called from _process_dataset_configs here, which calls HF's dataset map function. The dataset map function is called here, where it passes things like the max_length as a kwarg with fn_kwargs.

As I've been debugging, it appears that this line here in tokenize_and_apply_input_masking is possibly an issue, where it takes the kwargs and pulls out fn_kwargs. However, HF's dataset.map function passes the the fn_kwargs as the function's kwargs (see documentation), it doesn't pass it as a dictionary containing fn_kwargs.

Should this

fn_kwargs = tokenizer_kwargs.get("fn_kwargs", {})
tokenizer_inner_kwargs = fn_kwargs.get("tokenizer_kwargs", {})

tokenized_comb_seqs = tokenizer(combined, **tokenizer_inner_kwargs)
tokenized_input = tokenizer(input_text, **tokenizer_inner_kwargs)

Instead be this

tokenizer_inner_kwargs = tokenizer_kwargs.get("tokenizer_kwargs", {})

tokenized_comb_seqs = tokenizer(combined, **tokenizer_inner_kwargs)
tokenized_input = tokenizer(input_text, **tokenizer_inner_kwargs)

Platform

Please provide details about the environment you are using, including the following:

  • Interpreter version:
  • Library version:

Sample Code

Please include a minimal sample of the code that will (if possible) reproduce the bug in isolation

Expected behavior

A clear and concise description of what you expected to happen.

Observed behavior

What you see happening (error messages, stack traces, etc...)

Additional context

Add any other context about the problem here.

@Abhishek-TAMU
Copy link
Collaborator

Thank you Max, for the clear description of this issue. This will be fixed as part of this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants