Rename TaskSelector to DataSelector.#519
Rename TaskSelector to DataSelector.#519chenyushuo wants to merge 1 commit intoagentscope-ai:mainfrom
TaskSelector to DataSelector.#519Conversation
2. Add `DataSelector` to `ExperienceFileReader` 3. Refactor `tokenize_and_mask_messages_default`
There was a problem hiding this comment.
Pull request overview
This PR renames the configuration concept of TaskSelector to DataSelector, propagates the new config through buffer readers (including experience file reading), and refactors the default assistant-token masking logic used with chat templates.
Changes:
- Rename
TaskSelectorConfig/task_selectortoDataSelectorConfig/data_selectoracross config, validators, scheduler, selector code, and tests. - Add selector support to
ExperienceFileReaderby refactoring common dataset-reading logic into_DatasetFileReader. - Refactor
tokenize_and_mask_messages_defaultto batchapply_chat_templatecalls when building assistant masks.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| trinity/common/models/utils.py | Refactors default chat-template tokenization/masking logic. |
| trinity/common/config_validator.py | Updates selector validation to use data_selector (incl. experience buffer validation). |
| trinity/common/config.py | Renames selector config type and renames config fields to data_selector; adds to experience buffer config. |
| trinity/buffer/task_scheduler.py | Updates scheduler to read selector type from data_selector. |
| trinity/buffer/selector/selector.py | Updates selector constructors/types to accept DataSelectorConfig. |
| trinity/buffer/reader/file_reader.py | Refactors file readers and adds selector initialization for experience file reading. |
| tests/trainer/trainer_test.py | Updates tests to use DataSelectorConfig. |
| tests/buffer/task_scheduler_test.py | Updates scheduler tests to use data_selector arguments/config. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| prompt_token_ids_list = tokenizer.apply_chat_template( | ||
| generation_messages, | ||
| add_generation_prompt=True, | ||
| **common_kwargs, | ||
| )["input_ids"] | ||
| response_token_ids_list = tokenizer.apply_chat_template( | ||
| response_messages, | ||
| add_generation_prompt=False, | ||
| **common_kwargs, | ||
| )["input_ids"] |
There was a problem hiding this comment.
tokenizer.apply_chat_template(...) is indexed with ["input_ids"], but tokenize_and_mask_messages_default no longer sets tokenize=True / return_dict=True (and also removed return_tensors). With HF tokenizers, apply_chat_template defaults to tokenize=False and will return a string, causing a runtime error here. Please explicitly request tokenization and a dict return (e.g., tokenize=True + return_dict=True) and keep the downstream code consistent with the returned type (list vs tensor).
| generation_messages = [] | ||
| response_messages = [] | ||
| for idx, message in enumerate(messages): | ||
| if message["role"] == "assistant": | ||
| prompt_token_ids = tokenizer.apply_chat_template( | ||
| messages[:idx], | ||
| tools=tools, | ||
| chat_template=chat_template, | ||
| add_generation_prompt=True, | ||
| enable_thinking=enable_thinking, | ||
| padding=False, | ||
| truncation=True, | ||
| return_tensors="pt", | ||
| add_special_tokens=False, | ||
| return_dict=False, | ||
| ) | ||
| prompt_length = prompt_token_ids.shape[1] | ||
| prompt_response_token_ids = tokenizer.apply_chat_template( | ||
| messages[: idx + 1], | ||
| tools=tools, | ||
| chat_template=chat_template, | ||
| add_generation_prompt=False, | ||
| enable_thinking=enable_thinking, | ||
| padding=False, | ||
| truncation=True, | ||
| return_tensors="pt", | ||
| add_special_tokens=False, | ||
| return_dict=False, | ||
| ) | ||
| prompt_response_length = prompt_response_token_ids.shape[1] | ||
| assistant_token_mask[prompt_length:prompt_response_length] = 1 | ||
| generation_messages.append(messages[:idx]) | ||
| response_messages.append(messages[: idx + 1]) | ||
| elif idx == len(messages) - 1: | ||
| response_messages.append(messages) | ||
| prompt_token_ids_list = tokenizer.apply_chat_template( | ||
| generation_messages, | ||
| add_generation_prompt=True, | ||
| **common_kwargs, | ||
| )["input_ids"] |
There was a problem hiding this comment.
generation_messages can be an empty list when there are no assistant turns. In that case apply_chat_template(generation_messages, ...) is ambiguous (empty conversation vs empty batch) and will not produce a per-assistant list aligned with response_messages, which can break the zip(...) logic and masking. Consider handling the no-assistant case explicitly (return all-zero mask) and/or ensuring you always pass a non-empty batch of conversations to apply_chat_template.
| self.selector: BaseSelector = SELECTORS.get(config.data_selector.selector_type)( | ||
| self.dataset, config.data_selector | ||
| ) |
There was a problem hiding this comment.
_init_selector calls SELECTORS.get(config.data_selector.selector_type)(...) without validating that the registry lookup succeeded. If selector_type is None or unsupported (or if config.check_and_update() was not run), this will raise a confusing TypeError: 'NoneType' object is not callable. Add an explicit check and raise a clear ValueError (or fall back to sequential) before instantiating the selector.
| self.selector: BaseSelector = SELECTORS.get(config.data_selector.selector_type)( | |
| self.dataset, config.data_selector | |
| ) | |
| selector_type = getattr(config.data_selector, "selector_type", None) | |
| selector_cls = SELECTORS.get(selector_type) | |
| if selector_cls is None: | |
| raise ValueError( | |
| f"Unknown or missing data selector type {selector_type!r} for dataset [{self.name}]." | |
| ) | |
| self.selector = selector_cls(self.dataset, config.data_selector) |
| # check if selector is supported | ||
| selector = SELECTORS.get(taskset.task_selector.selector_type) | ||
| selector = SELECTORS.get(taskset.data_selector.selector_type) | ||
| if selector is None: | ||
| raise ValueError( | ||
| f"Selector {taskset.task_selector.selector_type} is not supported." | ||
| f"Selector `{taskset.data_selector.selector_type}` " | ||
| f"in {taskset.name} is not supported." | ||
| ) |
There was a problem hiding this comment.
This selector validation assumes taskset.data_selector is always set. However, the runtime codepath supports disabling selection by setting data_selector = None (e.g., _DatasetFileReader checks for None, and SimpleTasksetScheduler explicitly sets it to None). If a user config sets data_selector: null, this will raise an AttributeError. Please guard for None (skip validation / treat as sequential) before accessing selector_type.
| # check if selector is supported | ||
| selector = SELECTORS.get(experience_buffer.data_selector.selector_type) | ||
| if selector is None: | ||
| raise ValueError( | ||
| f"Selector {experience_buffer.data_selector.selector_type} " | ||
| "in `experience_buffer` is not supported." | ||
| ) |
There was a problem hiding this comment.
This validation unconditionally checks experience_buffer.data_selector.selector_type. If data_selector is explicitly set to None in config (which other components treat as a valid way to disable selection), this will crash with AttributeError. Please handle experience_buffer.data_selector is None (and/or selector_type is None) by skipping selector validation or applying a default selector type.
| # check if selector is supported | |
| selector = SELECTORS.get(experience_buffer.data_selector.selector_type) | |
| if selector is None: | |
| raise ValueError( | |
| f"Selector {experience_buffer.data_selector.selector_type} " | |
| "in `experience_buffer` is not supported." | |
| ) | |
| # check if selector is supported (if configured) | |
| data_selector = getattr(experience_buffer, "data_selector", None) | |
| selector_type = getattr(data_selector, "selector_type", None) if data_selector is not None else None | |
| if selector_type is not None: | |
| selector = SELECTORS.get(selector_type) | |
| if selector is None: | |
| raise ValueError( | |
| f"Selector {selector_type} in `experience_buffer` is not supported." | |
| ) |
| """ | ||
| taskset_configs = config.buffer.explorer_input.tasksets | ||
| if len(taskset_configs) == 1 and taskset_configs[0].task_selector.selector_type == "sequential": | ||
| if len(taskset_configs) == 1 and taskset_configs[0].data_selector.selector_type == "sequential": |
There was a problem hiding this comment.
get_taskset_scheduler assumes taskset_configs[0].data_selector is non-None. Since other code paths support data_selector = None to disable selection, a config that sets data_selector: null would raise an AttributeError here. Consider treating None as equivalent to sequential/no-selection in this condition.
| if len(taskset_configs) == 1 and taskset_configs[0].data_selector.selector_type == "sequential": | |
| data_selector = taskset_configs[0].data_selector | |
| if len(taskset_configs) == 1 and ( | |
| data_selector is None or data_selector.selector_type == "sequential" | |
| ): |
Description
TaskSelectortoDataSelectorDataSelectortoExperienceFileReadertokenize_and_mask_messages_defaultChecklist
Please check the following items before code is ready to be reviewed.