-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] New transform to annotate with any classifier model with multi-classifier support #924
Comments
@Harmedox How would this transform be used in pre-training, fine-tuning, RAG or any other application? Would this be a substitute to the current lang_id ? If so, can you quantify the value. I understand "how" you want to do it, I am not clear on the 'why'. Can you please elaborate more. Thanks. @shahrokhDaijavad any point of view on the actual use case for this new transform? thanks |
Text classification is a common use case in data preprocessing. Language identification is an example of how this transform can be used. Another example is classification into domain categories such as medical, code, technology, health etc, that enables us filter high-quality domain-specific content to be used for pre-training. In principle, if I have a classification model (whether it's performing language identification or any other task) and I want to annotate the label and confidence score, this transform will be used.
Yes, it should be. This would be a more generic version of lang_id. |
Issei-san, I know that in our discussion last night, we said that creating a "new" transform parallel to the existing lang_id could be a quicker path, and we also discussed that this is problematic down the road as a "maintenance" issue for two transforms with overlapping functions. After our call, @touma-I convinced me that going the route of evolving the existing lang_id in the open repo to "fasttext_classification" (using the code from the inner "hack" code) is not only better, but it is probably faster. This "evolution" was exactly how the original lang_id in the inner became the "hack" version. The added step here is to comply with the new structure/APIs of outer transforms, which has to be done anyways, whether you create a new transform or evolve the existing one. Please comment, if you think differently. |
thanks @shahrokhDaijavad.
I recommend that the team learn the existing code and come back with ideas on a high level design based with concrete examples of what are the configuration parameters of the new transform and what are the constraints and boudaries for execution (i.e. what models we will support). Also for the model to use, are they open source or have we done any improvements on the model that we need to consider for open source. I am pretty sure we will have more back and forth on the design and implementation as we get started. |
To limit the scope of the current work, let me change the sentence that @Harmedox wrote above (as we think about a new name for the transform): @Harmedox wrote above: The lang id transform is currently limited to language identification models. What is being proposed is a transform that can be used to annotate any classifier model irrespective of the underlying classification task. Let's limit the goal to: The lang id transform is currently limited to language identification models. What is being proposed is a transform that can be used to annotate more classifier tasks than just the language (but still limited to models and tasks that were used in the GneissWeb data preparation). |
thanks @shahrokhDaijavad . @Harmedox The paper talks about a very specific model that was trained specifically for this task. Is this the model that we are starting with ? |
Thanks @shahrokhDaijavad . This is an apt description.
@touma-I Yes, this is what we care about at this moment. This is the most important detail we have to know about the models - |
Testing the GitHub IDs: @ran-iwamoto , @issei-ibm and @takulake |
@shahrokhDaijavad @touma-I I agree on the following direction.
Here are my thoughts, suggestions, and questions. Please give me your opinions.
@ran-iwamoto @takulake FYI. |
@issei-ibm I am worried "fastest_classification" is too general and sets the expectation that we can support any fastext models and will require extensive testing and documentation to meet the open source community needs. I suggest you choose a name that closely reflects/relates to the specific model that the ablation study does. @issei-ibm my recommendation is not to break backward compatibility. The current classifier lang_id currently produces a column named "lang" and a column called "score". We do have pipelines out there that expect that column name in the metadata for the document. Ideally we should preserve the "lang" column name for the language classification and my recommendation would be to follow the same approach and produce a new column name that is "semantically" closely related to the category classification that the new code is doing. The paper talks about the following categories: science, education, technology & computing, and medical health. @Harmedox How is this column called today in the dataset for the pre-training ? |
Today, this is what we call those columns: The expectation is that as long as the fasttext classifier is of a certain nomenclature (in this case, used in gneissweb), the same transform should work. In the near term, there could be a case where a user wants to extend the category domain (e.g. sports) and use it for filtering so we should design for this. In terms of UX, the names for the labels and scores column should be configurable at run time. If not set, it can default to "lang" and "score" currently used by the lang_id transform. For the sports example, at runtime, a user should only set the location of the sports fasttext classifier and the column names for both label and score. |
@touma-I @Harmedox as the schedule is tight toward the release of the dataset, here is a list of possible options considering suggestions/concerns from you. a) the top-level directory name remains (i.e., "lang_id"), and (possibly multiple) classifier implementations for the specific models used in the dataset are put into the directory. Maybe the option b) is reasonable as a short-term solution. I would like to hear your opinions. Please note that we do not have much time to complete this task, so we should be able to pick out a solution that may not be perfect at this moment but can be expanded to a better one. My intention of backward compatibility is complete the same as what @Harmedox said above. We should be able to provide a way to configure model names, column names for labels/scores, etc. As for models, we cannot guarantee in advance that any fasttext model is available by testing, but can say that the transform implementation was tested with several models that we used for our model creation. I wonder if this is sufficient for users to try the new transform implementation. |
Thank you, @issei-ibm. Today is a US Holiday, and you may not get a response from Maroun and Hamid until tomorrow. I like option b too, but would like to hear from @touma-I and @Harmedox. |
thanks @issei-ibm. I like the specific reference to gneissweb_classification in d). I think either b) or d) would be a good solution. |
I second using gneissweb_classification in (d). Option (b) is also appealing as a quick fix. |
@shahrokhDaijavad @touma-I @Harmedox Thank you for your comments. |
@shahrokhDaijavad @touma-I @Harmedox I talked with @ran-iwamoto @takulake and decided to go with d) now. We are roughly done with its implementation. We will make an initial PR after some tests. Thank you, @issei-ibm. Sounds good. Looking forward to the PR. |
Search before asking
Component
Transforms/Other
Feature
The lang_id transform allows to specify a model for language identification, for which it annotates each document with the language and score.
Challenge:
The lang id transform is currently limited to language identification models. What is being proposed is a transform that can be used to annotate any classifier model irrespective of the underlying classification task. This is based on the assumption that the model outputs a label with an associated confidence/probability score. This new transform will have the following enhancements:
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: