Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] New transform to annotate with any classifier model with multi-classifier support #924

Open
1 of 2 tasks
Harmedox opened this issue Jan 8, 2025 · 17 comments
Open
1 of 2 tasks
Assignees
Labels
enhancement New feature or request

Comments

@Harmedox
Copy link

Harmedox commented Jan 8, 2025

Search before asking

  • I searched the issues and found no similar issues.

Component

Transforms/Other

Feature

The lang_id transform allows to specify a model for language identification, for which it annotates each document with the language and score.

Challenge:
The lang id transform is currently limited to language identification models. What is being proposed is a transform that can be used to annotate any classifier model irrespective of the underlying classification task. This is based on the assumption that the model outputs a label with an associated confidence/probability score. This new transform will have the following enhancements:

  1. classifier model, or a list of models can be loaded from a variety of sources e.g., HF, S3 etc.
  2. the name of the label and score annotations (or a list of label-score pair) is configurable
  3. when a list of models is used, matching number of label-score names must be configured

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@Harmedox Harmedox added the enhancement New feature or request label Jan 8, 2025
@touma-I
Copy link
Collaborator

touma-I commented Jan 9, 2025

@Harmedox How would this transform be used in pre-training, fine-tuning, RAG or any other application? Would this be a substitute to the current lang_id ? If so, can you quantify the value. I understand "how" you want to do it, I am not clear on the 'why'. Can you please elaborate more. Thanks.

@shahrokhDaijavad any point of view on the actual use case for this new transform? thanks

@Harmedox
Copy link
Author

Harmedox commented Jan 9, 2025

@Harmedox How would this transform be used in pre-training, fine-tuning, RAG or any other application? Would this be a substitute to the current lang_id ? If so, can you quantify the value. I understand "how" you want to do it, I am not clear on the 'why'. Can you please elaborate more. Thanks.

@shahrokhDaijavad any point of view on the actual use case for this new transform? thanks

Text classification is a common use case in data preprocessing. Language identification is an example of how this transform can be used. Another example is classification into domain categories such as medical, code, technology, health etc, that enables us filter high-quality domain-specific content to be used for pre-training.

In principle, if I have a classification model (whether it's performing language identification or any other task) and I want to annotate the label and confidence score, this transform will be used.

Would this be a substitute to the current lang_id ?

Yes, it should be. This would be a more generic version of lang_id.

@shahrokhDaijavad shahrokhDaijavad self-assigned this Jan 16, 2025
@shahrokhDaijavad
Copy link
Member

Issei-san, I know that in our discussion last night, we said that creating a "new" transform parallel to the existing lang_id could be a quicker path, and we also discussed that this is problematic down the road as a "maintenance" issue for two transforms with overlapping functions. After our call, @touma-I convinced me that going the route of evolving the existing lang_id in the open repo to "fasttext_classification" (using the code from the inner "hack" code) is not only better, but it is probably faster. This "evolution" was exactly how the original lang_id in the inner became the "hack" version. The added step here is to comply with the new structure/APIs of outer transforms, which has to be done anyways, whether you create a new transform or evolve the existing one. Please comment, if you think differently.

@touma-I
Copy link
Collaborator

touma-I commented Jan 16, 2025

thanks @shahrokhDaijavad.

  • Yes, The existing code already implements all the APIs required for DPK. If we augment the existing code, this portion of the code remains the same and reduce coding time, testing and maintenance effort
  • I also believe the existing code already works with fastest library so extending it to support additional fastest categorization techniques will also be time saving during development, testing and maintenance
  • Designing a generic "fasttext_categorization" transform is a very ambitious effort. @Harmedox suggest that we should be able to load any model from S3 or HF. I think this is too broad. For the first iteration, it might be better to get a good sense of what is the One or Two models that were used in our abalation study and focus on those.

I recommend that the team learn the existing code and come back with ideas on a high level design based with concrete examples of what are the configuration parameters of the new transform and what are the constraints and boudaries for execution (i.e. what models we will support).

Also for the model to use, are they open source or have we done any improvements on the model that we need to consider for open source.

I am pretty sure we will have more back and forth on the design and implementation as we get started.

@shahrokhDaijavad
Copy link
Member

To limit the scope of the current work, let me change the sentence that @Harmedox wrote above (as we think about a new name for the transform):

@Harmedox wrote above:

The lang id transform is currently limited to language identification models. What is being proposed is a transform that can be used to annotate any classifier model irrespective of the underlying classification task.

Let's limit the goal to:

The lang id transform is currently limited to language identification models. What is being proposed is a transform that can be used to annotate more classifier tasks than just the language (but still limited to models and tasks that were used in the GneissWeb data preparation).

@touma-I
Copy link
Collaborator

touma-I commented Jan 16, 2025

thanks @shahrokhDaijavad . @Harmedox The paper talks about a very specific model that was trained specifically for this task. Is this the model that we are starting with ?
"Specifically, we use the supervised fastText package from [22] to train a classifier on ....."

@Harmedox
Copy link
Author

The lang id transform is currently limited to language identification models. What is being proposed is a transform that can be used to annotate more classifier tasks than just the language (but still limited to models and tasks that were used in the GneissWeb data preparation).

Thanks @shahrokhDaijavad . This is an apt description.

The paper talks about a very specific model that was trained specifically for this task. Is this the model that we are starting with ? "Specifically, we use the supervised fastText package from [22] to train a classifier on ....."

@touma-I Yes, this is what we care about at this moment. This is the most important detail we have to know about the models - Each classifier takes as input a document and produces a label whether the document belongs to the category, along with a confidence score between [0,1]

@shahrokhDaijavad
Copy link
Member

shahrokhDaijavad commented Jan 17, 2025

Testing the GitHub IDs: @ran-iwamoto , @issei-ibm and @takulake

@issei-ibm
Copy link

@shahrokhDaijavad @touma-I I agree on the following direction.

@touma-I convinced me that going the route of evolving the existing lang_id in the open repo to "fasttext_classification" (using the code from the inner "hack" code) is not only better, but it is probably faster.

Here are my thoughts, suggestions, and questions. Please give me your opinions.

  1. We will rename the lang_id directory to fasttext_classification (I hope everyone here is comfortable with this name 🙂), and update codes and configuration files as appropriate. Please let me know if my understanding is correct.
  2. Backword compatibility. As the original lang_id code uses facebook/fasttext-language-identification as its base model for classification. The default configuration of, for example, local.py will use the same model, at least for the time being. If we find a better model as the default one, then let us replace with the new model.
  3. Variable names. for example, nlp_langid depends on language identification and I think we should rename it to, for example, nlp_classification or simply nlp. However, this breaks backward compatibility, I mean that someone may have accessed directly this variable to use it. I wonder if we should be conservative about this.

@ran-iwamoto @takulake FYI.

@touma-I
Copy link
Collaborator

touma-I commented Jan 17, 2025

@issei-ibm I am worried "fastest_classification" is too general and sets the expectation that we can support any fastext models and will require extensive testing and documentation to meet the open source community needs. I suggest you choose a name that closely reflects/relates to the specific model that the ablation study does.

@issei-ibm my recommendation is not to break backward compatibility. The current classifier lang_id currently produces a column named "lang" and a column called "score". We do have pipelines out there that expect that column name in the metadata for the document. Ideally we should preserve the "lang" column name for the language classification and my recommendation would be to follow the same approach and produce a new column name that is "semantically" closely related to the category classification that the new code is doing. The paper talks about the following categories: science, education, technology & computing, and medical health. @Harmedox How is this column called today in the dataset for the pre-training ?

@Harmedox
Copy link
Author

The paper talks about the following categories: science, education, technology & computing, and medical health. @Harmedox How is this column called today in the dataset for the pre-training ?

Today, this is what we call those columns:
Quality Annotation:
labels - fasttext_dclm_oh_eli5_label;fasttext_cosmo_10k_edu_label
scores - fasttext_dclm_oh_eli5_prob;fasttext_cosmo_10k_edu_prob
Category annotation:
labels - fasttext_technology_computing_label;fasttext_medical_health_label;fasttext_education_label;fasttext_science_label
scores - fasttext_technology_computing_prob;fasttext_medical_health_prob;fasttext_education_prob;fasttext_science_prob

The expectation is that as long as the fasttext classifier is of a certain nomenclature (in this case, used in gneissweb), the same transform should work. In the near term, there could be a case where a user wants to extend the category domain (e.g. sports) and use it for filtering so we should design for this.

In terms of UX, the names for the labels and scores column should be configurable at run time. If not set, it can default to "lang" and "score" currently used by the lang_id transform. For the sports example, at runtime, a user should only set the location of the sports fasttext classifier and the column names for both label and score.

@issei-ibm
Copy link

@touma-I @Harmedox as the schedule is tight toward the release of the dataset, here is a list of possible options considering suggestions/concerns from you.

a) the top-level directory name remains (i.e., "lang_id"), and (possibly multiple) classifier implementations for the specific models used in the dataset are put into the directory.
b) the top-level directory name remains (i.e., "lang_id"), and a single implementation of classifier is put into the directory which is configured by default for language identification and is configurable for other label/score column names.
c) change the top-level directory name to "fasttext_classification" and a single implementation of classifier is put into the directory as in b).
d) add a new top-level directory like "gneissweb_classification" that indicates it was used for data creation and does not claim general capability, and a single implementation of classifier is put into the directory as in b).

Maybe the option b) is reasonable as a short-term solution. I would like to hear your opinions. Please note that we do not have much time to complete this task, so we should be able to pick out a solution that may not be perfect at this moment but can be expanded to a better one.

My intention of backward compatibility is complete the same as what @Harmedox said above. We should be able to provide a way to configure model names, column names for labels/scores, etc. As for models, we cannot guarantee in advance that any fasttext model is available by testing, but can say that the transform implementation was tested with several models that we used for our model creation. I wonder if this is sufficient for users to try the new transform implementation.

@shahrokhDaijavad
Copy link
Member

Thank you, @issei-ibm. Today is a US Holiday, and you may not get a response from Maroun and Hamid until tomorrow. I like option b too, but would like to hear from @touma-I and @Harmedox.

@touma-I
Copy link
Collaborator

touma-I commented Jan 21, 2025

thanks @issei-ibm. I like the specific reference to gneissweb_classification in d). I think either b) or d) would be a good solution.

@Harmedox
Copy link
Author

I second using gneissweb_classification in (d). Option (b) is also appealing as a quick fix.

@issei-ibm
Copy link

issei-ibm commented Jan 22, 2025

@shahrokhDaijavad @touma-I @Harmedox Thank you for your comments. Then let us proceed with the option b) for now. @ran-iwamoto @takulake are working on this and will make an initial PR this week.
I discussed with @ran-iwamoto and @takulake again about which of b) or d) to choose, and d) might be better so that users of the current lang_id are not confused. I will update when we have a consensus on our side.

@issei-ibm
Copy link

issei-ibm commented Jan 23, 2025

@shahrokhDaijavad @touma-I @Harmedox I talked with @ran-iwamoto @takulake and decided to go with d) now. We are roughly done with its implementation. We will make an initial PR after some tests.

Thank you, @issei-ibm. Sounds good. Looking forward to the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants