The project was carried out to show how some basic important text preprocessings are done on data using the regular expression library. Then using embeddings, hash the key of a given word to an ID to carry out similary searches like a dictionary. The transformer from hugging face played a big role in the project as it was used to get the embeddings of the sentences i.e., the word meanings.
The data used for this project is a csv file containing words and meanings from the letter A-Z which was joined together using the os library before the preprocessing stages.
- Information retrieval
- Text classification system
- Reccomendation system for products and services
- used to cluster documents
This project was done similar to that of the cohere library for similarity searches and was aimed at utilising many important libraries as well as the steps handled in the preprocessing of such a dirty data. Similar steps would be carried out if there are any form of need for searches.