HateSpeechDetect-text

Project Overview

In this project, I focused on benchmarking various machine learning models, deep learning architectures, and fine-tuned BERT-based models to evaluate their performance across multiple metrics. The aim was to establish a robust and efficient framework for text classification tasks, ultimately improving the overall accuracy of predictions by 25%.

Key Highlights

Data Collection:
- Scraped data from Twitter using BeautifulSoup (BS4), collecting tweets related to a specific domain for text classification tasks.
- Preprocessed the scraped data to clean, tokenize, and transform it for effective model training.
Machine Learning Models:
- Benchmarked traditional ML models, including:
  - Logistic Regression
  - Support Vector Classifier
  - Decision Tree Classifier
  - Random Forest Classifier
  - Gradient Boosting Classifier
  - AdaBoost Classifier
  - XGBoost Classifier
Deep Learning Architectures:
- Evaluated the performance of deep learning models:
  - LSTM (Long Short-Term Memory)
  - Bi-LSTM (Bidirectional LSTM)
  - GRU (Gated Recurrent Unit)
Fine-Tuned BERT-Based Models:
- Implemented and fine-tuned transformer-based models, including:
  - BERT
  - RoBERTa
  - ALBERT
  - DistilBERT
- Employed QLoRA-based fine-tuning on the Phi-2 model to enhance its performance.
Performance Metrics:
- Evaluated models on the following metrics:
  - Accuracy
  - Precision (Macro and Weighted)
  - Recall (Macro and Weighted)
  - F1-Score (Macro and Weighted)
Achievements:
- Improved the best-performing model’s accuracy by 25% through advanced fine-tuning and hyperparameter optimization.
- Achieved consistent performance with BERT-based models, maintaining 91% accuracy across multiple datasets.
Best Performing Model:
- After rigorous benchmarking, DistilBERT emerged as the top-performing model, achieving a balanced performance with 91% accuracy.
- The fine-tuned DistilBERT model has been pushed to the datafreak/hatespeech-distill-bert repository.
FastAPI App:
- The FastAPI application serving the model has been dockerized to make it easily deployable and scalable.

Benchmarking Results

Model	Accuracy	Precision (Macro)	Recall (Macro)	F1 (Macro)	Precision (Weighted)	Recall (Weighted)	F1 (Weighted)
Logistic Regression	0.75	0.58	0.67	0.59	0.86	0.75	0.79
Support Vector Classifier	0.75	0.54	0.60	0.55	0.83	0.75	0.78
Decision Tree Classifier	0.72	0.51	0.53	0.51	0.79	0.72	0.75
Random Forest Classifier	0.83	0.59	0.60	0.59	0.82	0.83	0.82
Gradient Boosting Classifier	0.75	0.57	0.63	0.57	0.84	0.75	0.79
AdaBoost Classifier	0.71	0.54	0.59	0.54	0.83	0.71	0.76
XGBoost Classifier	0.81	0.59	0.62	0.60	0.83	0.81	0.82
LSTM	0.73	0.57	0.54	0.51	0.85	0.73	0.77
Bi-LSTM	0.75	0.55	0.63	0.57	0.85	0.75	0.78
GRU	0.79	0.57	0.66	0.60	0.85	0.79	0.81
BERT	0.91	0.76	0.69	0.71	0.90	0.91	0.90
roBERTa	0.91	0.75	0.72	0.74	0.90	0.91	0.90
ALBERT	0.91	0.76	0.66	0.67	0.90	0.91	0.91
DistilBERT	0.91	0.79	0.73	0.75	0.91	0.91	0.91
Phi-2 (QLoRA Fine-Tuned)	0.90	0.75	0.68	0.70	0.89	0.90	0.89

Technical Tools and Frameworks

Data Collection: BeautifulSoup (BS4), Python for scraping and preprocessing.
Modeling: Scikit-learn, TensorFlow, PyTorch, and HuggingFace Transformers.
Fine-Tuning: QLoRA-based approach for parameter-efficient fine-tuning of large language models like Phi-2.
Visualization: Matplotlib and Seaborn for plotting evaluation metrics.

Results

The project demonstrated the superior performance of BERT-based models, particularly after applying QLoRA-based fine-tuning on Phi-2, which outperformed traditional machine learning and standard deep learning models across all key metrics. Random Forest and XGBoost were the top-performing ML models, while GRU showed the best results among deep learning approaches.

How to Use the API Locally

To run the FastAPI app locally and test the DistilBERT model for hate speech detection, follow the instructions below:

1. Clone the Repository

Clone the repository containing the FastAPI application and model files.

git clone https://github.com/devroopsaha744/HateSpeecDetect-text.git
cd HateSpeecDetect-text

2. Install Dependencies

Ensure you have Python 3.10+ installed and set up a virtual environment (recommended).

python3 -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`
pip install -r requirements.txt

3. Run the Docker Container (Optional)

To run the API using Docker, you can use the following commands. Make sure you’ve built the Docker image before.

docker build -t hatespeech-api .
docker run -p 8000:8000 hatespeech-api

This will start the FastAPI app and expose it on port 8000.

4. Access the API

Once the container is running, you can access the API via the following endpoint in your browser or Postman:

http://127.0.0.1:8000

API Endpoints

1. GET `/`

Description: Health check endpoint to verify if the API is running.
Response:
```
{
  "Hello": "World!"
}
```

2. POST `/predict/`

Description: Predict whether the provided text is hate speech or not.

Body:

{
  "text": "shut the fuck up, you asshole!"
}

Response:
```
{
  "prediction": "Offensive Language",
}
```
This will return a prediction of whether the text is hate speech, offensive language or neither.

You can also access the deployed API on Hugging Face Spaces:

https://datafreak-hatespeech-detect.hf.space

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
UI		UI
backend		backend
examples		examples
hate-speech		hate-speech
Dockerfile		Dockerfile
HateSpeechDetect_using_DeepLearning.ipynb		HateSpeechDetect_using_DeepLearning.ipynb
Hate_Speech_Transformers.ipynb		Hate_Speech_Transformers.ipynb
Hatespeech_Finetuning_LLM.ipynb		Hatespeech_Finetuning_LLM.ipynb
Hatespeech_detection_classifier.ipynb		Hatespeech_detection_classifier.ipynb
LICENSE		LICENSE
LLM-finetuning-subtask-1		LLM-finetuning-subtask-1
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HateSpeechDetect-text

Project Overview

Key Highlights

Benchmarking Results

Technical Tools and Frameworks

Results

How to Use the API Locally

1. Clone the Repository

2. Install Dependencies

3. Run the Docker Container (Optional)

4. Access the API

API Endpoints

1. GET `/`

2. POST `/predict/`

Screenshots

About

Releases

Packages

Languages

License

devroopsaha744/HateSpeechDetect-text

Folders and files

Latest commit

History

Repository files navigation

HateSpeechDetect-text

Project Overview

Key Highlights

Benchmarking Results

Technical Tools and Frameworks

Results

How to Use the API Locally

1. Clone the Repository

2. Install Dependencies

3. Run the Docker Container (Optional)

4. Access the API

API Endpoints

1. GET /

2. POST /predict/

Screenshots

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. GET `/`

2. POST `/predict/`

Packages