The problem

People do not tweet about the same things all the time. Transformer models, however, are mostly applied to single-topic documents. Can we correctly infer a user's ideology from small, thematically diverse tweets using transformers and clustering? Can preprocessing tweets based on their similarity help achieve that?

The contributions

Yes, we can. However, all tested combinations of methods hit an accuracy ceiling of 62-64%.
Embedding with Sentence-BERT and filtering tweets based on cosine similarity can improve information capture by 20%. Transfomer-powered clustering (BERTopic) is no better than legacy clustering for this problem; even worse, BERTopic's own clustering algorithm may fail to capture data relationships right.

Structure

Data

3 datasets in .zip "Data":

ExtractedTweets, housing data downloaded from the Kaggle source ([43] in the thesis).
Dataset_versions, housing abridged versions of the original dataset generated through CSF.

- Dataset_random_selection, housing a single abridged version of the dataset generated through random sampling.

Tweet processing

Reducing volume of data to <= 256 tokens (embedding model ’all-MiniLM-L6-v2’).

"CSF" "Contextual Similarity Filtering". Processes the input dataset "ExtractedTweets" into "Dataset_versions" by filtering out least similar tweets. Contains a csv with different volumes of data corresponding to different filtering thresholds.
"Random_selection" generates a random sample from "ExtractedTweets" Serves as a baseline to check for effectiveness of CSF. Generates a csv file "Dataset_random_selection".

Clustering

Clustering-related analysis (both embeddings- and BERTopic-based) is gathered in "Clustering". Makes use of the DBCV file, credited to [30]. Makes use of two datasets: "Dataset_random_selection" and "Dataset_versions".

Results

Legacy clustering on SBERT embeddings:

Cluster accuracy for unprocessed data

VS

Cluster accuracy for CSF-processed data (threshold_0.7 from "Dataset_versions").

Clustering with BERTopic:

Cluster accuracy for unprocessed data

Cluster accuracy for CSF-processed data (threshold_0.7 from "Dataset_versions").

References

[30] D. Moulavi, P. A. Jaskowiak, R. J. Campello, A. Zimek, and J. Sander, “Density-based clustering validation,” Proceedings of the 2014 SIAM International Conference on Data Mining, 2014. doi:10.1137/1.9781611973440.96

[43] K. Pastor, 'Democrat Vs. Republican Tweets', Kaggle, 2018. [Online]. Available: https://www.kaggle.com/datasets/kapastor/democratvsrepublicantweets/data

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
CSF.ipynb		CSF.ipynb
Clustering.ipynb		Clustering.ipynb
DBCV [30].py		DBCV [30].py
Data.zip		Data.zip
LICENSE		LICENSE
README.md		README.md
Random_selection.ipynb		Random_selection.ipynb

Provide feedback

Saved searches