This repository contains Jupyter notebooks, covering distinct topics I've been researching.
For them to be rendered properly I advise you to access them through nbviewer by clicking on the headings below.
This notebook focuses on optimizing vector search operations by comparing implementations such as Faiss, cuVS, and CuPy. It uses NVIDIA Nsight Systems for profiling and performance analysis to enhance GPU-accelerated nearest neighbor search speed and scalability.
Explores methods to detect near duplicates in data using Jaccard similarity and MinHashing techniques.
Demonstrates the use of BERTopic for topic modeling on Reddit posts related to Austria, from the period of the 2024 European Parliament elections. It includes data preprocessing, topic extraction, and visualization of topic trends over time. The analysis uncovers key themes in the Reddit dataset, leveraging statistical learning and unsupervised clustering of keywords.
Covers essential steps for preparing a (bike rental) dataset for Bayesian network modeling. It includes data distribution inspection, outlier and multicollinearity checks, missing value imputation, continuous variable categorization, and calculation of Weight of Evidence (WoE) and Information Value (IV) scores.