Intern: Omokhoa Oshose Tosayoname | ID: CV/A1/61250 | Role: Data Analysis Intern
Organisation: Codveda Technologies | Duration: March – May 2026 | Mode: Remote
- Project Overview
- Repository Structure
- Tools and Technologies
- Level 1 — Basic Data Analysis
- Level 2 — Intermediate Data Analysis
- Level 3 — Advanced Data Analysis
- Key Findings Summary
- Dashboards
- How to Run the Notebooks
- Connect With Me
This repository contains all completed work for the Codveda Technologies Data Analytics Internship — a hands-on, project-based programme designed to develop real-world data analytics skills across three progressive levels.
Over the course of the internship, six data analytics tasks were completed spanning the full analytics pipeline:
| Level | Focus | Tasks Completed |
|---|---|---|
| Level 1 — Basic | Data exploration and visualisation | EDA · Basic Data Visualisation |
| Level 2 — Intermediate | Pattern discovery and time-based analysis | Time Series Analysis · K-Means Clustering |
| Level 3 — Advanced | NLP and business intelligence | Sentiment Analysis · Interactive Dashboards |
All tasks are documented in fully executable Jupyter Notebooks with step-by-step explanations, inline visualisations, and analytical insights. Dashboards were built in both Power BI and Tableau Public.
codveda-data-analytics-internship/
│
├── README.md
├── requirements.txt
│
├── Level_1_Basic/
│ ├── Level_1_Basic_Data_Analysis.ipynb
│ └── outputs/
│ ├── 01_histograms.png
│ ├── 02_boxplots.png
│ ├── 03_scatter_plots.png
│ ├── 04_correlation_heatmap.png
│ ├── 05_bar_plot.png
│ └── 06_line_charts.png
│
├── Level_2_Intermediate/
│ ├── Level_2_Intermediate_Data_Analysis.ipynb
│ └── outputs/
│ ├── 07_aapl_raw_timeseries.png
│ ├── 08_moving_averages.png
│ ├── 09_decomposition.png
│ ├── 10_multi_stock_comparison.png
│ ├── 11_elbow_method.png
│ └── 12_kmeans_clusters.png
│
├── Level_3_Advanced/
│ ├── Level_3_Advanced_Data_Analysis.ipynb
│ └── outputs/
│ ├── 13_sentiment_distribution.png
│ ├── 14_polarity_subjectivity.png
│ ├── 15_sentiment_by_platform_country.png
│ ├── 16_sentiment_over_time.png
│ ├── 17_word_clouds.png
│ ├── 18_top_words.png
│ ├── 19_polarity_boxplot.png
│ ├── powerbi_dashboard_screenshot.png
│ └── tableau_dashboard_screenshot.png
│
└── datasets/
├── iris.csv
├── Stock_Prices_Data_Set.csv
├── Sentiment_dataset.csv
├── Churn_Dashboard_Data.csv
├── churn-bigml-80.csv
└── churn-bigml-20.csv
| Category | Tools |
|---|---|
| Language | Python 3.10+ |
| Data Manipulation | pandas, numpy |
| Visualisation | matplotlib, seaborn, WordCloud |
| Machine Learning | scikit-learn (KMeans, StandardScaler, PCA) |
| Time Series | statsmodels |
| NLP | TextBlob, re (regex) |
| Business Intelligence | Microsoft Power BI Desktop, Tableau Public |
| Development Environment | Jupyter Notebook, Google Colab |
| Version Control | Git, GitHub |
Dataset: Iris Dataset (150 records, 5 features)
Notebook: Level_1_Basic_Data_Analysis.ipynb
Performed a comprehensive exploratory analysis of the Iris dataset to understand its structure, distribution, and feature relationships.
Steps taken:
- Removed 3 duplicate rows identified during initial inspection
- Calculated mean, median, mode, and standard deviation for all four numerical features
- Computed a full correlation matrix and interpreted pairwise relationships
- Generated per-species summary statistics revealing how the three flower species differ
Key findings:
| Finding | Detail |
|---|---|
| Most variable feature | Petal length (std = 1.76 cm) |
| Most consistent feature | Sepal width (std = 0.44 cm) |
| Strongest correlation | Petal length vs Petal width (r = 0.96) |
| Most separable species | Iris setosa — distinctly small petals |
Created six publication-quality charts to communicate the dataset's structure and relationships:
| Chart | Insight |
|---|---|
| Histograms | Setosa petal measurements cluster far below other species |
| Boxplots | Virginica has widest spread; setosa most consistent |
| Scatter plots | Petal dimensions form near-perfect linear clusters by species |
| Correlation heatmap | Petal features almost perfectly correlated (r = 0.96) |
| Bar plot | Virginica largest across all features |
| Line charts | Petal bands clearly species-separated across all samples |
Datasets: Stock Prices Dataset (497,472 rows, 505 symbols) · Iris Dataset
Notebook: Level_2_Intermediate_Data_Analysis.ipynb
Analysed Apple Inc. (AAPL) daily closing prices from January 2014 to December 2017.
Steps taken:
- Filtered AAPL from 505 stock symbols
- Converted date strings to datetime and set as index
- Computed 30-day and 90-day moving averages
- Applied multiplicative seasonal decomposition (period = 252 trading days)
- Normalised and compared AAPL against MSFT, GOOGL, and AMZN
Key findings:
| Finding | Detail |
|---|---|
| Total price return | +114.2% ($79.02 → $169.23) |
| Price minimum | $71.40 (mid-2016) |
| Price maximum | $176.42 (late 2017) |
| Best performing peer | Amazon (AMZN) — ~4× return over same period |
| Decomposition model | Multiplicative (annual seasonal cycle detected) |
Applied unsupervised machine learning to the Iris dataset to discover natural groupings without using species labels.
Steps taken:
- Standardised all four features using StandardScaler (z = (x − μ) / σ)
- Applied the Elbow Method across K = 1 to 10 — optimal K = 3 confirmed
- Fitted K-Means with K=3 and validated against true species labels
- Visualised clusters using PCA projection (97.7% variance explained)
Key findings:
| Finding | Detail |
|---|---|
| Optimal clusters | K = 3 (confirmed by elbow method) |
| Setosa accuracy | 100% — all 48 records correctly isolated |
| Overall accuracy | ~85% — overlap between versicolor/virginica |
| PCA variance explained | 97.7% across two components |
Datasets: Social Media Sentiment Dataset (732 posts) · Customer Churn Dataset (3,333 customers)
Notebook: Level_3_Advanced_Data_Analysis.ipynb
Built a complete text analytics pipeline to classify social media posts as Positive, Negative, or Neutral.
Pipeline stages:
- Data cleaning — stripped whitespace, dropped redundant columns
- Emotion label mapping — consolidated 279 unique labels into 3 classes
- Text preprocessing — lowercase, URL/mention removal, tokenisation, stopword removal, lemmatisation
- TextBlob sentiment scoring — polarity and subjectivity computed for all 732 posts
- Visualisation — 7 charts including word clouds, frequency analysis, and platform breakdowns
Key findings:
| Finding | Detail |
|---|---|
| Sentiment distribution | 64.6% Positive · 28.3% Negative · 7.1% Neutral |
| TextBlob agreement rate | 46.2% — expected given nuanced emotion labels |
| Mean polarity — Positive | +0.191 |
| Mean polarity — Negative | −0.106 |
| Top positive words | beautiful, enjoy, feel, amazing, love |
| Top negative words | fearful, shadows, storm, heartbreak, lost |
Built a Customer Churn Analysis Dashboard in both Power BI and Tableau Public using a combined churn dataset of 3,333 customer records with 7 engineered features.
Dashboard visuals:
- KPI Cards (Total Customers · Churn Rate · Avg Charge · Avg Service Calls)
- Churn by International Plan (100% Stacked Bar)
- Churn by Service Call Risk (Stacked Column — staircase pattern)
- Total Charge Distribution (Histogram)
- Churn Rate by State (Filled Map)
- Churned Customers by Tenure Band (Pie/Donut Chart)
- Day Usage vs Charge (Scatter Plot)
- Interactive Filters (International Plan · Voice Mail Plan)
Key findings:
| Finding | Detail |
|---|---|
| Overall churn rate | 14.5% (483 of 3,333 customers) |
| Highest churn driver | International plan (~4× higher churn rate) |
| Critical risk threshold | 5+ customer service calls |
| Highest-risk tenure segment | Established (100-149 days) — 41.4% of churned |
Across all six tasks, three cross-cutting analytical themes emerged:
1. Feature importance is not obvious from inspection alone In the Iris dataset, petal dimensions proved far more discriminating than sepal dimensions — something that only became apparent through EDA and confirmed by K-Means clustering. Similarly in the churn dataset, the international plan variable was not the most prominent field but turned out to be the strongest churn predictor.
2. Real-world data is always messy The sentiment dataset arrived with 279 unique emotion labels, whitespace inconsistencies, and redundant index columns. The churn dataset required feature engineering before it could support meaningful dashboard analysis. Professional data work begins with cleaning, not analysis.
3. Communication is as important as computation The same churn insights that exist as numbers in a Python script become genuinely actionable when presented in an interactive Power BI or Tableau dashboard that a business stakeholder can explore without technical knowledge.
🔗 Customer Churn Analysis Dashboard
Click the link above to explore the fully interactive dashboard — filter by International Plan, Voice Mail Plan, and click any chart to cross-filter the entire dashboard.
Built in Microsoft Power BI Desktop. Screenshot available in Level_3_Advanced/outputs/powerbi_dashboard_screenshot.png
- Open the notebook directly in Google Colab
- Upload the required datasets to a
datasets/folder in the Colab file system - Run all cells top to bottom — all dependencies are installed automatically
Prerequisites: Python 3.10+, pip
Step 1 — Clone the repository:
git clone https://github.com/Tosa9/codveda-data-analytics-internship.git
cd codveda-data-analytics-internshipStep 2 — Install dependencies:
pip install -r requirements.txtStep 3 — Launch Jupyter:
jupyter notebookStep 4 — Open any notebook from the Level folders and run all cells.
Datasets are included in the datasets/ folder. The Stock Prices dataset (~24MB) may take a moment to load.
pandas>=2.0
numpy>=1.24
matplotlib>=3.7
seaborn>=0.12
scikit-learn>=1.3
statsmodels>=0.14
textblob>=0.17
wordcloud>=1.9
jupyter>=1.0
nbformat>=5.9
Install all dependencies with:
pip install -r requirements.txtOmokhoa Oshose Tosayoname Data Science/Analysis Intern | Mechanical Engineering Student | Junior Project Manager
Completed as part of the Codveda Technologies Data Analytics Internship Programme Intern ID: CV/A1/61250 | March – May 2026
#Codveda #CodvedaTech #CodvedaInternship #DataAnalytics #Python #MachineLearning