This project demonstrates the application of real-world data wrangling techniques using Python. The goal was to analyze the relationship between global temperature anomalies and CO₂ emissions over time. By gathering, assessing, cleaning, and combining datasets from different sources, we aimed to uncover trends and correlations that provide insights into climate change.
The research question for this project is:
"How do global temperature anomalies relate to CO₂ emissions over time?"
To answer this question:
- Dataset 1: Global temperature anomalies were programmatically downloaded from NASA's GISTEMP dataset.
- Dataset 2: CO₂ emissions data were manually downloaded from Our World in Data.
The analysis involved cleaning and combining these datasets to explore trends in global warming and greenhouse gas emissions.
- Type: CSV File
- Source: NASA GISTEMP
- Method: Programmatically downloaded.
- Variables:
Year
: The year of the observation.Temperature Anomaly
: The deviation in global temperature from the baseline average (°C).- Monthly temperature anomalies (
Jan
,Feb
, etc.).
- Type: CSV File
- Source: Our World in Data
- Method: Manually downloaded.
- Variables:
Year
: The year of the observation.Entity
: The country or region.Annual CO₂ emissions
: Total CO₂ emissions in metric tons.
Two datasets were gathered using different methods:
- NASA's GISTEMP dataset was programmatically downloaded using Python.
- Our World in Data's CO₂ emissions dataset was manually downloaded as a CSV file.
Both datasets were loaded into a Jupyter Notebook for analysis.
The datasets were assessed visually (df.head()
) and programmatically (df.info()
, df.isnull().sum()
), revealing:
- Quality Issues:
- Invalid placeholder values (
***
) in the temperature dataset. - Missing values in the
Code
column of the CO₂ dataset.
- Invalid placeholder values (
- Tidiness Issues:
- Wide format in the temperature dataset (monthly columns needed reshaping).
- Redundant summary columns (
J-D
,D-N
, etc.) in the temperature dataset.
The identified issues were cleaned as follows:
- Replaced invalid placeholder values (
***
) withNaN
in the temperature dataset. - Dropped the unnecessary
Code
column from the CO₂ dataset. - Reshaped the temperature dataset from wide to long format, creating a single column for months.
- Removed redundant summary columns (
J-D
,D-N
, etc.) from the temperature dataset.
The datasets were then combined on the common column Year
.
Both raw and cleaned datasets were saved locally:
- Raw datasets:
raw_temperature_data.csv
raw_co2_emissions.csv
- Cleaned datasets:
cleaned_temperature_data.csv
cleaned_co2_emissions.csv
- Combined cleaned dataset:
combined_cleaned_data.csv
Two visualizations were created to answer the research question:
Visualization 1
Insight: This line plot shows a clear upward trend in global temperature anomalies since the mid-20th century, indicating rising global temperatures due to human activities.
Visualization 2
Insight: This line plot demonstrates a steady increase in global CO₂ emissions since the Industrial Revolution, aligning with observed rises in global temperatures.
Visualization Optional
Insight: The scatter plot reveals a strong positive correlation between rising temperatures and increasing CO₂ emissions, suggesting that greenhouse gas emissions are a key driver of global warming.
If given more time, I would:
- Investigate additional greenhouse gases (e.g., methane or nitrous oxide) to understand their impact on climate change.
- Explore regional trends in temperature anomalies and emissions to identify specific countries contributing most to global warming.
- Incorporate other factors like deforestation or renewable energy adoption rates for a more comprehensive analysis.
- Clone this repository or download it as a
.zip
file. - Extract all files into a working directory.
- Open the Jupyter Notebook (
data_wrangling_project_three.ipynb
) using JupyterLab or Jupyter Notebook. - Run all cells sequentially to reproduce the results.
data_wrangling_project_three.ipynb
: The main Jupyter Notebook containing all code and analysis.- Raw Datasets:
raw_temperature_data.csv
raw_co2_emissions.csv
- Cleaned Datasets:
cleaned_temperature_data.csv
cleaned_co2_emissions.csv
combined_cleaned_data.csv
- Python
- Pandas
- Matplotlib
- Jupyter Notebook
Feel free to reach out if you have any questions about this project!
Let me know if you’d like any additional sections or modifications! 😊