These short guides are meant to show you some practical examples of matplotlib and pandas, not serve as comprehensive walkthroughs.
- Introduction to matplotlib visualizations
- Basic matplotlib visualization of climate data
- Basic pandas data wrangling and matplotlib visualization of climate data
- Nicolas P. Rougier has an excellent and beautifully designed Matplotlib tutorial.
- How to make beautiful data visualizations in Python with matplotlib
- Matplotlib homepage
(Note: While the Matplotlib homepage is a place you eventually want to go to, some of the documentation may be more complicated for you than necessary...)
- 10 Minutes to pandas
- Pandas cookbook
- An Introduction to Pandas, via Michael Hansen
- 12 Useful Pandas Techniques in Python for Data Manipulation
- Things in Pandas I Wish I'd Known Earlier
The data folder contains several datasets, extracted and somewhat normalized for your convenience:
- data/climate
- Sources:
- NASA-aggregated data on global temperature and greenhouse gases
- Sources:
- data/schools
- Sources:
- data/stocks
- Source:
- Daily closing prices for top tech stocks, via Yahoo Finance.
- Source:
- data/congress
Typecasting dates during the pandas import:
from os.path import join
import matplotlib.pyplot as plt
import pandas as pd
fname = join('data', 'stocks', 'YHOO.csv')
# must specify that the 'Date' column is actually a date
# and pandas will try its best to convert it
df = pd.read_csv(fname, parse_dates=['Date'])
fig, ax = plt.subplots()
ax.plot(df['Date'], df['Adj Close'])
Without pandas, here's what that typecasting would look like:
from os.path import join
from datetime import datetime
import csv
fname = join('data', 'stocks', 'YHOO.csv')
with open(fname, 'r') as rf:
data = list(csv.DictReader(fname))
for d in data:
d['Date'] = datetime.strptime(d['Date'], '%Y-%m-%d')
d['Adj Close'] = float(d['Adj Close'])
# then the visualization code...
The 2014 SAT score data is an example of annoyingly difficult dirty data. The columns contain a mix of numbers and things like asterisks, which need to be cleared out if pandas is to typecast a column as all numbers/floats/etc.
The coercion can be done when read_csv() is called; check out the documentation for all of its arguments.
One argument is na_values, which let's us specify strings values that should be considered as "not-a-number" values. Such as 'NA'
or '*'
:
Here's the import without specifying na_values
:
from os.path import join
import pandas as pd
fname = join('data', 'schools', 'sat-2014.csv')
adf = pd.read_csv(fname)
bdf = pd.read_csv(fname, na_values=['*'])
Compare the dtypes
attributes of adf
and bdf
-- many more columns of the bdf
dataframe are typecasted as numbers.
Now it's easy to filter the SAT results by schools that have a minimum number of test takers:
cdf = bdf[bdf['number_of_test_takers'] >= 20]