Topics¶

Series vs. DataFrames
Querying, indexing, slicing
Combining and splitting data
Handling missing data
Continuous vs. categorical data

Workshop: Pandas and Data Manipulation¶

In this workshop, we will use pandas to load, describe, and query datasets.

Cheatsheet¶

https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf

References¶

https://pandas.pydata.org/pandas-docs/stable/
Python Data Science Handbook by Jake VanderPlas
Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython by Wes McKinney
https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html

Installation¶

Windows: Start Button -> "Anaconda Prompt"

Ubuntu / MacOS: conda should be in your path

Activate the environment

conda activate mldds01

Pandas should already be installed. If not, install it:

conda install pandas

Tip: You can check the versions installed by calling Python with a script:

python -c "import pandas; print(pandas.__version__)"

SGD to USD Exchange Rate Data¶

Similar to the NumPy workshop, we'll use the historical SGD to USD exchange rates from data.gov.sg to demonstrate some Pandas concepts.

In [ ]:

from IPython.display import IFrame

IFrame('https://data.gov.sg/dataset/exchange-rates-sgd-per-unit-of-usd-average-for-period-annual/resource/f927c39b-3b44-492e-8b54-174e775e0d98/view/43207b9f-1554-4afb-98fe-80dfdd6bb4f6', width=600, height=400)

Download Instructions¶

You should already have this dataset from the NumPy workshop. If not, here are the instructions:

Go to https://data.gov.sg/dataset/exchange-rates-sgd-per-unit-of-usd-average-for-period-annual
Click on the Download button
Unzip and extract the .csv file. Note the path for use below.

Note: on Windows, you may wish to rename the unzipped folder path to something shorter.

Import the package¶

In [ ]:

import pandas as pd

In [ ]:

pd?

Two main data structures¶

Series
DataFrame

Series¶

pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)

Similar to 1-d numpy array but with more flexible explicit indexing
Has two components : index and value for each element
A bit similar concept as dictionary
more here..

DataFrame¶

pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

The primary pandas data structure
Tabular format similar to excel
Two-dimensional, potentially heterogeneous tabular data
structure with labeled axes (rows and columns). Row and columns index
Can be thought of as a dict-like container for Series objects.
more here..

Read data from CSV to a Series¶

You can use pandas.read_csv to read data into:

A Series, if the data contains only 1 column and you specify the squeeze=True option, or
A DataFrame, for any number of columns. This is the default behavior because the DataFrame is the most flexible.

more here

In [ ]:

# Read data into a series
#
# parse_dates: which column(s) to parse dates
# index_col: which zeroth indexed column to use the index
# infer_datetime_format may speed up the date parsing
sgd_usd_series = pd.read_csv('/tmp/exchange-rates/exchange-rates-sgd-per-unit-of-usd-daily.csv',
                             parse_dates=['date'], index_col=0, infer_datetime_format=True,
                             squeeze=True)

# inspect the first 10 values
sgd_usd_series.head(10)

In [ ]:

sgd_usd_series.describe()

In [ ]:

# Read data into a data frame
sgd_usd_df = pd.read_csv('/tmp/exchange-rates/exchange-rates-sgd-per-unit-of-usd-daily.csv',
                         parse_dates=['date'], index_col=0, infer_datetime_format=True)

# inspect the first 10 values
sgd_usd_df.head(10)

In [ ]:

sgd_usd_df.describe()

In [ ]:

# Read data without parsing dates
temp = pd.read_csv('/tmp/exchange-rates/exchange-rates-sgd-per-unit-of-usd-daily.csv',
                   index_col=0)

# inspect the first 10 values
temp.head(10)

In [ ]:

# dtype of index is a string
temp.index

In [ ]:

# with parse_dates = ['date'], dtype of index is a datetime64
sgd_usd_df.index

In [ ]:

# read_csv has many other functions for reading in CSV data
# such as custom parsing functions, or custom delimiters
pd.read_csv?

Get values or indices¶

Series

use series.values
use series.index

DataFrame

use df.values
use df.index

Note: these are properties, not function calls with ()

For this dataset:

Series.values is a 1 dimensional numpy.array
DataFrame.values is not. It's actually 2 dimensional:

number of samples (rows) x 1 (column)

This is because DataFrame is a more general data structure that can hold more columns.

In [ ]:

sgd_usd_series.values

In [ ]:

sgd_usd_df.values

In [ ]:

sgd_usd_series.values.shape

In [ ]:

sgd_usd_series.values.ndim # rank = 1

In [ ]:

sgd_usd_df.values.shape

In [ ]:

sgd_usd_df.values.ndim # rank = 2

In [ ]:

# Tip: you can flatten the 3993 x 1 numpy array
sgd_usd_df.values.flatten()

The .index is the same whether it is a Series or a DataFrame

In [ ]:

sgd_usd_series.index

In [ ]:

sgd_usd_df.index

Get the summary¶

Try these on sgd_usd_series and sgd_usd_df.


df.columns	Describe DataFrame columns
df.info()	Information on a DataFrame
s.count(), df.count()	Number of non-NA values
s.count()	Number of non-NA values

Note: these are from the Cheatsheet. Series supports fewer methods.

Get statistics¶

Try these on sgd_usd_series and sgd_usd_df.


s.sum(), df.sum()	Sum of values
s.cumsum(), df.cumsum()	Cummulative sum of values
s.min()/s.max(), df.min()/df.max()	Minimum/maximum values
s.idxmin()/series.idxmax(), df.idxmin()/df.idxmax()	Minimum/Maximum index value
s.describe(), df.describe()	Summary statistics
s.mean(), df.mean()	Mean of values
s.median(), df.median()	Median of values

Note: these are from the Cheatsheet

Joins¶

Let's say we need to also show Singapore Dollar and Renminbi (CNY) exchange rates, but from a different data set.

This dataset is already downloaded for you in the data folder.

In [ ]:

# data source: https://www.exchangerates.org.uk
sgd_cny_df = pd.read_csv('data/sgd_cny_rates_daily.csv',
                          parse_dates=True, index_col=0, infer_datetime_format=True)
print('First 5 entries:')
sgd_cny_df.head(5)

In [ ]:

sgd_cny_df.info()

In [ ]:

sgd_usd_df.info()

sgd_cny_df: DatetimeIndex: 3224 entries, 2018-05-27 to 2009-10-06

sgd_usd_df: DatetimeIndex: 3993 entries, 1988-01-08 to 2015-10-19

have different date ranges
one is in decreasing time order, the other is increasing time order

Pandas DataFrames make it easy to join these datasets together based on index.

You can do this without looping over the data, using DataFrame.join()

more info..

In [ ]:

sgd_usd_df.join?

In [ ]:

# Default join is an `left` join, where the index of the left series (`sgd_usd_df`) is preserved.

sgd_usd_cny = sgd_usd_df.join(sgd_cny_df)
sgd_usd_cny

We can remove the NaN entries using dropna()

In [ ]:

sgd_usd_cny = sgd_usd_cny.dropna()
sgd_usd_cny

The result is a DataFrame with entries where both exchange rates are present.

Entries where either SGD-USD or SGD-CNY are missing are excluded.

Note that even though the index for the series are in different order, join will still work because it matches the individual index values

When in doubt, visualize¶

Let's visualize what we just did by plotting the dataframes.

You should have already installed matplotlib. If not, do this:

conda install matplotlib

In [ ]:

# Find the names of the columns
sgd_usd_cny.columns

In [ ]:

import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, figsize=(20,15),
                              sharex=True) # common x-axis for all subplots

ax1.set_title('Original series')
ax1.plot(sgd_usd_df, label='SGD/USD')
ax1.plot(sgd_cny_df, label='SGD/CNY')
ax1.legend(loc='upper center', shadow=True, fontsize='x-large')

ax2.set_title('After join() and dropna()')
ax2.plot(sgd_usd_cny['exchange_rate_usd'], label='SGD/USD')
ax2.plot(sgd_usd_cny['Singapore Dollar to Chinese Yuan'], label='SGD/CNY')
ax2.legend(loc='upper center', shadow=True, fontsize='x-large')

plt.show()

In [ ]:

sgd_cny_usd = sgd_cny_df.join(sgd_usd_df).dropna()
fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, figsize=(20,15),
                              sharex=True) # common x-axis for all subplots

ax1.set_title('Original series')
ax1.plot(sgd_usd_df, label='SGD/USD')
ax1.plot(sgd_cny_df, label='SGD/CNY')
ax1.legend(loc='upper center', shadow=True, fontsize='x-large')

ax2.set_title('After join() and dropna()')
ax2.plot(sgd_cny_usd['exchange_rate_usd'], label='SGD/USD')
ax2.plot(sgd_cny_usd['Singapore Dollar to Chinese Yuan'], label='SGD/CNY')
ax2.legend(loc='upper center', shadow=True, fontsize='x-large')

plt.show()

Pandas, like SQL¶

If you have worked with SQL or databases before, the DataFrame.join() is conceptually the same as SQL.

Here's a guide that compares Pandas with SQL.

This SQL:

SELECT *
FROM tips
WHERE time = 'Dinner' AND tip > 5.00;

Becomes this in pandas:

tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)]

We'll do an example with queries to demonstrate how you can think of Pandas as conceptually equivalent to SQL.

Exercise: Querying a DataFrame¶

Let's say you want to query a DataFrame using something equivalent to this SQL syntax:

SELECT *
FROM sgd_usd_cny
WHERE date >= '2012-01-01' AND date < '2013-01-01';

What will you use in Pandas?

In [ ]:

# Use pandas to find all exchange rates from 2012, for the sgd_usd_cny DataFrame
#
# Refer to https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html
#
# Hint 1: use pd.to_datetime
# start = pd.to_datetime('2012-01-01')
# end = pd.to_datetime('2013-01-01')
#
# Hint 2: use sgd_usd_cny.index to compare against the date time
#
# Hint 3: add parenthesis for the boolean conditions

Indexing and Slicing: iloc, loc¶


iloc	Select by position
loc	Select by label
iat	Get scalar value by position (fast iloc)
at	Get scalar value by label (fast loc)

loc iloc (image: shanelynn.ie)

loc, at¶

In [ ]:

sgd_usd_cny.loc[:4] # error (because index is time, not integer)

In [ ]:

# use a DateTimeIndex for row selector
sgd_usd_cny.loc[pd.to_datetime('2009-10-08'), :]

In [ ]:

# use a column selector
sgd_usd_cny.loc[pd.to_datetime('2009-10-08'), 'exchange_rate_usd']

In [ ]:

# fast single-value access using .at
sgd_usd_cny.at[pd.to_datetime('2009-10-08'), 'exchange_rate_usd']

iloc, iat¶

In [ ]:

sgd_usd_cny.iloc[1, :] # select by row position

In [ ]:

sgd_usd_cny.iloc[1]['exchange_rate_usd'] # select by row position, column name

In [ ]:

sgd_usd_cny.iat[1, 0] # select by row position, column position

In [ ]:

sgd_usd_cny.iloc[:4, :] # select by row slice

Combining with boolean conditions¶

In [ ]:

# logical / boolean row index
sgd_usd_cny.loc[sgd_usd_cny.exchange_rate_usd < 1.21, :]

In [ ]:

# more selectors
sgd_usd_cny.loc[sgd_usd_cny.exchange_rate_usd > 1.425]['Singapore Dollar to Chinese Yuan']

In [ ]:

# iloc does not support boolean indexing, because it is NOT label-based

sgd_usd_cny.iloc[sgd_usd_cny.exchange_rate_usd < 1.21, :] # error

Adding and removing rows / columns¶


s.drop(), df.drop()	Drop row (axis=0)
s.drop('col', axis=1), df.drop('col', axis=1)	Drop column (axis=1)
pd.concat([s1, s2], axis=0), pd.concat([df1, df2], axis=0)	Concatenate rows (axis=0)
pd.concat([s1, s2], axis=1), pd.concat([df1, df2], axis=1)	Concatenate columns (axis=1)

In [ ]:

pd.concat?

In [ ]:

pd.DataFrame.drop?

In [ ]:

# Append a few rows of empty data
index = pd.date_range(start='2015-10-20', end='2015-10-22')

new_df = pd.DataFrame(index=index,
                      columns=sgd_usd_cny.columns)
result = pd.concat([sgd_usd_cny, new_df], axis=0) # axis=0 for rows
result.tail(10)

In [ ]:

# Drop the rows
index = pd.date_range(start='2015-10-20', end='2015-10-22')

result.drop(index, inplace=True)
result.tail(10)

Exercises: add and remove columns¶

Add a new column called 'SGD to Euro' with empty values
Drop the column you just added

In [ ]:

# Add a new column called 'SGD to Euro', with empty values
#
# Hint 1: use index=sgd_usd_cny.index
# Hint 2: what value should axis be in pd.concat?
#

In [ ]:

# Drop the column you just added
# 
# Hint: use inplace=True so that the change is performed in place
# Hint: what value should axis be?
#

Missing data¶

We saw how to add columns and rows without any data. Now we'll explore how to deal with the missing data.

Here are some ways:


dropna	Drop missing values
fillna(new_value)	Fill missing values with new_value
interpolate()	Use linear interpolation

We've already seen how dropna() works.

In [ ]:

pd.DataFrame.dropna?

In [ ]:

pd.DataFrame.fillna?

In [ ]:

pd.DataFrame.interpolate?

In [ ]:

index=pd.date_range(start='2015-10-20', end='2015-10-22')

new_df = pd.DataFrame(index=index,
                      columns=sgd_usd_cny.columns)
result = pd.concat([sgd_usd_cny, new_df], axis=0)
result.tail(10)

In [ ]:

# Dropping the NaN values
result.dropna().tail(5)

In [ ]:

# Filling the NaN values with a meaningful value, such as the median
median = result.median()
result.fillna(median).tail(5)

In [ ]:

# Using linear interpolation
result.interpolate().tail(5)

Categorical Data¶

The previous dataset shows how to use pandas for datasets with continous variables (the exchange rate).

Let's see another dataset that demonstrates how to use pandas for categorical variables (such as classes of things).

We'll use the Annual Motor Vehicle Population by Vehicle Type dataset from data.gov.sg.

Reference: https://pandas.pydata.org/pandas-docs/stable/visualization.html

Download Instructions¶

Go to https://data.gov.sg/dataset/annual-motor-vehicle-population-by-vehicle-type
Click on the Download button
Unzip and extract the .csv file. Note the path for use below.

In [ ]:

df = pd.read_csv('/tmp/motor-vehicles/annual-motor-vehicle-population-by-vehicle-type.csv',
                 parse_dates=['year'], index_col=0, infer_datetime_format=True)
df

Grouping and aggregating categories¶

Since this series has multiple entries for the same date, we can group rows together using group_by.

For example, to find the total number of vehicles per year:

In [ ]:

df.groupby(df.index)['number'].sum()

We can also plot the total number of vehicles per year:

In [ ]:

ax = df.groupby(df.index)['number'].sum().plot(marker='o')
ax.set(ylabel='Vehicle Population',
       xlabel='Year',
       title='Annual Vehicle Population')

Processing per category¶

In [ ]:

# get the unique classes
df.category.unique()

In [ ]:

# get just the dataFrame for a category
buses = df[df.category =='Buses']
buses.head(5)

In [ ]:

# count of each category
for name in df.category.unique():
    print(name, df.loc[df.category == name, "category"].count())

In [ ]:

# pick a year (2009) and plot the distributions across types
df_2009 = df[df.index == pd.to_datetime('2009')]
df_2009

In [ ]:

# Create a series with the type as the index, and the numbers as values
s_2009_type = df_2009.number
s_2009_type.index = df_2009.type
s_2009_type

In [ ]:

# plot the bar chart
s_2009_type.plot.bar()

In [ ]:

# Let's ignore private cars because they outweigh everything else
s_2009_type[s_2009_type.index != "Private cars"].plot.bar()

Pivot Table¶

Pandas supports creating a pivot table, similar to what is available from Excel.

This is useful when we would like to group categories by year, and then display them in stacked bar plots, with year on the x-axis.

pandas.pivot_table()

bar plots

Why pivot table?¶

The original dataset has each category as a separate entry in the category column.

year	category	type	number
2005-01-01	Cars & Station-wagons	Private cars	401638
2005-01-01	Cars & Station-wagons	Company cars	14926

For the bar plot, we need each category as a separate column. The entry under each category will be the number of vehicles for that year.

Something like:

year	Buses	Cars & Station-wagons	Goods & Other Vehicles	...
2005-01-01	2640	438194	128193	...
2006-01-01	...	...	...	...

In [ ]:

# Original dataset, filtered by year 2005 as an illustration
df[df.index == pd.to_datetime('2005')]

Create pivot table that is:

indexed by year
using category as the columns
numbers as values

Each category has multiple entries (one for each type), so we also need to specify this:

aggregate function to sum up the numbers for that category

In [ ]:

import numpy as np
pv_year_category = pd.pivot_table(df, index="year", columns="category",
                                  values="number", aggfunc=np.sum)

pv_year_category

Hmm, there are some NaN values.

On closer examination, the culprit is 2017, where "and" replaced "&" in the category name.

We will need to fix those entries so that the name matches.

Question¶

Can you explain why can't we just use dropna(axis=1) to drop columns with NaN?

In [ ]:

# Show the problematic dataframes (2017)
df_2017 = df.loc[pd.to_datetime('2017'), :]
df_2017

Exercise¶

Fix the category names for year = 2017 by replacing "and" with "&".

Avoid looping on individual rows.

For a given category, you can:

Use the syntax df.loc[row_indexer, col_indexer] to get a view for each incorrect category
Assign the correct value to that view

Note on why we use `DataFrame.loc`:¶

Using df.loc with assignment will avoid this SettingWithCopyWarning, because loc returns a view, not a copy of the DataFrame:

Traceback (most recent call last)
     ...
SettingWithCopyWarning:
     A value is trying to be set on a copy of a slice from a DataFrame.
     Try using .loc[row_index,col_indexer] = value instead

In [ ]:

# Your code needs to fix these categories:
#
#     Incorrect                   Correct
#     ---------------------------------------------------
#     Cars and Station-wagons     Cars & Station-wagons
#     Goods and Other Vehicles    Goods & Other Vehicles
#     Motorcycles & Scooters      Motorcycles
#     ----------------------------------------------------
#

# Your code here













# Let's generate the pivot_table again, the NaNs should now go away
pv_year_category = pd.pivot_table(df, index="year", columns="category",
                                  values="number", aggfunc=np.sum)

pv_year_category

Let's plot the bar chart.

We are doing a bit more matplotlib stuff here to make the bar chart readable.

In [ ]:

fig, ax = plt.subplots(figsize=(10,8))

pv_year_category.plot(kind='bar', stacked=True, ax=ax)
ax.set(title='Annual Motor Vehicle Population by Category',
       ylabel='Vehicle Population',
       xlabel='Year')
ax.legend(loc='upper left')

# Due to: https://github.com/pandas-dev/pandas/issues/1918
# Can't just do: ax.xaxis.set_major_formatter(mdates.FormatStrFormatter('%Y'))
ax.xaxis.set_major_formatter(plt.FixedFormatter(pv_year_category.index.to_series().dt.strftime("%Y")))

fig.tight_layout()
plt.show()

Topics¶

Workshop: Pandas and Data Manipulation¶

Cheatsheet¶

References¶

Installation¶

SGD to USD Exchange Rate Data¶

Download Instructions¶

Import the package¶

Two main data structures¶

Series¶

DataFrame¶

Read data from CSV to a Series¶

Get values or indices¶

Get the summary¶

Get statistics¶

Joins¶

When in doubt, visualize¶

Pandas, like SQL¶

Exercise: Querying a DataFrame¶

Indexing and Slicing: iloc, loc¶

loc, at¶

iloc, iat¶

Combining with boolean conditions¶

Adding and removing rows / columns¶

Exercises: add and remove columns¶

Missing data¶

Categorical Data¶

Download Instructions¶

Grouping and aggregating categories¶

Processing per category¶

Pivot Table¶

Why pivot table?¶

Question¶

Exercise¶

Note on why we use DataFrame.loc:¶

Note on why we use `DataFrame.loc`:¶