You'll learn the basics of manipulating time series data. Time series data are data that are indexed by a sequence of dates or times. You'll learn how to use methods built into Pandas to work with this index. You'll also learn how resample time series to change the frequency. This course will also show you how to calculate rolling and cumulative values for times series. Finally, you'll use all your new skills to build a value-weighted stock index from actual stock data.

Download Datasets and Presentation slides for this post HERE

This post is 1st part of Time Series with Python Track from DataCamp. The track includes:

  1. Manipulating Time Series Data in Python
  2. Time Series Analysis in Python
  3. Visualizing Time Series Data in Python
  4. ARIMA Models in Python
  5. Machine Learning for Time Series Data in Python
import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt

import matplotlib as mpl 
mpl.rcParams['figure.dpi'] = 150

pd.set_option('display.expand_frame_repr', False)


plt.style.use('ggplot')
mpl.rcParams['figure.figsize'] = (25, 20)
mpl.rcParams['axes.grid'] = True

warnings.filterwarnings("ignore")

Working with Time Series in Pandas

This chapter lays the foundations to leverage the powerful time series functionality made available by how Pandas represents dates, in particular by the DateTimeIndex. You will learn how to create and manipulate date information and time series, and how to do calculations with time-aware DataFrames to shift your data in time or create period-specific returns.

How to use dates & times with pandas

Your first time series

You have learned in the video how to create a sequence of dates using pd.date_range(). You have also seen that each date in the resulting pd.DatetimeIndex is a pd.Timestamp with various attributes that you can access to obtain information about the date.

Now, you'll create a week of data, iterate over the result, and obtain the dayofweek and day_name() for each date.

Instructions:

  • Use pd.date_range to create seven dates starting from '2017-1-1' at (default) daily frequency. Use the arguments start and periods. Assign the result to seven_days.

  • Iterate over each date in seven_days and in each iteration, print the .dayofweek and .day_name() attributes.

seven_days = pd.date_range(start='2017-1-1', periods=7)

# Iterate over the dates and print the number and name of the weekday
for day in seven_days:
    print(day.dayofweek, day.day_name())
6 Sunday
0 Monday
1 Tuesday
2 Wednesday
3 Thursday
4 Friday
5 Saturday

Indexing & resampling time series

Create a time series of air quality data

You have seen in the video how to deal with dates that are not in the correct format, but instead are provided as string types, represented as dtype object in pandas.

We have prepared a data set with air quality data (ozone, pm25, and carbon monoxide for NYC, 2000-2017) for you to practice the use of pd.to_datetime().

Instructions:

  • Inspect data using .info().
  • Use pd.to_datetime to convert the column 'date' to dtype datetime64.
  • Set the 'date' column as index.
  • Validate the changes by inspecting data using .info() again.
  • Plot data using subplots=True.
data = pd.read_csv('./datasets/nyc.csv')

# Inspect data
display(data.info())

# Convert the date column to datetime64
data.date = pd.to_datetime(data.date)

# Set date column as index
data.set_index("date",inplace = True)

# Inspect data 
display(data.info())

# Plot data
data.plot(subplots = True)
# plt.tight_layout()
plt.show()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6317 entries, 0 to 6316
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   date    6317 non-null   object 
 1   ozone   6317 non-null   float64
 2   pm25    6317 non-null   float64
 3   co      6317 non-null   float64
dtypes: float64(3), object(1)
memory usage: 197.5+ KB
None
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6317 entries, 1999-07-01 to 2017-03-31
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ozone   6317 non-null   float64
 1   pm25    6317 non-null   float64
 2   co      6317 non-null   float64
dtypes: float64(3)
memory usage: 197.4 KB
None

Compare annual stock price trends

You have seen how to select sub-periods from a time series. You'll use this to compare the performance for three years of Yahoo stock prices.

We have already imported pandas as pd and matplotlib.pyplot as plt and we have already loaded the 'yahoo.csv' file in a variable yahoo with DateTimeIndex and a single column price.

Instructions:

  • Create an empty pd.DataFrame() called prices.
  • Iterate over a list containing the three years, 2013, 2014, and 2015, as string, and in each loop:
    • Use the iteration variable to select the data for this year and the column price.
    • Use .reset_index() with drop=True to remove the DatetimeIndex.
    • Rename the column price column to the appropriate year.
    • Use pd.concat() to combine the yearly data with the data in prices along axis=1.
  • Plot prices.
yahoo = pd.read_csv("./datasets/yahoo.csv")

# Convert the date column to datetime64
yahoo['date']= pd.to_datetime(yahoo['date'])
# Set date column as index
yahoo.set_index('date',inplace = True)

display(yahoo.head())
price
date
2013-01-02 20.08
2013-01-03 19.78
2013-01-04 19.86
2013-01-07 19.40
2013-01-08 19.66
prices = pd.DataFrame()

# Select data for each year and concatenate with prices here 
for year in [ '2013', '2014', '2015']:
    price_per_year = yahoo.loc[year, ['price']].reset_index(drop=True)
    price_per_year.rename(columns={'price': year}, inplace=True)
    prices = pd.concat([prices, price_per_year], axis=1)

print(prices.head())
# Plot prices
prices.plot(subplots = True)
plt.show()
    2013   2014   2015
0  20.08    NaN    NaN
1  19.78  39.59  50.17
2  19.86  40.12  49.13
3  19.40  39.93  49.21
4  19.66  40.92  48.59

The plot you just created shows Yahoo's stock price in three different years.

Set and change time series frequency

In the video, you have seen how to assign a frequency to a DateTimeIndex, and then change this frequency. Now, you'll use data on the daily carbon monoxide concentration in NYC, LA and Chicago from 2005-17.

You'll set the frequency to calendar daily and then resample to monthly frequency, and visualize both series to see how the different frequencies affect the data.

Instructions:

  • Inspect co using .info().
  • Use .asfreq() to set the frequency to calendar daily.
  • Show a plot of 'co' using subplots=True.
  • Change the the frequency to monthly using the alias 'M'.
  • Show another plot of co using subplots=True.
co = pd.read_csv('./datasets/co_cities.csv', parse_dates=['date'], index_col='date')
# Inspect data
print(co.info())
co.head()

# Set the frequency to calendar daily
co = co.asfreq('D')
print(co.info())


# Plot the data
co.plot(subplots=True)
plt.show()


# # Set frequency to monthly
co = co.asfreq('M')
# Plot the data
co.plot(subplots=True)
plt.show()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1898 entries, 2005-01-01 to 2010-12-31
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Chicago      1898 non-null   float64
 1   Los Angeles  1898 non-null   float64
 2   New York     1898 non-null   float64
dtypes: float64(3)
memory usage: 59.3 KB
None
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2191 entries, 2005-01-01 to 2010-12-31
Freq: D
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Chicago      1898 non-null   float64
 1   Los Angeles  1898 non-null   float64
 2   New York     1898 non-null   float64
dtypes: float64(3)
memory usage: 68.5 KB
None

How does changing the frequency of the data affect the plot output?

  • DatetimeIndex changes from 1898 to 2191 entries.

Lags, changes, and returns for stock price series

Shifting stock prices across time

The first method to manipulate time series that you saw in the video was .shift(), which allows you shift all values in a Series or DataFrame by a number of periods to a different time along the DateTimeIndex.

Let's use this to visually compare a stock price series for Google shifted 90 business days into both past and future.

Instructions:

  • Use pd.read_csv() to import 'google.csv', parsing the 'Date' as dates, setting the result as index and assigning to google.
  • Use .asfreq() to set the frequency of google to business daily.
  • Add new columns lagged and shifted to google that contain the Close shifted by 90 business days into past and future, respectively.
  • Plot the three columns of google.
google = pd.read_csv('./datasets/google.csv', parse_dates=['Date'], index_col='Date')
print(google.info())

# Set data frequency to business daily
google = google.asfreq('B')

# Create 'lagged' and 'shifted'
google['lagged'] = google.Close.shift(-90)
google['shifted'] = google.Close.shift(90)

# Plot the google price series

google.plot(title='Google Stock Price')
plt.tight_layout()
plt.show()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1094 entries, 2014-01-02 to 2016-12-30
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Close   756 non-null    float64
dtypes: float64(1)
memory usage: 17.1 KB
None

Calculating stock price changes

You have learned in the video how to calculate returns using current and shifted prices as input. Now you'll practice a similar calculation to calculate absolute changes from current and shifted prices, and compare the result to the function .diff().

We have already imported pandas as pd and matplotlib.pyplot as plt. We have also loaded Yahoo stock prices for the years 2013 to 2015, set the frequency to business daily, and assigned the result to yahoo.

Instructions:

  • Create a new column called shifted_30 that contains the 'price' shifted by 30 business days into the future.
  • Subtract 'shifted_30' from 'price', and assign the result to a new column, 'change_30'.
  • Apply .diff(), setting periods to 30, and assign the result to a new column, 'diff_30'.
  • Inspect the last five rows of yahoo to verify the calculation.
  • Subtract diff_30 from change_30 using the .sub() method and print the .value_counts() of the result to show both columns are equal.
yahoo['shifted_30'] = yahoo.price.shift(30)

# Subtract shifted_30 from price
yahoo['change_30'] = yahoo['price'] - yahoo['shifted_30']

# Get the 30-day price difference
yahoo['diff_30'] = yahoo.price.diff(periods=30)

# Inspect the last five rows of price
print(yahoo.tail())

# Show the value_counts of the difference between change_30 and diff_30
print(yahoo.change_30.sub(yahoo.diff_30).value_counts())
            price  shifted_30  change_30  diff_30
date                                             
2015-12-25    NaN       32.19        NaN      NaN
2015-12-28  33.60       32.94       0.66     0.66
2015-12-29  34.04       32.86       1.18     1.18
2015-12-30  33.37       32.98       0.39     0.39
2015-12-31  33.26       32.62       0.64     0.64
0.0    703
dtype: int64

There's usually more than one way to get to the same result when working with data.

Plotting multi-period returns

The last time series method you have learned about in the video was .pct_change(). Let's use this function to calculate returns for various calendar day periods, and plot the result to compare the different patterns.

We'll be using Google stock prices from 2014-2016. We have already imported pandas as pd, and matplotlib.pyplot as plt. We have also loaded 'GOOG' stock prices for the years 2014-2016, set the frequency to calendar daily, and assigned the result to google.

Instructions

  • Create the columns 'daily_return', 'monthly_return', and 'annual_return' that contain the pct_change() of 'Close' for 1, 30 and 360 calendar days, respectively, and multiply each by 100.
  • Plot the result using subplots=True.
google = pd.read_csv('./datasets/google.csv', parse_dates=['Date'], index_col='Date')
# Set data frequency to business daily
google = google.asfreq('D')

print(google.info())
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1094 entries, 2014-01-02 to 2016-12-30
Freq: D
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Close   756 non-null    float64
dtypes: float64(1)
memory usage: 17.1 KB
None
google['daily_return'] = google.Close.pct_change(periods=1).mul(100)

# Create monthly_return
google['monthly_return'] = google.Close.pct_change(periods=30).mul(100)

# Create annual_return
google['annual_return'] = google.Close.pct_change(periods=360).mul(100)

# Plot the result
google.plot(subplots = True)
plt.show()

How do the returns for different periods compare?

Basic Time Series Metrics & Resampling

This chapter dives deeper into the essential time series functionality made available through the pandas DataTimeIndex. It introduces resampling and how to compare different time series by normalizing their start points.

Compare time series growth rates

Comparing stock performance:

  • Stock price series are hard to compare at different levels.
  • Simple solution is to normalize price series to start at 100.
  • Divide all prices by ,rst in series, multiply by 100
    • Same starting point
    • All prices relative to starting point
    • Difference to starting point in percentage points

Normalizing a single series

google = pd.read_csv('./datasets/apple_google.csv', usecols = ['Date','GOOG'], parse_dates=['Date'], index_col='Date')
google.rename({'GOOG':'price'},axis = 1, inplace = True)
display(google.head(3))
price
Date
2010-01-04 313.06
2010-01-05 311.68
2010-01-06 303.83
first_price = google.price.iloc[0] # int-based selection
first_price
313.06
first_price == google.loc['2010-01-04', 'price']
True
normalized = google.price.div(first_price).mul(100)
plt.rcParams["figure.figsize"] = (7,5)
normalized.plot(title='Google Normalized Series')
<AxesSubplot:title={'center':'Google Normalized Series'}, xlabel='Date'>
index = pd.read_csv('./datasets/sp500.csv',parse_dates=['date'],index_col='date')
prices.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 261 entries, 0 to 260
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   2013    252 non-null    float64
 1   2014    252 non-null    float64
 2   2015    252 non-null    float64
dtypes: float64(3)
memory usage: 8.2 KB

Compare the performance of several asset classes

You have seen in the video how you can easily compare several time series by normalizing their starting points to 100, and plot the result.

To broaden your perspective on financial markets, let's compare four key assets:stocks, bonds, gold, and oil.We have already imported pandas as pd and matplotlib.pyplot as plt.

Instructions:

  • Import 'asset_classes.csv', using .read_csv() to parse dates in the 'DATE' column and set this column as the index, then assign the result to prices.
  • Select the first price for each series using .iloc[0] on prices and assign the result to first_prices.
  • Divide prices by first_prices, multiply by 100 and assign the result to normalized.
  • Plot normalized.
plt.style.use('seaborn')
mpl.rcParams['figure.figsize'] = (20, 20)
prices = pd.read_csv('./datasets/asset_classes.csv', parse_dates = ['DATE'], index_col = 'DATE')

# Inspect prices here
print(prices.info())

# Select first prices
first_prices = prices.iloc[0]

# Create normalized
normalized = prices.div(first_prices).mul(100)

# Plot normalized
normalized.plot(subplots = True)
plt.show()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2469 entries, 2007-06-29 to 2017-06-26
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   SP500   2469 non-null   float64
 1   Bonds   2469 non-null   float64
 2   Gold    2469 non-null   float64
 3   Oil     2469 non-null   float64
dtypes: float64(4)
memory usage: 96.4 KB
None

Normalizing series is a common step in time series analysis.

Comparing stock prices with a benchmark

You also learned in the video how to compare the performance of various stocks against a benchmark. Now you'll learn more about the stock market by comparing the three largest stocks on the NYSE to the Dow Jones Industrial Average, which contains the 30 largest US companies.

The three largest companies on the NYSE are:


Company Stock Ticker
Johnson & Johnson JNJ
Exxon Mobil XOM
JP Morgan Chase JPM

Instructions:

  • Use pd.read_csv() to import 'nyse.csv' and 'dow_jones.csv', creating a DatetimeIndex for each from the 'date' column using parse_dates and index_col, and assign the result to stocks and dow_jones, respectively.
  • Use pd.concat() along axis=1 to combine stocks and dow_jones and assign the result to data. Inspect the .info() of data.
  • Divide data by the first value for each series, multiply by 100 and plot the result.
stocks      = pd.read_csv('./datasets/nyse.csv', parse_dates = ['date'], index_col = 'date')
dow_jones   = pd.read_csv('./datasets/dow_jones.csv', parse_dates = ['date'], index_col = 'date')

# Concatenate data and inspect result here
data = pd.concat([stocks, dow_jones], axis = 1).dropna()
print(data.info())

# Normalize and plot your data here
data.div(data.iloc[0]).mul(100).plot(subplots = True)
plt.show()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1762 entries, 2010-01-04 to 2016-12-30
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   JNJ     1762 non-null   float64
 1   JPM     1762 non-null   float64
 2   XOM     1762 non-null   float64
 3   DJIA    1762 non-null   float64
dtypes: float64(4)
memory usage: 68.8 KB
None

Plot performance difference vs benchmark index

In the video, you learned how to calculate and plot the performance difference of a stock in percentage points relative to a benchmark index.

Let's compare the performance of Microsoft (MSFT) and Apple (AAPL) to the S&P 500 over the last 10 years.

Instructions:

  • Create the list tickers containing the two stock symbols.
  • Use pd.read_csv() to import 'msft_aapl.csv' and 'sp500.csv', creating a DatetimeIndex for each from the 'date' column using parse_dates and index_col, and assign the result to stocks and sp500, respectively.
  • Use pd.concat() to concatenate stocks and sp500 along axis=1, apply .dropna() to drop all missing values, and assign the result to data.
  • Normalize data by dividing by the first price, multiply by 100 and assign the output to normalized.
  • Select tickers from normalized, and subtract normalized['SP500'] with keyword axis=0 to align the indexes, then plot the result.
tickers = ['MSFT','AAPL']

# Import stock data here
stocks = pd.read_csv('./datasets/msft_aapl.csv', parse_dates = ['date'], index_col = 'date')

# Import index here
sp500 = pd.read_csv('./datasets/sp500.csv', parse_dates = ['date'], index_col = 'date')

# Concatenate stocks and index here
data = pd.concat([stocks, sp500], axis = 1).dropna()

# Normalize data
normalized = data.div(data.iloc[0]).mul(100)

# Subtract the normalized index from the normalized stock prices, and plot the result
normalized[tickers].sub(normalized['SP500'], axis = 0).plot()
plt.show()

Now you can compare these stocks to the overall market so you can more easily spot trends and outliers.

Changing the time series frequency: resampling

  • DateTimeIndex : set & change freq using .asfreq()
  • But frequency conversion a(ects the data
    • Upsampling: ,ll or interpolate missing data
    • Downsampling: aggregate existing data
  • pandas API:
    • .asfreq() , .reindex()
    • .resample() + transformation method

      Getting started:quarterly data

dates = pd.date_range(start='2016', periods=4, freq='Q')
data = range(1, 5)
quarterly = pd.Series(data=data, index=dates)
quarterly
2016-03-31    1
2016-06-30    2
2016-09-30    3
2016-12-31    4
Freq: Q-DEC, dtype: int64

Upsampling: quarter => month

Upsampling creates missing values

monthly = quarterly.asfreq('M') # to month-end frequency
monthly
2016-03-31    1.0
2016-04-30    NaN
2016-05-31    NaN
2016-06-30    2.0
2016-07-31    NaN
2016-08-31    NaN
2016-09-30    3.0
2016-10-31    NaN
2016-11-30    NaN
2016-12-31    4.0
Freq: M, dtype: float64

Upsampling: fill methods

  • bfill : backfill
  • ffill : forward fill
  • fill_value=0
monthly = monthly.to_frame('baseline') # to DataFrame
monthly['ffill'] = quarterly.asfreq('M', method='ffill')
monthly['bfill'] = quarterly.asfreq('M', method='bfill')
monthly['value'] = quarterly.asfreq('M', fill_value=0)

monthly
baseline ffill bfill value
2016-03-31 1.0 1 1 1
2016-04-30 NaN 1 2 0
2016-05-31 NaN 1 2 0
2016-06-30 2.0 2 2 2
2016-07-31 NaN 2 3 0
2016-08-31 NaN 2 3 0
2016-09-30 3.0 3 3 3
2016-10-31 NaN 3 4 0
2016-11-30 NaN 3 4 0
2016-12-31 4.0 4 4 4

Add missing months: .reindex()

.reindex() :

  • conform DataFrame to new index
  • same filling logic as .asfreq()
dates = pd.date_range(start='2016', periods=12, freq='M')
dates
DatetimeIndex(['2016-01-31', '2016-02-29', '2016-03-31', '2016-04-30',
               '2016-05-31', '2016-06-30', '2016-07-31', '2016-08-31',
               '2016-09-30', '2016-10-31', '2016-11-30', '2016-12-31'],
              dtype='datetime64[ns]', freq='M')
quarterly.reindex(dates)
2016-01-31    NaN
2016-02-29    NaN
2016-03-31    1.0
2016-04-30    NaN
2016-05-31    NaN
2016-06-30    2.0
2016-07-31    NaN
2016-08-31    NaN
2016-09-30    3.0
2016-10-31    NaN
2016-11-30    NaN
2016-12-31    4.0
Freq: M, dtype: float64

Convert monthly to weekly data

You have learned in the video how to use .reindex() to conform an existing time series to a DateTimeIndex at a different frequency.

Let's practice this method by creating monthly data and then converting this data to weekly frequency while applying various fill logic options.

Instructions:

  • Create monthly_dates using pd.date_range with start, end and frequency alias 'M'.
  • Create and print the pd.Series monthly, passing the list [1, 2] as the data argument, and using monthly_dates as index.
  • Create weekly_dates using pd.date_range with start, end and frequency alias 'W'.
  • Apply .reindex() to monthly three times: first without additional options, then with bfill and then with ffill, print()-ing each result.
start = '2016-1-1'
end = '2016-2-29'

# Create monthly_dates here
monthly_dates = pd.date_range(start = start,end = end, freq='M')

# Create and print monthly here
monthly = pd.Series(data=[1,2], index=monthly_dates)
print(monthly)

# Create weekly_dates here
weekly_dates = pd.date_range(start = start,end = end, freq='W')

# Print monthly, reindexed using weekly_dates
print(monthly.reindex(weekly_dates))
print(monthly.reindex(weekly_dates,method='bfill'))
print(monthly.reindex(weekly_dates,method='ffill'))
2016-01-31    1
2016-02-29    2
Freq: M, dtype: int64
2016-01-03    NaN
2016-01-10    NaN
2016-01-17    NaN
2016-01-24    NaN
2016-01-31    1.0
2016-02-07    NaN
2016-02-14    NaN
2016-02-21    NaN
2016-02-28    NaN
Freq: W-SUN, dtype: float64
2016-01-03    1
2016-01-10    1
2016-01-17    1
2016-01-24    1
2016-01-31    1
2016-02-07    2
2016-02-14    2
2016-02-21    2
2016-02-28    2
Freq: W-SUN, dtype: int64
2016-01-03    NaN
2016-01-10    NaN
2016-01-17    NaN
2016-01-24    NaN
2016-01-31    1.0
2016-02-07    1.0
2016-02-14    1.0
2016-02-21    1.0
2016-02-28    1.0
Freq: W-SUN, dtype: float64

Create weekly from monthly unemployment data

The civilian US unemployment rate is reported monthly. You may need more frequent data, but that's no problem because you just learned how to upsample a time series.

You'll work with the time series data for the last 20 years, and apply a few options to fill in missing values before plotting the weekly series.

Instructions:

  • Use pd.read_csv() to import 'unemployment.csv', creating a DateTimeIndex from the 'date' column using parse_dates and index_col, and assign the result to data.
  • Convert data to weekly frequency using .asfreq() with the alias 'W' and show the first five rows.
  • Convert again to weekly frequency, adding the option 'bfill' and show the first five rows.
  • Create weekly series, now adding the option 'ffill', assign to weekly_ffill and show the first five rows.
  • Plot weekly_ffill starting in 2015.
# data = pd.read_csv('unemployment.csv', parse_dates=['date'] , index_col='date'
data = pd.read_csv('./datasets/unrate_2000.csv', parse_dates=['date'] , index_col='date') 

# Show first five rows of weekly series
print(data.asfreq('W').head())

# Show first five rows of weekly series with bfill option
print(data.asfreq('W', method='bfill').head())

# Create weekly series with ffill option and show first five rows
weekly_ffill = data.asfreq('W', method='ffill')
print(weekly_ffill.head())

# Plot weekly_fill starting 2015 here 
weekly_ffill.loc['2015':].plot()
plt.show()
            UNRATE
date              
2000-01-02     NaN
2000-01-09     NaN
2000-01-16     NaN
2000-01-23     NaN
2000-01-30     NaN
            UNRATE
date              
2000-01-02     4.1
2000-01-09     4.1
2000-01-16     4.1
2000-01-23     4.1
2000-01-30     4.1
            UNRATE
date              
2000-01-02     4.0
2000-01-09     4.0
2000-01-16     4.0
2000-01-23     4.0
2000-01-30     4.0

Upsampling & interpolation with .resample()

Frequency conversion & transformation methods

  • .resample() : similar to .groupby()
  • Groups data within resampling period and applies one or several methods to each group
  • New date determined by o(set - start, end, etc
  • Upsampling: fill from existing or interpolate values
  • Downsampling: apply aggregation to existing data
unrate = pd.read_csv('./datasets/unrate_2000.csv', parse_dates=['date'], index_col='date')
unrate.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 209 entries, 2000-01-01 to 2017-05-01
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   UNRATE  209 non-null    float64
dtypes: float64(1)
memory usage: 3.3 KB
display(unrate.head())
UNRATE
date
2000-01-01 4.0
2000-02-01 4.1
2000-03-01 4.0
2000-04-01 3.8
2000-05-01 4.0

Quarterly real GDP growth

gdp = pd.read_csv('./datasets/gdp_growth.csv', parse_dates=['date'], index_col='date')
gdp.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 41 entries, 2007-01-01 to 2017-01-01
Data columns (total 1 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   gdp_growth  41 non-null     float64
dtypes: float64(1)
memory usage: 656.0 bytes
gdp.head(5)
gdp_growth
date
2007-01-01 0.2
2007-04-01 3.1
2007-07-01 2.7
2007-10-01 1.4
2008-01-01 -2.7

Interpolate monthly real GDP growth

gdp_1 = gdp.resample('MS').ffill().add_suffix('_ffill')
gdp_1.head(10)
gdp_growth_ffill
date
2007-01-01 0.2
2007-02-01 0.2
2007-03-01 0.2
2007-04-01 3.1
2007-05-01 3.1
2007-06-01 3.1
2007-07-01 2.7
2007-08-01 2.7
2007-09-01 2.7
2007-10-01 1.4

.interpolate(): find points on straight line between existing data

gdp_2 = gdp.resample('MS').interpolate().add_suffix('_inter')
gdp_2.head(10)
gdp_growth_inter
date
2007-01-01 0.200000
2007-02-01 1.166667
2007-03-01 2.133333
2007-04-01 3.100000
2007-05-01 2.966667
2007-06-01 2.833333
2007-07-01 2.700000
2007-08-01 2.266667
2007-09-01 1.833333
2007-10-01 1.400000

Plot interpolated real GDP growth

  • interpolate vs ffill
plt.style.use('ggplot')
pd.concat([gdp_1, gdp_2], axis=1).loc['2015':].plot()
<AxesSubplot:xlabel='date'>

Combine GDP growth & unemployment

gdp = pd.read_csv('./datasets/gdp_growth.csv', parse_dates=['date'], index_col='date')
gdp_inter = gdp.resample('MS').interpolate().add_suffix('_inter')
unrate = pd.read_csv('./datasets/unrate_2000.csv', parse_dates = ['date'], index_col='date')
display(unrate.head())
UNRATE
date
2000-01-01 4.0
2000-02-01 4.1
2000-03-01 4.0
2000-04-01 3.8
2000-05-01 4.0
pd.concat([unrate, gdp_inter], axis=1).plot()
<AxesSubplot:xlabel='date'>

Interpolating is a useful way to create smoother time series when resampling.

Use interpolation to create weekly employment data

You have recently used the civilian US unemployment rate, and converted it from monthly to weekly frequency using simple forward or backfill methods.

Compare your previous approach to the new .interpolate() method that you learned about in this video. We have imported pandas as pd and matplotlib.pyplot as plt for you. We have also loaded the monthly unemployment rate from 2010 to 2016 into a variable monthly.

Instructions:

  • Inspect monthly using .info().
  • Create a pd.date_range() with weekly dates, using the .min() and .max() of the index of monthly as start and end, respectively, and assign the result to weekly_dates.
  • Apply .reindex() using weekly_dates to monthly and assign the output to weekly.
  • Create new columns 'ffill' and 'interpolated' by applying .ffill() and .interpolate() to weekly.UNRATE.
  • Show a plot of weekly.
monthly = pd.read_csv('./datasets/unrate.csv', parse_dates=['DATE'], index_col='DATE')
print(monthly.info())

# Create weekly dates
weekly_dates = pd.date_range(start = monthly.index.min() , end = monthly.index.max(), freq='W')

# Reindex monthly to weekly data
weekly = monthly.reindex(weekly_dates)

# Create ffill and interpolated columns
weekly['ffill'] = weekly.UNRATE.ffill()
weekly['interpolated'] = weekly.UNRATE.interpolate()
display(weekly.loc['2011':].head())
# Plot weekly
weekly.plot()
plt.show()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 85 entries, 2010-01-01 to 2017-01-01
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   UNRATE  85 non-null     float64
dtypes: float64(1)
memory usage: 1.3 KB
None
UNRATE ffill interpolated
2011-01-02 NaN 9.5 9.217949
2011-01-09 NaN 9.5 9.205128
2011-01-16 NaN 9.5 9.192308
2011-01-23 NaN 9.5 9.179487
2011-01-30 NaN 9.5 9.166667

Interpolate debt/GDP and compare to unemployment

Since you have learned how to interpolate time series, you can now apply this new skill to the quarterly debt/GDP series, and compare the result to the monthly unemployment rate.

Instructions:

  • Use pd.read_csv() to import 'debt_unemployment.csv', creating a DateTimeIndex from the 'date' column using parse_dates and index_col, and assign the result to data. print() the .info() of the data.
  • Apply .interpolate() to data and assign this to interpolated, then inspect the result.
  • Plot interpolated with 'Unemployment' on the secondary_y axis.
data = pd.read_csv('./datasets/debt_unemployment.csv', parse_dates = ['date'], index_col = 'date')
print(data.info())

# Interpolate and inspect here
interpolated = data.interpolate()
print(interpolated.info())

# Plot interpolated data here
interpolated.plot(secondary_y = 'Unemployment')
plt.show()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 89 entries, 2010-01-01 to 2017-05-01
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Debt/GDP      29 non-null     float64
 1   Unemployment  89 non-null     float64
dtypes: float64(2)
memory usage: 2.1 KB
None
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 89 entries, 2010-01-01 to 2017-05-01
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Debt/GDP      89 non-null     float64
 1   Unemployment  89 non-null     float64
dtypes: float64(2)
memory usage: 2.1 KB
None

Downsampling & aggregation


  • So far: upsampling, fill logic & interpolation
  • Now: downsampling
    • hour to day
    • day to month, etc
  • How to represent the existing values at the new date?

    • Mean, median, last value?


Air quality: daily ozone levels

ozone = pd.read_csv('./datasets/ozone_nyc.csv', parse_dates=['date'], index_col='date')
display(ozone.head())
ozone.info()
Ozone
date
2000-01-01 0.004032
2000-01-02 0.009486
2000-01-03 0.005580
2000-01-04 0.008717
2000-01-05 0.013754
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6291 entries, 2000-01-01 to 2017-03-31
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Ozone   6167 non-null   float64
dtypes: float64(1)
memory usage: 98.3 KB
ozone = ozone.resample('D').asfreq()
display(ozone.head())
ozone.info()
Ozone
date
2000-01-01 0.004032
2000-01-02 0.009486
2000-01-03 0.005580
2000-01-04 0.008717
2000-01-05 0.013754
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6300 entries, 2000-01-01 to 2017-03-31
Freq: D
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Ozone   6167 non-null   float64
dtypes: float64(1)
memory usage: 98.4 KB

Creating monthly ozone data

Example:

  • .resample().mean() : Monthly average, assigned to end of calendar month
display(ozone.resample('M').mean().head())
Ozone
date
2000-01-31 0.010443
2000-02-29 0.011817
2000-03-31 0.016810
2000-04-30 0.019413
2000-05-31 0.026535
  • .resample().median() : Monthly median, assigned to end of calendar month
display(ozone.resample('M').median().head())
Ozone
date
2000-01-31 0.009486
2000-02-29 0.010726
2000-03-31 0.017004
2000-04-30 0.019866
2000-05-31 0.026018
  • .resample().agg(): List of aggregation functions like groupby
display(ozone.resample('M').agg(['mean', 'std']).head())
Ozone
mean std
date
2000-01-31 0.010443 0.004755
2000-02-29 0.011817 0.004072
2000-03-31 0.016810 0.004977
2000-04-30 0.019413 0.006574
2000-05-31 0.026535 0.008409

Plotting resampled ozone data

ozone = ozone.loc['2016':]
ax = ozone.plot()
monthly = ozone.resample('M').mean()
monthly.add_suffix('_monthly').plot(ax=ax)
plt.show()

Resampling multiple time series

data = pd.read_csv('./datasets/nyc.csv', parse_dates=['date'], index_col='date')
data = data.resample('D').asfreq()
display(data.head())
data.info()
ozone pm25 co
date
1999-07-01 0.012024 20.000000 1.300686
1999-07-02 0.027699 23.900000 0.958194
1999-07-03 0.043969 36.700000 1.194444
1999-07-04 0.035161 39.000000 1.081548
1999-07-05 0.038359 28.171429 0.939583
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6484 entries, 1999-07-01 to 2017-03-31
Freq: D
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ozone   6317 non-null   float64
 1   pm25    6317 non-null   float64
 2   co      6317 non-null   float64
dtypes: float64(3)
memory usage: 202.6 KB
data = data.resample('BM').mean() # 'BM' Business Month
data.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 213 entries, 1999-07-30 to 2017-03-31
Freq: BM
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ozone   213 non-null    float64
 1   pm25    213 non-null    float64
 2   co      213 non-null    float64
dtypes: float64(3)
memory usage: 6.7 KB

Compare weekly, monthly and annual ozone trends for NYC & LA

You have seen in the video how to downsample and aggregate time series on air quality.

First, you'll apply this new skill to ozone data for both NYC and LA since 2000 to compare the air quality trend at weekly, monthly and annual frequencies and explore how different resampling periods impact the visualization.

Instructions:

  • Use pd.read_csv() to import 'ozone.csv' and set a DateTimeIndex based on the 'date' column using parse_dates and index_col, assign the result to ozone and inspect using .info().
  • Apply .resample() with weekly frequency ('W') to ozone, aggregate using .mean() and plot the result.
  • Repeat with monthly ('M') and annual ('A') frequencies, plotting each result.
ozone = pd.read_csv('./datasets/ozone_nyla.csv', parse_dates=['date'], index_col='date')
print(ozone.info())

# Calculate and plot the weekly average ozone trend
ozone.resample('W').mean().plot()
plt.title('Ozone Weekly')
plt.show()

# Calculate and plot the monthly average ozone trend
ozone.resample('M').mean().plot()
plt.title('Ozone Monthly')
plt.show()

# Calculate and plot the annual average ozone trend
ozone.resample('A').mean().plot()
plt.title('Ozone Annualy')
plt.show()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6291 entries, 2000-01-01 to 2017-03-31
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Los Angeles  5488 non-null   float64
 1   New York     6167 non-null   float64
dtypes: float64(2)
memory usage: 147.4 KB
None

Compare monthly average stock prices for Facebook and Google

Now, you'll apply your new resampling skills to daily stock price series for Facebook and Google for the 2015-2016 period to compare the trend of the monthly averages.

Instructions:

  • Use pd.read_csv() to import 'stocks.csv' and set a DateTimeIndex based on the 'date' column using parse_dates and index_col, assign the result to stocks and inspect using .info().
  • Create monthly_average by applying .resample() with monthly frequency to data, using .mean() to aggregate. Plot the result using subplots.
# stocks =  pd.read_csv('stocks.csv', parse_dates=['date'], index_col='date')
stocks =  pd.read_csv('./datasets/goog_fb.csv', parse_dates=['date'], index_col='date')
print(stocks.info())
 
# Calculate and plot the monthly averages
monthly_average = stocks.resample('M').mean()
monthly_average.plot(subplots=True)
plt.show()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 504 entries, 2015-01-02 to 2016-12-30
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   FB      504 non-null    float64
 1   GOOG    504 non-null    float64
dtypes: float64(2)
memory usage: 11.8 KB
None

Compare quarterly GDP growth rate and stock returns

With your new skill to downsample and aggregate time series, you can compare higher-frequency stock price series to lower-frequency economic time series.

As a first example, let's compare the quarterly GDP growth rate to the quarterly rate of return on the (resampled) Dow Jones Industrial index of 30 large US stocks.

GDP growth is reported at the beginning of each quarter for the previous quarter. To calculate matching stock returns, you'll resample the stock index to quarter start frequency using the alias 'QS', and aggregating using the .first() observations.

Instructions:

  • Use pd.read_csv() to import 'gdp_growth.csv' and 'djia.csv', for both set a DateTimeIndex based on the 'date' column using parse_dates and index_col, and assign the results to gdp_growth and djia respectively, then inspect using .info().
  • Resample djia using frequency alias 'QS', aggregate using .first(), and assign to djia_quarterly.
  • Apply .pct_change() to djia_quarterly and .mul() by 100 to obtain djia_quarterly_return.
  • Use pd.concat() to concatenate gdp_growth and djia_quarterly_return along axis=1, and assign to data. Rename the columns using .columns and the new labels 'gdp' and 'djia', then .plot() the results.
gdp_growth = pd.read_csv('./datasets/gdp_growth.csv', parse_dates=['date'], index_col='date')
print(gdp_growth.info())
 
# Import and inspect djia here
djia = pd.read_csv('./datasets/djia.csv', parse_dates=['date'], index_col='date')
print(djia.info())
 
 
# Calculate djia quarterly returns here 
djia_quarterly = djia.resample('QS').first()
djia_quarterly_return = djia_quarterly.pct_change().mul(100)
 
# Concatenate, rename and plot djia_quarterly_return and gdp_growth here 
data = pd.concat([gdp_growth, djia_quarterly_return], axis=1)
data.columns = ['gdp', 'djia']
 
data.plot()
plt.show()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 41 entries, 2007-01-01 to 2017-01-01
Data columns (total 1 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   gdp_growth  41 non-null     float64
dtypes: float64(1)
memory usage: 656.0 bytes
None
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2610 entries, 2007-06-29 to 2017-06-29
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   djia    2519 non-null   float64
dtypes: float64(1)
memory usage: 40.8 KB
None

Visualize monthly mean, median and standard deviation of S&P500 returns

You have also learned how to calculate several aggregate statistics from upsampled data.

Let's use this to explore how the monthly mean, median and standard deviation of daily S&P500 returns have trended over the last 10 years.

Instructions:

  • Use pd.read_csv() to import 'sp500.csv', set a DateTimeIndex based on the 'date' column using parse_dates and index_col, assign the results to sp500, and inspect using .info().
  • Convert sp500 to a pd.Series() using .squeeze(), and apply .pct_change() to calculate daily_returns.
  • .resample() daily_returns to month-end frequency (alias: 'M'), and apply .agg() to calculate 'mean', 'median', and 'std'. Assign the result to stats.
  • .plot() stats.
sp500 = pd.read_csv('./datasets/sp500.csv', parse_dates=['date'], index_col='date')
 
# Calculate daily returns here
daily_returns = sp500.squeeze().pct_change()
 
# Resample and calculate statistics
stats = daily_returns.resample('M').agg(['mean', 'median', 'std'])
 
# Plot stats here
stats.plot()
plt.show()

Now you're a pro at resampling time series!

Window Functions: Rolling & Expanding Metrics

This chapter will show you how to use window function to calculate time series metrics for both rolling and expanding windows.

Rolling window functions with pandas

Rolling average air quality since 2010 for new york city

The last video was about rolling window functions. To practice this new tool, you'll start with air quality trends for New York City since 2010. In particular, you'll be using the daily Ozone concentration levels provided by the Environmental Protection Agency to calculate & plot the 90 and 360 day rolling average.

Instructions:

  • Use pd.read_csv() to import 'ozone.csv', creating a DateTimeIndex from the 'date' column using parse_dates and index_col, and assign the result to data.
  • Add the columns '90D' and '360D' containing the 90 and 360 rolling calendar day .mean() for the column 'Ozone'.
  • Plot data starting 2010, setting 'New York City' as title.
# data = pd.read_csv('ozone.csv', parse_dates = ['date'], index_col = 'date')
data = pd.read_csv('./datasets/ozone_nyc.csv', parse_dates = ['date'], index_col = 'date')
print(data.info())

# Calculate 90d and 360d rolling mean for the last price
data['90D'] = data['Ozone'].rolling(window='90D').mean() 
data['360D'] = data['Ozone'].rolling(window='360D').mean() 

# Plot data
data.loc['2010':].plot(title = 'New York City')
plt.show()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6291 entries, 2000-01-01 to 2017-03-31
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Ozone   6167 non-null   float64
dtypes: float64(1)
memory usage: 98.3 KB
None

Do the different rolling windows help you see any long term trends that are hard to spot in the original data?

Rolling 360-day median & std. deviation for nyc ozone data since 2000

The last video also showed you how to calculate several rolling statistics using the .agg() method, similar to .groupby().

Let's take a closer look at the air quality history of NYC using the Ozone data you have seen before. The daily data are very volatile, so using a longer term rolling average can help reveal a longer term trend.

You'll be using a 360 day rolling window, and .agg() to calculate the rolling mean and standard deviation for the daily average ozone values since 2000.

Instructions:

  • Use pd.read_csv() to import 'ozone.csv', creating a DateTimeIndex from the 'date' column using parse_dates and index_col, assign the result to data, and drop missing values using .dropna().
  • Select the 'Ozone' column and create a .rolling() window using 360 periods, apply .agg() to calculate the mean and std, and assign this to rolling_stats.
  • Use .join() to concatenate data with rolling_stats, and assign to stats.
  • Plot stats using subplots.
plt.rcParams["figure.figsize"] = (8,8)
data = pd.read_csv('./datasets/ozone_nyc.csv', parse_dates=['date'], index_col='date').dropna()

# Calculate the rolling mean and std here
rolling_stats = data['Ozone'].rolling(360).agg(['mean', 'std'])

# Join rolling_stats with ozone data
stats = data.join(rolling_stats)

# Plot data
stats.plot(subplots=True)
plt.show()

How does adding the standard deviation help you understand what's happening in the orignal series?

Rolling quantiles for daily air quality in nyc

You learned in the last video how to calculate rolling quantiles to describe changes in the dispersion of a time series over time in a way that is less sensitive to outliers than using the mean and standard deviation.

Let's calculate rolling quantiles - at 10%, 50% (median) and 90% - of the distribution of daily average ozone concentration in NYC using a 360-day rolling window.

Instructions:

  • Apply .resample() with daily frequency 'D' to data and apply .interpolate() to fill missing values, and reassign to data.
  • Inspect the result using .info().
  • Create a .rolling() window using 360 periods, select the column 'Ozone', and assign the result to rolling.
  • Insert three new columns, 'q10', 'q50' and 'q90' into data, calculating the respective quantiles from rolling.
  • Plot data.
data = data.resample('D').interpolate()
print(data.info())

# Create the rolling window
rolling = data['Ozone'].rolling(360)

# Insert the rolling quantiles to the monthly returns
data['q10'] = rolling.quantile(0.1).to_frame('q10')
data['q50'] = rolling.quantile(0.5).to_frame('q50')
data['q90'] = rolling.quantile(0.9).to_frame('q90')

# Plot the data
data.plot()
plt.show()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6300 entries, 2000-01-01 to 2017-03-31
Freq: D
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Ozone   6300 non-null   float64
dtypes: float64(1)
memory usage: 98.4 KB
None

The rolling quantiles help show the volatility of the series.

Expanding window functions with pandas

Cumulative sum vs .diff()

In the video, you have learned about expanding windows that allow you to run cumulative calculations.

The cumulative sum method has in fact the opposite effect of the .diff() method that you came across in chapter 1.

To illustrate this, let's use the Google stock price time series, create the differences between prices, and reconstruct the series using the cumulative sum.

Instructions:

  • Apply .diff() to data, drop missing values, and assign the result to differences.
  • Use .first('D') to select the first price from data, and assign it to start_price.
  • Use .append() to combine start_price and differences, apply .cumsum() and assign this to cumulative_sum.
  • Use .equals() to compare data and cumulative_sum, and print the result.
data = pd.read_csv('./datasets/google.csv', parse_dates=['Date'], index_col='Date').dropna()
differences = data.diff().dropna()

# Select start price
start_price = data.first('D')

# Calculate cumulative sum
cumulative_sum = start_price.append(differences).cumsum()

# Validate cumulative sum equals data
print(data.equals(cumulative_sum))
True

The .cumsum() method allows you to reconstruct the original data from the differences.

Cumulative return on $1,000 invested in google vs apple I

To put your new ability to do cumulative return calculations to practical use, let's compare how much $1,000 would be worth if invested in Google ('GOOG') or Apple ('AAPL') in 2010.

Instructions:

  • Define a variable investment with the value 1000.
  • Calculate returns by applying .pct_change() to data.
  • Add 1 to returns and assign this to returns_plus_one, then apply .cumprod() to returns_plus_one and assign the result to cumulative_return.
  • Multiply cumulative_return by investment, and plot the result.
data = pd.read_csv('./datasets/data.csv', parse_dates=['Date'], index_col='Date').dropna()
investment = 1000

# Calculate the daily returns here
returns = data.pct_change()

# Calculate the cumulative returns here
returns_plus_one = returns + 1
cumulative_return = returns_plus_one.cumprod()

# Calculate and plot the investment return here
cumulative_return.mul(investment).plot()
plt.show()

Now let's take a look at the rolling annual returns on this investment.

Cumulative return on $1,000 invested in google vs apple II

Apple outperformed Google over the entire period, but this may have been different over various 1-year sub periods, so that switching between the two stocks might have yielded an even better result.

To analyze this, calculate that cumulative return for rolling 1-year periods, and then plot the returns to see when each stock was superior.

Instructions:
We have already imported pandas as pd and matplotlib.pyplot as plt. We have also loaded the GOOG and AAPL close prices from the last exercise into data.

  • Define a multi_period_return() function that returns the cumulative return from an array of period returns.
  • Calculate daily_returns by applying .pct_change() to data.
  • Create a '360D' .rolling() window on daily_returns, and .apply() multi_period_returns. Assign the result to rolling_annual_returns.
  • Plot rolling_annual_returns after multiplying it by 100.
import numpy as np

# Define a multi_period_return function
def multi_period_return(period_returns):
    return np.prod(period_returns + 1) - 1

# Calculate daily returns
daily_returns = data.pct_change()

# Calculate rolling_annual_returns
rolling_annual_returns = daily_returns.rolling('360D').apply(multi_period_return)

# Plot rolling_annual_returns
rolling_annual_returns.mul(100).plot()
plt.show()

Do you think it's better to invest in Google or Apple?

Case study: S&P500 price simulation

Random walk I

you have seen how to generate a random walk of returns, and how to convert this random return series into a random stock price path.

In this exercise, you'll build your own random walk by drawing random numbers from the normal distribution with the help of numpy.

Instructions:

  • Set seed to 42.
  • Use normal to generate 2,500 random returns with the parameters loc=.001, scale=.01 and assign this to random_walk.
  • Convert random_walk to a pd.Series object and reassign it to random_walk.
  • Create random_prices by adding 1 to random_walk and calculating the cumulative product.
  • Multiply random_prices by 1,000 and plot the result for a price series starting at 1,000.
from numpy.random import seed, normal

# Set seed here
seed(42)

# Create random_walk
random_walk = normal(loc=0.001, scale=0.01, size=2500)

# Convert random_walk to pd.series
random_walk = pd.Series(random_walk)

# Create random_prices
random_prices = random_walk.add(1).cumprod()

# Plot random_prices
random_prices.mul(1000).plot()
plt.show()

Random walk II

you have also seen how to create a random walk of returns by sampling from actual returns, and how to use this random sample to create a random stock price path.

In this exercise, you'll build a random walk using historical returns from Facebook's stock price since IPO through the end of May 31, 2017. Then you'll simulate an alternative random price path in the next exercise.

Instructions:
We have already imported pandas as pd, choice and seed from numpy.random, seaborn as sns, and matplotlib.pyplot as plt. We have also imported the FB stock price series since IPO in May 2012 as the variable fb. Inspect this using .head().

  • Set seed to 42.
  • Apply .pct_change() to generate daily Facebook returns, drop missing values, and assign to daily_returns.
  • Create a variable n_obs that contains the .count() of Facebook daily_returns.
  • Use choice() to randomly select n_obs samples from daily_returns, and assign to random_walk.
  • Convert random_walk to a pd.Series and reassign it to itself.
  • Use sns.distplot() to plot the distribution of random_walk.
from numpy.random import choice, seed
import seaborn as sns

fb = pd.read_csv('./datasets/fb.csv',   names=('date', 'price'),
                                        index_col='date',
                                        parse_dates=True)
display(fb.head())
display(fb.info())
price
date
2012-05-17 38.00
2012-05-18 38.23
2012-05-21 34.03
2012-05-22 31.00
2012-05-23 32.00
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1267 entries, 2012-05-17 to 2017-05-31
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   price   1267 non-null   float64
dtypes: float64(1)
memory usage: 19.8 KB
None
seed(42)

# Calculate daily_returns here
# daily_returns = fb.pct_change().dropna()
daily_returns = fb.price.pct_change().dropna()

# Get n_obs
n_obs = daily_returns.count()

# Create random_walk
random_walk = choice(daily_returns, size=n_obs)

# Convert random_walk to pd.Series
random_walk = pd.Series(random_walk)

# Plot random_walk distribution
sns.distplot(random_walk)
plt.show()

Get ready to finish up this random walk!

Random walk III

In this exercise, you'll complete your random walk simulation using Facebook stock returns over the last five years. You'll start off with a random sample of returns like the one you've generated during the last exercise and use it to create a random stock price path.

Instructions:
We have already imported pandas as pd, choice and seed from numpy.random, and matplotlib.pyplot as plt. We have loaded the Facebook price as a pd.DataFrame in the variable fb and a random sample of daily FB returns as pd.Series in the variable random_walk.

  • Select the first Facebook price by applying .first('D') to fb.price, and assign to start.
  • Add 1 to random_walk and reassign it to itself, then .append() random_walk to start and assign this to random_price.
  • Apply .cumprod() to random_price and reassign it to itself.
  • Insert random_price as new column labeled random into fb and plot the result.
random_walk = pd.read_csv('./datasets/random_walk.csv', header=None, index_col=0)
date_range = pd.date_range(start=fb.index[1], periods= len(fb) - 1, freq='B')
random_walk.index = date_range
random_walk = pd.Series(random_walk[1])
start = fb['price'].first('D')

# Add 1 to random walk and append to start
random_walk = random_walk.add(1)
random_price = start.append(random_walk)

# Calculate cumulative product here
random_price = random_price.cumprod()

# Insert into fb and plot
fb['random'] = random_price
fb.plot()
plt.show()

How did you like generating random stock prices?

Relationships between time series: correlation

Annual return correlations among several stocks

You have seen in the video how to calculate correlations, and visualize the result.

In this exercise, we have provided you with the historical stock prices for Apple (AAPL), Amazon (AMZN), IBM (IBM), WalMart (WMT), and Exxon Mobile (XOM) for the last 4,000 trading days from July 2001 until the end of May 2017.

You'll calculate the year-end returns, the pairwise correlations among all stocks, and visualize the result as an annotated heatmap.

Instructions:
We have already imported pandas as pd, seaborn as sns, and matplotlib.pyplot as plt. We have loaded the daily close price for the five stocks in a variable called data.

  • Inspect using .info().
  • Apply .resample() with year-end frequency (alias: 'A') to data and select the .last() price from each subperiod; assign this to annual_prices.
  • Calculate annual_returns by applying .pct_change() to annual_prices.
  • Calculate correlations by applying .corr() to annual_returns and print the result.
  • Visualize correlations as an annotated sns.heatmap().
print(plt.style.available)
['Solarize_Light2', '_classic_test_patch', '_mpl-gallery', '_mpl-gallery-nogrid', 'bmh', 'classic', 'dark_background', 'fast', 'fivethirtyeight', 'ggplot', 'grayscale', 'seaborn', 'seaborn-bright', 'seaborn-colorblind', 'seaborn-dark', 'seaborn-dark-palette', 'seaborn-darkgrid', 'seaborn-deep', 'seaborn-muted', 'seaborn-notebook', 'seaborn-paper', 'seaborn-pastel', 'seaborn-poster', 'seaborn-talk', 'seaborn-ticks', 'seaborn-white', 'seaborn-whitegrid', 'tableau-colorblind10']
data = pd.read_csv('./datasets/5_stocks.csv', parse_dates=['Date'], index_col='Date')
plt.style.use('seaborn')
print(data.info())

# Calculate year-end prices here
annual_prices = data.resample('A').last()

# Calculate annual returns here
annual_returns = annual_prices.pct_change()

# Calculate and print the correlation matrix here
correlations = annual_returns.corr()
print(correlations)
display(correlations)


# Visualize the correlations as heatmap here
sns.heatmap(correlations, annot=True)
plt.show()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4001 entries, 2001-07-05 to 2017-05-31
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   AAPL    4000 non-null   float64
 1   AMZN    4000 non-null   float64
 2   IBM     4000 non-null   float64
 3   WMT     4000 non-null   float64
 4   XOM     4000 non-null   float64
dtypes: float64(5)
memory usage: 187.5 KB
None
          AAPL      AMZN       IBM       WMT       XOM
AAPL  1.000000  0.208731  0.460568 -0.183553  0.336413
AMZN  0.208731  1.000000  0.346407 -0.367620 -0.133965
IBM   0.460568  0.346407  1.000000  0.155445  0.367253
WMT  -0.183553 -0.367620  0.155445  1.000000  0.178833
XOM   0.336413 -0.133965  0.367253  0.178833  1.000000
AAPL AMZN IBM WMT XOM
AAPL 1.000000 0.208731 0.460568 -0.183553 0.336413
AMZN 0.208731 1.000000 0.346407 -0.367620 -0.133965
IBM 0.460568 0.346407 1.000000 0.155445 0.367253
WMT -0.183553 -0.367620 0.155445 1.000000 0.178833
XOM 0.336413 -0.133965 0.367253 0.178833 1.000000

Heatmaps are a great way to visualize correlation matrices.

Putting it all together: Building a value-weighted index

This chapter combines the previous concepts by teaching you how to create a value-weighted index. This index uses market-cap data contained in the stock exchange listings to calculate weights and 2016 stock price information. Index performance is then compared against benchmarks to evaluate the performance of the index you created.

Select index components & import data

Explore and clean company listing information

To get started with the construction of a market-value based index, you'll work with the combined listing info for the three largest US stock exchanges, the NYSE, the NASDAQ and the AMEX.

In this and the next exercise, you will calculate market-cap weights for these stocks.

We have already imported pandas as pd, and loaded the listings data set with listings information from the NYSE, NASDAQ, and AMEX. The column 'Market Capitalization' is already measured in USD mn.

Instructions:

  • Inspect listings using .info().
  • Move the column 'Stock Symbol' into the index (inplace).
  • Drop all companies with missing 'Sector' information from listings.
  • Select companies with IPO Year before 2019.
  • Inspect the result of the changes you just made using .info().
  • Show the number of companies per 'Sector' using .groupby() and .size(). Sort the output in descending order.
listings_nyse = pd.read_excel('./datasets/listings.xlsx', sheet_name='nyse', na_values='n/a')
listings_amex = pd.read_excel('./datasets/listings.xlsx', sheet_name='amex', na_values='n/a')
listings_nasdaq = pd.read_excel('./datasets/listings.xlsx', sheet_name='nasdaq', na_values='n/a')
listings_nyse['Exchange'] = 'nyse'
listings_amex['Exchange'] = 'amex'
listings_nasdaq['Exchange'] = 'nasdaq'
listings = pd.concat([listings_amex, listings_nasdaq, listings_nyse], axis=0)
listings.reset_index(inplace=True)
listings.drop(['index'], axis=1, inplace=True)
listings['Market Capitalization'] /= 1e6
print(listings.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6674 entries, 0 to 6673
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Stock Symbol           6674 non-null   object 
 1   Company Name           6674 non-null   object 
 2   Last Sale              6590 non-null   float64
 3   Market Capitalization  6674 non-null   float64
 4   IPO Year               2852 non-null   float64
 5   Sector                 5182 non-null   object 
 6   Industry               5182 non-null   object 
 7   Exchange               6674 non-null   object 
dtypes: float64(3), object(5)
memory usage: 417.2+ KB
None
print(listings.info())

# Move 'stock symbol' into the index
listings.set_index('Stock Symbol', inplace=True)
 
# Drop rows with missing 'sector' data
listings.dropna(subset=['Sector'], inplace=True)
 
# Select companies with IPO Year before 2019
listings = listings.loc[listings['IPO Year'] < 2019]
 
# Inspect the new listings data
print(listings.info())
 
# Show the number of companies per sector
print(listings.groupby('Sector').size().sort_values(ascending=False))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6674 entries, 0 to 6673
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Stock Symbol           6674 non-null   object 
 1   Company Name           6674 non-null   object 
 2   Last Sale              6590 non-null   float64
 3   Market Capitalization  6674 non-null   float64
 4   IPO Year               2852 non-null   float64
 5   Sector                 5182 non-null   object 
 6   Industry               5182 non-null   object 
 7   Exchange               6674 non-null   object 
dtypes: float64(3), object(5)
memory usage: 417.2+ KB
None
<class 'pandas.core.frame.DataFrame'>
Index: 2349 entries, ACU to ZTO
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Company Name           2349 non-null   object 
 1   Last Sale              2349 non-null   float64
 2   Market Capitalization  2349 non-null   float64
 3   IPO Year               2349 non-null   float64
 4   Sector                 2349 non-null   object 
 5   Industry               2349 non-null   object 
 6   Exchange               2349 non-null   object 
dtypes: float64(3), object(4)
memory usage: 146.8+ KB
None
Sector
Health Care              445
Consumer Services        402
Technology               386
Finance                  351
Energy                   144
Capital Goods            143
Basic Industries         104
Public Utilities         104
Consumer Non-Durables     89
Miscellaneous             68
Transportation            58
Consumer Durables         55
dtype: int64

Your data is squeaky clean now!

Select and inspect index components

Now that you have imported and cleaned the listings data, you can proceed to select the index components as the largest company for each sector by market capitalization.

You'll also have the opportunity to take a closer look at the components, their last market value, and last price.

Instructions:
We have already imported pandas as pd, and loaded the listings data with the modifications you made during the last exercise.

  • Use .groupby() and .nlargest() to select the largest company by 'Market Capitalization' for each 'Sector', and assign the result to components.
  • Print components, sorted in descending order by market cap.
  • Select Stock Symbol from the index of components, assign it to tickers and print the result.
  • Create a list info_cols that holds the column names Company Name, Market Capitalization, and Last Sale. Next, use .loc[] with tickers and info_cols to print() more details about the listings sorted in descending order by Market Capitalization).
components = listings.groupby('Sector')['Market Capitalization'].nlargest(1)

# Print components, sorted by market cap
print(components.sort_values(ascending=False))

# Select stock symbols and print the result
tickers = components.index.get_level_values('Stock Symbol')
print(tickers)

# Print company name, market cap, and last price for each components
info_cols = ['Company Name', 'Market Capitalization', 'Last Sale']
print(listings.loc[tickers,info_cols].sort_values('Market Capitalization', ascending=False))
Sector                 Stock Symbol
Technology             AAPL            740024.467000
Consumer Services      AMZN            422138.530626
Miscellaneous          BABA            275525.000000
Health Care            AMGN            118927.210535
Transportation         UPS              90180.886756
Finance                GS               88840.590477
Consumer Non-Durables  ABEV             88240.198455
Basic Industries       RIO              70431.476895
Public Utilities       TEF              54609.806092
Capital Goods          GM               50086.335099
Energy                 PAA              22223.001416
Consumer Durables      CPRT             13620.922869
Name: Market Capitalization, dtype: float64
Index(['RIO', 'GM', 'CPRT', 'ABEV', 'AMZN', 'PAA', 'GS', 'AMGN', 'BABA', 'TEF',
       'AAPL', 'UPS'],
      dtype='object', name='Stock Symbol')
                                    Company Name  Market Capitalization  Last Sale
Stock Symbol                                                                      
AAPL                                  Apple Inc.          740024.467000     141.05
AMZN                            Amazon.com, Inc.          422138.530626     884.67
BABA               Alibaba Group Holding Limited          275525.000000     110.21
AMGN                                  Amgen Inc.          118927.210535     161.61
UPS                  United Parcel Service, Inc.           90180.886756     103.74
GS               Goldman Sachs Group, Inc. (The)           88840.590477     223.32
ABEV                                  Ambev S.A.           88240.198455       5.62
RIO                                Rio Tinto Plc           70431.476895      38.94
TEF                                Telefonica SA           54609.806092      10.84
GM                        General Motors Company           50086.335099      33.39
PAA           Plains All American Pipeline, L.P.           22223.001416      30.72
CPRT                                Copart, Inc.           13620.922869      29.65
display(listings.loc[tickers,info_cols].sort_values('Market Capitalization', ascending=False))
Company Name Market Capitalization Last Sale
Stock Symbol
AAPL Apple Inc. 740024.467000 141.05
AMZN Amazon.com, Inc. 422138.530626 884.67
BABA Alibaba Group Holding Limited 275525.000000 110.21
AMGN Amgen Inc. 118927.210535 161.61
UPS United Parcel Service, Inc. 90180.886756 103.74
GS Goldman Sachs Group, Inc. (The) 88840.590477 223.32
ABEV Ambev S.A. 88240.198455 5.62
RIO Rio Tinto Plc 70431.476895 38.94
TEF Telefonica SA 54609.806092 10.84
GM General Motors Company 50086.335099 33.39
PAA Plains All American Pipeline, L.P. 22223.001416 30.72
CPRT Copart, Inc. 13620.922869 29.65

You're ready for the next step in creating this index.

Import index component price information

Now you'll use the stock symbols for the companies you selected in the last exercise to calculate returns for each company.

Instructions:We have already imported pandas as pd and matplotlib.pyplot as plt for you. We have also made the variable tickers available to you, which contains the Stock Symbol for each index component as a list.

  • Print tickers to verify the content matches your result from the last exercise.
  • Use pd.read_csv() to import 'stock_prices.csv', parsing the 'Date' column and also setting the 'Date' column as index before assigning the result to stock_prices. Inspect the result using .info().
  • Calculate the price return for the index components by dividing the last row of stock_prices by the first, subtracting 1 and multiplying by 100. Assign the result to price_return.
  • Plot a horizontal bar chart of the sorted returns with the title Stock Price Returns.
print(tickers)

# Import prices and inspect result
# stock_prices = pd.read_csv('stock_prices.csv', parse_dates=['Date'], index_col='Date')
stock_prices = pd.read_csv('./datasets/stocks_4.csv', parse_dates=['Date'], index_col='Date')
print(stock_prices.info())

# Calculate the returns
price_return = stock_prices.iloc[-1].div(stock_prices.iloc[0]).sub(1).mul(100)

# Plot horizontal bar chart of sorted price_return
price_return.sort_values().plot(kind='barh', title='Stock Price Returns')
plt.show()
Index(['RIO', 'GM', 'CPRT', 'ABEV', 'AMZN', 'PAA', 'GS', 'AMGN', 'BABA', 'TEF',
       'AAPL', 'UPS'],
      dtype='object', name='Stock Symbol')
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1762 entries, 2010-01-04 to 2016-12-30
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   AAPL    1761 non-null   float64
 1   AMGN    1761 non-null   float64
 2   AMZN    1761 non-null   float64
 3   CPRT    1761 non-null   float64
 4   EL      1762 non-null   float64
 5   GS      1762 non-null   float64
 6   ILMN    1761 non-null   float64
 7   MA      1762 non-null   float64
 8   PAA     1762 non-null   float64
 9   RIO     1762 non-null   float64
 10  TEF     1762 non-null   float64
 11  UPS     1762 non-null   float64
dtypes: float64(12)
memory usage: 179.0 KB
None

Build a market-cap weighted index

Calculate number of shares outstanding

The next step towards building a value-weighted index is to calculate the number of shares for each index component.

The number of shares will allow you to calculate the total market capitalization for each component given the historical price series in the next exercise.

Instructions:

  • Inspect listings and print tickers.
  • Use .loc[] with the list of tickers to select the index components and the columns 'Market Capitalization' and 'Last Sale'; assign this to components.
  • Print the first five rows of components.
  • Create no_shares by dividing Market Capitalization by 'Last Sale'.
  • Print no_shares in descending order.
listings =  pd.read_csv('./datasets/listings.csv')
listings.set_index('Stock Symbol', inplace=True)

print(listings.info())
tickers = [ 'RIO',
            'ILMN',
            'CPRT',
            'EL',
            'AMZN',
            'PAA',
            'GS',
            'AMGN',
            'MA',
            'TEF',
            'AAPL',
            'UPS']
<class 'pandas.core.frame.DataFrame'>
Index: 1015 entries, ACU to YPF
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Exchange               1015 non-null   object 
 1   Company Name           1015 non-null   object 
 2   Last Sale              1015 non-null   float64
 3   Market Capitalization  1015 non-null   float64
 4   IPO Year               1015 non-null   float64
 5   Sector                 1015 non-null   object 
 6   Industry               1015 non-null   object 
dtypes: float64(3), object(4)
memory usage: 63.4+ KB
None
print(listings.info())
print(tickers)

# Select components and relevant columns from listings
components = listings[['Market Capitalization', 'Last Sale']].loc[tickers]

# Print the first rows of components
print(components.head(5))

# Calculate the number of shares here
no_shares = components['Market Capitalization'].div(components['Last Sale'])

# Print the sorted no_shares
print(no_shares.sort_values(ascending=False))
<class 'pandas.core.frame.DataFrame'>
Index: 1015 entries, ACU to YPF
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Exchange               1015 non-null   object 
 1   Company Name           1015 non-null   object 
 2   Last Sale              1015 non-null   float64
 3   Market Capitalization  1015 non-null   float64
 4   IPO Year               1015 non-null   float64
 5   Sector                 1015 non-null   object 
 6   Industry               1015 non-null   object 
dtypes: float64(3), object(4)
memory usage: 63.4+ KB
None
['RIO', 'ILMN', 'CPRT', 'EL', 'AMZN', 'PAA', 'GS', 'AMGN', 'MA', 'TEF', 'AAPL', 'UPS']
              Market Capitalization  Last Sale
Stock Symbol                                  
RIO                    70431.476895      38.94
ILMN                   25409.384000     173.68
CPRT                   13620.922869      29.65
EL                     31122.510011      84.94
AMZN                  422138.530626     884.67
Stock Symbol
AAPL    5246.540000
TEF     5037.804990
RIO     1808.717948
MA      1108.884100
UPS      869.297154
AMGN     735.890171
PAA      723.404994
AMZN     477.170618
CPRT     459.390316
GS       397.817439
EL       366.405816
ILMN     146.300000
dtype: float64

Now you know which companies have the most shares.

Create time series of market value

You can now use the number of shares to calculate the total market capitalization for each component and trading date from the historical price series.

The result will be the key input to construct the value-weighted stock index, which you will complete in the next exercise.

Instructions:

  • Select the 'Number of Shares' from components, assign to no_shares, and print the result, sorted in the default (ascending) order.
  • Multiply stock_prices by no_shares to create a time series of market cap per ticker, and assign it to market_cap.
  • Select the first and the last row of market_cap and assign these to first_value and last_value.
  • Use pd.concat() to concatenate first_value and last_value along axis=1 and plot the result as horizontal bar chart.
components = pd.read_csv('./datasets/components.csv',index_col=[0])
print(components.info())
<class 'pandas.core.frame.DataFrame'>
Index: 12 entries, RIO to UPS
Data columns (total 3 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Market Capitalization  12 non-null     float64
 1   Last Sale              12 non-null     float64
 2   Number of Shares       12 non-null     float64
dtypes: float64(3)
memory usage: 384.0+ bytes
None
no_shares = components['Number of Shares']
print(no_shares.sort_values())

# Create the series of market cap per ticker
market_cap = stock_prices.mul(no_shares)

# Select first and last market cap here
first_value = market_cap.iloc[0]
last_value = market_cap.iloc[-1]

# Concatenate and plot first and last market cap here
pd.concat([first_value, last_value], axis=1).plot(kind='barh')
plt.show()
Stock Symbol
ILMN     146.300000
EL       366.405816
GS       397.817439
CPRT     459.390316
AMZN     477.170618
PAA      723.404994
AMGN     735.890171
UPS      869.297154
MA      1108.884100
RIO     1808.717948
TEF     5037.804990
AAPL    5246.540000
Name: Number of Shares, dtype: float64

You've made one of the essential ingredients of the index.

Calculate & plot the composite index

By now you have all ingredients that you need to calculate the aggregate stock performance for your group of companies.

Use the time series of market capitalization that you created in the last exercise to aggregate the market value for each period, and then normalize this series to convert it to an index.

Instructions:
We have already imported pandas as pd and matplotlib.pyplot as plt for you. We have also loaded components and market_cap_series, which you worked with in the last exercise.

  • Aggregate the market cap per trading day by applying .sum() to market_cap_series with axis=1, assign to raw_index and print the result.
  • Normalize the aggregate market cap by dividing by the first value of raw_index and multiplying by 100. Assign this to index and print the result.
  • Plot the index with the title 'Market-Cap Weighted Index'.
market_cap_series = pd.read_csv('./datasets/market_cap_series.csv', index_col=[0])
display(market_cap_series.info())
<class 'pandas.core.frame.DataFrame'>
Index: 1761 entries, 2010-01-04 to 2016-12-30
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   AAPL    1761 non-null   float64
 1   AMGN    1761 non-null   float64
 2   AMZN    1761 non-null   float64
 3   CPRT    1761 non-null   float64
 4   EL      1761 non-null   float64
 5   GS      1761 non-null   float64
 6   ILMN    1761 non-null   float64
 7   MA      1761 non-null   float64
 8   PAA     1761 non-null   float64
 9   RIO     1761 non-null   float64
 10  TEF     1761 non-null   float64
 11  UPS     1761 non-null   float64
dtypes: float64(12)
memory usage: 178.9+ KB
None
display(market_cap_series.head())
AAPL AMGN AMZN CPRT EL GS ILMN MA PAA RIO TEF UPS
Date
2010-01-04 160386.7278 42475.580670 63893.145750 2090.225938 8892.669154 68854.242342 4469.465 28476.143688 19531.934838 101342.466626 143829.332465 50575.708420
2010-01-05 160701.5202 42107.635585 64270.110538 2090.225938 8859.692631 70071.563705 4440.205 28398.521801 19748.956336 102916.051241 143728.576365 50662.638135
2010-01-06 158130.7156 41791.202811 63105.814231 2081.038131 8885.341038 69323.666920 4713.786 28343.077596 19741.722286 106063.220471 142217.234868 50288.840359
2010-01-07 157815.9232 41408.539922 62032.180340 2067.256422 8998.926841 70680.224387 4794.251 28154.567299 19502.998638 106081.307650 139799.088473 49906.349611
2010-01-08 158865.2312 41776.485008 63711.820915 2076.444228 9035.567423 69343.557792 4849.845 28165.656140 19568.105088 107256.974316 138892.283574 52305.609756
raw_index = market_cap_series.sum(axis=1)
print(raw_index)

# Normalize the aggregate market cap here
index = raw_index.div(raw_index.iloc[0]).mul(100)
print(index)

# Plot the index here
index.plot(title='Market-Cap Weighted Index')
plt.show()
Date
2010-01-04    6.948176e+05
2010-01-05    6.979957e+05
2010-01-06    6.946857e+05
2010-01-07    6.912416e+05
2010-01-08    6.958476e+05
                  ...     
2016-12-23    1.588874e+06
2016-12-27    1.599280e+06
2016-12-28    1.593635e+06
2016-12-29    1.589422e+06
2016-12-30    1.574862e+06
Length: 1761, dtype: float64
Date
2010-01-04    100.000000
2010-01-05    100.457394
2010-01-06     99.981005
2010-01-07     99.485328
2010-01-08    100.148231
                 ...    
2016-12-23    228.675001
2016-12-27    230.172669
2016-12-28    229.360223
2016-12-29    228.753821
2016-12-30    226.658267
Length: 1761, dtype: float64

Now you have an index to work with!

Evaluate index performance

Calculate the contribution of each stock to the index

You have successfully built the value-weighted index. Let's now explore how it performed over the 2010-2016 period.

Let's also determine how much each stock has contributed to the index return.

Instructions:
Divide the last index value by the first, subtract 1 and multiply by 100. Assign the result to index_return and print it.

  • Select the 'Market Capitalization' column from components.
  • Calculate the total market cap for all components and assign this to total_market_cap.
  • Divide the components' market cap by total_market_cap to calculate the component weights, assign it to weights, and print weights with the values sorted in default (ascending) order.
  • Multiply weights by the index_return to calculate the contribution by component, sort the values in ascending order, and plot the result as a horizontal bar chart.
index_return = (index.iloc[-1] / index.iloc[0] - 1) * 100
print(index_return)

# Select the market capitalization
market_cap = components['Market Capitalization']

# Calculate the total market cap
total_market_cap = market_cap.sum()

# Calculate the component weights , and print the result
weights = market_cap.div(total_market_cap)
print(weights.sort_values())

# Calculate and plot the distribution by component
weights.mul(index_return).sort_values().plot(kind='barh')
plt.show()
126.65826661173818
Stock Symbol
CPRT    0.007564
PAA     0.012340
ILMN    0.014110
EL      0.017282
TEF     0.030324
RIO     0.039110
GS      0.049332
UPS     0.050077
AMGN    0.066039
MA      0.068484
AMZN    0.234410
AAPL    0.410929
Name: Market Capitalization, dtype: float64

The next step is to take a look at how your index stacks up against a benchmark!

Compare index performance against benchmark I

The next step in analyzing the performance of your index is to compare it against a benchmark.

In the video, we used the S&P 500 as benchmark. You can also use the Dow Jones Industrial Average, which contains the 30 largest stocks, and would also be a reasonable benchmark for the largest stocks from all sectors across the three exchanges.

Instructions:
We have already imported pandas as pd, matplotlib.pyplot as plt for you. We have also loaded your index and the DJIA data into variables index and djia, respectively, both as a pd.Series().

  • Convert index to a pd.DataFrame with the column name 'Index' and assign the result to data.
  • Normalize djia to start at 100 and add it as new column to data.
  • Show the total return for both index and djia by dividing the last row of data by the first, subtracting 1 and multiplying by 100.
  • Show a plot of both of the series in data.
djia = pd.read_csv('./datasets/djia2.csv', parse_dates=['DATE'], index_col='DATE')['DJIA']
data = index.to_frame('Index')

# Normalize djia series and add as new column to data
djia = djia.div(djia.iloc[0]).mul(100)
data['DJIA'] = djia

# Print total return for both index and djia
print((data.iloc[-1] / data.iloc[0] - 1) * 100)

# Plot both series
data.plot()
plt.show()
Index    126.658267
DJIA      86.722172
dtype: float64

How do they compare?

Compare index performance against benchmark II

The next step in analyzing the performance of your index is to compare it against a benchmark.

In the video, we have use the S&P 500 as benchmark. You can also use the Dow Jones Industrial Average, which contains the 30 largest stocks, and would also be a reasonable benchmark for the largest stocks from all sectors across the three exchanges.

Instructions:

  • Inspect data and print the first five rows.
  • Define a function multi_period_return that takes a numpy array of period returns as input, and returns the total return for the period. Use the formula from the video - add 1 to the input, pass the result to np.prod(), subtract 1 and multiply by 100.
  • Create a .rolling() window of length '360D' from data, and apply multi_period_return. Assign to rolling_return_360.
  • Plot rolling_return_360 using the title 'Rolling 360D Return'.
data.index = pd.to_datetime(data.index)
print(data.info())
print(data.head(5))

# Create multi_period_return function here
def multi_period_return(r):
    return (np.prod(r + 1) - 1) * 100

# Calculate rolling_return_360
rolling_return_360 = data.pct_change().rolling('360D').apply(multi_period_return)

# Plot rolling_return_360 here
rolling_return_360.plot(title='Rolling 360D Return')
plt.show()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1761 entries, 2010-01-04 to 2016-12-30
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Index   1761 non-null   float64
 1   DJIA    1761 non-null   float64
dtypes: float64(2)
memory usage: 41.3 KB
None
                 Index        DJIA
Date                              
2010-01-04  100.000000  100.000000
2010-01-05  100.457394   99.887188
2010-01-06   99.981005   99.902872
2010-01-07   99.485328  100.216365
2010-01-08  100.148231  100.323414

How do the returns of your index compare to the Dow Jones?

display(data.tail(5))
Index DJIA
Date
2016-12-23 228.675001 188.339809
2016-12-27 230.172669 188.445912
2016-12-28 229.360223 187.393754
2016-12-29 228.753821 187.262424
2016-12-30 226.658267 186.722172

Index correlation & exporting to Excel

Visualize your index constituent correlations

To better understand the characteristics of your index constituents, you can calculate the return correlations.

Use the daily stock prices or your index companies, and show a heatmap of the daily return correlations!

Instructions:
We have already imported pandas as pd, matplotlib.pyplot as plt, and seaborn as sns. We have also loaded the historical price series of your index constituents into the variable stock_prices.

  • Inspect stock_prices using .info().
  • Calculate the daily returns for stock_prices and assign the result to returns.
  • Calculate the pairwise correlations for returns, assign them to correlations and print the result.
  • Plot a seaborn annotated heatmap of the daily return correlations with the title 'Daily Return Correlations'.
stock_prices = pd.read_csv('./datasets/stocks_4.csv', parse_dates=['Date'], index_col='Date')
plt.style.use('seaborn')
print(stock_prices.info())

# Calculate the dail returns
returns = stock_prices.pct_change()

# Calculate and print the pairwise correlations
correlations = returns.corr()
display(correlations)

# Plot a heatmap of daily return correlations
sns.heatmap(correlations, annot=True)
plt.title('Daily Return Correlations')
plt.show()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1762 entries, 2010-01-04 to 2016-12-30
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   AAPL    1761 non-null   float64
 1   AMGN    1761 non-null   float64
 2   AMZN    1761 non-null   float64
 3   CPRT    1761 non-null   float64
 4   EL      1762 non-null   float64
 5   GS      1762 non-null   float64
 6   ILMN    1761 non-null   float64
 7   MA      1762 non-null   float64
 8   PAA     1762 non-null   float64
 9   RIO     1762 non-null   float64
 10  TEF     1762 non-null   float64
 11  UPS     1762 non-null   float64
dtypes: float64(12)
memory usage: 179.0 KB
None
AAPL AMGN AMZN CPRT EL GS ILMN MA PAA RIO TEF UPS
AAPL 1.000000 0.286898 0.327611 0.346616 0.306770 0.344981 0.264791 0.391421 0.212960 0.361684 0.325309 0.366039
AMGN 0.286898 1.000000 0.323408 0.355892 0.349893 0.390076 0.336927 0.400230 0.229255 0.313878 0.374555 0.432468
AMZN 0.327611 0.323408 1.000000 0.298929 0.334031 0.333402 0.242726 0.428330 0.182438 0.326229 0.331867 0.378399
CPRT 0.346616 0.355892 0.298929 1.000000 0.371763 0.423160 0.265665 0.401352 0.221273 0.384944 0.376767 0.462716
EL 0.306770 0.349893 0.334031 0.371763 1.000000 0.358318 0.214027 0.431556 0.206056 0.415416 0.428925 0.456952
GS 0.344981 0.390076 0.333402 0.423160 0.358318 1.000000 0.266063 0.466796 0.271982 0.527298 0.498230 0.506407
ILMN 0.264791 0.336927 0.242726 0.265665 0.214027 0.266063 1.000000 0.301392 0.162796 0.234445 0.231173 0.267801
MA 0.391421 0.400230 0.428330 0.401352 0.431556 0.466796 0.301392 1.000000 0.243761 0.437778 0.448438 0.486512
PAA 0.212960 0.229255 0.182438 0.221273 0.206056 0.271982 0.162796 0.243761 1.000000 0.337448 0.253598 0.217523
RIO 0.361684 0.313878 0.326229 0.384944 0.415416 0.527298 0.234445 0.437778 0.337448 1.000000 0.559264 0.509809
TEF 0.325309 0.374555 0.331867 0.376767 0.428925 0.498230 0.231173 0.448438 0.253598 0.559264 1.000000 0.516242
UPS 0.366039 0.432468 0.378399 0.462716 0.456952 0.506407 0.267801 0.486512 0.217523 0.509809 0.516242 1.000000

Save your analysis to multiple excel worksheets

Now that you have completed your analysis, you may want to save all results into a single Excel workbook.

Let's practice exporting various DataFrame to multiple Excel worksheets.

Instructions:
We have already imported pandas as pd for you. We have also loaded both the historical price series of your index constituents into the variable stock_prices, and the index as index.

  • Inspect both index and stock_prices using .info().
  • Use .join() to combine index with stock_prices, and assign to data.
  • Apply .pct_change() to data and assign to returns.
  • Create pd.ExcelWriter and use with to export data and returns to excel with sheet_names of the same name.
index = index.to_frame('Index')
print(index.info())
print(stock_prices.info())

# Join index to stock_prices, and inspect the result
data = stock_prices.join(index)
print(data.info())

# Create index & stock price returns
returns = data.pct_change()

# Export data and data as returns to excel
with pd.ExcelWriter('data.xls') as writer:
    data.to_excel(writer, sheet_name='data')
    returns.to_excel(writer, sheet_name='returns')
<class 'pandas.core.frame.DataFrame'>
Index: 1761 entries, 2010-01-04 to 2016-12-30
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Index   1761 non-null   float64
dtypes: float64(1)
memory usage: 27.5+ KB
None
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1762 entries, 2010-01-04 to 2016-12-30
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   AAPL    1761 non-null   float64
 1   AMGN    1761 non-null   float64
 2   AMZN    1761 non-null   float64
 3   CPRT    1761 non-null   float64
 4   EL      1762 non-null   float64
 5   GS      1762 non-null   float64
 6   ILMN    1761 non-null   float64
 7   MA      1762 non-null   float64
 8   PAA     1762 non-null   float64
 9   RIO     1762 non-null   float64
 10  TEF     1762 non-null   float64
 11  UPS     1762 non-null   float64
dtypes: float64(12)
memory usage: 179.0 KB
None
<class 'pandas.core.frame.DataFrame'>
Index: 1762 entries, 2010-01-04 00:00:00 to 2016-12-30 00:00:00
Data columns (total 13 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   AAPL    1761 non-null   float64
 1   AMGN    1761 non-null   float64
 2   AMZN    1761 non-null   float64
 3   CPRT    1761 non-null   float64
 4   EL      1762 non-null   float64
 5   GS      1762 non-null   float64
 6   ILMN    1761 non-null   float64
 7   MA      1762 non-null   float64
 8   PAA     1762 non-null   float64
 9   RIO     1762 non-null   float64
 10  TEF     1762 non-null   float64
 11  UPS     1762 non-null   float64
 12  Index   0 non-null      float64
dtypes: float64(13)
memory usage: 257.3+ KB
None