Manipulating Time Series Data in Python
You'll learn the basics of manipulating time series data.
You'll learn the basics of manipulating time series data. Time series data are data that are indexed by a sequence of dates or times. You'll learn how to use methods built into Pandas to work with this index. You'll also learn how resample time series to change the frequency. This course will also show you how to calculate rolling and cumulative values for times series. Finally, you'll use all your new skills to build a value-weighted stock index from actual stock data.
Download Datasets and Presentation slides for this post HERE
This post is 1st part of Time Series with Python Track from DataCamp. The track includes:
import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 150
pd.set_option('display.expand_frame_repr', False)
plt.style.use('ggplot')
mpl.rcParams['figure.figsize'] = (25, 20)
mpl.rcParams['axes.grid'] = True
warnings.filterwarnings("ignore")
Working with Time Series in Pandas
This chapter lays the foundations to leverage the powerful time series functionality made available by how Pandas represents dates, in particular by the DateTimeIndex. You will learn how to create and manipulate date information and time series, and how to do calculations with time-aware DataFrames to shift your data in time or create period-specific returns.
How to use dates & times with pandas
Your first time series
You have learned in the video how to create a sequence of dates using pd.date_range()
. You have also seen that each date in the resulting pd.DatetimeIndex is a pd.Timestamp with various attributes that you can access to obtain information about the date.
Now, you'll create a week of data, iterate over the result, and obtain the dayofweek and day_name()
for each date.
Instructions:
Use
pd.date_range
to create seven dates starting from'2017-1-1'
at (default) daily frequency. Use the arguments start and periods. Assign the result to seven_days.Iterate over each date in seven_days and in each iteration, print the .dayofweek and .day_name() attributes.
seven_days = pd.date_range(start='2017-1-1', periods=7)
# Iterate over the dates and print the number and name of the weekday
for day in seven_days:
print(day.dayofweek, day.day_name())
Indexing & resampling time series
Create a time series of air quality data
You have seen in the video how to deal with dates that are not in the correct format, but instead are provided as string types, represented as dtype object in pandas.
We have prepared a data set with air quality data (ozone, pm25, and carbon monoxide for NYC, 2000-2017) for you to practice the use of pd.to_datetime().
Instructions:
- Inspect data using .info().
- Use pd.to_datetime to convert the column 'date' to dtype datetime64.
- Set the 'date' column as index.
- Validate the changes by inspecting data using .info() again.
- Plot data using subplots=True.
data = pd.read_csv('./datasets/nyc.csv')
# Inspect data
display(data.info())
# Convert the date column to datetime64
data.date = pd.to_datetime(data.date)
# Set date column as index
data.set_index("date",inplace = True)
# Inspect data
display(data.info())
# Plot data
data.plot(subplots = True)
# plt.tight_layout()
plt.show()
Compare annual stock price trends
You have seen how to select sub-periods from a time series. You'll use this to compare the performance for three years of Yahoo stock prices.
We have already imported pandas as pd and matplotlib.pyplot as plt and we have already loaded the 'yahoo.csv' file in a variable yahoo with DateTimeIndex and a single column price.
Instructions:
- Create an empty pd.DataFrame() called prices.
- Iterate over a list containing the three years, 2013, 2014, and 2015, as string, and in each loop:
- Use the iteration variable to select the data for this year and the column price.
- Use .reset_index() with drop=True to remove the DatetimeIndex.
- Rename the column price column to the appropriate year.
- Use pd.concat() to combine the yearly data with the data in prices along axis=1.
- Plot prices.
yahoo = pd.read_csv("./datasets/yahoo.csv")
# Convert the date column to datetime64
yahoo['date']= pd.to_datetime(yahoo['date'])
# Set date column as index
yahoo.set_index('date',inplace = True)
display(yahoo.head())
prices = pd.DataFrame()
# Select data for each year and concatenate with prices here
for year in [ '2013', '2014', '2015']:
price_per_year = yahoo.loc[year, ['price']].reset_index(drop=True)
price_per_year.rename(columns={'price': year}, inplace=True)
prices = pd.concat([prices, price_per_year], axis=1)
print(prices.head())
# Plot prices
prices.plot(subplots = True)
plt.show()
The plot you just created shows Yahoo's stock price in three different years.
Set and change time series frequency
In the video, you have seen how to assign a frequency to a DateTimeIndex, and then change this frequency. Now, you'll use data on the daily carbon monoxide concentration in NYC, LA and Chicago from 2005-17.
You'll set the frequency to calendar daily and then resample to monthly frequency, and visualize both series to see how the different frequencies affect the data.
Instructions:
- Inspect co using .info().
- Use .asfreq() to set the frequency to calendar daily.
- Show a plot of 'co' using subplots=True.
- Change the the frequency to monthly using the alias 'M'.
- Show another plot of co using subplots=True.
co = pd.read_csv('./datasets/co_cities.csv', parse_dates=['date'], index_col='date')
# Inspect data
print(co.info())
co.head()
# Set the frequency to calendar daily
co = co.asfreq('D')
print(co.info())
# Plot the data
co.plot(subplots=True)
plt.show()
# # Set frequency to monthly
co = co.asfreq('M')
# Plot the data
co.plot(subplots=True)
plt.show()
How does changing the frequency of the data affect the plot output?
- DatetimeIndex changes from 1898 to 2191 entries.
Lags, changes, and returns for stock price series
Shifting stock prices across time
The first method to manipulate time series that you saw in the video was .shift(), which allows you shift all values in a Series or DataFrame by a number of periods to a different time along the DateTimeIndex.
Let's use this to visually compare a stock price series for Google shifted 90 business days into both past and future.
Instructions:
- Use pd.read_csv() to import 'google.csv', parsing the 'Date' as dates, setting the result as index and assigning to google.
- Use .asfreq() to set the frequency of google to business daily.
- Add new columns lagged and shifted to google that contain the Close shifted by 90 business days into past and future, respectively.
- Plot the three columns of google.
google = pd.read_csv('./datasets/google.csv', parse_dates=['Date'], index_col='Date')
print(google.info())
# Set data frequency to business daily
google = google.asfreq('B')
# Create 'lagged' and 'shifted'
google['lagged'] = google.Close.shift(-90)
google['shifted'] = google.Close.shift(90)
# Plot the google price series
google.plot(title='Google Stock Price')
plt.tight_layout()
plt.show()
Calculating stock price changes
You have learned in the video how to calculate returns using current and shifted prices as input. Now you'll practice a similar calculation to calculate absolute changes from current and shifted prices, and compare the result to the function .diff().
We have already imported pandas as pd and matplotlib.pyplot as plt. We have also loaded Yahoo stock prices for the years 2013 to 2015, set the frequency to business daily, and assigned the result to yahoo.
Instructions:
- Create a new column called shifted_30 that contains the 'price' shifted by 30 business days into the future.
- Subtract 'shifted_30' from 'price', and assign the result to a new column, 'change_30'.
- Apply .diff(), setting periods to 30, and assign the result to a new column, 'diff_30'.
- Inspect the last five rows of yahoo to verify the calculation.
- Subtract diff_30 from change_30 using the .sub() method and print the .value_counts() of the result to show both columns are equal.
yahoo['shifted_30'] = yahoo.price.shift(30)
# Subtract shifted_30 from price
yahoo['change_30'] = yahoo['price'] - yahoo['shifted_30']
# Get the 30-day price difference
yahoo['diff_30'] = yahoo.price.diff(periods=30)
# Inspect the last five rows of price
print(yahoo.tail())
# Show the value_counts of the difference between change_30 and diff_30
print(yahoo.change_30.sub(yahoo.diff_30).value_counts())
There's usually more than one way to get to the same result when working with data.
Plotting multi-period returns
The last time series method you have learned about in the video was .pct_change(). Let's use this function to calculate returns for various calendar day periods, and plot the result to compare the different patterns.
We'll be using Google stock prices from 2014-2016. We have already imported pandas as pd, and matplotlib.pyplot as plt. We have also loaded 'GOOG' stock prices for the years 2014-2016, set the frequency to calendar daily, and assigned the result to google.
Instructions
- Create the columns
'daily_return'
,'monthly_return'
, and'annual_return'
that contain thepct_change()
of'Close'
for 1, 30 and 360 calendar days, respectively, and multiply each by 100. - Plot the result using
subplots=True
.
google = pd.read_csv('./datasets/google.csv', parse_dates=['Date'], index_col='Date')
# Set data frequency to business daily
google = google.asfreq('D')
print(google.info())
google['daily_return'] = google.Close.pct_change(periods=1).mul(100)
# Create monthly_return
google['monthly_return'] = google.Close.pct_change(periods=30).mul(100)
# Create annual_return
google['annual_return'] = google.Close.pct_change(periods=360).mul(100)
# Plot the result
google.plot(subplots = True)
plt.show()
How do the returns for different periods compare?
Basic Time Series Metrics & Resampling
This chapter dives deeper into the essential time series functionality made available through the pandas DataTimeIndex. It introduces resampling and how to compare different time series by normalizing their start points.
Compare time series growth rates
Comparing stock performance:
- Stock price series are hard to compare at different levels.
- Simple solution is to normalize price series to start at 100.
- Divide all prices by ,rst in series, multiply by 100
- Same starting point
- All prices relative to starting point
- Difference to starting point in percentage points
Normalizing a single series
google = pd.read_csv('./datasets/apple_google.csv', usecols = ['Date','GOOG'], parse_dates=['Date'], index_col='Date')
google.rename({'GOOG':'price'},axis = 1, inplace = True)
display(google.head(3))
first_price = google.price.iloc[0] # int-based selection
first_price
first_price == google.loc['2010-01-04', 'price']
normalized = google.price.div(first_price).mul(100)
plt.rcParams["figure.figsize"] = (7,5)
normalized.plot(title='Google Normalized Series')
index = pd.read_csv('./datasets/sp500.csv',parse_dates=['date'],index_col='date')
prices.info()
Compare the performance of several asset classes
You have seen in the video how you can easily compare several time series by normalizing their starting points to 100, and plot the result.
To broaden your perspective on financial markets, let's compare four key assets:stocks, bonds, gold, and oil.We have already imported pandas as pd and matplotlib.pyplot as plt.
Instructions:
- Import 'asset_classes.csv', using .read_csv() to parse dates in the 'DATE' column and set this column as the index, then assign the result to prices.
- Select the first price for each series using .iloc[0] on prices and assign the result to first_prices.
- Divide prices by first_prices, multiply by 100 and assign the result to normalized.
- Plot normalized.
plt.style.use('seaborn')
mpl.rcParams['figure.figsize'] = (20, 20)
prices = pd.read_csv('./datasets/asset_classes.csv', parse_dates = ['DATE'], index_col = 'DATE')
# Inspect prices here
print(prices.info())
# Select first prices
first_prices = prices.iloc[0]
# Create normalized
normalized = prices.div(first_prices).mul(100)
# Plot normalized
normalized.plot(subplots = True)
plt.show()
Normalizing series is a common step in time series analysis.
Comparing stock prices with a benchmark
You also learned in the video how to compare the performance of various stocks against a benchmark. Now you'll learn more about the stock market by comparing the three largest stocks on the NYSE to the Dow Jones Industrial Average, which contains the 30 largest US companies.
The three largest companies on the NYSE are:
Company | Stock Ticker |
---|---|
Johnson & Johnson | JNJ |
Exxon Mobil | XOM |
JP Morgan Chase | JPM |
Instructions:
- Use pd.read_csv() to import 'nyse.csv' and 'dow_jones.csv', creating a DatetimeIndex for each from the 'date' column using parse_dates and index_col, and assign the result to stocks and dow_jones, respectively.
- Use pd.concat() along axis=1 to combine stocks and dow_jones and assign the result to data. Inspect the .info() of data.
- Divide data by the first value for each series, multiply by 100 and plot the result.
stocks = pd.read_csv('./datasets/nyse.csv', parse_dates = ['date'], index_col = 'date')
dow_jones = pd.read_csv('./datasets/dow_jones.csv', parse_dates = ['date'], index_col = 'date')
# Concatenate data and inspect result here
data = pd.concat([stocks, dow_jones], axis = 1).dropna()
print(data.info())
# Normalize and plot your data here
data.div(data.iloc[0]).mul(100).plot(subplots = True)
plt.show()
Plot performance difference vs benchmark index
In the video, you learned how to calculate and plot the performance difference of a stock in percentage points relative to a benchmark index.
Let's compare the performance of Microsoft (MSFT) and Apple (AAPL) to the S&P 500 over the last 10 years.
Instructions:
- Create the list tickers containing the two stock symbols.
- Use pd.read_csv() to import 'msft_aapl.csv' and 'sp500.csv', creating a DatetimeIndex for each from the 'date' column using parse_dates and index_col, and assign the result to stocks and sp500, respectively.
- Use pd.concat() to concatenate stocks and sp500 along axis=1, apply .dropna() to drop all missing values, and assign the result to data.
- Normalize data by dividing by the first price, multiply by 100 and assign the output to normalized.
- Select tickers from normalized, and subtract normalized['SP500'] with keyword axis=0 to align the indexes, then plot the result.
tickers = ['MSFT','AAPL']
# Import stock data here
stocks = pd.read_csv('./datasets/msft_aapl.csv', parse_dates = ['date'], index_col = 'date')
# Import index here
sp500 = pd.read_csv('./datasets/sp500.csv', parse_dates = ['date'], index_col = 'date')
# Concatenate stocks and index here
data = pd.concat([stocks, sp500], axis = 1).dropna()
# Normalize data
normalized = data.div(data.iloc[0]).mul(100)
# Subtract the normalized index from the normalized stock prices, and plot the result
normalized[tickers].sub(normalized['SP500'], axis = 0).plot()
plt.show()
Now you can compare these stocks to the overall market so you can more easily spot trends and outliers.
Changing the time series frequency: resampling
- DateTimeIndex : set & change freq using .asfreq()
- But frequency conversion a(ects the data
- Upsampling: ,ll or interpolate missing data
- Downsampling: aggregate existing data
- pandas API:
- .asfreq() , .reindex()
- .resample() + transformation method
Getting started:quarterly data
dates = pd.date_range(start='2016', periods=4, freq='Q')
data = range(1, 5)
quarterly = pd.Series(data=data, index=dates)
quarterly
Upsampling: quarter => month
Upsampling creates missing values
monthly = quarterly.asfreq('M') # to month-end frequency
monthly
Upsampling: fill methods
- bfill : backfill
- ffill : forward fill
- fill_value=0
monthly = monthly.to_frame('baseline') # to DataFrame
monthly['ffill'] = quarterly.asfreq('M', method='ffill')
monthly['bfill'] = quarterly.asfreq('M', method='bfill')
monthly['value'] = quarterly.asfreq('M', fill_value=0)
monthly
Add missing months: .reindex()
.reindex()
:
- conform DataFrame to new index
- same filling logic as
.asfreq()
dates = pd.date_range(start='2016', periods=12, freq='M')
dates
quarterly.reindex(dates)
Convert monthly to weekly data
You have learned in the video how to use .reindex() to conform an existing time series to a DateTimeIndex at a different frequency.
Let's practice this method by creating monthly data and then converting this data to weekly frequency while applying various fill logic options.
Instructions:
- Create monthly_dates using pd.date_range with start, end and frequency alias 'M'.
- Create and print the pd.Series monthly, passing the list [1, 2] as the data argument, and using monthly_dates as index.
- Create weekly_dates using pd.date_range with start, end and frequency alias 'W'.
- Apply .reindex() to monthly three times: first without additional options, then with bfill and then with ffill, print()-ing each result.
start = '2016-1-1'
end = '2016-2-29'
# Create monthly_dates here
monthly_dates = pd.date_range(start = start,end = end, freq='M')
# Create and print monthly here
monthly = pd.Series(data=[1,2], index=monthly_dates)
print(monthly)
# Create weekly_dates here
weekly_dates = pd.date_range(start = start,end = end, freq='W')
# Print monthly, reindexed using weekly_dates
print(monthly.reindex(weekly_dates))
print(monthly.reindex(weekly_dates,method='bfill'))
print(monthly.reindex(weekly_dates,method='ffill'))
Create weekly from monthly unemployment data
The civilian US unemployment rate is reported monthly. You may need more frequent data, but that's no problem because you just learned how to upsample a time series.
You'll work with the time series data for the last 20 years, and apply a few options to fill in missing values before plotting the weekly series.
Instructions:
- Use pd.read_csv() to import 'unemployment.csv', creating a DateTimeIndex from the 'date' column using parse_dates and index_col, and assign the result to data.
- Convert data to weekly frequency using .asfreq() with the alias 'W' and show the first five rows.
- Convert again to weekly frequency, adding the option 'bfill' and show the first five rows.
- Create weekly series, now adding the option 'ffill', assign to weekly_ffill and show the first five rows.
- Plot weekly_ffill starting in 2015.
# data = pd.read_csv('unemployment.csv', parse_dates=['date'] , index_col='date'
data = pd.read_csv('./datasets/unrate_2000.csv', parse_dates=['date'] , index_col='date')
# Show first five rows of weekly series
print(data.asfreq('W').head())
# Show first five rows of weekly series with bfill option
print(data.asfreq('W', method='bfill').head())
# Create weekly series with ffill option and show first five rows
weekly_ffill = data.asfreq('W', method='ffill')
print(weekly_ffill.head())
# Plot weekly_fill starting 2015 here
weekly_ffill.loc['2015':].plot()
plt.show()
Upsampling & interpolation with .resample()
Frequency conversion & transformation methods
.resample()
: similar to.groupby()
- Groups data within resampling period and applies one or several methods to each group
- New date determined by o(set - start, end, etc
- Upsampling: fill from existing or interpolate values
- Downsampling: apply aggregation to existing data
unrate = pd.read_csv('./datasets/unrate_2000.csv', parse_dates=['date'], index_col='date')
unrate.info()
display(unrate.head())
Quarterly real GDP growth
gdp = pd.read_csv('./datasets/gdp_growth.csv', parse_dates=['date'], index_col='date')
gdp.info()
gdp.head(5)
Interpolate monthly real GDP growth
gdp_1 = gdp.resample('MS').ffill().add_suffix('_ffill')
gdp_1.head(10)
.interpolate()
: find points on straight line between existing data
gdp_2 = gdp.resample('MS').interpolate().add_suffix('_inter')
gdp_2.head(10)
Plot interpolated real GDP growth
interpolate
vsffill
plt.style.use('ggplot')
pd.concat([gdp_1, gdp_2], axis=1).loc['2015':].plot()
Combine GDP growth & unemployment
gdp = pd.read_csv('./datasets/gdp_growth.csv', parse_dates=['date'], index_col='date')
gdp_inter = gdp.resample('MS').interpolate().add_suffix('_inter')
unrate = pd.read_csv('./datasets/unrate_2000.csv', parse_dates = ['date'], index_col='date')
display(unrate.head())
pd.concat([unrate, gdp_inter], axis=1).plot()
Interpolating is a useful way to create smoother time series when resampling.
Use interpolation to create weekly employment data
You have recently used the civilian US unemployment rate, and converted it from monthly to weekly frequency using simple forward or backfill methods.
Compare your previous approach to the new .interpolate() method that you learned about in this video. We have imported pandas as pd and matplotlib.pyplot as plt for you. We have also loaded the monthly unemployment rate from 2010 to 2016 into a variable monthly.
Instructions:
- Inspect monthly using .info().
- Create a pd.date_range() with weekly dates, using the .min() and .max() of the index of monthly as start and end, respectively, and assign the result to weekly_dates.
- Apply .reindex() using weekly_dates to monthly and assign the output to weekly.
- Create new columns 'ffill' and 'interpolated' by applying .ffill() and .interpolate() to weekly.UNRATE.
- Show a plot of weekly.
monthly = pd.read_csv('./datasets/unrate.csv', parse_dates=['DATE'], index_col='DATE')
print(monthly.info())
# Create weekly dates
weekly_dates = pd.date_range(start = monthly.index.min() , end = monthly.index.max(), freq='W')
# Reindex monthly to weekly data
weekly = monthly.reindex(weekly_dates)
# Create ffill and interpolated columns
weekly['ffill'] = weekly.UNRATE.ffill()
weekly['interpolated'] = weekly.UNRATE.interpolate()
display(weekly.loc['2011':].head())
# Plot weekly
weekly.plot()
plt.show()
Interpolate debt/GDP and compare to unemployment
Since you have learned how to interpolate time series, you can now apply this new skill to the quarterly debt/GDP series, and compare the result to the monthly unemployment rate.
Instructions:
- Use pd.read_csv() to import 'debt_unemployment.csv', creating a DateTimeIndex from the 'date' column using parse_dates and index_col, and assign the result to data. print() the .info() of the data.
- Apply .interpolate() to data and assign this to interpolated, then inspect the result.
- Plot interpolated with 'Unemployment' on the secondary_y axis.
data = pd.read_csv('./datasets/debt_unemployment.csv', parse_dates = ['date'], index_col = 'date')
print(data.info())
# Interpolate and inspect here
interpolated = data.interpolate()
print(interpolated.info())
# Plot interpolated data here
interpolated.plot(secondary_y = 'Unemployment')
plt.show()
ozone = pd.read_csv('./datasets/ozone_nyc.csv', parse_dates=['date'], index_col='date')
display(ozone.head())
ozone.info()
ozone = ozone.resample('D').asfreq()
display(ozone.head())
ozone.info()
Creating monthly ozone data
Example:
.resample().mean()
: Monthly average, assigned to end of calendar month
display(ozone.resample('M').mean().head())
.resample().median()
: Monthly median, assigned to end of calendar month
display(ozone.resample('M').median().head())
.resample().agg()
: List of aggregation functions like groupby
display(ozone.resample('M').agg(['mean', 'std']).head())
Plotting resampled ozone data
ozone = ozone.loc['2016':]
ax = ozone.plot()
monthly = ozone.resample('M').mean()
monthly.add_suffix('_monthly').plot(ax=ax)
plt.show()
Resampling multiple time series
data = pd.read_csv('./datasets/nyc.csv', parse_dates=['date'], index_col='date')
data = data.resample('D').asfreq()
display(data.head())
data.info()
data = data.resample('BM').mean() # 'BM' Business Month
data.info()
Compare weekly, monthly and annual ozone trends for NYC & LA
You have seen in the video how to downsample and aggregate time series on air quality.
First, you'll apply this new skill to ozone data for both NYC and LA since 2000 to compare the air quality trend at weekly, monthly and annual frequencies and explore how different resampling periods impact the visualization.
Instructions:
- Use pd.read_csv() to import 'ozone.csv' and set a DateTimeIndex based on the 'date' column using parse_dates and index_col, assign the result to ozone and inspect using .info().
- Apply .resample() with weekly frequency ('W') to ozone, aggregate using .mean() and plot the result.
- Repeat with monthly ('M') and annual ('A') frequencies, plotting each result.
ozone = pd.read_csv('./datasets/ozone_nyla.csv', parse_dates=['date'], index_col='date')
print(ozone.info())
# Calculate and plot the weekly average ozone trend
ozone.resample('W').mean().plot()
plt.title('Ozone Weekly')
plt.show()
# Calculate and plot the monthly average ozone trend
ozone.resample('M').mean().plot()
plt.title('Ozone Monthly')
plt.show()
# Calculate and plot the annual average ozone trend
ozone.resample('A').mean().plot()
plt.title('Ozone Annualy')
plt.show()
Compare monthly average stock prices for Facebook and Google
Now, you'll apply your new resampling skills to daily stock price series for Facebook and Google for the 2015-2016 period to compare the trend of the monthly averages.
Instructions:
- Use pd.read_csv() to import 'stocks.csv' and set a DateTimeIndex based on the 'date' column using parse_dates and index_col, assign the result to stocks and inspect using .info().
- Create monthly_average by applying .resample() with monthly frequency to data, using .mean() to aggregate. Plot the result using subplots.
# stocks = pd.read_csv('stocks.csv', parse_dates=['date'], index_col='date')
stocks = pd.read_csv('./datasets/goog_fb.csv', parse_dates=['date'], index_col='date')
print(stocks.info())
# Calculate and plot the monthly averages
monthly_average = stocks.resample('M').mean()
monthly_average.plot(subplots=True)
plt.show()
Compare quarterly GDP growth rate and stock returns
With your new skill to downsample and aggregate time series, you can compare higher-frequency stock price series to lower-frequency economic time series.
As a first example, let's compare the quarterly GDP growth rate to the quarterly rate of return on the (resampled) Dow Jones Industrial index of 30 large US stocks.
GDP growth is reported at the beginning of each quarter for the previous quarter. To calculate matching stock returns, you'll resample the stock index to quarter start frequency using the alias 'QS', and aggregating using the .first() observations.
Instructions:
- Use pd.read_csv() to import 'gdp_growth.csv' and 'djia.csv', for both set a DateTimeIndex based on the 'date' column using parse_dates and index_col, and assign the results to gdp_growth and djia respectively, then inspect using .info().
- Resample djia using frequency alias 'QS', aggregate using .first(), and assign to djia_quarterly.
- Apply .pct_change() to djia_quarterly and .mul() by 100 to obtain djia_quarterly_return.
- Use pd.concat() to concatenate gdp_growth and djia_quarterly_return along axis=1, and assign to data. Rename the columns using .columns and the new labels 'gdp' and 'djia', then .plot() the results.
gdp_growth = pd.read_csv('./datasets/gdp_growth.csv', parse_dates=['date'], index_col='date')
print(gdp_growth.info())
# Import and inspect djia here
djia = pd.read_csv('./datasets/djia.csv', parse_dates=['date'], index_col='date')
print(djia.info())
# Calculate djia quarterly returns here
djia_quarterly = djia.resample('QS').first()
djia_quarterly_return = djia_quarterly.pct_change().mul(100)
# Concatenate, rename and plot djia_quarterly_return and gdp_growth here
data = pd.concat([gdp_growth, djia_quarterly_return], axis=1)
data.columns = ['gdp', 'djia']
data.plot()
plt.show()
Visualize monthly mean, median and standard deviation of S&P500 returns
You have also learned how to calculate several aggregate statistics from upsampled data.
Let's use this to explore how the monthly mean, median and standard deviation of daily S&P500 returns have trended over the last 10 years.
Instructions:
- Use pd.read_csv() to import 'sp500.csv', set a DateTimeIndex based on the 'date' column using parse_dates and index_col, assign the results to sp500, and inspect using .info().
- Convert sp500 to a pd.Series() using .squeeze(), and apply .pct_change() to calculate daily_returns.
- .resample() daily_returns to month-end frequency (alias: 'M'), and apply .agg() to calculate 'mean', 'median', and 'std'. Assign the result to stats.
- .plot() stats.
sp500 = pd.read_csv('./datasets/sp500.csv', parse_dates=['date'], index_col='date')
# Calculate daily returns here
daily_returns = sp500.squeeze().pct_change()
# Resample and calculate statistics
stats = daily_returns.resample('M').agg(['mean', 'median', 'std'])
# Plot stats here
stats.plot()
plt.show()
Now you're a pro at resampling time series!
Window Functions: Rolling & Expanding Metrics
This chapter will show you how to use window function to calculate time series metrics for both rolling and expanding windows.
Rolling window functions with pandas
Rolling average air quality since 2010 for new york city
The last video was about rolling window functions. To practice this new tool, you'll start with air quality trends for New York City since 2010. In particular, you'll be using the daily Ozone concentration levels provided by the Environmental Protection Agency to calculate & plot the 90 and 360 day rolling average.
Instructions:
- Use pd.read_csv() to import 'ozone.csv', creating a DateTimeIndex from the 'date' column using parse_dates and index_col, and assign the result to data.
- Add the columns '90D' and '360D' containing the 90 and 360 rolling calendar day .mean() for the column 'Ozone'.
- Plot data starting 2010, setting 'New York City' as title.
# data = pd.read_csv('ozone.csv', parse_dates = ['date'], index_col = 'date')
data = pd.read_csv('./datasets/ozone_nyc.csv', parse_dates = ['date'], index_col = 'date')
print(data.info())
# Calculate 90d and 360d rolling mean for the last price
data['90D'] = data['Ozone'].rolling(window='90D').mean()
data['360D'] = data['Ozone'].rolling(window='360D').mean()
# Plot data
data.loc['2010':].plot(title = 'New York City')
plt.show()
Do the different rolling windows help you see any long term trends that are hard to spot in the original data?
Rolling 360-day median & std. deviation for nyc ozone data since 2000
The last video also showed you how to calculate several rolling statistics using the .agg() method, similar to .groupby().
Let's take a closer look at the air quality history of NYC using the Ozone data you have seen before. The daily data are very volatile, so using a longer term rolling average can help reveal a longer term trend.
You'll be using a 360 day rolling window, and .agg() to calculate the rolling mean and standard deviation for the daily average ozone values since 2000.
Instructions:
- Use pd.read_csv() to import 'ozone.csv', creating a DateTimeIndex from the 'date' column using parse_dates and index_col, assign the result to data, and drop missing values using .dropna().
- Select the 'Ozone' column and create a .rolling() window using 360 periods, apply .agg() to calculate the mean and std, and assign this to rolling_stats.
- Use .join() to concatenate data with rolling_stats, and assign to stats.
- Plot stats using subplots.
plt.rcParams["figure.figsize"] = (8,8)
data = pd.read_csv('./datasets/ozone_nyc.csv', parse_dates=['date'], index_col='date').dropna()
# Calculate the rolling mean and std here
rolling_stats = data['Ozone'].rolling(360).agg(['mean', 'std'])
# Join rolling_stats with ozone data
stats = data.join(rolling_stats)
# Plot data
stats.plot(subplots=True)
plt.show()
How does adding the standard deviation help you understand what's happening in the orignal series?
Rolling quantiles for daily air quality in nyc
You learned in the last video how to calculate rolling quantiles to describe changes in the dispersion of a time series over time in a way that is less sensitive to outliers than using the mean and standard deviation.
Let's calculate rolling quantiles - at 10%, 50% (median) and 90% - of the distribution of daily average ozone concentration in NYC using a 360-day rolling window.
Instructions:
- Apply .resample() with daily frequency 'D' to data and apply .interpolate() to fill missing values, and reassign to data.
- Inspect the result using .info().
- Create a .rolling() window using 360 periods, select the column 'Ozone', and assign the result to rolling.
- Insert three new columns, 'q10', 'q50' and 'q90' into data, calculating the respective quantiles from rolling.
- Plot data.
data = data.resample('D').interpolate()
print(data.info())
# Create the rolling window
rolling = data['Ozone'].rolling(360)
# Insert the rolling quantiles to the monthly returns
data['q10'] = rolling.quantile(0.1).to_frame('q10')
data['q50'] = rolling.quantile(0.5).to_frame('q50')
data['q90'] = rolling.quantile(0.9).to_frame('q90')
# Plot the data
data.plot()
plt.show()
The rolling quantiles help show the volatility of the series.
Expanding window functions with pandas
Cumulative sum vs .diff()
In the video, you have learned about expanding windows that allow you to run cumulative calculations.
The cumulative sum method has in fact the opposite effect of the .diff() method that you came across in chapter 1.
To illustrate this, let's use the Google stock price time series, create the differences between prices, and reconstruct the series using the cumulative sum.
Instructions:
- Apply .diff() to data, drop missing values, and assign the result to differences.
- Use .first('D') to select the first price from data, and assign it to start_price.
- Use .append() to combine start_price and differences, apply .cumsum() and assign this to cumulative_sum.
- Use .equals() to compare data and cumulative_sum, and print the result.
data = pd.read_csv('./datasets/google.csv', parse_dates=['Date'], index_col='Date').dropna()
differences = data.diff().dropna()
# Select start price
start_price = data.first('D')
# Calculate cumulative sum
cumulative_sum = start_price.append(differences).cumsum()
# Validate cumulative sum equals data
print(data.equals(cumulative_sum))
The .cumsum() method allows you to reconstruct the original data from the differences.
Cumulative return on $1,000 invested in google vs apple I
To put your new ability to do cumulative return calculations to practical use, let's compare how much $1,000 would be worth if invested in Google ('GOOG') or Apple ('AAPL') in 2010.
Instructions:
- Define a variable investment with the value 1000.
- Calculate returns by applying .pct_change() to data.
- Add 1 to returns and assign this to returns_plus_one, then apply .cumprod() to returns_plus_one and assign the result to cumulative_return.
- Multiply cumulative_return by investment, and plot the result.
data = pd.read_csv('./datasets/data.csv', parse_dates=['Date'], index_col='Date').dropna()
investment = 1000
# Calculate the daily returns here
returns = data.pct_change()
# Calculate the cumulative returns here
returns_plus_one = returns + 1
cumulative_return = returns_plus_one.cumprod()
# Calculate and plot the investment return here
cumulative_return.mul(investment).plot()
plt.show()
Now let's take a look at the rolling annual returns on this investment.
Cumulative return on $1,000 invested in google vs apple II
Apple outperformed Google over the entire period, but this may have been different over various 1-year sub periods, so that switching between the two stocks might have yielded an even better result.
To analyze this, calculate that cumulative return for rolling 1-year periods, and then plot the returns to see when each stock was superior.
Instructions:
We have already imported pandas as pd and matplotlib.pyplot as plt. We have also loaded the GOOG and AAPL close prices from the last exercise into data.
- Define a multi_period_return() function that returns the cumulative return from an array of period returns.
- Calculate daily_returns by applying .pct_change() to data.
- Create a '360D' .rolling() window on daily_returns, and .apply() multi_period_returns. Assign the result to rolling_annual_returns.
- Plot rolling_annual_returns after multiplying it by 100.
import numpy as np
# Define a multi_period_return function
def multi_period_return(period_returns):
return np.prod(period_returns + 1) - 1
# Calculate daily returns
daily_returns = data.pct_change()
# Calculate rolling_annual_returns
rolling_annual_returns = daily_returns.rolling('360D').apply(multi_period_return)
# Plot rolling_annual_returns
rolling_annual_returns.mul(100).plot()
plt.show()
Do you think it's better to invest in Google or Apple?
Case study: S&P500 price simulation
Random walk I
you have seen how to generate a random walk of returns, and how to convert this random return series into a random stock price path.
In this exercise, you'll build your own random walk by drawing random numbers from the normal distribution with the help of numpy.
Instructions:
- Set seed to 42.
- Use normal to generate 2,500 random returns with the parameters loc=.001, scale=.01 and assign this to random_walk.
- Convert random_walk to a pd.Series object and reassign it to random_walk.
- Create random_prices by adding 1 to random_walk and calculating the cumulative product.
- Multiply random_prices by 1,000 and plot the result for a price series starting at 1,000.
from numpy.random import seed, normal
# Set seed here
seed(42)
# Create random_walk
random_walk = normal(loc=0.001, scale=0.01, size=2500)
# Convert random_walk to pd.series
random_walk = pd.Series(random_walk)
# Create random_prices
random_prices = random_walk.add(1).cumprod()
# Plot random_prices
random_prices.mul(1000).plot()
plt.show()
Random walk II
you have also seen how to create a random walk of returns by sampling from actual returns, and how to use this random sample to create a random stock price path.
In this exercise, you'll build a random walk using historical returns from Facebook's stock price since IPO through the end of May 31, 2017. Then you'll simulate an alternative random price path in the next exercise.
Instructions:
We have already imported pandas as pd, choice and seed from numpy.random, seaborn as sns, and matplotlib.pyplot as plt. We have also imported the FB stock price series since IPO in May 2012 as the variable fb. Inspect this using .head().
- Set seed to 42.
- Apply .pct_change() to generate daily Facebook returns, drop missing values, and assign to daily_returns.
- Create a variable n_obs that contains the .count() of Facebook daily_returns.
- Use choice() to randomly select n_obs samples from daily_returns, and assign to random_walk.
- Convert random_walk to a pd.Series and reassign it to itself.
- Use sns.distplot() to plot the distribution of random_walk.
from numpy.random import choice, seed
import seaborn as sns
fb = pd.read_csv('./datasets/fb.csv', names=('date', 'price'),
index_col='date',
parse_dates=True)
display(fb.head())
display(fb.info())
seed(42)
# Calculate daily_returns here
# daily_returns = fb.pct_change().dropna()
daily_returns = fb.price.pct_change().dropna()
# Get n_obs
n_obs = daily_returns.count()
# Create random_walk
random_walk = choice(daily_returns, size=n_obs)
# Convert random_walk to pd.Series
random_walk = pd.Series(random_walk)
# Plot random_walk distribution
sns.distplot(random_walk)
plt.show()
Get ready to finish up this random walk!
Random walk III
In this exercise, you'll complete your random walk simulation using Facebook stock returns over the last five years. You'll start off with a random sample of returns like the one you've generated during the last exercise and use it to create a random stock price path.
Instructions:
We have already imported pandas as pd, choice and seed from numpy.random, and matplotlib.pyplot as plt. We have loaded the Facebook price as a pd.DataFrame in the variable fb and a random sample of daily FB returns as pd.Series in the variable random_walk.
- Select the first Facebook price by applying .first('D') to fb.price, and assign to start.
- Add 1 to random_walk and reassign it to itself, then .append() random_walk to start and assign this to random_price.
- Apply .cumprod() to random_price and reassign it to itself.
- Insert random_price as new column labeled random into fb and plot the result.
random_walk = pd.read_csv('./datasets/random_walk.csv', header=None, index_col=0)
date_range = pd.date_range(start=fb.index[1], periods= len(fb) - 1, freq='B')
random_walk.index = date_range
random_walk = pd.Series(random_walk[1])
start = fb['price'].first('D')
# Add 1 to random walk and append to start
random_walk = random_walk.add(1)
random_price = start.append(random_walk)
# Calculate cumulative product here
random_price = random_price.cumprod()
# Insert into fb and plot
fb['random'] = random_price
fb.plot()
plt.show()
How did you like generating random stock prices?
Relationships between time series: correlation
Annual return correlations among several stocks
You have seen in the video how to calculate correlations, and visualize the result.
In this exercise, we have provided you with the historical stock prices for Apple (AAPL), Amazon (AMZN), IBM (IBM), WalMart (WMT), and Exxon Mobile (XOM) for the last 4,000 trading days from July 2001 until the end of May 2017.
You'll calculate the year-end returns, the pairwise correlations among all stocks, and visualize the result as an annotated heatmap.
Instructions:
We have already imported pandas as pd, seaborn as sns, and matplotlib.pyplot as plt. We have loaded the daily close price for the five stocks in a variable called data.
- Inspect using .info().
- Apply .resample() with year-end frequency (alias: 'A') to data and select the .last() price from each subperiod; assign this to annual_prices.
- Calculate annual_returns by applying .pct_change() to annual_prices.
- Calculate correlations by applying .corr() to annual_returns and print the result.
- Visualize correlations as an annotated sns.heatmap().
print(plt.style.available)
data = pd.read_csv('./datasets/5_stocks.csv', parse_dates=['Date'], index_col='Date')
plt.style.use('seaborn')
print(data.info())
# Calculate year-end prices here
annual_prices = data.resample('A').last()
# Calculate annual returns here
annual_returns = annual_prices.pct_change()
# Calculate and print the correlation matrix here
correlations = annual_returns.corr()
print(correlations)
display(correlations)
# Visualize the correlations as heatmap here
sns.heatmap(correlations, annot=True)
plt.show()
Heatmaps are a great way to visualize correlation matrices.
Putting it all together: Building a value-weighted index
This chapter combines the previous concepts by teaching you how to create a value-weighted index. This index uses market-cap data contained in the stock exchange listings to calculate weights and 2016 stock price information. Index performance is then compared against benchmarks to evaluate the performance of the index you created.
Explore and clean company listing information
To get started with the construction of a market-value based index, you'll work with the combined listing info for the three largest US stock exchanges, the NYSE, the NASDAQ and the AMEX.
In this and the next exercise, you will calculate market-cap weights for these stocks.
We have already imported pandas as pd, and loaded the listings data set with listings information from the NYSE, NASDAQ, and AMEX. The column 'Market Capitalization' is already measured in USD mn.
Instructions:
- Inspect listings using .info().
- Move the column 'Stock Symbol' into the index (inplace).
- Drop all companies with missing 'Sector' information from listings.
- Select companies with IPO Year before 2019.
- Inspect the result of the changes you just made using .info().
- Show the number of companies per 'Sector' using .groupby() and .size(). Sort the output in descending order.
listings_nyse = pd.read_excel('./datasets/listings.xlsx', sheet_name='nyse', na_values='n/a')
listings_amex = pd.read_excel('./datasets/listings.xlsx', sheet_name='amex', na_values='n/a')
listings_nasdaq = pd.read_excel('./datasets/listings.xlsx', sheet_name='nasdaq', na_values='n/a')
listings_nyse['Exchange'] = 'nyse'
listings_amex['Exchange'] = 'amex'
listings_nasdaq['Exchange'] = 'nasdaq'
listings = pd.concat([listings_amex, listings_nasdaq, listings_nyse], axis=0)
listings.reset_index(inplace=True)
listings.drop(['index'], axis=1, inplace=True)
listings['Market Capitalization'] /= 1e6
print(listings.info())
print(listings.info())
# Move 'stock symbol' into the index
listings.set_index('Stock Symbol', inplace=True)
# Drop rows with missing 'sector' data
listings.dropna(subset=['Sector'], inplace=True)
# Select companies with IPO Year before 2019
listings = listings.loc[listings['IPO Year'] < 2019]
# Inspect the new listings data
print(listings.info())
# Show the number of companies per sector
print(listings.groupby('Sector').size().sort_values(ascending=False))
Your data is squeaky clean now!
Select and inspect index components
Now that you have imported and cleaned the listings data, you can proceed to select the index components as the largest company for each sector by market capitalization.
You'll also have the opportunity to take a closer look at the components, their last market value, and last price.
Instructions:
We have already imported pandas as pd, and loaded the listings data with the modifications you made during the last exercise.
- Use .groupby() and .nlargest() to select the largest company by 'Market Capitalization' for each 'Sector', and assign the result to components.
- Print components, sorted in descending order by market cap.
- Select Stock Symbol from the index of components, assign it to tickers and print the result.
- Create a list info_cols that holds the column names Company Name, Market Capitalization, and Last Sale. Next, use .loc[] with tickers and info_cols to print() more details about the listings sorted in descending order by Market Capitalization).
components = listings.groupby('Sector')['Market Capitalization'].nlargest(1)
# Print components, sorted by market cap
print(components.sort_values(ascending=False))
# Select stock symbols and print the result
tickers = components.index.get_level_values('Stock Symbol')
print(tickers)
# Print company name, market cap, and last price for each components
info_cols = ['Company Name', 'Market Capitalization', 'Last Sale']
print(listings.loc[tickers,info_cols].sort_values('Market Capitalization', ascending=False))
display(listings.loc[tickers,info_cols].sort_values('Market Capitalization', ascending=False))
You're ready for the next step in creating this index.
Import index component price information
Now you'll use the stock symbols for the companies you selected in the last exercise to calculate returns for each company.
Instructions:We have already imported pandas as pd and matplotlib.pyplot as plt for you. We have also made the variable tickers available to you, which contains the Stock Symbol for each index component as a list.
- Print tickers to verify the content matches your result from the last exercise.
- Use pd.read_csv() to import 'stock_prices.csv', parsing the 'Date' column and also setting the 'Date' column as index before assigning the result to stock_prices. Inspect the result using .info().
- Calculate the price return for the index components by dividing the last row of stock_prices by the first, subtracting 1 and multiplying by 100. Assign the result to price_return.
- Plot a horizontal bar chart of the sorted returns with the title Stock Price Returns.
print(tickers)
# Import prices and inspect result
# stock_prices = pd.read_csv('stock_prices.csv', parse_dates=['Date'], index_col='Date')
stock_prices = pd.read_csv('./datasets/stocks_4.csv', parse_dates=['Date'], index_col='Date')
print(stock_prices.info())
# Calculate the returns
price_return = stock_prices.iloc[-1].div(stock_prices.iloc[0]).sub(1).mul(100)
# Plot horizontal bar chart of sorted price_return
price_return.sort_values().plot(kind='barh', title='Stock Price Returns')
plt.show()
Build a market-cap weighted index
Calculate number of shares outstanding
The next step towards building a value-weighted index is to calculate the number of shares for each index component.
The number of shares will allow you to calculate the total market capitalization for each component given the historical price series in the next exercise.
Instructions:
- Inspect listings and print tickers.
- Use .loc[] with the list of tickers to select the index components and the columns 'Market Capitalization' and 'Last Sale'; assign this to components.
- Print the first five rows of components.
- Create no_shares by dividing Market Capitalization by 'Last Sale'.
- Print no_shares in descending order.
listings = pd.read_csv('./datasets/listings.csv')
listings.set_index('Stock Symbol', inplace=True)
print(listings.info())
tickers = [ 'RIO',
'ILMN',
'CPRT',
'EL',
'AMZN',
'PAA',
'GS',
'AMGN',
'MA',
'TEF',
'AAPL',
'UPS']
print(listings.info())
print(tickers)
# Select components and relevant columns from listings
components = listings[['Market Capitalization', 'Last Sale']].loc[tickers]
# Print the first rows of components
print(components.head(5))
# Calculate the number of shares here
no_shares = components['Market Capitalization'].div(components['Last Sale'])
# Print the sorted no_shares
print(no_shares.sort_values(ascending=False))
Now you know which companies have the most shares.
Create time series of market value
You can now use the number of shares to calculate the total market capitalization for each component and trading date from the historical price series.
The result will be the key input to construct the value-weighted stock index, which you will complete in the next exercise.
Instructions:
- Select the 'Number of Shares' from components, assign to no_shares, and print the result, sorted in the default (ascending) order.
- Multiply stock_prices by no_shares to create a time series of market cap per ticker, and assign it to market_cap.
- Select the first and the last row of market_cap and assign these to first_value and last_value.
- Use pd.concat() to concatenate first_value and last_value along axis=1 and plot the result as horizontal bar chart.
components = pd.read_csv('./datasets/components.csv',index_col=[0])
print(components.info())
no_shares = components['Number of Shares']
print(no_shares.sort_values())
# Create the series of market cap per ticker
market_cap = stock_prices.mul(no_shares)
# Select first and last market cap here
first_value = market_cap.iloc[0]
last_value = market_cap.iloc[-1]
# Concatenate and plot first and last market cap here
pd.concat([first_value, last_value], axis=1).plot(kind='barh')
plt.show()
You've made one of the essential ingredients of the index.
Calculate & plot the composite index
By now you have all ingredients that you need to calculate the aggregate stock performance for your group of companies.
Use the time series of market capitalization that you created in the last exercise to aggregate the market value for each period, and then normalize this series to convert it to an index.
Instructions:
We have already imported pandas as pd and matplotlib.pyplot as plt for you. We have also loaded components and market_cap_series, which you worked with in the last exercise.
- Aggregate the market cap per trading day by applying .sum() to market_cap_series with axis=1, assign to raw_index and print the result.
- Normalize the aggregate market cap by dividing by the first value of raw_index and multiplying by 100. Assign this to index and print the result.
- Plot the index with the title 'Market-Cap Weighted Index'.
market_cap_series = pd.read_csv('./datasets/market_cap_series.csv', index_col=[0])
display(market_cap_series.info())
display(market_cap_series.head())
raw_index = market_cap_series.sum(axis=1)
print(raw_index)
# Normalize the aggregate market cap here
index = raw_index.div(raw_index.iloc[0]).mul(100)
print(index)
# Plot the index here
index.plot(title='Market-Cap Weighted Index')
plt.show()
Now you have an index to work with!
Evaluate index performance
Calculate the contribution of each stock to the index
You have successfully built the value-weighted index. Let's now explore how it performed over the 2010-2016 period.
Let's also determine how much each stock has contributed to the index return.
Instructions:
Divide the last index value by the first, subtract 1 and multiply by 100. Assign the result to index_return and print it.
- Select the 'Market Capitalization' column from components.
- Calculate the total market cap for all components and assign this to total_market_cap.
- Divide the components' market cap by total_market_cap to calculate the component weights, assign it to weights, and print weights with the values sorted in default (ascending) order.
- Multiply weights by the index_return to calculate the contribution by component, sort the values in ascending order, and plot the result as a horizontal bar chart.
index_return = (index.iloc[-1] / index.iloc[0] - 1) * 100
print(index_return)
# Select the market capitalization
market_cap = components['Market Capitalization']
# Calculate the total market cap
total_market_cap = market_cap.sum()
# Calculate the component weights , and print the result
weights = market_cap.div(total_market_cap)
print(weights.sort_values())
# Calculate and plot the distribution by component
weights.mul(index_return).sort_values().plot(kind='barh')
plt.show()
The next step is to take a look at how your index stacks up against a benchmark!
Compare index performance against benchmark I
The next step in analyzing the performance of your index is to compare it against a benchmark.
In the video, we used the S&P 500 as benchmark. You can also use the Dow Jones Industrial Average, which contains the 30 largest stocks, and would also be a reasonable benchmark for the largest stocks from all sectors across the three exchanges.
Instructions:
We have already imported pandas as pd, matplotlib.pyplot as plt for you. We have also loaded your index and the DJIA data into variables index and djia, respectively, both as a pd.Series().
- Convert index to a pd.DataFrame with the column name 'Index' and assign the result to data.
- Normalize djia to start at 100 and add it as new column to data.
- Show the total return for both index and djia by dividing the last row of data by the first, subtracting 1 and multiplying by 100.
- Show a plot of both of the series in data.
djia = pd.read_csv('./datasets/djia2.csv', parse_dates=['DATE'], index_col='DATE')['DJIA']
data = index.to_frame('Index')
# Normalize djia series and add as new column to data
djia = djia.div(djia.iloc[0]).mul(100)
data['DJIA'] = djia
# Print total return for both index and djia
print((data.iloc[-1] / data.iloc[0] - 1) * 100)
# Plot both series
data.plot()
plt.show()
How do they compare?
Compare index performance against benchmark II
The next step in analyzing the performance of your index is to compare it against a benchmark.
In the video, we have use the S&P 500 as benchmark. You can also use the Dow Jones Industrial Average, which contains the 30 largest stocks, and would also be a reasonable benchmark for the largest stocks from all sectors across the three exchanges.
Instructions:
- Inspect data and print the first five rows.
- Define a function multi_period_return that takes a numpy array of period returns as input, and returns the total return for the period. Use the formula from the video - add 1 to the input, pass the result to np.prod(), subtract 1 and multiply by 100.
- Create a .rolling() window of length '360D' from data, and apply multi_period_return. Assign to rolling_return_360.
- Plot rolling_return_360 using the title 'Rolling 360D Return'.
data.index = pd.to_datetime(data.index)
print(data.info())
print(data.head(5))
# Create multi_period_return function here
def multi_period_return(r):
return (np.prod(r + 1) - 1) * 100
# Calculate rolling_return_360
rolling_return_360 = data.pct_change().rolling('360D').apply(multi_period_return)
# Plot rolling_return_360 here
rolling_return_360.plot(title='Rolling 360D Return')
plt.show()
How do the returns of your index compare to the Dow Jones?
display(data.tail(5))
Index correlation & exporting to Excel
Visualize your index constituent correlations
To better understand the characteristics of your index constituents, you can calculate the return correlations.
Use the daily stock prices or your index companies, and show a heatmap of the daily return correlations!
Instructions:
We have already imported pandas as pd, matplotlib.pyplot as plt, and seaborn as sns. We have also loaded the historical price series of your index constituents into the variable stock_prices.
- Inspect stock_prices using .info().
- Calculate the daily returns for stock_prices and assign the result to returns.
- Calculate the pairwise correlations for returns, assign them to correlations and print the result.
- Plot a seaborn annotated heatmap of the daily return correlations with the title 'Daily Return Correlations'.
stock_prices = pd.read_csv('./datasets/stocks_4.csv', parse_dates=['Date'], index_col='Date')
plt.style.use('seaborn')
print(stock_prices.info())
# Calculate the dail returns
returns = stock_prices.pct_change()
# Calculate and print the pairwise correlations
correlations = returns.corr()
display(correlations)
# Plot a heatmap of daily return correlations
sns.heatmap(correlations, annot=True)
plt.title('Daily Return Correlations')
plt.show()
Save your analysis to multiple excel worksheets
Now that you have completed your analysis, you may want to save all results into a single Excel workbook.
Let's practice exporting various DataFrame to multiple Excel worksheets.
Instructions:
We have already imported pandas as pd for you. We have also loaded both the historical price series of your index constituents into the variable stock_prices, and the index as index.
- Inspect both index and stock_prices using .info().
- Use .join() to combine index with stock_prices, and assign to data.
- Apply .pct_change() to data and assign to returns.
- Create pd.ExcelWriter and use with to export data and returns to excel with sheet_names of the same name.
index = index.to_frame('Index')
print(index.info())
print(stock_prices.info())
# Join index to stock_prices, and inspect the result
data = stock_prices.join(index)
print(data.info())
# Create index & stock price returns
returns = data.pct_change()
# Export data and data as returns to excel
with pd.ExcelWriter('data.xls') as writer:
data.to_excel(writer, sheet_name='data')
returns.to_excel(writer, sheet_name='returns')