TL;DR: Use the python script to extract the link to download the targeting datasets from IPython Shell.

Motivation: While studying DataCamp courses, there were so many times that I could not reproduce the course results in my local environment with the provided datasets, which was quite frustrating. I had tried to reach out to the support team, but the results were not satisfactory. After spending hours searching on Stackoverflow here and there, then trying and failing numerous times with several scripts, I could finally manage to download the dataset that the course was using.

I decided to share my workaround here. Hopefully, it could help everyone who is facing the same problems.

I have summarized how I did it in the following steps:

  • 1. Check the general info of the dataset
  • 2. Put the download script to the script.py
  • 3. Extract the "href "link, and download it.

Quick EDA

Run the following command to take do a quick check on the data

print(df.head())

                   y
2013-01-01  1.624345
2013-01-02 -0.936625
2013-01-03  0.081483
2013-01-04 -0.663558
2013-01-05  0.738023

then a quick look with df.info()

print(df.info())


<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1000 entries, 2013-01-01 to 2015-09-27
Freq: D
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   y       1000 non-null   float64
dtypes: float64(1)
memory usage: 55.6 KB
None

Okay, the dataset has 1000 rows, with Freq set as Date, Dtype = float64, etc.

Here are the first 5 rows of data:

2013-01-01  1.624345
2013-01-02 -0.936625
2013-01-03  0.081483
2013-01-04 -0.663558
2013-01-05  0.738023

We will remember this, so we can check later once we download the dataset to our local environment.

Save the dataset to the cloud storage

Next, we will save the dataset from the DataFrame to the cloud storage. Try the following code in the script.py.

CODE:

# Get the filename fullpath
from pathlib import Path

filename = "data.csv"
filename = Path.cwd() / filename
df.to_csv(filename)


Check the result by typing the following commands to the IPython Shell:

!pwd
!ls

Next, we run the following code, to generate HTML data.

import base64
import pandas as pd
from IPython.display import HTML

in_file  = open(filename, "rb")
csv = in_file.read()
# print(csv) # Uncomment this if you want to check the csv content
in_file.close()

b64 = base64.b64encode(csv)
payload = b64.decode()
html = '<a download="{filename}" href="data:text/csv;base64,{payload}" target="_blank">{title}</a>'
html = html.format(payload=payload,title=title,filename=filename)
# Print the link
print("data:text/csv;base64,{}".format(payload))


Result:

Paste the extracted link in the Shell output to a new tab in the browser.


Voila! That is the dataset that we want.

Check the downloaded dataset in your local Jupyter notebook

Use this code:

df = pd.read_csv('./datasets/download.csv', parse_dates=True, index_col=[0])
df = df.asfreq('d')  # Set the frequent as DATE 
df.info()

then

df.head()

Results:


As you can see, the downloaded dataset looks exactly like what we saw in the quick EDA section. You can now freely run your experiment locally with no worries about preprocessing the raw datasets.

Full Code

Here is the full code of the post. Use this script in the script.py, you should be able to download the targeted datasets.

import base64
import pandas as pd
from IPython.display import HTML
from pathlib import Path


def create_download_link( df, title = "Download CSV file", filename = "data.csv"):

    filename = Path.cwd() / filename
    df.to_csv(filename)

    in_file  = open(filename, "rb")
    csv = in_file.read()
    # print(csv) # Uncomment this if you want to check the csv content
    in_file.close()

    b64 = base64.b64encode(csv)
    payload = b64.decode()
    html = '<a download="{filename}" href="data:text/csv;base64,{payload}" target="_blank">{title}</a>'
    html = html.format(payload=payload,title=title,filename=filename)
    # print the link
    print("data:text/csv;base64,{}".format(payload))
    return HTML(html)

create_download_link(df)
print(df.info())

BONUS:  Here is another version of the code. It was modified to help downloading numpy arrays.

import base64
import pandas as pd
import numpy as np
from IPython.display import HTML
from pathlib import Path


def create_download_link( numpy_arr, title = "Download CSV file", filename = "data.csv"):
    # CONVERT numpy array to pandas frame. 
    df = pd.DataFrame(numpy_arr) 

    filename = Path.cwd() / filename
    df.to_csv(filename)

    in_file  = open(filename, "rb")
    csv = in_file.read()
    # print(csv) # Uncomment this if you want to check the csv content
    in_file.close()

    b64 = base64.b64encode(csv)
    payload = b64.decode()
    html = '<a download="{filename}" href="data:text/csv;base64,{payload}" target="_blank">{title}</a>'
    html = html.format(payload=payload,title=title,filename=filename)
    # print the link
    print("data:text/csv;base64,{}".format(payload))
    return HTML(html)

create_download_link(new_inputs)

Then we can reload the downloaded numpy array as follows.

new_inputs_df = pd.read_csv('./datasets/data.csv', index_col=[0])
new_inputs = new_inputs_df.to_numpy()
new_inputs.reshape((100,))  # CONVERT it back to its original shape.

OTHER NOTE:   To see the pre-defined function from the scripts.py use the following script:

import inspect
source = inspect.getsource(pre_defined_fucntion_name)
print(source)

UPDATE (2022 Oct 13)

In the above methods, it seems like that base64.b64encode create a TOO LONG string data when encoding the large table (Ex:Tables have more than 6500 rows). The download link is too long, and therefore nolonger possible to be loaded by the browser. The data will be lost if we use the above method. It will be trickier to download the data in these cases without encoding data with base64. One can think of using some string compression method to make the string shorter, but that is not feasible when huge tables present.

As an alternative, we can download the csv data as string in bytes format by:

  • Manually copying data from the terminal
  • Save it as .txt format
  • Finally, reload it back to the original form.

In details:

  • STEP 1: Download the data as "data.txt" file.
print(numpy_arr)
import pandas as pd
from pathlib import Path

def create_download_link(numpy_arr, filename = "data.csv"):
    # CONVERT numpy array to pandas frame. 
    df = pd.DataFrame(numpy_arr) 

    filename = Path.cwd() / filename
    df.to_csv(filename)

    with open(filename, "rb") as in_file:
        csv_data = in_file.read()

        # print("data:text/csv;charset=utf-8,{}".format(csv_data)) --> not working since '\\n' is not the actual new line char anymore. 
        # print("data:text/plain;charset=utf-8,{}".format(csv_data)) # save the byte data as plain text
        print(csv_data, '\n\n\n') # if the browser can not cover all character, we ll paste the data to .txt file directly 

create_download_link(abnormal)
  • STEP 2: Load the "data.txt" file, and convert it back to normal csv format file.
import re, csv, os
file_path = './datasets/all_prices.txt'

with open(file_path, "r") as f:
    plain_str = f.read()
    print(plain_str[:10], plain_str[-10:])

    # remove byte traces
    csv_data = re.match(r".*?b\'(.*)\'.*?", plain_str).group(1) 
    print("START:\n",csv_data[:250],"\nEND:", csv_data[-250:])

    # remove \\n with actual new line chars
    csv_data = csv_data.replace('\\n', '\n') 
    print("START:\n",csv_data[:250],"\nEND:", csv_data[-250:])

    # update csv format to the filename
    with open(os.path.splitext(file_path)[0] + '.csv', 'w') as out:
        out.write(csv_data)
  • STEP 3: Load the csv data as usual, we can also convert it to numpy array if we want.
# Read in the data
df = pd.read_csv('./datasets/data.csv', index_col=[0])
print(df.info())
display(df.tail())

new_np_arr = df.to_numpy()
print(new_np_arr.shape)

That's it. Thank you for reading. If you find the blog post useful, please give the GitHub blog repo a star to show your support and share it with others. Also, please let me know in the comments section of the post if you have any questions.