Improving the Speed of YASR

5 minute read

Previous Performance of YASR

After adding in a feature to show users the similarity scores for all new releases based on what they’ve previously listened to, the time it took to generate these recommendations ended up taking nearly 1 minute. In order to solve this issue, I first tried to do some small improvements such as removing unnecessary and costly dataframe operations like merges. This only improved things very marginally, and each recommendation still took about 55 seconds to create. After some additional analysis of my code, I suddenly realized that even though I was using Tensorflow and my GPU to do a lot of the calculations, a lot of the slowdown was being caused by trying to store and create dataframes using Pandas so that the recommendations can be used in Flask. As Pandas only uses the CPU for dataframe manipulation, I decided to go search for a way to manipulate dataframes using my GPU. After some searching, I came across cuDF and cuML, libraries created by Rapids that are designed to mimic Pandas and Scikit-learn using NVIDIA GPUS instead. Woohoo!

RAPIDS

Working with cuDF and cuML

If you followed my previous guide on setting up your NVIDIA GPU to use Tensorflow, then getting cuDF and cuML is pretty easy. RAPIDS has an easy tool that produces a command for you after you select all the options based on your machine. Click here to go to the tool. As you might see RAPIDS has quite a few other tools, but I will be focusing on cuDF and cuML. After you have the packages installed, make sure to link the virtual environment created to Jupyter. A guide to do this can be found here. Now lets start using cuDF and cuML. So of course to start off you will want to import the two libraries like show below:

import cudf
from cuml.metrics import pairwise_distances
from cuml.experimental.preprocessing import MinMaxScaler

Similar to how we only import what we need from scikit-learn, we also only want to import what we need from cuML. For YASR, we only need MinMaxScaler and pairwise_distances. As cuML is meant to mimic scikit-learn, most of the names for everything are exactly the same.

Now create a DataFrame in cuDF. You will notice it is the exact same as in Pandas.

new_features_df = cudf.read_csv('../data/new_track_features.csv')
new_features_df.head()

cuDFcsv

Now we will look at a comparison of creating a DataFrame from a JSON file in Pandas and cuDF. This is to demonstrate that there are some limitations in cuDF that are not in Pandas. These limitations are due to the fact that storing data in GPUs works differently than storing it in system memory, so some data types do not work properly in cuDF.

Pandas:

# load in user json
with open('/media/jesse/Number3/json/Jesse.p.tao.json') as f:
    data = json.load(f)
user_df = pd.DataFrame(track_data)

track_name = []
artist_uri = []
artist_name = []
track_uri = []
genre = []

for i in range(len(user_df)):
    track_name.append(user_df.iloc[i]['track']['name'])
    track_uri.append('spotify:track:' + user_df.iloc[i]['track']['id'])
    artist_uri.append('spotify:artist:' + user_df.iloc[i]['track']['artists'][0]['id'])
    artist_name.append(user_df.iloc[i]['track']['artists'][0]['name'])
user_df['track_uri'] = track_uri
user_df['track_name'] = track_name
user_df['artist_uri'] = artist_uri
user_df['artist'] = artist_name
user_df.drop(['track', 'played_at', 'context'], axis = 1, inplace = True)

cuDF:

# load in user json
with open('/media/jesse/Number3/json/Jesse.p.tao.json') as f:
    data = json.load(f)
track_data = [i['track'] for i in data['items']]
user_df = cudf.DataFrame(track_data)
user_df['artists'] = [i['track']['artists'][0]['name'] for i in data['items']]
user_df.rename({'artists': 'artist', 'uri': 'track_uri', 'name': 'track_name'}, axis = 1, inplace = True)
user_df['artist_id'] = [i['track']['artists'][0]['id'] for i in data['items']]
user_df.drop(['album', 'external_urls', 'external_ids'], axis = 1, inplace = True)

The important thing to note is that the JSON file that is being used here is a dictionary of dictionaries. So we end up seeing that in Pandas manipulating each of these dictionaries in the DataFrame works fine, however in cuDF dictionaries are not a supported type, so we have to work with the values in each dictionary in our dictionary of dictionaries instead when using cuDF. While working in cuDF you will likely notice that lists are not supported either, also applying user defined functions to strings is not supported currently either. So if you are mainly working with numeric data and you have an NVIDIA GPU to work with, cuDF is a great option, however when it comes to more complex data types and strings, sticking to Pandas might be more performant. Luckily, cuDF has a to_pandas function, so if you do have to modify a column of strings in your DataFrame, this is an option. Another important thing to note is that series in cuDF are not iterable, due to the nature in that they are stored, so you will need to use the to_pandas function, to_arrow function, or values_host function in cuDF to iterate through your series. In my testing, I found that values_host was the most performant being about 2 times faster than to_pandas or to_arrow.

Now lets look at the cuML side of things. cuML has much less caveats compared to cuDF and works nearly the same as scikit-learn. Any code that you’ve written in scikit-learn can easily be converted to work with cuML. The only thing that cuML does not have that scikit-learn does that I noticed is fit_transform. Due to this, you do need to make separate calls for fitting and transforming using any sort of preprocessing. Other than this everything else stays the same, and I noticed when using cuML a 6x improvement over scikit-learn and about a 1.2x improvement over Tensorflow for calculating pairwise distances.

So, How Much Better are cuDF and cuML?

After converting all my code for YASR to use cuDF and cuML, while it’s not the crazy 1200x speed improvements you see others boasting about when they make the switch, I was able to improve the time to get recommendations from one minute down to 20 seconds or less. That’s a 3x improvement which I am very satisfied with considering that other than some of the caveats of working with strings that I had to go through when working with cuDF, is remarkable considering all I had to do was switch my import statements and modify a few lines of code. I hope that this guide has convinced you to switch to cuDF and cuML from Pandas and scikit-learn for your next project!