We’ve all been in that situation where we are looking for some new music to listen to, but what Spotify recommends just doesn’t cut it. My goal is to build a recommender that takes all the new releases from Spotify and recommends songs from artists that aren’t well know. This way we save a lot of time of having to manually search through Spotify’s massive library just to find something interesting. I will be demonstrating the different methods I used to do this as well as giving the link to a live demo of this recommender at the end.
As I’m using Tensorflow to analyze audio files using an NVIDIA RTX 3090 as well as a myriad of other libraries, I setup a virtual environemnt to ensure things go smoothly. I will outline the steps on how to do this on Ubuntu 20.04 below. The first thing you will want to do is install Python 3.8 and Python 3.8 venv. To do this simply put the following into the command line:
sudo apt-get install python3.8 python3.8-venv python3-venv
Now setup your virutal environment, I use Jupyter notebooks to do most of my visualizations so there are a few extra steps to link my virtual environment with it below, but these are the general steps to creating a virtual environment on the command line.
python3.8 -m venv /path/to/venv source /path/to/venv pip install --user ipykernel python -m ipykernel install --user --name=nameofvenv
Now activate your virtual environment by typing the following in the command line:
Navigate to the code folder and install all the packages needs by running:
pip install -r requirements.txt. This may take a while as Tensorflow and some of the other packages are quite large.
You will now need to install the NVIDIA drivers to make sure all GPU dependent code runs smoothly. First off you will want to install the base NVIDIA driver, to this this you will want to first check the newest NVIDIA driver your graphics card supports by entering in the command line:
ubuntu-drivers devices. As of writing this the newest driver available for my graphics card is nvidia-driver-460. Do note that to run Tensorflow with your GPU you will need to have nvidia-driver-450 or higher. To install the driver run the following:
sudo apt-get install nvidia-driver-4xx. Replace the Xs with whatever your newest version is. Once the drivers finish instally reboot your system using:
After your system has come back up you will need to install CUDA Toolkit and cuDNN. Unless you absolutely know what you are doing, I would recommend to go with CUDA 11.0 and cuDNN 8.0.5. While going with new versions might work these are the most stable versions for Tensorflow at the time of writing. As NVIDIA has great documentation for installing these tools I will not go over them here, links to installing CUDA Toolkit and cuDNN are below.
Now you should be able to run through the notebooks without any issues. It is recommended you go through the notebooks in the code folder in sequential order to understand the entire process. The app folder contains the code required to deploy the project to the web using Flask.
I gathered data from 3 main sources for this project: Spotify, Everynoise, and Youtube Music. To start off, I pulled new releases for the week from Everynoise. Everynoise is a great site that also tracks different genres of music on Spotify. Next, I used the SpotDL library to download the audio files for the new releases and songs I personally listened to from Youtube Music as the Spotify API does not allow downloading of tracks directly from Spotify. I also got the audio features and popularity of artists for all tracks using the Spotify API.
The EDA part of this project was not very involved as I only needed to see what the distribution of the popularity of artists on Spotify was and take a quick look at what audio features Spotify’s API includes. Below is a histogram showing this distribution.
Audio File Modeling
The first approach I took was to compare the audio files of all the new releases to the tracks the user has listened to. The first step in this process was to look at the raw audio signals of each song to see if we could see anything obvious that was different. Below is a code used to compare of the raw audio signal of 2 tracks and well as a visualization of them.
# loading in the audio user_audio = tfio.audio.AudioIOTensor('/media/jesse/Number4/user_tracks/(K)NoW_NAME - KNOCK on the CORE.mp3') user_audio_slice = user_audio[0:] audio_test = tfio.audio.AudioIOTensor('/media/jesse/Number2/tracks/ - 50 Razones - Tony Aguirre.mp3') audio_test_slice = audio_test[0:] # plotting raw audio waveforms os.chdir('/home/jesse/dsir-1116/projects/capstone/code') user_tensor = tf.cast(user_audio_slice, tf.float32) test_tensor = tf.cast(audio_test_slice, tf.float32) fig, ax = plt.subplots(1, 2, figsize = (20, 12)) ax.plot(tf.math.maximum(user_tensor.numpy()[:, 0], user_tensor.numpy()[:, 1])) ax.plot(tf.math.maximum(test_tensor.numpy()[:, 0], test_tensor.numpy()[:, 1])) fig.suptitle('Audio Waveforms', fontweight = 'bold', fontsize = 32) ax.set_xlabel('Time', fontsize = 16) ax.set_ylabel('Amplitude', fontsize = 16) ax.set_xlabel('Time', fontsize = 16) ax.set_ylabel('Amplitude', fontsize = 16) plt.savefig('../images/raw_audio_signal.png');
Here we see that we can not really see anything different from the 2 signals. One important thing we can see in these signals is the silence that is common at the beginning and end of songs. As we have no interest in comparing signals with no sound we trim the beginning and ends of songs where the amplitude of the signal is small. Below you is the code to do this and a visualization of it.
# take the maximum signal of each audio channel user_tensor = tf.math.maximum(user_tensor[:, 0].numpy(), user_tensor[:, 1].numpy()) test_tensor = tf.math.maximum(test_tensor[:, 0].numpy(), test_tensor[:, 1].numpy()) # trimming insignificant noise user_audio_slice = tfio.experimental.audio.trim(user_tensor, axis = 0, epsilon = 0.1) test_audio_slice = tfio.experimental.audio.trim(test_tensor, axis = 0, epsilon = 0.1) user_start = user_audio_slice user_end = user_audio_slice test_start = test_audio_slice test_end = test_audio_slice user_tensor = user_tensor[user_start:user_end] test_tensor = test_tensor[test_start:test_end] # plotting trimmed audio fig, ax = plt.subplots(1, 2, figsize = (20, 12)) ax.plot(user_tensor) ax.plot(test_tensor) fig.suptitle('Trimmed Audio Waveforms', fontweight = 'bold', fontsize = 32) ax.set_xlabel('Time', fontsize = 16) ax.set_ylabel('Amplitude', fontsize = 16) ax.set_xlabel('Time', fontsize = 16) ax.set_ylabel('Amplitude', fontsize = 16) plt.savefig('../images/trimmed_audio_signal.png');
As these raw audio signals are not useful, I moved on to using Fourier transforms to change these signals from the time domain to the frequency domain. Specifically, I used the Short-Time Fourier Transform (STFT) to covert signals to the frequency domain while keeping some of the time component using windowing. This helps us see where frequencies change quickly in a song such as when there is a bass drop. Below is the code to do STFTs and an image of our audio signals tranformed using STFT.
stft_user = tf.signal.stft(user_tensor, frame_length = 1024, frame_step = 512) spectrograms_user = tf.abs(stft_user) stft_test = tf.signal.stft(test_tensor, frame_length = 1024, frame_step = 512) spectrograms_test = tf.abs(stft_test) U = spectrograms_user.numpy().T T = spectrograms_test.numpy().T plt.figure(figsize = (15, 8)) plt.subplot(2, 1, 1) lr.display.specshow(U, sr = user_audio.rate.numpy(), y_axis = 'linear', x_axis = 'time') plt.colorbar() plt.xlabel('Time') plt.ylabel('Hz') plt.title('Magnitude Spectrogram') plt.subplot(2, 1, 2) lr.display.specshow(T, sr = audio_test.rate.numpy(), y_axis = 'linear', x_axis = 'time') plt.colorbar() plt.xlabel('Time') plt.ylabel('Hz') plt.title('Magnitude Spectrogram') plt.tight_layout() _ = plt.savefig('../images/magnitude_spectrogram.png');
Here we can’t really tell much as most of our signal lies below 5000 Hz. To solve this issue we log-scale our signals to make our features much more pronounced. Here is the code and a figure showing the result of this.
plt.figure(figsize = (15, 8)) plt.subplot(2, 1, 1) lr.display.specshow(lr.amplitude_to_db(U, ref = np.max), sr = user_audio.rate.numpy(), y_axis = 'linear', x_axis = 'time', hop_length = 512) plt.colorbar(format = '%2.0f dB') plt.xlabel('Time') plt.ylabel('Hz') plt.title('Log-Magnitude Spectrogram') plt.subplot(2, 1, 2) lr.display.specshow(lr.amplitude_to_db(T, ref = np.max), sr = audio_test.rate.numpy(), y_axis = 'linear', x_axis = 'time', hop_length = 512) plt.colorbar(format = '%+2.0f dB') plt.xlabel('Time') plt.ylabel('Hz') plt.title('Log-Magnitude Spectrogram') plt.tight_layout() _ = plt.savefig('../images/log_magnitude_spectrogram.png');
Now that we have a signal that is much more workable we change it into Mel Frequency Ceptrum Coefficients (MFCC) to split our signal into bins to optimize our training times by reducing the dimensionality of our data. Below is the code to convert two audio signals to MFCC and a visualization of it.
# Warp the linear scale spectrograms into the mel-scale. user_new_num_spectrogram_bins = stft_user.shape[-1] lower_edge_hertz, upper_edge_hertz, num_mel_bins = 20.0, 15000.0, 12 user_new_linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(num_mel_bins, user_new_num_spectrogram_bins, user_audio.rate.numpy(), lower_edge_hertz, upper_edge_hertz) user_new_mel_spectrograms = tf.tensordot(spectrograms_user, user_new_linear_to_mel_weight_matrix, 1) user_new_mel_spectrograms.set_shape(spectrograms_user.shape[:-1].concatenate(user_new_linear_to_mel_weight_matrix.shape[-1:])) # Compute a stabilized log to get log-magnitude mel-scale spectrograms. user_new_log_mel_spectrograms = tf.math.log(user_new_mel_spectrograms + 1e-6) # Compute MFCCs from log_mel_spectrograms and take the first 13. user_mfccs = tf.signal.mfccs_from_log_mel_spectrograms(user_new_log_mel_spectrograms) # Warp the linear scale spectrograms into the mel-scale. test_new_num_spectrogram_bins = stft_test.shape[-1] lower_edge_hertz, upper_edge_hertz, num_mel_bins = 20.0, 15000.0, 12 test_new_linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(num_mel_bins, test_new_num_spectrogram_bins, audio_test.rate.numpy(), lower_edge_hertz, upper_edge_hertz) test_new_mel_spectrograms = tf.tensordot(spectrograms_test, test_new_linear_to_mel_weight_matrix, 1) test_new_mel_spectrograms.set_shape(spectrograms_test.shape[:-1].concatenate(test_new_linear_to_mel_weight_matrix.shape[-1:])) # Compute a stabilized log to get log-magnitude mel-scale spectrograms. test_new_log_mel_spectrograms = tf.math.log(test_new_mel_spectrograms + 1e-6) # Compute MFCCs from log_mel_spectrograms and take the first 13. test_mfccs = tf.signal.mfccs_from_log_mel_spectrograms(test_new_log_mel_spectrograms) plt.figure(figsize = (15, 8)) plt.subplot(2, 1, 1) lr.display.specshow(user_mfccs.numpy().T, x_axis = 'time') plt.colorbar() plt.xlabel('Time') plt.title('MFCC') plt.subplot(2, 1, 2) lr.display.specshow(test_mfccs.numpy().T, x_axis = 'time') plt.colorbar() plt.xlabel('Time') plt.tight_layout() _ = plt.savefig('../images/mfcc.png');
After converting the audio files to a form that can be worked with, I took the cosine similarities between each of the audio signals. The main issue with using the cosine similarity however is that audio signals must be trimmed further as the cosine similarity can only be taken between two signals with the same length. Due to this additional step required it took 2 hours just to calulate all the cosine similarites of 1 song to all the new releases. Below is a function that can be used to compare the audio signals.
def compare_audio(user, new_release): # load audio files as Tensors user_audio = tfio.audio.AudioIOTensor(user) user_audio_tensor = user_audio.to_tensor() user_audio_tensor = tf.linalg.normalize(user_audio_tensor) new_audio = tfio.audio.AudioIOTensor(new_release) new_audio_tensor = new_audio.to_tensor() new_audio_tensor = tf.linalg.normalize(new_audio_tensor) # trim beginning and end of audio where there is no siginificant noise user_audio_trim = tfio.experimental.audio.trim(user_audio_tensor, axis = 0, epsilon = 0.0001) new_audio_trim = tfio.experimental.audio.trim(new_audio_tensor, axis = 0, epsilon = 0.0001) start_user = user_audio_trim end_user = user_audio_trim start_new = new_audio_trim end_new = new_audio_trim user_audio_tensor = user_audio[start_user:end_user] new_audio_tensor = new_audio[start_new:end_new] # check size of both songs and change them to be the same if user_audio_tensor.shape > new_audio_tensor.shape: user_start = (user_audio_tensor.shape//2) - (new_audio_tensor.shape//2) user_end = (user_audio_tensor.shape//2) + (int(np.ceil(new_audio_tensor.shape/2))) user_audio_tensor = user_audio[user_start:user_end] elif user_audio_tensor.shape < new_audio_tensor.shape: new_start = (new_audio_tensor.shape//2) - (user_audio_tensor.shape//2) new_end = (new_audio_tensor.shape//2) + (int(np.ceil(user_audio_tensor.shape/2))) new_audio_tensor = new_audio[new_start:new_end] # create spectrograms for each song user_spectrogram = tf.signal.stft(tf.math.maximum(user_audio_tensor[:, 0], user_audio_tensor[:, 1]), frame_length = 2048, frame_step = 2048) new_spectrogram = tf.signal.stft(tf.math.maximum(new_audio_tensor[:, 0], new_audio_tensor[:, 1]), frame_length = 2048, frame_step = 2048) # take absolute value of spectrograms user_spectrogram = tf.abs(user_spectrogram) new_spectrogram = tf.abs(new_spectrogram) # calculate and return cosine similarity cosine_loss = tf.keras.losses.CosineSimilarity(axis = -1) score = cosine_loss(user_spectrogram, new_spectrogram).numpy() return score
To solve the issues of the cosine similarity, I moved on to using Dynamic Time Warping (DTW). This allowed for comparisons between different size audio signals and reduced the time needed to compare a user’s song to new releases down to 1 hour. Of course users probably don’t want to wait around for 50 hours to get a new recommendation for music, so I had to figure out a different way to compare songs.
That does it for part 1 of Yet Another Spotify Recommender. In part 2, I will go over how I used’s Spotify API’s audio features to make the recommender much more performant, as well as deploying my recommender to the internet using Flask. Click here for Part 2.