Control Spotify with Hand Gestures

7 min readMay 23, 2021

Haven’t you always wanted to control Spotify with your hands?

Yes, how else would I use my mouse and keyboard.

No, no. I mean wouldn’t you like to wave hand gestures in front of a webcam to open up the Spotify app or to skip to the next song?

Err, not really.

You could look like Iron Man controlling his computer?

Oh snap, now we’re talking!

Full disclosure, we won’t be reaching the absurd standards of Tony Stark’s brilliance in this Medium article. The day that such a technology becomes a commonplace reality, we might even be looking at a nigh invincible full-bodied flying iron suit with lasers. At which point, we’d best heed the wise words of Jeff Goldblum’s character in Jurassic Park:

“Yeah, but your scientists were so preoccupied with whether or not they could, that they didn’t stop to think if they should.”

In anticipation of that, this article shall not dive too deeply into technical programming and computer science shenanigans. I shall instead only give a light reading for the topic in hand (and perhaps this is just a clever excuse to do less writing. Who knows!). And that topic being human-computer interaction with a vision-based hand gesture recognition system. My apologies to the students or researchers looking this up for quick answers. It is, however, in my sincere hopes that this article could inspire an interest in the topic. Or perhaps to point the kind reader to the right direction in order to continue with more research. Therefore, this article is for readers of all coding levels. Also, a disclaimer that this article only presents one of many innovative solutions possible. It is by no means a gold standard or the only way to go about the problem.

So what exactly are we trying to achieve? Basically, we are going to create an intelligent system to recognise single hand gestures as well as sequences of hand gestures to control your computer’s Spotify music app. This is a project that I did with a couple of mates. Shoutout to Teoh Yee Seng and Lim Wee Kiat! You can check out this video demo to know exactly what I’m talking about.

Be sure to LIKE and SUBSCRIBE!

Ah, you’re still reading? Excellent, with that, let us begin!

Prerequisites

Firstly, we’ll cover some prerequisites. These are for developers looking to recreate the system and the task will involve a lot of technical knowledge. Python familiarity is a definite must as the whole system will be written in that language. Further familiarity with the modules Numpy, Pandas and OpenCV will also be beneficial as these are used extensively in preprocessing the video inputs. Machine learning and neural net knowledge (along with the Scikit-learn library for Python) rounds off the list. Fret not if you do not meet these prerequisites or are not a developer. This article will do its best to give a high level overview and explanation so that all readers can enjoy the journey. Of course, I highly encourage further reading/learning into the topics that interest you.

Dataset

First, let us define our dataset. You will create videos of various hand gestures by your own design and creativity. For example, in the project I was involved in, I have had defined nine sequences, each sequence consisting of three hand gestures. Therefore, I have filmed myself performing each of the nine sequences, not all at once, but one sequence per video. Each sequence has at least 30 video recordings.

Example of a training video for one of the sequence

The dataset will be used to train the machine learning model (explained further later). The model will then be able to classify the nine sequences and we can define a specific operation to control Spotify for each sequence.

Pose Estimation

The next aspect we shall cover is pose estimation. It is a deep learning technique that reads in an image (or video frame), detects the human subjects in it and derives the person’s anatomical points and links, as shown in the gif above. The pose estimation performed in the gif was done using OpenPose, a real-time system for multi-person 2D pose estimation. However, for our case, we would like to focus specifically on the hands and as such will not be using OpenPose.

This article proposes the use of MediaPipe @ https://google.github.io/mediapipe/solutions/hands. It is a powerful open source tool developed by Google that performs 3D pose estimation for hands (very impressively, it captures the depth aspect and allows a 3rd dimension). It will give the anatomical (or landmark) points and links of your hand. Check out the link provided to find out more on how MediaPipe solution works.

In your code, you are going to read in all the videos in your dataset using OpenCV. You can then parse each video to the MediaPipe function to acquire 21 landmarks from the detected hand. Save the data into a NumPy array and create a separate list to append the appropriate label belonging to each training video (e.g. Sequence 1, Sequence 2, etc.).

Feature Extraction

Here, you are going to hand-craft your own distance-based feature extractor to calculate the distance from various pairs of landmark points of a hand using Euclidean distance. The type of features you want to extract will also depend on the hand gestures you have defined. To explain this briefly, we have to refer to the MediaPipe chart:

For instance, in my example project, we wish to know how close are the finger tips to the thumb, in order to perform volume control by adjusting the gap between the fingers. This information is embed in pairs [4,8], [4,12], [4,16] and [4,20]. As such, I would want to extract these features. Hence, you have to extract the relevant points you need for your own unique gestures. Each pair will generate a distance value and all the gathered values should then be normalised. The features from each frame of the video are appended together and reshaped into an 1D NumPy array, which can subsequently be fed further into the pipeline.

Training the Model

Next, we have to design a neural net model to learn from the features we have just extracted. The project I worked on used a Long Short Term Memory (LSTM) model and Multi Layer Perceptron (MLP) model for the gesture-sequence and single-gesture learning respectively. We found that these two models suited our needs the best, but, you can experiment with other types. For the gesture-sequence, we are only dealing with extraction of temporal patterns from the movement of the landmarks within a sequence. As such, LSTM is suitable. In single-gesture, we only need to classify each gesture’s embedded features, where MLP is more than enough for the job. I will not be discussing the designs of either model in this article.

Determining the Output

This part would again depend greatly on the the gestures that you have defined. Using the project I did as an example, we first try to determine the single-gesture. This is done by having the system recognise a palm (for the purpose of performing a swiping motion to skip track) and a pinching sign (for the purpose of controlling the volume). Once either of these are recognised, the system can proceed to deliver the required operation (skip track or adjust volume).

Then would be the gesture-sequence. In order for the system to recognise and recall up to 3 gestures for a sequence, we have had implemented a first-in-first-out (FIFO) queue system to save the preceding gestures. Once a proper sequence is detected in the queue, the model can proceed with the classification of it. As mentioned earlier, there are many ways to go about crafting a solution and each situation is unique to your own defined gestures.

Controlling Spotify

Once the outputs have been determined, we have to link them to the desired operation to perform on Spotify. These operations can be defined in our code easily with the help of a couple of external tools.

Spotify has provided its own official REST basedWeb API @ https://api.spotify.com for developers to access user related data, such as playlists and music. Additionally, Spotipy is a lightweight Python library that works in tandem with the Spotify Web API, providing many useful functions to control the Spotify app on the computer. We shall use both to their advantages in our code. We can also use the following Python modules to control the Spotify application window: OS, Subprocess and Ctypes. OS and Subprocess both contain functions that allow the opening and closing of the Spotify application. Ctype contains a function that allows the minimising and restoring of a window.

Conclusion

And there we have it! You have just read through a high level tutorial on how to create an entire, robust system. What do you think of it? Exciting, hopefully! If so, I encourage you to pick up this project on your own, regardless of your coding levels. I may not have provided an in-depth tutorial, but if you need more help, do feel free to leave a comment. Hopefully, you will be on your way to an exciting realm of discovery! I bid thee kind reader farewell.

Control Spotify with Hand Gestures

Written by Marcus Yatim