Hand Gesture Recognition using Python and OpenCV :

Hand gesture recognition is a cool project to start for a Computer Vision enthusiast as it involves an intuitive step-by-step procedure which could be easily understood, so that you could build more complex stuff on top of these concepts.

Introduction :

Gesture recognition has been a very interesting problem in Computer Vision community for a long time. This is particularly due to the fact that segmentation of foreground object from a cluttered background is a challenging problem in real-time. The most obvious reason is because of the semantic gap involved when a human looks at an image and a computer looking at the same image. Humans can easily figure out what’s in an image but for a computer, images are just 3-dimensional matrices. It is because of this, computer vision problems remains a challenge. Look at the image below.

Figure 1. Semantic Segmentation
Figure 1. Semantic Segmentation

This image describes the semantic segmentation problem where the objective is to find different regions in an image and tag its corresponding labels. In this case, “sky”, “person”, “tree” and “grass”. A quick Google search will give you the necessary links to learn more about this research topic. As this is a very difficult problem to solve, we will restrict our focus to nicely segment one foreground object from a live video sequence.

AIM : 

We are going to recognize hand gestures from a video sequence. To recognize these gestures from a live video sequence, we first need to take out the hand region alone removing all the unwanted portions in the video sequence. After segmenting the hand region, we then count the fingers shown in the video sequence to instruct a robot based on the finger count. Thus, the entire problem could be solved using 2 parts -

  1. PART 1 : Find and segment the hand region from the video sequence. (segment.py)
  2.  PART 2 : Count the number of fingers from the segmented hand region in the video sequence. (recognize.py)

Let’s get started !!

                                                                PART 1 : 

Segment the Hand region :

The first step in hand gesture recognition is obviously to find the hand region by eliminating all the other unwanted portions in the video sequence. This might seem to be frightening at first. But don’t worry. It will be a lot easier using Python and OpenCV!

Before getting into further details, let us understand how could we possibly figure out the hand region.

Background Subtraction :

First, we need an efficient method to separate foreground from background. To do this, we use the concept of running averages. We make our system to look over a particular scene for 30 frames. During this period, we compute the running average over the current frame and the previous frames. By doing this, we essentially tell our system that -  "Ok robot! The video sequence that you stared at (running average of those 30 frames) is the background."

After figuring out the background, we bring in our hand and make the system understand that our hand is a new entry into the background, which means it becomes the foreground object. But how are we going to take out this foreground alone? The answer is Background Subtraction.

Look at the image below which describes how Background Subtraction works.

Figure 2. Background Subtraction
Figure 2. Background Subtraction

After figuring out the background model using running averages, we use the current frame which holds the foreground object (hand in our case) in addition to the background. We calculate the absolute difference between the background model (updated over time) and the current frame (which has our hand) to obtain a difference image that holds the newly added foreground object (which is our hand). This is what Background Subtraction is all about.

Motion Detection and Thresholding :

To detect the hand region from this difference image, we need to threshold the difference image, so that only our hand region becomes visible and all the other unwanted regions are painted as black. This is what Motion Detection is all about.

Thresholding : Thresholding is the assigment of pixel intensities to 0's and 1's based a particular threshold level so that our object of interest alone is captured from an image.

Contour Extraction :

After thresholding the difference image, we find contours in the resulting image. The contour with the largest area is assumed to be our hand.

Contour : Contour is the outline or boundary of an object located in an image

So, our first step to find the hand region from a video sequence involves three simple steps.

  1. Background Subtraction
  2. Motion Detection and Thresholding
  3. Contour Extraction

Implementation : 


# organize imports
import cv2
import imutils
import numpy as np

# global variables
bg = None

First, we import all the essential packages to work with and initialize the background model. In case, if you don’t have these packages installed in your computer. Install the following packages :

  1. NUMPY : Use command  "pip install numpy"
  2. OPENCV : Use command "pip install opencv-python"
  3. IMUTILS :  Use command  "pip install imutils"
  4. SCIKIT-LEARN :  Use command "pip install scikit-learn"

# To find the running average over the background

def run_avg(image, aWeight):
    global bg
    # initialize the background
    if bg is None:
        bg = image.copy().astype("float")

    # compute weighted average, accumulate it and update the background
    cv2.accumulateWeighted(image, bg, aWeight)

Next, we have our function that is used to compute the running average between the background model and the current frame. This function takes in two arguments - current frame and aWeight, which is like a threshold to perform running average over images. If the background model is None (i.e if it is the first frame), then initialize it with the current frame. Then, compute the running average over the background model and the current frame using cv2.accumulateWeighted() function. Running average is calculated using the formula given below -


  • src(x,y) - Source image or input image (1 or 3 channel, 8-bit or 32-bit floating point)
  • dst(x,y) - Destination image or output image (same channel as source image, 32-bit or 64-bit floating point)
  • a - Weight of the source image (input image)

# To segment the region of hand in the image

def segment(image, threshold=25):
    global bg
    # find the absolute difference between background and current frame
    diff = cv2.absdiff(bg.astype("uint8"), image)

    # threshold the diff image so that we get the foreground
    thresholded = cv2.threshold(diff, threshold, 255, cv2.THRESH_BINARY)[1]

    # get the contours in the thresholded image
    (_, cnts, _) = cv2.findContours(thresholded.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

    # return None, if no contours detected
    if len(cnts) == 0:
        # based on contour area, get the maximum contour which is the hand
        segmented = max(cnts, key=cv2.contourArea)
        return (thresholded, segmented)

Our next function is used to segment the hand region from the video sequence. This function takes in two parameters - current frame and threshold used for thresholding the difference image.

First, we find the absolute difference between the background model and the current frame using cv2.absdiff() function.

Next, we threshold the difference image to reveal only the hand region. Finally, we perform contour extraction over the thresholded image and take the contour with the largest area (which is our hand).

We return the thresholded image as well as the segmented image as a tuple. The math behind thresholding is pretty simple. If x(n) represents the pixel intensity of an input image at a particular pixel coordinate, then threshold decides how nicely we are going to segment/threshold the image into a binary image.



if __name__ == "__main__":
    # initialize weight for running average
    aWeight = 0.5

    # get the reference to the webcam
    camera = cv2.VideoCapture(0)

    # region of interest (ROI) coordinates
    top, right, bottom, left = 10, 350, 225, 590

    # initialize num of frames
    num_frames = 0

    # keep looping, until interrupted
        # get the current frame
        (grabbed, frame) = camera.read()

        # resize the frame
        frame = imutils.resize(frame, width=700)

        # flip the frame so that it is not the mirror view
        frame = cv2.flip(frame, 1)

        # clone the frame
        clone = frame.copy()

        # get the height and width of the frame
        (height, width) = frame.shape[:2]

        # get the ROI
        roi = frame[top:bottom, right:left]

        # convert the roi to grayscale and blur it
        gray = cv2.cvtColor(roi, cv2.COLOR_BGR2GRAY)
        gray = cv2.GaussianBlur(gray, (7, 7), 0)

        # to get the background, keep looking till a threshold is reached
        # so that our running average model gets calibrated
        if num_frames < 30:
            run_avg(gray, aWeight)
            # segment the hand region
            hand = segment(gray)

            # check whether hand region is segmented
            if hand is not None:
                # if yes, unpack the thresholded image and
                # segmented region
                (thresholded, segmented) = hand

                # draw the segmented region and display the frame
                cv2.drawContours(clone, [segmented + (right, top)], -1, 	     (0, 0, 255))
                cv2.imshow("Thesholded", thresholded)

        # draw the segmented hand
        cv2.rectangle(clone, (left, top), (right, bottom), (0,255,0), 2)

        # increment the number of frames
        num_frames += 1

        # display the frame with segmented hand
        cv2.imshow("Video Feed", clone)

        # observe the keypress by the user
        keypress = cv2.waitKey(1) &amp;amp; 0xFF

        # if the user pressed "q", then stop looping
        if keypress == ord("q"):

# free up memory

The above code sample is the main function of our program. We initialize the aWeight to 0.5. As shown eariler in the running average equation, this threshold means that if you set a lower value for this variable, running average will be performed over larger amount of previous frames and vice-versa. We take a reference to our webcam using cv2.VideoCapture(0), which means that we get the default webcam instance in our computer.

Instead of recognizing gestures from the overall video sequence, we will try to minimize the recognizing zone (or the area), where the system has to look for hand region. To highlight this region, we use cv2.rectangle() function which needs top, right, bottom and left pixel coordinates.

To keep track of frame count, we initialize a variable num_frames. Then, we start an infinite loop and read the frame from our webcam using camera.read() function. We then resize the input frame to a fixed width of 700 pixels maintaining the aspect ratio using imutils library and flip the frame to avoid mirror view.

Next, we take out only the region of interest (i.e the recognizing zone), using simple NumPy slicing. We then convert this ROI into grayscale image and use gaussian blur to minimize the high frequency components in the image. Until we get past 30 frames, we keep on adding the input frame to our run_avg function and update our background model. Please note that, during this step, it is mandatory to keep your camera without any motion. Or else, the entire algorithm fails.

After updating the background model, the current input frame is passed into the segment function and the thresholded image and segmented image are returned. The segmented contour is drawn over the frame using cv2.drawContours() and the thresholded output is shown using cv2.imshow().

Finally, we display the segmented hand region in the current frame and wait for a keypress to exit the program. Notice that we maintain bg variable as a global variable here. This is important and must be taken care of.

Executing code :

Copy all the code given above and put it in a single file named segment.py.

Then, open up a Terminal or a Command prompt, get into the correct directory where you have saved the file and type python segment.py.

Note: Remember to update the background model by keeping the camera static without any motion. After 5-6 seconds, show your hand in the recognizing zone to reveal your hand region alone. Below you can see how our system segments the hand region from the live video sequence effectively.

Figure 3. Segmenting hand region in a real-time video sequence
Figure 3. Segmenting hand region in a real-time video sequence

Upto now, we have completed Background Subtraction, Motion Detection, Thresholding and Contour Extraction to nicely segment hand region from a real-time video sequence using OpenCV and Python.

Now, we will extend this simple technique to make our system (intelligent enough) to recognize hand gestures by counting the fingers shown in the video sequence. Using this, you could build an intelligent robot that performs some operations based on your gesture commands.

Let's move to complete the next part.

                                                              PART 2 :

Now we will write the code for "recognize.py".

Count My Fingers :

Having segmented the hand region from the live video sequence, we will make our system to count the fingers that are shown via a camera/webcam. We cannot use any template (provided by OpenCV) that is available to perform this, as it is indeed a challenging problem.

We have obtained the segmented hand region by assuming it as the largest contour (i.e. contour with the maximum area) in the frame. If you bring in some large object inside this frame which is larger than your hand, then this algorithm fails. So, you must make sure that your hand occupies the majority of the region in the frame.

We will use the segmented hand region which was obtained in the variable hand. Remember, this hand variable is a tuple having thresholded (thresholded image) and segmented (segmented hand region). We are going to utilize these two variables to count the fingers shown. How are we going to do that?

Figure 4. Hand-Gesture Recognition algorithm to count the fingers
Figure 4. Hand-Gesture Recognition algorithm to count the fingers

As you can see from the above image, there are four intermediate steps to count the fingers, given a segmented hand region. All these steps are shown with a corresponding output image (shown in the left) which we get, after performing that particular step.

Four Intermediate Steps :

  1. Find the convex hull of the segmented hand region (which is a contour) and compute the most extreme points in the convex hull (Extreme Top, Extreme Bottom, Extreme Left, Extreme Right).
  2. Find the center of palm using these extremes points in the convex hull.
  3. Using the palm’s center, construct a circle with the maximum Euclidean distance (between the palm’s center and the extreme points) as radius.
  4. Perform bitwise AND operation between the thresholded hand image (frame) and the circular ROI (mask). This reveals the finger slices, which could further be used to calcualate the number of fingers shown.

Below you could see the entire function used to perform the above four steps.

  • Input - thresholded (thresholded image) and segmented (segmented hand region or contour)
  • Output - count (Number of fingers).

# To count the number of fingers in the segmented hand region

def count(thresholded, segmented):
    # find the convex hull of the segmented hand region
    chull = cv2.convexHull(segmented)

    # find the most extreme points in the convex hull
    extreme_top    = tuple(chull[chull[:, :, 1].argmin()][0])
    extreme_bottom = tuple(chull[chull[:, :, 1].argmax()][0])
    extreme_left   = tuple(chull[chull[:, :, 0].argmin()][0])
    extreme_right  = tuple(chull[chull[:, :, 0].argmax()][0])

    # find the center of the palm
    cX = int((extreme_left[0] + extreme_right[0]) / 2)
    cY = int((extreme_top[1] + extreme_bottom[1]) / 2)

    # find the maximum euclidean distance between the center of the palm
    # and the most extreme points of the convex hull
    distance = pairwise.euclidean_distances([(cX, cY)], Y=[extreme_left, extreme_right, extreme_top, extreme_bottom])[0]
    maximum_distance = distance[distance.argmax()]

    # calculate the radius of the circle with 80% of the max euclidean distance obtained
    radius = int(0.8 * maximum_distance)

    # find the circumference of the circle
    circumference = (2 * np.pi * radius)

    # take out the circular region of interest which has 
    # the palm and the fingers
    circular_roi = np.zeros(thresholded.shape[:2], dtype="uint8")
    # draw the circular ROI
    cv2.circle(circular_roi, (cX, cY), radius, 255, 1)

    # take bit-wise AND between thresholded hand using the circular ROI as the mask
    # which gives the cuts obtained using mask on the thresholded hand image
    circular_roi = cv2.bitwise_and(thresholded, thresholded, mask=circular_roi)

    # compute the contours in the circular ROI
    (_, cnts, _) = cv2.findContours(circular_roi.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)

    # initalize the finger count
    count = 0

    # loop through the contours found
    for c in cnts:
        # compute the bounding box of the contour
        (x, y, w, h) = cv2.boundingRect(c)

        # increment the count of fingers only if -
        # 1. The contour region is not the wrist (bottom area)
        # 2. The number of points along the contour does not exceed
        #     25% of the circumference of the circular ROI
        if ((cY + (cY * 0.25)) > (y + h)) and ((circumference * 0.25) > c.shape[0]):
            count += 1

    return count

Each of the intermediate step requires some understanding of image processing fundamentals such as Contours, Bitwise-AND, Euclidean Distance and Convex Hull.

Contours :

The outline or the boundary of the object of interest. This contour could easily be found using OpenCV’s cv2.findContours() function. Be careful while unpacking the return value of this function, as we need three variables to unpack this tuple in OpenCV 3.1.0


Performs bit-wise logical AND between two objects. You could visually think of this as using a mask and extracting the regions in an image that lie under this mask alone. OpenCV provides cv2.bitwise_and() function to perform this operation - Bitwise-AND.

Euclidean Distance

This is the distance between two points given by the equation shown here. Scikit-learn provides a function called pairwise.euclidean_distances() to calculate the Euclidean distance from one point to multiple points in a single line of code - Pairwise Euclidean Distance. After that, we take the maximum of all these distances using NumPy’s argmax() function.

Convex Hull

You can think of convex hull as a dynamic, stretchable envelope that wraps around the object of interest.

Now again copy the above i.e. (recognize.py ) code and save it in the folder.

Then, open up a Terminal or a Command prompt, get into the correct directory where you have saved the file and type "python recognize.py".

Note: Do not shake your webcam during the calibration period of 30 frames. If shaken during the first 30 frames, the entire algorithm will not perform as we expect.

Required Output :

Figure 5. Hand-Gesture Recognition || Counting the fingers || Demo
Figure 5. Hand-Gesture Recognition || Counting the fingers || Demo

 You can download or clone the entire code from my GitHub to perform Hand  Gesture Recognition.


1.Clone it using the following command in your Windows Terminal or Git Bash.

git clone https://github.com/AbhiSinghDeveloper/Gesture-Recognition

2. Open the Terminal/Command prompt or Git Bash. Then, get into the correct directory and type

python recognize.py

3. After this, the program will run successfully.

HURRAY !! You did it.


Or you can manually download the files from the repository by clicking on the following link and pressing Clone or download button.

Link to GitHub repository : https://github.com/AbhiSinghDeveloper/Gesture-Recognition

You can also fork repository by following the same link and clicking the Fork button,

And, If This project is useful for you then please give a Star to this repository.

If you like the blog then please Like it and follow me on GitHub.

Follow me on GitHub  :  https://github.com/AbhiSinghDeveloper

Credits :  Gogul Ilango



                                     THANK YOU !!