Disertation Es Ioana Ver2 [306199]
“LUCIAN BLAGA” [anonimizat]:
COORDINATOR:
GRADUATE:
Master program:
Sibiu, 2019
“LUCIAN BLAGA” [anonimizat] A RASPBERRY PI
SCIENTIFIC ADVISOR:
COORDINATOR:
GRADUATE:
Master program:
Sibiu, 2019
"LUCIAN BLAGA" UNIVERSITY OF SIBIU APPROVED (date)
FACULTY OF ENGINEERING __________________
[anonimizat], Electrical
Electrical and Electronics Engineering and Electronics Engineering
_____________
THEMATIC PLAN FOR DISSERTATION
Student’s name and surname_____________________________________.
Study programme______________________________________________.
1.Subject
_________________________________________________________________________________________________________________________.
2. Deadline ___________________________________________________.
3. Basic elements for the project
________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________.
4. Problems to be solved
____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
______________________________________________________________________________________________________________________________________________________________________________________.
6. Advices schedule (every week)
_________________________________________________________________________________________________________________________.
7. Theme released in ________________.
Theme received by student: [anonimizat],
Date _________ _____________________
Student's signature ________________ (name and signature)
Table of Contents
1. ABSTRACT 1
2. PART I: THEORY 4
2.1. State of the Art 4
2.2. Raspberry Pi 5
2.3. OpenCV and Python 9
2.4. Image processing 12
2.5. [anonimizat] 14
3. PART II: APPLICATION 18
3.1. Block diagram 18
3.2. Create dataset 19
3.3. Process images 21
3.4. Learning algorithm 24
3.4.1. Set parameters for backpropagation algorithm 24
3.4.2. [anonimizat] 24
3.4.3. Create train set and test set 25
3.4.4. Initialize the network 26
3.4.5. Train the network 26
3.4.6. Make predictions on test set 29
3.4.7. Compute accuracy 29
3.5. Test the model with Raspberry Pi 31
3.6. Results 38
4. CONCLUSIONS 39
5. REFERENCES 40
ABSTRACT
The fast development in technology from the last decades has made computers and processing systems to be present in all our modern life aspects. Therefore, human computer interaction has become a [anonimizat]-day life. [anonimizat]. Gestures are a non-[anonimizat]. [anonimizat], the major interest of engineers, scientists and bioinformatics is to develop gesture detection and recognition algorithms in order to improve the natural communication language between computers and humans.
Gesture recognition implies a natural human-machine communication without the interaction with other mechanical devices, therefore without the need of learning how to use them, making the classic devices such as keyboard, mouse or touchscreen to become redundant.
The gestures have to be captured and recognized using special methods of acquisition. Amongst them, we can specify the following three, as the main ones:
By using special devices with wearable gloves
By aquiring data from different 3D postions of points on the hand
By processing a raw captured image
In the first case, the speed of recognition and the accuracy of the gesture made are very good, but the person whose gestures must be detected has to wear auxiliary devices and, therefore, extra wires.
In the second case, the accuracy is still very good, but, due to the fact that aquiring data from the hand point takes extra processing time, the overall speed is lower than in the first case.
The third method proves to be the most efficient one overall, since the person whose gestures must be detected doesn`t have to wear any other devices or wires. In this case, we only need a good camera or sensor for capturing the raw image. Of course, the raw image must be processed before making the prediction of the gesture made. The accuracy is not so high as for the first case, but, with a good training algorithm, it can reach a satisfying result.
According to, the processing system on which the training is made and which makes the prediction of the gesture has to accomplish the following features:
To have a good classification criteria, leading to high accuracy
To be able to detect and predict quickly a certain gesture
To have enough computational resources, since the training algorithm requires a lot of resources
This new domain of gesture detection and recognition has a wide range of applications in:
Gaming: virtual reality interaction, 3D designing
Television: smart TVs, smart displays
Healthcare: sign language recognition for deaf people
Forensics: face recognition software, lie detector
Industry: equipment handling, robots
The purpose of this dissertation is to develop a hand gesture recognition software using OpenCV and Python and having as hardware support a Raspberry Pi equipped with a Pi Camera.
The study was done for seven different hand gestures: palm, fist, peace, ok, thumbs up, left and right. For each hand sign, a dataset was created containing 20 photos captured with the Pi Camera, which were further processed in order to be used in the training of the neural network.
The next step after this previous generation of features is to generate the machine learning model using a backpropagation algorithm. Then, the model is checked and the accuracy is computed.
After the training is done, the neural network obtained is tested to see how well it can predict. This testing is done on the Raspberry Pi. The raw image is captured with the same Pi Camera, processed in the same way as for the training dataset and then forward propagated through the network. The prediction is shown on screen.
This paper is structured in 3 chapters and References.
The first part is a theoretical study containing the state of the art related to gesture detection using neural networks and a short presentation related to the technology used for developing the application, specifically, the hardware part (Raspberry Pi 3, Pi Camera), the programming language and environment used (OpenCV and Python) and neural networks backpropagation algorithm.
The second part contains the detailed development of the application, in all its stages, described very well by a block diagram. Shortly, the main stages in developing the application are the following: creating the dataset of images, processing the images, implementing the learning algorithm, testing the model on a Raspberry Pi and presenting the results obtained.
The third chapter contains the conclusions which will be drawn and future development possibilities will be presented.
Of course, there are a lot of challenges related to the efficiency of a gesture detection and recognition software, mainly based on the quality of the equipment used for capturing the images (resolution, algorithm calibrated for a specific camera) and on the background environment (image noise, lighting, location, distance, partial occlusions, skin color).
PART I: THEORY
State of the Art
There are numerous methods for recognizing hand gestures in static images, which have been proposed over the years.
In , Kopuklu et al. proposed the idea of having an hierarchical type of architecture used for the detection and recognition of gesture in real time, with a separate integration of the training models in the respective architecture.
The structure consists of an independent three dimensional convolutional neural network (CNN) which is trained in order to obtain a good prediction of gestures and, also, of another convolutional neural network trained for a good detection of a gesture. Therefore, they have a classifier and a detector. The detector works only when a gesture is taking place and, when this happens, the classifier also starts working. The detection is done by analyzing continuously the input stream of images, using a moveable frame.
In , Dadashzadeh et al. proposed a new two-stage convolutional neural network architecture, where the first stage performs accurate pixel-level semantic segmentation to determine the hand regions, and the second stage identifies the hand gesture.
The recognition stage deploys a two-stream CNN which fuses the information from the RGB and segmented images by combining their deep representations in a new fully connected layer before classification.
Pisharady et al utilized a Bayesian type of model which is able to identify hand regions within images. Hand gestures were recognized based on some certain characteristics related to texture, color, shape of the hand, using a SVM type of classification (Support Vector Machines).
Priyal and Bora used geometry based normalizations and properties related to the Krwatchouk moment in order to generate a binarized model of the hand.
Nalepa and Kawulok proposed the idea of having a parallel algorithm used for classifying the shape of the hand, which performed in real time over the segmented hand gestures. They combined multiple techniques related to the shape of the hand in order to improve the classification score.
Raspberry Pi
In order to test how well the network performs, a Raspberry board was used. This Raspberry Pi board was developed in UK by the Raspberry Pi foundation. It is a small single board computer created with the purpose to be used in science classes in schools. Over the years, the Raspberry Pi foundation launched some generations of models. All of them include an integrated Advanced RISC Machine (ARM) compatible CPU – central processing unit, a broadcom SoC (System on Chip) and a GPU (graphics processing unit) located on chip.
The model used for this application is Raspberry Pi 3 board, model B+ illustrated in the figure below:
Figure 1: Raspberry Pi 3 board model B+, top view of the board
The board is like a miniature computer and it can be controlled with any type of peripherals (mouse or keyboard) connected through the generic USB port. It also has the possibility to connect an USB storage device, converters or any other devices or components which can be connected through USB. There is also the alternative of connecting other auxiliary devices or peripherals using the 40 pin connector which the Raspberry Pi 3 is equipped with.
The general purpose input output connector (GPIO) has 40 pins, with different description and functionalities. The connector schematic is presented in Figure 2, and the pins descriptions are presented in Figure 3
.
Figure 2: GPIO connector schematic
Figure 3: GPIO connector, pins description
The Raspberry Pi 3 board model B+ has the following specifications (Table 1), according to :
Table 1: Raspberry Pi 3 model B+ specifications
The photos for the dataset and for testing the application are captured using a Raspberry Pi Camera module illustrated in Figure 4.
The Pi Camera is a small circuit board having the following dimensions: 25mm x 20mm x 9mm. It can be connected to Raspberry Pi board using the CSI (Camera Serial Interface) connector presented in Figure 5.
This connection is done through a flexible ribbon cable which the camera comes equipped with.
Figure 4: Raspberry Pi Camera Module
Figure 5: Camera Serial Interface (CSI) bus connector
OpenCV and Python
OpenCV stand for Open Source Computer Vision Library and is a software library mainly used in the image processing and in the machine learning domains.
It was created with the purpose of having the same foundation and architecture for the applications making use of computer vision and the purpose of making it easier to utilize machine learning in the new emerging techonology of artificial intelligence.
Figure 6: OpenCV and Python logos
The founding of OpenCV was done in 1999 at Intel. The first developer was Gary Bradsky who made the first launch in the year 2000. He was later accompanied by Vadim Pisaresky in leading the team at Intel, in Russia.
The major breakthrough that this library had was in 2005, in the DARPA Challenge, where a group of developers programmed the winning vehicle, Stanley, using OpenCV. The two led the project for some years, with the help and sustainment from Willow Garage.
As we speak, the library has accumulated a lot of functions, features and algorithms in the field of machine learning and computer vision, and, of, course, is expected to grow even further in the upcoming years.
There are numerous programming languages for which OpenCV offers help. For example, Python, Java, C++ etc. are some of the languages supported by OpenCV. Of course, it can be used on different operating systems like: Windows, Linux or Android.
In our case, the operating system on the Raspberry board is Linux and the OpenCV-Python duo gives the best features of the C++ API supported by OpenCV and of the Python programming language.
OpenCV library offers 2000+ specific optimized functions and algorithms, containing both typical school class algorithms, but also innovative approaches for the algorithms in the domain of machine learning and image processing.
Nowadays, there are a lot of applications for face detection and recognition, movement tracking, object identification, following the movement of the eye, red eye removal, virtual reality or Kinect applications which successfully use the above mentioned algorithms.
The programming language used in this work is Python. This programming language is a general purpose one and was developed by Guido van Rossum. It is a user friendly programming language, simple and readable, which led to shortly gain a wide popularity.
The user programming with Python has the freedom to code his ideas in a few lines with easiness a high readability.
Figure 7 illustrates the simplicity and readability of an OpenCV-Python code for capturing a still image from the Pi Camera and displaying it.
Figure 7: Sample code for capturing and displaying an image
The disadvantage that Python has is the fact that it is slower than other programming languages like C++. This is not a major problem because it can be solved by extending Python with C++, by the support provided by OpenCV. Therefore, having this feature, we can create Python wrappers and use them as Python modules.
So, starting with a big disadvantage, OpenCV Python succeeded to have a fast and very easy programming language. This is what OpenCV Python does, is creating a wrapper for the initial C++ development.
Taking into account that the application from this work makes use of arrays and operations on arrays, OpenCV and Python has the Numpy support, which will make the processing faster and easier.
Numpy is a library used when dealing with numerical operations, similar to the ones used in Matlab, the syntax being quite similar. Using Numpy together with OpenCV will convert all the arrays from OpenCV to arrays of type Numpy. This increases the computation time in image processing applications.
Besides Numpy, we will also use the functionalities given by the Pi Camera library offerd by the Raspberry Pi foundation.
Image processing
Amongst all of the information that the human brain is able to acquire, filter and interpret, the visual one is the most significant. Processing an image has the purpose to get some desired properties from its representation, using specific methods.
Any image can be represented as a matrix: 2D for grayscale images or 3D for images with color. The matrix can be seen as a function “f(x,y) and has two dimensions which represent the vertical and horizontal coordinates.
In order to determine the value of a pixel at certain position in the image, we simply look at the value of the function for the coordinates of the respective position. This values give information about the color of the pixel or its brightness.
If we are dealing with gray images, the matrix will have two dimensions and the value of any pixel shows how bright the pixel is. Usually, image processing deals with byte type images, which are represented as 8 bit integers. A pixel from this kind of image can have values between 0 and 255, “0” being considered as black and “255” as white. All the values from this interval are considered to be values of gray.
OpenCV together with Python offer a lot of support in the field of image processing, having a wide library of algorithms and procedures related to computer vision. All the procedures that can be applied to a certain image can be grouped in the following categories:
Acquiring, storing and transmitting the images; this can be done using special techniques: compression, quantization, encoding etc.
Filtering for obtaining better quality (ex. Enhancing or restoring)
Gathering features for using them in other purposes (ex. For training a network).
For this application, the image acquisition is done using the Pi Camera connected to the Raspberry Pi board, and all the training and test images are stored in a dataset, classified on the type of hand gesture.
During real time testing, the image is captured, stored locally and further processed.
The images are then processed: all images are resized in order to have the same dimensions and, after that, the images are binarized, in order to have only two possible pixel values.
To binarize or to threshold an image means to process the image in such a way to obtain a “black/white” image, more specifically to have only two allowed values for the pixels in the matrix representation of the image. This obtained image is the “binary” image.
We can also refer to this procedure as “segmentation” because we want to separate/segment an object from the rest of the picture.
The success of this operation is highly dependent of the image content (brightness, background, resolution etc.), therefore is not such a simple task as it may seem.
The idea is to find a value called the threshold, based on which the pixel values will be part of one of the two groups (the background or the searched object.
For our application, the threshold value will be chosen as the pixel value in the center of the image, assuming the hand is centered in the image.
Of course, this is not the only decision criteria when selecting the threshold pixel value and its location on the image. Most of the times, the matrix of pixel values undergoes pre-processing (calculating entropies, mean values etc). We can mention here Nobuyuki Otsu who introduced in 1979 an algorithm which continuously searches for the correct threshold based on variance.
It is not so important which algorithm is used as long as the results are as expected. The success of the result is directly proportional to the number of objects to be segmented in the image. For images with many objects, the binarization is more difficult.
For this application, the binarization is done in the following way: first, the image is converted to YCrCb format and the central pixel is selected as the skin color pixel. Pixels that have values in a certain range around the skin color pixel are marked with 1’s. The rest of the pixels, meaning the pixels from the background, are marked with 0. As a result, the images will have a white background with a black hand in the middle.
The images will be stored as arrays of zeros and ones and this is the way in which they will further be used in the training algorithm.
Neural Networks – Backpropagation algorithm
An artificial neural network is a computing system similar with the neural networks that form the brain.
A system like this is able to execute commands by looking-up to examples, without having any set of rules at the foundations.
Figure 8: Neural network architecture
The neural network is constructed from three types of layers (Figure 8):
Input layer: initial data for the neural network
Hidden layers: intermediate layer between input and output layer, where all the computations are made
Output layer: result for the given inputs
Each node is connected with each node from the next layer and each connection has a particular weight. Weight can be seen as the impact that a node has on the node from the next layer.
Figure 9 presents one particular node from the architecture.
Figure 9: Particular node from neural network
For a neural network with 4 input nodes (x0…x4), one hidden layer with 4 nodes (a0…a3) and one output layer (hθ), the weights are noted with θ. The weights between the input and hidden layer will represent a 3×4 matrix and the weights between the hidden layer and the output layer will represent a 1×4 matrix.
The activation nodes are computed by multiplying the input vector X and the weight matrix θ for the first layer and then by applying the activation function g (Equation 1).
Equation 1: Compute activation nodes
The output is obtained by applying again the activation function g to the multiplication of the second layer with the weights matrix.
Equation 2: Compute output node value
In neural network the activation function defines if a given node should be activated or not based on the weighted sum. The Sigmoid function (Figure 10) is one of the most widely used activation function today. The equation is given with the formula below.
Equation 3: Sigmoid equation
Figure 10: Sigmoid function
The cost function represents the sum of the error, meaning the difference between the predicted value and the real value.
Equation 4: General cost function
The goal is to minimize the cost function, meaning to find the minimum J(θ) using the optimal set of values for θ (weights). The backpropagation method is used in order to compute the partial derivative of J(θ).
The algorithm used is the back propagation algorithm, which is part of the domain of ANN (artificial neural network), mainly the feed forward algorithms.
The main component of such a network, just like in the human brain, is the neuron, the main processor of information. In the human body, connection and dissemination of data between neurons are made through dendrites. Each neuron is connected to the dendrites of another neuron through synapses.
The idea of this algorithm is to force the network to deliver the desired output by updating the weights of each neuron accordingly. This updating is done at each step and it is proportional to the error between what we obtained and what we want to obtain.
Therefore, we can state that the back propagation procedure is in fact a procedure for updating the weight of each neuron in the neural network. This is why, from the beginning, the architecture of the network has to be clearly initialized.
Backpropagation is about determining how changing the weights impacts the overall cost in the neural network. The algorithm has 5 steps:
Set a(1) = X, for the training examples
Perform forward propagation and compute a(l) for the other layers
Use y and compute the delta value for the last layer
Compute the delta values backwards for each layer
Calculate derivative values for each layer
What it does is propagating the error backwards in the neural network. On the way back it finds how much each weight is contributing in the overall error.
The weights that contribute more to the overall error will have a larger derivation value, which means that they will change more.
PART II: APPLICATION
Block diagram
Figure 11: Block diagram of the application
Creating the dataset
This application has to be able to detect seven different types of gestures: fist, hand, left, ok, peace, right and thumbs-up. Since the detection is based on neural network learning, we have to create a training dataset. Therefore, for each type of gesture, 20 images were recorded.
The training images were captured with the PiCamera installed on the Raspberry Pi, the same camera which will be used when testing the model.
Some examples of the captured images can be seen below:
Table 2: Examples of photos from the learning dataset
In order to activate the use of the Pi Camera, the Camera must be enabled from the Raspberry Pi configuration tool (Figure 12 and 13).
Figure 12: Raspberry Pi configuration tool
Figure 13: Camera enable property
For each gesture, in order to have a rapid capture of the 20 images, the following small piece of code was used:
from picamera import PiCamera
from time import sleep
new_camera = PiCamera()
new_camera.start_preview()
for index in range (20):
sleep(5)
new_camera.capture('/home/pi/Desktop/dataset/fist/fist%s.jpg' % index)
new_camera.stop_preview()
Processing the images
In order to use the images in the learning algorithm, they must be processed and adjusted in a suitable form for the network. We start with the original, full sized, colored image and we end up with a binarized image.
The goal is to represent it as an array of zeros and ones with a length of 2500 (50×50 resized images).
First, we create output folders for each set of images. Here is where we will store the resized binarized images.
def create_folders():
shutil.rmtree('train_images_resized')
os.mkdir('train_images_resized')
os.mkdir('./train_images_resized/fist/')
os.mkdir('./train_images_resized/hand/')
os.mkdir('./train_images_resized/left/')
os.mkdir('./train_images_resized/ok/')
os.mkdir('./train_images_resized/peace/')
os.mkdir('./train_images_resized/right/')
os.mkdir('./train_images_resized/thumbsup/')
From the path, we extract the name of the directory in which the image is located and then set the value of label according to Table 3.
The current image is opened for processing. We resize the image to 50×50 and we convert it from BGR to YCrCb format (the default format for pictures in OpenCV Python is BGR not RGB).
We assume the hand is in the center, so we pick the central pixel as the skin color pixel. For each pixel in the image, we check if the pixel value is in the range of the skin color pixel (the range was selected after a number of trials); if yes, then mark the pixel with white.
The binarized image is saved in the corresponding output folder.
As it was said before, at the input we will have the original image (Figure 14) and at the output the resized binarized image (Figure 15).
The image processing algorithm is presented below:
filename = os.path.realpath(f)[:30]+'train_images/'+folder+'/'+os.path.basename(f)
img = cv2.imread(filename)
M=50
N=50
img_resized = cv2.resize(img, (M,N))
img_out = np.zeros((M,N))
img_ycrcb = cv2.cvtColor(img_resized, cv2.COLOR_BGR2YCR_CB)
cb = img_ycrcb[:,:,1]
cr = img_ycrcb[:,:,2]
center = img_ycrcb[int(M/2),int(N/2),:]
center_cb=center[1]
center_cr=center[2]
delta_cb=5
delta_cr=5
for i in range(M):
for j in range(N):
if (cb[i,j]>=(center_cb-delta_cb)) & (cb[i,j]<=(center_cb+delta_cb)) & (cr[i,j]>=(center_cr-delta_cr)) & (cr[i,j]<=(center_cr+delta_cr)):
img_out[i,j]=255
output_folder = os.path.realpath(f)[:30]+'train_images_resized/'+folder+'/'+os.path.basename(f)
cv2.imwrite(output_folder,img_out)
Each image will be a row from the input layer in the network. We will convert the 50×50 output image to an array of 0’s and 1’s and append it to the input layer. The last element in each row represents the label or the expected output value of each input.
x = list()
img_out10 = np.zeros(2501)
index=0
for i in range(M):
for j in range(N):
if img_out[i][j] == 255:
img_out10[index] = int(0)
index +=1
else:
img_out10[index]= int(1)
index +=1
img_out10[index]=label
x.append(img_out10)
return x
Learning algorithm
Set parameters for backpropagation algorithm
The first stage in the learning algorithm is to set the parameters of the neural network. The network will have three layers:
one input layer
one hidden layer
one output layer.
The input layer is the list returned from the image processing, with 140 rows containing 2501 values (2500 is size of image plus 1 = the label value). Therefore, the input layer size will be 2500. The hidden layer size value was chosen 25, after a number of trials. The output layer size value will equal to the number of labels, therefore, the output layer size will be 7.
iLayer_size = 2500
hLayer_size = 25
oLayer_size = 7
fold_size = 5
learning_rate = 0.1
epochs = 500
Create folds for cross-validation
The resampling procedure of cross validation is widely used in the evaluation of neural networks learning stage for a certain piece of data. The function has only one parameter, “k”, which indicates the number of folds or sets for splitting the initial dataset for learning.
The reason why the procedure is called cross validation is the fact that, each time, the current fold is kept as the set for testing, while the other “k-1” folds are kept for the training set. Taking into account all the previous explanations, the algorithm for cross validation can be described by the following succession of steps :
Divide the initial set of data into “k” sets, taking randomly values from the initial set
Given each division, do the following:
Keep the current fold as the set used for testing the network
Concatenate the remaining “k-1” sets and use them for training the network
Perform the learning algorithm using the set for training and test the output network with the set for testing
Compute the accuracy of the current division of folds and repeat the above steps, cycling the set for testing
The code for the above algorithm is further presented:
def cross_validation(X,nr_folds):
folds = list()
X_temp = list(X)
fold_length = int(len(X)/nr_folds)
for index in range(nr_folds):
f = list()
while len(f)<fold_length:
position = random.randrange(len(X_aux))
f.append(X_aux.pop(position))
folds.append(f)
return folds
Create train set and test set
For each fold, we will create the set for training and the set for testing. The current fold will be the test_set and the remaining folds will constitute the train_set.
folds = cross_validation(X,nr_folds)
results = list()
i=0
for fold in folds:
training_set = list()
for j in range(nr_folds):
if j!=i:
training_set.append(folds[j])
i+=1
training_set = sum(training_set, [])
testing_set = list()
for el in fold:
el = el[:-1]
testing_set.append(el)
Initialize the network
As mentioned before, the network is organized into three layers: input, output and hidden layer. Every neuron in the network is defined by a set of weights, its output and its delta value (sum of errors from back propagation). Therefore, the neurons will be stored as dictionaries, having the following three properties: ‘output’, ‘weight’ and ‘delta’.
The layers are organized as arrays containing dictionaries, therefore the network will be organized as an array containing layers. The initial weights are computed according to the Glorot initialization , using an uniform distribution.
def initialize_network(iLayer_size,hLayer_size,oLayer_size):
network = list()
epsilon = math.sqrt(6)/(iLayer_size+hLayer_size)
hidden_layer = [{'weights':[random.random()*2*epsilon-epsilon for i in range(iLayer_size+1)]} for i in range(hLayer_size)]
network.append(hidden_layer)
epsilon = math.sqrt(6)/(hLayer_size+oLayer_size)
output_layer = [{'weights':[random.random()*2*epsilon-epsilon for i in range(hLayer_size+1)]} for i in range(oLayer_size)]
network.append(output_layer)
return network
Train the network
The neural network training has the following three steps which are repeated for as many times as indicated by the chosen number of epochs:
Forward propagation
Back propagation
Weights updating
The array ‘exp_value’ is an array of size equal to the number of labels. It is initialized with 0 values and then updated to have a value of 1 at the index corresponding to the label. For example, if we have ‘ok’ sign, expected value will be [0, 0, 0, 1, 0, 0, 0]. Since indexing in Python starts at 0 and labeling starts at 1, we must subtract 1 from the index at which we update the ‘exp_value’ array.
for e in range(epochs):
for line in training_set:
out_value = f_prop(network, line)
exp_value = [0 for i in range(oLayer_size)]
exp_value[int(row[-1])-1] = 1
b_prop_err(network, exp_value)
up_weights(network, line, l_rate)
Forward propagation works on each layer of the network computing the output value for every neuron in the layer. The inputs of one layer are actually the output values from the previous layer. The output value of the neuron will be memorized at the dictionary property “output”.
The first step in forward propagation is to compute the activation, which is the weighted sum of the inputs. The output of the neuron is the transfer function applied to the activation. The transfer function used is the sigmoid activation function.
def activation(w, layer):
a = w[-1]
for neuron in range(len(w)-1):
a+= weights[neuron]*layer[neuruon]
return a
def transfer_function(a):
tf = 1.0 / (1.0+math.exp(-a))
return tf
def f_propagate(network, line):
values = line
for l in network:
new_values = []
for n in layer:
a = activation(neuron['weights'], values)
n['output'] = transfer_function(a)
new_values.append(neuron['output'])
inputs = new_inputs
return inputs
Back propagation works backwards on the network layers, and computes the error for every output neuron. The error is calculated as the difference between the expected value of the outputs from the “exp_value” array and the predicted outputvalues obtained by forward propagating, multiplied with the slope of the neuron output.
The slope is calculated with the tf_deriv function. The back propagated error is accumulated and stored in the neuron delta value. This value reflects the change that the error implies on the neuron.
def b_prop_err(network, exp_value):
l = len(network)
for index in reversed(range(l)):
networkLayer = network[index]
error = list()
if index != l-1:
for j in range(len(networkLayer)):
e = 0.0
for neuron in network[index+1]:
error += (neuron['weights'][j]*neuron['delta'])
error.append(e)
else:
for j in range(len(networkLayer)):
neuron = networkLayer [j]
error.append(exp_value[j]-neuron['output'])
for j in range(len(networkLayer)):
neuron = networkLayer [j]
neuron['delta'] = error[j]*tf_deriv(neuron['output'])
The procedure of updating of weight modifies the current weight of each neuron by adding to it the previously computed error multiplied with the learning rate and with the neuron output.
The learning rate is an indicator of how the weight changes for achieving error correction. A 0.2 value of the learning rate will modify the value of the weight with 20%.
def up_weights(network, line, lrate):
for index in range(len(network)):
values = line[:-1]
if index!=0:
valuess = [neuron['output'] for neuron in network[index-1]]
for n in network[index]:
for j in range(len(values)):
n['weights'][j] += lrate*n['delta']*values[j]
n['weights'][-1] += lrate*n['delta']
Make predictions on test set
Once the network has finalized the training stage, the value of the weights is optimized. Making predictions on a test set means to take each row from the test set and to perform forward propagation. The predicted value will be the index where we get the maximum weight. This is done with the assumption that the labels start from 0. Since we started the label notation from 1, we will add +1 to the outputted index.
def prediction(network, line):
predict = f_prop(network,line)
return predict.index(max(predict))+1
Compute accuracy
For each fold, we can compute the accuracy of the predicted values. These values are compared to the expected values for each row in the testing set. The resulting value for the accuracy is obtained by dividing the number of predictions which proved to be correct to the number of tests made, multiplied with 100.
def acc(actual_value, predicted_value):
ok = 0;
for index in range(len(actual_value)):
if actual_value[index] == predicted_value[index]:
ok += 1
result = ok/float(len(actual_value))*100.0
return result
The network is now trained and we can save it to a file for further use on the Raspberry Pi application. We save the network for each fold and we will use the one with the highest accuracy. The writing to file is done in the following way:
with open('list%s.txt' %i,'w') as filehandle:
for listitem in network:
filehandle.write('%s\n' % listitem)
filehandle.close()
Given all of the presented sub-functions, the main program will be structured in the following way:
functions.create_folders()
X = list()
X = functions.process_images()
iLayer_size = 2500
hLayer_size = 25
oLayer_size = 7
fold_size = 5
learning_rate = 0.05
epochs = 1000
results = nn.train(X, fold_size, learning_rate, epochs, iLayer_size, hLayer_size, oLayer_size)
print(results)
Test the model with Raspberry Pi
At this point, the network is trained and saved to a file. The first step is to read the file and properly extract the useful information. This is done in the function extract() from gui_functions.
The network is organized into layers: input, output and hidden layer. Each neuron is defined by a set of weights, its output and its delta value (sum of errors from back propagation). Therefore, the neurons will be stored as dictionaries, having the following three properties: ‘output’, ‘weight’ and ‘delta’.
Therefore, we will form the two layers from the file, the hidden layer and the output layer, together with their properties: output, weights and delta. For example, for the hidden layer, the syntax is the following:
for line in file:
splitted = line.split(":")
inputs = len(splitted)-1
if k==1:
for i in range(0,inputs,3):
h_dict = dict()
h_dict['output'] = float(splitted[i+1][1:-11].strip())
h_dict['weights']=[]
weights = splitted[i+2][2:-10].split(", ")
for w in weights:
h_dict['weights'].append(float(w.strip()))
if i==inputs-3:
h_dict['delta'] = float(splitted[i+3][1:-3].strip())
else:
h_dict['delta'] = float(splitted[i+3][1:-12].strip())
hidden_layer.append(h_dict)
k=2
After we have the two layers extracted, we add them to the nework:
network.append(hidden_layer)
network.append(output_layer)
return network
For the prediction, we will use the following functions from the neural networks training algorithm: prediction, forward propagations, activation and transfer.
The next step is to open the Pi Camera and to have a continuous capture of images. In order to properly have the gesture detected, the user is asked to place the hand inside a rectangle drawn on the screen.
Writing the text to the screen and drawing the rectangle is done using the cv2 functions ‘putText’ and ‘drawRectangle’. At each capture of image, the variable ‘text’ will be updated according to the gesture made in the rectangle. Initially, it is a blank text.
camera = PiCamera()
rawCapture = PiRGBArray(camera)
text=''
for frame in camera.capture_continuous(rawCapture, format="bgr", use_video_port=True):
stream=frame.array
rawCapture.truncate(0)
cv2.rectangle(stream, (400,100), (1000,700), (0,255,0), 0)
cv2.putText(stream, text, (500,850), cv2.FONT_HERSHEY_SIMPLEX, 5.0, (0,255,0), 3, lineType=cv2.LINE_AA)
cv2.putText(stream, "Place hand inside rectangle", (100,80), cv2.FONT_HERSHEY_SIMPLEX, 3.0, (0,255,0), 3, lineType=cv2.LINE_AA)
cv2.imshow('Frame', stream)
Below is a preview of the application at start-up:
Figure 16: Raspberry Pi Application
When the hand is placed in the rectangle, the prediction can be made by pressing the ‘c’ (c from Capture) on the keyboard.
Before making the prediction, the image must be processed in the same way as when creating the training dataset (resizing, converting to YCrCb, binarizing, creating output image as array).
In order to make the prediction, the resulting array of 0’s and 1’s is forward propagated through the network and its output is computed. According to Table 3, depending on the prediction, the text on the screen will be updated accordingly.
if key==ord("c"):
img = stream
out = np.zeros(2500)
out = gui_functions.processImage(img)
prediction = gui_functions.predict(network,out)
if prediction == 1:
text='FIST'
if prediction == 2:
text='HAND'
if prediction == 3:
text='LEFT'
if prediction == 4:
text='OK'
if prediction == 5:
text='PEACE'
if prediction == 6:
text='RIGHT'
if prediction == 7:
text='THUMBS-UP'
The application is tested for all the 7 gestures, and it is successfully predicting the correct output. The following images show how the application works:
Figure 17: Raspberry Pi Application: predicting 'fist'
Figure 18: Raspberry Pi Application: predicting 'hand'
Figure 19: Raspberry Pi Application: predicting 'left'
Figure 20: Raspberry Pi Application: predicting 'ok'
Figure 21: Raspberry Pi Application: predicting 'peace'
Figure 22: Raspberry Pi Application: predicting 'right'
Figure 23: Raspberry Pi Application: predicting 'thumbs-up'
Results and discussions
Talking about the results of this work mainly resumes to discussing about the accuracy obtained when predicting on the neural network obtained. The factors which influence the result are:
The number of hidden layers chosen and the size of each hidden layer
The learning rate
The number of epochs
The diversity of the input images.
For the same learning rate and number of epochs, a change in the size of the hidden layer did not improve significantly the accuracy. The size was chosen to be 25, similar to related works from the literature.
Therefore, we tried a variation of the learning rate and number of epochs.
Initially, learning rate was set to 0.3 and number of epochs to 500. The accuracy obtained was around 70 – 75%. After some variations, the algorithm was able to give a decent accuracy between 85 – 90% for a learning rate equal to 0.05 and a number of epochs equal to 1000.
When testing the application on Raspberry Pi, the background conditions are totally different from the background of the training images, which can lead to a decrease in the accuracy of the predictor. This was an aspect which was not taken into account when starting this work and clearly it`s one of the first thing to do in the future development of the application.
Even though the overall accuracy is between 80-85%, the application on the Raspberry Pi was able to identify all seven gestures: fist, hand, left, ok, peace, right and thumbs-up.
CONCLUSIONS
The goal of this paper was to implement a gesture recognition software using neural networks backpropagation algorithm and to test the application on a Raspberry Pi 3 model B+ equipped with a Pi Camera.
The Raspberry Pi 3 was used, at the beginning, for creating the dataset of training images and, at the end, for testing how well the network can predict the correct gesture from an image captured also by the Pi Camera.
In between, due to the large resources required by the backpropagation algorithm, the network was trained offline, on another computer. The resulting network was saved to a file and loaded at application start-up on the Raspberry Pi.
All gestures were tested and the predictor gave the correct label most of the time for the gesture made, with a small amount of errors due to different background of testing image. In the future development of the application, this can be improved by using a better hand segmentation algorithm.
Also, when talking about future developments, the purpose is to extend the number of gestures to include all signs from ASL (alphabet sign language) and to be able to interpret a message sent in ASL.
REFERENCES
Copyright Notice
© Licențiada.org respectă drepturile de proprietate intelectuală și așteaptă ca toți utilizatorii să facă același lucru. Dacă consideri că un conținut de pe site încalcă drepturile tale de autor, te rugăm să trimiți o notificare DMCA.
Acest articol: Disertation Es Ioana Ver2 [306199] (ID: 306199)
Dacă considerați că acest conținut vă încalcă drepturile de autor, vă rugăm să depuneți o cerere pe pagina noastră Copyright Takedown.
