Articol Important (1) [608649]

Real Time Sign Language Estimation System
Pushkar Kurhekar
Department of Computer Engineering
Vidyalankar Institute of Technology
Mumbai, India
[anonimizat] Janvi Phadtare
Department of Computer Engineering
Vidyalankar Institute of Technology
Mumb ai, India
[anonimizat] Sourav Sinha
Department of Computer Engineering
Vidyalankar Institute of Technology
Mumbai, India
[anonimizat]

Kavita P. Shirsat
Department of Computer Engineering
Vidyalankar Institute of Technology
Mumbai , India
[anonimizat]
Abstract— This paper presents a system which assists
people with speech and hearing disabilities to communicate
freely. The model is able to extract signs from videos, by
processing the video frame by frame under minima lly cluttered
background. This sign is then presented in readable text. The
system uses a Convolutional Neural Network (CNN) and fastai
– a deep learning library, along with OpenCV for webcam
input and displaying the predicted sign. Experimental results
show decent segmentation of signs under different
backgrounds and high accuracy in gesture recognition.
Keywords — CNN, OpenCV, ASL, sign language,
classification , deep learning
I. INTRODUCTION
People with impaired speech and impaired hearing use
sign languag e for communication. However, only a small
percentage of people know sign language. People who know
sign language have no issues in communicating but people
who do not understand sign language find it difficult to
communicate with these people. To overcome this obstacle
and to facilitate smooth communication, there should be a
system which allows everyone to communicate irrespective
of their knowledge of sign language.
This paper attempts to solve t his problem by providing a
real-time sign language transl ator. It uses video as an input
and provides the classified text as an output. The system
consists of two major parts: 1. Video capture (Video is
converted to frames which are given to the model for
classification) 2. Deep learning (used for classification and
providing output). The system uses a Convolutional Neural
Network (CNN) for classification. It uses the fastai deep
learning library for training the CNN.
Our system consists of the following tasks:
1. Getting the input video
2. Converting the video to fra mes
3. Classif ying the frame s using CNN
4. Showing the text output
All these tasks are done s imultaneously to make it a real –
time system. This system is trained for ASL (American Sign
Language). 87,000 images are used for training the mo del.
The system is train ed with imag es of A -Z alphabets, space,
nothing and delete. The system can be used on a desktop or
laptop computer with a webcam . II. RELATED WORK
Y. Ji, S. Kim and K. Lee [1] have p roposed a nov el sign
language learning system based on 2D ima ge sampling an d
concatenating to solve the problems of conventional sign
recognition. The system constructs the training data by
sampling and concatenating from a sign lang uage
demonstration vide o at a certai n sampling rate. The learning
process is impleme nted with a convolutional neural network.
C. Chuan, E. Regina and C. Guardino [2] present an ASL
recognition system using a Leap Motion sensor. The sensor
reports data such a s positi on and speed of palm and fingers
based on the sensor’s coordi nate system. The frame rat e of
data transmi ssion is set at 15 frames per second. They have
employed the use of k -nearest neighbors and support vector
machine algorithms to classify the 26 letters of the English
alphabet in ASL using data obtained from the sensor. They
have obtained an average class ification rate of 72.78% and
79.83% for k -nearest neighbors and support vector machine
respectively.
A. Kar and P. S. Chatterjee [3] explain their
implementation of a system tha t converts a video with sign
languag e into text. They have fi rst prepared the dataset by
converting multiple videos containing sign languages into
frames, which they then put into a folder for a particular sign.
The inp ut video is then given, converted int o frames and then
matche d with the f older containing that sig n. Thus, they have
proposed an approach of minimizing the time required to
match each frame of the input video to the dataset. This
approach checks th e first frame of the input video with the
frames present in the datas et. This fir st frame would get
matche d with some number of folders in the dataset. These
matched folders are then taken and the rest of the frames of
the input video are matched with thes e folder s. After this, the
final matched fra mes are obtained, which a re then con verted
to text. This appro ach reduces the time complexity
considerably.
M. Ahmed, M. Idrees, Z. ul Abideen, R. Mumtaz and S.
Khalique [4] have proposed a system using Mic rosoft Kinect
V2. This system has in dependent modules for ges ture to
speech conversion and speech to sign conve rsion. The
system receives performed gestures through Kinect sensor
and interprets those by comparing them with trained
gestures, after which the y are mapped to corresponding
keywor ds and thus , the sign is recognized. Proceedings of the Third International Conference on Trends in Electronics and Informatics (ICOEI 2019)
IEEE Xplore Part Number: CFP19J32-ART; ISBN: 978-1-5386-9439-8
978-1-5386-9439-8/19/$31.00 ©2019 IEEE 654

This system further displays animate d gestures for the
signs. The advantage of this system is the Microsoft Kinect
V2 since it has d epth sensing and motion sensing, which
would h elp for classification of sign languages as th e hand
can be diff erentiated from t he background, and motion
depende nt alphabets like j and z can be classified with their
natural signs.
G. A. Rao, K. Syamala, P. V. V. Kishore and A. S. C. S.
Sastry [5] have ac hieved 92.88% accuracy on an Indian Sign
Language (ISL) datas et which they created du e to
unavailability of ISL datasets. The dataset has five different
subjects performing 200 signs each representing commonly
used words in ISL in 5 different viewing angle s under
vario us background environments. Each sign occupies 6 0
frames or images in a video. To prove the capabi lity of CNN
in recognition, the results are compared with the other
traditional state of the art techniques like Mahalanobis
distance classifier (MDC), Ada Boost, ANN and Deep ANN.
H. Suk, B. Sin and S. Lee [7] pr opose a method for
recognition of hand gest ures in a video using a Dynamic
Bayesian Network. They attempt to classify moving hand
gestures, such as making a circle or wavi ng. They achi eve an
accuracy of over 99% . However, all the ge stures are different
from each other and are not American Sign Lang uage
alphabets . The motion -tracking feature would be relevant for
classifying the motion dependent letters of ASL: j and z .
A. Thongtawee, O. Pinsanoh and Y. Kitjaidure [11] have
presented a system for sign langua ge recognitio n. Features
are extracted fro m video images after preprocessing them. To
extract features, they have used contour -based techniques.
Then, Artificial Neural Netwo rk is used to classify the signs.
The ANN is a t wo-layer feed -forward, with sigmoi d function
at hid den layer and SoftMax function at the output layer. It
classifies all the 26 letters of alphabet. They have achieved
an accuracy of 95% in gestu re recogniti on.
V. N. T. Truong, C. Yang and Q. Tran [12] have
developed a system that can automatically detect static hand
signs of alp habets in ASL with the help of AdaBoost to
collect the important features represented for each sign and
Haar -like featur es for detecting the static hand sign from an
image frame . A webcam is used to cap ture live video feed at
a rate of abo ut 15 to 20 frame s per second. The resolution of
the video feed is set at 640×480 pixels. Their system has an
accuracy of 98. 7%. The outp ut text is also further c onverted
to speech usi ng SAPI 5.3 which is integrated int o the mai n
Windows SDK provided by Microsoft .
III. PROPOS ED SYSTEM
Following is a block diagram of the proposed sys tem that
converts ASL hand gestures into text:

Fig.1. Block d iagram of the system

A. Classifier
Our ASL alph abet classification is done using a
Convolutional Neural Network (CNN). CNNs have been
proven to handle video and image processing with great
accur acy. The m ain advantage of using CNNs as classifi ers
is because they can learn features as well as the weight s of
those features.
B. Transfer Learni ng
Transfer Learning is a technique in Machine Lea rning in
which models are generally trained on larger, more generic
datasets, and used to fit niche or specif ic data. This has been
done by re -using the weights from pre vious layers and
changing the weight s in the later layers to better fit the data.
The prime advantage of this method is the large reduction
in da ta and tim e requirements. However, this method may
not be very us eful if there is a large difference between t he
data the model was initially trai ned on and the specific data it
is trying to fit. If this difference is large, weight re –
initialization might be require d, or the learning rate may need
to be a djusted accordi ngly.
C. ResNet -34 with fastai
We have used fas tai, a deep learning library to trai n our
CNN. We used a pre -trained ResNet -34 [6] (Residual
Network with 34 layers) model trained on the ImageNet [9]
dataset, wh ich contains over 1 million images . ResNet won
1st place in the ImageNet Large Scale Visual Re cognition
Challeng e (ILSVRC ) 2015 classification competition and
Microsoft Common Objec ts in Context (MS COCO ) 2015
competition .
Since it h as been proven to be highly accurate at image
processing and classificat ion tasks, we decided to use
ResNet -34 which is provided by PyTorch , on which the
fastai library is based . Proceedings of the Third International Conference on Trends in Electronics and Informatics (ICOEI 2019)
IEEE Xplore Part Number: CFP19J32-ART; ISBN: 978-1-5386-9439-8
978-1-5386-9439-8/19/$31.00 ©2019 IEEE 655

The fastai library prov ides a layer of abstr action over the
PyTorch deep learning library, which makes data preparation
and model tr aining much simpl er. The library is based on
research i nto d eep learning best practices, and includes good
default values for learning rates, data augmentations and
transformations that work well for mo st data, while sti ll
having the flexibility for fine -tuning.
D. Overall T echnique
Our overall approach was to u se th e pre -trained ResNet –
34 model to classify ASL images. Our data consists of
images of on ly hands performing various gestures, in onl y
one orien tation, while Imag eNet consists of 1000 uniquely
diffe rent objects or c lasses.
Even though ImageNet has some classes that are quite
similar to each other, our data was comprised of the same
object (t he hand) in different positions. Due to large
difference s in ImageNet and our dataset, we re-initiali zed the
weights and adjust ed the learning rate to obtain better
accuracy.
E. Application
Our main goal was to create an application that t ook
video i nput fro m the user and displayed the predicted sign i n
the form of text to the user in real time. To do this, we used
OpenCV [10] – an extremely popular and robust computer
vision library for capturing live video from the native camera
on laptop computers (webcam) . The defa ult resolution for
most webcams is 640×480, therefore our application uses
this as the capt ure resolution.

Fig.2. Real Time Application

Our application has a predefined region of interest of area
224×224, this is done so t he images t hat are saved are exactly
the size required by ResNe t, and no resizing would be
required. The user is e xpected t o perform the signs in this
region of interest. After a specified tim e, the image from that
region is captured, saved, and give n to the mo del for
class ification. Once the model has classifie d the image, we
display the character underneath the regi on of int erest to
immediately show the user.
As mentioned in section F, we also have 3 special
characters – ‘Space ’, ‘Delete ’, and ‘Noth ing’. The u ser can
use the sp ecial signs ‘Space ’ and ‘Delete ’ to type full
sentences in real time, as when they sign space , a space is put
in after the previous character, and delete works like backspace where it removes the last character from the
string . This makes an effectiv e way of communication as no
interaction w ith a keyboard is re quired for per forming the
signs .
F. Dataset
We used a dataset fro m Kaggle [8], which consists of 26
alphabets (A -Z) along with 3 special signs – ‘Space ’,
‘Delete ’, ‘Nothing ’. This dataset consists of 3000 im ages per
character, which com es to a total of 87000 images for the
whole dataset. The size of the dataset is about 1 Gig abyte.

Fig. 3. Dataset

The 3 special s igns stated above make the dataset very
useful for real time applications like ou rs, as the user is able
to interactively type the text using these signs. All the s igns
have been done by right han d in this dataset.
ASL contains two alphabets – J and Z, wh ich require
motion. The dataset we have chosen contains these two
alphabets, but in a specific orientation without motion. There
were ot her dataset s available which excluded these
alphabets, howe ver we picked this one as it includes all the
alphabets . The goal was to create a real -time application,
therefore not being able to type J o r Z would lead to a
suboptimal user experience. The test data included with this
dataset contains only 29 images to encourage the use o f real-
world test images. These images were very similar to the
ones in the training data.
Since we wanted to test our model on data that simulated
more real -life conditions, we used another dat aset from
Kaggle which was crea ted specifically for t he first dataset to
be used as testing data [13]. The testing data from this
dataset consists of 30 images per character, thus m aking a
total of 870 testing images. This enabled us to get a better
idea about the model ’s performance in real world scenarios.
G. Pre-processing and data augmentation
We have no t used any typical pre -processing techniques
for our application like backgroun d elimination, skin
masking, etc. Data augmentation is one of the most
important regularization techniq ues for training a model for
Computer Vision. We found that fas tai has predefined
transform values, and after some experimenting, concluded
that the defa ult values worked best for our dataset. Proceedings of the Third International Conference on Trends in Electronics and Informatics (ICOEI 2019)
IEEE Xplore Part Number: CFP19J32-ART; ISBN: 978-1-5386-9439-8
978-1-5386-9439-8/19/$31.00 ©2019 IEEE 656

By default, fastai performs the following transformations:
horizontal flipping, random r otation of 10 degrees, zoom,
lighting and warp. Vertical flipping is off by def ault, and
since we don’t expect users to sign th e symbols upside down,
we have not employed the use of this transformation. This
augmentation enables u s to classify most of the ASL
alphabets with either the left or the right hand. However,
some alphabets l ike G and Z (for our dataset) tend to be
misclas sified if used with the left hand.
H. Real-world testing, results and discussion
Our image classifier rep orted an accuracy of
appro ximately 78.5% on the testing set (870 images). Some
points to be noted are that no pr e-processing of the images is
done. We have also not made use of any specialized
hardware such as motion cameras , sens ors or special motion
tracking gloves. We tested our application in a variety of
locations, considering sim ple as well as complex
backgrounds.
Since the model was trained on a d ataset which containe d
very plain backgrounds, it naturally performs better in such
conditions. We found that lighti ng is an extre mely importan t
factor for the speed of the application. In well -lit areas, the
rate at which the image is s aved and given for classification
is mu ch higher as the native camera works at higher frames
per second due to more light being absorbe d, whereas in low
or dimly lit areas, the camera works at much lower frames
per second. So, the rate at which images are saved needs to
be adjusted according to the lighting conditions.
We observed misclassification in certain pairs of
alphabets wh ich have very simila r shapes and orientations.
The most misclassified p airs are the alphabets V and K,
along with S an d A. The remaining alphabets are quite
distinct and so misclassification is rare .
The distance of the hand from the camera should not be
very large, as this may c ause the background to be included
in the regio n of interest which may lead to misclassifi cation .
In conditions where the lighting is good and the hand is
near the camera , the entire message is predict ed perfectl y,
whereas in suboptimal conditions there is misclassification
observed, especially in the above specified pairs.
For example, if the distance is too large, the alphabe t ‘E’
may be predicted as either ‘ A’ or ‘S’. We attribute this to the
fact that none of these alphabets have any f ingers protruding
from the palm area. In such cases, quality of the native
camera of the device is extremely important.
With the cur rent defau lt resolution of 640×480 it is
entirely possible that the model may misclassify a sign
simply because the imag e was not c lear enough .
Another point to note is th at the dataset contains images
which are quite dark and the images themselves are q uite
lacking in d etail.
Ther efore, distance of the ha nd from the camera, lighting
conditions and quality of the camera are all import ant factors
for proper identification AS L signs in real world conditions.
The quality of the dataset also plays an impor tant r ole
towards th e accuracy of the model. IV. CONCLUSION
We implemented and trained an American Sign
Language translation application using OpenCV and fastai,
with the help of a ResNet -34 CNN Classifier. We are able to
produce a model with an accuracy of 7 8.5% on the testing
set, but we observe higher accuracy provided the appropriate
lighting, distance and camera quality conditions are met. We
hypothesize that with a dataset with better quality images,
taken in better lighting conditions, and with preproce ssing
done o n the image s, the model would be able to perform
much better.
V. FUTURE SCOPE AND FURTHER IMPROVEMENTS
The task of classification can be made much simpler by
performin g heavy pre -processing on the images. Functions
like backgr ound removal and skin masking among others can
be done to further increase the accurac y of the dataset and to
eliminate the problem of interference of complex
backgrounds.
Our current implementation is based on a predefined
fixed region of interest. By usi ng object d etecti on to locate
the hand, spe cifically the palm area, there would be n o need
for a fixed region of interest as it can then be drawn with the
required dimensions around the user’s hand itself, ra ther than
on fixed coordinates. This would vastl y improve t he use r
experience and make the appli cation feel more natural.
Currentl y, on ce the text has reached the end of the video
capture window, it does not wrap around to the next line. The
implementation of this would enable the user to write long
sentences, whe re they are currently limited by a certain
character count.
The text can als o be converted to speech by using a text –
to-speech tool like Goo gle’s Cloud Text-To-Speech if
required. This can make conversations possible between a
person with speech or hearing i mpairment and one without
them.
REFERENCES

[1] Y. Ji, S. Kim a nd K. Le e, "Sig n Language Learning System with
Image Sampli ng and Convolutional Neural Netw ork," 2017 First
IEEE International Conference on Robotic Computing (IRC),
Taichung, 2017, pp. 371 -375.
[2] C. Chuan, E. Regina and C. Guardi no, "American Sign Language
Recognit ion Usi ng Leap Motion Sensor," 2014 13th International
Conference on Machine Learning a nd Applications, Detroit, MI,
2014, pp. 541 -544.
[3] A. Kar and P. S. Chatterjee, "An Approach for Minimiz ing the Time
Taken by Video Pro cessing for Translating Sign Languag e to Si mple
Sentence in English," 2015 International Conference on
Comput ational Intelligence and Ne tworks, Bhubaneshwar, 2015, pp.
172-177.
[4] M. Ahmed, M. Idrees, Z. ul Abideen, R. Mumtaz an d S. Khalique,
"Deaf talk using 3D animated sign language: A sign l anguage
interpreter using Microsoft's kinect v2," 2016 SAI Computing
Conf erence (SAI), London, 2016, pp. 330 -335.
[5] G. A. Rao, K. Syam ala, P. V. V. Kishore and A. S. C. S. Sastry,
"Deep convo lutional neural networks for si gn language recognition,"
2018 Confe rence o n Signal Processing And Communication
Engineering Systems (SPACES) , Vijayawada, 2018, pp. 194 -197.
[6] K. He, X. Zhang, S. Ren an d J. Sun, "Deep Residual Learning for
Image Recognition," 2016 IEEE Conference on Comput er Vision and
Patter n Recogn ition (C VPR), L as Vegas, NV, 2016, pp. 770 -778. Proceedings of the Third International Conference on Trends in Electronics and Informatics (ICOEI 2019)
IEEE Xplore Part Number: CFP19J32-ART; ISBN: 978-1-5386-9439-8
978-1-5386-9439-8/19/$31.00 ©2019 IEEE 657

[7] H. Suk, B . Sin and S. Lee, "Recognizing hand gestures using dynamic
Bayesian network," 2008 8th IEEE International Conference on
Automatic Face & Ges ture Recognition, Amsterdam, 20 08, pp. 1 -6.
[8] American Sign L anguage Trainin g Dataset ,
https://w ww.kaggle.com/grassknoted/asl -alphabet
[9] J. Deng, W. Dong, R. Socher, L. Li, Kai Li and Li Fei -Fei,
"Image Net: A large -scale hierarchical image database ," 2009 IEE E
Conference on Computer Vision and Pattern Recognition , Miami, FL,
2009, pp. 248-255.
[10] A. Kaehler and G. Bradski . Learning OpenCV 3 . Sebastopol, CA:
O'Reilly Media, 201 6. [11] A. Tho ngtawee, O. Pinsanoh a nd Y. Kitjaidure, "A Novel Feature
Extraction for America n Sign Language Recognition Usi ng
Webcam," 2018 11th Biomed ical Eng ineerin g International
Conference (BMEiCON), Chiang Mai, 2018, pp. 1 -5.
[12] V. N. T. Truong, C. Yang and Q. Tran, "A translator for Amer ican
sign language to text and speech," 2016 IEEE 5th Glo bal Conference
on Consumer Elec tronics, Kyoto, 2016, pp. 1 -2.
[13] American Sig n L anguage Testing Dataset ,
https://www.kaggle.com/danrasband/asl -alphabet -test
Proceedings of the Third International Conference on Trends in Electronics and Informatics (ICOEI 2019)
IEEE Xplore Part Number: CFP19J32-ART; ISBN: 978-1-5386-9439-8
978-1-5386-9439-8/19/$31.00 ©2019 IEEE 658

Similar Posts