On Real Time Implementation of Emotion [601139]

On Real Time Implementation of Emotion
Detection Algorithms in Internet of Things
Sorin Zoican
Electronics, Telecommunications and Information Technology Fa culty
Telecommunication Department
POLITEHNICA University of Bucharest
Romania
Email:[anonimizat]
Abstract
This chapter describesthe methods for detecting the human emotion u sing
signal processing techniques and theiraccomplishment in real time .The first se c-
tionspresent the basic approaches both for emotion detec tion using face images
and speech signals.The tradeoff between detection performance and algorithm
complexity ishighlighted .Thenext sections describe architectures of microco n-
trollersused tocarry out the algorithms for emotions detect ion(including preproc-
essing and basic tasks ) andmethods for codeoptimization . These optimizations
are madefor areal time achievement on mobile devices (e.g. smart phone s, tab-
lets). Preprocessing tasks run on mobile device and the basic tasks may be run on
a server or ondevice. Thelast section estimates computational ef fort and memory
requirements forimage and speech process inginvolved in emotion dete ction.
Motivation of detecting the emotions in Internet of
Things
Nowadays many smart devices and a ppliances are connected one to another
using high technology with much cognitive intelligence but no emotional intell i-
gence. If the technology could understand the human emotion these devices will
bring positive behavior change suggesting how our action be come better . Is neces-
sary the devices to detect our emotions. Possible applications of the em otion-
aware wearable are: intelligent home, automotive, telemedicine, on -line education,

2
shopping recommendation based on your emotion, social robots and interacti ve
games.
An intelligent home has various sensors (video and acoustic) that maycapture
the human face image and voice that will use to discover the mood and create an
adequate ambiance and maybe interact with the person to make his or her to take
the best action (such as adjust lights according to your mood).
In automotive, detecting the driver emotion will be useful because the road
ragecan be managed. In this situation the car engine can be adequately co ntrolled
despite the manual controls of the driver . In such application other signals mayde-
tect the driver emotion ( electrocardiogram ,sweating).
Monitoring the mental health is another important application of the emotion
detection. The personal smart phone can detect an ill condition of the person and
alarm the medical center to manage the situation.
In on-line education, the educational content can be adapted to the mood of
the students so the educational process will have a greater performance.
The last three possible applications are related to the social behavior – we
wouldbe advised to shop things, interact with machines, robots or games.
An emotion -aware Internet o f Things w ill bring the seapplication s in the
“smart society”, and has the potential to transform major industries.
This chapter intend s to describe how the human emotion may be detected u s-
ing signal processing techniques and how these techniques can be implemented in
a real-time system.
The emotion detection can be achieved by face image processing (eyes, lips)
and speech processing. Fo r both processing methods specific features are ex-
tracted from image or audio signal and mapped by a classifier during the trai ning
session. T he collected da ta are pre processing in the device (e.g. windowing, pre –
filtering, conditioning, edge detection, se gmentation) then the pre -processed data
are sent to a server (or a cloud) for complex processing to decide what are the
emotional state sent back to the device that will suggest a change in our behavior
or will propose to take a proper action. The detectio n process has many complex
tasksto be completed sothe detection has not 100% acc uracy.
Important problem s to tackle are what an emotion is and how the emotions
are measured . The following issuesshould be analyzed [1]: (a)the event that tri g-
gersan emotion,(b) what are the physiological symptoms, c) what are the motor
expression of the person, (d) what are the action tendencies and (e) what are the
feelings triggered by that event.
For an given event that will trigger an event the following items should be
considered: the frequency and the suddenness of the event occurrence, the impor-
tance of the event, and the cause of the event (natural or artificial) and how the
event determines an emotion. The knowledge of the event that determines a sp e-
cific emotion is very important because it confirm the detected emotion.
The emotion will trigger physiological symptoms (such as feeling cold or
warm, weak limbs, pale face, heart beats slowing or faster, muscles relaxing or
tensing, breathing slowing or faster, sweat ing).Most of these symptoms can be
measured by observing the corresponding motor expression (face changes –

3
mouth, eyes and eyebrow expressions, voice volume changes, body movements).
The measurements of the motor expression underlie the emotion detectio n
algorithms. The action tendencies (moving towards or away the event) may
confirm the detected emotion. After measuring the motor expression a training
processis performed to detect correctly the emotion. Based on this training
process the emotions are classified in neutral, joy, sad, angry, happiness, pride,
hostility etc. The feelings such as intensity, duration, valence, arousal, and tension
should be determined to classify better the emotion. The key word here is real –
time. The device must resp onse to our emotion fast enough and should have an a r-
chitecture based on a powerful microcontroller and aquick network communic a-
tion. There is a lot of algorithms involved in detection of emotions [8]. In this
chapter we focus on the algorithms possible to be implemented on a wearable d e-
vice (smart phone or tablet) without sensors to be attached to the body (e.g. elec-
trodes). The only input devices that will be used in emotion dete ction will be the
microphone and video camera existing on the portable device so the emotion d e-
tection algorithms will use face and voice changes. We focus on the possibility to
detect the emotion, based on the best algorithms described in literature, in real –
time. These algorithms may be simplified, without degrade the perform ance,to
meet this goal.
Emotions detection using frontal face
image proces sing
Most of human emotions can be visualized by face expression [2, 6, and 7 ].
The human face gives clues very useful to detect our mood. Especially the eyes
and the mouth will expre ss the best the emotion such as joy, sadness, a ngry, fear,
disgust etc. The emotion detection process from image of frontal face is d epicted
in Fig. 1.
Image acquisition
Face recognition
Cropping lips region
Cropping eyes region
Extract lips features
Extracteyes features
Emotion classification based on lips and eyes features
Fig.1The emotion detection process
The challenging tasks in the above process are: face recognition, extract lips
andeye features and emotion classification. Many methods to detect a f ace inan
image exist. The most known algorit hm is Viola – Jones [3, 4] is highly time- con-
suming andis implemented usually on desktop computers. O ther simpler methods,
but efficacy, are based on skin detection followed by image segmentation to find
the face region [7]. These methods have the advantage to be less time -consuming

4
and they are suitable to be implemented on mobile devices. The face recognition
will be performed scanning pic ture with given patterns and exploit the symmetry
of the face. The used patternsare showed in the Fig. 2.and Fig. 3.
Fig.2The five patterns used in face detection
Fig.3The symmetry proprieties of face
The basic idea in Viola Jones algorithm is to find the feature s associated with
each pattern repeated at diffe rent scales.All pixels in each region S1, S2, S3, and
S4 are summed, and a featuresare calculated asis illustrated in Table 1.
Thresholds T1 to T5will be determined in the training process. The complexity of
Viola Jones algorithm i s high despite the using sp ecific techniques to speed up the
process (such integral image and Ada -boost algorithm) [4]. Specific assumptions
are made to make the face detec tion easier .
Table1
Feature Condition to be fulfilled
a) S1-S2 < T1
b) S1-S2 >T2
c) S1+S3-S2 <T3
d) S1+S3 -S2 >T4
e) S1+S4-S2-S3 <T5

5
Images should contain faces with no beard, moustache or glasses, the
illuminationshould be constant and in particular cases the ethnicity may affect the
recognition pro cess.
The face detection process could be more robust if preliminary processing
techniques are involved such as color equal ization and edge detection. C olor
image (usually RGB image) is transform in a binary image and then the lips and
eyesare identified by scanning the original image. The image scanning will
estimate the regions where the eyes and lips should be located based on the
symmetry properties of human face. L ipsare modeled as an ellipse and the eyes
are modeled as irregular ellipse [2, 5, and 6] as shown in Fig. 4.
Fig. 4The lips and eyes models
After the region of eyes and lips are determined, the parameters a and
brespectively a,1b and2b are measured using anedge dete ctor and thetraining
process willrun. The training is necessary to determine ellipse parameters in
particularcondition such as neutral, surprise, fear, joy, sadness etc. Based on the
training process the emotions are classified. Table 2 shows t he detailed operations
in emotion recognition from frontal face image.
The image preprocessing involves the normalization in RGB space ( R,G,B
are the values of the pixel)as:
, ,R G Br g b
R G B R G B R G B  
     
The normalized RGB representation ignore ambient and may be invariant to
changes of face orientation relatively to the light source.
Table 2 .
Detailed operations in em otion recognition from frontal face image
Image preprocessing (color and illumination normalization)
Skin detection
Face segmentation
Eyes and lips segmentation
Estimate parameters of eyes and lips models ( ,a b) and (1 2, ,a b b)
Training process (associate the above parameters w ith a class (emotion))
Classify the emotion using the current face image

6
The skin detection goal is to build a decision rule to distinguish the skin pixel
from non -skin pixel. There are many methods to decide if a pixel is skin or non-
skin pixel [8]:
1.Explicitly rule (based on RGB color space) .
Consider that R,G,B are the values of a pixel it will be classified as skin if
the following conditions are fulfilled:
95 40 20
max( , , ) min( , , ) 15
| | 15R andG andB
R G B R G B and
R G andR GandR B  
 
   
2.Nonparametric skin distribution models
These methods estimate skin color distribution without find an explicit model.
A number of M lookup tables ( LUT) that has N bins each (corresponding to a
range of color component ) which stores the number of times a particular color
occurred in training image is build. The value:
( )
( )
( , )
1 1LUT c
P cskin M N
LUT k i
k i
 
 
is the probability that current pixel to be skin pixel. A pixel is classi fied as skin
pixel if ( )skin skinP c TwithskinTis a given threshold.
3.Bayes classifier
The above calculated value is a conditional probability ( | )P c skin –
probabilityof observing a color knowing that it represents skin. A probability of
observing skin given a concrete value o f color ( | )P skin c could be more suitable
to skin detection. This probability is given by Bayes rule:
( | ) ( )( | )
( | ) ( ) ( |~ ) (~ )P c skin P skinP skin c
P c skin P skin P c skin P skin

where( |~ )P c skin is the probability of observing a color knowing that it is not
skin.
The probabilities ( | )P c skin ,( |~ )P c skin ,( )P skin and (~ )P skin are
computed using the skin and non -skin pixels histograms and t he total number of
skin and non -skin pixels.

7
4.Parametric skin di stribution models
The non-parametric models need a large spaceforstorage and the performance
depends on the data in the training set. A parametric model is more compact
(memory requirements will be lesser) and it has the possibility to generalize the
training data. Skin color distribution is modeled by a probability density function
(usually Gau ssian) as:1
1/21 1( | ) exp[ ( ) ( )]
2 2SS ST
Sp skin
   c cμ σ c μ
σ
wherecis the color vector,Sμis mean vector andSσ is covariance matrix. These
parameters are estimated using training set:
11
Sn
jnjμ c and
11( )( )
1S S
jnT
Sn   
j jσ c μ c μ wheren is the number of color ve ctorsjc.
Emotions detection usi ng speech signals
Speech has information onthe message and emotions too.To detect emotion
one should understand how the speech signal is generated. Voice is produced by
varying the vocal tract and the excitation source [9, 11]. During emotion both
components will behave differently comparing with neu tral state. There are many
features of speech signal thatcharacterize the vocal tract and source excitation.
Extracting the most important and robust of these features will lead to em otion
recognition. There many kinds of emotions in voi ce: happiness, surprise, joy,
sadness, fear, anger, disgust etc. Systems for emotion detection must capture the
voice using a microphone, reduce or removethe background noise and associate
speech features with specific emotion. The emotion detection based on speech
signal analyses fe atures such as:
pitch
energy (computi ng Teager Energy Operator – TEO)
energy fluctuation
average level crossing rate (ALCR)
extrema based signal track length (ESTL)
linear prediction cepstrum coefficients (LPCC)
melfrequency cestrumcoefficients (MFCC)
formants
consonant vowel transition
The pitch in emotional speech is compared with neutral speech and the di s-
criminative power of pitch is measured. Specific emotion, such disgust, may be
detected using semantic and prosodic f eatures.The energy measurement reflects
the changes in the airflow structure of speech. The emotion’s detection use t empo-
ral and spectral properties of speech . These features are due to syllabic, harmonic

8
and formants str ucture of speech. These processing will improve the quality of
speech (reduces noise and reverberation).
Temporal process ing such ALCR and ESTL find the temporal features which
differ in amplitude and frequency.
The LPCC, MFCC and formants is related to vocal tract information that is
changedwith emotion. If the above features will combine the robustness of
emotion detector will increase .
The following paragraphs describe in more details how these algorithms works
[12, 14] andanalyzetheir computational complexity.
The LPCC algorithm es timates the speech parameters by prediction of the
current sample as a linear combination of the past sa mplesas shown be low:
1.Signal windowing with the Hamming window
( ) 0.54 0.46cos(2 ),0nw n n N
N     ,N is the speech wi ndow length
2.Pre-filtering with transfer f unction1( ) 1H z azp  , high pass filter
witha=0.97
3.Linear prediction: find the linear prediction coefficients (LPC)ka such
that the spectrum of speech signal ( )S zto be given by
( ) ( ). ( )S z E z V z where( )E zisthe excitation (periodic pulses at pitch
period or noise) and
1( )
1kk p
kGV z
a z
, withG the gain and p the
number of LPC coefficients. Usually, determining of LPC coeff icients is
done using the Lev inson Durbin method, described briefly b elow:
a)Compute the autocorrelation of the speech signal, s(n),
1
0( ) ( ) ( ), 0..N
nR j s n s n j j m
   , withm the number of LPC
coefficients
b)Computes1
1 1
1 01 RA with aa R   
  
c)Calculates10 1 1E R Ra 
d)Fork=1 tom
oComputes1
0k
j k j
j
ka R
E 


9
oCalculates11
2
2
11 0
0 1kk
ka a
aAa
a a    
   
   
   
   
   
      

oComputes12(1 )k kE E  
4.Cepstral analysis: the real cepstrum ( )cs n is the inverse Fourier
transform (IFFT) of the logarithm of the speech sp ectrum amplitude:
( ) [ln( ( )]cs n IFFT S  
The MFCC algorithm determines the Mel coefficients based on audio
perception following the steps :
1.Windowing (Hamming window)
2.Pre-emphasis – spectrally flatten the windowed input signal
3.Computes the spectral po wer of speech window, | ( )|Susing the fast
Fouriertransform (FFT)
2 2| ( )| Re ( [ ( )]) Im ( ( ( )])S k FFT s n FFT s n 
4.Apply Mel filter to the spectral powe r. The Mel filter comprises a series
of overlapped triangular band pass filter banks that maps the pow ers of
the spectrum onto the Mel scale that resembles the way human ear
perceives sound .The formula to calculate the Mel filter is:
0 ( 1)
( 1)( 1) ( )
( ) ( 1)( )( 1)( ) ( 1)
( 1) ( )
0 ( 1)mk f m
k f mf m k f m
f m f mH kf m kf m k f m
f m f m
k f m 
   
    
 
  
 
   
 
 
  and
1
0( ) 1M
mmH k
 
M is the number of Mel Filters and f() is the set of M+2 Mel frequencies.
The Mel frequencies are calculated using Mel scale
102595log (1 )
700ffMel  and1 /2595700(10 1)ffMel  as:
1 ( ) ( )( ) ( ) ( ( ) )
1h l
l
sMel f Mel f Nf m Mel Mel f m
F M  

10
wherefMel is Mel frequency and fis the frequency, N is the window length,
sFis sampling frequency,lf andhfare the lowest and the highest fr equency in the
Mel filter bank.
The output of the Mel filter bank is:
1
02( ) | ( )| ( ),0N
kmY m S k H k m M
  
Before next step t he logarithmic spectral power is calculated as:
*
10( ) log ( ),0Y m Y m m M  
5.Apply inverse discrete cosine transform (IDCT) to Mel banks filter
outputs to return in the time domain. A reduction in computational effort
is achieved due fact that DCT accumulat es most of the information
contained in the signal to its lower order coefficients. For K coefficients
the Mel cepstrum coefficients are:
1
0*( ) ( )cos( . ( 1/ 2)/ ),0M
mmc k Y m k m M k K 
   
The first two steps are the same with LPCC algorithm.
The classifier process
A generic c lassifierprocess [10, 15] is illustrated in Fig.5.
Fig.5The classifier process
An input x is analyzed usin g a set of discrimination functions
: , 1..n
ig R R i c  each giving a score corresponding to a class from 1 to c. The
input is then classified inthe class with the highest score. C lassifier is
characterized by a classification error calculated asnumberof misclassification
totalnumberof inputs
and classification accuracy given by1numberof misclassification
totalnumberof inputs .

11
The classifier is buildin g in the training pha seusing as much as possible input
combinations to configure th e parameters of discrimination functions. The
classifier may be design ed using an input set S and tested on the set R S.If
Ris closed toS the classifier’s performance will increase . The main alternatives
are: re-substitution or R -method ( R S), hold-out or H method ( Ris halfofS),
cross-validation or rotation method (divide the S in K subsets of same size , use
K-1 subsets to train and the remaining subset to test; repeats this procedure K
times and average the K e stimates).
Most known classifier algorithms are summarized be low [16, 17 and 18 ].
The k-nearest neighbor classifier (kNN)
Given a set X and a distance function, find the k closest points in Xto the
training set . The similarities (inverse of the di stances) of the neighbors in the same
class are summed. The class with the highest score is assigned to the input setX.
A set of p classes1 2{ , ,.. }pC c c c and training set
1{( , )| , , 1.. }m
i i i iX x y x R y C i n    where the point ( , )i ix yis an m –
dimensionalixvalue classified in classiy fromp classes. For an given input
xthe distancesid to each point in the training set are computed asi id x x 
and the first k distances (sorted in ascending order) are taken. The chosen class for
the input x is the class of the majority of the selected k points. The kNN search
technique is widely used in various areas such as image processing, multimedia
database and classification. In general , nearest neighbor classifiers typically have
good predictive accuracy in low dimensions, but might not in high dimensions.
Support vector machine (SVM)
The SVM algorithm finds the decis ion surface (hy perplane) that maximizes
the margin between the data points of the two classes.
A set {( , )| , { 1,1}, 1.. }m
i i i iX x y x R y i n     where the point ( , )i ix yis
an m-dimensionalixvalue classified in class
iy. A hyperplane that separating the
m-dimensionalixvalues into the two classes is given by relation :bwx.The
optimization pro blem minimize win the hy perplanebwxwithconstrain
( ) 1, 1..iy b i n  iwx .
The SVM are a binary classifier. For a problem with overtwo classes, we need
to create severalSVM classifiers . The number of the classifiers is equal with the
number of classes and the algorithm is applied one class versus re mained class .
Such multiclass SVM algo rithm has betterperformance than a kNN algorithm.

12
Decision tree
Adecisiontreeisaprocedure forclassifying databasedontheirattributes.
Based on the features of the input, a binary tree is build. The input is classified in
classes after tree traversal (leafs of the decision tree are the possible classes).
Construction ofdecisiontreeneeds none settings so e xploratory
knowledgediscovery will use such trees .Given a data set D and an attribute A
havingk values, the decision tree is building according to thefollowing steps:
a) The data set D is splitting in k subsets (or class)jD
b) Calculate the information gain ( , )GD Aassociated with the attri buteA:
1( )
( , ) ( ) ( )
( )k
jcard
G E E
card j
jD
D A D D
Dwhere (.)Eis the entropy calculated as:
2
1( ) log ( )k
i i
iE x p p
  . The parameterip is the probability that x to belong to
classk and ( )card y is the number of elements in the set y
c) Choose the attribute with the highest gain. This attribute becomes a splitting
attribute for the cur rent node in the binary tree
d) The step b) is repeated by choosing the next attrib ute in ascending order of
gains
DigitalSignalProcessors (DSP) architectures
A digital signal processor must has high -performance, power -efficient
architecture that may be used in real -time applications with high level of
computational complexity. The following features characterizes a DSP
architecture: fast and flexible arithmetic, large dynamic range for arithmetic
computation, the abi lity of loading two operands in one processor cycle, hardware
circular memory addres sing and hardware control of iterations .Most emotion
detection algorithms (d escribed in the previous sections) need a large dynamic
scale for number represe ntation. The processor arithmetic is fixed point arithmetic
orfloatingpointarithmetic. Floatingpoint processors are easier to program than
fixed point pro cessors, because is not need ed to adapt the algorithm to conform to
a fixed point representation o r emulate the floating point operations using fixed
point instru ctions but they consume more power [23].
Emotion detection algorithms are complex and they should be implemented
with no changes, thus flexible computational units are required .
Computation al power of DSP processors is measured by the instruction set and
the methods which increases the parallelism (hardware loops, computational units
work simultaneously, DMA channels, dynamic range ).Organization of a micro-
controller’s memory subsystem has a great impact on its performance. The DSP
processors have a modified Harvard architecture with separate memorie s for the
program and data. M emories are large enough to accommodate the data and pr o-

13
gram and they are accessed through DMA channels. The emotion detection alg o-
rithms may need largedata memory (especially those algorithms based on the i m-
age processing). Memories’ capacityisup to thousands of Mbytes. These amounts
are enough for instructions codes due the processor architecture has a high level o f
parallelism and the instruction set is powerful and the code density is high. For
data memory, the above amounts are necessary to store image, speech signals and
data training set. The parallelism is achieved using a multi -core arch itecture with
at least two processor cores with two or three computational units each that can
run in parallel. T he specializedblocks in program sequencer that a llow hardware
control of circular addressing and iterations increase the level of parallelism. A
digital signal proc essor has specialized arithmetic units (arithm etic and logic unit –
ALU, multiply and accumulate unit -MAC and shifter unit -SHIFTER). A rich set
of peripherals (such a s serial and parallel ports controlled by DMA channels, ti m-
ers) used to communicate with ex ternal devices as digital camera, microphones
and various sensors must complete a powerful DSP archite cture. These goals can
be achieved be a multi -core processor with jointarchitectures: DSP and ARM that
delivers peak performance up to 24 Giga floating-point operations per second.
This section presents two architectures that followthe above requirements:
Blackfin dual processors family [19] and SHARC dual core from Analog Devices
[20, 21] as representative architecture of fixed point and floating point
microcontrollers architectures.
Blackfin BF561 microcomputers
The core of Blackfin microcomputer is a 16 -bit fixed-point processor that is
based on the unified architecture Micro Signal Architecture (MSA), developed by
Analog Devices and Intel to carry o ut applications with complex computations
and power -sensitive. An embedded fast Ethernet controller that offers the ability
to connect directly to a network, a high peripheral support and memory
managementis provided in Blackfin architecture. The clock ra te and operating
voltages can be switched dynamically for specified tasks via software for
reduction power, b etween 350 MHz at 1.6 V. and 750 MHz at 1.45 V.
Themain features of Blackfin core are:
dual multiply -accumulate (MAC) units
an orthogonal reduced instruction -set computer (RISC) instruction set
single instruction multiple data (SIMD) programming capabilities
multimedia processing capabilities
The Blackfin processor uses a modified Harvard architecture which allows:
multiple memory accesses per cloc k cycle,multifunction instructions ,control ve c-
tor operations .There are several computational units: the multiplier and accum u-
late MAC unit, the arithmetic unit ALU that supports single instruction multiple
data (SIMD) operation s and has Load/Store archi tecture, the video ALU that o f-
fers parallel computational power for video operations (quad 8 -bit additions and
subtracting, quad 8 -bit average, Subtract -Absolute-Accumulate), the addressing
unit with dual -data fetches in a single instruction. The program sequencer co n-

14
trolsprogram flow (subroutines, jumps, idles, interrupts, exceptions, hardware
loops).
The Blackfin processors have “pipe-line” architecture which allocates best the
workload among the processor’s units, which leads in efficient parallel pro cessing
among the processor’s hardware. A single hierarchical memory space that is
shared between data and instruction memory is used in Blackfin architecture. Fi g.
6and Fig. 7 illustrate the general architecture of Blackfin processors family and
the Blackfin core.
Fig.6 The Blackfin BF561 general architecture
Fig.7The Blackfin core

15
The multi -core architecture ARM SHARC
Higher performance may be achieved using a multi -core architecture. These
architectures are systems on chip (SoC) or multi processors systems that include
three interconnected processors (one general purpose and two DSP). As an
example, the ARM -SHARC architecture ADSP -SC58x is shown in Fi g. 8.
Fig.8The multi-core ARM -SHARC general architecture (ADSP -SC58x)
ADSP-SC58x architecture includes two SHARC+ cores and one ARM Cortex-
A5 core.SHARC+ cores share a crossbar switch with a shared cache memory and
interfaces with internal and external memory. The switch provides access to an
array of peripheral I/O, including USB. Two hi gh-speed link ports allow multiple
systems on chip or DSP processors to be connected. Communication links i nclude
serial ports, analog to digital convertors .The ARM Cortex-A5 will handle secured
communication between SHARC+ cores. Security cryptographic e ngines (AES –
128 and AES -256 standards) are involved for secure boot and network security
support. Operating systems as Linux and Micrium µC/OS-III are available for the
ARMCortex-A5 and/orSHARC+ cores. A system servi ces and device drivers are
provided.
TheSHARC processor integrates a SHARC+ SIMD core, level one ( L1)
memory that run at the full processor speed, crossbar, instruction and data cache
memory and communication ports. It has a modified Harvard architecture with a
hierarchical memory structure. TheSHARC+ core has two computational proces s-
ing elements that work as a single -instruction, multiple -data (SIMD) engine. Each
processing elements contain an ALU, multiplier, shifter and register file. In SIMD
mode the processing elements execute the same instruction but they use differen t
data. The architecture is efficient for math -intensive DSP algorithms. The SIMD
mode doubles the bandwidth between memory and the processing elements. Two
data values is transferred between memory and registers. There is a set of pip e-
lined computational units in each processing element: arithmetic/logic unit (ALU),
multiplier, and shifter. These units support 32 -bit single -precision floa ting-point,
40-bit extended precision floating -point, 64 -bit double -precision floating -point
and 32-bit fixed-point data formats and may run in parallel, maximizing comput a-
tional power. In SIMD mode, multifunction instructions enable the parallel ALU
and multiplier operations execution in both processing elements per core, in one
processorcycle for fixed point data or most six processor cycles for floating point

16
data, depending on data format. The instruction’s execution involves a n inter-
locked pipeline and data d ependency check.
A general -purpose data registers exists in each processing el ement for
transferring data between the computational units and allowing unconstrained data
flow between computation units and internal memory. Most units have secondary
registers thatcan be used for a fast context switch. As in Blackfin processors a
hardware loop control is provided. The instruction is coded on 48 bits and
accommodates various parallel operations (multiply, add, and subtract in both
processing elements while branching and fetching up to four 32 -bit values from
memory).A new instruction s et architecture, named Variable Instruction Set
Architecture (VISA), drops redundant/unused bits within the 48 -bit instruction to
create more compact code.
With its separate program and data memory buses (Harvard architecture) and
on-chip instruction conf lict-cache, the processors can simultaneously fetch four
operands (two over each data bus) and one instruction (from the conflict -cache bus
arbiter), all in a single cycle. The program sequencer supports efficient branching
(with low delays) for conditiona l and unconditional instructions using a hardware
branch predictor based on a branch target buffer ( BTB).
The general architecture of SHARC+ core is illustrated in the Fig. 9.
Fig. 9The SHARC general architecture
ARM Cortex -A5
The ARM Cortex -A5 is alow-power and high -performance processor based
on ARMv7 architecture with full virtual memory capabilities. This proce ssor runs
32-bit ARM instructions, 16 -bit and 32 -bit Thumb instructions, and 8 -bit Java
codes. This architecture is illu strated in Fig. 10 .

17
Fig.10The core ARM Cortex A5 architecture
The main functional units in ARM Cortex A5 are:
–floating point unit ( integrated with processor pipe -line, optimized for
scalar operation and enabl ed to emulate vector operation)
–media processing engine (wi th instruction for audio, video, 3D graphics
processing)
–memory management unit (separate data and program memories in a
Harvard architecture)
–cache controller (improving performance for cache memory on level two
in the hierarchical memory sub -system)
–in order pipe-line with dynamic branch prediction
–intelligent energy manager
Data Processing Un it (DPU) has general-purpose registers, status registers and
control registers. Thi s unit decodes and executes the instructions. Instructions
using data in the regis ters are fed from the Pre Fetch Unit (PFU). Load and store
instructions that need data to or from the memory are managed by the Data Cache
Unit (DCU). The Pre Fetch Unit (PFU) extracts instructions from the i nstruction
cache or from external memory and pr edicts the branches in the instru ction flow,
before it transmits the instructions to the DPU for processing. Data Cache Unit
(DCU) has a L1 data cache controller, a pipe line with load/store architecture, a
system coprocessor which performs cache maintenan ce operations both for data
and instruction cache. Bus Inter face Unit (BIU) has the external i nterfaces and
provides a single 64 -bit port,shared between the instruction side and the data side.
The ARM Cortex -A5 media pro cessing engine feature s are:
SIMDand scalar single -precision floating -point computation
Scalar double -precision floating -point computation
SIMD and scalar half -precision floating -point conversion
SIMD 8, 16, 32, and 64 -bit signed and unsigned integer computation
8 or 16-bit polynomial co mputation for single -bit coefficients
Structured data load capabilities

18
Large, shared register file, addressable as 32 -bit, 64-bit and 128 -bit regis-
ters
The instruction set include: addition and subtraction, multiplication with
optional accumulation, maxi mum or minimum value, inverse square -root,
comprehensive data -structure load (e.g. register -bank-resident table lookup
implement ation).The floating point unit is characterized by :
Single-precision and double -precision floating -point formats
Conversion be tween half -precision and single -precision
Fused Multiply Accumulate operations
Normalized and de -normalized data handled in hardware
The memory management unit controls the L1 and L2 memory system and
translates virtual addresses to physical addresses and accesses the external
memory. The Translate Look -inside Buffer operations are managed by a
coprocessor integrated with the core that provides a mechanism for configuring
the memory system. An AXI (Advanced eXtensible Interface) provides a high
bandwidth f or data transfers to L2 caches, on -chip RAM, peripherals, and external
memory. It comprises a single AXI port with a 64 -bit for instruction and data.
The computational power, memory and peripheral interfaces should be tested
to see they are large enough fo r implementing complex image and speech
processing algorithms such emotion detection .
The experimental results
This section presents the main resu lts that prove the possibility of
implement ationthe emotion detection algorithms. We testes s implified but
effective alg orithms using Visual DSP++ IDE , in C language optimized for speed
with code profiling enabled. For each algorithm we will measure the execution
timefor various image sizes and algorithm specific parameters.
The microcontroller used was Blackfi n BF561 that operates at 750MHz clock
frequency with one core active. P rograms performance will be better if both cores
in BF561 would be enabled or a multi -core chip is used.
Memory amount (between 16 Mbytes to 128 Mbytes) is large enough to store
medium size image and the rich set of input -output peripherals (such as PPI, USB)
ensures a proper communication between processor and external devices.
The mouth and eyes detect algorithm
The following algorithm was implemented to detect the mouth and eyes (Fig.
11).

19
Read image
For the whole read image do
Apply skin filter
Create binary image
Scan binary image to determine the face region
For face region do
Applymouth discriminator
Determine mouth region
Measure the mouth irregular ellipse parameters
Apply eyes histogram
Determine eyes regions
Measure the eyes regular ellipse parameters
Apply classifier or update data base in training phase
Determine the type of emotion
Fig. 11 Mouth and eyes detect algorithm.
The discriminators for mouth and eyes are pr esented bellow.
Mouth (lips) discrimination
-compute2( ) 0.776 0.560 0.2123l r r r  
-compute2( ) 0.776 0.560 0.1766f r r r  
-the pixel is in mouth region if
( ) ( ) 20 20 20f r g l r and R andG and B    
Eyes discrimination
-convert RGB image in a grayscale image using the for mula
0.2989 0.587 0.114R B 
-apply histogram equalization to grayscale image accordingly with the
algorithm bellow:
a)compute the probability that a pixel x to have value
{0,1,…, 1}n L  :nnumberof pixelswith valuenptotalnumberof pixels
b)transform the pixels with value k in pixels with value
1
0( ) (( 1) )L
n
nT k round L p
 
-determine eyes region applying a threshold operation on the histogram
equalized image

20
The results are illustrated in Fig. 12.
a) b)
Fig.12a) Original image b) Binary image after skin detection, binary image after eyes
detection, binary image after mouth detection and face, eyes and mouth regions determination
(from top corner left to right)
Measurements of mouth and eyes parame ters are made using the binary
images illustrated in Fig. 12. B inary images are scanned horizontally and
vertically to find the transitions from black pixels to white pixels and from black
pixels to white pixels. D ifferences between pixel positions corresp onding to these
two transitions represent the parameters of the ellipses that characteri zed the
mouth and the eyes. P ercentage differences betwe en mouth and eyes parameters in
imagecorresponding to neutral state and the same parameters in analyzed image
are calculated.The emotion is determining after measure ment of the parameters
illustrated in the previou s section,as inTable 3.
Table 3.
increase
(relative to
neutral)unchanged
(relative to
neutral)decrease (relative to
neutral)average variation of1 2, ,b b b
(relative to neutral)
Fear 2b1,b b
Happy 2b1,b b
Sad 1 2, ,b b b25%
Angry 1 2, ,b b b25%and1bvariation is the
greatest
Dislike 1 2, ,b b b25%and1bvariation is the
smallest
Surprise 1 2,b bb

21
The “fear”, “happy” and “surprise” are detected easier using the changes dire c-
tion of1 2, ,b b b relative to their values in the “neutral” state.
For “sad”, “angry” and “dislike” the average di fferences is computed to
discriminate between “sad” and “angry” or ”dislike”. The discrimination between
“angry” and “dislike” are made using the variation of parameter1b.
The executio n time is illustrated in Fig.13 .
Fig.13Emotion detection by face image processing execution time for various image sizes
One can see that for medium images size s the image processing can be done in
seconds.
The code uses the rules presented in [ 23] to optimize for speed : hardware
loops, multi -function instructions and video ALU instructions.
We analyzed the MFCC algorithm f or emotio n detection from speech signals .
This algorithm is robust and has better performance than LPCC algorithm [13].
The speech processing takes about 10 microseconds (opti mized code) for a
speech window length of 256 samples , as is shown in Table 4 .
Table4.
Algorithm step Time (microse conds)
Windowing 0.295111
Pre-filtering 0.003222
Spectral power 4.90889
Mel filter 2.018222
IDCT 2.926222
TOTAL 10.15167

22
The square d error between mean MFCC coefficients for the neutral state and
emotion state is illustrated in the Fig. 14 .
a) b)
Fig.14The mean MFCC coefficients
One can see that the error is greater for the first coefficient (exceptthe “sad”
emotion). D iscrimination between emotions may use onlythe first two or three
coefficients. Emotions such as “joy”, “fear”, “angry”,and “sad” are discriminated
using the first coefficient.
The kNN classifier can be implement ed in hundred s ofmicroseconds (as on
can see in Fig. 15 ).
Fig.15The execution time – kNNclassifier
The kNN c lassifier has been tested for 20 neighbors, 6 classes and up to 15
parameters.

23
Conclusions
This chapter presents an overview of the methods to emotion detecti on by
image and speech processing and it focuses on implementing of such algorithms in
real time using digital signal processors. Emotion detection implies frontal face
image and/or speech processing and should be implemented in real time. These
processing algorithms are including face detection, region of interest dete rmining
(e.g. mouth, eyes), and spectral analysis (for speech signals). After extrac ting the
relevantparameters a classifier process is completed todetermine the emotion.
This process calcu lates the distances between current parameters (corresponding
to the emotion) and the parameters in a training set corresponding to the possible
classes and determines the most probable class for the current em otion. The main
goal of the chapter was to inv estigate the algorithms for emotion detection and to
see if those algorithms may be realized inreal time using DSP processors. The
resultsshowthe possibility to carry out most of existing emotion detection
algorithms in real time using optimizations cod e methods. For more complex
algorithms the data processing may be performed on a server with more
computational power. The LwIP protocol [23] will transfer t he needed data and
the results to the mobile device. Analysis performed in this chapter shows that this
approach is not necessar y ifanalyzed data to (that is , the frontal face image size) is
nottoo large (e.g. images with medium size), and the chosen image and speech
processing a lgorithms are not so complex but still effective. The analyzed
processors has enough processing power to implement emotion detection
algorithms with reaso nable increase of their basic computational effort for voice
and data communic ation, operating system functions and basic input output
operations. Future work will an alyze methods to increase the accuracy of emotion
detection from 70% -80% to over 90% using multi -core DSP processors that can
carry out more advanced image processing algorithms and classifiers such SVM
with operating sy stem support [22, 23] (e.g.VDK operating sy stem orMicrium
operating sy stem).
References
[1].Klaus R. Scherer , “What are emotions? And how can they be measured? ”,Social Science
Information & 2005 SAGE Publications (London, Thousand Oaks, CA and New Delhi),
0539-0184,DOI: 10.1177/0539018405058 216 Vol 44(4), pp. 695 –729; 058216
[2].A. Habibizad Navin, Mir Kamal Mirnia , “A New Algorithm to Classify Face Emotions
through Eye and Lip Features by Using Particle Swarm Optimization ”,2012 4th International
Conference on Computer Modeling and Simulati on (ICCMS 2012) IPCSIT vol.22 (2012) ©
(2012) IACSIT Press, Singapore , pp. 268-274
[3]. Paul Viola, Michael J. Jones,” Real -Time Face Detection”, International Journal of
Computer Vision, 57(2), 2004, Kluwer Academic Publishers, pp. 137 -154
[4].Yi-Qing Wang, “An Analysis of the Viola -Jones Face Detection Algorithm”, Image
Processingon Line, 2014, http://dx.doi.org/10.5201/ipol.2014.104 ,pp. 129-148
[5].Cheng-Chin Chiang , Wen-KaiTai,Mau-TsuenYang,Yi-Ting Huang,Chi -Jaung Huang , “A
novel method for det ecting lips,eyes and faces in real time ”, Real-Time Imaging, 2003, pp
277–287

24
[6].M. Karthigayan, M. Rizon, R. Nagarajan and Sazali Yaacob , “Genetic Algorithm and Neural
Network for Face Emotion Recognition ”, pp. 57-68, book chapter in “ Affective Computin g”,
Edited by Jimmy Or, Intech, 2008, ISBN 978 -3-902613-23-3
[7].Moon Hwan Kim, Young Hoon Joo, and Jin Bae Park, “Emotion Detection Algorithm Using
Frontal Face Image”, 2015 International Confrence on Control Automation and Systems
(ICCAS2005 ) June 2-5,2005, Kintex, Gyeong Gi, Korea, pp. 2373 -2378
[8].Vinay Kumar, Arpit Agarwal, Kanika Mittal , “Tutorial: Introduction to Emotion
Recognition forDigital Images”, Technical Report 2011, inria -00561918.
[9].Rahul. B. Lanjewar, D. S. Chaudhari , “Speech Emot ion Recognition: A Review”,
International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN:
2278-3075, Volume-2, Issue-4, March 2013, pp. 68 -71
[10].Vaishali M. Chavan, V.V. Gohokar, “Speech Emotion Recognition by using SVM –
Classifier”, International Journal of Engineering and Advanced Technology (IJEAT) ISSN:
2249– 8958, Volume -1, Issue-5, June 2012, pp. 11 -15
[11].Dipti D. Joshi, M. B. Zalte , “Speech Emotion Recognition: A Review ”,IOSR Journal of
Electronics and Communication Engineering (IOSR -JECE),ISSN: 2278 -2834, ISBN: 2278 –
8735. Volume 4, Issue 4 (Jan. – Feb. 2013), pp. 34-37
[12].J. Přibil, A. Přibilová , “An Experiment with Evaluation of Emotional Speech Conversion
by Spectrograms ”,Measurement Science Review, Volume 10, No. 3, 2010 . pp. 72-77
[13].Siddhant C. Joshi, A.N.Cheeran, “MATLAB Based Feature Extraction Using Mel
Frequency Cepstrum Coefficients for Automatic Speech Recognition”, International Journal
of Science, Engineering and Technology Research (IJSETR), Volume 3, Issue 6, June 2014
1820, ISSN: 2278 – 7798, pp. 182 -1823
[14].Taabish Gulzar ,Anand Singh , “Comparative A nalysis of LPCC, MFCC and BFCC for the
Recognition of Hindi Words u sing Artificial Neural Networks”, International Journal of
Computer Applications (0975 – 8887) Volume 101 – No.12, September 2014 , pp. 22-27
[15].Ludmila I. Kuncheva , “Combining Pattern Cla ssifiers: Methods and Algorithms ”, Ed. John
Wiley & Sons, Hoboken, New Jersey, 2004
[16].XindongWu et al., “Top 10 algorithms in data mining ”, Knowl Inf Syst (2008) 14:1 –37,
DOI 10.1007/s10115 -007-0114-2, Springer -Verlag London Limited 2007
[17].Xuchun L i_, Lei Wang, Eric Sung , “AdaBoost with SVM -based component classifiers ”,
Engineering Applicat ions of Artificial Intelligence, vol. 21 (2008) ,pp.785–795
[18].Niklas Lavesson, “Evaluation and Analysis of Supervised Learning Algorithms and
Classifiers”, Publisher: Blekinge Institute of Technology ,Printed by Kaserntryckeriet,
Karlskrona, Sweden 2006, ISBN 91 -7295-083-8
[19].Analog Devices, BF561 -Blackfin Embedded Symmetric Multiprocessor, Data sheet 2009
[20].Analog Devices, ADSP -SC58x/ADSP 2158x -SHARC+ Dual Core DSP with ARM
Cortex-A5, Data sheet 2015
[21].ARM, Cortex -A5- Technical Reference Manual, 2010
[22].Sorin Zoican, “Networking Applications for Embedded Systems”, pp. 1 -20, chapter book in
“Real Time Systems, Architecture, Scheduling, and Application”, Ed. INTECH, ISBN 978 –
953-51-0510-7, edited by: Seyed Morteza Babamir, 2012
[23].Sorin Zoican , “Embedded Systems and Applications in Telecommunications”, chapter
book in “Embedded Systems and Wireless Technology: Theory and practical applications”,
edited by: Raúl Aquino Santos, Arthur Edwards Block Science Publishers, 2012.

Similar Posts