List of Figures [304609]

List of Figures

Figure 1 – Biometrics Classification (© [20]) 12

Figure 2 – Biometrics Enhance Identity Verification in a Border Management and Homeland Security Context (© [21]) 12

Figure 3 – [anonimizat], Valley, Bifurcations (© [24]) 14

Figure 4 – Example of face recognition scan (© [25]) 14

Figure 5 – Iris normalization (© [27]) 15

Figure 6 – Retina Feature Vector Extraction Process (© [28]) 16

Figure 7 – Features used in Hand and finger geometry recognition (© [29]) 16

Figure 8 – The Iannarelli Identification System Entails the Extraction of 12 Geometric Measurements of the Ear Based On the Crus of Helix (© [31]) 17

Figure 9 – Variations in the Signature of an Individual (© [32]) 17

Figure 10 – [anonimizat] E-Nose System (© [33]) 18

Figure 11 – Architecture Of A Typical Biometric Recognition System. 19

Figure 12 – Automatic Identification System 20

Figure 13 – Automatic Verification System 20

Figure 14 – Crossover Error Rate Compared to False Accept and False Reject Rates 21

Figure 15 – Comparison between DET and ROC Curves of 2 Systems (© [37]) 22

Figure 16 – Simplified General Architecture for Bimodal Recognition. 23

Figure 17 – Human Vocal Apparatus used to Produce Speech 25

Figure 18 – Block Diagram of Human Speech Production (© [38]) 26

Figure 19 – Spectrogram of Speech Signal (Voiced and Unvoiced) 28

Figure 20 – The Phonemes of British English (© [40]) 28

Figure 21 – Discrete-time speech production model (© [41]) 30

Figure 22 – Vocal-Tract Model Comprised of Concatenated Acoustic Tubes (© [41]) 31

Figure 23 – Speaker Identification System (© [43]) 33

Figure 24 – Levels of Features for Speaker Recognition 34

Figure 25 – Extraction of spectral envelope using cepstral analysis and linear prediction (LP) (© [43]) 35

Figure 26 – [anonimizat] (© [45]) 36

Figure 27 – Conceptual Diagram to Illustrate the VQ Process (© [50]) 39

Figure 28 – Example of a Three-State Hidden Markov Model 40

Figure 29 – Depiction of an M Component Gaussian Mixture Density (© [53]) 41

Figure 30 – Diagram of the Proposed SOM System (© [56]) 42

Figure 31 – Main Components of a Generic Face Recognition System 43

Figure 32 – Classification of face detection methods 45

Figure 33 – Face Detection and Locating Features by Vertical and Horizontal Projections 46

Figure 34 – The Basic Algorithm Used for Face Detection (© [69]) 49

Figure 35 – A set of Principal Components (© [71]) 50

Figure 36 – Left to Right HMM for Face Recognition (© [72]) 50

Figure 37 – Classification of face recognition methods 51

Figure 38 – Eigenfaces Recognition Results 52

Figure 39 – An Example of Five Images of the Same Face. 53

Figure 40 – First Proposed Architecture (© [65]) 54

Figure 41 – Second Proposed Architecture (© [65]) 54

Figure 42 – A Hybrid EPP/PPR Neural Network (© [78]) 55

Figure 43 – Proposed Architecture (© [80]) 55

Figure 44 – Training Phase of the Neural Network (© [84]) 56

Figure 45 – Set of 21 Face Features (© [87]) 58

Figure 46 – Sequence of the Analysis Steps (© [60]) 58

Figure 47 – The Graph Representation of a Face Based on Gabor Jets (© [88]) 59

Figure 48 – Classifier Combination System Framework (© [91]) 60

Figure 49 – Late Fusion Recognition System 61

Figure 50 – Late Fusion Recognition Algorithm Architecture 62

Figure 51 – Second Level Feature Fusion (© [95]) 64

Figure 52 – Spectral Features Extraction 66

Figure 53 – Audio Signal Before and After Pre-Emphasis (© [97]) 67

Figure 54 – Short-Term Spectral Analysis for Speech. 68

Figure 55 – Hamming Window 68

Figure 56 – The Nonlinear Mel-frequency versus Hertz frequency 69

Figure 57 – Used MFC Filterbank 70

Figure 58 – Block Diagram of MFCC Extraction 70

Figure 59 – Signal Spectrogram (© [97]) 71

Figure 60 – MFCCs (© [97]) 72

Figure 61 – Visual Features Extraction 74

Figure 62 – The CbCr plane at constant luma Y=0.5 75

Figure 63 – Used Membership Functions (© [102]) 76

Figure 64 – Skin Segmentation Results 77

Figure 65 – Haar-Like Features 78

Figure 66 – The Harr-Like Patterns used by Viola-Jones Algorithm 78

Figure 67 – Schematic Depiction of a Detection Cascade 79

Figure 68 – Example of Gabor Filter (Real and Imaginary Part) 81

Figure 69 – Gabor Filters (Real Part) 82

Figure 70 – Gabor Filter Output (Convolution Result) 83

Figure 71 – Comparison between a Natural Neuron (top) and an Artificial Neuron (bottom) 84

Figure 72 – SOM Architecture 85

Figure 73 – 2-Dimensional SOM Rectangular or Hexagonal Lattice 85

Figure 74 – Example of Machine Vision Processing (© [111]) 89

Figure 75 – Typical Convolutional Neural Network Architecture (© [116]) 90

Figure 76 – Example of 2-D Convolution without Kernel Flipping (© [117]) 91

Figure 77 – Example of Max-Pooling with 2×2 Filters and Stride 2 (© [118]) 92

Figure 78 – Example of a Fully-Connected Layer 93

Figure 79 – Dropout Neural Net Model (© [120]) 95

Figure 80 – Schematic Structure of an Autoencoder with 3 Fully-Connected Hidden Layers. 96

Figure 81 – Block Diagram of Proposed Face Recognition Procedure 100

Figure 82 – Features Vector Extraction 101

Figure 83 – Stacked Autoencoder Architecture 102

Figure 84 – SOM-based Speaker Recognition Block Diagram 103

Figure 85 – SOM Architecture 104

Figure 86 – Proposed Algorithm Stages 105

Figure 87 – CNN Features Vectors Extraction 107

Figure 88 – Used CNN Architecture 108

Figure 89 – Quantization Error (Euclidian Distance) 112

Figure 90 – Quantization Error (Cityblock Distance) 112

Figure 91 – Quantization Error (Chebychev Distance) 113

Figure 92 – Quantization Error (Spearman Distance) 113

Figure 93 – Optimisation Criteria vs. Frame Rate 113

Figure 94 – Optimization Criterion Evolution 114

Figure 95 – System Mean Quantization Error No Noise 115

Figure 96 – System Mean Quantization Error SNR 20dB 116

Figure 97 – Identification Rate vs. SNR using Training Data as Input 116

Figure 98 – Mean Quantization Error for Retell Recordings (Without Noise) 117

Figure 99 – Identification Rate vs. SNR using SOLO Recordings 118

Figure 100 – Identification Rate vs. SNR using RETELL Recordings 118

Figure 101 – Identification Rate vs. SNR using WHISPER Recordings 118

Figure 102 – System Mean Quantization Error Using Retell Sound Files and SNR 20dB 119

Figure 103 – Influence of Noise on the Mean Spearman Distance between the Speaker Model and Corresponding Test Data 120

Figure 104 – CNN Training Progress 121

Figure 105 – CNN ROC Curve for Validation Data 121

Figure 106 – Identification Rate vs. SNR using Training Data as Input 122

Figure 107 – CNN ROC Curve for Test Data 123

Figure 108 – Identification Rate vs. SNR using SOLO Recordings 123

Figure 109 – Identification Rate vs. SNR using RETELL Recordings 124

Figure 110 – Identification Rate vs. SNR using WHISPER Recordings 124

Figure 111 – Sample images from Caltech face database 125

Figure 112 – Sample images from Yale face database 125

Figure 113 – Sample images from Yale B face database 126

Figure 114 – Sample images from ORL face database 126

Figure 115 – CUAVE Speaker Sample 128

Figure 116 – VidTIMIT Speaker Sample 129

Figure 117 – Multimodal CNN Architecture 130

Figure 118 – Voice Model ROC Curve for Testing Data 132

Figure 119 – Face Model ROC Curve for Testing Data 132

Figure 120 – Final Classifier Architecture 133

Figure 121 – Multimodal CNN Error Rate for Different Configuration Parameters 133

Figure 122 – Multimodal CNN ROC Curve (logarithmic scale plot) 134

Figure 123 – ROC curve for Voice Model 135

Figure 124 – ROC curve for Face Model 135

Figure 125 – Multimodal CNN Error Rate for Different Configuration Parameters 136

Figure 126 – ROC curve for Multimodal CNN (logarithmic scale plot) 136

Abstract

Automatic speaker recognition represents the process by which a machine is able to identify an individual from a spoken sentence. In the last years, due to increasing security risks and constraints that appear in our daily lives, the demand for reliable person identification systems has increased significantly.

Over the past twenty years, in the biometrics community, the face and speaker recognition fields matured constantly and produced / proposed several methods that deal with the problem of person identification. Unfortunately, automatic speech recognition systems (ASR) and automatic face recognition systems (AFR), when treated separately are difficult to improve further.

While the existing systems offer high recognition rates under controlled laboratory conditions, the performance is significantly affected by a wide variety of disturbances that occur in real-life situations. Noise corrupted data due to low quality sensors or environmental conditions, low quality biometric information or a lack of cooperation from the subject significantly affects performance.

In order to cope with the real-life usage scenarios, a new trend emerged in the field of automatic biometric recognition system, the so called multimodal systems. It proposes to combine different biometric keys, in order to perform the recognition and present a higher success rate in challenging usage scenarios.

This thesis addresses the problem of a bimodal biometric system for person identification that combines two widely accepted and inexpensive to collect and process biometric markers, the human face image (face recognition) and voice (speaker recognition). Such a selection of biometric information represents a natural combination inspired by the human speaker recognition process. Audio-visual speech recognition (AVSR) systems add visual cues in order to compensate a reduced signal quality (e.g. noisy environment, low quality acquisition systems).

In our work we develop and test two independent classifiers, one for the voice recognition problem, the other one for the face recognition. Afterwards, the outputs are fuzzified in order to obtain the final result.

In order to show the performances, the above machine learning algorithms are tested using different databases such as Caltech 101, Yale Face , Extended Yale B and ORL for face recognition, CHAINS Speech corpus for speaker recognition and CUAVE and VidTIMIT for bimodal recognition. Also two working scenarios are taken into consideration: the database is composed of perturbation-free faces and voices (ideal case), and a database perturbed with variable Gaussian noise, airport noise and occlusions.

Keywords: face recognition, speaker recognition, bimodal recognition, deep learning, autoencoder, convolutional neural networks

Introduction

The field of automatic biometric recognition has been studied for more than twenty years and it still is an active research area due to a variety of potential applications in various domains. In the last years, the technology progress encouraged the development and deployment of automatic recognition systems. The fact that biometric sensors have become cheaper and widely available, correlated with an increase in computational power, lead to the deployment of biometric recognition systems to be included also in mobile devices such as smartphones and tablets.

In order for a recognition method to be “successful”, the process must be realized in a discrete and non-invasive manner and offer high recognition rates. The demand for reliable biometric identification systems has increase and the technology is now widely applicable in domains such as:

Access control: with the introduction of biometric passports, face recognition systems have been deployed for access control in in high security areas such as airports . The picture of a person is compared in real time with the one stored on his biometric passport, validating the person’s identity. Speaker recognition technology is applied for services like tele-banking, mobile commerce and mobile phone user authentication

Surveillance and Law enforcement: closed-circuit television systems (CCTV) have been deployed in many public places, storing huge amount of information. This information presents a huge interest in forensic science when criminal investigations are performed. Face recognition algorithms are able to assist law enforcement officers in their investigations by reducing the quantity of information . With the help of speaker recognition algorithms, law enforcement agencies can link recordings obtained from anonymous calls or telephone tapping to criminal or terrorist activities.

Data management: companies such as Google, Microsoft, Facebook or Apple are providing, in their image organizer and image viewer software, automatic person identification capabilities (photo tagging) by applying face recognition algorithms . Speech data management is being deployed in “smart meeting rooms”, that automatically track and label the conversation

Personalization: In the last years, a wide variety of smart devices, that organize and facilitate our daily life, have appeared on the market. Devices such as Google Assistant, Amazon Alexa and Google Home are voice controlled and perform better with a good personal customization.

Stallkamp presented a person identification system for vehicular environments that uses audio and visual sensors. Such a system can be used for automatic identification of a car driver and to automatically adjust his personal settings (seat, steering wheel, mirrors, driving preferences, etc.).

Overview

The recognition systems use, in order to identify a person from an a priori established population, one or more biometric “keys”. The choice of the particular biometric key used depends on the final application. In the past, machine implementations of person identification have focussed on single “keys” techniques like: visual cues (face identification, iris identification), audio cues (speaker recognition) or other biometrics (fingerprints). Although some methods present a high recognition rate, like fingerprints and retina scanners, their use is limited to the availability of cooperative individuals. Meanwhile other techniques, like face recognition and speaker recognition have a smaller identification rate, but the big advantage is that they rely on natural biometric “keys” that can be obtained without the subject’s cooperation.

Facial information based recognition systems are strongly influenced by lighting conditions, background, and the relative position of the human face to the camera. Speaker identification systems are affected by background noise, distortions like echo and reverberation. Starting from the idea that in order to recognize a person, humans use a variety of attributes, such as acoustic attributes, visual cues, behavioural characteristics (gestures, motor or vocal tics), new identification systems and techniques have been developed combining several biometric "keys".

Recently, researchers are attempting to combine multiple modalities for person identification. The use of multiple complementary biometric "keys" can improve system performance and robustness since degradations for each “key” are uncorrelated. The most natural biometric “keys” combination is represented by the usage of image (human face) and sound (speaker voice).

This combination of voice and image, not only is it a natural combination, but also leads to improved system performance by using inputs from two complementary areas (the audio and the video). Thus, while the background noise has a negative influence on the identification performance of voice recognition systems, it does not affect the image component. On the other hand, a change in lighting, which degrades the performance of face recognition systems, does not have any effect on the quality of human speech.

“Bimodal recognition exploits the synergy between acoustic speech and visual speech, particularly under adverse conditions. It is motivated by the need – in many potential applications of speech-based recognition – for robustness to speech variability, high recognition accuracy, and protection against impersonation” (Chibelushi, Deravi and Mason )

Thesis Structure

This section provides a brief overview of the contents of the rest of the thesis.

Chapter II: offers a brief overview of bimodal biometric recognition area. It also provides information about speech production and face / speech recognition algorithms previously used such as: Vector quantization, Gaussian mixture model, Hidden Markov model, Artificial Neural Networks.

Chapter III: presents the theory behind speech recognition. It focuses on the audio feature extraction methods based on short-term spectral features. We discuss the Mel-frequency cepstral coefficients (MFCCs), filterbanks.

Chapter IV: gives an overview of the face localization and facial features extractions used in our work. It presents the main steps used in face localization: skin segmentation (using a colour-based fuzzy classifier), face detection (Viola-Jones Algorithm). It also introduces the Gabor Wavelet Filters and the Gabor Face Representation framework for face feature extraction.

Chapter V: is a short presentation of neural networks and the applied methods. We address the problems of Self-organizing Maps (SOM), Deep Learning, Convolutional Neural Networks and Auto-Encoders.

Chapter VI: we present the proposed algorithm for bimodal audio-facial recognition system.

Chapter VII: the case study; we test our system in different configurations and compare the results with the classical approaches.

Chapter VIII: concludes this thesis and provides a summary of the main contributions that can be drawn from the work.

Background in Multimodal Speaker Recognition

Automatic speaker recognition is the process by which an individual is identified using information extracted from the speech signal. The used features are related to the manner of sound generation process that occurs in larynx and to the filtering of the speech sounds that occurs in vocal and nasal tracts. In the speech generation process the larynx act as the sound source, while the vocal and nasal tracts perform a filtering process. The speech features can be used also in multimodal person recognition systems by combining them with other human traits like face, fingerprints, iris, hand geometry, etc.

Biometrics offers greater security than traditional methods in person recognition. The accuracy of automatic speaker verification systems (ASV) has steadily increased in the recent years, producing relatively low to medium error rate and the technology has a high public acceptance rate due to unobtrusive nature. The deployment of ASV technologies in sensitive domains such as voice dialling, baking over the telephone, security access, has started to raise concerns to potential vulnerabilities of the methods. One such vulnerability is the voice mimicking, it was first studied by Lau et al. . Voice mimicking refers to dedicated effort to manipulate one’s speech so that an ASV system would misclassify the attacker’s sample to originate from the target (client). In order to overcome such vulnerabilities, multiple methods for speaker classification must be used.

Biometrics

The origin of the word “biometrics” can be traced to the old Greek language by composing two words: “bio” – life and “metrics” – to measure.

Biometric recognition refers to the use of distinctive characteristics to recognise individuals . The idea of using body measurements in order to establish the identity of person, it’s a very old one and was introduced more than a century ago by Alphonse Bertillon.

The biometric identifiers or biometrics can be easily classified into: physiological or behavioural characteristics.

Figure 1 – Biometrics Classification (© )

Physiological biometrics represents physical characteristics that can be measured like, fingerprints, retina or iris scan, face, hand geometry. The behavioural biometrics such as signature or voice depends on the action that is carried out and can be easily and deliberately changed .

Figure 2 – Biometrics Enhance Identity Verification in a Border Management and Homeland Security Context (© )

Although the classification of biometric identifiers into behavioural and physiological characteristics seems to be simple and straight forward, mostly all biometric identifiers are a combination of both classes. Behavioural biometrics are related to movement or dynamics, but they are highly dependent on the physiological structure of the individual; voice and gait, for example, depend on the anatomical structure of the human vocal mechanism and the legs, respectively .

According to Jain any human physiological and/or behavioural characteristic can be used as a biometric identifier to recognise a person as long as the following requirements are satisfied:

Universality: each person should have the biometric.

Distinctiveness: any two individuals should be sufficiently different in terms of their biometric identifiers.

Permanence: the biometric should be sufficiently invariant over a period of time.

Collectability: the biometric can be measured quantitatively.

Biometric Identifiers

Each biometric has its strengths and weaknesses and in practical biometric systems the choice of which identifiers should be used depends on the application, the required accuracy, the degree of intrusiveness (undesirable contact with the subject in order to acquire the biometric data) or acceptability (to which extent people are willing to accept a particular biometric identifier in their daily lives). A brief introduction (as presented in ) to the commonly used biometrics is given below.

DNA Deoxyribonucleic Acid (DNA): is the one-dimensional (1-D) ultimate unique code for one’s individually. It is mostly used in forensic applications for person recognition due to the high performance rate and permanence over time. DNA as a biometric identifier is characterised by a low collectability and a low acceptability due to the very intrusive acquisition method, the recognition system is very sensitive to contamination and cannot be used in real time.

Fingerprints: are used as biometrics since 1893, when the Home Ministry Office (UK) accepted that no individuals have the same fingerprints. A fingerprint is the pattern of ridges and valleys on the surface of a fingertip, the formation of which is determined during the first seven months of fetal development. Fingerprints of identical twins are different and so are the prints on each finger of the same person. Fingerprints are characterised by medium universality, acceptability and collectability, and a high accuracy. One problem with the current fingerprint recognition systems is the large amount of computational resources needed when operating in the identification mode.

Figure 3 – Fingerprint Patterns Ridge, Terminations, Valley, Bifurcations (© )

Face: is one of the most common biometric used by humans in their visual interactions and recognition. Unlike other more reliable biometrics (fingerprints, retinal or iris scan), face recognition is non-intrusive, and does not rely in the cooperation of the participants. The applications of facial recognition range from a static, controlled “mug-shot” verification to a dynamic, uncontrolled face identification in a cluttered background. The main challenge is to develop a face recognition technique that is tolerant to variations in the pose of the face with respect to the camera, light and background variations, facial expressions and the effects of aging .

Figure 4 – Example of face recognition scan (© )

Voice: is a non-intrusive biometric with a high acceptability. It is a combination of physiological and behavioural biometrics. The features of an individual’s voice are based on the shape and size of the appendages (e.g., vocal tracts, mouth, nasal cavities, and lips) that are used in the synthesis of the sound. These physiological characteristics of human speech are invariant for an individual, but the behavioural part of the speech of a person changes over time due to age, medical conditions (such as a common cold), and emotional state, etc. Voice is not expected to have enough distinctiveness to allow the recognition of an individual from a large database of speakers. Voice recognition presents three main disadvantages: 1) the speech signal quality is influenced by the microphone and background noise; 2) speech can be heavily influenced by the behavioural part; 3) it has been shown that some people are extraordinarily skilled in mimicking others’ voices .

Iris: the recognition technology is highly accurate and fast. During the embryonic development the chaotic morphogenetic process determines the visual texture of the human iris. It believed that the human iris represents a distinctive and unique feature for each person and even each eye. The iris is a protected internal organ of the eye; it is immune to the environment and presents a high permanence over time. The biggest weakness of iris recognition systems derives from the intrusive manner required to capture an iris image . Iris recognition applications have been recently used in access control to hospitals and international frontiers.

Figure 5 – Iris normalization (© )

Retinal scan: The retinal vasculature is another characteristic of each individual and each eye, such as the iris. It is claimed to be the most difficult to forge biometric. However, the image acquisition is quite intrusive and requires a large cooperation and effort of the user; therefore, this method is characterised by a low public acceptability .

Figure 6 – Retina Feature Vector Extraction Process (© )

Hand and finger geometry: the recognition is characterised by a relatively high permanence. The proposed features related to a human hand are not very distinctiveness to individuals. Systems that use hand and/or finger geometry also characterised by a medium acceptability, since the acquisition systems involve the cooperation of the user and they are perceived as quite intrusive methods. Their strength relies on the collectability, since the representational requirements of the hand are very small. Even so, finger geometry systems are sometimes preferred due to their more compact size .

Figure 7 – Features used in Hand and finger geometry recognition (© )

Ear: ear images can be acquired using the same techniques as face images. Recent studies have showed that the shape of the ear and the structure of the cartilaginous tissue of the pinna are unique enough to allow the usage in biometric recognition . The ear presents relatively low changes during the life span of individuals. Medical reports shown that the variation over time is most noticeable during the period from four months to eight years old and over 70 years old.

\

Figure 8 – The Iannarelli Identification System Entails the Extraction of 12 Geometric Measurements of the Ear Based On the Crus of Helix (© )

Signature: Handwritten signatures are one of the oldest biometric identification methods. Each individual has a characteristic way of signing and handwriting. In everyday life, the individual’s signature is used for authentication purposes in many sensitive domains such as legal, commercial transactions, banking and document definitions. The signature is already a socially accepted form of authentication. The weaknesses of using handwritten signatures in automatic recognition systems consist of: low universality, distinctiveness and permanence. A signature is not unique enough, it is easily forged and also one’s signature may change over time. [11] [14].

Figure 9 – Variations in the Signature of an Individual (© )

Gait: simply by recognizing the way a person walks, humans often can identify a familiar person. A persons gait is not very distinctive, but it can be used in some low-security applications. Gait is not permanent over a large time period, several factors such as footwear, terrain, fatigue, and injury tend to cause variations .

Odour: every human body generates an odour (chemical component). The door is distinctive to a particular individual and it seems to be permanent over time. The door contains chemical components that are stable over time regardless of diet and environmental factors. Although it presents all the characteristic in order to be a good biometric identifier, door data is not that easy to acquire, sensors are expensive and not widely available . The schematic of an electronic nose (E-nose) designed and equipped with software that can detect and classify human armpit body door can be seen in Figure 10.

Figure 10 – Schematic Diagram of the Lab-Made E-Nose System (© )

Keystroke dynamics: each individual has a unique way of typing on a keyboard. “Keystroke dynamics is a behavioural biometric that aims to identify humans based on the analysis of their typing rhythms on a keyboard” . For each keyboard action, two dynamic parameters are extracted: dwell time (the length of time a key is held down) and fly time (the time duration from a key released to the next key pressed). Keystroke dynamic authentication systems are non-intrusive and have a high acceptance rate within the users [11] [14]. Unfortunately, the method presents a large list of weaknesses like: is not permanent over time and it has a low degree of uniqueness.

Automatic Biometric Recognition Systems

A typical biometric recognition system (Figure 11) consists in general of two phases: first, a training or enrolment phase and the testing or recognition phase. Using different sensors, relevant biometric measurements are captured form the users. The captured data is processed and relevant features are extracted and stored in a database as user models.

Figure 11 – Architecture Of A Typical Biometric Recognition System.

During the recognition phase, the same type of biometric sensors captures the information from the “unknown” user. Afterwards, the information is processed by the “Feature extraction” step. The features are compared using a “Pattern matching” algorithm with the stored user models from the system database. In this step a degree or similarity or score is calculated. It represents the likelihood that the user corresponds to one of users whose model is stored in the database. Based on the likelihood score, a decision is taken at the end of the recognition process.

ASR Systems Classification

Depending on the final application type, an automatic biometric recognition system can be classified into two distinct categories : identification and verification.

Subject identification: is the process by which, from a known population (set of users) an individual is identified by its biometric information (Figure 12).

Figure 12 – Automatic Identification System

In the case of identification systems, the know population ca be a closed or open set. In a close set identification, the assumption is that the unknown user must match one of the already enrolled users, and the recognition system just selects the known user that mostly matches. In the case of open set identification, the unknown subject nay or may be not enrolled in the population set. First, the system selects a user from the known population that presents the maximum likelihood with the unknown subject; afterwards a verification process is applied in order to ensure that the two users are close enough.

Subject verification: is the process of accepting or rejecting the identity claimed by a person (Figure 13). Applications of this recognition mode are related to access in restricted area. The database contains two models: the enrolled user model and an impostor model. If the unknown user’s biometric information is close enough to the enrolled model, the system accepts the claimed identity; otherwise the user is classified as an impostor.

Figure 13 – Automatic Verification System

Both modes, identification and verification, contain the same functional blocks: obtaining the biometric “keys”, feature extraction, a database that stores the characteristics of the population, determining the similarities, a selection process based on maximum selection / threshold value and the last block the result output.

Evaluation of a Biometric Verification System

In a biometric verification system, two types of errors can occur : false rejection (or false detection – FRR) and false acceptance (or false alarm – FAR). When a valid claim is rejected by the system, a false rejection error happens. A false acceptance error is generated when the system accepts the identity claimed by an impostor. Both errors depend on the threshold value used in the decision making process. The selection of the threshold can reduce one type of error at the expense of the other one. The value of FRR and FAR where the two curves intersect (see Figure 14) is called Equal Error Rate (ERR).

Figure 14 – Crossover Error Rate Compared to False Accept and False Reject Rates

By setting a low threshold, the system tends to accept every identity claim thus making few false rejections and lots of false acceptances. On the contrary, if the threshold is set to some high value, the system will reject every claim and make very few false acceptances but a lot of false rejections. Therefore, when designing a biometric verification system, a trade-off between FRR and FAR must be established by choosing the threshold value.

Two of the most common representations for the performance of a system are: Receiver Operating Characteristic (ROC) curve and Detection Error Trade-offs (DET) curve.

The ROC curve is created by plotting the FAR (false acceptance rate) against the FRR (false rejection rate) at various threshold settings. This curve is monotonous and decreasing. ROC analysis represents a direct and natural way to perform a cost/benefit analysis on the system. The DET (Detection Error Trade-offs) curve represents the plot of error curve on a normal deviate scale. The DET curves are more easily readable and comparable than ROC; a comparison can be seen in Figure 15. The better the system is, closer to the origin the curve will be.

Figure 15 – Comparison between DET and ROC Curves of 2 Systems (© )

The DET and ROC curves represent a good way to compare different recognition methods. Unfortunately, plotting the error rates as a function of the threshold is suited only for “theoretical use”. In the “real life” working situations, the working threshold is set a priori at a given value . In such a case, DET and ROC curves aren’t sufficient and the system must be evaluated according to a cost function that takes into account the two error rates weighted by their respective costs, that is:

(1)

In this equation, Cfa and Cfr are the costs given to false acceptances (Pfa) and false rejections (Pfr), respectively. The cost function is minimal if the threshold is correctly set to the desired operating point.

Another popular measurement for the performance of a recognition system is the equal error rate (EER). It corresponds to the operating point where Pfa = Pfr. From the DET curve, the EER operating point can be easily identified by intersecting the DET with the first bisector curve. The EER is a quite popular measure of the ability of a system to separate impostors from true speakers.

Another popular measure is the half total error rate (HTER) which is the average of the two error rates Pfa and Pfr. It can also be seen as the normalized cost function assuming equal costs for both errors.

Bimodal Recognition Systems

A bimodal recognition system combines two biometric “keys” in order to perform the recognition tasks. The main goal of bimodal systems is to increase the overall performance. Thus, the biometric parameters used in developing such systems must be complementary in order to reduce common error sources and vulnerabilities.

Bimodal recognition system can be classified, after the moment when the biometric characteristics fusion occurs, into three classes: early fusion or feature fusion (Figure 16 – a), late fusion or decision fusion (Figure 16 – b) and middle fusion or hybrid fusion.

The most natural biometric “keys” combination is represented by the usage of image (human face) and sound (speaker voice). Such systems represent Audio-Visual speaker recognition systems (AVSPR).

Figure 16 – Simplified General Architecture for Bimodal Recognition.

(a) Early Fusion (Feature Fusion) (b) Late Fusion (Decision Fusion) (© )

Late fusion or the decision (score) fusion implies the existence of two classification systems, one for voice features and the second for video features. Each classifier operates independently of one another; the result of the total system (system decision) is, in this case, a linear or nonlinear combination between the outputs of each classifier. Because decisions or scores are combined in the last step of the recognition system, many correlations of the audio-video assembly properties are lost. Thus, decision fusion cannot take advantage of the temporal dependencies that exists between the two domains, audio and video.

On the other hand, the early fusion or feature fusion can exploit the temporal correlations existing between the audio and video, thus providing a richer set of features by which a speaker can be identified. These methods extract features from the raw data set and then combining them they generate the speaker model. Taking into account the temporal correlation, methods using early fusion can present better performance than late fusion based methods. They can model static and dynamic components of a person while speaking, which lead to systems less vulnerable to impersonation. A disadvantage of these systems is that performance depends heavily on the noise level and in addition they are less robust to sensor failures than decision fusion methods. Moreover, feature fusion techniques generally require more training and are harder to develop due to the difficulty to model the audio-visual speech asynchronicity.

Problems arising from decision fusion and features fusion based techniques have led to the development of the third class of systems, systems using hybrid or middle fusion. This method accepts two strings of features as data entry and then combines them inside the model, thus producing a single system decision.

Speech Recognition Techniques

This section presents several aspects related to speaker recognition: speech production, general presentation of speaker recognition and speaker modelling techniques.

Speech Production

Speech represents the vocalized form of human communication and it is human’s primary means of communication. Through speech individual information representing speaker’s identity and gender, and also sometimes the emotions can be transmitted.

Speech production is a complex procedure carried out in the speaker’s respiratory system (Figure 17). It is the process by which spoken words are selected to be produced, have their phonetics formulated, and then are articulated by the motor system in the vocal apparatus. The speech production system in humans is composed of the lungs, trachea, larynx and the vocal tract (oral tract and nasal tract). There are two distinct and independent stages in the speech production mechanism: the sound generation in the larynx and the acoustic filtering of the speech sounds in the vocal tract.

Figure 17 – Human Vocal Apparatus used to Produce Speech

The speech initiation process is the moment when the air is expelled from the lungs. The pulmonary pressure generated by the lungs passes through the larynx creating a phonation that is then modified into different vowels and consonants by the vocal tract. The larynx, also called the “voice box”, is a complicated system of cartilages, muscles, and ligaments. It contains the vocal cords (two horizontal folds of tissue) and the glottis (the gap between the vocal cords).

The shape of the vocal tract is determined by the position of the tongue, lips, jaw and velum. The size and shape of the mouth opening (through which speech sounds are radiated) are controlled by the lips. The size of the opening at the velum controls the acoustic coupling between the nasal and vocal tracts. By lowering the velum, the nose becomes part of the vocal tract to produce the “nasal” sounds of speech. When the velum is drawn up tightly toward the back of the pharyngeal cavity, the nasal cavity is decoupled from the speech production system and non-nasal sounds are produced. The shape of the vocal tract determines the characteristics of each phoneme and acts as a resonator in vowels, semivowels and nasals.

Figure 18 – Block Diagram of Human Speech Production (© )

The vocal cords act in several different ways during speech. Depending on whether the vocal cords vibrate or not during the flow of air in the trachea, speech can be divided into voiced and unvoiced sounds .

The voiced speech is generated when the air stream from the lungs is modulated by the vocal cords by rapidly opening and closing the vocal folds, causing a buzzing sound from which vowels and some consonants are produced. The oscillation frequency of the vocal cords is called the fundamental frequency, f0, and it depends on the physical characters of vocal folds like mass and tension. The fundamental frequency differs between men, women and children. In the case of an adult male the pitch ranges between 50Hz to 250Hz with an average of 120Hz; for adult females the upper pitch reaches 500Hz with an average of 200Hz. Hence fundamental frequency is an important physical distinguishing factor, which has been found effective for automatic speech and speaker recognition.

After exiting the larynx, the airflow enters the nasal or the oral cavity (vocal tract), which acts a band-pass filter allowing some harmonics of the fundamental frequency to pass, while suppressing others. Through the oro-nasal process we can differentiate between the nasal consonants (/m/, /n/) and other sounds.

The last stage of speech production, the articulation process, takes place in the oral cavity. The size, shape and acoustics of the oral cavity can be varied by the movements of the palate, tongue, lips, cheeks and teeth. In the case of speech production, the mouth acts as a resonator.

The second type of speech is the unvoiced speech; it is generated when the vocal cords no not vibrate. Constricting the vocal tract during the speech process, the vocal cords no not vibrate when airflow pass through them, this generates a turbulent airflow, which results in noise or breathy voice. The various positions at which the vocal tract can constrict determine whether the unvoiced sound is aspirated or fricative. When the constriction in the vocal tract is near the front of the mouth, a fricative sound is produced. In the case of aspirated sounds, the constriction takes place at the beginning of the vocal tract, the glottis, so the excitation signal is modulated by the vocal tract. Because the vocal cords are not used, there is impossible to detect a fundamental frequency and unvoiced speech can be regarded and modelled as white noise.

By combining phonation and frication, a new type of speech, called mixed voice speech, is produced. Actually unlike the phonation that is placed in vocal folds (the vibration of vocal folds), the place of frication is inside the vocal tract.

The classification of speech signal into voiced, unvoiced provides a preliminary acoustic segmentation for speech processing applications, such as speech synthesis, speech enhancement, and speech recognition.

If the voiced and unvoiced speech is analysed form the energy point of view using signal frames, we can see that the voiced sound part contains more energy that the unvoiced one (Figure 19).

Figure 19 – Spectrogram of Speech Signal (Voiced and Unvoiced)

Another way to classify the speech is with the use of phonetics. In most languages, the speech can be divided into phrases, then words, syllables and finally into phonemes. A phoneme represents the smallest perceived acoustic speech unit. The set of phonemes is much smaller that the number of sounds that can be produced by a human, and it varies between 20 and 40 phonemes per language.

Figure 20 – The Phonemes of British English (©)

By considering the acoustic characterization of the various sounds including the place of articulation, waveforms and spectrographic, the phonemes can be divided into four broad classes: vowels, diphthongs, semivowels, and consonants (see Figure 20).

Vowels: are always voiced sounds, produced by the vibration of the vocal cords and the mouth fully opened. They present the highest intensity and a have a duration between 50 to 400ms. Depending on the tongue position in the mouth, vowels can be classified in: front, mid, and back. Vowels can also be classified in terms of other factors such as: low or high tongue, rounded or unrounded lips, and nasalized or non-nasalised. Vowels frequencies are affected by the vocal tract shape.

Diphthongs: “is gliding monosyllabic speech item that starts at or near the articulatory position for one vowel and moves to or toward the position for another” .

Semivowels: “are generally characterized by a gliding transition in the vocal tract area function between adjacent phonemes” . The acoustic characteristics are strongly influenced by the context in which they appear, and are better described as transitional vowel-like sounds.

Consonants: represent a speech sound in which the vocal tract is at least partly obstructed. They can be voiced or unvoiced. With respect to the manner of articulation, consonants can be divided into: affricates, nasals, plosives and fricatives:

Nasals: nasal consonants or nasal stops are “produced with glottal excitation and the vocal tract totally constricted” . The velum is lowered and the air is allowed to escape freely through the nose. In the case of nasal consonants, the mouth still acts as a resonance chamber.

Plosives: or stops; are produced by closing the vocal tract and allowing the air pressure to build-up before suddenly releasing it. Depending whether the vocal cords vibrate or not, they can be voiced or unvoiced.

Fricatives: are produced by turbulent air-flow passing a constriction in the vocal tract. Like plosive sounds, fricatives can be divided into voiced or unvoiced. Unvoiced fricatives are produced by turbulent air-flow that excites the vocal tract and the vocal cords don’t vibrate. The location of the constriction determines the type of sound. The constriction can take part near the lips, near the teeth, in the middle of the oral tract or near the back of the oral tract. In the case of unvoiced fricatives the waveform has a non-periodic nature.

In the production of voiced fricatives, two excitation sources are involved: the glottis by vibrations of the vocal cords and the frication noise source, produced downstream of a supra-glottal constriction. Thus, the spectrum of the voiced fricatives is made up of two distinctive components.

Affricates: represent a dynamic sound that can be modelled as concatenation between a short plosive sound and a fricative consonant.

The Source-Filter Model for Speech Production

Figure 21 – Discrete-time speech production model (© )

The short study presented in the previous chapter helped us to understand the differences in how individual phonemes are produced. It characterized the speech sounds in terms of the position and movement of the vocal-tract, variation in their time waveform characteristics, and frequency domain properties such as formant location and bandwidth.

From the anatomical perspective, the speech production can be divided into three principal components: excitation production, vocal tract articulation and lips and / or nostrils radiation. The main assumptions are that these three components are linear, separable and planar propagation .

Figure 21 offers a mathematical representation of human speech production system. The model has two excitation sources: one for voiced speech production that models the vibration of the vocal cords; and a second source for unvoiced speech, modelled as a random noise generator.

With this model a voiced phoneme, such as a vowel, can be represented as the product of three transfer functions:

(2)

Where: represents the source excitation (voice waveform); the dynamics of the vocal tract; the radiation effects. The source excitation can be white Gaussian noise in the case of unvoiced speech or a convolution between noise and the quasi-periodic impulse train generated in the glottis.

The general approach is that the excitation spectrum and radiation are mostly constant and well known a priori; therefore the characterization of speech is done using the vocal tract component.

Due to the fact that speech production is characterized by changing the vocal tract shape, the realistic vocal-tract model consists of a tube that varies as a function of time and displacement along the axis of sound propagation. Such an approach is quite complex and simpler modelling solution can be applied: represent the vocal tract as a series of concatenated lossless (Figure 22).

Figure 22 – Vocal-Tract Model Comprised of Concatenated Acoustic Tubes (© )

The vocal tract spectral response is characterised by the presence of certain resonant frequencies or formants. Using the Concatenated Acoustic Tubes model, we can approximate the resonant frequencies of the human speech. This approximation depends on the number of tubes, the cross-sectional areas (A1, A2 … An) and the length of each tube (l1, l2 … ln). In reality, the number of formats is unlimited, however, in practice only a finite and reduced number is considered.

General Presentation of Speaker Recognition

The speaker identification represents the task of selecting the identity of the speaker from a known population by using their voices, based on the individual acoustic features included in the speech. A speaker recognition process is always concerned with extracting information about individuals from their speech.

Speaker recognition methods can be classified into: text-dependent methods and text-independent methods . In text-dependent speaker recognition systems, the recognition task is characterized by the usage of a known set of words (or lexicon) during the testing phase. This is a subset of the ones presented during the enrolment phase. By having a restricted lexicon, the system offers a very short enrolment (or registration) and testing sessions in order to deliver an accurate solution. In this mode identity of the speaker is established using one or more specific predefined phrases, such as passwords, codes or pre-set phrases, making it ideal to be used in applications with strong control over user input. The initial enrolment lexicon is not restricted to a specific one.

Text-independent methods don’t impose any constraints on the words which the speaker must use during the testing phase; the recognition is performed on the characteristics extracted from the speaker’s voice.

Figure 23 – Speaker Identification System (© )

All speaker identification systems contain three modules: acoustic feature extraction module, speaker modelling technique and feature matching module.

Selection of Features

The acoustic features extraction module converts the speech signal from a waveform into parametric representation. The voice features used for speaker recognition can be divided into different categories with different degrees of complexity (see Figure 24) : short-term spectral features / voice source features, spectral-temporal features, prosodic features, and high-level features.  High-Level features, such as semantics, diction, pronunciation, idiosyncrasy, are related to the speaker’s socioeconomic status, education, place of birth, and are more difficult to extract in an automatic manner.

Figure 24 – Levels of Features for Speaker Recognition

Spectral features: represent low-level features generated by the anatomical structure of the vocal tract. Due to anatomical structure of the vocal tract, similar sounds from different speakers present different spectra (location and magnitude of peaks). Spectral features, like location and magnitude of peaks, are easily to extract in an automatic manner.

The speech is a quasi-stationary signal, so when it is examined over a short period of time (usually, between 5 and 100ms), its characteristics are stationary. Using short frames of about 20-30ms, statistical spectral analysis is performed on the stationary speech signal, and a spectral feature vector is extracted from each frame. As a result, short-time spectral analyses represent good method to characterize the speech signal.

The state-of-the-art speaker identification algorithms are based on statistical models of short-term acoustic measurements. The most popular feature extraction methods are: Perceptual Linear Coding (PLC), Linear Predictive Coding (LPC), and Mel-Frequency Cepstral Coding (MFCC). Frequencial methods, like MFCC, present an interesting property; they mimic the functional properties of the human ear by using a logarithmic scale. And by doing so, methods that use speech spectral analyses tend to present better performances.

Figure 25 – Extraction of spectral envelope using cepstral analysis and linear prediction (LP) (© )

The most used representation of the short-term power spectrum of a human sound is the Mel-frequency cepstrum (MFC), which represents a linear cosine transform of a log power spectrum on a nonlinear Mel-scale of frequency. Mel-Frequency Cepstral Coefficients (MFCCs) are coefficients that collectively make up an MFC. By using the Mel-scale, which is a perceptual scale of pitches judged by listeners to be equal in distance from one another, the MFC approximates the human auditory system more closely than the normal cepstrum.

Prosodic features: Refer to non-segmental aspects of speech and attempt to capture speaker-specific variation in intonation, timing, and loudness such as: syllable stress, accent, intonation patterns, speaking rate and rhythm. The estimation of prosodic features is based on pitch, energy and duration information. The prosodic features can be used at two levels, in the lower one the direct values of the pitch, energy or duration can be used, at a higher level the system might compute co-occurrence probabilities of certain recurrent patterns and check them at the recognition phase. A challenge in text-independent ASR systems is represented by the modelling of such features; the features must be long enough in order to capture the speaker differences and also they should be robust against impersonation or other effects that the speaker can voluntary alter.

The fundamental frequency F0 is one of the most important prosodic parameter. Beside F0-related features such as mean and standard deviation, other prosodic features include: dynamics of the fundamental frequency, energy trajectories, duration (e.g. pause statistics), speaking rate, and energy distribution / modulations .

Phonetic features: the identity of a speaker can be established using speaking patterns and speaker-specific pronunciations extracted from phonemes sequences. Without changing the semantics of an utterance, the same phonemes can be pronounced in different ways. This high variability in the pronunciation of a given phoneme can be used for speaker recognition. In order to capture dialectal characteristics of the speaker, each variant of each phoneme is recognized and afterwards comparing the frequency of co-occurrence of the phonemes of an utterance (N-grams of phone sequences), with the N-grams of each speaker.

Syntactical (Idiolectal) features: It is well-known that some persons use and abuse several words. By using sequences of recognized words the speaker’s identity can be established . Each utterance is modelled and converted into a sequence of tokens, sequences are called n-grams. The speaker is characterized using statistics of co-occurrence of n consecutive tokens. The n-grams represent general models of the language, trained on very large corpora, that includes different sources from numerous speakers.

Figure 26 – Speaker information contained in word bigrams, tabulated over the whole SwitchBoard corpus (© )

Lower-level traits or physical traits, such as spectral information, are easily to extract but they can be easily corrupted by noise. Meanwhile, higher-level traits, such as speaking and pause rate, pitch and timing patterns, idiosyncratic word / phrase usage, idiosyncratic pronunciations, etc., are more noise robust but difficult to extract in an automatic manner. So far, features derived from speech spectrum have proven to be the most effective in automatic systems. A comparison between MFCC and Prosodic features is presented in the below table.

Table 1 – MFCC & Prosodic: A comparative Chart (© )

Speaker Modelling Techniques

By using feature vectors extracted from speaker’s voice data, a speaker model is trained and stored. Modelling techniques used in speaker identification include the usage of: Gaussian Mixture Model (GMM), Hidden Markov Model (HMM), Support Vector Machines (SVM), Vector Quantization (VQ), Dynamic Time Warping (DTW) and, recently, Self-Organizing Maps (SOMs – Kohonen neural networks).

Speaker models can be divided into template models and stochastic models, also known as nonparametric and parametric models, respectively. In order to establish the identity, template models techniques, like VQ or DTW, directly compare the training against the unknown feature vectors. The distortion between them represents their degree of similarity.

In stochastic models, like Gaussian Mixture Model (GMM), Hidden Markov Model (HMM), each speaker is modelled as a probabilistic source with an unknown but fixed probability density function. The training phase is to estimate the parameters of the probability density function from a training sample. Matching is usually done by evaluating the likelihood of the test utterance with respect to the model.

Long-Term Averaging

One of the first modelling technique used for speaker recognition was the Long-term averaging of features extracted from the speech signal in both time and frequency domains (). Using a large number of feature vectors obtained from each of the known speakers, the method computes the mean and variance of each component of the feature vector for all the samples of a speaker. Then, the similarity between speakers is determined by calculating a weighted distance measure between the average feature vectors of two speakers.

Markel et al discussed the applicability of long-term feature averaging choosing three different feature sets: fundamental frequency features, gain features (gain variation in the speaker voice over time) and spectral features.

Intra-speaker variability increases a lot in short sentences; therefore, when using the long-term averaging approach, the accuracy of the systems is highly dependent on the amount of training and testing data.

Dynamic Time Warping (DTW)

Dynamic time warping (DTW) is the most popular method to compensate for speaking-rate variability in template-based systems ().

A text-dependent template model consists of a sequence of templates that must be matched to an input sequence where N is normally not equal to M because of timing inconsistencies in speech. A DTW algorithm does a constrained, piece-wise linear mapping of one or both time axes to align the two speech signals while minimising the accumulated distance, which is expressed as follows:

Vector Quantization (VQ)

Vector quantization (VQ) is a classical quantization technique that uses multiple templates to represent (divide) a large set of vectors.

In the case of speaker recognition feature vectors extracted from speech frames are mapped into a predefined number of clusters each of which is defined by its centroids.

For each known speaker, in the training phase, from the training data a VQ codebook is created.  In the testing phase the quantization distortion of a sequence of feature vectors obtained from the unknown speaker is calculated with the whole set of codebooks obtained in the training phase. The codebook that provides the smallest vector quantization distortion indicates the identified user. One way to define the distortion measure is to use the average of the Euclidean Distances.

R. Hasan in proposed a speaker recognition method that uses MFCC for feature vector extraction and VQ to create the speaker model. The system was tested on clean speech from 21 speakers with different code book sizes and framing windows. It had a performance of 100% recognition rate for a code book size of 64.

Figure 27 – Conceptual Diagram to Illustrate the VQ Process (©)

Hidden Markov models (HMM)

Introduced in the late 1960s , Hidden Markov model (HMM) is a stochastic model commonly used for modelling dynamic process  with unobserved (hidden) states, where the observations are a probabilistic function of the state . In simpler Markov models (like a Markov chain), the state is directly visible to the observer, and therefore the state transition probabilities are the only parameters. In a Hidden Markov model, the state is not directly visible, but output, dependent on the state, is visible. Each state has a probability distribution over the possible output tokens. Therefore the sequence of tokens generated by an HMM gives some information about the sequence of states.

The HMM is an N-state finite state machine, where a pdf (or feature vector stochastic model) is associated with each state (the main underlying model). The states are connected by a transition network, where the state transition probabilities are .

Figure 28 – Example of a Three-State Hidden Markov Model

(Ergodic Model and Left-To-Right Model)

The probability that a sequence of speech frames was generated by this model can be obtained using an iterative procedure such as the expectation-maximization (EM) algorithm which conducts optimization of the complete dataset, also known as the Baum-Welch decoding. This likelihood is the score for L frames of input speech given the model.

(3)

Gaussian Mixture Models (GMM)

The Gaussian mixture model (GMM) as introduced by Reynolds and Rose , and it represents a weighted sum of Gaussian density functions in order to model the distribution of the feature vectors extracted from the speech signal.

For speaker identification, each speaker is represented by a Gaussian mixture model, . Using the GMM modelling, a sequence of speech frames can be modelled as a sum of M Gaussian components, where M depends on the size and content of the training data set.

(4)

Where: is a D-dimensional feature vector, are the component densities and are the mixture weights ().

A typical value is M = 32 for characteristic feature dimensions in the range of 12 – 36.

Figure 29 – Depiction of an M Component Gaussian Mixture Density (© )

Each component density is a D-variate Gaussian function of the form.

(5)

Where: is the mean vector and the covariance matrix.

Thus a GMM is defined by . The most popular technique for estimating the model’s parameters is maximum likelihood (ML) estimation. Given a sequence of T training vectors , the ML estimation determines the model parameters that maximise the likelihood of the GMM:

(6)

In the recognition step, a group of S speakers is represented by GMM’s , and given a sequence of test feature vectors extracted from an unknown user’s speech, the objective is to find the speaker model which has the maximum a posteriori probability for the given test sequence.

(7)

L. Wang et al., in combined MFCC and phase information in order to perform speaker identification in noisy environments. The speaker model is computed by combining two GMMs, one for MFCCs and the second one for phase information. By doing so, the error rate is reduced by 20% – 70%. On clean speech, the system has a detection rate of 99.3%, but it drops at 59.7%, in a noisy environment with a signal to noise ratio (SNR) of 10dB.

Artificial Neural Networks (ANN)

For speaker recognition applications, also artificial neural networks (ANN) can be used. Oglesby and Mason presented an approach to speaker identification using feedforward neural models. They approach is to generate and train, for each known speaker, a feedforward neural network, representing the speaker’s voice model. The training data must also contain speech information from other user in order to capture the differences between speakers. During the testing phase, an input feature vectors is fed through each neural model and the identity is established by the network with the highest output value.

A method that uses Self-Organizing Maps for speaker modelling is presented in by E. Monte et al. The system uses the VQ function of the SOM and its topological property, i.e. the fact that neighbouring code-words in the feature space are neighbours in the topological map. The feature vectors were computed using two methods: LPC and MFCC (24 coefficients for both methods), and an analysis window with a duration of 30ms and 10ms overlap between them. In the training stage, for each speaker an occupancy histogram is computed and then filtered, using a low pass filter. In the identification stage, the occupancy histogram of the unknown speaker is computed and compared with the known models from the database. Using the relative entropy, the system determines the degree of similarity between the unknown speaker’s histogram and the reference.

Figure 30 – Diagram of the Proposed SOM System (©)

The system performances were as follows: 98.2% for clean speech using LPC, 100% for clean speech using MFCC, 8% in noisy environment with a SNR of 10dB using LPC and 19.5% in noisy environment with a SNR of 10dB using MFCC.

Face Recognition systems

In this section we will make a summary of the techniques used for face recognition; highlighting the basic approaches and detail some algorithms for a better understanding.

A generic face recognition system (Figure 31) consists, in general, of three major processing steps:

Face detection: defined as the process of extracting faces from the scene.

Feature extraction: the process of extracting relevant information from a face, information which is later used for identification.

Face recognition: represents the process in which a subject’s identity is established or verified based on his facial features. The identification task involves a comparison method, a classification algorithm and an accuracy measure.

Figure 31 – Main Components of a Generic Face Recognition System

Face Detection

Many face recognition methods assume that the faces have been previously localised, are frontal pictures and have the same size. But in realistic scenarios, facial images are presented under different poses, scales, lighting conditions and with a complex background.

The face detection problem can be defined as following: being given a still image, the goal is to positively identify the image region that contains a face. The algorithm must deal with various well known challenges caused by:

Pose variation: it can happen due to subject’s movements or camera movement. Ideally for face detection only frontal images are involved, but this is very unlikely in real life applications.

Facial features: Human faces present elements like beards, moustaches, glasses or hats, all of which increase the variability degree. In

Facial expressions:

Feature occlusion: faces may be partially occluded by other objects (scarves, hats, other people)

Imaging conditions.

High degree of variability in size, shape, colour and texture.

A classification of face detection methods is presented by Yang, Kriegman and Ahuja in . There are four proposed categories as follows:

Knowledge-based methods. The Ruled-based methods encode human knowledge of face, such as relationships between facial features.

Feature-invariant methods. They try to find structural features that are invariant of a face despite the angle, position or lighting conditions.

Template matching methods. These algorithms compare input images with stored patterns of faces or features.

Appearance-based methods. A template matching method whose pattern database is learnt from a set of training images.

Figure 32 – Classification of face detection methods

Knowledge-Based Methods

Knowledge-based methods are rule-based methods that try to capture the researcher knowledge of human faces and translate them into a set of rules. For example, a face consists of two symmetric eyes, a mouth and a nose. The eye area is darker than the cheeks; the mouth is lower than the eyes and nose etc. The relationships between features can be represented by their relative distances and positions. The biggest problem of Knowledge-based methods is the difficulty of building a relevant set of rules. If the rules are too general, then the number of false positive is too high; if the rules are too restrictive, then the false negative ratio increases.

In 1997, Kotropoulos and Pitas presented a rule-based face localization method which is similar to and . In the first phase, the algorithm locates facial features by applying the projection method that Kanade described and used to locate the boundary of a face . Let be the intensity value of an image at position (x, y), the horizontal and vertical projections of the image are defined as:

(8)

First, the horizontal profile of an input image is obtained, and then the two local minima, determined by detecting abrupt changes in The two values correspond to the left and right side of the head. Similar, the vertical profile () is obtained and the local minima are determined for the locations of mouth lips, nose tip, and eyes. A face candidate region is composed of all these features (Figure 33).

Figure 33 – Face Detection and Locating Features by Vertical and Horizontal Projections

The proposed method has been tested using a set of faces in frontal views extracted from the European ACTS M2VTS (MultiModal Verification for Teleservices and Security applications) database which contains video sequences of 37 different people. Each image sequence contains only one face in a uniform background. Their method provides correct face candidates in all tests. The detection rate is 86.5 percent if successful detection is defined as correctly identifying all facial features. In the case of complex background and multiple faces, it becomes difficult to locate faces using vertical and horizontal projections.

Feature-Invariant Methods

Humans can easily detect faces regardless of the pose and lighting conditions, which means that all human faces must present features and properties which are invariant. Feature-based methods represent a bottom-up approach that relies on the invariant face features for detection. Many algorithms use facial features, such as eyebrows, eyes, nose, mouth, and hair line to build a statistical model. The model describes the relationship that exists between the extracted facial features and is used to verify the existence of a face. Facial features are usually extracted using edge detectors, thus poor illumination, noise and shadows alter feature boundaries resulting in inferior performances.

One early algorithm was developed in 1977 by Han, Liao, Yu and Chen . The proposed system consists of three main steps. First step performs an eye-analogue segmentation using a morphology-based technique.

Using morphological operations, applied on the original image, eye-analogue pixels are located and unwanted pixels are removed. In the second stage, each previously located eye-analogue segment is used as guide to search for a potential face region. A face region is temporarily marked if a possible geometrical combination of eyes, nose, eyebrows, and mouth exists. The last step consists of face verification validates a face candidate region and extracts the pose. In this step, every face candidate region is normalized to a 20–by–20 pixel image, and then verified using a trained back-propagation neural network. The reported success rate was of 94%.

Other feature-based methods use skin colour and/or skin textures in order to detect faces. Skin colour has been used and proven to be an effective face feature; multiple colour spaces have been utilized to describe skin pixels: RGB, normalized RGB, YCbCr, etc. A simple and effective skin colour model is presented in . It uses two colour representations for the skin pixels: normalized RGB – in order to reduce the influence of the intensity and normalized HSV. The following parameters are used to describe skin pixels:

(9)

Because skin colour is heavily affected by light conditions, skin colour based techniques are not effective where the spectrum of the light source varies significantly and cannot be used alone, but in combinations with other methods, like local symmetry or structure and geometry.

Template Matching Methods

In template matching techniques, the face is defined as a function. Using a standard face pattern, in general a frontal picture, the parametric function is determined. For example a face can be divided into eyes, face contour, nose and mouth; another face model can be built by edges or as a silhouette. The face detection algorithm computes the correlation value between the standard predefined patterns and the input image. Based on the correlation values, a region is identified as face or non-face.

An early algorithm was developed by Sakai . The face model consists of several sub-templates: the eyes, nose, moth and face contour. In the first stage of the algorithm, from the input image, lines are extracted based on greatest gradient change. Then the lines are matched against the sub-templates in order to detect face candidate regions. In the second phase, the candidate regions are matched with other sub-templates in order to validate the face.

This approach is simple to implement, but is affected by variations in pose, scale and shape. In order to correct these weaknesses, deformable templates have been proposed.

Appearance-Based Methods

Appearance-based methods rely, in general, on techniques from statistical analysis and machine learning in order to identify and relevant characteristics of human faces. These methods use face templates that are generated / learned from face sample images.

Many appearance-based methods treat the face in a probabilistic framework. From the input image, a feature vector is extracted. This is treated as a random variable with some probability of belonging to a face or not. Bayesian classification or maximum likelihood can be used to classify a candidate image location as face or non-face.

Another approach is to define a discriminant function (i.e., decision surface, separating hyperplane, threshold function) between face and non-face classes. The most relevant / known methods are:

Eigenface-based: In 1987, Sirovich and Kirby presented a method for efficiently representing faces using PCA (Principal Component Analysis). With this approach the face is represented as coordinate system. The eigenvectors that make up the coordinate system are referred to as eigenpictures.

Distribution-based: Sung , in his PhD thesis, proposed a learning based method for classifying objects and patterns with “variable image appearance but highly predictable image boundaries”. By collecting a sufficiently large number of samples for the pattern class that we wish to detect, a distribution-based model is generated in an appropriate feature space. New patterns are matched against the distribution-based target model. The system matches the candidate picture against the distribution-based canonical face model. Finally, using a trained classifier that is based on a set of distance measurements between the input pattern and the distribution-based class representation in the chosen feature space, the target pattern class instances are correctly identified from background image patterns. Algorithms like PCA or Fisher’s Discriminant can be used to define the subspace representing facial patterns.

Neural Networks: They have been successfully applied on many pattern recognition problems. Rowley et al. present a neural network-based upright frontal face detection system. The Neural Network was trained to recognize face and non-face patterns. The algorithm applies one or more neural networks on portions of the input image and the results are afterwards arbitrated in order to establish the existence of a face.

Figure 34 – The Basic Algorithm Used for Face Detection (© )

Support Vector Machines: SVMs are linear classifiers based on the concept of decision planes that define decision boundaries. A decision plane or decision boundary is a hypersurface that partitions between a set of objects having different class memberships. SVMs maximize the margin between the decision hyperplane and the examples in the training set. Osuna et al. investigated the application of Support Vector Machines (SVMs) in face detection problem.

Bayes Classifiers: Schneiderman and Kanade proposed a “probabilistic model for object recognition based primarily on local appearance” and applied it for face detection. The local appearance space is partitioned into a finite number of patterns. The image is divided into 16 by 16 pixels sub-regions and projected onto a 12 dimensional space. By counting the occurrence frequency of the selected patterns over the training images, they computed the probability of a face to be present or not in the picture. The classifier captured the joint statistics of local appearance and position of the face as well as the statistics of local appearance and position in the visual world.

Figure 35 – A set of Principal Components (© )

Hidden Markov Model: Nefian and Hayes presented a Hidden Markov Model (HMM)-based framework for face recognition and face detection. The states of HMM are described using observation vectors obtained using the coefficients of the Karhunen-Loeve transform. Each face image is divided into overlapping blocks. Each facial region (hair, forehead, eyes, nose, and mouth) regions is assigned to a state in a left to right ID continuous HMM.

Figure 36 – Left to Right HMM for Face Recognition (© )

Face Recognition

The goal of the face recognition is to identify or verify a person’s identity from images, based on their facial features. The recognition process must succeed despite of wide variations in pose, facial expressions and illumination changes .

Face recognition methods can be classified into three main categories: analytic or feature-based methods, holistic methods and hybrid methods (Figure 37).

Figure 37 – Classification of face recognition methods

Holistic Methods

Holistic methods treat the image as a matrix of pixels; the subject’s identification is based on the entire face image rather than on local face features. Typically these methods use statistical or artificial intelligence approaches.

Principal Components Analysis (PCA) or Eigenfaces Method

The Principal Components Analysis (PCA) is a statistical method designed to model linear variation. Applied to a set of feature vectors, it extracts a set of mutually orthogonal basis vectors that best represent the original subjects. These basis vectors represent the most significant variations in the dataset.

Sirovich and Kirby first utilised the PCA method to form a set of basis features from a collection of face images. Appling the PCA method to face images, the image is decomposed into a reduced set of basis vectors, also known as “eigenfaces”. If the training set consists of M images, principal component analysis could form a basis set of N images, where N < M. The reconstruction error is reduced by increasing the number of eigen-pictures; however the number needed is always less than M.

Eigenfaces method is perhaps the most commonly holistic method used for face recognition. The PCA method was first applied for face recognition by Turk and Pentland in . The system builds “eigenfaces” using eigenvectors extracted from known faces, and then by comparing unknown face projections along the known “Eigenfaces” the identity is established. Turk and Pentlands paper demonstrated ways to extract the eigenvectors based on matrices sized by the number of images rather than the number of pixels.

An advantage of using such representations is their reduced sensitivity to noise. As long as the topological structure does not change, the recognition algorithm is not affected. For example, Figure 38 illustrates the recognition / reconstruction of the original image after applying different distortions such as: blurring, partial occlusion and changes in background.

(a)

(b)

Figure 38 – Eigenfaces Recognition Results

(a) Electronically Modified Images Which Were Correctly Identified

(b) Reconstructed Images Using 300 PCA Projection Coefficients for Electronically Modified Images (© )

Linear Discriminant Analysis (LDA)

Fisher’s Linear Discriminant Analysis (LDA) is another technique than is used for face recognition. It works on the same principle as the Eigenfaces method.

Adini et al. first proposed the use of LDA method for face recognition trying to compensate the variations of the same face caused by lighting and facial expressions. LDA explicitly attempts to model the difference between the classes of data; it searches for the projection axes on which the data points of different classes are far from each other while requiring data points of the same class to be close to each other. LDA encodes discriminating information in a linearly separable space using bases that are not necessarily orthogonal.

Figure 39 – An Example of Five Images of the Same Face.

(a) Frontal View with Left Illumination.

(b) Frontal View with Right Illumination.

(c) 34° Away From the Frontal View on the Horizontal Axis with Left Illumination.

(d) A Smile Expression Taken from Frontal View with Left Illumination.

(e) A Drastic Expression Taken from the Frontal View Left Illumination. (© )

Independent Component Analysis (ICA)

A PCA method generalisation is the Independent Component Analysis (ICA); it tries to separate a multivariate signal into additive subcomponents. ICA method finds the independent components by maximizing the statistical independence of the estimated components. This is done by assuming that the subcomponents are non-Gaussian signals and that they are statistically independent from each other. In , Bartlett et al. used a version of ICA derived from the principle of optimal information transfer through sigmoidal neurons for the face recognition task. In their experiments, two different architectures have been used: “one which treated the images as random variables and the pixels as outcomes, and a second which treated the pixels as random variables and the images as outcomes”. The first architecture found spatially local basis images for the faces. The second architecture produced a factorial face code.

Figure 40 – First Proposed Architecture (© [65])

Figure 41 – Second Proposed Architecture (© [65])

Artificial Intelligence (AI)

A popular approach for pattern recognition and classification is the use of Artificial Intelligence (neural networks). In the face recognition systems, the neural networks can be used in two ways: perform feature extraction or perform modelling and classification.

In 1995 Intrator et al. presented a hybrid method. The first step was the detection of the eyes and mouth followed by a spatial normalization of the image such that the eyes and centre of the mouth are trans-located to predefined positions and then, by combining unsupervised methods for extracting features and supervised methods for finding features they were able to reduce classification error. The classification was done by using a feed-forward neural network (see Figure 42).

Figure 42 – A Hybrid EPP/PPR Neural Network (© )

A mix method that uses Gabor filters and neural networks was proposed in by Bhuiyan et al. The neural network used for face recognition is based on the multilayer perceptron and trained by Gabor features. By doing so they achieved a recognition rate of 84.5%.

Melin et al. presented a new approach for face recognition using modular neural networks with a fuzzy logic method for result integration. The face is divided into three regions (the eyes, the mouth, and the nose) and each region is assigned to a module of the neural network (Figure 43). The outputs of the three modules are combined in order to obtain the final result using a fuzzy Sugeno integral. They applied the new approach on a small database of 20 people and reported that the modular network offered better results than a monolithic one.

Figure 43 – Proposed Architecture (© )

In 2009 Bhattacharjee et al. combined fuzzy mathematics with neural networks in order to improve the recognition accuracy. The face recognition system developed is based on a fuzzy multilayer perceptron (MLP) trained with feature vectors extracted using Gabor wavelet transformation. The idea behind this approach is to identify decision surfaces in case of nonlinear overlapping classes, task that a MLP cannot complete (it is restricted to crisp boundaries). The algorithm shows a 97.875% recognition rate using the ORL database.

In the face recognition domain, Self-Organising Map (SOM) usually has the role of dimension reduction and feature extraction. In 2005 Tan et al. addressed the problem of “one training image per person” and proposed a SOM base algorithm. The algorithm offered a lower recognition error rate than the PCA version and presented a high robust performance against the partial occlusions and variant expressions.

An appearance-base method for face recognition that uses SOMs is presented in by Aly et al. The Self-organizing map is utilized to transform the high dimensional face image into low dimensional topological space; the similarity between the input and codebook images is calculated using the Mahalanobis distance.

A mixture of neural networks and PCA methods is presented by Eleyan and Demirel . Feature projection vectors are extracted using Principal Components Analysis from face images, and then they are classified using a feed-forward neural network.

Figure 44 – Training Phase of the Neural Network (©)

A system that uses the wavelet transform in order to decompose the face image is presented by Li and Yin . Applying the wavelet transform, three low-frequency sub-images result; on each of them the Fisherfaces method is then applied. The individual classifiers are fused using the RBF neural network.

Some other approaches utilize for the face recognition problem Support Vector Machines (SVMs) and also Hidden Markov models.

Analytic or Feature-Based Methods

Analytic or feature-based methods represent the most natural and intuitive approach for the facial recognition problem. These methods use the input face image in order to identify, extract and measure different local facial features, and then establish the geometric relationships that exist among them. The input face image is converted into a set of geometric features vectors, based on which the subject’s recognition can be performed using standard pattern recognition techniques. Major facial features such as nose, mouth, eyes as well as other sub-features are used by these techniques.

Using feature-based techniques, in 1971 Goldstein, Harmon and Lesk developed a semi-automatic facial recognition system that relied on 21 specific markers (see Figure 45). These 21-dimensional vectors were shown to be sufficient for accurate individual identification, both by human and by computer.

Figure 45 – Set of 21 Face Features (© )

Kanade, in , presented the first full-automated feature-based face recognition system. The recognition was performed by calculating the Euclidian distance between the feature vectors of a reference and unknown image. The feature vector consists of 16 facial parameters extracted from the distance and angle between the marker points (facial features), such as eyes, ears, nose, and mouth.

Figure 46 – Sequence of the Analysis Steps (© )

The facial features localisation algorithm is based on the “integral projection” method and has eight steps, as show in Figure 46, each of them is a used for detecting a distinct part of the face. The outputs of the processing steps are combined using a feedback procedure. Each step consists of three parts: prediction, detection and evaluation. The result of each step is evaluated to see if the program can proceed to the next step; in case of an unsatisfactory result the algorithm is restarted at a previous phase. There are two types of feedback used:

direct feedback in order to modify the parameters of the analysis step just executed,

feedback to modify former processing steps.

Recognition techniques based of facial features are robust against changes in illumination, but the accuracy is low due to the fact that the coordinates of the marker points are very difficult to establish accurately.

Elastic Bunch Graph Matching (EBGM)

A more modern approach for facial recognition using local features is the use of Elastic Bunch Graph Matching (EBGM) algorithms. Such a method is presented in by Wiskott et al. where the face recognition system combines an elastic graph-matching algorithm with a neural network. Faces are viewed as flexible graphs (the nodes are shifted to match feature locations), with nodes positioned at fiducial points. Each graph node contains a set of visual features extracted using Gabor filters.

Figure 47 – The Graph Representation of a Face Based on Gabor Jets (© )

A graph for an individual face is generated as follows: a set of fiducial points on the face are chosen. Each fiducial point is a node of a full connected graph, and is labelled with the Gabor filters responses applied to a window around the fiducial point. Each arch is labelled with the distance between the correspondent fiducial points. A representative set of such graphs is combined into a stack-like structure, called a face bunch graph. Once the system has a face bunch graph, graphs for new face images can then be generated automatically by Elastic Bunch Graph Matching. Recognition of a new face image is performed by comparing its image graph to those of all the known face images and picking the one with the highest similarity value. As variations of this approach the Gabor features can be replace the by a graph matching strategy and Histograms of Oriented Gradients (HOGs) .

Multiple Classifier Systems

A new approach in face recognition is represented by the fusion between multiple classifier systems. Since each classic classifier is more sensitive to some types of disturbances than to others, a natural approach was to combine different classifiers resulting thus a more robust system. Such an example is the face recognition system described by Lu et al. . The system fuses the results of PCA, ICA and LDA using the sum rule and RBF network-based (see Figure 48).

Figure 48 – Classifier Combination System Framework (©)

State of the Art Bimodal Recognition Systems

In , M. Benoit et al. present several late fusion techniques for a simultaneous audio-video recognition system. The proposed system performs face identification and voice identification independently of each other; the results of the two models are unified and the end result compared with a list containing the scores of all speakers in order to detect the similarities with the current speaker.

Figure 49 – Late Fusion Recognition System

The system finds and tracks human faces in the video sequence. Since human faces can have different sizes and different positions for the face detection following assumptions are made: the position is closer to the vertical, and minimum height of a face is 66 pixels. Based on these assumptions, the system searches a multi-resolution pyramid positions showing the highest degree of similarity with a predefined template of human face of fixed size (11–by–11 pixels). Thus, identifying potential areas where human faces can be located. The potential face area is classified as face or non-face if the region contains a high proportion of skin-tone pixels and by using a Fisher linear discriminant trained to differentiate between background and human faces.

Once a face is found, facial features are extracted, starting with large-scale features, such as eyes, nose and mouth and then sub-features like hairline, chin, ears, corners of the mouth, nose, eyes and eyebrows are located. In total 29 sub-features are located. At each of the estimated sub-feature locations, a Gabor Jet representation is generated using eight orientations and five scales used, resulting a total of 40 complex coefficients. The similarity between the unknown face and the known faces feature vectors is computed using a simple distance metric.

The audio speaker identification is performed using a frame based approach. For each speaker a Gaussian mixture model is created using cepstral feature vectors. The model is created using training data consisting of a sequence of K frames with d-dimensional cepstral feature vectors: the model corresponding to i speaker, defined by (the mean vector, covariance matrix and mixture weights for each of the n components of speaker i's model).

Using 154 news clips with 76 speakers, the system presented 93.5% recognition rate accuracy in noise free environment and 74.7% accuracy in a noisy environment with a 10dB SNR.

Another late fusion recognition system is described by Alberto Albiol in . The system has two parallel processing chains: one for face recognition, the other for voice recognition.

Figure 50 – Late Fusion Recognition Algorithm Architecture

The face recognition path is divided into two subsystems: face detection and face recognition. Using only colour information the image is segmented into skin / non-skin areas, then using an unsupervised segmentation algorithm the skin areas are grouped. By filtering the potential face areas using shape, colour and texture constrains, the face is localised. The face recognition algorithm is based on “self-eigenfaces” technique, a variant of the PCA method also known as “Eigenfaces”.

GMM (Gaussian mixture models) are used for the voice recognition. Two types of models are trained: one model specific for each speaker and another model for the background noise. The speaker GMM is built from feature vectors that consist of 12 Mel-frequency cepstral coefficients and their derivates, extracted from 2-3 minutes of clean speech files. For the extraction a 34ms Hamming window with 50% overlap is used. The background noise GMM is built from 1 hour of recordings. In order to establish the identity of the speaker the following log-likelihood ratio is used:

(10)

where: and are the probability density functions of the person m and the background model respectively.

For the fusion block, several types of classifiers were tested: Bayesian classifiers and MSE classifiers. The best results were obtained using a classifier based on the linear discriminant function. The linear discriminant function is found by MSE using the Pseudoinverse Matrix. The system was tested using TV news recordings with 10 speakers and with the proposed method they obtain a 98% of true positives and a 99% of true negatives.

Geng et al. propose audio-visual speaker recognition system that uses face and voice features fused via multi-modal correlated neural networks (at different levels). In order to learn heterogeneous multi-modal features, the facial features must be compatible at high-level with the audio features.

In their work, the face features are learned using convolutional networks. The input image is resized to 50–by–50 pixels of size and divided by 255 in order to get the pixel values between 0 and 1, after this, is passed to a convolutional network consisting of four layers: three convolutional layers and one fully-connected layer. Each convolutional layer is followed by a pool layer to down-sample the inputs. In order to use backpropagation for learning the kernel parameters, the standard 2–by–2 max poll layer is replaced by a 3–by–3 convolutional layer with stride 2. The first convolutional layer has a kernel size of 15–by–15 and 48 filters. It is implemented using 2 convolutional layers with kernel size 1–by–15 and 15–by–1. The second convolutional layer is a 5–by–5 with 256 filters, implemented using two 3–by–3 convolutional layers with a stride of 1. The final convolutional layer has a kernel size of 7–by–7 and 1024 filters. The first convolutional layer with a large kernel can extract the major information from the image, while the second layer can catch the details of feature maps. The face features are represented by the outputs of the last convolutional layer.

As audio features, the Mel-Frequency Cepstral Coefficients (MFCCs) with a 20ms window size and a 10ms frame shift are used. Each audio clip is characterized using a 75 audio features vector consisting of: mean and standard deviation of 25 1–Δ MFCCs, and standard deviation of 2–Δ MFCCs.

In their work, Geng et al. propose multiple methods for fusing the face and audio features. The method that presented the best recognition rate consists of data fusion at the second level (see Figure 51). The proposed method achieved a 93.2% recognition rate.

Figure 51 – Second Level Feature Fusion (© )

Audio Features

Short-Term Spectral Features

Speaker identity is correlated with the physiological and behavioural characteristics of the speaker and these characteristics are derived both from the vocal tract and the source of speech (vocal chords). Thus the first step of designing an automatic speech recognition system is to identify the components of the audio signal that are good for identifying the person and discard all the other irrelevant stuff like noise, emotion etc. (to extract speech features from raw speech signals). The selection of the best representation of acoustic data is an important task in the design of any speech recognition system.

The feature extraction component consists in transforming the speech signal into a parametric representation. This section will cover a brief overview of the main concepts of acoustic features extraction used in the thesis. Many techniques were developed, during the 50 years of ASR research, for the extraction of suitable feature vectors used in speech processing systems; this domain is a mature but still active research area.

Human voice is subject to variations caused by several factors: speaker’s physical characteristics, health conditions, emotional states, dialect, sociolect and idiolect spoken by the individual, age, or even the recording and transmission conditions.

In order to guarantee a correct performance of the system, the parameters used to describe the speaker’s voice should be characterised by the following desired conditions:

occur naturally and frequently in normal speech

be easily measurable

vary as much as possible among speakers (provide good discrimination)

be as consistent as possible for each speaker (have a low variability)

not change over time or be affected by the speaker's health

not be affected by reasonable background noise nor depend on specific transmission characteristics (unaffected by environmental and transmission noise)

not be modifiable by conscious effort of the speaker, or, at least, be unlikely to be-affected by attempts to disguise the voice (not subject to mimicry)

In practice all of the above criteria cannot be fulfilled simultaneous, thus, a partial or complete relaxation of some of the standards must be applied.

It was seen in section 2.1 that the spectral shape of the voice signal contains information about the vocal tract and the excitation source in the glottis by means of the formants and the fundamental frequency, respectively. Short-term spectral features offer a good characterization for the vocal tract particularities and as well for the short-term spectral envelope of the signal. The spectral parameters obtained for the speaker recognition analysis are usually the same as the ones used in speech recognition techniques, although the objective of both applications is not the same.

The process of audio features extraction can be divided into three processing stages: pre-processing, filter-bank analysis and features extraction.

Figure 52 – Spectral Features Extraction

Pre-Processing

The first step in acoustic features extraction is represented by the pre-processing. Two operations are carried out in this step: pre-emphasis and windowing.

In speech processing the voice has a low-pass behaviour. This is generated by the vocal tract that acts a low-pass filter with -12dB/octave, while the lips introduce a +6dB/octave amplification to the spectrum. Thus, in order to smooth the signal spectrum a high-pass filter is applied. The high-pass filter must be designed to emphasize the region of frequencies greater than 1kHz. The most commonly used filter for this task is the first order filter described in the following equation:

(11)

Besides balancing the frequency spectrum, the usage of a pre-emphasis filter offers some other benefits such as: improving the Signal to Noise Ratio (SNR), and avoid numerical problems during the Fourier transform operations.

Figure 53 – Audio Signal Before and After Pre-Emphasis (© )

The human speech is a naturally continuous signal that constantly changes over time. Fortunately, when analysed using short time scales (20 – 100ms), it is from the statistical point of view a quasi-stationary signal; the audio signal doesn't change much.

In order to obtain the quasi-stationary acoustic signal, a framing and windowing technique is used. Windowing represents the process of multiplying a given signal by a window function. A window function is a function that is zero-valued outside of some chosen interval. Given a signal function and a window function, the product is also zero-valued outside the interval.

The voice signal is divided into frames of samples. Adjacent frames are being separated by. For speech processing, a typical length of a frame is 20 – 25ms and the overlap is set to 10ms.  If the frame is much shorter we don't have enough samples to get a reliable spectral estimate, if it is longer the signal changes too much throughout the frame. The acoustic features extraction is performed over the entire length of the window.

Figure 54 – Short-Term Spectral Analysis for Speech.

The window function is chosen in a way that the values near the edges approach zero. This prevents discontinuities at the edges which would negatively affect the result of further processing. A popular function is the Hamming window, represented by the following equation:

(12)

where: is the number of samples in each frame, the Hamming window

Figure 55 – Hamming Window

Filter Bank

After the pre-processing stage has divided the input speech signal into quasi-stationary windows, each frame is converted from time domain into frequency domain using the DFFT algorithm and the extracted frequency spectrum is analysed in order to generate the final speech features. The frequency spectrum analysis used for this thesis is the filter bank technique. This method is derived from the human perception.

Studies on the perception of speech showed that the human ear resolves frequencies non-linearly across the audio spectrum. In 1937 Stevens et al. introduced the “Mel scale”; a perceptual scale of pitches judged by listeners to be equal in distance from one another and is based on how the speech perception works in the human ear.

This non-linear behaviour can be approximated by a finite triangular filter bank (usually 20 – 24 filters) spaced across the spectrum according to the Mel scale. The series of triangular filters are 50% overlapped and for the frequency range 0 to 1000Hz spacing between them is approximately linear an logarithmic at higher frequencies. By applying the Mel-scale filterbank, the lower frequencies from the human speech are given greater importance (representation). The popular formula to convert between Hertz-scale and Mel-scale is:

(13)

Figure 56 – The Nonlinear Mel-frequency versus Hertz frequency

In this thesis is used a filter bank of 40 equal area filters (see Figure 57), implemented by Slaney in . It covers the frequency range of 133 – 6854Hz and it consists of:

13 linearly spaced filters, range 100 – 1000Hz, with a step of 133.33Hz

27 logarithmically spaced filters, range 1071 – 6854Hz, with a logarithmic step of 1.0711703

Figure 57 – Used MFC Filterbank

Mel-Frequency Cepstral Coefficients (MFCCs)

The Mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear Mel scale of frequency. The coefficients that make up an MFC are called Mel-frequency cepstral coefficients (MFCCs). Mel Frequency Cepstral Coefficients (MFCCs) are a feature widely used in automatic speech and speaker recognition. They were introduced by Davis and Mermelstein in the 1980's , and have been state-of-the-art ever since. Prior to the introduction of MFCCs, Linear Prediction Coefficients (LPCs) and Linear Prediction Cepstral Coefficients (LPCCs) and were the main feature type for automatic speech recognition.

Figure 58 – Block Diagram of MFCC Extraction

There are many implementations of the original MFCC algorithm, they mainly differ in the number of filters, the shape of the filters, the way the filters are spaced, the bandwidth of the filters and the manner in which the spectrum is warped. The algorithm used in this work is the one implemented by Slaney in .

In order to extract the Mel Frequency Cepstral Coefficients the working domain of the speech signal must be changed from time to frequency. The speech signal’s frequency spectrum is calculated using the N-point Fast Fourier Transform (FFT) in its discreet form. From the frequency spectrum, the power spectrum (periodogram) is computed:

(14)

Where is the ith frame of the speech signal .

The resulted power spectrum is mapped on the Mel-scale by applying the triangular filters from the Mel filters bank. This step extracts the spectral magnitude for each filter bank channel; each FFT magnitude coefficient is multiplied by the corresponding filter gain and the results accumulated.

Figure 59 – Signal Spectrogram (© )

Having the filter bank energies, the logarithm of them is calculated in order to mimic the human perception of loudness: we don't hear loudness on a linear scale, at high amplitudes the slight differences in amplitude are less perceptible by humans. The log module also acts as a smoothing function and makes the feature estimates less sensitive to input variations.

The final step is to compute the Discrete Cosine Transform (DCT) of the log filter bank energies. There are two main reasons this is performed: first, the filter bank coefficients computed are highly correlated. This is caused by the overlapping of the filter banks, overlapping that leads to correlated filter bank energies. And, second the DCT performs a compression of the features, reducing the number of components.

The DCT coefficients represent the Mel-frequency cepstral coefficients. Usually in ASR, only the first 12 – 19 DCT coefficients are kept. This is because the higher DCT coefficients represent fast changes in the filter bank energies and this tends to degrade the ASR performance.

Figure 60 – MFCCs (© )

The above described MFCC features vector are known as static coefficients and describes only the power spectral envelope of a single frame without taking into account the temporal variation along the frames. The assumption that the feature vectors of consecutive frames are uncorrelated is a false one. The human speech also presents information about the speaker identity in the dynamics i.e. what are the trajectories of the MFCC coefficients over time. Furui first added the dynamic information to the original feature vector, improving the ASR performance. These dynamic coefficients or deltas are calculated with the following formula:

(15)

Where:  is a delta coefficient, from frame t computed in terms of the static coefficients  to. represents the window size used to compute the dynamic parameters. A typical value for  is 2.

Applying the Equation (12) over the delta-parameters, the delta-delta coefficients, also known as acceleration, are calculated.

The proposed algorithm uses a feature vector of 64 components, made up of the first 32 MFCCs and their corresponding DMFCCs, in order to add dynamic information for static Cepstral features.

Visual Features

Human faces provide vital information about a person’s identity and characteristics such as gender and age. Even if we all have the same basic features (one nose, one mouth, two eyes), each person has its own distinctive features.

Face recognition represents the preferred method that the humans use in order to recognize each other; the human brain has a specialized region dedicated to face processing. It is a natural and non-invasive technique for identifying a person.

The schematic of visual features extraction, as presented in Figure 61, consists of two stages:

Face localization in video frames:

Skin / Non-skin segmentation

Morphological filtering

Face detection

Features Vector extraction.

Figure 61 – Visual Features Extraction

Face Localisation

In any facial features extraction algorithm the first essential step consists in face detection / localisation. Faces must be first located and registered in order to facilitate further processing. Numerous researches were performed in the field of face detection. Generally, they can be classified into four categories: knowledge-based methods, feature invariant approach, template matching methods, and appearance-based methods.

The face detection problem, in general, presents a high complexity that requires an increased computing and memory resources. In our work, we combine two approaches in order to reduce the computational time: first, using skin segmentation and morphological filtering, a number of “face candidate” areas are identified; and in the second step, each component is analysed in order to detect the presence of a face.

Skin Segmentation

For the “Skin Segmentation” we have chosen a colour-based fuzzy classifier presented by Iraji and Tosinia . The skin / non-skin segmentation is performed in the YCbCr colour space (Y is the brightness or luma; Cb is blue chrominance component and Cr is red chrominance component).

Figure 62 – The CbCr plane at constant luma Y=0.5

Being give an image in RGB colour space, the YCbCr transformation can be calculated as:

(14)

The usage of YCbCr colour space for skin segmentation offers three advantages:

According to , in the YCbCr colour space the skin cluster is more compact

The appearance of skin-tone colour is dependent on the lighting conditions. In the YCbCr colour space, the luminance component (Y) is independent of the colour. Thus, it can be used to solve the illumination variation problem.

YCbCr colour space presents the smallest overlap between skin and non-skin data in under various illumination conditions.

The fuzzy classifier designed by Iraji has:

three input values (Y, Cb and Cr); each input has two Linguistic variables, low and high, as describe below:

one output with two Linguistic variables: skin and non-skin.

eight fuzzy rules, as follows:

If Y is Low AND Cb is Low AND Cr is Low Then Pixel is Non-Skin

If Y is Low AND Cb is Low AND Cr is High Then Pixel is Non-Skin

If Y is Low AND Cb is High AND Cr is Low Then Pixel is Non-Skin

If Y is Low AND Cb is High AND Cr is High Then Pixel is Non-Skin

If Y is High AND Cb is Low AND Cr is Low Then Pixel is Non-Skin

If Y is High AND Cb is Low AND Cr is High Then Pixel is Non-Skin

If Y is High AND Cb is High AND Cr is Low Then Pixel is Non-Skin

If Y is High AND Cb is High AND Cr is High Then Pixel is Skin.

Figure 63 – Used Membership Functions (© )

After the skin pixels are segmented using the described classifier, a morphological filtering is applied in order to reduce false positive responses. The morphological filtering consists in applying two morphological operators: image erosion and fill image regions and holes.

Figure 64 – Skin Segmentation Results

Face Detection using Viola-Jones Algorithm

For the face detection, in our work, we use the Viola-Jones algorithm which represents the first ever real-time face detection system.

The main task of a face detector is to establish whether an image contains or not a human face and if it does, where it is located. Such a task requires an accurate numerical description on “what sets the human face apart from other objects”. The algorithm must minimize both the false negative and false positive error rates and also it must work within a reasonable computational time.

For the classification task the Viola-Jones algorithm uses Haar-like features: being given an image and some Haar-like templates the “feature vectors” are represented by the scalar product between them (see Figure 65). Let to denote an image and a Harr pattern of the same size, the associated feature with pattern and image is defined as:

(16)

Figure 65 – Haar-Like Features

Figure 66 – The Harr-Like Patterns used by Viola-Jones Algorithm

The algorithm uses five patterns (Figure 66), each of them having a size of 24×24 pixels:

two-rectangle feature: difference between the sum of the pixels within two rectangular regions (cases (a) and (b))

three-rectangle feature: sum within two outside rectangles subtracted from the sum in a centre rectangle (cases (c) and (d)

four-rectangle feature: difference between diagonal pairs of rectangles (case (e)).

In order to perform very fast the features evaluation, in their paper, Viola-Jones introduced a new image representation method called “integral image”. The “integral image” (II) at location contains the sum of the pixels above and to the left of :

(17)

The features extraction method generates roughly 180,000 rectangle features associated to each image sub-window. Only a small number of these features can be combined to form an effective classifier. For this task a week classifier is designed, the selected learning algorithm is a variant of AdaBoost. At each round of boosting the algorithm selects one feature from the 180,000 potential features.

The AdaBoost is a lengthy algorithm that in order to produce good results it must be trained using enormous negative training sets. Thus, in order to increase the computational speed and the system performance, a multi-layer cascade of detectors is used. The cascade classifier consists on smaller, and therefore more efficient, classifiers can be constructed which reject many of the negative sub-windows while detecting almost all positive instances:

simpler classifiers are used to reject the majority of sub-windows

then, more complex classifiers are called upon to achieve low false positive rates.

Figure 67 – Schematic Depiction of a Detection Cascade

In general, a classifier that includes more features will achieve a higher detection rate and a lower false positive error rate. Nevertheless, as the number of features increases, also the computational time increases. A trade-off must be made between the two parameters: processing time and detection rate.

In their algorithm, Viola and Jones proposed a very simple framework in order to and effective and highly efficient classifier. “Each stage in the cascade reduces the false positive rate and decreases the detection rate. A target is selected for the minimum reduction in false positives and the maximum decrease in detection. Each stage is trained by adding features until the target detection and false positives rates are met (these rates are determined by testing the detector on a validation set). Stages are added until the overall target for false positive and detection rate is met.” The final face detection cascade contains 38 stages with over 6000 features.

Gabor Wavelet Filters

First introduced in 1946 by the Hungarian-born engineer Denis Gabor , the Gabor functions or atoms have been used in many different fields such: audio signal processing, image compression, edge detection, filter design, object recognition.

In image processing, a Gabor Wavelet Filter represents a band-pass linear filter whose impulse response is defined by a harmonic function multiplied by a Gaussian function. Basically, they are a group of wavelets, with each wavelet capturing energy at a specific frequency and a specific direction.

Marcelja and Daugman advanced the idea that simple cells in the visual cortex of mammalian brains can be modelled using Gabor functions. Thus, image analysis using Gabor filters is thought to be similar to perception in the human visual system. By presenting optimal localization properties in both spatial and frequency domain and thus, are well suited for texture segmentation, edge detection and image representation problems.

A bi-dimensional Gabor filter () can be viewed as a Gaussian kernel function modulated by a sinusoidal plane wave of particular frequency and orientation as follows:

(18)

where, , , the orientation of the wave plane with the plane, represents the spatial frequency of the sinusoidal wave at the angle , the phase, and represent the standard deviations of the Gaussian envelope along the axes.

The filter has two components (Figure 68), a real and an imaginary component, representing orthogonal directions:

(19)

(20)

Figure 68 – Example of Gabor Filter (Real and Imaginary Part)

The 2–D filters define by relation (18), represent a group of wavelets and can optimally capture both local orientation and frequency information from an image. By using, the image is filtered at various orientations, frequencies and standard deviations. Thus, in order to design a Gabor filter, we must define the Phase, the orientations, the frequencies and standard deviations.

When using the Gabor filters for facial feature extraction, the proposed methods typically set the phase to and the angle is defined by:

(21)

where, denotes the number of orientations and .

Figure 69 depicts the real parts of Gabor wavelets for five scales and eight orientations.

Figure 69 – Gabor Filters (Real Part)

Gabor Face Representation

Let be a grey-scale image, whererepresents the image size in pixels, and a Gabor filter with the central frequency and orientation. The Gabor feature representation of the grey-scale image is obtained by convolving the input face image with the created Gabor filters (Figure 70), defined as:

(22)

Where denotes the convolution operation.

The resulted complex output of the filtering operation,, can be decomposed into its real () and imaginary () parts:

(23)

Based on relation (23), we can define the magnitude () and the phase () responses of the filtering operation:

(24)

For Gabor-based feature extraction, the magnitude of the filtering output is used by the majority techniques. The phase information is discarded.

In order to extract the Gabor Magnitude Face Representation, first the Gabor Filter bank must be constructed. Thus, the variance values, spatial frequencies and orientations must be a priori set. In this thesis, for feature extraction we are using the real part of the Gabor representation. Thus, being given (24) and (21), we consider five spatial frequencies, eight orientationsand variance values , . Therefore, the 2–D Gabor filter bank is composed of 40 channels.

Figure 70 – Gabor Filter Output (Convolution Result)

Neural networks

An artificial neural network (ANNs) is a computational model inspired by the information processing method of the biological nervous systems (the brain).

Artificial neural networks are composed of highly interconnected processing elements, called “neurons” which can compute values from inputs. The computational model for the artificial neuron is inspired by the natural neurons. A neuron receives signals through synapses, if the signals are strong enough the neuron is activated and emits a signal through the axon. The same concept is modelled by the artificial neurons: the signals are received by inputs, which are multiplied by weights, and then computed by a mathematical function which determines the activation of the neuron.

Figure 71 – Comparison between a Natural Neuron (top) and an Artificial Neuron (bottom)

By adjusting the connections that exist between the neurons (the weights of each input), an ANN is capable of machine learning as well as pattern recognition. The process of adjusting the weights is called learning or training.

Self-Organizing Maps

The basic idea of the Self-organizing map (SOM) or Kohonen network   is to mimic the human brain identification process of certain types of patterns. A Self-organizing map (SOM) is a single layer artificial neural network (ANN). It is trained using unsupervised learning and produces a low-dimensional (typically two-dimensional), discrete representation of the input space of the training samples. Thus it provides a way of representing multidimensional data in much lower dimensional spaces – usually one or two dimensions.

The main advantage of using Self-Organizing Maps is that they preserve the topological properties / relationships of the input space.

Figure 72 – SOM Architecture

A SOM consist of components called nodes or neurons arranged in 1–D or 2–D, usually hexagonal or rectangular grid (Figure 73). Higher order structures are uncommon. Each neuron has a weight vector associated with it and a specific topological position in the output map space; each neuron is connected to all the inputs, thus the dimension of the weight matches the dimension of the data input vector. Distances between neurons are calculated from their positions with a distance function.

Figure 73 – 2–Dimensional SOM Rectangular or Hexagonal Lattice

Being a neural network, a SOM needs to be trained. The training process is unsupervised and it uses a competitive learning algorithm. The goal of learning in the self-organizing map is to cause different parts of the network to respond similarly to certain input patterns. This is partly motivated by how visual, auditory or other sensory information is handled in separate parts of the cerebral cortex in the human brain.

Each neuron from the SOM structure has associated a weight vector, where represents the dimension of the input vector. Initially the weights of the neurons are initialised to small random values of even samples from the subspace spanned by the two largest principal component eigenvectors.

At each training step an input pattern is selected and fed to the network; the distance between the specific pattern and every neuron is computed. Being given an input, using a distance function, a “winning” neuron is identified as the node that best matches the input:

(25)

Where is the set of neurons; best matching unit

The weights ( of the “winning” neuron and its neighbours are updated toward the current input. Thus, the “winning” neuron and its close neighbours move towards the input vector.

(26)

where:

is the learning rate. Is a training parameter that controls the size of weight vector in learning of SOM. Its value decreases over time.

is the neighbourhood function. Determines the rate of change of the neighbourhood around the winner neuron. The most widely used neighbourhood functions are Bubble (27) and Gaussian (28).

(27)

(28)

Here is the index set of neighbour nodes around the node with indices; and are two-dimensional vectors include indexes of and ; the radius of the neighbourhood.

The training stage ends after a predefined number of epochs.

If the SOM training was successful, the input topology is preserved; the patterns that are close in the input space are mapped to neurons that are close in the output space.

Deep Learning

Since 2006 a new area of machine learning techniques has emerged: deep learning, or also known as deep structured learning or hierarchical learning . In opposition to task-specific algorithms, Deep Learning methods are based on “learning data representations”. In the technical community, multiple definitions for deep learning are proposed:

“A class of machine learning techniques that exploit many layers of non-linear information processing for supervised or unsupervised feature extraction and transformation, and for pattern analysis and classification”

“A sub-field within machine learning that is based on algorithms for learning multiple levels of representation in order to model complex relationships among data. Higher-level features and concepts are thus defined in terms of lower-level ones, and such a hierarchy of features is called a deep architecture. Most of these models are based on unsupervised learning of representations.”

“A sub-field of machine learning that is based on learning several levels of representations, corresponding to a hierarchy of features or factors or concepts, where higher-level concepts are defined from lower-level ones, and the same lower-level concepts can help to define many higher-level concepts. Deep learning is part of a broader family of machine learning methods based on learning representations. An observation (e.g., an image) can be represented in many ways (e.g., a vector of pixels), but some representations make it easier to learn tasks of interest (e.g., is this the image of a human face?) from examples, and research in this area attempts to define what makes better representations and how to learn them.”

“Deep learning is a set of algorithms in machine learning that attempt to learn in multiple levels, corresponding to different levels of abstraction. It typically uses artificial neural networks. The levels in these learned statistical models correspond to distinct levels of concepts, where higher-level concepts are defined from lower-level ones, and the same lower-level concepts can help to define many higher-level concepts.”

“Deep Learning is a new area of Machine Learning research, which has been introduced with the objective of moving Machine Learning closer to one of its original goals: Artificial Intelligence. Deep Learning is about learning multiple levels of representation and abstraction that help to make sense of data such as images, sound, and text.”

From the information processing and communication patterns, deep learning models tend to mimic the human information processing mechanism (vision and audition cortex). The human brain is organized as deep architecture , a given stimulus input is transformed by layered hierarchical structures into high-level abstract information (e.g. speech wave from is converted into linguistic information by the human speech perception system). Inspired by the brain architecture, researches have tried to design and train multi-level neural networks. Until 2006, positive results were reported using only “shallow architectures” composed of two or three levels. The first breakthrough in “deep network architecture” was realized by Hinton et al. whom introduced “Deep belief networks” (DBNs).

Figure 74 – Example of Machine Vision Processing (© )

Consisting of multiple levels of non-linear operations, such as neural nets with multiple hidden layers, deep architectures can learn complex structures that represent high-level abstractions of the input data. The higher levels of the features are formed by the composition of lower level features.

When analysed from the point view of the training method, deep learning architectures can be categorized into three major classes:

Deep networks for unsupervised or generative learning: the goal is to capture high-order correlations of the input data patterns. They are used for pattern analysis or synthesis, where no information related to the target class labels is available. They include architectures such as: Deep belief network (DBN), Boltzmann machine (BM), Restricted Boltzmann machine (RBM), Deep Autoencoder.

Deep networks for supervised learning (discriminative deep networks): they “directly provide discriminative power for pattern classification purposes, often by characterizing the posterior distributions of classes conditioned on the visible data” . Target label data must always be available, direct or indirect. One of the most popular architecture is the Convolutional Neural Network (CNN).

Hybrid deep networks: “the goal is discrimination which is assisted, often in a significant way, with the outcomes of generative or unsupervised deep networks. This can be accomplished by better optimization or/and regularization of the deep networks in category (2). The goal can also be accomplished when discriminative criteria for supervised learning are used to estimate the parameters in any of the deep generative or unsupervised deep networks in category (1) above.”

As a summary, deep learning methods represent a class of machine learning algorithms characterized by:

feature extraction and transformation is done by a cascade of multiple non-linear processing units; where each successive uses the output from the previous layer as input;

learning can be supervised, unsupervised or hybrid;

having a hierarchical architecture, it can learn multiple levels of representations, each level corresponding to a different abstraction level;

the term “deep” refers to the number of hidden layers (greater than 3, it can easily reach 150).

Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are one of the most popular types of deep neural architectures. Convolutional nets are biologically-inspired variants of Multilayer Perceptron (MLP), which mimic the visual system structure proposed by Hubel et al. in 1962 .

Figure 75 – Typical Convolutional Neural Network Architecture (© )

From architectural view, CNNs represent a grid-like topology made up of multiple hidden layers (see Figure 74) stacked up one on top of the other, in which each module consists of a convolutional layer and a pooling layer, optionally followed by a fully connected layer and/or normalization layer. The neurons in the convolutional and pooling layers are fully connected to all neurons from the previous layer, and fully independent inside a single layer (the neurons inside a single layer don’t share any connections between them). The name “convolutional” indicates that the linear operation convolution is used by the network.

“Convolutional networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers.”

The first step in creating a new Convolutional Neural Network is to define the architecture. Depending on the type of application or data used for training, the architecture can vary in the number and types of included layers. CNNs can consist of 1-dimensional grids that process samples at regular time intervals (sound or time series), 2-dimensional grids that process grey-scale image pixels or audio data pre-processed with a Fourier transform, or 3-dimensinal grids (volumetric data) such as colour images. Depending on the data complexity, the number of layers can vary from only one or two layers (for processing small number of data patterns) to hundreds of layers (for more complex processing). Regardless of the number of layers, a CNN uses three main types of layers to build the architecture: Convolutional Layer, Pooling Layer, and Fully-Connected Layer.

Convolutional Layer: is the core building block of a CNN. It consists of a small number of neurons, known as kernels or filters, which connect to sub-regions of the input (input data or outputs of previous layer). At this layer, the previous layer’s outputs are convolved with the learnable kernels. Every filter is small spatially (along width and height) and extends through the full depth of the input volume. The filters are moved vertically and horizontally along the input. As the filter is moved along the input, at each location, the dot product of the weights and the input is computed, and then a bias term is added. The output of the convolution operation is pass-through the activation function and a 2–D feature map is generated.

Figure 76 – Example of 2-D Convolution without Kernel Flipping (© )

In order to define a Convolutional Layer, the following parameters must be set: number of filters, filter size (height and width) and stride (step size with which the layer moves across the input). Being given a filter with the height,, width, , and an input with channels, the number of neurons contained in a filter is . The total number of parameters in a convolutional layer is:

(29)

Another important parameter is the padding size. It specifies the number of zeros to be added to the input borders in order to control the layer’s output size. The output height and width of a convolutional layer is:

(30)

Pooling layers: they follow the convolutional layers and perform a non-linear down-sampling, thus, reducing the number of connections to the next layers and the computational complexity of the model. They also help reduce overfitting. Pooling layers operate independently on all the layers of the input. A specific property is that they do not perform any learning.

There are three types of pooling:

Max pooling: the input is divided into rectangular regions. The max-pooling operation returns the maximum values of these regions.

Average pooling: in the case of average-pooling, the layer returns the average value within the rectangular neighbourhood.

General pooling: the neurons included in this layer can perform different common operations. Some popular pooling functions include the L1-normalisation or L2-normalisation

Figure 77 – Example of Max-Pooling with 2×2 Filters and Stride 2 (© )

To define a pooling layer we must specify the height and width of the rectangular regions (also known as Pool Size), the step size (vertical and horizontal) used for scanning the inputs (Stride), and the function used to process the input (some examples are Max, L1-normalisation, L2-normalisation, and Average).

Regardless of the selected pooling method, the pool layer increases the robustness of the network by providing a form of translation invariance; in the case small translations of the input data, the values of most of the pooled outputs do not change. “Invariance to local translation can be a useful property if we care more about whether some feature is present than exactly where it is”.

Fully-Connected Layer: after a series of convolutional and pooling layers, the high-level processing of the neural network is performed using fully-connected layers. As the name suggests, the neurons that make up this layer are directly and fully connected to all activations from the previous layer, as in a regular ANN. By combining all the features (local information) learned by the previous layer, the fully-connected layer can identify larger patterns.

Figure 78 – Example of a Fully-Connected Layer

The fully-connected layer takes inputs, and outputs units (Figure 78). The layer’s outputs represent a linear combination of its inputs. The neurons activations are computed using a matrix multiplication (neuron weights matrix), followed by a bias vector addition.

(31)

Where: represents jth output value, is the ith input value, the ith weight of jth neuron, the bias value of the jth neuron and represents the neuron activation function.

The size of the last fully-connected layer depends on the problem type that the model must solve. For classification problems, the output size of the last fully-connected layer equals the number of classes of the data set. In this case the layer extracts features in order to classify the input data. For regression problems, the output size must be equal to the number of response variables.

Other types of layers that can appear in a Convolutional Neural Network are:

ReLU Layer: a rectified linear unit (ReLU) represents an elementwise non-linear non-saturating activation function. The output size equals the input size:

(32)

Batch Normalization Layer: introduced first time by Ioffe and Szegedy , the layer normalizes each input channel across a mini-batch. First, the activations of each channel are normalized by subtracting the mini-batch mean and dividing by the mini-batch standard deviation. Then, the layer shifts the input by an offset and scales it by a scale factor. Parameters and are updated during network training. “Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. By normalizing activations throughout the network, it prevents small changes in layer parameters from amplifying as the data propagates through a deep network. Batch Normalization also makes training more resilient to the parameter scale”

Dropout Layer: fully connected layers are prone to overfitting. Thus, one method to resolve this issue is to use a dropout layer. At each training stage, this layer it randomly sets the layer’s input elements to zero with a given probability. Dropping a unit at training time, it means that the neuron and all its incoming and outgoing connections are removed temporarily from the network (Figure 79). This operation is done randomly with a given fix probability.

Figure 79 – Dropout Neural Net Model (©)

Autoencoder Networks

Normally, neural networks used for classification are trained to map an input vector to an output that represents the known classification classes. An alternative approach is to use auto-associative neural networks, also called Autoencoders or Diabolo networks in order to learn a model of each class.

An Autoencoder, auto-associator or Diabolo network is a specific type of feedforward neural network used for unsupervised learning of efficient codings . Typically, the aim of an Autoencoder is to generate (learn) a compressed representation (also known as encoding) for a dataset, with the scope to obtain a dimensionality reduction. This initial concept has evolved, and Autoencoders have become widely used for learning generative models of data.

Figure 80 summarizes the basic architecture of an Autoencoder. From architectural point of view, the simplest form of an Autoencoder is a feed-forward, non-recurrent neural network that is very similar to the multilayer perceptron (MLP). It consists of three major components: an input layer, an output layer and one or more hidden layers connecting them.

The main difference between an MLP and a Diabolo Network consists in the number of nodes contained in the output layer. In the case of Diabolo Networks the output layer has the same number of nodes as the input layer.

Figure 80 – Schematic Structure of an Autoencoder with 3 Fully-Connected Hidden Layers.

Both the encoder and decoder are fully-connected feedforward neural networks and can have multiple layers. The code is a single layer ANN, the number of hidden neurons (code size) represents a design parameter.

From the architectural point of view, an Autoencoder is defined when the following parameters are configured:

Code size: it represents the number of neurons / nodes in the middle layer. Smaller size will generate a greater compression.

Number of layers: the minimum number of layers for an Autoencoder is two, but it can be as deep as we like. By having multiple layers for encoding and decoding, a so called “Stacked Autoencoder” is generated. With each subsequent layer of the encoder, the number of neurons per layer decreases. The decoder part is symmetric to the encoder, thus, with each layer the number of nodes per layer increases.

Number of nodes per layer: the number of nodes per layer decreases with each subsequent layer of the encoder, and increases back in the decoder. Also the decoder is symmetric to the encoder in terms of layer structure. As noted above this is not necessary and we have total control over these parameters.

Loss function: is used during the training stage in order to compare the output of the Autoencoder with the input. Two loss functions are widely used: Mean Squared Error (MSE) or Binary Cross-Entropy. When the input values are in the range, typically the Cross-Entropy is used, in all other cases the MSE is the used.

By increasing the number of layers, nodes per layer and / or the code size, will allow the Autoencoder to learn more complex codings.

The training of an Autoencoder is done using one of the many back-propagation algorithms (conjugate gradient method, steepest descent, etc.), with the purpose that the network’s outputs reconstructs its own inputs.

An Autoencoder always consists of two parts, an encoder and a decoder. The “encoder” part of the network produces the code by compressing the inputs; the decoder then reconstructs the inputs using only this code.

From functional point of view, the Autoencoder tries to learn a function. Being given an input, the Autoencoder tries to learn an approximation to the identity function, in order for the output to be similar to the input. For simplicity we assume that each of the encoder and decoder parts have only one layer. Assuming that the “code” size is, the encoder will map the input onto another vector as follows:

(33)

where: is an element-wise activation function (transfer function) for the encoder, is a weight matrix and is a bias vector.

By using the decoder, the encoded representation is mapped onto the reconstruction vector, an estimate of the original input vector.

(34)

where: is transfer function for the decoder, is a weight matrix and is a bias vector.

Because an Autoencoder is designed to replicate the input data to the outputs, the biggest risk is that it will learn the identity function, the trivial case. In order to prevent this from happening different constraints can be placed on the network. The simplest one consists in limiting the number of hidden neurons: e.g. if the input vector has 100-elements, then the hidden layer has 50-elements. With this constraint the encoder will compress the 100-element vector into a 50-element vector and the decompression phase will expand the 50-element vector into a 100-element vector that is ideally close to the original input. This simple Autoencoder often ends up learning a low-dimensional representation of the inputs very similar to PCA (in the case that the input data is correlated)

Another type of constraint, that is more often used, is the sparsity constraint. In this case the hidden layer has a large number of neurons (compared with the number of inputs), but for each input only a small fraction of them will be activated (produces an activation value closer to “1”) during the training of the network. This constraint forces the Autoencoder to represent each input as combination of small number of nodes. Such encoders are called “Sparse Autoencoder” and they produce a sparse representation of the inputs. The “Sparsity Parameter” is a small value close to zero (e.g. 0.05).

In order to generate a Sparse Autoencoder the training criteria must define a sparsity penalty on the code layer, the reconstruction error is calculated as:

(35)

where: represents the loss function or error between input and the estimated output .

A widely used sparsity regularization term is the Kullback-Leibler divergence. It is imposed on the average output activation value of a neuron. For neuron, we can define the average output activation value, over the training set, as:

(36)

where: is the total number of training samples, is the jth training sample, is the ith row of the weight matrix and is the ith entry of the bias vector .

In this case the sparsity regularization term becomes:

(37)

The Kullback-Leibler divergence is an efficient function for measuring how different two distributions are. When the average activation value, , of the neuron and its desired value (sparsity parameter), , are close to each other, the Kullback-Leibler divergence is near zero, otherwise it becomes larger as the two values diverge from each other.

The cost function for training a sparse Autoencoder adjusted with the sparsity regularization term becomes:

(38)

One possible loss function, for the Autoencoder can be the norm.

(39)

Combined Audio-Visual Recognition System

In this work an “early fusion” audio-visual recognition system, that is intended to work in a noisy environment, is presented.

In order to achieve the recognition task, two classifiers, one for visual recognition, another for audio recognition, are developed.

In our work we generate the subject’s face and voice models using two different neural techniques: for the Face Model, our approach is to use Autoencoder neural networks; while for Voice Modelling we explore two possibilities: one that uses Self-Organizing Maps, and the second approach is to use Convolutional Neural Networks in order to generate the speaker model.

Finally, after testing both approaches, only one will remain.

Face Detection and Recognition System

Proposed Algorithm

Our proposed algorithm (presented in Figure 81) consists of the following stages: (I) multi-level features vector extraction using Gabor filters, (II) subject’s face model generation using autoencoders and (III) unknown subject’s identification.

Figure 81 – Block Diagram of Proposed Face Recognition Procedure

The algorithm is designed to use Grey scale images thus, in the case of a colour image we convert the RGB (red – green – blue) colour space to HSV (hue – saturation – value) colour space and keep the V component. Afterwards, each face image from the database is resized to 128–by–128 pixels.

The “Features Vector Extraction” phase generates the features vector from each known subject image using a Gabor filters in a multi-level approach. For this task we use two sets of Gabor Filters. Both sets of filters have five scales and eight orientations; the difference between them is in the size of the filters.

The first set of Gabor Filters has a size of 128–by–128 pixels; we perform the convolution between the full face image and the first filter bank, resulting a response of 128–by–128–by–40 that is down-sampled in order to reduce its dimensions. The down-sampled response is normalized to have zero mean and unit variance, then converted into row vectors and concatenated.

The second set of Gabor filters has a size of 64–by–64 pixels. The face image is divided, from left to right and top to bottom, into nine rectangular regions each consisting of 64–by–64 pixels, with an overlap of 50% between each region. Each of the nine regions is then filtered using the second set of Gabor filters. The convolution result is down-sampled and stored into the features vector (Figure 82). The final resulted features vector consists of 33640 elements.

Figure 82 – Features Vector Extraction

Subject’s Face Modelling and Matching

“Subject’s Face modelling” represents the second step of our algorithm. Having as input multiple feature vectors from the same subject, we generate the “Face model” with the help of a stacked neural network architecture that combines a Diabolo network and SoftMax layer (Figure 83).

Using features vectors extracted in the “Features Vector Extraction” phase, we train an Autoencoder network. The hidden layer of the Autoencoder will map the input data to a reduced subspace that represents the compressed version of the input.

The encoder of the trained network is used to generate the features vectors for the final step of the algorithm. In this stage a SoftMax network that is trained to classify the compressed feature vectors into different subject classes. The number of output classes of the SoftMax layer is equal with the number of known subjects.

Figure 83 – Stacked Autoencoder Architecture

Speaker Recognition

For Voice based Speaker Recognition task, in this thesis we present and validate two approaches: in the first method the speaker modelling is based on Self-Organizing Maps (SOMs) , while in the second approach we propose the usage of Convolutional Neural Networks (CNNs), which represents a newer state-of-the-art technology in trend with current research interests.

SOM-Based Speaker Recognition

As mentioned earlier, the first designed speaker identification system proposes the usage of SOMs for speaker voice modelling (Figure 84); as features vectors the MFCCs are combined with DMFCCs. In the recognition stage, the unknown speaker’s identity is established by determining the degree of similarity between the between the feature vectors of the speech sample and the known speaker models. This is done by computing the pair-wise distance between the SOM representation and the features vectors extracted from the speech signal.

Figure 84 – SOM-based Speaker Recognition Block Diagram

Speaker’s Voice Modelling

The system that we propose uses the Mel-Frequency Cepstral Coefficients as feature vectors and Self-Organizing Maps for creating the speaker model. The training of the neural network is done using noise free recordings, while the testing of the system is performed using noise corrupted speech files.

In the Features Vector extraction stage, we calculate 32 Mel-Frequency Cepstral Coefficients and their deltas as described in Chapter III Figure 58, resulting a feature vector with 64 components. The speech waveform, sampled at 16 kHz 16 bit, is sent to a second order high pass filter, in the pre-emphasis stage. The aim of this stage is to increase the high-frequency components of the human voice that are suppressed during speech production. The used pre-emphasis filter is given by the following transfer function:

(40)

After pre-emphasis, the signal is blocked into frames of samples with a frame overlap of samples. Typical values are 30ms frame width and 10ms overlap. The proposed algorithm uses a 32ms frame, but the optimal overlap between frames will be determined through testing. Appling a Hamming window on each speech frame, the discontinuities caused by framing are minimized.

The algorithm uses the VQ functionality of the self-organizing maps in order to establish the identity of the speaker. The identity of the unknown speaker is determined by calculating the mean quantization distortion between the unknown speech input data and the entire known speaker models from the database. The SOM model that presents the smallest quantization distortion establishes the unknown speaker’s identity.

For the speaker model, we propose the usage of Kohonen neural networks or Self-Organizing Maps (SOMs). For each of the known speakers we train a SOM with 64 neurons arranged in an 8–by–8 hexagonal pattern, using the 64 components feature vectors.

Figure 85 – SOM Architecture

Figure 86 – Proposed Algorithm Stages

The proposed algorithm associates each speaker an 8×8 hexagonal SOM with 64 inputs. The number of training epochs is set at 500 and for distance function we use “linkdist”:

Voice Model Matching

One of the main goals of a SOM is to quantize the input values into a finite number of centroids or output vectors. A measure of how well the neurons represent the input pattern is the quantisation error. The network quantization error is defined as the distance between an input vector and its nearest centroid. Summing the quantization error over the input data represents the network distortion:

(41)

In this work, the similarity between the input and codebook is determined using the mean quantization error of the SOM:

(42)

where: represents the number of input vectors; is a distance function.

Several metrics will be used in order to compute the mean quantization error: Euclidean distance, City-block distance, Chebychev distance, Cosine distance and Spearman distance.

Given a data matrix, treated as mx row vectors, and a data matrix, treated as  row vectors, the distance between the vectors xs and yt is defined as:

Euclidean distance

(43)

City-block distance

(44)

Chebychev distance

(45)

Cosine distance

(46)

Spearman distance

(47)

where:

is the rank of take over

is the rank of take over

Using CNN for Voice Modelling

In our second approach for speaker modelling, we propose the usage of 2-D Convolutional filters for learning speaker dependent traits from MFCC features. In this approach, the speaker identification is modelled as an image classification problem and solved with the help of a CNN.

Figure 87 – CNN Features Vectors Extraction

The features vectors needed to train the network are constructed as follows (Figure 87): using a moving window approach, each speech sample is divided into segments of 500 milliseconds in length and an overlap of 475 milliseconds. Then, from each speech segment we extract the MFCC features using a 25ms frame width. The stride between MFCC frames is set to 5ms. The number of MFCC features, was experimentally determined during the testing of the proposed CNN architecture. The MFCC features, corresponding to one speech segment, are packed together into a matrix, resulting a feature patch of size, where is the number of MFCC frames contained in a speech segment.

A Deep Convolutional Neural Network consists of multiple layers stacked one on top of the other. The input data is transformed by each layer and passed over to the next network layer. A crucial step in the development of a CNN architecture that is capable of learning “relevant” features needed to distinguish between different speakers is choosing the shape and placement of the filters along the various layers. The CNN architecture used in this work is presented in Figure 88 and consists of 22 layers organized in 5 groups, as follows:

Figure 88 – Used CNN Architecture

Group 1:

2-D Convolutional Layer with 64 filters of size 7–by–7 and stride of 1–by–1;

2-D Convolutional Layer with 64 filters of size 5–by–5 and stride of 1–by–1;

2-D Convolutional Layer with 64 filters of size 3–by–3 and stride of 2–by–2;

Dropout Layer with a probability of 0.3

Group 2:

2-D Convolutional Layer with 256 filters of size 7–by–7 and stride of 1–by–1;

2-D Convolutional Layer with 256 filters of size 3–by–3 and stride of 1–by–1;

2-D Convolutional Layer with 256 filters of size 3–by–3 and stride of 1–by–1;

2-D Convolutional Layer with 256 filters of size 3–by–3 and stride of 2–by–2;

Dropout Layer with a probability of 0.3

Group 3:

2-D Convolutional Layer with 1024 filters of size 5–by–5 and stride of 1–by–1;

Dropout Layer with a probability of 0.3

Group 4:

Fully-Connected Layer with 1024 neurons

Fully-Connected Layer with the number of neurons equal with the number of known speakers

Group 5:

SoftMax Layer

Classification layer as the networks output

Case Study

Speaker Verification Experiments

The aim of the experiments described in this section is to identify the optimal configuration and to examine the classification precision provided by the ASR architectures proposed in the previous chapter. The classification performance of both systems will be established under both clean and noisy conditions.

The database used in this work is the CHAINS Speech corpus , and it contains 36 speakers (18 male and 18 female). The recordings are obtained in two different sessions that are about two months apart. The recording sessions provide a range of six different speaking styles and voice modifications, as follows:

First recording session that contains SOLO, SYNCHRONOUS and RETELL was performed in a professional recording studio.

In the SOLO condition, subjects read a prepared text (the Cinderella fable) at a comfortable pace. This condition serves as a baseline.

For SYNCHRONOUS condition recordings, the subjects read a prepared text in synchronicity with a co-speaker.

The RETELL condition represents spontaneous speech. After reading the Cinderella fable in SOLO condition, the subjects are asked to retell the story using their own words. This mode provides the opportunity to test the ASR methods on radically different speech style (a high probability of containing specific short phrase, specific lexical items).

Second recording session was performed in quiet office environment and consists of: “REPETITIVE SYNCHRONOUS IMITATION (RSI)”, “WHISPER” and “FAST RATE”.

For WHISPER condition, the subjects read the text in a whisper. During whispering the glottis is constricted, this generates a turbulent airflow and “a characteristic hissing quality”. During whispering the fundamental frequency (F0) cannot be used as a marker for speaker identity, also the information related to vocal intensity and voice quality is reduced.

In the FAST RATE condition, a speaker is reading a short text a relatively fast rate and the subjects were asked to read the text at the same rate. The increased speech rate causes complex changes to the speech signal.

Validating the SOM Model Approach

The training stage of our proposed SOM Model is performed using four recordings: two SOLO reading recordings (“The Cinderella Story” and “The Rainbow Text”), where subjects simply read a prepared text at a comfortable rate, another two in the WHISPER Condition (“The North Wind” and “The Members of the Body”).

In order to test the trained model, we used recordings from the SOLO, RETELL and WHISPER speech conditions. The original sound files are down-sampled from 44.1 kHz to 16 kHz, 16 bit mono PCM.

Determining the optimal frame rate

In order to compare two Self-Organizing Maps, a performance index must be defined. In this work, we are using the mean quantization error of the SOM as performance index. One of the goals of a SOM is to quantize the input values into a finite number of centroids or output vectors. The quantization error is defined as the distance between an input vector and its nearest centroid. The sum of the quantization error over the input data represents the network distortion:

(48)

As mentioned earlier, we are using the mean value of the distortion as performance index for the speaker model:

(49)

Where: N represents the number of input vectors, a distance function.

To determine the optimal frame rate, we will generate speaker models using several frame strides (12.5ms, 10ms, 8ms, 6ms, 5ms, 4ms and 3ms) and, afterwards, compute for each of the models, the mean quantization error for the training data. The SOM performance index is computed using the following metrics: Euclidean distance, Cityblock distance, Chebychev distance and Spearman distance.

Figure 89 to Figure 92 present, for each speaker model, the evolution of the average quantization error in correspondence with the frame rate. By analysing them, we can observe that increasing the frame rate (decreasing the frame stride), the quantization errors for each metric functions decreases. But with the increase of the frame rate, the number of MFCCs also increases, thus the system processing time is affected. In order to determine an optimal frame rate, we must define an optimization criterion.

We propose the following optimization criterion:

(50)

Where: represents the variation (in %) of performance index defined in (49); the variation (in %) of MFCCs due to frame rate increase.

Figure 89 – Quantization Error (Euclidian Distance)

Figure 90 – Quantization Error (Cityblock Distance)

Figure 91 – Quantization Error (Chebychev Distance)

Figure 92 – Quantization Error (Spearman Distance)

The mean quantization error for each speaker model, as can be seen from the previously figures, decreases with the frame rate (for higher frame rates we have a lower quantization error). This fact can also be observed from Figure 93, where the evolution of the mean performance index, for all the speaker models, is shown.

Figure 93 – Optimisation Criteria vs. Frame Rate

From Figure 94, which illustrates the evolution of the optimization criterion in relation to the frame rate, we can conclude that the optimal frame rate for extracting the MFCCs is 200 fps (frame stride of 5ms). The Spearman metric presents a maximum when the frame rate is 200 fps; all the other metrics wave a downward trend; the performance increases slower than the number of Mel-Frequency Cepstral Coefficients.

Figure 94 – Optimization Criterion Evolution

Noise robust metric identification

To determine a noise robust metric, we test the speaker models with a noise corrupt version of the training data. Because the CHAINS corpus database is recorded in a sound-proof booth (no noise is present), we have manually altered the training recordings by adding Gaussian white noise with different SNRs (40dB, 35dB, 30dB, 25dB, 20dB, 15dB, 10dB, 5dB and 0dB). In these tests we use a frame width of 32ms and a frame stride of 5ms, as determined in the previous section.

By analysing the system outputs, Figure 95 (blue represents lower values – red higher values) for the case where the input data is without noise, we can observe that the minimum values for the mean quantization error are obtained when the speech data and the speaker model are from the same person (the minimum values are on the minor diagonal), thus we have a recognition rate of 100%.

Figure 95 – System Mean Quantization Error No Noise

By adding noise, the minimum values for the performance index no longer appear on the minor diagonal (where the speaker model and speech sample correspond to the same person), as can be seen from Figure 96 where we have the system outputs for a SNR of 20dB, and so the identification rate decreases.

Figure 96 – System Mean Quantization Error SNR 20dB

By analysing Figure 95 (a – d) and Figure 96 (a – d) we can observe that, when the Spearman metric is used to compute the system mean quantization error, the noise influence on the results is diminished.

The system presents roughly the same distribution of the output values for the two scenarios presented, Figure 95 (d) no noise and Figure 96 (d) 20dB SNR, in which the Spearman metric is used as the distance function.

The identification rate for a SNR of 20dB is 95% when Spearman metric is used; for the other metrics the rate is below 50%.

Figure 97 – Identification Rate vs. SNR using Training Data as Input

Figure 97 shows the evolution of the identification rate for all the four metric functions in relation to the SNR value. It can be easily seen that:

when the Chebychev metric is used the performance of the system is heavily influenced by the Gaussian white noise

using the Cityblock metric, the system outperforms the case when the performance index is computed using the Euclidian metric

the Spearman metric is the least affected by noise.

System validation

We evaluate the system performance using three recordings of the CHAINS corpus database from different speech conditions: SOLO Condition (“The Members of the Body”), WHISPER Condition (“The Cinderella Story”) and RETELL Condition (“The Cinderella Story”). The recordings are corrupted using different noise sources (Gaussian white noise, airport noise, restaurant noise and street noise) and multiple SNRs. The Mel-Frequency Cepstral Coefficients are extracted using a frame width of 32ms and a frame overlap of 27ms. In order to determine the speaker’s model quantization error, we use the Spearman metric as distance function.

As benchmark, to compare the degradation of the system performance with noise, we determine the system mean quantization error using the retell speech files without noise corruption. Afterwards, the input files are mixed with different noise sources using multiple signal to noise ratios. All the files are down-sampled to 16000Hz.

Figure 98 – Mean Quantization Error for Retell Recordings (Without Noise)

By looking at Figure 98 and Figure 95 (d), we observe that the system quantization error presents nearly the same distribution in both cases (input data retell sound files and solo reading sound files).

Figure 99 – Identification Rate vs. SNR using SOLO Recordings

Figure 100 – Identification Rate vs. SNR using RETELL Recordings

Figure 101 – Identification Rate vs. SNR using WHISPER Recordings

The evolution of the system identification rate for different types of noise sources and signal to noise ratios is presented in Figure 99 to Figure 101.

In the case of SOLO recordings (Figure 99), the system presents a 100% identification rate, even for a SNR value of 25dB and for all types of noise sources. In the case of RETELL recordings, the maximum identification rate is about 97%. For the WHISPER recordings, our proposed system has an identification rate of 100%, when the “clean sources” are used, and it decreases when noise is added.

By analysing the identification rate for all the possible speech conditions, we can see that Gaussian white noise produces the biggest perturbation of the identification rate. When the source is corrupted using Gaussian white noise, the identification rate drops at 85% at SNR of 20dB (for SOLO speech conditions) and the performance decrease is more rapid for this type of noise.

Figure 102 – System Mean Quantization Error Using Retell Sound Files and SNR 20dB

Figure 103 – Influence of Noise on the Mean Spearman Distance between the Speaker Model and Corresponding Test Data

According to (8), the Spearman distance is one minus the sample Spearman's rank correlation between observations, treated as sequences of values. The rank correlation coefficient is between -1 and 1:

−1 if the correlation between the two rankings is perfect; one ranking is the reverse of the other.

0 if the rankings are completely independent.

1 if the correlation between the two rankings is perfect; the two rankings are the same

An increased rank correlation coefficient implies a high correlation between the rankings of the two observations. The two observations, the speaker model and the speech feature vectors, have similar shapes

Validating the CNN Model Approach

Training the CNN Model

We perform the training of the CNN using for each speaker two recordings from the SOLO Condition (“The Cinderella Story” and “The Rainbow Text”) and another two in the WHISPER Condition (“The North Wind” and “The Members of the Body”). The total recording length, per speaker, is about 150 seconds. The network training is done using the he stochastic gradient descent with momentum (SGDM) optimizer with a momentum of 0.9, an initial learning rate of 0.005 and a L2 factor regularization of 0.0001. After each leaning epoch, the learn rate drops by a factor of 0.1.

The MFCCs are calculated using a window of 25ms and 5ms frame shift. For each frame we extract the first 36 MFCCs, including C0. The MFCC spectrum is extracted from 500ms speech segments.

Figure 104 – CNN Training Progress

The progress of the CNN training can be seen in Figure 104. After approximately 700 training epochs, the CNN presents a recognition rate greater than 90%. After the training was completed, for the proposed model we obtained a raw recognition error of 0.16%.

Figure 105 – CNN ROC Curve for Validation Data

In order to evaluate the noise robustness of our CNN system, we test the trained CNN using as input the noise corrupted version of the training data. The recordings used during the training phase are corrupted using multiple noise sources and 7 SNRs. The recognition rate evolution in the context of noise corruption is presented in the figure below (Figure 106).

Figure 106 – Identification Rate vs. SNR using Training Data as Input

We can observe that the presence of noise in the training data has little to non-influence on our CNN recognition rate. Until 20dB SNR, the system recognition rate remains constant.

Testing the CNN Model

The testing is performed using recordings from SOLO (“The Members of the Body”), WHISPER (“The Cinderella Story”) and RETELL (“The Cinderella Story”) Conditions. In order to simulate real life conditions, we also perform tests with noise corrupted data from different noise sources (Gaussian white noise, airport noise, restaurant noise and street noise) and with multiple SNRs. For the testing procedure the 500ms speech segments have an overlap of 350ms. This oversampling allows us to perform a filtering of the networks output.

Figure 107 – CNN ROC Curve for Test Data

For the trained CNN, during the testing phase we obtain the following recognition error rates: in SOLO condition 1.19% error rate, in WHISPER condition 2.14% error rate and in RETELL condition 16.99%.

We also test our system under “noisy” operation conditions in order to evaluate the degradation of the recognition rate under the presence of different noise sources and levels. The selected test audio data is corrupted using Gaussian white noise, airport noise, restaurant noise and street noise, and afterwards is feedthrough the CNN. Figures 108 to 110 present the evolution of the system performance versus the SNR.

Figure 108 – Identification Rate vs. SNR using SOLO Recordings

Figure 109 – Identification Rate vs. SNR using RETELL Recordings

Figure 110 – Identification Rate vs. SNR using WHISPER Recordings

By analysing the evolution of the identification rate in noisy environment, we can observe that the CNN approach is more robust than the SOM modelling. The SOM model outperforms the CNN in the case of RETELL recordings.

Face Verification Experiments

All experiments were executed using one colour face databases Caltech 101 and three grey-scale face databases Yale Face , Extended Yale B and ORL . The Caltech database contains 450 frontal face colour images (896 x 592 pixels) of 26 unique subjects with different lighting, expressions and backgrounds (Figure 111). The face regions from the images are extracted using a Viola-Jones face detector. Some images have very poor lighting conditions (are too dark or too bright) and were excluded. After the face segmentation more than 200 faces of 26 unique individuals remain.

Figure 111 – Sample images from Caltech face database

The Yale Face database consists of 11 different grey-scale pictures for 15 unique individuals. The images were acquired with several configurations and facial expressions: normal, centre-light, with glasses, happy, left-light, no glasses, right-light, sad, sleepy, surprised, and wink (Figure 112).

Figure 112 – Sample images from Yale face database

The extended Yale Face Database B contains 16128 grey-scale images of 28 human subjects under 9 poses and 64 illumination conditions, resulting 576 viewing conditions for every subject. The images are aligned, cropped and resized. The nine poses recorded are as following: pose 0 is the frontal pose; poses 1, 2, 3, 4, and 5 were about 12 degrees from the camera optical axis, while poses 6, 7, and 8 were about 24 degrees. In this thesis, only a subset of the database was used, consisting of 63 pictures per subject under the following conditions: nine poses (0 to 8), five azimuth positions (0, +0.5, -0.5, +1, -1 degrees) and three elevation positions (0, +10, -10 degrees) (see Figure 113).

Figure 113 – Sample images from Yale B face database

The ORL (Olivetti Research Laboratory) face database contains a set of grey-scale faces taken between April 1992 and April 1994. There are 10 different images of 40 distinct subjects. The images were taken at different times, varying slightly, facial expressions (open / closed eyes, smiling / non-smiling) and facial details (glasses/no-glasses). All the images are taken against a dark homogeneous background and the subjects are in up-right, frontal position.

Figure 114 – Sample images from ORL face database

By running the experiments with different configuration parameters and measuring the error rates, we determined the optimal neural network configuration parameters. The corresponding values are as follows: L2 Weight Regularization 0.04, Sparsity Regularization 1.6, Sparsity Proportion 0.1 and Hidden Layer Size 200.

For our algorithm testing, the databases were divided in two: first part, containing 4 pictures per subject, was used to generate the face model; second part consisting of the remaining pictures for each person was used as test data. With the given setup we measured the effectiveness of the proposed algorithm using several standard error and recognition rates, as follows:

the first rate is rank one recognition rate (ROR) – means the nearest neighbour is an image of the same person;

(51)

where: represents the number of correctly classified identities, the total number of images

the false acceptance rate (FAR), indicates the percentage of accepted non-authorized users;

the false rejection rate (FRR), indicates the percentage of incorrectly rejected authorized users;

the minimal half total error rate

(52)

In order to measure and prove the efficiency of our proposed algorithm, we trained and tested, using the same data set, four different classical face recognition algorithms :

Gabor – PCA and the nearest neighbour classifier;

Gabor – LDA and the nearest neighbour classifier;

Gabor – KFA and the nearest neighbour classifier.

Gabor – KPCA and the nearest neighbour classifier.

The experimental results are presented in Table 2 to Table 5.

Bimodal Speaker Verification Experiments – Early Fusion

CUAVE and VidTIMIT Datasets

For the multimodal experiments we evaluate our models using the CUAVE and VidTIMIT datasets. The CUAVE database is a speaker independent database of isolated and connected digits of high quality video and audio of a representative group of speakers. The database consists of two major sections, one of individual speakers and the second part of speaker pairs. For our experiments we use the one with the individual speakers. The database has 36 distinct speakers chosen in such a manner to have an even representation of male and female speakers with different skin tones and accents.

Figure 115 – CUAVE Speaker Sample

In each recording the speaker was framed including the shoulders and head and asked to speak several digits in different styles. A recording has 4 parts:

first, the speaker is sending naturally still and pronounces 50 isolated digits;

for the second part, each individual speaker was asked to move side-to-side, back-and-forth, or tilt the head while speaking 30 isolated digits;

the third part includes 20 isolated digits with both profile views

in the fourth part the speaker is facing the camera and pronounce 60 telephone-number-like sequences (30 while standing still and 30 while moving)

Figure 116 – VidTIMIT Speaker Sample

The VidTIMIT database contains 43 volunteers (19 female and 24 male) reciting short sentences. The dataset was recorded in 3 sessions, with a mean delay of 7 days between Session 1 and 2, and 6 days between Session 2 and 3. This delay allows for changes in the voice, hair style, make-up, clothing and mood / pronunciation of the speakers. Each speaker is asked, during the recording sessions, to pronounce ten sentences. The sentences were chosen from the test section of the NTIMIT corpus as follows:

Table 6 – Examples Sentences Used in the VidTIMIT Database

The first two sentences (Sa1 and Sa2) are the same for all persons, while the remaining eight different for each person. The average duration of each sentence is 4.25 seconds, approximately 106 video frames at 25fps.

Multimodal CNN Architecture

Our approach represents an Early Fusion (Feature Fusion) learning based system. First, we train two models, one for voice and a second one for face. The face and audio models are based on the presented Autoencoders and CNN architectures. After the models are trained, we use them in order to extract the face and audio cues that are fused at the feature extraction level into a large feature vector. This last feature vector is then used to train the final classification network (Figure 117).

Figure 117 – Multimodal CNN Architecture

As audio features, we use the outputs of the last Fully-Connected layer of the CNN model (Figure 88). According to the proposed network architecture, this layer consists of 1024 neurons, thus the audio feature vector is a 1–D vector whose size is 1024. For the face features vector, we use the output of the first layer of the Stacked Autoencoders neural model, resulting a 1–D feature vector with 200 elements (the number of neurons). The resulting concatenated fused features vector has a total of 1224 elements. This last vector will be used to train the last stage of our multimodal CNN recognition system.

The last stage of our Multimodal CNN recognition system consists of a 2–Layer Stacked Autoencoder network, with the following configuration parameters (established during the final experiments): 400 Hidden Neurons for the Autoencoder, Sparsity Proportion of 0.075 – 0.125, L2 Weight Regularization factor of 0.004, second (and final layer) SoftMax neural network.

In order to measure the recognition performance of the Multimodal CNN, we use two commonly values: the standard error and recognition rate. For verification experiments the false acceptance error (FAR) and false rejection error rates (FRR) as well as the half total error rate (HTER) are used to measure the performance. The FAR, FRR and HTER are defined as follows:

(53)

(54)

(55)

where: denotes the number of rejected legitimate identity, the total legitimate identity claims, the number of accepted illegitimate identity claims represents the total illegitimate identity claims.

Both FAR and FRR values depend on the decision threshold applied on the output of the final SoftMax neural network. Selecting a threshold that ensures a small value of the FAR will result in a large value of the FRR and vice versa.

CUAVE Multimodal Experiments

For the CUAVE dataset we use the first 70 seconds of recordings for training the Neural Models, and the final 30 seconds for testing of the full algorithm.

The first step is represented by the training of the Face and Voice Neural Models described in this thesis (see Chapter VI). From the video files we extract, using a processing window of 500ms with an overlap of 430ms, the MFCC features. The MFCCs are calculated using a window of 25ms and 5ms frame shift. For each frame we extract the first 36 MFCCs.

For each processing window position, we also extract the face associated with it, in order to process it accordingly to the proposed face model algorithm (Chapter VI – section 1).

After training the Voice CNN model (using the architecture presented in Figure 88), the resulted model offers a 99.9% recognition rate while using the training data, and 91.4% accuracy on the testing data.

Figure 118 – Voice Model ROC Curve for Testing Data

The training experiment of the Stacked Autoencoder Face model (Figure 83) offers and 96.7% recognition rate when the training data is used and an 85.11% recognition rate for the test data.

Figure 119 – Face Model ROC Curve for Testing Data

The final classifier used by our Multimodal CNN Architecture consists of an Autoencoder stacked on-top of a SoftMax neural network. The training of the final classifier (Figure 120) is performed using fused “super-features vector” that are obtained from the trained Voice and Face Speaker Models.

Figure 120 – Final Classifier Architecture

The architecture of the Stacked Autoencoder classifier mainly depends on two configuration parameters: Autoencoder Hidden Layer size (code size) and the Sparsity Proportion. These two parameters are experimentally identified, we test different combinations of Sparsity Proportion parameter (0.010, 0.050, 0.075, 0.100 and 0.125) and Hidden Layer sizes (160, 200, 250, 300, 350 and 400 neurons).

Figure 121 – Multimodal CNN Error Rate for Different Configuration Parameters

After training the last classifier, the smallest recognition rate (see Figure 121) for the complete system was obtained when the number of Hidden Neurons of the Autoencoder is set to 400 and the Sparsity Proportion parameter at 0.125. In this case we obtained a recognition error rate of 1.55%, which gives an accuracy (ROR) of 98.45%. The ROC curve for the multimodal system is depicted in Figure 122 at logarithmic scale.

Figure 122 – Multimodal CNN ROC Curve (logarithmic scale plot)

During the verification / authentication experiments, the multimodal system presented the following performance characteristics:

Equal Total Rate: 0.19%

Minimal Half Total Error Rate: 0.19%

False Rejection Error Rate at the minimal Half Total Error Rate: 0.09%

False Acceptance Error Rate at the minimal Half Total Error: 0.29%

Verification Rate at 1% FAR: 99.97%

Verification Rate at 0.1% FAR: 99.40%

Verification Rate at 0.01% FAR: 97.05%

VidTIMIT Multimodal Experiments

As we mentioned earlier, the VidTIMIT dataset consists of 43 speakers, with 10 recordings per speaker. In the case of the VidTIMIT dataset experiment, the Face and Voice Neural Models are trained using the first 9 recordings (in total 20 seconds of speech) and the validation is done using the last recording file. The same segmentation algorithm and parameters (frame size, frame shift) are used during the neural network models training.

In the case of the VidTIMIT dataset, the Voice CNN model offered a recognition rate of 66.23%, while the Face Stacked Autoencoder model obtained a 91.83% recognition rate. The small recognition rate generated by the Voice CNN model is due to the small number of training samples used (approximately 18 seconds of speech). Figure 123 and Figure 124 represent the ROC curves generated during the validation experiments of the Voice and Face neural models.

Figure 123 – ROC curve for Voice Model

Figure 124 – ROC curve for Face Model

In the case of the complete system, the recognition rate varies between 84% and 96.91%, depending on the final configuration parameters (Figure 125 – Multimodal CNN Error Rate for Different Configuration ParametersFigure 125). The highest recognition rate is obtained when the final classifier (Stacked Autoencoder) has 400 neurons and a Sparsity Proportion of 0.075. For this configuration, the multimodal system offers an accuracy of 96.91%. This is an improvement of 5.08%.

Figure 125 – Multimodal CNN Error Rate for Different Configuration Parameters

When combining the two features (voice and face), the resulted Multimodal Recognition System at the “optimal” system configuration (400 Hidden Neurons, Sparsity Proportion of 0.075 and L2 Weight Regularization factor of 0.004), is quite robust and the recognition performance improves compared to the performance of the individual classifiers, as exemplified by the ROC curve plot (Figure 126).

Figure 126 – ROC curve for Multimodal CNN (logarithmic scale plot)

The verification / authentication experiments carried on the test samples generated the following results for the Multimodal CNN architecture:

Equal Error Rate: 0.41%

Minimal Half Total Error Rate: 0.40%

False Rejection Error Rate at the minimal Half Total Error Rate (in %): 0.34%

False Acceptance Error Rate at the minimal Half Total Error Rate (in %): 0.47%

Verification Rate at 1% FAR: 99.86%

Verification Rate at 0.1% FAR: 97.74%

Verification Rate at 0.01% FAR: 89.16%

Conclusions

This thesis explored the issues involved in multimodal speaker recognition system design. In order to tackle the complex task of speaker recognition, we proposed a feature based multimodal recognition architecture that combines CNN and Autoencoders.

In this thesis we used only fusion at feature level; we have chosen this approach in order not to discard the data correlation that exists between the speakers face and voice model.

The proposed approach aims to learn high-level features from audio and faces, in which the audio features learned by Convolutional Neural Networks are concatenated with compatible face features extracted by Autoencoders. The experiments conducted in this final chapter have shown that, by fusing of both face and audio cues at feature level, the proposed multimodal architecture offers a higher accuracy than single mode approaches. Table 7 lists a short summary of the obtained recognition rate for both datasets.

Table 7 – Summary of Recognition Accuracy

By fusing the multi-modal features at a high level, the correlation between face and speech features is stronger and this offers a much more discriminative dataset for the speaker recognition system increasing the recognition rate. For both datasets (CUAVE and VidTIMIT), the overall performance increased by at least 5%.

If we compare the Audio CNN model recognition rate for the used datasets, we can see that in the case of the VidTIMIT database it has the smallest value (approximately 80%). This is caused by the small number of samples used during the training. In the case of the VidTIMIT dataset we trained the CNN model with approximately 18 seconds of speech. A CNN model becomes more accurate as the training set is larger and diverse.

In the case verification / authentication experiments, our Multimodal CNN architecture offers very high Verification rates, as depicted in the table below.

Table 8 – Summary of Multimodal CNN Verification Rate

If we set the operation point of 1% FAR, we obtain for both systems a Verification rate above 99%.

The CNN model was chosen for our multimodal recognition system, at the detriment of the SOM approach, due to the following factors:

Both methods have roughly the same recognition rate;

The CNN accuracy can be improved by simple filtering of oversampled inputs;

The CNN is easily extended to accommodate a higher number of speakers;

The high-level features, need for the multimodal system, are easily extracted from the CNN;

The high-level features are in depended form the number of speakers (in the case of CNN).

The experiments conducted on the CNN voice model in Chapter VII – Section 1.2 also showed that the proposed CNN architecture, when trained with sufficient data, offers a high accuracy. The recognition rate can be easily improved if we oversample the audio data and the network outputs are filtered.

We also tested the noise robustness of the CNN model, by altering the input data with different noise sources and SNR values. For a SNR above 25dB, the model offered a recognition rate between 80% – 100% (depending on the noise source and speech conditions).

Because the CNN is designed to use long speech frames (500ms), our Voice model tends to learn high-level speech features. From our experiments, has resulted that a good “training sample” should contain at least 70 seconds of speech, in order to be as varied as possible.

Conclusions and Personal Contribution

Similar Posts