Mono- and Multilingual Estimation of Parkinsons Disease Severity Using [602243]

Mono- and Multilingual Estimation of Parkinson's Disease Severity Using
Voiced Ratio and Nonlinear Parameters
D avid Sztah oa,b,, Juan Rafael Orozco-Arroyavec, Kl ara Vicsia
aBudapest University of Technology and Economics, Budapest, Hungary
bHungarian Academy of Sciences, Budapest, Hungary
cUniversidad de Antioquia, Medell n, Colombia
Abstract
Parkinson's disease severity estimation analysis was carried out using speech databases of Spanish and
Hungarian speakers separately. Correlation measurements were performed between acoustic features and the
UPDRS severity. The applied acoustic features were the following: voicing ratio (VR), nonlinear recurrence:
the normalized recurrence probability density entropy ( Hnorm) and fractal scaling: the scaling exponent
( ). High diversity is found according to the text and gender variation, therefore multiple experiments
were carried out taken these phenomena into consideration. Based on the results of correlation calculations,
prediction of the UPDRS values was performed using regression technique applying neural networks. The
results showed that the applied features are capable of estimating the severity of the PD. By assigning the
mean predicted UPDRS for each corresponding speaker using the best correlated linguistic contents, in the
case of the Spanish database the result of the Interspeech 2015 Sub-challenge winner was exceeded. By
training NN models separately for male and female the accuracy was further increased. Cross-lingual tests
were also performed with lower but promising performance.
Keywords: Parkinson, speech, regression
1. Introduction
One of the most common neurodegenerative diseases include Parkinson's disease (PD). Its prevalence
rate is about 20/100 000 Rajput et al. (2007). The main case of the disease is the incorrect behaviour of
the dopamine system in the substantia nigra. 70-80% of the neurons that prdduces dopamine is damaged or
dead. Dopamine is a neurotransmitter, that is responsible (among others) for the stimuli sent from the brain
to the muscles (applying smooth movement regulation). Low level of dopamine population results in loss
of stimuli, causing the main symptoms of the disease. These are resting tremor, muscle rigidity, slowness,
congitive impairment Ho et al. (1999).
Speech may also be an important factor in predicting the PD Little et al. (2009)Sapir et al. (2010)Arias-
Vergara et al. (2016)Orozco-Arroyave et al. (2016b). Most of the PD patients observe some kind of speech
impairment, therefore it may be an early sign Harel et al. (2004). Patients report reduced loudness, increased
vocal tremor, breathiness, reduced air during speech prodution. The two vocal impairments linked to PD
are dysphonia (inability to produce normal vocal sounds) and dysarthria (diculty in pronouncing words)
Baken and Orliko (2000).
While there are other factors that contribute to the developement of the disease (medicals, genetics), age
plays the most important role. Therefore in a continuously aging population early detection of the disease
is very important.
Corresponding author
Email addresses: [anonimizat] (D avid Sztah o), [anonimizat] (Juan Rafael Orozco-Arroyave),
[anonimizat] (Kl ara Vicsi)
Preprint submitted to Elsevier March 8, 2017

Because of the importance of the early diagnosis, researchers are inspired to develop diagnostic support
tools in order to identify PD. Various speech features were applied in order to obtain relevant information
about PD and facilitate the spearation of speech samples between PD patients and healthy controls.
Decision support tools discriminate PD patients from healthy controls by means of automatic classi ca-
tion methods. Many classi ers are available with di erent distinctive strength. Most commonly statistical
classi ers are used Tsanas et al. (2012)Sakar et al. (2013)Alemami and Almazaydeh (2014)Frid et al. (2014),
such as Bayesian classi er, support vector machines Orozco-Arroyave et al. (2016a), (deep) neural networks,
Gaussian mixture models, random forests, k-nearest neighbors. In Tsanas et al. (2012) binary classi cation
was performed on sustained vowels only for two classes: PD and healthy control. In another study Khan
et al. (2014) about 85% highest accuracy was achieved with speech intelligibility features from running
speech using three classes: healthy, mild and severe.
In the present paper instead of categorizing patients into two groups (healthy, ill), the severity of the
PD is to be estimated on a continuous scale. Two databases are used. A Spanish speech sample collection
(PC-GITA Orozco-Arroyave et al. (2014)) and a Hungarian database of PD patients.
Due to tremor and muscle movement disorders, patients with PD show diculty in producing fast voiced
and unvoiced speech segment repetitions. They can't stop, or have diculty in stopping voicing. This
suggests that the main acoustic features that characterizes the syndrome are mainly related to parameters
that measures voicing quality, for example in a close area of unvoiced plosives followed by or before vowels
(VCV connections, where the consonant is an unvoiced sound). This phenomenon has been studied, but the
nonlinearities examined in this paper have not been analyzied so far Orozco-Arroyave et al. (2016a).
Vocal production has nonlinear dynamics with certain randomness, and any changes in muscles and
nerves will a ect both the stochastic and deterministic components of the system. Rahn and his colleague
have shown that aperiodic segments which are perceived as hoarse or breathy phonation, have an elevated
incidence and are more prevalent in PD subjects Rahn et al. (2007). Nonlinear recurrence and fractal scaling
Little et al. (2007) are two properties that is successfully applied to voice disorder detection (di erentiation
between healthy and pathological speech).
On the base of the before mentioned two works we hypothesized that nonlinear recurrence and fractal
scaling features and the severity of the Parkinson's disease (UPDRS score) may also correlate. Thus in our
work we investigate how nonlinear recurrence and fractal scaling parameters, with adding the ratio of voiced
and unvoiced speech segments, contribute to the severity prediction of PD.
In the Section 2 the applied databases are described. In Section 3 the acoustic features are detailed.
In the following Sections, rst the correlation between the measurements of each feature and the UPDRS
values are presented separately, then regression was performed to predict PD severity (UPDRS) using neural
networks. In Section 6 cross-lingual regression possibilities are investigated. In Section 7 the results are
discussed and in the Session 8 conclusion and plans for the future are given.
2. Database
2.1. PC-GITA
The Spanish database of the speech of patients with Parkinson's disease was created by the GITA research
group at Universidad de Antioquia, Medell n, Colombia Orozco-Arroyave et al. (2014). The database was
used as the common database for Interspeech 2015 Special Session: PC Sub-challenge Schuller et al. (2015).
It contains speech in Spanish from 50 people (25 male, 25 female) su ering from PD. The age of the
male utterances ranges from 33 to 77 (mean 62.2 11.2), the age of the female utterances ranges from
44 to 75 (mean 60.1 7.8). The data comprises a total of 42 speech tasks per speaker, including 24
isolated words, 10 sentences, one read text, one monologue, and the rapid repetition of syllables. The
total duration of the recordings is 1.4 hours. The neurological state of the patients was evaluated by an
expert neurologist according to the uni ed Parkinson's disease rating scale (motor subscale): UPDRS-III.
The values of the neurological evaluations performed over the patients range from 5 to 92. The audio les
are divided into training and development sets. An additional test partition with labels unknown to the
participants comprises a further eleven subjects. In the present study the training and development sets of
2

the database are used due to their additional meta-information (speaker, linguistic content) and test dataset
was omitted (due to missing meta-information).
A more detailed description of the database can be found in Orozco-Arroyave et al. (2014).
2.2. Hungarian Parkinson Speech Database (HPSDb)
The other database that the results of this paper are based on, is a Hungarian speech database of PD
patients. The samples were recorded in two health institutions under the supervision of head neurologists
(Annamria Takts at Semmelweis University and Istvn Vallik at Viranyos Clinic). The recordings took place
in doctor's oce rooms in quiet envorimnent using microport microphones at 44 kHz sampling frequency
and 16 bit quantization. 21 speech tasks were recorder from each speaker: sustained vowels, sustained
vowels with pitch change, 6 fast repetition of syllables, 7 isolated words, 4 sentences with emphasized parts,
one read text (tale entitled "The north wind and the sun") and a short monologue. The database contains
speech samples from 25 male speakers (age: 62 12) and 21 female speaker (age: 65 9). The neurological
state of the patients was also evaluated according to UPDRS scale by neurologists. 35 speakers had electrode
implants that realize bilateral subthalamic stimulation in order to improve patients' conditions. Speakers
with implants were recorded in two stages: all speech tasks were recorded with implant states ON and OFF.
Only recroding with implant state OFF were used in the study.
3. Methods
Patients su ering from Parkinson's disease have disordered muscle movements, therefore supposedly
having trouble starting and stopping vocal cord vibrations. This may be re
ected at starts and stops of
voiced speech segments. At these parts, such as VCV connections, where unvoiced consonants are present
the unvoiced part becomes voiced due to the incorrect constant voicing. This can be measured with the
ratio of the voiced and unvoiced parts of the speech. This ratio is highly linguistic and text dependent. A
text in which the alternation of voiced and unvoiced sounds is frequent, this ratio di ers more between the
speech of PD patients and healthy control.
In order to calculate the voiced ratio (VR) the following equation was used:
VR=Pnv
i=1di
v
ds(1)
wheredi
vis the duration of voiced segment ianddsis the total duration of the speech in the sample.
In order to calculation VR, rst the speech samples were segmented into speech and non-speech parts using
MAUS Beringer and Schiel (2000), then voiced parts were determined using Voicebox for Matlab Brookes
(2016). Fundamental frequency was calculated with 10 ms time step and 40 ms window length.
Traditional voice quality measures, such as jitter and shimmer capture the quality of the voice production.
However, during voice production there is a combination of deterministic and stochastic elements. The
deterministic element is attributable to the nonlinear movement of the vocal fold and to the bulk of air in
the vocal fold, whereas stochastic components are the high frequency aeroacoustics pressure
uctuations
caused by the vortex shedding at the top of the vocal folds, whose frequency and intensity is modulated
by the bulk air movement Little et al. (2007). Time-delay embedding is a measurement for recurrence that
captures this non-linearity. For a detailed description about the concept of recurrence, see Little et al. (2007).
Nonlinear recurrence ( Hnorm) was calculated using the Matlab tool of Max Little Little et al. (2007). The
recurrence time probability density was calculated by
P(T) =R(T)PTmax
i=1R(i); (2)
whereR(T) are the recurrence times, Tmaxis the maximum recurrence time found in the embedded
time-state space. The normalized recurrence probability density entropy (RPDE) scale is computed by
3

Table 1: Spearman correlations between features and UPDRS scores on the total PC-GITA dataset.
Dataset Variable pairs Spearman
UPDRS – VR .215
PC-GITA UPDRS - .260
UPDRS -Hnorm -.061
UPDRS – VR .127
HPSDb UPDRS - .073
UPDRS -Hnorm -.112
Hnorm=PTmax
i=1P(i)lnP(i)
lnTmax(3)
Another aspect is the increased breath noise in disordered speech. A practical, robust approach to
measure this phenomenon is the detrended
uctuation analysis (DFA) Little et al. (2007). DFA scaling
exponent ( ) was calculated with Little et al. (2006). First, the time series snis integrated:
yn=nX
j=1sj; (4)
for n = 1,2…N, where N is the number of samples in the signal. Then, ynis divided into windows
of length L samples. A least-squares straight line local trend is calculated by analytically minimizing the
squared error E2over the slope and intercept parameters aand b:
argmin
a;bE2=LX
n=1(ynanb)2(5)
Next, the root-mean-square deviation from the trend, the
uctuation, is calculated over every window
at every time scale:
F(L) ="
1
LLX
n=1(ynanb)2#1
2
(6)
This process is repeated over the whole signal at a range of di erent window sizes L, and a log-log
graph of L against F(L) is constructed. A straight line on this graph indicates self-similarity expressed as
F(L)/L . The scaling exponent is calculated as the slope of a straight line t to the log-log graph of L
against F(L) using least-squares as above.
Both features Hnorm and was calculated at voiced segments of speech samples. For the determination
of the voiced parts Voicebox was applied.
4. Correlation Measurements
The main task in the detection of Parkinson's disease is the estimation of its severity. Based on medical
employees' subjective opinions, disorders in speech may precede other muscle tremor symptoms, therefore
they can be a source of severity estimation in an early stage of the disease. The correlations between the
measured features and the UPDRS scores were calculated using SPSS Corp (2013).
The Spearman correlations are summarized in Table 1 for the databases separately. Relatively low
correlation is measured. Only DFA feature ( ) shows a slightly higher correlation. However, it is still
considered as low.
Because Parkinson's disease has in
uence mainly on muscle movement, those speech parts that have
the most variability in vocal cord loads may show higher correlation with the measured data. To see the
4

Figure 1: VR (top) and (bottom) feature values in function of UPDRS in the case of male (left) and female (right) samples
of the word \
echa".
e ect of the di erent linguistic contents on the correlation between the acoustic features and the UPDRS
values, each sample group with di erent linguistic content was examined separately. Correlations are shown
in Table 2. Signi cant values are marked by * (p <0.05) and ** (p <0.01).
UPDRS values of samples of the PC-GITA database tend to correlate more with the individual acoustic
features. Table 4 shows the Spearman correlation values in the case of UPDRS { VR and UPDRS { pairs
for words that has the top six correlation ( >0:380). In the case of the VR feature the top six correlated
samples all contain VCV parts with unvoiced consonant. This suggests that VR may be able to capture if
unvoiced sounds become voiced due to PD. Table 4 shows the Spearman correlation values between UPDRS
and values also. shows higher correlation not only in the above mentioned VCV segments, but where
the whole sample is pronounced as voiced (not counted if unvoiced parts are at the beginning or at the end),
thus giving information from voiced segments.
The gender of the speakers may be also an in
uential factor on the correlation. Therefore, the samples
were divided into the two gender categories and the same correlation was calculated as in the previous case.
Tables 2, 3 and 4 also show the correlation values broken down according to the two di erent genders.
Feature VR has higher measured correlation in the case of the male speakers on the PC-GITA, whereas it
has no signi cant value among the female samples. In contrast, has signi cant values in both cases. In
the case of the HPSDb database lower correlation values are observed. In contrast to the PC-GITA, the
Hnorm feature has signifant correlation values.
The VR and values are depicted in function of UPDRS on Fig. 1 in the case of the word \
echa"
uttered by male and female speakers, which shows high correlation in both features. The correlation is lower
in the case of female speakers.
The di ences that can be observed according to textual content and genders suggest that the estimation
of the severity of the Parkinson's disease should be performed separating speaker genders and taking only
speci c textual content into consideration.
5. Regression with Ensemble of Neural Networks
Estimation of UPDRS values were carried out using feed-forward arti cial neural networks (NN) (each
with one hidden layer with 5 neurons). Based on the considerations of the previous Section, di erent test
setups were applied: NN models were created using (1) all samples altogether, (2) samples with di erent
5

Table 2: Spearman correlations of samples with di erent textual content for undivided, male and female datasets for PC-GITA
database. Linguistic groups are marked by the lename notation of the database. (`*': p <0.05; `**': p <0.01)
UPDRS { VR UPDRS { UPDRS -Hnorm
undivided male female undivided male female undivided male female
,,apto" 0.236 0.275 0.034 0.426** 0.647** 0.299 -0.020 -0.072 0.171
,,atleta" 0.146 0.299 -0.093 0.379* 0.589** 0.249 -0.380 -0.354 -0.223
,,blusa" 0.273 0.443* -0.309 0.331 0.418 0.281 0.093 0.092 0.248
,,bodega" 0.023 0.078 -0.198 0.194 0.229 0.240 0.266 0.350 0.322
,,braso" 0.397* 0.555** 0.173 0.060 0.270 -0.310 0.219 0.246 -0.062
,,campana" 0.134 0.208 0.013 0.301* 0.358 0.301 -0.022 0.100 -0.160
,,caucho" 0.448** 0.576** 0.190 0.250 0.533* 0.177 -0.130 0.066 -0.353
,,clavo" -0.137 -0.394 0.220 0.380 0.424 0.302 -0.112 -0.051 -0.124
,,coco" 0.499** 0.610** 0.140 0.350* 0.243 0.652** -0.146 -0.388 0.288
,,crema" -0.032 -0.044 -0.024 0.428** 0.419 0.453 -0.131 -0.256 0.070
,,drama" 0.019 0.089 0.011 0.318* 0.476* 0.246 0.149 0.131 0.209
,,
echa" 0.510** 0.739** -0.206 0.490** 0.454* 0.305 -0.276 -0.435* 0.141
,,gato" 0.382* 0.622** 0.035 0.116 0.282 -0.015 0.049 -0.156 0.363
,,globo" 0.204 0.202 0.187 0.441** 0.442* 0.544* -0.153 -0.181 -0.126
,,grito" 0.284 0.451* 0.115 0.245 0.121 0.499* 0.096 0.252 -0.125
,,ka" 0.221 0.284 0.213 0.278 0.433* 0.172 -0.105 -0.155 0.213
,,llueve" -0.020 -0.003 -0.182 0.086 0.083 0.146 -0.048 -0.110 0.257
monologue 0.304* 0.530** -0.045 0.445** 0.389 0.566** 0.142 0.156 0.187
,,name" 0.237 0.612** -0.209 0.229 0.171 0.293 0.231 0.355 -0.012
,,pa" 0.397* 0.543* -0.029 0.323 0.248 0.326 -0.183 -0.195 -0.108
,,pato" 0.298 0.515* -0.096 0.299 0.440 0.433 0.035 0.214 -0.243
,,plato" 0.276 0.437* -0.036 0.197 0.439* -0.203 -0.063 -0.017 -0.108
,,presa" 0.164 0.299 0.001 0.119 0.290 0.016 0.037 -0.068 0.256
readtext 0.290* 0.651** -0.264 0.330* 0.451* 0.241 -0.146 -0.092 -0.140
,,reina" 0.242 0.309 0.102 0.322* 0.261 0.587** -0.002 -0.028 0.033
,,trato" 0.214 0.254 0.017 0.152 0.608** -0.299 0.103 -0.010 0.327
,,viaje" 0.196 0.269 0.122 0.355* 0.472* 0.212 -0.296 -0.320 -0.355
sentence1 0.287 0.539** -0.238 0.127 0.227 0.213 -0.063 0.123 -0.213
sentence2 0.286 0.568** -0.258 0.331* 0.220 0.438 -0.066 0.131 -0.235
sentence3 0.348* 0.622** 0.001 0.383* 0.376 0.453* 0.029 0.244 -0.200
sentence4 0.234 0.519* -0.281 0.272 0.243 0.457* -0.066 -0.004 -0.114
sentence5 0.386** 0.622** -0.016 0.403** 0.412 0.360 -0.074 0.048 -0.148
sentence6 0.265 0.487* -0.215 0.401** 0.425* 0.518* -0.156 -0.089 -0.173
sentence7 0.226 0.452* -0.081 0.425** 0.508* 0.550** -0.158 -0.154 -0.107
sentence8 0.160 0.388 -0.320 0.231 0.357 0.284 -0.232 -0.201 -0.241
sentence9 0.359* 0.545** -0.149 0.225 0.255 0.375 -0.092 -0.089 -0.064
sentence10 0.332* 0.589** -0.124 0.101 0.008 0.291 -0.030 0.147 -0.154
6

Table 3: Spearman correlations of samples with di erent textual content for undivided, male and female datasets for HPSDb
database. Linguistic groups are marked by the lename notation of the database. (`*': p <0.05; `**': p <0.01)
UPDRS { VR UPDRS { UPDRS {Hnorm
undivided male female undivided male female undivided male female
sust. vowel – – – .092 .105 .246 .063 .283 -.011
sust. vowel pitch change – – – .318 .247 .361 .042 -.012 -.120
,,kakaka" .243 .338 .211 -.027 -.146 .350 .066 -.164 .331
,,pakata" .106 -.097 .158 -.027 -.015 .416 -.275 -.250 -.168
,,papapa" .198 .009 .274 -.098 -.082 .276 -.208 -.183 -.165
,,pataka" .195 -.034 .283 -.017 .171 .305 -.231 -.063 -.203
,,petaka" .215 .028 .335 -.149 -.262 .434 -.445** -.392 -.388
,,tatata" .182 .033 .306 -.085 -.225 .425 -.095 -.229 .158
,,ag arral" .476* .530** .376* .214 .072 .385* -.325** -.374* -0.211
,,baba" .104* .094 .052 -.013 -.040 .133 -.289* -.466** -.244
,,cic acska" .347** .093 .435** -.015 .030 .320 -.325** .234 -.378*
,,l opatk o" .173 .128 .045 .016 .041 .350* -.426** -.568** -.383*
,,megles" .435** .574** .269 .092 -.012 .582* -.275* -.165 -.327*
,,ropog os" .143 .188 .004 -.037 -.071 .272 -.233 -.279 -.211
,,trombita" .414** .233 .495** .012 .099 .274 -.359** -.296 -.410*
sentence1 .409* .287 .533* .008 .103 .207 -.373* -.443 -.324
sentence2 .369* .092 .696** -.018 .090 .140 -.477** -.541* -.326
sentence3 .155 .092 .106 .051 .103 .409 -.499** -.483 -.527*
sentence4 .391* .238 .695** .191 .328 .382 -.581** -.549* -.552*
read text .338 .355 .493 -.132 -.179 .128 -.481** -.566* -.330
monologue .399* .473* .381 -.074 .088 .488* -.495** -.625** -.465*
Table 4: Spearman correlations between UPDRS { VR and UPDRS { for words with the top six correlation values for
di erent textual content in the cases of undivided, male and female datasets.
VR
text undivided male female text undivided male female
,,
echa " 0.510** 0.739** -0.206 ,,
echa " 0.490** 0.454* 0.305
,,coco" 0.499** 0.610** 0.140 ,,globo" 0.441** 0.442* 0.544*
,,caucho " 0.448** 0.576** 0.190 ,,crema " 0.428** 0.419 0.453
,,braso" 0.397* 0.555** 0.173 ,,apto" 0.426** 0.647** 0.299
,,papapa " 0.397* 0.543* -0.029 ,,clavo" 0.380 0.424 0.302
,,gato" 0.382* 0.622** 0.035 ,,atleta " 0.379* 0.589** 0.249
7

Table 5: Prediction errors between the original and the predicted UPDRS values of the PC-GITA. NN models were created
using (1) all samples altogether, (2) samples with di erent textual content separately and (3) samples with separate gender
and textual content. Baseline RMSE values are the standard deviations of the original UPDRS values.
predicted UPDRS calculation method RMSE Spearman Pearson
Baseline 17.39 – –
predicted UPDRS (1) 16.33 .303 .362
per utterance (2) 15.16 .420 .498
(3) 14.14 .540 .586
Baseline 17.39 – –
mean predicted UPDRS (1) 15.82 .574 .573
per speaker (2) 13.53 .811 .760
(3) 12.25 .846 .763
Table 6: Prediction errors between the original and the predicted UPDRS values of the HPSDb. NN models were created
using (1) all samples altogether, (2) samples with di erent textual content separately and (3) samples with separate gender
and textual content. Baseline RMSE values are the standard deviations of the original UPDRS values.
predicted UPDRS calculation method RMSE Spearman Pearson
Baseline 12.30 – –
predicted UPDRS (1) 13.31 .232 .200
per utterance (2) 12.99 .311 .312
(3) 11.96 .530 .503
Baseline 12.30 – –
mean predicted UPDRS (1) 13.24 .227 .269
per speaker (2) 12.21 .556 .586
(3) 10.50 .889 .816
linguistic content separately and (3) samples with separate gender and linguistic content. In each case 70%
of the speech samples were used for training NN models and the remaining 30% was applied for testing. This
database partitioning was identical to the one that was used at the Interspeech (IC) 2015 PC Sub-challenge
on the PC-GITA database (called training and development sets) Schuller et al. (2015). The databases were
used separately (cross-lingual trials will be described in the later Sections). The NN was implemented in
SPSS. In cases (2) and (3) multiple models were trained and applied during prediction (the corresponding
model for each speech sample). For each case the total prediction RMSE and correlation between the original
and predicted UPDRS were calculated (Table 5 and Table 6). Beside Spearman correlation Pearson's rvalue
is also depicted.
Because each sample group with di erent linguistic content has samples from all the speakers, it is possible
to create a single UPDRS value prediction for each speaker by averaging the NN outcomes. Applying this
method, we get an ensemble classi er containing multiple NNs. The output of the NNs are averaged for each
speaker, therefore it may be assumed, that by combining classi ers that has (not necessary signi cantly,
but independently) better performance than random, we get increased classi cation performance. Tables 7
and 8 include the results with and without averaging the UPDRS values per speakers for the two databases.
Averaging resulted increased correlations and lower RMSE, due to the negation of the error variance and
the combination of classi ers.
Tables 2 and 3 showed that not all of the sample groups (with di erent linguistic content) has the same
e ect on the performance of the UPDRS prediction. This suggests that the nal prediction can be increased
taking into consideration only those samples that has linguistic content with signi cant correlations (in the
case of undivided dataset). The same test sessions were performed as in the previous case. The results are
depicted in Tables 7 and 8 showing mean predicted UPDRS values per speaker for the two databases. The
baseline value in the case of PC-GITA samples for the Interspeech 2015 Sub-challenge (best performance on
the developement set) is also depicted along with the results of the top two performing participants Gr osz
8

Table 7: Prediction errors between the original and the mean predicted UPDRS per speaker using the best correlated linguistic
content only of the PC-GITA. NN models were created using (1) all samples altogether, (2) samples with di erent linguistic
content separately and (3) samples with separate gender and linguistic content. Baseline RMSE values (in brackets) are the
standard deviation of the original UPDRS values.
Dataset RMSE Spearman Pearson
undivided 15.14 (18.43) .572 .626
(1) male 17.83 (22.76) .811 .710
female 12.32 (15.32) .252 .552
undivided 13.79 (18.43) .717 .757
(2) male 15.80 (22.76) .811 .858
female 11.75 (15.32) .571 .656
total 11.96 (18.43) .799 .831
(3) male 13.31 (22.76) .811 .844
female 10.65 (15.32) .786 .891
Orozco-Arroyave et. al. Orozco-Arroyave et al. (2016b) .620 –
Interspeech 2015 baseline 0.492 –
Gr osz et. al. Gr osz et al. (2015) .691 –
Williamson et. al. Williamson et al. (2015) .670 .671
Table 8: Prediction errors between the original and the mean predicted UPDRS per speaker using the best correlated linguistic
content only of the HPSDb. NN models were created using (1) all samples altogether, (2) samples with di erent linguistic
content separately and (3) samples with separate gender and linguistic content. Baseline RMSE values (in brackets) are the
standard deviation of the original UPDRS values.
RMSE Spearman Pearson
(1) 15.03 (12.93) .489 .404
(2) 10.95 (12.93) .707 .709
(3) 9.96 (12.93) .898 .835
et al. (2015)Williamson et al. (2015). Fig. 2 shows the best predicted and original UPDRS values with
averaging per speakers in the best performing cases for both databases.
6. Cross-lingual regression
In the previous test runs monolingual prediction was applied. In order to investigate the possibilities
and the language dependency of the used acoustic features, cross-lingual prediction trials are carried out.
The training in this case was done using all (or selected) samples of one of the databases and the testing was
performed with the other one. Both train/test combinations (PC-GITA/HPSDb and HPSDb/PC-GITA)
were applied. Four setups were used in the test trials according to the used datasets: NN models were
created using (1) all samples altogether, (2) read texts and monololgues only, (3) all samples for genders
separately and (4) read tests and monologues only for genders separately. Table 9 contains the results of the
di erent setups. In all cases, the mean UPDRS score was calculated and used for each speaker, as described
in the previous Section.
The results show lower prediction performance compared to the monolingual predictions. Relatively high
RMSE values can be observed. In the case of males, lower prediction performance was achieved than in
the case of females. This tendency is true for both Train set/Test set setups. Fig. 3 shows the relation of
the original and the predicted mean UPDRS scores. As in the monolingual case, the predicted values have
compressed scale compared to the original values. This is particularly true for the male samples, where
the prediction power is lower. The female samples, however, have higher correlation and decreased relative
RMSE value.
9

Figure 2: Scatter plot of the original UPDRS and the predicted mean UPDRS values per speaker with NN models trained for
male and female speakers separately using the best correlated linguistic content only for both (a) PC-GITA and (b) HPSDb
databases.
Table 9: Prediction results of the cross-lingual regression. Correlations between the original and the mean predicted UPDRS
per speaker. NN models were created using (1) all samples altogether, (2) read texts and monologues only, (3) all samples for
genders separately and (4) read texts and monologues only for genders separately. Baseline RMSE values (in brackets) are the
standard deviation of the original UPDRS values.
Train set/Test set Dataset RMSE Spearman Pearson
undivided 13.23 (12.98) .306 .300
(1) male 13.85 (12.30) .239 .293
female 12.68 (13.39) .451 .456
undivided 13.20 (12.98) .217 .212
(2) male 13.60 (12.30) .302 .357
PC-GITA/ female 12.81 (13.39) .354 .393
HPSDb undivided 13.08 (12.98) .222 .192
(3) male 13.54 (12.30) .308 .335
female 12.68 (13.39) .437 .427
undivided 12.74 (12.98) .246 .305
(4) male 13.61 (12.30) -.121 .-199
female 11.86 (13.39) .493 .563
undivided 17.95 (17.88) .433 .443
(1) male 13.73 (12.86) .190 .309
female 21.04 (21.43) .562 .547
undivided 16.54 (17.88) .388 .429
(2) male 13.82 (12.86) .023 .006
HPSDb/ female 19.01 (21.43) .628 .576
PC-GITA undivided 17.33 (17.88) .388 .421
(3) male 14.03 (12.86) .127 .043
female 19.84 (21.43) .664 .640
undivided 15.73 (17.88) .444 .498
(4) male 14.22 (12.86) -.059 -.084
female 17.19 (21.43) .654 .660
10

Figure 3: Scatter plot of the original UPDRS and the predicted mean UPDRS values per speaker with cross-lingual NN models
trained for male and female speakers separately using monologues and read texts.
7. Discussion
Based on the correlation analysis in Section 4 it is clear that the measured values are applicable to
estimate the severity of the Parkinson's disease in monolingual environment. However, not all of the speech
samples have the same contribution, as well as not all the features. High diversity is found according to the
type of speech sound production, linguistic content and text. For example, how many VCV segments are
present, where the consonant is unvoiced.
The measured correlation values of the male and female speakers show di erences. Males tend to have
more correlation between the computed features and the UPDRS values than females in the examined speech
material. Whether is it due to a general tendency in the disease, or some other phenomena, it is not clear.
In case of regression, the estimation of the UPDRS values showed that the applied features are capable of
estimating the severity of the disease. The RMSE values were signi cantly lower than the standard deviation
of the original UPDRS values (that is predicting the mean of UPDRS values for all utterances).
Because the database contained the speaker id values and each di erent linguistic content was uttered by
all of the speaker, it was possible to create one predicted UPDRS for each speaker by averaging the results
corresponding to the same speaker. This resulted much better UPDRS prediction by negating the variance
of the prediction error. In a real world diagnostic application this can also be performed by recording
utterances with varying linguistic content from the same speaker. Sorting out only the best correlated
linguistic contents an even more accurate predictor could be trained. Based on the correlations of the
individual features, only those were selected that had the highest values. With these units, performing the
same prediction tests as in the previous case, increased correlations and lower RMSE values were achieved.
The baseline result of the Interspeech 2015 Sub-challenge was surpassed. Although the predicted values
have smaller range than the original ones (Fig. 2), the order (rank) of the UPDRS values remained mostly
the same.
The female/male di erences were observed here as well. This implied that training NN models with both
gender types at the same time performs worse that separating them. The results in Table 5 and Table 7
con rm this hypothesis. It is interesting to see that the performance increased more in the case of female
samples compared to the male speakers. The reason of this is not clear.
The cross-lingual predictions showed yet unsatisfactory but promising results. The compression in the
nal predicted UPDRS and the di erence between male and famale samples could also be observed. Increased
prediction performance (lower relative RMSE and higher correlation) was found in the case of the female
samples.
11

8. Conclusion
In this paper a Parkinson's disease severity estimation analysis was carried out using speech database
of Spanish and Hungarian speakers separately. Correlation measurements were performed between acoustic
features and the UPDRS severity values to examine the goodness of the selected parameters. The applied
acoustic features were the following: the voicing ratio (VR); nonlinear recurrence: the normalized recurrence
probability density entropy ( Hnorm) and fractal scaling: the scaling exponent ( ) were examined. High
diversity is found according to the type of speech sound production, and hence according to the text and
gender. Increased correlation was by separating male and female speech speakers. Males tend to have
higher correlation between the computed features and the UPDRS values than females. The reason of this
phenomenon is hereinafter should be thoroughly investigated.
Based on the results of correlation calculations, prediction of the UPDRS values was performed using
regression technique applying neural networks. The results showed that the applied features are capable
of estimating the severity of the PD. By assigning the mean predicted UPDRS for each corresponding
speaker using the best correlated linguistic contents, in the case of the Spanish database the result of the
Interspeech 2015 Sub-challenge winner was exceeded. It means, that by composing well selected, di erent
linguistic content, more accurate predictor could be trained. Moreover, by training NN models separately
for males and females the accuracy could be increased further in case of the examined Spanish database.
The cross-lingual prediction tests showed yet unsatisfactory but promising results. The set of acoustic
features could be extended in order to improve the prediction performance. The extension of the HPSDb
speech material is in continuous progress. With extended speech material and acoustic feature set a model
with more accurate prediction could be trained. Thus, a computer aided diagnostic system can be developed
in helping medical sta or home self-diagnosis.
Acknowledgement
The research was partly funded by the Postdoctoral Fellowship Programme of the Hungarian Academy of
Sciences (POSTDOC-77). The PC-GITA database was created by the GITA research group at Universidad
de Antioquia, Medell n, Colombia.
References
Alemami, Y., Almazaydeh, L., 2014. Detection of parkinson disease through voice signal features. The Journal of American
Science 10, 44{47.
Arias-Vergara, T., Vasquez-Correa, J., Orozco-Arroyave, J.R., Vargas-Bonilla, J., N oth, E., 2016. Parkinsons disease progression
assessment from speech using gmm-ubm. Interspeech 2016 , 1933{1937.
Baken, R.J., Orliko , R.F., 2000. Clinical measurement of speech and voice. Cengage Learning.
Beringer, N., Schiel, F., 2000. The quality of multilingual automatic segmentation using german maus, in: Proc. of the
International Conference on Spoken Language Processing.
Brookes, M., 2016. Voicebox. http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html/ .
Corp, I., 2013. Ibm spss statistics for windows, version 22.0.
Frid, A., Hazan, H., Hilu, D., Manevitz, L., Ramig, L.O., Sapir, S., 2014. Computational diagnosis of parkinson's disease directly
from natural speech using machine learning techniques, in: Software Science, Technology and Engineering (SWSTE), 2014
IEEE International Conference on, IEEE. pp. 50{53.
Gr osz, T., Busa-Fekete, R., Gosztolya, G., T oth, L., 2015. Assessing the degree of nativeness and parkinsons condition using
gaussian processes and deep recti er neural networks, in: Proceedings of Interspeech, pp. 1339{1343.
Harel, B., Cannizzaro, M., Snyder, P.J., 2004. Variability in fundamental frequency during speech in prodromal and incipient
parkinson's disease: a longitudinal case study. Brain and cognition 56, 24{29.
Ho, A.K., Iansek, R., Marigliani, C., Bradshaw, J.L., Gates, S., 1999. Speech impairment in a large sample of patients with
parkinsons disease. Behavioural neurology 11, 131{137.
Khan, T., Westin, J., Dougherty, M., 2014. Classi cation of speech intelligibility in parkinson's disease. Biocybernetics and
Biomedical Engineering 34, 35{45.
Little, M., McSharry, P., Moroz, I., Roberts, S., 2006. Nonlinear, biophysically-informed speech pathology detection, in: 2006
IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, IEEE. pp. II{II.
Little, M.A., McSharry, P.E., Hunter, E.J., Spielman, J., Ramig, L.O., et al., 2009. Suitability of dysphonia measurements for
telemonitoring of parkinson's disease. IEEE transactions on biomedical engineering 56, 1015{1022.
12

Little, M.A., McSharry, P.E., Roberts, S.J., Costello, D.A., Moroz, I.M., 2007. Exploiting nonlinear recurrence and fractal
scaling properties for voice disorder detection. BioMedical Engineering OnLine 6, 1.
Orozco-Arroyave, J., H onig, F., Arias-Londo~ no, J., Vargas-Bonilla, J., Daqrouq, K., Skodda, S., Rusz, J., N oth, E., 2016a.
Automatic detection of parkinson's disease in running speech spoken in three di erent languages. The Journal of the
Acoustical Society of America 139, 481{500.
Orozco-Arroyave, J.R., Arias-Londo~ no, J.D., Bonilla, J.F.V., Gonzalez-R ativa, M.C., N oth, E., 2014. New spanish speech
corpus database for the analysis of people su ering from parkinson's disease., in: LREC, pp. 342{347.
Orozco-Arroyave, J.R., Vdsquez-Correa, J., H onig, F., Arias-Londo~ no, J.D., Vargas-Bonilla, J., Skodda, S., Rusz, J., Noth,
E., 2016b. Towards an automatic monitoring of the neurological state of parkinson's patients from speech, in: Acoustics,
Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, IEEE. pp. 6490{6494.
Rahn, D.A., Chou, M., Jiang, J.J., Zhang, Y., 2007. Phonatory impairment in parkinson's disease: evidence from nonlinear
dynamic analysis and perturbation analysis. Journal of Voice 21, 64{71.
Rajput, M., Rajput, A., Rajput, A.H., 2007. Epidemiology, in: Pahwa, R., Lyons, K.E. (Eds.), Handbook of Parkinsons
disease. 4th ed.. Crc Press, New York.
Sakar, B.E., Isenkul, M.E., Sakar, C.O., Sertbas, A., Gurgen, F., Delil, S., Apaydin, H., Kursun, O., 2013. Collection and
analysis of a parkinson speech dataset with multiple types of sound recordings. IEEE Journal of Biomedical and Health
Informatics 17, 828{834.
Sapir, S., Ramig, L.O., Spielman, J.L., Fox, C., 2010. Formant centralization ratio: a proposal for a new acoustic measure of
dysarthric speech. Journal of Speech, Language, and Hearing Research 53, 114{125.
Schuller, B., Steidl, S., Batliner, A., Hantke, S., H onig, F., Orozco-Arroyave, J.R., N oth, E., Zhang, Y., Weninger, F., 2015.
The interspeech 2015 computational paralinguistics challenge: Nativeness, parkinsons & eating condition, in: Proceedings
of INTERSPEECH.
Tsanas, A., Little, M.A., McSharry, P.E., Spielman, J., Ramig, L.O., 2012. Novel speech signal processing algorithms for
high-accuracy classi cation of parkinson's disease. IEEE Transactions on Biomedical Engineering 59, 1264{1271.
Williamson, J.R., Quatieri, T.F., Helfer, B.S., Perricone, J., Ghosh, S.S., Ciccarelli, G., Mehta, D.D., 2015. Segment-dependent
dynamics in predicting parkinsons disease, in: Sixteenth Annual Conference of the International Speech Communication
Association.
13

Similar Posts