Teza Finala Ionescu Vlad 2017 [616607]

BABES -BOLYAI UNIVERSITY
FACULTY OF MATHEMATICS AND COMPUTER SCIENCE
Contributions to solving real world
problems using machine learning
models
Ph.D. Thesis
Ph.D. student: [anonimizat]: Prof. Dr. Czibula Gabriela
2017

Contents
Keywords 7
Acknowledgements 8
List of publications 9
Introduction 11
1 Background for Archaeological problems 14
1.1 Predicting stature from archaeological skeletal remains using long bone lengths 14
1.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.1.2 Related work on the stature prediction problem . . . . . . . . . . . . . 15
1.2 Body mass estimation in bioarchaeology . . . . . . . . . . . . . . . . . . . . . 18
1.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.2.2 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3 Age at death estimation from long bone lengths . . . . . . . . . . . . . . . . . 19
1.3.1 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Contributions to machine learning models for archaeology 21
2.1 Proposed machine learning models . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.1 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.2 Support Vector Regression . . . . . . . . . . . . . . . . . . . . . . . . 22
2.1.3 Locally Weighted Regression . . . . . . . . . . . . . . . . . . . . . . . 23
2.1.4 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.1.5 Evaluation methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2 Computational Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.1 Stature prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.2 Body mass estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.2.3 Age at death estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3 Background for Software Engineering problems 63
3.1 Software defect prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2 Software development e ort estimation . . . . . . . . . . . . . . . . . . . . . . 65
3.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.2.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4 Contributions to machine learning models for software engineering 71
4.1 Proposed machine learning models . . . . . . . . . . . . . . . . . . . . . . . . 71
4.1.1 Fuzzy Self Organizing Maps . . . . . . . . . . . . . . . . . . . . . . . . 71
4.1.2 Fuzzy Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
1

CONTENTS 2
4.1.3 Learning models for software development e ort estimation . . . . . . 78
4.2 Computational experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2.1 Software defect detection using FSOM . . . . . . . . . . . . . . . . . . 84
4.2.2 Software Defect Prediction using FuzzyDT . . . . . . . . . . . . . . . 89
4.2.3 Software Development E ort Estimation . . . . . . . . . . . . . . . . . 97
Conclusions 111

List of Figures
2.1 Pearson correlation for the features on the set of European-American instances 30
2.2 PCA on the set of European-American instances . . . . . . . . . . . . . . . . 30
2.3 Pearson correlation for the features on the set of African-American instances 32
2.4 PCA on the set of African-American instances . . . . . . . . . . . . . . . . . 32
2.5 Pearson correlation for the features on the set of all European-American and
African-American instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.6 PCA on the set of all European-American and African-American instances. . 34
2.7 Pearson correlation for the features on the set of all European-American and
African-American individuals (without considering the sexand race of the
individuals). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.8 Data set reduced to a single feature using Principal Components Analysis. . . 42
2.9 Learning curves for the mixed case study, with the RBF kernel under the M1
methodology and considering MAE scores. . . . . . . . . . . . . . . . . . . . . 45
2.10 Pearson correlation for the features of the rst case study. . . . . . . . . . . . 47
2.11 Pearson correlation for the features for the second experiment. . . . . . . . . 48
2.12 Pearson correlation and U-Matrix visualization. . . . . . . . . . . . . . . . . . 51
2.13 PCA graph for the subadults case study. . . . . . . . . . . . . . . . . . . . . . 59
2.14 PCA graph for the young adults case study. . . . . . . . . . . . . . . . . . . . 59
4.1 The fuzzy functions de ned for the total locsoftware metric. . . . . . . . . . 76
4.2 Selected software metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3 Flowchart of the used machine learning pipeline. . . . . . . . . . . . . . . . . 82
4.4 t-SNE plot for the Ar1 data set. . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.5 t-SNE plot for the Ar6 data set. . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.6 U-Matrix built on the Ar1 data set. . . . . . . . . . . . . . . . . . . . . . . . 86
4.7 U-Matrix built on the Ar3 data set. . . . . . . . . . . . . . . . . . . . . . . . 86
4.8 U-Matrix built on the Ar4 data set. . . . . . . . . . . . . . . . . . . . . . . . 87
4.9 U-Matrix built on the Ar5 data set. . . . . . . . . . . . . . . . . . . . . . . . 87
4.10 U-Matrix built on the Ar6 data set. . . . . . . . . . . . . . . . . . . . . . . . 87
4.11 Comparison to related work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.12 Two dimensional representations using t-SNE of our transformed data sets. . 91
4.13 Counts of related work methods that are better and worse than FuzzyDT on
the considered data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.14 Visualizations of TF-IDF and doc2vec transformers reduced to two dimensions
on theTdata set, with initial preprocessing (parsing). . . . . . . . . . . . . . 101
4.15 Visualizations of TF-IDF and doc2vec transformers reduced to two dimensions
on thed2data set, with initial preprocessing (parsing). . . . . . . . . . . . . . 102
4.16 Results and comparison to human estimates – chart form. . . . . . . . . . . . 106
4.17 Mean MMRE across all data sets for each learning model and vectorization
method, with initial text preprocessing. 95% con dence intervals for the mean
are depicted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
3

LIST OF FIGURES 4
4.18 Mean MMRE across all data sets for each learning model and vectorization
method, without initial text preprocessing. 95% con dence intervals for the
mean are depicted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

List of Tables
2.1 MAE and SEE values obtained using the GA model on the set of European-
American instances, with 95% con dence intervals for the mean. . . . . . . . 30
2.2 MAE and SEE values obtained using the GA on the set of African-American
instances, using 95% con dence intervals for the mean. . . . . . . . . . . . . . 32
2.3 MAE and SEE values obtained using GA on the set of all European-American
and African-American instances, using 95% con dence intervals for the mean. 34
2.4 Detailed results obtained on the considered case studies, considering all 20
cross-validations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5 MAE and SEE values obtained using the nal GA formula on the considered
case studies, with no cross-validation. . . . . . . . . . . . . . . . . . . . . . . 35
2.6 MAE and SEE values obtained using the GA model on the considered case
studies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.7 Average MAE and SEE values obtained using the GA model on the set of all
European-American and African-American individuals (without considering
thesexand race of the individuals). . . . . . . . . . . . . . . . . . . . . . . . 36
2.8 Comparison between the SEE values reported by our approaches and the ap-
proach of Trotter and Gleser [TG52]. . . . . . . . . . . . . . . . . . . . . . . . 38
2.9 Average MAE and SEE for the proposed models against other important re-
lated articles in the bibliography. . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.10 Results for the caucasian case study. 95% con dence intervals are used for the
results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.11 Results obtained for the African-american data. 95% con dence intervals are
used for the results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.12 Results for the mixed case study. 95% con dence intervals are used for the
results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.13 Overview of literature results on the data set we have used. . . . . . . . . . . 45
2.14 LWR model performance measures obtained for the rst experiment, with 95%
con dence intervals for the MAE mean. . . . . . . . . . . . . . . . . . . . . . 47
2.15 LWR model performance measures obtained for the second experiment, with
95% con dence intervals for the MAE mean. . . . . . . . . . . . . . . . . . . 47
2.16 Results on the second case study with the M45 instance removed. GENDER
was included. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.17 Comparison between the known results on the open source Trotter data set
and our own IBL method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.18 Results obtained using the GA. 95% con dence intervals for the mean are used. 52
2.19 Results obtained using the SVM. 95% con dence intervals for the mean are
used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.20 Results obtained using the LWR model. 95% con dence intervals for the mean
are used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.21 MAE values obtained using the GA, SVR and LWR models on the considered
case studies, with 95% con dence intervals for the mean. . . . . . . . . . . . . 54
5

LIST OF TABLES 6
2.22 Comparative MAE values – with and without outliers removal. . . . . . . . . 55
2.23 Comparison between our approaches and similar related work. 95% con dence
intervals for the mean are used for the results. . . . . . . . . . . . . . . . . . . 56
2.24 Results using the GA, SVR and LWR learning models on the 5 case studies.
95% con dence intervals for the mean are used. . . . . . . . . . . . . . . . . . 60
2.25 Summary of our comparison to related work. . . . . . . . . . . . . . . . . . . 61
2.26 Results obtained by applying some of the formulas in [PFSL12] on our own
data set's subadults. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.1 Constants used in the COCOMO formulas for each project type [BCH+00]. . 67
4.1 Di erent percentile thresholds used for de ning the fuzzy functions. . . . . . . 77
4.2 Description of the data sets used in the experiments. . . . . . . . . . . . . . . 85
4.3 Results obtained using FSOM on all experimented data sets. . . . . . . . . . 88
4.4 Comparison of our AUC values with the related work. . . . . . . . . . . . . . 89
4.5 Description of the Ar1-Ar6 data sets. . . . . . . . . . . . . . . . . . . . . . . 90
4.6 Detailed results obtained for the Ar1 data set. . . . . . . . . . . . . . . . . . 92
4.7 Detailed results obtained for the Ar3 data set. . . . . . . . . . . . . . . . . . 93
4.8 Detailed results obtained for the Ar4 data set. . . . . . . . . . . . . . . . . . 93
4.9 Detailed results obtained for the Ar5 data set. . . . . . . . . . . . . . . . . . 93
4.10 Detailed results obtained for the Ar6 data set. . . . . . . . . . . . . . . . . . 94
4.11 Minimum, maximum, average and population standard deviations of the ob-
tained values on each data set. . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.12 The con gurations for which the highest AUC values are achieved. . . . . . . 94
4.13 Comparison of our results based on the considered comparison criteria. . . . . 95
4.14 Comparison of our average AUC with related work on the same data sets. . . 96
4.15 Number of instances and short presentations for each data set. . . . . . . . . 99
4.16 Summary of data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.17 Results using TF-IDF and initial text preprocessing. The best results are
highlighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.18 Results using TF-IDF without the initial text preprocessing. The best results
are highlighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.19 Results using doc2vec and initial text preprocessing. The best results are
highlighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.20 Results using doc2vec without the initial text preprocessing. The best results
are highlighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.21 Results using TF-IDF doc2vec and initial text preprocessing. The best
results are highlighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.22 Results using TF-IDF doc2vec without the initial text preprocessing. The
best results are highlighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.23 Hyperparameter search descriptions. . . . . . . . . . . . . . . . . . . . . . . . 107
4.24 Results and comparison to human estimates – table form. . . . . . . . . . . . 108
4.25 Results using each text vectorizer with the initial text preprocessing. The best
SVR and GNB results are highlighted across the three di erent text vectorizers.108
4.26 Results using each text vectorizer without initial text preprocessing. The best
SVR and GNB results are highlighted across the three di erent text vectorizers.109
4.27 Comparison to related work. The related works with higher average MMRE
values than our best result on the Tdata set are marked in green. The works
that we provide a MMRE interval for and for which we do better than the
upper bound on the Tset are marked in yellow. Works that clearly obtain
better MMRE values on their data sets than we do on Tare left in white. . . 110

Keywords
Machine learning, genetic algorithms, instance based learning, support
vector machines, archaeology, software engineering, software defect pre-
diction, software development e ort estimation, doc2vec, self organizing
maps, natural language processing, fuzzy decision trees, fuzzy self or-
ganizing maps, principal components analysis, t-distributed stochastic
neighbour embedding.
7

Acknowledgements
This thesis has been the result of three years of research at the Babe s-Bolyai University of
Cluj-Napoca. I would like to thank all of my colleagues there who have supported me during
these years and have o ered me suggestions and feedback regarding my research, especially
Gabi Mircea, Iuliana Bocicor, Zsuzsanna Marian, Istvan Czibula and Laura Diosan. Their
knowledge and dedication have de nitely made me into a better researcher.
I also thank the computer science department leadership for making it a mostly fair and
meritocratic department that strives to attract and keep good talent and understands the
need to be supportive towards all members of the department. The colleagues that I have
worked with for teaching activities were all great people and professionals and I thank you
all for sharing your experience with me.
I would also like to thank my family, who has always been by my side and helped me with
anything I needed, and my girlfriend Alexandra who was always encouraging and supportive.
Most importantly, I would like to thank my Ph.D. supervisor, professor Gabriela Czibula,
for her incredible passion and dedication, unwavering support and indispensable scienti c
experience and knowledge, who has also made me a lot more passionate and dedicated about
scienti c research, who has managed to share a great deal of her knowledge with me in a
relatively short time, and without whom I would not be presenting this thesis here today.
8

List of publications
All rankings are listed according to the 2014 classi cation of journals1and conferences2in
Computer Science and the associated web service3.
Publications in ISI Web of Knowledge
Publications in ISI Science Citation Index Expanded
1. [CIMM16] Gabriela Czibula, Vlad-Sebastian Ionescu , Diana-Lucia Miholca and
Ioan-Gabriel Mircea Machine learning-based approaches for predicting stature from
archaeological skeletal remains using long bone lengths. Journal of Archaeological
Science , volume 69, pp. 85{99, 2016 ( 2016 IF = 2.602 ).
Rank A, 4 points.
2. [IDC17] Vlad-Sebastian Ionescu , Horia Demian and Istvan-Gergely Czibula Natural
language processing and machine learning methods for software development e ort
estimation. Studies in Informatics and Control , volume 26(2), pp. 219{228, 2017
(2016 IF = 0.776 ).
Rank C, 2 points.
3. [CCMI16] Istvan-Gergely Czibula, Gabriela Czibula, Zsuzsanna-Edit Marian and Vlad-
Sebastian Ionescu A Novel Approach Using Fuzzy Self-Organizing Maps for Detecting
Software Faults. Studies in Informatics and Control , volume 25(2), pp. 207{216, 2016
(2016 IF = 0.776 ).
Rank C, 1 points.
4. [LI14] Georgia Irina Oros, Gheorghe Oros, Alina Alb Lupa and Vlad Ionescu Di er-
ential subordinations obtained by using a generalization of Marx-Strohhacker theorem.
Journal of Computational Analysis and Applications , volume 20(1), pp. 135{139, 2016,
(2016 IF = 0.609 ).
Rank C, 1 points.
5. [OOLI16] Andrei Loriana and Vlad Ionescu Some di erential superordinations using
Ruscheweyh derivative and generalized Slgean operator. Journal of Computational
Analysis and Applications , volume 20(1), pp. 437{444, 2014, ( 2016 IF = 0.609 ).
Rank C, 2 points.
1http://informatica-universitaria.ro/getpfile/16/CSafisat2.pdf
2http://informatica-universitaria.ro/getpfile/16/CORE2013_Exported.xlsx
3http://informatica-universitaria.ro/php/index.html
9

LIST OF TABLES 10
Publications in ISI Conference Proceedings Citation Index
1. [Ion17] Vlad-Sebastian Ionescu . An approach to software development e ort esti-
mation using machine learning. In 2017 IEEE International Conference on Intelligent
Computer Communication and Processing , IEEE Computer Society, accepted for pub-
lication, 2017 ( indexed IEEE ).
Rank C, 2 points.
2. [ITV16] Vlad-Sebastian Ionescu , Mihai Teletin and Estera-Maria Voiculescu Ma-
chine learning techniques for age at death estimation from long bone lengths In IEEE
11th International Symposium on Applied Computational Intelligence and Informatics
(SACI 2016) , IEEE Hungary Section, pp. 457{462, 2016.
Rank C, 2 points.
3. [ICT16] Vlad-Sebastian Ionescu , Gabriela Czibula and Mihai Teletin Supervised
learning techniques for body mass estimation in bioarchaeology IEEE 7th International
Workshop on Soft Computing Applications , accepted, 2016, in press.
Rank C, 2 points.
Publications in international journals and conferences
1. [IMMC15] Vlad-Sebastian Ionescu , Ioan-Gabriel Mircea, Diana-Lucia Miholca, and
Gabriela Czibula. Novel instance based learning approaches for stature estimation in
archaeology. In 2015 IEEE International Conference on Intelligent Computer Commu-
nication and Processing , IEEE Computer Society, pp. 309{316, 2015 ( indexed IEEE ).
Rank C, 1 point.
2. [Ion15a] Vlad-Sebastian Ionescu New supervised learning approaches for sex iden-
ti cation in archaeology In International Virtual Research Conference In Technical
Disciplines , EDIS – Publishing Institution of the University of Zilina, pp. 56{64, 2015
(indexed Google Scholar ).
Rank D, 1 point.
3. [Ion15b] Vlad-Sebastian Ionescu Support vector regression methods for height esti-
mation in archaeology. Studia Universitatis Babes-Bolyai Series Informatica , volume
LX(2), pp. 70{82, 2015 ( indexed Mathematical Reviews ).
Rank D, 1 point.
4. [MCMI16] Zsuzsanna-Edit Marian, Istvan-Gergely Czibula, Ioan-Gabriel Mircea and
Vlad-Sebastian Ionescu A study on software defect prediction using fuzzy decision
trees. Studia Universitatis Babes-Bolyai Series Informatica , volume LXI(2), pp. 5{20,
2016 ( indexed Mathematical Reviews ).
Rank D, 0 points.
Publications score: 19 points .

Introduction
This Ph.D. thesis consists of research on applying computational intelligence techniques
for solving real world problems in the elds of Archaeology andSoftware Engineering .
All of the original research conducted was done under the supervision of Prof. Dr. Gabriela
Czibula.
The two elds were chosen due to the existence of several dicult practical problems in
both that we hypothesized would be of signi cant importance to researchers in archaeology
and software engineering and that could be approached under a uni ed machine learning
theme. Our hypothesis proved to be correct: no similar approaches exist for the chosen
problems, the results obtained are better and useful to researchers in those elds and our
approaches have a central machine learning element.
Forarchaeology , the focus was on the application of several machine learning models {
Genetic Algorithms (GA), Support Vector Regression (SVR) and Locally Weighted Regres-
sion (LWR) { on three important problems from the eld of archaeology: stature prediction ,
weight (body mass) estimation and age at death estimation using bone lengths. The models
we have chosen are known to provide very good results on similar regression problems, and
the problems we have decided to approach are of signi cant interest in the archaeological
literature, due to the information that a good solution to them can provide to researchers
in the eld. Moreover, according to our knowledge, machine learning was not used on the
approached archaeological problems until now.
Each of our learning models has been applied for all of the researched archaeological tasks,
resulting in a uni ed approach to the eld of archaeology through machine learning. Our
approaches are novel in relation to the archaeological eld and have been published in journal
articles and conferences [CIMM16, ITV16, ICT16, IMMC15, Ion15b].
Concerning Software Engineering , we have focused on two important problems from
the Search-based Software Engineering literature: Software Defect Prediction (SDP) and
Software Development E ort Estimation (SDEE). We have proposed several machine learning
models for solving the above mentioned problems and we have performed experiments on
various data sets relating to these problems using the Fuzzy Self Organizing Maps (FSOM),
Fuzzy Decision Trees (FDT) and Support Vector Regression (SVR) machine learning models.
Our models are either novel in and of themselves or have not been applied to these problems
before in the way that we have. We have obtained very good results with all three approaches,
surpassing most approaches from the literature and introducing new research directions for
some important problems in the eld of Software Engineering.
Our novel solutions for these problems result in a uni ed machine learning approach for
various Software Engineering tasks, which can help developers and project managers alike.
Our approaches are novel in relation to the Software Engineering eld and have been published
in the papers [CCMI16, MCMI16, Ion17, IDC17].
The rest of the thesis is structured as follows.
A description of the archaeological problems approached and the necessary background
for understanding our solutions are presented in Chapter 1. We present the problems we
target from the archaeological domain: stature prediction through skeletal remains using long
bone lengths ,estimating the body mass from human skeletal remains and age at death es-
11

LIST OF TABLES 12
timation from long bone lengths . For each of the three problems, we start by emphasizing
their importance in paleontological and archaeological research. Then, a motivation for ap-
plying machine learning methods and a brief literature review on existing approaches will be
provided.
The original approaches introduced in the eld of archaeology are presented in Chapter 2.
The chapter begins with a presentation of the novel applications of machine learning models
used to approach the above problems, in Subchapter 2.1. In Section 1.1, the problem of
stature prediction from long bone lengths is presented. This is the rst of the three real world
problems from archaeology approached in this thesis, and the one for which we obtained the
most well-received results by the archaeological community on. In Section 1.2, the problem
of weight prediction is introduced. This is a problem with similar usefulness to archaeologists
in their work. Section 1.3 presents the age at death prediction problem, a metric that again
has important implications in archaeology. In Subchapter 2.2, our experimental results on
publicly available case studies are presented for each of the above problems. Comparisons
to related work are conducted for each problem, highlighting that our methods outperform
existing ones.
Chapter 3 presents the necessary background on the software engineering problems ap-
proached: software defect prediction and software development e ort estimation . For each
of the problems, we present their practical relevance for software development also giving a
motivation for our machine learning approaches. A related work on existing approaches for
solving the above mentioned problems is provided in Subchapters 3.1 and 3.2. In Section 3.1,
the problems of software development defect prediction and detection are presented: the rst
one concerns predicting where defects will show up in a codebase, while the latter deals with
detecting existing defects. Section 3.2 introduces the software development e ort estimation
(SDEE) problem, which involves automatically providing estimates for development tasks.
The original contributions in the eld of software engineering are presented in Chapter 4.
The chapter starts with an overview of the machine learning elements used in our research
on the two software engineering problems, in Subchapter 4.1. The experimental evaluation is
performed on freely available software engineering data sets are presented in Subchapter 4.2,
for each of the approached problems. We also compare the obtained results with the ones
reported by similar approaches from the software engineering literature, highlighting that our
results are better than existing ones.
We have also conducted research not part of the central theme of this thesis. This was
done on the sex identi cation problem in archaeology [Ion15a] and on two mathematical
analysis issues in pure mathematics [LI14, OOLI16]. That research contributed to gaining
better insights into the problems this thesis is written around.
The thesis concludes with our future research directions regarding the application of
machine learning to real world problems.
The original contributions of this thesis are contained in Chapters 2 and 4 and are sum-
marized in the following. Our contributions have been well received by the archaeological
community, with one publication being accepted in a top archaeological journal. In the soft-
ware engineering eld our contributions are extensive, and have also been appreciated by
industry professionals.
Original contributions in applying machine learning in the archaaeological eld, for
solving important archaeological problems: stature prediction ,body mass estimation
and age-at-death estimation . These contributions are presented in Chapter 2 and they
are as follows:
{Novel approaches for solving archaeological problems using machine learning mod-
els are introduced in Subchapter 2.1 [CIMM16, Ion15b, ITV16, ICT16, IMMC15].

LIST OF TABLES 13
The proposed machine learning models (genetic algorithm, support vector re-
gression, instance based learning) are introduced in Sections 2.1.1, 2.1.2 and
2.1.3. We have succesfully applied a robust genetic algorithm that is both
easier to apply than previous archaeological formulas and provides more accu-
rate results. This GA approach has been used on the other two archaeological
problems as well.
Sections 2.1.4 and 2.1.5 present the preprocessing approaches and experimen-
tal methodologies used in experiments involving our approaches to the archae-
ological problems mentioned.
{A thorough experimental evaluation on the above mentioned archaeological prob-
lems and comparison to other approaches from the literature are provided in Sec-
tion 2.2 [CIMM16, Ion15b, ITV16, ICT16, IMMC15].
Original contributions in applying machine learning in the software engineering eld,
for solving important problems related to software development: software defect pre-
diction and software development e ort estimation . These contributions are presented
in Chapter 4 and they are as follows:
{Novel approaches for solving software engineering using machine learning models
are introduced in Subchapter 4.1 [CCMI16, MCMI16, IDC17, Ion17].
The proposed machine learning models – fuzzy self organizing map (FSOM),
fuzzy decision trees, support vector regression and Gaussian Naive Bayes – are
introduced in Sections 4.1.1, 4.1.2, 4.1.3.
Section 4.1.3 presents the data preprocessing methods and experimental method-
ologies employed in the computational experiments regarding the software
engineering problems approached.
{Computational results on the above mentioned software engineering problems and
comparison to other approaches from the literature are provided in Section 4.2
[CCMI16, MCMI16, IDC17, Ion17].

Chapter 1
Background for Archaeological
problems
In this chapter we present the archaeological background knowledge regarding stature,
weight and age required for our approaches. This background knowledge has been collected
in order to facilitate the original research published in [CIMM16, ITV16, ICT16].
1.1 Predicting stature from archaeological skeletal remains
using long bone lengths
From the perspective of forensic anthropology and archaeology, predicting the stature of
an individual based on osteological information is fundamental. The classical mathematical
approaches to stature estimation focus on the use of regression methods [SS97] based on
statistical analysis of the data
Estimating stature is important in bioarchaeological and forensic research, rstly because
stature is a standard biological attribute together with age and weight. It also enables
researchers to assess sexual dimorphism or the body size of the past population under study
[RR06]. Moreover, stature is an important indicator of individual's physical growth and
development within its social and natural environment [BR14]. Despite the individual's
natural genetic potential for physical growth, it is the society that nurtures its members
through nutrition, hygiene, physical education etc., to reach their potential [BR14].
1.1.1 Motivation
Stature estimation is a common topic in anthropological analysis, the generated results
having wide applications for making biocultural assessments with regard to archaeological
populations. Drawing upon methods and theories from human behavioural ecology, physical
and cultural anthropology, sociology, and economy, scholars have used stature as a quality-
of-life indicator for inferring the complex relationships between skeletal development and
ecology, diet, nutrition, genetics, and physical activity [LW10, BR14].
Together with porotic hyperostosis, cribra orbitalia, and dental enamel hypoplasia, living
height is used as a measure of health in bioarchaeological studies [Aue11, Mor09, PDS+14],
allowing for inferences about subsistence strategies or social inequality [BR14]. The mean
height of a population is considered in [Mor09] to be a marker of its nutritional and health
status. In [PDS+14], Pietrusewsky et al. classi ed stature as an indicator of health, as due
to non-speci c systemic stress through the growth.
Along with the development of computational intelligence and machine learning, it is
natural to research the possibility of building computer programs that do not have prede ned
14

CHAPTER 1. BACKGROUND FOR ARCHAEOLOGICAL PROBLEMS 15
algorithms for generating predictions, but instead learn from the available data and adapt
their models according to the new data samples that are being processed.
1.1.2 Related work on the stature prediction problem
Di erent approaches have been proposed in the archaeological literature for estimating
the stature of human skeletons.
As suggested by Thomas Dwight in [Dwi84] (1884), the following methods can be used
for stature reconstruction: anatomical methods and mathematical methods. The anatomical
approach consists of reconstructing the cadaveric stature by summing the lengths of several
elements from the skeleton from the skull through the calcaneus and converting them to living
stature by incorporating soft tissue correction factors [Ful56]. Therefore, as stated by Lundy
in [Lun85] (1985), when applicable, the anatomical method is superior to the mathematical
one. Its applicability obviously diminishes in situations in which well-preserved and nearly
complete skeletons are not retrieved. In these contexts, the mathematical method is e ective
even with few bones available since it implies regression formulae based on how correlated
are speci c skeletal elements with the living stature (e.g. [TG52]). An additional advantage
of both the mathematical method and the novel method that we propose over the anatomical
one is the speed of investigation [RR06].
Thomas Dwight proposed in [Dwi84] (1884) the rst anatomical method for estimating
the stature, a method that still causes high degrees of error. The French scientist Etienne
Rollet initiated, in the late 19thcentury, a series of approaches aiming to estimate stature
[Rol88] (1888). Using the raw data of Rollet, Manouvrier [Man92] (1892) introduced another
method for stature estimation, by determining the average value of statures of individuals
sharing the same length of a given bone.
Introduced by Karl Pearson [Pea99] (1899), the regression formulae virtually replaced the
abovementioned methods, and subsequent studies performed by Stevenson [Ste29] (1929),
Telka [Tel50] (1950), and Dupertuis and Hadden [DH51] identi ed a restraint regarding the
inter-populational applicability of these formulae.
An approach considered to be a milestone with respect to stature prediction was intro-
duced by Trotter and Gleser [TG52] (1952), which is also the source of the data used in the
present thesis. Di erent formulae based on measurements of important bones in the human
body such as the femur or tibia were computed based on the Terry Collection [Ter15] of
both European-American and African-American male and female osteological remains. The
regression formulae were tested on actual data obtained from the military with signi cant
accuracy rates. Many of the subsequent approaches, like the one introduced by Jantz and
Ousley in [OJ05] (2005), proposed formulae based on the ones introduced by Trotter and
Gleser [TG52].
An anatomical method that makes use of the sum of all skeletal elements that directly
in
uence the living stature was proposed by Fully in [Ful56] (1956). Soft tissue corrections
were added to these so-called skeletal heights to derive living statures. This correction was
done by simple addition or via a regression approach, and provided a superior performance
compared to the methods that exclude soft tissue corrections.
Lundy performed in [Lun85] (1985) a study comparing the anatomical and the mathemat-
ical methods for stature prediction. Owing to the experimental results, the usage of anatom-
ical methods is recommended in the cases when skeletal remains are suciently complete.
Otherwise, if this constraint is unful lled, the suggested approach remains the mathematical
one.
Trying to alleviate the diculties arising from the cross-application of stature regres-
sion equations on populations having di erent body proportions, Sjovold proposed in [Sjo90]
(1990) a new regression-based method called line of organic correlation for estimating stature
from stature/long bone proportions, considering that virtually the same slopes characterise

CHAPTER 1. BACKGROUND FOR ARCHAEOLOGICAL PROBLEMS 16
the regression lines of distinct populations. The proposed method uses major axis regression
instead of the most commonly used ordinary least squares method. The independence with
respect to sex and ethnic group constitutes an advantage of this approach.
An alternative method, based on the femur/stature ratio, was later rediscovered, being
exposed by Feldesman in [Fel96] (1996). The fact that the femur/stature ratio is a sex-
and-ethnicity-independent ratio represents a signi cant advantage, along with a small degree
of improvement of the performance (except, however, the particular cases when equations
speci cally designed for the target population have been derived).
Kozak analyzed in [Koz96] (1996) several stature prediction methods (Fully and Pineau
[FP60], Pearson [Pea99], Trotter and Gleser [TG52], Gralla [Gra76], Sjovold [Sjo90], Bre-
itinger [Bre38], Dupertuis [DH51]) on skeletal materials from 9th to 19th century Poland.
In that study, 4400 individuals from 43 skeletal populations were used, divided into rural,
urban, and city dwellers, and members of the higher levels of society. The analysis of [Koz96]
is divided into two steps: rst, the accuracy of the methods was studied on samples from
the island of Ostr ow Lednicki dating back to the 10th{14th century. The authors compared
the estimations of the humerus, radius, femur, and tibia of 50 male and 50 female skeletons
given by the application of seven methods with the estimations provided by the anatomical
method of Fully and Pineau. The second step consisted of applying the stature reconstruc-
tion methods of Dupertuis and Hadden (1951) [DH51], Breitinger (1937) [Bre38], and Gralla
(1976) [Gra76] on the 4400 individuals from Poland.
Konigsberg et al. contrasted in [KHJJ98] the Bayesian and maximum likelihood ap-
proaches for estimating stature. Tt has been shown that inverse calibration (regression of
stature based on the lengths of the bones ) is preferable when the evaluated individuals
originated from identical stature distribution as represented within the archaeological data
sample from which the regression equation is calculated. If this constraint cannot be en-
sured, then the classical calibration by obtaining the stature after regression on bone lengths
is recommended.
Belcastro and Facchini (2001) addressed in [BF01] the problem of analysing some morpho-
metric features of the horsemen recognised in the medieval necropolis of Vicenne compared
with those of the rest of the population. The aim of the study was investigate if besides
the cultural features there are other anthropological characteristics di erent from the other
males. The comparison was performed inclusively in terms of stature, with the method pro-
posed by Manouvrier or Pearson, as well as the method proposed by Telka [Tel50], being
comparatively used. As a result, it was observed a heterogeneity in the horsemen and in
the other males, with the di erent methods leading to suciently di erent estimations to
recon rm the problem of the inter-populational applicability.
Raxter et al. in [RR06] (2006) tested and detailed the applicability of the anatomical
method proposed by Fully [Ful56]. The data set exploited for these purposes consisted of
119 individuals from the Terry Collection available at the National Museum of Natural His-
tory, Smithsonian Institute. The individuals resided in the St. Louis area, their mean age
being 54 years. The experimental results showed a correlation coecient of 0 :96 with living
statures, as well as a systematic underestimate averaged at 2 :4 cm. As a consequence, the
authors proposed the adjustment of soft tissue correlation factors. With the application of
the proposed formulae, the directional bias was removed and the errors ranged to 4 :5 cm for
95% of the cases.
One year later, through a technical note [RA07], Raxter et al. highly recommended the
use of the formulae including an age-dependent correction term to the detriment of the one
inherently incorporating the average age of sample data. To highlight the potential problem
of using the age-independent alternative, we reconsidered a subset of the data previously
exploited by Raxter in [RR06]. The subset consisted of 48 adults with ages between 21
and 49 years, and averaged at 38 years. The experimental results showed that the estimation

CHAPTER 1. BACKGROUND FOR ARCHAEOLOGICAL PROBLEMS 17
based on the Fully method without the age term performed signi cantly less e ectively owing
to a systematic underestimate of the living stature.
Auerbach examined in [Aue11] (2011) methods to estimate the missing measurements
from the skeletons and to include these estimations into anatomical stature reconstruction
using the revised Fully anatomical method for estimating stature. The author provided rules
and formulae for estimating missing skeletal dimensions, extending this way the appliction
of the revised Fully method. Menezes et al. performed in [MNM+11] (2011) a study using
linear regression formulae for approximating the stature of South Indian females from the
length of the sternum.
Ru et al. introduced in [RHN+12a] (2012) equations for stature prediction based on
samples ranging in dates from the Mesolithic to the 20th century. After the anatomical re-
construction of stature is achieved, the data are used to obtain formulae for stature estimation
based on long bone lengths. The obtained equations are applicable to European Holocene
skeletons. In 2012, Pomery and Stock in [PS12] evaluated the accuracy of existing formulae
for stature estimation and on skeletal remains from the central Andean coast and highlands
of South America. The authors proposed new sample-speci c regression equations, showing
that the existing formulae are not clearly appropriate.
Research on the living standards in the Roman Empire was conducted by Klein Gold-
ewijk and Jacobs in [KGJ13] (2013). The authors collected data on thousands of skeletal
remains dated between 500 BCE and 750 CE. Their methodology involves using existing ap-
proaches to estimate the long bone length proportions and comparing those to the recorded
measurements. The authors consider this approach superior to pure stature reconstruction.
Panel regression formulae for estimating stature in immature human skeletons were intro-
duced by Schug et al. in [SGC+13] (2013). The proposed formulae were applied to a cadaver
sample from Franklin County, Ohio, and to a large sample of immature skeletons from diverse
populations. The obtained results indicated that panel regression formulae provide accurate
stature estimates in immature skeletons, without needing an estimate for age at death.
A predictive regression equation for stature prediction was also recently proposed by
Pal and Datta in [PD14] (2014). The equation was derived from multiple linear regression
analysis using both the radius length and the age, with the values for these two attributes
being obtained from healthy adult Bengali men. The achieved performance has surpassed
that of the previously proposed methods.
Vercellotti et al. analyzed in [VBA+14] (2014) the interrelationship between stature and a
variety of biocultural factors. Information was collected for living samples from South Amer-
ica and a study is conducted for detecting di erences in stature for di erent archaeological
populations.
The study of the previously reviewed literature on stature estimation highlights the im-
portance of a priori knowledge regarding the studied individuals (within the population they
are part of) for a more accurate establishment of stature. Anthropological factors such as age,
sex, and race in
uence most of the proposed regression formulae (such as the ones introduced
by Trotter and Gleser [TG52]). The age factor, unlike sexand race [RR06], signi cantly
improves the accuracy of anatomical methods (as shown, for instance, by Raxter et al. in
[RA07]), which require, however, nearly complete skeletons. Geographical, natural, and his-
torical factors a ect the overall stature trend of a population, whereas social and political
factors inside a society [BR14] lead to intra-populational heterogeneity of stature with respect
to socio-political strata. This heterogeneity also impacts the stature estimation, as certain
methods achieve better performance when applied on superior social strata, whereas others
provide more accurate estimations on inferior ones [Koz96].
The limitations of the existing methods (both mathematical and anatomical) for stature
approximation, arising primarily from the lack of prior knowledge (such as age, sex, race,
and complete skeleton) may be overcome with intelligent systems capable of learning stature

CHAPTER 1. BACKGROUND FOR ARCHAEOLOGICAL PROBLEMS 18
estimation formulae based on the particularities of any given training population.
1.2 Body mass estimation in bioarchaeology
Estimating the body mass from human skeletal remains represents a problem of major
importance in paleontological and archaeological research. There is an agreement in the
bioarchaeological literature that postcranial measurements are directly related to the body
size and provide the most accurate estimates [B.00].
1.2.1 Motivation
Estimation of the body mass from human skeletons represents a challenge in forensic death
investigations concerning unidenti ed remains [Moo08]. A major problem in research related
to body mass estimation is caused by a lack of publicly available benchmarks. The existing
methods for body mass estimations from the skeletons are: mechanical ,morphometric [AR04]
and a combination of biomechanical and morphometric methods [Moo08]. The morphometric
methods consist of directly reconstructing the body size from the skeletal elements, while the
mechanical methods provide functional associations between skeletal elements and body mass
[AR04].
Body mass estimation is a very important problem for modern archeology. It can pro-
vide certain knowledge about past populations, such as indicators for [RHN+12b]: the past
population's health, the e ects of di erent environmental factors on past populations (e.g.
subsistence strategy, climatic factors), social aspects etc. The ability to obtain accurate body
mass estimations from skeletons is also essential in forensic death investigations concerning
unidenti ed skeletal remains [Moo08]. Consequently, it is essential for biaorchaeologists to
develop and use body mass estimation methods that are as accurate as possible.
However, designing an accurate method for solving this problem remains a great chal-
lenge, because there are many factors which should be taken into account [AR04]. Some
decisions that are to be taken in this process [RHN+12b]: which are the most relevant skele-
tal measurements to use, which is the appropriate statistical approach to apply, which should
be the skeletal sample to use etc.
1.2.2 Literature review
In the following, a brief review of the recent human body mass estimation literature is
given. Most of the existing statistical methods for body mass estimation use linear regression
formulas that usually consider one or a few bones measurements. These formulas are usually
developed on particular data sets, and it is questionable whether or not they would perform
well on previously unknown data.
A comparison between several body mass estimation methods was conducted by Auerbach
and Ru in [AR04] (2004). The authors proposed to test some existing methods on a great
variety of subjects. They used skeletal remains of 1173 adult skeletons of di erent origins
and body sizes, both males and females. Three femural head based regression formulas were
tested and compared on the considered skeletal sample: Ru et al. [RSL91] (1991), McHenry
[McH92] (1991) and Grine el al. [GJTP95] (1995). The authors concluded that for small body
sizes (Pygmoids), the formula of McHenry (1992) can provide a good body mass estimation.
For very large body sizes, the formula introduced by Grine et al. in [GJTP95] should be
used, whereas for the other samples the formula of Ru (1991), or the average of the three
techniques would be the best approach.
Ru et al. provided in [RHN+12b] (2012) new equations for estimating the body mass
that are generally applicable to European Holocene adult skeletons. The equations for ap-
proximating the body mass were based on femoral head breadth. 1145 skeletal specimens were

CHAPTER 1. BACKGROUND FOR ARCHAEOLOGICAL PROBLEMS 19
collected from European museums, from various time periods (from Mesolithic until the 20th
century) [RHN+12b]. On these data sets, the regression formulas introduced in [RHN+12b]
provided better results than the previous formulas from Ru et al. [RSL91] (1991), McHenry
[McH92] (1991) and Grine el al. [GJTP95] (1995).
It has to be mentioned that most of the researchers from the bioarchaeological elds
develop regression formulas for body mass estimation based on a data set which is also used
for testing the developed formulas, without using a testing set independent from the training
data or without using any type of cross-validation. This may lead to over tting, as for the
regression formulae from the literature which provided good performances on the data they
were trained on, but when applied on an unseen test data they have a weaker performance.
We consider supervised machine learning based models to be a good alternative to existing
methods for body mass estimation, since they can be retrained on new data sets easily.
Moreover, particular techniques to avoid the over tting problem can be used for developing
models that are likely to generalize well on unseen data.
1.3 Age at death estimation from long bone lengths
Age at death estimation is a problem that asks to nd a good estimate for the age at death
of some human biological remains. These remains can be bones, teeth, a well preserved body
etc. In our case, we consider long bone lengths and are interested in nding, using machine
learning, a mathematical function of these lengths that gives us the age at death of the
instance whose bone lengths are input to the function.
Estimating age at death is essential for determining population demographics and per-
forming individual analysis of human remains. For children and young adults, the problem
is easier due to the fact that growth and biological development occur during these stages
[CD13].
The age at death estimation problem is a complicated one for adult skeletons. This is
due to the fact that age indicators are biologically variable and that certain features respond
di erently to environmental factors over the course of someone's life, so a lot more individual
di erences that do not help to pinpoint an exact age (noise) accumulates for older individuals
[CD13].
Because the problem is so complex, there are no exact algorithms for solving it. Simple
regression formulas that take one or a few bones into consideration exist, but they are usually
created for each particular data set, and it is unclear if they would work well or not on other
data. By using machine learning, we can come up with models that are easy to retrain on new
data sets if needed. By using validation methods, we can estimate how well our algorithms
will perform on unseen data of the same kind.
We deal in the thesis with age at death estimation in multiple age categories, but we
advise the reader to interpret our results relative to what is known in the literature as being
feasible to achieve for each age group in particular.
1.3.1 Literature review
A brief review of age-at-death estimation from long bone lenghts is provided in the foll-
woing,
The study in [MUC+07] presents a comparison between four di erent approaches for
estimating age at death. The authors also proposed Principal Components Analysis (PCA)
for a combination of these four methods. The research was conducted on 218 skeletons
uniformly distributed in terms of age, sex and race. The study was carried out both on the
full sample and separated by age group (25 40 years, 4160 years,>60 years), race and
sex. It was observed that the most accurate method depended on the target group. On the

CHAPTER 1. BACKGROUND FOR ARCHAEOLOGICAL PROBLEMS 20
full set, the most precise method was the PCA, obtaining a mean absolute error of 6 :7 years.
They also observed that the worst results (10 16 years depending on method) were obtained
on the elders group ( >60 years).
In [RGM+07], it is shown that the best method to detect age from skeletons for subadults
is mineralisation of teeth, but for other categories, long bone length and epiphysis devel-
opment is used, and for adults, the best method is aspartic acid racemisation { a chemical
method [RTC02]. Usually, a speci c method is used for a speci c category of age, and it is
impossible to nd a method that works well for all categories. For adults, it is more dicult
to estimate age: the error of estimate grows with age. And using racemisation of aspartic
acid method, Rsing and Kvaal, in 1998 had results with an average mean error of estimate
of 2.1 years. While this is better than we obtained, our method is strictly mathematical.
In [Sak08], the author used a method to determine the sex of the subjects and then used
the same approach for calculating the age at death. This method is based on measurements
of the patella bone. The author tried to apply the procedure on 283 Japanese skeletons, 183
males and 100 females. The method consists of trying to nd a discriminant function for
sex determination and a method for estimating age at death. Only measurements of patella
were used. It was observed that a lipping on this bone is gradually developing. Three stages
(young, middle-aged and old) which can be used for age estimation were identi ed. The
approach was tested on 26 young adult, adult and old adult subjects and the results weren't
satisfactory, obtaining a mean absolute error of 8 :6 years. It was concluded that there are
better approaches such as those based on pubic symphysis or cranial sutures. Nevertheless,
the proposed method could be used to classify subjects in the previously mentioned three
stages.
In [SDIL94], experiments are performed on 328 instances, with the age between 15 and
97 years, and 90 :5% of European ancestry. The obtained standard error of the estimate was
11:13 for males, 9 :77 for females and 10 :70 for females and males, all above our own results.
The review in [CBM+09] presents the authors' experience with age estimation on di erent
age groups, from fetuses to elders, both for living and dead individuals. In this, an explanation
is given for why age estimation methods applied on older groups are less accurate: because
for these groups, the discrepancy between chronological age and physiological age is larger.
For subadults (fetuses, newborns, children and adolescents) and transition phase it is stated
that dentition is the most reliable method for age estimation, from its formation during the
fetal state to calci cation and eruption during childhood and adolescence. Also, development
of the skull and hands are good indicators. However, especially during puberty, gender is
also important because it is known that girls tend to develop faster than boys. In case of
age estimation on adults, dentition and skeletal development are not so reliable anymore.
Age estimation is based on skeletal degeneration during aging. Also, the authors mention
the Lamendin strategy, which although is considered one of the most precise methods for
estimating the age for over 40 year old individuals, it provides an underestimation of the real
age with a mean error of 19 years. On our data set, we obtained better than this.
In [PFSL12], multiple formulas, of the form Lx+b, are presented for measurements Lof
a single bone. Formulas are presented for the clavicle, humerus, ulna, radius, femur, tibia,
bula and some sums of these. Although these formulas were created with a di erent data
set in mind, they also target subadults, so we decided to try them on our own subadults
subset. The lowest mean absolute error obtained was 1 :788, which is above both our SVR
and LWR mean absolute error for the subadults case study.

Chapter 2
Contributions to machine learning
models for archaeology
The rst part of this chapter presents the learning models, our methodology and other
complementary algorithms that we have used for solving the problems detailed in Chapter 1.
The second part presents the computational results obtained. Our models were introduced
in the original research papers [CIMM16, ITV16, ICT16, IMMC15, Ion15b] published in
scienti c journals and conferences. The following chapter is based on these publications.
2.1 Proposed machine learning models
In this Subchapter, we present the proposed machine learning models for solving the
archaeological tasks detailed in Chapter 1.
2.1.1 Genetic Algorithms
We have applied Genetic Algorithms (GAs) for nding better stature estimation formulas
given long bone lengths [CIMM16]. The GA model we propose is based on the stochastic
acceptance technique [LL12].
Genetic Algorithms represent a broad class of algorithms used for solving optimization
and search problems. GAs model the biological processes of natural selection and evolution,
whereby the ttest individuals of a population survive longer, and the population as a whole
adapts to changes in the environment over successive generations [Mel99].
Although there is no proven or agreed-upon standard approach that performs well on
all problems, GAs usually adopt the following methodology: a list of candidate solutions
(individuals ) is maintained, with all solutions initially being randomly generated (the popu-
lation ). Afterwards, the individual from the population is given a tness value, which tells
us how good that individual (candidate solution) is for solving our problem. The population
is iteratively evolved for a number of generations using genetic operators (lije crossover and
mutation ) until acceptable solutions are found [Mel99].
The exact manner in which genetic operators are used and of how the population is
maintained is an issue best dealt with on a problem-by-problem basis.
Since GAs are known to be useful in solving search problems, it is only natural to try
applying them to certain supervised learning problems.
Our proposed GA attempts to evolve a real-valued vector wthat represents the coecients
of a linear regression model, as shown in Formula (2.1).
Prediction (si) =mX
k=1wksik (2.1)
21

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 22
In Formula (2.1), mrepresents the number of features of instance siandsikis the value
of thek-th feature of the instance si. In our case, the instances are skeletons and the features
are measurements of long bones. Experiments revealed that a bias term leads to slightly
worse results, so one is not used.
An individual from the GA population is a real-valued vector whose size is the number of
features used for our problem. The dot product between such a vector and an instance gives
us the prediction for that instance. Note that, so far, this does not necessarily represent a nal
prediction, since we could be working with preprocessed data. The reverse operations have
to be performed to obtain the nal predicted values, in case the data has been preprocessed,
which is usually the case.
We have de ned our tness function for a given individual as in Formula (2.2). nrep-
resents the number of training instances, mis the number of features, targetiis the real
(preprocessed) target value of the i-th training instance (i.e. si) andcjis thej-th component
of the vector (individual) c(the one that needs to be multiplied by the value of the j-th
feature in our instances).
fitness (c) =nX
i=1 mX
j=1cjsijtargeti (2.2)
Therefore, the lower the tness value, the better that individual is at predicting targets
(because the sum of errors is smaller). Our stochastic acceptance implementation involves
copying each individual cto the new generation with probabilityfitness (M)
fitness (c)whereMis
the ttest individual (has the lowest tness value). Thus, the ttest individual will always
survive to the next generation. Then, parents are randomly selected and accepted with
the same probability measure to produce o spring (using the single point crossover genetic
operator) until the new generation has the maximum number of individuals in it. Finally,
each individual in the new generation id the subject of mutation with a certain mutation
probability.
It was experimentally determined that the absolute error used in Formula (2.2) provides
better results than other error metrics, such as squared error.
Once the algorithm stops, the components of the ttest individual will give us optimal
values for the vector wthat will be used for stature estimation (see Formula (2.1)).
Given this model, our GA can attempt to achieve any precision we desire, even 0 error,
where precision is de ned as the sum of absolute di erences between predicted values and
real targets. Naturally, this is not feasible in practice, so we will use more realistic precision
values in our experiments. For example, a precision of p= 0:3 will cause the algorithm to
return the rst vector for which the sum of absolute di erences on the preprocessed training
data is at most 0 :3. If no such vector is evolved within a given number of operations, the
best one is returned and the search stops.
2.1.2 Support Vector Regression
We have applied Support Vector Regression for stature, age at death and body mass
estimations [ITV16, ICT16, Ion15b].
Support vector machines are models originally introduced for classi cation [CV95], but
they have also seen success in regression. The regression method is called "-support vector
regression ( SVR), due to the existence of an extra hyperparameter "that controls the level
of error allowed by the algorithm.
TheSVR model will be used in a supervised learning scenario, because we want to learn
a mathematical function that is able to estimate the age at death when presented with some
bone measurements.

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 23
TheSVR algorithm solves the problem in Formula (2.3), where xiis a training instance,
yiis its target, bis a bias term, wis a weights vector and Cis a regularization hyperparameter.
The variables 
iand+
iare \slack variables" that allow for errors in the learning process,
in order to make the problem feasible [SS04]. Various kernel functions can be introduced in
order to obtain non-linear regression surfaces [CV95].
minimize1
2kwk2+CnX
i=1(
i++
i)
subject to8
><
>:yi(wxi+b)"+
i
(wxi+b)yi"++
i

i;+
i0(2.3)
Formula (2.3) can be manipulated in various ways in order to make computation faster.
Various kernels can also be applied, resulting in a nonlinear learning model, but potentially
introducing new hyperparameters. Such kernels are:
1. Linear:K(p;q) =hp;qi=pq. Results in a linear model, which will nd a regression
hyperplane { the simplest kind of model.
2. Radial Basis Function (RBF), or Gaussian: K(p;q) =exp(
jjpqjj2). With the
RBF kernel, the algorithm is very sensitive to the
hyperparameter. An overly large
value will result in all applications of the kernel being very close to zero, thus each
instance will only in
uence itself and the model will be very likely to over t. A too low
value will overconstrain the model, resulting in a similar behavior to the linear kernel.
3. Polynomial: K(p;q) = (
hp;qi+c)d. Attempts to t a polynomial to the data.
4. Sigmoid: K(p;q) =tanh(
hp;qi+c). Inspired by the logistic units of arti cial neural
networks, this kernel can make Support Vector Machines behave similarly to Arti cial
Neural Networks.
2.1.3 Locally Weighted Regression
We have applied Locally Weighted Regression (LWR) for estimating stature in archaeology
[IMMC15].
Locally weighted regression is a form of instance based learning (also known as lazy learn-
ingand memory based learning ), a family of machine learning methods that delay the pro-
cessing of labeled data until the arrival of a query that needs to be processed. Once presented
with a query, the algorithm computes the answer by ranking the stored labeled data by rele-
vance to the query point. Relevance is usually calculated by using a distance function, with
points that are farther away from the query point being less relevant [Mit97, CGA97].
Locally weighted regression involves tting local linear models to nearby data for each
query point. Nearby data can refer either to a xed number of labeled in-memory points that
are close to the query point, or to all the available training data instances augmented such
that points that are closer to the query are given more relevance [CGA97]. In our research,
we considered the latter approach, where all training data is used to answer a query.
An LWR model can be de ned by at least two parameters: a distance function , that
speci es the distance metric to be used in computing relevance, and a kernel function , that
speci es how exactly the distance values are used. Common choices for the two are the
Euclidean distance and the Gaussian kernel, shown in Formula (2.4) [CGA97]. Depending on

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 24
how we plan to solve the underlying linear regression problem, we can have other parameters
as well, such as a learning rate for gradient descent.
K(d) =exp(d ) (2.4)
In Formula (2.4), is a positive hyperparameter whose optimal value can be obtained by
cross-validation.
We consider in the following that there are ntraining instances and mfeatures describing
the instances.
Once we decide on a model, let our training instances be the rows of a matrix X(with
nrows), with the rst column consisting of the constant 1 (for the constant regression term,
so this will make the matrix have m+ 1 columns) and let our known outputs be stored in a
column vector yof lengthn. Once presented with a row vector q(to which we also prepend
a 1) that represents a query, we will run the following steps for each labeled training instance
(Xi;yi) [CGA97]:
1. Compute wi, the weighting coecient for samplei(i=1;n), as the square root of the
kernel of the distance between the labeled instance Xiand the query point q(Formula
(2.5)), ignoring the constant terms in this computation.
wi=p
K(d(Xi;q));i=1;n (2.5)
2. Multiply each row iofX(1in, excluding the rst column, corresponding to the
constant term), by wiand store the result in a new matrix Z(Formula (2.6)). The rst
column of matrix Zis still all ones.
3. Multiply yibywiand store the result in a new vector v(Formula (2.6)).
Zi;j=Xi;jwiZi;1= 1
vi=yiwii=1;n;j =2;m+ 1(2.6)
After these steps are completed, we solve a standard linear regression problem for ( Z;v)
and use the resulting regression line to predict the answer for the given q. We employ the
classic pseudoinverse solution for this, as shown in Formula (2.7).
^v=
ZTZ1ZTv=Z+v (2.7)
Locally weighted regression has the advantages of being less a ected by changing data
patterns while maintaining most of the simplicity of linear regression.
2.1.4 Data Preprocessing
An important rst step in any learning task is the data proprocessing step. This ensures
that the data is in a format that will help the actual learning algorithms t it better. The pre-
processing step refers to lling in missing values for features, manipulating the feature values
so that they have certain statistical properties (statistical normalization), removing features
that we can statistically prove not to facilitate learning (feature selection) and eliminating
outliers.

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 25
Statistical normalization
For statistical normalization, we consider two methods.
1.Standardization : Formula (2.8). Normalizes the data such that it has zero mean and
unit variance. Support Vector models are generally known to perform better in this
case. We have found this to be the case for our case studies as well.
Xstandardized =X
(2.8)
In the above formula, represents the mean of Xandis population standard deviation
ofX.
2.Feature scaling : Formula (2.9). Proportionately scales each value such that it is in the
interval [0;1]. In some cases, we have found this approach to perform better.
Xscaled =XXmin
XmaxXmin(2.9)
Note that normalization is performed across each feature individually: rst the measure-
mentsf1;1;f1;2;:::;f 1;nof featuref1are normalized, then those of feature f2and so on.
Apriori domain knowledge can be incorporated into Formula (2.9) by providing constant
values forXmaxandXmin, either for each feature individually or globally across the data set.
This has the potential to make scaling new (unseen) data less error-prone in the prediction
stage of the algorithms, because there will be less risk of the [ Xmin;Xmax] interval changing
for new test data.
Feature selection
For feature selection we are interested in a statistical relationship between each feature
and the targets for learning. We have used the Pearson Correlation Coecient (PCC) for
performing feature selection, whose computation is given in Formula (2.10) [Tuf11]. The
thresholds for keeping and removing features will be given for each case study in its respective
section.
PCC (a;b) =nX
i=1(aia)(bib)
vuutnX
i=1(aia)2vuutnX
i=1
bib2(2.10)
The Pearson correlation is a statistical measure expressing a linear relationship among
two variables, indicating the extent of correlation between the variables. A value of 0 for the
Pearson correlation between two variables suggests an absence of linear relationships between
the two variables. A value of 1 or 1 suggests the existence of a (negative, in case of 1)
linear correlation between the variables.
Dimensionality reduction
Dimensionality reduction is used for deacresing the number of features by projecting
them into a lower dimensional space, thus obtaining a new set of features that approximate
the original ones. This can help reduce over tting and enables us to visualize our data in
two dimensions. By enabling us to visualize the data, we can visually remove outliers as

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 26
a preprocessing step and get an intuitive idea about how well our learning algorithms will
perform.
We have used Principal Components Analysis (PCA) [TB99, AW10] in order to perform
dimensionality reduction and to visualize our data sets. PCA reduces a nxmdata setXof
ninstances and mfeatures to a data set of p<m features by performing the following steps.
1. Reduce each column of Xto zero mean by subtracting its mean from each element.
2. The new nxpdata setXnewwill be constructed by building a mxpmatrixWsuch
thatXnew=XW.
3. The matrix Wis computed by rst building the covariance matrix C, as in Formula
(2.11).
C=XXT
n1(2.11)
4. Determine the eigenvectors and eigenvalues of C, the covariance matrix. By ordering
the eigenvectors descendingly by their eigenvalues, we obtain the principal components,
which will be used as the elements of W.
2.1.5 Evaluation methodology
When evaluating our learning models, we must make sure that they will generalize well
to unseen data. In order to do this, we perform cross validation [Mit97]. The Leave-One-Out
method of cross-validation (LOO) and 10-fold cross-validation (10CV) were used.
In LOO, we select the rst instance as a test instance, train our model on the remaining
n1 instances, and store the result of the model on the held out test instance. We repeat
this another n1 times such that each instance is picked exactly once as a test instance and
report the average results.
By results, we usually refer to either the Mean Absolute Error (MAE) or the Standard
Error of the Estimate (SEE), as de ned in Formula (2.12).
MAE =1
nnX
i=1jyi^yij;
SEE =vuut1
nnX
i=1(yi^yi)2; (2.12)
where ^yiis the value predicted by our model and yiis the true value.
In 10CV, we rst shue our data and then split it into 10 subcollections of as close sizes
as possible. We train on 9 of the subcollections and test our model on the held out part of
the data. We repeat this another 9 times so that each subcollection gets to act as the test set
exactly once and report the average results. This has the advantage of being computationally
faster than LOO. Usually, we let the existing literature dictate what we use such that our
comparisons are more relevant.
Because most of the time our models will involve the use of random numbers (either for
initializing the weights of models or when shuing the data), we will usually also repeat each
run of LOO or 10CV a number of times and provide a statistical analysis of the obtained
results, so that we increase the stability and relevance of our results.

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 27
When appropriate, we will also provide con dence intervals (CI) for our results. In our
case, for example, by calculating the 95% con dence interval of the mean of 20 10CV runs, we
will obtain an interval that future means of 10CV runs will be 95% likely to fall in [BCD01].
We can consider con dence intervals to be reliability indicators.
We mention that the scikit-learn machine learning library [PVG+11] is used to aid our
experimental evaluation.
2.2 Computational Experiments
This chapter presents computational experiments using genetic algorithms (GA), support
vector regression (SVR) and locally weighted regression (LWR) on three important bioarchae-
ological problems: stature prediction, body mass estimation and age at death estimation. Our
machine learning models were described in Section 2.1, and the problems we have applied
them to were presented in Chapter 1.
The content of this chapter contains research results published in scienti c journals and
conferences [CIMM16, ITV16, ICT16, IMMC15, Ion15b].
2.2.1 Stature prediction
This section presents our computational experiments and result for the stature prediction
problem, which we conducted as described in Section 2.1.5. The stature prediction problem
was presented in Section 1.1.
We mention that 20 runs of 10-fold cross-validation were considered only for the GA and
LWR models, and the obtained MAE and SEE values were averaged over all runs. For the
GA, the grid search was run only once, but the resulting hyperparameters were then used for
the 20 runs mentioned. The SVR experiments only use a single run of cross-validation.
In [CIMM16], we introduced genetic algorithms for the problem of stature prediction
from human skeletons using long bone lengths [CIMM16]. In [Ion15b] and [IMMC15], we
applied support vector regression and locally weighted regression for the same problem, but
considering di erent data sets.
Data sets and case studies
The rst data set we used in our experiments, and for which we applied genetic algorithms
in [CIMM16], is part of the Terry Collection Postcranial Osteometric Database recently avail-
able at [Ter15]. This database represent data obtained from the Robert J. Terry Anatomical
Skeletal Collection. The individuals from the database represent raw data collected from
European-American and African-American individuals (most of them Americans) from the
19thand 20thcenturies, having the date of birth between 1822 and 1943. We thank professor
David Hunt who provided us the cadaveric statures of the skeletons, since this information
does not exist in the original database from [Ter15].
For the stature estimation task, we used the lengths of lower and upper limb long bones
[TG52] showing the highest correlation to stature. These measurements were previously used
by Trotter and Gleser in a paper that is a milestone for forensic analysis [TG52].
Since the data set from [Ter15] contains missing values for measurements, we extracted
from the database a subset of 183 instances for which the measurements of humerus ,radius ,
ulna,femur ,tibia, and bula bones are given. For each bone, we considered the maximum
length between the left and the right bone | if only one measurement was available for a
bone (i.e., left or right), we used it. We note that, for each instance, the race and the sexof
the individual are also known.
The following case studies will be considered starting from the previously described data
set:

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 28
The rst case study consists of the subset of instances corresponding to the European-
American males and females, characterized by sixfeatures representing the bone mea-
surements described above (labelled as HUM ,RAD ,ULNA ,FEM ,TIB, and FIB).
The second case study contains the subset of instances corresponding to the African-
American males and females. The features used for this experiment are the features
that were described above ( sixfeatures as used in our rst case study).
The third case study contains all European-American and African-American instances
characterised by seven features (the measurements of the previously described bones,
together with a feature corresponding to the race of the individual). Thus, seven
features are considered, labelled as HUM ,RAD ,ULNA ,FEM ,TIB,FIB, and RACE .
The second data set, for which we applied support vector regression and locally weighted
regression in [Ion15b, IMMC15], is based on the same Terry Collection, but preprocessed
by Trotter, so it does not consist of the raw skeletal data [TG52]. It consists of seven bone
measurements for 92 caucasians and 92 afro-americans (47 males and 45 females for both
races), and was used in the stature estimation literature by Trotter and Glesser in [TG52].
The features characterizing the human skeletal remains are extracted from [TG52] and are
labeled as: HUM ,RAD ,ULNA ,FEM ,TIB,FIB,FEM+TIB andGENDER .HUM represents
the measurement of the humerus and RAD andULNA describe the radius and ulna, the most
important bones of the human hand. FEM ,TIB,FIB represent the measurements for the
length of most important bones of the human leg. FEM+TIB is the length of the entire
leg and GENDER represents the gender of the skeleton. We mention that for each bone
measurement, the maximum length between the left and the right bone was considered.
Two experiments were conducted starting from this data set:
The rst case study we considered from the data set described above consists of the
subset of instances corresponding to the caucasian males and females.
The second case study consists of the subset of instances corresponding to the afro-
american males and females.
Data preprocessing
This step consists of the input data being normalized according to a traditional algorithm.
After data normalization, a statistically based feature selection is applied to determine a
subset of features (measurements) that are highly correlated with the target output (stature).
Through the feature selection, we aim to reduce (if needed) the dimension of the input data
by eliminating the characteristics that do not have a signi cant in
uence on the stature
estimation process. The dependencies between the features are determined using the Pearson
correlation coecient, as described in 2.1.4. A measure of linear correlation is used because
most of our models are also linear, so this is the most important for the problem at hand.
Preprocessing the data for the subsequent application of the learning algorithms also in-
volves normalizing stature according to the same algorithm used for normalizing the features.
Therefore, it will be necessary to unscale the stature once a result is provided for the test
data to get a meaningful result.
Moreover, as a data preprocessing step, we applied principal component analysis (PCA)
[TB99, AW10] to get a two-dimensional view of the data we were working with. By doing this,
we were able to visually eliminate outliers and therefore improve our results. This process
will be detailed for each case study we will present results on.
After the training data is preprocessed as indicated above, the models (GA, SVR, LWR)
will be built during the training step. For preprocessing, we normalize the input data by

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 29
subtracting, from each measurement, the mean of all the measurements of that type and
then dividing it by the standard deviation of all measurements of that type. The same is
done for the known stature we are training with. We have done this for all three of our
models, on each data set, as described in Section 2.2.1.
The GA, SVR and LWR learning models that we are using were described in Sections
2.1.1 2.1.2 and 2.1.3, respectively.
The performance of the models built after training was evaluated during the testing step,
as described in Section 2.1.5.
Genetic algorithm results on the Terry data set
For the GA used in the experiments, we used a random grid search to obtain the best
hyperparameters. This has recently been shown to outperform other hyperparameter search
strategies [BB12a]. We ran the random grid search for 400 iterations, with each iteration
using a run of 10-fold cross-validation to measure the mean absolute error of a set of assigned
hyperparameters. The best set was kept after the 400 iterations. These hyperparameters
were then used to perform the 10-fold cross-validation runs from which we report our MAE
and SEE values.
The hyperparameters being searched for, and the sets from which they were sampled, are
given below.
Population size: sampled from the set f40;80;90;100;120;140g.
Initialisation: each individual's coecients are initialised by random values in [ init
2;init
2],
whereinitis sampled from [0 ;1).
Mutation probability: sampled from [0 ;1).
Mutation coecient: the valuesmtto mutate by are sampled from [0 ;2). If an individ-
ual is selected for mutation (based on the mutation probability), a random value from
[mt
2;mt
2] will be added to a random coecient.
Search termination: if the desired precision has not been achieved after a speci ed
number of generations, the GA stops. The number of iterations to stop after is sampled
from the setf200;400g.
Precision: the precision is always xed at 0 :34 during the grid searches.
We use standard scaling (zero mean and unit variance) for our GA experiments.
First case study For the set of European-American individuals from the data set consid-
ered for evaluation, Figure 2.1 shows the correlations between the stature and our chosen
features.
All the features showed a good correlation with the target stature; thus they were kept
for building the GA learning model.
Figure 2.2 illustrates a two dimensional visualization of the data set, as obtained by
the PCA algorithm. The feature computed by PCA is represented on the horizontal axis,
while the vertical axis represents the known height of the instances. Visually, the data
show potential for machine learning applications, even when reduced to a single feature for
visualisation.
Using the PCA graph, we visually determined the instance at the approximate coordinates
of (23;145) to be an outlier, since it is signi cantly outside the formation determined by the
rest of the instances.

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 30
Figure 2.1: Pearson correlation for the features on the set of European-American instances
Figure 2.2: PCA on the set of European-American instances
We used a GA as described in Section 2.1.1, with the settings presented in Section 2.2.1,
to evolve a formula for computing the stature of the European-American individuals from
the Terry Collection data set (Section 2.2.1).
The values for MAE and SEE measured for all the 20 runs of the GA using 10-fold cross-
validation, along with the minimum, maximum, mean value, standard deviation and 95%
con dence intervals for the mean, are presented in Table 2.1.
MeasureMin Max Mean Stdev
(cm) (cm) (cm) (cm)
MAE 1.6937 4.5972 2:99530:37 0.841
SEE 2.1899 5.9051 3:97410:51.1374
Table 2.1: MAE and SEE values obtained using the GA model on the set of European-
American instances, with 95% con dence intervals for the mean.
The obtained MAE and SEE values, separately for males andfemales , considering all the
20 runs, as well as the mean ,standard deviation of the values and 95% con dence intervals
for the mean, are given in Table 2.4.
Formula (2.14) was obtained by training the GA model on the entire data set consisting
of European-American (EA) instances, using the best hyperparameters found in the random
grid search phase. This formula is applicable to bone measurements in centimeters and can
be applied as is, resulting in a PS. Note that the actual GA generates another raw formula
that is applicable to the preprocessed data. The form presented here is obtained by reversing
the preprocessing of the data, which is why this form can be applied to actual measurements.

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 31
The equations in Formula (2.13) describe how to obtain a formula that can be applied
as-is on measurements in centimeters. The notations are: fandythe standardized features
and targets, fcm andycm, the features and stature in centimeters iy;i;ythe standard
deviations and means for the i-th feature and and targets, used in the standardization pro-
cess.ciare the trained coecients and wiare those that can be applied to the features in
centimeters.
If:
mX
i=1cifi=y
Then:
mX
i=1(wifcmi) +b=ycm
Where:
wi=y
ici
b=ymX
i=1wii(2.13)
PSEA= 1:43668623HUM
0:7146025RAD +
0:46254589ULNA +
0:60431697FEM
0:40105835TIB +
1:55755349FIB +
57:49507424(2.14)
The rst two lines of Table 2.5 show the MAE and SEE values obtained by applying
Formula (2.14) on the entire data set, without performing cross-validation. Although these
values are not entirely relevant, since they show a training error rather than a test error
(which the values obtained through cross-validation show), we have decided to include them
for each case study for two reasons: rst of all, this methodology is closer to the one Trotter
and Gleser used to report their errors, so by including our results for the same methodology
(even if, by machine learning standards, it is not very relevant), a better comparison can be
made. Second, because the cross-validation results are good and close to the results obtained
by applying this formula, it is to be expected that our new formula will generalise well, so
the results are not fully irrelevant. Moreover, since it is a linear formula, over tting should
not be an issue.
Second case study Figure 2.3 illustrates the correlations between the target stature for
the set of African-American individuals and the sixconsidered features (bone measurements).
The correlations between all the features and the output stature are large enough. Con-
sequently, all sixfeatures are kept for building the GA learning model.
Figure 2.4 shows a view of the data set in two dimensions, as obtained by the PCA
algorithm. Once more, the data show potential for machine learning applications, even when
reduced to a single feature for visualisation.

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 32
Figure 2.3: Pearson correlation for the features on the set of African-American instances
According to the PCA graph, we deemed the instances around points (51 ;140) and
(53;148) to be far enough from the rest of the group to warrant removal as outliers.
Figure 2.4: PCA on the set of African-American instances
Using the same GA model and settings as for the previous case study, we obtained the
MAE and SEE values presented in Table 2.2.
MeasureMin Max Mean Stdev
(cm) (cm) (cm) (cm)
MAE 1.6993 3.8059 2:7580:35 0.7983
SEE 1.9555 4.5236 3:36520:40.9206
Table 2.2: MAE and SEE values obtained using the GA on the set of African-American
instances, using 95% con dence intervals for the mean.
The obtained MAE and SEE values, considered for all the 20 runs, as well as the mean
and the standard deviation of the values, with 95% con dence intervals for the mean, are
given in Table 2.4.
Formula (2.15) was obtained by training the GA model on the entire data set of African-

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 33
American (AA) instances, using the best hyperparameters found by the random grid search.
PSAA= 0:90622849HUM +
0:86447619RAD +
0:69811532ULNA +
0:63420389FEM +
0:59202374TIB +
0:17618173FIB +
37:77939141(2.15)
The third and fourth rows of Table 2.5 show the error measurements obtained by applying
Formula (2.15) on the entire data set of African-American instances, without cross-validation.
Third case study As we have described in Section 2.2.1, in this case study, all European-
American and African-American individuals were considered. There are seven initial features
used for the stature prediction task: the measurements of the six bones considered in the
rst two case studies and a feature for the race. We note that the sexof the individuals is
not used as a feature in the stature regression process. After scaling the data, we determined
the dependencies between the features and the stature using the Pearson's rank correlation
coecient [Tuf11]. The absolute correlation values between the features and the output
stature are depicted in Figure 2.5. It can be easily seen that the race feature has the smallest
correlation with the stature, 0 :125. While this is a binary feature, it was opted not to single
it out for a separate correlation exploration since the other features are all continuous.
Figure 2.5: Pearson correlation for the features on the set of all European-American and
African-American instances.
Figure 2.6 shows a two dimensional view of the data set, as obtained by the PCA algo-
rithm. Again, the data seem to follow a learnable pattern.
After analyzing the PCA graph of this case study, we have decided to remove the three in-
stances with stature below 150 and having a value of about 50 for the single feature computed
by PCA.
When considering this case study, the values provided by our learning models for the
MAE and SEE evaluation measures considered for all the 20 independent runs are described
in Table 2.3.
The obtained MAE and SEE values, considered for all the 20 runs, as well as the mean
and the standard deviation of the values, using 95% con dence intervals for the mean, are
given in Table 2.4.

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 34
Figure 2.6: PCA on the set of all European-American and African-American instances.
MeasureMin Max Mean Stdev
(cm) (cm) (cm) (cm)
MAE 1.5495 3.5399 2:81090:26 0.5926
SEE 2.2227 4.7148 3:48630:35 0.7991
Table 2.3: MAE and SEE values obtained using GA on the set of all European-American
and African-American instances, using 95% con dence intervals for the mean.
Formula (2.16) gives the stature estimation formula applicable for this case study.
PSAll= 1:03098743HUM +
0:25947151RAD +
0:05093836ULNA +
0:98251099FEM +
0:80198147TIB +
0:12059372FIB +
4:62192995RACE +
44:58112393(2.16)
In Formula (2.16), the value for the RACE feature has to be set to 1 for the European-
American individuals and 0 for the African-American individuals.
The last two lines of Table 2.5 show the error measurements obtained by applying Formula
(2.16) on the data set consisting of all European-American and African-American individuals,
without cross-validation.
Discussion In this section, an analysis of the genetic algorithms model introduced for
stature estimation of human skeletal remains is presented. First, a detailed analysis of the
obtained experimental results is given, followed by a related work comparison.
As shown in the experimental part, our GA model has provided very good performances
for stature estimation. To face the problem of the unrepresentativeness regarding either the
training data set or the test data set, we opted for 20 experiments with disjoint training and
test data sets, with each experiment consisting of a 10-fold cross-validation process. For all
the performed case studies, the maximum average values obtained for MAE are 3 :18 cm,
while the maximum average values for SEE are 4 cm for our GA model.
The experimental values that were obtained for the average MAE and SEE considering
all case studies are summarised in Table 2.6. The last column from Table 2.6 contains the
overall mean error (MAE/SEE) | averaged over all instances independent of their race and

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 35
Case study Class MeasureMin Max Mean Stdev
(cm) (cm) (cm) (cm)
FirstMAE 2.1989 2.7786 2:51920:08 0.1828European-American MalesSEE 2.6111 3.3745 3:05650:10.2377
European-American FemalesMAE 2.8011 4.3617 3:59170:16 0.3759
SEE 3.084 4.8843 4:09730:19 0.4317
SecondMAE 2.4058 2.601 2:48320:02 0.0469African-American MalesSEE 3.0455 3.2182 3:11590:02 0.0439
African-American FemalesMAE 2.8803 3.9403 3:44360:11 0.2502
SEE 3.3218 4.4203 3:93390:12 0.2846
ThirdMAE 2.3736 2.8585 2:67340:06 0.1295European-American MalesSEE 2.7865 3.4546 3:2250:07 0.1528
European-American FemalesMAE 2.8977 4.0991 3:3020:12 0.2727
SEE 3.1736 4.7546 3:73970:15 0.3422
MAE 2.5538 2.9871 2:70980:05 0.1042African-American MalesSEE 3.0997 3.6542 3:30790:06 0.1317
African-American FemalesMAE 2.8418 4.0111 3:45770:12 0.2762
SEE 3.317 4.4031 3:95770:11 0.2508
Table 2.4: Detailed results obtained on the considered case studies, considering all 20 cross-
validations.
Data set MeasureValue
(cm)
European-American MAE 2.7095
instances SEE 3.6577
African-American MAE 2.6776
instances SEE 3.5292
All instancesMAE 2.7385
SEE 3.4772
Table 2.5: MAE and SEE values obtained using the nal GA formula on the considered case
studies, with no cross-validation.
sex. The values for these measures are very small and reveal that our machine learning-based
GA model provided very good stature estimates.
Analysis of the experimental results While unable to perfectly t the data, as is to be
expected from looking at the PCA plots, the GA has the advantage of delivering a very easy
to use formula, in the same style as Trotter and Gleser's formulae. Since the errors are small,
we believe that this advantage makes the method useful for the archaeological community.
We also noted that, for all considered case studies, the FEM and theHUM features
proved to be the most correlated with stature and, therefore, are the most relevant for stature
estimation. This is expectable from a bioarchaeological perspective and is re
ected by the
coecients of the formulae obtained by the GA model.
For a better analysis of the results obtained by our machine learning -based models, we
performed an additional experiment, which will be described in the following paragraphs.
To study how the raceof the individuals in
uences the stature estimation task, we removed
theRACE feature from the third case study. In this experiment, there were 183 individuals
characterised only by sixbone measurements, i.e., sixfeatures labelled as HUM ,RAD ,
ULNA ,FEM ,TIB, and FIB.
Figure 2.7 illustrates the correlation values between the considered characteristics and

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 36
Data MeasureOverall
(cm)
Instances separated on races MAE 2.8767
(First and second case studies) SEE 3.6697
All individuals MAE 2.8109
(Third case study) SEE 3.4863
Table 2.6: MAE and SEE values obtained using the GA model on the considered case studies.
the target stature.
Figure 2.7: Pearson correlation for the features on the set of all European-American and
African-American individuals (without considering the sexand race of the individuals).
The average errors which were obtained on this additional experiment, during the 20 runs
performed are shown in Table 2.7. The last column of this Table represents the average MAE
and SEE values obtained considering all individuals from the data set.
Formula (2.17) shows the nal GA formula applicable to this situation.
MeasureAverage value
(cm)
MAE 3.1746
SEE 3.9319
Table 2.7: Average MAE and SEE values obtained using the GA model on the set of all
European-American and African-American individuals (without considering the sexandrace
of the individuals).
PS= 1:71781015HUM
0:29506528RAD
0:66794333ULNA +
1:25152012FEM +
0:07329773TIB +
0:63599258FIB +
52:9587286(2.17)
One can observe from Table 2.7 that slightly better estimations were obtained when using
theraceof the individuals as a feature in learning (see Table 2.3). However, the errors without

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 37
using the RACE feature are not much larger than the errors obtained considering the race of
the individuals (with about 4 mm larger MAEs and 5 mm larger SEEs). The good results that
were obtained on this additional experiment show that the proposed machine learning-based
model would be very useful in practice, since it is very likely that the stature estimations
have to be made without knowing the sexand race of the individuals.
We have to emphasize the fact that our machine learning-based system can be trained
on the data characterized only by a subset of features (bones measurements). This can have
a major practical relevance when several bone measurements are missing. In this case, our
systems can be trained to learn multiple formulae, considering di erent bone combinations.
For example, considering our additional experiment (all European-American and African-
American individuals without their sexand race), the best formula provided by the GA
model for estimating stature based only on the measurements of the femur and tibia is
given in Formula (2.18). Using this formula, we obtained an average MAE of 3 :1534 and an
average SEE of 3 :8753. These values are very close to those reported considering all the bone
measurements (see Table 2.7), con rming this way the adaptability of our learning system.
PS= 2:04390026FEM +
0:14116985TIB +
68:77568858(2.18)
Based on the above analysis of the experimental results obtained on the performed case
studies, we can conclude that the GA-based model has shown a very good performance.
The GA model has the advantage of producing an easy-to-use linear formula. An important
general advantage of our work is that it is applicable to new data sets: we can retrain
our models very easily on new data representing homogeneous populations. This process is
automated and can produce new formulae or trained models as needed.
Comparison to related work As shown in Section 1.1.2, most of the existing approaches
to stature estimation rely on bone measurements and basic statistical methods. As far as we
know, there are no existing machine learning-based models (like the GA model introduced
in [CIMM16]) for the problem of stature estimation. An exact comparison of our approaches
to similar existing ones is hard to make, since di erent case studies are used in the experi-
ments, the used data sets are not publicly available, and the measures used to evaluate the
performance of the methods are not the same for all approaches.
Comparing our approach with the di erent methods found in the literature, we can re-
port a higher degree of
exibility and a superior stature prediction accuracy as improvements.
Flexibility involves adaptability, a desirable feature knowing that the development of applica-
ble standards for local populations is highly recommended. The
exible nature of the machine
learning-based methods is derived from the possibility of uncovering unobvious patterns in,
possibly regional, data, in contrast to the limitations in this regard of a pre-established for-
mula.
To the best of our knowledge, only [TG52] and [RR06] present research results on the
Terry Collection data set. The regression formulae proposed by Trotter and Gleser in [TG52]
were tested on European-American and African-American instances belonging to the Terry
Collection Database. For an accurate comparison between our machine learning-based ap-
proaches and the approach of Trotter and Gleser [TG52], we applied the regression formulae
from [TG52] on the instances from our data set and we computed the mean values for both
the MAE and the SEE measures. For our GA learning model, the results obtained on the
rst two case studies are reported, in which the classes of European-American and African-
American were independently considered, but without using sexas a feature in the regression
process. It has to be noted that the formulae of Trotter and Gleser were adjusted for individ-

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 38
Approach MeasureEuropean European African African
Mean American American American American
Males Females Males Females
MAE 2.5192 3.5917 2.4832 3.4436 3.0094Our GASEE 3.0565 4.0973 3.1159 3.9339 3.5509
Trotter's
regression
formula
[TG52]MAE 2.675 3.762 2.92 4.219 3.394
SEE 3.233 5.316 3.612 5.297 4.365
Table 2.8: Comparison between the SEE values reported by our approaches and the approach
of Trotter and Gleser [TG52].
uals aged over 30 years, as indicated in [TG52]. The results of the comparison with [TG52]
are given in Table 2.8.
From Table 2.8, we observe that both the MAE and the SEE values obtained by our GA
methods are smaller than those reported in [TG52]. This is the best measure of the perfor-
mance of the present approach since both approaches were applied on the same data set. We
note that our GA model provided better results than those provided by the Trotter's formulae
from [TG52], even if the evaluation of the machine learning-based models was obtained using
multiple cross-validation runs (to avoid over tting), whereas the formula from [TG52] was
obtained using the entire data set. If it were to use the evaluation method from [TG52], which
computes the MAE and SEE errors by applying Trotter and Gleser's formulae on the entire
data, we would have obtained lower errors for our machine learning approaches. Another
major advantage of our approaches with respect to [TG52] is that our estimations are made
without knowing two characteristics essential for the formulae given by Trotter and Gleser.
More precisely, our GA learning model does not use the ageand the sexof the individuals,
which are indispensable for Trotter and Gleser's formulae to provide accurate estimations. It
has to be noted also that even the results we have obtained without considering the race of
the individuals (see Table 2.7) are above the results provided by [TG52].
Given the experimentally proven sensitivity of the anatomical method detailed by Raxter
et al. in [RR06] with respect to the age of individuals, the age independence of our methods
represents a real advantage. This is all the more true given that the precise ages are dicult
to estimate [RA07]. More than that, several skeletal elements from the cranium through
the calcaneus were measured to enable the use of the anatomical method. The existence
of incomplete skeletons, for which bone measurements are missing, signi cantly a ects the
application of anatomical methods like the one proposed in [RR06]. In contrast to this, the
two machine learning-based methods are applicable for any subset of bone measurements, if
a training data set (for which these measurements are known) is provided. A numerical com-
parison cannot be accurately performed and would not be truly meaningful given that Raxter
et al. aim, through a signi cantly di erent methodology, to estimate the living statures of
individuals from a data set with the same origin as ours, but characterised by fewer instances
and more bone measurements.
The rest of the methods existing in the literature for stature estimation used for the
evaluation of data sets are di erent from those used for testing our GA model. Still, our
comparison will be further conducted by the value of the SEE evaluation measure, which is a
good measure for estimating the accuracy of a regression model. For our GA model, the best
values are reported | the ones obtained when the estimation is made separately for each
race, but without considering the sexof the individuals (the rst two case studies). Thus, the
average values obtained on the 20 runs of the cross-validations for the rst two case studies

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 39
are reported for the GA model.
The regression models employed in [OJ05] achieve a far better accuracy than the previous
methods and, therefore, constitute the basis for the FORDISC software used in forensic
analysis.
The values for the SEE reported for evaluating the proposals detailed in [PD14] are
between 1:426 cm and 3 :538 cm. The values for the SEE obtained using our GA approach
for the two case studies (see Table 2.9) are signi cantly lower than those reported in [PD14].
The methods proposed in [Ful56] reported absolute errors that did not exceed 3 :5 cm,
whereas previous methods such as those of Rollet and Manouvrier registered errors of up to
9 cm. We remark that the maximum value of the absolute error is signi cantly smaller in
our case. A major di erence is preserved as well when comparing our results with the MAE
values reported in [Fel96] (3 :4 cm using the femur/stature ratio method and 2 :15 cm using
the regression equations calculated speci cally for particular populations) or with the one
evaluating the solution proposed in [PD14] (1 :69 cm).
The authors of [Koz96] obtained errors of stature estimates between 2 :0 cm (Pearson
[Pea99]) and 8 :5 cm (Dupertius [DH51]) for males on the data from Ostr ow Lednicki. This
is relative to a baseline given by the method of Fully and Pineau. The errors for females
were between 0 :7 cm (Fully and Pineau [FP60]) and 8 :2 cm (Breitinger [Bre38]). These are
all higher than our results. However, it must be noted that they worked with a much larger
volume of data that were also more scattered than ours and that they only compared the
methods relative to each other. The authors even stated that the problem of the selecting
the best method for stature estimation is still unsolved.
In [KGJ13], a number of formulae based on leg and arm bones were applied on thousands
of instances from the period of the Roman Empire. All of their standard errors were above
6 cm.
The literature review conducted in [Nav10] presents a number of results for various pop-
ulations, all using the SEE measure, such as the following: as high as 32 cm for Asiatic
Indians using the clavicle (Singh and Sohal 1952 [SS52]; Jit and Singh 1956 [JS56]); between
67:89 and 54:72 mm for black American males and between 68 :22 and 53:09 mm for black
American females using spine bones (Tibbetts 1981 [Tib81]); between 6 :59 and 8:59 cm for
Japanese cadavers using skull measurements (Chiba and Terazawa 1998 [CT98]); from 4 :37
to 6:24 cm in indigenous South African samples (Ryan and Bidmos 2007 [IR07]); from 3 :69
to 5:92 cm using fragmented tibia (Holland 1992 [Hol92]); and between 4 :033 and 5:127 for
an Indian population using hand and nger length (Jasuja and Singh 2004 [JS04]). The SEE
values obtained in our experiments are lower than those described above.
On the basis of the comparisons we have presented above and summarized in Table 2.9,
we notice that the errors obtained using our machine learning-based models are smaller than
most of the errors already reported in the literature.
Our GA model proved to outperform (see Tables 2.8 and 2.9) most of the previous ap-
proaches used for the stature prediction task and is a strong and e ective machine learning
tool that would be helpful for assisting bioarchaeologists in their tasks for estimating the
stature of human skeletal remains. The major advantage of our GA-based learning system is
that it is able to provide sample-speci c stature estimation formulae, since it is adaptive and
able to learn to estimate the stature considering exclusively the data set that was provided
as training data. Therefore, the stature of the members of a new population described by
a homogeneous set of osteological measurements can be estimated using the proposed ma-
chine learning-based models obtained after training only on the given data without any other
domain-speci c knowledge. As mentioned before, we can provide new trained models and
formulae by using the systems and methodologies developed as part of this research.
The machine learning GA approach introduced in this work is proposed as an alternative
for the mathematical methods used for stature estimation. The machine learning perspective

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 40
ApproachAverage MAE Average SEE
(cm) (cm)
Our GA 2.8767 3.6697
Approach [SS52] | 32
Approach [Tib81] { 6.1
Approach [Hol92] { 4.81
Approach [Koz96] | 4.85
Approach [CT98] { 7.59
Approach [KGJ13] | >6
Approach [JS04] { 4.58
Approach [IR07] { 5.31
Approach [RRA+08] | 3.05
Approach [PD14] | 2.482
Approach [OJ05] 2.04 |
Approach [Ful56] <3.5 |
Approach [Fel96] 2.78 |
Table 2.9: Average MAE and SEE for the proposed models against other important related
articles in the bibliography.
o ers estimation methods that are adaptive, age independent, and with a performance that
was experimentally proven to be increased with respect to existing regression formulae. Our
GA-based learning systems is able to adapt the formulae provided for stature estimation to the
particularity of a given population (training data) by automatically identifying population-
speci c characteristics that are relevant for stature estimation. The machine learning-based
model proposed in [CIMM16] can incorporate additional intra-populational characteristics
like geographic, biocultural [VBA+14], and other factors by simply adding them as features in
the training data set to better di erentiate stature within the studied population. Moreover,
considering that the applicability of anatomical methods for stature estimation is diminished
in the case of incomplete skeletons, the GA-based method would represent practical and
automated tools for bioarchaeologists.
Conclusions and future work We have introduced in [CIMM16] a regression model
based on machine learning for the problem of stature estimation of archaeological human
remains. The experiments performed for evaluating the proposed model have provided very
good results, highlighting the potential of our proposal.
Considering the good performance of the machine learning-based model introduced in
[CIMM16], one can conclude that machine learning-based methods are useful for detecting
and recognising patterns in archaeological data that are hard to identify using other conven-
tional techniques.
Future work will be done to extend the experimental evaluation of the GA model on other
real-world data, to better test its performance. The application of other machine learning-
based models [Mit97] (such as decision trees, instance-based learning, clustering, relational
association rules mining [SCC06] etc.) for the problem of stature prediction of skeletal remains
will be further investigated.
Another direction of future research is that of automating the PCA outlier removal by
using domain knowledge. The visual method we have currently employed is rather subjec-
tive and might prove dicult to perform on other data sets, so we would like a way to
automatically perform this step.

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 41
Support vector regression results on the Trotter data set
In this section, we present results using the SVR method introduced in [Ion15b] on the
Trotter data set, which is described in Section 2.2.1.
Software libraries and experimental methods All of our experiments are performed
using the scikit-learn machine learning library [PVG+11]. For Support Vector Machines,
this in turn uses the libsvm library [CL11], which is known to be a very powerful library for
SVMs. Using well-known open source and well documented libraries ensures bug free exper-
iments and guarantees that our experiments can easily be reproduced by other researchers.
We run experiments using the linear, RBF (Radial Basis Function), polynomial and sig-
moid kernels. We use a randomized grid search to tune the hyperparameters for our SVM
model (such as the Cand"values). As mentioned before, it has recently been shown that
using a random search is better than using a standard grid search [BB12a]. We use 10-fold
cross-validation [Mit97] as the model evaluation method within the random search. The pa-
rameter con guration which leads to the best results, according to the Mean Absolute Error
(MAE), is returned after 1000 random parameter samplings from a uniform distribution over
[0;1) for each model parameter, except the dparameter for the polynomial kernel, which was
sampled from the set f1;2;3;4;5g, and theCparameter, which was sampled from [0 ;10000)
when optimizing the RBF kernel, from [0 ;10) when optimizing the sigmoid kernel and from
[0;100) when optimizing the linear kernel. For the polynomial kernel, the grid searches found
d= 1, which basically considers a linear kernel, so we do not include it in the presentation.
For nding the optimal hyperparameters, we have added the normalization step as the
rst step of a pipeline, with the second and nal step being the Support Vector Regressor. Our
normalizer only scales features. The resulting pipeline is then used as the nal regressor given
to the randomized search for optimization. This entails that mean and standard deviation
are computed on the training folds and the same values are used to normalize the test fold
during testing. After normalization, the test fold is fed to the actual regressor part of the
pipeline.
The way in which we optimize hyperparameters using randomized grid searching is xed
(we will refer to it as the M1method). However, we consider the following methods as well,
which we will optimize using a standard grid search over a feasible set of hyperparameters.
1. Method M2involves normalizing our entire data prior to doing 10-fold cross-validation
and also normalizing our targets (the correct statures). Performance scores are reported
considering the values returned by the model unscaled. This approach can potentially
overestimate the performance of a model, due to the fact that it does not really mimic
real world scenarios by assuming we can also normalize the test data together with the
training data;
2. Method M3 uses the pipeline approach of M1, but it also normalizes targets. We
believe this to be a realistic application scenario that can help improve performance.
Testing method For each methodology and kernel, we use a single run of 10-fold cross-
validation per iteration, storing and using the hyperparameters that de ne the model which
minimizes the Mean Absolute Error score (MAE) over all iterations. That model is then used
to report the MAE and Standard Error of the Estimate (SEE).
A 95% con dence interval for the mean on the 10 test folds is also reported, as described
in [BCD01].
We present our results for two case studies representing di erent populations, as described
in Section 2.2.1, as well as for their concatenation.

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 42
−400−300−200−1000100200300400
Instance reduced to 1 dimension by PCA130140150160170180190200210Stature in centimetersPCA for the caucasian data set
(a) Caucasian case study
−400−300−200−1000100200300400500
Instance reduced to 1 dimension b PCA130140150160170180190200210Stature in centimetersPCA for the afro-american data set (b) Afro-american case study
Figure 2.8: Data set reduced to a single feature using Principal Components Analysis.
First case study: caucasians Figure 2.8 presents the two case studies reduced to a
single dimension using Principal Components Analysis (PCA) [TB99]. The xaxis of this
graph represents the value of the single feature computed by PCA and the yaxis represents
heights. It can be seen that even under this setting, it is trivial to nd a linear function
that approximates the data, which makes us expect very good results from the linear SVM
kernel, since it will be able to make use of more features, when even a single one shows good
potential, at least if obtained by PCA.
Table 2.10 presents results for the rst case study under all three evaluation methodolo-
gies.
For the caucasian case study, the linear kernel proved to be the best under the M1testing
methodology, while the RBF kernel proved to be the best under M2andM3.
For the M2andM3testing methodologies, we obtained the best results with identical
model hyperparameters. This is understandable, since the methods are very similar and the
scaled targets cannot di er too much in our data set. For larger data sets, the di erences could
potentially be bigger, thus requiring di erent hyperparameters. Therefore, one might want
to employ a randomized grid search for the other methodologies as well. For our purposes
however, we wanted to showcase both kinds of searches.
We note that all three kernels give very good results under all three methodologies, con-
sidering that the errors are in centimeters and the data set contains almost 100 instances,
which means that even in the worst case (0 :881 MAE for the sigmoid kernel in the M1
methodology), our average mistake is under one centimeter.
Overall, the best results are obtained under the M3methodology.
Second case study: afro-americans Table 2.11 presents results for the second case study
under all three evaluation methodologies.
Once more, the linear kernel is the best under the M1methodology. Compared to the
rst case study's M1results, scores are better with the linear and sigmoid kernels and worse
with the RBF kernel, although the di erences are very small for any practical concerns.
The RBF kernel took the top spot under the M2andM3methods once again. Compared
with the rst case study, the M2andM3results are worse for the Afro-american case study,
only the sigmoid kernel in the M2setting managing to surpass its direct competitor.
Overall, results are worse on the second case study than on the rst, but not in a signi cant

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 43
KernelMAE
(cm)SEE
(cm)Hyperparameters M
linear 0.0460.010 0.0560.011 C= 91:92;"= 0:07
M1RBF 0.0880.061 0.1010.068 C=9625.77, "=0.07,
=0.004
sigmoid 0.8810.504 0.9130.504C= 4:064; " = 0:114;

= 0:0162; r= 0:164
linear 0.0490.009 0.0570.009 C=5,"=0.001
RBF 0.0420.014 0.0500.015C= 910;"= 0:0001;
= 0:001M2
sigmoid 0.4900.350 0.5150.355C= 5; " = 0:0001;

= 0:01; r= 0:0001
linear 0.0490.010 0.0570.010 C= 5;"= 0:001
RBF 0.0400.012 0.0480.014C= 910;"= 0:0001;
= 0:001M3
sigmoid 0.7350.515 0.7560.520C= 5; " = 0:0001;

= 0:01; r= 0:0001
Table 2.10: Results for the caucasian case study. 95% con dence intervals are used for the
results.
KernelMAE
(cm)SEE
(cm)Hyperparameters M
linear 0.0310.008 0.0410.017 C= 28:41;"= 0:017
M1 RBF 0.1070.097 0.1410.131C= 7304:752;"= 0:037;
=
0:02
sigmoid 0.2680.099 0.3390.171C= 9:844; " = 0:013;

= 0:0062; r= 0:479
linear 0.0560.052 0.1140.154 C= 0:1;"= 0:001
RBF 0.0510.030 0.0810.072C= 910;"= 0:001;
= 0:001M2
sigmoid 0.1460.090 0.1870.141C= 0:9; " = 0:0001;

= 0:01; r= 0:0001
linear 0.0570.053 0.1160.157 C= 0:1;"= 0:001
RBF 0.0550.037 0.0900.091C= 910;"= 0:001;
= 0:001M3
sigmoid 0.1950.147 0.2200.167C= 0:9,"= 0:0001;

= 0:01; r= 0:0001
Table 2.11: Results obtained for the African-american data. 95% con dence intervals are
used for the results.

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 44
KernelMAE
(cm)SEE
(cm)Hyperparameters M
linear 0.3400.103 0.4170.129 C= 98:508;"= 0:394
RBF 0.2430.152 0.3470.223C= 9563:38;"= 0:164;

= 0:009 M1
sigmoid 2.6041.177 2.8551.208C= 5:824; "= 0:0972;

= 0:01; r = 0:7156
linear 0.4580.225 0.5510.254 C= 0:29;"= 0:001
RBF 0.2610.137 0.3670.215C= 10000;"= 0:0001;

= 0:001 M2
sigmoid 1.2460.391 1.3050.389C= 6; " = 0:1;

= 0:01; r= 0:0001
linear 0.4690.223 0.5770.270 C= 0:29;"= 0:001
RBF 0.2520.130 0.3760.268C= 10000;"= 0:0001;

= 0:001 M3
sigmoid 1.4690.498 1.5850.527C= 6,"= 0:1;

= 0:01; r= 0:0001
Table 2.12: Results for the mixed case study. 95% con dence intervals are used for the
results.
fashion.
This time, the best results are obtained under the M1methodology.
Mixed case study The mixed case study is obtained by concatenating the data sets cor-
responding to the previous two case studies, resulting in a bigger data set that contains both
populations.
We have not added any new feature to distinguish the two populations.
Since the two populations look similar in the PCA plots (Figure 2.8), we expect their
concatenation to yield good results as well.
Table 2.12 presents results for the mixed study. This time, the radial basis function kernel
clearly outperforms the other kernels taken into consideration. This implies that the RBF
kernel is the most robust, being able to deal with more data instances even if they are from
di erent populations. This suggests that the RBF kernel should perform the best in practice.
For the M2 andM3 methodologies, we again did not obtain signi cant di erences in
results with di erent model hyperparameters.
We obtained the best results under the M1methodology.
Another testing scenario that we plan to research in the future involves the concatenation
of the two data sets, but with a new feature added that speci es which population each
instance belongs to.
Figure 2.9 illustrates the learning curves for the RBF kernel on the mixed case study,
unde the M1testing methodology. It can be seen the model generalizes very well and there
is no over tting. Because of the good generalization, it is unlikely for more data to produce
better results. Because the training score error increases very slowly, it is likely that more
data will not lead to worse results. It can also be observed that the model achieves its optimal
performance with few training instances.
Comparison to related work All of our results outperform existing literature results. As
far as we are aware of, only [TG52] presents results on these data sets. Their obtained SEE

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 45
Figure 2.9: Learning curves for the mixed case study, with the RBF kernel under the M1
methodology and considering MAE scores.
MethodBest
SEEWorst
SEEShort description
SVM 0.048 0.913First case study, RBF kernel under the M3
methodology for the best SEE and sigmoid
kernel under the M1 methodology for the
worst case.
SVM 0.041 0.339Second case study, linear and sigmoid ker-
nels, both under the M1methodology.
SVM 0.347 2.855Mixed case study, RBF kernel under the M1
methodology and sigmoid kernel also under
theM1methodology.
[TG52] 3.05 3.66Regression formulas based on basic statisti-
cal methods.
Table 2.13: Overview of literature results on the data set we have used.
errors are between 3 :05 and 3:66 cm. Our worst result is 2 :855 SEE on the mixed case study
using the sigmoid kernel under the M1testing methodology.
On the individual case studies, all of our results are well below 1 SEE, with most of them
being under 0 :5 and the best of them under 0 :1.
Taking into consideration work done on other data sets but regarding the same issue,
such as [McC01, PD14, Fel96, Ful56] and others, we note that no previously existing method
manages to achieve errors less than a centimeter, under any scoring metric. This can be
attributed to the fact that previous methods only seem to consider a few features, to which
they apply rather basic statistical procedures. It is impossible for us to test SVM applications
on those data sets, since they are not public. Therefore, a direct comparison between our
results and existing results on the data sets in [McC01, PD14, Fel96, Ful56] would be mean-
ingless. However, it stands to reason that, given the results we have presented in Section 5,
SVM-based models can potentially outperform existing methods on other data sets as well.
Table 2.13 presents a quick comparison between our methods and other literature results
on the data set we have worked with.
Therefore, our Support Vector Regression approach is a very reliable solution to the

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 46
problem of skeletal height estimation given the lengths of certain bones, leading to much
better results than previous approaches and having the potential to be easily adapted to new
data sets from the eld.
Conclusions and future work We have presented in [Ion15b] how SVM regression meth-
ods can be applied for estimating the height of archaeological remains. We have run extensive
experiments on two archaeological data sets that are freely available, obtaining very good re-
sults that surpass previous results from the literature.
Our results are also superior considering other data sets. While this comparison does
require further investigation, it is a valid conjecture due to the fact that SVMs are a family
of machine learning algorithms, which means they can learn from any type of data. If they
could learn well on one data set, it stands to reason that they are able to do the same on
another data set containing similar kind of data.
We have applied three testing methodologies, which we believe to mimic certain real world
scenarios well. We also used a randomized grid search for one of them, which recent research
has shown to be the best way of optimizing hyperparameters.
Therefore, we consider the SVM-based methods that we have applied to o er signi cant
contributions to the elds of archaeology and forensic analysis and believe that they will
generalize well to other problems of a similar nature.
As a future research direction, we plan to experiment with more machine learning libraries
and on more data sets. We also plan to nd di erent kernels to test. Another possible
direction is re ning our hyperparameter searches, by reducing the intervals we search in and
by running the randomized search for more iterations, increasing the probability of nding
better hyperparameters.
If we can obtain more data, we believe that researching online learning options would also
be a useful undertaking.
Since the PCA reduction did not seem to generate useless data, it is also worth investi-
gating the results that can be obtained using less features, since fewer measurements should
always be helpful in practice.
Locally weighted regression results on the Trotter data set
In this section, we present results using the LWR method introduced in [IMMC15] on the
Trotter data set, which is described in Section 2.2.1. We mention that we have used our own
implementation of the LWR machine learning method.
First case study: caucasian males and females In the data set 8 features are initially
provided for the stature prediction task. After the data was scaled (see Section 2.1.4), a subset
of features correlated with the target stature are determined through a statistical analysis.
The dependencies between the features and the stature (target output) are obtained using
the Pearson's rank correlation coecient [Tuf11].
Figure 2.10 depicts the Pearson's correlations between the features and the stature (target
output) for the rst case study.
We note that the lowest correlation is for the GENDER feature, thus experiments are
performed with and without using this feature for the regression task.
Tables 2.14 presents results on the rst experiment. It can be seen that the LWR model
provides better results with the GENDER feature included.
Second case study: afro-american males and females
The Pearson correlation values for the second experiment are depicted in Figure 2.11.

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 47
Figure 2.10: Pearson correlation for the features of the rst case study.
GENDER MAE SEE MAE Stdev
yes 0.02200.022 0.004
no 0:1660:01 0.166 0.026
Table 2.14: LWR model performance measures obtained for the rst experiment, with 95%
con dence intervals for the MAE mean.
We observe that the GENDER feature again has the smallest correlation with the target
stature, so we will present results with and without it for this experiment as well.
Table 2.15 presents results on the second experiment. Once more, the LWR model obtains
better results with the GENDER feature included.
GENDER MAE SEE MAE Stdev
yes 0.0240.01 0.024 0.017
no 0:1370:02 0.137 0.036
Table 2.15: LWR model performance measures obtained for the second experiment, with 95%
con dence intervals for the MAE mean.
Discussion and comparison to related work The LWR approach introduced in [IMMC15]
for predicting heights of human skeletal remains is analyzed in the following. Then, a com-
parison with related works from the archaeological literature is presented.
As shown in the experimental part of this chapter, the LWR model has provided good
performances for the height prediction problem on two case studies conducted on open source
data available in the archaeological literature.
Analysing the results obtained on the second case study (the set of afro-american in-
stances) we observed that the performance is highly in
uenced by a male instance (the 45th
male instance from the data set). Looking into the data set, we noticed the fact that this
instance is somehow di erent from the other male instances, since the FIB feature (i.e. the
maximum length of the bula bone) has a considerably larger value than the other male
instances. So, this instance may be considered as a kind of outlier in the data set of afro-
american instances. We removed the instance M45 from the data set since it seems to be an
outlier and we obtained better results using LWR, as shown in Table 2.16. We can see that
these results are better for our Instance Based Learning (IBL) algorithm.
Our obtained results surpass those from the literature on the same data set. To the best
of our knowledge, only Trotter, in [TG52], presents results on the same data set that we
have worked with. Trotter's SEE values range between 3 :05 and 3:66 centimeters, which is a

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 48
Figure 2.11: Pearson correlation for the features for the second experiment.
MAE SEE
0.014 0.014
Table 2.16: Results on the second case study with the M45 instance removed. GENDER
was included.
performance clearly well below our own results. This proves that our approach can perform
well on this type of task. Due to the fact that such data sets are usually relatively small, the
slower performance of IBL models compared to other models is also not a big disadvantage.
The small MAE and SEE values we obtain are worth incurring minor memory and execution
time costs.
Because open source data sets are hard to nd for the problem at hand, we have only
been able to apply our algorithm on the Trotter data set. However, results are available
for other, non-public data sets. Comparing our results with those, we still obtain better
values. In [Ste29], a Chinese population is considered, in [HR96] a population from Ontario
is considered and in [RRA+08] an ancient Egyptian population is used. Our results on the
Trotter data set are an order of magnitude better than all other results that we are aware of,
even considering other data sets.
Our worst MAE results are under 0 :5 centimeters and our worse SEE results are under
1:2 centimeters, and still better than existing results.
Table 2.17 presents a comparison between the best results obtained using our method and
the existing literature results on the data set that we used.
Method Best MAE Best SEE
LWR – rst case study 0.022 0.022
LWR – second case study 0.024 0.024
Trotter [TG52] N/A 3.05
Table 2.17: Comparison between the known results on the open source Trotter data set and
our own IBL method.
To the best of our knowledge, the rest of the methods existing in the literature for stature
estimation use for evaluation data sets that are di erent from the one used in this thesis and
are not publicly available. Still, we will brie
y present several methods which provide the
value of the standard error of the estimate (SEE) evaluation measure.
A regression-based method for the stature estimation problem was introduced in [McC01],
while a predictive regression equation for stature prediction, also, was recently proposed in
[PD14]. The values for the standard error of the estimate reported for evaluating the methods

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 49
from [PD14] and [McC01] are 1 :69 cm, as average, and between 1 :426 cm and 3 :538 cm,
correspondingly. The values for the SEE obtained using our LWR approach for the two
case studies are signi cantly lower than those reported in [PD14]. In [KGJ13], a number of
formulas based on leg and arm bones were applied on instances from the period of the Roman
Empire. All of the obtained SEE values were above 6 centimeters. Raxter et al. introduce
in [RRA+08] a regression formulae which is based on directly reconstructing the stature in
ancient Egyptians and have obtained standard errors of estimate ranging from 1 :9 to 4:2 cm.
Considering the results presented above, we notice that our SEE results are signi cantly
lower than the standard errors in the literature.
Conclusions and future work In [IMMC15], we presented an IBL model that solves
the problem of stature prediction of human skeletal remains. Our idea to apply instance
based learning to archaeological data is, as far as we know, a novel one. We have succeeded
in outperforming classical statistics and domain knowledge-based methods that have been
applied on the same data set that we have used or on similar data sets, which makes our
approach very useful for the eld at hand.
Further work will be done in order to detect and remove, during the data preprocessing
step, the outliers from the training data set. We also plan to make improvements in the
following directions. Other memory-based methods, such as k-nearest neighbors. Perhaps
most importantly however, we plan to apply these methods to as many data sets as we can,
in order to develop them into robust methods that will work well on any data set from this
domain.
2.2.2 Body mass estimation
This section introduces our machine learning based approaches (GA, SVR and LWR) for
estimating the body mass of human skeletal remains, based on bone measurements. We have
applied SVR for this problem in [ICT16]. The body mass estimation problem was presented
in Section 1.2.
After the data is preprocessed, as in a supervised learning scenario, our regression models
are trained and then tested for assessing how performant of the obtained models are. We
will use"-SVR with multiple choices of kernels and a randomized hyperparameter search. A
random search has been shown to outperform classical grid searches for this problem [BB12a].
The random search consists of specifying either a probabilistic distribution for each hyperpa-
rameter or a discrete set of possible values. Then, for a given number of iterations, random
values are selected for each hyperparameter, according to our probability distributions or
discrete sets of values. A model is constructed each iteration, and evaluated using ten-fold
cross-validation. The best model returned by the random search is then tested using Leave
One Out cross-validation, as described in Section 2.1.5.
The SVM will be tested using a single LOO cross-validation, since there is no randomness
in building the SVM regressor, thus the obtained results are always the same. There is
randomness in the initial shuing of the data, but we did not nd that this in
uenced our
results in any signi cant manner. The random search for the hyperparameters optimization
is ran for 200 iterations. The same methodology will be applied for LWR.
For the GA, we will repeat our LOO cross-validation 20 times, in order to account for the
random numbers used in the algorithm itself. We will only repeat the LOO cross-validation
20 times after the grid search has nished, the grid search is only executed once.
We have used our own implementations for the GA and LWR models as well as for the
self organizing feature map (SOM) used in the data pre-processing step. The scikit-learn
machine learning library was used for the SVM implementation [PVG+11].

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 50
Data sets and case studies
The data set considered in our experiments is an archaeological data set publicly available
at [JMJ00]. The database at [JMJ00] was developed through a research project [JMJ98] and
is a database of skeletal remains composed of forensic cases from United States [JMJ00]. The
database contains 1009 human skeletons and 86 bones measurements representing: postcra-
nial measurements of the clavicle, scapula, humerus, radius, ulna, sacrum, innominate, femur,
tibia, bula, and calcaneus [JMJ00]. We extracted from the database only those instances
for which the ( forensic and/or cadaver )body mass was available.
An analysis of the bioarchaeological literature revealed 10 measurements which are used
for human body mass estimation and have a good correlation with the body size (stature
and/or body mass) [AR04]: (i) femoral head diameter (ii)iliac breadth ,femur bicondrial
length (iii)clavicle ,humerus ,radius ,ulna,femur ,tibia and bula . From these measurements,
thefemural measurements seem to produce the most precise estimations for the body mass
[AR04].
In case of two measurements for any of these bones (left and right), we have used their
mean. If only one measurement existed in the database, we considered that one. We have
not considered in our case studies instances containing missing bone measurements.
We will consider the following case studies, with the aim of analyzing the relevance of the
previously mentioned measurements for the problem of body mass estimation.
The rst case study consists of 200 instances characterized by 3 measurements – (i)
and (ii).
Thesecond case study consists of 146 instances characterized by 8 measurements – (i)
and (iii).
Thethird case study consists of 135 instances characterized by all 10 measurements –
(i), (ii) and (iii).
Data preprocessing
Before applying the machine learning based methods, the data set is preprocessed. After
the data normalization using the Min-Max method, a statistically based feature selection is
applied for determining how well the measurements are correlated with the target body mass
output. The dependencies between the features and the target body mass are determined
using the Pearson correlation coecient [Tuf11].
Also as a data preprocessing step, we have used a self organizing feature map (SOM)
[SK99] for obtaining a two dimensional view of the data we are working with. The trained
SOM is then visualized using the U-Matrix method [KK96]. The U-Matrix values can be
interpreted as heights, resulting in a U-Matrix landscape, with high places encoding instances
that are not similar while the instances grouped into the same valleys represent data that is
similar. On the U-matrix we are able to observe outliers, i.e. input instances that are isolated
on the map. The visually observed outliers will then be eliminated from the training data.
This process will be detailed in the following experimental part.
Figure 2.12 (a) illustrates the correlations between the 10 measurements and the target
output on the third data set (case study).
Figure 2.12 (a) shows that the rst three features have the highest correlation with the
body mass. The femoral head diameter has the maximum correlation with the body mass
(0:5086), while the length of the tibia has a correlation of only 0 :3858 with the body mass.
Analysing the correlations from Figure 2.12 (a) it would be expected that the best perfor-
mance of our machine learning models will be on the rst case study (using only the rst
three measurements).

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 51
(a) Pearson correlations between the features and the target weights.
(b) The U-Matrix visualization.
Figure 2.12: Pearson correlation and U-Matrix visualization.
In order to determine possible outliers within the training data, we trained a self organiz-
ing map on the rst data set (consisting of 200 instances). The U-Matrix view of the trained
map is depicted in Figure 2.12 (b). On the map one can see four small regions (delimited by
white boundaries). The instances from these regions may be viewed as possible outliers in the
data set, since they are isolated from the other instances. This way we have visually identi ed
8 instances as possible outliers. These instances will be removed from all three training data
sets, resulting in data sets of sizes 192 ( rst case study), 139 (second case study) and 128
(third case study).
Results using our machine learning approaches
This section presents our results using our GA, SVR and LWR models. The SVR model
was used in [ICT16].
GA body mass estimation results In this section, we perform the experimental evalu-
ation of the proposed GA machine learning model on the case studies described in Section
2.2.2.
For the random search of GA hyperparameters, we have used the uniform probability
distribution. The intervals or sets from which we randomly draw each hyperparameter are
as follows:

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 52
Population size: sampled from the set f40;80;90;100;120;140g.
Initialisation: each individual's coecients are initialised by random values in [ init
2;init
2],
whereinitis sampled from [0 ;1).
Mutation probability: sampled from [0 ;1).
Mutation coecient: the valuesmtto mutate by are sampled from [0 ;2). If an individ-
ual is selected for mutation (based on the mutation probability), a random value from
[mt
2;mt
2] will be added to a random coecient.
Search termination: if the desired precision has not been achieved after a certain number
of generations, the GA stops. The number of iterations to stop after is sampled from
the setf200;400g.
Precision: the precision is always xed at 5 during the grid searches.
The results obtained using the GA are presented in Table 2.18. The best values used for
the hyperparameters during the 20 LOO runs are depicted in the last column of the table.
Case studyMAEExample best hyperparameters(kg)
First case
study9:07111:1913Population size : 140
Initialisation : 0:7384331304323912
Mutation probability :0:7601788328422227
Mutation coecient :1:1590351357712945
Search termination : 400
Second case
study9:60771:5448Population size : 80
Initialisation : 0 :4378496082777412
Mutation probability : 0:3788103953145435
Mutation coecient : 1:4614684567584595
Search termination : 200
Third case
study10:06301:5751Population size : 120
Initialisation : 0 :451871905511117
Mutation probability : 0:8090272666122902
Mutation coecient : 0:6132309553757227
Search termination : 400
Table 2.18: Results obtained using the GA. 95% con dence intervals for the mean are used.
SVR body mass estimation results In this section, an experimental evaluation of the
proposed SVM machine learning regressor on the case studies described in Section 2.2.2 is
conducted.
For the random search of SVM hyperparameters, we have used the uniform probability
distribution. The intervals or sets from which we randomly draw each hyperparameter are
as follows:
kernel : from the setfrbf;linear;sigmoid;poly g(i.e. RBF, linear, sigmoid, polynomial
kernels).
": from [0;1).

: from [0;1).

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 53
b: from [0;1).
Polynomial degree for the polynomial kernel: from the set f1;2;3;4;5g.
C: from [0;1).
Note that not all hyperparameters apply to every kernel choice.
The results obtained using the SVM are presented in Table 2.19. The best values used
for the hyperparameters (including the used kernel function) are depicted in the last column
of the table.
Case studyMAEExample best hyperparameters(kg)
First case study 8:48971:1854kernel : linear
": 0:027578797176389447
C: 0:641027338498186
Second case study 9:49181:5060kernel : linear
": 0:09576656837738184
C: 0:24862712124238084
Third case study 8:68431:5545kernel : polynomial
": 0:04798752786733451
C: 0:1465703832031966
degree : 2

: 0 :7681948068876795
b: 0:42975272058542213
Table 2.19: Results obtained using the SVM. 95% con dence intervals for the mean are used.
As shown in Table 2.19, the SVM obtained the best performance on the rst case study
and the worst performance on the second case study. The SVM generally performs better
than the GA, with the smallest di erence being on the second case study. It's worth noting
that the values are close enough that the GA might be able to outperform the SVM if it is
run enough times.
LWR body mass estimation results This section presents our LWR experimental eval-
uation on the case studies introduced in Section 2.2.2.
For the random search of LWR hyperparameters, we have sampled the hyperparameter
from [0;30].
Table 2.20 presents the results obtained using our LWR learning model.
Case studyMAEExample best hyperparameters(kg)
First case study 8.63571.1833 :2:6891837777270666
Second case study 10:52001:6008 : 14:696571566405868
Third case study 9:56761:6043 : 29:83486109946865
Table 2.20: Results obtained using the LWR model. 95% con dence intervals for the mean
are used.
It can be seen in Table 2.20 that the LWR machine learning model outperforms the GA
on the rst and third case studies, while providing the worst results on the second case study.
The di erences are not large enough to be signi cant however; the GA might do better if
lucky, while there is no randomness involved for the LWR model.

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 54
Discussion and comparison to related work
In this section we provide an analysis of the three machine learning approached we have
introduced for body mass estimation from bone measurements. Then, a comparison with
existing literature approaches is shown.
As shown in Section 2.2.2, both the LWR and SVR models provide about the same
performances for the body mass estimation problem. The SVR slightly outperforms the
LWR model, with a little over 1 MAE, and the GA model with about 1 :5 MAE in the worst
case. The experimental values which we obtained for the (average) MAE considering all case
studies are summarized in Table 2.21. The last column from Table 2.21 contains the MAE
value (averaged over all LOO cross-validations). The best values, for each case study, are
highlighted.
Case study ModelMAE
(kg)
GA 9:07111:1913
First case study SVR 8.48971.1854
LWR 8:63571:1833
GA 9:60771:5448
Second case study SVR 9.49181.5060
LWR 10:52001:6008
GA 10:06301:5751
Third case study SVR 8.68431.5545
LWR 9:56761:6043
Table 2.21: MAE values obtained using the GA, SVR and LWR models on the considered
case studies, with 95% con dence intervals for the mean.
From Table 2.21 we conclude that the best machine learning based regressor for estimating
the human body mass from skeletal remains is the SVM, when only 3 measurements ( femoral
head diameter ,iliac breadth and femur bicondrial length ) are used for the skeletal elements.
This is to be expected, since these three measurements showed the highest correlation with
the traget body mass (Figure 2.12 (a)).
Analysing the results from Table 2.21 one can also observe that the worst results were
obtained (for all three learning models) on the second case study. We can conclude that the
iliac breadth and femur bicondrial length are also important in estimating the body mass
and the measurements for the clavicle ,humerus ,radius ,ulna,femur ,tibia and bula do not
improve the body mass estimation results.
It is worth mentioning that the outliers removal step we have performed during the data
preprocessing step has only slightly increased the performance of our ML regressor, and only
in some cases. Table 2.22 illustrates the e ect of the outliers removal from the training data
for all of our three regression methods. Most of the gains are within the con dence intervals,
so they are not statistically relevant. The highest reduction for the MAE value was obtained
for the second case study using LWR. The worst MAE increase was also using the LWR, on
the third case study. We are forced to conclude that removing outliers does little to improve
results.
In the following, a brief review of the recent human body mass estimation literature is
given, with the aim to compare our ML regressors to the existing related work. As far as
we know, there are no existing machine learning based models (like the GA, SVM and LWR
models proposed in our research) for the problem of body mass estimation from skeletal
remains.
A comparison between several body mass estimation methods was conducted by Auerbach

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 55
Model Case studyMAE
reduction ( %)Outliers
removalMAE
GAFirst case studyYes 8:7471:2793.57No 9:0711:191
Second case studyYes 10:0331:700-4.42No 9:60771:5060
Third case studyYes 9:7801:7262.81No 10:0631:575
SVRFirst case studyYes 8:4781:2650.14No 8:4901:185
Second case studyYes 9:4491:5770.45No 9:4921:506
Third case studyYes 9:0941:679-4.72No 8:6841:555
LWRFirst case studyYes 8:7081:274-0.83No 8:6361:183
Second case studyYes 10:1271:6603.74No 10:5201:600
Third case studyYes 9:9581:774-4.08No 9:5681:604
Table 2.22: Comparative MAE values – with and without outliers removal.
and Ru in [AR04] (2004). The authors proposed to test some existing methods on a great
variety of subjects. They used skeletal remains of 1173 adult skeletons of di erent origins
and body sizes, both males and females. Three femural head based regression formulas were
tested and compared on the considered skeletal sample: Ru et al. [RSL91] (1991), McHenry
[McH92] (1991) and Grine el al. [GJTP95] (1995). The authors concluded that for small body
sizes (Pygmoids), the formula of McHenry (1992) can provide a good body mass estimation.
For very large body sizes, the formula from [GJTP95] introduced by Grine et al. should be
used, whereas for the other samples the formula of Ru (1991), or the average of the three
techniques would be the best approach.
Ru et al. provided in [RHN+12b] (2012) new equations fpr estimating the body mass
that are generally applicable to European Holocene adult skeletal samples. Body mass estima-
tion equations were based on femoral head breadth. 1145 skeletal specimens were collected
from European museums, from time periods varying from Mesolithic to the 20th century
[RHN+12b]. On these data sets, the regression formulas introduced in [RHN+12b] provided
better results than the previous formulas from Ru et al. [RSL91] (1991), McHenry [McH92]
(1991) and Grine el al. [GJTP95] (1995).
The data sets used in the previously mentioned papers are not publicly available, that is
why an exact comparison of our approaches to the previously mentioned approaches cannot
be performed. Since we have not found in the literature experiments related to body mass
estimation using the same data set as in our paper [ICT16], we conducted the following
comparison to related work. We have applied the regression formulas introduced in Ru et
al. [RSL91], [McH92], [GJTP95], [RHN+12b] on our data sets (all three case studies) and we
have compared the obtained MAE values with the ones provided by our proposed regression
models. The results of the comparison are given in Table 2.23. In this table, 95% con dence
intervals for the mean [BCD01] were used for the obtained results. All results that outperform
literature results are highlighted in bold.
From Table 2.23 we observe that the MAE values obtained by all three of our learning

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 56
ApproachFirst case study Second case study Third case study
MAE MAE MAE
(kg) (kg) (kg)
Our GA 9.07111.1913 9.60771.5448 10.06301.5751
Our SVR 8.48971.1854 9.49181.5060 8.68431.5545
Our LWR 8.63571.1833 10.52001.6008 9.56761.6043
Ru et al. (2012)
[RHN+12b]10.115 11.43 11.202
Ru et al. (1991)
[RSL91]10.476 11.556 11.359
Mc Henry (1992)
[McH92]10.176 11.514 11.324
Grine et al. (1995)
[GJTP95]10.609 11.656 11.431
Table 2.23: Comparison between our approaches and similar related work. 95% con dence
intervals for the mean are used for the results.
methods are smaller than those obtained using regression formulas from the literature. One
can notice that for the SVR even the upper limit of the 95% con dence interval of the mean
is below the MAE from the literature. This is somehow predictable because the previously
stated regression formulas only use one measurement, the femoral head anterior-posterior
breadth , while we are using multiple measurements. This is the best measure of the per-
formance of the present approach since the experiments are performed on the same data
sets. We note that our learning models provided better results, even if the evaluation of the
machine learning based models was obtained using multiple cross-validation runs (in order
to avoid the over tting), whereas the formulas from the literature were obtained using the
entire data set. Another major advantage of our approaches with respect to the literature is
that our estimations are made without knowing the sexfeature, which is mainly used in the
existing literature.
From Table 2.23 we can also notice that, for approaches from the literature, including our
learning models, the best performance was obtained on the rst case study.
It has to be mentioned that most of the researchers from the bioarchaeological elds
develop regression formulas for body mass estimation based on a data set which is also used
for testing the developed formulas, without using a testing set independent from the training
data or without using any type of cross-validation. This may lead to over tting, as for the
regression formulae from the literature which provided good performances on the data they
were trained on (about 4-5 MAE), but when applied on an unseen test data (our case studies)
they provide large MAE values. It is likely that the methods from the body mass estimation
literature would provide larger errors under the testing methodology used for our machine
learning methods.
The main advantage of the machine learning approaches proposed in [ICT16] over the
mathematical ones, is that the machine learning models can be retrained on new data, and
they are able to generalize well if speci c methods for avoiding over tting are used, like the
ones described in Section 2.1.5. Moreover, it can be computationally costly to develop math-
ematical formulas one new data sets, which will likely not generalize well anyway. Instead,
a machine learning regressor can easily build from new data sets learning models which are
able to perform well on unseen data having the same features as the training one.

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 57
Conclusions and future work
We have presented three supervised learning regression models for body mass estimation
from human skeletal remains based on bone measurements. The idea to apply machine
learning for the body size estimation problem is, as far as we know, a novel one. Several case
studies were conducted on a publicly available data set from the bioarchaeological literature.
Our results outperformed classical statistical methods for body size estimation, which makes
our approaches useful for the eld of bioarchaeology.
Further work will be done in order to improve the data preprocessing step of our ap-
proaches, by automatically detecting and removing the outliers from the training data set.
Since the archaeological data sets usually contain missing measurements, we will further in-
vestigate methods to deal with these missing values. We also plan to apply other machine
learning based regression models for the body mass estimation problem, like radial basis
function networks and k-nearest neighbors .
2.2.3 Age at death estimation
This section presents our machine learning based approaches (GA, SVR, LWR) for ap-
proximating the age at death of human skeletal remains, based on bone measurements. We
have applied support vector regression for this problem in [ITV16]. The age at death esti-
mation problem was introduced in Section 1.3.
For evaluating the performance of our GA, SVR and LWR models, 10-fold cross-validation
is used [WLZ00].
For the GA experiment, the cross-validation has to be executed multiple times, due to
the randomness of weights initialization, mutation, selection of instances within the folds
during the cross-validation process and the initial shuing of the data. We execute the
cross-validation step 20 times and report the average, minimum, maximum and standard
deviation of the executions. We will do the same for the LWR and SVR experiments, which
may also be a ected by the shuing of the data.
For all of our learning models, we have also used a random grid search in order to obtain
optimal hyperparameters for our model. As mentioned before, in [BB12a], it is shown that a
random grid search generally outperforms a classical grid search.
We run the random grid search for 200 iterations, using the scikit-learn machine learning
library [PVG+11]. After the grid search, the best hyperparameters found are used for the 20
cross-validation runs, with the data being reshued before each run. We found that results
do not improve by increasing the number of iterations.
Data sets and case studies
The data set considered in our experiments is an archaeological data set publicly avail-
able at [JMJ00]. The database at [JMJ00] was developed through a research project [JMJ98]
and contains a database of human skeletons from United States [JMJ00]. The database con-
tains 1009 skeletal remains and 86 bones measurements (features) representing: postcranial
measurements (length, diameter, breadth, circumference) of the clavicle, scapula, humerus,
radius, ulna, sacrum, innominate, femur, tibia, bula, and calcaneus [JMJ00]. Besides the
bone mesurements, the sex and the age at death is available for the skeletons.
We have extracted from the database a set of 446 individuals and measurements for the
clavicle, humerus, radius, ulna, femur, tibia and bula. In case of two measurements for any
of these bones (left and right), we have picked their mean. If only one measurement existed,
we picked that one. The age at death was given as an interval [ minage;maxage] for some
instances. In such cases, we considered the maxage.
This data set is then divided into 4 subsets according to age. The groups are taken
according to what is usually used in the literature, although there are minor di erences

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 58
between authors [RGM+07, Sak08, PFSL12]. The literature also deals with each group
individually, so we decided to do the same.
subadults: with age at most 21 years { 89 instances.
young adults: with age between 22 and 35 years inclusively { 154 instances.
adults: with age between 36 and 50 years inclusively { 112 instances.
old adults: with age above 50 years { 91 instances.
We will also consider a case study on the concatenation of young adults and adults.
Data preprocessing
The rst stage of our approach is the data preprocessing step, in which the data is nor-
malized using the Min-Max approach. In the future, we plan to also ll in missing values,
but so far we have only selected the instances with no missing values.
Also as a data preprocessing step, we have applied Principal Components Analysis (PCA)
[TB99, AW10] in order to get a two dimensional representation of the data we are working
with. By doing this, we were able to obtain some intuition about how our models will
behave for the original data, and also present the data in a way that might be of interest to
researchers in forensic science elds. Although it is not something that we have considered
for this research, the PCA graph might also be able to help us identify and remove outliers,
either through an automated process or visually and with the help of domain knowledge.
After the data was preprocessed, as in any supervised learning scenario, the regression
model was trained and then tested to obtain a measure of performance for the obtained
model.
Results using our machine learning approaches
This section presents our results using the GA, SVR and LWR models. The SVR results
were introduced in [ITV16].
Data visualizations Figure 2.13 presents the 2-dimensional PCA graph for the subadults
subset. The PCA graph suggests that the problem should be solvable even by simple linear
regression, due to the fact that a lot of the instances are clustered together along a line, with
few outliers.
Figure 2.14 presents the PCA plot for young adults. This time, it is harder to identify a
clear regression line. The plots for the other age categories are similar.
Experimental setup For the SVR, the following intervals for the random search are used.
For the GA and LWR, we use the same intervals presented in Sections 2.2.2 and 2.2.2 for the
body mass estimation problem.
Within the random search, we consider the following hyperparameters and possible choices
for them:
kernel : chosen between RBF, linear, sigmoid and polynomial.
": chosen in [0 ;1).

: chosen in [0 ;2).
b- the bias term: chosen in [0 ;2).
degree – for the polynomial kernel: integer from 1 to 5 inclusively.
C- the regularization coecient: chosen in [0 ;6).

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 59
Figure 2.13: PCA graph for the subadults case study.
Figure 2.14: PCA graph for the young adults case study.
Experimental results Table 2.24 presents our results on each case study, considering all
three learning models, with the best results highlighted.
It can be seen that our SVR approach leads to generally better results, with the LWR
model very close behind and actually obtaining better results on the young adults case study.
The best results are obtained for the subadults case study.
We can also see from the results that they tend to con rm what the PCA graphs suggested:
the problem is a lot easier for subadults than for the other categories.
For the SVM, the RBF kernel generally provides the best results, although there are no
signi cant di erences between di erent kernels if we run the grid search with just one choice
for the kernel.
Discussion and comparison to related work
An analysis of the approaches we have introduced for predicting the age at death from
bone measurements will be further provided, together with a comparison to related work
from the literature.
Table 2.25 summarizes the comparison of our results to related work and shows that our
approaches lead to better results in general. We will further detail the comparison to each

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 60
Case studyMin
MAEMax
MAEMean MAEStdev
MAEMethod
subadults1.743 1.956 1:8510:029 0.060 GA
1.578 1.616 1.5980.023 0.011 SVR
2.524 3.056 2:8110:068 0.141 LWR
young adults3.826 3.967 3:9020:018 0.038 GA
3.442 3.511 3.4700.040 0.019 SVR
3.401 3.542 3.4650.017 0.035 LWR
adults3.998 4.270 4:1040:035 0.072 GA
3.498 3.928 3.6600.201 0.096 SVR
3.860 4.096 3:9870:028 0.058 LWR
old adults7.394 8.032 7:6670:093 0.194 GA
6.886 7.326 7.1050.293 0.140 SVR
7.217 7.854 7:4420:065 0.136 LWR
young adults
and adults7.524 7.821 7:6710:043 0.090 GA
6.709 6.855 6.7780.069 0.033 SVR
6.721 6.910 6:8180:025 0.052 LWR
Table 2.24: Results using the GA, SVR and LWR learning models on the 5 case studies. 95%
con dence intervals for the mean are used.
related work item below.
The study in [MUC+07] presents a comparison between four di erent approaches for
estimating age at death: the method of Suchey – Brooks using pubic symphysis, the method of
Lovejoy using auricular surface, Lamendin monoradicular teeth and Iscan fourth rib method.
The authors also proposed Principal Components Analysis (PCA) for a combination of these
four methods.
The research was conducted on 218 skeletons uniformly distributed in terms of age, sex
and race. The study was performed both on the full sample and separated by age group
(2540 years, 4160 years,>60 years), sex and race. It was observed that the most
accurate method depended on the target group. On the full set, the most accurate method
was the PCA, obtaining a mean absolute error of 6 :7 years. They also observed that the
worst results (1016 years depending on method) were obtained on the elders group ( >60
years). Our worst results are on the oldest group as well, but our mean absolute errors are
still better on that category.
In [RGM+07], it is shown that the best method to detect age from skeletons for subadults
is mineralisation of teeth, but for other categories, long bone length and epiphysis devel-
opment is used, and for adults, the best method is aspartic acid racemisation { a chemical
method [RTC02]. Usually, a speci c method is used for a speci c category of age, and it is
impossible to nd a method that works well for all categories. For adults, it is more dicult
to estimate age: the error of estimate increases with age. And using racemisation of aspartic
acid method, Rsing and Kvaal, in 1998 had results with an average mean error of estimate
of 2.1 years. While this is better than we obtained, our method is strictly mathematical.
In [Sak08], the author used a method to determine the sex of the subjects and then used
the same approach for calculating the age at death. This method is based on measurements
of the patella bone. The author tried to apply the procedure on 283 Japanese skeletons, 183
males and 100 females. The method consists of trying to nd a discriminant function for
sex determination and a method for age at death estimation. Only measurements of patella
were used. It was observed that a lipping on this bone is gradually developing. Three stages
(young, middle-aged and old) which can be used for age estimation were identi ed. The

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 61
Reference Age group Our best mean MAEReference
mean
MAE
[MUC+07] old adults 7.105 10-16
[MUC+07] All | 6.7
[Sak08]young adults
and adults5.72 5.77
[Sak08] old adults 7.105 12.3
[SDIL94] 15-97smaller error on age
groups, about the same
error on our entire data
set10-12
standard
error
[CBM+09]old adults,
>40 years old7.020 / 7.105 19
[PFSL12] subadults 1.598 1.788
Table 2.25: Summary of our comparison to related work.
approach was tested on 26 young adult, adult and old adult subjects and the results weren't
satisfactory, obtaining a mean absolute error of 8 :6 years. It was concluded that there are
better approaches such as those based on pubic symphysis or cranial sutures. Nevertheless,
the proposed method could be used to classify subjects in the previously mentioned three
stages. This also explains why our results on non-subadults were considerably worse than on
subadults: predicting age from bone lengths is a much harder problem. If we only consider
the young adult and adult instances that have the same age as those tested by the authors,
we obtain a slightly better MAE of 5 :72, as opposed to their 5 :77. If we only consider the
old adults instances, we obtain an error of about 7, which is much better than their 12 :3.
In [SDIL94], experiments are performed on 328 instances, with the age between 15 and
97 years, and 90 :5% of European ancestry. The SEE for regression was 11 :13 for males, 9 :77
for females and 10 :70 for females and males, all above our own results.
The review in [CBM+09] presents the authors' experience with age estimation on di erent
age groups, from fetuses to elders, both for living and dead individuals. In this, an explanation
is given for why age estimation methods applied on older groups are less accurate: because
for these groups, the discrepancy between chronological age and physiological age is larger.
For subadults (fetuses, newborns, children and adolescents) and transition phase it is stated
that dentition is the most reliable method for age estimation, from its formation during the
fetal state to calci cation and eruption during childhood and adolescence. Also, development
of the skull and hands are good indicators. However, especially during puberty, gender is
also important because it is known that girls tend to develop faster than boys. In case
of age estimation on adults, dentition and skeletal development arent so reliable anymore.
Age estimation is based on skeletal degeneration during aging. Also, the authors mention
the Lamendin strategy, which although is considered one of the most precise methods for
estimating the age for individuals over 40 year old.
In [PFSL12], multiple formulas, of the form Lx+b, are presented for measurements Lof
a single bone. Formulas are presented for the clavicle, humerus, ulna, radius, femur, tibia,
bula and some sums of these. Although these formulas were created with a di erent data
set in mind, they also target subadults, so we decided to try them on our own subadults
subset. The lowest mean absolute error obtained was 1 :788, which is above both our SVR
and LWR mean absolute error for the subadults case study.
Table 2.26 presents the results obtained by applying some of the formulas in [PFSL12] on
our own data set.

CHAPTER 2. CONTRIBUTIONS TO ARCHAEOLOGY 62
Bone FormulaMean
MAE
clavicle L0:19689:596 2.08
humerus L0:07876:944 1.856
ulna L0:09237:18 2.136
radius L0:10586:728 2.107
femur L0:05656:901 1.936
tibia L0:06395:458 1.982
bula L0:06755:827 2.319
humerus + femur L0:03317:087 1.788
Table 2.26: Results obtained by applying some of the formulas in [PFSL12] on our own data
set's subadults.
It is worth mentioning that, during our literature review, we have noticed that researchers
in non-machine learning elds tend to build their mathematical formulas by looking at the
entire data set at once, without setting aside a test set or performing any kind of cross-
validation. Even for simple methods such as the formulas in [PFSL12], this can lead to
over tting and a false sense of good performance. Therefore, it is likely that a lot of the
methods presented in the literature we are aware of would perform at least slightly worse
under the same testing methodology employed in our research.
Conclusions and future work
Three machine learning models for estimating the age at death of human skeletal remains
from bone measurements were presented. We have shown that our approaches obtain better
results than the current mathematical methods that exist in the literature on similar data set.
We have also applied some previous mathematical formulas for the task of age estimation
on our data set, and they have provided us with worse results than those of our machine
learning methods.
The main advantage of machine learning approaches over other mathematical approaches,
such as the basic formulas in [PFSL12], is that we can easily retrain machine learning models
on new data, which means that we can always have a mathematical model that is optimized
for the data set that we are working with. Mathematical formulas tailored to a speci c data
set will likely not generalize well and they can be time consuming to come up with for each
new data set, especially when we have to consider one for each of multiple measurements. It
is better if we can feed all available measurements into a machine learning regressor that can
learn to make sense of them.
Further e orts will be made for applying our methods to more data, to determine what,
if any, bone measurements can help improve the MAE for adult age categories, to nd better
methods for lling in missing values in a given data set and to nd ways to automatically
remove outliers from the data sets.

Chapter 3
Background for Software
Engineering problems
In this chapter we present the Software Engineering background knowledge regarding
Software Defect Prediction and Software Development E ort Estimation. The literature sur-
vey conducted in this chapter has made possible the original research published in [CCMI16,
MCMI16, IDC17, Ion17].
3.1 Software defect prediction
Software defect prediction represents the activity of identifying software modules which
are likely to develop errors in a forthcoming version of a software system, being of major
importance for software testing and for assuring the software quality as well. The methods
for detecting faulty software entities are useful for suggesting to developers the software
modules that should be rigorously tested. These software entities can be software components,
modules, packages, classes, methods, functions or other software artifacts.
3.1.1 Motivation
The software maintenance process represents a major part of a software life cycle, re-
quiring a large software engineering e ort. The software engineering literature reveals that
understanding the software represents about half of the amount of e ort allocated to the
maintenance activity. Fixing defects represents one of the main software maintenance activ-
ities, also being referred to as corrective maintenance [KMFO01].
For increasing the eciency of the software defect- xing process, defect prediction models
are useful for anticipating locations in a software system where future defects may appear.
Identifying software defects is dicult, mainly for complex software projects. The main
diculty related to building supervised defect predictors is the fact that the defects in software
projects are much less than the non-defective entities and thus, the training data is highly
imbalanced [ATS15].
Another important problem related to software faults identi cation is software defect
detection , which represents the process of identifying software modules that contain errors
and it contributes to increasing the e ectiveness of the quality assurance process. Fault
detection methods would be helpful for suggesting the developers which software modules
should be focused on during testing, particularly when, from lack of time, the modules can
not be systematically tested.
Code review is an activity frequently used in agile development processes for maintaining
the quality of the software system. During code review, an experienced programmer reviews
the source code in order to identify vulnerabilities, security problems and other problems
63

CHAPTER 3. BACKGROUND FOR SOFTWARE ENGINEERING PROBLEMS 64
overlooked by the initial implementer. Since code review is a time consuming and costly
activity, detecting software defects can be used to guide the process of code reviewing by
identifying sections in the code that would likely be
agged for revision during a code review
session, due to various issues with that code.
3.1.2 Related work
In [CCMI16] we conducted a literature review of software defect detection. The presen-
tation from this section is based on the original paper [CCMI16]
In the following, we will brie
y review several machine learning-based approaches from
the defect detection literature which are somehow related to our approaches.
Software defect detection is intensively investigated in the literature and an active area in
the software engineering eld, as shown by a systematic literature review published in 2011,
which collected 208 fault prediction studies published between 2000 and 2010 [HBB+11]. De-
tecting software faults is a complex and dicult task, mainly for large scale software projects.
In the search-based software engineering literature there are a lot of machine learning-
based approaches for predicting faulty software entities, for example, [HDF12], [MGF07]
and [Mal14a]. From a supervised learning perspective, defect prediction is a hard problem,
particularly because of the imbalanced nature of the training data. Moreover, identifying a set
of software metrics that would be relevant for discriminating between faulty and non-faulty
modules is not a trivial task.
An approach that uses a combination of self-organizing maps and threshold values is
presented in [ARS13]. After the SOM is trained, threshold values are used to label the
trained nodes: if any of the values from the weight vector is greater than the corresponding
threshold, the node will represent the defective entities. Classi cation is done by determining
the best matching unit for the given instance and using the label of the node.
An approach for detecting defective entities using self-organizing maps is introduced in
[MCCS15]. After an attribute selection based on the Information Gain [Mit97] of the at-
tributes, a map was trained to visualize the defective entities. While we had encouraging
results, we have realized that in many cases defective and nondefective entities are quite sim-
ilar, they are close to each other on the map. These observations led us to the use of fuzzy
self-organizing maps, which can handle such situations.
There are several approaches in the literature that use di erent clustering algorithms to
group defective and nondefective entities. One such approach is presented in [BB12b], where
K-Means algorithm is used and the centers of the initial clusters are found using Quad Trees.
Varade and Ingle in [VI13] use K-Means as well, but they use Hyper-Quad Trees for the
cluster center initialization. Since identifying the optimal number of clusters is not a simple
task, some approaches use clustering algorithms which automatically determine the number
of clusters. Such an approach is described in [SD09] where the Xmeans algorithm from Weka
[HFH+09] is used for clustering. After the clusters are created, the approach makes use of
software metric threshold values to classify the clusters into clusters of defective and non-
defective entities. The Xmeans algorithm (together with a second clustering algorithm which
automatically determines the optimal number of clusters, EM) is used by the authors in
[PH14] as well, together with di erent attribute selection techniques implemented in Weka.
Yu and Mishra in [YM12] investigate the problem of building cross-project detection
models, which are models built from data taken from one software system, but used and
tested on a di erent software system. They use binary logistic regression on the Ardata
sets, and build two models: self-assessment, when the model is tested on the data set from
which it was built, and forward-assessment, when some data sets are used for building a
model and a di erent one is used for testing it. They conclude that self-assessment leads to
better performance measures, but forward-assessment gives a more realistic measure of the
real performance of the binary logistic regression model.

CHAPTER 3. BACKGROUND FOR SOFTWARE ENGINEERING PROBLEMS 65
The problem of cross-project defect detection is approached in [NK15] as well. The
authors consider situations when the software metrics from the data sets on which a model
was built are not the same as the metrics computed for the system to be tested. They
introduce an approach which tries to match the software metrics from the di erent sets to
each other, based on correlation, distribution, and other characteristics. To compare this
approach to other existing ones, they use 28 data sets (including the Ardata sets) and
Logistic Regression from Weka.
Multiple Linear Regression and Genetic Programming are used in [ATF12] to evaluate
the in
uence and performance of di erent resampling methods for the problem of defect
detection. The Ardata sets are used as case studies to compare ve di erent resampling
methods. The results of the study show that, considering the AUC performance measure, the
resampling methods do not di er signi cantly, but the authors claim that this can be caused
by the imbalanced data sets or the high number of attributes.
A comparison of statistical and machine learning methods for defect prediction is pre-
sented in [Mal14b]. They compare logistic regression with six machine learning approaches:
Decision Trees, Arti cial Neural Networks, Support Vector Machines, Cascade Correlation
Networks, GMDH polynomial networks and Gene Expression Programming. The models
were evaluated on two Ardata sets, and the best performance was obtained using Decision
Trees.
3.2 Software development e ort estimation
Software development e ort estimation (SDEE) represents the action of estimating the
time it will take for each part of a software system to be completed during the development
phase of the product. Such estimations are done either at a management level or at the level
of the programmer(s) responsible for the actions that will need to be performed. Accurate
estimates are important in order to properly plan the development process and allocate human
resources accordingly.
3.2.1 Motivation
Multiple estimation methodologies and protocols exist, including software-aided ones.
Two popular ones are simple ad-hoc estimations, in which a programmer provides a time
estimate based on the task description and his or her experience with what is required,
and group methods, such as planning poker, which are more involved methodologies that
require a consensus from a team of programmers [Coh05]. Both of these are considered
expert estimations , because experts in the eld are the ones providing the estimates.
Group methods have the advantage of being more accurate and, in the case of planning
poker, allow re nement of estimates over time. The advantage of more complex methods is
a better estimate than ad-hoc methods in general, but they take a lot more time that could
be used for actual development. As far as we know, there are no studies that analyze this
trade-o and whether or not it is a worthwhile endeavour to employ more complex estimation
approaches.
According to one study, between 60% and 80% of projects have actual e ort values be-
tween 30% and 89% of the estimated values [MJ03]. While more recent data is not available,
judging from our experience, similar percentages would still be the case.
Software-aided methodologies consist of a computer system that provides an estimate, or
helps programmers arrive at one, by algorithmically analysing parts of the requirement or
the software under development. The most common methods used nowadays are parametric
estimation models , such as COCOMO and the Putnam model.
COCOMO uses three formulas that give e ort applied in man-months, development time

CHAPTER 3. BACKGROUND FOR SOFTWARE ENGINEERING PROBLEMS 66
in months and people required . It uses prede ned constants for each of three types of projects.
It does not account for developer experience, available hardware resources and some other
factors [Boe01, BCH+00]. The Putnam model uses a formula for e ort (expressed in person-
years) that includes the size of the product being developed, the productivity of the company
and the total scheduled time for the project [PM03].
Most of the existing computer-aided methods rely on software metrics of questionable
relevance and on other human estimates, such as complexity and productivity. This makes
them highly prone to large errors and far removed from the intuitive approach that most
programmers use when providing estimates.
3.2.2 Related work
Planning Poker
Planning Poker is a method based on consensus for estimating the e ort required to solve
programming tasks. Its main goal is to force developers to think independently and reach
a proper consensus regarding the e ort required, without one person in
uencing the rest
[Coh05].
Typical Planning Poker uses a deck of cards with Fibonacci numbers on them, starting
from 0 up until 89. These represent e ort, measured in any unit convened upon, such as
hours [Coh05].
First, the team can discuss the requirements in order to clarify any uncertainties, without
mentioning any estimates, in order to avoid in
uencing each other. Then, they lay a card
face down. Once everyone has decided on a card, they each turn them up at the same time.
Discussion then resumes, with developers getting the chance to state arguments that
support their estimate. The process continues until a consensus is reached.
An advantage of Planning Poker over more ad-hoc methods is that it can reduce personal
biases and it forces developers to be able to properly defend their choice. An important
disadvantage is that it takes more time due to the multiple rounds and people involved.
Algorithmic approaches to software development e ort estimation
Most computational approaches to SDEE rely on a mathematical formula that considers
certain project metrics and on domain knowledge from experts. This makes them unlikely
to perform well on a large variety of projects, and their use is mostly avoided in practice.
Nevertheless, they provide a needed introduction into the eld, so we give a short presentation
of the most popular ones. The below are generally called parametric models .
COCOMO COCOMO [BCH+00] starts by dividing projects into three types:
1. Projects with small teams having the necessary expertise and working with lax require-
ments (called Organic );
2. Projects with medium-sized teams having mixed experience levels and working with
mixed requirement types (called semi-detached );
3. A combination of the previous two (called embedded ).
Then, COCOMO provides three formulas:
E=abKLOCbbD=cbEdbP=E
D(3.1)
where:

CHAPTER 3. BACKGROUND FOR SOFTWARE ENGINEERING PROBLEMS 67
Project type abbbcbdb
Organic 2.4 1.05 2.5 0.38
Semi-detached 3.0 1.12 2.5 0.35
Embedded 3.6 1.20 2.5 0.32
Table 3.1: Constants used in the COCOMO formulas for each project type [BCH+00].
Erepresents the e ort applied , expressed in man-months;
Dis the development time , expressed in months;
Pis the number of people required;
KLOC estimates the number of delivered thousands of lines of code for the entire
project.
The other coecients are constants that depend on the project type, and are given in
Table 3.1.
The above basic COCOMO method can be extended to take a lot more factors into con-
sideration, such as hardware restrictions, changing requirements and other human resources
attributes.
An important disadvantage of the model is that it tries to accommodate the existence
of a lot of important project factors into its estimations by providing either tables of values
or human estimates. Tables are not robust, cannot easily be adapted to ones own situation
and are subjective. The necessity of human estimates and evaluations means that it does not
provide a fully automated solution to the SDEE problem, which is our goal.
Putnam model The Putnam model [PM03] bases its estimations on Formula (3.2), where:
Sis the project size, usually measured in E ective Source Lines of Code [NDrTB07].
Bis a scaling factor dependent on the size S.
Pris the process productivity, de ned as the ability of the development team to produce
a certain-size software at a given defect rate. This is distinct from the more conventional
de nition size divided by e ort.
Eis the e ort, expressed as the total person-years allocated to the project.
Tis the total number of years allocated to the project.
B1
3S
Pr=E1
3T4
3 (3.2)
To provide an e ort estimate for a given task, the equation in Formula (3.2) is solved for
E, as shown in Formula (3.3).
E=S
PrT4
33
B (3.3)
It is known that the method is sensitive to the SandPrparameters, which must be
estimated by human factors. The process productivity can be calibrated according to Formula
(3.4) [PM03].
Pr=S
E
B1
3T4
3(3.4)

CHAPTER 3. BACKGROUND FOR SOFTWARE ENGINEERING PROBLEMS 68
An advantage of the Putnam model is its calibration simplicity, however it still su ers
from the need of human estimations, and can be inaccurate in practice.
Many other approaches similar to COCOMO and the Putnam model exist in the litera-
ture, such as SEER-SEM [GE06], however they mostly rely on some xed equations involv-
ing a number of subjective, user-inputted parameters, which reduces their robustness and
resilience to human error.
Quantifying the accuracy of estimates
The most widespread accuracy metric for e ort estimation systems is the Mean Magnitude
of Relative Error (MMRE), shown in Formula (3.5), where nis the number of tasks estimated,
EAiis the actual e ort for task iandEEiis the estimated e ort for task i. We want to
minimize the MMRE.
MMRE =1
nnX
i=1jEAiEEij
EAi(3.5)
Note that, the equation in Formula (3.5) is sometimes multiplied by 100, in order to
express the estimation error as a percentage.
Related work on algorithmic e ort estimation
We divide the SDEE literature in three categories, with regards to how close the used
techniques are to our own approaches.
Classical parametric models The rst category consists of the classical frameworks dis-
cussed in the previous section, such as COCOMO. There are many studies that apply these
frameworks to various projects, usually private ones on which, unfortunately, we cannot ap-
ply any of our own methods in order to provide a direct comparison, as the project data is
not publicly available. In the study at [BP10], Basha and Ponnurangamthe apply the CO-
COMO, SEER, COSEEKMO, REVIC, SASET, Cost Model, SLIM, FP, Estimac and Cosmic
frameworks to a set of applications of various types, such as Flight Software and Business
Applications, obtaining MMRE values between 0.373% and 771.87%. The authors conclude
that there is no one best framework and that they are all very sensitive to the input data,
the application type and the various abilities of the development team. A lack of proper
bookkeeping regarding e ort can also hamper proper validation of the obtained results.
Popovi c and Boji c analyze in [JP12] 94 projects developed between 2004 and 2010 by a
single company. These are mostly Microsoft .Net Web projects with a lot of available metrics
and documentation. The obtained MMRE values are between 10% and 46%, using linear and
non-linear models with various metrics and phases at which e ort is estimated. Once again,
the used data set not available for public usage.
A set of open source projects is experimented on by Toka in [Tok13] using COCOMO II,
SEER-SEM, SLIM and TruePlanning. The MMRE values range from 34% using TruePlan-
ning to 74% using COCOMO II.
In a very recent literature survey on Software E ort Estimation Models, Tharwon presents
in [Tha16] an overview of experimental research that uses the Function Point Analysis (FPA),
Use Case Point Analysis (UCPA) and COCOMO models. The MMRE value obtained by the
FPA model in the surveyed case studies is at least 13 :8% and at most 1624 :31%, with an
average across the case studies of 90 :38%. Considering UCPA, the minimum MMRE value
of four surveyed experimental papers is 27 :30%, the maximum 88 :01% and the average is
39:11%. Considering COCOMO, the average is 281 :61%. The survey also presents multiple
causes of inaccuracies in estimates. Judging by the wildly varying error rates within the same

CHAPTER 3. BACKGROUND FOR SOFTWARE ENGINEERING PROBLEMS 69
estimation model, we can also conclude that the software development e ort estimation task
is highly reliant on the data available within a project. There is no method that will perform
well under any circumstances.
A comparison that also includes human estimates, provided through planning poker or
by an expert, is performed by Usman et al. in [UMWB14]. On the considered projects, it
found that planning poker obtains a MMRE of 48%, UCPA methods obtain MMRE values
between 2% and 11% and expert judgments between 28% and 38%.
As con rmed by the literature, the vast majority of the time, parametric models do not
provide useful e ort estimates.
Machine learning models using software metrics By machine learning models using
software metrics, we understand frameworks such as COCOMO that are used together with
more advanced, machine learning-oriented elements, such as fuzzy logic, neural networks,
Bayesian statistics and the like. We also understand approaches that use pure machine
learning algorithms applied exclusively on various project metrics and indicators.
A Neuro-Fuzzy approach is used by Du et al. in conjunction with SEER-SEM in [DHC15]
in order to obtained lower MMRE values on four case studies consisting of COCOMO-speci c
data. The obtained MMRE values using the classical SEER-SEM approach are between
42:05% and 84 :39%. Using the Neuro-Fuzzy enhancement, they are between 29 :01% and
69:05%, which is a signi cant improvement.
Han et al. apply in [HJLZ15] a larger set of machine learning algorithms: linear regression,
neural networks, M5P tree learning, Sequential Minimal Optimization, Gaussian Process,
Least Median Squares and REPtree. The study is conducted on 59 projects having between
6 and 28 developers and between 3 and 320 KLOC. The obtained MMRE values are between
87:5% for the Linear Regression approach and 95 :1% for the Gaussian Process model.
Bayesian networks, Regression trees, Backward elimination and Stepwise selection are
applied on various metrics from two software project data sets by van Koten and Grayin
[vKG06]. The best obtained MMRE is 97 :2% on one of the projects, using Bayesian networks,
and 0:392%, using Stepwise selection, on the other project.
In a literature review of machine learning models applied to the SDEE problem [WLL+12],
Wen et al. show that MMRE values
uctuate a lot between di erent projects as well as
di erent learning algorithms. For example, for Case Based Reasoning, the survey found
experiments with MMRE values between 13 :55% and 143%. Similar ranges were found for
Arti cial Neural Networks, Decision Trees, Bayesian Networks, Support Vector Regression
and Gaussian Processes. The authors recommend that more empirical research be performed
regarding the feasibility of ML for SDEE, with a focus on industry applicability, which is
found to be low in the surveyed research.
In [UMWB14], Usman et al. obtain MMRE values between 66% and 90% using linear
regression. Using Radial Basis Function networks, MMRE values between 6% and 90% are
obtained.
According to our literature review, machine learning models applied on software metrics
provide better estimates than pure parametric models. The MMRE values are also less
spread out between di erent data sets, which makes machine learning models more reliable
and predictable from an accuracy point of view.
However, a remaining drawback of these approaches is the need for project software
metrics, which are not always available or would take substantial e ort to collect properly.
Sometimes, various parameters must still be inputted by the developers, which takes about
as much time as it would take developers to provide their own estimates.
Machine learning models using text processing To the best of our knowledge, the
thesis by Alhad in [Sap12] is the only other research that approaches the SDEE problem by

CHAPTER 3. BACKGROUND FOR SOFTWARE ENGINEERING PROBLEMS 70
inputting task descriptions directly to ML learning pipelines. It uses a bag of words approach
on keywords extracted from Agile story cards, which it then feeds to multiple learning models,
such as Naive Bayes, J48, Random Forests and Logistic Model Trees. Experiments are
conducted both with the Planning Poker estimates included in the actual learning part of
the pipeline and without. The author reports 106 :81% MMRE for Planning Poker estimates,
and 92:32% MMRE using J48 (which outperforms the other models) with the Planning Poker
estimates excluded from the learning stage. Including the Planning Poker estimates leads to
slightly better results, although not enough so as to not defeat the purpose of an automatic
approach.
The approach classi es instances into classes representing Fibonacci numbers, in the same
way that Planning Poker estimates are provided.
Because we also use a similar text-based approach, we consider [Sap12] to be the most
relevant related work to compare ourselves with, although our data sets are di erent and
neither are publicly available.

Chapter 4
Contributions to machine learning
models for software engineering
This chapter presents our machine learning models and computational experiments using
Fuzzy Self Organizing Maps (FSOM), Fuzzy Decision Trees (FuzzyDT) and Word Vector
models followed by Support Vector Regression and Gaussian Naive Bayes on two important
Software Engineering problems: Software Defect Prediction and Software Development E ort
Estimation. Our machine learning models are described in Subchapter 4.1, and the problems
we have applied them to were presented in Chapter 3.
The content of this chapter contains research results published in scienti c journals
[CCMI16, MCMI16, IDC17, Ion17].
4.1 Proposed machine learning models
This subchapter presents the learning models, our methodology and other complementary
algorithms that we have used for solving the problems detailed in Chapter 3.
We have introduced in [CCMI16] a novel unsupervised learning approach centered around
fuzzy self-organizing maps for software faults detection. We have also introduced an original
approach to software defect prediction using Fuzzy Decision Trees in [MCMI16].
In [IDC17], we introduce an automated solution for the SDEE problem consisting of a su-
pervised machine learning framework that, after training, takes as input textual descriptions
of required tasks and returns a numeric value representing an estimate of the e ort required
for completing those tasks. According to a literature review presented in Section 3.2.2, our
approach is novel, with only one other approach even considering using textual descriptions of
tasks in order to provide e ort estimates. Our results are consistent and encouraging across
a software company's entire projects base. We also analyze ways in which our results can
be improved, both from a machine learning perspective and from a data collection at the
company management level point of view.
4.1.1 Fuzzy Self Organizing Maps
We have applied Fuzzy Self Organizing Maps (FSOM) for software fault detection in
[CCMI16]. In this section, we describe this learning model. The presentation from this
section is based on the original paper [CCMI16].
Self-organizing maps (SOMs), also known as Kohonen maps, [SK99] are unsupervised
learning methods which reduce the dimensionality of the input space through mapping the
input instances into a low (usually two) dimensional output space, the so called map [EDB08].
Both the input layer and the output layer consist of neurons which are completely connected
and these connections are weighted. The main characteristic of a SOM is that is preserves the
71

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 72
topology of the input space, through mapping neighboring input instances into neighboring
neurons on the map [KOS09]. After it was built, a SOM can be used for clustering [LO92],
but it is also helpful for visualizing high-dimensional data.
It is well known that fuzzy methods, unlike the crisp ones, are able to better deal with
noise and to increase robustness of the developed systems.
There are various perspectives in the literature combining SOMs with the fuzzy theory
[Zad65]. In [TBP94], Tsao et. al combined the fuzzy c-means algorithm with the classical
Kohonen network into the so called fuzzy Kohonen clustering map. A fuzzy self-organizing
map which is based on the Kohonen's algorithm was introduced by Lei and Zheng in [LZ95].
In [KP12], Khalilia and Popescu considered the problem of clustering relational data and
proposed an algorithm FRSOM which combined the relational RSOM [HH07] with a fuzzy
clustering algorithm of relational data presented in [HB94]. Another approach in which the
neurons from the map were substituted by fuzzy rules was proposed in [Vuo94].
We further present our FSOM proposal introduced in [CCMI16] for identifying software
defects. The presentation from this section is based on the original paper [CCMI16].
Motivation
Since software systems are continuously growing in size and complexity, predicting the
reliability of software is fundamental in the software development process [Zhe09]. Clark
and Zubrow consider in [CZ01] that there are three main reasons for which the analysis
and prediction of software defects is essential. The rst one is to help the project manager
to measure the progress of a software project and to plan activities for defect detection.
The second reason is to contribute to the process management, by evaluating the quality of
the software product and measuring the process performance [CZ01]. Finally, information
about software faults, their location within the software and the distribution of defects may
contribute to increasing the e ectiveness of the testing process and the quality of the next
version of a software.
Many of the software defect predictors from the literature which are using machine learn-
ing used historical data collected by extracting relevant information from software repositories
[KZWG11]. Unfortunately, there are studies carried out in the defect prediction literature
(like [AV09]) which revealed that data obtained from change logs and led bug reports is
likely to contain noise [KZWG11]. Other defect predictors based on machine learning use
openly available data sets, like the NASA data sets, where only the software metric values
computed for the modules of the software system are available, but not the source code. Un-
fortunately, there can be noise in these data sets as well, as shown by [GBD+11]. Therefore,
there is a need to build classi ers which can cope with the lack of information, imprecision
and noise.
Methodology
In this section, we introduce our fuzzy self-organizing map model for detecting faults in
existing software systems.
The software entities (application classes, software modules, functions, etc) from a soft-
ware system are represented as high-dimensional vectors (an element from this vector rep-
resents the value obtained by applying a software metric on the considered entity). As
shown in [MCCS15], the software system Softis a set of instances (referred to as entities )
Soft=fe1;e2;:::;eng. A set of software metrics will represent the feature set that describes
the software system's entities, M=fm1;m2;:::;mlg. Therefore, a software entity ei2Soft
can be expressed as an l-dimensional vector, ei= (ei1;ei2;:::;eil) (eijdenotes the value
obtained by applying the software metric mjon the the software entity ei).

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 73
For the software entities from the system Soft, the instance label is known (D=defect or
N=non-defect). The labels of the instances are not used for building the fuzzy SOM model,
since the learning process is completely unsupervised. The labels are used exclusively for
preprocessing the input data and for evaluating the quality of the resulting classi cation
model.
Before applying the fuzzy SOM approach, the data is preprocessed . First, the data is
normalized using Min-Max scaling, after which a feature selection step is used for determining
a subset of features (software metrics) that are highly signi cant for the fault detection task
(details will be given in the experimental chapter). As a result of the feature selection step,
pfeatures (software metrics) will be selected and will be further used for building the fuzzy
SOM.
The fuzzy SOM model. Our proposal
The data set preprocessed as indicated above, will be used for the unsupervised training
of the map. As for the classical SOM approach, a distance function that takes as input two
instances is required. We are using as distance between two software entities eiandejthe
Euclidean Distance between their corresponding vectors.
We are proposing, in the following, a fuzzy self-organizing map algorithm (FSOM) for
building the fuzzy map. Our algorithm does not reproduce any existing algorithm from the
literature, but it combines the existing viewpoints related to fuzzy SOM approaches. The
underlying idea in FSOM is the classical SOM algorithm, combined with the concept of
fuzziness employed in fuzzy clustering [KH03].
The FSOM algorithm enhances the classical Kohonen algorithm for building a SOM
with the idea (employed in fuzzy clustering) of using a fuzzy membership matrix . In fuzzy
clustering, instead of using a crisp assignment of an object to a cluster, an object can belong
to multiple clusters. The degree to which an input object belongs to the clusters is indicated
by the set of membership levels expressed by the columns of the membership matrix . In
building the fuzzy SOM, we will use the fuzzy membership idea related to the computation
of the \winning neuron". Instead of using a crisp best-matching unit (BMU), as used in the
classical SOM algorithm, the membership matrix will be used to specify the degree to which
an input instance belongs to an output neuron (cluster). This means that an input instance
is not mapped to a single neuron (its BMU), but to all the neurons (clusters) from the map
(but with a certain membership degree ).
Intuitively, an input instance will have the larger membership degree to the neuron rep-
resenting its BMU. The idea of updating the weights connecting the winning neuron and its
neighboring neurons is kept from the classical SOM, but if the input instance has a larger
membership degree (level) to a neighboring neuron, this neuron will be \moved" closer to
the input instance than the other neurons (i.e., the updating rule considers the computed
membership levels). Through these updating rules, the FSOM algorithm maintains the main
characteristic of the classical SOM of \moving" the winning neuron and its neighborhood
towards the input instance, but it may express a better updating scheme than the crisp
approach.
We consider that the input layer of the map consists of pneurons (the data dimensionality
after the feature selection step) and the computational layer of the map consists of cneurons
disposed on a two dimensional grid, in which an output neuron iis characterized as an p-
dimensional vector of weights, wi= (wi1;wi2;:::;wip) (wijrepresents the weight of the link
between the j-th input neuron and the i-th output neuron (from the computational layer).
Let us denote by uthemembership matrix , whereuik2[0;1];81ic;1kn.
These values are used to de ne a set of fuzzyc-partitions for the nentities, and uikexpresses
the membership degree of entity ekto the output neuron (cluster) i.
The main steps of the FSOM algorithm are described in the following.

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 74
Step 1. Weights initialization. The weights are initialized with small random values from
[0,1].
Step 2. Membership degrees computation. The values from the membership matrix
are computed as in Formula (4.1) (as for the fuzzy c-means clustering algorithm [KH03]). m
is a real number, greater than 1 and represents the fuzzi er . The role of the fuzzi er is to
control the overlapping between the clusters [KH03].
uik=1
cX
j=1jjekwijj
jjekwjjj 2
m1(4.1)
Step 3. Sampling . Select a random input entity etand send it to the map.
3.1 Matching . The \winning" neuron jis determined as the output neuron which
maximizes the membership degree of the input entity etto the neuron, i.e., j=
argmax
1jcujt.
3.2 Updating . After identifying the \winning neuron", update its connection weights
followed by its neighboring neurons, such that the neurons are \shifted" closer to the
input instance. When updating the weights for a particular neuron, we will consider
themembership degree of the considered entity to that neuron. More precisely, for each
output neuron j(81jc), its weights wji(81ip) will be updated with a value
wjicomputed as in Formula (4.2)
wji=Tjj(etiwji)um
jt (4.2)
whereis the learning rate and Tjjdenotes the neighborhood function usually used
in the classical Kohonen's algorithm [SK99] and whose radius decreases over time.
Step 4. Iteration . Repeat steps 2-3 for a given number of iterations.
If we are looking to the Step 2 of the FSOM algorithm, we observe that an input entity will
have the largest membership degree to the neuron (cluster) representing its BMU. Intuitively,
the degrees to which the entity belongs to the other neurons from the map (others than its
BMU) have to decrease as the distance from the entity and the neurons increases. Another
characteristic of the fuzzy algorithm (compared to the crisp variant) is the fact that the
weights of particular neurons from the neighborhood of the \winning neuron" (see Step 3) are
updated di erently considering the degree to which the current entity belongs to the neuron.
This updating method may lead to nal weights which would give a better representation of
the input space.
After the map is trained using the FSOM algorithm described above, in order to obtain a
visualization of the FSOM, the U-Matrix approach [KK96] is employed. The value of a neuron
from the U-Matrix is calculated by taking the average distance to its 4 nearest neighbors. If
one interprets these distances as heights, the U-Matrix may be interpreted as follows [KK96]:
large values in the U-Matrix represent entities that are dissimilar with those associated with
low values, while the data with similar height represents entities that are similar and can be
clustered together.
Since the fault prediction problem is a binary classi cation one, our goal is to cluster the
trained map into defective and non-defective entities, corresponding to our binary classi ca-
tion targets.
Even if the fuzzy SOM was built using unsupervised learning, after it was created it is
usable for classifying unseen-before software entities. First, the \winning neuron" associated

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 75
with this entity is determined (Step 3.1). Then, the class ( defect ornon-defect ) the computed
class of the winning neuron will also be the class of the unseen-before software entity.
For evaluating the performance of the FSOM model trained as shown above, we rst
compute the confusion matrix for the defect classi cation task. The class of defects will
be considered the positive class, while the non-defects will be considered negative instances.
The known class labels of the training entities are used in the computation of the confusion
matrix.
Since defect prediction data are highly imbalanced (there are much less defects than there
arenon-defects ) the most important goal in software fault prediction is to maximize the true
positive rate (i.e., maximize the number of defective entities that are classi ed as faults ), or,
equivalently to decrease the false negative rate (i.e., minimize the number of defective entities
that are wrongly classi ed as non-faults ). For the problem of defect detection, having false
negatives is a more serious problem than having false positives , the rst situation denotes an
undetected fault in the system, which can cause serious problems later, while in case of the
second situation some time is lost to thoroughly test a fault-free entity that was classi ed
faulty. In the case of imbalanced data, the evaluation measure that is relevant for representing
the performance of the classi ers is the Area Under the ROC Curve (AUC) measure [Faw06]
(a larger AUC value is indicative of a better classi er).
4.1.2 Fuzzy Decision Trees
In this section, we present our Fuzzy Decision Tree model introduced in [MCMI16] for
the problem of Software Defect Prediction.
A fuzzy decision tree appears to be an e ective choice for solving the software defect
prediction problem for the following reasons. Most importantly, the nature of the data con-
cerning software metrics makes a clear di erentiation between the defective and non-defective
classes virtually impossible and therefore a certain degree of uncertainty must be taken into
account in the decision process. That is why it is important that accurate fuzzy functions are
de ned and incorporated in the classical decision tree paradigm, thus transforming it into a
fuzzy decision tree.
A fuzzy decision tree [Jan98] follows the classical decision tree paradigm for classi cation
in the sense that, starting from the entire data set, a tree is constructed by selecting at any
decision step the most relevant attribute with respect to an impurity measure and splitting
the remaining attribute information on several branches according to the distinct values that
underlie the chosen attribute. The internal nodes of the fuzzy tree contain all the training
instances, however each instance is given a membership degree for each class. Leaf nodes
in the fuzzy tree, instead of indicating a single classi cation like they do in the crisp case,
contains cumulative membership values to each of the classes.
However, in the case of the fuzzy approach, the distinct values that enable the decision
branching process of the tree are replaced by fuzzy functions concerning the attribute. The
entropy and information gain measures, which the decision process is dependent on, are
very sensitive not only to how well balanced the target classes used in training are, but also
on the construction of the fuzzy membership functions concerning each attribute, since these
functions need to be established in such a way that they better enable the defect classi cation
process.
Methodology
Our rst action is to investigate how di erent software metrics thresholds for the fuzzy
membership function in
uence the results of the algorithm.
All the data sets used for the experiments contain 29 software metrics and the class label
(defective ornon-defective ). For building the fuzzy decision trees, we de ne for each software

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 76
Figure 4.1: The fuzzy functions de ned for the total locsoftware metric.
metric two trapezoidal fuzzy functions: the rst fuzzy function determines the degree to
which a software metric value belongs to the faulty class and the second fuzzy function gives
the degree of membership to the non-faulty class.
For de ning the fuzzy functions we take inspiration from the work presented in [FBF15].
The authors have created a large data set by computing multiple software metrics for 111
software systems and identi ed thresholds for grouping values for these software metrics into
three categories: Good ,Regular and Bad. They de ned the threshold between the rst two
categories at the 70thpercentile and between the second and the third categories at the 90th
percentile in the data. Similar to this work, we merge the ve Ardata sets used for the
experimental evaluation and for each software metric we compute the value of 70 and 90
percentile and use these values for de ning the fuzzy functions. For example, the two fuzzy
functions de ned for the total locsoftware metric, where the two percentile values are 49.9
and 117.6, are presented on Figure 4.1.
Denoting as aand bthe values for the thresholds used to de ne the fuzzy functions, the
degree to which a software metric value xbelongs to the non-defective class can be computed
using Formula (4.3). Similarly, the membership degree to the defective class can be computed
using Formula (4.4) [MMCC16].
nondefect (x) =8
<
:1; x<a
bx
baaxb
0; x>b(4.3)
defect (x) =8
<
:0; x<a
xa
baaxb
1; x>b(4.4)
Besides using the value of percentiles 70 and 90 as thresholds for de ning the fuzzy func-
tions, we use two other pairs of thresholds as well. These thresholds are presented in Table
4.1. By modifying the threshold value, we are actually modifying the section where the two
functions overlap. We consider the rst threshold pair Regular and we de ned one where the
overlap section is narrower (second row of Table 4.1), and one where the overlap section is
wider (third row of Table 4.1).

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 77
Name Percentile threshold Percentile threshold
for a for b
Regular 70 90
Narrow overlap 75 85
Wide overlap 65 95
Table 4.1: Di erent percentile thresholds used for de ning the fuzzy functions.
Software metrics
In this section, the e ect of di erent software metrics on the fuzzy decision tree is inves-
tigated.
All 29 software metrics We use the values of 29 di erent McCabe and Halstead soft-
ware metrics: halstead vocabulary ,unique operators ,unique ope-rands ,total operands ,to-
taloperators ,executable loc,halstead length ,total loc,halstead volume ,halstead error ,hal-
stead diculty ,halstead e ort ,halstead time,blank loc,condition count ,multiple condition count ,
branch count ,decision count ,cyclomatic complexity ,halstead level,comment loc,code and
comment loc,decision density ,callpairs ,design complexity ,cyclomatic density ,normalized
cyclomatic complexity ,design density formal parameters .
A subset of 9 software metrics For reducing the dimensionality of the feature set char-
acterizing the software entities, we use the analysis performed in [MCCS15] on the Ar3,Ar4
and Ar5 data sets for selecting relevant software metrics for the software defect prediction
task. For determining the importance of the software metrics we used information gain (IG)
measure. In the study performed by Marian et al. in [MCCS15], 9 software metrics (whose IG
values are higher than a speci ed threshold), were selected as relevant for detecting software
defects: total operands ,halstead vocabulary ,executable loc,total operators ,halstead length ,
condition count ,total loc,decision count ,branch count [MCCS15]. These software metrics
are used as features in our classi cation task.
The selected software metrics are shown in Figure 4.2.
Impurity functions
In this section, the e ect of applying di erent impurity functions when constructing the
fuzzy decision tree is investigated.
For creating the fuzzy decision tree , the heterogeneity of software entities labeled as defects
andnon-defects must be measured. This is accomplished by using a pair of impurity functions.
As we have described in Section 4.1.2, every non-leaf node from the tree stores every
instance in the training data set (let us denote tbyD), but the instances have an associated
degree of membership to each class.
The rst impurity function is the one usually used when building decision trees, namely
theentropy .
The entropy at some node in the fuzzy decision tree is obtained as shown in Formula
(4.5).
Entropy (node) =ndefect
ndefect +nnondefectlogndefect
ndefect +nnondefect
nnondefect
ndefect +nnondefectlognnondefect
ndefect +nnondefect(4.5)
wherendefect sums the degrees of membership for the defective entities from D,nnondefect
sums the membership degrees for the non-defective entities from D.

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 78
Figure 4.2: Selected software metrics
The next impurity function used is the misclassi cation function. The misclassi cation
at a certain node is calculated as shown in Formula (4.6) and represents a generalization for
themisclassi cation function from the crisp case.
misclassification (node) =(nnondefect
ndefect +nnondefectif ndefect>nnondefect
ndefect
ndefect +nnondefectotherwise(4.6)
The notations in Formula (4.6) are the same as in Formula (4.5).
4.1.3 Learning models for software development e ort estimation
In this section, we start with the motivation of our approach, then we present the word
vector models and the regression models used in [IDC17] for the problem of Software Devel-
opment E ort Estimation.
Motivation
In an e ort to mimic the real life e ort estimation process we introduce a natural language
processing and machine learning based approach for the e ort estimation problem. Our hy-
pothesis is that, in a software project, for every resolved task we can nd textual descriptions
for the task in form of comments in the source repository, together with the actual time spent
by developers on the task, by analyzing the logs. Using machine learning, we try to discover
relations between the textual description and the needed time. Intuitively, this corresponds
to the domain knowledge and experience factors in the actual e ort estimation done by a
software developer.
In most Agile development methodologies, the smallest unit to estimate is a task. A
task has a description, is usually linked to a user story, a feature request or a usage scenario
[Coh05]. Software developers are asked to give an estimate based on this textual information.
At rst, the only input for the estimation problem is some textual representation of the

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 79
problem to be solved. Over the lifetime of a software project, the programmer gains more
and more domain knowledge so the estimate will be based not only on the description of the
task but also on past experiences with related or similar tasks within the project. This can
also be generalized to non-Agile methodologies which still require developers to provide time
estimates for their tasks.
Automated software e ort estimation can be used as a means to verify and correct actual
estimates made by developers. This can be especially useful in case of outsourcing or remote
teams where establishing trust is dicult. The machine learning based e ort estimation can
be used to spot discrepancies in estimates (or even reported time) given by di erent team
members. By comparing the estimation (or reported e ective time) given by the developer
with the predicted one, project managers can identify problematic tasks or incorrect report-
ing.
Another possible use case for automated software estimation is task allocation. One can
build separate predictor models for every team member (based on every nished task in the
project by each team member separately) and for any new task we can estimate the time for
every team member and assign the task to the one with the lowest estimated time.
Since accurately estimating the software development e ort is a dicult and important
task, for which a lot of human estimation methodologies, as well as some automated method-
ologies, exist, this research considers machine learning based regression models to be ap-
propriate for providing estimates due to their capability of capturing relevant patterns from
software development related data. In order to solve the issue of needing project metrics
and human input for SDEE systems, we want to feed machine learning models with only the
textual descriptions of the tasks that need solving. This is the same raw data that developers
work with when they have to provide their estimates, combined with historical knowledge of
past projects and intrinsic technical skills. In our approaches, we work on the assumption
that the more exact and full the historical knowledge is, the more it can compensate for a lack
of actual technical knowledge, whose integration into a machine learning model is beyond the
scope of this research.
Word vector models
Term frequency-inverse document frequency (TF-IDF) TF-IDF is a weighting scheme
for terms in a text corpus. It represents the multiplication between the term frequency statis-
tic (tf(d;t)), that counts how many times a term texists in a document d, and the inverse
document frequency statistic (idf(t)), that is the inverse fraction of the documents that con-
tain the term. This is outlined in Formula (4.7), where ndis the total number of documents
andndtis the number of documents that include the term t. Theidfis usually scaled loga-
rithmically in a number of ways. We exemplify only one such way, the default implemented
by the scikit-learn library [PVG+11].
idf(t) = 1 + log1 +nd
1 +ndt
tf-idf(d;t) =tf(d;t)idf(t)(4.7)
The resulting tf-idfvalues are usually normalized by the Euclidean norm, described in
Formula (4.8) [PVG+11].
vnorm =vp
v2
1+v2
2+:::+v2n(4.8)
Thetf-idfprocess is usually the rst step of text processing machine learning pipeline.
Its results are then fed to classi ers and regressors, such as those presented in Sections 4.1.3
and 4.1.3.

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 80
Distributed representations of documents (doc2vec) Models such as word2vec [MSC+13]
and doc2vec [LM14] address a key weakness of bag of words models like TF-IDF: that words
lose their semantics in the process. For example, in TF-IDF, there is no necessarily stronger
relationship between the words \Paris" and \London" than between the words \Car" and
\Skyscraper". In distributed vector representation models, the model would learn, from a
large enough corpus, that \Paris" and \London" are both capital cities, and their vectors
would be closer together than those of words with less meaning in common.
This is achieved in word2vec [MSC+13] by training a model to predict a word given a
context , which is a set of words around it. For example, for the sentence \The city of London
has many places to visit during a weekend trip", a training session might consist of learning
to predict \London" given \city of has many", or, alternatively, learning to predict \city of
has many" given \London". This leads to words with similar meaning being positioned closer
together in the vector space, because it is likely that the names of capital cities will appear
in similar contexts.
Because semantics are kept, such models allow for arithmetic on the word vectors. A
famous example [LM14] is given in Formula (4.9), which expresses the fact that subtracting
the vector for \man" from the vector for \king" and adding the vector for \woman" will lead
to a vector that is very similar to the vector for \queen" [LM14].
vect(\king")vect(\man") +vect(\woman ") =vect(\queen ") (4.9)
As shown in [LM14], this can be extended to paragraphs as well, allowing us to obtain
vectors for entire documents and to infer new vectors for unseen paragraph.
Because semantics are kept, feeding these vectors into classi ers and regressors, in a similar
manner as the TF-IDF vectors are used, leads to better results for some tasks [LM14]. We
use doc2vec vectors for the same purpose.
Doc2vec vectors can be combined with TF-IDF vectors by multiplying the two, thus
potentially keeping the information provided by both approaches. Our experiments consider
this case as well.
Support vector regression
Cortes and Vapnik originally developed support vector machines models for supervised
classi cation [CV95], but they have also been successfully applied in regression. The regres-
sion method is known as "-support vector regression (SVR), since an extra hyperparameter
"is used for controlling the algorithm's error level.
The SVR algorithm solves the optimization problem given in Formula (4.10) [CV95].
minimize1
2kwk2+CmX
i=1(
i++
i)
subject to8
><
>:yi(wxi+b)"+
i
(wxi+b)yi"++
i

i;+
i0(4.10)
In Formula 4.10 xirepresents a training instance, yiis its target value, bis a bias term, w
is a vector of weights and Cis a regularization hyperparameter. 
iand+
iare non-negative
\slack variables" which allow regression errors in the learning process [SS04]. In order to
obtain non-linear regression surfaces, kernel functions (RBF, sigmoid etc) can be introduced
for mapping the input data points into a higher dimension [CV95].
Methods such as stochastic gradient descent [Mit97] can be used to solve the SVR problem
faster in the linear case.

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 81
Gaussian Naive Bayes
The Gaussian Naive Bayes (GNB) is a classi cation algorithm that assumes that the
values in each class follow a Gaussian distribution. This allows the algorithm to function
without having to discretize the features.
First, the algorithm computes cand2
c, representing the mean and variance of all in-
stances in class c, for each class c. When having to classify a new instance v, the algorithm
uses the Gaussian distribution parameterized by cand2
cto nd the probability of vbe-
longing to class c, as shown in Formula 4.11.
p(vjc) =exp
(vc)2
22c
p
22c(4.11)
By not having to discretize the feature values, which would be required in order to apply
the classical Naive Bayes algorithm, GNB makes use of all the available information in the
data.
Experimental methodology
Our experiments consist of a machine learning pipeline with multiple steps, a hyperpa-
rameter search and a nal model evaluation.
Before the learning starts, the data set is rst of all randomly shued.
The machine learning pipeline for the data sets in [IDC17] All of our experiments
start with and without a text preprocessing step that di ers based on the type of data set:
Tor adiset. Then, they proceed in the same manner.
For theTset, this preprocessing step consists of transforming the text of each instance as
follows. Since the elds for each action can appear multiple times, we concatenate them such
that the keywords Description ,Type etc. only appear a single time, followed by the contents
of all of them. We also copy the numeric contents of the Complexity ,Number of entities and
Estimation numeric elds into a separate numerics vector that will be carried over to the
next stages of the pipeline together with the preprocessed text. Note that the preprocessed
text still contains these numeric elds.
For thedidata sets, the preprocessing consists of simply copying the numeric elds into
the separate numerics vector.
Experiments are performed with and without the initial preprocessing step. Without it,
the raw text data, as described in the previous subsection, is fed to the next stages.
The next part of the learning pipeline is the transformation of the text into a vector
model. We perform experiments using TF-IDF and doc2vec, described in Section 4.1.3. In
case the rst preprocessing step is applied, the vector model is fed to the next part of the
pipeline concatenated with the numerics vector and scaled to zero mean and unit variance.
The nal part of the pipeline is using an actual learning model to learn relations between
the text (now represented as a numeric vector provided either by TF-IDF or doc2vec) and
the real completion times. The learning models we use are also presented in Section 4.1.3.
Figure 4.3 shows a
owchart of our machine learning pipeline.
Our proposed method relies on the SVR algorithm. We also run tests with GNB in
order to compare our methodology with the one used in [Sap12], where the authors employ
a bag of words approach that results in discrete features. Since TF-IDF and doc2vec do not
produce discrete features, we opted for GNB so as not to lose information by discretization.
The authors employ classi cation into Fibonacci classes, as used in Planning Poker. Each
training instance is put in the class corresponding to the Fibonacci number closest to its
actual e ort. MMRE values are computed by using the Fibonacci numbers associated with
each class. We do the same on our data sets.

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 82
Figure 4.3: Flowchart of the used machine learning pipeline.

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 83
The machine learning pipeline for the data sets in [Ion17] Our approach consists
of the following steps, which make up our machine learning pipeline:
1.Vocabulary and output preprocessing . We sort the words in our training set by
the standard deviation of the known completion times of the tasks in the training set in
which each word appears. For each task, we only keep a percentage of the words in it
(as long as the number of resulting words is above a given value) based on this statistic.
We also o er the possibility, based on a hyperparameter setting, of duplicating some of
these words.
The resulting tasks are concatenated with the available, one-hot encoded, project met-
rics.
This is an original statistical preprocessing method that we have experimentally deter-
mined to lead to improved results.
We also take the logarithm of our targets, since they are exponentially distributed in
our data set.
2.TF-IDF or doc2vec . After our custom vocabulary preprocessing, we feed the textual
part of our tasks to a TF-IDF or doc2vec [MSC+13, LM14] (experiments are performed
for each) model that outputs numerical data.
3.Min-max normalization . The fully numerical output so far is normalized using the
min-max method. This has the e ect of converting every feature value of an instance
to a value in the interval [0 ;1]. Formula (4.12) presents this process.
Xnormalized =XXmin
XmaxXmin(4.12)
4.Arti cial neural network . The resulting features are given to a neural network
that learns to perform a regression task for predicting MMRE. MMRE is de ned as
in Formula (3.5). Note that this is sometimes multiplied by 100 in order to obtain a
percentage.
Arti cial neural networks can be viewed as a generalization of linear regression. They
are made of neurons arranged in successive layers , starting from the input layer to the
output one, through a number of computational layers. Each input layer neuron takes
as input one feature and outputs it to all the neurons in the rst computational layer.
Each neuron in a computational layer performs a weighted sum of its inputs, applies an
activation function , and outputs the results to all the neurons in the next layer [Mit97].
Various activation functions exist, such as the Logistic Sigmoid (Formula(4.13)) and
the Hyperbolic Tangent (Formula (4.14)).
f(x) =1
1 +exp(x)(4.13)
f(x) =tanh(x) (4.14)
The network is then trained using backpropagation with gradient descent or other
optimizers [Mit97].
5.Random search . A random search [BB12a] is performed in order to optimize the
hyperparameters for each of the above elements, over a xed number of iterations.
We perform this search on 67% of our data set, and report results on evaluating the
resulting optimized pipeline by using it to make predictions on the rest of the data.

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 84
Hyperparameter search
Our pipeline for the SDEE problem involves many hyperparameters for the TF-IDF or
doc2vec stages and for the SVR stage. There are no hyperparameters for GNB. These need
proper values in order to obtain good results. Since there are so many, it is impossible to
run a full grid search over them, so we opt for a random search, which has been shown
to provide good results in general [BB12a]. We run this search for 5000 iterations on each
experiment, unless otherwise speci ed, using ten-fold cross validation (10CV) to evaluate
each con guration.
Model evaluation
Once the random hyperparameter search completes, we report the best result it has found
for some con guration of hyperparameters, according to our sampling sets and distributions.
The best ones are applied to a new pipeline, the data is reshued and the pipeline is evaluated
on the data set again using 10CV. These are the reported results.
Technical implementation details
All of our experiments are performed with the help of the scikit-learn [PVG+11] and
Gensim [ RS10]. Scikit-learn provides implementations and helpers for TF-IDF, support vec-
tor regression, Gaussian Naive Bayes, random hyperparameter searches and cross validation,
while Gensim implements a doc2vec algorithm.
For the hyperparameter search, in the case of TF-IDF, we sample values for 11 TF-IDF-
speci c hyperparameters, such as the level of ngrams (character or word), the ngram range,
the maximum number of features to keep, the normalization method etc., either from discrete
sets of likely to perform well values or from uniform distributions over known good ranges. In
the case of doc2vec, we sample 11 doc2vec-speci c hyperparameters, such as the window size
when learning the representations, the resulting vector sizes, the learning rate, the minimum
count of words and others, from the same kind of distributions.
For the SVR model, we sample hyperparameters such as C,and other SVR-speci c ones
from the same type of distributions. We only consider the linear kernel in our experiments,
having found that others take much longer to evaluate and provide almost no improvements.
The GNB model has no hyperparameters.
In total, we consider and sample values for all hyperparameters available for the models
implemented in scikit-learn [PVG+11] and Gensim [ RS10] that we used. Thus, our approach
is robust, comprehensive and likely to nd near-optimal hyperparameters for solving the
SDEE problem using a machine learning pipeline.
4.2 Computational experiments
This subchapter presents our computational experiments, using the models and techniques
presented in the previous subchapter, on the software engineering problems approached in
this thesis: software defect detection, software defect prediction and software development
e ort estimation. Comparisons to related work are also presented.
4.2.1 Software defect detection using FSOM
In [CCMI16], the FSOM model (presented in Section 4.1.1) is experimentally evaluated
on ve open-source data sets that are popular in the software defect detection literature. The
implementation of the proposed FSOM is fully original and does not make use of any third
party tools or software libraries.
The presentation from this section is based on the original paper [CCMI16].

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 85
data set Defects Non-defects Diculty
Ar1 9 (7.4%) 112 (92.6%) 0.666
Ar3 8 (12.7%) 55 (87.3%) 0.625
Ar4 20 (18.69%) 87 (81.31 %) 0.7
Ar5 8 (22.22%) 28 (77.78%) 0.375
Ar6 15 (14.85%) 86 (85.15%) 0.666
Table 4.2: Description of the data sets used in the experiments.
Data sets
The data sets used in our experiments are freely available for download at [dat] and are
called Ar1,Ar3,Ar4,Ar5 and Ar6. All ve data sets are well known and originate from
a Turkish manufacturer software written in C [MCCS15]. The software entities from these
data sets are functions and methods from the considered software and are represented as 29-
dimensional vectors containing the values of multiple McCabe and Halstead software metrics.
For each instance within the data sets, we also know the class label, which identi es the entity
asdefective ornot defective .
We depict in Table 4.2 the description of the Ar1-Ar6 data sets used in our case studies.
For each data set, the number of defects andnon-defects are illustrated, as well as the diculty
of the data set. The measure of diculty for a data set was introduced by Boetticher in
[Boe07]. The diculty is given by the percentage of entities that have their nearest neighbor
with a di erent class label. Due to the data sets being intrinsically imbalanced, only the
ratio of faulty entities with non-defective nearest neighbor was considered for computing the
diculty.
From Table 4.2 it can be seen that all data sets are strongly imbalanced, with non-
defects always much higher than defects . Moreover, it can be seen that the task of accurately
classifying the defective entities is very dicult. Ar1,Ar4 and Ar6 seem to be the most
dicult data sets from the defect classi cation viewpoint. The complexity of the software
fault prediction task for the Ar1andAr6data sets is highlighted in Figures 4.4 and 4.5, which
depict a two dimensional view of the data obtained using t-SNE [vdMH08]. T-distributed
Stochastic Neighbor Embedding (t-SNE) is a method used to visualize high-dimension data
in a way that better re
ects the initial structure of the data compared to other techniques,
such as PCA. From a visualization point of view, the method has been shown to produce
better results than its competitors on a signi cant number of data sets.
Results
For the fuzzy self-organizing map, we used in our experiments the torus topology, due to
the fact that this topology leads to better neighborhoods than the lattice topology [KTO+07].
The parameters used for building the map are the following: 200000 training epochs and the
learning coecient was chosen to be 0.7. For controlling the overlapping degree in the fuzzy
approach, the fuzzi er was set to 2 (shown in the literature as a good value for controlling
the fuziness degree [KH03]).
For the feature selection step, we have used the analysis that was performed in [MCCS15]
on the Ar3,Ar4 and Ar5 data sets. For determining the importance of the software metrics
for the defect detection task, the information gain (IG) measure was used. From the software
metrics whose IG values were higher than a given threshold, a subset of metrics that mea-
sure features of the software system were nally selected. Therefore, 9 software metrics were
selected in [MCCS15] to be representative for the defect detection process: total operands ,
halstead vocabulary ,executable loc,total operators ,halstead length ,condition count ,total loc,
decision count ,branch count [MCCS15]. The previously mentioned features (software met-

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 86
Figure 4.4: t-SNE plot for the Ar1 data
set.
Figure 4.5: t-SNE plot for the Ar6 data
set.
rics) will also be used in our FSOM approach.
The following presents the results obtained by applying the proposed FSOM model (see
Section 4.1.1) on the Ar1,Ar3,Ar4,Ar5 and Ar6 data sets. After the data is preprocessed,
the FSOM algorithm introduced in Section 4.1.1 is applied and the U-Matrix associated with
the trained FSOM is be used to identify the class of defects and non-defects . Then, for each
instance from the training data set, we compare the class provided by our FSOM with the
entity's true class label (known from the training data). Finally, the AUC measure will be
computed.
Figures 4.6, 4.7, 4.8, 4.9 and 4.10 depict the U-Matrix visualization of the best FSOMs
obtained on the ve data sets used in the experimental evaluation. On each neuron from
the maps we represent the training instances (software entities) which were mapped (using
the FSOM algorithm) on that neuron, i.e., instances for which the neuron was their BMU.
The red circles are the defects and the green circles are the non-defects. Each neuron is also
marked with the number of defects ( D) and non-defects ( N) which are represented on it.
Figure 4.6: U-Matrix built on the Ar1
data set.
Figure 4.7: U-Matrix built on the Ar3
data set.

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 87
Visualizing the U-Matrices from Figures 4.6, 4.7, 4.8, 4.9 and 4.10, one can identify two
distinct areas: one containing lightly colored neurons, whereas the second area consists of
darker neurons. The two areas represented on the maps correspond to the clusters of defective
andnon-defective software entities. Since the percentage of software faults from the software
systems is signi cantly smaller than the percentage of non-faulty entities (see Table 4.2), the
area from the map containing a larger number of elements is considered to be the non-defective
cluster. The remaining area from the map corresponds to the defective cluster.
Figure 4.8: U-Matrix built on the Ar4
data set.
Figure 4.9: U-Matrix built on the Ar5
data set.
Table 4.3 illustrates, for each data set, the con guration used for the FSOMs (number of
rows and columns of the maps) as well as the values from the confusion matrix .
Figure 4.10: U-Matrix built on the Ar6 data set.
Discussion and comparison to related work
As presented in Section 4.2.1 and graphically illustrated in Figures 4.6, 4.7, 4.8, 4.9 and
4.10, our FSOM approach was able to provide a good topological mapping of the entities

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 88
data set rows x columns FP FN TP TN
Ar1 3×2 26 1 8 86
Ar3 2×3 1 2 6 54
Ar4 2×3 18 4 16 69
Ar5 3×2 4 0 8 24
Ar6 3×3 18 4 11 68
Table 4.3: Results obtained using FSOM on all experimented data sets.
from the software system and successfully identi ed two clusters corresponding to the faulty
and non-faulty entities. Even if the separation was not perfect, which is extremely dicult
for the software defect detection task, for all ve data sets we obtained good enough true
positive rates (at least 73% detection rate for the defects). For the Ar5 data set, our FSOM
succeeded in obtaining a perfect defect detection rate, misclassifying only 4 non-defective
entities.
The AUC measure is often considered to be the best performance measure to compare
classi ers [Faw06]. However, it is usually suitable for methods which provide a value that is
converted into a classi cation based on a threshold. In such cases, di erent thresholds lead
to di erent ( sensitivity ,1-speci city ) points on the ROC curve, and AUC computes the area
under this curve. For methods where no threshold is used (for example, in our approach)
the ROC curve contains one single point, which is linked to the points (0,0) and (1,1), thus
providing a curve and making possible the computation of the AUC measure.
Table 4.4 presents the AUC values computed for the results obtained using the proposed
approach, together with literature results for certain similar approaches ([ARS13], [SD09],
[PH14], [BB12b], [YM12]). If an approach does not report results on a particular data set, we
marked it with \n/a" ( not available ). In case of approaches that do not report the value of
the AUC measure, but report other measures (for example false positive rate ,false negative
rate) if it was possible, we used the confusion matrix to compute the AUC measure. The
best results obtained for the AUC measure are marked with bold in the table.
We would like to mention that the results from [ATF12] for the Multiple Linear Regression
and Genetic Programming approaches are the best values reported by the authors and they
were usually achieved for di erent resampling settings. In case of the cross-project defect
prediction approach, [YM12], we have reported only the results of the experiments when the
same data set was used both for building the model and testing it.
From Table 4.4 we observe that our FSOM approach has better results than those of re-
lated literature approaches for this problem. Out of 54 comparisons, in 48 cases our algorithm
reports better or equal AUC values, which means 89% accuracy.
It has to be noted that the fuzzy SOM method introduced in [CCMI16] proved to have a
better or equal performance, for all data sets, than the crisp approach previously introduced
in [MCCS15]. For the Ar3andAr6data sets, the FSOM performed similarly to the classical
SOM, for the other three data sets the FSOM outperformed the SOM. For the Ar1 data set,
the FSOM obtained a signi cantly better AUC value than the classical SOM. These results
highlight the e ectiveness of using a fuzzy approach with respect to the crisp one.
An analysis of the results depicted in Table 4.4 reveals that our FSOM approach obtained
its maximum AUC value on the Ar5 data set, the second maximum for the Ar1 and Ar3
data sets and the third maximum for the Ar6 data set. Interestingly, the results that we
have obtained are perfectly correlated with the diculties of the considered data sets (given
in Table 4.2). More precisely, the best result was obtained for the \easiest" data set, Ar5,
while the worst results were provided for the data sets which are more \dicult", Ar6 and
Ar4. Even for the hardest data sets, the AUC values obtained by the FSOM are larger than
most of the AUC values from the literature.

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 89
Approach Ar1 Ar3 Ar4 Ar5 Ar6
Our FSOM 0.829 0.87 0.80 0.93 0.762
SOM [MCCS15] 0.695 0.87 0.74 0.92 0.726
SOM with Threshold [ARS13] n/a 0.88 0.95 0.84 n/a
K-means with Quad-Trees [BB12b] n/a 0.70 0.75 0.87 n/a
Clustering Xmeans [PH14] n/a 0.84 0.69 0.86 n/a
Clustering EM [PH14] n/a 0.82 0.69 0.80 n/a
Clustering Xmeans [SD09] n/a 0.70 0.75 0.87 n/a
Genetic Programming [ATF12] 0.530 0.67 0.65 0.67 0.630
Multiple Linear Regression [ATF12] 0.550 0.61 0.62 0.55 0.590
Binary Logistic Regression [YM12] 0.551 0.87 0.73 0.39 0.722
Logistic Regression [NK15] 0.734 0.82 0.82 0.91 0.640
Logistic Regression [Mal14b] 0.494 n/a n/a n/a 0.538
Arti cial Neural Networks [Mal14b] 0.711 n/a n/a n/a 0.774
Support Vector Machines [Mal14b] 0.717 n/a n/a n/a 0.721
Decision Trees [Mal14b] 0.865 n/a n/a n/a 0.948
Cascade Correlation Networks [Mal14b] 0.786 n/a n/a n/a 0.758
GMDH Network [Mal14b] 0.744 n/a n/a n/a 0.702
Gene Expression Programming [Mal14b] 0.547 n/a n/a n/a 0.688
Table 4.4: Comparison of our AUC values with the related work.
Figure 4.11 depicts, for each data set we have considered, the AUC value obtained by our
FSOM and the average AUC value reported in related work from the literature for the data
set (see Table 4.4). The rst dashed bar from this gure corresponds to our FSOM. One can
observe that the AUC value provided by our approach is better, for each data set, than the
average AUC value from the existing related work.
Conclusions and future work
A fuzzy self-organizing feature map was introduced for detecting, in an unsupervised
manner, those software entities which are likely to be defective. The experiments we have
performed on ve open-source data sets used in the literature provided results that are better
than most of the existing approaches for detecting software defects. Moreover, the fuzzy
approach introduced proved to outperform the crisp one on the considered case studies.
More case studies and real world software systems are still needed in order to exper-
imentally strengthen the applicability of the fuzzy self-organizing map model proposed in
[CCMI16]. Identifying software metrics appropriate for software fault detection is also an
important research direction [RHT13].
4.2.2 Software Defect Prediction using FuzzyDT
In this section, the FuzzyDT approach introduced in Section 4.1.2 is experimentally eval-
uated on available data sets used from the software defect detection literature. For each case
study, the comparison criteria presented above is applied.
First, the data sets used in our case studies are described, then the obtained experimental
results are provided. The obtained results are analyzed and compared to related work in
Section 4.2.2.

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 90
Figure 4.11: Comparison to related work.
Data set Defective Non-defective Diculty
Ar1 9 112 0.666
Ar3 8 55 0.625
Ar4 20 87 0.7
Ar5 8 28 0.375
Ar6 15 86 0.666
Table 4.5: Description of the Ar1-Ar6 data sets.
Data sets
The data sets used in our experiments are Ar1,Ar3,Ar4,Ar5 and Ar6, they are open-
source and available at [dat]. These data sets were obtained from a Turkish manufacturer
software written in C [MCCS15]. From these software products the functions and methods
were extracted and these entities are represented as 29-dimensional vectors containing several
McCabe and Halstead software metrics values. The data sets used in our case studies are
composed of these high-dimensional representations. For each software entity from the data
sets, the defective ornon-defective nature of the entity is known and serves as the class label.
Table 4.5 describes the Ar1-Ar6 data sets considered in our experiments. For each data
set, its diculty , as well as the number of defects and non-defects are shown. The measure
ofdiculty for a data set was proposed in [Boe07] and represents the percentage of software
entities in the data for which the nearest neighbor has a di erent class label. In order to
compute the diculty measure, only the ratio of defective entities having a non-defective
nearest neighbor is used.
One can observe from Table 4.5 the imbalanced nature of the data sets, with much smaller
number of defective entities than non-defective ones. We also observe large values for the
diculty measure, which con rm the complexity of the defect classi cation task.
In order to make the data sets less imbalanced, we decided to add to each data set more
defective entities, but instead of oversampling or creating synthetic instances, we use the

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 91
Figure 4.12: Two dimensional representations using t-SNE of our transformed data sets.
actual defective instances from the other data sets. For example, to the 9 defective entities
from Ar1 we add all the entities labeled as defective from the other four data sets. In this
way, all data sets contain the same 60 defective entities, while the number of non-defective
entities remains the same as in Table 4.5. The only exception is the Ar5 data set, which has
only 28 non-defective entities and to which we only add the 20 defective entities from Ar4,
making it perfectly balanced.
Figure 4.12 shows the Ar1-Ar6 data sets reduced to two dimensions using t-SNE [vdMH08],
after the aforementioned transformations. It can be seen that the defective and non-defective
instance are clustered very close together, with no clear way to separate them in two dimen-
sions. This is another proof of the problem's intrinsic diculty.
Results
For evaluating the performance of the fuzzy decision tree, we have used a leave-one-out
cross validation technique [WLZ00]. For each data set a fuzzy decision tree is built using
all but one instances and the tree is tested on the instance not used for the training. This
process is repeated until every instance from the data set was used once for testing.
The confusion matrix is computed during the cross-validation process. It contains the true
positives (TP; defective instances classi ed as defective), true negatives (TN; non-defective
instances classi ed as non-faulty), false positives (FP; non-faluty entities classi ed as faulty)
and false negatives (FN; faulty entities classi ed as non-faulty).
In the literature there are di erent evaluation measures used for estimating the per-
formance of classi ers. The accuracy (Formula 4.15) is frequently used for indicating the
performance of a classi er, but it is not relevant for imbalanced training data sets. The Area
under the ROC curve (AUC) measure [Faw06] is an evaluation measure that is more relevant
in this case. The AUC measure is generally used when a classi er does not return directly the
class, but a value that is transformed into a class label using a threshold. In the case of such
approaches, di erent values for the detection probability (Formula 4.16) and the probability of
false alarm (Formula 4.17) measures can be obtained by changing the threshold value. For

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 92
each possible value for the threshold, the point ( Pf;Pd ) is plotted, and AUC computes the
area under the curve [MMCC16].
Acc=TP+TN
TP+TN+FP+FN(4.15)
Pd=TP
TP+FN(4.16)
Pf=FP
FP+TN(4.17)
If the algorithms directly output the class label, like the proposed approach, there is a
single (Pf;Pd ) point, but it can be connected to the points at (0 ;0) and (1;1) and the area
under the obtained curve can be obtained as shown in Formula 4.18 [MMCC16].
AUC =(Pd+ 1Pf)
2(4.18)
The results achieved for the ve data sets used for experimental evaluation for all the
parameter combinations are presented in Tables 4.6, 4.7, 4.8, 4.9, 4.10. In these tables,
besides the value of the Accand AUC metrics, we provide the complete confusion matrices
as well.
Thresholds #Metrics Impurity TP FP TN FN Acc AUC
function
29 Entropy 38 8 104 22 0.826 0.781
a=70, b=90 Misclassi cation 38 8 104 22 0.826 0.781
9 Entropy 35 5 107 25 0.826 0.769
Misclassi cation 34 5 107 26 0.820 0.761
29 Entropy 36 11 101 24 0.797 0.751
a=75, b=85 Misclassi cation 39 12 100 21 0.808 0.771
9 Entropy 34 5 107 26 0.820 0.761
Misclassi cation 34 5 107 26 0.820 0.761
29 Entropy 39 4 108 21 0.855 0.807
a=65, b=95 Misclassi cation 36 6 106 24 0.826 0.773
9 Entropy 31 3 109 29 0.814 0.745
Misclassi cation 31 3 109 29 0.814 0.745
Table 4.6: Detailed results obtained for the Ar1 data set.
Discussion and comparison to Related Work
To get an overall view of the results, Table 4.11 presents the minimum, maximum, average
and population standard deviation of the Accuracy andAUC values across each con guration
for each data set. It can be seen that the best average accuracy is obtained on Ar1, while
the best average AUC is obtained on Ar5.
We also record, in Table 4.12, for each data set the con gurations for which the highest
AUC values are achieved. The column Tcontains the percentile thresholds, Mcontains the
number of metrics, and Icontains the impurity function of each con guration.
From Table 4.12 we can see that in case of each data set the highest AUC value was
achieved for a di erent con guration.
Since looking at the whole con guration does not lead to a conclusion regarding the best
con guration, in the following we compare the results for each of the three comparison criteria
presented initially separately. Table 4.13 shows the results of the comparison.

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 93
Thresholds #Metrics Impurity TP FP TN FN Acc AUC
function
29 Entropy 38 26 29 22 0.583 0.580
a=70, b=90 Misclassi cation 40 20 35 20 0.652 0.652
9 Entropy 36 12 43 24 0.687 0.691
Misclassi cation 36 15 40 24 0.661 0.664
29 Entropy 42 26 29 18 0.617 0.614
a=75, b=85 Misclassi cation 42 26 29 18 0.617 0.614
9 Entropy 40 14 41 20 0.704 0.706
Misclassi cation 37 18 37 23 0.644 0.645
29 Entropy 44 24 31 16 0.652 0.649
a=65, b=95 Misclassi cation 40 23 32 20 0.626 0.624
9 Entropy 31 15 40 29 0.617 0.622
Misclassi cation 33 18 37 27 0.609 0.611
Table 4.7: Detailed results obtained for the Ar3 data set.
Thresholds #Metrics Impurity TP FP TN FN Acc AUC
function
29 Entropy 40 10 77 20 0.796 0.776
a=70, b=90 Misclassi cation 39 15 72 21 0.755 0.739
9 Entropy 27 7 80 33 0.728 0.685
Misclassi cation 26 7 80 34 0.721 0.676
29 Entropy 32 15 72 28 0.708 0.681
a=75, b=85 Misclassi cation 41 13 74 19 0.782 0.767
9 Entropy 29 11 76 31 0.714 0.678
Misclassi cation 30 13 74 30 0.708 0.675
29 Entropy 36 6 81 24 0.796 0.766
a=65, b=95 Misclassi cation 34 9 78 26 0.762 0.732
9 Entropy 25 5 82 35 0.728 0.680
Misclassi cation 26 5 82 34 0.735 0.688
Table 4.8: Detailed results obtained for the Ar4 data set.
Thresholds #Metrics Impurity TP FP TN FN Acc AUC
function
29 Entropy 24 4 24 4 0.857 0.857
a=70, b=90 Misclassi cation 24 2 26 4 0.893 0.893
9 Entropy 21 6 22 7 0.768 0.768
Misclassi cation 22 6 22 6 0.786 0.786
29 Entropy 22 5 23 6 0.804 0.804
a=75, b=85 Misclassi cation 19 5 23 9 0.750 0.750
9 Entropy 21 6 22 7 0.768 0.768
Misclassi cation 21 6 22 6 0.768 0.768
29 Entropy 21 5 23 7 0.786 0.786
a=65, b=95 Misclassi cation 22 4 24 6 0.821 0.821
9 Entropy 20 6 22 8 0.750 0.750
Misclassi cation 22 6 22 6 0.786 0.786
Table 4.9: Detailed results obtained for the Ar5 data set.

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 94
Thresholds #Metrics Impurity TP FP TN FN Acc AUC
function
29 Entropy 40 13 73 20 0.774 0.758
a=70, b=90 Misclassi cation 42 11 75 18 0.801 0.786
9 Entropy 39 6 80 21 0.815 0.790
Misclassi cation 39 6 80 21 0.815 0.790
29 Entropy 35 16 70 25 0.719 0.699
a=75, b=85 Misclassi cation 37 14 72 23 0.747 0.727
9 Entropy 38 6 80 22 0.808 0.782
Misclassi cation 38 4 82 22 0.822 0.793
29 Entropy 39 5 81 21 0.822 0.796
a=65, b=95 Misclassi cation 40 5 81 20 0.829 0.804
9 Entropy 35 5 81 25 0.795 0.763
Misclassi cation 35 5 81 25 0.795 0.763
Table 4.10: Detailed results obtained for the Ar6 data set.
Data setAcc AUC
Min Max Avg Stdev Min Max Avg Stdev
Ar1 0.797 0.855 0.821 0.013 0.745 0.807 0.767 0.017
Ar3 0.583 0.704 0.639 0.033 0.580 0.706 0.639 0.034
Ar4 0.708 0.796 0.744 0.032 0.675 0.776 0.712 0.039
Ar5 0.750 0.893 0.795 0.042 0.750 0.893 0.795 0.042
Ar6 0.719 0.829 0.795 0.032 0.699 0.804 0.771 0.030
Table 4.11: Minimum, maximum, average and population standard deviations of the obtained
values on each data set.
Data set T M I
Ar1 65-95 29 Entropy
Ar3 75-85 9 Entropy
Ar4 70-90 29 Entropy
Ar5 70-90 29 Misclassi cation
Ar6 65-95 29 Misclassi cation
Table 4.12: The con gurations for which the highest AUC values are achieved.

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 95
Thresholds for the fuzzy functions.
The second column from Table 4.13 contains for each of the three thresholds used for the
fuzzy functions the number of cases when the highest AUC is achieved for those threshold
values, the other parameters having the same value. For example, for the Ar1 data set,
we compare the AUC values achieved for 9 software metrics with Entropy, for the three
possible threshold values. When two thresholds have the same maximum AUC value, both
are considered.
Number of software metrics used.
Table 4.13 contains on the third column for both values for the number of software metrics
used the number of cases when the value of the AUC measure is higher for that software
metric number, the other parameters having the same value.
The impurity function used.
The last column from Table 4.13 contains for both impurity functions used the number
of cases when the value of the AUC measure was higher for that impurity function, the
other parameters having the same value. The last column, Ties, counts the number of cases
when the AUC value is the same for the two impurity functions. From Table 4.13 we can
see that, even if Misclassi cation has the highest number of wins, there is no signi cant
di erence between the two impurity functions. While in case of the other two criteria one of
the parameter values always has about twice as much wins as the other(s), in this case the
di erence between the number of wins for the two impurity functions is only one and there
are also 7 ties. What is interesting is how these wins are achieved: on the Ar1,Ar3 and Ar4
data sets together Entropy has 10 wins (from a total of 11 wins) and Misclassi cation only 4
(and there are 4 ties), while on the other two data sets, Ar5 and Ar6, Misclassi cation has
8 wins and Entropy only 1.
Threshold value # Software metrics Impurity function
70-90 75-85 65-95 9 29 Entropy Misclassi cation
Number of wins 12 5 6 10 20 11 12
Ties { { 7
Table 4.13: Comparison of our results based on the considered comparison criteria.
Comparison to related work.
Table 4.14 compares our best AUC values with supervised learning methods from the
literature. It can be seen that our fuzzy decision tree approach leads to better results than
most of the other approaches. Out of 11 other approaches, our approach is the second best
on four of the data sets ( Ar1,Ar4,Ar5 and Ar6), and third best on the remaining Ar3.
Figure 4.13 presents, for each data set, how many of the other approaches our method
outperforms. One can observe that the FuzzyDT method presented outperforms most of the
approaches considered for comparison.
We note that some of the authors we compare ourselves to report average AUCs, while
others, such as [ATF12], report the best values. Since our standard deviations are small, we
consider our comparisons to still be relevant and insightful.

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 96
Approach Ar1 Ar3 Ar4 Ar5 Ar6
Our FuzzyDT 0.807 0.706 0.776 0.893 0.804
Genetic Programming [ATF12] 0.530 0.67 0.65 0.67 0.630
Multiple Linear Regression [ATF12] 0.550 0.61 0.62 0.55 0.590
Binary Logistic Regression [YM12] 0.551 0.87 0.73 0.39 0.722
Logistic Regression [NK15] 0.734 0.82 0.82 0.91 0.640
Logistic Regression [Mal14b] 0.494 n/a n/a n/a 0.538
Arti cial Neural Networks [Mal14b] 0.711 n/a n/a n/a 0.774
Support Vector Machines [Mal14b] 0.717 n/a n/a n/a 0.721
Decision Trees [Mal14b] 0.865 n/a n/a n/a 0.948
Cascade Correlation Networks [Mal14b] 0.786 n/a n/a n/a 0.758
GMDH Network [Mal14b] 0.744 n/a n/a n/a 0.702
Gene Expression Programming [Mal14b] 0.547 n/a n/a n/a 0.688
Table 4.14: Comparison of our average AUC with related work on the same data sets.
Figure 4.13: Counts of related work methods that are better and worse than FuzzyDT on
the considered data sets.

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 97
Conclusions and Further Work
In the above, we have conducted a study on the e ect of changing di erent parameters
for the FuzzyDT method we have previously introduced in [MMCC16] for software defect
prediction . We have considered two possible variations for the FuzzyDT, and we applied
them on the Ardata sets. We highlighted that both variations can perform well, depending
on the data set we are working with. This is why we recommend to try both and choose the
best performing one for the problem at hand.
Considering the thresholds for the fuzzy functions comparison criterion we can observe
that the best threshold for de ning the fuzzy functions seems to be 70-90. It provides higher
AUC than the other two thresholds 12 times, which is slightly more than half of the cases.
From the number of software metrics point of view we can see that using all the software
metrics from the data set leads to better results than using only the 9 software metrics
selected in [MCCS15]. The best impurity function which should be used can depend on the
exact data set. Therefore, it is impossible to indicate the best impurity function for a data
set without performing experiments that consider both Entropy and Misclassi cation.
The experimental results we have obtained for the best parameter setting show that the
FuzzyDT approach performs better than most of the existing approaches for the software
defect prediction task. Further work will be done to use function approximation methods
(like neural networks, radial basis function networks, etc.) to learn the fuzzy functions.
4.2.3 Software Development E ort Estimation
This section presents our experiments and results on the SDEE problem, using various
machine learning elements [IDC17, Ion17].
Data sets
Description of data sets Our rst data sets are the ones introduced in [IDC17]. They
consist of a templated data set (i.e. a data set following a given template, described below),
which we call T, and eight other data sets, which we refer to as d1;d2;:::;d 8.
All data sets are provided by a software company that deals with software development
and general IT maintenance work. The software development activities are desktop and web-
related using Microsoft .NET technologies and the maintenance work consists of network
management, printers servicing and other such work.
TheTdata set consists of tasks that are described by team members in a certain format
that is meant to help learning algorithms infer estimates more accurately. For our Tset, all
of the following holds true:
A task has one or more actions that have to be performed with the goal of completing
the task.
A task can take one or more days to complete.
An action refers to an indivisible set of development activities that have to be performed
during a single work day.
One or more actions, done by one or more developers, can be necessary to complete a
task.
Each instance of Tdescribes a task and its associated actions.
The content of a task represents a development goal.

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 98
The content of an activity represents what was done in order to achieve the goal (rein-
stalling a program, installing some hardware, restoring a backup etc.).
InT, we only describe development (programming) tasks.
InT, an instance consists of text representing:
1. For the task:
The Interface worked on: the name of a database table, class, source le etc.
The Complexity : as estimated by the project manager, an integer between 1 (triv-
ial) and 5 (very dicult).
The Number of entities : how many entities the change will a ect, usually an
objective measure.
The Estimation : a very rough human estimate, in minutes.
The Functionality : a brief textual description of the goal.
2. For each action:
The Description : a brief textual description of the action.
The Type : one of \Creation" or \Change", representing if something new was
added to the project (a le or database table, for example), or if an existing entity
was somehow changed.
3. The last number represents the actual time in minutes it took to complete the task.
For example, an instance of Tcan look like this:
Interface: Catalog
Complexity: 4
NumEntities: 2
Estimation: 100
Functionality: Add a new admin only field for the internal
product rating
Description: Insert new columns
Type: Change
Description: Make new column admin only
Type: Change
150
This describes a task with two actions, since \Description" and \Type", which are par-
ticular to actions, appear twice. It took 150 minutes to complete the task.
Thed1throughd8data sets have a simpler format and they refer to non-programming
tasks. The following is what each instance of a didata set contains:
A textual description of the task, similar to the \Functionality" eld of the instances
ofT.
A number representing the count of physical systems the client has that are managed
by the company.
Another number representing the licensed software count that the client has and that
must be managed by the company.

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 99
Data set Number of instances Short presentation
T 203The templated data set, represent-
ing data collected over a ve months
period
d1 147Contains data referring to network
administration tasks, since the com-
pany started tracking them (a few
years).
d2 1756Contains data referring to nan-
cial software activities, such as re-
ceipts, billings etc., since the com-
pany started tracking them (a few
years).
d3 138Contains various maintenance ac-
tivities, since the start of internal
tracking.
d4 318 Same as above.
d5 564 Same as above.
d6 220 Same as above.
d7 194 Same as above.
d8 862More general hardware maintenance
tasks.
Table 4.15: Number of instances and short presentations for each data set.
A nal number representing the actual time in minutes it took to complete the task.
Table 4.15 presents the number of instances in each data set, along with a short presen-
tation of each data set's contentes.
Eachdidata set contains instances until the same time as those for Twere collected.
We note that our data sets are very diverse: Twas collected over a relatively short period
of time with the express purpose of being adequate for the SDEE problem, while the others
represent a simple, ad-hoc, internal tracking of the company's business. Moreover, the di
data sets were collected over a period of a few years, with more developers introducing data
in their own particular styles, since there was no guideline for how descriptions should be
written.
For both the Tanddidata sets, the descriptions are very short, usually not containing
more than 10 words.
Our next data set is the one introduced in [Ion17]. It contains real world data from a local
software company. It consists of 7826 instances obtained from a project management software.
Each instance describes a tasks and contains features such as the task title, description, type,
reporter, team, developer responsible, severity and others.
There are 5 possible task types, based on whether they are for xing bugs, implementing
features and other such development classi cations. We run experiments on the entire data
set as well as on each task type separately.
For each task, we know the time it took to complete it. This is what we will learn in our
supervised learning task. We also know the human estimates, which we use to compute the
human MMRE value for each data set.
Table 4.16 presents a summary of our data sets. We only use 2000 iterations on the full
data set because the execution time is too high otherwise for so many instances and with a
cross validation for each hyperparameter sample.

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 100
Task type Instances Search iterations Human MMRE
All 7826 2000 0.466
s1 1679 5000 0.445
s2 1191 5000 0.554
s3 4366 5000 0.452
s4 353 5000 0.422
s5 237 5000 0.474
Table 4.16: Summary of data sets.
Visualization of data sets we have decided to visualize some of our data sets in two
dimensions for a better understanding of the complexity and diculty of the problem we are
dealing with, We achieve this by applying t-SNE [vdMH08] to the values returned by applying
either TF-IDF, doc2vec or TF-IDF doc2vec (with and without parsing) on all instances
of a data set. t-SNE is an algorithm that can reduce a given data set to fewer dimensions,
while still keeping relevant information. Among others, it is useful as a data visualization
tool. By visualizing the data in two dimensional space, we can gain intuition about how well
learning is likely to work on a certain data set: if the data set looks easy to learn from in
two dimensions, it is most likely going to remain that way in the original space as well (or, if
not, we can just reduce it rst). If it looks dicult to learn from in the reduced space, then
it is still possible that it will be easier in the original space.
We only considered the Tandd2data sets { Tbecause it is the templated set and d2
because it is the largest \raw" set.
Figure 4.14 shows visualizations for the Tdata set using TF-IDF and doc2vec with parsing
and Figure 4.15 for the d2data set using the same text transformers. We can see that we
are likely to have a dicult problem on our hands, since simple linear regression does a very
poor job of tting the reduced data sets. Moreover, it does not look like non-linear models
would do a better job either, so we must rely on the extra information in the original space.
Experimental results
In this section, we present our experimental results on the real world data for SDEE,
described in Section 4.2.3 and introduced in [IDC17], using machine learning. Our experi-
ments are divided by the text representation method used (TF-IDF, doc2vec and TF-IDF 
doc2vec) and within each representation method by whether or not we use the initial text
preprocessing.
For each experiment, we report MMRE (Formula (3.5)) and the standard deviation (SD)
on the test folds within 10 fold cross validation.
Using TF-IDF Table 4.17 presents results on all data sets using TF-IDF on the initially
preprocessed text instances.
Table 4.18 shows our results under the same conditions, but without the initial text
preprocessing.
We can see that using regression, we obtain signi cantly better results than with classi -
cation into Fibonacci classes on all data sets.
Using doc2vec Table 4.19 presents our results on all data sets with the Gensim imple-
mentation of doc2vec, using the initial text preprocessing.
Table 4.20 presents results on the same case, but without the initial text preprocessing.
This time, using GNB, we obtain a slightly better result on the templated data set when
not using the initial text processing. On the other data sets, SVR still provides considerably

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 101
Figure 4.14: Visualizations of TF-IDF and doc2vec transformers reduced to two dimensions
on theTdata set, with initial preprocessing (parsing).

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 102
Figure 4.15: Visualizations of TF-IDF and doc2vec transformers reduced to two dimensions
on thed2data set, with initial preprocessing (parsing).

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 103
Data set Set sizeMMRE SD
SVR GNB SVR GNB
T 203 0.53 0.668 0.094 0.171
d1 147 0.593 0.705 0.091 0.094
d2 1756 0.641 0.743 0.072 0.069
d3 138 0.588 1.262 0.073 0.822
d4 318 0.571 0.674 0.043 0.071
d5 564 0.594 0.849 0.06 0.173
d6 220 0.587 0.973 0.049 0.353
d7 194 0.597 1.033 0.287 0.252
d8 862 0.643 0.707 0.039 0.062
Table 4.17: Results using TF-IDF and initial text preprocessing. The best results are high-
lighted.
Data set Set sizeMMRE SD
SVR GNB SVR GNB
T 203 0.535 0.677 0.124 0.166
d1 147 0.627 0.664 0.054 0.055
d2 1756 0.644 0.67 0.043 0.071
d3 138 0.573 0.636 0.122 0.137
d4 318 0.585 0.661 0.067 0.096
d5 564 0.588 0.891 0.046 0.405
d6 220 0.609 0.747 0.09 0.083
d7 194 0.603 0.802 0.052 0.236
d8 862 0.657 0.803 0.028 0.121
Table 4.18: Results using TF-IDF without the initial text preprocessing. The best results
are highlighted.
better results.
Using TF-IDFdoc2vec Table 4.21 gives the results using TF-IDF doc2vec with the
initial text parsing.
Table 4.22 shows the results under the same circumstances but without the initial text
preprocessing.
Once more, GNB only provides a better result on the templated data set when not using
text processing. On the other data sets, our regression approach is clearly superior.
Approach and results on the data set from [Ion17] Table 4.23 describes what hyper-
parameters we consider in our random search and the possible values we sample them from.
All of our experiments are performed with the help of the scikit-learn library [PVG+11].
Gensim is used for the doc2vec implementation [ RS10].
We have gradually tuned our hyperparameter search over multiple runs, eliminated vari-
ants that never showed up in the top solutions.
As mentioned in the previous sections, our experiments are performed on each subset
of the data, using both doc2vec and TF-IDF. Table 4.24 shows our results on each subset
together with the human estimates MMRE values. The best results are marked in green and
the second best in yellow.
Figure 4.16 shows our results in graphical form.

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 104
Data set Set sizeMMRE SD
SVR GNB SVR GNB
T 203 0.589 0.592 0.104 0.081
d1 147 0.683 0.731 0.113 0.09
d2 1756 0.657 0.713 0.058 0.095
d3 138 0.618 0.981 0.063 0.642
d4 318 0.62 0.666 0.072 0.058
d5 564 0.606 0.755 0.064 0.254
d6 220 0.623 0.748 0.077 0.124
d7 194 0.576 0.759 0.131 0.314
d8 862 0.662 0.681 0.036 0.043
Table 4.19: Results using doc2vec and initial text preprocessing. The best results are high-
lighted.
Data set Set sizeMMRE SD
SVR GNB SVR GNB
T 203 0.625 0.62 0.085 0.114
d1 147 0.661 0.726 0.093 0.16
d2 1756 0.66 0.727 0.06 0.077
d3 138 0.627 0.922 0.051 0.466
d4 318 0.627 0.638 0.072 0.031
d5 564 0.603 0.661 0.061 0.093
d6 220 0.63 0.931 0.096 0.293
d7 194 0.585 0.78 0.122 0.161
d8 862 0.663 0.697 0.041 0.035
Table 4.20: Results using doc2vec without the initial text preprocessing. The best results
are highlighted.
From Table 4.24 and Figure 4.16 we can see that our method obtains the best results on
two subsets, both times using doc2vec. While most of the time we are unable to outperform
the human estimates, our results are encouraging considering that we mostly rely only on
text data.
Moreover, our method performs the best on the subsets with fewer instances, which
represent more specialized type of tasks. This suggests that more specialized tasks are easier
to learn by machine learning algorithms. Our hypothesis is that such tasks have a clearer
structure deducible from their task descriptions without requiring much knowledge of the
entire code base.
More general tasks, on the other hand, might span a wider project area, and thus require
intimate knowledge of multiple areas of the project in order to make proper estimates. Since
our method does not incorporate any global project knowledge in its learning, and since our
available instances are quite few for such general tasks, learning might be hindered.
Discussion and comparison to related work
Since we obtain better results in [IDC17], only these are considered in the following
comparisons.
In order to see how the choice of text vectorizer in
uences the results on each data set,
Table 4.25 presents the MMRE results using SVR and GNB for each of TF-IDF, doc2vec and
TF-IDFdoc2vec, with the initial text preprocessing. Table 4.26 presents the same data,

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 105
Data set Set sizeMMRE SD
SVR GNB SVR GNB
T 203 0.565 0.727 0.093 0.269
d1 147 0.637 0.716 0.115 0.34
d2 1756 0.66 0.695 0.077 0.058
d3 138 0.611 1.463 0.077 1.849
d4 318 0.606 0.614 0.047 0.086
d5 564 0.597 0.925 0.079 0.223
d6 220 0.621 1.128 0.073 0.716
d7 194 0.618 1.045 0.139 0.696
d8 862 0.663 0.712 0.034 0.032
Table 4.21: Results using TF-IDF doc2vec and initial text preprocessing. The best results
are highlighted.
Data set Set sizeMMRE SD
SVR GNB SVR GNB
T 203 0.607 0.589 0.046 0.117
d1 147 0.669 0.732 0.102 0.186
d2 1756 0.656 0.681 0.06 0.079
d3 138 0.623 0.639 0.086 0.089
d4 318 0.596 0.674 0.054 0.06
d5 564 0.605 0.845 0.063 0.166
d6 220 0.645 0.743 0.084 0.092
d7 194 0.629 0.779 0.202 0.265
d8 862 0.667 0.787 0.047 0.102
Table 4.22: Results using TF-IDF doc2vec without the initial text preprocessing. The best
results are highlighted.
except without the initial text parsing.
We can see from Tables 4.25 and 4.26 that TF-IDF followed by SVR provides the best
results on all but one data set. When using GNB, doc2vec and TF-IDF doc2vec lead
to better results most of the time, without surpassing the regression approach however.
This does, however, suggest that doc2vec might be a better vectorizer for problems involving
classi cation than for those involving regression, at least in the context of SDEE. The training
is also faster with TF-IDF than with doc2vec. While doc2vec vectors might store more data
about the semantics of the documents in question, this did not improve our regression results.
It only slightly helped with the classi cation results.
Figures 4.17 and 4.18 show the mean of MMRE values across all of our data sets, obtained
with each learning model and vectorization model, together with con dence intervals, with
and without initial text preprocessing, respectively. We can see that, with initial text process-
ing, SVR performs noticeably better than GNB for all three text vectorization approaches.
Without initial preprocessing, the means are very close between SVR and GNB. This suggests
that using initial preprocessing helps regression more than it helps classi cation.
We note that for both methods, there aren't big di erences between the data sets. This
shows that our machine learning aproach to SDEE is robust and is likely to perform well on
various data sets. Similarly, we note that we have generally obtained the best results on the
Tdata set, which was speci cally built with care to how the tasks are described, in order to
help our algorithms perform better. This was successful, and shows that better written task
descriptions can help improve MMRE results.

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 106
Figure 4.16: Results and comparison to human estimates – chart form.
Figure 4.17: Mean MMRE across all data sets for each learning model and vectorization
method, with initial text preprocessing. 95% con dence intervals for the mean are depicted.

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 107
Pipeline elementConsidered hyperpa-
rameters
Vocabulary preprocessingMinimum number of
words to keep, method of
word duplication, param-
eters associated with the
method of duplication.
TF-IDF or doc2vecAll hyperparameters per-
taining to these models:
whether to use IDF,
smoothing, how many
features to keep, ngram
types and their range,
doc2vec type, learning
rates, number of learning
steps and others. Most of
the parameters available
in [RS10] and some others
that we introduced for our
purposes.
Arti cial neural networksDiscrete con gurations of
number of layers (maxi-
mum 2), number of nodes
in each layer, learning
rates, activation functions,
solving algorithms. Most
of the parameters available
in [PVG+11].
Table 4.23: Hyperparameter search descriptions.
Considering the results we have found in the literature and presented in Section 3.2.2,
the results obtained on our data sets are better than most of the results we found for this
problem. Table 4.27 presents a comparison to the related work reviewed in Section 3.2.2.
The most relevant comparison is with [Sap12], due to the fact that it also uses raw text data
for the experiments.
We note that, considering the literature approach that also uses raw text data in [Sap12],
we obtain signi cantly better results with both regression and classi cation. Moreover, our
regression results on our data sets are considerably better than the classi cation results,
which indicates that regression could be the better choice for the SDEE problem.
Conclusions
We have shown that using text vectorization methods such as TF-IDF and doc2vec,
together with regression algorithms, can obtain better results for the SDEE problem than
classical parametric models such as COCOMO.
Our method is also, as far as we know, one of only two that uses text data for providing
estimates. It is the only one that uses modern machine learning algorithms and regression in
order to do this.
We are con dent that better structured text can signi cantly reduce the MMRE values,
as indicated by the generally lower errors on the Tdata set, which has a decent structure

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 108
Task type InstancesMethodHuman MMREdoc2-vec TF-IDF
All 7826 0.889 0.814 0.466
1 1679 0.424 0.673 0.445
2 1191 0.779 0 .747 0.554
3 4366 0.818 0.834 0.452
4 353 0.538 1.310 0.422
5 237 0.408 0.409 0.474
Table 4.24: Results and comparison to human estimates – table form.
Data set Set sizeTF-IDF doc2vec TF-IDFdoc2vec
SVR GNB SVR GNB SVR GNB
T 203 0.53 0.668 0.589 0.592 0.565 0.727
d1 147 0.593 0.705 0.683 0.731 0.637 0.716
d2 1756 0.641 0.743 0.657 0.713 0.66 0.695
d3 138 0.588 1.262 0.618 0.981 0.611 1.463
d4 318 0.571 0.674 0.62 0.666 0.606 0.614
d5 564 0.594 0.849 0.606 0.755 0.597 0.925
d6 220 0.587 0.973 0.623 0.748 0.621 1.128
d7 194 0.597 1.033 0.576 0.759 0.618 1.045
d8 862 0.643 0.707 0.662 0.681 0.663 0.712
Table 4.25: Results using each text vectorizer with the initial text preprocessing. The best
SVR and GNB results are highlighted across the three di erent text vectorizers.
compared to the others. More data would be needed in order to properly assess this, however.
In the same way, our experiments also suggest the following:
Regression approaches perform better for SDEE than classi cation into Planning Poker
Fibonacci classes.
Preprocessing the initial text using basic parsing strategies and extraction of numeric
values helps obtain better results, especially when using regression.
Basic TF-IDF vectorization leads to better results than more advanced methods, such
as doc2vec, at least for regression. For classi cation, doc2vec and TF-IDF doc2vec
obtain slightly better results.
In the future, our plan is to gather more data from more companies, in order to better
determine which models work better and in which cases. We also plan to incorporate met-
rics in our data sets, so as to make use of the information they provide together with the
information the textual description of a task provides. We intend to use di erent methods
to process the textual descriptions of the input data and to apply other machine learning
models for the SDEE problem.

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 109
Data set Set sizeTF-IDF doc2vec TF-IDFdoc2vec
SVR GNB SVR GNB SVR GNB
T 203 0.535 0.677 0.625 0.62 0.607 0.589
d1 147 0.627 0.664 0.661 0.726 0.669 0.732
d2 1756 0.644 0.67 0.66 0.727 0.656 0.681
d3 138 0.573 0.636 0.627 0.922 0.623 0.639
d4 318 0.585 0.661 0.627 0.638 0.596 0.674
d5 564 0.588 0.891 0.603 0.661 0.605 0.845
d6 220 0.609 0.747 0.63 0.931 0.645 0.743
d7 194 0.603 0.802 0.585 0.78 0.629 0.779
d8 862 0.657 0.803 0.663 0.697 0.667 0.787
Table 4.26: Results using each text vectorizer without initial text preprocessing. The best
SVR and GNB results are highlighted across the three di erent text vectorizers.
Figure 4.18: Mean MMRE across all data sets for each learning model and vectorization
method, without initial text preprocessing. 95% con dence intervals for the mean are de-
picted.

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 110
Related work Description MMRE
[BP10]COCOMO, SEER,
COSEEKMO, REVIC,
SASET, Cost Model, SLIM,
FP, Estimac and Cosmic
frameworks on
ight software
and business applications.0:373% { 771 :87%
[JP12]COCOMO with project met-
rics.10% { 46%
[Tok13]COCOMO and TruePlanning
on open source projects.30% – 74%
[Tha16]Function Point Analysis sur-
vey.13:8% {
1624:31%, 90:38
average.
[Tha16]Use Case Point Analysis sur-
vey.27:30% { 88:01%,
39:11% average
[UMWB14]Planning poker, Use Case
Point Analysis and human es-
timates.48% for planning
poker, 2% { 11%
for UCPA, 28%
{ 38% for human
estimates.
[DHC15]Neuro-fuzzy enhancement for
SEER-SEM.29:01% { 69:05%
[HJLZ15]Multiple maching learning
models.87:5% { 95:1%
[vKG06]Bayesian networks, Regres-
sion trees, Backward elimi-
nation and Stepwise selection
on metrics using two software
projects.97:2% on one of
the projects and
0:392% on the
other.
[WLL+12]Literature survey on machine
learning for SDEE.13:55% { 143%
[UMWB14]Linear regression and radial
basis function networks on
metrics.66% { 90% for lin-
ear regression, 6%
and 90% for RBF
networks.
[Sap12]Machine learning applied on
raw text data from Agile story
cards. Uses bag of words and
classi cation into classes cor-
responding to planning poker
Fibonacci estimates.92:32%
Table 4.27: Comparison to related work. The related works with higher average MMRE
values than our best result on the Tdata set are marked in green. The works that we
provide a MMRE interval for and for which we do better than the upper bound on the T
set are marked in yellow. Works that clearly obtain better MMRE values on their data sets
than we do on Tare left in white.

Conclusions and future work
We have presented three machine learning models (GA, SVR and LWR) applied on three
important archaeological tasks: stature prediction, body mass estimation and age at death
estimation. We have obtained, for each problem and with at least one learning method, better
results than the previous state of the art. Our approaches have been validated by publications
in reputable journals and coferences, thus proving that our approaches are useful for both
the machine learning and archaeological community.
As stated in Chapter 1, the problems we approached with machine learning models are
very important to the archaeological community. Our applied machine learning approaches
are novel, because our work is the rst large scale application of computational intelligence
techniques to problems from archaeological elds.
Regarding our results, we have obtained the lowest error measurements for the stature pre-
diction task. We have published our stature prediction results in [CIMM16, IMMC15, Ion15b].
Our second lowers error metrics are on the age at death estimation problem, especially on
the subadults and young adults case studies. These results have been published in [ITV16].
Lastly, our largest errors are on the body mass estimation data sets, although our results are
still an improvement over the previous state of the art. Our body mass estimation results
have been submitted to [ICT16].
Our future work will focus on extending the machine learning approaches we have intro-
duced in order to better handle outliers and missing data, on obtaining more data sets and
on developing new models and algorithms that are better suited for handling the speci cs of
archaeological measurements. For this, we will require better data preprocessing algorithms,
and a way to make use of domain knowledge in our algorithms.
Regarding software engineering, various machine learning models were presented applied
on two important software engineering tasks: Software Defect Prediction and Software Devel-
opment E ort Estimation. We have obtained, for each problem and with at least one learning
method, better results than the previous state of the art. Our approaches have been validated
by publications in reputable journals and conferences, thus proving that our approaches are
useful for both the machine learning and Software Engineering community.
As stated throughout the thesis, the problems we approached with machine learning
models are very important to the Software Engineering community. Our applied machine
learning approaches are either novel with respect to the models used (in the case of the
FSOM and FDT) or with respect to the way we have approached a problem with existing
concepts (in the case of the SDEE problem). In both cases, we have outperformed previous
approaches from the literature.
Regarding our results, we have obtained signi cantly lower error measurements than the
previous state of the art, especially on the SDEE problem. We have published our SDP
results in [CCMI16, MCMI16] and our SDEE results in [IDC17, Ion17].
We will further focus on extending the machine learning approaches we have introduced
in this thesis in order to better handle outliers and missing data, on obtaining more data sets
and on developing new models and algorithms that are better suited for handling the speci cs
of both archaeological measurements and Software Engineering data. For this, we will require
better data preprocessing algorithms, and a way to make use of domain knowledge in our
111

CHAPTER 4. CONTRIBUTIONS TO SOFTWARE ENGINEERING 112
algorithms. We plan to further study fuzzy approaches, because they can naturally make
use of domain knowledge and are resilient to outliers, as we have already seen in our current
applications.
Regarding SDEE, we plan to use much larger data sets, so that we have more text to
learn from, and also to combine text with metrics, which are known to produce good results.

Bibliography
[AR04] B.M Auerbach and C.B. Ru . Human body mass estimation: a comparison
of \morphometric" and \mechanical" methods. American Journal of Physical
Anthropology , 125(4):331{342, Dec 2004.
[ARS13] G. Abaei, Z. Rezaei, and A. Selamat. Fault prediction by utilizing self-organizing
map and threshold. In 2013 IEEE International Conference on Control System,
Computing and Engineering (ICCSCE) , pages 465{470, Nov 2013.
[ATF12] Wasif Afzal, Richard Torkar, and Robert Feldt. Resampling methods in soft-
ware quality classi cation. International Journal of Software Engineering and
Knowledge Engineering , 22(2):203{223, 2-12.
[ATS15] Ishani Arora, Vivek Tetarwal, and Anju Saha. Open issues in software defect
prediction. Procedia Computer Science , 46:906 { 912, 2015.
[Aue11] Benjamin M. Auerbach. Methods for estimating missing human skeletal element
osteometric dimensions employed in the revised fully technique for estimating
stature. American Journal of Physical Anthropology , 145:67{80, 2011.
[AV09] Jorge Aranda and Gina Venolia. The secret life of bugs: Going past the errors
and omissions in software repositories. In Proceedings of the 31st International
Conference on Software Engineering , ICSE '09, pages 298{308, Washington, DC,
USA, 2009. IEEE Computer Society.
[AW10] Herv Abdi and Lynne J. Williams. Principal component analysis. Wiley Inter-
disciplinary Reviews: Computational Statistics , 2(4):433{459, 2010.
[B.00] Ru Christopher B. Body mass prediction from skeletal frame size in elite
athletes. American Journal of Physical Anthropology , 113(4):507{517, Dec 2000.
[BB12a] James Bergstra and Yoshua Bengio. Random search for hyper-parameter opti-
mization. J. Mach. Learn. Res. , 13:281{305, February 2012.
[BB12b] P.S. Bishnu and V. Bhattacherjee. Software fault prediction using quad tree-
based k-means clustering algorithm. IEEE Transactions on Knowledge and Data
Engineering , 24(6):1146{1150, June 2012.
[BCD01] Lawrence D. Brown, T. Tony Cai, and Anirban Dasgupta. Interval estimation
for a binomial proportion. Statistical Science , 16:101{133, 2001.
[BCH+00] Barry W. Boehm, Clark, Horowitz, Brown, Reifer, Chulani, Ray Madachy, and
Bert Steece. Software Cost Estimation with Cocomo II with Cdrom . Prentice
Hall PTR, Upper Saddle River, NJ, USA, 1st edition, 2000.
[BF01] M. G. Belcastro and F. Facchini. Anthropological and cultural features of a
skeletal sample of horsemen from the medieval necropolis of vicenne-campochiaro
(molise, italy). Collegium antropologicum , 25:387{401, 2001.
113

BIBLIOGRAPHY 114
[Boe01] Barry W. Boehm. Software Engineering Economics , pages 99{150. Springer
Berlin Heidelberg, Berlin, Heidelberg, 2001.
[Boe07] Gary D. Boetticher. Advances in Machine Learning Applications in Software
Engineering , chapter Improving the Credibility of Machine Learner Models in
Software Engineering. IGI Global, 2007.
[BP10] M. S. Saleem Basha and Dhavachelvan Ponnurangam. Analysis of empirical
software e ort estimation models. CoRR , abs/1004.1239, 2010.
[BR14] CARLES BOIX and FRANCES ROSENBLUTH. Bones of contention: The
political economy of height inequality. American Political Science Review , 108:1{
22, 2 2014.
[Bre38] E. Breitinger. Zur berenchnung der korperhohe aus den langen gliedmassen-
knochen. Anthropol. Anz. , 14:249{274, 1938.
[CBM+09] E. Cunha, E. Baccino, L. Martrille, F. Ramsthaler, J. Prieto, Y. Schuliar, N. Lyn-
nerup, and C. Cattaneo. The problem of aging human remains and living indi-
viduals: A review. Forensic Science International , 193(1-3):1{13, 2009.
[CCMI16] Istvan-Gergely Czibula, Gabriela Czibula, Zsuzsanna-Edit Marian, and Vlad-
Sebastian Ionescu. A novel approach using fuzzy self-organizing maps for de-
tecting software faults. Studies in Informatics and Control , 25(2):207{216, 2016.
[CD13] Christian Crowder and Victoria M. Dominguez. Estimation of age at death using
cortical bone histomorphometry. Research Report, 2013. Funded by the U.S.
Department of Justice, Oce of Justice Programs, National Institute of Justice.
[CGA97] Stefan Schaal Cristopher G. Atkeson, Andrew W. Moore. Locally weighted
learning. Arti cial Intelligence Review , pages 11 { 73, 1997.
[CIMM16] Gabriela Czibula, Vlad-Sebastian Ionescu, Diana-Lucia Miholca, and Ioan-
Gabriel Mircea. Machine learning-based approaches for predicting stature from
archaeological skeletal remains using long bone lengths. Journal of Archaeologi-
cal Science , 69:85{99, 2016.
[CL11] Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support vector
machines. ACM Trans. Intell. Syst. Technol. , 2(3):27:1{27:27, May 2011.
[Coh05] Mike Cohn. Agile Estimating and Planning . Prentice Hall PTR, Upper Saddle
River, NJ, USA, 2005.
[CT98] M. Chiba and K. Terazawa. Estimation of stature from somatometry of skull.
Forensic Sci. Int. , 97:87{92, 1998.
[CV95] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learn-
ing, 20(3):273{297, 1995.
[CZ01] B. Clark and D. Zubrow. How good is the software: A review of defect prediction
techniques. In Software Engineering Symposium , pages 1{35, Carreige Mellon
University, 2001.
[dat] Tera-promise repository. http://openscience.us/repo/.
[DH51] C. Wesley Dupertuis and John A. Hadden. On the reconstruction of stature
from long bones. American Journal of Physical Anthropology , 9(1):15{54, 1951.

BIBLIOGRAPHY 115
[DHC15] Wei Lin Du, Danny Ho, and Luiz Fernando Capretz. A neuro-fuzzy model with
SEER-SEM for software e ort estimation. CoRR , abs/1508.00032, 2015.
[Dwi84] T. Dwight. Methods of estimating height from parts of skeleton. Medical Recon-
struction New York , 46:293{296, 1884.
[EDB08] N. Elfelly, J.-Y. Dieulot, and P. Borne. A neural approach of multimodel repre-
sentation of complex processes. International Journal of Computers, Communi-
cations & Control , III(2):149{160, 2008.
[Faw06] Tom Fawcett. An introduction to ROC analysis. Pattern Recogn. Lett. ,
27(8):861{874, 2006.
[FBF15] Tarc sio G. S. Fil o, Mariza A. S. Bigonha, and Kecia A. M. Ferreira. A cata-
logue of thresholds for object-oriented software metrics. In First International
Conference on Advances and Trends in Software Engineering , pages 48{55, 2015.
[Fel96] M. R. Feldesman. Race speci city and the femur/stature ratio. American Journal
of Physical Anthropology , 100(2):207{224, 1996.
[FP60] G. Fully and H. Pineau. Determination de la stature au moyen du squelette.
Ann Med Legale , 40:145{154, 1960.
[Ful56] G. Fully. Une nouvelle methode de determination de la taille. Annales de
Medecine et de Criminologie , 36:266{273, 1956.
[GBD+11] David Gray, David Bowes, Neil Davey, Yi Sun, and Bruce Christianson. The
misuse of the nasa metrics data program data sets for automated software de-
fect prediction. In Proceedings of the Evaluation and Assesment in Software
Engineering , pages 96{103, 2011.
[GE06] Daniel D. Galorath and Michael W. Evans. Software Sizing, Estimation, and
Risk Management . Auerbach Publications, Boston, MA, USA, 2006.
[GJTP95] F.E. Grine, W.L. Jungers, P.V. Tobias, and O.M. Pearson. Fossil homo femur
from berg aukas, northern namibia. American Journal of Physical Anthropology ,
26:67{78, 1995.
[Gra76] G. Gralla. Rekonstrukcja dlugos ci ciala z kos ci dlugich. Przegla d Antropolog-
iczny , 42(2):259{264, 1976.
[HB94] Richard J. Hathaway and James C. Bezdek. Nerf c-means: Non-Euclidean rela-
tional fuzzy clustering. Pattern Recognition , 27(3):429{437, 1994.
[HBB+11] Tracy Hall, Sarah Beecham, David Bowes, David Gray, and Steve Counsell. A
systematic literature review on fault prediction performance in software engi-
neering. IEEE Transactions on Software Engineering , 38(6):1276{1304, 2011.
[HDF12] A.A. Shahrjooi Haghighi, M. Abbasi Dezfuli, and S.M. Fakhrahmad. Applying
mining schemes to software fault prediction: A proposed approach aimed at test
cost reduction. In Proceedings of the World Congress on Engineering 2012 Vol I ,
WCE 2012,, pages 1{5, Washington, DC, USA, 2012. IEEE Computer Society.
[HFH+09] Mark Hall, Eibe Frank, Geo rey Holmes, Bernhard Pfahringer, Peter Reute-
mann, and Ian H. Witten. The WEKA data mining software: An update.
SIGKDD Explorations , 11(1):10{18, 2009.

BIBLIOGRAPHY 116
[HH07] A. Hasenfuss and Barbara Hammer. Relational topographic maps. In Michael R.
Berthold, John Shawe-Taylor, and Nada Lavrac, editors, Advances in Intelligent
Data Analysis VII, Proceedings of the 7th International Symposium on Intelligent
Data Analysis , volume 4723. Springer, 2007.
[HJLZ15] WanJiang Han, LiXin Jiang, TianBo Lu, and XiaoYan Zhang. Comparison of
machine learning algorithms for software project time prediction. International
Journal of Multimedia and Ubiquitous Engineering , 10(9):1{8, 2015.
[Hol92] T.D. Holland. Estimation of adult stature from fragmentary tibias. J Forensic
Sci., 37(5):1223{1229, 1992.
[HR96] Gruspier KL. Hoppa RD. Estimating diaphyseal length from fragmentary
subadult skeletal remains: implications for palaeodemographic reconstructions of
a southern ontario ossuary. American Journal of Physical Anthropology , 3:341{
354, 1996.
[ICT16] Vlad-Sebastian Ionescu, Gabriela Czibula, and Mihai Teletin. Supervised learn-
ing techniques for body mass estimation in bioarchaeology. In IEEE 7th Inter-
national Workshop on Soft Computing Applications . accepted, 2016.
[IDC17] Vlad-Sebastian Ionescu, Horia Demian, and Istvan-Gergely Czibula. Natural
language processing and machine learning methods for software development
e ort estimation. Studies in Informatics and Control , 26(2):219{228, 2017.
[IMMC15] Vlad-Sebastian Ionescu, Ioan-Gabriel Mircea, Diana-Lucia Miholca, and
Gabriela Czibula. Novel instance based learning approaches for stature esti-
mation in archaeology. In 2015 IEEE International Conference on OIntelligent
Computer Communication and Processing , pages 309{316. IEEE Computer So-
ciety, 2015.
[Ion15a] Vlad-Sebastian Ionescu. New supervised learning approaches for sex identi ca-
tion in archaeology. In International Virtual Research Conference In Techni-
cal Disciplines , pages 56{64. EDIS – Publishing Institution of the University of
Zilina, 2015.
[Ion15b] Vlad-Sebastian Ionescu. Support vector regression methods for height estimation
in archaeology. Studia Universitatis Babes-Bolyai Series Informatica , LX(2):70|
82, 2015.
[Ion17] Vlad-Sebastian Ionescu. An approach to software development e ort estimation
using machine learning. In 2017 IEEE International Conference on OIntelligent
Computer Communication and Processing , page To be published. IEEE Com-
puter Society, 2017.
[IR07] M.A. Bidmos I. Ryan. Skeletal height reconstruction from measurements of the
skull in indigenous south africans. Forensic Sci. Int. , 167:16{21S, 2007.
[ITV16] Vlad-Sebastian Ionescu, Mihai Teletin, and Estera-Maria Voiculescu. Machine
learning techniques for age at death estimation from long bone lengths. In
IEEE 11th International Symposium on Applied Computational Intelligence and
Informatics (SACI 2016) , pages 457{462. IEEE Hungary Section, 2016.
[Jan98] C. Z. Janikow. Fuzzy decision trees: issues and methods. IEEE Transactions on
Systems, Man, and Cybernetics, Part B (Cybernetics) , 28(1):1{14, 1998.

BIBLIOGRAPHY 117
[JMJ98] Richard J. Jantz and Peer H. Moore-Jansen. Database for Forensic Anthropology
in the United States, 1962-1991 (ICPSRversion), 1998. Knoxville, TN: University
of Tennessee, Dept. of Anthropology [producer], 1998. Ann Arbor, MI: Inter-
university Consortium for Political and Social Research, 2000.
[JMJ00] Richard J. Jantz and Peer H. Moore-Jansen. Database for Forensic Anthropol-
ogy in the United States, 1962-1991 (ICPSR), 2000. University of Tennessee,
Department of Anthropology.
[JP12] Dragan Boji Jovan Popovi. A comparative evaluation of e ort estimation meth-
ods in the software life cycle. Computer Science and Information Systems ,
9(1):455{484, 2012.
[JS56] I. Jit and S. Singh. Estimation of stature from clavicles. Indian J Med Res. ,
44:137{155, 1956.
[JS04] O.P. Jasuja and G. Singh. Estimation of stature from hand and phalange length.
J Ind Acad Forensic Med , 26(3):100{106, 2004.
[KGJ13] Geertje Klein Goldewijk and Jan Jacobs. The relation between stature and long
bone length in the roman empire. Research Report EEF-13002, University of
Groningen, Research Institute SOM (Systems, Organisations and Management),
2013.
[KH03] Frank Klawonn and Frank Hppner. What is fuzzy about fuzzy clustering? un-
derstanding and improving the concept of the fuzzi er. In Michael R. Berthold,
Hans-Joachim Lenz, Elizabeth Bradley, Rudolf Kruse, and Christian Borgelt,
editors, Advances in Intelligent Data Analysis V , volume 2810 of Lecture Notes
in Computer Science , pages 254{264. Springer, 2003.
[KHJJ98] L.W. Konigsberg, S.M. Hens, L.M. Jantz, and W.L. Jungers. Stature estima-
tion and calibration: Bayesian and maximum likelihood perspectives in physical
anthropology. Am J Phys Anthropol , Suppl 27:65{92, 1998.
[KK96] S. Kaski and T. Kohonen. Exploratory data analysis by the self-organizing
map: Structures of welfare and poverty in the world. In Neural Networks in
Financial Engineering. Proceedings of the Third International Conference on
Neural Networks in the Capital Markets , pages 498{507. World Scienti c, 1996.
[KMFO01] Mira Kajko-Mattsson, Stefan Forssander, and Ulf Olsson. Corrective mainte-
nance maturity model (cm3): Maintainer's education and training. In Proceed-
ings of the 23rd International Conference on Software Engineering , ICSE '01,
pages 610{619, Washington, DC, USA, 2001. IEEE Computer Society.
[KOS09] Andreas Khler, Matthias Ohrnberger, and Frank Scherbaum. Unsupervised fea-
ture selection and general pattern discovery using self-organizing maps for gain-
ing insights into the nature of seismic wave elds. Computers & Geosciences ,
35(9):1757 { 1767, 2009.
[Koz96] J. Kozak. Stature reconstruction from long bones. the estimation of the useful-
ness of some selected methods for skeletal populations from poland. Variability
and Evolution , 5:83{94, 1996.
[KP12] M. Khalilia and M. Popescu. Fuzzy relational self-organizing maps. In 2012 IEEE
International Conference on Fuzzy Systems (FUZZ-IEEE) , pages 1{6, June 2012.

BIBLIOGRAPHY 118
[KTO+07] Peter K. Kihato, Heizo Tokutaka, Masaaki Ohkita, Kikuo Fujimura, Kazuhiko
Kotani, Yoichi Kurozawa, and Yoshio Maniwa. Spherical and torus som ap-
proaches to metabolic syndrome evaluation. In Masumi Ishikawa, Kenji Doya,
Hiroyuki Miyamoto, and Takeshi Yamakawa, editors, ICONIP (2) , volume 4985
ofLecture Notes in Computer Science , pages 274{284. Springer, 2007.
[KZWG11] Sunghun Kim, Hongyu Zhang, Rongxin Wu, and Liang Gong. Dealing with
noise in defect prediction. In Proceedings of the 33rd International Conference
on Software Engineering , ICSE '11, pages 481{490, New York, NY, USA, 2011.
ACM.
[LI14] Andrei Loriana and Vlad Ionescu. Some di erential superordinations using
ruscheweyh derivative and generalized salagean operator. Journal of Compu-
tational Analysis and Applications , 17(3):437{444, 2014.
[LL12] Adam Lipowski and Dorota Lipowska. Roulette-wheel selection via stochastic
acceptance. Physica A: Statistical Mechanics and its Applications , 391(6):2193
{ 2196, 2012.
[LM14] Quoc V. Le and Tomas Mikolov. Distributed representations of sentences and
documents. CoRR , abs/1405.4053, 2014.
[LO92] J. Lampinen and E. Oja. Clustering properties of hierarchical self-organizing
maps. Journal of Mathematical Imaging and Vision , 2(3):261{272, 1992.
[Lun85] J. K. Lundy. The mathematical versus anatomical methods of stature estimate
from long bones. The American Journal of Forensic Medicine and Pathology ,
6:73{75, 1985.
[LW10] C.S. Larsen and L.P. Walker. Bioarchaeology: Health, Lifestyle, and Society
in Recent Human Evolution, A Companion to Biological Anthropology . Wiley-
Blackwell, 2010.
[LZ95] Peng Lei and Hu Zheng. Clustering properties of fuzzy kohonen's self-organizing
feature maps. Journal of Electronics , 12(2):124 { 133, 1995.
[Mal14a] Ruchika Malhotra. Comparative analysis of statistical and machine learning
methods for predicting faulty modules. Applied Soft Computing , 21:286{297,
2014.
[Mal14b] Ruchika Malhotra. Comparative analysis of statistical and machine learning
methods for predicting faulty modules. Applied Soft Computing , 21:286{297,
2014.
[Man92] L. Manouvrier. Determination de la taille dapres les grands os des membres.
Rev. Men. de l'Ecole dAnnthrop. , 2:227{233, 1892.
[McC01] Donna McCarthy. The long and short of it: The reliability and inter-populational
applicability of stature regression equations. Master's thesis, Oregon State Uni-
versity, 2001.
[MCCS15] Zsuzsanna Marian, Gabriela Czibula, Istvan-Gergley Czibula, and Sergiu Sotoc.
Software defect detection using self-organizing maps. Studia Universitatis Babes-
Bolyai, Informatica , LX(2):55 { 69, 2015.
[McH92] H.M. McHenry. Body size and proportions in early hominids. American Journal
of Physical Anthropology , 87:407{431, 1992.

BIBLIOGRAPHY 119
[MCMI16] Zsuzsanna-Edit Marian, Istvan-Gergely Czibula, Ioan-Gabriel Mircea, and Vlad-
Sebastian Ionescu. A study on software defect prediction using fuzzy decision
trees. Studia Universitatis Babes-Bolyai Series Informatica , LXI(2):5|20, 2016.
[Mel99] Mitchell Melanie. An Introduction to Genetic Algorithms . The MIT Press, 1999.
[MGF07] Tim Menzies, Jeremy Greenwald, and Art Frank. Data mining static code at-
tributes to learn defect predictors. IEEE Transactions on Software Engineering ,
33(1):2{13, 2007.
[Mit97] Thomas M. Mitchell. Machine learning . McGraw-Hill, Inc. New York, USA,
1997.
[MJ03] Kjetil Molkken and Magne Jrgensen. A review of surveys on software e ort
estimation. In Proceedings of the 2003 International Symposium on Empiri-
cal Software Engineering , ISESE '03, pages 223{, Washington, DC, USA, 2003.
IEEE Computer Society.
[MMCC16] Z. Marian, I.G. Mircea, I.G. Czibula, and G. Czibula. A novel approach for
software defect prediction using fuzzy decision trees. pages 1{8, Timisoara, Ro-
mania, 2016. IEEE Computer Science.
[MNM+11] R.G. Menezes, K.R. Nagesh, F.P. Monteiro, G.P. Kumar, T. Kanchan, S. Uysal,
P.P. Rao, P. Rastogi, S.W. Lobo, and S.G. Kalthur. Estimation of stature from
the length of the sternum in south indian females. Journal of Forensic and Legal
Medicine , 18:242{245, 2011.
[Moo08] Megan K. Moore. Body Mass Estimation from the Human Skeleton . PhD thesis,
The University of Tennessee, Knoxville, 2008.
[Mor09] Alexander Moradi. Towards an objective account of nutrition and health in colo-
nial kenya: A study of stature in african army recruits and civilians, 18801980.
The Journal of Economic History , 69:719{754, 9 2009.
[MSC+13] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Je rey Dean.
Distributed representations of words and phrases and their compositionality.
CoRR , abs/1310.4546, 2013.
[MUC+07] Laurent Martrille, Douglas H. Ubelaker, Cristina Cattaneo, Fabienne Seguret,
Marie Tremblay, and Eric Baccino. Comparison of four skeletal methods for
the estimation of age at death on white and black adults. Journal of Forensic
Sciences , 52(2):302{307, 2007.
[Nav10] Nadia Navsa. Skeletal morphology of the human hand as applied in forensic
anthropology . PhD thesis, University of Pretoria, September 2010.
[NDrTB07] Vu Nguyen, Sophia Deeds-rubin, Thomas Tan, and Barry Boehm. A sloc count-
ing standard. In COCOMO II Forum 2007 , 2007.
[NK15] Jaechang Nam and Sunghun Kim. Heterogeneous defect prediction. In Proceed-
ings of the 2015 10th Joint Meeting on Foundations of Software Engineering ,
pages 508{519. ACM, 2015.
[OJ05] S.D Ousley and R.L. Jantz. Fordisc 3.0 personal computer forensic discriminant
functions. Forensic Anthropology Center , 2005.

BIBLIOGRAPHY 120
[OOLI16] Georgia Irina Oros, Gheorghe Oros, Alina Alb Lupa, and Vlad Ionescu. Dif-
ferential subordinations obtained by using a generalization of marx-strohhacker
theorem. Journal of Computational Analysis and Applications , 20(1):135{139,
2016.
[PD14] Dolan Champa Pal and Asis Kumar Datta. Estimation of stature from radius
length in living adult bengali males. Indian Journal of Basic and Applied Medical
Research , 3:380{389, march 2014.
[PDS+14] M. Pietrusewsky, M. Douglas, M.K. Swift, R.A. Harper, and M.A. Fleming.
Health in ancient mariana islanders: a bioarchaeological perspective. Journal of
Island and Coastal Archaeology , 9:319{340, 2014.
[Pea99] K. Pearson. Iv. mathematical contributions to the theory of evolution. v. on the
reconstruction of the stature of prehistoric races. Philos. Trans. R. SOC., Series
A, 192:169{244, 1899.
[PFSL12] Charlotte Primeau, Laila Friis, Birgitte Sejrsen, and Niels Lynnerup. A method
for estimating age of danish medieval sub-adults based on long bone length.
Anthropologischer Anzeiger , 69(3):317{333, 2012.
[PH14] Mikyeong Park and Euyseok Hong. Software fault prediction model using cluster-
ing algorithms determining the number of clusters automatically. International
Journal of Software Engineering and Its Applications , 8(7):199{205, 2014.
[PM03] Lawrence H. Putnam and Ware Myers. Five Core Metrics: Intelligence Behind
Successful Software Management . Dorset House Publishing Co., Inc., New York,
NY, USA, 2003.
[PS12] E. Pomeroy and J.T. Stock. Estimation of stature and body mass from the skele-
ton among coastal and mid-altitude andean populations. Am J Phys Anthropol ,
147:264{79, 2012.
[PVG+11] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,
M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine
learning in Python. Journal of Machine Learning Research , 12:2825{2830, 2011.
[RA07] C.B. Raxter, M.H.and Ru and M. Auerbach. Technical note: Revised fully
stature estimation technique. Am J Phys Anthropol , 133:817{818, 2007.
[RGM+07] F.W. Rsing, M. Graw, B. Marr, S. Ritz-Timme, M.A. Rothschild, K. Rtzscher,
A. Schmeling, I. Schrder, and G. Geserick. Recommendations for the forensic
diagnosis of sex and age from skeletons. fHOMOg- Journal of Comparative
Human Biology , 58(1):75 { 89, 2007.
[RHN+12a] C.B. Ru , B.M. Holt, M. Niskanen, V. Sladek, M. Berner, E. Garofalo, H.M.
Garvin, M. Hora, H. Maijanen, S. Niinimki, K. Salo, E. Schuplerova, and
D. Tompkins. Stature and body mass estimation from skeletal remains in the
european holocene. Am J Phys Anthropol. , 148:601{617, 2012.
[RHN+12b] C.B. Ru , B.M. Holt, M. Niskanen, V. Sladk, M. Berner, E. Garofalo, H.M.
Garvin, M. Hora, H. Maijanen, S. Niinimki, K. Salo, E. Schuplerov, and
D. Tompkins. Stature and body mass estimation from skeletal remains in the
european holocene. American Journal of Physical Anthropology , 182(4):601{617,
Aug 2012.

BIBLIOGRAPHY 121
[RHT13] Danijel Radjenovi, Marjan Heriko, Richard Torkar, and Ale ivkovi. Software fault
prediction metrics: A systematic literature review. Information and Software
Technology , 55(8):1397 { 1418, 2013.
[Rol88] E. Rollet. De la mensuration des os longs des membres. Thesis pour le doc. en
med., 43:1{128, 1888.
[RR06] M. Raxter, M.H.and Auerbach and B. Ru . Revision of the fully technique for
estimating statures. Am J Phys Anthropol , 130:374{384, 2006.
[RRA+08] M.H. Raxter, C. Ru , A. Azab, M. Erfan, M. Solliman, and A. El-Sawaf. Stature
estimation in ancient egyptians: a new technique based on anatomical recon-
struction of stature. American Journal of Physical Anthropology , pages 147{155,
2008.
[RS10] Radim Reh u rek and Petr Sojka. Software Framework for Topic Modelling with
Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges
for NLP Frameworks , pages 45{50, Valletta, Malta, May 2010. ELRA. http:
//is.muni.cz/publication/884893/en .
[RSL91] C.B. Ru , W.W. Scott, and A.Y-C. Liu. Articular and diaphyseal remodeling
of the proximal femur with changes in body mass in adults. American Journal
of Physical Anthropology , 86:397{413, 1991.
[RTC02] Stefanie Ritz-Timme and Matthew J Collins. Racemization of aspartic acid in
human proteins. Ageing Research Reviews , 1(1):43 { 59, 2002.
[Sak08] Kazuhiro Sakaue. New method for diagnosis of the sex and age-at-death of an
adult human skeleton from the patella. Bulletin of the National Museum of
Nature and Science, Series D , 34:43{51, 2008.
[Sap12] Alhad Vinayak Sapre. Feasibility of automated estimation of software develop-
ment e ort in agile environments. Master's thesis, The Ohio State University,
2012.
[SCC06] Gabriela Serban, Alina C^ ampan, and Istvan Gergely Czibula. A programming
interface for nding relational association rules. International Journal of Com-
puters, Communications & Control , I(S.):439{444, June 2006.
[SD09] C. , U. Sevim, and B. Diri. Software fault prediction of unlabeled program
modules. In Proceedings of the World Congress on Engineering (WCE) , pages
212{217, Dec 2009.
[SDIL94] Sam Stout, W.H. Dietze, M.Y. Ican, and S.R. Loth. Estimation of age at death
using cortical histomorphometry of the sterna end of the fourth rib. Journal of
Forensic Sciences , 39(3):778{784, 1994.
[SGC+13] Gwen Robbins Schug, Sat Gupta, Libby W. Cowgill, Paul W. Sciulli, and Saman-
tha H. Blatt. Panel regression formulas for estimating stature and body mass
from immature human skeletons: a statistical approach without reference to spe-
ci c age estimates. Journal of Archaeological Science , 40(7):3076 { 3086, 2013.
[Sjo90] T. Sjovold. Estimation of stature from long bones utilizing the line of organic
correlation. Human Evolution , 5:431{447, 1990.
[SK99] Panu Somervuo and Teuvo Kohonen. Self-organizing maps and learning vector
quantization for feature sequences. Neural Processing Letters , 10:151{159, 1999.

BIBLIOGRAPHY 122
[SS52] B. Singh and H.S. Sohal. Estimation of stature from the length of clavicle in
punjabis. Indian J Med Res. , 40:67{71, 1952.
[SS97] A. Sen and M. Srivastava. Regression Analysis: Theory, Methods, and Applica-
tions . Springer Texts in Statistics. Springer New York, 1997.
[SS04] AlexJ. Smola and Bernhard Schlkopf. A tutorial on support vector regression.
Statistics and Computing , 14(3):199{222, 2004.
[Ste29] Paul Huston Stevenson. On racial di erences in stature long bone regression for-
mulae, with special reference to stature reconstruction formulae for the chinese.
Biometrika Trust , 21:303{318, 1929.
[TB99] Michael E. Tipping and Chris M. Bishop. Probabilistic principal component
analysis. Journal of the Royal Statistical Society, Series B , 61:611{622, 1999.
[TBP94] Eric Chen-Kuo Tsao, James C. Bezdek, and Nikhil R. Pal. Fuzzy kohonen
clustering networks. Pattern Recognition , 27(5):757 { 764, 1994.
[Tel50] A. Telkka. On the prediction of human stature from the long bones. Acta
Anatomica , 9(1-2):103{117, 1950.
[Ter15] R.J. Terry. Terry collection postcranial osteometric database, national museum
of natural history, 2015.
[TG52] M. Trotter and G.C. Gleser. Estimation of stature from long bones of american
whites and negroes. American Journal of Physical Anthropology , 10(4):463{514,
1952.
[Tha16] Arnuphaptrairong Tharwon. A literature survey on the accuracy of software
e ort estimation models. In Proceedings of the International MultiConference of
Engineers and Computer Scientists 2016 , volume II, 2016.
[Tib81] G.L. Tibbetts. Estimation of stature from the vertebral column in american
blacks. J Forensic Sci , 26:715{723, 1981.
[Tok13] Derya Toka. Accuracy of contemporary parametric software estimation models:
A comparative analysis. In Proceeding of the 39th Euromicro Conference Series
on Software Engineering and Advanced Applications , pages 313{316, Santander,
Spain, 2013.
[Tuf11] St ephane Tu  ery. Data Mining and Statistics for Decision Making . John Wiley
and Sons, 2011.
[UMWB14] Muhammad Usman, Emilia Mendes, Francila Weidt, and Ricardo Britto. E ort
estimation in agile software development: A systematic literature review. In Pro-
ceedings of the 10th International Conference on Predictive Models in Software
Engineering , PROMISE '14, pages 82{91, New York, NY, USA, 2014. ACM.
[VBA+14] G. Vercellotti, Piperata B.A., A.M. Agnew, W.M. Wilson, D.L. Dufour, J.C.
Reina, R. Boano, H.M. Justus, C.S. Larsen, and P. W. Stout, S.D. amd Sci-
ulli. Exploring the multidimensionality of stature variation in the past through
comparisons of archaeological and living populations. Am J Phys Anthropol ,
155:229{242, 2014.
[vdMH08] Laurens van der Maaten and Geo rey Hinton. Visualizing data using t-sne.
Journal of Machine Learning Research , 9:2579{2605, 2008.

BIBLIOGRAPHY 123
[VI13] Swati Varade and Madhav Ingle. Hyper-quad-tree based k-means clustering
algorithm for fault prediction. International Journal of Computer Applications ,
76(5):6{10, August 2013.
[vKG06] C. van Koten and A. R. Gray. An application of bayesian network for predict-
ing object-oriented software maintainability. Inf. Softw. Technol. , 48(1):59{67,
January 2006.
[Vuo94] Petri Vuorimaa. Fuzzy self-organizing map. Fuzzy Sets and Systems , 66:223{231,
1994.
[WLL+12] Jianfeng Wen, Shixian Li, Zhiyong Lin, Yong Hu, and Changqin Huang. Sys-
tematic literature review of machine learning based software development e ort
estimation models. Inf. Softw. Technol. , 54(1):41{59, January 2012.
[WLZ00] G. Wahba, Y. Lin, and H. Zhang. Gacv for support vector machines, or, another
way to look at margin-like quantities. Advances in Large Margin classi ers ,
pages 297{309, 2000.
[YM12] Liguo Yu and Alok Mishra. Experience in predicting fault-prone software mod-
ules using complexity metrics. Quality Technology & Quantitative Management ,
9(4):421{433, 2012.
[Zad65] L.A. Zadeh. Fuzzy sets. Information and Control , 8(3):338 { 353, 1965.
[Zhe09] Jun Zheng. Predicting software reliability with neural network ensembles. Expert
Systems with Applications , 36(2, Part 1):2116 { 2122, 2009.

Similar Posts