Man vs. Computer: [609358]

Man vs. Computer:
BenchmarkingMachineLearningAlgorithms forTrafficSign Recognition
J.Stallkampa,∗,M. Schlipsinga,J.Salmena, C. Igelb
aInstitut f¨ur Neuroinformatik, Ruhr-Universit ¨at Bochum, Universit ¨atsstraße 150, 44780Bochum,Germany
bDepartment ofComputer Science, University ofCopenhagen, Universitetsparken 1, 2100Copenhagen,Denmark
Abstract
Traffic signs are characterized by a wide variability in thei r visual appearance in real-world environments. For exampl e, changes
of illumination, varying weather conditions and partial oc clusions impact the perception of road signs. In practice, a large number
of different sign classes needs to be recognized with very hi gh accuracy. Traffic signs have been designed to be easily rea dable
for humans, who perform very well at this task. For computer s ystems, however, classifying traffic signs still seems to po se a
challenging pattern recognition problem. Both image proce ssing and machine learning algorithms are continuously refi ned to
improve on this task. But little systematic comparison of su ch systems exist. What is the status quo? Do today’s algorithm s reach
humanperformance? Forassessingtheperformanceofstate- of-the-artmachinelearningalgorithms,wepresentapubli clyavailable
traffic sign dataset with more than 50,000 images of German ro ad signs in 43 classes. The data was considered in the second s tage
of the German Traffic Sign Recognition Benchmark held at IJCN N 2011. The results of this competition are reported and the
best-performing algorithms are briefly described. Convolu tional neural networks (CNNs) showed particularly high cla ssification
accuracies in the competition. We measured the performance of human subjects on the same data — and the CNNs outperformed
the human testpersons.
Keywords: trafficsignrecognition, machine learning, convolutional neural networks, benchmarking
1. Introduction
Traffic sign recognition is a multi-category classification
problem with unbalanced class frequencies. It is a challeng ing
real-worldcomputervisionproblemofhighpracticalrelev ance,
whichhasbeenaresearchtopicforseveraldecades. Manystu d-
ies have been published on this subject and multiple systems ,
which often restrict themselves to a subset of relevant sign s,
arealreadycommerciallyavailableinnewhigh-andmid-ran ge
vehicles. Nevertheless, there has been little systematic u nbi-
ased comparison of approaches and comprehensive benchmark
datasets arenot publicly available.
Road signs are designed to be easily detected and recog-
nizedbyhumandrivers. Theyfollowcleardesignprinciples us-
ingcolor,shape,iconsandtext. Theseallowforawiderange of
variations between classes. Signs with the same general mea n-
ing, such as the various speed limits, have a common general
appearance,leadingtosubsetsoftrafficsignsthatarevery sim-
ilar to each other. Illumination changes, partial occlusio ns, ro-
tations, and weather conditions further increase the range of
variations invisual appearance aclassifier has tocope with .
Humans are capable of recognizing the large variety of ex-
isting road signs in most situations with near-perfect accu racy.
∗Correspondingauthor. Tel.: +492343225566;Fax: +4923432 14210
Email addresses: [anonimizat]
(J. Stallkamp), [anonimizat] (M. Schlipsing),
[anonimizat] (J. Salmen), [anonimizat] (C.Igel)Thisdoesnot onlyapplytoreal-worlddriving, whererichco n-
text information and multiple views of a single traffic sign a re
available, but also to the recognition from individual, cli pped
images.
In this paper, we compare the traffic sign recognition per-
formanceofhumanstothatofstate-of-the-artmachinelear ning
algorithms. These results were generated in the context of t he
second stage of the German Traffic Sign Recognition Bench-
mark(GTSRB) held at IJCNN 2011. We present the extended
GTSRBdatasetwith51,840imagesofGermanroadsignsin43
classes. Awebsitewithapublicleaderboardwassetupandwi ll
bepermanentlyavailableforsubmissionofnewresults. Det ails
about the competition design and analysis of the results of t he
firststage aredescribed byStallkamp et al.(2011).
The paper is organized as follows: Section 2 presents re-
lated work. Section 3 provides details about the benchmark
dataset. Section 4 explains how the human traffic sign recogn i-
tionperformance isdetermined, whereas thebenchmarked ma –
chine learning algorithms are presented in Sec. 5. The evalu a-
tion procedure is described in Sec. 6, together with the asso ci-
atedpublicleaderboard. Benchmarkingresultsarereporte dand
discussed inSec. 7before conclusions are drawn inSec. 8.
2. Related work
It is difficult to compare the published work on traffic sign
recognition. Studies are based on different data and either con-
sider the complete task chain of detection, classification a nd
Preprint submitted to Elsevier March8,2012

tracking or focus on the classification part only. Some artic les
concentrate on subclasses of signs, for example on speed lim it
signs and digit recognition.
Bahlmann et al. (2005) present a holistic system covering
all three processing steps. The classifier itself is claimed to
operatewithacorrectclassificationrateof94%onimagesfr om
23classes. Trainingwasconductedon4,000trafficsignimag es
featuringanunbalancedclassfrequencyof30to600example s.
The individual performance of the classification component is
evaluated on atest setof 1,700 samples.
Moutarde et al. (2007) present a system for recognition of
European and U.S. speed limit signs. Their approach is based
on single digit recognition using a neural network. Includi ng
detection and tracking, the proposed system obtains a perfo r-
mance of 89% for U.S. and 90% for European speed limits,
respectively, on 281 traffic signs. Individual classificati on re-
sultsarenot provided.
Another traffic sign detection framework is presented by
Ruta et al. (2010). The overall system including detection a nd
classification of 48 different signs achieves a performance of
85.3%whileobtaining classification error ratesbelow 9%.
Broggietal.(2007)applymultipleneuralnetworkstoclas-
sify different traffic signs. In order to choose the appropri ate
network, shape and color information from the detection sta ge
is used. The authors only provide qualitative classificatio n re-
sults.
In the work by Keller et al. (2008), a number-based speed
limit classifier is trained on 2,880 images. It achieves a cor rect
classification rate of 92.4%on 1,233 images. However, it is
not clear whether images of the same traffic sign instance are
shared between sets.
Gao et al. (2006) propose a system based on color features
inspired by human vision. They report recognition rates up t o
95%on 98Britishtraffic signimages.
Various approaches are compared on a dataset containing
1,300preprocessedexamplesfrom6classes(5speedlimitsa nd
1 noise class) by Muhammad et al. (2009). The best classifica-
tionperformance observed was 97%.
In the study by Maldonado Basc ´on et al. (2010), a classi-
fication performance of 95.5%is achieved using support vec-
tor machines. The database comprises ∼36,000 Spanish traf-
fic sign samples of 193 sign classes. However, it is not clear
whether the training and test sets can be assumed to be inde-
pendent, as the random split only took care of maintaining th e
distribution of traffic sign classes (see Sec. 3). To our know l-
edge, thisdatabase isnot publicly available.
Obviously,theresultsreportedabovearenotcomparable,a s
all systems are evaluated on proprietary data, most of which is
notpubliclyavailable. Therefore,wepresentafreelyavai lable,
extensive traffic sign data set to allow unbiased comparison of
traffic signrecognition approaches.
3. Dataset
This section describes our publicly available benchmark
dataset. We explain the process of data collection and the pr o-
vided data representation.3.1. Data collection
The dataset was created from approx. 10 hours of video
thatwererecordedwhiledrivingondifferentroadtypesinG er-
many during daytime. The sequences were recorded in March,
October and November 2010. For data collection, a Prosilica
GC1380CH camera was used with automatic exposure control
and a frame rate of 25fps. The camera images, from which the
traffic sign images are extracted, have a resolution of 1360×
1024pixels. The video sequences are stored in raw Bayer-
pattern format (Bayer, 1975).
Data collection, annotation and image extraction was per-
formed using the NISYS Advanced Development and Analysis
Framework (ADAF)1, an easily extensible, module-based soft-
ware system(seeFig. 1).
Figure 1: Screenshot of the software used forthe manual annot ation. We made
use ofthe NISYS Advanced DevelopmentandAnalysis Framework ( ADAF).
Wewillusetheterm trafficsigninstance torefertoaphysi-
calreal-worldtrafficsigninordertodiscriminateagainst traffic
signimages whicharecapturedwhenpassingthetrafficsignby
car. The sequence of images originating from one traffic sign
instance will be referred to as track. Each instance is unique.
Inotherwords,thedatasetonlycontainsasingletrackfore ach
physical trafficsign.
3.2. Data organization
From 144,769 labelled traffic sign images of 2,416 traffic
sign instances in 70 classes, the GTSRB dataset was compiled
according tothefollowing criteria:
1. Discardtracks withlessthan 30 images.
2. Discardclasses withless than 9tracks.
3. For the remaining tracks: If the track contains more than
30images, equidistantly sample 30images.
Step3wasperformedfortworeasons. Firstofall,thecarpas ses
different traffic sign instances with different velocities depend-
ing on sign position and the overall traffic situation. In the
recording, this leads to different numbers of traffic sign im ages
pertrack(approximately5–250imagespertrack). Consecut ive
1http://www.nisys.de
2

Figure2: Atrafficsign track,whichcontains trafficsignimages capturedwhen
passing a particular trafficsign instance .
images of a traffic sign that was passed with low velocity are
very similar to each other. They do not contribute to the dive r-
sityofthedataset. Onthecontrary,theycauseanundesired im-
balance of dependent images. Since the different velocitie s are
not uniformly distributed over all traffic sign types, this w ould
stronglyfavourimageclassesthatarepresentinlow-speed traf-
fic (Stop,Yield-right-of-way , low speed limits).
Secondly, the question arises why to keep multiple images
per track at all. Although consecutive images in long tracks
are nearly identical, the visual appearance of a traffic sign can
vary significantly over the complete track, as can be seen in
Fig. 2. Traffic signs at high distance result in low resolutio n
while closer ones are prone to motion blur. The illumination
may change, and the motion of the car affects the perspec-
tive with respect to occlusions and background. Selecting a
fixed number of images per traffic sign both increases the di-
versityofthedatasetintermsofthevariationsmentioneda bove
and avoids an undesired imbalance caused by large numbers of
nearly identical images.
Theselectionprocedureoutlinedabovereducedthenumber
to 51,840 images of the 43 classes that are shown in Fig. 3.
TherelativeclassfrequenciesoftheclassesareshowninFi g.4.
Figure 3: Random representatives of the 43 traffic sign class es in the GTSRB
dataset.0 5 10 15 20 25 30 35 400123456
Class IDRelative class frequency [%]mean
Figure 4: Relative class frequencies in the dataset. The cla ss ID results from
enumeratingthe classes inFig. 3fromtop-left tobottom-righ t.
The set contains images of more than 1,700 traffic sign in-
stances. The size of the traffic signs varies between 15×15
and222×193pixels. The images contain 10% margin (at
least 5 pixels) around the traffic sign to allow for the usage o f
edgedetectors. Theoriginalsizeandlocationofthetraffic sign
within the image (region of interest, ROI) is preserved in th e
provided annotations. The images are not necessarily squar ed.
Figure 5 shows the distribution of traffic sign sizes taking i nto
account the larger of both dimensions of thetraffic signROI.
TheGTSRBdatasetwassplitintothreesubsetsaccordingto
Fig.6. Weappliedstratifiedsampling. Thesplitwasperform ed
atrandom,buttakingintoaccountclassandtrackmembershi p.
This makes sure that (a) the overall class distribution is pr e-
served for each individual set and that (b) all images of one
traffic sign instance are assigned to the same set, as otherwi se
thedatasetscouldnotbeconsideredstochasticallyindepe ndent.
The main split separates the data in to the full training set
and thetest set. The training set is ordered by class. Further-
more,theimagesaregroupedbytrackstopreservetemporali n-
formation,whichmaybeexploitedbyalgorithmsthatarecap a-
bleofusingprivilegedinformation(VapnikandVashist,20 09).
It can be used for final training of the classifier after all nec es-
sary design decisions were made or for training of parameter –
freeclassifiers.
For thetest set, in contrast to the training set, temporal in-
formation is not available. It is consecutively numbered an d
shuffled to prevent deduction of class membership from other
images of thesame track.
05001000150020002500300035004000
≤ 25 ≤ 35 ≤ 45 ≤ 55 ≤ 65 ≤ 75 ≤ 85 ≤ 95 > 95
Size range [pixels]Frequency
Figure5: Distribution oftraffic sign sizes (inpixel).
3

50% 25%Basictraining set Validation set
25%
Fulltraining set Testset
Figure 6: For the two stages of the competition, the data was sp lit into three
sets.
For the first stage of the GTSRB (see Sec. 5 and Stallkamp
et al., 2011), the full training set was partitioned into two sets.
Thevalidation set is a subset of the full training set and is
still provided for convenience. It is generated according t o the
aforementioned criteria and, thus, ensures consistent cla ss dis-
tribution and clean separation from the other sets. It allow s for
classifier selection, parameter search and optimization. D ata in
the validation set is available in two different configurati ons:
(a) shuffled like the test set which allows a fixed system setup
for training and testing and (b) appended to the basic training
set—sortedbyclassandgroupedbytrack—aspartofthe full
training set . The validation set played the role of the test set in
the online competition (seeStallkamp et al.,2011 and Sec. 5 ).
3.3. Data representation
Toallowparticipantswithoutimageprocessingbackground
to benchmark their machine learning approaches on the data,
allsetsareprovidedindifferentrepresentations. Thefol lowing
pre-calculated features areincluded:
3.3.1. Color images
Originally,thevideosarerecordedbya Bayersensorarray.
All extracted traffic sign images are converted into RGBcolor
images employing an edge-adaptive, constant-hue demosaic k-
ing method (Gunturk et al., 2005; Ramanath et al., 2002). The
images are stored in PPMformat alongside the corresponding
annotations inatext file.
3.3.2. HOG features
Histograms of Oriented Gradient (HOG) descriptors have
been proposed by Dalal and Triggs (2005) for pedestrian de-
tection. Based on gradients of color images, different weig hted
and normalized histograms are calculated: first for small no n-
overlapping cellsof multiple pixels that cover the whole image
andthenforlargeroverlapping blocksthatintegrateovermulti-
ple cells.
We provided three sets of features from differently con-
figured HOG descriptors, which we expected to perform well
whenusedforclassification. TocomputeHOGfeatures,allim –
ages were scaled to a size of 40×40pixels. For sets 1 and 3
the sign of the gradient response was ignored. Sets 1 and 2 use
cells of size 5×5pixels, a block size of 2×2cells and an ori-
entation resolution of 8, resulting in feature vectors of le ngth
1568. In contrast, for “HOG 3” cells of size 4×4pixels and
9orientations resultedin2916 features.
 ABCDEFCAFEABFBC C
EFCAFEABFBC CBCD
C EF
C EDBC
DBCD
BABC
Figure7: The“HOG1”trainingdataprojectedonitsfirsttwop rincipalcompo-
nents.
HOGdescriptorsprovideagoodrepresentationofthetraffic
signs. As can be seen in Fig. 7, the first two principal compo-
nents provide already a clear and meaningful separation of d if-
ferent sign shapes (e.g., the diamond shaped signs are locat ed
betweentheupwardsanddownwardspointingtriangularsign s).
3.3.3. Haar-like features
ThepopularityofHaarfeaturesismainlyduetotheefficient
computation using the integral image proposed by Viola and
Jones(2001)andtheiroutstandingperformanceinreal-tim eob-
ject detection employing a cascade of weak classifiers.
Figure 8: Haar features types used to generate one of the repr esentations pro-
videdbythe competitionorganizers.
JustasfortheHOGfeatures,imageswererescaledto 40×40
pixels. In order to compute Haar features, they were convert ed
to grayscale after rescaling. We computed five different typ es
(seeFig.8)indifferentsizestoatotalof11,584featuresp erim-
age. While one would usually apply feature selection (Salmen
et al., 2010) weprovide all Haar-feature responses inthe se t.
3.3.4. Color histograms
This set of features was provided to complement the grad-
ient-based feature sets with color information. It contain s a
global histogram of the hue values in HSV color space, result –
ing in256 features per image.
4. Human Performance
Traffic signs are designed to be easily distinguishable and
readable by humans. Once spotted, recognition of the majori ty
oftrafficsignsisnotachallengingproblemforthem. Althou gh
real-life traffic provides rich context, it is not required f or the
4

task of pure classification. Humans are well capable to recog –
nizethetypeofatrafficsignfromclippedimagessuchasinth e
GTSRB dataset (e.g., seeFig. 3).
Inordertodeterminethehumantrafficsignrecognitionper-
formance, two experiments were conducted. During these ex-
periments, images were presented to the test person in three
different versions (see Fig. 9): the original image, an enla rged
Figure9: User interface of thehumanperformanceapplication .
version to improve readability of small images and a contras t-
enhanced, enlarged version to improve readability of dark a nd
low-contrast samples like the example in the Fig. 9. The test
person assigned a class ID by clicking the corresponding but –
ton. Please note that this class ID assignment was for testin g
purposes only, not for generation of the ground-truth data, as
this was done on the original camera images (see Sec. 3.1 and
Fig. 1).
Forthefirstexperiment,theimagesinthetestsetwerepre-
sentedinchunksof400randomlychosenimageseachto32test
persons. Over the complete course of the experiment, each im –
age was presented exactly once for classification. This yiel ded
anaveragetraffic sign recognition performance over all sub-
jects. This experiment was executed in analogy to the online
competition (Stallkamp et al.,2011).
As shown in Fig. 10, there is some variance w.r.t. the indi-
vidual performance. To some extent, this can be explained by
the random selection of images that were presented to each of
the subjects. Somebody with a lower performance might just
have got more difficult images than somebody else with higher
performance.
To eliminate this possibility, we set up another experiment
to determine the traffic sign recognition performance of ind i-
vidual subjects on the full test set (12,630 images). As man-
ual classification of this amount of data is a very tedious, ti me-
consuming and concentration-demanding task, the experime nt
was limitedtoasingle well-performing test person.
To find a suitable candidate, we performed a model selec-
tionstep,verymuchinthesamesenseasitisusedwhenchoos-
ing or tuning a classifier for a problem. Eight test persons0246810
≤ 0.975 ≤ 0.98 ≤ 0.985 ≤ 0.99 ≤ 0.995 ≤ 1
Correct classification rateFrequency
Figure 10: Distribution of individual performance in averagehuman perfor-
mance experiment.
were confronted with a randomly selected, but fixed subset of
500 images of the validation set. The best-performing one wa s
selectedtoclassifythetestset. Inadditiontoselectinga candi-
date,themodelselectionstepservedasaninitialtraining phase
to get used to the sometimes unfamiliar appearance of traffic
signs in the dataset. To reduce a negative impact of decreasi ng
concentration on recognition performance, the experiment on
the fulltestset was splitintomultiplesessions.
5. Benchmarked methods
Thissectiondescribesthemachinelearningalgorithmstha t
were evaluated on the GTSRB dataset. This evaluation consti –
tutedthesecondstageoftheIJCNN2011competition TheGer-
manTrafficSignRecognitionBenchmark andwasperformedat
theconference. Thefirststageofthecompetition—conducte d
online before the conference — attracted more than 20 teams
from all around the world (Stallkamp et al., 2011). A wide
range of state-of-the-art machine learning methods was em-
ployed, including (but not limited to) several kinds of neur al
networks, support vector machines, linear discriminant an aly-
sis,subspaceanalysis,ensembleclassifiers,slowfeature analy-
sis,kd-trees,andrandomforests. Thetopteamswereinvite dto
theconferenceforafinalcompetitionsession. However,par tic-
ipation was not limited to these teams. Any researcher or tea m
could enter regardless of their participation or performan ce in
the first stage of competition. The second stage was set to re-
produceorimprovetheresultsoftheonlinestageandtoprev ent
potential cheating.
Inadditiontoabaselinealgorithm,wepresenttheapproach es
of thethree best-performing teams.
5.1. Baseline: LDA
Asabaselineforcomparison,weprovideresultsofalinear
classifier trained by linear discriminant analysis (LDA). L inear
discriminant analysis is based on a maximum a posteriori esti-
mateoftheclassmembership. Theclassificationruleisderi ved
under the assumption that the class densities are multi-var iate
Gaussians having a common covariance matrix. Linear dis-
crimination using LDA gives surprisingly good results in pr ac-
tice despite its simplicity (Hastie et al., 2001). The LDA wa s
5

Figure 11: CNN architecture employed by Sermanet and LeCun (20 11), who
kindlyprovidedthis figure.
based on the implementation in the Shark Machine Learning
Library (Igelet al., 2008), which ispublicly available2.
5.2. Team Sermanet: Multi-Scale CNN
Sermanet and LeCun (2011) employed a multi-scale con-
volutional neural network (CNN or ConvNet). CNNs are bi-
ologically inspired multi-layer feed-forward networks th at are
able to learn task-specific invariant features in a hierarch ical
manner, as sketched in Fig. 11. The multiple feature extract ion
stages are trained using supervised learning. The raw image s
are used as input. Each feature extraction stage of the netwo rk
consists of a convolutional layer, a non-linear transforma tion
layer and a spatial pooling layer. The latter reduces the spa –
tialresolutionwhichleadstoimprovedrobustnessagainst small
translations, similar to “complex cells” in the standard mo dels
of the visual cortex. In contrast to traditional CNNs, not on ly
the output of the last stage but of all feature extraction sta ges
are fed into the classifier. This results in a combination of d if-
ferent scales of the receptive field, providing both global a nd
localfeatures. Moreover,SermanetandLeCunemployedalte r-
native non-linearities. They used a combination of a rectifi ed
sigmoid followed by subtractive and divisive local normali za-
tion inspired by computational neuroscience models of visi on
(Lyu and Simoncelli, 2008; Pintoet al., 2008).
Theinputwasscaledtoasizeof 32×32pixels. Colorinfor-
mation was discarded and the resulting grayscale images wer e
contrast-normalized. To increase the robustness of the cla ssi-
fier, Sermanet and LeCun increased the training set size five-
fold by perturbing the available samples with small, random
changes of translation, rotation and scale.
5.3. Team IDSIA:Committee of CNNs
TeamIDSIAused a committee of CNNs in the form of a
multi-column deep neural network (MCDNN). It is based on
a flexible, high-performance GPU implementation. The ap-
proach in Ciresan et al. (2011) won the first stage of the GT-
SRB competition by using a committee of CNNs trained on
rawimagepixelsandmulti-layerperceptrons(MLP)trained on
the three provided HOG feature sets. For the second and final
competitionstage,forwhichresultsarepresentedinthisp aper,
theauthorsdroppedtheMLPs. Inturn,theyincreasedthenum –
ber of DNNs, because MCDNN with more columns showed
2http://shark-project.sourceforge.netimproved performance. The details on the architecture for o ne
DNN isshown inTab. 1.
Table 1: 8-layerDNN architecture used byTeam IDSIA.
Layer Type #Maps Neurons/Map Kernel
0 input 3 48×48
1 convolutional 100 42×42 7 ×7
2 max pooling 100 21×21 2 ×2
3 convolutional 150 18×18 4 ×4
4 max pooling 150 9×9 2 ×2
5 convolutional 250 6×6 4 ×4
6 max pooling 250 3×3 2 ×2
7 fullyconnected 300 1×1
8 fullyconnected 43 1×1
In contrast to team Sermanet (see 5.2), team IDSIAonly
uses the central ROI containing the traffic sign and ignores t he
margin. This region is scaled to a size of 48×48pixels. In
comparisontotheirapproachfortheonlinecompetition,th eau-
thors improved the preprocessing of the data by using four im –
age adjustments methods. Histogram stretching increases i m-
agecontrastbyremappingpixelintensitiessousethefullr ange
of available values. Histogram equalization transforms pi xel
intensities so that the histogram of the resulting image is a p-
proximately uniform. Adaptive histogram equalization app lies
the same principle, but to non-overlapping tiles rather tha n the
full image. Contrast normalization enhances edges by filter ing
the image with a difference of Gaussians. The latter was in-
spired by the approach of team Sermanet . Each preprocessing
step was applied individually to the training data, resulti ng in a
five-fold increase of the number of training samples. The gen –
eralization of the individual networks is further increase d by
random perturbations of the training data in terms of transl a-
tion,rotationandscale. However,incontrasttoteam Sermanet ,
thesedistortionsarecomputedon-the-flyeverytimeanimag eis
passed through the network during training. Thus, every ima ge
isdistorteddifferentlyineachepoch. Thetrainingofeach DNN
requiresabout25epochsandtakesabout2hours. Thisleadst o
total trainingtimeof approximately 50hours forMCDNN.
5.4. Team CAOR: Random Forests
Thecompetitionentryofteam CAORisbasedonaRandom
Forest of 500 trees. A Random Forest is an ensemble classi-
fier that is based on a set of non-pruned random decision trees
(Breiman,2001). Eachdecisiontreeisbuiltonarandomlych o-
sen subset of the training data. The remaining data is used to
estimatetheclassificationerror. Ineachnodeofatree,asm all,
randomlychosensubsetoffeaturesisselectedandthebests plit
ofthedataisdeterminedbasedonthisselection. Forclassi fica-
tion,asampleispassedthroughalldecisiontrees. Theoutc ome
of the Random Forest is a majority vote over all trees. Team
CAORused the official HOG 2 dataset. More details on this
approach arereported by Zaklouta et al. (2011).
6

Figure 12: The GTSRB submission website, which is open for new contribu-
tions.
6. Evaluation procedure
Participating algorithms need to classify the single image s
of the GTSRB test set. For model selection and training of
the final classifier, the basic training set and the validatio n set
(cf. Sec. 3) can be used either independently or combined (fu ll
training set).
Here, we explain how the performance of the algorithms
is assessed and introduce the benchmark website featuring a
public leaderboard and detailed result analysis.
6.1. Performance metric
The performance is evaluated based on the 0/1 loss, that is,
by basically counting the number of misclassifications. The re-
fore, we are able to rank algorithms based on their empirical
correct classification rate (CCR).
The loss is chosen equal for all misclassifications, althoug h
the test set is strongly unbalanced w.r.t. the number of samp les
per class. This accounts for the fact that every sign is equal ly
important independent of variable frequencies of appearan ce.
Nevertheless, the performance for the different subsets is addi-
tionally considered separately (seeSec. 7.4).
6.2. Public leaderboard
In addition to the benchmark dataset itself, we provide an
evaluation website3featuring a public leaderboard. It was in-
spired by a similar website for comparison of stereo vision a l-
gorithms4established by Scharstein and Szeliski (2002). Fig-
ure 12shows ascreenshot of theGTSRB submissionwebsite.
Our benchmark website will remain permanently open for
submissions. It allows participants to upload result files ( in a
simple CSV format) and get immediate feedback about their
3http://benchmark.ini.rub.de
4http://vision.middlebury.edu/stereoperformance. The results can be made publicly visible as soo n
aspublicationdetailsareprovided. Approachesareranked based
on their performance on the whole test dataset. Nevertheles s,
we allow re-sortingbased on subset evaluation.
Thewebsiteprovidesamoredetailedresultanalysis,forin –
stance online computation of the confusion matrix and a list of
all misclassified images. For even more detailed offline anal –
ysis, an open-source software application can be downloade d
that additionally enables participants to compare multipl e ap-
proaches.
We encourage researchers to continue submitting their re-
sults. Whiledifferentmachinelearningalgorithmsalready have
beenshowntoachieveveryhighperformance,thereisaparti c-
ular interest in having more real-time capable methods or ap –
proaches focusing on difficult subsets.
7. Results& Discussion
We report the classification performance of the three best-
performing machine learning approaches complemented with
theresultsofthebaselinealgorithmasdescribedinSec.5. Fur-
thermore, we present the results of the experiments on human
traffic sign recognition performance (see Sec. 4). The resul ts
that are reported inthissection aresummarized inTab. 2.
Table 2: Result overviewforthe finalstage oftheGTSRB.
CCR (%) Team Method
99.46 IDSIA Committee ofCNNs
99.22 INI-RTCV Human (best individual)
98.84 INI-RTCV Human (average)
98.31 Sermanet Multi-Scale CNN
96.14 CAOR Random Forests
95.68 INI-RTCV LDA (HOG 2)
93.18 INI-RTCV LDA (HOG 1)
92.34 INI-RTCV LDA (HOG 3)
7.1. Human performance
Forahumanobserver,theimagesinthedatasetvarystrongly
in terms of quality and readability. This is, to a large exten t,
caused by visual artifacts — such as low resolution, low con-
trast, motion blur, or reflections — which originate from the
data acquisition process and hardware. Although the machin e
learning algorithms have to deal with these issues as well, t he
visualappearanceoftrafficsignsindeficientimagescanbev ery
unfamiliar to human observers compared to traffic signs they
encounter inreality.
As noted in Sec. 4, the first experiment on human perfor-
mance yields an averagetraffic sign recognition rate over all
subjects. The distribution of individual classification pe rfor-
mances of the 32 test persons is shown in Fig. 10. However,
7

this does not give a clear picture of human traffic sign recog-
nition performance as the individual image sets that were pr e-
sented to the test subjects could vary significantly in diffic ulty
due to the aforementioned reasons. Although the test applic a-
tion is designed to improve readability of low-quality imag es
and, thus, reduce the impact of this variation of difficulty, it
cannot resolve the issues completely. Therefore, the varia tions
ofindividualperformancearecausedbothbyvaryingdifficu lty
of the selected images and by differing ability of the subjec ts
to cope with these issues and to actually recognize the traffi c
signs. The model selection step of the second human perfor-
manceexperimentpreventstheformerissuebyusingarandom
but fixeddataset. Thus, the varying performance in this exper-
iment is due to individual ability of the test persons. As can
be seen in Tab. 2, the single best test person performs signif –
icantly better (McNemar’s test, p <0.001) than the average,
reaching an accuracy of 99.22%. Therefore, future referenc es
inthissectionrefertothehumanperformanceofthesingleb est
individual.
7.2. Machine learning algorithms
As can be seen in Tab. 2, most of the machine learning al-
gorithmsachievedacorrectrecognitionrateofmorethan95 %,
with the committee of CNNs reaching near-perfect accuracy,
outperforming thehuman testpersons.
From an application point of view, processing time and re-
sourcerequirementsareimportantaspectswhenchoosingac las-
sifier. Inthiscontext,itisnotablehowwellLDA—averysim-
ple and computationally cheap classifier — performs in com-
parisontothemorecomplexapproaches. Especiallytheconv o-
lutional networks are computationally demanding, both dur ing
training and testing. Not surprisingly, the performance of LDA
was considerably dependent on the feature representation. In
the following, we just refer to the best LDA results achieved
withthe HOG 2representation.
Theperformanceresultsofthemachinelearningalgorithms
are all significantly different from each other. With except ion
ofthecomparisonofRandomForestsandLDA( p= 0.00865),
all pairwise p-values are smaller than 10−10. The values were
calculated withMcNemar’s testfor paired samples5
7.3. Man vs Computer
Boththebesthumanindividualandthebestmachinelearn-
ing algorithm achieve a very high classification accuracy. T he
Committee of CNNs performs significantly better than the bes t
human individual (McNemar’s test, p= 0.01366). However,
even without taking into account that the experimental setu p
for the human performance was unfamiliar for the test subjec ts
anddidnotreflectreal-lifetrafficscenarios,itneedstobe noted
that the best human test person significantly outperformed a ll
othermachinelearningalgorithmsinthiscomparison. Allp air-
wisep-values, as calculated with McNemar’s test, are smaller
than10−10.
5We provide and discuss p-values instead of confidence levels to show that
correcting formultipletesting still leads to significant re sults.7.4. Subsets
In order to gain a deeper insight into the results, we split
the dataset into groups of similar traffic sign classes as sho wn
in Fig. 13. The individual results per approach and subset ar e
listed in Tab. 3. A more detailed view is provided by the con-
fusion matrices for the different approaches in Fig. 14. The
classes are ordered by subsets as defined in Fig. 13a to 13f,
from left-to-right and top-to-bottom respectively. Rows d enote
the true class, columns the assigned class. The subsets are s ep-
arated by the grey lines. The confusion matrices show the dis –
tributionof error over the different classes.
Common to all approaches except the multi-scale CNN, al-
though to different extents, is a clustering in two areas: in the
top-left corner, which corresponds to the subset of speed limit
signs(seeFig.13a)andinthelargeareainthelowerright(s ec-
ond last row/column) which corresponds to the subset of tria n-
gulardangersigns (see Fig. 13e). As can be seen in Fig. 14,
the signs in these subsets are mostly mistaken for signs in th e
same subset. So the general shape is matched correctly, but t he
contained number or icon can not be discriminated. If a traf-
fic sign provided less-detailed content, like the blue mandatory
signs(seeFig.13d),orifthesignhasaverydistinctshapes uch
astheuniquesigns(seeFig.13f),therecognitionrateisusually
above average, withhumans even achieving perfect accuracy .
TheHOG-basedLDAisabletodiscriminatetheroundsigns
from the triangular ones. However, it easily confuses all ro und
signs (and some of the unique signs as well) for speed limits .
This is caused by the strongly imbalanced dataset, in which a
thirdof allsigns belong tothissubset.
Although similar in overall performance, the Random For-
est approach is not affected by this imbalance. Each decisio n
treeintheforestistrainedonadifferent,randomsampleof the
training data. Therefore, the class distribution in this sa mple
can be very different fromthe overall dataset.
7.5. Incorrectly classified images
Visual inspection of errors allows to better understand why
a certain approach failed at correct classification. Figure 15
shows the images that were incorrectly classified by the best
machine learning approach and by the best individual in the
human performance experiment. For presentation purposes, all
images werecontrast-enhanced and scaled toa fixed size.
It is notable that a large part of the error of the committee
of CNNs is caused by a single traffic sign instance, a diamond-
shapedright-of-way sign. Itaccountsformorethan15%ofthe
total error. However, not allimages of this traffic sign track
were misclassified, but only half of them. In fact, the commit –
tee misclassified those images in this track that were so over –
exposed that the yellow center is mostly lost. For humans, th is
signclass generally poses no problem due toitsunique shape .
Furthermore, the algorithm misclassified a few images due
to occlusion (such as reflections and graffiti) and two images
duetoinaccurateannotationthatresultedinanon-centere dview
ofthetrafficsign. Theseimagesareeasilyclassifiedbyhuma ns.
In contrast, the most difficult class for humans are speed
limitsigns,especiallyatlowresolutionwhichimpairsdiscrimi –
nationofsingledigitsand,thus,correctrecognition. Mor ethan
8

(a)Speed limit signs
(b) Other prohibitory
signs
(c)Derestriction signs
(d)Mandatorysigns
(e)Danger signs
(f)Uniquesigns
Figure13: Subsets oftraffic signs.
(a) IDSIA –Committee ofCNNs
(b)Human performance(best individual) (c)Human performance(average)
(d)Sermanet –Multi-Scale CNN (e)CAOR -RandomForests
(f)LDA
Figure14: Confusionmatrices. Thegridlinesseparatethetr afficsignsubsetsdefinedinFig.13. Theencodedvaluesareno rmalizedperclassandintherange[0,1].
Table 3: Individualresults forsubsets oftraffic signs. Bol dtypedenotes thebest result(s) persubset.
Speed limits Other
prohibitionsDerestriction Mandatory Danger Unique
Committee of CNNs 99.47 99 .93 99 .72 99.89 99 .07 99 .22
Human (bestindividual) 98.32 99 .87 98 .89 100.00 99 .21 100 .00
Human (average) 97.63 99.93 98.89 99 .72 98 .67 100.00
Multi-Scale CNN 98.61 99 .87 94 .44 97 .18 98 .03 98 .63
Random Forests (HOG 2) 95.95 99 .13 87 .50 99 .27 92 .08 98 .73
LDA (HOG 2) 95.37 96 .80 85 .83 97 .18 93 .73 98 .63
9

(a) Committeeof CNNs
(b)Human Performance(best individual)
Figure 15: Incorrectlyclassified images.
70% percent of the error can be accounted to this subset of
traffic signs. Misclassification of dangersigns causes the ma-
jor part of the remaining error for the the same reasons. Typ-
ical examples for confusions are caused by similar structur es,
for example the exclamation mark (general dangersign) being
confused for the traffic light sign and vice versa (second and
ninthtrafficsigninFig.13e),orthe curvyroad signbeingcon-
fused with crossing deer (fifth and last traffic sign in Fig. 13e),
which both show a diagonal distibution of black pixels in the
icon area.
7.6. Image size
As shown in Fig. 5, the images in the dataset vary strongly
in size. Smaller images provide lower resolution by defini-
tion, whereas the very large images, i.e., the ones of traffic
signsincloseproximitytotheegovehicle,oftenshowblurr ing
or ghost images (showing the sign twice, blurry and slightly
shifted) due to the larger relative motion in the image plane .
Figure16showstheclassificationperformanceofallpresen ted
approaches in dependency of the image size. It is not surpris –
ing that, for all approaches, the recognition rate is the low –
est for the smallest images. The low resolution strongly im-
pairs discriminability of fine details such as the single dig its
onspeed limit signs or the icons on dangersigns. The human
performancecontinuouslyincreaseswithincreasingimage size,
reaching perfect accuracy for images larger than 45 pixels ( in
the larger of both dimensions) for the best individual and fo r
images larger than 75 pixels in the average case. The algorit h-
mic approaches, however, show reduced performance for very
close images. Possible reasons are the strong motion blur or
thepresenceofghostimagessuchasinthelowerleftimagesi n
Fig. 15a.
ThisreductionofperformanceisstrongestforRandomFor-
estsandLDAwhichgenerallyshowaverysimilarperformance
whendifferentimagesizesareconsidered. Inaddition,bot hap-
proaches show a major impact on recognition performance for
very small images. Contrary to expectation, the smallest er ror0.90.920.940.960.981
Size range [pixels]Correct classification rate
≤ 25 ≤ 35 ≤ 45 ≤ 55 ≤ 65 ≤ 75 ≤ 85 ≤ 95 > 95Committee of CNNs
HumanPerformancebest
HumanPerformanceavg
Multi−scale CNN
Random forests (HOG 2)
LDA (HOG 2)
Figure16: Recognitionperformancedependingonimage size.
does not occur for mid-size images which often are of good
quality in terms of resolution and blurring. As the number of
imagespersizelevelisstronglydecreasingwithincreasin gim-
agesize(seeFig.5),thesensitivitytosinglemisclassifie dtracks
(or alargeparts thereof) increases and impairs performanc e.
8. Conclusions
Wepresentedadetailedcomparisonofthetrafficsignrecog-
nition performance of state-of-the-art machine learning a lgo-
rithms and humans. Although the best individual in the human
performance experiment achieved a close-to-perfect accur acy
of 99.22%, it was outperformed in this challenging task by
the best-performing machine learning approach, a committe e
ofconvolutionalneuralnetworks,with99.46%correctclas sifi-
cation rate. In contrast to traditional computer vision, wh ere
10

hand-crafted features are common, convolutional neural ne t-
works are able to learn task-specific features from raw data.
However,inreturn,“findingtheoptimalarchitectureofaCo nv-
Net for a given task remains mainly empirical” (Sermanet and
LeCun, 2011, Sec. II.B).
Moreover, convolutional neural networks are still compu-
tationally very demanding. Taking into account potential c on-
straints on hardware capabilities and processing time, as t hey
are common in the domain of driver assistance systems, it is
striking to see how well linear discriminant analysis, a com pu-
tationallycheapclassifier,performsonthisproblem,reac hinga
correct recognition rateof 95.68%.
However, none of the machine learning approaches is able
tohandleinputimagesofvariablesizeandaspectratioaspr esent
in the dataset. The usual approach is scaling of the images to
a fixed size. This can cause problems when the aspect ratio is
different between the original and target sizes. Furthermo re, it
discards information in larger images or introduces artifa cts if
verysmallimagesarestronglymagnified. Humansarewellca-
pable to recognize traffic signs of different size, even if vi ewed
from sharpangles.
The public leaderboard on the competition website will be
permanently open for submission and analysis of new results
on the GTSRB dataset. For the future, we plan to add more
benchmarktasksanddatatothecompetitionwebsite. Inpart ic-
ular, we are currently working on a benchmark dataset for the
detection of trafficsigns infullcamera images.
9. Acknowledgements
We thank Lukas Caup, Sebastian Houben, Stefan Tenb ¨ult,
and Marc Tschentscher for their labelling support, Bastian Pet-
zka for creating the competition website, NISYS GmbH for
supplyingthedatacollectionandannotationsoftware. Wet hank
Fatin Zaklouta, Pierre Sermanet and Dan Ciresan for the valu –
able comments on their approaches. Furthermore, we want to
thank all our test persons for the human performance experi-
ment, especially Lisa Kalbitz, and all others that contribu ted
to this competition. We acknowledge support from the Ger-
manFederalMinistryofEducationandResearchwithintheNa –
tional Network Computational Neuroscience, Bernstein Fok us:
“Learningbehavioralmodels: Fromhumanexperimenttotech –
nical assistance”, grant FKZ 01GQ0951.
References
Bahlmann, C., Zhu, Y., Ramesh, V., Pellkofer, M., and Koehler, T. (2005).
A system for traffic sign detection, tracking, and recogniti on using color,
shape, and motion information. In Proceedings of the IEEE Intelligent Ve-
hicles Symposium , pages 255–260.IEEE Press.
Bayer, B. E. (1975). U.S. Patent 3971065: Color imaging array . Eastman
KodakCompany.
Breiman, L. (2001). Randomforests. MachineLearning , 45(1):5–32.
Broggi, A., Cerri, P., Medici, P., Porta, P. P., and Ghisio, G . (2007). Real
timeroadsignsrecognition. In ProceedingsoftheIEEEIntelligentVehicles
Symposium , pages 981–986.IEEEPress.
Ciresan, D. C., Meier, U., Masci, J., and Schmidhuber, J. (201 1). A com-
mittee of neural networks for traffic sign classification. In Proceedings of
the IEEE International Joint Conference on Neural Networks , pages 1918–
1921.IEEE Press.Dalal, N. and Triggs, B. (2005). Histograms of oriented gradi ents for human
detection. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition , pages 886–893.
Gao,X.W.,Podladchikova,L.,Shaposhnikov,D.,Hong,K.,a ndShevtsova,N.
(2006). Recognitionoftrafficsignsbasedontheircolouran dshapefeatures
extractedusinghumanvisionmodels. JournalofVisualCommunicationand
Image Representation , 17(4):675–685.
Gunturk, B. K., Glotzbach, J., Altunbasak, Y., Schafer, R. W ., and Mersereau,
R. M. (2005). Demosaicking: Color filter array interpolation .IEEE Signal
Processing Magazine , 22(1):44–54.
Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical
Learning: DataMining, Inference, andPrediction . Springer-Verlag.
Igel, C., Glasmachers, T., and Heidrich-Meisner, V. (2008). Shark.Journal of
MachineLearning Research , 9:993–996.
Keller, C. G., Sprunk, C., Bahlmann, C., Giebel, J., and Barat off, G. (2008).
Real-time recognition of U.S. speed signs. In Proceedings of the IEEE In-
telligent Vehicles Symposium , pages 518–523.IEEE Press.
Lyu, S. and Simoncelli, E. P. (2008). Nonlinear image represen tation using
divisivenormalization.In ProceedingsoftheIEEEConferenceonComputer
Vision andPattern Recognition , pages 1–8.IEEE Press.
MaldonadoBasc ´on,S.,AcevedoRodr ´ıguez,J.,LafuenteArroyo,S.,Caballero,
A., and L ´opez-Ferreras, F. (2010). An optimization on pictogram iden tifi-
cationfortheroad-signrecognitiontaskusingSVMs. ComputerVisionand
Image Understanding , 114(3):373–383.
Moutarde, F., Bargeton, A., Herbin, A., and Chanussot, A. (2 007). Robust
on-vehicle real-time visual detection of american and europe an speed limit
signswith amodulartrafficsignsrecognitionsystem. In Proceedings ofthe
IEEE Intelligent Vehicles Symposium , pages1122–1126.IEEE Press.
Muhammad,A.S.,Lavesson,N.,Davidsson,P.,andNilsson,M.( 2009). Anal-
ysis of speed sign classification algorithms using shape base d segmentation
of binary images. In Proceedings of the International Conference on Com-
puter Analysis of Images and Patterns , volume 5702, pages 1220–1227.
Springer-Verlag.
Pinto, N., Cox, D. D., and DiCarlo, J. J. (2008). Why is Real-Wo rld Visual
Object RecognitionHard? PLoSComputationalBiology , 4(1):e27.
Ramanath, R., Snyder, W. E., Bilbro, G. L., and Sander, W. A. (2 002). De-
mosaicking methods for bayer color arrays. Journal of Electronic Imaging ,
11:306–315.
Ruta, A., Li, Y., and Liu, X. (2010). Real-time traffic sign rec ognition
from video by class-specific discriminative features. Pattern Recognition ,
43(1):416–430.
Salmen, J., Schlipsing, M., and Igel, C. (2010). Efficient upd ate of the covari-
ance matrix inverse in iterated linear discriminant analysis .Pattern Recog-
nitionLetters , 31(1):1903–1907.
Scharstein, D. and Szeliski, R. (2002). A taxonomy and evalua tion of dense
two-framestereocorrespondencealgorithms. InternationalJournalofCom-
puter Vision , 47:7–42.
Sermanet, P. and LeCun, Y. (2011). Traffic sign recognition wi th multi-scale
convolutional networks. In Proceedings of the IEEE International Joint
Conference onNeuralNetworks , pages 2809–2813.IEEE Press.
Stallkamp, J., Schlipsing, M., Salmen, J., and Igel, C. (2011) . The German
Traffic Sign Recognition Benchmark: A multi-class classificat ion compe-
tition. In Proceedings of IEEE International Joint Conference on Neur al
Networks , pages1453–1460.IEEE Press.
Vapnik, V. and Vashist, A. (2009). A new learning paradigm: Le arning using
privilegedinformation. NeuralNetworks , 22(5-6):544–557.
Viola,P.andJones,M.(2001). Robustreal-timeobjectdetec tion.International
Journalof ComputerVision , 57(2):137–154.
Zaklouta, F., Stanciulescu, B., and Hamdoun, O. (2011). Traf fic sign classifi-
cation using k-d trees and random forests. In Proceedings of the IEEE In-
ternational Joint Conference on Neural Networks , pages 2151–2155. IEEE
Press.
11

Similar Posts