ACCEPTEDCONFERENCEONCOMPUTERVISIONANDPATTERNRECOGNITION2001 [619320]
ACCEPTEDCONFERENCEONCOMPUTERVISIONANDPATTERNRECOGNITION2001
RapidObjectDetectionusingaBoostedCascadeofSimple
Features
PaulViola MichaelJones
viola@me rl.com mjones@cr l.dec.com
MitsubishiElectricResearch Labs CompaqCRL
201Broadway,8thFL OneCambridge Center
Cambridge, MA02139 Cambridge, MA02142
Abstract
This paperdescribes amachinelearning approachforvi-
sual object detection whic hiscapa bleofprocessing images
extremely rapidly andachievinghigh detectio nrates. This
work isdistingu ished bythreekeycontrib utions. Thefirst
istheintroductio nofanewimagerepresentation called the
“Inte gralImage”whic hallows thefeatur esused byourde-
tector tobecomp uted very quickly.Thesecond isalearning
algorithm, based onAdaB oost, whic hselects asmall num-
berofcritical visual featur esfromalargersetandyields
extremely efficien tclassifier s[6]. Thethirdcontrib ution is
amethod forcomb ining increasingly morecomple xclassi-
fiersina“cascad e”whic hallows backgroundregionsofthe
imagetobequicklydiscar dedwhile spending morecompu-
tation onpromising object-lik eregions.Thecascade canbe
viewed asanobject specific focus-o f-attentio nmechanism
whic hunlik epreviousapproachesprovides statistical guar-
anteesthatdiscar dedregions areunlik elytocontain theob-
jectofinterest.Inthedomain offace detection thesystem
yields detection rates compar abletothebest previoussys-
tems. Used inreal-time applications, thedetecto rruns at
15frames persecond without resorting toimagedifferenc-
ingorskincolor detection.
1.Introduction
This paper bring stogeth ernewalgor ithms andinsights to
construct aframeworkforrobustandextrem elyrapid object
detectio n.This frameworkisdemon strated on,andinpart
motivated by,thetask offacedetection .Towardthisend
wehaveconstru cted afrontalfacedetection system which
achie vesdetection andfalsepositi verates which areequiv-
alent tothebest published results [16,12,15,11,1].This
facedetectio nsystem ismost clearly distinguished from
previous approa ches initsability todetect faces extremely
rapid ly.Operating on384by288pixelimages, faces arede-tected at15framespersecond onaconvention al700MHz
Intel Pentium III.Inother facedetection systems, auxiliary
information, such asimage differences invideo sequence s,
orpixelcolor incolor images, havebeen used toachie ve
high frame rates. Our system achie veshigh framerates
workin gonly with theinform ation presen tinasingle grey
scale image. These alternati vesources ofinfor mation can
also beintegrated with oursystem toachie veevenhigher
framerates.
There arethree main contrib utions ofourobject detec-
tion frame work. Wewill introdu ceeach ofthese ideas
briefly belowandthen describ ethem indetail insubsequ ent
sections.
Thefirstcontrib utionofthispaper isanewimage repre-
sentation called anintegralimagethatallowsforveryfast
featur eevaluation. Moti vated inpartbytheworkofPapa-
georgiou etal.ourdetection system does notworkdirectly
with image intensities [10].Likethese autho rsweusea
setoffeatur eswhich areremin iscent ofHaar Basis func-
tions (thou ghwewillalsouserelated filters which aremore
comp lexthan Haar filters). Inordertocompu tethese fea-
tures veryrapidly atmanyscales weintrod ucetheintegral
image representation forimages. Theintegralimage canbe
comp uted from animage using afewoperations perpixel.
Once compu ted,anyoneofthese Harr-likefeatur escanbe
comp uted atanyscale orlocation inconstant time.
The second contrib ution ofthispaperisamethod for
constru cting aclassifier byselecting asmall numbe rofim-
portantfeatur esusing AdaBoost [6].Within anyimage sub-
windo wthetotal numb erofHarr-likefeatur esisverylarge,
farlargerthan thenumbe rofpixels.Inordertoensure fast
classification, thelearnin gprocess must excludealargema-
jority oftheavailable featur es,andfocusonasmall setof
critical features. Moti vated bytheworkofTieuandViola,
featur eselection isachie vedthroughasimple modification
oftheAdaBoost procedure: theweak learner isconstrain ed
sothateach weak classifier return edcandependononly a
1
single feature [2].Asaresult each stage oftheboosting
process, which selects anewweak classifier ,canbeviewed
asafeature selection proce ss.AdaBoost provides aneffec-
tivelearnin galgorithm andstrong boun dsongener alization
performan ce[13,9,10].
The third major contrib ution ofthispaper isamethod
forcomb ining successi velymore comp lexclassifiers ina
cascade structure which dramatically increases thespeed of
thedetector byfocusing attention onprom ising regionsof
theimage. Thenotio nbehind focus ofattention approaches
isthatitisoften possible torapidly determ inewhere inan
image anobjectmigh toccur [17,8,1].More comp lexpro-
cessing isreserv edonly forthese promising regions. The
keymeasur eofsuch anappro achisthe“falsenegative”rate
oftheattention alprocess. Itmust bethecase thatall,or
almost all,object instances areselected bytheattention al
filter.
Wewilldescribe aprocess fortraining anextremely sim-
pleandefficient classifier which canbeused asa“super –
vised” focusofattention operato r.The term supervised
referstothefactthattheattentiona loperato ristrained to
detect examples ofaparticular class. Inthedoma inofface
detectio nitispossible toachie vefewerthan 1%falseneg-
ativesand40% falsepositi vesusing aclassifier constructed
fromtwoHarr-likefeatur es.The effectofthisfilter isto
reducebyoveronehalfthenumberoflocation swhere the
final detector must beevaluated .
Those sub-win dowswhich arenotrejected bytheinitial
classifier areprocessed byasequen ceofclassifiers, each
slightly more comp lexthan thelast. Ifanyclassifier rejects
thesub-wind ow,nofurther processing isperfor med. The
structu reofthecascaded detection process isessentially
thatofadegenerate decision tree, andassuch isrelated to
theworkofGeman andcolleague s[1,4].
Anextremely fastfacedetector will havebroadprac-
tical application s.These includ euser interf aces, image
datab ases, and telecon ferencin g.Inapplication swhere
rapid frame-rates arenotnecessary ,oursystem will allow
forsignificant addition alpost-pro cessing andanalysis. In
addition oursystem canbeimplemen tedonawide range of
small lowpowerdevices, includ inghand- helds andembed-
dedprocessors. Inourlabwehaveimplemen tedthisface
detecto rontheCompa qiPaqhandh eldandhaveachie ved
detectio nattwoframes persecond (this device hasalow
power200mips StrongArm processor which lacks floating
pointhardware).
The remain derofthepaper describes ourcontr ibutions
andanumbe rofexperim ental results, includ ingadetailed
descrip tion ofourexperimental method ology .Discussion
ofclosely related worktakesplace attheendofeach sec-
tion.
2.Featur es
Ourobject detection procedure classifies images based on
thevalueofsimple featur es.There aremanymotivations
A B
C D
Figure 1:Examp lerectang lefeature sshownrelati vetothe
enclosin gdetection windo w.Thesum ofthepixelswhich
liewithin thewhite rectang lesaresubtracted from thesum
ofpixelsinthegreyrectan gles. Two-rectan glefeatu resare
shownin(A)and(B). Figure (C)showsathree- rectangle
featur e,and(D)afour-rectangle feature.
forusing featur esrather than thepixelsdirectly .Themost
comm onreason isthatfeatur escanacttoencod ead-hoc
domainknowledge that isdifficulttolearn using afinite
quan tityoftraining data. Forthissystem there isalso a
second critical motivation forfeatures: thefeature based
system operate smuchfaster than apixel-based system.
The simple features used areremin iscent ofHaar basis
functions which havebeen used byPapageo rgiou etal.[10].
More specifically ,weusethree kindsoffeatures. Thevalue
ofatwo-r ectangle featur eisthedifference between thesum
ofthepixelswithin tworectangu larregions.The regions
havethesame size andshape andarehoriz ontally orver-
tically adjacen t(see Figure 1).Athree-rectan glefeatur e
comp utes thesum within twooutside rectangles subtracted
from thesum inacenter rectangle. Finally afour-rectan gle
featur ecomp utes thedifference between diago nalpairs of
rectang les.
Giventhatthebase resolution ofthedetecto ris24x2 4,
theexhau stivesetofrectang lefeatures isquite large,over
180,000 .Note thatunlik etheHaar basis, thesetofrectan-
glefeature sisovercomp lete1.
2.1.Integral Image
Rectangle features canbecomp uted veryrapid lyusing an
interme diate repre sentation fortheimage which wecallthe
integralimage.2Theintegralimage atlocation
contain s
thesum ofthepixelsaboveandtotheleftof
,inclusi ve: ! "#$%"
'&(&$
1Acomple tebasis hasnolineardepen dence between basis elements
andhasthesame number ofelements astheimage space ,inthiscase 576.
Thefullsetof180,000 thousa ndfeatur esismanytimes over-complete.
2There isaclose relationto“summed area tables”asused ingraphics
[3].Wechoose adifferentname here inorder toemphasi zeitsuseforthe
analy sisofimage s,rather than fortexturemapping .
2
A
CB
D1
42
3
Figur e2:Thesum ofthepixelswithin rectan gle canbe
computed with four array referen ces.Thevalue oftheinte-
gralimage atlocation 1isthesum ofthepixelsinrectang le.Thevalue atlocation 2is
,atlocation 3is
,
andatlocation 4is
.Thesum within can
becomp uted as
.
wher e
istheintegralimage and
istheorigi-
nalimage. Using thefollowing pairofrecur rences:
(1) (
((2)
(whe re
(isthecumu lativerowsum,
,
and
)theintegralimage canbecomp uted in
onepass overtheorigin alimage.
Using theintegral image anyrectang ular sum canbe
computed infour array references (see Figure 2).Clearly
thedifference between tworectang ular sums canbecom-
putedineight referen ces. Since thetwo-rectan glefeatures
define daboveinvolveadjacen trectang ular sums theycan
becompu tedinsixarray referen ces, eight inthecase of
thethree-re ctangle features, andnine forfour-rectangle fea-
tures.
2.2.Featur eDiscussion
Rectangle features aresome whatprimiti vewhen comp ared
with alternati vessuch assteerable filters [5,7].Steerable fil-
ters, andtheir relati ves,areexcellent forthedetailed analy-
sisofboun daries, image comp ression, andtextureanalysis.
Incontra strectangle featur es,while sensiti vetothepres-
ence ofedges, bars, andother simple image structur e,are
quite coarse. Unlik esteerable filters theonly orientatio ns
available arevertical, horizo ntal, anddiagon al.Thesetof
rectan glefeatures dohoweverprovidearich image repre-
sentation which suppo rtseffectivelearnin g.Inconjun ction
with theintegralimage ,theefficienc yoftherectan glefea-
turesetprovides ample compe nsation fortheir limited flex-
ibility .
3.Lear ning Classification Functions
Givenafeatur esetandatraining setofpositi veandneg-
ativeimages, anynumb erofmach inelearning approachescould beused tolearn aclassification functio n.Inoursys-
temavariant ofAdaBoost isused both toselect asmall set
offeatures andtrain theclassifier [6].Initsoriginalform,
theAdaBoo stlearning algorith misused toboost theclas-
sification perfo rmance ofasimple (sometimes called weak)
learnin galgor ithm. There areanumberofformalguaran-
teesprovided bytheAdaBoo stlearnin gprocedure .Freund
andSchapir eprovedthat thetraining errorofthestrong
classifier appro aches zero exponentially inthenumberof
rounds. More importa ntly anumberofresults were later
provedabou tgeneralization perfo rmance [14].The key
insight isthatgeneraliza tion perfo rmance isrelated tothe
marginoftheexamples, andthatAdaBoo stachie veslarge
marginsrapid ly.
Recall thatthere areover180,000 rectang lefeatur esas-
sociated with each image sub-wind ow,anumberfarlarger
than thenumberofpixels. Eventhoug heach feature can
becomp uted veryefficiently ,comp uting thecomple tesetis
prohibiti velyexpensive.Ourhypothesis, which isborneout
byexperiment, isthataverysmall numb erofthese featur es
canbecomb ined toform aneffectiveclassifier .Themain
challeng eistofindthese features.
Insuppor tofthisgoal, theweak learning algorithm is
designe dtoselect thesingle rectang lefeature which best
separates thepositi veandnegativeexamp les(this issimilar
totheappro achof[2]inthedomain ofimage database re-
trieval).Foreach feature, theweak learner determin esthe
optima lthreshold classification function, such thatthemin-
imum numberofexamp lesaremisclassified. Aweak clas-
sifier
%thus consists ofafeatu re,athresholdand
aparity !indicating thedirection oftheinequality sign:
% #"
if
%%$
other wise
Here
isa24x2 4pixelsub-w indowofanimage. SeeTa-
ble1forasummary oftheboosting proce ss.
Inpractice nosingle feature canperfo rmtheclassifica-
tiontaskwith lowerror.Features which areselected inearly
roundsoftheboosting process haderror rates between 0.1
and0.3. Features selected inlater rounds, asthetask be-
comes moredifficult, yield error rates between 0.4and0.5.
3.1.Lear ning Discussion
Manygener alfeatur eselection procedure shavebeen pro-
posed (see chapter 8of[18]forareview).Ourfinal appli-
cation demand edaveryaggressi veappro achwhich would
discard thevastmajority offeatures. Forasimilar recog ni-
tionproblemPapageorgiouetal.propo sedascheme forfea-
tureselection based onfeature varianc e[10].Theydemo n-
strated goodresults selecting 37feature soutofatotal 1734
featur es.
Roth etal.proposeafeatur eselection process based
ontheWinnowexpon ential perceptron learning rule [11].
TheWinnowlearning process convergestoasolution where
manyofthese weights arezero. Neverthe lessaverylarge
3
